copyright © 2009 siemens medical solutions usa, inc. all rights reserved. “so what do those...

97
Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify My Problem?” John Paul Enterprise Hosting Services Research and Development February 20, 2009

Upload: katrina-goodman

Post on 26-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.

“So What Do Those VMware Counters Really Mean and How Do They Help Me Identify My Problem?”

John Paul

Enterprise Hosting Services

Research and Development

February 20, 2009

Page 2: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 2

Acknowledgments and Presentation Goal

The material in this presentation was pulled from a variety of sources, much of which was graciously provided by VMware staff members. In particular I would like to thank Scott Drummonds for his “Interpreting ESXTop Statistics” document, Eric Heck (local VMware SE) for his input and review of the material, and the Performance Team at VMware that shared their insight and valuable information with us during our recent visit to Palo Alto. I also acknowledge and thank the VMware staff for their permission to use their material in this presentation.

This presentation is intended to review the basics for performance analysis for the VI3.5 virtual infrastructure with detailed information on the tools and counters used. It presents a series of examples of how the performance counters report different types of resource consumption with a focus on key counters to observe.

Page 3: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 3

Agenda

Performance Analysis Basics Types of Performance Counters

Top Performance Counters

Basic Performance Analysis Approach

Native VMWare Tools – Where are We Looking? A Comparison of Esxtop and the Virtual Infrastructure Client

A Quick Introduction to Esxtop

A Quick Introduction to the Virtual Infrastructure Client

Core Four Deep Dive - Performance Counters in Action

Closing Thoughts

Questions and Answers

Page 4: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 4

Types of Performance Counters (a.k.a. statistics)

Static – Counters that don’t change during runtime, for example MEMSZ (memsize), Adapter queue depth, VM Name. The static counters are informational and may not be essential during performance problem analysis.

Performance Analysis Basics

Dynamic – Counters that are computed dynamically, for example CPU load average, memory over-commitment load average.

Calculated - Some are calculated from the delta between two successive snapshots. Refresh interval (-d) determines the time between successive snapshots. For example %CPU used = ( CPU used time at snapshot 2 - CPU used time at snapshot 1 ) / time elapsed between snapshots

Page 5: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 5

Top Performance Counters to Use for Initial Problem Determination

Physical/Virtual Machine CPU (queuing)• Average physical CPU utilization*• Peak physical CPU utilization*• CPU Time*• Processor Queue LengthMemory (swapping)• Average Memory Usage• Peak Memory Usage• Page Faults• Page Fault Delta*Disk (latency)• Split IO/Sec• Disk Read Queue Length• Disk Write Queue Length• Average Disk Sector Transfer TimeNetwork (queuing/errors)• Total Packets/second• Bytes Received/second• Bytes Sent/Second• Output queue length

ESX Host

CPU (queuing)• PCPU%•%SYS•%RDY• Average physical CPU utilization• Peak physical CPU utilization• Physical CPU load average

Memory (swapping)• State (memory state)• SWTGT (swap target)• SWCUR (swap current)• SWR/s (swap read/sec)• SWW/s (swap write/sec)• Consumed • Active (working set)• Swapused (instantaneous swap)• Swapin (cumulative swap in)• Swapout (cumulative swap out)• VMmemctl (balloon memory)

* Remember that counters based upon the timer interrupt inside a VM can be inconsistent and should be used as general guides only.

Disk (latency, queuing)• DiskReadLatency• DiskWriteLatency• CMDS/s (commands/sec)• Bytes transferred/received/sec• Disk bus resets• ABRTS/s (aborts/sec)• SPLTCMD/s (I/O split cmds/sec)

Network (queuing/errors)• %DRPTX (packets dropped - TX)• %DRPRX (packets dropped – RX)• MbTX/s (mb transferred/sec – TX)• MbRX/s (mb transferred/sec – RX)

Performance Analysis Basics

Page 6: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 6

A Review of the Basic Performance Analysis Approach

Identify the virtual context of the reported performance problem Where is the problem being seen? (“When I do this here, I get that”) How is the problem being quantified? (“My function is 25% slower) Apply a reasonability check (“Has something changed from the status quo?”)

Monitor the performance from within that virtual context View the performance counters in the same context as the problem Look at the ESX cluster level performance counters Look for atypical behavior (“Is the amount of resources consumed characteristic

of this particular application or task for the server processing tier?” ) Look for repeat offenders! This happens often.

Expand the performance monitoring to each virtual context as needed Are other workloads influencing the virtual context of this particular

application and causing a shortage of a particular resource? Consider how a shortage is instantiated for each of the Core Four

resources

Performance Analysis Basics

Page 7: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 7

A Comparison of Esxtop and the Virtual Infrastructure Client

VIC gives a graphical view of both real-time and trend consumption

VIC combines real-time reporting with short term (1 hour) trending

VIC can report on the virtual machine, ESX host, or ESX cluster

VIC is a bit awkward to get the needed counters in the same view

Esxtop allows more concurrent performance counters to be shown

Esxtop has a higher system overhead to run

Esxtop can sample down to a 2 second sampling period

Esxtop gives a detailed view of each of the Core Four

Recommendation – Use VIC to get a general view of the system performance but use Esxtop for detailed problem analysis

Where Are We Looking?

Page 8: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 8

A Brief Introduction to Esxtop

Launched at the root level of the ESX host

Fairly expensive in overhead, especially if the sampling rate increases

Screens c: cpu (default)

m: memory

n: network

d: disk adapter

u: disk device (new in ESX 3.5)

v: disk VM (new in ESX 3.5)

Can be piped to a file and then imported in W2K System Monitor

Horizontal and vertical screen resolution limits the number of fields and entities that could be viewed so chose your fields wisely

Some of the rollups and counters may be confusing to the casual user

Where Are We Looking?

Page 9: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 9

Using the Esxtop screen view

fields hidden from the view…

Time Uptime running worlds

• Worlds = VMKernel processes (like W2k threads)• ID = world identifier• GID = world group identifier• NWLD = number of worlds

Where Are We Looking?

Page 10: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 10

Using the Esxtop screen view - expanding groups

• In rolled up view some stats are cumulative of all the worlds in the group• Expanded view gives breakdown per world• VM group consists of mks (mouse, keyboard, screen), vcpu, vmx worlds. SMP VMs have additional vcpu and vmm worlds• vmm0, vmm1 = Virtual machine monitors for vCPU0 and vCPU1 respectively

press ‘e’ key

Where Are We Looking?

Page 11: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 11

A Brief Introduction to the Virtual Infrastructure Client

Screens – CPU, Disk, Management Agent, Memory, Network, System vCenter collects performance metrics from the hosts that it manages and aggregates the

data using a consolidation algorithm. The algorithm is optimized to keep the database size constant over time.

vCenter does not display many counters for trend/history screens Esxtop defaults to a 5 second sampling rate while vCenter defaults to a 20 second rate. Default statistics collection periods, samples, and how long they are stored:

Interval Interval Period Number of Samples

Interval Length

Per Hour (real-time) 20 seconds 180

Per day 5 minutes 288 1 day

Per week 30 minutes 336 1 week

Per month 2 hours 360 1 month

Per year 1 day 365 1 year

Where Are We Looking?

Page 12: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 12

The Effect of Sampling Rates/Times on ESXTop and VIC Counters

Total CPU =

21.272GHZ

Total CPU Usage

= 3.68%

Total CPU Usage

= 14.18%

ESXTop

VIC Performance

Tab

VIC Summary

Tab

Total CPU Usage =

10.27%

Page 13: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 13

Virtual Infrastructure Client – CPU Screen

To Change Settings

To Change Screens

Performance Counters in Action

Page 14: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 14

Virtual Infrastructure Client – Disk Screen

Performance Counters in Action

Page 15: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 15

Virtual Infrastructure Client – Change Settings Screen

Performance Counters in Action

Page 16: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 16

A Comparison of Memory Counters in ESXTop and VIC on a 24GB Host

Total Memory

Usage = 52.12%12.51GB Memory

Usage = 52.12%

ESXTop

VIC Performance

Tab

VIC Summary

Tab

Total Memory

Usage = 51.87%

Page 17: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 17

Reservation (Guarantees) • Minimum service level guarantee (in MHz)• Even when system is overcommitted• Needs to pass admission control for start-up

Shares (Share the Resources)• CPU entitlement is directly proportional to VM's shares and depends on the total number of shares issued• Abstract number, only ratio matters

Limit • Absolute upper bound on CPU entitlement (in MHz)• Even when system is not overcommitted

Resource Control Revisit – CPU Example S

hares

Total MHZ

0 MHZ

Limit

Reservation

Performance Counters in Action

Page 18: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 18

Types of Resources – The Core Four

Though the Core Four resources exist at both the ESX host and virtual machine levels, they are not the same in how they are instantiated and reported against.

CPU – processor cycles (vertical), multi-processing (horizontal)

Memory – allocation and sharing

Disk (a.k.a. storage) – throughput, size, latencies, queuing

Network - throughput, latencies, queuing

Though all resources are limited, ESX handles the resources differently. CPU is more strictly scheduled, memory is adjusted and reclaimed (more fluid) if based on shares, disk and network are fixed bandwidth (except for queue depths) resources.

Performance Counters in Action

Page 19: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 19

CPU – Understanding PCPU versus VCPU

It is important to separate the physical CPU (PCPU) resources of the ESX host from the virtual CPU (VCPU) resources that are presented by ESX to the virtual machine. PCPU – The ESX host’s processor resources are exposed only to ESX. The virtual machines are not aware and cannot report on those physical resources.

VCPU – ESX effectively assembles a virtual CPU(s) for each virtual machine from the physical machine’s processors/cores, based upon the type of resource allocation (ex. shares, guarantees, minimums).

Scheduling - The virtual machine is scheduled to run inside the VCPU(s), with the virtual machine’s reporting mechanism (such as W2K’s System Monitor) reporting on the virtual machine’s allocated VCPU(s) and remaining Core Four resources.

Performance Counters in Action

Page 20: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 20

Operating System

Intel Hardware

PCPU PMemory PNIC PDisk

Application

Operating System

Intel Hardware

VCPU VMemory VNIC VDisk

Application

Operating System

Intel Hardware

VCPU VMemory VNIC VDisk

Application

Intel Hardware

PNIC PDiskPMemoryPCPU

PCPU and VCPU Example – Two Virtual Machines

Physical Host2 socket, four core @2.1Ghz = 16.8Ghz CPU

8 GB RAM

Physical Resources

Two Virtual Machines2 cores @2.1Ghz = 4.2Ghz CPU

2 GB RAM

Allocated Physical Resources

6 cores @2.1Ghz = 12.6Ghz CPU (minus virt. overhead)

6 GB RAM (minus virt. overhead)

Limits, maximums, shares all affect real resources

Remaining Physical Resources

Virtual Machine Logical Resources

Each virtual machine defined as a uniprocessor

VCPU = 2.1GHZ since uniprocessor

Memory allocation of 1GB per virtual machine

Performance Counters in Action

Page 21: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 21

CPU – Key Question and Considerations

Allocation – The CPU allocation for a specific workload can be constrained due to the resource settings or number of CPUs, amount of shares, or limits. The key field at the virtual machine level is CPU queuing and at the ESX level it is Ready to Run (%RDY in Esxtop).

Capacity - The virtual machine’s CPU can be constrained due to a lack of sufficient capacity at the ESX host level as evidenced by the PCPU/LCPU utilization.

Contention – The specific workload may be constrained by the consumption of workloads operating outside of their typical patterns

SMP CPU Skewing – The movement towards lazy scheduling of SMP CPUs can cause delays if one CPU gets too far “ahead” of the other. Look for higher %CSTP (co-schedule pending)

Performance Counters in Action

Is there a lack of CPU resources for the VCPU(s) of the virtual machine or for the PCPU(s) of the ESX host?

Page 22: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 22

CPU Performance Counters 1

Field Usage

PCPU(%)Percentage of CPU utilization per physical CPU and total average physical CPU utilization.

LCPU(%)

Percentage of CPU utilization per logical CPU. The percentages for the logical CPUs belonging to a package add up to100 percent. This line appears only if hyper-threading is present and enabled.

CCPU(%) Percentages of total CPU time as reported by the ESX Server service console.

  us— Percentage user time.

  sy— Percentage system time.

  id— Percentage idle time.

  wa— Percentage wait time.

  cs/sec— Context switches per second recorded by the service console

IDResource pool ID or virtual machine ID of the running world’s resource pool or virtual machine, or world ID of running world.

GID Resource pool ID of the running world’s resource pool or virtual machine

NAME Name of running world’s resource pool or virtual machine, or name of running world.

Performance Counters in Action

Page 23: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 23

CPU Performance Counters 2

Field Usage

NWLD

Number of members in running world’s resource pool or virtual machine. If a Group is expanded using the interactive command e(see interactive commands), then NWLD for all the resulting worlds is 1 (some resource pools like the console resource pool have only one member).

%STATE TIMES Set of CPU statistics made up of the following percentages. For a world, the percentages are a percentage of one physical CPU.

%USED Percentage physical CPU used by the resource pool, virtual machine, or world. This is the combination of all of the cores so the % can be greater then 100%.

%RUN

Percentage of total time scheduled. This time does not account for hyper threading and system time. On a hyper threading enabled server, the %RUN can be twice as large as %USED.

%SYS

Percentage of time spent in the ESX Server VMkernel on behalf of the resource pool, virtual machine, or world to process interrupts and to perform other system activities. This time is part of the time used to calculate %USED, above.

%WAIT

Percentage of time the resource pool, virtual machine, or world spent in the blocked or busy wait state. This percentage includes the percentage of time the resource pool, virtual machine, or world was idle.

Performance Counters in Action

Page 24: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 24

CPU Performance Counters 3

Field Usage

%IDLE

Percentage of time the resource pool, virtual machine, or world was idle. Subtract this percentage from %WAIT, above to see the percentage of time the resource pool, virtual machine, or world was waiting for some event.

%RDY Percentage of time the resource pool, virtual machine, or world was ready to run

%OVRLP

Percentage of system time spent during scheduling of a resource pool, virtual machine, or world on behalf of a different resource pool, virtual machine, or world while the resource pool, virtual machine, or world was scheduled. This time is not included in %SYS. For example, if virtual machine A is currently being scheduled and a network packet for virtual machine B is processed by the ESX Server VMkernel, the time spent appears as %OVRLP for virtual machine A and %SYS for virtual machine B.

%CSTP Percentage of time a resource pool spends in a ready, co-deschedule state

%MLMTD

Percentage of time the ESX Server VMkernel deliberately did not run the resource pool, virtual machine, or world because doing so would violate the resource pool, virtual machine, or worlds limit setting. Even though the resource pool, virtual machine, or world is ready to run when it is prevented from running in this way, the %MLMTD time is not included in %RDY time.

EVENTSet of CPU statistics made up of per second event rates. These statistics are COUNTS/s for VMware internal use only.

Performance Counters in Action

Page 25: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 25

CPU Performance Counters 4

Field Usage

CPU ALLOC Set of CPU statistics made up of the following CPU allocation configuration parameters.

AMIN Resource pool, virtual machine, or world attribute Reservation

AMAXResource pool, virtual machine, or world attribute Limit. A value of ‐1 means unlimited.

ASHRS Resource pool, virtual machine, or world attribute Shares

STATSParameters and statistics. These statistics are applicable only to worlds and not to virtual machines or resource pools.

AFFINITYBIT Bit mask showing the current scheduling affinity for the world

HTSHARING Current hyperthreading configuration

CPUThe physical or logical processor on which the world was running when resxtop(or esxtop) obtained this information.

HTQIndicates whether the world is currently quarantined or not. N means no and Y means yes.

TIMER Timer rate for this world.

Performance Counters in Action

Page 26: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 26

Esxtop CPU screen (c)

PCPU = Physical CPU/core

CCPU = Console CPU (CPU 0)

Press ‘f’ key to choose fields

Performance Counters in Action

Page 27: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 27

ESXTop

Virtual Machine View

Idle State on Test Bed (CPU View)

Performance Counters in Action

Page 28: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 28

Idle State on Test Bed – GID 32 Expanded

Cumulative Wait %

Five Worlds Total Idle %Expanded GID Rolled Up GID

Wait includes idle

Performance Counters in Action

Page 29: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 29

High CPU within one virtual machine caused by affinity - Esxtop

Physical CPU Fully

Used

One Virtual CPU is

Fully Used

Case Studies - CPU

Page 30: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 30

High CPU within one virtual machine (affinity) - vCenter

View of the ESX Host

View of the VM

Case Studies - CPU

Page 31: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 31

Add a Second Virtual Machine – MAX CPU (affinity)

2 Physical CPUs Fully

Used

2 Virtual CPUs Fully

Used

Ready to Run

Acceptable

Case Studies - CPU

Page 32: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 32

Add a Third Virtual Machine – MAX CPU (affinity)

3 Physical CPUs Fully

Used

3 Virtual CPUs Fully Used

Ready to Run Acceptable

Case Studies - CPU

Page 33: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 33

Add a Fourth Virtual Machine – MAX CPU (affinity)

4 Physical CPUs Fully

Used

4 Virtual CPUs Fully Used

Ready to Run Acceptable

Case Studies - CPU

Page 34: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 34

Four Virtual Machines – MAX CPU (no affinity)

4 Physical CPUs Fully

Used

4 Virtual CPUs Fully Used

Ready to Run Acceptable

Case Studies - CPU

Page 35: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 35

SMP Implementation WITHOUT CPU Constraints

4 Physical CPUs Fully

Used

4 Virtual CPUs Fully Used

Ready to Run Acceptable

One - 2 CPU SMP VCPU

Case Studies - CPU

Page 36: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 36

SMP Implementation WITHOUT CPU Constraints - VIC

Case Studies - CPU

Page 37: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 37

Case Studies - CPU

SMP Implementation WITHOUT CPU Constraints - Esxtop

4 Physical CPUs Fully

Used

4 Virtual CPUs Fully Used

Ready to Run Acceptable

One - 2 CPU SMP VCPU

Page 38: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 38

SMP Implementation with Mild CPU Constraints

4 Physical CPUs Fully

Used

4 Virtual CPUs Heavily Used

Ready to Run Indicates Severe

Problems

One - 2 CPU SMP VCPUs (7

NWLD)

Case Studies - CPU

Page 39: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 39

SMP Implementation with Severe CPU Constraints

4 Physical CPUs Fully

Used

4 Virtual CPUs

Fully Used

Ready to Run Indicates Severe

Problems

Two - 2 CPU SMP VCPUs

(7 NWLD)

Case Studies - CPU

Page 40: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 40

SMP Implementation with Severe CPU Constraints

Case Studies - CPU

4 Physical CPUs Fully

Used

4 Virtual CPUs

Fully Used

Ready to Run Indicates Severe

Problems

Two - 2 CPU SMP VCPUs

(7 NWLD)

Page 41: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 41

SMP Implementation with Severe CPU Constraints

Case Studies - CPU

Page 42: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 42

Memory – Separating the machine and guest memoryIt is important to note that some statistics refer to guest physical memory while others refer to machine memory. “ Guest physical memory" is the virtual-hardware physical memory presented to the VM. " Machine memory" is actual physical RAM in the ESX host.

In the figure below, two VMs are running on an ESX host, where each block represents 4 KB of memory and each color represents a different set of data on a block.

Inside each VM, the guest OS maps the virtual memory to its physical memory. ESX Kernel maps the guest physical memory to machine memory. Due to ESX Page Sharing technology, guest physical pages with the same content canbe mapped to the same machine page.

Case Studies - Memory

Page 43: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 43

A Brief Look at Ballooning

The W2K balloon driver is located in VMtools

ESX sets a balloon target for each workload at start-up and as workloads are introduced/removed

The balloon driver expands memory consumption, requiring the Virtual Machine operating system to reclaim memory based on its algorithms

Ballooning routinely takes 10-20 minutes to reach the target

The returned memory is now available for ESX to use

Key ballooning fields:

Case Studies - Memory

SZTGT: determined by reservation, limit and memory sharesSWCUR = 0 : no swapping in the pastSWTGT = 0 : no swapping pressureSWR/S, SWR/W = 0 : No swapping activity currently

Page 44: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 44

ESX Memory Sharing - The “Water Bed Effect”

ESX handles memory shares on an ESX host and across an ESX cluster with a result similar to a single water bed, or room full of water beds, depending upon the action and the memory allocation type:

Initial ESX boot (i.e., “lying down on the water bed”) – ESX sets a target working size for each virtual machine, based upon the memory allocations or shares, and uses ballooning to pare back the initial allocations until those targets are reached (if possible).

Steady State (i.e., “minor position changes”) - The host gets into a steady state with small adjustments made to memory allocation targets. Memory “ripples” occur during steady state, with the amplitude dependent upon the workload characteristics and consumption by the virtual machines.

New Event (i.e., “second person on the bed”) – The host receives additional workload via a newly started virtual machine or VMotion moves a virtual machine to the host through a manual step, maintenance mode, or DRS. ESX pares back the target working size of that virtual machine while the other virtual machines lose CPU cycles that are directed to the new workload.

Large Event (i.e., “jumping across water beds”) – The cluster has a major event that causes a substantial movement of workloads to or between multiple hosts. Each of the hosts has to reach a steady state, or to have DRS determine that the workload is not a current candidate for the existing host, moving to another host that has reached a steady state with available capacity. Maintenance mode is another major event.

Case Studies - Memory

Page 45: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 45

Memory – Key Question and Considerations

Is the memory allocation for each workload optimum to prevent swapping at the Virtual Machine level, yet low enough not to constrain other workloads or the ESX host? HA/DRS/Maintenance Mode Regularity – How often do the workloads in the cluster get moved between hosts? Each movement causes an impact on the receiving (negative) and sending (positive) hosts with maintenance mode causing a rolling wave of impact across the cluster, depending upon the timing.

Allocation Type – Each of the allocation types have their drawbacks so tread carefully when choosing the allocation type. One size seldom is right for all needs.

Capacity/Swapping - The virtual machine’s CPU can be constrained due to a lack of sufficient capacity at the ESX host level. Look for regular swapping at the ESX host level as an indicator of a memory capacity issue but be sure to notice memory leaks that artificially force a memory shortage situation.

Case Studies - Memory

Page 46: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 46

Memory Performance Counters 1

Field Usage

PMEM (MB) Displays the machine memory statistics for the server. All numbers are in megabytes.

  total — Total amount of machine memory in the server.

 cos — Amount of machine memory allocated to the ESX Server service console (ESX Server 3 only).

  vmk — Amount of machine memory being used by the ESX Server VMkernel.

 other — Amount of machine memory being used by everything other than the ESX service console (ESX Server 3 only) and ESX Server VMkernel.

  free — Amount of machine memory that is free.

 

state — Current machine memory availability state. Possible values are high, soft, hard and low. High means that the machine memory is not under any pressure and low means that it is.

VMKMEM

Displays the machine memory statistics for the ESX Server VMkernel. All (MB) numbers are in megabytes. . managed — Total amount of machine memory managed by the ESX Server VMkernel. . min free — Minimum amount of machine memory that the ESX Server VMkernel aims to keep free. . rsvd — Total amount of machine memory currently reserved by resource pools. . ursvd — Total amount of machine memory currently unreserved

Case Studies - Memory

Page 47: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 47

Memory Performance Counters 2

Field Usage

COSMEMDisplays the memory statistics as reported by the ESX Server service console (MB) (ESX Server 3 only). All numbers are in megabytes.

  free — Amount of idle memory.

  swap_t — Total swap configured.

  swap_f — Amount of swap free.

  r/s is — Rate at which memory is swapped in from disk.

  w/s — Rate at which memory is swapped to disk.

NUMA (MB)

Displays the ESX Server NUMA statistics. This line appears only if the ESX Server host is running on a NUMA server. All numbers are in megabytes. For each NUMA node in the server, two statistics are displayed

  The total amount of machine memory in the NUMA node that is managed by the ESX Server.

  The amount of machine memory in the node that is currently free (in parentheses).

PSHARE (MB) Displays the ESX Server page‐sharing statistics. All numbers are in megabytes

  shared — Amount of physical memory that is being shared.

  common — Amount of machine memory that is common across worlds.

  saving — Amount of machine memory that is saved because of page sharing.

Case Studies - Memory

Page 48: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 48

Memory Performance Counters 3

Field Usage

SWAP (MB) Displays the ESX Server swap usage statistics. All numbers are in megabytes

  curr — Current swap usage

  target — Where the ESX Server system expects the swap usage to be.

  r/s — Rate at which memory is swapped in by the ESX Server system from disk.

  w/s — Rate at which memory is swapped to disk by the ESX Server system

MEMCTL (MB) Displays the memory balloon statistics. All numbers are in megabytes.

  curr — Total amount of physical memory reclaimed using the vmmemctl module

 target — Total amount of physical memory the ESX Server host attempts to reclaim using the vmmemctl module.

 max — Maximum amount of physical memory the ESX Server host can reclaim using the vmmemctlmodule

AMIN Memory reservation for this resource pool or virtual machine

AMAX Memory limit for this resource pool or virtual machine. A value of 01 means Unlimited

ASHRS Memory shares for this resource pool or virtual machine

AMLMT Memory limit for this resource pool or virtual machine

Case Studies - Memory

Page 49: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 49

Memory Performance Counters 4

Field Usage

AUNITS Type of units shown (kb)

N%L Current percentage of memory allocated to the virtual machine or resource pool that is local

MEMSZ (MB) Amount of physical memory allocated to a resource pool or virtual machine

SZTGT (MB) Amount of machine memory the ESX Server VMkernel wants to allocate to a resource pool or virtual machine

TCHD (MB) Working set estimate for the resource pool or virtual machine

%ACTV Percentage of guest physical memory that is being referenced by the guest. This is an instantaneous value

%ACTVS Percentage of guest physical memory that is being referenced by the guest. This is a slow moving average.

%ACTVF Percentage of guest physical memory that is being referenced by the guest. This is a fast moving average

%ACTVN Percentage of guest physical memory that is being referenced by the guest. This is an estimation

MCTL Memory balloon driver is installed or not. N means no, Y means yes.

MCTLSZ Amount of physical memory reclaimed from the resource pool by way of (MB) ballooning

MCTLTGTAmount of physical memory the ESX Server system can reclaim from the (MB) resource pool or virtual machine by way of ballooning

Case Studies - Memory

Page 50: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 50

Memory Performance Counters 5

Field Usage

MCTLMAX

Maximum amount of physical memory the ESX Server system can reclaim (MB) from the resource pool or virtual machine by way of ballooning. This maximum depends on the guest operating system type

SWCUR (MB) Current swap usage by this resource pool or virtual machine

SWTGT (MB) Target where the ESX Server host expects the swap usage by the resource pool or virtual machine to be

SWR/s (MB) Rate at which the ESX Server host swaps in memory from disk for the resource pool or virtual machine

SWW/s (MB) Rate at which the ESX Server host swaps resource pool or virtual machine memory to disk.

CPTRD (MB) Amount of data read from checkpoint file

CPTTGT Size of checkpoint file. (MB)

ZERO (MB) Resource pool or virtual machine physical pages that are zeroed

SHRD (MB) Resource pool or virtual machine physical pages that are shared.

SHRDSVDMachine pages that are saved because of resource pool or virtual machine (MB) shared pages.

OVHD (MB) Current space overhead for resource pool.

Case Studies - Memory

Page 51: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 51

Memory Performance Counters 6

Field Usage

OVHDMAX Maximum space overhead that might be incurred by resource pool or virtual (MB) machine

OVHDUW Current space overhead for a user world

RESP? Memory Responsive?

NHNCurrent home node for the resource pool or virtual machine. This statistic is applicable only on NUMA systems. If the virtual machine has no home node, a dash (‐) is displayed

NRMEMCurrent amount of remote memory allocated to the virtual machine or (MB) resource pool. This statistic is applicable only on NUMA systems

GST_NDx Guest memory allocated for a resource pool on NUMA node x. This statistic (MB) is applicable on NUMA systems only

OVD_NDx VMM overhead memory allocatedforaresource pool on NUMA node x. This (MB) statistic is applicable on NUMA systems only

Case Studies - Memory

Page 52: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 52

Esxtop memory screen (m)Possible

states: High, Soft, hard and

low

Physical Memory (PMEM)

VMKMEMCOSPCI Hole

VMKMEM - Memory managed by VMKernelCOSMEM - Memory used by Service Console

Case Studies - Memory

Page 53: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 53

Esxtop memory screen (m)

SZTGT = Size targetSWTGT = Swap targetSWCUR = Currently swappedMEMCTL = Balloon driverSWR/S = Swap read /secSWW/S = Swap write /sec

SZTGT : determined by reservation, limit and memory sharesSWCUR = 0 : no swapping in the pastSWTGT = 0 : no swapping pressureSWR/S, SWR/W = 0 : No swapping activity currently

Swapping activity in Service Console

VMKernel Swapping activity

Case Studies - Memory

Page 54: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 54

VIC Memory Screen – Summary Tab

Case Studies - Memory

Host Memory Usage: The amount of physical

memory (in MB) allocated to a guest

Guest Memory Usage: The amount of physical memory (in MB) actively

used by a guest

Requested Memory: The amount of physical

memory (in MB) requested for a guest

Page 55: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 55

ESXTop and VIC Summary Tab – Memory Screens

Virtual Machine Memory

Actively Used

Virtual Machine Requested

Memory

Virtual Machine Target Working

Set Size

Virtual Machines Have

Previously Swapped

Page 56: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 56

Idle State on Test Bed – Memory View

Case Studies - Memory

Page 57: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 57

Memory View at Steady State of 3 Virtual Machines – Memory Shares

Virtual Machine Just Powered On

These VMs are at a memory steady

state

No VM Swapping or Targets

Case Studies - Memory

Most memory is not reserved

Page 58: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 58

Ballooning and Swapping in Progress – Memory View

Ballooning In Effect

Mild swapping Different Size

Targets Due to Different Amount of

Up Time

Possible states: High, Soft, hard and

low

Case Studies - Memory

Page 59: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 59

Memory Reservations – Effect on New Loads

Can’t start fourth virtual machine of >512MB of reserved memory

Fourth virtual machine of 512MB of reserved memory started

6GB of “free” physical memory due to memory sharing over 20

minutes

666MB of unreserved

memory

Three VMs, each with 2GB reserved memory

Case Studies - Memory

What Size Virtual Machine with Reserved Memory Can Be Started?

Page 60: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 60

Memory Shares – Effect on New Loads

Fourth virtual machine of 2GB of memory allocation started successfully

5.9 GB of “free”

physical memory

6GB of unreserved

memory

Case Studies - Memory

Three VMs with 2GB allocation

Page 61: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 61

Storage – Key Question and Considerations

Is the bandwidth and configuration of the storage subsystem sufficient to meet the desired latency for the target workloads? If the latency target is not being met then further analysis may be very time consuming. Queuing - Queuing can happen at any point along the storage path, but is not necessarily a bad thing if the latency meets requirements.

Storage Path Configuration and Capacity – It is critical to know the configuration of the storage path and the capacity of each node along that path. The number of active vmkernel commands must be less then or equal to the queue depth max of any of the storage path components while processing the target storage workload.

Virtual Machines per LUN - The number of outstanding active vmkernel commands per virtual machine times the number of virtual machines on a specific LUN must be less then the queue depth of that adaptor

The Latency vs. Throughput Relationship -

Throughput (MB/sec) = (Outstanding IOs/ latency (msec)) * Block size (KB)

Storage Considerations

Page 62: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 62

Storage – More Questions

How fast can the individual disk drive process a request?

Based upon the block-size and type of I/O (sequential read, sequential write, random read, random write) what type of configuration (RAID, number of physical spindles, cache) is required to match the I/O characteristics and workload demands for average and peak throughput?

Does the network storage (SAN frame) handle the I/O rate down each path and aggregated across the internal bus, frame adaptors, and front end processors?

In order to answer these questions we need to better understand the underlying design, considerations, and basics of the storage subsystem

Storage Considerations

Page 63: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 63

Back-end Storage Design Considerations

Capacity - What is the storage capacity needed for this workload/cluster? Disk drive size (144GB, 300GB) Number of disk drives needed within a single logical unit (ex., LUN)

IOPS Rate – How many I/Os per second are expected with the needed latency? Number of physical spindles per LUN Impact of sharing of physical disk drives between LUNs Configuration (ex., cache) and speed of the disk drive

Availability – How many disk drives, storage components can fail at one time? Type of RAID chosen Amount of redundancy built into the storage solution

Cost – Delivered cost per byte at the required speed and availability Many options are available for each design consideration Final decisions on the choice for each component The cumulative amount of capacity, IOPS rate, and availability often dictate

the overall solution

Storage Considerations

Page 64: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 64

Storage from the Ground Up – Physical Disk Types

SCSI (Small Computer Systems Interface) Fiber Channel (FC)

Allows queuing depth of 16 S-ATA (Serial Advanced Technology Attachment)

No command queuing, lower costs Especially good for sequential I/Os (ex, backup)

SAS (Serial SCSI) Solid State Disk (SSD)

Storage Considerations

Page 65: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 65

Storage from the Ground Up – Configuration Options

Latency – The average time it takes for the requested sector to rotate under the read/write head after a completed seek 5400 (5.5ms), 7200 (4.2ms), 10,000 (3ms) , 15,000 (2ms) RPM Ave. disk latency = 1/2 * rotation

Seek Time – The time it takes for the read/write head to find the physical location of the requested data on the disk Average Seek time: 8-10 ms

Average Access Time – The time it takes for a disk device to service a request. This includes seek time, latency, and command processing overhead time.

Host Transfer Rate – The speed at which the host can transfer the data across the disk interface.

Storage Considerations

Page 66: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 66

The Effect of Rotational Speed on Transfer Time

15,000 RPM disk Average Seek 3.6mSec (1/3 Stroke)

Average latency 2.0mSec

Command and data transfer 0.2mSec

Total random access time = 3.6 + 2.0 + 0.2 = 5.8mSec

1 / 0.0058sec = 172 I/Os per second per disk

At 8KB per I/O that is about 1.4 MB/Second

10,000 RPM disk Average Seek 4.7mSec (1/3 Stroke)

Average latency 3.0 mSec

Command and data transfer < 1mSec

Total random access time = 4.9 + 3.0 + 0.2 = 8.1mSec

1 / 0.0081 = 123 I/0s per second per disk

At 8KB per I/O that is about 984 KB/Sec

Storage Considerations

Seek

Actuator Arm

ConcentricData

Tracks

Page 67: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 67

An Example of the Benefits of Faster Drives

Drive Speed Drive Capacity # of Drives RAID 5 Capacity (GB) Performance (IOPS) Short-Stroked to (GB)

           

10K RPM 146GB 17 2482 700 No

15K RPM 146GB 11 1606 700 No

           

10K RPM 73GB 17 1241 700 No

15K RPM 146GB 11 1606 700 No

           

10K RPM 300GB 16 2064 763 Yes

15K RPM 146GB 14 2044 772 No

Page 68: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 68

Interaction of Transfer Rate, Block Size, and IOPS - Read

0

100

200

300

400

500

600

700

800

900

1 2 3 4 5 6 7 8 9 10

IOPS K/s Block Size KB/s XFR Rate MB/Sec

XFR Rate

Block SizeIOPS

Block size increases, IOPS decrease, throughput increase

Page 69: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 69

Storage from the Ground Up – Common RAID Types/Uses

RAID 1

Data 1

Data 2

Data 3

Data 4

Data 1

Data 2

Data 3

Data 4

Disk 1 Disk 2

RAID 0

Data 1

Data 3

Data 5

Data 7

Data 2

Data 4

Data 6

Data 8

Disk 1 Disk 2

RAID 5

Data 1

Data 1

Data 1

Data p

Data 2

Data 2

Data p

Data 1

Disk 1 Disk 2

Data 3

Data p

Data 2

Data 2

Data p

Data 3

Data 3

Data 3

Disk 3 Disk 4

RAID 0 - Stripes data across two or more disks with no parity Poor redundancy, good performance

RAID 1 – Data is duplicated across disks (mirrored) Good redundancy, good performance on reads, reduced capacity/price

RAID 5 - Data blocks are striped across disks with parity data distributed across the physical disks Good redundancy, performance penalty on writes, high capacity/price

RAID 6 – Data blocks are striped across disks with two copies of parity data distributed across the physical disks Good redundancy, performance penalty on writes, high capacity/price

RAID 6

Data 1

Data 1

Data 1

Data p

Data 2

Data 2

Data p

Data p

Disk 1 Disk 2

Data 3

Data p

Data p

Data 1

Data p

Data p

Data 2

Data 2

Disk 3 Disk 4 Disk 5

Data p

Data 3

Data 3

Data 3

Storage Considerations

Page 70: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 70

Example: Creating Four 150GB LUNs Using RAID 1 + 0

Eight total physical disks, each physical disk has 150GB available space

Effective raw disk capacity of 600GB (4 x 150GB)

Each LUN spans four physical disk drives for performance needs

Each physical disk has four LUNs allocated

Storage Considerations

RAID 1 Four 150GB LUNs

LUN 1

LUN 2

LUN 3

LUN 4

LUN 1

LUN 2

LUN 3

LUN 4

Disk 1 Disk 2Mirrored Disk Pair 1

LUN 1

LUN 2

LUN 3

LUN 4

LUN 1

LUN 2

LUN 3

LUN 4

LUN 1

LUN 2

LUN 3

LUN 4

LUN 1

LUN 2

LUN 3

LUN 4

LUN 1

LUN 2

LUN 3

LUN 4

LUN 1

LUN 2

LUN 3

LUN 4

Disk 3 Disk 4Mirrored Disk Pair 2

Disk 5 Disk 6Mirrored Disk Pair 3

Disk 7 Disk 8Mirrored Disk Pair 4

Page 71: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 71

Example: Creating Four 150GB LUNs Using RAID 5

Four total physical disks, each physical disk has 150GB available space

Effective raw disk capacity of 600GB (4 x 150GB)

Each LUN spans three physical disk drives for performance needs

Each physical disk has four LUNs allocated

Storage Considerations

RAID 5 Four 150GB LUNs

LUN 1

LUN 2

LUN 3

LUN p

LUN 1

LUN 2

LUN p

LUN 4

LUN 1

LUN p

LUN 3

LUN 4

LUN p

LUN 2

LUN 3

LUN 4

Page 72: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 72

The Impact of LUN Data Striping on Performance

Each workload independently makes requests of the physical disks

Higher disk utilization usually results in slower response time

The faster the physical disk response, the less negative impact on response time

High usage of a single disk results in hot spots, with resultant data moves by the storage subsystem to different physical disks

“Short stroking” (using outer disk part) of disks may be needed for high I/O

Storage Considerations

RAID 5 Four 150GB LUNs

LUN 1

LUN 2

LUN 3

LUN p

LUN 1

LUN 2

LUN p

LUN 4

LUN 1

LUN p

LUN 3

LUN 4

LUN p

LUN 2

LUN 3

LUN 4

RAID 5 Short Stroking

LUN 1 LUN 1 LUN 1 LUN p

Page 73: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 73

Network Storage Components That Can Affect Performance/Availability

Size and use of cache

Number of independent internal data paths and buses

Number of front-end interfaces and processors

Types of interfaces supported (ex. Fiber channel and iSCSI)

Number and type of physical disk drives available

MetaLUN Expansion MetaLUNs allow for the aggregation of LUNs

System typically re-stripes data when MetaLUN is changed

Some performance degradation during re-striping

Storage Virtualization Aggregation of storage arrays behind a presented mount point/LUN

Movements between disk drives and tiers control by storage management

Change of physical drives and configuration may be transient and severe

Storage Considerations

Page 74: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 74

SAN Storage Infrastructure – Areas to Watch/Consider

HBA

ISL

FC switch Director

San.jpg

HBA

HBA

HBA Speed

Storage Adapter Queue LengthWorld Queue Length LUN Queue Length

Fiber Bandwidth

FA CPU SpeedDisk Response

RAID Configuration

Disk Speeds

Block Size

Number of Spindles in LUN/array

Cache Size/Type

Storage Considerations

Page 75: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 75

Storage Area Network

VM 3World Queue

Length (WQLEN)

ESX Host

HBA

HBA

VM 1World Queue

Length (WQLEN)

VM 2World Queue

Length (WQLEN)

VM 4World Queue

Length (WQLEN)

Storage Queuing – The Key Throttle Points

LU

N Q

ueu

e L

eng

th (

LQ

LE

N)

Ex

ec

uti

on

Th

rott

leE

xe

cu

tio

n T

hro

ttle

Storage Considerations

Page 76: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 76

Storage I/O – The Key Throttle Point Definitions

Storage Adapter Queue Length (AQLEN) The number of outstanding vmkernel active commands that the adapter is configured to support. This is not settable. It is a parameter passed from the adapter to the kernel.

LUN Queue Length (LQLEN) The maximum number of permitted outstanding vmkernel active commands to a LUN. (This would be the HBA queue depth setting for an HBA.) This is set in the storage adapter configuration via the command line.

World Queue Length (WQLEN) VMware Recommends Not to Change This!!!! The maximum number of permitted outstanding vmkernel active requests to a LUN from any singular virtual machine (min:1, max:256: default: 32) Configuration->Advanced Settings->Disk-> Disk.SchedNumReqOutstanding

Execution Throttle (this is not a displayed counter) The maximum number of permitted outstanding vmkernel active commands that can be executed on any one HBA port (min:1, max:256: default: ~16, depending on vendor) This is set in the HBA driver configuration.

Storage Considerations

Page 77: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 77

Queue Length Rules of Thumb

For a lightly-loaded system, average queue length should be less than 1 per spindle with occasional spikes up to 10. If the workload is write-heavy, the average queue length above a mirrored controller should be less than 0.6 per spindle and less than 0.3 per spindle above a RAID-5 controller.

For a heavily-loaded system that isn’t saturated, average queue length should be less than 2.5 per spindle with infrequent spikes up to 20. If the workload is write-heavy, the average queue length above a mirrored controller should be less than 1.5 per spindle and less than 1 above a RAID-5 controller.

Storage Considerations

Page 78: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 78

Storage Performance Counters 1Field Usage

%USD Percentage of the queue depth used by ESX Server VMkernel active commands. This statistic is applicable only to worlds and devices.

ABRTS/s Number of commands aborted per second.

ACTVNumber of commands in the ESX Server VMkernel that are currently active. This statistic is applicable only to worlds and devices

AQLEN Storage adapter queue depth. Maximum number of ESX Server VMkernel active commands that the adapter driver is configured to support.

BLKSZ Block size in bytes.

CIDStorage adapter channel/controller ID. This ID is visible only if the corresponding adapter, channel and target are expanded.

CMDS/s Number of commands issued per second.

DAVG/cmd Average device latency per command in milliseconds.

DAVG/rd Average device read latency per read operation in milliseconds.

DAVG/wr Average device write latency per write operation in milliseconds.

Device Storage device name. This name is visible only if corresponding world is expanded to devices.

DQLEN Storage device queue depth. This is the maximum number of ESX Server VMkernel active commands that the device is configured to support.

Case Studies - Storage

Page 79: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 79

Storage Performance Counters 2Field Usage

GAVG/cmd Average guest operating system latency per command in milliseconds.

GAVG/rd Average guest operating system read latency per read operation in milliseconds

GAVG/wr Average guest operating system write latency per write operation in milliseconds

GID Resource pool ID of running world’s resource pool.

ID Resource pool ID of the running world’s resource pool or the world ID of the running world.

KAVG/cmd Average ESX Server VMkernel latency per command in milliseconds.

KAVG/rd Average ESX Server VMkernel read latency per read operation in milliseconds

KAVG/wr Average ESX Server VMkernel write latency per write operation in milliseconds

LIDStorage adapter channel target LUN ID. This ID is visible only if the corresponding adapter, channel and target are expanded.

LOADRatio of ESX Server VMkernel active commands plus ESX Server VMkernel queued commands to queue depth. This statistic is applicable only to worlds and devices.

LQLENLUN queue depth. Maximum number of ESX Server VMkernel active commands that the LUN is allowed to have

MBREAD/s Megabytes read per second

MBWRTN/s Megabytes written per second.

NAME Name of running world’s resource pool or name of the running world.

Case Studies - Storage

Page 80: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 80

Storage Performance Counters 3

Field Usage

NCHNS Number of channels.

NDV Number of devices

NLUNS Number of LUNs

NPH Number of paths/partitions

NTGTS Number of targets

NUMBLKSNumber of blocks of the device. It is valid only if the corresponding world is expanded to devices.

NVMS Number of worlds

NWD Number of worlds.

PAECMD/s The number of PAE (Physical Address Extension) commands per second.

PAECP/s Number of PAE copies per second. This statistic is applicable only to paths.

PARTITION Partition ID This ID is visible only if the corresponding device is expanded to partitions.

PATH Path name This name is visible only if the corresponding device is expanded to paths.

QAVG/cmd Average queue latency per command in milliseconds.

QAVG/rd Average queue latency per read operation, in milliseconds

QAVG/wr Average queue latency per write operation, in milliseconds.

Case Studies - Storage

Page 81: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 81

Storage Performance Counters 4Field Usage

QUED Number of commands in the ESX Server VMkernel that are currently queued

READS/s Number of read commands issued per second.

RESETS/s Number of commands reset per second.

SHARES Number of shares

SPLTCMD/s The number of split commands per second.

SPLTCP/s The number of split copies per second.

WID

Storage adapter channel target LUN world ID. This ID is visible only if the corresponding adapter, channel, target and LUN are expanded.

WQLENWorld queue depth. Maximum number of ESX Server VMkernel active commands that the world is allowed to have. This is a per LUN maximum for the world

WQLEN

World queue depth. This is the maximum number of ESX Server VMkernel active commands that the world is allowed to have. This is a per device maximum for the world. It is valid only if the corresponding device is expanded to worlds.

WRITES/s Number of write commands issued per second

Case Studies - Storage

Page 82: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 82

Esxtop disk adapter screen (d)

Host bus adapters (HBAs) - includes SCSI,

iSCSI,RAID, and FC-HBA adapters

Latency stats from the Device, Kernel and the

Guest

DAVG/cmd - Average latency (ms) from the Device (LUN)

KAVG/cmd - Average latency (ms) in the VMKernel

GAVG/cmd - Average latency (ms) in the Guest

Case Studies - Storage

Page 83: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 83

Esxtop disk device screen (u)

LUNs in C:T:L format (Controller: Target: LUN)

Case Studies - Storage

Page 84: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 84

Esxtop disk VM screen (v)

running VMs

Case Studies - Storage

Page 85: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 88

Test Bed Idle State– Device Adapter View

Storage Adapter

Maximum Queue Length

LUN Maximum Queue Length

World Maximum Queue Length

Average Device Latency, Per Command

Case Studies - Storage

Page 86: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 89

Moderate load on two virtual machines

Acceptable latency from the disk subsystem

Commands are queued BUT….

Case Studies - Storage

Page 87: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 90

Heavier load on two virtual machines

Virtual machine latency is consistently above

20ms/second, performance could start

to be an issue

Commands are queued and are exceeding

maximum queue lengths BUT….

Case Studies - Storage

Page 88: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 91

Heavy load on four virtual machines

Virtual machine latency is consistently above

60 ms/second for some VMs, performance will

be an issue

Commands are queued and are exceeding

maximum queue lengths AND….

Case Studies - Storage

Page 89: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 92

Artificial Constraints on Storage

Bad throughput

Good throughput Problem with the disk subsystem

Device Latency is high - cache disabled

Low device Latency

Case Studies - Storage

Page 90: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 93

Network – Key Question and Considerations

Is the bandwidth and configuration of the network sufficient to meet the desired throughput without network errors? Configuration – Is the network configuration known and is the network correctly designed for the needed throughput?

Error Indicators – Dropped packets and low throughput without errors indicates a problem.

Network Counters – These are fairly high level counters. Most network analysis would be done external to the system.

Case Studies - Network

Page 91: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 94

Network Performance Counters 1

Field Usage

PORT Virtual network device port ID

UPLINK Y means the corresponding port is an uplink. N means it is not.

UP Y means the corresponding like is up. N means it is not.

SPEED Link speed in Megabits per second.

FDUPLX Y means the corresponding link is operating in full duplex. N means it is not.

USED Virtual network device port user.

DTYP Virtual network device type. H means hub and S means switch.

PKTTx/s Number of packets transmitted per second.

PKTRX/s Number of packets received per second.

MbTX/s MegaBits transmitted per second.

MbRX/s MegaBits received per second

%DRPTX Percentage of transmit packets dropped

%DRPRX Percentage of receive packets dropped

Case Studies - Network

Page 92: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 95

Esxtop network screen (n)

PKTTX/s - Packets transmitted /secPKTRX/s - Packets received /secMbTx/s - Transmit Throughput in Mbits/secMbRx/s - Receive throughput in Mbits/sec

Port ID: every entity is attached to a port on the virtual switchDNAME - switch where the port belongs to

Physical NIC

Service console

NIC Virtual NICs

Case Studies - Network

Page 93: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 96

Test Bed Idle State – Network View

Case Studies - Network

Page 94: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 97

Test Bed Under Heavy Network Load

No packets are being dropped despite the load

Heavy transmit

and receive loads

BUT….

Case Studies - Network

Page 95: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 98

Closing Thoughts

Know the key counters to look at for each type of resource Be careful on what type of resource allocation technique you use for CPU and RAM. One size may NOT fit all. Consider the impact of events such as maintenance on the performance of a cluster Set up a simple test bed where you can create simple loads to become familiar with the various performance counters and tools Compare your test bed analysis and performance counters with the development and production clusters Know your storage subsystem components and configuration due to the large impact this can have on overall performance Take the time to learn how the various components of the virtual infrastructure work together

Page 96: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 99

Performance Resources

The performance community

http://communities.vmware.com/community/vmtn/general/performance

Performance analysis techniques whitepaper

http://communities.vmware.com/docs/DOC-3930

Performance web page for white papers

http://www.vmware.com/overview/performance

VROOM!—VMware performance blog

http://blogs.vmware.com/performance

Page 97: Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. “So What Do Those VMware Counters Really Mean and How Do They Help Me Identify

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved.Page 100

John Paul – [email protected]