cmg: vmware vsphere performance boot campcmg: …...to interpret the vmware tools (such as esxtop,...

CMG: VMWare vSphere Performance Boot CampCMG: VMWare vSphere Performance Boot Camp

John PaulManaged Services, R&DF b 29 2012February 29, 2012

Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved.

Acknowledgments and Presentation GoalAcknowledgments and Presentation Goal

The material in this presentation was pulled from a variety of sources, some of which was graciously provided by VMware and Intel staffsome of which was graciously provided by VMware and Intel staff members. I acknowledge and thank the VMware and Intel staff for their permission to use their material in this presentation. This presentation is intended to review the basics for performance This presentation is intended to review the basics for performance

analysis for the virtual infrastructure with detailed information on the tools and counters used. It presents a series of examples of how the performance counters report different types of resource consumptionperformance counters report different types of resource consumption with a focus on key counters to observe. The performance counters do change and we are not going to go over all of the counters. ESXTOP and RESXTOP will be used interchangeably since both toolsESXTOP and RESXTOP will be used interchangeably since both tools

effectively provide the same counters. The screen shots in this presentation have the colors inverted for readability purposes. The presentation shows screen shots from vSphere 4 and 5 since both


The presentation shows screen shots from vSphere 4 and 5 since both are actively in use.

IntroductoryIntroductoryCommentsComments


Trends to Consider Hardware

Introductory Comments

Trends to Consider - Hardware

Intel Strategies – Intel necessarily had to move to a horizontal, multi-core strategy due to the physical restrictions of what could be done oncore strategy due to the physical restrictions of what could be done on the current technology base. This resulted in: Increasing number of cores per socket Stabilization of processor speed (i.e., processor speeds no longer are p p ( , p p g

increasing according to Moore’s Law and in fact are slower for newer models) Focus on new architectures that allow for more efficient movement of data

between memory and the processors external sources and the processorsbetween memory and the processors, external sources and the processors, and larger and faster caches associated with different components

OEM Strategies – As Intel moved down this path system OEMs have assembled the Intel (and AMD) components in different ways (such as ( ) p y (multi-socket) to differentiate their offerings. It is important to understand the timing of the Intel architectural releases, the OEM implementation of those releases, and the operating system vendors’ use of those features in their code They aren’t always aligned


use of those features in their code. They aren t always aligned.

Trends to Consider Hardware Assisted Virtualization


Trends to Consider – Hardware Assisted Virtualization

Intel VT-x and AMD-V - This provides two forms of CPU operation (root and non-root) allowing the virtualization hypervisor to be less(root and non-root), allowing the virtualization hypervisor to be less intrusive during workloads. Hardware virtualization with a Virtual Machine Monitor (VMM) versus binary translation resulted in a substantial performance improvement.substantial performance improvement. Intel EPT and AMD RVI - Memory management virtualization (MMU)

supports extended page tables (EPT) which eliminated the need for ESX to maintain shadow page tablesESX to maintain shadow page tables. Vt-d and AMD-Vi – I/O virtualization assist allows the virtual machines

to have direct access to hardware I/O devices, such as network cards and storage controllers (HBAs)and storage controllers (HBAs).


Trends to Consider Software


Trends to Consider – Software

VMWare/Hypervisor Strategies – Horizontal scalability at the hardware layer requires comparable scalability for the hypervisorlayer requires comparable scalability for the hypervisor NUMA support required hypervisor scheduler changes (Wide NUMA) Larger CPU and RAM virtual machines Efficiency while running larger, most complex workloads Abstraction model versus consolidation model for some workloads Performance guarantees for the Core Four Federation of management tools and resource pools


Scheduler


Scheduler

vCPU – The vCPU is an aggregation of the time it allocates to the workload It time slices each core based upon the type of configurationworkload. It time slices each core based upon the type of configuration. It constantly is changing which core the vm is on, unless affinity is used. SMP Lazy Scheduling The scheduler continues to evolve using lazy SMP Lazy Scheduling – The scheduler continues to evolve, using lazy

scheduling to launch individual vCPUs and then having others “catch up” if CPU skewing occurs. Note that this has improved across releasesreleases. SMP – Note that SMP effectiveness is NOT linear, depending upon

workloads. It is very important to load test your workloads on your hardware to measure the efficiency of SMP We have found that thehardware to measure the efficiency of SMP. We have found that the higher the number of SMP vCPUs, the lower the efficiency.




Resource Pools – These are really a way to group the amount of resources allocated to a specific workload or grouping of workloads


resources allocated to a specific workload, or grouping of workloads across VMs and hosts. Single Unit of Work – It is easy to miss the fact that resource “sharing”

really does not affect the single unit of work Dynamic Resourcereally does not affect the single unit of work. Dynamic Resource Sharing (DRS) moves VMs across ESX hosts where there may be more resources. Perfmon Counters The inclusion of ESX counters (via VMTools) into Perfmon Counters – The inclusion of ESX counters (via VMTools) into

the Windows perfmon counters is helpful for overall analysis, and higher level performance analysis. Many counters are not exposed yet in Perfmonin Perfmon.




Hyper-threading – The pre-Nehalem Intel architecture had some problems with the hyper-threading efficiencies causing many people to


problems with the hyper-threading efficiencies, causing many people to turn off hyper-threading. The Nehalem architecture seems to have corrected those problems, and vSphere is now hyper-threading aware. You need to understand how hyper-threading works so you know howYou need to understand how hyper threading works so you know how to interpret the VMware tools (such as ESXTOP, ESXPlot). The different cores will be shown as equals while they don’t have equal capacity. Microsoft Hyper-V – While we are not going to be diving into Hyper-V

(or Zen VM) the basic principles of the hypervisors are the same, though the implementation is quite different. There are good reasons g p q gwhy VMware leads the market in enterprise virtualization implementations.


Introductory Comments Decision Points


WHAT should we change? – VMware continues to expose more performance changing settings One of the key questions that needs

Introductory Comments – Decision Points

performance changing settings. One of the key questions that needs to be answered is whether you should take the default settings for operational simplicity or fine tune for the best possible performance. NUMA Awareness and Control Should the NUMA control be turned NUMA Awareness and Control – Should the NUMA control be turned

over to the guest operating system? ESXTOP versus RESXTOP – Though both work the use of ESXTOP

on the actual host requires SSH to be enabled which may violateon the actual host requires SSH to be enabled, which may violate security guidelines.


vSphere Architecture

VMware ESX Architecture

File

CPU is controlled by scheduler and virtualized b it

GuestGuest TCP/IPFile

Systemby monitor

Monitor supports:BT (Binary Translation)

Monitor (BT, HW, PV)Monitor

Virtual NIC Virtual SCSI

( y )HW (Hardware assist)PV (Paravirtualization)

Memory is allocated by the VMkernel Memory

Allocator

NIC Drivers

Virtual Switch

I/O Drivers

File SystemScheduler

y yVMkernel and virtualized by the monitor

PhysicalHardware

Network and I/O devices are emulated and proxied though native device


though native device drivers

PerformancePerformance Analysis Basicsy

Key Reference Documents vSphere Resource Management (EN-000591-01)

P f B t P ti f VMW S h 5 0 (EN 000005 04)


Performance Best Practices for VMWare vSphere 5.0 (EN-000005-04)

Performance Analysis Basics

Types of Resources – The Core Four (Plus One)

Though the Core Four resources exist at both the ESX host and virtual machine l l th t th i h th i t ti t d d t d i tlevels, they are not the same in how they are instantiated and reported against. CPU – processor cycles (vertical), multi-processing (horizontal) Memory – allocation and sharing Disk (a.k.a. storage) – throughput, size, latencies, queuing Network - throughput, latencies, queuing

Though all resources are limited, ESX handles the resources differently. CPU is more strictly scheduled, memory is adjusted and reclaimed (more fluid) if based on shares, disk and network are fixed bandwidth (except for queue depths) resources.

The Fifth Core Four resource is virtualization overhead!



vSphere Components in a Context You Are Used To World The smallest schedulable component for vSphere Similar to a process in Windows or thread in other operating systems

Groups A collection of ESX worlds, often associated with a virtual server or common set of

functions, such as, Idle System Helper DriversDrivers Vmotion Console


Th Fi C f Vi li i d Whi h T l U f E h


The Five Contexts of Virtualization and Which Tools to Use for Each

Physical MachineOperating System

Intel Hardware

VCPU VMemory VNIC VDisk

Application

Operating System

Intel Hardware


Application

Operating System

ApplicationOperating System

Intel Hardware


Application

Virtual Machine

ESX Host Machine

Physical MachineOperating System

Intel Hardware


Application

Operating System

Intel Hardware


Application

PerfMon ESXTOP PerfMon ESXTOP

Intel Hardware

PNIC PDiskPMemoryPCPU

Intel Hardware

PCPU PMemory PNIC PDisk

Intel Hardware

PNIC PDiskPMemoryPCPUESX Host Farm/Cluster

ESX Host Complex Intel Hardware



Virtual Center Virtual Center

Operating System


Application

Operating System


Application

Operating System


Application

Operating System


Application

Intel Hardware Intel Hardware

Intel Hardware


Intel Hardware Intel Hardware

Intel Hardware




Remember the virtual context


Types of Performance Counters (v4) Static – Counters that don’t change during runtime, for example MEMSZ

(memsize), Adapter queue depth, VM Name. The static counters are informational and ma not be essential d ring performance problem anal sisinformational and may not be essential during performance problem analysis.

Dynamic – Counters that are computed dynamically, for example CPU load average, memory over-commitment load average.C l l t d S l l t d f th d lt b t t i Calculated - Some are calculated from the delta between two successive snapshots. Refresh interval (-d) determines the time between successive snapshots. For example %CPU used = ( CPU used time at snapshot 2 - CPU used time at snapshot 1 ) / time elapsed between snapshots



A Review of the Basic Performance Analysis ApproachIdentify the virtual context of the reported performance problem Where is the problem being seen? (“When I do this here, I get that”)p g How is the problem being quantified? (“My function is 25% slower) Apply a reasonability check (“Has something changed from the status quo?”)

Monitor the performance from within that virtual context View the performance counters in the same context as the problem Look at the ESX cluster level performance counters

L k f t i l b h i (“I th t f d Look for atypical behavior (“Is the amount of resources consumed characteristic of this particular application or task for the server processing tier?” )

Look for repeat offenders! This happens often.

Expand the performance monitoring to each virtual context as needed Are other workloads influencing the virtual context of this particular

application and causing a shortage of a particular resource?C id h h t i i t ti t d f h f th C F


Consider how a shortage is instantiated for each of the Core Four resources


Reservation (Guarantees) • Minimum service level guarantee (in MHz)

Resource Control Revisited – CPU ExampleTotal MHZ

g ( )• When system is overcommitted it is still the target• Needs to pass admission control for start-up

Limit

Shares (Share the Resources)• CPU entitlement is directly proportional to VM's shares and depends on the total number of shares issuedAb t t b l ti tt

Shares

• Abstract number, only ratio matters

Limit • Absolute upper bound on CPU entitlement (in MHz)

Reservation

• Absolute upper bound on CPU entitlement (in MHz)• Even when system is not overcommitted 0 MHZ


Tools

Key Reference Documents vSphere Monitoring and Performance (EN-000620-01) Chapter 7 – Performance Monitoring Utilities: resxtop and esxtop


Chapter 7 – Performance Monitoring Utilities: resxtop and esxtop

Esxtop for Advanced Users (VSP1999 VMWorld 2011)

Tools

Load Generators, Data Gatherers, Data Analyzers

Load Generators IOMeter www iometer org IOMeter – www.iometer.org Consume – windows SDK SQLIOSIM - http://support.microsoft.com/?id=231619

Data Gatherers ESXTOP Virtual Center Vscsistats

Data Analyzers ESXTOP (interactive or batch mode) ESXTOP (interactive or batch mode)Windows Perfmon/Systems Monitor ESXPLOT


Tools

A Comparison of ESXTOP and the vSphere Client

vC gives a graphical view of both real-time and trend consumptionC bi l ti ti ith h t t (1 h ) t di vC combines real-time reporting with short term (1 hour) trending

vC can report on the virtual machine, ESX host, or ESX cluster vC has performance overview charts in vSphere 4 and 5p p vC is limited to 2 unit types at a time for certain views ESXTOP allows more concurrent performance counters to be shown

ESXTOP h hi h t h d t ESXTOP has a higher system overhead to run ESXTOP can sample down to a 2 second sampling period ESXTOP gives a detailed view of each of the Core Fourg

Recommendation – Use vC to get a general view of the system performance but use ESXTOP for detailed problem analysis


performance but use ESXTOP for detailed problem analysis.

Tools

An Introduction to ESXTOP/RESXTOP

Launched through vSphere Management Assistant (VMA) or CLI or via SSH session (ESXTOP) with ESX host

Screens (version 5) c: cpu (default) d: disk adapter

h h l h: help i: interrupts m: memory n: network p: power management u: disk device v: disk VM

Can be piped to a file and then imported in System Monitor/ESXPLOTp p p y Horizontal and vertical screen resolution limits the number of fields and entities

that could be viewed so chose your fields wisely Some of the rollups and counters may be confusing to the casual user


Tools

ESXTOP: New Counters in vSphere 5.0

World, VM Count, vCPU count (CPU screen) %VMWait (%Wait %Idle CPU screen) %VMWait (%Wait - %Idle, CPU screen) CPU Clock Frequency in different P-states (Power Management Screen) Failed Disk IOs (Disk adapter screen)

FCMDs/s – failed commands per second FReads/s – failed reads per second FMBRD/s – failed megabyte reads per second FMBWR/s failed megabyte writes per second FMBWR/s – failed megabyte writes per second FRESV/s – failed reservations per second

VAAI: Block Deletion Operations (Disk adapter screen) Same counters as Failed Disk IOs aboveSame counters as Failed Disk IOs above

Low-Latency Swap (Host Cache – Disk Screen) LLSWR/s – Swap in rate from host cache LLSWW/s – Swap out rate to host cache


LLSWW/s Swap out rate to host cache

ESXTOP H l S ( 5)

Tools

ESXTOP: Help Screen (v5)


ESXTOP CPU ( 5)

Tools

ESXTOP: CPU screen (v5)Time Uptime New Counter

fi ld hidd f th i• Worlds = Worlds, VMs, vCPU Totals


fields hidden from the view…• ID = ID• GID = world group identifier• NWLD = number of worlds

Tools

ESXTOP: CPU screen (v4) expanding groups

press ‘e’ key

• In rolled up view some stats are cumulative of all the worlds in the groupp g p• Expanded view gives breakdown per world• VM group consists of mks (mouse, keyboard, screen), vcpu, vmx worlds. SMP VMs have additional vcpu and vmm worlds• vmm0, vmm1 = Virtual machine monitors for vCPU0 and vCPU1


,respectively

ESXTOP CPU S ( 5) M N W ld

Tools

ESXTOP CPU Screen (v5): Many New Worlds

New Processes Using Little/No CPU resource


ESXTOP CPU S ( 5) Vi l M hi O l ( i V d)

Tools

ESXTOP CPU Screen (v5): Virtual Machines Only (using V command)

Value >= 1 means overload


ESXTOP CPU S ( 5) Vi l M hi O l E d d

Tools

ESXTOP CPU Screen (v5): Virtual Machines Only, Expanded


ESXTOP CPU ( 4)

Tools

ESXTOP: CPU screen (v4)

PCPU = Physical CPU/core

CCPU = Console CPU (CPU 0)

Press ‘f’ key to choose fields


ESXTOP CPU ( 5)

Tools

ESXTOP: CPU screen (v5)

Core Usage Now Shown

PCPU = Physical CPU

CORE = Core CPUChanged Field


New Field

Idle State on Test Bed (CPU View v4)

Tools

ESXTOP

Idle State on Test Bed (CPU View v4)


Virtual Machine View

Tools

Idle State on Test Bed – GID 32 Expanded (v4)

Wait includes idle

Cumulative Five Worlds Total Idle %Expanded G Rolled Up GID Wait %Five Worlds Total Idle %GID Rolled Up GID


Possible

Tools

ESXTOP memory screen (v4) Possible states: High,

Soft, hard and low

VMKMEMCOSPCI Hole

Physical Memory (PMEM)

VMKMEMCOS

VMKMEM - Memory managed by VMKernelCOSMEM - Memory used by Service Console


Tools

ESXTOP: memory screen (v5)

NUMA Stats

Changed Field

New Fields


Swapping activity in

Tools

ESXTOP: memory screen (4.0) Swapping activity in Service Console

VMKernel S i ti itSwapping activity

SZTGT : determined by reservation, limit and memory sharesSWCUR 0 i i th t

SZTGT = Size targetSWTGT = Swap targetSWCUR = Currently swappedMEMCTL B ll d i

SWCUR = 0 : no swapping in the pastSWTGT = 0 : no swapping pressureSWR/S, SWR/W = 0 : No swapping activity currently

MEMCTL = Balloon driverSWR/S = Swap read /secSWW/S = Swap write /sec


Tools

ESXTOP: disk adapter screen (v4)

Host bus adapters (HBAs) - includes SCSI,

iSCSI,RAID, and FC-HBA Latency stats from the Device, Kernel and the

G tadapters Guest

G/ ( ) f ( )DAVG/cmd - Average latency (ms) from the Device (LUN)

KAVG/cmd - Average latency (ms) in the VMKernel

GAVG/cmd - Average latency (ms) in the Guest


Tools

ESXTOP: disk device screen (v4)

LUNs in C:T:L format (Controller: Target: LUN)


Tools

ESXTOP disk VM screen (v v4)

running VMs


Tools

ESXTOP: network screen (v4)

PKTTX/s - Packets transmitted /secPKTRX/s - Packets received /sec

Physical NIC

Service console

NIC Virtual NIC /s ac ets ece ed /sec

MbTx/s - Transmit Throughput in Mbits/secMbRx/s - Receive throughput in Mbits/sec

Port ID: every entity is attached to a port on the virtual switch

NICs

Port ID: every entity is attached to a port on the virtual switchDNAME - switch where the port belongs to


Tools

A Brief Introduction to the vSphere Client

Screens – CPU, Disk, Management Agent, Memory, Network, System vCenter collects performance metrics from the hosts that it manages and aggregates the

data using a consolidation algorithm. The algorithm is optimized to keep the database size constant over time.

vCenter does not display many counters for trend/history screens ESXTOP defaults to a 5 second sampling rate while vCenter defaults to a 20 second

rate. Default statistics collection periods, samples, and how long they are stored

Interval Interval Period Number of Samples

Interval Length

Per Hour (real-time) 20 seconds 180

Per day 5 minutes 288 1 dayPer day 5 minutes 288 1 day

Per week 30 minutes 336 1 week

Per month 2 hours 360 1 month

Per year 1 day 365 1 year


y y y

Tools

vSphere Client – CPU Screen (v4)

To Change To Change


To Change Settings

To Change Screens

Tools

vSphere Client – Disk Screen (v4)


Tools

vSphere Client - Performance Overview Chart (v4)

Performance overview charts help to quickly identify bottlenecks andto quickly identify bottlenecks and isolate root causes of issues.


Tools

Analyzing Performance from Inside a VMVM Performance Counters Integration into Perfmon

A k h t t ti ti Access key host statistics from inside the guest OS

View “accurate” CPU utilization along side observed CPU utilizationobserved CPU utilization

Third-parties can instrument their agents to access these counters using WMI

Integrated with VMware Tools


Tools

Summarized Performance Charts (v4)

Quickly identify bottlenecks and isolate root causes Side-by-side performance charts in a single view Correlation and drill-down capabilities Richer set of performance metrics

Key MetricsKey Metrics Displayed

Aggregated UUsage


Tools

A Brief Introduction to ESXPlot

Launched on a Windows workstationI t d t f fil Imports data from a .csv file

Allows an in-depth analysis of an ESXTOP batch file session Capture data using ESXTOP batch from root using SSH utilityp g g y ESXTOP –a –b >exampleout.csv (for verbose capture)

Transfer file to Windows workstation using WinSCP


ESXPlotTools

ESXPlot


ESXPlot Field Expansion: CPUTools

ESXPlot Field Expansion: CPU


ESXPlot Field Expansion: Physical DiskTools

ESXPlot Field Expansion: Physical Disk


T P f C t t U f I iti l P bl D t i ti

Tools

Top Performance Counters to Use for Initial Problem DeterminationPhysical/Virtual Machine

CPU (queuing)• Average physical CPU utilization

ESX HostCPU (queuing)• PCPU%

Disk (latency, queuing)• DiskReadLatency• Average physical CPU utilization

• Peak physical CPU utilization• CPU Time• Processor Queue LengthMemory (swapping)• Average Memory Usage

• PCPU%•%SYS•%RDY• Average physical CPU utilization• Peak physical CPU utilization• Physical CPU load average

DiskReadLatency• DiskWriteLatency• CMDS/s (commands/sec)• Bytes transferred/received/sec• Disk bus resets• ABRTS/s (aborts/sec)g y g

• Peak Memory Usage• Page Faults• Page Fault Delta*Disk (latency)• Split IO/Sec

Di k R d Q L th

Physical CPU load average

Memory (swapping)• State (memory state)• SWTGT (swap target)• SWCUR (swap current)

• SPLTCMD/s (I/O split cmds/sec)

Network (queuing/errors)• %DRPTX (packets dropped - TX)• %DRPRX (packets dropped – RX)

MbTX/ ( b t f d/ TX)• Disk Read Queue Length• Disk Write Queue Length• Average Disk Sector Transfer TimeNetwork (queuing/errors)• Total Packets/second• Bytes Received/second

• SWR/s (swap read/sec)• SWW/s (swap write/sec)• Consumed • Active (working set)• Swapused (instantaneous swap)• Swapin (cumulative swap in)

• MbTX/s (mb transferred/sec – TX)• MbRX/s (mb transferred/sec – RX)

• Bytes Received/second• Bytes Sent/Second• Output queue length

• Swapin (cumulative swap in)• Swapout (cumulative swap out)• VMmemctl (balloon memory)


CPU


Performance Counters in Action

CPU – Understanding PCPU versus VCPU

It is important to separate the physical CPU (PCPU) resources of the ESX host from the virtual CPU (VCPU) resources that are presented by ESX to the virtual machine. PCPU – The ESX host’s processor resources are exposed only to ESX. The i t l hi t d t t th h i lvirtual machines are not aware and cannot report on those physical resources. VCPU – ESX effectively assembles a virtual CPU(s) for each virtual machine from the physical machine’s processors/cores, based upon the type of resource allocation (ex shares guarantees minimums)allocation (ex. shares, guarantees, minimums). Scheduling - The virtual machine is scheduled to run inside the VCPU(s), with the virtual machine’s reporting mechanism (such as W2K’s System Monitor) reporting on the virtual machine’s allocated VCPU(s) and remaining Core Fourreporting on the virtual machine s allocated VCPU(s) and remaining Core Four resources.



CPU – Key Question and Considerations

Is there a lack of CPU resources for the VCPU(s) of the virtual machine f th PCPU( ) f th ESX h t?

Allocation – The CPU allocation for a specific workload can be constrained due to the resource settings or number of CPUs, amount of

or for the PCPU(s) of the ESX host?

shares, or limits. The key field at the virtual machine level is CPU queuing and at the ESX level it is Ready to Run (%RDY in ESXTOP). Capacity - The virtual machine’s CPU can be constrained due to a lack p yof sufficient capacity at the ESX host level as evidenced by the PCPU/LCPU utilization. Contention – The specific workload may be constrained by the p y yconsumption of workloads operating outside of their typical patterns SMP CPU Skewing – The movement towards lazy scheduling of SMP CPUs can cause delays if one CPU gets too far “ahead” of the other.


CPUs can cause delays if one CPU gets too far ahead of the other. Look for higher %CSTP (co-schedule pending)

CPU State Times and AccountingPerformance Counters in Action

CPU State Times and Accounting

Accounting: USED = RUN + SYS - OVRLP



High CPU within one virtual machine caused by affinity (ESXTOP v4)

Physical CPU Fully

Used

One Virtual CPU is

Fully Used


Fully Used


High CPU within one virtual machine (affinity) (vCenter v4)

View of the ESX Host


View of the VM


SMP Implementation WITHOUT CPU Constraints – ESXTOP V4

4 Physical CPUs Fully 4 Virtual CPUs

Fully UsedReady to Run

AcceptableOne - 2 CPU SMP VCPU


Used Fully Used AcceptableSMP VCPU

SMP Implementation WITHOUT CPU Constraints vC V4


SMP Implementation WITHOUT CPU Constraints – vC V4



SMP Implementation with Mild CPU Constraints V4

4 Physical CPUs Fully 4 Virtual CPUs

Heavily Used

Ready to Run Indicates P bl

One - 2 CPU SMP VCPUs (7

)


Used Heavily Used ProblemsNWLD)


SMP Implementation with Severe CPU Constraints V4

4 Physical CPUs Fully

4 Virtual CPUs Fully Used

Ready to Run Indicates Severe

P bl

Two - 2 CPU SMP VCPUs

( )


Used Fully Used Problems(7 NWLD)


SMP Implementation with Severe CPU Constraints V4



CPU Usage – Without Core Sharing

ESX scheduler tries to avoid sharing the same core



CPU Usage – With Core Sharing


®Introduction to the Intel® QuickPath Interconnect

Intel® QuickPath interconnect is: Cache-coherent, high-speed packet-based, point-to-point

interconnect sed in Intel’s ne t generation microprocessors

Four-Socket Platform

interconnect used in Intel’s next generation microprocessors (starting in 2H’08)

Narrow physical link contains 20 lanes Two uni-directional links complete QuickPath interconnect

portport

Provides high bandwidth, low latency connections between processors and between processors and chipsetprocessors and chipset Maximum data rate of 6.4GT/s 2 bytes/T, 2 directions, yields 25.6GB/s per port


Interconnect performance for Intel’s next generation microarchitectures

Intel TopologiesIntel Topologies

CPUCPU CPUCPU CPUCPU CPUCPUCPUCPU

Intel® Itanium®Processor(Tukwila)

Nehalem-EXNehalem-EPIntel® Core™ i7Processor

Lynnfield

CPUCPU CPUCPU CPUCPU CPUCPUCPUCPU

4 Full Width Links4 Full Width Links2 Full Width Links1 Full Width LinkNo links 4 Full Width Links2 Half Width Links

4 Full Width Links2 Full Width Links1 Full Width LinkNo links

Nehalem-EP Example 2S Nehalem-EX Example 4SIOH IOH

CPUCPU CPUCPU

IOH IOH

CPUCPU CPUCPU

IOH IOH

IOH IOHCPUCPU CPUCPU

IOH IOH


Different Number of Links for Different Platforms

IOH IOH

I t l QPI P f C id tiIntel: QPI Performance Considerations• Not always a direct correlation between processor performance and

interconnect latency / bandwidthWh t i i t t i th t th i t t h ld f

Max theoretical bandwidth Max bandwidth with packet overhead

• What is important is that the interconnect should perform sufficiently to not limit processor performance

Max of 16 bits (2 bytes) of “real” data sent across full width link during one clock edge

Double pumped bus with max initial frequency of 3.2 GHz

Max bandwidth with packet overhead Typical data transaction is a 64 byte cache line Typical packet has header Flit which requires 4

Phits to transmit across link Data payload takes 32 Phits to transfer (64

bytes at 2 bytes/Phit) 2 bytes/transfer * 2 transfers/cycle * 3.2 GHz = 12.8 GB/s

With Intel® QuickPath Interconnect at 6.4 GT/s translates to 25.6 GB/s across two simultaneous unidirectional links

bytes at 2 bytes/Phit) With CRC is sent inline with data, data packet

requires 4 Phits for header + 32 Phits of payload With Intel® QuickPath Interconnect at 6.4 GT/s,

64B cache line transfers in 5.6 ns translates to 22 8 GB/s across two simultaneoussimultaneous unidirectional links to 22.8 GB/s across two simultaneous unidirectional links


Memory


Memory

Memory – Separating the machine and guest memoryIt is important to note that some statistics refer to guest physical memory while others refer to machine memory. “ Guest physical memory" is the virtual-hardware physical memory presented to the VM " Machine memory" is actualhardware physical memory presented to the VM. Machine memory is actual physical RAM in the ESX host. In the figure below, two VMs are running on an ESX host, where each block represents 4 KB of memory and each color represents a different set of data onrepresents 4 KB of memory and each color represents a different set of data on a block.

Inside each VM, the guest OS maps the virtual memory to its physical memory. ESX Kernel maps the guest physical memory to machine memory. Due to ESX


Page Sharing technology, guest physical pages with the same content canbe mapped to the same machine page.

Memory

A Brief Look at Ballooning

The W2K balloon driver is located in VMtools ESX sets a balloon target for each workload at start-up and as

workloads are introduced/removed The balloon driver expands memory consumption, requiring the Virtual

Machine operating system to reclaim memory based on its algorithms Ballooning routinely takes 10-20 minutes to reach the target The returned memory is now available for ESX to useThe returned memory is now available for ESX to use Key ballooning fields:

SZTGT: determined by reservation, limit and memory sharesy , ySWCUR = 0 : no swapping in the pastSWTGT = 0 : no swapping pressureSWR/S, SWR/W = 0 : No swapping activity currently


Memory Interleaving BasicsMemory Interleaving Basics

What is it?It is a process where memory is stored in a non contiguous form toCh 0 Ch 1 Ch 2 Ch 0 Ch 1 Ch 2 stored in a non contiguous form to optimize access performance and efficiencyInterleaving usually done in cache line granularityWhy do it?

Ch 1 Ch 1

Why do it?Increase bandwidth by allowing multiple memory accesses at onceReduce hot spots since memory is spread out over a wider locationT t NUMA (N U if

System Memory MapCSI

Tylersburg To support NUMA (Non Uniform Memory Access) based OS/applications

Memory organization where there is different access times for different sections of memory, due to memory

Tylersburg

-DP IOH

y, ylocated in different locationsConcentrate the data for the application on the memory of the same socket


Non-NUMA (UMA)Non NUMA (UMA)

Uniform Memory Access (UMA) Addresses interleaved across memory nodes by cache line. Accesses may or may not have to cross QPI link

Socket 0 Memory Socket 1 Memory

DDR3 DDR3DDR3 DDR3

DDR3 DDR3

System Memory Map

Tylersburg-DP


Uniform Memory Access lacks tuning for optimal performanceUniform Memory Access lacks tuning for optimal performance

NUMANUMA Non-Uniform Memory Access (NUMA) Addresses not interleaved across memory nodes by cache line. Each CPU has direct access to contiguous block of memory.

Socket 0 Memory Socket 1 Memory

DDR3 DDR3DDR3 DDR3

Tylersburg-EP

DDR3 DDR3

System Memory Map

Tylersburg EP


Thread affinity benefits from memory attached locallyThread affinity benefits from memory attached locally

Memory

ESX Memory Sharing - The “Water Bed Effect”

ESX handles memory shares on an ESX host and across an ESX cluster with a result similar to a single water bed, or room full of water beds, depending upon the

ti d th ll ti taction and the memory allocation type: Initial ESX boot (i.e., “lying down on the water bed”) – ESX sets a target working size

for each virtual machine, based upon the memory allocations or shares, and uses ballooning to pare back the initial allocations until those targets are reached (if possible).

Steady State (i e “minor position changes”) The host gets into a steady state with Steady State (i.e., minor position changes ) - The host gets into a steady state with small adjustments made to memory allocation targets. Memory “ripples” occur during steady state, with the amplitude dependent upon the workload characteristics and consumption by the virtual machines.

New Event (i.e., “second person on the bed”) – The host receives additional workload ( , p )via a newly started virtual machine or VMotion moves a virtual machine to the host through a manual step, maintenance mode, or DRS. ESX pares back the target working size of that virtual machine while the other virtual machines lose CPU cycles that are directed to the new workload.

Large Event (i e “jumping across water beds”) – The cluster has a major event thatLarge Event (i.e., jumping across water beds ) The cluster has a major event that causes a substantial movement of workloads to or between multiple hosts. Each of the hosts has to reach a steady state, or to have DRS determine that the workload is not a current candidate for the existing host, moving to another host that has reached a steady state with available capacity. Maintenance mode is another major event.


Memory

Memory – Key Question and Considerations

Is the memory allocation for each workload optimum to prevent swapping at the Virtual Machine level, yet low enough not to constrain other workloads or the ESX host? HA/DRS/Maintenance Mode Regularity – How often do the workloads in the l t t d b t h t ? E h t i t thcluster get moved between hosts? Each movement causes an impact on the

receiving (negative) and sending (positive) hosts with maintenance mode causing a rolling wave of impact across the cluster, depending upon the timing. Allocation Type Each of the allocation types have their drawbacks so tread Allocation Type – Each of the allocation types have their drawbacks so tread carefully when choosing the allocation type. One size seldom is right for all needs. Capacity/Swapping - The virtual machine’s CPU can be constrained due to aCapacity/Swapping The virtual machine s CPU can be constrained due to a lack of sufficient capacity at the ESX host level. Look for regular swapping at the ESX host level as an indicator of a memory capacity issue but be sure to notice memory leaks that artificially force a memory shortage situation.


Idle State on Test Bed Memory View V4

Memory

Idle State on Test Bed – Memory View V4


Memory View at Steady State of 3 Virtual Machines Memory Shares V4

Memory

Memory View at Steady State of 3 Virtual Machines – Memory Shares V4

Most memory is not reserved

Virtual Machine Just Powered On

These VMs are at memory steady state

No VM Swapping or Targets


Powered On

B ll i d S i i P M Vi V4

Memory

Ballooning and Swapping in Progress – Memory View V4

Possible states: High,

Soft, hard and low

Ballooning In Effect Mild swapping

Different Size Targets Due to Different


EffectAmount of Up Time

Memory Reservations Effect on New Loads V4

Memory

Memory Reservations – Effect on New Loads V4

6GB of “free”

What Size Virtual Machine with Reserved Memory Can Be Started?

physical memory due to memory sharing over 20

minutes

666MB of unreserved

memory

Three VMs each with 2GBCan’t start fourth virtual machine of >512MB of reserved memory

Fourth virtual machine of 512MB of reserved memory started

Three VMs, each with 2GB reserved memory


Memory Shares Effect on New Loads V4

Memory

Memory Shares – Effect on New Loads V4

5.9 GB of “free”

Three VMs with 2GB allocation

free physical memory

6GB of

Fourth virtual machine of 2GB of memory allocation started successfully

6G ounreserved

memory


Vi l M hi i h M G Th O A Si l NUMA N d V5

Memory

Virtual Machine with Memory Greater Then On A Single NUMA Node V5

Remote Local NUMA

% Local NUMA Access

NUMA Access

NUMA Access


Wid NUMA S i V5 1 CPU 1 NUMA N d

Memory

Wide-NUMA Support in V5 – 1 vCPU 1 NUMA Node


Wid NUMA S i V5 8 CPU 2 NUMA N d

Memory

Wide-NUMA Support in V5 – 8 vCPU 2 NUMA Nodes


Power Management


Power Management

Power Management Screen V5


Impact of Power States on CPU

Power Management

Impact of Power States on CPU


Power Management Impact on CPU V5

Power Management

Power Management Impact on CPU V5


Storage


Storage Considerations

Storage – Key Question and Considerations

Is the bandwidth and configuration of the storage subsystem sufficient to meet the desired latency (a k a response time) for the target workloads?meet the desired latency (a.k.a. response time) for the target workloads? If the latency target is not being met then further analysis may be very time consuming.Storage Frames specifications refer to the aggregate bandwidth of theStorage Frames specifications refer to the aggregate bandwidth of the frame or components, not the single path capacity of those components. Queuing - Queuing can happen at any point along the storage path, but is not necessarily a bad thing if the latency meets requirementsnecessarily a bad thing if the latency meets requirements.

Storage Path Configuration and Capacity – It is critical to know the configuration of the storage path and the capacity of each component along that path. The number of active vmkernel commands must be less then or equal topath. The number of active vmkernel commands must be less then or equal to the queue depth max of any of the storage path components while processing the target storage workload.



Storage – Aggregate versus Single Paths

Storage Frames specifications refer to the aggregate bandwidth of the frame or components not the single path capacity of those components*frame or components, not the single path capacity of those components DMX Message Bandwidth: 4-6.4 GB/s DMX Data Bandwidth: 32-128 GB/s

Gl b l M 32 512 GB Global Memory: 32-512 GB Concurrent Memory Transfers: 16-32 (4 per Global Memory Director)Performance Measurement for storage is all about individual paths and the

f f h i d i h hperformance of the components contained in that path


(* Source – EMC Symmetrix DMX-4 Specification Sheet c1166-dmx4-ss.pdf)


Storage – More Questions

Virtual Machines per LUN - The number of outstanding active vmkernel commands per virtual machine times the number of virtual machines oncommands per virtual machine times the number of virtual machines on a specific LUN must be less then the queue depth of that adapter How fast can the individual disk drive process a request?

B d th bl k i d t f I/O ( ti l d ti l Based upon the block-size and type of I/O (sequential read, sequential write, random read, random write) what type of configuration (RAID, number of physical spindles, cache) is required to match the I/O characteristics and workload demands for average and peak throughput?characteristics and workload demands for average and peak throughput? Does the network storage (SAN frame) handle the I/O rate down each path and aggregated across the internal bus, frame adaptors, and front end processors?end processors?

In order to answer these questions we need to better understand the


underlying design, considerations, and basics of the storage subsystem


Back-end Storage Design ConsiderationsCapacity - What is the storage capacity needed for this workload/cluster? Disk drive size (ex., 144GB, 300GB)( ) Number of disk drives needed within a single logical unit (ex., LUN)

IOPS Rate – How many I/Os per second are required with the needed latency? Number of physical spindles per LUN

I t f h i f h i l di k d i b t LUN Impact of sharing of physical disk drives between LUNs Configuration (ex., cache) and speed of the disk drive

Availability – How many disk drives, storage components can fail at one time? Type of RAID chosen, number of parity drives per groupingyp , p y p g p g Amount of redundancy built into the storage solution

Cost – Delivered cost per byte at the required speed and availability Many options are available for each design consideration

Fi l d i i th h i f h t Final decisions on the choice for each component The cumulative amount of capacity, IOPS rate, and availability often dictate

the overall solution


St f th G d U B i D fi iti M h i l D i


Storage from the Ground Up – Basic Definitions: Mechanical Drives

Disk Latency – The average time it takes for the requested sector to rotate under the read/write head after a completed seekp 5400 (5.5ms), 7200 (4.2ms), 10,000 (3ms) , 15,000 (2ms) RPM Ave. disk latency = 1/2 * rotation Throughput (MB/sec) = (Outstanding IOs/ latency (msec)) * Block size (KB)Throughput (MB/sec) (Outstanding IOs/ latency (msec)) Block size (KB)

Seek Time – The time it takes for the read/write head to find the physical location of the requested data on the disk Average Seek time: 8-10 mse age See t e 8 0 s

Access Time – The total time it takes to locate the data on the drive(s). This includes seek time, latency, settle time, and command processing overhead time.

Host Transfer Rate – The speed at which the host can transfer the data across the disk interface.


N t k St C t Th t C Aff t P f /A il bilit


Network Storage Components That Can Affect Performance/Availability

Size and use of cache (i.e., % dedicated to reads versus writes) Number of independent internal data paths and buses Number of independent internal data paths and buses Number of front-end interfaces and processors Types of interfaces supported (ex. Fiber channel and iSCSI) Number and type of physical disk drives available MetaLUN Expansion

MetaLUNs allo for the aggregation of LUNs MetaLUNs allow for the aggregation of LUNs System typically re-stripes data when MetaLUN is changed Some performance degradation during re-striping

Storage Virtualization Aggregation of storage arrays behind a presented mount point/LUN Movements between disk drives and tiers control by storage management


y g g Change of physical drives and configuration may be transient and severe

Test Bed Idle State Device Adapter View V4

Case Studies - Storage

Test Bed Idle State– Device Adapter View V4Average Device Latency,

Per Command

Storage Adapter

MaximumWorld Maximum Q L thMaximum

Queue LengthQueue Length


LUN Maximum Queue Length

Moderate load on two virtual machines V4


Moderate load on two virtual machines V4

Acceptable latency from the disk subsystem

Commands are queued BUT….


H i l d t i t l hi


Heavier load on two virtual machines

Virtual machine latencyVirtual machine latency is consistently above

20ms/second, performance could start

to be an issue

Commands are queued and are exceeding

maximum queue lengths BUT….


to be an issue

H l d f i t l hi


Heavy load on four virtual machines

Virtual machine latencyVirtual machine latency is consistently above 60

ms/second for some VMs, performance will

be an issue

Commands are queued and are exceeding

maximum queue lengths AND….


be an issue

A tifi i l C t i t St


Artificial Constraints on StorageGood

throughput Problem with the disk subsystem

Low device Latency

Bad throughput

Device Latency is high - cache disabled


Understanding the Disk Counters and LatenciesUnderstanding the Disk Counters and Latencies


Understanding Disk I/O QueuingUnderstanding Disk I/O Queuing


SAN St I f t t A t W t h/C id


SAN Storage Infrastructure – Areas to Watch/Consider

HBA

HBA Speed Fiber Bandwidth

FA CPU Speed Disk Response

RAID Configuration

Block SizeNumber of Spindles

in LUN/array

HBA

ISL

FC switch Director

San.jpg Disk Speeds

HBA


Storage Adapter Queue LengthWorld Queue Length LUN Queue Length Cache Size/Type

St Q i Th K Th ttl P i t


ESX Host

Storage Queuing – The Key Throttle Points

HBA

VM 1World Queue

Length (WQLEN)

VM 2 QLE

N)

on T

hrot

tle

Storage Area Network

VM 3World Queue

World Queue Length (WQLEN)

Leng

th (L

Exec

utio

rottl

eLength (WQLEN)

HBA

VM 4World Queue

Length (WQLEN) UN

Que

ue

Exec

utio

n Th

r

Length (WQLEN) L



Storage I/O – The Key Throttle Point Definitions Storage Adapter Queue Length (AQLEN)

The number of outstanding vmkernel active commands that the adapter is configured to support. This is not settable. It is a parameter passed from the adapter to the kernel.

LUN Queue Length (LQLEN) The maximum number of permitted outstanding vmkernel active commands to a LUN The maximum number of permitted outstanding vmkernel active commands to a LUN. (This would be the HBA queue depth setting for an HBA.) This is set in the storage adapter configuration via the command line.

World Queue Length (WQLEN) VMware Recommends Not to Change This!!!! The maximum number of permitted outstanding vmkernel active requests to a LUN from any singular virtual machine (min:1, max:256: default: 32) Configuration->Advanced Settings->Disk-> Disk.SchedNumReqOutstanding

Execution Throttle (this is not a displayed counter) Execution Throttle (this is not a displayed counter) The maximum number of permitted outstanding vmkernel active commands that can be executed on any one HBA port (min:1, max:256: default: ~16, depending on vendor) This is set in the HBA driver configuration.


Q L th R l f Th b


Queue Length Rules of Thumb

For a lightly-loaded system, average queue length should be less than 1 per spindle with occasional spikes up to 10. If the workload is write-heavy, the average queue length above a mirrored controller should be less than 0.6 per spindle and less than 0.3 per spindle above a RAID-5 controllercontroller.

For a heavily-loaded system that isn’t saturated, average queue length should be less than 2.5 per spindle with infrequent spikes up to 20. If th kl d i it h th l th bthe workload is write-heavy, the average queue length above a mirrored controller should be less than 1.5 per spindle and less than 1 above a RAID-5 controller.


Closing ThoughtsClosing Thoughts

Know the key counters to look at for each type of resource Be careful on what type of resource allocation technique you use for CPUBe careful on what type of resource allocation technique you use for CPU and RAM. One size may NOT fit all. Consider the impact of events such as maintenance on the performance of a cluster Set up a simple test bed where you can create simple loads to become familiar with the various performance counters and tools Compare your test bed analysis and performance counters with the development and production clusters Know your storage subsystem components and configuration due to the large impact this can have on overall performance

T k th ti t l h th i t f th i t l Take the time to learn how the various components of the virtual infrastructure work together


John Paul – johnathan paul@siemens comJohn Paul – [email protected]