vmware vsphere performance troubleshooting

69
vSphere Performance Monitoring and Troubleshooting Overview What? CPU, Memory, Disk, Network How? Use available tools and a systematic methodology Why? Need to build confidence in virtualizing critical and high demand applications

Upload: dan-brinkmann

Post on 15-May-2015

10.659 views

Category:

Technology


6 download

DESCRIPTION

From the Lewan

TRANSCRIPT

Page 1: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Overview

What?

CPU, Memory, Disk, Network

How?

Use available tools and a systematic methodology

Why?

Need to build confidence in virtualizing critical and high demand applications

Page 2: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Top Issues

Top Issues:

Storage "performance capacity" oversubscription

Memory oversubscription

SMP overuse

Firmware & driver issues

Page 3: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

What tools do we have at our disposal?

Top tools for information collection:

vCenter - Performance charts and alarms

Guest OS* - Task Manager/Resource Monitor and PerfMon

ESX Host - esxtop and vscsiStats

vSphere PowerCLI

*Guest based monitoring is subject to inaccuracy

Page 4: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Prepare vCenter Settings

Page 5: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Prepare vCenter Settings

Page 6: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Prepare vCenter Settings

Prepare custom vCenter alerts:

Host Console Swap In Rate 512KBps Warning, 1024 KBps AlertHost Console Swap Out Rate 512KBps Warning, 1024 KBps Alert

VM CPU Ready 1000ms Warning, 2000ms Alert

VM Disk Latency 20ms Warning, 50ms Alert

Page 7: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Prepare vCenter Settings

Page 8: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Prepare vCenter Settings

Page 9: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Prepare esxtop

ESXTOP realtime monitoring:

esxtop (run command from SSH or tech-support mode)

s 2 (refresh view every 2 seconds)

V (View VMs only)

h (for quick in-tool command reference)

Batch Mode for a 5 minute capture of all stats:

esxtop -b -a -d 2 -n 150 > esxtop_capture.csv

Page 10: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Prepare PowerCLI

Run PowerCLI:

Tip: Run as Administrator

Set-ExecutionPolicy remotesigned

Connect-VIServer -Server <host> -Protocol https -User <user> -Password <pass>

<host> can be IP address or name of ESX server or vCenter

Get-VMGet-Stat -common -realtime

Page 11: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Where do we get started?

Page 12: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Network Overview

Core Switch pNIC

Virtual Machine

vNIC

vSwitchor

dvSwitch

Network

Page 13: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Network

Troubleshooting Guidance:

1. Physical Issues - A bad cable, a failing switch port or NIC, or an incompatible/flawed firmware or device driver (use VMXNET3 whenever possible)

2. Configuration Issues - Inconsistent configuration of vSwitches, Port Groups, or upstream VLAN trunks

3. Capacity Issues - Too many VMs on a single NIC; inadequate switch backplane or uplink capacity; sharing “unmanaged” network infrastructure for storage and data

4. Thresholds – Bandwidth saturation, dropped packets

Page 14: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Network – What can we see?

Systems Management Tools

VMware Monitoring Tools

Core Switch pNIC

Virtual Machine

vNIC

dvSwitch (only)

Page 15: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Network

vCenter Metrics:

Receive packets dropped

Transmit packets dropped

Page 16: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Network

ESXTOP Metrics:

Display Metric Threshold Explanation

NETWORK %DRPTX 1Dropped packets transmitted, hardware overworked.

Possible cause: very high network utilization

NETWORK %DRPRX 1Dropped packets received, hardware overworked.

Possible cause: very high network utilization

Page 17: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Network

ESXTOP Commands:

esxtop

s 2

n

f

Page 18: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Network

ESXTOP Example:

Page 19: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Network

PowerCLI Commands:

Get-Stat -net -realtime

Get-Stat -Entity <Host> -stat net.droppedRx.summation

Get-Stat -Entity <Host> -stat net.droppedTx.summation

Page 20: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Network – What can’t we see?

Net Mon ToolsNetwork Monitoring Tools

Core Switch pNIC

Virtual Machine

vNIC

Cisco 1000v only

Page 21: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Network

Possible resources for external monitoring:

Native Telnet/SSH/HTTP-based interface counters and stats

Third-party SNMP, NetFlow and ICMP tools

Page 22: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

CPU Overview

Physical CPU

Virtual Machine

CPU

vCPU

Page 23: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

CPU

Troubleshooting Guidance:

1. Physical Issues - Rare and always catastrophic (e.g. obvious)

2. Configuration Issues - Too many / too few vCPUs per VM; SMP/HAL mismatch; incorrect CPU affinity settings

3. Capacity Issues - CPU saturation at the guest or host level; CPU starvation due to high IO or other system level ops

4. Thresholds – Waiting for CPU cycles (due to co-scheduling, swapping, high IO)

Page 24: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

CPU – What can we see?

Systems Management Tools

VMware Monitoring Tools

Physical CPU

Virtual Machine

vCPU

Page 25: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

CPU

vCenter Metrics:

Host/Guest Saturation

Stacked Graph (per VM)

Usage

Page 26: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

CPU

vCenter Metrics:

Guest

Ready (value/20=n%)

Swap Wait

Page 27: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

CPU

ESXTOP Metrics:

Display Metric Threshold Explanation

CPU %RDY 10Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check %MLMTD) has been set.

CPU %CSTP 3

Excessive usage of vSMP. Decrease amount of vCPUs for this particular VM. This should lead to increased scheduling opportunities.

CPU %SYS 20

The percentage of time spent by system services on behalf of the world. Most likely caused by high IO VM. Check other metrics and VM for possible root cause

CPU %MLMTD 0

The percentage of time the vCPU was ready to run but deliberately wasn’t scheduled because that would violate the “CPU limit” settings. If larger than 0 the world is being throttled due to the limit on CPU.

CPU %SWPWT 5VM waiting on swapped pages to be read from disk. Possible cause: Memory overcommitment.

Page 28: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

CPU

ESXTOP Commands:

esxtop

s 2

V

c

e GID (expand/contract a VM world)

Page 29: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

CPU

ESXTOP Example:

Excessive vCPUs

Page 30: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

CPU

ESXTOP Example: Now with fewer vCPUs

Page 31: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

CPU

ESXTOP Example:

SMP impacting multiple VMs

Page 32: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

CPU

PowerCLI Example

Get-Stat -cpu

Get-Stat -Entity <VM> -stat cpu.ready.summation -realtime

Very cool script code at:

http://www.peetersonline.nl/index.php/vmware/examine-vmware-cpu-ready-times-with-powershell/

Page 33: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

CPU – Not much else to see…

Offline Diagnostics & Systems Management Tools

Physical CPU

Virtual Machine

vCPU

Page 34: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

CPU

Possible resources for external monitoring:

Vendor specific systems management tools,

MS System Center, etc.

http://www.peetersonline.nl/index.php/vmware/examine-vmware-cpu-ready-times-with-powershell

/

Page 35: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Memory Overview

Physical RAM

Virtual Machine

Memory

vRAM

Page 36: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Memory

Troubleshooting Guidance:

1. Physical Issues - Rare and usually catastrophic

2. Configuration Issues - Memory overcommit; incorrect configuration of shares, reservations or limits

3. Capacity Issues - Physical memory exhaustion

4. Thresholds – Active memory swapping

Page 37: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Memory – What can we see?

Systems Management Tools

VMware Monitoring Tools

Physical RAM

Virtual Machine

vRAM

Page 38: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Memory

vCenter Metrics

Swap in rate

Swap out rate

Swap used

Page 39: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Memory

ESXTOP Metrics:

Display Metric Threshold Explanation

MEM MCTLSZ 1If larger than 0 host is forcing VMs to inflate balloon driver

to reclaim memory as host is overcommited.

MEM SWCUR 1If larger than 0 host has swapped memory pages in the

past. Possible cause: Overcommitment.

MEM SWR/s 1If larger than 0 host is actively reading from swap(vswp).

Possible cause: Excessive memory overcommitment.

MEM SWW/s 1If larger than 0 host is actively writing to swap(vswp). Possible cause: Excessive memory overcommitment.

MEM CACHEUSD 0If larger than 0 host has compressed memory. Possible

cause: Memory overcommitment.

MEM ZIP/s 0If larger than 0 host is actively compressing memory.

Possible cause: Memory overcommitment.

MEM UNZIP/s 0

If larger than 0 host has accessing compressed memory. Possible cause: Previously host was overcommited on

memory.

Page 40: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Memory

ESXTOP Commands:

esxtop

s 2

V

m

f

Page 41: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Memory

ESXTOP Example:m – Heavy swapping and ballooning

Page 42: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Memory

PowerCLI Commands:

Get-Stat -mem

Get-Stat -Entity <VM> -stat mem.swapoutRate.average -realtime

Get-Stat -Entity <VM> -stat mem.swapinRate.average -realtime

Get-Stat -Entity <VM> -stat mem.vmmemctl.average -realtime

Get-Stat -Entity <Host> -stat mem.swapused.average -realtime

Page 43: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Memory – The occasional DIMM failure…

Offline Diagnostics & Systems Management Tools

Physical RAM

Virtual Machine

vRAM

Page 44: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Memory

Possible external monitoring options:

Vendor specific systems management tools, MS System Center, etc.

Don’t forget vCenter ‘Hardware Status’ reporting

Page 45: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage Overview

vmdk

Disk LUN Controller Switch HBA

vmdk vmdk

Datastore

Virtual Machine

SCSI Controller

Storage

RDM

Page 46: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

Troubleshooting Guidance:

1. Physical Issues - A bad cable, a failing switch port or HBA/NIC, or an incompatible/flawed firmware or device driver (use LSI Logic Parallel/SAS as appropriate)

2. Configuration Issues - Inconsistent or incorrect configuration of LUN masking, zoning, or multi-pathing; inappropriate resource provisioning; aligning queue depth with storage type

3. Capacity Issues - Too many VMs or VMDKs on a LUN; too much IO load for an array or RAID group

4. Thresholds – Latency and queuing

Page 47: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage – What can we see?

Systems Management Tools

VMware Monitoring Tools

vmdk

Disk LUN Controller Switch HBA

vmdk vmdk

Datastore

Virtual Machine

SCSI Controller

RDM

Page 48: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

vCenter Metrics:

Datastore

Read latency

Write latency

Page 49: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

ESXTOP Metrics:

Display Metric Threshold Explanation

DISK GAVG 20 Look at “DAVG” and “KAVG” as the sum of both is GAVG.

DISK DAVG 20 Disk latency most likely to be caused by array.

DISK KAVG 2Disk latency caused by the VMkernel, high KAVG usually

means queuing. Check “QUED”.

DISK QUED 1Queue maxed out. Possibly queue depth set to low. Check

with array vendor for optimal queue depth value.

DISK ABRTS/s 1Aborts issued by guest(VM) because storage is not

responding. Can be caused when paths failed.

DISK RESETS/s 1 The number of commands reset per second.

DISK CONS/s 20SCSI Reservation Conflicts per second. Can be caused by

too many VMDKs on a datastore.

Page 50: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

Guest

VMkernel

Application

Filesystem

I/O Drivers

Virtual SCSI

Filesystem

A

G

D

K

S

R

Device Queue

Application Latency

R = Physical Disk “Disk Secs/Transfer”

G = Guest Latency

K = ESX Kernel

D = Device Latency

Page 51: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

ESXTOP Commands (HBA/LUN):

esxtop

s 2

V

d

f

e vmhba#

Page 52: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

ESXTOP Commands(LUN/Datastore):

esxtop

s 2

V

u

L 38

f

e <devname>

Page 53: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

ESXTOP Commands (VM/VMDK):

esxtop

s 2

V

v

f

e GID

Page 54: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

ESXTOP Examples: d - Multipathing / Expand adapter to view targets

Page 55: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

ESXTOP Examples: u - Queuing, Disk or Kernel?

Page 56: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

ESXTOP Examples:v - Identify the IO consumer

Page 57: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

vscsiStats Command:

[root@host ~]# cd /usr/lib/vmware/bin

./vscsiStats -l

./vscsiStats -s -w <worldid>

./vscsiStats -w <worldid> -p all -c > /path/vscsistats.csv

./vscsiStats -x

Page 58: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

vscsiStats Example:

Page 59: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

vscsiStats Example:

Page 60: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

vscsiStats Example:http://dunnsept.wordpress.com/2010/03/11/new-vscsistats-excel-macro/

Page 61: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

vscsiStats histograms: IO lengths of commands IO lengths of Read commands IO lengths of Write commands distance (in LBNs) between successive commands distance (in LBNs) between successive Read commands distance (in LBNs) between successive Write commands distance (in LBNs) between each command from the closest of previous 16 latency of IOs in Microseconds (us) latency of Read IOs in Microseconds (us) latency of Write IOs in Microseconds (us) number of outstanding IOs when a new IO is issued number of outstanding Read IOs when a new Read IO is issued number of outstanding Write IOs when a new Write IO is issued latency of IO interarrival time in Microseconds (us) latency of IO interarrival time for Reads in Microseconds (us) latency of IO interarrival time for Writes in Microseconds (us)

Page 62: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

PowerCLI Commands:

Get-Stat -disk

Get-Stat -stat disk.totalLatency.average -realtime

Get-Stat -stat disk.deviceLatency.average -realtime

Get-Stat -stat disk.kernelLatency.average -realtime

Page 63: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage – What can’t we see?

Storage Monitoring Tools

vmdk

Disk LUN Controller Switch HBA

vmdk vmdk

Datastore

Virtual Machine

SCSI Controller

RDM

Page 64: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage – More of what we can’t see

Network/Fabric Monitoring Tools

vmdk

Disk LUN Controller Switch HBA

vmdk vmdk

Datastore

Virtual Machine

SCSI Controller

RDM

Page 65: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Storage

Possible external monitoring solutions:

Vendor specific SAN and fabric/network tools, native Telnet/SSH/HTTP-based tools for most networks, third-party SNMP-based tools

Page 66: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Working with PowerCLI

PowerCLI Tips:

For a complete list of stat objects:

Get-StatType -Entity <Host/VM>

Pipe the outputs to a file:

Get-Stat -stat <stat> -realtime | ft -autosize > c:\temp\<filename>.csv

Import the CSV file data to a spreadsheet with fixed width parameters

Build pretty graphs

Page 67: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Working with PowerCLI

Page 68: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Way More Information

ESXTOP / vscsiStats / PowerCLI:

http://www.yellow-bricks.com/esxtop/ Special thanks to Duncan Epping!

http://communities.vmware.com/docs/DOC-3930

http://communities.vmware.com/docs/DOC-9279

http://communities.vmware.com/docs/DOC-10095

http://www.vmware.com/support/developer/PowerCLI/PowerCLI41/html/Get-Stat.html

http://www.lucd.info/2009/12/30/powercli-vsphere-statistics-part-1-the-basics/

http://simongreaves.co.uk/blog/esxtop-guide

http://dunnsept.wordpress.com/2010/03/11/new-vscsistats-excel-macro/

Page 69: VMware vSphere Performance Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Easy button?

What is the problem with these tools?

Limited alerting mechanisms, no collection automation or historical data for comparison, and no correlation of events!

vCenter Operations Standard / Enterprise