esx performance problems 10 steps
Post on 05-Dec-2014
2.252 Views
Preview:
DESCRIPTION
TRANSCRIPT
Monitoring and Intelligently Reacting to Monitoring and Intelligently Reacting to ESX PerformanceESX Performance
Greg ShieldsGreg ShieldsPartner and Principal TechnologistConcentrated Technologywww.ConcentratedTech.com
This slide deck was used in one of our many conference presentations. We hope you enjoy it, and invite you to use it
within your own organization however you like.
For more information on our company, including information on private classes and upcoming conference appearances, please
visit our Web site, www.ConcentratedTech.com.
For links to newly-posted decks, follow us on Twitter:@concentrateddon or @concentratdgreg
This work is copyright ©Concentrated Technology, LLC
Class DiscussionClass Discussion
What kinds of performance things should one monitor on an ESX server?– Why?
ESX Performance 101ESX Performance 101
Processor Use– Processor use on any server > 80%
Consider this “overuse”.– Reduce processing requirements on VMs.– Migrate VMs elsewhere, rebalance.
ESX Performance 101ESX Performance 101
Processor Use– Processor use on any server > 80%
Consider this “overuse”.– Reduce processing requirements on VMs.– Migrate VMs elsewhere, rebalance.
Memory Use– Memory use on any server > 80%
Consider this “overuse”– Reduce assigned vRAM to VMs, if possible.– Migrate VMs elsewhere, rebalance.
ESX Performance 201ESX Performance 201
Network throughput– Network throughput > 80% and steady
Begin analyzing throughput consumption– Consider re-routing heavy consumption to
independent pNICs & independent vSwitches.– Rebalance load, although this tends to just shift
problems.
ESX Performance 201ESX Performance 201
Network throughput– Network throughput > 80% and steady
Begin analyzing throughput consumption– Consider re-routing heavy consumption to
independent pNICs & independent vSwitches.– Rebalance load, although this tends to just shift
problems.
Context Switches– Context switches significantly higher than baseline– Analyze workload. Consider V2P.– Rebalance.– Upgrade hardware to Nehalem / Opteron
ESX Performance 201ESX Performance 201
IOPS– IOPS demand > IOPS supply
Consider this “overuse”– Analyze with esxtop or Disk | Usage in Performance
tab– Adding disks spreads spindle demand, reduces
contention– Consider more/smaller datastores– Consider new storage hardware that can rebalance
internally based on observed contention. $$$
DEMO: ESX performance tab. DEMO: Customizing perf stats intervals
Thank you!Thank you!Class Dismissed!Class Dismissed!
Thank you!Thank you!Class Dismissed!Class Dismissed!
““Uh, GimmeUh, Gimme’’ a Break, Greg. a Break, Greg.Is that All YouIs that All You’’ve Got?ve Got?””
ESX Performance 301ESX Performance 301
The Structured Approach!– Greg’s TEN STEP Plan to VM Happiness– Computers are deterministic.– Virtual computers are as well, however they are
much more complicated.– Virtual computers have so many more
dependencies than traditional computers. Makes the ad hoc process less intuitive.
– Your “gut feeling” with virtual environments is less effective.
Homework Reading: Performance Troubleshooting for VMware vSphere 4Get it at VMware.com
Step 1: VMware ToolsStep 1: VMware Tools
If the VMware Tools aren’t working, this will cause numerous low-level issues.– Always start by verifying their functionality
DEMO: Verifying VMware Tools status
Step 2: Verify Host CPUStep 2: Verify Host CPUSaturationSaturation
CPU saturation on an ESX host creates contention, which slows down all VMs.– Performance | Advanced– CPU | Usage– Is this number consistently above 75%?– If yes, go to Step 3.
Step 3: Verify VM Ready TimeStep 3: Verify VM Ready Time
If high host CPU usage, then the next step is to see which VM is causing the problem.– Select Host | Virtual Machines tab | Host CPU – Mhz
column.– Locate high-use VM.– Select VM | Performance tab | CPU | Ready (all
vCPUs)
If Ready > 2000ms for any vCPU, then host CPU saturation exists.
Step 3: SolutionsStep 3: Solutions
Rebalance VMs. Move VMs off this host. Increase CPU shares available to host, if
resource constrained.– Resource Pools can do this.
Reduce the number of vCPUs assigned to VMs.
Add hosts.
Step 4: Verify Guest CPUStep 4: Verify Guest CPUSaturationSaturation
Remember that CPU saturation can happen on the host, but it can also happen in the VM.– Shares/Limits/Other can restrict guest processing.– “Everything looks good on the host, but the guest is
running at 100%”
Check VM CPU for saturation– Select VM | Performance tab | CPU | Usage– Is this number consistently above 75%?
Step 4: SolutionsStep 4: Solutions
The VM is working too hard– (Aren’t we all?)– Not getting enough resources to accomplish its
task. Assign more CPU shares.– Installed workload not well-throttled. Throttle or
reconfigure applications. Balance processing across time of day.
– Add vCPUs. Only do this if the application is multi-threaded.
– Remove pinning of processes to processors.
Step 4½: Verify VMs areStep 4½: Verify VMs areActually Using their vCPUsActually Using their vCPUs
An interesting reverse! Assigning multiple vCPUs to a VM that isn’t
using them wastes resources.– If that VM isn’t using the vCPU, remove it so
another VM can use it instead.– Select VM | Performance tab | CPU | Usage– Look at all vCPU objects.– Is usage for all vCPUs but one close to 0?
Step 4½: SolutionsStep 4½: Solutions
Reduce assigned vCPUs to one.– …and don’t do that again!
Step 5: Check for HostStep 5: Check for HostMemory SwappingMemory Swapping
Memory swapping is generally always a condition you want to avoid.– Swapping exerts an incredible tax on performance.– A solution of last resort.– Select Host | Performance tab | Memory | Swap
In/Out Rate– Are either of these above 0?
Step 5: SolutionsStep 5: Solutions
Limited solutions for memory swapping.– Reduce memory overcommit. Drop the level of
assigned memory in each VM as appropriate.
– Most of us over-assign memory to VMs anyway. So, at least at first, this can sometimes be effective.
– Reduce reservations. Too many reservations can impact optimization of memory sharing.
– Add RAM.
– Enable resource controls. Note that this might cause VM memory swapping.
DEMO: Verifying a VM’s balloon driver is functioning.
Step 5½: Check for VMStep 5½: Check for VMMemory SwappingMemory Swapping
The solutions for Step 5 can cause downstream effects in each VM.– You decrease available RAM– VM doesn’t have enough– VM itself has to swap
This is a situation just as bad a host swapping.– Select Host | Performance tab | Memory | Real-
Time | Stacked Graph (per VM)– Are any VMs reporting memory swapping > 0?– If so, then that VM needs more RAM.
Step 5½: SolutionsStep 5½: Solutions
That VM needs more RAM.– You’ve gone too far with restricting its resources.
Step 6: Check forStep 6: Check forOverloaded StorageOverloaded Storage
Many paths for verifying storage utilization.– IOPS is an emerging metric.– Can also verify Command Aborts. Identifies the
number of SCSI commands that were aborted.– Select host | Performance tab | Disk | Command
Aborts | Attached LUNs.
– Are any LUNs showing Command Aborts > 0?
Step 6: SolutionsStep 6: Solutions
This indicates that the storage layer cannot keep up with the demands of VMs.– Increase storage performance. $$$– Segregate storage. Modularity assists here.– Spread VMFS LUNs across more spindles. Add
disks. Reduces storage contention.– Use tools like vscsiStats to quantify storage
behaviors.– Balance memory with storage. Sometimes
throwing more RAM at a VM lessens its storage demand.
– Buy new storage. Buy more storage. $$$
Step 6: vscsiStatsStep 6: vscsiStats
http://communities.vmware.com/docs/DOC-10095– IO size– Seek distance– Outstanding IOs– Latency in ms
Step 7-1: Check for Inbound Step 7-1: Check for Inbound Networking ProblemsNetworking Problems
An inbound network problem is a VM that cannot process receive packets.– Packets are coming in over the wire, but the VM
lacks the resources to process them.– Thus, those packets must be dropped and
retransmitted, reducing effective performance.– This creates a cascading problem. More dropped
packets == more retransmitted ones == more to do == more oversubscription. Yikes!
– Select host | Performance tab | Network | Receive Packets Dropped
– Is this value greater than 0?
Step 7-1: SolutionsStep 7-1: Solutions
An inability to process inbound packets usually relates to vProc overutilization.– With vNICs, your processor is needed to process
their workloads.– Not enough processor == a less-capable vNIC– Reduce VM CPU utilization– Increase VM CPU reservation– Add pCPUs. Add servers.– Verify VMs are using the most-effective driver
(VMXNET3 for most workloads).
Step 7-2: Check for Outbound Step 7-2: Check for Outbound Networking ProblemsNetworking Problems
An outbound network problem is a VM that cannot effectively send packets.– Outbound VM packets are buffered at the vSwitch.– Heavy traffic at the vSwitch can overload its
attached pNIC.– When this happens, packets get dropped and must
be retransmitted.– Select host | Performance tab | Network | Transmit
Packets Dropped– Is this value greater than 0?
Step 7-2: SolutionsStep 7-2: Solutions
An inability to process outbound packets often requires additional pNICs.– Aggregate more pNICs to handle outbound load.– Ensure you’re not using failover mode, but load
balancing.– Rebalance high network use VMs to other hosts.– Rebalance high network use VMs to other vSwitches
(which should be attached to different pNICs).– Add networking.– Reduce ambient network traffic. Isolate subnets.– Ahhh, the old backups network problem. Or, the
n00b who multicasts on the server net! We’ve all been that n00b at some point…
Step 8: Check forStep 8: Check forSlow StorageSlow Storage
“Slow” storage is represented by high storage latency.– Essentially, the storage isn’t responding fast enough.– Storage layer itself could be insufficient, or
overloaded.– Select host | Performance tab | Disk | Physical Device
Read/Write Latency (all LUNs)– Are any average latencies greater than 10ms, or any
peaks above 20ms.*
– * These are VMware’s suggested starting values. Yours may be different based on storage architecture.
Step 8: SolutionsStep 8: Solutions
This indicates that the storage layer cannot keep up with the demands of VMs.– Increase storage performance. $$$– Segregate storage. Modularity assists here.– Spread VMFS LUNs across more spindles. Add disks.
Reduces storage contention.– Use tools like vscsiStats to quantify storage behaviors.– Balance memory with storage. Sometimes throwing
more RAM at a VM lessens its storage demand.
– Buy new storage. Buy more storage. $$$
– Notice that these are the same as for Step 6!
Step 8: SolutionsStep 8: Solutions
ESX Server
ESX Server
SAN Storage Device
Step 9: Check for Low VMStep 9: Check for Low VMCPU UtilizationCPU Utilization
Wait a minute! Isn’t low VM CPU utilization a good thing? Isn’t this why virtualization works?– Yes, and no.– Low VM CPU utilization can mean a low-needs
workload.– It can also mean a workload in a wait state.– Only check here if end user experience is suffering.– Select VM | Performance tab | CPU | Usage (VM)– Is this a lower than expected value?
Step 9: SolutionsStep 9: Solutions
Suffering end user experience but low CPU utilization usually indicates a wait state.– Verify other counters: Network, storage.– Storage response time?– Network response time?– Other servers or virtual servers that this workload
relies upon to do its job?
– Another common source: Overly restrictive resource allocations.
Step 10: Check for MemoryStep 10: Check for MemoryReclamationReclamation
Remember that ESX’s balloon driver will reclaim memory that it doesn’t believe a VM needs.– However, that driver has very limited visibility into
what each VM is actually doing with its memory.– It becomes a problem when memory that the VM
needs is reclaimed. Kind of like a double page fault.
– Select host | Performance tab | Memory | Balloon– If this value is greater than 0, then…
Step 10: Check for MemoryStep 10: Check for MemoryReclamationReclamation
Remember that ESX’s balloon driver will reclaim memory that it doesn’t believe a VM needs.– However, that driver has very limited visibility into
what each VM is actually doing with its memory.– It becomes a problem when memory that the VM
needs is reclaimed. Kind of like a double page fault.– Select host | Performance tab | Memory | Balloon– If this value is greater than 0, then…– Select VM | Performance tab | Memory | Stacked
Graph (per VM) | Balloon.– Is this value greater than 0 for the specific VMs
which are experiencing problems?
Step 10: SolutionsStep 10: Solutions
Ballooning occurs when there’s not enough memory to go around.– You’re oversubscribing your RAM.– This can be a good thing, unless it takes memory
from where its actually needed.– Eliminate memory overcommittment on the host.
Essentially, stop assigning more RAM to VMs than you have.
– Use reservations to ensure adequate memory for VMs.
– Be aware that this may just shift the problem elsewhere.
– Buy RAM. Buy servers. $$$
ESX Performance 401ESX Performance 401
ESX Performance 401ESX Performance 401
Honestly…– …go buy a product. Let someone else do the work!
ESX Performance 401ESX Performance 401
Honestly…– …go buy a product. Let someone else do the work!
This analysis takes time.– Time that you probably don’t have.– What you want is actionable information– “Convert all this math into a ‘click here’ response.”
ESX Performance 401ESX Performance 401
Another problem throughout these approaches relates to their “perspective”.– Virtualization touches everything in the datacenter
and introduces dependencies everywhere.– vSphere’s perspective means that it can only see
behaviors as it observes them.– Metaphor: Einstein’s Theory of Relativity.
Third-party products tie into networking, storage, applications, user experience, etc.– They can interrelate performance from multiple
perspectives.
ESX Performance 401ESX Performance 401
Who’s Who inVirtualizationPerformanceand CapacityManagement
Source: http://www.virtualizationpractice.com/blog/?p=6749
Final ThoughtsFinal Thoughts
Virtualization adds ridiculous interdependencies to the IT datacenter that weren’t there before.– No human alive can monitor all those metrics
effectively and at all times.– You need actionable information.– Use these tips to get you started, solve the
immediate problems.– Consider investing in a set-it-and-forget-it solution.
Monitoring and Intelligently Reacting Monitoring and Intelligently Reacting to ESX Performanceto ESX Performance
Greg ShieldsGreg ShieldsPartner and Principal TechnologistConcentrated Technologywww.ConcentratedTech.com
Please fill out evaluations,or more servers will crash!
!!!
This slide deck was used in one of our many conference presentations. We hope you enjoy it, and invite you to use it
within your own organization however you like.
For more information on our company, including information on private classes and upcoming conference appearances, please
visit our Web site, www.ConcentratedTech.com.
For links to newly-posted decks, follow us on Twitter:@concentrateddon or @concentratdgreg
This work is copyright ©Concentrated Technology, LLC
top related