vsphere performance best practices - vmware › vmworld › 2012 › top10 › vsp1800.pdf · $25...
TRANSCRIPT
vSphere Performance Best Practices
Peter Boone, VMware, Inc.
INF-VSP1800
#vmworldinf
2
Disclaimer
This session may contain product features that are currently under development.
This session/overview of the new technology represents no commitment from VMware to deliver these features in any generally available product.
Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
Technical feasibility and market demand will affect final delivery.
Pricing and packaging for any new technologies or features discussed or presented have not been determined.
3
Global Support Services and Customer Advocacy
Bangalore, India
Tokyo, Japan
Cork, Ireland Burlington, Canada
Palo Alto, CA Broomfield, CO
Support offices Local language support Spanish, Portuguese, French, German, Japanese, Chinese
Global Coverage 24x7, 365 days/year 6 Support Centers
1000+ Support Engineers
Follow-the-sun Support for
Severity 1 Issues
Support Relationships with 100% of the
Fortune 100; 99% of Fortune 500
4
Customer Support Day Events
Coming to a location near you: sharing of VMware best practices! Support Days are a collaboration between VMware Support, Sales
and customers – you learn directly from the experts Topics are driven by
customer input, and typically include: • Best practices
• Tips/tricks
• Top issues
• Product roadmaps/demos
• Certification offerings
http://www.vmware.com/go/supportdays
5
Overview
What a performance problem sounds like: • “My VM is running slow and I don’t know what to do!”
• “I tried adding more memory and CPUs but the problem got worse!”`
• “My VM is slow on one host but fast on another!”
What to look for? Where to start?
We will explore some of the most common performance-related
issues that our support centers receive cases for
6
A word about performance….
Troubleshooting methodology must define: • How to find root cause
• How to fix the problem
Must answer these questions: 1. How do we know when we are done?
2. Where do we start looking for problems?
3. How do we know what to look for to identify a problem?
4. How do we find the root-cause of a problem we have identified?
5. What do we change to fix the root-cause?
6. Where do we look next if no problem is found?
7
Agenda
Benchmarking & Tools Best Practices and Troubleshooting
The 4 “food groups”
• Memory
• CPU
• Storage
• Network
© 2012 VMware Inc. All rights reserved
BENCHMARKING & TOOLS
9
Benchmarking
Consistent and reproducible results Important to have base level of acceptable performance
• Expectation vs. Acceptable
Determine baseline of performance prior to deployment • Benchmark on a physical system if applicable
Avoid subjective metrics, stay quantitative • “The system seems slower”
• “This worked better last year”
10
Benchmarking
Benchmarking should be done at the application layer • Use application-specific benchmarking tools and load generators
• Check with the application vendor
Isolate variables, benchmark optimum situation before introducing load
Understand dependencies • Human interaction
• Other “food groups”
• Compare apples-to-apples
11
Aggregates thousands of metrics into Workload, Capacity, Health scores Self-learns “normal” conditions using patented analytics
Smart alerts of impending performance and capacity degradation
Identifies potential performance problems before they start
Slide 11
Tools – vCenter Operations
12
Tools – vCenter Operations Slide 12
13
Tools – esxtop
Valuable tool built in to vSphere hosts View or capture real-time data
• View or playback data later
• Import data in 3rd party tools
vSphere Client performance graphs get their data from esxtop data • Presentation/unit may be different (e.g. %RDY)
Little overhead impact on the host
© 2012 VMware Inc. All rights reserved
MEMORY
15
Memory – Allocation
A VM’s RAM is not necessarily physical RAM • vRAM + overhead = maximum physical RAM
Whether or not that memory is physical or virtual depends on…
• Host configuration
• Shares
• Limits
• Reservations
• Host load
• Idle/Active VMs
16
Memory – Overhead
Source: vSphere 5.0 Resource Management Guide
17
Memory – Host Memory Management
Occurs when memory is under contention Transparent Page Sharing
Ballooning
Compression
Swapping
18
Memory – Transparent Page Sharing
19
Memory – Ballooning
20
Memory – Compression
21
Memory – Swapping
22
Memory – Swapping
23
Memory – VM Resource Allocation
24
Memory – Resource Pool Allocation
25
Memory – Ballooning vs. Swapping
Ballooning is better than swapping Guest can surrender unused/free pages
Guest chooses what to swap, can avoid swapping “hot” pages
Idle memory tax uses ballooning
26
Memory – Rightsizing
Generally, it is better to OVER-commit than UNDER-commit If the running VMs are consuming too much host/pool memory…
• Some VMs may not get physical memory
• Ballooning or host swapping
• Higher disk IO
• All VMs slow down
27
Memory – Rightsizing
If a VM has too little vRAM… • Applications suffer from lack of RAM
• The guest OS swaps
• Increased disk traffic, thrashing
• SAN slow down as a result of increased disk traffic
If a VM has too much vRAM… • Higher overhead memory
• Possible decreased failover capacity
• Longer vMotion time
• Larger VSWP file
• Wasted resources
28
Memory – Troubleshooting
Wrong resource allocation May not notice a limit, e.g. VM or template with a limit gets cloned
Custom share values
Ballooning or swapping at the host level • Ballooning is a warning sign, not a problem
• Swapping is a performance issue if seen over an extended period
Swapping/paging at the guest level • Under-provisioned guest memory
Missing balloon driver (Tools)
29
Memory – Best Practices
Avoid high active host memory over-commitment • No host swapping occurs when total memory demand is less than the physical
memory (Assuming no limits)
Right-size guest memory • Avoid guest OS swapping
Ensure there is enough vRAM to cover demand peaks
Use a fully automated DRS cluster
• Test that vMotion works
• Use Resource Pools with High/Normal/Low shares
• Avoid using custom shares
© 2012 VMware Inc. All rights reserved
CPU
31
CPU – Overview
Raw processing power of a given host or VM • Hosts provide CPU resources
• VMs and Resource Pools consume CPU resources
CPU cores/threads need to be shared between VMs Fair scheduling vCPU time
• Hardware interrupts for a VM
• Parallel processing for SMP VMs
• I/O
32
CPU – esxtop
33
CPU – esxtop
Interpret the esxtop columns correctly %USED – Physical CPU usage %SYS – Percentage of time in the VMkernel %RUN – Percentage of total scheduled time to run %WAIT – Percentage of time in blocked or busy wait states %IDLE – %WAIT- %IDLE can be used to estimate I/O wait time
34
CPU – Performance Overhead & Utilization
Different workloads have different overhead costs (%SYS) even for the same utilization (%USED) CPU virtualization adds varying amounts of system overhead
• Direct execution vs. privileged execution
• Non-paravirtual adapters vs. emulated adaptors
• Virtual hardware (Interrupts!)
• Network and storage I/O
35
CPU – vSMP
Relaxed Co-Scheduling: vCPUs can run out-of-sync Idle vCPUs incur a scheduling penalty
• configure only as many vCPUs as needed
• Impose unnecessary scheduling constraints
Use Uniprocessor VMs for single-threaded applications
36
CPU– Scheduling
Over committing physical CPUs
VMkernel CPU Scheduler
37
CPU– Scheduling
Over committing physical CPUs
VMkernel CPU Scheduler
X X
38
CPU– Scheduling
Over committing physical CPUs
VMkernel CPU Scheduler
X X X X
39
CPU – Ready Time
The percentage of time that a vCPU is ready to execute, but waiting for physical CPU time
Does not necessarily indicate a problem • Indicates possible CPU contention or limits
40
CPU – NUMA nodes
Non-Uniform Memory Access system architecture
Each node consists of CPU cores and memory A CPU core in one NUMA node can access memory in another
node, but at a small performance cost
NUMA node 1 NUMA node 2
41
CPU – NUMA nodes
The VMkernel will try to keep a VM’s vCPUs local to its memory • Internal NUMA migrations can occur to balance load
Manual CPU affinity can affect performance
• vCPUs inadvertently spread across NUMA nodes
• Not possible with fully automated DRS
VMs with more vCPUs than cores available in a single NUMA node
may see decreased performance
42
CPU – Troubleshooting
vCPU to pCPU over allocation • HyperThreading does not double CPU capacity!
Limits or too many reservations • can create artificial limits.
Expecting the same consolidation ratios with different workloads
• Virtualizing “easy” systems first, then expanding to heavier systems
• Compare Apples to Apples • Frequency, turbo, cache sizes, cache sharing, core count, instruction set…
43
CPU – Best Practices
Right-size vSMP VMs Keep heavy-hitters separated
• Fully automated DRS should do this for you
• Use anti-affinity rules if necessary
Use a fully automated DRS cluster
• Test that vMotion works
• Use Resource Pools with High/Normal/Low shares
• Avoid using custom shares
© 2012 VMware Inc. All rights reserved
STORAGE
45
Storage – esxtop Counters
Different esxtop storage views • Adapter (d)
• VM (v)
• Disk Device (u)
Key Fields: • DAVG + KAVG = GAVG
• QUED/USD – Command Queue Depth
• CMDS/s – Commands Per Second
• MBREADS/s
• MBWRTN/s
46
Storage – Troubleshooting with esxtop
High DAVG: issue beyond the adapter • bad/overloaded zoning, over utilized storage processors, too few platters in the
RAID set, etc.
High KAVG: issue in the kernel storage stack
• Driver issue
• Full queue
Aborts: GAVG exceeding 5000 ms
• Command will be repeated, storage delay for the VM
47
Storage – Benchmarking with iometer
48
Storage – Storage I/O Control
Allows the use of Shares per VMDK Throttling occurs when datastore reaches latency threshold
• Higher share VMDKs perform IO first
vCenter monitors latency across all hosts • Not effective if datastore shared with other vCenters
49
Storage – Storage DRS
Datastore clusters • Maintenance mode
• Anti-affinity rules
vCenter monitors for latency and disk space
• Migrate VMDKs for better performance or utilization
Not effective with automated tiering SANs • Check HCL to confirm these features are compatible
50
Storage – Troubleshooting
Snapshots Excessive traffic down one HBA / Switch / SP can cause latency
• Consider using Round Robin in conjunction with ALUA
• Always be paranoid when it comes to monitoring storage I/O
Consider your I/O patterns • Peak time for storage IO?
• Virus scans, database maintenance, user logins
Always consult with array vendor • They know the best practices for their array!
51
Storage – Best Practices
Use different tiers of storage for different VM workloads • Slower storage for OS VMDKs
• Faster storage for databases or other high-IO applications
Use the Paravirtual SCSI adapter
• Reduced overhead, higher throughput
Use path balancing where possible, either through plugins
(Powerpath) / Round Robin and ALUA, if supported. Use Storage DRS with SIOC
• Balance for both free space and latency
• Simplified datastore management
© 2012 VMware Inc. All rights reserved
NETWORK
53
Network – Load Balancing
Load balancing defines which uplink is used • Route based on Port ID
• Route based on IP hash
• Route based on MAC hash
• Route based on NIC load
Probability of high-bandwidth VMs being on the same physical NIC
Traffic will stay on elected uplink until an event occurs
• NIC link state change, adding/removing NIC from a team, beacon probe timeout…
54
Network – Troubleshooting
Check counters for NICs and VMs • Network load imbalance
• 10 Gbps NICs can incur a significant CPU load when running at 100%
Ensure hardware supports TSO • Use latest drivers and firmware for your NIC on the host
For multi-tier VM applications, use DRS affinity rules to keep VMs on same host • Same vSwitch / VLAN, rules out physical network
If using Jumbo Frames, ensure it is enabled end-to-end
55
Network – Best Practices
Use the vmxnet3 virtual adapter • Less CPU overhead
• 10 Gbps connection to vSwitch
Use the latest driver/firmware for the NICs on the host Use network shares
• Requires Virtual Distributed Switch 4.1
Isolate vMotion and iSCSI traffic from regular VM traffic • Separate vSwitches with dedicated NIC(s)
• Most applicable with Gigabit NICs
56
In conclusion…
57
Key Takeaways – Performance Best Practices
Understand your environment • Hardware, storage, networking
• VMs & applications
Advanced configuration values do not need to be tweaked or
modified • In almost all situations
Use fully automated DRS
Use Paravirtual virtual hardware
58
Important Links
59
Important Links
FILL OUT A SURVEY
EVERY COMPLETE SURVEY IS ENTERED INTO DRAWING FOR A
$25 VMWARE COMPANY STORE GIFT CERTIFICATE
vSphere Performance Best Practices
Peter Boone, VMware, Inc.
INF-VSP1800
#vmworldinf