esx performance troubleshooting

73
© 2009 VMware Inc. All rights reserved ESX Performance Troubleshooting VMware Technical Support Broomfield, Colorado Confidential

Upload: kimberly-gardner

Post on 24-Nov-2015

46 views

Category:

Documents


1 download

DESCRIPTION

ESX Performance Troubleshooting

TRANSCRIPT

ESX Performance Troubleshooting

ESX Performance TroubleshootingVMware Technical SupportBroomfield, ColoradoConfidential 2009 VMware Inc. All rights reserved1What is slow performance?What does slow performance mean?Application responds slowly - latencyApplication takes longer time to do a job throughputInterpretation varies wildlySlower than expectationThroughput is lowLatency is highThroughput, latency fine but uses excessive resources (efficiency) What are high latency, low throughput, and excessive resource usage?These are subjective and relativeBoth related to time2Bandwidth, Throughput, Goodput, LatencyBandwidth vs. ThroughputHigher Bandwidth does not guarantee Throughput. Low Bandwidth is a bottleneck for higher ThroughputThroughput vs. GoodputHigher Throughput does not mean higher GoodputLow Throughput is indicative of lower GoodputEfficiency = Goodput/BandwidthThroughput vs. LatencyLow Latency does not guarantee higher Throughput and vice versaThroughput or Latency alone can dominate performance3Bandwidth, Throughput, Goodput, LatencyLatencyBandwidthGoodputThroughput4How to measure performance?Higher throughput does not necessarily mean higher performance Goodput could be lowThroughput is easy to measure, but Goodput is not

How do we measure performance?Performance is actually never measuredWe could only quantify different metrics that affect performance. These metrics describe the state of: CPU, memory, disk and network5Performance is like beauty can never be measured.However scientifically one could quantify eye, nose, lips, cheeks and ear lobe characteristics and then come to a conclusion that the performance should be good. And end of the day it lies on the eyes of the beholderPerformance MetricsCPUThroughput: MIPS (%used), Goodput: useful instructionsLatency: Instruction Latency (cache latency, cache miss)MemoryThroughput: MB/Sec, Goodput: useful dataLatency: nanosecsStorageThroughput: MB/Sec, IOPS/Sec, Goodput: useful dataLatency: Seek timeNetworkingThroughput: MB/Sec, IO/Sec, Goodput: useful trafficLatency: microseconds6Hardware and PerformanceCPUProcessor Architecture: Intel XEON, AMD OpteronProcessor cache L1, L2, L3, TLBHyperthreadingNUMA7Hardware and PerformanceProcessor ArchitectureClock Speeds from one architecture is not comparable with otherP-III outperforms P4 on a clock by clock basisOpteron outperforms P4 on a clock by clock basisHigher clock speeds is not always beneficialBigger cache or better architecture may outperform higher clock speedsProcessor memory communication is often the performance bottleneckProcessor wastes 100s of instruction cycles while waiting on memory accessCaching alleviates this issue

8Hardware and PerformanceProcessor CacheCache reduces memory access latencyBigger cache increases cache hit probabilityWhy not build bigger cache ?ExpensiveCache access latency increases with cache sizeCache is built into stages L1, L2, L3 with varying cache access latencyESX benefits from larger cache sizesL3 cache seems to boost performance of networking workloads9Hardware and PerformanceTLB Translation Lookaside BufferEvery running process needs virtual address (VA) to physical address (PA) translationHistorically this translation table was done entirely from memorySince memory access is significantly slower and process needs access to this table on every context switch, TLB was introducedTLB is a hardware circuitry that caches VA to PA mappingsWhen VA is not available in TLB, Page Fault occurs and OS needs to bring the address to TLB (load latency)Performance of application depends on effective use of TLBTLB is flushed during context switch10Hardware and performanceHyperthreadingIntroduced with Pentium 4 and Xeon processorsAllows simultaneous execution of two threads on a single processorHT maintains separate architectural states for the same processor but shares underlying processor resources like execution unit, cache etcHT strives to improve throughput by taking advantage of processor stalls on the logical processorHT performance could be worse than UniProcessor (non-HT) performance if the threads have higher cache hit (more than 50%)11Hardware and PerformanceMulticoresCores have their own L1 CacheL2 Cache is shared between processorsCache coherency is relatively faster compared to SMP systemsPerformance scaling is same as SMP systems12Hardware and performanceNUMAMemory contention increases as the number of processors increaseNUMA alleviates memory contention by localizing memory per processor13Hardware and Performance - MemoryNode InterleavingOpteron processors supports two type of memory access NUMA and Node Interleaving modeNode interleaving mode alternates memory pages between processor nodes so that the memory latencies are made uniform. This can offer performance improvements to systems that are not NUMA awareNUMA on single core Opteron systems contains only one core per NUMA node.SMP VM on ESX running on a single core Opteron systems will have to access memory across the NUMA boundary. So SMP VMs may benefit from Node InterleavingOn dual core Opteron systems a single NUMA node will have two cores. So NUMA mode could be turned on.14Hardware and Performance I/O devicesI/O DevicesPCI-E, PCI-X, PCIPCI at 66MHz 533 MB/sPCI-X at 133 MHz 1066 MB/sPCI-X at 266 MHz 2133 MB/sPCI-E bandwidth depends on the number of Lanes, x16 Lanes - 4GB/s, each Lane adds 250 MB/s.PCI bus saturation dual port, quad port devicesIn PCI protocol the bus bandwidth is shared by all the devices in the bus. Only one device could communicate at a time. PCI-E allows parallel full duplex transmission with the use of Lanes15Hardware and Performance I/O DevicesSCSIUltra3/Ultra 160 SCSI 160 MB/sUltra320 SCSI 320 MB/sSAS 3Gbps 300 MB/s duplexFCSpeed constrained by Medium, Laser wavelengthLink Speeds: 1G FC 200 MB/s, 2G 400 MB/s, 4G 800 MB/s, 8GB 1600 MB/s16 ESX Architecture17Performance PerspectiveConfidentialESX Architecture Performance PerspectiveCPU Virtualization Virtual Machine MonitorESX doesnt trap and emulate every instruction, x86 arch does not allow thisSystem calls and Faults are trapped by the monitorGuest code runs in one of three contextsDirect executionMonitor code (fault handling)Binary Translation (BT - non virtualizable instructions)BT behaves much like JITPreviously translated code fragments are stored in translation cache and reused saves translation overhead18ESX Architecture Performance ImplicationsVirtual Machine Monitor Performance implicationsPrograms that dont fault or invoke system calls run at near native speeds ex. GzipMicro-benchmarks that do nothing but invoke system calls will incur nothing but monitor overheadTranslation overhead varies with different Privileged instructions. Translation cache tries to offset some of the overhead.Applications will have varying amount of monitor overhead depending on their call stack profile.Call stack profile of an application can vary depending on its workload, errors and other factors.It is hard to generalize monitor overheads for any workload. Monitor overheads measured for an application are strictly applicable only to Identical test conditions.19ESX Architecture Performance PerspectiveMemory virtualizationModern OSes set up page tables for each running process. x86 paging hardware (TLB) caches VA - PA mappingsPage table shadowing additional level of indirectionVMM maintains PA MA mappings in a shadow tableAllows the guest to use x86 paging hardware with the shadow tableMMU updatesVMM write protects shadow page tables (trace)When the guest updates page table, monitor kicks in (page fault) and keeps shadow page table consistent with the physical page tableHidden page faultsTrace faults are hidden to the guest OS - monitor overhead.Hidden page faults are similar to TLB misses on native environments20ESX Architecture Performance PerspectivePage table shadowing

21ESX Architecture Performance ImplicationsContext SwitchesOn Native hardware TLB is flushed during a context switch. Newly switched process will incur TLB miss on first memory access.VMM caches Page Table Entries (PTE) during context switches (caching MMU). We try to keep the Shadow PTE consistent with the Physical PTEIf there are lots of processes running in the guest, and they context switch frequently, VMM may run out of PT caching. Workload=terminalservices increases this cache size (vmx).Process creationEvery new process created requires new PT mapping. MMU updates are frequentShell Scripts that spawns commands can cause MMU overhead22ESX Architecture Performance PerspectiveI/O Path

23ESX Architecture Performance PerspectiveI/O VirtualizationI/O devices are non virtualizable and therefore they are emulated in the guest OSVMkernel handles Storage and Networking devices directly as they are performance critical in server environments. CDROM, floppy devices are handled by the service console.I/O is interrupt driven and therefore incurs monitor overhead. All I/O goes through VMkernel and involves a context switch from VMM to VMKernelLatency of networking device is lower and therefore delay due to context switches can hamper throughputVMkernel fields I/O interrupts and delivers it to correct VM. From ESX 2.1, VMKernel delivers the interrupts to the idle processor.24ESX Architecture Performance PerspectiveVirtual NetworkingVirtual NICsQueue buffer could overflow if the pkt tx/rx rate is highVM is not scheduled frequentlyVMs are scheduled when they have packets for deliveryIdle VMs still receive broadcast frames. Wastes CPU resources.Guest Speed/duplex settings is irrelevant.Virtual Switches dont learn MAC addressVMs register MAC address, virtual switch knows the location of the MACVMnicsListens for the MAC addresses that are registered by the VMs.Layer 2 Broadcast frames are passed above25ESX Architecture Performance PerspectiveNIC TeamingTeaming only provides outbound load balancingNICs with different capabilities could be teamedLeast common Capability in the bond is usedOut-MAC mode scales with number of VMs/virtual NICs. Traffic from a single virtual NIC is never load balanced.Out-IP scales with the number of Unique TCP/IP sessions.Incoming traffic can come on the same NIC. Link aggregation on the physical switches provides inbound load balancing.Packet reflections can cause performance hits in the guest OS. No empirical data available.We Failback when the Link comes alive again. Performance could be affected if the Link flips flops.26ESX Architecture Performance Perspectivevmxnet optimizationsvmxnet handles cluster of packets at once reduces context switches and interruptsClustering kicks in only when the packet receive/transmit rate is high.vmxnet shares memory area with VMkernel reduces copying overheadvmxnet can take advantage of TCP checksum and Segmentation offloading (TSO)NIC Morphing allows loading vmxnet driver for valance virtual device. Probes a new register with the valance device.Performance of a NIC Morphed vlance device is same as the performance of vmxnet virtual device.27ESX Architecture Performance PerspectiveSCSI performanceQueue depth determines the SCSI throughput. When the queue is full, SCSI I/Os are blocked limiting effective throughput.Stages of QueuingBuslogic/LSILogic -> VMkernel Queue -> VMkernel Driver Queue depth - > Device Firmware Queue -> Queue depth of the LUNSched.numrequestOutstanding number of outstanding I/O commands per VM see KB 1269Buslogic driver in windows limits the queue depth size to 1 see KB 1890Registry settings available for maximizing queue depth for LSILogic adapter (Maximum Number of Concurrent I/Os)28ESX Architecture Performance PerspectiveVMFSUses larger block sizes (1MB default)Larger block size reduces Metadata size metadata is completely cached in memoryNear native speeds is possible, because metadata overhead is removedFewer I/O operations. Improves read-ahead cache hits for sequential readsSpanningData is filled to the other LUN sequentially after overflow. There is no striping.Does not offer performance improvements.Distributed AccessMultiple ESX hosts can access the VMFS volume, only one ESX host updates the meta-data29ESX Architecture Performance PerspectiveVMFSVolume LockingMetadata updates are performed through locking mechanismSCSI reservation is used to lock the volumeDo not confuse this locking with the file level locks implemented in the VMFS volume for different access modesSCSI reservationSCSI reservation blocks all I/O operations until the lock is released by the ownerSCSI reservation is held usually for a very short time and released as soon as the update is performedSCSI reservation conflict happens when SCSI reservation is attempted on a volume that is already locked. This usually happens when multiple ESX hosts contend for metadata updates30ESX Architecture Performance PerspectiveVMFSContention for metadata updatesRedo log updates from multiple ESX hostsTemplate deployment with redo log activityAnything that changes/modifies file permission on every ESX hostVMFS 3.0 uses new volume locking mechanism that significantly reduces the number of SCSI reservations used

31ESX Architecture Performance PerspectiveService ConsoleService console can share Interrupt resources with VMkernel. Shared interrupt lines reduce performance of I/O devices KB 1290MKS is handled in the service console in ESX 2.x. and its performance is determined by the resources available in the COS The default Min CPU allocated is 8% and may not be sufficient if there are lots of VMs runningMemory recommendations for service console do not account memory that will be used by the agentsScalability of VMs is limited by COS in ESX 2.x. ESX 3.x avoids this problems with userworlds for VMkernel.32Understanding ESX Resource33Management & Over-CommitmentConfidentialESX Resource ManagementSchedulingOnly one VCPU runs on a CPU at any timeScheduler tries to run the VM on the same CPU as much as possibleScheduler can move VMs to others Processors when it has to meet the CPU demands of the VMCo-schedulingSMP VMs are co-scheduled, i.e. all the VCPUs run on their own PCPUs/LCPUs simultaneouslyCo-scheduling facilitates synchronization/communication between processors, like in the case of spinlock wait between CPUsScheduler can run a VCPU without the other for a short period of time (1.5 ms)Guest could halt the co-scheduled CPU, if it is not using it, but Windows doesnt seem to halt the CPU wastes CPU cycles34ESX Resource ManagementNUMA SchedulingScheduler tries to schedule the world within the same NUMA node so that cross NUMA migrations are fewerIf a VMs memory pages are split between NUMA nodes, the memory scheduler slowly migrates all the VMs pages to the local node. Over time the system becomes completely NUMA balanced.On NUMA architecture, CPU utilization per NUMA node gives better idea of CPU contentionWhile factoring %ready, factor the CPU contention within the same NUMA node.35ESX Resource ManagementHyperthreadingHyperthreading support was added in ESX 2.1, recommendedHyperthreading increases schedulers flexibility especially in the case of running SMP VMs with UP VMsA VM scheduled on a LCPU is charged only half the package secondsScheduler tries to avoid scheduling a SMP VM onto the logical CPUS of the same packageA high priority VM may be scheduled to a package with one its of LCPU halted this prevents other running worlds from using the same package36ESX Resource ManagementHTSharingControls hyperthreading behavior with individual VMs.htsharing=anyVirtual CPUs could be scheduled on any LCPUs. Most flexible option for the scheduler.htsharing=noneExcludes sharing of LCPUs with other VMs. VM with this option gets a full package or never gets scheduled.Essentially this excludes the VM from using logical CPUs (useful for the security paranoid). Use this option if an application in the VM is known to perform poorly with HT. htsharing=internalApplies to SMP VMs only. This is same as none, but allows sharing the same package for the VCPUs of the same VM. Best of both worlds for SMP VMs.For UP VMs this translates to none37ESX Resource ManagementHT QuarantiningESX uses P4 Performance counters to constantly evaluate HT performance of running worldsIf a VM appears to interact badly with HT, the VM is automatically placed into a quarantining mode (i.e. htsharing is set to none)If the bad events disappear, the VM is automatically pulled back from quarantining modeQuarantining is completely transparent38ESX Resource ManagementCPU affinityDefines a subset of LCPUs/PCPUs that a world could run onUseful to partition server between departmentstroubleshoot system reliability issuesFor manually setting NUMA affinity in ESX 1.5.xapplications that benefit from cache affinityCaveatsWorlds that dont have affinity can run on any CPU, so they have better chance of getting scheduledAffinity reduces Schedulers capability to maintain fairness min CPU guarantees may not be possible under some circumstancesNUMA optimizations (page migrations) are excluded for VMs that have CPU affinity (can enforce manual memory affinity)SMP VMs should not be pinned to LCPUsDisallows vMotion operations39ESX Resource Management Proportional SharesShares are used only when there is a resource contentionUnused shares (shares of a halting/idling VM) are partitioned across active VMs.In ESX 2.x shares operate on a flat namespaceChanging shares of one world affects the effective CPU cycles received by other running worlds. If VMs use a different share scale then shares for other worlds should be changed to the same scale40ESX Resource ManagementMinimum CPUGuarantees CPU resources when the VM requests for itUnused resources are not wasted, and is given to other worlds that requires it.Setting min CPU to 100% (200% in case of SMP) ensures that the VM is not bound by the CPU resource limitsUsing min CPU is favored over using CPU affinity or proportional sharesAdmission control verifies if Min CPUs could be guaranteed when the VM is powered on or VMotioned41ESX Resource ManagementDemystifying Ready timePowered on VM could be either running, halted or in a ready stateReady time signifies the time spent by a VM on the run queue waiting to be scheduledReady time accrues when more than one world wants to run at the same time on the same CPUPCPU, VCPU over-commitment with CPU intensive workloadsScheduler constraints - CPU affinity settingsHigher ready time reduces response times or increases job completion timeTotal accrued ready time is not usefulVM could have accrued ready time during their runtime without incurring performance loss (for example during boot)%ready = ready time accrual rate42ESX Resource ManagementDemystifying Ready timeThere are no good/bad values for %ready. Depends on the priority of the VMs - latency sensitive applications may require less or no ready timeReady time could be reduced by increasing the priority of the VMAllocate more shares, set minCPU, remove CPU affinity43ESX Resource ManagementUnexplained Ready timeIf the VM accrues ready time while there are enough CPU resources then it is called Unexplained Ready timeThere are some belief in the field that such a thing actually exists hard to prove or disproveVery hard to determine if CPU resources are available when ready time accruesCPU utilization is not a good indicator of CPU contentionBurstiness is very hard to determineNUMA boundaries All VMs may contend within the same NUMA nodeMisunderstanding of how scheduler works

44ESX Resource ManagementResource Management in ESX 3.0Resource PoolsExtends hierarchy. Shares operate within the resource pool domain.MHzResource allocation are absolute based on clock cycles. % based allocation could vary with processor speeds.ClustersAggregates resources from multiple ESX hosts

45Resource Over-CommitmentCPU Over-CommitmentSchedulingToo many things to do!Symptoms: high %readyJudicious use of SMPCPU utilizationToo much to do!Symptoms: 100% CPUThings to watchMisbehaving applications inside the guestDo not rely on Guest CPU utilization halting issues, timer interruptsSome applications/services seem to impact guest halting behavior. No longer tied to SMP HALs.46Resource Over-CommitmentCPU Over-CommitmentHigher CPU utilization does not necessarily mean lesser performance. Applications progress is not affected by higher CPU utilizationHowever if higher CPU utilization is due to monitor overheads then it may impact performance by increasing latencyWhen there is no headroom (100% CPU), performance degrades 100% CPU utilization and %ready are almost identical both delay application progressCPU Over-Commitment could lead to other performance problemsDropped network packetsPoor I/O throughputHigher latency, poor response time47Resource Over-CommitmentMemory Over-CommitmentGuest Swapping - WarningGuest page faults while swapping. Performance is affected by both guest swapping and due to monitor overhead handling page faults.Additional disk I/OBallooning SeriousVMkernel Swapping - CriticalCOS Swapping - CriticalVMX process could stall and affect the progress of the VMVMX could be a victim of random process killed by the kernelCOS requires additional CPU cycles, for handling frequent page faults and disk I/OMemory shares determine the rate of ballooning/swapping48Resource Over-CommitmentMemory Over-CommitmentBallooningBallooning/swapping stalls processor, increases delayWindows VMs touches all allocated memory pages during boot. Memory pages touched by the guest could be reclaimed only by ballooningLinux guest touches memory pages on demand. Ballooning kicks in only when the guest is under complete memory pressureBallooning could be avoided by using min=max/proc/vmware/sched/memsize sizetgt indicates memory pressuremctl > mctlgt ballooning out (giving away pages)mctl < mctlgt ballooning in (taking in pages)Memory shares affect ballooning rate49Resource Over-CommitmentMemory Over-CommitmentVMKernel SwappingProcessor stalls due to VMkernel swapping are more expensive than ballooning (due to disk I/O)Do not confuse this withSwap reservation: Swap is always reserved for worst case scenario if min max, reservation = max minTotal swapped pages: Only current swap I/O affects performance/proc/vmware/sched/mem-verboseswpd total pages swappedswapin, swapout swap I/O activitySCSI I/O delays during VMKernel I/O swapping could result in system reliability issues50Resource Over-CommitmentI/O bottlenecksPCI Bus saturationTarget device saturationEasy to saturate storage arrays if the topology is not designed correctly for load distributionPacket dropsEffective throughput reducesRetransmissions can cause congestionWindow size scales down in the case of TCPLatency affects throughputTCP is very sensitive to Latency and packet dropsBroadcast trafficMulticast and broadcast traffic sent to all VMs.Keep an eye on Pkts/sec and IOPS and not just bandwidth consumption51ESX Performance52Application Performance issuesConfidentialESX Performance Application IssuesBefore we beginFrom VM perspective, an running application is just a x86 workload. Any Application performance tuning that makes the application to run more efficiently will helpApplication performance can vary between versionsNew version could be more or less efficientTuning recommendations could changeApplication behavior could change based on its configurationApplication performance tuning requires intimate knowledge on how the application behavesNobody at VMware specializes on application performance tuningVendors should optimize their software with the thought that the hardware resources could be shared by other Operating Systems.TAP programSpringSource (unit of VMware) Provides developer support for API scripting 53ESX Performance Application issuesCitrixRoughly 50-60% monitor overhead takes 50-60% more CPU cycles than on the native machineThe maximum number of users limit is hit when the CPU is maxed out roughly 50% of users as would be seen on native environment with an apples to apples comparison.Citrix Logon delaysThis could happen even on native machines when roaming profiles are configured. Refer Citrix and MS KB articlesMonitor overhead can introduce logon delaysWorkaroundsDisable com ports, workload=terminalservices, disable unused apps, scale horizontallyESX 3.0 improves Citrix performance roughly 70-80% of native performance54ESX Performance Application issuesDatabase performanceScales well with vSMP recommendedExceptions: Pervasive SQL not optimized for SMPTwo key parameters for database workloadsResponse timeTransaction logsCPU utilizationUnderstanding SQL performance is complex. Most enterprise databases run some sort of query optimizer that changes the SQL Engine parameters dynamicallyPerformance will vary with run time. Typically benchmarking is done after priming the databaseMemory resource is key. SQL performance can vary a lot depending on the available memory.55ESX Performance Application IssuesLotus Domino ServerOne of the better performing workloads. 80-90% of direct_execCPU and I/O intensiveScalability issues Not a good idea to run all domino servers on the same ESX server.56ESX Performance Application Issues16-bit applications16 bit applications on windows NT/2000 and above run in a Sandboxed Virtual Machine16 bit apps depend on segmentation possible monitor overhead.Some 16-bit apps seem to spin idle loop instead of halting the CPUConsumes excessive CPU cycles No performance studies done yetNo compelling application57ESX Performance Application IssuesNetperf throughputMax Throughput is bound by a variety of parametersAvailable Bandwidth, TCP window size, available CPU cyclesVM incurs additional CPU overhead for I/OCPU utilization for networking varies withSocket buffer size, MTU affects the number of I/O operations performedDriver vmxnet consumes lesser CPU cyclesOffloading features depending on the driver settings and NIC capabilitiesFor most applications, throughput is not the bottleneckMeasuring throughput and improving it may not always resolve the underlying performance issue58ESX Performance Application IssuesNetperf LatencyLatency plays an important role for many applicationsLatency can increaseWhen there are too many VMs to scheduleVM is CPU boundPackets are dropped and then re-transmitted59ESX Performance Application IssuesCompiler WorkloadsMMU intensive: Lots of new processes created, context switched, and destroyed.SMP VM may hurt performanceMany compiler workloads are not optimized by SMPProcess threads could ping-pong between the vCPUsWorkarounds:Disable NPTLTry UP (dont forget to change the HAL)Workload=terminalservices might help60ESX Performance Forensics61ConfidentialESX Performance ForensicsTroubleshooting MethodologyUnderstand the problem.Pay attention to all the symptomsPay less attention to subjective metrics.Know the mechanics of the applicationFind how the application worksWhat resources it uses, and how it interacts with the rest of the systemIdentify the key bottleneckLook for clues in the data and see if that could be related to the symptomsEliminate CPU, Disk I/O, Networking I/O, Memory bottlenecks by running testsRunning the right test is critical.ESX Performance Forensics Isolating memory bottlenecksBallooningSwappingGuest MMU overheadsESX Performance ForensicsIsolating Networking BottlenecksSpeed/Duplex settingsLink state flappingNIC Saturation /Load balancingPacket dropsRx/Tx Queue OverflowESX Performance ForensicsIsolating Disk I/O bottlenecksQueue depthPath thrashingLUN thrashingESX Performance ForensicsIsolating CPU bottlenecksCPU utilizationCPU scheduling contentionGuest CPU usageMonitor OverheadESX Performance ForensicsIsolating Monitor overheadProcedures for release buildsCollect performance snapshotsMonitor ComponentsESX Performance ForensicsCollecting Performance SnapshotsDurationDelayProc nodesRunning esxtop on performance snapshotsESX Performance ForensicsCollecting Benchmarking numbersClient side benchmarksRunning benchmarks inside the guestESX Performance70Troubleshooting - SummaryConfidentialESX Performance Troubleshooting - SummaryKey pointsAddress real performance issues. Lots of time could be spent on spinning wheels on theoretical benchmarking studiesReal performance issues could be easily described by the end user who uses the applicationThere is no magical configuration parameter that will solve all performance problemsESX performance problems are resolved byRe-architecting the deploymentTuning applicationApplying workarounds to circumvent bad workloadsMoving to a newer version that addresses a known problemUnderstanding Architecture is the keyUnderstanding both ESX and application architecture is essential to resolve performance problemsQuestions?

Reference linkshttp://www.vmware.com/files/pdf/perf-vsphere-memory_management.pdfhttp://www.vmware.com/resources/techresources/10041http://www.vmware.com/resources/techresources/10054http://www.vmware.com/resources/techresources/10066http://www.vmware.com/files/pdf/perf-vsphere-cpu_scheduler.pdfhttp://www.vmware.com/pdf/RVI_performance.pdfhttp://www.vmware.com/pdf/Perf_ESX_Intel-EPT-eval.pdfhttp://www.vmware.com/files/pdf/perf-vsphere-fault_tolerance.pdf