vmworld 2013: building a validation factory for vmware partners
Post on 05-Dec-2014
158 Views
Preview:
DESCRIPTION
TRANSCRIPT
Building a Validation Factory for VMware Partners
Tim Harris, VMware
TEX5485
#TEX5485
2 2
Disclaimer
This session may contain product features that are
currently under development.
This session/overview of the new technology represents
no commitment from VMware to deliver these features in
any generally available product.
Features are subject to change, and must not be included in
contracts, purchase orders, or sales agreements of any kind.
Technical feasibility and market demand will affect final delivery.
Pricing and packaging for any new technologies or features
discussed or presented have not been determined.
3 3
About the Speaker…
Tim Harris:
• At VMWare since 2007
• Currently running ISV Validation Program
• Engineering and Lab Resources for TAP members
• Oracle Corp for nearly 10 years
• Managed various performance engineering teams
• Ran Oracle Applications Standard Benchmark effort
• PhD in Computer Science
• Focus on Parallel Computing algorithms and architectures
• BS in Electrical Engineering
3
4 4
Agenda
Validation Services Overview
• Goals and Best Practices
Why Build a Validation Factory?
• Business and Technical Value
Process and Procedures
• Org charts, resources, planning and objectives
Tuning Best Practices and Telco
• What’s challenging today, and how best to solve those challenges
4
5 5
Validation Services
6 6
Overview of Validation Services
Engineering Back-End to ISV Alliances
• Lab and Engineer Resources
• Free of Cost, Indirect Revenue for VMware
Performance Validations
• Virtualized Net-New App
Business Continuity/Disaster Recovery
• Site Recovery Manager
• VMware HA, vMotion, DRS, FT
Cloud Migration Services
• vCloud Director
• vApps
• vShield
• Hosting and Billing
Performance
Validations
View and BCDR
Cloud/ SAAS
7 7
Settings Goals for a Performance Validation
Primary: Remove blockers for adoption
• As perceived by you, the Partner
VMware in supporting role here
• We do not set requirements
Supportability
• Our mutual customers should be happy
Maximize Value Proposition
• Synergy in combined functionality?
• 1 + 1 = 3 opportunities?
8 8
Performance Goals
Same performance as physical?
• Is “nearly the same” enough?
What are the application stress points?
• Realtime access to CPU?
• High throughput access to I/O?
• Dynamic memory footprint?
Infrastructure requirements
• Storage requirements
• Load driver requirements
Application level KPIs?
• For small, medium and large customers
9 9
Validation Goals and Common vSphere Use Cases
Validation Collaboration
• Many general learning opportunities
What’s likely vSphere configuration
• Existing cluster of 6 to 12 nodes
• DRS turned on
• Reservations turned off
• HA turned on
• Mix of diverse workloads
vSphere Admin’s may
• Prioritize the good of the many
• Vs the good of the few (applications)
Any conflicts with your best practices?
10 10
Vmware Ready and Validations
Vmware ready is marketing certification program
• Applications Category requires some performance testing
• Designed as self-service activity
Validations mean can waive testing requirements
• If you’ve done good performance work
• Can provide testing waiver
Testing requirements are modest
• Apply load and observe behavior and capacity
11 11
Why Build a Validation Factory?
12 12
What Is a Validation Factory?
Validate All Your Applications
• Solution Level, Suite Level, Company Level
Plan for Capacity with Resource Requirements
• Hardware, Manpower, Marketing, Management
• Move from Event to Service model
Leverage results
• Document, Market, Enable the Field
Broaden solutions
• BC/DR, Hybrid Cloud (Private/Public), VDI
Get Certified
• VMware Ready status for all products
13 13
Validation Factory: Why Do It?
Provide Suite level virtualization advice
• Combine point products into virtualized solutions
Differentiate from competitors
• Establish technical leadership across products
Provide broader value of single platform
• Point products not sufficient
Enable delivery of specific deployment architectures
• E.g. 5 product suite on 3 node cluster supports 200 users
14 14
Process and Procedures
15 15
Org Chart and Process
Centralized Resources are easier
• Center of Expertise Model
Two Major Product Categories
• Need full validation to support
• Just need VMware Ready logo
Build Prioritized List
• Easy/Quick wins
• Hard/Longer Challenges
Internal and External Marketing
• Take credit for incremental achievements
16 16
Factory Deliverables
Suite Level VMware Ready Status
• vSphere based solutions
• Reference architectures
• Availability story
• Solution Deployment Guide
Span the Gap from R&D to Field
• Key architects in the loop
• Field enabled to understand and sell
Document and Market
• External doc delivered
• Internal message delivered
Planning Your Validation Effort
18 18
Validation Process in Agile Sprints
Planning Sprint: 3 weeks
• Iteratively populate test plan template
• HW resource requirements
• Storage volume and throughput
• Workload and Load Driver Tooling
Execution: 3 weeks
• At VMware Labs or ISV Labs
Wrap up: 3 weeks
• Interactively create Field Facing Documents
• Any join marketing/Press releases/VMware Ready Logos, etc
Add concurrency to increase throughput
• Different products can overlap sprints
Plan Execute Wrap –
up
19 19
Planning Risk Factors
Infastructure limitations
• Little is learned by testing with insufficient capacity
• Entire benchmark limited by smallest bottleneck
Storage throughput
• Do we know the requirements?
• Can we verify the device can hit requirements?
• E.g. run IOMeter before testing begins
Length of effort
• Assume problems throughput before locking in dates
• Or choose timeline and work backwards to test schedule
• E.g. We plan 2 weeks of testing and reserve 3 weeks of HW
Executing on Your Validation Effort
21 21
Environment Build-out
Assume Build period largely single threaded
• Not considered full lab time
Start all staging/installs week before
• Assume long copy/install/datagen steps
• May include snail mail steps
• Ship USB drives for items bigger than 20G
• 10G and under via FTP
Full install on greenfield VM
• Most common process
vApps (OVFs) arguably better
• But more likely to break size limits for FTP
22 22
Load Drivers and Validations
Good load driver is critical to Performance testing
• Not virtualization specific
Load drivers are expensive to build
• Assume 2 man years and 6 calendar months
Bad load drivers don’t represent realistic use cases
• Focus should be on customer critical activities
• Proving the performance of edge cases is a waste of resources
• Load should represent common production load
23 23
Physical vs. Virtual Comparisons
Obvious choice, but not always correct choice
• Costs substantially more
• Adds a bit more value
Assume P-vs-V costs 2X+ more time/resources
• Physical HW setup is slow and inflexible
• Apples to Oranges comparisons common
Apples to Apples is…
• Must remove resources from physical to match VM
• VM must not consume all physical resources
• Hypervisor will have resources in production
• Needs to have resources in testing too
24 24
Tuning Best Practices and Telco
25 25
Executive Summary: vSphere Tuning in Last 5 Years
Used to be scary – now they just work:
• High I/O Applications: Run at wire speed now
• Monster VM type workloads: Big iron now in a VM
• Enterprise use cases for Linux: Now safer
What’s still hard?
• Realtime requirements under 1 ms
• ESX 3.5 – 100 ms
• ESX 4 and 5 – 10 ms
• ESX 5.1 and 5.5 – working on sub-ms (100s of microseconds) now
• vMotion of Huge Realtime VMs
• 64 GB in-memory DBs like to stay still
26 26
Example Telco Workload Challenges
Service Provider Use Cases
• Large SAAS deployments
BC/DR QOS Built into application
• Realtime active/passive failover
Conservative by nature
• “Don’t try and fix it if you might break it”
Realtime Transaction Rates
• Latency requirements of <10ms
27 27
Tuning Strategies
Shopping list of tune-ables may be misused
• Changes for changes sake
Experimental science says
• Make one change at a time
• Assess value of change
• Remove or move on to next change
Prioritize by relative impact
• No reason to make change if can’t solve a problem
28 28
Large Tuning Knobs Available
Incrementally back off virtualization
• Realtime demands likely can be met
Reservations for CPU and Memory
• Hard allocation of resources
If truly needed – CPU Affinity
• Exclusive or with Halt.desched flag
If truly needed – NIC passthrough
• With SRIOV or not
Horizontally scaled apps
• Still have less scheduling overhead
Storage design still critical
• Ensure Iops are available before tuning
29 29
Advanced Tuning: CPU Affinity
CPU Affinity (aka Pinning)
• Rumored to be critical for VOIP
• Our data shows little gain with vSphere 4.x and before
Affinity and vSphere 5.0
• Allows “Exclusive Affinity”
• Previously, cores still accessible to other VMs despite affinity
0
2
4
6
8
Max DSP Execution Timein Milliseconds
SLA
Without ExclusiveAffinity
With ExclusiveAffinity
30 30
Halt Desched vs. Affinity vs. Latency Sensitive
“Pre-Allocating” CPU resources to a VM
• Reducing benefits of virtualization (vmotion, overcommit)
• Reducing scheduling overhead
Hierarchy of Techniques
• Simple reservations first
• Exclusive CPU Affinity (5.0 and beyond)
• Halt Desched option
Latency Sensitive UI available in 5.1 and beyond
• At highest setting, equivalent to Exclusive CPU affinity
Halt Desched
• vCPUs at 100% usage even if no work being done
• monitor_control.halt_desched set to FALSE
31 31
Horizontal Scaling and Latency Sensitivity
Scheduling overhead a function of vCPUs per VM
• 4 to 8 vCPU VMs may be our sweet spot
Many Applications scale horizontally effectively
• Doesn’t need to impact aggregate resources for an application
• E.g. double vm count and halve vCPUs per VM
• Trade-offs with management overhead of more VMs
Expect less jitter with smaller VMs
• Empirical result across many workloads
32 32
Non-Uniform Memory Access (NUMA) Impacts
Physical Memory Spread across NUMA Nodes
• Typically one node per socket
Access to remote node’s memory expensive
• Access to local node “cheap”
Monitor from ESXtop
• NUMA stats: %local memory should be 100
• vSphere 5 more NUMA aware than previous
• Small-ish VMs and Smallish RAM best case
Align Core count per socket with vCPUs
• Fully occupy integer socket count
Disable “Node Interleaving” at BIOS to enable NUMA
• Node interleaving (enabled) leads to consistent but poor performance
33 33
Advanced Tuning: Direct Path I/O
Direct Path I/O (aka NIC Pass-through)
• Disables vMotion
• Makes physical NIC available for only one VM
Substantial jitter improvements in realtime workloads
• But at substantial cost in vSphere functionality
SRIOV provides alternative
• Reusable NIC with vMotion and Pass-through
020406080
100
Worst Case Latency inMilliseconds
SLA
Without Direct PathI/O
Direct Path I/O
34 34
Interrupt Management and Latency Sensitive Workloads
Interrupt coalescing in vSphere 4.x and 5
• Does “Adaptive Interrupt Coalescing” by default
• Groups interrupts to reduce impact and CPU
• Group size (queue depth) dynamically adjusts to the workload
Adaptive coalescing may introduce latency
• Can disable coalescing for latency sensitive workloads
• Some improvements observed, but not always a win
Pinning of interrupts
• Likely used with CPU pinning
• Keeps all interrupts on vCPU and hence pCPU
• Modest gain – test before using
35 35
Latency Sensitive Tuning and Overcommitment
Safest solution – undercommit physical cores on each host
• E.g. 16 core server runs no more than 14 vCPUs
• 1-2 cores per host and 2G of RAM uncommitted
Challenges with undercommitment
• HW utilization, DRS in cluster with mixed workloads, etc.
• Most viable with dedicated (to one app) clusters
Alternative approaches
• CPU Affinity locks a VM to cores
• Other cores available for general use in cluster
36 36
Realtime Tuning Summary
Start with simple techniques
• Reservations, BIOS tuning, etc
Move towards pre-allocation of resources
• CPU Exclusive Affinity if CPU bound
• NIC-passthrough if network bound
Consider horizontal scaling of configuration
• More, smaller VMs
Test one change at a time and iterate
• Don’t overlap your changes
37 37
Telco Progress In-flight
Active Efforts with Nearly Every Global Telco Provider
• Some solutions in market, so on the way
Easy to virtualize pieces definitely exist
• Careful prioritization of efforts underway
Realtime workloads are achievable
• 2ms for compute and packet send consistently achievable (5.1)
• <1ms QOS work in progress (5.5?)
Availability still adds value
• Augment built in availability story
• Protect previous unprotected components
38 38
Validation Factory Summary
Vendors see value in Suite Level solutions design
• TAP program can provide support for such efforts
VMware Ready status for all applications
• Detailed performance assessment for some
What was once hard is not possible
• Most challenging applications successfully virtualized today
39 39
Questions?
THANK YOU
Building a Validation Factory for VMware Partners
Tim Harris, VMware
TEX5485
#TEX5485
top related