vmware vsphere 4.1 ha and drs deep technical drive

Upload: antonio-jimenez

Post on 01-Jun-2018

248 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    1/140

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    2/140

     

    VMware vSphere 4.1

    HA and DRSTechnical Deepdive

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    3/140

     

    VMware vSphere 4.1, HA and DRS Technical Deepdive

    Copyright © 2010 by Duncan Epping and Frank Denneman.

    All rights reserved. No part of this book shall be reproduced, stored in a retrieval system, or

    transmitted by any means, electronic, mechanical, or otherwise, without written permission from

    the publisher. No patent liability is assumed with respect to the use of the information contained

    herein. Although every precaution has been taken in the preparation of this book, the publisher and

    authors assume no responsibility for errors or omissions. Neither is any liability assumed for

    damages resulting from the use of the information contained herein.

    International Standard Book Number (ISBN:)

    9781456301446

    All terms mentioned in this book that are known to be trademarks or service marks have been

    appropriately capitalized.

    Use of a term in this book should not be regarded as affecting the validity of any trademark or

    service mark.

    Version: 1.1

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    4/140

    About the Authors

    Duncan Epping  is a Principal Architect working for VMware as part of the Technical Marketing

    department. Duncan primarily focuses on vStorage initiatives and ESXi. He is specialized in

    vSphere, vStorage, VMware HA and Architecture. Duncan is a VMware Certified Professional and

    among the first VMware Certified Design Experts (VCDX 007). Duncan is the owner of Yellow-

    Bricks.com, one of the leading VMware/virtualization blogs worldwide (Recently been voted

    number 1 virtualization blog for the 4th consecutive time on vsphere-land.com.) and lead-author of

    the "vSphere Quick Start Guide" and "Foundation for Cloud Computing with VMware vSphere 4"

    which has recently been published by Usenix/Sage. (#21 in the Short Topics Series). He can be

    followed on twitter at http://twitter.com/DuncanYB.

    Frank Denneman  is a Consulting Architect working for VMware as part of the Professional

    Services Organization. Frank works primarily with large Enterprise customers and Service

    Providers. He is focused on designing large vSphere Infrastructures and specializes in Resource

    Management, DRS in general and storage. Frank is a VMware Certified Professional and among the

    first VMware Certified Design Experts (VCDX 029). Frank is the owner of FrankDenneman.nl which

    has recently been voted number 6 worldwide on vsphere-land.com. He can be followed on twitter

    at http://twitter.com/FrankDenneman.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    5/140

    Table of Contents

     About the Authors

     Acknowledgements

    Foreword

    Introduction to VMware High Availability

    How Does High Availability Work?

    Pre-requisites

    Firewall Requirements

    Configuring VMware High Availability

    Components of High Availability

    VPXA

    VMAP Plug-In

     AAM

    Nodes

    Promoting Nodes

    Failover Coordinator

    Preferred Primary

    High Availability Constructs

    Isolation Response

    Split-Brain

    Isolation Detection

    Selecting an Additional Isolation Address

    Failure Detection Time

     Adding Resiliency to HA (Network Redundancy)

    Single Service Console with vmnics in Active/Standby Configuration

    Secondary Management Network

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    6/140

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    7/140

    Operation and Tasks of DRS

    Load Balance Calculation

    Events and Statistics

    Migration and Info Requests

    vCenter and Cluster sizing

    DRS Cluster Settings

     Automation Level

    Initial Placement

    Impact of Automation Levels on Procedures

    Resource Management

    Two-Layer Scheduler Architecture

    Resource Entitlement

    Resource Entitlement Calculation

    Calculating DRS Recommendations

    When is DRS Invoked?

    Defragmenting cluster during Host failover

    Recommendation Calculation

    Constraints Correction

    Imbalance Calculation

    Impact of Migration Threshold on Selection Procedure

    Selection of Virtual Machine Candidate

    Cost-Benefit and Risk Analysis Criteria

    The Biggest Bang for the Buck

    Calculating the Migration Recommendation Priority Level

    Influence DRS Recommendations

    Migration Threshold Levels

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    8/140

    Rules

    VM-VM Affinity Rules

    VM-Host Affinity Rules

    Impact of Rules on Organization

    Virtual Machine Automation Level

    Impact of VM Automation Level on DRS Load Balancing Calculation

    Resource Pools and Controls

    Root Resource Pool

    Resource Pools

    Resource pools and simultaneous vMotions

    Under Committed versus Over Committed

    Resource Allocation Settings

    Shares

    Reservation

    VM Level Scheduling: CPU vs Memory

    Impact of Reservations on VMware HA Slot Sizes.

    Behavior of Resource Pool Level Memory Reservations

    Setting a VM Level Reservation inside a Resource Pool

    VMkernel CPU reservation for vMotion

    Reservations Are Not Limits.

    Memory Overhead Reservation

    Expandable Reservation

    Limits

    CPU Resource Scheduling

    Memory Scheduler

    Distributed Power Management

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    9/140

    Enable DPM

    Templates

    DPM Threshold and the Recommendation Rankings

    Evaluating Resource Utilization

    Virtual Machine Demand and ESX Host Capacity Calculation

    Evaluating Power-On and Power-Off Recommendations

    Resource LowScore and HighScore

    Host Power-On Recommendations

    Host Power-Off Recommendations

    DPM Power-Off Cost/Benefit Analysis

    Integration with DRS and High Availability

    Distributed Resource Scheduler

    High Availability

    DPM awareness of High Availability Primary Nodes

    DPM Standby Mode

    DPM WOL Magic Packet

    Baseboard Management Controller

    Protocol Selection Order

    DPM and Host Failure Worst Case Scenario

    DRS, DPM and VMware Fault Tolerance

    DPM Scheduled Tasks

    Summarizing

     Appendix A – Basic Design Principles

    VMware High Availability

    VMware Distributed Resource Scheduler

     Appendix B – HA Advanced Settings

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    10/140

    Acknowledgements

    The authors of this book work for VMware. The opinions expressed here are the authors’ personal

    opinions. Content published was not read or approved in advance by VMware and does not

    necessarily reflect the views and opinions of VMware. This is the authors’ book, not a VMware book.

    First of all we would like to thank our VMware management team (Steve Beck, Director; Rob

    Jenkins, Director) for supporting us on this and other projects.

    A special thanks goes out to our Technical Reviewers: fellow VCDX Panel Member Craig Risinger

    (VMware PSO), Marc Sevigny (VMware HA Engineering), Anne Holler (VMware DRS Engineering)

    and Bouke Groenescheij (Jume.nl) for their very valuable feedback and for keeping us honest.

    A very special thanks to our families and friends for supporting this project. Without your support

    we could have not have done this.

    We would like to dedicate this book to the VMware Community. We highly appreciate all the effort

    everyone is putting in to take VMware, Virtualization and Cloud to the next level. This is our gift to

    you.

    Duncan Epping and Frank Denneman

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    11/140

    Foreword

    Since its inception, server virtualization has forever changed how we build and manage the

    traditional x86 datacenter. In its early days of providing an enterprise-ready hypervisor, VMware

    focused their initial virtualization efforts to meet the need for server consolidation. Increased

    optimization of low-utilized systems and lowering datacenter costs of cooling, electricity, and floor

    space requirements was a surefire recipe for VMware’s early success. Shortly after introducing

    virtualization solutions, customers started to see the significant advantages introduced by the

    increased portability and recoverability that were all of a sudden available.

    It’s this increased portability and recoverability that significantly drove VMware’s adoption during

    its highest growth period. Recovery capabilities and options that were once reserved for the most

    critical of workloads within the world’s largest organizations became broadly available to the

    masses. Replication, High-Availability, and Fault Tolerance were once synonymous with “Expensive

    Enterprise Solutions, but are now available to even the smallest of companies. Data protectionenhancements, when combined with the intelligence of intelligent resource management, placed

    VMware squarely at the top market leadership board. VMware’s virtualization platform can

    provide near instant recovery time with increasingly more recent recovery points in a properly

    designed environment.

    Now, if you’ve read this far, you likely understand the significant benefits that virtualization can

    provide, and are probably well on your way to building out your virtual infrastructure and strategy.

    The capabilities provided by VMware are not ultimately what dictates the success and failure of a

    virtualization project, especially as increasingly more critical applications are introduced and

    require greater availability and recoverability service levels. It takes a well-designed virtual

    infrastructure and a full understanding of how the business requirements of the organization align

    to the capabilities of the platform.

    This book is going to arm you with the information necessary to understand the in-depth details of

    what VMware can provide you when it comes to improving the availability of your systems. This

    will help you better prepare for, and align to, the requirements of your business as well as set the

    proper expectations with the key stakeholders within the IT organization. Duncan and Frank have

    used their extensive field experience into this book to enable you to drive broader virtualization

    adoption across more complex and critical applications. This book will enable you to make the

    most educated decisions as you attempt to achieve the next level of maturity within your virtual

    environment.

     Scott Herold

    Lead Architect, Virtualization Business, Quest Software

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    12/140

    Part 1

    VMware High Availability

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    13/140

     Chapter 1

    Introduction to VMware High Availability

    VMware High Availability (HA) provides a simple and cost effective clustering solution to increase

    uptime for virtual machines. HA uses a heartbeat mechanism to detect a host or virtual machine

    failure. In the event of a host failure, affected virtual machines are automatically restarted on other

    production hosts within the cluster with spare capacity. In the case of a failure caused by the Guest

    OS, HA restarts the failed virtual machine on the same host. This feature is called VM Monitoring,

    but sometimes also referred to as VM HA.

    Figure 1: High Availability in action

    Unlike many other clustering solutions HA is literally configured and enabled with 4 clicks.

    However HA is not, and let’s repeat it, is not a 1:1 replacement for solutions like MicrosoftClustering Services. (MSCS). MSCS and for instance Linux Clustering are stateful clustering solutions

    where the state of the service or application is preserved when one of the nodes fails. The service is

    transitioned to one of the other nodes and it should resume with limited downtime or loss of data.

    With HA the virtual machine is literally restarted and this incurs downtime. HA is a form of stateless

    clustering.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    14/140

    One might ask why would you want to use HA when a virtual machine is restarted and service is

    temporarily lost. The answer is simple; not all virtual machines (or services) need 99.999% uptime.

    For many services the type of availability HA provides is more than sufficient. Stateful clustering

    does not guarantee 100% uptime, can be complex and need special skills and training. One example

    is managing patches and updates/upgrades in a MSCS environment; this could even cause more

    downtime if not operated correctly. Just like MSCS a service or application is restarted during afailover, the same happens with HA and the effected virtual machines.

    Besides that, HA reduces complexity, costs (associated with downtime and MSCS), resource

    overhead and unplanned downtime for minimal additional costs. It is important to note that HA,

    contrary to MSCS, does not require any changes to the guest as HA is provided on the hypervisor

    level. Also, VM Monitoring does not require any additional software or OS modifications except for

    VMware Tools, which should be installed anyway.

    We can’t think of a single reason not to use it.

    How Does High Availability Work?

    Before we deep dive into the main constructs of HA and describe all the choices one has when

    configuring HA we will first briefly touch on the requirements. Now, the question of course is how

    does HA work? As just briefly touched in the introductions, HA triggers a response based on the loss

    of heartbeats. However you might be more interested in knowing which components VMware uses

    and what is required in order for HA to function correctly. Maybe if this is the first time you are

    exposed to HA you also want to know how to configure it.

    Pre-requisites

    For those who want to configure HA, the following items are the pre-requisites in order for HA to

    function correctly:

    •  Minimum of two VMware ESX or ESXi hosts

    •  Minimum of 2300MB memory to install the HA Agent

    •  VMware vCenter Server

    •  Redundant Service Console or Management Network (not a requirement, but highly

    recommended)

    • 

    Shared Storage for VMs – NFS, SAN, iSCSI

    •  Pingable gateway or other reliable address for testing isolation

    We recommend against using a mixed cluster. With that we mean a single cluster containing bothESX and ESXi hosts. Differences in build numbers has led to serious issues in the past when using

    VMware FT. (KB article: 1013637)

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    15/140

    Firewall RequirementsThe following list contains the ports that are used by HA for communication. If your environment

    contains firewalls ensure these ports are opened for HA to function correctly.

    High Availability port settings:

    •  8042 – UDP - Used for host-to-hosts "backbone" (message bus) communication.

    •  8042 – TCP - Used by AAM agents to communicate with a remote backbone.

    •  8043 – TCP - Used to locate a backbone at bootstrap time.

    •  8044 – UDP - Used by HA to send heartbeats.

    •  2050 – 2250 - Used by AAM agent process to communicate with the backbone.

    Configuring VMware High Availability

    As described earlier, HA can be configured with the default settings within 4 clicks. The following

    steps however will show you how to create a cluster and how to enable HA including VMMonitoring. Each of the settings and the mechanisms associated with these will be described more

    in-depth in the following chapters.

    1. Select the Hosts & Clusters view.

    2. Right-click the Datacenter in the Inventory tree and click New Cluster.

    3. Give the new cluster an appropriate name. We recommend at a minimum including the

    location of the cluster and a sequence number ie. ams-hadrs-001.

    4. In the Cluster Features section of the page, select Turn On VMware HA and click Next.

    5. Ensure Host Monitoring Status and Admission Control is enabled and click Next

    6. Leave Cluster Default Settings for what it is and click Next

    7. Enable VM Monitoring Status by selecting “VM Monitoring Only” and click Next

    8. Leave VMware EVC set to the default and click Next

    9. Leave the Swapfile Policy set to default and click Next

    10. Click Finish to complete the creation of the cluster

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    16/140

     

    When the HA cluster has been created ESX hosts can be added to the cluster simply by dragging

    them into the cluster. When an ESX host is added to the cluster the HA agent will be loaded.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    17/140

    Chapter 2

    Components of High Availability

    Now that we know what the pre-requisites are and how to configure HA the next steps will be

    describing which components form HA. This is still a “high level” overview however. There is more

    under the cover that we will explain in following chapters. The following diagram depicts a two

    host cluster and shows the key HA components.

    Figure 3: Components of High Availability

    As you can clearly see there are three major components that form the foundation for HA:

    • 

    VPXA

    •  VMAP

    •  AAM

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    18/140

    VPXAThe first and probably the most important is VPXA. This is not an HA agent, but it is the vCenter

    agent and it allows your vCenter Server to interact with your ESX host. It is also takes care of

    stopping and starting virtual machines if and when needed.

    HA is loosely coupled with vCenter Server. Although HA is configured by vCenter Server, it does notneed vCenter to manage an HA failover. It is comforting to know that in case of a host failure

    containing the virtualized vCenter server, HA takes care of the failure and restarts the vCenter

    server on another host, including all other configured virtual machines from that failed host.

    When a virtual vCenter is used we do however recommend setting the correct restart priorities

    within HA to avoid any dependency problems.

    It’s highly recommended to register ESX hosts with their FQDN in vCenter. VMware vCenter

    supplies the name resolution information that HA needs to function. HA stores this locally in a file

    called “FT_HOSTS”. In other words, from an HA perspective there is no need to create local host files

    and it is our recommendation to avoid using local host files. They are too static and will maketroubleshooting more difficult.

    To stress my point even more as of vSphere 4.0 Update 1 host files (i.e. /etc/hosts) are corrected

    automatically by HA. In other words if you have made a typo or for example forgot to add the short

    name HA will correct the host file to make sure nothing interferes with HA.

    Basic design principle:

    Avoid using static host files as it leads to inconsistency, which makes troubleshooting

    difficult.

    VMAP Plug-In

    Next on the list is VMAP. Where vpxa is the process for vCenter to communicate with the host

    VMAP is the translator for the HA agent (AAM) and vpxa. When vpxa wants to communicate with

    the AAM agent VMAP will translate this into understandable instructions for the AAM agent. A good

    example of what VMAP would translate is the state of a virtual machine: is it powered on or

    powered off? Pre-vSphere 4.0 VMAP was a separate process instead of a plugin linked into vpxa.VMAP is loaded into vpxa at runtime when a host is added to an HA cluster.

    The vpxa communicates with VMAP and VMAP communicates with AAM. When AAM has received it

    and flushed the info it well tell VMAP and VMAP on its turn will acknowledge to vpxa that info has

    been processed. The VMAP plug-in acts as a proxy for communication to AAM.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    19/140

    One thing you are probably wondering is why do we need VMAP in the first place? Wouldn’t this be

    something vpxa or AAM should be able to do? The answer is yes, either vpxa or AAM should be able

    to carry this functionality. However, when HA was first introduced it was architecturally more

    prudent to create a separate process for dealing with this which has now been turned into a plugin.

     AAMThat brings us to our next and final component, the AAM agent. The AAM agent is the core of HA

    and actually stands for “Automated Availability Manager”. As stated above, AAM was originally

    developed by Legato. It is responsible for many tasks such as communicating host resource

    information, virtual machine states and HA properties to other hosts in the cluster. AAM stores all

    this info in a database and ensures consistency by replicating this database amongst all primary

    nodes. (Primary nodes are discussed in more detail in chapter 4.) It is often mentioned that HA uses

    an In-Memory database only, this is not the case! The data is stored in a database on local storage or

    in FLASH memory on diskless ESXi hosts.

    One of the other tasks AAM is responsible for is the mechanism with which HA detects

    isolations/failures: heartbeats.

    All this makes the AAM agent one of the most important processes on an ESX host, when HA is

    enabled of course, but we are assuming for now it is. The engineers recognized the importance and

    added an extra level of resiliency to HA. The agent is multi-process and each process acts as a

    watchdog for the other. If one of the processes dies the watchdog functionality will pick up on this

    and restart the process to ensure HA functionality remains without anyone ever noticing it failed. It

    is also resilient to network interruptions and component failures. Inter-host communication

    automatically uses another communication path (if the host is configured with redundantmanagement networks) in the case of a network failure. The underlying message framework

    exactly-once guarantees message delivery.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    20/140

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    21/140

    An HA cluster consists of hosts, or nodes as HA calls them. There are two types of nodes. A node is

    either a primary or a secondary node. This concept was introduced to enable scaling up to 32 hosts

    in a cluster and each type of node has a different role. Primary nodes hold cluster settings and all

    “node states”. The data a primary node holds is stored in a persistent database and synchronized

    between primaries as depicted in the diagram above.

    An example of node state data would be host resource usage. In case vCenter is not available the

    primary nodes will always have a very recent calculation of the resource utilization and can take

    this into account when a failover needs to occur. Secondary nodes send their state info to primary

    nodes. This will be sent when changes occur, generally within seconds after a change. As of vSphere

    4.1 by default every host will send an update of its status every 10 seconds. Pre-vSphere 4.1 this

    used to be every second.

    This interval can be controlled by an advanced setting called das.sensorPollingFreq. As stated

    before the default value of this advanced setting is 10. Although a smaller value will lead to a more

    update view of the status of the cluster overall it will also increase the amount of traffic between

    nodes. It is not recommended to decrease this value as it might lead to decreased scalability due tothe overhead of these status updates. The maximum value of the advanced setting is 30.

    As discussed earlier, HA uses a heartbeat mechanism to detect possible outages or network

    isolation. The heartbeat mechanism is used to detect a failed or isolated node. However, a node will

    recognize it is isolated by the fact that it isn’t receiving heartbeats from any of the other nodes.

    Nodes send a heartbeat to each other. Primary nodes send heartbeats to all primary nodes and all

    secondary nodes. Secondary nodes send their heartbeats to all primary nodes, but not to

    secondaries. Nodes send out these heartbeats every second by default. However, this is a

    configurable value through the use of the following cluster advanced setting:

    das.failuredetectioninterval . We do however not recommend changing this interval as it wascarefully selected by VMware.

    The first 5 hosts that join the HA cluster are automatically selected as primary nodes. All other

    nodes are automatically selected as secondary nodes. When you do a reconfigure for HA, the

    primary nodes and secondary nodes are selected again; this is virtually random.

    Except for the first host that is added to the cluster; any host that joins the cluster must

    communicate with an existing primary node to complete its configuration. At least one primary host

    must be available for HA to operate correctly. If all primary hosts are unavailable, you will not be

    able to add or remove a host from your cluster.

    The vCenter client normally does not show which host is a primary node and which is a secondary

    node. As of vCenter 4.1 a new feature has been added which is called “Operational Status” and can

    be found on the HA section of the Cluster’s summary tab. It will give details around errors and will

    show the primary and secondary nodes. There is one gotcha however; it will only show which

    nodes are primary and secondary in case of an error.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    22/140

    Figure 5: Cluster operational status

    This however can also be revealed from the Service Console or via PowerCLI. The following are two

    examples of how to list the primary nodes via the Service Console (ESX 4.0):

    Figure 6: List node command

    Another method of showing the primary nodes is:

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    23/140

    Figure 7: List nodes command

    With PowerCLI the primary nodes can be listed with the following lines of code:

    Power-CLI code:Get-Cluster | Get-HAPrimaryVMHost

    Now that you have seen that it is possible that you can list all nodes with the CLI you probably

    wonder what else is possible… Let’s start with a warning - this is not supported! Currently the

    supported limit of primaries is 5. This is a soft limit however. It is possible to manually add a 6th

    primary but this is not supported nor encouraged.

    Having more than 5 primaries in a cluster will significantly increase network and CPU overhead.

    There should be no reason to increase the number of primaries beyond 5. For the purpose of

    education we will demonstrate how to promote a secondary node to primary and vice versa.

    To promote a node:

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    24/140

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    25/140

    Promoting NodesA common misunderstanding about HA with regards to primary and secondary nodes is the re-

    election process. When does a re-election, or promotion, occur?

    It is a common misconception that a promotion of a secondary occurs when a primary node fails.

    This is not the case. Let’s stress that, this is not the case! The promotion of a secondary node toprimary only occurs in one of the following scenarios:

    • 

    When a primary node is placed in “Maintenance Mode”

    •  When a primary node is disconnected from the cluster

    •  When a primary node is removed from the cluster

    •  When the user clicks “reconfigure for HA” on any ESX host

    This is particularly important for the operational aspect of a virtualized environment. When a host

    fails it is important to ensure its role is migrated to any of the other hosts in case it was an HA

    primary node. To simplify it; when a host fails we recommend placing it in maintenance mode, to

    disconnect it or to remove it from the cluster to avoid any risks!

    If all primary hosts fail simultaneously no HA initiated restart of the virtual machines can take

    place. HA needs at least one primary node to restart virtual machines. This is why you can configure

    HA to tolerate only up to 4 host failures when you have selected the “host failures” Admission

    Control Policy (Remember 5 primaries…). The amount of primaries is definitely something to take

    into account when designing for uptime.

    Failover Coordinator

    As explained in the previous section, you will need at least one primary to restart virtual machines.

    The reason for this is that one of the primary nodes will hold the “failover coordinator” role. This

    role will be randomly assigned to a primary node; this role is also sometimes referred to as “active

    primary”. We will use “failover coordinator” for now.

    The failover coordinator coordinates the restart of virtual machines on the remaining primary and

    secondary hosts. The coordinator takes restart priorities in account when coordinating the restarts.

    Pre-vSphere 4.1 when multiple hosts would fail at the same time it would handle the restarts

    serially. In other words, restart the virtual machines of the first failed host (taking restart priorities

    in account) and then restart the virtual machines of the host that failed as second (again taking

    restart priorities in account). As of vSphere 4.1 this mechanism has been severely improved. In the

    case of multiple near-simultaneous host failures, all the host failures that occur within 15 secondswill have all their VMs aggregated and prioritized before the power-on operations occur.

    If the failover coordinator fails, one of the other primaries will take over. This node is again

    randomly selected from the pool of available primary nodes. As any other process within the HA

    stack, the failover coordinator process is carefully watched by the watchdog functionality of HA.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    26/140

    Pre-vSphere 4.1 the failover coordinator would decide where a virtual machine would be restarted.

    Basically it would check which host had the highest percentage of unreserved and available

    memory and CPU and select it to restart that particular virtual machine. For the next virtual

    machine the same exercise would be done by HA, select the host with the highest percentage of

    unreserved memory and CPU and restart the virtual machine.

    HA does not coordinate with DRS when making the decision on where to place virtual machines. HA

    would rely on DRS. As soon as the virtual machines were restarted, DRS would kick in and

    redistribute the load if and when needed.

    As of vSphere 4.1 virtual machines will be evenly distributed across hosts to lighten the load on the

    hostd service and to get quicker power-on results. HA then relies on DRS to redistribute the load

    later if required. This improvement results in faster restarts of the virtual machines and less stress

    on the ESX hosts. DRS also re-parents the virtual machine when it is booted up as virtual machines

    are failed over into the root resource pool by default. This re-parent process however did already

    exist pre-vSphere 4.1.

    The failover coordinator can restart up to 32 VMs concurrently per host. The number of concurrent

    failovers can be controlled by an advanced setting called das.perHostConcurrentFailoversLimit . As

    stated the default value is 32. Setting a larger value will allow more VMs to be restarted

    concurrently and might reduce the overall VM recovery time, but the average latency to recover

    individual VMs might increase.

    In blade environments it is particularly important to factor the primary nodes and failover

    coordinator concept into your design. When designing a multi chassis environment the impact of a

    single chassis failure needs to be taken into account. When all primary nodes reside in a single

    chassis and the chassis fails, no virtual machines will be restarted as the failover coordinator is the

    only one who initiates the restart of your virtual machines. When it is unavailable, no restart willtake place.

    It is a best practice to have the primaries distributed amongst the chassis in case an entire chassis

    fails or a rack loses power, there is still a running primary to coordinate the failover. This can even

    be extended in very large environments by having no more than 2 hosts of a cluster in a chassis.

    The following diagram depicts the scenario where four 8 hosts clusters are spread across four

    chassis.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    27/140

    Figure 10: Logical cluster layout on blade environment

    Basic design principle:In blade environments, divide hosts over all blade chassis and never exceed

    four hosts per chassis to avoid having all primary nodes in a single chassis.

    Preferred Primary

    With vSphere 4.1 a new advanced setting has been introduced. This setting is not even

    experimental, it is currently considered unsupported. We don't recommend anyone using it in a

    production environment, if you do want to play around with it use your test environment.

    This new advanced setting is called das.preferredPrimaries. With this setting multiple hosts of a

    cluster can be manually designated as a preferred node during the primary node election process.

    The list of nodes can either be comma or space separated and both hostnames and IP addresses are

    allowed. Below you can find an example of what this would typically look like. The “=” sign has been

    used as a divider between the setting and the value.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    28/140

    das.preferredPrimaries = hostname1,hostname2,hostname3

    or

    das.preferredPrimaries = 192.168.1.1 192.168.1.2 192.168.1.3

    As shown there is no need to specify 5 hosts; you can specify any number of hosts. If you specify 5

    hosts, or less, and all 5 hosts are available they will become the primary nodes in your cluster. If you

    specify more than 5 hosts, the first 5 hosts of your list will become primary.

    Again, please be warned that this is considered unsupported at times of writing and please verify in

    the VMware Availability Guide or online in the knowledge base (kb.vmware.com) what the status is

    of the support on this feature before even thinking about implementing it.

    A work around found by some pre-vSphere 4.1 was using the “promote/demote” option of HA’s CLI

    as described earlier in this chapter. Although this solution could fairly simply be scripted it is

    unsupported and as opposed to “das.preferredPrimaries” a rather static solution.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    29/140

    Chapter 4

    High Availability Constructs

    When configuring HA two major decisions will need to be made.

    •  Isolation Response

    • 

    Admission Control

    Both are important to how HA behaves. Both will also have an impact on availability. It is really

    important to understand these concepts. Both concepts have specific caveats. Without a good

    understanding of these it is very easy to increase downtime instead of decreasing downtime.

    Isolation Response

    One of the first decisions that will need to be made when HA is configured is the “isolation

    response”. The isolation response refers to the action that HA takes for its VMs when the host has

    lost its connection with the network. This does not necessarily means that the whole network is

    down; it could just be this hosts network ports or just the ports that are used by HA for the

    heartbeat. Even if your virtual machine has a network connection and only your “heartbeat

    network” is isolated the isolation response is triggered.

    Today there are three isolation responses, “Power off”, “Leave powered on” and “Shut down”. This

    answers the question what a host should do when it has detected it is isolated from the network. In

    any of the three chosen options, the remaining non isolated, hosts will always try to restart the

    virtual machines no matter which of the following three options is chosen as the isolationresponse:

    •  Power off – When network isolation occurs all virtual machines are powered off. It is a hard

    stop, or to put it bluntly, the power cable of the VMs will be pulled out!

    • 

    Shut down – When network isolation occurs all virtual machines running on the host will be

    shut down using VMware Tools. If this is not successful within 5 minutes, a “power off” will

    be executed. This time out value can be adjusted by setting the advanced option

    das.isolationShutdownTimeout. If VMware Tools is not installed, a “power off” will be

    initiated immediately.

    •  Leave powered on – When network isolation occurs on the host, the state of the virtual

    machines remains unchanged.

    This setting can be changed on the cluster settings under virtual machine options.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    30/140

    Figure 11: Cluster default setting

    The default setting for the isolation response has changed multiple times over the last couple of

    years. Up to ESX 3.5 U2 / vCenter 2.5 U2 the default isolation response when creating a new cluster

    was “Power off”. This changed to “Leave powered on” as of ESX 3.5 U3 / vCenter 2.5 U3. However

    with vSphere 4.0 this has changed again. The default setting for newly created clusters, at the time

    of writing, is “Shut down” which might not be the desired response. When installing a new

    environment; you might want to change the default setting based on your customer’s requirements

    or constraints.

    The question remains, which setting should you use? The obvious answer applies here; it depends.We prefer “Shut down” because we do not want to use a degraded host to run our virtual machines

    on and it will shut down your virtual machines in clean manner. Many people however prefer to use

    “Leave powered on” because it eliminates the chances of having a false positive and the associated

    down time with a false positive. A false positive in this case is an isolated heartbeat network but a

    non-isolated virtual machine network and a non-isolated iSCSI / NFS network.

    That leaves the question how the other HA nodes know if the host is isolated or failed.

    HA actually does not know the difference. The other HA nodes will try to restart the affected virtual

    machines in either case. When the host is unavailable, a restart attempt will take place no matter

    which isolation response has been selected. If a host is merely isolated, the non-isolated hosts willnot be able to restart the affected virtual machines. The reason for this is the fact that the host that

    is running the virtual machine has a lock on the VMDK and swap files. None of the hosts will be able

    to boot a virtual machine when the files are locked. For those who don’t know, ESX locks files to

    prevent the possibility of multiple ESX hosts starting the same virtual machine. However, when a

    host fails, this lock expires and a restart can occur.

    To reiterate, the remaining nodes will always try to restart the “failed” virtual machines. The

    possible lock on the VMDK files belonging to these virtual machines, in the case of an isolation

    event, prevents them from being started. This assumes that the isolated host can still reach the files,

    which might not be true if the files are accessed through the network on iSCSI, NFS, or FCoE based

    storage. HA however will repeatedly try starting the “failed” virtual machines when a restart isunsuccessful.

    The amount of retries is configurable as of vCenter 2.5 U4 with the advanced option

    “das.maxvmrestartcount ”. The default value is 5. Pre-vCenter 2.5 U4 HA would keep retrying

    forever which could lead to serious problems as described in KB article 1009625 where multiple

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    31/140

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    32/140

     

    Split-Brain

    When creating your design, make sure you understand the isolation response setting. For instancewhen using an iSCSI array or NFS based storage choosing “Leave powered on” as your default

    isolation response might lead to a split-brain situation.

    A split-brain situation can occur when the VMDK file lock times out. This could happen when the

    iSCSI, FCoE or NFS network is also unavailable. In this case the virtual machine is being restarted on

    a different host while it is not being powered off on the original host because the selected isolation

    response is “Leave powered on”. Which could potentially leave vCenter in an inconsistent state as

    two VMs with a similar UUID would be reported as running on both hosts. This would cause a

    “ping-pong” effect where the VM would appear to live on ESX host 1 at one moment and on ESX

    host 2 soon after.

    VMware’s engineers have recognized this as a potential risk and developed a solution for this

    unwanted situation. (This not well documented, but briefly explained by one of the engineers on the

    VMTN Community forums. http://communities.vmware.com/message/1488426#1488426.)

    In short; as of version 4.0 Update 2 ESX detects that the lock on the VMDK has been lost and issues a

    question if the virtual machine should be powered off and auto answers the question with yes.

    However, you will only see this question if you directly connect to the ESX host. HA will generate an

    event for this auto-answer though, which is viewable within vCenter. Below you can find a

    screenshot of this question.

    Figure 13: Virtual machine message

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    33/140

     

    As stated above, as of ESX 4 update 2 the question will be auto-answered and the virtual machine

    will be powered off to recover from the split brain scenario.

    The question still remains: with iSCSI or NFS, should you power off virtual machines or leave them

    powered on?

    As described above in earlier versions, "Leave powered on" could lead to a split-brain scenario. You

    would end up seeing multiple virtual machines ping-ponging between hosts as vCenter would not

    know where it resided as it was active in memory on two hosts. As of ESX 4.0 Update 2, this is

    however not the case anymore and it should be safe to use “Leave powered on”.

    We recommend avoiding the chances of a split-brain scenario. Configure a secondary ServiceConsole on the same vSwitch and network as the iSCSI or NFS VMkernel portgroup and pre-vSphere

    4.0 Update 2 to select either “Power off” or “Shut down” as the isolation response. By doing this you

    will be able to detect if there’s an outage on the storage network. We will discuss the options you

    have for Service Console / Management Network redundancy more extensively later on in this.

    Basic design principle:  For network-based storage (iSCSI, NFS, FCoE) it isrecommended (pre-vSphere 4.0 Update 2) to set the isolation response to "Shut Down" or

    “Power off”. It is also recommended to have a secondary Service Console (ESX) or

    Management Network (ESXi) running on the same vSwitch as the storage network to detecta storage outage and avoid false positives for isolation detection.

    Isolation Detection

    We have explained what the options are to respond to an isolation event. However we have not

    extensively discussed how isolation is detected. This is one of the key mechanisms of HA. Isolate

    detection is a mechanism that takes place on the host that is isolated. The remaining, non-isolated,

    hosts don’t know if that host has failed completely or if it is isolated from the network, they only

    know it is unavailable.

    The mechanism is fairly straightforward though and works as earlier explained with heartbeats.

    When a node receives no heartbeats from any of the other nodes for 13 seconds (default setting)

    HA will ping the “isolation address”. Remember primary nodes send heartbeats to primaries and

    secondaries, secondary nodes send heartbeats only to primaries.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    34/140

    The isolation address is the gateway specified for the Service Console network (or management

    network on ESXi), but there is a possibility to specify one or multiple additional isolation addresses

    with an advanced setting. This advanced setting is called “das.isolationaddress” and could be used

    to reduce the chances of having a false positive. We recommend to set at least one additional

    isolation address.

    Figure 14: das.isolationaddress

    When isolation has been confirmed, meaning no heartbeats have been received and HA was unable

    to ping any of the isolation addresses, HA will execute the isolation response. This could be any ofthe above-described options, power down, shut down or leave powered on.

    If only one heartbeat is received or just a single isolation address can be pinged the isolation

    response will not be triggered, which is exactly what you want.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    35/140

    Selecting an Additional Isolation Address

    A question asked by many people is which address should be specified for this additional isolation

    verification. We generally recommend an isolation address closest to the hosts to avoid too many

    network hops. In many cases the most logical choice is the physical switch to which the host is

    directly connected, another usual suspect would be a router or any other reliable and pingable

    device. However, when you are using network based shared storage like NFS and for instance iSCSI

    a good choice would be the IP-address of the device, this way you would also verify if the storage is

    still reachable or not.

    Failure Detection Time

    Failure Detection Time seems to be a concept that is often misunderstood but is critical when

    designing a virtual infrastructure. Failure Detection Time is basically the time it takes before the

    “isolation response” is triggered. There are two primary concepts when we are talking about failure

    detection time:

    •  The time it will take the host to detect it is isolated

    •  The time it will take the non-isolated hosts to mark the unavailable host as isolated and

    initiate the failover

    The following diagram depicts the timeline for both concepts:

    Figure 15: High Availability failure detection time

    The default value for failure detection is 15 seconds. (das.failuredetectiontime) In other words the

    failed or isolated host will be declared failed by the other hosts in the HA cluster on the fifteenth

    second and a restart will be initiated by the failover coordinator after one of the primaries has

    verified that the failed or isolated host is unavailable by pinging the host on its management

    network.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    36/140

    It should be noted that in the case of a dual management network setup both addresses will be

    pinged and a 1 second will need to be added to the timeline. Meaning that the failover coordinator

    will initiate the restart on the 17th second.

    Let’s stress that again, a restart will be initiated after one of the primary nodes has tried to ping all

    of the management network addresses of the failed host.

    Let’s assume the isolation response is “Power off ”. The isolation response “Power off” will be

    triggered by the isolated host 1 second before the das.failuredetectiontime elapses. In other words a

    “Power off” will be initiated on the fourteenth second. A restart will be initiated on the sixteenth

    second by the failover coordinator if the host has a single management network.

    Does this mean that you can end up with your virtual machines being down and HA not restarting

    them?

    Yes, when the heartbeat returns between the 14th and 16th second the “Power off” might have

    already been initiated. The restart however will not be initiated because the received heartbeat

    indicates that the host is not isolated anymore.

    How can you avoid this?

    Selecting “Leave VM powered on” as an isolation response is one option. Increasing the

    das.failuredetectiontime  will also decrease the chances of running into issues like these, and with

    ESX 3.5 it was a standard best practice to increase the failure detection time to 30 seconds.

    At the time of writing (vSphere) this is not a best practice anymore as with any value the “2-second”

    gap exists and the likelihood of running into this issue is small. We recommend keeping

    das.failuredetectiontime as low as possible to decrease associated down time.

    Basic design principle:  Keep das.failuredetectiontime low for fast responses tofailures. If an isolation validation address has been added, “das.isolationaddress”, add 5000

    to the default “das.failuredetectiontime” (15000).

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    37/140

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    38/140

    Recommended:

    2 physical switches

    The vSwitch should be configured as follows:

    • 

    vSwitch0: 2 Physical NICs (vmnic0 and vmnic2)•  2 Portgroups (Service Console and VMkernel)

    •  Service Console active on vmnic0 and standby on vmnic2

    • 

    VMkernel active on vmnic2 and standby on vmnic0

    •  Failback set to No

    Each portgroup has a VLAN ID assigned and runs dedicated on its own physical NIC; only in the

    case of a failure it is switched over to the standby NIC. We highly recommend setting failback to

    “No” to avoid chances of a false positive which can occur when a physical switch routes no traffic

    during boot but the ports are reported as “up”. (NIC Teaming Tab)

    Pros:  Only 2 NICs in total are needed for the Service Console and VMkernel, especially useful in

    Blade environments. This setup is also less complex.

    Cons: Just a single active path for heartbeats.

    The following diagram depicts the active/standby scenario:

    Figure 16: Active-standby Service Console network layout

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    39/140

     

    Secondary Management Network

    Requirements:

    •  3 physical NICs

    •  VLAN trunking

    Recommended:

    •  2 physical switches

    •  The vSwitch should be configured as follows:

    •  vSwitch0 – 3 Physical NICs (vmnic0 & vmnic2)

    •  3 Portgroup (Service Console, secondary Service Console and VMkernel)

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    40/140

      The primary Service Console runs on vSwitch0 and active on vmnic0, with a VLAN assigned on

    either the physical switch port or the portgroup and is connected to the first physical switch. (We

    recommend using a VLAN trunk for all network connections for consistency and flexibility.)

    The secondary Service Console will be active on vmnic2 and connected to the second physical

    switch.

    The VMkernel is active on vmnic1 and standby on vmnic2.

    Pros  - Decreased chances of false alarms due to Spanning Tree “problems” as the setup contains

    two Service Consoles that are both connected to only 1 physical switch. Subsequently both Service

    Consoles will be used for the heartbeat mechanism that will increase resiliency.

    Cons  - Need to set advanced settings. It is mandatory to set an additional isolation address

    (das.isolationaddress2) in order for the secondary Service Console to verify network isolation via a

    different route.

    The following diagram depicts the secondary Service Console scenario:

    Figure 17: Secondary management network

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    41/140

     

    The question remains; which would we recommend? Both scenarios are fully supported and

    provide a highly redundant environment either way. Redundancy for the Service Console or

    Management Network is important for HA to function correctly and avoid false alarms about the

    host being isolated from the network. We however recommend the first scenario. Redundant NICs

    for your Service Console adds a sufficient level of resilience without leading to an overly complex

    environment.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    42/140

    Chapter 6 

    Admission Control

    Admission Control is often misunderstood and disabled because of this. However Admission

    Control is a must when availability needs to be guaranteed and isn’t that the reason for enabling HA

    in the first place?

    What is HA Admission Control about? Why does HA contain Admission Control?

    The “ Availability Guide” a.k.a HA bible states the following:

    “vCenter Server uses Admission

    Control to ensure that sufficient

    resources are available in

    a cluster to provide failover

    protection and to ensure that

    virtual machine resource

    reservations are respected.”

    Admission Control guarantees capacity is available for an HA initiated failover by reserving

    resources within a cluster. It calculates the capacity required for a failover based on availableresources. In other words if a host is placed into maintenance mode, or disconnected, it is taken out

    of the equation. Available resources also mean that the virtualization overhead has already been

    subtracted from the total. To give an example; Service Console Memory and VMkernel memory is

    subtracted from the total amount of memory that results in the available memory for the virtual

    machines.

    There is one gotcha with Admission Control that we want to bring to your attention before drilling

    into the different policies.

    When Admission Control is set to strict, VMware Distributed Power Management in no way will

    violate availability constraints. This means that it will always ensure multiple hosts are up andrunning. (For more info on how DPM calculates read Chapter 18)

    When Admission Control was disabled and DPM was enabled in a pre-vSphere 4.1 environment you

    could have ended up with all but one ESX host placed in sleep mode, which could lead to potential

    issues when that particular host failed or resources were scarce as there would be no host available

    to power-on your virtual machines. (KB: http://kb.vmware.com/kb/1007006)

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    43/140

    With vSphere 4.1 however; if there are not enough resources to power on all hosts, DPM will be

    asked to take hosts out of standby mode to make more resources available and the virtual machines

    can then get powered on by HA when those hosts are back online.

     Admission Control Policy

    The Admission Control Policy dictates the mechanism that HA uses to guarantee enough resources

    are available for an HA initiated failover. This section gives a general overview of the available

    Admission Control Policies. The impact of each policy is described in the following section including

    our recommendation.

    HA has three mechanisms to guarantee enough capacity is available to respect virtual machine

    resource reservations.

    Figure 18: Admission control policy

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    44/140

    Below we have listed all three options currently available as the Admission Control Policy. Each

    option has a different mechanism to ensure resources are available for a failover and each option

    has its caveats.

     Admission Control Mechanisms

    Each Admission Control Policy has its own Admission Control mechanism. Understanding this

    Admission Control mechanism is important to understand the impact of decisions for your cluster

    design. For instance setting a reservation on a specific virtual machine can have an impact on the

    achieved consolidation ratio. This section will take you on a journey through the trenches of

    Admission Control mechanisms.

    Host Failures Cluster Tolerates

    The Admission Control Policy that has been around the longest is the “Host Failures Cluster

    Tolerates” policy. It is also historically the least understood Admission Control Policy due to its

    complex admission control mechanism.

    The so-called “slots” mechanism is used when selecting “host failures cluster tolerates” as the

    Admission Control Policy. The mechanism of this concept has changed several times in the past and

    it is one of the most restrictive policies.

    Slots dictate how many virtual machines can be powered on before vCenter starts yelling “Out Of

    Resources”! Normally a slot represents one virtual machine. Admission Control does not limit HA in

    restarting virtual machines, it ensures enough resources are available to power on all virtual

    machines in the cluster by preventing “over-commitment”. For those wondering why HA initiated

    failovers are not prone to the Admission Control Policy think back for a second. Admission Control

    is done by vCenter. HA initiated restarts are executed directly on the ESX host without the use of

    vCenter. So even if resource would be low and vCenter would complain it couldn’t stop the restart.

    If a failure has occurred and the host has been removed from the cluster, HA will recalculate all the

    values and start with an “N+x” cluster again from scratch. This could result in an over-committed

    cluster as you can imagine.

    “A slot is defined as a logical representation of the memory and CPU resources that satisfy the

    requirements for any powered-on virtual machine in the cluster…”

    In other words a slot is the worst case CPU and memory reservation scenario in a cluster. This

    directly leads to the first “gotcha”:

    HA uses the highest CPU reservation of any given virtual machine and the highest memory

    reservation of any given VM in the cluster. If no reservations of higher than 256 MHz are set HA will

    use a default of 256 MHz for CPU. If no memory reservation is set HA will use a default of

    0MB+memory overhead for memory. (See the VMware vSphere Resource Management Guide for

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    45/140

    more details on memory overhead per virtual machine configuration) The following example will

    clarify what “worst-case” actually means.

    Example  - If virtual machine “VM1” has 2GHz of CPU reserved and 1024MB of memory reserved

    and virtual machine “VM2” has 1GHz of CPU reserved and 2048MB of memory reserved the slot

    size for memory will be 2048MB (+memory overhead) and the slot size for CPU will be 2GHz. It is acombination of the highest reservation of both virtual machines. Reservations defined at the

    Resource Pool level however, will not affect HA slot size calculations.

    Basic design principle:Be really careful with reservations, if there’s no need to have them on a per

    virtual machine basis; don’t configure them, especially when using Host Failures Cluster Tolerates.

    If reservations are needed, resort to resource pool based reservations.

    Now that we know the worst case scenario is always taken into account when it comes to slot size

    calculations we will describe what dictates the amount of available slots per cluster.

    We will need to know what the slot size for memory and CPU is first. Then we will divide the total

    available CPU resources of a host by the CPU slot size and the total available Memory Resources of a

    host by the memory slot size. This leaves us with a slot size for both memory and CPU. The most

    restrictive number, again worst-case scenario is the number of slots for this host. If you have 25

    CPU slots but only 5 memory slots, the amount of available slots for this host will be 5 as HA always

    will always take the worst case scenario into account to “guarantee” all virtual machines can be

    powered on in case of a failure or isolation.

    The question we receive a lot is how do I know what my slot size is? The details around slot sizes

    can be monitored on the HA section of the Cluster’s summary tab by clicking the “Advanced

    Runtime Info” line.

    Figure 19: High Availability cluster summary tab

    This will show the following screen that specifies the slot size and more useful details around the

    amount of slots available.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    46/140

    Figure 20: High Availability advanced runtime info

    As you can see using reservations on a per-VM basis can lead to very conservative consolidation

    ratios. However, with vSphere this is something that is configurable. If you have just one virtual

    machine with a really high reservation you can set the following advanced settings to lower the slot

    size used for these calculations: “das.slotCpuInMHz” or “das.slotMemInMB”.

    To avoid not being able to power on the virtual machine with high reservations the virtual machine

    will take up multiple slots. When you are low on resources this could mean that you are not able to

    power-on this high reservation virtual machine as resources may be fragmented throughout the

    cluster instead of available on a single host. As of vSphere 4.1 HA will notify DRS that a power-on

    attempt was unsuccessful and a request will be made to defragment the resources to accommodate

    the remaining virtual machines that need to be powered on.

    The following diagram depicts a scenario where a virtual machine spans multiple slots:

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    47/140

    Figure 21: Virtual machine spanning multiple HA slot

    Notice that because the memory slot size has been manually set to 1024MB one of the virtual

    machines (grouped with dotted lines) spans multiple slots due to a 4GB memory reservation. As

    you might have noticed none of the hosts has 4 slots left. Although in total there are enough slots

    available; they are fragmented and HA will not be able to power-on this particular virtual machine

    directly but will request DRS to defragment the resources to accommodate for this virtual machines

    resource requirements.

    Admission control does not take fragmentation of slots into account when slot sizes are manually

    defined with advanced settings. It will take the number of slots this virtual machine will consume

    into account by subtracting them from the total number of available slots, but it will not verify the

    amount of available slots per host to ensure failover. As stated earlier though HA will request DRS,

    as of vSphere 4.1, to defragment the resources. However, this is no guarantee for a successful

    power-on attempt or slot availability.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    48/140

     

    Basic design principle:Avoid using advanced settings to decrease the slot size as it could lead to

    more down time and adds an extra layer of complexity. If there is a large discrepancy in size and

    reservations are set it might help to put similar sized virtual machines into their own cluster.

    Unbalanced Configurations and Impact on Slot

    Calculation

    It is an industry best practice to create clusters with similar hardware configurations. However

    many companies start out with a small VMware cluster when virtualization is introduced and plan

    on expanding when trust within the organization has been built.

    When the time has come to expand, chances are fairly large the same hardware configuration is no

    longer available. The question is will you add the newly bought hosts to the same cluster or create a

    new cluster?

    From a DRS perspective, large clusters are preferred as it increases the load balancing options.

    However there is a caveat for DRS as well, which is described in the DRS section of this book. For

    HA there is a big caveat and when you think about it and understand the internal workings of HA

    you probably already know what is coming up.

    Let’s first define the term “unbalanced cluster”.

    An unbalanced cluster would for instance be a cluster with 6 hosts of which one contains more

    memory than the other hosts in the cluster.

    Let’s try to clarify that with an example.

    Example: 

    What would happen to the total number of slots in a cluster of the following specifications?

    • 

    Six host cluster

    •  Five hosts have 16GB of available memory

    • 

    One host has 32GB of available memory

    The sixth host is a brand new host that has just been bought and as prices of memory dropped

    immensely the decision was made to buy 32GB instead of 16GB.

    The cluster contains a virtual machine that has 1 vCPU and 4GB of memory. A 1024MB memory

    reservation has been defined on this virtual machine. As explained earlier a reservation will dictate

    the slot size, which in this case leads to a memory slot size of 1024MB+memory overhead. For the

    sake of simplicity we will however calculate with 1024MB.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    49/140

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    50/140

     As Admission Control is enabled a worst-case scenario is taken into account . When a single host

    failure has been specified, this means that the host with the largest number of slots will be taken

    out of the equation. In other words for our cluster this would result in:

    esx01 + esx02 + esx03 + esx4 + esx5 = 80 slots available

    Although you have doubled the amount of memory in one of your hosts you are still stuck with only

    80 slots in total. As clearly demonstrated there is absolutely no point in buying additional memory

    for a single host when your cluster is designed with Admission Control enabled and a number of

    host failures as the Admission Control Policy has been selected.

    In our example the memory slot size happened to be the most restrictive, the same principle is

    applied when CPU slot size is most restrictive.

    Basic design principle:When using Admission Control, balance your clusters and be conservative

    with reservations as it leads to decreased consolidation ratios.

    Now what would happen in the scenario above when the number of allowed host failures is to 2?

    In this case ESX06 is taken out of the equation and one of any of the remaining hosts in the cluster is

    also taken out. It would result in 64 slots. This makes sense doesn’t it?

    Can you avoid large HA slot sizes due to reservations without resorting to advanced settings? That’s

    the question we get almost daily. The answer used to be NO if per virtual machine reservations

    were required. HA uses reservations to calculate the slot size and there’s no way to tell HA to ignore

    them without using advanced settings pre-vSphere. With vSphere, the new Percentage method is

    an alternative.

    Percentage of Cluster Resources Reserved

    With vSphere VMware introduced the ability to specify a percentage next to a number of host

    failures and a designated failover host. The percentage avoids the slot size issue, as it does not use

    slots for Admission Control. So what does it use?

    When you specify a percentage; that percentage of the total amount of available resources will stay

    reserved for HA purposes. First of all HA will add up all available resources to see how much it has

    available (virtualization overhead will be subtracted) in total. Then HA will calculate how much

    resources are currently reserved by adding up all reservations for both memory and CPU forpowered on virtual machines.

    For those virtual machines that do not have a reservation larger than 256 MHz a default of 256 MHz

    will be used for CPU and a default of 0MB+memory overhead will be used for Memory. (Amount of

    overhead per configuration type can be found in the “Understanding Memory Overhead” section of

    the Resource Management guide.)

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    51/140

    In other words:

    ((Total amount of available resources – total reserved virtual machine resources)/total amount of

    available resources)

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    52/140

     

    If you have an unbalanced cluster (hosts with different sizes of CPU or memory resources) your

    percentage should be equal or preferably larger than the percentage of resources provided by the

    largest host. This way you ensure that all virtual machines residing on this host can be restarted in

    case of a host failure.

    As earlier explained this Admission Control Policy does not use slots, as such resources might befragmented throughout the cluster. Although as of vSphere 4.1 DRS is notified to rebalance the

    cluster, if needed, to accommodate for these virtual machines resource requirements a guarantee

    cannot be given. We recommend ensuring you have at least one host with enough available capacity

    to boot the largest virtual machine (reservation CPU/MEM). Also make sure you select the highest

    restart priority for this virtual machine (of course depending on the SLA) to ensure it will be able to

    boot.

    The following diagram will make it more obvious. You have 5 hosts, each with roughly 80%

    memory usage, and you have configured HA to reserve 20% of resources. A host fails and all virtual

    machines will need to failover. One of those virtual machines has a 4GB memory reservation, as you

    can imagine, the first power-on attempt for this particular virtual machine will fail due to the factthat none of the hosts has enough memory available to guarantee it.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    53/140

    Figure 25: Available resources

    Basic design principle:Although vSphere 4.1 will utilize DRS to try to accommodate for the resource

    requirements of this virtual machine a guarantee cannot be given. Do the math; verify that any

    single host has enough resources to power-on your largest virtual machine. Also take restart

    priority into account for this/these virtual machine(s).

    Failover Host

    The third option one could choose is a designated Failover host. This is commonly referred to as a

    hot standby. There is actually not much to tell around this mechanism, as it is “what you see is what

    you get”. When you designate a host as a failover host it will not participate in DRS. You will not be

    able to power on virtual machines on this host! It is almost like it is in maintenance mode and it will

    only be used in case a failover needs to occur.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    54/140

    Chapter 7

    Impact of Admission Control Policy

    As with any decision when architecting your environment there is an impact. This especially goes

    for the Admission Control Policy. The first decision that will need to be made is if Admission Control

    is enabled or not. We recommend enabling Admission Control but carefully select the policy and

    ensure it fits your or your customer’s needs.

    Basic design principle:

    Admission Control guarantees enough capacity is available for virtual machine failover. As

    such we recommend enabling it.

    We have explained all the mechanisms that are being used by each of the policies in Chapter 6. Asthis is one of the most crucial decisions that need to be made we have summarized all the pros and

    cons for each of the three policies below.

    Host Failures Cluster Tolerates

    This option is historically speaking the most used for Admission Control. Most environments are

    designed with an N+1 redundancy and N+2 is also not uncommon. This Admission Control Policy

    uses “slots” to ensure enough capacity is reserved for failover, which is a fairly complex

    mechanism. Slots are based on VM-level Reservations.

    Pros:

    • 

    Fully automated (When a host is added to a cluster, HA re-calculates how many slots are

    available.)

    •  Ensures failover by calculating slot sizes.

    Cons:

    Can be very conservative and inflexible when reservations are used as the largest reservation

    dictates slot sizes.

    • 

    Unbalanced clusters lead to wastage of resources.•  Complexity for administrator from calculation perspective.

    •  Percentage as Cluster Resources Reserved

    Percentage based Admission Control is the latest addition to the HA Admission Control Policy. The

    percentage based Admission Control is based on per VM reservation calculations instead of slots.

    Pros:

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    55/140

    • 

    Accurate as it considers actual reservation per virtual machine.

    • 

    Cluster dynamically adjusts when resources are added.

    Cons:

    Manual calculations needed when adding additional hosts in a cluster and number of host failuresneed to remain unchanged.

    Unbalanced clusters can be a problem when chosen percentage is too low and resources are

    fragmented, which means failover of a virtual machine can’t be guaranteed as the reservation of this

    virtual machine might not be available as resources on a single host.

    Specify a Failover Host

    With the Specify a Failover Host Admission Control Policy, when a host fails, HA will attempt to

    restart all virtual machines on the designated failover host. The designated failover host is

    essentially a “hot standby”. In other words DRS will not migrate VMs to this host when resourcesare scarce or the cluster is imbalanced.

    Pros:

    •  What you see is what you get.

    •  No fragmented resources.

    Cons:

    • 

    What you see is what you get.

    • 

    Maximum of one failover host. (N+2 redundancy is impossible.)

    •  Dedicated failover host not utilized during normal operations.

    Recommendations

    We have been asked many times for our recommendation on Admission Control and it is difficult to

    answer as each policy has its pros and cons. However, we generally recommend a Percentage based

    Admission Control Policy. It is the most flexible policy as it uses the actual reservation per virtual

    machine instead of taking a “worse case” scenario approach like the number of host failures does.

    However, the number of host failures policy guarantees the failover level under all circumstances.

    Percentage based is less restrictive, but offers lower guarantees that in all scenarios, HA will be ableto restart all virtual machines. With the added level of integration between HA and DRS we believe

    a Percentage based Admission Control Policy will fit most environments.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    56/140

    Basic design principle: Do the math, and take customer requirements into account.We recommend using a “Percentage” based Admission Control Policy, as it is the most

    flexible policy.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    57/140

    Chapter 8

    VM Monitoring

    VM monitoring or VM level HA is an often overlooked but really powerful feature of HA. The reason

    for this is most likely that it is disabled by default and relatively new compared to HA. We have

    tried to gather all the info we could around VM Monitoring but it is a pretty straightforward

    product that actually does what you expect it would do.

    With vSphere 4.1 VMware also introduced VM and Application Monitoring. Application Monitoring

    is a brand new feature that Application Developers can leverage to increase resiliency as shown in

    the screenshot below.

    Figure 26: VM and Application Monitoring 

    As of writing there was little information around Application Monitoring besides the fact that the

    Guest SDK is be used by application developers or partners like for instance Symantec to develop

    solutions against the SDK. In the case of Symantec a simplified version of Veritas Cluster Server

    (VCS) is used to enable application availability monitoring including of course responding to issues.

    Note that it is not a multi-node clustering solution like VCS itself but a single node solution.

    Symantec ApplicationHA as it is called is triggered to get the application up and running again by

    restarting it. Symantec's ApplicationHA is aware of dependencies and knows in which order

    services should be started or stopped. If however for whatever reason this fails for an "X" amount

    (configurable option within ApplicationHA) of times HA will be asked to take action. This action will

    be a restart of the virtual machine.

    Although Application Monitoring is relatively new and there are only a few partners currently

    exploring the capabilities it does add a whole new level of resiliency in our opinion. We have tested

    ApplicationHA by Symantec and personally feel it is the missing link. It enables you as System

    Admin to integrate your virtualization layer with your application layer. It ensures you as a SystemAdmin that services, which are protected, are restarted in the correct order and it avoids the

    common pitfalls associated with restarts and maintenance.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    58/140

    Why Do You Need VM/Application Monitoring?

    VM and Application Monitoring acts on a different level as HA. VM/App Monitoring responds to a

    single virtual machine or application failure as opposed to HA which responds to a host failure. An

    example of a single virtual machine failure would for instance be the infamous “blue screen of

    death”.

    How Does VM/App Monitoring Work?

    VM Monitoring restarts individual virtual machines when needed. VM/App monitoring uses a

    similar concept as HA, heartbeats. If heartbeats, and in this case VMware Tools heartbeats, are not

    received for a specific amount of time, the virtual machine will be rebooted. The heartbeats are

    communicated directly to VPXA by VMware Tools; these heartbeats are not sent over a network.

    Figure 27: VM monitoring sensitivity

    When enabling VM/App Monitoring, the level of sensitivity can be configured. The default settingshould fit most situations. Low sensitivity basically means that the amount of allowed “missed”

    heartbeats is higher and as such the chances of running into a false positive are lower. However if a

    failure occurs and the sensitivity level is set to low the experienced downtime will be higher. When

    quick action is required in case of a possible failure “high sensitivity” can be selected, and as

    expected this is the opposite of “low sensitivity”.

    Table 1: VM monitoring sensitivity

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    59/140

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    60/140

    Screenshots

    The cool thing about VM Monitoring is the fact that it takes screenshots of the VM console. They are

    taken right before a virtual machine is reset by VM Monitoring. This has been added as of vCenter

    4.0. It is a very useful feature when a virtual machine “freezes” every once in a while with no

    apparent reason. This screenshot can be used to debug the virtual machine operating system, if and

    when needed, and is stored in the virtual machine’s working directory.

    Basic design principle: VM Monitoring can substantially increase availability. Itis part of the HA stack and we heavily recommend using it! 

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    61/140

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    62/140

    Flattened Shares

    Pre-vSphere 4.1 an issue could arise when shares had been set custom on a virtual machine. When

    HA fails over a virtual machine it will power-on the virtual machine in the Root Resource Pool.

    However, the virtual machine’s shares were scaled for its appropriate place in the resource pool

    hierarchy, not for the Root Resource Pool. This could cause the virtual machine to receive either too

    many or too few resources relative to its entitlement.

    A scenario where and when this can occur would be the following:

    VM1 has a 1000 shares and Resource Pool A has 2000 shares. However Resource Pool A has 2 VMs

    and both will have 50% of those “20003 shares. The following diagram depicts this scenario:

    Figure 28: Flatten shares starting point

    When the host would fail both VM2 and VM3 will end up on the same level as VM1. However as a

    custom shares value of 10.000 was specified on both VM2 and VM3 they will completely blow away

    VM1 in times of contention. This is depicted in the following diagram:

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    63/140

    Figure 29: Flatten shares host failure

    This situation would persist until the next invocation of DRS would re-parent the virtual machine to

    its original Resource Pool. To address this issue as of vSphere 4.1 DRS will flatten the virtual

    machine’s shares and limits before fail-over. This flattening process ensures that the virtual

    machine will get the resources it would have received if it had failed over to the correct Resource

    Pool. This scenario is depicted in the following diagram. Note that both VM2 and VM3 are placed

    under the Root Resource Pool with a shares value of 1000.

    Figure 30: Flatten shares after host failure before DRS invocation

    Of course when DRS is invoked both VM2 and VM3 will be re-parented under Resource Pool A and

    will receive the amount of shares they had originally assigned again.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    64/140

     

    Chapter 10

    Summarizing

    The integration of HA with DRS has been vastly improved and so has HA in general. We hope

    everyone sees the benefits of these improvements and of HA and VM and Application Monitoring in

    general. We have tried to simplify some of the concepts to make it easier to understand, still we

    acknowledge that some concepts are difficult to grasp. We hope though that after reading this

    section of the book everyone is confident enough to make the changes to HA needed to increase

    resiliency and essentially uptime of your environment because that is what it is all about.

    If there are any questions please do not hesitate to reach out to either of the authors.

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    65/140

     

    Part 2

    VMware Distributed Resource Scheduler

  • 8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

    66/140

    Chapter 11

    What is VMware DRS?

    VMware Distributed Resource Scheduler (DRS) is an infrastructure service run by VMware vCenter

    Server (vCenter). DRS aggregates ESX host resources into clusters and automatically distributes

    these resources to the virtual machines.

    DRS monitors resource usage and continuously optimizes the virtual machine resource distribution

    across ESX hosts.

    DRS computes the resource entitlement for each virtual machine based on static resource allocation

    settings and dynamic settings such as active usage and level of contention.

    DRS attempts to satisfy the virtual machine resource entitlement with the resources available in the

    cluster by leveraging vMotion. vMotion is used to either migrate the virtual machines to alternative

    ESX hosts with more available resources or migrating virtual machines away to free up resources.

    Because DRS is an automated solution and easy to configure, we recommend enabling DRS to

    achieve higher consolidation ratios at low costs.

    A DRS-enabled cluster is often referred to as a DRS cluster. In vSphere 4.1, a DRS cluster can

    manage up to 32 hosts and 3000 VMs.

    Cluster Level Resource Management

    Clusters group the resources of the various ESX hosts together and treat them as a pool of

    resources, DRS presents the aggregated resources as one big host to the virtual machines. Pooling

    resources allows DRS to create resource pools spanning across all hosts in the cluster and apply

    cluster level resource allocation policies. Probably unnecessary to point out, but a virtual machine

    cannot span hosts even when resources are pooled by using DRS. In addition to resource pools and

    resource allocation policies, DRS offers the following resource management capabilities.

    Initial placement   – When a virtual machine is powered on in the cluster, DRS places thevirtual machine on an appropriate host or generates a recommendation depending on the

    automation level.