vmware vsphere 4.1 ha and drs deep technical drive

8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive

1/140


2/140

VMware vSphere 4.1

HA and DRSTechnical Deepdive


3/140

VMware vSphere 4.1, HA and DRS Technical Deepdive

Copyright © 2010 by Duncan Epping and Frank Denneman.

All rights reserved. No part of this book shall be reproduced, stored in a retrieval system, or

transmitted by any means, electronic, mechanical, or otherwise, without written permission from

the publisher. No patent liability is assumed with respect to the use of the information contained

herein. Although every precaution has been taken in the preparation of this book, the publisher and

authors assume no responsibility for errors or omissions. Neither is any liability assumed for

damages resulting from the use of the information contained herein.

International Standard Book Number (ISBN:)

9781456301446

All terms mentioned in this book that are known to be trademarks or service marks have been

appropriately capitalized.

Use of a term in this book should not be regarded as affecting the validity of any trademark or

service mark.

Version: 1.1


4/140

About the Authors

Duncan Epping is a Principal Architect working for VMware as part of the Technical Marketing

department. Duncan primarily focuses on vStorage initiatives and ESXi. He is specialized in

vSphere, vStorage, VMware HA and Architecture. Duncan is a VMware Certified Professional and

among the first VMware Certified Design Experts (VCDX 007). Duncan is the owner of Yellow-

Bricks.com, one of the leading VMware/virtualization blogs worldwide (Recently been voted

number 1 virtualization blog for the 4th consecutive time on vsphere-land.com.) and lead-author of

the "vSphere Quick Start Guide" and "Foundation for Cloud Computing with VMware vSphere 4"

which has recently been published by Usenix/Sage. (#21 in the Short Topics Series). He can be

followed on twitter at http://twitter.com/DuncanYB.

Frank Denneman is a Consulting Architect working for VMware as part of the Professional

Services Organization. Frank works primarily with large Enterprise customers and Service

Providers. He is focused on designing large vSphere Infrastructures and specializes in Resource

Management, DRS in general and storage. Frank is a VMware Certified Professional and among the

first VMware Certified Design Experts (VCDX 029). Frank is the owner of FrankDenneman.nl which

has recently been voted number 6 worldwide on vsphere-land.com. He can be followed on twitter

at http://twitter.com/FrankDenneman.


5/140

Table of Contents

About the Authors

Acknowledgements

Foreword

Introduction to VMware High Availability

How Does High Availability Work?

Pre-requisites

Firewall Requirements

Configuring VMware High Availability

Components of High Availability

VPXA

VMAP Plug-In

AAM

Nodes

Promoting Nodes

Failover Coordinator

Preferred Primary

High Availability Constructs

Isolation Response

Split-Brain

Isolation Detection

Selecting an Additional Isolation Address

Failure Detection Time

Adding Resiliency to HA (Network Redundancy)

Single Service Console with vmnics in Active/Standby Configuration

Secondary Management Network


6/140


7/140

Operation and Tasks of DRS

Load Balance Calculation

Events and Statistics

Migration and Info Requests

vCenter and Cluster sizing

DRS Cluster Settings

Automation Level

Initial Placement

Impact of Automation Levels on Procedures

Resource Management

Two-Layer Scheduler Architecture

Resource Entitlement

Resource Entitlement Calculation

Calculating DRS Recommendations

When is DRS Invoked?

Defragmenting cluster during Host failover

Recommendation Calculation

Constraints Correction

Imbalance Calculation

Impact of Migration Threshold on Selection Procedure

Selection of Virtual Machine Candidate

Cost-Benefit and Risk Analysis Criteria

The Biggest Bang for the Buck

Calculating the Migration Recommendation Priority Level

Influence DRS Recommendations

Migration Threshold Levels


8/140

Rules

VM-VM Affinity Rules

VM-Host Affinity Rules

Impact of Rules on Organization

Virtual Machine Automation Level

Impact of VM Automation Level on DRS Load Balancing Calculation

Resource Pools and Controls

Root Resource Pool

Resource Pools

Resource pools and simultaneous vMotions

Under Committed versus Over Committed

Resource Allocation Settings

Shares

Reservation

VM Level Scheduling: CPU vs Memory

Impact of Reservations on VMware HA Slot Sizes.

Behavior of Resource Pool Level Memory Reservations

Setting a VM Level Reservation inside a Resource Pool

VMkernel CPU reservation for vMotion

Reservations Are Not Limits.

Memory Overhead Reservation

Expandable Reservation

Limits

CPU Resource Scheduling

Memory Scheduler

Distributed Power Management


9/140

Enable DPM

Templates

DPM Threshold and the Recommendation Rankings

Evaluating Resource Utilization

Virtual Machine Demand and ESX Host Capacity Calculation

Evaluating Power-On and Power-Off Recommendations

Resource LowScore and HighScore

Host Power-On Recommendations

Host Power-Off Recommendations

DPM Power-Off Cost/Benefit Analysis

Integration with DRS and High Availability

Distributed Resource Scheduler

High Availability

DPM awareness of High Availability Primary Nodes

DPM Standby Mode

DPM WOL Magic Packet

Baseboard Management Controller

Protocol Selection Order

DPM and Host Failure Worst Case Scenario

DRS, DPM and VMware Fault Tolerance

DPM Scheduled Tasks

Summarizing

Appendix A – Basic Design Principles

VMware High Availability

VMware Distributed Resource Scheduler

Appendix B – HA Advanced Settings


10/140

Acknowledgements

The authors of this book work for VMware. The opinions expressed here are the authors’ personal

opinions. Content published was not read or approved in advance by VMware and does not

necessarily reflect the views and opinions of VMware. This is the authors’ book, not a VMware book.

First of all we would like to thank our VMware management team (Steve Beck, Director; Rob

Jenkins, Director) for supporting us on this and other projects.

A special thanks goes out to our Technical Reviewers: fellow VCDX Panel Member Craig Risinger

(VMware PSO), Marc Sevigny (VMware HA Engineering), Anne Holler (VMware DRS Engineering)

and Bouke Groenescheij (Jume.nl) for their very valuable feedback and for keeping us honest.

A very special thanks to our families and friends for supporting this project. Without your support

we could have not have done this.

We would like to dedicate this book to the VMware Community. We highly appreciate all the effort

everyone is putting in to take VMware, Virtualization and Cloud to the next level. This is our gift to

you.

Duncan Epping and Frank Denneman


11/140

Foreword

Since its inception, server virtualization has forever changed how we build and manage the

traditional x86 datacenter. In its early days of providing an enterprise-ready hypervisor, VMware

focused their initial virtualization efforts to meet the need for server consolidation. Increased

optimization of low-utilized systems and lowering datacenter costs of cooling, electricity, and floor

space requirements was a surefire recipe for VMware’s early success. Shortly after introducing

virtualization solutions, customers started to see the significant advantages introduced by the

increased portability and recoverability that were all of a sudden available.

It’s this increased portability and recoverability that significantly drove VMware’s adoption during

its highest growth period. Recovery capabilities and options that were once reserved for the most

critical of workloads within the world’s largest organizations became broadly available to the

masses. Replication, High-Availability, and Fault Tolerance were once synonymous with “Expensive

Enterprise Solutions, but are now available to even the smallest of companies. Data protectionenhancements, when combined with the intelligence of intelligent resource management, placed

VMware squarely at the top market leadership board. VMware’s virtualization platform can

provide near instant recovery time with increasingly more recent recovery points in a properly

designed environment.

Now, if you’ve read this far, you likely understand the significant benefits that virtualization can

provide, and are probably well on your way to building out your virtual infrastructure and strategy.

The capabilities provided by VMware are not ultimately what dictates the success and failure of a

virtualization project, especially as increasingly more critical applications are introduced and

require greater availability and recoverability service levels. It takes a well-designed virtual

infrastructure and a full understanding of how the business requirements of the organization align

to the capabilities of the platform.

This book is going to arm you with the information necessary to understand the in-depth details of

what VMware can provide you when it comes to improving the availability of your systems. This

will help you better prepare for, and align to, the requirements of your business as well as set the

proper expectations with the key stakeholders within the IT organization. Duncan and Frank have

used their extensive field experience into this book to enable you to drive broader virtualization

adoption across more complex and critical applications. This book will enable you to make the

most educated decisions as you attempt to achieve the next level of maturity within your virtual

environment.

Scott Herold

Lead Architect, Virtualization Business, Quest Software


12/140

Part 1

VMware High Availability


13/140

Chapter 1

Introduction to VMware High Availability

VMware High Availability (HA) provides a simple and cost effective clustering solution to increase

uptime for virtual machines. HA uses a heartbeat mechanism to detect a host or virtual machine

failure. In the event of a host failure, affected virtual machines are automatically restarted on other

production hosts within the cluster with spare capacity. In the case of a failure caused by the Guest

OS, HA restarts the failed virtual machine on the same host. This feature is called VM Monitoring,

but sometimes also referred to as VM HA.

Figure 1: High Availability in action

Unlike many other clustering solutions HA is literally configured and enabled with 4 clicks.

However HA is not, and let’s repeat it, is not a 1:1 replacement for solutions like MicrosoftClustering Services. (MSCS). MSCS and for instance Linux Clustering are stateful clustering solutions

where the state of the service or application is preserved when one of the nodes fails. The service is

transitioned to one of the other nodes and it should resume with limited downtime or loss of data.

With HA the virtual machine is literally restarted and this incurs downtime. HA is a form of stateless

clustering.


14/140

One might ask why would you want to use HA when a virtual machine is restarted and service is

temporarily lost. The answer is simple; not all virtual machines (or services) need 99.999% uptime.

For many services the type of availability HA provides is more than sufficient. Stateful clustering

does not guarantee 100% uptime, can be complex and need special skills and training. One example

is managing patches and updates/upgrades in a MSCS environment; this could even cause more

downtime if not operated correctly. Just like MSCS a service or application is restarted during afailover, the same happens with HA and the effected virtual machines.

Besides that, HA reduces complexity, costs (associated with downtime and MSCS), resource

overhead and unplanned downtime for minimal additional costs. It is important to note that HA,

contrary to MSCS, does not require any changes to the guest as HA is provided on the hypervisor

level. Also, VM Monitoring does not require any additional software or OS modifications except for

VMware Tools, which should be installed anyway.

We can’t think of a single reason not to use it.

How Does High Availability Work?

Before we deep dive into the main constructs of HA and describe all the choices one has when

configuring HA we will first briefly touch on the requirements. Now, the question of course is how

does HA work? As just briefly touched in the introductions, HA triggers a response based on the loss

of heartbeats. However you might be more interested in knowing which components VMware uses

and what is required in order for HA to function correctly. Maybe if this is the first time you are

exposed to HA you also want to know how to configure it.

Pre-requisites

For those who want to configure HA, the following items are the pre-requisites in order for HA to

function correctly:

• Minimum of two VMware ESX or ESXi hosts

• Minimum of 2300MB memory to install the HA Agent

• VMware vCenter Server

• Redundant Service Console or Management Network (not a requirement, but highly

recommended)

•

Shared Storage for VMs – NFS, SAN, iSCSI

• Pingable gateway or other reliable address for testing isolation

We recommend against using a mixed cluster. With that we mean a single cluster containing bothESX and ESXi hosts. Differences in build numbers has led to serious issues in the past when using

VMware FT. (KB article: 1013637)


15/140

Firewall RequirementsThe following list contains the ports that are used by HA for communication. If your environment

contains firewalls ensure these ports are opened for HA to function correctly.

High Availability port settings:

• 8042 – UDP - Used for host-to-hosts "backbone" (message bus) communication.

• 8042 – TCP - Used by AAM agents to communicate with a remote backbone.

• 8043 – TCP - Used to locate a backbone at bootstrap time.

• 8044 – UDP - Used by HA to send heartbeats.

• 2050 – 2250 - Used by AAM agent process to communicate with the backbone.

Configuring VMware High Availability

As described earlier, HA can be configured with the default settings within 4 clicks. The following

steps however will show you how to create a cluster and how to enable HA including VMMonitoring. Each of the settings and the mechanisms associated with these will be described more

in-depth in the following chapters.

1. Select the Hosts & Clusters view.

2. Right-click the Datacenter in the Inventory tree and click New Cluster.

3. Give the new cluster an appropriate name. We recommend at a minimum including the

location of the cluster and a sequence number ie. ams-hadrs-001.

4. In the Cluster Features section of the page, select Turn On VMware HA and click Next.

5. Ensure Host Monitoring Status and Admission Control is enabled and click Next

6. Leave Cluster Default Settings for what it is and click Next

7. Enable VM Monitoring Status by selecting “VM Monitoring Only” and click Next

8. Leave VMware EVC set to the default and click Next

9. Leave the Swapfile Policy set to default and click Next

10. Click Finish to complete the creation of the cluster


16/140

When the HA cluster has been created ESX hosts can be added to the cluster simply by dragging

them into the cluster. When an ESX host is added to the cluster the HA agent will be loaded.


17/140

Chapter 2

Components of High Availability

Now that we know what the pre-requisites are and how to configure HA the next steps will be

describing which components form HA. This is still a “high level” overview however. There is more

under the cover that we will explain in following chapters. The following diagram depicts a two

host cluster and shows the key HA components.

Figure 3: Components of High Availability

As you can clearly see there are three major components that form the foundation for HA:

•

VPXA

• VMAP

• AAM


18/140

VPXAThe first and probably the most important is VPXA. This is not an HA agent, but it is the vCenter

agent and it allows your vCenter Server to interact with your ESX host. It is also takes care of

stopping and starting virtual machines if and when needed.

HA is loosely coupled with vCenter Server. Although HA is configured by vCenter Server, it does notneed vCenter to manage an HA failover. It is comforting to know that in case of a host failure

containing the virtualized vCenter server, HA takes care of the failure and restarts the vCenter

server on another host, including all other configured virtual machines from that failed host.

When a virtual vCenter is used we do however recommend setting the correct restart priorities

within HA to avoid any dependency problems.

It’s highly recommended to register ESX hosts with their FQDN in vCenter. VMware vCenter

supplies the name resolution information that HA needs to function. HA stores this locally in a file

called “FT_HOSTS”. In other words, from an HA perspective there is no need to create local host files

and it is our recommendation to avoid using local host files. They are too static and will maketroubleshooting more difficult.

To stress my point even more as of vSphere 4.0 Update 1 host files (i.e. /etc/hosts) are corrected

automatically by HA. In other words if you have made a typo or for example forgot to add the short

name HA will correct the host file to make sure nothing interferes with HA.

Basic design principle:

Avoid using static host files as it leads to inconsistency, which makes troubleshooting

difficult.

VMAP Plug-In

Next on the list is VMAP. Where vpxa is the process for vCenter to communicate with the host

VMAP is the translator for the HA agent (AAM) and vpxa. When vpxa wants to communicate with

the AAM agent VMAP will translate this into understandable instructions for the AAM agent. A good

example of what VMAP would translate is the state of a virtual machine: is it powered on or

powered off? Pre-vSphere 4.0 VMAP was a separate process instead of a plugin linked into vpxa.VMAP is loaded into vpxa at runtime when a host is added to an HA cluster.

The vpxa communicates with VMAP and VMAP communicates with AAM. When AAM has received it

and flushed the info it well tell VMAP and VMAP on its turn will acknowledge to vpxa that info has

been processed. The VMAP plug-in acts as a proxy for communication to AAM.


19/140

One thing you are probably wondering is why do we need VMAP in the first place? Wouldn’t this be

something vpxa or AAM should be able to do? The answer is yes, either vpxa or AAM should be able

to carry this functionality. However, when HA was first introduced it was architecturally more

prudent to create a separate process for dealing with this which has now been turned into a plugin.

AAMThat brings us to our next and final component, the AAM agent. The AAM agent is the core of HA

and actually stands for “Automated Availability Manager”. As stated above, AAM was originally

developed by Legato. It is responsible for many tasks such as communicating host resource

information, virtual machine states and HA properties to other hosts in the cluster. AAM stores all

this info in a database and ensures consistency by replicating this database amongst all primary

nodes. (Primary nodes are discussed in more detail in chapter 4.) It is often mentioned that HA uses

an In-Memory database only, this is not the case! The data is stored in a database on local storage or

in FLASH memory on diskless ESXi hosts.

One of the other tasks AAM is responsible for is the mechanism with which HA detects

isolations/failures: heartbeats.

All this makes the AAM agent one of the most important processes on an ESX host, when HA is

enabled of course, but we are assuming for now it is. The engineers recognized the importance and

added an extra level of resiliency to HA. The agent is multi-process and each process acts as a

watchdog for the other. If one of the processes dies the watchdog functionality will pick up on this

and restart the process to ensure HA functionality remains without anyone ever noticing it failed. It

is also resilient to network interruptions and component failures. Inter-host communication

automatically uses another communication path (if the host is configured with redundantmanagement networks) in the case of a network failure. The underlying message framework

exactly-once guarantees message delivery.


20/140


21/140

An HA cluster consists of hosts, or nodes as HA calls them. There are two types of nodes. A node is

either a primary or a secondary node. This concept was introduced to enable scaling up to 32 hosts

in a cluster and each type of node has a different role. Primary nodes hold cluster settings and all

“node states”. The data a primary node holds is stored in a persistent database and synchronized

between primaries as depicted in the diagram above.

An example of node state data would be host resource usage. In case vCenter is not available the

primary nodes will always have a very recent calculation of the resource utilization and can take

this into account when a failover needs to occur. Secondary nodes send their state info to primary

nodes. This will be sent when changes occur, generally within seconds after a change. As of vSphere

4.1 by default every host will send an update of its status every 10 seconds. Pre-vSphere 4.1 this

used to be every second.

This interval can be controlled by an advanced setting called das.sensorPollingFreq. As stated

before the default value of this advanced setting is 10. Although a smaller value will lead to a more

update view of the status of the cluster overall it will also increase the amount of traffic between

nodes. It is not recommended to decrease this value as it might lead to decreased scalability due tothe overhead of these status updates. The maximum value of the advanced setting is 30.

As discussed earlier, HA uses a heartbeat mechanism to detect possible outages or network

isolation. The heartbeat mechanism is used to detect a failed or isolated node. However, a node will

recognize it is isolated by the fact that it isn’t receiving heartbeats from any of the other nodes.

Nodes send a heartbeat to each other. Primary nodes send heartbeats to all primary nodes and all

secondary nodes. Secondary nodes send their heartbeats to all primary nodes, but not to

secondaries. Nodes send out these heartbeats every second by default. However, this is a

configurable value through the use of the following cluster advanced setting:

das.failuredetectioninterval . We do however not recommend changing this interval as it wascarefully selected by VMware.

The first 5 hosts that join the HA cluster are automatically selected as primary nodes. All other

nodes are automatically selected as secondary nodes. When you do a reconfigure for HA, the

primary nodes and secondary nodes are selected again; this is virtually random.

Except for the first host that is added to the cluster; any host that joins the cluster must

communicate with an existing primary node to complete its configuration. At least one primary host

must be available for HA to operate correctly. If all primary hosts are unavailable, you will not be

able to add or remove a host from your cluster.

The vCenter client normally does not show which host is a primary node and which is a secondary

node. As of vCenter 4.1 a new feature has been added which is called “Operational Status” and can

be found on the HA section of the Cluster’s summary tab. It will give details around errors and will

show the primary and secondary nodes. There is one gotcha however; it will only show which

nodes are primary and secondary in case of an error.


22/140

Figure 5: Cluster operational status

This however can also be revealed from the Service Console or via PowerCLI. The following are two

examples of how to list the primary nodes via the Service Console (ESX 4.0):

Figure 6: List node command

Another method of showing the primary nodes is:


23/140

Figure 7: List nodes command

With PowerCLI the primary nodes can be listed with the following lines of code:

Power-CLI code:Get-Cluster | Get-HAPrimaryVMHost

Now that you have seen that it is possible that you can list all nodes with the CLI you probably

wonder what else is possible… Let’s start with a warning - this is not supported! Currently the

supported limit of primaries is 5. This is a soft limit however. It is possible to manually add a 6th

primary but this is not supported nor encouraged.

Having more than 5 primaries in a cluster will significantly increase network and CPU overhead.

There should be no reason to increase the number of primaries beyond 5. For the purpose of

education we will demonstrate how to promote a secondary node to primary and vice versa.

To promote a node:


24/140


25/140

Promoting NodesA common misunderstanding about HA with regards to primary and secondary nodes is the re-

election process. When does a re-election, or promotion, occur?

It is a common misconception that a promotion of a secondary occurs when a primary node fails.

This is not the case. Let’s stress that, this is not the case! The promotion of a secondary node toprimary only occurs in one of the following scenarios:

•

When a primary node is placed in “Maintenance Mode”

• When a primary node is disconnected from the cluster

• When a primary node is removed from the cluster

• When the user clicks “reconfigure for HA” on any ESX host

This is particularly important for the operational aspect of a virtualized environment. When a host

fails it is important to ensure its role is migrated to any of the other hosts in case it was an HA

primary node. To simplify it; when a host fails we recommend placing it in maintenance mode, to

disconnect it or to remove it from the cluster to avoid any risks!

If all primary hosts fail simultaneously no HA initiated restart of the virtual machines can take

place. HA needs at least one primary node to restart virtual machines. This is why you can configure

HA to tolerate only up to 4 host failures when you have selected the “host failures” Admission

Control Policy (Remember 5 primaries…). The amount of primaries is definitely something to take

into account when designing for uptime.

Failover Coordinator

As explained in the previous section, you will need at least one primary to restart virtual machines.

The reason for this is that one of the primary nodes will hold the “failover coordinator” role. This

role will be randomly assigned to a primary node; this role is also sometimes referred to as “active

primary”. We will use “failover coordinator” for now.

The failover coordinator coordinates the restart of virtual machines on the remaining primary and

secondary hosts. The coordinator takes restart priorities in account when coordinating the restarts.

Pre-vSphere 4.1 when multiple hosts would fail at the same time it would handle the restarts

serially. In other words, restart the virtual machines of the first failed host (taking restart priorities

in account) and then restart the virtual machines of the host that failed as second (again taking

restart priorities in account). As of vSphere 4.1 this mechanism has been severely improved. In the

case of multiple near-simultaneous host failures, all the host failures that occur within 15 secondswill have all their VMs aggregated and prioritized before the power-on operations occur.

If the failover coordinator fails, one of the other primaries will take over. This node is again

randomly selected from the pool of available primary nodes. As any other process within the HA

stack, the failover coordinator process is carefully watched by the watchdog functionality of HA.


26/140

Pre-vSphere 4.1 the failover coordinator would decide where a virtual machine would be restarted.

Basically it would check which host had the highest percentage of unreserved and available

memory and CPU and select it to restart that particular virtual machine. For the next virtual

machine the same exercise would be done by HA, select the host with the highest percentage of

unreserved memory and CPU and restart the virtual machine.

HA does not coordinate with DRS when making the decision on where to place virtual machines. HA

would rely on DRS. As soon as the virtual machines were restarted, DRS would kick in and

redistribute the load if and when needed.

As of vSphere 4.1 virtual machines will be evenly distributed across hosts to lighten the load on the

hostd service and to get quicker power-on results. HA then relies on DRS to redistribute the load

later if required. This improvement results in faster restarts of the virtual machines and less stress

on the ESX hosts. DRS also re-parents the virtual machine when it is booted up as virtual machines

are failed over into the root resource pool by default. This re-parent process however did already

exist pre-vSphere 4.1.

The failover coordinator can restart up to 32 VMs concurrently per host. The number of concurrent

failovers can be controlled by an advanced setting called das.perHostConcurrentFailoversLimit . As

stated the default value is 32. Setting a larger value will allow more VMs to be restarted

concurrently and might reduce the overall VM recovery time, but the average latency to recover

individual VMs might increase.

In blade environments it is particularly important to factor the primary nodes and failover

coordinator concept into your design. When designing a multi chassis environment the impact of a

single chassis failure needs to be taken into account. When all primary nodes reside in a single

chassis and the chassis fails, no virtual machines will be restarted as the failover coordinator is the

only one who initiates the restart of your virtual machines. When it is unavailable, no restart willtake place.

It is a best practice to have the primaries distributed amongst the chassis in case an entire chassis

fails or a rack loses power, there is still a running primary to coordinate the failover. This can even

be extended in very large environments by having no more than 2 hosts of a cluster in a chassis.

The following diagram depicts the scenario where four 8 hosts clusters are spread across four

chassis.


27/140

Figure 10: Logical cluster layout on blade environment

Basic design principle:In blade environments, divide hosts over all blade chassis and never exceed

four hosts per chassis to avoid having all primary nodes in a single chassis.

Preferred Primary

With vSphere 4.1 a new advanced setting has been introduced. This setting is not even

experimental, it is currently considered unsupported. We don't recommend anyone using it in a

production environment, if you do want to play around with it use your test environment.

This new advanced setting is called das.preferredPrimaries. With this setting multiple hosts of a

cluster can be manually designated as a preferred node during the primary node election process.

The list of nodes can either be comma or space separated and both hostnames and IP addresses are

allowed. Below you can find an example of what this would typically look like. The “=” sign has been

used as a divider between the setting and the value.


28/140

das.preferredPrimaries = hostname1,hostname2,hostname3

or

das.preferredPrimaries = 192.168.1.1 192.168.1.2 192.168.1.3

As shown there is no need to specify 5 hosts; you can specify any number of hosts. If you specify 5

hosts, or less, and all 5 hosts are available they will become the primary nodes in your cluster. If you

specify more than 5 hosts, the first 5 hosts of your list will become primary.

Again, please be warned that this is considered unsupported at times of writing and please verify in

the VMware Availability Guide or online in the knowledge base (kb.vmware.com) what the status is

of the support on this feature before even thinking about implementing it.

A work around found by some pre-vSphere 4.1 was using the “promote/demote” option of HA’s CLI

as described earlier in this chapter. Although this solution could fairly simply be scripted it is

unsupported and as opposed to “das.preferredPrimaries” a rather static solution.


29/140

Chapter 4

High Availability Constructs

When configuring HA two major decisions will need to be made.

• Isolation Response

•

Admission Control

Both are important to how HA behaves. Both will also have an impact on availability. It is really

important to understand these concepts. Both concepts have specific caveats. Without a good

understanding of these it is very easy to increase downtime instead of decreasing downtime.

Isolation Response

One of the first decisions that will need to be made when HA is configured is the “isolation

response”. The isolation response refers to the action that HA takes for its VMs when the host has

lost its connection with the network. This does not necessarily means that the whole network is

down; it could just be this hosts network ports or just the ports that are used by HA for the

heartbeat. Even if your virtual machine has a network connection and only your “heartbeat

network” is isolated the isolation response is triggered.

Today there are three isolation responses, “Power off”, “Leave powered on” and “Shut down”. This

answers the question what a host should do when it has detected it is isolated from the network. In

any of the three chosen options, the remaining non isolated, hosts will always try to restart the

virtual machines no matter which of the following three options is chosen as the isolationresponse:

• Power off – When network isolation occurs all virtual machines are powered off. It is a hard

stop, or to put it bluntly, the power cable of the VMs will be pulled out!

•

Shut down – When network isolation occurs all virtual machines running on the host will be

shut down using VMware Tools. If this is not successful within 5 minutes, a “power off” will

be executed. This time out value can be adjusted by setting the advanced option

das.isolationShutdownTimeout. If VMware Tools is not installed, a “power off” will be

initiated immediately.

• Leave powered on – When network isolation occurs on the host, the state of the virtual

machines remains unchanged.

This setting can be changed on the cluster settings under virtual machine options.


30/140

Figure 11: Cluster default setting

The default setting for the isolation response has changed multiple times over the last couple of

years. Up to ESX 3.5 U2 / vCenter 2.5 U2 the default isolation response when creating a new cluster

was “Power off”. This changed to “Leave powered on” as of ESX 3.5 U3 / vCenter 2.5 U3. However

with vSphere 4.0 this has changed again. The default setting for newly created clusters, at the time

of writing, is “Shut down” which might not be the desired response. When installing a new

environment; you might want to change the default setting based on your customer’s requirements

or constraints.

The question remains, which setting should you use? The obvious answer applies here; it depends.We prefer “Shut down” because we do not want to use a degraded host to run our virtual machines

on and it will shut down your virtual machines in clean manner. Many people however prefer to use

“Leave powered on” because it eliminates the chances of having a false positive and the associated

down time with a false positive. A false positive in this case is an isolated heartbeat network but a

non-isolated virtual machine network and a non-isolated iSCSI / NFS network.

That leaves the question how the other HA nodes know if the host is isolated or failed.

HA actually does not know the difference. The other HA nodes will try to restart the affected virtual

machines in either case. When the host is unavailable, a restart attempt will take place no matter

which isolation response has been selected. If a host is merely isolated, the non-isolated hosts willnot be able to restart the affected virtual machines. The reason for this is the fact that the host that

is running the virtual machine has a lock on the VMDK and swap files. None of the hosts will be able

to boot a virtual machine when the files are locked. For those who don’t know, ESX locks files to

prevent the possibility of multiple ESX hosts starting the same virtual machine. However, when a

host fails, this lock expires and a restart can occur.

To reiterate, the remaining nodes will always try to restart the “failed” virtual machines. The

possible lock on the VMDK files belonging to these virtual machines, in the case of an isolation

event, prevents them from being started. This assumes that the isolated host can still reach the files,

which might not be true if the files are accessed through the network on iSCSI, NFS, or FCoE based

storage. HA however will repeatedly try starting the “failed” virtual machines when a restart isunsuccessful.

The amount of retries is configurable as of vCenter 2.5 U4 with the advanced option

“das.maxvmrestartcount ”. The default value is 5. Pre-vCenter 2.5 U4 HA would keep retrying

forever which could lead to serious problems as described in KB article 1009625 where multiple


31/140


32/140

Split-Brain

When creating your design, make sure you understand the isolation response setting. For instancewhen using an iSCSI array or NFS based storage choosing “Leave powered on” as your default

isolation response might lead to a split-brain situation.

A split-brain situation can occur when the VMDK file lock times out. This could happen when the

iSCSI, FCoE or NFS network is also unavailable. In this case the virtual machine is being restarted on

a different host while it is not being powered off on the original host because the selected isolation

response is “Leave powered on”. Which could potentially leave vCenter in an inconsistent state as

two VMs with a similar UUID would be reported as running on both hosts. This would cause a

“ping-pong” effect where the VM would appear to live on ESX host 1 at one moment and on ESX

host 2 soon after.

VMware’s engineers have recognized this as a potential risk and developed a solution for this

unwanted situation. (This not well documented, but briefly explained by one of the engineers on the

VMTN Community forums. http://communities.vmware.com/message/1488426#1488426.)

In short; as of version 4.0 Update 2 ESX detects that the lock on the VMDK has been lost and issues a

question if the virtual machine should be powered off and auto answers the question with yes.

However, you will only see this question if you directly connect to the ESX host. HA will generate an

event for this auto-answer though, which is viewable within vCenter. Below you can find a

screenshot of this question.

Figure 13: Virtual machine message


33/140

As stated above, as of ESX 4 update 2 the question will be auto-answered and the virtual machine

will be powered off to recover from the split brain scenario.

The question still remains: with iSCSI or NFS, should you power off virtual machines or leave them

powered on?

As described above in earlier versions, "Leave powered on" could lead to a split-brain scenario. You

would end up seeing multiple virtual machines ping-ponging between hosts as vCenter would not

know where it resided as it was active in memory on two hosts. As of ESX 4.0 Update 2, this is

however not the case anymore and it should be safe to use “Leave powered on”.

We recommend avoiding the chances of a split-brain scenario. Configure a secondary ServiceConsole on the same vSwitch and network as the iSCSI or NFS VMkernel portgroup and pre-vSphere

4.0 Update 2 to select either “Power off” or “Shut down” as the isolation response. By doing this you

will be able to detect if there’s an outage on the storage network. We will discuss the options you

have for Service Console / Management Network redundancy more extensively later on in this.

Basic design principle: For network-based storage (iSCSI, NFS, FCoE) it isrecommended (pre-vSphere 4.0 Update 2) to set the isolation response to "Shut Down" or

“Power off”. It is also recommended to have a secondary Service Console (ESX) or

Management Network (ESXi) running on the same vSwitch as the storage network to detecta storage outage and avoid false positives for isolation detection.

Isolation Detection

We have explained what the options are to respond to an isolation event. However we have not

extensively discussed how isolation is detected. This is one of the key mechanisms of HA. Isolate

detection is a mechanism that takes place on the host that is isolated. The remaining, non-isolated,

hosts don’t know if that host has failed completely or if it is isolated from the network, they only

know it is unavailable.

The mechanism is fairly straightforward though and works as earlier explained with heartbeats.

When a node receives no heartbeats from any of the other nodes for 13 seconds (default setting)

HA will ping the “isolation address”. Remember primary nodes send heartbeats to primaries and

secondaries, secondary nodes send heartbeats only to primaries.


34/140

The isolation address is the gateway specified for the Service Console network (or management

network on ESXi), but there is a possibility to specify one or multiple additional isolation addresses

with an advanced setting. This advanced setting is called “das.isolationaddress” and could be used

to reduce the chances of having a false positive. We recommend to set at least one additional

isolation address.

Figure 14: das.isolationaddress

When isolation has been confirmed, meaning no heartbeats have been received and HA was unable

to ping any of the isolation addresses, HA will execute the isolation response. This could be any ofthe above-described options, power down, shut down or leave powered on.

If only one heartbeat is received or just a single isolation address can be pinged the isolation

response will not be triggered, which is exactly what you want.


35/140

Selecting an Additional Isolation Address

A question asked by many people is which address should be specified for this additional isolation

verification. We generally recommend an isolation address closest to the hosts to avoid too many

network hops. In many cases the most logical choice is the physical switch to which the host is

directly connected, another usual suspect would be a router or any other reliable and pingable

device. However, when you are using network based shared storage like NFS and for instance iSCSI

a good choice would be the IP-address of the device, this way you would also verify if the storage is

still reachable or not.

Failure Detection Time

Failure Detection Time seems to be a concept that is often misunderstood but is critical when

designing a virtual infrastructure. Failure Detection Time is basically the time it takes before the

“isolation response” is triggered. There are two primary concepts when we are talking about failure

detection time:

• The time it will take the host to detect it is isolated

• The time it will take the non-isolated hosts to mark the unavailable host as isolated and

initiate the failover

The following diagram depicts the timeline for both concepts:

Figure 15: High Availability failure detection time

The default value for failure detection is 15 seconds. (das.failuredetectiontime) In other words the

failed or isolated host will be declared failed by the other hosts in the HA cluster on the fifteenth

second and a restart will be initiated by the failover coordinator after one of the primaries has

verified that the failed or isolated host is unavailable by pinging the host on its management

network.


36/140

It should be noted that in the case of a dual management network setup both addresses will be

pinged and a 1 second will need to be added to the timeline. Meaning that the failover coordinator

will initiate the restart on the 17th second.

Let’s stress that again, a restart will be initiated after one of the primary nodes has tried to ping all

of the management network addresses of the failed host.

Let’s assume the isolation response is “Power off ”. The isolation response “Power off” will be

triggered by the isolated host 1 second before the das.failuredetectiontime elapses. In other words a

“Power off” will be initiated on the fourteenth second. A restart will be initiated on the sixteenth

second by the failover coordinator if the host has a single management network.

Does this mean that you can end up with your virtual machines being down and HA not restarting

them?

Yes, when the heartbeat returns between the 14th and 16th second the “Power off” might have

already been initiated. The restart however will not be initiated because the received heartbeat

indicates that the host is not isolated anymore.

How can you avoid this?

Selecting “Leave VM powered on” as an isolation response is one option. Increasing the

das.failuredetectiontime will also decrease the chances of running into issues like these, and with

ESX 3.5 it was a standard best practice to increase the failure detection time to 30 seconds.

At the time of writing (vSphere) this is not a best practice anymore as with any value the “2-second”

gap exists and the likelihood of running into this issue is small. We recommend keeping

das.failuredetectiontime as low as possible to decrease associated down time.

Basic design principle: Keep das.failuredetectiontime low for fast responses tofailures. If an isolation validation address has been added, “das.isolationaddress”, add 5000

to the default “das.failuredetectiontime” (15000).


37/140


38/140

Recommended:

2 physical switches

The vSwitch should be configured as follows:

•

vSwitch0: 2 Physical NICs (vmnic0 and vmnic2)• 2 Portgroups (Service Console and VMkernel)

• Service Console active on vmnic0 and standby on vmnic2

•

VMkernel active on vmnic2 and standby on vmnic0

• Failback set to No

Each portgroup has a VLAN ID assigned and runs dedicated on its own physical NIC; only in the

case of a failure it is switched over to the standby NIC. We highly recommend setting failback to

“No” to avoid chances of a false positive which can occur when a physical switch routes no traffic

during boot but the ports are reported as “up”. (NIC Teaming Tab)

Pros: Only 2 NICs in total are needed for the Service Console and VMkernel, especially useful in

Blade environments. This setup is also less complex.

Cons: Just a single active path for heartbeats.

The following diagram depicts the active/standby scenario:

Figure 16: Active-standby Service Console network layout


39/140

Secondary Management Network

Requirements:

• 3 physical NICs

• VLAN trunking

Recommended:

• 2 physical switches

• The vSwitch should be configured as follows:

• vSwitch0 – 3 Physical NICs (vmnic0 & vmnic2)

• 3 Portgroup (Service Console, secondary Service Console and VMkernel)


40/140

The primary Service Console runs on vSwitch0 and active on vmnic0, with a VLAN assigned on

either the physical switch port or the portgroup and is connected to the first physical switch. (We

recommend using a VLAN trunk for all network connections for consistency and flexibility.)

The secondary Service Console will be active on vmnic2 and connected to the second physical

switch.

The VMkernel is active on vmnic1 and standby on vmnic2.

Pros - Decreased chances of false alarms due to Spanning Tree “problems” as the setup contains

two Service Consoles that are both connected to only 1 physical switch. Subsequently both Service

Consoles will be used for the heartbeat mechanism that will increase resiliency.

Cons - Need to set advanced settings. It is mandatory to set an additional isolation address

(das.isolationaddress2) in order for the secondary Service Console to verify network isolation via a

different route.

The following diagram depicts the secondary Service Console scenario:

Figure 17: Secondary management network


41/140

The question remains; which would we recommend? Both scenarios are fully supported and

provide a highly redundant environment either way. Redundancy for the Service Console or

Management Network is important for HA to function correctly and avoid false alarms about the

host being isolated from the network. We however recommend the first scenario. Redundant NICs

for your Service Console adds a sufficient level of resilience without leading to an overly complex

environment.


42/140

Chapter 6

Admission Control

Admission Control is often misunderstood and disabled because of this. However Admission

Control is a must when availability needs to be guaranteed and isn’t that the reason for enabling HA

in the first place?

What is HA Admission Control about? Why does HA contain Admission Control?

The “ Availability Guide” a.k.a HA bible states the following:

“vCenter Server uses Admission

Control to ensure that sufficient

resources are available in

a cluster to provide failover

protection and to ensure that

virtual machine resource

reservations are respected.”

Admission Control guarantees capacity is available for an HA initiated failover by reserving

resources within a cluster. It calculates the capacity required for a failover based on availableresources. In other words if a host is placed into maintenance mode, or disconnected, it is taken out

of the equation. Available resources also mean that the virtualization overhead has already been

subtracted from the total. To give an example; Service Console Memory and VMkernel memory is

subtracted from the total amount of memory that results in the available memory for the virtual

machines.

There is one gotcha with Admission Control that we want to bring to your attention before drilling

into the different policies.

When Admission Control is set to strict, VMware Distributed Power Management in no way will

violate availability constraints. This means that it will always ensure multiple hosts are up andrunning. (For more info on how DPM calculates read Chapter 18)

When Admission Control was disabled and DPM was enabled in a pre-vSphere 4.1 environment you

could have ended up with all but one ESX host placed in sleep mode, which could lead to potential

issues when that particular host failed or resources were scarce as there would be no host available

to power-on your virtual machines. (KB: http://kb.vmware.com/kb/1007006)


43/140

With vSphere 4.1 however; if there are not enough resources to power on all hosts, DPM will be

asked to take hosts out of standby mode to make more resources available and the virtual machines

can then get powered on by HA when those hosts are back online.

Admission Control Policy

The Admission Control Policy dictates the mechanism that HA uses to guarantee enough resources

are available for an HA initiated failover. This section gives a general overview of the available

Admission Control Policies. The impact of each policy is described in the following section including

our recommendation.

HA has three mechanisms to guarantee enough capacity is available to respect virtual machine

resource reservations.

Figure 18: Admission control policy


44/140

Below we have listed all three options currently available as the Admission Control Policy. Each

option has a different mechanism to ensure resources are available for a failover and each option

has its caveats.

Admission Control Mechanisms

Each Admission Control Policy has its own Admission Control mechanism. Understanding this

Admission Control mechanism is important to understand the impact of decisions for your cluster

design. For instance setting a reservation on a specific virtual machine can have an impact on the

achieved consolidation ratio. This section will take you on a journey through the trenches of

Admission Control mechanisms.

Host Failures Cluster Tolerates

The Admission Control Policy that has been around the longest is the “Host Failures Cluster

Tolerates” policy. It is also historically the least understood Admission Control Policy due to its

complex admission control mechanism.

The so-called “slots” mechanism is used when selecting “host failures cluster tolerates” as the

Admission Control Policy. The mechanism of this concept has changed several times in the past and

it is one of the most restrictive policies.

Slots dictate how many virtual machines can be powered on before vCenter starts yelling “Out Of

Resources”! Normally a slot represents one virtual machine. Admission Control does not limit HA in

restarting virtual machines, it ensures enough resources are available to power on all virtual

machines in the cluster by preventing “over-commitment”. For those wondering why HA initiated

failovers are not prone to the Admission Control Policy think back for a second. Admission Control

is done by vCenter. HA initiated restarts are executed directly on the ESX host without the use of

vCenter. So even if resource would be low and vCenter would complain it couldn’t stop the restart.

If a failure has occurred and the host has been removed from the cluster, HA will recalculate all the

values and start with an “N+x” cluster again from scratch. This could result in an over-committed

cluster as you can imagine.

“A slot is defined as a logical representation of the memory and CPU resources that satisfy the

requirements for any powered-on virtual machine in the cluster…”

In other words a slot is the worst case CPU and memory reservation scenario in a cluster. This

directly leads to the first “gotcha”:

HA uses the highest CPU reservation of any given virtual machine and the highest memory

reservation of any given VM in the cluster. If no reservations of higher than 256 MHz are set HA will

use a default of 256 MHz for CPU. If no memory reservation is set HA will use a default of

0MB+memory overhead for memory. (See the VMware vSphere Resource Management Guide for


45/140

more details on memory overhead per virtual machine configuration) The following example will

clarify what “worst-case” actually means.

Example - If virtual machine “VM1” has 2GHz of CPU reserved and 1024MB of memory reserved

and virtual machine “VM2” has 1GHz of CPU reserved and 2048MB of memory reserved the slot

size for memory will be 2048MB (+memory overhead) and the slot size for CPU will be 2GHz. It is acombination of the highest reservation of both virtual machines. Reservations defined at the

Resource Pool level however, will not affect HA slot size calculations.

Basic design principle:Be really careful with reservations, if there’s no need to have them on a per

virtual machine basis; don’t configure them, especially when using Host Failures Cluster Tolerates.

If reservations are needed, resort to resource pool based reservations.

Now that we know the worst case scenario is always taken into account when it comes to slot size

calculations we will describe what dictates the amount of available slots per cluster.

We will need to know what the slot size for memory and CPU is first. Then we will divide the total

available CPU resources of a host by the CPU slot size and the total available Memory Resources of a

host by the memory slot size. This leaves us with a slot size for both memory and CPU. The most

restrictive number, again worst-case scenario is the number of slots for this host. If you have 25

CPU slots but only 5 memory slots, the amount of available slots for this host will be 5 as HA always

will always take the worst case scenario into account to “guarantee” all virtual machines can be

powered on in case of a failure or isolation.

The question we receive a lot is how do I know what my slot size is? The details around slot sizes

can be monitored on the HA section of the Cluster’s summary tab by clicking the “Advanced

Runtime Info” line.

Figure 19: High Availability cluster summary tab

This will show the following screen that specifies the slot size and more useful details around the

amount of slots available.


46/140

Figure 20: High Availability advanced runtime info

As you can see using reservations on a per-VM basis can lead to very conservative consolidation

ratios. However, with vSphere this is something that is configurable. If you have just one virtual

machine with a really high reservation you can set the following advanced settings to lower the slot

size used for these calculations: “das.slotCpuInMHz” or “das.slotMemInMB”.

To avoid not being able to power on the virtual machine with high reservations the virtual machine

will take up multiple slots. When you are low on resources this could mean that you are not able to

power-on this high reservation virtual machine as resources may be fragmented throughout the

cluster instead of available on a single host. As of vSphere 4.1 HA will notify DRS that a power-on

attempt was unsuccessful and a request will be made to defragment the resources to accommodate

the remaining virtual machines that need to be powered on.

The following diagram depicts a scenario where a virtual machine spans multiple slots:


47/140

Figure 21: Virtual machine spanning multiple HA slot

Notice that because the memory slot size has been manually set to 1024MB one of the virtual

machines (grouped with dotted lines) spans multiple slots due to a 4GB memory reservation. As

you might have noticed none of the hosts has 4 slots left. Although in total there are enough slots

available; they are fragmented and HA will not be able to power-on this particular virtual machine

directly but will request DRS to defragment the resources to accommodate for this virtual machines

resource requirements.

Admission control does not take fragmentation of slots into account when slot sizes are manually

defined with advanced settings. It will take the number of slots this virtual machine will consume

into account by subtracting them from the total number of available slots, but it will not verify the

amount of available slots per host to ensure failover. As stated earlier though HA will request DRS,

as of vSphere 4.1, to defragment the resources. However, this is no guarantee for a successful

power-on attempt or slot availability.


48/140

Basic design principle:Avoid using advanced settings to decrease the slot size as it could lead to

more down time and adds an extra layer of complexity. If there is a large discrepancy in size and

reservations are set it might help to put similar sized virtual machines into their own cluster.

Unbalanced Configurations and Impact on Slot

Calculation

It is an industry best practice to create clusters with similar hardware configurations. However

many companies start out with a small VMware cluster when virtualization is introduced and plan

on expanding when trust within the organization has been built.

When the time has come to expand, chances are fairly large the same hardware configuration is no

longer available. The question is will you add the newly bought hosts to the same cluster or create a

new cluster?

From a DRS perspective, large clusters are preferred as it increases the load balancing options.

However there is a caveat for DRS as well, which is described in the DRS section of this book. For

HA there is a big caveat and when you think about it and understand the internal workings of HA

you probably already know what is coming up.

Let’s first define the term “unbalanced cluster”.

An unbalanced cluster would for instance be a cluster with 6 hosts of which one contains more

memory than the other hosts in the cluster.

Let’s try to clarify that with an example.

Example:

What would happen to the total number of slots in a cluster of the following specifications?

•

Six host cluster

• Five hosts have 16GB of available memory

•

One host has 32GB of available memory

The sixth host is a brand new host that has just been bought and as prices of memory dropped

immensely the decision was made to buy 32GB instead of 16GB.

The cluster contains a virtual machine that has 1 vCPU and 4GB of memory. A 1024MB memory

reservation has been defined on this virtual machine. As explained earlier a reservation will dictate

the slot size, which in this case leads to a memory slot size of 1024MB+memory overhead. For the

sake of simplicity we will however calculate with 1024MB.


49/140


50/140

As Admission Control is enabled a worst-case scenario is taken into account . When a single host

failure has been specified, this means that the host with the largest number of slots will be taken

out of the equation. In other words for our cluster this would result in:

esx01 + esx02 + esx03 + esx4 + esx5 = 80 slots available

Although you have doubled the amount of memory in one of your hosts you are still stuck with only

80 slots in total. As clearly demonstrated there is absolutely no point in buying additional memory

for a single host when your cluster is designed with Admission Control enabled and a number of

host failures as the Admission Control Policy has been selected.

In our example the memory slot size happened to be the most restrictive, the same principle is

applied when CPU slot size is most restrictive.

Basic design principle:When using Admission Control, balance your clusters and be conservative

with reservations as it leads to decreased consolidation ratios.

Now what would happen in the scenario above when the number of allowed host failures is to 2?

In this case ESX06 is taken out of the equation and one of any of the remaining hosts in the cluster is

also taken out. It would result in 64 slots. This makes sense doesn’t it?

Can you avoid large HA slot sizes due to reservations without resorting to advanced settings? That’s

the question we get almost daily. The answer used to be NO if per virtual machine reservations

were required. HA uses reservations to calculate the slot size and there’s no way to tell HA to ignore

them without using advanced settings pre-vSphere. With vSphere, the new Percentage method is

an alternative.

Percentage of Cluster Resources Reserved

With vSphere VMware introduced the ability to specify a percentage next to a number of host

failures and a designated failover host. The percentage avoids the slot size issue, as it does not use

slots for Admission Control. So what does it use?

When you specify a percentage; that percentage of the total amount of available resources will stay

reserved for HA purposes. First of all HA will add up all available resources to see how much it has

available (virtualization overhead will be subtracted) in total. Then HA will calculate how much

resources are currently reserved by adding up all reservations for both memory and CPU forpowered on virtual machines.

For those virtual machines that do not have a reservation larger than 256 MHz a default of 256 MHz

will be used for CPU and a default of 0MB+memory overhead will be used for Memory. (Amount of

overhead per configuration type can be found in the “Understanding Memory Overhead” section of

the Resource Management guide.)


51/140

In other words:

((Total amount of available resources – total reserved virtual machine resources)/total amount of

available resources)


52/140

If you have an unbalanced cluster (hosts with different sizes of CPU or memory resources) your

percentage should be equal or preferably larger than the percentage of resources provided by the

largest host. This way you ensure that all virtual machines residing on this host can be restarted in

case of a host failure.

As earlier explained this Admission Control Policy does not use slots, as such resources might befragmented throughout the cluster. Although as of vSphere 4.1 DRS is notified to rebalance the

cluster, if needed, to accommodate for these virtual machines resource requirements a guarantee

cannot be given. We recommend ensuring you have at least one host with enough available capacity

to boot the largest virtual machine (reservation CPU/MEM). Also make sure you select the highest

restart priority for this virtual machine (of course depending on the SLA) to ensure it will be able to

boot.

The following diagram will make it more obvious. You have 5 hosts, each with roughly 80%

memory usage, and you have configured HA to reserve 20% of resources. A host fails and all virtual

machines will need to failover. One of those virtual machines has a 4GB memory reservation, as you

can imagine, the first power-on attempt for this particular virtual machine will fail due to the factthat none of the hosts has enough memory available to guarantee it.


53/140

Figure 25: Available resources

Basic design principle:Although vSphere 4.1 will utilize DRS to try to accommodate for the resource

requirements of this virtual machine a guarantee cannot be given. Do the math; verify that any

single host has enough resources to power-on your largest virtual machine. Also take restart

priority into account for this/these virtual machine(s).

Failover Host

The third option one could choose is a designated Failover host. This is commonly referred to as a

hot standby. There is actually not much to tell around this mechanism, as it is “what you see is what

you get”. When you designate a host as a failover host it will not participate in DRS. You will not be

able to power on virtual machines on this host! It is almost like it is in maintenance mode and it will

only be used in case a failover needs to occur.


54/140

Chapter 7

Impact of Admission Control Policy

As with any decision when architecting your environment there is an impact. This especially goes

for the Admission Control Policy. The first decision that will need to be made is if Admission Control

is enabled or not. We recommend enabling Admission Control but carefully select the policy and

ensure it fits your or your customer’s needs.

Basic design principle:

Admission Control guarantees enough capacity is available for virtual machine failover. As

such we recommend enabling it.

We have explained all the mechanisms that are being used by each of the policies in Chapter 6. Asthis is one of the most crucial decisions that need to be made we have summarized all the pros and

cons for each of the three policies below.

Host Failures Cluster Tolerates

This option is historically speaking the most used for Admission Control. Most environments are

designed with an N+1 redundancy and N+2 is also not uncommon. This Admission Control Policy

uses “slots” to ensure enough capacity is reserved for failover, which is a fairly complex

mechanism. Slots are based on VM-level Reservations.

Pros:

•

Fully automated (When a host is added to a cluster, HA re-calculates how many slots are

available.)

• Ensures failover by calculating slot sizes.

Cons:

Can be very conservative and inflexible when reservations are used as the largest reservation

dictates slot sizes.

•

Unbalanced clusters lead to wastage of resources.• Complexity for administrator from calculation perspective.

• Percentage as Cluster Resources Reserved

Percentage based Admission Control is the latest addition to the HA Admission Control Policy. The

percentage based Admission Control is based on per VM reservation calculations instead of slots.

Pros:


55/140

•

Accurate as it considers actual reservation per virtual machine.

•

Cluster dynamically adjusts when resources are added.

Cons:

Manual calculations needed when adding additional hosts in a cluster and number of host failuresneed to remain unchanged.

Unbalanced clusters can be a problem when chosen percentage is too low and resources are

fragmented, which means failover of a virtual machine can’t be guaranteed as the reservation of this

virtual machine might not be available as resources on a single host.

Specify a Failover Host

With the Specify a Failover Host Admission Control Policy, when a host fails, HA will attempt to

restart all virtual machines on the designated failover host. The designated failover host is

essentially a “hot standby”. In other words DRS will not migrate VMs to this host when resourcesare scarce or the cluster is imbalanced.

Pros:

• What you see is what you get.

• No fragmented resources.

Cons:

•

What you see is what you get.

•

Maximum of one failover host. (N+2 redundancy is impossible.)

• Dedicated failover host not utilized during normal operations.

Recommendations

We have been asked many times for our recommendation on Admission Control and it is difficult to

answer as each policy has its pros and cons. However, we generally recommend a Percentage based

Admission Control Policy. It is the most flexible policy as it uses the actual reservation per virtual

machine instead of taking a “worse case” scenario approach like the number of host failures does.

However, the number of host failures policy guarantees the failover level under all circumstances.

Percentage based is less restrictive, but offers lower guarantees that in all scenarios, HA will be ableto restart all virtual machines. With the added level of integration between HA and DRS we believe

a Percentage based Admission Control Policy will fit most environments.


56/140

Basic design principle: Do the math, and take customer requirements into account.We recommend using a “Percentage” based Admission Control Policy, as it is the most

flexible policy.


57/140

Chapter 8

VM Monitoring

VM monitoring or VM level HA is an often overlooked but really powerful feature of HA. The reason

for this is most likely that it is disabled by default and relatively new compared to HA. We have

tried to gather all the info we could around VM Monitoring but it is a pretty straightforward

product that actually does what you expect it would do.

With vSphere 4.1 VMware also introduced VM and Application Monitoring. Application Monitoring

is a brand new feature that Application Developers can leverage to increase resiliency as shown in

the screenshot below.

Figure 26: VM and Application Monitoring

As of writing there was little information around Application Monitoring besides the fact that the

Guest SDK is be used by application developers or partners like for instance Symantec to develop

solutions against the SDK. In the case of Symantec a simplified version of Veritas Cluster Server

(VCS) is used to enable application availability monitoring including of course responding to issues.

Note that it is not a multi-node clustering solution like VCS itself but a single node solution.

Symantec ApplicationHA as it is called is triggered to get the application up and running again by

restarting it. Symantec's ApplicationHA is aware of dependencies and knows in which order

services should be started or stopped. If however for whatever reason this fails for an "X" amount

(configurable option within ApplicationHA) of times HA will be asked to take action. This action will

be a restart of the virtual machine.

Although Application Monitoring is relatively new and there are only a few partners currently

exploring the capabilities it does add a whole new level of resiliency in our opinion. We have tested

ApplicationHA by Symantec and personally feel it is the missing link. It enables you as System

Admin to integrate your virtualization layer with your application layer. It ensures you as a SystemAdmin that services, which are protected, are restarted in the correct order and it avoids the

common pitfalls associated with restarts and maintenance.


58/140

Why Do You Need VM/Application Monitoring?

VM and Application Monitoring acts on a different level as HA. VM/App Monitoring responds to a

single virtual machine or application failure as opposed to HA which responds to a host failure. An

example of a single virtual machine failure would for instance be the infamous “blue screen of

death”.

How Does VM/App Monitoring Work?

VM Monitoring restarts individual virtual machines when needed. VM/App monitoring uses a

similar concept as HA, heartbeats. If heartbeats, and in this case VMware Tools heartbeats, are not

received for a specific amount of time, the virtual machine will be rebooted. The heartbeats are

communicated directly to VPXA by VMware Tools; these heartbeats are not sent over a network.

Figure 27: VM monitoring sensitivity

When enabling VM/App Monitoring, the level of sensitivity can be configured. The default settingshould fit most situations. Low sensitivity basically means that the amount of allowed “missed”

heartbeats is higher and as such the chances of running into a false positive are lower. However if a

failure occurs and the sensitivity level is set to low the experienced downtime will be higher. When

quick action is required in case of a possible failure “high sensitivity” can be selected, and as

expected this is the opposite of “low sensitivity”.

Table 1: VM monitoring sensitivity


59/140


60/140

Screenshots

The cool thing about VM Monitoring is the fact that it takes screenshots of the VM console. They are

taken right before a virtual machine is reset by VM Monitoring. This has been added as of vCenter

4.0. It is a very useful feature when a virtual machine “freezes” every once in a while with no

apparent reason. This screenshot can be used to debug the virtual machine operating system, if and

when needed, and is stored in the virtual machine’s working directory.

Basic design principle: VM Monitoring can substantially increase availability. Itis part of the HA stack and we heavily recommend using it!


61/140


62/140

Flattened Shares

Pre-vSphere 4.1 an issue could arise when shares had been set custom on a virtual machine. When

HA fails over a virtual machine it will power-on the virtual machine in the Root Resource Pool.

However, the virtual machine’s shares were scaled for its appropriate place in the resource pool

hierarchy, not for the Root Resource Pool. This could cause the virtual machine to receive either too

many or too few resources relative to its entitlement.

A scenario where and when this can occur would be the following:

VM1 has a 1000 shares and Resource Pool A has 2000 shares. However Resource Pool A has 2 VMs

and both will have 50% of those “20003 shares. The following diagram depicts this scenario:

Figure 28: Flatten shares starting point

When the host would fail both VM2 and VM3 will end up on the same level as VM1. However as a

custom shares value of 10.000 was specified on both VM2 and VM3 they will completely blow away

VM1 in times of contention. This is depicted in the following diagram:


63/140

Figure 29: Flatten shares host failure

This situation would persist until the next invocation of DRS would re-parent the virtual machine to

its original Resource Pool. To address this issue as of vSphere 4.1 DRS will flatten the virtual

machine’s shares and limits before fail-over. This flattening process ensures that the virtual

machine will get the resources it would have received if it had failed over to the correct Resource

Pool. This scenario is depicted in the following diagram. Note that both VM2 and VM3 are placed

under the Root Resource Pool with a shares value of 1000.

Figure 30: Flatten shares after host failure before DRS invocation

Of course when DRS is invoked both VM2 and VM3 will be re-parented under Resource Pool A and

will receive the amount of shares they had originally assigned again.


64/140

Chapter 10

Summarizing

The integration of HA with DRS has been vastly improved and so has HA in general. We hope

everyone sees the benefits of these improvements and of HA and VM and Application Monitoring in

general. We have tried to simplify some of the concepts to make it easier to understand, still we

acknowledge that some concepts are difficult to grasp. We hope though that after reading this

section of the book everyone is confident enough to make the changes to HA needed to increase

resiliency and essentially uptime of your environment because that is what it is all about.

If there are any questions please do not hesitate to reach out to either of the authors.


65/140

Part 2

VMware Distributed Resource Scheduler


66/140

Chapter 11

What is VMware DRS?

VMware Distributed Resource Scheduler (DRS) is an infrastructure service run by VMware vCenter

Server (vCenter). DRS aggregates ESX host resources into clusters and automatically distributes

these resources to the virtual machines.

DRS monitors resource usage and continuously optimizes the virtual machine resource distribution

across ESX hosts.

DRS computes the resource entitlement for each virtual machine based on static resource allocation

settings and dynamic settings such as active usage and level of contention.

DRS attempts to satisfy the virtual machine resource entitlement with the resources available in the

cluster by leveraging vMotion. vMotion is used to either migrate the virtual machines to alternative

ESX hosts with more available resources or migrating virtual machines away to free up resources.

Because DRS is an automated solution and easy to configure, we recommend enabling DRS to

achieve higher consolidation ratios at low costs.

A DRS-enabled cluster is often referred to as a DRS cluster. In vSphere 4.1, a DRS cluster can

manage up to 32 hosts and 3000 VMs.

Cluster Level Resource Management

Clusters group the resources of the various ESX hosts together and treat them as a pool of

resources, DRS presents the aggregated resources as one big host to the virtual machines. Pooling

resources allows DRS to create resource pools spanning across all hosts in the cluster and apply

cluster level resource allocation policies. Probably unnecessary to point out, but a virtual machine

cannot span hosts even when resources are pooled by using DRS. In addition to resource pools and

resource allocation policies, DRS offers the following resource management capabilities.

Initial placement – When a virtual machine is powered on in the cluster, DRS places thevirtual machine on an appropriate host or generates a recommendation depending on the

automation level.

vmware vsphere 4.1 ha and drs deep technical drive

Documents