vmworld 2013: operating and architecting a vsphere metro storage cluster based infrastructure

Operating and Architecting a vSphere Metro Storage

Cluster based infrastructure

Lee Dilworth, VMware

Duncan Epping, VMware

BCO4872

#BCO4872

2

Interact!

If you use Twitter, feel free to tweet about this session and use

hashtag #BCO4872

Feel free to take pictures, shoot video, and share it on twitter /

facebook

Blog about it

• We would love to read your thoughts, your opinion, design decisions!

3

Agenda for Today

Availability Basics

vSphere Metro Storage Cluster Basics

Architecting and Operating

Failure Scenarios

Wrapping up

4

Availability Basics

5

Disaster Avoidance

Avoidance NOT Recovery

• Two sites, One vSphere Cluster

• One vCenter manages BOTH sites

• One site effectively put into maintenance mode

• Hot VM Mobility solution

Intra-cluster vMotion

6

Disaster Recovery

Replication

Recovery NOT avoidance

• Two sites, typically two vSphere Clusters

• Each sites usually managed by own vCenter

• vMSC solutions CAN support disaster recovery via HA restarts

• Cold VM Mobility Solutions (SRM or vMSC “Federated HA”)

7

vSphere High Availability – Setting the Baseline

vSphere HA minimizes unplanned downtime

Provides automatic VM recovery in minutes

Protects against various types of failures

• Host failure

• Host network isolation

• Permanent loss of datastore

• VM crashes (including VMX)

• Guest OS / Application crashes / hangs

Does not require complex configuration changes

Is Operating System and application-independent

8

vSphere 5.0+ Architecture

HA Agent

• Called the Fault Domain Manager (FDM)

• Provides all the HA on-host functionality

Operation

• vCenter Server manages the cluster

• Failover is not dependent on vCenter

Communicate over

• Management Network

• Datastores

vCenter Server

9

Master and Slave Roles

Any host can be master, selected by

election

• All others assume the role of slaves

The Master

• Monitors hosts and VMs

• Manages VM restarts after failures

• Reports cluster state to vCenter Server

The Slave

• Forwards critical state changes to the Master

• Restart VMs when directed by the Master

• Elects new Master

vCenter Server

10

Network Used for Communication

Network is default communication method

• Used for selecting a Master

• Used for heartbeating

• Used for reporting state to vCenter Server

Network Heartbeating

• Used by a Master to monitor the state of a Slave

• When Master receives no heartbeats it will ping the Slave

• When Slave receives no heartbeats from Master it will ping isolation address

11

Datastores Used for Communication

Datastores are used when management network is

not available

• It is used to determine state (isolated vs failed)

• Only when a failure has occurred!

• vCenter selects two for each host

Files used on datastores

• host-<id>-hb

• Heartbeat file!

• host-<id>-poweron

• Contains power state of VMs and used to communicate

isolation

• First line, either a “0” or a “1” where “1” means isolated

• protectedlist

• Owned by the master, its view of the world

12

vSphere Metro Storage Cluster

the Basics (well sort of)

13

What is a vSphere Metro Storage Cluster

Stretched cluster solution, not a feature!

Requires:

• storage system that “stretches” across sites

• stretched network across sites

Hardware Compatibility List (HCL) – Certified vMSC

• “iSCSI Metro Cluster Storage”

• “FC Metro Cluster Storage”

• “NFS Metro Cluster Storage”

14

vSphere Metro Storage Cluster – Growing Ecosystem

15

vMSC Certified Storage

Typical vSphere vMSC Setup

vCenter

Stretched Network

vSphere HA Cluster

Network

Storage

16

Latency Support Requirements

ESXi management network max supported latency 10 milliseconds

Round Trip Time (RTT)

• Note: 10ms supported with Enterprise+ licenses only (Metro vMotion), default

is 5ms

Synchronous storage replication link is 5 milliseconds RTT

• Note: some storage vendors have different support requirements!

network

17

When to Use Stretched vSphere Clusters?

Campus / nearby sites

• Sites within Synchronous distance

• Two buildings on a common campus

• Two datacenters within a city

Planned migration important

• Long-distance vMotion for planned maintenance, disaster avoidance, or load

balancing

DR Features less critical

• No testing, orchestration, or automation

• VMware HA typically not sufficient for automation – requires scripting / manual

process due to VM placement with primary / secondary arrays

• RTOs typically longer

18

Two Architectures: Uniform Host Access Configuration (1/2)

Stretched Cluster

Storage A

LUN (R/W)

Storage B

LUN (R/O)

FC / IP

fabric fabric

Site A Site B

19

Two architectures: Non-Uniform Host Access Configuration (2/2)

Stretched Cluster

Storage A

LUN (R/W)

Storage B

LUN (R/W)

fabric fabric

FC / IP

distributed

Site A Site B

20

Defining Some Failure Terminology

All Paths Down (APD) – Aaahhhh where has that device gone?

• Incorrect storage removal i.e. yanked!

• Sudden storage failure

• No time for storage to tell us anything

Permanent Device Loss (PDL) – Aaahhhh the device has gone, OK I

understand

• Much nicer than APD, graceful handing of state change

• Storage notifies of device state change via SCSI sense code

• Allows HA to failover VM’s

Split Brain – Hmmm the other half has disappeared, now what?

• Election of second HA master

• Check heartbeat datastore region

• Restart VM’s (if needed)

21

Architecting and Operating

vSphere Metro Storage Cluster

22

Will Use Our Environment to Illustrate…

Two sites

Four hosts in total

Stretched network

Stretched storage

One vCenter Server

One vSphere HA

Cluster

fabric fabric

management

Site A Site B

Storage A

LUN (R/W)

Storage B

LUN (R/W)

FC / IP

distributed

23

HA & DRS – Site Awareness

DRS

HA

network

What they think…..

What you’ve actually got…..

DRS

HA ? ?

24

Why Should I Care About Site Awareness?

Operational Simplicity

• Group dependent workloads

• Increase HA predictability

• Reduce impact of full cluster partition

• Orchestrate allocation of workloads

to “sites”

• Even distribution & consumption of

cluster resources

Alignment with Storage

• Locate VM’s above read/write device

• Remove unnecessary east/west IO

traffic

• Access anywhere devices, align with

partition winner per device

25

DRS Design Considerations – Affinity Rules (1/2)

DRS Host Group Per Site

DRS VM Group Per Site

Align Dependent VM Workloads

26

DRS Design Considerations – Affinity Rules (2/2)

Use the “should” rules

• HA does not violate “must” therefore avoid for these configurations

27

Storage DRS Design Considerations

Cluster datastores based on “site affinity”

Avoid unnecessary site-to-site migrations

Set Storage DRS to “Manual”, take control, migration *could* impact availability

Align VM’s with storage / site boundary

Group *similar* devices!

28

Network Design Considerations

Network teams usually don’t like the words “Stretch” and “Cluster”

Site-to-Site vMotion – handle carefully

Ingress point to the network? Load balanced / redundant?

Consider application users – site affinity affects data flow to!

Network options are changing (OTV, EoMPLS)

L3 Routing impacts (and options LISP?)

Co-locate Multi-VM applications

Consider east-west traffic

network

29

HA Design Considerations – Admission Control

What about Admission Control?

• We typically recommend setting it to 50%, to allow full site fail-over

• Admission control is not a resource management tool

• Only guarantees power-on

30

HA Design Considerations – Isolation Response

Isolation response

• Configure it based on your infrastructure!

• We cannot make this decision for you, however…

31

HA Design Considerations – Isolation Addresses

Isolation addresses

• Specify two, one at each site, using the advanced setting

“das.isolationaddress”

• Note that “default gateway” is an isolation address already!

isolation

address 02 isolation

address 01

32

HA Design Considerations – Heartbeat Datastores

Each site needs a heartbeat datastore defined to ensure each site can update heartbeat region for storage local to that site

With multiple storage systems consider increasing default from 2 to 4 => 2 per site

33

HA Design Consideration – Restart Order

You can use “restart priority” to determine restart order

This applies even when there is no contention

Only about order in restarts occur, not about when VM is booted

34

Operations - Maintaining the Configuration

Storage Device <-> DRS Affinity Group

Mappings

Validate DRS Affinity regularly

Are there VM dependencies? Co-locate!

Remember HA doesn’t speak vApp

(wont’ respect restart order)

…automate if you can!

Some vendors offer tools

DRS

HA

35

Failure Scenarios

36

Face Your Fears!

Understand the possibilities

Test them

Test them again and keeping going until they feel normal!

vm mobility

P

A

R

T

I

T

I

O

N

37

Scenario - Single Host Failure (Non-Uniform)

Storage A

LUN (R/W)

Storage B

LUN (R/W)

FC / IP

fabric fabric

management A normal HA event

No network or

datastore heartbeats

Host will be declared

dead

All VMs will be

restarted

Could violate affinity

rules

X Site A Site B

distributed

38

Scenario - Full Compute Failure in One Site (Non-Uniform)

Storage A

LUN (R/W)

Storage B

LUN (R/W)

FC / IP

fabric fabric

management Normal HA event

No datastore or

network heartbeats

All virtual machines

will be restarted

Note, max 32

concurrent restarts

per host

“Sequencing” start

up order!

Will violate affinity

rules! (should rule)

X X Site A Site B

distributed

39

Scenario - Storage Partition (Uniform)

Stretched Cluster

Storage A

LUN (R/W)

Storage B

LUN (R/O)

FC / IP

fabric fabric

management Virtual machines

remained running

with no impact!

Will virtual machines

be restarted on the

other site?

•No Network heartbeats!

X

Site A Site B

40

Scenario - Storage Partition (Non-uniform)

Stretched Cluster

Storage A

LUN (R/W)

Storage B

LUN (R/W)

FC / IP

fabric fabric


remained running

with no impact!

Will virtual machines

be restarted on the

other site?

• Yes PDL Sense code issued.

• VM will be killed

• HA will detect and restart! X

PDL

Site A Site B

preferred

41

Permanent Device Loss (PDL) Requirements (1/2)

Ensure PDL enhancements are configured

•Cluster Advanced Option • Set “Das.maskCleanShutdownEnabled” to “true”, in advanced settings

• Set to “false” by default in 5.0, change it!

• Set to “true” by default in 5.1 and up

42

Permanent Device Loss (PDL) Requirements (2/2)

Ensure PDL enhancements are configured

•ESXi Host Level changes • 5.1 and earlier: Set “disk.terminateVMonPDLDefault” to “true” in

“/etc/vmware/settings”

• 5.5 and up: Set advanced setting “VMkernel.Boot.terminateVMOnPDL”

43

Scenario - Datacenter Partition (Uniform) (1/3)

Stretched Cluster

Storage A

LUN (R/W)

Storage B

LUN (R/O)

FC / IP

fabric fabric


remained running

with no impact!

Remember the

affinity rules

Without affinity rules

this would result in

APD condition…

X

X

X

Site A Site B

44

Scenario - Datacenter Partition (Uniform) (2/3)

Stretched Cluster

Storage A

LUN (R/W)

Storage B

LUN (R/O)

FC / IP

fabric fabric

management Affinity rule was

violated

Same VM restarted in

Site A

Results in APD for

Site B

Same VM

Same IP address

Same name

Yes, could result in

weird behavior!

X

X

X

Site A Site B

45

Scenario - Datacenter Partition (Uniform) (3/3

• VM restarted in site with “storage site-affinity”

• Now you have two active instances of same VM!

• When partition is lifted, VM will be killed!

46

Scenario - Loss of full datacenter (Non-Uniform)

Stretched Cluster

Storage A

LUN (R/W)

Storage B

LUN (R/W)

FC / IP

fabric fabric

management

All virtual machines

will be restarted

Note in many cases

requires manual

intervention from a

storage perspective!

HA will retry 5 times

and has a

compatibility list

Run DRS when site

returns, to apply

affinity rules and

balance load!

Site A Site B

distributed

47

Wrapping Up

48

Key Takeaways

Design a cluster that meets your needs don’t forget operations!

Understand HA / DRS play key part in your vMSC success

Testing is critical, don’t just test the easy stuff!

Document process changes, gain operational acceptance

Do not assume it is “Next > Next > Finish”

Ongoing maintenance/checks will be required

Automate as much as you can!

49

Questions?

50

Other VMware Activities Related to This Session

Group Discussions:

BCO1001-GD

Stretched Clusters for Availability with Lee Dilworth

BCO4872

THANK YOU

Operating and Architecting a vSphere Metro Storage

Cluster based infrastructure

Lee Dilworth, VMware

Duncan Epping, VMware

BCO4872

#BCO4872

vmworld 2013: operating and architecting a vsphere metro storage cluster based infrastructure

Technology

cluster solution

cluster vmotion

cluster failover

vsphere clusters

datastores host

baseline vsphere ha

failures reports cluster

storage system