bco1159-architecting and operating a vmware vsphere metro storage cluster_final_us.pdf

Upload: kinankazuki104

Post on 14-Apr-2018

232 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    1/49

    Architecting and

    Operating a VMwarevSphere MetroStorage Cluster

    Lee Dilworth, VMware, Inc.

    Duncan Epping, VMware, Inc.

    INF-BCO1159

    #vmworldinf

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    2/49

    2

    Disclaimer

    This session may contain product features that are

    currently under development.

    This session/overview of the new technology represents

    no commitment from VMware to deliver these features in

    any generally available product.

    Features are subject to change, and must not be included in

    contracts, purchase orders, or sales agreements of any kind.

    Technical feasibility and market demand will affect final delivery.

    Pricing and packaging for any new technologies or features

    discussed or presented have not been determined.

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    3/49

    2011 VMware Inc. All rights reserved

    Architecting and Operating a vSphere

    Metro Storage Cluster (vMSC)

    Lee Dilworth Principal SE VMware (Twitter: @LeeDilworth)

    Duncan Epping Principal Architect VMware (Twitter: @DuncanYB)

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    4/49

    4

    Interact!

    If you use Twitter, feel free to tweet about this session (#BCO1159)

    Take pictures and share them on twitter / facebook Signed copy of the vSphere 5.1 Clustering Deepdive for the best picture

    Ask questions!

    Signed copy of the vSphere 5.1 Clustering Deepdive for the best question

    Blog about it We would love to read your thoughts, your opinion, design decisions!

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    5/49

    5

    vSphere Metro Storage Cluster

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    6/49

    6

    Site A Datastore

    Whats This All About - Typical vSphere vMSC Setup

    vMotion

    vCenter Server

    vSphere Cluster

    Site A hosts

    ESXiESXiESXiESXi

    Site B Datastore

    Site B hosts

    ESXiESXiESXiESXi

    Active / Active Storage

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    7/49

    7

    What is a vSphere Metro Storage Cluster

    Stretched Cluster Solution

    Requires:

    Storage system that stretches across sites

    Stretched network across sites

    Hardware Compatibility List (HCL) Certified vMSC

    iSCSI Metro Cluster Storage

    FC Metro Cluster Storage

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    8/49

    8

    Latency Support Requirements

    ESXi management network max supported latency 10

    milliseconds Round Trip Time (RTT)

    Note: 10ms supported with Enterprise+ licenses only (Metro vMotion),default is 5ms

    Synchronous storage replication link is 5 milliseconds RTT

    Note: some storage vendors have different support requirements!

    CAMPUS

    METRO / SYNC

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    9/49

    9

    Two Different Architectures (1/2)

    Uniform host access configuration

    ESXi hosts from both sites are all connected to a storage node in the storage

    cluster across all sites. Paths are stretched across distance.

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    10/49

    10

    Two Different Architectures (2/2)

    Non-Uniform host access configuration

    ESXi hosts in each site are connected only to storage node(s) in the same site.

    Paths are limited to local site.

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    11/49

    11

    Architecting a

    vSphere Metro Storage Cluster

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    12/49

    12

    Sounds Simple Right?

    No, think about the whole solution its NOT just storage

    vSphere HA is not site aware!

    vSphere DRS is not site aware!

    vSphere Storage DRS is not site aware!

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    13/49

    13

    HA & DRS Site Awareness

    DRSHA

    What they think..

    What youve actually got..

    DRS

    HA ????

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    14/49

    14

    Other Network Considerations

    Network teams usually dont like the words Stretch and Cluster

    Network options are changing (OTV, EoMPLS)

    L3 Routing impacts (and options LISP?)

    Site-to-Site vMotion handle carefully

    Co-locate Multi-VM applications

    Consider application users site affinity affects data flow to! Consider east-west traffic

    Ingress point to the network? Load balanced / redundant?

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    15/49

    15

    Will Use Our Environment to Illustrate

    Two sites

    Four hosts in total

    Stretched network

    Stretched storage

    One vCenter Server

    One vSphere HA Cluster

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    16/49

    16

    Site Awareness Why Should I Care?

    VM to storage mapping

    Operational Simplicity

    Application Resiliency

    Site Affinity / Locality matters!

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    17/49

    17

    Site Awareness Using DRS Affinity

    Site A VM group

    Host Group required per site

    Groups require ongoing management

    Site B VM group

    Site A host group Site B host group

    Consider multi-tiers apps

    Group dependent VMs

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    18/49

    18

    DRS Affinity - Design Considerations

    Use the should rules

    HA does not violate must therefore avoid for these configurations

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    19/49

    19

    Site Awareness Using SDRS & Datastore Clusters

    Cluster datastores based on

    site affinity

    Avoid unnecessary site-to-site

    migrations

    Set SDRS to Manual, take control,

    migration *could* impact availability

    Align VMs with storage /

    site boundary

    Group *similar* devices!

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    20/49

    20

    HA Design Considerations Admission Control

    What about Admission Control?

    We typically recommend setting it to 50%, to allow full site fail-over

    Admission control is not resource management

    Only guarantees power-on

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    21/49

    21

    HA Design Considerations Isolation Addresses

    Isolation Addresses

    Specify two, one at each site, using the advanced setting das.isolationaddress

    isolationaddress 02isolationaddress 01

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    22/49

    22

    HA Design Considerations HeartBeat Datastores

    Each site needs a heartbeat datastore defined to ensure each

    site can update heartbeat region for storage local to that site

    With multiple storage systems consider increasing defaultfrom 2 to 4 => 2 per site

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    23/49

    23

    HA Design Considerations Permanent Device Loss (PDL)

    Ensure PDL enhancements are configured

    ESXi Hosts - Set disk.terminateVMonPDLDefault to true in

    /etc/vmware/settings

    Cluster Advanced Option - Set Das.maskCleanShutdownEnabledto true, in advanced settings

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    24/49

    24

    HA Design Considerations Split Brain

    vSphere 5.0 HA master / slave concept

    Default,1 master, responsible for HA restarts

    If master fails, a new one elected in ~15 seconds

    On partition there will be TWO masters

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    25/49

    25

    HA Design Considerations Isolation Response

    Isolation response

    Configure it based on your infrastructure!

    We cannot make this decision for you, however

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    26/49

    26

    Operating a

    vSphere Metro Storage Cluster

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    27/49

    27

    Maintaining the Configuration

    HA / DRS settings (per-VM)

    DRS Affinity Group Members

    VM Dependencies Co-Locate?

    Restart Priorities (HA)

    Remember HA doesnt speak vApp

    (wont respect restart order)

    Should certain VMs be able to roam?

    Storage Device DRA Affinity Group

    Mappings

    Storage Device Split Brain / Detachment

    rules?

    .automate if you can!!!!

    DRS

    HA

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    28/49

    28

    So What About Automation / Orchestration?

    Automation / orchestration is key

    Automate virtual machine provisioning

    Validate virtual machine placement

    Validate the VM-Host rules

    Validate the Datastore cluster

    Some vendors offer tools!

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    29/49

    29

    Failure Scenarios

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    30/49

    30

    Face Your Fears!

    Understand the possibilities

    Test them

    Test them again and keeping going until they feel normal!

    Fi di Y F

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    31/49

    31

    Finding Your Fears

    Seek out vendor KB articles

    Review Impact Tables

    Base POC Testing around tables

    Start with biggest impact, get

    confident with it

    Easy stuff last

    Test with misconfigured VMs

    Restart orders unset

    Incorrect affinity placement

    Learn to spot configuration drift

    Automate as much as possible

    HP/LefthandKB: 2020097

    EMC VPLEX

    KB: 2007545

    NetApp

    KB: 2031038

    D fi i S F il T i l

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    32/49

    32

    Defining Some Failure Terminology

    All Paths Down (APD) Aaaarghhhh where has that device gone?

    Incorrect storage removal i.e. yanked!

    Sudden storage failure

    No time for storage to tell us anything

    Permanent Device Loss (PDL) Aaahhhh the device has gone,

    OK I understand

    Much nicer than APD, graceful handing of state change

    Storage notifies of device state change via SCSI sense code

    Allows HA to failover VMs

    Split Brain Hmmm the other half has disappeared, now what?

    Election of second HA master

    Check heartbeat datastore region

    Restart VMs (if needed)

    M PDL (A d Wh D It M tt ?)

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    33/49

    33

    More on PDL (And Why Does It Matter?)

    Permanent Device Loss is a specific condition issued by the array

    though a SCSI sense code

    Virtual machines that do IO will be killed

    Only when disk.terminateVMOnPDLDefault and

    das.maskCleanShutdownEnabledhas been set to true!

    2012-03-14T13:39:25.085Z cpu7:4499)WARNING: VSCSI: 4055:handle 8198(vscsi4:0):opened by wid 4499 (vmm0:fri-iscsi-02)

    has Permanent Device Loss. Killing world group leader 4491

    What about an APD, will this help?

    All Paths Down is a different condition.

    No action will be taken by HA during an APD event.

    S i Si l H t F il

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    34/49

    34

    Scenario - Single Host Failure

    A normal HA event

    No network or

    datastore heartbeats

    Host will be declared

    dead

    All VMs will berestarted

    Could violate affinity

    rules

    S i F ll C t F il i O Sit

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    35/49

    35

    Scenario - Full Compute Failure in One Site

    Normal HA event

    No datastore or

    network heartbeats

    All virtual machines

    will be restarted

    Note, max 32concurrent restarts

    per host

    Sequencing start

    up order!

    Will violate affinity

    rules! (should rule)

    Sid St S i

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    36/49

    36

    Side StepSequencing

    One thing to point out with regards to the start up order

    1. Agent virtual machines

    2. FT secondary virtual machines

    3. Virtual Machines configured with a restart priority of high

    4. Virtual Machines configured with a medium restart priority

    5. Virtual Machines configured with a low restart priority

    This is no guarantee, if restart attempt fails HA continues with

    the next virtual machine!

    Scenario Disk Shelf Fail re

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    37/49

    37

    Scenario - Disk Shelf Failure

    No impact on virtual

    machines

    Instant switch by

    storage stack!

    Might incur latency

    for virtual machines

    in Frimley

    No HA response

    required

    Scenario Storage Partition

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    38/49

    38

    Scenario - Storage Partition

    Virtual machines

    remained running with

    no impact! Remember the affinity

    rules

    Without affinity rules

    this would result in

    APD condition

    Will virtual machines

    be restarted on the

    other site? Network heartbeats!

    Scenario Datacenter Partition (1/2)

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    39/49

    39

    Scenario - Datacenter Partition (1/2)

    Virtual machines

    remained running with

    no impact! Remember the affinity

    rules

    Without affinity rules

    this would result in

    APD condition

    Will virtual machines

    be restarted from the

    other site? Storage not accessible!

    Scenario Datacenter Partition (2/2) Restart of a VM!

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    40/49

    40

    Scenario - Datacenter Partition (2/2) - Restart of a VM!

    But what if affinity

    rules were violated?

    Your virtual machine

    would be available in

    both sites!

    Scenario Loss of Full Datacenter (1/2)

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    41/49

    41

    Scenario - Loss of Full Datacenter (1/2)

    All virtual machines

    will be restarted

    Note in many cases

    requires manual

    intervention from a

    storage perspective!

    Run DRS when sitereturns, to apply

    affinity rules and

    balance load!

    Scenario Loss of Full Datacenter (2/2)

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    42/49

    42

    Scenario - Loss of Full Datacenter (2/2)

    What if the manual fail-over of storage is slow?

    HA retries 5 times by default in ~ 30 minutes

    HA keeps a compatibility list so it knows where it can restart what

    The compatibility list contains VM / datastore / portgroup details

    In many cases stretched architectures offer a witness

    Used to determine the problem!

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    43/49

    43

    How About Combining vMSC with

    Site Recovery Manager

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    44/49

    44

    Site Recovery Manager

    Metro vMSC Site(s) DR Site (SRM)

    And What About

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    45/49

    45

    And What About

    vCloudDirector

    Database considerations

    Multiple vCD cells!

    NFS share for the cells?

    VXLAN

    Keep in mind that VXLAN needs Edge

    Horse shoe!

    Single Point of Failure

    Key Takeaways

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    46/49

    46

    Key Takeaways

    Design a cluster that meets your needs dont forget Ops!

    Understand HA / DRS play key part in your vMSC success

    Testing is critical, dont just test the easy stuff!

    Document process changes, gain operational acceptance

    Do not assume it is Next > Next > Finish

    Ongoing maintenance/checks will be required

    Automate as much as you can!

    Thank You

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    47/49

    47

    Thank You

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    48/49

    FILL OUTA SURVEY

    EVERY COMPLETE SURVEYIS ENTERED INTO

    DRAWING FOR A$25 VMWARE COMPANY

    STORE GIFT CERTIFICATE

  • 7/27/2019 BCO1159-Architecting and Operating a VMware vSphere Metro Storage Cluster_Final_US.pdf

    49/49

    Architecting and

    Operating a VMwarevSphere MetroStorage Cluster

    Lee Dilworth, VMware, Inc.

    Duncan Epping, VMware, Inc.

    INF-BCO1159