chapter 11 planning for metropolitan site resiliency

Upload: jon-russell

Post on 02-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    1/26

    Chapter 11: Planning for Metropolitan SiteResiliency

    Microsoft Lync Server 2010

    Published: March 2012

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    2/26

    This document is provided as-is. Information and views expressed in this document, including

    URL and other Internet Web site references, may change without notice.

    Some examples depicted herein are provided for illustration only and are fictitious. No real

    association or connection is intended or should be inferred.

    This document does not provide you with any legal rights to any intellectual property in any

    Microsoft product. You may copy and use this document for your internal, reference purposes.

    Copyright 2012 Microsoft Corporation. All rights reserved.

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    3/26

    Contents

    Planning for Metropolitan Site Resiliency.................................................................................... 1

    The Metropolitan Site Resiliency Solution................................................................................1

    Overview............................................................................................................................... 2

    Prerequisites.........................................................................................................................4

    Test Methodology..................................................................................................................... 5

    Site Resiliency Topology....................................................................................................... 5

    Servers in the Metropolitan Site Resiliency Topology........................................................7

    Hardware Load Balancers................................................................................................. 8

    WAN/SAN Latency Simulator.......................................................................................... 10

    DNS................................................................................................................................. 10

    Database Storage............................................................................................................ 11

    Test Load............................................................................................................................. 11

    Expected Client Sign-In Behavior........................................................................................11

    Test Results........................................................................................................................ 15

    Findings and Recommendations............................................................................................17

    Failback Procedure Recommendations..............................................................................18

    Performance Monitoring Counters And Numbers................................................................19

    DNS and HLB Topology Reference........................................................................................20

    Acknowledgements and References......................................................................................22

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    4/26

    Planning for Metropolitan Site ResiliencyIf you require Microsoft Lync Server 2010 communications software to be always available, even

    in the event of a severe disaster at one geographical location in your organization, you can followthe guidelines in this section to create a topology that offers metropolitan site resiliency.

    In this topology, Lync Server 2010 pools span two geographically separate locations. In such a

    topology, even catastrophic server failure in one location would not seriously disrupt usage,

    because all connection requests would automatically be directed to servers in the same pool but

    at the second location. The site resiliency solution described in this section is designed

    specifically for this split-pool topology and is supported by Microsoft subject to the constraints

    mentioned in Findings and Recommendations.

    If your environment does not meet the requirements described in this document, For

    recommendations about providing resiliency for your Enterprise Voice workload, see Planning for

    Enterprise Voice Resiliency.

    Unless specifically stated otherwise, all server roles have been installed according to the product

    documentation. For details, see Deployment in the Deployment documentation.

    In This Section

    The Metropolitan Site Resiliency Solutionprovides an overview of the tested and

    supported site resiliency solution.

    Test Methodologydescribes the testing topology, expected behavior, and test results.

    Findings and Recommendationsprovides practical guidance for deploying your own

    failover solution.

    Notes:

    This section does not include specific procedures for deploying the products that are used in the

    solution. Specific deployment requirements are likely to vary so much among different customers

    that step-by-step instructions are likely to be incomplete or misleading. For step-by-step

    instructions, see the product documentation for the various software and hardware used in this

    solution.

    To successfully follow the topics in this section, you should have a thorough understanding of

    Lync Server 2010 and Windows Server 2008 R2 Failover Clustering.

    The Metropolitan Site Resiliency Solution

    This section describes the tested and supported metropolitan site resiliency solution, including

    prerequisites, topology, and individual components. For details about planning and deploying

    Windows Server 2008 R2 and Lync Server 2010, see the documentation for these products. For

    details about third-party components, see Database Storage and the product documentation

    provided by the makers of those components.

    1

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    5/26

    In This Section

    Overview

    Prerequisites

    Overview

    The metropolitan site resiliency solution described in this section entails the following:

    Splitting the Front End pool between two physical sites, hereafter called North and South.

    In Topology Builder, these two geographical sites are configured as one single Lync Server

    2010 site.

    Creating separate geographically dispersed clusters (physically separated Windows

    Server 2008 R2 failover clusters) for the following:

    Back End Servers

    Group Chat Database Servers

    File Servers

    Deploying a Windows Server 2008 R2 file share witness to which all server clusters are

    connected. To determine where to place the file share witness, refer to the Windows Server

    2008 R2 failover cluster documentation at http://go.microsoft.com/fwlink/?LinkId=211216.

    Enabling synchronous data replication between the geographically dispersed clusters.

    Deploying servers running certain server roles in both sites. These roles include Front

    End Server, A/V Conferencing Server, Director, Edge Server, and Group Chat Server. The

    servers of each type in both sites are contained within one pool of that type, which crosses

    both sites. Except for Group Chat Server, all servers of these types, in both sites, are active.

    For Group Chat Server, only the servers in one site can be active at a time. The Group ChatServers in the other site must be inactive.

    Additionally, Monitoring Server and Archiving Server can be deployed in both sites; however,

    only the Monitoring Server and Archiving Server in one site are associated with the other

    servers in your deployment. The Monitoring Server and Archiving Server in the other site is

    deployed but not associated with any pools, and it serves as a "hot" backup.

    The following figure provides an overview of the resulting topology.

    2

    http://go.microsoft.com/fwlink/?LinkId=211216http://go.microsoft.com/fwlink/?LinkId=211216
  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    6/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    With the topology depicted in the preceding figure, a single site could become unavailable for any

    reason, and users would still be able to access supported unified communications services within

    minutes rather than hours. For a detailed depiction of the topology used to test the solution

    described in this section, see Site Resiliency Topology.

    Scope of Testing and Support

    This site resiliency solution has been tested and is supported by Microsoft for the following

    workloads:

    IM and presence

    Peer-to-peer scenarios; for example, peer-to-peer audio/video sessions

    IM conferencing

    Web conferencing

    A/V conferencing

    3

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    7/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    Application sharing

    Enterprise Voice and Telephony Integration

    Enterprise Voice applications, including Conferencing Attendant, Conferencing

    Announcement service, Outside Voice Control, and Response Group service

    Approved unified communications devices

    Simple URLs

    Group Chat

    Exchange UM

    Workloads That Are Out of Scope

    The following scenarios can be deployed in the metropolitan site resiliency topology, but the

    automatic failover of these workloads is not designed or supported:

    Federation and Public IM Connectivity

    Remote call control

    Microsoft Lync Web App

    XMPP Gateway

    Prerequisites

    The solution described in this section assumes that your Lync Server deployment meets both the

    core requirements described in the product documentation and all of the following prerequisites.

    To qualify for Microsoft support, your failover solution must meet all these prerequisites.

    All servers that are part of geographically dispersed clusters must be part of the same

    stretched VLAN, using the same Layer-2 broadcast domain. All other internal servers running

    Lync Server server roles can be on a subnet within that servers local data center.

    Edge Servers must be in the perimeter network, and should be on a different subnet than the

    internal servers. Also, the perimeter network need not be stretched between sites.

    Synchronous data replication must be enabled between the primary and secondary sites,

    and the vendor solution that you employ must be supported by Microsoft.

    Round-trip latency between the two sites must not be greater than 20 ms.

    Available bandwidth between the sites must be at least 1 Gbps.

    A geographically dispersed cluster solution based on Windows Server 2008 R2 Failover

    Clustering must be in place. That solution must be certified and supported by Microsoft, and it

    must pass cluster validation as described in the Windows Server 2008 R2 documentation.

    For details, see the What is cluster validation? section of Failover Cluster Step-by-Step

    Guide: Validating Hardware for a Failover Cluster at http://go.microsoft.com/fwlink/?

    linkid=142436.

    All geographically dispersed cluster servers must be running the 64-bit edition of

    Windows Server 2008 R2.

    All your servers that are running Lync Server must run the Lync Server 2010 version.

    All database servers must be running the 64-bit edition of one of the following:

    4

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    8/26

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    9/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    components you choose for your particular implementation of this solution, you might need help

    from your vendor of choice to deploy this solution.

    This figure is representative of the topology tested, but for purposes of clarity, it does not

    necessarily depict the number of servers used in each pool in the actual test topology. For

    example, in the actual test topology there were four Front End Servers in each site.

    As shown in the figure, the tested topology deployed two central sites and a branch office, along

    with a third location that hosted a file server functioning as a Windows Server 2008 R2 Failover

    Clustering Service file share witness. For details about using a witness in a failover cluster, see

    http://go.microsoft.com/fwlink/?LinkId=211004.The file share witness is available to all Windows

    Server 2008 R2 Failover Cluster nodes in both central sites. All Windows Server 2008 R2

    Failover Clusters used in this solution use the Node and File Share Majority quorum mode.

    The following topics discuss each of the solution components shown in preceding figure.

    6

    http://go.microsoft.com/fwlink/?LinkId=211004http://go.microsoft.com/fwlink/?LinkId=211004http://go.microsoft.com/fwlink/?LinkId=211004
  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    10/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    In This Section

    Servers in the Metropolitan Site Resiliency Topology

    Hardware Load Balancers

    WAN/SAN Latency Simulator

    DNS

    Database Storage

    Servers in the Metropolitan Site Resiliency Topology

    The metropolitan site resiliency topology can include different types of server roles, as follows.

    Front End Pool

    This pool hosts all Lync Server users. Each site, North and South, contains four identically

    configured Front End Servers. The Back-End Database is deployed as two Active/Passive SQL

    Server 2008 geographically dispersed cluster nodes, running on the Windows Server 2008 R2

    Failover Clustering service. Synchronous data replication is required between the two Back-EndDatabase Servers.

    In our test topology, the Mediation Server was collocated with Front End Server. Topologies with

    stand-alone Mediation Server are also supported.

    Our test topology used DNS load balancing to balance the SIP traffic in the pool, with hardware

    load balancers deployed for the HTTP traffic.

    Topologies that use only hardware load balancers to balance all types of traffic are also supported

    for site resiliency.

    A/V Conferencing Pool

    We deployed a single A/V Conferencing pool with four A/V Conferencing Servers, two in each

    site.

    Director Pool

    We deployed a single Director pool with four Directors, two in each site.

    Edge Pool

    The Edge Servers ran all services (Access Edge service, A/V Conferencing Edge service, and

    Web Conferencing Edge service), but we tested them only for remote-user scenarios. Federation

    and public IM connectivity are beyond the scope of this document.

    We recommend DNS load balancing for your Edge pool, but we also support using hardware load

    balancers. The internal Edge interface and external Edge interface must use the same type of

    load balancing. You cannot use DNS load balancing on one Edge interface and hardware load

    balancing on the other Edge interface. If you use hardware load balancers for the Edge pool, the

    hardware load balancer at one site serves as the primary load balancer and responds to requests

    with the virtual IP address of the appropriate Edge service. If the primary load balancer is

    unavailable, the secondary hardware load balancer at the other site would take over. Each site

    has its own IP subnet; perimeter networks were not stretched across the North and South sites.

    Group Chat Servers

    7

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    11/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    Each site hosts both a Channel service and a Lookup service, but these services can be active in

    only one of the sites at a time. The Channel service and the Lookup service in the other site must

    be stopped or disabled. In the event of site failover, manual intervention is required to start these

    services at the failover site.

    Each site also hosts a Compliance Server, but only one of these servers can be active at a time.In the event of site failover and failback, manual intervention is required to restore the service. For

    details, see Backing Up the Compliance Server in the Operations documentation.

    We deployed the Group Chat back-end database as two Active/Passive SQL Server 2008

    geographically dispersed cluster nodes running on top of Windows Server 2008 R2 Failover

    Clustering. Data replication between the two back-end database servers must be synchronous. A

    single database instance is used for both Group Chat and compliance data.

    Monitoring Server and Archiving Server

    For Monitoring Server and Archiving Server, we recommend a hot standby deployment. Deploy

    these server roles in both sites, on a single server in each site. Only one of these servers is

    active, and the pools in your deployment are all associated with that active server. The other

    server is deployed and installed, but not associated with any pool.

    If the primary server becomes unavailable, you use Topology Builder to manually associate the

    pools with the standby server, which then becomes the primary server.

    File Server Cluster

    We deployed a file server as a two-node geographically dispersed cluster resource using

    Windows Server 2008 R2 Failover Clustering. Synchronous data replication was required. Any

    Lync Server function that requires a file share and is split across the two sites must use this file

    share cluster. This includes the following:

    Meeting content location

    Meeting metadata location Meeting archive location

    Address Book Server file store

    Application data store

    Client Update data store

    Group Chat compliance file repository

    Group Chat upload files location

    Reverse Proxy

    A reverse proxy server is deployed at each site. In our test topology, these servers ran Microsoft

    Forefront Threat Management Gateway. Each server running Microsoft Forefront Threat

    Management Gateway ran independently of one another. A hardware load balancer was deployed

    at each site.

    Hardware Load Balancers

    Even when you deploy DNS load balancing, you need hardware load balancers to load balance

    the HTTP traffic to the Front End pools and Director pools.

    8

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    12/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    Additionally, we deployed hardware load balancers in the perimeter network for the reverse proxy

    servers.

    To provide the highest level of load balancing and high availability, a pair of hardware load

    balancers (HLBs) were deployed with a Global Server Load Balancer (GSLB) at each site. With

    all the load balancers in constant communication with each other regarding site and serverhealth, no single device failure at either central site would cause a service disruption for any of

    the users who are currently connected.

    This test scenario employed the use of both global server (the F5 BIG-IP GTM) and local server

    (the F5 BIG-IP LTM) HLBs. The global server load balancers were implemented to manage traffic

    to each site based upon central site availability and health, while the local server load balancers

    managed connections within each site to the local servers. This implementation has the following

    advantages:

    Fully-meshed system for the highest level of fault tolerance at a local and global level.

    Complete segmentation of internal and external traffic within the central site.

    The ability, if you want, to leverage the hardware to load balance all connections to FrontEnd Servers, Edge Servers, and Directors.

    Although optimal from some perspectives, this deployment does have two distinct disadvantages:

    you need to purchase more HLBs, and the numerous devices create a more complex

    configuration to manage. Consolidation of the load balancing infrastructure is definitely possible

    and in some environments is beneficial. For instance, many deployment designs include a single

    HLB instance or pair in each central site. Although the HLB spans multiple subnets in this design,

    the load balancing logic remains the same. F5 produced architectural guidance that explores the

    tradeoffs between different network designs. For details, see http://go.microsoft.com/fwlink/?

    LinkId=212143. For details about deployments leveraging HLBs for Lync Server without GSLBs,

    see the Office Communications Server 2007 R2 Site Resiliency white paper at

    http://go.microsoft.com/fwlink/?LinkId=211387. The deployments described in that white paperalso provide a valid reference architecture for Lync Server 2010.

    By leveraging both local and global load balancers, we achieved both server and site resiliency

    while using a single URL for users to connect to. The GTM resolves a single URL to different IP

    addresses based on the selected load balancing algorithm and availability of global services. By

    having the authoritative Windows DNS servers (contoso.com) delegate the URL

    (pool.contoso.com) to the GTM, users connecting to pool.contoso.com are sent to the appropriate

    site at the time of DNS resolution. The local server load balancer then gets the connection and

    load balances it to the appropriate server.

    The HLBs were configured to monitor the Front End Pool members by using an HTTP or HTTPSmonitor, which gives the load balancers the best information about the health and performance of

    the servers. The HLBs then use this information to load balance the incoming connections to the

    best local Front End. Using a feature called Feature Priority Activation, we also configured the

    HLBs to proxy connections to the other central site if all the local Front Ends reached capacity or

    no longer functioned.

    The global server load balancers (GTM) were configured to monitor the HLBs in each site and to

    direct users to the best performing site. The GTM can be configured to send all users to a specific

    9

    http://go.microsoft.com/fwlink/?LinkId=212143http://go.microsoft.com/fwlink/?LinkId=212143http://go.microsoft.com/fwlink/?LinkId=211387http://go.microsoft.com/fwlink/?LinkId=211387http://go.microsoft.com/fwlink/?LinkId=212143http://go.microsoft.com/fwlink/?LinkId=212143http://go.microsoft.com/fwlink/?LinkId=211387
  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    13/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    site in the case of active/standby central sites (as was the case for this test), or load balance

    users between the sites for active/active deployments. If one site reaches capacity or becomes

    unavailable, the GTM directs users to the other available site(s).

    WAN/SAN Latency Simulator

    In order to see impact of network latency between two sites, we deployed a network latency

    simulator. The simulator allowed us to test different latencies and come up with a

    recommendation for maximum acceptable and supported latency.

    Besides testing network latency, we also wanted to test the impact of latency on data storage

    replication. In order to test storage latency, we connected two storage nodes (one at each site) by

    means of a fiber channel to the IP gateway. This connection enabled data replication over the IP

    network, which made it possible to use the network latency simulator to test latency along the

    data path.

    Note:

    The WAN/SAN latency simulator was used for testing purposes only. The simulator is not

    a requirement for the solution described in this paper and is not required for Microsoft

    support.

    DNS

    This test topology used a split-brain DNS configuration; that is, the parent DNS namespace was

    contoso.com, but resolution records for internal and external users were managed separately.

    This configuration allows for advertising a single URL for any specific Lync Server service while

    maintaining separate servers and routes to access those services for internal and external users.

    DNS and DNS load balancing were deployed according to Microsoft best practices. For details,

    see DNS Requirements for Front End Pools, DNS Requirements for Automatic Client Sign-In,

    Determining DNS Requirements, and DNS Load Balancing in the Planning documentation.Windows DNS can handle all DNS responsibilities for Lync Server services; however, in this case

    we used the F5 Global Traffic Manager (GTM) for more granular site awareness and load

    distribution.

    Windows DNS was authoritative for contoso.com for both internal and external user resolution.

    Service names (such as pool1 for HTTPS requests) needing global load balancing were

    delegated to the GTMs so that Windows DNS could maintain ownership of the overall

    contoso.com namespace but GTM could also load balance what was needed. In this case, we

    used the GTM to manage resolution records for HTTPS access; however, this approach can be

    expanded to cover records for other services as well.

    The following lists provide a configuration snapshot of both the internal and external DNS servers

    that were used in our testing.

    External Windows DNS

    Windows DNS is used, and is authoritative for the contoso.com zone.

    ap.contoso.com points to the external network interface of the Access Edge service.

    webconf.contoso.com points to the external network interface of the Web Conferencing

    Edge service.

    10

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    14/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    avedge.contoso.com points to the external network interface of the A/V Edge service.

    The wip.contoso.com zone is delegated to a Global Server Load Balancer system, in this

    case, the F5 GTM.

    proxy.contoso.com is CNAMEd to proxy.wip.contoso.com, thus granting GTM the

    resolution and load balancing responsibilities.

    proxy.wip.contoso.com is configured on the GTM to load balance users to the HTTP

    reverse proxies.

    Internal Windows DNS

    Windows DNS is used, and is authoritative for the contoso.com zone.

    The wip.contoso.com zone is delegated to a Global Server Load Balancer system, in this

    case the F5 GTM.

    webpool1.contoso.com is CNAMEd to webpool1.wip.contoso.com, thus granting GTM the

    resolution and load balancing responsibilities.

    webpool1.wip.contoso.com is configured on the GTM to load balance users to the Front

    End VIPs of the load balancers.

    Database Storage

    In order to implement a geographically dispersed Windows Server 2008 R2 Failover Clustering

    solution, we used two HP StorageWorks Enterprise Virtual Array (EVA) Disk Enclosure storage

    area network (SAN) systems (one per site) as database storage. Storage was carved into disk

    groups, which in turn were associated with their respective clusters. All disk groups used

    synchronous data replication. SAN cluster extension was used as Windows Server 2008 R2

    Failover Clustering resource to facilitate storage failover and failback.

    One of the scenarios we wanted to test was the impact of latency on storage data replication

    between two sites. One problem we encountered was that HP StorageWorks has fiber channel

    interfaces but the network latency simulator we used does not support those interfaces. In order

    to connect the two, we used a Fiber Channel to IP gateway that HP provided.

    Test Load

    Stress testing included the following:

    25,000 concurrent users were using the servers.

    6,000 users were in IM sessions, with 50% of those IM sessions having more than two

    users.

    3000 users were in peer-to-peer A/V calls.

    3000 users were in A/V conferences.

    500 active users were in application sharing conferences.

    3000 active users were in data collaboration conferences.

    Expected Client Sign-In Behavior

    This section describes the client sign-in behavior during normal operation and failover. This

    description does not include all the details of signing in but is intended only to illustrate the

    11

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    15/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    general flow when a user signs in to a metropolitan site resiliency topology that is split across

    geographical sites.

    During normal operation, with DNS load balancing deployed, client sign-in with the site resilient

    topology works basically as it does in any supported topology.

    Normal Sign-In Operation

    1. Remote user [email protected] signs in to Lync 2010. Lync 2010 queries DNS server for

    its connection endpoint (the Edge Server in this specific instance). The DNS server returns

    the list of the FQDNs of the Access Edge service on each Edge Server.

    2. The client chooses one of these FQDNs at random and attempts to connect to that Edge

    Server. This Edge Server may be at either site. If this attempt fails, the client will keep trying

    different Edge Servers until it succeeds.

    3. Lync 2010 connects by using TLS to one of the Edge Servers.

    4. The Edge Server forwards the request to a Director. The Director may be at either site.

    5. The Director determines the pool where the user is homed and then forwards the request

    to that pool.

    6. The DNS server again returns the list of Front End Servers in the pool, including those

    servers at both sites. Each user has an assigned list of Front End Servers to which the

    users client is always connected: if the first server on the list for that client is currently

    unavailable, it tries the next one on the list. It keeps trying until it succeeds. In this example,

    the request is forwarded to a Front End Server at the North site.

    7. The response is returned to Lync 2010.

    Failover Sign-In Operation

    The following figures show typical call flow during a user sign-in, in the event that the North site

    fails. Diagrams have been simplified to highlight the most important aspects of the topology.

    The following figure shows the flow for an internal user, with automatic configuration.

    12

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    16/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    The following figure shows the flow for an internal user, with manual configuration.

    13

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    17/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    The following figure shows the flow for an external user.

    14

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    18/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    Test Results

    This topic describes the results of Microsofts testing of the failover solution proposed in this

    section.

    Central Site Link Latency

    We used a network latency simulator to introduce latency on the simulated WAN link between

    North and South. The recommended topology supports a maximum latency of 20 ms between the

    geographical sites. Improvements in the architecture of Lync Server 2010 enable the allowed

    latency to be higher than the maximum of 15 ms allowed in the Microsoft Office Communications

    Server 2007 R2 metropolitan site resiliency topology.

    15

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    19/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    15 ms. We started by introducing a 15 ms round-trip latency into both the network path

    between two sites and the data path used for data replication between the two sites. The

    topology continued to operate without problem under these conditions and under load.

    20 ms. We then began to increase latency. At 20 ms round-trip latency for both network

    and data traffic, the topology continued to operate without problem. 20 ms is the maximumsupported round-trip latency for this topology in Lync Server 2010.

    Important:

    Microsoft will not support solutions whose network and data latency exceeds 20 ms.

    30 ms. At 30 ms round-trip latency, we started to see degradation in performance. In

    particular, message queues for archiving and monitoring databases started to grow. As a

    result of these increased latencies, user experience also deteriorated. Sign-in time and

    conference creation time both increased, and the A/V experience degraded significantly. For

    these reasons, Microsoft does not support a solution where round-trip latency has exceeded

    20 ms.

    Failover

    As previously mentioned, all Windows Server 2008 R2 clusters in the topology used a Node and

    File Share Majority quorum. As a result, in order to simulate site failover, we had to isolate all

    servers and clusters by losing connectivity to both the South site and the witness site. We used a

    dirty shutdown of all servers at the North site.

    Results and observations following failure of the North site are as follows:

    The passive SQL Server cluster node became active within minutes. The exact amount of

    time can vary and depends on the details of the environment. Internal users connected to the

    North site were signed out and then automatically signed back in. During the failover,

    presence was not updated, and new actions, such as new IM sessions or conferences, failed

    with appropriate errors. No more errors occurred after the failover was complete.

    As long as there is a valid network path between peers, ongoing peer-to-peer calls

    continued without interruption.

    UC-PSTN calls were disconnected if the gateway supporting the call became

    unavailable. In that case, users could manually re-establish the call.

    Lync 2010 users connected to North site were disconnected and automatically

    reconnected to the South site within minutes. Users could then continue as before.

    In order to reconnect, Group Chat client users had to sign out and sign back in. The

    Group Chat Channel service and Lookup service in the South site, which were normally

    stopped or disabled at the site, had to be started manually.

    Conferences hosted in the North site automatically failed over to the South site. All users

    were prompted to rejoin the conference after failover completed. Clients could rejoin the

    meeting. Meeting recording continued during the failover. Archiving stopped until the hot

    standby Archiving Server was brought online.

    Manageability continued to work while the North site was down. For example, users could

    be moved from the Survivable Branch Appliance to the Front End pool.

    16

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    20/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    After the North site went offline, SQL Server clusters and file share clusters in the South

    site came online in a few minutes.

    Site failover duration as observed in our testing was only a few minutes.

    Failback

    For the purposes of our testing, we defined failback as restoring all functionality to the North site

    such that users can reconnect to servers at that site. After the North site was restored, all cluster

    resources were moved back to their nodes at the North site.

    We recommend that you perform your failback in a controlled manner, preferably during off hours,

    as some user disruption can happen during the failback procedures. Results and observations

    following failback of the North site are as follows:

    Before cluster resources can be moved back to their nodes at the North site, storage had

    to be fully resynchronized. If storage has not been resynchronized, clusters will fail to come

    online. The resynchronization of the storage happened automatically.

    To ensure minimal user impact, the clusters were set not to automatically fail back. Our

    recommendation is to postpone failback until the next maintenance window after ensuring

    storage has fully resynchronized.

    The Front End Servers will come online when they are able to connect to the Active

    Directory Domain Services. If the Back End Database is not yet available when the Front End

    Servers come online, users will have limited functionality.

    After the Front End Servers in the North site are online, new connections will be routed to

    them. Users who are online, and who usually connect through Front End Servers in the North

    site, will be signed out and then signed back in on their usual North site server.

    If you want to prevent the Front End Servers at the North site from automatically coming back

    onlinefor example, if you want better control over the whole process or if latency between

    the two sites has not been restored to acceptable levelswe recommend shutting down theFront End Servers.

    Site failback duration as observed in our testing was under one minute.

    Findings and Recommendations

    The metropolitan site resiliency solution has been tested and is officially supported by Microsoft;

    however, before deploying this topology, you should consider the following findings and

    recommendations.

    Findings

    Cluster failover worked as expected. No manual steps were required, with the exception

    of Group Chat Server, Archiving Server, and Monitoring Server. Front End Servers were able

    to reconnect to the back-end database servers after the failover and resume normal service.

    Microsoft Lync 2010 clients reconnected automatically.

    Cluster failback worked as expected. It is important to ensure that storage has

    resynchronized before failback begins.

    17

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    21/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    Users will see a quick sign out/sign in sequence as they are transferred back to their usual

    Front End Server, when it becomes available again.

    When failover occurred, the Group Chat Channel service Lookup service at the failover

    site had to be started manually. Additionally, the Group Chat Compliance Server setting had

    to be updated manually. For details, see Backing Up the Compliance Server in theOperations documentation.

    Recommendations

    Although testing used two nodes (one per site) in each SQL Server cluster, we

    recommend deploying additional nodes to achieve in-site redundancy for all components in

    the topology. For example, if the active SQL Server node becomes unavailable, a backup

    SQL Server node in the same site and part of the same cluster can assume the workload until

    the failed server is brought back online or replaced.

    Although our testing used components provided by certain third-party vendors, thesolution does not depend on or stipulate any particular vendors. As long as components are

    certified and supported by Microsoft, any qualifying vendor will do.

    All individual components of the solution (for example, geographically dispersed cluster

    components) must be supported and, where appropriate, certified by Microsoft. This does not

    mean, however, that Microsoft will directly support individual third-party components. For

    component support, contact the appropriate third-party vendor.

    Although a full-scale deployment was not tested, we expect published scale numbers for

    Lync Server 2010 to hold true. With that in mind, you should plan for enough capacity that

    sufficient capacity remains to continue operation in the event of failover. For details, see

    Capacity Planning in the Planning documentation.

    The information in this section should be used only as guidance. Before deploying this

    solution in a production environment, you should build and test it using your own topology.

    Note:

    Microsoft does not support implementations of this solution where network and data-

    replication latency between the primary and secondary sites exceeds 20 ms, or when the

    bandwidth does not support the user model for your organization. When latency exceeds

    20 ms, the end-user experience rapidly deteriorates. In addition, Archiving Server and

    Group Chat Compliance servers are likely to start falling behind, which may in turn cause

    Front End Servers and Group Chat lookup servers to shut down.

    Failback Procedure Recommendations

    To failback and resume normal operation at the North site, the following steps are necessary:

    1. Restore network connection between two sites. Quality attributes of the network

    connection (for example, bandwidth, latency, and loss) should be comparable to the quality

    prior to failover.

    18

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    22/26

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    23/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    Verify that the queue is not increasing unbounded. Establish a baseline for the counter, and

    monitor the counter to ensure that it does not exceed that baseline.

    On Group Chat Channel and Compliance Servers, monitor the MSMQ

    Service\Total Messages in all Queues counter. The size of the queue will vary

    depending on load. Verify that the queue is not increasing unbounded. Establish a baselinefor the counter, and monitor the counter to make sure that it does not exceed that baseline.

    On the Directors, Edge Servers, and Front End Servers, monitor the LC:SIP 04

    Responses object\ SIP 051 Local 503 Responses/sec counter. This counter

    indicates if any server is returning errors indicating that the server is unavailable. At steady

    state, this counter should be approximately 0. Occasional spikes are acceptable.

    On all servers monitor the LC:SIP 04 Responses \SIP 053 Local 504

    Responses/sec counter. This counter can indicate connection delays or failures with

    other servers. At steady state, this counter should be approximately 0. Occasional spikes are

    acceptable. If you see 504 error messages, check the LC:SIP 01 Peers\SIP 017 -

    Sends Outstanding counter. This counter records the number of requests and responses in

    the outbound queue, which will indicate which servers are having problems.

    DNS and HLB Topology Reference

    The following figure is a conceptual overview of how DNS, Global, and Local server load

    balancing were configured to support the metropolitan site resiliency solution.

    20

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    24/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    In this topology, Global Server Load Balancers (GSLB) were deployed at each site to provide

    failover capabilities at a site level, supporting internal client/server (https) traffic to the pool and

    external reverse proxy (https) traffic for users connected remotely. As part of this configuration,

    Local Server Load Balancers (LSLB) were also deployed at each site to manage https

    connections to Front End servers within the pool, physically located across each site. To support

    21

  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    25/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    the DNS zones delegated internally and externally, the GSLB at each site monitored and routed

    https traffic destined for the following URLs:

    Internally

    https://webpool1.contoso.com

    https://admin.contoso.com

    https://dial.contoso.com

    https://meet.contoso.com

    Externally

    https://proxy.contoso.com

    https://dial.contoso.com

    https://meet.contoso.com

    To support the simple URLs referenced above, CNAME records were created, delegating the

    DNS resolution to the GSLB for further routing to the LSLB of choice. For example, as internal

    client requests resolved to webpool1.contoso.com, they were translated towebpool1.wip.contoso.com by the GSLB and traffic was routed to one of the local server load

    balancers virtual IP addresses (VIPs) as shown.

    If a site failure occurred, the GSLB would redirect future requests to the LSLB VIP that remains.

    For all other Lync Server client-to-server and server-to-server traffic, external or internal, the

    requests were handled by DNS load balancing, which is a new load balancing capability in Lync

    Server 2010.

    Acknowledgements and References

    AcknowledgementsWe would like to acknowledge the following partners:

    F5 (http://www.f5.com) for providing hardware load balancers and support.

    Hewlett-Packard Development Company (http://www.hp.com/go/clxeva) for providing the

    geographically dispersed cluster solution.

    Network Equipment Technologies (www.net.com) for providing gateways, Survivable

    Branch Appliances, and support.

    Juniper Networks (www.juniper.net) for providing firewalls.

    References

    The following links provide more information about some of the topics in this section:

    For details about Windows Server 2008 R2 Failover Clustering, see the "Getting Started"

    section of "Failover Clustering" at http://go.microsoft.com/fwlink/?LinkId=208305.

    For details about the Windows Server 2008 R2 Failover Cluster Configuration Program,

    see the "Configuration Program" section of "Failover Clustering" at

    http://go.microsoft.com/fwlink/?LinkId=208306.

    22

    http://go.microsoft.com/fwlink/?LinkId=208305http://go.microsoft.com/fwlink/?LinkId=208306http://go.microsoft.com/fwlink/?LinkId=208305http://go.microsoft.com/fwlink/?LinkId=208306
  • 7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

    26/26

    Chapter 11: Planning for Metropolitan Site Resiliency

    For details about SQL Server Always On partners, see "SQL Server Always on Storage

    Solution Partners" at http://go.microsoft.com/fwlink/?LinkId=208307.

    http://go.microsoft.com/fwlink/?LinkId=208307http://go.microsoft.com/fwlink/?LinkId=208307http://go.microsoft.com/fwlink/?LinkId=208307