chapter 11 planning for metropolitan site resiliency

7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency

1/26

Chapter 11: Planning for Metropolitan SiteResiliency

Microsoft Lync Server 2010

Published: March 2012


2/26

This document is provided as-is. Information and views expressed in this document, including

URL and other Internet Web site references, may change without notice.

Some examples depicted herein are provided for illustration only and are fictitious. No real

association or connection is intended or should be inferred.

This document does not provide you with any legal rights to any intellectual property in any

Microsoft product. You may copy and use this document for your internal, reference purposes.

Copyright 2012 Microsoft Corporation. All rights reserved.


3/26

Contents

Planning for Metropolitan Site Resiliency.................................................................................... 1

The Metropolitan Site Resiliency Solution................................................................................1

Overview............................................................................................................................... 2

Prerequisites.........................................................................................................................4

Test Methodology..................................................................................................................... 5

Site Resiliency Topology....................................................................................................... 5

Servers in the Metropolitan Site Resiliency Topology........................................................7

Hardware Load Balancers................................................................................................. 8

WAN/SAN Latency Simulator.......................................................................................... 10

DNS................................................................................................................................. 10

Database Storage............................................................................................................ 11

Test Load............................................................................................................................. 11

Expected Client Sign-In Behavior........................................................................................11

Test Results........................................................................................................................ 15

Findings and Recommendations............................................................................................17

Failback Procedure Recommendations..............................................................................18

Performance Monitoring Counters And Numbers................................................................19

DNS and HLB Topology Reference........................................................................................20

Acknowledgements and References......................................................................................22


4/26

Planning for Metropolitan Site ResiliencyIf you require Microsoft Lync Server 2010 communications software to be always available, even

in the event of a severe disaster at one geographical location in your organization, you can followthe guidelines in this section to create a topology that offers metropolitan site resiliency.

In this topology, Lync Server 2010 pools span two geographically separate locations. In such a

topology, even catastrophic server failure in one location would not seriously disrupt usage,

because all connection requests would automatically be directed to servers in the same pool but

at the second location. The site resiliency solution described in this section is designed

specifically for this split-pool topology and is supported by Microsoft subject to the constraints

mentioned in Findings and Recommendations.

If your environment does not meet the requirements described in this document, For

recommendations about providing resiliency for your Enterprise Voice workload, see Planning for

Enterprise Voice Resiliency.

Unless specifically stated otherwise, all server roles have been installed according to the product

documentation. For details, see Deployment in the Deployment documentation.

In This Section

The Metropolitan Site Resiliency Solutionprovides an overview of the tested and

supported site resiliency solution.

Test Methodologydescribes the testing topology, expected behavior, and test results.

Findings and Recommendationsprovides practical guidance for deploying your own

failover solution.

Notes:

This section does not include specific procedures for deploying the products that are used in the

solution. Specific deployment requirements are likely to vary so much among different customers

that step-by-step instructions are likely to be incomplete or misleading. For step-by-step

instructions, see the product documentation for the various software and hardware used in this

solution.

To successfully follow the topics in this section, you should have a thorough understanding of

Lync Server 2010 and Windows Server 2008 R2 Failover Clustering.

The Metropolitan Site Resiliency Solution

This section describes the tested and supported metropolitan site resiliency solution, including

prerequisites, topology, and individual components. For details about planning and deploying

Windows Server 2008 R2 and Lync Server 2010, see the documentation for these products. For

details about third-party components, see Database Storage and the product documentation

provided by the makers of those components.

1


5/26

In This Section

Overview

Prerequisites

Overview

The metropolitan site resiliency solution described in this section entails the following:

Splitting the Front End pool between two physical sites, hereafter called North and South.

In Topology Builder, these two geographical sites are configured as one single Lync Server

2010 site.

Creating separate geographically dispersed clusters (physically separated Windows

Server 2008 R2 failover clusters) for the following:

Back End Servers

Group Chat Database Servers

File Servers

Deploying a Windows Server 2008 R2 file share witness to which all server clusters are

connected. To determine where to place the file share witness, refer to the Windows Server

2008 R2 failover cluster documentation at http://go.microsoft.com/fwlink/?LinkId=211216.

Enabling synchronous data replication between the geographically dispersed clusters.

Deploying servers running certain server roles in both sites. These roles include Front

End Server, A/V Conferencing Server, Director, Edge Server, and Group Chat Server. The

servers of each type in both sites are contained within one pool of that type, which crosses

both sites. Except for Group Chat Server, all servers of these types, in both sites, are active.

For Group Chat Server, only the servers in one site can be active at a time. The Group ChatServers in the other site must be inactive.

Additionally, Monitoring Server and Archiving Server can be deployed in both sites; however,

only the Monitoring Server and Archiving Server in one site are associated with the other

servers in your deployment. The Monitoring Server and Archiving Server in the other site is

deployed but not associated with any pools, and it serves as a "hot" backup.

The following figure provides an overview of the resulting topology.

2
http://go.microsoft.com/fwlink/?LinkId=211216http://go.microsoft.com/fwlink/?LinkId=211216


6/26

Chapter 11: Planning for Metropolitan Site Resiliency

With the topology depicted in the preceding figure, a single site could become unavailable for any

reason, and users would still be able to access supported unified communications services within

minutes rather than hours. For a detailed depiction of the topology used to test the solution

described in this section, see Site Resiliency Topology.

Scope of Testing and Support

This site resiliency solution has been tested and is supported by Microsoft for the following

workloads:

IM and presence

Peer-to-peer scenarios; for example, peer-to-peer audio/video sessions

IM conferencing

Web conferencing

A/V conferencing

3


7/26


Application sharing

Enterprise Voice and Telephony Integration

Enterprise Voice applications, including Conferencing Attendant, Conferencing

Announcement service, Outside Voice Control, and Response Group service

Approved unified communications devices

Simple URLs

Group Chat

Exchange UM

Workloads That Are Out of Scope

The following scenarios can be deployed in the metropolitan site resiliency topology, but the

automatic failover of these workloads is not designed or supported:

Federation and Public IM Connectivity

Remote call control

Microsoft Lync Web App

XMPP Gateway

Prerequisites

The solution described in this section assumes that your Lync Server deployment meets both the

core requirements described in the product documentation and all of the following prerequisites.

To qualify for Microsoft support, your failover solution must meet all these prerequisites.

All servers that are part of geographically dispersed clusters must be part of the same

stretched VLAN, using the same Layer-2 broadcast domain. All other internal servers running

Lync Server server roles can be on a subnet within that servers local data center.

Edge Servers must be in the perimeter network, and should be on a different subnet than the

internal servers. Also, the perimeter network need not be stretched between sites.

Synchronous data replication must be enabled between the primary and secondary sites,

and the vendor solution that you employ must be supported by Microsoft.

Round-trip latency between the two sites must not be greater than 20 ms.

Available bandwidth between the sites must be at least 1 Gbps.

A geographically dispersed cluster solution based on Windows Server 2008 R2 Failover

Clustering must be in place. That solution must be certified and supported by Microsoft, and it

must pass cluster validation as described in the Windows Server 2008 R2 documentation.

For details, see the What is cluster validation? section of Failover Cluster Step-by-Step

Guide: Validating Hardware for a Failover Cluster at http://go.microsoft.com/fwlink/?

linkid=142436.

All geographically dispersed cluster servers must be running the 64-bit edition of

Windows Server 2008 R2.

All your servers that are running Lync Server must run the Lync Server 2010 version.

All database servers must be running the 64-bit edition of one of the following:

4


8/26


9/26


components you choose for your particular implementation of this solution, you might need help

from your vendor of choice to deploy this solution.

This figure is representative of the topology tested, but for purposes of clarity, it does not

necessarily depict the number of servers used in each pool in the actual test topology. For

example, in the actual test topology there were four Front End Servers in each site.

As shown in the figure, the tested topology deployed two central sites and a branch office, along

with a third location that hosted a file server functioning as a Windows Server 2008 R2 Failover

Clustering Service file share witness. For details about using a witness in a failover cluster, see

http://go.microsoft.com/fwlink/?LinkId=211004.The file share witness is available to all Windows

Server 2008 R2 Failover Cluster nodes in both central sites. All Windows Server 2008 R2

Failover Clusters used in this solution use the Node and File Share Majority quorum mode.

The following topics discuss each of the solution components shown in preceding figure.

6
http://go.microsoft.com/fwlink/?LinkId=211004http://go.microsoft.com/fwlink/?LinkId=211004http://go.microsoft.com/fwlink/?LinkId=211004


10/26


In This Section

Servers in the Metropolitan Site Resiliency Topology

Hardware Load Balancers

WAN/SAN Latency Simulator

DNS

Database Storage

Servers in the Metropolitan Site Resiliency Topology

The metropolitan site resiliency topology can include different types of server roles, as follows.

Front End Pool

This pool hosts all Lync Server users. Each site, North and South, contains four identically

configured Front End Servers. The Back-End Database is deployed as two Active/Passive SQL

Server 2008 geographically dispersed cluster nodes, running on the Windows Server 2008 R2

Failover Clustering service. Synchronous data replication is required between the two Back-EndDatabase Servers.

In our test topology, the Mediation Server was collocated with Front End Server. Topologies with

stand-alone Mediation Server are also supported.

Our test topology used DNS load balancing to balance the SIP traffic in the pool, with hardware

load balancers deployed for the HTTP traffic.

Topologies that use only hardware load balancers to balance all types of traffic are also supported

for site resiliency.

A/V Conferencing Pool

We deployed a single A/V Conferencing pool with four A/V Conferencing Servers, two in each

site.

Director Pool

We deployed a single Director pool with four Directors, two in each site.

Edge Pool

The Edge Servers ran all services (Access Edge service, A/V Conferencing Edge service, and

Web Conferencing Edge service), but we tested them only for remote-user scenarios. Federation

and public IM connectivity are beyond the scope of this document.

We recommend DNS load balancing for your Edge pool, but we also support using hardware load

balancers. The internal Edge interface and external Edge interface must use the same type of

load balancing. You cannot use DNS load balancing on one Edge interface and hardware load

balancing on the other Edge interface. If you use hardware load balancers for the Edge pool, the

hardware load balancer at one site serves as the primary load balancer and responds to requests

with the virtual IP address of the appropriate Edge service. If the primary load balancer is

unavailable, the secondary hardware load balancer at the other site would take over. Each site

has its own IP subnet; perimeter networks were not stretched across the North and South sites.

Group Chat Servers

7


11/26


Each site hosts both a Channel service and a Lookup service, but these services can be active in

only one of the sites at a time. The Channel service and the Lookup service in the other site must

be stopped or disabled. In the event of site failover, manual intervention is required to start these

services at the failover site.

Each site also hosts a Compliance Server, but only one of these servers can be active at a time.In the event of site failover and failback, manual intervention is required to restore the service. For

details, see Backing Up the Compliance Server in the Operations documentation.

We deployed the Group Chat back-end database as two Active/Passive SQL Server 2008

geographically dispersed cluster nodes running on top of Windows Server 2008 R2 Failover

Clustering. Data replication between the two back-end database servers must be synchronous. A

single database instance is used for both Group Chat and compliance data.

Monitoring Server and Archiving Server

For Monitoring Server and Archiving Server, we recommend a hot standby deployment. Deploy

these server roles in both sites, on a single server in each site. Only one of these servers is

active, and the pools in your deployment are all associated with that active server. The other

server is deployed and installed, but not associated with any pool.

If the primary server becomes unavailable, you use Topology Builder to manually associate the

pools with the standby server, which then becomes the primary server.

File Server Cluster

We deployed a file server as a two-node geographically dispersed cluster resource using

Windows Server 2008 R2 Failover Clustering. Synchronous data replication was required. Any

Lync Server function that requires a file share and is split across the two sites must use this file

share cluster. This includes the following:

Meeting content location

Meeting metadata location Meeting archive location

Address Book Server file store

Application data store

Client Update data store

Group Chat compliance file repository

Group Chat upload files location

Reverse Proxy

A reverse proxy server is deployed at each site. In our test topology, these servers ran Microsoft

Forefront Threat Management Gateway. Each server running Microsoft Forefront Threat

Management Gateway ran independently of one another. A hardware load balancer was deployed

at each site.

Hardware Load Balancers

Even when you deploy DNS load balancing, you need hardware load balancers to load balance

the HTTP traffic to the Front End pools and Director pools.

8


12/26


Additionally, we deployed hardware load balancers in the perimeter network for the reverse proxy

servers.

To provide the highest level of load balancing and high availability, a pair of hardware load

balancers (HLBs) were deployed with a Global Server Load Balancer (GSLB) at each site. With

all the load balancers in constant communication with each other regarding site and serverhealth, no single device failure at either central site would cause a service disruption for any of

the users who are currently connected.

This test scenario employed the use of both global server (the F5 BIG-IP GTM) and local server

(the F5 BIG-IP LTM) HLBs. The global server load balancers were implemented to manage traffic

to each site based upon central site availability and health, while the local server load balancers

managed connections within each site to the local servers. This implementation has the following

advantages:

Fully-meshed system for the highest level of fault tolerance at a local and global level.

Complete segmentation of internal and external traffic within the central site.

The ability, if you want, to leverage the hardware to load balance all connections to FrontEnd Servers, Edge Servers, and Directors.

Although optimal from some perspectives, this deployment does have two distinct disadvantages:

you need to purchase more HLBs, and the numerous devices create a more complex

configuration to manage. Consolidation of the load balancing infrastructure is definitely possible

and in some environments is beneficial. For instance, many deployment designs include a single

HLB instance or pair in each central site. Although the HLB spans multiple subnets in this design,

the load balancing logic remains the same. F5 produced architectural guidance that explores the

tradeoffs between different network designs. For details, see http://go.microsoft.com/fwlink/?

LinkId=212143. For details about deployments leveraging HLBs for Lync Server without GSLBs,

see the Office Communications Server 2007 R2 Site Resiliency white paper at

http://go.microsoft.com/fwlink/?LinkId=211387. The deployments described in that white paperalso provide a valid reference architecture for Lync Server 2010.

By leveraging both local and global load balancers, we achieved both server and site resiliency

while using a single URL for users to connect to. The GTM resolves a single URL to different IP

addresses based on the selected load balancing algorithm and availability of global services. By

having the authoritative Windows DNS servers (contoso.com) delegate the URL

(pool.contoso.com) to the GTM, users connecting to pool.contoso.com are sent to the appropriate

site at the time of DNS resolution. The local server load balancer then gets the connection and

load balances it to the appropriate server.

The HLBs were configured to monitor the Front End Pool members by using an HTTP or HTTPSmonitor, which gives the load balancers the best information about the health and performance of

the servers. The HLBs then use this information to load balance the incoming connections to the

best local Front End. Using a feature called Feature Priority Activation, we also configured the

HLBs to proxy connections to the other central site if all the local Front Ends reached capacity or

no longer functioned.

The global server load balancers (GTM) were configured to monitor the HLBs in each site and to

direct users to the best performing site. The GTM can be configured to send all users to a specific

9
http://go.microsoft.com/fwlink/?LinkId=212143http://go.microsoft.com/fwlink/?LinkId=212143http://go.microsoft.com/fwlink/?LinkId=211387http://go.microsoft.com/fwlink/?LinkId=211387http://go.microsoft.com/fwlink/?LinkId=212143http://go.microsoft.com/fwlink/?LinkId=212143http://go.microsoft.com/fwlink/?LinkId=211387


13/26


site in the case of active/standby central sites (as was the case for this test), or load balance

users between the sites for active/active deployments. If one site reaches capacity or becomes

unavailable, the GTM directs users to the other available site(s).

WAN/SAN Latency Simulator

In order to see impact of network latency between two sites, we deployed a network latency

simulator. The simulator allowed us to test different latencies and come up with a

recommendation for maximum acceptable and supported latency.

Besides testing network latency, we also wanted to test the impact of latency on data storage

replication. In order to test storage latency, we connected two storage nodes (one at each site) by

means of a fiber channel to the IP gateway. This connection enabled data replication over the IP

network, which made it possible to use the network latency simulator to test latency along the

data path.

Note:

The WAN/SAN latency simulator was used for testing purposes only. The simulator is not

a requirement for the solution described in this paper and is not required for Microsoft

support.

DNS

This test topology used a split-brain DNS configuration; that is, the parent DNS namespace was

contoso.com, but resolution records for internal and external users were managed separately.

This configuration allows for advertising a single URL for any specific Lync Server service while

maintaining separate servers and routes to access those services for internal and external users.

DNS and DNS load balancing were deployed according to Microsoft best practices. For details,

see DNS Requirements for Front End Pools, DNS Requirements for Automatic Client Sign-In,

Determining DNS Requirements, and DNS Load Balancing in the Planning documentation.Windows DNS can handle all DNS responsibilities for Lync Server services; however, in this case

we used the F5 Global Traffic Manager (GTM) for more granular site awareness and load

distribution.

Windows DNS was authoritative for contoso.com for both internal and external user resolution.

Service names (such as pool1 for HTTPS requests) needing global load balancing were

delegated to the GTMs so that Windows DNS could maintain ownership of the overall

contoso.com namespace but GTM could also load balance what was needed. In this case, we

used the GTM to manage resolution records for HTTPS access; however, this approach can be

expanded to cover records for other services as well.

The following lists provide a configuration snapshot of both the internal and external DNS servers

that were used in our testing.

External Windows DNS

Windows DNS is used, and is authoritative for the contoso.com zone.

ap.contoso.com points to the external network interface of the Access Edge service.

webconf.contoso.com points to the external network interface of the Web Conferencing

Edge service.

10


14/26


avedge.contoso.com points to the external network interface of the A/V Edge service.

The wip.contoso.com zone is delegated to a Global Server Load Balancer system, in this

case, the F5 GTM.

proxy.contoso.com is CNAMEd to proxy.wip.contoso.com, thus granting GTM the

resolution and load balancing responsibilities.

proxy.wip.contoso.com is configured on the GTM to load balance users to the HTTP

reverse proxies.

Internal Windows DNS

Windows DNS is used, and is authoritative for the contoso.com zone.

The wip.contoso.com zone is delegated to a Global Server Load Balancer system, in this

case the F5 GTM.

webpool1.contoso.com is CNAMEd to webpool1.wip.contoso.com, thus granting GTM the

resolution and load balancing responsibilities.

webpool1.wip.contoso.com is configured on the GTM to load balance users to the Front

End VIPs of the load balancers.

Database Storage

In order to implement a geographically dispersed Windows Server 2008 R2 Failover Clustering

solution, we used two HP StorageWorks Enterprise Virtual Array (EVA) Disk Enclosure storage

area network (SAN) systems (one per site) as database storage. Storage was carved into disk

groups, which in turn were associated with their respective clusters. All disk groups used

synchronous data replication. SAN cluster extension was used as Windows Server 2008 R2

Failover Clustering resource to facilitate storage failover and failback.

One of the scenarios we wanted to test was the impact of latency on storage data replication

between two sites. One problem we encountered was that HP StorageWorks has fiber channel

interfaces but the network latency simulator we used does not support those interfaces. In order

to connect the two, we used a Fiber Channel to IP gateway that HP provided.

Test Load

Stress testing included the following:

25,000 concurrent users were using the servers.

6,000 users were in IM sessions, with 50% of those IM sessions having more than two

users.

3000 users were in peer-to-peer A/V calls.

3000 users were in A/V conferences.

500 active users were in application sharing conferences.

3000 active users were in data collaboration conferences.

Expected Client Sign-In Behavior

This section describes the client sign-in behavior during normal operation and failover. This

description does not include all the details of signing in but is intended only to illustrate the

11


15/26


general flow when a user signs in to a metropolitan site resiliency topology that is split across

geographical sites.

During normal operation, with DNS load balancing deployed, client sign-in with the site resilient

topology works basically as it does in any supported topology.

Normal Sign-In Operation

1. Remote user [email protected] signs in to Lync 2010. Lync 2010 queries DNS server for

its connection endpoint (the Edge Server in this specific instance). The DNS server returns

the list of the FQDNs of the Access Edge service on each Edge Server.

2. The client chooses one of these FQDNs at random and attempts to connect to that Edge

Server. This Edge Server may be at either site. If this attempt fails, the client will keep trying

different Edge Servers until it succeeds.

3. Lync 2010 connects by using TLS to one of the Edge Servers.

4. The Edge Server forwards the request to a Director. The Director may be at either site.

5. The Director determines the pool where the user is homed and then forwards the request

to that pool.

6. The DNS server again returns the list of Front End Servers in the pool, including those

servers at both sites. Each user has an assigned list of Front End Servers to which the

users client is always connected: if the first server on the list for that client is currently

unavailable, it tries the next one on the list. It keeps trying until it succeeds. In this example,

the request is forwarded to a Front End Server at the North site.

7. The response is returned to Lync 2010.

Failover Sign-In Operation

The following figures show typical call flow during a user sign-in, in the event that the North site

fails. Diagrams have been simplified to highlight the most important aspects of the topology.

The following figure shows the flow for an internal user, with automatic configuration.

12


16/26


The following figure shows the flow for an internal user, with manual configuration.

13


17/26


The following figure shows the flow for an external user.

14


18/26


Test Results

This topic describes the results of Microsofts testing of the failover solution proposed in this

section.

Central Site Link Latency

We used a network latency simulator to introduce latency on the simulated WAN link between

North and South. The recommended topology supports a maximum latency of 20 ms between the

geographical sites. Improvements in the architecture of Lync Server 2010 enable the allowed

latency to be higher than the maximum of 15 ms allowed in the Microsoft Office Communications

Server 2007 R2 metropolitan site resiliency topology.

15


19/26


15 ms. We started by introducing a 15 ms round-trip latency into both the network path

between two sites and the data path used for data replication between the two sites. The

topology continued to operate without problem under these conditions and under load.

20 ms. We then began to increase latency. At 20 ms round-trip latency for both network

and data traffic, the topology continued to operate without problem. 20 ms is the maximumsupported round-trip latency for this topology in Lync Server 2010.

Important:

Microsoft will not support solutions whose network and data latency exceeds 20 ms.

30 ms. At 30 ms round-trip latency, we started to see degradation in performance. In

particular, message queues for archiving and monitoring databases started to grow. As a

result of these increased latencies, user experience also deteriorated. Sign-in time and

conference creation time both increased, and the A/V experience degraded significantly. For

these reasons, Microsoft does not support a solution where round-trip latency has exceeded

20 ms.

Failover

As previously mentioned, all Windows Server 2008 R2 clusters in the topology used a Node and

File Share Majority quorum. As a result, in order to simulate site failover, we had to isolate all

servers and clusters by losing connectivity to both the South site and the witness site. We used a

dirty shutdown of all servers at the North site.

Results and observations following failure of the North site are as follows:

The passive SQL Server cluster node became active within minutes. The exact amount of

time can vary and depends on the details of the environment. Internal users connected to the

North site were signed out and then automatically signed back in. During the failover,

presence was not updated, and new actions, such as new IM sessions or conferences, failed

with appropriate errors. No more errors occurred after the failover was complete.

As long as there is a valid network path between peers, ongoing peer-to-peer calls

continued without interruption.

UC-PSTN calls were disconnected if the gateway supporting the call became

unavailable. In that case, users could manually re-establish the call.

Lync 2010 users connected to North site were disconnected and automatically

reconnected to the South site within minutes. Users could then continue as before.

In order to reconnect, Group Chat client users had to sign out and sign back in. The

Group Chat Channel service and Lookup service in the South site, which were normally

stopped or disabled at the site, had to be started manually.

Conferences hosted in the North site automatically failed over to the South site. All users

were prompted to rejoin the conference after failover completed. Clients could rejoin the

meeting. Meeting recording continued during the failover. Archiving stopped until the hot

standby Archiving Server was brought online.

Manageability continued to work while the North site was down. For example, users could

be moved from the Survivable Branch Appliance to the Front End pool.

16


20/26


After the North site went offline, SQL Server clusters and file share clusters in the South

site came online in a few minutes.

Site failover duration as observed in our testing was only a few minutes.

Failback

For the purposes of our testing, we defined failback as restoring all functionality to the North site

such that users can reconnect to servers at that site. After the North site was restored, all cluster

resources were moved back to their nodes at the North site.

We recommend that you perform your failback in a controlled manner, preferably during off hours,

as some user disruption can happen during the failback procedures. Results and observations

following failback of the North site are as follows:

Before cluster resources can be moved back to their nodes at the North site, storage had

to be fully resynchronized. If storage has not been resynchronized, clusters will fail to come

online. The resynchronization of the storage happened automatically.

To ensure minimal user impact, the clusters were set not to automatically fail back. Our

recommendation is to postpone failback until the next maintenance window after ensuring

storage has fully resynchronized.

The Front End Servers will come online when they are able to connect to the Active

Directory Domain Services. If the Back End Database is not yet available when the Front End

Servers come online, users will have limited functionality.

After the Front End Servers in the North site are online, new connections will be routed to

them. Users who are online, and who usually connect through Front End Servers in the North

site, will be signed out and then signed back in on their usual North site server.

If you want to prevent the Front End Servers at the North site from automatically coming back

onlinefor example, if you want better control over the whole process or if latency between

the two sites has not been restored to acceptable levelswe recommend shutting down theFront End Servers.

Site failback duration as observed in our testing was under one minute.

Findings and Recommendations

The metropolitan site resiliency solution has been tested and is officially supported by Microsoft;

however, before deploying this topology, you should consider the following findings and

recommendations.

Findings

Cluster failover worked as expected. No manual steps were required, with the exception

of Group Chat Server, Archiving Server, and Monitoring Server. Front End Servers were able

to reconnect to the back-end database servers after the failover and resume normal service.

Microsoft Lync 2010 clients reconnected automatically.

Cluster failback worked as expected. It is important to ensure that storage has

resynchronized before failback begins.

17


21/26


Users will see a quick sign out/sign in sequence as they are transferred back to their usual

Front End Server, when it becomes available again.

When failover occurred, the Group Chat Channel service Lookup service at the failover

site had to be started manually. Additionally, the Group Chat Compliance Server setting had

to be updated manually. For details, see Backing Up the Compliance Server in theOperations documentation.

Recommendations

Although testing used two nodes (one per site) in each SQL Server cluster, we

recommend deploying additional nodes to achieve in-site redundancy for all components in

the topology. For example, if the active SQL Server node becomes unavailable, a backup

SQL Server node in the same site and part of the same cluster can assume the workload until

the failed server is brought back online or replaced.

Although our testing used components provided by certain third-party vendors, thesolution does not depend on or stipulate any particular vendors. As long as components are

certified and supported by Microsoft, any qualifying vendor will do.

All individual components of the solution (for example, geographically dispersed cluster

components) must be supported and, where appropriate, certified by Microsoft. This does not

mean, however, that Microsoft will directly support individual third-party components. For

component support, contact the appropriate third-party vendor.

Although a full-scale deployment was not tested, we expect published scale numbers for

Lync Server 2010 to hold true. With that in mind, you should plan for enough capacity that

sufficient capacity remains to continue operation in the event of failover. For details, see

Capacity Planning in the Planning documentation.

The information in this section should be used only as guidance. Before deploying this

solution in a production environment, you should build and test it using your own topology.

Note:

Microsoft does not support implementations of this solution where network and data-

replication latency between the primary and secondary sites exceeds 20 ms, or when the

bandwidth does not support the user model for your organization. When latency exceeds

20 ms, the end-user experience rapidly deteriorates. In addition, Archiving Server and

Group Chat Compliance servers are likely to start falling behind, which may in turn cause

Front End Servers and Group Chat lookup servers to shut down.

Failback Procedure Recommendations

To failback and resume normal operation at the North site, the following steps are necessary:

1. Restore network connection between two sites. Quality attributes of the network

connection (for example, bandwidth, latency, and loss) should be comparable to the quality

prior to failover.

18


22/26


23/26


Verify that the queue is not increasing unbounded. Establish a baseline for the counter, and

monitor the counter to ensure that it does not exceed that baseline.

On Group Chat Channel and Compliance Servers, monitor the MSMQ

Service\Total Messages in all Queues counter. The size of the queue will vary

depending on load. Verify that the queue is not increasing unbounded. Establish a baselinefor the counter, and monitor the counter to make sure that it does not exceed that baseline.

On the Directors, Edge Servers, and Front End Servers, monitor the LC:SIP 04

Responses object\ SIP 051 Local 503 Responses/sec counter. This counter

indicates if any server is returning errors indicating that the server is unavailable. At steady

state, this counter should be approximately 0. Occasional spikes are acceptable.

On all servers monitor the LC:SIP 04 Responses \SIP 053 Local 504

Responses/sec counter. This counter can indicate connection delays or failures with

other servers. At steady state, this counter should be approximately 0. Occasional spikes are

acceptable. If you see 504 error messages, check the LC:SIP 01 Peers\SIP 017 -

Sends Outstanding counter. This counter records the number of requests and responses in

the outbound queue, which will indicate which servers are having problems.

DNS and HLB Topology Reference

The following figure is a conceptual overview of how DNS, Global, and Local server load

balancing were configured to support the metropolitan site resiliency solution.

20


24/26


In this topology, Global Server Load Balancers (GSLB) were deployed at each site to provide

failover capabilities at a site level, supporting internal client/server (https) traffic to the pool and

external reverse proxy (https) traffic for users connected remotely. As part of this configuration,

Local Server Load Balancers (LSLB) were also deployed at each site to manage https

connections to Front End servers within the pool, physically located across each site. To support

21


25/26


the DNS zones delegated internally and externally, the GSLB at each site monitored and routed

https traffic destined for the following URLs:

Internally

https://webpool1.contoso.com

https://admin.contoso.com

https://dial.contoso.com

https://meet.contoso.com

Externally

https://proxy.contoso.com

https://dial.contoso.com

https://meet.contoso.com

To support the simple URLs referenced above, CNAME records were created, delegating the

DNS resolution to the GSLB for further routing to the LSLB of choice. For example, as internal

client requests resolved to webpool1.contoso.com, they were translated towebpool1.wip.contoso.com by the GSLB and traffic was routed to one of the local server load

balancers virtual IP addresses (VIPs) as shown.

If a site failure occurred, the GSLB would redirect future requests to the LSLB VIP that remains.

For all other Lync Server client-to-server and server-to-server traffic, external or internal, the

requests were handled by DNS load balancing, which is a new load balancing capability in Lync

Server 2010.

Acknowledgements and References

AcknowledgementsWe would like to acknowledge the following partners:

F5 (http://www.f5.com) for providing hardware load balancers and support.

Hewlett-Packard Development Company (http://www.hp.com/go/clxeva) for providing the

geographically dispersed cluster solution.

Network Equipment Technologies (www.net.com) for providing gateways, Survivable

Branch Appliances, and support.

Juniper Networks (www.juniper.net) for providing firewalls.

References

The following links provide more information about some of the topics in this section:

For details about Windows Server 2008 R2 Failover Clustering, see the "Getting Started"

section of "Failover Clustering" at http://go.microsoft.com/fwlink/?LinkId=208305.

For details about the Windows Server 2008 R2 Failover Cluster Configuration Program,

see the "Configuration Program" section of "Failover Clustering" at

http://go.microsoft.com/fwlink/?LinkId=208306.

22
http://go.microsoft.com/fwlink/?LinkId=208305http://go.microsoft.com/fwlink/?LinkId=208306http://go.microsoft.com/fwlink/?LinkId=208305http://go.microsoft.com/fwlink/?LinkId=208306


26/26


For details about SQL Server Always On partners, see "SQL Server Always on Storage

Solution Partners" at http://go.microsoft.com/fwlink/?LinkId=208307.
http://go.microsoft.com/fwlink/?LinkId=208307http://go.microsoft.com/fwlink/?LinkId=208307http://go.microsoft.com/fwlink/?LinkId=208307

chapter 11 planning for metropolitan site resiliency

Documents