chapter 11 planning for metropolitan site resiliency
TRANSCRIPT
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
1/26
Chapter 11: Planning for Metropolitan SiteResiliency
Microsoft Lync Server 2010
Published: March 2012
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
2/26
This document is provided as-is. Information and views expressed in this document, including
URL and other Internet Web site references, may change without notice.
Some examples depicted herein are provided for illustration only and are fictitious. No real
association or connection is intended or should be inferred.
This document does not provide you with any legal rights to any intellectual property in any
Microsoft product. You may copy and use this document for your internal, reference purposes.
Copyright 2012 Microsoft Corporation. All rights reserved.
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
3/26
Contents
Planning for Metropolitan Site Resiliency.................................................................................... 1
The Metropolitan Site Resiliency Solution................................................................................1
Overview............................................................................................................................... 2
Prerequisites.........................................................................................................................4
Test Methodology..................................................................................................................... 5
Site Resiliency Topology....................................................................................................... 5
Servers in the Metropolitan Site Resiliency Topology........................................................7
Hardware Load Balancers................................................................................................. 8
WAN/SAN Latency Simulator.......................................................................................... 10
DNS................................................................................................................................. 10
Database Storage............................................................................................................ 11
Test Load............................................................................................................................. 11
Expected Client Sign-In Behavior........................................................................................11
Test Results........................................................................................................................ 15
Findings and Recommendations............................................................................................17
Failback Procedure Recommendations..............................................................................18
Performance Monitoring Counters And Numbers................................................................19
DNS and HLB Topology Reference........................................................................................20
Acknowledgements and References......................................................................................22
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
4/26
Planning for Metropolitan Site ResiliencyIf you require Microsoft Lync Server 2010 communications software to be always available, even
in the event of a severe disaster at one geographical location in your organization, you can followthe guidelines in this section to create a topology that offers metropolitan site resiliency.
In this topology, Lync Server 2010 pools span two geographically separate locations. In such a
topology, even catastrophic server failure in one location would not seriously disrupt usage,
because all connection requests would automatically be directed to servers in the same pool but
at the second location. The site resiliency solution described in this section is designed
specifically for this split-pool topology and is supported by Microsoft subject to the constraints
mentioned in Findings and Recommendations.
If your environment does not meet the requirements described in this document, For
recommendations about providing resiliency for your Enterprise Voice workload, see Planning for
Enterprise Voice Resiliency.
Unless specifically stated otherwise, all server roles have been installed according to the product
documentation. For details, see Deployment in the Deployment documentation.
In This Section
The Metropolitan Site Resiliency Solutionprovides an overview of the tested and
supported site resiliency solution.
Test Methodologydescribes the testing topology, expected behavior, and test results.
Findings and Recommendationsprovides practical guidance for deploying your own
failover solution.
Notes:
This section does not include specific procedures for deploying the products that are used in the
solution. Specific deployment requirements are likely to vary so much among different customers
that step-by-step instructions are likely to be incomplete or misleading. For step-by-step
instructions, see the product documentation for the various software and hardware used in this
solution.
To successfully follow the topics in this section, you should have a thorough understanding of
Lync Server 2010 and Windows Server 2008 R2 Failover Clustering.
The Metropolitan Site Resiliency Solution
This section describes the tested and supported metropolitan site resiliency solution, including
prerequisites, topology, and individual components. For details about planning and deploying
Windows Server 2008 R2 and Lync Server 2010, see the documentation for these products. For
details about third-party components, see Database Storage and the product documentation
provided by the makers of those components.
1
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
5/26
In This Section
Overview
Prerequisites
Overview
The metropolitan site resiliency solution described in this section entails the following:
Splitting the Front End pool between two physical sites, hereafter called North and South.
In Topology Builder, these two geographical sites are configured as one single Lync Server
2010 site.
Creating separate geographically dispersed clusters (physically separated Windows
Server 2008 R2 failover clusters) for the following:
Back End Servers
Group Chat Database Servers
File Servers
Deploying a Windows Server 2008 R2 file share witness to which all server clusters are
connected. To determine where to place the file share witness, refer to the Windows Server
2008 R2 failover cluster documentation at http://go.microsoft.com/fwlink/?LinkId=211216.
Enabling synchronous data replication between the geographically dispersed clusters.
Deploying servers running certain server roles in both sites. These roles include Front
End Server, A/V Conferencing Server, Director, Edge Server, and Group Chat Server. The
servers of each type in both sites are contained within one pool of that type, which crosses
both sites. Except for Group Chat Server, all servers of these types, in both sites, are active.
For Group Chat Server, only the servers in one site can be active at a time. The Group ChatServers in the other site must be inactive.
Additionally, Monitoring Server and Archiving Server can be deployed in both sites; however,
only the Monitoring Server and Archiving Server in one site are associated with the other
servers in your deployment. The Monitoring Server and Archiving Server in the other site is
deployed but not associated with any pools, and it serves as a "hot" backup.
The following figure provides an overview of the resulting topology.
2
http://go.microsoft.com/fwlink/?LinkId=211216http://go.microsoft.com/fwlink/?LinkId=211216 -
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
6/26
Chapter 11: Planning for Metropolitan Site Resiliency
With the topology depicted in the preceding figure, a single site could become unavailable for any
reason, and users would still be able to access supported unified communications services within
minutes rather than hours. For a detailed depiction of the topology used to test the solution
described in this section, see Site Resiliency Topology.
Scope of Testing and Support
This site resiliency solution has been tested and is supported by Microsoft for the following
workloads:
IM and presence
Peer-to-peer scenarios; for example, peer-to-peer audio/video sessions
IM conferencing
Web conferencing
A/V conferencing
3
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
7/26
Chapter 11: Planning for Metropolitan Site Resiliency
Application sharing
Enterprise Voice and Telephony Integration
Enterprise Voice applications, including Conferencing Attendant, Conferencing
Announcement service, Outside Voice Control, and Response Group service
Approved unified communications devices
Simple URLs
Group Chat
Exchange UM
Workloads That Are Out of Scope
The following scenarios can be deployed in the metropolitan site resiliency topology, but the
automatic failover of these workloads is not designed or supported:
Federation and Public IM Connectivity
Remote call control
Microsoft Lync Web App
XMPP Gateway
Prerequisites
The solution described in this section assumes that your Lync Server deployment meets both the
core requirements described in the product documentation and all of the following prerequisites.
To qualify for Microsoft support, your failover solution must meet all these prerequisites.
All servers that are part of geographically dispersed clusters must be part of the same
stretched VLAN, using the same Layer-2 broadcast domain. All other internal servers running
Lync Server server roles can be on a subnet within that servers local data center.
Edge Servers must be in the perimeter network, and should be on a different subnet than the
internal servers. Also, the perimeter network need not be stretched between sites.
Synchronous data replication must be enabled between the primary and secondary sites,
and the vendor solution that you employ must be supported by Microsoft.
Round-trip latency between the two sites must not be greater than 20 ms.
Available bandwidth between the sites must be at least 1 Gbps.
A geographically dispersed cluster solution based on Windows Server 2008 R2 Failover
Clustering must be in place. That solution must be certified and supported by Microsoft, and it
must pass cluster validation as described in the Windows Server 2008 R2 documentation.
For details, see the What is cluster validation? section of Failover Cluster Step-by-Step
Guide: Validating Hardware for a Failover Cluster at http://go.microsoft.com/fwlink/?
linkid=142436.
All geographically dispersed cluster servers must be running the 64-bit edition of
Windows Server 2008 R2.
All your servers that are running Lync Server must run the Lync Server 2010 version.
All database servers must be running the 64-bit edition of one of the following:
4
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
8/26
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
9/26
Chapter 11: Planning for Metropolitan Site Resiliency
components you choose for your particular implementation of this solution, you might need help
from your vendor of choice to deploy this solution.
This figure is representative of the topology tested, but for purposes of clarity, it does not
necessarily depict the number of servers used in each pool in the actual test topology. For
example, in the actual test topology there were four Front End Servers in each site.
As shown in the figure, the tested topology deployed two central sites and a branch office, along
with a third location that hosted a file server functioning as a Windows Server 2008 R2 Failover
Clustering Service file share witness. For details about using a witness in a failover cluster, see
http://go.microsoft.com/fwlink/?LinkId=211004.The file share witness is available to all Windows
Server 2008 R2 Failover Cluster nodes in both central sites. All Windows Server 2008 R2
Failover Clusters used in this solution use the Node and File Share Majority quorum mode.
The following topics discuss each of the solution components shown in preceding figure.
6
http://go.microsoft.com/fwlink/?LinkId=211004http://go.microsoft.com/fwlink/?LinkId=211004http://go.microsoft.com/fwlink/?LinkId=211004 -
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
10/26
Chapter 11: Planning for Metropolitan Site Resiliency
In This Section
Servers in the Metropolitan Site Resiliency Topology
Hardware Load Balancers
WAN/SAN Latency Simulator
DNS
Database Storage
Servers in the Metropolitan Site Resiliency Topology
The metropolitan site resiliency topology can include different types of server roles, as follows.
Front End Pool
This pool hosts all Lync Server users. Each site, North and South, contains four identically
configured Front End Servers. The Back-End Database is deployed as two Active/Passive SQL
Server 2008 geographically dispersed cluster nodes, running on the Windows Server 2008 R2
Failover Clustering service. Synchronous data replication is required between the two Back-EndDatabase Servers.
In our test topology, the Mediation Server was collocated with Front End Server. Topologies with
stand-alone Mediation Server are also supported.
Our test topology used DNS load balancing to balance the SIP traffic in the pool, with hardware
load balancers deployed for the HTTP traffic.
Topologies that use only hardware load balancers to balance all types of traffic are also supported
for site resiliency.
A/V Conferencing Pool
We deployed a single A/V Conferencing pool with four A/V Conferencing Servers, two in each
site.
Director Pool
We deployed a single Director pool with four Directors, two in each site.
Edge Pool
The Edge Servers ran all services (Access Edge service, A/V Conferencing Edge service, and
Web Conferencing Edge service), but we tested them only for remote-user scenarios. Federation
and public IM connectivity are beyond the scope of this document.
We recommend DNS load balancing for your Edge pool, but we also support using hardware load
balancers. The internal Edge interface and external Edge interface must use the same type of
load balancing. You cannot use DNS load balancing on one Edge interface and hardware load
balancing on the other Edge interface. If you use hardware load balancers for the Edge pool, the
hardware load balancer at one site serves as the primary load balancer and responds to requests
with the virtual IP address of the appropriate Edge service. If the primary load balancer is
unavailable, the secondary hardware load balancer at the other site would take over. Each site
has its own IP subnet; perimeter networks were not stretched across the North and South sites.
Group Chat Servers
7
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
11/26
Chapter 11: Planning for Metropolitan Site Resiliency
Each site hosts both a Channel service and a Lookup service, but these services can be active in
only one of the sites at a time. The Channel service and the Lookup service in the other site must
be stopped or disabled. In the event of site failover, manual intervention is required to start these
services at the failover site.
Each site also hosts a Compliance Server, but only one of these servers can be active at a time.In the event of site failover and failback, manual intervention is required to restore the service. For
details, see Backing Up the Compliance Server in the Operations documentation.
We deployed the Group Chat back-end database as two Active/Passive SQL Server 2008
geographically dispersed cluster nodes running on top of Windows Server 2008 R2 Failover
Clustering. Data replication between the two back-end database servers must be synchronous. A
single database instance is used for both Group Chat and compliance data.
Monitoring Server and Archiving Server
For Monitoring Server and Archiving Server, we recommend a hot standby deployment. Deploy
these server roles in both sites, on a single server in each site. Only one of these servers is
active, and the pools in your deployment are all associated with that active server. The other
server is deployed and installed, but not associated with any pool.
If the primary server becomes unavailable, you use Topology Builder to manually associate the
pools with the standby server, which then becomes the primary server.
File Server Cluster
We deployed a file server as a two-node geographically dispersed cluster resource using
Windows Server 2008 R2 Failover Clustering. Synchronous data replication was required. Any
Lync Server function that requires a file share and is split across the two sites must use this file
share cluster. This includes the following:
Meeting content location
Meeting metadata location Meeting archive location
Address Book Server file store
Application data store
Client Update data store
Group Chat compliance file repository
Group Chat upload files location
Reverse Proxy
A reverse proxy server is deployed at each site. In our test topology, these servers ran Microsoft
Forefront Threat Management Gateway. Each server running Microsoft Forefront Threat
Management Gateway ran independently of one another. A hardware load balancer was deployed
at each site.
Hardware Load Balancers
Even when you deploy DNS load balancing, you need hardware load balancers to load balance
the HTTP traffic to the Front End pools and Director pools.
8
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
12/26
Chapter 11: Planning for Metropolitan Site Resiliency
Additionally, we deployed hardware load balancers in the perimeter network for the reverse proxy
servers.
To provide the highest level of load balancing and high availability, a pair of hardware load
balancers (HLBs) were deployed with a Global Server Load Balancer (GSLB) at each site. With
all the load balancers in constant communication with each other regarding site and serverhealth, no single device failure at either central site would cause a service disruption for any of
the users who are currently connected.
This test scenario employed the use of both global server (the F5 BIG-IP GTM) and local server
(the F5 BIG-IP LTM) HLBs. The global server load balancers were implemented to manage traffic
to each site based upon central site availability and health, while the local server load balancers
managed connections within each site to the local servers. This implementation has the following
advantages:
Fully-meshed system for the highest level of fault tolerance at a local and global level.
Complete segmentation of internal and external traffic within the central site.
The ability, if you want, to leverage the hardware to load balance all connections to FrontEnd Servers, Edge Servers, and Directors.
Although optimal from some perspectives, this deployment does have two distinct disadvantages:
you need to purchase more HLBs, and the numerous devices create a more complex
configuration to manage. Consolidation of the load balancing infrastructure is definitely possible
and in some environments is beneficial. For instance, many deployment designs include a single
HLB instance or pair in each central site. Although the HLB spans multiple subnets in this design,
the load balancing logic remains the same. F5 produced architectural guidance that explores the
tradeoffs between different network designs. For details, see http://go.microsoft.com/fwlink/?
LinkId=212143. For details about deployments leveraging HLBs for Lync Server without GSLBs,
see the Office Communications Server 2007 R2 Site Resiliency white paper at
http://go.microsoft.com/fwlink/?LinkId=211387. The deployments described in that white paperalso provide a valid reference architecture for Lync Server 2010.
By leveraging both local and global load balancers, we achieved both server and site resiliency
while using a single URL for users to connect to. The GTM resolves a single URL to different IP
addresses based on the selected load balancing algorithm and availability of global services. By
having the authoritative Windows DNS servers (contoso.com) delegate the URL
(pool.contoso.com) to the GTM, users connecting to pool.contoso.com are sent to the appropriate
site at the time of DNS resolution. The local server load balancer then gets the connection and
load balances it to the appropriate server.
The HLBs were configured to monitor the Front End Pool members by using an HTTP or HTTPSmonitor, which gives the load balancers the best information about the health and performance of
the servers. The HLBs then use this information to load balance the incoming connections to the
best local Front End. Using a feature called Feature Priority Activation, we also configured the
HLBs to proxy connections to the other central site if all the local Front Ends reached capacity or
no longer functioned.
The global server load balancers (GTM) were configured to monitor the HLBs in each site and to
direct users to the best performing site. The GTM can be configured to send all users to a specific
9
http://go.microsoft.com/fwlink/?LinkId=212143http://go.microsoft.com/fwlink/?LinkId=212143http://go.microsoft.com/fwlink/?LinkId=211387http://go.microsoft.com/fwlink/?LinkId=211387http://go.microsoft.com/fwlink/?LinkId=212143http://go.microsoft.com/fwlink/?LinkId=212143http://go.microsoft.com/fwlink/?LinkId=211387 -
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
13/26
Chapter 11: Planning for Metropolitan Site Resiliency
site in the case of active/standby central sites (as was the case for this test), or load balance
users between the sites for active/active deployments. If one site reaches capacity or becomes
unavailable, the GTM directs users to the other available site(s).
WAN/SAN Latency Simulator
In order to see impact of network latency between two sites, we deployed a network latency
simulator. The simulator allowed us to test different latencies and come up with a
recommendation for maximum acceptable and supported latency.
Besides testing network latency, we also wanted to test the impact of latency on data storage
replication. In order to test storage latency, we connected two storage nodes (one at each site) by
means of a fiber channel to the IP gateway. This connection enabled data replication over the IP
network, which made it possible to use the network latency simulator to test latency along the
data path.
Note:
The WAN/SAN latency simulator was used for testing purposes only. The simulator is not
a requirement for the solution described in this paper and is not required for Microsoft
support.
DNS
This test topology used a split-brain DNS configuration; that is, the parent DNS namespace was
contoso.com, but resolution records for internal and external users were managed separately.
This configuration allows for advertising a single URL for any specific Lync Server service while
maintaining separate servers and routes to access those services for internal and external users.
DNS and DNS load balancing were deployed according to Microsoft best practices. For details,
see DNS Requirements for Front End Pools, DNS Requirements for Automatic Client Sign-In,
Determining DNS Requirements, and DNS Load Balancing in the Planning documentation.Windows DNS can handle all DNS responsibilities for Lync Server services; however, in this case
we used the F5 Global Traffic Manager (GTM) for more granular site awareness and load
distribution.
Windows DNS was authoritative for contoso.com for both internal and external user resolution.
Service names (such as pool1 for HTTPS requests) needing global load balancing were
delegated to the GTMs so that Windows DNS could maintain ownership of the overall
contoso.com namespace but GTM could also load balance what was needed. In this case, we
used the GTM to manage resolution records for HTTPS access; however, this approach can be
expanded to cover records for other services as well.
The following lists provide a configuration snapshot of both the internal and external DNS servers
that were used in our testing.
External Windows DNS
Windows DNS is used, and is authoritative for the contoso.com zone.
ap.contoso.com points to the external network interface of the Access Edge service.
webconf.contoso.com points to the external network interface of the Web Conferencing
Edge service.
10
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
14/26
Chapter 11: Planning for Metropolitan Site Resiliency
avedge.contoso.com points to the external network interface of the A/V Edge service.
The wip.contoso.com zone is delegated to a Global Server Load Balancer system, in this
case, the F5 GTM.
proxy.contoso.com is CNAMEd to proxy.wip.contoso.com, thus granting GTM the
resolution and load balancing responsibilities.
proxy.wip.contoso.com is configured on the GTM to load balance users to the HTTP
reverse proxies.
Internal Windows DNS
Windows DNS is used, and is authoritative for the contoso.com zone.
The wip.contoso.com zone is delegated to a Global Server Load Balancer system, in this
case the F5 GTM.
webpool1.contoso.com is CNAMEd to webpool1.wip.contoso.com, thus granting GTM the
resolution and load balancing responsibilities.
webpool1.wip.contoso.com is configured on the GTM to load balance users to the Front
End VIPs of the load balancers.
Database Storage
In order to implement a geographically dispersed Windows Server 2008 R2 Failover Clustering
solution, we used two HP StorageWorks Enterprise Virtual Array (EVA) Disk Enclosure storage
area network (SAN) systems (one per site) as database storage. Storage was carved into disk
groups, which in turn were associated with their respective clusters. All disk groups used
synchronous data replication. SAN cluster extension was used as Windows Server 2008 R2
Failover Clustering resource to facilitate storage failover and failback.
One of the scenarios we wanted to test was the impact of latency on storage data replication
between two sites. One problem we encountered was that HP StorageWorks has fiber channel
interfaces but the network latency simulator we used does not support those interfaces. In order
to connect the two, we used a Fiber Channel to IP gateway that HP provided.
Test Load
Stress testing included the following:
25,000 concurrent users were using the servers.
6,000 users were in IM sessions, with 50% of those IM sessions having more than two
users.
3000 users were in peer-to-peer A/V calls.
3000 users were in A/V conferences.
500 active users were in application sharing conferences.
3000 active users were in data collaboration conferences.
Expected Client Sign-In Behavior
This section describes the client sign-in behavior during normal operation and failover. This
description does not include all the details of signing in but is intended only to illustrate the
11
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
15/26
Chapter 11: Planning for Metropolitan Site Resiliency
general flow when a user signs in to a metropolitan site resiliency topology that is split across
geographical sites.
During normal operation, with DNS load balancing deployed, client sign-in with the site resilient
topology works basically as it does in any supported topology.
Normal Sign-In Operation
1. Remote user [email protected] signs in to Lync 2010. Lync 2010 queries DNS server for
its connection endpoint (the Edge Server in this specific instance). The DNS server returns
the list of the FQDNs of the Access Edge service on each Edge Server.
2. The client chooses one of these FQDNs at random and attempts to connect to that Edge
Server. This Edge Server may be at either site. If this attempt fails, the client will keep trying
different Edge Servers until it succeeds.
3. Lync 2010 connects by using TLS to one of the Edge Servers.
4. The Edge Server forwards the request to a Director. The Director may be at either site.
5. The Director determines the pool where the user is homed and then forwards the request
to that pool.
6. The DNS server again returns the list of Front End Servers in the pool, including those
servers at both sites. Each user has an assigned list of Front End Servers to which the
users client is always connected: if the first server on the list for that client is currently
unavailable, it tries the next one on the list. It keeps trying until it succeeds. In this example,
the request is forwarded to a Front End Server at the North site.
7. The response is returned to Lync 2010.
Failover Sign-In Operation
The following figures show typical call flow during a user sign-in, in the event that the North site
fails. Diagrams have been simplified to highlight the most important aspects of the topology.
The following figure shows the flow for an internal user, with automatic configuration.
12
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
16/26
Chapter 11: Planning for Metropolitan Site Resiliency
The following figure shows the flow for an internal user, with manual configuration.
13
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
17/26
Chapter 11: Planning for Metropolitan Site Resiliency
The following figure shows the flow for an external user.
14
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
18/26
Chapter 11: Planning for Metropolitan Site Resiliency
Test Results
This topic describes the results of Microsofts testing of the failover solution proposed in this
section.
Central Site Link Latency
We used a network latency simulator to introduce latency on the simulated WAN link between
North and South. The recommended topology supports a maximum latency of 20 ms between the
geographical sites. Improvements in the architecture of Lync Server 2010 enable the allowed
latency to be higher than the maximum of 15 ms allowed in the Microsoft Office Communications
Server 2007 R2 metropolitan site resiliency topology.
15
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
19/26
Chapter 11: Planning for Metropolitan Site Resiliency
15 ms. We started by introducing a 15 ms round-trip latency into both the network path
between two sites and the data path used for data replication between the two sites. The
topology continued to operate without problem under these conditions and under load.
20 ms. We then began to increase latency. At 20 ms round-trip latency for both network
and data traffic, the topology continued to operate without problem. 20 ms is the maximumsupported round-trip latency for this topology in Lync Server 2010.
Important:
Microsoft will not support solutions whose network and data latency exceeds 20 ms.
30 ms. At 30 ms round-trip latency, we started to see degradation in performance. In
particular, message queues for archiving and monitoring databases started to grow. As a
result of these increased latencies, user experience also deteriorated. Sign-in time and
conference creation time both increased, and the A/V experience degraded significantly. For
these reasons, Microsoft does not support a solution where round-trip latency has exceeded
20 ms.
Failover
As previously mentioned, all Windows Server 2008 R2 clusters in the topology used a Node and
File Share Majority quorum. As a result, in order to simulate site failover, we had to isolate all
servers and clusters by losing connectivity to both the South site and the witness site. We used a
dirty shutdown of all servers at the North site.
Results and observations following failure of the North site are as follows:
The passive SQL Server cluster node became active within minutes. The exact amount of
time can vary and depends on the details of the environment. Internal users connected to the
North site were signed out and then automatically signed back in. During the failover,
presence was not updated, and new actions, such as new IM sessions or conferences, failed
with appropriate errors. No more errors occurred after the failover was complete.
As long as there is a valid network path between peers, ongoing peer-to-peer calls
continued without interruption.
UC-PSTN calls were disconnected if the gateway supporting the call became
unavailable. In that case, users could manually re-establish the call.
Lync 2010 users connected to North site were disconnected and automatically
reconnected to the South site within minutes. Users could then continue as before.
In order to reconnect, Group Chat client users had to sign out and sign back in. The
Group Chat Channel service and Lookup service in the South site, which were normally
stopped or disabled at the site, had to be started manually.
Conferences hosted in the North site automatically failed over to the South site. All users
were prompted to rejoin the conference after failover completed. Clients could rejoin the
meeting. Meeting recording continued during the failover. Archiving stopped until the hot
standby Archiving Server was brought online.
Manageability continued to work while the North site was down. For example, users could
be moved from the Survivable Branch Appliance to the Front End pool.
16
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
20/26
Chapter 11: Planning for Metropolitan Site Resiliency
After the North site went offline, SQL Server clusters and file share clusters in the South
site came online in a few minutes.
Site failover duration as observed in our testing was only a few minutes.
Failback
For the purposes of our testing, we defined failback as restoring all functionality to the North site
such that users can reconnect to servers at that site. After the North site was restored, all cluster
resources were moved back to their nodes at the North site.
We recommend that you perform your failback in a controlled manner, preferably during off hours,
as some user disruption can happen during the failback procedures. Results and observations
following failback of the North site are as follows:
Before cluster resources can be moved back to their nodes at the North site, storage had
to be fully resynchronized. If storage has not been resynchronized, clusters will fail to come
online. The resynchronization of the storage happened automatically.
To ensure minimal user impact, the clusters were set not to automatically fail back. Our
recommendation is to postpone failback until the next maintenance window after ensuring
storage has fully resynchronized.
The Front End Servers will come online when they are able to connect to the Active
Directory Domain Services. If the Back End Database is not yet available when the Front End
Servers come online, users will have limited functionality.
After the Front End Servers in the North site are online, new connections will be routed to
them. Users who are online, and who usually connect through Front End Servers in the North
site, will be signed out and then signed back in on their usual North site server.
If you want to prevent the Front End Servers at the North site from automatically coming back
onlinefor example, if you want better control over the whole process or if latency between
the two sites has not been restored to acceptable levelswe recommend shutting down theFront End Servers.
Site failback duration as observed in our testing was under one minute.
Findings and Recommendations
The metropolitan site resiliency solution has been tested and is officially supported by Microsoft;
however, before deploying this topology, you should consider the following findings and
recommendations.
Findings
Cluster failover worked as expected. No manual steps were required, with the exception
of Group Chat Server, Archiving Server, and Monitoring Server. Front End Servers were able
to reconnect to the back-end database servers after the failover and resume normal service.
Microsoft Lync 2010 clients reconnected automatically.
Cluster failback worked as expected. It is important to ensure that storage has
resynchronized before failback begins.
17
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
21/26
Chapter 11: Planning for Metropolitan Site Resiliency
Users will see a quick sign out/sign in sequence as they are transferred back to their usual
Front End Server, when it becomes available again.
When failover occurred, the Group Chat Channel service Lookup service at the failover
site had to be started manually. Additionally, the Group Chat Compliance Server setting had
to be updated manually. For details, see Backing Up the Compliance Server in theOperations documentation.
Recommendations
Although testing used two nodes (one per site) in each SQL Server cluster, we
recommend deploying additional nodes to achieve in-site redundancy for all components in
the topology. For example, if the active SQL Server node becomes unavailable, a backup
SQL Server node in the same site and part of the same cluster can assume the workload until
the failed server is brought back online or replaced.
Although our testing used components provided by certain third-party vendors, thesolution does not depend on or stipulate any particular vendors. As long as components are
certified and supported by Microsoft, any qualifying vendor will do.
All individual components of the solution (for example, geographically dispersed cluster
components) must be supported and, where appropriate, certified by Microsoft. This does not
mean, however, that Microsoft will directly support individual third-party components. For
component support, contact the appropriate third-party vendor.
Although a full-scale deployment was not tested, we expect published scale numbers for
Lync Server 2010 to hold true. With that in mind, you should plan for enough capacity that
sufficient capacity remains to continue operation in the event of failover. For details, see
Capacity Planning in the Planning documentation.
The information in this section should be used only as guidance. Before deploying this
solution in a production environment, you should build and test it using your own topology.
Note:
Microsoft does not support implementations of this solution where network and data-
replication latency between the primary and secondary sites exceeds 20 ms, or when the
bandwidth does not support the user model for your organization. When latency exceeds
20 ms, the end-user experience rapidly deteriorates. In addition, Archiving Server and
Group Chat Compliance servers are likely to start falling behind, which may in turn cause
Front End Servers and Group Chat lookup servers to shut down.
Failback Procedure Recommendations
To failback and resume normal operation at the North site, the following steps are necessary:
1. Restore network connection between two sites. Quality attributes of the network
connection (for example, bandwidth, latency, and loss) should be comparable to the quality
prior to failover.
18
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
22/26
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
23/26
Chapter 11: Planning for Metropolitan Site Resiliency
Verify that the queue is not increasing unbounded. Establish a baseline for the counter, and
monitor the counter to ensure that it does not exceed that baseline.
On Group Chat Channel and Compliance Servers, monitor the MSMQ
Service\Total Messages in all Queues counter. The size of the queue will vary
depending on load. Verify that the queue is not increasing unbounded. Establish a baselinefor the counter, and monitor the counter to make sure that it does not exceed that baseline.
On the Directors, Edge Servers, and Front End Servers, monitor the LC:SIP 04
Responses object\ SIP 051 Local 503 Responses/sec counter. This counter
indicates if any server is returning errors indicating that the server is unavailable. At steady
state, this counter should be approximately 0. Occasional spikes are acceptable.
On all servers monitor the LC:SIP 04 Responses \SIP 053 Local 504
Responses/sec counter. This counter can indicate connection delays or failures with
other servers. At steady state, this counter should be approximately 0. Occasional spikes are
acceptable. If you see 504 error messages, check the LC:SIP 01 Peers\SIP 017 -
Sends Outstanding counter. This counter records the number of requests and responses in
the outbound queue, which will indicate which servers are having problems.
DNS and HLB Topology Reference
The following figure is a conceptual overview of how DNS, Global, and Local server load
balancing were configured to support the metropolitan site resiliency solution.
20
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
24/26
Chapter 11: Planning for Metropolitan Site Resiliency
In this topology, Global Server Load Balancers (GSLB) were deployed at each site to provide
failover capabilities at a site level, supporting internal client/server (https) traffic to the pool and
external reverse proxy (https) traffic for users connected remotely. As part of this configuration,
Local Server Load Balancers (LSLB) were also deployed at each site to manage https
connections to Front End servers within the pool, physically located across each site. To support
21
-
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
25/26
Chapter 11: Planning for Metropolitan Site Resiliency
the DNS zones delegated internally and externally, the GSLB at each site monitored and routed
https traffic destined for the following URLs:
Internally
https://webpool1.contoso.com
https://admin.contoso.com
https://dial.contoso.com
https://meet.contoso.com
Externally
https://proxy.contoso.com
https://dial.contoso.com
https://meet.contoso.com
To support the simple URLs referenced above, CNAME records were created, delegating the
DNS resolution to the GSLB for further routing to the LSLB of choice. For example, as internal
client requests resolved to webpool1.contoso.com, they were translated towebpool1.wip.contoso.com by the GSLB and traffic was routed to one of the local server load
balancers virtual IP addresses (VIPs) as shown.
If a site failure occurred, the GSLB would redirect future requests to the LSLB VIP that remains.
For all other Lync Server client-to-server and server-to-server traffic, external or internal, the
requests were handled by DNS load balancing, which is a new load balancing capability in Lync
Server 2010.
Acknowledgements and References
AcknowledgementsWe would like to acknowledge the following partners:
F5 (http://www.f5.com) for providing hardware load balancers and support.
Hewlett-Packard Development Company (http://www.hp.com/go/clxeva) for providing the
geographically dispersed cluster solution.
Network Equipment Technologies (www.net.com) for providing gateways, Survivable
Branch Appliances, and support.
Juniper Networks (www.juniper.net) for providing firewalls.
References
The following links provide more information about some of the topics in this section:
For details about Windows Server 2008 R2 Failover Clustering, see the "Getting Started"
section of "Failover Clustering" at http://go.microsoft.com/fwlink/?LinkId=208305.
For details about the Windows Server 2008 R2 Failover Cluster Configuration Program,
see the "Configuration Program" section of "Failover Clustering" at
http://go.microsoft.com/fwlink/?LinkId=208306.
22
http://go.microsoft.com/fwlink/?LinkId=208305http://go.microsoft.com/fwlink/?LinkId=208306http://go.microsoft.com/fwlink/?LinkId=208305http://go.microsoft.com/fwlink/?LinkId=208306 -
7/27/2019 Chapter 11 Planning for Metropolitan Site Resiliency
26/26
Chapter 11: Planning for Metropolitan Site Resiliency
For details about SQL Server Always On partners, see "SQL Server Always on Storage
Solution Partners" at http://go.microsoft.com/fwlink/?LinkId=208307.
http://go.microsoft.com/fwlink/?LinkId=208307http://go.microsoft.com/fwlink/?LinkId=208307http://go.microsoft.com/fwlink/?LinkId=208307