exchange server 2013 site resilience scott schnoll

Exchange Server 2013Site Resilience

Scott Schnoll

Agenda

• The Preferred Architecture

• Namespace Planning and Principles

• Datacenter Switchovers and Failovers

• Dynamic Quorum and DAGs

The Preferred Architecture

Site Resilience changes in Exchange 2013

Frontend/Backend recovery are independent

Most protocol access in Exchange Server 2013 is HTTPDNS resolves to multiple IP addressesHTTP clients have built-in IP failover capabilitiesClients skip past IPs that produce hard TCP failures

Namespace no longer a single point of failureSingle or multiple namespace optionsAdmins can switchover by removing VIP from DNS or disablingNo dealing with DNS latency

Preferred ArchitectureNamespace Design

For a site resilient datacenter pair, a single namespace / protocol is deployed across both datacenters

autodiscover.contoso.comHTTP: mail.contoso.comIMAP: imap.contoso.comSMTP: smtp.contoso.com

Load balancers are configured without session affinity, one VIP / datacenter

Round-robin, geo-DNS, or other solutions are used to distribute traffic equally across both datacenters

mail VIP

Preferred ArchitectureDAG Design

• Each datacenter should be its own Active Directory site

• Deploy unbound DAG model spanning each DAG across two datacenters

• Distribute active copies across all servers in the DAG

• Deploy 4 copies, 2 copies in each datacenter

• One copy will be a lagged copy (7 days) with automatic play down enabled

• Native Data Protection is used

• Single network is used for MAPI and replication traffic

• Third datacenter used for Witness server, if possible

• Increase DAG size density before creating new DAGs

mail VIP

Witness Server

Selina(somewhere in

NA)DNS Resolution

na VIP na VIP

Batman(somewhere in Europe)

DNS Resolution

eur VIP

Preferred Architecture

na.contoso.comeur.contoso.com

Namespace Planning & Principles

Namespace Planning

• No need for namespaces required by Exchange 2010• Can still deploy regional namespaces to control traffic• Can still have specific namespaces for protocols

• Two namespace models• Bound Model• Unbound Model

• Leverage split-DNS to minimize namespaces and control connectivity

• Deploy separate namespaces for internal and external Outlook Anywhere host names

Sue (somewhere in

NA) DNS Resolution

mail VIP mail2 VIP

mail.contoso.com

mail2.contoso.com

Jane(somewhere in

NA)DNS Resolution

Passive

Active

Passive

Bound Model

Round-Robin between # of VIPs

Sue (somewhere in

NA) DNS Resolution

VIP #1 VIP #2

mail.contoso.com

Unbound Model

Load Balancing

• Exchange 2013 no longer requires session affinity to be maintained on the load balancer

• For each protocol session, CAS now maintains a 1:1 relationship with the Mailbox server hosting the user’s data

• Load balancer configuration and health probes will factor into namespace design

• Remember to configure health probes to monitor healthcheck.htm, otherwise LB and MA will be out of sync

CASOWA

Single Namespace / Layer 4

autodiscover.contoso.com

mail.contoso.com

health check

CASOWA

Single Namespace / Layer 7

mail.contoso.com

health check

Health check executes against each virtual directory

mapi.contoso.com

mail.contoso.com

ecp.contoso.com

ews.contoso.com

eas.contoso.com

oab.contoso.com

oa.contoso.com

CASOWA

Multiple Namespaces / Layer 4

Datacenter Switchovers and Failovers

Witness Server Placement

New Witness Server placement options availableChoose based on business needs and available options

Third location DAG witness server improves DAG recovery behaviorsAutomatic recovery on datacenter loss;Third location network infrastructure must have independent failure modes

Deployment scenario RecommendationsDAG(s) deployed in a single datacenter Locate witness server in the same datacenter as DAG members; can share one server across DAGs

DAG(s) deployed across two datacenters; No additional locations available Locate witness server in primary datacenter; can share one server across DAGs

DAG(s) deployed across two+ datacenters Locate witness server in third location; can share one server across DAGs

alternate datacenter: Portlandprimary datacenter: Redmond

Site Resilience - CAS

cas3 cas4cas1 cas2

VIP: 192.168.1.50X VIP: 10.0.1.50

mail.contoso.com: 192.168.1.50, 10.0.1.50Removing failing IP from DNS puts you in control of in service time of VIP

With multiple VIP endpoints sharing the same namespace, if one VIP fails,clients automatically failover to alternate VIP!

mail.contoso.com: 10.0.1.50

third datacenter: Stockholm alternate datacenter: Portlandprimary datacenter: Redmond

Site Resilience - Mailbox

mbx1 mbx2 mbx3 mbx4

Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log file, automatic failover should occur

witnessX

witness

mbx1 mbx2 mbx3 mbx4

1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond

2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc

3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland

witness

mbx1 mbx2 mbx3 mbx4

alternate witness

1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond

2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc

3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland

Activation Block ComparisonTool Parameter Value Instance Usage

Suspend-MailboxDatabaseCopy

ActivationOnly N/A Per database copy

• Keep active off a working but questionable drive

Set-MailboxServer DatabaseCopyAutoActivationPolicy “Blocked” or “Unrestricted”

Per server • Used to control active/passive SR configurations and maintenance

• Can force admin moveSet-MailboxServer DatabaseCopyActivationDisabledAndMoveNow $true or $false Per server • Used to do faster site

failovers and maintain database availability

• Databases are not blocked from failing back

• Continuous move-off operation

DatabaseDisabledAndMoveNow

New server setting to improve site resilience

Get all active databases off server – FAST!Last resort to not move an active!

Proactively continue move databases attempts

Server can still be in serviceDatabases mounted and mail delivery!

Best Practices

Automate your recovery logic; make it reliableThink of it as rack/site maintenance

Exercise it regularly

Recovery times directly dependent on detection & decision times!Flip the bit! Don’t ask repair times, “if outage go…”Humans are the biggest threat to recovery times

Dynamic Quorum and DAGs

Dynamic Quorum

In Windows Server 2008 R2, quorum majority is fixed, based on the initial cluster configuration

In Windows Server 2012 (and later), cluster quorum majority is determined by the set of nodes that are active members of the cluster at a given time

This new feature is called Dynamic Quorum, and it is enabled for all clusters by default

Dynamic Quorum

Cluster dynamically manages vote assignment to nodes, based on the state of each nodeWhen a node shuts down or crashes, the node loses its quorum voteWhen a node rejoins the cluster, it regains its quorum vote

By adjusting the assignment of quorum votes, the cluster can dynamically increase or decrease the number of quorum votes required to keep running

Dynamic Quorum

By dynamically adjusting the quorum majority requirement, a cluster can sustain sequential node shutdowns to a single nodeThis is referred to as a “Last Man Standing” scenario

Dynamic Quorum

Does not allow a cluster to sustain a simultaneous failure of majority of voting membersTo continue running, the cluster must always maintain quorum after a node shutdown or failure

If you manually remove a node’s vote, the cluster does not dynamically add the vote back

Dynamic QuorumMajority of 7 required

Dynamic Quorum

Majority of 4 requiredMajority of 7 required

Dynamic Quorum

Majority of 3 required

Dynamic Quorum

Use Get-ClusterNode to verify votes0 = does not have quorum vote1 = has quorum vote

Get-ClusterNode <Name> | ft name, *weight, state

Name DynamicWeight NodeWeight State---- ------------- ---------- -----EX1 1 1 Up

Dynamic Quorum

Works with most DAGsThird-party replication DAGs not tested

All internal testing has it enabled

Office 365 servers use it

Exchange is not dynamic quorum-aware

Does not change quorum requirements

Dynamic Quorum

Cluster team guidance:Generally increases the availability of the clusterEnabled by default, strongly recommended to leave enabledAllows the cluster to continue running in failure scenarios that are not possible when this option is disabled

Exchange team guidance:Leave it enabled for majority of DAG membersIn some cases where a Windows 2008 R2 DAG would have lost quorum, a Windows 2012 DAG can maintain quorumDon’t factor it into availability plans

Dynamic Witness

Witness OfflineWitness vote gets removed by the cluster

Witness OnlineIf necessary, Witness vote is added back by the cluster

Witness FailureWitness vote gets removed by the cluster

Windows Server 2012 R2 and later

Questions?

exchange server 2013 site resilience scott schnoll

sync slide

unbound model slide

datacenters mail vip

preferred architecture

health check slide

na dns resolution dag

new dags dag mail vip

unbound dag model

Documents

scott schnoll principal technical writer microsoft...

“cyber...

resilience 1 running head: review of resilience research ·...

earthquake disaster resilience disaster resilience dividends

developing career resilience and...

leo r. sandy and scott r. meyer. a parent’s capacity for...

resilience & outdoor education. what is resilience?

resilience is as resilience does

ebook resilience - resilience of people & design

information risk management and...

jackson v aeg live-transcripts july 2nd 2013 - sidney...

enhancing the resilience of puget sound...

technical perspective scott schnoll mct mcse mcsa mcp...

researched abuse, diversion, and addiction-related...

microsoft exchange server 2010 sp2 tips & tricks scott...

motivational speaker scott greenberg - resilience, peak...

u.s. federal fire and forest policy: emphasizing resilience...

exl311: exchange server 2013 architecture deep dive scott...

air: asymmetry in resilience -...

national resilience for regional resilience in