best ofmms2013didiervanhoye
DESCRIPTION
Presentation of best of MMS 2013 held in Belgium on 19th of June 2013. Availability strategies for a resillient private cloudTRANSCRIPT
Availability Strategies for a Resilient Private CloudDidier Van Hoye - Technical ArchitectMicrosoft MVP & MEET Memberhttp://workinghardinit.wordpress.com@workinghardinit
Sources of Downtime
Data Corruption Component Failure Application Failure
Human Error Maintenance Site Outage
Agenda• Understanding the various technologies in Windows
Server 2012 that can reduce downtime for a private cloud deployment
Planned Downtime
Disaster Recovery
Unplanned Downtime
Hyper-V Availability Suite
Planned Downtime
Disaster Recovery
Unplanned Downtime
Live Migration• Moves a running VM from one host to another with zero perceived
downtimeStorage Migration• Moves a VHD from one storage location to another with zero perceived
downtime
Hyper-V Replica• Replicates VMs another server in another location when a site is
lostMulti-Site Clusters• Stretching a cluster across sites with hardware or software
replication
Failover Clustering• Monitors health of servers & VMs, then starts and recovers on
the same server or another one in the event of a failure
Core OS Resiliency Features
DataCorruption
ApplicationFailure
ComponentFailure
Storage Spaces• Software fault tolerance which provides resiliency to disk failuresChkdsk • Repairs data corruption when it occurs• Vastly improved performance in Windows Server 2012
Guest Clustering• Application health monitoring and mobilityVM Monitoring• Application health monitoring for non-cluster aware applications
NIC Teaming• Resiliency to an network card failureStorage Multi-Path IO (MPIO)• Resiliency to an HBA failure
Planned Downtime
Live Migration• Moves a running VM between hosts with no user-
perceived downtime• Client is not aware the VM moved to another server• Maintains open TCP connections to the guest OS• Clients stay connected• Enables draining a server with zero downtime for
planned maintenance
• Note: PING is a poor tool to evaluate a live migration• ICMP works at the IP layer and TCP is what makes a live migration seamless
Complete VM Mobility Across the Datacenter
Live Migrate VM and Storage to Clusters
Live Migrate VM and Storage
Between Clusters
Live Migrate VM and Storage to Stand-Alone
Server
Cluster Cluster
You can move a VM anywhere in your datacenter with zero downtime!
Live Migration - Initiate MigrationClient accessing
VM
Live Migrate this VM to another
physical machine
IT Admin initiates a Live Migration to move a VM from one host to another
VHDX
Live Migration - Memory Setup Copy
Memory content is copied to new
server
VM pre-staged
The first initial copy is of all in memory content
VHDX
Live Migration - Brownout: Copy Dirty Pages
Pages are being dirtied
Client continues accessing VM
Client continues to access VM, which results in memory being modified
VHDX
Live Migration - Brownout: Incremental Copy
Smaller set of changes
Recopy of changes
Hyper-V tracks changed data, and re-copies over incremental changesSubsequent passes get faster as data set is smaller
VHDX
Live Migration - Blackout
Partition State copied
VM Paused
Window is very small and within TCP connection timeout
VHDX
Live Migration - Post-Transition: Cleanup
Old VM deleted once migration is verified
successful
Client directed to new host
ARP issued to have routing devices update their tablesSince session state is maintained, no reconnections necessary
VHDX
Simultaneous Live Migrations• Windows Server 2012 now supports the ability to do
multiple live migrations in parallel• Unlimited number of live migrations can be
performed in parallel• Default configuration of 2 simultaneous LM’s per host
• Wield this power wisely• Excessive number of simultaneous migrations may actually result in
overall longer times than serially
Dynamic Optimization• Feature in SCVMM 2012• Rebalances VMs across hosts
• Live migration• Keeps cluster balanced• Avoids VM downtime• Supports heterogeneous clusters
• Managed resources• Considers CPU, memory, disk IO, network IO• Optimize when above resource threshold• Considers entire cluster
• Options• Manual or automatic• User controlled frequency• Configurable aggressiveness
• Feature in SCVMM 2012• Rebalance the workload and turn off machines
when using Dynamic Optimization• Conserve energy in the data center• Keeps the cluster balanced, and avoids
VM downtime or latency through lack of resources
• Uses out-of-band power management• User defined schedule
Power Optimization
Windows Update
Zero Downtime Automated Patching• Cluster-Aware Updating
• Streamlines ‘Patch Tuesday’• Zero downtime patching!
• Coordinator updates nodes in the cluster• Coordinates with Windows Update Agent (WUA)• Updates in a rolling fashion, 1 node at a time
• Serially steps through all nodes• Coordinator can be made clustered, for Self-Updating mode
Workflow1. Scan nodes to identify appropriate updates needed2. Identify node with fewest workloads3. Nodes drained4. Call to WUA to patch (which leverages WSUS or Windows Update)5. Verify successful6. Repeat Steps 2 – 5 on next node7. Repeat on remaining nodes
UpdateCoordinator
Admin
Initiate Cluster-Aware
Updating
Storage Migration
• Move a VHD or VHDX from one host to another with zero downtime
• Storage Migration between hosts without shared storage is done over SMB protocol
• Storage Migration accelerated by arrays that support Offloaded Data Transfer (ODX)
• Enables draining a storage array for planned maintenance
Storage Migration – Initiate Migration
Storage Migrate this VHDX to another disk
Client accessing VM
VM stays running servicing clients
VHDX
Storage Migration – Create Destination VHDX
• New VHDX created on destination storage
New VHDX created on new storage
VHDXVHDX
Storage Migration – Mirror Writes
• Reads are from Source VHDX• Writes are done to Source VHDX and also
synchronously to the Destination VHDX
Writes mirrored to new Destination VHDX
Reads are from Source
VHDX
VHDX VHDX
Storage Migration – Copy Data
• Source VHDX data is copied over to Destination VHDX• Only unchanged blocks are copied over
VHDX data copied from Source to
Destination
• ODX will accelerate file copy
• SMB leveraged if storage is not accessible to this server
VHDXVHDX
Storage Migration – Post-Transition Cleanup
• Once all data is synchronized VM is switched to new VHDX
• Source VHDX is only removed once verified to be running on Destination VHDX• Enables roll-back
Reads and Writes transitioned to new
VHDX
VHDX
Unplanned Downtime
Failover Clustering• Failover Clustering is a distributed system that health
monitors servers and takes recovery action
• Protects from unplanned downtime:• Hardware• Host OS• VM• Guest OS • Apps in VMs
• Unplanned downtime results in VM restarted on another server• Session state lost
Failover Cluster Health Monitoring• Extensive health monitoring up and down the stack
NetFT
Node 1
ClusSvc
VMMS
User Mode
Kernel Mode
RHS
Guest OS
VDev
NetFT
Node 2
ClusSvc
VMMS
RHS
Guest OS
VDev
vmclusres.dll
vmclusres.dll
Resiliency Delivered by Host Clustering• Avoids a single point of failure when consolidating• Survive Host Crashes
• VMs restarted on another node
• Restart VM Crashes• VM OS restarted on same node
• Recover VM Hangs• VM OS restarted on same node
• Zero Downtime Maintenance & Patching• Live migrate VMs to other hosts
• Mobility & Load Distribution• Live migrate VMs to different servers to load balance
FC
Flexible storage choices for building clusters
SAS RBOD
iSCSI FCoE
SAS JBOD
Shared Storage
RAID HBA Software Replicatio
n
Hardware Replicatio
n
SMB
Data Replication
3rd party software
replication solution
Application
Replication
Example: Exchange
SQL AlwaysOnHyper-V Replica
Spaces
Cluster Shared Volumes (CSV)• Cluster Shared Volumes (CSV) is a clustered file system in
Windows Server 2012• Enables all servers in a Failover Cluster to access a common NTFS volume• Provides a layer of abstraction above NTFS
• Provides applications complete abstraction with respect to which nodes actually own a LUN
• Applications can failover without requiring drive ownership changes• No dismounting and remounting of volumes• Faster failover times (aka. less downtime)
• Increases resiliency and availability
CSV I/O Fault Tolerance
VM running on Node 2 is unaffected
Coordination Node
SAN Connectivity Failure
I/O Redirected via network
VM’s can then be live migrated to another node with zero client downtime
VHDX
Application Availability with Guest Clustering• Guest Clustering is creating a Failover Cluster inside
of the virtual machines and failing over applications across VMs
• Delivers:• Application Health Monitoring
• Application within VM crashes, application automatically restarts or fails over
• Application Mobility• Guest OS needs patching or VM needs maintenance,
application moved to other node
Cluster
Combining Host & Guest Clustering • Best of both worlds for flexibility and protection
• VM high-availability & mobility between physical nodes• Application & service high-availability & mobility between VMs
• Cluster-on-a-cluster does increase complexity• Mixing physical and virtual nodes is supported
• Must pass Validate
CLUSTER CLUSTERiSCSI or FC
Guest Cluster
SAN SAN
VM Monitoring• The host identifies & recovers from services failures
in the guest• Application level recovery
• Service Control Manager (SCM) or event triggered• Guest level HA recovery
• Failover Clustering gracefully reboots VM • Host level HA recovery
• Failover Clustering fails over VM to another node
• Generic health monitoring for any application• Monitor services through Service Control Manager• Generation of specific Event IDs
Disaster Recovery
Hyper-V Replica• Hypervisor level replication• Point-in-time replication of a VM to a remote server• RPO of 5 minutes
SiteA SiteB
VHDX VHDX
Multi-Site Clustering• Automatic and manual failover for DR• Supports 3rd party hardware and software based
replication
SiteA SiteB
Single Cluster
Service Level Agreement / Business Requirements
Choosing a Disaster Recovery Solution
Hyper-V Replica• RPO = 5 minutes• RTO = Manual (longer)• Cost = In-box in Windows
Server• Complexity = Low
Multi-Site Clustering• RPO = 0 minutes*• RTO = Automatic (fast)• Cost = High• Complexity = High
*depending on 3rd party replication solution
Fault Tolerant Solutions• Fault tolerant solutions can deliver zero downtime for
any unplanned hardware failure• Requires special hardware
• Gives higher levels of availability for a broader set of unplanned downtime scenarios
• Partnership with Stratus• Mission-Critical Hyper-V Windows Systems in
lock step
Summary• Windows Server delivers a breath of features which
can increase the resiliency of your Private Cloud
• Windows Server 2012 delivers many new availability features
• Delivering a resilient cloud includes planning for:Planned
DowntimeUnplanned Downtime
Disaster Recovery
Thank You to our SPONSORS
Q and A
© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.