© 2011 ibm corporation architect’s 2013 guide to designing ha, bc, and dr - best practices...
TRANSCRIPT
© 2011 IBM Corporation
Architect’s 2013 Guide to Designing
HA, BC, and DR - Best Practices
Industry Best Practices - IT HA DR BC
Provided by: John Sing, Executive IT Consultant, San Jose, California [email protected]
© 2013 IBM Corporation2
Industry Best Practices – IT HA DR BC
September 2013
Contents
Principles of architecting traditional IT HA, DR, BC
Technology and location considerations
Traditional Workloads vs. Internet Scale Workloads
Best Practices Step by Step Methodology
© 2013 IBM Corporation3
Industry Best Practices – IT HA DR BC
September 2013
Four Stages of Data Center Efficiency: (pre-req’s for HA/BC/DR)
http://public.dhe.ibm.com/common/ssi/ecm/en/rlw03007usen/RLW03007USEN.PDF http://www-935.ibm.com/services/us/igs/smarterdatacenter.html
April 2012
© 2013 IBM Corporation4
Industry Best Practices – IT HA DR BC
September 2013
Application 1Application 3Analytics
report
managementreports
http://xyz.xml
decisionpoint
MQseries
WebSphere
Application 2
SQL
db2
Businessprocess A
Businessprocess B
Businessprocess C
Businessprocess D
Businessprocess E
Businessprocess F
Businessprocess G
Infr
astr
uctu
reA
pp
licati
on
Bu
sin
ess
1. An error occurs on a storage device that correspondingly corrupts a database
2. The error impacts the ability of two or more applications to share critical data
3. The loss of both applications affects two distinctly different business processes
IT Business Continuity must recover at the business processlevel
Business Process is the Recoverable Unit
© 2013 IBM Corporation5
Industry Best Practices – IT HA DR BC
September 2013
Still true: synergistic overlap of valid data protection techniques
Protection of critical Business data Operations continue after a disaster
Costs are predictable and manageableRecovery is predictable and reliable
Fault-tolerant, failure-resistant streamlined infrastructure
with affordable cost foundation
1. High Availability Non-disruptive backups and
system maintenance coupled with continuous availability of
applications
2. Continuous Operations Protection against unplanned
outages such as disasters through reliable, predictable
recovery
3. Disaster Recovery
IT DataProtection
© 2013 IBM Corporation6
Industry Best Practices – IT HA DR BC
September 2013
Done?
?
Still true: Timeline of an IT Recovery ==>
Production ☺ Network Staff
Operations StaffOperations Staff
Data
Operating System
Physical Facilities
Telecom Network
Management Control
Execute hardware, operating system, and data integrity recovery
AssessRPO
Application transactionintegrity recovery
Applications
Now we're done!
Applications Staff
Recovery Time Objective (RTO)of transaction integrity
Recovery Time Objective (RTO)of hardware data integrity
Recovery Point Objective
(RPO)
How much datamust be
recreated?
Outage!
RPO
Telecom bandwidth still the major delimiterfor any fast recovery
© 2013 IBM Corporation7
Industry Best Practices – IT HA DR BC
September 2013
?
Still true: value of Automation for real-time failover ===>
Production ☺ Network StaffOperations StaffOperations Staff
Data
Operating System
Physical Facilities
Telecom Network
Management Control
AssessRPO
Trans.Recov.
Applications
Now we're done!
Applications Staff
RTO trans. integrity
RTO H/W
Recovery Point Objective
(RPO)
How much datamust be
recreated?
Outage!
RPO
HW
•Reliability
•Repeatability
•Scalability
•Frequent Testing
Value of automation
© 2013 IBM Corporation8
Industry Best Practices – IT HA DR BC
September 2013
Tape Backup
SecsMinsHrsDays Wks Secs Mins Hrs Days Wks
Recovery PointRecovery Point Recovery TimeRecovery Time
Synchronous replication / HA
Periodic Replication
Asynchronous replication
Still true: Replication Technology Drives RPO
For example:
© 2013 IBM Corporation9
Industry Best Practices – IT HA DR BC
September 2013
Recovery Time includes:
– Fault detection
– Recovering data
– Bringing applications back online
– Network access
Manual Tape Restore
SecsMinsHrsDays Wks Secs Mins Hrs Days Wks
Recovery PointRecovery Point Recovery TimeRecovery Time
End to end automated clustering
Storage automation
Still true: Recovery Automation Drives Recovery Time
For example:
© 2013 IBM Corporation10
Industry Best Practices – IT HA DR BC
September 2013
Integration into IT ManageBusiness Prioritization
StrategyDesign
riskassessment
businessimpactanalysis
Risks,
Vulnerabilities
and Threats
programassessment
Impacts
of
Outage
RTO/RPO
•Maturity Model
•Measure ROI
•Roadmap for Program
ProgramDesign
Current
Capability
Implement programvalidation
Estimated
Recovery Tim
e
ResilienceProgram
Management
Awareness, Regular Validation, Change Management, Quarterly Management Briefings
Business processes drive strategies and they are integral to the Continuity of Business Operations. A company cannot be resilient without having strategies for alternate workspace, staff members, call centers and communications channels.
crisis team
businessresumption
disasterrecovery
highavailability
1. People2. Processes3. Plans4. Strategies5. Networks6. Platforms7. Facilities
Database andSoftware design
High Availability Servers
Storage, Data Replication
High Availabilitydesign
Source: IBM STG, IBM Global Services
Still true: “ideal world” construct for IT High Availability and Business Continuity
© 2013 IBM Corporation11
Industry Best Practices – IT HA DR BC
September 2013
The 2013 Bottom line: (IT Business Continuity Planning Steps)
For today’s real world environment……….
Integration into IT ManageBusiness Prioritization
StrategyDesign
riskassessment
businessimpactanalysis
Risks,
Vulnerabilities
and Threats
programassessment
Impacts
of
Outage
RTO/RPO
• Maturity Model
• Measure ROI
• Roadmap for Program
ProgramDesign
Current
Capability
Implement programvalidation
Estimated
Recovery Tim
e
ResilienceProgram
Management
Awareness, Regular Validation, Change Management, Quarterly Management Briefings
crisis team
businessresumption
disasterrecovery
highavailability
1. People2. Processes3. Plans4. Strategies5. Networks6. Platforms7. Facilities
Database andSoftware design
High Availability Servers
Data Replication
high availabilitydesign
i.e. how to streamline this “ideal” process?1. Collect information for prioritization
2. Vulnerability, risk assessment, scope
3. Define BC targets based on scope
4. Solution option design and evaluation
5. Recommend solutions and products
6. Recommend strategy and roadmap
4. Solution option design and evaluation
5. Recommend solutions and products
6. Recommend strategy and roadmap
2013 key #2:
Workload type
2013 key #1:
need a basicData Strategy
Need faster way than even this simplified 2007 version:
© 2013 IBM Corporation12
Industry Best Practices – IT HA DR BC
September 2013Streamlined BC ActionsInput Output
2. Vulnerability / Risk Assessment
List of vulnerabilities Defined vulnerabilities
3. Define desired HA/BC targets based on scope
Existing BC capability, KPIs, targets, and success rate
Defined BC baseline targets, architecture, decision and success criteria
4. Solution design andevaluation
Technologies and solution options
Business process segmentsand solutions
5. Recommend solutions and products
Generic solutions that meet criteria
Recommended IBMSolutions and benefits
1. Collect info forprioritization
Business processes, Key Perf. Indicators, IT inventory
Scope, Resource Business Impact
Component effect on business processes
6. Recommend strategy and roadmap
Budget, major project milestones, resource availability, business process priority
Baseline Bus. Cont. strategy, roadmap, benefits, challenges,financial implications andjustification
2005 version
© 2013 IBM Corporation13
Industry Best Practices – IT HA DR BC
September 2013
Scope definition of Business Continuity program
Frequency ofOccurrences
Per Year
Consequences (Single Occurrence Loss) in Dollars per Occurrence
1,000
100
10
1
1/10
1/100
1/1,000
1/10,000
1/100,000
Virus
WormsDisk Failure
Component Failure
Power Failure
frequent
infr
equent
lower higher
Natural Disaster
Application Outage
Data Corruption
Network Problem
Building Fire
Terrorism/Civil Unrest
availability-related
recovery-related
This becomes the scope of HA/BC
progrom
© 2013 IBM Corporation14
Industry Best Practices – IT HA DR BC
September 2013
Define scope based on prioritized vulnerabilitiesSet expectation for phased implementation
Example chart at left shows Vulnerability / Risk Assessment:
– Define what will be on the chart– This defines the scope of the Business Continuity
solution
Divide Scope into implementation phases– Do not try to solve all vulnerabilities at once– Instead, focus on delivering tangible visible value in
each project step – Portray that scope expands as project progresses– This matches expenditure with increasing probability
over timeriskrisk
risk
risk
6 months12 months18 months
Total Scope
Likelihood
Imp
act risk
risk
risk
riskriskrisk
risk
Risk Assessment
© 2013 IBM Corporation15
Industry Best Practices – IT HA DR BC
September 2013
Recovery Time Objective (guidelines only)
15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days
Co
st
/ Va
lue
BC Tier 4 – Add Point in Time replication to Backup/Restore
BC Tier 3 – VTL, Data De-Dup, Remote vault
BC Tier 2 – Tape libraries + Automation
BC Tier 7 – Add Server or Storage replication with end-to-end automated server recovery
BC Tier 6 – Add real-time continuous data replication, server or storage
BC Tier 1 – Restore from Tape
Step by Step: Typical three phase approach to implementing High Availability, Business Continuity Technologies
Balancing recovery time objective with cost / value
BC Tier 5 – Add Application/database integration to Backup/Restore
Recovery from a disk image Recovery from tape copy
© 2013 IBM Corporation16
Industry Best Practices – IT HA DR BC
September 2013
Recovery Time Objective
15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days
Co
st
/ Va
lue
BC Tier 4 – Add Point in Time replication to Backup/Restore
BC Tier 3 – VTL, Data De-Dup, Remote vault
BC Tier 2 – Tape libraries + Automation
BC Tier 7 – Add Server or Storage replication with end-to-end automated server recovery
BC Tier 6 – Add real-time continuous data replication, server or storage
BC Tier 1 – Restore from Tape
Recovery from a disk image Recovery from tape copy
Step by Step Virtualization, High Availability, Business Continuity data strategy
Balancing recovery time objective with cost / value
BC Tier 5 – Add Application/database integration to Backup/Restore
Continuous AvailabilityContinuous Availability
Rapid Data RecoveryRapid Data Recovery
Backup/RestoreBackup/Restore
Workload typesStorage Pools
Clouddeploymentif needed
© 2013 IBM Corporation17
Industry Best Practices – IT HA DR BC
September 2013
? IT Virtualization, Consolidation enhances
Data Protection
Funding given today’s cost crunch?Complexity of infrastructure to recover?Priorities? Resources? Data Protection is an intended side benefit of
Consolidation, Virtualization
Fact: accelerating IT Consolidation, Virtualization, will accelerate Data Protection
Strategic Approach: Data protection is intended side-benefit of IT Virtualization
Data Protection Fewer Components to Recover Invest percentage of Savings
Invest in more robust Business Resiliency
Standardize and optimize IT and Business Resiliency solution design
Load Balancing Solution architecture
HA/BC pre-requisite:IT Virtualization and Consolidation
Cost-Effective Storage
and IT Efficiency
Application Servers
High-End Workstations
Database
End Users
Protocols
SANCIFSNFS
HTTPFTP
ManagementCentralAdministratio
nMonito
ringFile
Mgmt
AvailabilityData Migration
ReplicationBackup
© 2013 IBM Corporation18
Industry Best Practices – IT HA DR BC
September 2013
For traditional IT - Virtualization is fundamental to addressing today’s IT diversity
Virtualization
© 2013 IBM Corporation19
Industry Best Practices – IT HA DR BC
September 2013
IT Virtualization is the means to achieve IT Business Continuity
I.e. consolidate Servers, Storage, into virtualized systems
Provides the change agent and political momentum to enable Business Continuity implementation
Reduces management complexity using integrated virtualization and management software
Provides workload optimization needed for affordable maximum performance and efficiency
Becomes possible to identify what to replicate and manage that replication
Implements key tools such as virtual resource mobility within the ensemble
Is perfect foundation to implement the necessary IT strategy, design, tools, procedures, and testing to create IT Business Continuity
Because it also provides the umbrella
and political change-agent required to
allow IT Business Continuity to be
implemented as a by-product
© 2013 IBM Corporation20
Industry Best Practices – IT HA DR BC
September 2013
Virtualized IT infrastructure Business Processes
Virtualized systems become the resource pools that enable the recoverability
For traditional IT - Consolidated virtualized systems become the Recoverable Units for IT Business Continuity
Virtualization
© 2013 IBM Corporation21
Industry Best Practices – IT HA DR BC
September 2013
IT storage infrastructure …… Before:
End Users
Servers and Storage
Database
Underutilized Segmented StorageCopies of Data
Application Servers
High-End Workstations
© 2013 IBM Corporation22
Industry Best Practices – IT HA DR BC
September 2013
Transformation To Standardization, Virtualization
Servers And Storage
Database
Underutilized Segmented StorageCopies of Data
Application Servers
High-End Workstations
(animated chart)
End Users
VirtualizedStorage
VirtualizationSANNAS
ManagementCentral
AdministrationMonitoringFile Mgmt
AvailabilityData Migration
ReplicationBackup
Virtualized Storage
Ability to move data between
storage pools
Tiered
Storage
Virtualized
De-dup,
tape
High performance
petabyte
scale
Here arethe benefits:
© 2013 IBM Corporation24
Industry Best Practices – IT HA DR BC
September 2013
Key strategy: using standardized virtualization, segment data into logical data storage pools by appropriate Data Protection characteristics
Continuous Availability (CA) – E2E automation enhances RDR– RTO = near continuous, RPO = small as possible (Tier 7)– Priority = uptime, with high value justification
Lower cost
Rapid Data Recovery (RDR) – enhance backup/restore– For data that requires it– RTO = minutes, to (approx. range): 2 to 6 hours– BC Tiers 6, 4– Balanced priorities = Uptime and cost/value
Backup/Restore (B/R) – assure efficient foundation – Standardize base backup/restore foundation – Provide universal 24 hour - 12 hour (approx) recovery capability– Address requirements for archival, compliance, green energy– Priority = cost
Mission Critical
Know and categorize your data -
Provides foundation for affordable data protection
Know and categorize your data -
Provides foundation for affordable data protection
Enabled by
virtualization
© 2013 IBM Corporation25
Industry Best Practices – IT HA DR BC
September 2013
Recovery Time Objective
15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days
Co
st
/ Va
lue
BC Tier 4 – Add Point in Time replication to Backup/Restore
BC Tier 3 – VTL, Data De-Dup, Remote vault
BC Tier 2 – Tape libraries + Automation
BC Tier 7 – Add Server or Storage replication with end-to-end automated server recovery
BC Tier 6 – Add real-time continuous data replication, server or storage
BC Tier 1 – Restore from Tape
High Availability, Business Continuity Step by Step virtualization journey
Balancing recovery time objective with cost / value
BC Tier 5 – Add Application/database integration to Backup/Restore
Recovery from a disk image Recovery from tape copy
Foundation
Storage pools
© 2013 IBM Corporation26
Industry Best Practices – IT HA DR BC
September 2013Storage Pools
Apply appropriate server, storage technology
Real Time replication(storage or server or
software)
Real Time replication(storage or server or
software)
Periodic PiT replication:-File System
- Point in Time Disk- VTL to VTL with Dedup
Periodic PiT replication:-File System
- Point in Time Disk- VTL to VTL with Dedup
- Foundation backup/restore- Physical or electronic transport
- Foundation backup/restore- Physical or electronic transport
PetaByteUnstructured
PetaByteUnstructured
PetabyteUnstructured
PetabyteUnstructured
Petabyte unstructured, due to usage and large scale, typically uses
application level intelligent redundancyfailure toleration design
Petabyte unstructured, due to usage and large scale, typically uses
application level intelligent redundancyfailure toleration design
Real-time replication
Point in time
Removable media
File, application, or disk-to-disk
periodic replication
Add automated failover to replicated storage
© 2013 IBM Corporation27
Industry Best Practices – IT HA DR BC
September 2013
Step by step – architecting remote solution
© 2013 IBM Corporation28
Industry Best Practices – IT HA DR BC
September 2013
Recovery Time Objective
Co
st
Methodology Traditional IT:HA / BC / DR in stages, from bottom up
SAN SAN
Add: Point-in-time Copy, disk to disk, Tiered Storage (Tier 4)Foundation: electronic vaulting, automation, tape lib (Tier 3)
Foundation: standardized, automated tape backup (Tier 2, 1)
Disk VTL/De-DupDisk VTL/De-Dup VTL/De-Dup
•IBM FlashCopy, SnapShot•IBM XIV, SVC, DS, SONAS•IBM Tivoli Storage Productivity Center 5.1
•IBM ProtecTier•IBM Virtual Tape Library•IBM Tivoli Storage Manager Backup/restore
•VTL, de-dup, remote replication at tape level
© 2013 IBM Corporation29
Industry Best Practices – IT HA DR BC
September 2013
Recovery Time Objective
Co
st
SAN SAN
Add: Point-in-time Copy, disk to disk for backup/restore (Tier 4)Foundation: electronic vaulting, automation, tape lib (Tier 3)
Foundation: standardized, automated tape backup (Tier 2, 1)
Disk VTL/De-DupDisk VTL/De-Dup VTL/De-Dup
Applicationintegration
Applicationintegration
Automate applications, database for replication and automation (Tier 5)Consolidate and implement real time data availability (Tier 6)
Datareplication
Data replication
End to end automated site failover servers, storage, applications (Tier 7)
Dynamic
End to endAutomatedFailover:Server
StorageApplications
Methodology Traditional IT HA / BC / DR in stages, from bottom up
If storage: •Metro Mirror, Global Mirror, Hitachi UR•XIV, SVC, DS, other storage•TPC 5.1
•VMWare•PowerHA on p
•Tivoli FlashCopy Manager
•Server virtualization
© 2013 IBM Corporation30
Industry Best Practices – IT HA DR BC
September 2013
IBM Disk Mirroring Technology naming
DS8000DS6000
ESS
DS5000DS4000
DCS3700
DS3000V3700
V7000
N series
.
N series
Entry
Midrange NAS Enterprise SAN
SVCV7000
Virtualization
Metro / Global MirrorThree site synchronous and asynchronous mirroring
– DS8000 (sync+async)– N series (only async)
FlashCopy Point in time copy SVC, V7000, DS3000,
DS4000, DS5000, DS6000, DS8000, ESS, XIV, SONAS, N series
Global Mirror Asynchronous Mirroring SVC, V7000, DCS3700, DS4000,
DS5000, DS6000, DS8000, ESS, XIV, SONAS, N series
Metro Mirror Synchronous Mirroring SVC, V7000, DS3500, DCS3700, DS4000, DS5000, DS6000, DS8000, ESS, XIV, N series
XIVSONAS
© 2013 IBM Corporation31
Industry Best Practices – IT HA DR BC
September 2013
Recovery Time Objective
15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days
Co
st
/ Va
lue
BC Tier 4 – Add Point in Time replication to Backup/Restore
BC Tier 3 – VTL, Data De-Dup, Remote vault
BC Tier 2 – Tape libraries + Automation
BC Tier 7 – Add Server or Storage replication with end-to-end automated server recovery
BC Tier 6 – Add real-time continuous data replication, server or storage
BC Tier 1 – Restore from Tape
Today’s world: High Availability, Business Continuity is a Step by Step data strategy / workload journey
Balancing recovery time objective with cost / value
BC Tier 5 – Add Application/database integration to Backup/Restore
Recovery from a disk image Recovery from tape copy
Workload Types
Data Strategy
Clouddeploymentif needed
© 2013 IBM Corporation32
Industry Best Practices – IT HA DR BC
September 2013
Summary – IT High Availability / Business Continuity Best Practices 2012
Production
Backup/Restore Tier 1, 2 Foundation:
Storage, server virtualization and consolidation
Understand my dataDefine scope of recovery Implement remote
sites (Tier 1, 2)
Backup/Restore Tier 1, 2 replicated foundation:
SAN and server virtualization and consolidation
Implement Tier 3 – Consolidate and standardize Backup/Restore methods. Implement tape VTL, data de-dup, Server / Storage Virtualization / Mgmt tools, basic automation
Backup /Restore
Implement Tier 4 – Standardize use of disk to disk and Point in Time disk copy
Implement Tier 5 - Standardize DB / Application Mirroring methods
Implement Tier 6 – Standardize high volume data replication method
RapidData
Recovery
Implement BC Tier 7 – Standardize use of Continuous Availability automated Failover
ContinuousAvailability
Workload typesData strategy Recovery
© 2013 IBM Corporation33
Industry Best Practices – IT HA DR BC
September 2013
Key IT High Availability, Business ContinuityRequirements Questions (in proper order):
1. What applications or databases to recover?
2. What platform? (z, p, i, x and Windows, Linux, heterogeneous open, heterogeneous z+Open)
3. What is desired Recovery Time Objective (RTO)?
4. What is distance between the sites? (if there are 2 sites)
5. What is the connectivity, infrastructure, and bandwidth between sites?
7. What is the Level of Recovery?- Planned Outage- Unplanned Outage- Transaction Integrity
8. What is the Recovery Point Objective?
9. What is the amount of data to be recovered (in GB or TB)?
10. Who will design the solution?
11. Who will implement the solution?
12. Remaining solutions are valid choices to give to detailed DR evaluation team
6. What are the specific h/w equipment(s) that needs to be recovered?
Tier 4Tier 3
Tier 2
Tier 7Tier 6
Tier 5
Tier 1
© 2013 IBM Corporation34
Industry Best Practices – IT HA DR BC
September 2013
Summary
Clouddeployment
options
Principles of architecting traditional IT HA, DR, BC
Technology and location considerations
Traditional Workloads vs. Internet Scale Workloads
Best Practices Step by Step Methodology