system availability talk
DESCRIPTION
Talk i gave on HA, resiliency and recovery of systemsTRANSCRIPT
![Page 1: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/1.jpg)
Michael RichardsonTwitter: @Mr_SPB
1© 2011 Energized Work - www.energizedwork.com
Availability and Recoverability
![Page 2: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/2.jpg)
So what is High Availability?
• Five 9s?• No Single point of failure?• Multiple Data Centre’s?• Fault Tolerance?• Load Balancing?• Uptime?
2© 2012 Energized Work - www.energizedwork.com
![Page 3: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/3.jpg)
The 9’s of Availability
3© 2012 Energized Work - www.energizedwork.com
9 9
![Page 4: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/4.jpg)
The 9’s of Availability
4© 2012 Energized Work - www.energizedwork.com
Availability Downtime per Year
One nine (90%) 36.5 days
Two nines (99%) 3.65 days
Three nines (99.9%) 8.76 hours
Four nines (99.99%) 52.56 minutes
Five nines (99.999%) 5.26 minutes
![Page 5: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/5.jpg)
Problem with the 9’s
5© 2012 Energized Work - www.energizedwork.com
• What do they mean?• Guaranteed or just an SLA• Multiplicity
(99.9% * 99.9% * 99.9% = 99.7%)
![Page 6: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/6.jpg)
SLA availability numbers:
just aim to provide a level of confidence in a website’s
service
6© 2012 Energized Work - www.energizedwork.com
![Page 7: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/7.jpg)
No Single Point of Failure (SPOF)
7© 2012 Energized Work - www.energizedwork.com
![Page 8: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/8.jpg)
two of everything?
8© 2012 Energized Work - www.energizedwork.com
![Page 9: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/9.jpg)
Start with this
9© 2012 Energized Work - www.energizedwork.com
Index.html
Users
![Page 10: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/10.jpg)
End with this
10© 2012 Energized Work - www.energizedwork.com
WEB1
switch 1 switch 2
WEB2 APP1 APP2 DB1 DB2
Firewall 1 Firewall 2
Users
![Page 11: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/11.jpg)
• It’s expensive ££• Where do you draw the line?• Are failures independent• Can you guarantee No SPOF?• Increased complexity
11© 2012 Energized Work - www.energizedwork.com
Problems with eliminating SPOF
![Page 12: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/12.jpg)
Problem: Data Centre’s Fail
12© 2012 Energized Work - www.energizedwork.com
![Page 13: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/13.jpg)
Solution: Get a 2nd Data Centre
13© 2012 Energized Work - www.energizedwork.com
![Page 14: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/14.jpg)
Hot/Hot Multisite
14© 2012 Energized Work - www.energizedwork.com
• Full range of services available in multiple locations.
• Easy to automate failover of sites• Data Consistency is hard.• Capacity Planning concerns
+
![Page 15: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/15.jpg)
Hot/Warm Multisite
15© 2012 Energized Work - www.energizedwork.com
• Simpler than Hot/Hot• Read/write ratio dependant• Synchronous or Asynchronously
replicate data?
+
![Page 16: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/16.jpg)
Hot/Cold Multisite
16© 2012 Energized Work - www.energizedwork.com
• Easy to setup• Will it work?• Can it be trusted?• Cold site rapidly become stale• Is it actually valuable?
+
![Page 17: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/17.jpg)
DR Multisite
17© 2012 Energized Work - www.energizedwork.com
• Fingers crossed you never need it.• How can/should you test it?• Cloud?
+
![Page 18: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/18.jpg)
Problems with Multiple sites
18© 2012 Energized Work - www.energizedwork.com
• ££ - it’s expensive• Managing more systems• Managing consistency of Data• Managing Capacity• Is it still fail proof?• Unless you test it, it’s just a plan
![Page 19: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/19.jpg)
19© 2012 Energized Work - www.energizedwork.com
We now have a Complex System
![Page 20: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/20.jpg)
• More redundancy and automation leads to more complexity.
• More complexity often adds more points of failure.
20© 2012 Energized Work - www.energizedwork.com
Complex Systems
![Page 21: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/21.jpg)
Author: Dr. Richard Cook
21© 2012 Energized Work - www.energizedwork.com
“How Complex Systems fail”
• Catastrophe is always just around the corner.
• Human Operators have dual roles.• Change introduces new forms of failure
![Page 22: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/22.jpg)
Failure and Recovery
22© 2012 Energized Work - www.energizedwork.com
![Page 23: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/23.jpg)
Questions for the Customer
23© 2012 Energized Work - www.energizedwork.com
• What is the cost of downtime?
• What are the RTO and RPO?
![Page 24: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/24.jpg)
24© 2012 Energized Work - www.energizedwork.com
RTO = Recovery Time Objective
RPO = Recovery Point Objective
![Page 25: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/25.jpg)
Aggressive RTO & RPO is expensive and has a performance impact.
25© 2012 Energized Work - www.energizedwork.com
![Page 26: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/26.jpg)
RTO / RPO example
26© 2012 Energized Work - www.energizedwork.com
problem
•Simple DB•Business can tolerate up to 15 minutes downtime•10 minute window of data lose.
![Page 27: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/27.jpg)
RTO / RPO example
27© 2012 Energized Work - www.energizedwork.com
Possible solution
1.Continuously replicate data to 2nd host2.Continue with nightly backups and also copy DB transaction logs from the primary host to another system.
![Page 28: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/28.jpg)
So what’s more important?
28© 2012 Energized Work - www.energizedwork.com
Increasing Availability
Or
Reducing Recovery Time
![Page 29: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/29.jpg)
29© 2012 Energized Work - www.energizedwork.com
MTBFOr
MTTRWhat about MTTD??
![Page 30: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/30.jpg)
30© 2012 Energized Work - www.energizedwork.com
Answer?
It Depends
![Page 31: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/31.jpg)
31© 2012 Energized Work - www.energizedwork.com
Failure is inevitable
![Page 32: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/32.jpg)
32© 2012 Energized Work - www.energizedwork.com
Ask anyone
![Page 33: System Availability Talk](https://reader031.vdocument.in/reader031/viewer/2022013100/54b767474a795971038b4575/html5/thumbnails/33.jpg)
33© 2011 Energized Work - www.energizedwork.com
Thank you
The End
Twitter - @Mr_SPB