j. gray, dependability in the internet era (acknowledgement: slides from j.gray, e.brewer)
Post on 21-Dec-2015
217 views
TRANSCRIPT
![Page 1: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/1.jpg)
• J. Gray, Dependability in the Internet Era• (acknowledgement: slides from J.Gray, E.Brewer)
![Page 2: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/2.jpg)
The Last 10 Years: Availability Dark Ages
Ready for a Renaissance? • Things got better, then things got a lot worse!
1950 1960 1970 1980 1990 2000
9%
99%
99.9%
99.99%
99.999%
99.999%
Computer Systems
Telephone Systems
Cellphones
InternetA
vaila
bilit
y
2010
![Page 3: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/3.jpg)
DEPENDABILITY: The 3 ITIES• RELIABILITY / INTEGRITY:
Does the right thing. (also MTTF>>1)
• AVAILABILITY: Does it now.
(also 1 >> MTTR ) MTTF+MTTRSystem Availability:If 90% of terminals up & 99% of DB up?
(=>89% of transactions are serviced on time).
• Holistic vs. Reductionist view
SecurityIntegrityReliability
Availability
![Page 4: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/4.jpg)
Fail-Fast is Good, Repair is Needed
Improving either MTTR or MTTF gives
benefit
Fault Detect
Repair
Return
Lifecycle of a moduleLifecycle of a modulefail-fast gives fail-fast gives short fault latencyshort fault latency
High Availability High Availability
is low UN-Availabilityis low UN-Availability
Unavailability ~ Unavailability ~ MTTRMTTR MTTFMTTF
![Page 5: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/5.jpg)
Disks (raid) the BIG Success Story
• Duplex or Parity: masks faults• Disks @ 1M hours (~100 years) • But
– controllers fail and – have 1,000s of disks.
• Duplexing or parity, and dual path gives “perfect disks”
• Wal-Mart never lost a byte (thousands of disks, hundreds of failures).
• Only software/operations mistakes are left.
![Page 6: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/6.jpg)
Fault Tolerance vs Disaster Tolerance
• Fault-Tolerance: mask local faults– RAID disks– Uninterruptible Power Supplies– Cluster Failover
• Disaster Tolerance: masks site failures– Protects against fire, flood, sabotage,..– Also, software changes, site moves,…– Redundant system and service
at remote site.
![Page 7: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/7.jpg)
Availability99 999well-managed nodes
well-managed packs & clones
well-managed GeoPlex
Masks some hardware failures
Masks hardware failures, Operations tasks (e.g. software upgrades)Masks some software failures
Masks site failures (power, network, fire, move,…) Masks some operations failuresA
vaila
bilit
yUn-managed
![Page 8: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/8.jpg)
Case Studies - Tandem Trends
MTTF improved
Shift from Hardware & Maintenance to from 50% to 10%
to Software (62%) & Operations (15%)
NOTE: Systematic under-reporting of EnvironmentOperations errorsApplication Software
unknown environment operations maintenance hardware software
0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
100
1985 1987 1989
0
20
40
60
80
1 00
1 20
1985 19 87 1 989
Outag es/ 1000 Syste m Yearsby Primar y Cause
% of Outage s by Pri mary Cause
![Page 9: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/9.jpg)
Dependability Status circa 1995 • ~4-year MTTF
• 5 9s for well-managed sys. Fault Tolerance Works.
• Hardware is GREAT (maintenance and MTTF).
• Software masks most hardware faults.• Many hidden software outages in operations:
• New Software.
• Utilities.
• Need to make all hardware/software changes ONLINE.
![Page 10: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/10.jpg)
Progress?• MTTF improved from 1950-1995• MTTR incremental improvements 1970 ---
failover• Hardware and Software online change
(pNp) is now standard• Then the Internet arrived:
– No project can take more than 3 months.– Time to market is everything– Change is good.
Computer Systems
Telephone Systems
Cellphones
Internet
![Page 11: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/11.jpg)
The Internet Changed Expectations
1990Phones delivered 99.999%
ATMs delivered 99.99%
Failures were front-page news.
Few hackers
Outages last an “hour”
2005Cell phones deliver 90%
Web sites deliver 99%
Failures are business-page news
Many hackers.
Outages last a “day”
This is progress?
![Page 12: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/12.jpg)
2006
![Page 13: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/13.jpg)
![Page 14: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/14.jpg)
Eric Brewer said it best:
ACID vs BASEthe internet litmus test
• AtomicityConsistencyIsolation Durabilty
• Availability?• Strong consistency
Isolation
• Focus on commit• Conservative (Pessimistic)
• Difficult evolution (e.g. schema)
• Nested transactions
• BasicAvailabilitySoft StateEventual Consistency
• Availability FIRST• Weak consistency
stale data is OKApproximate answers OK
• Best effort• Aggressive (optimistic)• Easier Evolution.
• Simpler!• Faster
I think it is a spectrum
![Page 15: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/15.jpg)
![Page 16: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/16.jpg)
![Page 17: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/17.jpg)
![Page 18: J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)](https://reader031.vdocument.in/reader031/viewer/2022032201/56649d585503460f94a378fd/html5/thumbnails/18.jpg)