total 23 slides below the network is reliable an informal survey of real-world communications...
DESCRIPTION
The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURYTRANSCRIPT
![Page 1: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/1.jpg)
![Page 2: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/2.jpg)
TOTAL 23 SLIDES BELOW
![Page 3: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/3.jpg)
The network is ReliableAn informal survey of real-world communications failures
BY PETER BAILIS AND KYLE KINGSBURY
![Page 4: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/4.jpg)
CONTENTS
• Abstract
• Various survey reports of network reliability under different circumstance
• Conclusion
![Page 5: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/5.jpg)
ABSTRACT• “The network is reliable.” is a fallacy of distributed
computing.
• The degree of network reliability is critical for systems to function robustly.
• It is hard to determine the degree of network reliability .
![Page 6: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/6.jpg)
VARIOUS SURVEY REPORTS OF
NETWORK RELIABILITY UNDER
DIFFERENT CIRCUMSTANCE
![Page 7: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/7.jpg)
LARGE DEPLOYMENTS & ISSUES
• What are large deployments?Large deployments mean a distributed network system that is run globally having distributed infrastructure with hundreds of thousands of servers.
• What is serious considered issue in large deployments?
Partitions : A network partition refers to the failure of a network device that causes a network to be split
![Page 8: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/8.jpg)
LARGE DEPLOYMENTS & ISSUES(CONTD.)EXAMPLES
BEHAVIOR OF NETWORK FAILURE IN MICROSOFT DATACENTERS
Average failure rate• 5.2 devices/day • 40.8 links/day.• which causes Avg loss of 59000 packets
per failure.• Avg time to repair is of approximately five
minutes• Redundancy improves Avg traffic by 43%.
Devices Links0
20
40
Per Day Failures
![Page 9: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/9.jpg)
LARGE DEPLOYMENTS & ISSUES(CONTD.)EXAMPLES
NETWORK FAILURES IN HP’S MANAGED NETWORKS
Analysis of Support ticket data• Connectivity-related tickets
accounted for 11.4%• 14% of which were of the highest
priority level• 2 hours and 45 minutes for the
highest priority tickets and a median duration of 4 hours 18 minutes for all tickets
Conectivity Related
High Priority048
12Trouble Tickets
![Page 10: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/10.jpg)
LARGE DEPLOYMENTS & ISSUES(CONTD.)EXAMPLES
FIRST YEAR FOR NEW GOOGLE CLUSTER INVOLVES
Five racks were faulty
(40–80 machines
seeing 50% packet loss)
Eight network maintenances (four might
cause 30-minute random
connectivity losses)
Three router failures (have
to immediately pull traffic for
an hour)
![Page 11: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/11.jpg)
LARGE DEPLOYMENTS & ISSUES(CONTD.)
How these companies try to repair network
partitions?
Google(by Dean): “easy-to use” abstractions
PNUTS: Weeker consistency alternatives
![Page 12: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/12.jpg)
DATACENTER NETWORK FAILURES
A Datacenter of Google
Main factors of Failures :
1)Power failure2)Misconfiguration3)Firmware bugs4)Topology changes5)Cable damage 6)Malicious traffic
![Page 13: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/13.jpg)
CLOUD NETWORKSWhat is Cloud Networks?
Key issues:• 1)Transient latency• 2)Dropped packets• 3)Full network partitions
![Page 14: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/14.jpg)
CLOUD NETWORKS(CONTD.)
When two nodes connected to the
internet but unable to see each other?
What experience can we learn from
this case?
![Page 15: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/15.jpg)
HOST PRVIDERSCould host providers offer reliable networks?
E.g. Freistil IT : a specific data center has50%-100%packet loss that leads
GlusterFS disturbuted file system to entire split-brain undetected
Why?
What is the main issue?
![Page 16: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/16.jpg)
WIDE AREA NETWORKS(WAN)
• Why WAN failures are particularly interesting?
• Example: CENIC: Average partition duration(5 years): SRF: 6 mins HRF:8.2 hours
Conclusion: Graceful degradationUnder partition or increased Latency is especially important for WAN.
![Page 17: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/17.jpg)
GLOBAL ROUTING FAILURES
•Can a high level redundancy internet system be safe?
1) Firewall configuration error: e.g CloudFlare
2)Firmware bug: e.g Juniper Networks
3) BGP misconfiguration: e.g Pakistan Telecom
![Page 18: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/18.jpg)
NICS AND DRIVERSFirmware bug: NICs problem
e.g. BCM5709 (chip model)
Misconfiguration : Drivers problem
e.g. bnx2
![Page 19: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/19.jpg)
APPLICATION-LEVEL FAILURES
What are the issues causing messages drop ping and delay?
1).Crashes
2). Program errors
3).Scheduler latency
4).Overloaded processes
![Page 20: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/20.jpg)
CONCLUSIONWhere are the communication failures occur?
• Processes• Servers• NICs, switches• local and wide area networks• Etc.
![Page 21: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/21.jpg)
CONCLUSION(CONTD.)• Whether there exist a reliable network?
• Depends on
1).Cautious engineering 2)Aggressive network advance 3).Lots of investments
![Page 22: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/22.jpg)
CONCLUSION(CONTD.)
•What can we do ? Consider the risk before a partition occurs.
![Page 23: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/23.jpg)
QUESTIONS TIME ! LOL!
![Page 24: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/24.jpg)
REFERENCES• "Physical Network Interface". Microsoft. January 7, 2009.• Stonebraker, Michael (April 5, 2010). "Errors in Database
Systems, Eventual Consistency, and the CAP Theorem". Communications of the ACM
• CityCloud, 2011; https://www.citycloud.eu/cloudcomputing/
post-mortem/.• Davidson, S.B., Garcia-Molina, H. and Skeen, D. Consistency in a partitioned network: A survey. ACM Computing Surveys 17, 3 (1985), 341–370; http:// dl.acm.org/citation.cfm?id=5508.
![Page 25: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY](https://reader033.vdocument.in/reader033/viewer/2022051007/5a4d1afa7f8b9ab059983837/html5/thumbnails/25.jpg)
THANK YOU FOR YOUR PATIENCE