understanding network failures in data centers
DESCRIPTION
Understanding Network Failures in Data Centers. Michael Over. Questions to be Answered. Which devices/links are most unreliable? What causes failures? How do failures impact network traffic? How effective is network redundancy? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/1.jpg)
Understanding Network Failures in
Data CentersMichael Over
![Page 2: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/2.jpg)
Which devices/links are most unreliable? What causes failures? How do failures impact network traffic? How effective is network redundancy?
Questions will be answered using multiple data sources commonly collected by network operators.
Questions to be Answered
![Page 3: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/3.jpg)
Demand for dynamic scaling and benefits from economies of scale are driving the creation of mega data centers.
The data center networks need to be scalable, efficient, fault tolerant, and easy to manage.
The issue of reliability has not been addressed In this paper, reliability is studied “by
analyzing network error logs collected from over a year from thousands of network devices across tens of geographically distributed data centers.”
Purpose of Study
![Page 4: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/4.jpg)
Characterize network failure patterns in data centers and understand overall reliability of the network
Leverage lessons learned from this study to guide the design of future data centers
Goals of the Study
![Page 5: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/5.jpg)
Network reliability is studied along three dimensions:◦ Characterizing the most failure prone network
elements Those that fail with high frequency or that incur high
downtime◦ Estimating the impact of failures
Correlate event logs with recent network traffic observed on links involved in the event
◦ Analyzing the effectiveness of network redundancy Compare traffic on a per-link basis during failure events
to traffic across all links in the network redundancy group where the failure occurred
Network Reliability
![Page 6: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/6.jpg)
Multiple monitoring tools are put in place by network operators.
Static View◦ Router configuration files◦ Device procurement data
Dynamic View◦ SNMP polling◦ Syslog◦ Trouble tickets
Data Sources
![Page 7: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/7.jpg)
Logs track low level network events and do not necessarily imply application performance impact or service outage
Separate failures that potentially impact network connectivity from high volume and noisy network logs
Analyzing the effectiveness of network redundancy requires correlating multiple data sources across redundant devices and links
Difficulties with Data Sources
![Page 8: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/8.jpg)
Data center networks show high reliability◦ More than four 9’s for 80% of the links and 60% of
the devices Low-cost, commodity switches such as ToRs
and AggS are highly reliable◦ Top of Rack switches (ToRs) and aggregation
switches (AggS) exhibit the highest reliability Load balancers dominate in terms of failure
occurrences with many short-lived software related faults◦ 1 in 5 load balancers exhibit a failure
Key Observations of Study
![Page 9: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/9.jpg)
Failures have potential to cause loss of many small packets such as keep alive messages and ACKs◦ Most failures lose a large number of packets
relative to the number of lost bytes Network redundancy is only 40% effective
in reducing the median impact of failure◦ Ideally, network redundancy should completely
mask all failures from applications
Key Observations of Study
![Page 10: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/10.jpg)
Best effort: Possible missed events or multiply-logged events
Data cleaned, but some events may still be lost due to software faults or disconnections
Human bias may arise in failure annotations Network errors do not always impact
network traffic or service availability Thus… failure rates in this study should not
be interpreted as necessarily all impacting applications
Limitations of Study
![Page 11: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/11.jpg)
Background
![Page 12: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/12.jpg)
ToRs are the most prevalent device type in the network comprising about 75% of devices
Load balancers are the next most prevalent at approximately 10% of devices
The remaining 15% are AggS, Core, and AccR
Despite ToRs being highly reliable, ToRs account for a large amount of downtime
LBs account for few devices but are extremely failure prone, making them a leading contributor of failures
Network Composition
![Page 13: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/13.jpg)
Large volume of short-lived latency-sensitive “mice” flows
Few long-lived throughput-sensitive “elephant” flows
There are higher utilization rates at upper layers of the topology as a result of aggregation and high bandwidth oversubscription
Workload Characteristics
![Page 14: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/14.jpg)
Network Event Logs (SNMP/syslog)◦ Operators filter the logs and produce a smaller set
of actionable events which are assigned to NOC tickets
NOC Tickets◦ Operators employ a ticketing system to track the
resolution of issues Network traffic data
◦ Five minute averages of bytes/packets into and out of each network interface
Network topology data◦ Static snapshot of network
Methodology & Data Sets
![Page 15: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/15.jpg)
Network devices can send multiple notifications even though a link is operational
They monitor all logged “down” events for devices and links leading to two types of failures:◦ Link failures – connection between two devices is down◦ Device failures – device is not functioning for
routing/forwarding traffic Observe multiple components notifications related
to a single high level failure or a correlated event Correlate failure events with network traffic logs
to filter failures with impact that potentially result in loss of traffic
Defining and Identifying Failures
![Page 16: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/16.jpg)
A single link or device may experience multiple “down” events simultaneously◦ These are grouped together
An element may experience another “down” event before the previous event has been resolved◦ These are also grouped together
Cleaning the Data
![Page 17: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/17.jpg)
Goal: Identify failures with impact without access to application monitoring logs
Cannot exactly quantify application impact such as throughput loss or increased response times◦ Therefore, estimate the impact of failures on
network traffic Correlate each link failure with traffic
observed on the link in the recent past before the time of the failure◦ Traffic less than before the failure implies impact
Identifying Failures with Impact
![Page 18: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/18.jpg)
Identifying Failures with Impact
![Page 19: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/19.jpg)
For device failures, additional steps are taken to filter spurious messages
If a device is down, neighboring devices connected to it will observe failures on inter-connecting links.
Verify that at least one link failure with impact has been noted for links incident on the device
This significantly reduces the number of device failures observed
Identifying Failures with Impact
![Page 20: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/20.jpg)
Link Failure Analysis – All Failures
![Page 21: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/21.jpg)
Link Failure Analysis – Failures with Impact
![Page 22: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/22.jpg)
Links experience about an order of magnitude more failures than devices
Link failures are variable and bursty Device failures are usually caused by
maintenance
Failure Analysis
![Page 23: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/23.jpg)
Top of Rack switches (ToRs) have the lowest failure rates
Load balancers (LBs) have the highest failure rate
Probability of Failure
![Page 24: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/24.jpg)
Agg. Impact of Failures - Devices
![Page 25: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/25.jpg)
Properties of Failures
![Page 26: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/26.jpg)
Properties of Failures
![Page 27: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/27.jpg)
Properties of Failures
![Page 28: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/28.jpg)
Properties of Failures
![Page 29: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/29.jpg)
Properties of Failures
![Page 30: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/30.jpg)
Properties of Failures
![Page 31: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/31.jpg)
In order to correlate multiple link failures:◦ The link failures must occur in the same data center◦ The failures must occur within some predefined time
threshold Observed that link failures tend to be isolated
Grouping Link Failures
![Page 32: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/32.jpg)
Root Causes of Failures
![Page 33: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/33.jpg)
Root Causes of Failures
![Page 34: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/34.jpg)
Root Causes of Failures
![Page 35: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/35.jpg)
In the absence of application performance data, they estimate the amount of traffic that would have been routed on a failed link had it been available for the duration of a failure
The amount of data that was potentially lost during a failure event is estimated as:◦ Loss = (medb – medd) x duration
Link failures incur loss of many packets, but relatively few bytes◦ Suggests packets lost during failures are mostly
keep alive packets used by applications
Estimating Failure Impact
![Page 36: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/36.jpg)
Is Redundancy Effective?
![Page 37: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/37.jpg)
There are several reasons why redundancy may not be 100% effective:◦ Bugs in fail-over mechanisms can arise if there is
an uncertainty as to which link or component is the backup
◦ If the redundant components are not configured correctly, they will not be able to re-route traffic away from the failed component
◦ Protocol issues such as TCP backoff, timeouts, and spanning tree reconfigurations may result in loss of traffic
Is Redundancy Effective?
![Page 38: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/38.jpg)
Links highest in the topology benefit most from redundancy◦ A reliable network core is critical to traffic flow◦ Redundancy is effective at reducing failure impact
Links from ToRs to aggregation switches benefit the least from redundancy, but have low failure impact ◦ However, on a per link basis, these links do not
experience significant impact from failures so there is less room for redundancy to benefit them
Redundancy at Different Layers
![Page 39: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/39.jpg)
Low end switches exhibit high reliability
Improve reliability of middleboxes Improve the effectiveness of network redundancy
Discussion
![Page 40: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/40.jpg)
Application failures◦ Netmedic aims to diagnose application failures in
enterprise networks Network failures
◦ These studies also observed that the majority of failures in data centers are isolated
Failures in cloud computing◦ Increased focus on understanding component
failures
Related Work
![Page 41: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/41.jpg)
Large-scale analysis of network failure events in data centers
Characterize failures of network links and devices
Estimate failure impact Analyze effectiveness of network
redundancy in masking failures Methodology of correlating network traffic
logs with logs of actionable events to filter spurious notifications
Conclusions
![Page 42: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/42.jpg)
Commodity switches exhibit high reliability Middle boxes need to be better managed Effectiveness of redundancy at network and
application layers needs further investigation
Conclusions
![Page 43: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/43.jpg)
This study considered the occurrence of interface level failures – only one aspect of reliability in data center networks
Future: Correlate logs from application-level monitors
Understand what fraction of application failures can be attributed to network failures.
Future Work
![Page 44: Understanding Network Failures in Data Centers](https://reader038.vdocument.in/reader038/viewer/2022102819/568140b5550346895dac786f/html5/thumbnails/44.jpg)
Questions???