absence: usage-based failure detection in mobile … nguyen... · • e.g., accessibility kpi looks...
TRANSCRIPT
ABSENCE: Usage-based Failure Detection in Mobile Networks
Binh Nguyen, Zihui Ge, Jacobus Van der Merwe, He Yan, Jennifer Yates Mobicom 2015
1
Silent failures
• Silent failures: service disruptions/outages that are not detected by current monitoring systems.
• New features rolled out, bugs on devices, or combination of both.
2
EPC core core
RAN
Silent failures
• Silent failures: service disruptions/outages that are not detected by current monitoring systems.
• New features rolled out, bugs on devices, or combination of both.
2
EPC core core
RAN
Silent failures
• Silent failures: service disruptions/outages that are not detected by current monitoring systems.
• New features rolled out, bugs on devices, or combination of both.
2
EPC core core
RAN
Detecting silent failures is challenging!
Detecting silent failures is difficult - passive network monitoring
• Drops in traffic/usage on network elements do not imply service disruptions:
• Load balancing/maintenance activities.
• Dynamic routing/Self-Organizing Network (SON).
3
Load balancingeventLo
ad
Time
expected load
actual load
Detecting silent failures is difficult - passive network monitoring
• Drops in traffic/usage on network elements do not imply service disruptions:
• Load balancing/maintenance activities.
• Dynamic routing/Self-Organizing Network (SON).
• Key Performance metric Indicators (KPI) may not reflect service issues:
• E.g., accessibility KPI looks good even when only a subset of users can access the network.
3
Load balancingeventLo
ad
Time
expected load
actual load
Detecting silent failures is difficult - passive network monitoring
• Drops in traffic/usage on network elements do not imply service disruptions:
• Load balancing/maintenance activities.
• Dynamic routing/Self-Organizing Network (SON).
• Key Performance metric Indicators (KPI) may not reflect service issues:
• E.g., accessibility KPI looks good even when only a subset of users can access the network.
3
Load balancingeventLo
ad
Time
expected load
actual load
A “healthy network” (from a monitoring perspective) does not guarantee service experience of users!
Detecting silent failures is difficult - active service monitoring
• Sending test traffic across the network on all service paths.
4
EPC core
RAN
Detecting silent failures is difficult - active service monitoring
• Sending test traffic across the network on all service paths.
• Many types of customer devices, applications, huge geographic environment to probe.
4
EPC core
RAN
Active monitoring does not scale!
Relying on customer feedback• It takes time for customers to give feedback.
• Relying on customer feedback is too slow: hours of delay.
• E.g., failure happens at 16:38 UTC but manifests in customer feedback at 21:00 UTC, 3.5 hours of delay.
5
Time
# o
f tic
kets
Event starting time(16:38 UTC)
Detected by Customer care (21:00 UTC)
3.5 hours delay
Too slow!
ABSENCE: usage-based failure detection
• ABSENCE: Passive service monitoring approach - monitor usage of users in a passive manner.
6
ABSENCE: usage-based failure detection
• ABSENCE: Passive service monitoring approach - monitor usage of users in a passive manner.
• Absence of customer usage is a reliable indicator of service disruptions in a mobile network.
6
ABSENCE’s key ideas
7
Mobile NetworkA group of users
Usage
• If failure happens, users are not able to use the network as normal.
ABSENCE’s key ideas
7
Mobile NetworkA group of users
Usage
• If failure happens, users are not able to use the network as normal.
• Large number of users cannot use the network leads to a drop in usage.
• Could detect both hard failures (outages) and performance degradations.
ABSENCE overview
8
Use anonymized and aggregated Call Detail Record (CDR) collected in real time from an U.S. operator.
0
0 20 40 60 80 100
120
140
160
180
Time
Week 1
ABSENCE overview
8
Use anonymized and aggregated Call Detail Record (CDR) collected in real time from an U.S. operator.
0
0 20 40 60 80 100
120
140
160
180
Time
0
0 20 40 60 80 100
120
140
160
180
Time
Week 2
ABSENCE overview
8
Use anonymized and aggregated Call Detail Record (CDR) collected in real time from an U.S. operator.
0
0 20 40 60 80 100
120
140
160
180
Time
0
0 20 40 60 80 100
120
140
160
180
Time
0
0 20 40 60 80 100
120
140
160
180
Time
Week 3“Expected” usage
“Absence” of usage
ABSENCE overview
8
Use anonymized and aggregated Call Detail Record (CDR) collected in real time from an U.S. operator.
0
0 20 40 60 80 100
120
140
160
180
Time
0
0 20 40 60 80 100
120
140
160
180
Time
0
0 20 40 60 80 100
120
140
160
180
Time
Week 3
Anomaly?
“Expected” usage
“Absence” of usage
ABSENCE overview
8
Use anonymized and aggregated Call Detail Record (CDR) collected in real time from an U.S. operator.
Outline• Motivation.
• ABSENCE overview.
• Is ABSENCE feasible?
• ABSENCE’s challenges.
• ABSENCE’s event detection.
• Synthetic workload evaluation.
• Operational validation.
9
Is usage predictable enough?• While individual user usage is not predictable, usage of a large group of
users is predictable.
11
Is usage predictable enough?• While individual user usage is not predictable, usage of a large group of
users is predictable.
• For example: 3 weeks of usage overlapped, usage of a small group is less predictable than usage of a large group.
00
# of
cal
ls
Week1Week2Week3
Time0
0
# of
cal
ls
Week1
Time
Week2Week3
1170 users 3000 users
Is usage predictable enough?• While individual user usage is not predictable, usage of a large group of
users is predictable.
• For example: 3 weeks of usage overlapped, usage of a small group is less predictable than usage of a large group.
00
# of
cal
ls
Week1Week2Week3
Time0
0
# of
cal
ls
Week1
Time
Week2Week3
1170 users 3000 usersYes, usage of a large enough group of users is predictable!
Outline• Motivation.
• ABSENCE overview.
• Is ABSENCE feasible?
• ABSENCE’s challenges.
• ABSENCE’s event detection.
• Synthetic workload evaluation.
• Operational validation.
12
Challenges
• Failures happens to different scopes: geo-area, device makes/models, service types.
13
• How to deal with users mobility?
• How to improve predictability of aggregate usage?
• How to make ABSENCE scalable, given a large amount of data in the network?
Challenges
• Failures happens to different scopes: geo-area, device makes/models, service types.
13
• How to deal with users mobility?
• How to improve predictability of aggregate usage?
• How to make ABSENCE scalable, given a large amount of data in the network?
How to detect failures with different scopes?• Group users based on their geographical information: ZIP code area, city,
state.
• A user could belong to multiple geographical groups in the same time.
• Under each geographical group: further divided to device OS, make.
14
How to detect failures with different scopes?• Group users based on their geographical information: ZIP code area, city,
state.
• A user could belong to multiple geographical groups in the same time.
• Under each geographical group: further divided to device OS, make.
14
How to detect failures with different scopes?• Group users based on their geographical information: ZIP code area, city,
state.
• A user could belong to multiple geographical groups in the same time.
• Under each geographical group: further divided to device OS, make.
14
+ Android
Salt Lake City (841)
iOS
Samsung HTC iPhone
Gal. S5 iPhone 5Gal. S4 iPhone 6
How to detect failures with different scopes?• Group users based on their geographical information: ZIP code area, city,
state.
• A user could belong to multiple geographical groups in the same time.
• Under each geographical group: further divided to device OS, make.
14
+ Android
Salt Lake City (841)
iOS
Samsung HTC iPhone
Gal. S5 iPhone 5Gal. S4 iPhone 6
How to detect failures with different scopes?• Group users based on their geographical information: ZIP code area, city,
state.
• A user could belong to multiple geographical groups in the same time.
• Under each geographical group: further divided to device OS, make.
14
+ Android
Salt Lake City (841)
iOS
Samsung HTC iPhone
Gal. S5 iPhone 5Gal. S4 iPhone 6
Outline• Motivation.
• ABSENCE overview.
• Is ABSENCE feasible?.
• ABSENCE’s challenges.
• ABSENCE’s event detection.
• Synthetic workload evaluation.
• Operational validation.
15
Event detection algorithm
0
100
200
300
400
500
600
24 48 72 96
met
ric
Time seriesDetected anomalies
16
Usage’s time series
Event detection algorithm
0
100
200
300
400
500
600
24 48 72 96
met
ric
Time seriesDetected anomalies
100
150
200
250
300
24 48 72 96
met
ric
Trend
-300
-200
-100
0
100
200
300
400
24 48 72 96
met
ric
Seasonal
-250
-200
-150
-100
-50
0
50
100
150
200
24 48 72 96
met
ric
Noise componentLower 95% CI of Noise componentUpper 95% CI of Noise component
• Decompose time series: trend, seasonal, noise
• Trend: moving average. • Seasonal: average of phasing values. • Noise = Time series - Trend - Seasonal
Trend
Seasonal
Noise16
Usage’s time series
Event detection algorithm
0
100
200
300
400
500
600
24 48 72 96
met
ric
Time seriesDetected anomalies
-250
-200
-150
-100
-50
0
50
100
150
200
24 48 72 96
met
ric
Noise componentLower 95% CI of Noise componentUpper 95% CI of Noise component
Noise component17
Usage’s time series
Upper 95th C.I.
Lower 95th C.I.
Event detection algorithm
0
100
200
300
400
500
600
24 48 72 96
met
ric
Time seriesDetected anomalies
-250
-200
-150
-100
-50
0
50
100
150
200
24 48 72 96
met
ric
Noise componentLower 95% CI of Noise componentUpper 95% CI of Noise component
Noise component17
anomalyUsage’s time series
• If noise is out of the 95th percent Confidence Interval (CI) of noise component => anomaly.
Upper 95th C.I.
Lower 95th C.I.
Outline• Motivation.
• ABSENCE overview.
• Is ABSENCE feasible?
• ABSENCE’s challenges.
• ABSENCE’s event detection.
• Synthetic workload evaluation.
• Operational validation.
18
Synthetic workload evaluation
19
• 6 months of real CDR from an U.S operator. • Synthetically introduce failures:
• Network failures: remove usage on base stations. • Device failures: remove usage on devices.
Parameters and metrics
20
• 11,000 failures generated.
• 100 ZIPs, 10 cities.
• Two popular device types.
• LTE/Voice.
• Duration: 1,2,3,6,12 hours.
• Quiet and busy hours.
• Impact degree: 0 - 55%.
Parameters Metrics
• Detection rate = detected events/introduced events.
• Loss ratio = loss until detected/normal usage.
Example of failure scenarios:
• All Android devices in Los Angeles fail.
• All Iphone5 devices in Downtown Los Angeles fail.
Overall detection rate
21
0
10
20
30
40
50
60
70
80
90
100
0-55-10
10-15
15-20
20-25
25-30
30-35
35-40
40-45
45-50
De
tect
ion
ra
te (
pe
rce
nt)
Failure impact (percent)
All failures
• With the 11,000 introduced failures: • ABSENCE detected >96% of failures that have more than 15% of impact. • ABSENCE tends to miss events that are <10% of impact.
Overall detection rate
21
0
10
20
30
40
50
60
70
80
90
100
0-55-10
10-15
15-20
20-25
25-30
30-35
35-40
40-45
45-50
De
tect
ion
ra
te (
pe
rce
nt)
Failure impact (percent)
All failures
• With the 11,000 introduced failures: • ABSENCE detected >96% of failures that have more than 15% of impact. • ABSENCE tends to miss events that are <10% of impact.
Overall detection rate
21
0
10
20
30
40
50
60
70
80
90
100
0-55-10
10-15
15-20
20-25
25-30
30-35
35-40
40-45
45-50
De
tect
ion
ra
te (
pe
rce
nt)
Failure impact (percent)
All failures
• With the 11,000 introduced failures: • ABSENCE detected >96% of failures that have more than 15% of impact. • ABSENCE tends to miss events that are <10% of impact.
Loss ratio of detected failures
22
• All detected failures: • ~97% of them are detected when <10% of usage is lost (during busy hours).
0 10 20 30 40 50 600.0
0.2
0.4
0.6
0.8
1.0
CD
F
Busy hours
Loss ratio (percent)
Outline• Motivation.
• ABSENCE overview.
• Is ABSENCE feasible?
• ABSENCE’s challenges.
• ABSENCE event detection.
• Synthetic workload evaluation.
• Operational validation.
23
Evaluate against known silent failures from the operator
• 19 silent failure events: not known by the network operator when they happened.
• Detected19/19, 100% true positive.
24
Alarm rate and true positive
25
0
20
40
60
80
100
5001000
20003000
40005000
6000
0
20
40
60
80
100
120
140Tr
ue p
ositi
ve (%
)
Ala
rm ra
te (a
larm
s/da
y)
True positive
Cut-off threshold
Alarm rate
m
2m
3m
4m
5m
6m
7m
n 2n 3n 4n 5n 6n 7n4n 6n 8n 10n 12n
• Use the 19 known events from operator.
• Alarm rate (m): average number of alarms per day that an operation team needs to handle.
• Cut-off threshold (n): filter out events that less impactful.
• Increase cut-off threshold could reduce alarm rate while maintaining true positive rate of ABSENCE
Alarm rate and true positive
25
0
20
40
60
80
100
5001000
20003000
40005000
6000
0
20
40
60
80
100
120
140Tr
ue p
ositi
ve (%
)
Ala
rm ra
te (a
larm
s/da
y)
True positive
Cut-off threshold
Alarm rate
m
2m
3m
4m
5m
6m
7m
n 2n 3n 4n 5n 6n 7n4n 6n 8n 10n 12n
• Use the 19 known events from operator.
• Alarm rate (m): average number of alarms per day that an operation team needs to handle.
• Cut-off threshold (n): filter out events that less impactful.
• Increase cut-off threshold could reduce alarm rate while maintaining true positive rate of ABSENCE
100% true positive, m alarms per day m is small enough
Alarm rate and true positive
25
0
20
40
60
80
100
5001000
20003000
40005000
6000
0
20
40
60
80
100
120
140Tr
ue p
ositi
ve (%
)
Ala
rm ra
te (a
larm
s/da
y)
True positive
Cut-off threshold
Alarm rate
m
2m
3m
4m
5m
6m
7m
n 2n 3n 4n 5n 6n 7n4n 6n 8n 10n 12n
• Use the 19 known events from operator.
• Alarm rate (m): average number of alarms per day that an operation team needs to handle.
• Cut-off threshold (n): filter out events that less impactful.
• Increase cut-off threshold could reduce alarm rate while maintaining true positive rate of ABSENCE
100% true positive, m alarms per day m is small enough
ABSENCE’s alarm rate is reasonable for practical!
Conclusions
• Absence of customer usage is a reliable indicator of service disruptions a mobile network.
• Appropriate grouping users results in predictable usage and high fidelity for anomaly detection.
• Synthetic evaluation and operational validation.
• Practical in an operational environment.
26