anomaly detection and troubleshooting of large scale systems …niloy/presentation/netapp...
TRANSCRIPT
![Page 1: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/1.jpg)
Anomaly Detection and Troubleshooting of
Large Scale Systems from Event Logs
Presented By Niloy Ganguly
Bivas Mitra, Subhendu KhatuyaAlso in collaboration with NetApp
Department of Computer Science and EngineeringIndian Institute of Technology, Kharagpur
![Page 2: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/2.jpg)
Prerequisite
Dataset
Objective
Challenges
Model Development
Anomaly detection framework
Building an automated troubleshooter
Results
![Page 3: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/3.jpg)
Prerequisite
Dataset
Objective
Challenges
Model Development
Anomaly detection framework
Building an automated troubleshooter
Results
![Page 4: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/4.jpg)
Prerequisite
EMS: Event Message System
• EMS supports a built-in logging facility that logs all activities on storage appliance done by customer.
• The system writes out event indication descriptions using a generic text-based log format.
EMS
System
![Page 5: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/5.jpg)
ONTAP Components
StorageRAID
WAFL
ProtocolsNetwork
Stack
Clients
NVRAM
Disks
Node/Data ONTAP
HA (CFO/SFO)
HA Partner
HA Interconnect
![Page 6: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/6.jpg)
Prerequisite
Case:
![Page 7: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/7.jpg)
Case Filed
cannot find errors with
environment/storage commands but
getting messages say to replace the
module
![Page 8: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/8.jpg)
Snapshot of a BURT
![Page 9: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/9.jpg)
Post Case Info
Customer-Support Engg. Communication
![Page 10: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/10.jpg)
Prerequisite
Dataset
Objective
Challenges
Model Development
Anomaly detection framework
Building an automated troubleshooter
Results
![Page 11: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/11.jpg)
Dataset
• Daily Event message system (EMS) log
• Customer support database
• Customer support portal provides the platform to report cases, failures, communicate with support engineers
• Bug database
• Internally oriented
• Each case is associated with a bug
![Page 12: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/12.jpg)
Dataset
• Daily Event message system (EMS) log
Module 1 Module 2 Module 3 Module 4
EMS log EMS log EMS log
![Page 13: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/13.jpg)
Dataset: A Typical EMS Log
Field Log Entry Example Description
Event Time Apr 01 2014 09:11:12 Day, date, timestamp
System name cc-nas1 Name of the node in cluster that generated the event
Event Message kern.uptime.filer Contains Subsystem name and event type
Severity info Severity of the event
Raw EMS Data
Extracted Information
![Page 14: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/14.jpg)
Data filtering
Select the bugs with sufficient number of cases
Select the bugs with high priority levels
Eliminate the cases with missing data
11 12
13
![Page 15: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/15.jpg)
Final EMS Dataset
Dataset-info Number
Total No of Bugs 48
Total No of Cases 4827
No of Customers 2691
No of unique system
4305
No of Module 331
Types of Message ~8k
Timeline January 2011 to June 2016
Case Filed Date
For each filed case we have collected around 18 weeks prior data , and 1 weeks log after case filed date.
Apr 01 09:11:12 INFO kern_uptime_filer_1…
Raw EMS Data
Extracted Information
![Page 16: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/16.jpg)
How to resolve?
Resolution period:
Let’s assume customer filed case at To. It resolved on Tc
Resolution period = (Tc - To)
The support engineers use predefined rules to resolve the problem.
![Page 17: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/17.jpg)
MotivationReliable and fast customer support service is pre-requisite to the storage industry
There are some complain for which the resolution period is very high.
Resolution period pretty high
50%
(CLUSTER NETWORK DEGRADED) ERROR
![Page 18: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/18.jpg)
Prerequisite
Dataset
Objective
Challenges
Model Development
Anomaly detection framework
Building an automated troubleshooter
Results
![Page 19: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/19.jpg)
Objective 1 (Anomaly detection)
• Leverage on the event logs generated by the subsystems/modules
• Development of anomaly detection framework
Anomaly Detector
Days
Event log
Failure
ADELE: Anomaly Detection from Event log Empiricism, accepted in INFOCOM’18
![Page 20: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/20.jpg)
Objective 2 (Troubleshooting)
• Building a troubleshooter which can localize faulty components within a very short time.
• Providing a ranked list of modules to the support engineers
• Reducing the complexity of the diagnostic process
GBTM: Graph Based Troubleshooting Method for Handing Customer Cases Using Storage system Log , accepted in PAKDD’18
![Page 21: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/21.jpg)
Prerequisite
Dataset
Objective
Challenges
Model Development
Anomaly detection framework
Building an automated troubleshooter
Results
![Page 22: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/22.jpg)
Challenges (Anomaly detection) • Detection of abnormality from log becomes challenging
in the noisy environment
• where the log gets colluded with the messages from system misconfiguration
• Do event log messages carry signals of anomaly?
• Do the anomaly signals eventually lead to failure?
• File-system fragmentation may cause performance slowdown
• How many false alerts?
![Page 23: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/23.jpg)
Challenges (Troubleshooting)
• Most of the real systems are complex as various constituent system components exhibit functional dependencies
• Each component has its own failure modes. For example, a storage system failure can be caused by disks, physical interconnects, shelves, RAID controllers etc.
• It is extremely hard for support engineer to have a updated domain knowledge in this evolving system.
• In such a large evolving complex system the prior knowledge of dependency tree between modules is not available.
![Page 24: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/24.jpg)
Prerequisite
Dataset
Objective
Challenges
Model Development
Anomaly detection framework
Building an automated troubleshooter
Results
![Page 25: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/25.jpg)
Attributes Description
Event Count Total number of events generated by the subsystem
Event Ratio Ratio of number of events generated by the subsystem to total number of messages
Mean Inter-arrival Time Mean time between successive events generated of the particular subsystem
Mean Inter-arrival Distance Mean number of other messages between successive events of the particular subsystem
Severity Spread Eight features corresponding to event counts of each severity type for the subsystem
Time-interval Spread Six features denoting event counts during six four-hour intervals of the day for the subsystem
Model development: Attribute Extraction
![Page 26: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/26.jpg)
Observation1:Periodicity
Weekly periodicity can be observed for attributes from event log
Number of messages generated from API module
planned maintenance, scheduled backups, workload intensitychanges
![Page 27: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/27.jpg)
Anomaly Clues • If one or more subsystem is going through an anomalous phase
• it gets reflected in some attributes of logs generated for those subsystems
![Page 28: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/28.jpg)
Model development: Overview
Extract 18 features from EMS log, for each module
Log transformation
Anomaly score
![Page 29: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/29.jpg)
▪ EMS log of each day is abstracted into a matrix (Xd)
Model development : Log Transformation
• We fit a normal distribution with the features of the last few weeks
![Page 30: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/30.jpg)
Model development: Score Matrix
▪ EMS log of each day is abstracted into a matrix (Xd)
▪ We transform the raw matrix (Xd) of dth day into score matrix (St) as follows
![Page 31: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/31.jpg)
Score matrix
Ridge regression
W Weight matrix
Anomaly score
Event log of a day
Above threshold Below threshold
Anomaly No Anomaly
Model development: Anomaly Detect
S(i,j) contributes differently to overall anomaly of the system
![Page 32: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/32.jpg)
True positive Vs False positive High anomaly detection rate with low false alert
Step label
Ramp label
Comparison with Baseline
ADELE: Anomaly Detection from Event log Empiricism, accepted in INFOCOM’18
![Page 33: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/33.jpg)
Prerequisite
Dataset
Objective
Challenges
Model Development
Anomaly detection framework
Building an automated troubleshooter
Results
![Page 34: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/34.jpg)
Graph Construction
Vertex:Each module is considered as vertex, we took all 331 possible modules.
Edge:Edge is decided based on timestamp difference, if the timestamp difference between two module is less than 300 second, one directed edge is formed between them.
Edge weight:Edge weight is as follows, where k is no of occurrences of edges and tiis timestamp difference.
![Page 35: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/35.jpg)
Sample Example
Case Filed Date
Corresponding to each case, we collect 18 weeks of data - we construct a graph corresponding to each week -consequently, we get 18 graphs from a single case. The last two graphs we assume is arising out of anomalous state of the system.
![Page 36: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/36.jpg)
Graph EncodingVertex encoding (vbits):
▪ log2 𝑣 bits to encode the number of vertices 𝑣 in the graph
▪ 𝑣 ∗ log2 𝑢 𝑏𝑖𝑡𝑠 𝑡𝑜 𝑒𝑛𝑐𝑜𝑑𝑒 𝑙𝑎𝑏𝑒𝑙𝑠 𝑜𝑓 𝑎𝑙𝑙 𝑣 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠 where u is total unique no of labels of vertices.
𝒗𝒃𝒊𝒕𝒔 = 𝐥𝐨𝐠𝟐 𝒗 + 𝒗 ∗ 𝐥𝐨𝐠𝟐 𝒖
Edge encoding (ebits):
ebits= 𝒆 ∗ 𝟏 + 𝐥𝐨𝐠𝟐 𝒖 + 𝑲 ∗ 𝐥𝐨𝐠𝟐 𝒎+ 𝐥𝐨𝐠𝟐 𝒎
e is total no. of edges, K is total no. of 1’s in the adjacency matrix, m=max e(i,j)
Row encoding (rbits):
𝒓𝒃𝒊𝒕𝒔 = 𝐯 ∗ log𝟐 𝒃 σ𝒊=𝟏𝒗 log𝟐
𝒗𝒌
𝒊
![Page 37: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/37.jpg)
Encoding example
kern
wafl
disk
cmds
raid
cifs
kern_cmds
wafl_raid
disk_cifs
wafl_disk
Kern_wafl
𝐯𝐛𝐢𝐭𝐬 = log2 6 + 6 ∗ log2 11 = 23.33 𝑏𝑖𝑡𝑠
𝒓𝒃𝒊𝒕𝒔 = 21.49 𝑏𝑖𝑡𝑠
kerncmds
waflraiddiskcifs
No. of vertices: 6Unique labels: 11e=5; K=5; m=1
ebits = 𝒆 ∗ 𝟏 + 𝐥𝐨𝐠𝟐 𝒖 + 𝑲 ∗ 𝐥𝐨𝐠𝟐 𝒎=5*(1+log2 11)+5*log2 1 = 22.25 𝑏𝑖𝑡𝑠
Total bits=67.07 bits
![Page 38: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/38.jpg)
Step 1: Finding Abnormal Substructure (PCCS)
Subgraph:
A substructure is a connected subgraph of the overallgraph.
Best Substructure:
we consider the best substructure to be one that minimizes the following value:
Where G is the entire graph, S is the substructure, DL(G|S) is the description length of G after compressing it using S, and DL(S) is the description length of the substructure
Intuition:
Anomalous substructure occurs very infrequently.
![Page 39: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/39.jpg)
Abnormal Substructure finding steps
▪ First, we compute anomaly score by the transformation cost(using insertion and deletion of vertex and edges) to match theentity with the best substructure.
▪ We finally shortlist only those abnormal substructure whereanomaly score exceeds a certain threshold (0.95).
▪ Hence the problem creating candidate set (PCCS) is the union of the modules present in the shortlisted anomalous structure
![Page 40: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/40.jpg)
Step2: Community Detection
Intuition: If there is failure in one module of a community,other modules present in the group might be affected dueto dependency between modules
• We choose Louvain community detection algorithm
![Page 41: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/41.jpg)
Step 3: Set Expansion• We calculate normalized overlapping index between
PCCS and each community
• If overlapping index exceeds some threshold (0.75) for a particular cluster, we expand PCCS by incorporating modules of that specific cluster
Normal Period :: NEPCSAbnormal Period :: AEPCS
![Page 42: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/42.jpg)
Final PCS Construction
• For a case, suppose we discover that module appears n1 times in abnormal set AEPCS out of total nabn
samples and it also appears in NEPCS n2 times out of total nnorm normal samples.
Then causality score (CS) of the module is as follows
Normal Period
Abn. Period
Top ranked modules considred as final problem creating set (PCS)
![Page 43: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/43.jpg)
An Example
![Page 44: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/44.jpg)
Validation
• Direct (Ground Truth available)
• Support engineers extracted the trouble creating modules from domain knowledge and conversation with customer for only 20.50% of cases, where evaluation becomes straightforward
• Indirect• Similar cases will have approximately similar problem
creating modules set.
![Page 45: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/45.jpg)
Grouping Similar Cases
(Sym-Text Based)
….....
SYMPTOM TEXT
C1 C2 C3 Cn
>Th. (0.80)
SIMILAR
NOT SIMILAR
….....
Y
N
Cos. Similarity
![Page 46: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/46.jpg)
Similar Cases (EMS-Log Based)
…...
C1C2 C3 Cn
kern_uptime_filer_1unowned_disk_remindercallhome_performance_data
kern_uptime_filer_1ems_engine_suppressedcifs_op_subop_unsupported
…...
SIMILARY
NNOT SIMILAR
>Th. (0.65)
The similar cases belongs to both the group taken as final similar case set
![Page 47: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/47.jpg)
Prerequisite
Dataset
Objective
Challenges
Model Development
Anomaly detection framework
Building an automated troubleshooter
Results
![Page 48: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/48.jpg)
Overlapping Score (Indirect Validation)
Average Overlap: 0.807
The PCS of similar cases are ~ 80% similar
Indirect validation
Mathematically, for two arbitrary sets S1 and S2Overlapping score (S1, S2)=
![Page 49: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/49.jpg)
False Positive Rate
Average FPR: 9.15%
Intuitively, the problem causing modules should appear only in the abnormal state. If a module appears in both NEPCS and AEPCS set we treat that module as a false positive.
![Page 50: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/50.jpg)
Comparison with Baseline
![Page 51: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/51.jpg)
Ranking Modules
We provide a ranked list of modules to the support engineers which cansignificantly narrow down the troubleshooting process for around 95% cases
GBTM: Graph Based Troubleshooting Method for Handing Customer Cases Using Storage system Log , accepted in PAKDD’18
![Page 52: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/52.jpg)
Conclusion
▪ Logs are challenging to analyze manually because they are noisy
▪ In large scale system, constituent system components exhibit functional dependencies.
▪ We proposed ADELE, a machine learning model to detect anomalies with high anomaly detection rate and low false alert.
▪ We proposed GBTM, troubleshooting tool which abstracts the raw log by a graph structure and infers a probable set of malfunctioning modules with the help of community structure.
![Page 53: Anomaly Detection and Troubleshooting of Large Scale Systems …niloy/PRESENTATION/NetApp Talk... · 2018. 7. 4. · Bivas Mitra, Subhendu Khatuya ... • It is extremely hard for](https://reader033.vdocument.in/reader033/viewer/2022060901/609eaaa7d583ba05c25df3c3/html5/thumbnails/53.jpg)
Thank you!
Follow the work of Complex Network Research Group (CNeRG), IIT KGP at:Web: http://www.cnergres.iitkgp.ac.in/Facebook: https://web.facebook.com/iitkgpcnergTwitter: https://www.twitter.com/cnerg