acm sigkdd conference on knowledge discovery and data mining (kdd), 2009 © 2008 ibm corporation...
Post on 20-Jan-2016
214 Views
Preview:
TRANSCRIPT
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
© 2008 IBM Corporation
Learning, Indexing and Diagnosing Network Faults
Ting Wang†, Mudhakar Srivatsa‡,
Dakshi Agrawal‡ and Ling Liu†
Georgia Institute of Technology†
IBM T.J. Watson Research Center‡
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
© 2008 IBM Corporation
Complex Networks Network as a graph
– Vertices represent network entities
– Edges represent pair-wise (local) interactions between network entities
Even simple interactions give rise to complex global network phenomena
– Fault cascading in communication networks
– Information spread (e.g., via emails) in social networks
– Infection propagation in protein interaction networks
Key challenge is to detect and understand emerging global phenomena
2
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
© 2008 IBM Corporation
Network Monitoring Data
Networks generate massive monitoring data (aka events)
– Monitored data consists of local (in both space & time) observations on the network
– Monitored data is incomplete and sometimes even erroneous (e.g., imprecise, out-of-order wrt to both time and causality, etc)
3
Examples
– Ping failure, interface down, high CPU utilization, etc. in communication networks
– Email threads (time stamp, tokenized subject, MIME type, etc.) between members in a organizational hierarchy
– Pathological symptoms in biological networks – protein interaction networks (PINs)
Key observation: monitoring data gathered from network entities are correlated through the network topology
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
© 2008 IBM Corporation
Network Patterns Network patterns attempt to efficiently capture spatial (topological) and
temporal correlations in monitored data
Key challenges
– Understand the semantics of network patterns
– Identify domain-specific network patterns (e.g., fault diagnosis & prediction in IT systems, information spread and access control on social networks, disease propagation in protein networks, etc)
– How to learn and represent network patterns?
– How to scalably match network patterns against an online stream of network events?
4
e1
e2e3
e1 e2 e3
iBGP server
OSPF networks N1
and N2
Update configuration withdraw prefix
announcement
N1 says N2 is not reachable
N2 says N1 is not reachable
Director D
Employees N1 and N2
Meeting with D and N1
Email from N1 to N2 N2 updates project design document
Person P Friends N1 and N2
P updates a blog on her facebook page
N1 sends friend request to N2
N2 views P’s updates and accepts N1’s friend request
Simplified Examples
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
© 2008 IBM Corporation
Network Patterns Notation and Formalism
– Event data: <nodeId, type, timestamp, monitorId>
– Network Pattern: <event types, spatial pattern, temporal pattern>
– INTERFACE DOWN <LINK DOWN, NEIGHBOR, TIME WINDOW>
Temporal Pattern
– E.g.: markov chains, frequent item sets
Spatial Pattern: Composition/Closures of one or more topological relationships
– Communication networks: upstream, downstream, neighbor, tunnel
– Social networks: manages, friends, team members, IM buddies
– Biological network: catalyst, inhibitor, suppressor
5
e1 e2 e3t12 t23
t13
t11 t33t22
Temporal Pattern: Markov Chain
Temporal Pattern: Frequent Item Sets
Spatial Pattern: Downstream (transitive closure)
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
© 2008 IBM Corporation
Fault Diagnosis and Prediction in Communication Networks
Challenges: improve scalability & expressiveness of fault-diagnosis
–Limitation of current solutions: a complexity that grows as square of the network size
–Correlation rules are pair-wise: expensive to support complex fault diagnosis (e.g., predicting soft failures, router failure from VRF tunnel events, etc)
–Lacks predictive capability
Approach:
–Fault signatures encode temporal patterns: frequent item sets, Markov chains; and topological patterns (spans the network): upstream, downstream, neighbors, VPN tunnels, etc
–Topologically index streaming monitoring data to facilitate scalable single-pass event correlation and fault-diagnosis
–Results in linear complexity – increased scalability
Traditional RCA Engine vs. Proposed Approach
Correlation Engine (ITNM RCA)
Monitoring Data
(Omnibus)
Topology
Pair-wise correlation
rules
Fault Signatures (Network Patterns)
Topological
Index
Fault diagnosis
Complexity:Monitoring data x Monitoring data x Rules
Monitoring data x Network Diameter x Signatures
Monitoring data ~ linear in network sizeNetwork diameter ~ logarithmic in network
size for power-law networks
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
© 2008 IBM Corporation
7 04/21/23
Learn fault signatures from historical network event data– Fault Synopsis: Fault Type Network Pattern– Fault Signature: Network Pattern <Fault Type, Spatial Pattern to Localize Faulty Node>– Fault Diagnosis: <Spatial Pattern to Localize Faulty Node, Network Topology> Faulty
Node– Fault Prediction: Use incrementally matchable network patterns
Use indexable network patterns– Topological relationships are invertible: neighbor-1 = neighbor, downstream-1 = upstream
Step 1: Learning Network Faults
Fault Type up-stream down-stream neighbor …
f1 c1 c2 c3 …
f2 c2 c4 c1 …
Network Pattern up-stream down-stream Neighbor …
c1 - f1, p1 f2, p2 …
c2 f1, p1 f2, p2 - …
c3 - - f1, p1 …
c4 f2, p2 …
Fault Synopsis
Fault Signature
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
© 2008 IBM Corporation
Step 2: Online Matching Fault localization using topological indices and hierarchical evidence aggregation
– Topology indexing algorithms + space-time trade off in computing R(x) and R-
1(x)• R Є {upstream, downstream, neighbor, tunnel, …}
– Scalable hierarchical evidence aggregation for efficient fault diagnosisNetwork Pattern up-stream down-stream neighbor VPN Tunnel
c1 Device Down - f1 -
c2 - f2 - Device Down
c3 - - Device Down -
n1
n2
c1
c2c3
fnfn-1...f3f2f1
bf bf…...
bf bf…... bf bf…...
…… …
Evidence Aggregation Scalable Hierarchical Evidence Aggregation
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
© 2008 IBM Corporation
Details
9
Event Datasets
Preparation of training
data
Interval Filter: segment event dataset into event bursts
Support Filter: eliminate high frequency (regular n/w ops) and low frequency burst sets (noise)
Periodicity Filter: eliminate burst sets with high periodicity (maintenance ops)
Extract temporal patterns
Markov chains and maximum
likelihood estimation
Extract topological
patterns
Set of topological relationships: SE, NE, DS, US, TN
Principle of minimum
explanation
Fault Signatures
OFFLINE LEARNINGNetwork Topology
Match temporal patterns
Fault Signatures
Evidences: <f, v,
Rv>
Indexed network topology
Network Topology
Scalable Evidence
Aggregation
Fault Diagnosis
and Prediction
Min-Heap + incremental
pattern matching
Inverted Index for constant time lookup
Space-Time tradeoffs
BIRCH data structure (hierarchical aggregation)
Optimizations: filter-and-refine (Bloom filter) + slotted aggregation
(BIGTABLE)
Event Stream
ONLINE MATCHING
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
© 2008 IBM Corporation
Fault Diagnosis & Prediction: Scalability
10
Result Summary:
SNMP Trap messages from a large enterprise (7 ASes, 32 IGP networks, 871 subnets, 1,268 VPN tunnels, 2,068 main nodes, 18,747 interfaces and 192,000 entities) over 14 days in 2007
Topology dataset – European backbone network (2,383 main nodes, spans 7 countries, 11 ASes and over 100,000 entities)
Network fault simulator and monitoring data generation
Linear scalability; further optimizations: prune-and-search; slotted hierarchical aggregation
Ongoing activities
Integration with IBM Tivoli Network Management suite (ITNM) for live testing and fine-tuning
Network patterns for access control on information flows over : (i) ENRON email data & organization role topology; (ii) Smallblue data & social + information network topology
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
© 2008 IBM Corporation
Summary Network patterns encode spatial-temporal properties of various
networks
– Ability to scalably mine and match network patterns is key for understanding global network phenomena
Case study on fault diagnosis and prediction in communication networks
– Complexity of solution has to be linear in network size
– Topologically indexed databases was a key tool for addressing scalability
Explore more complex network patterns for information, social and biological networks which exhibit stronger coupling relationships
– A failed router does not cause its neighboring router to fail
– A corrupt information node can corrupt its neighbor (e.g., summary node)
– A diseased enzyme can catalyze/inhibit its neighbors11
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
© 2008 IBM Corporation
Questions?
Mudhakar Srivatsa
msrivats@us.ibm.com
12
top related