acm sigkdd conference on knowledge discovery and data mining (kdd), 2009 © 2008 ibm corporation...

Post on 20-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Learning, Indexing and Diagnosing Network Faults

Ting Wang†, Mudhakar Srivatsa‡,

Dakshi Agrawal‡ and Ling Liu†

Georgia Institute of Technology†

IBM T.J. Watson Research Center‡

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Complex Networks Network as a graph

– Vertices represent network entities

– Edges represent pair-wise (local) interactions between network entities

Even simple interactions give rise to complex global network phenomena

– Fault cascading in communication networks

– Information spread (e.g., via emails) in social networks

– Infection propagation in protein interaction networks

Key challenge is to detect and understand emerging global phenomena

2

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Network Monitoring Data

Networks generate massive monitoring data (aka events)

– Monitored data consists of local (in both space & time) observations on the network

– Monitored data is incomplete and sometimes even erroneous (e.g., imprecise, out-of-order wrt to both time and causality, etc)

3

Examples

– Ping failure, interface down, high CPU utilization, etc. in communication networks

– Email threads (time stamp, tokenized subject, MIME type, etc.) between members in a organizational hierarchy

– Pathological symptoms in biological networks – protein interaction networks (PINs)

Key observation: monitoring data gathered from network entities are correlated through the network topology

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Network Patterns Network patterns attempt to efficiently capture spatial (topological) and

temporal correlations in monitored data

Key challenges

– Understand the semantics of network patterns

– Identify domain-specific network patterns (e.g., fault diagnosis & prediction in IT systems, information spread and access control on social networks, disease propagation in protein networks, etc)

– How to learn and represent network patterns?

– How to scalably match network patterns against an online stream of network events?

4

e1

e2e3

e1 e2 e3

iBGP server

OSPF networks N1

and N2

Update configuration withdraw prefix

announcement

N1 says N2 is not reachable

N2 says N1 is not reachable

Director D

Employees N1 and N2

Meeting with D and N1

Email from N1 to N2 N2 updates project design document

Person P Friends N1 and N2

P updates a blog on her facebook page

N1 sends friend request to N2

N2 views P’s updates and accepts N1’s friend request

Simplified Examples

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Network Patterns Notation and Formalism

– Event data: <nodeId, type, timestamp, monitorId>

– Network Pattern: <event types, spatial pattern, temporal pattern>

– INTERFACE DOWN <LINK DOWN, NEIGHBOR, TIME WINDOW>

Temporal Pattern

– E.g.: markov chains, frequent item sets

Spatial Pattern: Composition/Closures of one or more topological relationships

– Communication networks: upstream, downstream, neighbor, tunnel

– Social networks: manages, friends, team members, IM buddies

– Biological network: catalyst, inhibitor, suppressor

5

e1 e2 e3t12 t23

t13

t11 t33t22

Temporal Pattern: Markov Chain

Temporal Pattern: Frequent Item Sets

Spatial Pattern: Downstream (transitive closure)

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Fault Diagnosis and Prediction in Communication Networks

Challenges: improve scalability & expressiveness of fault-diagnosis

–Limitation of current solutions: a complexity that grows as square of the network size

–Correlation rules are pair-wise: expensive to support complex fault diagnosis (e.g., predicting soft failures, router failure from VRF tunnel events, etc)

–Lacks predictive capability

Approach:

–Fault signatures encode temporal patterns: frequent item sets, Markov chains; and topological patterns (spans the network): upstream, downstream, neighbors, VPN tunnels, etc

–Topologically index streaming monitoring data to facilitate scalable single-pass event correlation and fault-diagnosis

–Results in linear complexity – increased scalability

Traditional RCA Engine vs. Proposed Approach

Correlation Engine (ITNM RCA)

Monitoring Data

(Omnibus)

Topology

Pair-wise correlation

rules

Fault Signatures (Network Patterns)

Topological

Index

Fault diagnosis

Complexity:Monitoring data x Monitoring data x Rules

Monitoring data x Network Diameter x Signatures

Monitoring data ~ linear in network sizeNetwork diameter ~ logarithmic in network

size for power-law networks

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

7 04/21/23

Learn fault signatures from historical network event data– Fault Synopsis: Fault Type Network Pattern– Fault Signature: Network Pattern <Fault Type, Spatial Pattern to Localize Faulty Node>– Fault Diagnosis: <Spatial Pattern to Localize Faulty Node, Network Topology> Faulty

Node– Fault Prediction: Use incrementally matchable network patterns

Use indexable network patterns– Topological relationships are invertible: neighbor-1 = neighbor, downstream-1 = upstream

Step 1: Learning Network Faults

Fault Type up-stream down-stream neighbor …

f1 c1 c2 c3 …

f2 c2 c4 c1 …

Network Pattern up-stream down-stream Neighbor …

c1 - f1, p1 f2, p2 …

c2 f1, p1 f2, p2 - …

c3 - - f1, p1 …

c4 f2, p2 …

Fault Synopsis

Fault Signature

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Step 2: Online Matching Fault localization using topological indices and hierarchical evidence aggregation

– Topology indexing algorithms + space-time trade off in computing R(x) and R-

1(x)• R Є {upstream, downstream, neighbor, tunnel, …}

– Scalable hierarchical evidence aggregation for efficient fault diagnosisNetwork Pattern up-stream down-stream neighbor VPN Tunnel

c1 Device Down - f1 -

c2 - f2 - Device Down

c3 - - Device Down -

n1

n2

c1

c2c3

fnfn-1...f3f2f1

bf bf…...

bf bf…... bf bf…...

…… …

Evidence Aggregation Scalable Hierarchical Evidence Aggregation

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Details

9

Event Datasets

Preparation of training

data

Interval Filter: segment event dataset into event bursts

Support Filter: eliminate high frequency (regular n/w ops) and low frequency burst sets (noise)

Periodicity Filter: eliminate burst sets with high periodicity (maintenance ops)

Extract temporal patterns

Markov chains and maximum

likelihood estimation

Extract topological

patterns

Set of topological relationships: SE, NE, DS, US, TN

Principle of minimum

explanation

Fault Signatures

OFFLINE LEARNINGNetwork Topology

Match temporal patterns

Fault Signatures

Evidences: <f, v,

Rv>

Indexed network topology

Network Topology

Scalable Evidence

Aggregation

Fault Diagnosis

and Prediction

Min-Heap + incremental

pattern matching

Inverted Index for constant time lookup

Space-Time tradeoffs

BIRCH data structure (hierarchical aggregation)

Optimizations: filter-and-refine (Bloom filter) + slotted aggregation

(BIGTABLE)

Event Stream

ONLINE MATCHING

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Fault Diagnosis & Prediction: Scalability

10

Result Summary:

SNMP Trap messages from a large enterprise (7 ASes, 32 IGP networks, 871 subnets, 1,268 VPN tunnels, 2,068 main nodes, 18,747 interfaces and 192,000 entities) over 14 days in 2007

Topology dataset – European backbone network (2,383 main nodes, spans 7 countries, 11 ASes and over 100,000 entities)

Network fault simulator and monitoring data generation

Linear scalability; further optimizations: prune-and-search; slotted hierarchical aggregation

Ongoing activities

Integration with IBM Tivoli Network Management suite (ITNM) for live testing and fine-tuning

Network patterns for access control on information flows over : (i) ENRON email data & organization role topology; (ii) Smallblue data & social + information network topology

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Summary Network patterns encode spatial-temporal properties of various

networks

– Ability to scalably mine and match network patterns is key for understanding global network phenomena

Case study on fault diagnosis and prediction in communication networks

– Complexity of solution has to be linear in network size

– Topologically indexed databases was a key tool for addressing scalability

Explore more complex network patterns for information, social and biological networks which exhibit stronger coupling relationships

– A failed router does not cause its neighboring router to fail

– A corrupt information node can corrupt its neighbor (e.g., summary node)

– A diseased enzyme can catalyze/inhibit its neighbors11

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Questions?

Mudhakar Srivatsa

msrivats@us.ibm.com

12

top related