g-rca: a generic root cause analysis platform for service quality management in large ip networks

Post on 31-Dec-2015

18 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks. He Yan, Lee Breslau, Zihui Ge, Dan Massey, Dan Pei, Jennifer Yates. Abstract. Best effort networks --> QoS Manage end-to-end service quality as a whole Generic Root Cause Analysis (G-RCA) - PowerPoint PPT Presentation

TRANSCRIPT

G-RCA: A Generic Root Cause Analysis Platform for Service

Quality Management in Large IP Networks

He Yan, Lee Breslau, Zihui Ge, Dan Massey, Dan Pei, Jennifer Yates

Abstract

● Best effort networks --> QoS● Manage end-to-end service quality as a whole

● Generic Root Cause Analysis (G-RCA)o Service Quality Management (SQM)

● FCAPS

Introduction

• Finding root to errors– transient errors

• Gather information for network operators

• Helps Service Quality Management (SQM) for ISPs.

G-RCA Architecture

• Consists of five main components.

• G-RCA determines where and when to look for diagnostic events.

• Used for:– Troubleshoot

ongoing networks

– Investigate past behavior.

Data Collection and Management

• Proactively collects data from network, such as alarms, logs and performance measurements.

• Uses a data collector and database to store data

• “Events”

– event-name, location type, retrieval process and information

Service Dependency Model

● Figure 2 used to include network elements associated with a problem

● Hard to realize theory

o Traffic sampling data

o Snapshots of router configs

Spatial-Temporal Correlation (1)● How to relate what has happened to service problem?● G-RCA defines a temporal and spatial joining rule● Temporal Joining Rule

○ Defines a time window to allow symptom and diagnostic event to be joined.

○ 6 parameters for symptom & diagnostic event

■ Left expansion margin

■ Right expansion margin

■ Expanding option (Start/End, Start/Start or End/End)

Spatial-Temporal Correlation (2)

○ Symptom and diagnostic event are joint when the windows overlap.

Spatial-Temporal Correlation (3)● Spatial Joining Rule

○ Symptom event location type○ Diagnostic event location type○ Joining level

● Joining level○ Link symptom locations and diagnostic event locations together

● Model diagnostic signatures using diagnosis graph● A symptom and diagnostic event pair is called diagnosis rule● G-RCA evaluates the time and location conditions and

collected data● Determine whether diagnostic signature is present

Reasoning LogicRule-Based Reasoning Module

• Priority value in the diagnosis graph – Assigned by operator– Higher value means more confidence on the diagnostic

event to be the real root cause– Can be examined by G-RCA’s Result Browser

• How does rule-based reasoning work?

Diagnosis graph for BGP flaps root cause analysis

Bayesian Inference• Determining the root cause is to identify the one producing the

following maximum likelihood ratio:

• When the features are conditionally independent– The second term can be decoupled to

• Parameters configuration (ratios of: and ) – bootstrap using the rule-based reasoning – define a fuzzy type of discrete values

• Low, Medium, and High, which corresponds to values 2, 100, and 20 000.

Potential root causes:

classes

A set of rpresence or absence of the diagnostic evidence and symptom events themselves : features

First term Second term

Comparison• In the operational practice,rule-based reasoning logic is often

preferred over Bayesian inference– Easier to configure– Gives simple and direct association between the diagnosed

root cause and the evidence – Effective in most applications

• However, there are a few cases where Bayesian inference is preferred – Root cause condition is unobservable

Domain Knowledge Building

● Issue: The specification of a diagnosis graph for a SQM application offered by an operator, especially the initial version, can be inaccurate and incomplete.

● G-RCA addresses this concern regarding incomplete diagnosis graph through iteratively using the Correlation Tester and Result Browser.○ Firstly, operator filters out the symptom events with known

root causes with the root cause classification capability provided in the result browser.

○ Secondly, operator could focus on the rest of symptom events by comparing with other suspected diagnostic events that occur at the same time and that are spatially related to the service problem.

Domain Knowledge Building

● On one hand, the second step can be done via manual drill-down and data exploration capability in the result browser;

● On the other hand, operators can also to run the correlation tester blindly between the symptom events without known root causes and each type of suspected diagnostic graph.

● As G-RCA emphasizes usability, the newly uncovered diagnosis rules need to be verified by operators before incorporating into the diagnosis graph.

Introduction of G-RCA Applications• The key advantage of G-RCA in SQM is its capability to be rapidly

customized into different RCA applications in the ISP’s network.

• In this section, the following three case studies are included in order to demonstrate effectiveness of G-RCA

– 1) customer BGP flaps

– 2) end-to-end throughput management in a CDN service

– 3) network PIM flaps in multicast VPN

BGP Flaps Root Cause Analysis

Purpose: Understanding the root cause of flaps.● Achieving this using G-RCA by constructing application specific

events and rules.○ Starting by constructing our BGP flap-specific events. ○ Adding a few application-specific diagnosis rules. ○ Specifying priorities for different diagnosis rules for BGP flaps

RCA. (Please refer to the figure of “Diagnosis graph for BGP flaps root cause analysis” shown in the previous slides)

Application-specific events for BGP flaps root cause analysis

Conclusion1. It captures the layered network model in its knowledge library, by implementing

- temporal/spatial correlation, - rule-based reasoning, and - Bayesian inference.

2. Domain knowledge in existing RCA application can be refined by the interaction between the RCA engine and the Correlation Tester.

3. In order to analyse a large number of service quality issues and classify trend their root causes, it proactively collects all types of data from different sources and normalize them in real time.

top related