author: rodrigo fonseca, george porter, randy h. katz, scott shenker, ion stoica presenter :yinzhi...

Author: Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, Ion Stoica

Presenter :Yinzhi Cao

Outline Background Origin X-Trace

VectorFlowing VectorGodOverHead

Usage Scenarios Potential Problems

Background(1)

Network Diagnosis Scenarios One (Accessing Website)

Background(2)

Scenario Two (Distributed File System)

Background(3) Existing Method

White Box

X-Trace

Black Box

Sherlock Comparison of

White Box and Black Box

WhiteBox

BlackBox

Overhead Large Small

Modification to Program Yes No

Notification of Program No Yes

Accuracy High Low

Origin of X-Trace How to Diagnosis a

Person?1. Radioactive MaterialImplies: We need a vector

flowing in our body.2. X-Ray DetectorImplies: We need a collector

to monitor activities. 3. OverheadImplies: There is no free

lunch.

X-Trace(Vector)

Vector: X-Trace Metadata

X-Trace(Flowing Vector)

Flowing Vector

Only Vectors are of no use. We make it flow and we get the info. The following is an entity we want to diagnosis.

X-Trace(Flowing Vector) Continued Let Vectors Flow.

Two Ways: pushNext() and pushDown()

X-Trace(Collector)

Like diagnosing a person, we need a god to collect all the data and reconstruct offline trees.

The question is how to?

X-Trace(Overhead)

Modification of Existing Program

X-Trace(Overhead) Continued Influence on Current Network Flow

1. Metadata is very small which brings little additional flow to the network.

2. Reports are sent in different channels which doesn’t occupy current network flow

Usage Scenarios of X-Trace(1) Web Request and Recursive DNS

queries

Usage Scenarios of X-Trace(2) A Web Hosting Site

Usage Scenarios of X-Trace(3) An Overlay Network

Potential Problems Mentioned by Author Report Loss Managing Report Traffic Non-Tree Request Structures Partial Deployment Security Consideration

We have examined White Box. So let’s come to some other approach, which may not be that accurate but may cost less overhead.First, we need some models.

Author: Victor Bahl, Ranveer Chandra, Albert Greenberg, Srikanth Kandula, David A. Maltz, Ming

Presenter: Yinzhi Cao

Outline

ModelsNode ModelNetwork ModelRelationship Model

How to use Our Model Algorithm Efficiency Evaluation

Models

The main idea of this paper is to establish a model of network and use this model to diagnose.

We have three levels of Model: Node, Network and Relationship.

Node Model

Node has three status: down, up and troubled.

Network Model

Graph What’s

more? Inference Graph.

Relationship Model(1)

Noisy-Max

Backup Slides 1 First, we use the model below. The

circle means with x probability the output is the input, and with 1-x probability the output is up.

Let’s use unordered pair {x,y} to represent node status.{1,1} = {1} up{0,1} troubled{0,0} = {0} down

Backup Slides 2

So the status of Child can be represented as follows.

Status(Child) = |Status(Parent)•Status(Parent)|

• means outer product.

And we define |(x,y)| = <x,y> = xy.

Selector

Failover

Backup Slides 3

We use definition before. Status(Parent1)={x1,x2},

Status(Parent2)={y1,y2}. Status(Child)={(x1+x2)x1+not(x1+x2)y1,

(x1+x2)x2+not(x1+x2)y2}

+ means and, * means or which is skipped.

How to Use Model?

Fault Localization on the Inference Graph

Algorithm Efficiency(1)

Calculations inside Inference Graph ( noisy max relationship )

Reduce time complexity from O(3n) to O(n)

Algorithm Efficiency(2)

Comparison of Multiple Input and Observation

Two Methods to Use

1. Examine Data Sets with High Probability and Ignore Small Ones

2. Dynamic Programming (Reduce Redundancy)

Algorithm Efficiency(3) Author conclude two observations using

these two methods.1. It is very likely that at any point in time only

a few root-cause nodes are troubled or down.

2. Since a root-cause is assigned to be up in most assignment vectors, the evaluation of an assignment vector only requires re-evaluation of states at the descendants of rootcause nodes that are not up.

Evaluation

Inference Graph Established

Accuracy Compared with others

Time to Localize Faults

Impact of Errors in Inference Graph

Open Issues

The Node Model is very simple, which only has three status. Can we have a continuous model of it?

Can we take some stochastic process concept like Markov-Chain into this model?

author: rodrigo fonseca, george porter, randy h. katz, scott shenker, ion stoica presenter :yinzhi...

pushdown slide

xtrace metadata slide

overlay network slide

current network flow

yinzhi cao slide

origin of x

x probability

accessing website slide

Documents

1 advanced topics in routing ee122 fall 2012 scott shenker...

1 routing as a service karthik lakshminarayanan (with ion...

internet indirection infrastructure (i3 ) ion stoica, daniel...

1 congestion control ee122 fall 2011 scott shenker ee122/...

tcp: congestion control (part ii) cs 168, fall 2014 sylvia...

1 transport and tcp ee122 fall 2011 scott shenker ee122/...

matei zaharia, dhruba borthakur , joydeep sen sarma ,...

1 recent developments in routing ee122 fall 2011 scott...

1 miscellaneous topics ee122 fall 2012 scott shenker ...

replay debugging for distributed systems dennis geels,...

towards a new naming architectures ion stoica, scott...

tcp ee 122, fall 2013 sylvia ratnasamy ee122/ material...

cs 268: project suggestions scott shenker and ion stoica...

cs 268: internet architecture & e2e arguments scott shenker...

1 midterm review ee122 fall 2011 scott shenker ee122/...

geographic routing without location information ananth rao,...

dns and the web ee 122, fall 2013 sylvia ratnasamy ee122/...

effective straggler mitigation: attack of the clones ganesh...

internet indirection infrastructure (i3) ion stoica daniel...

the transport layer cs168, fall 2014 scott shenker...