1. cluster the data. 2. for the data of a cluster, set up the network. 3. begin at a random vertex...

1
1. cluster the data. 2. for the data of a cluster, set up the network. 3. begin at a random vertex as source/sink s, choose its farthest vertex as the sink/source t. 4. use the Maximum-Flow/Minimum-Cut algorithm to find the flow from source to sink, get the cut separating s and t, and use the smaller side as the candidate outlier or outlier group. 5. remove the candidate outlier or outlier group from the graph. 6. select the next source, go back to 3 until the stop criterion. 7. coarsen the graph and run the algorithm again; select outliers from candidate outliers. Outlier Detection and Evaluation by Network Flow Ying Liu Advisor: Alan P. Sprague {liuyi, sprague}@cis.uab.edu Department of Computer and Information Sciences University of Alabama at Birmingham http://www.cis.uab.edu/kddm http://www.cis.uab.edu/kddm Fraud detection for credit cards Intrusion detection in computer network Detection novelties in images Detect network bottlenecks Experiments and Implementation Details To repair the poor quality clusters generated by a cluster algorithm, we repair a cluster by removing outliers which do not belong in this cluster. Theory foundation: Maximum Flow/Minimum Cut The maximum flow problem is to find a f for which the total flow is maximum. The total flow can be measured at the sink, or it can be measured at any cut separating the source from the sink. (Ford-Fulkerson algorithm, 1962.) s->a->b->t: 12 s->a->c->d->b->t: 7 s->c->b->t: 9 s->c->d->t: 3 maximum-flow = minimum-cut = 12+3+9+7=31 s t a b c d 19/19 12/13 7/10 9/9 7/7 12/12 28/30 3/3 10/11 Intuition: Suppose t is an outlier, s is the farthest vertex from t. Suppose further that t is far from all other points in the data set. Then each edge between t and other vertices has small capacity, so ({t}, C-{t}) is a cut of small capacity. 7 nearest neighbors 591 points, 5028 edges. Edges are in two directions. The No. 20 cluster 591 points 20 •Compute k nearest neighbors, make sure all vertices are connected. •Compute the capacity between two vertices by the distance. Distanc 1 1 Capacity Red vertices are the source side. sourc e Blue vertices are the sink side Randomly select a vertex to start, find its farthest vertex by single source shortest path. Use this source and sink to run the Maximum_flow/Minimum_cut algorithm. The flow from source to sink is 1269. We use the maximum flow as the measure of outlier degree. For an outlier or outlier group, low maximum flow means strong outliers, and high maximum flow means weak Vertex with smallest (last_wavemin+flow_passed) s t 3/3 3/3 3/10 3/12 3/12 3/10 0/8 0/8 0/ 4 0/4 Minimum-cut a b c d e f 0/1 5 Minimum-cut After saturating all the edges from source to sink, source will continue find a augmenting path, i.e., the last wave. We use the vertex with smallest (last_wave + flow_passed) as the next source. Vertex with maximum average distance Get each vertex’s average distance to other vertices, then order these distances from maximum to minimum. Begin this process from the vertex with maximum average distance. After the minimum cut, some vertices are cut. The next source is chosen from among vertices not cut, as the one with maximum average distance. 1 2 3 Loop Max Flow No. 4 1267 No. 1 1269 No. 3 3256 No. 5 3937 No. 8 5939 No. 7 7717 No. 14 8962 No. 9 10148 No. 10 16194 No. 2 16533 No. 13 17793 No. 6 25378 No. 11 63797 No. 12 160515 No. 15 359560 No. 17 427908 No. 16 1307310 •Users input the number of outlier or outlier group they want. •Use the maximum flow as the stop condition. If D flow < D avg Then Stop D flow = 1/n th root of the max_flow D avg = average distance the remaining data Loop Cut Max Flow No. 1 vertex 4 1267 No. 2 vertex 1 1269 No. 3 vertex 3 3256 No. 4 Vertex 5 3937 No. 5 vertex 8 5939 No. 6 vertex 7,9,10 16531 No. 7 vertex 2 16533 No. 8 Vertex 6 25378 No. Vertex 52498 1. Density-based algorithm measure the difference in density between an object and its neighboring objects 2. Distribution-based algorithm An object O in a dataset T is a UO (p, D)-outlier if at least fraction p of the objects in T are distance D from O. 3. Distance-based algorithm The problem of finding all DB (p, D)-outliers can be solved by answering a nearest neighbor or range query centered at each object O. 4. Depth-based algorithm Depth based algorithms find the outliers by peeling off the outer layers of convex hulls. 5. Clustering-based algorithm Outliers are byproduct of the clustering process and those outliers will not be in any clusters. Use maximum average distance to select next source Outlier Detection and Application Previous work Repair poor quality of a cluster Poor quality clusters Theory foundation Outlier Detection by Network Flow Outlier Detection and Application Find an outlier/outlier group •Scale up the capacity by n th power of the original capacity. si nk Choose next source Outliers and maximum flow results Different parameters Stop criteria K = 10 K = 15 Different k nearest neighbors K = 7 Increase the number of k, network has more edges. Outliers are split into more pieces. Use maximum average distance to select next source, outliers are split in to more pieces. Algorithm process Final results after running the algorithm again on the coarse network. Because of the order of removal, outliers 13 and 14 have quite different maximum flow. We coarsen the graph and use each cut as a vertex and merge edges. Outliers Noisy data Novel information Anomaly Deviation Set up the Network original n new Capacity Capacity

Upload: preston-woods

Post on 26-Dec-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1. cluster the data. 2. for the data of a cluster, set up the network. 3. begin at a random vertex as source/sink s, choose its farthest vertex as the

1. cluster the data.

2. for the data of a cluster, set up the network.

3. begin at a random vertex as source/sink s, choose its farthest vertex as the sink/source t.

4. use the Maximum-Flow/Minimum-Cut algorithm to find the flow from source to sink, get the cut separating s and t, and use the smaller side as the candidate outlier or outlier group.

5. remove the candidate outlier or outlier group from the graph.

6. select the next source, go back to 3 until the stop criterion.

7. coarsen the graph and run the algorithm again; select outliers from candidate outliers.

Outlier Detection and Evaluation by Network FlowOutlier Detection and Evaluation by Network Flow

Ying Liu Advisor: Alan P. Sprague {liuyi, sprague}@cis.uab.edu Department of Computer and Information Sciences University of Alabama at Birmingham Ying Liu Advisor: Alan P. Sprague {liuyi, sprague}@cis.uab.edu Department of Computer and Information Sciences University of Alabama at Birminghamhttp://www.cis.uab.edu/kddmhttp://www.cis.uab.edu/kddm

Fraud detection for credit cards

Intrusion detection in computer network

Detection novelties in images

Detect network bottlenecks

Experiments and Implementation Details

To repair the poor quality clusters generated by a cluster algorithm, we repair a cluster by removing outliers which do not belong in this cluster.

Theory foundation: Maximum Flow/Minimum Cut

The maximum flow problem is to find a f for which the total flow is maximum. The total flow can be measured at the sink, or it can be measured at any cut separating the source from the sink. (Ford-Fulkerson algorithm, 1962.)

s->a->b->t: 12

s->a->c->d->b->t: 7s->c->b->t: 9s->c->d->t: 3

maximum-flow = minimum-cut = 12+3+9+7=31

s t

a b

c d

19/19

12/13

7/10 9/9 7/7

12/12

28/30

3/3

10/11

Intuition: Suppose t is an outlier, s is the farthest vertex from t. Suppose further that t is far from all other points in the data set. Then each edge between t and other vertices has small capacity, so ({t}, C-{t}) is a cut of small capacity.

7 nearest neighbors591 points, 5028 edges. Edges are in two directions.

The No. 20 cluster , 591 points

20

•Compute k nearest neighbors, make sure all vertices are connected.

•Compute the capacity between two vertices by the distance.

Distance1

1

Capacity

Red vertices are the source side.

source

Blue vertices are the sink side

Randomly select a vertex to start, find its farthest vertex by single source shortest path. Use this source and sink to run the Maximum_flow/Minimum_cut algorithm. The flow from source to sink is 1269. We use the maximum flow as the measure of outlier degree. For an outlier or outlier group, low maximum flow means strong outliers, and high maximum flow means weak outliers.

Vertex with smallest (last_wavemin+flow_passed)Vertex with smallest (last_wavemin+flow_passed)

st

3/3

3/3

3/103/12

3/123/10

0/80/8

0/40/4

Minimum-cut a

b c

d

e f0/15

Minimum-cut

After saturating all the edges from source to sink, source will continue find a augmenting path, i.e., the last wave. We use the vertex with smallest (last_wave + flow_passed) as the next source.

Vertex with maximum average distanceVertex with maximum average distance

Get each vertex’s average distance to other vertices, then order these distances from maximum to minimum. Begin this process from the vertex with maximum average distance. After the minimum cut, some vertices are cut. The next source is chosen from among vertices not cut, as the one with maximum average distance.

1

23

Loop Max Flow

No. 4 1267

No. 1 1269

No. 3 3256

No. 5 3937

No. 8 5939

No. 7 7717

No. 14 8962

No. 9 10148

No. 10 16194

No. 2 16533

No. 13 17793

No. 6 25378

No. 11 63797

No. 12 160515

No. 15 359560

No. 17 427908

No. 16 1307310

•Users input the number of outlier or outlier group they want.

•Use the maximum flow as the stop condition.If Dflow < Davg Then StopDflow = 1/nth root of the max_flowDavg = average distance of the remaining data

Loop Cut Max Flow

No. 1 vertex 4 1267

No. 2 vertex 1 1269

No. 3 vertex 3 3256

No. 4 Vertex 5 3937

No. 5 vertex 8 5939

No. 6 vertex 7,9,10

16531

No. 7 vertex 2 16533

No. 8 Vertex 6 25378

No. 9 Vertex 11 52498

1. Density-based algorithm

• measure the difference in density between an object and its neighboring objects

2. Distribution-based algorithm

• An object O in a dataset T is a UO (p, D)-outlier if at least fraction p of the objects in T are distance D from O.

3. Distance-based algorithm

• The problem of finding all DB (p, D)-outliers can be solved by answering a nearest neighbor or range query centered at each object O.

4. Depth-based algorithm

• Depth based algorithms find the outliers by peeling off the outer layers of convex hulls.

5. Clustering-based algorithm

• Outliers are byproduct of the clustering process and those outliers will not be in any clusters.

Use maximum average distance to select next source

Outlier Detection and ApplicationOutlier Detection and Application Previous workPrevious work Repair poor quality of a clusterRepair poor quality of a cluster

Poor quality clusters

Theory foundationTheory foundation Outlier Detection by Network FlowOutlier Detection by Network Flow

Outlier Detection and ApplicationOutlier Detection and Application Find an outlier/outlier groupFind an outlier/outlier group

•Scale up the capacity by nth power of the original capacity.

sink

Choose next sourceChoose next source Outliers and maximum flow resultsOutliers and maximum flow results Different parametersDifferent parameters

Stop criteriaStop criteria

K = 10 K = 15

Different k nearest neighbors

K = 7

Increase the number of k, network has more edges. Outliers are split into more pieces.

Use maximum average distance to select next source, outliers are split in to more pieces.

Algorithm process

Final results after running the algorithm again on the coarse network.

Because of the order of removal, outliers 13 and 14 have quite different maximum flow. We coarsen the graph and use each cut as a vertex and merge edges.

Outliers

Noisy data

Novel information

Anomaly

Deviation

Set up the Network

originaln

new CapacityCapacity