1. cluster the data. 2. for the data of a cluster, set up the network. 3. begin at a random vertex...
TRANSCRIPT
1. cluster the data.
2. for the data of a cluster, set up the network.
3. begin at a random vertex as source/sink s, choose its farthest vertex as the sink/source t.
4. use the Maximum-Flow/Minimum-Cut algorithm to find the flow from source to sink, get the cut separating s and t, and use the smaller side as the candidate outlier or outlier group.
5. remove the candidate outlier or outlier group from the graph.
6. select the next source, go back to 3 until the stop criterion.
7. coarsen the graph and run the algorithm again; select outliers from candidate outliers.
Outlier Detection and Evaluation by Network FlowOutlier Detection and Evaluation by Network Flow
Ying Liu Advisor: Alan P. Sprague {liuyi, sprague}@cis.uab.edu Department of Computer and Information Sciences University of Alabama at Birmingham Ying Liu Advisor: Alan P. Sprague {liuyi, sprague}@cis.uab.edu Department of Computer and Information Sciences University of Alabama at Birminghamhttp://www.cis.uab.edu/kddmhttp://www.cis.uab.edu/kddm
Fraud detection for credit cards
Intrusion detection in computer network
Detection novelties in images
Detect network bottlenecks
Experiments and Implementation Details
To repair the poor quality clusters generated by a cluster algorithm, we repair a cluster by removing outliers which do not belong in this cluster.
Theory foundation: Maximum Flow/Minimum Cut
The maximum flow problem is to find a f for which the total flow is maximum. The total flow can be measured at the sink, or it can be measured at any cut separating the source from the sink. (Ford-Fulkerson algorithm, 1962.)
s->a->b->t: 12
s->a->c->d->b->t: 7s->c->b->t: 9s->c->d->t: 3
maximum-flow = minimum-cut = 12+3+9+7=31
s t
a b
c d
19/19
12/13
7/10 9/9 7/7
12/12
28/30
3/3
10/11
Intuition: Suppose t is an outlier, s is the farthest vertex from t. Suppose further that t is far from all other points in the data set. Then each edge between t and other vertices has small capacity, so ({t}, C-{t}) is a cut of small capacity.
7 nearest neighbors591 points, 5028 edges. Edges are in two directions.
The No. 20 cluster , 591 points
20
•Compute k nearest neighbors, make sure all vertices are connected.
•Compute the capacity between two vertices by the distance.
Distance1
1
Capacity
Red vertices are the source side.
source
Blue vertices are the sink side
Randomly select a vertex to start, find its farthest vertex by single source shortest path. Use this source and sink to run the Maximum_flow/Minimum_cut algorithm. The flow from source to sink is 1269. We use the maximum flow as the measure of outlier degree. For an outlier or outlier group, low maximum flow means strong outliers, and high maximum flow means weak outliers.
Vertex with smallest (last_wavemin+flow_passed)Vertex with smallest (last_wavemin+flow_passed)
st
3/3
3/3
3/103/12
3/123/10
0/80/8
0/40/4
Minimum-cut a
b c
d
e f0/15
Minimum-cut
After saturating all the edges from source to sink, source will continue find a augmenting path, i.e., the last wave. We use the vertex with smallest (last_wave + flow_passed) as the next source.
Vertex with maximum average distanceVertex with maximum average distance
Get each vertex’s average distance to other vertices, then order these distances from maximum to minimum. Begin this process from the vertex with maximum average distance. After the minimum cut, some vertices are cut. The next source is chosen from among vertices not cut, as the one with maximum average distance.
1
23
Loop Max Flow
No. 4 1267
No. 1 1269
No. 3 3256
No. 5 3937
No. 8 5939
No. 7 7717
No. 14 8962
No. 9 10148
No. 10 16194
No. 2 16533
No. 13 17793
No. 6 25378
No. 11 63797
No. 12 160515
No. 15 359560
No. 17 427908
No. 16 1307310
•Users input the number of outlier or outlier group they want.
•Use the maximum flow as the stop condition.If Dflow < Davg Then StopDflow = 1/nth root of the max_flowDavg = average distance of the remaining data
Loop Cut Max Flow
No. 1 vertex 4 1267
No. 2 vertex 1 1269
No. 3 vertex 3 3256
No. 4 Vertex 5 3937
No. 5 vertex 8 5939
No. 6 vertex 7,9,10
16531
No. 7 vertex 2 16533
No. 8 Vertex 6 25378
No. 9 Vertex 11 52498
1. Density-based algorithm
• measure the difference in density between an object and its neighboring objects
2. Distribution-based algorithm
• An object O in a dataset T is a UO (p, D)-outlier if at least fraction p of the objects in T are distance D from O.
3. Distance-based algorithm
• The problem of finding all DB (p, D)-outliers can be solved by answering a nearest neighbor or range query centered at each object O.
4. Depth-based algorithm
• Depth based algorithms find the outliers by peeling off the outer layers of convex hulls.
5. Clustering-based algorithm
• Outliers are byproduct of the clustering process and those outliers will not be in any clusters.
Use maximum average distance to select next source
Outlier Detection and ApplicationOutlier Detection and Application Previous workPrevious work Repair poor quality of a clusterRepair poor quality of a cluster
Poor quality clusters
Theory foundationTheory foundation Outlier Detection by Network FlowOutlier Detection by Network Flow
Outlier Detection and ApplicationOutlier Detection and Application Find an outlier/outlier groupFind an outlier/outlier group
•Scale up the capacity by nth power of the original capacity.
sink
Choose next sourceChoose next source Outliers and maximum flow resultsOutliers and maximum flow results Different parametersDifferent parameters
Stop criteriaStop criteria
K = 10 K = 15
Different k nearest neighbors
K = 7
Increase the number of k, network has more edges. Outliers are split into more pieces.
Use maximum average distance to select next source, outliers are split in to more pieces.
Algorithm process
Final results after running the algorithm again on the coarse network.
Because of the order of removal, outliers 13 and 14 have quite different maximum flow. We coarsen the graph and use each cut as a vertex and merge edges.
Outliers
Noisy data
Novel information
Anomaly
Deviation
Set up the Network
originaln
new CapacityCapacity