outlier detection lian duan management sciences, uiowa
TRANSCRIPT
Outlier DetectionOutlier Detection
Lian DuanLian DuanManagement Sciences, UIOWAManagement Sciences, UIOWA
What are outliers?What are outliers? Hawkins-Outlier: An outlier is an Hawkins-Outlier: An outlier is an
observation that deviates so much from observation that deviates so much from other observations as to arouse suspicion other observations as to arouse suspicion that it is generated by a different that it is generated by a different mechanism.mechanism.
A relative concept:A relative concept: SituationSituation Your angleYour angle A example: Suppose you are the US president.A example: Suppose you are the US president. Common Thing: Compare to History and Common Thing: Compare to History and
MajorityMajority
Outlier Detection and Outlier Detection and ClusteringClustering Interwoven with each other.Interwoven with each other.
Not all objects should belong to a certain cluster.Not all objects should belong to a certain cluster. Abnormal events might have temporal or spatial loAbnormal events might have temporal or spatial lo
cality. (Body Temperature)cality. (Body Temperature)
Single Point Outliers Cluster-based OutleirSingle Point Outliers Cluster-based Outleirss
Previous WorkPrevious Work
DB(pct,dmin)-Outlier [Binary]: Given aDB(pct,dmin)-Outlier [Binary]: Given an object p, at least percentage pct of thn object p, at least percentage pct of the objects in D lies greater than distance objects in D lies greater than distance dmin from p. e dmin from p.
Density-based local outlier [Degree]: GDensity-based local outlier [Degree]: Given the lowest acceptable bound of Liven the lowest acceptable bound of LOF, an object p in a dataset D is a densiOF, an object p in a dataset D is a density-based local outlier if LOF(p)>LOFLB.ty-based local outlier if LOF(p)>LOFLB.
Other statistical methods.Other statistical methods.
Local Outlier FactorLocal Outlier Factor
Local Density: the inverse of the averaLocal Density: the inverse of the average distance to its k-nearest neighbors.ge distance to its k-nearest neighbors.
Local Outlier Factor: the ratio of the loLocal Outlier Factor: the ratio of the local density of p and those of p’s k-neacal density of p and those of p’s k-nearest neighbors.rest neighbors. The LOF of each object depends on the deThe LOF of each object depends on the de
nsity of the cluster relative to it and the dinsity of the cluster relative to it and the distance between it and the cluster.stance between it and the cluster.
Illustration Of LOFIllustration Of LOF
A example:A example:
LOF-Outlier vs. DLOF-Outlier vs. DB(pct,dmin)-OutliB(pct,dmin)-Outlierer
LDBSCAN=DBSCAN+LOLDBSCAN=DBSCAN+LOFF DBSCAN: RetrievDBSCAN: Retriev
e all points which e all points which is density-reachais density-reachable from the giveble from the given n Core-Point(MinCore-Point(MinPts, Pts, εε ).).
Problem: How maProblem: How many are many?ny are many?
LDBSCAN (continued)LDBSCAN (continued)
A relative concept of core points and siA relative concept of core points and similarity.milarity. Core Points: LOF<LOFUBCore Points: LOF<LOFUB Similarity: p∈NSimilarity: p∈NMinPts(q)MinPts(q) and LRD(q)/(1+pct)< and LRD(q)/(1+pct)<
LRD(p)<LRD(q)*(1+pct)LRD(p)<LRD(q)*(1+pct)
LDBSCAN (continued)LDBSCAN (continued)
The same The same clustering clustering idea with idea with DBSCANDBSCAN
Parameter:Parameter: LOFUBLOFUB pctpct
LDBSCAN (continued)LDBSCAN (continued)
AdvantageAdvantage
Density-based vs Partitioning ClusterinDensity-based vs Partitioning Clustering:g: Small clusters, arbitrary shape, and noise.Small clusters, arbitrary shape, and noise.
Advantage (continued)Advantage (continued)
LDBSCAN vs DBSCANLDBSCAN vs DBSCAN Easier to select proper parameters.Easier to select proper parameters. Handle local density problems.Handle local density problems.
Advantage (continued)Advantage (continued)
LDBSCAN vs OPTICSLDBSCAN vs OPTICS Comet-like clustersComet-like clusters Hierarchical structureHierarchical structure
PerformancePerformance
EExperiment facilityxperiment facility: PⅣ 2.4G, 512M m: PⅣ 2.4G, 512M memory, redhat 9.0, jdk1.4.2emory, redhat 9.0, jdk1.4.2
Algorithm steps:Algorithm steps: Search k-nearest neighbors: O(nSearch k-nearest neighbors: O(n22) or O(nlog) or O(nlog
n)n) Calculate LRDs and LOFs: O(n)Calculate LRDs and LOFs: O(n) Clustering: O(n)Clustering: O(n)
Its compute complexity isIts compute complexity isequal to that of LOF.equal to that of LOF.
ExperimentExperiment
Wisconsin Breast Cancer DataWisconsin Breast Cancer Data After data preprocessing, the resultant dataseAfter data preprocessing, the resultant datase
t has 327 (57.8%) benign records and 239 (42.2t has 327 (57.8%) benign records and 239 (42.2%) malignant records with nine attributes.%) malignant records with nine attributes.
Discover two clusters and five single point ouDiscover two clusters and five single point outliers.tliers. Cluster A contains 296 benign records and 6 maligCluster A contains 296 benign records and 6 malig
nant records. Its average local density is 0.743.nant records. Its average local density is 0.743. Cluster B contains 26 benign records and 233 maliCluster B contains 26 benign records and 233 mali
gnant records. Its average local density is 0.167.gnant records. Its average local density is 0.167. Five single point outlier whose LOFs fall into the rFive single point outlier whose LOFs fall into the r
ange from 3 to 5.ange from 3 to 5.
Experiment (continued)Experiment (continued) Boston Housing DataBoston Housing Data
After data preprocessing, the resultant dataset has 5After data preprocessing, the resultant dataset has 506 records with 14 attributes.06 records with 14 attributes.
Cluster: (1, 82, 0.556); (2, 345, 0.528); (3, 26, 0.477); (4,Cluster: (1, 82, 0.556); (2, 345, 0.528); (3, 26, 0.477); (4, 34, 0.266); (5, 9, 0.228); (6, 6, 0.127). 34, 0.266); (5, 9, 0.228); (6, 6, 0.127).
4 single point outliers.4 single point outliers. Cluster 5 vs Cluster 6 (from cluster 1)Cluster 5 vs Cluster 6 (from cluster 1)
24.514 (bigger per capita cirme rate) vs 20.005; 24.514 (bigger per capita cirme rate) vs 20.005; 284284thth record (from cluster 4): LRD=0.155, LOF=1.468. record (from cluster 4): LRD=0.155, LOF=1.468.
22ndnd attribute: higher proportion of residential land zoned f attribute: higher proportion of residential land zoned for lots.or lots.
33rdrd attribute: lower proportion of non-retail bussiness acre attribute: lower proportion of non-retail bussiness acres per town.s per town.
Appendix: Cluster-based Appendix: Cluster-based OutliersOutliers Definition 1 (Upper Bound of the Cluster-Based OutlDefinition 1 (Upper Bound of the Cluster-Based Outl
ier):ier): Let Let C1C1, ..., , ..., CkCk be the clusters of the database D dis be the clusters of the database D discovered by LDBSCAN in the sequence that |C1|≥|C2|≥covered by LDBSCAN in the sequence that |C1|≥|C2|≥…≥|Ck|. Given parameters α, the number of the object…≥|Ck|. Given parameters α, the number of the objects in the cluster Ci is the UBCBO if (|C1|+|C2|+…+|Ci-1|)≥s in the cluster Ci is the UBCBO if (|C1|+|C2|+…+|Ci-1|)≥|D|*α and (|C1|+|C2|+…+|Ci-2|)|D|*α and (|C1|+|C2|+…+|Ci-2|)<< |D|*α.|D|*α.
Definition 2 (Cluster-based outlier):Definition 2 (Cluster-based outlier): Let Let C1C1, ..., , ..., CkCk be be the clusters of the database D discovered by LDBSCAN. the clusters of the database D discovered by LDBSCAN. Cluster-based outliers are the clusters in which the nuCluster-based outliers are the clusters in which the number of the objects is no more than UBCBO.mber of the objects is no more than UBCBO.
Definition 3 (Cluster-based outlier factor):Definition 3 (Cluster-based outlier factor): Let C1 be Let C1 be a cluster-based outlier and C2 be the nearest non-outliea cluster-based outlier and C2 be the nearest non-outlier cluster of C1. The cluster-based outlier factor of C1 is r cluster of C1. The cluster-based outlier factor of C1 is defined as defined as
2
||/)(*),(*||)( 22111Cp
i
i
CplrdCCdistCCCBOF
2
||/)(*),(*||)( 22111Cp
i
i
CplrdCCdistCCCBOF
Experiment (continued)Experiment (continued) Abnormal Network Abnormal Network
Throughput DetectionThroughput Detection Network throughput has Network throughput has
the characteristic that are the characteristic that are consistent with self-consistent with self-similarity.similarity.
Monitoring 300 nodes per Monitoring 300 nodes per 5 minutes: 3600 per hour5 minutes: 3600 per hour
Single point VS. Cluster-Single point VS. Cluster-basedbased
30 VS. 3 alerts per hour30 VS. 3 alerts per hour Occasional fluctuations VS. Occasional fluctuations VS.
Abnormal events over a Abnormal events over a periodperiod
ConclusionConclusion
Outlier detection and clustering improve Outlier detection and clustering improve accuracy with each other.accuracy with each other.
Cluster-based outlier detection is more Cluster-based outlier detection is more meaningful.meaningful.
ADVERTISING: LDBSCAN is good at both ADVERTISING: LDBSCAN is good at both outlier detection and clustering.outlier detection and clustering. Clusters with arbitrary shape and different Clusters with arbitrary shape and different
local densitylocal density Single point outliers and cluster-based outliersSingle point outliers and cluster-based outliers Degree of outliersDegree of outliers