class outlier mining

49
Class Outlier Mining: DistanceBased Approach By Nabil M. Hewahi and Motaz K. Saad Presented by Motaz K. Saad [email protected] Jan. 2008 International Journal of Intelligent Technology, Vol. 2, No. 1, pp 55-68, 2007

Upload: motaz-saad

Post on 21-May-2015

1.765 views

Category:

Technology


1 download

DESCRIPTION

Class Outlier Mining: Distance Based Approach

TRANSCRIPT

Page 1: Class Outlier Mining

Class Outlier Mining:Distance‐Based Approach

ByNabil M. Hewahi and Motaz K. Saad

Presented byMotaz K. Saad

[email protected]. 2008

International Journal of Intelligent Technology, Vol. 2, No. 1, pp 55-68, 2007

Page 2: Class Outlier Mining

Abstract

• In large datasets, identifying exception or rarecases with respect to a group of similar cases is to be considered very significant problem.  (unusual pattern)

• The traditional problem (Outlier Mining) is to find exception or rare cases in a dataset irrespective of the class label of these cases, they are considered rare event with respect to the whole dataset. 

2

Page 3: Class Outlier Mining

Abstract (Cont.)

• Present an overview of Class Outlier.

• Introduce a novel definition of a class outlier and propose COF factor.

• Propose a new algorithm for class outlier mining.

• Present experimental results.

• Perform a comparison study.

3

Page 4: Class Outlier Mining

Outlier Definition

• An Outlier is a data object that does not comply with the general behavior of the data (unusual pattern)

• It can be considered as noise or exception but is quite useful in fraud detection and rare events analysis.

4

Page 5: Class Outlier Mining

Outlier Mining

• It is the problem of detecting rare events, deviant objects, and exceptions. 

• Is an important data mining issue in knowledge discovery; it has attracted increasing interests in recent years. 

5

Page 6: Class Outlier Mining

Outliers

Outliers

6

Page 7: Class Outlier Mining

Outlier Mining: Business Applications

• Medical • Education  • Fraud detection  • Credit approving• Stock market analysis• Identifying computer network intrusions• Data cleaning• Surveillance and auditing• Health monitoring systems• Insurance, banking, money  laundering   telecommunication ..., etc). 

7

Page 8: Class Outlier Mining

Outlier Detection Methods

• Statistical based (Distribution based)

• Clustering

• Distance‐based– K Nearest Neighbors (KNN)

– Density‐Based

• Model‐Based (Neural Network): Replicator Neural Network RNN

8

Page 9: Class Outlier Mining

Statistical (Distribution) based Outlier Detection Method

9

Page 10: Class Outlier Mining

Variation of Distance‐Based Approach for detecting Outlier

10

Page 11: Class Outlier Mining

NN Model for Detection Outlier

11

A schematic view of a fully connected Replicator Neural Network.

Page 12: Class Outlier Mining

What is Class Outlier?

• All the previous Definition of Outliers do not consider the class labels of the data set

• This means all the previous methods of Outliers Mining are devoted on the overall data set without looking closely to each class label separately.

12

Page 13: Class Outlier Mining

Class Outlier vs. Outlier

13

Class Outliers

Outlier

X ClassClass

Page 14: Class Outlier Mining

Class Outlier Example in heart‐statlog dataset

14

Att.#Inst# 1 2 3 4 5 6 7 8 9 10 11 12 13 Class

69 47 1 3 108 243 0 0 152 0 0 1 0 3 present 62 44 1 3 120 226 0 0 169 0 0 1 0 3 absent150 41 1 3 112 250 0 0 179 0 0 1 0 3 absent179 50 1 3 129 196 0 0 163 0 0 1 0 3 absent38 42 1 3 130 180 0 0 150 0 0 1 0 3 absent253 51 1 3 110 175 0 0 123 0 .6 1 0 3 absent23 47 1 3 112 204 0 0 143 0 .1 1 0 3 absent

Class Outlier Mining take in consideration the class label of the dataset

Class Outlier

Page 15: Class Outlier Mining

Class Outlier Example in house‐vote‐84 dataset

15

Att.#Inst# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Class407 n n n y y y n n n n y y y y n n democrat

306 n n n y y y n n n n n y y y n n republican

83 n n n y y y n n n n n y y y n n republican

87 n n n y y y n n n n n y y y n n republican

303 n n n y y y n n n n n y y y n n republican

119 n n n y y y n n n n n y y y n n republican

339 y n n y y y n n n n y y y y n n republican

Class Outlier Mining take in consideration the class label of the dataset

Class Outlier

Page 16: Class Outlier Mining

Party Behavior of house‐vote‐84 dataset

16

231

29714

245

8

55

200

12

218

45

436

213

18 22

142

4

163

23

157

8324

133

11

135

2013

0

50

100

150

200

250

300

ynnoVoteynnoVoteynnoVoteynnoVoteynnoVote

Issue 3Issue 4Issue 5Issue 8Issue 12

Issue #

Vote

s

Democrat Republican

Page 17: Class Outlier Mining

Class Outlier Definitions

• There are three definition that handle the problem that is “given a set of observations with class labels, find those that arouse suspicions, taking into account the class labels”.– Semantic Outlier [He,  et al. WAIM’02, 2002]

– Cross Outlier [Papadimitriou and Faloutsos, SSTD’03, 2003]

– Generalization of def. 1 & 2 [He,  et al. ESWA'04, 2004]

17

Page 18: Class Outlier Mining

Semantic Outlier• Semantic Outlier: a data point, which behaves differently with other data points in the same class, while looks normal with respect to data points in another class [He,  et al. 2002]

18

Page 19: Class Outlier Mining

Cross Outlier• Cross‐outlier: Given two sets (or classes) of objects, find those which deviate with respect to the other set [Papadimitriou and Faloutsos 2003]

19

Page 20: Class Outlier Mining

Definition (3)

• Generalization of Definition 1 & 2: The generalization does not consider only outliers that deviate with respect to their own class, but also outliers that deviate with respect to other classes [He, et al. 2004].

20

Page 21: Class Outlier Mining

The proposed definitions

• Distance (Similarity) Function

• K Nearest Neighbors

• PCL(T, K)

• Deviation

• K‐Distance

• Class Outlier

• Class Outlier Factor (COF)

21

Page 22: Class Outlier Mining

Distance (Similarity) Function

• Given a data set D = {t1, t2, t3, ..., tn} of tuples where each tuple ti = <ti1, ti2, ti3, ..., tim, Ci> contains m attributes and the class label Ci, the similarity function based on the EuclideanDistance between two data tuples, X = <x1, x2, x3, ...., xm> and Y = <y1, y2, y3,..., ym> (excluding the class labels) is

22

( ) ( )∑ −m

=iYX=YX,d

1

22

Page 23: Class Outlier Mining

K Nearest Neighbours

• For any positive integer K, the K-Nearest Neighbours of a tuple ti are the K closest tuples in the data set.

23

Page 24: Class Outlier Mining

PCL (T, K)

• The Probability of theclass label of theinstance T with respectto the class labels of itsK Nearest Neighbours.

• The instance T has the class label y, So the PCLof the instance T is 2/7.

24

Page 25: Class Outlier Mining

Deviation (T)

• Given a subset DCL = {t1, t2, t3, ..., th} of a data set D = {t1, t2, t3, ..., tn}. Where h is the number of instances in DCL and n is the number of instances of D.

• Given the instance T, DCL contains all the instances that have the similar class label of that of the instance T.

• The Deviation of T is how much the instance T deviates from DCL subset.

25

Page 26: Class Outlier Mining

Deviation (T) (Cont.)

• The Deviation is computed by summing the distance between the instance T and every instance in DCL.

26

( ) ( ) DCL.tWhere,tT,d=TDeviation i

h

=ii ∈∑

1

Page 27: Class Outlier Mining

Deviation (T) (Cont.)

27

Page 28: Class Outlier Mining

K‐Distance (The Density Factor)

• K Distance between theinstance T and its Knearest neighbors, i.e.how much the K nearestneighbors instances areclose to the instance T.

28

( ) ( )∑K

=iitT,d=TKDist

1

Page 29: Class Outlier Mining

Class Outlier: The proposed Definition

• Class Outliers are the top N instances which satisfies the following:– The K‐Distance to its K nearest neighbours is the least.

– Its Deviation is the greatest.

–Has different class label form that of its Knearest neighbours.

29

Page 30: Class Outlier Mining

Class Outlier Factor (COF)

• COF: The Class Outlier Factor of the instance T is the degree of being Class Outlier. The Class Outlier Factor of the instance T is defined as:

• PCL(T) from [1/K,1] to [1,K] by multiplying it by K. α and β factors are to control the importance and the effects of Deviation and K-Distance, where 0 ≤ α ≤ Mand 0 ≤ β ≤ 1. M is a changeable value based on the application domain and the initial experimental results. 30

( ) ( ) ( ) ( )TKDistβ+TDeviation

α+TPCLK=TCOF ∗∗∗1

Page 31: Class Outlier Mining

Guidelines for choosing K, α and β

• If the Deviation in hundreds for example, the best value for α is 100, and if the Deviation in tens, then the best value for α is 10 and so on.

• The optimal value of K is determined by trial and error technique. – There are many factors affecting the optimal value,

for example dataset size and number of classes are very important factors that affect choosing the value of K.

31

Page 32: Class Outlier Mining

Optimal value of K

• High value of K might result in wrong estimation of PCL.

• Low value of K means KNN is not well utilized.

• Odd values of K would make more sense for PCL value.

32

Page 33: Class Outlier Mining

CODB Algorithm Basic Steps

33

• Rank each instance in the dataset D. – This is done by calling the Rank procedure after providing the CODB with all the necessary data such as the value of α, β and K. 

– The Rank Procedure finds out the rank of each instance using the formula in slide 28 and gives back the rank to CODB

• CODBmaintains a list of only the instances of the top N class outliers. – The less is the value of COF of an instance, the higher is the priority of the instance to be a class outlier.

Page 34: Class Outlier Mining

CODB features

• Direct method: no need for clustering.

• Handle numeric (continues), nominal and mixed dataset.

• Works on datasets with more than two classes.

• More specific in data object ranking than other related methods.

34

Page 35: Class Outlier Mining

Experimental results

• The CODB algorithm has been applied on five different real world datasets. 

• All the datasets are publicly available at the UCI machine learning repository . 

• The datasets are chosen from various domains that might have single or mixed data types and with two or more class labels. This variation is being tested on our proposed algorithm to show its capabilities.

35

Page 36: Class Outlier Mining

We performed experiments on the following datasets:

• Votes dataset: Nominal, 2 class labels.

• Hepatitis dataset: Mixed, 2 class labels.

• Heart‐statlog dataset. Mixed, 2 class labels.

• Credit approval (credit‐a) dataset: good mix of attributes: continuous, nominal with small numbers of values, and nominal with larger numbers of values, 2 class labels

• Vehicle dataset. Continues, 4 class labels 

36

Page 37: Class Outlier Mining

Votes dataset experimental results

• 1984 United States Congressional Voting Records Database.

• Includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes.

• 16 Boolean attributes + class name = 17

• 435 instances, 2 classes (61.38% Democrats, 38.62% Republicans).

37

Page 38: Class Outlier Mining

THE TOP 20 CLASS OUTLIERS OF HOUSE‐VOTE‐84 DATASET

• K = 7

• Top N COF = 20

• Distance type: Euclidean Distance

• α = 100

• β = 0.1

• Remove Instance With Missing Values: false

• Replace Missing Values: false

38

Page 39: Class Outlier Mining

# Inst. # PCL Dev KDist # Inst. # PCL Dev KDist

1 4071 896.24 6.0

11 1762 519.94 8.49

COF: 1.71158 COF: 3.04086

2 3751 881.78 8.07

12 3842 832.95 9.34

COF: 1.92051 COF: 3.0543

3 3881 857.35 8.07

13 3652 836.65 9.44

COF: 1.92375 COF: 3.0634

4 1611 819.35 8.49

14 62 849.33 9.76

COF: 1.97058 COF: 3.0934

5 2671 523.66 8.49

15 3552 524.28 9.12

COF: 2.03949 COF: 3.10283

6 711 535.52 9.44

16 1642 845.19 10.07

COF: 2.13061 COF: 3.12576

7 771 799.64 10.39

17 4022 480.79 10.93

COF: 2.16429 COF: 3.30081

8 3252 846.64 8.49

18 1512 879.84 12.0

COF: 2.96664 COF: 3.31366

9 1602 829.17 8.49

19 1733 839.0 8.07

COF: 2.96913 COF: 3.9263

10 3822 851.76 9.02

20 753 841.28 9.44

COF: 3.01986 COF: 4.06275 39

Page 40: Class Outlier Mining

Comparison Study

• We performed a comparison study with He’s method (2002, 2004).

• Semantic Outlier Factor vs. Class Outlier Factor (SOF vs. COF).

40

Page 41: Class Outlier Mining

SOF vs. COF

41

Page 42: Class Outlier Mining

THE TOP 20 SEMANTIC OUTLIERS FOR

HOUSE‐VOTE‐84 DATASET

42

# Inst. # SOF # Inst. # SOF1 176 0.3036 11 375 1.45202 71 0.3394 12 151 1.49273 355 0.3645 13 372 1.49504 267 0.3659 14 388 1.63655 183 0.8726 15 2 1.64896 97 0.9892 16 382 1.67277 88 1.0724 17 215 1.70108 402 1.1690 18 164 1.71689 407 1.3309 19 6 1.723610 248 1.3487 20 325 1.7259

Page 43: Class Outlier Mining

Comparison Study (Cont.)

• Instance # 176 of votes dataset– Rank 1 (the top) using SOF.

– Rank 11 using COF.

– PCL(176) = 2/7 → there is another instance of the same class within the seven nearest neighbours.

43

Page 44: Class Outlier Mining

Comparison Study (Cont.)

• Instance # 407 of votes dataset– Rank 9 using SOF.

– Rank 1 (the top) using COF.

– PCL(407) = 1/7.

– The Deviation is the greatest which implies sort of uniqueness of the instance (object) behaviour.

– The K-Distance of the instance is very small (high density of other class type)

– SOF(407) = rank 9 → indicates the disability of recognizing such important cases.

44

Page 45: Class Outlier Mining

Conclusions

• In this research we proposed and introduced:– A novel approach for Class Outliers mining based on the K nearest neighbours using distance‐based similarity function to determine the nearest neighbours.

– Motivation about Class Outliers and their significance as exceptional cases.

– Ranking score that is Class Outlier Factor (COF) to measure the degree of being a Class Outlier for an object.

45

Page 46: Class Outlier Mining

Conclusions (Cont.)

– An efficient algorithm for mining and detection Class Outliers. 

– An implementation has been developed using Weka framework. 

• We presented:– Experimental results of the algorithm applied on various domains dataset (medical, business, and other domains), and for different dataset type (continues, nominal with small numbers of values, nominal with larger numbers of values, and mixed).

46

Page 47: Class Outlier Mining

Conclusions (Cont.)

– A comparison study has been performed  with other methods and results show that our proposed algorithm gives more plausible and reasonable results than others. In addition, it considers mixed data types and more than two class label.

47

Page 48: Class Outlier Mining

Future work

• Proposing Class Outlier Detection Model.

• Getting advantage of the output of this work to find out a scheme to induce Censored Productions Rules (CPRs) from large datasets.

• Developing a weighted distance similarity function, where feature weight determination might be based on the information gain. 

48

Page 49: Class Outlier Mining

Thank You !... Q&A

49