class outlier mining

Class Outlier Mining:Distance‐Based Approach

ByNabil M. Hewahi and Motaz K. Saad

Presented byMotaz K. Saad

[email protected]. 2008

International Journal of Intelligent Technology, Vol. 2, No. 1, pp 55-68, 2007

mailto:[email protected]

Abstract

• In large datasets, identifying exception or rarecases with respect to a group of similar cases is to be considered very significant problem. (unusual pattern)

• The traditional problem (Outlier Mining) is to find exception or rare cases in a dataset irrespective of the class label of these cases, they are considered rare event with respect to the whole dataset.

2

Abstract (Cont.)

• Present an overview of Class Outlier.

• Introduce a novel definition of a class outlier and propose COF factor.

• Propose a new algorithm for class outlier mining.

• Present experimental results.

• Perform a comparison study.

3

Outlier Definition

• An Outlier is a data object that does not comply with the general behavior of the data (unusual pattern)

• It can be considered as noise or exception but is quite useful in fraud detection and rare events analysis.

4

Outlier Mining

• It is the problem of detecting rare events, deviant objects, and exceptions.

• Is an important data mining issue in knowledge discovery; it has attracted increasing interests in recent years.

5

Outliers

Outliers

6

Outlier Mining: Business Applications

• Medical • Education • Fraud detection • Credit approving• Stock market analysis• Identifying computer network intrusions• Data cleaning• Surveillance and auditing• Health monitoring systems• Insurance, banking, money laundering telecommunication ..., etc).

7

Outlier Detection Methods

• Statistical based (Distribution based)

• Clustering

• Distance‐based– K Nearest Neighbors (KNN)

– Density‐Based

• Model‐Based (Neural Network): Replicator Neural Network RNN

8

Statistical (Distribution) based Outlier Detection Method

9

Variation of Distance‐Based Approach for detecting Outlier

10

NN Model for Detection Outlier

11

A schematic view of a fully connected Replicator Neural Network.

What is Class Outlier?

• All the previous Definition of Outliers do not consider the class labels of the data set

• This means all the previous methods of Outliers Mining are devoted on the overall data set without looking closely to each class label separately.

12

Class Outlier vs. Outlier

13

Class Outliers

Outlier

X ClassClass

Class Outlier Example in heart‐statlog dataset

14

Att.#Inst# 1 2 3 4 5 6 7 8 9 10 11 12 13 Class

69 47 1 3 108 243 0 0 152 0 0 1 0 3 present 62 44 1 3 120 226 0 0 169 0 0 1 0 3 absent150 41 1 3 112 250 0 0 179 0 0 1 0 3 absent179 50 1 3 129 196 0 0 163 0 0 1 0 3 absent38 42 1 3 130 180 0 0 150 0 0 1 0 3 absent253 51 1 3 110 175 0 0 123 0 .6 1 0 3 absent23 47 1 3 112 204 0 0 143 0 .1 1 0 3 absent

Class Outlier Mining take in consideration the class label of the dataset

Class Outlier

Class Outlier Example in house‐vote‐84 dataset

15

Att.#Inst# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Class407 n n n y y y n n n n y y y y n n democrat

306 n n n y y y n n n n n y y y n n republican





339 y n n y y y n n n n y y y y n n republican

Class Outlier Mining take in consideration the class label of the dataset

Class Outlier

Party Behavior of house‐vote‐84 dataset

16

231

29714

245

8

55

200

12

218

45

436

213

18 22

142

4

163

23

157

8324

133

11

135

2013

0

50

100

150

200

250

300

ynnoVoteynnoVoteynnoVoteynnoVoteynnoVote

Issue 3Issue 4Issue 5Issue 8Issue 12

Issue #

Vote

s

Democrat Republican

Class Outlier Definitions

• There are three definition that handle the problem that is “given a set of observations with class labels, find those that arouse suspicions, taking into account the class labels”.– Semantic Outlier [He, et al. WAIM’02, 2002]

– Cross Outlier [Papadimitriou and Faloutsos, SSTD’03, 2003]

– Generalization of def. 1 & 2 [He, et al. ESWA'04, 2004]

17

Semantic Outlier• Semantic Outlier: a data point, which behaves differently with other data points in the same class, while looks normal with respect to data points in another class [He, et al. 2002]

18

Cross Outlier• Cross‐outlier: Given two sets (or classes) of objects, find those which deviate with respect to the other set [Papadimitriou and Faloutsos 2003]

19

Definition (3)

• Generalization of Definition 1 & 2: The generalization does not consider only outliers that deviate with respect to their own class, but also outliers that deviate with respect to other classes [He, et al. 2004].

20

The proposed definitions

• Distance (Similarity) Function

• K Nearest Neighbors

• PCL(T, K)

• Deviation

• K‐Distance

• Class Outlier

• Class Outlier Factor (COF)

21

Distance (Similarity) Function

• Given a data set D = {t1, t2, t3, ..., tn} of tuples where each tuple ti = <ti1, ti2, ti3, ..., tim, Ci> contains m attributes and the class label Ci, the similarity function based on the EuclideanDistance between two data tuples, X = <x1, x2, x3, ...., xm> and Y = <y1, y2, y3,..., ym> (excluding the class labels) is

22

( ) ( )∑ −m

=iYX=YX,d

1

22

K Nearest Neighbours

• For any positive integer K, the K-Nearest Neighbours of a tuple ti are the K closest tuples in the data set.

23

PCL (T, K)

• The Probability of theclass label of theinstance T with respectto the class labels of itsK Nearest Neighbours.

• The instance T has the class label y, So the PCLof the instance T is 2/7.

24

Deviation (T)

• Given a subset DCL = {t1, t2, t3, ..., th} of a data set D = {t1, t2, t3, ..., tn}. Where h is the number of instances in DCL and n is the number of instances of D.

• Given the instance T, DCL contains all the instances that have the similar class label of that of the instance T.

• The Deviation of T is how much the instance T deviates from DCL subset.

25

Deviation (T) (Cont.)

• The Deviation is computed by summing the distance between the instance T and every instance in DCL.

26

( ) ( ) DCL.tWhere,tT,d=TDeviation i

h

=ii ∈∑

1

Deviation (T) (Cont.)

27

K‐Distance (The Density Factor)

• K Distance between theinstance T and its Knearest neighbors, i.e.how much the K nearestneighbors instances areclose to the instance T.

28

( ) ( )∑K

=iitT,d=TKDist

1

Class Outlier: The proposed Definition

• Class Outliers are the top N instances which satisfies the following:– The K‐Distance to its K nearest neighbours is the least.

– Its Deviation is the greatest.

–Has different class label form that of its Knearest neighbours.

29

Class Outlier Factor (COF)

• COF: The Class Outlier Factor of the instance T is the degree of being Class Outlier. The Class Outlier Factor of the instance T is defined as:

• PCL(T) from [1/K,1] to [1,K] by multiplying it by K. α and β factors are to control the importance and the effects of Deviation and K-Distance, where 0 ≤ α ≤ Mand 0 ≤ β ≤ 1. M is a changeable value based on the application domain and the initial experimental results. 30

( ) ( ) ( ) ( )TKDistβ+TDeviation

α+TPCLK=TCOF ∗∗∗1

Guidelines for choosing K, α and β

• If the Deviation in hundreds for example, the best value for α is 100, and if the Deviation in tens, then the best value for α is 10 and so on.

• The optimal value of K is determined by trial and error technique. – There are many factors affecting the optimal value,

for example dataset size and number of classes are very important factors that affect choosing the value of K.

31

Optimal value of K

• High value of K might result in wrong estimation of PCL.

• Low value of K means KNN is not well utilized.

• Odd values of K would make more sense for PCL value.

32

CODB Algorithm Basic Steps

33

• Rank each instance in the dataset D. – This is done by calling the Rank procedure after providing the CODB with all the necessary data such as the value of α, β and K.

– The Rank Procedure finds out the rank of each instance using the formula in slide 28 and gives back the rank to CODB

• CODBmaintains a list of only the instances of the top N class outliers. – The less is the value of COF of an instance, the higher is the priority of the instance to be a class outlier.

CODB features

• Direct method: no need for clustering.

• Handle numeric (continues), nominal and mixed dataset.

• Works on datasets with more than two classes.

• More specific in data object ranking than other related methods.

34

Experimental results

• The CODB algorithm has been applied on five different real world datasets.

• All the datasets are publicly available at the UCI machine learning repository .

• The datasets are chosen from various domains that might have single or mixed data types and with two or more class labels. This variation is being tested on our proposed algorithm to show its capabilities.

35

We performed experiments on the following datasets:

• Votes dataset: Nominal, 2 class labels.

• Hepatitis dataset: Mixed, 2 class labels.

• Heart‐statlog dataset. Mixed, 2 class labels.

• Credit approval (credit‐a) dataset: good mix of attributes: continuous, nominal with small numbers of values, and nominal with larger numbers of values, 2 class labels

• Vehicle dataset. Continues, 4 class labels

36

Votes dataset experimental results

• 1984 United States Congressional Voting Records Database.

• Includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes.

• 16 Boolean attributes + class name = 17

• 435 instances, 2 classes (61.38% Democrats, 38.62% Republicans).

37

THE TOP 20 CLASS OUTLIERS OF HOUSE‐VOTE‐84 DATASET

• K = 7

• Top N COF = 20

• Distance type: Euclidean Distance

• α = 100

• β = 0.1

• Remove Instance With Missing Values: false

• Replace Missing Values: false

38

# Inst. # PCL Dev KDist # Inst. # PCL Dev KDist

1 4071 896.24 6.0

11 1762 519.94 8.49

COF: 1.71158 COF: 3.04086

2 3751 881.78 8.07

12 3842 832.95 9.34

COF: 1.92051 COF: 3.0543

3 3881 857.35 8.07

13 3652 836.65 9.44

COF: 1.92375 COF: 3.0634

4 1611 819.35 8.49

14 62 849.33 9.76

COF: 1.97058 COF: 3.0934

5 2671 523.66 8.49

15 3552 524.28 9.12

COF: 2.03949 COF: 3.10283

6 711 535.52 9.44

16 1642 845.19 10.07

COF: 2.13061 COF: 3.12576

7 771 799.64 10.39

17 4022 480.79 10.93

COF: 2.16429 COF: 3.30081

8 3252 846.64 8.49

18 1512 879.84 12.0

COF: 2.96664 COF: 3.31366

9 1602 829.17 8.49

19 1733 839.0 8.07

COF: 2.96913 COF: 3.9263

10 3822 851.76 9.02

20 753 841.28 9.44

COF: 3.01986 COF: 4.06275 39

Comparison Study

• We performed a comparison study with He’s method (2002, 2004).

• Semantic Outlier Factor vs. Class Outlier Factor (SOF vs. COF).

40

SOF vs. COF

41

THE TOP 20 SEMANTIC OUTLIERS FOR

HOUSE‐VOTE‐84 DATASET

42

# Inst. # SOF # Inst. # SOF1 176 0.3036 11 375 1.45202 71 0.3394 12 151 1.49273 355 0.3645 13 372 1.49504 267 0.3659 14 388 1.63655 183 0.8726 15 2 1.64896 97 0.9892 16 382 1.67277 88 1.0724 17 215 1.70108 402 1.1690 18 164 1.71689 407 1.3309 19 6 1.723610 248 1.3487 20 325 1.7259

Comparison Study (Cont.)

• Instance # 176 of votes dataset– Rank 1 (the top) using SOF.

– Rank 11 using COF.

– PCL(176) = 2/7 → there is another instance of the same class within the seven nearest neighbours.

43

Comparison Study (Cont.)

• Instance # 407 of votes dataset– Rank 9 using SOF.

– Rank 1 (the top) using COF.

– PCL(407) = 1/7.

– The Deviation is the greatest which implies sort of uniqueness of the instance (object) behaviour.

– The K-Distance of the instance is very small (high density of other class type)

– SOF(407) = rank 9 → indicates the disability of recognizing such important cases.

44

Conclusions

• In this research we proposed and introduced:– A novel approach for Class Outliers mining based on the K nearest neighbours using distance‐based similarity function to determine the nearest neighbours.

– Motivation about Class Outliers and their significance as exceptional cases.

– Ranking score that is Class Outlier Factor (COF) to measure the degree of being a Class Outlier for an object.

45

Conclusions (Cont.)

– An efficient algorithm for mining and detection Class Outliers.

– An implementation has been developed using Weka framework.

• We presented:– Experimental results of the algorithm applied on various domains dataset (medical, business, and other domains), and for different dataset type (continues, nominal with small numbers of values, nominal with larger numbers of values, and mixed).

46

Conclusions (Cont.)

– A comparison study has been performed with other methods and results show that our proposed algorithm gives more plausible and reasonable results than others. In addition, it considers mixed data types and more than two class label.

47

Future work

• Proposing Class Outlier Detection Model.

• Getting advantage of the output of this work to find out a scheme to induce Censored Productions Rules (CPRs) from large datasets.

• Developing a weighted distance similarity function, where feature weight determination might be based on the information gain.

48

Thank You !... Q&A

49

class outlier mining

Technology

hewahi andmotaz

abstract inlargedatasets

distancebasedapproach

saad presentedbymotaz