applying support vector machines to imbalanced datasets authors: rehan akbani, stephen kwek...

24
Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz (University of Ottawa, Canada) Published: European Conference on Machine Learning (ECML), 2004 Presenter: Rehan Akbani Home Page: http://www.cs.utsa.edu/~rakbani/

Upload: philomena-terry

Post on 18-Jan-2018

223 views

Category:

Documents


0 download

DESCRIPTION

Motivation Imbalanced datasets are datasets where the negative instances far outnumber the positive instances (or vice versa). Naturally occurring imbalanced datasets:  Gene profiling  Medical diagnosis  Credit card fraud detection Ratios of negative to positive instances of 100 to 1 are not uncommon.

TRANSCRIPT

Page 1: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Applying Support Vector Machines to Imbalanced Datasets

Authors: Rehan Akbani, Stephen Kwek

(University of Texas at San Antonio, USA)Nathalie Japkowicz

(University of Ottawa, Canada)Published: European Conference on Machine

Learning (ECML), 2004Presenter: Rehan AkbaniHome Page: http://www.cs.utsa.edu/~rakbani/

Page 2: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Presentation Outline

Motivation and Problem Definition Key Issues Support Vector Machines Background Problem in Detail Traditional Approaches to Solve the Problem Our Approach Results and Conclusions Future Work and Suggested Improvements

Page 3: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Motivation

Imbalanced datasets are datasets where the negative instances far outnumber the positive instances (or vice versa).

Naturally occurring imbalanced datasets: Gene profiling Medical diagnosis Credit card fraud detection

Ratios of negative to positive instances of 100 to 1 are not uncommon.

Page 4: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Key Issues

Traditional algorithms such as SVM, decision trees, neural networks etc. perform poorly with imbalanced data.

Accuracy is not a good metric to measure performance.

Need to improve traditional algorithms so that they can handle imbalanced data.

Need to define other metrics to measure performance.

Page 5: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Support Vector Machines Background

Find the maximum margin boundary that separates the green and red instances.

Page 6: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Support Vector MachinesSupport Vectors

Circled instances are support vectors.

Page 7: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Support Vector Machines Kernels

Kernels allow non-linear separation of instances.E.g. Gaussian Kernel

Page 8: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Effects of Imbalance on SVM

1. Positive (minority) instances lie further away from the “ideal” boundary.

Page 9: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Effects of Imbalance on SVM

2. Support vector ratio is imbalanced.

Support vectors are shown in red.

Page 10: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Effects of Imbalance on SVM

3. Weakness of Soft-Margins. Minimize the primal Lagrangian:

Compromise between minimization oftotal error and maximization of margin.

n

iip C

wL

1

2

2

n

i

n

iiiiiii rbxwy

1 1

1.

Page 11: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Effects of Imbalance on SVM

Margin is maximized at the cost of small total error

Page 12: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Traditional Approaches

Oversample the minority class or undersample the majority class.

Sample distribution is no longer random – its distribution no longer approximates the target distribution. Defense: Sample biased to begin with

With undersampling, we are discarding instances that may contain valuable information.

Page 13: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Problem with Undersampling

Before After

After undersampling, the learned plane estimates the distance of the ideal plane better but the orientation of the learned plane is no longer as accurate.

Page 14: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Our Approach – SMOTE with Different Costs (SDC)

Do not undersample the majority class in order to retain all the information.

Use Synthetic Minority Oversampling TEchnique (SMOTE) (Chawla et al, 2002).

Use Different Error Costs (DEC) to push the boundary away from positive instances (Veropoulos et al, 1999).

Page 15: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Effect of DEC

Before DEC After DEC

Page 16: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Effect of SMOTE and DEC – (SDC)

After DEC alone After SMOTE and DEC

Page 17: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Experiments

Used 10 different UCI datasets. Compared with four other algorithms:

Regular SVM Undersampling (US) Different Error Costs (DEC) alone SMOTE alone

Used linear, polynomial (degree 2) and Radial Basis Function (RBF) (γ = 1) kernels.

Page 18: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Metric Used – g-means

Used g-means metric (Kubat et al, 1997). Higher g-means means better performance:

Sensitivity = TP / (TP + FN) Specificity = TN / (TN + FP) Used by researchers such as Kubat, Matwin, Holte,

Wu, Chang (1997 – 2003) for imbalanced datasets. Can be computed easily and results can be displayed

compactly. Suitable for use with several datasets and SVM, where time and space are limited.

SpecSensmeansg .

Page 19: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Datasets Used - UCI

Dataset Imbalance Positive Insts. Negative Insts. % OversampledAbalone19 1 : 130 32 4145 1000Anneal5 1 : 12 67 831 400Car3 1 : 24 69 1659 400Glass7 1 : 6 29 185 200Hepatitis1 1 : 4 32 123 100Hypothyroid3 1 : 39 95 3677 400Letter26 1 : 26 734 19266 200Segment1 1 : 6 330 1980 100Sick2 1 : 15 231 3541 100Soybean12 1 : 15 44 639 100

Page 20: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Results

g-means metric for each algorithm and dataset

SVM US SMOTE DEC SDCabalone 0 0.6436394 0 0.8064562 0.7449049anneal 1 1 1 1 1car 0 0.960104 0.9846381 0.3162278 0.9846381glass 0.8660254 0.880108 0.877058 0.9199519 0.9405399hepatitis 0.5959695 0.7470874 0.742021 0.6942835 0.7682954hypothyroid0.1767767 0.8938961 0.8025625 0.9384492 0.9581446letter 0.8182931 0.9555176 0.947737 0.9871834 0.9816909segment 0.9950372 0.9917748 0.9765287 0.9950372 0.9783467sick 0 0.7641141 0.4071283 0.8627879 0.8695146soybean 0.9258201 0.9672867 1 1 1Mean 0.537792 0.880353 0.773767 0.852038 0.922608

Page 21: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Results

g-means graphs for each algorithm and dataset

0

0.2

0.4

0.6

0.8

1

1.2

Datasets

g m

ean

SVM

USSMOTEDECSDC

Page 22: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Conclusions

Our algorithm (SDC) outperforms all the other four algorithms. Undersampling is the runner-up.

SDC performs better than undersampling in 9 out of 10 datasets.

It always performs better than or equal to SMOTE. It performs better than or equal to DEC in 7 out of 10

datasets. It has similar limitations to that of SMOTE:

Assumes the space between two positive neighboring instances is positive.

Assumes the neighborhood of a positive instance is positive.

Page 23: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Future Work and Suggested Improvements

Design a better over sampling technique that does not assume a convex positive space.

Evaluate the algorithm on biological datasets with extremely high degrees of imbalance (over 10,000 to 1).

Find out if the technique can be extended to other ML algorithms which have lower execution time than SVM.

Analyze the robustness of the algorithm against noisy minority instances.

Page 24: Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz

Questions?

Questions?

Questions?

Questions?

Questions?

Questions?