applying support vector machines to imbalanced datasets authors: rehan akbani, stephen kwek...

Applying Support Vector Machines to Imbalanced Datasets

Authors: Rehan Akbani, Stephen Kwek

(University of Texas at San Antonio, USA)Nathalie Japkowicz

(University of Ottawa, Canada)Published: European Conference on Machine

Learning (ECML), 2004Presenter: Rehan AkbaniHome Page: http://www.cs.utsa.edu/~rakbani/

http://www.cs.utsa.edu/~rakbani/

Presentation Outline

Motivation and Problem Definition Key Issues Support Vector Machines Background Problem in Detail Traditional Approaches to Solve the Problem Our Approach Results and Conclusions Future Work and Suggested Improvements

Motivation

Imbalanced datasets are datasets where the negative instances far outnumber the positive instances (or vice versa).

Naturally occurring imbalanced datasets: Gene profiling Medical diagnosis Credit card fraud detection

Ratios of negative to positive instances of 100 to 1 are not uncommon.

Key Issues

Traditional algorithms such as SVM, decision trees, neural networks etc. perform poorly with imbalanced data.

Accuracy is not a good metric to measure performance.

Need to improve traditional algorithms so that they can handle imbalanced data.

Need to define other metrics to measure performance.

Support Vector Machines Background

Find the maximum margin boundary that separates the green and red instances.

Support Vector MachinesSupport Vectors

Circled instances are support vectors.

Support Vector Machines Kernels

Kernels allow non-linear separation of instances.E.g. Gaussian Kernel

Effects of Imbalance on SVM

1. Positive (minority) instances lie further away from the “ideal” boundary.


2. Support vector ratio is imbalanced.

Support vectors are shown in red.


3. Weakness of Soft-Margins. Minimize the primal Lagrangian:

Compromise between minimization oftotal error and maximization of margin.

n

iip C

wL

1

2

2

n

i

n

iiiiiii rbxwy

1 1

1.


Margin is maximized at the cost of small total error

Traditional Approaches

Oversample the minority class or undersample the majority class.

Sample distribution is no longer random – its distribution no longer approximates the target distribution. Defense: Sample biased to begin with

With undersampling, we are discarding instances that may contain valuable information.

Problem with Undersampling

Before After

After undersampling, the learned plane estimates the distance of the ideal plane better but the orientation of the learned plane is no longer as accurate.

Our Approach – SMOTE with Different Costs (SDC)

Do not undersample the majority class in order to retain all the information.

Use Synthetic Minority Oversampling TEchnique (SMOTE) (Chawla et al, 2002).

Use Different Error Costs (DEC) to push the boundary away from positive instances (Veropoulos et al, 1999).

Effect of DEC

Before DEC After DEC

Effect of SMOTE and DEC – (SDC)

After DEC alone After SMOTE and DEC

Experiments

Used 10 different UCI datasets. Compared with four other algorithms:

Regular SVM Undersampling (US) Different Error Costs (DEC) alone SMOTE alone

Used linear, polynomial (degree 2) and Radial Basis Function (RBF) (γ = 1) kernels.

Metric Used – g-means

Used g-means metric (Kubat et al, 1997). Higher g-means means better performance:

Sensitivity = TP / (TP + FN) Specificity = TN / (TN + FP) Used by researchers such as Kubat, Matwin, Holte,

Wu, Chang (1997 – 2003) for imbalanced datasets. Can be computed easily and results can be displayed

compactly. Suitable for use with several datasets and SVM, where time and space are limited.

SpecSensmeansg .

Datasets Used - UCI

Dataset Imbalance Positive Insts. Negative Insts. % OversampledAbalone19 1 : 130 32 4145 1000Anneal5 1 : 12 67 831 400Car3 1 : 24 69 1659 400Glass7 1 : 6 29 185 200Hepatitis1 1 : 4 32 123 100Hypothyroid3 1 : 39 95 3677 400Letter26 1 : 26 734 19266 200Segment1 1 : 6 330 1980 100Sick2 1 : 15 231 3541 100Soybean12 1 : 15 44 639 100

Results

g-means metric for each algorithm and dataset

SVM US SMOTE DEC SDCabalone 0 0.6436394 0 0.8064562 0.7449049anneal 1 1 1 1 1car 0 0.960104 0.9846381 0.3162278 0.9846381glass 0.8660254 0.880108 0.877058 0.9199519 0.9405399hepatitis 0.5959695 0.7470874 0.742021 0.6942835 0.7682954hypothyroid0.1767767 0.8938961 0.8025625 0.9384492 0.9581446letter 0.8182931 0.9555176 0.947737 0.9871834 0.9816909segment 0.9950372 0.9917748 0.9765287 0.9950372 0.9783467sick 0 0.7641141 0.4071283 0.8627879 0.8695146soybean 0.9258201 0.9672867 1 1 1Mean 0.537792 0.880353 0.773767 0.852038 0.922608

Results

g-means graphs for each algorithm and dataset

0

0.2

0.4

0.6

0.8

1

1.2

Datasets

g m

ean

SVM

USSMOTEDECSDC

Conclusions

Our algorithm (SDC) outperforms all the other four algorithms. Undersampling is the runner-up.

SDC performs better than undersampling in 9 out of 10 datasets.

It always performs better than or equal to SMOTE. It performs better than or equal to DEC in 7 out of 10

datasets. It has similar limitations to that of SMOTE:

Assumes the space between two positive neighboring instances is positive.

Assumes the neighborhood of a positive instance is positive.

Future Work and Suggested Improvements

Design a better over sampling technique that does not assume a convex positive space.

Evaluate the algorithm on biological datasets with extremely high degrees of imbalance (over 10,000 to 1).

Find out if the technique can be extended to other ML algorithms which have lower execution time than SVM.

Analyze the robustness of the algorithm against noisy minority instances.

Questions?

Questions?

Questions?

Questions?

Questions?

Questions?

applying support vector machines to imbalanced datasets authors: rehan akbani, stephen kwek...

Documents