support vector machines for visualization and dimensionality reduction

Support Vector Machines for Visualization and Dimensionality Reduction

Tomasz Maszczyk and

Włodzisław Duch

Department of Informatics, Nicolaus Copernicus University, Toruń, Poland

ICANN 2008

Plan

• Main idea.• Short description of three popular methods frequently

used for visualization :– Multidimensional Scaling (MDS)– Principal Component Analysis (PCA)– Fisher Discriminant Analysis (FDA)

• Description of our approach.• Results on artificial and real data.

Main idea

• Many classifiers work as black-box predictors.• Quality estimated on accuracy or cost measures.• It's important to be able to evaluate a specific case,

showing the confidence in predictions in the region of space close to this case.

• PCA, ICA, MDS and other commonly used for direct visualization useful for exploratory data analysis, but do not provide information about reliability of the method.

• We shows how to use any LDA or SVM classifier in its linear or kernelized version for dimensionality reduction and data visualization.

Multidimensional Scaling (MDS)

• Non-linear technique of visualization .• Decreases dimensionality preserving original

distances in high-dimensional space.• Need only similarities between objects, so explicit

vector representation of objects is not necessary.• In metric scaling uses quantitative evaluation of

similarity using numerical functions.• For non-metric scaling uses qualitative information

about the pairwise similarities.• Differ by their cost functions, optimization algorithms,

the number of similarity matrices used and the use of feature weighting.

Multidimensional Scaling (MDS)

• The most commonly used stress function:

dij - distances in the target spaceDij - distances in the input space, calculated using some metric functions

• These measures are minimized over positions of all target points.• Cost functions are not easy to minimize, with multiple local

minima representing different mappings.• Initial configuration is either selected randomly or based on

projection of data to the space spanned by principal components.• Orientation of axes is arbitrary, and the values of coordinates do

not have any simple interpretation (only relative distances are important).

n

jiijijT dDdS 2

Principal Component Analysis (PCA)

• Linear projection method.• Finds orthogonal combinations of input features

accounting for most variation in the data.• Principal components guarantee minimal loss of

information when position are recreated from their low-dimensional projections.

• Projecting the data into the space defined by 1-3 principal components provides for each input vector its representative in the target space.

• For some data distributions such projections show informative structures.

Fisher Discriminant Analysis (FDA)

• Supervised methods find projections that separate data from different classes in a better way.

• It maximizes the ratio of between-class to within-class scatter, seeking a direction W such that:

where the scatter matrices SB and SI are defined by:

where mi and are the sample means and covariance matrices of each class and m is the sample mean.

• In our implementation pseudoinverse matrix has been used to generate higher FDA directions.

i

WSWWSWJI

TB

T

WWmax

C

i

Tii

iB mmmm

nnS

1i

C

i

iI n

nS

ˆ1

Linear SVM• Search a hyperplane that provides a large margin of

classification, using regularization term and quadratic programming.

• Non-linear versions are based on a kernel trick that implicitly maps data vectors to a high-dimensional feature space where the best separating hyperplane is constructed.

• Linear discriminant function is defined by:

• The best discriminating hyperplane should maximize the distance between decision hyperplane and the vectors that are nearest to it.

• Vector W, orthogonal to the discriminant hyperplane, defines direction on which data vectors are projected, and thus may be used for one-dimensional projections.

0wXWXg TW

Linear SVM

• The same may be done using non-linear SVM based on kernel discriminant:

• The x=gW(X) values for different classes may be smoothed and displayed as a histogram, estimating either the class-conditionals or posterior probabilities.

• The first projection should give gW1(X)<0 for vectors from the first class and >0 for the second class.

• The second best direction may be obtained by repeating SVM calculations in the space orthogonalized to the already obtained W directions.

01

, wXXKXgsvN

i

iiW

Linear SVM

• This process may be repeated to obtain more dimensions.

• Each additional dimension should help to decrease errors estimated by CV.

• Optimal dimensionality is obtained when new dimensions stop decreasing the number of errors in CV tests.

• In case of non-linear kernel gw(X) provides the first direction, while the second may be generated e.g. repeating training on subset of vectors that are close to the hyperplane in the extended space using some other kernel.

Four datasets• Parity_8: 8-bit parity dataset (8 binary features and 256 vectors).

• Heart: disease dataset consist of 270 samples, each described by 13 attributes, 150 cases belongs to group „absence” and 120 to „presence of heart disease”.

• Wisconsin: breast cancer data contains 699 samples collected from patients. Among them, 458 biopsies are from patients labeled as „benign”, and 241 are labeled as „malignant”. Feature 6 has 16 missing values, removing corresponding vectors leaves 683 examples.

• Leukemia: microarray gene expressions for two types of leukemia (ALL and AML, with a total of 47 ALL and 25 AML samples measured with 7129 probes. Visualization is based on 100 best features from simple feature ranking using FDA index.

Parity_8 dataset

Heart dataset

Wisconsin dataset

Leukemia dataset

Estimation of class-conditional probability in the first SVM direction

SVM 10CV accuracy for datasets with reduced features

Summary• Visualization enables exploratory data analysis, gives

more information than accuracy or probability.• Helps to understand what black box classifiers really do. • When safety is important, visualization methods may

evaluate confidence in predictors.• Sequential dimensionality reduction based on SVM has

several advantages:– enables visualization– guarantees dimensionality reduction without loss of

accuracy – increases accuracy of the linear discrimination model– very fast– preserves simple interpretation

Summary cont.

• SVM decision borders can be visualized using estimations of class-dependent or posterior probabilities.

• Visualization help evaluate the reliability of predictions for individual cases, showing them in context of the known cases.

• We plan to add visualization of probabilities and scatterograms to a few popular SVM packages soon.

support vector machines for visualization and dimensionality reduction

Documents