application of statistical learning theory...

Application of Statistical Learning Theory to

DNA Microarray Analysis

by

Sayan Mukherjee

B�S� Princeton University ��M�S� Columbia University ��

Submitted to the Department of Brain Sciencesin partial ful�llment of the requirements for the degree of

Doctor of Philosophy

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

July ��

c� Massachusetts Institute of Technology �� All rights reserved�

Author � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �Department of Brain Sciences

June ��

Certi�ed by� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �Tomaso Poggio

Uncas and Helen Whitaker Professor of Brain SciencesThesis Supervisor

Accepted by � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �Earl Miller

Chairman� Departmental Committee on Graduate Students

Application of Statistical Learning Theory to DNA

Microarray Analysis

by

Sayan Mukherjee

Submitted to the Department of Brain Scienceson June �� in partial ful�llment of the

requirements for the degree ofDoctor of Philosophy

Abstract

This thesis focuses on applying Support Vector Machines �SVMs�� an algorithmfounded in the framework of statistical learning theory� to analyzing DNA microarraydata�

The �rst part of the thesis focuses on extensions to SVMs required for analyzingmicroarray data� First� the problem of choosing multiple parameters at once for SVMsis addressed� This is used as the basis of a feature selection algorithm that allowsus to select which genes are most relevant in discriminating between two classes� Amethodology for outputting con�dence levels as well as class labels is developed�

The second part of the thesis consists of a systematic evaluation of a variety ofmachine learning algorithms on �ve datasets from four types of molecular cancerclassi�cation problems� It will also describe some very promising results in predictingtreatment outcome from expression data for brain tumors and lymphoma� The algorithms compared will be kNearest Neighbors �kNN�� Naive Bayes �NB�� WeightedVoting Average �WV�� and Support Vector Machines �SVMs�� Learning curves areconstructed for the lymphoma treatment and morphology datasets to compare performance as a function of sample size and try to address the questions of given enoughdata can error rates that are clinically acceptable be achieved and how much data isneeded to achieve such a rate� A simple analytic model is constructed to estimate thevariance in classi�cation accuracy due to sample size limitations�

Thesis Supervisor Tomaso PoggioTitle Uncas and Helen Whitaker Professor of Brain Sciences

Acknowledgments

I can unequivocally state that I �nd Acknowledgement sections to be annoying� trite�

and pointless� especially when it is ones own� That being said I will now proceed to

write such a section�

i thank my advisor Tomaso Poggio for providing me with many opportunities

i thank my �rst advisor at MIT Federico Girosi� for being a role model both as a

scientist and a human being

i thank Gadi Geiger for his attempts to keep me out of trouble

i thank Vladimir Vapnik for technical training

i appologize to the Brain and Cognitive Science Department for my profound lack

of knowledge and interest in both brain and cognitive science

i thank� more likely curse� James Schummers and Javid Sadr for teaching me the

little neuroscience i think i might know

ich erfreche mich� mich in aller Form ueberschwenglichst und in tiefer und hof

fentlich immerwaehrender Verbundenheit bei Christine Matter zu bedanken fuer teach

ing me the word �Unhintergehbarkeit�

to the people that brought me into this world� my two parents� Rina and Shyama

Mukherjee� �unless we are to believe in the sexual theory proposed by the Tralfamado

rians� i owe some love as well as bitterness

to my brother� Neelanjan� whom they also brought into this world� i owe mainly

an apology

i would dedicate my thesis to the three above but i do not see the point

Contents

� Introduction �

� Motivation � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� � DNA Microarray Technology � � � � � � � � � � � � � � � � � � � �

� �� Molecular Classi�cation of Cancer � � � � � � � � � � � � � � � �

� �� Statistical and Computational Challenge � � � � � � � � � � � �

�� Statistical Learning as a Framework for Microarray Analysis � � � � � �

�� Outline of the thesis � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Contributions of the thesis � � � � � � � � � � � � � � � � � � � � �

� Algorithmic Extensions to SVMs for DNA Microarray Problems ��

�� Support Vector Machines for Classi�cation � � � � � � � � � � � � � � � �

�� Choosing Multiple Parameters for Support Vector Machines � � � � � �

�� Single validation estimate � � � � � � � � � � � � � � � � � � � � �

�� Leaveoneout bounds � � � � � � � � � � � � � � � � � � � � � � �

�� Optimizing the kernel parameters � � � � � � � � � � � � � � � � ��

�� Computing the gradient � � � � � � � � � � � � � � � � � � � � � ��

�� An Example � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Feature Selection for Support Vector Machines � � � � � � � � � � � � � ��

�� Toy data � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� DNA Microarray Data � � � � � � � � � � � � � � � � � � � � � � ��

�� Face detection � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Con�dence in Predictions and Rejections � � � � � � � � � � � � � � � � ��

�

� Comparison of Algorithms using Microarray Data to Predict Cancer

Morphology and Treatment Outcome ��

�� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Classi�cation Results � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Morphology and Lineage � � � � � � � � � � � � � � � � � � � � � ��

�� Treatment Outcome � � � � � � � � � � � � � � � � � � � � � � � ��

�� Learning Curves � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Removal of Important Genes and Higher Order Information � �

�� Bayes Error and Sample Size Deviations � � � � � � � � � � � � � � � � ��

�� Methods � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Datasets � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Construction of Discriminative Models � � � � � � � � � � � � � ��

�� Constructing Learning Curves � � � � � � � � � � � � � � � � � ��

�� Sample Size Deviations for a Simple Model � � � � � � � � � � ��

� Further Remarks ��

�� Summary and Contributions � � � � � � � � � � � � � � � � � � � � � � � ��

�� Future Work � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�

List of Figures

An �a� oligonucleotide and �b� cDNA microarray� � � � � � � � � � � � �

� On each of the � tiles� the scaling factors of the � pixels are identical� �

�� Evolution of the test error �left� and of the bound R�� right� during

the gradient descent optimization with a polynomial kernel � � � � � � ��

�� Evolution of the test error �left� and of the bound R�� right� during

the gradient descent optimization with an RBF kernel � � � � � � � � � ��

�� A comparison of feature selection methods on �a� a linear problem and

�b� a nonlinear problem both with many irrelevant features� The x

axis is the number of training points� and the yaxis the test error as

a fraction of test points� � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� ROC curves for di�erent number of PCA gray features� � � � � � � � � ��

�� Plots of the distance from the hyperplane for test points using �a� ��

genes �b� �� genes �c� �� genes �d� and � �� genes� The � are for class

ALL� the o for class AML� the � are mistakes� and the line indicates

the decision boundary� � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Plots of the con�dence levels as a function of �jdj estimated from a

leaveoneout procedure on the training data using �a� �� genes �b� ��

genes �c� �� genes �d� and � �� genes� � � � � � � � � � � � � � � � � � �

�� Plot of the con�dence levels as a function of �d estimated from a

leaveoneout procedure on the training data for using �� genes� � � � ��

� Survival plots for �a� Lymphoma and �b� Brain outcomes� � � � � � � ��

�

List of Tables

�� Number of errors� rejects� con�dence level� and the jdj corresponding

to the con�dence level for various number of genes with the linear SVM

descriminating ALL from AML� � � � � � � � � � � � � � � � � � � � � � ��

�� Number of errors as a function of the order of polynomial and the

number of important genes removed� � � � � � � � � � � � � � � � � � � ��

�� Estimated Bayes optimal error and deviation � � � � � � � � � � � � � � ��

�� Leukemia dataset � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Lymphoma dataset � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Medulloblastoma and Glioblastoma dataset � � � � � � � � � � � � � � � ��

�

Chapter �

Introduction

�� Motivation

Recent technological advances in molecular biology and chemistry have brought about

the terminology �highthroughput experiments�� Basically� experimental scientists

can now perform thousands of experiments at once� For example� a molecular biologist

can monitor the expression of all the know genes in yeast under various conditions at

once� Similarly� a chemist can synthesize thousands of compounds at the same time�

These experimental scientists however have to now analyze the data from these

experiments� For the chemist this might mean asking which of these compounds

might possibly be used as a drug� The molecular biologist may ask which genes are

important for certain cell functions and how this genetic pathway works� These ques

tions require a bit more than a ttest or an ANOVA� The development of statistical

and computational procedures to address the scienti�c questions asked by these ex

perimenters is developing rapidly� Also� it is becoming evident that statistical and

computational issues will as much as experiemental methods or technologies drive

what scienti�c questions can be answered and what breakthroughs will be made�

This thesis applies a statistical framework called �Statistical learning theory� to

a particular highthroughput problem� the analysis of DNA Microarray data�

�

�� DNA Microarray Technology

Virtually all cell function is carried out by proteins� So a readout of what and

how many proteins are in a cell at a particular moment gives us great deal of

information about the state of the cell� It was discovered relatively early on in

molecular biology that the abundance and distribution of proteins in cells are corre

lated to a large extent to the levels of messenger RNA� mRNA �Brenner et al��

Nirenberg and Leder� ��

Various methods are available for detecting and quantifying the amount of mRNA�

The methods take advantage of the sequence complimentarity of DNA� The key ob

servation was that single stranded DNA binds strongly to nitrocellulose membranes

which prevents strands from reassociating with each other but permits the hybridiza

tion to complementary RNA �Gillespie and Spiegelman� �� This led to �bloting�

methods� the �rst of which combined �lter hybridization with gel separation of restric

tion digests �Southern� �� The Northern blot was simply such a method applied to

RNA rather than DNA� The problem with these dotblot techniques is that they are

serial in nature� the mRNA is measured one at a time� and are not easy to automate�

DNA microarrays allow one to interrogate the mRNA population expressed by

thousands �for example �� of genes at once rather than serially as in the dot

blot methods� The key distinction between DNA microarrays and dotblots is that

the microarrays use a impermeable rigid substance� such as glass� to bind the DNA

sequneces� This has many practical advantages over porous membranes and gel pads

�Southern et al�� Two basic types of DNA microarrays are commonly used�

spotted arrays and oligonucleotide arrays� Figure ��gure is from

�Lockhart and Winzeler� ��

In the spotted array methods �Shalon et al�� Schena et al�� a large

number of cDNAs are prepared from a cDNA library and then spotted onto a glass

slide by a robot� Each cDNA corresponds to one probe of length of �� bp

nearer to the �� end of a gene or EST� Each spot on the slide corresponds to a par

ticular probe� A labelled sample of mRNA is eluted onto the slide and is hybridized

�

overnight� The arrays are then scanned and the quantitative �ourescence image along

with the known position of the cDNA probes is used to asses whether a gene or EST

is present and its relative abundance� Note in cDNA arrays the �ourescence image is

a ratio of the abundance of mRNA of two samples�

In the oligonucletide arrays �Lockhart et al�� multiple probes of ��mers are

synthesized base by base using photolithograhpy in hundreds of thousands of di�erent

positions on a glass plate� For each gene or EST� multiple probes of length ��bp are

placed in a particular position of the microarray� Again the probes are taken from

the �� end of a gene or EST� As in the cDNA array a labelled sample of mRNA

is eluted onto the slide and is hybridized overnight� The arrays are then scanned

and the quantitative �ourescence image along with the known position of the cDNA

probes is used to asses whether a gene or EST is present and its abundance� In the

oligonucleotide arrays the �ourescence image is an absolute measure of the abundance

of mRNA of a sample�

Figure An �a� oligonucleotide and �b� cDNA microarray�

�

�� Molecular Classi�cation of Cancer

Recent technological advances in molecular genetics �i�e� oligonucleotide and cDNA

Microarray �Hardy� �� Lockhart et al�� DeRisi et al�� allow us to mon

itor easily the simultaneous and quantitative expression of thousands of genes in clin

ical specimens� In this context there is considerable interest in understanding if gene

expression pro�les might serve as molecular �ngerprints that would allow for a better

and more accurate classi�cation of cancer and other diseases or biological phenotypes

in general� The analysis of several thousands of genes at once and relating them to bi

ologically or clinically relevant labels has required molecular biologists and oncologists

to collaborate with statisticians and computer scientists who have some experience

with producing models given data�

It has been shown that various biological classes can be distinguished with a very

low error rate without using any a priori biological knowledge or expert interpretation

�Golub et al�� Brown et al�� Furey et al�� Mukherjee et al�� This

was done by constructing data driven models of functions that discriminate between

classes� These models typically consist of the two steps of selecting relevant genes or

features and then building models from the expression patterns of these genes� This

methodology falls in the �learning from examples� paradigm of supervised learning�

In this paradigm a mapping is learned from data� here gene expression patterns� to

a label� this can be a biological class or a continuous value� for example a particular

time in cell cycle� One then tests the accuracy of this mapping on data that was not

used to generate the model� outofsample data�

�� Statistical and Computational Challenge

From a point of view of statistical learning theory or machine learning the challenging

aspect of these types of problems is that typically the number of examples or patterns

are relatively small� from �� examples� and the dimensionality� the number of

genes whose expression levels are measured� is very large� from �� in hu

mans� Given data of this nature a statistician or machine learning expert would be

tempted to say� �Nothing can be said or done� followed by some mumbling about

the curse of dimensionality� However� empirical evidence is mounting that for many

problems accurate models can be built to discriminate biological classes� To under

stand why this is so and to understand what machine learning approaches are best

suited to address this type of data one needs a nonasymptotic statistical theory� In

this context� statistical learning provides a valuable framework for asking questions

about the accuracy of models built on limited data�

�� Statistical Learning as a Framework for Mi�

croarray Analysis

Statistical learning theory will be used as a framework throughout this thesis for

methodologies to select appropraite classi�cation functions and determine which genes

are relevant in composing this function when one is given a small sample of data�

The basic problem is as follows �a� given � example pairs �gene expression values

and a biological label� construct a classi�cation rule� �b� this rule should generalize

well� correctly classify examples that were not in the example pairs used to construct

the classi�cation rule�

Statistical learning theory allows us to say in probability how much the gener

alization performance will deviate from the performance on the example pairs as a

function of the number of examples� �� and a measure of the complexity of the class

of functions used to construct the classi�cation rule� This deviation decreases as the

number of examples increases and increases as the class of functions becomes more

complex�

The Support Vector Machine �SVM� �Vapnik� �� Cortes and Vapnik� ��a�

algorithm developed in the framework of statistical learning theory will be the main

algorithm in this thesis developed to analyze DNA Microarray data� A basic model

selection problem� choosing many parameters at once for the SVM algorithm� will be

addressed again in the framework of statistical learning theory and will be applied

�

to the problem of selecting which genes are relevant in characterizing a particular

biological class�

�� Outline of the thesis

Algorithmic Extensions to SVMs for DNA Microarray Prob�

lems

Chapter � will concentrate on extensions to SVMs that were developed out of

requirements that arose in analyzing microarray data� First the SVM will be intro

duced in a summary fashion� Then the problem of choosing many parameters at once

for SVMs will be addressed and a solution o�ered� This solution will form the basis

for a feature selection algorithm� Then we discuss a methodology by which the SVM

will output a con�dence in addition to a class labels� one can then reject predictions

with low con�dence�

Comparison of Algorithms using Microarray Data to Predict

Cancer Morphology and Treatment Outcome

Chapter � will consist of a systematic evaluation of a variety of machine learning

algorithms on �ve datasets from four types of molecular cancer classi�cation prob

lems� It will also state some promising results in predicting treatment outcome from

expression data for brain tumors and lymphoma� The algorithms compared will be

kNearest Neighbors �kNN�� Naive Bayes �NB�� Weighted Voting Average �WV�� and

Support Vector Machines �SVMs�� Learning curves are constructed for the lymphoma

treatment and morphology datasets to compare performance as a function of sample

size and try to address the questions of given enough data can error rates that are

clinically acceptable be achieved and how much data is needed to achieve such a rate�

A simple analytic model is constructed to estimate the variance in classi�cation ac

curacy due to sample size limitations� The basic objective of the second chapter is to

understand the potential and limitations of these algorithms in solving these types of

�

problems�

�� Contributions of the thesis

The signi�cant contributions of this thesis are in �a� extending the SVM algorithm

to make it more applicable to the requirements of microarray data analysis� and �b�

a systematic comparison of di�erent algorithms and empirical answers to questions

about sample size requirements to achieve clinically applicability�

In summary the contributions of this thesis are

� A feature selection methodology for SVMs

�� Computing con�dence estimates as well as class labels for SVMs

�� A comparison of algorithms for DNA microarray problems

�� Empirical answers to sample size requirements for two microarray problems

�

Chapter �

Algorithmic Extensions to SVMs

for DNA Microarray Problems

�� Support Vector Machines for Classi�cation

In the problem of supervised learning� one takes a set of inputoutput pairs Z �

f�x�� y�� x�� y��g and attempts to construct a classi�er function f that maps

input vectors x � IRn onto labels y � Y� We are interested here in pattern recognition

or classi�cation� that is the case where the set of labels is simply Y � f� � g� The

goal is to �nd a f � F which minimizes the error �f�x� �� y� on future examples�

Learning algorithms usually depend on parameters which control the size of the class

F or the way the search is conducted in F �

The support vector machine can be derived from a particular case of the reg

ularization framework �Evgeniou et al�� Girosi� �� Regularization theory

�Tikhonov and Arsenin� �� Wahba� �� Girosi and Poggio� �� formulates the

supervised learning problem as a variational problem of �nding the function f that

minimizes the functional

minf�F

H�f � �

�

�Xi��

V �yi� f�xi�� jjf jj�K� ��

where V �� is a loss function� jjf jj�K is a �semi� norm in a Reproducing Kernel Hilbert

Space �RKHS� F de�ned by a �conditionally� positive de�nite function K called a

�

kernel� and � is a regularization parameter� Under general conditions the solution of

equation �� is either

f�x� ��X

i��

ciK�x�xi��

f�x� ��X

i��

ciK�x�xi� � b� ��

depending on whether K is positive de�nite or conditionally positive de�nite of order

� for conditionally positive de�nite K of higher orders more terms would be required

on the righthand side of �� Wahba� ��

Starting from the regularization formulation we will derive the SVM for classi�

cation using the following loss function

V �yi� f�xi�� yif�xi��

We write our regularized functional as follows

minf�F

H�f � �C

�

�Xi��

�� yif�xi��

�jjPf jj�K� ��

where P is a projection operator that removes a constant term from any f�x� so

jjP �f�x� � b�jjK � jjPf�x�jjK for all b� The functional �� can be written as the

following quadratic programming problem proposed in �Cortes and Vapnik� ��b�

minf�F ��

��f� �� C

�

�Xi��

��i �

�jjPf jj�K ��

subject to

yif�xi� � � �i�

�i � � for all i�

The solution of the above problem again has the form

f�x� ��X

i��

�iK�x�xi� � b� ��

and the class predicted is the sign of f�x�� The solution in general will be sparse

�this is due to the choice of loss function� in that not all �i will be nonzero� the data

points corresponding to the nonzero �i are called support vectors�

�

Historically SVMs were derived from a di�erent perspective� The initial formula

tion was for linear discriminant functions

f�x� � w � x � b ��

and the following optimization problem was proposed �Cortes and Vapnik� ��b�

minw�b��

��w� b� ��

�jjwjj� �

C

�

�Xi��

��i ��

subject to

yi�w � xi � b� � � �i�


The solution of the above optimization problem has the same form as ��

f�x� ��X

i��

yi�iK�x�xi� � b� ��

where K�x�xi� � x �xi� Historically� the extension to nonlinear discrminant functions

was formulated via potential functions �Aizerman et al�� b� Aizerman et al�� a��

The idea was to construct a map from the input space to a high �possibly in�nite�

dimensional space K called feature space via a function IRn � K and construct a

linear discriminant in this space

f�x� � w � �x� � b �NXp��

wp p�x� � b ��

and the following optimization problem was proposed

minw�b��

��w� b� ��

�jjwjj�K �

C

�

�Xi��

��i ��

subject to

yi�w � �xi� � b� � � �i�


This function �x� need never be computed because the above optimization problem

�� can be written in its dual form �Cortes and Vapnik� ��b�

max�

�Ty �

��T �M��

�

subject to

�Ty � ��

� � ��

where y is the vector of labels� and the matrix �M is the kernel matrix with a ridge

added �Cortes and Vapnik� ��b� Cristianini and ShaweTaylor� ��

�M � M��

CI� ��

and

Mij � yiyjK�xi�xj� � yiyj�xi� � �xj� � yiyjNXp��

p�xi�p�xj� ��

is well de�ned� By well de�ned we mean that the following series converges

K�x�y� �NXp��

�pp�x�p�y��

where N is posibly in�nite and �p is a sequence of positive numbers� note that we

can renormalize p so that �p � and the equivalence between equations �� and

�� is clear� The convergence in equation �� holds for �conditionally� positive

de�nite kernels from the following fact �Courant and Hilbert� �� Mercer� ��

K�x�y� �NXi��

i�x� � i�y�

d�i� ��

with N possibly in�nite and i�x� and di are eigenfunctions and eigenvalues of K�

When all the slack variables �i � � �the data is separable by a hyperplane in the

space K� the hyperplane that minimizes the function � has a geometric interpretation�

This hyperplane is called the optimal hyperplane and is one with the maximal distance

�in K space� between the hyperplane and the closest image �xi� of the vector xi from

the training data� For nonseparable training data a generalization of this concept is

used�

Suppose that the maximal distance is equal to � and that the images �x�� x��

of the training vectors x�� x� are within a sphere of radius R� Then the following

theorem holds true �Vapnik and Chapelle� ��

�

Theorem �� Given a training set Z � f�x�� y�� x�� y��g of size �� a feature

space H and a hyperplane �w� b�� the margin ��w� b� Z� and the radius R�Z� are

de�ned by

��w� b� Z� � min�xi�yi��Z

yi�w � �xi� � b�

kwkR�Z� � min

a�xi

k�xi� � ak

The maximum margin algorithm L� �IRn � Y �� K � IR takes as input a training

set of size � and returns a hyperplane in feature space such that the margin ��w� b� Z�

is maximized� Note that assuming the training set separable means that � � �� Under

this assumption� for all probability measures P underlying the data Z� the expectation

of the misclassi�cation probability

perr�w� b� � P �sign�w � �X� � b� �� Y�

has the bound

Efperr�L��Z��g

�E

�R��Z�

��L�Z�� Z�

��

The expectation is taken over the random draw of a training set Z of size � � for

the left hand side and size � for the right hand side�

This theorem justi�es the idea of constructing a hyperplane that separates the

data with a large margin the larger the margin the better the performance of the

constructed hyperplane� Note however that according to the theorem the average

performance depends on the ratio EfR��g and not simply on the large margin ��

�� ChoosingMultiple Parameters for Support Vec�

tor Machines

The SVM algorithm usually depends on several parameters� One of them� denoted C�

controls the tradeo� between margin maximization and error minimization� Other

parameters appear in the nonlinear mapping into feature space� They are called

kernel parameters� For simplicity� we use the trick in equation �� that allows us

�

to consider C as a kernel parameter� so that all parameters can be treated in a uni�ed

framework��

It is widely acknowledged that a key factor in an SVMs performance is the choice

of the kernel� However� in practice� very few di�erent types of kernels have been

used due to the di�culty of appropriately tuning the parameters� We present here a

technique that allows to deal with a large number of parameters and thus allows the

use of more complex kernels�

Our goal is not only to �nd the hyperplane which maximizes the margin but

also the values of the mapping parameters that yield best generalization error� To

do so� we propose a minimax approach maximize the margin over the hyperplane

coe�cients and minimize an estimate of the generalization error over the set of kernel

parameters� This last step is performed using a standard gradient descent approach�

We consider a kernels K� depending on a set of parameters �� The decision

function given by an SVM is

f�x� � sign

��X

i��

��i yiK��xi�x� � b

��

where the coe�cients ��i are obtained by maximizing the following functional

W �� Ty �

��T �M��

subject to

�Ty � ��

� � ��

where �M � M � �CI and Mij � yiyjK��xi�xj��

Ideally we would like to choose the value of the kernel parameters that minimize

the true risk of the SVM classi�er� Unfortunately� since this quantity is not accessible�

one has to build estimates or bounds for it� Next� we present several measures of the

expected error rate of an SVM�

�This section is based on the work done in �Chapelle et al��

��

�� Single validation estimate

If one has enough data available� it is possible to estimate the true error on a validation

set� This estimate is unbiased and its variance gets smaller as the size of the validation

set increases� If the validation set is f�x�i� y�i�g��i�p� the estimate is

T �

p

pXi��

��y�if�x�i��

where � is the step function ��x� � when x � � and ��x� � � otherwise�

�� Leave�one�out bounds

The leave�one�out procedure consists of removing from the training data one element�

constructing the decision rule on the basis of the remaining training data and then

testing on the removed element� In this fashion one tests all � elements of the training

data �using � di�erent decision rules�� Let us denote the number of errors in the leave

oneout procedure by L�x�� y�� x�� y�� It is known �Luntz and Brailovsky� ��

that the the leaveoneout procedure gives an almost unbiased estimate of the ex

pected generalization error

Lemma ��

Ep��err �

�E�L�x�� y�� x�� y��

where p��err is the probability of test error for the machine trained on a sample of size

�� and the expectations are taken over the random choice of the sample�

Although this lemma makes the leaveoneout estimator a good choice when estimat

ing the generalization error� it is nevertheless very costly to actually compute since it

requires running the training algorithm � times� The strategy is thus to upper bound

or approximate this estimator by an easy to compute quantity T having� if possible�

an analytical expression�

If we denote by f � the classi�er obtained when all training examples are present

and f i the one obtained when example i has been removed� we can write

L�x�� y�� x�� y�� X

p��

��ypf p�xp��

�

which can also be written as

L�x�� y�� x�� y�� X

p��

��ypf ��xp� � yp�f��xp�� f p�xp��

Thus� if Up is an upper bound for yp�f��xp��f p�xp�� we will get the following upper

bound on the leaveoneout error

L�x�� y�� x�� y�� X

p��

��Up � ��

since for hard margin SVMs� ypf��xp� � and � is monotonically increasing�

Support vector count

Since removing a nonsupport vector from the training set does not change the so

lution computed by the machine �i�e� Up � f ��xp� � f p�xp� � � for xp nonsupport

vector�� we can restrict the preceding sum to support vectors and upper bound each

term in the sum by which gives the following bound on the number of errors made

by the leaveoneout procedure �Vapnik� ��

T �NSV

��

where NSV denotes the number of support vectors�

JaakkolaHaussler bound

For SVMs without threshold� analyzing the optimization performed by the SVM algo

rithm when computing the leaveoneout error� Jaakkola and Haussler �Jaakkola and Haussler� ��

proved the inequality

yp�f��xp�� f p�xp��

pK�xp�xp� � Up

which leads to the following upper bound

T �

�

�Xp��

��pK�xp�xp��

��

Note that Wahba et al� �Wahba et al�� proposed an estimate of the number

of errors made by the leaveoneout procedure� which in the hard margin SVM case

turns out to be

T �X

��pK�xp�xp��

which can be seen as an upper bound of the JaakkolaHaussler one since ��x� � x

for x � ��

OpperWinther bound

For hard margin SVMs without threshold� Opper and Winther �Opper and Winther� ��

used a method inspired from linear response theory to prove the following under the

assumption that the set of support vectors does not change when removing the ex

ample p� we have


��p

�K��SV�pp

�

where KSV is the matrix of dot products between support vectors� leading to the

following estimate

T �

�

�Xp��

�

��p

�K��SV �pp

�

��

Radiusmargin bound

For SVMs without threshold and with no training errors� Vapnik �Vapnik� �� pro

posed the following upper bound on the number of errors of the leaveoneout proce

dure

T �

�

R�

��

where R and � are the radius and the margin as de�ned in theorem ��

Span bound

Vapnik and Chapelle �Vapnik and Chapelle� �� derived an estimate using the con

cept of span of support vectors�

��

Under the assumption that the set of support vectors remains the same during

the leaveoneout procedure� the following equality is true


pS�p �

where Sp is the distance between the point �xp� and the set �p where

�p �

��

Xi ��p� ��

i��

�i�xi��Xi ��p

�i �

��

This gives the exact number of errors made by the leaveoneout procedure under the

previous assumption

T �

�

�Xp��

��pS

�p � ��

The span estimate can be related to other approximations

Link with Jaakkola�Haussler bound

If we consider SVMs without threshold� the constraintP�i � can be removed

in the de�nition of the span� Then we can easily upper bound the value of the

span S�p K�xp�xp�� and thus recover the JaakkolaHaussler bound�

Link with R��

For each support vector� we have ypf��xp� � � Since for x � �� x� � x�

the number of errors made by the leaveoneout procedure is bounded by

Xp

��pS

�p �

It has been shown �Vapnik and Chapelle� �� that the span Sp is bounded

by the diameter of the smallest sphere enclosing the training points and sinceP��p � �� we �nally get

T �R�

��

A similar derivation as the one used in the span bound has been proposed in

�Joachims� �� where the leaveoneout error is bounded by jfp� ��pR

� �

ypf��xp�gj� with � K�xi�xi� R�� i�

��

Link with Opper�Winther

When the support vectors do not change� the hard margin case without thresh

old gives the same value as the OpperWinther bound� namely

S�p �

�K��SV�pp

�

�� Optimizing the kernel parameters

Let s go back to the SVM algorithm� We assume that the kernel k depends on one or

several parameters� encoded into a vector � � �� n�� We thus consider a class

of decision functions parametrized by �� b and �

f��b��x� � sign

��X

i��

�iyiK��x�xi� � b

��

We want to choose the values of the parameters � and � such that W �see equa

tion �� is maximized �maximum margin algorithm� and T � the model selection

criterion� is minimized �best kernel parameters�� More precisely� for � �xed� we want

to have �� arg maxW �� and choose �� such that

�� arg min�

T ��

When � is a one dimensional parameter� one typically tries a �nite number of

values and picks the one which gives the lowest value of the criterion T � When both

T and the SVM solution are continuous with respect to �� a better approach has

been proposed by Cristianini et al� �Cristianini et al�� using an incremental

optimization algorithm� one can train an SVM with little e�ort when � is changed

by a small amount� However� as soon as � has more than one component computing

T �� for every possible value of � becomes intractable� and one rather looks for a

way to optimize T along a trajectory in the kernel parameter space�

Using the gradient of a model selection criterion to optimize the model parameters

has been proposed in �Bengio� �� and demonstrated in the case of linear regres

sion and timeseries prediction� It has also been proposed by �Larsen et al�� to

optimize the regularization parameters of a neural network�

��

Here we propose an algorithm that alternates the SVM optimization with a gra

dient step is the direction of the gradient of T in the parameter space� This can be

achieved by the following iterative procedure

�� Initialize � to some value�

�� Using a standard SVM algorithm� find the maximum of the

quadratic form W�

�� arg max�

W ��

�� Update the parameters � such that T is minimized�

This is typically achieved by a gradient step �see below��

� Go to step � or stop when the minimum of T is reached�

Solving step � requires estimating how T varies with �� We will thus restrict

ourselves to the case where K� can be di�erentiated with respect to �� Moreover� we

will only consider cases where the gradient of T with respect to � can be computed

�or approximated��

Note that �� depends implicitly on � since �� is de�ned as the maximum of W �

Then� if we have n kernel parameters �� n�� the total derivative of T �� T �� with respect to �p is

T �

�p�

T �

�p

�� xed

� T �

��

��

�p�

Having computed the gradient r�T �� a way of performing step � is to make

a gradient step

��k � �� T ��

�k�

for some small and eventually decreasing �� The convergence can be improved with

the use of second order derivatives �Newton s method�

��k � ��!�T �� T ��

�k

��

where the Laplacian operator ! is de�ned by

�!�T �i� j � �T ��

�i �j�

In this formulation� additional constraints can be imposed through projection of the

gradient�

�� Computing the gradient

In this section� we describe the computation of the gradient �with respect to the kernel

parameters� of the di�erent estimates of the generalization error� First� for the bound

R�� see Theorem �� we obtain a formulation of the derivative of the margin

�section �� and of the radius �section �� For the validation error �see equation

�� we show how to calculate the derivative of the hyperplane parameters �� and

b� Finally� details about the computation of the derivative of the span bound ��

are not included in this thesis see �Chapelle et al�� for details�

We �rst begin with a useful lemma�

Lemma �� Suppose we are given a �n � � vector v� and an �n � n� matrix P�

smoothly depending on a parameter �� Consider the function�

L�� maxx�F

xTv� �

�xTP�x

where

F � fx bTx � c�x � �g�

Let "x be the the vector x where the maximum in L�� is attained� If this minimum is

unique then L��

�� "xT

v� �

�

�"xT

P�

�"x�

In other words� it is possible to di�erentiate L with respect to � as if "x did not depend

on �� Note that this is also true if one �or both� of the constraints in the de�nition

of F are removed�

��

Proof� We �rst need to express the equality constraint with a Lagrange multiplier

� and the inequality constraints with Lagrange multipliers �i

L�� maxx��

xTv� �

�xTP�x� ��bTx� c� � �Tx� ��

At the maximum� the following conditions are veri�ed

v� �P��x � ��b� ��

bT "x � c�

"�i "xi � �� i�

We will not consider here di�erentiability problems� The interested reader can

�nd details in �Bonnans and Shapiro� �� The main result is that whenever "x is

unique� L is di�erentiable�

We have L��

�� "xT

v� �

�

�"xT

P�

�"x �

"x

�

T

�v� �P��x��

where the last term can be written as follows�

"x

�

T

�v� �P��x� � �� x

�

T

b� �x

�

T

��

Using the derivatives of the optimality conditions� namely

"x

�

T

b � ��

"�i �

"xi � "�i "xi �

� ��

and the fact that either "�i � � or "xi � � we get

"�i �

"xi � "�i "xi �

� ��

hence "x

�

T

�v� �P��x� � �

and the result follows� �

��

Computing the derivative of the margin

Note that in feature space� the separating hyperplane fx w � �x� � b � �g has the

following expansion

w ��X

i��

��i yi�xi�

and is normalized such that

min��i��

yi�w � �xi� � b� � �

It follows from the de�nition of the margin in Theorem �� that this latter is � �

�kwk� Thus we write the bound R�� as R�kwk��The previous lemma enables us to compute the derivative of kwk�� Indeed� it can

be shown �Vapnik� �� that

�kwk� � W ��

and the lemma can be applied to the standard SVM optimization problem ��

giving kwk� �p

� ��X

i�j��

��i�

�jyiyj

K�xi�xj�

�p

Computing the derivative of the radius

Computing the radius of the smallest sphere enclosing the training points can be

achieved by solving the following quadratic problem �Vapnik� ��

R� � max�

�Xi��

�iK�xi�xi��X

i�j��

�i�jK�xi�xj�

under constraints�X

i��

�i �

i �i � �

We can again use the previous lemma to compute the derivative of the radius

R�

�p�

�Xi��

��i

K�xi�xi�

�p�

�Xi�j��

�i�j K�xi�xj�

�p�

where �� maximizes the previous quadratic form�

��

Computing the derivative of the spanrule

Now� let us consider the span value� Recall that the span of the support vector xp is

de�ned as the the distance between the point �xp� and the set �p de�ned by ��

Then the value of the span can be written as

S�p � min

�max�

��xp��

Xi��p

�i�xi�

�A�

� ��

�Xi��p

�i �

�A �

Note that we introduced a Lagrange multiplier � to enforce the constraintP�i �

�

Introducing the extended vector �� T��T and the extended matrix of the dot

products between support vectors

�KSV �

B� K �

�T �

�CA �

the value of the span can be written as

S�p � min

�max�

�K�xp�xp�� vT �� TH ��

where H is the submatrix of �KSV with row and column p removed� and v is the pth

column of �KSV �

From the fact that the optimal value of �� is H��v� it follows

S�p � K�xp�xp�� vTH��v

� �� K��SV �pp� ��

The last equality comes from the following block matrix identity� known as the �Wood

bury� formula �Lutkepohl� ��

B� A� AT

A A�

�CA��

�

B� B� BT

B B�

�CA �

where B� � �A� �AA�� AT��

The closed form we obtain is particularly attractive since we can compute the

value of the span for each support vector just by inverting the matrix KSV�

��

The following is the derivative of the span

S�p

�p� S�

p

��K��SV

�KSV

�p�K��SV

�pp

�� An Example

In this experiment� we try to choose the scaling factors for an RBF and polynomial

kernel of degree �� More precisely� we consider kernels of the following form

K�x� z� � exp

��X

i

�xi � zi��

��i

�

and

K�x� z� �

� �

Xi

xizi��i

��

Most of the experiments have been carried out on the USPS handwritten digit

recognition database� This database consists of �� training examples and �� test

examples of digit images of size �x � pixels� We try to classify digits � to � against

� to �� The training set has been split into �� subsets of � � examples and each of

this subset has been used successively during the training�

To assess the feasibility of our gradient descent approach for �nding kernel pa

rameters� we �rst used only � parameters� each one corresponding to a scaling factor

for a squared tile of � pixels as shown on �gure � �

Figure � On each of the � tiles� the scaling factors of the � pixels are identical�

�

The scaling parameters were initialized to � The evolution of the test error and

of the bound R�� is plotted versus the number of iterations in the gradient descent

procedure in �gures �� polynomial kernel� and �� RBF kernel��

Figure �� Evolution of the test error �left� and of the bound R�� right� duringthe gradient descent optimization with a polynomial kernel

Figure �� Evolution of the test error �left� and of the bound R�� right� duringthe gradient descent optimization with an RBF kernel

Note that for the polynomial kernel� the test error went down to �# whereas the

best test error with only one scaling parameter is ��#� Thus� by taking several

scaling parameters� we managed to make the test error decrease�

�� Feature Selection for Support Vector Machines

The motivation for feature selection is threefold

� Improve generalization error

��

�� Determine the relevant features �for explanatory purposes�

�� Reduce the dimensionality of the input space �for realtime applications�

Finding optimal scaling parameters can lead to feature selection algorithms� In

deed� if one of the input components is useless for the classi�cation problem� its

scaling factor is likely to become small� But if a scaling factor becomes small enough�

it means that it is possible to remove it without a�ecting the classi�cation algorithm�

This leads to the following idea for feature selection keep the features whose scaling

factors are the largest� This can also be performed in a principal components space

where we scale each principal component by a scaling factor�

Previous work on feature selection for SVMs does exist� however it has been

limited to linear kernels �Bradley and Mangasarian� �� Guyon et al�� gener

ative models �Jebara and Jaakkola� �� or analysis of perturbations of the margin

�Evgeniou et al�� Guyon et al�� Our approach can be applied to nonlin

ear problems outside of the generative model framework and can be thought of as a

generalization of the approach in �Guyon et al�� Evgeniou et al��

We consider two di�erent parametrization of the kernel� The �rst one corresponds

to rescaling the data in the input space

K��x� z� � K��Tx� �Tz�

where � � IRn�

The second one corresponds to rescaling in the principal components space

K��x� z� � K��T�x� �T�z�

where � is the matrix of principal components�

We compute � and � using the following iterative procedure

�� Initialize � � � � � � � � �

�� In the case of principal component scaling� perform

principal component analysis to compute the matrix ��

�This section is based on the work done in �Weston et al��

��

�� Solve the SVM optimization problem

� Minimize the estimate of the error T with respect to �

with a gradient step�

� If a local minimum of T is not reached go to step ��

�� Discard dimensions corresponding to small elements in �

and return to step ��

We demonstrate this idea on two toy problems where we show that feature se

lection reduces generalization error� We then apply our feature selection algorithm

to DNA microarray data where it is important to �nd which genes are relevant in

performing the classi�cation� It also seems in these types of algorithms that feature

selection improves performances� Lastly� we apply the algorithm to face detection and

show that we can greatly reduce the input dimension without sacri�cing performance�

�� Toy data

We compared several algorithms

� The standard SVM algorithm with no feature selection

� Our feature selection algorithm with the estimate R�� and with the span

estimate

� The standard SVM applied after feature selection via a �lter method

The three �lter methods we used choose the m largest features according to Pear

son correlation coe�cients� the Fisher criterion score� and the KolmogorovSmirnov

test�� Note that the Pearson coe�cients and Fisher criterion cannot model nonlinear

dependencies�

�F �r ��r ��r��r��

�

r

�� where ��r is the mean value for the r�th feature in the positive and negative

classes and ��r is the standard deviation�KStst�r

p� sup

��PfX � frg � �PfX � fr� yr �g

�where fr denotes the r�th feature from

each training example� and �P is the corresponding empirical distribution�

��

In the two following arti�cial datasets our objective was to assess the ability of

the algorithm to select a small number of target features in the presence of irrelevant

and redundant features �Weston et al��

For the �rst example� six dimensions of �� were relevant� The probability of y �

or � was equal� The �rst three features fx�� x�� xg were drawn as xi � yN�i� � and

the second three features fx�� x� x�g were drawn as xi � N�� with a probability

of �� otherwise the �rst three were drawn as xi � N�� and the second three as

xi � yN�i� �� The remaining features are noise xi � N�� i � ��

For the second example� two dimensions of �� were relevant� The probability of

y � or � was equal� The data are drawn from the following if y � � then

fx�� x�g are drawn from N��$� or N��$� with equal probability� �� f��g

and �� f�� g and $ � I� if y � then fx�� x�g are drawn again from two normal

distributions with equal probability� with �� f��g and �� f�� g and the

same $ as before� The rest of the features are noise xi � N�� i � ��

In the linear problem the �rst six features have redundancy and the rest of the

features are irrelevant� In the nonlinear problem all but the �rst two features are

irrelevant�

We used a linear kernel for the linear problem and a second order polynomial

kernel for the nonlinear problem�

We imposed the feature selection algorithms to keep only the best two features�

The results are shown in �gure �� for various training set sizes� taking the average

test error on �� samples over �� runs of each training set size� The Fisher score �not

shown in graphs due to space constraints� performed almost identically to correlation

coe�cients�

In both problem� we clearly see that our method outperforms the other classical

methods for feature selection� In the nonlinear problem� among the �lter methods

only the KolmogorovSmirnov test improved performance over standard SVMs�

��

20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7 Span−Bound & Forward SelectionRW−Bound & Gradient Standard SVMs Correlation Coefficients Kolmogorov−Smirnov Test

20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7 Span−Bound & Forward SelectionRW−Bound & Gradient Standard SVMs Correlation Coefficients Kolmogorov−Smirnov Test

�a� �b�

Figure �� A comparison of feature selection methods on �a� a linear problem and�b� a nonlinear problem both with many irrelevant features� The xaxis is the numberof training points� and the yaxis the test error as a fraction of test points�

�� DNA Microarray Data

Next� we tested this idea on two leukemia discrimination problems �Golub et al��

and a problem of predicting treatment outcome for Medulloblastoma � The �rst prob

lem was to classify myeloid versus lymphoblastic leukemias based on the expression

of � �� genes� The training set consists of �� examples and the test set �� examples�

Standard linear SVMs achieve error on the test set� Using gradient descent on

R�� we achieved � error using �� genes and error using gene� Using the Fisher

score to select features resulted in error for both and �� genes�

The second leukemia classi�cation problem was discriminating B versus T cells

for lymphoblastic cells �Golub et al�� Standard linear SVMs make error for

this problem� Using either the span bound or gradient descent on R�� results in �

error using � genes� whereas the Fisher score get � errors using the same number of

genes�

The �nal problem is one of predicting treatment outcome of patients that have

Medulloblastoma� Here there are �� examples each with � �� expression values in

the dataset and we use leaveoneout to measure the error rate� A standard SVM

with a Gaussian kernel makes �� errors� while selecting �� genes using the gradient

descent on R�� we achieved an error of ��

�The database will be available at http��waldo�wi�mit�edu�MPR�data sets�html

��

�� Face detection

The trainable system for detecting frontal and nearfrontal views of faces in gray

images presented in �Heisele et al�� gave good results in terms of detection rates�

The system used gray values of �� images as inputs to a seconddegree polynomial

kernel SVM� This choice of kernel lead to more than �� features in the feature

space� Searching an image for faces at di�erent scales took several minutes on a PC�

To make the system realtime reducing the dimensionality of the input space and the

feature space was required� The feature selection in principal components space was

used to reduce the dimensionality of the input space �Serre et al��

The method was evaluated on the large MITCMU test set consisting of ��

faces and about �� nonface patterns� In Figure �� we compare the ROC

curves obtained for di�erent numbers of selected components�

The results showed that using more than �� components does not improve the

performances of the system �Serre et al��

Figure �� ROC curves for di�erent number of PCA gray features�

��

�� Con�dence in Predictions and Rejections

For many applications especially clinical applications the concept of the con�dence

of a prediction and then rejecting a sample of making a call is very important� In

this section we develop such a methodology for SVMs� Prior work on outputing

probabilities or con�dences from SVMs can be found in �Platt� �� Vapnik� ��

The basic idea is to reject points near the optimal hyperplane for which the clas

si�er may not be very con�dent of the class label� We introduced con�dence levels

based on the SVM output� d

d ��X

i��

��i yiK��xi�x� � b�

These con�dence levels are a function of d and are computed from the training data�

This allows us to reject samples below a certain value of jdj �similarly we could have

two di�ernet con�dence values d� and d� for the two sides of the optimal hyperplane�

because they do not fall within the con�dence level� Introducing con�dence levels

resulted in ��# accuracy for all four cases and between � and � rejects� depending

on the data set� Table �� Figure �� plots the d values for the test data and the

classi�cation and rejection intervals� The genes were selected by the signal to noise

criteria used in �Golub et al��

genes rejects errors con�dence level jdj� �� # � �� # �� # �� # � ��

Table �� Number of errors� rejects� con�dence level� and the jdj corresponding tothe con�dence level for various number of genes with the linear SVM descriminatingALL from AML�

The computation of the con�dence level is based on a Bayesian formulation and

the following assumption for SVMs

p�cjx� � p�cjd��

We can rewrite p�cjd� as

p�cjd� � p�djc�p�c��

��

For our problem� we assume p� � � p�� and that p�dj � � p��dj � � this allows

us to simply estimate p�jdj jf �� g�� We make the previous assumptions so that we

only have to estimate one con�dence level based upon jdj rather than two con�dence

levels� one for class and one for class � �

We use the leaveoneout estimator on the training data to get �� jdj values� We

then estimate the distribution function� %F �jdj� from the jdj values� This was done

using an automated nonparametric density estimation algorithm which has no free

parameters �Mukherjee and Vapnik� �� Vapnik and Mukherjee� �� The con�

dence level C�jdj� is simply

C�jdj� � � %F �jdj��

Figure �� is a plot of the con�dence level as a function of jdj for the four cases� If we

look at the d for the two classes separately we would get two con�dence levels� �gure

��

��

0 5 10 15 20 25 30 35−4

−3

−2

−1

0

1

2

3

4

5

dis

tance fro

m h

yperp

lane

examples0 5 10 15 20 25 30 35

−4

−3

−2

−1

0

1

2

3

4

dis

tance fro

m h

yperp

lane

examples

�a� �b�

0 5 10 15 20 25 30 35−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

dis

tance fro

m h

yperp

lane

examples0 5 10 15 20 25 30 35

−2

−1.5

−1

−0.5

0

0.5

1

1.5

dis

tance fro

m h

yperp

lane

examples

�c� �d�

Figure �� Plots of the distance from the hyperplane for test points using �a� ��genes �b� �� genes �c� �� genes �d� and � �� genes� The � are for class ALL� the ofor class AML� the � are mistakes� and the line indicates the decision boundary�

��

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1/|d|

confidence

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1/|d|confidence

�a� �b�

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1/|d|

confidence

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1/|d|

confidence

�c� �d�

Figure �� Plots of the con�dence levels as a function of �jdj estimated from aleaveoneout procedure on the training data using �a� �� genes �b� �� genes �c� ��genes �d� and � �� genes�

�

−20 −15 −10 −5 0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

co

nfid

en

ce

1/d

Figure �� Plot of the con�dence levels as a function of �d estimated from a leaveoneout procedure on the training data for using �� genes�

��

Chapter �

Comparison of Algorithms using

Microarray Data to Predict Cancer

Morphology and Treatment

Outcome

�� Introduction

E�ective cancer treatment depends upon the availability of curative therapies� accu

rate diagnosis of the disease so that the appropriate therapy can be utilized� and the

ability to classify patients according to groups who are likely to bene�t from each

existing therapy� For most tumors anatomical and morphological analysis remains

the standard method by which clinical decision making is directed� New methods of

cancer detection have improved over the last two decades but there are still serious

limitations in our ability to accurately classify tumors in a manner that would allow

for more rational and systematic clinical decision making� Using standard techniques

the empirical classi�cation for a given tumor can vary dramatically from patient to

patient� This may be due to biological di�erences that as of yet cannot be measured

by standard methods� A particular class of biological di�erences that hold promise

��

is the di�erence in gene expression patterns for di�erent tumor types� Cancer is a

disease where the pattern of expression of genes involved in di�erentiation and cell

growth is altered� resulting in a state of uncontrolled growth that becomes clinically

apparent �Hanahan and Weinberg� �� Weinberg and Varmus� �� Recent tech

nological advances in molecular genetics �i�e� oligonucleotide and cDNA microarrays

�Hardy� �� allow us to monitor easily the simultaneous and quantitative expres

sion of thousands of genes in clinical specimens� In this context there is considerable

interest in understanding if gene expression pro�les might serve as molecular �nger

prints that would allow for a better and more accurate classi�cation of cancer and

other diseases or biological phenotypes in general� These molecular �ngerprints may

be used not just to create gene lists that are particular to a certain taxonomy of can

cer but also to discover new taxonomies and possibly uncover the structure of genetic

networks not yet understood�

It has been shown �Golub et al�� that two types of leukemia� acute myeloid

and acute lymphoblastic� can be distinguished with a very low error rate without

using any a priori biological knowledge or expert interpretation� The methodology

was rather general and followed the paradigm of data collection� feature selection�

modelbuilding by crossvalidation� and model testing on an independent dataset�

The classi�er used a weighted voting algorithm and by achieving low error rates it

proved the feasibility of performing molecular classi�cation using only gene expres

sion patterns� Over the last year we have also built binary classi�ers to distinguish

normal vs� malignant tissues and to di�erentiate lineages �Bcell vs� Tcell leukemia

�Slonim et al�� Glioblastoma vs� medulloblastoma �Pomeroy et al�� fol

licular vs� large Bcell lymphoma �Shipp et al�� Given enough samples� these

kinds of results can be achieved with low error rates �� #� because there are many

relevant features �between �� and �� marker� genes� correlated with the target

class that can be exploited by the classi�er algorithm�

Other problems such as outcome prediction �low vs� high risk� i�e� who will re

spond to clinical treatments such as chemotherapy or radiation are more problematic

and challenging due to the small number of marker genes in a background of morpho

��

logically identical samples and the intrinsic complexity of the phenotype� Speci�cally�

for two treatment outcome problems� Medullablastoma and large Bcell lymphoma�

the algorithms achieve an error rates between �# and ��#� Based on this one can

designate a hierarchy of problems of increasing complexity as follows

� Histological di�erences normal vs� malignant� skin vs� brain tissue�

�� Morphologies di�erent leukemia types ALL vs� AML�

�� Lineage BCell vs� TCell� follicular vs� large Bcell lymphoma�

�� Outcome treatment outcome� relapse or drug sensitivity�

In this work we make a systematic study of algorithms and molecular classi�

cation problems with the goal of understanding which are the best algorithms and

methodologies and also to identify the main obstacles that hamper computational

approaches to molecular classi�cation problems in the fourth class� This class is

probably the most important from a clinical perspective because there are almost no

e�ective traditional methods to perform this critical classi�cation�

The core result and most clinically relevant part of this study was the compar

ison of several algorithms kNearest Neighbors �kNN� �Duda and Hart� �� Naive

Bayes �NB� �Duda and Hart� �� Weighted Voting Average �WV� �Slonim et al��

and SVMs �Vapnik� �� We examined the performance of these algorithms for all

the datasets above in terms of various error measures �for example two of the mea

sures used were the number of errors if a prediction is made for all samples� and

the number of errors and percentages of calls made when the classi�ers rejected low

con�dence predictions�� For clinical applications error rates must be �� #�� For

all the discrimination tasks except for treatment outcome prediction we fall within

that range�

In general� for reasonable classi�ers �consistent classi�ers� as more samples are

given the classi�ers converge to the smallest achieveable error rate for a given dataset�

often called the Bayes error� An important question is the following for a given

dataset how much will the addition of n samples lower the error rate� For the treat

ment outcome problem this translates to the practical question of given our perfor

��

mance with n samples about how many more are needed to achieve a desired error

rate and is it even possible to achieve such a rate� In addition� given enough samples

all reasonable algorithms will perform well� however some may require an order of

magnitude more samples to reach this performance level� Since we are always faced

with few samples in gene expression studies this issue is very important� The above

questions are addressed by constructing learning curves for the above algorithms by

�tting the empirical error rates to the following model of error rate as a function of

the number of samples

Err�n� � n�� b

where n is the number of samples� Err�n� is the error rate given n samples� � � �

is the rate of convergence of the algorithm� and b is the error rate as the number of

samples goes to in�nity which is the smallest achievable error rate� This analysis was

done for the Lymphoma treatment and brain outcome datasets�

Lastly� we build a simple model of the expression data and derive closed form

expressions of the performance of a standard type of classi�er �a perceptron or hyper

plane discriminant� for this model� This analysis allows us to state both the optimal

accuracy of a classi�er and the variance in the accuracy as a function of sample size�

strength of the underlying signal in the model� and number of informative genes� This

analysis helps give us an indication of how much of a change in performance accu

racy is statistically signi�cant� We applied this analysis to our datasets by assuming

the datasets follow the simple model� choosing the number of informative genes� and

using empirical estimates from the data to determine the signal strength�

�� Classi�cation Results

We state classi�cation results for the algorithms on the three di�erent datasets� As

expected the error rates were lower for morphology classi�cation than treatment out

come following the general hierarchy of increasing complexity described in the intro

duction� See section �� for details about the classi�ers and the datasets�

��

��in��gures&graphic �eps

�� Morphology and Lineage

The results for the various datasets are listed in Figure �� Most algorithms

performed very well for the leukemia morphologies and lineage distinctions �AML

vs� ALL� B vs TCell� achieving zero errors� For the Follicular vs� Large B Cell

lymphoma distinction SVM and kNN produced the smaller error rates �� #��

The reason there are some mistakes is presumably because the problem is slightly

more di�cult than the one corresponding to the leukemias or because given the larger

dataset size the chances of mislabeled samples are higher� For the Glioblastoma vs�

Medulloblastoma distinction all algorithms perform well achieving ��# errors� One

of the reasons these classi�cations can be done with such small error rates is because

there are many relevant features �at least �� or more� highly correlated with the

target class that are used by the di�erent algorithms�

�� Treatment Outcome

The results for treatment outcome are listed in Figure �� Prediction results for

all algorithms for leukemia treatment are at chance level or the error rate that would

be achieved by always predicting the majority class� This might be due to either the

small sample size� the inherent complexity of the problem� the heterogeneity of the

samples or the basic lack of correlation between outcome and the expression values�

For the lymphoma and brain outcome predictions the results are more promising and

the algorithms achieve error rates between �# and ��#� For outcome prediction

it is important to consider not only global error rates but also the false positive

��

��in��gures&graphic��eps

��in�gures&image �eps ��in�gures&image��eps�a� �b�

Figure � Survival plots for �a� Lymphoma and �b� Brain outcomes�

and false negative rates� The reason is that clinically the classes are not symmetric�

For example� classifying a high risk patient in the low risk class implies that the

patient may get less treatment �e�g� chemotherapy�� For this reason when comparing

algorithms one has to look at the errors per class or the average error per class� For

the Medulloblastoma we listed two models generated by the SVM one using global

error as an optimization criterion the other using the average error per class as an

optimization criterion�

Another way to look measure the accuracy of the algorithms besides error rate

is computing the KaplanMeier survival plot and statistic �Lachin� �� Survival

statistices encorporate time event information into the classi�cation problem� So� a

patient that is misclassi�ed as alive would be penalized more if the patient died early

in study rather than towards the end� One of the most common survival estimators

is the KaplanMeier estimator �Lachin� �� One constructs empirical distribution

functions with respect to time of the number of patients alive for the two classes�

those predicted to live and those predicted to die� One then tests against the null hy

pothesis that these two empirical distributions were drawn from the same probability

distribution� As we can see in Figure � the pvalues of the KaplanMeier statistic

for the lymphoma and brain outcome predictions were �� WV� and �� k

NN�� Note that even relatively high error rates may still be signi�cant from a survival

perspective�

��

�� Learning Curves

Learning curves are constructed for both the lympohma treatment and brain outcome

datasets� For details on how each data point was estimated and the way the curve

was �t see section ��

Figure and plot the learning curves and data points used to �t the curves for

treatment and morpholgy data� respectively� The algorithms considered in the above

plots are the SVM and kNN algorithms� One can see that the SVM has a quicker rate

of convergence� does better given fewer training examples than the kNN algorithm�

The four curves estimated� two for dataset� are listed below

Err�n� � ��n� � � � ��

Err�n� � ��n��

Err�n� is the error rate as a function of the number of samples n� the constant term

gives us an estimate of the smallest error rate achievable� and � gives us a rate of

convergence of the algorithm� Eqn �� is the learning curve for breain treatment

outcome using the SVM algorithm� Eqn �� is the learning curve for lymphoma

outcome using the SVM algorithm�

�� Removal of Important Genes and Higher Order Infor�

mation

We examined how well the SVM performed when the most important genes accord

ing to the signal to noise ratio crteria �Golub et al�� were removed from the

leukemia morphology discrimination problem� We also examined whether higher or

der interactions helped when important genes are removed�

Higher order statistics seem in fact to increase performance when the problem is

arti�cially made more di�cult by removing between � and �� of the top features�

Above this high order kernels hindered performance� This result is consistent with

the concepts of generalization error which the SVM algorithm is based upon� When

the data is less noisy the advantage of the �exibility of a more complicated model

��

outweighs the disadvantage of the possibility of over�tting� When the data is noisy

these aspects are reversed so a simpler model performs better� SVM performed well

until �� features were removed �see table �� Biologically this is interesting

because it hints that genes do interect and are not indpendent�

genes st order �nd order �rd orderremoved � � ��

Table �� Number of errors as a function of the order of polynomial and the numberof important genes removed�

�� Bayes Error and Sample Size Deviations

In this section we introduce a very simple model of the gene expression data� Given

this simple model and a linear classi�er we compute the accuracy of the optimal

classi�er and the the deviation from this accuracy due to the fact we have a �nite

sample size n� See section �� for details of the derivation used in this section�

We assume that the data from the two classes are drawn from Gaussian distribu

��

tions that are independent across features or genes� We also assume that the prior

probablity of each class is equal� One can compute the generalization error of this

model when we separate with a hyperplane� linear classi�er� Knowing the parameters

of the two class distributions we can compute the Bayes optimal hyperplane� the hy

perplane that results in the smallest generalization error� The variation in estimating

the hyperplane due to the fact that we have few samples is then computed� We can

then estimate the deviation from the Bayes optimal error�

We now have a procedure to compute the Bayes optimal error and �nite sample

deviation from this error under the model assumptions above� We apply this analysis

to the various datasets� The above analysis requires knowledge of the means and

variances of the distribution of the two classes for each gene� We replace these values

with the sample means and sample variances of the classes to construct Table�

Data set Sample size Number of genes Bayes optimal error DeviationLeukemia Morphology �� # �#Leukemia Lineage �ALL� �� # �#Lymphoma Morphology �� # ��#Lymphoma Outcome �� # ��#Brain Morphology � � �# �#Brain Outcome �� # ��#

Table �� Estimated Bayes optimal error and deviation

�� Methods

The datasets� data preparation� classi�cation algorithms used� and details of models

follow�

�� Datasets

The three datasets used are good representatives of liquid and solid primary tumors

with all the complexities of real world clinical tumor samples� All the datasets corre

spond to binary discriminations or labels�

�

Leukemia

A set of �� samples was derived from bone marrow aspirates performed at the time

of diagnosis� prior to any chemotherapy� The dataset contains acute lymphoblastic

leukemia �ALL� B and TCell� and acute myeloid leukemia �AML�� These samples

were randomly selected from the leukemia cell is it clear what the cell bank is bank

based on availability� Samples were selected without regard to immunophenotype�

cytogenetics� or other molecular features�

Data set Total Class Class � Samples ALL AML

Leukemia Morphology �train� �� Leukemia Morphology �test� ��

Class � Class Bcell Tcell

Leukemia Lineage �ALL� �� Class � Class Low Risk High Risk

Leukemia Outcome �AML� � � �

Table �� Leukemia dataset

Low risk means patients who were alive at the time of the last survey or patients

who died from nondisease related causes� High risk corresponds to patients who died

from disease after treatment�

Lymphoma

These datasets contain samples corresponding to excisional lymph node biopsy spec

imens obtained from �� patients with Follicular �FSC� and �� with di�use large cell

lymphoma �DLCL��

The average follow up for these patients is �� months� Low risk means patients who

were alive at the time of the last survey or patients who died from nondisease related

causes� High risk corresponds to patients who died from disease after treatment�

��

Data set Total Class Class � Samples FSC DLCL

Lymphoma Morphology �� Class � Class Low risk High risk

Lymphoma Outcome ��

Table �� Lymphoma dataset

Brain

These samples correspond to Glioblastoma and childhood Medulloblastoma �cerebel

lum� tumors obtained from several sources�

Data set Total Class Class � Samples Glioma MD

Brain Morphology � � ��Class � Class Low risk High risk

Brain Outcome ��

Table �� Medulloblastoma and Glioblastoma dataset

The samples included in the outcome dataset had at least two years of follow up

after treatment� Low risk corresponds to patients who were alive in the last survey�

High risk corresponds to patients who died in the �rst two years after treatment� The

long term survival rate of medulloblastoma patients is about �� # but the sur

vivors generally su�er adverse side e�ects as a consequence of radiation therapy� This

is one of the reasons it is important to �nd better methods of outcome classi�cation�

Data Preparation

The biological samples used in this work were obtained from di�erent tumor banks but

the process of extracting RNA and the basic laboratory protocol was essentially the

same� For details see the protocols section on this web site http&&www�genome�wi�mit�edu&MPR�

The samples were snap frozen in liquid nitrogen and stored at �� degrees� All samples

were obtained prior to the patients receiving any chemotherapy or radiation treat

ment� The RNA was hybridized overnight to A�ymetrix �� highdensity oligonu

cleotide microarrays containing probes for �� known human genes and �� expressed

��

sequence tags �ESTs�� The arrays were scanned with a HewlettPackard scanner� and

the expression levels for each gene calculated using A�ymetrix GENECHIP analysis

software� The data obtained from the arrays was rescaled in order to adjust for

minor di�erences in overall array intensity�

�� Construction of Discriminative Models

This is the general methodology that we follow

� Obtain and �lter expression data� Expression values are thresholded below ��

units� and above � �� units� then we apply a variation �lter ��fold� min� absolute

variation � �� units in most cases� to eliminate genes that do not change signi�

cantly across the samples in the dataset�

�� De�ne a target class based on morphological or clinical information� Here we

either choose the morphology� lineage or long term clinical treatment outcome of the

samples �patients��

�� Select the features ��marker� genes� with higher correlation with the target class�

�� Build a classi�er in crossvalidation �leaveoneout� and measure error rate�

�� Build a �nal classi�er using all the samples and measure the test error if an inde

pendent test set is available�

Since the dimensionality of the datasets is quite large �� potential features�

the feature selection process can be quite important for some algorithms�

A description of the various classi�ers follow�

Weighted Voting

The weighted voting algorithm used was identical to that used in �Golub et al��

Slonim et al��

��

kNearest Neighbors

The kNN algorithm uses the cosine distance as the metric

d�x�y� � � � x�y �

jjxjj�jjyjj� � ��

Two variations to the standard procedure of voting the nearest neighbors were

used� In the �rst variation we used a weighted vote of the k nearest neighbors where

the weight was one divided by the euclidian distance� In the second variation the

weight was �k� where k is the rank� Note for this weighting to have an e�ect

di�erent than the standard kNN� k � �� The seriesP

k�k

is divergent and it is for

this reason that this type of weighting can give di�erent results than the standard

vote��

Na�ive Bayes

In our implementation we make the following assumptions

� the expression of each gene is a Gaussian distribution for each class

�� the expression of di�erent genes is independent

�� the prior probabilities for each class are equal�

Support Vector Machines

We use the SVM with the radius marigin ration �RMR� as our feature selection

criterion�

�� Constructing Learning Curves

Learning curves are constructed for both the lympohma treatment and follicular vs�

large Bcell datasets� Here we give the methodology for constructing the learning

curves�

The training set sizes used were �� for the lymphoma treatment

dataset and �� for the follicular vs� large Bcell datasets� The

��

test sets were of size � for both cases� Training and test number of samples in the two

classes was drawn proportionally to the number of total samples in the population�

We drew �� respective training and test sets where each respective test and training

set do not containg overlapping points�

We set parameters of the algorithm for based using � points not in the test or

training sets for crossvalidation�

We then average the results of the �� trials and �t the following function

Err�n� � n�� b

using a least squares criterion to obtain the learning curve� We also plot standard

deviation bars based upon the standard deviation over the �fty trials�

�� Sample Size Deviations for a Simple Model

Here we give the details of how the Bayes optimal error and deviation from that error

as a function of sample size is derived�

We assume that the data from the two classes are drawn from Gaussian distribu

tions that are independent across dimensions or features

P��x� � exp

��

dXi��

�xi � t�i��

��i

��

P��x� � exp

��

dXi��

�xi � t�i��

��i

��

We also assume that the prior probablity of each class is equal� One can compute the

generalization error of this model given a hyperplane w

A �

�

�Zx�w��

dx P��x� �Zx�w�

dx P��x��

which can be simpli�ed to the following

A �

�

�erf

�w � t�k %w�k

�� erf

��w � t�k %w�k

��

where %w� is the vector with elements %w�i � ��iwi�

��

Now given the two class distributions we can compute the Bayes optimal hyper

plane as that for whichw � t�k %w�k � �w � t�

k %w�k � ��

Note that the covariance matrix for both classes are diagonal� so we can this

equality is achieved when componentwise this results in

wbi �t�i��i � t�i��i

�� i� ��

where wbi is the ith component of the Bayes optimal hyperplane�

We now need to compute the variation in the elements wbi due to �nite sample

size� Since the standard deviation of the two classes are known the estimation error

in estimating the standard deviations is simply the standard error for a sample size

of n

%��i � ��i � ��ipn� ��

We now need to compute the deviation from the Bayes optimal point for each com

pononet� First we replace ��i in equation �� with %��i where %��i � ��i � ��ipn

�

We then replace the class means t�i with sample means� We can write the following

random variable � as the numerator of the Bayes optimal point computation

� � x%��i � y%��i� ��

where x is drawn from the distribution of class � and y is drawn from � � The ran

dom variable can be thought of as the distribution of deviation from an approximate

Bayes optimal point� The random variable � is distributed as follows

P �� p��c

e��ba�� c��

where c� � �%��%�� and ba � t��%��i � t��%��i� Given the above distribution if we

assume that �the deviation equality can be derived without this assumption but it is

more complicated�

ba � wbi ��

then we get the following equality

Pfj� �wbij � kc�ng � � �erf�k� ��

��

where c is the standard deviation in distribution �� and n is the number of ex

amples from the two classes� For the rest of the section we will set k � �� From this

analysis we deviate each component in wbi by wbi� kcn

in equation �� and compute

the generalization error�

��

Chapter �

Further Remarks

�� Summary and Contributions

This thesis consisted of two parts� Chapter � described extensions to SVMs required

for the problem of anayzing microarray data� Chapter � consisted of a systematic

evaluation of a variety of machine learning algorithms on �ve datasets from four types

of molecular cancer classi�cation problems� It also o�ers an empirical approach to

ask questions about sample size and performance as well as looking at the problem

of predicting treatment outcome�

The contributions of the thesis� as outlined in the introduction� can be summarized

as follows

� Feature selection for SVMs

�� Outputing con�dence estimates as well as class labels for SVMs

�� A comparison of algorithms for DNA microarray problems

�� Empirical answers to sample size requirements for two microarray problems

�� Future Work

Some future areas of reseach are listed below

��

� The fact that the leaveoneout estimator is almost unbiased no longer holds

when one tries to optimize the leaveoneout quantity with respect to a param

eter of ones algorithm� So it is of theoretical interest to understand how this

biasvariance tradeo� e�ects our methodology for choosing kernel parameters�

� Oligonucleotide microarrays are not the only way information about tumors

are gathered cDNA microarrays� northern blots� and nuclease protection can

be used� An area of great pratical importance is constructing robust enough

classi�ers and features that can be learned on one type of technology and then

transferred accurately to another� For example� after analyzing microarray data

it is found only three genes are relevant for a particular process a physician wants

to screen� It is very reasonable to run northern blots for these three genes� How

should the discriminant function estimated with the microarray data be mapped

into a function to discriminate samples from a northern blot�

� In general classi�cation and gene selection are not the �nal goal of the cancer

genomics problem� We really want to infer genetic networks from the data� In

yeast graphical model approaches are being developed �Hartemink et al��

to analyze genomic expression data in a way that permits genetic regulatory

networks to be represented in a biologically interpretable form� It would be of

interest if methodologies such as ANOVA decompositions �Vapnik� �� could

be used to generate kernels for SVM that function as logical operations on

genes� This would allow one to use performance criterion as a model selection

tool to infer which kernels best characterize a set of expression data� The

di�erent models would be di�erent ANOVA decompositions which correspond

to di�erent regulatory networks�

� Most classi�cation methods used so far in analyzing microarray data have di�

culties in extracting subtaxonomies in classes especially in more di�cult classi

�cation tasks such as treatment outcome prediction� A framework that allows

the combination of classi�ers in variety of ways from voting to trees� with each

classi�er built from di�erent genes can be thought of as a genetic regulatory net

��

work� An extension of one approach to combining classi�ers �Niyogi et al��

may lead to the development of an algorithm that allows for the learning of sub

taxonomies�

� Recent work has related the stability of an algorithm to its generalization pre

formance �Bousquet and Elissee�� This work is dependent upon a conce

tration of measure inequality called McDirmiad s inequality �McDiarmid� ��

which allows one to take either the worst case deviation or the expectation in

deviation when a point is left out of an algorithm and use this in a Hoe�ding

like inequality� This type of approach maybe used to select how many genes are

relevant�

�

Bibliography

�Aizerman et al�� a� M�A� Aizerman� E�M� Braverman� and L�I� Roeznoer� The

problem of patter recognition learning and the method of potential functions� Av�

tomatika i Telemekhanika� �� ' ��

�Aizerman et al�� b� M�A� Aizerman� E�M� Braverman� and L�I� Roeznoer� Theo

retical foundations of the potential function method in pattern recognition learning�

Avtomatika i Telemekhanika� �� '��

�Bengio� �� Y� Bengio� Gradientbased optimization of hyperparameters� Neural

Computation� ��

�Bonnans and Shapiro� �� J�F� Bonnans and A� Shapiro� Perturbation Analysis of

Optimization Problems� SpringerVerlag� ��

�Bousquet and Elissee�� O� Bousquet and A� Elissee�� Algorithmic stability

and generalization performance� In Neural Information Processing Systems �

Denver� CO� ��

�Bradley and Mangasarian� �� P� S� Bradley and O� L� Mangasarian� Feature se

lection via concave minimization and support vector machines� In Proc� �th Inter�

national Conference on Machine Learning� pages ��'�� San Francisco� CA� ��

�Brenner et al�� S� Brenner� F� Jacob� and M� Meselson� An unstable interme

diate carrying information from genes to ribosomes for protein synthesis� Nature�

��'��

��

�Brown et al�� M� Brown� W� Grundy� D� Lin� N� Christianini� C� Sugnet�

M� Ares Jr�� and D� Haussler� Support vector machine classi�cation of microar

ray gene expression data� UCSCCRL �� Department of Computer Science�

University California Santa Cruz� Santa Cruz� CA� June� ��

�Chapelle et al�� O� Chapelle� V� Vapnik� O� Bousquet� and S� Mukherjee�

Choosing many kernel parameters for support vector machines� Machine Learning�

��

�Cortes and Vapnik� ��a� C� Cortes and V� Vapnik� Supportvector networks� Ma�

chine Learning� ��'��

�Cortes and Vapnik� ��b� C� Cortes and V� Vapnik� Support vector networks� Ma�

chine Learning� �� '��

�Courant and Hilbert� �� R� Courant and D� Hilbert� Methods of mathematical

physics� Vol� � Interscience� London� England� ��

�Cristianini and ShaweTaylor� �� N� Cristianini and J� ShaweTaylor� An Intro�

duction to Support Vector Machines� Cambridge University Press� ��

�Cristianini et al�� N� Cristianini� C� Campbell� and J� ShaweTaylor� Dynam

ically adapting kernels in support vector machines� In Advances in Neural Infor�

mation Processing Systems� ��

�DeRisi et al�� J�L� DeRisi� L� Penland� P�O� Brown� M�L� Bittner� P�S� Meltzer�

M� Ray� Y� Chen� Y�A� Su� and J�M� Trent� Use of a cdna microarray to analyse

gene expression patterns in human cancer� Nat� Genet�� '��

�Duda and Hart� �� R� O� Duda and P� E� Hart� Pattern Classi�cation and Scene

Analysis� Wiley� New York� ��

�Evgeniou et al�� T� Evgeniou� M� Pontil� and T� Poggio� A uni�ed framework

for regularization networks and support vector machines� A�I� Memo No� ��

Arti�cial Intelligence Laboratory� Massachusetts Institute of Technology� ��

��

�Evgeniou et al�� T� Evgeniou� M� Pontil� C� Papageorgiou� and T� Poggio� Im

age representations for object detection using kernel classi�ers� In Proceedings of

Asian Conference on Computer Vision� ��

�Furey et al�� T�S� Furey� N� Cristianini� N� Du�y� M Schummer� D�W� Bed

narski Lin� and D� Haussler� Support vector machine classi�cation and validation

of cancer tissue using microarray expression data� ��

�Gillespie and Spiegelman� �� D� Gillespie and S� Spiegelman� A quantative assay

for dnarna hybrids with dna immobilized on amembrane� J� Mol� Biol�� '

��

�Girosi and Poggio� �� F� Girosi and T� Poggio� Networks for learning a view from

the theory of approximation of functions� In P� Antognetti and V� Milutinovic(c�

editors� Neural Networks� Concepts� Applications� and Implementations� Vol� I�

chapter �� pages �' �� Prentice Hall� Englewood Cli�s� New Jersey� ��

�Girosi� �� F� Girosi� An equivalence between sparse approximation and support

vector machines� Neural Computation� � ��' ��

�Golub et al�� T� Golub� D� Slonim� P� Tamayo� C� Huard� M� Gaasenbeek� J�P�

Mesirov� H� Coller� M�L� Loh� J�R� Downing� M�A� Caligiuri� C� D� Bloom�eld� and

E� S� Lander� Molecular classi�cation of cancer Class discovery and class prediction

by gene expression monitoring� Science� �� '��

�Guyon et al�� I� Guyon� J� Weston� S� Barnhill� and V� Vapnik� Gene selection

for cancer classi�cation using support vector machines� Machine Learning� ��

�Hanahan and Weinberg� �� D� Hanahan and R� Weinberg� The hallmarks of can

cer� Cell� ��'��

�Hardy� �� R�L� Hardy� The chipping forecast� Nature Genetics� � � January ��

�Hartemink et al�� A�J� Hartemink� D�K� Gi�ord� T�S� Jaakkola� and R�A�

Young� Using graphical models and genomic expression data to statistically vali

��

date models of genetic regulatory networks� In Paci�c Symposium on Biocomputing

� � San Jose� CA� ��

�Heisele et al�� B� Heisele� T� Poggio� and M� Pontil� Face detection in still gray

images� AI Memo �� Massachusetts Institute of Technology� ��

�Jaakkola and Haussler� �� T� Jaakkola and D� Haussler� Probabilistic kernel re

gression models� In Proc� of Neural Information Processing Conference� ��

�Jebara and Jaakkola� �� T� Jebara and T� Jaakkola� Feature selection and duali

ties in maximum entropy discrimination� In Uncertainity in Arti�cial Intelligence�

Stanford� CA� ��

�Joachims� �� T� Joachims� Estimating the generalization performance of a svm

e�ciently� In International Conference on Machine Learning� ��

�Lachin� �� J�M� Lachin� Biostatistical Methods� The Assesment of Risks� John

Wiley and Sons� N�Y��

�Larsen et al�� J� Larsen� C� Svarer� L�N� Andersen� and L�K� Hanssen� Adap

tive regularization in neural network modeling� In Neural Networks � Tricks of the

Trade� Springer� ��

�Lockhart and Winzeler� �� D�J� Lockhart and E� Winzeler� Genomics� gene ex

pression and dna arrays� Nature Insight� ��'��

�Lockhart et al�� D�J� Lockhart� H� Dong� M�C� Byrne� M�T� Follettie� M�V�

Gallo� M�S� Chee� M� Mittmann� C� Wang� M� Kobayashi� H� Horton� and E�L�

Brown� Expression monitoring by hybridization to high'density oligonucleotide

arrays� Nature Biotechnology� � ��' ��

�Luntz and Brailovsky� �� A� Luntz and V� Brailovsky� On estimation of char

acters obtained in statistical pattern recognition� Technichezkya Kibernetica� ��

��

�Lutkepohl� �� H� Lutkepohl� Handbook of Matrices� Wiley and Sons� ��

��

�McDiarmid� �� C� McDiarmid� On the method of bounded di�erences� In Surveys

in Combinatorics �� pages ��' ��

�Mercer� �� J� Mercer� Functions of positive and negative type and their connec

tion with the theory of integral equations� Philos� Trans� Roy� Soc� London� A

�� '��

�Mukherjee and Vapnik� �� S� Mukherjee and V� Vapnik� Multivariate density es

timation An svm approach� AI Memo �� Massachusetts Institute of Technology�

��

�Mukherjee et al�� S� Mukherjee� P� Tamayo� D� Slonim� A� Verri� T� G olub�

J�P� Mesirov� and T� Poggio� Support vector machine classi�cation of microarray

data� AI Memo �� Massachusetts Institute of Technology� ��

�Nirenberg and Leder� �� M�W� Nirenberg and P� Leder� The e�ect of trinu

cleotides upon the binding of srna to ribosomes� Science� �� ' ��

�Niyogi et al�� P� Niyogi� J� B� Pierrot� and O� Siohan� Multiple classi�ers by

constrained minimization� In Proceedings of International Conference on Acoustics�

Speech� and Signal Processing� ��

�Opper and Winther� �� M� Opper and O� Winther� Gaussian processes and svm

Mean �eld and leaveoneout� In Advances in Large Margin Classi�ers� pages � '

�� MIT Press� ��

�Platt� �� J� C� Platt� Probabilistic outputs for support vector machines and com

parisons to regularized likelihood methods� In A� Smola� P� Bartlett� B� Scholkopf�

and D� Schuurmans� editors� Advances in Large Margin Classi�ers� MIT press�

��

�Pomeroy et al�� S� Pomeroy� P� Tamayo� L� Sturla� M� Angelo� M� McLaughlin�

J� Kim L� Goumnreova� P� Black� C� Lau� J� Allen� D� Zagzag� J� Olson� T� Cur

ran� C� Wetmore� J� Biegel� T� Poggio� S� Mukherjee� A� Caliafno� G� Stolovitzky�

��

D� Louis� J� Mesirov� E� Lander� and T� Golub� Gene expression based classi�ca

tion and outcome prediction of central nervous system emryonal tumors�� Nature

Medicine� ��

�Schena et al�� M� Schena� D� Shalon� and P�O� Brown� Quantitative monitoring

of gene expression patterns with complimentary dna microarray� Science� ��'

��

�Serre et al�� T� Serre� B� Heisele� S� Mukherjee� and T� Poggio� Feature se

lection for face detection� AI Memo �� Massachusetts Institute of Technology�

��

�Shalon et al�� D� Shalon� S�J� Smith� and P�O� Brown� A dna microarray sys

tem for analyzing complex dna samples using two color �uorescent probe hybridiza

tion� Genome Research� ��'��

�Shipp et al�� M� Shipp� P� Tamayo� M� Gaasenbeek� M� Angelo� T� Ray� M� Re

ich� J� Mesirov� D� Neuberg� J� Alster� T� Poggio� S� Mukherjee� and T� Golub�

Di�use large b cell lymphoma outcome prediction by gene expression pro�ling�

��

�Slonim et al�� D� Slonim� P� Tamayo� J�P� Mesirov� T� Golub� and E� Lander�

Class prediction and discovery using gene expression data� In Proceedings of the

Fourth Annual Conference on Computational Molecular Biology �RECOMB�� pages

��'��

�Southern et al�� E�M� Southern� K� Mir� and M� Shchepinov� Molecular inter

actions on microarrays� Nature Genetics Supplement�� '��

�Southern� �� E�M� Southern� Detection of speci�c sequences among dna grag

ments separated by gel electrophoresis� J� Mol� Biol�� '� ��

�Tikhonov and Arsenin� �� A� N� Tikhonov and V� Y� Arsenin� Solutions of Ill�

posed Problems� W� H� Winston� Washington� D�C��

��

�Vapnik and Chapelle� �� V� Vapnik and O� Chapelle� Bounds of error expectation

for support vector machines� Neural Computation� ��

�Vapnik and Mukherjee� �� V� Vapnik and S� Mukherjee� Multivariate density

estimation A support vector machine approach� In NIPS �� San Mateo� CA

Morgan Kaufmann Publishers� ��

�Vapnik� �� V� Vapnik� The Nature of Statistical Learning Theory� Springer� New

York� ��

�Vapnik� �� V� N� Vapnik� Statistical learning theory� J� Wiley� ��

�Wahba et al�� G� Wahba� Y� Lin� and H� Zhang� Generlaized approximate

crossvalidation for support vector machines another way to look at marginlike

quantities� In Advances in Large Margin Classi�ers� pages ��'�� MIT Press�

��

�Wahba� �� G� Wahba� Splines Models for Observational Data� Series in Applied

Mathematics� Vol� �� SIAM� Philadelphia� ��

�Weinberg and Varmus� �� R� Weinberg and H� Varmus� Genes and the Biology

of Cancer� W� H� Freeman and Co��

�Weston et al�� J� Weston� S� Mukherjee� O� Chapelle� M� Pontil� T � Poggio�

and V� Vapnik� Feature selection for support vector machines� In Advances in

Neural Information Processing Systems� ��

��

application of statistical learning theory...

Documents