application of statistical learning theory...
TRANSCRIPT
Application of Statistical Learning Theory to
DNA Microarray Analysis
by
Sayan Mukherjee
B�S� Princeton University ������M�S� Columbia University ������
Submitted to the Department of Brain Sciencesin partial ful�llment of the requirements for the degree of
Doctor of Philosophy
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
July ��
c� Massachusetts Institute of Technology ��� All rights reserved�
Author � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �Department of Brain Sciences
June �� �
Certi�ed by� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �Tomaso Poggio
Uncas and Helen Whitaker Professor of Brain SciencesThesis Supervisor
Accepted by � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �Earl Miller
Chairman� Departmental Committee on Graduate Students
Application of Statistical Learning Theory to DNA
Microarray Analysis
by
Sayan Mukherjee
Submitted to the Department of Brain Scienceson June ��� ����� in partial ful�llment of the
requirements for the degree ofDoctor of Philosophy
Abstract
This thesis focuses on applying Support Vector Machines �SVMs�� an algorithmfounded in the framework of statistical learning theory� to analyzing DNA microarraydata�
The �rst part of the thesis focuses on extensions to SVMs required for analyzingmicroarray data� First� the problem of choosing multiple parameters at once for SVMsis addressed� This is used as the basis of a feature selection algorithm that allowsus to select which genes are most relevant in discriminating between two classes� Amethodology for outputting con�dence levels as well as class labels is developed�
The second part of the thesis consists of a systematic evaluation of a variety ofmachine learning algorithms on �ve datasets from four types of molecular cancerclassi�cation problems� It will also describe some very promising results in predictingtreatment outcome from expression data for brain tumors and lymphoma� The algorithms compared will be kNearest Neighbors �kNN�� Naive Bayes �NB�� WeightedVoting Average �WV�� and Support Vector Machines �SVMs�� Learning curves areconstructed for the lymphoma treatment and morphology datasets to compare performance as a function of sample size and try to address the questions of given enoughdata can error rates that are clinically acceptable be achieved and how much data isneeded to achieve such a rate� A simple analytic model is constructed to estimate thevariance in classi�cation accuracy due to sample size limitations�
Thesis Supervisor Tomaso PoggioTitle Uncas and Helen Whitaker Professor of Brain Sciences
Acknowledgments
I can unequivocally state that I �nd Acknowledgement sections to be annoying� trite�
and pointless� especially when it is ones own� That being said I will now proceed to
write such a section�
i thank my advisor Tomaso Poggio for providing me with many opportunities
i thank my �rst advisor at MIT Federico Girosi� for being a role model both as a
scientist and a human being
i thank Gadi Geiger for his attempts to keep me out of trouble
i thank Vladimir Vapnik for technical training
i appologize to the Brain and Cognitive Science Department for my profound lack
of knowledge and interest in both brain and cognitive science
i thank� more likely curse� James Schummers and Javid Sadr for teaching me the
little neuroscience i think i might know
ich erfreche mich� mich in aller Form ueberschwenglichst und in tiefer und hof
fentlich immerwaehrender Verbundenheit bei Christine Matter zu bedanken fuer teach
ing me the word �Unhintergehbarkeit�
to the people that brought me into this world� my two parents� Rina and Shyama
Mukherjee� �unless we are to believe in the sexual theory proposed by the Tralfamado
rians� i owe some love as well as bitterness
to my brother� Neelanjan� whom they also brought into this world� i owe mainly
an apology
i would dedicate my thesis to the three above but i do not see the point
Contents
� Introduction �
� Motivation � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
� � DNA Microarray Technology � � � � � � � � � � � � � � � � � � � �
� �� Molecular Classi�cation of Cancer � � � � � � � � � � � � � � � �
� �� Statistical and Computational Challenge � � � � � � � � � � � �
�� Statistical Learning as a Framework for Microarray Analysis � � � � � �
�� Outline of the thesis � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��� Contributions of the thesis � � � � � � � � � � � � � � � � � � � � �
� Algorithmic Extensions to SVMs for DNA Microarray Problems ��
�� Support Vector Machines for Classi�cation � � � � � � � � � � � � � � � �
��� Choosing Multiple Parameters for Support Vector Machines � � � � � �
���� Single validation estimate � � � � � � � � � � � � � � � � � � � � �
����� Leaveoneout bounds � � � � � � � � � � � � � � � � � � � � � � �
����� Optimizing the kernel parameters � � � � � � � � � � � � � � � � ��
����� Computing the gradient � � � � � � � � � � � � � � � � � � � � � ��
����� An Example � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��� Feature Selection for Support Vector Machines � � � � � � � � � � � � � ��
���� Toy data � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
����� DNA Microarray Data � � � � � � � � � � � � � � � � � � � � � � ��
����� Face detection � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Con�dence in Predictions and Rejections � � � � � � � � � � � � � � � � ��
�
� Comparison of Algorithms using Microarray Data to Predict Cancer
Morphology and Treatment Outcome ��
�� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Classi�cation Results � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
���� Morphology and Lineage � � � � � � � � � � � � � � � � � � � � � ��
����� Treatment Outcome � � � � � � � � � � � � � � � � � � � � � � � ��
����� Learning Curves � � � � � � � � � � � � � � � � � � � � � � � � � � ��
����� Removal of Important Genes and Higher Order Information � �
��� Bayes Error and Sample Size Deviations � � � � � � � � � � � � � � � � ��
��� Methods � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
���� Datasets � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
����� Construction of Discriminative Models � � � � � � � � � � � � � ��
����� Constructing Learning Curves � � � � � � � � � � � � � � � � � ��
����� Sample Size Deviations for a Simple Model � � � � � � � � � � ��
� Further Remarks ��
�� Summary and Contributions � � � � � � � � � � � � � � � � � � � � � � � ��
��� Future Work � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�
List of Figures
An �a� oligonucleotide and �b� cDNA microarray� � � � � � � � � � � � �
� On each of the � tiles� the scaling factors of the � pixels are identical� �
�� Evolution of the test error �left� and of the bound R���� �right� during
the gradient descent optimization with a polynomial kernel � � � � � � ��
�� Evolution of the test error �left� and of the bound R���� �right� during
the gradient descent optimization with an RBF kernel � � � � � � � � � ��
�� A comparison of feature selection methods on �a� a linear problem and
�b� a nonlinear problem both with many irrelevant features� The x
axis is the number of training points� and the yaxis the test error as
a fraction of test points� � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� ROC curves for di�erent number of PCA gray features� � � � � � � � � ��
�� Plots of the distance from the hyperplane for test points using �a� ��
genes �b� �� genes �c� ��� genes �d� and � �� genes� The � are for class
ALL� the o for class AML� the � are mistakes� and the line indicates
the decision boundary� � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� Plots of the con�dence levels as a function of �jdj estimated from a
leaveoneout procedure on the training data using �a� �� genes �b� ��
genes �c� ��� genes �d� and � �� genes� � � � � � � � � � � � � � � � � � �
�� Plot of the con�dence levels as a function of �d estimated from a
leaveoneout procedure on the training data for using ��� genes� � � � ��
� Survival plots for �a� Lymphoma and �b� Brain outcomes� � � � � � � ��
�
List of Tables
�� Number of errors� rejects� con�dence level� and the jdj corresponding
to the con�dence level for various number of genes with the linear SVM
descriminating ALL from AML� � � � � � � � � � � � � � � � � � � � � � ��
�� Number of errors as a function of the order of polynomial and the
number of important genes removed� � � � � � � � � � � � � � � � � � � ��
��� Estimated Bayes optimal error and deviation � � � � � � � � � � � � � � ��
��� Leukemia dataset � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Lymphoma dataset � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Medulloblastoma and Glioblastoma dataset � � � � � � � � � � � � � � � ��
�
Chapter �
Introduction
��� Motivation
Recent technological advances in molecular biology and chemistry have brought about
the terminology �highthroughput experiments�� Basically� experimental scientists
can now perform thousands of experiments at once� For example� a molecular biologist
can monitor the expression of all the know genes in yeast under various conditions at
once� Similarly� a chemist can synthesize thousands of compounds at the same time�
These experimental scientists however have to now analyze the data from these
experiments� For the chemist this might mean asking which of these compounds
might possibly be used as a drug� The molecular biologist may ask which genes are
important for certain cell functions and how this genetic pathway works� These ques
tions require a bit more than a ttest or an ANOVA� The development of statistical
and computational procedures to address the scienti�c questions asked by these ex
perimenters is developing rapidly� Also� it is becoming evident that statistical and
computational issues will as much as experiemental methods or technologies drive
what scienti�c questions can be answered and what breakthroughs will be made�
This thesis applies a statistical framework called �Statistical learning theory� to
a particular highthroughput problem� the analysis of DNA Microarray data�
�
����� DNA Microarray Technology
Virtually all cell function is carried out by proteins� So a readout of what and
how many proteins are in a cell at a particular moment gives us great deal of
information about the state of the cell� It was discovered relatively early on in
molecular biology that the abundance and distribution of proteins in cells are corre
lated to a large extent to the levels of messenger RNA� mRNA �Brenner et al�� �� �
Nirenberg and Leder� �����
Various methods are available for detecting and quantifying the amount of mRNA�
The methods take advantage of the sequence complimentarity of DNA� The key ob
servation was that single stranded DNA binds strongly to nitrocellulose membranes
which prevents strands from reassociating with each other but permits the hybridiza
tion to complementary RNA �Gillespie and Spiegelman� ����� This led to �bloting�
methods� the �rst of which combined �lter hybridization with gel separation of restric
tion digests �Southern� ����� The Northern blot was simply such a method applied to
RNA rather than DNA� The problem with these dotblot techniques is that they are
serial in nature� the mRNA is measured one at a time� and are not easy to automate�
DNA microarrays allow one to interrogate the mRNA population expressed by
thousands �for example �� ���� of genes at once rather than serially as in the dot
blot methods� The key distinction between DNA microarrays and dotblots is that
the microarrays use a impermeable rigid substance� such as glass� to bind the DNA
sequneces� This has many practical advantages over porous membranes and gel pads
�Southern et al�� ����� Two basic types of DNA microarrays are commonly used�
spotted arrays and oligonucleotide arrays� Figure ��gure is from
�Lockhart and Winzeler� �������
In the spotted array methods �Shalon et al�� ���� Schena et al�� ���� a large
number of cDNAs are prepared from a cDNA library and then spotted onto a glass
slide by a robot� Each cDNA corresponds to one probe of length of �� � ���bp
nearer to the �� end of a gene or EST� Each spot on the slide corresponds to a par
ticular probe� A labelled sample of mRNA is eluted onto the slide and is hybridized
�
overnight� The arrays are then scanned and the quantitative �ourescence image along
with the known position of the cDNA probes is used to asses whether a gene or EST
is present and its relative abundance� Note in cDNA arrays the �ourescence image is
a ratio of the abundance of mRNA of two samples�
In the oligonucletide arrays �Lockhart et al�� ���� multiple probes of ��mers are
synthesized base by base using photolithograhpy in hundreds of thousands of di�erent
positions on a glass plate� For each gene or EST� multiple probes of length ��bp are
placed in a particular position of the microarray� Again the probes are taken from
the �� end of a gene or EST� As in the cDNA array a labelled sample of mRNA
is eluted onto the slide and is hybridized overnight� The arrays are then scanned
and the quantitative �ourescence image along with the known position of the cDNA
probes is used to asses whether a gene or EST is present and its abundance� In the
oligonucleotide arrays the �ourescence image is an absolute measure of the abundance
of mRNA of a sample�
Figure An �a� oligonucleotide and �b� cDNA microarray�
�
����� Molecular Classi�cation of Cancer
Recent technological advances in molecular genetics �i�e� oligonucleotide and cDNA
Microarray �Hardy� ���� Lockhart et al�� ���� DeRisi et al�� ������ allow us to mon
itor easily the simultaneous and quantitative expression of thousands of genes in clin
ical specimens� In this context there is considerable interest in understanding if gene
expression pro�les might serve as molecular �ngerprints that would allow for a better
and more accurate classi�cation of cancer and other diseases or biological phenotypes
in general� The analysis of several thousands of genes at once and relating them to bi
ologically or clinically relevant labels has required molecular biologists and oncologists
to collaborate with statisticians and computer scientists who have some experience
with producing models given data�
It has been shown that various biological classes can be distinguished with a very
low error rate without using any a priori biological knowledge or expert interpretation
�Golub et al�� ���� Brown et al�� ���� Furey et al�� ����� Mukherjee et al�� ����� This
was done by constructing data driven models of functions that discriminate between
classes� These models typically consist of the two steps of selecting relevant genes or
features and then building models from the expression patterns of these genes� This
methodology falls in the �learning from examples� paradigm of supervised learning�
In this paradigm a mapping is learned from data� here gene expression patterns� to
a label� this can be a biological class or a continuous value� for example a particular
time in cell cycle� One then tests the accuracy of this mapping on data that was not
used to generate the model� outofsample data�
����� Statistical and Computational Challenge
From a point of view of statistical learning theory or machine learning the challenging
aspect of these types of problems is that typically the number of examples or patterns
are relatively small� from ��� ��� examples� and the dimensionality� the number of
genes whose expression levels are measured� is very large� from ����� ���� in hu
mans� Given data of this nature a statistician or machine learning expert would be
tempted to say� �Nothing can be said or done� followed by some mumbling about
the curse of dimensionality� However� empirical evidence is mounting that for many
problems accurate models can be built to discriminate biological classes� To under
stand why this is so and to understand what machine learning approaches are best
suited to address this type of data one needs a nonasymptotic statistical theory� In
this context� statistical learning provides a valuable framework for asking questions
about the accuracy of models built on limited data�
��� Statistical Learning as a Framework for Mi�
croarray Analysis
Statistical learning theory will be used as a framework throughout this thesis for
methodologies to select appropraite classi�cation functions and determine which genes
are relevant in composing this function when one is given a small sample of data�
The basic problem is as follows �a� given � example pairs �gene expression values
and a biological label� construct a classi�cation rule� �b� this rule should generalize
well� correctly classify examples that were not in the example pairs used to construct
the classi�cation rule�
Statistical learning theory allows us to say in probability how much the gener
alization performance will deviate from the performance on the example pairs as a
function of the number of examples� �� and a measure of the complexity of the class
of functions used to construct the classi�cation rule� This deviation decreases as the
number of examples increases and increases as the class of functions becomes more
complex�
The Support Vector Machine �SVM� �Vapnik� ���� Cortes and Vapnik� ���a�
algorithm developed in the framework of statistical learning theory will be the main
algorithm in this thesis developed to analyze DNA Microarray data� A basic model
selection problem� choosing many parameters at once for the SVM algorithm� will be
addressed again in the framework of statistical learning theory and will be applied
�
to the problem of selecting which genes are relevant in characterizing a particular
biological class�
��� Outline of the thesis
Algorithmic Extensions to SVMs for DNA Microarray Prob�
lems
Chapter � will concentrate on extensions to SVMs that were developed out of
requirements that arose in analyzing microarray data� First the SVM will be intro
duced in a summary fashion� Then the problem of choosing many parameters at once
for SVMs will be addressed and a solution o�ered� This solution will form the basis
for a feature selection algorithm� Then we discuss a methodology by which the SVM
will output a con�dence in addition to a class labels� one can then reject predictions
with low con�dence�
Comparison of Algorithms using Microarray Data to Predict
Cancer Morphology and Treatment Outcome
Chapter � will consist of a systematic evaluation of a variety of machine learning
algorithms on �ve datasets from four types of molecular cancer classi�cation prob
lems� It will also state some promising results in predicting treatment outcome from
expression data for brain tumors and lymphoma� The algorithms compared will be
kNearest Neighbors �kNN�� Naive Bayes �NB�� Weighted Voting Average �WV�� and
Support Vector Machines �SVMs�� Learning curves are constructed for the lymphoma
treatment and morphology datasets to compare performance as a function of sample
size and try to address the questions of given enough data can error rates that are
clinically acceptable be achieved and how much data is needed to achieve such a rate�
A simple analytic model is constructed to estimate the variance in classi�cation ac
curacy due to sample size limitations� The basic objective of the second chapter is to
understand the potential and limitations of these algorithms in solving these types of
�
problems�
����� Contributions of the thesis
The signi�cant contributions of this thesis are in �a� extending the SVM algorithm
to make it more applicable to the requirements of microarray data analysis� and �b�
a systematic comparison of di�erent algorithms and empirical answers to questions
about sample size requirements to achieve clinically applicability�
In summary the contributions of this thesis are
� A feature selection methodology for SVMs
�� Computing con�dence estimates as well as class labels for SVMs
�� A comparison of algorithms for DNA microarray problems
�� Empirical answers to sample size requirements for two microarray problems
�
Chapter �
Algorithmic Extensions to SVMs
for DNA Microarray Problems
��� Support Vector Machines for Classi�cation
In the problem of supervised learning� one takes a set of inputoutput pairs Z �
f�x�� y��� � � � � �x�� y��g and attempts to construct a classi�er function f that maps
input vectors x � IRn onto labels y � Y� We are interested here in pattern recognition
or classi�cation� that is the case where the set of labels is simply Y � f� � g� The
goal is to �nd a f � F which minimizes the error �f�x� �� y� on future examples�
Learning algorithms usually depend on parameters which control the size of the class
F or the way the search is conducted in F �
The support vector machine can be derived from a particular case of the reg
ularization framework �Evgeniou et al�� ���� Girosi� ����� Regularization theory
�Tikhonov and Arsenin� ���� Wahba� ���� Girosi and Poggio� �� �� formulates the
supervised learning problem as a variational problem of �nding the function f that
minimizes the functional
minf�F
H�f � �
�
�Xi��
V �yi� f�xi�� � �jjf jj�K� ��� �
where V ��� �� is a loss function� jjf jj�K is a �semi� norm in a Reproducing Kernel Hilbert
Space �RKHS� F de�ned by a �conditionally� positive de�nite function K called a
�
kernel� and � is a regularization parameter� Under general conditions the solution of
equation ��� � is either
f�x� ��X
i��
ciK�x�xi�� �����
f�x� ��X
i��
ciK�x�xi� � b� �����
depending on whether K is positive de�nite or conditionally positive de�nite of order
� for conditionally positive de�nite K of higher orders more terms would be required
on the righthand side of ����� �Wahba� �����
Starting from the regularization formulation we will derive the SVM for classi�
cation using the following loss function
V �yi� f�xi�� � �� � yif�xi����� � �����
We write our regularized functional as follows
minf�F
H�f � �C
�
�Xi��
�� � yif�xi����� �
�jjPf jj�K� �����
where P is a projection operator that removes a constant term from any f�x� so
jjP �f�x� � b�jjK � jjPf�x�jjK for all b� The functional ����� can be written as the
following quadratic programming problem proposed in �Cortes and Vapnik� ���b�
minf�F ��
��f� �� �C
�
�Xi��
��i �
�jjPf jj�K �����
subject to
yif�xi� � � �i�
�i � � for all i�
The solution of the above problem again has the form
f�x� ��X
i��
�iK�x�xi� � b� �����
and the class predicted is the sign of f�x�� The solution in general will be sparse
�this is due to the choice of loss function� in that not all �i will be nonzero� the data
points corresponding to the nonzero �i are called support vectors�
�
Historically SVMs were derived from a di�erent perspective� The initial formula
tion was for linear discriminant functions
f�x� � w � x � b �����
and the following optimization problem was proposed �Cortes and Vapnik� ���b�
minw�b��
��w� b� �� �
�jjwjj� �
C
�
�Xi��
��i �����
subject to
yi�w � xi � b� � � �i�
�i � � for all i�
The solution of the above optimization problem has the same form as �����
f�x� ��X
i��
yi�iK�x�xi� � b� ��� ��
where K�x�xi� � x �xi� Historically� the extension to nonlinear discrminant functions
was formulated via potential functions �Aizerman et al�� ���b� Aizerman et al�� ���a��
The idea was to construct a map from the input space to a high �possibly in�nite�
dimensional space K called feature space via a function IRn � K and construct a
linear discriminant in this space
f�x� � w � �x� � b �NXp��
wp p�x� � b ��� �
and the following optimization problem was proposed
minw�b��
��w� b� �� �
�jjwjj�K �
C
�
�Xi��
��i ��� ��
subject to
yi�w � �xi� � b� � � �i�
�i � � for all i�
This function �x� need never be computed because the above optimization problem
��� �� can be written in its dual form �Cortes and Vapnik� ���b�
max�
�Ty �
��T �M�� ��� ��
�
subject to
�Ty � ��
� � ��
where y is the vector of labels� and the matrix �M is the kernel matrix with a ridge
added �Cortes and Vapnik� ���b� Cristianini and ShaweTaylor� �����
�M � M��
CI� ��� ��
and
Mij � yiyjK�xi�xj� � yiyj�xi� � �xj� � yiyjNXp��
p�xi�p�xj� ��� ��
is well de�ned� By well de�ned we mean that the following series converges
K�x�y� �NXp��
�pp�x�p�y�� ��� ��
where N is posibly in�nite and �p is a sequence of positive numbers� note that we
can renormalize p so that �p � and the equivalence between equations ��� �� and
��� �� is clear� The convergence in equation ��� �� holds for �conditionally� positive
de�nite kernels from the following fact �Courant and Hilbert� ���� Mercer� ����
K�x�y� �NXi��
i�x� � i�y�
d�i� ��� ��
with N possibly in�nite and i�x� and di are eigenfunctions and eigenvalues of K�
When all the slack variables �i � � �the data is separable by a hyperplane in the
space K� the hyperplane that minimizes the function � has a geometric interpretation�
This hyperplane is called the optimal hyperplane and is one with the maximal distance
�in K space� between the hyperplane and the closest image �xi� of the vector xi from
the training data� For nonseparable training data a generalization of this concept is
used�
Suppose that the maximal distance is equal to � and that the images �x��� ���� �x��
of the training vectors x�� ����x� are within a sphere of radius R� Then the following
theorem holds true �Vapnik and Chapelle� ������
�
Theorem ��� Given a training set Z � f�x�� y��� � � � � �x�� y��g of size �� a feature
space H and a hyperplane �w� b�� the margin ��w� b� Z� and the radius R�Z� are
de�ned by
��w� b� Z� � min�xi�yi��Z
yi�w � �xi� � b�
kwkR�Z� � min
a�xi
k�xi� � ak
The maximum margin algorithm L� �IRn � Y �� � K � IR takes as input a training
set of size � and returns a hyperplane in feature space such that the margin ��w� b� Z�
is maximized� Note that assuming the training set separable means that � � �� Under
this assumption� for all probability measures P underlying the data Z� the expectation
of the misclassi�cation probability
perr�w� b� � P �sign�w � �X� � b� �� Y�
has the bound
Efperr�L����Z��g
�E
�R��Z�
���L�Z�� Z�
��
The expectation is taken over the random draw of a training set Z of size � � for
the left hand side and size � for the right hand side�
This theorem justi�es the idea of constructing a hyperplane that separates the
data with a large margin the larger the margin the better the performance of the
constructed hyperplane� Note however that according to the theorem the average
performance depends on the ratio EfR����g and not simply on the large margin ��
��� ChoosingMultiple Parameters for Support Vec�
tor Machines
The SVM algorithm usually depends on several parameters� One of them� denoted C�
controls the tradeo� between margin maximization and error minimization� Other
parameters appear in the nonlinear mapping into feature space� They are called
kernel parameters� For simplicity� we use the trick in equation ��� �� that allows us
�
to consider C as a kernel parameter� so that all parameters can be treated in a uni�ed
framework��
It is widely acknowledged that a key factor in an SVMs performance is the choice
of the kernel� However� in practice� very few di�erent types of kernels have been
used due to the di�culty of appropriately tuning the parameters� We present here a
technique that allows to deal with a large number of parameters and thus allows the
use of more complex kernels�
Our goal is not only to �nd the hyperplane which maximizes the margin but
also the values of the mapping parameters that yield best generalization error� To
do so� we propose a minimax approach maximize the margin over the hyperplane
coe�cients and minimize an estimate of the generalization error over the set of kernel
parameters� This last step is performed using a standard gradient descent approach�
We consider a kernels K� depending on a set of parameters �� The decision
function given by an SVM is
f�x� � sign
��X
i��
��i yiK��xi�x� � b
�� ��� ��
where the coe�cients ��i are obtained by maximizing the following functional
W ��� � �Ty �
��T �M�� ��� ��
subject to
�Ty � ��
� � ��
where �M � M � �CI and Mij � yiyjK��xi�xj��
Ideally we would like to choose the value of the kernel parameters that minimize
the true risk of the SVM classi�er� Unfortunately� since this quantity is not accessible�
one has to build estimates or bounds for it� Next� we present several measures of the
expected error rate of an SVM�
�This section is based on the work done in �Chapelle et al�� ������
��
����� Single validation estimate
If one has enough data available� it is possible to estimate the true error on a validation
set� This estimate is unbiased and its variance gets smaller as the size of the validation
set increases� If the validation set is f�x�i� y�i�g��i�p� the estimate is
T �
p
pXi��
���y�if�x�i��� ������
where � is the step function ��x� � when x � � and ��x� � � otherwise�
����� Leave�one�out bounds
The leave�one�out procedure consists of removing from the training data one element�
constructing the decision rule on the basis of the remaining training data and then
testing on the removed element� In this fashion one tests all � elements of the training
data �using � di�erent decision rules�� Let us denote the number of errors in the leave
oneout procedure by L�x�� y�� ����x�� y��� It is known �Luntz and Brailovsky� ����
that the the leaveoneout procedure gives an almost unbiased estimate of the ex
pected generalization error
Lemma ���
Ep���err �
�E�L�x�� y�� ����x�� y����
where p���err is the probability of test error for the machine trained on a sample of size
�� and the expectations are taken over the random choice of the sample�
Although this lemma makes the leaveoneout estimator a good choice when estimat
ing the generalization error� it is nevertheless very costly to actually compute since it
requires running the training algorithm � times� The strategy is thus to upper bound
or approximate this estimator by an easy to compute quantity T having� if possible�
an analytical expression�
If we denote by f � the classi�er obtained when all training examples are present
and f i the one obtained when example i has been removed� we can write
L�x�� y�� ����x�� y�� ��X
p��
���ypf p�xp��� ���� �
�
which can also be written as
L�x�� y�� ����x�� y�� ��X
p��
���ypf ��xp� � yp�f��xp�� f p�xp����
Thus� if Up is an upper bound for yp�f��xp��f p�xp��� we will get the following upper
bound on the leaveoneout error
L�x�� y�� ����x�� y�� �X
p��
��Up � ��
since for hard margin SVMs� ypf��xp� � and � is monotonically increasing�
Support vector count
Since removing a nonsupport vector from the training set does not change the so
lution computed by the machine �i�e� Up � f ��xp� � f p�xp� � � for xp nonsupport
vector�� we can restrict the preceding sum to support vectors and upper bound each
term in the sum by which gives the following bound on the number of errors made
by the leaveoneout procedure �Vapnik� ����
T �NSV
��
where NSV denotes the number of support vectors�
JaakkolaHaussler bound
For SVMs without threshold� analyzing the optimization performed by the SVM algo
rithm when computing the leaveoneout error� Jaakkola and Haussler �Jaakkola and Haussler� ����
proved the inequality
yp�f��xp�� f p�xp�� ��
pK�xp�xp� � Up
which leads to the following upper bound
T �
�
�Xp��
����pK�xp�xp�� ��
��
Note that Wahba et al� �Wahba et al�� ����� proposed an estimate of the number
of errors made by the leaveoneout procedure� which in the hard margin SVM case
turns out to be
T �X
��pK�xp�xp��
which can be seen as an upper bound of the JaakkolaHaussler one since ��x� � x
for x � ��
OpperWinther bound
For hard margin SVMs without threshold� Opper and Winther �Opper and Winther� �����
used a method inspired from linear response theory to prove the following under the
assumption that the set of support vectors does not change when removing the ex
ample p� we have
yp�f��xp�� f p�xp�� �
��p
�K��SV�pp
�
where KSV is the matrix of dot products between support vectors� leading to the
following estimate
T �
�
�Xp��
�
���p
�K��SV �pp
�
��
Radiusmargin bound
For SVMs without threshold and with no training errors� Vapnik �Vapnik� ���� pro
posed the following upper bound on the number of errors of the leaveoneout proce
dure
T �
�
R�
���
where R and � are the radius and the margin as de�ned in theorem �� � �
Span bound
Vapnik and Chapelle �Vapnik and Chapelle� ����� derived an estimate using the con
cept of span of support vectors�
��
Under the assumption that the set of support vectors remains the same during
the leaveoneout procedure� the following equality is true
yp�f��xp�� f p�xp�� � ��
pS�p �
where Sp is the distance between the point �xp� and the set �p where
�p �
�����
Xi ��p� ��
i��
�i�xi��Xi ��p
�i �
��� � ������
This gives the exact number of errors made by the leaveoneout procedure under the
previous assumption
T �
�
�Xp��
����pS
�p � �� ������
The span estimate can be related to other approximations
Link with Jaakkola�Haussler bound
If we consider SVMs without threshold� the constraintP�i � can be removed
in the de�nition of the span� Then we can easily upper bound the value of the
span S�p K�xp�xp�� and thus recover the JaakkolaHaussler bound�
Link with R����
For each support vector� we have ypf��xp� � � Since for x � �� ��x� � x�
the number of errors made by the leaveoneout procedure is bounded by
Xp
��pS
�p �
It has been shown �Vapnik and Chapelle� ����� that the span Sp is bounded
by the diameter of the smallest sphere enclosing the training points and sinceP��p � ���� we �nally get
T �R�
���
A similar derivation as the one used in the span bound has been proposed in
�Joachims� ������ where the leaveoneout error is bounded by jfp� ���pR
� �
ypf��xp�gj� with � K�xi�xi� R�� i�
��
Link with Opper�Winther
When the support vectors do not change� the hard margin case without thresh
old gives the same value as the OpperWinther bound� namely
S�p �
�K��SV�pp
�
����� Optimizing the kernel parameters
Let s go back to the SVM algorithm� We assume that the kernel k depends on one or
several parameters� encoded into a vector � � ���� � � � � �n�� We thus consider a class
of decision functions parametrized by �� b and �
f��b���x� � sign
��X
i��
�iyiK��x�xi� � b
��
We want to choose the values of the parameters � and � such that W �see equa
tion ��� ��� is maximized �maximum margin algorithm� and T � the model selection
criterion� is minimized �best kernel parameters�� More precisely� for � �xed� we want
to have �� � arg maxW ��� and choose �� such that
�� � arg min�
T ���� ���
When � is a one dimensional parameter� one typically tries a �nite number of
values and picks the one which gives the lowest value of the criterion T � When both
T and the SVM solution are continuous with respect to �� a better approach has
been proposed by Cristianini et al� �Cristianini et al�� ���� using an incremental
optimization algorithm� one can train an SVM with little e�ort when � is changed
by a small amount� However� as soon as � has more than one component computing
T ��� �� for every possible value of � becomes intractable� and one rather looks for a
way to optimize T along a trajectory in the kernel parameter space�
Using the gradient of a model selection criterion to optimize the model parameters
has been proposed in �Bengio� ����� and demonstrated in the case of linear regres
sion and timeseries prediction� It has also been proposed by �Larsen et al�� ���� to
optimize the regularization parameters of a neural network�
��
Here we propose an algorithm that alternates the SVM optimization with a gra
dient step is the direction of the gradient of T in the parameter space� This can be
achieved by the following iterative procedure
�� Initialize � to some value�
�� Using a standard SVM algorithm� find the maximum of the
quadratic form W�
����� � arg max�
W ��� ���
�� Update the parameters � such that T is minimized�
This is typically achieved by a gradient step �see below��
� Go to step � or stop when the minimum of T is reached�
Solving step � requires estimating how T varies with �� We will thus restrict
ourselves to the case where K� can be di�erentiated with respect to �� Moreover� we
will only consider cases where the gradient of T with respect to � can be computed
�or approximated��
Note that �� depends implicitly on � since �� is de�ned as the maximum of W �
Then� if we have n kernel parameters ���� � � � � �n�� the total derivative of T ���� �T ������� �� with respect to �p is
T �
�p�
T �
�p
������� �xed
� T �
��
��
�p�
Having computed the gradient r�T ���� ��� a way of performing step � is to make
a gradient step
��k � �� T ���� ��
�k�
for some small and eventually decreasing �� The convergence can be improved with
the use of second order derivatives �Newton s method�
��k � ��!�T ��� T ���� ��
�k
��
where the Laplacian operator ! is de�ned by
�!�T �i� j � �T ���� ��
�i �j�
In this formulation� additional constraints can be imposed through projection of the
gradient�
����� Computing the gradient
In this section� we describe the computation of the gradient �with respect to the kernel
parameters� of the di�erent estimates of the generalization error� First� for the bound
R���� �see Theorem �� � �� we obtain a formulation of the derivative of the margin
�section ������ and of the radius �section ������� For the validation error �see equation
�������� we show how to calculate the derivative of the hyperplane parameters �� and
b� Finally� details about the computation of the derivative of the span bound ������
are not included in this thesis see �Chapelle et al�� ��� � for details�
We �rst begin with a useful lemma�
Lemma ��� Suppose we are given a �n � � vector v� and an �n � n� matrix P�
smoothly depending on a parameter �� Consider the function�
L��� � maxx�F
xTv� �
�xTP�x
where
F � fx bTx � c�x � �g�
Let "x be the the vector x where the maximum in L��� is attained� If this minimum is
unique then L���
�� "xT
v� �
�
�"xT
P�
�"x�
In other words� it is possible to di�erentiate L with respect to � as if "x did not depend
on �� Note that this is also true if one �or both� of the constraints in the de�nition
of F are removed�
��
Proof� We �rst need to express the equality constraint with a Lagrange multiplier
� and the inequality constraints with Lagrange multipliers �i
L��� � maxx����
xTv� �
�xTP�x� ��bTx� c� � �Tx� ������
At the maximum� the following conditions are veri�ed
v� �P��x � ��b� ���
bT "x � c�
"�i "xi � �� i�
We will not consider here di�erentiability problems� The interested reader can
�nd details in �Bonnans and Shapiro� ������ The main result is that whenever "x is
unique� L is di�erentiable�
We have L���
�� "xT
v� �
�
�"xT
P�
�"x �
"x
�
T
�v� �P��x��
where the last term can be written as follows�
"x
�
T
�v� �P��x� � �� �x
�
T
b� �x
�
T
���
Using the derivatives of the optimality conditions� namely
"x
�
T
b � ��
"�i �
"xi � "�i "xi �
� ��
and the fact that either "�i � � or "xi � � we get
"�i �
"xi � "�i "xi �
� ��
hence "x
�
T
�v� �P��x� � �
and the result follows� �
��
Computing the derivative of the margin
Note that in feature space� the separating hyperplane fx w � �x� � b � �g has the
following expansion
w ��X
i��
��i yi�xi�
and is normalized such that
min��i��
yi�w � �xi� � b� � �
It follows from the de�nition of the margin in Theorem �� � that this latter is � �
�kwk� Thus we write the bound R���� as R�kwk��The previous lemma enables us to compute the derivative of kwk�� Indeed� it can
be shown �Vapnik� ���� that
�kwk� � W �����
and the lemma can be applied to the standard SVM optimization problem ��� ���
giving kwk� �p
� ��X
i�j��
��i�
�jyiyj
K�xi�xj�
�p
Computing the derivative of the radius
Computing the radius of the smallest sphere enclosing the training points can be
achieved by solving the following quadratic problem �Vapnik� ����
R� � max�
�Xi��
�iK�xi�xi���X
i�j��
�i�jK�xi�xj�
under constraints�X
i��
�i �
i �i � �
We can again use the previous lemma to compute the derivative of the radius
R�
�p�
�Xi��
��i
K�xi�xi�
�p�
�Xi�j��
�i�j K�xi�xj�
�p�
where �� maximizes the previous quadratic form�
��
Computing the derivative of the spanrule
Now� let us consider the span value� Recall that the span of the support vector xp is
de�ned as the the distance between the point �xp� and the set �p de�ned by �������
Then the value of the span can be written as
S�p � min
�max�
��xp��
Xi��p
�i�xi�
�A�
� ��
�Xi��p
�i �
�A �
Note that we introduced a Lagrange multiplier � to enforce the constraintP�i �
�
Introducing the extended vector �� � ��T��T and the extended matrix of the dot
products between support vectors
�KSV �
B� K �
�T �
�CA �
the value of the span can be written as
S�p � min
�max�
�K�xp�xp�� �vT �� �TH ���
where H is the submatrix of �KSV with row and column p removed� and v is the pth
column of �KSV �
From the fact that the optimal value of �� is H��v� it follows
S�p � K�xp�xp�� vTH��v
� �� �K��SV �pp� ������
The last equality comes from the following block matrix identity� known as the �Wood
bury� formula �Lutkepohl� ����
B� A� AT
A A�
�CA��
�
B� B� BT
B B�
�CA �
where B� � �A� �AA��� AT����
The closed form we obtain is particularly attractive since we can compute the
value of the span for each support vector just by inverting the matrix KSV�
��
The following is the derivative of the span
S�p
�p� S�
p
��K��SV
�KSV
�p�K��SV
�pp
����� An Example
In this experiment� we try to choose the scaling factors for an RBF and polynomial
kernel of degree �� More precisely� we consider kernels of the following form
K�x� z� � exp
��X
i
�xi � zi��
���i
�
and
K�x� z� �
� �
Xi
xizi��i
��
Most of the experiments have been carried out on the USPS handwritten digit
recognition database� This database consists of ��� training examples and ���� test
examples of digit images of size �x � pixels� We try to classify digits � to � against
� to �� The training set has been split into �� subsets of � � examples and each of
this subset has been used successively during the training�
To assess the feasibility of our gradient descent approach for �nding kernel pa
rameters� we �rst used only � parameters� each one corresponding to a scaling factor
for a squared tile of � pixels as shown on �gure � �
Figure � On each of the � tiles� the scaling factors of the � pixels are identical�
�
The scaling parameters were initialized to � The evolution of the test error and
of the bound R���� is plotted versus the number of iterations in the gradient descent
procedure in �gures �� �polynomial kernel� and �� �RBF kernel��
Figure �� Evolution of the test error �left� and of the bound R���� �right� duringthe gradient descent optimization with a polynomial kernel
Figure �� Evolution of the test error �left� and of the bound R���� �right� duringthe gradient descent optimization with an RBF kernel
Note that for the polynomial kernel� the test error went down to �# whereas the
best test error with only one scaling parameter is ���#� Thus� by taking several
scaling parameters� we managed to make the test error decrease�
��� Feature Selection for Support Vector Machines
The motivation for feature selection is threefold
� Improve generalization error
��
�� Determine the relevant features �for explanatory purposes�
�� Reduce the dimensionality of the input space �for realtime applications�
Finding optimal scaling parameters can lead to feature selection algorithms� In
deed� if one of the input components is useless for the classi�cation problem� its
scaling factor is likely to become small� But if a scaling factor becomes small enough�
it means that it is possible to remove it without a�ecting the classi�cation algorithm�
This leads to the following idea for feature selection keep the features whose scaling
factors are the largest� This can also be performed in a principal components space
where we scale each principal component by a scaling factor�
Previous work on feature selection for SVMs does exist� however it has been
limited to linear kernels �Bradley and Mangasarian� ���� Guyon et al�� ������ gener
ative models �Jebara and Jaakkola� ������ or analysis of perturbations of the margin
�Evgeniou et al�� ����� Guyon et al�� ������ Our approach can be applied to nonlin
ear problems outside of the generative model framework and can be thought of as a
generalization of the approach in �Guyon et al�� ����� Evgeniou et al�� ����� ��
We consider two di�erent parametrization of the kernel� The �rst one corresponds
to rescaling the data in the input space
K��x� z� � K��Tx� �Tz�
where � � IRn�
The second one corresponds to rescaling in the principal components space
K��x� z� � K��T�x� �T�z�
where � is the matrix of principal components�
We compute � and � using the following iterative procedure
�� Initialize � � � � � � � � �
�� In the case of principal component scaling� perform
principal component analysis to compute the matrix ��
�This section is based on the work done in �Weston et al�� �����
��
�� Solve the SVM optimization problem
� Minimize the estimate of the error T with respect to �
with a gradient step�
� If a local minimum of T is not reached go to step ��
�� Discard dimensions corresponding to small elements in �
and return to step ��
We demonstrate this idea on two toy problems where we show that feature se
lection reduces generalization error� We then apply our feature selection algorithm
to DNA microarray data where it is important to �nd which genes are relevant in
performing the classi�cation� It also seems in these types of algorithms that feature
selection improves performances� Lastly� we apply the algorithm to face detection and
show that we can greatly reduce the input dimension without sacri�cing performance�
����� Toy data
We compared several algorithms
� The standard SVM algorithm with no feature selection
� Our feature selection algorithm with the estimate R���� and with the span
estimate
� The standard SVM applied after feature selection via a �lter method
The three �lter methods we used choose the m largest features according to Pear
son correlation coe�cients� the Fisher criterion score� and the KolmogorovSmirnov
test�� Note that the Pearson coe�cients and Fisher criterion cannot model nonlinear
dependencies�
�F �r �����r ���r��r��
�
r
���� where ��r is the mean value for the r�th feature in the positive and negative
classes and ��r is the standard deviation�KStst�r
p� sup
��PfX � frg � �PfX � fr� yr �g
�where fr denotes the r�th feature from
each training example� and �P is the corresponding empirical distribution�
��
In the two following arti�cial datasets our objective was to assess the ability of
the algorithm to select a small number of target features in the presence of irrelevant
and redundant features �Weston et al�� ������
For the �rst example� six dimensions of ��� were relevant� The probability of y �
or � was equal� The �rst three features fx�� x�� xg were drawn as xi � yN�i� � and
the second three features fx�� x� x�g were drawn as xi � N��� � with a probability
of ���� otherwise the �rst three were drawn as xi � N��� � and the second three as
xi � yN�i� �� �� The remaining features are noise xi � N��� ���� i � �� � � � � ����
For the second example� two dimensions of �� were relevant� The probability of
y � or � was equal� The data are drawn from the following if y � � then
fx�� x�g are drawn from N����$� or N����$� with equal probability� �� � f�����g
and �� � f�� �g and $ � I� if y � then fx�� x�g are drawn again from two normal
distributions with equal probability� with �� � f����g and �� � f��� �g and the
same $ as before� The rest of the features are noise xi � N��� ���� i � �� � � � � ���
In the linear problem the �rst six features have redundancy and the rest of the
features are irrelevant� In the nonlinear problem all but the �rst two features are
irrelevant�
We used a linear kernel for the linear problem and a second order polynomial
kernel for the nonlinear problem�
We imposed the feature selection algorithms to keep only the best two features�
The results are shown in �gure ���� for various training set sizes� taking the average
test error on ��� samples over �� runs of each training set size� The Fisher score �not
shown in graphs due to space constraints� performed almost identically to correlation
coe�cients�
In both problem� we clearly see that our method outperforms the other classical
methods for feature selection� In the nonlinear problem� among the �lter methods
only the KolmogorovSmirnov test improved performance over standard SVMs�
��
20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7 Span−Bound & Forward SelectionRW−Bound & Gradient Standard SVMs Correlation Coefficients Kolmogorov−Smirnov Test
20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7 Span−Bound & Forward SelectionRW−Bound & Gradient Standard SVMs Correlation Coefficients Kolmogorov−Smirnov Test
�a� �b�
Figure �� A comparison of feature selection methods on �a� a linear problem and�b� a nonlinear problem both with many irrelevant features� The xaxis is the numberof training points� and the yaxis the test error as a fraction of test points�
����� DNA Microarray Data
Next� we tested this idea on two leukemia discrimination problems �Golub et al�� ����
and a problem of predicting treatment outcome for Medulloblastoma � The �rst prob
lem was to classify myeloid versus lymphoblastic leukemias based on the expression
of � �� genes� The training set consists of �� examples and the test set �� examples�
Standard linear SVMs achieve error on the test set� Using gradient descent on
R���� we achieved � error using �� genes and error using gene� Using the Fisher
score to select features resulted in error for both and �� genes�
The second leukemia classi�cation problem was discriminating B versus T cells
for lymphoblastic cells �Golub et al�� ����� Standard linear SVMs make error for
this problem� Using either the span bound or gradient descent on R���� results in �
error using � genes� whereas the Fisher score get � errors using the same number of
genes�
The �nal problem is one of predicting treatment outcome of patients that have
Medulloblastoma� Here there are �� examples each with � �� expression values in
the dataset and we use leaveoneout to measure the error rate� A standard SVM
with a Gaussian kernel makes �� errors� while selecting �� genes using the gradient
descent on R���� we achieved an error of ��
�The database will be available at http���waldo�wi�mit�edu�MPR�data sets�html
��
����� Face detection
The trainable system for detecting frontal and nearfrontal views of faces in gray
images presented in �Heisele et al�� ����� gave good results in terms of detection rates�
The system used gray values of �� � images as inputs to a seconddegree polynomial
kernel SVM� This choice of kernel lead to more than ������ features in the feature
space� Searching an image for faces at di�erent scales took several minutes on a PC�
To make the system realtime reducing the dimensionality of the input space and the
feature space was required� The feature selection in principal components space was
used to reduce the dimensionality of the input space �Serre et al�� ������
The method was evaluated on the large MITCMU test set consisting of ���
faces and about ���������� nonface patterns� In Figure ��� we compare the ROC
curves obtained for di�erent numbers of selected components�
The results showed that using more than �� components does not improve the
performances of the system �Serre et al�� ������
Figure �� ROC curves for di�erent number of PCA gray features�
��
��� Con�dence in Predictions and Rejections
For many applications especially clinical applications the concept of the con�dence
of a prediction and then rejecting a sample of making a call is very important� In
this section we develop such a methodology for SVMs� Prior work on outputing
probabilities or con�dences from SVMs can be found in �Platt� ���� Vapnik� �����
The basic idea is to reject points near the optimal hyperplane for which the clas
si�er may not be very con�dent of the class label� We introduced con�dence levels
based on the SVM output� d
d ��X
i��
��i yiK��xi�x� � b�
These con�dence levels are a function of d and are computed from the training data�
This allows us to reject samples below a certain value of jdj �similarly we could have
two di�ernet con�dence values d� and d� for the two sides of the optimal hyperplane�
because they do not fall within the con�dence level� Introducing con�dence levels
resulted in ��# accuracy for all four cases and between � and � rejects� depending
on the data set� Table �� � Figure �� plots the d values for the test data and the
classi�cation and rejection intervals� The genes were selected by the signal to noise
criteria used in �Golub et al�� �����
genes rejects errors con�dence level jdj� �� � � ��# � ��� � � ��# ����� � � ��# ����� � � ��# � ��
Table �� Number of errors� rejects� con�dence level� and the jdj corresponding tothe con�dence level for various number of genes with the linear SVM descriminatingALL from AML�
The computation of the con�dence level is based on a Bayesian formulation and
the following assumption for SVMs
p�cjx� � p�cjd��
We can rewrite p�cjd� as
p�cjd� � p�djc�p�c��
��
For our problem� we assume p� � � p�� � and that p�dj � � p��dj � � this allows
us to simply estimate p�jdj jf �� g�� We make the previous assumptions so that we
only have to estimate one con�dence level based upon jdj rather than two con�dence
levels� one for class and one for class � �
We use the leaveoneout estimator on the training data to get �� jdj values� We
then estimate the distribution function� %F �jdj� from the jdj values� This was done
using an automated nonparametric density estimation algorithm which has no free
parameters �Mukherjee and Vapnik� ���� Vapnik and Mukherjee� ������ The con�
dence level C�jdj� is simply
C�jdj� � � %F �jdj��
Figure �� is a plot of the con�dence level as a function of jdj for the four cases� If we
look at the d for the two classes separately we would get two con�dence levels� �gure
���
��
0 5 10 15 20 25 30 35−4
−3
−2
−1
0
1
2
3
4
5
dis
tance fro
m h
yperp
lane
examples0 5 10 15 20 25 30 35
−4
−3
−2
−1
0
1
2
3
4
dis
tance fro
m h
yperp
lane
examples
�a� �b�
0 5 10 15 20 25 30 35−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
dis
tance fro
m h
yperp
lane
examples0 5 10 15 20 25 30 35
−2
−1.5
−1
−0.5
0
0.5
1
1.5
dis
tance fro
m h
yperp
lane
examples
�c� �d�
Figure �� Plots of the distance from the hyperplane for test points using �a� ��genes �b� �� genes �c� ��� genes �d� and � �� genes� The � are for class ALL� the ofor class AML� the � are mistakes� and the line indicates the decision boundary�
��
0 2 4 6 8 10 12 14 16 18 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1/|d|
confidence
0 2 4 6 8 10 12 14 16 18 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1/|d|confidence
�a� �b�
0 2 4 6 8 10 12 14 16 18 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1/|d|
confidence
0 2 4 6 8 10 12 14 16 18 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1/|d|
confidence
�c� �d�
Figure �� Plots of the con�dence levels as a function of �jdj estimated from aleaveoneout procedure on the training data using �a� �� genes �b� �� genes �c� ���genes �d� and � �� genes�
�
−20 −15 −10 −5 0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
co
nfid
en
ce
1/d
Figure �� Plot of the con�dence levels as a function of �d estimated from a leaveoneout procedure on the training data for using ��� genes�
��
Chapter �
Comparison of Algorithms using
Microarray Data to Predict Cancer
Morphology and Treatment
Outcome
��� Introduction
E�ective cancer treatment depends upon the availability of curative therapies� accu
rate diagnosis of the disease so that the appropriate therapy can be utilized� and the
ability to classify patients according to groups who are likely to bene�t from each
existing therapy� For most tumors anatomical and morphological analysis remains
the standard method by which clinical decision making is directed� New methods of
cancer detection have improved over the last two decades but there are still serious
limitations in our ability to accurately classify tumors in a manner that would allow
for more rational and systematic clinical decision making� Using standard techniques
the empirical classi�cation for a given tumor can vary dramatically from patient to
patient� This may be due to biological di�erences that as of yet cannot be measured
by standard methods� A particular class of biological di�erences that hold promise
��
is the di�erence in gene expression patterns for di�erent tumor types� Cancer is a
disease where the pattern of expression of genes involved in di�erentiation and cell
growth is altered� resulting in a state of uncontrolled growth that becomes clinically
apparent �Hanahan and Weinberg� ����� Weinberg and Varmus� ����� Recent tech
nological advances in molecular genetics �i�e� oligonucleotide and cDNA microarrays
�Hardy� ������ allow us to monitor easily the simultaneous and quantitative expres
sion of thousands of genes in clinical specimens� In this context there is considerable
interest in understanding if gene expression pro�les might serve as molecular �nger
prints that would allow for a better and more accurate classi�cation of cancer and
other diseases or biological phenotypes in general� These molecular �ngerprints may
be used not just to create gene lists that are particular to a certain taxonomy of can
cer but also to discover new taxonomies and possibly uncover the structure of genetic
networks not yet understood�
It has been shown �Golub et al�� ���� that two types of leukemia� acute myeloid
and acute lymphoblastic� can be distinguished with a very low error rate without
using any a priori biological knowledge or expert interpretation� The methodology
was rather general and followed the paradigm of data collection� feature selection�
modelbuilding by crossvalidation� and model testing on an independent dataset�
The classi�er used a weighted voting algorithm and by achieving low error rates it
proved the feasibility of performing molecular classi�cation using only gene expres
sion patterns� Over the last year we have also built binary classi�ers to distinguish
normal vs� malignant tissues and to di�erentiate lineages �Bcell vs� Tcell leukemia
�Slonim et al�� ������ Glioblastoma vs� medulloblastoma �Pomeroy et al�� ��� �� fol
licular vs� large Bcell lymphoma �Shipp et al�� ��� ��� Given enough samples� these
kinds of results can be achieved with low error rates ��� �#� because there are many
relevant features �between �� and ��� �marker� genes� correlated with the target
class that can be exploited by the classi�er algorithm�
Other problems such as outcome prediction �low vs� high risk� i�e� who will re
spond to clinical treatments such as chemotherapy or radiation are more problematic
and challenging due to the small number of marker genes in a background of morpho
��
logically identical samples and the intrinsic complexity of the phenotype� Speci�cally�
for two treatment outcome problems� Medullablastoma and large Bcell lymphoma�
the algorithms achieve an error rates between �# and ��#� Based on this one can
designate a hierarchy of problems of increasing complexity as follows
� Histological di�erences normal vs� malignant� skin vs� brain tissue�
�� Morphologies di�erent leukemia types ALL vs� AML�
�� Lineage BCell vs� TCell� follicular vs� large Bcell lymphoma�
�� Outcome treatment outcome� relapse or drug sensitivity�
In this work we make a systematic study of algorithms and molecular classi�
cation problems with the goal of understanding which are the best algorithms and
methodologies and also to identify the main obstacles that hamper computational
approaches to molecular classi�cation problems in the fourth class� This class is
probably the most important from a clinical perspective because there are almost no
e�ective traditional methods to perform this critical classi�cation�
The core result and most clinically relevant part of this study was the compar
ison of several algorithms kNearest Neighbors �kNN� �Duda and Hart� ����� Naive
Bayes �NB� �Duda and Hart� ����� Weighted Voting Average �WV� �Slonim et al�� ������
and SVMs �Vapnik� ����� We examined the performance of these algorithms for all
the datasets above in terms of various error measures �for example two of the mea
sures used were the number of errors if a prediction is made for all samples� and
the number of errors and percentages of calls made when the classi�ers rejected low
con�dence predictions�� For clinical applications error rates must be �� �#�� For
all the discrimination tasks except for treatment outcome prediction we fall within
that range�
In general� for reasonable classi�ers �consistent classi�ers� as more samples are
given the classi�ers converge to the smallest achieveable error rate for a given dataset�
often called the Bayes error� An important question is the following for a given
dataset how much will the addition of n samples lower the error rate� For the treat
ment outcome problem this translates to the practical question of given our perfor
��
mance with n samples about how many more are needed to achieve a desired error
rate and is it even possible to achieve such a rate� In addition� given enough samples
all reasonable algorithms will perform well� however some may require an order of
magnitude more samples to reach this performance level� Since we are always faced
with few samples in gene expression studies this issue is very important� The above
questions are addressed by constructing learning curves for the above algorithms by
�tting the empirical error rates to the following model of error rate as a function of
the number of samples
Err�n� � n�� � b
where n is the number of samples� Err�n� is the error rate given n samples� � � �
is the rate of convergence of the algorithm� and b is the error rate as the number of
samples goes to in�nity which is the smallest achievable error rate� This analysis was
done for the Lymphoma treatment and brain outcome datasets�
Lastly� we build a simple model of the expression data and derive closed form
expressions of the performance of a standard type of classi�er �a perceptron or hyper
plane discriminant� for this model� This analysis allows us to state both the optimal
accuracy of a classi�er and the variance in the accuracy as a function of sample size�
strength of the underlying signal in the model� and number of informative genes� This
analysis helps give us an indication of how much of a change in performance accu
racy is statistically signi�cant� We applied this analysis to our datasets by assuming
the datasets follow the simple model� choosing the number of informative genes� and
using empirical estimates from the data to determine the signal strength�
��� Classi�cation Results
We state classi�cation results for the algorithms on the three di�erent datasets� As
expected the error rates were lower for morphology classi�cation than treatment out
come following the general hierarchy of increasing complexity described in the intro
duction� See section ���� for details about the classi�ers and the datasets�
��
��in��gures&graphic �eps
����� Morphology and Lineage
The results for the various datasets are listed in Figure ����� �� Most algorithms
performed very well for the leukemia morphologies and lineage distinctions �AML
vs� ALL� B vs TCell� achieving zero errors� For the Follicular vs� Large B Cell
lymphoma distinction SVM and kNN produced the smaller error rates �� � �#��
The reason there are some mistakes is presumably because the problem is slightly
more di�cult than the one corresponding to the leukemias or because given the larger
dataset size the chances of mislabeled samples are higher� For the Glioblastoma vs�
Medulloblastoma distinction all algorithms perform well achieving ���# errors� One
of the reasons these classi�cations can be done with such small error rates is because
there are many relevant features �at least �� or more� highly correlated with the
target class that are used by the di�erent algorithms�
����� Treatment Outcome
The results for treatment outcome are listed in Figure ����� �� Prediction results for
all algorithms for leukemia treatment are at chance level or the error rate that would
be achieved by always predicting the majority class� This might be due to either the
small sample size� the inherent complexity of the problem� the heterogeneity of the
samples or the basic lack of correlation between outcome and the expression values�
For the lymphoma and brain outcome predictions the results are more promising and
the algorithms achieve error rates between �# and ��#� For outcome prediction
it is important to consider not only global error rates but also the false positive
��
��in��gures&graphic��eps
��in�gures&image �eps ��in�gures&image��eps�a� �b�
Figure � Survival plots for �a� Lymphoma and �b� Brain outcomes�
and false negative rates� The reason is that clinically the classes are not symmetric�
For example� classifying a high risk patient in the low risk class implies that the
patient may get less treatment �e�g� chemotherapy�� For this reason when comparing
algorithms one has to look at the errors per class or the average error per class� For
the Medulloblastoma we listed two models generated by the SVM one using global
error as an optimization criterion the other using the average error per class as an
optimization criterion�
Another way to look measure the accuracy of the algorithms besides error rate
is computing the KaplanMeier survival plot and statistic �Lachin� ������ Survival
statistices encorporate time event information into the classi�cation problem� So� a
patient that is misclassi�ed as alive would be penalized more if the patient died early
in study rather than towards the end� One of the most common survival estimators
is the KaplanMeier estimator �Lachin� ������ One constructs empirical distribution
functions with respect to time of the number of patients alive for the two classes�
those predicted to live and those predicted to die� One then tests against the null hy
pothesis that these two empirical distributions were drawn from the same probability
distribution� As we can see in Figure � the pvalues of the KaplanMeier statistic
for the lymphoma and brain outcome predictions were ������ �WV� and ���� � �k
NN�� Note that even relatively high error rates may still be signi�cant from a survival
perspective�
��
����� Learning Curves
Learning curves are constructed for both the lympohma treatment and brain outcome
datasets� For details on how each data point was estimated and the way the curve
was �t see section ������
Figure and plot the learning curves and data points used to �t the curves for
treatment and morpholgy data� respectively� The algorithms considered in the above
plots are the SVM and kNN algorithms� One can see that the SVM has a quicker rate
of convergence� does better given fewer training examples than the kNN algorithm�
The four curves estimated� two for dataset� are listed below
Err�n� � ��n� � � � ��� �
Err�n� � ���n��� � � � �����
Err�n� is the error rate as a function of the number of samples n� the constant term
gives us an estimate of the smallest error rate achievable� and � gives us a rate of
convergence of the algorithm� Eqn ��� � is the learning curve for breain treatment
outcome using the SVM algorithm� Eqn ����� is the learning curve for lymphoma
outcome using the SVM algorithm�
����� Removal of Important Genes and Higher Order Infor�
mation
We examined how well the SVM performed when the most important genes accord
ing to the signal to noise ratio crteria �Golub et al�� ���� were removed from the
leukemia morphology discrimination problem� We also examined whether higher or
der interactions helped when important genes are removed�
Higher order statistics seem in fact to increase performance when the problem is
arti�cially made more di�cult by removing between � and �� of the top features�
Above this high order kernels hindered performance� This result is consistent with
the concepts of generalization error which the SVM algorithm is based upon� When
the data is less noisy the advantage of the �exibility of a more complicated model
��
outweighs the disadvantage of the possibility of over�tting� When the data is noisy
these aspects are reversed so a simpler model performs better� SVM performed well
until ��� features were removed �see table ��� ��� Biologically this is interesting
because it hints that genes do interect and are not indpendent�
genes st order �nd order �rd orderremoved � � �� � � �� � � ��� � � ��� � � � �� � � ���� � � ���� � � ���� � � ���� � � ���� � � ���� � � ���� � � ���� � � � ��� � � � �� � � � ��� � � � ��� � � � ��� � � � ��� � � �
Table �� Number of errors as a function of the order of polynomial and the numberof important genes removed�
��� Bayes Error and Sample Size Deviations
In this section we introduce a very simple model of the gene expression data� Given
this simple model and a linear classi�er we compute the accuracy of the optimal
classi�er and the the deviation from this accuracy due to the fact we have a �nite
sample size n� See section ����� for details of the derivation used in this section�
We assume that the data from the two classes are drawn from Gaussian distribu
��
tions that are independent across features or genes� We also assume that the prior
probablity of each class is equal� One can compute the generalization error of this
model when we separate with a hyperplane� linear classi�er� Knowing the parameters
of the two class distributions we can compute the Bayes optimal hyperplane� the hy
perplane that results in the smallest generalization error� The variation in estimating
the hyperplane due to the fact that we have few samples is then computed� We can
then estimate the deviation from the Bayes optimal error�
We now have a procedure to compute the Bayes optimal error and �nite sample
deviation from this error under the model assumptions above� We apply this analysis
to the various datasets� The above analysis requires knowledge of the means and
variances of the distribution of the two classes for each gene� We replace these values
with the sample means and sample variances of the classes to construct Table�
Data set Sample size Number of genes Bayes optimal error DeviationLeukemia Morphology �� � �# �#Leukemia Lineage �ALL� �� � �# �#Lymphoma Morphology �� �� �# ��#Lymphoma Outcome �� � ��# ��#Brain Morphology � � �# �#Brain Outcome �� � ��# ��#
Table ��� Estimated Bayes optimal error and deviation
��� Methods
The datasets� data preparation� classi�cation algorithms used� and details of models
follow�
����� Datasets
The three datasets used are good representatives of liquid and solid primary tumors
with all the complexities of real world clinical tumor samples� All the datasets corre
spond to binary discriminations or labels�
�
Leukemia
A set of �� samples was derived from bone marrow aspirates performed at the time
of diagnosis� prior to any chemotherapy� The dataset contains acute lymphoblastic
leukemia �ALL� B and TCell� and acute myeloid leukemia �AML�� These samples
were randomly selected from the leukemia cell is it clear what the cell bank is bank
based on availability� Samples were selected without regard to immunophenotype�
cytogenetics� or other molecular features�
Data set Total Class Class � Samples ALL AML
Leukemia Morphology �train� �� �� Leukemia Morphology �test� �� �� �
Class � Class Bcell Tcell
Leukemia Lineage �ALL� �� � �Class � Class Low Risk High Risk
Leukemia Outcome �AML� � � �
Table ��� Leukemia dataset
Low risk means patients who were alive at the time of the last survey or patients
who died from nondisease related causes� High risk corresponds to patients who died
from disease after treatment�
Lymphoma
These datasets contain samples corresponding to excisional lymph node biopsy spec
imens obtained from �� patients with Follicular �FSC� and �� with di�use large cell
lymphoma �DLCL��
The average follow up for these patients is �� months� Low risk means patients who
were alive at the time of the last survey or patients who died from nondisease related
causes� High risk corresponds to patients who died from disease after treatment�
��
Data set Total Class Class � Samples FSC DLCL
Lymphoma Morphology �� � ��Class � Class Low risk High risk
Lymphoma Outcome �� �� ��
Table ��� Lymphoma dataset
Brain
These samples correspond to Glioblastoma and childhood Medulloblastoma �cerebel
lum� tumors obtained from several sources�
Data set Total Class Class � Samples Glioma MD
Brain Morphology � � ��Class � Class Low risk High risk
Brain Outcome �� �� �
Table ��� Medulloblastoma and Glioblastoma dataset
The samples included in the outcome dataset had at least two years of follow up
after treatment� Low risk corresponds to patients who were alive in the last survey�
High risk corresponds to patients who died in the �rst two years after treatment� The
long term survival rate of medulloblastoma patients is about �� � ��# but the sur
vivors generally su�er adverse side e�ects as a consequence of radiation therapy� This
is one of the reasons it is important to �nd better methods of outcome classi�cation�
Data Preparation
The biological samples used in this work were obtained from di�erent tumor banks but
the process of extracting RNA and the basic laboratory protocol was essentially the
same� For details see the protocols section on this web site http&&www�genome�wi�mit�edu&MPR�
The samples were snap frozen in liquid nitrogen and stored at �� degrees� All samples
were obtained prior to the patients receiving any chemotherapy or radiation treat
ment� The RNA was hybridized overnight to A�ymetrix ���� highdensity oligonu
cleotide microarrays containing probes for ���� known human genes and ��� expressed
��
sequence tags �ESTs�� The arrays were scanned with a HewlettPackard scanner� and
the expression levels for each gene calculated using A�ymetrix GENECHIP analysis
software� The data obtained from the arrays was rescaled in order to adjust for
minor di�erences in overall array intensity�
����� Construction of Discriminative Models
This is the general methodology that we follow
� Obtain and �lter expression data� Expression values are thresholded below ���
units� and above � ���� units� then we apply a variation �lter ��fold� min� absolute
variation � �� units in most cases� to eliminate genes that do not change signi�
cantly across the samples in the dataset�
�� De�ne a target class based on morphological or clinical information� Here we
either choose the morphology� lineage or long term clinical treatment outcome of the
samples �patients��
�� Select the features ��marker� genes� with higher correlation with the target class�
�� Build a classi�er in crossvalidation �leaveoneout� and measure error rate�
�� Build a �nal classi�er using all the samples and measure the test error if an inde
pendent test set is available�
Since the dimensionality of the datasets is quite large ����� potential features�
the feature selection process can be quite important for some algorithms�
A description of the various classi�ers follow�
Weighted Voting
The weighted voting algorithm used was identical to that used in �Golub et al�� ����
Slonim et al�� ������
��
kNearest Neighbors
The kNN algorithm uses the cosine distance as the metric
d�x�y� � � � x�y �
jjxjj�jjyjj� � �����
Two variations to the standard procedure of voting the nearest neighbors were
used� In the �rst variation we used a weighted vote of the k nearest neighbors where
the weight was one divided by the euclidian distance� In the second variation the
weight was �k� where k is the rank� Note for this weighting to have an e�ect
di�erent than the standard kNN� k � �� �The seriesP
k�k
is divergent and it is for
this reason that this type of weighting can give di�erent results than the standard
vote��
Na�ive Bayes
In our implementation we make the following assumptions
� the expression of each gene is a Gaussian distribution for each class
�� the expression of di�erent genes is independent
�� the prior probabilities for each class are equal�
Support Vector Machines
We use the SVM with the radius marigin ration �RMR� as our feature selection
criterion�
����� Constructing Learning Curves
Learning curves are constructed for both the lympohma treatment and follicular vs�
large Bcell datasets� Here we give the methodology for constructing the learning
curves�
The training set sizes used were ��� ��� ��� ��� ��� �� for the lymphoma treatment
dataset and ��� ��� ��� ��� ��� ��� ��� �� for the follicular vs� large Bcell datasets� The
��
test sets were of size � for both cases� Training and test number of samples in the two
classes was drawn proportionally to the number of total samples in the population�
We drew �� respective training and test sets where each respective test and training
set do not containg overlapping points�
We set parameters of the algorithm for based using � points not in the test or
training sets for crossvalidation�
We then average the results of the �� trials and �t the following function
Err�n� � n�� � b
using a least squares criterion to obtain the learning curve� We also plot standard
deviation bars based upon the standard deviation over the �fty trials�
����� Sample Size Deviations for a Simple Model
Here we give the details of how the Bayes optimal error and deviation from that error
as a function of sample size is derived�
We assume that the data from the two classes are drawn from Gaussian distribu
tions that are independent across dimensions or features
P��x� � exp
��
dXi��
�xi � t�i��
����i
������
P��x� � exp
��
dXi��
�xi � t�i��
����i
�� �����
We also assume that the prior probablity of each class is equal� One can compute the
generalization error of this model given a hyperplane w
A �
�
�Zx�w��
dx P��x� �Zx�w�
dx P��x��� �����
which can be simpli�ed to the following
A �
�
�erf
�w � t�k %w�k
�� erf
��w � t�k %w�k
��� �����
where %w� is the vector with elements %w�i � ��iwi�
��
Now given the two class distributions we can compute the Bayes optimal hyper
plane as that for whichw � t�k %w�k � �w � t�
k %w�k � �����
Note that the covariance matrix for both classes are diagonal� so we can this
equality is achieved when componentwise this results in
wbi �t�i��i � t�i��i
��� � ��i� �����
where wbi is the ith component of the Bayes optimal hyperplane�
We now need to compute the variation in the elements wbi due to �nite sample
size� Since the standard deviation of the two classes are known the estimation error
in estimating the standard deviations is simply the standard error for a sample size
of n
%��i � ��i � ��ipn� ��� ��
We now need to compute the deviation from the Bayes optimal point for each com
pononet� First we replace ��i in equation ����� with %��i where %��i � ��i � ��ipn
�
We then replace the class means t�i with sample means� We can write the following
random variable � as the numerator of the Bayes optimal point computation
� � x%��i � y%��i� ��� �
where x is drawn from the distribution of class � and y is drawn from � � The ran
dom variable can be thought of as the distribution of deviation from an approximate
Bayes optimal point� The random variable � is distributed as follows
P ��� � p��c
e����ba�� �c�� ��� ��
where c� � �%���%��� and ba � t��%��i � t��%��i� Given the above distribution if we
assume that �the deviation equality can be derived without this assumption but it is
more complicated�
ba � wbi ��� ��
then we get the following equality
Pfj� �wbij � kc�ng � � �erf�k� ��� ��
��
where c is the standard deviation in distribution ��� �� and n is the number of ex
amples from the two classes� For the rest of the section we will set k � �� From this
analysis we deviate each component in wbi by wbi� kcn
in equation ����� and compute
the generalization error�
��
Chapter �
Further Remarks
��� Summary and Contributions
This thesis consisted of two parts� Chapter � described extensions to SVMs required
for the problem of anayzing microarray data� Chapter � consisted of a systematic
evaluation of a variety of machine learning algorithms on �ve datasets from four types
of molecular cancer classi�cation problems� It also o�ers an empirical approach to
ask questions about sample size and performance as well as looking at the problem
of predicting treatment outcome�
The contributions of the thesis� as outlined in the introduction� can be summarized
as follows
� Feature selection for SVMs
�� Outputing con�dence estimates as well as class labels for SVMs
�� A comparison of algorithms for DNA microarray problems
�� Empirical answers to sample size requirements for two microarray problems
��� Future Work
Some future areas of reseach are listed below
��
� The fact that the leaveoneout estimator is almost unbiased no longer holds
when one tries to optimize the leaveoneout quantity with respect to a param
eter of ones algorithm� So it is of theoretical interest to understand how this
biasvariance tradeo� e�ects our methodology for choosing kernel parameters�
� Oligonucleotide microarrays are not the only way information about tumors
are gathered cDNA microarrays� northern blots� and nuclease protection can
be used� An area of great pratical importance is constructing robust enough
classi�ers and features that can be learned on one type of technology and then
transferred accurately to another� For example� after analyzing microarray data
it is found only three genes are relevant for a particular process a physician wants
to screen� It is very reasonable to run northern blots for these three genes� How
should the discriminant function estimated with the microarray data be mapped
into a function to discriminate samples from a northern blot�
� In general classi�cation and gene selection are not the �nal goal of the cancer
genomics problem� We really want to infer genetic networks from the data� In
yeast graphical model approaches are being developed �Hartemink et al�� ��� �
to analyze genomic expression data in a way that permits genetic regulatory
networks to be represented in a biologically interpretable form� It would be of
interest if methodologies such as ANOVA decompositions �Vapnik� ���� could
be used to generate kernels for SVM that function as logical operations on
genes� This would allow one to use performance criterion as a model selection
tool to infer which kernels best characterize a set of expression data� The
di�erent models would be di�erent ANOVA decompositions which correspond
to di�erent regulatory networks�
� Most classi�cation methods used so far in analyzing microarray data have di�
culties in extracting subtaxonomies in classes especially in more di�cult classi
�cation tasks such as treatment outcome prediction� A framework that allows
the combination of classi�ers in variety of ways from voting to trees� with each
classi�er built from di�erent genes can be thought of as a genetic regulatory net
��
work� An extension of one approach to combining classi�ers �Niyogi et al�� �����
may lead to the development of an algorithm that allows for the learning of sub
taxonomies�
� Recent work has related the stability of an algorithm to its generalization pre
formance �Bousquet and Elissee�� ������ This work is dependent upon a conce
tration of measure inequality called McDirmiad s inequality �McDiarmid� ����
which allows one to take either the worst case deviation or the expectation in
deviation when a point is left out of an algorithm and use this in a Hoe�ding
like inequality� This type of approach maybe used to select how many genes are
relevant�
�
Bibliography
�Aizerman et al�� ���a� M�A� Aizerman� E�M� Braverman� and L�I� Roeznoer� The
problem of patter recognition learning and the method of potential functions� Av�
tomatika i Telemekhanika� ���� ��' ��� ����
�Aizerman et al�� ���b� M�A� Aizerman� E�M� Braverman� and L�I� Roeznoer� Theo
retical foundations of the potential function method in pattern recognition learning�
Avtomatika i Telemekhanika� ������ '���� ����
�Bengio� ����� Y� Bengio� Gradientbased optimization of hyperparameters� Neural
Computation� �� �����
�Bonnans and Shapiro� ����� J�F� Bonnans and A� Shapiro� Perturbation Analysis of
Optimization Problems� SpringerVerlag� �����
�Bousquet and Elissee�� ����� O� Bousquet and A� Elissee�� Algorithmic stability
and generalization performance� In Neural Information Processing Systems �
Denver� CO� �����
�Bradley and Mangasarian� ���� P� S� Bradley and O� L� Mangasarian� Feature se
lection via concave minimization and support vector machines� In Proc� �th Inter�
national Conference on Machine Learning� pages ��'��� San Francisco� CA� ����
�Brenner et al�� �� � S� Brenner� F� Jacob� and M� Meselson� An unstable interme
diate carrying information from genes to ribosomes for protein synthesis� Nature�
�����'�� � �� �
��
�Brown et al�� ���� M� Brown� W� Grundy� D� Lin� N� Christianini� C� Sugnet�
M� Ares Jr�� and D� Haussler� Support vector machine classi�cation of microar
ray gene expression data� UCSCCRL ����� Department of Computer Science�
University California Santa Cruz� Santa Cruz� CA� June� ��� ����
�Chapelle et al�� ��� � O� Chapelle� V� Vapnik� O� Bousquet� and S� Mukherjee�
Choosing many kernel parameters for support vector machines� Machine Learning�
��� �
�Cortes and Vapnik� ���a� C� Cortes and V� Vapnik� Supportvector networks� Ma�
chine Learning� ��������'���� ����
�Cortes and Vapnik� ���b� C� Cortes and V� Vapnik� Support vector networks� Ma�
chine Learning� �� '��� ����
�Courant and Hilbert� ���� R� Courant and D� Hilbert� Methods of mathematical
physics� Vol� � Interscience� London� England� ����
�Cristianini and ShaweTaylor� ����� N� Cristianini and J� ShaweTaylor� An Intro�
duction to Support Vector Machines� Cambridge University Press� �����
�Cristianini et al�� ���� N� Cristianini� C� Campbell� and J� ShaweTaylor� Dynam
ically adapting kernels in support vector machines� In Advances in Neural Infor�
mation Processing Systems� ����
�DeRisi et al�� ���� J�L� DeRisi� L� Penland� P�O� Brown� M�L� Bittner� P�S� Meltzer�
M� Ray� Y� Chen� Y�A� Su� and J�M� Trent� Use of a cdna microarray to analyse
gene expression patterns in human cancer� Nat� Genet�� ����'���� ����
�Duda and Hart� ���� R� O� Duda and P� E� Hart� Pattern Classi�cation and Scene
Analysis� Wiley� New York� ����
�Evgeniou et al�� ���� T� Evgeniou� M� Pontil� and T� Poggio� A uni�ed framework
for regularization networks and support vector machines� A�I� Memo No� ����
Arti�cial Intelligence Laboratory� Massachusetts Institute of Technology� ����
��
�Evgeniou et al�� ����� T� Evgeniou� M� Pontil� C� Papageorgiou� and T� Poggio� Im
age representations for object detection using kernel classi�ers� In Proceedings of
Asian Conference on Computer Vision� �����
�Furey et al�� ����� T�S� Furey� N� Cristianini� N� Du�y� M Schummer� D�W� Bed
narski Lin� and D� Haussler� Support vector machine classi�cation and validation
of cancer tissue using microarray expression data� �����
�Gillespie and Spiegelman� ���� D� Gillespie and S� Spiegelman� A quantative assay
for dnarna hybrids with dna immobilized on amembrane� J� Mol� Biol�� ����'
���� ����
�Girosi and Poggio� �� � F� Girosi and T� Poggio� Networks for learning a view from
the theory of approximation of functions� In P� Antognetti and V� Milutinovic(c�
editors� Neural Networks� Concepts� Applications� and Implementations� Vol� I�
chapter �� pages �' ��� Prentice Hall� Englewood Cli�s� New Jersey� �� �
�Girosi� ���� F� Girosi� An equivalence between sparse approximation and support
vector machines� Neural Computation� � ���' ���� ����
�Golub et al�� ���� T� Golub� D� Slonim� P� Tamayo� C� Huard� M� Gaasenbeek� J�P�
Mesirov� H� Coller� M�L� Loh� J�R� Downing� M�A� Caligiuri� C� D� Bloom�eld� and
E� S� Lander� Molecular classi�cation of cancer Class discovery and class prediction
by gene expression monitoring� Science� ����� '���� ����
�Guyon et al�� ����� I� Guyon� J� Weston� S� Barnhill� and V� Vapnik� Gene selection
for cancer classi�cation using support vector machines� Machine Learning� �����
�Hanahan and Weinberg� ����� D� Hanahan and R� Weinberg� The hallmarks of can
cer� Cell� ����'��� �����
�Hardy� ���� R�L� Hardy� The chipping forecast� Nature Genetics� � � January ����
�Hartemink et al�� ��� � A�J� Hartemink� D�K� Gi�ord� T�S� Jaakkola� and R�A�
Young� Using graphical models and genomic expression data to statistically vali
��
date models of genetic regulatory networks� In Paci�c Symposium on Biocomputing
� � San Jose� CA� ��� �
�Heisele et al�� ����� B� Heisele� T� Poggio� and M� Pontil� Face detection in still gray
images� AI Memo ���� Massachusetts Institute of Technology� �����
�Jaakkola and Haussler� ���� T� Jaakkola and D� Haussler� Probabilistic kernel re
gression models� In Proc� of Neural Information Processing Conference� ����
�Jebara and Jaakkola� ����� T� Jebara and T� Jaakkola� Feature selection and duali
ties in maximum entropy discrimination� In Uncertainity in Arti�cial Intelligence�
Stanford� CA� �����
�Joachims� ����� T� Joachims� Estimating the generalization performance of a svm
e�ciently� In International Conference on Machine Learning� �����
�Lachin� ����� J�M� Lachin� Biostatistical Methods� The Assesment of Risks� John
Wiley and Sons� N�Y�� �����
�Larsen et al�� ���� J� Larsen� C� Svarer� L�N� Andersen� and L�K� Hanssen� Adap
tive regularization in neural network modeling� In Neural Networks � Tricks of the
Trade� Springer� ����
�Lockhart and Winzeler� ����� D�J� Lockhart and E� Winzeler� Genomics� gene ex
pression and dna arrays� Nature Insight� ������'���� �����
�Lockhart et al�� ���� D�J� Lockhart� H� Dong� M�C� Byrne� M�T� Follettie� M�V�
Gallo� M�S� Chee� M� Mittmann� C� Wang� M� Kobayashi� H� Horton� and E�L�
Brown� Expression monitoring by hybridization to high'density oligonucleotide
arrays� Nature Biotechnology� � ���' ���� ����
�Luntz and Brailovsky� ���� A� Luntz and V� Brailovsky� On estimation of char
acters obtained in statistical pattern recognition� Technichezkya Kibernetica� ��
����
�Lutkepohl� ���� H� Lutkepohl� Handbook of Matrices� Wiley and Sons� ����
��
�McDiarmid� ���� C� McDiarmid� On the method of bounded di�erences� In Surveys
in Combinatorics ���� pages ��' ��� ����
�Mercer� ���� J� Mercer� Functions of positive and negative type and their connec
tion with the theory of integral equations� Philos� Trans� Roy� Soc� London� A
���� �'���� ����
�Mukherjee and Vapnik� ���� S� Mukherjee and V� Vapnik� Multivariate density es
timation An svm approach� AI Memo ���� Massachusetts Institute of Technology�
����
�Mukherjee et al�� ���� S� Mukherjee� P� Tamayo� D� Slonim� A� Verri� T� G olub�
J�P� Mesirov� and T� Poggio� Support vector machine classi�cation of microarray
data� AI Memo ���� Massachusetts Institute of Technology� ����
�Nirenberg and Leder� ���� M�W� Nirenberg and P� Leder� The e�ect of trinu
cleotides upon the binding of srna to ribosomes� Science� �� ���' ���� ����
�Niyogi et al�� ����� P� Niyogi� J� B� Pierrot� and O� Siohan� Multiple classi�ers by
constrained minimization� In Proceedings of International Conference on Acoustics�
Speech� and Signal Processing� �����
�Opper and Winther� ����� M� Opper and O� Winther� Gaussian processes and svm
Mean �eld and leaveoneout� In Advances in Large Margin Classi�ers� pages � '
���� MIT Press� �����
�Platt� ���� J� C� Platt� Probabilistic outputs for support vector machines and com
parisons to regularized likelihood methods� In A� Smola� P� Bartlett� B� Scholkopf�
and D� Schuurmans� editors� Advances in Large Margin Classi�ers� MIT press�
����
�Pomeroy et al�� ��� � S� Pomeroy� P� Tamayo� L� Sturla� M� Angelo� M� McLaughlin�
J� Kim L� Goumnreova� P� Black� C� Lau� J� Allen� D� Zagzag� J� Olson� T� Cur
ran� C� Wetmore� J� Biegel� T� Poggio� S� Mukherjee� A� Caliafno� G� Stolovitzky�
��
D� Louis� J� Mesirov� E� Lander� and T� Golub� Gene expression based classi�ca
tion and outcome prediction of central nervous system emryonal tumors�� Nature
Medicine� ��� �
�Schena et al�� ���� M� Schena� D� Shalon� and P�O� Brown� Quantitative monitoring
of gene expression patterns with complimentary dna microarray� Science� ������'
���� ����
�Serre et al�� ����� T� Serre� B� Heisele� S� Mukherjee� and T� Poggio� Feature se
lection for face detection� AI Memo ���� Massachusetts Institute of Technology�
�����
�Shalon et al�� ���� D� Shalon� S�J� Smith� and P�O� Brown� A dna microarray sys
tem for analyzing complex dna samples using two color �uorescent probe hybridiza
tion� Genome Research� ����'���� ����
�Shipp et al�� ��� � M� Shipp� P� Tamayo� M� Gaasenbeek� M� Angelo� T� Ray� M� Re
ich� J� Mesirov� D� Neuberg� J� Alster� T� Poggio� S� Mukherjee� and T� Golub�
Di�use large b cell lymphoma outcome prediction by gene expression pro�ling�
��� �
�Slonim et al�� ����� D� Slonim� P� Tamayo� J�P� Mesirov� T� Golub� and E� Lander�
Class prediction and discovery using gene expression data� In Proceedings of the
Fourth Annual Conference on Computational Molecular Biology �RECOMB�� pages
���'���� �����
�Southern et al�� ���� E�M� Southern� K� Mir� and M� Shchepinov� Molecular inter
actions on microarrays� Nature Genetics Supplement�� � �'�� ����
�Southern� ���� E�M� Southern� Detection of speci�c sequences among dna grag
ments separated by gel electrophoresis� J� Mol� Biol�� �����'� �� ����
�Tikhonov and Arsenin� ���� A� N� Tikhonov and V� Y� Arsenin� Solutions of Ill�
posed Problems� W� H� Winston� Washington� D�C�� ����
��
�Vapnik and Chapelle� ����� V� Vapnik and O� Chapelle� Bounds of error expectation
for support vector machines� Neural Computation� �� �����
�Vapnik and Mukherjee� ����� V� Vapnik and S� Mukherjee� Multivariate density
estimation A support vector machine approach� In NIPS �� San Mateo� CA
Morgan Kaufmann Publishers� �����
�Vapnik� ���� V� Vapnik� The Nature of Statistical Learning Theory� Springer� New
York� ����
�Vapnik� ���� V� N� Vapnik� Statistical learning theory� J� Wiley� ����
�Wahba et al�� ����� G� Wahba� Y� Lin� and H� Zhang� Generlaized approximate
crossvalidation for support vector machines another way to look at marginlike
quantities� In Advances in Large Margin Classi�ers� pages ���'���� MIT Press�
�����
�Wahba� ���� G� Wahba� Splines Models for Observational Data� Series in Applied
Mathematics� Vol� ��� SIAM� Philadelphia� ����
�Weinberg and Varmus� ���� R� Weinberg and H� Varmus� Genes and the Biology
of Cancer� W� H� Freeman and Co�� ����
�Weston et al�� ����� J� Weston� S� Mukherjee� O� Chapelle� M� Pontil� T � Poggio�
and V� Vapnik� Feature selection for support vector machines� In Advances in
Neural Information Processing Systems� �����
��