c e n t r f o r i n t e g r a t i v e b i o i n f o r m a t i c s v u e lecture 6 machine learning...

CENTR

FORINTEGRATIVE

BIOINFORMATICSVU

E

Lecture 6

Machine Learning

Bioinformatics Data Analysis and Tools

[email protected]

2

Supervised Learning

Train dataset

ML algorithm

model predictionnew observation

System (unknown)observationsproperty of interest

?

supervisor

Classification

3

Unsupervised Learning

ML for unsupervised learning attempts to discover interesting structure in the available data

Data mining, Clustering

4

What is your question?• What are the targets genes for my knock-out gene?• Look for genes that have different time profiles between different cell types.

Gene discovery, differential expression

• Is a specified group of genes all up-regulated in a specified conditions?Gene set, differential expression

• Can I use the expression profile of cancer patients to predict survival?• Identification of groups of genes that predictive of a particular class of tumors?

Class prediction, classification

• Are there tumor sub-types not previously identified? • Are there groups of co-expressed genes?

Class discovery, clustering

• Detection of gene regulatory mechanisms. • Do my genes group into previously undiscovered pathways?

Clustering. Often expression data alone is not enough, need to incorporate sequence and other information

5

Predefined Class

{1,2,…K}

1 2 K

Objects

Basic principles of discrimination•Each object associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)

Aim: predict Y from X.

X = {red, square} Y = ?

Y = Class Label = 2

X = Feature vector {colour, shape}

Classification rule ?

6

Discrimination and PredictionLearning Set

Data with known classes

ClassificationTechnique

Classificationrule

Data with unknown classes

ClassAssignment

Discrimination

Prediction

7

Example: A Classification Problem

• Categorize images of fish—say, “Atlantic salmon” vs. “Pacific salmon”

• Use features such as length, width, lightness, fin shape & number, mouth position, etc.

• Steps1. Preprocessing (e.g., background

subtraction)2. Feature extraction 3. Classification

example from Duda & Hart

http://museums.ncl.ac.uk/flint/images/salmon.jpg

8

Classification in Bioinformatics

• Computational diagnostic: early cancer detection

• Tumor biomarker discovery

• Protein folding prediction

• Protein-protein binding sites prediction

• Gene function prediction

• …

9

?Bad prognosis

recurrence < 5yrsGood Prognosis

recurrence > 5yrs

ReferenceL van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan..

ObjectsArray

Feature vectorsGene

expression

Predefine classesClinical

outcome

new array

Learning set

Classificationrule

Good PrognosisMatesis > 5

10

Classification Techniques

• K Nearest Neighbor classifier

• Support Vector Machines

• …

11

Instance Based LearningKey idea: just store all training examples <xi,f(xi)>

Nearest neighbor:• Given query instance xq, first locate nearest

training example xn, then estimate f(xq)=f(xn)

K-nearest neighbor:

• Given xq, take vote among its k nearest neighbors (if discrete-valued target function)

• Take mean of f values of k nearest neighbors (if

real-valued) f(xq)=i=1k f(xi)/k

12

K-Nearest Neighbor

• A lazy learner …

• Issues: – How many neighbors?– What similarity measure?

13

Which similarity or dissimilarity measure?

• A metric is a measure of the similarity or dissimilarity between two data objects

• Two main classes of metric:– Correlation coefficients (similarity)

• Compares shape of expression curves• Types of correlation:

– Centered.– Un-centered.– Rank-correlation

– Distance metrics (dissimilarity)• City Block (Manhattan) distance• Euclidean distance

14

• Pearson Correlation Coefficient (centered correlation)

Sx = Standard deviation of x

Sy = Standard deviation of y

n

i y

i

x

in S

yy

S

xx

11

1

Correlation (a measure between -1 and 1)

Positive correlation Negative correlation

You can use absolute correlation to capture both positive and negative correlation

15

Potential pitfalls

Correlation = 1

16

Distance metrics• City Block (Manhattan)

distance:– Sum of differences across

dimensions– Less sensitive to outliers – Diamond shaped clusters

• Euclidean distance:– Most commonly used

distance– Sphere shaped cluster– Corresponds to the

geometric distance into the multidimensional space

i

ii yxYXd ),( i

ii yxYXd 2)(),(

where gene X = (x1,…,xn) and gene Y=(y1,…,yn)

X

Y

Condition 1

Co

nd

itio

n 2

Condition 1

X

Y

Co

nd

itio

n 2

17

Euclidean vs Correlation (I)

• Euclidean distance

• Correlation

18

When to Consider Nearest Neighbors

• Instances map to points in RN

• Less than 20 attributes per instance• Lots of training data

Advantages:• Training is very fast • Learn complex target functions• Do not loose information

Disadvantages:• Slow at query time • Easily fooled by irrelevant attributes

19

Voronoi Diagram

query point qf

nearest neighbor qi

20

3-Nearest Neighbors

query point qf

3 nearest neighbors

2x,1o

21

7-Nearest Neighbors

query point qf

7 nearest neighbors

3x,4o

22

Nearest Neighbor (continuous)

1-nearest neighbor

23


3-nearest neighbor

24


5-nearest neighbor

25

Nearest Neighbor

• Approximate the target function f(x) at the single query point x = xq

• Locally weighted regression = generalization of IBL

26

Curse of DimensionalityImagine instances are described by 20 attributes but only 10 are

relevant to target function

Curse of dimensionality: nearest neighbor is easily misled when the instance space is high-dimensional

One approach: weight the features according to their relevance!

• Stretch j-th axis by weight zj, where z1,…,zn chosen to minimize prediction error

• Use cross-validation to automatically choose weights z1,…,zn

• Note setting zj to zero eliminates this dimension alltogether (feature subset selection)

27

Practical implementations

• Weka – IBk

• Optimized – Timbl

28

Example: Tumor Classification

• Reliable and precise classification essential for successful cancer treatment

• Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables

• Uncertainties in diagnosis remain; likely that existing classes are heterogeneous

• Characterize molecular variations among tumors by monitoring gene expression (microarray)

• Hope: that microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes)

29

Tumor Classification Using Gene Expression Data

Three main types of ML problems associated with tumor classification:

• Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering)

• Classification of malignancies into known classes (supervised learning – discrimination)

• Identification of “marker” genes that characterize the different tumor classes (feature or variable selection).

30

B-ALL T-ALL AML

ReferenceGolub et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537.

ObjectsArray

Feature vectorsGene

expression

Predefine classes

Tumor type

?

new array

Learning set

ClassificationRule

T-ALL

31

Nearest neighbor rule

32

SVM• SVMs were originally proposed by Boser, Guyon and Vapnik

in 1992 and gained increasing popularity in late 1990s.• SVMs are currently among the best performers for a number

of classification tasks ranging from text to genomic data.• SVM techniques have been extended to a number of tasks

such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc.

• Most popular optimization algorithms for SVMs are SMO [Platt ’99] and SVMlight

[Joachims’ 99], both use decomposition to hill-climb over a subset of αi’s at a time.

• Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner.

33

SVM

• In order to discriminate between two classes, given a training dataset– Map the data to a higher dimension

space (feature space)

– Separate the two classes using an optimal linear separator

34

Feature Space Mapping• Map the original data to some higher-

dimensional feature space where the training set is linearly separable:

Φ: x → φ(x)

35

The “Kernel Trick”• The linear classifier relies on inner product between vectors K(xi,xj)=xi

Txj

• If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes:

K(xi,xj)= φ(xi) Tφ(xj)

• A kernel function is some function that corresponds to an inner product in some expanded feature space.

• Example:

2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2

,

Need to show that K(xi,xj)= φ(xi) Tφ(xj):

K(xi,xj)=(1 + xiTxj)2

,= 1+ xi12xj1

2 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 + 2xi2xj2=

= [1 xi12 √2 xi1xi2 xi2

2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2

2 √2xj1 √2xj2] =

= φ(xi) Tφ(xj), where φ(x) = [1 x1

2 √2 x1x2 x22 √2x1 √2x2]

36

Linear Separators

37

Optimal hyperplane

ρ

Support vector

margin

Optimal hyper-plane

Support vectors uniquely characterize optimal hyper-plane

38

Optimal hyperplane: geometric view

11

11

ii

ii

yforbxw

yforbxw

39

Soft Margin Classification • What if the training set is not linearly separable?

• Slack variables ξi can be added to allow misclassification of difficult or noisy examples.

ξjξk

40

Weakening the constraints

Weakening the constraints

Allow that the objects do not strictly obey the constraints

Introduce ‘slack’-variables

41

Influence of C

Erroneous objects can still have a (large) influence on the solution

42

SVM

• Advantages:– maximize the margin between two classes in the

feature space characterized by a kernel function– are robust with respect to high input dimension

• Disadvantages:– difficult to incorporate background knowledge– Sensitive to outliers

43

SVM and outliersoutlier

44

Classifying new examples

• Given new point x, its class membership is

sign[f(x, *, b*)], where ***

1

***** ),,( bybybbfSVi iii

N

i iii xxxxxwx

Data enters only in the form of dot products!

**** ),(),,( bKybfSVi iii

xxx

and in general

Kernel function

45

Classification: CV error

• Training error– Empirical error

• Error on independent test set – Test error

• Cross validation (CV) error– Leave-one-out (LOO)– N-fold CV

N samples

splitting

1/n samples for testing

Summarize CV error rate

N-1/n samples for training

Count errors

c e n t r f o r i n t e g r a t i v e b i o i n f o r m a t i c s v u e lecture 6 machine learning...

Documents

clustering slide

class prediction

classification example

information slide

expression data

prediction learning

classification techniques

classification problem