machine learning - computational bioscience at the school ...compbio.ucdenver.edu/7711...

Center for Genes, Environment, and Health

Machine Learning CPBS7711

Oct 1, 2013

Sonia Leach, PhD Assistant Professor

Center for Genes, Environment, and Health

National Jewish Health

[email protected]

Someone once said

“Artificial Intelligence = Search”

so Machine Learning = ?Induction of New Knowledge from

experience and ability to improve?

Machine Learning is a natural outgrowth of the intersection of Computer Science and Statistics.

We might say the defining question of Computer Science is “How can we build machines that

solve problems, and which problems are inherently tractable/intractable?”

The question that largely defines Statistics is “What can be inferred from data plus a set of

modeling assumptions, with what reliability?”

The defining question for Machine Learning builds on both, but it is a distinct question.

Whereas Computer Science has focused primarily on how to manually program computers,

Machine Learning focuses on the question of how to get computers to program themselves

(from experience plus some initial structure).

Whereas Statistics has focused primarily on what conclusions can be inferred from data,

Machine Learning incorporates additional questions about what computational architectures

and algorithms can be used to most effectively capture, store, index, retrieve and merge these

data, how multiple learning subtasks can be orchestrated in a larger system, and questions of

computational tractability.

We say that a machine learns with respect to a particular task T, performance metric P, and type

of experience E, if the system reliably improves its performance P at task T, following

experience E.

- Tom Mitchell http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

Center for Genes, Environment, and Health 2

Also interesting discussion of differences among AI, ML, Data Mining, Stats :

http://stats.stackexchange.com/questions/5026/what-is-the-difference-between-data-mining-statistics-machine-learning-and-ai

http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf
























Machine Learning

• From Wikipedia: – 7.1 Decision tree learning

– 7.2 Association rule learning

– 7.3 Artificial neural networks

– 7.4 Inductive logic programming

– 7.5 Support vector machines

– 7.6 Clustering

– 7.7 Bayesian networks

– 7.8 Reinforcement learning

– 7.9 Representation learning

– 7.10 Similarity and metric learning

– 7.11 Sparse Dictionary Learning

• From Alppaydin Intro to Mach Learn: – Supervised Learning

– Bayesian Decision Theory

– Parametric Methods

– Multivariate Methods

– Dimensionality Reduction

– Clustering

– Nonparametric Methods

– Decision Trees

– Linear Discrimination

– Multilayer Perceptrons

– Local Models

– Kernel Machines

– Bayesian Estimation

– Hidden Markov Models

– Graphical Models

– Combining Multiple Learners

– Reinforcement Learning

3 Center for Genes, Environment, and Health

http://www.realtechsupport.org/UB/MRIII/papers/MachineLearning/Alppaydin_MachineLearning_2010.pdf

http://www.realtechsupport.org/UB/MRIII/papers/MachineLearning/Alppaydin_MachineLearning_2010.pdf

Machine Learning (what I will cover)

• Unsupervised

– Dimensionality Reduction

• PCA

– Clustering

• k-Means, SOM, Hierarchical

– Association Set Mining

– Probabilistic Graphical

Models

• HMMs, Bayes Nets

• Supervised

– k-Nearest Neighbor

– Neural Nets

– Decision Trees/Random Forests

– SVMs

– Naïve Bayes

• Issues

– Regression/Classification

– Feature selection/reduction

– Missing data

– Boosting/bagging/jackknife

– Cross validation, generalization

– Model selection


Connections to other lectures: Miller (HMM), Pollock (HMM),

Leach (HMM), Lozupone (PCA, Feature Importance Scores,

Clustering), Kechris (Regression), [Hunter (Knowledge-

Based Analysis), Cohen (BioNLP), Phang (Expr Analysis)

….]

R: http://cran.r-project.org/web/views/MachineLearning.html

Unsupervised Learning


Dimensionality Reduction: Principal Components Analysis (PCA)

• Motivation: Instead of considering all variables, use

small number of linear combos of those variables with

minimum information lost


http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/

2D data: What if could only choose

1 of the variables

to represent data?

Choose

y-axis,

explains

more

variance

in data

Amount of variance

explained by

single variable

http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/

P

1

v

a

r

Amount of

variance

explained by P1

>

explained by Y

Principal Components Analysis (PCA) • If X=(x1,x2,…,xn) is a random vector (mean vector , covariance

matrix ), then principal component transformation

X Y = (X- )

s.t. is orthogonal, = is diagonal, 1 2 … p 0.

– Linear orthogonal transform of original data to new coordinate

system

– each component is linear combination of original variables

• coefficient of variables in linear combo = Loadings

• data transformed to new coords = Scores

– components ordered by percentage of variance explained along

new axis

– number of components = minimum dimension of input data matrix

– set of orthogonal vectors not unique, not scale-invariant (covariance

vs correlation), computed by eigen value decomposition (as above &

R princomp) or singular value decomposition (SVD) (R prncmp)

7 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics

Principal Components Analysis (PCA) • If X is a random vector (mean , covariance matrix ), then

principal component transformation X Y= (X- ) s.t. is

orthogonal, = is diagonal, 1 2 … p 0.


Comp.1 Comp.2 Comp.3 Comp.4 Comp.5

[1,] -2.292745 5.827588 8.966977 -7.1630488 -2.2195936

[2,] 25.846460 13.457048 -3.257987 0.5344066 0.4777994

[3,] -14.856875 4.337867 -4.057297 -2.5308172 1.4998247

[4,] 70.434116 -3.286077 6.423473 3.9571310 0.8815369

[5,] 13.768664 -4.392701 -6.058773 -4.7551497 -2.2951908

[6,] -28.899236 -4.611347 4.338621 -2.2710490 6.7118075

[7,] 5.216449 -4.536616 -7.625423 2.2093319 3.2618335

[8,] -3.432334 -11.115805 -3.553422 -0.9908949 -4.1604420

[9,] -31.579207 8.354892 -2.497369 5.6986938 -1.9742069

[10,] -34.205292 -4.034848 7.321199 5.3113963 -2.1833687

Y(scores)

diffgeom complex algebra reals stats

1 36 58 43 36 37

2 62 54 50 46 52

3 31 42 41 40 29

4 76 78 69 66 81

5 46 56 52 56 40

6 12 42 38 38 28

7 39 46 51 54 41

8 30 51 54 52 32

9 22 32 43 28 22

10 9 40 47 30 24

X

Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5

Standard deviation 30.142 7.179 5.786 4.098 3.084

Proportion of Variance 0.890 0.050 0.032 0.016 0.009

Cumulative Proportion 0.890 0.941 0.974 0.990 1.000

Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5

diffgeom 0.638 0.599 -0.407 -0.112 -0.237

complex 0.372 -0.230 0.593 -0.595 -0.320

algebra 0.240 -0.371 0.645 -0.624

reals 0.333 -0.671 -0.557 -0.234 0.271

statistics 0.535 0.414 0.404 0.615

(loadings)

~ i

EXAMPLE IN R

X = read.table('pca.input',sep=" ",

header=TRUE)

pc = princomp(X)

mu = pc$center

Gamma = pc$loadings

Y = pc$scores

XminusMu=sweep(X,MARGIN=2,mu,FUN="-")

propOfVar= pc$sdev^2/sum(pc$sdev^2)

eigenVals= pc$sdev^2

Principal Components Analysis (PCA)


Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5

Standard deviation 30.142 7.179 5.786 4.098 3.084

Proportion of Variance 0.890 0.050 0.032 0.016 0.009

Cumulative Proportion 0.890 0.941 0.974 0.990 1.000

Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5

diffgeom 0.638 0.599 -0.407 -0.112 -0.237

complex 0.372 -0.230 0.593 -0.595 -0.320

algebra 0.240 -0.371 0.645 -0.624

reals 0.333 -0.671 -0.557 -0.234 0.271

statistics 0.535 0.414 0.404 0.615

(loadings)

## Verify Y = (X-mu)*Gamma

unique(Y-as.matrix(XminusMu)%*%Gamma)

## Verify X repr by Comp. i== Y[,i]

par(mfrow=c(2,1),pty="s"),biplot(pc)

plot(Y[,1],Y[,2],col="white")

text(Y[,1],Y[,2],1:10)

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5

[1,] -2.292745 5.827588 8.966977 -7.1630488 -2.2195936

[2,] 25.846460 13.457048 -3.257987 0.5344066 0.4777994

[3,] -14.856875 4.337867 -4.057297 -2.5308172 1.4998247

[4,] 70.434116 -3.286077 6.423473 3.9571310 0.8815369

[5,] 13.768664 -4.392701 -6.058773 -4.7551497 -2.2951908

[6,] -28.899236 -4.611347 4.338621 -2.2710490 6.7118075

[7,] 5.216449 -4.536616 -7.625423 2.2093319 3.2618335

[8,] -3.432334 -11.115805 -3.553422 -0.9908949 -4.1604420

[9,] -31.579207 8.354892 -2.497369 5.6986938 -1.9742069

[10,] -34.205292 -4.034848 7.321199 5.3113963 -2.1833687

Y(scores)

X = read.table('pca.input',sep=" ", header=TRUE)

pc = princomp(X)

mu = pc$center

Gamma = pc$loadings

Y = pc$scores

XminusMu=sweep(X,MARGIN=2,mu,FUN="-")

propOfVar= pc$sdev^2 /sum(pc$sdev^2)

eigenVals= pc$sdev^2

~ i

arrows for original variables:

length=PropVarExplained in 2 comps

Direction=relative loadings in 2 comps

ex) diffgeom largest(++,++), algebra smallest (-,+)

diffgeom complex algebra reals stats

1 36 58 43 36 37

2 62 54 50 46 52

3 31 42 41 40 29

4 76 78 69 66 81

5 46 56 52 56 40

6 12 42 38 38 28

7 39 46 51 54 41

8 30 51 54 52 32

9 22 32 43 28 22

10 9 40 47 30 24

X

Clustering

• Partitioning

– Must specify number of clusters

– K-Means, Self-Organizing Maps (SOM/Kohonen Net)

• (Agglomerative) Hierarchical Clustering

– Do not need to specify number of clusters

– Need to specify distance metric and linkage method

• Other approaches

– Fuzzy clustering (probabilistic membership)

– Spectral Clustering (using eigen value decomposition)


Clustering


http://apandre.wordpress.com/visible-data/cluster-analysis/







http://stackoverflow.com/questions/4722290/generating-synthetic-datasets

R package: mlbench: Machine Learning Benchmark Problems






k-Means • Intitialize: Select the initial k Centroids

– REPEAT

• Form k clusters by assigning all points to

the ‘closest’ Centroid

• Recompute the Centroid for each cluster

– UNTIL ”The Centroids don’t change or all

changes are below predefined

• Initial Centroids are random vectors, randomly selected among

vectors, first k vectors, etc or computed from random 1st assignment

• ‘closest’ typically defined by Euclidean distance (Voronoi diagram)

• Prone to local maxima so typically do N random

restarts, take best (min sum of distE2 to centroids)

• In practice, favors separated spherical clusters


2

1

),(),(n

i

iiEE yxxydistyxdist

Images from wikipedia

k-Means

14 Center for Genes, Environment, and Health http://en.wikipedia.org/wiki/K-means_clustering

Iteration 0 Iteration 1 Iteration 2

Iteration 3 Iteration 4 Iteration 5

Images from wikipedia

http://en.wikipedia.org/wiki/K-means_clustering



Self-Organizing Maps (SOM)

• Similar to k-Means, goal to assign data to map node

(e.g. Centroid in k-Means) with ‘closest’ weight vector

to data space vector (minimize distE(x,w))

• Difference: map nodes constrained by neighborhood

relationships, whereas k-Means Centroids freely move

• Must input initial topology, map ‘stretches’ to cover nD

data in 2D, similar data assigned to map neighbors


Image from wikipedia


• 1. Initialization – Choose random

values for initial weight vectors wj.

• 2. Sampling – Draw a sample

training input vector x from the

input space.

• 3. Matching – Find the winning

neuron I(x) with weight vector

closest to input vector (i.e.,min distE)

• 4. Updating – Apply the weight

update equation

wji = (t) Tj,I(x) (t) ( xi-wji)

where (t) = learning rate @ time t*

Tj,I(x) (t)=neighborhood @ time t

• 5. Continuation – keep returning

to step 2 until the feature map stops

changing.


http://www.sciencedirect.com/science/article/pii/S0014579399005244 * Informal intro to simulated annealing, gradient descent…

http://www.sciencedirect.com/science/article/pii/S0014579399005244

Hierarchical Clustering

• Divisive – (top down) start with all

points in 1 cluster, successively sub-

divide until full tree

• Agglomerative – (bottom up) start with

each point in its own cluster (singleton),

merge ‘closest’ pair of Clusters at each

step until root

– Requires metric to define ‘closest’ – distance

no longer between points, but between

clusters

– Linkage strategy for which merge is often

based on pairwise point comparisons

• Dendrogram shows order of splits


Distance Metrics • Euclidean

– distance in Euclidean space

• Pearson Correlation

– linear relationships

• Spearman Correlation

– monotonic relationships

• Mutual Information

– non-linear relationships

• Polyserial Correlation

– correlation continuous vs ordinal (polychoric if ordinal vs ordinal)

• Hamming Distance, Jaccard, Dice (binary variables)


2

1

),(n

i

iiE yxyxdist

n

i i

n

i i

n

i ii

P

yyxx

yyxxyxdist

1

2

1

2

11),(

n

i yy

n

i xx

n

i yyxx

yxS

rrrr

rrrrrrdist

ii

ii

1

2

1

2

11),(

)(zrankrz

),(),(),( yxMIyxHyxdistMI

yx yxyxx xx ppyxHppxH

yxHyHxHyxMI

, ,, log),( and log)(

),()()(),(

111001

1001

MMM

MMdistJ1001 MMdistH Good when 0

gives no info YX

YXdistD

21

Like Jaccard but

2*Matches

Distance Metrics • Euclidean vs Pearson (linear) vs Spearman (monotonic)


Numbers are Pearson correlation

Note Pearson invariant to slope

Pearson=0 if non-linear

A A 1 1 -1 0.8 0 0 Pearson 1 1 -1 1 0 0 Spearman 0 8 9 6 17 19 EucDist

8 0 1 6 22 23 EucDist -1 -1 1 -0.7 0 0 Pearson 0 0 0 0.3 1 0.85 Pearson 0 0 0 0 1 0.91 Spearman

Linkage Methods • Single Linkage

argmin S,T min s S,t T dist(s,t)

• Complete Linkage

argmin S,T max s S,t T dist(s,t)

• Average Linkage (a.k.a. group average)

argmin S,T average s S,t T dist(s,t)

• Centroid Linkage (People err after Eisen et al 1998 Treeview

paper think=Average Linkage!) – min dist(centroid(S), centroid(T))

• Ward’s Linkage (optimizes same criterion as kMeans)

• UPGMA (Unweighted Pair Group Method with Arithmetic Mean)

from Lozupone lecture – assumes constant rate of evolution,

average, Euclidean distance


Choosing the Number of Clusters • Rule of thumb: k= n/2

• Elbow or Knee method (bend in plot of metric)

• K-means likes spherical so minimize

within-cluster variation (SSE, sum dist of

all points to cluster mean) or maximize

between-cluster variation (dist between

clusts) or both CH(K)=[B(K)/K-1]/[W(k)/(n-K)]

• Gap Statistic

– Calculate SSE, randomize dataset, calculate

SSE rand, n times, gap= log(mean SSErand/ SSE)

• Hierarchical – plot dist chosen at each

merge (okay for single, complete)


See also http://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf for

long list of indices, NbClust R package: http://cedric.cnam.fr/fichiers/art_2579.pdf and

http://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf

W(K)

B(K)

CT(K)

Gap(K)

*Calinski & Harabasz 1974

Association Set Mining • Also known as Market Basket Analysis

{milk, eggs} {butter}

• Support of itemset X supp(X) = # transactions with itemset X

• Confidence of rule conf(X Y) = supp(X &Y)/ supp(X)

• Lift of rule (perf over assuming independent)

lift(X Y) = supp(X &Y)/ (supp(X)*supp(Y))

• Want rules with max supp, conf, lift

• Other measures found at: http://michael.hahsler.net/research/association_rules/measures.html


http://michael.hahsler.net/research/association_rules/measures.html

Association Set Mining • Tables of data converted to transactions by creating

binary variables for all categories for all variables

(must discretize continuous, missing data okay)


ID Gender Age Height

(inches)

Race Diagnosis

CC245 Male 6 25 Caucasian Depression

CC346 Male 75 60 African COPD

CC978 30 54 Asian Obesity

CC125 Female 15 54 African

{ {gender_M=Y, age_child=Y, height_20-29=Y, race_WH=Y, diag_depr=Y},

{gender_M=Y, age_senior=Y, height_60-69=Y, race_BL=Y, diag_copd=Y},

{age_adult=Y, height_50-59=Y,race_AS=Y, diag_obes=Y},

{gender_F=Y, age_adol=Y, height_50-59=Y, race_BL=Y} }

Association Set Mining Example in R: arules pkg, apriori algorithm


lhs rhs support confidence lift

1 {Class=2nd,

Age=Child} => {Survived=Yes} 0.011 1.000 3.097

2 {Class=2nd,

Sex=Female,

Age=Child} => {Survived=Yes} 0.006 1.000 3.096

3 {Class=1st,

Sex=Female} => {Survived=Yes} 0.064 0.972 3.010

4 {Class=1st,

Sex=Female,

Age=Adult} => {Survived=Yes} 0.064 0.972 3.010

… 12 {Sex=Female,

Survived=Yes} => {Age=Adult} 0.143 0.918 0.966

27 {Class=2nd} => {Age=Adult} 0.118 0.915 0.963

Note that rule 2 subsumed by rule 1, which has

better lift (and support) – can remove redundants


Probabilistic Graphical Models

Time

Observability Utility Observability and Utility

Markov Decision Process (MDP)

A tA t− 1

X tX t − 1

U tU t− 1

Partially Observable Markov Decision Process (POMDP)

A t− 1A t

X tX t − 1

OtOt− 1

U tU t− 1

Markov Process (MP)

X tX t − 1

Hidden Markov Model (HMM)

O t O

X t X t-1

t-1

Y X

Hidden Markov Model • Finite set of N states X

• Finite set of M observations O

• Parameter set λ = (A, B, π)

– Initial state distribution πi = Pr(X1 = i)

– Transition probability aij = Pr(Xt=j | Xt-1 = i)

– Emission probability bik = Pr(Ot=k | Xt = i)

• Given observation sequence O=O1,O2,…,On, how compute Pr(O| λ)?


Hidden Markov Model (HMM)

O t O

X t X t-1

t-1

1 2

3

N=3, M=2 π=(0.25, 0.55, 0.2) A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:

• Probability of O is sum over all state sequences

Pr(O|λ) = ∑all X

Pr(O|X, λ) Pr(X|λ)

= ∑all X

πx1 bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT

• What is computational complexity of this sum?


1 2

3

N=3, M=2 π=(0.25, 0.55, 0.2) A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:

πi = Pr(X1 = i)

aij = Pr(Xt=j | Xt-1 = i)

bik = Pr(Ot=k | Xt = i)


Pr(O|λ) = ∑all X


= ∑all X


• At each t, are N states to reach, so NT possible state sequences and

2T multiplications per seq, means O(2T*NT) operations

• So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11!


1 2

3

N=3, M=2 π=(0.25, 0.55, 0.2) A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:

πi = Pr(X1 = i)




Pr(O|λ) = ∑all X


= ∑all X


• Efficient dynamic programming algorithm to do

this: Forward algorithm(Baum and Welch,O(N2T)) 30 Center for Genes, Environment, and Health

1 2

3

N=3, M=2 π=(0.25, 0.55, 0.2) A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:

πi = Pr(X1 = i)



The Forward Algorithm Probability of a Sequence is the Sum of All

Paths that Can Produce It


G .1

C .1

A .4

T .4

G .3

C .3

A .2

T .2

0.1 0.2

Non-CpG

0.8

0.9 G

CpG

G .3

G .1

.3*(

.3*.8+

.1*.1)

=.075

.1*(

.3*.2+

.1*.9)

=.015

C

.3*(

.075*.8+

.015*.1)

=.0185

.1*(

.075*.2+

.015*.9)

=.0029

G

.2*(

.0185*.8+

.0029*.1)

=.003

.4*(

.0185*.2+

.0029*.9)

=.0025

A

.2*(

.003*.8+

.0025*.1)

=.0005

.4*(

.003*.2+

.0025*.9)

=.0011

A

David Pollock’s Lecture

Parameter estimation by Baum-Welch

Forward Backward Algorithm

Forward variable αt(i) =Pr(O1..t,Xt=i | λ)

Backward variable βt(i) =Pr(Ot+1..N|Xt=i, λ)

DEFINITIVE tutorial: Rabiner 1989: http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Leach.pdf

and erratum: http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Erratum_Leach.pdf

http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Leach.pdf

http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Erratum_Leach.pdf

Forward Algorithm

• Dynamic programming method to compute

forward variable: αt(i) =Pr(O1..t,Xt=i | λ)

• Base Condition: for 1 i N

α1(i) = πx1 bxio1

• Recurrence: for 1 j N and 1 t T-1

αt+1(j) = [ ∑i=1 to N αt(i) axixj ] bxjot+1

• Then probability of sequence

Pr(O | λ) = ∑i=1 to N αT(i)


*Backward algorithm

for βt(i) is analogous

Applications in Bioinformatics

• DNA – motif matching, gene matching,

multiple sequence alignment

• Amino Acids – domain matching, fold

recognition

• Microarrays/Whole Genome Sequencing –

assign copy number

• ChIP-chip/seq – distinct chromatin states


Bayesian Networks

• Given set of random variables,

the joint probability distribution

can be represented by:

– Structure: Directed Acyclic Graph

(DAG)

• variables are nodes, absence of arcs

captures conditional independencies

– Parameters: Local Conditional

Probability Distributions (CPDs)

• conditional probability of variable given

values of parents in graph

• Joint Probability factors into

product of local CPDs:


Pr(X1, X2, …, Xn) = i=1 to N Pr(Xi | Parents(Xi))

Bayesian Networks


• Generally can think of directed arcs as

‘causal’ (be careful!)

– If the sprinkler is on OR it is raining, then the

grass will be wet: Pr(W|S,R)

• If observe wet grass, can determine

whether because of sprinkler or rain

– Pr(R|W) and Pr(S|W)

– Bayes rule = Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y)

• Note S and R compete to explain W: this

model says sprinkler usage is independent

of rain, but if know the grass is wet, and it is

raining, then it is less likely that the

sprinkler being on is the explanation for W

– Pr(S|W,R) < Pr(S|W) “explaining away”

http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

Applications in Bioinformatics


16873470

Gene regulatory networks (Friedman et al, 2000, PMID: 11108481)

Predicting clinical outcomes

using expression data (Gevaert et al, 2006, PMID: 16873470)

Determining Regulators with PRMS (Segal et al, 2002, RECOMB)

Gene Function Prediction (Troyanskaya et al, 2003, PMID: 12826619 )

Hanalyzer – edge scores (Leach et al, 2009, PMID: 19325874)

Supervised Learning


Supervised Learning

• Given examples (x,y) of input features x and

output variable y, learn function f(x)=y

– Regression (continuous response) vs Classification

(discrete response)

– Feature selection vs Feature Reduction

– Cross validation (Leave-One-Out vs N-Fold)

– Generalization (Training set error vs Test set error)

– Model Selection (AIC, BIC)

– Boosting/bagging/jackknife

– Missing data and Imputation

– Curse of dimensionality


Supervised Learning • Boosting (weak learners on different subsets)

– Train H1 on random data split, sample among H1’s predictions so next

data set to train H2 has half wrong, half right in H1. Train H3 where

both H1 and H2 wrong. Return majority vote H1, H2, H3 (Adaboost

weights examples, weighted vote)

• Bagging (bootstrap aggregate)

– Train multiple models on random with replacement (bootstrap) splits

of input data, average predictions

• Jackknife (vs boostrap) – disjoint subsets of data

• Model Selection: balance goodness of fit (likelihood L) with

complexity of model (number of parameters k) for n samples

– Bayesian information criterion (BIC): minimize k ln(n)-2 ln(L)

– Akaike information criterion (AIC): minimize 2k – 2 ln(L) (less strong,

better theory than BIC)

• Curse of dimensionality – greater D, data samples sparser in

covering space so need more&more data to get learn properly


Decision Boundaries


https://sites.google.com/a/iupr.co

m/dia-course/lectures/lecture08-

classification-with-neural-networks

https://sites.google.com/a/iupr.com/dia-course/lectures/lecture08-classification-with-neural-networks












k-Nearest Neighbors

• Store database of (x,y) pairs, classify new example by

majority vote of k nearest neighbors (regression if

assign (weighted) mean y in neighborhood)

• No training needed, non-parametric,

sensitive to local structure in data,

frequent class tends to dominate

• Curse of dimensionality if many

variables, any query equidistant to

all points – reduce features by PCA

• Allows complicated boundaries

between classes


If k=3, (green, red)

If k=5, (green, blue)

Neural Network: Linear Perceptron

• Learning: • Initialize wt, choose learning rate

• 1) Calculate prediction y*j,t = f[wt xj]

• 2) Update weights wt+1 = wt+ (yj – y*j,t)xj

• Repeat 1&2 until (yj – y*j,t) < threshold

– Can be generalized to multi-class

– Optimal only if data linearly separable


vs

Step activation function

Neural Network:Multi-Layer Perceptron

• Smooth activation

function instead

• Can also have

multiple hidden

layers

• Can learn when data

not linearly separable

• Learn like before but

backpropagation from

output layer


Smooth activation function

(signmoid, tanh)

Decision Tree • Node is attribute tested, branch

is outcome, leaf is (majority) class (prob)

• Discrete X=xi?, real X<value?

• Greedy algorithm chooses

best attribute to split upon:

– pi = fraction items labelled i in set

– Gini impurity: IG(p) = i j pipj

prob items labeled i chosen *

prob i mistakenly assigned class j

– Information gain: IE(p) =- i pi log2pi

– Real val: SSE

• EASY TO INTERPRET!!! Can overfit, large tree for XOR,

biased in favor of attributes with more levels => ensembles


BIOPSY+

Rx SIDE

EFFECT

BREATH

>90%

Died:

3

Alive:

27

Y N

BREATH

<30%

Y N Died:

15

Alive:

15

Died:

20

Alive:

57

Died:

30

Alive:

7

Died:

80

Alive:

1

Y N

Y N

Random Forest

• Classifier consisting of ensemble of decision trees

{h(x, k)} where k is some i.i.d. random vector, and

each tree casts vote for class of x (Breiman 2001)

1. Bagging – k is random selection of N samples (with

replacement) to grow tree

2. Dietterich 98: k is random split among n best splits

3. Ho 98: k is random subset of features to grow tree

4. Adaboost-like: k is random weights on examples

– 4 better than {2,3} better than 1 on generalization error

• How many features to select at each node depends on

internal estimates of generalization error, classifier

strength and correlation between trees: Out-of-bag

estimates


Random Forest

• Most popular implementation {h(x, k)}: bagging

(random with repl. from input) + random subset features

– If set of features small, trees more correlated, so can make new

features as random linear combinations of orig. features

• Out-of-bag classifier for specific {x,y} = aggregate over

trees that didn’t use {x,y} as training data (removes need

for setting aside test data)

• Out-of-bag estimate is error rate for out-of-bag classifer

for training set (can also estimate OOB strength and correlation)

• Can estimate variable importance from OOB estimates

– For mth variable, permute values, compare misclassification rate

of OOB classifiers on ‘noised up’ data with OOB on real data,

large increase implies mth variable important


Support Vector Machine (SVM)

• Support vectors are points lie closest

to decision surface, maximize

‘margin’, hyperplane separating

examples (solution change if SVs

removed)

• Kernel function – maps not-linearly

separable data to transformed space

• Non-probabilistic, optimization not

greedy search, not affected by local

minima, theoretical guarantee of

performance, escape curse of

dimensionality


• Distance between H and H1 is

1/||w|| so to maximize margin,

need to minimize ||w||= sqrt( i wi2)

s.t. no points between H1&H2:

xi w + b +1 when yi = +1

xi w + b -1 when yi = -1

• Quadratic program (constrained optimization, solved by

(dual of) Lagrangian multiplier)

Max L = i- ½ i jxi xj s.t w= iyixi and iyi=0

• If not linearly separable, use transformation to space

where is linearly separable, via kernels

• If use 1-norm, weights = variable importance


+

+

+

http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf

yi(xi w) 1


http://books.nips.cc/papers/files/nips16/NIPS2003_AA07.pdf

http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf

http://books.nips.cc/papers/files/nips16/NIPS2003_AA07.pdf



Not separated by

linear function, but

can by quadratic

one

Radial basis function (Gaussians)

pxxxxK )1'()',(

)'tanh()',( xxxxK

2

2

2

'exp)',(

xxxxKPolynomial (p=1 linear)

~sigmoid (like Neural Net)

Naïve Bayes • Recall Bayes rule

Pr(X|Y) = Pr(Y|X)Pr(X) / Pr(Y)

• Classifier:

Pr(C|F1,…,Fn ) = Pr(C) Pr(F1,…,Fn|C) / Pr(F1,…,Fn)

– Note denominator does not depend on C (effectively constant Z)

– “Naïve” assumption because assume Fi, Fj independent

– Simplifies calculation:

Pr(C|F1,…,Fn ) = 1/Z Pr(C) i Pr(Fi|C)

• Learn parameters Pr(C) & each Pr(Fi|C) by

maximum likelihood (multinomial, Gaussian, …)

– Can learn each Pr(Fi|C) independently, escape curse

of dimensionality, not need dataset to scale with # Fi 51 Center for Genes, Environment, and Health

C

F1 Fn … F2 F3

Examples in R

• Making 2D datasets

– Install libraries: mlbench

• Clustering (Hierarchical, K-Means, SOM)

– Install libraries: kohonen

• Classification (kNN, NN, DT, SVM, NB)

– Install libraries: class (if R>3.0, o/w knn), neuralnet,

rpart, e1071




R package: mlbench: Machine Learning Benchmark Problems






The End


machine learning - computational bioscience at the school ...compbio.ucdenver.edu/7711...

Documents