machine learning - computational bioscience at the school ...compbio.ucdenver.edu/7711...
TRANSCRIPT
Center for Genes, Environment, and Health
Machine Learning CPBS7711
Oct 1, 2013
Sonia Leach, PhD Assistant Professor
Center for Genes, Environment, and Health
National Jewish Health
Someone once said
“Artificial Intelligence = Search”
so Machine Learning = ?Induction of New Knowledge from
experience and ability to improve?
Machine Learning is a natural outgrowth of the intersection of Computer Science and Statistics.
We might say the defining question of Computer Science is “How can we build machines that
solve problems, and which problems are inherently tractable/intractable?”
The question that largely defines Statistics is “What can be inferred from data plus a set of
modeling assumptions, with what reliability?”
The defining question for Machine Learning builds on both, but it is a distinct question.
Whereas Computer Science has focused primarily on how to manually program computers,
Machine Learning focuses on the question of how to get computers to program themselves
(from experience plus some initial structure).
Whereas Statistics has focused primarily on what conclusions can be inferred from data,
Machine Learning incorporates additional questions about what computational architectures
and algorithms can be used to most effectively capture, store, index, retrieve and merge these
data, how multiple learning subtasks can be orchestrated in a larger system, and questions of
computational tractability.
We say that a machine learns with respect to a particular task T, performance metric P, and type
of experience E, if the system reliably improves its performance P at task T, following
experience E.
- Tom Mitchell http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf
Center for Genes, Environment, and Health 2
Also interesting discussion of differences among AI, ML, Data Mining, Stats :
http://stats.stackexchange.com/questions/5026/what-is-the-difference-between-data-mining-statistics-machine-learning-and-ai
Machine Learning
• From Wikipedia: – 7.1 Decision tree learning
– 7.2 Association rule learning
– 7.3 Artificial neural networks
– 7.4 Inductive logic programming
– 7.5 Support vector machines
– 7.6 Clustering
– 7.7 Bayesian networks
– 7.8 Reinforcement learning
– 7.9 Representation learning
– 7.10 Similarity and metric learning
– 7.11 Sparse Dictionary Learning
• From Alppaydin Intro to Mach Learn: – Supervised Learning
– Bayesian Decision Theory
– Parametric Methods
– Multivariate Methods
– Dimensionality Reduction
– Clustering
– Nonparametric Methods
– Decision Trees
– Linear Discrimination
– Multilayer Perceptrons
– Local Models
– Kernel Machines
– Bayesian Estimation
– Hidden Markov Models
– Graphical Models
– Combining Multiple Learners
– Reinforcement Learning
3 Center for Genes, Environment, and Health
http://www.realtechsupport.org/UB/MRIII/papers/MachineLearning/Alppaydin_MachineLearning_2010.pdf
Machine Learning (what I will cover)
• Unsupervised
– Dimensionality Reduction
• PCA
– Clustering
• k-Means, SOM, Hierarchical
– Association Set Mining
– Probabilistic Graphical
Models
• HMMs, Bayes Nets
• Supervised
– k-Nearest Neighbor
– Neural Nets
– Decision Trees/Random Forests
– SVMs
– Naïve Bayes
• Issues
– Regression/Classification
– Feature selection/reduction
– Missing data
– Boosting/bagging/jackknife
– Cross validation, generalization
– Model selection
4 Center for Genes, Environment, and Health
Connections to other lectures: Miller (HMM), Pollock (HMM),
Leach (HMM), Lozupone (PCA, Feature Importance Scores,
Clustering), Kechris (Regression), [Hunter (Knowledge-
Based Analysis), Cohen (BioNLP), Phang (Expr Analysis)
….]
R: http://cran.r-project.org/web/views/MachineLearning.html
Dimensionality Reduction: Principal Components Analysis (PCA)
• Motivation: Instead of considering all variables, use
small number of linear combos of those variables with
minimum information lost
6 Center for Genes, Environment, and Health
http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/
2D data: What if could only choose
1 of the variables
to represent data?
Choose
y-axis,
explains
more
variance
in data
Amount of variance
explained by
single variable
http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/
P
1
v
a
r
Amount of
variance
explained by P1
>
explained by Y
Principal Components Analysis (PCA) • If X=(x1,x2,…,xn) is a random vector (mean vector , covariance
matrix ), then principal component transformation
X Y = (X- )
s.t. is orthogonal, = is diagonal, 1 2 … p 0.
– Linear orthogonal transform of original data to new coordinate
system
– each component is linear combination of original variables
• coefficient of variables in linear combo = Loadings
• data transformed to new coords = Scores
– components ordered by percentage of variance explained along
new axis
– number of components = minimum dimension of input data matrix
– set of orthogonal vectors not unique, not scale-invariant (covariance
vs correlation), computed by eigen value decomposition (as above &
R princomp) or singular value decomposition (SVD) (R prncmp)
7 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics
Principal Components Analysis (PCA) • If X is a random vector (mean , covariance matrix ), then
principal component transformation X Y= (X- ) s.t. is
orthogonal, = is diagonal, 1 2 … p 0.
8 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
[1,] -2.292745 5.827588 8.966977 -7.1630488 -2.2195936
[2,] 25.846460 13.457048 -3.257987 0.5344066 0.4777994
[3,] -14.856875 4.337867 -4.057297 -2.5308172 1.4998247
[4,] 70.434116 -3.286077 6.423473 3.9571310 0.8815369
[5,] 13.768664 -4.392701 -6.058773 -4.7551497 -2.2951908
[6,] -28.899236 -4.611347 4.338621 -2.2710490 6.7118075
[7,] 5.216449 -4.536616 -7.625423 2.2093319 3.2618335
[8,] -3.432334 -11.115805 -3.553422 -0.9908949 -4.1604420
[9,] -31.579207 8.354892 -2.497369 5.6986938 -1.9742069
[10,] -34.205292 -4.034848 7.321199 5.3113963 -2.1833687
Y(scores)
diffgeom complex algebra reals stats
1 36 58 43 36 37
2 62 54 50 46 52
3 31 42 41 40 29
4 76 78 69 66 81
5 46 56 52 56 40
6 12 42 38 38 28
7 39 46 51 54 41
8 30 51 54 52 32
9 22 32 43 28 22
10 9 40 47 30 24
X
Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation 30.142 7.179 5.786 4.098 3.084
Proportion of Variance 0.890 0.050 0.032 0.016 0.009
Cumulative Proportion 0.890 0.941 0.974 0.990 1.000
Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
diffgeom 0.638 0.599 -0.407 -0.112 -0.237
complex 0.372 -0.230 0.593 -0.595 -0.320
algebra 0.240 -0.371 0.645 -0.624
reals 0.333 -0.671 -0.557 -0.234 0.271
statistics 0.535 0.414 0.404 0.615
(loadings)
~ i
EXAMPLE IN R
X = read.table('pca.input',sep=" ",
header=TRUE)
pc = princomp(X)
mu = pc$center
Gamma = pc$loadings
Y = pc$scores
XminusMu=sweep(X,MARGIN=2,mu,FUN="-")
propOfVar= pc$sdev^2/sum(pc$sdev^2)
eigenVals= pc$sdev^2
Principal Components Analysis (PCA)
9 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics
Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation 30.142 7.179 5.786 4.098 3.084
Proportion of Variance 0.890 0.050 0.032 0.016 0.009
Cumulative Proportion 0.890 0.941 0.974 0.990 1.000
Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
diffgeom 0.638 0.599 -0.407 -0.112 -0.237
complex 0.372 -0.230 0.593 -0.595 -0.320
algebra 0.240 -0.371 0.645 -0.624
reals 0.333 -0.671 -0.557 -0.234 0.271
statistics 0.535 0.414 0.404 0.615
(loadings)
## Verify Y = (X-mu)*Gamma
unique(Y-as.matrix(XminusMu)%*%Gamma)
## Verify X repr by Comp. i== Y[,i]
par(mfrow=c(2,1),pty="s"),biplot(pc)
plot(Y[,1],Y[,2],col="white")
text(Y[,1],Y[,2],1:10)
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
[1,] -2.292745 5.827588 8.966977 -7.1630488 -2.2195936
[2,] 25.846460 13.457048 -3.257987 0.5344066 0.4777994
[3,] -14.856875 4.337867 -4.057297 -2.5308172 1.4998247
[4,] 70.434116 -3.286077 6.423473 3.9571310 0.8815369
[5,] 13.768664 -4.392701 -6.058773 -4.7551497 -2.2951908
[6,] -28.899236 -4.611347 4.338621 -2.2710490 6.7118075
[7,] 5.216449 -4.536616 -7.625423 2.2093319 3.2618335
[8,] -3.432334 -11.115805 -3.553422 -0.9908949 -4.1604420
[9,] -31.579207 8.354892 -2.497369 5.6986938 -1.9742069
[10,] -34.205292 -4.034848 7.321199 5.3113963 -2.1833687
Y(scores)
X = read.table('pca.input',sep=" ", header=TRUE)
pc = princomp(X)
mu = pc$center
Gamma = pc$loadings
Y = pc$scores
XminusMu=sweep(X,MARGIN=2,mu,FUN="-")
propOfVar= pc$sdev^2 /sum(pc$sdev^2)
eigenVals= pc$sdev^2
~ i
arrows for original variables:
length=PropVarExplained in 2 comps
Direction=relative loadings in 2 comps
ex) diffgeom largest(++,++), algebra smallest (-,+)
diffgeom complex algebra reals stats
1 36 58 43 36 37
2 62 54 50 46 52
3 31 42 41 40 29
4 76 78 69 66 81
5 46 56 52 56 40
6 12 42 38 38 28
7 39 46 51 54 41
8 30 51 54 52 32
9 22 32 43 28 22
10 9 40 47 30 24
X
Clustering
• Partitioning
– Must specify number of clusters
– K-Means, Self-Organizing Maps (SOM/Kohonen Net)
• (Agglomerative) Hierarchical Clustering
– Do not need to specify number of clusters
– Need to specify distance metric and linkage method
• Other approaches
– Fuzzy clustering (probabilistic membership)
– Spectral Clustering (using eigen value decomposition)
10 Center for Genes, Environment, and Health
Clustering
11 Center for Genes, Environment, and Health
http://apandre.wordpress.com/visible-data/cluster-analysis/
12 Center for Genes, Environment, and Health
http://stackoverflow.com/questions/4722290/generating-synthetic-datasets
R package: mlbench: Machine Learning Benchmark Problems
k-Means • Intitialize: Select the initial k Centroids
– REPEAT
• Form k clusters by assigning all points to
the ‘closest’ Centroid
• Recompute the Centroid for each cluster
– UNTIL ”The Centroids don’t change or all
changes are below predefined
• Initial Centroids are random vectors, randomly selected among
vectors, first k vectors, etc or computed from random 1st assignment
• ‘closest’ typically defined by Euclidean distance (Voronoi diagram)
• Prone to local maxima so typically do N random
restarts, take best (min sum of distE2 to centroids)
• In practice, favors separated spherical clusters
13 Center for Genes, Environment, and Health
2
1
),(),(n
i
iiEE yxxydistyxdist
Images from wikipedia
k-Means
14 Center for Genes, Environment, and Health http://en.wikipedia.org/wiki/K-means_clustering
Iteration 0 Iteration 1 Iteration 2
Iteration 3 Iteration 4 Iteration 5
Images from wikipedia
Self-Organizing Maps (SOM)
• Similar to k-Means, goal to assign data to map node
(e.g. Centroid in k-Means) with ‘closest’ weight vector
to data space vector (minimize distE(x,w))
• Difference: map nodes constrained by neighborhood
relationships, whereas k-Means Centroids freely move
• Must input initial topology, map ‘stretches’ to cover nD
data in 2D, similar data assigned to map neighbors
15 Center for Genes, Environment, and Health
Image from wikipedia
Self-Organizing Maps (SOM)
• 1. Initialization – Choose random
values for initial weight vectors wj.
• 2. Sampling – Draw a sample
training input vector x from the
input space.
• 3. Matching – Find the winning
neuron I(x) with weight vector
closest to input vector (i.e.,min distE)
• 4. Updating – Apply the weight
update equation
wji = (t) Tj,I(x) (t) ( xi-wji)
where (t) = learning rate @ time t*
Tj,I(x) (t)=neighborhood @ time t
• 5. Continuation – keep returning
to step 2 until the feature map stops
changing.
16 Center for Genes, Environment, and Health
http://www.sciencedirect.com/science/article/pii/S0014579399005244 * Informal intro to simulated annealing, gradient descent…
Self-Organizing Maps (SOM)
17 Center for Genes, Environment, and Health
http://www.sciencedirect.com/science/article/pii/S0014579399005244
Hierarchical Clustering
• Divisive – (top down) start with all
points in 1 cluster, successively sub-
divide until full tree
• Agglomerative – (bottom up) start with
each point in its own cluster (singleton),
merge ‘closest’ pair of Clusters at each
step until root
– Requires metric to define ‘closest’ – distance
no longer between points, but between
clusters
– Linkage strategy for which merge is often
based on pairwise point comparisons
• Dendrogram shows order of splits
18 Center for Genes, Environment, and Health
Distance Metrics • Euclidean
– distance in Euclidean space
• Pearson Correlation
– linear relationships
• Spearman Correlation
– monotonic relationships
• Mutual Information
– non-linear relationships
• Polyserial Correlation
– correlation continuous vs ordinal (polychoric if ordinal vs ordinal)
• Hamming Distance, Jaccard, Dice (binary variables)
19 Center for Genes, Environment, and Health
2
1
),(n
i
iiE yxyxdist
n
i i
n
i i
n
i ii
P
yyxx
yyxxyxdist
1
2
1
2
11),(
n
i yy
n
i xx
n
i yyxx
yxS
rrrr
rrrrrrdist
ii
ii
1
2
1
2
11),(
)(zrankrz
),(),(),( yxMIyxHyxdistMI
yx yxyxx xx ppyxHppxH
yxHyHxHyxMI
, ,, log),( and log)(
),()()(),(
111001
1001
MMM
MMdistJ1001 MMdistH Good when 0
gives no info YX
YXdistD
21
Like Jaccard but
2*Matches
Distance Metrics • Euclidean vs Pearson (linear) vs Spearman (monotonic)
20 Center for Genes, Environment, and Health
Numbers are Pearson correlation
Note Pearson invariant to slope
Pearson=0 if non-linear
A A 1 1 -1 0.8 0 0 Pearson 1 1 -1 1 0 0 Spearman 0 8 9 6 17 19 EucDist
8 0 1 6 22 23 EucDist -1 -1 1 -0.7 0 0 Pearson 0 0 0 0.3 1 0.85 Pearson 0 0 0 0 1 0.91 Spearman
Linkage Methods • Single Linkage
argmin S,T min s S,t T dist(s,t)
• Complete Linkage
argmin S,T max s S,t T dist(s,t)
• Average Linkage (a.k.a. group average)
argmin S,T average s S,t T dist(s,t)
• Centroid Linkage (People err after Eisen et al 1998 Treeview
paper think=Average Linkage!) – min dist(centroid(S), centroid(T))
• Ward’s Linkage (optimizes same criterion as kMeans)
• UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
from Lozupone lecture – assumes constant rate of evolution,
average, Euclidean distance
21 Center for Genes, Environment, and Health
Choosing the Number of Clusters • Rule of thumb: k= n/2
• Elbow or Knee method (bend in plot of metric)
• K-means likes spherical so minimize
within-cluster variation (SSE, sum dist of
all points to cluster mean) or maximize
between-cluster variation (dist between
clusts) or both CH(K)=[B(K)/K-1]/[W(k)/(n-K)]
• Gap Statistic
– Calculate SSE, randomize dataset, calculate
SSE rand, n times, gap= log(mean SSErand/ SSE)
• Hierarchical – plot dist chosen at each
merge (okay for single, complete)
22 Center for Genes, Environment, and Health
See also http://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf for
long list of indices, NbClust R package: http://cedric.cnam.fr/fichiers/art_2579.pdf and
http://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf
W(K)
B(K)
CT(K)
Gap(K)
*Calinski & Harabasz 1974
Association Set Mining • Also known as Market Basket Analysis
{milk, eggs} {butter}
• Support of itemset X supp(X) = # transactions with itemset X
• Confidence of rule conf(X Y) = supp(X &Y)/ supp(X)
• Lift of rule (perf over assuming independent)
lift(X Y) = supp(X &Y)/ (supp(X)*supp(Y))
• Want rules with max supp, conf, lift
• Other measures found at: http://michael.hahsler.net/research/association_rules/measures.html
23 Center for Genes, Environment, and Health
Association Set Mining • Tables of data converted to transactions by creating
binary variables for all categories for all variables
(must discretize continuous, missing data okay)
24 Center for Genes, Environment, and Health
ID Gender Age Height
(inches)
Race Diagnosis
CC245 Male 6 25 Caucasian Depression
CC346 Male 75 60 African COPD
CC978 30 54 Asian Obesity
CC125 Female 15 54 African
{ {gender_M=Y, age_child=Y, height_20-29=Y, race_WH=Y, diag_depr=Y},
{gender_M=Y, age_senior=Y, height_60-69=Y, race_BL=Y, diag_copd=Y},
{age_adult=Y, height_50-59=Y,race_AS=Y, diag_obes=Y},
{gender_F=Y, age_adol=Y, height_50-59=Y, race_BL=Y} }
Association Set Mining Example in R: arules pkg, apriori algorithm
25 Center for Genes, Environment, and Health
lhs rhs support confidence lift
1 {Class=2nd,
Age=Child} => {Survived=Yes} 0.011 1.000 3.097
2 {Class=2nd,
Sex=Female,
Age=Child} => {Survived=Yes} 0.006 1.000 3.096
3 {Class=1st,
Sex=Female} => {Survived=Yes} 0.064 0.972 3.010
4 {Class=1st,
Sex=Female,
Age=Adult} => {Survived=Yes} 0.064 0.972 3.010
… 12 {Sex=Female,
Survived=Yes} => {Age=Adult} 0.143 0.918 0.966
27 {Class=2nd} => {Age=Adult} 0.118 0.915 0.963
Note that rule 2 subsumed by rule 1, which has
better lift (and support) – can remove redundants
26 Center for Genes, Environment, and Health
Probabilistic Graphical Models
Time
Observability Utility Observability and Utility
Markov Decision Process (MDP)
A tA t− 1
X tX t − 1
U tU t− 1
Partially Observable Markov Decision Process (POMDP)
A t− 1A t
X tX t − 1
OtOt− 1
U tU t− 1
Markov Process (MP)
X tX t − 1
Hidden Markov Model (HMM)
O t O
X t X t-1
t-1
Y X
Hidden Markov Model • Finite set of N states X
• Finite set of M observations O
• Parameter set λ = (A, B, π)
– Initial state distribution πi = Pr(X1 = i)
– Transition probability aij = Pr(Xt=j | Xt-1 = i)
– Emission probability bik = Pr(Ot=k | Xt = i)
• Given observation sequence O=O1,O2,…,On, how compute Pr(O| λ)?
27 Center for Genes, Environment, and Health
Hidden Markov Model (HMM)
O t O
X t X t-1
t-1
1 2
3
N=3, M=2 π=(0.25, 0.55, 0.2) A = B =
000.1
1.09.00
8.02.00
5.0
25.0
5.0
75.09.01.0
Example:
• Probability of O is sum over all state sequences
Pr(O|λ) = ∑all X
Pr(O|X, λ) Pr(X|λ)
= ∑all X
πx1 bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT
• What is computational complexity of this sum?
28 Center for Genes, Environment, and Health
1 2
3
N=3, M=2 π=(0.25, 0.55, 0.2) A = B =
000.1
1.09.00
8.02.00
5.0
25.0
5.0
75.09.01.0
Example:
πi = Pr(X1 = i)
aij = Pr(Xt=j | Xt-1 = i)
bik = Pr(Ot=k | Xt = i)
• Probability of O is sum over all state sequences
Pr(O|λ) = ∑all X
Pr(O|X, λ) Pr(X|λ)
= ∑all X
πx1 bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT
• At each t, are N states to reach, so NT possible state sequences and
2T multiplications per seq, means O(2T*NT) operations
• So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11!
29 Center for Genes, Environment, and Health
1 2
3
N=3, M=2 π=(0.25, 0.55, 0.2) A = B =
000.1
1.09.00
8.02.00
5.0
25.0
5.0
75.09.01.0
Example:
πi = Pr(X1 = i)
aij = Pr(Xt=j | Xt-1 = i)
bik = Pr(Ot=k | Xt = i)
• Probability of O is sum over all state sequences
Pr(O|λ) = ∑all X
Pr(O|X, λ) Pr(X|λ)
= ∑all X
πx1 bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT
• Efficient dynamic programming algorithm to do
this: Forward algorithm(Baum and Welch,O(N2T)) 30 Center for Genes, Environment, and Health
1 2
3
N=3, M=2 π=(0.25, 0.55, 0.2) A = B =
000.1
1.09.00
8.02.00
5.0
25.0
5.0
75.09.01.0
Example:
πi = Pr(X1 = i)
aij = Pr(Xt=j | Xt-1 = i)
bik = Pr(Ot=k | Xt = i)
The Forward Algorithm Probability of a Sequence is the Sum of All
Paths that Can Produce It
31 Center for Genes, Environment, and Health
G .1
C .1
A .4
T .4
G .3
C .3
A .2
T .2
0.1 0.2
Non-CpG
0.8
0.9 G
CpG
G .3
G .1
.3*(
.3*.8+
.1*.1)
=.075
.1*(
.3*.2+
.1*.9)
=.015
C
.3*(
.075*.8+
.015*.1)
=.0185
.1*(
.075*.2+
.015*.9)
=.0029
G
.2*(
.0185*.8+
.0029*.1)
=.003
.4*(
.0185*.2+
.0029*.9)
=.0025
A
.2*(
.003*.8+
.0025*.1)
=.0005
.4*(
.003*.2+
.0025*.9)
=.0011
A
David Pollock’s Lecture
Parameter estimation by Baum-Welch
Forward Backward Algorithm
Forward variable αt(i) =Pr(O1..t,Xt=i | λ)
Backward variable βt(i) =Pr(Ot+1..N|Xt=i, λ)
DEFINITIVE tutorial: Rabiner 1989: http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Leach.pdf
and erratum: http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Erratum_Leach.pdf
Forward Algorithm
• Dynamic programming method to compute
forward variable: αt(i) =Pr(O1..t,Xt=i | λ)
• Base Condition: for 1 i N
α1(i) = πx1 bxio1
• Recurrence: for 1 j N and 1 t T-1
αt+1(j) = [ ∑i=1 to N αt(i) axixj ] bxjot+1
• Then probability of sequence
Pr(O | λ) = ∑i=1 to N αT(i)
33 Center for Genes, Environment, and Health
*Backward algorithm
for βt(i) is analogous
Applications in Bioinformatics
• DNA – motif matching, gene matching,
multiple sequence alignment
• Amino Acids – domain matching, fold
recognition
• Microarrays/Whole Genome Sequencing –
assign copy number
• ChIP-chip/seq – distinct chromatin states
34 Center for Genes, Environment, and Health
Bayesian Networks
• Given set of random variables,
the joint probability distribution
can be represented by:
– Structure: Directed Acyclic Graph
(DAG)
• variables are nodes, absence of arcs
captures conditional independencies
– Parameters: Local Conditional
Probability Distributions (CPDs)
• conditional probability of variable given
values of parents in graph
• Joint Probability factors into
product of local CPDs:
35 Center for Genes, Environment, and Health
Pr(X1, X2, …, Xn) = i=1 to N Pr(Xi | Parents(Xi))
Bayesian Networks
36 Center for Genes, Environment, and Health
• Generally can think of directed arcs as
‘causal’ (be careful!)
– If the sprinkler is on OR it is raining, then the
grass will be wet: Pr(W|S,R)
• If observe wet grass, can determine
whether because of sprinkler or rain
– Pr(R|W) and Pr(S|W)
– Bayes rule = Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y)
• Note S and R compete to explain W: this
model says sprinkler usage is independent
of rain, but if know the grass is wet, and it is
raining, then it is less likely that the
sprinkler being on is the explanation for W
– Pr(S|W,R) < Pr(S|W) “explaining away”
http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html
Applications in Bioinformatics
37 Center for Genes, Environment, and Health
16873470
Gene regulatory networks (Friedman et al, 2000, PMID: 11108481)
Predicting clinical outcomes
using expression data (Gevaert et al, 2006, PMID: 16873470)
Determining Regulators with PRMS (Segal et al, 2002, RECOMB)
Gene Function Prediction (Troyanskaya et al, 2003, PMID: 12826619 )
Hanalyzer – edge scores (Leach et al, 2009, PMID: 19325874)
Supervised Learning
• Given examples (x,y) of input features x and
output variable y, learn function f(x)=y
– Regression (continuous response) vs Classification
(discrete response)
– Feature selection vs Feature Reduction
– Cross validation (Leave-One-Out vs N-Fold)
– Generalization (Training set error vs Test set error)
– Model Selection (AIC, BIC)
– Boosting/bagging/jackknife
– Missing data and Imputation
– Curse of dimensionality
39 Center for Genes, Environment, and Health
Supervised Learning • Boosting (weak learners on different subsets)
– Train H1 on random data split, sample among H1’s predictions so next
data set to train H2 has half wrong, half right in H1. Train H3 where
both H1 and H2 wrong. Return majority vote H1, H2, H3 (Adaboost
weights examples, weighted vote)
• Bagging (bootstrap aggregate)
– Train multiple models on random with replacement (bootstrap) splits
of input data, average predictions
• Jackknife (vs boostrap) – disjoint subsets of data
• Model Selection: balance goodness of fit (likelihood L) with
complexity of model (number of parameters k) for n samples
– Bayesian information criterion (BIC): minimize k ln(n)-2 ln(L)
– Akaike information criterion (AIC): minimize 2k – 2 ln(L) (less strong,
better theory than BIC)
• Curse of dimensionality – greater D, data samples sparser in
covering space so need more&more data to get learn properly
40 Center for Genes, Environment, and Health
Decision Boundaries
41 Center for Genes, Environment, and Health
https://sites.google.com/a/iupr.co
m/dia-course/lectures/lecture08-
classification-with-neural-networks
k-Nearest Neighbors
• Store database of (x,y) pairs, classify new example by
majority vote of k nearest neighbors (regression if
assign (weighted) mean y in neighborhood)
• No training needed, non-parametric,
sensitive to local structure in data,
frequent class tends to dominate
• Curse of dimensionality if many
variables, any query equidistant to
all points – reduce features by PCA
• Allows complicated boundaries
between classes
42 Center for Genes, Environment, and Health
If k=3, (green, red)
If k=5, (green, blue)
Neural Network: Linear Perceptron
• Learning: • Initialize wt, choose learning rate
• 1) Calculate prediction y*j,t = f[wt xj]
• 2) Update weights wt+1 = wt+ (yj – y*j,t)xj
• Repeat 1&2 until (yj – y*j,t) < threshold
– Can be generalized to multi-class
– Optimal only if data linearly separable
43 Center for Genes, Environment, and Health
vs
Step activation function
Neural Network:Multi-Layer Perceptron
• Smooth activation
function instead
• Can also have
multiple hidden
layers
• Can learn when data
not linearly separable
• Learn like before but
backpropagation from
output layer
44 Center for Genes, Environment, and Health
Smooth activation function
(signmoid, tanh)
Decision Tree • Node is attribute tested, branch
is outcome, leaf is (majority) class (prob)
• Discrete X=xi?, real X<value?
• Greedy algorithm chooses
best attribute to split upon:
– pi = fraction items labelled i in set
– Gini impurity: IG(p) = i j pipj
prob items labeled i chosen *
prob i mistakenly assigned class j
– Information gain: IE(p) =- i pi log2pi
– Real val: SSE
• EASY TO INTERPRET!!! Can overfit, large tree for XOR,
biased in favor of attributes with more levels => ensembles
45 Center for Genes, Environment, and Health
BIOPSY+
Rx SIDE
EFFECT
BREATH
>90%
Died:
3
Alive:
27
Y N
BREATH
<30%
Y N Died:
15
Alive:
15
Died:
20
Alive:
57
Died:
30
Alive:
7
Died:
80
Alive:
1
Y N
Y N
Random Forest
• Classifier consisting of ensemble of decision trees
{h(x, k)} where k is some i.i.d. random vector, and
each tree casts vote for class of x (Breiman 2001)
1. Bagging – k is random selection of N samples (with
replacement) to grow tree
2. Dietterich 98: k is random split among n best splits
3. Ho 98: k is random subset of features to grow tree
4. Adaboost-like: k is random weights on examples
– 4 better than {2,3} better than 1 on generalization error
• How many features to select at each node depends on
internal estimates of generalization error, classifier
strength and correlation between trees: Out-of-bag
estimates
46 Center for Genes, Environment, and Health
Random Forest
• Most popular implementation {h(x, k)}: bagging
(random with repl. from input) + random subset features
– If set of features small, trees more correlated, so can make new
features as random linear combinations of orig. features
• Out-of-bag classifier for specific {x,y} = aggregate over
trees that didn’t use {x,y} as training data (removes need
for setting aside test data)
• Out-of-bag estimate is error rate for out-of-bag classifer
for training set (can also estimate OOB strength and correlation)
• Can estimate variable importance from OOB estimates
– For mth variable, permute values, compare misclassification rate
of OOB classifiers on ‘noised up’ data with OOB on real data,
large increase implies mth variable important
47 Center for Genes, Environment, and Health
Support Vector Machine (SVM)
• Support vectors are points lie closest
to decision surface, maximize
‘margin’, hyperplane separating
examples (solution change if SVs
removed)
• Kernel function – maps not-linearly
separable data to transformed space
• Non-probabilistic, optimization not
greedy search, not affected by local
minima, theoretical guarantee of
performance, escape curse of
dimensionality
48 Center for Genes, Environment, and Health
• Distance between H and H1 is
1/||w|| so to maximize margin,
need to minimize ||w||= sqrt( i wi2)
s.t. no points between H1&H2:
xi w + b +1 when yi = +1
xi w + b -1 when yi = -1
• Quadratic program (constrained optimization, solved by
(dual of) Lagrangian multiplier)
Max L = i- ½ i jxi xj s.t w= iyixi and iyi=0
• If not linearly separable, use transformation to space
where is linearly separable, via kernels
• If use 1-norm, weights = variable importance
49 Center for Genes, Environment, and Health
+
+
+
http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf
yi(xi w) 1
Support Vector Machine (SVM)
http://books.nips.cc/papers/files/nips16/NIPS2003_AA07.pdf
Support Vector Machine (SVM)
50 Center for Genes, Environment, and Health
Not separated by
linear function, but
can by quadratic
one
Radial basis function (Gaussians)
pxxxxK )1'()',(
)'tanh()',( xxxxK
2
2
2
'exp)',(
xxxxKPolynomial (p=1 linear)
~sigmoid (like Neural Net)
Naïve Bayes • Recall Bayes rule
Pr(X|Y) = Pr(Y|X)Pr(X) / Pr(Y)
• Classifier:
Pr(C|F1,…,Fn ) = Pr(C) Pr(F1,…,Fn|C) / Pr(F1,…,Fn)
– Note denominator does not depend on C (effectively constant Z)
– “Naïve” assumption because assume Fi, Fj independent
– Simplifies calculation:
Pr(C|F1,…,Fn ) = 1/Z Pr(C) i Pr(Fi|C)
• Learn parameters Pr(C) & each Pr(Fi|C) by
maximum likelihood (multinomial, Gaussian, …)
– Can learn each Pr(Fi|C) independently, escape curse
of dimensionality, not need dataset to scale with # Fi 51 Center for Genes, Environment, and Health
C
F1 Fn … F2 F3
Examples in R
• Making 2D datasets
– Install libraries: mlbench
• Clustering (Hierarchical, K-Means, SOM)
– Install libraries: kohonen
• Classification (kNN, NN, DT, SVM, NB)
– Install libraries: class (if R>3.0, o/w knn), neuralnet,
rpart, e1071
52 Center for Genes, Environment, and Health
53 Center for Genes, Environment, and Health
http://stackoverflow.com/questions/4722290/generating-synthetic-datasets
R package: mlbench: Machine Learning Benchmark Problems