probabilistic graphical modelsepxing/class/10708-16/slide/lecture4... · 2016. 2. 8. · learning...
Post on 13-Sep-2020
0 Views
Preview:
TRANSCRIPT
School of Computer Science
Probabilistic Graphical Models
Parameter Est. in fully observed BNs
Eric XingLecture 4, January 25, 2016
Reading: KF-chap 17
X1
X4
X2 X3
X4
X2 X3
X1X1
X2
X1
X3
X1
X4
X2 X3
X4
X2 X3
X1X1
X2
X1
X3
© Eric Xing @ CMU, 2005-2016 1
Learning Graphical ModelsThe goal:
Given set of independent samples (assignments of random variables), find the best (the most likely?) Bayesian Network (both DAG and CPDs)
(B,E,A,C,R)=(T,F,F,T,F)(B,E,A,C,R)=(T,F,T,T,F)……..
(B,E,A,C,R)=(F,T,T,T,F)
E
R
B
A
C
0.9 0.1
e
be
0.2 0.8
0.01 0.990.9 0.1
bebb
e
BE P(A | E,B)
E
R
B
A
C
Structural learning
Parameter learning
© Eric Xing @ CMU, 2005-2016 2
Learning Graphical Models Scenarios:
completely observed GMs directed undirected
partially or unobserved GMs directed undirected (an open research topic)
Estimation principles: Maximal likelihood estimation (MLE) Bayesian estimation Maximal conditional likelihood Maximal "Margin" Maximum entropy
We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.
© Eric Xing @ CMU, 2005-2016 3
ML Structural Learning for completely observed
GMs
E
R
B
A
C
Data
),,( )()( 111 nxx
),,( )()( Mn
M xx 1
),,( )()( 221 nxx
© Eric Xing @ CMU, 2005-2016 4
Two “Optimal” approaches “Optimal” here means the employed algorithms guarantee to
return a structure that maximizes the objectives (e.g., LogLik) Many heuristics used to be popular, but they provide no guarantee on attaining
optimality, interpretability, or even do not have an explicit objective E.g.: structured EM, Module network, greedy structural search, deep learning via
auto-encoders, etc.
We will learn two classes of algorithms for guaranteed structure learning, which are likely to be the only known methods enjoying such guarantee, but they only apply to certain families of graphs: Trees: The Chow-Liu algorithm (this lecture) Pairwise MRFs: covariance selection, neighborhood-selection (later)
© Eric Xing @ CMU, 2005-2016 5
Structural Search How many graphs over n nodes?
How many trees over n nodes?
But it turns out that we can find exact solution of an optimal tree (under MLE)! Trick: MLE score decomposable to edge-related elements Trick: in a tree each node has only one parent! Chow-liu algorithm
)(2
2nO
)!(nO
© Eric Xing @ CMU, 2005-2016 6
Information Theoretic Interpretation of ML
i xGiGiGi
i xGiGi
Gi
i nGiGnin
n iGiGnin
GG
Gii
iii
Gii
ii
i
ii
ii
xpxpM
xpMxcount
M
xp
xp
GDpDG
)(
)(
,)(|)()(
,)(|)(
)(
)(|)(,,
)(|)(,,
),|(log),(ˆ
),|(log),(
),|(log
),|(log
),|(log);,(
x
x
xx
xx
x
x
l
From sum over data points to sum over count of variable states
© Eric Xing @ CMU, 2005-2016 7
Information Theoretic Interpretation of ML (con'd)
ii
iGi
i xii
i x iG
GiGiGi
i x i
i
G
GiGiGi
i xGiGiGi
GG
xHMxIM
xpxpMxpp
xpxpM
xpxp
pxp
xpM
xpxpM
GDpDG
i
iGii i
ii
i
Gii i
ii
i
Gii
iii
)(ˆ),(ˆ
)(ˆlog)(ˆ)(ˆ)(ˆ
),,(ˆlog),(ˆ
)(ˆ)(ˆ
)(ˆ),,(ˆ
log),(ˆ
),|(ˆlog),(ˆ
),|(ˆlog);,(
)(
, )(
)(|)()(
, )(
)(|)()(
,)(|)()(
)(
)(
)(
x
xx
x
xx
x
xx
x
x
x
l
Decomposable score and a function of the graph structure
© Eric Xing @ CMU, 2005-2016 8
Chow-Liu tree learning algorithm Objection function:
Chow-Liu: For each pair of variable xi and xj
Compute empirical distribution:
Compute mutual information:
Define a graph with node x1,…, xn
Edge (I,j) gets weight
ii
iGi
GG
xHMxIM
GDpDG
i)(ˆ),(ˆ
),|(ˆlog);,(
)(
x
l i
Gi ixIMGC ),(ˆ)( )(x
Mxxcount
XXp jiji
),(),(ˆ
ji xx ji
jijiji xpxp
xxpxxpXXI
, )(ˆ)(ˆ),(ˆ
log),(ˆ),(ˆ
),(ˆ ji XXI
© Eric Xing @ CMU, 2005-2016 9
Chow-Liu algorithm (con'd) Objection function:
Chow-Liu:Optimal tree BN Compute maximum weight spanning tree Direction in BN: pick any node as root, do breadth-first-search to define directions I-equivalence:
ii
iGi
GG
xHMxIM
GDpDG
i)(ˆ),(ˆ
),|(ˆlog);,(
)(
x
l i
Gi ixIMGC ),(ˆ)( )(x
A
B C
D E
C
A E
B
DE
C
D A B
),(),(),(),()( ECIDCICAIBAIGC © Eric Xing @ CMU, 2005-2016 10
Structure Learning for general graphs Theorem:
The problem of learning a BN structure with at most d parents is NP-hard for any (fixed) d≥2
Most structure learning approaches use heuristics Exploit score decomposition Two heuristics that exploit decomposition in different ways
Greedy search through space of node-orders
Local search of graph structures
© Eric Xing @ CMU, 2005-2016 11
ML Parameter Est. for completely observed GMs of
given structure
Z
X
The data:{ (z1,x1), (z2,x2), (z3,x3), ... (zN,xN)}
© Eric Xing @ CMU, 2005-2016 12
Parameter Learning Assume G is known and fixed,
from expert design from an intermediate outcome of iterative structure learning
Goal: estimate from a dataset of N independent, identically distributed (iid) training cases D = {x1, . . . , xN}.
In general, each training case xn=(xn,1, . . . , xn,M) is a vector of M values, one per node, the model can be completely observable, i.e., every element in xn is known (no
missing values, no hidden variables), or, partially observable, i.e., i, s.t. xn,i is not observed.
In this lecture we consider learning parameters for a BN with given structure and is completely observable
i ninin
n iinin ii
xpxpDpD ),|(log),|(log)|(log);( ,,,, xxl
© Eric Xing @ CMU, 2005-2016 13
Review of density estimation Can be viewed as single-node graphical models
Instances of exponential family dist.
Building blocks of general GM
MLE and Bayesian estimate
x1 x2 x3 xN…
xiN
GM:
© Eric Xing @ CMU, 2005-2016 14
Bernoulli distribution: Ber(p)
Multinomial distribution: Mult(1,)
Multinomial (indicator) variable:
. , w.p.
and ],,[
where , ∑
∑
[1,...,6]∈
[1,...,6]∈
11
110
6
5
4
3
2
1
jjjj
jjj
X
XX
XXXXXX
X
x
k
xk
xT
xG
xC
xAj
j
kTGCA
jXPjxp
∏
}face-dice index the where,{))(( 1
Discrete Distributions
101
xpxp
xPfor for
)( xx ppxP 11 )()(
© Eric Xing @ CMU, 2005-2016 15
Multinomial distribution: Mult(n,)
Count variable:
Discrete Distributions
jj
K
Nnn
nn where ,
1
n
K
nK
nn
K nnnN
nnnNnp K
!!!!
!!!!)(
2121
21
21
© Eric Xing @ CMU, 2005-2016 16
Example: multinomial model Data:
We observed N iid die rolls (K-sided): D={5, 1, K, …, 3}
Representation:
Unit basis vectors:
Model:
How to write the likelihood of a single observation xn?
The likelihood of datasetD={x1, …,xN}:
Knnn x
Kxx
k
kni nkxPxP,2,1,
21
, })rollth theof side-die index the where,1({)(
N
n k
xk
N
nnN
knxPxxxP11
21,)|()|,...,,(
1 and },1,0{ where, 1
,,
,
2,
1,
K
kknkn
Kn
n
n
n xx
x
xx
x
1 and , w.p.1}1{
, ,...Kk
kkknX
K
k
xk
kn
1
,
k
nk
k
x
kk
N
nkn
1,
x1 x2 x3 xN…
xiN
GM:
© Eric Xing @ CMU, 2005-2016 17
MLE: constrained optimization with Lagrange multipliers Objective function:
We need to maximize this subject to the constrain
Constrained cost function with a Lagrange multiplier
Take derivatives wrt k
Sufficient statistics The counts, are sufficient statistics of data D
k
kkk
nk nDPD k loglog)|(log);(l
11
K
kk
K
kk
kkkn
11 logl
k
kk
kkk
k
k
k
Nnn
n 0l
Nnk
MLEk ,
n
knMLEk xN ,,1
or
Frequency as sample mean
, ),,,( ,1 n knkK xnnnn
© Eric Xing @ CMU, 2005-2016 18
Bayesian estimation: Dirichlet distribution:
Posterior distribution of :
Notice the isomorphism of the posterior to the prior, such a prior is called a conjugate prior
Posterior mean estimation:
kk
kk
kk
kk
kk CP 1-1- )()(
)()(
k
nk
kk
k
nk
N
NN
kkkk
xxppxxpxxP 11
1
11 ),...,(
)()|,...,(),...,|(
NndCdDp kk
k
nkkkk
kk 1)|(
x1 x2 x3 xN…
GM:
N
xi
Dirichlet parameters can be understood as pseudo-counts
© Eric Xing @ CMU, 2005-2016 19
More on Dirichlet Prior: Where is the normalize constant C() come from?
Integration by parts () is the gamma function: For inregers,
Marginal likelihood:
Posterior in closed-form:
Posterior predictive rate:
k k
k kKK dd
CK
)()(
111
11
1
0
1 dtet t )(!)( nn 1
)()()|()|()|()|},...,({
nC
Cdpnpnpxxp N1
k
nkN
kknCnp
pnpxxP 11 )(
)|()|()|()},,...,{|(
)(Dir n
)()()()},,...,{|( 1
11N
ni
k
nkNN xnC
nCdnCxxixp kkkk
||||
nn ii
© Eric Xing @ CMU, 2005-2016 20
Sequential Bayesian updating Start with Dirichlet prior Observe N ' samples with sufficient statistics . Posterior
becomes:
Observe another N " samples with sufficient statistics . Posterior becomes:
So sequentially absorbing data in any order is equivalent to batch update.
):(Dir)|( P
'n
)':(Dir)',|( nnP
"n
)"':(Dir)",',|( nnnnP
© Eric Xing @ CMU, 2005-2016 21
Hierarchical Bayesian Models are the parameters for the likelihood p(x|) are the parameters for the prior p(|) . We can have hyper-hyper-parameters, etc. We stop when the choice of hyper-parameters makes no
difference to the marginal likelihood; typically make hyper-parameters constants.
Where do we get the prior? Intelligent guesses Empirical Bayes (Type-II maximum likelihood) computing point estimates of :
)|(maxarg
npMLE
© Eric Xing @ CMU, 2005-2016 22
Limitation of Dirichlet Prior:
N
xi
© Eric Xing @ CMU, 2005-2016 23
- Log Partition Function- Normalization Constant
log
logexp
,~
,~
1
1
1
1
1
1
1
0
K
i
K
iii
KK
K
i
i
eC
e
N
LN
μ Σ
N
xi
The Logistic Normal Prior
Pro: co-variance structure Con: non-conjugate (we will discuss how to solve this later)
© Eric Xing @ CMU, 2005-2016 24
Logistic Normal Densities
© Eric Xing @ CMU, 2005-2016 25
Uniform Probability Density Function
Normal (Gaussian) Probability Density Function
The distribution is symmetric, and is often illustrated as a bell-shaped curve. Two parameters, (mean) and (standard deviation), determine the location and shape of
the distribution. The highest point on the normal curve is at the mean, which is also the median and mode. The mean can be any numerical value: negative, zero, or positive.
Multivariate Gaussian
Continuous Distributions
elsewhere for )/()(
01
bxaabxp
22 2
21
/)()( xexp
xx
ff((xx))
xx
ff((xx))
XXXp T
n1
212 21
21 exp),;(
//
© Eric Xing @ CMU, 2005-2016 26
MLE for a multivariate-Gaussian It can be shown that the MLE for µ and Σ is
where the scatter matrix is
The sufficient statistics are nxn and nxnxnT.
Note that XTX=nxnxnT may not be full rank (eg. if N <D), in which case ΣML is not
invertible
SN
xxN
xN
nT
MLnMLnMLE
n nMLE
11
1
TMLMLn n
Tnnn
TMLnMLn NxxxxS
Kn
n
n
n
x
xx
x
,
2,
1,
TN
T
T
x
xx
X2
1
© Eric Xing @ CMU, 2005-2016 27
Bayesian parameter estimation for a Gaussian There are various reasons to pursue a Bayesian approach
We would like to update our estimates sequentially over time. We may have prior knowledge about the expected magnitude of the parameters. The MLE for Σ may not be full rank if we don’t have enough data.
We will restrict our attention to conjugate priors.
We will consider various cases, in order of increasing complexity: Known σ, unknown µ Known µ, unknown σ Unknown µ and σ
© Eric Xing @ CMU, 2005-2016 28
Bayesian estimation: unknown µ, known σ
Normal Prior:
Joint probability:
Posterior:
220
212 22 /)(exp)( /
P
1
222
022
2
22
2 11
11
N
Nx
NN ~ and ,
///
///~ where
x1 x2 x3 xN…
GM:
N
xi
Sample mean
220
212
1
22
22
22
212
/)(exp
exp),(
/
/
N
nn
N xxP
22212 22 ~/)~(exp~)|( /
xP
© Eric Xing @ CMU, 2005-2016 29
Bayesian estimation: unknown µ, known σ
The posterior mean is a convex combination of the prior and the MLE, with weights proportional to the relative noise levels.
The precision of the posterior 1/σ2N is the precision of the prior 1/σ2
0 plus one contribution of data precision 1/σ2 for each observed data point.
Sequentially updating the mean µ∗ = 0.8 (unknown), (σ2)∗ = 0.1 (known)
Effect of single data point
Uninformative (vague/ flat) prior, σ20 →∞
1
20
22
020
2
20
20
2
2 11
11
N
Nx
NN
N~ ,
///
///
20
2
20
020
2
20
001
)( )( xxx
0 N
© Eric Xing @ CMU, 2005-2016 30
Other scenarios Known µ, unknown λ = 1/σ2
The conjugate prior for λ is a Gamma with shape a0 and rate (inverse scale) b0
The conjugate prior for σ2 is Inverse-Gamma
Unknown µ and unknown σ2 The conjugate prior is
Normal-Inverse-Gamma
Semi conjugate prior
Multivariate case: The conjugate prior is
Normal-Inverse-Wishart© Eric Xing @ CMU, 2005-2016 31
Estimation of conditional density Can be viewed as two-node graphical models
Instances of GLIM
Building blocks of general GM
MLE and Bayesian estimate
See supplementary slides
Q
X
Q
X
© Eric Xing @ CMU, 2005-2016 32
MLE for general BNs If we assume the parameters for each CPD are globally
independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node:
i ninin
n iinin ii
xpxpDpD ),|(log),|(log)|(log);( ,,,, xxl
X2=1
X2=0
X5=0
X5=1
© Eric Xing @ CMU, 2005-2016 33
A plate is a “macro” that allows subgraphs to be replicated
For iid (exchangeable) data, the likelihood is
We can represent this as a Bayes net with N nodes. The rules of plates are simple: repeat every structure in a box a number of
times given by the integer in the corner of the box (e.g. N), updating the plate index variable (e.g. n) as you go.
Duplicate every arrow going into the plate and every arrow leaving the plate by connecting the arrows to each copy of the structure.
X1
X2
XN
…
Xn
N
Plates
n
nxpDp )|()|(
© Eric Xing @ CMU, 2005-2016 34
Consider the distribution defined by the directed acyclic GM:
This is exactly like learning four separate small BNs, each of which consists of a node and its parents.
Decomposable likelihood of a BN
),,|(),|(),|()|()|( 432431321211 xxxpxxpxxpxpxp
X1
X4
X2 X3
X4
X2 X3
X1X1
X2
X1
X3
© Eric Xing @ CMU, 2005-2016 35
MLE for BNs with tabular CPDs Assume each CPD is represented as a table (multinomial)
where
Note that in case of multiple parents, will have a composite state, and the CPD will be a high-dimensional table
The sufficient statistics are counts of family configurations
The log-likelihood is
Using a Lagrange multiplier to enforce , we get:
)|(def
kXjXpiiijk
iX
nkn
jinijk ixxn ,,
def
kji
ijkijkkji
nijk nD ijk
,,,,
loglog);( l
1 j ijk
''
jkij
ijkMLijk n
n
© Eric Xing @ CMU, 2005-2016 36
Earthquake
Radio
Burglary
Alarm
Call
M
ii i
xpp1
)|()( xxXFactorization:
ji
kii x
jkixp
x
x|
)|(
Local Distributions defined by, e.g., multinomial parameters:
How to define parameter prior?
Assumptions (Geiger & Heckerman 97,99):
Complete Model Equivalence Global Parameter Independence Local Parameter Independence Likelihood and Prior Modularity
? )|( Gp
© Eric Xing @ CMU, 2005-2016
37
Global Parameter IndependenceFor every DAG model:
Local Parameter IndependenceFor every node:
M
iim GpGp
1
)|()|(
Earthquake
Radio
Burglary
Alarm
Call
i
ji
ki
q
jxi GpGp
1|
)|()|(
x
independent of)( | YESAlarmCallP
)( | NOAlarmCallP
Global & Local Parameter Independence
© Eric Xing @ CMU, 2005-2016 38
Provided all variables are observed in all cases, we can perform Bayesian update each parameter independently !!!
sample 1
sample 2
2|11 2|1
X1 X2
X1 X2
Global ParameterIndependence
Local ParameterIndependence
Parameter Independence,Graphical View
© Eric Xing @ CMU, 2005-2016 39
Which PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)
Discrete DAG Models:
Dirichlet prior:
Gaussian DAG Models:
Normal prior:
Normal-Wishart prior:
kk
kk
kk
kk
kk CP 1-1- )()(
)()(
)(Multi~| jxi i
x
),(Normal~| jxi i
x
)()'(exp
||)(),|( //
1
212 21
21
np
. where
,WtrexpW),(),|W( /)(/
1
212
21
W
nww
wwncp
,)(,Normal),,|( 1 WW p
© Eric Xing @ CMU, 2005-2016 40
Consider a time-invariant (stationary) 1st-order Markov model Initial state probability vector:
State transition probability matrix:
The joint:
The log-likelihood:
Again, we optimize each parameter separately is a multinomial frequency vector, and we've seen it before What about A?
Parameter sharing
X1 X2 X3 XT…
A
)(def
11 kk Xp
)|(def
11 1 it
jtij XXpA
T
t tttT XXpxpXp
2 2111 )|()|()|( :
n
T
ttntn
nn AxxpxpD
211 ),|(log)|(log);( ,,, l
© Eric Xing @ CMU, 2005-2016 41
Learning a Markov chain transition matrix A is a stochastic matrix: Each row of A is multinomial distribution. So MLE of Aij is the fraction of transitions from i to j
Application: if the states Xt represent words, this is called a bigram language model
Sparse data problem: If i j did not occur in data, we will have Aij =0, then any future sequence with
word pair i j will have zero probability. A standard hack: backoff smoothing or deleted interpolation
1 j ijA
n
T
ti
tn
jtnn
T
ti
tnMLij
x
xxi
jiA2 1
2 1
,
,,
)(#)(#
MLiti AA )(
~ 1
© Eric Xing @ CMU, 2005-2016 42
Bayesian language model Global and local parameter independence
The posterior of Ai∙ and Ai'∙ is factorized despite v-structure on Xt, because Xt-
1 acts like a multiplexer Assign a Dirichlet prior i to each row of the transition matrix:
We could consider more realistic priors, e.g., mixtures of Dirichlets to account for types of words (adjectives, verbs, etc.)
X1 X2 X3 XT…
Ai · Ai'·
…
A
)(# where,)(
)(#)(#
),,|( ',
,def
i
Ai
jiDijpA
i
ii
MLijikii
i
kii
Bayesij
1
© Eric Xing @ CMU, 2005-2016 43
Example: HMM: two scenarios Supervised learning: estimation when the “right answer” is known
Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,
as he changes dice and produces 10,000 rolls
Unsupervised learning: estimation when the “right answer” is unknown Examples:
GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition
GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice
QUESTION: Update the parameters of the model to maximize P(x|) --- Maximal likelihood (ML) estimation
© Eric Xing @ CMU, 2005-2016 44
Recall definition of HMM Transition probabilities between
any two states
or
Start probabilities
Emission probabilities associated with each state
or in general:
A AA Ax2 x3x1 xT
y2 y3y1 yT...
... ,)|( , jiit
jt ayyp 11 1
.,,,,lMultinomia~)|( ,,, I iaaayyp Miiiitt 211 1
.,,,lMultinomia~)( Myp 211
.,,,,lMultinomia~)|( ,,, I ibbbyxp Kiiiitt 211
.,|f~)|( I iyxp iitt 1
© Eric Xing @ CMU, 2005-2016 45
Supervised ML estimation Given x = x1…xN for which the true state path y = y1…yN is known,
Define:Aij = # times state transition ij occurs in yBik = # times state i in y emits k in x
We can show that the maximum likelihood parameters are:
What if x is continuous? We can treat as NTobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …
' ',
,,
)(#)(#
j ij
ij
n
T
ti
tn
jtnn
T
ti
tnMLij A
A
y
yyi
jia2 1
2 1
' ',
,,
)(#)(#
k ik
ik
n
T
ti
tn
ktnn
T
ti
tnMLik B
By
xyi
kib1
1
NnTtyx tntn :,::, ,, 11
© Eric Xing @ CMU, 2005-2016 46
Supervised ML estimation, ctd. Intuition:
When we know the underlying states, the best estimate of is the average frequency of transitions & emissions that occur in the training data
Drawback: Given little data, there may be overfitting:
P(x|) is maximized, but is unreasonable0 probabilities – VERY BAD
Example: Given 10 casino rolls, we observe
x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F
Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1
© Eric Xing @ CMU, 2005-2016 47
Pseudocounts Solution for small training sets:
Add pseudocountsAij = # times state transition ij occurs in y + Rij
Bik = # times state i in y emits k in x + Sik
Rij, Sij are pseudocounts representing our prior belief Total pseudocounts: Ri = jRij , Si = kSik ,
--- "strength" of prior belief, --- total number of imaginary instances in the prior
Larger total pseudocounts strong prior belief
Small total pseudocounts: just to avoid 0 probabilities --- smoothing
This is equivalent to Bayesian est. under a uniform prior with "parameter strength" equals to the pseudocounts
© Eric Xing @ CMU, 2005-2016 48
Summary: Learning GM
For fully observed BN, the log-likelihood function decomposes into a sum of local terms, one per node; thus learning is also factored Structural learning
Chow liu Neighborhood selection
Learning single-node GM – density estimation: exponential family dist. Typical discrete distribution Typical continuous distribution Conjugate priors
Learning two-node BN: GLIM Conditional Density Est. Classification
Learning BN with more nodes Local operations
© Eric Xing @ CMU, 2005-2016 49
Supplemental review:
© Eric Xing @ CMU, 2005-2016 50
ClassificationGenerative and discriminative approaches
Q
X
Q
X
Linear/Logistic Regression
Two node fully observed BNs
Conditional mixtures
© Eric Xing @ CMU, 2005-2016 51
Classification: Goal: Wish to learn f: X Y
Generative: Modeling the joint distribution
of all data
Discriminative: Modeling only points
at the boundary
© Eric Xing @ CMU, 2005-2016 52
Conditional Gaussian The data:
Both nodes are observed: Y is a class indicator vector
X is a conditional Gaussian variable with a class-specific mean
GM:Yi
XiN
k
yknn
knyyp ,):(multi)(
22
12/12, )-(-exp
)2(1),,1|( 2 knknn xyxp
n k
ykn
knxNyxp ,),:(),,|(
),(,),,(),,(),,( NN yxyxyxyx 332211
© Eric Xing @ CMU, 2005-2016 53
Data log-likelihood
MLE
),,|()|(log),(log);( nnn
nn
nn yxpypyxpD θl
),;(max,
,, Nn
N
yDrga kn
kn
MLEkMLEk
θlthe fraction of samples of class m
k
nnkn
nkn
nnkn
MLEkMLEk n
xy
y
xyD
,
,
,
,, ),;(maxarg θl the average of samples of class m
MLE of conditional GaussianGM:
Yi
XiN
© Eric Xing @ CMU, 2005-2016 54
Prior:
Posterior mean (Bayesian est.)
Bsyesian estimation of conditional Gaussian
GM:
Yi
XiN
):(Dir)|( P
),:(Normal)|( kkP K
1
222
22
2
22
2 11
11
N
nnn
Bayesk
MLkk
kBayesk and ,
///
///
,,
Nn
NNN kkk
MLkBayesk ,,
© Eric Xing @ CMU, 2005-2016 55
Classification Gaussian Discriminative Analysis:
The joint probability of a datum and it label is:
Given a datum xn, we predict its label using the conditional probability of the label given the datum:
This is basic inference introduce evidence, and then normalize
22
12/12
,,,
)-(-exp)2(
1),,1|()1(),|1,(
2 knk
knnknknn
x
yxpypyxp
'
2'2
12/12'
22
12/12
,
)-(-exp)2(
1
)-(-exp)2(
1
),,|1(2
2
kknk
knk
nkn
x
xxyp
Yi
Xi
© Eric Xing @ CMU, 2005-2016 56
Transductive classification Given Xn, what is its corresponding Yn
when we know the answer for a set of training data?
Frequentist prediction: we fit , and from data first, and then …
Bayesian: we compute the posterior dist. of the parameters first …
GM:
Yi
XiN
KYn
Xn
'''
,, ),|,(
),|,(),,|(
),,|,1(),,,|1(
kknk
knk
n
nknnkn xN
xNxp
xypxyp
© Eric Xing @ CMU, 2005-2016 57
Linear Regression The data:
Both nodes are observed: X is an input vector Y is a response vector
(we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator)
A regression scheme can be used to model p(y|x) directly,rather than p(x,y)
Yi
XiN
),(,),,(),,(),,( NN yxyxyxyx 332211
© Eric Xing @ CMU, 2005-2016 58
A discriminative probabilistic model Let us assume that the target variable and the inputs are
related by the equation:
where ε is an error term of unmodeled effects or random noise
Now assume that ε follows a Gaussian N(0,σ), then we have:
By independence assumption:
iiT
iy x
2
2
221
)(exp);|( i
Ti
iiyxyp x
2
12
1 221
n
i iT
inn
iii
yxypL
)(exp);|()(
x
© Eric Xing @ CMU, 2005-2016 59
Linear regression Hence the log-likelihood is:
Do you recognize the last term?
Yes it is:
It is same as the MSE!
n
i iT
iynl1
22 2
1121 )(log)( x
n
ii
Ti yJ
1
2
21 )()( x
© Eric Xing @ CMU, 2005-2016 60
A recap: LMS update rule
Pros: on-line, low per-step cost Cons: coordinate, maybe slow-converging
Steepest descent
Pros: fast-converging, easy to implement Cons: a batch,
Normal equations
Pros: a single-shot algorithm! Easiest to implement. Cons: need to compute pseudo-inverse (XTX)-1, expensive, numerical issues
(e.g., matrix is singular ..)
ntT
nntt y xx )(1
n
in
tTnn
tt y1
1 xx )(
yXXX TT 1*
© Eric Xing @ CMU, 2005-2016 61
Bayesian linear regression
© Eric Xing @ CMU, 2005-2016 62
ClassificationGenerative and discriminative approach
Q
X
Q
X
RegressionLinear, conditional mixture, nonparametric
X Y
Density estimationParametric and nonparametric methods
,
XX
Simple GMs are the building blocks of complex BNs
© Eric Xing @ CMU, 2005-2016 63
top related