bayesian linear regression - university at buffalosrihari/cse574/chap3/3... · (regression...
TRANSCRIPT
Machine Learning Srihari
Topics in Bayesian Regression
• Recall Max Likelihood Linear Regression • Parameter Distribution • Predictive Distribution • Equivalent Kernel
2
Machine Learning Srihari
3
Linear Regression: model complexity M • Polynomial regression
– Red lines are best fits with M = 0,1,3,9 and N=10
Poor representations of sin(2πx)
Best Fit to sin(2πx)
Over Fit Poor representation of sin(2πx)
y(x,w) = w
0+w
1x +w
2x 2 + ..+w
MxM = w
jx j
j=0
M
∑
Machine Learning Srihari
Max Likelihood Regression • Input vector x , basis functions {ϕ1(x),.., ϕM(x)}:
• Objective Function:
• Closed-form ML solution is:
• Gradient Descent: 4
y(x,w) = w
jφ
j(x)
j=0
M −1
∑ = wTφ(x)
wML= (ΦTΦ)−1ΦTt
E(w) = 1
2 t
n−wTφ(x
n) { }2
n=1
N
∑ + λ2wTw
E(w) =
12
tn−wTφ(x
n) { }
n=1
N
∑2
wML= (λI +ΦTΦ)−1ΦTt
w(τ +1) = w(τ ) − η∇E
∇E =− t
n−w(τ)Tφ(x
n) { }
n=1
N
∑ φ(xn)
∇E = − t
n−w(τ )Tφ(x
n) { }
n=1
N
∑ φ(xn)
⎡
⎣⎢
⎤
⎦⎥ − λw(τ )
φ
j(x) = exp −
12(x −µ
j)tΣ−1(x −µ
j)
⎛
⎝⎜⎜⎜⎜
⎞
⎠⎟⎟⎟⎟Radial basis fns:
Max Likelihood objective with N examples {x1,..xN}: (equivalent to Mean Squared Error Objective)
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=Φ
−
−
)x()x(
)x()x(...)x()x(
10
20
111110
NMN
M
φφ
φφφφ
where Φ is the design matrix: (ΦTΦ)-1 is Moore-Penrose inverse
Regularized MSE with N examples: (λ is the regularization coefficient)
Regularized solution is:
Regularized version:
Machine Learning Srihari
Shortcomings of MLE • M.L.E. of parameters w does not address
– M (Model complexity: how many basis functions? – It is controlled by data size N
• More data allows better fit without overfitting
• Regularization also controls overfit (λ controls effect)
• But M and choice of ϕj are still important – M can be determined by holdout, but wasteful of data
• Model complexity and over-fitting are better handled using Bayesian approach 5
E(w) = ED(w)+ λE
W(w)
E
W(w) = 1
2wTwwhere
E
D(w) =
12
tn−wTφ(x
n) { }
n=1
N
∑2
Machine Learning Srihari
6
Bayesian Linear Regression • Using Bayes rule, posterior is proportional to
Likelihood × Prior: – where p(t|w) is the likelihood of observed data – p(w) is prior distribution over the parameters
• We will look at: – A normal distribution for prior p(w) – Likelihood p(t|w) is a product of Gaussians based
on the noise model – And conclude that posterior is also Gaussian
p(w | t) = p(t |w)p(w)
p(t)
Machine Learning Srihari
7
Gaussian Prior Parameters Assume multivariate Gaussian prior for w (which has components w0,..,wM-1)
p(w) = N (w|m0 , S0) with mean m0 and covariance matrix S0
If we choose S0 = α-1I it means that the variances of the weights are all equal to α-1 and covariances are zero
p(w) with zero mean (m0=0) and isotropic over weights (same variances)
w0
w1
Machine Learning Srihari
8
Likelihood of Data is Gaussian • Assume noise precision parameter β
• Likelihood of t ={t1,..,tN} is then
– This is the probability of target data t given the parameters w and input X={x1,..,xN}
– Due to Gaussian noise, likelihood p(t |w) is also a Gaussian
p(t | X,w,β) = N t
n|wTφ(x
n),β−1( )
n=1
N
∏
t = y(x,w)+ε where ε is defined probabilistically as Gaussian noise p(t|x,w,β)=N(t|y(x,w),β-1) Note that output t is a scalar
Machine Learning Srihari
9
Posterior Distribution is also Gaussian • Prior: p(w)~N (w|m0 , S0) i.e., it is Gaussian • Likelihood comes from Gaussian noise
• It follows that posterior p(w|t) is also Gaussian • Proof: use standard result from Gaussians:
• If marginal p(w) & conditional p(t|w) have Gaussian forms then the marginals p(t) and p(w|t) are also Gaussian:
– Let p(w) = N(w|µ,Λ-1) and p(t|w) = N(t|Aw+b,L-1) – Then marginal p(t) = N(t|Aµ+b,L-1+AΛ-1AT) and
conditional p(w|t) = N(w|Σ{AtL(t-b)+Λµ},Σ) where Σ=(Λ+ATLA)-1
p(t | X,w,β) = N t
n|wTφ(x
n),β−1( )
n=1
N
∏
Machine Learning Srihari
10
Exact form of Posterior Distribution • We have p(w)= N (w|m0 , S0) & • Posterior is also Gaussian, written directly as
p(w|t)=N(w|mN,SN) – where mN is the mean of the posterior given by mN= SN (S0
-1m0+ β ΦTt) Φ is the design matrix – and SN is the covariance matrix of posterior given by SN
-1= S0-1+ β ΦTΦ
w0
w1
Prior and Posterior in weight space for scalar input x and y(x,w)=w0+w1x w0
w1 p(w |α) = N(w | 0,α −1I )
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=Φ
−
−
)x()x(
)x()x(...)x()x(
10
20
111110
NMN
M
φφ
φφφφ
p(t | X,w,β) = N t
n|wTφ(x
n),β−1( )
n=1
N
∏
w1
Machine Learning Srihari
Properties of Posterior 1. Since posterior p(w|t)=N(w|mN,SN) is
Gaussian its mode coincides with its mean – Thus maximum posterior weight is wMAP= mN
2. Infinitely broad prior S0=α -1I, i.e.,precision α à0– Then mean mN reduces to the maximum likelihood
value, i.e., mean is the solution vector
3. If N = 0, posterior reverts to the prior 4. If data points arrive sequentially, then posterior
to any stage acts as prior distribution for subsequent data points 11
wML= (ΦTΦ)−1ΦTt
Machine Learning Srihari
Choose a simple Gaussian prior p(w)
• Zero mean (m0=0) isotropic • (same variances) Gaussian
• Corresponding posterior distribution is p(w|t)=N(w|mN,SN)
where mN=β SNΦTt and SN
-1=α I+β ΦTΦ
12
p(w |α) ∼ N(w | 0,α −1I) Single precision parameter α
Note: β is noise precision and α is variance of parameter w in prior
Point Estimate with infinite samples
w0
w1
y(x,w)=w0+w1x p(w|α)~N(0,1/α)
Machine Learning Srihari
• Since and • we have
• Log of Posterior is • Thus Maximization of posterior is equivalent to
minimization of sum-of-squares error
with addition of quadratic regularization term wTw with λ = α /β
Equivalence to MLE with Regularization
13
ln p(w | t) = − β
2tn−wTφ(x
n){ }
n=1
N
∑2
− α2wTw + const
E(w) = 1
2 t
n−wTφ(x
n) { }2
n=1
N
∑ + λ2wTw
p(w | t) = N t
n|wTφ(x
n),β−1( )
n=1
N
∏ N(w | 0,α −1I)
p(t | X,w,β) = N t
n| wTφ(x
n),β−1( )
n=1
N
∏ p(w |α) = N(w | 0,α −1I )
Machine Learning Srihari
14
Bayesian Linear Regression Example (Straight Line Fit)
• Single input variable x • Single target variable t • Goal is to fit
– Linear model y(x,w) = w0+ w1x
• Goal of Linear Regression is to recover w =[w0 ,w1] given the samples
x
t
Machine Learning Srihari
Data Generation • Synthetic data generated from
f(x,w)=w0+w1x with parameter values
w0= -0.3 and w1=0.5
– First choose xn from U(x|-1,1), then evaluate f(xn,w) – Add Gaussian noise with st dev 0.2 to get target tn
– Precision parameter β = (1/0.2 )2= 25
• For prior over w we choose α = 2
15
t
x -1 1
p(w |α) = N(w | 0,α −1I )
Machine Learning Srihari
Sampling p(w) and p(w|t)
16
• Each sample represents a straight line in data space (modified by examples)
w0
w1
Distribution Six samples
p(w)
p(w|t)
Goal of Bayesian Linear Regression: Determine p(w|t)
y(x,w)=w0+w1x
With two examples:
With no examples:
Machine Learning Srihari
17
Sequential Bayesian Learning • Since there are
only two parameters – We can plot
prior and posterior distributions in parameter space
• We look at sequential update of posterior
Before data points observed
After first data point (x,t) observed Band represents values of w0, w1 representing st lines going near data point x
Likelihood for 2nd point alone
Likelihood for 20th point alone
First Data Point (x1,t1)
Likelihood p(t|x.w) as function of w
Prior/ Posterior
p(w) gives p(w|t)
Six samples (regression functions) corresponding to y(x,w) with w drawn from posterior
X
With infinite points posterior is a delta function centered at true parameters (white cross)
Second Data Point
True parameter Value
No Data Point
Twenty Data Points
We are plotting p(w|t) for a single data point
Machine Learning Srihari
Generalization of Gaussian prior • The Gaussian prior over parameters is
Maximization of posterior ln p(w|t) is equivalent to minimization of sum of squares error
• Other prior yields Lasso and variations:
– q=2 corresponds to Gaussian – Corresponds to minimization of regularized error
function 18
p(w |α) = N(w | 0,α −1I)
E(w) = 1
2 t
n−wTφ(x
n) { }2
n=1
N
∑ + λ2wTw
p(w |α) = q2
α2
⎛⎝⎜
⎞⎠⎟
1/q1
Γ(1/q)
⎡
⎣⎢⎢
⎤
⎦⎥⎥
M
exp − α2
|wj|q
j=1
M
∑⎛
⎝⎜⎞
⎠⎟
12
tn−wTφ(x
n) { }2
n=1
N
∑ + λ2
|wj|q
j=1
M
∑
Machine Learning Srihari
• Usually not interested in the value of w itself • But predicting t for a new value of x
p(t|t,X,x) or p(t|t)
– Leaving out conditioning variables X and x for convenience
• Marginalizing over parameter variable w, is the standard Bayesian approach – Sum rule of probability – We can now write
Predictive Distribution
p(t)= p(t,w)dw∫ = p(t|w)p(w)dw∫
19 p(t | t)= p(t|w)p(w|t)dw∫
Machine Learning Srihari
• We can predict t for a new value of x using
– With explicit dependence on prior parameter α, noise parameter β, & targets in training set t
Predictive Distribution with α, β,x,t
p(t | t,α,β)= p(t|w,β) ⋅p(w|t,α,β)dw∫
p(t |x,w,β) = N(t |y(x,w),β−1)Conditional of target t given weight w posterior of weight w
p(w|t)=N(w|mN,SN)
• RHS is a convolution of two Gaussian distributions • whose result is the Gaussian:
p(t | t)= p(t|w)p(w|t)dw∫
mN=β SNΦTt SN
-1=α I+β ΦTΦ where
We have left out conditioning variables X and x for convenience. Also we have applied sum rule of probability p(t)=Σwp(t|w)p(w)
p(t |x,t,α,β) = N(t |m
NTφ(x),σ
N2 (x)) where σ
N2 (x) =
1β
+φ(x)TSNφ(x)
Machine Learning Srihari
• Predictive distribution is a Gaussian:
p(t |x, t,α,β) = N(t |mNTφ(x),σ
N2 (x))
where σN2 (x) =
1β
+φ(x)TSNφ(x)
Variance of Predictive Distribution
Noise in data Uncertainty associated with parameters w: where is the covariance of p(w|α) Since as no. of samples increases it becomes narrower As Nà ∞, second term of variance goes to zero and variance of predictive distribution arises solely from the additive noise parameter β
SN-1=α I+β ΦTΦ
Since noise process and distribution of w are independent Gaussians their variances are additive
σN +12 (x)≤ σ
N2 (x)
21
Machine Learning Srihari
Example of Predictive Distribution
• Data generated from sin(2πx) • Model: nine Gaussian basis functions
• Predictive distribution
22 Mean of Predictive Distribution
Plot of p(t|x)for one data point showing mean (red) and one std dev (pink)
where mN=β SNΦTt, SN-1=α I+β ΦTΦ
and α and β come from assumptionsp(w|α)= N (w|0, α-1I )
y(x,w) = w
jφ
j(x)
j=0
8
∑ = wTφ(x)
p(t | x, t,α,β)= N(t |mNTφ(x),σ
N2 (x)) where σ
N2 (x)=
1β+φ(x)TS
Nφ(x)
φj(x) = exp
(x −µj)2
2σ2
⎛
⎝
⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟
p(t | x,w,β) = N(t |y(x,w),β−1)
Machine Learning Srihari Predictive Distribution Variance
, std dev of t, is smallest in neighborhood of data points Uncertainty decreases as more data points are observed
N=1 N=2
N=4 N=25
Mean of the Gaussian Predictive Distribution
One standard deviation from Mean
Plot only shows point-wise predictive variance To show covariance between predictions at different values of x draw samples from posterior distribution over w p(w|t) and plot corresponding functions y(x,w)
p(t | x, t,α,β) = N(t |m
NTφ(x),σ
N2 (x))where σ
N2 (x) =
1β
+φ(x)TSNφ(x)
σN2 (x)
p(w |α) = N(w | 0,α −1I)
p(t|x,w,β)=N(t|y(x,w),β-1)
SN-1=α I+β ΦTΦ
where we have assumed Gaussian prior over parameters:
Using data from sin(2πx):
Bayesian prediction:
Noise model assumed Gaussian:
and use design matrix as:
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=Φ
−
−
)x()x(
)x()x(...)x()x(
10
20
111110
NMN
M
φφ
φφφφ
Machine Learning Srihari
24
Plots of function y(x,w)
Draw samples w from from posterior distribution p(w|t)
and plot samples from y(x,w) = wTϕ(x)
Shows covariance between predictions at different values of x
For a given function, for a pair of x,x’ , the values of y,y’ are determined by k(x,x’) which in turn is determined by the samples
N=1 N=2
N=4 N=25
p(w|t)=N(w|mN,SN)
Machine Learning Srihari Disadvantage of Local Basis
• Predictive distribution, assuming Gaussian prior – and Gaussian noise t = y(x,w)+ε
– where noise is defined probabilistically as p(t|x,w,β)=N(t|y(x,w),β-1)
• With localized basis functions, e.g., Gaussian
– at regions away from basis function centers, contribution of second term of variance σn
2 in will go to zero leaving only noise contribution β -1
• Model becomes very confident outside of region occupied by basis functions
– Problem avoided by alternative Bayesian approach of Gaussian Processes 25
p(t | x, t,α,β) = N(t |m
NTφ(x),σ
N2 (x)) where σ
N2 (x) =
1β
+φ(x)TSNφ(x)
p(w |α) = N(w | 0,α −1I)
SN-1=α I+β ΦTΦ
Machine Learning Srihari
Dealing with unknown β • If both w and β are treated as unknown then we
can introduce a conjugate prior distribution p(w,β) which is given by a Gaussian-gamma distribution
– In this case the predictive distribution is a Student’s t-distribution
26
p(µ,λ) = N µ | µ
0, βλ( )−1⎛
⎝⎜⎜⎜
⎞⎠⎟⎟⎟Gam λ |a,b( )
2/12/22/1 )(1)2()2/12/(),,|(
−−
⎥⎦
⎤⎢⎣
⎡ −+⎟⎠⎞⎜
⎝⎛
+Γ+Γ=
ν
νµλ
πνλ
νννλµ x
xSt
Machine Learning Srihari Mean of p(w|t) has Kernel Interpretation
• Regression function is: • If we take a Bayesian approach with Gaussian prior
p(w)=N(w|m0 , S0) then we have: – Posterior p(w|t)=N (w|mN,SN) where
mN= SN (S0-1m0+ βΦTt)
SN-1= S0
-1+ βΦTΦ • With zero mean isotropic p(w|α)= N(w|0, α-1I)
mN= β SN ΦTt,
SN-1= α I+ β ΦTΦ
• Posterior mean β SN ΦTt has a kernel interpretation – Sets stage for kernel methods and Gaussian processes
y(x,w) = w
jφ
j(x) = wTφ(x)
j=0
M −1
∑
27
Machine Learning Srihari Equivalent Kernel
• Posterior mean of w is mN=βSNΦTt – where SN
-1= S0-1+ βΦTΦ ,
• S0 is the covariance matrix of the prior p(w), β is the noise parameter and Φ is the design matrix that depends on the samples
• Substitute mean value into Regression function
• Mean of predictive distribution at point x is
– where k(x,x’)=βϕ (x)TSN ϕ (x’) is the equivalent kernel
• Thus mean of predictive distribution is a linear combination of training set target variables tn – Note: the equivalent kernel depends on input values xn from
the dataset because they appear in SN
y(x,m
N) = m
NTφ(x) = βφ(x)TS
NΦTt =
n=1
N
∑ βφ(x)TSNφ(x
n)t
n= k(x,x
n)t
nn=1
N
∑
y(x,w) = w
jφ
j(x) = wTφ(x)
j=0
M −1
∑
Machine Learning Srihari
Kernel Function • Regression functions such as
• That take a linear combination of the training set target values are known as linear smoothers
• They depend on the input values xn from the data set since they appear in the definition of SN
29
k(x,x’)=βϕ (x)TSNϕ (x’)
SN-1= S0
-1+ βΦTΦ
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=Φ
−
−
)x()x(
)x()x(...)x()x(
10
20
111110
NMN
M
φφ
φφφφ
y(x,m
N) = k(x,x
n)t
nn=1
N
∑
Machine Learning Srihari
Example of kernel for Gaussian Basis
30
For three values of x the behavior of k(x,x’) is shown as a slice
Kernels are localized around x, i.e., peaks when x =x’
k(x, x’)=ϕ(x)TSNϕ(x’)
Data set used to generate kernel were 200 values of x equally spaced in (-1,1)
x’
x
Plot of k(x,x’) shown as a function of x and x’ Peaks when x=x’
Gaussian Basis ϕ(x)
Kernel used directly in regression. Mean of the predictive distribution is Obtained by forming a weighted combination of target values:
Data points close to x are given higher weight than points further removed from x
y(x,m
N) = k(x,x
n)t
nn=1
N
∑
Equivalent Kernel
φj(x) = exp
(x − µj)2
2s2
⎛
⎝⎜
⎞
⎠⎟
SN-1= S0
-1+ βΦTΦ
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=Φ
−
−
)x()x(
)x()x(...)x()x(
10
20
111110
NMN
M
φφ
φφφφ
Machine Learning Srihari
Equivalent Kernel for Polynomial Basis Function
31
k(x,x’)=βϕ (x)TSNϕ (x’)
Data points close to x are given higher weight than points further removed from x
Localized function of x’ even though corresponding basis function is nonlocal
ϕj(x)=x j
SN-1= S0
-1+ β ΦTΦ Plotted as a function of x’ for x=0
Machine Learning Srihari
Equivalent Kernel for Sigmoidal Basis Function
32
k(x,x’)=βϕ(x)TSNϕ(x’)
Localized function of x’ even though corresponding basis function is nonlocal
φ
j(x) = σ
x −µj
s
⎛
⎝
⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟ where σ(a) =
11+ exp(−a)
Machine Learning Srihari
Covariance between y(x) and y(x’)
cov [y(x), y(x’)] = cov[ϕ(x)Tw, wTϕ (x’)] = ϕ (x)TSNϕ (x’)
= β-1k(x, x’)
33
From the form of the equivalent kernel k(x, x’) the predictive mean at nearby points y(x) , y(x’) will be highly correlated For more distant pairs correlation is smaller The kernel captures the covariance
where we have used: p(w|t)~N(w|mN,SN) k(x, x’)= βϕ(x)TSNϕ (x)
An important insight: The value of the kernel function between two points is directly related to the covariance between their target values
Machine Learning Srihari
Predictive plot vs. Posterior plots • Predictive distribution
– allows us to visualize pointwise uncertainty in the predictions governed by
• Drawing samples from posterior p(w|t)
– Plotting corresponding functions y(x,w) we visualize joint uncertainty in the posterior distribution between the y values at two or more x values as governed by the kernel
34
p(t | x, t,α,β) = N(t |m
NTφ(x),σ
N2 (x)) where σ
N2 (x) =
1β
+φ(x)TSNφ(x)
Machine Learning Srihari
Directly Specifying Kernel Function • Formulation of Linear Regression in terms of
kernel function suggests an alternative approach to regression: – Instead of introducing a set of basis functions, which
implicitly determines an equivalent kernel: – Directly define kernel functions and use to make
predictions for new input x, given the observation set
• This leads to a practical framework for regression (and classification) called Gaussian Processes 35
Machine Learning Srihari Summing Kernel Values Over samples
• Effective kernel defines weights by which – target values combined to make a prediction at x
• It can be shown that weights sum to one, i.e.,
• For all values of x
– This result can be proven intuitively: • Since summation is equivalent to
considering predictive mean ŷ(x) for a set of integer data in which tn=1 for all n
• Provided basis functions are linearly independent, that N>M, one of the basis functions is constant (corresponding to the bias parameter), then we can fit training data exactly, and hence ŷ(x)=1
y(x,m
N) = k(x,x
n)t
nn=1
N
∑
k(x, x’)=ϕ(x)TSNϕ(x’) SN
-1= S0-1+ βΦTΦ
36
k(x,x
n) = 1
n=1
N
∑
Machine Learning Srihari
Kernel Function Properties • Equivalent kernel can be positive or negative
– Although it satisfies a summation constraint, the corresponding predictions are not necessarily a convex combination of the training set target variables
• Equivalent kernel satisfies important property shared by kernel functions in general. – It can be expressed in the form of an inner product
wrt a vector ψ(x) of nonlinear functions: 37 k(x,z) =ψ (x)Tψ (z) ψ (x) = β1/2S
N1/2φ(x)
k(x, x’)=ϕ(x)TSNϕ(x’)
where