mlpr: tutorial sheet 5 answersmlpr: tutorial sheet 5 answers school of informatics, university of...

MLPR: Tutorial Sheet 5 Answers

School of Informatics, University of Edinburgh

Instructor: Charles Sutton

1. Consider the following classification problem. There are two real-valued features x1, x2 ∈ R and a binary classlabel. The class label is determined by

y =

{1 if x2 ≥ |x1|0 otherwise

(1)

(a) Can this be perfectly represented by a feedforward neural network without a hidden layer? Why or whynot?

(b) Let’s consider a simpler problem for a moment. Consider the classification problem.

y =

{1 if x2 ≥ x10 otherwise

(2)

Design a single neuron that solves this problem. Pick the weights by hand. For an activation function,use the hard threshold function

h(a) =

{1 if a ≥ 00 otherwise

(c) Now go back to the classification problem at the beginning of this question. Design a two layer feedforwardnetwork (i.e., one hidden layer) that represents this function. Use the hard threshold activation functionas in the previous question. Hints: Use two units in the hidden layer. The unit from the last questionwill be one of the units, and you will need to design one more. Your output unit will essentially performa binary AND operation on the hidden units.

(d) Consider the network that you created in the previous question. Suppose that we replace the hardthreshold activation from that network function with a sigmoid activation function. Would your networkstill be correct? If not, is there a simply way to modify your network to make it approximately correct?

Solution:

(a) No. A neural network without a hidden layer simply computes a linear function of the inputs. Thedecision boundary required for this problem is not linear.

(b) Call the neuron z1, and compute its output by

z1 = h(v11x1 + v12x2 + v10)

Setv11 = −1, v12 = 1, v10 = 0.

Now z1 > 0 ⇐⇒ x2 − x1 ≥ 0.(c) Define another hidden unit z2 as

z2 = h(v21x1 + v22x2 + v20)

setv11 = 1, v12 = 1, v10 = 0.

Now z2 > 0 ⇐⇒ x1 + x2 ≥ 0. So x2 ≥ −x1.

1

The intersection of the area for which z1 ≥ 0 and the area for which z2 ≥ 0 is the area where y = 1. Wecan set up a logical AND with the following output unit:

y = h(w1z1 + w2z2 + w0),

where we choosew1 = 1, w2 = 1, w0 = −1

You can show that now y = 1 ⇐⇒ x2 ≥ |x1|. Try this on a few example points to verify that it works.(d) Something interesting happens here: Now along the line x1 = x2, the output unit z1 = 0.5 instead of 0.

Similarly along the line x2 = −x1, the output z2 = 0.5 instead of 0.Also notice if x2 = 0, then necessarily y = 1. Here is a contour of the output unit as a function of x1and x2:

0.0

0.5

1.0

1.5

2.0

-4 -2 0 2 4

-4

-2

0

2

4

We can use the sigmoid function to approximate a threshold function, though. Define a family of functions

σc(x) =1

1 + exp{−cx}.

As we take c → ∞, this will approximate the hard threshold function h.What this means for us is that if we replace h with a sigmoids, we can just multiply all the weights bya large positive constant and we will recover the same network.

□

2. Let p(x) be a mixture of Gaussians

p(x) =K∑

k=1

πkN (x;µk,Σk)

.

(a) What is E [X]?

(b) What is Cov [X]? Hint: Use Cov [X] = E[XXT

]− E [X] E [X]T

Solution:

2

(a)

E [X] =

∫p(x)xdx

=

∫ K∑k=1

πkN (x;µk,Σk)xdx

=K∑

k=1

πkµk.

That is, the mean of the mixture is the weighted average of the individual means.

(b) Define Ek(f(X)) to be the expectation of f according to component k, i.e.,

Ek(f(X)) =

∫N (x;µk,Σk)f(x)dx.

A similar argument to the previous question shows that

E [XiXj ] =∑k

πkEk(XiXj).

So for the covariance matrix we have

Cov [X] = E[XXT

]− E [X] E [X]T

=K∑

k=1

πkEk(XXT )−

K∑k=1

K∑j=1

πjπkµjµTk

□

3. Show that the Metropolis Hastings algorithm for sampling from P (x) satisfies detailed balance w.r.t. P (x).

Solution: Consider the transition probability from x to y and y to x. Without loss of generality, presumeP (x) > P (y) (the symmetry of detailed balance means if it follows in one direction it follows in both). Thenthe transition probability combines the acceptance probability and the proposal distribution

P (y|x) = P (y)Q(x|y)P (x)Q(y|x)

Q(y|x) +∫

dy′(1− P (y

′)Q(x|y′)P (x)Q(y′|x)

)δ(y − x).

The second term comes from the case where the proposal is rejected: we propose y′ and reject it, ending upback at x. We need to account for the fact that we could have proposed y′ anywhere, hence the integral.Therefore

P (y|x)P (x) = P (y)Q(x|y)Q(y|x)

Q(y|x) +K(x)δ(y − x)

where we have absorbed the integral term into the K(x).

We can ignore the case where x = y as detailed balance trivially holds then (P (x|x)P (x) = P (x|x)P (x)). Sofor x ̸= y:

P (y|x)P (x) = P (y)Q(x|y)

In the other direction, the proposal is always accepted so we have

P (x|y)P (y) = Q(x|y)P (y)

This is equal to P (y|x)P (x), so detailed balance holds.This makes it clear why the use of proposals and acceptance is useful in building MCMC methods: as thex = y case trivially satisfies detailed balance, any other imbalance in the probability elsewhere can just besucked up using a rejection and put into the x = y case without worrying about it.

□

3

4. Consider the univariate Gaussian model from the previous tutorial. The data is D = {x1 . . . xn}, where eachxi is independent with

p(xi|µ, τ) =√

τ

2πexp

{−τ2(xi − µ)2

}.

Now we will assume that both µ and τ are unknown. We will place prior distributions:

p(µ|µ0, τ0) =√

τ02π

exp{−τ0

2(µ− µ0)2

}p(τ |α, β) = β

α

Γ(α)τα−1 exp {−βτ}

Describe the Gibbs sampling algorithm to sample from the posterior p(µ, τ |D). Give the specific equations foreach step.

Solution: It will be easiest if we write down the joint distribution first (we will drop all terms that areconstant with respect to x, τ , and µ):

p(D, µ, τ) = p(τ |α, β)p(µ|µ0, τ0)∏i

p(xi|µ, τ)

∝ τn/2+α−1 exp

{−τ2

∑i

(xi − µ)2 −τ02(µ− µ0)2 − βτ

}

The Gibbs sampler maintains a current guess (µt, τt) at every iteration t. It resamples µt+1 from p(µ|τt,D)and τt+1 from p(τ |µt,D). We can compute these conditional distribution very quickly by using our tricks ofignoring constants.

First, for τ , we have:

p(τ |µ,D) ∝ p(D, µ, τ)


{−τ2

∑i

(xi − µ)2 − βτ

}

= τn/2+α−1 exp

{−τ

(β +

1

2

∑i

(xi − µ)2)}

∝ Gamma

(α+ n/2, β +

1

2

∑i

(xi − µ)2)

So we resample τ from a Gamma distribution whose shape depends only on the number of data points, andwhose scale parameter depends on the average deviation of the data from the current value of µ.

The conditional distribution for µ requires a bit more algebra (we need to complete the square). There aretwo main tricks:

(a) We can ignore any constants with respect to µ. This works within the exponential too: exp{µ + C} =exp {µ} exp {C} ∝ exp {µ} if C is constant with respect to µ.

(b) Suppose that we are able to work out that

p(µ|τ,D) ∝ exp{− t2

(µ2 − 2aµ

)}Then we know that p(µ|τ,D) is normal with mean a and precision t (this can be shown by completingthe square). So we’re going to start with p(D, µ, τ) and try to massage it into this form.

4

Using these tricks gives us:

p(µ|τ,D) ∝ p(D, µ, τ)


{−τ2

∑i

(xi − µ)2 −τ02(µ− µ0)2 − βτ

}

∝ exp

{−τ2

∑i

(xi − µ)2 −τ02(µ− µ0)2

}

= exp

{−τ2

∑i

x2i + τµ∑i

xi −τ

2

∑i

µ2 − τ02µ2 + τ0µµ0 −

τ02µ20

}

∝ exp

{µ2(−nτ

2− τ0

2

)+ µ

(τ∑i

xi + τ0µ0

)}

∝ exp{−nτ + τ0

2

(µ2 − 2µ

τ∑

i xi + τ0µ0nτ + τ0

)}∝ N (µ; a, t) ,

where

a =nτ

nτ + τ0x̄+

τ0nτ + τ0

µ0

b = (nτ + τ0)−1,

by x̄ = n−1∑

i xi we mean the sample mean of D. The conditional mean a has a nice interpretation: it is theweighted average of the sample mean x̄ and the prior mean µ0, where the weights are given by the strengthof the prior τ0 and our current estimate τ of the strength of the likelihood. Notice that τ will change as werun the Gibbs sampler, and also that as n increases, we tend to give more weight to the sample mean ratherthan the prior mean.

□

5

mlpr: tutorial sheet 5 answersmlpr: tutorial sheet 5 answers school of informatics, university of...

Documents