test problems - warwick insitehomepages.warwick.ac.uk/staff/martin.lotz/files/learning/exam-prep...

MA3K1 Mathematics of Machine Learning March 10, 2020

Test ProblemsThese problems are indicative of the type of problem you may encounter in the

exam. The questions are a mixture of knowledge-based questions, questions that requirecalculation, and some that may be a bit more challenging. This is, however, not the examformat! The exam itself will follow the customary format of having one compulsoryquestion worth 40 points, and four additional parts worth 20 points each, of which threeshould be chosen. Even though the problems in this set are indicative of exam questions,they may not cover all the material for the exam, nor will the exam cover all the topicstouched upon in this set. It should therefore be used in conjunction with the problemsets from class and the lecture notes.

1 Statistical Learning Theory(1) Explain the concept of generalization risk and how it relates to the problem ofoverfitting.

(2) State the Markov inequality, and use it to show that for any t ∈ R, λ ≥ 0 and arandom variable X ,

P(X ≥ t) ≤ e−λt E[eλX ].

(3) LetH be a set of binary classifiers on a space X .

(a) Define the empirical risk R(h) with respect to the unit loss function based on nrandom samples, and the generalization risk R(h) with respect to a probabilitydistribution on X × Y .

(b) Find the error in the following argument:

Let δ ∈ [0, 1) and assume that for every h ∈ H we have a bound

|R(h)−R(h)| ≤ C(n, δ) (1.1)

with probability at least 1 − δ. Let h ∈ H be an empirical risk minimizer, i.e.,R(h) = minh∈HR(h). Since (1.1) holds for every h with probability at least1− δ, it holds in particular for h with probability at least 1− δ, and therefore

|R(h)−R(h)| ≤ C(n, δ)

holds with probability at least 1− δ.

(4) Given an input space X and output space Y = {0, 1}, consider a random variable(X,Y ) distributed on X ×Y . Assume that X has a continuous distribution with densityρ, that the conditional density of X given Y = 1 is ρ1(x) := ρY=1(x), and that theconditional density of X given Y = 0 is ρ0(x) := ρY=0(x).

1


(a) Define the regression function and the notion of a Bayes classifier h∗ : X → Yfor binary classification.

(b) Assuming that P(Y = 1) = p, show that the Bayes classifier for the problemabove is given by

h∗(x) =

{1 if ρ1(x)ρ0(x)

> 1−pp

0 else

(5) Explain the concept of Probably Approximately Correct (PAC) learning.

(6) Consider the following table, describing the function XOR: {0, 1}2 → {0, 1}(this is the “exclusive or” function).

x1 x2 y0 0 00 1 11 0 11 1 0

LetH be the set of linear classifiers on R2, where each h ∈ H is of the form

h(x) =

{1 if wTx+ b > 0

0 if wTx+ b ≤ 0.

for some w ∈ R2 and b ∈ R.

(a) Show that there is no h ∈ H such that h(x) = XOR(x) for x ∈ {0, 1}2.

(b) Determine the number of functions f : {0, 1}2 → {0, 1} that can be representedby elements ofH.

(7) Let X = {0, 1}d ⊂ Rd. For every I ⊂ [d] = {1, . . . , d}, consider a functionhI : X → {0, 1}, where

hI(x) =

{1 if {i : xi = 1} ⊂ I0 else

LetH denote the set of all such function. Define the notion of VC dimension, and showthat the VC dimension ofH is at least d.

(8) Define the empirical Rademacher complexity and the Rademacher complexityof a finite set of classifiers H with values in {−1, 1} and with respect to an arbitraryloss function L taking values in [0, 1]. State Massart’s Lemma and use it to derive theRademacher complexity bound

Rn(L ◦ H) ≤√

2 log(|H|)n

.

2


(9) Consider the unit `1-ball

Bd1 := Bd1 (0, 1) = {x ∈ Rd : ‖x‖1 ≤ 1},

where ‖x‖1 =∑di=1 |xi| is the `1-norm. Let d1 be the metric d1(x, y) = ‖x − y‖1.

Define the notion of an ε-net and the covering number, and show that for ε ∈ (0, 1), thecovering number N(Bd1 ,d1, ε) is bounded by

N(Bd1 ,d1, ε) ≤(cε

)dfor some constant c independent of d and ε. You may use that the volume of the `1-ballof radius r scales as volBd1 (0, r) = rdBd1 .

2 Optimization(10) For each of the functions below, state (with justification) whether it is convex ornot.

(a) f(x) = log(1− e−x) for x ∈ R;

(b) f(x) = |x1|+ x22 for x ∈ R2;

(c) f(x) = x21 − sin(x) for x ∈ R2;

(d) f(x) = min{max{x1, x2}, 0} for x ∈ R3;

(e) f(x) = 1{x ≥ 1} for x ∈ R.

(11) For each of the sets below, determine whether they are convex or not.

(a) C = {x ∈ R2 : ‖x‖1 = 1};

(b) C = {x ∈ R2 : x21 − x22 = 0};

(c) C = {x ∈ R2 : x2 ≥ ex1};

(d) C = {A ∈ Rn×n : det(A) 6= 0}.

(12) Let f ∈ C1(Rd) be a convex function and C ⊂ Rd a convex set. Show that x∗

is an optimal point of the optimization problem

minimize f(x) subject to x ∈ C

if and only if for all y ∈ C, 〈∇f(x∗),y − x∗〉 ≥ 0. Show, by counterexample, that thestatement is not true if C is not required to be convex.

(13) Let f(x) = (x21 + x2)2 by a function on R2. Show that the direction p =(−1, 1)> is a descent direction at x0 = (1, 0)>, and determine a step length α thatminimizes f(x0 + αp).

3


(14) Determine the Lagrange dual of the following problem.

minimize x2 − 1 subject to (x− 3)(x− 1) ≤ 0.

Does strong duality hold?

(15) Define the concept of a linear Support Vector Machine (SVM) and show how aseparating hyperplane of maximal margin can be found by solving a quadratic optimiza-tion problem.

(16) Determine (with justification or counterexample) which of the following state-ments is true and which is false.

(a) A convex function has a unique minimizer;

(b) A function f ∈ C2(Rd) is convex if and only if for all x,y ∈ Rd,

f(y) ≥ f(x) + 〈∇f(y),y − x〉;

(c) Gradient descent converges linearly;

(d) A convex function is differentiable.

(17) Define the notions of Lipschitz continuity, α-convexity, and β-smoothness. Showdirectly, without using the convergence bound for gradient descent for α-convex andβ-smooth functions, that if α = β for some function f , then gradient descent with asuitable step size converges in one iteration.

(18) Describe stochastic gradient descent and explain the benefits and limitations asopposed to gradient descent.

3 Deep Learning(19) Describe the backpropagation algorithm for deep neural networks.

(20) Consider the logical AND function on {0, 1}2, defined as AND(x1, x2) = 1 ifx1 = 1 and x2 = 1 and AND(x1, x2) = 0 else. Specify a neural network with ReLUactivation (that is, ρ(x) = max{x, 0}) that implements the AND function.

(21) Consider the following function:

4


0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Figure 1: The saw function

Describe a neural network with ReLU activation (that is, ρ(x) = max{x, 0}) thatimplements this function on the interval [0, 1].

(22) Suppose you are given the data set displayed in Figure ??. Describe the structureof a neural network with logistic sigmoid activation function that separates the trianglesfrom the circles, and sketch a decision boundary. The output should consist of a numberp ∈ [0, 1] such that p > 1/2 when the input is a circle and p < 1/2 otherwise.

2 1 0 1 2

0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Figure 2: A classification problem

(23) Describe the idea of a Convolutional Neural Network and the features that makeit useful for image classification tasks.

(24) Define the robustness of a classifier with a finite number of classes {1, . . . ,K}and the distance to misclassification of a point. Given a linear classifier h, describe analgorithm that takes any input x and computes the smallest perturbation r such thatx+ r is in a different class than x.

5


(25) Suppose we have two generators Gi : Zi → X , i ∈ {0, 1}, and one discriminatorD : X → {0, 1}. Assume that on Z we have probability densities ρZi , and if Zi is arandom variable on Zi distributed according to this density, then Xi = Gi(Zi) is arandom variable on X distributed with density ρXi

for i ∈ {0, 1}. The goal of D is todetermine whether an observed random sample x ∈ X was generated by G0 or G1.

Describe the problem of training D to distinguish data coming from G0 from datacoming from G1 as an optimization problem, and characterize an optimal solution.

6


SolutionsSolution (1) We assume that there is an input space X , and output space Y , and aprobability distribution on X × Y . Given a map h : X → Y and a loss function Ldefined on Y × Y , the generalization risk of h is the expected value

R(h) = E[L(h(X), Y )],

where (X,Y ) is distributed on X × Y according to the given distribution. The general-ization risk gives an indication on how h “performs”, on average, on unseen data.

If we observe n samples (x1, y1), . . . , (xn, yn) drawn from this distribution, we canconstruct a classifier h by minimizing the empirical risk

R(h) =1

n

n∑i=1

L(h(xi), yi)

over a class of functionsH. When the classH is very large (for example, if it consistsof all possible functions from X to Y), then we can find a function h for which theempirical risk, R(h), is very small or even zero (if we simply “memorise” the observeddata). Such a function h can have a large generalization risk: it has been adapted tooclosely to the observed data, at the expense of not generalizing well to unobserved data.This is called overfitting.

Solution (2) The Markov inequality states that, given t > 0 and a non-negativerandom variable X , we have

P(X ≥ t) ≤ E[X]

t.

Given any t ∈ R and λ ≥ 0, eλt > 0, and hence

P(X ≥ t) = P(eλX ≥ eλt) ≤ E[eλX ]

eλt,

where the last inequality follows by applying Markov’s inequality.

Solution (3) (a) The generalization risk is the expected value

R(h) = E[1{h(X) 6= Y }] = P(h(X) 6= Y ).

The empirical risk is

R(h) =1

n

n∑i=1

1{h(Xi) 6= Yi}.

(b) Assume that we have n random samples (X1, Y1), . . . , (Xn, Yn) that are inde-pendent and identically distributed. For each fixed h ∈ H, we have the bound

|R(h)−R(h)| ≤ C(n, δ),

7


with probability at least 1− δ, by assumption. Now if h is the minimizer of R(h), thatis,

h = arg minh∈H

1

n

n∑i=1

1{h(Xi) 6= Yi},

then h is also a random variable: it depends on the random data (Xi, Yi). Hence, in theexpression |R(h)−R(h)|, the given h is not fixed and will vary with the data (Xi, Yi).The expected value may larger.

Solution (4) (a) The regression function for binary classification is

f(x) = E[Y | X = x] = P(Y = 1 | X = x)

The Bayes classifier can be defined as

h∗(x) =

{1 if f(x) > 1/2

0 else

(b) To compute the Bayes classifier, we use Bayes’ rule:

P(Y = 1 | X = x) =ρ1(x)P(Y = 1)

ρ(x)=

p · ρ1(x)

(1− p)ρ0(x) + pρ1(x).

We can rearrange:

p · ρ1(x)

(1− p)ρ0(x) + pρ1(x)>

1

2⇔ ρ1(x)

ρ0(x)>

1− pp

.

Solution (5) A hypothesis classH is called PAC-learnable if there exists a classifierh ∈ H depending on n random samples (Xi, Yi), i ∈ {1, . . . , n}, and a polynomialfunction p(x, y, z, w), such that for any ε > 0 and δ ∈ (0, 1), for all distributions onX × Y ,

P(R(h) ≤ inf

h∈HR(h) + ε

)≥ 1− δ

holds whenever n ≥ p(1/ε, 1/δ, size(X ), size(H)).

Solution (6) (a) A function h ∈ H that implements XOR would correspond to a linein R2 that separates the points {(0, 0), (1, 1)} from the points {(0, 1), (1, 0)}. Any lineh(x) = wTx+b that has (0, 0) on one side, say, h(0, 0) < 0, has to have b < 0. Assumethat h separates (0, 0) from (0, 1) and (1, 0). Then h(1, 0) > 0 implies w1 + b > 0, andh(0, 1) > 0 implies that w2 + b > 0. In particular, if we sum these two expression, weget w1 + w2 + 2b > 0. But since b < 0, this implies that h(1, 1) = w1 + w2 + b > 0.This shows that any line line separating (0, 0) from (0, 1) and (1, 0) also separates (0, 0)from (1, 1),and hence XOR is not possible via linear classifiers.

(b) The problem amounts to finding the ways in which the corners of a square canbe separated by lines. There are 16 = 24 possible binary functions.

8


x1 x2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 10 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 11 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 11 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

The task is to figure out which of these can be implemented by a linear classifier.Clearly, for every such function f that can be implemented by linear separation, theinverted 1 − f can also be implemented (if f(x) = 1{h(x) > 0}, then 1 − f(x) =1{−h(x) > 0}). Clearly, cases 1 and 16 can be implemented (take any line that containsall the points on one side). We also get all the cases in the above table with exactly one1 or exactly one 0 by a line that has one point on one side and all the others on the otherside (like x2 − x1 + 1/2, for example). This makes another 8 cases. By consideringconditions of the form

xi − 1/2 > 0, xi − 1/2 < 0, i ∈ {1, 2}

we get four additional cases, taking the total number of functions that can be implemen-ted usingH to 14. The remaining two cases are XOR and 1−XOR, which cannot beimplement usingH as seen in Part (a).

Solution (7) Let H be a set of classifiers h : X → Y . The growth function ΠH isdefined as

ΠH(n) = maxx∈Xn

|{h(x) : h ∈ H}| .

where h(x) = (h(x1), . . . , h(xn)) if x = (x1, . . . , xn). The VC dimension ofH is

VC(H) = max{n ∈ N : ΠH(n) = 2n}.

We can take the set x = (e1, . . . , ed), where ei is the i-th unit vector. Then

|{h(x) : h ∈ H}| = 2d,

since for any 0 − 1 vector σ of length d, taking the function hI where I consists ofthose indices i for which σi = 1 gives the sign pattern hI(x) = σ. This shows that theVC dimension is at least d.

Solution (8) Set K = |H|. For z = (z1, . . . , zn), with zi = (xi, yi), we can definethe empirical Rademacher complexity as

Rz(L ◦ H) = Eσ

[sup

g∈L◦H

1

n

n∑i=1

σig(zi)

],

where g(zi) = L(h(xi), yi), and the Rademacher complexity as

Rn(L ◦ H) = EZ[RZ(L ◦ H

].

9


The expectation is with respect to independent Rademacher random variables σi, takingvalues 1 and −1 with equal probability. Massart’s Lemma states that if S ⊂ Rn consistsof K elements, then

R(S) ≤r ·√

2 log(K)

n,

where r is the maximal 2-norm of elements in S and R denotes the Rademachercomplexity of a set. For fixed z = (z1, . . . , zn), Massart’s Lemma implies that theempirical Rademacher complexity is bounded by

Rz(L ◦ H) ≤ r ·√

2 log(K)

n,

where

r = sup{‖x‖ : x ∈ S}, S = {g(z1), . . . , g(zn)) : g ∈ L ◦ H}.

Since by assumption g(z) = L(h(x), y) ∈ [0, 1], we have r ≤√n and taking the

expectation over Z we get

Rn(L ◦ H) ≤√

2 log(K)

n.

Solution (9) There are different ways of solving this problem. One would be to copyverbatim the proof for the `2-ball done in class. The covering number corresponding toa set S, a (pseudo)metric d and ε > 0 is defined as the smallest size of an ε-net for S.An ε-net T for S, in turn, is defined as a set of points such that for every x ∈ S thereexists y ∈ T with d(x, y) ≤ ε. To construct an ε-net for Bd1 , start with T = {x1} andadd points successively. If T = {x1, . . . , xk}, find a point x ∈ S such that d(x, xj) > εfor 1 ≤ j ≤ k, if possible, and stop if not. This process will eventually terminate sinceBd1 is bounded, and the resulting set T = {x1, . . . , xn} is (by construction) an ε-net.Also by construction, the balls Bd1 (xj , ε/2) around the xj are disjoint, and the union ofthese balls is contained in the larger ball Bd1 (0, 1 + ε/2) of radius 1 + ε/2. The volumeof this ball is greater than or equal to the volume of the union of the balls B(xj , ε/2),

N volBd1 (x1, ε/2) = vol

(⋃i

B(xi, ε/2)

)≤ volBd1 (0, 1 + ε/2),

and hence

N(Bd1 ,d1, ε) ≤ |T | = n ≤ volBd1 (0, 1 + ε/2)

volBd1 (x1, ε/2)=

(1 + ε/2

ε/2

)d≤(

3

ε

)dfor ε < 1.

Solution (10)

(a) The function is not convex on R, since it is not defined on all of R.

10


(b) The function x 7→ x22 is convex, since the Hessian matrix is positive semidefinite.The function x 7→ |x1| is convex, since |λx1+(1−λ)y1| ≤ λ|x1|+(1−λ|y1| bythe triangle inequality. As the sum of two convex functions is convex, it followsthat f is convex.

(c) The function f is not convex, as the Hessian matrix is not positive semidefinite atevery point.

(d) The function f is not convex. The function x 7→ f(x) := max{x1, x2} is convex,since each of the following inequalities is trivially true,

λx1 + (1− λ)y1 ≤ λmax{x1, y1}+ (1− λ) max{x2, y2},λx2 + (1− λ)y2 ≤ λmax{x1, y1}+ (1− λ) max{x2, y2},

and the inequality

max{λx1+(1−λ)y1, λx2+(1−λ)y2} ≤ λmax{x1, y1}+(1−λ) max{x2, y2}

is therefore also true. Assume that g(x) = min{f(x), 0} is convex. Considertwo distinct points x and y with f(x) > 0 and f(y) < 0, and let λ ∈ [0, 1] besuch that f(λx+ (1− λ)y) > 0 (this is possible). Then

0 = min{f(λx+ (1− λ)y), 0}≤ λmin{f(x), 0}+ (1− λ) min{f(y), 0}= (1− λ)f(y) < 0,

which is a contradiction.

(e) The function f is not convex. Take, for example, x = −1 and y = 1. Thenf(λx+(1−λ)y) = f(1−2λ) = 1 for λ ≤ 0.5, but λf(x)+(1−λ)f(y) = 1−λ.

Solution (11)

(a) The set C is not convex. Take the points (0,−1) and (0, 1). These are in C, buttheir convex combination (0, 0) = 1/2(0,−1) + 1/2(0, 1) is not.

(b) The set is not convex: it consists of the union of the lines x1 = x2 and x1 =−x2. The points (1, 1) and (1,−1) are in this set, but their convex combination(1, 0) = 1/2(1, 1) + 1/2(1,−1) is not.

(c) This set is convex: it is the area above the graph of the exponential function.Formally, taking two points (x1, x2) and (y1, y2) in C, we have

λx2 + (1− λ)y2 ≥ λex2 + (1− λ)ey2 ,

by definition. The function t 7→ et is convex (as can be seen by looking at thesecond derivative), and therefore

λex2 + (1− λ)ey2 ≥ eλx2+(1−λ)y2 .

This shows that λx+ (1− λ)y ∈ C.

(d) This set is not convex. Take any non-singular matrixA and its negative −A. Thezero matrix is in the convex combination of these.

11


Solution (12) Suppose x∗ is such that for all y ∈ C,

〈∇f(x∗),y − x∗〉 ≥ 0, (3.1)

Then, since f is a convex function, for all y ∈ C we have

f(y) ≥ f(x∗) + 〈∇f(x∗),y − x∗〉 ≥ f(x∗),

which shows that x∗ is a minimizer in C. To show the opposite direction, assume thatx∗ is a minimizer but that (3.1) does not hold. This means that there exists a y ∈ Csuch that 〈∇f(x∗),y − x∗〉 < 0. Since both x∗ and y are in C and C is convex, anypoint z(λ) = (1− λ)x∗ + λy with λ ∈ [0, 1] is also in C. At λ = 0 we have

df

dλf(z(λ))|λ=0 = 〈∇f(x∗),y − x∗〉 < 0.

Since the derivative at λ = 0 is negative, the function f(z(λ)) is decreasing at λ = 0,and therefore, for small λ > 0, f(z(λ)) < f(z(0)) = f(x∗), in contradiction to theassumption that x∗ is a minimizer.

To see that the statement need not be true if C is not convex, Let C = {x : x2 + (y+0.5)2 ≥ 1} and consider the function f(x) = x2 + y2. This function has the uniqueminimizer x∗ = (0, 0.5)T and the gradient at this point is∇f(x) = (0, 1)T . However,there exist plenty of points y ∈ C for which the second coordinate of y−x∗ is negative,contradicting the statement.

Solution (13) The gradient of f(x) is given by

∇f(x) = 2(x21 + x2)(2x1, 1)T .

At (1, 0), this evaluates to (4, 2)T . The inner product with p is negative, so this is adescent direction. To compute the optimal step length, we need to minimize the function

ϕ(α) = f(x0 + αp) = ((1− α)2 + α)2.

Setting the derivative to 0,

ϕ′(α) = 2((1− α)2 + α)(1− 2(1− α)) = 2((1− α)2 + α)(2α− 1) = 0

we see that the only real root is α = 0.5. Moreover, the second derivative is positive,ϕ′′(0.5) > 0, and we therefore get a minimum. The step length is therefore 0.5.

Solution (14) The Lagrangian function is

L(x, λ) = x2 − 1 + λ(x− 1)(x− 3).

The derivative of the Lagrangian with respect to x is

∂L∂x

= 2(1 + λ)x− 4λ.

12


Setting the derivative to zero, we get for λ 6= −1,

x =2λ

1 + λ. (3.2)

If λ > −1, this is a minimum, as the second derivative shows. If λ = −1, thenL(x, λ) = 4x− 4 is unbounded from below, and if λ < −1, then the second derivativewith respect to x is negative, and hence the solution (3.2) is a maximum and the functionL(x, λ) is again bounded from below. We therefore have

g(λ) = infxL(x, λ) =

{4λ2

(1+λ)2 − 1 + λ(

2λ1+λ − 1

)(2λ1+λ − 3

)if λ > −1

−∞ if λ ≤ −1.

The Lagrange dual problem is

maximizeλ≥0

4λ2

(1 + λ)2− 1 + λ

(2λ

1 + λ− 1

)(2λ

1 + λ− 3

).

Strong duality holds, as the problem is convex and has a strictly feasible point.

Solution (15) A linear support vector machine (SVM) is a classifier associated to aseparating hyperplane with maximal margin. Assume we are given linearly separablepoints x1, . . . ,xn ∈ Rd with labels yi ∈ {−1, 1} for i ∈ {1, . . . , n}. A separatinghyperplane is defined as a hyperplane

H = {x ∈ Rd : wTx+ b = 0},

where w ∈ Rd and b ∈ R are such that for all i ∈ {1, . . . , d},

wTxi + b > 0 if yi = 1

wTxi + b < 0 if yi = −1.

Define the classifier h(x) = sign(wTx+b), and letH+ andH− be the open half-spaceswhere h(x) = 1 and h(x) = −1. Given any point x 6∈ H , we define the distance to Has the δ given by

δ2

2:= min

r

1

2‖r‖2 subject to wT (x+ r) + b = 0.

We can obtain a closed form solution either geometrically (as in Lecture 17), or byconsidering the Lagrangian (as in Lecture 27). Using the second approach, we get

L(r, λ) =1

2‖r‖2 + λ(wT (x+ r) + b).

Taking the gradient with respect to r, we get the relation

∇rL(r, λ) = r + λw = 0 ⇒ r∗ = −λ∗w. (3.3)

13


To determine the Lagrange multiplier, we take the inner product with w and obtain

wTr∗ = −λ∗‖w‖2 ⇒ λ∗ =wTr∗

‖w‖2.

Using the fact that wTr∗ = −(wTx+ b), and replacing λ∗ in (3.3), we get

r∗ = −wTx+ b

‖w‖2·w, δ =

|wTx+ b|‖w‖

.

If we normalize w and b such that wTx+ b = 1, then we get

δ =1

‖w‖,

and the margin is 2δ.

Solution (16)

(a) False. A convex function has a unique minimum, but not necessarily a uniqueminimizer. Example: f(x) = 0.

(b) False. It would be correct with ∇f(x) instead of ∇f(y). A counterexample isf(x) = −(1/2)‖x‖2 (or the negative of any convex function)

(c) False. Gradient descent does not need to converge at all (take f(x) = −x or anyfunction that is unbounded from below).

(d) False. The function f(x) = |x| is convex.

Solution (17) A function f : Rd → Rk is called Lipschitz continuous with Lipschitzconstant L > 0, if for all x,y ∈ Rd,

‖f(x)− f(y)‖ ≤ L · ‖x− y‖.

A function f : Rd → R is called β-smooth for some β > 0 if the gradient is Lipschitzcontinuous:

‖∇f(x)−∇f(y)‖ ≤ β · ‖x− y‖

for all x,y ∈ Rd. A function f ∈ C1(Rd) is called α-strongly convex for some α > 0if for every x,y ∈ Rd,

f(y) + 〈∇f(y),x− y〉+α

2‖x− y‖2 ≤ f(x).

A function that is both α-strongly convex and β-smooth satisfies

α

2‖x− y‖2 ≤ f(x)− f(y)− 〈∇f(y),x− y〉 ≤ β

2‖x− y‖2.

14


If α = β, we have equality, and hence, for fixed y and any x,

f(x) = f(y) + 〈∇f(y),x− y〉+α

2‖x− y‖2.

In particular, taking y = 0, a := f(0) and b = ∇f(0), we get the form

f(x) = a+ 〈b,x〉+α

2‖x‖2.

This function is strictly convex and therefore has a unique minimizer. Setting thegradient∇f(x) = b+ αx to zero, we see that the minimizer is given by

x∗ = − 1

αb.

Starting with any x0, we immediately arrive at this minimizer by taking a step in thedirection of the negative gradient with step length 1/α.

Solution (18) Suppose we are given a function f(x) = 1n

∑ni=1 fi(x). To compute

the gradient of f , we need to compute the gradient of each of the terms fi, which can beexpensive if n is large. Stochastic Gradient Descent (SGD) proceeds by selecting one ormore of the fi at random, computing the gradient of these functions, and taking a step inthe negative gradient direction as computed from this subset of functions. The obviousbenefit is reduced computational cost at each step. The drawback is that convergence isno longer guaranteed, even if all the functions involved are strictly convex and smooth.Rather than bounding the convergence ‖xk −x∗‖ of iterates to a minimizer x∗, in SGDone bounds the expected convergence.

Solution (19) Let F (x) be the output of a neural network on input x, and let C =L(F (x), y) be a loss function evaluated at some test data (x, y). The functionF dependson weight matrices W k and bias terms bk, where k denotes the layer. Backpropagationis a method to compute the gradients

∂C

∂wkij,

∂C

∂bkj.

The network starts with a0 = x, and at the k-th layer it computes

zk = W kak−1 + bk, ak = σ(zk).

The sensitivity of the network to the j-th node in the k-th layer is measured by

δkj =∂C

∂zkj.

Let δk be the vectors of δkj entries and ◦ denote componentwise multiplication. ThenBackpropagation computes the vectors δk recursively from the output as

δ` = ρ′(z`) ◦ ∇a`L(a`,y), δk = ρ′(zk) ◦(W k+1

)Tδk+1,

15


where F (x) = a` is the output. The gradients are then computed as

∂C

∂wkij= δki a

k−1j ,

∂C

∂bki= δki .

When training a neural network, one typically uses stochastic gradient descent, and thisrequires the computation of gradients.

Solution (20) The ReLU activation function maps all non-positive numbers to 0 and 1to 1, so the aim is to find a function in two variables that takes on non-positive values on(0, 0), (0, 1) and (1, 0), and the value 1 on (1, 1). There is no unique way of solving thisproblem, but one could try out a function of the form f(x1, x1) = w1x1 +w2x2 + b anddetermine the coefficients w1, w2 and b from the conditions. For example, the condition

w1 · 1 + w2 · 1 + b = 1

allows us to eliminate b = 1− w1 − w2. The inequality

w1 · 0 + w2 · 0 + b = 1− w1 − w2 ≤ 0

implies that w1 + w2 ≥ 1. The conditions

w1 · 1 + w2 · 0 + b = 1− w2 ≤ 0, w1 · 0 + w2 · 1 + b = 1− w2 ≤ 0

imply that w1 ≥ 1 and w2 ≥ 1. We could, for example, choose w1 = w2 = 1 andb = −1. Applying ReLU to the resulting function gives

AND(x1, x2) = ρ(x1 + x2 − 1).

Of course, one can also try to guess this function in a less systematic way by a bit oftrial and error.

Solution (21) Restricted to the interval [0, 1], the function can be expressed as

1− ρ(4x) + 2ρ(4x− 1)− 2ρ(4x− 2) + 2ρ(4x− 3). (3.4)

This can be implemented by a neural network with one input node, four nodes in thehidden layer, and one output node.

16


Figure 3: A neural network with one hidden layer.

The starting point is to write each of the four lines in the graph as a ReLU functionwhen restricted to its respective interval on the x-axis. In our case, this is

f1(x) = 1− ρ(4x), x ∈ [0, 0.25)

f2(x) = ρ(4x− 1), x ∈ [0.25, 0.5)

f3(x) = 1− ρ(4x− 2), x ∈ [0.5, 0.75)

f4(x) = ρ(4x− 3), x ∈ [0.75, 1].

We next need to add these functions together, but need to cancel the effect of fi on theinterval corresponding to fi+1. As we see from the graph, if we would continue f1on [0.25, 0.5), it cancels out with f2. If we continue f2 on the interval [0.5, 0.75), itcancels out with f3 − 1. If we continue f4 on the interval [0.75, 1], it cancels out withf4. Therefore, we get the desired function by setting:

f(x) = f1(x) + 2f2(x) + 2f3(x)− 1 + 2f4(x),

which leads to the form (3.4).

Solution (22) There are many different ways to solve this problem. Perhaps the mostilluminating way is to first construct a function that solves the classification tasks usingindicator functions, and then approximate the functions using a neural network structure.

To construct a classifier based on the points plotted, it is useful to first try to separatethe points by hyperplanes. In our case, we can delineate the region containing thetriangle (blue) points as the set of training points satisfying the inequalities

−x+ 0.5 > 0

y − 0.5 > 0

If one of these inequalities is not satisfied, then a point is labelled as a circle (red) point.We can formulate this using the indicator function

f(x, y) = 1{−x+ 0.5 < 0}+ 1{y − 0.5 < 0}

17


and declaring a point to be a circle if f(x, y) > 0 and a triangle else. We can approximatethe classification tasks using sigmoid functions, with the weights chosen so that thesigmoids become “steep” at 0 and approximate an indicator function. For example,

f(x, y) = σ(σ(a(x− 0.5)) + σ(b(0.5− y))− 0.5),

with weight a chosen so large that aσ(x− 0.5) will be close to 1 if x > 0.5 and close to0 if x < 0.5 (and similar with b). This way, the sum of the two sigma terms will be closeto 0 for triangle points, and close to 1 or 2 for circles. By subtracting 0.5, we make surethat the function will be negative for the triangles and positive for the circles. Finally,applying a sigmoid to the result ensures that we get a value between 0 and 1 as output.

A neural network structure that implements this function is depicted in Figure 4:

w11

w12

w22

w21

u1

u2

Figure 4: A neural network with one hidden layer.

As a function, the network in this general form can be written as

f(x) = σ(uTσ(Wx+ b) + c).

In our special case,

W =

(a 00 −b

), b =

(−0.5a0.5b

), u =

(11

), c = −0.5.

A typical form of the resulting decision boundary is shown in Figure 5. For this particularplot we used a = b = 10, higher values will lead to a sharper corner.

2 1 0 1 2 31

0

1

2

3

4

Figure 5: A decision boundary

18


Solution (23) Given a vector x ∈ Rd and a sequence g = (g−d+1, . . . , gd−1)T , theconvolution is defined as the vector y = (y1, . . . , yd)

T , where

yk =

d∑i=1

xi · gk−i, k ∈ {1, . . . , d}

In terms of matrices,

y1y2...yd

=

g0 g−1 · · · g−d+1

g1 g0 · · · g−d+2

......

. . ....

gd−1 gd−2 · · · g0

x1x2...xd

The sequence g is called a filter. Convolutional Neural Networks (CNNs) operate onimage data, and in the context of colour image data, a vector x that enters a layer ofthe CNN will not be seen as a single vector in some Rd, but as a tensor with formatN ×M × 3. Here, the image is considered to be a N ×M matrix, with each entrycorresponding to a pixel, and each pixel is represented by a 3-tuple consisting of the red,blue and green components. A convolution filter will therefore operate on a subset ofthis tensor, typically a n×m× 3 window. Typical parameters are N = M = 32 andn = m = 5.

∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗

· · ·...

. . .

∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗

· · ·...

. . .

∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗

· · ·...

. . .

[∗] · · ·...

. . .

Each 5 × 5 × 3 block is mapped linearly to an entry of a 28 × 28 matrix in thesame way (that is, applying the same filter). The filter applied is called a kernel and theresulting map is interpreted as representing a feature. Specifically, instead of writingit as a vector g, we can write it as a matrix K, and then have, if we denote the inputimage byX ,

Y = X ∗K, Yij =∑k,`

Xk` ·Ki−k,j−`.

In a convolutional layer, various different kernels or convolutions are applied to theimage. This can be seen as extracting several features from an image. CNNs have theadvantage of being scalable: they work on images of arbitrary size.

19


Solution (24) Let h : Rd → Y be any classifier. Define the smallest perturbation thatmoves a data point into a different class as

∆(x) = infr{‖r‖ : h(x+ r) 6= h(x)}. (3.5)

Given a probability distribution on the input spaces, define the robustness as

ρ(h) = E[

∆h(X)

‖X‖

].

For a linear classifier with only two classes, where h(x) = wTx + b, we get thevector that moves point x to the boundary by solving the optimization problem (as forSVMs)

r∗ = argmin1

2‖r‖2 subject to wT (x+ r) + b = 0.

The solution is

r∗ =|wTx+ b|‖w‖

.

Assume that we now have k linear functions f1, . . . , fk, with fi(x) = wTi x+ bi, and

a classifier h : Rd → {1, . . . , k} that assigns to each x the index j of the largest valuefj(x) (this corresponds to the one-to-many setting for multiple classification). Let x besuch that maxj fj(x) = fk(x) and define the linear functions

gi(x) := fi(x)− fk(x) = (wi −wk)Tx+ (bi − bk), i ∈ {1, . . . , k − 1}.

Thenx ∈

⋂1≤i≤k−1

{y : gi(y) < 0}.

The intersection of half-spaces is a polyhedron, andx is in the interior of the polyhedronP delineated by the hyperplanes Hi = {y : gi(x) = 0} (see Figure 6).

H1

H2

H3H4

H5

x

Figure 6: The distance to misclassification is the radius of the largest enclosed ball ina polyhedron P .

20


A perturbation x+ r ceases to be in class k as soon as gj(x+ r) > 0 for some j,and the smallest length of such an r equals the radius of the largest enclosed ball in thepolyhedron. Formally, noting that wj −wk = ∇gj(x), we get

j := arg minj

|gj(x)|‖∇gj(w)‖

, r∗ = −gj(x)

‖∇gj(x)‖2· ∇gj(x). (3.6)

Solution (25) The discriminator D would like to achieve, on average, a large valueon G0(Z0), and a small value on G1(Z1). Using the logarithm, this can be expressed asthe problem of maximizing

EZ0[log(D(G0(Z0)))] + EZ1

[log(1−D(G1(Z1)))].

As an integral, using the push-forward measures, we get the expression∫X

[log(D(x))ρX0(x) + log(1−D(x))ρX1

(x)] dx.

We choose D(x) so that the integrand becomes maximal. So considering the function

f(y) = log(y)ρX0(x) + log(1− y)ρX1

(x)

and computing the first and second derivatives,

f ′(y) =ρX0(x)

y− ρX1

(x)

1− y, f ′′(y)− ρX0

(x)

y2− ρX1

(x)

(1− y)2< 0,

we see that we get a maximum at

y =ρX0

(x)

ρX0(x) + ρX1(x),

which is the value of D∗(x) that we want.

21

test problems - warwick insitehomepages.warwick.ac.uk/staff/martin.lotz/files/learning/exam-prep...

Documents