vector calculus - mml-book. · pdf filereport errata and feedback to mml-book.com. please do...
TRANSCRIPT
5
Vector Calculus
2711
Many algorithms in machine learning are inherently based on optimizing2712
an objective function with respect to a set of desired model parameters2713
that control how well a model explains the data: Finding good param-2714
eters can be phrased as an optimization problem. Examples include lin-2715
ear regression (see Chapter 9), where look at curve-fitting problems, and2716
we optimize linear weight parameters to maximize the likelihood; neural-2717
network auto-encoders for dimensionality reduction and data compres-2718
sion, where the parameters are the weights and biases of each layer, and2719
where we minimize a reconstruction error by repeated application of the2720
chain-rule; Gaussian mixture models (see Chapter 12) for modeling data2721
distributions, where we optimize the location and shape parameters of2722
each mixture component to maximize the likelihood of the model. Fig-2723
ure 5.1 illustrates some of these problems, which we typically solve by us-2724
ing optimization algorithms that exploit gradient information (first-order2725
methods). Figure 5.2 gives an overview of how concepts in this chapter2726
are related and how they are connected to other chapters of the book.2727
In this chapter, we will discuss how to compute gradients of functions,2728
which is often essential to facilitate learning in machine learning models.2729
Therefore, vector calculus is one of the fundamental mathematical tools2730
we need in machine learning.2731
Figure 5.1 Vectorcalculus plays acentral role in (a)regression (curvefitting) and (b)density estimation,i.e., modeling datadistributions.
x-5 0 5
f(x)
-3
-2
-1
0
1
2
3Polynomial of degree 4
DataMaximum likelihood estimate
(a) Regression problem: Find parame-ters, such that the curve explains theobservations (circles) well.
10 5 0 5 10
x1
10
5
0
5
10
x2
(b) Density estimation with a Gaussian mixturemodel: Find means and covariances, such that thedata (dots) can be explained well.
117Draft chapter (February 14, 2018) from “Mathematics for Machine Learning” c©2018 by MarcPeter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge UniversityPress. Report errata and feedback to mml-book.com. Please do not post or distribute this file,please link to mml-book.com.
118 Vector Calculus
Figure 5.2 A mindmap of the conceptsintroduced in thischapter, along withwhen they are usedin other parts of thebook.
Difference quotient
Partial derivativesJacobianHessian
Taylor series
Chapter 7Optimization
Chapter 9Regression
Chapter 10Classification
Chapter 11Dimensionality
reduction
Chapter 12Density estimation
defines
collected byused in
usedin
used
in
used inused
in used inused in
Figure 5.3 Theaverage incline of afunction f betweenx0 and x0 + δx isthe incline of thesecant (blue)through f(x0) andf(x0 + δx) andgiven by δy/δx.
δy
δx
f(x)
x
y
f(x0)
f(x0 + δx)
5.1 Differentiation of Univeriate Functions2732
In the following, we briefly revisit differentiation of a univariate function,2733
which we may already know from school. We start with the difference2734
quotient of a univariate function y = f(x), x, y ∈ R, which we will2735
subsequently use to define derivatives.2736
Definition 5.1 (Difference Quotient). The difference quotientdifference quotient
δy
δx:=
f(x+ δx)− f(x)
δx(5.1)
computes the slope of the secant line through two points on the graph of2737
f . In Figure 5.3 these are the points with x-coordinates x0 and x0 + δx.2738
The difference quotient can also be considered the average slope of f2739
between x and x+δx if we assume a f to be a linear function. In the limit2740
for δx → 0, we obtain the tangent of f at x, if f is differentiable. The2741
tangent is then the derivative of f at x.2742
Definition 5.2 (Derivative). More formally, for h > 0 the derivative of fderivative
at x is defined as the limit
df
dx:= lim
h→0
f(x+ h)− f(x)
h, (5.2)
and the secant in Figure 5.3 becomes a tangent.2743
2744
Example (Derivative of a Polynomial)2745
Draft (2018-02-14) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
5.1 Differentiation of Univeriate Functions 119
We want to compute the derivative of f(x) = xn, n ∈ N. We may already2746
know that the answer will be nxn−1, but we want to derive this result2747
using the definition of the derivative as the limit of the difference quotient.2748
Using the definition of the derivative in (5.2) we obtain
df
dx= lim
h→0
f(x+ h)− f(x)
h(5.3)
= limh→0
(x+ h)n − xnh
(5.4)
= limh→0
∑ni=0
(ni
)xn−ihi − xnh
. (5.5)
We see that xn =(n0
)xn−0h0. By starting the sum at 1 the xn-term cancels,
and we obtain
df
dx= lim
h→0
∑ni=1
(ni
)xn−ihi
h(5.6)
= limh→0
n∑i=1
(n
i
)xn−ihi−1 (5.7)
= limh→0
(n
1
)xn−1 +
n∑i=2
(n
i
)xn−ihi−1
︸ ︷︷ ︸→0 as h→0
(5.8)
=n!
1!(n− 1)!xn−1 = nxn−1 . (5.9)
2749
2750
2751
5.1.1 Taylor Series2752
The Taylor series is a representation of a function f as an (infinite) sum2753
of terms. These terms are determined using derivatives of f , evaluated at2754
a point x0.2755
Definition 5.3 (Taylor Series). For a function f : R→ R, the Taylor series Taylor series
of a smooth function f ∈ C∞ at x0 is defined as f ∈ C∞ means thatf is infinitely oftencontinuouslydifferentiable.f(x) =
∞∑k=0
f (k)(x0)
k!(x− x0)k , (5.10)
where f (k)(x0) is the kth derivative of f at x0 and f(k)(x0)
k!are the coeffi-2756
cients of the polynomial. For x0 = 0, we obtain the Maclaurin series as a Maclaurin series2757
special instance of the Taylor series.2758
Definition 5.4 (Taylor Polynomial). The Taylor polynomial of degree n of Taylor polynomial
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
120 Vector Calculus
f at x0 contains the first n+ 1 terms of the series in (5.10) and is definedas
Tn :=n∑k=0
f (k)(x0)
k!(x− x0)k . (5.11)
Remark. In general, a Taylor polynomial of degree n is an approximation2759
of a function, which does not need to be a polynomial. The Taylor poly-2760
nomial is similar to f in a neighborhood around x0. However, a Taylor2761
polynomial of degree n is an exact representation of a polynomial f of2762
degree k 6 n since all derivatives f (i), i > k vanish. ♦2763
Example (Taylor Polynomial)We consider the polynomial
f(x) = x4 (5.12)
and seek the Taylor polynomial T6, evaluated at x0 = 1. We start by com-puting the coefficients f (k)(1) for k = 0, . . . , 6:
f(1) = 1 (5.13)
f ′(1) = 4 (5.14)
f ′′(1) = 12 (5.15)
f (3)(1) = 24 (5.16)
f (4)(1) = 24 (5.17)
f (5)(1) = 0 (5.18)
f (6)(1) = 0 (5.19)
Therefore, the desired Taylor polynomial is
T6(x) =6∑k=0
f (k)(x0)
k!(x− x0)k (5.20)
= 1 + 4(x− 1) + 6(x− 1)2 + 4(x− 1)3 + (x− 1)4 + 0 . (5.21)
Multiplying out and re-arranging yields
T6(x) = (1− 4 + 6− 4 + 1) + x(4− 12 + 12− 4)
+ x2(6− 12 + 6) + x3(4− 4) + x4 (5.22)
= x4 = f(x) , (5.23)
i.e., we obtain an exact representation of the original function.2764
2765
2766
Example (Taylor Series)
Draft (2018-02-14) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
5.1 Differentiation of Univeriate Functions 121
Consider the smooth function
f(x) = sin(x) + cos(x) ∈ C∞ . (5.24)
We seek a Taylor series expansion of f at x0 = 0, which is the Maclaurinseries expansion of f . We obtain the following derivatives:
f(0) = sin(0) + cos(0) = 1 (5.25)
f ′(0) = cos(0)− sin(0) = 1 (5.26)
f ′′(0) = − sin(0)− cos(0) = −1 (5.27)
f (3)(0) = − cos(0) + sin(0) = −1 (5.28)
f (4)(0) = sin(0) + cos(0) = f(0) = 1 (5.29)...
We can see a pattern here: The coefficients in our Taylor series are only2767
±1 (since sin(0) = 0), each of which occurs twice before switching to the2768
other one. Furthermore, f (k+4)(0) = f (k)(0).2769
Therefore, the full Taylor series expansion of f at x0 = 0 is given by
f(x) =∞∑k=0
f (k)(x0)
k!(x− x0)k (5.30)
= 1 + x− 1
2!x2 − 1
3!x3 +
1
4!x4 +
1
5!x5 − · · · (5.31)
= 1− 1
2!x2 +
1
4!x4 ∓ · · ·+ x− 1
3!x3 +
1
5!x5 ∓ · · · (5.32)
=∞∑k=0
(−1)k1
(2k)!x2k +
∞∑k=0
(−1)k1
(2k + 1)!x2k+1 (5.33)
= cos(x) + sin(x) , (5.34)
where we used the power series representations power seriesrepresentations
cos(x) =∞∑k=0
(−1)k1
(2k)!x2k , (5.35)
sin(x) =∞∑k=0
(−1)k1
(2k + 1)!x2k+1 . (5.36)
Figure 5.4 shows the corresponding first Taylor polynomials Tn for n =2770
0, 1, 5, 10.2771
2772
2773
5.1.2 Differentiation Rules2774
In the following, we briefly state basic differentiation rules, where we2775
denote the derivative of f by f ′.2776
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
122 Vector Calculus
Figure 5.4 Taylorpolynomials. Theoriginal functionf(x) =
sin(x) + cos(x)
(black, solid) isapproximated byTaylor polynomials(dashed) aroundx0 = 0.Higher-order Taylorpolynomialsapproximate thefunction f betterand more globally.T10 is alreadysimilar to f in[−4, 4]. 4 3 2 1 0 1 2 3 4
x
2
0
2
4y
fT0T1T5T10
Product Rule: (f(x)g(x))′ = f ′(x)g(x) + f(x)g′(x) (5.37)
Quotient Rule:(f(x)
g(x)
)′=f ′(x)g(x)− f(x)g′(x)
(g(x))2(5.38)
Sum Rule: (f(x) + g(x))′ = f ′(x) + g′(x) (5.39)
Chain Rule:(g(f(x))
)′= (g ◦ f)′(x) = g′(f)f ′(x) (5.40)
Here, g ◦ f is a function composition x 7→ f(x) 7→ g(f(x)).2777
Example (Chain rule)Let us compute the derivative of the function h(x) = (2x + 1)4 using thechain rule. With
h(x) = (2x+ 1)4 = g(f(x)) , (5.41)
f(x) = 2x+ 1 , (5.42)
g(f) = f4 (5.43)
we obtain the derivatives of f and g as
f ′(x) = 2 , (5.44)
g′(f) = 4f3 , (5.45)
such that the derivative of h is given as
h′(x) = g′(f)f ′(x) = (4f3) · 2 (5.42)= 4(2x+ 1) · 2 = 8(2x+ 1) , (5.46)
where we used the chain rule, see (5.40), and substituted the definition2778
of f in (5.42) in g′(f).2779
Draft (2018-02-14) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
5.2 Partial Differentiation and Gradients 123
2780
2781
5.2 Partial Differentiation and Gradients2782
Differentiation as discussed in Section 5.1 applies to functions f of a2783
scalar variable x ∈ R. In the following, we consider the general case2784
where the function f depends on one or more variables x ∈ Rn, e.g.,2785
f(x) = f(x1, x2). The generalization of the derivative to functions of sev-2786
eral variables is the gradient.2787
We find the gradient of the function f with respect to x by varying one2788
variable at a time and keeping the others constant. The gradient is then2789
the collection of these partial derivatives.2790
Definition 5.5 (Partial Derivative). For a function f : Rn → R, x 7→f(x), x ∈ Rn of n variables x1, . . . , xn we define the partial derivatives as partial derivatives
∂f
∂x1
= limh→0
f(x1 + h, x2, . . . , xn)− f(x)
h...
∂f
∂xn= lim
h→0
f(x1, . . . , xn−1, xn + h)− f(x)
h
(5.47)
and collect them in the row vector
∇xf = gradf =df
dx=[∂f(x)
∂x1
∂f(x)
∂x2· · · ∂f(x)
∂xn
]∈ R1×n , (5.48)
where n is the number of variables and 1 is the dimension of the image/2791
range of f . Here, we used the compact vector notation x = [x1, . . . , xn]>.2792
The row vector in (5.48) is called the gradient of f or the Jacobian and is gradient
Jacobian
2793
the generalization of the derivative from Section 5.1.2794
Example (Partial Derivatives using the Chain Rule)For f(x, y) = (x+ 2y3)2, we obtain the partial derivatives We can use results
from scalardifferentiation: Eachpartial derivative isa derivative withrespect to a scalar.
∂f(x, y)
∂x= 2(x+ 2y3)
∂
∂x(x+ 2y3) = 2(x+ 2y3) , (5.49)
∂f(x, y)
∂y= 2(x+ 2y3)
∂
∂y(x+ 2y3) = 12(x+ 2y3)y2 . (5.50)
where we used the chain rule (5.40) to compute the partial derivatives.2795
2796
2797
Remark (Gradient as a Row Vector). It is not uncommon in the literature2798
to define the gradient vector as a column vector, following the conven-2799
tion that vectors are generally column vectors. The reason why we define2800
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
124 Vector Calculus
the gradient vector as a row vector is twofold: First, we can consistently2801
generalize the gradient to a setting where f : Rn → Rm no longer maps2802
onto the real line (then the gradient becomes a matrix). Second, we can2803
immediately apply the multi-variate chain-rule without paying attention2804
to the dimension of the gradient. We will discuss both points later. ♦2805
Example (Gradient)For f(x1, x2) = x2
1x2 + x1x32 ∈ R, the partial derivatives (i.e., the deriva-
tives of f with respect to x1 and x2) are
∂f(x1, x2)
∂x1
= 2x1x2 + x32 (5.51)
∂f(x1, x2)
∂x2
= x21 + 3x1x
22 (5.52)
and the gradient is then
df
dx=[∂f(x1,x2)
∂x1
∂f(x1,x2)
∂x2
]=[2x1x2 + x3
2 x21 + 3x1x
22
]∈ R1×2 .
(5.53)2806
2807
2808
5.2.1 Basic Rules of Partial Differentiation2809
Product rule:(fg)′ = f ′g + fg′,Sum rule:(f + g)′ = f ′ + g′,Chain rule:(g(f))′ = g′(f)f ′
In the multivariate case, where x ∈ Rn, the basic differentiation rules that2810
we know from school (e.g., sum rule, product rule, chain rule; see also2811
Section 5.1.2) still apply. However, when we compute derivatives with2812
respect to vectors x ∈ $n we need to pay attention: Our gradients now2813
involve vectors and matrices, and matrix multiplication is no longer com-2814
mutative (see Section 2.2.1), i.e., the order matters.2815
Here are the general product rule, sum rule and chain rule:
Product Rule:∂
∂x
(f(x)g(x)
)=∂f
∂xg(x) + f(x)
∂g
∂x(5.54)
Sum Rule:∂
∂x
(f(x) + g(x)
)=∂f
∂x+∂g
∂x(5.55)
Chain Rule:∂
∂x(g ◦ f)(x) =
∂
∂x
(g(f(x))
)=∂g
∂f
∂f
∂x(5.56)
Let us have a closer look at the chain rule. The chain rule (5.56) resem-This is only anintuition, but notmathematicallycorrect since thepartial derivative isnot a fraction.
2816
bles to some degree the rules for matrix multiplication where we said that2817
neighboring dimensions have to match for matrix multiplication to be de-2818
fined, see Section 2.2.1. If we go from left to right, the chain rule exhibits2819
similar properties: ∂f shows up in the “denominator” of the first factor2820
and in the “numerator” of the second factor. If we multiply the factors2821
Draft (2018-02-14) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
5.2 Partial Differentiation and Gradients 125
together, multiplication is defined (the dimensions of ∂f match, and ∂f2822
“cancels”, such that ∂g/∂x remains.2823
5.2.2 Chain Rule2824
Consider a function f : R2 → R of two variables x1, x2. Furthermore,x1(t) and x2(t) are themselves functions of t. To compute the gradient off with respect to t, we need to apply the chain rule (5.56) for multivariatefunctions as
df
dt=[∂f∂x1
∂f∂x2
] [∂x1(t)
∂t∂x2(t)
∂t
]=
∂f
∂x1
∂x1
∂t+∂f
∂x2
∂x2
∂t(5.57)
where d denotes the gradient and ∂ partial derivatives.2825
ExampleConsider f(x1, x2) = x2
1 + 2x2, where x1 = sin t and x2 = cos t, then
df
dt=
∂f
∂x1
∂x1
∂t+∂f
∂x2
∂x2
∂t(5.58)
= 2 sin t∂ sin t
∂t+ 2
∂ cos t
∂t(5.59)
= 2 sin t cos t− 2 sin t = 2 sin t(cos t− 1) (5.60)
is the corresponding derivative of f with respect to t.
If f(x1, x2) is a function of x1 and x2, where x1(s, t) and x2(s, t) arethemselves functions of two variables s and t, the chain rule yields thepartial derivatives
∂f
∂s=
∂f
∂x1
∂x1
∂s+∂f
∂x2
∂x2
∂s, (5.61)
∂f
∂t=
∂f
∂x1
∂x1
∂t+∂f
∂x2
∂x2
∂t, (5.62)
and the gradient is obtained by the matrix multiplication
df
d(s, t)=∂f
∂x
∂x
∂(s, t)=[∂f∂x1
∂f∂x2
]︸ ︷︷ ︸
= ∂f∂x
[∂x1
∂s∂x1
∂t∂x2
∂s∂x2
∂t
]︸ ︷︷ ︸
= ∂x∂(s,t)
. (5.63)
This compact way of writing the chain rule as a matrix multiplication The chain rule canbe written as amatrixmultiplication.
2826
makes only sense if the gradient is defined as a row vector. Otherwise,2827
we will need to start transposing gradients for the matrix dimensions to2828
match. This may still be straightforward as long as the gradient is a vector2829
or a matrix; however, when the gradient becomes a tensor (we will discuss2830
this in the following), the transpose is no longer a triviality.2831
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
126 Vector Calculus
Remark (Verifying the Correctness of the Implementation of the Gradient).2832
The definition of the partial derivatives as the limit of the corresponding2833
difference quotient, see (5.47), can be exploited when numerically check-2834
ing the correctness of gradients in computer programs: When we computeGradient checking2835
gradients and implement them, we can use finite differences to numeri-2836
cally test our computation and implementation: We choose the value h2837
to be small (e.g., h = 10−4) and compare the finite-difference approxima-2838
tion from (5.47) with our (analytic) implementation of the gradient. If the2839
error is small, our gradient implementation is probably correct. “Small”2840
could mean that√∑
i(dhi−dfi)2∑i(dhi+dfi)2
< 10−3, where dhi is the finite-difference2841
approximation and dfi is the analytic gradient of f with respect to the ith2842
variable xi. ♦2843
5.3 Gradients of Vector-Valued Functions (Vector Fields)2844
Thus far, we discussed partial derivatives and gradients of functions f :2845
Rn → Rmapping to the real numbers. In the following, we will generalize2846
the concept of the gradient to vector-valued functions f : Rn → Rm,2847
where n,m > 1.2848
For a function f : Rn → Rm and a vector x = [x1, . . . , xn]> ∈ Rn, thecorresponding vector of function values is given as
f(x) =
f1(x)...
fm(x)
∈ Rm . (5.64)
Writing the vector-valued function in this way allows us to view a vector-2849
valued function f : Rn → Rm as a vector of functions [f1, . . . , fm]>,2850
fi : Rn → R that map onto R. The differentiation rules for every fi are2851
exactly the ones we discussed in Section 5.2.2852
Therefore, the partial derivative of a vector-valued function f : Rn → Rmpartial derivative ofa vector-valuedfunction
with respect to xi ∈ R, i = 1, . . . n, is given as the vector
∂f
∂xi=
∂f1∂xi
...∂fm∂xi
=
limh→0f1(x1,...,xi−1,xi+h,xi+1,...xn)−f1(x)
h...
limh→0fm(x1,...,xi−1,xi+h,xi+1,...xn)−fm(x)
h
∈ Rm .(5.65)
From (5.48), we know that we obtain the gradient of f with respectto a vector as the row vector of the partial derivatives. In (5.65), everypartial derivative ∂f/∂xi is a column vector. Therefore, we obtain thegradient of f : Rn → Rm with respect to x ∈ Rn by collecting these
Draft (2018-02-14) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
5.3 Gradients of Vector-Valued Functions (Vector Fields) 127
partial derivatives:
df(x)
dx= ∂f(x)
∂x1· · · ∂f(x)
∂xn
[ ]=
∂f1(x)
∂x1· · · ∂f1(x)
∂xn
......
∂fm(x)
∂x1· · · ∂fm(x)
∂xn
∈ Rm×n .
(5.66)
Definition 5.6 (Jacobian). The collection of all first-order partial deriva-tives of a vector-valued function f : Rn → Rm is called the Jacobian. The Jacobian
Jacobian J is an m× n matrix, which we define and arrange as follows: The gradient of afunctionf : Rn → Rm is amatrix of sizem× n.
J = ∇xf =df(x)
dx=[∂f(x)
∂x1· · · ∂f(x)
∂xn
](5.67)
=
∂f1(x)
∂x1· · · ∂f1(x)
∂xn
......
∂fm(x)
∂x1· · · ∂fm(x)
∂xn
, (5.68)
x =
x1
...xn
, J(i, j) =∂fi∂xj
. (5.69)
In particular, a function f : Rn → R1, which maps a vector x ∈ Rn onto2853
a scalar (e.g., f(x) =∑n
i=1 xi), possesses a Jacobian that is a row vector2854
(matrix of dimension 1× n), see (5.48).2855
Example (Gradient of a Vector Field)We are given
f(x) = Ax , f(x) ∈ RM , A ∈ RM×N , x ∈ RN .To compute the gradient df/dxwe first determine the dimension of df/dx:Since f : RN → RM , it follows that df/dx ∈ RM×N . Second, to computethe gradient compute the partial derivatives of f with respect to every xj:
fi(x) =N∑j=1
Aijxj =⇒ ∂fi∂xj
= Aij (5.70)
Finally, we collect the partial derivatives in the Jacobian and obtain thegradient as
df
dx=
∂f1∂x1
· · · ∂f1∂xN
......
∂fM∂x1
· · · ∂fM∂xN
=
A11 · · · A1N
......
AM1 · · · AMN
= A ∈ RM×N . (5.71)
2856
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
128 Vector Calculus
2857
2858
Example (Chain Rule)Consider the function h : R→ R, h(t) = (f ◦ g)(t) with
f : R2 → R (5.72)
g : R→ R2 (5.73)
f(x) = exp(x1x22) , (5.74)
x
[x1
x2
]= g(t) =
[t cos tt sin t
](5.75)
and compute the gradient of h with respect to t.2859
Since f : R2 → R and g : R→ R2 we note that
∂f
∂x∈ R1×2 , (5.76)
∂g
∂t∈ R2×1 . (5.77)
The desired gradient is computed by applying the chain-rule:
dh
dt=∂f
∂x
∂x
∂t=[∂f∂x1
∂f∂x2
][∂x1
∂t∂x2
∂t
](5.78)
=[exp(x1x
22)x2
2 2 exp(x1x22)x1x2
][cos t− t sin tsin t+ t cos t
](5.79)
= exp(x1x22)(x2
2(cos t− t sin t) + 2x1x2(sin t+ t cos t)), (5.80)
where x1 = t cos t and x2 = t sin t, see (5.75).2860
2861
2862
Example (Gradient of a Linear Model)Let us consider the linear model
y = Φθ , (5.81)
where θ ∈ RD is a parameter vector, Φ ∈ RN×D are input features andy ∈ RN are the corresponding observations. We define the following func-We will discuss this
model in muchmore detail inChapter 9 in thecontext of linearregression.
tions:
L(e) := ‖e‖2 , (5.82)
e(θ) := y −Φθ . (5.83)
We seek ∂L∂θ
, and we will use the chain rule for this purpose.2863
Before we start any calculation, we determine the dimensionality of thegradient as
∂L
∂θ∈ R1×D . (5.84)
Draft (2018-02-14) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
5.4 Gradients of Matrices 129
The chain rule allows us to compute the gradient as
∂L
∂θ=∂L
∂e
∂e
∂θ, (5.85)
i.e., every element is given bydLdtheta =np.einsum(’j,ji’,dLde,dedtheta)
∂L
∂θ[1, i] =
D∑j=1
∂L
∂e[j]∂e
∂θ[j, i] . (5.86)
We know that ‖e‖2 = e>e (see Section 3.2) and determine
∂L
∂e= 2e> ∈ R1×N . (5.87)
Furthermore, we obtain
∂e
∂θ= −Φ ∈ RN×D , (5.88)
such that our desired derivative is
∂L
∂θ= −2e>Φ
(5.83)= − 2(y> − θ>Φ>)︸ ︷︷ ︸
1×N
Φ︸︷︷︸N×D
∈ R1×D . (5.89)
Remark. We would have obtained the same result without using the chainrule by immediately looking at the function
L2(θ) := ‖y −Φθ‖2 = (y −Φθ)>(y −Φθ) . (5.90)
This approach is still practical for simple functions like L2 but becomes2864
impractical if consider deep function compositions. ♦2865
2866
2867
2868
5.4 Gradients of Matrices2869
We will encounter situations where we need to take gradients of matri-2870
ces with respect to vectors (or other matrices), which results in a multi-2871
dimensional tensor. For example, if we compute the gradient of an m× n Matrices can betransformed intovectors by stackingthe columns of thematrix(“flattening”).
2872
matrix with respect to a p × q matrix, the resulting Jacobian would be2873
(p × q) × (m × n), i.e., a four-dimensional tensor (or array). Since ma-2874
trices represent linear mappings, we can exploit the fact that there is a2875
vector-space isomorphism (linear, invertible mapping) between the space2876
Rm×n of m×n matrices and the space Rmn of mn vectors. Therefore, we2877
can re-shape our matrices into vectors of lengths mn and pq, respectively.2878
The gradient using these mn vectors results in a Jacobian of size pq×mn.2879
Figure 5.5 visualizes both approaches.2880
In practical applications, it is often desirable to re-shape the matrix2881
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
130 Vector CalculusFigure 5.5Visualization ofgradientcomputation of amatrix with respectto a vector. We areinterested incomputing thegradient ofA ∈ R4×2 withrespect to a vectorx ∈ R3. We knowthat gradientdAdx∈ R4×2×3. We
follow twoequivalentapproaches to arrivethere: (a) Collatingpartial derivativesinto a Jacobiantensor; (b)Flattening of thematrix into a vector,computing theJacobian matrix,re-shaping into aJacobian tensor.
A ∈ IR4×2 x ∈ IR3
∂A
∂x1∈ IR4×2
∂A
∂x2∈ IR4×2
∂A
∂x3∈ IR4×2
x1
x2
x3
x1
x2
x3
dA
dx∈ IR4×2×3
4
2
3
Partial derivatives:
collate
(a) Approach 1: We compute the partial derivative∂A∂x1
, ∂A∂x2
, ∂A∂x3
, each of which is a 4× 2 matrix, and col-late them in a 4× 2× 3 tensor.
A ∈ IR4×2 x ∈ IR3
x1
x2
x3
x1
x2
x3
dA
dx∈ IR4×2×3
re-shape re-shapegradient
A ∈ IR4×2 A ∈ IR8
dA
dx∈ IR8×3
(b) Approach 2: We re-shape (flatten) A ∈ R4×2 into avector A ∈ R8. Then, we compute the gradient dA
dx∈
R8×3. We obtain the gradient tensor by re-shaping thisgradient as illustrated above.
into a vector and continue working with this Jacobian matrix: The chain2882
rule (5.56) boils down to simple matrix multiplication, whereas in the2883
case of a Jacobian tensor, we will need to pay more attention to what2884
dimensions we need to sum out.2885
Draft (2018-02-14) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
5.4 Gradients of Matrices 131
Example (Gradient of Vectors with Respect to Matrices)Let us consider the following example, where
f = Ax , f ∈ RM ,A ∈ RM×N ,x ∈ RN (5.91)
and where we seek the gradient df/dA. Let us start again by determiningthe dimension of the gradient as
df
dA∈ RM×(M×N) . (5.92)
By definition, the gradient is the collection of the partial derivatives:
df
dA=
∂f1∂A...
∂fM∂A
, ∂fi∂A∈ R1×(M×N) . (5.93)
To compute the partial derivatives, it will be helpful to explicitly write outthe matrix vector multiplication:
fi =N∑j=1
Aijxj, i = 1, . . . ,M , (5.94)
and the partial derivatives are then given as
∂fi∂Aiq
= xq . (5.95)
This allows us to compute the partial derivatives of fi with respect to arow of A, which is given as
∂fi∂Ai,:
= x> ∈ R1×1×N , (5.96)
∂fi∂Ak 6=i,:
= 0> ∈ R1×1×N (5.97)
where we have to pay attention to the correct dimensionality. Since fi2886
maps onto R and each row of A is of size 1×N , we obtain a 1× 1×N -2887
sized tensor as the partial derivative of fi with respect to a row of A.2888
We can now stack the partial derivatives to obtain the desired gradientas
∂fi∂A
=
0>
...x>
0>
...0>
∈ R1×(M×N) . (5.98)
2889
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
132 Vector Calculus
2890
2891
Example (Gradient of Matrices with Respect to Matrices)Consider a matrix L ∈ Rm×n and f : Rm×n → Rn×n with
f(L) = L>L =: K ∈ Rn×n . (5.99)
where we seek the gradient dK/dL. To solve this hard problem, let usfirst write down what we already know: We know that the gradient hasthe dimensions
dK
dL∈ R(n×n)×(m×n) , (5.100)
which is a tensor. If we compute the partial derivative of f with respect toa single entry Lij , i, j ∈ {1, . . . , n}, of L, we obtain an n× n-matrix
∂K
∂Lij∈ Rn×n . (5.101)
Furthermore, we know that
dKpq
dL∈ R1×m×n (5.102)
for p, q = 1, . . . , n, where Kpq = fpq(L) is the (p, q)-th entry of K =2892
f(L).2893
Denoting the i-th column of L by li, we see that every entry of K isgiven by an inner product of two columns of L, i.e.,
Kpq = l>p lq =n∑k=1
LkpLkq . (5.103)
When we now compute the partial derivative ∂Kpq
∂Lij, we obtain
∂Kpq
∂Lij=
n∑k=1
∂
∂LijLkpLkq = ∂pqij , (5.104)
∂pqij =
Liq if j = p, p 6= qLip if j = q, p 6= q2Liq if j = p, p = q0 otherwise
(5.105)
From (5.100), we know that the desired gradient has the dimension (n×2894
n) × (m × n), end every single entry of this tensor is given by ∂pqij2895
in (5.105), where p, q, j = 1, . . . , n and i = q, . . . ,m.2896
2897
2898
Draft (2018-02-14) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
5.5 Useful Identities for Computing Gradients 133
5.5 Useful Identities for Computing Gradients2899
In the following, we list some useful gradients that are frequently requiredin a machine learning context (Petersen and Pedersen, 2012):
∂
∂Xf(X)> =
(∂f(X)
∂X
)>(5.106)
∂
∂Xtr(f(X)) = tr
(∂f(X)
∂X
)(5.107)
∂
∂Xdet(f(X)) = det(f(X))tr
(f−1(X)
∂f(X)
∂X
)(5.108)
∂
∂Xf−1(X) = −f−1(X)
∂f(X)
∂Xf−1(X) (5.109)
∂a>X−1b
∂X= −(X−1)>ab>(X−1)> (5.110)
∂x>a
∂x= a> (5.111)
∂a>x
∂x= a> (5.112)
∂a>Xb
∂X= ab> (5.113)
∂x>Bx
∂x= x>(B +B>) (5.114)
∂
∂s(x−As)>W (x−As) = −2(x−As)>WA for symmetric W
(5.115)
5.6 Backpropagation and Automatic Differentiation2900
In many machine learning applications, we find good model parameters2901
by performing gradient descent (Chapter 7), which relies on the fact that2902
we can compute the gradient of a learning objective with respect to the2903
parameters of the model. For a given objective function, we can obtain the2904
gradient with respect to the model parameters using calculus and applying2905
the chain rule, see Section 5.2.2. We already had a taster in Section 5.32906
when we looked at the gradient of a squared loss with respect to the2907
parameters of a linear regression model.2908
Consider the function
f(x) =√x2 + exp(x2) + cos
(x2 + exp(x2)
). (5.116)
By application of the chain rule, and noting that differentiation is linear
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
134 Vector Calculus
Figure 5.6 Forwardpass in a multi-layerneural network tocompute the lossfunction L as afunction of theinputs x and theparameters of thenetwork.
x fK
A0,b0 AK−1,bK−1
LfL−1
AK−2,bK−2
f1
A2,b2
we compute the gradient
df
dx= − 2x+ 2x exp(x2)
2√x2 + exp(x2)
− sin(x2 + exp(x2)
) (2x+ 2x exp(x2)
)= −2x
(1
2√x2 + exp(x2)
+ sin(x2 + exp(x2)
)) (1 + exp(x2)
).
(5.117)
Writing out the gradient in this explicit way is often impractical since it2909
often results in a very lengthy expression for a derivative. In practice,2910
it means that, if we are not careful, the implementation of the gradient2911
could be significantly more expensive than computing the function, which2912
is an unnecessary overhead. For training deep neural network models, the2913
backpropagation algorithm (Kelley, 1960; Bryson, 1961; Dreyfus, 1962;backpropagation 2914
Rumelhart et al., 1986) is an efficient way to compute the gradient of an2915
error function with respect to the parameters of the model.2916
5.6.1 Gradients in a Deep Network2917
In machine learning, the chain rule plays an important role when opti-mizing parameters of a hierarchical model (e.g., for maximum likelihoodestimation). An area where the chain rule is used to an extreme is DeepLearning where the function value y is computed as a deep function com-position
y = (fK ◦ fK−1 ◦ · · · ◦ f1)(x) = fK(fK−1(· · · (f1(x)) · · · )) , (5.118)
where x are the inputs (e.g., images), y are the observations (e.g., class2918
labels) and every function fi, i = 1, . . . ,K possesses its own parame-2919
ters. In neural networks with multiple layers, we have functions fi(x) =2920
σ(Aixi−1 + bi) in the ith layer. Here xi−1 is the output of layer i − 1We discuss the casewhere the activationfunctions areidentical to clutternotation.
2921
and σ an activation function, such as the logistic sigmoid 11+e−x , tanh or a2922
rectified linear unit (ReLU). In order to train these models, we require the2923
gradient of a loss function L with respect to all model parameters Aj, bj2924
for i = 0, . . . ,K − 1. This also requires us to compute the gradient of L2925
with respect to the inputs of each layer.2926
For example, if we have inputs x and observations y and a networkstructure defined by
f 0 = x (5.119)
Draft (2018-02-14) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
5.6 Backpropagation and Automatic Differentiation 135
Figure 5.7Backward pass in amulti-layer neuralnetwork to computethe gradients of theloss function.
x fK
A0,b0 AK−1,bK−1
LfL−1
AK−2,bK−2
f1
A2,b2
f i = σi(Ai−1f i−1 + bi−1) , i = 1, . . . ,K , (5.120)
see also Figure 5.6 for a visualization, we may be interested in findingAj, bj for j = 0, . . . ,K − 1, such that the squared loss
L(θ) = ‖y − fK(θ,x)‖2 (5.121)
is minimized, where θ = {Aj, bj}, j = 0, . . . ,K − 1.2927
To obtain the gradients with respect to the parameter set θ, we requirethe partial derivatives of L with respect to the parameters θj = {Aj, bj}of each layer j = 0, . . . ,K − 1. The chain rule allows us to determine thepartial derivatives as
∂L
∂θK−1
=∂L
∂fK
∂fK∂θK−1
(5.122)
∂L
∂θK−2
=∂L
∂fK
∂fK∂fK−1
∂fK−1
∂θK−2
(5.123)
∂L
∂θK−3
=∂L
∂fK
∂fK∂fK−1
∂fK−1
∂fK−2
∂fK−2
∂θK−3
(5.124)
∂L
∂θi=
∂L
∂fK
∂fK∂fK−1
· · · ∂f i+2
∂f i+1
∂f i+1
∂θi(5.125)
The orange terms are partial derivatives of the output of a layer with re-2928
spect to its inputs, whereas the blue terms are partial derivatives of the2929
output of a layer with respect to its parameters. Assuming, we have al-2930
ready computed the partial derivatives ∂L/∂θi+1, then most of the com-2931
putation can be reused to compute ∂L/∂θi. The additional terms that2932
we need to compute are indicated by the boxes. Figure 5.7 visualizes2933
that the gradients are passed backward through the network. A more2934
in-depth discussion about gradients of neural networks can be found at2935
https://tinyurl.com/yalcxgtv.2936
There are efficient ways of implementing this repeated application of2937
the chain rule using backpropagation (Kelley, 1960; Bryson, 1961; Drey- backpropagation2938
fus, 1962; Rumelhart et al., 1986). A good discussion about backpropaga-2939
tion and the chain rule is available at https://tinyurl.com/ycfm2yrw.2940
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
136 Vector Calculus
Figure 5.8 Simplegraph illustratingthe flow of datafrom x to y viasome intermediatevariables a, b.
x a b y
5.6.2 Automatic Differentiation2941
Automaticdifferentiation isdifferent fromsymbolicdifferentiation andnumericalapproximations ofthe gradient, e.g., byusing finitedifferences.
It turns out that backpropagation is a special case of a general technique2942
in numerical analysis called automatic differentiation. We can think of au-automaticdifferentiation
2943
tomatic differentation as a set of techniques to numerically (in contrast to2944
symbolically) evaluate the exact (up to machine precision) gradient of a2945
function by working with intermediate variables and applying the chain2946
rule. Automatic differentiation applies a series of elementary arithmetic2947
operations, e.g., addition and multiplication and elementary functions,2948
e.g., sin, cos, exp, log. By applying the chain rule to these operations, the2949
gradient of quite complicated functions can be computed automatically.2950
Automatic differentiation applies to general computer programs and has2951
forward and reverse modes.2952
Figure 5.8 shows a simple graph representing the data flow from inputsx to outputs y via some intermediate variables a, b. If we were to computethe derivative dy/dx, we would apply the chain rule and obtain
dy
dx=dy
db
db
da
da
dx. (5.126)
Intuitively, the forward and reverse mode differ in the order of multi-In the general case,we work withJacobians, whichcan be vectors,matrices or tensors.
plication. Due to the associativity of matrix multiplication we can choosebetween
dy
dx=
(dy
db
db
da
)da
dx, (5.127)
dy
dx=dy
db
(db
da
da
dx
). (5.128)
Equation (5.127) would be the reverse mode because gradients are prop-reverse mode 2953
agated backward through the graph, i.e., reverse to the data flow. Equa-2954
tion (5.128) would be the forward mode, where the gradients flow withforward mode 2955
the data from left to right through the graph.2956
In the following, we will focus on reverse mode automatic differentia-2957
tion, which is backpropagation. In the context of neural networks, where2958
the input dimensionality is often much higher than the dimensionality of2959
the labels, the reverse mode is computationally significantly cheaper than2960
the forward mode. Let us start with an instructive example.2961
ExampleConsider the function
f(x) =√x2 + exp(x2) + cos
(x2 + exp(x2)
)(5.129)
from (5.116). If we were to implement a function f on a computer, wewould be able to save some computation by using intermediate variables:intermediate
variables
Draft (2018-02-14) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
5.6 Backpropagation and Automatic Differentiation 137
Figure 5.9Computation graphwith inputs x,function values fand intermediatevariables a, b, c, d, e.
x (·)2 a
exp(·) b
+ c
√·
cos(·)
d
e
+ f
a = x2 , (5.130)
b = exp(a) , (5.131)
c = a+ b , (5.132)
d =√c , (5.133)
e = cos(c) , (5.134)
f = d+ e . (5.135)
This is the same kind of thinking process that occurs when applying the2962
chain rule. Observe that the above set of equations require fewer opera-2963
tions than a direct naive implementation of the function f(x) as defined2964
in (5.116). The corresponding computation graph in Figure 5.9 shows the2965
flow of data and computations required to obtain the function value f .2966
The set of equations that include intermediate variables can be thoughtof as a computation graph, a representation that is widely used in imple-mentations of neural network software libraries. We can directly computethe derivatives of the intermediate variables with respect to their corre-sponding inputs by recalling the definition of the derivative of elementaryfunctions. We obtain:
∂a
∂x= 2x , (5.136)
∂b
∂a= exp(a) , (5.137)
∂c
∂a= 1 =
∂c
∂b, (5.138)
∂d
∂c= − 1
2√c, (5.139)
∂e
∂c= − sin(c) , (5.140)
∂f
∂d= 1 =
∂f
∂e. (5.141)
By looking at the computation graph in Figure 5.9, we can compute ∂f/∂xby working backward from the output, and we obtain the following rela-tions:
∂f
∂c=∂f
∂d
∂d
∂c+∂f
∂e
∂d
∂c, (5.142)
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
138 Vector Calculus
∂f
∂b=∂f
∂c
∂c
∂b, (5.143)
∂f
∂a=∂f
∂b
∂b
∂a+∂f
∂c
∂c
∂a, (5.144)
∂f
∂x=∂f
∂a
∂a
∂x. (5.145)
Note that we have implicitly applied the chain rule to obtain ∂f/∂x. Bysubstituting the results of the derivatives of the elementary functions, weget
∂f
∂c= 1 ·
(− 1
2√c
)+ 1 · (− sin(c)) , (5.146)
∂f
∂b=∂f
∂c· 1 , (5.147)
∂f
∂a=∂f
∂bexp(a) +
∂f
∂c· 1 , (5.148)
∂f
∂x=∂f
∂a· 2x . (5.149)
By thinking of each of the derivatives above as a variable, we observethat the computation required for calculating the derivative is of similarcomplexity as the computation of the function itself. This is quite counter-intuitive since the mathematical expression for the derivative ∂f
∂x(5.117)
is significantly more complicated than the mathematical expression of thefunction f(x) in (5.116).
Automatic differentiation is a formalization of the example above. Letx1, . . . , xd be the input variables to the function, xd+1, . . . , xD−1 be theintermediate variables and xD the output variable. Then the computationgraph can be expressed as an equation
For i = d+ 1, . . . , D : xi = gi(xPa(xi)) (5.150)
where gi(·) are elementary functions and xPa(xi) are the parent nodes ofthe variable xi in the graph. Given a function defined in this way, we canuse the chain rule to compute the derivative of the function in a step-by-step fashion. Recall that by definition f = xD and hence
∂f
∂xD= 1. (5.151)
For other variables xi, we apply the chain rule
∂f
∂xi=
∑xj :xi∈Pa(xj)
∂f
∂xj
∂xj∂xi
=∑
xj :xi∈Pa(xj)
∂f
∂xj
∂gj∂xi
, (5.152)
where Pa(xj) is the set of parent nodes of xj in the computing graph.2967
Equation (5.150) is the forward propagation of a function, whereas (5.152)Auto-differentiationin reverse moderequires a parsetree.
2968
Draft (2018-02-14) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
5.7 Higher-order Derivatives 139
is the backpropagation of the gradient through the computation graph. For2969
neural network training we backpropagate the error of the prediction with2970
respect to the label.2971
The automatic differentiation approach above works whenever we have2972
a function that can be expressed as a computation graph, where the ele-2973
mentary functions are differentiable. In fact, the function may not even be2974
a mathematical function but a computer program. However, not all com-2975
puter programs can be automatically differentiated, e.g., if we cannot find2976
differential elementary functions. Programming structures, such as for2977
loops and if statements require more care as well.2978
5.7 Higher-order Derivatives2979
So far, we discussed gradients, i.e., first-order derivatives. Sometimes, we2980
are interested in derivatives of higher order, e.g., when we want to use2981
Newton’s Method for optimization, which requires second-order deriva-2982
tives (Nocedal and Wright, 2006). In Section 5.1.1, we discussed the Tay-2983
lor series to approximate functions using polynomials. In the multivariate2984
case, we can do exactly the same. In the following, we will do exactly this.2985
But let us start with some notation.2986
Consider a function f : R2 → R of two variables x, y. We use the2987
following notation for higher-order partial derivatives (and for gradients):2988
• ∂2f∂x2 is the second partial derivative of f with respect to x2989
• ∂nf∂xn is the nth partial derivative of f with respect to x2990
• ∂2f∂x∂y
is the partial derivative obtained by first partial differentiating by2991
x and then y2992
• ∂2f∂y∂x
is the partial derivative obtained by first partial differentiating by2993
y and then x2994
The Hessian is the collection of all second-order partial derivatives. Hessian2995
If f(x, y) is a twice (continuously) differentiable function then
∂2f
∂x∂y=
∂2f
∂y∂x, (5.153)
i.e., the order of differentiation does not matter, and the correspondingHessian matrix Hessian matrix
H =
[∂2f∂x2
∂2f∂x∂y
∂2f∂x∂y
∂2f∂y2
](5.154)
is symmetric. Generally, for x ∈ Rn and f : Rn → R, the Hessian is an2996
n× n matrix. The Hessian measures the local geometry of curvature.2997
Remark (Hessian of a Vector Field). If f : Rn → Rm is a vector field, the2998
Hessian is an (m× n× n)-tensor. ♦2999
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
140 Vector Calculus
Figure 5.10 Linearapproximation of afunction. Theoriginal function fis linearized atx0 = −2 using afirst-order Taylorseries expansion.
4 3 2 1 0 1 2 3 4x
2.5
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
f(x)
f(x)
f(x0)
f(x0) + f ′(x0)(x x0)
5.8 Linearization and Multivariate Taylor Series3000
The gradient ∇f of a function f is often used for a locally linear approxi-mation of f around x0:
f(x) ≈ f(x0) + (∇xf)(x0)(x− x0) . (5.155)
Here (∇xf)(x0) is the gradient of f with respect to x, evaluated at x0.3001
Figure 5.10 illustrates the linear approximation of a function f at an input3002
x0. The orginal function is approximated by a straight line. This approx-3003
imation is locally accurate, but the further we move away from x0 the3004
worse the approximation gets. Equation (5.155) is a special case of a mul-3005
tivariate Taylor series expansion of f at x0, where we consider only the3006
first two terms. We discuss the more general case in the following, which3007
will allow for better approximations.3008
Definition 5.7 (Multivariate Taylor Series). For the multivariate Taylormultivariate Taylorseries series, we consider a function
f : RD → R (5.156)
x 7→ f(x) , x ∈ RD , (5.157)
that is smooth at x0.3009
When we define the difference vector δ := x− x0, the Taylor series off at (x0) is defined as
f(x) =∞∑k=0
Dkxf(x0)
k!δk , (5.158)
where Dkxf(x0) is the k-th (total) derivative of f with respect to x, eval-3010
uated at x0.3011
Definition 5.8 (Taylor Polynomial). The Taylor polynomial of degree n ofTaylor polynomial
f at x0 contains the first n+ 1 components of the series in (5.158) and is
Draft (2018-02-14) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
5.8 Linearization and Multivariate Taylor Series 141
Figure 5.11Visualizing outerproducts. Outerproducts of vectorsincrease thedimensionality ofthe array by 1 perterm.
(a) Given a vector δ ∈ R4, we obtain the outer product δ2 := δ ⊗δ = δδ> ∈ R4×4 as a matrix.
(b) An outer product δ3 := δ⊗δ⊗δ ∈ R4×4×4 results in a third-order tensor(“three-dimensional matrix”), i.e., an array with three indexes.
defined as
Tn =n∑k=0
Dkxf(x0)
k!δk . (5.159)
Remark (Notation). In (5.158) and (5.159), we used the slightly sloppynotation of δk, which is not defined for vectors x ∈ RD, D > 1, and k >1. Note that both Dk
xf and δk are k-th order tensors, i.e., k-dimensional
arrays. The k-th order tensor δk ∈ Rk times︷ ︸︸ ︷
D×D×...×D is obtained as a k-fold A vector can beimplemented as a1-dimensional array,a matrix as a2-dimensional array.
outer product, denoted by ⊗, of the vector δ ∈ RD. For example,
δ2 = δ ⊗ δ = δδ> , δ2[i, j] = δ[i]δ[j] (5.160)
δ3 = δ ⊗ δ ⊗ δ , δ3[i, j, k] = δ[i]δ[j]δ[k] . (5.161)
Figure 5.11 visualizes two such outer products. In general, we obtain thefollowing terms in the Taylor series:
Dkxf(x0)δk =
∑a
· · ·∑k
Dkxf(x0)[a, . . . , k]δ[a] · · · δ[k] , (5.162)
where Dkxf(x0)δk contains k-th order polynomials.3012
Now that we defined the Taylor series for vector fields, let us explicitlywrite down the first terms Dk
xf(x0)δk of the Taylor series expansion fork = 0, . . . , 3 and δ := x− x0:
np.einsum(’i,i’,Df1,d)
np.einsum(’ij,i,j’,Df2,d,d)
np.einsum(’ijk,i,j,k’,Df3,d,d,d)
k = 0 : D0xf(x0)δ0 = f(x0) ∈ R (5.163)
k = 1 : D1xf(x0)δ1 = ∇xf(x0)︸ ︷︷ ︸
1×D
δ︸︷︷︸D×1
=∑i
∇xf(x0)[i]δ[i] ∈ R (5.164)
k = 2 : D2xf(x0)δ2 = tr
(H︸︷︷︸D×D
δ︸︷︷︸D×1
δ>︸︷︷︸1×D
)= δ>Hδ (5.165)
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
142 Vector Calculus
=∑i
∑j
H[i, j]δ[i]δ[j] ∈ R (5.166)
k = 3 : D3xf(x0)δ3 =
∑i
∑j
∑k
D3xf(x0)[i, j, k]δ[i]δ[j]δ[k] ∈ R
(5.167)
♦3013
Example (Taylor-Series Expansion of a Function with Two Variables)Consider the function
f(x, y) = x2 + 2xy + y3 . (5.168)
We want to compute the Taylor series expansion of f at (x0, y0) = (1, 2).3014
Before we start, let us discuss what to expect: The function in (5.168) is3015
a polynomial of degree 3. We are looking for a Taylor series expansion,3016
which itself is a linear combination of polynomials. Therefore, we do not3017
expect the Taylor series expansion to contain terms of fourth or higher3018
order to express a third-order polynomial. This means, it should be suffi-3019
cient to determine the first four terms of (5.158) for an exact alternative3020
representation of (5.168).3021
To determine the Taylor series expansion, start of with the constant termand the first-order derivatives, which are given by
f(1, 2) = 13 (5.169)∂f
∂x= 2x+ 2y =⇒ ∂f
∂x(1, 2) = 6 (5.170)
∂f
∂y= 2x+ 3y2 =⇒ ∂f
∂y(1, 2) = 14 . (5.171)
Therefore, we obtain
D1x,yf(1, 2) = ∇x,yf(1, 2) =
[∂f∂x
(1, 2) ∂f∂y
(1, 2)]
=[6 14
]∈ R1×2
(5.172)
such that
D1x,yf(1, 2)
1!δ =
[6 14
] [x− 1y − 2
]= 6(x− 1) + 14(y − 2) . (5.173)
Note that D1x,yf(1, 2)δ contains only linear terms, i.e., first-order polyno-3022
mials.3023
The second-order partial derivatives are given by
∂2f
∂x2= 2 =⇒ ∂2f
∂x2(1, 2) = 2 (5.174)
∂2f
∂y2= 6y =⇒ ∂2f
∂y2(1, 2) = 12 (5.175)
Draft (2018-02-14) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
5.8 Linearization and Multivariate Taylor Series 143
∂f2
∂x∂y= 2 =⇒ ∂f2
∂x∂y(1, 2) = 2 (5.176)
∂f2
∂y∂x= 2 =⇒ ∂f2
∂y∂x(1, 2) = 2 . (5.177)
When we collect the second-order partial derivatives, we obtain the Hes-sian
H =
[∂2f∂x2
∂f2
∂x∂y∂f2
∂y∂x∂2f∂y2
]=
[2 22 6y
], (5.178)
such that
H(1, 2) =
[2 22 12
]∈ R2×2 . (5.179)
Therefore, the next term of the Taylor-series expansion is given by
D2x,yf(1, 2)
2!δ2 =
1
2δ>H(1, 2)δ (5.180)
=[x− 1 y − 2
] [2 22 12
] [x− 1y − 2
](5.181)
= (x− 1)2 + 2(x− 1)(y − 2) + 6(y − 2)2 . (5.182)
Here,D2x,yf(1, 2)δ2 contains only quadratic terms, i.e., second-order poly-3024
nomials.3025
The third-order derivatives are obtained as
D3xf =
[∂H∂x
∂H∂y
]∈ R2×2×2 , (5.183)
D3xf [:, :, 1] =
∂H
∂x=
[∂3f∂x3
∂3f∂x∂y∂x
∂3f∂y∂x2
∂3f∂y2∂x
], (5.184)
D3xf [:, :, 2] =
∂H
∂y=
[∂3f∂x2∂y
∂3f∂x∂y2
∂3f∂y∂x∂y
∂3f∂y3
.
]. (5.185)
Since most second-order partial derivatives in the Hessian in (5.178) areconstant the only non-zero third-order partial derivative is
∂3f
∂y3= 6 =⇒ ∂3f
∂y3(1, 2) = 6 . (5.186)
Higher-order derivatives and the mixed derivatives of degree 3 (e.g., ∂f3
∂x2∂y)
vanish, such that
D3x,yf [:, :, 1] =
[0 00 0
], D3
x,yf [:, :, 2] =
[0 00 6
](5.187)
and
D3x,yf(1, 2)
3!δ3 = (y − 2)3 , (5.188)
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
144 Vector Calculus
which collects all cubic terms (third-order polynomials) of the Taylor se-3026
ries.3027
Overall, the (exact) Taylor series expansion of f at (x0, y0) = (1, 2) is
f(x) =f(1, 2)+D1x,yf(1, 2)δ+
D2x,yf(1, 2)
2!δ2+
D3x,yf(1, 2)
3!δ3 (5.189)
=f(1, 2)+∂f(1, 2)
∂x(x− 1) +
∂f(1, 2)
∂y(y − 2) (5.190)
+1
2!
(∂2f(1, 2)
∂x2(x− 1)2 +
∂2f(1, 2)
∂y2(y − 2)2 (5.191)
+2∂2f(1, 2)
∂x∂y(x− 1)(y − 2)
)+
1
6
∂3f(1, 2)
∂y3(y − 2)3 (5.192)
= 13+6(x− 1) + 14(y − 2) (5.193)
+(x− 1)2 + 6(y − 2)2 + 2(x− 1)(y − 2)+(y − 2)3 . (5.194)
In this case, we obtained an exact Taylor series expansion of the polyno-3028
mial in (5.168), i.e., the polynomial in (5.194) is identical to the original3029
polynomial in (5.168). In this particular example, this result is not sur-3030
prising since the original function was a third-order polynomial, which3031
we expressed through a linear combination of constant terms, first-order,3032
second order and third-order polynomials in (5.194).3033
3034
3035
5.9 Further Reading3036
In machine learning (and other disciplines), we often need to computeexpectations, i.e., we need to solve integrals of the form
Ex[f(x)] =
∫f(x)p(x)dx . (5.195)
Even if p(x) is in a convenient form (e.g., Gaussian), this integral gen-3037
erally not be solved analytically. The Taylor series expansion of f is one3038
way of finding an approximate solution: Assuming p(x) = N(µ, Σ
)is3039
Gaussian, then the first-order Taylor series expansion around µ locally3040
linearizes the nonlinear function f . For linear functions, we can compute3041
the mean (and the covariance) exactly if p(x) is Gaussian distributed (see3042
Section 6.5). This property is heavily exploited by the Extended KalmanExtended KalmanFilter
3043
Filter (Maybeck, 1979) for online state estimation in nonlinear dynami-3044
cal systems (also called “state-space models”). Other deterministic ways3045
to approximate the integral in (5.195) are the unscented transform (Julierunscented transform3046
and Uhlmann, 1997), which does not require any gradients, or the LaplaceLaplaceapproximation
3047
approximation (Bishop, 2006), which uses the Hessian for a local Gaussian3048
approximation of p(x) at the posterior mean.3049
Draft (2018-02-14) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
Exercises 145
Exercises3050
5.1 Consider the following functions
f1(x) = sin(x1) cos(x2) , x ∈ R2 (5.196)
f2(x,y) = x>y , x,y ∈ Rn (5.197)
f3(x) = xx> , x ∈ Rn (5.198)
1. What are the dimensions of ∂fi∂x ?3051
2. Compute the Jacobians3052
5.2 Differentiate f with respect to t and g with respect to X, where
f(t) = sin(log(t>t)) , t ∈ RD (5.199)
g(X) = tr(AXB) , A ∈ RD×E ,X ∈ RE×F ,B ∈ RF×D , (5.200)
where tr denotes the trace.3053
5.3 Compute the derivatives df/dx of the following functions by using the chain3054
rule. Provide the dimensions of every single partial derivative. Describe your3055
steps in detail.3056
1.
f(z) = log(1 + z) , z = x>x , x ∈ RD
2.
f(z) = sin(z) , z = Ax+ b , A ∈ RE×D,x ∈ RD, b ∈ RE
where sin(·) is applied to every element of z.3057
5.4 Compute the derivatives df/dx of the following functions.3058
Describe your steps in detail.3059
1. Use the chain rule. Provide the dimensions of every single partial deriva-tive.
f(z) = exp(− 12z)
z = g(y) = y>S−1y
y = h(x) = x− µ
where x,µ ∈ RD, S ∈ RD×D.3060
2.
f(x) = tr(xx> + σ2I) , x ∈ RD
Here tr(A) is the trace of A, i.e., the sum of the diagonal elements Aii.3061
Hint: Explicitly write out the outer product.3062
3. Use the chain rule. Provide the dimensions of every single partial deriva-tive. You do not need to compute the product of the partial derivativesexplicitly.
f = tanh(z) ∈ RM
z = Ax+ b, x ∈ RN ,A ∈ RM×N , b ∈ RM .
Here, tanh is applied to every component of z.3063
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.