vector calculus - mml-book. · pdf filereport errata and feedback to mml-book.com. please do...

5

Vector Calculus

2711

Many algorithms in machine learning are inherently based on optimizing2712

an objective function with respect to a set of desired model parameters2713

that control how well a model explains the data: Finding good param-2714

eters can be phrased as an optimization problem. Examples include lin-2715

ear regression (see Chapter 9), where look at curve-fitting problems, and2716

we optimize linear weight parameters to maximize the likelihood; neural-2717

network auto-encoders for dimensionality reduction and data compres-2718

sion, where the parameters are the weights and biases of each layer, and2719

where we minimize a reconstruction error by repeated application of the2720

chain-rule; Gaussian mixture models (see Chapter 12) for modeling data2721

distributions, where we optimize the location and shape parameters of2722

each mixture component to maximize the likelihood of the model. Fig-2723

ure 5.1 illustrates some of these problems, which we typically solve by us-2724

ing optimization algorithms that exploit gradient information (first-order2725

methods). Figure 5.2 gives an overview of how concepts in this chapter2726

are related and how they are connected to other chapters of the book.2727

In this chapter, we will discuss how to compute gradients of functions,2728

which is often essential to facilitate learning in machine learning models.2729

Therefore, vector calculus is one of the fundamental mathematical tools2730

we need in machine learning.2731

Figure 5.1 Vectorcalculus plays acentral role in (a)regression (curvefitting) and (b)density estimation,i.e., modeling datadistributions.

x-5 0 5

f(x)

-3

-2

-1

0

1

2

3Polynomial of degree 4

DataMaximum likelihood estimate

(a) Regression problem: Find parame-ters, such that the curve explains theobservations (circles) well.

10 5 0 5 10

x1

10

5

0

5

10

x2

(b) Density estimation with a Gaussian mixturemodel: Find means and covariances, such that thedata (dots) can be explained well.

117Draft chapter (February 14, 2018) from “Mathematics for Machine Learning” c©2018 by MarcPeter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge UniversityPress. Report errata and feedback to mml-book.com. Please do not post or distribute this file,please link to mml-book.com.

mml-book.com

mml-book.com

118 Vector Calculus

Figure 5.2 A mindmap of the conceptsintroduced in thischapter, along withwhen they are usedin other parts of thebook.

Difference quotient

Partial derivativesJacobianHessian

Taylor series

Chapter 7Optimization

Chapter 9Regression

Chapter 10Classification

Chapter 11Dimensionality

reduction

Chapter 12Density estimation

defines

collected byused in

usedin

used

in

used inused

in used inused in

Figure 5.3 Theaverage incline of afunction f betweenx0 and x0 + δx isthe incline of thesecant (blue)through f(x0) andf(x0 + δx) andgiven by δy/δx.

δy

δx

f(x)

x

y

f(x0)

f(x0 + δx)

5.1 Differentiation of Univeriate Functions2732

In the following, we briefly revisit differentiation of a univariate function,2733

which we may already know from school. We start with the difference2734

quotient of a univariate function y = f(x), x, y ∈ R, which we will2735

subsequently use to define derivatives.2736

Definition 5.1 (Difference Quotient). The difference quotientdifference quotient

δy

δx:=

f(x+ δx)− f(x)

δx(5.1)

computes the slope of the secant line through two points on the graph of2737

f . In Figure 5.3 these are the points with x-coordinates x0 and x0 + δx.2738

The difference quotient can also be considered the average slope of f2739

between x and x+δx if we assume a f to be a linear function. In the limit2740

for δx → 0, we obtain the tangent of f at x, if f is differentiable. The2741

tangent is then the derivative of f at x.2742

Definition 5.2 (Derivative). More formally, for h > 0 the derivative of fderivative

at x is defined as the limit

df

dx:= lim

h→0

f(x+ h)− f(x)

h, (5.2)

and the secant in Figure 5.3 becomes a tangent.2743

2744

Example (Derivative of a Polynomial)2745

Draft (2018-02-14) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.

mml-book.com

5.1 Differentiation of Univeriate Functions 119

We want to compute the derivative of f(x) = xn, n ∈ N. We may already2746

know that the answer will be nxn−1, but we want to derive this result2747

using the definition of the derivative as the limit of the difference quotient.2748

Using the definition of the derivative in (5.2) we obtain

df

dx= lim

h→0

f(x+ h)− f(x)

h(5.3)

= limh→0

(x+ h)n − xnh

(5.4)

= limh→0

∑ni=0

(ni

)xn−ihi − xnh

. (5.5)

We see that xn =(n0

)xn−0h0. By starting the sum at 1 the xn-term cancels,

and we obtain

df

dx= lim

h→0

∑ni=1

(ni

)xn−ihi

h(5.6)

= limh→0

n∑i=1

(n

i

)xn−ihi−1 (5.7)

= limh→0

(n

1

)xn−1 +

n∑i=2

(n

i

)xn−ihi−1

︸︷︷︸→0 as h→0

(5.8)

=n!

1!(n− 1)!xn−1 = nxn−1 . (5.9)

2749

2750

2751

5.1.1 Taylor Series2752

The Taylor series is a representation of a function f as an (infinite) sum2753

of terms. These terms are determined using derivatives of f , evaluated at2754

a point x0.2755

Definition 5.3 (Taylor Series). For a function f : R→ R, the Taylor series Taylor series

of a smooth function f ∈ C∞ at x0 is defined as f ∈ C∞ means thatf is infinitely oftencontinuouslydifferentiable.f(x) =

∞∑k=0

f (k)(x0)

k!(x− x0)k , (5.10)

where f (k)(x0) is the kth derivative of f at x0 and f(k)(x0)

k!are the coeffi-2756

cients of the polynomial. For x0 = 0, we obtain the Maclaurin series as a Maclaurin series2757

special instance of the Taylor series.2758

Definition 5.4 (Taylor Polynomial). The Taylor polynomial of degree n of Taylor polynomial

c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

120 Vector Calculus

f at x0 contains the first n+ 1 terms of the series in (5.10) and is definedas

Tn :=n∑k=0

f (k)(x0)

k!(x− x0)k . (5.11)

Remark. In general, a Taylor polynomial of degree n is an approximation2759

of a function, which does not need to be a polynomial. The Taylor poly-2760

nomial is similar to f in a neighborhood around x0. However, a Taylor2761

polynomial of degree n is an exact representation of a polynomial f of2762

degree k 6 n since all derivatives f (i), i > k vanish. ♦2763

Example (Taylor Polynomial)We consider the polynomial

f(x) = x4 (5.12)

and seek the Taylor polynomial T6, evaluated at x0 = 1. We start by com-puting the coefficients f (k)(1) for k = 0, . . . , 6:

f(1) = 1 (5.13)

f ′(1) = 4 (5.14)

f ′′(1) = 12 (5.15)

f (3)(1) = 24 (5.16)

f (4)(1) = 24 (5.17)

f (5)(1) = 0 (5.18)

f (6)(1) = 0 (5.19)

Therefore, the desired Taylor polynomial is

T6(x) =6∑k=0

f (k)(x0)

k!(x− x0)k (5.20)

= 1 + 4(x− 1) + 6(x− 1)2 + 4(x− 1)3 + (x− 1)4 + 0 . (5.21)

Multiplying out and re-arranging yields

T6(x) = (1− 4 + 6− 4 + 1) + x(4− 12 + 12− 4)

+ x2(6− 12 + 6) + x3(4− 4) + x4 (5.22)

= x4 = f(x) , (5.23)

i.e., we obtain an exact representation of the original function.2764

2765

2766

Example (Taylor Series)


mml-book.com

5.1 Differentiation of Univeriate Functions 121

Consider the smooth function

f(x) = sin(x) + cos(x) ∈ C∞ . (5.24)

We seek a Taylor series expansion of f at x0 = 0, which is the Maclaurinseries expansion of f . We obtain the following derivatives:

f(0) = sin(0) + cos(0) = 1 (5.25)

f ′(0) = cos(0)− sin(0) = 1 (5.26)

f ′′(0) = − sin(0)− cos(0) = −1 (5.27)

f (3)(0) = − cos(0) + sin(0) = −1 (5.28)

f (4)(0) = sin(0) + cos(0) = f(0) = 1 (5.29)...

We can see a pattern here: The coefficients in our Taylor series are only2767

±1 (since sin(0) = 0), each of which occurs twice before switching to the2768

other one. Furthermore, f (k+4)(0) = f (k)(0).2769

Therefore, the full Taylor series expansion of f at x0 = 0 is given by

f(x) =∞∑k=0

f (k)(x0)

k!(x− x0)k (5.30)

= 1 + x− 1

2!x2 − 1

3!x3 +

1

4!x4 +

1

5!x5 − · · · (5.31)

= 1− 1

2!x2 +

1

4!x4 ∓ · · ·+ x− 1

3!x3 +

1

5!x5 ∓ · · · (5.32)

=∞∑k=0

(−1)k1

(2k)!x2k +

∞∑k=0

(−1)k1

(2k + 1)!x2k+1 (5.33)

= cos(x) + sin(x) , (5.34)

where we used the power series representations power seriesrepresentations

cos(x) =∞∑k=0

(−1)k1

(2k)!x2k , (5.35)

sin(x) =∞∑k=0

(−1)k1

(2k + 1)!x2k+1 . (5.36)

Figure 5.4 shows the corresponding first Taylor polynomials Tn for n =2770

0, 1, 5, 10.2771

2772

2773

5.1.2 Differentiation Rules2774

In the following, we briefly state basic differentiation rules, where we2775

denote the derivative of f by f ′.2776


122 Vector Calculus

Figure 5.4 Taylorpolynomials. Theoriginal functionf(x) =

sin(x) + cos(x)

(black, solid) isapproximated byTaylor polynomials(dashed) aroundx0 = 0.Higher-order Taylorpolynomialsapproximate thefunction f betterand more globally.T10 is alreadysimilar to f in[−4, 4]. 4 3 2 1 0 1 2 3 4

x

2

0

2

4y

fT0T1T5T10

Product Rule: (f(x)g(x))′ = f ′(x)g(x) + f(x)g′(x) (5.37)

Quotient Rule:(f(x)

g(x)

)′=f ′(x)g(x)− f(x)g′(x)

(g(x))2(5.38)

Sum Rule: (f(x) + g(x))′ = f ′(x) + g′(x) (5.39)

Chain Rule:(g(f(x))

)′= (g ◦ f)′(x) = g′(f)f ′(x) (5.40)

Here, g ◦ f is a function composition x 7→ f(x) 7→ g(f(x)).2777

Example (Chain rule)Let us compute the derivative of the function h(x) = (2x + 1)4 using thechain rule. With

h(x) = (2x+ 1)4 = g(f(x)) , (5.41)

f(x) = 2x+ 1 , (5.42)

g(f) = f4 (5.43)

we obtain the derivatives of f and g as

f ′(x) = 2 , (5.44)

g′(f) = 4f3 , (5.45)

such that the derivative of h is given as

h′(x) = g′(f)f ′(x) = (4f3) · 2 (5.42)= 4(2x+ 1) · 2 = 8(2x+ 1) , (5.46)

where we used the chain rule, see (5.40), and substituted the definition2778

of f in (5.42) in g′(f).2779


mml-book.com

5.2 Partial Differentiation and Gradients 123

2780

2781

5.2 Partial Differentiation and Gradients2782

Differentiation as discussed in Section 5.1 applies to functions f of a2783

scalar variable x ∈ R. In the following, we consider the general case2784

where the function f depends on one or more variables x ∈ Rn, e.g.,2785

f(x) = f(x1, x2). The generalization of the derivative to functions of sev-2786

eral variables is the gradient.2787

We find the gradient of the function f with respect to x by varying one2788

variable at a time and keeping the others constant. The gradient is then2789

the collection of these partial derivatives.2790

Definition 5.5 (Partial Derivative). For a function f : Rn → R, x 7→f(x), x ∈ Rn of n variables x1, . . . , xn we define the partial derivatives as partial derivatives

∂f

∂x1

= limh→0

f(x1 + h, x2, . . . , xn)− f(x)

h...

∂f

∂xn= lim

h→0

f(x1, . . . , xn−1, xn + h)− f(x)

h

(5.47)

and collect them in the row vector

∇xf = gradf =df

dx=[∂f(x)

∂x1

∂f(x)

∂x2· · · ∂f(x)

∂xn

]∈ R1×n , (5.48)

where n is the number of variables and 1 is the dimension of the image/2791

range of f . Here, we used the compact vector notation x = [x1, . . . , xn]>.2792

The row vector in (5.48) is called the gradient of f or the Jacobian and is gradient

Jacobian

2793

the generalization of the derivative from Section 5.1.2794

Example (Partial Derivatives using the Chain Rule)For f(x, y) = (x+ 2y3)2, we obtain the partial derivatives We can use results

from scalardifferentiation: Eachpartial derivative isa derivative withrespect to a scalar.

∂f(x, y)

∂x= 2(x+ 2y3)

∂

∂x(x+ 2y3) = 2(x+ 2y3) , (5.49)

∂f(x, y)

∂y= 2(x+ 2y3)

∂

∂y(x+ 2y3) = 12(x+ 2y3)y2 . (5.50)

where we used the chain rule (5.40) to compute the partial derivatives.2795

2796

2797

Remark (Gradient as a Row Vector). It is not uncommon in the literature2798

to define the gradient vector as a column vector, following the conven-2799

tion that vectors are generally column vectors. The reason why we define2800


124 Vector Calculus

the gradient vector as a row vector is twofold: First, we can consistently2801

generalize the gradient to a setting where f : Rn → Rm no longer maps2802

onto the real line (then the gradient becomes a matrix). Second, we can2803

immediately apply the multi-variate chain-rule without paying attention2804

to the dimension of the gradient. We will discuss both points later. ♦2805

Example (Gradient)For f(x1, x2) = x2

1x2 + x1x32 ∈ R, the partial derivatives (i.e., the deriva-

tives of f with respect to x1 and x2) are

∂f(x1, x2)

∂x1

= 2x1x2 + x32 (5.51)

∂f(x1, x2)

∂x2

= x21 + 3x1x

22 (5.52)

and the gradient is then

df

dx=[∂f(x1,x2)

∂x1

∂f(x1,x2)

∂x2

]=[2x1x2 + x3

2 x21 + 3x1x

22

]∈ R1×2 .

(5.53)2806

2807

2808

5.2.1 Basic Rules of Partial Differentiation2809

Product rule:(fg)′ = f ′g + fg′,Sum rule:(f + g)′ = f ′ + g′,Chain rule:(g(f))′ = g′(f)f ′

In the multivariate case, where x ∈ Rn, the basic differentiation rules that2810

we know from school (e.g., sum rule, product rule, chain rule; see also2811

Section 5.1.2) still apply. However, when we compute derivatives with2812

respect to vectors x ∈ $n we need to pay attention: Our gradients now2813

involve vectors and matrices, and matrix multiplication is no longer com-2814

mutative (see Section 2.2.1), i.e., the order matters.2815

Here are the general product rule, sum rule and chain rule:

Product Rule:∂

∂x

(f(x)g(x)

)=∂f

∂xg(x) + f(x)

∂g

∂x(5.54)

Sum Rule:∂

∂x

(f(x) + g(x)

)=∂f

∂x+∂g

∂x(5.55)

Chain Rule:∂

∂x(g ◦ f)(x) =

∂

∂x

(g(f(x))

)=∂g

∂f

∂f

∂x(5.56)

Let us have a closer look at the chain rule. The chain rule (5.56) resem-This is only anintuition, but notmathematicallycorrect since thepartial derivative isnot a fraction.

2816

bles to some degree the rules for matrix multiplication where we said that2817

neighboring dimensions have to match for matrix multiplication to be de-2818

fined, see Section 2.2.1. If we go from left to right, the chain rule exhibits2819

similar properties: ∂f shows up in the “denominator” of the first factor2820

and in the “numerator” of the second factor. If we multiply the factors2821


mml-book.com

5.2 Partial Differentiation and Gradients 125

together, multiplication is defined (the dimensions of ∂f match, and ∂f2822

“cancels”, such that ∂g/∂x remains.2823

5.2.2 Chain Rule2824

Consider a function f : R2 → R of two variables x1, x2. Furthermore,x1(t) and x2(t) are themselves functions of t. To compute the gradient off with respect to t, we need to apply the chain rule (5.56) for multivariatefunctions as

df

dt=[∂f∂x1

∂f∂x2

] [∂x1(t)

∂t∂x2(t)

∂t

]=

∂f

∂x1

∂x1

∂t+∂f

∂x2

∂x2

∂t(5.57)

where d denotes the gradient and ∂ partial derivatives.2825

ExampleConsider f(x1, x2) = x2

1 + 2x2, where x1 = sin t and x2 = cos t, then

df

dt=

∂f

∂x1

∂x1

∂t+∂f

∂x2

∂x2

∂t(5.58)

= 2 sin t∂ sin t

∂t+ 2

∂ cos t

∂t(5.59)

= 2 sin t cos t− 2 sin t = 2 sin t(cos t− 1) (5.60)

is the corresponding derivative of f with respect to t.

If f(x1, x2) is a function of x1 and x2, where x1(s, t) and x2(s, t) arethemselves functions of two variables s and t, the chain rule yields thepartial derivatives

∂f

∂s=

∂f

∂x1

∂x1

∂s+∂f

∂x2

∂x2

∂s, (5.61)

∂f

∂t=

∂f

∂x1

∂x1

∂t+∂f

∂x2

∂x2

∂t, (5.62)

and the gradient is obtained by the matrix multiplication

df

d(s, t)=∂f

∂x

∂x

∂(s, t)=[∂f∂x1

∂f∂x2

]︸︷︷︸

= ∂f∂x

[∂x1

∂s∂x1

∂t∂x2

∂s∂x2

∂t

]︸︷︷︸

= ∂x∂(s,t)

. (5.63)

This compact way of writing the chain rule as a matrix multiplication The chain rule canbe written as amatrixmultiplication.

2826

makes only sense if the gradient is defined as a row vector. Otherwise,2827

we will need to start transposing gradients for the matrix dimensions to2828

match. This may still be straightforward as long as the gradient is a vector2829

or a matrix; however, when the gradient becomes a tensor (we will discuss2830

this in the following), the transpose is no longer a triviality.2831


126 Vector Calculus

Remark (Verifying the Correctness of the Implementation of the Gradient).2832

The definition of the partial derivatives as the limit of the corresponding2833

difference quotient, see (5.47), can be exploited when numerically check-2834

ing the correctness of gradients in computer programs: When we computeGradient checking2835

gradients and implement them, we can use finite differences to numeri-2836

cally test our computation and implementation: We choose the value h2837

to be small (e.g., h = 10−4) and compare the finite-difference approxima-2838

tion from (5.47) with our (analytic) implementation of the gradient. If the2839

error is small, our gradient implementation is probably correct. “Small”2840

could mean that√∑

i(dhi−dfi)2∑i(dhi+dfi)2

< 10−3, where dhi is the finite-difference2841

approximation and dfi is the analytic gradient of f with respect to the ith2842

variable xi. ♦2843

5.3 Gradients of Vector-Valued Functions (Vector Fields)2844

Thus far, we discussed partial derivatives and gradients of functions f :2845

Rn → Rmapping to the real numbers. In the following, we will generalize2846

the concept of the gradient to vector-valued functions f : Rn → Rm,2847

where n,m > 1.2848

For a function f : Rn → Rm and a vector x = [x1, . . . , xn]> ∈ Rn, thecorresponding vector of function values is given as

f(x) =

f1(x)...

fm(x)

∈ Rm . (5.64)

Writing the vector-valued function in this way allows us to view a vector-2849

valued function f : Rn → Rm as a vector of functions [f1, . . . , fm]>,2850

fi : Rn → R that map onto R. The differentiation rules for every fi are2851

exactly the ones we discussed in Section 5.2.2852

Therefore, the partial derivative of a vector-valued function f : Rn → Rmpartial derivative ofa vector-valuedfunction

with respect to xi ∈ R, i = 1, . . . n, is given as the vector

∂f

∂xi=

∂f1∂xi

...∂fm∂xi

=

limh→0f1(x1,...,xi−1,xi+h,xi+1,...xn)−f1(x)

h...

limh→0fm(x1,...,xi−1,xi+h,xi+1,...xn)−fm(x)

h

∈ Rm .(5.65)

From (5.48), we know that we obtain the gradient of f with respectto a vector as the row vector of the partial derivatives. In (5.65), everypartial derivative ∂f/∂xi is a column vector. Therefore, we obtain thegradient of f : Rn → Rm with respect to x ∈ Rn by collecting these


mml-book.com

5.3 Gradients of Vector-Valued Functions (Vector Fields) 127

partial derivatives:

df(x)

dx= ∂f(x)

∂x1· · · ∂f(x)

∂xn

[ ]=

∂f1(x)

∂x1· · · ∂f1(x)

∂xn

......

∂fm(x)

∂x1· · · ∂fm(x)

∂xn

∈ Rm×n .

(5.66)

Definition 5.6 (Jacobian). The collection of all first-order partial deriva-tives of a vector-valued function f : Rn → Rm is called the Jacobian. The Jacobian

Jacobian J is an m× n matrix, which we define and arrange as follows: The gradient of afunctionf : Rn → Rm is amatrix of sizem× n.

J = ∇xf =df(x)

dx=[∂f(x)

∂x1· · · ∂f(x)

∂xn

](5.67)

=

∂f1(x)

∂x1· · · ∂f1(x)

∂xn

......

∂fm(x)

∂x1· · · ∂fm(x)

∂xn

, (5.68)

x =

x1

...xn

, J(i, j) =∂fi∂xj

. (5.69)

In particular, a function f : Rn → R1, which maps a vector x ∈ Rn onto2853

a scalar (e.g., f(x) =∑n

i=1 xi), possesses a Jacobian that is a row vector2854

(matrix of dimension 1× n), see (5.48).2855

Example (Gradient of a Vector Field)We are given

f(x) = Ax , f(x) ∈ RM , A ∈ RM×N , x ∈ RN .To compute the gradient df/dxwe first determine the dimension of df/dx:Since f : RN → RM , it follows that df/dx ∈ RM×N . Second, to computethe gradient compute the partial derivatives of f with respect to every xj:

fi(x) =N∑j=1

Aijxj =⇒ ∂fi∂xj

= Aij (5.70)

Finally, we collect the partial derivatives in the Jacobian and obtain thegradient as

df

dx=

∂f1∂x1

· · · ∂f1∂xN

......

∂fM∂x1

· · · ∂fM∂xN

=

A11 · · · A1N

......

AM1 · · · AMN

= A ∈ RM×N . (5.71)

2856


128 Vector Calculus

2857

2858

Example (Chain Rule)Consider the function h : R→ R, h(t) = (f ◦ g)(t) with

f : R2 → R (5.72)

g : R→ R2 (5.73)

f(x) = exp(x1x22) , (5.74)

x

[x1

x2

]= g(t) =

[t cos tt sin t

](5.75)

and compute the gradient of h with respect to t.2859

Since f : R2 → R and g : R→ R2 we note that

∂f

∂x∈ R1×2 , (5.76)

∂g

∂t∈ R2×1 . (5.77)

The desired gradient is computed by applying the chain-rule:

dh

dt=∂f

∂x

∂x

∂t=[∂f∂x1

∂f∂x2

][∂x1

∂t∂x2

∂t

](5.78)

=[exp(x1x

22)x2

2 2 exp(x1x22)x1x2

][cos t− t sin tsin t+ t cos t

](5.79)

= exp(x1x22)(x2

2(cos t− t sin t) + 2x1x2(sin t+ t cos t)), (5.80)

where x1 = t cos t and x2 = t sin t, see (5.75).2860

2861

2862

Example (Gradient of a Linear Model)Let us consider the linear model

y = Φθ , (5.81)

where θ ∈ RD is a parameter vector, Φ ∈ RN×D are input features andy ∈ RN are the corresponding observations. We define the following func-We will discuss this

model in muchmore detail inChapter 9 in thecontext of linearregression.

tions:

L(e) := ‖e‖2 , (5.82)

e(θ) := y −Φθ . (5.83)

We seek ∂L∂θ

, and we will use the chain rule for this purpose.2863

Before we start any calculation, we determine the dimensionality of thegradient as

∂L

∂θ∈ R1×D . (5.84)


mml-book.com

5.4 Gradients of Matrices 129

The chain rule allows us to compute the gradient as

∂L

∂θ=∂L

∂e

∂e

∂θ, (5.85)

i.e., every element is given bydLdtheta =np.einsum(’j,ji’,dLde,dedtheta)

∂L

∂θ[1, i] =

D∑j=1

∂L

∂e[j]∂e

∂θ[j, i] . (5.86)

We know that ‖e‖2 = e>e (see Section 3.2) and determine

∂L

∂e= 2e> ∈ R1×N . (5.87)

Furthermore, we obtain

∂e

∂θ= −Φ ∈ RN×D , (5.88)

such that our desired derivative is

∂L

∂θ= −2e>Φ

(5.83)= − 2(y> − θ>Φ>)︸︷︷︸

1×N

Φ︸︷︷︸N×D

∈ R1×D . (5.89)

Remark. We would have obtained the same result without using the chainrule by immediately looking at the function

L2(θ) := ‖y −Φθ‖2 = (y −Φθ)>(y −Φθ) . (5.90)

This approach is still practical for simple functions like L2 but becomes2864

impractical if consider deep function compositions. ♦2865

2866

2867

2868

5.4 Gradients of Matrices2869

We will encounter situations where we need to take gradients of matri-2870

ces with respect to vectors (or other matrices), which results in a multi-2871

dimensional tensor. For example, if we compute the gradient of an m× n Matrices can betransformed intovectors by stackingthe columns of thematrix(“flattening”).

2872

matrix with respect to a p × q matrix, the resulting Jacobian would be2873

(p × q) × (m × n), i.e., a four-dimensional tensor (or array). Since ma-2874

trices represent linear mappings, we can exploit the fact that there is a2875

vector-space isomorphism (linear, invertible mapping) between the space2876

Rm×n of m×n matrices and the space Rmn of mn vectors. Therefore, we2877

can re-shape our matrices into vectors of lengths mn and pq, respectively.2878

The gradient using these mn vectors results in a Jacobian of size pq×mn.2879

Figure 5.5 visualizes both approaches.2880

In practical applications, it is often desirable to re-shape the matrix2881


130 Vector CalculusFigure 5.5Visualization ofgradientcomputation of amatrix with respectto a vector. We areinterested incomputing thegradient ofA ∈ R4×2 withrespect to a vectorx ∈ R3. We knowthat gradientdAdx∈ R4×2×3. We

follow twoequivalentapproaches to arrivethere: (a) Collatingpartial derivativesinto a Jacobiantensor; (b)Flattening of thematrix into a vector,computing theJacobian matrix,re-shaping into aJacobian tensor.

A ∈ IR4×2 x ∈ IR3

∂A

∂x1∈ IR4×2

∂A

∂x2∈ IR4×2

∂A

∂x3∈ IR4×2

x1

x2

x3

x1

x2

x3

dA

dx∈ IR4×2×3

4

2

3

Partial derivatives:

collate

(a) Approach 1: We compute the partial derivative∂A∂x1

, ∂A∂x2

, ∂A∂x3

, each of which is a 4× 2 matrix, and col-late them in a 4× 2× 3 tensor.

A ∈ IR4×2 x ∈ IR3

x1

x2

x3

x1

x2

x3

dA

dx∈ IR4×2×3

re-shape re-shapegradient

A ∈ IR4×2 A ∈ IR8

dA

dx∈ IR8×3

(b) Approach 2: We re-shape (flatten) A ∈ R4×2 into avector A ∈ R8. Then, we compute the gradient dA

dx∈

R8×3. We obtain the gradient tensor by re-shaping thisgradient as illustrated above.

into a vector and continue working with this Jacobian matrix: The chain2882

rule (5.56) boils down to simple matrix multiplication, whereas in the2883

case of a Jacobian tensor, we will need to pay more attention to what2884

dimensions we need to sum out.2885


mml-book.com

5.4 Gradients of Matrices 131

Example (Gradient of Vectors with Respect to Matrices)Let us consider the following example, where

f = Ax , f ∈ RM ,A ∈ RM×N ,x ∈ RN (5.91)

and where we seek the gradient df/dA. Let us start again by determiningthe dimension of the gradient as

df

dA∈ RM×(M×N) . (5.92)

By definition, the gradient is the collection of the partial derivatives:

df

dA=

∂f1∂A...

∂fM∂A

, ∂fi∂A∈ R1×(M×N) . (5.93)

To compute the partial derivatives, it will be helpful to explicitly write outthe matrix vector multiplication:

fi =N∑j=1

Aijxj, i = 1, . . . ,M , (5.94)

and the partial derivatives are then given as

∂fi∂Aiq

= xq . (5.95)

This allows us to compute the partial derivatives of fi with respect to arow of A, which is given as

∂fi∂Ai,:

= x> ∈ R1×1×N , (5.96)

∂fi∂Ak 6=i,:

= 0> ∈ R1×1×N (5.97)

where we have to pay attention to the correct dimensionality. Since fi2886

maps onto R and each row of A is of size 1×N , we obtain a 1× 1×N -2887

sized tensor as the partial derivative of fi with respect to a row of A.2888

We can now stack the partial derivatives to obtain the desired gradientas

∂fi∂A

=

0>

...x>

0>

...0>

∈ R1×(M×N) . (5.98)

2889


132 Vector Calculus

2890

2891

Example (Gradient of Matrices with Respect to Matrices)Consider a matrix L ∈ Rm×n and f : Rm×n → Rn×n with

f(L) = L>L =: K ∈ Rn×n . (5.99)

where we seek the gradient dK/dL. To solve this hard problem, let usfirst write down what we already know: We know that the gradient hasthe dimensions

dK

dL∈ R(n×n)×(m×n) , (5.100)

which is a tensor. If we compute the partial derivative of f with respect toa single entry Lij , i, j ∈ {1, . . . , n}, of L, we obtain an n× n-matrix

∂K

∂Lij∈ Rn×n . (5.101)

Furthermore, we know that

dKpq

dL∈ R1×m×n (5.102)

for p, q = 1, . . . , n, where Kpq = fpq(L) is the (p, q)-th entry of K =2892

f(L).2893

Denoting the i-th column of L by li, we see that every entry of K isgiven by an inner product of two columns of L, i.e.,

Kpq = l>p lq =n∑k=1

LkpLkq . (5.103)

When we now compute the partial derivative ∂Kpq

∂Lij, we obtain

∂Kpq

∂Lij=

n∑k=1

∂

∂LijLkpLkq = ∂pqij , (5.104)

∂pqij =

Liq if j = p, p 6= qLip if j = q, p 6= q2Liq if j = p, p = q0 otherwise

(5.105)

From (5.100), we know that the desired gradient has the dimension (n×2894

n) × (m × n), end every single entry of this tensor is given by ∂pqij2895

in (5.105), where p, q, j = 1, . . . , n and i = q, . . . ,m.2896

2897

2898


mml-book.com

5.5 Useful Identities for Computing Gradients 133

5.5 Useful Identities for Computing Gradients2899

In the following, we list some useful gradients that are frequently requiredin a machine learning context (Petersen and Pedersen, 2012):

∂

∂Xf(X)> =

(∂f(X)

∂X

)>(5.106)

∂

∂Xtr(f(X)) = tr

(∂f(X)

∂X

)(5.107)

∂

∂Xdet(f(X)) = det(f(X))tr

(f−1(X)

∂f(X)

∂X

)(5.108)

∂

∂Xf−1(X) = −f−1(X)

∂f(X)

∂Xf−1(X) (5.109)

∂a>X−1b

∂X= −(X−1)>ab>(X−1)> (5.110)

∂x>a

∂x= a> (5.111)

∂a>x

∂x= a> (5.112)

∂a>Xb

∂X= ab> (5.113)

∂x>Bx

∂x= x>(B +B>) (5.114)

∂

∂s(x−As)>W (x−As) = −2(x−As)>WA for symmetric W

(5.115)

5.6 Backpropagation and Automatic Differentiation2900

In many machine learning applications, we find good model parameters2901

by performing gradient descent (Chapter 7), which relies on the fact that2902

we can compute the gradient of a learning objective with respect to the2903

parameters of the model. For a given objective function, we can obtain the2904

gradient with respect to the model parameters using calculus and applying2905

the chain rule, see Section 5.2.2. We already had a taster in Section 5.32906

when we looked at the gradient of a squared loss with respect to the2907

parameters of a linear regression model.2908

Consider the function

f(x) =√x2 + exp(x2) + cos

(x2 + exp(x2)

). (5.116)

By application of the chain rule, and noting that differentiation is linear


134 Vector Calculus

Figure 5.6 Forwardpass in a multi-layerneural network tocompute the lossfunction L as afunction of theinputs x and theparameters of thenetwork.

x fK

A0,b0 AK−1,bK−1

LfL−1

AK−2,bK−2

f1

A2,b2

we compute the gradient

df

dx= − 2x+ 2x exp(x2)

2√x2 + exp(x2)

− sin(x2 + exp(x2)

) (2x+ 2x exp(x2)

)= −2x

(1

2√x2 + exp(x2)

+ sin(x2 + exp(x2)

)) (1 + exp(x2)

).

(5.117)

Writing out the gradient in this explicit way is often impractical since it2909

often results in a very lengthy expression for a derivative. In practice,2910

it means that, if we are not careful, the implementation of the gradient2911

could be significantly more expensive than computing the function, which2912

is an unnecessary overhead. For training deep neural network models, the2913

backpropagation algorithm (Kelley, 1960; Bryson, 1961; Dreyfus, 1962;backpropagation 2914

Rumelhart et al., 1986) is an efficient way to compute the gradient of an2915

error function with respect to the parameters of the model.2916

5.6.1 Gradients in a Deep Network2917

In machine learning, the chain rule plays an important role when opti-mizing parameters of a hierarchical model (e.g., for maximum likelihoodestimation). An area where the chain rule is used to an extreme is DeepLearning where the function value y is computed as a deep function com-position

y = (fK ◦ fK−1 ◦ · · · ◦ f1)(x) = fK(fK−1(· · · (f1(x)) · · · )) , (5.118)

where x are the inputs (e.g., images), y are the observations (e.g., class2918

labels) and every function fi, i = 1, . . . ,K possesses its own parame-2919

ters. In neural networks with multiple layers, we have functions fi(x) =2920

σ(Aixi−1 + bi) in the ith layer. Here xi−1 is the output of layer i − 1We discuss the casewhere the activationfunctions areidentical to clutternotation.

2921

and σ an activation function, such as the logistic sigmoid 11+e−x , tanh or a2922

rectified linear unit (ReLU). In order to train these models, we require the2923

gradient of a loss function L with respect to all model parameters Aj, bj2924

for i = 0, . . . ,K − 1. This also requires us to compute the gradient of L2925

with respect to the inputs of each layer.2926

For example, if we have inputs x and observations y and a networkstructure defined by

f 0 = x (5.119)


mml-book.com

5.6 Backpropagation and Automatic Differentiation 135

Figure 5.7Backward pass in amulti-layer neuralnetwork to computethe gradients of theloss function.

x fK

A0,b0 AK−1,bK−1

LfL−1

AK−2,bK−2

f1

A2,b2

f i = σi(Ai−1f i−1 + bi−1) , i = 1, . . . ,K , (5.120)

see also Figure 5.6 for a visualization, we may be interested in findingAj, bj for j = 0, . . . ,K − 1, such that the squared loss

L(θ) = ‖y − fK(θ,x)‖2 (5.121)

is minimized, where θ = {Aj, bj}, j = 0, . . . ,K − 1.2927

To obtain the gradients with respect to the parameter set θ, we requirethe partial derivatives of L with respect to the parameters θj = {Aj, bj}of each layer j = 0, . . . ,K − 1. The chain rule allows us to determine thepartial derivatives as

∂L

∂θK−1

=∂L

∂fK

∂fK∂θK−1

(5.122)

∂L

∂θK−2

=∂L

∂fK

∂fK∂fK−1

∂fK−1

∂θK−2

(5.123)

∂L

∂θK−3

=∂L

∂fK

∂fK∂fK−1

∂fK−1

∂fK−2

∂fK−2

∂θK−3

(5.124)

∂L

∂θi=

∂L

∂fK

∂fK∂fK−1

· · · ∂f i+2

∂f i+1

∂f i+1

∂θi(5.125)

The orange terms are partial derivatives of the output of a layer with re-2928

spect to its inputs, whereas the blue terms are partial derivatives of the2929

output of a layer with respect to its parameters. Assuming, we have al-2930

ready computed the partial derivatives ∂L/∂θi+1, then most of the com-2931

putation can be reused to compute ∂L/∂θi. The additional terms that2932

we need to compute are indicated by the boxes. Figure 5.7 visualizes2933

that the gradients are passed backward through the network. A more2934

in-depth discussion about gradients of neural networks can be found at2935

https://tinyurl.com/yalcxgtv.2936

There are efficient ways of implementing this repeated application of2937

the chain rule using backpropagation (Kelley, 1960; Bryson, 1961; Drey- backpropagation2938

fus, 1962; Rumelhart et al., 1986). A good discussion about backpropaga-2939

tion and the chain rule is available at https://tinyurl.com/ycfm2yrw.2940


https://tinyurl.com/yalcxgtv

https://tinyurl.com/ycfm2yrw

136 Vector Calculus

Figure 5.8 Simplegraph illustratingthe flow of datafrom x to y viasome intermediatevariables a, b.

x a b y

5.6.2 Automatic Differentiation2941

Automaticdifferentiation isdifferent fromsymbolicdifferentiation andnumericalapproximations ofthe gradient, e.g., byusing finitedifferences.

It turns out that backpropagation is a special case of a general technique2942

in numerical analysis called automatic differentiation. We can think of au-automaticdifferentiation

2943

tomatic differentation as a set of techniques to numerically (in contrast to2944

symbolically) evaluate the exact (up to machine precision) gradient of a2945

function by working with intermediate variables and applying the chain2946

rule. Automatic differentiation applies a series of elementary arithmetic2947

operations, e.g., addition and multiplication and elementary functions,2948

e.g., sin, cos, exp, log. By applying the chain rule to these operations, the2949

gradient of quite complicated functions can be computed automatically.2950

Automatic differentiation applies to general computer programs and has2951

forward and reverse modes.2952

Figure 5.8 shows a simple graph representing the data flow from inputsx to outputs y via some intermediate variables a, b. If we were to computethe derivative dy/dx, we would apply the chain rule and obtain

dy

dx=dy

db

db

da

da

dx. (5.126)

Intuitively, the forward and reverse mode differ in the order of multi-In the general case,we work withJacobians, whichcan be vectors,matrices or tensors.

plication. Due to the associativity of matrix multiplication we can choosebetween

dy

dx=

(dy

db

db

da

)da

dx, (5.127)

dy

dx=dy

db

(db

da

da

dx

). (5.128)

Equation (5.127) would be the reverse mode because gradients are prop-reverse mode 2953

agated backward through the graph, i.e., reverse to the data flow. Equa-2954

tion (5.128) would be the forward mode, where the gradients flow withforward mode 2955

the data from left to right through the graph.2956

In the following, we will focus on reverse mode automatic differentia-2957

tion, which is backpropagation. In the context of neural networks, where2958

the input dimensionality is often much higher than the dimensionality of2959

the labels, the reverse mode is computationally significantly cheaper than2960

the forward mode. Let us start with an instructive example.2961

ExampleConsider the function

f(x) =√x2 + exp(x2) + cos

(x2 + exp(x2)

)(5.129)

from (5.116). If we were to implement a function f on a computer, wewould be able to save some computation by using intermediate variables:intermediate

variables


mml-book.com

5.6 Backpropagation and Automatic Differentiation 137

Figure 5.9Computation graphwith inputs x,function values fand intermediatevariables a, b, c, d, e.

x (·)2 a

exp(·) b

+ c

√·

cos(·)

d

e

+ f

a = x2 , (5.130)

b = exp(a) , (5.131)

c = a+ b , (5.132)

d =√c , (5.133)

e = cos(c) , (5.134)

f = d+ e . (5.135)

This is the same kind of thinking process that occurs when applying the2962

chain rule. Observe that the above set of equations require fewer opera-2963

tions than a direct naive implementation of the function f(x) as defined2964

in (5.116). The corresponding computation graph in Figure 5.9 shows the2965

flow of data and computations required to obtain the function value f .2966

The set of equations that include intermediate variables can be thoughtof as a computation graph, a representation that is widely used in imple-mentations of neural network software libraries. We can directly computethe derivatives of the intermediate variables with respect to their corre-sponding inputs by recalling the definition of the derivative of elementaryfunctions. We obtain:

∂a

∂x= 2x , (5.136)

∂b

∂a= exp(a) , (5.137)

∂c

∂a= 1 =

∂c

∂b, (5.138)

∂d

∂c= − 1

2√c, (5.139)

∂e

∂c= − sin(c) , (5.140)

∂f

∂d= 1 =

∂f

∂e. (5.141)

By looking at the computation graph in Figure 5.9, we can compute ∂f/∂xby working backward from the output, and we obtain the following rela-tions:

∂f

∂c=∂f

∂d

∂d

∂c+∂f

∂e

∂d

∂c, (5.142)


138 Vector Calculus

∂f

∂b=∂f

∂c

∂c

∂b, (5.143)

∂f

∂a=∂f

∂b

∂b

∂a+∂f

∂c

∂c

∂a, (5.144)

∂f

∂x=∂f

∂a

∂a

∂x. (5.145)

Note that we have implicitly applied the chain rule to obtain ∂f/∂x. Bysubstituting the results of the derivatives of the elementary functions, weget

∂f

∂c= 1 ·

(− 1

2√c

)+ 1 · (− sin(c)) , (5.146)

∂f

∂b=∂f

∂c· 1 , (5.147)

∂f

∂a=∂f

∂bexp(a) +

∂f

∂c· 1 , (5.148)

∂f

∂x=∂f

∂a· 2x . (5.149)

By thinking of each of the derivatives above as a variable, we observethat the computation required for calculating the derivative is of similarcomplexity as the computation of the function itself. This is quite counter-intuitive since the mathematical expression for the derivative ∂f

∂x(5.117)

is significantly more complicated than the mathematical expression of thefunction f(x) in (5.116).

Automatic differentiation is a formalization of the example above. Letx1, . . . , xd be the input variables to the function, xd+1, . . . , xD−1 be theintermediate variables and xD the output variable. Then the computationgraph can be expressed as an equation

For i = d+ 1, . . . , D : xi = gi(xPa(xi)) (5.150)

where gi(·) are elementary functions and xPa(xi) are the parent nodes ofthe variable xi in the graph. Given a function defined in this way, we canuse the chain rule to compute the derivative of the function in a step-by-step fashion. Recall that by definition f = xD and hence

∂f

∂xD= 1. (5.151)

For other variables xi, we apply the chain rule

∂f

∂xi=

∑xj :xi∈Pa(xj)

∂f

∂xj

∂xj∂xi

=∑

xj :xi∈Pa(xj)

∂f

∂xj

∂gj∂xi

, (5.152)

where Pa(xj) is the set of parent nodes of xj in the computing graph.2967

Equation (5.150) is the forward propagation of a function, whereas (5.152)Auto-differentiationin reverse moderequires a parsetree.

2968


mml-book.com

5.7 Higher-order Derivatives 139

is the backpropagation of the gradient through the computation graph. For2969

neural network training we backpropagate the error of the prediction with2970

respect to the label.2971

The automatic differentiation approach above works whenever we have2972

a function that can be expressed as a computation graph, where the ele-2973

mentary functions are differentiable. In fact, the function may not even be2974

a mathematical function but a computer program. However, not all com-2975

puter programs can be automatically differentiated, e.g., if we cannot find2976

differential elementary functions. Programming structures, such as for2977

loops and if statements require more care as well.2978

5.7 Higher-order Derivatives2979

So far, we discussed gradients, i.e., first-order derivatives. Sometimes, we2980

are interested in derivatives of higher order, e.g., when we want to use2981

Newton’s Method for optimization, which requires second-order deriva-2982

tives (Nocedal and Wright, 2006). In Section 5.1.1, we discussed the Tay-2983

lor series to approximate functions using polynomials. In the multivariate2984

case, we can do exactly the same. In the following, we will do exactly this.2985

But let us start with some notation.2986

Consider a function f : R2 → R of two variables x, y. We use the2987

following notation for higher-order partial derivatives (and for gradients):2988

• ∂2f∂x2 is the second partial derivative of f with respect to x2989

• ∂nf∂xn is the nth partial derivative of f with respect to x2990

• ∂2f∂x∂y

is the partial derivative obtained by first partial differentiating by2991

x and then y2992

• ∂2f∂y∂x

is the partial derivative obtained by first partial differentiating by2993

y and then x2994

The Hessian is the collection of all second-order partial derivatives. Hessian2995

If f(x, y) is a twice (continuously) differentiable function then

∂2f

∂x∂y=

∂2f

∂y∂x, (5.153)

i.e., the order of differentiation does not matter, and the correspondingHessian matrix Hessian matrix

H =

[∂2f∂x2

∂2f∂x∂y

∂2f∂x∂y

∂2f∂y2

](5.154)

is symmetric. Generally, for x ∈ Rn and f : Rn → R, the Hessian is an2996

n× n matrix. The Hessian measures the local geometry of curvature.2997

Remark (Hessian of a Vector Field). If f : Rn → Rm is a vector field, the2998

Hessian is an (m× n× n)-tensor. ♦2999


140 Vector Calculus

Figure 5.10 Linearapproximation of afunction. Theoriginal function fis linearized atx0 = −2 using afirst-order Taylorseries expansion.

4 3 2 1 0 1 2 3 4x

2.5

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

f(x)

f(x)

f(x0)

f(x0) + f ′(x0)(x x0)

5.8 Linearization and Multivariate Taylor Series3000

The gradient ∇f of a function f is often used for a locally linear approxi-mation of f around x0:

f(x) ≈ f(x0) + (∇xf)(x0)(x− x0) . (5.155)

Here (∇xf)(x0) is the gradient of f with respect to x, evaluated at x0.3001

Figure 5.10 illustrates the linear approximation of a function f at an input3002

x0. The orginal function is approximated by a straight line. This approx-3003

imation is locally accurate, but the further we move away from x0 the3004

worse the approximation gets. Equation (5.155) is a special case of a mul-3005

tivariate Taylor series expansion of f at x0, where we consider only the3006

first two terms. We discuss the more general case in the following, which3007

will allow for better approximations.3008

Definition 5.7 (Multivariate Taylor Series). For the multivariate Taylormultivariate Taylorseries series, we consider a function

f : RD → R (5.156)

x 7→ f(x) , x ∈ RD , (5.157)

that is smooth at x0.3009

When we define the difference vector δ := x− x0, the Taylor series off at (x0) is defined as

f(x) =∞∑k=0

Dkxf(x0)

k!δk , (5.158)

where Dkxf(x0) is the k-th (total) derivative of f with respect to x, eval-3010

uated at x0.3011

Definition 5.8 (Taylor Polynomial). The Taylor polynomial of degree n ofTaylor polynomial

f at x0 contains the first n+ 1 components of the series in (5.158) and is


mml-book.com

5.8 Linearization and Multivariate Taylor Series 141

Figure 5.11Visualizing outerproducts. Outerproducts of vectorsincrease thedimensionality ofthe array by 1 perterm.

(a) Given a vector δ ∈ R4, we obtain the outer product δ2 := δ ⊗δ = δδ> ∈ R4×4 as a matrix.

(b) An outer product δ3 := δ⊗δ⊗δ ∈ R4×4×4 results in a third-order tensor(“three-dimensional matrix”), i.e., an array with three indexes.

defined as

Tn =n∑k=0

Dkxf(x0)

k!δk . (5.159)

Remark (Notation). In (5.158) and (5.159), we used the slightly sloppynotation of δk, which is not defined for vectors x ∈ RD, D > 1, and k >1. Note that both Dk

xf and δk are k-th order tensors, i.e., k-dimensional

arrays. The k-th order tensor δk ∈ Rk times︷︸︸︷

D×D×...×D is obtained as a k-fold A vector can beimplemented as a1-dimensional array,a matrix as a2-dimensional array.

outer product, denoted by ⊗, of the vector δ ∈ RD. For example,

δ2 = δ ⊗ δ = δδ> , δ2[i, j] = δ[i]δ[j] (5.160)

δ3 = δ ⊗ δ ⊗ δ , δ3[i, j, k] = δ[i]δ[j]δ[k] . (5.161)

Figure 5.11 visualizes two such outer products. In general, we obtain thefollowing terms in the Taylor series:

Dkxf(x0)δk =

∑a

· · ·∑k

Dkxf(x0)[a, . . . , k]δ[a] · · · δ[k] , (5.162)

where Dkxf(x0)δk contains k-th order polynomials.3012

Now that we defined the Taylor series for vector fields, let us explicitlywrite down the first terms Dk

xf(x0)δk of the Taylor series expansion fork = 0, . . . , 3 and δ := x− x0:

np.einsum(’i,i’,Df1,d)

np.einsum(’ij,i,j’,Df2,d,d)

np.einsum(’ijk,i,j,k’,Df3,d,d,d)

k = 0 : D0xf(x0)δ0 = f(x0) ∈ R (5.163)

k = 1 : D1xf(x0)δ1 = ∇xf(x0)︸︷︷︸

1×D

δ︸︷︷︸D×1

=∑i

∇xf(x0)[i]δ[i] ∈ R (5.164)

k = 2 : D2xf(x0)δ2 = tr

(H︸︷︷︸D×D

δ︸︷︷︸D×1

δ>︸︷︷︸1×D

)= δ>Hδ (5.165)


142 Vector Calculus

=∑i

∑j

H[i, j]δ[i]δ[j] ∈ R (5.166)

k = 3 : D3xf(x0)δ3 =

∑i

∑j

∑k

D3xf(x0)[i, j, k]δ[i]δ[j]δ[k] ∈ R

(5.167)

♦3013

Example (Taylor-Series Expansion of a Function with Two Variables)Consider the function

f(x, y) = x2 + 2xy + y3 . (5.168)

We want to compute the Taylor series expansion of f at (x0, y0) = (1, 2).3014

Before we start, let us discuss what to expect: The function in (5.168) is3015

a polynomial of degree 3. We are looking for a Taylor series expansion,3016

which itself is a linear combination of polynomials. Therefore, we do not3017

expect the Taylor series expansion to contain terms of fourth or higher3018

order to express a third-order polynomial. This means, it should be suffi-3019

cient to determine the first four terms of (5.158) for an exact alternative3020

representation of (5.168).3021

To determine the Taylor series expansion, start of with the constant termand the first-order derivatives, which are given by

f(1, 2) = 13 (5.169)∂f

∂x= 2x+ 2y =⇒ ∂f

∂x(1, 2) = 6 (5.170)

∂f

∂y= 2x+ 3y2 =⇒ ∂f

∂y(1, 2) = 14 . (5.171)

Therefore, we obtain

D1x,yf(1, 2) = ∇x,yf(1, 2) =

[∂f∂x

(1, 2) ∂f∂y

(1, 2)]

=[6 14

]∈ R1×2

(5.172)

such that

D1x,yf(1, 2)

1!δ =

[6 14

] [x− 1y − 2

]= 6(x− 1) + 14(y − 2) . (5.173)

Note that D1x,yf(1, 2)δ contains only linear terms, i.e., first-order polyno-3022

mials.3023

The second-order partial derivatives are given by

∂2f

∂x2= 2 =⇒ ∂2f

∂x2(1, 2) = 2 (5.174)

∂2f

∂y2= 6y =⇒ ∂2f

∂y2(1, 2) = 12 (5.175)


mml-book.com

5.8 Linearization and Multivariate Taylor Series 143

∂f2

∂x∂y= 2 =⇒ ∂f2

∂x∂y(1, 2) = 2 (5.176)

∂f2

∂y∂x= 2 =⇒ ∂f2

∂y∂x(1, 2) = 2 . (5.177)

When we collect the second-order partial derivatives, we obtain the Hes-sian

H =

[∂2f∂x2

∂f2

∂x∂y∂f2

∂y∂x∂2f∂y2

]=

[2 22 6y

], (5.178)

such that

H(1, 2) =

[2 22 12

]∈ R2×2 . (5.179)

Therefore, the next term of the Taylor-series expansion is given by

D2x,yf(1, 2)

2!δ2 =

1

2δ>H(1, 2)δ (5.180)

=[x− 1 y − 2

] [2 22 12

] [x− 1y − 2

](5.181)

= (x− 1)2 + 2(x− 1)(y − 2) + 6(y − 2)2 . (5.182)

Here,D2x,yf(1, 2)δ2 contains only quadratic terms, i.e., second-order poly-3024

nomials.3025

The third-order derivatives are obtained as

D3xf =

[∂H∂x

∂H∂y

]∈ R2×2×2 , (5.183)

D3xf [:, :, 1] =

∂H

∂x=

[∂3f∂x3

∂3f∂x∂y∂x

∂3f∂y∂x2

∂3f∂y2∂x

], (5.184)

D3xf [:, :, 2] =

∂H

∂y=

[∂3f∂x2∂y

∂3f∂x∂y2

∂3f∂y∂x∂y

∂3f∂y3

.

]. (5.185)

Since most second-order partial derivatives in the Hessian in (5.178) areconstant the only non-zero third-order partial derivative is

∂3f

∂y3= 6 =⇒ ∂3f

∂y3(1, 2) = 6 . (5.186)

Higher-order derivatives and the mixed derivatives of degree 3 (e.g., ∂f3

∂x2∂y)

vanish, such that

D3x,yf [:, :, 1] =

[0 00 0

], D3

x,yf [:, :, 2] =

[0 00 6

](5.187)

and

D3x,yf(1, 2)

3!δ3 = (y − 2)3 , (5.188)


144 Vector Calculus

which collects all cubic terms (third-order polynomials) of the Taylor se-3026

ries.3027

Overall, the (exact) Taylor series expansion of f at (x0, y0) = (1, 2) is

f(x) =f(1, 2)+D1x,yf(1, 2)δ+

D2x,yf(1, 2)

2!δ2+

D3x,yf(1, 2)

3!δ3 (5.189)

=f(1, 2)+∂f(1, 2)

∂x(x− 1) +

∂f(1, 2)

∂y(y − 2) (5.190)

+1

2!

(∂2f(1, 2)

∂x2(x− 1)2 +

∂2f(1, 2)

∂y2(y − 2)2 (5.191)

+2∂2f(1, 2)

∂x∂y(x− 1)(y − 2)

)+

1

6

∂3f(1, 2)

∂y3(y − 2)3 (5.192)

= 13+6(x− 1) + 14(y − 2) (5.193)

+(x− 1)2 + 6(y − 2)2 + 2(x− 1)(y − 2)+(y − 2)3 . (5.194)

In this case, we obtained an exact Taylor series expansion of the polyno-3028

mial in (5.168), i.e., the polynomial in (5.194) is identical to the original3029

polynomial in (5.168). In this particular example, this result is not sur-3030

prising since the original function was a third-order polynomial, which3031

we expressed through a linear combination of constant terms, first-order,3032

second order and third-order polynomials in (5.194).3033

3034

3035

5.9 Further Reading3036

In machine learning (and other disciplines), we often need to computeexpectations, i.e., we need to solve integrals of the form

Ex[f(x)] =

∫f(x)p(x)dx . (5.195)

Even if p(x) is in a convenient form (e.g., Gaussian), this integral gen-3037

erally not be solved analytically. The Taylor series expansion of f is one3038

way of finding an approximate solution: Assuming p(x) = N(µ, Σ

)is3039

Gaussian, then the first-order Taylor series expansion around µ locally3040

linearizes the nonlinear function f . For linear functions, we can compute3041

the mean (and the covariance) exactly if p(x) is Gaussian distributed (see3042

Section 6.5). This property is heavily exploited by the Extended KalmanExtended KalmanFilter

3043

Filter (Maybeck, 1979) for online state estimation in nonlinear dynami-3044

cal systems (also called “state-space models”). Other deterministic ways3045

to approximate the integral in (5.195) are the unscented transform (Julierunscented transform3046

and Uhlmann, 1997), which does not require any gradients, or the LaplaceLaplaceapproximation

3047

approximation (Bishop, 2006), which uses the Hessian for a local Gaussian3048

approximation of p(x) at the posterior mean.3049


mml-book.com

Exercises 145

Exercises3050

5.1 Consider the following functions

f1(x) = sin(x1) cos(x2) , x ∈ R2 (5.196)

f2(x,y) = x>y , x,y ∈ Rn (5.197)

f3(x) = xx> , x ∈ Rn (5.198)

1. What are the dimensions of ∂fi∂x ?3051

2. Compute the Jacobians3052

5.2 Differentiate f with respect to t and g with respect to X, where

f(t) = sin(log(t>t)) , t ∈ RD (5.199)

g(X) = tr(AXB) , A ∈ RD×E ,X ∈ RE×F ,B ∈ RF×D , (5.200)

where tr denotes the trace.3053

5.3 Compute the derivatives df/dx of the following functions by using the chain3054

rule. Provide the dimensions of every single partial derivative. Describe your3055

steps in detail.3056

1.

f(z) = log(1 + z) , z = x>x , x ∈ RD

2.

f(z) = sin(z) , z = Ax+ b , A ∈ RE×D,x ∈ RD, b ∈ RE

where sin(·) is applied to every element of z.3057

5.4 Compute the derivatives df/dx of the following functions.3058

Describe your steps in detail.3059

1. Use the chain rule. Provide the dimensions of every single partial deriva-tive.

f(z) = exp(− 12z)

z = g(y) = y>S−1y

y = h(x) = x− µ

where x,µ ∈ RD, S ∈ RD×D.3060

2.

f(x) = tr(xx> + σ2I) , x ∈ RD

Here tr(A) is the trace of A, i.e., the sum of the diagonal elements Aii.3061

Hint: Explicitly write out the outer product.3062

3. Use the chain rule. Provide the dimensions of every single partial deriva-tive. You do not need to compute the product of the partial derivativesexplicitly.

f = tanh(z) ∈ RM

z = Ax+ b, x ∈ RN ,A ∈ RM×N , b ∈ RM .

Here, tanh is applied to every component of z.3063


vector calculus - mml-book. · pdf filereport errata and feedback to mml-book.com. please do...

Documents