vector calculus

Vector Calculus

Marc DeisenrothCentre for Artificial IntelligenceDepartment of Computer ScienceUniversity College London

@mpd37m.deisenroth@ucl.ac.uk

https://deisenroth.cc

AIMS Rwanda and AIMS Ghana

March/April, 2020

Reading Material

MATHEMATICS FOR MACHINE LEARNING

Marc Peter DeisenrothA. Aldo Faisal

Cheng Soon Ong

ATICS FO

E LEARN

ET AL.

The fundamental mathematical tools needed to understand machine learning include linear algebra, analytic geometry, matrix decompositions, vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science or computer science students, or professionals, to effi ciently learn the mathematics. This self-contained textbook bridges the gap between mathematical and machine learning texts, introducing the mathematical concepts with a minimum of prerequisites. It uses these concepts to derive four central machine learning methods: linear regression, principal component analysis, Gaussian mixture models and support vector machines. For students and others with a mathematical background, these derivations provide a starting point to machine learning texts. For those learning the mathematics for the fi rst time, the methods help build intuition and practical experience with applying mathematical concepts. Every chapter includes worked examples and exercises to test understanding. Programming tutorials are offered on the book’s web site.

MARC PETER DEISENROTH is Senior Lecturer in Statistical Machine Learning at the Department of Computing, Împerial College London.

A. ALDO FAISAL leads the Brain & Behaviour Lab at Imperial College London, where he is also Reader in Neurotechnology at the Department of Bioengineering and the Department of Computing.

CHENG SOON ONG is Principal Research Scientist at the Machine Learning Research Group, Data61, CSIRO. He is also Adjunct Associate Professor at Australian National University.

Cover image courtesy of Daniel Bosma / Moment / Getty Images

Cover design by Holly Johnson

Deisenrith et al. 9781108455145 Cover. C M Y K

Chapter 5https://mml-book.com

Marc Deisenroth (UCL) Vector Calculus March/April, 2020 2

Regression in Machine Learning (1)

x-5 0 5

3Polynomial of degree 5

DataMaximum likelihood estimate

Setting: Inputs x P RD, outputs/targets y P RGoal: Find a function that models the relationship between xand y (regression/curve fitting)Model f that depends on parameters θ. Examples:

Linear model: fpx,θq “ θJx, x,θ P RD

Neural network: fpx,θq “ NNpx,θq

Training data, e.g., N pairs pxi, yiq ofinputs xi and observations yi

Training the model means findingparameters θ˚, such that fpxi,θ˚q « yi

x-5 0 5

Define a loss function, e.g.,řNi“1pyi ´ fpxi,θqq

2, which we wantto optimize

Typically: Optimization based on some form of gradient descentDifferentiation required

Training data, e.g., N pairs pxi, yiq ofinputs xi and observations yi

Training the model means findingparameters θ˚, such that fpxi,θ˚q « yi

x-5 0 5

Define a loss function, e.g.,řNi“1pyi ´ fpxi,θqq

2, which we wantto optimize

Typically: Optimization based on some form of gradient descentDifferentiation required

Types of Differentiation

1 Scalar differentiation: f : RÑ R

y P R w.r.t. x P R

2 Scalar differentiation of a vector: f : RÑ RN

y P RN w.r.t. x P R

3 Multivariate case: f : RN Ñ R

y P R w.r.t. vector x P RN

4 Vector fields: f : RN Ñ RM

vector y P RM w.r.t. vector x P RN

5 General derivatives: f : RMˆN Ñ RPˆQ

matrix y P RPˆQ w.r.t. matrix X P RMˆN

Scalar Differentiation f : R Ñ R

Derivative defined as the limit of the difference quotient

f 1pxq “df

dx“ lim

fp x` h q ´ fpxq

Slope of the secant line through fpxq and fpx` hq

Some Examples

fpxq “ xn f 1pxq “ nxn´1

fpxq “ sinpxq f 1pxq “ cospxqfpxq “ tanhpxq f 1pxq “ 1´ tanh2pxqfpxq “ exppxq f 1pxq “ exppxqfpxq “ logpxq f 1pxq “ 1

Differentiation Rules

Sum Rule`

fpxq ` gpxq˘1“ f 1pxq ` g1pxq “

dx`dgpxq

Product Rule`

fpxqgpxq˘1“ f 1pxqgpxq ` fpxqg1pxq “

dxgpxq ` fpxq

Chain Rule

pg ˝ fq1pxq “`

gpfpxqq˘1“ g1pfpxqqf 1pxq “

dgpfpxqq

Quotient Rule´fpxq

“fpxq1gpxq ´ fpxqgpxq1

pgpxqq2“

dfdxgpxq ´ fpxq

pgpxqq2

Sum Rule`

dx`dgpxq

Product Rule`

dxgpxq ` fpxq

Chain Rule

pg ˝ fq1pxq “`

dgpfpxqq

Quotient Rule´fpxq

pgpxqq2“

dfdxgpxq ´ fpxq

pgpxqq2

Sum Rule`

dx`dgpxq

Product Rule`

dxgpxq ` fpxq

Chain Rule

pg ˝ fq1pxq “`

dgpfpxqq

Quotient Rule´fpxq

pgpxqq2“

dfdxgpxq ´ fpxq

pgpxqq2

Sum Rule`

dx`dgpxq

Product Rule`

dxgpxq ` fpxq

Chain Rule

pg ˝ fq1pxq “`

dgpfpxqq

Quotient Rule´fpxq

pgpxqq2“

dfdxgpxq ´ fpxq

pgpxqq2

Example: Scalar Chain Rule

pg ˝ fq1pxq “`

Beginner

gpzq “ 6z ` 3

z “ fpxq “ ´2x` 5

pg ˝ fq1pxq “

p6qloomoon

p´2qloomoon

“ ´12

Advanced

gpzq “ tanhpzq

z “ fpxq “ xn

pg ˝ fq1pxq “

1´ tanh2pxnqqlooooooooomooooooooon

nxn´1loomoon

Work it out with your neighbors

Example: Scalar Chain Rule

pg ˝ fq1pxq “`

Beginner

gpzq “ 6z ` 3

z “ fpxq “ ´2x` 5

pg ˝ fq1pxq “ p6qloomoon

p´2qloomoon

“ ´12

Advanced

gpzq “ tanhpzq

z “ fpxq “ xn

pg ˝ fq1pxq “`

1´ tanh2pxnqqlooooooooomooooooooon

nxn´1loomoon

Scalar Differentiation f : R Ñ RN

fpxq “

f1pxq...

fN pxq

P RN , x P R

Here, fn are different functions, e.g.,

fpxq “

sinpxq

Differentiation: Compute derivatives of each fn:

df1dx...

Derivative of a (column) vector w.r.t. a scalar input is a (column)vector

Multivariate Differentiation f : RN Ñ R

y “ fpxq , x “

x1...xN

Partial derivative (change one coordinate at a time):

Bxi“ lim

fpx1, . . . , xi´1, xi ` h , xi`1, . . . , xN q ´ fpxq

Jacobian vector (gradient) collects all partial derivatives:

¨ ¨ ¨BfBxN

P R1ˆN

Note: This is a row vector.

Multivariate Differentiation f : RN Ñ R

y “ fpxq , x “

x1...xN

Partial derivative (change one coordinate at a time):

Bxi“ lim

fpx1, . . . , xi´1, xi ` h , xi`1, . . . , xN q ´ fpxq

Jacobian vector (gradient) collects all partial derivatives:

¨ ¨ ¨BfBxN

P R1ˆN

Note: This is a row vector.

Example: Multivariate Differentiation

Beginner

f : R2 Ñ R

fpx1, x2q “ x21x2 ` x1x32 P R

Advanced

f : R2 Ñ R

fpx1, x2q “ px1 ` 2x32q2 P R

Partial derivatives? Gradient?Work it out with your neighbors

Bfpx1, x2q

Bx1“ 2x1x2 ` x

Bfpx1, x2q

Bx2“ x21 ` 3x1x

Bfpx1, x2q

Bx1“ 2px1 ` 2x32q

px1`2x32qhkkikkj

Bfpx1, x2q

Bx2“ 2px1 ` 2x32q p6x22q

loomoon

px1`2x32q

Gradientdfdx

Bfpx1,x2qBx1

Bfpx1,x2qBx2

P R1ˆ2

““

2x1x2 ` x32 x21 ` 3x1x

‰ dfdx

““

2px1 ` 2x32q 12px1 ` 2x32qx22

Beginner

f : R2 Ñ R

fpx1, x2q “ x21x2 ` x1x32 P R

Advanced

f : R2 Ñ R

fpx1, x2q “ px1 ` 2x32q2 P R

Partial derivatives

? Gradient?Work it out with your neighbors

Bfpx1, x2q

Bx1“ 2x1x2 ` x

Bfpx1, x2q

Bx2“ x21 ` 3x1x

Bfpx1, x2q

Bx1“ 2px1 ` 2x32q

px1`2x32qhkkikkj

Bfpx1, x2q

Bx2“ 2px1 ` 2x32q p6x22q

loomoon

px1`2x32q

Gradientdfdx

Bfpx1,x2qBx1

Bfpx1,x2qBx2

P R1ˆ2

““

2x1x2 ` x32 x21 ` 3x1x

‰ dfdx

““

2px1 ` 2x32q 12px1 ` 2x32qx22

Beginner

f : R2 Ñ R

fpx1, x2q “ x21x2 ` x1x32 P R

Advanced

f : R2 Ñ R

fpx1, x2q “ px1 ` 2x32q2 P R

Partial derivatives

? Gradient?Work it out with your neighbors

Bfpx1, x2q

Bx1“ 2x1x2 ` x

Bfpx1, x2q

Bx2“ x21 ` 3x1x

Bfpx1, x2q

Bx1“ 2px1 ` 2x32q

px1`2x32qhkkikkj

Bfpx1, x2q

Bx2“ 2px1 ` 2x32q p6x22q

loomoon

px1`2x32q

Gradientdfdx

Bfpx1,x2qBx1

Bfpx1,x2qBx2

P R1ˆ2

““

2x1x2 ` x32 x21 ` 3x1x

‰ dfdx

““

2px1 ` 2x32q 12px1 ` 2x32qx22

Example: Multivariate Chain Rule

Consider the function

Lpeq “ 12}e}

2 “ 12eJe

e “ y ´Ax , x P RN ,A P RMˆN , e,y P RM

Compute the gradient dLdx . What is the dimension/size of dL

Be“ eJ P R1ˆM (1)

Bx“ ´A P RMˆN (2)

ùñdLdx

“ eJp´Aq “ ´py ´AxqJA P R1ˆN

Example: Multivariate Chain Rule

Consider the function

Lpeq “ 12}e}

2 “ 12eJe

e “ y ´Ax , x P RN ,A P RMˆN , e,y P RM

Compute the gradient dLdx . What is the dimension/size of dL

Be“ eJ P R1ˆM (1)

Bx“ ´A P RMˆN (2)

ùñdLdx

“ eJp´Aq “ ´py ´AxqJA P R1ˆN

Vector Field Differentiation f : RN Ñ RM

y “ fpxq P RM , x P RN

y1...yM

f1pxq...

fM pxq

f1px1, . . . , xN q...

fM px1, . . . , xN q

Jacobian matrix (collection of all partial derivatives)

dy1dx...

Bf1Bx1

¨ ¨ ¨Bf1BxN

......

BfMBx1

¨ ¨ ¨BfMBxN

P RMˆN

Vector Field Differentiation f : RN Ñ RM

y “ fpxq P RM , x P RN

y1...yM

f1pxq...

fM pxq

f1px1, . . . , xN q...

fM px1, . . . , xN q

Jacobian matrix (collection of all partial derivatives)

dy1dx...

Bf1Bx1

¨ ¨ ¨Bf1BxN

......

BfMBx1

¨ ¨ ¨BfMBxN

P RMˆN

Example: Vector Field Differentiation

fpxq “ Ax , fpxq P RM , A P RMˆN , x P RN

y1...yM

...fM pxq

A11x1 ` A12x2 ` ¨ ¨ ¨ ` A1NxN...

......

...AM1x1 ` AM2x2 ` ¨ ¨ ¨ ` AMNxN

Compute the gradient dfdx

Dimension of dfdx :

Since f : RN Ñ RM it follows that dfdx P R

Gradient:

fipxq “Nÿ

Aijxj ùñBfiBxj

“Aij

ùñdfdx

Bf1Bx1

¨ ¨ ¨Bf1BxN

......

BfMBx1

¨ ¨ ¨BfMBxN

A11 ¨ ¨ ¨ A1N

......

AM1 ¨ ¨ ¨ AMN

“ A P RMˆN

y1...yM

...fM pxq

A11x1 ` A12x2 ` ¨ ¨ ¨ ` A1NxN...

......

...AM1x1 ` AM2x2 ` ¨ ¨ ¨ ` AMNxN

Dimension of dfdx : Since f : RN Ñ RM it follows that df

dx P RMˆN

Gradient:

fipxq “Nÿ

Aijxj ùñBfiBxj

“Aij

ùñdfdx

Bf1Bx1

¨ ¨ ¨Bf1BxN

......

BfMBx1

¨ ¨ ¨BfMBxN

A11 ¨ ¨ ¨ A1N

......

AM1 ¨ ¨ ¨ AMN

“ A P RMˆN

y1...yM

...fM pxq

A11x1 ` A12x2 ` ¨ ¨ ¨ ` A1NxN...

......

...AM1x1 ` AM2x2 ` ¨ ¨ ¨ ` AMNxN

dx P RMˆN

Gradient:

fipxq “Nÿ

Aijxj ùñBfiBxj

“Aij

ùñdfdx

Bf1Bx1

¨ ¨ ¨Bf1BxN

......

BfMBx1

¨ ¨ ¨BfMBxN

A11 ¨ ¨ ¨ A1N

......

AM1 ¨ ¨ ¨ AMN

“ A P RMˆN

y1...yM

...fM pxq

A11x1 ` A12x2 ` ¨ ¨ ¨ ` A1NxN...

......

...AM1x1 ` AM2x2 ` ¨ ¨ ¨ ` AMNxN

dx P RMˆN

Gradient:

fipxq “Nÿ

Aijxj ùñBfiBxj

“Aij

ùñdfdx

Bf1Bx1

¨ ¨ ¨Bf1BxN

......

BfMBx1

¨ ¨ ¨BfMBxN

A11 ¨ ¨ ¨ A1N

......

AM1 ¨ ¨ ¨ AMN

“ A P RMˆN

Dimensionality of the Gradient

In general: A function f : RN Ñ RM has a gradient that is anM ˆN -matrix with

P RMˆN , df rm,ns “BfmBxn

Gradient dimension: # target dimensions ˆ # input dimensions

Chain Rule

Bxpg ˝ fqpxq “

gpfpxqq˘

“Bgpfq

Example: Chain Rule

Consider f : R2 Ñ R, x : RÑ R2

fpxq “ fpx1, x2q “ x21 ` 2x2 ,

xptq “

sinptq

cosptq

What are the dimensions of dfdx and dx

dt ?1ˆ 2 and 2ˆ 1

Compute the gradient dfdt using the chain rule:

dfdt“

dxdt“

fl “

2 sin t 2ı

´ sin t

“ 2 sin t cos t´ 2 sin t “ 2 sin tpcos t´ 1q

Example: Chain Rule

fpxq “ fpx1, x2q “ x21 ` 2x2 ,

xptq “

sinptq

cosptq

dt ?Work it out with your neighbors

1ˆ 2 and 2ˆ 1

dfdt“

dxdt“

fl “

2 sin t 2ı

´ sin t

Example: Chain Rule

fpxq “ fpx1, x2q “ x21 ` 2x2 ,

xptq “

sinptq

cosptq

dt ?1ˆ 2 and 2ˆ 1

dfdt“

dxdt“

fl “

2 sin t 2ı

´ sin t

Example: Chain Rule

fpxq “ fpx1, x2q “ x21 ` 2x2 ,

xptq “

sinptq

cosptq

dt ?1ˆ 2 and 2ˆ 1

dfdt“

dxdt“

fl “

2 sin t 2ı

´ sin t

Derivatives with Respect to Matrices

Recall: A function f : RN Ñ RM has a gradient that is anM ˆN -matrix with

This generalizes to when the inputs (N ) or targets (M ) arematrices

Function f : RM ˆN Ñ R P ˆQ , has a gradient that is apP ˆQq ˆ pM ˆNq object (tensor)

P RpPˆQqˆpMˆNq , df rp, q,m, ns “BfpqBXmn

Example 1: Derivatives with Respect to Matrices

f “ Ax , f P RM ,A P RMˆN ,x P RN

y1...yM

...fM pxq

A11x1 ` A12x2 ` ¨ ¨ ¨ ` A1NxN...

......

...AM1x1 ` AM2x2 ` ¨ ¨ ¨ ` AMNxN

# target dimˆ # input dim “M ˆ pM ˆNq

Bf1BA...

,BfmBA

P R1ˆpMˆNq

Example 1: Derivatives with Respect to Matrices

f “ Ax , f P RM ,A P RMˆN ,x P RN

y1...yM

...fM pxq

A11x1 ` A12x2 ` ¨ ¨ ¨ ` A1NxN...

......

...AM1x1 ` AM2x2 ` ¨ ¨ ¨ ` AMNxN

P R# target dimˆ # input dim “M ˆ pM ˆNq

Bf1BA...

,BfmBA

P R1ˆpMˆNq

Example 2: Derivatives w.r.t. Matrices

fm “Nÿ

Amnxn, m “ 1, . . . ,M

y1...yi...yM

f1pxq...

fipxq...

fM pxq

A11x1 `A12x2` ¨ ¨ ¨ `A1NxN...

......

...Ai1x1 ` Ai2x2` ¨ ¨ ¨ `AiNxN

......

AM1x1 `AM2x2` ¨ ¨ ¨ `AMNxN

BfmBAmq

BfmBAm,:

PR1ˆ1ˆN

BfmBAk‰m,:

PR1ˆ1ˆN

PR1ˆpMˆNq

fm “Nÿ

Amnxn, m “ 1, . . . ,M

y1...yi...yM

f1pxq...

fipxq...

fM pxq

A11x1 `A12x2` ¨ ¨ ¨ `A1NxN...

......

...Ai1x1 ` Ai2x2` ¨ ¨ ¨ `AiNxN

......

BfmBAmq

“ xq

BfmBAm,:

PR1ˆ1ˆN

BfmBAk‰m,:

PR1ˆ1ˆN

PR1ˆpMˆNq

fm “Nÿ

Amnxn, m “ 1, . . . ,M

y1...yi...yM

f1pxq...

fipxq...

fM pxq

A11x1 `A12x2` ¨ ¨ ¨ `A1NxN...

......

...Ai1x1 ` Ai2x2` ¨ ¨ ¨ `AiNxN

......

BfmBAmq

“ xq

BfmBAm,:

“ xJ

PR1ˆ1ˆN

BfmBAk‰m,:

PR1ˆ1ˆN

PR1ˆpMˆNq

fm “Nÿ

Amnxn, m “ 1, . . . ,M

y1...yi...yM

f1pxq...

fipxq...

fM pxq

A11x1 `A12x2` ¨ ¨ ¨ `A1NxN...

......

...Ai1x1 ` Ai2x2` ¨ ¨ ¨ `AiNxN

......

BfmBAmq

“ xq

BfmBAm,:

“ xJ

PR1ˆ1ˆN

BfmBAk‰m,:

“ 0J

PR1ˆ1ˆN

PR1ˆpMˆNq

fm “Nÿ

Amnxn, m “ 1, . . . ,M

y1...yi...yM

f1pxq...

fipxq...

fM pxq

A11x1 `A12x2` ¨ ¨ ¨ `A1NxN...

......

...Ai1x1 ` Ai2x2` ¨ ¨ ¨ `AiNxN

......

BfmBAmq

“ xq

BfmBAm,:

“ xJ

PR1ˆ1ˆN

BfmBAk‰m,:

“ 0J

PR1ˆ1ˆN

PR1ˆpMˆNq

Gradient Computation: Two Alternatives

Consider f : R3 Ñ R4ˆ2, fpxq “ A P R4ˆ2 where the entriesAij depend on a vector x P R3

We can compute dApxqdx P R4ˆ2ˆ3 in two equivalent ways:

A P R4ˆ2 x P R3

Bx1P R4ˆ2

Bx2P R4ˆ2

Bx3P R4ˆ2

P R4ˆ2ˆ3

Partial derivatives:

collate

A P R4ˆ2 x P R3

P R4ˆ2ˆ3

re-shape re-shapegradient

A P R4ˆ2 A P R8dAdx

P R8ˆ3

Gradient Computation: Two Alternatives

Consider f : R3 Ñ R4ˆ2, fpxq “ A P R4ˆ2 where the entriesAij depend on a vector x P R3

We can compute dApxqdx P R4ˆ2ˆ3 in two equivalent ways:

A P R4ˆ2 x P R3

Bx1P R4ˆ2

Bx2P R4ˆ2

Bx3P R4ˆ2

P R4ˆ2ˆ3

collate

A P R4ˆ2 x P R3

P R4ˆ2ˆ3

re-shape re-shapegradient

A P R4ˆ2 A P R8dAdx

P R8ˆ3

Gradients of a Single-Layer Neural Network

f “ tanhpAx` blooomooon

“:zPRM

q P RM , x P RN ,A P RMˆN , b P RM

“:zPRM

Bzloomoon

Bbloomoon

P RMˆMBf

Bbri, js “

Bzri, ls

Bbrl, js

Bzloomoon

BAloomoon

MˆpMˆNq

P RMˆpMˆNqBf

BAri, j, ks “

Bzri, ls

BArl, j, ks

Bz“ diagp1´ tanh2pzqq

looooooooooomooooooooooon

PRMˆM

Bb“ I

loomoon

PRMˆM

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

loooooooooooomoooooooooooon

PRMˆpMˆNq

“:zPRM

Bzloomoon

Bbloomoon

P RMˆM

Bbri, js “

Bzri, ls

Bbrl, js

Bzloomoon

BAloomoon

MˆpMˆNq

P RMˆpMˆNqBf

BAri, j, ks “

Bzri, ls

BArl, j, ks

PRMˆM

Bb“ I

loomoon

PRMˆM

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

PRMˆpMˆNq

“:zPRM

Bzloomoon

Bbloomoon

P RMˆM

Bbri, js “

Bzri, ls

Bbrl, js

Bzloomoon

BAloomoon

MˆpMˆNq

P RMˆpMˆNq

BAri, j, ks “

Bzri, ls

BArl, j, ks

PRMˆM

Bb“ I

loomoon

PRMˆM

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

PRMˆpMˆNq

“:zPRM

Bzloomoon

Bbloomoon

P RMˆM

Bbri, js “

Bzri, ls

Bbrl, js

Bzloomoon

BAloomoon

MˆpMˆNq

P RMˆpMˆNq

BAri, j, ks “

Bzri, ls

BArl, j, ks

PRMˆM

Bb“ I

loomoon

PRMˆM

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

PRMˆpMˆNq

“:zPRM

Bzloomoon

Bbloomoon

P RMˆM

Bbri, js “

Bzri, ls

Bbrl, js

Bzloomoon

BAloomoon

MˆpMˆNq

P RMˆpMˆNq

BAri, j, ks “

Bzri, ls

BArl, j, ks

PRMˆM

Bb“ I

loomoon

PRMˆM

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

PRMˆpMˆNq

“:zPRM

Bzloomoon

Bbloomoon

P RMˆM

Bbri, js “

Bzri, ls

Bbrl, js

Bzloomoon

BAloomoon

MˆpMˆNq

P RMˆpMˆNq

BAri, j, ks “

Bzri, ls

BArl, j, ks

PRMˆM

Bb“ I

loomoon

PRMˆM

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

PRMˆpMˆNq

“:zPRM

Bzloomoon

Bbloomoon

P RMˆMBf

Bbri, js “

Bzri, ls

Bbrl, js

Bzloomoon

BAloomoon

MˆpMˆNq

P RMˆpMˆNq

BAri, j, ks “

Bzri, ls

BArl, j, ks

PRMˆM

Bb“ I

loomoon

PRMˆM

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

PRMˆpMˆNq

“:zPRM

Bzloomoon

Bbloomoon

P RMˆMBf

Bbri, js “

Bzri, ls

Bbrl, js

Bzloomoon

BAloomoon

MˆpMˆNq

P RMˆpMˆNqBf

BAri, j, ks “

Bzri, ls

BArl, j, ks

PRMˆM

Bb“ I

loomoon

PRMˆM

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

PRMˆpMˆNq

Putting Things Together

Inputs x P RN

Observed outputs y “ fθpzq ` ε P RM , ε „ N

0, Σ˘

Train single-layer neural network with

fθpzq “ tanhpzq P RM , z “ Ax` b P RM , θ “ tA, bu

Find A, b, such that the squared loss

Lpθq “ 12}e}

2 P R , e “ y ´ fθpzq P RM

is minimized

Inputs x P RN

0, Σ˘

Lpθq “ 12}e}

is minimized

Inputs x P RN

0, Σ˘

Lpθq “ 12}e}

is minimized

Inputs x P RN

0, Σ˘

Lpθq “ 12}e}

is minimized

Partial derivatives:BL

BA“BL

Bb“BL

Be“ eJ

loomoon

PR1ˆM

Bf“ ´I

loomoon

PRMˆM

xJ ¨ 0J ¨ 0J

¨ ¨ ¨

0J ¨ xJ ¨ 0J

¨ ¨ ¨

0J ¨ 0J ¨ xJ

looooooooooooomooooooooooooon

PRMˆpMˆNq

Bb“ I

loomoon

PRMˆM

Gradients of a Multi-Layer Neural Network

A0, b0 AK´1, bK´1

LfK´1

AK´2, bK´2

A1, b1

Inputs x, observed outputs y

Train multi-layer neural network with

f0 “ x

f i “ σipAi´1f i´1 ` bi´1q , i “ 1, . . . ,K

Find Aj , bj for j “ 0, . . . ,K ´ 1, such that the squared loss

Lpθq “ }y ´ fK,θpxq}2

is minimized, where θ “ tAj , bju , j “ 0, . . . ,K ´ 1

A0, b0 AK´1, bK´1

LfK´1

AK´2, bK´2

A1, b1

Inputs x, observed outputs y

Train multi-layer neural network with

f0 “ x

f i “ σipAi´1f i´1 ` bi´1q , i “ 1, . . . ,K

Find Aj , bj for j “ 0, . . . ,K ´ 1, such that the squared loss

Lpθq “ }y ´ fK,θpxq}2

is minimized, where θ “ tAj , bju , j “ 0, . . . ,K ´ 1

A1, b1 AK´1, bK´1

LfK´1

AK´2, bK´2

A2, b2

BθK´1“BL

BfKBθK´1

BθK´2“BL

BfKBfK´1

BfK´1BθK´2

BθK´3“BL

BfKBfK´1

BfK´1BfK´2

BfK´2BθK´3

Bθi“BL

BfKBfK´1

¨ ¨ ¨Bf i`2Bf i`1

Bf i`1Bθi

Intermediate derivatives are stored during the forward pass

A1, b1 AK´1, bK´1

LfK´1

AK´2, bK´2

A2, b2

BθK´1“BL

BfKBθK´1

BθK´2“BL

BfKBfK´1

BfK´1BθK´2

BθK´3“BL

BfKBfK´1

BfK´1BfK´2

BfK´2BθK´3

Bθi“BL

BfKBfK´1

¨ ¨ ¨Bf i`2Bf i`1

Bf i`1Bθi

A0, b0 AK´1, bK´1

LfK´1

AK´2, bK´2

A1, b1

BθK´1“BL

BfKBθK´1

BθK´2“BL

BfKBfK´1

BfK´1BθK´2

BθK´3“BL

BfKBfK´1

BfK´1BfK´2

BfK´2BθK´3

Bθi“BL

BfKBfK´1

¨ ¨ ¨Bf i`2Bf i`1

Bf i`1Bθi

A0, b0 AK´1, bK´1

LfK´1

AK´2, bK´2

A1, b1

BθK´1“BL

BfKBθK´1

BθK´2“BL

BfKBfK´1

BfK´1BθK´2

BθK´3“BL

BfKBfK´1

BfK´1BfK´2

BfK´2BθK´3

Bθi“BL

BfKBfK´1

¨ ¨ ¨Bf i`2Bf i`1

Bf i`1Bθi

A0, b0 AK´1, bK´1

LfK´1

AK´2, bK´2

A1, b1

BθK´1“BL

BfKBθK´1

BθK´2“BL

BfKBfK´1

BfK´1BθK´2

BθK´3“BL

BfKBfK´1

BfK´1BfK´2

BfK´2BθK´3

Bθi“BL

BfKBfK´1

¨ ¨ ¨Bf i`2Bf i`1

Bf i`1Bθi

Example: Regression with Neural Networks

Linear regression with a neural network parametrized by θ:

y “ fθpxq ` ε , ε „ N`

0, σ2ε˘

Given inputs xn and corresponding (noisy) observations yn,n “ 1, . . . , N , find parameters θ˚ that minimize the squared loss

Lpθq “Nÿ

pyn ´ fθpxnqq2 “ }y ´ fpXq}2

Example: Regression with Neural Networks

Linear regression with a neural network parametrized by θ:

y “ fθpxq ` ε , ε „ N`

0, σ2ε˘

Given inputs xn and corresponding (noisy) observations yn,n “ 1, . . . , N , find parameters θ˚ that minimize the squared loss

Lpθq “Nÿ

pyn ´ fθpxnqq2 “ }y ´ fpXq}2

Training NNs as Maximum Likelihood Estimation

Training a neural network in the above way corresponds tomaximum likelihood estimation:

If y “ NNpx,θq ` ε, ε „ N`

0, I˘

then the log-likelihood is

log ppy|X,θq “ ´12}y ´NNpx,θq}

Find θ˚ by minimizing the negative log-likelihood:

θ˚ “ argminθ´ log ppy|x,θq

“ argminθ

12}y ´NNpx,θq}

“ argminθLpθq

Maximum likelihood estimation can lead to overfitting (interpretnoise as signal)

0, I˘

“ argminθ

12}y ´NNpx,θq}

“ argminθLpθq

0, I˘

“ argminθ

12}y ´NNpx,θq}

“ argminθLpθq

Example: Linear Regression (1)

Linear regression with a polynomial of order M :

y “ fpx,θq ` ε , ε „ N`

0, σ2ε˘

fpx,θq “ θ0 ` θ1x` θ2x2 ` ¨ ¨ ¨ ` θMx

Given inputs xi and corresponding (noisy) observations yi,i “ 1, . . . , N , find parameters θ “ rθ0, . . . , θM sJ, that minimizethe squared loss (equivalently: maximize the likelihood)

Lpθq “Nÿ

pyi ´ fpxi,θqq2

Linear regression with a polynomial of order M :

y “ fpx,θq ` ε , ε „ N`

0, σ2ε˘

fpx,θq “ θ0 ` θ1x` θ2x2 ` ¨ ¨ ¨ ` θMx

Given inputs xi and corresponding (noisy) observations yi,i “ 1, . . . , N , find parameters θ “ rθ0, . . . , θM sJ, that minimizethe squared loss (equivalently: maximize the likelihood)

Lpθq “Nÿ

pyi ´ fpxi,θqq2

x-5 0 5

Regularization, model selection etc. can address overfitting

Alternative approach based on integration

x-5 0 5

Regularization, model selection etc. can address overfitting

Alternative approach based on integration

Summary

A P R4ˆ2 x P R3

Bx1P R4ˆ2

Bx2P R4ˆ2

Bx3P R4ˆ2

P R4ˆ2ˆ3

collate

x-5 0 5

A0, b0 AK´1, bK´1

LfK´1

AK´2, bK´2

A1, b1

Vector-valued differentiation

Chain rule

Check the dimension of the gradients

vector calculus - deisenroth.cc · vector calculus, optimization, probability and statistics. these...

Documents

vector calculus example

vector calculus study

03 vector calculus

17. vector calculus with applications vector calculus with...

16 - vector calculus

calculus iii chapter 16 - vector calculus vector...

vector calculus -...

vector calculus - arraytool · vectorscoordinate systemsvc...

vector calculus - whitman college...16 vector calculus 16.1...

vector calculus paul

vector calculus in polar, cylindrical, and spherical...

vector calculus 2

vector calculus

ch. 10 vector integral calculus. integral theorems ch. 10...

vector calculus supplement

vector calculus book

vector calculus -...