regression - university of oxfordaz/lectures/ml/2011/lect5.pdf · 2011-02-28 · 0 0.1 0.2 0.3 0.4...

Lecture 5: Regression

C4B Machine Learning Hilary 2011 A. Zisserman

• Linear regression• Loss function

• Ridge regression

• Basis functions

• Dual representation and kernels

y

Regression

• Suppose we are given a training set of N observations

((x1, y1), . . . , (xN, yN)) with xi ∈ Rd, yi ∈ R

• The regression problem is to estimate f(x) from this data

such that

yi = f(xi)

Learning by optimization

• As in the case of classification, learning a regressor can be formulated as an optimization:

loss function regularization

• There is a choice of both loss functions and regularization

• e.g. squared loss, SVM “hinge-like” loss

• squared regularizer, lasso regularizer

• Algorithms can be kernelized

Minimize with respect to f ∈ FNXi=1

l (f(xi), yi) + λR (f)

Choice of regression function – non-linear basis functions

• Function for regression y(x,w) is a non-linear function of x, butlinear in w:

f(x,w) = w0 + w1φ1(x) + w2φ2(x) + . . .+ wMφM (x) = w>Φ(x)

• For example, for x ∈ R, polynomial regression with φj(x) = xj :

f(x,w) = w0 + w1φ1(x) + w2φ2(x) + . . .+ wMφM (x) =MXj=0

wjxj

e.g. for M = 3,

f(x,w) = (w0, w1, w2, w3)

⎛⎜⎜⎝1xx2

x3

⎞⎟⎟⎠ = w>Φ(x)

Φ : x→ Φ(x) R1 → R4

• or the basis functions can be Gaussians centred on the trainingdata:

φj(x) = exp−(x− xj)2/2σ2

e.g. for 3 points,

f(x,w) = (w1, w2, w3)

⎛⎜⎝ e−(x−x1)2/2σ2

e−(x−x2)2/2σ2

e−(x−x3)2/2σ2

⎞⎟⎠ = w>Φ(x)

Φ : x→ Φ(x) R1 → R3

Least squares “ridge regression”

• Cost function – squared loss:

loss function regularization

• Regression function for x (1D):

• NB squared loss arises in Maximum Likelihood estimation for an error model

target value

f(x,w) = w0 + w1φ1(x) + w2φ2(x) + . . .+ wMφM (x) = w>Φ(x)

yi = yi+ ni ni ∼ N (0,σ2)measured value true value

xi

yi

Solving for the weights w

Notation: write the target and regressed values as N-vectors

y =

⎛⎜⎜⎜⎜⎜⎜⎝y1y2..yN

⎞⎟⎟⎟⎟⎟⎟⎠ f =

⎛⎜⎜⎜⎜⎜⎜⎝Φ(x1)

>wΦ(x2)

>w..

Φ(xN)>w

⎞⎟⎟⎟⎟⎟⎟⎠ = Φw =

⎡⎢⎢⎢⎢⎢⎢⎣1 φ1(x1) . . . φM(x1)1 φ1(x2) . . . φM(x2). .. .1 φ1(xN) . . . φM(xN)

⎤⎥⎥⎥⎥⎥⎥⎦

⎛⎜⎜⎜⎜⎜⎜⎝w0w1..wM

⎞⎟⎟⎟⎟⎟⎟⎠

e.g. for polynomial regression with basis functions up to x2

Φw =

⎡⎢⎢⎢⎢⎢⎢⎣1 x1 x211 x2 x22. .. .

1 xN x2N

⎤⎥⎥⎥⎥⎥⎥⎦⎛⎜⎝ w0w1w2

⎞⎟⎠

Φ is an N ×M design matrix

eE(w) =1

2

NXi=1

{f(xi,w)− yi}2 +λ

2kwk2

=1

2

NXi=1

³yi −w>Φ(xi)

´2+

λ

2kwk2

=1

2(y −Φw)2 + λ

2kwk2

Now, compute where derivative w.r.t. w is zero for minimum

eE(w)dw

= −Φ> (y −Φw) + λw = 0

Hence ³Φ>Φ+ λI

´w = Φ>y

w =³Φ>Φ+ λI

´−1Φ>y

w =³Φ>Φ+ λI

´−1Φ>y

= assume N > M

M x 1 M x M

M basis functions, N data points

M x N N x 1

• This shows that there is a unique solution.

• If λ = 0 (no regularization), then

w = (Φ>Φ )−1Φ>y = Φ+y

where Φ+ is the pseudo-inverse of Φ (pinv in Matlab)

• Adding the term λI improves the conditioning of the inverse, since if Φ

is not full rank, then (Φ>Φ+ λI) will be (for sufficiently large λ)

• As λ→∞, w→ 1λΦ

>y→ 0

• Often the regularization is applied only to the inhomogeneous part of w,i.e. to w, where w = (w0, w)

w =³Φ>Φ+ λI

´−1Φ>y

f(x,w) = w>Φ(x) = Φ(x)>w

= Φ(x)>³Φ>Φ+ λI

´−1Φ>y

= b(x)>y

Output is a linear blend, b(x), of the training values {yi}

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

x

y

ideal fit

Sample points

Ideal fit

Example 1: polynomial basis functions

• The red curve is the true function (which is not a polynomial)

• The data points are samples from the curve with added noise in y.

• There is a choice in both the degree, M, of the basis functions used, and in the strength of the regularization

f(x,w) =MXj=0

wjxj = w>Φ(x)

w is a M+1 dimensional vector

Φ : x→ Φ(x) R→ RM+1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

x

y

Sample points

Ideal fit

lambda = 100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

x

y

Sample points

Ideal fit

lambda = 0.001

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

x

y

Sample points

Ideal fit

lambda = 1e-010

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

x

y

Sample points

Ideal fit

lambda = 1e-015

N = 9 samples, M = 7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-25

-20

-15

-10

-5

0

5

10

15

x

y

Polynomial basis functions

M = 3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

x

y

least-squares fit

Sample points

Ideal fitLeast-squares solution

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-400

-300

-200

-100

0

100

200

300

400

x

y

Polynomial basis functions

M = 5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

x

y

least-squares fit

Sample points

Ideal fitLeast-squares solution

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

x

y

ideal fit

Sample points

Ideal fit

Example 2: Gaussian basis functions

• The red curve is the true function (which is not a polynomial)

• The data points are samples from the curve with added noise in y.

• Basis functions are centred on the training data (N points)

• There is a choice in both the scale, sigma, of the basis functions used, and in the strength of the regularization

f(x,w) =NXi=1

wie−(x−xi)2/σ2 = w>Φ(x)

w is a N-vector

Φ : x→ Φ(x) R→ RN

N = 9 samples, sigma = 0.334

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

x

y

Sample points

Ideal fit

lambda = 100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

x

y

Sample points

Ideal fit

lambda = 0.001

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

x

y

Sample points

Ideal fit

lambda = 1e-010

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

x

y

Sample points

Ideal fit

lambda = 1e-015

10-10

10-5

100

0

1

2

3

4

5

6

log λ

erro

r no

rm

Ideal fit

ValidationTraining

Min error

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

x

y

Sample points

Ideal fit

Validation set fit

Choosing lambda using a validation set

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

x

y

Sample points

Ideal fit

Validation set fit

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

x

y

Gaussian basis functions

Sigma = 0.1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5

-1

-0.5

0

0.5

1

1.5

x

y

Sample points

Ideal fit

Validation set fit

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-2000

-1500

-1000

-500

0

500

1000

1500

2000

x

y

Gaussian basis functions

Sigma = 0.334

Summary and previewSo far we have considered the primal problem where

f(x,w) =MXi=1

wiφi(x) = w>Φ(x)

and we wanted a solution for w ∈ RM

Now we will consider the dual problem where

w =NXi=1

aiΦ(xi)

and we want a solution for a ∈ RN .

We will see that

• there is a closed form solution for a,

• the solution involves the N×N Gram matrix k(xi, xj) = Φ(xi)>Φ(xj),

• so we can use the kernel trick again to replace scalar products

eE(w) = 1

2

NXi=1

³yi −w>Φ(xi)

´2+

λ

2kwk2

and from the derivative w.r.t. w

eE(w)dw

=NXi=1

−³yi −w>Φ(xi)

´Φ(xi) + λw = 0

Hence

w =NXi=1

yi −w>Φ(xi)λ

Φ(xi) =NXi=1

aiΦ(xi)

Again the vector w can be written as a linear combination of

the training data — Representer Theorem

w =NXi=1

aiΦ(xi) =hΦ(x1) Φ(x2) . . . Φ(xN)

i⎛⎜⎜⎜⎜⎜⎜⎝a1a2..aN

⎞⎟⎟⎟⎟⎟⎟⎠ = Φ>a

where Φ is the N ×M design matrix

Dual Representation

Φ : x→ Φ(x) R→ RM

=

assume N > M

M x 1 M x N N x 1

Substitute w = Φ>a into

eE(w) =1

2(y −Φw)2 + λ

2kwk2

=1

2

³y −ΦΦ>a

´2+

λ

2a>ΦΦ>a

=1

2(y − Ka)2 + λ

2a>Ka

where K = ΦΦ> is the N ×N Kernel Gram matrix with entries

k(xi, xj) = Φ(xi)>Φ(xj), and minimize eE(a) w.r.t. a, to show

that

Exercisea = (K+ λI)−1 y=

N x 1 N x N N x 1

• dual version involves inverting a N x N matrix, cf inverting a M x M matrix in the primal

f(x,w) = w>Φ(x) = Φ(x)>w= Φ(x)>Φ>a= Φ(x)>

hΦ(x1) Φ(x2) . . . Φ(xN)

ia

=³Φ(x)>Φ(x1) Φ(x)>Φ(x2) . . . Φ(x)>Φ(xN)

´a

=³k(x, x1) k(x, x2) . . . k(x, xN)

´a

Write k(x) =³k(x, x1) k(x, x2) . . . k(x, xN)

´>, thenf(x) = k(x)>a = k(x)> (K+ λI)−1 y

• Again, see that output is a linear blend, b(x)>y, of the trainingvalues {yi} where

b(x)> = k(x)> (K+ λI)−1

• All the advantages of kernels: it is only necessary to provide

k(xi, xj) rather than compute Φ(x) explicitly.

Example: 3D Human Pose from Silhouettes

Ankur Agarwal and Bill Triggs, ICML 2004

Objective: Recover 3D human body pose from image silhouettes

• 3D pose = joint angles

Applications:

• motion capture, resynthesis

• human-computer interaction

• action recognition

• visual surveillance

Silhouette descriptors to represent input

Use Shape Context Histograms – distributions of local shape context responses

x

Regression for pose vector y

Predict 3D pose y given the shape context histogram x as

y = Ak(x)

where

• k(x) = (k(x,x1), k(x,x2), . . . , k(x,xN))> is a vector of scalar ba-

sis functions.

• A = (a1, a2, . . . , aN) is a matrix of dual vectors aj

(Compare with scalar version y =PNi aik(xi,x)).

Learn A from training data {xi,yi} by optimizing the cost function

minA

NXi

||yi − Ak(xi)||2 + λtrace(A>A)

Training and test data

Video recordings

Motion capturedata

Re-rendered poses

Re-renderedsilhouettes

(y)

Results: Synthetic Spiral Walk Test Sequence

•Mean angular error per d.o.f. = 6.0o

•Instances of error due to ambiguities

Complete sequence from individual images

15% instances of error due to ambiguities

Ambiguities in pose reconstruction

Silhouette to pose problem is inherently multi-valued

Add tracking to disambiguate

Tracking results

Tracking a real motion sequence

1. Preprocessing

a) Background subtraction b) Shadow removal for

silhouette extraction

Tracking a real motion sequence

2. Regression

obtain 3D body jointangles as output

-can be used to rendersynthetic models …

Background reading

• Bishop, chapters 3 & 6.1

• More on web page:

http://www.robots.ox.ac.uk/~az/lectures/ml

• e.g. Gaussian process regression

regression - university of oxfordaz/lectures/ml/2011/lect5.pdf · 2011-02-28 · 0 0.1 0.2 0.3 0.4...

Documents