regression - university of oxfordaz/lectures/ml/2011/lect5.pdf · 2011-02-28 · 0 0.1 0.2 0.3 0.4...
TRANSCRIPT
Lecture 5: Regression
C4B Machine Learning Hilary 2011 A. Zisserman
• Linear regression• Loss function
• Ridge regression
• Basis functions
• Dual representation and kernels
y
Regression
• Suppose we are given a training set of N observations
((x1, y1), . . . , (xN, yN)) with xi ∈ Rd, yi ∈ R
• The regression problem is to estimate f(x) from this data
such that
yi = f(xi)
Learning by optimization
• As in the case of classification, learning a regressor can be formulated as an optimization:
loss function regularization
• There is a choice of both loss functions and regularization
• e.g. squared loss, SVM “hinge-like” loss
• squared regularizer, lasso regularizer
• Algorithms can be kernelized
Minimize with respect to f ∈ FNXi=1
l (f(xi), yi) + λR (f)
Choice of regression function – non-linear basis functions
• Function for regression y(x,w) is a non-linear function of x, butlinear in w:
f(x,w) = w0 + w1φ1(x) + w2φ2(x) + . . .+ wMφM (x) = w>Φ(x)
• For example, for x ∈ R, polynomial regression with φj(x) = xj :
f(x,w) = w0 + w1φ1(x) + w2φ2(x) + . . .+ wMφM (x) =MXj=0
wjxj
e.g. for M = 3,
f(x,w) = (w0, w1, w2, w3)
⎛⎜⎜⎝1xx2
x3
⎞⎟⎟⎠ = w>Φ(x)
Φ : x→ Φ(x) R1 → R4
• or the basis functions can be Gaussians centred on the trainingdata:
φj(x) = exp−(x− xj)2/2σ2
e.g. for 3 points,
f(x,w) = (w1, w2, w3)
⎛⎜⎝ e−(x−x1)2/2σ2
e−(x−x2)2/2σ2
e−(x−x3)2/2σ2
⎞⎟⎠ = w>Φ(x)
Φ : x→ Φ(x) R1 → R3
Least squares “ridge regression”
• Cost function – squared loss:
loss function regularization
• Regression function for x (1D):
• NB squared loss arises in Maximum Likelihood estimation for an error model
target value
f(x,w) = w0 + w1φ1(x) + w2φ2(x) + . . .+ wMφM (x) = w>Φ(x)
yi = yi+ ni ni ∼ N (0,σ2)measured value true value
xi
yi
Solving for the weights w
Notation: write the target and regressed values as N-vectors
y =
⎛⎜⎜⎜⎜⎜⎜⎝y1y2..yN
⎞⎟⎟⎟⎟⎟⎟⎠ f =
⎛⎜⎜⎜⎜⎜⎜⎝Φ(x1)
>wΦ(x2)
>w..
Φ(xN)>w
⎞⎟⎟⎟⎟⎟⎟⎠ = Φw =
⎡⎢⎢⎢⎢⎢⎢⎣1 φ1(x1) . . . φM(x1)1 φ1(x2) . . . φM(x2). .. .1 φ1(xN) . . . φM(xN)
⎤⎥⎥⎥⎥⎥⎥⎦
⎛⎜⎜⎜⎜⎜⎜⎝w0w1..wM
⎞⎟⎟⎟⎟⎟⎟⎠
e.g. for polynomial regression with basis functions up to x2
Φw =
⎡⎢⎢⎢⎢⎢⎢⎣1 x1 x211 x2 x22. .. .
1 xN x2N
⎤⎥⎥⎥⎥⎥⎥⎦⎛⎜⎝ w0w1w2
⎞⎟⎠
Φ is an N ×M design matrix
eE(w) =1
2
NXi=1
{f(xi,w)− yi}2 +λ
2kwk2
=1
2
NXi=1
³yi −w>Φ(xi)
´2+
λ
2kwk2
=1
2(y −Φw)2 + λ
2kwk2
Now, compute where derivative w.r.t. w is zero for minimum
eE(w)dw
= −Φ> (y −Φw) + λw = 0
Hence ³Φ>Φ+ λI
´w = Φ>y
w =³Φ>Φ+ λI
´−1Φ>y
w =³Φ>Φ+ λI
´−1Φ>y
= assume N > M
M x 1 M x M
M basis functions, N data points
M x N N x 1
• This shows that there is a unique solution.
• If λ = 0 (no regularization), then
w = (Φ>Φ )−1Φ>y = Φ+y
where Φ+ is the pseudo-inverse of Φ (pinv in Matlab)
• Adding the term λI improves the conditioning of the inverse, since if Φ
is not full rank, then (Φ>Φ+ λI) will be (for sufficiently large λ)
• As λ→∞, w→ 1λΦ
>y→ 0
• Often the regularization is applied only to the inhomogeneous part of w,i.e. to w, where w = (w0, w)
w =³Φ>Φ+ λI
´−1Φ>y
f(x,w) = w>Φ(x) = Φ(x)>w
= Φ(x)>³Φ>Φ+ λI
´−1Φ>y
= b(x)>y
Output is a linear blend, b(x), of the training values {yi}
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5
-1
-0.5
0
0.5
1
1.5
x
y
ideal fit
Sample points
Ideal fit
Example 1: polynomial basis functions
• The red curve is the true function (which is not a polynomial)
• The data points are samples from the curve with added noise in y.
• There is a choice in both the degree, M, of the basis functions used, and in the strength of the regularization
f(x,w) =MXj=0
wjxj = w>Φ(x)
w is a M+1 dimensional vector
Φ : x→ Φ(x) R→ RM+1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5
-1
-0.5
0
0.5
1
1.5
x
y
Sample points
Ideal fit
lambda = 100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5
-1
-0.5
0
0.5
1
1.5
x
y
Sample points
Ideal fit
lambda = 0.001
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5
-1
-0.5
0
0.5
1
1.5
x
y
Sample points
Ideal fit
lambda = 1e-010
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5
-1
-0.5
0
0.5
1
1.5
x
y
Sample points
Ideal fit
lambda = 1e-015
N = 9 samples, M = 7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
-25
-20
-15
-10
-5
0
5
10
15
x
y
Polynomial basis functions
M = 3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5
-1
-0.5
0
0.5
1
1.5
x
y
least-squares fit
Sample points
Ideal fitLeast-squares solution
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
-400
-300
-200
-100
0
100
200
300
400
x
y
Polynomial basis functions
M = 5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5
-1
-0.5
0
0.5
1
1.5
x
y
least-squares fit
Sample points
Ideal fitLeast-squares solution
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5
-1
-0.5
0
0.5
1
1.5
x
y
ideal fit
Sample points
Ideal fit
Example 2: Gaussian basis functions
• The red curve is the true function (which is not a polynomial)
• The data points are samples from the curve with added noise in y.
• Basis functions are centred on the training data (N points)
• There is a choice in both the scale, sigma, of the basis functions used, and in the strength of the regularization
f(x,w) =NXi=1
wie−(x−xi)2/σ2 = w>Φ(x)
w is a N-vector
Φ : x→ Φ(x) R→ RN
N = 9 samples, sigma = 0.334
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5
-1
-0.5
0
0.5
1
1.5
x
y
Sample points
Ideal fit
lambda = 100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5
-1
-0.5
0
0.5
1
1.5
x
y
Sample points
Ideal fit
lambda = 0.001
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5
-1
-0.5
0
0.5
1
1.5
x
y
Sample points
Ideal fit
lambda = 1e-010
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5
-1
-0.5
0
0.5
1
1.5
x
y
Sample points
Ideal fit
lambda = 1e-015
10-10
10-5
100
0
1
2
3
4
5
6
log λ
erro
r no
rm
Ideal fit
ValidationTraining
Min error
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5
-1
-0.5
0
0.5
1
1.5
x
y
Sample points
Ideal fit
Validation set fit
Choosing lambda using a validation set
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5
-1
-0.5
0
0.5
1
1.5
x
y
Sample points
Ideal fit
Validation set fit
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
x
y
Gaussian basis functions
Sigma = 0.1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.5
-1
-0.5
0
0.5
1
1.5
x
y
Sample points
Ideal fit
Validation set fit
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
-2000
-1500
-1000
-500
0
500
1000
1500
2000
x
y
Gaussian basis functions
Sigma = 0.334
Summary and previewSo far we have considered the primal problem where
f(x,w) =MXi=1
wiφi(x) = w>Φ(x)
and we wanted a solution for w ∈ RM
Now we will consider the dual problem where
w =NXi=1
aiΦ(xi)
and we want a solution for a ∈ RN .
We will see that
• there is a closed form solution for a,
• the solution involves the N×N Gram matrix k(xi, xj) = Φ(xi)>Φ(xj),
• so we can use the kernel trick again to replace scalar products
eE(w) = 1
2
NXi=1
³yi −w>Φ(xi)
´2+
λ
2kwk2
and from the derivative w.r.t. w
eE(w)dw
=NXi=1
−³yi −w>Φ(xi)
´Φ(xi) + λw = 0
Hence
w =NXi=1
yi −w>Φ(xi)λ
Φ(xi) =NXi=1
aiΦ(xi)
Again the vector w can be written as a linear combination of
the training data — Representer Theorem
w =NXi=1
aiΦ(xi) =hΦ(x1) Φ(x2) . . . Φ(xN)
i⎛⎜⎜⎜⎜⎜⎜⎝a1a2..aN
⎞⎟⎟⎟⎟⎟⎟⎠ = Φ>a
where Φ is the N ×M design matrix
Dual Representation
Φ : x→ Φ(x) R→ RM
=
assume N > M
M x 1 M x N N x 1
Substitute w = Φ>a into
eE(w) =1
2(y −Φw)2 + λ
2kwk2
=1
2
³y −ΦΦ>a
´2+
λ
2a>ΦΦ>a
=1
2(y − Ka)2 + λ
2a>Ka
where K = ΦΦ> is the N ×N Kernel Gram matrix with entries
k(xi, xj) = Φ(xi)>Φ(xj), and minimize eE(a) w.r.t. a, to show
that
Exercisea = (K+ λI)−1 y=
N x 1 N x N N x 1
• dual version involves inverting a N x N matrix, cf inverting a M x M matrix in the primal
f(x,w) = w>Φ(x) = Φ(x)>w= Φ(x)>Φ>a= Φ(x)>
hΦ(x1) Φ(x2) . . . Φ(xN)
ia
=³Φ(x)>Φ(x1) Φ(x)>Φ(x2) . . . Φ(x)>Φ(xN)
´a
=³k(x, x1) k(x, x2) . . . k(x, xN)
´a
Write k(x) =³k(x, x1) k(x, x2) . . . k(x, xN)
´>, thenf(x) = k(x)>a = k(x)> (K+ λI)−1 y
• Again, see that output is a linear blend, b(x)>y, of the trainingvalues {yi} where
b(x)> = k(x)> (K+ λI)−1
• All the advantages of kernels: it is only necessary to provide
k(xi, xj) rather than compute Φ(x) explicitly.
Example: 3D Human Pose from Silhouettes
Ankur Agarwal and Bill Triggs, ICML 2004
Objective: Recover 3D human body pose from image silhouettes
• 3D pose = joint angles
Applications:
• motion capture, resynthesis
• human-computer interaction
• action recognition
• visual surveillance
Silhouette descriptors to represent input
Use Shape Context Histograms – distributions of local shape context responses
x
Regression for pose vector y
Predict 3D pose y given the shape context histogram x as
y = Ak(x)
where
• k(x) = (k(x,x1), k(x,x2), . . . , k(x,xN))> is a vector of scalar ba-
sis functions.
• A = (a1, a2, . . . , aN) is a matrix of dual vectors aj
(Compare with scalar version y =PNi aik(xi,x)).
Learn A from training data {xi,yi} by optimizing the cost function
minA
NXi
||yi − Ak(xi)||2 + λtrace(A>A)
Training and test data
Video recordings
Motion capturedata
Re-rendered poses
Re-renderedsilhouettes
(y)
Results: Synthetic Spiral Walk Test Sequence
•Mean angular error per d.o.f. = 6.0o
•Instances of error due to ambiguities
Complete sequence from individual images
15% instances of error due to ambiguities
Ambiguities in pose reconstruction
Silhouette to pose problem is inherently multi-valued
Add tracking to disambiguate
Tracking results
Tracking a real motion sequence
1. Preprocessing
a) Background subtraction b) Shadow removal for
silhouette extraction
Tracking a real motion sequence
2. Regression
obtain 3D body jointangles as output
-can be used to rendersynthetic models …
Background reading
• Bishop, chapters 3 & 6.1
• More on web page:
http://www.robots.ox.ac.uk/~az/lectures/ml
• e.g. Gaussian process regression