unsupervised kernel regressionrfm/pubs/ukr_talk.pdf · 2007. 7. 19. · unsupervised kernel...
TRANSCRIPT
Unsupervised Kernel Regression
Roland Memisevic
February 9, 2004
Joint work with Peter Meinicke and Stefan Klanke
Unsupervised Kernel Regression
� ��� � � ������ � ������� ���������
� ��� � � ��� !� � "#����� �$�����%�
& �' (� )�'*�
� "
�+�
1
Unsupervised Kernel Regression
Unsupervised Learning as Generalized Regression
y = f∗(x; θ) + u, E(u) = 0
• Now call x ’latent’
• Need to solve for both x and θ.
• → iterate:
– Projection - Regression– EM (For true random variable x)
• Most often we choose: f(x) = Wb(x)
2
Unsupervised Kernel Regression
Idea: Use Non-parametric Regression Function
Consider:
f(x) = E(y|x) =∫yp(x, y)dy∫p(x, y)dy
Approximate p(x, y) with Kernel Density Estimate p̂(x, y):
p̂(x, y) =1N
N∑j=1
Kq(x, xj)Kd(y, yj),
for some (isotropic, ...) kernel-functions Kq and Kd.
Then:
3
Unsupervised Kernel Regression
Nadaraya-Watson Estimator
f(x) =N∑j=1
K(x, xj)∑Nk=1K(x, xk)
yj =: Y b(x)
with (un-normalized) kernel functions K(·, ·), e.g. RBF:
K(xi, xj) = exp(1
2h2l
‖xi − xj‖2)
Note: model complexity <—> Kernel bandwidth hl
Now call x latent...
4
Unsupervised Kernel Regression
Unsupervised Kernel Regression (1)
...and minimize the data-space reconstruction error:
Eobs(X) =1N
N∑i=1
‖yi − f(xi)‖2
=1N
N∑i=1
‖yi −N∑j=1
K(xi, xj)∑Nk=1K(xi, xk)
yj‖2
Set hl = 1.0
model complexity <—> scale of X
5
Unsupervised Kernel Regression
Unsupervised Kernel Regression (1)
Learning with gradient descent.
Complexity: O(N2dq) for both E and ∂E∂X
6
Unsupervised Kernel Regression
Unsupervised Kernel Regression (1)
Regularization (a): ’Weight Decay’:
instead of Eobs minimize
EP (X) = Eobs(X) + λP (X),
for some penalty P (X), e.g.
P (X) = ‖X‖2F , or P (X) = −N∑i=1
p̂(xi)
X̂ = arg minX
EP (X)
7
Unsupervised Kernel Regression
Unsupervised Kernel Regression (1)
Regularization (b): Constrain the solution:
X̂ = arg minX
Eobs(X),
subject to c(X) <= 0
... or subject to bound constraints.
8
Unsupervised Kernel Regression
Unsupervised Kernel Regression (1)
Regularization (c): Crossvalidation
Use the modified objective function:
Ecv =1N
N∑i=1
‖yi − f−i(xi)‖2
:=N∑i=1
‖yi −∑j 6=i
K(xi, xj)∑k 6=iK(xi, xk)
yj‖2
’Built in’ Leave-One-Out Crossvalidation !
9
Unsupervised Kernel Regression
Unsupervised Kernel Regression (1)
To avoid local minima use ’Deterministic Annealing’:
Begin with strong regularization.
Then annealing.
→ Regularization variants (a) und (b).
10
Unsupervised Kernel Regression
Unsupervised Kernel Regression (1)
After training:
• Learned X.
• Learned ’Forward mapping’ implicitly.
• Define ’Backward mapping’ by orthogonal projection:
g(y) := arg minx
‖y − f(x)‖22
(Initialize with ’nearest reconstruction’
x0 := arg minxi
‖y − f(xi)‖22, i = 1, . . . , N)
11
Unsupervised Kernel Regression
Unsupervised Kernel Regression (1)
−5 0 510
−4
10−2
100
−1 0 1−1
0
1
2
3
η = 0.8
−5 0 5
−1 0 1−1
0
1
2
3
η = 0.82
−5 0 5
−1 0 1−1
0
1
2
3
η = 0.84
−5 0 5
−1 0 1−1
0
1
2
3
η = 0.87
−5 0 5
−1 0 1−1
0
1
2
3
η = 0.820
12
Unsupervised Kernel Regression
Discussion UKR(1)
• Solves NLDR-problem ’completely’
• Principled
• ’Built in’ regularizer
• Arbitrary latent space dimensionalities
• Latent space density (Consider non-uniformly distributed data!)
• Introduction of prior knowledge
13
Unsupervised Kernel Regression
Unsupervised Kernel Regression (2)
Idea:
Turn Nadaraya-Watson Estimator upside down...
Instead of E(y|x) estimate E(x|y)
g(y) :=N∑j=1
K(y, yj)∑Nk=1K(y, yk)
xj
Justification? E.g. ’Noisy Interpolation’ (Webb, 1994).
14
Unsupervised Kernel Regression
Unsupervised Kernel Regression (2)
→ Objective function now:
Elat(X) =1N
N∑i=1
‖xi − g(yi)‖2,
=1N
N∑i=1
‖xi −N∑j=1
K(yi, yj)∑Nk=1K(yi, yk)
xj‖2,
=:1N‖X −XB(Y )‖2F
15
Unsupervised Kernel Regression
Unsupervised Kernel Regression (2)
Closed form solution, because
Elat(X) =1N
tr(X(IN −B(Y ))(IN −B(Y ))TXT )
=1N
tr(XQXT )
withQ := (IN −B(Y ))(IN −B(Y ))T
Have to minimize quadratic form Q!
→ Constraining: X1N = 0 and XXT = Iq, Solution by EVD.
16
Unsupervised Kernel Regression
Unsupervised Kernel Regression (2)
After training:
• Learned X.
• Learned ’Backward mapping’ implicitly.
• ...but not f
17
Unsupervised Kernel Regression
Unsupervised Kernel Regression (2)
18
Unsupervised Kernel Regression
Unsupervised Kernel Regression (2)
h =0.61 h =0.8696 h =1.1292
h =1.3888 h =1.6484 h =1.908
h =2.1676 h =2.4272 h =2.6868
h =2.9464 h =3.206 h =3.4656
h =3.7252 h =3.9848 h =4.2444
correct Kernel Bandwidth!?
19
Unsupervised Kernel Regression
Discussion UKR(2)
• Learns only X and g
• ’Yet another Spectral Method’ - however, from a regression perspective!
• Generalizes to new, unseen (observable space) data
• Fast
• Can provide useful initializations...
20
Unsupervised Kernel Regression
Unsupervised Kernel Regression (2+1)
A way to combine the advantages of both methods:
Use UKR(2) as an initialization for UKR(1).
Problem: UKR(2) solution will be destroyed after a few gradient steps.
Idea: minimize UKR(1)-error with respect to scale of X:
E(S) =N∑i=1
‖yi −∑j 6=i
K(Sxi, Sxj)∑k 6=iK(Sxi, Sxk)
yj‖2
with S diagonal scaling matrix.
21
Unsupervised Kernel Regression
Now:
• Using SX, we obtain a ’complete model’.
• Can continue with UKR(1).
• Or just do model assessment.
22
Unsupervised Kernel Regression
Experiments
UKR(2) vs. LLE
Use UKR(2) and LLE as initialization. Then determine reconstruction error.
Performance on toy data:
halfcircle scurve spiral
σ2 0.0 0.2 0.0 0.5 0.0 0.05LLE + rescaling 0.00060 0.0735 0.3279 0.4677 0.2364 0.2453
UKR(2) + rescaling 0.00035 0.0472 0.1107 0.1911 0.1971 0.2057LLE + UKR(1) 0.00059 0.0300 0.0790 0.0725 0.0582 0.0400
UKR(2) + UKR(1) 0.00035 0.0218 0.0481 0.0685 0.0319 0.0369
23
Unsupervised Kernel Regression
Experiments
UKR vs. PPS and GTM: (Chang and Ghosh, 2001)
Approximation of UCI-data:
iris glass diabetesq 1 2 1 2 1 2
GTM 2.7020 0.9601 8.0681 1.9634 6.7100 1.8822PPS 2.5786 0.8757 7.9465 1.8616 6.5509 1.8202
UKR(2) 2.1164 1.2667 7.6977 6.1930 6.4130 4.9041UKR(2) + UKR(1) 1.2017 0.5961 5.8665 4.3678 5.8018 3.9466UKR(1) Homotopy 0.9414 0.5651 4.7515 3.6402 6.2488 3.9619
24
Unsupervised Kernel Regression
Experiments
Visualization: MNIST-Digits
25
Unsupervised Kernel Regression
−30 −20 −10 0 10 20 30−30
−20
−10
0
10
20
30
26
Unsupervised Kernel Regression
Experiments
The duck from the beginning in 3d:
−1.5−1
−0.50
0.51
1.5
−2
−1
0
1
2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
27
Unsupervised Kernel Regression
Experiments
Interpolation in Latent Spacevs.
Interpolation in Observable Space
28
Unsupervised Kernel Regression
Experiments
29
Unsupervised Kernel Regression
Experiments
Introducing prior information:
Let user determine what shall end up in which dimension.
Approach: For different properties, sort data into pre-defined equivalenceclasses.
In the (latent) dimensions that shall reflect some specific property, penalizedistance of objects that belong to the same class (with respect to thisproperty).
30
Unsupervised Kernel Regression
Experiments
The (2-dimensional) PCA representation of the ’olivettifaces’-data lookslike:
a
a
a a
a
a
a
a
a
a
b
b
bb
b
b
b
b
b
b
c
ccc
c
cc
cc c
dd
dd
d
dd d
d
d
ee
e
ee
e
e
e
e
eff
ff
f
fff ff
g g
g
g
ggggg
g
h
hh
hh
h
h
h
h
h i
i i
ii i
ii
ii
j
j
j
j
j
j
jj
j
j
kkk
k
kk
kk
k
k
l l
ll
l
l
l
ll l
mm
m
mm
m
m
mm
m
n
n
n
n
nn
nnn
n
o
o
o
o
o
o
o
o
o
o
p
p
p
p
p
p
p
pp
p
q
q
q
q
r
rr
r
r
r
r
r
r
r
ss
s
ss
s
s
s s
s
ttt
tt
t
ttt
t
uu
u
uuuu
uu
u
vvv
v
vvv
vvv
w
ww
w
w
w
ww
w
w
xx
xx
x
x
xx
xx
y
y
y
y
y
y
y
yy
y
z
z
z
z
z
z
z
z
z
z
1
1
1
1
1
1
1
1
1
12
2
2
2
2
2
2
2
2
2 3
3
3
3
33
3
3
3
3
4
4
4
44
4
44
4 4
5
55
5
5
5
55
5
5
6
66
6
6
6
66
66
77
7 7
7
777
7
7
88
8
88
8
8888
99
9 9
99
9
9
9
9
00
00
000
00
0
−
−
−
−−
−
−
−
−
−
=
==
=
==
==
==
+
++ +
+
+
++ +
+
:
:
:
:
:
:
:
:
:
:
31
Unsupervised Kernel Regression
Experiments
Considering the two properties ’identity’ and ’wears glasses’, the latentrepresentation might look like:
−4 −3 −2 −1 0 1 2 3 4−10
−8
−6
−4
−2
0
2
4
6
8
10
a
a
a
a
a
a
a
a
a
a
bbbb
b
b
bb
b
b
ccc
cc
cc
c
cc
d
d
d
dd
dd
d
d
d
e
ee
e
ee eeee
fff f ff ff
f
fgg
g g
g
g
g g
g
gh
hh
h
h h
h
h
h
h
ii i
iiiiiii
j
jjj
j
jj
j
j
j
kkkk
k
kk k kk
lll
ll
l
lll lm
m
mmm
m
mmm m
nnnnn
nnn
nn
o
o
oo
o
o
oooo
p
pp pp
p
p
p
pp
q qqq
q q
rr
r
r
rr
r
r
r
r
sssss
ss ss
s
tt tt
tt
tt t
tuuu
u uuuu uu
vvvv
v
v
vvvv
w
w
w
w
w
w
ww
ww
x
xx x
x
x xxxx
yyy
y
y
y
y
y
yyz
zzz
z
z
z
zzz 11 11 111111
22
22
22
22
2
2
3 33
33 3 33
33
44
4
44444
44 5
5555
5555
5
6 66
6
66
6
666
7 77777 7 77
7888
8 8
8
8
888
9 99
9
9
999
9 9
0
000
0
00 000
−
−
−
−−−
−
−
−−
=
===
=
=
==
=
=
++
+ + ++ ++++
:
::
::
:
:
:
::
glasses vs. non−glasses
iden
tity
32
Unsupervised Kernel Regression
Experiments
Now we can give people glasses:
33
Unsupervised Kernel Regression
’Bridge the gap’
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
Clustering or Dimensionality Reduction?!
34
Unsupervised Kernel Regression
’Bridge the gap’
• Recall: UKR learns a latent space density, not just latent representativesand not just a principal manifold!
• We can use the latent density constraints at test time:
35
Unsupervised Kernel Regression
’Bridge the gap’
−1 0 1−1.5
−1
−0.5
0
0.5
1
1.5
−1 0 1−1.5
−1
−0.5
0
0.5
1
1.5
−1 0 1−1.5
−1
−0.5
0
0.5
1
1.5
36
Unsupervised Kernel Regression
Things Not Covered/Future Work:
• (Mercer-)Kernel Feature Space Variant
• Approximate (whole) data set using a reduced set of prototypes
• Condensing
• ’Two data spaces’, CCA
• Matlab Toolbox
37