unsupervised kernel regressionrfm/pubs/ukr_talk.pdf · 2007. 7. 19. · unsupervised kernel...

Unsupervised Kernel Regression

Roland Memisevic

February 9, 2004

Joint work with Peter Meinicke and Stefan Klanke


� ��

� �� !� � "#�� $��%�

& �' (� )�'*�

� "

�+�

1


Unsupervised Learning as Generalized Regression

y = f∗(x; θ) + u, E(u) = 0

• Now call x ’latent’

• Need to solve for both x and θ.

• → iterate:

– Projection - Regression– EM (For true random variable x)

• Most often we choose: f(x) = Wb(x)

2


Idea: Use Non-parametric Regression Function

Consider:

f(x) = E(y|x) =∫yp(x, y)dy∫p(x, y)dy

Approximate p(x, y) with Kernel Density Estimate p̂(x, y):

p̂(x, y) =1N

N∑j=1

Kq(x, xj)Kd(y, yj),

for some (isotropic, ...) kernel-functions Kq and Kd.

Then:

3


Nadaraya-Watson Estimator

f(x) =N∑j=1

K(x, xj)∑Nk=1K(x, xk)

yj =: Y b(x)

with (un-normalized) kernel functions K(·, ·), e.g. RBF:

K(xi, xj) = exp(1

2h2l

‖xi − xj‖2)

Note: model complexity <—> Kernel bandwidth hl

Now call x latent...

4


Unsupervised Kernel Regression (1)

...and minimize the data-space reconstruction error:

Eobs(X) =1N

N∑i=1

‖yi − f(xi)‖2

=1N

N∑i=1

‖yi −N∑j=1

K(xi, xj)∑Nk=1K(xi, xk)

yj‖2

Set hl = 1.0

model complexity <—> scale of X

5



Learning with gradient descent.

Complexity: O(N2dq) for both E and ∂E∂X

6



Regularization (a): ’Weight Decay’:

instead of Eobs minimize

EP (X) = Eobs(X) + λP (X),

for some penalty P (X), e.g.

P (X) = ‖X‖2F , or P (X) = −N∑i=1

p̂(xi)

X̂ = arg minX

EP (X)

7



Regularization (b): Constrain the solution:

X̂ = arg minX

Eobs(X),

subject to c(X) <= 0

... or subject to bound constraints.

8



Regularization (c): Crossvalidation

Use the modified objective function:

Ecv =1N

N∑i=1

‖yi − f−i(xi)‖2

:=N∑i=1

‖yi −∑j 6=i

K(xi, xj)∑k 6=iK(xi, xk)

yj‖2

’Built in’ Leave-One-Out Crossvalidation !

9



To avoid local minima use ’Deterministic Annealing’:

Begin with strong regularization.

Then annealing.

→ Regularization variants (a) und (b).

10



After training:

• Learned X.

• Learned ’Forward mapping’ implicitly.

• Define ’Backward mapping’ by orthogonal projection:

g(y) := arg minx

‖y − f(x)‖22

(Initialize with ’nearest reconstruction’

x0 := arg minxi

‖y − f(xi)‖22, i = 1, . . . , N)

11



−5 0 510

−4

10−2

100

−1 0 1−1

0

1

2

3

η = 0.8

−5 0 5

−1 0 1−1

0

1

2

3

η = 0.82

−5 0 5

−1 0 1−1

0

1

2

3

η = 0.84

−5 0 5

−1 0 1−1

0

1

2

3

η = 0.87

−5 0 5

−1 0 1−1

0

1

2

3

η = 0.820

12


Discussion UKR(1)

• Solves NLDR-problem ’completely’

• Principled

• ’Built in’ regularizer

• Arbitrary latent space dimensionalities

• Latent space density (Consider non-uniformly distributed data!)

• Introduction of prior knowledge

13



Idea:

Turn Nadaraya-Watson Estimator upside down...

Instead of E(y|x) estimate E(x|y)

g(y) :=N∑j=1

K(y, yj)∑Nk=1K(y, yk)

xj

Justification? E.g. ’Noisy Interpolation’ (Webb, 1994).

14



→ Objective function now:

Elat(X) =1N

N∑i=1

‖xi − g(yi)‖2,

=1N

N∑i=1

‖xi −N∑j=1

K(yi, yj)∑Nk=1K(yi, yk)

xj‖2,

=:1N‖X −XB(Y )‖2F

15



Closed form solution, because

Elat(X) =1N

tr(X(IN −B(Y ))(IN −B(Y ))TXT )

=1N

tr(XQXT )

withQ := (IN −B(Y ))(IN −B(Y ))T

Have to minimize quadratic form Q!

→ Constraining: X1N = 0 and XXT = Iq, Solution by EVD.

16



After training:

• Learned X.

• Learned ’Backward mapping’ implicitly.

• ...but not f

17



18



h =0.61 h =0.8696 h =1.1292

h =1.3888 h =1.6484 h =1.908

h =2.1676 h =2.4272 h =2.6868

h =2.9464 h =3.206 h =3.4656

h =3.7252 h =3.9848 h =4.2444

correct Kernel Bandwidth!?

19


Discussion UKR(2)

• Learns only X and g

• ’Yet another Spectral Method’ - however, from a regression perspective!

• Generalizes to new, unseen (observable space) data

• Fast

• Can provide useful initializations...

20


Unsupervised Kernel Regression (2+1)

A way to combine the advantages of both methods:

Use UKR(2) as an initialization for UKR(1).

Problem: UKR(2) solution will be destroyed after a few gradient steps.

Idea: minimize UKR(1)-error with respect to scale of X:

E(S) =N∑i=1

‖yi −∑j 6=i

K(Sxi, Sxj)∑k 6=iK(Sxi, Sxk)

yj‖2

with S diagonal scaling matrix.

21


Now:

• Using SX, we obtain a ’complete model’.

• Can continue with UKR(1).

• Or just do model assessment.

22


Experiments

UKR(2) vs. LLE

Use UKR(2) and LLE as initialization. Then determine reconstruction error.

Performance on toy data:

halfcircle scurve spiral

σ2 0.0 0.2 0.0 0.5 0.0 0.05LLE + rescaling 0.00060 0.0735 0.3279 0.4677 0.2364 0.2453

UKR(2) + rescaling 0.00035 0.0472 0.1107 0.1911 0.1971 0.2057LLE + UKR(1) 0.00059 0.0300 0.0790 0.0725 0.0582 0.0400

UKR(2) + UKR(1) 0.00035 0.0218 0.0481 0.0685 0.0319 0.0369

23


Experiments

UKR vs. PPS and GTM: (Chang and Ghosh, 2001)

Approximation of UCI-data:

iris glass diabetesq 1 2 1 2 1 2

GTM 2.7020 0.9601 8.0681 1.9634 6.7100 1.8822PPS 2.5786 0.8757 7.9465 1.8616 6.5509 1.8202

UKR(2) 2.1164 1.2667 7.6977 6.1930 6.4130 4.9041UKR(2) + UKR(1) 1.2017 0.5961 5.8665 4.3678 5.8018 3.9466UKR(1) Homotopy 0.9414 0.5651 4.7515 3.6402 6.2488 3.9619

24


Experiments

Visualization: MNIST-Digits

25


−30 −20 −10 0 10 20 30−30

−20

−10

0

10

20

30

26


Experiments

The duck from the beginning in 3d:

−1.5−1

−0.50

0.51

1.5

−2

−1

0

1

2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

27


Experiments

Interpolation in Latent Spacevs.

Interpolation in Observable Space

28


Experiments

29


Experiments

Introducing prior information:

Let user determine what shall end up in which dimension.

Approach: For different properties, sort data into pre-defined equivalenceclasses.

In the (latent) dimensions that shall reflect some specific property, penalizedistance of objects that belong to the same class (with respect to thisproperty).

30


Experiments

The (2-dimensional) PCA representation of the ’olivettifaces’-data lookslike:

a

a

a a

a

a

a

a

a

a

b

b

bb

b

b

b

b

b

b

c

ccc

c

cc

cc c

dd

dd

d

dd d

d

d

ee

e

ee

e

e

e

e

eff

ff

f

fff ff

g g

g

g

ggggg

g

h

hh

hh

h

h

h

h

h i

i i

ii i

ii

ii

j

j

j

j

j

j

jj

j

j

kkk

k

kk

kk

k

k

l l

ll

l

l

l

ll l

mm

m

mm

m

m

mm

m

n

n

n

n

nn

nnn

n

o

o

o

o

o

o

o

o

o

o

p

p

p

p

p

p

p

pp

p

q

q

qq

q

qq

q

qq

r

rr

r

r

r

r

r

r

r

ss

s

ss

s

s

s s

s

ttt

tt

t

ttt

t

uu

u

uuuu

uu

u

vvv

v

vvv

vvv

w

ww

w

w

w

ww

w

w

xx

xx

x

x

xx

xx

y

y

y

y

y

y

y

yy

y

z

z

z

z

z

z

z

z

z

z

1

1

1

1

1

1

1

1

1

12

2

2

2

2

2

2

2

2

2 3

3

3

3

33

3

3

3

3

4

4

4

44

4

44

4 4

5

55

5

5

5

55

5

5

6

66

6

6

6

66

66

77

7 7

7

777

7

7

88

8

88

8

8888

99

9 9

99

9

9

9

9

00

00

000

00

0

−

−

−

−−

−

−

−

−

−

=

==

=

==

==

==

+

++ +

+

+

++ +

+

:

:

:

:

:

:

:

:

:

:

31


Experiments

Considering the two properties ’identity’ and ’wears glasses’, the latentrepresentation might look like:

−4 −3 −2 −1 0 1 2 3 4−10

−8

−6

−4

−2

0

2

4

6

8

10

a

a

a

a

a

a

a

a

a

a

bbbb

b

b

bb

b

b

ccc

cc

cc

c

cc

d

d

d

dd

dd

d

d

d

e

ee

e

ee eeee

fff f ff ff

f

fgg

g g

g

g

g g

g

gh

hh

h

h h

h

h

h

h

ii i

iiiiiii

j

jjj

j

jj

j

j

j

kkkk

k

kk k kk

lll

ll

l

lll lm

m

mmm

m

mmm m

nnnnn

nnn

nn

o

o

oo

o

o

oooo

p

pp pp

p

p

p

pp

qq

qq

q qqq

q q

rr

r

r

rr

r

r

r

r

sssss

ss ss

s

tt tt

tt

tt t

tuuu

u uuuu uu

vvvv

v

v

vvvv

w

w

w

w

w

w

ww

ww

x

xx x

x

x xxxx

yyy

y

y

y

y

y

yyz

zzz

z

z

z

zzz 11 11 111111

22

22

22

22

2

2

3 33

33 3 33

33

44

4

44444

44 5

5555

5555

5

6 66

6

66

6

666

7 77777 7 77

7888

8 8

8

8

888

9 99

9

9

999

9 9

0

000

0

00 000

−

−

−

−−−

−

−

−−

=

===

=

=

==

=

=

++

+ + ++ ++++

:

::

::

:

:

:

::

glasses vs. non−glasses

iden

tity

32


Experiments

Now we can give people glasses:

33


’Bridge the gap’

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Clustering or Dimensionality Reduction?!

34



• Recall: UKR learns a latent space density, not just latent representativesand not just a principal manifold!

• We can use the latent density constraints at test time:

35



−1 0 1−1.5

−1

−0.5

0

0.5

1

1.5

−1 0 1−1.5

−1

−0.5

0

0.5

1

1.5

−1 0 1−1.5

−1

−0.5

0

0.5

1

1.5

36


Things Not Covered/Future Work:

• (Mercer-)Kernel Feature Space Variant

• Approximate (whole) data set using a reduced set of prototypes

• Condensing

• ’Two data spaces’, CCA

• Matlab Toolbox

37

unsupervised kernel regressionrfm/pubs/ukr_talk.pdf · 2007. 7. 19. · unsupervised kernel...

Documents