unsupervised kernel regressionrfm/pubs/ukr_talk.pdf · 2007. 7. 19. · unsupervised kernel...

Post on 09-Mar-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Unsupervised Kernel Regression

Roland Memisevic

February 9, 2004

Joint work with Peter Meinicke and Stefan Klanke

Unsupervised Kernel Regression

� ��� � � ������ � ������� ���������

� ��� � � ��� !� � "#����� �$�����%�

& �' (� )�'*�

� "

�+�

1

Unsupervised Kernel Regression

Unsupervised Learning as Generalized Regression

y = f∗(x; θ) + u, E(u) = 0

• Now call x ’latent’

• Need to solve for both x and θ.

• → iterate:

– Projection - Regression– EM (For true random variable x)

• Most often we choose: f(x) = Wb(x)

2

Unsupervised Kernel Regression

Idea: Use Non-parametric Regression Function

Consider:

f(x) = E(y|x) =∫yp(x, y)dy∫p(x, y)dy

Approximate p(x, y) with Kernel Density Estimate p̂(x, y):

p̂(x, y) =1N

N∑j=1

Kq(x, xj)Kd(y, yj),

for some (isotropic, ...) kernel-functions Kq and Kd.

Then:

3

Unsupervised Kernel Regression

Nadaraya-Watson Estimator

f(x) =N∑j=1

K(x, xj)∑Nk=1K(x, xk)

yj =: Y b(x)

with (un-normalized) kernel functions K(·, ·), e.g. RBF:

K(xi, xj) = exp(1

2h2l

‖xi − xj‖2)

Note: model complexity <—> Kernel bandwidth hl

Now call x latent...

4

Unsupervised Kernel Regression

Unsupervised Kernel Regression (1)

...and minimize the data-space reconstruction error:

Eobs(X) =1N

N∑i=1

‖yi − f(xi)‖2

=1N

N∑i=1

‖yi −N∑j=1

K(xi, xj)∑Nk=1K(xi, xk)

yj‖2

Set hl = 1.0

model complexity <—> scale of X

5

Unsupervised Kernel Regression

Unsupervised Kernel Regression (1)

Learning with gradient descent.

Complexity: O(N2dq) for both E and ∂E∂X

6

Unsupervised Kernel Regression

Unsupervised Kernel Regression (1)

Regularization (a): ’Weight Decay’:

instead of Eobs minimize

EP (X) = Eobs(X) + λP (X),

for some penalty P (X), e.g.

P (X) = ‖X‖2F , or P (X) = −N∑i=1

p̂(xi)

X̂ = arg minX

EP (X)

7

Unsupervised Kernel Regression

Unsupervised Kernel Regression (1)

Regularization (b): Constrain the solution:

X̂ = arg minX

Eobs(X),

subject to c(X) <= 0

... or subject to bound constraints.

8

Unsupervised Kernel Regression

Unsupervised Kernel Regression (1)

Regularization (c): Crossvalidation

Use the modified objective function:

Ecv =1N

N∑i=1

‖yi − f−i(xi)‖2

:=N∑i=1

‖yi −∑j 6=i

K(xi, xj)∑k 6=iK(xi, xk)

yj‖2

’Built in’ Leave-One-Out Crossvalidation !

9

Unsupervised Kernel Regression

Unsupervised Kernel Regression (1)

To avoid local minima use ’Deterministic Annealing’:

Begin with strong regularization.

Then annealing.

→ Regularization variants (a) und (b).

10

Unsupervised Kernel Regression

Unsupervised Kernel Regression (1)

After training:

• Learned X.

• Learned ’Forward mapping’ implicitly.

• Define ’Backward mapping’ by orthogonal projection:

g(y) := arg minx

‖y − f(x)‖22

(Initialize with ’nearest reconstruction’

x0 := arg minxi

‖y − f(xi)‖22, i = 1, . . . , N)

11

Unsupervised Kernel Regression

Unsupervised Kernel Regression (1)

−5 0 510

−4

10−2

100

−1 0 1−1

0

1

2

3

η = 0.8

−5 0 5

−1 0 1−1

0

1

2

3

η = 0.82

−5 0 5

−1 0 1−1

0

1

2

3

η = 0.84

−5 0 5

−1 0 1−1

0

1

2

3

η = 0.87

−5 0 5

−1 0 1−1

0

1

2

3

η = 0.820

12

Unsupervised Kernel Regression

Discussion UKR(1)

• Solves NLDR-problem ’completely’

• Principled

• ’Built in’ regularizer

• Arbitrary latent space dimensionalities

• Latent space density (Consider non-uniformly distributed data!)

• Introduction of prior knowledge

13

Unsupervised Kernel Regression

Unsupervised Kernel Regression (2)

Idea:

Turn Nadaraya-Watson Estimator upside down...

Instead of E(y|x) estimate E(x|y)

g(y) :=N∑j=1

K(y, yj)∑Nk=1K(y, yk)

xj

Justification? E.g. ’Noisy Interpolation’ (Webb, 1994).

14

Unsupervised Kernel Regression

Unsupervised Kernel Regression (2)

→ Objective function now:

Elat(X) =1N

N∑i=1

‖xi − g(yi)‖2,

=1N

N∑i=1

‖xi −N∑j=1

K(yi, yj)∑Nk=1K(yi, yk)

xj‖2,

=:1N‖X −XB(Y )‖2F

15

Unsupervised Kernel Regression

Unsupervised Kernel Regression (2)

Closed form solution, because

Elat(X) =1N

tr(X(IN −B(Y ))(IN −B(Y ))TXT )

=1N

tr(XQXT )

withQ := (IN −B(Y ))(IN −B(Y ))T

Have to minimize quadratic form Q!

→ Constraining: X1N = 0 and XXT = Iq, Solution by EVD.

16

Unsupervised Kernel Regression

Unsupervised Kernel Regression (2)

After training:

• Learned X.

• Learned ’Backward mapping’ implicitly.

• ...but not f

17

Unsupervised Kernel Regression

Unsupervised Kernel Regression (2)

18

Unsupervised Kernel Regression

Unsupervised Kernel Regression (2)

h =0.61 h =0.8696 h =1.1292

h =1.3888 h =1.6484 h =1.908

h =2.1676 h =2.4272 h =2.6868

h =2.9464 h =3.206 h =3.4656

h =3.7252 h =3.9848 h =4.2444

correct Kernel Bandwidth!?

19

Unsupervised Kernel Regression

Discussion UKR(2)

• Learns only X and g

• ’Yet another Spectral Method’ - however, from a regression perspective!

• Generalizes to new, unseen (observable space) data

• Fast

• Can provide useful initializations...

20

Unsupervised Kernel Regression

Unsupervised Kernel Regression (2+1)

A way to combine the advantages of both methods:

Use UKR(2) as an initialization for UKR(1).

Problem: UKR(2) solution will be destroyed after a few gradient steps.

Idea: minimize UKR(1)-error with respect to scale of X:

E(S) =N∑i=1

‖yi −∑j 6=i

K(Sxi, Sxj)∑k 6=iK(Sxi, Sxk)

yj‖2

with S diagonal scaling matrix.

21

Unsupervised Kernel Regression

Now:

• Using SX, we obtain a ’complete model’.

• Can continue with UKR(1).

• Or just do model assessment.

22

Unsupervised Kernel Regression

Experiments

UKR(2) vs. LLE

Use UKR(2) and LLE as initialization. Then determine reconstruction error.

Performance on toy data:

halfcircle scurve spiral

σ2 0.0 0.2 0.0 0.5 0.0 0.05LLE + rescaling 0.00060 0.0735 0.3279 0.4677 0.2364 0.2453

UKR(2) + rescaling 0.00035 0.0472 0.1107 0.1911 0.1971 0.2057LLE + UKR(1) 0.00059 0.0300 0.0790 0.0725 0.0582 0.0400

UKR(2) + UKR(1) 0.00035 0.0218 0.0481 0.0685 0.0319 0.0369

23

Unsupervised Kernel Regression

Experiments

UKR vs. PPS and GTM: (Chang and Ghosh, 2001)

Approximation of UCI-data:

iris glass diabetesq 1 2 1 2 1 2

GTM 2.7020 0.9601 8.0681 1.9634 6.7100 1.8822PPS 2.5786 0.8757 7.9465 1.8616 6.5509 1.8202

UKR(2) 2.1164 1.2667 7.6977 6.1930 6.4130 4.9041UKR(2) + UKR(1) 1.2017 0.5961 5.8665 4.3678 5.8018 3.9466UKR(1) Homotopy 0.9414 0.5651 4.7515 3.6402 6.2488 3.9619

24

Unsupervised Kernel Regression

Experiments

Visualization: MNIST-Digits

25

Unsupervised Kernel Regression

−30 −20 −10 0 10 20 30−30

−20

−10

0

10

20

30

26

Unsupervised Kernel Regression

Experiments

The duck from the beginning in 3d:

−1.5−1

−0.50

0.51

1.5

−2

−1

0

1

2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

27

Unsupervised Kernel Regression

Experiments

Interpolation in Latent Spacevs.

Interpolation in Observable Space

28

Unsupervised Kernel Regression

Experiments

29

Unsupervised Kernel Regression

Experiments

Introducing prior information:

Let user determine what shall end up in which dimension.

Approach: For different properties, sort data into pre-defined equivalenceclasses.

In the (latent) dimensions that shall reflect some specific property, penalizedistance of objects that belong to the same class (with respect to thisproperty).

30

Unsupervised Kernel Regression

Experiments

The (2-dimensional) PCA representation of the ’olivettifaces’-data lookslike:

a

a

a a

a

a

a

a

a

a

b

b

bb

b

b

b

b

b

b

c

ccc

c

cc

cc c

dd

dd

d

dd d

d

d

ee

e

ee

e

e

e

e

eff

ff

f

fff ff

g g

g

g

ggggg

g

h

hh

hh

h

h

h

h

h i

i i

ii i

ii

ii

j

j

j

j

j

j

jj

j

j

kkk

k

kk

kk

k

k

l l

ll

l

l

l

ll l

mm

m

mm

m

m

mm

m

n

n

n

n

nn

nnn

n

o

o

o

o

o

o

o

o

o

o

p

p

p

p

p

p

p

pp

p

q

q

qq

q

qq

q

qq

r

rr

r

r

r

r

r

r

r

ss

s

ss

s

s

s s

s

ttt

tt

t

ttt

t

uu

u

uuuu

uu

u

vvv

v

vvv

vvv

w

ww

w

w

w

ww

w

w

xx

xx

x

x

xx

xx

y

y

y

y

y

y

y

yy

y

z

z

z

z

z

z

z

z

z

z

1

1

1

1

1

1

1

1

1

12

2

2

2

2

2

2

2

2

2 3

3

3

3

33

3

3

3

3

4

4

4

44

4

44

4 4

5

55

5

5

5

55

5

5

6

66

6

6

6

66

66

77

7 7

7

777

7

7

88

8

88

8

8888

99

9 9

99

9

9

9

9

00

00

000

00

0

−−

=

==

=

==

==

==

+

++ +

+

+

++ +

+

:

:

:

:

:

:

:

:

:

:

31

Unsupervised Kernel Regression

Experiments

Considering the two properties ’identity’ and ’wears glasses’, the latentrepresentation might look like:

−4 −3 −2 −1 0 1 2 3 4−10

−8

−6

−4

−2

0

2

4

6

8

10

a

a

a

a

a

a

a

a

a

a

bbbb

b

b

bb

b

b

ccc

cc

cc

c

cc

d

d

d

dd

dd

d

d

d

e

ee

e

ee eeee

fff f ff ff

f

fgg

g g

g

g

g g

g

gh

hh

h

h h

h

h

h

h

ii i

iiiiiii

j

jjj

j

jj

j

j

j

kkkk

k

kk k kk

lll

ll

l

lll lm

m

mmm

m

mmm m

nnnnn

nnn

nn

o

o

oo

o

o

oooo

p

pp pp

p

p

p

pp

qq

qq

q qqq

q q

rr

r

r

rr

r

r

r

r

sssss

ss ss

s

tt tt

tt

tt t

tuuu

u uuuu uu

vvvv

v

v

vvvv

w

w

w

w

w

w

ww

ww

x

xx x

x

x xxxx

yyy

y

y

y

y

y

yyz

zzz

z

z

z

zzz 11 11 111111

22

22

22

22

2

2

3 33

33 3 33

33

44

4

44444

44 5

5555

5555

5

6 66

6

66

6

666

7 77777 7 77

7888

8 8

8

8

888

9 99

9

9

999

9 9

0

000

0

00 000

−−−

−−

=

===

=

=

==

=

=

++

+ + ++ ++++

:

::

::

:

:

:

::

glasses vs. non−glasses

iden

tity

32

Unsupervised Kernel Regression

Experiments

Now we can give people glasses:

33

Unsupervised Kernel Regression

’Bridge the gap’

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Clustering or Dimensionality Reduction?!

34

Unsupervised Kernel Regression

’Bridge the gap’

• Recall: UKR learns a latent space density, not just latent representativesand not just a principal manifold!

• We can use the latent density constraints at test time:

35

Unsupervised Kernel Regression

’Bridge the gap’

−1 0 1−1.5

−1

−0.5

0

0.5

1

1.5

−1 0 1−1.5

−1

−0.5

0

0.5

1

1.5

−1 0 1−1.5

−1

−0.5

0

0.5

1

1.5

36

Unsupervised Kernel Regression

Things Not Covered/Future Work:

• (Mercer-)Kernel Feature Space Variant

• Approximate (whole) data set using a reduced set of prototypes

• Condensing

• ’Two data spaces’, CCA

• Matlab Toolbox

37

top related