manifold learning: from linear to nonlineardisp.ee.ntu.edu.tw/~pujols/gsp2011.pdfmanifold learning:...

Manifold Learning:From Linear To Nonlinear

Wei-Lun (Harry) Chao

Date: April 7, 2011

1

Outline

• Notation and fundamental of linear algebra

• PCA and LDA

• Topology, manifold, and embedding

• MDS

• ISOMAP

• LLE

• Laplacian eigenmap

2

Reference

• [1] J. A. Lee et al., Nonlinear Dimensionality reduction, 2007

• [2] R. O. Duda et al., Pattern Classification, 2001

• [3] P. N. Belhumeur et al., Eigenfaces vs. Fisherfaces, 1997

• [4] J. B. Tenenbaum et al., A global geometric framework fornonlinear dimensionality reduction, 2000

• [5] S. T. Roweis et al., Nonlinear dimensionality reduction bylocally linear embedding, 2000

• [6] L. K. Saul et al., Think globally, fit locally, 2003

• [7] M. Belkin et al., Laplacian eigenmaps for dimensionalityreduction and data representation, 2003

• [8] T. F. Cootes et al., Active appearance models, 1998

3

Notation

• Data set:

• Matrix:

• Vector:

• Matrix form of data set:

( )

1high-D: { }n Nd

nX R x

( )

1low-D: { }n

pn NY y R

(1) (2) ( )

1 ,1[ , ,......, ] [ ]m

n m ij i n j mA a a a a

( ) ( ) ( ) ( )

1 1 2[ , ,......, ]i i i i T

n na a a a

(1) (2) ( ) (1) (2) ( )

| | |

, ,......, ......

| | |

N N

d N

N

X

x x x x x x

4

Fundamental of Linear Algebra

• SVD (singular value decomposition):

, where

(1)11

(2)22(1) (2) ( )

( )

0 ... 0 0| | |

0 0 0 0......

0 0 0| | |

0 0 0 0

T

d

T

T

d

d N

N Td ddd d N

d d N N

N

N

N

V UX

u

uv v v

u

(1)

(2)

(1) (2) ( )

( )

| | |

...... ,

| | |

,

T

T

T d T

d d N N

d T d d

d d

T T T T

V V I U U I

V V I VV U U I UU

v

vv v v

v

5


• EVD (Eigenvector decomposition)

• SVD vs. EVD (symmetric positive semi-definite)

• Caution: Eigenvalues are not always orthogonal!

1

2(1) (2) ( ) (1) (2) ( )

0 0| | | | | |

0 0 0...... ......

0| | | | | |

0 0

N N

N N

N

A

AV V A

v v

v v v v v v

( ) ( ) ( )

( )

T T T T T T T T T

T T

A XX V U V U V U U V V V

AV V V V AV

6


• Determinant:

• Trace:

• Rank:

1

N

N N n

n

A

1

( ) ( )N

N N n

n

tr A diagonal A

( ) ( )d N N d N d d Ntr A B tr B A

( ) ( ) # nonzero diagonal elements of

# independent columns of

# nonzero eigenvalues (square )

Trank A rank V U

A

A

7

Dimensionality reduction

• Operation:

• Reason:

Compression

Knowledge discovery or feature extraction

Irrelevant and noise feature removal

Visualization

Curse of dimensionality

( )

1high-D: { }n Nd

nX R x ( )

1low-D: { }n

pn NY y R ( )p d

8

Dimensionality reduction

• Methods:

Feature transform:

Feature selection:

• Criterion:

Preserve some properties or structures of the high-D feature space into the low-D feature space

These properties are measured from data

: , d pR p df R x y

linear form: p dW y x

(1) (2) ( )[ , ,......, ]

deotes selection indices

T

s s s p

s

y x x x

9

Principal Component Analysis(PCA)

10

[1] J. A. Lee et al., Nonlinear Dimensionality reduction, 2007[2] R. O. Duda et al., Pattern Classification, 2001

Principal component analysis (PCA)

• PCA is basic yet important and useful:

Easy to train and use

Lots of additional functionalities: noise reduction,

ellipse fitting, singular matrix inversion, ……

• Also called Karhunen-Loeve transform (KL transform)

• Criteria:

Maximum variance with decorrelation

Minimum reconstruction error

11


• Surprising usage: face recognition and encoding

= -2181 +627 +389 +… =

12


• (Training) data set:

• Preprocessing: centering (mean can be added back)

• Model:

( )

1high-D: { }n Nd

nX R x

( )

1

( ) ( )

1 or say

Nn

n

NT n n

Nn

X X

1

x x

x x x x

, where and is

(orthonor al) m

T p

T

p p

T

R W d p

W

W

W I

W WW

y x y

x z = x13


• The low-D feature vectors should be decorrelated

• Covariance variance:

• Covariance matrix:(1)

1 1 1 (2)

(1) (2) ( )

1 ( )

( )

| | |cov( , ) cov( , )1

......

cov( , ) cov( , ) | | |

1

T

d T

N

xx

d d d N T

T n

x x x x

CN

x x x x

XXN

x

xx x x

x

x x( )

1

(1) (1) ( ) ( )

| |1 1

......

| |

Nn T

n

T N N T

N N

x x x x

( ) ( )

1 2 1 1 2 2 1 1 2 2

1

1cov( , ) ( )( ) ( )( )

Nn n

n

x x E x x x x x x x xN

14


• Covariance matrix

( ) ( )

1 2 1 1 2 2 1 1 2 2

1

1cov( , ) ( )( ) ( )( )

Nn n

n

x x E x x x x x x x xN

15


• Decorrelation:

( ) ( )

1 1

( ) ( )

1

( ) ( )

1

1( )( ) diagonal mat x

1

i

10

1r

N Nn T n T

n n

Nn n T

NT n T n T

n

yy

n

W WN N

WCN

WN

x x

y y x x

y y

16


• Maximum variance

( ) ( )

2 2* ( ) ( )

1 1

( ) ( ) ( ) ( )

1 1

( ) ( )

1 1

1 1arg max arg max

1 1 arg max ( ) ( ) arg max

1 1 arg max arg} max{ { }

T T

T T

T T

N Nn T n

W W I W W In n

N NT n T T n n T T n

W W I W W In n

N NT n n T

W W I W

n T T

W In n

n

W WN N

W W WWN

tr WW

N

tr W WN N

y y x

x

x

x

xx

x x

x

( ) ( )

1

1 arg max { [ ] } arg max { }

T T

NT n n T T

xxW W I W W I

n

tr W W tr W C WN

x x

(1 x 1) (p x p)

17


• Optimization:

• Solution:

* arg max { } subject to is diagonal

T

yy xx

T

d p xx yyW

C W C W

W tr W C W C

* (1) (2) ( )

( )

* *

[ , ,......, ],

is the th largest eigenvector of

1 1 1( )

xx xx xx

xx

p

C C C

i d d

C xx

T T T T T T T

xx

T

p p

W

i C R

C XX V U U V V V V VN N N

W W I

v v v

v

18


• Proof:

*

Assume is 1, { }

( , ) arg max

Take partial derivative 0

( ) 2 0

( 1

1 0

is the largest

)

2 eigenvect

eigenvector of

or

T T

xx xx

T

xx

T

xx x

T

T

xxx

d tr C C

E C

EC C

E

C

w

w w w w w

w w

w

w w

w

w

w

w

w w

w w

1

xx

T T T

xx

C

C V V w w w w =19

Minimum reconstruction error

• Mean square error is preferred:2 2

* ( ) ( ) ( ) ( )

1 1

( ) ( )

1

( ) ( ) ( ) ( ) ( ) ( )

1 1 1

1 1arg min min

1 arg min (( ) ) (( ) )

1 2 1 arg min ( )

ar

T T

T

T

N Nn n n T n

W W I W W In n

NN T n T N T n

W W In

N N Nn T n n T T n n T T T n

W W In n n

W W WWN N

I WW I WWN

WW W W W WN N N

x y x x

x x

x x x x x x

( ) ( ) ( ) ( )

1

( (

1

) )

1

1arg ma

1 1g min

...... argx max { }

T

T T

Nn T T n

W W In

N Nn T n n T T n

W W In n

T

xxW W I

WWN N

tr W CW WWN

x x x

x x

x

20

Algorithm



• Model:

( )

1high-D: { }n Nd

nX R x

( )

1

( ) ( )

1 or say

Nn

n

NT n n

Nn

X X

1

x x

x x x x

, where and is

(orthonor al) m

T p

T

p p

T

R W d p

W

W

W I

W WW

z x z

x z = x21

Algorithm

• Algorithm 1: (EVD)

• Algorithm 2: (SVD)

1

(1) (2) ( ) (1) (2) ( )

( )

1. , where in are in descending order

| | | | | |

2. ...... ......

| | | | | |

d

xx i i

p pd p

d p

d p p

V C V

IW VI

O

v v v v v v

1

(1) (2) ( ) (1) (2) ( )

( )

1. , where in are in descending order

| | | | | |

2. ...... ......

| | | | | |

dT

i i

p pd p

d p

d p p

X V U

IW VI

O

v v v v v v

22

Illustration

• What is PCA doing:

23

1 1

22

Summary

• PCA exploits 2nd –order statistical properties measuredin data (simple and hard to over-fitting)

• Usually used as a “preprocessing step” in applications

• Rank:

( ) ( ) (1) (1) ( ) ( )

1

| |1 1 1

......

| |

, 1 in ge( ) l1 nera

xx

Nn n T T N N T

xx

n

xxrank C N

V C V

CN N N

p N

x x x x x x

24

Examples

• Active appearance model: TW WWx z = x

25

[8]

[8]

Linear Discriminant Analysis(LDA)

26

[2] R. O. Duda et al., Pattern Classification, 2001[3] P. N. Belhumeur et al., Eigenfaces vs. Fisherfaces, 1997

Linear discriminant analysis(LDA)

• PCA is unsupervised

• LDA takes the label information into consideration

• Achieved low-D features are efficient for discrimination

27



• Model:

• Notation:

( )

1high-D: { }n Nd

nX R x

, where and is T pRW W d p y x y

( )

1 2( ) { , ,......, }n

clabel L l l l x

( )

( )

1

( )

( )

1

{ ( ) }

# samples in

1class mean:

1total mean:

ni

n N

i i n

i i

n

i

Xi

Nn

n

X label l

N X

N

N

x

x

x

x

( )

1

( ) ( )

1

between-class scatter:

within-class scatter:

( )( )

( )( )n

i

cT

B i i i

i

cn n T

W i i

i X

S N

S

x

x x

28


• Properties of scatter matrix:

• Scatter matrix in low-D:

( )

1

( ) ( )

1

inter-class separation( )( )

( )( ) intra-class tightn s e sn

i

cT

B i i i

i

cn n T

W i i

i X

S N

S

x

x x

( )

1

( ) ( )

1

between-class: ( )( )

within-class: ( )( )n

i

T

B

T

W

cT T T T T

i i i

i

cT n T T n T T

i i

i X

N W W W W

W W

W S W

WW SW W

x

x x

29


30

Criterion and algorithm

• Criterion of LDA:

• Determinant and trace are suitable scalar measures:

• With Rayleigh quotient:

Maximize the ration of to "in some sense"T T

B WW S W W S W

* arg max

T

B

TWW

W S WW

W S W

*

* (1) (2)

( ) ( )

( )

is n, are both symmetric positive semi-definite and

arg max , is in descending order

, ,.....

onsigu

.

larB W

T

B

iT

W

i i

BW

p

i

T

W

W

S S

W S WW

W

S

SS W

W

S

v

v v

v

v 31

Note and Problem

• Note:

• Problem:

( ) ( ) 1 ( ) ( )

( ) 1, at most 1 nonzero

so 1

i i i i

B i W W B i

B i

S S S S

rank S c c

p c

v v v v

( ) , and is

if ( ) , is singular, Rayleight quotient is useless

W W

W W

rank S N c S d d

rank S d S

32

Solution

• Problem:

• Solution:

PCA+LDA:

Null-space:

( )

( ) ( )

1

( )( ) is singularn

i

cn n T

W i i

i X

S

x

x x

( ) ( ) ( )

( ( )) ( ( ))

(( ) ( ))

1. Perform on ,

2. Compute , if nonsingular, the problem is solve

3. For new samples,

n n n N c

PCA d N c PCA d N c

W N c N c

LDA PCA

W W R

S

W W

x x x

y x

* *

*

1. arg max find to make 0

2. Extract columns of from the null space of

T

B T

WTWW

W

W S WW W W S W

W S W

W S

33

Example

34[3]

Topology, Manifold, and Embedding

35

[1] J. A. Lee et al., Nonlinear Dimensionality reduction, 2007

Topology

• Geometrical point of view

If two or more features are dependent, their jointdistribution does not span the whole feature space

The dependence induces some structure (object) in thefeature space

1 2( , ) ( ), x x g s a s b

( )g a

( )g b

36

Topology

• Topology:

Allowed: Deformation, twisting, and stretching

Not allowed: Tearing

Topology object means properties and structures

A topology object is represented (embedding) as a spatialobject in the feature space

Topology abstracts the intrinsic structure, but ignores thedetails of spatial object

Ex: circle and ellipse are topologically homeomorphic

37

Manifold

• Geometrical space: dimensionality + structure

• Neighborhood:

• Topology space can be characterized by neighborhood

• Manifold is a locally Euclidean topological space

• Euclidean space:

• In general, any spatial object that is nearly flat in small scale is a manifold

2

( )( ) ball ( ) { }d i

LNei R B x x x x

2

(1) (2) (1) (2)( , ) is meaningfulL

dis x x x x

38

Manifold

2D+Euclidean3D+non-Euclidean

39

[5]

Embedding

• Embedding:

Embedding is a representation of a topological object (ex. amanifold, graph) in a certain space, in such a way thetopological properties are preserved

A smooth manifold is differentiable and has “functionalstructure” to link the features with latent variables.

The dimensionality of a manifold is the # latent variables

A k-manifold can be embedded to any d-dimensional spacewith d is equal to or larger than (2k+1)

40

Manifold learning

• Manifold learning:

Re-embed a k-manifold in d-dimensional space into a p-dimensional space with d <p

In practice, the structure of manifold can only be measuredfrom the underlying data.

• Importance:

The space property of the original high-D space is measuredby data, while the space property of the target low-D spacecan be defined by users!

41

Example

1 2 3 1( , , ) ( ), x x x g s a s b

1( )g a

1( )g b

a s b a b

2 ( )g a

2 ( )g b

1 2 2( , ) ( ), x x g s a s b

Re-embedding

Latent variable:

1 2: ( ) ( )f g s g s

42

Multidimensional Scaling(MDS)

43

[1] J. A. Lee et al., Nonlinear Dimensionality reduction, 2007


• Distance preserving:

• Scaling: methods refer to construct a configuration ofsamples in a target metric space from information ofinter-point distances

• MDS is a scaling when the target space is Euclidean

• Here we mentioned about classical metric MDS

• Metric MDS indeed preserves pairwise inner product rather than pairwise distance

• Metric MDS is unsupervised

( ) ( ) ( ) ( )( , ) ( , )i j i jdis disx x y y

44




• Model:

( )

1high-D: { }n Nd

nX R x

( )

1

( ) ( )

1 or say

Nn

n

NT n n

Nn

X X

1

x x

x x x x

There is no to train

: , d pf R

W

R p d x y

45

Criterion

• Inner product (scalar product):

• Gram matrix: recording pairwise inner product

• Usually we only know Z, but not X

( ) ( ) ( ) ( ) ( ) ( )( , ) ( , )i j i j i T j

Xs i j s x x x x x x

(1)

(2)

(1) (2) ( )

1 ,

( )

| | |

[ ( , )] ...... .

| | |

T

T

N T

X i j N

N T

S s i j X X

x

xx x x

x

46

Criterion

• Criterion 1:

• Criterion 2:

2

2

2

2* ( ) ( ) 2

1 1

2

(1)

(2)

2 (1) (2) ( )

,

( )

arg min ( ) arg min

, where is the matrix norm, also called the Frobenius norm

| | |

( ) ( ......

| |

N Ni j T

ij LY Zi j

L

T

T

T N

ijLi j

N T

Y s S Y Y

A L

A a tr A A tr

y y

a

aa a a

a

1/2)

|

T TX X S Y Y 47

Algorithm

• EVD:1/2 1/2 1/2 1/2

(1)

1

(2)

2

( )

(

1

)

/2

1. ( )( ) ( ) ( )

rank( )

2. |

, where is a arbitrary orthonormal matrix for rotati

T T T T T

T

T

p p p N p

N T

T T

T

N

p N

U U U U Y Y

S d

I

S X X U

O

p p

U

Y I U

u

u

u

on

48

PCA vs. MDS


• PCA: EVD on covariance matrix

• MDS: EVD on Gram matrix

( )

1high-D: { } SVD: n N T

n

dX R X V U x

1 1( )

( )) (

TT T T

xx PCA

T T

PCA d p

T

p d

C XX V U U V V V V VN N N

Y W X VI I V XX

1/2 T

p N MD

T T T T T T T

MDS

MDS S

S X X U V V U U U U U

Y I U

49

PCA vs. MDS

• Discard the rotation term and with some derivations:

• Comparison

1/2 1/2( )

( ) ( )

T T T T

MDS p N MDS p N p d

T T T T

PCA p d p d p d

Y I U I U I U

Y I V X I V V U I U

PCA: EVD on matrix

MDS: EVD on matrix

SVD: SVD on matrix

T

xx

T

d d C XX

N N S X X

d N X

50

For test data

• Model:

• For a new coming test x:

• Finally:

(generatuve view)

Use from PCA for convenience

T

d p

W W

W VI

y x x y

1/2( )

( )

(with )

T T T

d p

T

T T T T

T T T

d p N p

T T T

X V U U V U V

U V U I U I

X X U U U

W

VI

U

s x = y

y y x

x x

1/2 T

p NI U

x s

51

MDS with pairwise distance

• How about a training set with pairwise distance?

52

( ) ( )

1 ,[ ( , )] , no and i j

ij i j ND d dis X Z x x

Distance metric

• Distance metric:

Nonnegative:

Symmetric:

Triangular:

• Minkowski distance: (order p)

( ) ( ) ( ) ( ) ( ) ( )( , ) 0, ( , ) 0 iff i j i j i jdis dis x x x x x x( ) ( ) ( ) ( )( , ) ( , )i j j idis disx x x x

( ) ( ) ( ) ( ) ( ) ( )( , ) ( , ) ( , )i j i k k jdis dis dis x x x x x x

( ) ( ) ( ) ( ) ( ) ( ) 1/

1

( ) ( )

1

( ) [ ( ) ]

di j i j i j p p

k k k kpk

pdi jp

k k

k

dis x x x x

x x

x x

53

Distance metric


• Euclidean distance and inner product:

2

( ) ( ) ( ) ( ) ( ) ( ) 2 1/2

1

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

2

( , ) [ ( ) ]

( , ) ( ) ( )

2

( , ) 2 ( , ) ( , )

1( , )

di j i j i j

k kLk

i j i j T i j

i T i i T j j T j

X X X

X

dis x x

dis

s i i s i j s j j

s i j

x x x x

x x x x x x

x x x x x x

( ) ( )2{ ( , ) }2

( , ) ( , )X X

i j s i id s jis j x x

( ) ( )

1 ,[ ( , )] , no and i j

ij i j ND d dis X Z x x

54

Distance to inner product

• Define square distance matrix:

• Double centering:

2 2 2 22

2 2 2 2

21 1 1 1

1 1 1 1( )

2

1 1 1 1( , ) ( )

2

T T T T

N N N N N N N N

N N N N

X ij ik mj mk

k m k m

S D D D DN N N

s i j d d d dN N N

1 1 1 1 1 1 1 1

2 ( ) ( )

2 2 1

2

,[ ( , )]i j

ij ij i j ND d d dis x x

55

Proof

• Proof:

2 2 ( ) ( )

1 1 1

( ) ( ) ( ) ( ) ( ) ( )

1

( ) ( ) ( ) ( ) ( ) ( )

1 1

1 1 1( , ) ( , ) 2 ( , ) ( , )

1 , 2 , ,

1 , , 2 ,

N N Nm j

mj X X X

m m m

Nm m m j j j

m

N Nj j m m m j

m m

d dis s m m s m j s j jN N N

N

N

x x

x x x x x x

x x x x x x

( ) ( )

1

2 ( ) ( )

1

( ) ( )

( )

1

( )

1 ,,

1,,

1

Nm m

m

N Nk k

ik

k

j j

i

k

i

N

dN N

x x

x x

x x

x x

56

Proof

• Proof:

• Finally

2 2 ( ) ( )

2 2 21 1 1 1 1 1

( ) ( ) ( ) ( ) ( ) ( )

21 1

( ) ( ) ( ) (

2 21 1

1 1 1( , ) ( , ) 2 ( , ) ( , )

1 , 2 , ,

1 1 , ,

N N N N N Nm k

mk X X X

m k m k m k

N Nm m m k k k

m k

N Nm m k k

m k

d dis s m m s m k s k kN N N

N

N N

x x

x x x x x x

x x x x) ( ) ( )

1 1 1 1

( ) ( ) ( ) ( )

1 1

1 12 ,

1 1 , ,

N N N Nm k

m k m k

N Nm m k k

m k

N N

N N

x x

x x x x

2 2 2

21 1

( ) ( ) ( )

1 1

( ) 1 1,

1,

N N N N

mj ik m

j j j j

k

m k m k

d d dN N N

x x x x

57

Algorithm

• Given X:

Get Z, perform MDS

• Given Z:

Perform MDS

• Given D:

Double each entry in D

Perform double centering

Perform MDS

58

Summary

• Metric MDS preserves pairwise inner product instead of pairwise distance

• It preserves linear properties

• Extension:

Sammom’s nonlinear mapping

Curvilinear component analysis (CCA)

2

1 1

( ( , ) ( , ))

( , )

N NX Y

NLM

i j X

dis i j dis i jE

dis i j

2

1 1

1( ( , ) ( , )) ( ( , ))

2

N N

CCA X Y Y

i j

E dis i j dis i j h dis i j

59

From Linear To Nonlinear

60

Linear

• PCA, LDA, MDS are linear:

Matrix operation

Linear properties (sum, scaling, commutative,…)

• Inner product, covariance:

• Assumption on the original feature space:

Euclidean or Euclidean with “rotation and scaling”

( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( ) ( )

, , ,

( )

i j k i k j k

k i j T k i T k j T

x x x x x x x

x x x x x x x

61

Problem

• If there exists structure in the feature space:

1 2 3 1( , , ) ( ), x x x g s a s b

1( )g a

1( )g b

crashed

62

Manifold way

• Assumption:

The latent space is nonlinear embedding in the feature space

The latent space is a manifold, so does the feature space

The feature space is locally smooth and Euclidean

• Local geometry or property:

Distance preserving (ISOMAP)

Neighborhood (topology) preserving (LLE)

• Caution:

There properties and structures are measured in the feature space

63

Isometric Feature Mapping(ISOMAP)

64

[4] J. B. Tenenbaum et al., A global geometric framework fornonlinear dimensionality reduction, 2000

ISOMAP

• Distance metric in feature space: Geodesic distance

• How to measure:

Small scale: Euclidean distance in

Large scale: shortest path in Graph

• The space to re-embed:

p-dimensional Euclidean space

After we get the pairwise distance, we can embed it in many kinds of space

dR

65

Graph

• (Training) data set: ( )

1high-D: { }n Nd

nX R x

Vertices

Assume placed in order

66

(1)x

( )Nx

Small scale

• Small scale: Euclidean, Large scale: graph distance

Vertices + edges

(1)x

( )Nx

2

1(1) ( ) ( ) ( 1)

1

( , ) ,N

N i i

Li

dis

x x x x


67

Distance metric

• MDS:

Vertices + edges


2 2

1(1) ( ) ( ) ( ) (1) ( ) ( ) ( 1)

1

( , ) , ( , ) ,N

N i N N i i

L Li

dis dis

y y y y x x x x

68

Algorithm

• Presetting:

• (1) Geodesic distance in neighborhood

1 ,

( )

Define distance matrxi [ ]

Set ( ) as the neighbor set of (undified)

ij i j N

i

N N D d

Nei i

x

2

( )

( ) ( )

for 1:

for 1:

if ( ( ) and )

end

end

end

j

i j

ij L

i N

j N

Nei i i j

d

x

x x

69

Algorithm

• (1) Geodesic distance in neighborhood:

• (2) Geodesic distance in large scale: (shortest path)

2

( ) ( ) ( )

( ) ( ) ( )

Neighbor:

-neighbor: ( ) iff

NN : ( ) iff ( ) or ( )

j i j

L

j j i

Nei i

K Nei i KNN i KNN j

x x x

x x x

for each pair ( , )

for 1:

min{ , }

end

end

ij ij ik kj

i j

k N

d d d d

Floyd’s algorithm:Run several round until converge

70

Algorithm

• (3) MDS:

Transfer pairwise distance into inner product:

EVD:

Proof

2( ) / 2, where ( , ) 1/ (for centering)ijD H H ND h i j

1/2 1/2

1/

1/2

2

1/2( ) ( )( ) ( ) ( )

( , 1)

T T T T T

T

p NY I

D U U U U U U

p d pU N

2 2

2 2 2 22

1 1( ) / 2 ( ) ( ) / 2

1 1 1 1 ( ) / 2

T T

N N N N N N N N

T T T T

N N N N N N N N

D HD H I D IN N

D D D DN N N N

S

1 1 1 1

1 1 1 1 1 1 1 1

71

Example

• Swiss roll:

72

[4]

Example

• Swiss roll: 350 points

MDS

ISOMAP

73

[1]

Example

74

[4]

Summary

• Compared to MDS: ISOMAP has the ability to discoverthe underlying structure (latent variables) which isnonlinear embedded in the feature space

• It is a “global methods”, which preserves all pairs ofdistances

• The Euclidean space assumption in low-D space impliesconvex properties, which is sometimes failed.

75

Locally Linear Embedding(LLE)

76

[5] S. T. Roweis et al., Nonlinear dimensionality reduction by locallylinear embedding, 2000[6] L. K. Saul et al., Think globally, fit locally, 2003

LLE

• Neighborhood preserving:

Based on the fundamental manifold properties

Preserve the local geometry of each sample and itsneighbors

Ignore the global geometry in large scale

• Assumption:

Well-sampled with sufficient data

Each sample and its neighbors lie on or closed to alocal linear patch (sub-plane) of the manifold

77

LLE

• Properties:

Local geometry is characterized by linear coefficientsthat reconstruct each sample from its neighbors

These coefficients are robust to rotation, scaling, andtranslation (RST)

• Re-embedding:

Assume the target space is locally smooth (manifold)

Locally Euclidean, but not necessary is large scale

Reconstruction coefficients are still meaningful

Stick local patches on the low-D global coordinate 78

LLE

• (Training) data set: ( )

1high-D: { }n Nd

nX R x

79

Neighborhood properties

• Linear reconstruction coefficients:

80

Re-embedding

• Local patches into global coordinate:

81

Illustration

82[5]

Algorithm

• Presetting:

• (1) Find neighbors of each sample

(1)( )

(2)( )

1 ,

( )( )

Define weight matrxi [ 0]

T

T

ij i j N

N T

N N W w

w

w

w

83

2

( )

( ) ( ) ( )

( ) ( ) ( )


Neighbor:

-neighbor: ( ) iff

NN : ( ) iff ( ) or ( ),

i

j i j

L

j j i

Nei i

Nei i

K Nei i KNN i KNN j K p

x

x x x

x x x

Algorithm

• (2) Linear reconstruction coefficients:

Objective function:

Constraints: (for RST invariant)

84

2

2( ) ( )

2

2

( ) ( )

1 1

2

2( ) ( ) ( ) ( )

1

min ( ) min

min mini i

N Ni j

ijW W

i j L

Ni j i i

ij Lj L

E W w

w X

w w

x x

x x x w

( )

1

for all : 0, if ( ), 1N

j

ij ij

j

i w x Nei i w

Algorithm

• (2) Linear reconstruction coefficients: (for each sample)

85

( )1 1( )( ) ( )

2( )

2( ) ( ) ( )

1

2( )

2( ) ( ) ( )

1

| | |

Define neighbor index of , , is ( ) 1

| | |

( , ) ( 1) ( 1)

( ) ( 1) ( )

Nei ihh h

Nei i

i i T i T

m

m

Nei i

i m T i T

m

m

i Nei i

e

1 1

1 1

h x x x

x x

x x

( ) ( )

( 1)

( ) ( ) ( 1) ( 1)

T

T i T T i T T T TC

1

1 1 1 1x x

Algorithm

• (2) Linear reconstruction coefficients:

Algorithm: Run for each sample

86

11

12 0 2

2

1 0T

T

C

C

EC C C

E

1 11

1 11

1

1( ) ( )

1

Define , , and

( ) ( ),

for 1: ( )

end

m

i T T i T

T

ih m

CC

C

m Nei i

w

11 1

1 1

h

x x

Algorithm

• (3) Re-embedding: (minimize reconstruction error again)

87

2

2

( ) ( )

1 1

(1) ( ) (1) ( ) (1)

2

( )

( )

| | | | | |

| | | | | |

{( ) ( )}

{( ) ( )

( (

}

{ )

L

N Ni j

ij

i j

N N N

T T

N N N N

T

N

T T

N

T

N

Y w

tr I W Y Y I W

tr

tr Y YW Y

Y W

YW

I I

y y

y y y y w w

) }

{ ( ) }

T

N

T T T

N N

W Y

tr Y I W W W W Y

Algorithm

• (3) Re-embedding:

Definition:

Constraints: (avoid degradation)

Optimization:

88

1 ,

1

( )

[ ]

T T

N N

NN

ij ij ij ji ki kj i j N

k

M I W W W W

m w w w w

( ) ( ) ( )

1 1

10,

N Nn n n T T

n n

YY IN

y y y

* ( )

1

min { }, subject to

Rayleitz-Ritz theorem

0,

Apply

NT n T

Yn

Y tr YMY YY I

y

Algorithm


Additional property (row sum of W is 0)

Solution: (EVD)

89

1 1 1 1 1 1

1 1 1

1

1 1

[ ] 1

0

N N N N N N

ij ij ji ki kj ij ji ki kj

j k j j j k

N

ij

j

N N N N N

ji ki kj ji ki

j k j j k

w w w w w w w w

w w w

m

w w

(

1

1 )

* min { } ( )T

T

N p p

T T

p pY

p

Y I

M

O

V V

O

Y tr YMY V I

is a eigenvector of with =0N M 1

Example

• Swiss roll:

90350 points

[5]

[1]

Example

• S shape:

91

[6]

Example

92[5]

Summary

• Although the global geometry isn’t explicitly preservedduring LLE, it can still be reconstructed fromoverlapping local neighborhoods.

• The matrix M to perform EVD is indeed sparse.

• K is a key factor in LLE, so does in ISOMAP

93

Laplacian eigenmap

94

[7] M. Belkin et al., Laplacian eigenmaps for dimensionalityreduction and data representation, 2003

General setting



• Want to achieve:

( )

1high-D: { }n Nd

nX R x

( )

1

( ) ( )

1 or say

Nn

n

NT n n

Nn

X X

1

x x

x x x x

95

( )

1low-D: { }n

pn NY y R

Laplacian eigenmap

• Fundamental:

Laplacian-Beltrami operator (for smoothness)

• Presetting:

• Neighborhood definition:

96

2

( )

( ) ( ) ( )

( ) ( ) ( )


Neighbor:

-neighbor: ( ) iff

NN : ( ) iff ( ) or ( )

i

j i j

L

j j i

Nei i

Nei i

K Nei i KNN i KNN j

x

x x x

x x x

1 ,Define weight matrxi [ 0]ij i j NN N W w

Algorithm

• (1) Neighborhood definition

• (2) Weight computation:


97

2

2( ) ( )

( )Heat kernek: exp( ), if ( )

0

i j

jL

ijNei iw t

x xx

2

2( ) ( )

1

1

( ) ( , ) ( ( ) ) ( )N

i j T T

Ln

ii ji

j i N

E Y W i j tr Y D W Y tr YLY

D d w

y y

Optimization

• Optimization:

98

*

( ) ( ) 1

(

1

1 )

*

arg min ( )

( )

T

T

i i

i

N p p

T

p

YD

p

Y I

p

Y tr YLY

L D D LV V

O

Y V

O

I

v v is a eigenvector of with =0N M 1

Example

• Swiss roll: 2000 points

99

[7]

Example

• Example: From 3D to 3D

100[1]

Thank you for listening

101

manifold learning: from linear to nonlineardisp.ee.ntu.edu.tw/~pujols/gsp2011.pdfmanifold learning:...

Documents