learning high-dimensional data - samos-matisse - … · 2002-06-24 · michel verleysen 1 learning...

23
1 Michel Verleysen 1 Learning high-dimensional data Michel Verleysen Université catholique de Louvain (Louvain-la-Neuve, Belgium) Electricity department January 2002 Michel Verleysen 2 Acknowledgements p Part of this work has been realized in collaboration with PhD students: p Philippe Thissen p Jean-Luc Voz p Amaury Lendasse p John Lee p Some ideas and figures come from p D. L. Donoho, High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality. Lecture on August 8, 2000, to the American Mathematical Society "Math Challenges of the 21st Century". Available from http://www-stat.stanford.edu/~donoho/.

Upload: phungthien

Post on 16-Jul-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

1

Michel Verleysen 1

Learning high-dimensional data

Michel VerleysenUniversité catholique de Louvain (Louvain-la-Neuve, Belgium)

Electricity department

January 2002

Michel Verleysen 2

Acknowledgements

p Part of this work has been realized in collaboration with PhD students:p Philippe Thissenp Jean-Luc Vozp Amaury Lendassep John Lee

p Some ideas and figures come fromp D. L. Donoho, High-Dimensional Data Analysis: The Curses and

Blessings of Dimensionality. Lecture on August 8, 2000, to the American Mathematical Society "Math Challenges of the 21st Century". Available from http://www-stat.stanford.edu/~donoho/.

2

Michel Verleysen 3

Data mining

p Data mining: find information in large databases(large = D and/or N)

4444444 84444444 76

L

MOMMMM

L

L

L

L

space) of (dimensioncolums

45581925125

492825459

9158141358

51422171267

912915721

ns)observatio

of(number

lines

D

.....

....

...

....

....

N

−−

−−−

−−

Michel Verleysen 4

High-Dimensional spaces

p Inputs = High-Dimension (HD) vectors (D is large)p many parameters in the model

p local minimap slow convergence

p The questions

p Learning algorithms in HD spaces ?

p Local learning or not ?

p The argumentsp real data are seldom HD (concept of intrinsic dimension)p local learning is not worse than global learning…

x0

x1

xD

y

z0

z1

zM

( )1ijw

( )2ijw

1st layer 2nd layer

g

g

h

3

Michel Verleysen 5

Contents

p High-dimensional datap Surprizing results

p Intrinsic dimension

p Local learningp Use of distance measures

p Dimension reductionp non-linear projection

p application to time-series forecasting

Michel Verleysen 6

Contents

p High-dimensional datap Surprizing results

p Intrinsic dimension

p Local learningp Use of distance measures

p Dimension reductionp non-linear projection

p application to time-series forecasting

4

Michel Verleysen 7

Data mining: large databases

p (biomedical) signals

Michel Verleysen 8

Data mining: large databases

p financial data

5

Michel Verleysen 9

Data mining: large databases

p imagery

Michel Verleysen 10

Data mining: large databases

p recordings of consumers’ habits (credit cards)

p bio- data (human genome, etc.)

p satellite images

p hyperspectral images

p …

6

Michel Verleysen 11

John Wilder Tukey

p The Future of Data Analysis, Ann. Math. Statis., 33, 1-67, 1962.

« Analyze data rather than prove theorems… »

p In other words:p data are here

p they will be coming more and more in the futurep we must analyze themp with very humble means

p insistence on mathematics will distract us from fundamental points

From D.L. Donoho, op. cit.

Michel Verleysen 12

Empty space phenomenon

p Necessity fo fill space with learning points

p # learning points exponential with dimension

x

f(x)

x

f(x)

x

f(x)

7

Michel Verleysen 13

Example: Silvermann’s result

p How to approximate a Gaussian distribution with Gaussian kernels

p Desired accuracy: 90%

1E+00

1E+01

1E+02

1E+03

1E+04

1E+05

1E+06

0 1 2 3 4 5 6 7 8 9 10 dim

# points

Michel Verleysen 14

Surprizing phenomena in HD spaces

p Sphere volume

p Sphere volume / cube volume p Embedded spheres (radius ratio = 0.9)

( ) ( )d

/dr

/ddV

12

2

+Γπ=

0

1

2

3

4

5

6

0 10 20 30

volume

dim

volume

dim0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30

volume

dim0

0.2

0.4

0.6

0.8

1

0 10 20 30

8

Michel Verleysen 15

dim

Gaussian kernels

p 1-D Gaussian

p % points inside a sphere of radius 1.65

0

0.1

0.2

0.3

0.4

0.5

-4 0 4

0

0.2

0.4

0.6

0.8

1

0 10 20 30

% points

1.65-1.65

Michel Verleysen 16

Concentration of measure phenomenon

p Take all pairwise distances in random data

p Compute the average A and the variance V of these distances

p If D increases thenp V remains fixedp A increases

p All distances seem to concentrate !!!

p Example: Eucliden norm of samplesp average A increases with (D)0.5

p variance V remains fixed

p→ samples seem to be normalized !

9

Michel Verleysen 17

Intrinsic dimension

p No definition

p Example:

From “Analyse de données par réseaux de neurones auto-organisés », P. Demartines, Ph.D. thesis, Institut National Polytechnique de Grenoble (France), 1994.

Michel Verleysen 18

Estimation of intrinsic dimensions 1/4

p small region →approximately plane

p PCA applied on small regions

Grassberger-Proccaciabox counting

a posteriori estimationlocal PCA

10

Michel Verleysen 19

Estimation of intrinsic dimensions 2/4

Grassberger-Proccaciabox counting

a posteriori estimationlocal PCA

From P. Demartines, op. cit.

Michel Verleysen 20

Estimation of intrinsic dimensions 3/4

p Similar to box counting

p Mutual distances between pairs of points

p Advantage: N(N-1)/2 distances for N points

p log N –log r graph: number N of pairs of points closer than r

Grassberger-Proccaciabox counting

a posteriori estimationlocal PCA

11

Michel Verleysen 21

Estimation of intrinsic dimensions 4/4

p example: forecasting problem

Grassberger-Proccaciabox counting

a posteriori estimationlocal PCA

Dimensionreduction

IN Neuralnetwork

PerformancesOK ?

OUT

no

Michel Verleysen 22

Intrinsic dimension: limitation of the concept

p Seen from very close

p Seen from very far

12

Michel Verleysen 23

Contents

p High-dimensional datap Surprizing results

p Intrinsic dimension

p Local learningp Use of distance measures

p Dimension reductionp non-linear projection

p application to time-series forecasting

Michel Verleysen 24

Local learning

p « Local »: by means of local functions

p Typical example: Gaussian kernels

p Radial Basis Function Networks

p Local and global: ANN are interpolators, and do not extrapolate to regions without learning data !

( ) ( )∑=

=P

jjjKwf

1xx ( )

−−=

2

2

j

jj

hexpK

cxx

13

Michel Verleysen 25

Radial-Basis Function Networks (RBFN) for approximation

p Advantages (over MLP, …):p splitted computation of - centres cj

- widths hj- multiplying factors wj

p easier learning

( ) ( )∑=

=P

jjjKwf

1xx ( )

−−=

2

2

j

jj

hexpK

cxx

Michel Verleysen 26

Local learning 2/2

p Sum of sigmoids = Gaussian !

14

Michel Verleysen 27

Problem with (local ?) learning

p Most ANN use distances between input and weight vectors:

p RBFN: as argument to radial kernels

p VQ and SOM: to choose the « winner »

p first layer in MLP: x w

p In high-dimensional spaces: all these distances seem identical !

(concentration of measure phenomenon)

p → need for

p other neural networks

p same neural networks, with other distance measures

icx −icx −

Michel Verleysen 28

Using new distance measures

p Example: RBF

p Use of « super-Gaussian » kernels:( )

−−=

rj

rj

jh

expKcx

x

0

0.5

1

0 2 4 6 8 10 12 140

0.5

1

0 2 4 6 8 10 12 14

( ) ( )∑=

=P

jjjKwf

1xx ( )

−−=

2

2

j

jj

hexpK

cxx

15

Michel Verleysen 29

Not assuming evidence…

p widths of kernels in RBF are not necessarily equal to the STDV of samples in the Voronoi zone!

jc( )zoneVoronoiinpointsSTDV=σ j

( )

σ

−−=

2

2

j

jj

qexpK

cxx

0

2468

10

1214

16

0 1 2 3 4 5 6

q factor

RBF error

Michel Verleysen 30

Contents

p High-dimensional datap Surprizing resultsp Intrinsic dimension

p Local learningp Use of distance measures

p Dimension reductionp non-linear projectionp application to time-series forecasting

16

Michel Verleysen 31

Dimension reduction

reduced # inputs

reduced # model parameters

reduced overfitting

Michel Verleysen 32

Dimension reduction & intrinsic dimension

p Questions:p intrinsic dimension is unknownp non-linear submanifolds

17

Michel Verleysen 33

Kohonen maps

p Based on topology preservation

Michel Verleysen 34

Distance-preservation methods

p Many non-linear projection methods:based on distance preservation between input and output pairs

p Examplesp Multi-dimensional scaling (MDS)p Sammon’s mappingp Curvilinear Component Analysis

p Principle: place points in the output space, so that pairwisedistances are as equal as possible to the correspondingpairwise distances in the input spacep Impossible to respect all pairwise distances → insist on small

ones (local pairs)

18

Michel Verleysen 35

Curvilinear Component Analysis

p Principle

p q < d

p respect of mutual distance between pairs of points

data in d-dim original space

data in q-dim state space

non-linear projection

( ) ( )λ−= ∑∑= =

,YFYXE ij

N

i

N

jijij

1 1

2

Michel Verleysen 36

Adjusting parameters in CCA

p Need for curvilinear distancep VQ → centroidsp linking centroidsp measuring the distances via the links

p In practice: generalized distance( )( ) ( ) ijijij tXt δω+ω−=∆ 1

19

Michel Verleysen 37

CCA: examples

Michel Verleysen 38

Application of dimension reductionto forecasting tasks

p Regressor: - past values x(t-i)

- exogenous data in(j)

p Forecasting

p Non-linear forecasting:p 1. optimise regressor on linear predictor

2. use the same regressor with non-linear predictor fp trials and errors (computational load !)

( ) ( ) ( ) ( ) ( ) ( ) ( )( )lin,,in,in,ktx,,tx,txftx KK 2111 −−=+

20

Michel Verleysen 39

Forecasting: selection of input variables

p Starting with many input variables, then reduce their number

p Two options:

1. selection of input variables

interpretability

limited to existing variables

2. projection of input variables

p linear: PCA

p non-linear: CCA, Kohonen, etc.based on Taken’s theorem

Michel Verleysen 40

Forecasting: Taken’s theorem 1/2

p Takens’ theorem:q ≤ size of regressor ≤ 2q+1(AR model)

time seriesintrinsic dimension (q) = 1

t

x(t)

x(t)

x(t-1)

x(t)

x(t-2)

x(t-1)

21

Michel Verleysen 41

Forecasting: Taken’s theorem 2/2

p Takens’ theorem:

q ≤ size of regressor ≤ 2q+1

p In the 2q+1 space, there exists a q-surface without intersection points

p Projection from 2q+1 to q possible !

Michel Verleysen 42

Forecasting: 1st example 1/2

p Artificial series

Two past values!

p Linear AR model

( ) ( ) ( ) ( )ttbxtaxtx ε+−+=+ 21 2

0

2

4

68

10

12

14

16

1 2 3 4 5 6 7 8 9 10

Sum of errors (on 1000 samples)

AR order

22

Michel Verleysen 43

Forecasting: 1st example 2/2

p Non-linear AR modelp initial regressor: size=6

p intrinsic dimension: 2p CCA from dim=6 to dim=2p MLP on 2-dim data

0

2

4

6

8

10

12

1 2 3 4 5 6

Sum of errors (on 1000 samples)

AR order

MLP without projection

non-linear projection

Michel Verleysen 44

Forecasting: 2nd example 1/2

p Daily returns of BEL20 index

p 42 indicators from inputs andexogenous variables:p returns: xt, xt-10, xt-20, xt-40, …, yt, yt-10, …

p differences of returns: xt-xt-5, xt-5-xt-10, …, yt-yt-5

p oscillators: K(20), K(40), …p moving averages: MM(10), MM(50), …

p exponential moving averages: MME(10), MME(50), …p etc

-0.025

-0.02

-0.015

-0.01

-0.005

0

0.005

0.01

0.015

0.02

0.025

0 500 1000 1500 2000 2500

23

Michel Verleysen 45

Forecasting: 2nd example 2/2

p Method:

p 42 indicators

p PCA → 25 variablesGrassberger-Proccacia: intrinsic dimension = 9

p CCA → 9 variables

p RBF → forecasting

p Result: % of correct approximations of sign(90-days average)

p In average: 57.2% on test set

0

10

20

30

40

50

60

70

80

90

0 200 400 600 800 1000 1200 1400 1600

Michel Verleysen 46

Conclusion

p High dimensions: > 3 !

p NN in general: also difficulties in high dimensions

p Common problems:p Euclidean distancep empty space phenomenon

p Towards solutions…1. Local NN (RBFN, etc.): easier learning2. Generic methods for non-linear projections: reduce dimension

p Open perspectives:p using dimension reduction techniques based on topology (and

not on distances)p study the possibility of non-Euclidean distances and

non-Gaussian kernels