learning high-dimensional data - samos-matisse - … · 2002-06-24 · michel verleysen 1 learning...
TRANSCRIPT
1
Michel Verleysen 1
Learning high-dimensional data
Michel VerleysenUniversité catholique de Louvain (Louvain-la-Neuve, Belgium)
Electricity department
January 2002
Michel Verleysen 2
Acknowledgements
p Part of this work has been realized in collaboration with PhD students:p Philippe Thissenp Jean-Luc Vozp Amaury Lendassep John Lee
p Some ideas and figures come fromp D. L. Donoho, High-Dimensional Data Analysis: The Curses and
Blessings of Dimensionality. Lecture on August 8, 2000, to the American Mathematical Society "Math Challenges of the 21st Century". Available from http://www-stat.stanford.edu/~donoho/.
2
Michel Verleysen 3
Data mining
p Data mining: find information in large databases(large = D and/or N)
4444444 84444444 76
L
MOMMMM
L
L
L
L
space) of (dimensioncolums
45581925125
492825459
9158141358
51422171267
912915721
ns)observatio
of(number
lines
D
.....
....
...
....
....
N
−−
−−−
−−
Michel Verleysen 4
High-Dimensional spaces
p Inputs = High-Dimension (HD) vectors (D is large)p many parameters in the model
p local minimap slow convergence
p The questions
p Learning algorithms in HD spaces ?
p Local learning or not ?
p The argumentsp real data are seldom HD (concept of intrinsic dimension)p local learning is not worse than global learning…
x0
x1
xD
y
z0
z1
zM
( )1ijw
( )2ijw
1st layer 2nd layer
g
g
h
3
Michel Verleysen 5
Contents
p High-dimensional datap Surprizing results
p Intrinsic dimension
p Local learningp Use of distance measures
p Dimension reductionp non-linear projection
p application to time-series forecasting
Michel Verleysen 6
Contents
p High-dimensional datap Surprizing results
p Intrinsic dimension
p Local learningp Use of distance measures
p Dimension reductionp non-linear projection
p application to time-series forecasting
4
Michel Verleysen 7
Data mining: large databases
p (biomedical) signals
Michel Verleysen 8
Data mining: large databases
p financial data
5
Michel Verleysen 9
Data mining: large databases
p imagery
Michel Verleysen 10
Data mining: large databases
p recordings of consumers’ habits (credit cards)
p bio- data (human genome, etc.)
p satellite images
p hyperspectral images
p …
6
Michel Verleysen 11
John Wilder Tukey
p The Future of Data Analysis, Ann. Math. Statis., 33, 1-67, 1962.
« Analyze data rather than prove theorems… »
p In other words:p data are here
p they will be coming more and more in the futurep we must analyze themp with very humble means
p insistence on mathematics will distract us from fundamental points
From D.L. Donoho, op. cit.
Michel Verleysen 12
Empty space phenomenon
p Necessity fo fill space with learning points
p # learning points exponential with dimension
x
f(x)
x
f(x)
x
f(x)
7
Michel Verleysen 13
Example: Silvermann’s result
p How to approximate a Gaussian distribution with Gaussian kernels
p Desired accuracy: 90%
1E+00
1E+01
1E+02
1E+03
1E+04
1E+05
1E+06
0 1 2 3 4 5 6 7 8 9 10 dim
# points
Michel Verleysen 14
Surprizing phenomena in HD spaces
p Sphere volume
p Sphere volume / cube volume p Embedded spheres (radius ratio = 0.9)
( ) ( )d
/dr
/ddV
12
2
+Γπ=
0
1
2
3
4
5
6
0 10 20 30
volume
dim
volume
dim0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30
volume
dim0
0.2
0.4
0.6
0.8
1
0 10 20 30
8
Michel Verleysen 15
dim
Gaussian kernels
p 1-D Gaussian
p % points inside a sphere of radius 1.65
0
0.1
0.2
0.3
0.4
0.5
-4 0 4
0
0.2
0.4
0.6
0.8
1
0 10 20 30
% points
1.65-1.65
Michel Verleysen 16
Concentration of measure phenomenon
p Take all pairwise distances in random data
p Compute the average A and the variance V of these distances
p If D increases thenp V remains fixedp A increases
p All distances seem to concentrate !!!
p Example: Eucliden norm of samplesp average A increases with (D)0.5
p variance V remains fixed
p→ samples seem to be normalized !
9
Michel Verleysen 17
Intrinsic dimension
p No definition
p Example:
From “Analyse de données par réseaux de neurones auto-organisés », P. Demartines, Ph.D. thesis, Institut National Polytechnique de Grenoble (France), 1994.
Michel Verleysen 18
Estimation of intrinsic dimensions 1/4
p small region →approximately plane
p PCA applied on small regions
Grassberger-Proccaciabox counting
a posteriori estimationlocal PCA
10
Michel Verleysen 19
Estimation of intrinsic dimensions 2/4
Grassberger-Proccaciabox counting
a posteriori estimationlocal PCA
From P. Demartines, op. cit.
Michel Verleysen 20
Estimation of intrinsic dimensions 3/4
p Similar to box counting
p Mutual distances between pairs of points
p Advantage: N(N-1)/2 distances for N points
p log N –log r graph: number N of pairs of points closer than r
Grassberger-Proccaciabox counting
a posteriori estimationlocal PCA
11
Michel Verleysen 21
Estimation of intrinsic dimensions 4/4
p example: forecasting problem
Grassberger-Proccaciabox counting
a posteriori estimationlocal PCA
Dimensionreduction
IN Neuralnetwork
PerformancesOK ?
OUT
no
Michel Verleysen 22
Intrinsic dimension: limitation of the concept
p Seen from very close
p Seen from very far
12
Michel Verleysen 23
Contents
p High-dimensional datap Surprizing results
p Intrinsic dimension
p Local learningp Use of distance measures
p Dimension reductionp non-linear projection
p application to time-series forecasting
Michel Verleysen 24
Local learning
p « Local »: by means of local functions
p Typical example: Gaussian kernels
p Radial Basis Function Networks
p Local and global: ANN are interpolators, and do not extrapolate to regions without learning data !
( ) ( )∑=
=P
jjjKwf
1xx ( )
−−=
2
2
j
jj
hexpK
cxx
13
Michel Verleysen 25
Radial-Basis Function Networks (RBFN) for approximation
p Advantages (over MLP, …):p splitted computation of - centres cj
- widths hj- multiplying factors wj
p easier learning
( ) ( )∑=
=P
jjjKwf
1xx ( )
−−=
2
2
j
jj
hexpK
cxx
Michel Verleysen 26
Local learning 2/2
p Sum of sigmoids = Gaussian !
14
Michel Verleysen 27
Problem with (local ?) learning
p Most ANN use distances between input and weight vectors:
p RBFN: as argument to radial kernels
p VQ and SOM: to choose the « winner »
p first layer in MLP: x w
p In high-dimensional spaces: all these distances seem identical !
(concentration of measure phenomenon)
p → need for
p other neural networks
p same neural networks, with other distance measures
icx −icx −
Michel Verleysen 28
Using new distance measures
p Example: RBF
p Use of « super-Gaussian » kernels:( )
−−=
rj
rj
jh
expKcx
x
0
0.5
1
0 2 4 6 8 10 12 140
0.5
1
0 2 4 6 8 10 12 14
( ) ( )∑=
=P
jjjKwf
1xx ( )
−−=
2
2
j
jj
hexpK
cxx
15
Michel Verleysen 29
Not assuming evidence…
p widths of kernels in RBF are not necessarily equal to the STDV of samples in the Voronoi zone!
jc( )zoneVoronoiinpointsSTDV=σ j
( )
σ
−−=
2
2
j
jj
qexpK
cxx
0
2468
10
1214
16
0 1 2 3 4 5 6
q factor
RBF error
Michel Verleysen 30
Contents
p High-dimensional datap Surprizing resultsp Intrinsic dimension
p Local learningp Use of distance measures
p Dimension reductionp non-linear projectionp application to time-series forecasting
16
Michel Verleysen 31
Dimension reduction
reduced # inputs
reduced # model parameters
reduced overfitting
Michel Verleysen 32
Dimension reduction & intrinsic dimension
p Questions:p intrinsic dimension is unknownp non-linear submanifolds
17
Michel Verleysen 33
Kohonen maps
p Based on topology preservation
Michel Verleysen 34
Distance-preservation methods
p Many non-linear projection methods:based on distance preservation between input and output pairs
p Examplesp Multi-dimensional scaling (MDS)p Sammon’s mappingp Curvilinear Component Analysis
p Principle: place points in the output space, so that pairwisedistances are as equal as possible to the correspondingpairwise distances in the input spacep Impossible to respect all pairwise distances → insist on small
ones (local pairs)
18
Michel Verleysen 35
Curvilinear Component Analysis
p Principle
p q < d
p respect of mutual distance between pairs of points
data in d-dim original space
data in q-dim state space
non-linear projection
( ) ( )λ−= ∑∑= =
,YFYXE ij
N
i
N
jijij
1 1
2
Michel Verleysen 36
Adjusting parameters in CCA
p Need for curvilinear distancep VQ → centroidsp linking centroidsp measuring the distances via the links
p In practice: generalized distance( )( ) ( ) ijijij tXt δω+ω−=∆ 1
19
Michel Verleysen 37
CCA: examples
Michel Verleysen 38
Application of dimension reductionto forecasting tasks
p Regressor: - past values x(t-i)
- exogenous data in(j)
p Forecasting
p Non-linear forecasting:p 1. optimise regressor on linear predictor
2. use the same regressor with non-linear predictor fp trials and errors (computational load !)
( ) ( ) ( ) ( ) ( ) ( ) ( )( )lin,,in,in,ktx,,tx,txftx KK 2111 −−=+
20
Michel Verleysen 39
Forecasting: selection of input variables
p Starting with many input variables, then reduce their number
p Two options:
1. selection of input variables
interpretability
limited to existing variables
2. projection of input variables
p linear: PCA
p non-linear: CCA, Kohonen, etc.based on Taken’s theorem
Michel Verleysen 40
Forecasting: Taken’s theorem 1/2
p Takens’ theorem:q ≤ size of regressor ≤ 2q+1(AR model)
time seriesintrinsic dimension (q) = 1
t
x(t)
x(t)
x(t-1)
x(t)
x(t-2)
x(t-1)
21
Michel Verleysen 41
Forecasting: Taken’s theorem 2/2
p Takens’ theorem:
q ≤ size of regressor ≤ 2q+1
p In the 2q+1 space, there exists a q-surface without intersection points
p Projection from 2q+1 to q possible !
Michel Verleysen 42
Forecasting: 1st example 1/2
p Artificial series
Two past values!
p Linear AR model
( ) ( ) ( ) ( )ttbxtaxtx ε+−+=+ 21 2
0
2
4
68
10
12
14
16
1 2 3 4 5 6 7 8 9 10
Sum of errors (on 1000 samples)
AR order
22
Michel Verleysen 43
Forecasting: 1st example 2/2
p Non-linear AR modelp initial regressor: size=6
p intrinsic dimension: 2p CCA from dim=6 to dim=2p MLP on 2-dim data
0
2
4
6
8
10
12
1 2 3 4 5 6
Sum of errors (on 1000 samples)
AR order
MLP without projection
non-linear projection
Michel Verleysen 44
Forecasting: 2nd example 1/2
p Daily returns of BEL20 index
p 42 indicators from inputs andexogenous variables:p returns: xt, xt-10, xt-20, xt-40, …, yt, yt-10, …
p differences of returns: xt-xt-5, xt-5-xt-10, …, yt-yt-5
p oscillators: K(20), K(40), …p moving averages: MM(10), MM(50), …
p exponential moving averages: MME(10), MME(50), …p etc
-0.025
-0.02
-0.015
-0.01
-0.005
0
0.005
0.01
0.015
0.02
0.025
0 500 1000 1500 2000 2500
23
Michel Verleysen 45
Forecasting: 2nd example 2/2
p Method:
p 42 indicators
p PCA → 25 variablesGrassberger-Proccacia: intrinsic dimension = 9
p CCA → 9 variables
p RBF → forecasting
p Result: % of correct approximations of sign(90-days average)
p In average: 57.2% on test set
0
10
20
30
40
50
60
70
80
90
0 200 400 600 800 1000 1200 1400 1600
Michel Verleysen 46
Conclusion
p High dimensions: > 3 !
p NN in general: also difficulties in high dimensions
p Common problems:p Euclidean distancep empty space phenomenon
p Towards solutions…1. Local NN (RBFN, etc.): easier learning2. Generic methods for non-linear projections: reduce dimension
p Open perspectives:p using dimension reduction techniques based on topology (and
not on distances)p study the possibility of non-Euclidean distances and
non-Gaussian kernels