neuronal goal neuronal goal n-dimensional vectors m-dimensional vectors m < n transform from 2 to...

Neuronal GoalNeuronal Goal

n-dimensional

vectors

m-dimensional

vectors

m < n

transform from 2 to 1 dimension

Ex:

We look for axes which minimise projection errors and maximise the variance after projection

Algorithm (cont’d)Algorithm (cont’d) Preserve as much of the variance as possible

rotate

project

more information (variance)

less information

Linear transformations – exampleLinear transformations – example

2D vectors X in a unit circle with mean (1,1); Y = AX, A = 2x2 matrix

The shape is elongated, rotated and the mean is shifted.

1 1

2 2

2 1

1 1

Y X

Y X

Invariant distancesInvariant distances

Euclidean distance is not invariant to general linear transformations

This is invariant only for orthonormal matrices ATA = I that make rigid rotations, without stretching or shrinking distances.

Idea: standardize the data in some way to create invariant distances.

Y A X

2 T1 2 1 2 1 2

T1 2 1 2T

Y Y Y Y Y Y

X X A A X X

Data standardizationData standardization

For each vector component X(j)T=(X1(j), ... Xd

(j)), j=1 .. n

calculate mean and std: n – number of vectors, d – their dimension

( ) ( )

1 1

1 1;

i

n nj j

ij j

X Xn n

X X Vector of mean

feature values.

Averages over rows.

(1) (2) ( )

(1) (2) ( )1 1 1 1

(1) (2) ( )2 2 2 2

(1) (2) ( )

n

n

n

nd d d d

X X X X

X X X X

X X X X

X X X

Standard deviationStandard deviation

Calculate standard deviation:

Transform X => Z, standardized data vectors

( )

1

22 ( )

1

1

1

1

i

i i

nj

ij

nj

ij

X Xn

X Xn

Vector of mean feature values.

Variance = square of standard deviation (std), sum of all deviations from the mean value.

( ) ( )j ji i i iZ X X

Std dataStd data

Std data: zero mean and unit variance.

Standardize data after making data transformation.

Effect: data is invariant to scaling only (diagonal transformation).

Distances are invariant, data distribution is the same??

How to make data invariant to any linear transformations?

,

( ) ( )

1 1

2 22 ( ) ( ) 2

1 1

1 10

1 11

1 1

i i

Z i i i

n nj j

i i ij j

n nj j

i i ij j

Z Z X Xn n

Z Z X Xn n

Terminology (Covariance)Terminology (Covariance)

• How two dimensions vary from the mean with respect to each other

)1(

))((),cov( 1

n

YYXXYX

n

iii

cov(X,Y) > 0: Dimensions increase together cov(X,Y) < 0: One increases, one decreases cov(X,Y) = 0: Dimensions are independent

Terminology (Covariance Matrix)Terminology (Covariance Matrix)

• Contains covariance values between all possible dimensions:

)),cov(|( jiijijnxn DimDimccC

• Example for three dimensions (x,y,z) (Always symetric):

),cov(),cov(),cov(

),cov(),cov(),cov(

),cov(),cov(),cov(

zzyzxz

zyyyxy

zxyxxx

C

cov(x,x) variance of component x

Properties of the Cov matrixProperties of the Cov matrix

• Can be used for creating a distance that

is not sensitive to linear transformation

• Can be used to find directions which

maximize the variance

• Determines a Gaussian distribution

uniquely (up to a shift)

Data standardization exampleData standardization example

For our example Y=AX, assuming X means=1 and variances = 1

Transformation

Vector of mean

feature values.

Variance

check it!

1 3 2 1 1

1 2 1 1 1

X Y

1 1

2 2

2 1

1 1

Y X

Y X

T1 5Diag

1 2

X Yσ σ AA

2 T1 2 1 2 1 2T Y Y X X A A X X How to make this

invariant?

Covariance matrixCovariance matrix

Variance (spread around mean value) + correlation between features.

where X is d x n dimensional matrix of vectors shifted to their means.

Covariance matrix is symmetric Cij = Cji and positive definite.

Diagonal elements are variances (square of std), i2 = Cii

Pearson correlation coefficient

( ) ( )

1

T( ) ( ) T

1

1; , 1

1

1 1

1 1

i

nk k

ij i j jk

nk k

k

C X X X X i j dn

n n

XC X X X X XX

[ 1, 1]ij ij i jr C

Spherical distribution of data has Cij=I (unit matrix).

Elongated ellipsoids: large off-diagonal elements, strong correlations between features.

CX is d x d

Mahalanobis distanceMahalanobis distance

Linear combinations of features leads to rotations and scaling of data.

Mahalanobis distance:

is invariant to linear transformations:

T; ; Y X Y AX Y AX C AC A

2 T1 2 1 2 1 21

T 11 2 1 2T T 1 1

21 2

Y

X

YC

X

C

Y Y Y Y C Y Y

X X A A C A A X X

X X

2 T 1

XXC

X X C X

Principal componentsPrincipal components

How to avoid correlated features?

Correlations covariance matrix is non-diagonal !

Solution: diagonalize it, then use transformation that makes it

diagonal to de-correlate features. Z are the eigen vectors of Cx

C – symmetric, positive definite matrix XTCX > 0 for ||X||>0;

its eigenvectors are orthonormal:

its eigenvalues are all non-negative

Z – matrix of orthonormal eigenvectors (because Z is real+symmetric),

transforms X into Y, with diagonal CY, i.e. decorrelated.

T ( ) ( )

T T

; ;i ii

X X

Y X

Y Z X C Z Z C Z ZΛ

C Z C Z Z ZΛ Λ

In matrix form, X, Y are dxn, Z, CX, CY are dxd

( )T ( )i jij Z Z

Matrix formMatrix form

Eigenproblem for C matrix in matrix form:X C Z ZΛ

11 12 1 11 12 1

21 22 2 21 22 2

1 2 1 2

11 12 1 1

21 22 2 2

1 2

0 0

0 0

0 0

d d

d d

d d dd d d dd

d

d

d d dd d

C C C Z Z Z

C C C Z Z Z

C C C Z Z Z

Z Z Z

Z Z Z

Z Z Z

Principal componentsPrincipal componentsPCA: old idea, C. Pearson (1901), H. Hotelling 1933

Result: PC are linear combinations of all features, providing new uncorrelated features, with diagonal covariance matrix = eigenvalues.

T

T

;

Y X

Y Z X

C Z C Z Λ

TXZΛZ C

Small i small variance data change little in direction Yi

PCA minimizes C matrix reconstruction errors:

Zi vectors for large i are sufficient to get:

because vectors for small eigenvalues will have very

small contribution to the covariance matrix.

Z – principal components, of vectors X transformed using eigenvectors of CX

Covariance matrix of transformed vectors is diagonal => ellipsoidal distribution of data.

Two components for visualizationTwo components for visualization

New coordinate system: axis ordered according to variance = size of the eigenvalue.

First k dimensions account for

1

1

k

ii

dk

ii

V

fraction of all variance (please note that i are variances); frequently 80-90% is sufficient for rough description.

Diagonalization methods: see Numerical Recipes, www.nr.com

http://www.nr.com/

Solving for Solving for Eigenvalues & EigenvectorsEigenvalues & Eigenvectors

• Vectors x having same direction as Ax are called

eigenvectors of A (A is an n by n matrix).

• In the equation Ax=x, is called an eigenvalue

of A.

• Ax=x (A-I)x=0

• How to calculate x and :– Calculate det(A-I), yields a polynomial (degree n)

– Determine roots to det(A-I)=0, roots are eigenvalues – Solve (A- I) x=0 for each to obtain eigenvectors x

PCA propertiesPCA properties

PC Analysis (PCA) may be achieved by:

• transformation making covariance matrix diagonal

• projecting the data on a line for which the sums of squares of distances from original points to projections is minimal.

• orthogonal transformation to new variables that have stationary variances

True covariance matrices are usually not known, estimated from data.

This works well on single-cluster data; more complex structure may require local PCA, separately for each cluster.

PC is useful for: finding new, more informative, uncorrelated features;

reducing dimensionality: reject low variance features,

reconstructing covariance matrices from low-dim data.

PCA Wisconsin examplePCA Wisconsin exampleWisconsin Breast Cancer data:

• Collected at the University of Wisconsin Hospitals, USA.

• 699 cases, 458 (65.5%) benign (red), 241 malignant (green).

• 9 features: quantized 1, 2 .. 10, cell properties, ex:

Clump Thickness, Uniformity of Cell Size, Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei,

Bland Chromatin, Normal Nucleoli, Mitoses.

2D scatterograms do not show any structure no matter which subspaces are taken!

Example cont.Example cont.PC gives useful information already in 2D.

Taking first PCA component of the standardized data:

If (Y1.41) then benign else malignant

18 errors/699 cases = 97.4%

Transformed vectors are not

standardized, std’s are below.

Eigenvalues converge slowly, but classes are

separated well.

PCA disadvantagesPCA disadvantagesUseful for dimensionality reduction but:

• Largest variance determines which components are used, but does not guarantee interesting viewpoint for clustering data.

• The meaning of features is lost when linear combinations are formed.

Analysis of coefficients in Z1 and other important eigenvectors may show which original features are given much weight.

PCA may be also done in an efficient way by performing singular value decomposition of the standardized data matrix.

PCA is also called Karhuen-Loève transformation.

Many variants of PCA are described in A. Webb, Statistical pattern recognition, J. Wiley 2002.

Exercise (will be part of Ex. 1)Exercise (will be part of Ex. 1)

• How would you calculate efficiently the PCA

of data where the dimensionality d is much

larger than the number of vector observations

n?

2 skewed distributions2 skewed distributions

PCA transformation for 2D data:

First component will be chosen along the largest variance line, both clusters will strongly overlap, no interesting structure will be visible.

In fact projection to orthogonal axis to the first PCA component has much more discriminating power.

Discriminant coordinates should be used to reveal class structure.

Projection PursuitProjection Pursuit

Projection Pursuit (PP)Projection Pursuit (PP)PCA and FDA are linear, PP may be linear or non-linear.

Find interesting “criterion of fit”, or “figure of merit” function,

that allows for low-dim (usually 2D or 3D) projection.

Interesting indices may use a priori knowledge about the problem:

1. mean nearest neighbor distance – increase clustering of Y(j) 2. maximize mutual information between classes and features

3. find projection that have non-Gaussian distributions.

The last index does not use a priori knowledge; it leads to the Independent Component Analysis (ICA).ICA features are not only uncorrelated, but also independent.

( )T ( ) ( ) ( )1 2, ; ;

( ; ) ;

j j j jY Y f

I I f

Y X W

Y W X W Index of “interestingness”

General transformation with parameters W.

KurtosisKurtosisICA is a special version of PP, recently very popular.

Gaussian distributions of variable Y are characterized by 2 parameters:

mean value:

variance:

These are the first 2 moments of distribution; all higher are 0 for G(Y).

Super-Gaussian distribution: long tail, peak at zero,

4(y)>0, like binary image data.

sub-Gaussian distribution is more flat and has

4(y)<0, like speech signal data.

24 2

4 3Y E Y E Y

{ }Y E Y2 2{ ( )}Y E Y E Y

One simple measure of non-Gaussianity of projections is the

4-th moment (cumulant) of the distribution, called kurtosis, measures

“skewedness” of the distribution. For E{Y}=0 kurtosis is:

Correlation and independenceCorrelation and independence

Features Yi, Yj are uncorrelated if covariance is diagonal, or:

This is much stronger condition than correlation; in particular the functions may be powers of variables; any non-Gaussian distribution after PCA transformation will still have correlated features.

1 21

,n

n i ii

p X X X p X

i j i jE YY E Y E Y

Uncorrelated features are orthogonal.

Statistically independent features Yi, Yj for any functions give:

1 2 1 2i j i jE f Y f Y E f Y E f Y

Variables are statistically independent if their joint probability distribution is a product of probabilities for all variables:

PP/ICA examplePP/ICA example

Example: PCA and PP based on maximal kurtosis: note nice separation of the blue class.

Some remarksSome remarks

Other components are found in the space orthogonal to W1T

X

2(1) T

1arg max E

WW W X

• Many formulations of PP and ICA methods exist.

• PP is used for data visualization and dimensionality reduction.

• Nonlinear projections are frequently considered, but solutions are more numerically intensive.

• PCA may also be viewed as PP, max (for standardized data):

21

( ) T ( ) T( )

11

arg maxk

k i i

i

E

WW W I W W X

Same index is used, with projection on space orthogonal to k-1 PCs.

Index I(Y;W) is based here on maximum variance.

How do we find multiple ProjectionsHow do we find multiple Projections

• Statistical approach is complicated:

– Perform a transformation on the data to

eliminate structure in the already found

direction

– Then perform PP again

• Neural Comp approach: Lateral

Inhibition

ICA demosICA demos

• ICA has many applications in signal and image analysis.

• Finding independent signal sources allows for separation of signals from different sources, removal of noise or artifacts.

Observations X are a linear mixture W of unknown sources Y

Play with ICALab PCA/ICA Matlab software for signal/image analysis: http://www.bsp.brain.riken.go.jp/page7.html

TX W Y

Both W and Y are unknown! This is a blind separation problem. How can they be found?

If Y are Independent Components and W linear mixing the problem is similar to FDA or PCA, only the criterion function is different.

http://www.bsp.brain.riken.go.jp/page7.html

ICA demo: images & audioICA demo: images & audio

Example from Cichocki’s lab,


X space for images:

take intensity of all pixels one vector per image, or

take smaller patches (ex: 64x64), increasing # vectors

• 5 images: originals, mixed, convergence of ICA iterations

X space for signals:

sample the signal for some time t

• 10 songs: mixed samples and separated samples


Self-organizationSelf-organization

PCA, FDA, ICA, PP are all inspired by statistics, although some neural-inspired methods have been proposed to find interesting solutions, especially for their non-linear versions.

• Brains learn to discover the structure of signals: visual, tactile, olfactory, auditory (speech and sounds).

• This is a good example of unsupervised learning: spontaneous development of feature detectors, compressing internal information that is needed to model environmental states (inputs).

• Some simple stimuli lead to complex behavioral patterns in animals; brains use specialized microcircuits to derive vital information from signals – for example, amygdala nuclei in rats sensitive to ultrasound signals signifying “cat around”.

Models of self-organizaitonModels of self-organizaitonSOM or SOFM (Self-Organized Feature Mapping) – self-organizing feature map, one of the simplest models.

How can such maps develop spontaneously?

Local neural connections: neurons interact strongly with those nearby, but weakly with those that are far (in addition inhibiting some intermediate neurons).

History:von der Malsburg and Willshaw (1976), competitive learning, Hebb mechanisms, „Mexican hat” interactions, models of visual systems.Amari (1980) – models of continuous neural tissue.Kohonen (1981) - simplification, no inhibition; leaving two essential factors: competition and cooperation.

Computational Intelligence: Computational Intelligence: Methods and ApplicationsMethods and Applications

Lecture 8 Projection Pursuit &

Independent Component Analysis

Włodzisław DuchSCE, NTU, Singapore

Google: Duch

Computational Intelligence: Computational Intelligence: Methods and ApplicationsMethods and Applications

Lecture 6 Principal Component Analysis.

Włodzisław Duch

SCE, NTU, Singapore

http://www.ntu.edu.sg/home/aswduch

neuronal goal neuronal goal n-dimensional vectors m-dimensional vectors m < n transform from 2 to...

Documents