kernel approaches to supervised/ unsupervised machine ...€¦ · machine learning techniques are...

208
S.Y. Kung Princeton University Kernel Approaches to Supervised/ Unsupervised Machine Learning* *Short course organized by NTU and IEEE Signal Processing Society, Taipei Section, June 29-30, 2009. For lecture notes, see the URL website of IEEE Signal Processing section, http://ewh.ieee.org/r10/taipei/

Upload: others

Post on 26-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

S.Y. Kung

Princeton University

Kernel Approaches to Supervised/ Unsupervised Machine Learning*

*Short course organized by NTU and IEEE Signal Processing Society, Taipei Section, June 29-30, 2009. For lecture notes, see the URL website of IEEE Signal Processing section, http://ewh.ieee.org/r10/taipei/

Page 2: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

1. Kernel-Based Unsupervised/supervised Machine Learning: Overview

2.Kernel Based Representation Spaces

3. Kernel Spectral Analysis (Kernel PCA)

4. Kernel Based Supervised Classifiers

5. Fast Clustering Methods for Positive-Definite Kernels (Kernel Component Analysis and Kernel Trick)

6. Extension to Nonvectorial Data and NPD (None-Positive-definite) Kernels

Topics:

Page 3: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Machine learning techniques are useful for data mining and pattern recognition when only pairwise relationships of the training objects are known - as opposed to the training objects themselves. Such pairwise learning approach has a special appeal to many practical applications such as multimedia and bioinformatics, where a variety of heterogeneous sources of information are available. The primary information used in the kernel approach is either (1) the kernel matrix (K) associated with eithervectorial data or (2) the similarity matrix (S) associated with nonvectorial objects.

1. Overview

Page 4: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

From Linear to Kernel SVMs

Subject to the constraints:

Wolfe optimization:

Linear inner-product objective function

Kernel inner-product objective function

Decision Function:

Page 5: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Why the Kernel Approach:

Traditionally, we have linear dot-productfor two vectorial data x, and y.

New Dot-Product means different distance (norm) metric in the new Hibert space.

, Tx y x y=

D=Distance 2 2 2|| || || || 2 TD x y x y= + −

, , 2 ,x x y y x y= + −

Traditionally,

In the Kernel Approach, a new distance:

- A New Distance Metric

2 ( , ) ( , ) 2 ( , )D K x x K y y K x y= + −%

Page 6: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernelization ⇒ Cornerization

Page 7: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Application to XOR Problem

teacher= +1

teacher= +1

teacher= -1

teacher= -1

Not linearly separable in X….…but linearly separable in kernel spaces: e.g. F, E, & K.

Page 8: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

PCA vs. Kernel PCA

Page 9: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Such pairwise learning approach has a special appeal to many practical applications such as multimedia and bioinformatics, where a variety of heterogeneous sources of information are available.

The learning techniques can be extended to the situations where only pairwise relationships of the training objects are knownas opposed to the training objects themselves.

Why the Kernel Approach: nonvectorial data

Page 10: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

• Nonvectorial Data: NPD Kernel Matrix

• Vectorial Data:

well-defined centroid/distance.PD

Two kinds of Kernel (or Similarity) Matrices

Sigmoid Kernel (rarely used) can be NPD.

polynomial kernel:

Similarity Matrix

The primary information used in the kernel approach:

PD Kernel Matrix for

or Gaussian kernel:

Page 11: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

(1) Kernel matrix (K) for vectorial data

Microarray gene expression data:

Kernel matrix for tissue samples:

Kernel matrix for genes:

Page 12: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Hierarchical clustering of gene expression data the grouping of genes (rows) and normal and malignant lymphocytes (columns). Depicted are the 1.8 million measurements of gene expression from 128 microarray analyses of 96 samples of normal and malignant lymphocytes.

ALIZADEH et al. "Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling", Nature 403 503–511.

Microarray Data Matrix

Page 13: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

(2) Similarity matrix (S) for nonvectorial objects.

Page 14: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

1. Kernel-Based Unsupervised/supervised Machine Learning: Overview

2. Kernel Based Representation Spaces• 2.1 Basic Kernel Hilbert Space• 2.2 Spectral Space• 2.3 Empirical Space• 2.4 Linear Mapping Betw. Spaces

3. Kernel Spectral Analysis (Kernel PCA)

4. Kernel Based Supervised Classifiers

5. Fast Clustering Methods for Positive-Definite Kernels (Kernel Component Analysis and Kernel Trick)

6. Extension to Nonvectorial Data and NPD Kernels

Topics:

Page 15: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernel Based Representation Spaces

New vector space:

“Linear” discriminant function:

Page 16: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Polynomial Kernel Function

New vector space:

“Linear” discriminant function:

Page 17: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

XOR Training Dataset

teacher= -1teacher= +1 teacher= -1teacher= +1

Page 18: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The representation vectorspace has the following properties:

•It is a Hilbert space endowed with dot-product.

•It has non-orthogonal bases with a finite dimension, J.

•It is independent of the training dataset.

Page 19: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernel and Basis FunctionOrthogonal Basis Function

and

Assume K(x,y) is positive-definite (PD)[21], i.e. λ΄k≥0

Page 20: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

These basis functions form the coordinate bases of the representation vector space with the dot-product:

2.1 Basic Kernel Hilbert Space

Page 21: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

•It is a Hilbert space endowed with a dot-product .

•It has ordered orthogonal bases with indefinite dimension, and

•It is independent of the training dataset.

Basic Kernel Hilbert Space: F

It has the following properties:

Page 22: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernel Matrix

Two Factorizations of Kernel Matrix:

E: Kernel Spectral Space

F: Basic Kernel Hilbert Space(1)

(2)

Page 23: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Factorization: Basic Kernel Hilbert Space

Page 24: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Apply the eigenvalue decomposition on the kernel-matrix K:

Kernel Spectral Factorization

Spectral Component N

Spectral Component 2

Spectral Component 1

Page 25: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

2.2 Spectral Vector Space: E

Page 26: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

•It is a Hilbert space endowed with the conventional (linear) dot-product.•It has a finite dimension, N, with ordered orthogonal bases

•It is dependent of the training dataset. More precisely, E should be referred to as the kernel spectral space with respect to the training dataset.

Spectral Space: EIt has the following properties:

Page 27: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

E: Kernel Spectral SpaceF: Basic Kernel Hilbert Space

The Twin Spaces

Page 28: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The limiting case of KCA (i.e. with indefinite number of training patterns covering the entire vector space RN:

E: Kernel Spectral Space

F: Basic Kernel Hilbert Space

1 2

1 2

' ' 'N

N

λ λ λλ λ λ

= =L

When N is extremely large with the training dataset uniformly distributed in the entire RN , then

Page 29: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Preservation of Dot-Products

F E

K-Means on E

Page 30: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The spectral space E is effective for dimensional reduction.

Note that, although both kernel Hilbert have ordered orthogonal bases, the latter is more effective for dimensional reduction since it takes into account of the distribution of the training dataset.

e.g. Spectral K-Means on E

Given the equivalence between the two spaces, the (same) supervised/unsupervised learning results can be (more easily) computed via the finite-dimensional spectral space, as opposed to the indefinite-dimensional space F.

Page 31: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

XOR Dataset

Empirical Vector

Page 32: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

diagonalization

Kernel Matrix

XOR Dataset

spectral factorization

Page 33: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Spectral Space: EEmpirical Vector

diagonalization Component N

Component 2

Component 1

ord

ere

d o

rthogonal c

om

ponents

XOR Dataset

Page 34: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

⎥⎦

⎤⎢⎣

⎡−−

41.141.1

1x

2x

⎥⎦

⎤⎢⎣

⎡41.141.1

2

2

Now verify:

Page 35: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Solution by Kernel K-means

Solution by K-means

Comparison of Clustering Analyses

XOR Dataset

Page 36: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The empirical vector is defined as

2.3 Empirical Space: K

Page 37: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

empirical vector:

4 empirical vectors

4 original vectorsXOR Dataset

Page 38: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

• It is a Hilbert space endowed with the conventional (linear) dot-product.•It has a finite dimension, N, with non-orthogonal bases. (Thus, unlike the spectral space, it cannot enjoy the advantage of having ordered orthogonal bases.)•It is dependent of the training dataset.

Empirical Space: KIt has the following properties:

Page 39: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

“The Three Spaces”

2.4 Linear Mapping:

Page 40: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Introduce a unifying Vector “z”

Linear Mapping:

Page 41: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernel PCALinear Mapping:

Page 42: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The discriminant function (for SVM or KFD) can be expressed as

Representation Theorem

A kernel solution w can always be represented by

Linear Mapping:

Page 43: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Consequently, it opens up a new and promising clustering strategy, termed "empirical K-means".

None-Preservation of Dot-Products

Page 44: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

1. Kernel-Based Unsupervised/supervised Machine Learning: Overview

2.Kernel Based Representation Spaces

3. Kernel Spectral Analysis

• 3.1 Conventional PCA• 3.2 Kernel PCA

4. Kernel Fisher Discriminant (KFD) analysis

5. Kernel-based Support Vector Machine and Robustification of Training

6. Fast Clustering Methods for Positive-Definite Kernels (Kernel Component Analysis and Kernel Trick)

7. Extension to Nonvectorial Data and NPD (None-Positive-definite) Kernels

Topics:

Page 45: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Therefore, the PCA may be solved by applying eigenvalue decomposition (i.e. diagonalization) to

as opposed to

S=

3.1 Conventional PCA

Page 46: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

which has two eigenvalues 4:0 and 4:0.

For the four XOR training data,

The two nonzero eigenvalues are again 4:0 and 4:0, just the same as before.

The corresponding pairwise score matrix is

Page 47: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The kernel PCA basically amountsto the spectral decomposition of K.

3.2 Kernel Spectral Analysis[32]

Page 48: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Reconfirms the Representer Theorem:

The kernel PCA basically amounts to the spectral decomposition of K.

Page 49: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The m-th components of of the entire set of training data:

Finding the m-th (kernel) components:

Page 50: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernel PCA

KCA

The eigenvector property can be confirmed as follows:

Page 51: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

From the spectral factorization perspective

Kernel PCA

Page 52: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernel PCA: XOR Dataset

Page 53: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The two principal components

The two principal eigenvectors

Kernel PCA

Page 54: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

PCA vs. Kernel PCA

Page 55: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Cornering Effect of Kernel Hilbert Space

PCA vs. Kernel PCA

Page 56: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of
Page 57: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

1. Kernel-Based Unsupervised/supervised Machine Learning: Overview2. Kernel Based Representation Spaces3.Principal Component analysis (Kernel PCA)

4. Supervised Kernel Classifiers4.1 Fisher Discriminant Analysis (FDA) 4.2 Kernel Fisher Discriminant (KFD) 4.3 Perfect KFD4.4 Perturbed Fisher Discriminant (PFD)

- Stochastic Modeling of Training Dataset4.5 Hybrid Fisher-SVM Approach

5. Fast Clustering Methods for Positive-Definite Kernels

6. Extension to Nonvectorial Data and NPD (None-Positive-definite) Kernels.

Topics:

Page 58: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

In the traditional LDA,

= Linear Discriminant Analysis (LDA)4.1 Fisher Discriminant Analysis (FDA)[8]

Between-Class Scatter Matrix

Total Scatter Matrix

Within-Class Scatter Matrix

Page 59: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Traditional FDA, • Data-Distribution-Dependent

• Range: [0, ∞]

A useful variant,

both share the same optimal w.

• Range: [0,1]

Assume that N > M, so that S is invertible:

Page 60: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Note that

By definition:

Page 61: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

=where

|| d ||=1

Page 62: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Generically,

where

Fisher Score

Closed-Form Solution of FDA (M≥N):

Page 63: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The optimal solution can be expressed as

Page 64: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

In the KFD Analysis, we shall assume the generic situations that training vectors are distinctive and other pathological distributions are excluded.

References [22-25]

4.2 KFD Analysis

Page 65: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

“universal” vector: z

Most common mappings:

Page 66: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

“Data” Matrix in Spectral Space: E

Pre-centered Matrix:

Page 67: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of
Page 68: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

4.3 Perfect KFD

Page 69: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

A perfect KFD solution expressed in the “universal” vector space is:

Perfect KFD

Page 70: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

→ 0, after the projection.Within-Class Scatter Matrix

J= 1Note that implies that

Namely, the positive and negative training vectors fall on two separate but parallel hyperplanes.

Page 71: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Positive & Negative Hyperplanes.

The entries of d have two different values:

for positive hyperplane

for negative hyperplane

for all the positive training patterns.

for all the negative training patterns.

Page 72: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

FDA/LDA: XOR Example

= 0

2 2 2 2 4 02 2 2 2 0 4

−⎡ ⎤ ⎡ ⎤ ⎡ ⎤= + =⎢ ⎥ ⎢ ⎥ ⎢ ⎥−⎣ ⎦ ⎣ ⎦ ⎣ ⎦

4 00 4⎡ ⎤

= ⎢ ⎥⎣ ⎦

= 0

The optima are Data-Distribution-Dependent.

Page 73: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

KFD: XOR Example

Data-Distribution-Independent.

Data-Distribution-Dependent.

Because E is Data-Distribution-Dependent.

Page 74: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Confirmed for this example:

The optima are Data-Distribution-Independent.

KFD: XOR Example

Next: More examples. Numerical Issues.

Page 75: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

x1

x2

x4

x3

First example:

Data-Distribution-Independence

1.10.8

11

-1.1-0.8

-1-1

Page 76: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

>>K=

9.0000 8.4100 1.0000 0.81008.4100 8.1225 0.8100 0.72251.0000 0.8100 9.0000 8.41000.8100 0.7225 8.4100 8.1225

>> [UT, Lambda]=eigs(K)UT=

0.5154 -0.5098 -0.4841 0.49000.4841 -0.4900 0.5154 -0.50980.5154 0.5098 -0.4841 -0.49000.4841 0.4900 0.5154 0.5098

Lambda =18.6606 0 0 0

0 15.3059 0 00 0 0.1844 00 0 0 0.0941

LambdaSqrt =4.3198 0 0 0

0 3.9123 0 00 0 0.4294 00 0 0 0.3067

>> E= LambdaSqrt *U=2.2264 2.0913 2.2264 2.0913-1.9943 -1.9172 1.9943 1.9172-0.2079 0.2213 -0.2079 0.22130.1503 -0.1564 -0.1503 0.1564

>> data-centered E= ee =

0.0675 -0.0675 0.0675 -0.0675-1.9943 -1.9172 1.9943 1.9172-0.2146 0.2146 -0.2146 0.21460.1503 -0.1564 -0.1503 0.1564

>> denominator_matrix=ee*ee' =

0.0182 0.0000 -0.0580 0.00000.0000 15.3059 0.0000 0.0000 -0.0580 0.0000 0.1842 0.0000 0.0000 0.0000 0.0000 0.0941

Page 77: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

>> data-centered E=

0.0675 -0.0675 0.0675 -0.0675-1.9943 -1.9172 1.9943 1.9172-0.2146 0.2146 -0.2146 0.21460.1503 -0.1564 -0.1503 0.156

Page 78: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

KFD: Numerical Analysis

>> v = [ 0 1 0 0]T

>> v = [ 0 0 1 0]T is numerically risky, as small perturbation could cause big effect.

is numerically stable, as it is robust w.r.t. perturbation

Two different decision and two different scenarios

>> v = [ x x … x x ]T

each element has its own sensitivity.

In general,

Page 79: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

[ rho eta_opt ]1.0000 0.0000

[ score_ve score_hybrid_AR score_hybrid_dR score_vopt ]0.9351 0.9383 0.9383 0.9383

Easy Case

>> numerator_matrix

0.0000 0.0000 0.0000 0.00000.0000 15.3 0.0000 0.02360.0000 0.0000 0.0000 0.00000.0000 0.0236 0.0000 0.00005

>> v = [ 0 1 0 0]T

>> >> Fisher Score = 0.9996

Imperfect FDA score = 0.9997

Page 80: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

[ rho eta_opt ]1.0000 -0.2401

[ score_ve score_hybrid_AR score_hybrid_dR score_vopt ]0.1684 0.0919 0.1571 0.1684

Hard Case

>> v = [ 0 0 1 0]T

>> >> Fisher Score = 1.

>> numerator_matrix = ee*d*d'*ee'

0.0182 -0.0000 -0.0580 0.0000-0.0000 0.0000 0.0000 -0.0000-0.0580 0.0000 0.1843 -0.00000.0000 -0.0000 -0.0000 0.0000

>> v = [ 0 0 1 0]T

>> >> Fisher Score = 1.

“Imperfect” FDA score = 1.0000

Page 81: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

[ rho eta_opt ]1.0000 0.0000

[ score_ve score_hybrid_AR score_hybrid_dR score_vopt]0.0860 0.0863 0.0863 0.0863

> numerator_matrix = ee*d*d'*ee'

0.0000 -0.0000 0.0000 0.0000-0.0000 0.0060 -0.0000 -0.02370.0000 -0.0000 0.0000 0.00000.0000 -0.0237 0.0000 0.0940

>> v = [ 0 0 0 1 ]T

>> Fisher Score = 1.

vopt_dR =

-0.00000.0047

-0.00000.2803

Not as Hard Case- poorest result

Imperfect FDA score = 0.9634

Page 82: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

It may be further verified that a vector is a perfect KFD if and only if itfalls on the two-dimensional solution space prescribed above.

Solution Space of Perfect KFDs

Page 83: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Solution Space of Perfect KFDs

sig

nal

90°

90°

90°

mass c

ente

r

Perfect KFD

90°

Illustration of why both the numerator and denominator are pictorially shown to be equal to 1.

Page 84: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Invariant Properties

The denominator remains constant:

The numerator remains constant:

Page 85: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

To create a maximum margin between the training vectors to the decision hyperplane, the threshold is set as the average of the two projected centroids (instead of the projected mass center).

Selection of decision threshold: b

Page 86: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Two Case Studies:

This self-centered solution matches the SVM approach.

Case I:

Case II:

Page 87: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

However, different perfect KFDs have their own margins of separation which have a simple formula:

This plays a vital role in the perturbation analysis, since the wider is the separation the more robust is the classifier.

Margin of Separation

Page 88: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Margin of Separation

Page 89: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The solution can be obtained by minimization of

This perfect KFD has a maximal PFD among its own kind.

Maximum Separation by Perfect KFD

Page 90: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

4.4 Perturbed Fisher Discriminant (PFD)

perfect is no optimal!

optimal is no perfect!

Page 91: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Stochastic Modeling of Training DatasetNow each training pattern is viewed as an instantiation of a noisy measurement of the (noisy-free) signal, i.e. the data matrix associated wiht the robustified (noise-added) training patterns can be expressed as:

X+N

With some extensive analysis, the following can be established for the robustification in the spectral space

Page 92: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Application Motivation: Stochastic Model of Training Dataset

Perturbed Fisher Discriminant

Simulation Supports

Say No to Conventional KFD or Perfect KFD.

Say Yes to PFD !

Page 93: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Perturbed Fisher Discriminant

Page 94: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

PFD for Perfect KFD

Maximal PFD Among Perfect KFDs

Best Perfect KFD from PFD perspective:

Page 95: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Perfect KFD vs. Imperfect KFD

sig

nal

tota

l=si

gnal

+nois

e

90°

90°

90°

noise

mass c

ente

r

Perfect KFD

Imperfect KFD

noise

Page 96: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

where

Globally Optimal PFD Solution

and

Page 97: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

0.92471.1131

score(best perfect) score(AR) score(vopt)

rho=.04: 0.9825 0.9880 0.9891

rho=.4: 0.8486 0.9081 0.9292

Dataset2: (easy case):

Page 98: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

0.06711.0056

score(best perfect) score(AR) score(vopt)

rho=.04: 0.5448 0.6283 0.6910

rho=.4: 0.1069 0.1983 0.5190

Dataset5 (hard case):

Page 99: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

(1) for rho=1: PFD= 0.8017 versus 0.8464, with KFD=0.9933;

(2) for rho=.1: PFD= 0.9675 versus 0.9778, with KFD=0.9959; and

(3) again for rho=.1, PFDs are 0.9573 and 0.9734 for perfect optimal KFD and regularized perfect KFD respectively.

Simulations: compare z= d, and optimal z 5 points,

Page 100: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

4.5 Hybrid Fisher-SVM

Page 101: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Empirical Space

Computationally speaking, when all the three spaces are basically equivalent, e.g. KFD and SVM, the empirical space is the most attractive, since the spectral space incurs the extra cost for spectral decomposition.

Page 102: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Positive & Negative Hyperplanes: in K

It is perfect, but not optimal!

Page 103: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The Optimal PFD Solution

Regulation by stochastic model of data.

Derivation:

Page 104: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

An alternative and effective regularization approach is via imposition of constraints in the Wolf optimization, a technique adopted by kernel-SVM classifiers to alleviate over-training problems.

Regulation by Constraints

Page 105: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Linear SVM

subject to

Lagrange multipliers can be solved by the Wolfe optimization:

Page 106: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernel SVM

subject to

Solution by the Wolfe optimization:

Since the dot-product is changed to:

Page 107: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Ignoring for a moment the constraints on ai., the optimal solution can be obtained by taking the derivative of the above with respect to ai. This leads to:

Linear System Solution

Page 108: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

XOR Example

Nonsingular K and exact solution:

Linear System Solution

KFD Solution

Page 109: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

A Hybrid Classifier Combining KFD and SVM

(1) On one hand, the cost function in theoptimization formulation follows that of KFD.

(2) On the other hand, a second regularization - based on the Wolfe optimization adopted by SVM – is incorporated.

Hybrid Fisher-SVM Classifier

Page 110: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Hybrid Fisher-SVM Classifier

Page 111: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

where zi = di - η, and

This is exactly the same as the optimal PFD solution. Recall that

Constraint-Relaxed Solution

Hybrid Fisher-SVM Classifier

Page 112: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Class-Balanced Wolfe Optimization with Perturbation-Adjusted Kernel Matrix

The approach is based on class-balanced Wolfe optimization with perturbation-adjusted kernel matrix.

1. Just like KFD, the class-balanced constraints will be adopted.

2. Under the perturbation analysis, the kernel matrix shouldbe replaced by its perturbation-adjusted more realistic representation:

In summary, by imposing a stochastic model on the training data, we derive a hybrid Fisher-SVM classifier which contains Fisher solution and SVM solution as special cases. On one hand, it gives the same solution as the optimal Fisher when the constraints are lifted. On the other hand, it approaches the (class-balanced) SVM when the perturbation is small.

Page 113: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Does such a shift improve the predictability of the supervised classifiers?

K → K + s I

Kernel Shift

Page 114: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Subcellular Dateset: hy50

Page 115: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

------------------------------------------

Kernel PNE min_evalue Accuracy(w/o shift) --> w/ offset)

-------------------------------------------

K1 8.37% -27.79 41.71% --> 52.66%K2 20.10% -23.27 49.47% --> 54.87%K3 0% 4.0922e+004 54.73%K4 10.16% -5.9002e+005 54.82% --> 51.96%K5 0% 1.6442e-010 75.08%

PNE: Percentage of negative eigenvalues in the kernel before applying diagonal shift.

Subcellular Dateset: hy50

Page 116: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

1. Overview

2.Kernel Based Representation Spaces

3.Principal Component analysis (Kernel PCA)

4.Supervised Classifier

5. Clustering Methods for PD Kernels • 5.1 K-means and Convergence• 5.2 Kernel K-means• 5.3 Spectral Space: Dimension-Reduction

• 5.4 Kernel NJ: Hierarchical Clustering• 5.5 Fast Kernel Trick

6. Extension to Nonvectorial Data and NPD Kernels

Topics:

Page 117: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Gene expression

Clustering • Unsupervised– K-mean (EM)– SOM– HC

Three major types of clustering approaches:

Unsupervised Clustering Analysis

Page 118: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

5.1 Conventional K-meansand Convergence

References: [9][17][18]

Page 119: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Clustering Criteria:

• Intra-Cluster Compactness

• Inter-Cluster Divergence

Page 120: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Exclusive Membership in K-means

xt

exclusive membership

Page 121: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The best parameters can be trained bymaximizing the log-likelihood function:

Likelihood-Based Criteria

Let the K classes be denoted by

and parameterized by

Page 122: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Under the i.i.d. (independent identical distribution) assumption,

Substituting Eq. ** into Eq.*, we obtain

**

*

exclusive membership

i.i.d.

Page 123: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

K-means

The objective function to minimize is

Page 124: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Winner Identification Centroid Update

Initiation

NoConvergence

Yes

Stop

?

K-means Flowchart

Page 125: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Monotonic Convergence: The convergence is assured if (and only if) the pairwise matrix is positive definite. Thus, the convergence is assured for all the vector clustering problems, since all the commonly adopted kernel functions arepositive definite.

Sufficiency & Necessity

Page 126: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Monotonic Convergence: The convergence is assured if (and only if) the pairwise matrix is positive definite. Thus, the convergence is assured for all the vector clustering problems, since most commonly adopted kernel functions arepositive definite.

Sufficiency & Necessity

Page 127: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

M-Step: update the new centroids (two arrows)

μ1 μ2

( )∑∈

−=X

rtt

tXE

xxx 2

)()( μ

..so as to reduce E(x):

Sufficiency

Page 128: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

E-Step: winner identification i.e. the new membership for some data. (3rd and 5th data will have a new owner (winner).

( )∑∈

−=X

rtt

tXE

xxx 2

)()( μDoes the new ownership reduce E(x)?

Sufficiency

Page 129: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

E-Step: new membership (color coded) are assigned

( )∑∈

−X

rtt

tx

xx 2)(μ ( )∑

−X

rtt

tx

xx 2)('μ

Convergence: Sufficiency

Reduction of Energy

Page 130: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

( )

( ) ( ) ( )

( )∑

−≤

−+−−−=

Xrt

GtBtX

rt

Xrt

t

t

t

t

t

t

xx

xx

xx

x

xxx

x

2)(

222)(

2)('

μ

μμμ

μ

The new membership help reduces E(x)!

Sufficiency

Page 131: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

μ1μ2

( ) ( )∑∑∈∈

−≤−X

rtX

rtt

t

t

tx

xx

x xx 2)('

2)('' μμ

( )∑∈

−X

rtt

tx

xx 2)('μ ( )∑

−X

rtt

tx

xx 2)(''μ

Likewise, a function is called monotonically decreasing (non-increasing) if,whenever x ≤ y, then f(x) ≥ f(y), so it reverses the order. ...

M-Step: Maximization Minimization

Reduction of Energy

Page 132: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

⎥⎦

⎤⎢⎣

⎡01

⎥⎦

⎤⎢⎣

⎡00

⎥⎦

⎤⎢⎣

⎡10

A

B

'C

C

Necessity:Proof by counter-example.

.. not necessary Reduction of Energy after centorid update!

If the kernel matarix is NPD, then ..

Page 133: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

5.2 Kernel K-Means

Page 134: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Conventional K-means, the objective function is

The objective of the kernel K-means aims tominimize:

Kernel K-Means

Criteria A: centroid-based objective function

K-means has very limited capability to separate clusters that are non-linearly separable in input space.

Page 135: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Algorithm: Kernel K-means

(i) Winner Identification: Given any training vector x, compute its distances to the K centroids. Identify the winning cluster as the one with the shortest distance

(ii) Membership Update: If the vector x is added to a cluster Ck, then

update the centroid of Ck

(whether explicitly or not).

For PD Kernel, centroids are defined.

e

Page 136: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Spectral K-Means

Kernel trick: Implicit formula

• Kernel PCA provides a suitable Hilbert space which can adequately manifest the similarity of the objects, vectorial or non-vectorial.

•circumvent the explicit derivation of the centroids •obviates the need of performing eigenvalue decomposition.

Page 137: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Assuming that the pairwise (similarity) matrix K is PD, then the following clustering criteria are equivalent.

Equivalence under Mercer (PD) Condition:

Three (Equivalent) Objective Functions

Page 138: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

centroid-free and norm-free objective function

centroid-free but norm-based objective function [Duda73]

centroid-based objective function

Kernel K-means

Page 139: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

centroid-free and norm-free objective function

centroid-free but norm-based objective function [Duda73]

centroid-based objective function

Kernel K-means

Page 140: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Compare A and B

B =

A =

Page 141: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

C is centroid-free and norm-free!

Compare B and C

B =

Page 142: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

5.3 Spectral K-Means

•Kernel Spectral Analysis is an effective approach.

Page 143: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

In [7],“…kernel k-means and spectral clustering….are related. Specifically, we can rewrite the weighted kernel k-means objective function as a trace maximization problem whose relaxation can be solved with eigenvectors. The result shows how a particular kernel and weight scheme is connected to the spectral algorithm of Ng, Jordan, and Weiss [27].

….. by choosing the weights in particular ways, the weighted kernel k-means objective function is identical to the normalized cut.”

Kernel K-means and Spectral clustering

Page 144: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

In the reduced-dimension

In the full-dimension, the full column of E is used:

Spectral Approximation

VCCVCVZZCV

VCCVCVZZCV

The Spectral “Approximate” Kernel matrix:

Page 145: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

where

centroid-free and norm-free objective function

Spectral Error AnalysisThe Spectral “Error” Kernel matrix is defined as

Page 146: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Spectral Error Analysis

Since the first summation is independent of the clustering,

Page 147: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Perturbation Analysis

1

1

11 1 1

1

k

k mk k k

k

N

NN N N

N

λ +

⎡ ⎤⎢ ⎥⎢ ⎥⎡ ⎤ ⎢ ⎥⎢ ⎥⎡ ⎤ ⎢ ⎥⎢ ⎥ ≤⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦⎢ ⎥⎢ ⎥⎣ ⎦

L

MK”

Page 148: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

If the kernel matrix is NPD, it is necessary to convert it to PD. One way to achieve this objective is by adopting a PD kernel matrix in the empirical space:

The dimension-reduced vectors are formed from the columns of

are used as the training vectors.

In the full-dimension, the full-size columns of

Approximation in Empirical Eigen-Space

Page 149: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Perturbation Analysis

{ 2 21

1

11 1 1 m a x ,

1

k

k m Nk k k

k

N

N K KN N N

N

λ λ+

⎡ ⎤⎢ ⎥⎢ ⎥⎡ ⎤ ⎢ ⎥⎢ ⎥⎡ ⎤ ⎢ ⎥⎢ ⎥ ≤ −⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦⎢ ⎥⎢ ⎥⎣ ⎦

L

MK”

By using the same perturbation analysis and, in addition, notingthat the perturbation is upper bounded by the largest eigenvalueterms of magnitude) of

Here K denote the number of clusters (instead of the kernel matrix).

Page 150: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The popular spectral clustering for graphic partitioning[27,34] resorts to a uniform (as opposed to weighted) truncation strategy.

In the Spectral Clustering, the training patterns are now represented by the columns of E". Unfortunately, the KCA-type error analysis is difficult to derive for the spectral clustering.

Comparison with Spectral Clustering

where the number of 1's are usually set to be equal to K or greater.

Page 151: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Comparison Table

Page 152: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The perturbation analysis has several practical implications

•As long as Kλm+1 is within the tolerance level, then m-dimensional spectral K-means will be sufficient to yield a satisfactory clustering result.

•If the global optimum holds a margin over any local optimum by an amount greater than 2 Kλm+1 (considering the worst possible situation), then the full-dimensional spectral K-means and m-dimensional spectral K-means will have the same optimal solution.

•When a NPD kernel matrix is encountered, it is common practice to discard the component corresponding to the negative spectral values. In this case, the same error analysis can be applied again to assess the level of deviation.

Page 153: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernel K-Means (SOM/NJ) w/o explictly

computing centroids

Many applications may be solved without any specific space.

Kernel Trick for Unsupervised Learning

Kernel Trick needs No Space at All.

Page 154: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernel Trick is effective for:

Kernel NJ Algorithm

Kernel K-Means for Highly Sparse Data

Neighbor-Joining (NJ) Algorithm

Page 155: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

5.4 Hierarchical NJ Algorithm

Page 156: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Hierarchical Clustering

Hierarchical clustering analysis plays an important role in biological taxonomic studies, as a phylogenetic tree or evolutionary tree is often adopted to show the evolutionary lineages among various biological entities.

Page 157: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Phylogenetic tree of SARS viral genesThe phylogenetic tree can help reveal the epidemic links.

The severe acute respiratory syndrome (SARS) virus started its outbreak during the winter of 2002. All of the SARS viruses can be linked via a phylogenetic tree, obtained from genomic data by measuring the DNA similarities between various SARS viruses.

Page 158: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernel-NJFive vectorial or nonvectorial objects -(A, B, C, D, E) - have the following (positive definite) kernel matrix:

Page 159: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Step 1: Compute N(N-1)/2 similarity/distance measures for the N objects.

These values correspond to the off-diagonal entries of a symmetric N× N similarity matrix.

Step 2: Merger of two most similar nodes: Update the merger node, remove the two old nodes.

Step 3: After N-1 mergers, all the objects are merged into one common ancestor, i.e. the root of all objects.

Neighbor-Joining (NJ) Algorithm

It starts with N nodes, i.e. one object per node:

Page 160: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The conventional NJ Score Update is approximation-based.

A

B

DC

|CD||AD|+|BD|

2≠

Exact method based centroids is natural.

Distance is defined for a Hilbert Space endowed with a inner-product.

CA+B

2=

Page 161: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernel Trick

The winner can be identified to be the cluster, say Ck, that yields the shortest distance.

The computation can be carried out exclusively based on Kij.

Page 162: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernel Trick

Page 163: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

(i) Winner Identification: Given any training vector x, compute its distances to the K centroids. Identify the winning cluster as the one with the shortest distance

(ii) Membership Update: If a node is added to a cluster Ck, then merge that node into the cluster Ck to form C’k.

Kernel Trick for NJ Algorithm

Page 164: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernel-NJ AlgorithmBecause SDE = 0.8 has the highest pairwise similarity. (DE) will first merge into a new node: F.

This leads to a reduced (4 x 4) similarity matrix:2AF AA FF AFd s s s= + −

1 0.9 2 0.3AFd = + − ×

1 0.9 2 0.35 1.2CFd = + − × =

Page 165: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Centroid denoted by F

Centroid denoted by G

Centroid denoted by H

Kernel-NJ Algorithm

Page 166: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

5.5 Fast Kernel TrickFor Kernel K-Means

Page 167: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Winner Identification

Page 168: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Similarity Table for Fast Kernel Trick

N Objects K Clusters

Page 169: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

In the actual coding of fast kernel trick, we must take into account the fact whenever a cluster is winning a new member there is always another cluster losing an old member. As a result, two clusters (i.e. two columns of the similarity measures) will have to be updated. For illustration simplicity, we focus only on the winning cluster.

N

K

winner’s column

loser’s column

Page 170: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Update of Centroids' Auto-Correlations:

Update of Similarity Table:

Slow Update

Fast Update

Page 171: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

•Winner Identification:

• Winner Update:•Winner Column Update•Winner's Auto-Correlation

• Loser Update:•Loser Column Update• Loser's Auto-Correlation:

Fast Kernel Trick

negligible

Page 172: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Winner Column Update:

Loser's Auto-Correlation:

Winner's Auto-Correlation:

Loser Column Update:

Fast Kernel Trick

Page 173: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Note that, for K-means via kernel component, there incurs a (one-time) cost of O(N3) ~ O(N4) for factorization.

O(N3) + Nr Ne O(KMN)

In contrast, the kernel trick obviates the need of performing spectral decomposition.

When to Use What?

Nr

Comparison of Computational Cost

Page 174: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

As stated in [7],

"Thus far, only eigenvector based algorithms have been employed to minimize normalized cuts in spectral clustering and image segmentation.However, software to compute eigenvectors of large sparse matrices (often based on the Lanczos algorithm) can have substantial computational overheads, especially when a large number of eigenvectors are to be computed. In such situations, our equivalence has an important implication: we can use k-means-like iterative algorithms for directly minimizing the normalized-cut of a graph."

When to use (Fast) Kernel Trick?

For Sparse Kernel Matrix, (Fast) Kernel Trick is relatively more attractive.

Page 175: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Fast Kernel Trick has less advantage:

1. When the original vector dimension is smallor when the dimension can be substantially reduced with good approximation. (This is because the processing time of Kernel Trick is independent of vector dimension.)

2. Another occasion is for Kernel NJ is itself a finite algorithm, so kernel trickwill already be fast enough and there is no need of fast kernel trick or spectral dimension reduction.

When not to use Fast Kernel Trick?

Page 176: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

MATLAB: Kmeans(A,B,C,..)

Kernel Trick is less effective if the required number of K-means, Nr, is large.

Regular None-Sparse Kernel Matrix

Spectral K-Means is relatively more attractive.

Multiple Trial

Page 177: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

1. Overview 2. Kernel Based Representation Spaces3. Principal Component analysis (Kernel PCA)4. Supervised Classifier

5. Fast Clustering Methods for Positive-Definite Kernels

6. Extension to Nonvectorial Data and NPD (Non-Positive-definite) Kernels

Topics:

6.1 The Shift Approach• a. Shift-Invariance• b. Concern on The Shift Approach

6. 2 The Truncation Approach 6. 3 Empirical Vectors Space

Page 178: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

centroid undefined

Extension of Vectorial data to nonvectorial data.

A Vital Assumption: Positive DefiniteS

A

B

DC

centroid defined

2A Bc +

=

Distance is defined for a Hilbert Space endowed with a inner-product.

Page 179: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

By adjusting kernels and weights,the clustering algorithm can be applied to a broad spectrum of Graph Partitions applications. Some examples are:

Application of K-means to Graph Partitions:

• Clustering of Gene expression profiles• Network Analysis of Protein-protein-interaction• Interconnection Networks• Parallel processing algorithm: partitioning• Document data mining: word-by-document matrix• Market basket data: transaction-by-item matrix

Page 180: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The problem is well defined and its convergence is assured if positive definite (PD) kernel functions are adopted.

The problem definition may requires some attention.The convergence of K-means can be assured only after the similarity matrix is converted to a PD matrix.

For nonvectorial data clustering problems:

For all the vector clustering problems:

Page 181: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

The ESSENCE of having a PD Kernel

Page 182: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

‘Forcing’ a PD Kernel:

Three Approaches

• Shift Approach• Truncation Approach• Squared Kernel: K2

Page 183: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

If K is NPD, it must be converted to a PD Matrix.

Kernel Shift

K + s I

6.1 The Shift Approach

Page 184: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Does such a Shift Change the Nature of the Problem?

6.1a Shift-Invariance

Kernel NJ is Not Shift- InvariantKernel K-means is Shift- InvariantKernel SOM can be Shift- Invariant

Page 185: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernel NJ is Not Shift- Invariant

Apply Shift, then the squared-distance is adjusted by

s + s/(p+1)

This penalizes smaller cluster and favors larger clusters.

For Kernel-NJ, the positive shift tends to favor identification of a larger cluster.

Page 186: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Kernel-NJThis leads to a reduced (4 x 4) similarity matrix:

' 0.92FFss = +

3' (1 ) (0.9 ) 2 0.35 1.22 2CFs sd s= + + + − × = +

' (1 ) (1 ) 2 0.6 0.8 2ABd s s s= + + + − × = + ×

3 3 2' 1.2 1.2 4.22 2CF

sd × ×= + = + =' 0.8 2 2 4.ABd = + × = 8

' ' ' 1AA BB CCs s s s= = = +

(1) (0.9) 2 0.35 1.2CFd = + − × =(1) (1) 2 0.6 0.8ABd = + − × =

2s =

>

<

Page 187: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

centroid-free and norm-free objective function

centroid-free objective function

centroid-based objective function

Equivalence under Mercer (PD) Condition:

Kernel K-means is Shift- Invariant

Page 188: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Example: centroid-free and norm-free objective function

Page 189: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

centroid-free objective function

centroid-based objective function

Equivalence under Mercer (PD) Condition:

Kernel K-means admits a centroid-free criterion

Pseudo-NormNPD

Page 190: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

+2s

+2Nks

+Ks

+s

constant

constant

Proof of Shift- Invariance

Page 191: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Conversion of K into a PD Matrix.

Kernel K-means is Shift- Invariant

Page 192: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

If K is NPD, it must be converted to a PD Matrix.

Kernel Shift

K + s I

Page 193: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Is the formulation rational?

Test Case:

The Tale of Twins.

6.1b Concern on The Shift

Page 194: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

⎥⎦

⎤⎢⎣

⎡01

⎥⎦

⎤⎢⎣

⎡00

⎥⎦

⎤⎢⎣

⎡10

A

B

'C

C

Necessity:Proof by counter-example.

Page 195: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

⎥⎦

⎤⎢⎣

⎡01

⎥⎦

⎤⎢⎣

⎡00

⎥⎦

⎤⎢⎣

⎡10

A

Example for Pseudo-Norm in Type-B

B

'C

C

Note: Empirical vectors

Page 196: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

B

The Concern lies in the twins are separated far from each other.

⎥⎦

⎤⎢⎣

⎡01

⎥⎥⎥

⎢⎢⎢

002

⎥⎦

⎤⎢⎣

⎡00

⎥⎦

⎤⎢⎣

⎡10

0

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

( )e A

( )e B

A

'C

C

( )e C

( ')e C00

1

⎡ ⎤⎢ ⎥⎢ ⎥−⎢ ⎥⎣ ⎦

Page 197: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

A popular and simple means to make an NPD matrix PD is by truncation of negative spectral components. [1,30]. Same perturbation analysis applies.

λm ≥ 0 >λm+1

6.2 The Truncation Approach

Page 198: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

6.3 Empirical Vectors Space

• Supervised Learning

• Unsupervised Learning: Empiric K-Means

The K2 Approach [36]

Page 199: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

⎥⎦

⎤⎢⎣

⎡01

⎥⎦

⎤⎢⎣

⎡00

⎥⎦

⎤⎢⎣

⎡10

A

Example for Pseudo-Norm in Type-B

B

'C

C

Empirical K-Means: Empirical Vectors

The proximity of the twins are preserved in the empirical space.

Page 200: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of
Page 201: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

• It is a Hilbert space endowed with the conventional (linear) dot-product.•It has a finite dimension, N, with non-orthogonal bases. Its energy is more concentrated to the higher components •It is dependent of the training dataset.

Empirical Eigen-Space vs. spectral space

Empirical Eigen-Space

Empirical Eigen-Space has 1.Positive eigenvalues for all kind of data (vectorial/nonvectorial).2.It has a stronger denoising capability than the spectral space.(Its energy is more concentrated to the higher components).

Page 202: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

If the kernel matrix is NPD, it is necessary to convert it to PD. One way to achieve this objective is by adopting a PD kernel matrix in the empirical space:

The dimension-reduced vectors are formed from the columns of

are used as the training vectors.

In the full-dimension, the full-size columns of

Approximation in Empirical Eigen-Space

Page 203: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Perturbation Analysis

{ }1

1

11 1 1 m a x ,

1

k

k m Nk k k

k

N

NN N N

N

λ λ+

⎡ ⎤⎢ ⎥⎢ ⎥⎡ ⎤ ⎢ ⎥⎢ ⎥⎡ ⎤ ⎢ ⎥⎢ ⎥ ≤ −⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦⎢ ⎥⎢ ⎥⎣ ⎦

L

MK”

By using the same perturbation analysis and, in addition, notingthat the perturbation is upper bounded by the largest eigenvalueterms of magnitude) of

Here K denote the number of clusters (instead of the kernel matrix).

Page 204: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

Future ResearchKernel Approach to Integration of heterogeneous data. [33].

Page 205: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

References

Page 206: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

References

Page 207: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

References

Page 208: Kernel Approaches to Supervised/ Unsupervised Machine ...€¦ · Machine learning techniques are useful for data mining and pattern recognition when only pairwiserelationships of

References

For background reading on Linear Algebra:G. Strang's textbook on “Introduction to Linear Algebra ”, Wellesley Cambridge Press, 2003.

G. Golub and C. Van Loan. “Matrix Computations”, JohnsHopkins University Press, 1989.