pca network unsupervised learning networks. pca is a representation network useful for signal,...

•PCA NETWORK

Unsupervised Learning NEtWORKS

PCA is a Representation Network useful for signal, image, video processing

In order to analyze multi-dimensional input vectors, a representation with maximum information is the principal component analysis (PCA).

PCA

• per component: extract most significant features,

• inter-component: avoid duplication or redundancy between the neurons.

PCA NEtWORKS

Rx Řx = (1/M ) Σt x(t)xt(t)

An estimate of the autocorrelation matrix by taking the time average over the sample vectors:

Rx = UΛUt

the optimal matrix W is formed by the first m singular vectors of Rx .

x(t) = W a(t)

the errors of the optimal estimate are [Jain89]:

• matrix-2-norm error = λm+1

• least-mean-square error = Σin

=m+1 λi

to enhance the correlation between the input x(t) and the extracted component a(t), it is natural to use a Hebbian-type rule:

w(t+1) = w(t) + β x(t)a(t)

a(t) = w(t)tx(t)

First PC

the Oja learning Rule is equivalent to a normalized Hebbian rule. (Show procedure!!)

Δw(t) = β [x(t)a(t) - w(t) a(t)2]

Oja Learning Rule

By the Oja learning rule, w(t) converges asymptotically (with probability 1) to

Convergence theorem:

Single Component

w = w(∞) = e1

where e1 is the principal eigenvector of Rx

Δw(t) = β [x(t)a(t) - w(t) a(t)2]Proof:Δw(t) = β [x(t)x’(t)w(t) - a(t)2 w(t)]

Δw(ť) = β [Rx - σ(ť)I] w(ť)

Δw(ť) = β [UΛUT - σ(ť)I] w(ť)

Δw(ť) = β U[Λ - σ(ť)I] UT w(ť)

ΔUTw(ť) = β [Λ - σ(ť)I] UTw(ť)

ΔΘ(ť) = β [Λ - σ(ť)I] Θ (ť)

take average over a block of data, and redenote ť as the block time index:

the relative dominance of the principle component grows, with a growth rate:

Convergence Rates

Each of the eigen-components is enhanced/dampened by

θi(ť+1) = [1+β' λi - β' σ(ť)] θi(ť)

(1+β' [λi-σ(ť)])/(1+β' [λ1 - σ(ť)])

Θ(ť) = [θ1(ť) θ2(ť) … θn(ť)]T

Simulation: Decay Rates of PCs

Multiple Principal Components

How to extract

Let W denote a nm weight matrix

ΔW(t) = β [x(t) - W(t) a(t)] a(t)t

Concern on duplication/redundancy

Assume that the first component is already obtained; then

the output value can be ``deflated'' by the following transformation:

Deflation Method

x = (I- w1 w’1) x˜

the basic idea is to allow the old hidden units to influence the new units so that the new ones do not duplicate information (in full or in part) already provided by the old units. By this approach, the deflation process is effectively implemented in an adaptive manner.

Lateral Orthogonalization Network

APEX Network(multiple PCs)

Δαij(t) = β [ ai(t) aj(t) - αij(t) ai(t)2 ]

APEX: Adaptive Principal-component Extractor

Δwi(t) = β [ x(t)ai(t) - wi(t) ai(t)2]

the Oja Rule: for i-th component (e.g. i=2)

Dynamic Orthogonalization Rule (e.g. i=2,j=1)

the Hebbian weight matrix W(t) in APEX converges asymptotically to a matrix formed by the m largest principal components.

Convergence theorem: Multiple Components

the weight matrix W(t) converges to (with probability 1),

W(∞) = W

where W is the matrix formed by m row vectors wit,

wi = wi(∞) = ei

Δα(t) = β [ a1(t) a2(t) - α(t) a2(t)2 ]

Δw2(t) = β [ x(t)a2(t) – w2(t) a2(t)2]

w’1Δw2(t) = β [w’1 x(t)a2(t) – w’1w2(t) a2(t)2]

Δw’1w2(t) = β [a1(t)a2(t) – w’1w2(t) a2(t)2]

Δ[w’1w2(t)- Δα(t)] = β[ w’1w2(t) -α(t)]a2(t)2

α(t)→w’1w2(t) a2(t) = x’ (t)w2(t) - α(t)a1(t) = x’ (t) [I- w’1w1] w2(t)

[w’1w2(t+1)- α(t+1)] = [1-βσ(t)][ w’1w2(t) -α(t)]

w’1w2(t) - α(t) → 0

Learning Rates of APEX

[w’1w2(ť+1)- α(ť+1)] = [1-β’σ(ť)][w’1w2(ť) -α(ť)]

β’ = 1/σ(ť)

• β = 1/[Σta2 (t)2]

• β = 1/[Σtγta2 (t)2]

Learning Rates

• PAPEX: Hierarchical Extraction

Other Extensions

• DCA: Discriminant Component Analysis

• ICA: Independent Component Analysis

pca network unsupervised learning networks. pca is a representation network useful for signal,...

Documents