faster training in nonlinear ica using miseplbalmeida/papers/almeidaica03poster.pdfluís b. almeida...
TRANSCRIPT
Luís B. Almeida – ICA 2003 1
Faster Training in Nonlinear ICA using MISEP (with a simple introduction to nonlinear ICA and to MISEP)
Luis B. Almeida – IST and INESC-ID, Lisbon, Portugal
Summary
♦ Mutual information as a dependence measure
♦ Mutual information as output entropy
♦ Minimizing the mutual information
♦ Examples
♦ Learning speed
♦ Conclusions
Luís B. Almeida – ICA 2003 2
What is MISEP? How does it perform nonlinear I CA/BSS?
MISEP is an extension of INFOMAX to nonlinear ICA/BSS
But How?
� We wish to use mutual information (MI) as the dependence measure to be minimized.
� INFOMAX minimizes the mutual information, but
� it is limited to linear ICA,
� it needs a priori knowledge of the sources' distributions (at least approximately).
� We shall extend INFOMAX in two directions:
� Extending it to nonlinear ICA.
� Using adaptive estimation of the components' distributions.
� This results in an extension of INFOMAX, that we call MISEP.
Luís B. Almeida – ICA 2003 3
Setti ng
s – source vector
o – observation vector
y – vector of estimated components
F – mixture (linear or nonlinear)
G – ICA system (linear or nonlinear)
G
s1
s2
y1
y2
F
o1
o2
Luís B. Almeida – ICA 2003 4
Mutual information as a dependence measure
io - observations
iy - estimated components
Mutual information:
� ∑ −=i
i HYHI )()()( YY – sum of marginal entropies minus joint entropy
� )(YI is also the Kullback-Leibler divergence between ∏i
iY Ypi
)( and the true distribution )(YYp .
� )( yI is non-negative, and is zero only if the iY are independent from one another. It is a good measure of the dependence of the iY .
G
o1
o2
y1
y2
Luís B. Almeida – ICA 2003 5
Expressing the mutual information as output entropy
� The mutual information is hard to minimize directly. But…
� If the transformations iψ are invertible, the mutual information is not affected: )()( YZ II = .
� If iψ is the cumulative probabilit y function of iY , then iZ is uniformly distributed in ]1,0[ , and 0)( =iZH .
)()()()()( ZZZY HHZHIIi
i −=−== ∑
� Maximizing the output entropy is equivalent to minimizing )(YI .
G
o1
ψ2o2
ψ1y1
y2
z1
z2
Luís B. Almeida – ICA 2003 6
How do we find the cumulative functions?
♦ INFOMAX (Bell & Sejnowski, 95) – Cumulative functions known a priori (at least
approximately).
♦ MISEP – Estimate the CPFs adaptively, by maximum entropy.
G
o1
ψ2o2
ψ1y1
y2
z1
z2
Luís B. Almeida – ICA 2003 7
Estimating the cumulative functions (continued)
∑ −=i
i HZHI )()()( ZY ∑ −=i
i IZHH )()()( YZ
� If the distributions of the iY components were kept fixed, maximizing the )(ZH would be
equivalent to maximizing each of the marginal entropies…
� … but at the end of training (at convergence) the iY are fixed!
� If each of the outputs iZ is bounded in ]1,0[ , it will become uniform in that interval, and each iψ will be the CPF of iY as desired.
�
iψ will have to be constrained to be an increasing function.
G
o1
ψ2o2
ψ1y1
y2
z1
z2
Luís B. Almeida – ICA 2003 8
Estimating the cumulative functions (continued)
By maximizing the output entropy we will :
♦ adapt the output MLPs to yield the CPFs of the iY ;
♦ minimize the mutual information )(YI .
The output MLPs are restricted to yield monotonically increasing functions, bounded to ]1,0[ .
G
o1
ψ2o2
ψ1y1
y2
z1
z2
Luís B. Almeida – ICA 2003 9
Maximizing the output entropy
JOZ detlog)()( += HH
with OZ
J∂∂= (Jacobian of the transformation).
)(OH is fixed. We need to maximize Jdetlog .
But how do we do that?
Luís B. Almeida – ICA 2003 10
Network that computes J:
The upper part of the figure is the separating network. The lower part computes the Jacobian.
The lower part is essentially a linearized version of the upper part. Its input is the identity matrix.
G
o1
ψ2o2
ψ1y1
y2
z1
z2
yBo C ΦΦ D
B C Φ Φ ' D
z
IH
J
A ΦΦ
A Φ Φ '
Luís B. Almeida – ICA 2003 11
Maximization of the entropy (continued)
We have to backpropagate through the lower network (and through the shaded arrows, into the upper network).
Input to the backpropagation network:
T)(detlog 1−=
∂∂
JJ
J
yBo C ΦΦ D
B C Φ Φ ' D
z
IH
J
A ΦΦ
A Φ Φ '
Luís B. Almeida – ICA 2003 12
0.5 1 1.5 2
x 104
-1
-0.5
0
0.5
0.5 1 1.5 2
x 104
-2
-1
0
1
0.5 1 1.5 2
x 104
-1
-0.5
0
0.5
1
0.5 1 1.5 2
x 104
-0.4
-0.2
0
0.2
0.5 1 1.5 2
x 104
-1
-0.5
0
0.5
0.5 1 1.5 2
x 104
-0.5
0
0.5
Examples
1. L inear ICA
Two supergaussians
Luís B. Almeida – ICA 2003 13
Scatter plots
Original mixture Just the 100 training points
Estimated cumulative functions
-0.1 -0.05 0 0.05-0.5
0
0.5
-0.5 0 0.5-1
-0.5
0
0.5
1
-0.12 -0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08
-0.5
0
0.5
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
-0.5
0
0.5
Luís B. Almeida – ICA 2003 14
A supergaussian and a subgaussian
-0.2 0 0.2
-0.5
0
0.5
-0.5 0 0.5-1
-0.5
0
0.5
1
-0.3 -0.2 -0.1 0 0.1 0.2 0.3
-0.5
0
0.5
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
-0.5
0
0.5
Luís B. Almeida – ICA 2003 15
Nonlinear I CA, two supergaussians
�� � � �� � �� �
� �� � �� �� � �
� �� � ��
�� � ��� � �
�� � �� � !
" � # � � #
" $ #" $
" � #�
� #$
" � # � � #
" � #�
� #" $ # " $ " � # � � # $
" � #�
� #
A supergaussian and a subgaussian
% &' & ( ) *+ * , * *+ * , *+ * - ./ . 0
1 23 2 45 67 6 8
9 :; : <9 :; : =
::; : =
:; : <:; : >
:; : ?
9 :; = 9 :; @ : :; @ :; =
9 @A BC D
BBC D
E
A BC F A BC E B BC E BC F
A BC DB
BC D
A E A BC D B BC D E
A BC DB
BC D
Luís B. Almeida – ICA 2003 16
Two subgaussians
G HI H J H HI H J HI K
G HI H LG HI H M
G HI H NG HI H O
HHI H O
HI H NHI H M
HI H L
G J G N P Q P R P S T
P TU Q VP TU Q
P TU R VP TU R
P TU S VP TU S
P TU T VT
TU T V
P V P W P Q P R P S T
P TU VT
TU V
P TU Q P TU R P TU S T
P TU VT
TU V
A local minimum of the mutual information
P TU T V T TU T V TU S
P TU T XP TU T Y
P TU T WP TU T R
TTU T R
TU T WTU T Y
TU T X
P S T S R Q
P TU VT
TU VS
S U V
P S T S R Q
P TU VT
TU VP TU V T TU V S S U V
P TU VT
TU V
Luís B. Almeida – ICA 2003 17
Nonlinear mixture of two speech signals
(listen to the demo)
Mixture:
2122
2211
)(
)(
saso
saso
+=
+=
Signal to interference ratios:
Mixture: 9.1 dB
Separated: 16.9 dB
Improvement: 7.8 dB
Luís B. Almeida – ICA 2003 18
Problem
Learning is often slower than one would expect
−0.2 0 0.2 0.4
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
−0.5 0 0.5
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
−0.2 0 0.2 0.4
−0.5
0
0.5
−0.5 0 0.5
−0.5
0
0.5
Separation after 350 epochs. Improves very slowly over the next 600 epochs
Why can't the system "spread" the high-density region into the low-density one?
Possible cause: The units of the MLP are non-local. Moving a unit to improve a part of the space
would harm some other part of the space.
Luís B. Almeida – ICA 2003 19
Solution
Use local units (e.g. Radial Basis Function units)
Nonlinear ICA block
Z The direct connections yield a linear mapping, which is then modified by the RBF units.
[ The RBF units' centers are trained by K-means. Radiuses are computed by a simple heuristic.
[ Only the output weights are trained by the gradient of the objective function.
RBF units
o y
Luís B. Almeida – ICA 2003 20
Results
−1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8
−1
−0.5
0
0.5
1
−1.5 −1 −0.5 0 0.5 1−2
−1.5
−1
−0.5
0
0.5
1
1.5
MLP RBF
Two supergaussians Supergaussian & subgaussian Number of epochs MLP RBF MLP RBF
Mean 500 68 610 233
St. deviation 152 10 266 87
Luís B. Almeida – ICA 2003 21
But more recent results
(which didn't make it into this paper)
\ The speed advantage of RBFs may be due more to initialization than to locality.
\ MLPs with hidden units initialized "crisscrossing" the whole observation space have shown learning speeds comparable to those of RBFs.
\ MLPs don't usually need explicit regularization, but RBFs do – and the amount of regularization has to be adjusted by hand in each case.
\ Download the most recent preprint (see last page).
Luís B. Almeida – ICA 2003 22
Conclusions
• Extension of INFOMAX
• ICA performed by minimizing the mutual information of the extracted components.
• Estimation of the independent components and of their distributions performed by a single
network, with a single objective function.
• Can handle a wide variety of components' distributions.
• Able to perform linear and nonlinear ICA and nonlinear source separation.
• Networks of local units yield better performance
] But this may have more to do with a good initialization than with the local units (see preprint of new paper submitted to Signal Processing)
Luís B. Almeida – ICA 2003 23
A related issue
Is nonlinear source separation really possible?
^ Purely blind nonlinear source separation is an ill -posed problem.
It has an infinite number of solutions, not trivially related to one another.
^ But we often solve ill -posed problems (e.g. the training of multiplayer perceptrons).
^ What we need is some extra information, that is often available (e.g. smoothness).
^ We can then use regularization to find an essentially unique solution.
^ In our MLP-based examples, the regularization inherent to the MLP suff iced.
^ In the RBF-based examples we needed explicit regularization – weight decay.
^ But in all our test cases we were able to perform nonlinear source separation.
Luís B. Almeida – ICA 2003 24
Chr istian Jutten's counter-example
(NIPS 2002 workshop)
Identity mapping (uniform) Twisted mapping (still uniform)
This is the smoothest possible mapping This is less smooth…
A smoothing regularizer would select the first mapping, and not the second one.
Luís B. Almeida – ICA 2003 25
Most recent and most comprehensive prepr int
Luis B. Almeida, "MISEP – Linear and Nonlinear ICA Based on Mutual Information", submitted to
Signal Processing, special issue on ICA.
Download at http://neural.inesc-id.pt/~lba/papers/AlmeidaSigProc2003.pdf
Probably also already available at the COGPRINTS archive, http://cogprints.ecs.soton.ac.uk/
(search for 'MISEP' in the title)
MATLAB – compatible toolkit
http://neural.inesc-id.pt/~lba/ica/mitoolbox.html