unsupervised feature extraction by time-contrastive ...€¦ · university of helsinki, finland...

Unsupervised Feature Extraction byTime-Contrastive Learning and Nonlinear ICA

Aapo Hyvarinen

with Hiroshi Morioka

Dept of Computer ScienceUniversity of Helsinki, Finland

Facebook AI Summit, 13th June 2016

Aapo Hyvarinen Time-contrastive learning

Abstract

I How to extract features from multi-dimensional datawhen there are no labels (unsupervised)?

I We consider data with temporal structure

I We learn features than enable discriminating data fromdifferent time segments (taking segment labels as class labels)

I We use ordinary neural networks with multinomial logisticregression: Last hidden layer gives the features

I Surprising theoretical result:Learns to estimate a nonlinear ICA model with

I general nonlinear mixing x(t) = f(s(t)).I nonstationary components si (t)


Background: Need for generative models like ICA

I Unsupervised deep learning is a largely unsolved problem

I Important since labels often difficult (costly) to obtain

I Most approaches heuristic, not very clear what they are doing

I Best would be to define a generative model, and estimate itI Cf. Linear unsupervised learning:

I independent component analysis (ICA) / sparse coding:I generative models which are well-defined, i.e. identifiable

(Darmois-Skitovich around 1950; Comon, 1994)

I If we define and estimate generative models:I we know better what we are doingI we can use all the theory of probabilistic methodsI ... but admittedly, it is theoretically more challenging


Background: Nonlinear ICA may not be well-defined

I For random vector x, it is easy to assume a nonlineargenerative model

x = f(s) (1)

with mutually independent hidden/latent components si .I However, not identifiable

I i.e. many different nonlinear transforms of x give independentcomponents: no guarantee we can recover the original si

I if we assume data with no temporal structure, andgeneral smooth invertible nonlinearities f(Darmois, 1952; Hyvarinen and Pajunen, 1999)

I Nevertheless, estimation attempted by many authors, e.g.Tan-Zurada (2001), Almeida (2003) and recent deep learningwork (Dinh et al, 2015)


Background: Temporal correlations can help

I Harmeling et al (2003) suggested using temporal structureI find features that change as slowly as possible (Foldiak, 1991)

x ⇒ s

I they used kernel-based models of nonlinearities

I Well-known idea in linear ICA (source separation) literature(Tong et al 1991; Belouchrani, 1997)

I In linear case, identifiable if autocorrelations distinct fordifferent sources (a rather strict condition!)

I In nonlinear case, identifiability unknown, but certainly notbetter than in linear case!


Background: Temporal structure as nonstationarity

I A less-known principle in linear source separation:Sources are nonstationary (Matsuoka et al, 2005)

x ⇒ s

I Usually, we assume variances of the sources change in time

si (t) ∼ N (0, σi (t)2) (2)

I Linear model x(t) = As(t) is identifiable under weakassumptions (Pham and Cardoso, 2001)

I So far, not used in nonlinear case...


Time-contrastive learning: Intuitive motivation

I Assume we are given an n-dimensional time series, x(t), witht time index

I Divide the time series (arbitrarily) into k segments(e.g. bins with equal sizes, 100–1000 points in each segment)

I Train a multi-layer perceptron to discriminate betweensegments

I Number of classes k , index of segment is class labelI Use multinomial regression, well-known algorithms/software

I Classifier should find a good representation in hidden layers:In particular, regarding nonstationarity

I Turns unsupervised learning into supervised, cf.noise-contrastive estimation or generative adversarial nets.


Theorem: TCL estimates nonlinear nonstationary ICA

I Assume data follows nonlinear ICA model x(t) = f(s(t)) withI independent sources si (t) with nonstationary variances

i.e. si (t) ∼ N (0, σi (τ)) in segment τI smooth, invertible nonlinear mixing f : Rn → Rn

I (+technical assumptions on the non-degeneracy of σi (τ))

I Assume we apply time-contrastive learning on x(t)I i.e. logistic regression to discriminate between time segmentsI using MLP with last hidden layer outputs in vector h(x(t)).

I Then, s(t)2 = Ah(x(t)) for some linear mixing matrix A.(Squaring is element-wise)

I I.e.: TCL demixes nonlinear ICA model up to a linear mixing(which can be estimated by linear ICA) and up to squaring.

I This is a constructive proof of identifiability (up to squaring)


Illustration and comments

Sourcesignals

Observedsignals

1

n

1

n

A Generative model B Time-contrastive learning

1 2 3 T

Time ( )

Nonlinear mixture:

Predictions of segment labels1 1 2 2 3

1

m

Featurevalues

T T3 4

Multinomial logistic regression:

Segments:

Feature extractor:

Theorem 1

I Nonstationarity enables identifiability, since independence ofsources must hold for all time points ⇒ enough constraints

I Many data sets well known to be nonstationary:I Video, EEG/MEG, financial time series

I We can generalize nonstationarity to exponential family

I We can combine with dimension reduction:find only nonstationary manifold


Sketch of proof of Theorem

I Denote h, hidden unit outputs; x, data;wτ , LR coeffs in segment τ ; pτ , probability in segment τ .

I By theory of logistic regression, we learn differences oflog-pdf’s in classes:

wTτ h(xt) + bτ = log pτ (xt)− log p1(xt) + const, (3)

I By the nonlinear ICA model, we have

log pτ (x) =n∑

i=1

λτ,i s2i + log | det Jg(x)| − logZ (λτ ), (4)

where J is Jacobian of nonlinear mixing f.

I So, the s2i and the hi (xt) span the same subspace⇒ the s2i are linear transformations of hidden units


Simulations with artificial data

Create data according to model, try to recover sources.Nonlinear mixing is by another MLP; segment length 512 points.

Recovery of sources Classification accuracy

Number of segments8 16 32 64 128 256 512

Mea

n co

rrela

tion

0

0.2

0.4

0.6

0.8

1 TCL(L=1)TCL(L=2)TCL(L=3)TCL(L=4)TCL(L=5)NSVICA(L=1)NSVICA(L=2)NSVICA(L=3)NSVICA(L=4)NSVICA(L=5)kTDSEP(L=1)kTDSEP(L=2)kTDSEP(L=3)kTDSEP(L=4)kTDSEP(L=5)DAE(L=1)DAE(L=2)DAE(L=3) Number of segments

8 16 32 64 128 256 512

Acc

urac

y (%

)1

2

4

810

20

40

80100

L=1L=2L=3L=4L=5L=1(chance)L=2(chance)L=3(chance)L=4(chance)L=5(chance)

kTDSEP: Harmeling et el (2003)DAE: Denoising autoencoderNSVICA: Linear nonstationarity-based method


Experiments with brain imaging data

I MEG data (like EEG but better)

I Sources estimated from resting data (no stimulation)I a) Validation by classifying another data set with four

stimulation modalities: visual, auditory, tactile, rest.I Trained a linear SVM on estimated sourcesI Number of layers in MLP ranging from 1 to 4

I b) Attempt to visualize nonlinear processing

a)

TCL DAE NSVICAkTDSEP

Cla

ssifi

catio

n ac

cura

cy (%

)

30

40

50

L=1 L=4

L=1 L=4

b) L3

L2

L1

Figure 3: Real MEG data. a) Classification accuracies of linear SMVs newly trained with task-session data to predict stimulation labels in task-sessions, with feature extractors trained in advancewith resting-session data. Error bars give standard errors of the mean across ten repetitions. For TCLand DAE, accuracies are given for different numbers of layers L. Horizontal line shows the chancelevel (25%). b) Example of spatial patterns of nonstationary components learned by TCL. Eachsmall panel corresponds to one spatial pattern with the measurement helmet seen from three differentangles (left, back, right); red/yellow is positive and blue is negative. “L3” shows approximate totalspatial pattern of one selected third-layer unit. “L2” shows the patterns of the three second-layerunits maximally contributing to this L3 unit. “L1” shows, for each L2 unit, the two most stronglycontributing first-layer units.

Results Figure 3a) shows the comparison of classification accuracies between the different methods,284

for different numbers of layers L = {1, 2, 3, 4}. The classification accuracies by the TCL method285

were consistently higher than those by the other (baseline) methods.1 We can also see a superior286

performance of multi-layer networks (L ≥ 3) compared with that of the linear case (L = 1), which287

indicates the importance of nonlinear demixing in the TCL method.288

Figure 3b) shows an example of spatial patterns learned by the TCL method. For simplicity of289

visualization, we plotted spatial patterns for the three-layer model. We manually picked one out of290

the ten hidden nodes from the third layer, and plotted its weighted-averaged sensor signals (Figure 3b,291

L3). We also visualized the most strongly contributing second- and first-layer nodes. We see292

progressive pooling of L1 units to form left temporal, right temporal, and occipito-parietal patterns293

in L2, which are then all pooled together in the L3 resulting in a bilateral temporal pattern with294

negative contribution from the occipito-parietal region. Most of the spatial patterns in the third layer295

(not shown) are actually similar to those previously reported using functional magnetic resonance296

imaging (fMRI), and MEG [2, 4]. Interestingly, none of the hidden units seems to represent artefacts,297

in contrast to ICA.298

8 Conclusion299

We proposed a new learning principle for unsupervised feature (representation) learning. It is based300

on analyzing nonstationarity in temporal data by discriminating between time segments. The ensuing301

“time-contrastive learning” is easy to implement since it only uses ordinary neural network training: a302

multi-layer perceptron with logistic regression. However, we showed that, surprisingly, it can estimate303

independent components in a nonlinear mixing model up to certain indeterminacies, assuming that304

the independent components are nonstationary in a suitable way. The indeterminacies include a linear305

mixing (which can be resolved by a further linear ICA step), and component-wise nonlinearities,306

such as squares or absolute values. TCL also avoids the computation of the gradient of the Jacobian,307

which is a major problem with maximum likelihood estimation [5].308

Our developments also give by far the strongest identifiability proof of nonlinear ICA in the literature.309

The indeterminacies actually reduce to just inevitable monotonic component-wise transformations in310

the case of modulated Gaussian sources. Thus, our results pave the way for further developments in311

nonlinear ICA, which has so far seriously suffered from the lack of almost any identifiability theory.312

Experiments on real MEG found neuroscientifically interesting networks. Other promising future313

application domains include video data, econometric data, and biomedical data such as EMG and314

ECG, in which nonstationary variances seem to play a major role.315

1Note that the classification using the final linear ICA is equivalent to using whitening since ICA only makesa further orthogonal rotation, and could be replaced by whitening without affecting classification accuracy.

8


Conclusion

I We proposed the intuitive idea of time-contrastive learningI Divide multivariate time series into segments, learn to

discriminate them, e.g. by ordinary MLP (deep) learningI Unsupervised learning via supervised learningI No new algorithms or software needed

I TCL can be shown to estimate a nonlinear ICA modelI With general (smooth, invertible) nonlinear mixing functionsI Assuming sources are nonstationary

I (Note: Likelihood or mutual information of nonlinear ICAmodel would be much more difficult to compute)

I First case of nonlinear ICA (or source separation) with generalidentifiability results !! (?)

I Future work:I Application on image/video data etc.I Combining nonstationarity with autocorrelations


unsupervised feature extraction by time-contrastive ...€¦ · university of helsinki, finland...

Documents