tensor decompositions in statistical signal processingmmds.imm.dtu.dk/presentations/comon.pdf ·...

Introduction Statistics Tensors TheEnd App

Tensor decompositionsin statistical signal processing

Pierre Comon

I3S, CNRS, University of Nice, Sophia-Antipolis, France

July 2, 2009

Pierre Comon MMDS - July 2009 1

Introduction Statistics Tensors TheEnd App Model Goals Appl

Linear statistical model

y = As + b (1)

withy : K × 1 random

“sensors”

s : P × 1 random stat. independent

“sources”

A : K × P deterministic

“mixing matrix”

b : errors

“noise”(may be removed for P large enough)

y = As + b (1)

withy : K × 1 random “sensors”s : P × 1 random stat. independent “sources”A : K × P deterministic “mixing matrix”b : errors “noise”

(may be removed for P large enough)

y = As + b (1)

withy : K × 1 random “sensors”s : P × 1 random stat. independent “sources”A : K × P deterministic “mixing matrix”b : errors “noise”

(may be removed for P large enough)

Other writing

y = [A, I]

which is of the noiseless form

y = A s (2)

with a larger dimension P

Other writing

y = [A, I]

]which is of the noiseless form

y = A s (2)

with a larger dimension P

Taxinomy

Problems addressed in the literature:

K ≥ P: “over-determined”

→ Approximation Pb

can be reduced to a square P × P regular mixtureA orthogonal or unitaryA square invertible

K < P: “under-determined”

→ Exact decompostion

A rectangular with pairwise lin. independent columns

Taxinomy

K ≥ P: “over-determined” → Approximation Pb

Taxinomy

K ≥ P: “over-determined” → Approximation Pb

K < P: “under-determined” → Exact decompostion

Solely from realizations of observation vector y

In the stochastic framework:

Estimate matrix A: Blind identification

Estimate realizations of the “source” vector s: Blindseparation/extraction/equalization

In the deterministic framework:

Build a data tensor Y thanks to a diversity k, and decomposeit:

Yijk =∑p

AipsjpBkp

A and s are jointly estimated

Yijk =∑p

AipsjpBkp

Yijk =∑p

AipsjpBkp

Yijk =∑p

AipsjpBkp

Yijk =∑p

AipsjpBkp

Yijk =∑p

AipsjpBkp

Yijk =∑p

AipsjpBkp

Application areas

1 Telecommunications (Cellular, Satellite, Military),

2 Radar, Sonar,

3 Biomedical (EchoGraphy, ElectroEncephaloGraphy,ElectroCardioGraphy)...

4 Speech, Audio,

5 Machine learning,

6 Data mining,

7 Control...

Introduction Statistics Tensors TheEnd App Uniqueness c.f. Decomposition

Identifiability & Uniqueness

Uniqueness/Identifiability up to inherent ambiguities

Finite number of solutions

Equivalent representations

Let y admit two representations

y = As and y = Bz

where s (resp. z) have independent components, and A (resp. B)have pairwise noncollinear columns.

These representations are equivalent if every column of A isproportional to some column of B, and vice versa.

If all representations of y are equivalent, they are said to beessentially unique (permutation & scale ambiguities only).

y = As and y = Bz

Identifiability & uniqueness theorems

Let y be a random vector of the form y = As, where sp areindependent, and A has non pairwise collinear columns.

Identifiability theorem y can be represented asy = A1 s1 + A2 s2, where s1 is non Gaussian, s2 is Gaussianindependent of s1, and A1 is essentially unique.

Uniqueness theorem If in addition the columns of A1 arelinearly independent, then the distribution of s1 is unique upto scale and location indeterminacies.

Example of uniqueness

Let si be independent with no Gaussian component, and bi beindependent Gaussian. Then the linear model below is identifiable,but A2 is not essentially unique whereas A1 is:(

s1 + s2 + 2 b1

s1 + 2 b2

)= A1 s+A2

)= A1 s+A3

(b1 + b2

b1 − b2

(1 11 0

), A2 =

(2 00 2

)and A3 =

(1 11 −1

)Hence the distribution of s is essentially unique.But (A1, A2) not equivalent to (A1, A3).

Example of non uniqueness

Let si be independent with no Gaussian component, and bi beindependent Gaussian. Then the linear model below is identifiable,but the distribution of s is not unique because a 2× 4 matrixcannot be full column rank:

(s1 + s3 + s4 + 2 b1

s2 + s3 − s4 + 2 b2

s3 + b1 + b2

s4 + b1 − b2

s1 + 2 b1

s2 + 2 b2

(1 0 1 10 1 1 −1

Characteristic functions

First c.f.

Real Scalar: Φx(t)def= E{eı tx} =

∫u eı tu dFx(u)

Real Multivariate: Φx(t)def= E{eı tTx} =

∫u eı tTu dFx(u)

Second c.f.

Ψ(t)def= log Φ(t)

Properties:

Always exists in the neighborhood of 0Uniquely defined as long as Φ(t) 6= 0

First c.f.

∫u eı tu dFx(u)

∫u eı tTu dFx(u)

Second c.f.

Ψ(t)def= log Φ(t)

Properties:

First c.f.

∫u eı tu dFx(u)

∫u eı tTu dFx(u)

Second c.f.

Ψ(t)def= log Φ(t)

Properties:

First c.f.

∫u eı tu dFx(u)

∫u eı tTu dFx(u)

Second c.f.

Ψ(t)def= log Φ(t)

Properties:

First c.f.

∫u eı tu dFx(u)

∫u eı tTu dFx(u)

Second c.f.

Ψ(t)def= log Φ(t)

Properties:

First c.f.

∫u eı tu dFx(u)

∫u eı tTu dFx(u)

Second c.f.

Ψ(t)def= log Φ(t)

Properties:

Characteristic functions (cont’d)

Properties of the 2nd Characteristic function (cont’d):

if a c.f. Ψx(t) is a polynomial, then its degree is at most 2 andx is Gaussian (Marcinkiewicz’1938)

Proof.complicated...

if (x , y) statistically independent, then

Ψx,y (u, v) = Ψx(u) + Ψy (v) (3)

Proof.

Ψx,y (u, v) = log[E{exp ı(ux + vy)}]= log[E{exp ı(ux)}E{exp ı(vy)}].

Ψx,y (u, v) = Ψx(u) + Ψy (v) (3)

Proof.

Ψx,y (u, v) = Ψx(u) + Ψy (v) (3)

Proof.

Ψx,y (u, v) = Ψx(u) + Ψy (v) (3)

Proof.

Ψx,y (u, v) = Ψx(u) + Ψy (v) (3)

Proof.

Problem posed in terms of Characteristic Functions

If sp independent and y = As, we have the core equation:

Ψy (u) =∑p

uq Aqp

where ψp(w)def= log E{exp ı(sp w)}.

Proof.

Plug y = As, in definition of Ψy and get

Φy (u)def= E{exp ı(uTAs)} = E{exp ı(

∑p,q

uq Aqp sp)}

Since sp independent, φy (u) =∏

p E{exp ı(∑

q uq Aqp sp)}

Ψy (u) =∑p

uq Aqp

Proof.

∑p,q

uq Aqp sp)}

p E{exp ı(∑

q uq Aqp sp)}

Ψy (u) =∑p

uq Aqp

Proof.

∑p,q

uq Aqp sp)}

p E{exp ı(∑

q uq Aqp sp)}

Function decomposition

Problem:Equation (4) shows that we have to decompose a mutlivariatefunction into a sum of univariate ones.

Weierstrass (1885), Stone (1948), Hilbert (1900), Kolmogorov(1957), Sprecher (1965), Hornik (1989), Cybenko (1989)...

➽ But here, Ψy and ψp’s are characteristic functions.

Function decomposition

Problem:Equation (4) shows that we have to decompose a mutlivariatefunction into a sum of univariate ones.

Weierstrass (1885), Stone (1948), Hilbert (1900), Kolmogorov(1957), Sprecher (1965), Hornik (1989), Cybenko (1989)...

➽ But here, Ψy and ψp’s are characteristic functions.

Introduction Statistics Tensors TheEnd App Non symmetric Symmetric Deterministic

Problem seen as non symmetric tensor decomposition

Back to core equation (4):

Ψy (u) =∑p

uq Aqp

), u ∈ G

Assumption: functions ψp, 1 ≤ p ≤ P admit finite derivativesup to order d in a neighborhood of the origin, containing G.

Taking d = 3 as a working example:

∂3Ψy

∂ui∂uj∂uk(u) =

P∑p=1

Aip Ajp Akp ψ(3)p (

K∑q=1

uq Aqp)

of the form Tijk` =∑

p Aip Ajp Akp F`p, if L points u` ∈ G.

Ψy (u) =∑p

uq Aqp

), u ∈ G

∂3Ψy

P∑p=1

K∑q=1

uq Aqp)

Ψy (u) =∑p

uq Aqp

), u ∈ G

∂3Ψy

P∑p=1

K∑q=1

uq Aqp)

Problem seen as symmetric tensor decomposition (1/2)

If only the origin in G, i.e. u` = 0, then

Cijk =∑p

Aip Ajp Akp fp

Multi-linear relation relating cumulants.

Equivalent writing (still with d = 3 as working example)

C =P∑

fp h(p)⊗⊗⊗h(p)⊗⊗⊗h(p)

where h(p)def= pth column of A.

If vectors h(p) are not pairwise collinear, P is hopefullyminimal, i.e. equal to the rank of tensor C

→ condition

C =P∑

fp h(p)⊗⊗⊗h(p)⊗⊗⊗h(p)

If vectors h(p) are not pairwise collinear, P is hopefullyminimal, i.e. equal to the rank of tensor C

→ condition

C =P∑

fp h(p)⊗⊗⊗h(p)⊗⊗⊗h(p)

If vectors h(p) are not pairwise collinear, P is hopefullyminimal, i.e. equal to the rank of tensor C → condition

Expected rank

For an order-d symmetric tensor of dimension K

Number of equations:(K+d−1

)Number of unknowns: PKFor what P can one have an exact fit?

“Expected rank”:

R(K , d)def=

⌈(K+d−1d

⌉(5)

Expected rank

For an order-d symmetric tensor of dimension K

Number of equations:(K+d−1

)Number of unknowns: PKFor what P can one have an exact fit?

“Expected rank”:

R(K , d)def=

⌈(K+d−1d

⌉(5)

Clebsch’s statement

Alfred Clebsch (1833-1872)

For (d ,K ) = (4, 3), a generic tensor cannot be written as the sumof 5 rank-1 terms, even if #unknowns = 15 = #equations

Hirschowitz theorem

From Alexander-Hirschowitz thm (1995), one deduces [CGLM08]:

Theorem For d > 2, the generic rank of a dth order symmetrictensor of dimension K is always equal to the expected rank

Rs = R(K , d) (6)

except for (d ,K ) ∈ {(3, 5), (4, 3), (4, 4), (4, 5)}

➽ Only a finite number of exceptions !Pierre Comon MMDS - July 2009 21

Identifiability

Identifiability in wider sense: finite number of solutions

Necessary condition for identifiability: P ≤ R(K , d).Not sufficient, cf. Clebsch (1861), Hirschowitz (1995)

Sufficient condition for almost sure identifiability:P < R(K , d).

NS condition for almost sure identifiability:

(K+d−1d

= R(K , d) = Rs

Identifiability

(K+d−1d

= R(K , d) = Rs

Identifiability

(K+d−1d

= R(K , d) = Rs

Identifiability

(K+d−1d

= R(K , d) = Rs

Generic rank of symmetric tensors

Symmetric tensors of order d and dimension K :

dK 2 3 4 5 6 7 8

2 2 3 4 5 6 7 8

3 2 4 5 8 10 12 15

4 3 6 10 15 21 30 42

Deterministic approaches

Model:Yijk =

Aip sjp Bkp

Y =∑P

p=1 a(p)⊗⊗⊗ s(p)⊗⊗⊗b(p)

Expected rank

For an order-d symmetric tensor of dimensions K`, 1 ≤ ` ≤ d

Number of equations:∏

Number of unknowns:∑

` K`P − (d − 1)P

“Expected rank” again given by the ceil rule:

R(K1, ..,Kd)def=

⌈ ∏` K`∑

` K` − d + 1

⌉(7)

Expected rank

For an order-d symmetric tensor of dimensions K`, 1 ≤ ` ≤ d

Number of equations:∏

Number of unknowns:∑

` K`P − (d − 1)P

“Expected rank” again given by the ceil rule:

R(K1, ..,Kd)def=

⌈ ∏` K`∑

` K` − d + 1

⌉(7)

Identifiability

Necessary condition for identifiability:

P ≤ R(K1, ..,Kd)

Sufficient condition for almost sure identifiability:

P < R(K1, ..,Kd)

For general tensors, generic rank R not yet known everywhere.

Identifiability

P ≤ R(K1, ..,Kd)

P < R(K1, ..,Kd)

Identifiability

P ≤ R(K1, ..,Kd)

P < R(K1, ..,Kd)

Generic rank of order 3 free tensors (1)

K 2 3 4

J 2 3 4 5 3 4 5 4 5

2 2 3 4 4 3 4 5 4 5

3 3 3 4 5 5 5 5 6 6

4 4 4 4 5 5 6 6 7 8

5 4 5 5 5 5 6 8 8 9

6 4 6 6 6 6 7 8 8 10

I 7 4 6 7 7 7 7 9 9 108 4 6 8 8 8 8 9 10 11

9 4 6 8 9 9 9 9 10 12

10 4 6 8 10 9 10 10 10 1211 4 6 8 10 9 11 11 11 1312 4 6 8 10 9 12 12 12 13

Perspectives

Efficient algorithms to compute finite set of solutions

Detection of weak defectivity: rare cases when ∞ manysolutions

Construction of additional diversity: increases order

Thanks for your attention

Perspectives

Introduction Statistics Tensors TheEnd App Darmois proof Cumulants AH-thm ref

Darmois-Skitovich theorem (1953)

TheoremLet si be statistically independent random variables, and two linearstatistics:

y1 =∑

ai si and y2 =∑

If y1 and y2 are statistically independent, then random variables skfor which ak bk 6= 0 are Gaussian.

NB: holds in both R or C

Proof of Darmois-Skitovich thm

Assume [ak , bk ] not collinear and ψp differentiable.

1 Hence∑P

k=1 ψp(u ak + v bk) =∑P

k=1 ψk(u ak) + ψk(v bk)Trivial for terms for which akbk = 0.From now on, restrict the sum to terms akbk 6= 0

2 Write this at u + α/aP and v − α/bP :

P∑k=1

(u ak + v bk + α(

aP− bk

)= f (u) + g(v)

3 Subtract to cancel Pth term, divide by α, and let α→ 0:

P−1∑k=1

aP− bk

(1)k (u ak + v bk) = f (1)(u) + g (1)(v)

for some univariate functions f (1)(u) and g (1)(u).

Conclusion: We have one term less

1 Hence∑P

P∑k=1

(u ak + v bk + α(

aP− bk

)= f (u) + g(v)

P−1∑k=1

aP− bk

(1)k (u ak + v bk) = f (1)(u) + g (1)(v)

1 Hence∑P

P∑k=1

(u ak + v bk + α(

aP− bk

)= f (u) + g(v)

P−1∑k=1

aP− bk

(1)k (u ak + v bk) = f (1)(u) + g (1)(v)

1 Hence∑P

P∑k=1

(u ak + v bk + α(

aP− bk

)= f (u) + g(v)

P−1∑k=1

aP− bk

(1)k (u ak + v bk) = f (1)(u) + g (1)(v)

1 Hence∑P

P∑k=1

(u ak + v bk + α(

aP− bk

)= f (u) + g(v)

P−1∑k=1

aP− bk

(1)k (u ak + v bk) = f (1)(u) + g (1)(v)

4 Repeat the procedure (P − 1) times and get:

P∏j=2

aj− b1

bj) ψ

(P−1)1 (u a1 + v b1) = f (P−1)(u) + g (P−1)(v)

5 Hence ψ(P−1)1 (u a1 + v b1) is linear, as a sum of two univariate

functions (ψ(P)1 is a constant because a1b1 6= 0).

6 Eventually ψ1 is a polynomial.

7 Lastly invoke Marcinkiewicz theorem to conclude that s1 isGaussian.

8 Same is true for any ψp such that apbp 6= 0: sp is Gaussian.

NB: also holds if ψp not differentiable

P∏j=2

aj− b1

bj) ψ

(P−1)1 (u a1 + v b1) = f (P−1)(u) + g (P−1)(v)

P∏j=2

aj− b1

bj) ψ

(P−1)1 (u a1 + v b1) = f (P−1)(u) + g (P−1)(v)

P∏j=2

aj− b1

bj) ψ

(P−1)1 (u a1 + v b1) = f (P−1)(u) + g (P−1)(v)

P∏j=2

aj− b1

bj) ψ

(P−1)1 (u a1 + v b1) = f (P−1)(u) + g (P−1)(v)

P∏j=2

aj− b1

bj) ψ

(P−1)1 (u a1 + v b1) = f (P−1)(u) + g (P−1)(v)

Definition of Cumulants

Moments:

µrdef= E{x r} = (−ı)r ∂rΦ(t)

∣∣∣∣t=0

Cumulants:

Cx (r)

def= Cum{x , . . . , x︸︷︷︸

r times

} = (−ı)r ∂rΨ(t)

∣∣∣∣t=0

Relationship between Moments and Cumulants obtained byexpanding both sides in Taylor series:

log Φx(t) = Ψx(t)

Moments:

∣∣∣∣t=0

Cumulants:

Cx (r)

def= Cum{x , . . . , x︸︷︷︸

r times

} = (−ı)r ∂rΨ(t)

∣∣∣∣t=0

log Φx(t) = Ψx(t)

Moments:

∣∣∣∣t=0

Cumulants:

Cx (r)

def= Cum{x , . . . , x︸︷︷︸

r times

} = (−ı)r ∂rΨ(t)

∣∣∣∣t=0

log Φx(t) = Ψx(t)

Polynomial interpolation

Alexander-Hirschowitz Theorem (1995) Let L(d ,m) be thespace of hypersurfaces of degree at most d in m variables. This

space is of dimension D(m, d)def=(m+d

)− 1.

Theorem Denote {pi} K given distinct points in the complexprojective space Pm. The dimension of the linear subspace ofhypersurfaces of L(d ,m) having multiplicity at least 2 at everypoint pi is:

D(m, d)− K (m + 1)

except for the following cases:• d = 2 and 2 ≤ K ≤ m• d ≥ 3 and (m, d ,K ) ∈ {(2, 4, 5), (3, 4, 9), (4, 1, 14), (4, 3, 7)}

In other words, there are a finite number of exceptions.Pierre Comon MMDS - July 2009 33

References

F. L. HITCHCOCK. “Multiple invariants and generalized rank of a p-way matrix or tensor.” J. Math. and

Phys., 7(1):39–79, 1927.

L. R. TUCKER. “Some mathematical notes for three-mode factor analysis.” Psychometrika, 31:279–311,

J. B. KRUSKAL. “Three-way arrays: Rank and uniqueness of trilinear decompositions.” Linear Algebra and

Applications, 18:95–138, 1977.

V. STRASSEN. “Rank and optimal computation of generic tensors.” Linear Algebra Appl., 52:645–685, July

P. COMON and B. MOURRAIN. “Decomposition of quantics in sums of powers of linear forms.” Signal

Processing, Elsevier, 53(2):93–107, September 1996.

N. D. SIDIROPOULOS and R. BRO. “On the uniqueness of multilinear decomposition of N-way arrays.” J.

Chemometrics, 14:229–239, 2000.

P. COMON. “Tensor decompositions.” In J. G. McWhirter and I. K. Proudler eds, Mathematics in Signal

Processing V, pages 1–24. Clarendon Press, Oxford, UK, 2002.

A. SMILDE, R. BRO, and P. GELADI. Multi-Way Analysis. Wiley, 2004.

P. COMON, G. GOLUB, L-H. LIM, and B. MOURRAIN. “Symmetric tensors and symmetric tensor rank.”

SIAM Journal on Matrix Analysis Appl., 30(3):1254–1279, 2008.

P. COMON, J. M. F. ten BERGE, L. De LATHAUWER and J. CASTAING. “Generic and Typical Ranks of

Multi-Way Arrays.” Linear Algebra Appl., 2009. See also http://arxiv.org/abs/0802.2371

P. COMON, X. LUCIANI and A. de ALMEIDA. “Tensor decompositions, alternating least squares and other

tales.” J. Chemometrics. 2009, special issue.

A. STEGEMAN and P. COMON. “Subtracting a best rank-1 approximation does not necessarily decrease

tensor rank.” 2009

References

Participants in ANR-06-BLAN-0074 “DECOTES” project:

Signal team, Lab. I3S UMR6070, Univ. Nice and CNRS

Galaad team, INRIA, UR Sophia-Antipolis - Mediterranee.

Lab. LTSI U642, INSERM and Univ. Rennes 1

Thales Communications

tensor decompositions in statistical signal processingmmds.imm.dtu.dk/presentations/comon.pdf ·...

Documents