tensordecompositionsandtheir …people.csail.mit.edu/moitra/docs/tensors.pdfspearman’shypothesis...

TENSOR DECOMPOSITIONS AND THEIR APPLICATIONS

ANKUR MOITRA MASSACHUSETTS INSTITUTE OF TECHNOLOGY

SPEARMAN’S HYPOTHESIS

Charles Spearman (1904): There are two types of intelligence, educ%ve and reproduc%ve



educIve (adj): the ability to make sense out of complexity reproducIve (adj): the ability to store and reproduce informaIon



To test this theory, he invented Factor Analysis:

≈ M A

tests (10)

students (1000)


BT



To test this theory, he invented Factor Analysis:

≈ M A BT

tests (10)

students (1000) inner-‐dimension (2)


Given: M = ai bi ×

= A BT

“correct” factors

Given: M = ai bi ×

When can we recover the factors ai and bi uniquely?

= A BT

“correct” factors

Given: M = ai bi ×


= A BT = AR R-‐1BT

“correct” factors alternaIve factorizaIon

Given: M = ai bi ×




Claim: The factors {ai} and {bi} are not determined uniquely unless we impose addiIonal condiIons on them

Given: M = ai bi ×




e.g. if {ai} and {bi} are orthogonal, or rank(M)=1


Given: M = ai bi ×





This is called the rota=on problem, and is a major issue in factor analysis and moIvates the study of tensor methods…

e.g. if {ai} and {bi} are orthogonal, or rank(M)=1

OUTLINE

Part I: Algorithms

� The RotaIon Problem

� Jennrich’s Algorithm

Part II: Applica=ons

� PhylogeneIc ReconstrucIon

� Pure Topic Models

The focus of this tutorial is on Algorithms/ApplicaIons/Models for tensor decomposiIons

Part III: Smoothed Analysis

� Overcomplete Problems

� Kruskal Rank and the Khatri-‐Rao Product

MATRIX DECOMPOSITIONS

M = a1 ⌦ b1 + a2 ⌦ b2 + · · ·+ aR ⌦ bR

+ +

MATRIX DECOMPOSITIONS

M = a1 ⌦ b1 + a2 ⌦ b2 + · · ·+ aR ⌦ bR

+ +

TENSOR DECOMPOSITIONS

T = a1 ⌦ b1 ⌦ c1 + · · ·+ aR ⌦ bR ⌦ cR

(i, j, k) entry of x⌦ y ⌦ z is x(i)⇥ y(j)⇥ z(k)

When are tensor decomposiIons unique?


Theorem [Jennrich 1970]: Suppose {ai} and {bi} are linearly independent and no pair of vectors in {ci} is a scalar mulIple of each other…


Theorem [Jennrich 1970]: Suppose {ai} and {bi} are linearly independent and no pair of vectors in {ci} is a scalar mulIple of each other. Then

T = a1 ⌦ b1 ⌦ c1 + · · ·+ aR ⌦ bR ⌦ cR

is unique up to permuIng the rank one terms and rescaling the factors.



T = a1 ⌦ b1 ⌦ c1 + · · ·+ aR ⌦ bR ⌦ cR


Equivalently, the rank one factors are unique



T = a1 ⌦ b1 ⌦ c1 + · · ·+ aR ⌦ bR ⌦ cR


Equivalently, the rank one factors are unique

There is a simple algorithm to compute the factors too!

JENNRICH’S ALGORITHM

x

i.e. add up matrix slices

xi Ti

Compute T( � , � , x )


x


xi Ti


T = a b c × × If then T(� , � , x ) = c, x a b ×


x


xi Ti



x


xi Ti

ci, x ai bi × Compute T( � , � , x ) =


x


xi Ti

ci, x ai bi ×

(x is chosen uniformly at random from Sn-‐1)

Compute T( � , � , x ) =


x


xi Ti

(x is chosen uniformly at random from Sn-‐1)

Compute T( � , � , x ) = A Dx BT

Diag( ci, x )



Compute T( � , � , y ) = A Dy BT




Diagonalize T( � , � , x ) T( � , � , y )-‐1


A DxBT(BT)-‐1 Dy-‐1 A-‐1





A Dx Dy-‐1 A-‐1





A Dx Dy-‐1 A-‐1

Claim: whp (over x,y) the eigenvalues are disInct, so the EigendecomposiIon is unique and recovers ai’s








Diagonalize T( � , � , y ) T( � , � , x )-‐1





Diagonalize T( � , � , y ) T( � , � , x )-‐1

Match up the factors (their eigenvalues are reciprocals) and find {ci} by solving a linear syst.

Given: M = ai bi × When can we recover the factors ai and bi uniquely?

This is only possible if {ai} and {bi} are orthonormal, or rank(M)=1



When can we recover the factors ai, bi and ci uniquely?

Given: T = ai bi ci × ×



When can we recover the factors ai, bi and ci uniquely?

Jennrich: If {ai} and {bi} are full rank and no pair in {ci} are scalar mulIples of each other

Given: T = ai bi ci × ×

OUTLINE

Part I: Algorithms










PHYLOGENETIC RECONSTRUCTION

x

a

b c

d z

y

= exInct

= extant

“Tree of Life”


x

a

b c

d z

y

= exInct

= extant


root: π : Σ R+ “iniIal distribuIon”

x

a

b c

d z

y

= exInct

= extant

Σ = alphabet



x

a

b c

d z

y

Rz,b “condiIonal distribuIon”

= exInct

= extant

Σ = alphabet



x

a

b c

d z

y

Rz,b “condiIonal distribuIon”

= exInct

= extant

In each sample, we observe a symbol (Σ) at each extant ( ) node where we sample from π for the root, and propagate it using Rx,y, etc

Σ = alphabet

HIDDEN MARKOV MODELS

x

a

= hidden

= observed

y

b

z

c …


π : Σs R+ “iniIal distribuIon”

x

a

= hidden

= observed

y

b

z

c …



x

a

= hidden

= observed

y

b

z

c

Rx,y

Ox,a

“transiIon matrices” “obs. m

atrices”

…



x

a

= hidden

= observed

In each sample, we observe a symbol (Σo) at each obs. ( ) node where we sample from π for the start, and propagate it using Rx,y, etc (Σs)

y

b

z

c

Rx,y

Ox,a

“transiIon matrices” “obs. m

atrices”

…

Ques=on: Can we reconstruct just the topology from random samples?

Usually, we assume Tx,y, etc are full rank so that we can re-‐root the tree arbitrarily


[Steel, 1994]: The following is a distance funcIon on the edges

dx,y = -‐ ln |det(Px,y)| + ½ ln πx,σ -‐ ½ ln πy,σ σ in Σ σ in Σ

where Px,y is the joint distribuIon





where Px,y is the joint distribuIon, and the distance between leaves is the sum of distances on the path in the tree





where Px,y is the joint distribuIon, and the distance between leaves is the sum of distances on the path in the tree

(It’s not even obvious it’s nonnega=ve!)



[Erdos, Steel, Szekely, Warnow, 1997]: Used Steel’s distance funcIon and quartet tests


to reconstrucIon the topology

a b

c d OR

a c

d b OR …




to reconstrucIon the topology, from polynomially many samples

a b

c d OR

a c

d b OR …




to reconstrucIon the topology, from polynomially many samples

a b

c d OR

a c

d b OR …

For many problems (e.g. HMMs) finding the transiIon matrices is the main issue…


[Chang, 1996]: The model is idenIfiable (if R’s are full rank)

x

a

b c

d z

y


Rz,b

x

a

b c

d z

y z

a

b c


Rz,b

x

a

b c

d z

y z

a

b c

Joint distribu=on over (a, b, c):

× Pr[z = σ] Pr[a|z = σ] Pr[b|z = σ] Pr[c|z = σ] × σ


Rz,b

x

a

b c

d z

y z

a

b c

Joint distribu=on over (a, b, c):



columns of Rz,b

Rz,b

[Mossel, Roch, 2006]: There is an algorithm to PAC learn a

phylogeneIc tree or an HMM (if its transiIon/output matrices are full rank) from polynomially many samples



Ques=on: Is the full-‐rank assumpIon necessary?




[Mossel, Roch, 2006]: It is as hard as noisy-‐parity to learn the parameters of a general HMM





Noisy-‐parity is an infamous problem in learning, where O(n) samples suffice but the best algorithms run in Ime 2n/log(n)

Due to [Blum, Kalai, Wasserman, 2003]





Noisy-‐parity is an infamous problem in learning, where O(n) samples suffice but the best algorithms run in Ime 2n/log(n)

Due to [Blum, Kalai, Wasserman, 2003]

(It’s now used as a hard problem to build cryptosystems!)

THE POWER OF CONDITIONAL INDEPENDENCE

[Phylogene=c Trees/HMMS]:


(joint distribuIon on leaves a, b, c)

PURE TOPIC MODELS

words (m

)

topics (r)

� Each topic is a distribuIon on words

PURE TOPIC MODELS

words (m

)

topics (r)


� Each document is about only one topic

(stochasIcally generated)

PURE TOPIC MODELS

words (m

)

topics (r)


� Each document is about only one topic

(stochasIcally generated)

� Each document, we sample L words from its distribuIon

PURE TOPIC MODELS

= A W M

PURE TOPIC MODELS

A W M

≈

PURE TOPIC MODELS

A W M

≈ [Anandkumar, Hsu, Kakade, 2012]: Algorithm for learning pure topic models from polynomially many samples (A is full rank)

PURE TOPIC MODELS

A W M


Ques=on: Where can we find three condiIonally independent random variables?

PURE TOPIC MODELS

A W M


PURE TOPIC MODELS

A W M


The first, second and third words are independent condiIoned on the topic t (and are random samples from At)





[Pure Topic Models/LDA]:

× Pr[topic = j] Aj Aj Aj × j

(joint distribuIon on first three words)





[Pure Topic Models/LDA]:

× Pr[topic = j] Aj Aj Aj × j

(joint distribuIon on first three words)

[Community Detec=on]:

× Pr[Cx = j] (CAΠ)j (CBΠ)j (CCΠ)j × j

(counIng stars)

OUTLINE

Part I: Algorithms










So far, Jennrich’s algorithm has been the key but it has a crucial limitaIon.

So far, Jennrich’s algorithm has been the key but it has a crucial limitaIon. Let

T = ai ai ai × × i = 1

R

where {ai} are n-‐dimensional vectors


Ques=on: What if R is much larger than n?

T = ai ai ai × × i = 1

R




T = ai ai ai × × i = 1

R

This is called the overcomplete case e.g. the number of factors is much larger than the number of observaIons…




T = ai ai ai × × i = 1

R

This is called the overcomplete case e.g. the number of factors is much larger than the number of observaIons…


In such cases, why stop at third-‐order tensors?

Consider a sixth-‐order tensor T:

T = ai ai ai × × i = 1

R

ai ai ai × × ×


T = ai ai ai × × i = 1

R

ai ai ai × × ×

Ques=on: Can we find its factors, even if R is much larger than n?


T = ai ai ai × × i = 1

R

ai ai ai × × ×

flat(T) = bi bi bi × × i = 1

R

(where bi = ai ai × KR )

n2-‐dimensional vector whose (j,k)th entry is the product of the jth and kth entries of ai Khatri-‐Rao product


Let’s flaven it:


T = ai ai ai × × i = 1

R

ai ai ai × × ×

flat(T) = bi bi bi × × i = 1

R

(where bi = ai ai × KR )

n2-‐dimensional vector whose (j,k)th entry is the product of the jth and kth entries of ai Khatri-‐Rao product


Let’s flaven it by rearranging its entries into a third-‐order tensor:

Ques=on: Can we apply Jennrich’s Algorithm to flat(T)?


When are the new factors bi = ai ai × KR linearly independent?



Example #1:

Let {ai} be all ( ) n 2 vectors with exactly two ones



Example #1:


Then {bi} are vectorizaIons of:

=



Example #1:



=

Non-‐zero only in bi



Example #1:



=

Non-‐zero only in bi

and are linearly independent



Example #2:

Let {a1…n} and {an+1..2n} be two random orthonormal bases



Example #2:

Then there is a linear dependence with 2n terms:




Example #2:


ai ai × KR i = 1

n

-‐ ai ai × KR i = n+1

2n

= 0




Example #2:


ai ai × KR i = 1

n

-‐ ai ai × KR i = n+1

2n

= 0 (as matrices, both sum to the idenIty)


THE KRUSKAL RANK

THE KRUSKAL RANK

Defini=on: The Kruskal rank (k-‐rank) of {bi} is the largest k s.t. every set of k vectors is linearly independent

THE KRUSKAL RANK


bi = ai ai × KR k-‐rank({ai}) = n

THE KRUSKAL RANK


bi = ai ai × KR Example #1: k-‐rank({bi}) = R = ( ) n

2

k-‐rank({ai}) = n

THE KRUSKAL RANK



2

Example #2: k-‐rank({bi}) = 2n-‐1

k-‐rank({ai}) = n

THE KRUSKAL RANK



2

Example #2: k-‐rank({bi}) = 2n-‐1

k-‐rank({ai}) = n

The Kruskal rank always adds under the Khatri-‐Rao product, but someImes it mul=plies and that can allow us to handle R >> n

[Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank mulIplies under the Khatri-‐Rao product


Proof: The set of {ai} where

bi = ai ai × KR det({bi}) = 0 and

is measure zero




is measure zero

But this yields a very weak bound on the condi=on number of {bi}…




is measure zero

But this yields a very weak bound on the condi=on number of {bi}…

… which is what we need to apply it to learning/staIsIcs, where we have an esImate to T


Defini=on: The robust Kruskal rank (k-‐rankγ) of {bi} is the largest k s.t. every set of k vector has condiIon number at most O(γ)


[Bhaskara, Charikar, Vijayaraghavan, 2013]: The robust Kruskal rank always under the Khatri-‐Rao product





[Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {ai} are ε-‐perturbed…




[Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {ai} are ε-‐perturbed. Then

k-‐rankγ({bi}) = R

for R = n2/2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ)




Hence we can apply Jennrich’s Algorithm to flat(T) with R >> n




Note: These bounds are easy to prove with inverse polynomial failure probability, but then γ depends δ





Note: These bounds are easy to prove with inverse polynomial failure probability, but then γ depends δ

This can be extended to any constant order Khatri-‐Rao product





Sample applica=on: Algorithm for learning mixtures of nO(1) spherical Gaussians in Rn, if their means are ε-‐perturbed

This was also obtained independently by [Anderson, Belkin, Goyal, Rademacher, Voss, 2014]


Any QuesIons? Summary:

� Tensor decomposiIons are unique under much more general condiIons, compared to matrix decomposiIons

� Jennrich’s Algorithm (rediscovered many Imes!), and its many applicaIons in learning/staIsIcs

� Introduced new models to study overcomplete problems (R >> n)

� Are there algorithms for order-‐k tensors that work with R = n0.51 k?

tensordecompositionsandtheir …people.csail.mit.edu/moitra/docs/tensors.pdfspearman’shypothesis...

Documents