tensordecompositionsandtheir …people.csail.mit.edu/moitra/docs/tensors.pdfspearman’shypothesis...
TRANSCRIPT
TENSOR DECOMPOSITIONS AND THEIR APPLICATIONS
ANKUR MOITRA MASSACHUSETTS INSTITUTE OF TECHNOLOGY
SPEARMAN’S HYPOTHESIS
Charles Spearman (1904): There are two types of intelligence, educ%ve and reproduc%ve
SPEARMAN’S HYPOTHESIS
Charles Spearman (1904): There are two types of intelligence, educ%ve and reproduc%ve
educIve (adj): the ability to make sense out of complexity reproducIve (adj): the ability to store and reproduce informaIon
SPEARMAN’S HYPOTHESIS
Charles Spearman (1904): There are two types of intelligence, educ%ve and reproduc%ve
To test this theory, he invented Factor Analysis:
≈ M A
tests (10)
students (1000)
educIve (adj): the ability to make sense out of complexity reproducIve (adj): the ability to store and reproduce informaIon
BT
SPEARMAN’S HYPOTHESIS
Charles Spearman (1904): There are two types of intelligence, educ%ve and reproduc%ve
To test this theory, he invented Factor Analysis:
≈ M A BT
tests (10)
students (1000) inner-‐dimension (2)
educIve (adj): the ability to make sense out of complexity reproducIve (adj): the ability to store and reproduce informaIon
Given: M = ai bi ×
= A BT
“correct” factors
Given: M = ai bi ×
When can we recover the factors ai and bi uniquely?
= A BT
“correct” factors
Given: M = ai bi ×
When can we recover the factors ai and bi uniquely?
= A BT = AR R-‐1BT
“correct” factors alternaIve factorizaIon
Given: M = ai bi ×
When can we recover the factors ai and bi uniquely?
= A BT = AR R-‐1BT
“correct” factors alternaIve factorizaIon
Claim: The factors {ai} and {bi} are not determined uniquely unless we impose addiIonal condiIons on them
Given: M = ai bi ×
When can we recover the factors ai and bi uniquely?
= A BT = AR R-‐1BT
“correct” factors alternaIve factorizaIon
e.g. if {ai} and {bi} are orthogonal, or rank(M)=1
Claim: The factors {ai} and {bi} are not determined uniquely unless we impose addiIonal condiIons on them
Given: M = ai bi ×
When can we recover the factors ai and bi uniquely?
Claim: The factors {ai} and {bi} are not determined uniquely unless we impose addiIonal condiIons on them
= A BT = AR R-‐1BT
“correct” factors alternaIve factorizaIon
This is called the rota=on problem, and is a major issue in factor analysis and moIvates the study of tensor methods…
e.g. if {ai} and {bi} are orthogonal, or rank(M)=1
OUTLINE
Part I: Algorithms
� The RotaIon Problem
� Jennrich’s Algorithm
Part II: Applica=ons
� PhylogeneIc ReconstrucIon
� Pure Topic Models
The focus of this tutorial is on Algorithms/ApplicaIons/Models for tensor decomposiIons
Part III: Smoothed Analysis
� Overcomplete Problems
� Kruskal Rank and the Khatri-‐Rao Product
MATRIX DECOMPOSITIONS
M = a1 ⌦ b1 + a2 ⌦ b2 + · · ·+ aR ⌦ bR
+ +
MATRIX DECOMPOSITIONS
M = a1 ⌦ b1 + a2 ⌦ b2 + · · ·+ aR ⌦ bR
+ +
TENSOR DECOMPOSITIONS
T = a1 ⌦ b1 ⌦ c1 + · · ·+ aR ⌦ bR ⌦ cR
(i, j, k) entry of x⌦ y ⌦ z is x(i)⇥ y(j)⇥ z(k)
When are tensor decomposiIons unique?
When are tensor decomposiIons unique?
Theorem [Jennrich 1970]: Suppose {ai} and {bi} are linearly independent and no pair of vectors in {ci} is a scalar mulIple of each other…
When are tensor decomposiIons unique?
Theorem [Jennrich 1970]: Suppose {ai} and {bi} are linearly independent and no pair of vectors in {ci} is a scalar mulIple of each other. Then
T = a1 ⌦ b1 ⌦ c1 + · · ·+ aR ⌦ bR ⌦ cR
is unique up to permuIng the rank one terms and rescaling the factors.
When are tensor decomposiIons unique?
Theorem [Jennrich 1970]: Suppose {ai} and {bi} are linearly independent and no pair of vectors in {ci} is a scalar mulIple of each other. Then
T = a1 ⌦ b1 ⌦ c1 + · · ·+ aR ⌦ bR ⌦ cR
is unique up to permuIng the rank one terms and rescaling the factors.
Equivalently, the rank one factors are unique
When are tensor decomposiIons unique?
Theorem [Jennrich 1970]: Suppose {ai} and {bi} are linearly independent and no pair of vectors in {ci} is a scalar mulIple of each other. Then
T = a1 ⌦ b1 ⌦ c1 + · · ·+ aR ⌦ bR ⌦ cR
is unique up to permuIng the rank one terms and rescaling the factors.
Equivalently, the rank one factors are unique
There is a simple algorithm to compute the factors too!
JENNRICH’S ALGORITHM
x
i.e. add up matrix slices
xi Ti
Compute T( � , � , x )
JENNRICH’S ALGORITHM
x
i.e. add up matrix slices
xi Ti
Compute T( � , � , x )
T = a b c × × If then T(� , � , x ) = c, x a b ×
JENNRICH’S ALGORITHM
x
i.e. add up matrix slices
xi Ti
Compute T( � , � , x )
JENNRICH’S ALGORITHM
x
i.e. add up matrix slices
xi Ti
ci, x ai bi × Compute T( � , � , x ) =
JENNRICH’S ALGORITHM
x
i.e. add up matrix slices
xi Ti
ci, x ai bi ×
(x is chosen uniformly at random from Sn-‐1)
Compute T( � , � , x ) =
JENNRICH’S ALGORITHM
x
i.e. add up matrix slices
xi Ti
(x is chosen uniformly at random from Sn-‐1)
Compute T( � , � , x ) = A Dx BT
Diag( ci, x )
JENNRICH’S ALGORITHM
Compute T( � , � , x ) = A Dx BT
JENNRICH’S ALGORITHM
Compute T( � , � , x ) = A Dx BT
Compute T( � , � , y ) = A Dy BT
JENNRICH’S ALGORITHM
Compute T( � , � , x ) = A Dx BT
Compute T( � , � , y ) = A Dy BT
Diagonalize T( � , � , x ) T( � , � , y )-‐1
JENNRICH’S ALGORITHM
A DxBT(BT)-‐1 Dy-‐1 A-‐1
Compute T( � , � , x ) = A Dx BT
Compute T( � , � , y ) = A Dy BT
Diagonalize T( � , � , x ) T( � , � , y )-‐1
JENNRICH’S ALGORITHM
A Dx Dy-‐1 A-‐1
Compute T( � , � , x ) = A Dx BT
Compute T( � , � , y ) = A Dy BT
Diagonalize T( � , � , x ) T( � , � , y )-‐1
JENNRICH’S ALGORITHM
A Dx Dy-‐1 A-‐1
Claim: whp (over x,y) the eigenvalues are disInct, so the EigendecomposiIon is unique and recovers ai’s
Compute T( � , � , x ) = A Dx BT
Compute T( � , � , y ) = A Dy BT
Diagonalize T( � , � , x ) T( � , � , y )-‐1
JENNRICH’S ALGORITHM
Compute T( � , � , x ) = A Dx BT
Compute T( � , � , y ) = A Dy BT
Diagonalize T( � , � , x ) T( � , � , y )-‐1
JENNRICH’S ALGORITHM
Compute T( � , � , x ) = A Dx BT
Compute T( � , � , y ) = A Dy BT
Diagonalize T( � , � , x ) T( � , � , y )-‐1
Diagonalize T( � , � , y ) T( � , � , x )-‐1
JENNRICH’S ALGORITHM
Compute T( � , � , x ) = A Dx BT
Compute T( � , � , y ) = A Dy BT
Diagonalize T( � , � , x ) T( � , � , y )-‐1
Diagonalize T( � , � , y ) T( � , � , x )-‐1
Match up the factors (their eigenvalues are reciprocals) and find {ci} by solving a linear syst.
Given: M = ai bi × When can we recover the factors ai and bi uniquely?
This is only possible if {ai} and {bi} are orthonormal, or rank(M)=1
Given: M = ai bi × When can we recover the factors ai and bi uniquely?
This is only possible if {ai} and {bi} are orthonormal, or rank(M)=1
When can we recover the factors ai, bi and ci uniquely?
Given: T = ai bi ci × ×
Given: M = ai bi × When can we recover the factors ai and bi uniquely?
This is only possible if {ai} and {bi} are orthonormal, or rank(M)=1
When can we recover the factors ai, bi and ci uniquely?
Jennrich: If {ai} and {bi} are full rank and no pair in {ci} are scalar mulIples of each other
Given: T = ai bi ci × ×
OUTLINE
Part I: Algorithms
� The RotaIon Problem
� Jennrich’s Algorithm
Part II: Applica=ons
� PhylogeneIc ReconstrucIon
� Pure Topic Models
The focus of this tutorial is on Algorithms/ApplicaIons/Models for tensor decomposiIons
Part III: Smoothed Analysis
� Overcomplete Problems
� Kruskal Rank and the Khatri-‐Rao Product
PHYLOGENETIC RECONSTRUCTION
x
a
b c
d z
y
= exInct
= extant
“Tree of Life”
PHYLOGENETIC RECONSTRUCTION
x
a
b c
d z
y
= exInct
= extant
PHYLOGENETIC RECONSTRUCTION
root: π : Σ R+ “iniIal distribuIon”
x
a
b c
d z
y
= exInct
= extant
Σ = alphabet
PHYLOGENETIC RECONSTRUCTION
root: π : Σ R+ “iniIal distribuIon”
x
a
b c
d z
y
Rz,b “condiIonal distribuIon”
= exInct
= extant
Σ = alphabet
PHYLOGENETIC RECONSTRUCTION
root: π : Σ R+ “iniIal distribuIon”
x
a
b c
d z
y
Rz,b “condiIonal distribuIon”
= exInct
= extant
In each sample, we observe a symbol (Σ) at each extant ( ) node where we sample from π for the root, and propagate it using Rx,y, etc
Σ = alphabet
HIDDEN MARKOV MODELS
x
a
= hidden
= observed
y
b
z
c …
HIDDEN MARKOV MODELS
π : Σs R+ “iniIal distribuIon”
x
a
= hidden
= observed
y
b
z
c …
HIDDEN MARKOV MODELS
π : Σs R+ “iniIal distribuIon”
x
a
= hidden
= observed
y
b
z
c
Rx,y
Ox,a
“transiIon matrices” “obs. m
atrices”
…
HIDDEN MARKOV MODELS
π : Σs R+ “iniIal distribuIon”
x
a
= hidden
= observed
In each sample, we observe a symbol (Σo) at each obs. ( ) node where we sample from π for the start, and propagate it using Rx,y, etc (Σs)
y
b
z
c
Rx,y
Ox,a
“transiIon matrices” “obs. m
atrices”
…
Ques=on: Can we reconstruct just the topology from random samples?
Usually, we assume Tx,y, etc are full rank so that we can re-‐root the tree arbitrarily
Ques=on: Can we reconstruct just the topology from random samples?
[Steel, 1994]: The following is a distance funcIon on the edges
dx,y = -‐ ln |det(Px,y)| + ½ ln πx,σ -‐ ½ ln πy,σ σ in Σ σ in Σ
where Px,y is the joint distribuIon
Usually, we assume Tx,y, etc are full rank so that we can re-‐root the tree arbitrarily
Ques=on: Can we reconstruct just the topology from random samples?
[Steel, 1994]: The following is a distance funcIon on the edges
dx,y = -‐ ln |det(Px,y)| + ½ ln πx,σ -‐ ½ ln πy,σ σ in Σ σ in Σ
where Px,y is the joint distribuIon, and the distance between leaves is the sum of distances on the path in the tree
Usually, we assume Tx,y, etc are full rank so that we can re-‐root the tree arbitrarily
Ques=on: Can we reconstruct just the topology from random samples?
[Steel, 1994]: The following is a distance funcIon on the edges
dx,y = -‐ ln |det(Px,y)| + ½ ln πx,σ -‐ ½ ln πy,σ σ in Σ σ in Σ
where Px,y is the joint distribuIon, and the distance between leaves is the sum of distances on the path in the tree
(It’s not even obvious it’s nonnega=ve!)
Usually, we assume Tx,y, etc are full rank so that we can re-‐root the tree arbitrarily
Ques=on: Can we reconstruct just the topology from random samples?
Usually, we assume Tx,y, etc are full rank so that we can re-‐root the tree arbitrarily
Ques=on: Can we reconstruct just the topology from random samples?
[Erdos, Steel, Szekely, Warnow, 1997]: Used Steel’s distance funcIon and quartet tests
Usually, we assume Tx,y, etc are full rank so that we can re-‐root the tree arbitrarily
to reconstrucIon the topology
a b
c d OR
a c
d b OR …
Ques=on: Can we reconstruct just the topology from random samples?
[Erdos, Steel, Szekely, Warnow, 1997]: Used Steel’s distance funcIon and quartet tests
Usually, we assume Tx,y, etc are full rank so that we can re-‐root the tree arbitrarily
to reconstrucIon the topology, from polynomially many samples
a b
c d OR
a c
d b OR …
Ques=on: Can we reconstruct just the topology from random samples?
[Erdos, Steel, Szekely, Warnow, 1997]: Used Steel’s distance funcIon and quartet tests
Usually, we assume Tx,y, etc are full rank so that we can re-‐root the tree arbitrarily
to reconstrucIon the topology, from polynomially many samples
a b
c d OR
a c
d b OR …
For many problems (e.g. HMMs) finding the transiIon matrices is the main issue…
Ques=on: Can we reconstruct just the topology from random samples?
[Chang, 1996]: The model is idenIfiable (if R’s are full rank)
x
a
b c
d z
y
[Chang, 1996]: The model is idenIfiable (if R’s are full rank)
Rz,b
x
a
b c
d z
y
[Chang, 1996]: The model is idenIfiable (if R’s are full rank)
Rz,b
x
a
b c
d z
y z
a
b c
[Chang, 1996]: The model is idenIfiable (if R’s are full rank)
Rz,b
x
a
b c
d z
y z
a
b c
Joint distribu=on over (a, b, c):
× Pr[z = σ] Pr[a|z = σ] Pr[b|z = σ] Pr[c|z = σ] × σ
[Chang, 1996]: The model is idenIfiable (if R’s are full rank)
Rz,b
x
a
b c
d z
y z
a
b c
Joint distribu=on over (a, b, c):
× Pr[z = σ] Pr[a|z = σ] Pr[b|z = σ] Pr[c|z = σ] × σ
[Chang, 1996]: The model is idenIfiable (if R’s are full rank)
columns of Rz,b
Rz,b
[Mossel, Roch, 2006]: There is an algorithm to PAC learn a
phylogeneIc tree or an HMM (if its transiIon/output matrices are full rank) from polynomially many samples
[Mossel, Roch, 2006]: There is an algorithm to PAC learn a
phylogeneIc tree or an HMM (if its transiIon/output matrices are full rank) from polynomially many samples
Ques=on: Is the full-‐rank assumpIon necessary?
[Mossel, Roch, 2006]: There is an algorithm to PAC learn a
phylogeneIc tree or an HMM (if its transiIon/output matrices are full rank) from polynomially many samples
Ques=on: Is the full-‐rank assumpIon necessary?
[Mossel, Roch, 2006]: It is as hard as noisy-‐parity to learn the parameters of a general HMM
[Mossel, Roch, 2006]: There is an algorithm to PAC learn a
phylogeneIc tree or an HMM (if its transiIon/output matrices are full rank) from polynomially many samples
Ques=on: Is the full-‐rank assumpIon necessary?
[Mossel, Roch, 2006]: It is as hard as noisy-‐parity to learn the parameters of a general HMM
Noisy-‐parity is an infamous problem in learning, where O(n) samples suffice but the best algorithms run in Ime 2n/log(n)
Due to [Blum, Kalai, Wasserman, 2003]
[Mossel, Roch, 2006]: There is an algorithm to PAC learn a
phylogeneIc tree or an HMM (if its transiIon/output matrices are full rank) from polynomially many samples
Ques=on: Is the full-‐rank assumpIon necessary?
[Mossel, Roch, 2006]: It is as hard as noisy-‐parity to learn the parameters of a general HMM
Noisy-‐parity is an infamous problem in learning, where O(n) samples suffice but the best algorithms run in Ime 2n/log(n)
Due to [Blum, Kalai, Wasserman, 2003]
(It’s now used as a hard problem to build cryptosystems!)
THE POWER OF CONDITIONAL INDEPENDENCE
[Phylogene=c Trees/HMMS]:
× Pr[z = σ] Pr[a|z = σ] Pr[b|z = σ] Pr[c|z = σ] × σ
(joint distribuIon on leaves a, b, c)
PURE TOPIC MODELS
words (m
)
topics (r)
� Each topic is a distribuIon on words
PURE TOPIC MODELS
words (m
)
topics (r)
� Each topic is a distribuIon on words
� Each document is about only one topic
(stochasIcally generated)
PURE TOPIC MODELS
words (m
)
topics (r)
� Each topic is a distribuIon on words
� Each document is about only one topic
(stochasIcally generated)
� Each document, we sample L words from its distribuIon
PURE TOPIC MODELS
= A W M
PURE TOPIC MODELS
= A W M
PURE TOPIC MODELS
= A W M
PURE TOPIC MODELS
= A W M
PURE TOPIC MODELS
= A W M
PURE TOPIC MODELS
= A W M
PURE TOPIC MODELS
A W M
≈
PURE TOPIC MODELS
A W M
≈ [Anandkumar, Hsu, Kakade, 2012]: Algorithm for learning pure topic models from polynomially many samples (A is full rank)
PURE TOPIC MODELS
A W M
≈ [Anandkumar, Hsu, Kakade, 2012]: Algorithm for learning pure topic models from polynomially many samples (A is full rank)
Ques=on: Where can we find three condiIonally independent random variables?
PURE TOPIC MODELS
A W M
≈ [Anandkumar, Hsu, Kakade, 2012]: Algorithm for learning pure topic models from polynomially many samples (A is full rank)
PURE TOPIC MODELS
A W M
≈ [Anandkumar, Hsu, Kakade, 2012]: Algorithm for learning pure topic models from polynomially many samples (A is full rank)
The first, second and third words are independent condiIoned on the topic t (and are random samples from At)
THE POWER OF CONDITIONAL INDEPENDENCE
[Phylogene=c Trees/HMMS]:
× Pr[z = σ] Pr[a|z = σ] Pr[b|z = σ] Pr[c|z = σ] × σ
(joint distribuIon on leaves a, b, c)
THE POWER OF CONDITIONAL INDEPENDENCE
[Phylogene=c Trees/HMMS]:
× Pr[z = σ] Pr[a|z = σ] Pr[b|z = σ] Pr[c|z = σ] × σ
(joint distribuIon on leaves a, b, c)
[Pure Topic Models/LDA]:
× Pr[topic = j] Aj Aj Aj × j
(joint distribuIon on first three words)
THE POWER OF CONDITIONAL INDEPENDENCE
[Phylogene=c Trees/HMMS]:
× Pr[z = σ] Pr[a|z = σ] Pr[b|z = σ] Pr[c|z = σ] × σ
(joint distribuIon on leaves a, b, c)
[Pure Topic Models/LDA]:
× Pr[topic = j] Aj Aj Aj × j
(joint distribuIon on first three words)
[Community Detec=on]:
× Pr[Cx = j] (CAΠ)j (CBΠ)j (CCΠ)j × j
(counIng stars)
OUTLINE
Part I: Algorithms
� The RotaIon Problem
� Jennrich’s Algorithm
Part II: Applica=ons
� PhylogeneIc ReconstrucIon
� Pure Topic Models
The focus of this tutorial is on Algorithms/ApplicaIons/Models for tensor decomposiIons
Part III: Smoothed Analysis
� Overcomplete Problems
� Kruskal Rank and the Khatri-‐Rao Product
So far, Jennrich’s algorithm has been the key but it has a crucial limitaIon.
So far, Jennrich’s algorithm has been the key but it has a crucial limitaIon. Let
T = ai ai ai × × i = 1
R
where {ai} are n-‐dimensional vectors
So far, Jennrich’s algorithm has been the key but it has a crucial limitaIon. Let
Ques=on: What if R is much larger than n?
T = ai ai ai × × i = 1
R
where {ai} are n-‐dimensional vectors
So far, Jennrich’s algorithm has been the key but it has a crucial limitaIon. Let
Ques=on: What if R is much larger than n?
T = ai ai ai × × i = 1
R
This is called the overcomplete case e.g. the number of factors is much larger than the number of observaIons…
where {ai} are n-‐dimensional vectors
So far, Jennrich’s algorithm has been the key but it has a crucial limitaIon. Let
Ques=on: What if R is much larger than n?
T = ai ai ai × × i = 1
R
This is called the overcomplete case e.g. the number of factors is much larger than the number of observaIons…
where {ai} are n-‐dimensional vectors
In such cases, why stop at third-‐order tensors?
Consider a sixth-‐order tensor T:
T = ai ai ai × × i = 1
R
ai ai ai × × ×
Consider a sixth-‐order tensor T:
T = ai ai ai × × i = 1
R
ai ai ai × × ×
Ques=on: Can we find its factors, even if R is much larger than n?
Consider a sixth-‐order tensor T:
T = ai ai ai × × i = 1
R
ai ai ai × × ×
flat(T) = bi bi bi × × i = 1
R
(where bi = ai ai × KR )
n2-‐dimensional vector whose (j,k)th entry is the product of the jth and kth entries of ai Khatri-‐Rao product
Ques=on: Can we find its factors, even if R is much larger than n?
Let’s flaven it:
Consider a sixth-‐order tensor T:
T = ai ai ai × × i = 1
R
ai ai ai × × ×
flat(T) = bi bi bi × × i = 1
R
(where bi = ai ai × KR )
n2-‐dimensional vector whose (j,k)th entry is the product of the jth and kth entries of ai Khatri-‐Rao product
Ques=on: Can we find its factors, even if R is much larger than n?
Let’s flaven it by rearranging its entries into a third-‐order tensor:
Ques=on: Can we apply Jennrich’s Algorithm to flat(T)?
Ques=on: Can we apply Jennrich’s Algorithm to flat(T)?
When are the new factors bi = ai ai × KR linearly independent?
Ques=on: Can we apply Jennrich’s Algorithm to flat(T)?
When are the new factors bi = ai ai × KR linearly independent?
Example #1:
Let {ai} be all ( ) n 2 vectors with exactly two ones
Ques=on: Can we apply Jennrich’s Algorithm to flat(T)?
When are the new factors bi = ai ai × KR linearly independent?
Example #1:
Let {ai} be all ( ) n 2 vectors with exactly two ones
Then {bi} are vectorizaIons of:
=
Ques=on: Can we apply Jennrich’s Algorithm to flat(T)?
When are the new factors bi = ai ai × KR linearly independent?
Example #1:
Let {ai} be all ( ) n 2 vectors with exactly two ones
Then {bi} are vectorizaIons of:
=
Non-‐zero only in bi
Ques=on: Can we apply Jennrich’s Algorithm to flat(T)?
When are the new factors bi = ai ai × KR linearly independent?
Example #1:
Let {ai} be all ( ) n 2 vectors with exactly two ones
Then {bi} are vectorizaIons of:
=
Non-‐zero only in bi
and are linearly independent
Ques=on: Can we apply Jennrich’s Algorithm to flat(T)?
When are the new factors bi = ai ai × KR linearly independent?
Ques=on: Can we apply Jennrich’s Algorithm to flat(T)?
When are the new factors bi = ai ai × KR linearly independent?
Example #2:
Let {a1…n} and {an+1..2n} be two random orthonormal bases
Ques=on: Can we apply Jennrich’s Algorithm to flat(T)?
When are the new factors bi = ai ai × KR linearly independent?
Example #2:
Then there is a linear dependence with 2n terms:
Let {a1…n} and {an+1..2n} be two random orthonormal bases
Ques=on: Can we apply Jennrich’s Algorithm to flat(T)?
When are the new factors bi = ai ai × KR linearly independent?
Example #2:
Then there is a linear dependence with 2n terms:
ai ai × KR i = 1
n
-‐ ai ai × KR i = n+1
2n
= 0
Let {a1…n} and {an+1..2n} be two random orthonormal bases
Ques=on: Can we apply Jennrich’s Algorithm to flat(T)?
When are the new factors bi = ai ai × KR linearly independent?
Example #2:
Then there is a linear dependence with 2n terms:
ai ai × KR i = 1
n
-‐ ai ai × KR i = n+1
2n
= 0 (as matrices, both sum to the idenIty)
Let {a1…n} and {an+1..2n} be two random orthonormal bases
THE KRUSKAL RANK
THE KRUSKAL RANK
Defini=on: The Kruskal rank (k-‐rank) of {bi} is the largest k s.t. every set of k vectors is linearly independent
THE KRUSKAL RANK
Defini=on: The Kruskal rank (k-‐rank) of {bi} is the largest k s.t. every set of k vectors is linearly independent
bi = ai ai × KR k-‐rank({ai}) = n
THE KRUSKAL RANK
Defini=on: The Kruskal rank (k-‐rank) of {bi} is the largest k s.t. every set of k vectors is linearly independent
bi = ai ai × KR Example #1: k-‐rank({bi}) = R = ( ) n
2
k-‐rank({ai}) = n
THE KRUSKAL RANK
Defini=on: The Kruskal rank (k-‐rank) of {bi} is the largest k s.t. every set of k vectors is linearly independent
bi = ai ai × KR Example #1: k-‐rank({bi}) = R = ( ) n
2
Example #2: k-‐rank({bi}) = 2n-‐1
k-‐rank({ai}) = n
THE KRUSKAL RANK
Defini=on: The Kruskal rank (k-‐rank) of {bi} is the largest k s.t. every set of k vectors is linearly independent
bi = ai ai × KR Example #1: k-‐rank({bi}) = R = ( ) n
2
Example #2: k-‐rank({bi}) = 2n-‐1
k-‐rank({ai}) = n
The Kruskal rank always adds under the Khatri-‐Rao product, but someImes it mul=plies and that can allow us to handle R >> n
[Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank mulIplies under the Khatri-‐Rao product
[Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank mulIplies under the Khatri-‐Rao product
Proof: The set of {ai} where
bi = ai ai × KR det({bi}) = 0 and
is measure zero
[Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank mulIplies under the Khatri-‐Rao product
Proof: The set of {ai} where
bi = ai ai × KR det({bi}) = 0 and
is measure zero
But this yields a very weak bound on the condi=on number of {bi}…
[Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank mulIplies under the Khatri-‐Rao product
Proof: The set of {ai} where
bi = ai ai × KR det({bi}) = 0 and
is measure zero
But this yields a very weak bound on the condi=on number of {bi}…
… which is what we need to apply it to learning/staIsIcs, where we have an esImate to T
[Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank mulIplies under the Khatri-‐Rao product
[Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank mulIplies under the Khatri-‐Rao product
Defini=on: The robust Kruskal rank (k-‐rankγ) of {bi} is the largest k s.t. every set of k vector has condiIon number at most O(γ)
[Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank mulIplies under the Khatri-‐Rao product
[Bhaskara, Charikar, Vijayaraghavan, 2013]: The robust Kruskal rank always under the Khatri-‐Rao product
Defini=on: The robust Kruskal rank (k-‐rankγ) of {bi} is the largest k s.t. every set of k vector has condiIon number at most O(γ)
[Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank mulIplies under the Khatri-‐Rao product
[Bhaskara, Charikar, Vijayaraghavan, 2013]: The robust Kruskal rank always under the Khatri-‐Rao product
Defini=on: The robust Kruskal rank (k-‐rankγ) of {bi} is the largest k s.t. every set of k vector has condiIon number at most O(γ)
[Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {ai} are ε-‐perturbed…
[Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank mulIplies under the Khatri-‐Rao product
[Bhaskara, Charikar, Vijayaraghavan, 2013]: The robust Kruskal rank always under the Khatri-‐Rao product
Defini=on: The robust Kruskal rank (k-‐rankγ) of {bi} is the largest k s.t. every set of k vector has condiIon number at most O(γ)
[Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {ai} are ε-‐perturbed. Then
k-‐rankγ({bi}) = R
for R = n2/2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ)
[Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {ai} are ε-‐perturbed. Then
k-‐rankγ({bi}) = R
for R = n2/2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ)
[Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {ai} are ε-‐perturbed. Then
k-‐rankγ({bi}) = R
for R = n2/2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ)
Hence we can apply Jennrich’s Algorithm to flat(T) with R >> n
[Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {ai} are ε-‐perturbed. Then
k-‐rankγ({bi}) = R
for R = n2/2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ)
Note: These bounds are easy to prove with inverse polynomial failure probability, but then γ depends δ
Hence we can apply Jennrich’s Algorithm to flat(T) with R >> n
[Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {ai} are ε-‐perturbed. Then
k-‐rankγ({bi}) = R
for R = n2/2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ)
Note: These bounds are easy to prove with inverse polynomial failure probability, but then γ depends δ
This can be extended to any constant order Khatri-‐Rao product
Hence we can apply Jennrich’s Algorithm to flat(T) with R >> n
[Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {ai} are ε-‐perturbed. Then
k-‐rankγ({bi}) = R
for R = n2/2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ)
Hence we can apply Jennrich’s Algorithm to flat(T) with R >> n
[Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {ai} are ε-‐perturbed. Then
k-‐rankγ({bi}) = R
for R = n2/2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ)
Sample applica=on: Algorithm for learning mixtures of nO(1) spherical Gaussians in Rn, if their means are ε-‐perturbed
This was also obtained independently by [Anderson, Belkin, Goyal, Rademacher, Voss, 2014]
Hence we can apply Jennrich’s Algorithm to flat(T) with R >> n
Any QuesIons? Summary:
� Tensor decomposiIons are unique under much more general condiIons, compared to matrix decomposiIons
� Jennrich’s Algorithm (rediscovered many Imes!), and its many applicaIons in learning/staIsIcs
� Introduced new models to study overcomplete problems (R >> n)
� Are there algorithms for order-‐k tensors that work with R = n0.51 k?