guaranteed learning of latent variable models through...
TRANSCRIPT
Guaranteed Learning of Latent Variable Models
through Spectral and Tensor Methods
Anima Anandkumar
U.C. Irvine
Application 1: Clustering
Basic operation of grouping data points.
Hypothesis: each data point belongs to an unknown group.
Application 1: Clustering
Basic operation of grouping data points.
Hypothesis: each data point belongs to an unknown group.
Probabilistic/latent variable viewpoint
The groups represent different distributions. (e.g. Gaussian).
Each data point is drawn from one of the given distributions. (e.g.Gaussian mixtures).
Application 2: Topic Modeling
Document modeling
Observed: words in document corpus.
Hidden: topics.
Goal: carry out document summarization.
Application 3: Understanding Human Communities
Social Networks
Observed: network of social ties, e.g. friendships, co-authorships
Hidden: groups/communities of actors.
Application 4: Recommender Systems
Recommender System
Observed: Ratings of users for various products, e.g. yelp reviews.
Goal: Predict new recommendations.
Modeling: Find groups/communities of users and products.
Application 5: Feature Learning
Feature Engineering
Learn good features/representations for classification tasks, e.g.image and speech recognition.
Sparse representations, low dimensional hidden structures.
Application 6: Computational Biology
Observed: gene expression levels
Goal: discover gene groups
Hidden variables: regulators controlling gene groups
How to model hidden effects?
Basic Approach: mixtures/clusters
Hidden variable h is categorical.
Advanced: Probabilistic models
Hidden variable h has more general distributions.
Can model mixed memberships.
x1 x2 x3 x4 x5
h1
h2 h3
This talk: basic mixture model. Tomorrow: advanced models.
Challenges in Learning
Basic goal in all mentioned applications
Discover hidden structure in data: unsupervised learning.
Challenges in Learning
Basic goal in all mentioned applications
Discover hidden structure in data: unsupervised learning.
Challenge: Conditions for Identifiability
When can model be identified (given infinite computation and data)?
Does identifiability also lead to tractable algorithms?
Challenges in Learning
Basic goal in all mentioned applications
Discover hidden structure in data: unsupervised learning.
Challenge: Conditions for Identifiability
When can model be identified (given infinite computation and data)?
Does identifiability also lead to tractable algorithms?
Challenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard.
Practice: EM, Variational Bayes have no consistency guarantees.
Efficient computational and sample complexities?
Challenges in Learning
Basic goal in all mentioned applications
Discover hidden structure in data: unsupervised learning.
Challenge: Conditions for Identifiability
When can model be identified (given infinite computation and data)?
Does identifiability also lead to tractable algorithms?
Challenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard.
Practice: EM, Variational Bayes have no consistency guarantees.
Efficient computational and sample complexities?
In this series: guaranteed and efficient learning through spectral methods
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
Basic spectral approach: principal component analysis (PCA).
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
Basic spectral approach: principal component analysis (PCA).
Beyond correlations: higher order moments.
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
Basic spectral approach: principal component analysis (PCA).
Beyond correlations: higher order moments.
Tensor notations, PCA extension to tensors.
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
Basic spectral approach: principal component analysis (PCA).
Beyond correlations: higher order moments.
Tensor notations, PCA extension to tensors.
Guaranteed learning of Gaussian mixtures.
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
Basic spectral approach: principal component analysis (PCA).
Beyond correlations: higher order moments.
Tensor notations, PCA extension to tensors.
Guaranteed learning of Gaussian mixtures.
Introduce topic models. Present a unified viewpoint.
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
Basic spectral approach: principal component analysis (PCA).
Beyond correlations: higher order moments.
Tensor notations, PCA extension to tensors.
Guaranteed learning of Gaussian mixtures.
Introduce topic models. Present a unified viewpoint.
Tomorrow: using tensor methods for learning latent variable modelsand analysis of tensor decomposition method.
What this talk series will cover
This talk
Start with the most basic latent variable model: Gaussian mixtures.
Basic spectral approach: principal component analysis (PCA).
Beyond correlations: higher order moments.
Tensor notations, PCA extension to tensors.
Guaranteed learning of Gaussian mixtures.
Introduce topic models. Present a unified viewpoint.
Tomorrow: using tensor methods for learning latent variable modelsand analysis of tensor decomposition method.
Wednesday: implementation of tensor methods.
Resources for Today’s Talk
“A Method of Moments for Mixture Models and Hidden MarkovModels.” by A. , D. Hsu, and S.M. Kakade. Proc. of COLT, June2012.
“Tensor Decompositions for Learning Latent Variable Models.” by A.,R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky, Oct. 2012.
Resources available athttp://newport.eecs.uci.edu/anandkumar/MLSS.html
Outline
1 Introduction
2 Warm-up: PCA and Gaussian Mixtures
3 Higher order moments for Gaussian Mixtures
4 Tensor Factorization Algorithm
5 Learning Topic Models
6 Conclusion
Warm-up: PCA
Optimization problem
For (centered) points xi ∈ Rd , find projection P
with Rank(P ) = k s.t.
minP∈Rd×d
1
n
∑
i∈[n]
‖xi − Pxi‖2.
Warm-up: PCA
Optimization problem
For (centered) points xi ∈ Rd , find projection P
with Rank(P ) = k s.t.
minP∈Rd×d
1
n
∑
i∈[n]
‖xi − Pxi‖2.
Result: If S = Cov(X) and S = UΛU⊤ is eigen decomposition, we haveP = U(k)U
⊤(k), where U(k) are top-k eigen vectors.
Warm-up: PCA
Optimization problem
For (centered) points xi ∈ Rd , find projection P
with Rank(P ) = k s.t.
minP∈Rd×d
1
n
∑
i∈[n]
‖xi − Pxi‖2.
Result: If S = Cov(X) and S = UΛU⊤ is eigen decomposition, we haveP = U(k)U
⊤(k), where U(k) are top-k eigen vectors.
Proof
By Pythagorean theorem:∑
i ‖xi − Pxi‖2 =∑
i ‖xi‖2 −∑
i ‖Pxi‖2.Maximize: 1
n
∑
i ‖Pxi‖2 = 1n
∑
iTr[
Pxix⊤i P
⊤]
= Tr[PSP⊤].
Review of linear algebra
For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue.
For symm. S ∈ Rd×d, there are d eigen values.
S =∑
i∈[d] λiuiu⊤i . U is orthogonal.
Review of linear algebra
For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue.
For symm. S ∈ Rd×d, there are d eigen values.
S =∑
i∈[d] λiuiu⊤i . U is orthogonal.
Rayleigh Quotient
For matrix S with eigenvalues λ1 ≥ λ2 . . . λd and correspondingeigenvectors u1, . . . ud, then
max‖z‖=1
z⊤Sz = λ1, min‖z‖=1
z⊤Sz = λd,
and the optimizing vectors are u1 and ud.
Review of linear algebra
For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue.
For symm. S ∈ Rd×d, there are d eigen values.
S =∑
i∈[d] λiuiu⊤i . U is orthogonal.
Rayleigh Quotient
For matrix S with eigenvalues λ1 ≥ λ2 . . . λd and correspondingeigenvectors u1, . . . ud, then
max‖z‖=1
z⊤Sz = λ1, min‖z‖=1
z⊤Sz = λd,
and the optimizing vectors are u1 and ud.
Optimal Projection
maxP :P 2=I
Rank(P )=k
Tr(P⊤SP ) = λ1 + λ2 . . . + λk and P spans {u1, . . . , uk}.
PCA on Gaussian Mixtures
k Gaussians: each sample is x = Ah+ z.
h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.
A ∈ Rd×k: columns are component means.
Let µ := Aw be the mean.
z ∼ N (0, σ2I) is white Gaussian noise.
PCA on Gaussian Mixtures
k Gaussians: each sample is x = Ah+ z.
h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.
A ∈ Rd×k: columns are component means.
Let µ := Aw be the mean.
z ∼ N (0, σ2I) is white Gaussian noise.
E[(x− µ)(x− µ)⊤] =∑
i∈[k]
wi(ai − µ)(ai − µ)⊤ + σ2I.
PCA on Gaussian Mixtures
k Gaussians: each sample is x = Ah+ z.
h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.
A ∈ Rd×k: columns are component means.
Let µ := Aw be the mean.
z ∼ N (0, σ2I) is white Gaussian noise.
E[(x− µ)(x− µ)⊤] =∑
i∈[k]
wi(ai − µ)(ai − µ)⊤ + σ2I.
How the above equation is obtained
E[(x− µ)(x− µ)⊤] = E[(Ah− µ)(Ah− µ)⊤] + E[zz⊤]
=∑
i∈[k]
wi(ai − µ)(ai − µ)⊤ + σ2I.
PCA on Gaussian Mixtures
E[(x− µ)(x− µ)⊤] =∑
i∈[k]
wi(ai − µ)(ai − µ)⊤ + σ2I.
The vectors {ai − µ} are linearly dependent:∑
i wi(ai − µ) = 0. ThePSD matrix
∑
i∈[k]wi(ai − µ)(ai − µ)⊤ has rank ≤ k − 1.
PCA on Gaussian Mixtures
E[(x− µ)(x− µ)⊤] =∑
i∈[k]
wi(ai − µ)(ai − µ)⊤ + σ2I.
The vectors {ai − µ} are linearly dependent:∑
i wi(ai − µ) = 0. ThePSD matrix
∑
i∈[k]wi(ai − µ)(ai − µ)⊤ has rank ≤ k − 1.
(k − 1)-PCA on covariance matrix ∪{µ} yields span(A).
Lowest eigenvalue of covariance matrix yields σ2.
PCA on Gaussian Mixtures
E[(x− µ)(x− µ)⊤] =∑
i∈[k]
wi(ai − µ)(ai − µ)⊤ + σ2I.
The vectors {ai − µ} are linearly dependent:∑
i wi(ai − µ) = 0. ThePSD matrix
∑
i∈[k]wi(ai − µ)(ai − µ)⊤ has rank ≤ k − 1.
(k − 1)-PCA on covariance matrix ∪{µ} yields span(A).
Lowest eigenvalue of covariance matrix yields σ2.
How to Learn A?
Learning through Spectral Clustering
Learning A through Spectral Clustering
Project samples x on to span(A).
Distance-based clustering (e.g. k-means).
A series of works, e.g. Vempala & Wang.
Learning through Spectral Clustering
Learning A through Spectral Clustering
Project samples x on to span(A).
Distance-based clustering (e.g. k-means).
A series of works, e.g. Vempala & Wang.
Failure to cluster under large variance.
Learning Gaussian Mixtures Without Separation Constraints?
Beyond PCA: Spectral Methods on Tensors
How to learn the component means A (not just its span) withoutseparation constraints?
Beyond PCA: Spectral Methods on Tensors
How to learn the component means A (not just its span) withoutseparation constraints?
PCA is a spectral method on (covariance) matrices.
◮ Are higher order moments helpful?
Beyond PCA: Spectral Methods on Tensors
How to learn the component means A (not just its span) withoutseparation constraints?
PCA is a spectral method on (covariance) matrices.
◮ Are higher order moments helpful?
What if number of components is greater than observeddimensionality k > d?
Beyond PCA: Spectral Methods on Tensors
How to learn the component means A (not just its span) withoutseparation constraints?
PCA is a spectral method on (covariance) matrices.
◮ Are higher order moments helpful?
What if number of components is greater than observeddimensionality k > d?
◮ Do higher order moments help to learn overcomplete models?
Beyond PCA: Spectral Methods on Tensors
How to learn the component means A (not just its span) withoutseparation constraints?
PCA is a spectral method on (covariance) matrices.
◮ Are higher order moments helpful?
What if number of components is greater than observeddimensionality k > d?
◮ Do higher order moments help to learn overcomplete models?
What if the data is not Gaussian?
Beyond PCA: Spectral Methods on Tensors
How to learn the component means A (not just its span) withoutseparation constraints?
PCA is a spectral method on (covariance) matrices.
◮ Are higher order moments helpful?
What if number of components is greater than observeddimensionality k > d?
◮ Do higher order moments help to learn overcomplete models?
What if the data is not Gaussian?
◮ Moment-based Estimation of probabilistic latent variable models?
Outline
1 Introduction
2 Warm-up: PCA and Gaussian Mixtures
3 Higher order moments for Gaussian Mixtures
4 Tensor Factorization Algorithm
5 Learning Topic Models
6 Conclusion
Tensor Notation for Higher Order Moments
Multi-variate higher order moments form tensors.
Are there spectral operations on tensors akin to PCA?
Matrix
E[x⊗ x] ∈ Rd×d is a second order tensor.
E[x⊗ x]i1,i2 = E[xi1xi2 ].
For matrices: E[x⊗ x] = E[xx⊤].
Tensor
E[x⊗ x⊗ x] ∈ Rd×d×d is a third order tensor.
E[x⊗ x⊗ x]i1,i2,i3 = E[xi1xi2xi3 ].
Third order moment for Gaussian mixtures
Consider mixture of k Gaussians: each sample is x = Ah+ z.
h ∈ [e1, . . . , ek], the basis vectors. E[h] = w.
A ∈ Rd×k: columns are component means. µ := Aw be the mean.
z ∼ N (0, σ2I) is white Gaussian noise.
E[x⊗ x⊗ x] =∑
i
wiai ⊗ ai ⊗ ai + σ2∑
i
(µ⊗ ei ⊗ ei + . . .)
Intuition behind equation
E[x⊗ x⊗ x] = E[(Ah)⊗ (Ah)⊗ (Ah)] + E[(Ah)⊗ z ⊗ z] + . . .
=∑
i
wi · ai ⊗ ai ⊗ ai + σ2∑
i
µ⊗ ei ⊗ ei + . . .
How to recover parameters A and w from third order moment?
Simplifications
E[x⊗ x⊗ x] =∑
i
wiai ⊗ ai ⊗ ai + σ2∑
i
(µ⊗ ei ⊗ ei + . . .)
σ2 is obtained from σmin(Cov(x)).
E[x⊗ x] =∑
iwiai ⊗ ai + σ2I.
Simplifications
E[x⊗ x⊗ x] =∑
i
wiai ⊗ ai ⊗ ai + σ2∑
i
(µ⊗ ei ⊗ ei + . . .)
σ2 is obtained from σmin(Cov(x)).
E[x⊗ x] =∑
iwiai ⊗ ai + σ2I.
Can obtain
M3 =∑
i
wiai ⊗ ai ⊗ ai
M2 =∑
i
wiai ⊗ ai.
How to obtain parameters A and w from M2 and M3?
Tensor Slices
M3 =∑
i
wiai ⊗ ai ⊗ ai, M2 =∑
i
wiai ⊗ ai.
Multilinear transformation of tensor
M3(B,C,D) :=∑
i
wi(B⊤ai) · (C⊤ai) · (D⊤ai)
Tensor Slices
M3 =∑
i
wiai ⊗ ai ⊗ ai, M2 =∑
i
wiai ⊗ ai.
Multilinear transformation of tensor
M3(B,C,D) :=∑
i
wi(B⊤ai) · (C⊤ai) · (D⊤ai)
Slice of a tensor
M3(I, I, r) =∑
iwiai ⊗ ai〈ai, r〉 = ADiag(w)Diag(A⊤r)A⊤
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤
M2 = A · Diag(w) ·A⊤
Eigen-decomposition
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
Eigen-decomposition
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
Assumption: A ∈ Rd×k has full column rank.
Eigen-decomposition
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
Assumption: A ∈ Rd×k has full column rank.
M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.
U⊤M2U = Λ ∈ Rk×k is invertible.
Eigen-decomposition
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
Assumption: A ∈ Rd×k has full column rank.
M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.
U⊤M2U = Λ ∈ Rk×k is invertible.
X =(
U⊤M3(I, I, r)U) (
U⊤M2U)−1
= V · Diag(λ̃) · V −1.
Eigen-decomposition
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
Assumption: A ∈ Rd×k has full column rank.
M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.
U⊤M2U = Λ ∈ Rk×k is invertible.
X =(
U⊤M3(I, I, r)U) (
U⊤M2U)−1
= V · Diag(λ̃) · V −1.
Substitution: X = (U⊤A)Diag(A⊤r)(U⊤A)−1.
We have vi ∝ U⊤ai .
Eigen-decomposition
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
Assumption: A ∈ Rd×k has full column rank.
M2 = UΛU⊤ be eigen-decomposition. U ∈ Rd×k.
U⊤M2U = Λ ∈ Rk×k is invertible.
X =(
U⊤M3(I, I, r)U) (
U⊤M2U)−1
= V · Diag(λ̃) · V −1.
Substitution: X = (U⊤A)Diag(A⊤r)(U⊤A)−1.
We have vi ∝ U⊤ai .
Technical Detail
r = Uθ and θ drawn uniformly from sphere to ensure eigen gap.
Learning Gaussian Mixtures through
Eigen-decomposition of Tensor Slices
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
(
U⊤M3(I, I, r)U) (
U⊤M2U)−1
= V Diag(λ̃)V −1.
Recovery of A (method 1)
Since vi ∝ U⊤ai, recover A upto scale: ai ∝ Uvi .
Learning Gaussian Mixtures through
Eigen-decomposition of Tensor Slices
M3(I, I, r) = A · Diag(w)Diag(A⊤r) ·A⊤, M2 = A · Diag(w) · A⊤.
(
U⊤M3(I, I, r)U) (
U⊤M2U)−1
= V Diag(λ̃)V −1.
Recovery of A (method 1)
Since vi ∝ U⊤ai, recover A upto scale: ai ∝ Uvi .
Recovery of A (method 2)
Let Θ ∈ Rk×k be a random rotation matrix.
Consider k slices M3(I, I, θi) and find eigen-decomposition above.
Let Λ̃ = [λ̃1| . . . |λ̃k] be the matrix of eigenvalues of all slices.
Recover A as : A = UΘ−1Λ̃.
Putting it together
Implications
Learn Gaussian mixtures through slices of third-order moment.
Guaranteed learning through eigen decomposition.
“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,
D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.
Putting it together
Implications
Learn Gaussian mixtures through slices of third-order moment.
Guaranteed learning through eigen decomposition.
“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,
D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.
Shortcomings
The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.
Putting it together
Implications
Learn Gaussian mixtures through slices of third-order moment.
Guaranteed learning through eigen decomposition.
“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,
D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.
Shortcomings
The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.
Require good eigen-gap in Diag(λ̃) for recovery. For r = Uθ, where θis drawn uniformly from unit sphere, gap is 1/k2.5. Numericalinstability in practice.
Putting it together
Implications
Learn Gaussian mixtures through slices of third-order moment.
Guaranteed learning through eigen decomposition.
“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,
D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.
Shortcomings
The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.
Require good eigen-gap in Diag(λ̃) for recovery. For r = Uθ, where θis drawn uniformly from unit sphere, gap is 1/k2.5. Numericalinstability in practice.
M3(I, I, r) is only a (random) slice of the tensor. Full information isnot utilized.
Putting it together
Implications
Learn Gaussian mixtures through slices of third-order moment.
Guaranteed learning through eigen decomposition.
“A Method of Moments for Mixture Models and Hidden Markov Models.” by A. ,
D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.
Shortcomings
The resulting product is not symmetric. Eigen-decompositionV Diag(λ̃)V −1 does not result in orthonormal V . More involved inpractice.
Require good eigen-gap in Diag(λ̃) for recovery. For r = Uθ, where θis drawn uniformly from unit sphere, gap is 1/k2.5. Numericalinstability in practice.
M3(I, I, r) is only a (random) slice of the tensor. Full information isnot utilized.
More efficient learning methods using higher order moments?
Outline
1 Introduction
2 Warm-up: PCA and Gaussian Mixtures
3 Higher order moments for Gaussian Mixtures
4 Tensor Factorization Algorithm
5 Learning Topic Models
6 Conclusion
Tensor Factorization
Recover A and w from M2 and M3.
M2 =∑
i
wiai ⊗ ai, M3 =∑
i
wiai ⊗ ai ⊗ ai.
= + ....
Tensor M3 w1 · a1 ⊗ a1 ⊗ a1 w2 · a2 ⊗ a2 ⊗ a2
a⊗ a⊗ a is a rank-1 tensor since, its (i1, i2, i3)th entry is ai1ai2ai3 .
M3 is a sum of rank-1 terms.
Tensor Factorization
Recover A and w from M2 and M3.
M2 =∑
i
wiai ⊗ ai, M3 =∑
i
wiai ⊗ ai ⊗ ai.
= + ....
Tensor M3 w1 · a1 ⊗ a1 ⊗ a1 w2 · a2 ⊗ a2 ⊗ a2
a⊗ a⊗ a is a rank-1 tensor since, its (i1, i2, i3)th entry is ai1ai2ai3 .
M3 is a sum of rank-1 terms.
When is it the most compact representation? (Identifiability).
Can we recover the decomposition? (Algorithm?)
Tensor Factorization
Recover A and w from M2 and M3.
M2 =∑
i
wiai ⊗ ai, M3 =∑
i
wiai ⊗ ai ⊗ ai.
= + ....
Tensor M3 w1 · a1 ⊗ a1 ⊗ a1 w2 · a2 ⊗ a2 ⊗ a2
a⊗ a⊗ a is a rank-1 tensor since, its (i1, i2, i3)th entry is ai1ai2ai3 .
M3 is a sum of rank-1 terms.
When is it the most compact representation? (Identifiability).
Can we recover the decomposition? (Algorithm?)
The most compact representation is known as CP-decomposition(CANDECOMP/PARAFAC).
Initial Ideas
M3 =∑
i
wiai ⊗ ai ⊗ ai.
What if A has orthogonal columns?
Initial Ideas
M3 =∑
i
wiai ⊗ ai ⊗ ai.
What if A has orthogonal columns?
M3(I, a1, a1) =∑
iwi〈ai, a1〉2ai = w1a1.
Initial Ideas
M3 =∑
i
wiai ⊗ ai ⊗ ai.
What if A has orthogonal columns?
M3(I, a1, a1) =∑
iwi〈ai, a1〉2ai = w1a1.
ai are eigenvectors of tensor M3.
Analogous to matrix eigenvectors:Mv = M(I, v) = λv.
Initial Ideas
M3 =∑
i
wiai ⊗ ai ⊗ ai.
What if A has orthogonal columns?
M3(I, a1, a1) =∑
iwi〈ai, a1〉2ai = w1a1.
ai are eigenvectors of tensor M3.
Analogous to matrix eigenvectors:Mv = M(I, v) = λv.
Two Problems
How to find eigenvectors of a tensor?
A is not orthogonal in general.
Whitening
M3 =∑
i
wiai ⊗ ai ⊗ ai, M2 =∑
i
wiai ⊗ ai.
Find whitening matrix W s.t. W⊤A = V is an orthogonal matrix.
When A ∈ Rd×k has full column rank, it is an invertible
transformation.
v1
v2v3
Wa1a2a3
Using Whitening to Obtain Orthogonal Tensor
Tensor M3 Tensor T
Multi-linear transform
M3 ∈ Rd×d×d and T ∈ R
k×k×k.
T = M3(W,W,W ) =∑
iwi(W⊤ai)
⊗3.
T =∑
i∈[k]
wi · vi ⊗ vi ⊗ vi is orthogonal.
Dimensionality reduction when k ≪ d.
How to Find Whitening Matrix?
M3 =∑
i
wiai ⊗ ai ⊗ ai, M2 =∑
i
wiai ⊗ ai.
v1
v2v3
Wa1a2a3
Use pairwise moments M2 to find W s.t. W⊤M2W = I.
Eigen-decomposition of M2 = UDiag(λ̃)U⊤, thenW = UDiag(λ̃−1/2).
V := W⊤ADiag(w)1/2 is an orthogonal matrix.
T = M3(W,W,W ) =∑
i
w−1/2i (W⊤ai
√wi)
⊗3
=∑
i
λivi ⊗ vi ⊗ vi, λi := w−1/2i .
T is an orthogonal tensor.
Orthogonal Tensor Eigen Decomposition
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi, 〈vi, vj〉 = δi,j, ∀ i, j.
T (I, v1, v1) =∑
i λi〈vi, v1〉2vi = λ1v1.
vi are eigenvectors of tensor T .
Orthogonal Tensor Eigen Decomposition
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi, 〈vi, vj〉 = δi,j, ∀ i, j.
T (I, v1, v1) =∑
i λi〈vi, v1〉2vi = λ1v1.
vi are eigenvectors of tensor T .
Tensor Power Method
Start from an initial vector v. v 7→ T (I, v, v)
‖T (I, v, v)‖ .
Orthogonal Tensor Eigen Decomposition
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi, 〈vi, vj〉 = δi,j, ∀ i, j.
T (I, v1, v1) =∑
i λi〈vi, v1〉2vi = λ1v1.
vi are eigenvectors of tensor T .
Tensor Power Method
Start from an initial vector v. v 7→ T (I, v, v)
‖T (I, v, v)‖ .
Canonical recovery method
Randomly initialize the power method. Run to convergence to obtainv with eigenvalue λ.
Deflate: T − λv ⊗ v ⊗ v and repeat.
Putting it together
Gaussian mixture: x = Ah+ z, where E[h] = w.
z ∼ N (0, σ2I).
M2 =∑
i
wiai ⊗ ai, M3 =∑
i
wiai ⊗ ai ⊗ ai.
Obtain whitening matrix W from SVD of M2.
Use W for multilinear transform: T = M3(W,W,W ).
Find eigenvectors of T through power method and deflation.
What about learning other latent variable models?
Outline
1 Introduction
2 Warm-up: PCA and Gaussian Mixtures
3 Higher order moments for Gaussian Mixtures
4 Tensor Factorization Algorithm
5 Learning Topic Models
6 Conclusion
Topic Modeling
Probabilistic Topic Models
Useful abstraction for automatic categorization of documents
Observed: words. Hidden: topics.
Bag of words: order of words does not matter
Graphical model representation
l words in a document x1, . . . , xl.
h: proportions of topics in a document.
Word xi generated from topic yi.
Exchangeability: x1 ⊥⊥ x2 ⊥⊥ . . . |h
A(i, j) := P[xm = i|ym = j] :
topic-word matrix.Words
Topics
Topic
Mixture
x1 x2 x3 x4 x5
y1 y2 y3 y4 y5
AAAAA
h
Distribution of the topic proportions vector h
If there are k topics, distribution over the simplex ∆k−1
∆k−1 := {h ∈ Rk, hi ∈ [0, 1],
∑
i
hi = 1}.
Simplification for today: there is only one topic in each document.◮ h ∈ {e1, . . . , ek}: basis vectors.
Tomorrow: Latent Dirichlet allocation.
Distribution of the topic proportions vector h
If there are k topics, distribution over the simplex ∆k−1
∆k−1 := {h ∈ Rk, hi ∈ [0, 1],
∑
i
hi = 1}.
Simplification for today: there is only one topic in each document.◮ h ∈ {e1, . . . , ek}: basis vectors.
Tomorrow: Latent Dirichlet allocation.
Formulation as Linear Models
Distribution of the words x1, x2, . . .
Order d words in vocabulary. If x1 is jth word, assign ej ∈ Rd.
Distribution of each xi: supported on vertices of ∆d−1.
Properties
Pr[x1|topic h = i] = E[x1|topic h = i] = ai
Linear Model: E[xi|h] = Ah .
Multiview model: h is fixed and multiple words (xi) are generated.
Geometric Picture for Topic Models
Topic proportions vector (h)
Document
Geometric Picture for Topic Models
Single topic (h)
Geometric Picture for Topic Models
Single topic (h)
AAA
x1
x2
x3Word generation (x1, x2, . . .)
Gaussian Mixtures vs. Topic Models(spherical) Mixture of Gaussian:
k means: a1, . . . ak
Component h = i with prob. wi
observe x, with spherical noise,
x = ai + z, z ∼ N (0, σ2i I)
(single) Topic Models
k topics: a1, . . . ak
Topic h = i with prob. wi
observe l (exchangeable) words
x1, x2, . . . xl i.i.d. from ai
Unified Linear Model: E[x|h] = Ah
Gaussian mixture: single view, spherical noise.
Topic model: multi-view, heteroskedastic noise.
Three words per document suffice for learning.
M3 =∑
i
wiai ⊗ ai ⊗ ai, M2 =∑
i
wiai ⊗ ai.
Outline
1 Introduction
2 Warm-up: PCA and Gaussian Mixtures
3 Higher order moments for Gaussian Mixtures
4 Tensor Factorization Algorithm
5 Learning Topic Models
6 Conclusion
Recap: Basic Tensor Decomposition Method
Toy Example in MATLAB
Simulated Samples: Exchangeable Model
Whiten The Samples◮ Second Order Moments◮ Matrix Decomposition
Orthogonal Tensor Eigen Decomposition◮ Third Order Moments◮ Power Iteration
Simulated Samples: Exchangeable Model
Model Parameters
Hidden State:h ∈ basis {e1, . . . , ek}k = 2
Observed States:xi ∈ basis {e1, . . . , ed}d = 3
Conditional Independency:x1 ⊥⊥ x2 ⊥⊥ x3|hTransition Matrix: A
Exchangeability:E[xi|h] = Ah, ∀i ∈ 1, 2, 3
h
x1 x2 x3
Simulated Samples: Exchangeable Model
Model Parameters
Hidden State:h ∈ basis {e1, . . . , ek}k = 2
Observed States:xi ∈ basis {e1, . . . , ed}d = 3
Conditional Independency:x1 ⊥⊥ x2 ⊥⊥ x3|hTransition Matrix: A
Exchangeability:E[xi|h] = Ah, ∀i ∈ 1, 2, 3
Generate Samples Snippet
for t = 1 : n% generate h for this sampleh category=(rand()>0.5) + 1;h(t,h category)=1;transition cum=cumsum(A true(:,h category));% generate x1 for this sample | hx category=find(transition cum> rand(),1);x1(t,x category)=1;% generate x2 for this sample | hx category=find(transition cum >rand(),1);x2(t,x category)=1;% generate x3 for this sample | hx category=find(transition cum > rand(),1);x3(t,x category)=1;end
Whiten The Samples
Second Order Moments
M2 =1n
∑
t xt1 ⊗ xt2
Whitening Matrix
W = UwL−0.5w ,
[Uw, Lw] =k-svd(M2)
Whiten Data
yt1 = W⊤xt1
Orthogonal Basis
V = W⊤A→ V ⊤V = I
Whitening Snippet
fprintf(’The second order moment M2:’);M2 = x1′*x2/n[Uw, Lw, Vw]= svd(M2);fprintf(’M2 singular values:’); LwW = Uw(:,1:k)* sqrt(pinv(Lw(1:k,1:k)));y1 = x1 * W; y2 = x2 * W; y3 = x3 * W;
a1 6⊥ a2 v1 ⊥ v2
Orthogonal Tensor Eigen Decomposition
Third Order Moments
T =1
n
∑
t∈[n]
yt1 ⊗ yt2 ⊗ yt3 ≈∑
i∈[k]
λivi ⊗ vi ⊗ vi, V ⊤V = I
Gradient Ascent
T (I, v1, v1) =1
n
∑
t∈[n]
〈v1, yt2〉〈v1, yt3〉yt1 ≈∑
i
λi〈vi, v1〉2vi = λ1v1.
vi are eigenvectors of tensor T .
Orthogonal Tensor Eigen Decomposition
T ← T −∑
j
λjv⊗3
j , v ← T (I, v, v)
‖T (I, v, v)‖
Power Iteration Snippet
.V = zeros(k,k); Lambda = zeros(k,1);
.for i = 1:k
. v old = rand(k,1); v old = normc(v old);
. for iter = 1 : Maxiter
. v new = (y1’* ((y2*v old).*(y3*v old)))/n;
. if i >1
. % deflation
. for j = 1: i-1
. v new=v new-(V(:,j)*(v old’*V(:,j))2)* Lambda(j);
. end
. end
. lambda = norm(v new);v new = normc(v new);
. if norm(v old - v new) < TOL
. fprintf(′Converged at iteration %d.′ , iter);
. V(:,i) = v new; Lambda(i,1) = lambda;
. break;
. end
. v old = v new;
. end
.end
Orthogonal Tensor Eigen Decomposition
T ← T −∑
j
λjv⊗3
j , v ← T (I, v, v)
‖T (I, v, v)‖
Power Iteration Snippet
.V = zeros(k,k); Lambda = zeros(k,1);
.for i = 1:k
. v old = rand(k,1); v old = normc(v old);
. for iter = 1 : Maxiter
. v new = (y1’* ((y2*v old).*(y3*v old)))/n;
. if i >1
. % deflation
. for j = 1: i-1
. v new=v new-(V(:,j)*(v old’*V(:,j))2)* Lambda(j);
. end
. end
. lambda = norm(v new);v new = normc(v new);
. if norm(v old - v new) < TOL
. fprintf(′Converged at iteration %d.′ , iter);
. V(:,i) = v new; Lambda(i,1) = lambda;
. break;
. end
. v old = v new;
. end
.end
iter 0
iter 1iter 2iter 3iter 4iter 5
Green: Groundtruth
Red: Estimation at each iteration
Conclusion
h
x1 x2x3
Gaussian mixtures and topic models. Unified linear representation.
Learning through higher order moments.
Tensor decomposition via whitening and power method.
Tomorrow’s lecture
Latent variable models and moments. Analysis of tensor power method.
Wednesday’s lecture: Implementation of tensor method.
Code snippet available at
http://newport.eecs.uci.edu/anandkumar/MLSS.html