smoothed analysis of tensor decompositionsapplications of tensor decompositions to ml – motivating...

Aravindan Vijayaraghavan

NYU ⇒ Northwestern University

Smoothed Analysis of Tensor Decompositions

based on joint work with

Aditya Bhaskara

Google Research

Moses Charikar

Princeton

Ankur Moitra

MIT

Multi-dimensional arrays

Tensors

n n n

n n

• Tensor of order 𝑝 ≡ 𝑝-tensor of size 𝑛 × 𝑛 × ⋯ 𝑛 (𝑝 times)

• Elements are over ℝ

Low rank Decompositions

Tensor can be written as a sum of few rank-one tensors

𝑇 = 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖

𝑘

𝑖=1

Rank(T) = smallest k s.t. T written as sum of k rank-1 tensors

3-Tensors:

T 𝑎1 𝑎2 𝑎𝑘

𝑐1 𝑐2 𝑐𝑘

• Rank of 𝑝-tensor 𝑇𝑛×⋯×𝑛 ≤ 𝑛𝑝−1

Low-rank 𝜖-approximation: Low-rank decomposition approximating T up to error 𝜖 in

Frobenius norm i.e. 𝑇 − 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖𝑘𝑖=1

𝐹≤ 𝜖

Tensor Decomposition: Uniqueness

Thm [Kruskal’77]. Rank-𝑘 decompositions for 3-tensors unique

(non-algorithmic) under rank condition (𝑘 ≤3𝑛

2− 1 )

• p-tensors: rank condition gives 𝑘 ≤ (𝑝𝑛 − 𝑝+1)/2 [SB01]

Thm [Jennrich via Harshman’70]. Find unique rank-𝑘 decompositions for 3-tensors when the vectors of a decomposition are linearly independent (hence 𝑘 ≤ 𝑛)

• “Full-rank” case. Rediscovered in [Leurgans et al. 93, Chang’96]

Thm [DeLathauwer, Castiang, Cardoso’07]. Algorithm for 4-tensors of rank 𝑘 generically when 𝑘 ≤ 𝑐. 𝑛2 • p-tensors: generically handle 𝑘 ≤ 𝑐. 𝑛⌊𝑝/2⌋

Thm [Chiantini Ottaviani‘14]. Uniqueness of 3-tensors of rank 𝑘 ≤ 𝑛2/3 generically

Algorithms for Tensor Decompositions

NP-hard in general when rank 𝑘 ≫ 𝑛 (except in special settings)

[Hillar-Lim]

• Polynomial time algorithms* for robust Tensor decompositions

• Introduce Smoothed Analysis to overcome worst-case

intractability

• Handle rank 𝒌 ≫ 𝒏 for higher order tensors (𝒑 ≥ 𝟓).

*Algorithms 𝑝𝑜𝑙𝑦𝑝(𝑛, 𝑘, 1/𝜖) to get vectors in decomposition to 𝜖 error in . 2

This talk

Talk Plan

1. Applications of Tensor Decompositions to ML

– Motivating Algorithm properties

2. Smoothed Analysis Model and Results

3. Overview of the Proof

Learning Probabilistic Models: Parameter Estimation

Learning goal: Can the parameters of the model be learned from polynomial samples generated by the model ?

HMMs for speech recognition

Mixture of Gaussians for clustering points

Question: Can given data be “explained” by a simple probabilistic model?

Multiview models

Parameters • Mixing weights: 𝑤1, 𝑤2, … , 𝑤𝑘 • Gaussian 𝐺𝑖 : 𝜇𝑖 , Σ𝑖 mean 𝜇𝑖 , covariance Σ𝑖 : diagonal

Learning problem: Given many

sample points, find (𝑤𝑖 , 𝜇𝑖 , Σ𝑖)

Probabilistic model for Clustering in 𝒏-dims

Mixtures of (axis-aligned) Gaussians

• Algorithms use 𝐎(𝐞𝐱𝐩 𝒌 . 𝒑𝒐𝒍𝒚(𝒏)) samples and time [FOS’06, MV’10]

• Lower bound of Ω(𝐞𝐱𝐩(𝒌)) [MV’10] in worst case

ℝ𝑛

𝜇𝑖

𝑥

Aim: 𝒑𝒐𝒍𝒚(𝒌, 𝒏) guarantees in realistic settings

Method of Moments and Tensor decompositions

step 1. compute a tensor whose decomposition

encodes model parameters

step 2. find decomposition (and hence

parameters)

⋱

𝑛

𝑛

𝑛

⋅⋅ ⋯ ⋅⋮ 𝐸 𝑥𝑖𝑥𝑗𝑥𝑘

⋮ ⋯

𝑻 = 𝒘𝒊 𝝁𝒊 ⊗ 𝝁𝒊 ⊗ 𝝁𝒊

𝒌

𝒊=𝟏

• Uniqueness ⟹ Recover parameters 𝑤𝑖 and 𝜇𝑖 • Algorithm for Decomposition ⟹ efficient learning

[Chang] [Allman, Matias, Rhodes]

[Anandkumar,Ge,Hsu, Kakade, Telgarsky]

Third moment tensor

Aim: 1. Uniqueness of Tensor Decompositions

2. Algorithms taking time 𝒑𝒐𝒍𝒚 𝒏, 𝒌

robust to noise 𝝐 = 𝟏/𝒑𝒐𝒍𝒚(𝒏, 𝒌)

Robustness to Errors

Beware : Sampling error

Empirical estimate 𝑇 =𝜖 𝑤𝑖 𝜇𝑖⨂ 𝜇𝑖⨂ 𝜇𝑖𝑘𝑖=1

With poly(n) samples, error ϵ ≈ 1/poly(n, k)

Thm[Goyal-Vempala-Xiao]. Jennrich’s polytime algorithm for Tensor Decompositions of rank 𝑘 ≤ 𝑛 robust up to 1/𝑝𝑜𝑙𝑦(𝑛) error

⇒ Efficient Learning for many probabilistic models when no. of clusters/ topics k ≤ dimension n [Chang 96, Mossel-Roch 06, Hsu-Kakade12 , Anandkumar et al. 09-14]

Overcomplete Setting

Number of clusters/topics/states 𝐤 ≫ dimension 𝐧

Computer Vision Speech

NP-hard in worst-case (for rank k ≥ 6n)

Polytime decomposition of Tensors of rank 𝒌 ≫ 𝑛? ( rank 𝑘 ≤ 𝑛𝑝−1)

Smoothed Analysis

[Spielman & Teng 2000]

• Small random perturbation of input makes instances easy

• Best polytime guarantees in the absence of any worst-case guarantees

Good Smoothed analysis guarantees:

• Worst instances are isolated

Simplex algorithm solves LPs efficiently (explains practice).

Smoothed Analysis for Learning

Learning setting (e.g. Mixtures of Gaussians)

Worst-case instances: Means 𝝁𝒊 in pathological configurations

Means not in adversarial configurations in real-world!

What if means 𝜇𝑖 perturbed slightly ? 𝝁𝒊 𝝁 𝒊

Parameters of the model are perturbed slightly.

Smoothed Analysis for Tensor Decompositions

1. Given tensor

3. Input: 𝑇 . Analyse algorithm on 𝑇 .

2. 𝑎 𝑖(𝑗)

is random 𝜌-perturbation of 𝑎𝑖(𝑗)

i.e. add independent (gaussian) random vector of length ≈ 𝜌.

𝑇𝑑×𝑑×⋯×𝑑 = 𝑎𝑖(1)

⨂𝑎𝑖2

⨂ … ⨂𝑎𝑖𝑝

𝑘

𝑖=1

𝑇 = 𝑎 𝑖(1)

⨂𝑎 𝑖2

⨂ … ⨂𝑎 𝑖𝑝

+ noise𝑘

𝑖=1

Factors of the Decomposition are perturbed

For mixture of Gaussians, means 𝜇𝑖 perturbed slightly

T = 𝑤𝑖 𝜇𝑖 ⊗ 𝜇𝑖 ⊗ 𝜇𝑖 … ⊗ 𝜇𝑖

𝑘

𝑖=1

Smoothed Analysis model

• Different from elements of T being perturbed

• More similar in spirit generic results

than average-case analysis:

no regions that are hard

𝑇 = 𝑎 𝑖(1)

⨂𝑎 𝑖2

⨂ … ⨂𝑎 𝑖𝑝

+ noise𝑘

𝑖=1

Easy

Problem instance space

Hard

• Robust analog for generic results ?

Algorithmic Guarantees

Thm. Polynomial time algorithm for decomposing 𝑝-tensor (size 𝑛𝑝)

in smoothed analysis, when rank 𝒌 ≤ 𝒏⌊(𝒑−𝟏)/𝟐⌋ w.p. 1 − exp (−𝑛𝑓 𝑝 )

Running time, sample complexity = 𝒑𝒐𝒍𝒚𝒑 𝒏,𝟏

𝝆.

Guarantees for order-t tensors in d-dims (each)

Rank of the t-tensor=k (number of clusters)

Previous Algorithms

𝑘 ≤ 𝑛

Algorithms (smoothed case) 𝑘 ≤ 𝑛(𝑝−1)/2

Corollary. Polytime algorithms (smoothed analysis) for learning parameters of mixtures of axis-aligned Gaussians, multiview models etc. when no. of clusters k ≤ dimC for any constant C w.h.p.

Interpreting Smoothed Analysis Guarantees

Time, sample complexity = 𝑝𝑜𝑙𝑦𝑝 𝑛,1

𝜌.

Works with probability 1 − exp (−𝜌𝑛3−𝑝)

• Exponential small failure probability in polytime (for constant 𝑝)

Smooth Interpolation between Worst-case and Average-case

[Anderson, Belkin, Goyal, Rademacher, Voss ‘14]

Time, sample complexity = 𝑝𝑜𝑙𝑦𝑝 𝑛,1

𝜌, 1/𝜏 .

Works with probability =1 − 𝜏

Algorithm Details

Algorithm Outline

• Helps handle the over-complete setting (k ≫ 𝑛)

[Jennrich 70] A simple (robust) algorithm for 3-tensor T when: 𝜎𝑘 𝐴 , 𝜎𝑘 𝐵 , 𝜎2(𝐶) ≥ 1/𝑝𝑜𝑙𝑦(𝑛, 𝑘)

2. For higher order tensors using ``tensoring / flattening’’.

1. An algorithm for 3-tensors in the ``full rank setting’’ (𝐤 ≤ 𝒏).

𝐴 (𝑛 × 𝑘)

𝑇 = 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖

𝑘

𝑖=1 Recall:

𝑎𝑖

⋱

𝑛

𝑛

𝑇 Aim: Recover A, B, C

• Any algorithm for full-rank tensors suffices

Blast from the Past

⋱

𝑇

𝑻 ≈𝝐 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖

𝑘

𝑖=1

Recall

𝐴 (𝑛 × 𝑘)

𝑎𝑖

Aim: Recover A, B, C

Qn. Is this algorithm robust to errors ?

Yes ! Needs perturbation bounds for eigenvectors. [Stewart-Sun]

Thm. Efficiently decompose T=𝜖 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖𝑘𝑖=1

and recover 𝐴, 𝐵, 𝐶 upto 𝜖. 𝑝𝑜𝑙𝑦(𝑛, 𝑘) error when 1) 𝐴, 𝐵 are min-singular-value ≥ 1/poly(n) 2) C doesn’t have parallel columns (robustly).

[Jennrich via Harshman 70]

Algorithm for 3-tensor 𝑇 = 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖𝑘𝑖=1

• 𝐴, 𝐵 have rank=k i.e. ai , bi are linearly independent • C has rank ≥2

• Reduces to matrix eigen-decompositions

Algorithm:

1. Take random combination along w1 as 𝑀1.

2. Take random combination along w2 as 𝑀2.

3. Find eigen-decomposition of 𝑀1𝑀2† to get 𝐴. Similarly B,C.

Decomposition algorithm [Jennrich]

𝑇 ≈𝜖 𝑎𝑖 ⊗ 𝑏𝑖 ⊗ 𝑐𝑖

𝑘

𝑖=1

Thm. Efficiently decompose T=𝜖 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖𝑘𝑖=1 and recover

𝐴, 𝐵, 𝐶 up to 𝜖. 𝑝𝑜𝑙𝑦(𝑛, 𝑘) error (in Frobenius norm) when 1) 𝐴, 𝐵 are full rank i.e. min-singular-value ≥ 1/poly(n) 2) C doesn’t have parallel columns (in a robust sense).

Handling high rank

into Techniques

Mapping to Higher Dimensions

How do we handle the case rank 𝒌 = 𝛀(𝒏𝟐)?

(or vectors with “many” linear dependencies?)

Consider a 6th order tensor with rank 𝑘 ≤ 𝑛2

Trick: view T as an 𝑛2 × 𝑛2 × 𝑛2 object

vectors in the decomposition are: 𝑎𝑖 ⊗ 𝑏𝑖 , {𝑐𝑖 ⊗ 𝑑𝑖}, {𝑒𝑖 ⊗ 𝑓𝑖}

𝑇 = 𝑎𝑖 ⊗ 𝑏𝑖 ⊗ 𝑐𝑖 ⊗ 𝑑𝑖 ⊗ 𝑒𝑖 ⊗ 𝑓𝑖

𝑘

𝑖=1

Qn: are these vectors 𝒂𝒊 ⊗ 𝒃𝒊 𝒊=𝟏…𝒌 linearly independent? Is ``dimensionality’’ 𝛀(𝒏𝟐)?

Bad cases

Smoothed Analysis Can we hope for “dimension” to multiply typically?

Bad example where 𝑘 > 2𝑛: • Cols of A=B composed of two orthonormal basis of ℝ𝑛 • Every 𝑛 vectors of A and B are linearly independent • But (2𝑛 − 1) vectors of Z are linearly dependent !

B(𝑛 × 𝑘)

𝑏𝑖

A(𝑛 × 𝑘)

𝑎𝑖

Z (𝑛2 × 𝑘)

𝑧𝑖= 𝑎𝑖⨂𝑏𝑖

A, B have rank 𝑛. 𝑧𝑖 = 𝑎𝑖⨂𝑏𝑖 ∈ ℝ𝑛2 for 𝑖 = 1 … 𝑘 = 𝑛2

Dimension does not grow multiplicatively in worst case

But, bad examples are pathological and hard to construct!

Qn: Are 𝒂𝒊 ⊗ 𝒃𝒊 𝒊=𝟏…𝒌 linearly independent? Is ``dimensionality’’ 𝛀(𝒏𝟐)?

Product vectors & linear structure

• ``Flattening’’ of 3𝑝-order moment tensor

• New factor matrix is full rank using Smoothed Analysis.

Theorem. For any matrix 𝐴𝑛×𝑘, for 𝑘 < 𝑛𝑝/2,

𝜎𝑘 𝑍 ≥ 1/𝑝𝑜𝑙𝑦 𝑘, 𝑛,1

𝜌 with probability 1- exp(-poly(n)).

𝐴 (𝑛 × 𝑘)

𝑎 𝑖

𝑍 (𝑛𝑝 × 𝑘)

𝑎 𝑖 ⊗ 𝑎 𝑖 ⊗ ⋯ 𝑎 𝑖

Khatri-Rao product

Proof sketch (two-wise product p=2)

Main Issue: perturbation before product..

• easy if columns perturbed after tensor product (simple anti-concentration bounds)

Technical component

show product of perturbed vectors behave

like random vectors in 𝑅𝑛2

𝑛2

𝑘

𝑍

Prop. For any matrix 𝐴, matrix 𝑍 below (𝑘 < 𝑛2/2) has

𝜎𝑘 𝐵 ≥ 1/𝑝𝑜𝑙𝑦 𝑘, 𝑛,1

𝜌 with probability 1- exp(-poly(n)).

• only 2𝑛 bits of randomness in 𝑛2 dims • Block dependencies

Projections of product vectors

Easy : 𝑛2 dimensional 𝑥, 𝜌-perturbation to 𝑥

projection > 1/𝑝𝑜𝑙𝑦(𝜌) on to S w.h.p.

anti-concentration for polynomials implies this with

probability 1-1/poly(n)

….

... 𝑎 𝑛 ⊗ 𝑏

Much tougher for product of perturbations! (inherent block structure)

Question. Given any vector 𝑎, 𝑏 ∈ ℝ𝑛 and gaussian 𝜌-perturbation

𝑎 , 𝑏 , does 𝒂 ⊗ 𝒃 have projection 𝑝𝑜𝑙𝑦(𝜌,1

𝑛) onto any given 𝑛2/2

dimensional subspace 𝑆 ⊂ 𝑅𝑛2 with prob. 1 − exp(− 𝑛) ?

Projections of product vectors

dot product of

block with 𝑏

=

Question. Given any vector 𝑎, 𝑏 ∈ ℝ𝑛 and gaussian 𝜌-perturbation

𝑎 , 𝑏 , does 𝒂 ⊗ 𝒃 have projection 𝑝𝑜𝑙𝑦(𝜌,1

𝑛) onto any given 𝑛2/2

dimensional subspace 𝑆 ⊂ 𝑅𝑛2 with prob. 1 − exp(− 𝑛) ?

𝑛2

2

𝑛2 𝑎 𝑛 ⊗ 𝑏

𝛱𝑆 is projection matrix onto 𝑆

𝛱𝑆 𝑎 ⊗ 𝑏 = 𝛱𝑆 𝑏 𝑎

𝑛2

2

𝑛

Two steps of Proof..

2. If Π𝑆(𝑏 ) has 𝑟 eigenvalues > 𝑝𝑜𝑙𝑦(𝜌,1

𝑛), then w.p. 1 − exp (−𝑟)

(over perturbation of 𝑎 ), 𝒂 ⊗ 𝒃 has large projection onto 𝑆.

follows easily analyzing

projection of a vector to

a dim-k space

will show with 𝑟 = √𝑛

1. W.h.p. (over perturbation of b), Π𝑆(𝑏 ) has at least

𝑟 eigenvalues > 𝑝𝑜𝑙𝑦(𝜌,1

𝑛)

Suppose: Choose ΠS first 𝑛 × 𝑛 “blocks” in ΠS were orthogonal...

…. ….

….

Structure in any subspace S

(restricted to 𝑛 cols)

• Entry (i,j) is:

• Translated i.i.d. Gaussian matrix!

has many big eigenvalues

√𝑛

Π𝑆 𝑏 𝑛 =

𝑣𝑖𝑗 ∈ ℝ𝑛

Main claim: every 𝑐. 𝑛2 dimensional space 𝑆 has ~√𝑛 vectors

with such a structure..

….

….

….

Property: picked blocks (n dim vectors) have “reasonable”

component orthogonal to span of rest..

Finding Structure in any subspace S

Earlier argument goes through even with blocks not fully

orthogonal!

𝑣1

𝑣2

𝑣√𝑑

Idea: obtain “good” columns one by one..

• Show there exists a block with many linearly

independent “choices”

• Fix some choices and argue the same property holds, …

Main claim (sketch)..

Generalization: similar result holds for higher order

products, implies main result.

crucially use the fact

that we have a Ω(𝑛2) dim subspace

• Uses a delicate inductive argument

Summary

• Polynomial time Algorithms when

rank 𝒌 ≫ 𝒏:

Tensor Decompositions when rank 𝑘 = 𝑛𝑂(1)

Learning when number of clusters/ topics 𝑘 = 𝑛𝑂(1)

• Smoothed Analysis for Tensor Decompositions & Learning

Future Directions

Smoothed Analysis for Tensor Decompositions, Learning

• Handling ranks that match generic results for uniqueness,

algorithms ?

• Polynomially robust analogs of [Chiantini-Ottaviani] or [Cardoso]?

• Proofs for generic results that are more amenable to noise?

Better Robustness to Errors

• Tensor decomposition algorithms that more robust to errors ?

promise: [Barak-Kelner-Steurer’14] using Lasserre hierarchy

• Modelling errors?

Thank You!

Questions?

smoothed analysis of tensor decompositionsapplications of tensor decompositions to ml – motivating...

Documents