beating the perils of non-convexity: machine learning ......learning with big data: ill-posed...
TRANSCRIPT
![Page 1: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/1.jpg)
Beating the Perils of Non-Convexity:
Machine Learning using Tensor Methods
Anima Anandkumar
..
Joint work with Majid Janzamin and Hanie Sedghi.
U.C. Irvine
![Page 2: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/2.jpg)
Learning with Big Data
Learning is finding needle in a haystack
![Page 3: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/3.jpg)
Learning with Big Data
Learning is finding needle in a haystack
High dimensional regime: as data grows, more variables!
Useful information: low-dimensional structures.
Learning with big data: ill-posed problem.
![Page 4: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/4.jpg)
Learning with Big Data
Learning is finding needle in a haystack
High dimensional regime: as data grows, more variables!
Useful information: low-dimensional structures.
Learning with big data: ill-posed problem.
Learning with big data: statistically and computationally challenging!
![Page 5: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/5.jpg)
Optimization for Learning
Most learning problems can be cast as optimization.
Unsupervised Learning
Clusteringk-means, hierarchical . . .
Maximum Likelihood EstimatorProbabilistic latent variable models
Supervised Learning
Optimizing a neural network withrespect to a loss function Input
Neuron
Output
![Page 6: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/6.jpg)
Convex vs. Non-convex Optimization
Progress is only tip of the iceberg..
Images taken from https://www.facebook.com/nonconvex
![Page 7: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/7.jpg)
Convex vs. Non-convex Optimization
Progress is only tip of the iceberg.. Real world is mostly non-convex!
Images taken from https://www.facebook.com/nonconvex
![Page 8: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/8.jpg)
Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima
![Page 9: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/9.jpg)
Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima
In high dimensions possiblyexponential local optima
![Page 10: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/10.jpg)
Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima
In high dimensions possiblyexponential local optima
How to deal with non-convexity?
![Page 11: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/11.jpg)
Outline
1 Introduction
2 Guaranteed Training of Neural Networks
3 Overview of Other Results on Tensors
4 Conclusion
![Page 12: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/12.jpg)
Training Neural Networks
Tremendous practical impactwith deep learning.
Algorithm: backpropagation.
Highly non-convex optimization
![Page 13: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/13.jpg)
Toy Example: Failure of Backpropagation
x1
x2
y=1y=−1
Labeled input samplesGoal: binary classification
σ(·) σ(·)
y
x1 x2x
w1 w2
Our method: guaranteed risk bounds for training neural networks
![Page 14: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/14.jpg)
Toy Example: Failure of Backpropagation
x1
x2
y=1y=−1
Labeled input samplesGoal: binary classification
σ(·) σ(·)
y
x1 x2x
w1 w2
Our method: guaranteed risk bounds for training neural networks
![Page 15: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/15.jpg)
Toy Example: Failure of Backpropagation
x1
x2
y=1y=−1
Labeled input samplesGoal: binary classification
σ(·) σ(·)
y
x1 x2x
w1 w2
Our method: guaranteed risk bounds for training neural networks
![Page 16: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/16.jpg)
Backpropagation vs. Our Method
Weights w2 randomly drawn and fixed
Backprop (quadratic) loss surface
−4
−3
−2
−1
0
1
2
3
4 −4−3
−2−1
01
23
4
200
250
300
350
400
450
500
550
600
650
w1(1) w1(2)
![Page 17: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/17.jpg)
Backpropagation vs. Our Method
Weights w2 randomly drawn and fixed
Backprop (quadratic) loss surface
−4
−3
−2
−1
0
1
2
3
4 −4−3
−2−1
01
23
4
200
250
300
350
400
450
500
550
600
650
w1(1) w1(2)
Loss surface for our method
−4
−3
−2
−1
0
1
2
3
4 −4−3
−2−1
01
23
4
0
20
40
60
80
100
120
140
160
180
200
w1(1) w1(2)
![Page 18: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/18.jpg)
Overcoming Hardness of Training
In general, training a neural network is NP hard.
How does knowledge of input distribution help?
![Page 19: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/19.jpg)
Overcoming Hardness of Training
In general, training a neural network is NP hard.
How does knowledge of input distribution help?
![Page 20: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/20.jpg)
Generative vs. Discriminative Models
p(x, y)
0 10 20 30 40 50 60 70 80 90 1000
0.02
0.04
0.06
0.08
0.1
0.12
Input data x
Class y = 1Class y = 0
p(y|x)
0 10 20 30 40 50 60 70 80 90 1000
0.2
0.4
0.6
0.8
1
1.2
Input data x
Class y = 1 Class y = 0
Generative models: Encode domain knowledge.
Discriminative: good classification performance.
Neural Network is a discriminative model.
Do generative models help in discriminative tasks?
![Page 21: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/21.jpg)
Feature Transformation for Training Neural Networks
Feature learning: Learn φ(·) from input data.
How to use φ(·) to train neural networks?x
φ(x)
y
![Page 22: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/22.jpg)
Feature Transformation for Training Neural Networks
Feature learning: Learn φ(·) from input data.
How to use φ(·) to train neural networks?x
φ(x)
y
Multivariate Moments: Many possibilities, . . .
E[x⊗ y], E[x⊗ x⊗ y], E[φ(x)⊗ y], . . .
![Page 23: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/23.jpg)
Tensor Notation for Higher Order Moments
Multi-variate higher order moments form tensors.
Are there spectral operations on tensors akin to PCA on matrices?
Matrix
E[x⊗ y] ∈ Rd×d is a second order tensor.
E[x⊗ y]i1,i2 = E[xi1yi2 ].
For matrices: E[x⊗ y] = E[xy⊤].
Tensor
E[x⊗ x⊗ y] ∈ Rd×d×d is a third order tensor.
E[x⊗ x⊗ y]i1,i2,i3 = E[xi1xi2yi3 ].
In general, E[φ(x)⊗ y] is a tensor.
What class of φ(·) useful for training neural networks?
![Page 24: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/24.jpg)
Score Function Transformations
Score function for x ∈ Rd
with pdf p(·):
S1(x) := −∇x log p(x)
Input:x ∈ R
d
S1(x) ∈ Rd
![Page 25: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/25.jpg)
Score Function Transformations
Score function for x ∈ Rd
with pdf p(·):
S1(x) := −∇x log p(x)
Input:x ∈ R
d
S1(x) ∈ Rd
![Page 26: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/26.jpg)
Score Function Transformations
Score function for x ∈ Rd
with pdf p(·):
S1(x) := −∇x log p(x)
Input:x ∈ R
d
S1(x) ∈ Rd
![Page 27: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/27.jpg)
Score Function Transformations
Score function for x ∈ Rd
with pdf p(·):
S1(x) := −∇x log p(x)
mth-order score function:Input:x ∈ R
d
S1(x) ∈ Rd
![Page 28: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/28.jpg)
Score Function Transformations
Score function for x ∈ Rd
with pdf p(·):
S1(x) := −∇x log p(x)
mth-order score function:
Sm(x) := (−1)m∇(m)p(x)
p(x)
Input:x ∈ R
d
S1(x) ∈ Rd
![Page 29: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/29.jpg)
Score Function Transformations
Score function for x ∈ Rd
with pdf p(·):
S1(x) := −∇x log p(x)
mth-order score function:
Sm(x) := (−1)m∇(m)p(x)
p(x)
Input:x ∈ R
d
S2(x) ∈ Rd×d
![Page 30: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/30.jpg)
Score Function Transformations
Score function for x ∈ Rd
with pdf p(·):
S1(x) := −∇x log p(x)
mth-order score function:
Sm(x) := (−1)m∇(m)p(x)
p(x)
Input:x ∈ R
d
S3(x) ∈ Rd×d×d
![Page 31: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/31.jpg)
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
![Page 32: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/32.jpg)
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M1 = E[y · S1(x)] =∑
j∈[k]
λ1,j · uj⊗uj ⊗ uj
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
![Page 33: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/33.jpg)
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M1 = E[y · S1(x)] =∑
j∈[k]
λ1,j · (A1)j⊗(A1)j ⊗ (A1)j
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
![Page 34: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/34.jpg)
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M1 = E[y · S1(x)] =∑
j∈[k]
λ1,j · (A1)j⊗(A1)j ⊗ (A1)j
= + ....
λ11(A1)1 λ12(A1)2
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
![Page 35: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/35.jpg)
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M2 = E[y · S2(x)] =∑
j∈[k]
λ2,j · (A1)j ⊗ (A1)j⊗(A1)j
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
![Page 36: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/36.jpg)
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M2 = E[y · S2(x)] =∑
j∈[k]
λ2,j · (A1)j ⊗ (A1)j⊗(A1)j
= + ....
λ11(A1)1 ⊗ (A1)1 λ12(A1)2 ⊗ (A1)2
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
![Page 37: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/37.jpg)
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M3 = E[y · S3(x)] =∑
j∈[k]
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
![Page 38: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/38.jpg)
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M3 = E[y · S3(x)] =∑
j∈[k]
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
= + ....
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
![Page 39: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/39.jpg)
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M3 = E[y · S3(x)] =∑
j∈[k]
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
Why tensors are required?
Matrix decomposition recovers subspace, not actual weights.
Tensor decomposition uniquely recovers under non-degeneracy.
![Page 40: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/40.jpg)
Moments of a Neural Network
E[y|x] = f(x) = a⊤2 σ(A⊤1 x+ b1) + b2
Given labeled examples {(xi, yi)}
E [y · Sm(x)] = E
[
∇(m)f(x)]
⇓
M3 = E[y · S3(x)] =∑
j∈[k]
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
σ(·) σ(·) σ(·)σ(·)
1 k
y
x1 x2 x3 xdxx
a2
A1 · · ·
· · ·
· · ·
Guaranteed learning of weights of first layer via tensor decomposition.
Learning the other parameters via a Fourier technique.
![Page 41: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/41.jpg)
NN-LiFT: Neural Network LearnIng
using Feature Tensors
Input:x ∈ R
d
S3(x) ∈ Rd×d×d
![Page 42: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/42.jpg)
NN-LiFT: Neural Network LearnIng
using Feature Tensors
Input:x ∈ R
d
S3(x) ∈ Rd×d×d
1
n
n∑
i=1
yi ⊗ S3(xi) =1
n
n∑
i=1
yi⊗
Estimating M3 using
labeled data {(xi, yi)}
S3(xi)
Cross-
moment
![Page 43: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/43.jpg)
NN-LiFT: Neural Network LearnIng
using Feature Tensors
Input:x ∈ R
d
S3(x) ∈ Rd×d×d
1
n
n∑
i=1
yi ⊗ S3(xi) =1
n
n∑
i=1
yi⊗
Estimating M3 using
labeled data {(xi, yi)}
S3(xi)
Cross-
moment
Rank-1 components are the estimates of columns of A1
+ + · · ·
CP tensor
decomposition
![Page 44: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/44.jpg)
NN-LiFT: Neural Network LearnIng
using Feature Tensors
Input:x ∈ R
d
S3(x) ∈ Rd×d×d
1
n
n∑
i=1
yi ⊗ S3(xi) =1
n
n∑
i=1
yi⊗
Estimating M3 using
labeled data {(xi, yi)}
S3(xi)
Cross-
moment
Rank-1 components are the estimates of columns of A1
+ + · · ·
CP tensor
decomposition
Fourier technique ⇒ a2, b1, b2
![Page 45: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/45.jpg)
Estimation error bound
Guaranteed learning of weights of first layer via tensor decomposition.
M3 = E[y ⊗ S3(x)] =∑
j∈[k]
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
Full column rank assumption on weight matrix A1
Guaranteed tensor decomposition (AGHKT’14, AGJ’14)
![Page 46: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/46.jpg)
Estimation error bound
Guaranteed learning of weights of first layer via tensor decomposition.
M3 = E[y ⊗ S3(x)] =∑
j∈[k]
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
Full column rank assumption on weight matrix A1
Guaranteed tensor decomposition (AGHKT’14, AGJ’14)
Learning the other parameters via a Fourier technique.
![Page 47: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/47.jpg)
Estimation error bound
Guaranteed learning of weights of first layer via tensor decomposition.
M3 = E[y ⊗ S3(x)] =∑
j∈[k]
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
Full column rank assumption on weight matrix A1
Guaranteed tensor decomposition (AGHKT’14, AGJ’14)
Learning the other parameters via a Fourier technique.
Theorem (JSA’14)
number of samples n = poly(d, k), we have w.h.p.
|f(x)− f(x)|2 ≤ O(1/n).
“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor
Methods” by M. Janzamin, H. Sedghi and A., June. 2015.
![Page 48: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/48.jpg)
Our Main Result: Risk Bounds
Approximating arbitrary function f(x) with bounded
Cf :=
∫
Rd
‖ω‖2 · |F (ω)|dω
n samples, d input dimension, k number of neurons.
![Page 49: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/49.jpg)
Our Main Result: Risk Bounds
Approximating arbitrary function f(x) with bounded
Cf :=
∫
Rd
‖ω‖2 · |F (ω)|dω
n samples, d input dimension, k number of neurons.
Theorem(JSA’14)
Assume Cf is small.
E[|f(x)− f(x)|2] ≤ O(C2f/k) +O(1/n).
Polynomial sample complexity n in terms of dimensions d, k.
Computational complexity same as SGD with enough parallelprocessors.
“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor
Methods” by M. Janzamin, H. Sedghi and A. , June. 2015.
![Page 50: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/50.jpg)
Outline
1 Introduction
2 Guaranteed Training of Neural Networks
3 Overview of Other Results on Tensors
4 Conclusion
![Page 51: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/51.jpg)
Tractable Learning for LVMs
GMM HMM
h1 h2 h3
x1 x2 x3
ICA
h1 h2 hk
x1 x2 xd
Multiview and Topic Models
![Page 52: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/52.jpg)
At Scale Tensor Computations
Randomized Tensor Sketches
Naive computation scales exponentially in order of the tensor.
Propose randomized FFT sketches.
Computational complexity independent of tensor order.
Linear scaling in input dimension and number of samples.
(1) Fast and Guaranteed Tensor Decomposition via Sketching by Yining Wang, Hsiao-Yu Tung, Alex Smola, A. , NIPS 2015.
(2) Tensor Contractions with Extended BLAS Kernels on CPU and GPU by Y. Shi, UN Niranjan, C. Cecka, A. Mowli, A.
![Page 53: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/53.jpg)
At Scale Tensor Computations
Randomized Tensor Sketches
Naive computation scales exponentially in order of the tensor.
Propose randomized FFT sketches.
Computational complexity independent of tensor order.
Linear scaling in input dimension and number of samples.
Tensor Contractions with Extended BLAS Kernels on CPU and GPU
BLAS: Basic Linear Algebraic Subprograms, highly optimized libraries.
Use extended BLAS to minimize data permutation, I/O calls.
(1) Fast and Guaranteed Tensor Decomposition via Sketching by Yining Wang, Hsiao-Yu Tung, Alex Smola, A. , NIPS 2015.
(2) Tensor Contractions with Extended BLAS Kernels on CPU and GPU by Y. Shi, UN Niranjan, C. Cecka, A. Mowli, A.
![Page 54: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/54.jpg)
Preliminary Results on SparkIn-memory processing of Spark: ideal for iterative tensor methods.
Alternating Least Squares for Tensor Decomposition.
minw,A,B,C
∥
∥
∥
∥
∥
T −k
∑
i=1
λiA(:, i) ⊗B(:, i)⊗ C(:, i)
∥
∥
∥
∥
∥
2
F
Update Rows Independently
Tensor slices
B C
worker 1
worker 2
worker i
worker k
Results on NYtimes corpus
3 ∗ 105 documents, 108 words
Spark Map-Reduce
26mins 4 hrs
(2) Topic Modeling at Lightning Speeds via Tensor Factorization on Spark by F. Huang, A. , under preparation.
![Page 55: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/55.jpg)
Convolutional Tensor Decomposition
== ∗
xx
∑
f∗i w∗
i F∗ w∗
(a)Convolutional dictionary model (b)Reformulated model
.... ....= ++
Cumulant λ1(F∗1 )
⊗3 +λ2(F∗2 )
⊗3 . . .
Efficient methods for tensor decomposition with circulant constraints.
Convolutional Dictionary Learning through Tensor Factorization by F. Huang, A. , June 2015.
![Page 56: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/56.jpg)
Reinforcement Learning (RL) of POMDPsPartially observable Markov decision processes.
Proposed Method
Consider memoryless policies. Episodic learning: indirect exploration.
Tensor methods: careful conditioning required for learning.
First RL method for POMDPs with logarithmic regret bounds.
xi xi+1 xi+2
yi yi+1
ri ri+1
ai ai+1
0 1000 2000 3000 4000 5000 6000 70002
2.1
2.2
2.3
2.4
2.5
2.6
2.7
Number of Trials
Ave
rage
Rew
ard
SM−UCRL−POMDPUCRL−MDPQ−learningRandom Policy
Logarithmic Regret Bounds for POMDPs using Spectral Methods by K. Azzizade, A. Lazaric, A.
, under preparation.
![Page 57: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/57.jpg)
Outline
1 Introduction
2 Guaranteed Training of Neural Networks
3 Overview of Other Results on Tensors
4 Conclusion
![Page 58: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/58.jpg)
Summary and Outlook
Summary
Tensor methods: a powerful paradigm for guaranteed large-scalemachine learning.
First methods to provide provable bounds for training neuralnetworks, many latent variable models (e.g HMM, LDA), POMDPs!
![Page 59: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/59.jpg)
Summary and Outlook
Summary
Tensor methods: a powerful paradigm for guaranteed large-scalemachine learning.
First methods to provide provable bounds for training neuralnetworks, many latent variable models (e.g HMM, LDA), POMDPs!
Outlook
Training multi-layer neural networks, models with invariances,reinforcement learning using neural networks . . .
Unified framework for tractable non-convex methods with guaranteedconvergence to global optima?
![Page 60: Beating the Perils of Non-Convexity: Machine Learning ......Learning with big data: ill-posed problem. LearningwithBigData Learning is finding needle in a haystack High dimensional](https://reader035.vdocument.in/reader035/viewer/2022071215/6045ecbdd8659f6be3126bb4/html5/thumbnails/60.jpg)
My Research Group and Resources
Podcast/lectures/papers/software available athttp://newport.eecs.uci.edu/anandkumar/