non-convex optimization for machine learning: design, analysis, … · 2020. 9. 7. · non-convex...
TRANSCRIPT
Non-convex Optimization for Machine
Learning: Design, Analysis, and
Understanding
Tengyu Ma
A Dissertation
Presented to the Faculty
of Princeton University
in Candidacy for the Degree
of Doctor of Philosophy
Recommended for Acceptance
by the Department of
Computer Science
Adviser: Professor Sanjeev Arora
November 2017
c© Copyright by Tengyu Ma, 2017.
All rights reserved.
Abstract
Non-convex optimization is ubiquitous in modern machine learning: recent break-
throughs in deep learning require optimizing non-convex training objective functions;
problems that admit accurate convex relaxation can often be solved more efficiently
with non-convex formulations. However, the theoretical understanding of non-convex
optimization remained rather limited. Can we extend the algorithmic frontier by effi-
ciently optimizing a family of interesting non-convex functions? Can we successfully
apply non-convex optimization to machine learning problems with provable guaran-
tees? How do we interpret the complicated models in machine learning that demand
non-convex optimizers?
Towards addressing these questions, in this thesis, we theoretically studied various
machine learning models including sparse coding, topic models, and matrix comple-
tion, linear dynamical systems, and word embeddings.
We first consider how to find a coarse solution to serve as a good starting point
for local improvement algorithms such as stochastic gradient descent. We propose ef-
ficient methods for sparse coding and topic inference with better provable guarantees.
Second, we propose a framework for analyzing local improvement algorithms that
start from a course solution. We apply it successfully to the sparse coding problem.
Then, we consider a family of non-convex functions satisfying that all local minima
are also global (and some additional regularity property). Such functions can be
optimized by local improvement algorithms efficiently from a random or arbitrary
starting point. The challenge that we address here, in turn, becomes proving that
an objective function belongs to this class. We establish such results for the natural
learning objectives of matrix completion and linear dynamical systems.
Finally, we make steps towards interpreting the non-linear models that require
non-convex training algorithms. We reflect on the principles of word embeddings in
natural language processing. We give a generative model for the texts, using which
iii
we explain why different non-convex formulations such as word2vec and GloVe can
learn similar word embeddings with the surprising performance — analogous words
have embeddings with similar differences.
iv
Acknowledgements
First and foremost, I would like to thank my advisor, Professor Sanjeev Arora for
all his advice, encouragements, inspirations, and supports. Throughout the past five
years, he has been a never-ending source of wisdom and insight. I continuously learn
from his commitment to research, his fortitude in exploring the unknown, his taste of
research problems, and his comprehensive knowledge of theoretical computer science.
I always remember his constructive and considerate suggestions for my decisions in
my life, and I’ve greatly influenced by his philosophy. I couldnt have wished for a
better advisor.
I am also most thankful to Avi Wigderson, Benjamin Recht, Elad Hazan, Moritz
Hardt, David Steurer, Ankur Moitra, and Rong Ge for their guidance, encourage-
ments, and collaboration. I enjoyed very much discussing research with them, and I
learned various aspects of research from them, ranging from mathematical techniques
to high-level thinking, from technical writing to general audience speech. They have
considerable influence on the content of this thesis and generally my work during
graduate school.
I owe many thanks to Moritz Hardt, Elad Hazan, Benjamin Recht, and Yoram
Singer who influenced me significantly beyond research collaborations in the last two
years of my Ph.D. They kindly spent a lot of time on helping me navigate the job
market without panic. The philosophical discussions with Elad and Sanjeev in the
kitchen on the 4-th floor of computer science department shaped my research taste.
Many thanks to Moritz Hardt and Yoram Singer for hosting me as an intern and
a visitor at Google in 2015 and 2016. The research discussions with Moritz, Ben,
Yoram, and other team members broadened my horizon and injected more practical
perspective in my research.
I was very fortunate to collaborate with many brilliant researchers. I would like
to thank my coauthors Naman Agarwal, Zeyuan Allen-Zhu, Sanjeev Arora, Aditya
v
Bhaskara, Mark Braverman, Brian Bullins, Xi Chen, Dan Garber, Ankit Garg, Rong
Ge, Moritz Hardt, Elad Hazan, Frederic Koehler, Jason D. Lee, Yuanzhi Li, Yingyu
Liang, Qihang Lin, Ankur Moitra, Huy Nguyen, Benjamin Recht, Andrej Risteski,
Jonathan Shi, David Steurer, Xiaoming Sun, Bo Tang, Yajun Wang, Avi Wigderson,
David P. Woodruff, Tianbao Yang, Huacheng Yu, Yi Zhang, Jiawei Zhang, and Yuan
Zhou.
I would also like to thank the wonderful group of researchers at Princeton Univer-
sity for creating such a great environment for machine learning, theoretical computer
science, statistics, and applied mathematics. At Princeton, within 10 minutes walk
distance, I was able to get thoughtful answers, comments, and feedbacks from world-
class experts for any questions and ideas. Thank you to all the staffs at Computer
Science Department, especially Ms. Melissa Lawson, Mitra Kelly, and Nicole Wagen-
blast, for their administrative work.
Thanks to Elad Hazan, Mark Braverman, David Steurer, and Rong Ge for being
on my thesis committee. Thanks to Andrew Chi-Chih Yao for creating the fantastic
Yao’s special pilot class where I was an undergraduate student. Thanks to Xiaoming
Sun, Yajun Wang for advising my undergraduate research.
The thesis is supported in part by NSF grants CCF-1527371, DMS-1317308, Si-
mons Investigator Award, Simons Collaboration Grant, and ONR-N00014-16-1-2329,
Simons Award in Theoretical Computer Science, IBM Ph.D. Fellowship, Simons-
Berkeley Research Fellowship, Siebel Scholarship, and Princeton Honorific Fellowship.
Furthermore, some of the work in this thesis was conducted while I was an intern at
Google and a fellow at the Simons Institute for the Theory of Computing. Thank
you all for your support.
Heartfelt thanks to all my friends Princeton University, who made the time in
Princeton very enjoyable.
vi
Finally, I would like to thank my family — my parents Qinglong Ma and Li Li,
and my wife Wenxin Xu for their love and support.
vii
To Wenxin
viii
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
1 Introduction 1
1.1 Analyzing Local Improvement Algorithms for Non-convex Optimization 2
1.1.1 Local Convergence Starting from Coarse Solutions . . . . . . . 3
1.1.2 Finding Coarse Solutions . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Global Convergence from Simple Initialization . . . . . . . . . 5
1.2 Interpreting Non-convex Objective Functions . . . . . . . . . . . . . . 6
1.3 Problems and Main Contributions . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 Matrix Completion . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.4 Learning Linear Dynamical Systems . . . . . . . . . . . . . . . 11
1.4 Previously Published work . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
I Local Improvement Algorithms Starting from Coarse
Solutions 15
2 Finding Coarse Solutions 16
ix
2.1 Spectral Initialization for Sparse Coding . . . . . . . . . . . . . . . . 16
2.1.1 Introduction and Problem Definition . . . . . . . . . . . . . . 16
2.1.2 Assumptions and Main Results . . . . . . . . . . . . . . . . . 19
2.1.3 Related Work and Notes . . . . . . . . . . . . . . . . . . . . . 21
2.1.4 The Spectral Algorithm and Key Observation . . . . . . . . . 22
2.1.5 Infinite Samples Case . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.6 Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Convex Initialization for Topic Modeling Inference . . . . . . . . . . . 34
2.2.1 Introduction and Main Results . . . . . . . . . . . . . . . . . 35
2.2.2 Notations and Preliminaries . . . . . . . . . . . . . . . . . . . 40
2.2.3 δ-Biased Minimum Variance Estimators . . . . . . . . . . . . 41
2.2.4 Thresholded Linear Inverse Algorithm and its Guarantees . . . 45
2.3 Discussion: Special Initialization vs Trivial Initialization . . . . . . . 48
3 Local Convergence to a Global Minimum 50
3.1 Analysis Framework via Lyapunov Function . . . . . . . . . . . . . . 50
3.1.1 Generalization to Stochastic Updates . . . . . . . . . . . . . . 54
3.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1.3 Limitation and Relation to Part II . . . . . . . . . . . . . . . 57
3.2 Analyzing Alternating Minimization Algorithm for Sparse Coding . . 58
3.2.1 Alternating Minimization for Sparse Coding . . . . . . . . . . 58
3.2.2 Applying the Framework to Analyzing Alternating Minimization 59
3.2.3 Algorithms and Main Results . . . . . . . . . . . . . . . . . . 61
3.3 Support Recovery Guarantees of Decoding . . . . . . . . . . . . . . . 64
3.4 Analysis Overview: Infinite Samples Setting . . . . . . . . . . . . . . 67
3.4.1 Making Progress at Each Iteration . . . . . . . . . . . . . . . 68
3.4.2 Maintaining Spectral Norm . . . . . . . . . . . . . . . . . . . 73
3.5 Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
x
3.6 More Alternating Minimization Algorithms . . . . . . . . . . . . . . . 84
3.6.1 Analysis of a Variant of Olshausen-Field Update . . . . . . . . 84
3.6.2 Removing Systemic Error . . . . . . . . . . . . . . . . . . . . 88
II Global Convergence with Arbitrary Initialization 91
4 Analysis via Optimization Landscape 92
4.1 Local Optimality vs Global Optimality . . . . . . . . . . . . . . . . . 93
5 Matrix completion 100
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.1.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2 Proof Strategy: “Simple” Proofs are More Generalizable . . . . . . . 106
5.3 Warm-up: Rank-1 Case . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3.1 Handling Incoherent x . . . . . . . . . . . . . . . . . . . . . . 115
5.3.2 Extension to General x . . . . . . . . . . . . . . . . . . . . . . 119
5.4 Rank-r Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.5 Handling Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.6 Finding the Exact Factorization . . . . . . . . . . . . . . . . . . . . . 134
5.7 Concentration Inequalities . . . . . . . . . . . . . . . . . . . . . . . . 140
5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6 Learning Linear Dynamical Systems 148
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.1.1 Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.1.2 Proper Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.1.3 The Power of Over-parameterization . . . . . . . . . . . . . . 154
6.1.4 Multi-input Multi-output Systems . . . . . . . . . . . . . . . . 155
xi
6.1.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.1.6 Proof Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.1.7 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.2 Population Risk in Frequency Domain . . . . . . . . . . . . . . . . . 160
6.2.1 Quasi-convexity of the Idealized Risk . . . . . . . . . . . . . . 162
6.2.2 Justifying Idealized Risk . . . . . . . . . . . . . . . . . . . . . 164
6.3 Effective Relaxations of Spectral Radius . . . . . . . . . . . . . . . . 165
6.3.1 Efficiently Computing the Projection . . . . . . . . . . . . . . 168
6.4 Learning Acquiescent Systems . . . . . . . . . . . . . . . . . . . . . . 169
6.5 The Power of Improper Learning . . . . . . . . . . . . . . . . . . . . 175
6.5.1 Instability of the Minimum Representation . . . . . . . . . . . 177
6.5.2 Power of Improper Learning in Various Cases . . . . . . . . . 178
6.5.3 Improper Learning Using Linear Regression . . . . . . . . . . 187
6.6 Learning Multi-input Multi-output (MIMO) Systems . . . . . . . . . 188
6.7 Technicalities: Mean and Variance of the Gradient Estimator . . . . . 191
6.8 Back-propagation Implementation . . . . . . . . . . . . . . . . . . . . 200
6.9 Projection to the Constraint Set . . . . . . . . . . . . . . . . . . . . . 201
III Interpreting Non-linear Models and Their Non-convex
Objective Functions 204
7 Understanding Word Embedding Methods Using Generative Mod-
els 205
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
7.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.1.2 Benefits of Generative Approaches . . . . . . . . . . . . . . . 209
7.2 Generative Model and Its Properties . . . . . . . . . . . . . . . . . . 210
7.2.1 Weakening the Model Assumptions . . . . . . . . . . . . . . . 218
xii
7.3 Training objective and relationship to other models . . . . . . . . . . 219
7.4 Explaining relations=lines . . . . . . . . . . . . . . . . . . . . . . 223
7.5 Experimental Verification . . . . . . . . . . . . . . . . . . . . . . . . 228
7.5.1 Model Verification . . . . . . . . . . . . . . . . . . . . . . . . 228
7.5.2 Performance on Analogy Tasks . . . . . . . . . . . . . . . . . 231
7.5.3 Verifying relations=lines . . . . . . . . . . . . . . . . . . . 232
7.6 Proof of Main Theorems and Lemmas . . . . . . . . . . . . . . . . . . 234
7.6.1 Analyzing Partition Function Zc . . . . . . . . . . . . . . . . . 242
7.6.2 Helper Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . 247
7.7 Maximum Likelihood Estimator for Co-occurrence . . . . . . . . . . . 259
7.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8 Mathematical Tools 264
8.1 Concentration Inequalities . . . . . . . . . . . . . . . . . . . . . . . . 264
8.1.1 Hoeffding’s inequality and Bernstein’s inequalities . . . . . . . 264
8.1.2 Sub-Gaussian ans Sub-exponential Random Variables . . . . . 267
8.2 Spectral Perturbation Theorems . . . . . . . . . . . . . . . . . . . . . 268
8.3 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
xiii
Chapter 1
Introduction
Non-convex optimization algorithms have been widely used in modern machine learn-
ing, especially deep learning. Can we design, analyze, and interpret non-convex
optimization algorithms in a principled way? The thesis aims put the non-convex
optimization on a more solid theoretical footing. We design and analyze non-convex
optimization algorithms for machine learning problems including sparse coding, topic
models, matrix completion, learning linear dynamical systems, and we interpret the
effectiveness of different non-convex algorithms for learning word embeddings that
capture semantic information.
In this chapter, we present a general overview of the questions and techniques
discussed in this thesis. We give a brief survey of the existing approaches for designing
and analyzing local improvement algorithms in Section 1.1, and discuss the related
work that motivates us to provide the theoretical interpretation of the non-convex
algorithms for training non-linear models in Section 1.2. We define the concrete
problems that will be studied and the main contributions in Section 1.3. This brief
overview is intended to convey the flavor of the work contained herein. We will provide
additional backgrounds and motivations later in each chapter.
1
1.1 Analyzing Local Improvement Algorithms for
Non-convex Optimization
Finding a global minimizer of a non-convex optimization problem — or even just a
degree-4 polynomial — is NP-hard in the worst case [83]. In fact, it’s also NP-hard
to check whether a point is a local minimum or not [130].
Despite the intractability results, non-convex optimization is the main algorith-
mic technique behind many state-of-the-art machine learning and deep learning re-
sults. Local improvement heuristics such as stochastic gradient descent [36], Momen-
tum [141], Adagrad [58], RMSProp [166], and Adam [99], are simple, scalable and
easy to implement, and they surprisingly return high-quality solutions (if not global
minima) [54, 49].
Given the empirical success of non-convex optimization in machine learning and
artificial intelligence, it becomes increasingly important to have a fundamental un-
derstanding of non-convex optimization algorithms and to design faster ones. Part I
and Part II of this thesis aim to develop mathematical techniques to analyze the
non-convex optimization algorithms for various machine learning problems.
We note that such analysis techniques have to be aware of the special properties
of the optimization problems, which, in machine learning, depend on the properties
of the data, the model, and the loss function. In this thesis, we mostly formulate
the problem by assuming the data are generated from some unknown and realistic
parametric distributions, and therefore work with average-case algorithm analysis.
Furthermore, the interface between users and optimizers in the context of machine
learning can be more relaxed. The optimizers can specialize in a restricted family of
objective functions, instead of addressing all differentiable functions, which would be
difficult or even impossible; the optimizers should also be allowed to potentially ask
the users to change the model parameterization, loss function, and regularization to
2
make the objective function easier. In Section 1.1.3 and more detailedly in Part II, we
will discuss a family of functions with certain landscape properties that allow efficient
optimization, and show that the objectives for several machine learning problems
belong to this family, and demonstrate theoretically that the choice of the regularizers
and over-parameterization of the models can make the landscape easier for optimizers.
There are two dominating paradigms to analyze and design non-convex optimiza-
tion algorithms: a) finding a coarse approximated solution first, followed by local
improvement algorithms and b) running local improvement algorithms from a trivial
initialization. Paradigm a) allows us to deploy very precise mathematical tools for
analysis and often gives strong theoretical guarantees, whereas paradigm b) is more
popular in practice because it can be applied simply to problems for which we don’t
know how to find coarse solutions without local improvement algorithms. This thesis
contributes to the development of both these two paradigms, as outlined below.
1.1.1 Local Convergence Starting from Coarse Solutions
For any sufficiently smooth objective function f and its global minimum x?, there
exists a neighborhood N of x in which the function is convex. Therefore, starting
from an initializer x0 in this neighborhood N , local improvement algorithms such
as first-order or second-order convex optimization algorithms converge to the global
minimum x?. However, the size of such neighborhood N depends on the structure of
the specific problem and can often be very small and scales inverse polynomially in
the dimension.
On the other hand, the true basin of attraction of the global minimum x? —
defined as the set of points x0 from which local improvement algorithms can converge
to x? — should be much larger than what we can prove from the argument above.
Various analysis framework has been developed to achieve a shaper analysis of the
basin of attraction [43, 20, 93, 72, 75, 164]. In most these cases, the local improvement
3
algorithms can provably start from a neighborhood of the global minimum with a
radius that doesn’t depend on the dimensionality.
In Chapter 3 of this thesis, we will present a framework that was previously
published in [14]. The key idea behind most of these analysis frameworks is to design
a Lyapunov function that measures the distance between the current iterate xk to
the global minimum x?, and to show that it decreases monotonically to zero. Our
framework contains most if not all of the others as special cases.
The strength of this local convergence analysis is that it often allows very precise
tools to reason about running time, sample complexity, etc (see, e.g., [48]), partly
due to its similarity to well-understood convex optimization. However, it is limited
to the situations where we already have a fast algorithm to obtain a coarse solution,
which we will discuss in the next subsection.
1.1.2 Finding Coarse Solutions
Coarse approximate solutions for non-convex optimization problems, especially in un-
supervised learning, can often be obtained by leveraging the data distributions using
spectral methods, combinatorial optimization, and convex optimization. The coarse
estimate may be far from statistically optimal, but the local improvement algorithms
on non-convex objectives that follow can find a more accurate solution. For exam-
ple, for the matrix completion problem (formally defined in Section 1.3), a simple
singular value decomposition (SVD) of the data matrix can recover a coarse estimate
of the underlying matrix. Such coarse solutions can often be good enough to enter
the basin of the attraction of some global minimum of the non-convex optimization
algorithms, as in the cases for matrix completion [93, 75, 163], matrix sensing [168],
phase retrieval [43], topic models [8, 12], noisy-or networks [15], dictionary learn-
ing [10, 11, 14].
4
In this thesis we will present two results along this line of research: a fast (coarse)
inference algorithm for topic models based on convex optimization (Section 2.2) and
a fast learning algorithm for dictionary learning based on spectral techniques (Sec-
tion 2.1). The key challenge in this direction is to solve models with prominent
non-linearities. The recent work of Arora et al. [15] addresses the non-linearity in
noisy-or networks via non-linear moment methods, but it’s beyond the scope of this
thesis.
1.1.3 Global Convergence from Simple Initialization
One of the largest families of non-convex functions for which we can optimize provably
efficiently is the set of function without bad local minima, which are often called
strict-saddle [67] or ridable functions [162]. Loosely speaking, if all local minima
of a function are also global minima and the function, in addition, satisfies some
mild regularity conditions, then many local improvement algorithms can find a global
minimum efficiently [134, 162, 138, 67, 3, 46, 106]. It can be easily seen that convex
functions form a subset of this family.
Many interesting non-convex objective functions commonly used in machine learn-
ing fall into this class. This is well known for the eigenvectorPCA/SVD prob-
lems [21, 157]. Recent work has established such properties for objective functions for
SVD/PCA, phase retrieval/synchronization, orthogonal tensor decomposition, matrix
decomposition, matrix sensing, learning linear dynamical systems, learning one-layer
hidden neural nets [67, 162, 22, 69, 74, 30, 148].
In this thesis, we will mostly focus on the analysis of the landscape for matrix
completion (Chapter 5) and learning linear dynamical systems (Chapter 6).
The optimization landscape properties have also been investigated on simplified
neural networks models. Kawaguchi [96] shows that the landscape of deep neural nets
doesn’t have bad local minima but has degenerate saddle points. Hardt and Ma [73]
5
show that re-parametrization using identity connection as in residual networks [80]
can remove the degenerate saddle points in the optimization landscape of deep linear
residual networks. Soudry and Carmon [155] showed that an over-parameterized
neural network doesn’t have bad differentiable local minima. Hardt et al. [74] analyze
the power of over-parameterization in a linear recurrent network (which is equivalent
to a linear dynamical system.) Ge, Lee and Ma [148] learns (a subset of) one-hidden-
layer neural networks by designing a new objective function such that it which has
no spurious local minima and its global minima recover the weights of the networks.
1.2 Interpreting Non-convex Objective Functions
Another intriguing question is how we can interpret the resulted solutions obtained
from optimizing a non-convex objective for a non-linear model. A particularly in-
teresting example is the word embeddings in the context of natural language pro-
cessing [26, 53, 124, 139]. Recently, researchers have discovered an efficient way of
assigning every word a vector in Euclidean space, which captures the semantic in-
formation in thwords. The striking property of the embeddings is that analogous
pairs of words have embeddings with similar differences. Moreover, such results can
be achieved by various methods including either simple recurrent neural networks
like in word2vec [124] or non-linear matrix factorization models like GloVe and PMI
method [50, 109, 139]. Despite their successful applications, it remained unclear why
these methods result in such vectors without any particular mechanism that targets
this property.
In chapter 7, we will show that the non-convex objective functions of word2vec,
GloVe, and PMI, in fact, correspond to different ways of learning a generative model
of the texts that we propose, a dynamic version of the log-linear topic model of [127].
GloVe and PMI correspond to moment methods for learning this generative model
6
and word2vec corresponds to an expectation-maximization algorithm for learning the
model. It also helps explain why low-dimensional semantic embeddings contain linear
algebraic structure that allows a solution of word analogies, as shown by [122] and
many subsequent papers.
Another new explanatory feature of our model is that low dimensionality of word
embeddings plays a key theoretical role —unlike in previous papers where the model
is agnostic about the dimension of the embeddings, and the superiority of low-
dimensional embeddings is an empirical finding (starting with [56]). Specifically,
our theoretical analysis makes the key assumption that the set of all word vectors
are spatially isotropic, which means that they have no preferred direction in space.
We will show that the low-rank fitting of the log co-occurrence matrix has a certain
denoising effect when the dimensionality of the word vectors is much smaller than
the vocabulary size. The theory in Chapter 7 also inspired the sense embeddings and
sentence embedding developed in a [18, 17].
1.3 Problems and Main Contributions
In this section, we define the machine learning problems that are concerned in this
thesis and summarizes the main results.
1.3.1 Sparse Coding
Sparse coding, also called dictionary learning, is a latent variable model for the dis-
tribution of the observed data. A basic latent variable model describes the data
distribution p(y) by p(y) = pθ(y | x)pα(x) where x is an unobserved latent variable
and θ and α are parameters that govern the conditional distributions and the distri-
bution of h respectively. We are given n examples y(1), . . . , y(n) from the distribution
p(y) and the task is to learn the model parameter θ. Sometimes in addition we would
7
like to recover the parameter α and posterior distribution of p(x | y = y(j)) for each
example.
In sparse coding or dictionary learning, the hidden variable h is a random sparse
vector in dimension Rm. Conditioned on the latent variable h, the data point x is
generated via a fixed dictionary A ∈ Rd×m by
y = Ax+ ξ
where ξ is a noise vector which is often assumed to be Gaussian. Thus this gives an
implicit definition of the distribution pA(y | x). Let A1, . . . , Am be the columns of
the matrix A, which are often called dictionary atoms. In words, we would like to
decompose the observed vector y into a sparse combination of the dictionary atoms
A1, . . . , Am.
Sparse coding was originally formulated by neuroscientists Olshausen and
Field [135] for the study of human visual systems. They gave experimental evidence
that it produces coding matrices for image patches that resemble known features
(such as Gabor filters) in the V 1 portion of the visual cortex. It has been widely used
in computer vision and image processing such as segmentation, retrieval, de-noising
and super-resolution (see references in [61] and more discussions in Chapter 2).
In Section 2.1 of Chapter 2, under realistic assumptions on the true dictionary ma-
trix and the distribution of the latent variables, we give algorithms based on spectral
methods that provably return a coarse solution that approximates each dictionary
atom (up to permutation and sign flip) up to o(1) relatively Euclidean distance error.
In Chapter 3, we design and analyze various alternating minimization algorithms that
start from the coarse solutions obtained from the spectral algorithms, which provably
converge to much more accurate solutions in polynomial time.
8
1.3.2 Topic Models
Topic models [32] are latent variable models for the distribution of the bag of words
representation of the documents: Suppose we have a vocabulary of D words. Each
document is viewed as a collection of unordered words and therefore can be repre-
sented as a vector in RD with each entry being the frequency of the corresponding
word in the documents. The document is assumed to be generated by the following
process. There are k topics, each of which is a distribution of words with corre-
sponding probability vector Ai ∈ RD. Thus Ai takes nonnegative values and sums
to 1. Each of the document is assumed to be generated by first picking a topic i
from a distribution x ∈ Rk over the topics, and then pick a word according to the
corresponding distribution Ai. A simple calculation shows that this is equivalent to
assuming that each of the words in the documents is i.i.d drew from the distribution
Ax, where A = [A1, . . . , Ak] is often called word-topic matrix.
The learning problem here is to recover the word-topic model. Researchers have
developed various techniques for this problem including singular value decomposi-
tion [56], variational inference [33], MCMC [71], tensor decomposition [8], and anchor-
word algorithms [12].
However there has been comparatively much less progress on designing algorithms
with provable guarantees for the inference problem for topic models — given a doc-
ument y and the word-topic matrix A, how do we infer the topic proportion vector
x that was used to generate the document? The result in Section 2.2 takes a step
in this direction by providing convex optimization based algorithms for estimating
the topic proportion vector x. The algorithm is very simple and fast because it only
uses a (carefully chosen) linear transformation plus thresholding, but it only returns
coarse solutions that are not statistically optimal. The solutions can be refined by
running gradient ascent on maximum likelihood estimators.
9
1.3.3 Matrix Completion
Matrix completion is the problem of recovering a low-rank matrix from partially
observed entries. It has been widely used in collaborative filtering and recommender
systems [102, 146], dimension reduction [42] and multi-class learning [5].
The simplest setting of matrix completion is the following: Let M ∈ Rd×d be
the target matrix that we aim to recover. We assume that it has rank r d.
We assume that we observe the values of a set of entries of the matrix, denoted
by Ω = (i, j) : Mi,j is observed. Here Ω are often assumed to come from some
distribution, e.g., the uniform distribution over the set of entries with a fixed size.
Our goal is to recover the underlying matrix M from these observations with as few
observations as possible.
There has been extensive work on designing efficient algorithms for matrix com-
pletion with guarantees. One earlier line of results (see [157, 159, 145, 45, 44] and
the references therein) rely on convex relaxations. These algorithms achieve strong
statistical guarantees but are relatively computationally expensive in practice.
In Chapter 5 of this thesis, we prove that the commonly used non-convex objective
function for positive semidefinite matrix completion has no spurious local minima —
all local minima must also be global. Therefore, many popular optimization algo-
rithms such as stochastic gradient descent can provably solve positive semidefinite
matrix completion with arbitrary initialization in polynomial time. The result can be
generalized to the setting when the observed entries contain noise. We believe that
our main proof strategy can be useful for understanding geometric properties of other
statistical problems involving partial or noisy observations. The result is built upon
recent progress for analyzing local improvement algorithms from good starting point
for matrix completion [97, 98, 93, 72, 75, 163, 176, 47, 151, 47]. See Chapter 5 for
more related work.
10
1.3.4 Learning Linear Dynamical Systems
As the name suggested, the problem of learning linear dynamical systems is a super-
vised learning problem that aims to estimate the underlying dynamical system that
maps a sequence of inputs x1, . . . , xT to a sequence of inputs y1, . . . , yT . Part of our
motivation for studying this problem comes from the desire of better understanding
the optimization issues in sequence-to-sequence learning models such as recurrent neu-
ral networks or long short-term memory. If we remove all non-linear state transitions
from a recurrent neural network, we are left with the state transition representation
of a linear dynamical system.
To be sure, linear dynamical systems are also very important in their own right
and have been studied for many decades independently of machine learning within the
control theory community [114] and the learning problem in this context corresponds
to “linear dynamical system identification”. In the context of machine learning, linear
systems play an important role in numerous tasks. For example, their estimation
arises as subroutines of reinforcement learning in robotics [108], location and mapping
estimation in robotic systems [59], and estimation of pose from video [143].
More formally, we receive noisy observations generated by the following time-
invariant linear system:
ht+1 = Aht +Bxt
yt = Cht +Dxt + ξt
Here, A,B,C,D are linear transformations with compatible dimensions and we denote
by Θ = (A,B,C,D) the parameters of the system. The vector ht represents the
hidden state of the model at time t. Its dimension n is called the order of the system.
The stochastic noise variables ξt perturb the output of the system which is why the
11
model is called an output error model in control theory. We assume the variables are
drawn i.i.d. from a distribution with mean 0 and variance σ2.
We assume we have N pairs of sequences (x, y) as training examples,
S =
(x(1), y(1)), . . . , (x(N), y(N)).
Each input sequence x ∈ RT of length T is drawn from a distribution and y is the
corresponding output of the system above generated from an unknown initial state
h. We allow the unknown initial state to vary from one input sequence to the next.
This only makes the learning problem more challenging.
In Chapter 6, we show that under structural assumptions on the input distribution
and the ground-truth parameters, stochastic gradient descent efficiently minimizes the
maximum likelihood objective of an unknown linear system given noisy observations
generated by the system. We also show that over-parameterization of the model can
relax many assumptions on the ground-truth parameters significantly by making the
landscape of the objective function easier for optimizers.
1.4 Previously Published work
Several portions of this thesis are based on previously published joint work with
collaborators, which I will describe briefly below.
The material presented in Section 2.1 of Chapter 2 and Chapter 3 is based on the
joint paper [14] with Sanjeev Arora, Rong Ge and Ankur Moitra, previously published
in COLT 2015. Section 2.2 of Chapter 2 is based on the joint work [13] with Sanjeev
Arora, Rong Ge, Frederic Koehler, and Ankur Moitra, a preliminary version of which
is published in ICML 2016.
Chapter 5 contains material based on the joint work [69] with Rong Ge and Jason
D. Lee, which is published in NIPS 2016. Chapter 6 is based on the joint paper with
12
Moritz Hardt and Benjamin Recht [74] that will appear in JMLR. Chapter 7 is based
on the joint work [16] with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, and Andrej
Risteski, which has been published in TACL.
1.5 Notations
Before proceeding to the thesis, we define some notations that are often used. Other
notations and terminologies will be defined when they first occur.
We use R to denote real numbers, C to denote complex numbers, and N =
0, 1, . . . , to denote natural numbers. Let [m] be a shorthand for 1, . . . ,m.
Unless explicitly stated otherwise, O(·)-notation hides absolute multiplicative con-
stants. Concretely, every occurrence of O(x) is a placeholder for some function f(x)
that satisfies ∀x ∈ R. |f(x)| ≤ C|x| for some absolute constant C > 0. Similarly,
Ω(x) is a placeholder for a function g(x) that satisfies ∀x ∈ R. |g(x)| ≥ |x|/C for
some absolute constant C > 0. The notation .,& also hide absolute multiplicative
constant — a . b means there exists a universal constant C > 0 such that a . Cb.
Throughout, we will use ‖·‖ to denote the Euclidean norm of a vector and spectral
norm of a matrix. Let ‖·‖F denotes the Frobenius norm of a matrix. For a matrix
A, let |A|∞ = max |Aij| be the infinity norm of the vectorized version of A. Let
|A|p→q = max‖x‖p=1 ‖A‖q be the `p → `q induced norm. Let tr(A) be the trace of a
square matrix A. For two matrices A and B with the same dimension, let the inner
product of two matrices be 〈A,B〉 = tr(A>B).
For a square matrix A, let λmax(A) and λmin(A) denote the largest and least
eigenvalues respectively. For a matrix A with dimension Rm×n, let σmax(A) be its top
singular value and σmin be its minm,n-th largest singular value. Let Idd denotes
the identity matrix with dimension d × d and we omit the subscript when it’s clear
from the context.
13
For a symmetric matrix A and B, let A B mean that A−B is positive semidef-
inite, and let A B mean that B − A is positive semidefinite.
Let 1() denote the indicator function — for an event E, we have that 1(E) = 1 if
E happens and otherwise 1(E) = 0. Let δij be a shorthand for 1(i = j).
14
Part I
Local Improvement Algorithms
Starting from Coarse Solutions
15
Chapter 2
Finding Coarse Solutions
In this Chapter, we discuss two algorithms that give coarse solutions for the sparse
coding problem and topic inference problem . The algorithms for sparse coding uses
singular vector decomposition (SVD) on a weighted moment matrix of the data,
whereas the algorithm for topic model inference is simply a carefully chosen linear
transformation followed by thresholding. The local improvement algorithms can start
from these coarse solutions and converge to more accurate solutions (see Chapter 3).
2.1 Spectral Initialization for Sparse Coding
2.1.1 Introduction and Problem Definition
Sparse coding or dictionary learning consists of learning to express (i.e., code) a set of
input vectors, say image patches, as linear combinations of a small number of vectors
chosen from a large dictionary. It is a basic task in many fields. In signal processing,
a wide variety of signals turn out to be sparse in an appropriately chosen basis (see
references in [118]). In neuroscience, sparse representations are believed to improve
the energy efficiency of the brain by allowing most neurons to be inactive at any given
time. In machine learning, imposing sparsity as a constraint on the representation
16
is a useful way to avoid over-fitting. Additionally, methods for sparse coding can
be thought of as a tool for feature extraction and are the basis for a number of
important tasks in image processing such as segmentation, retrieval, de-noising and
super-resolution (see references in [61]), as well as a building block for some deep
learning architectures [144]. It is also a basic problem in linear algebra itself since it
involves finding a better basis.
The notion was introduced by neuroscientists Olshausen and Field [135] who for-
malized it as follows: Given a dataset y(1), y(2), . . . , y(N) ∈ Rd, our goal is to find a set
of basis vectors A1, A2, . . . , Ar ∈ <d and sparse coefficient vectors x(1), x(2), . . . , x(N) ∈
<d that minimize the reconstruction error
N∑i=1
‖y(i) − A · x(i)‖22 +
N∑i=1
ρ(x(i)) (2.1.1)
where A is the d× r coding matrix whose j-th column is Aj and ρ(·) is a nonlin-
ear penalty function that is used to encourage sparsity. This function is nonconvex
because both A and the x(i)’s are unknown. Their paper, as well as subsequent work,
chooses r to be larger than d (so-called overcomplete case) because this allows greater
flexibility in adapting the representation to the data. We remark that sparse coding
should not be confused with the related — and usually easier — problem of finding
the sparse representations of the y(i)’s given the coding matrix A, variously called
compressed sensing or sparse recovery [40, 41].
Olshausen and Field also gave a local search/gradient descent heuristic for trying
to minimize the nonconvex energy function (2.1.1). They gave experimental evidence
that it produces coding matrices for image patches that resemble known features
(such as Gabor filters) in the V 1 portion of the visual cortex.
There is a large gap between theory and practice in terms of how to local search
algorithm to optimize objective (2.1.1). The usual approach is to set A randomly or
17
to populate its columns with samples1. These often work, but we do not know how
to analyze them.
The main contribution of this section is that we give a novel method for initializing
the local search algorithm. The initialization guarantees to give an estimate for the
dictionary which is o(1)-close in relative column-wise error. Such an initialization
suffices for us to use local search algorithm in Section 3.2.3. For the analysis of the
local search algorithm from this initialization, we refer the reader to Chapter 3
Our algorithm and analysis is based on the generative model proposed in [136]
(and also [112]), which places sparse coding in a more familiar probabilistic setting
whereby the data points y(i)’s are assumed to be probabilistically generated according
to a model y(i) = A∗ · x∗(i) + noise where x∗(1), x∗(2), . . . , x∗(N) are samples from some
appropriate distribution and A∗ is an unknown code. We formalize this model below:
Generative model: We assume that each example is generated as y = A∗x∗ + ξ
where A∗ is a ground truth dictionary and x∗ is drawn from an unknown distribution
D, which satisfies that with probability 1, the support S = supp(x∗) is of size at most
k.
We are given N examples y(1), . . . , y(N) generated as above. We use x∗(1), . . . , x∗(N)
and ξ(1), . . . , ξ(N) to denote the underlying coefficients and noises that generate the
examples respectively. The goal is to estimate the ground-truth dictionary A∗ and
the coefficients x∗(1), . . . , x∗(N) from the examples as accurately as possible.
Outline of the rest of the section: In Section 2.1.2 we state the main assumptions
and results of this section. Section 2.1.3 summarizes the related work. Section 2.1.4
state formally the algorithm and provide the key intuition behind it. Section 2.1.5
gives the analysis of the infinite sample complexity case and Section 2.1.6 completes
the proof of the main theorem by providing sample complexity bounds.
1Empirical evidence suggest that the later choice is significantly better than the former.
18
2.1.2 Assumptions and Main Results
Since the sparse coding problem is very likely to be intractable for worst case A∗ and
distribution for x∗, we will make a few assumptions on the true dictionary A∗ and
distribution of the coefficient x∗ (similar to those in early papers [156, 10, 1]).2
We assume A∗ is an incoherent dictionary, since these are widespread in signal
processing [61] and statistics [57], and include various families of wavelets, Gabor
filters as well as randomly generated dictionaries.
Assumption 2.1.1 (Incoherence). We assume A∗ is µ-incoherent in the sense that
each column of A∗ has unit `2 norm and the maximum pair-wise inner product is
bounded by
∣∣〈A∗i , A∗j〉∣∣ ≤ µ√d. (2.1.2)
We also assume ‖A∗‖ .√r/d.
We make the following (relatively weak) assumptions on the distribution of the sup-
port of x∗. We require the non-zero coordinates of S has bounded pair-wise correla-
tion.
Assumption 2.1.2. With probability 1, the support S = supp(x∗) is of size at most
k. The marginal of the support S satisfies qi = Pr[i ∈ S] = Θ(r/d) and qij = Pr[i, j ∈
S] = Θ(k2/r2).
Conditioning on the support of S, we assume the non-zero entries of x∗ to be inde-
pendent. Furthermore, we assume that non-zero entries are bounded away from zero,
namely, that non-zero entries have an absolute lower bound.
Assumption 2.1.3. Conditioned on the choice of S, the coordinates of x∗S are in-
dependent and their marginal satisfies ∀i ∈ S,E [x∗i | S] = 0 and E [(x∗i )2 | S] = 1.
2Recently, Hazan and the author makes progress on the worst guarantee without generativemodels [79], but the problem definition is changed into an improper setting.
19
Moreover, conditioned on i ∈ S, with probability 1, |x∗i | ≥ C for some absolute con-
stant C ∈ [0, 1].
We remark that the casual reader should just think of x∗ as being drawn from some
distribution that has independent coordinates. Even in this simpler setting —which
has polynomial time algorithms using Independent Component Analysis—we do not
know of any rigorous analysis of heuristics like Olshausen-Field. The earlier papers
were only interested in polynomial-time algorithms, so did not wish to assume inde-
pendence.
Finally, throughout this paper, we will assume the following choice of the regime
of k, r, d. Again, m can be allowed to be higher by lowering the sparsity.
Assumption 2.1.4 (Regime of parameters). We assume throughout this chapter that
k ≤√d
µ log dand r2k2 log3 r ≤ ρd3 for some small enough absolute constant ρ.
We remark that the assumption r2k2 log3 r ≤ ρd3 is slightly weaker than the assump-
tion than r . d. Therefore, causal readers can think of our parameter regime as
k √d and r . d (assuming µ is a constant and ignoring logarithmic factors).
Before stating the main theorem in this section, we first define the measure of
closeness that we use in this section:
Definition 2.1.5. A is δ-close to A∗ if there is a permutation π : [m] → [m] and a
choice of signs σ : [m]→ ±1 such that ‖σ(i)Aπ(i) − A∗i ‖ ≤ δ for all i.
This is a natural measure to use since we can only hope to learn the columns of A∗ up
to relabeling and sign-flips. Our main theorem shows that we can learn a dictionary
A that is 1/ log n-close to the true A∗ in this measure.
Theorem 2.1.6. Under assumption 2.1.1, 2.1.2, 2.1.3, 2.1.4, for sufficiently large
constant c, given N = c(kr log4 d+ σ4r2 log4 d
d
)examples, there exist an algorithm
20
(Algorithm 1) returns a matrix A that is O(ρ/ log n)-close to the ground truth dictio-
nary A∗ in O(rd2N) time. Here ρ is an arbitrarily small absolute constant defined in
Assumption 2.1.4.
We establish various building blocks towards the proof of the main theorem in the
following sections and finally formally prove it in Section 2.1.6.
2.1.3 Related Work and Notes
A common thread in recent work on sparse coding is to assume a generative model;
the precise details vary, but each has the property that given enough samples the
solution is essentially unique. [156] gave an algorithm that succeeds when A∗ has full
column rank (in particular m ≤ n) which works up to sparsity roughly√n. However,
this algorithm is not applicable in the more prevalent overcomplete setting.
[10] and [2, 1] independently gave algorithms in the overcomplete case assuming
that A∗ is µ-incoherent (which we define in the next section). The former gave an
algorithm that works up to sparsity n1/2−γ/µ for any γ > 0 but the running time is
nΘ(1/γ);
[2, 1] gave an algorithm that works up to sparsity either n1/4/µ or n1/6/µ depend-
ing on the particular assumptions on the model. These works also analyze alternat-
ing minimization but assume that it starts from an estimate A that is column-wise
1/poly(n)-close to A∗, in which case the objective function is essentially convex.
[23] gave a new approach based on the sum-of-squares hierarchy that works for
sparsity up to n1−γ for any γ > 0. But in order to output an estimate that is column-
wise ε-close to A∗ the running time of the algorithm is n1/εO(1). In most applications,
one needs to set (say) ε = 1/k to get a useful estimate. However in this case their
algorithm runs in exponential time. The sample complexity of the above algorithms is
also rather large, and is at least Ω(m2) if not much larger. Here we will give simple and
more efficient algorithms based on alternating minimization whose column-wise error
21
decreases geometrically, and that work for sparsity up to n1/2/µ log n. We remark that
even empirically, alternating minimization does not appear to work much beyond this
bound.
2.1.4 The Spectral Algorithm and Key Observation
Our algorithm works by reweighting the second moments followed by spectral decom-
position. We introduce the idea using the noiseless case. Let u = A∗α and v = A∗α′
be two samples from our model where the supports of α and α′ are U and V respec-
tively, and assume U and V has a singleton intersection U ∩ V = i. The main idea
is that if this happens, we can reweight a fresh sample y with a factor 〈y, u〉〈y, v〉 and
compute the weighted second moment Muv defined as follows:
Muv := E[〈y, u〉〈y, v〉yy>
]. (2.1.3)
Intuitively, yy> is a linear combination of A∗jA∗j> for all j ∈ [r] but 〈y, u〉〈y, v〉
put a high weight on those yy> with a higher contribution in the direction of A∗iA∗i>,
since both uv has a non-trivial correlation with A∗i . This causes that the top singular
vectors will be close to A∗i .
Of course, we don’t have such a pair of uv in hand to start with. Fortunately,
using the top two singular values of Muv, we can test whether uv have the property
that |U ∩ V | = 1. Therefore we choose uv pairs randomly repeatedly until we get all
the dictionary atoms. The procedure can be summarized in the following algorithm
(with an additional correction term which is necessary for the noisy case):
The key idea of the algorithm discussed above can be formally in the following
Proposition. We will invoke this proposition several times in order to analyze Algo-
rithm 1 to verify whether or not the supports of U and V share a common element,
22
Algorithm 1 Weighted Spectral Initialization for Sparse Coding
Input: A set of N1 + N2 examples, split into two sets with N1 and N2 exampleseach
Output: A matrix A that approximates the true dictionary A∗
Set L = ∅While |L| < r choose randomly two samples u and v from the first N1 examples
Let Puv =(uv> + vu> + 〈uv〉Id + (d+ 2)Idd
)σ4/d2
Use the second set of N2 examples to get an empirical estimate of Muv, denotedby Muv.
Compute the top two singular values σ1, σ2 and top singular vector z of Muv−PuvIf σ1 & k/r and σ2 ≤ k/(r log r)
If z is not within distance 1/ log r of any vector in L (even after sign flip),add z to L
Output a matrix A with its columns being those vectors in L.
and again to show that if they do we can approximately recover the corresponding
column of A∗ from the top singular vector of Mu,v.
Proposition 2.1.7. Let u = A∗α + ξ and v = A∗α′ + ξ′ be two examples. Define
Puv =(uv> + vu> + 〈uv〉Id + (d+ 2)Idd
)σ4/d2. Let U = supp(α) and V = supp(α′)
and let β = A∗>u and β′ = A∗>v. Let ci, qi be defined as in Assumption 2.1.2
and 2.1.3 and ρ be defined as in Assumption 2.1.4. Then, we have that
Muv − Puv =∑i∈U∩V
qiciβiβ′iA∗iA∗i> + E (2.1.4)
where E is bounded by
‖E‖ . ρk
r log d. (2.1.5)
23
As alluded before, the direct consequence of Proposition 2.1.7 is that when u and
v share a unique dictionary element, the top singular value of Muv stands out and is
close to the a true dictionary element, as formalized in the Corollary below.
Corollary 2.1.8. In the setting of Proposition 2.1.7, suppose u = A∗α + ξ and
v = A∗α′+ ξ′ are two random samples such that supp(α)∩ supp(α′) = i. Then, the
top singular vector of Muv is O(ρ/ log d)-close to A∗i .
Proof. When u and v share a unique dictionary element i, the contribution of the
first term in the RHS of (2.1.4) simply reduces to qiciβiβ′iA∗iA∗i>. Moreover, follow-
ing from Proposition 2.1.7 and from the assumptions that ci ≥ 1 and qi = Ω(r/d)
(Assumption 2.1.2), the coefficient qiciβiβ′i is at least Ω(r/d). Using the noise bound
equation (2.1.5), by singular value perturbation theorem (for example, Wedin’s The-
orem 8.2.3 in Section 8.2), we have that the top singular vector of Muv − Puv is
O(r/d log d)/Ω(r/d) = O(1/ log d)-close to A∗i , which completes the proof.
We will prove Proposition 2.1.7 in the rest of the Section. In Section 2.1.5, we will
prove the correctness of Algorithm 1 in the infinite samples regime. Fully analyzing
Algorithm 1 and proving Theorem 2.1.6 requires carefully bounding the difference
between the empirical estimate Muv and Muv, which will be handled in Section 2.1.6.
Towards Proposition 2.1.7, we first state the following Lemma which gives an
explicit expression for Muv.
Lemma 2.1.9. In the setting of Proposition 2.1.7, we have that
Muv = Puv +∑i∈U∩V
qiciβiβ′iA∗iA∗i> +
∑i∈[m]\U∩V
qiciβiβ′iA∗iA∗i>
+∑
i,j∈[m],i 6=j
qi,j
(βiβ
′iA∗jA∗j> + βiβ
′jA∗iA∗j> + β′iβjA
∗iA∗j>)
(2.1.6)
24
Proof of Lemma 2.1.9. First of all, recall that y = A∗x∗ + ξ. Therefore, plugging in
this into the definition of Muv (equation (2.1.3)), we have
Muv = ES
[Ex∗S
[〈u,A∗Sx∗S + ξ〉〈v,A∗Sx∗S + ξ〉(A∗Sx∗S + ξ)(x∗S>A∗S
> + ξ>)|S]]
(by the Law of total expectation)
= ES
[Ex∗S
[〈β, x∗S〉〈β′, x∗S〉A∗Sx∗Sx∗S>A∗S
>|S]]
+ E[〈u, ξ〉〈v, ξ〉ξξ>
]+ E
[〈β, x∗S〉〈β′, x∗S〉ξξ>
](2.1.7)
where the last line follows from E [x∗S] = 0, ξ has mean zero, and the independence of
x and ξ
Note that since ξ ∼ N (0, σ2
dIdd), we have
E[〈u, ξ〉〈v, ξ〉ξξ>
]+ E
[〈ξ, ξ〉ξξ>
]=(uv> + vu> + 〈uv〉Id + (d+ 2)Idd
)σ4/d2 .
(2.1.8)
Then replacing A∗x∗ by∑A∗ix
∗i and expanding the sum, using the fact that the
entries in x∗S are independent and have mean zero, we have that
ES
[Ex∗S
[〈β, x∗S〉〈β′, x∗S〉A∗Sx∗Sx∗S>A∗S
>|S]]
= ES[∑i∈S
ciβiβ′iA∗iA∗i> +
∑i,j∈S,i6=j
(βiβ
′iA∗jA∗j> + βiβ
′jA∗iA∗j> + β′iβjA
∗iA∗j>)]
=∑i∈[m]
qiciβiβ′iA∗iA∗i> +
∑i,j∈[m],i 6=j
qi,j
(βiβ
′iA∗jA∗j> + βiβ
′jA∗iA∗j> + β′iβjA
∗iA∗j>)
(2.1.9)
The equation (2.1.6) then follows from combining equation (2.1.8), (2.1.9), (2.1.7)
above.
25
Next in preparation to bound the error term in Proposition 2.1.7, we first state
a simple lemma that controls the singular values of the sub-matrices of A∗. It is a
direct consequence of the incoherent assumption 2.1.1.
Lemma 2.1.10. Under Assumption 2.1.1 and 2.1.4, we have that for any subset
S ⊂ [r] of size at most k,
σmin(A∗S) ≥ 1/2, and σmax(A∗S) ≤ 3/2.
Proof. We use Gershgorin Circle Theorem on the matrix A∗S>A∗S. This is a matrix
of size |S| × |S|. By Assumption 2.1.1, the (i, j)-th off-diagonal entries 〈A∗i , A∗j〉 is
bounded by µ/√d in absolute value. Thus, in every row, the sum of off-diagonal
entries in absolute values is at most kµ/√d. All the diagonal entries are of the form
〈A∗i , A∗i 〉 which is equal to 1. Since kµ/√d ≤ 1/2 by Assumption 2.1.4, Gershgorin
Circle Theorem implies that σmax(A∗S>A∗S) ≤ 1 + 1/2 = 3/2 and σmin(A∗S
>A∗S) ≤
1− 1/2 = 1/2.
Next we establish some useful property of β and β′, in preparation for bounding
the error terms in equation (2.1.6) ( the terms on the second line of equation (2.1.6)).
Claim 2.1.11. With high probability it holds that (a) for each i ∈ [r] we have
|βi − αi| . µ(k+σ) log r√d
and (b) ‖β‖ .√r(k + σ)/d. In particular since the differ-
ence between βi an αi is o(1) for our setting of parameters, we conclude that if αi 6= 0
then C − o(1) ≤ |βi| and if αi = 0 then |βi| . µ(k+σ) log r√d
.
Proof. Recall that U is the support of α and let R = U\i. Then:
βi − αi = A∗i>A∗UαU + A∗i
>ξ − αi = A∗i>A∗RαR + A∗i
>ξ .
To bound the two terms on the RHS of the equation above, we first note that A∗i>ξ is
a Gaussian random variable with variance ‖A∗i ‖2σ2/d = σ2/d. Therefore, with high
26
probability, |A∗i>ξ| . σ/√d · log r. Since A∗ is incoherent we have that ‖A∗i>A∗R‖ ≤
µ√k/d. Moreover, recall that the entries in αR are independent and subgaussian
random variables. Therefore, by Hoeffding inequality (see Theorem 8.1.2), we have
that, with high probability, |〈A∗i>A∗R, αR〉| .µk log r√
d, which implies the first part of
the claim.
For the second part, we can bound ‖β‖ ≤ ‖A∗‖‖A∗U‖‖α‖ + ‖A∗‖‖ξ‖. Since α is
a k-sparse vector with independent and subgaussian entries, with high probability
‖α‖ .√k. It follows that with high probability ‖β‖ .
√r(k + σ)/d.
Now we are ready to bound the error terms in equation (2.1.6).
Lemma 2.1.12. In the setting of Lemma 2.1.9. Let
E1 =∑i 6∈U∩V
qiciβiβ′iA∗iA∗i>
E2 =∑
i,j∈[m],i 6=j
qi,jβiβ′iA∗jA∗j>
E3 =∑
i,j∈[m],i 6=j
qi,j(βiA∗iβ′jA∗j> + β′iA
∗iβjA
∗j>).
Then, we have that ‖E1‖, ‖E2‖, ‖E3‖ . ρkr log d
.
Proof. Let R = [m]\(U ∩ V ), then we can rewrite E1 = A∗RD1A∗R>, where D1 is a
diagonal matrix whose entries are qiciβiβ′i for i ∈ R.
As a preparation, we first bound ‖D1‖. To this end, we can invoke the first
part of Claim 2.1.11 to conclude that |βiβ′i| ≤µ2(k+σ)2 log2 r
d. Also recall that by
Assumption 2.1.2 and 2.1.3, we have qici . k/r. Therefore
‖D1‖ .µ2(k + σ)2k log2 r
rd
27
. Since ‖A∗R‖ ≤ ‖A∗‖ . r/d, we have that
‖E1‖ ≤ ‖A∗R‖‖D1‖‖A∗R‖ ≤µ2(k + σ)2kr log2 r
d3.
ρk
r log r(2.1.10)
where the last inequality uses the Assumption that σ ≤ k and k2r2 log3 r ≤ d3 (see
Assumption 2.1.4).
The second term E2 is a sum of positive semidefinite matrices and we will make
crucial use of this fact below:
E2 =∑i 6=j
qi,jβiβ′iA∗jA∗j> O(k2/r2)
O(k2/r2)(∑i∈[r]
βiβ′i
)(∑j∈[r]
A∗jA∗j>)
(by qi,j . k2/r2 and then completing the square)
O(k2/r2)‖β‖‖β′‖A∗A∗> . (by Cauchy-Schwartz inequality)
We can now invoke the second part of Claim 2.1.11 and conclude that ‖E2‖ ≤
O(k2/r2)‖β‖‖β′‖‖A∗‖2 . r(k+σ)k2
d3. ρk
r log r(where we have used the Assumption 2.1.4
in the last inequality.)
For the third error term E3, by symmetry we need only consider terms of the
form qi,jβiβ′jA∗iA∗j>. We can collect these terms and write them as A∗QA∗>, where
Qi,j = 0 if i = j and Qi,j = qi,jβiβ′j if i 6= j. First, we bound the Frobenius norm of
Q by Cauchy-Schwartz inequality:
‖Q‖F =
√ ∑i 6=j,i,j∈[r]
q2i,jβ
2i (β
′j)
2 .√k4/r4(
∑i∈[r]
β2i )(∑j∈[r]
(β′j)2) . k2/r2 · ‖β‖‖β′‖.
Finally we have that
‖E3‖ ≤ 2‖A∗‖2‖Q‖ . r
d· k
2
r2· r(k + σ)
d.
ρk
r log r
28
where the inequality in the middle uses the bounds in Claim 2.1.11 and the last
inequality uses Assumption 2.1.4. This completes the proof.
Finally, combining Lemma 2.1.12 and Lemma 2.1.9 gives the proof of Proposi-
tion 2.1.7.
2.1.5 Infinite Samples Case
In this section we prove an infinite sample version of Theorem 2.1.6 by repeatedly
invoking Proposition 2.1.7. In section 2.1.6, we will give the full proof of Theorem 2.1.6
by considering the finite samples case.
Theorem 2.1.13. In the setting of Theorem 2.1.6, if Algorithm 1 has access to Muv
(defined in equation (2.1.3)) instead of the empirical average Muv, then with high
probability A is δ-close to A∗ where δ . ρ/ log d.
Towards proving Theorem 2.1.13, we first show that the test based on singular
values (that is, σ1 . k/r and σ2 ≤ k/(r log r) in Algorithm 1) can successfully
determine whether uv have a shared common dictionary atom as we desired.
Lemma 2.1.14. In the setting of Theorem 2.1.6, given two samples u = A∗α + ξ
and v = A∗α′ + ξ′, if the top singular value of Muv − Puv is at least Ω(r/d) and
the second largest singular value is at most r/(d log d), then with high probability
|supp(α) ∩ supp(α′)| = 1.
Proof. We prove by contradiction. We assume that |supp(α) ∩ supp(α′)| 6= 1. We
further divide it into two cases:
First, suppose |supp(α)∩ supp(α′)| = 0. Then the term∑
i∈U∩V qiciβiβ′iA∗iA∗i> in
the RHS of (2.1.4) is zero. Since ‖E‖ . r/(d log d) by Proposition 2.1.7, we get a
contradiction with σ1(Muv − Puv) & r/d.
Second, suppose U = supp(α) and V = supp(α′) share more than one dictionary
element. Let S = U ∩ V , we have |S| ≥ 2. We rewrite Muv − Puv as Muv − Puv =
29
A∗SDSA∗S>+E where DS is a diagonal matrix whose entries are equal to qiciβiβ
′i. All
diagonal entries in DS have magnitude at least Ω(r/d). By incoherence assumption
(Assumption 2.1.1), we have that A∗S has the smallest singular value being at least
1/2, therefore the second largest singular value of A∗SDSA∗S> is at least:
σ2(A∗SDSA∗S>) ≥ σmin(A∗S)2σ2(DS) & r/d.
It follows by Weyl’s theorem (see Theorem 8.2.1 in Section 8.2) that σ2(Muv−Puv) ≥
σ2(A∗SDSA∗S>) − ‖E‖ & r/d, which contradicts with the assumption that σ2(Muv −
Puv) ≤ r/(d log d).
Now we are ready to prove Theorem 2.1.13. The idea is every vector added to
the list L will be close to one of the dictionary elements (by Lemma 2.1.14), and for
every dictionary element, the list L contains at least one close vector because we have
enough random samples.
Proof of Theorem 2.1.13. By Lemma 2.1.14 we know every vector added into L must
be close to one of the dictionary elements. On the other hand, for any dictionary ele-
ment A∗i , by the bounded moment assumption of distribution x∗ (Assumption 2.1.3),
we know
Pr[|U ∩ V | = i] = Pr[i ∈ U ] Pr[i ∈ V ] Pr[(U ∩ V )\i = ∅|i ∈ U, j ∈ U ]
≥ Pr[i ∈ U ] Pr[i ∈ V ](1−∑
j 6=i,j∈[m]
Pr[j ∈ U ∩ V |i ∈ U, j ∈ V ])
= Ω(k2/r2) · (1− r ·O(k2/r2))
= Ω(k2/r2).
where the last inequality uses the assumption k <√r (Assumption 2.1.4). Here
the inequality uses union bound. Therefore given O((r2 log2 d)/k2) trials, with high
30
probability there is a pair of u,v that intersect uniquely at i for all i ∈ [m]. By
Lemma 2.1.12 and Lemma 2.1.9, this implies there must be at least one vector that
is close to A∗i for all dictionary elements.
Finally, since all the dictionary elements have distance at least 1/2 (by incoher-
ence), the connected components in L correctly identifies different dictionary ele-
ments. Hence, the output A must be O(ρ/ log d)-close to A∗.
2.1.6 Sample Complexity
Here we show with only O(mk) samples, the difference between the true Muv matrix
and the estimated Muv matrix is already small enough.
Proposition 2.1.15. In the setting of Theorem 2.1.6 and Proposition 2.1.7, let Muv
be the empirical estimate of Muv with N2 examples. Then, we have that with high
probability
∥∥∥Muv −Muv
∥∥∥ .k2 log4 d
N2
+
√k3 log4 d
rN2
+
√k2 log4 dσ4
dN2
(2.1.11)
Recall that if we use y(1), . . . , y(N2) to denote the examples used, then
Muv =
N2∑i=1
〈y(i), u〉〈y(i), v〉y(i)y(i)> (2.1.12)
is a sum of independent matrix random variables. We will use an extension of the ma-
trix Bernstein inequality (Corollary 8.1.4 in Section 8.1.1) to control the fluctuation.
In preparation for applying it, we have the following Claims that bound the norms
and variances of the summands in RHS of (2.1.12). We start with the individual
spectral norm.
Claim 2.1.16. In the setting of Proposition 2.1.15, we have that with high probability,
|〈u, y〉| .√k log d and ‖y‖ .
√k log d. As a direct consequence, with high probability,
31
we have
‖〈u, y〉〈v, y〉yy>‖ . k2 log3 d .
Proof. Recall that u = A∗α+ ξu = A∗SαS + ξu where S = supp(α), and y = A∗Rx∗R + ξ
where R = supp(x∗). Because α is k-sparse and has subgaussian non-zero entries, and
‖A∗S‖ ≤ 1 (by Lemma 2.1.10), we have that ‖u‖ ≤ ‖A∗S‖‖αS‖ + ‖ξ‖ .√k log d + σ.
The same bound holds for y as well because they are from the same distribution.
Next we write |〈u, y〉| = |〈A∗>Su, x∗R〉+ 〈u, ξ〉|. Note that with high probability,
‖A∗S>u‖ ≤ ‖A∗S‖‖u‖ . ‖u‖ (by ‖A∗R‖ . 1 as in Lemma 2.1.10)
≤√k log d+ σ (by ‖u‖ .
√k log d+ σ)
.√k log d. (by σ ≤
√k (Assumption 2.1.4))
Next we bound the variances of the summands in equation (2.1.12).
Claim 2.1.17. In the setting of Proposition 2.1.15, we have
∥∥E[〈u, y〉2〈v, y〉2yy>yy>]∥∥ . k2 log4 d ·
(k/r + σ4/d
).
Proof. By Claim 2.1.16, we have that
E[〈u, y〉2〈v, y〉2yy>yy>] = E[〈u, y〉2〈v, y〉2‖y‖2yy>]
O(k2 log4 d) · E[〈u, y〉2〈v, y〉2yy>
](by Claim 2.1.16)
O(k2 log4 d)E[〈v, y〉2yy>
](by 〈u, y〉 . k log2 d) in Claim 2.1.16)
32
On the other hand, notice that E[〈v, y〉2yy>] = Mv,v and using Proposition 2.1.7, we
have that ‖Mvv − Pvv‖ . k/r. Moreover, we have that ‖Pvv‖ . σ4/d. Therefore, we
have that
∥∥E[〈v, y〉2yy>]∥∥ = ‖Mvv − Pvv‖+ ‖Pvv‖ . k/r + σ4/d
Therefore, altogether we obtain that
∥∥E[〈u, y〉2〈v, y〉2yy>yy>]∥∥ . k2 log4 d ·
(k/r + σ4/d
)(2.1.13)
(2.1.14)
Proof of Proposition 2.1.15. Now we can apply the matrix Bernstein inequality and
conclude that with high probability,
‖Muv −Muv‖ . (k2 log3 d)/N2 +
√(k3 log4 d)/(rN2) + (k2 log4 dσ4)/(dN2)
.k2 log3 d
N2
+
√k3 log4 d
rN2
+
√k2σ4 log4 d
dN2
Finally we are ready to prove the main Theorem in Section 2.1
Proof of Theorem 2.1.6. First of all, the conclusion of Proposition 2.1.7 is still true
for Muv when N2 examples. To see this, we could simply write
Muv − Puv = qiciβiβ′iA∗iA∗i> + E + (Muv −Muv)︸ ︷︷ ︸
perturbation
33
where E is the same as the proof of Proposition 2.1.7. We can now view Muv −Muv
as an additional perturbation term with the same magnitude. We have that when
U ∩ V = i the top singular vector of Muv is O(ρ/ log d)-close to A∗i . Similarly, we
can prove the conclusion of Lemma 2.1.14 is also true for Muv. Note that we actually
choose N2 such that the perturbation of Muv matches noise level in Lemma 2.1.14:
k2 log3 d
N2
+
√k3 log4 d
rN2
+
√k2 log4 dσ4
dN2
.k
r log r.
Here we use the fact that N2 ≥ c(kr log4 d+ σ4r2 log4 d
d
)with sufficiently large absolute
constant c. With these perturbation theorem in hand, the rest of the proof of the
theorem follows exactly that of the infinite sample case given in Theorem 2.1.13.
2.2 Convex Initialization for Topic Modeling In-
ference
Recently, there has been considerable progress on designing algorithms with provable
guarantees — typically using linear algebraic methods — for parameter learning in
latent variable models including the sparse coding problem as discussed in Section 2.1.
But designing provable algorithms for inference has proven to be more challenging.
In this section, we take a first step towards provable inference in topic models. We
design an initialization algorithm based on linear programming that infers approx-
imately the topic coefficients (loading) for a document. The initialization provably
recovers the support of the topic coefficients under a realistic assumptions on the
word-topic matrix. Stating from this initialization, we can solve the inference prob-
lem by optimizing the maximum likelihood estimator under the correct support —
which turns out to be a convex problem.
34
2.2.1 Introduction and Main Results
Recently, there has been considerable progress on designing new algorithms for pa-
rameter learning with such provable guarantees. Since the usual maximum likelihood
estimator is often NP-hard to compute even in simple models, these new algorithms
use alternative estimators based on the method of moments and linear algebra. Their
analysis usually involves making a structural assumption about the parameters of the
problem, which can often be justified in applications. Some highlights include algo-
rithms for topic modeling [10, 6], learning mixture models [129, 87, 68], community
detection [7] and (special cases of) deep learning [11, 94].
But there has been comparatively much less progress on designing algorithms
with provable guarantees for inference. The current paper takes a first step in this
direction, in context of topic models. Our algorithms leverage a property of topic
models (Definition 2.2.3) that turns out to hold in many datasets — the existence of
a good approximate inverse matrix.
We also give empirical results that demonstrate that our algorithm works on
realistic topic models. On synthetic data, its error is competitive with state-of-the-
art approaches (which have no such provable guarantees). It obtains somewhat weaker
results on real data.
Here we describe topic modeling, and why inference appears more difficult than
parameter learning. In topic modeling, each document is represented as a bag of
words where we ignore the order in which words occur. The model assumes there
is a fixed set of k topics, each of which is a distribution over words. Thus the ith
topic is a vector Ai ∈ RD (where D is the number of words in the language) whose
coordinates are nonnegative and sum to 1. Each document is generated by first
picking its topic proportions from some distribution; say xi is the proportion of topic
i, so that∑
i xi = 1. The model assumes a distribution on x that favors sparse
or approximately sparse vectors; a popular choice is the Dirichlet distribution [33].
35
Then the document w1, w2, . . . , wn is generated by drawing n words independently
from the distribution A · x where A is the matrix whose columns are the topics. It is
important to note that the document size n can be quite small (e.g., n may be 400,
and D may be 50, 000) so the empirical distribution of words in a document is, in
general, a very inaccurate approximation to Ax. With some abuse of notation, we
also think of y as a vector in RD, whose jth coordinate is the number of occurrences
of word j in the document.
Parameter learning involves recovering the best A for a corpus of documents; this
can be seen as a latent structure in the corpus. Recent (provable) algorithms for
this problem [6, 10] use the method of moments, leveraging the fact that some form
of averaging over the corpus yields a linear algebraic problem for recovering A. For
example the word-word co-occurrence matrix (whose i, j entry is the probability that
words i, j co-occur in a document) is given by
Ex[AxxTAT ] = AZAT
where Z is the 2nd moment matrix of the prior distribution on x. It is possible to
recover A from this expression, under natural conditions like separability [10]. Alterna-
tively, one can use a co-occurrence tensor and recover A under weaker assumptions [6].
In the inference problem, we know the topic matrix A and are given a single
document y generated using this matrix. The goal is to find the posterior distribution
x|y. This can be seen as labeling or categorizing this document, which is important
in applications. Inference is reminiscent of classical regression problems where the
goal is to find x given y = Ax + noise vector. The key difference here is the nature
of noise —for each word coordinate j is 1 with probability (Ax)j, and 0 otherwise—
which means that the noise on a coordinate-by-coordinate basis can be much larger
36
than the signal. In particular, the vector y ∈ RD is very sparse even though Ax is
dense.
This problem can be seen as an analog of sparse linear regression when the target
(regression) vector x has nonnegative coordinate and∑
i xi = 1.
(This is distinct from usual `1-regression where regression vector is in `2 even
though the loss function is `1.) The difficulty here, in addition to the issue of
high coordinate-wise error already mentioned, is that the usual sparsity-enforcing
`1-regularization buys nothing since the solution needs to exactly satisfy ‖x‖1 = 1.
Inference seems more difficult than parameter learning because averaging over
many documents is no longer an option. Furthermore, the solution x is not unique
in general, and in some cases, the posterior distribution on x is not well concentrated
around any particular value. (In practice Gibbs Sampling can be used to sample from
the posterior [71, 175], but as mentioned, a rigorous analysis has proved difficult. The
inference is NP-hard.) We will view inference as a problem of recovering some ground
truth x∗ that was used to generate the document, and we show that with probability
close to 1 our estimate x is close to x∗ in `1 norm.
Bayesian vs Frequentist Views. So far we have not differentiated between
Bayesian and frequentist approaches to framing the inference problem, and now we
show that the two are closely related here. The above description is frequentist,
assuming an unknown “ground truth”vector x∗ of topic proportions (which is r-sparse
for some small r) was used to generate a document y, using a distribution y|x∗. Let
Ex∗ be the event that our algorithm recovers a vector x such that ‖x − x∗‖1 ≤ ε.
For our algorithm Pry|x∗ [Ex∗ ] ≥ 1 − δ2 for some δ > 0. By contrast, in the Bayesian
view, one assumes a prior distribution on x∗ and seeks to output a sample from
the conditional distribution x∗|y. Now we show that the success of our frequentist
37
algorithm implies that the posterior x∗|y must also be concentrated, and place most
probability mass on set of x such that ‖x− x‖1 ≤ ε.
By law of total expectation, we have Prx∗,y [Ex∗ ] = Prx∗[Pry|x∗ [Ex∗|x∗]
]≥ 1− δ2.
Switching the order of expectation, we obtain
Pry[Prx∗|y [Ex∗ | y]
]≥ 1− δ2 .
Then it follows by Markov argument that
Pry[Prx∗|y [Ex∗ | y] ≥ 1− δ
]≥ 1− δ .
Note that the inner probability is over the posterior distribution px∗|y. But the event
Ex∗ only depends on the output x of the algorithm given y. Thus the probability is at
least 1− δ over choice of y, that 1− δ of the probability mass of x∗|y is concentrated
in the `1 ball of radius ε around the algorithm’s answer x.
From now on the goal of our algorithm is to recover x∗ given y, and we identify
conditions under which the event has a probability close to 1.
Minimum Variance Estimators (with Bias). Having set up the problem as
above, next, we consider how to recover an approximation to x∗ given a document y
generated with topic proportions x∗.
Since A has orders of magnitude more rows than columns, it has many left inverses
to choose from. If we find any matrix B where BA is equal to the identity matrix,
then By is an unbiased estimate for x∗. However, this estimate has high variance if B
has large entries, necessitating working with only very large documents. Motivated
by applications to collaborative filtering, [100] introduce the notion of the `1 condition
number (see Definition 2.2.1) of A, which allows them to construct a left inverse B
with a much smaller maximum entry. We introduce a weaker notion of condition
38
number called the `∞-to-`1 condition number, which leverages the observation that
even if BA is close to the identity matrix it still yields a good linear estimator for
x∗. We call B an approximate inverse of A. Moreover, it has the benefit that the
condition number, as well as the approximate left inverse B with minimum variance,
can also be computed in polynomial time using a linear program (Proposition 2.2.4)!
In our experiments, we compute the exact condition number of word-topic matrices
that were found using standard topic modeling algorithms on real-life corpora. (By
contrast, we do not know the `1 condition number of these matrices.) In all of the
examples, we found that the condition number is at most a small constant, which
allows us to compute good approximate left inverses to the topic matrix A to enable
us to estimate x∗ even with relatively short documents.
Main results. Our main result (Theorem 2.2.5) shows that when the condition
number is small, it is possible to estimate x∗ using a combination of thresholding
and a left inverse B of minimum variance. Our overall algorithm requires n = O(r2)
samples to achieve o(1/r) error in `∞ norm and o(1) error in `1-norm, where r is
the number of topics represented in the document. It runs efficiently in O(nk) time.
Note that we do not need to assume a particular model (e.g., uniform random) for
the r topics, the algorithm works even when the topics may be correlated with each
other. This means that we can recover the support of x∗ when each of its non-zero
coordinates is suitably bounded away from zero.
This algorithm can serve as an initialization method for the MLE estimator. In
fact, it can be shown that maximizing the log-likelihood function over the support
recovered by this algorithm can further reduce the estimation error (measured in the
`1-norm) to O(√r/n). This part is beyond the concern of this thesis because the
MLE is convex with the correct support and the analysis involves a mostly statistical
39
argument. We refer the readers to [13] for more details. (One can also find a matching
sample complexity lower bounds for recovering the support of x∗ in [13].)
Thus, to sum up, our overall approach involves simple linear algebraic primitives
followed by convex programming. For a topic model with k topics, the sample com-
plexity of our algorithms depends on log k instead of k. This is important in practice
as k is often at least 100. The accuracy on synthetic data is good for sparse x, though
not quite as good as Gibbs sampling. However, if we forgo the convex programming
step we can compute a reasonable estimate for x from a single matrix-vector mul-
tiplication plus thresholding, which is an order of magnitude faster than finding an
estimate of the same quality via Gibbs sampling.
And of course, our approach comes with a performance guarantee.
2.2.2 Notations and Preliminaries
In addition to the description of topic model in Section 2.2.1, we introduce the fol-
lowing notations. We use Sk = z ∈ Rk≥0 : |z|1 = 1 to denote the k-dimensional
probability simplex. We assume that the true topic proportion vector x∗ ∈ Sk is
r-sparse throughout the paper. Sometimes we also abuse notations and use y as a
D dimensional vector instead of a set, in this case yi is the number of times word
i appears in the document. We will use a>i to denote the i-th row of A. We will
use cat(p) to denoted the categorical distribution defined by probability vector p.
Euclidean norm, `1, `∞ norm of a vector is denoted by ‖ · ‖, ‖ · ‖1 and ‖ cdot‖∞
respectively.
Condition Numbers of Matrices The condition number of a matrix usually
represents the ratio of the largest and smallest singular values. However, this concept
is tied to `2 norm, and for probability distributions, the most natural norms are `1
and `∞.
40
Next we define various matrix norms that we will utilize. Let |A|∞ = maxi,j |Aij|
denotes the maximum absolute value of the entries of the matrix A, and |A|1 =∑i,j |Aij| denotes the sum of the absolute value of the entries of the matrix A.
We will also work with various notions of condition number, that we will use in
our guarantees.
Definition 2.2.1 (`1-condition number). For a nonnegative matrix A, define its `1-
condition number κ(A) to be the minimum κ such that for any x ∈ Rk,
‖Ax‖1 ≥ ‖x‖1/κ (2.2.1)
This condition number was introduced by [100] in analyzing various algorithms
for collaborative filtering. We will use a weaker (i.e., smaller) notion of the condition
number. Empirically, it seems that most of the word-topic matrices that we have
encountered have a reasonably small `1-condition number, and have an even smaller
`∞ → `1-condition number.
Definition 2.2.2 (`∞ → `1-condition number). Let λ(A) be the minimum number λ
such that for any x ∈ Rk,
‖Ax‖1 ≥ ‖x‖∞/λ (2.2.2)
Remark 2.2.1. Based on the relationship between `1 and `∞ norm, we have that
λ(A) ≤ κ(A) ≤ kλ(A).
2.2.3 δ-Biased Minimum Variance Estimators
Let y ∈ RD be the document vector whose i-th entry yi is the number of times word i
appears. Our estimator attempts to infer the true topic vector x∗ by left-multiplying
y with some matrix B. Intuitively, E[By] = BAx∗, so we want BA to be close to
the identity matrix. On the other hand, when we apply B to the document vector,
41
each word will select a column of B, and its variance on any entry is bounded by the
maximum entry in B. Therefore we would like to optimize over two things: first, we
want BA to be close to identity; second, we want the matrix B to have small |B|∞.
This inspires the following linear program:
Definition 2.2.3. For A ∈ RD×k and δ ≥ 0, define λδ(A) to be the solution of the
following linear program:
λδ(A) = min |B|∞
s.t. |BA− Idk|∞ ≤ δ (2.2.3)
B ∈ Rk×D (2.2.4)
We will refer to the minimizer B of the above convex program as the δ-biased
minimum variance inverse for A. The solution to the above convex program will help
minimize our sample complexity both theoretically and empirically.
Allowing a nonzero δ can potentially reduce the variance of the estimator while
introducing a small bias. Such bias-variance trade-off has been studied in other
settings [128, 95].
What is the optimal |B|∞? To answer this question we get the dual of the LP 2.2.4
(with variable Q ∈ Rk×k),
maximize tr(Q)− δ|Q|1
s.t. |AQ|1 ≤ 1 (2.2.5)
We can further show that equation (2.2.5) is equivalent to the following (non-convex)
program with vector variables x ∈ Rk (see equation 2.2.8 in the proof of Proposi-
42
tion 2.2.4 for the proof):
maximize ‖x‖∞ − δ‖x‖1
s.t. ‖Ax‖1 ≤ 1
Note that this is very closely related to the condition number λ in Definition 2.2.2.
In particular, the optimal value is exactly λ(A) when δ = 0! When δ > 0 this can be
viewed as a relaxation of the `∞ → `1 condition number.
Proposition 2.2.4. For any δ ≥ 0, we have that
λδ(A) ≤ λ0(A) = λ(A) ≤ κ(A) .
Proof. Let J be the all 1’s matrix. We rewrite the program (2.2.4) as a linear program
by introducing auxiliary variable t.
λδ(A) = min t
s.t. B ≤ tJ
−B ≤ −tJ
BA− Id ≤ δJ
−BA+ Id ≤ −δJ
Let P1, P2 ∈ Rk×D, Q1, Q2 ∈ Rk×k be the dual variables for the four (set of) con-
straints. Let 〈X, Y 〉 = tr(XTY ) denote the inner product of the two matrices. Then,
43
the dual of the program above is
maximize 〈Q2 −Q1, Id〉 − δ〈Q2 +Q1, J〉
s.t. (P1 − P2) + (Q1 −Q2)A> = 0
〈P1 + P2, J〉 = 1
P1, P2, Q1, Q2 ≥ 0 (2.2.6)
Let Q = Q2−Q1 and P = P1−P2. Observe that 〈P1+P2, J〉 ≥ |P |1 and 〈Q1+Q2, J〉 ≥
|Q|1, it is easy to verify that program (2.2.6) is equivalent to the program below
maximize tr(Q)− δ|Q|1
s.t. P +QA> = 0
|P |1 ≤ 1 (2.2.7)
Towards further simplification, we claim that program (2.2.7) is equivalent to the
following (non-convex) program with vector variables x ∈ Rk:
maximize ‖x‖∞ − δ‖x‖1
s.t. ‖Ax‖1 ≤ 1 (2.2.8)
Indeed, suppose program (2.2.5) has optimal value λ and program (2.2.8) has optimal
value λ′ with optimal solution xopt. We first show that for any x,
‖x‖∞ − δ‖x‖1 ≤ λ′‖Ax‖1 , (2.2.9)
44
which is due to the homogeneity of the equation. Then, consider any P,Q that
satisfies the constraint of (2.2.5). Let P j and Qj be the rows of P and Q. We have
tr(Q)− δ|Q|1 ≤k∑j=1
(‖Qj‖∞ − δ‖Qj‖1
)≤
k∑j=1
(‖AQj‖1
)= |P |1
where the second inequality is by equation (2.2.9). Therefore λ ≤ λ′.
On the other hand, suppose the xopt has coordinate i with the largest absolute
value. Then let Q be the matrix which has it’s j-th row as xopt and 0 elsewhere, and
P = −QA>. Then it’s straightforward to check P,Q satisfy the constraint of (2.2.5)
and have objective value λ′. Therefore λ′ ≤ λ. Hence we obtained that λ = λ′.
Finally, from (2.2.8) it’s easy to see λ0(A) = λ(A), and λδ(A) ≤ λ0(A).
2.2.4 Thresholded Linear Inverse Algorithm and its Guaran-
tees
In this section we show how to estimate the topic proportion vector using a δ-biased
minimum variance inverse B of word-topic matrix A (Definition 2.2.3). For a small δ
(that is 1/r), given a solution B of program (2.2.4) with entries of absolute value
at most λδ(A), the following Thresholded Linear Inverse estimator (Algorithm 2)
is guaranteed to be close to the true x∗ in both `1 and `∞ norm. Recall that the
threshold function thτ (·) is defined as
thτ (t) =
t if t > τ
0 otherwise(2.2.10)
45
Algorithm 2 Thresholded Linear Inverse Algorithm (TLI)
Input: Document y with n words, and δ-biased inverse matrix B of matrix A.Output: Topic vector estimator x.
1. Compute x = By/n.
2. For all i ∈ [k], let xi = thτ (xi), where τ = 2λδ(A)√
log k/n+ δ.
Theorem 2.2.5. Suppose document y is generated from r-sparse topic vector x∗.
For any ε > 4δr, given n = Ω(λδ(A)2r2 log k/ε2) samples, with high probability Algo-
rithm 2 returns a vector that has `1-distance at most ε to x∗.
Our first step is to bound the variance of the partial estimator x before threshold-
ing. Our bound will utilize the maximum entry in B, which is why we tried to find
B that minimizes this quantity in the first place.
Lemma 2.2.6. With probability at least 1− 1/k2, it holds that
‖x− x‖∞ ≤ δ + 2λδ(A)√
(log k)/n . (2.2.11)
Proof of Lemma 2.2.6. Let Ij be the indicator vector for the jth word in the docu-
ment. That is, Ij = ek ∈ RD if the j-th word in the document y is the k-th word
in the vocabulary. Then, by definition, we have that y =∑
j∈[n] Ij. Next by the
definition of x, we have that
xi =1
n
n∑j=1
(BIj)i ,
where (BIj)i is the i-th coordinate of BIj. Thus, we have written xi as a sum of
independent random variables, each of which is (BIj)i. We will use concentration
inequality to show that xi is concentrated around its mean. The key here is the way
that we have chosen B ensures that the estimator is at most δ-based, and has small
variance.
46
To elaborate, we can compute the expectation of the partial estimator xi,
E[xi] = (BAx∗)i =k∑j=1
(BA)i,jx∗j
= x∗j +k∑j=1
((BA)i,j − 1(i = j))x∗j ,
where 1(i = j) = 1 if an only if i = j. Recall that by construction (equation (2.2.3)),
we have that for all i and j, |(BA)i,j − 1(i = j)| ≤ δ. Hence,
|k∑j=1
((BA)i,j − 1(i = j))x∗j | ≤ δk∑j=1
x∗j = δ . (2.2.12)
Therefore we conclude that |E[xi] − x∗i | ≤ δ which shows that our partial estimator
xi has bias at most δ on each coordinate.
Now we can appeal to standard concentration arguments to show the concentra-
tion of xi. Recall that xi is a sum of independent random variables xi = 1n
∑nj=1(BIj)i,
and each summand here is bounded by max(BIj)i ≤ λδ(A). We apply Hoeffding’s
inequality (see Theorem 8.1.2 in Section 8.1.1) and obtain that with probability at
least 1− 1/k2,
|xi − E[xi]| ≤ 2λδ(A)√
(log k)/n.
This, together with equation (2.2.12) that bounds the bias, completes the proof of
the Lemma.
Lemma above shows that the vector x is close to the true x∗ in infinity norm. As
a direct corollary, we conclude that after truncating x, we obtain the correct support
of x∗ provided that x∗ does not have very small entries.
Corollary 2.2.7. With high probability, the output x of Algorithm 2 satisfies that for
every i ∈ [k], if x∗i = 0 then xi = 0, and if x∗i ≥ 4λδ(A)√
(log k)/n + 2δ then xi > 0.
47
In particular, if all the nonzero entries of x∗ are at least ε/r for some ε > 4δr, the
algorithm finds the correct support with O((λδ(A)2r2 log k)/ε2) samples.
Using the corollary above we can then prove Theorem 2.2.5. The key intuition is
x can only incur error on non-zero coordinates of x∗, and a fixed amount of error on
non-zero coordinates of x∗.
Proof of Theorem 2.2.5. By Lemma 2.2.6 and union bound, we have that with prob-
ability at least 1 − 1/k, for every i ∈ [k] |xi − x∗i | ≤ δ + 2λδ(A)√
(log k)/n). Thus
in Step 2 of the algorithm we are guaranteed that if x∗i = 0 then xi must be smaller
than the threshold and therefore xi = 0.
On the other hand, if x∗i > 0, we know
|x∗i − xi| ≤ |x∗i − xi|+ |xi − xi| ≤ 2δ + 4λδ(A)√
(log k)/n)
Again appealing to the fact that x and x∗ are entry-wise close, there can be at most
r entries where we set x∗i > 0 for δ ≤ ε/4r. When n = 64κ2r2 log k/ε2 we also
have 4λδ(A)√
(log k)/n) ≤ ε/2r. Combining these two facts we conclude |x∗ − x|1 ≤
r(2δ + 4λδ(A)√
(log k)/n)) ≤ ε, which completes the proof.
2.3 Discussion: Special Initialization vs Trivial
Initialization
In this chapter, we designed special initialization the local improvement algorithms
for sparse coding and topic model inference. Attentive readers may notice that such
the designing of these kinds of initializations often rely on thorough understanding
of specific structures of the problems. For sparse coding, we exploit the incoherence
structure of the dictionary and sparsity of the coefficient vectors. For topic model
inference, we leverages the special `1 → `1 conditioning number of the word-topic
48
matrix. Exploiting these structures allow us to design strong initialization algorithms
that return approximately solutions. Designing such initializations is often challeng-
ing but if obtained, they often simplify the analysis and design of the non-convex
optimization algorithms that follow, as shown in Chapter 3.
In practice, sometimes simpler and more trivial algorithms are deployed for many
problems including various statistical inference problems and training neural net-
works. One of such initialization is the random Gaussian initialization with a small
variance, which turns out to be effective for training large-scale neural networks. A
slightly more delicate one is the standard initialization for sparse coding: the initial
dictionary contains random samples as columns. Although certain small tricks such
as tuning the variance of the initializers are required to achieve maximal empirical
performance, largely these methods perform reliably well and are very easy to im-
plement. It’s notoriously difficult to analyze the non-convex optimization algorithms
starting from such initialization, since the initializer tells little information regarding
how the dynamic of the iterates will evolve. In Part II, we will dedicate to developing
analysis tools for such situations.
49
Chapter 3
Local Convergence to a Global
Minimum
In this chapter, we introduce a framework for designing and analyzing the local search
algorithms that converge from a reasonable initialization quickly to a global minimum
(see Section 3.1). Then in Section 3.2 we apply the framework to design and analyze
several local search algorithms for the sparse coding problem. Together with the
initialization algorithms studied in Section 2.1, our algorithms improve upon the
sample complexity of existing approaches. We believe that our analysis framework
will have applications in other settings where simple iterative algorithms are used.
3.1 Analysis Framework via Lyapunov Function
Consider a general iterative algorithm that attempts to converge to a desired solution
z∗. In machine learning setting, z∗ often corresponds to some ground truth or global
optimum of some objective function. It starts with an initialization z0. At each step
s, given the current iterate zs, it computes some direction gs, and updates its estimate
50
as:
zs+1 = zs − ηgs. (3.1.1)
Such an algorithm can be seen as a dynamical system. Our goal is to show that
this sequence of iterates z0, z1, . . . converges to (or gets close to) the target point z∗.
To design a framework for proving convergence it helps to indulge in daydream-
ing/wishful thinking: what property would we like the updates to have, to simplify
our job?
A natural idea is to define a Lyapunov function V (z) and show that: (i) V (zs)
decreases to 0 (at a certain speed) as s → ∞; (ii) when V (z) is close to 0, then z is
close to z∗.
In this chapter, we consider possibly the most trivial Lyapunov function, the
(squared) Euclidean distance to the target point, V (z) = ‖z− z∗‖2. This is also used
in the standard convergence proof for convex functions, since moving in the opposite
direction to the gradient can be shown to reduce this measure V (·). 1
Simple algebraic manipulation shows that when the learning rate η is small
enough, then for V (zk+1) < V (zk), it is necessary and sufficient to have 〈gs, zs−z∗〉 >
0. Namely, the movement direction −gs should be correlated with the ideal direction.
To get quantitative bounds on the running time, we need to ensure that V (zs) not
only decreases but does so rapidly. The next definition formalizes this: intuitively
speaking it says that gs and z∗zs make an angle strictly less than 90 degrees.
Definition 3.1.1. A vector gs is (α, β, εs)-correlated with z∗ if
〈gs, zs − z∗〉 ≥ α‖zs − z∗‖2 + β‖gs‖2 − εs.1One can imagine more complicated ways of proving convergence, e.g., show V (zs) ultimately
goes to 0 even though it doesnt decrease in every step. The analysis of mirror descent and Nesterovsacceleration uses such a progress measure. Such analysis often uses Fenchel’s duality followed by atelescoping sum, which seems to rely on the convexity of the objective function.
51
The traditional analysis of convex optimization corresponds to the setting where
z∗ is the global optimum of some convex function f , and εs = 0. Specifically, if f(·)
is 2α-strongly convex and 1/(2β)-smooth, then gs = ∇f(zs) (α, β, 0)-correlated with
z∗. Also we will refer to εs as the bias. Allowing the bias makes the framework more
general and will be necessary for the case of sparse coding.
If the algorithm can at each step find such update directions that are correlated
with z∗, then the familiar convergence proof of convex optimization can be modified
to show rapid convergence here as well, except the convergence is approximate, to
some point in the neighborhood of z∗.
Theorem 3.1.2. Suppose gs satisfies Definition 3.1.1 for s = 1, 2, . . . , T , and η
satisfies 0 < η ≤ 2β and ε = maxTs=1 εs. Then, for any s = 1, . . . , T ,
‖zs+1 − z∗‖2 ≤ (1− 2αη)‖zs − z∗‖2 + 2ηεs
In particular, the update rule above converges to z∗ geometrically with systematic error
ε/α in the sense that
‖zs − z∗‖2 ≤ (1− 2αη)s‖z0 − z∗‖2 + ε/α.
The proof closely follows existing proofs in convex optimization (see, e.g., []).
52
Proof of Theorem 3.1.2. We expand the error as
‖zs+1 − z∗‖2 = ‖zs − z∗‖2 − 2ηgs>(zs − z∗) + η2‖gs‖2
= ‖zs − z∗‖2 − η(2gs>(zs − z∗)− η‖gs‖2
)≤ ‖zs − z∗‖2 − η
(2α‖zs − z∗‖2 + (2β − η)‖gs‖2 − 2εs
)((Definition 3.1.1 and η ≤ 2β))
≤ ‖zs − z∗‖2 − η(2α‖zs − z∗‖2 − 2εs
)≤ (1− 2αη)‖zs − z∗‖2 + 2ηεs
Then solving this recurrence we have ‖zs+1 − z∗‖2 ≤ (1 − 2αη)s+1R2 + εα
where
R = ‖z0 − z∗‖. And furthermore if εs <α2‖zs − z∗‖2 we have instead
‖zs+1 − z∗‖2 ≤ (1− 2αη)‖zs − z‖2 + αη‖zs − z‖2 = (1− αη)‖zs − z‖2
and this yields the second part of the theorem too.
The theorem above has a term ε/α that reflects the approximation caused by the
bias in each iteration. The following corollary says the approximate will go away if
the bias also decreases proportionally to the ‖zs − z∗‖2. The necessity and benefit
of allowing this small bias term will be clearer in Section 3.2.2 where we apply this
framework to analyze alternating minimization algorithms.
Corollary 3.1.3. In the setting of Theorem 3.1.2, if in addition εs <α2‖zs − z∗‖2
for s = 1, . . . , T , then
‖zs − z∗‖2 ≤ (1− αη)s‖z0 − z∗‖2.
In fact, we can extend the analysis above to obtain identical results for the case of
constrained optimization. Suppose we are interested in optimizing a convex function
53
f(z) over a convex set B. The standard approach is to take a step in the direction of
the gradient (or gs in our case) and then project into B after each iteration, namely,
replace zs+1 by ProjB zs+1 which is the closest point in B to zs+1 in Euclidean distance.
It is well-known that if z∗ ∈ B, then ‖ProjB z − z∗‖ ≤ ‖z − z∗‖. Therefore we obtain
the following as an immediate corollary to the above analysis:
Corollary 3.1.4. Suppose gs satisfies Definition 3.1.1 for s = 1, 2, . . . , T and set
0 < η ≤ 2β and ε = max>s=1 εs and that z∗ lies in a convex set B. Then the update
rule zs+1 = ProjB(zs − ηgs) satisfies that for any s = 1, . . . , T ,
‖zs − z∗‖2 ≤ (1− 2αη)s‖z0 − z∗‖2 + ε/α
In particular, zs converges to z∗ geometrically with systematic error ε/α. Additionally
if εs <α2‖zs − z∗‖2 for s = 1, . . . , T , then
‖zs − z∗‖2 ≤ (1− αη)s‖z0 − z∗‖2
3.1.1 Generalization to Stochastic Updates
In machine learning application, most of the objective functions involve an average
of loss function of individual samples. This special structure allows various speeding
up of gradient descent algorithms by using stochasticity — one can have an unbiased
estimator for the average of individual gradients of the terms. In many of these
settings, the update rule still takes the form of equation (3.1.1), whereas gs is a
random variable instead of a deterministic vector. Towards handling this stochasticity,
we introduce in this subsection extensions of Definition 3.1.1 and Theorem 3.1.2.
54
Definition 3.1.5. A random vector gs is (α, β, εs)-correlated-whp with a desired
solution z∗ if with probability at least 1− d−Ω(1),
〈gs, zs − z∗〉 ≥ α‖zs − z∗‖2 + β‖gs‖2 − εs.
This is a strong condition as it requires the random vector is well-correlated with
the desired solution with very high probability. In some cases we can further relax
the definition as the following:
Definition 3.1.6. A random vector gs is (α, β, εs)-correlated-in-expectation with a
desired solution z∗ if
E[〈gs, zs − z∗〉] ≥ α‖zs − z∗‖2 + βE[‖gs‖2]− εs.
We remark that E[‖gs‖2] can be much larger than ‖E[gs]‖2, and so the above
notion is still stronger than requiring (say) that the expected vector E[gs] is (α, β, εs)-
correlated with z∗.
Theorem 3.1.7. Suppose random vector gs is (α, β, εs)-correlated-whp with z∗ for
s = 1, 2, . . . , T where T ≤ poly(d), and η satisfies 0 < η ≤ 2β, then for any s =
1, . . . , T ,
E[‖zs+1 − z∗‖2] ≤ (1− 2αη)‖zs − z∗‖2 + 2ηεs
In particular, if ‖z0 − z∗‖ ≤ δ0 and εs ≤ α · o((1 − 2αη)s)δ20 + ε, then the updates
converge to z∗ geometrically with systematic error ε/α in the sense that
E[‖zs − z∗‖2] ≤ (1− 2αη)sδ20 + ε/α.
The proof is identical to that of Theorem 3.1.2 except that we take the expectation
of both sides.
55
3.1.2 Related Work
The proof of Theorem 3.1.2 is a straightforward extension of the standard analysis
of gradient descent algorithm for strongly-convex objective function (see [133] for
backgrounds of convex optimization.) Much previous work has viewed optimization
algorithms as dynamical systems and uses control theory techniques to analyze them,
especially in the context of convex optimization (see [107] and the references therein).
There have been recently a large body of works on proving convergence of local
search algorithms from a good initialization for many statistical learning problems:
matrix completion [93, 72, 75, 164], sparse coding [14], phase retrieval [43], mixture
of Gaussssians [20], just to name a few. These works have identified many similar
conditions under which specific nonconvex optimizations can be carried out to near-
optimality. All of these conditions, e.g., the conditions in [43, 20], can be seen as
some weakening of convexity of the objective function (except the analysis for matrix
completion in [72] which views the updates as noisy power method).
Our condition appears to contain most if not all of these as special cases. Often
the update direction gs in these papers is related to the gradient. For example using
the gradient instead of gs in our correlation condition turns it into the regularity
condition proposed by [43] for analyzing Wirtinger flow algorithm for phase retrieval.
The gradient stability condition in [20] is also a special case, where gs is required to
be close enough to ∇h(zs) for some convex h such that z∗ is the optimum of h. Then
since ∇h(zs) has angle < 90 degrees with zs− z∗ (which follows from convexity of h),
it implies that gs also does.
The advantage of our framework is that it encourages one to think of algorithms
where gs is not the gradient. Thus applying the framework doesnt require understand-
ing the behavior of the gradient on the entire landscape of the objective function;
instead, one needs to understand the update direction (which is under the algorithm
56
designers control) at the sequence of points actually encountered while running the
algorithm. This slight change of perspective may be powerful.
3.1.3 Limitation and Relation to Part II
We would like to note that despite the flexibility of framework as argued before, its
power limits to the analysis of “local” convergence — namely, it can only handle
convergence from a reasonably good initialization. The fundamental reason here is
that the analysis framework requires specifying a target solution z∗ and showing the
iterates approaches z∗. Recall that non-convex functions often have multiple local
minima or even multiple approximate global minima with very same function values
due to the symmetry. Therefore, with random initialization or arbitrary initialization,
it’s unclear or undetermined which global minimum the iterate will converge to, even
if we believe it will converge to one of them. The minimum requirement for applying
this framework (and many other variants) is that the initialization can serve the tie-
breaking between equivalent global minima, in the sense that the initialization can
guarantee convergence to a specific target solution, which of course is unknown to the
algorithm but should be identifiable in the analysis.
Towards going beyond such limitation, in Part II, we extend these local conver-
gence techniques so that to prove convergence from random or arbitrary initialization
for machine learning problems such as matrix completion (Chapter 5), learning linear
dynamical systems (Chapter 6)).
Finally, we note that the analysis of the problems in Part II also depends on the
machinery developed in this chapter: in the analysis of matrix completion in Chap-
ter 5, the local convergence result requires special treatment that uses the framework
in this section. For learning linear dynamical systems in Chapter 6, we essentially
use over-parameterization technique to reduce the problem to situations that can be
tackled by the analysis framework in this section.
57
3.2 Analyzing Alternating Minimization Algo-
rithm for Sparse Coding
In this section we design and analyze several alternating minimization algorithms
for sparse coding. Section 3.2.1 recalls the energy function defined in Section 2.1.1
and introduces the generic alternating minimization algorithms for solving it. Sec-
tion 3.2.2 highlights the analysis approaches for applying the tools in Section 3.1 to
the alternating minimizing setting. Section 3.2.3 states the detailed algorithms and
main results, and outlines the proof techniques. Section 3.3—3.6 give the detailed
proofs.
3.2.1 Alternating Minimization for Sparse Coding
We inherit most of the notations and setup for sparse coding from Section 2.1, and
we refer the readers to Section 2.1.1 for the motivation of the sparse coding problem.
We recall the non-convex objective proposed by Olshausen and Field [135]
E(A, x(1), . . . , x(N)) =N∑i=1
‖y(i) − A · x(i)‖22 +
N∑i=1
ρ(x(i)) . (3.2.1)
where ρ(·) is a nonlinear penalty function that is used to encourage sparsity. This
function is nonconvex because both A and the x(i)’s are unknown. Surprisingly,
various local search algorithms on the energy function (2.1.1) work very well, as do
related algorithms such as MOD [4] and k-SVD [62] on related objective function
with hard constraints. In fact these methods are so effective that sparse coding is
considered in practice to be a solved problem, even though it has no polynomial time
algorithm per se.
In this chapter, the generic scheme that we will be interested in is given in Algo-
rithm 3 and our analysis can be extended to k-SVD without many changes as well.
58
It is a heuristic for minimizing the non-convex function in (3.2.1) where the penalty
function is a hard constraint. Towards describing our algorithm, it would be helpful
to denote X as a shorthand of [x(1), . . . , x(N)] and write the objective and consider
the objective without the sparsity penalty,
E(A,X) =N∑i=1
‖y(i) − A · x(i)‖22 . (3.2.2)
Algorithm 3 alternates between updating the estimates A and X. The crucial step is
if we fix X and compute the gradient of E(A,X) with respect to A, we get:
∇AE(A,X) =N∑i=1
−2(y(i) − Ax(i))(x(i))>. (3.2.3)
We then take a step in the opposite direction to update A. Here and throughout the
paper η is the learning rate, and needs to be set appropriately.
Algorithm 3 Generic Alternating Minimization Approach
Given Initializer A0 ∈ Rd×r
Repeat for s = 0, 1, ..., T
Decode: Find a sparse solution to Asx(i) = y(i) for i = 1, 2, ..., N
Set Xs such that its columns are x(i) for i = 1, 2, ..., N
Update: As+1 = As − ηgs where gs is the gradient of E(As, Xs) with respect to As
3.2.2 Applying the Framework to Analyzing Alternating
Minimization
Recall that in the framework proposed in Section 3.1, we analyze an algorithm of
the type zs+1 = zs − ηgs and measure the progress of our algorithms by a simple
Lyapunov function. Here we are dealing with an alternating update algorithm and
we will cast it to fit our framework as follows.
59
We view Algorithm 3 as trying to minimize an unknown convex function, specifi-
cally f(A) = E(A,X∗), which is strictly convex and hence has a unique optimum that
can be reached via gradient descent. Here X∗ is the shorthand of the collection of
the unknown coefficient vectors x∗(1), . . . , x∗(N). This function is unknown since the
algorithm does not know X∗.
Despite that f is unknown, the analysis will show (directly) that the direction of
movement is correlated with A∗−As. The setup is reminiscent of stochastic gradient
descent, which moves in a direction whose expectation is the gradient of a known
convex function. By contrast, here the function f() is unknown, and furthermore
the expectation of gs is not the true gradient and has bias — which was caused by
the error in the current iterate As and the inexactness of decoding algorithm (see
paragraph below for more explanation of the decoding algorithm). Due to the bias,
we will only be able to prove that our algorithms reach an approximate optimum up
to some error whose magnitude is determined by the bias.
Choice of decoding algorithms: How should the algorithm update X? The usual
approach is to solve a sparse recovery problem with respect to the current code matrix
A. However many of the standard basis pursuit algorithms (such as solving a linear
program with an `1 penalty, see [61] and the reference therein) are difficult to analyze
when there is error in the dictionary itself. This is in part because the solution does
not have a closed form in terms of the dictionary matrix. Instead we take a much
simpler approach to solving the sparse recovery problem which uses matrix-vector
multiplication followed by thresholding: In particular, we set x = thC/2((As)>y),
where thC/2(·) keeps only the coordinates whose magnitude is at least C/2 and zeros
out the rest. Recall that the non-zero coordinates in x∗ have magnitude at least
C by Assumption 2.1.3. It turns out this approximate decoding algorithm suffices
for approximate convergence despite that it introduces an additional bias. We will
60
remove the bias using a more decoding complicated algorithm (see Algorithm 6 in
Section 3.2.3).
Remark: Balakrishnan et al. [19] proposes a similar framework for analyzing EM
algorithms for hidden variable models. The difference is that their condition is really
about the geometry of the objective function, though ours is about the property of the
direction of movement. Therefore we have the flexibility to choose different decoding
procedures. This flexibility allows us to have a closed form of Xs and obtain a useful
functional form of gs.
3.2.3 Algorithms and Main Results
We first recall that we use column-wise Euclidean distance after permutation to mea-
sure the closeness between the solution A and the ground-truth A∗.
Definition 2.1.5. A is δ-close to A∗ if there is a permutation π : [m] → [m] and a
choice of signs σ : [m]→ ±1 such that ‖σ(i)Aπ(i) − A∗i ‖ ≤ δ for all i.
This is a natural measure to use, since we can only hope to learn the columns of A∗
up to relabeling and sign-flips. In our analysis, we will assume throughout that π(·)
is the identity permutation and σ(·) ≡ +1 because our family of generative models is
invariant under this relabeling and it will simplify our notation.
For the purpose of analysis, we will use a slightly stronger measurement of close-
ness where we additionally require A− A∗ to have bounded spectral norm.
Definition 3.2.1. We say A is (δ, κ)-near to A∗ if A is δ-close to A∗ in the sense of
Definition 2.1.5, and in addition ‖A− A∗‖ ≤ κ‖A∗‖ too.
As alluded before, our simplest algorithm will be the alternating update algorithm
with a thresholding decoder. We also analyze two variants of Olshausen-Field update
rule. We first start with another simplification where in the update rule we use (y(j)−
61
Asx(j)) sign(x(j))> instead of (y(j) −Asx(j))x(j)> in the gradient computation (3.2.3).
This will simplify the analysis but won’t change essence of the analysis.
Algorithm 4 Neurally Plausible Update Rule
Initialize A0 that is (δ0, 2)-near to A∗
Repeat for s = 0, 1, ..., T
1 Let y(1), . . . , y(N) be a set of N fresh examples.
2 Decode: x(j) = thC/2((As)>y(j)) for all j ∈ [N ]
• Update:
As+1 = As − ηgs (3.2.4)
where
gs =1
N·N∑j=1
(y(j) − Asx(j)) sign(x(j))> (3.2.5)
Theorem 3.2.2. Under Assumption 2.1.1, 2.1.2, 2.1.3, 2.1.4. Suppose that A0 is
(2δ, 2)-near to A∗ and that η = Θ(r/k), and that each update step in Algorithm 4 uses
N = Ω(mk) fresh samples. Then, we have
E[‖Asi − A∗i ‖2] ≤ (1− τ)s‖A0i − A∗i ‖2 +O(k/d)
for some absolute constant 0 < τ < 1 and for any s = 1, 2, ..., T . In particular it
converges to A∗ geometrically, until the column-wise error is O(√k/d).
Revisiting Olshausen-Field
In this subsection we analyze a variant of the Olshausen-Field update rule. Due to
some technical difficulties that will clearer later, we will also need to make (slightly)
stronger assumptions on the distributional model for the support S = supp(x∗),
62
Algorithm 5 Olshausen-Field Update Rule
Initialize A0 that is (δ0, 2)-near to A∗
Repeat for s = 0, 1, ..., T
Decode: x = thC/2((As)>y) for each sample y
Update: As+1 = As − ηgs where gs = E[(y − Asx)x>]
Project: As+1 = ProjBAs+1(where B is defined in Definition 3.6.3)
Assumption 3.2.3. For any distinct indices i, j, k ∈ [r], we have Pr [i, j, k ∈ supp(x∗)] =
O(k3/r3). For simplicity we also denote Pr[i, j, k ∈ supp(x∗)] by qijk.
Under this assumptions, the variant of Olshausen-Field update rule in Algorithm 5
will converge to the ground truth approximately.
Theorem 3.2.4. Under Assumption 2.1.1, 3.2.3, 2.1.3, 2.1.4. Suppose that A0 is
(2δ, 2)-near to A∗ and that η = Θ(r/k). Then, Algorithm 5 satisfies that at each step
s,
‖As − A∗‖2F ≤ (1− τ)s‖A0 − A∗‖2
F +O(rk2/d2) .
for some 0 < τ < 1/2 and for any s = 1, 2, ..., T . In particular it converges to A∗
geometrically until the error in Frobenius norm is O(√rk/d).
We defer the proof of the main theorem to Section 3.6.1. Currently it uses a
projection step (using convex programming) that may not be needed but the proof
requires it.
Removing the Systemic Error
In this subsection, we design a new update rule that converges geometrically until
the column-wise error is d−Ω(1). The basic idea is to engineer a new decoding matrix
(instead of using As>) that projects out the components along the column currently
being updated. This has the effect of replacing a certain bias or error that occurs in
the previous update rules in Algorithm 4 or Algorithm 5.
63
Algorithm 6 Unbiased Update Rule
Initialize A0 that is (δ0, 2)-near to A∗
Repeat for s = 0, 1, ..., T
Decode: x = thC/2((As)>y) for each sample y
xi = thC/2((B(s,i))>y) for each sample y, and each i ∈ [m]
Update: As+1i = Asi − ηgsi where gsi = E[(y −B(s,i)xi) sign(x)>i ] for each i ∈ [m]
We will use B(s,i) to denote the decoding matrix used when updating the ith
column in the sth step. Then we set B(s,i)i = Ai and B
(s,i)j = ProjA⊥i Aj for j 6= i.
Note that B(s,i)−i (i.e. B(s,i) with the ith column removed) is now orthogonal to Ai.
We will rely on this fact when we bound the error.
Theorem 3.2.5. Suppose that A0 is (2δ, 2)-near to A∗ and that η = Θ(r/k). Algo-
rithm 6 satisfies that at each step s,
‖Asi − A∗i ‖2 ≤ (1− τ)s‖A0i − A∗i ‖2 + d−Ω(1)
for some 0 < τ < 1/2 and for any s = 1, 2, ..., T . In particular it converges to A∗
geometrically until the column-wise error is d−Ω(1).
Outline of remaining sections: Section 3.3 establishes a useful property of
the decoding algorithms, namely, it can recover the correct support of the hidden
coefficients x∗. Section 3.4 analyzes the infinite complexity version of Algorithm 4,
which shed lights on the finite sample complexity bounds in Section 3.5. Section 3.6
extends the analysis to Algorithm 5 and 6.
3.3 Support Recovery Guarantees of Decoding
In this section, we analyze the the properties of the simple thresholding method used
in the decoding steps of Algorithm 4, (5), and 6. We will show that it recovers the
64
support of each sample with high probability (over the randomness of x∗). This
corresponds to the fact that sparse recovery for incoherent dictionaries is much easier
when the non-zero coefficients do not take on a wide range of values; in particular,
one does not need iterative pursuit algorithms in this case. It is an ingredient in
analyzing all of the update rules we consider in this paper.
Proposition 3.3.1. Under Assumption 2.1.1, 2.1.2, 2.1.3, 2.1.4, suppose that A is
δ-close to A∗ for δ ≤ c/ log d where c is a sufficiently small absolute constant. Then
with high probability over the choice of the random sample y = A∗x∗, we have
sgn(thC/2((A)>y)) = sgn(x∗)
Towards seeing the key intuition behind Proposition 3.3.1, recall that y = A∗x∗
and consider
〈Ai, y〉 = 〈Ai, A∗i 〉x∗i + Zi (3.3.1)
where Zi is defined as
Zi =∑j 6=i
〈Ai, A∗j〉x∗j (3.3.2)
Note that Zi is a mean zero random variable which measures the contribution of the
cross-terms. By the assumption that A is δ-close to A∗, we have |〈Ai, A∗i 〉| ≥ (1−δ2/2),
and therefore |〈Ai, A∗i 〉x∗i | is either larger than (1−δ2/2)C or equal to zero depending
on whether or not i ∈ S. Our main goal is to show that the random variable Zi is
much smaller than C with high probability, and this follows by standard concentration
bounds, as shown in the formal proof below.
Proof. Recall equation (3.3.1) and let Zi be defined as in equation (3.3.2). Random
variable Zi has two sources of randomness: the support S of x∗, and the random
65
values of x∗ conditioned on the support S. We prove a stronger statement that only
requires second source of randomness. Namely, even conditioned on the support S,
with high probability S = i : |〈Ai, y〉| > C/2, and sign(〈Ai, y〉) = sign(x∗).
We remark that Zi is a sum of independent subgaussian random variables. We
control the variance of Zi:
Var(Zi) =∑
j∈S\i
〈Ai, A∗j〉2 (3.3.3)
Next we bound each summand in the equation above by
〈Ai, A∗j〉2 =(〈A∗i , A∗j〉+ 〈Ai − A∗i , A∗j〉
)2
≤ 2(〈A∗i , A∗j〉2 + 〈Ai − A∗i , A∗j〉2) (by Cauchy-Schwartz inequality)
≤ 2µ2/d+ 2〈Ai − A∗i , A∗j〉2. (by incoherence (Assumption 2.1.1))
It follows that
Var(Zi) ≤ 2µ2k/d+∑
j∈S\i
〈Ai − A∗i , A∗j〉2 (3.3.4)
= 2µ2k/d+ ‖A∗S\i>(Ai − A∗i )‖2 (3.3.5)
. c/ log d (by ‖A∗S\i‖ ≤ 2 by Lemma 2.1.10 and k ≤√d
µ log d)
Hence, we have that Zi is a subgaussian random variable with variance at most
O(c/ log r). For sufficiently small absolute constant c, we conclude that with high
probability, |Zi| ≤ λδ(A)/4. Finally we can take a union bound over all indices i ∈ [m]
and obtain ∀i, |Zi| ≤ λδ(A)/4. By equation (3.3.1) and the fact that |〈Ai, A∗i 〉| ≥
(1− δ2/2), we have that
|〈Ai, y〉 − x∗i | ≤ λδ(A)/4 (3.3.6)
66
Recall that by Assumption 2.1.3 we have |x∗i | ≥ λδ(A) when x∗i 6= 0. This and
equation (3.3.6) complete the proof.
3.4 Analysis Overview: Infinite Samples Setting
In this section, as a warm-up, we assume that each iteration of Algorithm 4 takes
infinite number of samples, and prove the corresponding simplified version of Theo-
rem 3.2.2. The proof of this theorem highlights the essential ideas of behind the proof
of the Theorem 3.2.2, which can be found at Section 3.5.
Theorem 3.4.1. In the setting of Theorem 3.2.2, suppose Algorithm 4 have access
to infinite number of examples. Namely, suppose Algorithm 4 uses gsi instead of gsi .
Then the conclusion of Theorem 3.2.2 holds:
E[‖Asi − A∗i ‖2] ≤ (1− τ)s‖A0i − A∗i ‖2 +O(k/d)
for some absolute constant 0 < τ < 1 and for any s = 1, 2, ..., T . In particular it
converges to A∗ geometrically, until the column-wise error is O(√k/d).
We define gs to be the expectation of gs
gs := E[gs] = E[(y − Asx) sign(x)>
], (3.4.1)
where x := thC/2((As)>y) is the decoding of y. The infinite sample case essentially
means we use gs in the update equation (3.2.4) instead of gs.
The proof of Theorem 3.4.1 uses our framework (Theorem 3.1.2) inductively. The
first step (Proposition 3.4.2 in Section 3.4.1) is to show that gs meets the condition
of Theorem 3.1.2, that is, gs is (α, β, ε)-correlated with the target solution A∗. How-
67
ever, this step requires to two conditions on the current iterate As: a) As is already
reasonably close to A∗ and b) a technical condition that the spectral norm of As−A∗
is bounded by 2‖A∗‖.
The first condition will hold naturally under inductive hypothesis, whereas proving
condition b) requires additional another step in Section 3.4.2: We will show that in
Proposition 3.4.5 that condition also holds at every step. We put everything together
and prove Theorem 3.4.1 at the end of this Section.
3.4.1 Making Progress at Each Iteration
Recall that as defined in Assumption 2.1.3 we have qi = Pr[x∗i 6= 0] and qi,j =
Pr[x∗ix∗j 6= 0], and define in addition pi = E[x∗i sign(x∗i )|x∗i 6= 0]. Recall that A∗−i
denote the matrix obtained from deleting the ith column of A∗. The following lemma
is the main step in our analysis.
Proposition 3.4.2. In the setting of Theorem 3.4.1, if at iteration s the iterate
As is (2δ, 2)-near to A∗, then the direction gsi is (α, β, ε)-correlated with A∗i , where
α = Ω(k/r), β ≥ Ω(r/k) and ε = O(k3/(rd2)).
Furthermore, we make the following progress in terms of the distance to the ground
truth:
‖As+1i − A∗i ‖2 ≤ (1− 2αη)‖Asi − A∗i ‖2 +O(ηk2/d2) . (3.4.2)
Towards proving Proposition 3.4.2, we first use the properties of the generative model
to derive a new formula for gs that is more amenable to analysis.
Lemma 3.4.3. In the setting of Proposition 3.4.2, we have that
gsi = piqi (λsiA
si − A∗i + εsi ± γsi ) (3.4.3)
68
with λsi , γsi , ε
si satisfying
λsi = 〈Asi , A∗i 〉
‖γsi ‖ ≤ 1/dΩ(1)
εsi =(As−idiag(qi,j)
(As−i)> )
A∗i /qi
‖εsi‖ ≤ O(k/d)
Note that piqi is a scaling constant and λsi ≈ 1; hence from the above formula (3.4.3),
we should expect that E [gsi ] is correlated with Asi − A∗i . (The exact amount of
correlation will be bounded using the Lemmas below.)
Proposition 3.4.3 turns out to be useful for show that ‖As+1 − A∗‖ ≤ 2‖A∗‖
(Proposition 3.4.5). We will also be able to reuse much of this analysis in Lemma 3.6.1
and Lemma 3.6.6 because we have derived to for a general decoding matrix B in the
proof below.
Proof. Since As is (2δ, 2)-near to A∗, As is 2δ-close to A∗. We can now invoke
Lemma 3.3.1 and conclude that with high probability, sign(x∗) = sign(x). Let Fx∗
be the event that sign(x∗) = sign(x), and let 1Fx∗ be the indicator function of this
event.
To avoid the overwhelming number of appearances of the superscripts, let B = As
throughout this proof. Here and in the rest of this proof, we will let γsi denote any
vector whose norm is negligible (i.e. smaller than 1/dC for any large constant C > 1).
Then we can write gsi = E[(y−Bx) sign(xi)]. Using the fact that 1Fx∗ +1Fx∗ = 1 and
that Fx∗ happens with very high probability:
gsi = E[(y −Bx) sign(xi)1Fx∗ ] + E[(y −Bx) sign(xi)1Fx∗ ]
= E[(y −Bx) sign(xi)1Fx∗ ]± γsi (3.4.4)
69
The key is that this allows us to essentially replace sign(x) with sign(x∗). More-
over, let S = supp(x∗). Note that when Fx∗ happens S is also the support of x. Recall
that according to the decoding rule (where we have replaced As by B for notational
simplicity) x = thC/2(B>y). Therefore, xS = (B>y)S = B>S y = B>SA∗x∗. Using the
fact that the support of x is S again, we have Bx = B>SBSA∗x∗. Plugging it into
equation (3.4.4):
gsi = E[(y −Bx) sign(xi)1Fx∗ ]± γsi = E[(I −BSB
>S )A∗x∗ · sign(x∗i )1Fx∗ ]± γ
si
= E[(I −BSB>S )A∗x∗ · sign(x∗i )]− E[(I −BSB
>S )A∗x∗ · sign(xi)1Fx∗ ]± γ
si
= E[(I −BSB>S )A∗x · sign(x∗i )]± γsi
where again we have used the fact that Fx∗ happens with very high probability. Now
we rewrite the expectation above using subconditioning where we first choose the
support S of x∗, and then we choose the nonzero values x∗S.
E[(I −BSB>S )A∗x∗ · sign(x∗i )] = ES
[Ex∗S [(I −BSB
>S )A∗x∗ · sign(x∗i )|S]
]= E[pi(I −BSB
>S )A∗i ]
where we use the fact that E[x∗i · sign(x∗i )|S] = pi. Let R = S − i. Using the fact
that BSB>S = BiB
>i +BRB
>R , we can split the quantity above into two parts,
gsi = piE[(I −BiB>i )A∗i + piE[BRB
>R ]A∗i
= piqi
(I −BiB
>i
)A∗i + pi
(B−idiag(qi,j)B
>−i
)A∗i ± γsi .
where diag(qi,j) is a r × r diagonal matrix whose (j, j)-th entry is equal to qi,j, and
B−i is the matrix obtained by zeroing out the ith column of B. Here we used the fact
that Pr[i ∈ S] = qi and Pr[i, j ∈ S] = qij.
70
Now we set B = As, and rearranging the terms, we have
gsi = piqi (〈Asi , A∗i 〉Asi − A∗i + εsi ± γsi ) ,
where εsi =(As−idiag(qi,j)
(As−i)> )
A∗i /qi, which can be bounded as follows
‖εsi‖ ≤ ‖As−i‖2 maxj 6=i
qi,j/qi ≤ O(k/r)‖As‖2 = O(k/d)
where the last step used the fact thatmaxi 6=j qi,j
min qi≤ O(k/r), which is an assumption of
our generative model.
Lemma 3.4.4. If a vector z is equal to 4α(Asi −A∗i )+v where ‖v‖ ≤ α‖Asi −A∗i ‖+ζ,
then z is (α, 1/100α, ζ2/α)-correlated with A∗i .
Proof. Throughout this proof s is fixed and so we will omit the superscript s to
simplify notations. By the assumption, z already has a component that is pointing
to the correct direction Ai − A∗i , we only need to show that the norm of the extra
term v is small enough. First we can bound the norm of z by triangle inequality:
‖z‖ ≤ ‖4α(Ai − A∗i )‖+ ‖v‖ ≤ 5α‖(Ai − A∗i )‖+ ζ (3.4.5)
therefore ‖z‖2 ≤ 50α2‖(Ai −A∗i )‖2 + 2ζ2. Similarly, we can bound the inner-product
between z and Ai − A∗i by
〈z, Ai − A∗i 〉 = 〈4α(Asi − A∗i ) + v,Ai − A∗i 〉 (by definition)
≥ 4α‖Ai − A∗i ‖2 − ‖v‖‖Ai − A∗i ‖ (3.4.6)
71
Here the last inequality above follows from Cauchy-Schwartz. Now we are ready to
prove our target inequality by combining equation (3.4.5) and (3.4.6) as follows:
〈z, Ai − A∗i 〉 − α‖Ai − A∗i ‖2 − 1
100α‖z‖2 + ζ2/α
≥ 4α‖Ai − A∗i ‖2 − ‖v‖‖Ai − A∗i ‖ − α‖Ai − A∗i ‖2 − 1
100α‖z‖2 + ζ2/α
(by equation (3.4.6))
≥ 3α‖Ai − A∗i ‖2 − (α‖Ai − A∗i ‖+ ζ)‖Ai − A∗i ‖
− 1
100α(50α2‖(Ai − A∗i )‖2 + 2ζ2) + ζ2/α (by equation (3.4.5))
≥ α‖Ai − A∗i ‖2 − ζ‖Ai − A∗i ‖+1
4ζ2/α (by ζ2/α ≥ 0)
= (√α‖Ai − A∗i ‖ − ζ/2
√α)2 ≥ 0
This completes the proof of the lemma.
Now we are ready to combine Lemmas above to prove Proposition 3.4.2.
Proof of Proposition 3.4.2. We first use the form in Proposition 3.4.3, gsi =
piqi(λiAsi − A∗i + εsi + γsi ) where λi = 〈Ai, A∗i 〉. We can write gsi = piqi(A
si −
A∗i ) + piqi((1 − λi)Asi + εsi + γsi ). Now we apply Lemma 3.4.4 by setting z = gsi ,
4α = piqi = Θ(k/r), and v = piqi((1 − λi)Asi + εsi + γsi ). The norm of v can be
bounded in two terms, the first term piqi(1 − λi)Asi has norm piqi(1 − λi) which is
smaller than piqi‖Asi − A∗i ‖ = α‖Asi − A∗i ‖, and the second term has norm bounded
by ζ = O(k2/(rd)).
With these parameters choices, by Lemma 3.4.4, we conclude that the vector gis is
(Ω(k/r),Ω(r/k), O(k3/rd2))-correlated with As. Equation (3.4.2) in the Proposition
follows from applying our analysis framework (Theorem 3.1.2.)
72
3.4.2 Maintaining Spectral Norm
In this section we show that the spectral norm bound ‖A− A∗‖ ≤ 2‖A∗‖ will be
preserved after every iteration.
Proposition 3.4.5. In the setting of Theorem 3.4.1, suppose that As is (2δ, 2)-near
to A∗. Then, we have ‖As+1 − A∗‖ ≤ 2‖A∗‖.
This proposition is expected because if the algorithm indeed drives the iterate
towards A∗, we can expect that ‖As − A‖ is also roughly decreasing as s increases.
Moreover, here we only require that ‖As − A‖ never exceeds 2‖A∗‖. Again in the
proof we will invoke Proposition 3.4.3 to have a functional form for As+1i − A∗i in
terms of the matrices A∗ and As, and then we use various linear algebraic inequality
to bound its spectral norm.
Proof. Again, we will make crucial use of Proposition 3.4.3. Substituting and rear-
ranging terms we have:
As+1i − A∗i = Asi − A∗i − ηgsi
= (1− ηpiqi)(Asi − A∗i ) + ηpiqi(1− λsi )Asi
−ηpi(As−idiag(qi,j)
(As−i)> )
A∗i ± γsi
Our first task is to write this equation in a more convenient form. In par-
ticular let U and V be matrices such that Ui = piqi(1 − λsi )Asi and Vi =
pi
(As−idiag(qi,j)
(As−i)> )
A∗i . Then we can re-write the above equation as:
As+1 − A∗ = (As − A∗)diag(1− ηpiqi) + ηU − ηV ± γsi (3.4.7)
where diag(1 − ηpiqi) is the r × r diagonal matrix whose entries along the diagonal
are 1− ηpiqi.
73
We will bound the spectral norm of As+1 − A by bounding the spectral norm of
each of the matrices on the RHS of (3.4.7). The first two terms are straightforward
to bound:
‖(As − A∗)diag(1− ηpiqi)‖ ≤ ‖As − A∗‖ · (1− ηminipiqi) ≤ 2(1− Ω(ηk/r))‖A∗‖
(3.4.8)
where the last inequality uses the assumption that pi = Θ(1) and qi ≤ O(k/r), and
the assumption that ‖As − A∗‖ ≤ 2‖A∗‖.
Regarding the term ηU in equation (3.4.7), first by the definition we have U =
Asdiag(piqi(1− λsi )). It follows that
‖U‖ ≤ δmaxipiqi‖As‖ = o(k/r) · ‖A∗‖
where we have used the fact that λsi ≥ 1− δ and δ = o(1), and ‖As‖ ≤ ‖As − A∗‖+
‖A∗‖ = O(‖A∗‖).
What remains is to bound the third term, and let us first introduce an auxiliary
matrix Q which we define as follows: Qii = 0 and Qi,j = qi,j〈Asi , A∗i 〉 for i 6= j. It is
straightforward to verify that the following claim:
Claim 3.4.6. The ith column of AsQ is equal to(As−idiag(qi,j)
(As−i)> )
A∗i
Recall the definition of pi = E[x∗i sign(x∗i )|x∗i 6= 0]. Therefore we can rewrite
V = AsQdiag(pi). We will bound the spectral norm of Q from above by its Frobenus
norm:
‖Q‖F ≤(
maxi 6=j
qij
)∑i 6=j
√〈Asi , A∗j〉2 = O(k2/r2)‖A∗>As‖F (3.4.9)
Moreover since A∗>As is an r × r matrix, its Frobenius norm can be at most a√r
factor larger than its spectral norm, that is ‖A∗>As‖F ≤√r‖A∗>As‖ ≤
√r‖A∗‖‖As‖,
74
Hence we have
‖V ‖ ≤(
maxipi
)‖As‖‖Q‖ (by V = AsQdiag(pi))
≤ O(k2√r/r2)‖As‖‖A∗>As‖F
(by equation (3.4.9) and pi . 1 by Assumption 2.1.3)
≤ O(k2√r/r2)‖As‖2‖A∗‖ (by ‖A∗>As‖F ≤
√r‖A∗‖‖As‖)
≤ O(k2√r/d2)‖A∗‖ (by ‖As‖ . ‖A∗‖ . r/d)
≤ o(k/r)‖A∗‖ (by the choice of parameters (Assumption 2.1.4))
Finally, putting all the pieces together we have:
‖As+1 − A∗‖ ≤‖(As − A∗)diag(1− ηpiqi)‖+ ‖ηU‖+ ‖ηV ‖ ± γsi
≤2(1− Ω(ηk/r))‖A∗‖+ o(ηk/r)‖A∗‖+ o(ηk/r)‖A∗‖ ± γsi
≤2(1− Ω(ηk/r))‖A∗‖ (3.4.10)
≤2‖A∗‖
and this completes the proof of the lemma.
Proof of Theorem 3.4.1
We provide the formal proof of Theorem 3.4.1 before closing this Section.
As outlined before, we prove by induction on s. Our induction hypothesis is that
the theorem is true at each step s and As is (2δ, 2)-near to A∗. The hypothesis is
true for s = 0 by the assumption on the initialization. Now assuming the induc-
tive hypothesis is true for some s. Recall that Proposition 3.4.2 says that if As is
(2δ, 2)-near to A∗, which is guaranteed by the inductive hypothesis, then gsi is indeed
(Ω(k/r),Ω(r/k), O(k3/rd2))-correlated with A∗i . Invoking our framework of analysis
75
(Theorem 3.1.2), we have that
‖As+1i − A∗i ‖2 ≤ (1− τ)‖Asi − A∗i ‖2 +O(k2/d2) ≤ (1− τ)s+1‖A0
i − A∗i ‖2 +O(k2/d2) .
Therefore it also follows that As+1 is 2δ-close to A∗. Then we invoke Proposition 3.4.5
to prove As+1 has not too large spectral norm ‖As+1−A∗‖ ≤ 2‖A∗‖, which completes
the induction.
3.5 Sample Complexity
In the previous sections, we analyzed various update rules assuming that the algorithm
was given the exact expectation of some matrix-valued random variable. Here we
show that these algorithms can just as well use approximations to the expectation
(computed by taking a small number of samples).
We will focus on analyzing the sample complexity of Algorithm 4, but a similar
analysis extends to the other update rules as well.
In order to prove Theorem 3.2.2, we proceed in two steps. First we show when As is
(δs, 2)-near to A∗, the approximate gradient is (α, β, εs)-correlated-whp with optimal
solution A∗, with εs ≤ O(k2/rd) + α · o(δ2s). This allows us to use Theorem 3.1.7
as long as we can guarantee the spectral norm of As − A∗ is small. Next we show
an extension of Proposition 3.4.5 which works even with the random approximate
gradient, hence the nearness property is preserved during the iterations.
The key idea boils down to the establishing various type of concentration of gsi
around gsi . For example, the following Proposition establishes the concentration in
Euclidean distance.
76
Proposition 3.5.1. In the setting of Theorem 3.2.2, suppose As is (2δ, 2)-near to
A∗. Then, we have that with high probability,
‖gsi − gsi ‖ ≤ k/r · (o(δs) +O(√k/d)).
The proposition above implies the finite sample extension of Proposition 3.4.2
straightforwardly.
Corollary 3.5.2 (Finite sample extension of Proposition 3.4.2). In the setting of
Proposition 3.5.1. Then the direction gsi as defined in Algorithm 4 is (α, β, εs)-
correlated-whp with A∗i with α = Ω(k/r), β = Ω(r/k) and εs ≤ α · o(δ2s) +O(k2/rd).
Proof of Corollary 3.5.2 using Proposition 3.5.1. Therefore using Proposition 3.4.3
we can write gsi (whp) as gi = gi−gi+gi = 4α(Asi −A∗i )+v with ‖v‖ ≤ α‖Asi −A∗i ‖+
O(k/r)·(o(δs)+O(√k/d)). By Proposition 3.4.4 we have gi is (Ω(k/r),Ω(r/k), o(r/d·
δ2s) +O(k2/rd))-correlated-whp with A∗i .
Proposition 3.5.1 and other concentration inequalities also imply that an finite sample
extension of Proposition 3.4.5.
Corollary 3.5.3 (Finite sample extension of Proposition 3.4.5). In the setting of
Theorem 3.2.2, suppose As is (δs, 2)-near to A∗ with δs ≤ c/ log d for some sufficiently
small absolute constant c, then with high probability, As+1 satisfies ‖As+1 − A∗‖ ≤
2‖A∗‖.
Proof. Using equation (3.4.10) in the proof of Proposition 3.4.5, recalling the A cor-
responds to As − ηgs in our case, we have that
‖As − ηgs − A∗‖ ≤ 2(1− Ω(ηk/r))‖A∗‖ . (3.5.1)
Recall that η is set to be Θ(r/k) and using Proposition 3.5.1 we have that ‖gs− gs‖ ≤
k/r ·(o(δs)+O(√k/d)) ≤ o(1). Moreover, since ‖A∗‖ = r we have that ‖A∗‖ ≥
√r/d.
77
This implies that
‖As+1 − A∗‖ ≤ ‖As − ηgs − A∗‖+ ‖gs − gs‖
≤ 2(1− Ω(ηk/r))‖A∗‖+ o(1)
≤ 2− Ω(1) + o(1) ≤ 2
which completes the proof.
Using Corollary 3.5.2 and Corollary 3.5.3 we can prove the main Theorem 3.2.2 by
induction on the step s.
Proof of Theorem 3.2.2. Similarly to the proof of Theorem 3.4.1, the theorem fol-
lows immediately by induction and Proposition 3.5.2 and Proposition 3.5.3, and then
applying Theorem 3.1.7.
In the rest of the subsection, we prove Proposition 3.5.1 using various concentra-
tion inequalities.
Proof of Proposition 3.5.1
We fix an index i and prove the concentration for every column of gsi . Note that by
definition, we have that
gsi =N∑j=1
(y(j) − Asx(j)) sign(x(j)i ) (3.5.2)
To simplify the notation, we omit the superscript s since it’s irrelevant for this propo-
sition, and let Zi be the shorthand for the random variable for the random variable
(y − Ax) sign(xi) | i ∈ S,
Zi := (y − Ax) sign(xi) | i ∈ S (3.5.3)
78
Therefore we see that gsi is basically a sum of N independent realizations of the ran-
dom variable Zi. We will use Bernstein inequality to prove the concentration. In
preparation, we study the absolute bound of Zi and its variance, which are essential
quantities for applying Bernstein inequality. We also note that we will use the ex-
tension of Bernstein inequality (Corollary 8.1.4), with which it suffices to have a high
probability uniform upper bound on the norm of Zi.
We start with a Claim that controls ‖Z‖i with high probability.
Claim 3.5.4. In the setting of Proposition 3.5.1, let Zi be define as in equa-
tion (3.5.3). Then, with high probability, we have that
‖Zi‖ ≤ µk/√d+ kδ2
s + δs√k log d.
Proof. We expand y − Ax as
y − Ax = (A∗S − ASA>SA∗S)x∗S = (A∗S − AS)x∗S + AS(Id− A>SA∗S)x∗S (3.5.4)
and we will bound each of the two terms separately. Since x∗S has sub-Gaussian
entries with variance bounded by O(1), we have that the entries of (A∗S −AS)x∗S can
be written as a sum of vectors scaled with independent sub-Gaussian entries
(A∗S − AS)x∗S =∑i∈S
(A∗i − Ai)x∗i (3.5.5)
Therefore (A∗S − AS)x∗S is also a sub-Gaussian random variable with variance proxy
‖A∗S − AS‖2 (see Lemma 8.1.7 for the property of sub-Gaussian random variables).
Then, we have that with high probability,
‖(A∗S − AS)x∗S‖ . ‖A∗S − AS‖F√
log d
79
Since A is δs-close to A∗ and |S| ≤ k, we have that ‖A∗S−AS‖F ≤ O(δs√k), and thus
we concluded that
‖(A∗S − AS)x∗S‖ . δs√k log d (3.5.6)
Regarding the second summand on the RHS of equation (3.3.1), we have that
‖AS(A>SA∗S − I)‖F ≤ ‖AS‖‖(A>SA∗S − I)‖F (by the ineq. ‖UV ‖F ≤ ‖U‖‖V ‖F )
≤ (‖A∗S‖+ δs√k)(‖(AS − A∗S)>A∗S‖F + ‖A∗S
>A∗S − I‖F )
(by ‖A∗S − AS‖ ≤ δs√k and traingle inequality)
≤ (2 + δs√k)(‖A∗S‖‖AS − A∗S‖F + ‖A∗S
>A∗S − I‖F )
(by ‖A∗S‖ ≤ 2 (Lemma 2.1.10) and the ineq. ‖UV ‖F ≤ ‖U‖‖V ‖F )
≤ (2 + δs√k)(‖A∗S‖‖AS − A∗S‖F + µk/
√d) (by Lemma 2.1.10)
≤ O(µk/√d+ δ2
sk +√kδs)
(by ‖A∗S − AS‖ ≤ δs√k and ‖A∗S‖ ≤ 2)
Finally plugging the bound above and equation (3.5.6) into equation (3.3.1), we con-
clude that
‖(y−Ax) sign(xi)‖ ≤ O(‖A∗S−AS‖F+‖AS(A>SA∗S−I)‖F ) . µk/
√d+kδ2
s+δs√k log d .
Next we bound from above the variance of the random variable Zi.
80
Claim 3.5.5. In the setting of Proposition 3.5.1, let Zi be define as in equa-
tion (3.5.3). We have that
E[‖Zi‖2] . k2δ2s + k3/d
Proof. Again we rewrite y−Ax as y−Ax = (A∗S −ASA>SA∗S)x∗S. Using the fact that
x∗S is conditionally independent of S with E[x∗S(x∗S)>] = Idr, we obtain that
E[‖Zi‖2] = E[‖(y − Ax) sign(xi)‖2|i ∈ S]
= E[‖(A∗S − ASA>SA∗S)x∗S‖2|i ∈ S]
= E[‖A∗S − ASA>SA∗S‖2F | i ∈ S] (by E[x∗S(x∗S)>] = Idk)
Then again we rewrite A∗S − ASA>SA∗S as
A∗S − ASA>SA∗S = (A∗S − AS) + AS(Idk − A>SA∗S) (3.5.7)
We bound the Frobenius norm of the two terms on the RHS of equation above sepa-
rately.
First, since A is δs-close to A∗, we have that A∗S − AS has column-wise norm at
most δs and therefore ‖A∗S − AS‖F ≤√kδs. Second, note that ‖AS‖F .
√k because
each column of A has norm 1± δs. Therefore, we have that
E[‖AS(Idk − A>SA∗S)‖2F | i ∈ S] . k E[‖(Idk − A>SA∗S)‖2
F | i ∈ S] (3.5.8)
where we used the inequality that ‖UV ‖F ≤ ‖U‖F ‖V ‖F . We divide the k2 terms
in ‖(Idk − A>SA∗S)‖2F largely according to whether the indices contain i because this
81
affects the conditional probability. We have that
E[‖(Idk − A>SA∗S)‖2F | i ∈ S] (3.5.9)
= E
[∑j∈S
(1− 〈Aj, A∗i 〉)2 | j ∈ S
]+ E
∑j,`∈[m]\i,j 6=`
〈Aj, A∗`〉2 | i ∈ S
(3.5.10)
+ E
∑j∈[m]\i
〈Aj, A∗i 〉2 | i ∈ S
+ E
∑j∈[m]\i
〈Ai, A∗j〉2 | i ∈ S
(3.5.11)
≤ O(k2δ2s) +O(k3/d) (3.5.12)
Since A is δs-close to A∗, we have that (1− 〈Aj, A∗j〉)2 ≤ δ2s for any j, and therefore,
E
[∑j∈S
(1− 〈Aj, A∗i 〉)2 | j ∈ S
]≤ kδ2
s (3.5.13)
Using the assumption that Pr[j ∈ S, ` ∈ S | i ∈ S] . k2/r2, we have
E
∑j∈[m]\i
〈Ai, A∗j〉2 | i ∈ S
. k2/r2 ·∑
j,`∈[m]\i
〈Aj, A∗`〉2
≤ k2/r2∥∥A>A∗∥∥2
F
≤ k2/r2 ‖A∗‖2 ‖A‖2F (by ??)
. k2/r2 · r2/d2 · d = k2/d
(by Assumption 2.1.1 and ‖A‖F . d)
By the assumption Pr [j ∈ S | i ∈ S] . k/r (Assumption 2.1.2), we have that
E
∑j∈[m]\i
〈Aj, A∗i 〉2 | i ∈ S
. k/r · ‖A>−iA∗i ‖2 ≤ k/r · ‖A>−i‖2‖A∗i ‖2
≤ k/r · r/d = k/d
(by the assumption that A is (δs, 2)-near A∗, we have ‖A‖ . ‖A∗‖ ≤√d/r)
82
We can get a similar bound for E[∑
j∈[m]\i〈Ai, A∗j〉2 | i ∈ S]. Hence altogether we
plug in bounds above in equation (3.5.11) and conclude that
E[‖(Idk − A>SA∗S)‖2F | i ∈ S] ≤ O(kδ2
s) +O(k2/d) .
Now by equation (3.5.8) we complete the proof.
Now we are ready to apply Bernstein inequality to prove Proposition 3.5.1.
Proof of Proposition 3.5.1. We fix the index i in the proof. We omit the superscript
s throughout the proof to simplify the notation. Let Wi = j ∈ [N ] : i ∈ supp(x∗(j))
be the set of examples that has uses dictionary atom i. Recall the g (equation (3.2.5)),
we have that
gi =1
N
∑j∈Wi
(y(j) − Ax(j)) sign(x(j)i ) .
Note that for every k ∈ [N ], (y(j) − Ax(j)) sign(x(j)i ) has the same distribution
as Zi, and therefore it satisfies the bounds on norm and variance in Claim 3.5.4
and Claim 3.5.5. Then, by the generalized version of Bernstein inequality (Corol-
lary 8.1.4), we have that with high probability,
‖gi − gi‖ .1
N
((µk/
√d+ kδ2
s + δs√k log d) log d+
√|Wi|(k2δ2
s + k3/d) log d)
.1
N
((µk/
√d+ kδ2
s + δs√k log d) log d+
√N · k/r · (k2δ2
s + k3/d) log d)
. k/r ·(o(δs) +
√k/d)
where the last inequality follows from our choice of parameters (Assumption 2.1.4)
and that N ≥ crk log2 d for sufficiently large absolute constant c.
83
3.6 More Alternating Minimization Algorithms
Here we prove Theorem 3.2.4 and Theorem 3.2.5. Note that in Algorithm 5 and
Algorithm 6, for simplicity, we use the expectation of the gradient over the samples
instead of the empirical average. We can show that these algorithms would maintain
the same guarantees if we used p = Ω(mk) to estimate gs as we did in Algorithm 4.
However these proofs would require repeating very similar calculations to those that
we performed in Section 3.5, and so we only claim that these algorithms maintain
their guarantees if they use a polynomial number of samples to approximate the
expectation.
3.6.1 Analysis of a Variant of Olshausen-Field Update
We give a variant of the Olshausen-Field update rule in Algorithm 5. Our first goal
is to prove that each column of gs is (α, β, ε)-correlated with A∗i . The main step is to
prove an analogue of Proposition 3.4.3 that holds for the new update rule.
Proposition 3.6.1. Suppose that As is (2δ, 5)-near to A∗ Then each column of gs in
Algorithm 5 takes the form
gsi = qi((λsi )
2Asi − λiAsi + εsi)
where λi = 〈Ai, A∗i 〉. Moreover the norm of εsi can be bounded as ‖εsi‖ ≤ O(k2/rd).
We remark that unlike the statement of Proposition 3.4.4, here we will not explicitly
state the functional form of εsi because we will not need it.
Proof. The proof parallels that of Proposition 3.4.3, although we will use slightly
different conditioning arguments as needed. Again, we define Fx∗ as the event that
sign(x∗) = sign(x), and let 1Fx∗ be the indicator function of this event. We can
invoke Proposition 3.3.1 and conclude that this event happens with high probability.
84
Moreover let Fi be the event that i is in the set S = supp(x∗) and let 1Fi be its
indicator function.
When event Fx∗ happens, the decoding satisfies xS = A>SA∗Sx∗S and all the other
entries are zero. Throughout this proof s is fixed and so we will omit the superscript
s for notational convenience. We can now rewrite gi as
gi = E[(y − Ax)x>] = E[(y − Ax)x>i 1Fx∗ ] + E[(y − Ax)x>i (1− 1Fx∗ )]
= E[(I − A>SAS)A∗Sx
∗Sx∗S>A∗S
>Ai1Fx∗1Fi
]± γ
= E[(I − A>SAS)A∗Sx
∗Sx∗S>A∗S
>Ai1Fi
]= E
[(I − A>SAS)A∗Sx
∗Sx∗S>A∗S
>Ai1Fi
]± γ
Once again our strategy is to rewrite the expectation above using subconditioning
where we first choose the support S of x∗, and then we choose the nonzero values x∗S.
gi = ES[Ex∗S [(I − A>SAS)A∗Sx
∗Sx∗S>A∗S
>Ai1Fi |S]]± γ
= E[(I − ASA>S )A∗SA
∗S>Ai1Fi
]± γ
= E[(I − AiA>i − ARAR>)(A∗iA
∗i> + A∗RA
∗R>)Ai1Fi
]± γ
= E[(I − AiA>i )(A∗iA
∗i>)Ai1Fi
]+ E
[(I − AiA>i )A∗RA
∗R>Ai1Fi
]−E
[ARAR
>A∗iA∗i>Ai1Fi
]− E
[ARA
>RA∗RA∗R>Ai1Fi
]± γ
Next we will compute the expectation of each of the terms on the right hand side.
This part of the proof will be somewhat more involved than the proof of Proposi-
tion 3.4.3, because the terms above are quadratic instead of linear. The leading term
is equal to qi(λiA∗i − λ2
iAi) and the remaining terms contribute to εi. The second
term is equal to (I−AiA>i )A∗−idiag(qi,j)A∗−i>Ai which has spectral norm bounded by
O(k2/rd). The third term is equal to λiA−idiag(qi,j)A∗−i>A∗i which again has spectral
85
norm bounded by O(k2/rd). The final term is equal to
E[ARA
>RA∗RA∗R>Ai1Fi
]=∑j1,j2 6=i
E[(Aj1A>j1
)(A∗j2A∗j2>)Ai1Fi1Fj11Fj2 ]
=∑j1 6=i
(∑j2 6=i
qi,j1,j2〈A∗j2 , Ai〉〈A∗j2, Aj1〉
)Aj1
= A−iv.
where v is a vector whose j2-th component is equal to∑
j2 6=i qi,j1,j2〈A∗j2, Ai〉〈A∗j2 , Aj1〉.
The absolute value of vj2 is bounded by
|vj2 | ≤ O(k2/r2)|〈A∗j2 , Ai〉|+O(k3/r3)(∑j2 6=j1,i
(〈A∗j2 , Ai〉2 + 〈A∗j2 , Aj1〉
2))
≤ O(k2/r2)|〈A∗j2 , Ai〉|+O(k3/r3)‖A∗‖2 = O(k2/r2)(|〈A∗j2 , Ai〉|+ k/d).
The first inequality uses bounds for q’s and the AM-GM inequality, the second in-
equality uses the spectral norm of A∗. We can now bound the norm of v as follows
‖v‖ ≤ O(k2/r2 ·√r/d)
and this implies that the last term satisfies ‖A−i‖‖v‖ ≤ O(k2/rd). Combining all
these bounds completes the proof of the lemma.
We are now ready to prove that the update rule satisfies Definition 3.1.1. This
again uses Proposition 3.4.4, except that we invoke Proposition 3.6.1 instead. Com-
bining these lemmas we obtain:
Lemma 3.6.2. Suppose that As is (2δ, 5)-near to A∗. Then for each i, gsi as defined
in Algorithm 5 is (α, β, ε)-correlated with A∗i , where α = Ω(k/r), β ≥ Ω(r/k) and
ε = O(k3/rd2).
86
Notice that in the third step in Algorithm 5 we project back (with respect to
Frobenius norm of the matrices) into a convex set B which we define below. Viewed
as minimizing a convex function with convex constraints, this projection can be com-
puted by various convex optimization algorithm, e.g. subgradient method (see Theo-
rem 3.2.3 of Section 3.2.4 of Nesterov’s seminal Book [132] for more detail). Without
this modification, it seems that the update rule given in Algorithm 5 does not neces-
sarily preserve nearness.
Definition 3.6.3. Let B = A|A is δ0 close to A0 and ‖A‖ ≤ 2‖A∗‖
The crucial properties of this set are summarized in the following claim:
Claim 3.6.4. (a) A∗ ∈ B and (b) for each A ∈ B, A is (2δ0, 5)-near to A∗
Proof. The first part of the claim follows because by assumption A∗ is δ0-close to
A0 and ‖A∗ − A0‖ ≤ 2‖A∗‖. Also the second part follows because ‖A − A∗‖ ≤
‖A− A0‖+ ‖A0 − A∗‖ ≤ 4‖A∗‖. This completes the proof of the claim.
By the convexity of B and the fact that A∗ ∈ B, we have that projection doesn’t
increase the error in Frobenius norm.
Claim 3.6.5. For any matrix A, ‖ProjBA− A∗‖F ≤ ‖A− A∗‖F .
We now have the tools to analyze Algorithm 5 by fitting it into the framework
of Corollary 3.1.4. In particular, we prove that it converges to a globally optimal
solution by connecting it to an approximate form of projected gradient descent:
Proof of Theorem 3.2.4. We note that projecting into B ensures that at the start of
each step ‖As − A∗‖ ≤ 5‖A∗‖. Hence gsi is (Ω(k/r),Ω(r/k), O(k3/rd2))-correlated
with A∗i for each i, which follows from Lemma 3.6.2. This implies that gs is
(Ω(k/r),Ω(r/k), O(k3/d2))-correlated with A∗ in Frobenius norm. Finally we can
apply Corollary 3.1.4 (on the matrices with Frobenius) to complete the proof of the
theorem.
87
3.6.2 Removing Systemic Error
The proof of Theorem 3.2.5 is parallel to that of Theorem 3.4.1 and Theorem 3.2.4.
As usual, our first step is to show that gs is correlated with A∗: Theorem 3.2.5 follows
from the two lemmas below (Lemma 3.6.6 and 3.6.7) straightforwardly.
Lemma 3.6.6. Suppose that As is (δ, 5)-near to A∗. Then for each i, gsi as defined
in Algorithm 6 is (α, β, ε)-correlated with A∗i , where α = Ω(k/r), β ≥ Ω(r/k) and
ε ≤ d−ω(1).
Proof. We chose to write the proof of Proposition 3.4.3 so that we can reuse the
calculation here. In particular, instead of substituting B for As in the calculation we
can substitute B(s,i) instead and we get:
g(s,i) = piqi(λsiA
si − A∗i +B
(s,i)−i diag(qi,j)B
(s,i)T−i A∗i ) + γ.
Recall that λsi = 〈Asi , A∗i 〉. Now we can write g(s,i) = piqi(Asi − A∗i ) + v, where
v = piqi(λsi − 1)Asi + piqiB
(s,i)−i diag(qi,j)B
(s,i)T−i A∗i + γ
Indeed the norm of the first term piqi(λsi − 1)Asi is smaller than piqi‖Asi − A∗i ‖.
Recall that the second term was the main contribution to the systemic error,
when we analyzed earlier update rules. However in this case we can use the fact that
B(s,i)T−i Asi = 0 to rewrite the second term above as
piqiB(s,i)−i diag(qi,j)B
(s,i)T−i (A∗i − Asi )
Hence we can bound the norm of the second term by O(k2/rd)‖A∗i − Asi‖, which is
also much smaller than piqi‖Asi − A∗i ‖.
88
Combining these two bounds we have that ‖v‖ ≤ piqi‖Asi − A∗i ‖/4 + γ, so we
can take ζ = γ = d−ω(1) in Lemma 3.4.4. We can complete the proof by invoking
Lemma 3.4.4 which implies that the g(s,i) is (Ω(k/r),Ω(r/k), d−ω(1))-correlated with
Ai.
This lemma would be all we would need, if we added a third step that projects onto
B as we did in Algorithm 5. However here we do not need to project at all, because the
update rule maintains nearness and thus we can avoid this computationally intensive
step.
Lemma 3.6.7. Suppose that As is (δ, 2)-near to A∗. Then ‖As+1 − A∗‖ ≤ 2‖A∗‖ in
Algorithm 6.
This proof of the above lemma parallels that of Proposition 3.4.5. We will focus on
highlighting the differences in bounding the error term, to avoid repeating the same
calculation.
Proof sketch. We will use A to denote As and B(i) to denote B(s,i) to simplify the
notation. Also let Ai be normalized so that Ai = Ai/‖Ai‖ and then we can write
B(i)−i = (I − AiA>i )A−i. Hence the error term is given by
(I − AiA>i )A−idiag(qi,j)A>−i(I − AiA>i )A∗i
Let C be a matrix whose columns are Ci = (I − AiA>i )A∗i = Ai − 〈Ai, A∗i 〉Ai. This
implies that ‖C‖ ≤ O(√r/d). We can now rewrite the error term above as
A−idiag(qi,j)A>−iCi − (AiAi)
>A−idiag(qi,j)A>−iCi
It follows from the proof of Proposition 3.4.5 that the first term above has spectral
norm bounded by O(r/d ·√r/d). This is because in Proposition 3.4.5 we bounded
89
the term A−idiag(qi,j)A>−iA
∗i and in fact it is easily verified that all we used in that
proof was the fact that ‖A∗‖ = O(√r/d), which also holds for C.
All that remains is to bound the second term. We note that its columns
are scalar multiples of Ai, where the coefficient can be bounded as follows:
‖Ai‖‖A−i‖2‖diag(qi,j)‖‖A∗i ‖ ≤ O(k2/rd). Hence we can bound the spectral norm of
the second term iby O(k2/rd)‖A‖ = O∗(r/d ·√r/d). We can now combine these
two bounds, which together with the calculation in Proposition 3.4.5 completes the
proof.
90
Part II
Global Convergence with
Arbitrary Initialization
91
Chapter 4
Analysis via Optimization
Landscape
Section 2.3 and Section 3.1.3 in Part I have discussed the strength and limitation
of the “finding coarse solutions + local convergence” paradigm for non-convex opti-
mization. In the following several chapters we concern with non-convex optimization
algorithms with simpler initialization schemes such as random initialization or arbi-
trary initialization. A key difference from the analysis techniques in Chapter 3 is
that we establish useful geometric properties of the objective functions instead of
analyzing the algorithmic updates directly. Loosely speaking, we show that all local
minima of the target objective function are also global minima. Since many local
search algorithms converge to approximate local minima (from arbitrary or random
initialization), such a property implies the convergence to an approximate global
minimum.
In the rest of this section, we briefly survey known results regarding the opti-
mization algorithms for functions with such properties. In Chapter 5, Chapter 6, we
develop techniques to prove such properties for specific objective functions, which are
the crux of Part II.
92
4.1 Local Optimality vs Global Optimality
Let f be a twice-differentiable function from Rd to R. Recall that x is a local minimum
of f(·) if there exists an open neighborhood N of x in which the function value is
at least f(x): ∀z ∈ N, f(z) ≥ f(x). We use ∇f(x) to denote the gradient of the
function, and ∇2f(x) to denote the Hessian of the function (∇2f(x) is an n × n
matrix where [∇2f(x)]i,j = ∂2
∂xi∂xjf(x)). It is well known that local minima of the
function f(x) must satisfy some necessary conditions:
Definition 4.1.1. A point x satisfies the first order necessary condition for optimality
(later abbreviated as first order optimality condition) if ∇f(x) = 0. A point x satisfies
the second order necessary condition for optimality (later abbreviated as second order
optimality condition) if ∇2f(x) 0.
A point x is a critical point or stationary point of function f if the gradient
vanishes at x. Thus a local minimum is a critical point, so is a global minimum.
For general function f , even computing a local minimum could be intractable:
even a degree four polynomial can be NP-hard to optimize [83], or even just to check
whether a point is not a local minimum [130]. These impossibility results motivate
us look for stronger structure in the target function f . Indeed, under the following
strict-saddle assumption, we can efficiently find a local minimum of the function f .
Loosely speaking, a strict-saddle function satisfies that every saddle point must have
some strictly negative curvature in some direction.
Definition 4.1.2 (strict saddle, c.f. [67, 106]). Suppose f(·) : Rd → R is twice
differentiable. For α, β, γ ≥ 0, we say f is (α, β, γ)-strict saddle if every x ∈ Rd
satisfies at least one of the following three conditions:
1. ‖∇f(x)‖ ≥ α.
2. λmin(∇2f) ≤ −β.
3. There exists a local minimum x? that is γ-close to x in Euclidean distance.
93
We see that if a function is (α, β, γ)-strict saddle, then for ε < minα, β2 an ε-
approximate local minimum is γ-close to some local minimum. We also note that the
specific way to parameterize this condition may not be the most quantitatively har-
monious way. Although this conditions suffices to allow a polynomial time algorithm.
The following theorem are not stated in the strongest form but aims to emphasize
the fact that various algorithms can converge to a local minim in polynomial time.
Theorem 4.1.3. If f is a twice differentiable (α, β, γ)-strict saddle function from
Rd → R. Suppose we have access to its gradient (or some unbiased estimator of
its gradient, Hessian) in poly(d) time, then there are algorithms (such as stochastic
gradient descent or second order algorithms) that can converge to a local minimum in
with ε error in domain in time poly(d, 1/α, 1/β, 1/γ, 1/ε).
There have been tremendous attempts to get faster and faster algorithms for converg-
ing to a local minimum. Nesterov and Polyak [134] and Sun et al. [162] give second
order algorithms for finding an approximate local minimum. Stochastic gradient de-
scent can converge to a local minimum in polynomial time from any starting point
[138, 67]. The work [3, 46] also have the fastest algorithms in terms of ε-dependencies.
However, the surprising observation that this paper Part of the thesis aims to
explain is that finding local minimum is often sufficient for solving many machine
learning problems. Empirical evidence suggest that the local minima found in practice
are actually close to a global minima. The explanation that this thesis advocates is
Many machine learning objective functions have the property that all or most local
minima are approximate global minima.
We will show in Chapter 5, 6 that the natural objective functions for matrix comple-
tion and linear dynamical systems indeed have this property. If the property above
holds in addition with the strict saddle property (which we can prove for matrix com-
94
pletion and linear dynamical system), then many local search algorithms can find a
global minimum:
Theorem 4.1.4 (Informal). Let f be twice-differentiable function form Rd to R.
Suppose there exist ε0, τ0 > 0 and a universal constant c > 0 such that if a point
x satisfies ‖∇f(x)‖ ≤ ε ≤ ε0 and ∇2f(x) −τ0 · Id, then x is εc-close to a global
minimum of f , then many optimization algorithms including cubic regularization,
trust-region, and stochastic gradient descent, can find a global minimum of f up to δ
error in `2 norm in domain in time poly(1/δ, 1/τ0, d).
A strictly stronger condition than “all local minima are global” is that “all critical
point is a global minimum.” In this case, the objective function has only a single
local minimum which is also the global minimum. The gradient descent is known
to converge to this global minimum linearly (under a quantitative version of this
condition), as stated below. However, we also note that since the condition rules out
multiply local minima, it can’t hold for many objective functions used in practice
which does have multiple local minima and critical points due to certain symmetry.
Theorem 4.1.5. If function f has L-Lipschitz continuous gradient and satisfies that
there exists µ > 0 and x∗ such that for every x,
‖∇f(x)‖2 ≥ µ(f(x)− f(x∗)) (4.1.1)
Then, gradient descent with step size 1/L has a linear convergence rate.
The condition 4.1.1 is called Polyak- Lojasiewicz condition and the theorem above
was first proved by Polyak [142]. We note that it’s often difficult to verify Polyak-
Lojasiewicz condition since the quantity ‖∇f(x)‖2 is often a complex function and
therefore inequality (4.1.1) is difficult to establish.
An easier-to-verify condition is quasi-convexity, which we will discuss next. Quasi-
convexity is also stronger than Polyak- Lojasiewicz condition, which makes it less ap-
95
plicable. Nevertheless, we will show in Chapter 6 that the objective function for learn-
ing linear dynamical system can be made quasi-convexity with over-parameterization.
Quasi-convexity
It is known that under certain mild conditions (stochastic) gradient descent converges
even on non-convex functions to local minimum [67, 106]. Though usually for concrete
problems the challenge is to prove that there is no spurious local minimum other than
the target solution. Here we introduce a condition similar to the quasi-convexity
notion in [78], which ensures that any point with vanishing gradient is the optimal
solution . Roughly speaking, the condition says that at any point θ the negative of
the gradient −∇f(θ) should be positively correlated with direction θ∗ − θ pointing
towards the optimum. Our condition is slightly weaker than that in [78] since we
only require quasi-convexity and smoothness with respect to the optimum, and this
(simple) extension will be necessary for our analysis.
Definition 4.1.6 (Weak quasi-convexity). We say an objective function f is τ -
weakly-quasi-convex (τ -WQC) over a domain B with respect to global minimum θ∗ if
there is a positive constant τ > 0 such that for all θ ∈ B,
∇f(θ)>(θ − θ∗) ≥ τ(f(θ)− f(θ∗)) . (4.1.2)
We further say f is Γ-weakly-smooth if for for any point θ, ‖∇f(θ)‖2 ≤ Γ(f(θ) −
f(θ∗)).
Note that indeed any Γ-smooth convex function in the usual sense is O(Γ)-weakly-
smooth. For a random vector X ∈ Rn, we define it’s variance to be V[X] = E[‖X −
EX‖2].
Definition 4.1.7. We call r(θ) an unbiased estimator of ∇f(θ) with variance V if it
satisfies E[r(θ)] = ∇f(θ) and V[r(θ)] ≤ V .
96
Projected stochastic gradient descent over some closed convex set B with learning
rate η > 0 refers to the following algorithm in which ΠB denotes the Euclidean
projection onto B:
for k = 0 to K − 1 :
wk+1 = θk − ηr(θk)
θk+1 = ΠB(wk+1)
return θj with j uniformly picked from 1, . . . , K (4.1.3)
The following Proposition is well known for convex objective functions (corre-
sponding to 1-weakly-quasi-convex functions). We extend it (straightforwardly) to
the case when τ -WQC holds with any positive constant τ .
Proposition 4.1.8. Suppose the objective function f is τ -weakly-quasi-convex and
Γ-weakly-smooth, and r(·) is an unbiased estimator for ∇f(θ) with variance V . More-
over, suppose the global minimum θ∗ belongs to B, and the initial point θ0 satisfies
‖θ0 − θ∗‖ ≤ R. Then projected gradient descent (4.1.3) with a proper learning rate
returns θK in K iterations with expected error
f(θK)− f(θ∗) ≤ O
(max
ΓR2
τ 2K,R√V
τ√K
).
The proof below uses a simple variation of the standard convergence analysis
of stochastic gradient descent (see, for example, [35]), and demonstrates that the
argument still works for weakly-quasi-convex functions.
Proof of Proposition 4.1.8. We start by using the weakly-quasi-convex condition and
then the rest follows a variant of the standard analysis of non-smooth projected sub-
97
gradient descent1. We conditioned on θk, and have that
τ(f(θk)− f(θ∗)) ≤ ∇f(θk)>(θk − θ∗) = E[r(θk)
>(θk − θ∗) | θk]
= E[
1
η(θk − wk+1)(θk − θ∗) | θk
]=
1
η
(E[‖θk − wk+1‖2 | θk
]+ ‖θk − θ∗‖2 − E
[‖wk+1 − θ∗‖2 | θk
])= η E
[‖r(θk)‖2
]+
1
η
(‖θk − θ∗‖2 − E
[‖wk+1 − θ∗‖2 | θk
])(4.1.4)
where the first inequality uses weakly-quasi-convex and the rest of lines are simply
algebraic manipulations. Since θk+1 is the projection of wk+1 to B and θ∗ belongs to
B, we have ‖wk+1 − θ∗‖ ≥ ‖θk+1 − θ∗‖. Together with (4.1.4), and
E[‖r(θk)‖2
]= ‖∇f(θk)‖2 + V[r(θk)] ≤ Γ(f(θk)− f(θ∗)) + V,
we obtain that
τ(f(θk)− f(θ∗)) ≤ ηΓ(f(θk)− f(θ∗)) + ηV +1
η
(‖θk − θ∗‖2 − E
[‖θk+1 − θ∗‖2 | θk
]).
Taking expectation over all the randomness and summing over k we obtain that
K−1∑k=0
E [f(θk)− f(θ∗)] ≤ 1
τ − ηΓ
(ηKV +
1
η‖θ0 − θ∗‖2
)≤ 1
τ − ηΓ
(ηKV +
1
ηR2
).
where we use the assumption that ‖θ0 − θ∗‖ ≤ R. Suppose K ≥ 4R2Γ2
V τ2, then we take
η = R√V K
. Therefore we have that τ − ηΓ ≥ τ/2 and therefore
K−1∑k=0
E [f(θk)− f(θ∗)] ≤ 4R√V√K
τ. (4.1.5)
1Although we used weak smoothness to get a slightly better bound
98
On the other hand, if K ≤ 4R2Γ2
V τ2, we pick η = τ
2Γand obtain that
K−1∑k=0
E [f(θk)− f(θ∗)] ≤ 2
τ
(τKV
2Γ+
2ΓR2
τ
)≤ 8ΓR2
τ 2. (4.1.6)
Therefore using equation (4.1.6) and (4.1.5) we obtain that when choosing η prop-
erly according to K as above,
Ek∈[K]
[f(θk)− f(θ∗)] ≤ max
8ΓR2
τ 2K,4R√V
τ√K
.
Remark 4.1.1. It’s straightforward to see (from the proof) that the algorithm toler-
ates inverse exponential bias in the gradient estimator. Technically, suppose E[r(θ)] =
∇f(θ)± ζ then f(θK) ≤ O(
max
ΓR2
τ2K, R√V
τ√K
)+ poly(K) · ζ. Throughout the paper,
we assume that the error that we are shooting for is inverse polynomial and therefore
the effect of inverse exponential bias is negligible.
Finally, we note that the sum of two quasi-convex functions may no longer be
quasi-convex. However, if a sequence functions is τ -WQC with respect to a common
point θ∗, then their sum is also τ -WQC. This follows from the linearity of gradient
operation.
Proposition 4.1.9. Suppose functions f1, . . . , fn are individually τ -weakly-quasi-
convex in B with respect to a common global minimum θ∗ , then for non-negative
w1, . . . , wn the linear combination f =∑n
i=1wifi is also τ -weakly-quasi-convex with
respect to θ∗ in B.
99
Chapter 5
Matrix completion
Matrix completion is a basic machine learning problem that has wide applications,
especially in collaborative filtering and recommender systems. Simple non-convex op-
timization algorithms are popular and effective in practice. Despite recent progress
in proving various non-convex algorithms converge from a good initial point, it re-
mains unclear why random or arbitrary initialization suffices in practice. We prove
that the commonly used non-convex objective function for positive semidefinite ma-
trix completion has no spurious local minima – all local minima must also be global.
Therefore, many popular optimization algorithms such as (stochastic) gradient de-
scent can provably solve positive semidefinite matrix completion with arbitrary ini-
tialization in polynomial time. The result can be generalized to the setting when the
observed entries contain noise. We believe that our main proof strategy can be useful
for understanding geometric properties of other statistical problems involving partial
or noisy observations.
5.1 Introduction
Matrix completion is the problem of recovering a low-rank matrix from partially
observed entries. It has been widely used in collaborative filtering and recommender
100
systems [102, 146], dimension reduction [42] and multi-class learning [5]. There has
been extensive work on designing efficient algorithms for matrix completion with
guarantees. One earlier line of results (see [145, 45, 44] and the references therein)
rely on convex relaxations. These algorithms achieve strong statistical guarantees,
but are quite computationally expensive in practice.
More recently, there has been growing interest in analyzing non-convex algorithms
for matrix completion [97, 98, 93, 72, 75, 163, 176, 47, 151, 47]. Let M ∈ Rd×d be
the target matrix with rank r d that we aim to recover, and let Ω = (i, j) :
Mi,j is observed be the set of observed entries. These methods are instantiations of
optimization algorithms applied to the objective1,
f(X) =1
2
∑(i,j)∈Ω
[Mi,j − (XX>)i,j
]2, (5.1.1)
These algorithms are much faster than the convex relaxation algorithms, which is
crucial for their empirical success in large-scale collaborative filtering applications
[102].
Most of the theoretical analysis of the nonconvex procedures require careful initial-
ization schemes: the initial point should already be close to optimum2. In fact, Sun
and Luo [163] showed that after this initialization the problem is effectively strongly-
convex; hence many different optimization procedures can be analyzed by standard
techniques from convex optimization.
However, in practice, people typically use a random initialization, which still leads
to robust and fast convergence. Why can these practical algorithms find the optimal
solution in spite of the non-convexity? In this work, we investigate this question
1In this paper, we focus on the symmetric case when the true M has a symmetric decompositionM = ZZT . Some of previous papers work on the asymmetric case when M = ZWT , which is harderthan the symmetric case.
2The work of De Sa et al. [151] is an exception, which gives an algorithm that uses fresh samplesat every iteration to solve matrix completion (and other matrix problems) approximately.
101
and show that the matrix completion objective has no spurious local minima. More
precisely, we show that any local minimum X of objective function f(·) is also a
global minimum with f(X) = 0, and recovers the correct low rank matrix M .
Our characterization of the structure in the objective function implies that
(stochastic) gradient descent from arbitrary starting point converges to a global
minimum. This is because gradient descent converges to a local minimum [67, 106],
and every local minimum is also a global minimum.
5.1.1 Main Results
Assume the target matrix M is symmetric and each entry of M is observed with
probability p independently 3. We assume M = ZZ> for some matrix Z ∈ Rd×r.
There are two known issues with matrix completion. First, the choice of Z is not
unique since M = (ZR)(ZR)> for any orthonormal matrix Z. Our goal is to find one
of these equivalent solutions.
Another issue is that matrix completion is impossible when M is “aligned” with
standard basis. For example, when M is the identity matrix in its first r × r block,
we will very likely be observing only 0 entries. To address this issue, we make the
following standard assumption:
Assumption 5.1.1. For any row Zi of Z, we have
‖Zi‖ ≤ µ/√d · ‖Z‖F .
Moreover, Z has a bounded condition number σmax(Z)/σmin(Z) = κ.
Throughout this paper, we think of µ and κ as small constants, and the sample
complexity depends polynomially on these two parameters. Also, note that this
3The entries (i, j) and (j, i) are the same. With probability p we observe both entries andotherwise we observe neither.
102
assumption is independent of the choice of Z: all Z such that ZZT = M have
the same row norms and Frobenius norm.
This assumption is similar to the “incoherence” assumption [44]. Our assumption
is the same as the one used in analyzing non-convex algorithms [97, 98, 163].
We enforce X to also satisfy this assumption by a regularizer
f(X) =1
2
∑(i,j)∈Ω
[Mi,j − (XX>)i,j
]2+R(X), (5.1.2)
where R(X) is a function that penalizes X when one of its rows is too large. See
Section 5.3 and Section 5.4 for the precise definition. Our main result shows that in
this setting, the regularized objective function has no spurious local minimum:
Theorem 5.1.2. [Informal] All local minimum of the regularized objective (5.1.2)
satisfy XXT = ZZT = M when p ≥ poly(κ, r, µ, log d)/d.
Combined with the results in [67, 106] (see more discussions in Section 5.1.2), we
have,
Theorem 5.1.3 (Informal). With high probability, stochastic gradient descent on the
regularized objective (5.1.2) will converge to a solution X such that XXT = ZZT = M
in polynomial time from any starting point. Gradient descent will converge to such a
point with probability 1 from a random starting point.
Our results are also robust to noise. Even if each entry is corrupted with a
Gaussian noise of standard deviation µ2‖Z‖2F/d (comparable to the magnitude of
the entry itself!), we can still guarantee that all the local minima satisfy ‖XXT −
ZZT‖F ≤ ε when p is large enough. See the discussion in Appendix 5.5 for results on
noisy matrix completion.
Our main technique is to show that every point that satisfies the first and second
order necessary conditions for optimality must be the desired solution. To achieve
103
this, we use new ideas to analyze the effect of the regularizer and show how it is
useful for modifying the first and second order conditions to exclude any spurious
local minimum.
5.1.2 Related Work
Matrix Completion The earlier theoretical works on matrix completion analyzed
the nuclear norm minimization [159, 145, 45, 44, 131]. This line of work has the clean-
est and strongest theoretical guarantees; [45, 145] showed that if |Ω| & drµ2 log2 d the
nuclear norm convex relaxation recovers the exact underlying low-rank matrix. The
solution can be computed via the solving a convex program in polynomial time.
However, the primary disadvantage of nuclear norm methods is their computational
and memory requirements — the fastest known provable algorithms require O(d2)
memory and thus at least O(d2) running time, which could be both prohibitive for
moderate to large values of d. Many algorithms have been proposed to improve the
runtime (either theoretically or empirically) (see, for example, [158, 120, 77], and the
references therein). Burer and Monteiro [38] proposed factorizing the optimization
variable M = XXT , and optimizing over X ∈ Rd×r instead of M ∈ Rd×d. This ap-
proach only requires O(dr) memory, and a single gradient iteration takes time O(|Ω|),
so has a much lower memory requirement and computational complexity than the nu-
clear norm relaxation. On the other hand, the factorization causes the optimization
problem to be non-convex in X, which leads to theoretical difficulties in analyzing
algorithms. Keshavan et al. [97, 98] showed that well-initialized gradient descent re-
covers M . The works [75, 72, 93, 47] showed that well-initialized alternating least
squares, block coordinate descent, and gradient descent converges M . Jain and Netra-
palli [92] showed a fast algorithm by iteratively doing gradient descent in the relaxed
space and projecting to the set of low-rank matrices. The work [151] analyzes stochas-
tic gradient descent with fresh samples at each iteration from random initialization
104
and shows that it approximately converge to the optimal solution. [163, 176, 177, 168]
provided a more unified analysis by showing that with careful initialization many al-
gorithms, including gradient descent and alternating least squares, succeed. [163, 177]
accomplished this by showing an analog of strong convexity in the neighborhood of
the solution M .
Non-convex Optimization Recently, a line of work analyzes non-convex opti-
mization by separating the problem into two aspects: the geometric aspect which
shows the function has no spurious local minimum and the algorithmic aspect which
designs efficient algorithms can converge to a local minimum that satisfies first and
(relaxed versions) of second order necessary conditions.
Our result is the first that explains the geometry of the matrix completion ob-
jective. Similar geometric results are only known for a few problems: SVD/PCA
phase retrieval/synchronization, orthogonal tensor decomposition, dictionary learn-
ing [21, 157, 67, 162, 22]. The matrix completion objective requires different tools due
to the sampling of the observed entries, as well as carefully managing the regularizer
to restrict the geometry. Parallel to our work Bhojanapalli et al.[30] showed similar
results for matrix sensing, which is closely related to matrix completion. Loh and
Wainwright [116] showed that for many statistical settings that involve missing/noisy
data and non-convex regularizers, any stationary point of the non-convex objective
is close to global optima; furthermore, there is a unique stationary point that is the
global minimum under stronger assumptions [115].
On the algorithmic side, it is known that second-order algorithms like cubic
regularization [134] and trust-region [162] algorithms converge to local minima that
approximately satisfy first and second order conditions. Gradient descent is also
known to converge to local minima [106] from a random starting point. Stochastic
gradient descent can converge to a local minimum in polynomial time from any
105
starting point [138, 67]. All of these results can be applied to our setting, implying
various heuristics used in practice are guaranteed to solve matrix completion.
Notations: For Ω ⊂ [d]× [d], let PΩ be the linear operator that maps a matrix A
to PΩ(A), where PΩ(A) has the same values as A on Ω, and 0 outside of Ω. In this
Chapter, for a matrix A, let Ai denotes the i-th row of A. We also use the shorthand
‖A‖Ω = ‖PΩA‖F .
5.2 Proof Strategy: “Simple” Proofs are More
Generalizable
In this section, we demonstrate the key ideas behind our analysis using the rank r = 1
case. In particular, we first give a “simple” proof for the fully observed case. Then we
show this simple proof can be easily generalized to the random observation case. We
believe that this proof strategy is applicable to other statistical problems involving
partial/noisy observations. The proof sketches in this section are only meant to be
illustrative and may not be fully rigorous in various places. We refer the readers to
Section 5.3 and Section 5.4 for the complete proofs.
In the rank r = 1 case, we assume M = zz>, where ‖z‖ = 1, and ‖z‖∞ ≤ µ√d.
Let ε 1 be the target accuracy that we aim to achieve in this section and let
p = poly(µ, log d)/(dε).
For simplicity, we focus on the following domain B of incoherent vectors where
the regularizer R(x) vanishes,
B =
x : ‖x‖∞ <
2µ√d
. (5.2.1)
106
Inside this domain B, we can restrict our attention to the objective function
without the regularizer, defined as,
g(x) =1
2· ‖PΩ(M − xx>)‖2
F . (5.2.2)
The global minima of g(·) are z and −z with function value 0. Our goal of this
section is to (informally) prove that all the local minima of g(·) are O(√ε)-close to
±z. In later section we will formally prove that the only local minima are ±z.
Lemma 5.2.1 (Partial observation case, informally stated). Under the setting of this
section, in the domain B, all local mimina of the function g(·) are O(√ε)-close to
±z.
It turns out to be insightful to consider the full observation case when Ω = [d]×[d].
The corresponding objective is
g(x) =1
2· ‖M − xx>‖2
F . (5.2.3)
Observe that g(x) is a sampled version of the g(x), and therefore we expect that
they share the same geometric properties. In particular, if g(x) does not have spurious
local minima then neither does g(x).
Lemma 5.2.2 (Full observation case, informally stated). Under the setting of this
section, in the domain B, the function g(·) has only two local minima ±z .
Before introducing the “simple” proof, let us first look at a delicate proof that
does not generalize well.
107
Difficult to Generalize Proof of Lemma 5.2.2 We compute the gradient and
Hessian of g(x),
∇g(x) = Mx− ‖x‖2x, (5.2.4)
∇2g(x) = 2xx> −M + ‖x‖2 · I . (5.2.5)
Therefore, a critical point x satisfies ∇g(x) = Mx− ‖x‖2x = 0, and thus it must
be an eigenvector of M and ‖x‖2 is the corresponding eigenvalue. Next, we prove that
the hessian is only positive definite at the top eigenvector . Let x be an eigenvector
with eigenvalue λ = ‖x‖2, and λ is strictly less than the top eigenvalue λ∗. Let z be
the top eigenvector. We have that 〈z,∇2g(x)z〉 = −〈z,Mz〉 + ‖x‖2 = −λ∗ + λ < 0,
which shows that x is not a local minimum. Thus only z can be a local minimizer,
and it is easily verified that ∇2g(z) is indeed positive definite.
The difficulty of generalizing the proof above to the partial observation case is
that it uses the properties of eigenvectors heavily. Suppose we want to imitate the
proof above for the partial observation case, the first difficulty is how to solve the
equation g(x) = PΩ(M − xx>)x = 0. Moreover, even if we could have a reasonable
approximation for the critical points (the solution of ∇g(x) = 0), it would be difficult
to examine the Hessian of these critical points without having the orthogonality of
the eigenvectors.
“Simple” and Generalizable proof The lessons from the subsection above sug-
gest us find an alternative proof for the full observation case which is generalizable.
The alternative proof will be simple in the sense that it doesn’t use the notion of
eigenvectors and eigenvalues. Concretely, the key observation behind most of the
analysis in this paper is the following,
108
Proofs that consist of inequalities that are linear in 1Ω are often easily generalizable
to partial observation case.
Here statements that are linear in 1Ω mean the statements of the form∑
ij 1(i,j)∈ΩTij ≤
a. We will call these kinds of proofs “simple” proofs in this section. Roughly speak-
ing, the observation follows from the law of large numbers — Suppose Tij, (i, j) ∈
[d]× [d] is a sequence of bounded real numbers, then the sampled sum∑
(i,j)∈Ω Tij =∑i,j 1(i,j)∈ΩTij is an accurate estimate of the sum p
∑i,j Tij, when the sampling prob-
ability p is relatively large. Then, the mathematical implications of p∑Tij ≤ a are
expected to be similar to the implications of∑
(i,j)∈Ω Tij ≤ a, up to some small error
introduced by the approximation. To make this concrete, we give below informal
proofs for Lemma 5.2.2 and Lemma 5.2.1 that only consists of statements that are
linear in 1Ω. Readers will see that due to the linearity, the proof for the partial
observation case (shown on the right column) is a direct generalization of the proof
for the full observation case (shown on the left column) via concentration inequalities
(which will be discussed more at the end of the section).
A “simple” proof for Lemma 5.2.2 and its generalization to Lemma 5.2.1
We prove Lemma 5.2.2 by combining two Claims below. We will give their general-
izations to the partial observation case.
Claim 1f. Suppose x ∈ B satisfies ∇g(x) = 0, then 〈x, z〉2 = ‖x‖4.
Proof. We have,
∇g(x) = (zz> − xx>)x = 0
⇒ 〈x,∇g(x)〉 = 〈x, (zz> − xx>)x〉 = 0 (5.2.6)
⇒ 〈x, z〉2 = ‖x‖4
109
Intuitively, this proof says that the norm of a critical point x is controlled by its
correlation with z.
The following claim is the counterpart of Claim 1f in the partial observation case.
Claim 1p. Suppose x ∈ B satisfies ∇g(x) = 0, then 〈x, z〉2 = ‖x‖4 − ε.
Proof. Imitating the proof of Claim 1f,
∇g(x) = PΩ(zz> − xx>)x = 0
⇒ 〈x,∇g(x)〉 = 〈x, PΩ(zz> − xx>)x〉 = 0 (5.2.7)
⇒ 〈x, z〉2 ≥ ‖x‖4 − ε
The last step uses the fact that equation (5.2.6) and (5.2.7) are approximately equal
up to scaling factor p for any x ∈ B, since (5.2.7) is a sampled version of (5.2.6).
Claim 2f. If x ∈ B has positive Hessian ∇2g(x) 0, then ‖x‖2 ≥ 1/3.
Proof. By the assumption on x, we have that 〈z,∇2g(x)z〉 ≥ 0. Calculating the
quadratic form of the Hessian (see Proposition 5.3.1 for details),
〈z,∇2g(x)z〉
= ‖zx> + xz>‖2 − 2z>(zz> − xx>)z ≥ 0aaaaaa (5.2.8)
⇒ ‖x‖2 + 2〈z, x〉2 ≥ 1
⇒ ‖x‖2 ≥ 1/3 (since 〈z, x〉2 ≤ ‖x‖2)
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Claim 2p. If x ∈ B has positive Hessian ∇2g(x) 0, then ‖x‖2 ≥ 1/3− ε.
110
Proof. Imitating the proof of Claim 2f, calculating the quadratic form over the Hessian
at z (see Proposition 5.3.1) , we have aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
〈z,∇2g(x)z〉
= ‖PΩ(zx> + xz>)‖2 − 2z>PΩ(zz> − xx>)z ≥ 0 (5.2.9)
⇒ · · · · · · (same step as the left)
⇒ ‖x‖2 ≥ 1/3− ε
Here we use the fact that 〈z,∇2g(x)z〉 ≈ p〈z,∇2g(x)z〉 for any x ∈ B.
With these two claims, we are ready to prove Lemma 5.2.2 and 5.2.1 by using another
step that is linear in 1Ω.
Proof of Lemma 5.2.2. By Claim 1f and 2f, we have x satisfies 〈x, z〉2 ≥ ‖x‖4 ≥ 1/9.
Moreover, we have that ∇g(x) = 0 implies
〈z,∇g(x)〉 = 〈z, (zz> − xx>)x〉 = 0 (5.2.10)
⇒ 〈x, z〉(1− ‖x‖2) = 0
⇒ ‖x‖2 = 1 (by 〈x, z〉2 ≥ 1/9)
Then by Claim 1f again we obtain 〈x, z〉2 = 1, and therefore x = ±z.
The proof of Lemma 5.2.1 is almost the same as the proof of Lemma 5.2.2.
Proof of Lemma 5.2.1. By Claim 1p and 2p, we have x satisfies 〈x, z〉2 ≥ ‖x‖4 ≥
1/9−O(ε). Moreover, we have that ∇g(x) = 0 implies
〈z,∇g(x)〉 = 〈z, PΩ(zz> − xx>)x〉 = 0 (5.2.11)
⇒ · · · · · · (same step as the left)
⇒ ‖x‖2 = 1±O(ε) (same step as the left)
111
Since (5.2.11) is the sampled version of equation (5.2.10), we expect they lead to
the same conclusion up to some approximation. Then by Claim 1p again we obtain
〈x, z〉2 = 1±O(ε), and therefore x is O(√ε)-close to either of ±z.
Subtleties regarding uniform convergence In the proof sketches above, our
key idea is to use concentration inequalities to link the full observation objective
g(x) with the partial observation counterpart. However, we require a uniform con-
vergence result. For example, we need a statement like “w.h.p over the choice of Ω,
equation (5.2.6) and (5.2.7) are similar to each other up to scaling”. This type of
statement is often only true for x inside the incoherent ball B. The fix to this is
the regularizer. For non-incoherent x, we will use a different argument that uses the
property of the regularizer. This is besides the main proof strategy of this section
and will be discussed in subsequent sections.
5.3 Warm-up: Rank-1 Case
In this section, using the general proof strategy described in previous section, we
provide a formal proof for the rank-1 case. In subsection 5.3.1, we formally work out
the proof sketches of Section 5.2. In subsection 5.3.2, we prove that due to the effect
of the regularizer, outside incoherent ball B, the objective function doesn’t have any
local minimum.
In the rank-1 case, the objective function simplifies to,
f(x) =1
2‖PΩ(M − xx>)‖2
F + λR(x) . (5.3.1)
Here we use the the regularization R(x)
R(x) =d∑i=1
h(xi), and h(t) = (|t| − α)41()t ≥ α .
112
The parameters λ and α will be chosen later as in Theorem 5.3.2. We will choose
α > 10µ/√d so that R(x) = 0 for incoherent x, and thus it only penalizes coherent
x. Moreover, we note R(x) has Lipschitz second order derivative. 4
We first state the optimality conditions.
Proposition 5.3.1. The first order optimality condition of objective (5.3.1) is,
2PΩ(M − xx>)x = λ∇R(x) , (5.3.2)
and the second order optimality condition requires:
∀v ∈ Rd, ‖PΩ(vx> + xv>)‖2F + λv>∇2R(x)v ≥ 2v>PΩ(M − xx>)v . (5.3.3)
Moreover, The τ -relaxed second order optimality condition requires
∀v ∈ Rd, ‖PΩ(vx> + xv>)‖2F + λv>∇2R(x)v ≥ 2v>PΩ(M − xx>)v − τ‖v‖2 .
(5.3.4)
Proof. We take the Taylor’s expansion around point x. Let δ be an infinitesimal
vector, we have
f(x+ δ) =1
2‖PΩ(M − (x+ δ)(x+ δ)>)‖2
F + λR(x+ δ) + o(‖δ‖2)
=1
2‖PΩ(M − xx> − (xδ> + δx>)− δδ>)‖2
F
+ λ
(R(x) + 〈∇R(x), δ〉+
1
2δT∇2R(x)δ
)+ o(‖δ‖2)
=1
2‖M − xx>‖2
Ω + λR(x)
− 〈PΩ(M − xx>), xδ> + δx>〉+ 〈∇R(x), δ〉+ o(‖δ‖2)
− 〈PΩ(M − xx>), δδ>〉+1
2‖PΩ(xδ> + δx>)‖2
F +1
2λδ>∇2R(x)δ + o(‖δ‖2).
4This is the main reason for us to choose 4-th power instead of 2-nd power.
113
By symmetry 〈PΩ(M − xx>), xδ>〉 = 〈PΩ(M − xx>), δx>〉 = 〈PΩ(M − xx>)x, δ〉,
so the first order optimality condition is ∀δ, 〈−2PΩ(M − xx>)x + λ∇R(x), δ〉 = 0,
which is equivalent to that 2PΩ(M − xx>)x = λ∇R(x).
The second order optimality condition says −〈PΩ(M − xx>), δδ>〉 + 12‖xδ> +
δx>‖2F + 1
2λδ>∇2R(x)δ ≥ 0 for every δ, which is exactly equivalent to Equation
(5.3.3).
We give the precise version of Theorem 5.1.2 for the rank-1 case.
Theorem 5.3.2. For p ≥ cµ6 log1.5 dd
where c is a large enough absolute constant, set
α = 10µ√
1/d and λ ≥ µ2p/α2.Then, with high probability over the randomness of
Ω, the only points in Rd that satisfy both first and second order optimality conditions
(or τ -relaxed optimality conditions with τ < 0.1p) are z and −z.
In the rest of this section, we will first prove that when x is constrained to be
incoherent (and hence the regularizer is 0 and concentration is straightforward) and
satisfies the optimality conditions, then x has to be z or −z. Then we go on to explain
how the regularizer helps us to change the geometry of those points that are far away
from z so that we can rule out them from being local minimum. For simplicity, we
will focus on the part that shows a local minimum x must be close enough to z.
Lemma 5.3.3. In the setting of Theorem 5.3.2, suppose x satisfies the first-order
and second-order optimality condition (5.3.2) and (5.3.3). Then when p is defined as
in Theorem 5.3.2, ∥∥xx> − zz>∥∥2
F≤ O(ε) .
where ε = µ3(pd)−1/2.
This turns out to be the main challenge. Once we proved x is close, we can apply
the result of Sun and Luo [163] (see Lemma 5.6.1), and obtain Theorem 5.3.2.
114
5.3.1 Handling Incoherent x
To demonstrate the key idea, in this section we restrict our attention to the subset
of Rd which contains incoherent x with `2 norm bounded by 1, that is, we consider,
B =
x : ‖x‖∞ ≤
2µ√d, ‖x‖ ≤ 1
. (5.3.5)
Note that the desired solution z is in B, and the regularization R(x) vanishes
inside B.
The following lemmas assume x satisfies the first and second order optimality con-
ditions, and deduce a sequence of properties that x must satisfy.
Lemma 5.3.4. Under the setting of Theorem 5.3.2 , with high probability over the
choice of Ω, for any x ∈ B that satisfies second-order optimality condition (5.3.3) we
have,
‖x‖2 ≥ 1/4.
The same is true if x ∈ B only satisfies τ -relaxed second order optimality condition
for τ ≤ 0.1p.
Proof. We plug in v = z in the second-order optimality condition (5.3.3), and obtain
that ∥∥PΩ(zx> + xz>)∥∥2
F≥ 2z>PΩ(M − xx>)z . (5.3.6)
Intuitively, when restricted to Ω, the squared Frobenius on the LHS and the
quadratic form on the RHS should both be approximately a p fraction of the unre-
stricted case. In fact, both LHS and RHS can be written as the sum of terms of the
form 〈PΩ(uvT ), PΩ(stT )〉, because
∥∥PΩ(zx> + xz>)∥∥2
F= 2〈PΩ(zxT ), PΩ(zxT )〉+ 2〈PΩ(zxT ), PΩ(xzT )〉
2z>PΩ(M − xx>)z = 2〈PΩ(zzT ), PΩ(zzT )〉 − 2〈PΩ(xxT ), PΩ(zzT )〉.
115
Therefore we can use concentration inequalities (Theorem 5.7.1), and simplify the
equation
LHS of (5.3.6) = p∥∥zx> + xz>
∥∥2
F±O(
√pd‖x‖2
∞‖z‖2∞‖x‖2‖z‖2)
= 2p‖x‖2‖z‖2 + 2p〈x, z〉2 ±O(pε) , (Since x, z ∈ B)
where ε = O(µ2√
log dpd
). Similarly, by Theorem 5.7.1 again, we have
RHS of (5.3.6) = 2(〈PΩ(zz>), PΩ(zz>)〉 − 〈PΩ(xx>), PΩ(zz>)〉
)(Since M = zz>)
= 2p‖z‖4 − 2p〈x, z〉2 ±O(pε) (by Theorem 5.7.1 and x, z ∈ B)
(Note that even we use the τ -relaxed second order optimality condition, the RHS
only becomes 1.99p‖z‖4 − 2p〈x, z〉2 ±O(pε) which does not effect the later proofs.)
Therefore plugging in estimates above back into equation (5.3.6), we have that
2p‖x‖2‖z‖2 + 2p〈x, z〉2 ±O(pε) ≥ 2‖z‖4 − 2〈x, z〉2 ±O(pε) ,
which implies that 6p‖x‖2‖z‖2 ≥ 2p‖x‖2‖z‖2 + 4p〈x, z〉2 ≥ 2p‖z‖4 − O(pε). Using
‖z‖2 = 1, and ε being sufficiently small, we complete the proof.
Next we use first order optimality condition to pin down another property of x –
it has to be close to z after scaling. Note that this doesn’t mean directly that x has to
be close to z since x = 0 also satisfies first order optimality condition (and therefore
the conclusion (5.3.7) below).
Lemma 5.3.5. With high probability over the randomness of Ω, for any x ∈ B that
satisfies first-order optimality condition (5.3.2), we have that x also satisfies
∥∥〈z, x〉z − ‖x‖2x∥∥ ≤ O(ε) . (5.3.7)
116
where ε = O(µ3(pd)−1/2).
Proof. Note that since x ∈ B, we have R(x) = 0. Therefore first-order optimality con-
dition says that
PΩ(M − xx>)x = PΩ(zz>)x− PΩ(xx>)x = 0 . (5.3.8)
Again, intuitively we hope PΩ(zzT ) ≈ pzzT and PΩ(xxT )x ≈ p‖x‖2x. These are made
precise by the concentration inequalities Lemma 5.7.4 and Theorem 5.7.2 respectively.
By Theorem 5.7.2, we have that with high probability over the choice of Ω, for
every x ∈ B,
‖PΩ(xx>)x− pxx>x‖F ≤ pε‖x‖3 ≤ pε (5.3.9)
where ε = O(µ3(pd)−1/2). Similarly, by Lemma 5.7.4, we have that for with high
probability over the choice of Ω,
∥∥PΩ(zz>)− pzz>∥∥ ≤ εp .
for ε = O(µ2(pd)−1/2). Therefore for every x,
∥∥PΩ(zz>)x− pzz>x∥∥ ≤ εp‖x‖ ≤ εp . (5.3.10)
Plugging in estimates (5.3.10) and (5.3.9) into equation (5.3.8), we complete the
proof.
Finally we combine the two optimality conditions and show equation (5.3.7) im-
plies xxT must be close to zzT .
Lemma 5.3.6. Suppose vector x satisfies that ‖x‖2 ≥ 1/4, and that ‖〈z, x〉z − ‖x‖2x‖ ≤
δ . Then for δ ∈ (0, 0.1), ∥∥xx> − zz>∥∥2
F≤ O(δ) .
117
Figure 5.1: Partition of Rd into regions where our Lemmas apply. For example,Lemma 3.8 rules out the possibility that a point x in the green region is local min-imum. Here, The green region is the intersection of `∞ norm ball and `2 norm ball.Both the white region and yellow region have non-zero gradient but for differentreasons.
Proof. We write z = ux+ v where u ∈ R and v is a vector orthogonal to x. Now we
know 〈z, x〉z = u2‖x‖2x+ u‖x‖2v, therefore
δ ≥∥∥〈z, x〉z − ‖x‖2x
∥∥ = ‖x‖2√u2‖v‖2 + (1− u2)2.
In particular, we know |1− u2| ≤ 4δ and u‖v‖ ≤ 4δ. This means |u| ∈ 1± 3δ and
‖v‖ ≤ 8δ. Now we expand xxT − zzT :
xxT − zzT = (1− u2)xxT + uxvT + uvxT + vvT
It is clear that all the terms have norm bounded byO(δ), therefore∥∥xx> − zz>∥∥2
F≤
O(δ).
118
5.3.2 Extension to General x
We have shown when x is incoherent and satisfies first and second order optimal-
ity conditions, then it must be close to z or −z. Now we need to consider more
general cases when x may have some very large coordinates. Here the main intuition
is that the first order optimality condition with a proper regularizer is enough to
guarantee that x cannot have a entry that is too much bigger than µ/√d.
Lemma 5.3.7. With high probability over the choice of Ω, for any x that satisfies
first-order order optimality condition (5.3.2), we have
‖x‖∞ ≤ 4 maxα, µ
√p/λ. (5.3.11)
Here we recall that α was chosen to be 10µ/√d and λ is chosen to be large so that
the α dominates the second term µ√p/λ in the setting of Theorem 5.3.2.
Proof of Lemma 5.3.7. Suppose i? = maxj |xj|. Without loss of generality, suppose
xi? ≥ 0. Suppose i?-th row of Ω consists of entries with index [i]× Si? . If |xi? | ≤ 2α,
we are done. Therefore in the rest of the proof we assume |xi?| > 2α. Note that when
p ≥ c(log d)/d for sufficiently large constant c, with high probability over the choice
of Ω, we have |Si?| ≤ 2pd. In the rest of argument we are working with such an Ω
with |Si?| ≤ 2pd.
We will compare the i?-th coordinate of LHS and RHS of first-order optimal-
ity condition (5.3.2). For preparation, we have
|(PΩ(M)x)i? | =∣∣(PΩ(zz>)x
)i?
∣∣ =
∣∣∣∣∣∣∑j∈Si?
zi?zjxj
∣∣∣∣∣∣≤ |xi?|
∑j∈Si?
|zi?zj| ≤ |xi?| · µ2/d · |Si?| ≤ 2|xi? |pµ2 (5.3.12)
119
where the last step we used the fact that |Si? | ≤ 2pd. Moreover, we have that
(PΩ(xx>)x)i? =∑j∈Si?
xi?x2j ≥ 0 ,
and that
(λ∇R(x))i? = 4λ(|xi? | − α)3 sign(xi?) ≥λ
2|xi? |3 (Since xi? ≥ 2α)
Now plugging in the bounds above into the i?-th coordinate of equation (5.3.2), we
obtain
4|xi?|pµ2 ≥ 2(PΩ(M − xx>)x)i? ≥ (λ∇R(x))i? ≥λ
2|xi? |3 ,
which implies that |xi? | ≤ 4√pµ2/λ.
Setting λ ≥ µ2p/α2 and α = 10µ√
1/d, Lemma 5.3.7 ensures that any x that
satisfies first-order optimality condition is the following ball,
B′ =x ∈ Rd : ‖x‖∞ ≤ 4α
.
Then we would like to continue to use arguments similar to Lemma 5.3.4 and
5.3.5. However, things have become more complicated as now we need to consider
the contribution of the regularizer.
Lemma 5.3.8 (Extension of Lemma 5.3.4). In the setting of Theorem 5.3.2, with
high probability over the choice of Ω, suppose x ∈ B′ satisfies second-order optimal-
ity condition (5.3.3) or τ -relaxed condition for τ ≤ 0.1p, we have ‖x‖2 ≥ 1/8.
The guarantees and proofs are very similar to Lemma 5.3.4. The main intuition
is that we can restrict our attentions to coordinates whose regularizer is equal to 0.
120
Proof. If ‖x‖ ≥ 1, then we are done. Therefore in the rest of the proof we assume
‖x‖ ≤ 1. The proof is very similar to Lemma 5.3.4. We plug in v = zJ instead into
equation (5.3.3), where J = i : |xi| ≤ α. Note that ∇R(zJ) vanishes. We plug in
v = zJ in the equation (5.3.3) and obtain that x satisfies that
∥∥PΩ(zJx> + xz>J )
∥∥2
F≥ 2z>J PΩ(M − xx>)zJ . (5.3.13)
Note that we assume ‖x‖∞ ≤ 2α, and in the beginning of the proof we assume
wlog ‖x‖ ≤ 1. Moreover, we have ‖zJ‖ ≤ µ√d
an, ‖zJ‖ ≤ 1. Similarly to the derivation
in the proof of Lemma 5.3.4, we apply Theorem 5.7.1 (twice) and obtain that with
high probability over the choice of Ω, for every x, for ε = O(µ2(pd)−1/2),
LHS of (5.3.13) = p∥∥zJx> + xz>J
∥∥2
F±O(pε) = 2p‖x‖2‖zJ‖2 + 2p〈x, zJ〉2 ±O(pε) .
RHS of (5.3.13) = 2(〈PΩ(zz>), PΩ(zJz
>J )〉 − 〈PΩ(xx>), PΩ(zJz
>J )〉)
(Since M = zz>)
= 2‖zJ‖4 − 2〈x, zJ〉2 ±O(pε) . (by Theorem 5.7.1)
(Again notice that using τ -relaxed second order optimality condition does not effect
the RHS by too much, so it does not change later steps.) Therefore plugging the
estimates above back into equation (5.3.13), we have that
p‖x‖2‖zJ‖2 + 2p〈x, zJ〉2 ≥ p‖zJ‖4 ±O(pε) ,
Using Cauchy-Schwarz, we have ‖x‖2‖zJ‖2 ≥ 〈x, zJ〉2, and therefore we obtain that
‖zJ‖2‖x‖2 ≥ 13‖zJ‖4 −O(ε).
121
Finally, we claim that ‖zJ‖2 ≥ 1/2, which completes the proof since ‖x‖2 ≥13‖zJ‖2 −O(ε) ≥ 1/8.
Claim 5.3.9. Suppose α ≥ 4µ√d
and x satisfies ‖x‖∞ ≤ 4α and ‖x‖ ≤ 2. Let J = i :
|xi| ≤ α. Then we have that ‖zJ‖ ≥ 1/2.
The claim can be simply proved as follows: Since ‖x‖2 ≤ 2 we have that |J c| ≤
2/α2 and therefore ‖zJc‖2 ≤ 2µ2/(dα2). This further implies that ‖zJ‖2 = ‖z‖2 −
‖zL‖2 ≥ (1− 2µ2/(dα2)) ≥ 12
because α ≥ 2µ√d.
We will now deal with first order optimality condition. We first write out the
basic extension of Lemma 5.3.5, which follows from the same proof except we now
include the regularizer term.
Lemma 5.3.10 (Basic extension of Lemma 5.3.5). With high probability over the
randomness of Ω, for any x ∈ B′ that satisfies first-order optimality condition (5.3.2),
we have that x also satisfies
∥∥〈z, x〉z − ‖x‖2x− γ · ∇R(x)∥∥ ≤ O(ε) . (5.3.14)
where ε = O(µ6(pd)−1/2) and γ = λ/(2p) ≥ 0.
Next we will show that we can remove the regularizer term, the main observation
here is nonzero entries ∇R(x) all have the same sign as the corresponding entries in
x.
Lemma 5.3.11. Suppose x ∈ B′ satisfies that ‖x‖2 ≥ 1/8, under the same assump-
tion as Lemma 5.3.10. we have,
∥∥〈x, z〉z − ‖x‖2x∥∥ ≤ O(ε)
122
Proof. Let L = i : ‖xi‖ ≥ α. For i 6∈ L, we have that (∇R(x))i = 0. Therefore it
suffices to prove that for every i ∈ L,
(ziz>x− xi‖x‖)2 ≤ (ziz
>x− xi‖x‖ − (γ∇R(x))i)2
It suffices to prov that
(∇R(x))i(xi‖x‖2 − zi〈z, x〉) ≥ 0 (5.3.15)
Since we have ∇R(x)i = γixi for some γi ≥ 0, we have
(∇R(x))i · xi‖x‖2 = 〈γixi, xi‖x‖2〉
≥ γix2i ‖x‖2
≥ 1√8γix
2i ‖x‖ (since ‖x‖2 ≥ 1/8)
On the other hand, we have
(∇R(x))i · zi〈z, x〉 = γixizi〈z, x〉
≤ 1
4γix
2i ‖x‖‖z‖ (by |xi| ≥ α ≥ 4|zi|)
Therefore combining two equations above we obtain equation (5.3.15) which completes
the proof.
Finally we combine Lemma 5.3.7, Lemma 5.3.8, Lemma 5.3.11 and Lemma 5.3.6
to prove Lemma 5.3.3. The argument are also summarized in Figure 5.1, where we
partition Rd into regions where our lemmas apply.
123
5.4 Rank-r Case
In this section we show how to extend the results in Section 5.3 to recover matrices
of rank r. Here we still use the same proof strategy of Section 5.2. Though for
simplicity we only write down the proof for the partial observation case, while the
analysis for the full observation case (which was our starting point) can be obtained
by substituting [d]× [d] for Ω everywhere.
Recall that in this case we assume the original matrix M = ZZT , where Z ∈ Rd×r.
We also assume Assumption 5.1.1. The objective function is very similar to the rank
1 case
f(X) =1
2
∥∥PΩ(M −XX>)∥∥2
F+ λR(X) , (5.4.1)
where R(X) =∑d
i=1 r(‖Xi‖) . Recall that r(t) = (|t|−α)41()t ≥ α. Here α and λ are
again parameters that we will determined later.
Without loss of generality, we assume that ‖Z‖2F = r in this section. This im-
plies that σmax(Z) ≥ 1 ≥ σmin(Z). Now we shall state the first and second order
optimality conditions:
Proposition 5.4.1. If X is a local optimum of objective function (5.4.1), its first
order optimality condition is,
2PΩ(M)X = 2PΩ(XX>)X + λ∇R(X) , (5.4.2)
and the second order optimality condition is equivalent to
∀V ∈ Rd×r, ‖PΩ(V X> +XV >)‖2F + λ〈V >,∇2R(X)V 〉 ≥ 2〈PΩ(M −XX>), V V >〉 .
(5.4.3)
Note that the regularizer now is more complicated than the one dimensional case,
but luckily we still have the following nice property.
124
Proposition 5.4.2. We have that ∇R(X) = ΓX where Γ ∈ Rd×d is a diagonal matrix
with Γii = 4(‖Xi‖−α)4
‖Xi‖ 1()‖Xi‖ ≥ α. As a direct consequence, 〈(∇R(X))i, Xi〉 ≥ 0 for
every i ∈ [d].
Now we are ready to state the precise version of Theorem 5.1.2:
Theorem 5.4.3. Suppose p ≥ C maxµ6κ16r4, µ4κ4r6d−1 log2 d where C is a large
enough constant. Let α = 4µκr/√d, λ ≥ µ2rp/α2. Then with high probability over
the randomness of Ω, any local minimum X of f(·) satisfies that f(X) = 0, and in
particular, ZZ> = XX>.
The proof of this Theorem follows from a similar path as Theorem 5.3.2. We
first notice that because of the regularizer, any matrix X that satisfies first order
optimality condition must be somewhat incoherent (this is analogues to Lemma 5.3.7):
Lemma 5.4.4. Suppose |Si| ≤ 2pd. Then for any X satisfies 1st order optimal-
ity (5.4.2), we have
‖X‖2→∞ = maxi‖Xi‖ ≤ 4 max
α, µ
√rp/λ
(5.4.4)
Proof. Assume i? = argmaxi‖Xi‖. Suppose the ith row of Ω consists of entries with
index [i]× Si. If ‖Xi?‖ ≤ 2α, then we are done. Therefore in the rest of the proof we
assume ‖Xi?‖ ≥ 2α.
We will compare the i-th row of LHS and RHS of (5.4.2). For preparation, we
have
(PΩ(M)x)i? =(PΩ(ZZ>)X
)i?
=(PΩ(ZZ>)
)i?X (5.4.5)
125
Then we have that
∥∥(PΩ(ZZ>))i?
∥∥1
=∑j∈Si?
|〈Zi? , Zj〉|
≤∑j∈Si?
‖Zi?‖‖Zj‖ ≤∑j∈Si?
µ2r/d|S1| (by incoherence of Z)
≤ 2µ2rp . (by |Si?| ≤ 2pd)
Therefore we can bound the `2 norm of LHS of 1st order optimality condi-
tion (5.4.2) by
∥∥(PΩ(ZZ>)X)i?
∥∥ ≤ ∥∥(PΩ(ZZ>))i?
∥∥1
∥∥X>∥∥1→2
≤ 2µ2rp ‖X‖2→∞ (by ‖X‖2→∞ =∥∥X>∥∥
1→2)
= 2µ2rp ‖Xi?‖ (5.4.6)
Next we lowerbound the norm of the RHS of equation (5.4.2). We have that
(PΩ(XX>)X)i? =∑j∈Si?
〈Xi? , Xj〉Xj = Xi
∑j∈Xi?
X>j Xj ,
which implies that
〈(PΩ(XX>)X)i? , Xi?〉 = Xi?
∑j∈Xi?
X>j Xj
X>i? ≥ 0 . (5.4.7)
Using Proposition 5.4.2 we obtain that
〈(PΩ(XX>)X)i? , (∇R(X))i?〉 = ΓiiXi?
∑j∈Xi?
X>j Xj
X>i? ≥ 0 . (5.4.8)
126
It follows that
∥∥(PΩ(XX>)X)i? + (λ∇R(X))i?∥∥ ≥ ‖(λ∇R(X))i?‖ (by equation (5.4.8))
=4λ(‖Xi?‖ − α)3
‖Xi?‖· ‖Xi?‖ (by Proposition 5.4.2)
≥ λ
2‖Xi?‖3 (by the assumptino ‖Xi?‖ ≥ 2α)
Therefore plugging in equation above and equation (5.4.6) into 1st order optimal-
ity condition (5.4.2). We obtain that ‖Xi?‖ ≤√
8µ2rp/λ which completes the
proof.
Next, we prove a property implied by first order optimality condition, which is
similar to Lemma 5.3.10.
Lemma 5.4.5. In the setting of Theorem 5.4.3, with high probability over the choice
of Ω, for any X that satisfies 1st order optimality condition (5.4.2), we have
‖X‖2F ≤ 2rσmax(Z)2 . (5.4.9)
Moreover, we have
σmax(X) ≤ 2σmax(Z)r1/6 . (5.4.10)
and ∥∥ZZTX −XXTX − γ∇R(X)∥∥F≤ O(δ) (5.4.11)
where δ = O(µ3κ3r2 log0.75(d)σmax(Z)−3(dp)−1/2) and γ = λ/(2p) ≥ 0.
Proof. If ‖X‖F ≤√rσmax(Z)2 we are done. When ‖X‖F ≥
√rσmax(Z)2, by
Lemma 5.4.4, we have that max ‖Xi‖ ≤ 4α = 4µκr/√d, and therefore max ‖Xi‖ ≤
ν‖X‖F with ν = O(µκ√r/σmax(Z)). Then by Theorem 5.7.2, we have that
∥∥PΩ(ZZ>)X − pZZ>X∥∥F≤ pδ ,
127
and ∥∥PΩ(XX>)X − pXX>X∥∥F≤ pδ ,
where δ = O(µ3κ3r2 log0.75(d)σmax(Z)−3(dp)−1/2). These two imply equation (5.4.11).
Moreover, we have
p∥∥ZZ>X∥∥
F=∥∥PΩ(ZZ>)X
∥∥F± pδ =
∥∥PΩ(XX>)X + λR(X)∥∥F± pδ
(by equation (5.4.2))
≥∥∥PΩ(XX>)X
∥∥F± pδ (by equation (5.4.8))
≥ p∥∥XX>X∥∥
F± 2pδ (5.4.12)
Suppose X has singular value σ1 ≥ · · · ≥ σr. Then we have∥∥ZZ>X∥∥2
F≤
‖ZZ>‖2‖X‖2F ≤ σmax(Z)4‖X‖2
F = σmax(Z)4(σ21 + · · · + σ2
r). On the other hand,∥∥XX>X∥∥2
F= σ6
1 + · · ·+ σ6r . Therefore, equation (5.4.12) implies that
(1 +O(δ))σmax(Z)4
r∑i=1
σ2i ≥
r∑i=1
σ6i
Then we have (by Proposition 8.3.4) we complete the proof.
Now we look at the second order optimality condition, this condition implies the
smallest singular value of X is large (similar to Lemma 5.3.8). Note that this lemma
is also true even if x only satisfies relaxed second order optimality condition with
τ = 0.01pσmin(Z).
Lemma 5.4.6. In the setting of Theorem 5.4.3. With high probability over the choice
of Ω, suppose X satisfies equation (5.4.9), (5.4.4) the 2nd order optimality condi-
tion (5.4.3). Then,
σmin(X) ≥ 1
4σmin(Z) (5.4.13)
128
Proof. Let J = i : ‖Xi‖ ≤ α. Let v ∈ Rr such that ‖Xv‖ = σmin(X). . Let ZJ be
the matrix that has the same i-th row as Z for every i ∈ J and 0 elsewhere. Since ZJ
has column rank at most r, by variational characterization of singular values, we have
that for there exists unit vector zJ ∈ col-span(ZJ) such that ‖X>zJ‖ ≤ σmin(X).
We claim that σmin(ZJ) ≥ 12σmin(Z). Let L = [d]− J . Since for any i ∈ L it holds
that ‖Xi‖ ≥ α, we have |L|α2 ≤ ‖X‖2F ≤ 2rσmax(Z)2 (by equation (5.4.9)), and it
follows that |L| ≤ 2rσmax(Z)2/α2. Therefore,
σmin(ZJ) ≥ σmin(Z)− σmax(ZL) ≥ σmin(Z)− ‖ZL‖F
≥ σmin(Z)−√|L|rµ2/d ≥ σmin(Z)−
√2r2σmax(Z)2µ2/(α2d)
≥ 1
2σmin(Z) . (by α ≥ rκµ√
d)
Since zJ ∈ col-span(ZJ) is a unit vector, we have that zJ can be written as
zJ = ZJβ where ‖β‖ ≤ 1σmin(ZJ )
≤ O(1/σmin(Z)). Therefore this in turn implies that
‖zJ‖∞ ≤ ‖ZJ‖2→∞‖β‖ ≤ O(µ√r/d/σmin(Z)) ≤ O(µκ
√r/d).
We will plug in V = zJvT in the 2nd order optimality condition (5.4.3). Note that
since zJ ∈ col-span(ZJ), it is supported on subset J , and therefore ∇2R(X)V = 0.
Therefore the term about regularization in (5.4.3) will vanish. For simplicity, let
y = X>zJ , w = Xv We obtain that taking V = zJv> in equation (5.4.3) will result
in ∥∥PΩ(wz>J + zJw>)∥∥2
F≥ 2〈PΩ(ZZ> −XX>), zJz
>J 〉
Note that we have that ‖w‖∞ ≤ ‖X‖2→∞‖v‖ ≤ µ√r/d. Recalling that ‖zJ‖∞ ≤
O(µκ√r/d), by Theorem 5.7.1, we have that
p∥∥wz>J + zJw
>∥∥2
F≥ 2p〈ZZ> −XX>, zJz>J 〉 − δp
129
where δ = O(µ2κr2(pd)−1/2). Then simple algebraic manipulation gives that
〈w, zJ〉2 + ‖w‖2‖zJ‖2 + ‖X>zJ‖2 ≥ ‖Z>zJ‖2 − δ/2 (5.4.14)
Note that 〈w, zJ〉 = 〈v,X>zJ〉 = 〈y, v〉. Recall that ‖zJ‖ = 1 and z ∈ col-span(ZJ),
and therefore ‖Z>zJ‖ = ‖Z>J zJ‖ ≥ σ2min(ZJ). Moreover, recall that ‖y‖ = ‖X>zJ‖ ≤
σmin(X). Using these with equation (5.4.14) we obtain that
〈w, zJ〉2 + ‖w‖2‖zJ‖2 + ‖X>zJ‖2 ≤ 〈y, v〉2 + ‖w‖2 + ‖y‖2
≤ 2‖y‖2 + σ2min(X)
(by Cauchy-Schwarz and ‖w‖ = σmin(X).)
≤ 3σ2min(X) (by ‖y‖ ≤ σmin(X).)
Therefore together with equation (5.4.14) and ‖Z>zJ‖ ≥ σ2min(ZJ) we obtain that
σmin(X) ≥ (1/2− Ω(δ))σmin(ZJ) (5.4.15)
Therefore combining equation (5.4.15) and the lower bound on σmin(ZJ) we complete
the proof.
Similar as before, we show it is possible to remove the regularizer term here, again
the intuition is that the regularizer is always in the same direction as X.
Lemma 5.4.7. Suppose X satisfies equation (5.4.4) and (5.4.13) and (5.4.10), then
for any γ ≥ 0,
∥∥ZZTX −XXTX∥∥2
F≤∥∥ZZTX −XXTX − γ∇R(X)
∥∥2
F(5.4.16)
130
Proof. Let L = i : ‖Xi‖ ≥ α. For i 6∈ L, we have that (∇R(X))i = 0. Therefore it
suffices to prove that for every i ∈ L,
∥∥ZiZ>X −XiX>X∥∥2 ≤
∥∥ZiZ>X −XiX>X − (γ∇R(X))i
∥∥2
It suffices to prove that
〈(∇R(X))i, XiX>X − ZiZ>X〉 ≥ 0 (5.4.17)
By proposition 5.4.2, we have ∇R(X))i = ΓiiXi for Γii ≥ 0. Then
〈(∇R(X))i, XiX>X〉 = Γii〈Xi, XiX
>X〉
≥ Γii ‖Xi‖2 σmin(XTX)
≥ 1
4Γii ‖Xi‖2 σmin(Z)2 (by equation 5.4.13)
On the other hand, we have
〈(∇R(X))i, ZiZ>X〉 = Γii〈Xi, ZiZ
>X〉
≤ Γii‖Xi‖‖Zi‖σmax(ZTX) ≤ Γii‖Xi‖‖Zi‖σmax(Z)σmax(X)
≤ Γii‖Xi‖‖Zi‖σmax(Z)2r1/6 (by equation (5.4.10))
≤ 1
10Γii‖Xi‖2σmin(Z)2r−1/3 (by ‖Xi‖ ≥ α ≥ 10
√rκ2‖Zi‖)
Therefore combining two equations above we obtain equation (5.4.17) which com-
pletes the proof.
Finally we show the form in Equation (5.4.16) implies ZZT is close to XXT (this
is similar to Lemma 5.3.6).
131
Lemma 5.4.8. Suppose X and Z satisfies that σmin(X) ≥ 1/4 · σmin(Z) and that
∥∥ZZTX −XXTX∥∥2
F≤ δ2
where δ ≤ σ3min(Z)/C for a large enough constant C, then
‖XX> − ZZ>‖2F ≤ O(δκ2/σmin(Z)).
Proof. The proof is similar to the one-dimensional case, we will separate Z into the
directions that are in column span of X and its orthogonal subspace. We will then
show the projection of Z in the column span is close to X, and the projection on the
orthogonal subspace must be small.
Let Z = U +V where U = Projspan(X)Z is the projection of Z to the column span
of X, and V is the projection to the orthogonal subspace. Then since V TX = 0 we
know
ZZTX = (U + V )(U + V )TX = UUTX + V UTX.
Here columns of the first term UUTX are in the column span of X, and the columns
second term V UTX are in the orthogonal subspace. Therefore,
‖ZZTX −XXTX‖2F = ‖UUTX −XXTX‖2
F + ‖V UTX‖2F ≤ δ2.
In particular, both terms should be bounded by δ2. Therefore ‖UUT −XXT‖2F ≤
δ2/σ2min(X) ≤ 16δ2/σ2
min(Z).
Also, we know σmin(UUTX) ≥ σmin(XXTX) − δ ≥ σmin(Z)3/128 if δ ≤
σmin(Z)3/128. Therefore σmin(UTX) is at least σmin(Z)3/‖Z‖128. Now ‖V ‖2F ≤
δ2/σmin(UTX)2 ≤ O(δ2‖Z‖2/σmin(Z)6).
Finally, we can bound ‖UV T‖F by ‖U‖‖V ‖F ≤ ‖Z‖‖V ‖F (last inequality is be-
cause U is a projection of Z), which at least Ω(‖V ‖2F ) when δ ≤ σmin(Z)3/128,
132
therefore
‖ZZT −XXT‖F ≤ ‖UUT −XXT‖F + 2‖UV T‖F + ‖V V T‖F ≤ O(δ‖Z‖2/σmin(Z)3).
Last thing we need to prove the main theorem is a result from Sun and Luo[163],
which shows whenever XXT is close to ZZT , the function is essentially strongly
convex, and the only points that have 0 gradient are points where XXT = ZZT , this
is explained in Lemma 5.6.1. Now we are ready to prove Theorem 5.4.3:
Proof of Theorem 5.4.3. Suppose X satisfies 1st and 2nd order optimality condi-
tion. Then by Lemma 5.4.5 and Lemma 5.4.4, we have that X satisfies equa-
tion (5.4.4), (5.4.9), (5.4.10) and (5.4.11). Then by Lemma 5.4.6, we obtain that
σmin(X) ≥ 1/6 · σmin(Z). Now by Lemma 5.4.7 and equation (5.4.11), we have that∥∥ZZTX −XXTX∥∥F≤ δ for δ ≤ cσmin(Z)3/κ2 for sufficiently small constant c. Then
by Lemma 5.4.8 we obtain that ‖ZZ> − XX>‖F ≤ cσmin(Z)2 for sufficiently small
constant c. By Lemma 5.6.1, in this region the only points that satisfy the first order
optimality condition must satisfy XXT = ZZT .
Handling Noise To handle noise, notice that we can only hope to get an ap-
proximate solution in presence of noise, and to get that our Lemmas only depend
on concentration bounds which still apply in the noisy setting. See Section 5.5 for
details.
5.5 Handling Noise
Suppose instead of observing the matrix ZZT , we actually observe a noisy version
M = ZZT +N , where N is a Gaussian matrix with independent N(0, σ2) entries. In
133
this case we should not hope to exactly recover ZZT (as two close Z’s may generate
the same observation). In this Section we show even with fairly large noise our
arguments can still hold.
Theorem 5.5.1. Let µ = maxµ,√
4σd√
log dr. Suppose p ≥ Cµ6κ12r4d−1ε−2 log1.5 d
where C is a large enough constant. Let α = 2µκr/√d, λ ≥ µ2rp/α2. Then with high
probability over the randomness of Ω, any local minimum X of f(·) satisfies
‖XXT − ZZT‖F ≤ ε.
In fact, a noise level σ√
log d ≤ µ2r/d (when the noise is almost as large as the
maximum possible entry) does not change the conclusions of Lemmas in this Section.
Proof. There are only three places in the proof where the noise will make a difference.
These are: 1. The infinity norm bound of M , used in Lemma 5.4.4. 2. The LHS
of first order optimality condition (Equation (5.4.2)). 3. The RHS of second order
optimality condition (Equation (5.4.3)).
What we require in these three steps are: 1. |M |∞ should be smaller than
µ2r/d. 2. 〈PΩ(N),W 〉 should be smaller than |〈PΩ(N), PΩ(W )〉| ≤ O(σ|Z|∞dr log d+√pd2rσ2|W |∞‖W‖F log d). 3. ‖PΩ(N)‖ ≤ εp‖ZZT‖F . When we define the µ =
maxµ,√
4σd√
log dr, all of these are satisfied (by Lemma 5.7.5 and 5.7.6).
Now we can follow the proof and see δ ≤ cεσmin(Z)/κ2 for small enough constant
c, and By Lemma 5.4.8 we know ‖XXT − ZZT‖F ≤ ε.
5.6 Finding the Exact Factorization
In Section 5.4, we showed that any point that satisfies the first and second order
necessary condition must satisfy ‖XXT − ZZT‖F ≤ c for a small enough constant c.
In this section we will show that in fact XXT must be exactly equal to ZZT . The
134
proof technique here is mostly based on the work of Sun and Luo[163]. However we
have to modify their proof because we use slightly different regularizers, and we work
in the symmetric case. The main Lemma in [163] can be rephrased as follows in our
setting:
Lemma 5.6.1 (Analog to Lemma 3.1 in [163]). Suppose p ≥ Cµ4r6κ4d−1 log d for
large enough absolute constant C, and ε = σmin(Z)2/100. with high probability over
the randomness of Ω, we have that for any point X in the set
Bε =
X ∈ Rd×r : ‖XXT − ZZT‖F ≤ ε, ‖X‖2→∞ ≤
16µκr√d
, (5.6.1)
there exists a matrix U such that UUT = ZZT and
〈∇f(X), X − U〉 ≥ p
4‖M −XXT‖2
F .
As a consequence, any point X in the set B that satisfies first order optimality con-
dition must be a global optimum (or, equivalently, satisfy XXT = ZZT ).
Recall f(X) = 12‖PΩ(M −XXT )‖2
F + λR(X). The proof of Lemma 5.6.1 consists
of three steps:
1. The regularizer has nonnegative correlation with (X−U): for any U such that
UUT = ZZT , we have 〈∇R(X), X − U〉 ≥ 0. (Claim 5.6.3).
2. There exists a matrix U such that UUT = ZZT , and U is close to X.
(Claim 5.6.4)
3. Argue that 〈∇f(x), X −U〉 ≥ p4‖PΩ(M −XXT )‖2
F when U is close to X. (See
proof of Lemma 5.6.1).
Before going into details, the first useful observation is that all matrices U with
UUT = ZZT have the same row norm.
135
Claim 5.6.2. Suppose U,Z ∈ Rd×r satisfy UU> = ZZ>. Then, for any i ∈ [d] we
have ‖Ui‖ = ‖Zi‖. Consequently, ‖U‖F = ‖Z‖F .
Proof. Suppose UU> = ZZ>, then we have U = ZR where R is an orthonormal
matrix. In particular, the i-th row of U is equal to
Ui = ZiR.
Since `2 norm (and Frobenius norm) is preserved after multiplying with an orthonor-
mal matrix, we know ‖Ui‖ = ‖Zi‖. The Frobenius norm bound follows immedi-
ately.
Note that this simple observation is only true in the symmetric case. This Claims
serves as the same role of the bounds on row norms of U, V in the asymmetric case
(Propositions 4.1 and 4.2 of [163]).
Next we are ready to argue that the regularizer is always positively correlated
with X − U .
Claim 5.6.3. For any U such that UUT = ZZT , we have,
〈∇R(X), X − U〉 ≥ 0.
Proof. Since the regularizer is applied independently to individual rows, we can
rewrite 〈∇R(X), X − U〉 =∑n
i=1〈∇R(Xi), Xi − Ui〉, and focus on i-th row.
For each row Xi, ∇R(Xi) is 0 when ‖Xi‖ ≤ 2µ√r/√d. In that case 〈∇R(Xi), Xi−
Ui〉 = 0.
When ‖Xi‖ is larger than 2µ/√d, we know∇R(Xi) is always in the same direction
as Xi. In this case λ∇R(Xi) = γXi for some γ > 0 and ‖Xi‖ ≥ 2µ√r/√d ≥ 2‖Zi‖ =
136
2‖Ui‖ (where last equality is by Claim 5.6.2). Therefore by triangle inequality
〈Xi, Xi − Ui〉 ≥ ‖Xi‖2 − ‖Xi‖‖Ui‖ ≥ ‖Xi‖2/2 > 0.
This then implies 〈λ∇R(Xi), Xi − Ui〉 = γ〈Xi, Xi − Ui〉 > 0.
Next we will prove the gradient of 12‖PΩ(M − XXT )‖2
F has a large correlation
with X − U . This is analogous to Proposition 4.2 in [163].
Claim 5.6.4. Suppose ‖XXT −M‖F = ε ≤ σmin(Z)2/100, there exists a matrix U
such that UUT = M and ‖X − U‖F ≤ 5ε√r/σmin(Z)2.
Proof. Without loss of generality we assume M is a diagonal matrix with first r
diagonal terms being σ1(Z)2, σ2(Z)2, ..., σr(Z)2 (this can be done by a change of basis).
That is, we assume M = diag(σ1(Z)2, . . . , σr(Z)2), 0, . . . , 0). We use M ′ to denote
the first r × r principle submatrix of M .
We write X =
VW
where V contains the first r rows of X, and W ∈ R(d−r)×r
contains the remaining rows in X. We write similarly U =
PQ
where P and Q
denote the first r rows and the rest of rows respectively.
In order to construct U , we first notice that Q must be constructed as a zero
matrix since M has non-zero diagonal only on the top-left corner. A natural guess of
P then becomes a “normalized” version of V .
Concretely, we construct P := V S = V (V T (M ′)−1V )−1/2 (where S :=
(V T (M ′)−1V )−1/2). Thus, the difference between U and X is equal to ‖U −X‖F ≤
‖P − V ‖F + ‖W‖F .
Since ‖XXT −M‖F ≤ ε, we know ‖M ′−V V T‖2F + 2‖VW T‖2
F ≤ ε2. In particular
both terms are smaller than ε2.
137
First, we bound ‖W‖F . Note that since ‖M ′ − V V T‖F ≤ ε ≤ σmin(Z)2/100, we
know σmin(V )2 ≥ 0.99σmin(Z)2. Therefore σmin(V ) ≥ 0.9σmin(Z). Now we know
‖W‖F ≤ ‖VW T‖F/σmin(V ) ≤ 2ε/σmin(Z).
Next we bound ‖P − V ‖2F . Since ‖M ′ − V V T‖F ≤ ε ≤ σmin(Z)2/100, we know
(1 − 2ε/σmin(Z)2)V V T M ′ (1 + 2ε2/σmin(Z)2)V V T . This implies ‖V ‖F ≤
1.1‖Z‖F , and (1 − 2ε/σmin(Z)2)I V TM−1V (1 + 2ε/σmin(Z)2)I. Therefore the
matrix S is also very close to identity, in particular, ‖S− I‖ ≤ 2ε/σmin(Z)2. Now we
know ‖P − V ‖F = ‖V ‖F‖S − I‖ ≤ 3ε‖Z‖F/σmin(Z)2. Using the fact that ‖Z‖F = 1
we know ‖U −X‖F ≤ ‖P − V ‖F + ‖W‖F ≤ 5ε√r/σmin(Z)2.
We can now combine this Claim with a sampling lemma to show ‖PΩ((X−U)(X−
U)T )‖2F is small:
Lemma 5.6.5. Under the same setting of Lemma 5.6.1, with probability at least
1− 1/(2n)4 over the choice of Ω, if U satisfies conclusion of Claim 5.6.4, then,
‖PΩ((X − U)(X − U)T )‖2F ≤
p
25‖M −XXT‖2
F .
Intuitively, this Lemma is true because ‖(X − U)(X − U)T‖F ≤ 25‖M −
XXT‖2F r/σmin(Z)4, which is much smaller than ‖M −XXT‖F when ‖M −XXT‖F
is small. By concentration inequalities we expect ‖PΩ((X − U)(X − U)T )‖2F to be
roughly equal to p‖(X − U)(X − U)T‖F , therefore it must be much smaller than
p‖M −XXT‖2F . The proof of this Lemma is exactly the same as Proposition 4.3 in
[163] (in fact, it is directly implied by Proposition 4.3), so we omit the proof here.
We also need a different concentration bound for the projection of the norm of the
matrix a = U(X − U)T + (X − U)UT . Unlike the previous lemma, here we want
‖PΩ(a)‖F to be large.
138
Lemma 5.6.6. Under the same setting of Lemma 5.6.1, let a = U(X − U)T + (X −
U)UT where U is constructed as in Claim 5.6.4. Then, with high probability, we have
that for any X ∈ Bε,
‖PΩ(a)‖2F ≥
5p
6‖a‖2
F .
Intuitively this should be true because a is in the tangent space Z : Z = UW T +
(W ′)UT which has rank O(nr). The proof of this follows from Theorem 3.4 [145],
and is written in detail in Equations (37) - (41) in [163].
Finally we are ready to prove the main lemma. The proof is the same as the
outline given in Section 4.1 of [163]. We give it here for completeness.
Proof of Lemma 5.6.1. Note that f(X) is equal to h(X) + λR(X) where where
h(X) = 12‖PΩ(M −XXT )‖2
F , and R(X) is the regularizer. By Claim 5.6.3 we know
〈∇R(X), X − U〉 ≥ 0, so we only need to prove there exists a U such that UUT = Z
and 〈∇g(X), X − U〉 ≥ p4‖M −XXT‖2
F .
Define a = U(X−U)T +(X−U)UT , b = (U−X)(U−X)T , then XXT−M = a+b
and (X − U)XT +X(X − U)T = a+ 2b.
Now
〈∇h(X), X − U〉 = 2〈PΩ(XXT −M)X,X − U〉
= 〈PΩ(XXT −M), (X − U)XT +X(X − U)T 〉
= 〈PΩ(a+ b), PΩ(a+ 2b)〉
= ‖PΩ(a)‖2F + 2‖PΩ(b)‖2
F + 3〈PΩ(a), PΩ(b)〉
≥ ‖PΩ(a)‖2F + 2‖PΩ(b)‖2
F − 3‖PΩ(a)‖‖PΩ(b)‖.
Let ε = ‖M −XXT‖F . Note that from Claim 5.6.4 and Lemma 5.6.5, we know
‖b‖F ≤ ε/10, ‖PΩ(b)‖F ≤√pd/5.
139
Therefore as long as we can show ‖PΩ(a)‖F is large we are done. This is true because
‖a‖F ≥ ‖M −XXT‖F − ‖b‖F ≥ 9ε/10. Hence by Lemma 5.6.6 we know
‖PΩ(A)‖2F ≥
5p
6‖a‖2
F ≥27
40pε2.
Combining the bounds for ‖PΩ(a)‖F , ‖PΩ(b)‖F , we know 〈∇g(X), X − U〉 ≥p4‖M −XXT‖2
F . Together with the fact that 〈∇R(X), X − U〉 ≥ 0, we know
〈∇f(X), X − U〉 ≥ p
4‖M −XXT‖2
F .
5.7 Concentration Inequalities
In this section we prove the concentration inequalities used in the main part. We first
show that the inner-product of two low rank matrices is preserved after restricting
to the observed entries. This is mostly used in arguments about the second order
necessary conditions.
Theorem 5.7.1. With high probability over the choice of Ω, for any two rank-r
matrices W,Z ∈ Rd×d, we have
|〈PΩ(W ), PΩ(Z)〉 − p〈W,Z〉| ≤ O(|W |∞|Z|∞dr log d+√pdr|W |∞|Z|∞ ‖W‖F ‖Z‖F log d)
Proof. Since both LHS and RHS are bilinaer in both W and Z, without loss of
generality we assume the Frobenius norms of W and Z are all equal to 1. Note that
in this case we should expect |W |∞ ≥ 1/d.
140
Let δi,j be the indicator variable for Ω, we know
〈PΩ(W,Z〉 =∑i,j
δi,jWi,jZi,j,
and in expectation it is equal to p〈W,Z〉. Let Q =∑
i,j(δi,j − p)Wi,jZi,j. We can
then view Q as a sum of independent entries (note that δi,j = δj,i, but we can simply
merge the two terms and the variance is at most a factor 2 larger). The expectation
E[Q] = 0. Each entry in the sum is bounded by |W |∞|Z|∞, and the variance is
bounded by
V[Q] ≤ p∑i,j
(Wi,jZi,j)2
≤ pmaxi,j|Wi,j|2
∑i,j
Z2i,j
≤ p|W |2∞.
Similarly, we also know V[Q] ≤ p|Z|2∞ and hence V[Q] ≤ p|W |∞|Z|∞.
Now we can apply Bernstein’s inequality, with probability at most η,
|Q− E[Q]| ≥ |W |∞|Z|∞ log 1/η +√p|W |∞|Z|∞ log(1/η).
By Proposition 8.3.3, there is a set Γ of size dO(dr) such that for any rank r matrix
X, there is a matrix X ∈ Γ such that ‖X − X‖F ≤ 1/d3. When W and Z come from
this set, we can set η = d−Cdr for a large enough constant C. By union bound, with
high probability
|Q− E[Q]| ≤ O(|W |∞|Z|∞dr log d+√pdr|W |∞|Z|∞ log d).
141
When W and Z are not from this set Γ, let W and Z be the closest matrix
in Γ, then we know |〈PΩ(W ), PΩ(Z)〉 − p〈W,Z〉 − (〈PΩ(W ), PΩ(Z)〉 − p〈W , Z〉)| ≤
O(1/d3) |W |∞|Z|∞dr log d. Therefore we still have
|〈PΩ(W ), PΩ(Z)〉−p〈W,Z〉| ≤ O(|W |∞|Z|∞dr log d+√pdr|W |∞|Z|∞ ‖W‖F ‖Z‖F log d).
Next Theorem shows PΩ(XXT )X is roughly equal to pXXTX, this is one of the
major terms in the gradient.
Theorem 5.7.2. When p ≥ Cν6r log2 ddε2
for a large enough constant C, With high
probability over the randomness of Ω, for any matrix X ∈ Rd×r such that ‖Xi‖ ≤
ν√
1d‖X‖F , we have
‖PΩ(XX>)X − pXXTX‖F ≤ pε‖X‖3F (5.7.1)
Proof. Without loss of generality we assume ‖X‖F = 1. Let δi,j be the indicator
variable for Ω, we first prove the result when δi,j are independent, then we will use
standard techniques to show the same argument works for δi,j = δj,i.
Note that
[PΩ(XX>)X]i =∑j
δi,j〈Xi, Xj〉Xj,
whose expectation is equal to
[pXXTX]i = p∑j
〈Xi, Xj〉Xj.
We know ‖Xi‖ ≤ ν√
1d, therefore each term is bounded by ν3(1/d)3/2. Let Zi
be a random variable that is equal to ‖PΩ(XX>)X]i − [pXXTX]i‖2, then it is easy
to see E[Zi] ≤ pdν6(r/d)3 = pν6/d2. and the variance V[Zi] = E[Z2i ] − E[Zi]
2 ≤142
pdν12(1/d)6 + 2E[Zi]2 ≤ 3E[Zi]
2 (as long as p > 1/d). Our goal now is to prove∑di=1 Zi ≤ p2ε2 for all X.
Let Zi be a truncated version of Zi. That is, Zi = Zi when Zi ≤ [2pdν3(1/d)3/2]2,
and Zi = [2pdν3(1/d)3/2]2 otherwise. It’s not hard to see Zi has smaller mean and
variance compared to Zi. Also, by vector’s Bernstein’s inequality, we know
Pr[√Zi ≤ t] ≤ d exp
(− t2
pν6
d2+ t ν3
d3/2
).
Notice that this is only relevant when t ≤ O(pν3d−1/2) (because otherwise the
probability is 0) and in that regime the variance term always dominates. Therefore
Zi is the square of a subgaussian random variable.
By the Bernstein’s inequality, we know the moments of√Zi − E[
√Zi] are domi-
nated by a Gaussian distribution with variance O(E[Zi] log d).
Now we can use the concentration bound for quadratics of the subgaussian random
variables[89]: we know that with probability exp(−t),
d∑i=1
Zi ≤ O(E[Z2i ] · (log d) · (d+ 2
√dt+ 2t)).
this means with probability exp(−Cdrlogd) with some large constant C, we know∑di=1 Zi ≤ O(pν6r log2 d/d). The probability is low enough for us to union bound
over all X in a standard ε-net such that every other X is within distance (ε/d)6.
Therefore we know with high probability for all X in the ε-net we have∑d
i=1 Zi ≤
O(pν6r log2 d/d), which is smaller than p2ε2 when p ≥ Cν6r log1.5 ddε2
for a large enough
constant C.
For any X that is not in the ε-net, let X be the closest point of X in the net, then
‖X − X‖F ≤ 1/d6, therefore the bound of X clearly follows from the bound of X.
Now to convert sum of Zi to sum of Zi, notice that with high probability there
are at most 2pd entries in Ω for every row. When that happens Zi is always bounded
143
by [2pdν3(1/d)3/2]2 so Zi = Zi. Let event 1 be∑d
i=1 Zi ≤ p2ε2 for all X, and let event
2 be that there are at most 2pd entries per row, we know with high probability both
event happens, and in that case∑d
i=1 Zi ≤ p2ε2 for all X.
Handling δi,j = δj,i First notice that the diagonal entries δi,i’s cannot change the
Frobenius norm by more than O(ν3(1/d)3/2 ·√d) ≤ pε so we can ignore the diagonal
terms. Now for independent terms δi,j, let γj,i = δi,j, then by union bound both δi,j
and γi,j satisfy the equation, and by triangle’s inequality (δi,j + γi,j)/2 also satisfies
the inequality. Let τi,j be the true indicator of Ω (hence τi,j = τj,i), and τ ′i,j be an
independent copy, we know (τi,j + τ ′i,j)/2 has the same distribution as (δi,j + γi,j)/2
(for off-diagonal entries), therefore with high probability the equation is true for
(τi,j+τ′i,j)/2. The Theorem then follows from the standard Claim below for decoupling
(note that sup‖X‖F=1 ‖PΩ(XXT )X − pXXTX‖F is a norm for the indicator variables
of Ω):
Claim 5.7.3. Let X, Y be two iid random variables, then
Pr[‖X‖ ≥ t] ≤ 3 Pr[‖X + Y ‖ ≥ 2t
3].
Proof. Let X, Y, Z be iid random variables then,
Pr[X ≥ t] = Pr[‖(X + Y ) + (X + Z)− (Y + Z)‖ ≥ 2t]
≤ Pr[‖X + Y ‖ ≥ 2t/3] + Pr[‖X + Z‖ ≥ 2t/3] + Pr[‖Y + Z‖ ≥ 2t/3]
≤ 3 Pr[‖X + Y ‖ ≥ 2t
3].
144
Finally we argue that random sampling of a matrix gives a nice spectral approxi-
mation. This is a standard Lemma that is used in arguing about the PΩ(M)X term
in the gradient (PΩ(M −XXT )X).
Lemma 5.7.4. Suppose W ∈ Rd×d satisfies that |W |∞ ≤ νd‖W‖F , then with high
probability (1− d−10) over the choice of Ω,
‖PΩ(W )− pW‖ ≤ εp‖W‖F .
where ε = O(ν√
log d/(pd)).
Proof. Without loss of generality we assume ‖W‖F = 1. The proof follows simply
from application of Bernstein inequality. We view PΩ(W ) as
PΩ(W ) =∑i,j∈[d]2
sijWijδij
where δij ∈ Rd×d is the indicator matrix for entry (i, j), and sij ∈ 0, 1 are in-
dependent Bernoulli variable with probability p of being 1. Then we have that
E[PΩ(W )] = pW and ‖sijWijδij‖ ≤ νd‖W‖F . Moreover, we compute the variance
by
∑i,j∈[d]2
E[sijW2ijδ>ijδij] =
∑i,j∈[d]2
E[sijW2ijδjj]
=∑j∈[d]
p
(∑i∈d
W 2ij
)δjj (5.7.2)
Therefore ∥∥∥∥∥∥∑i,j∈[d]2
E[sijW2ijδ>ijδij]
∥∥∥∥∥∥ ≤ pν2
d
Similarly we can control∥∥∥∑i,j∈[d]2 E[sijW
2ijδijδ
>ij ]∥∥∥ by pν2/d (again notice that al-
though δi,j = δj,i the bounds here are correct up to constant factors). Then it follows
145
from non-commutative Bernstein inequality [91] that
PrΩ
[‖PΩ(W )− p(W )‖ ≥ εp] ≤ d exp(−2ε2pd/ν2) .
Concentration Lemmas for Noise Matrix N Next we will state the concen-
tration lemmas that are necessary when observed matrix is perturbed by Gaussian
noise. The proof of these Lemmas are really exactly the same (in fact even simpler)
than the corresponding Theorem that we have just proven. The first Lemma is used
in the same settings as Theorem 5.7.1.
Lemma 5.7.5. Let N be a random matrix with independent Gaussian entries
N(0, σ2). With high probability over the support Ω and the Gaussian N , for any low
rank matrix W , we have
|〈PΩ(N), PΩ(W )〉| ≤ O(σ|Z|∞dr log d+√pd2rσ2|W |∞‖W‖F log d
Proof. The proof is exactly the same as Theorem 5.7.1 as |〈PΩ(N), PΩ(W )〉| is a sum
of independent entries that follows from the same Bernstein’s inequality.
Next we show that random sampling entries of a Gaussian matrix gives a matrix
with low spectral norm.
Lemma 5.7.6. Let N be a random Gaussian matrix with independent Gaussian en-
tries N(0, σ2), with high probability over the choice of Ω and N , we have
‖PΩ(N)‖ ≤ εpσd,
where ε = O(√
log d/pd).
Proof. Again the proof follows from the same argument as Lemma 5.7.4.
146
5.8 Conclusions
Although the matrix completion objective is non-convex, we showed the objective
function has very nice properties that ensures the local minima are also global. This
property gives guarantees for many basic optimization algorithms. An important open
problem is the robustness of this property under different model assumptions: Can
we extend the result to handle asymmetric matrix completion? Is it possible to add
weights to different entries (similar to the settings studied in [113])? Can we replace
the objective function with a different distance measure rather than Frobenius norm
(which is related to works on 1-bit matrix sensing [55])? We hope this framework of
analyzing the geometry of objective function can be applied to other problems.
147
Chapter 6
Learning Linear Dynamical
Systems
In this chapter, we use non-convex optimization approach to attack the problem of
learning linear dynamical system. We prove that gradient descent efficiently converges
to the global optimizer of the maximum likelihood objective of an unknown linear
time-invariant dynamical system from a sequence of noisy observations generated by
the system. Even though the objective function is non-convex, we provide polynomial
running time and sample complexity bounds under strong but natural assumptions.
Linear systems identification has been studied for many decades, yet, to the best of
our knowledge, these are the first polynomial guarantees for the problem we consider.
6.1 Introduction
Many learning problems are by their nature sequence problems where the goal is
to fit a model that maps a sequence of input words x1, . . . , xT to a corresponding
sequence of observations y1, . . . , yT . Text translation, speech recognition, time series
prediction, video captioning and question answering systems, to name a few, are all
sequence to sequence learning problems. For a sequence model to be both expressive
148
and parsimonious in its parameterization, it is crucial to equip the model with memory
thus allowing its prediction at time t to depend on previously seen inputs.
Recurrent neural networks form an expressive class of non-linear sequence models.
Through their many variants, such as long-short-term-memory [84], recurrent neural
networks have seen remarkable empirical success in a broad range of domains. At
the core, neural networks are typically trained using some form of (stochastic) gradi-
ent descent. Even though the training objective is non-convex, it is widely observed
in practice that gradient descent quickly approaches a good set of model parame-
ters. Understanding the effectiveness of gradient descent for non-convex objectives
on theoretical grounds is a major open problem in this area.
If we remove all non-linear state transitions from a recurrent neural network,
we are left with the state transition representation of a linear dynamical system.
Notwithstanding, the natural training objective for linear systems remains non-convex
due to the composition of multiple linear operators in the system. If there is any hope
of eventually understanding recurrent neural networks, it will be inevitable to develop
a solid understanding of this special case first.
To be sure, linear dynamical systems are important in their own right and have
been studied for many decades independently of machine learning within the con-
trol theory community. Control theory provides a rich set techniques for identifying
and manipulating linear systems. The learning problem in this context corresponds
to “linear dynamical system identification”. Maximum likelihood estimation with
gradient descent is a popular heuristic for dynamical system identification [114]. In
the context of machine learning, linear systems play an important role in numerous
tasks. For example, their estimation arises as subroutines of reinforcement learn-
ing in robotics [108], location and mapping estimation in robotic systems [59], and
estimation of pose from video [143].
149
In this work, we show that gradient descent efficiently minimizes the maximum
likelihood objective of an unknown linear system given noisy observations generated
by the system. More formally, we receive noisy observations generated by the following
time-invariant linear system:
ht+1 = Aht +Bxt (6.1.1)
yt = Cht +Dxt + ξt
Here, A,B,C,D are linear transformations with compatible dimensions and we denote
by Θ = (A,B,C,D) the parameters of the system. The vector ht represents the
hidden state of the model at time t. Its dimension n is called the order of the system.
The stochastic noise variables ξt perturb the output of the system which is why the
model is called an output error model in control theory. We assume the variables are
drawn i.i.d. from a distribution with mean 0 and variance σ2.
Throughout the paper we focus on controllable and externally stable systems.
A linear system is externally stable (or equivalently bounded-input bounded-output
stable) if and only if the spectral radius of A, denoted ρ(A), is strictly bounded by 1.
Controllability is a mild non-degeneracy assumption that we formally define later.
Without loss of generality, we further assume that the transformations B, C and D
have bounded Frobenius norm. This can be achieved by a rescaling of the output
variables. We assume we have N pairs of sequences (x, y) as training examples,
S =
(x(1), y(1)), . . . , (x(N), y(N)).
Each input sequence x ∈ RT of length T is drawn from a distribution and y is the
corresponding output of the system above generated from an unknown initial state
h. We allow the unknown initial state to vary from one input sequence to the next.
This only makes the learning problem more challenging.
150
Our goal is to fit a linear system to the observations. We parameterize our model
in exactly the same way as (6.1.1). That is, for linear mappings (A, B, C, D), the
trained model is defined as:
ht+1 = Aht + Bxt , yt = Cht + Dxt (6.1.2)
The (population) risk of the model is obtained by feeding the learned system
with the correct initial states and comparing its predictions with the ground truth in
expectation over inputs and errors. Denoting by yt the t-th prediction of the trained
model starting from the correct initial state that generated yt, and using Θ as a short
hand for (A, B, C, D), we formally define population risk as:
f(Θ) = Ext,ξt
[1
T
T∑t=1
‖yt − yt‖2
](6.1.3)
Note that even though the prediction yt is generated from the correct initial state,
the learning algorithm does not have access to the correct initial state for its training
sequences.
While the squared loss objective turns out to be non-convex, it has many ap-
pealing properties. Assuming the inputs xt and errors ξt are drawn independently
from a Gaussian distribution, the corresponding population objective corresponds to
maximum likelihood estimation. In this work, we make the weaker assumption that
the inputs and errors are drawn independently from possibly different distributions.
The independence assumption is certainly idealized for some learning applications.
However, in control applications the inputs can often be chosen by the controller
rather than by nature. Moreover, the outputs of the system at various time steps are
correlated through the unknown hidden state and therefore not independent even if
the inputs are.
151
6.1.1 Our Results
We show that we can efficiently minimize the population risk using projected stochas-
tic gradient descent. The bulk of our work applies to single-input single-output (SISO)
systems meaning that inputs and outputs are scalars xt, yt ∈ R. However, the hidden
state can have arbitrary dimension n. Every controllable SISO admits a convenient
canonical form called controllable canonical form that we formally introduce later. In
this canonical form, the transition matrix A is governed by n parameters a1, . . . , an
which coincide with the coefficients of the characteristic polynomial of A. The mini-
mal assumption under which we might hope to learn the system is that the spectral
radius of A is smaller than 1. However, the set of such matrices is non-convex and
does not have enough structure for our analysis. We will therefore make additional
assumptions. The assumptions we need differ between the case where we are trying
to learn A with n parameter system, and the case where we allow ourselves to over-
specify the trained model with n′ > n parameters. The former is sometimes called
proper learning, while the latter is called improper learning. In the improper case, we
are essentially able to learn any system with spectral radius less than 1 under a mild
separation condition on the roots of the characteristic polynomial. Our assumption
in the proper case is stronger and we introduce it next.
6.1.2 Proper Learning
Suppose that the state transition matrix A is given by parameters a1, . . . , an and
consider the polynomial q(z) = 1+a1z+a2z2+· · ·+anzn over the complex numbers C.
We will require that the image of the unit circle on the complex plane under the
polynomial q is contained in the cone of complex numbers whose real part is larger
than their absolute imaginary part. Formally, for all z ∈ C such that |z| = 1, we
require that <(q(z)) > |=(q(z))|. Here, <(z) and =(z) denote the real and imaginary
152
part of z, respectively. We illustrate this condition in the figure on the right for a
degree 4 system.
1 0 1
1
0
1
Complex plane
Our assumption has three important impli-
cations. First, it implies (via Rouche’s theorem)
that the spectral radius of A is smaller than 1 and
therefore ensures the stability of the system. Sec-
ond, the vectors satisfying our assumption form
a convex set in Rn. Finally, it ensures that the
objective function is weakly quasi-convex, a con-
dition we introduce later when we show that it
enables stochastic gradient descent to make suf-
ficient progress.
We note in passing that our assumption can be satisfied via the `1-norm constraint
‖a‖1 ≤√
2/2. Moreover, if we pick random Gaussian coefficients with expected norm
bounded by o(1/√
log n), then the resulting vector will satisfy our assumption with
probability 1 − o(1). Roughly speaking, the assumption requires the roots of the
characteristic polynomial p(z) = zn + a1zn−1 + · · ·+ an are relatively dispersed inside
the unit circle. (For comparison, on the other end of the spectrum, the polynomial
p(z) = (z − 0.99)n have all its roots colliding at the same point and doesn’t satisfy
the assumption.)
Theorem 6.1.1 (Informal). Under our assumption, projected stochastic gradient de-
scent, when given N sample sequence of length T , returns parameters Θ with popula-
tion risk
f(Θ) ≤ f(Θ) +O
(√n5 + σ2n3
TN
).
Recall that f(Θ) is the population risk of the optimal system, and σ2 refers to
the variance of the noise variables. We also assume that the inputs xt are drawn
153
from a pairwise independent distribution with mean 0 and variance 1. Note, however,
that this does not imply independence of the outputs as these are correlated by a
common hidden state. The stated version of our result glosses over the fact that we
need our assumption to hold with a small amount of slack; a precise version follows
in Section 6.3. Our theorem establishes a polynomial convergence rate for stochastic
gradient descent. Since each iteration of the algorithm only requires a sequence of
matrix operations and an efficient projection step, the running time is polynomial, as
well. Likewise, the sample requirements are polynomial since each iteration requires
only a single fresh example. An important feature of this result is that the error
decreases with both the length T and the number of samples N . Computationally,
the projection step can be a bottleneck, although it is unlikely to be required in
practice and may be an artifact of our analysis.
6.1.3 The Power of Over-parameterization
Endowing the model with additional parameters compared to the ground truth turns
out to be surprisingly powerful. We show that we can essentially remove the as-
sumption we previously made in proper learning. The idea is simple. If p is the
characteristic polynomial of A of degree n. We can find a system of order n′ > n
such that the characteristic polynomial of its transition matrix becomes p ·p′ for some
polynomial p′ of order n′ − n. This means that to apply our result we only need the
polynomial p · p′ to satisfy our assumption. At this point, we can choose p′ to be an
approximation of the inverse p−1. For sufficiently good approximation, the resulting
polynomial p · p′ is close to 1 and therefore satisfies our assumption. Such an ap-
proximation exists generically for n′ = O(n) under mild non-degeneracy assumptions
on the roots of p. In particular, any small random perturbation of the roots would
suffice.
154
Theorem 6.1.2 (Informal). Under a mild non-degeneracy assumption, stochastic
gradient descent returns parameters Θ corresponding to a system of order n′ = O(n)
with population risk
f(Θ) ≤ f(Θ) +O
(√n5 + σ2n3
TN
),
when given N sample sequences of length T .
We remark that the idea we sketched also shows that, in the extreme, improper
learning of linear dynamical systems becomes easy in the sense that the problem
can be solved using linear regression against the outputs of the system. However,
our general result interpolates between the proper case and the regime where linear
regression works. We discuss in more details in Section 6.5.3.
6.1.4 Multi-input Multi-output Systems
Both results we saw immediately extend to single-input multi-output (SIMO) sys-
tems as the dimensionality of C and D are irrelevant for us. The case of multi-input
multi-output (MIMO) systems is more delicate. Specifically, our results carry over
to a broad family of multi-input multi-output systems. However, in general MIMO
systems no longer enjoy canonical forms like SISO systems. In Section 6.6, we intro-
duce a natural generalization of controllable canonical form for MIMO systems and
extend our results to this case.
6.1.5 Related Work
System identification is a core problem in dynamical systems and has been studied
in depth for many years. The most popular reference on this topic is the text by
Ljung [114]. Nonetheless, the list of non-asymptotic results on identifying linear
systems from noisy data is surprisingly short. Several authors have recently tried
to estimate the sample complexity of dynamical system identification using machine
155
learning tools [170, 39, 173]. All of these result are rather pessimistic with sample
complexity bounds that are exponential in the degree of the linear system and other
relevant quantities. Contrastingly, we prove that gradient descent has an associated
polynomial sample complexity in all of these parameters. Moreover, all of these papers
only focus on how well empirical risk approximates the true population risk and do
not provide guarantees about any algorithmic schemes for minimizing the empirical
risk.
The only result to our knowledge which provides polynomial sample complexity
for identifying linear dynamical systems is in Shah et al [153]. Here, the authors
show that if certain frequency domain information about the linear dynamical system
is observed, then the linear system can be identified by solving a second-order cone
programming problem. This result is about improper learning only, and the size of
the resulting system may be quite large, scaling as the (1− ρ(A))−2. As we describe
in this work, very simple algorithms work in the improper setting when the system
degree is allowed to be polynomial in (1− ρ(A))−1. Moreover, it is not immediately
clear how to translate the frequency-domain results to the time-domain identification
problem discussed above.
Our main assumption about the image of the polynomial q(z) is an appeal to the
theory of passive systems. A system is passive if the dot product between the input
sequence ut and output sequence yt are strictly positive. Physically, this notion cor-
responds to systems that cannot create energy. For example, a circuit made solely of
resistors, capacitors, and inductors would be a passive electrical system. If one added
an amplifier to the internals of the system, then it would no longer be passive. The set
of passive systems is a subset of the set of stable systems, and the subclass is some-
what easier to work with mathematically. Indeed, Megretski used tools from passive
systems to provide a relaxation technique for a family of identification problems in dy-
namical systems [121]. His approach is to lower bound a nonlinear least-squares cost
156
with a convex functional. However, he does not prove that his technique can identify
any of the systems, even asymptotically. The work of Soderstrom and Stoica [154]
analyze the landscape of the population risk in the frequency domain and showed
that under certain conditions (that are closely related to ours), the population risk
has a unique global minimum. This effectively established Lemma 6.2.3 of our paper.
Bazanella et al use a passivity condition to prove quasi-convexity of a cost function
that arises in control design [24]. Building on this work, Eckhard and Bazanella prove
a weaker version of Lemma 6.2.3 in the context of system identification [60]. But no
sample complexity or global convergence proofs are provided in either [154] or [24].
6.1.6 Proof Overview
The first important step in our proof is to develop population risk in Fourier domain
where it is closely related to what we call idealized risk. Idealized risk essentially
captures the `2-difference between the transfer function of the learned system and
the ground truth. The transfer function is a fundamental object in control theory.
Any linear system is completely characterized by its transfer function G(z) = C(zI−
A)−1B. In the case of a SISO, the transfer function is a rational function of degree n
over the complex numbers and can be written as G(z) = s(z)/p(z). In the canonical
form introduced in Section 6.1.7, the coefficients of p(z) are precisely the parameters
that specify A. Moreover, znp(1/z) = 1 +a1z+a2z2 + · · ·+anz
n is the polynomial we
encountered in the introduction. Under the assumption illustrated earlier, we show
in Section 6.2 that the idealized risk is weakly quasi-convex (Lemma 6.2.3). Quasi-
convexity implies that gradients cannot vanish except at the optimum of the objective
function; we review this (mostly known) material in Section 4.1. In particular, this
lemma implies that in principle we can hope to show that gradient descent converges
to a global optimum. However, there are several important issues that we need to
address. First, the result only applies to idealized risk, not our actual population risk
157
objective. Therefore it is not clear how to obtain unbiased gradients of the idealized
risk objective. Second, there is a subtlety in even defining a suitable empirical risk
objective. The reason is that risk is defined with respect to the correct initial state of
the system which we do not have access to during training. We overcome both of these
problems. In particular, we design an almost unbiased estimator of the gradient of
the idealized risk in Lemma 6.4.4 and give variance bounds of the gradient estimator
(Lemma 6.4.5).
Our results on improper learning in Section 6.5 rely on a surprisingly simple
but powerful insight. We can extend the degree of the transfer function G(z) by
extending both numerator and denominator with a polynomial u(z) such that G(z) =
s(z)u(z)/p(z)u(z). While this results in an equivalent system in terms of input-output
behavior, it can dramatically change the geometry of the optimization landscape.
In particular, we can see that only p(z)u(z) has to satisfy the assumption of our
proper learning algorithm. This allows us, for example, to put u(z) ≈ p(z)−1 so
that p(z)u(z) ≈ 1, hence trivially satisfying our assumption. A suitable inverse
approximation exists under light assumptions and requires degree no more than d =
O(n). Algorithmically, there is almost no change. We simply run stochastic gradient
descent with n+ d model parameters rather than n model parameters.
6.1.7 Preliminaries
For complex matrix (or vector, number) C, we use <(C) to denote the real part and
=(C) the imaginary part, and C the conjugate and C∗ = C> its conjugate transpose
. We use | · | to denote the absolute value of a complex number c. For complex vector
u and v, we use 〈u, v〉 = u∗v to denote the inner product and ‖u‖ =√u∗u is the norm
of u. For complex matrix A and B with same dimension, 〈A,B〉 = tr(A∗B) defines
an inner product, and ‖A‖F =√
tr(A∗A) is the Frobenius norm. For a square matrix
A, we use ρ(A) to denote the spectral radius of A, that is, the largest absolute value
158
of the elements in the spectrum of A. We use Idn to denote the identity matrix with
dimension n × n, and we drop the subscript when it’s clear from the context.We let
ei denote the i-th standard basis vector.
A SISO of order n is in controllable canonical form if A and B have the following
form
A =
0 1 0 · · · 0
0 0 1 · · · 0
......
.... . .
...
0 0 0 · · · 1
−an −an−1 −an−2 · · · −a1
B =
0
0
...
0
1
(6.1.4)
We will parameterize A, B, C, D accordingly. We will write A = C(a) for brevity,
where a is used to denote the unknown last row [−an, . . . ,−a1] of matrix A. We will
use a to denote the corresponding training variables for a. Since here B is known, so
B is no longer a trainable parameter, and is forced to be equal to B. Moreover, C is
a row vector and we use [c1, · · · , cn] to denote its coordinates (and similarly for C).
A SISO is controllable if and only if the matrix [B | AB | A2B | · · · | An−1B] has
rank n. This statement corresponds to the condition that all hidden states should be
reachable from some initial condition and input trajectory. Any controllable system
admits a controllable canonical form [81]. For vector a = [an, . . . , a1], let pa(z) denote
the polynomial
pa(z) = zn + a1zn−1 + · · ·+ an . (6.1.5)
When a defines the matrix A that appears in controllable canonical form, then pa is
precisely the characteristic polynomial of A. That is, pa(z) = det(zI − A).
159
6.2 Population Risk in Frequency Domain
We next establish conditions under which risk is weakly-quasi-convex. Our strategy
is to first approximate the risk functional f(Θ) by what we call idealized risk. This
approximation of the objective function is fairly straightforward; we justify it toward
the end of the section. We can show that
f(Θ) ≈ ‖D − D‖2 +∑∞
k=0
(CAkB − CAkB
)2. (6.2.1)
The leading term ‖D − D‖2 is convex in D which appears nowhere else in the
objective. It therefore doesn’t affect the convergence of the algorithm (up to lower
order terms) by virtue of Proposition 4.1.9, and we restrict our attention to the
remaining terms.
Definition 6.2.1 (Idealized risk). We define the idealized risk as
g(A, C) =∞∑k=0
(CAkB − CAkB
)2
. (6.2.2)
We now use basic concepts from control theory (see [81, 82] for more background)
to express the idealized risk (6.2.2) in Fourier domain. The transfer function of the
linear system is
G(z) = C(zI − A)−1B . (6.2.3)
Note that G(z) is a rational function over the complex numbers of degree n and
hence we can find polynomials s(z) and p(z) such that G(z) = s(z)p(z)
, with the conven-
tion that the leading coefficient of p(z) is 1. In controllable canonical form (6.1.4),
the coefficients of p will correspond to the last row of the A, while the coefficients of
160
s correspond to the entries of C. Also note that
G(z) =∞∑t=1
z−tCAt−1B =∞∑t=1
z−trt−1
The sequence r = (r0, r1, . . . , rt, . . .) = (CB,CAB, . . . , CAtB, . . .) is called the im-
pulse response of the linear system. The behavior of a linear system is uniquely
determined by the impulse response and therefore by G(z). Analogously, we denote
the transfer function of the learned system by G(z) = C(zI − A)−1B = s(z)/p(z) .
The idealized risk (6.2.2) is only a function of the impulse response r of the learned
system, and therefore it is only a function of G(z).
For future reference, we note that by some elementary calculation (see
Lemma 8.3.1), we have
G(z) = C(zI − A)−1B =c1 + · · ·+ cnz
n−1
zn + a1zn−1 + · · ·+ an, (6.2.4)
which implies that s(z) = c1 + · · ·+ cnzn−1 and p(z) = zn + a1z
n−1 + · · ·+ an.
With these definitions in mind, we are ready to express idealized risk in Fourier
domain.
Proposition 6.2.2. Suppose pa(z) has all its roots inside unit circle, then the ideal-
ized risk g(a, C) can be written in the Fourier domain as
g(A, C) =
∫ 2π
0
∣∣∣G(eiθ)−G(eiθ)∣∣∣2 dθ .
161
Proof. Note that G(eiθ) is the Fourier transform of the sequence rk and so is G(eiθ)
the Fourier transform1 of rk. Therefore by Parseval’ Theorem, we have that
g(A, C) =∞∑k=0
‖rk − rk‖2 =
∫ 2π
0
|G(eiθ)−G(eiθ)|2dθ .
6.2.1 Quasi-convexity of the Idealized Risk
Now that we have a convenient expression for risk in Fourier domain, we can prove
that the idealized risk g(A, C) is weakly-quasi-convex when a is not so far from the
true a in the sense that pa(z) and pa(z) have an angle less than π/2 for every z on
the unit circle. We will use the convention that a and a refer to the parameters that
specify A and A, respectively.
Lemma 6.2.3. For τ > 0 and every C, the idealized risk g(A, C) is τ -weakly-quasi-
convex over the domain
Nτ (a) =
a ∈ Rn : <
(pa(z)
pa(z)
)≥ τ/2,∀ z ∈ C, s.t. |z| = 1
. (6.2.5)
Proof. We first analyze a single term h = |G(z)−G(z)|2. Recall that G(z) = s(z)/p(z)
where p(z) = pa(z) = zn + a1zn−1 + · · ·+ an. Note that z is fixed and h is a function
of a and C. Then it is straightforward to see that
∂h
∂s(z)= 2<
1
p(z)
(s(z)
p(z)− s(z)
p(z)
)∗. (6.2.6)
and
∂h
∂p(z)= −2<
s(z)
p(z)2
(s(z)
p(z)− s(z)
p(z)
)∗. (6.2.7)
1The Fourier transform exists since ‖rk‖2 = ‖CAkB‖2 ≤ ‖C‖‖Ak‖‖B‖ ≤ cρ(A)k where c doesn’t
depend on k and ρ(A) < 1.
162
Since s(z) and p(z) are linear in C and a respectively, by chain rule we have that
〈∂h∂a, a− a〉+ 〈 ∂h
∂C, C − C〉 =
∂h
∂p(z)〈∂p(z)
∂a, a− a〉+
∂h
∂s(z)〈∂s(z)
∂C, C − C〉
=∂h
∂p(z)(p(z)− p(z)) +
∂h
∂s(z)(s(z)− s(z)) .
Plugging the formulas (6.2.6) and (6.2.7) for ∂h∂s(z)
and ∂h∂p(z)
into the equation above,
we obtain that
〈∂h∂a, a− a〉+ 〈 ∂h
∂C, C − C〉
= 2<−s(z)(p(z)− p(z)) + p(z)(s(z)− s(z))
p(z)2
(s(z)
p(z)− s(z)
p(z)
)∗= 2<
s(z)p(z)− s(z)p(z)
p(z)2
(s(z)
p(z)− s(z)
p(z)
)∗= 2<
p(z)
p(z)
∣∣∣∣ s(z)
p(z)− s(z)
p(z)
∣∣∣∣2
= 2<p(z)
p(z)
∣∣∣G(z)−G(z)∣∣∣2 .
Hence h = |G(z) − G(z)|2 is τ -weakly-quasi-convex with τ = 2 min|z|=γ <p(z)p(z)
.
This implies our claim, since by Proposition 6.2.2, the idealized risk g is convex combi-
nation of functions of the form |G(z)−G(z)|2 for |z| = 1. Moreover, Proposition 4.1.9
shows convex combination preserves quasi-convexity.
For future reference, we also prove that the idealized risk is O(n2/τ 41 )-weakly
smooth.
Lemma 6.2.4. The idealized risk g(A, C) is Γ-weakly smooth with Γ = O(n2/τ 41 ).
Proof. By equation (6.2.7) and the chain rule we get that
∂g
∂C=
∫T
∂|G(z)−G(z)|2
∂p(z)· ∂p(z)
∂Cdz =
∫T
2<
1
p(z)
(s(z)
p(z)− s(z)
p(z)
)∗· [1, . . . , zn−1]dz .
163
therefore we can bound the norm of the gradient by
∥∥∥∥ ∂g∂C∥∥∥∥2
≤
(∫T
∣∣∣∣ s(z)
p(z)− s(z)
p(z)
∣∣∣∣2 dz)·(∫
T4‖[1, . . . , zn−1]‖2 · | 1
p(z)|2dz
)≤ O(n/τ 2
1 ) · g(A, C) .
Similarly, we could show that∥∥∂g∂a
∥∥2 ≤ O(n2/τ 21 ) · g(A, C).
6.2.2 Justifying Idealized Risk
We need to justify the approximation we made in Equation (6.2.1).
Lemma 6.2.5. Assume that ξt and xt are drawn i.i.d. from an arbitrary distribution
with mean 0 and variance 1. Then the population risk f(Θ) can be written as,
f(Θ) = (D −D)2 +T−1∑k=1
(1− k
T
)(CAk−1B − CAk−1B
)2
+ σ2 . (6.2.8)
Proof of Lemma 6.2.5. Under the distributional assumptions on ξt and xt, we can
calculate the objective functions above analytically. We write out yt, yt in terms of
the inputs,
yt = Dxt +t−1∑k=1
CAt−k−1Bxk + CAt−1h0 + ξt , yt = Dxt +t−1∑k=1
CAt−k−1Bxk + CAt−1h0 .
Therefore, using the fact that xt’s are independent and with mean 0 and covariance
Id, the expectation of the error can be calculated (formally by Claim 8.3.2),
E[‖yt − yt‖2
]= ‖D −D‖2
F +∑t−1
k=1
∥∥CAt−k−1B − CAt−k−1B∥∥2
F+ E[‖ξt‖2] . (6.2.9)
Using E[‖ξt‖2] = σ2 , it follows that
f(Θ) = ‖D −D‖2F +
∑T−1k=1
(1− k
T
)∥∥CAk−1B − CAk−1B∥∥2
F+ σ2 . (6.2.10)
164
Recall that under the controllable canonical form (6.1.4), B = en is known and there-
fore B = B is no longer a variable. We use a for the training variable corresponding
to a. Then the expected objective function (6.2.10) simplifies to
f(Θ) = (D −D)2 +∑T−1
k=1
(1− k
T
)(CAk−1B − CAk−1B
)2.
The previous lemma does not yet control higher order contributions present in
the idealized risk. This requires additional structure that we introduce in the next
section.
6.3 Effective Relaxations of Spectral Radius
The previous section showed quasi-convexity of the idealized risk. However, several
steps are missing towards showing finite sample guarantees for stochastic gradient
descent. In particular, we will need to control the variance of the stochastic gradient at
any system that we encounter in the training. For this purpose we formally introduce
our main assumption now and show that it serves as an effective relaxation of spectral
radius. This results below will be used for proving convergence of stochastic gradient
descent in Section 6.4.
Consider the following convex region C in the complex plane,
C = z : <z ≥ (1 + τ0)|=z| ∩ z : τ1 < <z < τ2 . (6.3.1)
where τ0, τ1, τ2 > 0 are constants that are considered as fixed constant throughout
the paper. Our bounds will have polynomial dependency on these parameters.
Definition 6.3.1. We say a polynomial p(z) is α-acquiescent if p(z)/zn : |z| =
α ⊆ C. A linear system with transfer function G(z) = s(z)/p(z) is α-acquiescent if
the denominator p(z) is.
165
The set of coefficients a ∈ Rn defining acquiescent systems form a convex set.
Formally, for a positive α > 0, define the convex set Bα ⊆ Rn as
Bα =a ∈ Rn : pa(z)/zn : |z| = α ⊆ C
. (6.3.2)
We note that definition (6.3.2) is equivalent to the definition Bα =a ∈ Rn :
znp(1/z) : |z| = 1/α ⊆ C
, which is the version that we used in introduction for
simplicity. Indeed, we can verify the convexity of Bα by definition and the convexity
of C: a, b ∈ Bα implies that pa(z)/zn, pb(z)/zn ∈ C and therefore, p(a+b)/2(z)/zn =
12(pa(z)/zn+pb(z)/zn) ∈ C. We also note that the parameter α in the definition of ac-
quiescence corresponds to the spectral radius of the companion matrix. In particular,
an acquiescent system is stable for α < 1.
Lemma 6.3.2. Suppose a ∈ Bα, then the roots of polynomial pa(z) have magnitudes
bounded by α. Therefore the controllable canonical form A = C(a) defined by a has
spectral radius ρ(A) < α.
Proof. Define holomorphic function f(z) = zn and g(z) = pa(z) = zn+a1zn−1 + · · ·+
an. We apply the symmetric form of Rouche’s theorem [63] on the circle K = z :
|z| = α. For any point z on K, we have that |f(z)| = αn, and that |f(z) − g(z)| =
αn · |1− pa(z)/zn|. Since a ∈ Bα, we have that pa(z)/zn ∈ C for any z with |z| = α.
Observe that for any c ∈ C we have that |1− c| < 1 + |c|, therefore we have that
|f(z)−g(z)| = αn|1−pa(z)/zn| < αn(1+|pa(z)|/|zn|) = |f(z)|+|pa(z)| = |f(z)|+|g(z)| .
Hence, using Rouche’s Theorem, we conclude that f and g have same number
of roots inside circle K. Note that function f = zn has exactly n roots in K and
therefore g have all its n roots inside circle K.
166
The following lemma establishes the fact that Bα is a monotone family of sets
in α. The proof follows from the maximum modulo principle of the harmonic func-
tions <(znp(1/z)) and =(znp(1/z)). We remark that there are larger convex sets
than Bα that ensure bounded spectral radius. However, in order to also guarantee
monotonicity and the no blow-up property below, we have to restrict our attention
to Bα.
Lemma 6.3.3 (Monotonicity of Bα). For any 0 < α < β, we have that Bα ⊂ Bβ.
Proof. Let qa(z) = 1 + a1z + · · ·+ anzn. Note that q(z−1) = pa(z)/zn. Therefore we
note that Bα = a : qa(z) ∈ C,∀|z| = 1/α. Suppose a ∈ Bα, then <(pa(z)) ≥ τ1
for any z with |z| = 1/α. Since <(pa(z)) is the real part of the holomorphic function
pa(z), its a harmonic function. By maximum (minimum) principle of the harmonic
functions, we have that for any |z| ≤ 1/α, <(pa(z)) ≥ inf |z|=1/α<(pa(z)) ≥ τ1. In
particular, it holds that for |z| = 1/β < 1/α, <(pa(z)) ≥ τ1. Similarly we can prove
that for z with |z| = 1/β, <(qa(z)) ≥ (1 + τ0)=(qa(z)), and other conditions for a
being in Bβ.
Our next lemma entails that acquiescent systems have well behaved impulse re-
sponses.
Lemma 6.3.4 (No blow-up property). Suppose a ∈ Bα for some α ≤ 1. Then the
companion matrix A = C(a) satisfies
∞∑k=0
‖α−kAkB‖2 ≤ 2πnα−2n/τ 21 . (6.3.3)
Moreover, it holds that for any k ≥ 0,
‖AkB‖2 ≤ min2πn/τ 21 , 2πnα
2k−2n/τ 21 .
167
Proof of Lemma 6.3.4. Let fλ =∑∞
k=0 eiλkα−kAkB be the Fourier transform of the
series α−kAkB. Then using Parseval’s Theorem, we have
∞∑k=0
‖α−kAkB‖2 =
∫ 2π
0
|fλ|2dλ =
∫ 2π
0
|(I − α−1eiλA)−1B|2dλ
=
∫ 2π
0
∑nj=1 α
2j
|pa(αe−iλ)|2dλ ≤
∫ 2π
0
n
|pa(αe−iλ)|2dλ. (6.3.4)
where at the last step we used the fact that (I−wA)−1B = 1pa(w−1)
[w−1, w−2 . . . , z−n]>
(see Lemma 8.3.1), and that α ≤ 1. Since a ∈ Bα, we have that |pa(α−1eiλ)| ≥ τ1,
and therefore pa(αe−iλ) = αne−inλp(eiλ/α) has magnitude at least τ1α
n. Plugging in
this into equation (6.3.4), we conclude that
∞∑k=0
‖α−kAkB‖2 ≤∫ 2π
0
n
|pa(αe−iλ)|2dλ ≤ 2πnα−2n/τ 2
1 .
Finally we establish the bound for ‖AkB‖2. By Lemma 6.3.3, we have Bα ⊂ B1 for
α ≤ 1. Therefore we can pick α = 1 in equation (6.3.3) and it still holds. That is, we
have that∞∑k=0
‖AkB‖2 ≤ 2πn/τ 21 .
This also implies that ‖AkB‖2 ≤ 2πn/τ 21 .
6.3.1 Efficiently Computing the Projection
In our algorithm, we require a projection onto Bα. However, the only requirement of
the projection step is that it projects onto a set contained inside Bα that also contains
the true linear system. So a variety of subroutines can be used to compute this pro-
jection or an approximation. First, the explicit projection onto Bα is representable by
a semidefinite program. This is because each of the three constrains can be checked
by testing if a trigonometric polynomial is non-negative. A simple inner approxima-
168
tion can be constructed by requiring the constraints to hold on an a finite grid of size
O(n). One can check that this provides a tight, polyhedral approximation to the set
Bα, following an argument similar to Section C of Bhaskar et al [29]. See Section 6.9
for more detailed discussion on why projection on a polytope suffices. Furthermore,
sometimes we can replace the constraint by an `1 or `2-constraint if we know that the
system satisfies the corresponding assumption. Removing the projection step entirely
is an interesting open problem.
6.4 Learning Acquiescent Systems
Next we show that we can learn acquiescent systems.
Theorem 6.4.1. Suppose the true system Θ is α-acquiescent and satisfies ‖C‖ ≤ 1.
Then with N samples of length T ≥ Ω(n + 1/(1 − α)), stochastic gradient descent
(Algorithm 7) with projection set Bα returns parameters Θ = (A, B, C, D) with pop-
ulation risk
f(Θ) ≤ f(Θ) +O
(n2
N+
√n5 + σ2n3
TN
), (6.4.1)
where O(·)-notation hides polynomial dependencies on 1/(1 − α), 1/τ0, 1/τ1, τ2, and
R = ‖a‖.
Algorithm 7 Projected stochastic gradient descent with partial loss
For i = 0 to N :
1. Take a fresh sample ((x1, . . . , xT ), (y1, . . . , yT )). Let yt be the simulated outputs2
of system Θ on inputs x and initial states h0 = 0.
2. Let T1 = T/4. Run stochastic gradient descent3 on loss function `((x, y), Θ) =1
T−T1
∑t>T1‖yt − yt‖2. Concretely, let GA = ∂`
∂a, GC = ∂`
∂C, and , GD = ∂`
∂D, we
update[a, C, D]→ [a, C, D]− η[GA, GC , GD] .
3. Project Θ = (a, C, D) to the set Bα ⊗ Rn ⊗ R.
169
Recall that T is the length of the sequence and N is the number of samples. The
first term in the bound (6.4.1) comes from the smoothness of the population risk
and the second comes from the variance of the gradient estimator of population risk
(which will be described in detail below). An important (but not surprising) feature
here is the variance scale in 1/T and therefore for long sequence actually we got 1/N
convergence instead of 1/√N (for relatively small N).
We can further balance the variance of the estimator with the number of samples
by breaking each long sequence of length T into Θ(T/n) short sequences of length
Θ(n), and then run back-propagation (7) on these TN/n shorter sequences. This
leads us to the following bound which gives the right dependency in T and N as we
expected: TN should be counted as the true number of samples for the sequence-to-
sequence model.
Corollary 6.4.2. Under the assumption of Theorem 6.4.1, Algorithm 8 returns pa-
rameters Θ with population risk
f(Θ) ≤ f(Θ) +O
(√n5 + σ2n3
TN
),
where O(·)-notation hides polynomial dependencies on 1/(1 − α), 1/τ0, 1/τ1, τ2, and
R = ‖a‖.
Algorithm 8 Projected stochastic gradient descent for long sequences
Input: N samples sequences of length TOutput: Learned system Θ
1. Divide each sample of length T into T/(βn) samples of length βn where β is alarge enough constant. Then run algorithm 7 with the new samples and obtainΘ.
3Note that yt is different from yt defined in equation (6.1.2) which is used to define the populationrisk: here yt is obtained from the (wrong) initial state h0 = 0 while yt is obtained from the correctinitial state.
3See Algorithm Box 9 for a detailed back-propagation algorithm that computes the gradient.
170
We remark the the gradient computation procedure takes time linear in Tn since
one can use chain-rule (also called back-propagation) to compute the gradient effi-
ciently . For completeness, Algorithm 9 gives a detailed implementation. Finally and
importantly, we remark that although we defined the population risk as the expected
error with respected to sequence of length T , actually our error bound generalizes to
any longer (or shorter) sequences of length T ′ maxn, 1/(1− α). By the explicit
formula for f(Θ) (Lemma 6.2.5) and the fact that ‖CAkB‖ decays exponentially for
k n (Lemma 6.3.4), we can bound the population risk on sequences of different
lengths. Concretely, let fT ′(Θ) denote the population risk on sequence of length T ′,
we have for all T ′ maxn, 1/(1− α),
fT ′(Θ) ≤ 1.1f(Θ) + exp(−(1− α) minT, T ′) ≤ O
(√n5 + σ2n3
TN
).
We note that generalization to longer sequence does deserve attention. Indeed in
practice, it’s usually difficult to train non-linear recurrent networks that generalize to
longer sequences than the training data.
Our proof of Theorem 6.4.1 simply consists of three parts: a) showing the idealized
risk is quasi-convex in the convex set Bα (Lemma 6.4.3); b) designing an (almost)
unbiased estimator of the gradient of the idealized risk (Lemma 6.4.4); c) variance
bounds of the gradient estimator (Lemma 6.4.5).
First of all, using the theory developed in Section 6.2 (Lemma 6.2.3 and
Lemma 6.2.4), it is straightforward to verify that in the convex set Bα ⊗ Rn, the
idealized risk is both weakly-quasi-convex and weakly-smooth.
Lemma 6.4.3. Under the condition of Theorem 6.4.1, the idealized risk (6.2.2) is
τ -weakly-quasi-convex in the convex set Bα ⊗ Rn and Γ-weakly smooth, where τ =
Ω(τ0τ1/τ2) and Γ = O(n2/τ 41 ).
171
Proof of Lemma 6.4.3. It suffices to show that for all a, a ∈ Bα, it satisfies a ∈
Nτ (a) for τ = Ω(τ0τ1/τ2). Indeed, by the monotonicity of the family of sets Bα
(Lemma 6.3.3), we have that a, a ∈ B1, which by definition means for every z on
unit circle, pa(z)/zn, pa(z)/zn ∈ C. By definition of C, for any point w, w ∈ C, the
angle φ between w and w is at most π − Ω(τ0) and ratio of the magnitude is at
least τ1/τ2, which implies that <(w/w) = |w|/|w| · cos(φ) ≥ Ω(τ0τ1/τ2). Therefore
<(pa(z)/pa(z)) ≥ Ω(τ0τ1/τ2), and we conclude that a ∈ Nτ (a). The smoothness
bound was established in Lemma 6.2.4.
Towards designing an unbiased estimator of the gradient, we note that there is a
small caveat here that prevents us to just use the gradient of the empirical risk, as
commonly done for other (static) problems. Recall that the population risk is defined
as the expected risk with known initial state h0, while in the training we don’t have
access to the initial states and therefore using the naive approach we couldn’t even
estimate population risk from samples without knowing the initial states.
We argue that being able to handle the missing initial states is indeed desired: in
most of the interesting applications h0 is unknown (or even to be learned). Moreover,
the ability of handling unknown h0 allows us to break a very long sequence into
shorter sequences, which helps us to obtain Corollary 6.4.2. Here the difficulty is
essentially that we have a supervised learning problem with missing data h0. We get
around it by simply ignoring first T1 = Ω(T ) outputs of the system and setting the
corresponding errors to 0. Since the influence of h0 to any outputs later than time
k ≥ T1 maxn, 1/(1− α) is inverse exponentially small, we could safely assume
h0 = 0 when the error earlier than time T1 is not taken into account.
This small trick also makes our algorithm suitable to the cases when these early
outputs are actually not observed. This is indeed an interesting setting, since in
many sequence-to-sequence model [165], there is no output in the first half fraction of
iterations (of course these models have non-linear operation that we cannot handle).
172
The proof of the correctness of the estimator is almost trivial and deferred to
Section 6.7.
Lemma 6.4.4. Under the assumption of Theorem 6.4.1, suppose a, a ∈ Bα. Then in
Algorithm 7, at each iteration, GA, GC are unbiased estimators of the gradient of the
idealized risk (6.2.2) in the sense that:
E [GA, GC ] =
[∂g
∂a,∂g
∂C
]± exp(−Ω((1− α)T )) .
(6.4.2)
Finally, we control the variance of the gradient estimator.
Lemma 6.4.5. The (almost) unbiased estimator (GA, GC) of the gradient of g(A, C)
has variance bounded by
V [GA] + V [GC ] ≤ O (n3Λ2/τ 61 + σ2n2Λ/τ 4
1 )
T.
where Λ = O(maxn, 1/(1− α) log 1/(1− α)).
Note that Lemma 6.4.5 does not directly follow from the Γ-weakly-smoothness of
the population risk, since it’s not clear whether the loss function `((x, y), Θ) is also
Γ-smooth for every sample. Moreover, even if it could work out, from smoothness the
variance bound can be only as small as Γ2, while the true variance scales linearly in
1/T . Here the discrepancy comes from that smoothness implies an upper bound of
the expected squared norm of the gradient, which is equal to the variance plus the
expected squared mean. Though typically for many other problems variance is on the
same order as the squared mean, here for our sequence-to-sequence model, actually
the variance decreases in length of the data, and therefore the bound of variance from
smoothness is very pessimistic.
173
We bound directly the variance instead. It’s tedious but very simple in spirit.
We mainly need Lemma 6.3.4 to control various difference sums that shows up from
calculating the expectation. The only tricky part is to obtain the 1/T dependency
which corresponds to the cancellation of the contribution from the cross terms. In
the proof we will basically write out the variance as a (complicated) function of A, C
which consists of sums of terms involving (CAkB − CAkB) and AkB. We control
these sums using Lemma 6.3.4. The proof is deferred to Section 6.7.
Finally we are ready to prove Theorem 6.4.1. We essentially just combine
Lemma 6.4.3, Lemma 6.4.4 and Lemma 6.4.5 with the generic convergence Proposi-
tion 4.1.8. This will give us low error in idealized risk and then we relate the idealized
risk to the population risk.
Proof of Theorem 6.4.1. We consider g′(A, C, D) = (D − D)2 + g(A, C), an ex-
tended version of the idealized risk which takes the contribution of D into
account. By Lemma 6.4.4 we have that Algorithm 7 computes GA, GC which
are almost unbiased estimators of the gradients of g′ up to negligible error
exp(−Ω((1− α)T )), and by Lemma 6.7.1 we have GD is an unbiased estimator
of g′ with respect to D. Moreover by Lemma 6.4.5, these unbiased estima-
tor has total variance V =O(n5+σ2n3)
Twhere O(·) hides dependency on τ1 and
(1− α). Applying Proposition 4.1.8 (which only requires an unbiased estima-
tor of the gradient of g′), we obtain that after T iterations, we converge to a
point with g′(a, C, D) ≤ O
(n2
N+√
n5+σ2n3
TN
). Then, by Lemma 6.2.5 we have
f(Θ) ≤ g′(a, C, D) + σ2 = g′(a, C, D) + f(Θ) ≤ O
(n2
N+√
n5+σ2n3
TN
)+ f(Θ) which
completes the proof.
174
6.5 The Power of Improper Learning
We observe an interesting and important fact about the theory in Section 6.4: it
solely requires a condition on the characteristic function p(z). This suggests that the
geometry of the training objective function depends mostly on the denominator of
the transfer function, even though the system is uniquely determined by the transfer
functionG(z) = s(z)/p(z). This might seem to be an undesirable discrepancy between
the behavior of the system and our analysis of the optimization problem.
However, we can actually exploit this discrepancy to design improper learning
algorithms that succeed under much weaker assumptions. We rely on the following
simple observation about the invariance of a system G(z) = s(z)p(z)
. For an arbitrary
polynomial u(z) of leading coefficient 1, we can write G(z) as
G(z) =s(z)u(z)
p(z)u(z)=s(z)
p(z),
where s = su and p = pu. Therefore the system s(z)/p(z) has identical behavior
as G. Although this is a redundant representation of G(z), it should counted as an
acceptable solution. After all, learning the minimum representation4 of linear system
is impossible in general. In fact, we will encounter an example in Section 6.5.1.
While not changing the behavior of the system, the extension from p(z) to p(z),
does affect the geometry of the optimization problem. In particular, if p(z) is now an
α-acquiescent characteristic polynomial as defined in Definition 6.3.1, then we could
find it simply using stochastic gradient descent as shown in Section 6.4. Observe
that we don’t require knowledge of u(z) but only its existence. Denoting by d the
degree of u, the algorithm itself is simply stochastic gradient descent with n+d model
parameters instead of n.
Our discussion motivates the following definition.
4The minimum representation of a transfer function G(z) is defined as the representation G(z) =s(z)/p(z) with p(z) having minimum degree.
175
Definition 6.5.1. A polynomial p(z) of degree n is α-acquiescent by extension of
degree d if there exists a polynomial u(z) of degree d and leading coefficient 1 such
that p(z)u(z) is α-acquiescent.
For a transfer function G(z), we define it’s H2 norm as
‖G‖2H2
=1
2π
∫ 2π
0
|G(eiθ)|2dθ .
We assume (with loss of generality) that the true transfer function G(z) has
bounded H2 norm, that is, ‖G‖H2 ≤ 1. This can be achieve by a rescaling5 of
the matrix C.
Theorem 6.5.2. Suppose the true system has transfer function G(z) = s(z)/p(z)
with a characteristic function p(z) that is α-acquiescent by extension of degree d, and
‖G‖H2 ≤ 1, then projected stochastic gradient descent with m = n+ d states (that is,
Algorithm 8 with m states) returns a system Θ with population risk
f(Θ) ≤ O
(√m5 + σ2m3
TK
).
where the O(·) notation hides polynomial dependencies on τ0, τ1, τ2, 1/(1− α).
The theorem follows directly from Corollary 6.4.2 (with some additional care about
the scaling.
Proof of Theorem 6.5.2. Let p(z) = p(z)u(z) be the acquiescent extension of p(z).
Since τ2 ≥ |u(z)p(z)| = |p(z)| ≥ τ0 on the unit circle, we have that |s(z)| =
|s(z)||u(z)| = s(z) · Oτ (1/p(z)). Therefore we have that s(z) satisfies that ‖s‖H2 =
Oτ (‖s(z)/p(z)‖H2) = Oτ (‖G(z)‖H2) ≤ Oτ (1). That means that the vector C that
determines the coefficients of s satisfies that ‖C‖ ≤ Oτ (1), since for a polynomial
5In fact, this is a natural scaling that makes comparing error easier. Recall that the populationrisk is essentially ‖G−G‖H2 , therefore rescaling C so that ‖G‖H2 = 1 implies that when error 1we achieve non-trivial performance.
176
h(z) = b0 + · · · + bn−1zn−1, we have ‖h‖H2 = ‖b‖. Therefore we can apply Corol-
lary 6.4.2 to complete the proof.
In the rest of this section, we discuss in subsection 6.5.1 the instability of the min-
imum representation in subsection, and in subsection 6.5.2 we show several examples
where the characteristic function p(z) is not α-acquiescent but is α-acquiescent by
extension with small degree d.
As a final remark, the examples illustrated in the following sub-sections may be
far from optimally analyzed. It is beyond the scope of this paper to understand the
optimal condition under which p(z) is acquiescent by extension.
6.5.1 Instability of the Minimum Representation
We begin by constructing a contrived example where the minimum representation of
G(z) is not stable at all and as a consequence one can’t hope to recover the minimum
representation of G(z).
Consider G(z) = s(z)p(z)
:= zn−0.8−n
(z−0.1)(zn−0.9−n)and G′(z) = s′(z)
p′(z):= 1
z−0.1. Clearly
these are the minimum representations of the G(z) and G′(z), which also both satisfy
acquiescence. On the one hand, the characteristic polynomial p(z) and p′(z) are very
different. On the other hand, the transfer functions G(z) and G′(z) have almost the
same values on unit circle up to exponentially small error,
|G(z)−G′(z)| ≤ 0.8−n − 0.9−n
(z − 0.1)(z − 0.9−n)≤ exp(−Ω(n)) .
Moreover, the transfer functions G(z) and G(z) are on the order of Θ(1) on unit
circle. These suggest that from an (inverse polynomially accurate) approximation of
the transfer function G(z), we cannot hope to recover the minimum representation in
any sense, even if the minimum representation satisfies acquiescence.
177
6.5.2 Power of Improper Learning in Various Cases
We illustrate the use of improper learning through various examples below.
Example: artificial construction
We consider a simple contrived example where improper learning can help us learn
the transfer function dramatically. We will show an example of characteristic function
which is not 1-acquiescent but (α+ 1)/2-(α+ 1)/2-acquiescent by extension of degree
3.
Let n be a large enough integer and α be a constant. Let J = 1, n − 1, n and
ω = e2πi/n, and then define p(z) = z3∏
j∈[n],j /∈J(z − αωj). Therefore we have that
p(z)/zn = z3∏
j∈[n],j∈J
(1− αωj/z) =1− αn/zn
(1− ω/z)(1− ω−1/z)(1− 1/z)(6.5.1)
Taking z = e−iπ/2 we have that p(z)/zn has argument (phase) roughly −3π/4, and
therefore it’s not in C, which implies that p(z) is not 1-acquiescent. On the other hand,
picking u(z) = (z − ω)(z − 1)(z − ω−1) as the helper function, from equation (6.5.1)
we have p(z)u(z)/zn+3 = 1 − αn/zn takes values inverse exponentially close to 1 on
the circle with radius (α + 1)/2. Therefore p(z)u(z) is (α + 1)/2-acquiescent.
Example: characteristic function with separated roots
A characteristic polynomial with well separated roots will be acquiescent by exten-
sion. Our bound will depend on the following quantity of p that characterizes the
separateness of the roots.
178
Definition 6.5.3. For a polynomial h(z) of degree n with roots λ1, . . . , λn inside unit
circle, define the quantity Γ(·) of the polynomial h as:
Γ(h) :=∑j∈[n]
∣∣∣∣∣ λnj∏i 6=j(λi − λj)
∣∣∣∣∣ .Lemma 6.5.4. Suppose p(z) is a polynomial of degree n with distinct roots inside
circle with radius α. Let Γ = Γ(p), then p(z) is α-acquiescent by extension of degree
d = O(max(1− α)−1 log(√nΓ · ‖p‖H2), 0).
Our main idea to extend p(z) by multiplying some polynomial u that approximates
p−1 (in a relatively weak sense) and therefore pu will always take values in the set C.
Lemma 6.5.5 (Approximation of inverse of a polynomial). Suppose p(z) is a polyno-
mial of degree n and leading coefficient 1 with distinct roots inside circle with radius
α, and Γ = Γ(p). Then for d = O(max( 11−α log Γ
(1−α)ζ, 0), there exists a polynomial
h(z) of degree d and leading coefficient 1 such that for all z on unit circle,
∣∣∣∣zn+d
p(z)− h(z)
∣∣∣∣ ≤ ζ .
We believe the following lemma should be known though for completeness we pro-
vide the proof below. Towards proving Proposition 6.5.5, we use the following lemma
to express the inverse of a polynomial as a sum of inverses of degree-1 polynomials.
Lemma 6.5.6. Let p(z) = (z−λ1) . . . (z−λn) where λj’s are distinct. Then we have
that
1
p(z)=
n∑j=1
tjz − λj
, where tj =(∏
i 6=j(λj − λi))−1
. (6.5.2)
179
Proof of Lemma 6.5.6. By interpolating constant function at points λ1, . . . , λn using
Lagrange interpolating formula, we have that
1 =n∑j=1
∏i 6=j(x− λi)∏i 6=j(λj − λi)
· 1 (6.5.3)
Dividing p(z) on both sides we obtain equation (6.5.2).
The following lemma computes the Fourier transform of function 1/(z − λ).
Lemma 6.5.7. Let m ∈ Z, and K be the unit circle in complex plane, and λ ∈ C
inside the K. Then we have that
∫K
zm
z − λdz =
2πiλm for m ≥ 0
0 o.w.
Proof of Lemma 6.5.7. For m ≥ 0, since zm is a holomorphic function, by Cauchy’s
integral formula, we have that
∫K
zm
z − λdz = 2πiλm .
For m < 0, by changing of variable y = z−1 we have that
∫K
zm
z − λdz =
∫K
y−m−1
1− λydy .
since |λy| = |λ| < 1, then we by Taylor expansion we have,
∫K
y−m−1
1− λydy =
∫Ky−m−1
(∞∑k=0
(λy)k
)dy .
Since the series λy is dominated by |λ|k which converges, we can switch the integral
with the sum. Note that y−m−1 is holomorphic for m < 0, and therefore we conclude
180
that
∫K
y−m−1
1− λydy = 0 .
Now we are ready to prove Proposition 6.5.5.
Proof of Proposition 6.5.5. Let m = n + d. We compute the Fourier transform of
zm/p(z). That is, we write
eimθ
p(eiθ)=
∞∑k=−∞
βkeikθ .
where
βk =1
2π
∫ 2π
0
ei(m−k)θ
p(eiθ)dθ =
1
2πi
∫K
zm−k−1
p(z)dz
By Lemma 6.5.6, we write
1
p(z)=
n∑j=1
tjz − λj
.
Then it follows that
βk =1
2πi
n∑j=1
tj
∫K
zm−k−1
z − λjdz
Using Lemma 6.5.7, we obtain that
βk =
∑n
j=1 tjλm−k−1j if −∞ ≤ k ≤ m− 1
0 o.w.(6.5.4)
We claim that
n∑j=1
tjλn−1j = 1 , and
n∑j=1
tjλsj = 0 , 0 ≤ s < n− 1 .
Indeed these can be obtained by writing out the lagrange interpolation for polynomial
f(x) = xs with s ≤ n− 1 and compare the leading coefficient. Therefore, we further
181
simplify βk to
βk =
∑n
j=1 tjλm−k−1j if −∞ < k < m− n
1 if k = m− n
0 o.w.
(6.5.5)
Let h(z) =∑
k≥0 βkzk. Then we have that h(z) is a polynomial with degree d = m−n
and leading term 1. Moreover, for our choice of d,
∣∣∣∣ zmp(z)− h(z)
∣∣∣∣ =
∣∣∣∣∣∑k<0
βkzk
∣∣∣∣∣ ≤∑k<0
|βk| ≤ maxj|tj|(1− λj)n
∑k<0
(1− γ)d−k−1
≤ Γ(1− γ)d/γ < ζ .
Proof of Proposition 6.5.4. Let γ = 1 − α. Using Proposition 6.5.5 with
ζ = 0.5‖p‖−1H∞ , we have that there exists polynomial u of degree d =
O(max 11−α log(Γ‖p‖H∞), 0) such that
∣∣∣∣zn+d
p(z)− u(z)
∣∣∣∣ ≤ ζ .
Then we have that ∣∣p(z)u(z)/zn+d − 1∣∣ ≤ ζ|p(z)| < 0.5 .
Therefore p(z)u(z)/zn+d ∈ Cτ0,τ1,τ2 for constant τ0, τ1, τ2. Finally noting that for
degree n polynomial we have ‖h‖H∞ ≤√n · ‖h‖H2 , which completes the proof.
Example: Characteristic polynomial with random roots
We consider the following generative model for characteristic polynomial of degree
2n. We generate n complex numbers λ1, . . . , λn uniformly randomly on circle with
radius α < 1, and take λi, λi for i = 1, . . . , n as the roots of p(z). That is, p(z) =
(z − λ1)(z − λ1) . . . (z − λn)(z − λn). We show that with good probability (over the
182
randomness of λi’s), polynomial p(z) will satisfy the condition in subsection 6.5.2 so
that it can be learned efficiently by our improper learning algorithm.
Theorem 6.5.8. Suppose p(z) with random roots inside circle of radius α is generated
from the process described above. Then with high probability over the choice of p, we
have that Γ(p) ≤ exp(O(√n)) and ‖p‖H2 ≤ exp(O(
√n)). As a corollary, p(z) is α-
acquiescent by extension of degree O((1− α)−1n).
Towards proving Theorem 6.5.8, we need the following lemma about the expected
distance of two random points with radius ρ and r in log-space.
Lemma 6.5.9. Let x ∈ C be a fixed point with |x| = ρ, and λ uniformly drawn on
the circle with radius r. Then E [ln |x− λ|] = ln maxρ, r .
Proof. When r 6= ρ, let N be an integer and ω = e2iπ/N . Then we have that
E[ln |x− λ| | r] = limN→∞
1
N
N∑k=1
ln |x− rωk| (6.5.6)
The right hand of equation (6.5.6) can be computed easily by observing that
1N
∑Nk=1 ln |x−rωk| = 1
Nln∣∣∣∏N
k=1(x− rωk)∣∣∣ = 1
Nln |xN−rN |. Therefore, when ρ > r,
we have limN→∞1N
∑Nk=1 ln |x−rωk| = limN→∞ ρ+ 1
Nln |(x/ρ)N − (r/ρ)N | = ln ρ. On
the other hand, when ρ < r, we have that limN→∞1N
∑Nk=1 ln |x− rωk| = ln r. There-
fore we have that E[ln |x−λ| | r] = ln(max ρ, r). For ρ = r, similarly proof (with more
careful concern of regularity condition) we can show that E[ln |x− λ| | r] = ln r.
Now we are ready to prove Theorem 6.5.8.
Proof of Theorem 6.5.8. Fixing index i, and the choice of λi, we consider the random
variable Yi = ln( |λi|2n∏j 6=i |λi−λj |
∏j 6=i |λi−λj |
)n ln |λi| −∑
j 6=i ln |λi − λj|. By Lemma 6.5.9,
we have that E[Yi] = n ln |λi| −∑
j 6=i E[ln |λi − λj|] = ln(1− δ). Let Zj = ln |λi − λj|.
Then we have that Zj are random variable with mean 0 and ψ1-Orlicz norm bounded
183
by 1 since E[eln |λi−λj |− 1] ≤ 1. Therefore by Bernstein inequality for sub-exponential
tail random variable (for example, [105, Theorem 6.21]), we have that with high
probability (1 − n−10), it holds that∣∣∣∑j 6=i Zj
∣∣∣ ≤ O(√n) where O hides logarithmic
factors. Therefore, with high probability, we have |Yi| ≤ O(√n).
Finally we take union bound over all i ∈ [n], and obtain that with high probability,
for ∀i ∈ [n], |Yi| ≤ O(√n), which implies that
∑ni=1 exp(Yi) ≤ exp(O(
√n)). With
similar technique, we can prove that ‖p‖H2 ≤ exp(O(√n).
Example: Passive systems
We will show that with improper learning we can learn almost all passive systems, an
important class of stable linear dynamical system as we discussed earlier. We start
off with the definition of a strict-input passive system.
Definition 6.5.10 (Passive System, c.f [103]). A SISO linear system is strict-input
passive if and only if for some τ0 > 0 and any z on unit circle, <(G(z)) ≥ τ0 .
In order to learn the passive system, we need to add assumptions in the definition
of strict passivity. To make it precise, we define the following subsets of complex
plane: For positive constant τ0, τ1, τ2, define
C+τ0,τ1,τ2
= z ∈ C : |z| ≤ τ2,<(z) ≥ τ1,<(z) ≥ τ0|=(z)| . (6.5.7)
We say a transfer function G(z) = s(z)/p(z) is (τ0, τ1, τ2)-strict input passive if
for any z on unit circle we have G(z) ∈ C+τ0,τ1,τ2
. Note that for small constant τ0, τ1
and large constant τ2, this basically means the system is strict-input passive.
Now we are ready to state our main theorem in this subsection. We will prove
that passive systems could be learned improperly with a constant factor more states
(dimensions), assuming s(z) has all its roots strictly inside unit circles and Γ(s) ≤
exp(O(n)).
184
Theorem 6.5.11. Suppose G(z) = s(z)/p(z) is (τ0, τ1, τ2)-strict-input passive. More-
over, suppose the roots of s(z) have magnitudes inside circle with radius α and
Γ = Γ(s) ≤ exp(O(n)) and ‖p‖H2 ≤ exp(O(n)). Then p(z) is α-acquiescent by
extension of degree d = Oτ,α(n), and as a consequence we can learn G(z) with n + d
states in polynomial time.
Moreover, suppose in addition we assume that G(z) ∈ Cτ0,τ1,τ2 for every z on unit
circle. Then p(z) is α-acquiescent by extension of degree d = Oτ,α(n).
The proof of Theorem 6.5.11 is similar in spirit to that of Proposition 6.5.4.
It follows directly from a combination of Lemma 6.5.12 and Lemma 6.5.13 below.
Lemma 6.5.12 shows that the denominator of a function (under the stated assump-
tions) can be extended to a polynomial that takes values in C+ on unit circle.
Lemma 6.5.13 shows that it can be further extended to another polynomial that
takes values in C.
Lemma 6.5.12. Suppose the roots of s are inside circle with radius α < 1, and
Γ = Γ(s). If transfer function G(z) = s(z)/p(z) satisfies that G(z) ∈ Cτ0,τ1,τ2
(or G(z) ∈ C+τ0,τ1,τ2
) for any z on unit circle, then there exists u(z) of degree d =
Oτ (max( 11−α log
√nΓ·‖p‖H2
1−α , 0) such that p(z)u(z)/zn+d ∈ Cτ ′0,τ′1,τ′2
(or p(z)u(z)/zn+d ∈
C+τ ′0,τ
′1,τ′2
respectively) for τ ′ = Θτ (1) , where Oτ (·),Θτ (·) hide the polynomial depen-
dencies on τ0, τ1, τ2.
Proof of Lemma 6.5.12. By the fact that G(z) = s(z)/p(z) ∈ Cτ0,τ1,τ2 , we have that
p(z)/s(z) ∈ Cτ ′0,τ ′1,τ ′2 for some τ ′ that polynomially depend on τ . Using Proposi-
tion 6.5.5, there exists u(z) of degree d such that
∣∣∣∣zn+d
s(z)− u(z)
∣∣∣∣ ≤ ζ .
185
where we set ζ minτ ′0, τ ′1/τ ′2 · ‖p‖−1H∞ . Then we have that
∣∣∣∣p(z)u(z)/zn+d − p(z)
s(z)
∣∣∣∣ ≤ |p(z)|ζ minτ ′0, τ ′1. (6.5.8)
It follows from equation (6.5.8) implies that that p(z)u(z)/zn+d ∈ Cτ ′′0 ,τ ′′1 ,τ ′′2 , where τ ′′
polynomially depends on τ . The same proof still works when we replace C by C+.
Lemma 6.5.13. Suppose p(z) of degree n and leading coefficient 1 satisfies that
p(z) ∈ C+τ0,τ1,τ2
for any z on unit circle. Then there exists u(z) of degree d such that
p(z)u(z)/zn+d ∈ Cτ ′0,τ ′1,τ ′2 for any z on unit circle with d = Oτ (n) and τ ′0, τ′1, τ′2 = Θτ (1),
where Oτ (·),Θτ (·) hide the dependencies on τ0, τ1, τ2.
Proof of Lemma 6.5.13. We fix z on unit circle first. Let’s defined√p(z)/zn be
the square root of p(z)/zn with principle value. Let’s write p(z)/zn = τ2(1 +
( p(z)τ2zn− 1)) and we take Taylor expansion for 1√
p(z)/zn= τ
−1/22 (1 + ( p(z)
τ2zn− 1))−1/2 =
τ−1/22
(∑∞k=0( p(z)
τ2zn− 1)k
). Note that since τ1 < |p(z)| < τ2, we have that | p(z)
τ2zn− 1| <
1−τ1/τ2. Therefore truncating the Taylor series at k = Oτ (1) we obtain a polynomial
a rational function h(z) of the form
h(z) =∑k
j≥0( p(z)τ2zn− 1)j ,
which approximates 1√p(z)/zn
with precision ζ minτ0, τ1/τ2, that is,∣∣∣∣ 1√p(z)/zn
− h(z)
∣∣∣∣ ≤ ζ . Therefore, we obtain that∣∣∣p(z)h(z)
zn−√p(z)/zn
∣∣∣ ≤ ζ|p(z)/zn| ≤
ζτ2 . Note that since p(z)/zn ∈ C+τ0,τ1,τ2
, we have that√p(z)/zn ∈ Cτ ′0,τ ′1,τ ′2 for some
constants τ ′0, τ′1, τ′2. Therefore p(z)h(z)
zn∈ Cτ ′0,τ ′1,τ ′2 . Note that h(z) is not a polynomial
yet. Let u(z) = znkh(z) and then u(z) is polynomial of degree at most nk and
p(z)u(z)/z(n+1)k ∈ Cτ ′0,τ ′1,τ ′2 for any z on unit circle.
186
6.5.3 Improper Learning Using Linear Regression
In this subsection, we show that under stronger assumption than α-acquiescent by
extension, we can improperly learn a linear dynamical system with linear regression,
up to some fixed bias.
The basic idea is to fit a linear function that maps [xk−`, . . . , xk] to yk. This
is equivalent to a dynamical system with ` hidden states and with the companion
matrix A in (6.1.4) being chosen as a` = 1 and a`−1 = · · · = a1 = 0. In this case, the
hidden states exactly memorize all the previous ` inputs, and the output is a linear
combination of the hidden states.
Equivalently, in the frequency space, this corresponds to fitting the transfer func-
tion G(z) = s(z)/p(z) with a rational function of the form c1z`−1+···+c1z`−1 = c1z
−(`−1) +
· · ·+ cn. The following is a sufficient condition on the characteristic polynomial p(x)
that guarantees the existence of such fitting,
Definition 6.5.14. A polynomial p(z) of degree n is extremely-acquiescent by exten-
sion of degree d with bias ε if there exists a polynomial u(z) of degree d and leading
coefficient 1 such that for all z on unit circle,
∣∣p(z)u(z)/zn+d − 1∣∣ ≤ ε (6.5.9)
We remark that if p(z) is 1-acquiescent by extension of degree d, then there exists
u(z) such that p(z)u(z)/zn+d ∈ C. Therefore, equation (6.5.9) above is a much
stronger requirement than acquiescence by extension.6
When p(z) is extremely-acquiescent, we see that the transfer function G(z) =
s(z)/p(z) can be approximated by s(z)u(z)/zn+d up to bias ε. Let ` = n+ d+ 1 and
s(z)u(z) = c1z`−1 + · · · + c`. Then we have that G(z) can be approximated by the
6We need (1−δ)-acquiescence by extension in previous subsections for small δ > 0, though this ismerely additional technicality needed for the sample complexity. We ignore this difference between1− δ-acquiescence and 1-acquiescence and for the purpose of this subsection
187
following dynamical system of ` hidden states with ε bias: we choose A = C(a) with
a` = 1 and a`−1 = · · · = a1 = 0, and C = [c1, . . . , c`]. As we have argued previously,
such a dynamical system simply memorizes all the previous ` inputs, and therefore it
is equivalent to linear regression from the feature [xk−`, . . . , xk] to output yk.
Proposition 6.5.15 (Informal). If the true system G(z) = s(z)/p(z) satisfies that
p(z) is extremely-acquiescent by extension of degree d. Then using linear regression
we can learn mapping from [xk−`, . . . , xk] to yk with bias ε and polynomial sampling
complexity.
We remark that with linear regression the bias ε will only go to zero as we increase
the length ` of the feature, but not as we increase the number of samples. Moreover,
linear regression requires a stronger assumption than the improper learning results
in previous subsections do. The latter can be viewed as an interpolation between the
proper case and the regime where linear regression works.
6.6 Learning Multi-input Multi-output (MIMO)
Systems
We consider multi-input multi-output systems with the transfer functions that have
a common denominator p(z),
G(z) =1
p(z)· S(z) (6.6.1)
where S(z) is an `in × `out matrix with each entry being a polynomial with real
coefficients of degree at most n and p(z) = zn + a1zn−1 + · · ·+ an. Note that here we
use `in to denote the dimension of the inputs of the system and `out the dimension of
the outputs.
188
Although a special case of a general MIMO system, this class of systems still
contains many interesting cases, such as the transfer functions studied in [65, 64],
where G(z) is assumed to take the form G(z) = R0 +∑n
i=1Riz−λi , for λ1, . . . , λn ∈ C
with conjugate symmetry and Ri ∈ C`out×`in satisfies that Ri = Rj whenever λi = λj.
In order to learn the system G(z), we parametrize p(z) by its coefficients a1, . . . , an
and S(z) by the coefficients of its entries. Note that each entry of S(z) depends on
n + 1 real coefficients and therefore the collection of coefficients forms a third order
tensor of dimension `out × `in × (n + 1). It will be convenient to collect the leading
coefficients of the entries of S(z) into a matrix of dimension `out × `in, named D,
and the rest of the coefficients into a matrix of dimension `out × `inn, denoted by C.
This will be particularly intuitive when a state-space representation is used to learn
the system with samples as discussed later. We parameterize the training transfer
function G(z) by a, C and D using the same way.
Let’s define the risk function in the frequency domain as,
g(A, C, D) =
∫ 2π
0
∥∥∥G(eiθ)− G(eiθ)∥∥∥2
Fdθ . (6.6.2)
The following lemma is an analog of Lemma 6.2.3 for the MIMO case. Itss proof
actually follows from a straightforward extension of the proof of Lemma 6.2.3 by
observing that matrix S(z) (or S(z)) commute with scalar p(z) and p(z), and that
S(z), p(z) are linear in a, C.
Lemma 6.6.1. The risk function g(a, C) defined in (6.6.2) is τ -weakly-quasi-convex
in the domain
Nτ (a) =
a ∈ Rn : <
(pa(z)
pa(z)
)≥ τ/2,∀ z ∈ C, s.t. |z| = 1
⊗ R`in×`out×n′
Finally, as alluded before, we use a particular state space representation for learn-
ing the system in time domain with example sequences. It is known that any transfer
189
function of the form (6.6.1) can be realized uniquely by the state space system of the
following special case of Brunovsky normal form [37],
A =
0 Id`in 0 · · · 0
0 0 Id`in · · · 0
......
.... . .
...
0 0 0 · · · Id`in
−anId`in −an−1Id`in −an−2Id`in · · · −a1Id`in
, B =
0
...
0
Id`in
,
(6.6.3)
and,
C ∈ R`out×n`in , D ∈ R`out×`in .
The following Theorem is a straightforward extension of Corollary 6.4.2 and Theo-
rem 6.5.2 to the MIMO case.
Theorem 6.6.2. Suppose transfer function G(z) of a MIMO system takes
form (6.6.1), and has norm ‖G‖H2 ≤ 1. If the common denominator p(z) is
α-acquiescent by extension of degree d then projected stochastic gradient descent over
the state space representation (6.6.3) will return Θ with risk
f(Θ) ≤ poly(n+ d, σ, τ, (1− α)−1)
TN.
We note that since A and B are simply the tensor product of Id`in with C(a) and
en, the no blow-up property (Lemma 6.3.4) for AkB still remains true. Therefore to
prove Theorem 6.6.2, we essentially only need to run the proof of Lemma 6.4.5 with
matrix notation and matrix norm. We defer the proof to the full version.
190
6.7 Technicalities: Mean and Variance of the Gra-
dient Estimator
In this section, we formally prove Lemma 6.3 and 6.4, which controls the mean and
variance of the gradient estimator used in Algorithm 7.
Proof of Lemma 6.4.4
Lemma 6.4.4 follows directly from the following general Lemma which also handles
the multi-input multi-output case. It can be seen simply from calculation similar to
the proof of Lemma 6.2.5. We mainly need to control the tail of the series using the
no-blow up property (Lemma 6.3.4) and argue that the wrong value of the initial
states h0 won’t cause any trouble to the partial loss function `((x, y), Θ) (defined in
Algorithm 7). This is simply because after time T1 = T/4, the influence of the initial
state is already washed out.
Lemma 6.7.1. In algorithm 9 the values of GA, GC , GD are equal to the gradients
of g(A, C) + (D −D)2 with respect to A, C and D up to inverse exponentially small
error.
Proof of Lemma 6.7.1. We first show that the partial empirical loss function
`((x, y), Θ) has expectation almost equal to the idealized risk (up to the term
for D and exponential small error),
E[`((x, y), Θ)] = g(A, C) + (D −D)2 ± exp(−Ω((1− α)T )).
191
This can be seen simply from similar calculation to the proof of Lemma 6.2.5. Note
that
yt = Dxt +t−1∑k=1
CAt−k−1Bxk + CAt−1h0 + ξt and yt = Dxt +t−1∑k=1
CAt−k−1Bxk .
(6.7.1)
Therefore noting that when t ≥ T1 ≥ Ω(T ), we have that ‖CAt−1h0‖ ≤ exp(−Ω((1−
α)T ) and therefore the effect of h0 is negligible. Then we have that
E[`((x, y), Θ)] =1
T − T1E
[T∑
t>T1
‖yt − yt‖2
]± exp(−Ω((1− α)T ))
= ‖D −D‖2 +1
T − T1
∑T≥t>T1
∑0≤j≤t−1
‖CAjB − CAjB‖2
± exp(−Ω((1− α)T ))
= ‖D −D‖2 +
T1∑j=0
‖CAjB − CAjB‖2
∑T≥j≥T1
T − jT − T1
‖CAjB − CAjB‖2 ± exp(−Ω((1− α)T ))
= ‖D −D‖2 +∞∑j=0
‖CAjB − CAjB‖2 ± exp(−Ω((1− α)T )) .
where the first line use the fact that ‖CAt−1h0‖ ≤ exp(−Ω((1 − α)T ), the second
uses equation (6.2.9) and the last line uses the no-blowing up property of AkB
(Lemma 6.3.4).
Similarly, we can prove that the gradient of E[`((x, y), Θ)] is also close to the
gradient of g(A, C) + (D −D)2 up to inverse exponential error.
192
Proof of Lemma 6.4.5
Proof of Lemma 6.4.5. Both GA and GC can be written in the form of a quadratic
form (with vector coefficients) of x1, . . . , xT and ξ1, . . . , ξT . That is, we will write
GA =∑s,t
xsxtust +∑s,t
xsξtu′st and GC =
∑s,t
xsxtvst +∑s,t
xsξtv′st .
where ust and vst are vectors that will be calculated later. By Claim 6.7.2, we
have that
V
[∑s,t
xsxsust +∑s,t
xsξtu′st
]≤ O(1)
∑s,t
‖ust‖2 +O(σ2)∑s,t
‖u′st‖2 . (6.7.2)
Therefore in order to bound from above V [GA], it suffices to bound∑‖ust‖2 and∑
‖u′st‖2, and similarly for GC .
We begin by writing out ust for fixed s, t ∈ [T ] and bounding its norm. We use the
same set of notations as int the proof of Lemma 6.4.4. Recall that we set rk = CAkB
and rk = CAkB, and ∆rk = rk− rk. Moreover, let zk = AkB. We note that the sums
of ‖zk‖2 and r2k can be controlled. By the assumption of the Lemma, we have that
∞∑k=t
‖zk‖2 ≤ 2πnτ−21 , ‖zk‖2 ≤ 2πnα2k−2nτ−2
1 . (6.7.3)
∞∑k=t
∆r2k ≤ 4πnτ−2
1 , ‖∆rk‖2 ≤ 4πnα2k−2nτ−21 . (6.7.4)
which will be used many times in the proof that follows.
We calculate the explicit form of GA using the explicit back-propagation Algo-
rithm 9. We have that in Algorithm 9,
hk =∑k
j=1 Ak−jBxj =
∑kj=1 zk−jxj (6.7.5)
193
and
∆hk =T∑j=k
(A>)j−kC>∆yj =T∑j=k
αj−k(A>)j−kC>1(j > T1)
(ξj +
j∑`=1
∆rj−`x`
)(6.7.6)
Then using GA =∑
k≥2B>∆hkh
′>k−1 and equation (6.7.5) and equation (6.7.6) above,
we have that
ust =∑T
k=2
(∑j≥maxk,s,T1+1∆rj−sCA
j−kB)
1(k ≥ t+ 1) · Ak−t−1B
=∑T
k=2
(∑j≥maxk,s,T1+1∆rj−srj−k
)1(k ≥ t+ 1) · zk−t−1 . (6.7.7)
and that,
u′st =T∑k=2
zk−1−s · 1(k ≥ s+ 1) · r′t−k · 1(t > maxT1, k)
=∑
s+1≤k≤t
zk−1−s · r′t−k · 1(t > maxT1) (6.7.8)
Towards bounding ‖ust‖, we consider four different cases. Let Λ =
Ω(maxn, (1− α)−1 log( 1
1−α))
be a threshold.
Case 1: When 0 ≤ s− t ≤ Λ, we rewrite ust by rearranging equation (6.7.7),
ust =∑T≥k≥s
zk−t−1
∑j≥maxk,T1+1
∆rj−srj−k +∑t<k<s
zk−t−1
∑j≥maxs,T1+1
∆rj−srj−k
=∑
`≥0,`≥T1+1−s
∆r`∑
s≤k≤l+s,k≤T
r`+s−kzk−t−1 +∑
`≥0,`≥T1+1−s
∆r`∑s>k>t
r`+s−kzk−t−1
194
where at the second line, we did the change of variables ` = j − s. Then by Cauchy-
Schartz inequality, we have,
‖ust‖2 ≤ 2
( ∑`≥0,`≥T1+1−s
∆r2`
) ∑`≥0,`≥T1+1−s
∥∥∥∥∥ ∑s≤k≤l+s,k≤T
r`+s−kzk−t−1
∥∥∥∥∥2
︸ ︷︷ ︸T1
+ 2
( ∑`≥0,`≥T1+1−s
∆r2`
) ∑`≥0,`≥T1+1−s
∥∥∥∥∥ ∑s>k>t
r`+s−kzk−t−1
∥∥∥∥∥2
︸ ︷︷ ︸T2
. (6.7.9)
We could bound the contribution from ∆r2k ssing equation (6.7.4), and it remains
to bound terms T1 and T2. Using the tail bounds for ‖zk‖ (equation (6.7.3)) and the
fact that |rk| = |CAkB| ≤ ‖AkB‖ = ‖zk‖ , we have that
T1 =∑
`≥0,`≥T1+1−s
∥∥∥∥∥ ∑s≤k≤l+s,k≤T
r`+s−kzk−t−1
∥∥∥∥∥2
≤∑`≥0
( ∑s≤k≤`+s
|r`+s−k|‖zk−t−1‖
)2
.
(6.7.10)
We bound the inner sum of RHS of (6.7.10) using the fact that ‖zk‖2 ≤
O(nα2k−2n/τ 21 ) and obtain that,
∑s≤k≤`+s
|r`+s−k|‖zk−t−1‖ ≤∑
s≤k≤`+s
O(nα(`+s−t−1)−2n/τ 21 )
≤ O(`nα(`+s−t−1)−2n/τ 21 ) . (6.7.11)
Note that equation (6.7.11) is particular effective when ` > Λ. When ` ≤ Λ, we can
refine the bound using equation (6.7.3) and obtain that
∑s≤k≤`+s
|r`+s−k|‖zk−t−1‖ ≤
( ∑s≤k≤`+s
|r`+s−k|2)1/2( ∑
s≤k≤`+s
‖zk−t−1‖2
)1/2
≤ O(√n/τ1) ·O(
√n/τ1) = O(n/τ 2
1 ) . (6.7.12)
195
Plugging equation (6.7.12) and (6.7.11) into equation (6.7.10), we have that
∑`≥0
( ∑s≤k≤`+s
|r`+s−k|‖zk−t−1‖
)2
≤∑
Λ≥`≥0
O(n2/τ 41 ) +
∑`>Λ
O(`2n2α2(`+s−t−1)−4n/τ 41 )
≤ O(n2Λ/τ 41 ) +O(n2/τ 4
1 ) = O(n2Λ/τ 41 ) . (6.7.13)
For the second term in equation (6.7.9), we bound similarly,
T2 ≤∑
`≥0,`≥T1+1−s
∥∥∥∥∥ ∑s>k>t
r`+s−kzk−t−1
∥∥∥∥∥2
≤ O(n2Λ/τ 41 ) . (6.7.14)
Therefore using the bounds for T1 and T2 we obtain that,
‖ust‖2 ≤ O(n3Λ/τ 61 ) (6.7.15)
Case 2: When s− t > Λ, we tighten equation (6.7.13) by observing that,
T1 ≤∑`≥0
( ∑s≤k≤`+s
|r`+s−k|‖zk−t−1‖
)2
≤ α2(s−t−1)−4n∑`≥0
O(`2n2α2`/τ 41 )
≤ αs−t−1 ·O(n2/(τ 41 (1− α)3)) . (6.7.16)
where we used equation (6.7.11). Similarly we can prove that
T2 ≤ αs−t−1 ·O(n2/(τ 41 (1− α)3)) .
Therefore, we have when s− t ≥ Λ,
‖ust‖2 ≤ O(n3/((1− α)3τ 61 )) · αs−t−1 . (6.7.17)
196
Case 3: When −Λ ≤ s − t ≤ 0, we can rewrite ust and use the Cauchy-Schwartz
inequality and obtain that
ust =∑
T≥k≥t+1
zk−t−1
∑j≥maxk,T1+1
∆rj−srj−k
=∑
`≥0,`≥T1+1−s
∆r`∑
t+1≤k≤l+s,k≤T
r`+s−kzk−t−1 .
and,
‖ust‖2 ≤
( ∑`≥0,`≥T1+1−s
∆r2`
) ∑`≥0,`≥T1+1−s
∥∥∥∥∥ ∑t+1≤k≤l+s,k≤T
r`+s−kzk−t−1
∥∥∥∥∥2 .
Using almost the same arguments as in equation (6.7.11) and (6.7.12), we that
∑t+1≤k≤`+s
|r`+s−k| · ‖zk−t−1‖ ≤ O(`nα(`+s−t−1)−2n/τ 21 )
and∑
t+1≤k≤`+s
|r`+s−k| · ‖zk−t−1‖ ≤ O(√n/τ1) ·O(
√n/τ1) = O(n/τ 2
1 ) .
Then using a same type of argument as equation (6.7.13), we can have that
∑`≥0,`≥T1+1−s
∥∥∥∥∥ ∑t+1≤k≤l+s,k≤T
r′`+s−kz′k−t−1
∥∥∥∥∥2
≤ O(n2Λ/τ 41 ) +O(n2/τ 4
1 )
= O(n2Λ/τ 41 ) .
It follows that in this case ‖ust‖ can be bounded with the same bound in (6.7.15).
197
Case 4: When s− t ≤ −Λ, we use a different simplification of ust from above. First
of all, it follows (6.7.7) that
‖ust‖ ≤T∑k=2
∑j≥maxk,s,T1+1
‖∆rj−sr′j−kzk−t−1‖1(k ≥ t+ 1)
(6.7.18)
≤∑k≥t+1
‖z′k−t−1‖∑
j≥maxk,T1+1
|∆rj−sr′j−k| .
Since j − s ≥ k − s > 4n and it follows that
∑j≥maxk,T1+1
|∆rj−sr′j−k| ≤∑
j≥maxk,T1+1
O(√n/τ1 · αj−s−n) ·O(
√n/τ1 · αj−k−n)
≤ O(n/(τ 21 (1− α)) · αk−s−n)
Then we have that
‖ust‖2 ≤∑k≥t+1
‖z′k−t−1‖∑
j≥maxk,T1+1
|∆rj−sr′j−k|
≤
(∑k≥t+1
‖z′k−t−1‖2
)∑k≥t+1
∑j≥maxk,T1+1
|∆rj−sr′j−k|
2≤ O(n/τ 2
1 ) ·O(n2/(τ 41 (1− α)3)αt−s) = O(n3/(τ 6
1 δ3)αt−s)
Therefore, using the bound for ‖ust‖2 obtained in the four cases above, taking
sum over s, t, we obtain that
∑1≤s,t≤T
‖ust‖2 ≤∑
s,t∈[T ]:|s−t|≤Λ
O(n3Λ/τ 61 ) +
∑s,t:|s−t|≥Λ
O(n3/(τ 61 (1− α)3)α|t−s|−1)
≤ O(Tn3Λ2/τ 61 ) +O(n3/τ 6
1 ) = O(Tn3Λ2/τ 61 ) . (6.7.19)
198
We finished the bounds for ‖ust‖ and now we turn to bound ‖u′st‖2. Using the
formula for u′st (equation 6.7.8), we have that for t ≤ s+ 1, u′st = 0. For s+ Λ ≥ t ≥
s+ 2, we have that by Cauchy-Schwartz inequality,
‖u′st‖ ≤
( ∑s+1≤k≤t
‖zk−1−s‖2
)1/2( ∑s+1≤k≤t
|r′t−k|2)1/2
≤ O(n/τ 21 ) ≤ O(n/τ 2
1 ) .
On the other hand, for t > s+ Λ, by the bound that |r′k|2 ≤ ‖z′k‖2 ≤ O(nα2k−2n/τ 21 ),
we have,
‖u′st‖ ≤T∑
s+1≤k≤t−1
‖zk−1−s‖ · |r′t−k| ≤T∑
s+1≤k≤t−1
nαt−s−1/τ 21
≤ O(n(t− s)αt−s−1/τ 21 ) .
Therefore taking sum over s, t, similarly to equation (6.7.19),
∑s,t∈[T ]
‖u′st‖2 ≤ O(Tn2Λ/τ 41 ) . (6.7.20)
Then using equation (6.7.2) and equation (6.7.19) and (6.7.20), we obtain that
V[‖GA‖2] ≤ O(Tn3Λ2/τ 6
1 + σ2Tn2Λ/τ 41
).
Hence, it follows that
V[GA] ≤ 1
(T − T1)2 V[GA] ≤ O (n3Λ2/τ 61 + σ2n2Λ/τ 4
1 )
T.
We can prove the bound for GC similarly.
Claim 6.7.2. Let x1, . . . , xT be independent random variables with mean 0 and vari-
ance 1 and 4-th moment bounded by O(1), and uij be vectors for i, j ∈ [T ]. Moreover,
let ξ1, . . . , ξT be independent random variables with mean 0 and variance σ2 and u′ij
199
be vectors for i, j ∈ [T ]. Then,
V[∑
i,j xixjuij +∑
i,j xiξju′ij
]≤ O(1)
∑i,j ‖uij‖2 +O(σ2)
∑i,j ‖u′ij‖2 .
Proof. Note that the two sums in the target are independent with mean 0, therefore
we only need to bound the variance of both sums individually. The proof follows the
linearity of expectation and the independence of xi’s:
E[∥∥∥∑i,j xixjuij
∥∥∥2]
=∑i,j
∑k,`
E[xixjxkx`u
>ijuk`
]=∑i
E[u>iiuiix4i ] +
∑i 6=j
E[u>iiujjx2ix
2j ]
+∑i,j
E[x2ix
2j(u>ijuij + u>ijuji)
]≤∑i,j
u>iiujj +O(1)∑i,j
‖uij + uji‖2
= ‖∑
i uii‖2 +O(1)
∑i,j
‖uij‖2
where at second line we used the fact that for any monomial xα with an odd degree
on one of the xi’s, E[xα] = 0. Note that E[∑
i,j xixjuij] =∑
i uii. Therefore,
V[∑
i,j xixjuij
]= E
[‖∑
i,j xixjuij‖2]− ‖E[
∑i,j xixjuij]‖2 ≤ O(1)
∑i,j ‖uij‖2
(6.7.21)
Similarly, we can control V[∑
i,j xiξju′ij
]by O(σ2)
∑i,j ‖u′ij‖2.
6.8 Back-propagation Implementation
In this section we give a detailed implementation of using back-propagation to com-
pute the gradient of the loss function. The algorithm is for general MIMO case with
the parameterization (6.6.3). To obtain the SISO sub-case, simply take `in = `out = 1.
200
Algorithm 9 Back-propagation
Parameters: a ∈ Rn, C ∈ R`in×n`out , and D ∈ R`in×`out . Let A = MCC(a) =C(a)⊗ Id`in and B = en ⊗ Id`in .Input: samples ((x(1), y1), . . . , x(N), y(N)) and projection set Bα.
for each sample (x(j), yj) = ((x1, . . . , xT ), (y1, . . . , yT )) doFeed-forward pass:
h0 = 0 ∈ Rn`in .for k = 1 to T
hk ← Ahk−1 + Bxk, yt ← Chk + Dxk and hk ← Ahk−1 + Bxk.end for
Back-propagation:∆hT+1 ← 0, GA ← 0, GC ← 0. GD ← 0T1 ← T/4for k = T to 1
if k > T1, ∆yk ← yk−yk, o.w. ∆yk ← 0. Let ∆hk ← C>∆yk+ A>∆hk+1.
update GC ← GC + 1T−T1 ∆ykhk, GA ← GA − 1
T−T1B>∆hkh
>k−1, and
GD ← GD + 1T−T1 ∆ykxk.
end forGradient update: A← A− η ·GA, C ← C − η ·GC , D ← D − η ·GD.Projection step: Obtain a from A and set a← ΠB(a), and A = MCC(a)
end for
6.9 Projection to the Constraint Set
In order to have a fast projection algorithm to the convex set Bα, we consider a grid
GM of size M over the circle with radius α. We will show that M = Oτ (n) will be
enough to approximate the set Bα in the sense that projecting to the approximating
set suffices for the convergence.
Let B′α,τ0,τ1,τ2 = a : pa(z)/zn ∈ Cτ0,τ1,τ2 ,∀z ∈ GM and Bα,τ0,τ1,τ2 = a : pa(z)/zn ∈
Cτ0,τ1,τ2 ,∀|z| = α. Here Cτ0,τ1,τ2 is defined the same as before though we used the
subscript to emphasize the dependency on τi’s,
Cτ0,τ1,τ2 = z : <z ≥ (1 + τ0)|=z| ∩ z : τ1 < <z < τ2 . (6.9.1)
201
We will first show that with M = Oτ (n), we can make B′α,τ1,τ2,τ3 to be sandwiched
within to two sets Bα,τ0,τ1,τ2 and Bα,τ ′0,τ ′1,τ ′2 .
Lemma 6.9.1. For any τ0 > τ ′0, τ1 > τ ′1, τ2 < τ ′2, we have that for M = Oτ (n), there
exists κ0, κ1, κ2 that polynomially depend on τi, τ′i ’s such that Bα,τ0,τ1,τ2 ⊂ B′α,κ0,κ1,κ2 ⊂
Bα,τ ′0,τ ′1,τ ′2
Before proving the lemma, we demonstrate how to use the lemma in our algorithm:
We will pick τ ′0 = τ0/2, τ ′1 = τ1/2 and τ ′2 = 2τ2, and find κi’s guaranteed in the lemma
above. Then we use B′α,κ0,κ1,κ2 as the projection set in the algorithm (instead of
Bα,τ0,τ1,τ2)). First of all, the ground-truth solution Θ is in the set B′α,κ0,κ1,κ2 . Moreover,
since B′α,κ0,κ1,κ2 ⊂ Bα,τ ′0,τ ′1,τ ′2 , we will guarantee that the iterates Θ will remain in the
set Bα,τ ′0,τ ′1,τ ′2 and therefore the quasi-convexity of the objective function still holds7.
Note that the set B′α,κ0,κ1,κ2 contains O(n) linear constraints and therefore we
can use linear programming to solve the projection problem. Moreover, since the
points on the grid forms a Fourier basis and therefore Fast Fourier transform can be
potentially used to speed up the projection. Finally, we will prove Lemma 6.9.1. We
need S. Bernstein’s inequality for polynomials.
Theorem 6.9.2 (Bernstein’s inequality, see, for example, [152]). Let p(z) be any
polynomial of degree n with complex coefficients. Then,
sup|z|≤1
|p′(z)| ≤ n sup|z|≤1
|p(z)|.
We will use the following corollary of Bernstein’s inequality.
Corollary 6.9.3. Let p(z) be any polynomial of degree n with complex coefficients.
Then, for m = 20n,
sup|z|≤1
|p′(z)| ≤ 2n supk∈[m]
|p(e2ikπ/m)|.
7with a slightly worse parameter up to constant factor since τi’s are different from τi’s up toconstant factors
202
Proof. For simplicity let τ = supk∈[m] |p(e2ikπ/m)|, and let τ ′ = supk∈[m] |p(e2ikπ/m)|.
If τ ′ ≤ 2τ then we are done by Bernstein’s inequality. Now let’s assume that τ ′ >
2τ . Suppose p(z) = τ ′. Then there exists k such that |z − e2πik/m| ≤ 4/m and
|p(e2πik/m)| ≤ τ . Therefore by Cauchy mean-value theorem we have that there exists
ξ that lies between z and e2πik/m such that p′(ξ) ≥ m(τ ′ − τ)/4 ≥ 1.1nτ ′, which
contradicts Bernstein’s inequality.
Lemma 6.9.4. Suppose a polynomial of degree n satisfies that |p(w)| ≤ τ for every
w = αe2iπk/m for some m ≥ 20n. Then for every z with |z| = α there exists w =
αe2iπk/m such that |p(z)− p(w)| ≤ O(nατ/m).
Proof. Let g(z) = p(αz) by a polynomial of degree at most n. Therefore we have
g′(z) = αp(z). Let w = αe2iπk/m such that |z − w| ≤ O(α/m). Then we have
|p(z)− p(w)| = |g(z/α)− p(w/α)| ≤ sup|x|≤1
|g′(x)| · 1
α|z − w|
(By Cauchy’s mean-value Theorem)
≤ sup|x|≤1
|p′(x)| · |z − w| ≤ nτ |z − w| .
(Corallary 6.9.3)
≤ O(αnτ/m) .
Now we are ready to prove Lemma 6.9.1.
Proof of Lemma 6.9.1. We choose κi = 12(τi + τ ′i).The first inequality is trivial. We
prove the second one. Consider a such that a ∈ Bα,κ0,κ1,κ2 . We wil show that a ∈
B′α,τ ′0,τ ′1,τ ′2 . Let qa(z) = p(z−1)zn. By Lemma 6.9.4, for every z with |z| = 1/α, we
have that there exists w = α−1e2πik/M for some integer k such that |qa(z)− qa(w)| ≤
O(τ2n/(αM)). Therefore let M = cn for sufficiently large c (which depends on τi’s),
we have that for every z with |z| = 1/α, qa(z) ∈ Cτ ′0,τ ′1,τ ′2 . This completes the proof.
203
Part III
Interpreting Non-linear Models
and Their Non-convex Objective
Functions
204
Chapter 7
Understanding Word Embedding
Methods Using Generative Models
Semantic word embeddings represent the meaning of a word via a vector, and are cre-
ated by diverse methods, the learning of which often involves non-convex optimization
problems such as weighted matrix factorization or learning neural networks.
This chapter proposes a new generative model, a dynamic version of the log-linear
topic model of [127], under which we can explain the effectiveness of these diverse
methods. The methodological novelty is to use this generative model to compute
closed form expressions for word statistics. This provides a theoretical justification
for nonlinear models like PMI, word2vec, and GloVe, as well as some hyperparameter
choices. It also helps explain why low-dimensional semantic embeddings contain linear
algebraic structure that allows solution of word analogies, as shown by [122] and many
subsequent papers.
Experimental support is provided for the generative model assumptions, the most
important of which is that latent word vectors are fairly uniformly dispersed in space.
205
7.1 Introduction
Vector representations of words (word embeddings) try to capture relationships be-
tween words as distance or angle, and have many applications in computational
linguistics and machine learning. They are constructed by various models whose
unifying philosophy is that the meaning of a word is defined by “the company it
keeps” [66], namely, co-occurrence statistics. The simplest methods use word vectors
that explicitly represent co-occurrence statistics. Reweighting heuristics are known
to improve these methods, as is dimension reduction [56]. Some reweighting methods
are nonlinear, which include taking the square root of co-occurrence counts [147], or
the logarithm, or the related Pointwise Mutual Information (PMI) [50]. These are
collectively referred to as Vector Space Models, surveyed in [169].
Neural network language models [149, 150, 26, 53] propose another way to con-
struct embeddings: the word vector is simply the neural network’s internal represen-
tation for the word. This method is nonlinear and nonconvex. It was popularized
via word2vec, a family of energy-based models in [123, 125], followed by a matrix
factorization approach called GloVe [140]. The first paper also showed how to solve
analogies using linear algebra on word embeddings. Experiments and theory were
used to suggest that these newer methods are related to the older PMI-based models,
but with new hyperparameters and/or term reweighting methods [111].
But note that even the old PMI method is a bit mysterious. The simplest version
considers a symmetric matrix with each row/column indexed by a word. The entry
for (w,w′) is PMI(w,w′) = log p(w,w′)p(w)p(w′)
, where p(w,w′) is the empirical probability
of words w,w′ appearing within a window of certain size in the corpus, and p(w)
is the marginal probability of w. (More complicated models could use asymmetric
matrices with columns corresponding to context words or phrases, and also involve
tensorization.) Then word vectors are obtained by low-rank SVD on this matrix, or
a related matrix with term reweightings. In particular, the PMI matrix is found to
206
be closely approximated by a low rank matrix: there exist word vectors in say 300
dimensions, which is much smaller than the number of words in the dictionary, such
that
〈vw, vw′〉 ≈ PMI(w,w′) (7.1.1)
where ≈ should be interpreted loosely.
There appears to be no theoretical explanation for this empirical finding about
the approximate low rank of the PMI matrix. The current paper addresses this.
Specifically, we propose a probabilistic model of text generation that augments the
log-linear topic model of [127] with dynamics, in the form of a random walk over
a latent discourse space. The chief methodological contribution is using the model
priors to analytically derive a closed-form expression that directly explains (7.1.1);
see Theorem 7.2.2 in Section 7.2. Section 7.3 builds on this insight to give a rigorous
justification for models such as word2vec and GloVe, including the hyperparameter
choices for the latter. The insight also leads to a mathematical explanation for why
these word embeddings allow analogies to be solved using linear algebra; see Sec-
tion 7.4. Section 7.5 shows good empirical fit to this model’s assumtions and predic-
tions, including the surprising one that word vectors are pretty uniformly distributed
(isotropic) in space.
7.1.1 Related Work
Latent variable probabilistic models of language have been used for word embeddings
before, including Latent Dirichlet Allocation (LDA) and its more complicated variants
(see the survey [?]), and some neurally inspired nonlinear models [127, 117]. In fact,
LDA evolved out of efforts in the 1990s to provide a generative model that “explains”
207
the success of older vector space methods like Latent Semantic Indexing [137, 86].
However, none of these earlier generative models has been linked to PMI models.
[111] tried to relate word2vec to PMI models. They showed that if there were no
dimension constraint in word2vec, specifically, the “skip-gram with negative sampling
(SGNS)” version of the model, then its solutions would satisfy (7.1.1), provided the
right hand side were replaced by PMI(w,w′) − β for some scalar β. However, skip-
gram is a discriminative model (due to the use of negative sampling), not generative.
Furthermore, their argument only applies to very high-dimensional word embeddings,
and thus does not address low-dimensional embeddings, which have superior quality
in applications.
[76] focuses on issues similar to our paper. They model text generation as a
random walk on words, which are assumed to be embedded as vectors in a geometric
space. Given that the last word produced was w, the probability that the next word
is w′ is assumed to be given by h(|vw−vw′ |2) for a suitable function h, and this model
leads to an explanation of (7.1.1). By contrast our random walk involves a latent
discourse vector, which has a clearer semantic interpretation and has proven useful in
subsequent work, e.g. understanding structure of word embeddings for polysemous
words [17]. Also our work clarifies some weighting and bias terms in the training
objectives of previous methods (Section 7.3) and also the phenomenon discussed in
the next paragraph.
Researchers have tried to understand why vectors obtained from the highly non-
linear word2vec models exhibit linear structures [110, 140]. Specifically, for analogies
like “man:woman::king :??,” queen happens to be the word whose vector vqueen is the
most similar to the vector vking − vman + vwoman. This suggests that simple semantic
relationships, such as masculine vs feminine tested in the above example, correspond
approximately to a single direction in space, a phenomenon we will henceforth refer
to as relations=lines.
208
Section 7.4 surveys earlier attempts to explain this phenomenon and their short-
coming, namely, that they ignore the large approximation error in relationships like
(7.1.1). This error appears larger than the difference between the best solution and
the second best (incorrect) solution in analogy solving, so that this error could in
principle lead to a complete failure in analogy solving. In our explanation, the low
dimensionality of the word vectors plays a key role. This can also be seen as a the-
oretical explanation of the old observation that dimension reduction improves the
quality of word embeddings for various tasks. The intuitive explanation often given
—that smaller models generalize better—turns out to be fallacious, since the training
method for creating embeddings makes no reference to analogy solving. Thus there is
no a priori reason why low-dimensional model parameters (i.e., lower model capacity)
should lead to better performance in analogy solving, just as there is no reason they
are better at some other unrelated task like predicting the weather.
7.1.2 Benefits of Generative Approaches
In addition to giving some form of “unification” of existing methods, our generative
model also brings more intepretability to word embeddings beyond traditional cosine
similarity and even analogy solving. For example, it led to an understanding of how
the different senses of a polysemous word (e.g., bank) reside in linear superposition
within the word embedding [17]. Such insight into embeddings may prove useful in
the numerous settings in NLP and neuroscience where they are used.
Another new explanatory feature of our model is that low dimensionality of word
embeddings plays a key theoretical role —unlike in previous papers where the model
is agnostic about the dimension of the embeddings, and the superiority of low-
dimensional embeddings is an empirical finding (starting with [56]). Specifically,
our theoretical analysis makes the key assumption that the set of all word vectors
(which are latent variables of the generative model) are spatially isotropic, which
209
means that they have no preferred direction in space. Having n vectors be isotropic
in d dimensions requires d n. This isotropy is needed in the calculations (i.e.,
multidimensional integral) that yield (7.1.1). It also holds empirically for our word
vectors, as shown in Section 7.5.
The isotropy of low-dimensional word vectors also plays a key role in our ex-
planation of the relations=lines phenomenon (Section 7.4). The isotropy has a
“purification” effect that mitigates the effect of the (rather large) approximation error
in the PMI models.
7.2 Generative Model and Its Properties
The model treats corpus generation as a dynamic process, where the t-th word is
produced at step t. The process is driven by the random walk of a discourse vector
ct ∈ <d. Its coordinates represent what is being talked about.1 Each word has a
(time-invariant) latent vector vw ∈ <d that captures its correlations with the discourse
vector. We model this bias with a log-linear word production model:
Pr[w emitted at time t | ct] ∝ exp(〈ct, vw〉). (7.2.1)
The discourse vector ct does a slow random walk (meaning that ct+1 is obtained
from ct by adding a small random displacement vector), so that nearby words are
generated under similar discourses. We are interested in the probabilities that word
pairs co-occur near each other, so occasional big jumps in the random walk are allowed
because they have negligible effect on these probabilities.
A similar log-linear model appears in [127] but without the random walk. The
linear chain CRF of [52] is more general. The dynamic topic model of [34] utilizes topic
1This is a different interpretation of the term “discourse” compared to some other settings incomputational linguistics.
210
dynamics, but with a linear word production model. [25] have proposed a dynamic
model for text using Kalman Filters, where the sequence of words is generated from
Gaussian linear dynamical systems, rather than the log-linear model in our case.
The novelty here over such past works is a theoretical analysis in the method-of-
moments tradition [88, 51]. Assuming a prior on the random walk we analytically
integrate out the hidden random variables and compute a simple closed form ex-
pression that approximately connects the model parameters to the observable joint
probabilities (see Theorem 7.2.2). This is reminiscent of analysis of similar random
walk models in finance [31].
Model details. Let n denote the number of words and d denote the dimension of
the discourse space, where 1 ≤ d ≤ n. Inspecting (7.2.1) suggests word vectors need
to have varying lengths, to fit the empirical finding that word probabilities satisfy
a power law. Furthermore, we will assume that in the bulk, the word vectors are
distributed uniformly in space, earlier referred to as isotropy. This can be quantified
as a prior in the Bayesian tradition. More precisely, the ensemble of word vectors
consists of i.i.d draws generated by v = s · v, where v is from the spherical Gaussian
distribution, and s is a scalar random variable. We assume s is a random scalar with
expectation τ = Θ(1) and s is always upper bounded by κ, which is another constant.
Here τ governs the expected magnitude of 〈v, ct〉, and it is particularly important to
choose it to be Θ(1) so that the distribution Pr[w|ct] ∝ exp(〈vw, ct〉) is interesting.2
Moreover, the dynamic range of word probabilities will roughly equal exp(κ2), so one
should think of κ as an absolute constant like 5. These details about s are important
for realistic modeling but not too important in our analysis. (Furthermore, readers
uncomfortable with this simplistic Bayesian prior should look at Section 7.2.1 below.)
Finally, we clarify the nature of the random walk. We assume that the stationary
distribution of the random walk is uniform over the unit sphere, denoted by C. The
2A larger τ will make Pr[w|ct] too peaked and a smaller one will make it too uniform.
211
transition kernel of the random walk can be in any form so long as at each step the
movement of the discourse vector is at most ε2/√d in `2 norm.3 This is still fast
enough to let the walk mix quickly in the space.
The following lemma (whose proof appears in Section 7.6.1) is central to the analy-
sis. It says that under the Bayesian prior, the partition function Zc =∑
w exp(〈vw, c〉),
which is the implied normalization in equation (7.2.1), is close to some constant Z for
most of the discourses c. This can be seen as a plausible theoretical explanation of
a phenomenon called self-normalization in log-linear models: ignoring the partition
function or treating it as a constant (which greatly simplifies training) is known to
often give good results. This has also been studied in [9].
Lemma 7.2.1 (Concentration of partition functions). If the word vectors satisfy the
Bayesian prior described in the model details, then
Prc∼C
[(1− εz)Z ≤ Zc ≤ (1 + εz)Z] ≥ 1− δ, (7.2.2)
for εz = O(1/√n), and δ = exp(−Ω(log2 n)).
The concentration of the partition functions then leads to our main theorem
(the proof is in the Section 7.6). The theorem gives simple closed form approxi-
mations for p(w), the probability of word w in the corpus, and p(w,w′), the prob-
ability that two words w,w′ occur next to each other. The theorem states the
result for the window size q = 2, but the same analysis works for pairs that ap-
pear in a small window, say of size 10, as stated in Corollary 7.2.3. Recall that
PMI(w,w′) = log[p(w,w′)/(p(w)p(w′))].
3 More precisely, the proof extends to any symmetric product stationary distribution Cwith sub-Gaussian coordinate satisfying Ec
[‖c‖2
]= 1, and the steps are such that for all ct,
Ep(ct+1|ct)[exp(κ√d‖ct+1 − ct‖)] ≤ 1 + ε2 for some small ε2.
212
Theorem 7.2.2. Suppose the word vectors satisfy the inequality (7.2.2), and window
size q = 2. Then,
log p(w,w′) =‖vw + vw′‖2
2
2d− 2 logZ ± ε, (7.2.3)
log p(w) =‖vw‖2
2
2d− logZ ± ε. (7.2.4)
for ε = O(εz) + O(1/d) +O(ε2). Jointly these imply:
PMI (w,w′) =〈vw, vw′〉
d±O(ε). (7.2.5)
Remarks 1. Since the word vectors have `2 norm of the order of√d, for two typical
word vectors vw, vw′ , ‖vw + vw′‖22 is of the order of Θ(d). Therefore the noise level
ε is very small compared to the leading term 12d‖vw + vw′‖2
2. For PMI however, the
noise level O(ε) could be comparable to the leading term, and empirically we also
find higher error here.
Remarks 2. Variants of the expression for joint probability in (7.2.3) had been
hypothesized based upon empirical evidence in [123] and also [70], and [119] .
Remarks 3. Theorem 7.2.2 directly leads to the extension to a general window size
q as follows:
Corollary 7.2.3. Let pq(w,w′) be the co-occurrence probability in windows of size q,
and PMIq(w,w′) be the corresponding PMI value. Then
log pq(w,w′) =
‖vw + vw′‖22
2d− 2 logZ + γ ± ε,
PMIq (w,w′) =〈vw, vw′〉
d+ γ ±O(ε).
where γ = log(q(q−1)
2
).
213
It is quite easy to see that Theorem 7.2.2 implies the Corollary 7.2.3, as when the
window size is q the pair w,w′ could appear in any of(q2
)positions within the window,
and the joint probability of w,w′ is roughly the same for any positions because the
discourse vector changes slowly. (Of course, the error term gets worse as we consider
larger window sizes, although for any constant size, the statement of the theorem is
correct.) This is also consistent with the shift β for fitting PMI in [111], which showed
that without dimension constraints, the solution to skip-gram with negative sampling
satisfies PMI (w,w′) − β = 〈vw, vw′〉 for a constant β that is related to the negative
sampling in the optimization. Our result justifies via a generative model why this
should be satisfied even for low dimensional word vectors.
Proof sketches
Here we provide the proof sketches, while the complete proof can be found in the
Section 7.6.
Proof sketch of Theorem 7.2.2 Let w and w′ be two arbitrary words. Let c
and c′ denote two consecutive context vectors, where c ∼ C and c′|c is defined by the
Markov kernel p(c′ | c).
We start by using the law of total expectation, integrating out the hidden variables
c and c′:
p(w,w′) = Ec,c′
[Pr[w,w′|c, c′]]
= Ec,c′
[p(w|c)p(w′|c′)]
= Ec,c′
[exp(〈vw, c〉)
Zc
exp(〈vw′ , c′〉)Zc′
](7.2.6)
An expectation like (7.2.6) would normally be difficult to analyze because of the
partition functions. However, we can assume the inequality (7.2.2), that is, the par-
214
tition function typically does not vary much for most of context vectors c. Let F be
the event that both c and c′ are within (1 ± εz)Z. Then by (7.2.2) and the union
bound, event F happens with probability at least 1−2 exp(−Ω(log2 n)). We will split
the right-hand side (RHS) of (7.2.6) into the parts according to whether F happens
or not.
RHS of (7.2.6) = Ec,c′
[exp(〈vw, c〉)
Zc
exp(〈vw′ , c′〉)Zc′
1F
]︸ ︷︷ ︸
T1
+ Ec,c′
[exp(〈vw, c〉)
Zc
exp(〈vw′ , c′〉)Zc′
1F
]︸ ︷︷ ︸
T2
(7.2.7)
where F denotes the complement of event F and 1F and 1F denote indicator functions
for F and F , respectively. When F happens, we can replace Zc by Z with a 1 ± εz
factor loss: The first term of the RHS of (7.2.7) equals to
T1 =1±O(εz)
Z2 Ec,c′
[exp(〈vw, c〉) exp(〈vw′ , c′〉)1F ] (7.2.8)
On the other hand, we can use E[1F ] = Pr[F ] ≤ exp(−Ω(log2 n)) to show that
the second term of RHS of (7.2.7) is negligible,
|T2| = exp(−Ω(log1.8 n)) . (7.2.9)
This claim must be handled somewhat carefully since the RHS does not depend
on d at all. Briefly, the reason this holds is as follows: in the regime when
d is small (√d = o(log2 n)), any word vector vw and discourse c satisfies that
exp(〈vw, c〉) ≤ exp(‖vw‖) = exp(O(√d)), and since E[1F ] = exp(−Ω(log2 n)),
the claim follows directly; In the regime when d is large (√d = Ω(log2 n)), we
can use concentration inequalities to show that except with a small probability
215
exp(−Ω(d)) = exp(−Ω(log2 n)), a uniform sample from the sphere behaves equiva-
lently to sampling all of the coordinates from a standard Gaussian distribution with
mean 0 and variance 1d, in which case the claim is not too difficult to show using
Gaussian tail bounds.
Therefore it suffices to only consider (7.2.8). Our model assumptions state that c
and c′ cannot be too different. We leverage that by rewriting (7.2.8) a little, and get
that it equals
T1 =1±O(εz)
Z2 Ec
[exp(〈vw, c〉) E
c′|c[exp(〈vw′ , c′〉)]
]=
1±O(εz)
Z2 Ec
[exp(〈vw, c〉)A(c)] (7.2.10)
where A(c) := Ec′|c
[exp(〈vw′ , c′〉)]. We claim that A(c) = (1 ± O(ε2)) exp(〈vw′ , c〉).
Doing some algebraic manipulations,
A(c) = exp(〈vw′ , c〉) Ec′|c
[exp(〈vw′ , c′ − c〉)] .
Furthermore, by our model assumptions, ‖c− c′‖ ≤ ε2/√d. So
〈vw, c− c′〉 ≤ ‖vw‖‖c− c′‖ = O(ε2)
and thus A(c) = (1 ± O(ε2)) exp(〈vw′ , c〉). Plugging the simplification of A(c)
to (7.2.10),
T1 =1±O(εz)
Z2 E[exp(〈vw + vw′ , c〉)]. (7.2.11)
Since c has uniform distribution over the sphere, the random variable 〈vw+vw′ , c〉 has
distribution pretty similar to Gaussian distribution N (0, ‖vw + vw′‖2/d), especially
when d is relatively large. Observe that E[exp(X)] has a closed form for Gaussian
216
random variable X ∼ N (0, σ2),
E[exp(X)] =
∫x
1
σ√
2πexp(− x2
2σ2) exp(x)dx
= exp(σ2/2) . (7.2.12)
Bounding the difference between 〈vw + vw′ , c〉 from Gaussian random variable, we
can show that for ε = O(1/d),
E[exp(〈vw + vw′ , c〉)] = (1± ε) exp
(‖vw + vw′‖2
2d
). (7.2.13)
Therefore, the series of simplification/approximation above (concretely, combining
equations (7.2.6), (7.2.7), (7.2.9), (7.2.11), and (7.2.13)) lead to the desired bound on
log p(w,w′) for the case when the window size q = 2. The bound on log p(w) can be
shown similarly.
Proof sketch of Lemma 7.2.1 Note that for fixed c, when word vectors have
Gaussian priors assumed as in our model, Zc =∑
w exp(〈vw, c〉) is a sum of indepen-
dent random variables.
We first claim that using proper concentration of measure tools, it can be shown
that the variance of Zc are relatively small compared to its mean Evw [Zc], and thus
Zc concentrates around its mean. Note this is quite non-trivial: the random variable
exp(〈vw, c〉) is neither bounded nor subgaussian/sub-exponential, since the tail is
approximately inverse poly-logarithmic instead of inverse exponential. In fact, the
same concentration phenomenon does not happen for w. The occurrence probability
of word w is not necessarily concentrated because the `2 norm of vw can vary a lot in
our model, which allows the frequency of the words to have a large dynamic range.
So now it suffices to show that Evw [Zc] for different c are close to each other. Using
the fact that the word vector directions have a Gaussian distribution, Evw [Zc] turns
217
out to only depend on the norm of c (which is equal to 1). More precisely,
Evw
[Zc] = f(‖c‖22) = f(1) (7.2.14)
where f is defined as f(α) = nEs[exp(s2α/2)] and s has the same distribution as the
norms of the word vectors. We sketch the proof of this. In our model, vw = sw · vw,
where vw is a Gaussian vector with identity covariance I. Then
Evw
[Zc] = n Evw
[exp(〈vw, c〉)]
= n Esw
[E
vw|sw[exp(〈vw, c〉) | sw]
]
where the second line is just an application of the law of total expectation, if we pick
the norm of the (random) vector vw first, followed by its direction. Conditioned on
sw, 〈vw, c〉 is a Gaussian random variable with variance ‖c‖22s
2w, and therefore using
similar calculation as in (7.2.12), we have
Evw|sw
[exp(〈vw, c〉) | sw] = exp(s2‖c‖22/2) .
Hence, Evw [Zc] = nEs[exp(s2‖c‖22/2)] as needed.
7.2.1 Weakening the Model Assumptions
For readers uncomfortable with Bayesian priors, we can replace our assumptions with
concrete properties of word vectors that are empirically verifiable (Section 7.5.1) for
our final word vectors, and in fact also for word vectors computed using other recent
methods.
The word meanings are assumed to be represented by some “ground truth” vectors,
which the experimenter is trying to recover. These ground truth vectors are assumed
to be spatially isotropic in the bulk, in the following two specific ways: (i) For almost
218
all unit vectors c the sum∑
w exp(〈vw, c〉) is close to a constant Z; (ii) Singular values
of the matrix of word vectors satisfy properties similar to those of random matrices,
as formalized in the paragraph before Theorem 7.4.1. Our Bayesian prior on the
word vectors happens to imply that these two conditions hold with high probability.
But the conditions may hold even if the prior doesn’t hold. Furthermore, they are
compatible with all sorts of local structure among word vectors such as existence of
clusterings, which would be absent in truly random vectors drawn from our prior.
7.3 Training objective and relationship to other
models
To get a training objective out of Theorem 7.2.2, we reason as follows. Let Xw,w′
be the number of times words w and w′ co-occur within the same window in the
corpus. The probability p(w,w′) of such a co-occurrence at any particular time is
given by (7.2.3). Successive samples from a random walk are not independent. But
if the random walk mixes fairly quickly (the mixing time is related to the logarithm
of the vocabulary size), then the distribution of Xw,w′ ’s is very close to a multinomial
distribution Mul(L, p(w,w′)), where L =∑
w,w′ Xw,w′ is the total number of word
pairs.
Assuming this approximation, we show below that the maximum likelihood values
for the word vectors correspond to the following optimization,
minvw,C
∑w,w′
Xw,w′(log(Xw,w′)− ‖vw+vw′‖2
2 − C)2
As is usual, empirical performance is improved by weighting down very frequent
word pairs, possibly because very frequent words such as “the” do not fit our model.
This is done by replacing the weighting Xw,w′ by its truncation minXw,w′ , Xmax
219
where Xmax is a constant such as 100. We call this objective with the truncated
weights SN (Squared Norm).
We now give its derivation. Maximizing the likelihood of Xw,w′ is equivalent to
maximizing
` = log
∏(w,w′)
p(w,w′)Xw,w′
.
Denote the logarithm of the ratio between the expected count and the empirical count
as
∆w,w′ = log
(Lp(w,w′)
Xw,w′
). (7.3.1)
Then with some calculation, we obtain the following where c is independent of the
empirical observations Xw,w′ ’s.
` = c+∑
(w,w′)
Xw,w′∆w,w′ (7.3.2)
On the other hand, using ex ≈ 1 + x+ x2/2 when x is small,4 we have
L =∑
(w,w′)
Lpw,w′ =∑
(w,w′)
Xw,w′e∆w,w′
≈∑
(w,w′)
Xw,w′
(1 + ∆w,w′ +
∆2w,w′
2
).
4This Taylor series approximation has an error of the order of x3, but ignoring it can be theo-retically justified as follows. For a large Xw,w′ , its value approaches its expectation and thus thecorresponding ∆w,w′ is close to 0 and thus ignoring ∆3
w,w′ is well justified. The terms where ∆w,w′
is significant correspond to Xw,w′ ’s that are small. But empirically, Xw,w′ ’s obey a power law dis-tribution (see, e.g. [140]) using which it can be shown that these terms contribute a small fractionof the final objective (7.3.3). So we can safely ignore the errors.
220
Note that L =∑
(w,w′) Xw,w′ , so
∑(w,w′)
Xw,w′∆w,w′ ≈ −1
2
∑(w,w′)
Xw,w′∆2w,w′ .
Plugging this into (7.3.2) leads to
2(c− `) ≈∑
(w,w′)
Xw,w′∆2w,w′ . (7.3.3)
So maximizing the likelihood is approximately equivalent to minimizing the right
hand side, which (by examining (7.3.1)) leads to our objective.
Objective for training with PMI. A similar objective PMI can be obtained from
(7.2.5), by computing an approximate MLE, using the fact that the error between
the empirical and true value of PMI(w,w′) is driven by the smaller term p(w,w′),
and not the larger terms p(w), p(w′).
minvw,C
∑w,w′
Xw,w′ (PMI(w,w′)− 〈vw, vw′〉)2
This is of course very analogous to classical VSM methods, with a novel reweighting
method.
Fitting to either of the objectives involves solving a version of Weighted SVD
which is NP-hard, but empirically seems solvable in our setting via AdaGrad [58].
Connection to GloVe. Compare SN with the objective used by GloVe [140]:
∑w,w′
f(Xw,w′)(log(Xw,w′)− 〈vw, vw′〉 − sw − sw′ − C)2
221
with f(Xw,w′) = minX3/4w,w′ , 100. Their weighting methods and the need for bias
terms sw, sw′ , C were derived by trial and error; here they are all predicted and given
meanings due to Theorem 7.2.2, specifically sw = ‖vw‖2.
Connection to word2vec(CBOW). The CBOW model in word2vec posits that
the probability of a word wk+1 as a function of the previous k words w1, w2, . . . , wk:
p(wk+1
∣∣ wiki=1
)∝ exp(〈vwk+1
,1
k
k∑i=1
vwi〉).
This expression seems mysterious since it depends upon the average word vec-
tor for the previous k words. We show it can be theoretically justified. Assume a
simplified version of our model, where a small window of k words is generated as
follows: sample c ∼ C, where C is a uniformly random unit vector, then sample
(w1, w2, . . . , wk) ∼ exp(〈∑k
i=1 vwi , c〉)/Zc. Furthermore, assume Zc = Z for any c.
Lemma 7.3.1. In the simplified version of our model, the Maximum-a-Posteriori
(MAP) estimate of c given (w1, w2, . . . , wk) is∑ki=1 vwi
‖∑ki=1 vwi‖2
.
Proof. The cmaximizing p (c|w1, w2, . . . , wk) is the maximizer of p(c)p (w1, w2, . . . , wk|c).
Since p(c) = p(c′) for any c, c′, and we have p (w1, w2, . . . , wk|c) = exp(〈∑
i vwi , c〉)/Z,
the maximizer is clearly c =∑ki=1 vwi
‖∑ki=1 vwi‖2
.
Thus using the MAP estimate of ct gives essentially the same expression as CBOW
apart from the rescaling, which is often omitted due to computational efficiency in
empirical works.
222
7.4 Explaining relations=lines
As mentioned, word analogies like “a:b::c:??” can be solved via a linear algebraic
expression:
argmind ‖va − vb − vc + vd‖22 , (7.4.1)
where vectors have been normalized such that ‖vd‖2 = 1. This suggests that the
semantic relationships being tested in the analogy are characterized by a straight
line,5 referred to earlier as relations=lines.
Using our model we will show the following for low-dimensional embeddings: for
each such relation R there is a direction µR in space such that for any word pair a, b
satisfying the relation, va − vb is like µR plus some noise vector. This happens for
relations satisfying a certain condition described below. Empirical results supporting
this theory appear in Section 7.5, where this linear structure is further leveraged to
slightly improve analogy solving.
A side product of our argument will be a mathematical explanation of the em-
pirically well-established superiority of low-dimensional word embeddings over high-
dimensional ones in this setting [110]. As mentioned earlier, the usual explanation
that smaller models generalize better is fallacious.
We first sketch what was missing in prior attempts to prove versions of rela-
tions=lines from first principles. The basic issue is approximation error: the differ-
ence between the best solution and the 2nd best solution to (7.4.1) is typically small,
whereas the approximation error in the objective in the low-dimensional solutions is
larger. For instance, if one uses our PMI objective, then the weighted average of
the termwise error in (7.2.5) is 17%, and the expression in (7.4.1) above contains six
5Note that this interpretation has been disputed; e.g., it is argued in [110] that (7.4.1) can be un-derstood using only the classical connection between inner product and word similarity, using whichthe objective (7.4.1) is slightly improved to a different objective called 3COSMUL. However, this“explanation” is still dogged by the issue of large termwise error pinpointed here, since inner prod-uct is only a rough approximation to word similarity. Furthermore, the experiments in Section 7.5clearly support the relations=lines interpretation.
223
inner products. Thus in principle the approximation error could lead to a failure of
the method and the emergence of linear relationship, but it does not.
Prior explanations. [140] try to propose a model where such linear relation-
ships should occur by design. They posit that queen is a solution to the analogy
“man:woman::king:??” because
p(χ | king)
p(χ | queen)≈ p(χ | man)
p(χ | woman), (7.4.2)
where p(χ | king) denotes the conditional probability of seeing word χ in a small
window of text around king. Relationship (7.4.2) is intuitive since both sides will be
≈ 1 for gender-neutral χ like “walks” or “food”, will be > 1 when χ is like “he, Henry”
and will be < 1 when χ is like “dress, she, Elizabeth.” This was also observed by [110].
Given (7.4.2), they then posit that the correct model describing word embeddings in
terms of word occurrences must be a homomorphism from (<d,+) to (<+,×), so
vector differences map to ratios of probabilities. This leads to the expression
pw,w′ = 〈vw, vw′〉+ bw + bw′ ,
and their method is a (weighted) least squares fit for this expression. One shortcoming
of this argument is that the homomorphism assumption assumes the linear relation-
ships instead of explaining them from a more basic principle. More importantly, the
empirical fit to the homomorphism has nontrivial approximation error, high enough
that it does not imply the desired strong linear relationships.
[111] show that empirically, skip-gram vectors satisfy
〈vw, vw′〉 ≈ PMI(w,w′) (7.4.3)
224
up to some shift. They also give an argument suggesting this relationship must be
present if the solution is allowed to be very high-dimensional. Unfortunately, that
argument does not extend to low-dimensional embeddings. Even if it did, the issue
of termwise approximation error remains.
Our explanation. The current paper has introduced a generative model to theo-
retically explain the emergence of relationship (7.4.3). However, as noted after The-
orem 7.2.2, the issue of high approximation error does not go away either in theory
or in the empirical fit. We now show that the isotropy of word vectors (assumed
in the theoretical model and verified empirically) implies that even a weak version
of (7.4.3) is enough to imply the emergence of the observed linear relationships in
low-dimensional embeddings.
This argument will assume the analogy in question involves a relation that obeys
Pennington et al.’s suggestion in (7.4.2). Namely, for such a relation R there exists
function νR(·) depending only upon R such that for any a, b satisfying R there is a
noise function ξa,b,R(·) for which:
p(χ | a)
p(χ | b)= νR(χ) · ξa,b,R(χ) (7.4.4)
For different words χ there is huge variation in (7.4.4), so the multiplicative noise
may be large.
Our goal is to show that the low-dimensional word embeddings have the property
that there is a vector µR such that for every pair of words a, b in that relation,
va − vb = µR + noise vector, where the noise vector is small.
Taking logarithms of (7.4.4) results in:
log
(p(χ | a)
p(χ | b)
)= log(νR(χ)) + ζa,b,R(χ) (7.4.5)
225
Theorem 7.2.2 implies that the left-hand side simplifies to log(p(χ|a)p(χ|b)
)= 1
d〈vχ, va−
vb〉+ εa,b(χ) where ε captures the small approximation errors induced by the inexact-
ness of Theorem 7.2.2. This adds yet more noise! Denoting by V the n × d matrix
whose rows are the vχ vectors, we rewrite (7.4.5) as:
V (va − vb) = d log(νR) + ζ ′a,b,R (7.4.6)
where log(νR) in the element-wise log of vector νR and ζ ′a,b,R = d(ζa,b,R− εa,b,R) is the
noise.
In essence, (7.4.6) shows that va−vb is a solution to a linear regression in d variables
and m constraints, with ζ ′a,b,R being the “noise.” The design matrix in the regression is
V , the matrix of all word vectors, which in our model (as well as empirically) satisfies
an isotropy condition. This makes it random-like, and thus solving the regression by
left-multiplying by V †, the pseudo-inverse of V , ought to “denoise” effectively. We
now show that it does.
Our model assumed the set of all word vectors satisfies bulk properties similar
to a set of Gaussian vectors. The next theorem will only need the following weaker
properties. (1) The smallest non-zero singular value of V is larger than some constant
c1 times the quadratic mean of the singular values, namely, ‖V ‖F/√d. Empirically we
find c1 ≈ 1/3 holds; see Section 7.5. (2) The left singular vectors behave like random
vectors with respect to ζ ′a,b,R, namely, have inner product at most c2‖ζ ′a,b,R‖/√n with
ζ ′a,b,R, for some constant c2. (3) The max norm of a row in V is O(√d).
Theorem 7.4.1 (Noise reduction). Under the conditions of the previous paragraph,
the noise in the dimension-reduced semantic vector space satisfies
‖ζa,b,R‖2 . ‖ζ ′a,b,R‖2
√d
n.
226
As a corollary, the relative error in the dimension-reduced space is a factor of√d/n
smaller.
Proof of Theorem 7.4.1 The proof uses the standard analysis of linear regression.
Let V = PΣQT be the SVD of V and let σ1, . . . , σd be the left singular values of V (the
diagonal entries of Σ). For notational ease we omit the subscripts in ζ and ζ ′ since they
are not relevant for this proof. Since V † = QΣ−1P T and thus ζ = V †ζ ′ = QΣ−1P T ζ ′,
we have
‖ζ‖2 ≤ σ−1d ‖P
T ζ ′‖2. (7.4.7)
We claim
σ−1d ≤
√1
c1n. (7.4.8)
Indeed,∑d
i=1 σ2i = O(nd), since the average squared norm of a word vector is d. The
claim then follows from the first assumption. Furthermore, by the second assumption,
‖P T ζ ′‖∞ ≤ c2√n‖ζ ′‖2, so
‖P T ζ ′‖22 ≤
c22d
n‖ζ ′‖2
2. (7.4.9)
Plugging (7.4.8) and (7.4.9) into (7.4.7), we get
‖ζ‖2 ≤√
1
c1n
√c2
2d
n‖ζ ′‖2
2 =c2
√d
√c1n‖ζ ′‖2
as desired. The last statement follows because the norm of the signal, which is
d log(νR) originally and is V †d log(νR) = va − vb after dimension reduction, also gets
reduced by a factor of√n.
227
7.5 Experimental Verification
In this section, we provide experiments empirically supporting our generative model.
Corpus. All word embedding vectors are trained on the English Wikipedia (March
2015 dump). It is pre-processed by standard approach (removing non-textual ele-
ments, sentence splitting, and tokenization), leaving about 3 billion tokens. Words
that appeared less than 1000 times in the corpus are ignored, resulting in a vocabu-
lary of 68, 430. The co-occurrence is then computed using windows of 10 tokens to
each side of the focus word.
Training method. Our embedding vectors are trained by optimizing the SN ob-
jective using AdaGrad [58] with initial learning rate of 0.05 and 100 iterations. The
PMI objective derived from (7.2.5) was also used. SN has average (weighted) term-
wise error of 5%, and PMI has 17%. We observed that SN vectors typically fit the
model better and have better performance, which can be explained by larger errors
in PMI, as implied by Theorem 7.2.2. So, we only report the results for SN.
For comparison, GloVe and two variants of word2vec (skip-gram and CBOW)
vectors are trained. GloVe’s vectors are trained on the same co-occurrence as SN
with the default parameter values.6 word2vec vectors are trained using a window size
of 10, with other parameters set to default values.7
7.5.1 Model Verification
Experiments were run to test our modeling assumptions. First, we tested two counter-
intuitive properties: the concentration of the partition function Zc for different dis-
course vectors c (see Theorem 7.2.1), and the random-like behavior of the matrix of
6http://nlp.stanford.edu/projects/glove/7https://code.google.com/p/word2vec/
228
0.5 1 1.5 20
20
40
Partition function value
Pe
rce
nta
ge
(a) SN
0.5 1 1.5 2
0
20
40
60
80
100
Partition function value
(b) GloVe
0.5 1 1.5 2
0
20
40
60
80
Partition function value
(c) CBOW
0.5 1 1.5 2
0
20
40
Partition function value
(d) skip-gram
Figure 7.1: The partition function Zc. The figure shows the histogram of Zc for 1000random vectors c of appropriate norm, as defined in the text. The x-axis is normalizedby the mean of the values. The values Zc for different c concentrate around the mean,mostly in [0.9, 1.1]. This concentration phenomenon is predicted by our analysis.
6 8 10 12 14 16 18
1
2
3
4
5
6
7
8
9
10
Natural logarithm of frequency
Sq
ua
red
no
rm
Figure 7.2: The linear relationship between the squared norms of our word vectorsand the logarithms of the word frequencies. Each dot in the plot corresponds to aword, where x-axis is the natural logarithm of the word frequency, and y-axis is thesquared norm of the word vector. The Pearson correlation coefficient between thetwo is 0.75, indicating a significant linear relationship, which strongly supports ourmathematical prediction, that is, equation (7.2.4) of Theorem 7.2.2.
word embeddings in terms of its singular values (see Theorem 7.4.1). For compar-
ison we also tested these properties for word2vec and GloVe vectors, though they
are trained by different objectives. Finally, we tested the linear relation between the
squared norms of our word vectors and the logarithm of the word frequencies, as
implied by Theorem 7.2.2.
229
Partition function. Our theory predicts the counter-intuitive concentration of
the partition function Zc =∑
w′ exp(c>vw′) for a random discourse vector c (see
Lemma 7.2.1). This is verified empirically by picking a uniformly random direction,
of norm ‖c‖ = 4/µw, where µw is the average norm of the word vectors.8 Figure 7.1(a)
shows the histogram of Zc for 1000 such randomly chosen c’s for our vectors. The
values are concentrated, mostly in the range [0.9, 1.1] times the mean. Concentration
is also observed for other types of vectors, especially for GloVe and CBOW.
Isotropy with respect to singular values. Our theoretical explanation of rela-
tions=lines assumes that the matrix of word vectors behaves like a random matrix
with respect to the properties of singular values. In our embeddings, the quadratic
mean of the singular values is 34.3, while the minimum non-zero singular value of our
word vectors is 11. Therefore, the ratio between them is a small constant, consistent
with our model. The ratios for GloVe, CBOW, and skip-gram are 1.4, 10.1, and 3.1,
respectively, which are also small constants.
Squared norms v.s. word frequencies. Figure 7.2 shows a scatter plot for the
squared norms of our vectors and the logarithms of the word frequencies. A linear re-
lationship is observed (Pearson correlation 0.75), thus supporting Theorem 7.2.2. The
correlation is stronger for high frequency words, possibly because the corresponding
terms have higher weights in the training objective.
This correlation is much weaker for other types of word embeddings. This is
possibly because they have more free parameters (“knobs to turn”), which imbue
the embeddings with other properties. This can also cause the difference in the
concentration of the partition function for the two methods.
8Note that our model uses the inner products between the discourse vectors and word vectors, soit is invariant if the discourse vectors are scaled by s while the word vectors are scaled by 1/s for anys > 0. Therefore, one needs to choose the norm of c properly. We assume ‖c‖µw =
√d/κ ≈ 4 for a
constant κ = 5 so that it gives a reasonable fit to the predicted dynamic range of word frequenciesaccording to our theory; see model details in Section 7.2.
230
Relations SN GloVe CBOW skip-gram
Gsemantic 0.84 0.85 0.79 0.73syntactic 0.61 0.65 0.71 0.68total 0.71 0.73 0.74 0.70
M
adjective 0.50 0.56 0.58 0.58noun 0.69 0.70 0.56 0.58verb 0.48 0.53 0.64 0.56total 0.53 0.57 0.62 0.57
Table 7.1: The accuracy on two word analogy task testbeds: G (the GOOGLEtestbed); M (the MSR testbed). Performance is close to the state of the art despiteusing a generative model with provable properties.
7.5.2 Performance on Analogy Tasks
We compare the performance of our word vectors on analogy tasks, specifically the two
testbeds GOOGLE and MSR [122, 125]. The former contains 7874 semantic questions
such as “man:woman::king :??”, and 10167 syntactic ones such as “run:runs ::walk :??.”
The latter has 8000 syntactic questions for adjectives, nouns, and verbs.
To solve these tasks, we use linear algebraic queries.9 That is, first normalize the
vectors to unit norm and then solve “a:b::c:??” by
argmind ‖va − vb − vc + vd‖22 . (7.5.1)
The algorithm succeeds if the best d happens to be correct.
The performance of different methods is presented in Table 7.1. Our vectors
achieve performance comparable to the state of art on semantic analogies (similar
accuracy as GloVe, better than word2vec). On syntactic tasks, they achieve accu-
racy 0.04 lower than GloVe and skip-gram, while CBOW typically outperforms the
others.10 The reason is probably that our model ignores local word order, whereas
the other models capture it to some extent. For example, a word “she” can affect
9One can instead use the 3COSMUL in [110], which increases the accuracy by about 3%. But itis not linear while our focus here is the linear algebraic structure.
10It was earlier reported that skip-gram outperforms CBOW [122, 140]. This may be due to thedifferent training data sets and hyperparameters used.
231
the context by a lot and determine if the next word is “thinks” rather than “think”.
Incorporating such linguistic features in the model is left for future work.
7.5.3 Verifying relations=lines
The theory in Section 7.4 predicts the existence of a direction for a relation, whereas
earlier [110] had questioned if this phenomenon is real. The experiment uses the
analogy testbed, where each relation is tested using 20 or more analogies. For each
relation, we take the set of vectors vab = va − vb where the word pair (a, b) satisfies
the relation. Then calculate the top singular vectors of the matrix formed by these
vab’s, and compute the cosine similarity (i.e., normalized inner product) of individual
vab to the singular vectors. We observed that most (va − vb)’s are correlated with
the first singular vector, but have inner products around 0 with the second singular
vector. Over all relations, the average projection on the first singular vector is 0.51
(semantic: 0.58; syntactic: 0.46), and the average on the second singular vector is
0.035. For example, Table 7.2 shows the mean similarities and standard deviations on
the first and second singular vectors for 4 relations. Similar results are also obtained
for word embedings by GloVe and word2vec. Therefore, the first singular vector can
be taken as the direction associated with this relation, while the other components
are like random noise, in line with our model.
Cheating solver for analogy testbeds. The above linear structure suggests a
better (but cheating) way to solve the analogy task. This uses the fact that the same
semantic relationship (e.g., masculine-feminine, singular-plural) is tested many times
in the testbed. If a relation R is represented by a direction µR then the cheating
algorithm can learn this direction (via rank 1 SVD) after seeing a few examples of
the relationship. Then use the following method of solving “a:b::c:??”: look for a
232
relation 1 2 3 4 5 6 7
1st 0.65 ± 0.07 0.61 ± 0.09 0.52 ± 0.08 0.54 ± 0.18 0.60 ± 0.21 0.35 ± 0.17 0.42 ± 0.162nd 0.02 ± 0.28 0.00 ± 0.23 0.05 ± 0.30 0.06 ± 0.27 0.01 ± 0.24 0.07 ± 0.24 0.01 ± 0.25
relation 8 9 10 11 12 13 14
1st 0.56 ± 0.09 0.53 ± 0.08 0.37 ± 0.11 0.72 ± 0.10 0.37 ± 0.14 0.40 ± 0.19 0.43 ± 0.142nd 0.00 ± 0.22 0.01 ± 0.26 0.02 ± 0.20 0.01 ± 0.24 0.07 ± 0.26 0.07 ± 0.23 0.09 ± 0.23
Table 7.2: The verification of relation directions on 2 semantic and 2 syntacticrelations in the GOOGLE testbed. Relations include cap-com: capital-common-countries; cap-wor: capital-world; adj-adv: gram1-adjective-to-adverb; opp: gram2-opposite. For each relation, take vab = va − vb for pairs (a, b) in the relation, andthen calculate the top singular vectors of the matrix formed by these vab’s. The rowwith label “1st”/“2nd” shows the cosine similarities of individual vab to the 1st/2ndsingular vector (the mean and standard deviation).
SN GloVe CBOW skip-gram
w/o RD 0.71 0.73 0.74 0.70RD(k = 20) 0.74 0.77 0.79 0.75RD(k = 30) 0.79 0.80 0.82 0.80RD(k = 40) 0.76 0.80 0.80 0.77
Table 7.3: The accuracy of the RD algorithm (i.e., the cheater method) on theGOOGLE testbed. The RD algorithm is described in the text. For comparison, therow “w/o RD” shows the accuracy of the old method without using RD.
word d such that vc − vd has the largest projection on µR, the relation direction for
(a, b). This can boost success rates by about 10%.
The testbed can try to combat such cheating by giving analogy questions in a
random order. But the cheating algorithm can just cluster the presented analogies to
learn which of them are in the same relation. Thus the final algorithm, named analogy
solver with relation direction (RD), is: take all vectors va − vb for all the word pairs
(a, b) presented among the analogy questions and do k-means clustering on them; for
each (a, b), estimate the relation direction by taking the first singular vector of its
cluster, and substitute that for va− vb in (7.5.1) when solving the analogy. Table 7.3
shows the performance on GOOGLE with different values of k; e.g. using our SN
vectors and k = 30 leads to 0.79 accuracy. Thus future designers of analogy testbeds
should remember not to test the same relationship too many times! This still leaves
233
SN GloVe CBOW skip-gram
w/o RD-nn 0.71 0.73 0.74 0.70RD-nn (k = 10) 0.71 0.74 0.77 0.73RD-nn (k = 20) 0.72 0.75 0.77 0.74RD-nn (k = 30) 0.73 0.76 0.78 0.74
Table 7.4: The accuracy of the RD-nn algorithm on the GOOGLE testbed. Thealgorithm is described in the text. For comparison, the row “w/o RD-nn” shows theaccuracy of the old method without using RD-nn.
other ways to cheat, such as learning the directions for interesting semantic relations
from other collections of analogies.
Non-cheating solver for analogy testbeds. Now we show that even if a rela-
tionship is tested only once in the testbed, there is a way to use the above structure.
Given “a:b::c:??,” the solver first finds the top 300 nearest neighbors of a and those
of b, and then finds among these neighbors the top k pairs (a′, b′) so that the cosine
similarities between va′−vb′ and va−vb are largest. Finally, the solver uses these pairs
to estimate the relation direction (via rank 1 SVD), and substitute this (corrected)
estimate for va − vb in (7.5.1) when solving the analogy. This algorithm is named
analogy solver with relation direction by nearest neighbors (RD-nn). Table 7.4 shows
its performance, which consistently improves over the old method by about 3%.
7.6 Proof of Main Theorems and Lemmas
In this section we prove Theorem 7.2.2 and Lemma 7.2.1 (restated below).
Theorem 7.2.2. Suppose the word vectors satisfy equation (7.2.2), and window size
q = 2. Then,
log p(w,w′) =‖vw + vw′‖2
2
2d− 2 logZ ± ε, (7.6.1)
log p(w) =‖vw‖2
2
2d− logZ ± ε. (7.6.2)
234
for ε = O(εz) + O(1/d) +O(ε2). Jointly these imply:
PMI (w,w′) =〈vw, vw′〉
d±O(ε). (7.6.3)
Lemma 7.2.1. If the word vectors satisfy the bayesian prior v = s · v, where v is
from the spherical Gaussian distribution, and s is a scalar random variable, then with
high probability the entire ensemble of word vectors satisfies that
Prc∼C
[(1− εz)Z ≤ Zc ≤ (1 + εz)Z] ≥ 1− δ, (7.6.4)
for εz = O(1/√n), and δ = exp(−Ω(log2 n)).
In this sectioin, we first prove Theorem 7.2.2 using Lemma 7.2.1 and some helper
lemmas. Lemma 7.2.1 will be proved in Section 7.6.1, and the helper lemmas will be
proved in Section 7.6.2. Please see Section 7.2 of the main paper for the intuition of
the proof and a cleaner sketch without too many technicalities.
Now begin the proof. Let c be the hidden discourse that determines the probability
of word w, and c′ be the next one that determines w′. We use p(c′|c) to denote the
Markov kernel (transition matrix) of the Markov chain. Let C be the stationary
distribution of discourse vector c, and D be the joint distribution of (c, c′). We
marginalize over the contexts c, c′ and then use the independence of w,w′ conditioned
on c, c′,
p(w,w′) = E(c,c′)∼D
[exp(〈vw, c〉)
Zc
exp(〈vw′ , c′〉)Zc′
](7.6.5)
We first get rid of the partition function Zc using Lemma 7.2.1. As sketched in
the main paper, essentially we will replace Zc by Z in equation (7.6.5), though a very
careful control of the approximation error is required. Then we arrive at the following
claim.
235
Claim 7.6.1. Under the setting of Theorem 7.2.1,
log p(w,w′) = log
(E
(c,c′)∼D[exp(〈vw, c〉) exp(〈vw′ , c′〉)]± δ0
)− 2 logZ + 2 log(1± εz).
Proof of Claim 7.6.1. Formally, let F1 be the event that c satisfies
(1− εz)Z ≤ Zc ≤ (1 + εz)Z . (7.6.6)
Similarly, let F2 be the even that c′ satisfies (1 − εz)Z ≤ Zc′ ≤ (1 + εz)Z, and let
F = F1 ∩ F2, and F be its negation. Moreover, let 1F be the indicator function for
the event F . Therefore by Lemma 7.2.1 and union bound, we have E[1F ] = Pr[F ] ≥
1− exp(−Ω(log2 n)).
We first decompose the integral (7.6.5) into the two parts according to whether
event F happens,
p(w,w′) = E(c,c′)∼D
[1
ZcZc′exp(〈vw, c〉) exp(〈vw′ , c′〉)1F
]+ E
(c,c′)∼D
[1
ZcZc′exp(〈vw, c〉) exp(〈vw′ , c′〉)1F
](7.6.7)
We bound the first quantity on the right hand side using (7.2.2) and the definition of
F .
E(c,c′)∼D
[1
ZcZc′exp(〈vw, c〉) exp(〈vw′ , c′〉)1F
]≤ (1 + εz)
2 1
Z2 E(c,c′)∼D
[exp(〈vw, c〉) exp(〈vw′ , c′〉)1F ] (7.6.8)
236
For the second quantity of the right hand side of (7.6.7), we have by Cauchy-Schwartz,
(E
(c,c′)∼D
[1
ZcZc′exp(〈vw, c〉) exp(〈vw′ , c′〉)1F
])2
≤(
E(c,c′)∼D
[1
Z2c
exp(〈vw, c〉)21F
])(E
(c,c′)∼D
[1
Z2c′
exp(〈vw′ , c′〉)21F
])≤(Ec
[1
Z2c
exp(〈vw, c〉)2 Ec′|c
[1F ]
])(Ec′
[1
Z2c′
exp(〈vw′ , c′〉)2 Ec|c′
[1F ]
]). (7.6.9)
Using the fact that Zc ≥ 1, then we have that
Ec
[1
Z2c
exp(〈vw, c〉)2 Ec′|c
[1F ]
]≤ E
c
[exp(〈vw, c〉)2 E
c′|c[1F ]
]
We can split that expectation as
Ec
[exp(〈vw, c〉)21〈vw,c〉>0 E
c′|c[1F ]
]+ E
c
[exp(〈vw, c〉)21〈vw,c〉<0 E
c′|c[1F ]
]. (7.6.10)
The second term of (7.6.10) is upper bounded by
Ec,c′
[1F ] ≤ exp(−Ω(log2 n))
We proceed to the first term of (7.6.10) and observe the following property of it:
Ec
[exp(〈vw, c〉)21〈vw,c〉>0 E
c′|c[1F ]
]≤ E
c
[exp(〈αvw, c〉)21〈vw,c〉>0 E
c′|c[1F ]
]≤ E
c
[exp(〈αvw, c〉)2 E
c′|c[1F ]
].
where α > 1. Therefore, it’s sufficient to bound
Ec
[exp(〈vw, c〉)2 E
c′|c[1F ]
]
when ‖vw‖ = Ω(√d).
237
Let’s denote by z the random variable 2〈vw, c〉. Let’s denote r(z) = Ec′|z[1F ]
which is a function of z between [0, 1]. We wish to upper bound Ec [exp(z)r(z)]. The
worst-case r(z) can be quantified using a continuous version of Abel’s inequality as
proven in Lemma 7.6.5, which gives
Ec[exp(z)r(z)] ≤ E [exp(z)1([t,+∞])(z)] (7.6.11)
where t satisfies that Ec[1[t,+∞]] = Pr[z ≥ t] = Ec[r(z)] ≤ exp(−Ω(log2 n)). Then, we
claim Pr[z ≥ t] ≤ exp(−Ω(log2 n)) implies that t ≥ Ω(log.9 n).
If c were distributed as N (0, 1dI), this would be a simple tail bound. However,
as c is distributed uniformly on the sphere, this requires special care, and the claim
follows by applying Lemma 7.6.2 instead.
Finally, applying Corollary 7.6.4, we have:
E[exp(z)r(z)] ≤ E [exp(z)1([t,+∞])(z)] = exp(−Ω(log1.8 n)). (7.6.12)
We have the same bound for c′ as well. Hence, for the second quantity of the right
hand side of (7.6.7), we have
E(c,c′)∼D
[1
ZcZc′exp(〈vw, c〉) exp(〈vw′ , c′〉)1F
]≤(Ec
[1
Z2c
exp(〈vw, c〉)2 Ec′|c
[1F ]
])1/2(Ec′
[1
Z2c′
exp(〈vw′ , c′〉)2 Ec|c′
[1F ]
])1/2
≤ exp(−Ω(log1.8 n)) (7.6.13)
where the first inequality follows from Cauchy-Schwartz, and the second from the
calculation above.
238
Combining (7.6.7), (7.6.8) and (7.6.13), we obtain
p(w,w′) ≤ (1 + εz)2 1
Z2 E(c,c′)∼D
[exp(〈vw, c〉) exp(〈vw′ , c′〉)1F ] +1
n2exp(−Ω(log1.8 n))
≤ (1 + εz)2 1
Z2 E(c,c′)∼D
[exp(〈vw, c〉) exp(〈vw′ , c′〉)] + δ0
where δ0 = exp(−Ω(log1.8 n))Z2 ≤ exp(−Ω(log1.8 n)) by the fact that Z ≤
exp(2κ)n = O(n). Note that κ is treated as an absolute constant throughout
the paper. On the other hand, we can lower bound similarly
p(w,w′) ≥ (1− εz)2 1
Z2 E(c,c′)∼D
[exp(〈vw, c〉) exp(〈vw′ , c′〉)1F ]
≥ (1− εz)2 1
Z2 E(c,c′)∼D
[exp(〈vw, c〉) exp(〈vw′ , c′〉)]−1
n2exp(−Ω(log1.8 n))
≥ (1− εz)2 1
Z2 E(c,c′)∼D
[exp(〈vw, c〉) exp(〈vw′ , c′〉)]− δ0.
Taking logarithm, the multiplicative error translates to a additive error
log p(w,w′) = log
(E
(c,c′)∼D[exp(〈vw, c〉) exp(〈vw′ , c′〉)]± δ0
)− 2 logZ + 2 log(1± εz).
This completes the proof of the claim.
For the purpose of exploiting the fact that c, c′ should be close to each other, we
further rewrite log p(w,w′) by re-organizing the expectations above,
log p(w,w′) = log
(Ec
[exp(〈vw, c〉) E
c′|c[exp(〈vw′ , c′〉)]
]± δ0
)− 2 logZ + 2 log(1± εz)
= log
(Ec
[exp(〈vw, c〉)A(c)]± δ0
)− 2 logZ + 2 log(1± εz) (7.6.14)
where A(c) in the inner integral is defined as
A(c) := Ec′|c
[exp(〈vw′ , c′〉)] .
239
It can be bounded as in the following claim.
Claim 7.6.2. In the setting of Theorem 7.2.2, we have A(c) = (1± ε2) exp(〈vw′ , c〉).
Proof of Claim 7.6.2. Since ‖vw‖ ≤ κ√d, we have that 〈vw, c− c′〉 ≤ ‖vw‖‖c− c′‖ ≤
κ√d‖c− c′‖.
Then we can bound A(c) by
A(c) = Ec′|c
[exp(〈vw′ , c′〉)]
= exp(〈vw′ , c〉) Ec′|c
[exp(〈vw′ , c′ − c〉)]
≤ exp(〈vw′ , c〉) Ec′|c
[exp(κ√d‖c− c′‖)]
≤ (1 + ε2) exp(〈vw′ , c〉)
where the last inequality follows from our model assumptions. To derive a lower
bound of A(c), observe that
Ec′|c
[exp(κ√d‖c− c′‖)] + E
c′|c[exp(−κ
√d‖c− c′‖)] ≥ 2.
Therefore, our model assumptions imply that
Ec′|c
[exp(−κ√d‖c− c′‖)] ≥ 1− ε2.
Hence,
A(c) = exp(〈vw′ , c〉) Ec′|c
exp(〈vw′ , c′ − c〉)
≥ exp(〈vw′ , c〉) Ec′|c
exp(−κ√d‖c− c′‖)
≥ (1− ε2) exp(〈vw′ , c〉).
240
This completes the proof of the claim.
Plugging the just obtained estimation of A(c) into the equation (7.6.14), we get
that
log p(w,w′) = log
(Ec
[exp(〈vw, c〉)A(c)]± δ0
)− 2 logZ + 2 log(1± εz)
= log
(Ec
[(1± ε2) exp(〈vw, c〉) exp(〈vw′ , c〉)]± δ0
)− 2 logZ + 2 log(1± εz)
= log
(Ec
[exp(〈vw + vw′ , c〉)]± δ0
)− 2 logZ + 2 log(1± εz) + log(1± ε2)
(7.6.15)
Now it suffices to compute Ec[exp(〈vw+vw′ , c〉)]. Note that if c had the distribution
N (0, 1dI), which is very similar to uniform distribution over the sphere, then we could
get straightforwardly Ec[exp(〈vw + vw′ , c〉)] = exp(‖vw + vw′‖2/(2d)). For c having
a uniform distribution over the sphere, by Lemma 7.6.6, the same equality holds
approximately,
Ec[exp(〈vw + vw′ , c〉)] = (1± ε3) exp(‖vw + vw′‖2/(2d)) (7.6.16)
where ε3 = O(1/d).
Plugging in equation (7.6.16) into equation (7.6.15), we have that
log p(w,w′) = log((1± ε3) exp(‖vw + vw′‖2/(2d))± δ0
)− 2 logZ + 2 log(1± εz) + log(1± ε2)
= ‖vw + vw′‖2/(2d) +O(ε3) +O(δ′0)− 2 logZ ± 2εz ± ε2
241
where δ′0 = δ0 · (Ec∼C[exp(〈vw + vw′ , c〉)])−1 = exp(−Ω(log1.8 n)). Note that ε3 =
O(1/d), εz = O(1/√n), and ε2 by assumption, therefore we obtain
log p(w,w′) =1
2d‖vw + vw′‖2 − 2 logZ ±O(εz) +O(ε2) + O(1/d).
7.6.1 Analyzing Partition Function Zc
In this subsection, we prove Lemma 7.2.1. We basically first prove that for the means
of Zc are all (1 + o(1))-close to each other, and then prove that Zc is concentrated
around its mean. It turns out the concentration part is non trivial because the random
variable of concern, exp(〈vw, c〉) is not well-behaved in terms of the tail. Note that
exp(〈vw, c〉) is NOT sub-gaussian for any variance proxy. This essentially disallows
us to use an existing concentration inequality directly. We get around this issue by
considering the truncated version of exp(〈vw, c〉), which is bounded, and have similar
tail properties as the original one, in the regime that we are concerning.
We bound the mean and variance of Zc first in the lemma below.
Lemma 7.6.1. For any fixed unit vector c ∈ Rd, we have that E[Zc] ≥ n and V[Zc] ≤
O(n).
Proof of Lemma 7.6.1. Recall that by definition
Zc =∑w
exp(〈vw, c〉).
We fix context c and view vw’s as random variables throughout this proof. Recall
that vw is composed of vw = sw · vw, where sw is the scaling and vw is from spherical
Gaussian with identity covariance Id×d. Let s be a random variable that has the same
distribution as sw. We lower bound the mean of Zc as follows:
E[Zc] = nE [exp(〈vw, c〉)] ≥ nE [1 + 〈vw, c〉] = n
242
where the last equality holds because of the symmetry of the spherical Gaussian
distribution. On the other hand, to upper bound the mean of Zc, we condition on
the scaling sw,
E[Zc] = nE[exp(〈vw, c〉)]
= nE [E [exp(〈vw, c〉) | sw]]
Note that conditioned on sw, we have that 〈vw, c〉 is a Gaussian random variable
with variance σ2 = s2w. Therefore,
E [exp(〈vw, c〉) | sw] =
∫x
1
σ√
2πexp(− x2
2σ2) exp(x)dx
=
∫x
1
σ√
2πexp(−(x− σ2)2
2σ2+ σ2/2)dx
= exp(σ2/2)
It follows that
E[Zc] = nE[exp(σ2/2)] = nE[exp(s2w/2)] = nE[exp(s2/2)].
We calculate the variance of Zc as follows:
V[Zc] =∑w
V [exp(〈vw, c〉)] ≤ nE[exp(2〈vw, c〉)]
= nE [E [exp(2〈vw, c〉) | sw]]
243
By a very similar calculation as above, using the fact that 2〈vw, c〉 is a Gaussian
random variable with variance 4σ2 = 4s2w,
E [exp(2〈vw, c〉) | sw] = exp(2σ2).
Therefore, we have that
V[Zc] ≤ nE [E [exp(2〈vw, c〉) | sw]]
= nE[exp(2σ2)
]= nE
[exp(2s2)
]≤ Λn
for Λ = exp(8κ2) a constant, and at the last step we used the facts that s ≤ κ a.s.
Now we are ready to prove Lemma 7.2.1.
Proof of Lemma 7.2.1. We fix the choice of c, and the proving the concentration using
the randomness of vw’s first. Note that that exp(〈vw, c〉) is neither sub-Gaussian nor
sub-exponential (actually the Orlicz norm of random variable exp(〈vw, c〉) is never
bounded). This prevents us from applying the usual concentration inequalities. The
proof deals with this issue in a slightly more specialized manner.
Let’s define Fw be the event that |〈vw, c〉| ≤ 12
log n. We claim that Pr[Fw] ≥
1 − exp(−Ω(log2 n)). Indeed note that 〈vw, c〉 | sw has a Gaussian distribution with
standard deviation sw‖c‖ = sw ≤ 2κ a.s. Therefore by the Gaussianity of 〈vw, c〉 we
have that
Pr[|〈vw, c〉| ≥ η log n | sw] ≤ 2 exp(−Ω(1
4log2 n/κ2)) = exp(−Ω(log2 n)),
where Ω(·) hides the dependency on κ which is treated as absolute constants. Taking
expectations over sw, we obtain that
Pr[Fw] = Pr[|〈vw, c〉| ≤1
2log n] ≥ 1− exp(−Ω(log2 n)).
244
Note that by definition, we in particular have that conditioned on Fw, it holds that
exp(〈vw, c〉) ≤√n.
Let the random variable Xw have the same distribution as exp(〈vw, c〉)|Fw . We
prove that the random variable Z ′c =∑
wXw concentrates well. By convexity of the
exponential function, we have that the mean of Z ′c is lowerbounded
E[Z ′c] = nE [exp(〈vw, c〉)|Fw ] ≥ n exp(E [〈vw, c〉|Fw ]) = n
and the variance is upper bounded by
V[Z ′c] ≤ nE[exp(〈vw, c〉)2|Fw
]≤ 1
Pr[Fw]E[exp(〈vw, c〉)2
]≤ 1
Pr[Fw]Λn ≤ 1.1Λn
where the second line uses the fact that
E[exp(〈vw, c〉)2
]= Pr[Fw]E
[exp(〈vw, c〉)2|Fw
]+ Pr[Fw]E
[exp(〈vw, c〉)2|Fw
]≥ Pr[Fw]E
[exp(〈vw, c〉)2|Fw
].
Moreover, by definition, for any w, |Xw| ≤√n. Therefore by Bernstein’s inequal-
ity, we have that
Pr [|Z ′c − E[Z ′c]| > εn] ≤ exp(−12ε2n2
1.1Λn+ 13
√n · εn
).
245
Note that E[Z ′c] ≥ n, therefore for ε log2 n√n
, we have,
Pr [|Z ′c − E[Z ′c]| > εE[Z ′c]] ≤ Pr [|Z ′c − E[Z ′c]| > εn] ≤ exp(−12ε2n2
Λn+ 13
√n · εn
)
≤ exp(−Ω(minε2n/Λ, ε√n))
≤ exp(−Ω(log2 n))
Let F = ∪wFw be the union of all Fw. Then by union bound, it holds that
Pr[F ] ≤∑
w Pr[Fw] ≤ n · exp(−Ω(log2 n)) = exp(−Ω(log2 n)). We have that by
definition, Z ′c has the same distribution as Zc|F . Therefore, we have that
Pr[|Zc − E[Z ′c]| > εE[Z ′c] | F ] ≤ exp(−Ω(log2 n)) (7.6.17)
and therefore
Pr[|Zc − E[Z ′c]| > εE[Z ′c]] = Pr[F ] · Pr[|Zc − E[Z ′c]|
> εE[Z ′c] | F ] + Pr[F ] Pr[|Zc − E[Z ′c]| > εE[Z ′c] | F ]
≤ Pr[|Zc − E[Z ′c]| > εE[Z ′c] | F ] + Pr[F ]
≤ exp(−Ω(log2 n)) (7.6.18)
where at the last line we used the fact that Pr[F ] ≤ exp(−Ω(log2 n)) and equa-
tion (7.6.17).
Let Z = E[Z ′c] = E[exp(〈vw, c〉) | |〈vw, c〉| < 12
log n] (note that E[Z ′c] only depends
on the norm of ‖c‖ which is equal to 1). Therefore we obtain that with high probability
over the randomness of vw’s,
(1− εz)Z ≤ Zc ≤ (1 + εz)Z (7.6.19)
246
Taking expectation over the randomness of c, we have that
Prc,vw
[ (7.6.19) holds] ≥ 1− exp(−Ω(log2 n))
Therefore by a standard averaging argument (using Markov inequality), we have
Prvw
[Prc
[ (7.6.19) holds] ≥ 1− exp(−Ω(log2 n))]≥ 1− exp(−Ω(log2 n))
For now on we fix a choice of vw’s so that Prc[ (7.6.19) holds] ≥ 1−exp(−Ω(log2 n)) is
true. Therefore in the rest of the proof, only c is random variable, and with probability
1− exp(−Ω(log2 n)) over the randomness of c, it holds that,
(1− εz)Z ≤ Zc ≤ (1 + εz)Z. (7.6.20)
7.6.2 Helper Lemmas
The following lemmas are helper lemmas that were used in the proof above. We use
Cd to denote the uniform distribution over the unit sphere in Rd.
Lemma 7.6.2 (Tail bound for spherical distribution). If c ∼ Cd, v ∈ Rd is a vector
with ‖v‖ = Ω(√d) and t = ω(1), the random variable z = 〈v, c〉 satisfies Pr[z ≥ t] =
e−O(t2).
Proof. If c = (c1, c2, . . . , cd) ∼ Cd, c is in distribution equal to(c1‖c‖ ,
c2‖c‖ , . . . ,
cd‖c‖
)where
the ci are i.i.d. samples from a univariate Gaussian with mean 0 and variance 1d. By
spherical symmetry, we may assume that v = (‖v‖, 0, . . . , 0). Let’s introduce the
247
random variable r =∑d
i=2 c2i . Since
Pr [〈v, c〉 ≥ t] = Pr
[‖v| c1
‖c‖≥ t
]≤ Pr
[‖v‖c1
‖c‖≥ t | r ≥ 1
2
]Pr
[r ≥ 1
2
]+ Pr
[‖v‖c1
‖c‖≥ t | r ≥ 1
2
]Pr
[r ≥ 1
2
].
it’s sufficient to lower bound Pr [r ≤ 100] and Pr[‖v‖ c1‖c‖ ≥ t | r ≤ 100
]. The for-
mer probability is easily seen to be lower bounded by a constant by a Chernoff bound.
Consider the latter one next. It holds that
Pr
[‖v‖ c1
‖c‖≥ t | r ≤ 100
]= Pr
[c1 ≥
√t2 · r‖v‖2 − t2
| r ≤ 100
]
≥ Pr
[c1 ≥
√100t2
‖v‖2 − t2
].
Denoting t =√
100t2
‖v‖2−t2 , by a well-known Gaussian tail bound it follows that
Pr[c1 ≥ t
]= e−O(dt2)
(1√dt−(
1√dt
)3)
= e−O(t2)
where the last equality holds since ‖v‖ = Ω(√d) and t = ω(1).
Lemma 7.6.3. If c ∼ Cd, v ∈ Rd is a vector with ‖v‖ = Θ(√d) and t = ω(1), the
random variable z = 〈v, c〉 satisfies
E [exp(z)1([t,+∞])(z)] = exp(−Ω(t2)) + exp(−Ω(d))
Proof. Similarly as in Lemma 7.6.2, if c = (c1, c2, . . . , cd) ∼ Cd, c is in distribution
equal to(c1‖c‖ ,
c2‖c‖ , . . . ,
cd‖c‖
)where the ci are i.i.d. samples from a univariate Gaus-
248
sian with mean 0 and variance 1d. Again, by spherical symmetry, we may assume
v = (‖v‖, 0, . . . , 0). Let’s introduce the random variable r =∑d
i=2 c2i . Then, for an
arbitrary u > 1, some algebraic manipulation shows
Pr [exp (〈v, c〉) 1([t,+∞])(〈v, c〉) ≥ u] = Pr [exp (〈v, c〉) ≥ u ∧ 〈v, c〉 ≥ t] =
Pr
[exp
(‖v‖ c1
‖c‖
)≥ u ∧ ‖v‖ c1
‖c‖≥ u
]= Pr
[c1 = max
(√u2r
‖v‖2 − u2,
√t2r
‖v‖2 − t2
)](7.6.21)
where we denote u = log u. Since c1 is a mean 0 univariate Gaussian with variance
1d, and ‖v‖ = Ω(
√d) we have ∀x ∈ R
Pr
[c1 ≥
√x2r
‖v‖2 − u2
]= O
(e−Ω(x2r)
)
Next, we show that r is lower bounded by a constant with probability 1−exp(−Ω(d)).
Indeed, r is in distribution equal to 1dχ2d−1, where χ2
k is a Chi-squared distribution
with k degrees of freedom. Standard concentration bounds [104] imply that ∀ξ ≥
0,Pr[r − 1 ≤ −2√
ξd] ≤ exp(−ξ). Taking ξ = αd for α a constant implies that with
probability 1− exp(−Ω(d)), r ≥M for some constant M . We can now rewrite
Pr
[c1 ≥
√x2r
‖v‖2 − x2
]
Pr
[c1 ≥
√x2r
‖v‖2 − x2| r ≥M
]Pr[r ≥M ] + Pr
[c1 ≥
√x2r
‖v‖2 − x2| r ≤M
]Pr[r ≤M ] .
The first term is clearly bounded by e−Ω(x2) and the second by exp(−Ω(d)). Therefore,
Pr
[c1 ≥
√x2r
‖v‖2 − x2
]= O
(max
(exp
(−Ω
(x2)), exp (−Ω (d))
))(7.6.22)
249
Putting 7.6.21 and 7.6.22 together, we get that
Pr [exp (〈v, c〉) 1([t,+∞])(〈v, c〉) ≥ u] = O(max
(exp
(−Ω
(min
(d, (max (u, t))2)))))
(7.6.23)
(where again, we denote u = log u)
For any random variable X which has non-negative support, it’s easy to check
that
E[X] =
∫ ∞0
Pr[X ≥ x]dx
Hence,
E [exp(z)1([t,+∞])(z)] =
∫ ∞0
Pr [exp(z)1([t,+∞])(z) ≥ u] du
=
∫ exp(‖v‖)
0
Pr [exp(z)1([t,+∞])(z) ≥ u] du .
To bound this integral, we split into the following two cases:
• Case t2 ≥ d: max (u, t) ≥ t, so min(d, (max (u, t))2) = d. Hence, 7.6.23 implies
E [exp(z)1([t,+∞])(z)] = exp(‖v‖) exp(−Ω(d)) = exp(−Ω(d))
where the last inequality follows since ‖v‖ = O(√d).
• Case t2 < d: In the second case, we will split the integral into two portions:
u ∈ [0, exp(t)] and u ∈ [exp(t), exp(‖v‖)].
When u ∈ [0, exp(t)], max (u, t) = t, so min(d, (max (u, t))2) = t2. Hence,
∫ exp(t)
0
Pr [exp(z)1([t,+∞])(z) ≥ u] du ≤ exp(t) exp(−Ω(t2)) = − exp(Ω(t2))
250
When u ∈ [exp(t), exp(‖v‖)], max (u, t) = u. But u ≤ log(exp(‖v‖)) = O(√d),
so min(d, (max (u, t))2) = u. Hence,
∫ exp(‖v‖)
exp(t)
Pr [exp(z)1([t,+∞])(z) ≥ u] du ≤∫ exp(‖v‖)
exp(t)
exp(−(log(u))2)du
Making the change of variable u = log(u), the we can rewrite the last integral
as ∫ ‖v‖t
exp(−u2) exp(u)du = O(exp(−t2))
where the last inequality is the usual Gaussian tail bound.
In either case, we get that
∫ exp(‖v‖)
0
Pr [exp(z)1([t,+∞])(z) ≥ u] du = exp(−Ω(t2)) + exp(−Ω(d)))
which is what we want.
As a corollary to the above lemma, we get the following:
Corollary 7.6.4. If c ∼ Cd, v ∈ Rd is a vector with ‖v‖ = Θ(√d) and t = Ω(log.9 n)
then
E [exp(z)1([t,+∞])(z)] = exp(−Ω(log1.8 n))
Proof. We claim the proof is trivial if d = o(log4 n). Indeed, in this case, exp(〈v, c〉) ≤
exp(‖v‖) = exp(O(√d)). Hence,
E [exp(z)1([t,+∞])(z)] = exp(O(√d))E[1([t,+∞])(z)] = exp(O(
√d)) Pr[z ≥ t]
Since by Lemma 7.6.2, Pr[z ≥ t] ≤ exp(−Ω(log2 n), we get
E [exp(z)1([t,+∞])(z)] = exp(O(√d)− Ω(log2 n)) = exp(−Ω(log1.8 n))
251
as we wanted.
So, we may, without loss of generality assume that d = Ω(log4 n). In this case,
Lemma 7.6.3 implies
E [exp(z)1([t,+∞])(z)] = exp(− log1.8 n) + exp(−Ω(d))) = exp(− log1.8 n)
where the last inequality holds because d = Ω(log4 n) and t2 = Ω(log.9 n), so we get
the claim we wanted.
Lemma 7.6.5 (Continuous Abel’s Inequality). Let 0 ≤ r(x) ≤ 1 be a function such
that such that E[r(x)] = ρ. Moreover, suppose increasing function u(x) satisfies that
E[|u(x)|] <∞. Let t be the real number such that E[1([t,+∞])] = ρ. Then we have
E[u(x)r(x)] ≤ E[u(x)1([t,+∞])] (7.6.24)
Proof. Let G(z) =∫∞zf(x)r(x)dx, and H(z) =
∫∞zf(x)1([t,+∞])(x)dx. Then we
have that G(z) ≤ H(z) for all z. Indeed, for z ≥ t, this is trivial since r(z) ≤ 1.
For z ≤ t, we have H(z) = E[1([t,+∞])] = ρ = E[r(x)] ≥∫∞zf(x)r(x)dx. Then by
integration by parts we have,
∫ ∞−∞
u(x)f(x)r(x)dx = −∫ ∞−∞
u(x)dG
= −u(x)G(x) |∞−∞ +
∫ +∞
−∞G(x)u′(x)dx
≤∫ +∞
−∞H(x)u′(x)dx
=
∫ ∞−∞
u(x)f(x)1([t,+∞])(x)dx,
where at the third line we use the fact that u(x)G(x) → 0 as x → ∞ and that
u′(x) ≥ 0, and at the last line we integrate by parts again.
252
Lemma 7.6.6. Let v ∈ Rd be a fixed vector with norm ‖v‖ ≤ κ√d for absolute
constant κ. Then for random variable c with uniform distribution over the sphere, we
have that
logE[exp(〈v, d〉)] = ‖v‖2/2d± ε (7.6.25)
where ε = O(1d).
Proof. Let g ∈ N (0, I), then g/‖g‖ has the same distribution as c. Let r = ‖v‖.
Since c is spherically symmetric, we could, we can assume without loss of generality
that v = (r, 0, . . . , 0). Let x = g1 and y =√g2
2 + · · ·+ g2d. Therefore x ∈ N (0, 1) and
y2 has χ2 distribution with mean d− 1 and variance O(d).
Let F be the event that x ≤ 20 log d and 1.5√d ≥ y ≥ 0.5
√d. Note that the
Pr[F ] ≥ 1 − exp(−Ω(log1.8(d)). By Proposition 7.6.4, we have that E[exp(〈v, c〉)] =
E[exp(〈v, c〉) | F ] · (1± Ω(− log1.8 d)).
Conditioned on event F , we have
E[exp(〈v, c〉) | F ] = E
[exp(
rx√x2 + y2
) | F
]
= E
[exp(
rx
y− rx3
y√x2 + y2(y +
√x2 + y2)
) | F
]
= E
[exp(
rx
y) · exp(
rx3
y√x2 + y2(y +
√x2 + y2)
) | F
]
= E[exp(
rx
y) | F
]· (1±O(
log3 d
d)) (7.6.26)
where we used the fact that r ≤ κ√d. Let E be the event that 1.5
√d ≥ y ≥ 0.5
√d.
By using Proposition 7.6.3, we have that
E[exp(rx/y) | F ] = E[exp(rx/y) | E ]± exp(−Ω(log2(d)) (7.6.27)
253
Then let z = y2/(d− 1) and w = z − 1. Therefore z has χ2 distribution with mean 1
and variance 1/(d− 1), and w has mean 0 and variance 1/(d− 1).
E[exp(
rx
y) | E
]= E[E[exp(rx/y) | y] | E ] = E[exp(r2/y2) | E ]
= E[exp(r2/(d− 1) · 1/z2) | E ]
= E[exp(r2/(d− 1) · (1 +2w + w2
(1 + w)2)) | E ]
= exp(r2/(d− 1))E[exp(1 +2w + w2
(1 + w)2)) | E ]
= exp(r2/(d− 1))E[1 + 2w ±O(w2) | E ]
= exp(r2/(d− 1)2)(1± 1/d)
where the second-to-last line uses the fact that conditioned on 1/2 ≥ E , w ≥ −1/2
and therefore the Taylor expansion approximates the exponential accurately, and the
last line uses the fact that |E[w | E ]| = O(1/d) and E[w2 | E ] ≤ O(1/d). Combining
the series of approximations above completes the proof.
We finally provide the proofs for a few helper propositions on conditional proba-
bilities for high probability events used in the lemma above.
Proposition 7.6.3. Suppose x ∼ N (0, σ2) with σ = O(1). Then for any event E with
Pr[E ] = 1−O(−Ω(log2 d)), we have that E[exp(x)] = E[exp(x) | E ]±exp(−Ω(log2(d)).
Proof. Let’s denote by E the complement of the event E . We will consider the upper
and lower bound separately. Since
E[exp(x)] = E[exp(x) | E ] Pr[E ] + E[exp(x) | E ] Pr[E ]
254
we have that
E[exp(x)] ≤ E[exp(x) | E ] + E[exp(x) | E ] Pr[E ] (7.6.28)
and
E[exp(x)] ≥ E[exp(x) | E ](1− exp(−Ω(log2 d)))
≥ E[exp(x) | E ]− E[exp(x) | E ] exp(−Ω(log2 d))) (7.6.29)
Consider the upper bound (7.6.28) first. To show the statement of the lemma, it
suffices to bound E[exp(x) | E ] Pr[E ].
Working towards that, notice that
E[exp(x) | E ] Pr[E ] = E[exp(x)1E ] = E[exp(x)E[1E |x]] = E[exp(x)r(x)]
if we denote r(x) = E[1E |x]. We wish to upper bound E[exp(x)r(x)]. By Lemma
7.6.5, we have
E[exp(x)r(x)] ≤ E[exp(x)1[t,∞]]
where t is such that E[1[t,∞]] = E[r(x)]. However, since E[r(x)] = Pr[E ] =
exp(−Ω(log2 d)), it must be the case that t = Ω(log d) by the standard Gaussian tail
bound, and the assumption that σ = O(1). In turn, this means
E[exp(x)1[t,∞]] ≤1
σ√
2π
∫ ∞t
exe−x2
σ2 dx =1
σ√
2π
∫ ∞t
e−( xσ−σ
2)2+σ2
4 dx
= eσ2
41√2π
∫ +∞
t/σ
e−(x′−σ2
)2dx′
where the last equality follows from the change of variables x = σx′. However,
1√2π
∫ +∞
t/σ
e−(x′−σ2
)2dx′
255
is nothing more than Pr[x′ > tσ], where x′ is distributed like a univariate gaussian
with mean σ2
and variance 1. Bearing in mind that σ = O(1)
eσ2
41√2π
∫ +∞
t/σ
e−(x′−σ2
)2dx′ = exp(−Ω(t2)) = exp(−Ω(log2 d))
by the usual Gaussian tail bounds, which proves the lower bound we need.
We proceed to consider the lower bound 7.6.29. To show the statement of the
lemma, we will bound E[exp(x) | E ]. Notice trivially that since exp(x) ≥ 0,
E[exp(x) | E ] ≤ E[exp(x)]
Pr[E ]
Since Pr[E ] ≥ 1 − exp(Ω(log2 d)), 1Pr[E]
≤ 1 + exp(O(log2)). So, it suffices to bound
E[exp(x)]. However,
E[exp(x)] =1
σ√
2π
∫ +∞
t=−∞exe−
x2
σ2 dx
=1
σ√
2π
∫ +∞
t=−∞e−( x
σ−σ
2)2+σ2
4 dx
=1√2π
∫ +∞
t=−∞e−(x′−σ
2)2+σ2
4 dx .
where the last equality follows from the same change of variables x = σx′ as before.
Since∫ +∞t=−∞ e
−(x′−σ2
)2dx′ =√
2π, we get
1√2π
∫ +∞
t=−∞e−(x′−σ
2)2+σ2
4 dx′ = eσ2
4 = O(1)
Putting together with the estimate of 1Pr[E]
, we get that E[exp(x) | E ] = O(1). Plug-
ging this back in 7.6.29, we get the desired upper bound.
Proposition 7.6.4. Suppose c ∼ C and v is an arbitrary vector with ‖v‖ = O(√d).
Then for any event E with Pr[E ] ≥ 1−exp(−Ω(log2 d)), we have that E[exp(〈v, c〉)] =
E[exp(〈v, c〉) | E ]± exp(− log1.8 d).
256
Proof of Proposition 7.6.4. Let z = 〈v, c〉. We proceed similarly as in the proof of
Proposition 7.6.3. We have
E[exp(z)] = E[exp(z) | E ] Pr[E ] + E[exp(z) | E ] Pr[E ]
and
E[exp(z)] ≤ E[exp(z) | E ] + E[exp(z) | E ] Pr[E ] (7.6.30)
and
E[exp(z)] ≥ E[exp(z) | E ] Pr[E ] = E[exp(z) | E ]− E[exp(z) | E ] exp(−Ω(log2 d))
(7.6.31)
We again proceed by separating the upper and lower bound.
We first consider the upper bound 7.6.30.
Notice that that
E[exp(z) | E ] Pr[E ] = E[exp(z)1E ]
We can split the last expression as
E[exp(〈vw, c〉)1〈vw,c〉>01E
]+ E
[exp(〈vw, c〉)1〈vw,c〉<01E
].
The second term is upper bounded by
E[1E ] ≤ exp(−Ω(log2 n))
We proceed to the first term of (7.6.10) and observe the following property of it:
E[exp(〈vw, c〉)1〈vw,c〉>01E
]≤ E
[exp(〈αvw, c〉)1〈vw,c〉>0 1E
]≤ E [exp(〈αvw, c〉)1E ]
257
where α > 1. Therefore, it’s sufficient to bound
E [exp(z)1E ]
when ‖vw‖ = Θ(√d). Let’s denote r(z) = E[1E |z].
Using Lemma 7.6.5, we have that
Ec[exp(z)r(z)] ≤ E [exp(z)1([t,+∞])(z)] (7.6.32)
where t satisfies that Ec[1[t,+∞]] = Pr[z ≥ t] = Ec[r(z)] ≤ exp(−Ω(log2 d)). Then, we
claim Pr[z ≥ t] ≤ exp(−Ω(log2 d)) implies that t ≥ Ω(log.9 d).
Indeed, this follows by directly applying Lemma 7.6.2.
Afterward, applying Lemma 7.6.3, we have:
E[exp(z)r(z)] ≤ E [exp(z)1([t,+∞])(z)] = exp(−Ω(log1.8 d)) (7.6.33)
which proves the upper bound we want.
We now proceed to the lower bound 7.6.31, which is again similar to the lower
bound in the proof of Proposition 7.6.3: we just need to bound E[exp(z) | E ]. Same
as in Proposition 7.6.3, since exp(x) ≥ 0,
E[exp(z) | E ] ≤ E[exp(z)]
Pr[E ]
Consider the event E ′ : z ≤ t, for t = Θ(log.9 d), which by Lemma 7.6.2 satisfies
Pr[E ′] ≥ 1− exp(−Ω(log2 d)). By the upper bound we just showed,
E[exp(z)] ≤ E[exp(z) | E ′] + exp(−Ω(log2 n)) = O(exp(log.9 d))
258
where the last equality follows since conditioned on E ′, z = O(log.9 d). Finally, this
implies
E[exp(z) | E ] ≤ 1
Pr[E ]O(exp(log.9 d)) = O(exp(log.9 d))
where the last equality follows since Pr[E ] ≥ 1 − exp(−Ω(log2 n)). Putting this
together with 7.6.31, we get
E[exp(z)] ≥ E[exp(z) | E ] Pr[E ] = E[exp(z) | E ]− E[exp(z) | E ] exp(−Ω(log2 d)) ≥
E[exp(z) | E ]−O(exp(log.9 d)) exp(−Ω(log2 d)) ≥ E[exp(z) | E ]− exp(−Ω(log2 d))
which is what we needed.
7.7 Maximum Likelihood Estimator for Co-
occurrence
Let L be the corpus size, and Xw,w′ the number of times words w,w′ co-occur within
a context of size 10 in the corpus. According to the model, the probability of this
event at any particular time is log p(w,w′) ∝ ‖vw + vw′‖22 . Successive samples from
a random walk are not independent of course, but if the random walk mixes fairly
quickly (and the mixing time of our random walk is related to the logarithm of the
number of words) then the set of Xw,w′ ’s over all word pairs is distributed up to
a very close approximation as a multinomial distribution Mul(L, p(w,w′)) where
L =∑
w,w′ Xw,w′ is the total number of word pairs in consideration (roughly 10L).
Assuming this approximation, we show below that the maximum likelihood values
for the word vectors correspond to the following optimization,
minvw,C
∑w,w′
Xw,w′(log(Xw,w′)− ‖vw+vw′‖2
2 − C)2
(Objective SN)
259
Now we give the derivation of the objective. According to the multinomial distri-
bution, maximizing the likelihood of Xw,w′ is equivalent to maximizing
` = log
∏(w,w′)
p(w,w′)Xw,w′
=∑
(w,w′)
Xw,w′ log p(w,w′).
To reason about the likelihood, denote the logarithm of the ratio between the
expected count and the empirical count as
∆w,w′ = log
(Lp(w,w′)
Xw,w′
).
Note that
` =∑
(w,w′)
Xw,w′ log p(w,w′)
=∑
(w,w′)
Xw,w′
[log
Xw,w′
L+ log
(Lp(w,w′)
Xw,w′
)]
=∑
(w,w′)
Xw,w′ logXw,w′
L+∑
(w,w′)
Xw,w′ log
(Lp(w,w′)
Xw,w′
)
= c+∑
(w,w′)
Xw,w′∆w,w′ (7.7.1)
where we let c denote the constant∑
(w,w′) Xw,w′ logXw,w′
L. Furthermore, we have
L =∑
(w,w′)
Lpw,w′
=∑
(w,w′)
Xw,w′e∆w,w′
=∑
(w,w′)
Xw,w′(1 + ∆w,w′ + ∆2w,w′/2 +O(|∆w,w′|3))
260
and also L =∑
(w,w′) Xw,w′ . So
∑(w,w′)
Xw,w′∆w,w′ = −
∑(w,w′)
Xw,w′∆2w,w′/2 +
∑(w,w′)
Xw,w′O(|∆w,w′|3)
.
Plugging this into (7.3.2) leads to
c− ` =∑
(w,w′)
Xw,w′∆2w,w′/2 +
∑(w,w′)
Xw,w′O(|∆w,w′|3). (7.7.2)
When the last term is much smaller than the first term on the right hand side,
maximizing the likelihood is approximately equivalent to minimizing the first term
on the right hand side, which is our objective:
∑(w,w′)
Xw,w′∆2w,w′ ≈
∑(w,w′)
Xw,w′
(‖vw + vw′‖2
2/(2d)− logXw,w′ + log L− 2 logZ)2
where Z is the partition function.
We now argue that the last term is much smaller than the first term on the
right hand side in (7.7.2). For a large Xw,w′ , the ∆w,w′ is close to 0 and thus the
induced approximation error is small. Small Xw,w′ ’s only contribute a small fraction
of the final objective (7.3.3), so we can safely ignore the errors. To see this, note
that the objective∑
(w,w′) Xw,w′∆2w,w′ and the error term
∑(w,w′) Xw,w′O(|∆w,w′|3)
differ by a factor of |∆w,w′| for each Xw,w′ . For large Xw,w′ ’s, |∆w,w′| 1, and thus
their corresponding errors are much smaller than the objective. So we only need to
consider the Xw,w′ ’s that are small constants. The co-occurrence counts obey a power
law distribution (see, e.g. [140]). That is, if one sorts Xw,w′ in decreasing order,
then the r-th value in the list is roughly
x[r] =k
r5/4
261
where k is some constant. Some calculation shows that
L ≈ 4k,∑
Xw,w′≤x
Xw,w′ ≈ 4k4/5x1/5,
and thus when x is a small constant
∑Xw,w′≤x
Xw,w′
L≈(
4x
L
)1/5
= O
(1
L1/5
).
So there are only a negligible mass of Xw,w′ ’s that are small constants, which vanishes
when L increases. Furthermore, we empirically observe that the relative error of
our objective is 5%, which means that the errors induced by Xw,w′ ’s that are small
constants is only a small fraction of the objective. Therefore,∑
w,w′ Xw,w′O(|∆w,w′|3)
is small compared to the objective and can be safely ignored.
7.8 Conclusions
A simple generative model has been introduced to explain the classical PMI based
word embedding models, as well as recent variants involving energy-based models and
matrix factorization. The model yields an optimization objective with essentially “no
knobs to turn”, yet the embeddings lead to good performance on analogy tasks, and
fit other predictions of our generative model. A model with fewer knobs to turn should
be seen as a better scientific explanation (Occam’s razor), and certainly makes the
embeddings more interpretable.
The spatial isotropy of word vectors is both an assumption in our model, and also
a new empirical finding of our paper. We feel it may help with further development
of language models. It is important for explaining the success of solving analogies via
low dimensional vectors (relations=lines). It also implies that semantic relation-
262
ships among words manifest themselves as special directions among word embeddings
(Section 7.4), which lead to a cheater algorithm for solving analogy testbeds.
Our model is tailored to capturing semantic similarity, more akin to a log-linear
dynamic topic model. In particular, local word order is unimportant. Designing
similar generative models (with provable and interpretable properties) with linguistic
features is left for future work.
263
Chapter 8
Mathematical Tools
In this chapter, we collect mathematical tools that are used in this thesis and are of
possible independent interests. Concentration inequalities and Spectral perturbation
bounds in Section 8.1 and Section 8.2 often appear in many other machine learning
and statistical settings. In particular, Corollary 8.1.4 and Theorem 8.2.5 are very
straightforward extensions of matrix Bernstein inequality and Wedin’s Theorem, but
they are user-friendly in many machine learning settings.
8.1 Concentration Inequalities
This section contains a collection of known technical results which are useful in proving
the concentration bounds in various chapters of this thesis.
8.1.1 Hoeffding’s inequality and Bernstein’s inequalities
Theorem 8.1.1 (Bernstein Inequality [28], cf. [27]). Let X1, . . . , Xn be independent
real-valued variables with finite variance σ2i = V[Xi] and bounded by M in the sense
264
that |Xi − E[Xi]| ≤M . Let σ2 =∑
i σ2i . Then we have
Pr
[∣∣∣∣∣n∑i=1
Xi − E[n∑i=1
Xi]
∣∣∣∣∣ > t
]≤ 2 exp(− t2
2σ2 + 23Mt
) .
As a consequence, for any d ≥ 1 and C > 0, we have that with probability at least
1− d−C,
∣∣∣∣∣n∑i=1
Xi − E[n∑i=1
Xi]
∣∣∣∣∣ . CM log d+ σ√C log d . (8.1.1)
Up to constant factor, the Hoeffding’s inequality can be seen as a corollary of
Bernstein inequality when the variance term σ is bounded trivially by the uniform
bound M .
Theorem 8.1.2 (Hoeffding’s inequality [85]). Let X1, . . . , Xn be independent real-
valued variables whose fluctuations are bounded in the sense that |Xi − E[Xi]| ≤ M
almost surely. Then we have
Pr
[∣∣∣∣∣n∑i=1
Xi − E[n∑i=1
Xi]
∣∣∣∣∣ > t
]≤ exp
(− t2
2M2n
).
Theorem 8.1.3 (Matrix Bernstein Inequality [90]. cf. [167]). Let X1, . . . , Xn be
independent matrix random variables with common dimension d1 × d2. We assume
that
E [Xi] = 0, and ‖Xi‖ ≤M a.s., ∀i ∈ [n] (8.1.2)
Define σ > 0 as
σ2 = max
∥∥∥∥∥n∑i=1
XiX>i
∥∥∥∥∥ ,∥∥∥∥∥
n∑i=1
X>i Xi
∥∥∥∥∥. (8.1.3)
265
Then, we have that
Pr
[∥∥∥∥∥n∑i=1
Xi
∥∥∥∥∥ > t
]≤ (d1 + d2) exp(− t2
2σ2 + 23Mt
) . (8.1.4)
As a consequence, for any d ≥ 1 and C > 0, we have that with probability at least
1− d−C,
∥∥∥∥∥n∑i=1
Xi
∥∥∥∥∥ . CM log d+ σ√C log d . (8.1.5)
Corollary 8.1.4. Let X1, . . . , Xn be independent matrix random variables with com-
mon dimension d1 × d2. We assume that they are zero centered, that is, E [Xi] =
0,∀i ∈ [n]. Suppose there exists M > 0 and ε ∈ [0, 1], δ > 0, such that
Pr [‖Xi‖ ≥M ] ≤ ε, and ‖E [Xi1(Xi ≥M)]‖ ≤ δ, ∀i ∈ [n] . (8.1.6)
Define σ > 0 as
σ2 = max
∥∥∥∥∥n∑i=1
XiX>i
∥∥∥∥∥ ,∥∥∥∥∥
n∑i=1
X>i Xi
∥∥∥∥∥. (8.1.7)
Then, for any d ≥ 1 and C > 0, we have that with probability at least 1− d−C − nε,
∥∥∥∥∥n∑i=1
Xi
∥∥∥∥∥ . CM log d+ σ√C log d+ nδ . (8.1.8)
Typically ε and δ in Corollary 8.1.4 will be chosen as very small so that the
conclusion of Corollary 8.1.4 is not much different from Theorem 8.1.1.
Proof. Let Zi = Xi1(‖Xi‖ ≤ M). Then we have that
max∥∥∑n
i=1XiX>i
∥∥ ,∥∥∑ni=1 Z
>i Zi
∥∥ ≤ max∥∥∑n
i=1XiX>i
∥∥ ,∥∥∑ni=1 X
>i Xi
∥∥ = σ2,
and ‖Zi‖ ≤M a.s. Applying Bernstein inequality on∑Zi gives that with probability
266
at least 1− d−C ,
∥∥∥∥∥n∑i=1
Zi − E
[∑i
Zi
]∥∥∥∥∥ . CM log d+ σ√C log d . (8.1.9)
Note that E [∑
i Zi] = −E [∑
iXi1(Xi ≥M)] and therefore by equation 8.1.6, we
have that ‖E [∑
i Zi]‖ ≤ nδ. Therefore by triangle inequality we have that
∥∥∥∥∥n∑i=1
Zi
∥∥∥∥∥ ≤∥∥∥∥∥
n∑i=1
Zi − E
[∑i
Zi
]∥∥∥∥∥+ nδ . CM log d+ σ√C log d+ nδ .
Note that with probability at most nε we have that ‖Xi‖ ≤ M and thus Xi = Zi.
Thus by another union bound, and triangle inequality, we have that with probability
at least 1− d−C − nε, equation (8.1.8) holds.
8.1.2 Sub-Gaussian ans Sub-exponential Random Variables
In this subsection, we give the formal definition of sub-Gaussian and sub-exponential
random variables, which is used heavily in our analysis. We also summarize their
properties.
Definition 8.1.5 (c.f.[171, Definition 2.1]). A random variable X with mean µ =
E [X] is sub-Gaussian with variance proxy σ2 if
E[eλ(X−µ)
]≤ e
ν2λ2
2 , ∀λ ∈ R (8.1.10)
Definition 8.1.6 (c.f.[171, Definition 2.2]). A random variable X with mean µ =
E [X] is sub-exponential if there are non-negative parameters (ν, b) such that
E[eλ(X−µ)
]≤ e
ν2λ2
2 , ∀λ ∈ R, |λ| < 1/b .
267
A summation of sub-Gaussian random variables remains sub-Gaussian with vari-
ance proxy being the sum of the variance proxies of each summands.
Lemma 8.1.7 (c.f. [171]). Suppose independent random variable X1, . . . , Xn are sub-
Gaussian variables with parameter with variance proxies σ21, . . . , σ
2n respectively, then
X1 + · · · + Xn is a sub-Gaussian random variable with parameter σ∗ where σ∗ =√∑k∈[n] σ
2k.
A summation of sub-exponential random variables remain sub-exponential (with
different parameters).
Lemma 8.1.8 (c.f. [171]). Suppose independent random variable X1, . . . , Xn are
sub-exponential variables with parameter (ν1, b1), . . . , (νn, bn) respectively, then X1 +
· · · + Xn is a sub-exponential random variable with parameter (ν∗, b∗) where ν∗ =√∑k∈[n] ν
2k and b∗ = maxk bk.
8.2 Spectral Perturbation Theorems
In this section, we list some standard spectral perturbation inequalities that are help-
ful in machine learning and for many results in this thesis. Most of them can be found
in the seminal paper by Stewart and Sun [161].
Given A = A+E as a perturbed version of A, Weyl’s theorem [174] and Mirsky’s
Theorem [126]) bound the perturbation in individual singular values:
Theorem 8.2.1 (Weyl’s theorem [174] and Mirsky’s Theorem [126], c.f. [160]). Let
A ∈ Rm×n with m ≥ n and σk(·) denotes the k-th largest singular value. Suppose
A = A+ E. Then,
σk(A)− ‖E‖ ≤ σk(A) ≤ σk(A) + ‖E‖, ∀i = 1, . . . , n
268
Moreover,
n∑k=1
(σk(A)− σk(A)
)2
≤ ‖E‖2F .
The singular vector perturbation is bounded by Wedin’s Theorem. Towards stating
it, we first recall the definition of principal angles between subspaces (PABS) [101].
Definition 8.2.2. Suppose X and Y are two subspaces of Rn of dimension p and
q respectively. Let X ∈ Rn×p and Y ∈ Rn×q be some orthonormal basis of X and
Y respectively. Then the principal angle between subspaces X and Y, denoted by
Θ(X ,Y), is a vector of dimension m = min(p, q) such that
cos(Θ(X ,Y)) = S(X>Y ) (8.2.1)
where cos(·) is the entry-wise cosine function and S(X>Y ) denotes the list singular
values of X>Y in decreasing order.
Remark 8.2.1. Other characterization of principal angles between subspace can be
found, e.g., in [101]. When p = q = 1, then the principal angle coincides the angle
between two vectors.
Wedin’s Theorem bounds the principal angles between the subspaces of the singular
vectors of A and its perturbations. The following one is the original and strongest
form of the theorem. We will state a weaker version later that is more convenient for
machine learning applications.
269
Theorem 8.2.3 (Wedin’s Theorem [172]; c.f. [160]). Given matrices A,E ∈ Rm×n
with m ≥ n. Let A have the singular value decomposition
A = [U1, U2, U3]
Σ1 0
0 Σ2
0 0
[V1, V2]>. (8.2.2)
Let A = A+ E has singular vector decomposition
A = [U1, U2, U3]
Σ1 0
0 Σ2
0 0
[V1, V2]>. (8.2.3)
Let Φ = Θ(U1, U1) and Ψ = Θ(V1, V1).1 Suppose that there exists some δ > 0 such
that
mini,j|[Σ1]i,i − [Σ2]j,j| ≥ δ, and min
i,i|[Σ1]i,i| ≥ δ,
Let R = AV1 − U1Σ1 and S = A>U1 − V1Σ1. Then,
‖ sin(Φ)‖2 + ‖ sin(Ψ)‖2 ≤ ‖R‖2F + ‖S‖2
F
δ2≤ 2‖E‖2
F
δ2.
In many applications, we have a good understanding of the spectrum of A and E,
but not on the spectrum of A = A+E. Thus it would be ideal that the condition of
the theorem only involves the spectrum of A and E. Such a theorem can be obtained
as a straightforward extension of Weyl’s Theorem and Wedin’s Theorem.
1We note that the notation that we used here is slightly different from that in [160]. Here Φ andΨ are vectors that contains the principal angles and the norm ‖·‖ below is Euclidean norm of thevector.
270
Theorem 8.2.4 (User-friendly version of Wedin’s Theorem). Given matrices A,E ∈
Rm×n with m ≥ n. Let A = A + E. Let A and A have the SVD decomposition as
equation (8.2.3) and (8.2.2) respectively. Suppose that
mini
[Σ1]i,i −maxi
[Σ2]i,i ≥ δ > ‖E‖.
Let Φ = Θ(U1, U1) and Ψ = Θ(V1, V1). Then,
‖ sin(Φ)‖2 + ‖ sin(Ψ)‖2 ≤ 2‖E‖2F
(δ − ‖E‖)2.
Proof. Using Weyl’s Theorem, we have that mini[Σ1]i,i ≥ mini[Σ1]i,i − ‖E‖. Then
applying Wedin’s Theorem we complete the proof.
Remark 8.2.2. We remark that Theorem 8.2.4 is as strong as the original Wedin’s
Theorem up to a univesal constant factor. This is because when δ ≥ 2‖E‖, we have
that ‖ sin(Φ)‖2+‖ sin(Ψ)‖2 ≤ 8‖E‖2F/δ
2. Otherwise we can directly bound ‖ sin(Φ)‖2+
‖ sin(Ψ)‖2 by 2.
The perturbation bounds above depends on the Frobenius norm of the matrix E. The
following version of the Theorem depends on the spectral norm of E.
Theorem 8.2.5 (User-friendly version of Wedin’s Theorem with spectral norm).
Given matrices A,E ∈ Rm×n with m ≥ n. Let A = A + E. Let A and A have the
SVD decomposition as equation (8.2.3) and (8.2.2) respectively. Suppose that
mini
[Σ1]i,i −maxi
[Σ2]i,i ≥ δ > ‖E‖.
Let Φ = Θ(U1, U1) and Ψ = Θ(V1, V1). Then,
max‖ sin(Φ)‖∞, ‖ sin(Ψ)‖∞ ≤‖E‖
δ − ‖E‖.
271
Perturbation bound for pseudo-inverse When we have a lower bound on
σmin(A), it is easy to get bounds for the perturbation of pseudoinverse.
Theorem 8.2.6 ([161, Theorem 3.4]). Consider the perturbation of a matrix A ∈
Rm×n: B = A+ E. Assume that rank(A) = rank(B) = n, then
‖B† − A†‖ ≤√
2‖A†‖‖B†‖‖E‖.
Note that this theorem is not strong enough when the perturbation is only known
to be τ -spectrally bounded in our definition.
8.3 Auxiliary Lemmas
In this section, we collect a few auxiliary mathematical lemmas that are used in
various sections of this thesis.
The following lemma regarding inverse the canonical controllable form is used in
Section 6.2.
Lemma 8.3.1. Let B = en ∈ Rn×1 and λ ∈ [0, 2π], w ∈ C. Suppose A with
ρ(A) · |w| < 1 has the controllable canonical form A = C(a). Then
(I − wA)−1B =1
pa(w−1)
w−1
w−2
...
w−n
where pa(x) = xn + a1x
n−1 + · · ·+ an is the characteristic polynomial of A.
272
Proof. let v = (I − wA)−1B then we have (I − wA)v = B. Note that B = en, and
I − wA is of the form
I − wA =
1 −w 0 · · · 0
0 1 −w · · · 0
......
.... . .
...
0 0 0 · · · −w
anw an−1w an−2w · · · 1 + a1w
(8.3.1)
Therefore we obtain that vk = wvk+1 for 1 ≤ k ≤ n − 1. That is, vk = v0w−k for
v0 = v1w1. Using the fact that ((I − wA)v)n = 1, we obtain that v0 = pa(w
−1)−1
where pa(·) is the polynomial pa(x) = xn + a1xn−1 + · · · + an. Then we have that
u(I − wA)−1B = u1w−1+···+unw−npa(w−1)
The following elementary claim was useful in the proof of Lemma 6.2.5 in Section 6.2.2.
Claim 8.3.2. Suppose x1, . . . , xn are independent variables with mean 0 and covari-
ance matrices and Id, U1, . . . , Ud are fixed matrices, then
E [‖∑n
k=1 Ukxk‖2] =∑n
k=1 ‖Uk‖2F .
Proof. We have that
E [‖∑n
k=1 Ukxk‖2F ] =
∑nk,` tr(Ukxkx
>` U>` ) =
∑nk tr(Ukxkx
>k U>k ) =
∑nk=1 ‖Uk‖2
F
The following Proposition regarding the size of epsilon net is useful in the proof of
Theorem 5.7.1 in Section 5.7.
Proposition 8.3.3. For any ζ ∈ (0, 1), there is a set Γ of rank r d × d matrices,
such that for any rank r d × d matrix X with Frobenius norm at most 1, there is a
matrix X ∈ Γ with ‖X − X‖F ≤ ζ. The size of Γ is bounded by (d/ζ)O(dr).
273
Proof. Standard construction of ε-net shows that there is a set P ⊂ Rd of size (d/ε)O(d)
such that for any ‖u‖ ≤ 1, there is a u ∈ P such that ‖u− u‖ ≤ ε. Such construction
can also be applied to matrices and Frobenius norm as that is the same as vectors
and `2 norm.
Here we let ε = 0.1ζ, and construct three sets P1, P2, P3 where P1 is an ε-net for
d × r matrices with Frobenius norm at most√r, P2 is an ε-net for r × r diagonal
matrices whose Frobenius norm is bounded by 1, and P3 is an ε-net for r×d matrices
with Frobenius norm at most√r.
Now we define Γ = UDV |U ∈ P1, D ∈ P2, V ∈ P3. Clearly the size of Γ is as
promised. For any rank r d× d matrix X, suppose its Singular Value Decomposition
is UDV , we can find U ∈ P1, D ∈ P2 and V ∈ P3 that are ε close to U,D, V
respectively. Therefore UDV ∈ Γ and
‖UDV − UDV ‖F ≤ 8ε ≤ ζ.
The following proposition the connects `6 norm to `2 norm is useful in the proof of
Lemma 5.4.5 in Section 5.4
Proposition 8.3.4. Let a1, . . . , ar ≥ 0, C ≥ 0. Then C4(a21 + · · ·+a2
r) ≥ a61 + · · ·+a6
r
implies that a21 + · · ·+ a2
r ≤ C2r and that max ai ≤ Cr1/6.
Proof. By Cauchy-Schwartz inequality, we have,
(r∑i=1
a2i
)(r∑i=1
a6i
)≥
(r∑i=1
a4i
)2
≥
1
r
(r∑i=1
a2i
)22
Using the assumption and equation above we have that a21 + · · · + a2
r ≤ C2r. This
implies with the condition that a61 + · · · + a6
r ≤ C6r, which implis that max ai ≤
Cr1/6.
274
Bibliography
[1] A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon. Learningsparsely used overcomplete dictionaries via alternating minimization. In COLT,pages 123–137, 2014.
[2] A. Agarwal, A. Anandkumar, and P. Netrapalli. Exact recovery of sparselyused overcomplete dictionaries. In arXiv:1309.1952, 2013.
[3] Naman Agarwal, Zeyuan Allen Zhu, Brian Bullins, Elad Hazan, and TengyuMa. Finding approximate local minima faster than gradient descent, 2017.
[4] M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm for designingovercomplete dictionaries for sparse representation. In IEEE Trans. on SignalProcessing, pages 4311–4322, 2006.
[5] Yonatan Amit, Michael Fink, Nathan Srebro, and Shimon Ullman. Uncover-ing shared structures in multiclass classification. In Proceedings of the 24thinternational conference on Machine learning, pages 17–24. ACM, 2007.
[6] A. Anandkumar, D. Foster, D. Hsu, S. Kakade, and Y. Liu. A spectral algorithmfor latent dirichlet allocation. In NIPS, pages 926–934, 2012.
[7] Animashree Anandkumar, Rong Ge, Daniel Hsu, and Sham M. Kakade. Atensor approach to learning mixed membership community models. Journal ofMachine Learning Research, 15(1):2239–2312, 2014.
[8] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and MatusTelgarsky. Tensor decompositions for learning latent variable models. Journalof Machine Learning Research, 15(1):2773–2832, 2014.
[9] Jacob Andreas and Dan Klein. When and why are log-linear models self-normalizing? In Proceedings of the Annual Meeting of the North AmericanChapter of the Association for Computational Linguistics, 2014.
[10] S. Arora, R. Ge, and A. Moitra. New algorithms for learning incoherent andovercomplete dictionaries. In COLT, pages 779–806, 2014.
[11] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable boundsfor learning some deep representations. In Proceedings of the 31th Interna-tional Conference on Machine Learning, ICML 2014, Beijing, China, 21-26June 2014, pages 584–592, 2014.
275
[12] Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, DavidSontag, Yichen Wu, and Michael Zhu. A practical algorithm for topic modelingwith provable guarantees. In International Conference on Machine Learning,pages 280–288, 2013.
[13] Sanjeev Arora, Rong Ge, Frederic Koehler, Tengyu Ma, and Ankur Moitra.Provable algorithms for inference in topic models. In The 33rd Inter-national Conference on Machine Learning (ICML 2016). arXiv preprintarXiv:1605.08491, 2016.
[14] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, efficient, andneural algorithms for sparse coding. In Proceedings of The 28th Conference onLearning Theory, COLT 2015, Paris, France, July 3-6, 2015, pages 113–149,2015.
[15] Sanjeev Arora, Rong Ge, Tengyu Ma, and Andrej Risteski. Provable learning ofnoisy-or networks. In Proceedings of the 49th Annual ACM SIGACT Symposiumon Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23,2017, pages 1057–1066, 2017.
[16] Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Alatent variable model approach to pmi-based word embeddings. Transactionsof the Association for Computational Linguistics, 4:385–399, 2016.
[17] Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski.Linear algebraic structure of word senses, with applications to polysemy. Tech-nical report, ArXiV, 2016. http://arxiv.org/abs/1502.03520.
[18] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beatbaseline for sentence embeddings. In 5th International Conference on LearningRepresentations (ICLR 2017), 2017.
[19] Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guar-antees for the EM algorithm: From population to sample-based analysis. CoRR,abs/1408.2156, 2014.
[20] Sivaraman Balakrishnan, Martin J Wainwright, Bin Yu, et al. Statistical guar-antees for the em algorithm: From population to sample-based analysis. TheAnnals of Statistics, 45(1):77–120, 2017.
[21] Pierre Baldi and Kurt Hornik. Neural networks and principal component analy-sis: Learning from examples without local minima. Neural networks, 2(1):53–58,1989.
[22] Afonso S Bandeira, Nicolas Boumal, and Vladislav Voroninski. On the low-rankapproach for semidefinite programs arising in synchronization and communitydetection. arXiv preprint arXiv:1602.04426, 2016.
276
[23] Boaz Barak, John Kelner, and David Steurer. Dictionary learning using sum-of-square hierarchy. 2014.
[24] Alexandre S. Bazanella, Michel Gevers, Ljubisa Miskovic, and Brian D.O. An-derson. Iterative minimization of h2 control performance criteria. Automatica,44:2549–2559, 2008.
[25] David Belanger and Sham M. Kakade. A linear dynamical system model fortext. In Proceedings of the 32nd International Conference on Machine Learning,2015.
[26] Yoshua Bengio, Holger Schwenk, Jean-Sebastien Senecal, Frederic Morin, andJean-Luc Gauvain. Neural probabilistic language models. In Innovations inMachine Learning. 2006.
[27] George Bennett. Probability inequalities for the sum of independent randomvariables. Journal of the American Statistical Association, 57(297):pp. 33–45,1962.
[28] S. Bernstein. Theory of Probability, 1927.
[29] Badri Narayan Bhaskar, Gongguo Tang, and Benjamin Recht. Atomic normdenoising with applications to line spectral estimation. IEEE Transactions onSignal Processing, 61(23):5987–5999, 2013.
[30] S. Bhojanapalli, B. Neyshabur, and N. Srebro. Global Optimality of LocalSearch for Low Rank Matrix Recovery. ArXiv e-prints, May 2016.
[31] Fischer Black and Myron Scholes. The pricing of options and corporate liabili-ties. Journal of Political Economy, 1973.
[32] D. Blei. Introduction to probabilistic topic models. Communications of theACM, pages 77–84, 2012.
[33] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of MachineLearning Research, pages 993–1022, 2003. Preliminary version in NIPS 2001.
[34] David M. Blei and John D. Lafferty. Dynamic topic models. In Proceedings ofthe 23rd International Conference on Machine Learning, 2006.
[35] Leon Bottou. On-line learning in neural networks. chapter On-line Learningand Stochastic Approximations, pages 9–42. Cambridge University Press, NewYork, NY, USA, 1998.
[36] Leon Bottou. Online algorithms and stochastic approximations. In David Saad,editor, Online Learning and Neural Networks. Cambridge University Press,Cambridge, UK, 1998. revised, oct 2012.
[37] Pavol Brunovsky. A classification of linear controllable systems. Kybernetika,06(3):(173)–188, 1970.
277
[38] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithmfor solving semidefinite programs via low-rank factorization. Mathematical Pro-gramming, 95(2):329–357, 2003.
[39] M. C. Campi and Erik Weyer. Finite sample properties of system identificationmethods. IEEE Transactions on Automatic Control, 47(8):1329–1334, 2002.
[40] E. Candes, J. Romberg, and T. Tao. Stable signal recovery from incompleteand inaccurate measurements. In Communications of Pure and Applied Math,pages 1207–1223, 2006.
[41] E. Candes and T. Tao. Decoding by linear programming. In IEEE Trans. onInformation Theory, pages 4203–4215, 2005.
[42] Emmanuel J Candes, Xiaodong Li, Yi Ma, and John Wright. Robust principalcomponent analysis? Journal of the ACM (JACM), 58(3):11, 2011.
[43] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrievalvia wirtinger flow: Theory and algorithms. IEEE Transactions on InformationTheory, 61(4):1985–2007, 2015.
[44] Emmanuel J Candes and Benjamin Recht. Exact matrix completion via convexoptimization. Foundations of Computational mathematics, 9(6):717–772, 2009.
[45] Emmanuel J Candes and Terence Tao. The power of convex relaxation:Near-optimal matrix completion. Information Theory, IEEE Transactions on,56(5):2053–2080, 2010.
[46] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Acceleratedmethods for non-convex optimization. arXiv preprint arXiv:1611.00756, 2016.
[47] Yudong Chen and Martin J Wainwright. Fast low-rank estimation by projectedgradient descent: General statistical and algorithmic guarantees. arXiv preprintarXiv:1509.03025, 2015.
[48] Yuxin Chen and Emmanuel Candes. Solving random quadratic systems ofequations is nearly as easy as solving linear systems. In Advances in NeuralInformation Processing Systems, pages 739–747, 2015.
[49] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gerard Ben Arous, andYann LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.
[50] Kenneth Ward Church and Patrick Hanks. Word association norms, mutualinformation, and lexicography. Computational linguistics, 1990.
[51] Shay B. Cohen, Karl Stratos, Michael Collins, Dean P. Foster, and Lyle Ungar.Spectral learning of latent-variable PCFGs. In Proceedings of the 50th AnnualMeeting of the Association for Computational Linguistics: Long Papers-Volume1, 2012.
278
[52] Ronan Collobert and Jason . A unified architecture for natural language pro-cessing: Deep neural networks with multitask learning. In Proceedings of the25th International Conference on Machine Learning, 2008.
[53] Ronan Collobert and Jason Weston. A unified architecture for natural languageprocessing: Deep neural networks with multitask learning. In Proceedings ofthe 25th International Conference on Machine Learning, 2008.
[54] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, SuryaGanguli, and Yoshua Bengio. Identifying and attacking the saddle point prob-lem in high-dimensional non-convex optimization. In Advances in neural infor-mation processing systems, pages 2933–2941, 2014.
[55] Mark A Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters. 1-bitmatrix completion. Information and Inference, 3(3):189–223, 2014.
[56] Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas,and Richard A. Harshman. Indexing by latent semantic analysis. Journal ofthe American Society for Information Science, 1990.
[57] D. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition.In IEEE Trans. on Information Theory, pages 2845–2862, 1999.
[58] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods foronline learning and stochastic optimization. The Journal of Machine LearningResearch, 2011.
[59] Hugh Durrant-Whyte and Tim Bailey. Simultaneous localization and mapping:part i. Robotics & Automation Magazine, IEEE, 13(2):99–110, 2006.
[60] Diego Eckhard and Alexandre Sanfelice Bazanella. On the global convergenceof identification of output error models. In Proc. 18th IFAC World congress,2011.
[61] Michael Elad. Sparse and Redundant Representations: From Theory to Appli-cations in Signal and Image Processing. Springer Publishing Company, Incor-porated, 1st edition, 2010.
[62] K. Engan, S. Aase, and J. Hakon-Husoy. Method of optimal directions for framedesign. In ICASSP, pages 2443–2446, 1999.
[63] T. Estermann. Complex numbers and functions. Athlone Press, 1962.
[64] Maryam Fazel, Haitham Hindi, and S Boyd. Rank minimization and applica-tions in system theory. In Proc. American Control Conference, volume 4, pages3273–3278. IEEE, 2004.
[65] Maryam Fazel, Haitham Hindi, and Stephen P Boyd. A rank minimiza-tion heuristic with application to minimum order system approximation. InProc. American Control Conference, volume 6, pages 4734–4739. IEEE, 2001.
279
[66] John Rupert Firth. A synopsis of linguistic theory. 1957.
[67] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddlepoints—online stochastic gradient for tensor decomposition. In Proceedings ofThe 28th Conference on Learning Theory, pages 797–842, 2015.
[68] Rong Ge, Qingqing Huang, and Sham M. Kakade. Learning mixtures of gaus-sians in high dimensions. In Proceedings of the Forty-Seventh Annual ACM onSymposium on Theory of Computing, STOC 2015, Portland, OR, USA, June14-17, 2015, pages 761–770, 2015.
[69] Rong Ge, Jason D. Lee, and Tengyu Ma. Matrix completion has no spuriouslocal minimum. Advances in Neural Information Processing Systems (NIPS),2016.
[70] Amir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. Euclideanembedding of co-occurrence data. Journal of Machine Learning Research, 2007.
[71] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceedingsof the National Academy of Sciences, 101(suppl 1):5228–5235, 2004.
[72] Moritz Hardt. Understanding alternating minimization for matrix completion.In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Sympo-sium on, pages 651–660. IEEE, 2014.
[73] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. In 5th Inter-national Conference on Learning Representations (ICLR 2017), 2017.
[74] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns lineardynamical systems. CoRR, abs/1609.05191, 2016.
[75] Moritz Hardt and Mary Wootters. Fast matrix completion without the conditionnumber. In Conference on Learning Theory, pages 638–678, 2014.
[76] Tatsunori B. Hashimoto, David Alvarez-Melis, and Tommi S. Jaakkola. Wordembeddings as metric recovery in semantic spaces. Transactions of the Associ-ation for Computational Linguistics, 2016.
[77] Trevor Hastie, Rahul Mazumder, Jason Lee, and Reza Zadeh. Matrix comple-tion and low-rank svd via fast alternating least squares. Journal of MachineLearning Research, 2014.
[78] E. Hazan, K. Y. Levy, and S. Shalev-Shwartz. Beyond Convexity: StochasticQuasi-Convex Optimization. ArXiv e-prints, July 2015.
[79] Elad Hazan and Tengyu Ma. A non-generative framework and convex relax-ations for unsupervised learning. In Neural Information Processing Systems(NIPS), 2016, 2016.
280
[80] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn-ing for image recognition. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 770–778, 2016.
[81] Christiaan Heij, Andre Ran, and Freek van Schagen. Introduction to mathe-matical systems theory : linear systems, identification and control. Birkhauser,Basel, Boston, Berlin, 2007.
[82] Joao P Hespanha. Linear systems theory. Princeton university press, 2009.
[83] Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are np-hard. J.ACM, 60(6):45, 2013.
[84] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. NeuralComputation, 9(8):1735–1780, 1997.
[85] Wassily Hoeffding. Probability inequalities for sums of bounded random vari-ables. Journal of the American Statistical Association, 58(301):pp. 13–30, 1963.
[86] Thomas Hofmann. Probabilistic latent semantic analysis. In Proceedings of theFifteenth Conference on Uncertainty in Artificial Intelligence, 1999.
[87] Daniel Hsu and Sham M. Kakade. Learning mixtures of spherical gaussians:moment methods and spectral decompositions. In Proceedings of the 4th con-ference on Innovations in Theoretical Computer Science, pages 11–20, 2013.
[88] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm forlearning hidden markov models. Journal of Computer and System Sciences,2012.
[89] Daniel Hsu, Sham M Kakade, and Tong Zhang. A tail inequality for quadraticforms of subgaussian random vectors. Electron. Commun. Probab, 17(52):1–6,2012.
[90] R. Imbuzeiro Oliveira. Concentration of the adjacency matrix and of the Lapla-cian in random graphs with independent edges. ArXiv e-prints, November 2009.
[91] R. Imbuzeiro Oliveira. Sums of random Hermitian matrices and an inequalityby Rudelson. ArXiv e-prints, April 2010.
[92] Prateek Jain and Praneeth Netrapalli. Fast exact matrix completion with finitesamples. In Proceedings of The 28th Conference on Learning Theory, pages1007–1034, 2015.
[93] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix com-pletion using alternating minimization. In Proceedings of the forty-fifth annualACM symposium on Theory of computing, pages 665–674. ACM, 2013.
281
[94] M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods.ArXiv e-prints, June 2015.
[95] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesistesting for high-dimensional regression. Journal of Machine Learning Research,15(1):2869–2909, 2014.
[96] Kenji Kawaguchi. Deep learning without poor local minima. In Advances inNeural Information Processing Systems, pages 586–594, 2016.
[97] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrixcompletion from a few entries. Information Theory, IEEE Transactions on,56(6):2980–2998, 2010.
[98] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix com-pletion from noisy entries. The Journal of Machine Learning Research, 11:2057–2078, 2010.
[99] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.
[100] Jon M. Kleinberg and Mark Sandler. Using mixture models for collaborativefiltering. J. Comput. Syst. Sci., 74(1):49–69, 2008.
[101] Andrew V Knyazev and Peizhen Zhu. Principal angles between subspaces andtheir tangents. Arxiv preprint, 2012.
[102] Yehuda Koren. The bellkor solution to the netflix grand prize. Netflix prizedocumentation, 81, 2009.
[103] Nicholas Kottenstette and Panos J Antsaklis. Relationships between positivereal, passive dissipative, & positive systems. In American Control Conference(ACC), 2010, pages 409–416. IEEE, 2010.
[104] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional bymodel selection. Ann. Statist., 28(5):1302–1338, 10 2000.
[105] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperime-try and processes, volume 23. Springer Science & Business Media, 2013.
[106] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradientdescent converges to minimizers. University of California, Berkeley, 1050:16,2016.
[107] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and designof optimization algorithms via integral quadratic constraints. SIAM Journal onOptimization, 26(1):57–95, 2016.
282
[108] Sergey Levine and Vladlen Koltun. Guided policy search. In Proceedings ofThe 30th International Conference on Machine Learning, pages 1–9, 2013.
[109] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrixfactorization. In Advances in Neural Information Processing Systems (NIPS),2015.
[110] Omer Levy and Yoav Goldberg. Linguistic regularities in sparse and explicitword representations. In Proceedings of the Eighteenth Conference on Compu-tational Natural Language Learning, 2014.
[111] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrixfactorization. In Advances in Neural Information Processing Systems, 2014.
[112] M. Lewicki and T. Sejnowski. Learning overcomplete representations. In NeuralComputation, pages 337–365, 2000.
[113] Yuanzhi Li, Yingyu Liang, and Andrej Risteski. Recovery guarantee ofweighted low-rank approximation via alternating minimization. arXiv preprintarXiv:1602.02262, 2016.
[114] Lennart Ljung. System Identification. Theory for the user. Prentice Hall, UpperSaddle River, NJ, 2nd edition, 1998.
[115] Po-Ling Loh and Martin J Wainwright. Support recovery without incoherence:A case for nonconvex regularization. arXiv preprint arXiv:1412.5632, 2014.
[116] Po-Ling Loh and Martin J. Wainwright. Regularized m-estimators with noncon-vexity: statistical and algorithmic theory for local optima. Journal of MachineLearning Research, 16:559–616, 2015.
[117] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y.Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InThe 49th Annual Meeting of the Association for Computational Linguistics,2011.
[118] S. Mallat. A wavelet tour of signal processing. In Academic-Press, 1998.
[119] Yariv Maron, Michael Lamar, and Elie Bienenstock. Sphere embedding: Anapplication to part-of-speech induction. In Advances in Neural InformationProcessing Systems, 2010.
[120] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularizationalgorithms for learning large incomplete matrices. Journal of machine learningresearch, 11(Aug):2287–2322, 2010.
[121] Alexandre Megretski. Convex optimization in robust identification of nonlinearfeedback. In Proceedings of the 47th Conference on Decision and Control, 2008.
283
[122] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estima-tion of word representations in vector space. Proceedings of the InternationalConference on Learning Representations, 2013.
[123] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean.Distributed representations of words and phrases and their compositionality. InAdvances in Neural Information Processing Systems, 2013.
[124] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and JeffreyDean. Distributed representations of words and phrases and their composition-ality. In Advances in Neural Information Processing Systems (NIPS), 2015.
[125] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities incontinuous space word representations. In Proceedings of the Conference ofthe North American Chapter of the Association for Computational Linguistics:Human Language Technologies, 2013.
[126] L. MIRSKY. Symmetric gauge functions and unitarily invariant norms. TheQuarterly Journal of Mathematics, 11(1):50–59, 1960.
[127] Andriy Mnih and Geoffrey Hinton. Three new graphical models for statisticallanguage modelling. In Proceedings of the 24th International Conference onMachine Learning, 2007.
[128] Ankur Moitra and Michael E. Saks. A polynomial time algorithm for lossy pop-ulation recovery. In 54th Annual IEEE Symposium on Foundations of ComputerScience, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pages 110–116,2013.
[129] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability ofmixtures of gaussians. In the 51st Annual Symposium on the Foundations ofComputer Science (FOCS), 2010.
[130] Katta G Murty and Santosh N Kabadi. Some np-complete problems inquadratic and nonlinear programming. Mathematical programming, 39(2):117–129, 1987.
[131] Sahand Negahban and Martin J Wainwright. Restricted strong convexity andweighted matrix completion: Optimal bounds with noise. Journal of MachineLearning Research, 13(May):1665–1697, 2012.
[132] Yurii Nesterov. Introductory lectures on convex optimization : a basic course.Applied optimization. Kluwer Academic Publ., Boston, Dordrecht, London,2004.
[133] Yurii Nesterov. Introductory lectures on convex optimization: A basic course,volume 87. Springer Science & Business Media, 2013.
284
[134] Yurii Nesterov and Boris T Polyak. Cubic regularization of Newton methodand its global performance. Mathematical Programming, 108(1):177–205, 2006.
[135] Bruno A Olshausen and David J Field. Sparse coding with an overcompletebasis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997.
[136] Bruno A Olshausen and David J Field. Sparse coding of sensory inputs. Currentopinion in neurobiology, 14(4):481–487, 2004.
[137] Christos H. Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and SantoshVempala. Latent semantic indexing: A probabilistic analysis. In Proceed-ings of the 7th ACM SIGACT-SIGMOD-SIGART Symposium on Principlesof Database Systems, 1998.
[138] Robin Pemantle. Nonconvergence to unstable points in urn models and stochas-tic approximations. The Annals of Probability, pages 698–712, 1990.
[139] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove:Global vectors for word representation. In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Processing(EMNLP), 2014.
[140] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove:Global vectors for word representation. Proceedings of the Empiricial Meth-ods in Natural Language Processing, 2014.
[141] Boris T Polyak. Some methods of speeding up the convergence of iterationmethods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
[142] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhur-nal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4):643–653, 1963.
[143] Ali Rahimi, Ben Recht, and Trevor Darrell. Learning appearance manifoldsfrom video. In Proc. IEEE CVPR, 2005.
[144] Marc’Aurelio Ranzato, Y-Lan Boureau, and Yann LeCun. Sparse feature learn-ing for deep belief networks. In Advances in Neural Information Processing Sys-tems 20, Proceedings of the Twenty-First Annual Conference on Neural Infor-mation Processing Systems, Vancouver, British Columbia, Canada, December3-6, 2007, pages 1185–1192, 2007.
[145] Benjamin Recht. A simpler approach to matrix completion. The Journal ofMachine Learning Research, 12:3413–3430, 2011.
[146] Jasson DM Rennie and Nathan Srebro. Fast maximum margin matrix factor-ization for collaborative prediction. In Proceedings of the 22nd internationalconference on Machine learning, pages 713–719. ACM, 2005.
285
[147] Douglas L. T. Rohde, Laura M. Gonnerman, and David C. Plaut. An improvedmodel of semantic similarity based on lexical co-occurence. Communication ofthe Association for Computing Machinery, 2006.
[148] Jason D. Lee Rong Ge and Tengyu Ma. Learning one-hidden-layer neural net-works with landscape design. 2017.
[149] David E. Rumelhart, Geoffrey E. Hinton, and James L. McClelland, editors.Parallel Distributed Processing: Explorations in the Microstructure of Cogni-tion. 1986.
[150] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learningrepresentations by back-propagating errors. Cognitive modeling, 1988.
[151] Christopher De Sa, Christopher Re, and Kunle Olukotun. Global convergenceof stochastic gradient descent for some non-convex matrix problems. In Proceed-ings of the 32nd International Conference on Machine Learning, ICML 2015,Lille, France, 6-11 July 2015, pages 2332–2341, 2015.
[152] A. C. Schaeffer. Inequalities of a. markoff and s. bernstein for polynomials andrelated functions. Bull. Amer. Math. Soc., 47(8):565–579, 08 1941.
[153] Parikshit Shah, Badri Narayan Bhaskar, Gongguo Tang, and Benjamin Recht.Linear system identification via atomic norm regularization. In Proceedings ofthe 51st Conference on Decision and Control, 2012.
[154] Torsten Soderstrom and Petre Stoica. Some properties of the output errormethod. Automatica, 18(1):93–99, 1982.
[155] Daniel Soudry and Yair Carmon. No bad local minima: Data indepen-dent training error guarantees for multilayer neural networks. arXiv preprintarXiv:1605.08361, 2016.
[156] D. Spielman, H. Wang, and J. Wright. Exact recovery of sparsely-used dictio-naries. In Journal of Machine Learning Research, 2012.
[157] Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. InICML, 2003.
[158] Nathan Srebro, Jason Rennie, and Tommi S Jaakkola. Maximum-margin ma-trix factorization. In Advances in neural information processing systems, pages1329–1336, 2004.
[159] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. InInternational Conference on Computational Learning Theory, pages 545–560.Springer, 2005.
[160] Gilbert W Stewart. Perturbation theory for the singular value decomposition.Technical report, 1998.
286
[161] GW Stewart. On the perturbation of pseudo-inverses, projections and linearleast squares problems. SIAM review, 19(4):634–662, 1977.
[162] Ju Sun, Qing Qu, and John Wright. When are nonconvex problems not scary?arXiv preprint arXiv:1510.06096, 2015.
[163] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via nonconvexfactorization. In Foundations of Computer Science (FOCS), 2015 IEEE 56thAnnual Symposium on, pages 270–289. IEEE, 2015.
[164] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convexfactorization. IEEE Transactions on Information Theory, 62(11):6535–6579,2016.
[165] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learningwith neural networks. In Proc. 27th NIPS, pages 3104–3112, 2014.
[166] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradi-ent by a running average of its recent magnitude. COURSERA: Neural networksfor machine learning, 4(2):26–31, 2012.
[167] J. A. Tropp. An Introduction to Matrix Concentration Inequalities. ArXive-prints, January 2015.
[168] Stephen Tu, Ross Boczar, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank solutions of linear matrix equations via procrustes flow. arXiv preprintarXiv:1507.03566, 2015.
[169] Peter D. Turney and Patrick Pantel. From frequency to meaning: Vector spacemodels of semantics. Journal of Artificial Intelligence Research, 2010.
[170] M. Vidyasagar and Rajeeva L. Karandikar. A learning theory approach to sys-tem identification and stochastic adaptive control. Journal of Process Control,18(3):421–430, 2008.
[171] Martin Wainwright. Basic tail and concentration bounds, 2015.
[172] Per-Ake Wedin. Perturbation bounds in connection with singular value decom-position. BIT Numerical Mathematics, 12(1):99–111, Mar 1972.
[173] Erik Weyer and M. C. Campi. Finite sample properties of system identificationmethods. In Proceedings of the 38th Conference on Decision and Control, 1999.
[174] H. Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partiellerdifferentialgleichungen (mit einer anwendung auf die theorie der hohlraum-strahlung). Mathematische Annalen, 71:441–479, 1912.
287
[175] Limin Yao, David Mimno, and Andrew McCallum. Efficient methods for topicmodel inference on streaming document collections. In Proceedings of the 15thACM SIGKDD International Conference on Knowledge Discovery and DataMining, KDD ’09, pages 937–946, New York, NY, USA, 2009. ACM.
[176] Tuo Zhao, Zhaoran Wang, and Han Liu. A nonconvex optimization frameworkfor low rank matrix estimation. In Advances in Neural Information ProcessingSystems, pages 559–567, 2015.
[177] Qinqing Zheng and John Lafferty. Convergence analysis for rectangular ma-trix completion using burer-monteiro factorization and gradient descent. arXivpreprint arXiv:1605.07051, 2016.
288