non-convex optimization for machine learning: design, analysis, … · 2020. 9. 7. · non-convex...

Non-convex Optimization for Machine

Learning: Design, Analysis, and

Understanding

Tengyu Ma

A Dissertation

Presented to the Faculty

of Princeton University

in Candidacy for the Degree

of Doctor of Philosophy

Recommended for Acceptance

by the Department of

Computer Science

Adviser: Professor Sanjeev Arora

November 2017

c© Copyright by Tengyu Ma, 2017.

All rights reserved.

Abstract

Non-convex optimization is ubiquitous in modern machine learning: recent break-

throughs in deep learning require optimizing non-convex training objective functions;

problems that admit accurate convex relaxation can often be solved more efficiently

with non-convex formulations. However, the theoretical understanding of non-convex

optimization remained rather limited. Can we extend the algorithmic frontier by effi-

ciently optimizing a family of interesting non-convex functions? Can we successfully

apply non-convex optimization to machine learning problems with provable guaran-

tees? How do we interpret the complicated models in machine learning that demand

non-convex optimizers?

Towards addressing these questions, in this thesis, we theoretically studied various

machine learning models including sparse coding, topic models, and matrix comple-

tion, linear dynamical systems, and word embeddings.

We first consider how to find a coarse solution to serve as a good starting point

for local improvement algorithms such as stochastic gradient descent. We propose ef-

ficient methods for sparse coding and topic inference with better provable guarantees.

Second, we propose a framework for analyzing local improvement algorithms that

start from a course solution. We apply it successfully to the sparse coding problem.

Then, we consider a family of non-convex functions satisfying that all local minima

are also global (and some additional regularity property). Such functions can be

optimized by local improvement algorithms efficiently from a random or arbitrary

starting point. The challenge that we address here, in turn, becomes proving that

an objective function belongs to this class. We establish such results for the natural

learning objectives of matrix completion and linear dynamical systems.

Finally, we make steps towards interpreting the non-linear models that require

non-convex training algorithms. We reflect on the principles of word embeddings in

natural language processing. We give a generative model for the texts, using which

iii

we explain why different non-convex formulations such as word2vec and GloVe can

learn similar word embeddings with the surprising performance — analogous words

have embeddings with similar differences.

iv

Acknowledgements

First and foremost, I would like to thank my advisor, Professor Sanjeev Arora for

all his advice, encouragements, inspirations, and supports. Throughout the past five

years, he has been a never-ending source of wisdom and insight. I continuously learn

from his commitment to research, his fortitude in exploring the unknown, his taste of

research problems, and his comprehensive knowledge of theoretical computer science.

I always remember his constructive and considerate suggestions for my decisions in

my life, and I’ve greatly influenced by his philosophy. I couldnt have wished for a

better advisor.

I am also most thankful to Avi Wigderson, Benjamin Recht, Elad Hazan, Moritz

Hardt, David Steurer, Ankur Moitra, and Rong Ge for their guidance, encourage-

ments, and collaboration. I enjoyed very much discussing research with them, and I

learned various aspects of research from them, ranging from mathematical techniques

to high-level thinking, from technical writing to general audience speech. They have

considerable influence on the content of this thesis and generally my work during

graduate school.

I owe many thanks to Moritz Hardt, Elad Hazan, Benjamin Recht, and Yoram

Singer who influenced me significantly beyond research collaborations in the last two

years of my Ph.D. They kindly spent a lot of time on helping me navigate the job

market without panic. The philosophical discussions with Elad and Sanjeev in the

kitchen on the 4-th floor of computer science department shaped my research taste.

Many thanks to Moritz Hardt and Yoram Singer for hosting me as an intern and

a visitor at Google in 2015 and 2016. The research discussions with Moritz, Ben,

Yoram, and other team members broadened my horizon and injected more practical

perspective in my research.

I was very fortunate to collaborate with many brilliant researchers. I would like

to thank my coauthors Naman Agarwal, Zeyuan Allen-Zhu, Sanjeev Arora, Aditya

v

Bhaskara, Mark Braverman, Brian Bullins, Xi Chen, Dan Garber, Ankit Garg, Rong

Ge, Moritz Hardt, Elad Hazan, Frederic Koehler, Jason D. Lee, Yuanzhi Li, Yingyu

Liang, Qihang Lin, Ankur Moitra, Huy Nguyen, Benjamin Recht, Andrej Risteski,

Jonathan Shi, David Steurer, Xiaoming Sun, Bo Tang, Yajun Wang, Avi Wigderson,

David P. Woodruff, Tianbao Yang, Huacheng Yu, Yi Zhang, Jiawei Zhang, and Yuan

Zhou.

I would also like to thank the wonderful group of researchers at Princeton Univer-

sity for creating such a great environment for machine learning, theoretical computer

science, statistics, and applied mathematics. At Princeton, within 10 minutes walk

distance, I was able to get thoughtful answers, comments, and feedbacks from world-

class experts for any questions and ideas. Thank you to all the staffs at Computer

Science Department, especially Ms. Melissa Lawson, Mitra Kelly, and Nicole Wagen-

blast, for their administrative work.

Thanks to Elad Hazan, Mark Braverman, David Steurer, and Rong Ge for being

on my thesis committee. Thanks to Andrew Chi-Chih Yao for creating the fantastic

Yao’s special pilot class where I was an undergraduate student. Thanks to Xiaoming

Sun, Yajun Wang for advising my undergraduate research.

The thesis is supported in part by NSF grants CCF-1527371, DMS-1317308, Si-

mons Investigator Award, Simons Collaboration Grant, and ONR-N00014-16-1-2329,

Simons Award in Theoretical Computer Science, IBM Ph.D. Fellowship, Simons-

Berkeley Research Fellowship, Siebel Scholarship, and Princeton Honorific Fellowship.

Furthermore, some of the work in this thesis was conducted while I was an intern at

Google and a fellow at the Simons Institute for the Theory of Computing. Thank

you all for your support.

Heartfelt thanks to all my friends Princeton University, who made the time in

Princeton very enjoyable.

vi

Finally, I would like to thank my family — my parents Qinglong Ma and Li Li,

and my wife Wenxin Xu for their love and support.

vii

To Wenxin

viii

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

1 Introduction 1

1.1 Analyzing Local Improvement Algorithms for Non-convex Optimization 2

1.1.1 Local Convergence Starting from Coarse Solutions . . . . . . . 3

1.1.2 Finding Coarse Solutions . . . . . . . . . . . . . . . . . . . . . 4

1.1.3 Global Convergence from Simple Initialization . . . . . . . . . 5

1.2 Interpreting Non-convex Objective Functions . . . . . . . . . . . . . . 6

1.3 Problems and Main Contributions . . . . . . . . . . . . . . . . . . . . 7

1.3.1 Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3.2 Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.3 Matrix Completion . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.4 Learning Linear Dynamical Systems . . . . . . . . . . . . . . . 11

1.4 Previously Published work . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

I Local Improvement Algorithms Starting from Coarse

Solutions 15

2 Finding Coarse Solutions 16

ix

2.1 Spectral Initialization for Sparse Coding . . . . . . . . . . . . . . . . 16

2.1.1 Introduction and Problem Definition . . . . . . . . . . . . . . 16

2.1.2 Assumptions and Main Results . . . . . . . . . . . . . . . . . 19

2.1.3 Related Work and Notes . . . . . . . . . . . . . . . . . . . . . 21

2.1.4 The Spectral Algorithm and Key Observation . . . . . . . . . 22

2.1.5 Infinite Samples Case . . . . . . . . . . . . . . . . . . . . . . . 29

2.1.6 Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 Convex Initialization for Topic Modeling Inference . . . . . . . . . . . 34

2.2.1 Introduction and Main Results . . . . . . . . . . . . . . . . . 35

2.2.2 Notations and Preliminaries . . . . . . . . . . . . . . . . . . . 40

2.2.3 δ-Biased Minimum Variance Estimators . . . . . . . . . . . . 41

2.2.4 Thresholded Linear Inverse Algorithm and its Guarantees . . . 45

2.3 Discussion: Special Initialization vs Trivial Initialization . . . . . . . 48

3 Local Convergence to a Global Minimum 50

3.1 Analysis Framework via Lyapunov Function . . . . . . . . . . . . . . 50

3.1.1 Generalization to Stochastic Updates . . . . . . . . . . . . . . 54

3.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.1.3 Limitation and Relation to Part II . . . . . . . . . . . . . . . 57

3.2 Analyzing Alternating Minimization Algorithm for Sparse Coding . . 58

3.2.1 Alternating Minimization for Sparse Coding . . . . . . . . . . 58

3.2.2 Applying the Framework to Analyzing Alternating Minimization 59

3.2.3 Algorithms and Main Results . . . . . . . . . . . . . . . . . . 61

3.3 Support Recovery Guarantees of Decoding . . . . . . . . . . . . . . . 64

3.4 Analysis Overview: Infinite Samples Setting . . . . . . . . . . . . . . 67

3.4.1 Making Progress at Each Iteration . . . . . . . . . . . . . . . 68

3.4.2 Maintaining Spectral Norm . . . . . . . . . . . . . . . . . . . 73

3.5 Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

x

3.6 More Alternating Minimization Algorithms . . . . . . . . . . . . . . . 84

3.6.1 Analysis of a Variant of Olshausen-Field Update . . . . . . . . 84

3.6.2 Removing Systemic Error . . . . . . . . . . . . . . . . . . . . 88

II Global Convergence with Arbitrary Initialization 91

4 Analysis via Optimization Landscape 92

4.1 Local Optimality vs Global Optimality . . . . . . . . . . . . . . . . . 93

5 Matrix completion 100

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.1.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.2 Proof Strategy: “Simple” Proofs are More Generalizable . . . . . . . 106

5.3 Warm-up: Rank-1 Case . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.3.1 Handling Incoherent x . . . . . . . . . . . . . . . . . . . . . . 115

5.3.2 Extension to General x . . . . . . . . . . . . . . . . . . . . . . 119

5.4 Rank-r Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.5 Handling Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

5.6 Finding the Exact Factorization . . . . . . . . . . . . . . . . . . . . . 134

5.7 Concentration Inequalities . . . . . . . . . . . . . . . . . . . . . . . . 140

5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6 Learning Linear Dynamical Systems 148

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.1.1 Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

6.1.2 Proper Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 152

6.1.3 The Power of Over-parameterization . . . . . . . . . . . . . . 154

6.1.4 Multi-input Multi-output Systems . . . . . . . . . . . . . . . . 155

xi

6.1.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6.1.6 Proof Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 157

6.1.7 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

6.2 Population Risk in Frequency Domain . . . . . . . . . . . . . . . . . 160

6.2.1 Quasi-convexity of the Idealized Risk . . . . . . . . . . . . . . 162

6.2.2 Justifying Idealized Risk . . . . . . . . . . . . . . . . . . . . . 164

6.3 Effective Relaxations of Spectral Radius . . . . . . . . . . . . . . . . 165

6.3.1 Efficiently Computing the Projection . . . . . . . . . . . . . . 168

6.4 Learning Acquiescent Systems . . . . . . . . . . . . . . . . . . . . . . 169

6.5 The Power of Improper Learning . . . . . . . . . . . . . . . . . . . . 175

6.5.1 Instability of the Minimum Representation . . . . . . . . . . . 177

6.5.2 Power of Improper Learning in Various Cases . . . . . . . . . 178

6.5.3 Improper Learning Using Linear Regression . . . . . . . . . . 187

6.6 Learning Multi-input Multi-output (MIMO) Systems . . . . . . . . . 188

6.7 Technicalities: Mean and Variance of the Gradient Estimator . . . . . 191

6.8 Back-propagation Implementation . . . . . . . . . . . . . . . . . . . . 200

6.9 Projection to the Constraint Set . . . . . . . . . . . . . . . . . . . . . 201

III Interpreting Non-linear Models and Their Non-convex

Objective Functions 204

7 Understanding Word Embedding Methods Using Generative Mod-

els 205

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

7.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

7.1.2 Benefits of Generative Approaches . . . . . . . . . . . . . . . 209

7.2 Generative Model and Its Properties . . . . . . . . . . . . . . . . . . 210

7.2.1 Weakening the Model Assumptions . . . . . . . . . . . . . . . 218

xii

7.3 Training objective and relationship to other models . . . . . . . . . . 219

7.4 Explaining relations=lines . . . . . . . . . . . . . . . . . . . . . . 223

7.5 Experimental Verification . . . . . . . . . . . . . . . . . . . . . . . . 228

7.5.1 Model Verification . . . . . . . . . . . . . . . . . . . . . . . . 228

7.5.2 Performance on Analogy Tasks . . . . . . . . . . . . . . . . . 231

7.5.3 Verifying relations=lines . . . . . . . . . . . . . . . . . . . 232

7.6 Proof of Main Theorems and Lemmas . . . . . . . . . . . . . . . . . . 234

7.6.1 Analyzing Partition Function Zc . . . . . . . . . . . . . . . . . 242

7.6.2 Helper Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . 247

7.7 Maximum Likelihood Estimator for Co-occurrence . . . . . . . . . . . 259

7.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

8 Mathematical Tools 264

8.1 Concentration Inequalities . . . . . . . . . . . . . . . . . . . . . . . . 264

8.1.1 Hoeffding’s inequality and Bernstein’s inequalities . . . . . . . 264

8.1.2 Sub-Gaussian ans Sub-exponential Random Variables . . . . . 267

8.2 Spectral Perturbation Theorems . . . . . . . . . . . . . . . . . . . . . 268

8.3 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

xiii

Chapter 1

Introduction

Non-convex optimization algorithms have been widely used in modern machine learn-

ing, especially deep learning. Can we design, analyze, and interpret non-convex

optimization algorithms in a principled way? The thesis aims put the non-convex

optimization on a more solid theoretical footing. We design and analyze non-convex

optimization algorithms for machine learning problems including sparse coding, topic

models, matrix completion, learning linear dynamical systems, and we interpret the

effectiveness of different non-convex algorithms for learning word embeddings that

capture semantic information.

In this chapter, we present a general overview of the questions and techniques

discussed in this thesis. We give a brief survey of the existing approaches for designing

and analyzing local improvement algorithms in Section 1.1, and discuss the related

work that motivates us to provide the theoretical interpretation of the non-convex

algorithms for training non-linear models in Section 1.2. We define the concrete

problems that will be studied and the main contributions in Section 1.3. This brief

overview is intended to convey the flavor of the work contained herein. We will provide

additional backgrounds and motivations later in each chapter.

1

1.1 Analyzing Local Improvement Algorithms for

Non-convex Optimization

Finding a global minimizer of a non-convex optimization problem — or even just a

degree-4 polynomial — is NP-hard in the worst case [83]. In fact, it’s also NP-hard

to check whether a point is a local minimum or not [130].

Despite the intractability results, non-convex optimization is the main algorith-

mic technique behind many state-of-the-art machine learning and deep learning re-

sults. Local improvement heuristics such as stochastic gradient descent [36], Momen-

tum [141], Adagrad [58], RMSProp [166], and Adam [99], are simple, scalable and

easy to implement, and they surprisingly return high-quality solutions (if not global

minima) [54, 49].

Given the empirical success of non-convex optimization in machine learning and

artificial intelligence, it becomes increasingly important to have a fundamental un-

derstanding of non-convex optimization algorithms and to design faster ones. Part I

and Part II of this thesis aim to develop mathematical techniques to analyze the

non-convex optimization algorithms for various machine learning problems.

We note that such analysis techniques have to be aware of the special properties

of the optimization problems, which, in machine learning, depend on the properties

of the data, the model, and the loss function. In this thesis, we mostly formulate

the problem by assuming the data are generated from some unknown and realistic

parametric distributions, and therefore work with average-case algorithm analysis.

Furthermore, the interface between users and optimizers in the context of machine

learning can be more relaxed. The optimizers can specialize in a restricted family of

objective functions, instead of addressing all differentiable functions, which would be

difficult or even impossible; the optimizers should also be allowed to potentially ask

the users to change the model parameterization, loss function, and regularization to

2

make the objective function easier. In Section 1.1.3 and more detailedly in Part II, we

will discuss a family of functions with certain landscape properties that allow efficient

optimization, and show that the objectives for several machine learning problems

belong to this family, and demonstrate theoretically that the choice of the regularizers

and over-parameterization of the models can make the landscape easier for optimizers.

There are two dominating paradigms to analyze and design non-convex optimiza-

tion algorithms: a) finding a coarse approximated solution first, followed by local

improvement algorithms and b) running local improvement algorithms from a trivial

initialization. Paradigm a) allows us to deploy very precise mathematical tools for

analysis and often gives strong theoretical guarantees, whereas paradigm b) is more

popular in practice because it can be applied simply to problems for which we don’t

know how to find coarse solutions without local improvement algorithms. This thesis

contributes to the development of both these two paradigms, as outlined below.

1.1.1 Local Convergence Starting from Coarse Solutions

For any sufficiently smooth objective function f and its global minimum x?, there

exists a neighborhood N of x in which the function is convex. Therefore, starting

from an initializer x0 in this neighborhood N , local improvement algorithms such

as first-order or second-order convex optimization algorithms converge to the global

minimum x?. However, the size of such neighborhood N depends on the structure of

the specific problem and can often be very small and scales inverse polynomially in

the dimension.

On the other hand, the true basin of attraction of the global minimum x? —

defined as the set of points x0 from which local improvement algorithms can converge

to x? — should be much larger than what we can prove from the argument above.

Various analysis framework has been developed to achieve a shaper analysis of the

basin of attraction [43, 20, 93, 72, 75, 164]. In most these cases, the local improvement

3

algorithms can provably start from a neighborhood of the global minimum with a

radius that doesn’t depend on the dimensionality.

In Chapter 3 of this thesis, we will present a framework that was previously

published in [14]. The key idea behind most of these analysis frameworks is to design

a Lyapunov function that measures the distance between the current iterate xk to

the global minimum x?, and to show that it decreases monotonically to zero. Our

framework contains most if not all of the others as special cases.

The strength of this local convergence analysis is that it often allows very precise

tools to reason about running time, sample complexity, etc (see, e.g., [48]), partly

due to its similarity to well-understood convex optimization. However, it is limited

to the situations where we already have a fast algorithm to obtain a coarse solution,

which we will discuss in the next subsection.

1.1.2 Finding Coarse Solutions

Coarse approximate solutions for non-convex optimization problems, especially in un-

supervised learning, can often be obtained by leveraging the data distributions using

spectral methods, combinatorial optimization, and convex optimization. The coarse

estimate may be far from statistically optimal, but the local improvement algorithms

on non-convex objectives that follow can find a more accurate solution. For exam-

ple, for the matrix completion problem (formally defined in Section 1.3), a simple

singular value decomposition (SVD) of the data matrix can recover a coarse estimate

of the underlying matrix. Such coarse solutions can often be good enough to enter

the basin of the attraction of some global minimum of the non-convex optimization

algorithms, as in the cases for matrix completion [93, 75, 163], matrix sensing [168],

phase retrieval [43], topic models [8, 12], noisy-or networks [15], dictionary learn-

ing [10, 11, 14].

4

In this thesis we will present two results along this line of research: a fast (coarse)

inference algorithm for topic models based on convex optimization (Section 2.2) and

a fast learning algorithm for dictionary learning based on spectral techniques (Sec-

tion 2.1). The key challenge in this direction is to solve models with prominent

non-linearities. The recent work of Arora et al. [15] addresses the non-linearity in

noisy-or networks via non-linear moment methods, but it’s beyond the scope of this

thesis.

1.1.3 Global Convergence from Simple Initialization

One of the largest families of non-convex functions for which we can optimize provably

efficiently is the set of function without bad local minima, which are often called

strict-saddle [67] or ridable functions [162]. Loosely speaking, if all local minima

of a function are also global minima and the function, in addition, satisfies some

mild regularity conditions, then many local improvement algorithms can find a global

minimum efficiently [134, 162, 138, 67, 3, 46, 106]. It can be easily seen that convex

functions form a subset of this family.

Many interesting non-convex objective functions commonly used in machine learn-

ing fall into this class. This is well known for the eigenvectorPCA/SVD prob-

lems [21, 157]. Recent work has established such properties for objective functions for

SVD/PCA, phase retrieval/synchronization, orthogonal tensor decomposition, matrix

decomposition, matrix sensing, learning linear dynamical systems, learning one-layer

hidden neural nets [67, 162, 22, 69, 74, 30, 148].

In this thesis, we will mostly focus on the analysis of the landscape for matrix

completion (Chapter 5) and learning linear dynamical systems (Chapter 6).

The optimization landscape properties have also been investigated on simplified

neural networks models. Kawaguchi [96] shows that the landscape of deep neural nets

doesn’t have bad local minima but has degenerate saddle points. Hardt and Ma [73]

5

show that re-parametrization using identity connection as in residual networks [80]

can remove the degenerate saddle points in the optimization landscape of deep linear

residual networks. Soudry and Carmon [155] showed that an over-parameterized

neural network doesn’t have bad differentiable local minima. Hardt et al. [74] analyze

the power of over-parameterization in a linear recurrent network (which is equivalent

to a linear dynamical system.) Ge, Lee and Ma [148] learns (a subset of) one-hidden-

layer neural networks by designing a new objective function such that it which has

no spurious local minima and its global minima recover the weights of the networks.

1.2 Interpreting Non-convex Objective Functions

Another intriguing question is how we can interpret the resulted solutions obtained

from optimizing a non-convex objective for a non-linear model. A particularly in-

teresting example is the word embeddings in the context of natural language pro-

cessing [26, 53, 124, 139]. Recently, researchers have discovered an efficient way of

assigning every word a vector in Euclidean space, which captures the semantic in-

formation in thwords. The striking property of the embeddings is that analogous

pairs of words have embeddings with similar differences. Moreover, such results can

be achieved by various methods including either simple recurrent neural networks

like in word2vec [124] or non-linear matrix factorization models like GloVe and PMI

method [50, 109, 139]. Despite their successful applications, it remained unclear why

these methods result in such vectors without any particular mechanism that targets

this property.

In chapter 7, we will show that the non-convex objective functions of word2vec,

GloVe, and PMI, in fact, correspond to different ways of learning a generative model

of the texts that we propose, a dynamic version of the log-linear topic model of [127].

GloVe and PMI correspond to moment methods for learning this generative model

6

and word2vec corresponds to an expectation-maximization algorithm for learning the

model. It also helps explain why low-dimensional semantic embeddings contain linear

algebraic structure that allows a solution of word analogies, as shown by [122] and

many subsequent papers.

Another new explanatory feature of our model is that low dimensionality of word

embeddings plays a key theoretical role —unlike in previous papers where the model

is agnostic about the dimension of the embeddings, and the superiority of low-

dimensional embeddings is an empirical finding (starting with [56]). Specifically,

our theoretical analysis makes the key assumption that the set of all word vectors

are spatially isotropic, which means that they have no preferred direction in space.

We will show that the low-rank fitting of the log co-occurrence matrix has a certain

denoising effect when the dimensionality of the word vectors is much smaller than

the vocabulary size. The theory in Chapter 7 also inspired the sense embeddings and

sentence embedding developed in a [18, 17].

1.3 Problems and Main Contributions

In this section, we define the machine learning problems that are concerned in this

thesis and summarizes the main results.

1.3.1 Sparse Coding

Sparse coding, also called dictionary learning, is a latent variable model for the dis-

tribution of the observed data. A basic latent variable model describes the data

distribution p(y) by p(y) = pθ(y | x)pα(x) where x is an unobserved latent variable

and θ and α are parameters that govern the conditional distributions and the distri-

bution of h respectively. We are given n examples y(1), . . . , y(n) from the distribution

p(y) and the task is to learn the model parameter θ. Sometimes in addition we would

7

like to recover the parameter α and posterior distribution of p(x | y = y(j)) for each

example.

In sparse coding or dictionary learning, the hidden variable h is a random sparse

vector in dimension Rm. Conditioned on the latent variable h, the data point x is

generated via a fixed dictionary A ∈ Rd×m by

y = Ax+ ξ

where ξ is a noise vector which is often assumed to be Gaussian. Thus this gives an

implicit definition of the distribution pA(y | x). Let A1, . . . , Am be the columns of

the matrix A, which are often called dictionary atoms. In words, we would like to

decompose the observed vector y into a sparse combination of the dictionary atoms

A1, . . . , Am.

Sparse coding was originally formulated by neuroscientists Olshausen and

Field [135] for the study of human visual systems. They gave experimental evidence

that it produces coding matrices for image patches that resemble known features

(such as Gabor filters) in the V 1 portion of the visual cortex. It has been widely used

in computer vision and image processing such as segmentation, retrieval, de-noising

and super-resolution (see references in [61] and more discussions in Chapter 2).

In Section 2.1 of Chapter 2, under realistic assumptions on the true dictionary ma-

trix and the distribution of the latent variables, we give algorithms based on spectral

methods that provably return a coarse solution that approximates each dictionary

atom (up to permutation and sign flip) up to o(1) relatively Euclidean distance error.

In Chapter 3, we design and analyze various alternating minimization algorithms that

start from the coarse solutions obtained from the spectral algorithms, which provably

converge to much more accurate solutions in polynomial time.

8

1.3.2 Topic Models

Topic models [32] are latent variable models for the distribution of the bag of words

representation of the documents: Suppose we have a vocabulary of D words. Each

document is viewed as a collection of unordered words and therefore can be repre-

sented as a vector in RD with each entry being the frequency of the corresponding

word in the documents. The document is assumed to be generated by the following

process. There are k topics, each of which is a distribution of words with corre-

sponding probability vector Ai ∈ RD. Thus Ai takes nonnegative values and sums

to 1. Each of the document is assumed to be generated by first picking a topic i

from a distribution x ∈ Rk over the topics, and then pick a word according to the

corresponding distribution Ai. A simple calculation shows that this is equivalent to

assuming that each of the words in the documents is i.i.d drew from the distribution

Ax, where A = [A1, . . . , Ak] is often called word-topic matrix.

The learning problem here is to recover the word-topic model. Researchers have

developed various techniques for this problem including singular value decomposi-

tion [56], variational inference [33], MCMC [71], tensor decomposition [8], and anchor-

word algorithms [12].

However there has been comparatively much less progress on designing algorithms

with provable guarantees for the inference problem for topic models — given a doc-

ument y and the word-topic matrix A, how do we infer the topic proportion vector

x that was used to generate the document? The result in Section 2.2 takes a step

in this direction by providing convex optimization based algorithms for estimating

the topic proportion vector x. The algorithm is very simple and fast because it only

uses a (carefully chosen) linear transformation plus thresholding, but it only returns

coarse solutions that are not statistically optimal. The solutions can be refined by

running gradient ascent on maximum likelihood estimators.

9

1.3.3 Matrix Completion

Matrix completion is the problem of recovering a low-rank matrix from partially

observed entries. It has been widely used in collaborative filtering and recommender

systems [102, 146], dimension reduction [42] and multi-class learning [5].

The simplest setting of matrix completion is the following: Let M ∈ Rd×d be

the target matrix that we aim to recover. We assume that it has rank r d.

We assume that we observe the values of a set of entries of the matrix, denoted

by Ω = (i, j) : Mi,j is observed. Here Ω are often assumed to come from some

distribution, e.g., the uniform distribution over the set of entries with a fixed size.

Our goal is to recover the underlying matrix M from these observations with as few

observations as possible.

There has been extensive work on designing efficient algorithms for matrix com-

pletion with guarantees. One earlier line of results (see [157, 159, 145, 45, 44] and

the references therein) rely on convex relaxations. These algorithms achieve strong

statistical guarantees but are relatively computationally expensive in practice.

In Chapter 5 of this thesis, we prove that the commonly used non-convex objective

function for positive semidefinite matrix completion has no spurious local minima —

all local minima must also be global. Therefore, many popular optimization algo-

rithms such as stochastic gradient descent can provably solve positive semidefinite

matrix completion with arbitrary initialization in polynomial time. The result can be

generalized to the setting when the observed entries contain noise. We believe that

our main proof strategy can be useful for understanding geometric properties of other

statistical problems involving partial or noisy observations. The result is built upon

recent progress for analyzing local improvement algorithms from good starting point

for matrix completion [97, 98, 93, 72, 75, 163, 176, 47, 151, 47]. See Chapter 5 for

more related work.

10

1.3.4 Learning Linear Dynamical Systems

As the name suggested, the problem of learning linear dynamical systems is a super-

vised learning problem that aims to estimate the underlying dynamical system that

maps a sequence of inputs x1, . . . , xT to a sequence of inputs y1, . . . , yT . Part of our

motivation for studying this problem comes from the desire of better understanding

the optimization issues in sequence-to-sequence learning models such as recurrent neu-

ral networks or long short-term memory. If we remove all non-linear state transitions

from a recurrent neural network, we are left with the state transition representation

of a linear dynamical system.

To be sure, linear dynamical systems are also very important in their own right

and have been studied for many decades independently of machine learning within the

control theory community [114] and the learning problem in this context corresponds

to “linear dynamical system identification”. In the context of machine learning, linear

systems play an important role in numerous tasks. For example, their estimation

arises as subroutines of reinforcement learning in robotics [108], location and mapping

estimation in robotic systems [59], and estimation of pose from video [143].

More formally, we receive noisy observations generated by the following time-

invariant linear system:

ht+1 = Aht +Bxt

yt = Cht +Dxt + ξt

Here, A,B,C,D are linear transformations with compatible dimensions and we denote

by Θ = (A,B,C,D) the parameters of the system. The vector ht represents the

hidden state of the model at time t. Its dimension n is called the order of the system.

The stochastic noise variables ξt perturb the output of the system which is why the

11

model is called an output error model in control theory. We assume the variables are

drawn i.i.d. from a distribution with mean 0 and variance σ2.

We assume we have N pairs of sequences (x, y) as training examples,

S =

(x(1), y(1)), . . . , (x(N), y(N)).

Each input sequence x ∈ RT of length T is drawn from a distribution and y is the

corresponding output of the system above generated from an unknown initial state

h. We allow the unknown initial state to vary from one input sequence to the next.

This only makes the learning problem more challenging.

In Chapter 6, we show that under structural assumptions on the input distribution

and the ground-truth parameters, stochastic gradient descent efficiently minimizes the

maximum likelihood objective of an unknown linear system given noisy observations

generated by the system. We also show that over-parameterization of the model can

relax many assumptions on the ground-truth parameters significantly by making the

landscape of the objective function easier for optimizers.

1.4 Previously Published work

Several portions of this thesis are based on previously published joint work with

collaborators, which I will describe briefly below.

The material presented in Section 2.1 of Chapter 2 and Chapter 3 is based on the

joint paper [14] with Sanjeev Arora, Rong Ge and Ankur Moitra, previously published

in COLT 2015. Section 2.2 of Chapter 2 is based on the joint work [13] with Sanjeev

Arora, Rong Ge, Frederic Koehler, and Ankur Moitra, a preliminary version of which

is published in ICML 2016.

Chapter 5 contains material based on the joint work [69] with Rong Ge and Jason

D. Lee, which is published in NIPS 2016. Chapter 6 is based on the joint paper with

12

Moritz Hardt and Benjamin Recht [74] that will appear in JMLR. Chapter 7 is based

on the joint work [16] with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, and Andrej

Risteski, which has been published in TACL.

1.5 Notations

Before proceeding to the thesis, we define some notations that are often used. Other

notations and terminologies will be defined when they first occur.

We use R to denote real numbers, C to denote complex numbers, and N =

0, 1, . . . , to denote natural numbers. Let [m] be a shorthand for 1, . . . ,m.

Unless explicitly stated otherwise, O(·)-notation hides absolute multiplicative con-

stants. Concretely, every occurrence of O(x) is a placeholder for some function f(x)

that satisfies ∀x ∈ R. |f(x)| ≤ C|x| for some absolute constant C > 0. Similarly,

Ω(x) is a placeholder for a function g(x) that satisfies ∀x ∈ R. |g(x)| ≥ |x|/C for

some absolute constant C > 0. The notation .,& also hide absolute multiplicative

constant — a . b means there exists a universal constant C > 0 such that a . Cb.

Throughout, we will use ‖·‖ to denote the Euclidean norm of a vector and spectral

norm of a matrix. Let ‖·‖F denotes the Frobenius norm of a matrix. For a matrix

A, let |A|∞ = max |Aij| be the infinity norm of the vectorized version of A. Let

|A|p→q = max‖x‖p=1 ‖A‖q be the `p → `q induced norm. Let tr(A) be the trace of a

square matrix A. For two matrices A and B with the same dimension, let the inner

product of two matrices be 〈A,B〉 = tr(A>B).

For a square matrix A, let λmax(A) and λmin(A) denote the largest and least

eigenvalues respectively. For a matrix A with dimension Rm×n, let σmax(A) be its top

singular value and σmin be its minm,n-th largest singular value. Let Idd denotes

the identity matrix with dimension d × d and we omit the subscript when it’s clear

from the context.

13

For a symmetric matrix A and B, let A B mean that A−B is positive semidef-

inite, and let A B mean that B − A is positive semidefinite.

Let 1() denote the indicator function — for an event E, we have that 1(E) = 1 if

E happens and otherwise 1(E) = 0. Let δij be a shorthand for 1(i = j).

14

Part I

Local Improvement Algorithms

Starting from Coarse Solutions

15

Chapter 2

Finding Coarse Solutions

In this Chapter, we discuss two algorithms that give coarse solutions for the sparse

coding problem and topic inference problem . The algorithms for sparse coding uses

singular vector decomposition (SVD) on a weighted moment matrix of the data,

whereas the algorithm for topic model inference is simply a carefully chosen linear

transformation followed by thresholding. The local improvement algorithms can start

from these coarse solutions and converge to more accurate solutions (see Chapter 3).

2.1 Spectral Initialization for Sparse Coding

2.1.1 Introduction and Problem Definition

Sparse coding or dictionary learning consists of learning to express (i.e., code) a set of

input vectors, say image patches, as linear combinations of a small number of vectors

chosen from a large dictionary. It is a basic task in many fields. In signal processing,

a wide variety of signals turn out to be sparse in an appropriately chosen basis (see

references in [118]). In neuroscience, sparse representations are believed to improve

the energy efficiency of the brain by allowing most neurons to be inactive at any given

time. In machine learning, imposing sparsity as a constraint on the representation

16

is a useful way to avoid over-fitting. Additionally, methods for sparse coding can

be thought of as a tool for feature extraction and are the basis for a number of

important tasks in image processing such as segmentation, retrieval, de-noising and

super-resolution (see references in [61]), as well as a building block for some deep

learning architectures [144]. It is also a basic problem in linear algebra itself since it

involves finding a better basis.

The notion was introduced by neuroscientists Olshausen and Field [135] who for-

malized it as follows: Given a dataset y(1), y(2), . . . , y(N) ∈ Rd, our goal is to find a set

of basis vectors A1, A2, . . . , Ar ∈ <d and sparse coefficient vectors x(1), x(2), . . . , x(N) ∈

<d that minimize the reconstruction error

N∑i=1

‖y(i) − A · x(i)‖22 +

N∑i=1

ρ(x(i)) (2.1.1)

where A is the d× r coding matrix whose j-th column is Aj and ρ(·) is a nonlin-

ear penalty function that is used to encourage sparsity. This function is nonconvex

because both A and the x(i)’s are unknown. Their paper, as well as subsequent work,

chooses r to be larger than d (so-called overcomplete case) because this allows greater

flexibility in adapting the representation to the data. We remark that sparse coding

should not be confused with the related — and usually easier — problem of finding

the sparse representations of the y(i)’s given the coding matrix A, variously called

compressed sensing or sparse recovery [40, 41].

Olshausen and Field also gave a local search/gradient descent heuristic for trying

to minimize the nonconvex energy function (2.1.1). They gave experimental evidence

that it produces coding matrices for image patches that resemble known features

(such as Gabor filters) in the V 1 portion of the visual cortex.

There is a large gap between theory and practice in terms of how to local search

algorithm to optimize objective (2.1.1). The usual approach is to set A randomly or

17

to populate its columns with samples1. These often work, but we do not know how

to analyze them.

The main contribution of this section is that we give a novel method for initializing

the local search algorithm. The initialization guarantees to give an estimate for the

dictionary which is o(1)-close in relative column-wise error. Such an initialization

suffices for us to use local search algorithm in Section 3.2.3. For the analysis of the

local search algorithm from this initialization, we refer the reader to Chapter 3

Our algorithm and analysis is based on the generative model proposed in [136]

(and also [112]), which places sparse coding in a more familiar probabilistic setting

whereby the data points y(i)’s are assumed to be probabilistically generated according

to a model y(i) = A∗ · x∗(i) + noise where x∗(1), x∗(2), . . . , x∗(N) are samples from some

appropriate distribution and A∗ is an unknown code. We formalize this model below:

Generative model: We assume that each example is generated as y = A∗x∗ + ξ

where A∗ is a ground truth dictionary and x∗ is drawn from an unknown distribution

D, which satisfies that with probability 1, the support S = supp(x∗) is of size at most

k.

We are given N examples y(1), . . . , y(N) generated as above. We use x∗(1), . . . , x∗(N)

and ξ(1), . . . , ξ(N) to denote the underlying coefficients and noises that generate the

examples respectively. The goal is to estimate the ground-truth dictionary A∗ and

the coefficients x∗(1), . . . , x∗(N) from the examples as accurately as possible.

Outline of the rest of the section: In Section 2.1.2 we state the main assumptions

and results of this section. Section 2.1.3 summarizes the related work. Section 2.1.4

state formally the algorithm and provide the key intuition behind it. Section 2.1.5

gives the analysis of the infinite sample complexity case and Section 2.1.6 completes

the proof of the main theorem by providing sample complexity bounds.

1Empirical evidence suggest that the later choice is significantly better than the former.

18

2.1.2 Assumptions and Main Results

Since the sparse coding problem is very likely to be intractable for worst case A∗ and

distribution for x∗, we will make a few assumptions on the true dictionary A∗ and

distribution of the coefficient x∗ (similar to those in early papers [156, 10, 1]).2

We assume A∗ is an incoherent dictionary, since these are widespread in signal

processing [61] and statistics [57], and include various families of wavelets, Gabor

filters as well as randomly generated dictionaries.

Assumption 2.1.1 (Incoherence). We assume A∗ is µ-incoherent in the sense that

each column of A∗ has unit `2 norm and the maximum pair-wise inner product is

bounded by

∣∣〈A∗i , A∗j〉∣∣ ≤ µ√d. (2.1.2)

We also assume ‖A∗‖ .√r/d.

We make the following (relatively weak) assumptions on the distribution of the sup-

port of x∗. We require the non-zero coordinates of S has bounded pair-wise correla-

tion.

Assumption 2.1.2. With probability 1, the support S = supp(x∗) is of size at most

k. The marginal of the support S satisfies qi = Pr[i ∈ S] = Θ(r/d) and qij = Pr[i, j ∈

S] = Θ(k2/r2).

Conditioning on the support of S, we assume the non-zero entries of x∗ to be inde-

pendent. Furthermore, we assume that non-zero entries are bounded away from zero,

namely, that non-zero entries have an absolute lower bound.

Assumption 2.1.3. Conditioned on the choice of S, the coordinates of x∗S are in-

dependent and their marginal satisfies ∀i ∈ S,E [x∗i | S] = 0 and E [(x∗i )2 | S] = 1.

2Recently, Hazan and the author makes progress on the worst guarantee without generativemodels [79], but the problem definition is changed into an improper setting.

19

Moreover, conditioned on i ∈ S, with probability 1, |x∗i | ≥ C for some absolute con-

stant C ∈ [0, 1].

We remark that the casual reader should just think of x∗ as being drawn from some

distribution that has independent coordinates. Even in this simpler setting —which

has polynomial time algorithms using Independent Component Analysis—we do not

know of any rigorous analysis of heuristics like Olshausen-Field. The earlier papers

were only interested in polynomial-time algorithms, so did not wish to assume inde-

pendence.

Finally, throughout this paper, we will assume the following choice of the regime

of k, r, d. Again, m can be allowed to be higher by lowering the sparsity.

Assumption 2.1.4 (Regime of parameters). We assume throughout this chapter that

k ≤√d

µ log dand r2k2 log3 r ≤ ρd3 for some small enough absolute constant ρ.

We remark that the assumption r2k2 log3 r ≤ ρd3 is slightly weaker than the assump-

tion than r . d. Therefore, causal readers can think of our parameter regime as

k √d and r . d (assuming µ is a constant and ignoring logarithmic factors).

Before stating the main theorem in this section, we first define the measure of

closeness that we use in this section:

Definition 2.1.5. A is δ-close to A∗ if there is a permutation π : [m] → [m] and a

choice of signs σ : [m]→ ±1 such that ‖σ(i)Aπ(i) − A∗i ‖ ≤ δ for all i.

This is a natural measure to use since we can only hope to learn the columns of A∗ up

to relabeling and sign-flips. Our main theorem shows that we can learn a dictionary

A that is 1/ log n-close to the true A∗ in this measure.

Theorem 2.1.6. Under assumption 2.1.1, 2.1.2, 2.1.3, 2.1.4, for sufficiently large

constant c, given N = c(kr log4 d+ σ4r2 log4 d

d

)examples, there exist an algorithm

20

(Algorithm 1) returns a matrix A that is O(ρ/ log n)-close to the ground truth dictio-

nary A∗ in O(rd2N) time. Here ρ is an arbitrarily small absolute constant defined in

Assumption 2.1.4.

We establish various building blocks towards the proof of the main theorem in the

following sections and finally formally prove it in Section 2.1.6.

2.1.3 Related Work and Notes

A common thread in recent work on sparse coding is to assume a generative model;

the precise details vary, but each has the property that given enough samples the

solution is essentially unique. [156] gave an algorithm that succeeds when A∗ has full

column rank (in particular m ≤ n) which works up to sparsity roughly√n. However,

this algorithm is not applicable in the more prevalent overcomplete setting.

[10] and [2, 1] independently gave algorithms in the overcomplete case assuming

that A∗ is µ-incoherent (which we define in the next section). The former gave an

algorithm that works up to sparsity n1/2−γ/µ for any γ > 0 but the running time is

nΘ(1/γ);

[2, 1] gave an algorithm that works up to sparsity either n1/4/µ or n1/6/µ depend-

ing on the particular assumptions on the model. These works also analyze alternat-

ing minimization but assume that it starts from an estimate A that is column-wise

1/poly(n)-close to A∗, in which case the objective function is essentially convex.

[23] gave a new approach based on the sum-of-squares hierarchy that works for

sparsity up to n1−γ for any γ > 0. But in order to output an estimate that is column-

wise ε-close to A∗ the running time of the algorithm is n1/εO(1). In most applications,

one needs to set (say) ε = 1/k to get a useful estimate. However in this case their

algorithm runs in exponential time. The sample complexity of the above algorithms is

also rather large, and is at least Ω(m2) if not much larger. Here we will give simple and

more efficient algorithms based on alternating minimization whose column-wise error

21

decreases geometrically, and that work for sparsity up to n1/2/µ log n. We remark that

even empirically, alternating minimization does not appear to work much beyond this

bound.

2.1.4 The Spectral Algorithm and Key Observation

Our algorithm works by reweighting the second moments followed by spectral decom-

position. We introduce the idea using the noiseless case. Let u = A∗α and v = A∗α′

be two samples from our model where the supports of α and α′ are U and V respec-

tively, and assume U and V has a singleton intersection U ∩ V = i. The main idea

is that if this happens, we can reweight a fresh sample y with a factor 〈y, u〉〈y, v〉 and

compute the weighted second moment Muv defined as follows:

Muv := E[〈y, u〉〈y, v〉yy>

]. (2.1.3)

Intuitively, yy> is a linear combination of A∗jA∗j> for all j ∈ [r] but 〈y, u〉〈y, v〉

put a high weight on those yy> with a higher contribution in the direction of A∗iA∗i>,

since both uv has a non-trivial correlation with A∗i . This causes that the top singular

vectors will be close to A∗i .

Of course, we don’t have such a pair of uv in hand to start with. Fortunately,

using the top two singular values of Muv, we can test whether uv have the property

that |U ∩ V | = 1. Therefore we choose uv pairs randomly repeatedly until we get all

the dictionary atoms. The procedure can be summarized in the following algorithm

(with an additional correction term which is necessary for the noisy case):

The key idea of the algorithm discussed above can be formally in the following

Proposition. We will invoke this proposition several times in order to analyze Algo-

rithm 1 to verify whether or not the supports of U and V share a common element,

22

Algorithm 1 Weighted Spectral Initialization for Sparse Coding

Input: A set of N1 + N2 examples, split into two sets with N1 and N2 exampleseach

Output: A matrix A that approximates the true dictionary A∗

Set L = ∅While |L| < r choose randomly two samples u and v from the first N1 examples

Let Puv =(uv> + vu> + 〈uv〉Id + (d+ 2)Idd

)σ4/d2

Use the second set of N2 examples to get an empirical estimate of Muv, denotedby Muv.

Compute the top two singular values σ1, σ2 and top singular vector z of Muv−PuvIf σ1 & k/r and σ2 ≤ k/(r log r)

If z is not within distance 1/ log r of any vector in L (even after sign flip),add z to L

Output a matrix A with its columns being those vectors in L.

and again to show that if they do we can approximately recover the corresponding

column of A∗ from the top singular vector of Mu,v.

Proposition 2.1.7. Let u = A∗α + ξ and v = A∗α′ + ξ′ be two examples. Define

Puv =(uv> + vu> + 〈uv〉Id + (d+ 2)Idd

)σ4/d2. Let U = supp(α) and V = supp(α′)

and let β = A∗>u and β′ = A∗>v. Let ci, qi be defined as in Assumption 2.1.2

and 2.1.3 and ρ be defined as in Assumption 2.1.4. Then, we have that

Muv − Puv =∑i∈U∩V

qiciβiβ′iA∗iA∗i> + E (2.1.4)

where E is bounded by

‖E‖ . ρk

r log d. (2.1.5)

23

As alluded before, the direct consequence of Proposition 2.1.7 is that when u and

v share a unique dictionary element, the top singular value of Muv stands out and is

close to the a true dictionary element, as formalized in the Corollary below.

Corollary 2.1.8. In the setting of Proposition 2.1.7, suppose u = A∗α + ξ and

v = A∗α′+ ξ′ are two random samples such that supp(α)∩ supp(α′) = i. Then, the

top singular vector of Muv is O(ρ/ log d)-close to A∗i .

Proof. When u and v share a unique dictionary element i, the contribution of the

first term in the RHS of (2.1.4) simply reduces to qiciβiβ′iA∗iA∗i>. Moreover, follow-

ing from Proposition 2.1.7 and from the assumptions that ci ≥ 1 and qi = Ω(r/d)

(Assumption 2.1.2), the coefficient qiciβiβ′i is at least Ω(r/d). Using the noise bound

equation (2.1.5), by singular value perturbation theorem (for example, Wedin’s The-

orem 8.2.3 in Section 8.2), we have that the top singular vector of Muv − Puv is

O(r/d log d)/Ω(r/d) = O(1/ log d)-close to A∗i , which completes the proof.

We will prove Proposition 2.1.7 in the rest of the Section. In Section 2.1.5, we will

prove the correctness of Algorithm 1 in the infinite samples regime. Fully analyzing

Algorithm 1 and proving Theorem 2.1.6 requires carefully bounding the difference

between the empirical estimate Muv and Muv, which will be handled in Section 2.1.6.

Towards Proposition 2.1.7, we first state the following Lemma which gives an

explicit expression for Muv.

Lemma 2.1.9. In the setting of Proposition 2.1.7, we have that

Muv = Puv +∑i∈U∩V

qiciβiβ′iA∗iA∗i> +

∑i∈[m]\U∩V

qiciβiβ′iA∗iA∗i>

+∑

i,j∈[m],i 6=j

qi,j

(βiβ

′iA∗jA∗j> + βiβ

′jA∗iA∗j> + β′iβjA

∗iA∗j>)

(2.1.6)

24

Proof of Lemma 2.1.9. First of all, recall that y = A∗x∗ + ξ. Therefore, plugging in

this into the definition of Muv (equation (2.1.3)), we have

Muv = ES

[Ex∗S

[〈u,A∗Sx∗S + ξ〉〈v,A∗Sx∗S + ξ〉(A∗Sx∗S + ξ)(x∗S>A∗S

> + ξ>)|S]]

(by the Law of total expectation)

= ES

[Ex∗S

[〈β, x∗S〉〈β′, x∗S〉A∗Sx∗Sx∗S>A∗S

>|S]]

+ E[〈u, ξ〉〈v, ξ〉ξξ>

]+ E

[〈β, x∗S〉〈β′, x∗S〉ξξ>

](2.1.7)

where the last line follows from E [x∗S] = 0, ξ has mean zero, and the independence of

x and ξ

Note that since ξ ∼ N (0, σ2

dIdd), we have

E[〈u, ξ〉〈v, ξ〉ξξ>

]+ E

[〈ξ, ξ〉ξξ>

]=(uv> + vu> + 〈uv〉Id + (d+ 2)Idd

)σ4/d2 .

(2.1.8)

Then replacing A∗x∗ by∑A∗ix

∗i and expanding the sum, using the fact that the

entries in x∗S are independent and have mean zero, we have that

ES

[Ex∗S

[〈β, x∗S〉〈β′, x∗S〉A∗Sx∗Sx∗S>A∗S

>|S]]

= ES[∑i∈S

ciβiβ′iA∗iA∗i> +

∑i,j∈S,i6=j

(βiβ



∗iA∗j>)]

=∑i∈[m]

qiciβiβ′iA∗iA∗i> +

∑i,j∈[m],i 6=j

qi,j

(βiβ



∗iA∗j>)

(2.1.9)

The equation (2.1.6) then follows from combining equation (2.1.8), (2.1.9), (2.1.7)

above.

25

Next in preparation to bound the error term in Proposition 2.1.7, we first state

a simple lemma that controls the singular values of the sub-matrices of A∗. It is a

direct consequence of the incoherent assumption 2.1.1.

Lemma 2.1.10. Under Assumption 2.1.1 and 2.1.4, we have that for any subset

S ⊂ [r] of size at most k,

σmin(A∗S) ≥ 1/2, and σmax(A∗S) ≤ 3/2.

Proof. We use Gershgorin Circle Theorem on the matrix A∗S>A∗S. This is a matrix

of size |S| × |S|. By Assumption 2.1.1, the (i, j)-th off-diagonal entries 〈A∗i , A∗j〉 is

bounded by µ/√d in absolute value. Thus, in every row, the sum of off-diagonal

entries in absolute values is at most kµ/√d. All the diagonal entries are of the form

〈A∗i , A∗i 〉 which is equal to 1. Since kµ/√d ≤ 1/2 by Assumption 2.1.4, Gershgorin

Circle Theorem implies that σmax(A∗S>A∗S) ≤ 1 + 1/2 = 3/2 and σmin(A∗S

>A∗S) ≤

1− 1/2 = 1/2.

Next we establish some useful property of β and β′, in preparation for bounding

the error terms in equation (2.1.6) ( the terms on the second line of equation (2.1.6)).

Claim 2.1.11. With high probability it holds that (a) for each i ∈ [r] we have

|βi − αi| . µ(k+σ) log r√d

and (b) ‖β‖ .√r(k + σ)/d. In particular since the differ-

ence between βi an αi is o(1) for our setting of parameters, we conclude that if αi 6= 0

then C − o(1) ≤ |βi| and if αi = 0 then |βi| . µ(k+σ) log r√d

.

Proof. Recall that U is the support of α and let R = U\i. Then:

βi − αi = A∗i>A∗UαU + A∗i

>ξ − αi = A∗i>A∗RαR + A∗i

>ξ .

To bound the two terms on the RHS of the equation above, we first note that A∗i>ξ is

a Gaussian random variable with variance ‖A∗i ‖2σ2/d = σ2/d. Therefore, with high

26

probability, |A∗i>ξ| . σ/√d · log r. Since A∗ is incoherent we have that ‖A∗i>A∗R‖ ≤

µ√k/d. Moreover, recall that the entries in αR are independent and subgaussian

random variables. Therefore, by Hoeffding inequality (see Theorem 8.1.2), we have

that, with high probability, |〈A∗i>A∗R, αR〉| .µk log r√

d, which implies the first part of

the claim.

For the second part, we can bound ‖β‖ ≤ ‖A∗‖‖A∗U‖‖α‖ + ‖A∗‖‖ξ‖. Since α is

a k-sparse vector with independent and subgaussian entries, with high probability

‖α‖ .√k. It follows that with high probability ‖β‖ .

√r(k + σ)/d.

Now we are ready to bound the error terms in equation (2.1.6).

Lemma 2.1.12. In the setting of Lemma 2.1.9. Let

E1 =∑i 6∈U∩V

qiciβiβ′iA∗iA∗i>

E2 =∑

i,j∈[m],i 6=j

qi,jβiβ′iA∗jA∗j>

E3 =∑

i,j∈[m],i 6=j

qi,j(βiA∗iβ′jA∗j> + β′iA

∗iβjA

∗j>).

Then, we have that ‖E1‖, ‖E2‖, ‖E3‖ . ρkr log d

.

Proof. Let R = [m]\(U ∩ V ), then we can rewrite E1 = A∗RD1A∗R>, where D1 is a

diagonal matrix whose entries are qiciβiβ′i for i ∈ R.

As a preparation, we first bound ‖D1‖. To this end, we can invoke the first

part of Claim 2.1.11 to conclude that |βiβ′i| ≤µ2(k+σ)2 log2 r

d. Also recall that by

Assumption 2.1.2 and 2.1.3, we have qici . k/r. Therefore

‖D1‖ .µ2(k + σ)2k log2 r

rd

27

. Since ‖A∗R‖ ≤ ‖A∗‖ . r/d, we have that

‖E1‖ ≤ ‖A∗R‖‖D1‖‖A∗R‖ ≤µ2(k + σ)2kr log2 r

d3.

ρk

r log r(2.1.10)

where the last inequality uses the Assumption that σ ≤ k and k2r2 log3 r ≤ d3 (see

Assumption 2.1.4).

The second term E2 is a sum of positive semidefinite matrices and we will make

crucial use of this fact below:

E2 =∑i 6=j

qi,jβiβ′iA∗jA∗j> O(k2/r2)

O(k2/r2)(∑i∈[r]

βiβ′i

)(∑j∈[r]

A∗jA∗j>)

(by qi,j . k2/r2 and then completing the square)

O(k2/r2)‖β‖‖β′‖A∗A∗> . (by Cauchy-Schwartz inequality)

We can now invoke the second part of Claim 2.1.11 and conclude that ‖E2‖ ≤

O(k2/r2)‖β‖‖β′‖‖A∗‖2 . r(k+σ)k2

d3. ρk

r log r(where we have used the Assumption 2.1.4

in the last inequality.)

For the third error term E3, by symmetry we need only consider terms of the

form qi,jβiβ′jA∗iA∗j>. We can collect these terms and write them as A∗QA∗>, where

Qi,j = 0 if i = j and Qi,j = qi,jβiβ′j if i 6= j. First, we bound the Frobenius norm of

Q by Cauchy-Schwartz inequality:

‖Q‖F =

√ ∑i 6=j,i,j∈[r]

q2i,jβ

2i (β

′j)

2 .√k4/r4(

∑i∈[r]

β2i )(∑j∈[r]

(β′j)2) . k2/r2 · ‖β‖‖β′‖.

Finally we have that

‖E3‖ ≤ 2‖A∗‖2‖Q‖ . r

d· k

2

r2· r(k + σ)

d.

ρk

r log r

28

where the inequality in the middle uses the bounds in Claim 2.1.11 and the last

inequality uses Assumption 2.1.4. This completes the proof.

Finally, combining Lemma 2.1.12 and Lemma 2.1.9 gives the proof of Proposi-

tion 2.1.7.

2.1.5 Infinite Samples Case

In this section we prove an infinite sample version of Theorem 2.1.6 by repeatedly

invoking Proposition 2.1.7. In section 2.1.6, we will give the full proof of Theorem 2.1.6

by considering the finite samples case.

Theorem 2.1.13. In the setting of Theorem 2.1.6, if Algorithm 1 has access to Muv

(defined in equation (2.1.3)) instead of the empirical average Muv, then with high

probability A is δ-close to A∗ where δ . ρ/ log d.

Towards proving Theorem 2.1.13, we first show that the test based on singular

values (that is, σ1 . k/r and σ2 ≤ k/(r log r) in Algorithm 1) can successfully

determine whether uv have a shared common dictionary atom as we desired.

Lemma 2.1.14. In the setting of Theorem 2.1.6, given two samples u = A∗α + ξ

and v = A∗α′ + ξ′, if the top singular value of Muv − Puv is at least Ω(r/d) and

the second largest singular value is at most r/(d log d), then with high probability

|supp(α) ∩ supp(α′)| = 1.

Proof. We prove by contradiction. We assume that |supp(α) ∩ supp(α′)| 6= 1. We

further divide it into two cases:

First, suppose |supp(α)∩ supp(α′)| = 0. Then the term∑

i∈U∩V qiciβiβ′iA∗iA∗i> in

the RHS of (2.1.4) is zero. Since ‖E‖ . r/(d log d) by Proposition 2.1.7, we get a

contradiction with σ1(Muv − Puv) & r/d.

Second, suppose U = supp(α) and V = supp(α′) share more than one dictionary

element. Let S = U ∩ V , we have |S| ≥ 2. We rewrite Muv − Puv as Muv − Puv =

29

A∗SDSA∗S>+E where DS is a diagonal matrix whose entries are equal to qiciβiβ

′i. All

diagonal entries in DS have magnitude at least Ω(r/d). By incoherence assumption

(Assumption 2.1.1), we have that A∗S has the smallest singular value being at least

1/2, therefore the second largest singular value of A∗SDSA∗S> is at least:

σ2(A∗SDSA∗S>) ≥ σmin(A∗S)2σ2(DS) & r/d.

It follows by Weyl’s theorem (see Theorem 8.2.1 in Section 8.2) that σ2(Muv−Puv) ≥

σ2(A∗SDSA∗S>) − ‖E‖ & r/d, which contradicts with the assumption that σ2(Muv −

Puv) ≤ r/(d log d).

Now we are ready to prove Theorem 2.1.13. The idea is every vector added to

the list L will be close to one of the dictionary elements (by Lemma 2.1.14), and for

every dictionary element, the list L contains at least one close vector because we have

enough random samples.

Proof of Theorem 2.1.13. By Lemma 2.1.14 we know every vector added into L must

be close to one of the dictionary elements. On the other hand, for any dictionary ele-

ment A∗i , by the bounded moment assumption of distribution x∗ (Assumption 2.1.3),

we know

Pr[|U ∩ V | = i] = Pr[i ∈ U ] Pr[i ∈ V ] Pr[(U ∩ V )\i = ∅|i ∈ U, j ∈ U ]

≥ Pr[i ∈ U ] Pr[i ∈ V ](1−∑

j 6=i,j∈[m]

Pr[j ∈ U ∩ V |i ∈ U, j ∈ V ])

= Ω(k2/r2) · (1− r ·O(k2/r2))

= Ω(k2/r2).

where the last inequality uses the assumption k <√r (Assumption 2.1.4). Here

the inequality uses union bound. Therefore given O((r2 log2 d)/k2) trials, with high

30

probability there is a pair of u,v that intersect uniquely at i for all i ∈ [m]. By

Lemma 2.1.12 and Lemma 2.1.9, this implies there must be at least one vector that

is close to A∗i for all dictionary elements.

Finally, since all the dictionary elements have distance at least 1/2 (by incoher-

ence), the connected components in L correctly identifies different dictionary ele-

ments. Hence, the output A must be O(ρ/ log d)-close to A∗.

2.1.6 Sample Complexity

Here we show with only O(mk) samples, the difference between the true Muv matrix

and the estimated Muv matrix is already small enough.

Proposition 2.1.15. In the setting of Theorem 2.1.6 and Proposition 2.1.7, let Muv

be the empirical estimate of Muv with N2 examples. Then, we have that with high

probability

∥∥∥Muv −Muv

∥∥∥ .k2 log4 d

N2

+

√k3 log4 d

rN2

+

√k2 log4 dσ4

dN2

(2.1.11)

Recall that if we use y(1), . . . , y(N2) to denote the examples used, then

Muv =

N2∑i=1

〈y(i), u〉〈y(i), v〉y(i)y(i)> (2.1.12)

is a sum of independent matrix random variables. We will use an extension of the ma-

trix Bernstein inequality (Corollary 8.1.4 in Section 8.1.1) to control the fluctuation.

In preparation for applying it, we have the following Claims that bound the norms

and variances of the summands in RHS of (2.1.12). We start with the individual

spectral norm.

Claim 2.1.16. In the setting of Proposition 2.1.15, we have that with high probability,

|〈u, y〉| .√k log d and ‖y‖ .

√k log d. As a direct consequence, with high probability,

31

we have

‖〈u, y〉〈v, y〉yy>‖ . k2 log3 d .

Proof. Recall that u = A∗α+ ξu = A∗SαS + ξu where S = supp(α), and y = A∗Rx∗R + ξ

where R = supp(x∗). Because α is k-sparse and has subgaussian non-zero entries, and

‖A∗S‖ ≤ 1 (by Lemma 2.1.10), we have that ‖u‖ ≤ ‖A∗S‖‖αS‖ + ‖ξ‖ .√k log d + σ.

The same bound holds for y as well because they are from the same distribution.

Next we write |〈u, y〉| = |〈A∗>Su, x∗R〉+ 〈u, ξ〉|. Note that with high probability,

‖A∗S>u‖ ≤ ‖A∗S‖‖u‖ . ‖u‖ (by ‖A∗R‖ . 1 as in Lemma 2.1.10)

≤√k log d+ σ (by ‖u‖ .

√k log d+ σ)

.√k log d. (by σ ≤

√k (Assumption 2.1.4))

Next we bound the variances of the summands in equation (2.1.12).

Claim 2.1.17. In the setting of Proposition 2.1.15, we have

∥∥E[〈u, y〉2〈v, y〉2yy>yy>]∥∥ . k2 log4 d ·

(k/r + σ4/d

).

Proof. By Claim 2.1.16, we have that

E[〈u, y〉2〈v, y〉2yy>yy>] = E[〈u, y〉2〈v, y〉2‖y‖2yy>]

O(k2 log4 d) · E[〈u, y〉2〈v, y〉2yy>

](by Claim 2.1.16)

O(k2 log4 d)E[〈v, y〉2yy>

](by 〈u, y〉 . k log2 d) in Claim 2.1.16)

32

On the other hand, notice that E[〈v, y〉2yy>] = Mv,v and using Proposition 2.1.7, we

have that ‖Mvv − Pvv‖ . k/r. Moreover, we have that ‖Pvv‖ . σ4/d. Therefore, we

have that

∥∥E[〈v, y〉2yy>]∥∥ = ‖Mvv − Pvv‖+ ‖Pvv‖ . k/r + σ4/d

Therefore, altogether we obtain that

∥∥E[〈u, y〉2〈v, y〉2yy>yy>]∥∥ . k2 log4 d ·

(k/r + σ4/d

)(2.1.13)

(2.1.14)

Proof of Proposition 2.1.15. Now we can apply the matrix Bernstein inequality and

conclude that with high probability,

‖Muv −Muv‖ . (k2 log3 d)/N2 +

√(k3 log4 d)/(rN2) + (k2 log4 dσ4)/(dN2)

.k2 log3 d

N2

+

√k3 log4 d

rN2

+

√k2σ4 log4 d

dN2

Finally we are ready to prove the main Theorem in Section 2.1

Proof of Theorem 2.1.6. First of all, the conclusion of Proposition 2.1.7 is still true

for Muv when N2 examples. To see this, we could simply write

Muv − Puv = qiciβiβ′iA∗iA∗i> + E + (Muv −Muv)︸︷︷︸

perturbation

33

where E is the same as the proof of Proposition 2.1.7. We can now view Muv −Muv

as an additional perturbation term with the same magnitude. We have that when

U ∩ V = i the top singular vector of Muv is O(ρ/ log d)-close to A∗i . Similarly, we

can prove the conclusion of Lemma 2.1.14 is also true for Muv. Note that we actually

choose N2 such that the perturbation of Muv matches noise level in Lemma 2.1.14:

k2 log3 d

N2

+

√k3 log4 d

rN2

+

√k2 log4 dσ4

dN2

.k

r log r.

Here we use the fact that N2 ≥ c(kr log4 d+ σ4r2 log4 d

d

)with sufficiently large absolute

constant c. With these perturbation theorem in hand, the rest of the proof of the

theorem follows exactly that of the infinite sample case given in Theorem 2.1.13.

2.2 Convex Initialization for Topic Modeling In-

ference

Recently, there has been considerable progress on designing algorithms with provable

guarantees — typically using linear algebraic methods — for parameter learning in

latent variable models including the sparse coding problem as discussed in Section 2.1.

But designing provable algorithms for inference has proven to be more challenging.

In this section, we take a first step towards provable inference in topic models. We

design an initialization algorithm based on linear programming that infers approx-

imately the topic coefficients (loading) for a document. The initialization provably

recovers the support of the topic coefficients under a realistic assumptions on the

word-topic matrix. Stating from this initialization, we can solve the inference prob-

lem by optimizing the maximum likelihood estimator under the correct support —

which turns out to be a convex problem.

34

2.2.1 Introduction and Main Results

Recently, there has been considerable progress on designing new algorithms for pa-

rameter learning with such provable guarantees. Since the usual maximum likelihood

estimator is often NP-hard to compute even in simple models, these new algorithms

use alternative estimators based on the method of moments and linear algebra. Their

analysis usually involves making a structural assumption about the parameters of the

problem, which can often be justified in applications. Some highlights include algo-

rithms for topic modeling [10, 6], learning mixture models [129, 87, 68], community

detection [7] and (special cases of) deep learning [11, 94].

But there has been comparatively much less progress on designing algorithms

with provable guarantees for inference. The current paper takes a first step in this

direction, in context of topic models. Our algorithms leverage a property of topic

models (Definition 2.2.3) that turns out to hold in many datasets — the existence of

a good approximate inverse matrix.

We also give empirical results that demonstrate that our algorithm works on

realistic topic models. On synthetic data, its error is competitive with state-of-the-

art approaches (which have no such provable guarantees). It obtains somewhat weaker

results on real data.

Here we describe topic modeling, and why inference appears more difficult than

parameter learning. In topic modeling, each document is represented as a bag of

words where we ignore the order in which words occur. The model assumes there

is a fixed set of k topics, each of which is a distribution over words. Thus the ith

topic is a vector Ai ∈ RD (where D is the number of words in the language) whose

coordinates are nonnegative and sum to 1. Each document is generated by first

picking its topic proportions from some distribution; say xi is the proportion of topic

i, so that∑

i xi = 1. The model assumes a distribution on x that favors sparse

or approximately sparse vectors; a popular choice is the Dirichlet distribution [33].

35

Then the document w1, w2, . . . , wn is generated by drawing n words independently

from the distribution A · x where A is the matrix whose columns are the topics. It is

important to note that the document size n can be quite small (e.g., n may be 400,

and D may be 50, 000) so the empirical distribution of words in a document is, in

general, a very inaccurate approximation to Ax. With some abuse of notation, we

also think of y as a vector in RD, whose jth coordinate is the number of occurrences

of word j in the document.

Parameter learning involves recovering the best A for a corpus of documents; this

can be seen as a latent structure in the corpus. Recent (provable) algorithms for

this problem [6, 10] use the method of moments, leveraging the fact that some form

of averaging over the corpus yields a linear algebraic problem for recovering A. For

example the word-word co-occurrence matrix (whose i, j entry is the probability that

words i, j co-occur in a document) is given by

Ex[AxxTAT ] = AZAT

where Z is the 2nd moment matrix of the prior distribution on x. It is possible to

recover A from this expression, under natural conditions like separability [10]. Alterna-

tively, one can use a co-occurrence tensor and recover A under weaker assumptions [6].

In the inference problem, we know the topic matrix A and are given a single

document y generated using this matrix. The goal is to find the posterior distribution

x|y. This can be seen as labeling or categorizing this document, which is important

in applications. Inference is reminiscent of classical regression problems where the

goal is to find x given y = Ax + noise vector. The key difference here is the nature

of noise —for each word coordinate j is 1 with probability (Ax)j, and 0 otherwise—

which means that the noise on a coordinate-by-coordinate basis can be much larger

36

than the signal. In particular, the vector y ∈ RD is very sparse even though Ax is

dense.

This problem can be seen as an analog of sparse linear regression when the target

(regression) vector x has nonnegative coordinate and∑

i xi = 1.

(This is distinct from usual `1-regression where regression vector is in `2 even

though the loss function is `1.) The difficulty here, in addition to the issue of

high coordinate-wise error already mentioned, is that the usual sparsity-enforcing

`1-regularization buys nothing since the solution needs to exactly satisfy ‖x‖1 = 1.

Inference seems more difficult than parameter learning because averaging over

many documents is no longer an option. Furthermore, the solution x is not unique

in general, and in some cases, the posterior distribution on x is not well concentrated

around any particular value. (In practice Gibbs Sampling can be used to sample from

the posterior [71, 175], but as mentioned, a rigorous analysis has proved difficult. The

inference is NP-hard.) We will view inference as a problem of recovering some ground

truth x∗ that was used to generate the document, and we show that with probability

close to 1 our estimate x is close to x∗ in `1 norm.

Bayesian vs Frequentist Views. So far we have not differentiated between

Bayesian and frequentist approaches to framing the inference problem, and now we

show that the two are closely related here. The above description is frequentist,

assuming an unknown “ground truth”vector x∗ of topic proportions (which is r-sparse

for some small r) was used to generate a document y, using a distribution y|x∗. Let

Ex∗ be the event that our algorithm recovers a vector x such that ‖x − x∗‖1 ≤ ε.

For our algorithm Pry|x∗ [Ex∗ ] ≥ 1 − δ2 for some δ > 0. By contrast, in the Bayesian

view, one assumes a prior distribution on x∗ and seeks to output a sample from

the conditional distribution x∗|y. Now we show that the success of our frequentist

37

algorithm implies that the posterior x∗|y must also be concentrated, and place most

probability mass on set of x such that ‖x− x‖1 ≤ ε.

By law of total expectation, we have Prx∗,y [Ex∗ ] = Prx∗[Pry|x∗ [Ex∗|x∗]

]≥ 1− δ2.

Switching the order of expectation, we obtain

Pry[Prx∗|y [Ex∗ | y]

]≥ 1− δ2 .

Then it follows by Markov argument that

Pry[Prx∗|y [Ex∗ | y] ≥ 1− δ

]≥ 1− δ .

Note that the inner probability is over the posterior distribution px∗|y. But the event

Ex∗ only depends on the output x of the algorithm given y. Thus the probability is at

least 1− δ over choice of y, that 1− δ of the probability mass of x∗|y is concentrated

in the `1 ball of radius ε around the algorithm’s answer x.

From now on the goal of our algorithm is to recover x∗ given y, and we identify

conditions under which the event has a probability close to 1.

Minimum Variance Estimators (with Bias). Having set up the problem as

above, next, we consider how to recover an approximation to x∗ given a document y

generated with topic proportions x∗.

Since A has orders of magnitude more rows than columns, it has many left inverses

to choose from. If we find any matrix B where BA is equal to the identity matrix,

then By is an unbiased estimate for x∗. However, this estimate has high variance if B

has large entries, necessitating working with only very large documents. Motivated

by applications to collaborative filtering, [100] introduce the notion of the `1 condition

number (see Definition 2.2.1) of A, which allows them to construct a left inverse B

with a much smaller maximum entry. We introduce a weaker notion of condition

38

number called the `∞-to-`1 condition number, which leverages the observation that

even if BA is close to the identity matrix it still yields a good linear estimator for

x∗. We call B an approximate inverse of A. Moreover, it has the benefit that the

condition number, as well as the approximate left inverse B with minimum variance,

can also be computed in polynomial time using a linear program (Proposition 2.2.4)!

In our experiments, we compute the exact condition number of word-topic matrices

that were found using standard topic modeling algorithms on real-life corpora. (By

contrast, we do not know the `1 condition number of these matrices.) In all of the

examples, we found that the condition number is at most a small constant, which

allows us to compute good approximate left inverses to the topic matrix A to enable

us to estimate x∗ even with relatively short documents.

Main results. Our main result (Theorem 2.2.5) shows that when the condition

number is small, it is possible to estimate x∗ using a combination of thresholding

and a left inverse B of minimum variance. Our overall algorithm requires n = O(r2)

samples to achieve o(1/r) error in `∞ norm and o(1) error in `1-norm, where r is

the number of topics represented in the document. It runs efficiently in O(nk) time.

Note that we do not need to assume a particular model (e.g., uniform random) for

the r topics, the algorithm works even when the topics may be correlated with each

other. This means that we can recover the support of x∗ when each of its non-zero

coordinates is suitably bounded away from zero.

This algorithm can serve as an initialization method for the MLE estimator. In

fact, it can be shown that maximizing the log-likelihood function over the support

recovered by this algorithm can further reduce the estimation error (measured in the

`1-norm) to O(√r/n). This part is beyond the concern of this thesis because the

MLE is convex with the correct support and the analysis involves a mostly statistical

39

argument. We refer the readers to [13] for more details. (One can also find a matching

sample complexity lower bounds for recovering the support of x∗ in [13].)

Thus, to sum up, our overall approach involves simple linear algebraic primitives

followed by convex programming. For a topic model with k topics, the sample com-

plexity of our algorithms depends on log k instead of k. This is important in practice

as k is often at least 100. The accuracy on synthetic data is good for sparse x, though

not quite as good as Gibbs sampling. However, if we forgo the convex programming

step we can compute a reasonable estimate for x from a single matrix-vector mul-

tiplication plus thresholding, which is an order of magnitude faster than finding an

estimate of the same quality via Gibbs sampling.

And of course, our approach comes with a performance guarantee.

2.2.2 Notations and Preliminaries

In addition to the description of topic model in Section 2.2.1, we introduce the fol-

lowing notations. We use Sk = z ∈ Rk≥0 : |z|1 = 1 to denote the k-dimensional

probability simplex. We assume that the true topic proportion vector x∗ ∈ Sk is

r-sparse throughout the paper. Sometimes we also abuse notations and use y as a

D dimensional vector instead of a set, in this case yi is the number of times word

i appears in the document. We will use a>i to denote the i-th row of A. We will

use cat(p) to denoted the categorical distribution defined by probability vector p.

Euclidean norm, `1, `∞ norm of a vector is denoted by ‖ · ‖, ‖ · ‖1 and ‖ cdot‖∞

respectively.

Condition Numbers of Matrices The condition number of a matrix usually

represents the ratio of the largest and smallest singular values. However, this concept

is tied to `2 norm, and for probability distributions, the most natural norms are `1

and `∞.

40

Next we define various matrix norms that we will utilize. Let |A|∞ = maxi,j |Aij|

denotes the maximum absolute value of the entries of the matrix A, and |A|1 =∑i,j |Aij| denotes the sum of the absolute value of the entries of the matrix A.

We will also work with various notions of condition number, that we will use in

our guarantees.

Definition 2.2.1 (`1-condition number). For a nonnegative matrix A, define its `1-

condition number κ(A) to be the minimum κ such that for any x ∈ Rk,

‖Ax‖1 ≥ ‖x‖1/κ (2.2.1)

This condition number was introduced by [100] in analyzing various algorithms

for collaborative filtering. We will use a weaker (i.e., smaller) notion of the condition

number. Empirically, it seems that most of the word-topic matrices that we have

encountered have a reasonably small `1-condition number, and have an even smaller

`∞ → `1-condition number.

Definition 2.2.2 (`∞ → `1-condition number). Let λ(A) be the minimum number λ

such that for any x ∈ Rk,

‖Ax‖1 ≥ ‖x‖∞/λ (2.2.2)

Remark 2.2.1. Based on the relationship between `1 and `∞ norm, we have that

λ(A) ≤ κ(A) ≤ kλ(A).

2.2.3 δ-Biased Minimum Variance Estimators

Let y ∈ RD be the document vector whose i-th entry yi is the number of times word i

appears. Our estimator attempts to infer the true topic vector x∗ by left-multiplying

y with some matrix B. Intuitively, E[By] = BAx∗, so we want BA to be close to

the identity matrix. On the other hand, when we apply B to the document vector,

41

each word will select a column of B, and its variance on any entry is bounded by the

maximum entry in B. Therefore we would like to optimize over two things: first, we

want BA to be close to identity; second, we want the matrix B to have small |B|∞.

This inspires the following linear program:

Definition 2.2.3. For A ∈ RD×k and δ ≥ 0, define λδ(A) to be the solution of the

following linear program:

λδ(A) = min |B|∞

s.t. |BA− Idk|∞ ≤ δ (2.2.3)

B ∈ Rk×D (2.2.4)

We will refer to the minimizer B of the above convex program as the δ-biased

minimum variance inverse for A. The solution to the above convex program will help

minimize our sample complexity both theoretically and empirically.

Allowing a nonzero δ can potentially reduce the variance of the estimator while

introducing a small bias. Such bias-variance trade-off has been studied in other

settings [128, 95].

What is the optimal |B|∞? To answer this question we get the dual of the LP 2.2.4

(with variable Q ∈ Rk×k),

maximize tr(Q)− δ|Q|1

s.t. |AQ|1 ≤ 1 (2.2.5)

We can further show that equation (2.2.5) is equivalent to the following (non-convex)

program with vector variables x ∈ Rk (see equation 2.2.8 in the proof of Proposi-

42

tion 2.2.4 for the proof):

maximize ‖x‖∞ − δ‖x‖1

s.t. ‖Ax‖1 ≤ 1

Note that this is very closely related to the condition number λ in Definition 2.2.2.

In particular, the optimal value is exactly λ(A) when δ = 0! When δ > 0 this can be

viewed as a relaxation of the `∞ → `1 condition number.

Proposition 2.2.4. For any δ ≥ 0, we have that

λδ(A) ≤ λ0(A) = λ(A) ≤ κ(A) .

Proof. Let J be the all 1’s matrix. We rewrite the program (2.2.4) as a linear program

by introducing auxiliary variable t.

λδ(A) = min t

s.t. B ≤ tJ

−B ≤ −tJ

BA− Id ≤ δJ

−BA+ Id ≤ −δJ

Let P1, P2 ∈ Rk×D, Q1, Q2 ∈ Rk×k be the dual variables for the four (set of) con-

straints. Let 〈X, Y 〉 = tr(XTY ) denote the inner product of the two matrices. Then,

43

the dual of the program above is

maximize 〈Q2 −Q1, Id〉 − δ〈Q2 +Q1, J〉

s.t. (P1 − P2) + (Q1 −Q2)A> = 0

〈P1 + P2, J〉 = 1

P1, P2, Q1, Q2 ≥ 0 (2.2.6)

Let Q = Q2−Q1 and P = P1−P2. Observe that 〈P1+P2, J〉 ≥ |P |1 and 〈Q1+Q2, J〉 ≥

|Q|1, it is easy to verify that program (2.2.6) is equivalent to the program below

maximize tr(Q)− δ|Q|1

s.t. P +QA> = 0

|P |1 ≤ 1 (2.2.7)

Towards further simplification, we claim that program (2.2.7) is equivalent to the

following (non-convex) program with vector variables x ∈ Rk:

maximize ‖x‖∞ − δ‖x‖1

s.t. ‖Ax‖1 ≤ 1 (2.2.8)

Indeed, suppose program (2.2.5) has optimal value λ and program (2.2.8) has optimal

value λ′ with optimal solution xopt. We first show that for any x,

‖x‖∞ − δ‖x‖1 ≤ λ′‖Ax‖1 , (2.2.9)

44

which is due to the homogeneity of the equation. Then, consider any P,Q that

satisfies the constraint of (2.2.5). Let P j and Qj be the rows of P and Q. We have

tr(Q)− δ|Q|1 ≤k∑j=1

(‖Qj‖∞ − δ‖Qj‖1

)≤

k∑j=1

(‖AQj‖1

)= |P |1

where the second inequality is by equation (2.2.9). Therefore λ ≤ λ′.

On the other hand, suppose the xopt has coordinate i with the largest absolute

value. Then let Q be the matrix which has it’s j-th row as xopt and 0 elsewhere, and

P = −QA>. Then it’s straightforward to check P,Q satisfy the constraint of (2.2.5)

and have objective value λ′. Therefore λ′ ≤ λ. Hence we obtained that λ = λ′.

Finally, from (2.2.8) it’s easy to see λ0(A) = λ(A), and λδ(A) ≤ λ0(A).

2.2.4 Thresholded Linear Inverse Algorithm and its Guaran-

tees

In this section we show how to estimate the topic proportion vector using a δ-biased

minimum variance inverse B of word-topic matrix A (Definition 2.2.3). For a small δ

(that is 1/r), given a solution B of program (2.2.4) with entries of absolute value

at most λδ(A), the following Thresholded Linear Inverse estimator (Algorithm 2)

is guaranteed to be close to the true x∗ in both `1 and `∞ norm. Recall that the

threshold function thτ (·) is defined as

thτ (t) =

t if t > τ

0 otherwise(2.2.10)

45

Algorithm 2 Thresholded Linear Inverse Algorithm (TLI)

Input: Document y with n words, and δ-biased inverse matrix B of matrix A.Output: Topic vector estimator x.

1. Compute x = By/n.

2. For all i ∈ [k], let xi = thτ (xi), where τ = 2λδ(A)√

log k/n+ δ.

Theorem 2.2.5. Suppose document y is generated from r-sparse topic vector x∗.

For any ε > 4δr, given n = Ω(λδ(A)2r2 log k/ε2) samples, with high probability Algo-

rithm 2 returns a vector that has `1-distance at most ε to x∗.

Our first step is to bound the variance of the partial estimator x before threshold-

ing. Our bound will utilize the maximum entry in B, which is why we tried to find

B that minimizes this quantity in the first place.

Lemma 2.2.6. With probability at least 1− 1/k2, it holds that

‖x− x‖∞ ≤ δ + 2λδ(A)√

(log k)/n . (2.2.11)

Proof of Lemma 2.2.6. Let Ij be the indicator vector for the jth word in the docu-

ment. That is, Ij = ek ∈ RD if the j-th word in the document y is the k-th word

in the vocabulary. Then, by definition, we have that y =∑

j∈[n] Ij. Next by the

definition of x, we have that

xi =1

n

n∑j=1

(BIj)i ,

where (BIj)i is the i-th coordinate of BIj. Thus, we have written xi as a sum of

independent random variables, each of which is (BIj)i. We will use concentration

inequality to show that xi is concentrated around its mean. The key here is the way

that we have chosen B ensures that the estimator is at most δ-based, and has small

variance.

46

To elaborate, we can compute the expectation of the partial estimator xi,

E[xi] = (BAx∗)i =k∑j=1

(BA)i,jx∗j

= x∗j +k∑j=1

((BA)i,j − 1(i = j))x∗j ,

where 1(i = j) = 1 if an only if i = j. Recall that by construction (equation (2.2.3)),

we have that for all i and j, |(BA)i,j − 1(i = j)| ≤ δ. Hence,

|k∑j=1

((BA)i,j − 1(i = j))x∗j | ≤ δk∑j=1

x∗j = δ . (2.2.12)

Therefore we conclude that |E[xi] − x∗i | ≤ δ which shows that our partial estimator

xi has bias at most δ on each coordinate.

Now we can appeal to standard concentration arguments to show the concentra-

tion of xi. Recall that xi is a sum of independent random variables xi = 1n

∑nj=1(BIj)i,

and each summand here is bounded by max(BIj)i ≤ λδ(A). We apply Hoeffding’s

inequality (see Theorem 8.1.2 in Section 8.1.1) and obtain that with probability at

least 1− 1/k2,

|xi − E[xi]| ≤ 2λδ(A)√

(log k)/n.

This, together with equation (2.2.12) that bounds the bias, completes the proof of

the Lemma.

Lemma above shows that the vector x is close to the true x∗ in infinity norm. As

a direct corollary, we conclude that after truncating x, we obtain the correct support

of x∗ provided that x∗ does not have very small entries.

Corollary 2.2.7. With high probability, the output x of Algorithm 2 satisfies that for

every i ∈ [k], if x∗i = 0 then xi = 0, and if x∗i ≥ 4λδ(A)√

(log k)/n + 2δ then xi > 0.

47

In particular, if all the nonzero entries of x∗ are at least ε/r for some ε > 4δr, the

algorithm finds the correct support with O((λδ(A)2r2 log k)/ε2) samples.

Using the corollary above we can then prove Theorem 2.2.5. The key intuition is

x can only incur error on non-zero coordinates of x∗, and a fixed amount of error on

non-zero coordinates of x∗.

Proof of Theorem 2.2.5. By Lemma 2.2.6 and union bound, we have that with prob-

ability at least 1 − 1/k, for every i ∈ [k] |xi − x∗i | ≤ δ + 2λδ(A)√

(log k)/n). Thus

in Step 2 of the algorithm we are guaranteed that if x∗i = 0 then xi must be smaller

than the threshold and therefore xi = 0.

On the other hand, if x∗i > 0, we know

|x∗i − xi| ≤ |x∗i − xi|+ |xi − xi| ≤ 2δ + 4λδ(A)√

(log k)/n)

Again appealing to the fact that x and x∗ are entry-wise close, there can be at most

r entries where we set x∗i > 0 for δ ≤ ε/4r. When n = 64κ2r2 log k/ε2 we also

have 4λδ(A)√

(log k)/n) ≤ ε/2r. Combining these two facts we conclude |x∗ − x|1 ≤

r(2δ + 4λδ(A)√

(log k)/n)) ≤ ε, which completes the proof.

2.3 Discussion: Special Initialization vs Trivial

Initialization

In this chapter, we designed special initialization the local improvement algorithms

for sparse coding and topic model inference. Attentive readers may notice that such

the designing of these kinds of initializations often rely on thorough understanding

of specific structures of the problems. For sparse coding, we exploit the incoherence

structure of the dictionary and sparsity of the coefficient vectors. For topic model

inference, we leverages the special `1 → `1 conditioning number of the word-topic

48

matrix. Exploiting these structures allow us to design strong initialization algorithms

that return approximately solutions. Designing such initializations is often challeng-

ing but if obtained, they often simplify the analysis and design of the non-convex

optimization algorithms that follow, as shown in Chapter 3.

In practice, sometimes simpler and more trivial algorithms are deployed for many

problems including various statistical inference problems and training neural net-

works. One of such initialization is the random Gaussian initialization with a small

variance, which turns out to be effective for training large-scale neural networks. A

slightly more delicate one is the standard initialization for sparse coding: the initial

dictionary contains random samples as columns. Although certain small tricks such

as tuning the variance of the initializers are required to achieve maximal empirical

performance, largely these methods perform reliably well and are very easy to im-

plement. It’s notoriously difficult to analyze the non-convex optimization algorithms

starting from such initialization, since the initializer tells little information regarding

how the dynamic of the iterates will evolve. In Part II, we will dedicate to developing

analysis tools for such situations.

49

Chapter 3

Local Convergence to a Global

Minimum

In this chapter, we introduce a framework for designing and analyzing the local search

algorithms that converge from a reasonable initialization quickly to a global minimum

(see Section 3.1). Then in Section 3.2 we apply the framework to design and analyze

several local search algorithms for the sparse coding problem. Together with the

initialization algorithms studied in Section 2.1, our algorithms improve upon the

sample complexity of existing approaches. We believe that our analysis framework

will have applications in other settings where simple iterative algorithms are used.

3.1 Analysis Framework via Lyapunov Function

Consider a general iterative algorithm that attempts to converge to a desired solution

z∗. In machine learning setting, z∗ often corresponds to some ground truth or global

optimum of some objective function. It starts with an initialization z0. At each step

s, given the current iterate zs, it computes some direction gs, and updates its estimate

50

as:

zs+1 = zs − ηgs. (3.1.1)

Such an algorithm can be seen as a dynamical system. Our goal is to show that

this sequence of iterates z0, z1, . . . converges to (or gets close to) the target point z∗.

To design a framework for proving convergence it helps to indulge in daydream-

ing/wishful thinking: what property would we like the updates to have, to simplify

our job?

A natural idea is to define a Lyapunov function V (z) and show that: (i) V (zs)

decreases to 0 (at a certain speed) as s → ∞; (ii) when V (z) is close to 0, then z is

close to z∗.

In this chapter, we consider possibly the most trivial Lyapunov function, the

(squared) Euclidean distance to the target point, V (z) = ‖z− z∗‖2. This is also used

in the standard convergence proof for convex functions, since moving in the opposite

direction to the gradient can be shown to reduce this measure V (·). 1

Simple algebraic manipulation shows that when the learning rate η is small

enough, then for V (zk+1) < V (zk), it is necessary and sufficient to have 〈gs, zs−z∗〉 >

0. Namely, the movement direction −gs should be correlated with the ideal direction.

To get quantitative bounds on the running time, we need to ensure that V (zs) not

only decreases but does so rapidly. The next definition formalizes this: intuitively

speaking it says that gs and z∗zs make an angle strictly less than 90 degrees.

Definition 3.1.1. A vector gs is (α, β, εs)-correlated with z∗ if

〈gs, zs − z∗〉 ≥ α‖zs − z∗‖2 + β‖gs‖2 − εs.1One can imagine more complicated ways of proving convergence, e.g., show V (zs) ultimately

goes to 0 even though it doesnt decrease in every step. The analysis of mirror descent and Nesterovsacceleration uses such a progress measure. Such analysis often uses Fenchel’s duality followed by atelescoping sum, which seems to rely on the convexity of the objective function.

51

The traditional analysis of convex optimization corresponds to the setting where

z∗ is the global optimum of some convex function f , and εs = 0. Specifically, if f(·)

is 2α-strongly convex and 1/(2β)-smooth, then gs = ∇f(zs) (α, β, 0)-correlated with

z∗. Also we will refer to εs as the bias. Allowing the bias makes the framework more

general and will be necessary for the case of sparse coding.

If the algorithm can at each step find such update directions that are correlated

with z∗, then the familiar convergence proof of convex optimization can be modified

to show rapid convergence here as well, except the convergence is approximate, to

some point in the neighborhood of z∗.

Theorem 3.1.2. Suppose gs satisfies Definition 3.1.1 for s = 1, 2, . . . , T , and η

satisfies 0 < η ≤ 2β and ε = maxTs=1 εs. Then, for any s = 1, . . . , T ,

‖zs+1 − z∗‖2 ≤ (1− 2αη)‖zs − z∗‖2 + 2ηεs

In particular, the update rule above converges to z∗ geometrically with systematic error

ε/α in the sense that

‖zs − z∗‖2 ≤ (1− 2αη)s‖z0 − z∗‖2 + ε/α.

The proof closely follows existing proofs in convex optimization (see, e.g., []).

52

Proof of Theorem 3.1.2. We expand the error as

‖zs+1 − z∗‖2 = ‖zs − z∗‖2 − 2ηgs>(zs − z∗) + η2‖gs‖2

= ‖zs − z∗‖2 − η(2gs>(zs − z∗)− η‖gs‖2

)≤ ‖zs − z∗‖2 − η

(2α‖zs − z∗‖2 + (2β − η)‖gs‖2 − 2εs

)((Definition 3.1.1 and η ≤ 2β))

≤ ‖zs − z∗‖2 − η(2α‖zs − z∗‖2 − 2εs

)≤ (1− 2αη)‖zs − z∗‖2 + 2ηεs

Then solving this recurrence we have ‖zs+1 − z∗‖2 ≤ (1 − 2αη)s+1R2 + εα

where

R = ‖z0 − z∗‖. And furthermore if εs <α2‖zs − z∗‖2 we have instead

‖zs+1 − z∗‖2 ≤ (1− 2αη)‖zs − z‖2 + αη‖zs − z‖2 = (1− αη)‖zs − z‖2

and this yields the second part of the theorem too.

The theorem above has a term ε/α that reflects the approximation caused by the

bias in each iteration. The following corollary says the approximate will go away if

the bias also decreases proportionally to the ‖zs − z∗‖2. The necessity and benefit

of allowing this small bias term will be clearer in Section 3.2.2 where we apply this

framework to analyze alternating minimization algorithms.

Corollary 3.1.3. In the setting of Theorem 3.1.2, if in addition εs <α2‖zs − z∗‖2

for s = 1, . . . , T , then

‖zs − z∗‖2 ≤ (1− αη)s‖z0 − z∗‖2.

In fact, we can extend the analysis above to obtain identical results for the case of

constrained optimization. Suppose we are interested in optimizing a convex function

53

f(z) over a convex set B. The standard approach is to take a step in the direction of

the gradient (or gs in our case) and then project into B after each iteration, namely,

replace zs+1 by ProjB zs+1 which is the closest point in B to zs+1 in Euclidean distance.

It is well-known that if z∗ ∈ B, then ‖ProjB z − z∗‖ ≤ ‖z − z∗‖. Therefore we obtain

the following as an immediate corollary to the above analysis:

Corollary 3.1.4. Suppose gs satisfies Definition 3.1.1 for s = 1, 2, . . . , T and set

0 < η ≤ 2β and ε = max>s=1 εs and that z∗ lies in a convex set B. Then the update

rule zs+1 = ProjB(zs − ηgs) satisfies that for any s = 1, . . . , T ,

‖zs − z∗‖2 ≤ (1− 2αη)s‖z0 − z∗‖2 + ε/α

In particular, zs converges to z∗ geometrically with systematic error ε/α. Additionally

if εs <α2‖zs − z∗‖2 for s = 1, . . . , T , then

‖zs − z∗‖2 ≤ (1− αη)s‖z0 − z∗‖2

3.1.1 Generalization to Stochastic Updates

In machine learning application, most of the objective functions involve an average

of loss function of individual samples. This special structure allows various speeding

up of gradient descent algorithms by using stochasticity — one can have an unbiased

estimator for the average of individual gradients of the terms. In many of these

settings, the update rule still takes the form of equation (3.1.1), whereas gs is a

random variable instead of a deterministic vector. Towards handling this stochasticity,

we introduce in this subsection extensions of Definition 3.1.1 and Theorem 3.1.2.

54

Definition 3.1.5. A random vector gs is (α, β, εs)-correlated-whp with a desired

solution z∗ if with probability at least 1− d−Ω(1),

〈gs, zs − z∗〉 ≥ α‖zs − z∗‖2 + β‖gs‖2 − εs.

This is a strong condition as it requires the random vector is well-correlated with

the desired solution with very high probability. In some cases we can further relax

the definition as the following:

Definition 3.1.6. A random vector gs is (α, β, εs)-correlated-in-expectation with a

desired solution z∗ if

E[〈gs, zs − z∗〉] ≥ α‖zs − z∗‖2 + βE[‖gs‖2]− εs.

We remark that E[‖gs‖2] can be much larger than ‖E[gs]‖2, and so the above

notion is still stronger than requiring (say) that the expected vector E[gs] is (α, β, εs)-

correlated with z∗.

Theorem 3.1.7. Suppose random vector gs is (α, β, εs)-correlated-whp with z∗ for

s = 1, 2, . . . , T where T ≤ poly(d), and η satisfies 0 < η ≤ 2β, then for any s =

1, . . . , T ,

E[‖zs+1 − z∗‖2] ≤ (1− 2αη)‖zs − z∗‖2 + 2ηεs

In particular, if ‖z0 − z∗‖ ≤ δ0 and εs ≤ α · o((1 − 2αη)s)δ20 + ε, then the updates

converge to z∗ geometrically with systematic error ε/α in the sense that

E[‖zs − z∗‖2] ≤ (1− 2αη)sδ20 + ε/α.

The proof is identical to that of Theorem 3.1.2 except that we take the expectation

of both sides.

55

3.1.2 Related Work

The proof of Theorem 3.1.2 is a straightforward extension of the standard analysis

of gradient descent algorithm for strongly-convex objective function (see [133] for

backgrounds of convex optimization.) Much previous work has viewed optimization

algorithms as dynamical systems and uses control theory techniques to analyze them,

especially in the context of convex optimization (see [107] and the references therein).

There have been recently a large body of works on proving convergence of local

search algorithms from a good initialization for many statistical learning problems:

matrix completion [93, 72, 75, 164], sparse coding [14], phase retrieval [43], mixture

of Gaussssians [20], just to name a few. These works have identified many similar

conditions under which specific nonconvex optimizations can be carried out to near-

optimality. All of these conditions, e.g., the conditions in [43, 20], can be seen as

some weakening of convexity of the objective function (except the analysis for matrix

completion in [72] which views the updates as noisy power method).

Our condition appears to contain most if not all of these as special cases. Often

the update direction gs in these papers is related to the gradient. For example using

the gradient instead of gs in our correlation condition turns it into the regularity

condition proposed by [43] for analyzing Wirtinger flow algorithm for phase retrieval.

The gradient stability condition in [20] is also a special case, where gs is required to

be close enough to ∇h(zs) for some convex h such that z∗ is the optimum of h. Then

since ∇h(zs) has angle < 90 degrees with zs− z∗ (which follows from convexity of h),

it implies that gs also does.

The advantage of our framework is that it encourages one to think of algorithms

where gs is not the gradient. Thus applying the framework doesnt require understand-

ing the behavior of the gradient on the entire landscape of the objective function;

instead, one needs to understand the update direction (which is under the algorithm

56

designers control) at the sequence of points actually encountered while running the

algorithm. This slight change of perspective may be powerful.

3.1.3 Limitation and Relation to Part II

We would like to note that despite the flexibility of framework as argued before, its

power limits to the analysis of “local” convergence — namely, it can only handle

convergence from a reasonably good initialization. The fundamental reason here is

that the analysis framework requires specifying a target solution z∗ and showing the

iterates approaches z∗. Recall that non-convex functions often have multiple local

minima or even multiple approximate global minima with very same function values

due to the symmetry. Therefore, with random initialization or arbitrary initialization,

it’s unclear or undetermined which global minimum the iterate will converge to, even

if we believe it will converge to one of them. The minimum requirement for applying

this framework (and many other variants) is that the initialization can serve the tie-

breaking between equivalent global minima, in the sense that the initialization can

guarantee convergence to a specific target solution, which of course is unknown to the

algorithm but should be identifiable in the analysis.

Towards going beyond such limitation, in Part II, we extend these local conver-

gence techniques so that to prove convergence from random or arbitrary initialization

for machine learning problems such as matrix completion (Chapter 5), learning linear

dynamical systems (Chapter 6)).

Finally, we note that the analysis of the problems in Part II also depends on the

machinery developed in this chapter: in the analysis of matrix completion in Chap-

ter 5, the local convergence result requires special treatment that uses the framework

in this section. For learning linear dynamical systems in Chapter 6, we essentially

use over-parameterization technique to reduce the problem to situations that can be

tackled by the analysis framework in this section.

57

3.2 Analyzing Alternating Minimization Algo-

rithm for Sparse Coding

In this section we design and analyze several alternating minimization algorithms

for sparse coding. Section 3.2.1 recalls the energy function defined in Section 2.1.1

and introduces the generic alternating minimization algorithms for solving it. Sec-

tion 3.2.2 highlights the analysis approaches for applying the tools in Section 3.1 to

the alternating minimizing setting. Section 3.2.3 states the detailed algorithms and

main results, and outlines the proof techniques. Section 3.3—3.6 give the detailed

proofs.

3.2.1 Alternating Minimization for Sparse Coding

We inherit most of the notations and setup for sparse coding from Section 2.1, and

we refer the readers to Section 2.1.1 for the motivation of the sparse coding problem.

We recall the non-convex objective proposed by Olshausen and Field [135]

E(A, x(1), . . . , x(N)) =N∑i=1

‖y(i) − A · x(i)‖22 +

N∑i=1

ρ(x(i)) . (3.2.1)

where ρ(·) is a nonlinear penalty function that is used to encourage sparsity. This

function is nonconvex because both A and the x(i)’s are unknown. Surprisingly,

various local search algorithms on the energy function (2.1.1) work very well, as do

related algorithms such as MOD [4] and k-SVD [62] on related objective function

with hard constraints. In fact these methods are so effective that sparse coding is

considered in practice to be a solved problem, even though it has no polynomial time

algorithm per se.

In this chapter, the generic scheme that we will be interested in is given in Algo-

rithm 3 and our analysis can be extended to k-SVD without many changes as well.

58

It is a heuristic for minimizing the non-convex function in (3.2.1) where the penalty

function is a hard constraint. Towards describing our algorithm, it would be helpful

to denote X as a shorthand of [x(1), . . . , x(N)] and write the objective and consider

the objective without the sparsity penalty,

E(A,X) =N∑i=1

‖y(i) − A · x(i)‖22 . (3.2.2)

Algorithm 3 alternates between updating the estimates A and X. The crucial step is

if we fix X and compute the gradient of E(A,X) with respect to A, we get:

∇AE(A,X) =N∑i=1

−2(y(i) − Ax(i))(x(i))>. (3.2.3)

We then take a step in the opposite direction to update A. Here and throughout the

paper η is the learning rate, and needs to be set appropriately.

Algorithm 3 Generic Alternating Minimization Approach

Given Initializer A0 ∈ Rd×r

Repeat for s = 0, 1, ..., T

Decode: Find a sparse solution to Asx(i) = y(i) for i = 1, 2, ..., N

Set Xs such that its columns are x(i) for i = 1, 2, ..., N

Update: As+1 = As − ηgs where gs is the gradient of E(As, Xs) with respect to As

3.2.2 Applying the Framework to Analyzing Alternating

Minimization

Recall that in the framework proposed in Section 3.1, we analyze an algorithm of

the type zs+1 = zs − ηgs and measure the progress of our algorithms by a simple

Lyapunov function. Here we are dealing with an alternating update algorithm and

we will cast it to fit our framework as follows.

59

We view Algorithm 3 as trying to minimize an unknown convex function, specifi-

cally f(A) = E(A,X∗), which is strictly convex and hence has a unique optimum that

can be reached via gradient descent. Here X∗ is the shorthand of the collection of

the unknown coefficient vectors x∗(1), . . . , x∗(N). This function is unknown since the

algorithm does not know X∗.

Despite that f is unknown, the analysis will show (directly) that the direction of

movement is correlated with A∗−As. The setup is reminiscent of stochastic gradient

descent, which moves in a direction whose expectation is the gradient of a known

convex function. By contrast, here the function f() is unknown, and furthermore

the expectation of gs is not the true gradient and has bias — which was caused by

the error in the current iterate As and the inexactness of decoding algorithm (see

paragraph below for more explanation of the decoding algorithm). Due to the bias,

we will only be able to prove that our algorithms reach an approximate optimum up

to some error whose magnitude is determined by the bias.

Choice of decoding algorithms: How should the algorithm update X? The usual

approach is to solve a sparse recovery problem with respect to the current code matrix

A. However many of the standard basis pursuit algorithms (such as solving a linear

program with an `1 penalty, see [61] and the reference therein) are difficult to analyze

when there is error in the dictionary itself. This is in part because the solution does

not have a closed form in terms of the dictionary matrix. Instead we take a much

simpler approach to solving the sparse recovery problem which uses matrix-vector

multiplication followed by thresholding: In particular, we set x = thC/2((As)>y),

where thC/2(·) keeps only the coordinates whose magnitude is at least C/2 and zeros

out the rest. Recall that the non-zero coordinates in x∗ have magnitude at least

C by Assumption 2.1.3. It turns out this approximate decoding algorithm suffices

for approximate convergence despite that it introduces an additional bias. We will

60

remove the bias using a more decoding complicated algorithm (see Algorithm 6 in

Section 3.2.3).

Remark: Balakrishnan et al. [19] proposes a similar framework for analyzing EM

algorithms for hidden variable models. The difference is that their condition is really

about the geometry of the objective function, though ours is about the property of the

direction of movement. Therefore we have the flexibility to choose different decoding

procedures. This flexibility allows us to have a closed form of Xs and obtain a useful

functional form of gs.

3.2.3 Algorithms and Main Results

We first recall that we use column-wise Euclidean distance after permutation to mea-

sure the closeness between the solution A and the ground-truth A∗.

Definition 2.1.5. A is δ-close to A∗ if there is a permutation π : [m] → [m] and a

choice of signs σ : [m]→ ±1 such that ‖σ(i)Aπ(i) − A∗i ‖ ≤ δ for all i.

This is a natural measure to use, since we can only hope to learn the columns of A∗

up to relabeling and sign-flips. In our analysis, we will assume throughout that π(·)

is the identity permutation and σ(·) ≡ +1 because our family of generative models is

invariant under this relabeling and it will simplify our notation.

For the purpose of analysis, we will use a slightly stronger measurement of close-

ness where we additionally require A− A∗ to have bounded spectral norm.

Definition 3.2.1. We say A is (δ, κ)-near to A∗ if A is δ-close to A∗ in the sense of

Definition 2.1.5, and in addition ‖A− A∗‖ ≤ κ‖A∗‖ too.

As alluded before, our simplest algorithm will be the alternating update algorithm

with a thresholding decoder. We also analyze two variants of Olshausen-Field update

rule. We first start with another simplification where in the update rule we use (y(j)−

61

Asx(j)) sign(x(j))> instead of (y(j) −Asx(j))x(j)> in the gradient computation (3.2.3).

This will simplify the analysis but won’t change essence of the analysis.

Algorithm 4 Neurally Plausible Update Rule

Initialize A0 that is (δ0, 2)-near to A∗


1 Let y(1), . . . , y(N) be a set of N fresh examples.

2 Decode: x(j) = thC/2((As)>y(j)) for all j ∈ [N ]

• Update:

As+1 = As − ηgs (3.2.4)

where

gs =1

N·N∑j=1

(y(j) − Asx(j)) sign(x(j))> (3.2.5)

Theorem 3.2.2. Under Assumption 2.1.1, 2.1.2, 2.1.3, 2.1.4. Suppose that A0 is

(2δ, 2)-near to A∗ and that η = Θ(r/k), and that each update step in Algorithm 4 uses

N = Ω(mk) fresh samples. Then, we have

E[‖Asi − A∗i ‖2] ≤ (1− τ)s‖A0i − A∗i ‖2 +O(k/d)

for some absolute constant 0 < τ < 1 and for any s = 1, 2, ..., T . In particular it

converges to A∗ geometrically, until the column-wise error is O(√k/d).

Revisiting Olshausen-Field

In this subsection we analyze a variant of the Olshausen-Field update rule. Due to

some technical difficulties that will clearer later, we will also need to make (slightly)

stronger assumptions on the distributional model for the support S = supp(x∗),

62

Algorithm 5 Olshausen-Field Update Rule



Decode: x = thC/2((As)>y) for each sample y

Update: As+1 = As − ηgs where gs = E[(y − Asx)x>]

Project: As+1 = ProjBAs+1(where B is defined in Definition 3.6.3)

Assumption 3.2.3. For any distinct indices i, j, k ∈ [r], we have Pr [i, j, k ∈ supp(x∗)] =

O(k3/r3). For simplicity we also denote Pr[i, j, k ∈ supp(x∗)] by qijk.

Under this assumptions, the variant of Olshausen-Field update rule in Algorithm 5

will converge to the ground truth approximately.

Theorem 3.2.4. Under Assumption 2.1.1, 3.2.3, 2.1.3, 2.1.4. Suppose that A0 is

(2δ, 2)-near to A∗ and that η = Θ(r/k). Then, Algorithm 5 satisfies that at each step

s,

‖As − A∗‖2F ≤ (1− τ)s‖A0 − A∗‖2

F +O(rk2/d2) .

for some 0 < τ < 1/2 and for any s = 1, 2, ..., T . In particular it converges to A∗

geometrically until the error in Frobenius norm is O(√rk/d).

We defer the proof of the main theorem to Section 3.6.1. Currently it uses a

projection step (using convex programming) that may not be needed but the proof

requires it.

Removing the Systemic Error

In this subsection, we design a new update rule that converges geometrically until

the column-wise error is d−Ω(1). The basic idea is to engineer a new decoding matrix

(instead of using As>) that projects out the components along the column currently

being updated. This has the effect of replacing a certain bias or error that occurs in

the previous update rules in Algorithm 4 or Algorithm 5.

63

Algorithm 6 Unbiased Update Rule



Decode: x = thC/2((As)>y) for each sample y

xi = thC/2((B(s,i))>y) for each sample y, and each i ∈ [m]

Update: As+1i = Asi − ηgsi where gsi = E[(y −B(s,i)xi) sign(x)>i ] for each i ∈ [m]

We will use B(s,i) to denote the decoding matrix used when updating the ith

column in the sth step. Then we set B(s,i)i = Ai and B

(s,i)j = ProjA⊥i Aj for j 6= i.

Note that B(s,i)−i (i.e. B(s,i) with the ith column removed) is now orthogonal to Ai.

We will rely on this fact when we bound the error.

Theorem 3.2.5. Suppose that A0 is (2δ, 2)-near to A∗ and that η = Θ(r/k). Algo-

rithm 6 satisfies that at each step s,

‖Asi − A∗i ‖2 ≤ (1− τ)s‖A0i − A∗i ‖2 + d−Ω(1)

for some 0 < τ < 1/2 and for any s = 1, 2, ..., T . In particular it converges to A∗

geometrically until the column-wise error is d−Ω(1).

Outline of remaining sections: Section 3.3 establishes a useful property of

the decoding algorithms, namely, it can recover the correct support of the hidden

coefficients x∗. Section 3.4 analyzes the infinite complexity version of Algorithm 4,

which shed lights on the finite sample complexity bounds in Section 3.5. Section 3.6

extends the analysis to Algorithm 5 and 6.

3.3 Support Recovery Guarantees of Decoding

In this section, we analyze the the properties of the simple thresholding method used

in the decoding steps of Algorithm 4, (5), and 6. We will show that it recovers the

64

support of each sample with high probability (over the randomness of x∗). This

corresponds to the fact that sparse recovery for incoherent dictionaries is much easier

when the non-zero coefficients do not take on a wide range of values; in particular,

one does not need iterative pursuit algorithms in this case. It is an ingredient in

analyzing all of the update rules we consider in this paper.

Proposition 3.3.1. Under Assumption 2.1.1, 2.1.2, 2.1.3, 2.1.4, suppose that A is

δ-close to A∗ for δ ≤ c/ log d where c is a sufficiently small absolute constant. Then

with high probability over the choice of the random sample y = A∗x∗, we have

sgn(thC/2((A)>y)) = sgn(x∗)

Towards seeing the key intuition behind Proposition 3.3.1, recall that y = A∗x∗

and consider

〈Ai, y〉 = 〈Ai, A∗i 〉x∗i + Zi (3.3.1)

where Zi is defined as

Zi =∑j 6=i

〈Ai, A∗j〉x∗j (3.3.2)

Note that Zi is a mean zero random variable which measures the contribution of the

cross-terms. By the assumption that A is δ-close to A∗, we have |〈Ai, A∗i 〉| ≥ (1−δ2/2),

and therefore |〈Ai, A∗i 〉x∗i | is either larger than (1−δ2/2)C or equal to zero depending

on whether or not i ∈ S. Our main goal is to show that the random variable Zi is

much smaller than C with high probability, and this follows by standard concentration

bounds, as shown in the formal proof below.

Proof. Recall equation (3.3.1) and let Zi be defined as in equation (3.3.2). Random

variable Zi has two sources of randomness: the support S of x∗, and the random

65

values of x∗ conditioned on the support S. We prove a stronger statement that only

requires second source of randomness. Namely, even conditioned on the support S,

with high probability S = i : |〈Ai, y〉| > C/2, and sign(〈Ai, y〉) = sign(x∗).

We remark that Zi is a sum of independent subgaussian random variables. We

control the variance of Zi:

Var(Zi) =∑

j∈S\i

〈Ai, A∗j〉2 (3.3.3)

Next we bound each summand in the equation above by

〈Ai, A∗j〉2 =(〈A∗i , A∗j〉+ 〈Ai − A∗i , A∗j〉

)2

≤ 2(〈A∗i , A∗j〉2 + 〈Ai − A∗i , A∗j〉2) (by Cauchy-Schwartz inequality)

≤ 2µ2/d+ 2〈Ai − A∗i , A∗j〉2. (by incoherence (Assumption 2.1.1))

It follows that

Var(Zi) ≤ 2µ2k/d+∑

j∈S\i

〈Ai − A∗i , A∗j〉2 (3.3.4)

= 2µ2k/d+ ‖A∗S\i>(Ai − A∗i )‖2 (3.3.5)

. c/ log d (by ‖A∗S\i‖ ≤ 2 by Lemma 2.1.10 and k ≤√d

µ log d)

Hence, we have that Zi is a subgaussian random variable with variance at most

O(c/ log r). For sufficiently small absolute constant c, we conclude that with high

probability, |Zi| ≤ λδ(A)/4. Finally we can take a union bound over all indices i ∈ [m]

and obtain ∀i, |Zi| ≤ λδ(A)/4. By equation (3.3.1) and the fact that |〈Ai, A∗i 〉| ≥

(1− δ2/2), we have that

|〈Ai, y〉 − x∗i | ≤ λδ(A)/4 (3.3.6)

66

Recall that by Assumption 2.1.3 we have |x∗i | ≥ λδ(A) when x∗i 6= 0. This and

equation (3.3.6) complete the proof.

3.4 Analysis Overview: Infinite Samples Setting

In this section, as a warm-up, we assume that each iteration of Algorithm 4 takes

infinite number of samples, and prove the corresponding simplified version of Theo-

rem 3.2.2. The proof of this theorem highlights the essential ideas of behind the proof

of the Theorem 3.2.2, which can be found at Section 3.5.

Theorem 3.4.1. In the setting of Theorem 3.2.2, suppose Algorithm 4 have access

to infinite number of examples. Namely, suppose Algorithm 4 uses gsi instead of gsi .

Then the conclusion of Theorem 3.2.2 holds:

E[‖Asi − A∗i ‖2] ≤ (1− τ)s‖A0i − A∗i ‖2 +O(k/d)

for some absolute constant 0 < τ < 1 and for any s = 1, 2, ..., T . In particular it

converges to A∗ geometrically, until the column-wise error is O(√k/d).

We define gs to be the expectation of gs

gs := E[gs] = E[(y − Asx) sign(x)>

], (3.4.1)

where x := thC/2((As)>y) is the decoding of y. The infinite sample case essentially

means we use gs in the update equation (3.2.4) instead of gs.

The proof of Theorem 3.4.1 uses our framework (Theorem 3.1.2) inductively. The

first step (Proposition 3.4.2 in Section 3.4.1) is to show that gs meets the condition

of Theorem 3.1.2, that is, gs is (α, β, ε)-correlated with the target solution A∗. How-

67

ever, this step requires to two conditions on the current iterate As: a) As is already

reasonably close to A∗ and b) a technical condition that the spectral norm of As−A∗

is bounded by 2‖A∗‖.

The first condition will hold naturally under inductive hypothesis, whereas proving

condition b) requires additional another step in Section 3.4.2: We will show that in

Proposition 3.4.5 that condition also holds at every step. We put everything together

and prove Theorem 3.4.1 at the end of this Section.

3.4.1 Making Progress at Each Iteration

Recall that as defined in Assumption 2.1.3 we have qi = Pr[x∗i 6= 0] and qi,j =

Pr[x∗ix∗j 6= 0], and define in addition pi = E[x∗i sign(x∗i )|x∗i 6= 0]. Recall that A∗−i

denote the matrix obtained from deleting the ith column of A∗. The following lemma

is the main step in our analysis.

Proposition 3.4.2. In the setting of Theorem 3.4.1, if at iteration s the iterate

As is (2δ, 2)-near to A∗, then the direction gsi is (α, β, ε)-correlated with A∗i , where

α = Ω(k/r), β ≥ Ω(r/k) and ε = O(k3/(rd2)).

Furthermore, we make the following progress in terms of the distance to the ground

truth:

‖As+1i − A∗i ‖2 ≤ (1− 2αη)‖Asi − A∗i ‖2 +O(ηk2/d2) . (3.4.2)

Towards proving Proposition 3.4.2, we first use the properties of the generative model

to derive a new formula for gs that is more amenable to analysis.

Lemma 3.4.3. In the setting of Proposition 3.4.2, we have that

gsi = piqi (λsiA

si − A∗i + εsi ± γsi ) (3.4.3)

68

with λsi , γsi , ε

si satisfying

λsi = 〈Asi , A∗i 〉

‖γsi ‖ ≤ 1/dΩ(1)

εsi =(As−idiag(qi,j)

(As−i)> )

A∗i /qi

‖εsi‖ ≤ O(k/d)

Note that piqi is a scaling constant and λsi ≈ 1; hence from the above formula (3.4.3),

we should expect that E [gsi ] is correlated with Asi − A∗i . (The exact amount of

correlation will be bounded using the Lemmas below.)

Proposition 3.4.3 turns out to be useful for show that ‖As+1 − A∗‖ ≤ 2‖A∗‖

(Proposition 3.4.5). We will also be able to reuse much of this analysis in Lemma 3.6.1

and Lemma 3.6.6 because we have derived to for a general decoding matrix B in the

proof below.

Proof. Since As is (2δ, 2)-near to A∗, As is 2δ-close to A∗. We can now invoke

Lemma 3.3.1 and conclude that with high probability, sign(x∗) = sign(x). Let Fx∗

be the event that sign(x∗) = sign(x), and let 1Fx∗ be the indicator function of this

event.

To avoid the overwhelming number of appearances of the superscripts, let B = As

throughout this proof. Here and in the rest of this proof, we will let γsi denote any

vector whose norm is negligible (i.e. smaller than 1/dC for any large constant C > 1).

Then we can write gsi = E[(y−Bx) sign(xi)]. Using the fact that 1Fx∗ +1Fx∗ = 1 and

that Fx∗ happens with very high probability:

gsi = E[(y −Bx) sign(xi)1Fx∗ ] + E[(y −Bx) sign(xi)1Fx∗ ]

= E[(y −Bx) sign(xi)1Fx∗ ]± γsi (3.4.4)

69

The key is that this allows us to essentially replace sign(x) with sign(x∗). More-

over, let S = supp(x∗). Note that when Fx∗ happens S is also the support of x. Recall

that according to the decoding rule (where we have replaced As by B for notational

simplicity) x = thC/2(B>y). Therefore, xS = (B>y)S = B>S y = B>SA∗x∗. Using the

fact that the support of x is S again, we have Bx = B>SBSA∗x∗. Plugging it into

equation (3.4.4):

gsi = E[(y −Bx) sign(xi)1Fx∗ ]± γsi = E[(I −BSB

>S )A∗x∗ · sign(x∗i )1Fx∗ ]± γ

si

= E[(I −BSB>S )A∗x∗ · sign(x∗i )]− E[(I −BSB

>S )A∗x∗ · sign(xi)1Fx∗ ]± γ

si

= E[(I −BSB>S )A∗x · sign(x∗i )]± γsi

where again we have used the fact that Fx∗ happens with very high probability. Now

we rewrite the expectation above using subconditioning where we first choose the

support S of x∗, and then we choose the nonzero values x∗S.

E[(I −BSB>S )A∗x∗ · sign(x∗i )] = ES

[Ex∗S [(I −BSB

>S )A∗x∗ · sign(x∗i )|S]

]= E[pi(I −BSB

>S )A∗i ]

where we use the fact that E[x∗i · sign(x∗i )|S] = pi. Let R = S − i. Using the fact

that BSB>S = BiB

>i +BRB

>R , we can split the quantity above into two parts,

gsi = piE[(I −BiB>i )A∗i + piE[BRB

>R ]A∗i

= piqi

(I −BiB

>i

)A∗i + pi

(B−idiag(qi,j)B

>−i

)A∗i ± γsi .

where diag(qi,j) is a r × r diagonal matrix whose (j, j)-th entry is equal to qi,j, and

B−i is the matrix obtained by zeroing out the ith column of B. Here we used the fact

that Pr[i ∈ S] = qi and Pr[i, j ∈ S] = qij.

70

Now we set B = As, and rearranging the terms, we have

gsi = piqi (〈Asi , A∗i 〉Asi − A∗i + εsi ± γsi ) ,

where εsi =(As−idiag(qi,j)

(As−i)> )

A∗i /qi, which can be bounded as follows

‖εsi‖ ≤ ‖As−i‖2 maxj 6=i

qi,j/qi ≤ O(k/r)‖As‖2 = O(k/d)

where the last step used the fact thatmaxi 6=j qi,j

min qi≤ O(k/r), which is an assumption of

our generative model.

Lemma 3.4.4. If a vector z is equal to 4α(Asi −A∗i )+v where ‖v‖ ≤ α‖Asi −A∗i ‖+ζ,

then z is (α, 1/100α, ζ2/α)-correlated with A∗i .

Proof. Throughout this proof s is fixed and so we will omit the superscript s to

simplify notations. By the assumption, z already has a component that is pointing

to the correct direction Ai − A∗i , we only need to show that the norm of the extra

term v is small enough. First we can bound the norm of z by triangle inequality:

‖z‖ ≤ ‖4α(Ai − A∗i )‖+ ‖v‖ ≤ 5α‖(Ai − A∗i )‖+ ζ (3.4.5)

therefore ‖z‖2 ≤ 50α2‖(Ai −A∗i )‖2 + 2ζ2. Similarly, we can bound the inner-product

between z and Ai − A∗i by

〈z, Ai − A∗i 〉 = 〈4α(Asi − A∗i ) + v,Ai − A∗i 〉 (by definition)

≥ 4α‖Ai − A∗i ‖2 − ‖v‖‖Ai − A∗i ‖ (3.4.6)

71

Here the last inequality above follows from Cauchy-Schwartz. Now we are ready to

prove our target inequality by combining equation (3.4.5) and (3.4.6) as follows:

〈z, Ai − A∗i 〉 − α‖Ai − A∗i ‖2 − 1

100α‖z‖2 + ζ2/α

≥ 4α‖Ai − A∗i ‖2 − ‖v‖‖Ai − A∗i ‖ − α‖Ai − A∗i ‖2 − 1

100α‖z‖2 + ζ2/α

(by equation (3.4.6))

≥ 3α‖Ai − A∗i ‖2 − (α‖Ai − A∗i ‖+ ζ)‖Ai − A∗i ‖

− 1

100α(50α2‖(Ai − A∗i )‖2 + 2ζ2) + ζ2/α (by equation (3.4.5))

≥ α‖Ai − A∗i ‖2 − ζ‖Ai − A∗i ‖+1

4ζ2/α (by ζ2/α ≥ 0)

= (√α‖Ai − A∗i ‖ − ζ/2

√α)2 ≥ 0

This completes the proof of the lemma.

Now we are ready to combine Lemmas above to prove Proposition 3.4.2.

Proof of Proposition 3.4.2. We first use the form in Proposition 3.4.3, gsi =

piqi(λiAsi − A∗i + εsi + γsi ) where λi = 〈Ai, A∗i 〉. We can write gsi = piqi(A

si −

A∗i ) + piqi((1 − λi)Asi + εsi + γsi ). Now we apply Lemma 3.4.4 by setting z = gsi ,

4α = piqi = Θ(k/r), and v = piqi((1 − λi)Asi + εsi + γsi ). The norm of v can be

bounded in two terms, the first term piqi(1 − λi)Asi has norm piqi(1 − λi) which is

smaller than piqi‖Asi − A∗i ‖ = α‖Asi − A∗i ‖, and the second term has norm bounded

by ζ = O(k2/(rd)).

With these parameters choices, by Lemma 3.4.4, we conclude that the vector gis is

(Ω(k/r),Ω(r/k), O(k3/rd2))-correlated with As. Equation (3.4.2) in the Proposition

follows from applying our analysis framework (Theorem 3.1.2.)

72

3.4.2 Maintaining Spectral Norm

In this section we show that the spectral norm bound ‖A− A∗‖ ≤ 2‖A∗‖ will be

preserved after every iteration.

Proposition 3.4.5. In the setting of Theorem 3.4.1, suppose that As is (2δ, 2)-near

to A∗. Then, we have ‖As+1 − A∗‖ ≤ 2‖A∗‖.

This proposition is expected because if the algorithm indeed drives the iterate

towards A∗, we can expect that ‖As − A‖ is also roughly decreasing as s increases.

Moreover, here we only require that ‖As − A‖ never exceeds 2‖A∗‖. Again in the

proof we will invoke Proposition 3.4.3 to have a functional form for As+1i − A∗i in

terms of the matrices A∗ and As, and then we use various linear algebraic inequality

to bound its spectral norm.

Proof. Again, we will make crucial use of Proposition 3.4.3. Substituting and rear-

ranging terms we have:

As+1i − A∗i = Asi − A∗i − ηgsi

= (1− ηpiqi)(Asi − A∗i ) + ηpiqi(1− λsi )Asi

−ηpi(As−idiag(qi,j)

(As−i)> )

A∗i ± γsi

Our first task is to write this equation in a more convenient form. In par-

ticular let U and V be matrices such that Ui = piqi(1 − λsi )Asi and Vi =

pi

(As−idiag(qi,j)

(As−i)> )

A∗i . Then we can re-write the above equation as:

As+1 − A∗ = (As − A∗)diag(1− ηpiqi) + ηU − ηV ± γsi (3.4.7)

where diag(1 − ηpiqi) is the r × r diagonal matrix whose entries along the diagonal

are 1− ηpiqi.

73

We will bound the spectral norm of As+1 − A by bounding the spectral norm of

each of the matrices on the RHS of (3.4.7). The first two terms are straightforward

to bound:

‖(As − A∗)diag(1− ηpiqi)‖ ≤ ‖As − A∗‖ · (1− ηminipiqi) ≤ 2(1− Ω(ηk/r))‖A∗‖

(3.4.8)

where the last inequality uses the assumption that pi = Θ(1) and qi ≤ O(k/r), and

the assumption that ‖As − A∗‖ ≤ 2‖A∗‖.

Regarding the term ηU in equation (3.4.7), first by the definition we have U =

Asdiag(piqi(1− λsi )). It follows that

‖U‖ ≤ δmaxipiqi‖As‖ = o(k/r) · ‖A∗‖

where we have used the fact that λsi ≥ 1− δ and δ = o(1), and ‖As‖ ≤ ‖As − A∗‖+

‖A∗‖ = O(‖A∗‖).

What remains is to bound the third term, and let us first introduce an auxiliary

matrix Q which we define as follows: Qii = 0 and Qi,j = qi,j〈Asi , A∗i 〉 for i 6= j. It is

straightforward to verify that the following claim:

Claim 3.4.6. The ith column of AsQ is equal to(As−idiag(qi,j)

(As−i)> )

A∗i

Recall the definition of pi = E[x∗i sign(x∗i )|x∗i 6= 0]. Therefore we can rewrite

V = AsQdiag(pi). We will bound the spectral norm of Q from above by its Frobenus

norm:

‖Q‖F ≤(

maxi 6=j

qij

)∑i 6=j

√〈Asi , A∗j〉2 = O(k2/r2)‖A∗>As‖F (3.4.9)

Moreover since A∗>As is an r × r matrix, its Frobenius norm can be at most a√r

factor larger than its spectral norm, that is ‖A∗>As‖F ≤√r‖A∗>As‖ ≤

√r‖A∗‖‖As‖,

74

Hence we have

‖V ‖ ≤(

maxipi

)‖As‖‖Q‖ (by V = AsQdiag(pi))

≤ O(k2√r/r2)‖As‖‖A∗>As‖F

(by equation (3.4.9) and pi . 1 by Assumption 2.1.3)

≤ O(k2√r/r2)‖As‖2‖A∗‖ (by ‖A∗>As‖F ≤

√r‖A∗‖‖As‖)

≤ O(k2√r/d2)‖A∗‖ (by ‖As‖ . ‖A∗‖ . r/d)

≤ o(k/r)‖A∗‖ (by the choice of parameters (Assumption 2.1.4))

Finally, putting all the pieces together we have:

‖As+1 − A∗‖ ≤‖(As − A∗)diag(1− ηpiqi)‖+ ‖ηU‖+ ‖ηV ‖ ± γsi

≤2(1− Ω(ηk/r))‖A∗‖+ o(ηk/r)‖A∗‖+ o(ηk/r)‖A∗‖ ± γsi

≤2(1− Ω(ηk/r))‖A∗‖ (3.4.10)

≤2‖A∗‖

and this completes the proof of the lemma.

Proof of Theorem 3.4.1

We provide the formal proof of Theorem 3.4.1 before closing this Section.

As outlined before, we prove by induction on s. Our induction hypothesis is that

the theorem is true at each step s and As is (2δ, 2)-near to A∗. The hypothesis is

true for s = 0 by the assumption on the initialization. Now assuming the induc-

tive hypothesis is true for some s. Recall that Proposition 3.4.2 says that if As is

(2δ, 2)-near to A∗, which is guaranteed by the inductive hypothesis, then gsi is indeed

(Ω(k/r),Ω(r/k), O(k3/rd2))-correlated with A∗i . Invoking our framework of analysis

75

(Theorem 3.1.2), we have that

‖As+1i − A∗i ‖2 ≤ (1− τ)‖Asi − A∗i ‖2 +O(k2/d2) ≤ (1− τ)s+1‖A0

i − A∗i ‖2 +O(k2/d2) .

Therefore it also follows that As+1 is 2δ-close to A∗. Then we invoke Proposition 3.4.5

to prove As+1 has not too large spectral norm ‖As+1−A∗‖ ≤ 2‖A∗‖, which completes

the induction.

3.5 Sample Complexity

In the previous sections, we analyzed various update rules assuming that the algorithm

was given the exact expectation of some matrix-valued random variable. Here we

show that these algorithms can just as well use approximations to the expectation

(computed by taking a small number of samples).

We will focus on analyzing the sample complexity of Algorithm 4, but a similar

analysis extends to the other update rules as well.

In order to prove Theorem 3.2.2, we proceed in two steps. First we show when As is

(δs, 2)-near to A∗, the approximate gradient is (α, β, εs)-correlated-whp with optimal

solution A∗, with εs ≤ O(k2/rd) + α · o(δ2s). This allows us to use Theorem 3.1.7

as long as we can guarantee the spectral norm of As − A∗ is small. Next we show

an extension of Proposition 3.4.5 which works even with the random approximate

gradient, hence the nearness property is preserved during the iterations.

The key idea boils down to the establishing various type of concentration of gsi

around gsi . For example, the following Proposition establishes the concentration in

Euclidean distance.

76

Proposition 3.5.1. In the setting of Theorem 3.2.2, suppose As is (2δ, 2)-near to

A∗. Then, we have that with high probability,

‖gsi − gsi ‖ ≤ k/r · (o(δs) +O(√k/d)).

The proposition above implies the finite sample extension of Proposition 3.4.2

straightforwardly.

Corollary 3.5.2 (Finite sample extension of Proposition 3.4.2). In the setting of

Proposition 3.5.1. Then the direction gsi as defined in Algorithm 4 is (α, β, εs)-

correlated-whp with A∗i with α = Ω(k/r), β = Ω(r/k) and εs ≤ α · o(δ2s) +O(k2/rd).

Proof of Corollary 3.5.2 using Proposition 3.5.1. Therefore using Proposition 3.4.3

we can write gsi (whp) as gi = gi−gi+gi = 4α(Asi −A∗i )+v with ‖v‖ ≤ α‖Asi −A∗i ‖+

O(k/r)·(o(δs)+O(√k/d)). By Proposition 3.4.4 we have gi is (Ω(k/r),Ω(r/k), o(r/d·

δ2s) +O(k2/rd))-correlated-whp with A∗i .

Proposition 3.5.1 and other concentration inequalities also imply that an finite sample

extension of Proposition 3.4.5.

Corollary 3.5.3 (Finite sample extension of Proposition 3.4.5). In the setting of

Theorem 3.2.2, suppose As is (δs, 2)-near to A∗ with δs ≤ c/ log d for some sufficiently

small absolute constant c, then with high probability, As+1 satisfies ‖As+1 − A∗‖ ≤

2‖A∗‖.

Proof. Using equation (3.4.10) in the proof of Proposition 3.4.5, recalling the A cor-

responds to As − ηgs in our case, we have that

‖As − ηgs − A∗‖ ≤ 2(1− Ω(ηk/r))‖A∗‖ . (3.5.1)

Recall that η is set to be Θ(r/k) and using Proposition 3.5.1 we have that ‖gs− gs‖ ≤

k/r ·(o(δs)+O(√k/d)) ≤ o(1). Moreover, since ‖A∗‖ = r we have that ‖A∗‖ ≥

√r/d.

77

This implies that

‖As+1 − A∗‖ ≤ ‖As − ηgs − A∗‖+ ‖gs − gs‖

≤ 2(1− Ω(ηk/r))‖A∗‖+ o(1)

≤ 2− Ω(1) + o(1) ≤ 2

which completes the proof.

Using Corollary 3.5.2 and Corollary 3.5.3 we can prove the main Theorem 3.2.2 by

induction on the step s.

Proof of Theorem 3.2.2. Similarly to the proof of Theorem 3.4.1, the theorem fol-

lows immediately by induction and Proposition 3.5.2 and Proposition 3.5.3, and then

applying Theorem 3.1.7.

In the rest of the subsection, we prove Proposition 3.5.1 using various concentra-

tion inequalities.

Proof of Proposition 3.5.1

We fix an index i and prove the concentration for every column of gsi . Note that by

definition, we have that

gsi =N∑j=1

(y(j) − Asx(j)) sign(x(j)i ) (3.5.2)

To simplify the notation, we omit the superscript s since it’s irrelevant for this propo-

sition, and let Zi be the shorthand for the random variable for the random variable

(y − Ax) sign(xi) | i ∈ S,

Zi := (y − Ax) sign(xi) | i ∈ S (3.5.3)

78

Therefore we see that gsi is basically a sum of N independent realizations of the ran-

dom variable Zi. We will use Bernstein inequality to prove the concentration. In

preparation, we study the absolute bound of Zi and its variance, which are essential

quantities for applying Bernstein inequality. We also note that we will use the ex-

tension of Bernstein inequality (Corollary 8.1.4), with which it suffices to have a high

probability uniform upper bound on the norm of Zi.

We start with a Claim that controls ‖Z‖i with high probability.

Claim 3.5.4. In the setting of Proposition 3.5.1, let Zi be define as in equa-

tion (3.5.3). Then, with high probability, we have that

‖Zi‖ ≤ µk/√d+ kδ2

s + δs√k log d.

Proof. We expand y − Ax as

y − Ax = (A∗S − ASA>SA∗S)x∗S = (A∗S − AS)x∗S + AS(Id− A>SA∗S)x∗S (3.5.4)

and we will bound each of the two terms separately. Since x∗S has sub-Gaussian

entries with variance bounded by O(1), we have that the entries of (A∗S −AS)x∗S can

be written as a sum of vectors scaled with independent sub-Gaussian entries

(A∗S − AS)x∗S =∑i∈S

(A∗i − Ai)x∗i (3.5.5)

Therefore (A∗S − AS)x∗S is also a sub-Gaussian random variable with variance proxy

‖A∗S − AS‖2 (see Lemma 8.1.7 for the property of sub-Gaussian random variables).

Then, we have that with high probability,

‖(A∗S − AS)x∗S‖ . ‖A∗S − AS‖F√

log d

79

Since A is δs-close to A∗ and |S| ≤ k, we have that ‖A∗S−AS‖F ≤ O(δs√k), and thus

we concluded that

‖(A∗S − AS)x∗S‖ . δs√k log d (3.5.6)

Regarding the second summand on the RHS of equation (3.3.1), we have that

‖AS(A>SA∗S − I)‖F ≤ ‖AS‖‖(A>SA∗S − I)‖F (by the ineq. ‖UV ‖F ≤ ‖U‖‖V ‖F )

≤ (‖A∗S‖+ δs√k)(‖(AS − A∗S)>A∗S‖F + ‖A∗S

>A∗S − I‖F )

(by ‖A∗S − AS‖ ≤ δs√k and traingle inequality)

≤ (2 + δs√k)(‖A∗S‖‖AS − A∗S‖F + ‖A∗S

>A∗S − I‖F )

(by ‖A∗S‖ ≤ 2 (Lemma 2.1.10) and the ineq. ‖UV ‖F ≤ ‖U‖‖V ‖F )

≤ (2 + δs√k)(‖A∗S‖‖AS − A∗S‖F + µk/

√d) (by Lemma 2.1.10)

≤ O(µk/√d+ δ2

sk +√kδs)

(by ‖A∗S − AS‖ ≤ δs√k and ‖A∗S‖ ≤ 2)

Finally plugging the bound above and equation (3.5.6) into equation (3.3.1), we con-

clude that

‖(y−Ax) sign(xi)‖ ≤ O(‖A∗S−AS‖F+‖AS(A>SA∗S−I)‖F ) . µk/

√d+kδ2

s+δs√k log d .

Next we bound from above the variance of the random variable Zi.

80

Claim 3.5.5. In the setting of Proposition 3.5.1, let Zi be define as in equa-

tion (3.5.3). We have that

E[‖Zi‖2] . k2δ2s + k3/d

Proof. Again we rewrite y−Ax as y−Ax = (A∗S −ASA>SA∗S)x∗S. Using the fact that

x∗S is conditionally independent of S with E[x∗S(x∗S)>] = Idr, we obtain that

E[‖Zi‖2] = E[‖(y − Ax) sign(xi)‖2|i ∈ S]

= E[‖(A∗S − ASA>SA∗S)x∗S‖2|i ∈ S]

= E[‖A∗S − ASA>SA∗S‖2F | i ∈ S] (by E[x∗S(x∗S)>] = Idk)

Then again we rewrite A∗S − ASA>SA∗S as

A∗S − ASA>SA∗S = (A∗S − AS) + AS(Idk − A>SA∗S) (3.5.7)

We bound the Frobenius norm of the two terms on the RHS of equation above sepa-

rately.

First, since A is δs-close to A∗, we have that A∗S − AS has column-wise norm at

most δs and therefore ‖A∗S − AS‖F ≤√kδs. Second, note that ‖AS‖F .

√k because

each column of A has norm 1± δs. Therefore, we have that

E[‖AS(Idk − A>SA∗S)‖2F | i ∈ S] . k E[‖(Idk − A>SA∗S)‖2

F | i ∈ S] (3.5.8)

where we used the inequality that ‖UV ‖F ≤ ‖U‖F ‖V ‖F . We divide the k2 terms

in ‖(Idk − A>SA∗S)‖2F largely according to whether the indices contain i because this

81

affects the conditional probability. We have that

E[‖(Idk − A>SA∗S)‖2F | i ∈ S] (3.5.9)

= E

[∑j∈S

(1− 〈Aj, A∗i 〉)2 | j ∈ S

]+ E

∑j,`∈[m]\i,j 6=`

〈Aj, A∗`〉2 | i ∈ S

(3.5.10)

+ E

∑j∈[m]\i

〈Aj, A∗i 〉2 | i ∈ S

+ E

∑j∈[m]\i

〈Ai, A∗j〉2 | i ∈ S

(3.5.11)

≤ O(k2δ2s) +O(k3/d) (3.5.12)

Since A is δs-close to A∗, we have that (1− 〈Aj, A∗j〉)2 ≤ δ2s for any j, and therefore,

E

[∑j∈S

(1− 〈Aj, A∗i 〉)2 | j ∈ S

]≤ kδ2

s (3.5.13)

Using the assumption that Pr[j ∈ S, ` ∈ S | i ∈ S] . k2/r2, we have

E

∑j∈[m]\i

〈Ai, A∗j〉2 | i ∈ S

. k2/r2 ·∑

j,`∈[m]\i

〈Aj, A∗`〉2

≤ k2/r2∥∥A>A∗∥∥2

F

≤ k2/r2 ‖A∗‖2 ‖A‖2F (by ??)

. k2/r2 · r2/d2 · d = k2/d

(by Assumption 2.1.1 and ‖A‖F . d)

By the assumption Pr [j ∈ S | i ∈ S] . k/r (Assumption 2.1.2), we have that

E

∑j∈[m]\i

〈Aj, A∗i 〉2 | i ∈ S

. k/r · ‖A>−iA∗i ‖2 ≤ k/r · ‖A>−i‖2‖A∗i ‖2

≤ k/r · r/d = k/d

(by the assumption that A is (δs, 2)-near A∗, we have ‖A‖ . ‖A∗‖ ≤√d/r)

82

We can get a similar bound for E[∑

j∈[m]\i〈Ai, A∗j〉2 | i ∈ S]. Hence altogether we

plug in bounds above in equation (3.5.11) and conclude that

E[‖(Idk − A>SA∗S)‖2F | i ∈ S] ≤ O(kδ2

s) +O(k2/d) .

Now by equation (3.5.8) we complete the proof.

Now we are ready to apply Bernstein inequality to prove Proposition 3.5.1.

Proof of Proposition 3.5.1. We fix the index i in the proof. We omit the superscript

s throughout the proof to simplify the notation. Let Wi = j ∈ [N ] : i ∈ supp(x∗(j))

be the set of examples that has uses dictionary atom i. Recall the g (equation (3.2.5)),

we have that

gi =1

N

∑j∈Wi

(y(j) − Ax(j)) sign(x(j)i ) .

Note that for every k ∈ [N ], (y(j) − Ax(j)) sign(x(j)i ) has the same distribution

as Zi, and therefore it satisfies the bounds on norm and variance in Claim 3.5.4

and Claim 3.5.5. Then, by the generalized version of Bernstein inequality (Corol-

lary 8.1.4), we have that with high probability,

‖gi − gi‖ .1

N

((µk/

√d+ kδ2

s + δs√k log d) log d+

√|Wi|(k2δ2

s + k3/d) log d)

.1

N

((µk/

√d+ kδ2

s + δs√k log d) log d+

√N · k/r · (k2δ2

s + k3/d) log d)

. k/r ·(o(δs) +

√k/d)

where the last inequality follows from our choice of parameters (Assumption 2.1.4)

and that N ≥ crk log2 d for sufficiently large absolute constant c.

83

3.6 More Alternating Minimization Algorithms

Here we prove Theorem 3.2.4 and Theorem 3.2.5. Note that in Algorithm 5 and

Algorithm 6, for simplicity, we use the expectation of the gradient over the samples

instead of the empirical average. We can show that these algorithms would maintain

the same guarantees if we used p = Ω(mk) to estimate gs as we did in Algorithm 4.

However these proofs would require repeating very similar calculations to those that

we performed in Section 3.5, and so we only claim that these algorithms maintain

their guarantees if they use a polynomial number of samples to approximate the

expectation.

3.6.1 Analysis of a Variant of Olshausen-Field Update

We give a variant of the Olshausen-Field update rule in Algorithm 5. Our first goal

is to prove that each column of gs is (α, β, ε)-correlated with A∗i . The main step is to

prove an analogue of Proposition 3.4.3 that holds for the new update rule.

Proposition 3.6.1. Suppose that As is (2δ, 5)-near to A∗ Then each column of gs in

Algorithm 5 takes the form

gsi = qi((λsi )

2Asi − λiAsi + εsi)

where λi = 〈Ai, A∗i 〉. Moreover the norm of εsi can be bounded as ‖εsi‖ ≤ O(k2/rd).

We remark that unlike the statement of Proposition 3.4.4, here we will not explicitly

state the functional form of εsi because we will not need it.

Proof. The proof parallels that of Proposition 3.4.3, although we will use slightly

different conditioning arguments as needed. Again, we define Fx∗ as the event that

sign(x∗) = sign(x), and let 1Fx∗ be the indicator function of this event. We can

invoke Proposition 3.3.1 and conclude that this event happens with high probability.

84

Moreover let Fi be the event that i is in the set S = supp(x∗) and let 1Fi be its

indicator function.

When event Fx∗ happens, the decoding satisfies xS = A>SA∗Sx∗S and all the other

entries are zero. Throughout this proof s is fixed and so we will omit the superscript

s for notational convenience. We can now rewrite gi as

gi = E[(y − Ax)x>] = E[(y − Ax)x>i 1Fx∗ ] + E[(y − Ax)x>i (1− 1Fx∗ )]

= E[(I − A>SAS)A∗Sx

∗Sx∗S>A∗S

>Ai1Fx∗1Fi

]± γ

= E[(I − A>SAS)A∗Sx

∗Sx∗S>A∗S

>Ai1Fi

]= E

[(I − A>SAS)A∗Sx

∗Sx∗S>A∗S

>Ai1Fi

]± γ

Once again our strategy is to rewrite the expectation above using subconditioning

where we first choose the support S of x∗, and then we choose the nonzero values x∗S.

gi = ES[Ex∗S [(I − A>SAS)A∗Sx

∗Sx∗S>A∗S

>Ai1Fi |S]]± γ

= E[(I − ASA>S )A∗SA

∗S>Ai1Fi

]± γ

= E[(I − AiA>i − ARAR>)(A∗iA

∗i> + A∗RA

∗R>)Ai1Fi

]± γ

= E[(I − AiA>i )(A∗iA

∗i>)Ai1Fi

]+ E

[(I − AiA>i )A∗RA

∗R>Ai1Fi

]−E

[ARAR

>A∗iA∗i>Ai1Fi

]− E

[ARA

>RA∗RA∗R>Ai1Fi

]± γ

Next we will compute the expectation of each of the terms on the right hand side.

This part of the proof will be somewhat more involved than the proof of Proposi-

tion 3.4.3, because the terms above are quadratic instead of linear. The leading term

is equal to qi(λiA∗i − λ2

iAi) and the remaining terms contribute to εi. The second

term is equal to (I−AiA>i )A∗−idiag(qi,j)A∗−i>Ai which has spectral norm bounded by

O(k2/rd). The third term is equal to λiA−idiag(qi,j)A∗−i>A∗i which again has spectral

85

norm bounded by O(k2/rd). The final term is equal to

E[ARA

>RA∗RA∗R>Ai1Fi

]=∑j1,j2 6=i

E[(Aj1A>j1

)(A∗j2A∗j2>)Ai1Fi1Fj11Fj2 ]

=∑j1 6=i

(∑j2 6=i

qi,j1,j2〈A∗j2 , Ai〉〈A∗j2, Aj1〉

)Aj1

= A−iv.

where v is a vector whose j2-th component is equal to∑

j2 6=i qi,j1,j2〈A∗j2, Ai〉〈A∗j2 , Aj1〉.

The absolute value of vj2 is bounded by

|vj2 | ≤ O(k2/r2)|〈A∗j2 , Ai〉|+O(k3/r3)(∑j2 6=j1,i

(〈A∗j2 , Ai〉2 + 〈A∗j2 , Aj1〉

2))

≤ O(k2/r2)|〈A∗j2 , Ai〉|+O(k3/r3)‖A∗‖2 = O(k2/r2)(|〈A∗j2 , Ai〉|+ k/d).

The first inequality uses bounds for q’s and the AM-GM inequality, the second in-

equality uses the spectral norm of A∗. We can now bound the norm of v as follows

‖v‖ ≤ O(k2/r2 ·√r/d)

and this implies that the last term satisfies ‖A−i‖‖v‖ ≤ O(k2/rd). Combining all

these bounds completes the proof of the lemma.

We are now ready to prove that the update rule satisfies Definition 3.1.1. This

again uses Proposition 3.4.4, except that we invoke Proposition 3.6.1 instead. Com-

bining these lemmas we obtain:

Lemma 3.6.2. Suppose that As is (2δ, 5)-near to A∗. Then for each i, gsi as defined

in Algorithm 5 is (α, β, ε)-correlated with A∗i , where α = Ω(k/r), β ≥ Ω(r/k) and

ε = O(k3/rd2).

86

Notice that in the third step in Algorithm 5 we project back (with respect to

Frobenius norm of the matrices) into a convex set B which we define below. Viewed

as minimizing a convex function with convex constraints, this projection can be com-

puted by various convex optimization algorithm, e.g. subgradient method (see Theo-

rem 3.2.3 of Section 3.2.4 of Nesterov’s seminal Book [132] for more detail). Without

this modification, it seems that the update rule given in Algorithm 5 does not neces-

sarily preserve nearness.

Definition 3.6.3. Let B = A|A is δ0 close to A0 and ‖A‖ ≤ 2‖A∗‖

The crucial properties of this set are summarized in the following claim:

Claim 3.6.4. (a) A∗ ∈ B and (b) for each A ∈ B, A is (2δ0, 5)-near to A∗

Proof. The first part of the claim follows because by assumption A∗ is δ0-close to

A0 and ‖A∗ − A0‖ ≤ 2‖A∗‖. Also the second part follows because ‖A − A∗‖ ≤

‖A− A0‖+ ‖A0 − A∗‖ ≤ 4‖A∗‖. This completes the proof of the claim.

By the convexity of B and the fact that A∗ ∈ B, we have that projection doesn’t

increase the error in Frobenius norm.

Claim 3.6.5. For any matrix A, ‖ProjBA− A∗‖F ≤ ‖A− A∗‖F .

We now have the tools to analyze Algorithm 5 by fitting it into the framework

of Corollary 3.1.4. In particular, we prove that it converges to a globally optimal

solution by connecting it to an approximate form of projected gradient descent:

Proof of Theorem 3.2.4. We note that projecting into B ensures that at the start of

each step ‖As − A∗‖ ≤ 5‖A∗‖. Hence gsi is (Ω(k/r),Ω(r/k), O(k3/rd2))-correlated

with A∗i for each i, which follows from Lemma 3.6.2. This implies that gs is

(Ω(k/r),Ω(r/k), O(k3/d2))-correlated with A∗ in Frobenius norm. Finally we can

apply Corollary 3.1.4 (on the matrices with Frobenius) to complete the proof of the

theorem.

87

3.6.2 Removing Systemic Error

The proof of Theorem 3.2.5 is parallel to that of Theorem 3.4.1 and Theorem 3.2.4.

As usual, our first step is to show that gs is correlated with A∗: Theorem 3.2.5 follows

from the two lemmas below (Lemma 3.6.6 and 3.6.7) straightforwardly.

Lemma 3.6.6. Suppose that As is (δ, 5)-near to A∗. Then for each i, gsi as defined

in Algorithm 6 is (α, β, ε)-correlated with A∗i , where α = Ω(k/r), β ≥ Ω(r/k) and

ε ≤ d−ω(1).

Proof. We chose to write the proof of Proposition 3.4.3 so that we can reuse the

calculation here. In particular, instead of substituting B for As in the calculation we

can substitute B(s,i) instead and we get:

g(s,i) = piqi(λsiA

si − A∗i +B

(s,i)−i diag(qi,j)B

(s,i)T−i A∗i ) + γ.

Recall that λsi = 〈Asi , A∗i 〉. Now we can write g(s,i) = piqi(Asi − A∗i ) + v, where

v = piqi(λsi − 1)Asi + piqiB

(s,i)−i diag(qi,j)B

(s,i)T−i A∗i + γ

Indeed the norm of the first term piqi(λsi − 1)Asi is smaller than piqi‖Asi − A∗i ‖.

Recall that the second term was the main contribution to the systemic error,

when we analyzed earlier update rules. However in this case we can use the fact that

B(s,i)T−i Asi = 0 to rewrite the second term above as

piqiB(s,i)−i diag(qi,j)B

(s,i)T−i (A∗i − Asi )

Hence we can bound the norm of the second term by O(k2/rd)‖A∗i − Asi‖, which is

also much smaller than piqi‖Asi − A∗i ‖.

88

Combining these two bounds we have that ‖v‖ ≤ piqi‖Asi − A∗i ‖/4 + γ, so we

can take ζ = γ = d−ω(1) in Lemma 3.4.4. We can complete the proof by invoking

Lemma 3.4.4 which implies that the g(s,i) is (Ω(k/r),Ω(r/k), d−ω(1))-correlated with

Ai.

This lemma would be all we would need, if we added a third step that projects onto

B as we did in Algorithm 5. However here we do not need to project at all, because the

update rule maintains nearness and thus we can avoid this computationally intensive

step.

Lemma 3.6.7. Suppose that As is (δ, 2)-near to A∗. Then ‖As+1 − A∗‖ ≤ 2‖A∗‖ in

Algorithm 6.

This proof of the above lemma parallels that of Proposition 3.4.5. We will focus on

highlighting the differences in bounding the error term, to avoid repeating the same

calculation.

Proof sketch. We will use A to denote As and B(i) to denote B(s,i) to simplify the

notation. Also let Ai be normalized so that Ai = Ai/‖Ai‖ and then we can write

B(i)−i = (I − AiA>i )A−i. Hence the error term is given by

(I − AiA>i )A−idiag(qi,j)A>−i(I − AiA>i )A∗i

Let C be a matrix whose columns are Ci = (I − AiA>i )A∗i = Ai − 〈Ai, A∗i 〉Ai. This

implies that ‖C‖ ≤ O(√r/d). We can now rewrite the error term above as

A−idiag(qi,j)A>−iCi − (AiAi)

>A−idiag(qi,j)A>−iCi

It follows from the proof of Proposition 3.4.5 that the first term above has spectral

norm bounded by O(r/d ·√r/d). This is because in Proposition 3.4.5 we bounded

89

the term A−idiag(qi,j)A>−iA

∗i and in fact it is easily verified that all we used in that

proof was the fact that ‖A∗‖ = O(√r/d), which also holds for C.

All that remains is to bound the second term. We note that its columns

are scalar multiples of Ai, where the coefficient can be bounded as follows:

‖Ai‖‖A−i‖2‖diag(qi,j)‖‖A∗i ‖ ≤ O(k2/rd). Hence we can bound the spectral norm of

the second term iby O(k2/rd)‖A‖ = O∗(r/d ·√r/d). We can now combine these

two bounds, which together with the calculation in Proposition 3.4.5 completes the

proof.

90

Part II

Global Convergence with

Arbitrary Initialization

91

Chapter 4

Analysis via Optimization

Landscape

Section 2.3 and Section 3.1.3 in Part I have discussed the strength and limitation

of the “finding coarse solutions + local convergence” paradigm for non-convex opti-

mization. In the following several chapters we concern with non-convex optimization

algorithms with simpler initialization schemes such as random initialization or arbi-

trary initialization. A key difference from the analysis techniques in Chapter 3 is

that we establish useful geometric properties of the objective functions instead of

analyzing the algorithmic updates directly. Loosely speaking, we show that all local

minima of the target objective function are also global minima. Since many local

search algorithms converge to approximate local minima (from arbitrary or random

initialization), such a property implies the convergence to an approximate global

minimum.

In the rest of this section, we briefly survey known results regarding the opti-

mization algorithms for functions with such properties. In Chapter 5, Chapter 6, we

develop techniques to prove such properties for specific objective functions, which are

the crux of Part II.

92

4.1 Local Optimality vs Global Optimality

Let f be a twice-differentiable function from Rd to R. Recall that x is a local minimum

of f(·) if there exists an open neighborhood N of x in which the function value is

at least f(x): ∀z ∈ N, f(z) ≥ f(x). We use ∇f(x) to denote the gradient of the

function, and ∇2f(x) to denote the Hessian of the function (∇2f(x) is an n × n

matrix where [∇2f(x)]i,j = ∂2

∂xi∂xjf(x)). It is well known that local minima of the

function f(x) must satisfy some necessary conditions:

Definition 4.1.1. A point x satisfies the first order necessary condition for optimality

(later abbreviated as first order optimality condition) if ∇f(x) = 0. A point x satisfies

the second order necessary condition for optimality (later abbreviated as second order

optimality condition) if ∇2f(x) 0.

A point x is a critical point or stationary point of function f if the gradient

vanishes at x. Thus a local minimum is a critical point, so is a global minimum.

For general function f , even computing a local minimum could be intractable:

even a degree four polynomial can be NP-hard to optimize [83], or even just to check

whether a point is not a local minimum [130]. These impossibility results motivate

us look for stronger structure in the target function f . Indeed, under the following

strict-saddle assumption, we can efficiently find a local minimum of the function f .

Loosely speaking, a strict-saddle function satisfies that every saddle point must have

some strictly negative curvature in some direction.

Definition 4.1.2 (strict saddle, c.f. [67, 106]). Suppose f(·) : Rd → R is twice

differentiable. For α, β, γ ≥ 0, we say f is (α, β, γ)-strict saddle if every x ∈ Rd

satisfies at least one of the following three conditions:

1. ‖∇f(x)‖ ≥ α.

2. λmin(∇2f) ≤ −β.

3. There exists a local minimum x? that is γ-close to x in Euclidean distance.

93

We see that if a function is (α, β, γ)-strict saddle, then for ε < minα, β2 an ε-

approximate local minimum is γ-close to some local minimum. We also note that the

specific way to parameterize this condition may not be the most quantitatively har-

monious way. Although this conditions suffices to allow a polynomial time algorithm.

The following theorem are not stated in the strongest form but aims to emphasize

the fact that various algorithms can converge to a local minim in polynomial time.

Theorem 4.1.3. If f is a twice differentiable (α, β, γ)-strict saddle function from

Rd → R. Suppose we have access to its gradient (or some unbiased estimator of

its gradient, Hessian) in poly(d) time, then there are algorithms (such as stochastic

gradient descent or second order algorithms) that can converge to a local minimum in

with ε error in domain in time poly(d, 1/α, 1/β, 1/γ, 1/ε).

There have been tremendous attempts to get faster and faster algorithms for converg-

ing to a local minimum. Nesterov and Polyak [134] and Sun et al. [162] give second

order algorithms for finding an approximate local minimum. Stochastic gradient de-

scent can converge to a local minimum in polynomial time from any starting point

[138, 67]. The work [3, 46] also have the fastest algorithms in terms of ε-dependencies.

However, the surprising observation that this paper Part of the thesis aims to

explain is that finding local minimum is often sufficient for solving many machine

learning problems. Empirical evidence suggest that the local minima found in practice

are actually close to a global minima. The explanation that this thesis advocates is

Many machine learning objective functions have the property that all or most local

minima are approximate global minima.

We will show in Chapter 5, 6 that the natural objective functions for matrix comple-

tion and linear dynamical systems indeed have this property. If the property above

holds in addition with the strict saddle property (which we can prove for matrix com-

94

pletion and linear dynamical system), then many local search algorithms can find a

global minimum:

Theorem 4.1.4 (Informal). Let f be twice-differentiable function form Rd to R.

Suppose there exist ε0, τ0 > 0 and a universal constant c > 0 such that if a point

x satisfies ‖∇f(x)‖ ≤ ε ≤ ε0 and ∇2f(x) −τ0 · Id, then x is εc-close to a global

minimum of f , then many optimization algorithms including cubic regularization,

trust-region, and stochastic gradient descent, can find a global minimum of f up to δ

error in `2 norm in domain in time poly(1/δ, 1/τ0, d).

A strictly stronger condition than “all local minima are global” is that “all critical

point is a global minimum.” In this case, the objective function has only a single

local minimum which is also the global minimum. The gradient descent is known

to converge to this global minimum linearly (under a quantitative version of this

condition), as stated below. However, we also note that since the condition rules out

multiply local minima, it can’t hold for many objective functions used in practice

which does have multiple local minima and critical points due to certain symmetry.

Theorem 4.1.5. If function f has L-Lipschitz continuous gradient and satisfies that

there exists µ > 0 and x∗ such that for every x,

‖∇f(x)‖2 ≥ µ(f(x)− f(x∗)) (4.1.1)

Then, gradient descent with step size 1/L has a linear convergence rate.

The condition 4.1.1 is called Polyak- Lojasiewicz condition and the theorem above

was first proved by Polyak [142]. We note that it’s often difficult to verify Polyak-

Lojasiewicz condition since the quantity ‖∇f(x)‖2 is often a complex function and

therefore inequality (4.1.1) is difficult to establish.

An easier-to-verify condition is quasi-convexity, which we will discuss next. Quasi-

convexity is also stronger than Polyak- Lojasiewicz condition, which makes it less ap-

95

plicable. Nevertheless, we will show in Chapter 6 that the objective function for learn-

ing linear dynamical system can be made quasi-convexity with over-parameterization.

Quasi-convexity

It is known that under certain mild conditions (stochastic) gradient descent converges

even on non-convex functions to local minimum [67, 106]. Though usually for concrete

problems the challenge is to prove that there is no spurious local minimum other than

the target solution. Here we introduce a condition similar to the quasi-convexity

notion in [78], which ensures that any point with vanishing gradient is the optimal

solution . Roughly speaking, the condition says that at any point θ the negative of

the gradient −∇f(θ) should be positively correlated with direction θ∗ − θ pointing

towards the optimum. Our condition is slightly weaker than that in [78] since we

only require quasi-convexity and smoothness with respect to the optimum, and this

(simple) extension will be necessary for our analysis.

Definition 4.1.6 (Weak quasi-convexity). We say an objective function f is τ -

weakly-quasi-convex (τ -WQC) over a domain B with respect to global minimum θ∗ if

there is a positive constant τ > 0 such that for all θ ∈ B,

∇f(θ)>(θ − θ∗) ≥ τ(f(θ)− f(θ∗)) . (4.1.2)

We further say f is Γ-weakly-smooth if for for any point θ, ‖∇f(θ)‖2 ≤ Γ(f(θ) −

f(θ∗)).

Note that indeed any Γ-smooth convex function in the usual sense is O(Γ)-weakly-

smooth. For a random vector X ∈ Rn, we define it’s variance to be V[X] = E[‖X −

EX‖2].

Definition 4.1.7. We call r(θ) an unbiased estimator of ∇f(θ) with variance V if it

satisfies E[r(θ)] = ∇f(θ) and V[r(θ)] ≤ V .

96

Projected stochastic gradient descent over some closed convex set B with learning

rate η > 0 refers to the following algorithm in which ΠB denotes the Euclidean

projection onto B:

for k = 0 to K − 1 :

wk+1 = θk − ηr(θk)

θk+1 = ΠB(wk+1)

return θj with j uniformly picked from 1, . . . , K (4.1.3)

The following Proposition is well known for convex objective functions (corre-

sponding to 1-weakly-quasi-convex functions). We extend it (straightforwardly) to

the case when τ -WQC holds with any positive constant τ .

Proposition 4.1.8. Suppose the objective function f is τ -weakly-quasi-convex and

Γ-weakly-smooth, and r(·) is an unbiased estimator for ∇f(θ) with variance V . More-

over, suppose the global minimum θ∗ belongs to B, and the initial point θ0 satisfies

‖θ0 − θ∗‖ ≤ R. Then projected gradient descent (4.1.3) with a proper learning rate

returns θK in K iterations with expected error

f(θK)− f(θ∗) ≤ O

(max

ΓR2

τ 2K,R√V

τ√K

).

The proof below uses a simple variation of the standard convergence analysis

of stochastic gradient descent (see, for example, [35]), and demonstrates that the

argument still works for weakly-quasi-convex functions.

Proof of Proposition 4.1.8. We start by using the weakly-quasi-convex condition and

then the rest follows a variant of the standard analysis of non-smooth projected sub-

97

gradient descent1. We conditioned on θk, and have that

τ(f(θk)− f(θ∗)) ≤ ∇f(θk)>(θk − θ∗) = E[r(θk)

>(θk − θ∗) | θk]

= E[

1

η(θk − wk+1)(θk − θ∗) | θk

]=

1

η

(E[‖θk − wk+1‖2 | θk

]+ ‖θk − θ∗‖2 − E

[‖wk+1 − θ∗‖2 | θk

])= η E

[‖r(θk)‖2

]+

1

η

(‖θk − θ∗‖2 − E

[‖wk+1 − θ∗‖2 | θk

])(4.1.4)

where the first inequality uses weakly-quasi-convex and the rest of lines are simply

algebraic manipulations. Since θk+1 is the projection of wk+1 to B and θ∗ belongs to

B, we have ‖wk+1 − θ∗‖ ≥ ‖θk+1 − θ∗‖. Together with (4.1.4), and

E[‖r(θk)‖2

]= ‖∇f(θk)‖2 + V[r(θk)] ≤ Γ(f(θk)− f(θ∗)) + V,

we obtain that

τ(f(θk)− f(θ∗)) ≤ ηΓ(f(θk)− f(θ∗)) + ηV +1

η

(‖θk − θ∗‖2 − E

[‖θk+1 − θ∗‖2 | θk

]).

Taking expectation over all the randomness and summing over k we obtain that

K−1∑k=0

E [f(θk)− f(θ∗)] ≤ 1

τ − ηΓ

(ηKV +

1

η‖θ0 − θ∗‖2

)≤ 1

τ − ηΓ

(ηKV +

1

ηR2

).

where we use the assumption that ‖θ0 − θ∗‖ ≤ R. Suppose K ≥ 4R2Γ2

V τ2, then we take

η = R√V K

. Therefore we have that τ − ηΓ ≥ τ/2 and therefore

K−1∑k=0

E [f(θk)− f(θ∗)] ≤ 4R√V√K

τ. (4.1.5)

1Although we used weak smoothness to get a slightly better bound

98

On the other hand, if K ≤ 4R2Γ2

V τ2, we pick η = τ

2Γand obtain that

K−1∑k=0

E [f(θk)− f(θ∗)] ≤ 2

τ

(τKV

2Γ+

2ΓR2

τ

)≤ 8ΓR2

τ 2. (4.1.6)

Therefore using equation (4.1.6) and (4.1.5) we obtain that when choosing η prop-

erly according to K as above,

Ek∈[K]

[f(θk)− f(θ∗)] ≤ max

8ΓR2

τ 2K,4R√V

τ√K

.

Remark 4.1.1. It’s straightforward to see (from the proof) that the algorithm toler-

ates inverse exponential bias in the gradient estimator. Technically, suppose E[r(θ)] =

∇f(θ)± ζ then f(θK) ≤ O(

max

ΓR2

τ2K, R√V

τ√K

)+ poly(K) · ζ. Throughout the paper,

we assume that the error that we are shooting for is inverse polynomial and therefore

the effect of inverse exponential bias is negligible.

Finally, we note that the sum of two quasi-convex functions may no longer be

quasi-convex. However, if a sequence functions is τ -WQC with respect to a common

point θ∗, then their sum is also τ -WQC. This follows from the linearity of gradient

operation.

Proposition 4.1.9. Suppose functions f1, . . . , fn are individually τ -weakly-quasi-

convex in B with respect to a common global minimum θ∗ , then for non-negative

w1, . . . , wn the linear combination f =∑n

i=1wifi is also τ -weakly-quasi-convex with

respect to θ∗ in B.

99

Chapter 5

Matrix completion

Matrix completion is a basic machine learning problem that has wide applications,

especially in collaborative filtering and recommender systems. Simple non-convex op-

timization algorithms are popular and effective in practice. Despite recent progress

in proving various non-convex algorithms converge from a good initial point, it re-

mains unclear why random or arbitrary initialization suffices in practice. We prove

that the commonly used non-convex objective function for positive semidefinite ma-

trix completion has no spurious local minima – all local minima must also be global.

Therefore, many popular optimization algorithms such as (stochastic) gradient de-

scent can provably solve positive semidefinite matrix completion with arbitrary ini-

tialization in polynomial time. The result can be generalized to the setting when the

observed entries contain noise. We believe that our main proof strategy can be useful

for understanding geometric properties of other statistical problems involving partial

or noisy observations.

5.1 Introduction

Matrix completion is the problem of recovering a low-rank matrix from partially

observed entries. It has been widely used in collaborative filtering and recommender

100

systems [102, 146], dimension reduction [42] and multi-class learning [5]. There has

been extensive work on designing efficient algorithms for matrix completion with

guarantees. One earlier line of results (see [145, 45, 44] and the references therein)

rely on convex relaxations. These algorithms achieve strong statistical guarantees,

but are quite computationally expensive in practice.

More recently, there has been growing interest in analyzing non-convex algorithms

for matrix completion [97, 98, 93, 72, 75, 163, 176, 47, 151, 47]. Let M ∈ Rd×d be

the target matrix with rank r d that we aim to recover, and let Ω = (i, j) :

Mi,j is observed be the set of observed entries. These methods are instantiations of

optimization algorithms applied to the objective1,

f(X) =1

2

∑(i,j)∈Ω

[Mi,j − (XX>)i,j

]2, (5.1.1)

These algorithms are much faster than the convex relaxation algorithms, which is

crucial for their empirical success in large-scale collaborative filtering applications

[102].

Most of the theoretical analysis of the nonconvex procedures require careful initial-

ization schemes: the initial point should already be close to optimum2. In fact, Sun

and Luo [163] showed that after this initialization the problem is effectively strongly-

convex; hence many different optimization procedures can be analyzed by standard

techniques from convex optimization.

However, in practice, people typically use a random initialization, which still leads

to robust and fast convergence. Why can these practical algorithms find the optimal

solution in spite of the non-convexity? In this work, we investigate this question

1In this paper, we focus on the symmetric case when the true M has a symmetric decompositionM = ZZT . Some of previous papers work on the asymmetric case when M = ZWT , which is harderthan the symmetric case.

2The work of De Sa et al. [151] is an exception, which gives an algorithm that uses fresh samplesat every iteration to solve matrix completion (and other matrix problems) approximately.

101

and show that the matrix completion objective has no spurious local minima. More

precisely, we show that any local minimum X of objective function f(·) is also a

global minimum with f(X) = 0, and recovers the correct low rank matrix M .

Our characterization of the structure in the objective function implies that

(stochastic) gradient descent from arbitrary starting point converges to a global

minimum. This is because gradient descent converges to a local minimum [67, 106],

and every local minimum is also a global minimum.

5.1.1 Main Results

Assume the target matrix M is symmetric and each entry of M is observed with

probability p independently 3. We assume M = ZZ> for some matrix Z ∈ Rd×r.

There are two known issues with matrix completion. First, the choice of Z is not

unique since M = (ZR)(ZR)> for any orthonormal matrix Z. Our goal is to find one

of these equivalent solutions.

Another issue is that matrix completion is impossible when M is “aligned” with

standard basis. For example, when M is the identity matrix in its first r × r block,

we will very likely be observing only 0 entries. To address this issue, we make the

following standard assumption:

Assumption 5.1.1. For any row Zi of Z, we have

‖Zi‖ ≤ µ/√d · ‖Z‖F .

Moreover, Z has a bounded condition number σmax(Z)/σmin(Z) = κ.

Throughout this paper, we think of µ and κ as small constants, and the sample

complexity depends polynomially on these two parameters. Also, note that this

3The entries (i, j) and (j, i) are the same. With probability p we observe both entries andotherwise we observe neither.

102

assumption is independent of the choice of Z: all Z such that ZZT = M have

the same row norms and Frobenius norm.

This assumption is similar to the “incoherence” assumption [44]. Our assumption

is the same as the one used in analyzing non-convex algorithms [97, 98, 163].

We enforce X to also satisfy this assumption by a regularizer

f(X) =1

2

∑(i,j)∈Ω

[Mi,j − (XX>)i,j

]2+R(X), (5.1.2)

where R(X) is a function that penalizes X when one of its rows is too large. See

Section 5.3 and Section 5.4 for the precise definition. Our main result shows that in

this setting, the regularized objective function has no spurious local minimum:

Theorem 5.1.2. [Informal] All local minimum of the regularized objective (5.1.2)

satisfy XXT = ZZT = M when p ≥ poly(κ, r, µ, log d)/d.

Combined with the results in [67, 106] (see more discussions in Section 5.1.2), we

have,

Theorem 5.1.3 (Informal). With high probability, stochastic gradient descent on the

regularized objective (5.1.2) will converge to a solution X such that XXT = ZZT = M

in polynomial time from any starting point. Gradient descent will converge to such a

point with probability 1 from a random starting point.

Our results are also robust to noise. Even if each entry is corrupted with a

Gaussian noise of standard deviation µ2‖Z‖2F/d (comparable to the magnitude of

the entry itself!), we can still guarantee that all the local minima satisfy ‖XXT −

ZZT‖F ≤ ε when p is large enough. See the discussion in Appendix 5.5 for results on

noisy matrix completion.

Our main technique is to show that every point that satisfies the first and second

order necessary conditions for optimality must be the desired solution. To achieve

103

this, we use new ideas to analyze the effect of the regularizer and show how it is

useful for modifying the first and second order conditions to exclude any spurious

local minimum.

5.1.2 Related Work

Matrix Completion The earlier theoretical works on matrix completion analyzed

the nuclear norm minimization [159, 145, 45, 44, 131]. This line of work has the clean-

est and strongest theoretical guarantees; [45, 145] showed that if |Ω| & drµ2 log2 d the

nuclear norm convex relaxation recovers the exact underlying low-rank matrix. The

solution can be computed via the solving a convex program in polynomial time.

However, the primary disadvantage of nuclear norm methods is their computational

and memory requirements — the fastest known provable algorithms require O(d2)

memory and thus at least O(d2) running time, which could be both prohibitive for

moderate to large values of d. Many algorithms have been proposed to improve the

runtime (either theoretically or empirically) (see, for example, [158, 120, 77], and the

references therein). Burer and Monteiro [38] proposed factorizing the optimization

variable M = XXT , and optimizing over X ∈ Rd×r instead of M ∈ Rd×d. This ap-

proach only requires O(dr) memory, and a single gradient iteration takes time O(|Ω|),

so has a much lower memory requirement and computational complexity than the nu-

clear norm relaxation. On the other hand, the factorization causes the optimization

problem to be non-convex in X, which leads to theoretical difficulties in analyzing

algorithms. Keshavan et al. [97, 98] showed that well-initialized gradient descent re-

covers M . The works [75, 72, 93, 47] showed that well-initialized alternating least

squares, block coordinate descent, and gradient descent converges M . Jain and Netra-

palli [92] showed a fast algorithm by iteratively doing gradient descent in the relaxed

space and projecting to the set of low-rank matrices. The work [151] analyzes stochas-

tic gradient descent with fresh samples at each iteration from random initialization

104

and shows that it approximately converge to the optimal solution. [163, 176, 177, 168]

provided a more unified analysis by showing that with careful initialization many al-

gorithms, including gradient descent and alternating least squares, succeed. [163, 177]

accomplished this by showing an analog of strong convexity in the neighborhood of

the solution M .

Non-convex Optimization Recently, a line of work analyzes non-convex opti-

mization by separating the problem into two aspects: the geometric aspect which

shows the function has no spurious local minimum and the algorithmic aspect which

designs efficient algorithms can converge to a local minimum that satisfies first and

(relaxed versions) of second order necessary conditions.

Our result is the first that explains the geometry of the matrix completion ob-

jective. Similar geometric results are only known for a few problems: SVD/PCA

phase retrieval/synchronization, orthogonal tensor decomposition, dictionary learn-

ing [21, 157, 67, 162, 22]. The matrix completion objective requires different tools due

to the sampling of the observed entries, as well as carefully managing the regularizer

to restrict the geometry. Parallel to our work Bhojanapalli et al.[30] showed similar

results for matrix sensing, which is closely related to matrix completion. Loh and

Wainwright [116] showed that for many statistical settings that involve missing/noisy

data and non-convex regularizers, any stationary point of the non-convex objective

is close to global optima; furthermore, there is a unique stationary point that is the

global minimum under stronger assumptions [115].

On the algorithmic side, it is known that second-order algorithms like cubic

regularization [134] and trust-region [162] algorithms converge to local minima that

approximately satisfy first and second order conditions. Gradient descent is also

known to converge to local minima [106] from a random starting point. Stochastic

gradient descent can converge to a local minimum in polynomial time from any

105

starting point [138, 67]. All of these results can be applied to our setting, implying

various heuristics used in practice are guaranteed to solve matrix completion.

Notations: For Ω ⊂ [d]× [d], let PΩ be the linear operator that maps a matrix A

to PΩ(A), where PΩ(A) has the same values as A on Ω, and 0 outside of Ω. In this

Chapter, for a matrix A, let Ai denotes the i-th row of A. We also use the shorthand

‖A‖Ω = ‖PΩA‖F .

5.2 Proof Strategy: “Simple” Proofs are More

Generalizable

In this section, we demonstrate the key ideas behind our analysis using the rank r = 1

case. In particular, we first give a “simple” proof for the fully observed case. Then we

show this simple proof can be easily generalized to the random observation case. We

believe that this proof strategy is applicable to other statistical problems involving

partial/noisy observations. The proof sketches in this section are only meant to be

illustrative and may not be fully rigorous in various places. We refer the readers to

Section 5.3 and Section 5.4 for the complete proofs.

In the rank r = 1 case, we assume M = zz>, where ‖z‖ = 1, and ‖z‖∞ ≤ µ√d.

Let ε 1 be the target accuracy that we aim to achieve in this section and let

p = poly(µ, log d)/(dε).

For simplicity, we focus on the following domain B of incoherent vectors where

the regularizer R(x) vanishes,

B =

x : ‖x‖∞ <

2µ√d

. (5.2.1)

106

Inside this domain B, we can restrict our attention to the objective function

without the regularizer, defined as,

g(x) =1

2· ‖PΩ(M − xx>)‖2

F . (5.2.2)

The global minima of g(·) are z and −z with function value 0. Our goal of this

section is to (informally) prove that all the local minima of g(·) are O(√ε)-close to

±z. In later section we will formally prove that the only local minima are ±z.

Lemma 5.2.1 (Partial observation case, informally stated). Under the setting of this

section, in the domain B, all local mimina of the function g(·) are O(√ε)-close to

±z.

It turns out to be insightful to consider the full observation case when Ω = [d]×[d].

The corresponding objective is

g(x) =1

2· ‖M − xx>‖2

F . (5.2.3)

Observe that g(x) is a sampled version of the g(x), and therefore we expect that

they share the same geometric properties. In particular, if g(x) does not have spurious

local minima then neither does g(x).

Lemma 5.2.2 (Full observation case, informally stated). Under the setting of this

section, in the domain B, the function g(·) has only two local minima ±z .

Before introducing the “simple” proof, let us first look at a delicate proof that

does not generalize well.

107

Difficult to Generalize Proof of Lemma 5.2.2 We compute the gradient and

Hessian of g(x),

∇g(x) = Mx− ‖x‖2x, (5.2.4)

∇2g(x) = 2xx> −M + ‖x‖2 · I . (5.2.5)

Therefore, a critical point x satisfies ∇g(x) = Mx− ‖x‖2x = 0, and thus it must

be an eigenvector of M and ‖x‖2 is the corresponding eigenvalue. Next, we prove that

the hessian is only positive definite at the top eigenvector . Let x be an eigenvector

with eigenvalue λ = ‖x‖2, and λ is strictly less than the top eigenvalue λ∗. Let z be

the top eigenvector. We have that 〈z,∇2g(x)z〉 = −〈z,Mz〉 + ‖x‖2 = −λ∗ + λ < 0,

which shows that x is not a local minimum. Thus only z can be a local minimizer,

and it is easily verified that ∇2g(z) is indeed positive definite.

The difficulty of generalizing the proof above to the partial observation case is

that it uses the properties of eigenvectors heavily. Suppose we want to imitate the

proof above for the partial observation case, the first difficulty is how to solve the

equation g(x) = PΩ(M − xx>)x = 0. Moreover, even if we could have a reasonable

approximation for the critical points (the solution of ∇g(x) = 0), it would be difficult

to examine the Hessian of these critical points without having the orthogonality of

the eigenvectors.

“Simple” and Generalizable proof The lessons from the subsection above sug-

gest us find an alternative proof for the full observation case which is generalizable.

The alternative proof will be simple in the sense that it doesn’t use the notion of

eigenvectors and eigenvalues. Concretely, the key observation behind most of the

analysis in this paper is the following,

108

Proofs that consist of inequalities that are linear in 1Ω are often easily generalizable

to partial observation case.

Here statements that are linear in 1Ω mean the statements of the form∑

ij 1(i,j)∈ΩTij ≤

a. We will call these kinds of proofs “simple” proofs in this section. Roughly speak-

ing, the observation follows from the law of large numbers — Suppose Tij, (i, j) ∈

[d]× [d] is a sequence of bounded real numbers, then the sampled sum∑

(i,j)∈Ω Tij =∑i,j 1(i,j)∈ΩTij is an accurate estimate of the sum p

∑i,j Tij, when the sampling prob-

ability p is relatively large. Then, the mathematical implications of p∑Tij ≤ a are

expected to be similar to the implications of∑

(i,j)∈Ω Tij ≤ a, up to some small error

introduced by the approximation. To make this concrete, we give below informal

proofs for Lemma 5.2.2 and Lemma 5.2.1 that only consists of statements that are

linear in 1Ω. Readers will see that due to the linearity, the proof for the partial

observation case (shown on the right column) is a direct generalization of the proof

for the full observation case (shown on the left column) via concentration inequalities

(which will be discussed more at the end of the section).

A “simple” proof for Lemma 5.2.2 and its generalization to Lemma 5.2.1

We prove Lemma 5.2.2 by combining two Claims below. We will give their general-

izations to the partial observation case.

Claim 1f. Suppose x ∈ B satisfies ∇g(x) = 0, then 〈x, z〉2 = ‖x‖4.

Proof. We have,

∇g(x) = (zz> − xx>)x = 0

⇒ 〈x,∇g(x)〉 = 〈x, (zz> − xx>)x〉 = 0 (5.2.6)

⇒ 〈x, z〉2 = ‖x‖4

109

Intuitively, this proof says that the norm of a critical point x is controlled by its

correlation with z.

The following claim is the counterpart of Claim 1f in the partial observation case.

Claim 1p. Suppose x ∈ B satisfies ∇g(x) = 0, then 〈x, z〉2 = ‖x‖4 − ε.

Proof. Imitating the proof of Claim 1f,

∇g(x) = PΩ(zz> − xx>)x = 0

⇒ 〈x,∇g(x)〉 = 〈x, PΩ(zz> − xx>)x〉 = 0 (5.2.7)

⇒ 〈x, z〉2 ≥ ‖x‖4 − ε

The last step uses the fact that equation (5.2.6) and (5.2.7) are approximately equal

up to scaling factor p for any x ∈ B, since (5.2.7) is a sampled version of (5.2.6).

Claim 2f. If x ∈ B has positive Hessian ∇2g(x) 0, then ‖x‖2 ≥ 1/3.

Proof. By the assumption on x, we have that 〈z,∇2g(x)z〉 ≥ 0. Calculating the

quadratic form of the Hessian (see Proposition 5.3.1 for details),

〈z,∇2g(x)z〉

= ‖zx> + xz>‖2 − 2z>(zz> − xx>)z ≥ 0aaaaaa (5.2.8)

⇒ ‖x‖2 + 2〈z, x〉2 ≥ 1

⇒ ‖x‖2 ≥ 1/3 (since 〈z, x〉2 ≤ ‖x‖2)

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Claim 2p. If x ∈ B has positive Hessian ∇2g(x) 0, then ‖x‖2 ≥ 1/3− ε.

110

Proof. Imitating the proof of Claim 2f, calculating the quadratic form over the Hessian

at z (see Proposition 5.3.1) , we have aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

〈z,∇2g(x)z〉

= ‖PΩ(zx> + xz>)‖2 − 2z>PΩ(zz> − xx>)z ≥ 0 (5.2.9)

⇒ · · · · · · (same step as the left)

⇒ ‖x‖2 ≥ 1/3− ε

Here we use the fact that 〈z,∇2g(x)z〉 ≈ p〈z,∇2g(x)z〉 for any x ∈ B.

With these two claims, we are ready to prove Lemma 5.2.2 and 5.2.1 by using another

step that is linear in 1Ω.

Proof of Lemma 5.2.2. By Claim 1f and 2f, we have x satisfies 〈x, z〉2 ≥ ‖x‖4 ≥ 1/9.

Moreover, we have that ∇g(x) = 0 implies

〈z,∇g(x)〉 = 〈z, (zz> − xx>)x〉 = 0 (5.2.10)

⇒ 〈x, z〉(1− ‖x‖2) = 0

⇒ ‖x‖2 = 1 (by 〈x, z〉2 ≥ 1/9)

Then by Claim 1f again we obtain 〈x, z〉2 = 1, and therefore x = ±z.

The proof of Lemma 5.2.1 is almost the same as the proof of Lemma 5.2.2.

Proof of Lemma 5.2.1. By Claim 1p and 2p, we have x satisfies 〈x, z〉2 ≥ ‖x‖4 ≥

1/9−O(ε). Moreover, we have that ∇g(x) = 0 implies

〈z,∇g(x)〉 = 〈z, PΩ(zz> − xx>)x〉 = 0 (5.2.11)

⇒ · · · · · · (same step as the left)

⇒ ‖x‖2 = 1±O(ε) (same step as the left)

111

Since (5.2.11) is the sampled version of equation (5.2.10), we expect they lead to

the same conclusion up to some approximation. Then by Claim 1p again we obtain

〈x, z〉2 = 1±O(ε), and therefore x is O(√ε)-close to either of ±z.

Subtleties regarding uniform convergence In the proof sketches above, our

key idea is to use concentration inequalities to link the full observation objective

g(x) with the partial observation counterpart. However, we require a uniform con-

vergence result. For example, we need a statement like “w.h.p over the choice of Ω,

equation (5.2.6) and (5.2.7) are similar to each other up to scaling”. This type of

statement is often only true for x inside the incoherent ball B. The fix to this is

the regularizer. For non-incoherent x, we will use a different argument that uses the

property of the regularizer. This is besides the main proof strategy of this section

and will be discussed in subsequent sections.

5.3 Warm-up: Rank-1 Case

In this section, using the general proof strategy described in previous section, we

provide a formal proof for the rank-1 case. In subsection 5.3.1, we formally work out

the proof sketches of Section 5.2. In subsection 5.3.2, we prove that due to the effect

of the regularizer, outside incoherent ball B, the objective function doesn’t have any

local minimum.

In the rank-1 case, the objective function simplifies to,

f(x) =1

2‖PΩ(M − xx>)‖2

F + λR(x) . (5.3.1)

Here we use the the regularization R(x)

R(x) =d∑i=1

h(xi), and h(t) = (|t| − α)41()t ≥ α .

112

The parameters λ and α will be chosen later as in Theorem 5.3.2. We will choose

α > 10µ/√d so that R(x) = 0 for incoherent x, and thus it only penalizes coherent

x. Moreover, we note R(x) has Lipschitz second order derivative. 4

We first state the optimality conditions.

Proposition 5.3.1. The first order optimality condition of objective (5.3.1) is,

2PΩ(M − xx>)x = λ∇R(x) , (5.3.2)

and the second order optimality condition requires:

∀v ∈ Rd, ‖PΩ(vx> + xv>)‖2F + λv>∇2R(x)v ≥ 2v>PΩ(M − xx>)v . (5.3.3)

Moreover, The τ -relaxed second order optimality condition requires

∀v ∈ Rd, ‖PΩ(vx> + xv>)‖2F + λv>∇2R(x)v ≥ 2v>PΩ(M − xx>)v − τ‖v‖2 .

(5.3.4)

Proof. We take the Taylor’s expansion around point x. Let δ be an infinitesimal

vector, we have

f(x+ δ) =1

2‖PΩ(M − (x+ δ)(x+ δ)>)‖2

F + λR(x+ δ) + o(‖δ‖2)

=1

2‖PΩ(M − xx> − (xδ> + δx>)− δδ>)‖2

F

+ λ

(R(x) + 〈∇R(x), δ〉+

1

2δT∇2R(x)δ

)+ o(‖δ‖2)

=1

2‖M − xx>‖2

Ω + λR(x)

− 〈PΩ(M − xx>), xδ> + δx>〉+ 〈∇R(x), δ〉+ o(‖δ‖2)

− 〈PΩ(M − xx>), δδ>〉+1

2‖PΩ(xδ> + δx>)‖2

F +1

2λδ>∇2R(x)δ + o(‖δ‖2).

4This is the main reason for us to choose 4-th power instead of 2-nd power.

113

By symmetry 〈PΩ(M − xx>), xδ>〉 = 〈PΩ(M − xx>), δx>〉 = 〈PΩ(M − xx>)x, δ〉,

so the first order optimality condition is ∀δ, 〈−2PΩ(M − xx>)x + λ∇R(x), δ〉 = 0,

which is equivalent to that 2PΩ(M − xx>)x = λ∇R(x).

The second order optimality condition says −〈PΩ(M − xx>), δδ>〉 + 12‖xδ> +

δx>‖2F + 1

2λδ>∇2R(x)δ ≥ 0 for every δ, which is exactly equivalent to Equation

(5.3.3).

We give the precise version of Theorem 5.1.2 for the rank-1 case.

Theorem 5.3.2. For p ≥ cµ6 log1.5 dd

where c is a large enough absolute constant, set

α = 10µ√

1/d and λ ≥ µ2p/α2.Then, with high probability over the randomness of

Ω, the only points in Rd that satisfy both first and second order optimality conditions

(or τ -relaxed optimality conditions with τ < 0.1p) are z and −z.

In the rest of this section, we will first prove that when x is constrained to be

incoherent (and hence the regularizer is 0 and concentration is straightforward) and

satisfies the optimality conditions, then x has to be z or −z. Then we go on to explain

how the regularizer helps us to change the geometry of those points that are far away

from z so that we can rule out them from being local minimum. For simplicity, we

will focus on the part that shows a local minimum x must be close enough to z.

Lemma 5.3.3. In the setting of Theorem 5.3.2, suppose x satisfies the first-order

and second-order optimality condition (5.3.2) and (5.3.3). Then when p is defined as

in Theorem 5.3.2, ∥∥xx> − zz>∥∥2

F≤ O(ε) .

where ε = µ3(pd)−1/2.

This turns out to be the main challenge. Once we proved x is close, we can apply

the result of Sun and Luo [163] (see Lemma 5.6.1), and obtain Theorem 5.3.2.

114

5.3.1 Handling Incoherent x

To demonstrate the key idea, in this section we restrict our attention to the subset

of Rd which contains incoherent x with `2 norm bounded by 1, that is, we consider,

B =

x : ‖x‖∞ ≤

2µ√d, ‖x‖ ≤ 1

. (5.3.5)

Note that the desired solution z is in B, and the regularization R(x) vanishes

inside B.

The following lemmas assume x satisfies the first and second order optimality con-

ditions, and deduce a sequence of properties that x must satisfy.

Lemma 5.3.4. Under the setting of Theorem 5.3.2 , with high probability over the

choice of Ω, for any x ∈ B that satisfies second-order optimality condition (5.3.3) we

have,

‖x‖2 ≥ 1/4.

The same is true if x ∈ B only satisfies τ -relaxed second order optimality condition

for τ ≤ 0.1p.

Proof. We plug in v = z in the second-order optimality condition (5.3.3), and obtain

that ∥∥PΩ(zx> + xz>)∥∥2

F≥ 2z>PΩ(M − xx>)z . (5.3.6)

Intuitively, when restricted to Ω, the squared Frobenius on the LHS and the

quadratic form on the RHS should both be approximately a p fraction of the unre-

stricted case. In fact, both LHS and RHS can be written as the sum of terms of the

form 〈PΩ(uvT ), PΩ(stT )〉, because

∥∥PΩ(zx> + xz>)∥∥2

F= 2〈PΩ(zxT ), PΩ(zxT )〉+ 2〈PΩ(zxT ), PΩ(xzT )〉

2z>PΩ(M − xx>)z = 2〈PΩ(zzT ), PΩ(zzT )〉 − 2〈PΩ(xxT ), PΩ(zzT )〉.

115

Therefore we can use concentration inequalities (Theorem 5.7.1), and simplify the

equation

LHS of (5.3.6) = p∥∥zx> + xz>

∥∥2

F±O(

√pd‖x‖2

∞‖z‖2∞‖x‖2‖z‖2)

= 2p‖x‖2‖z‖2 + 2p〈x, z〉2 ±O(pε) , (Since x, z ∈ B)

where ε = O(µ2√

log dpd

). Similarly, by Theorem 5.7.1 again, we have

RHS of (5.3.6) = 2(〈PΩ(zz>), PΩ(zz>)〉 − 〈PΩ(xx>), PΩ(zz>)〉

)(Since M = zz>)

= 2p‖z‖4 − 2p〈x, z〉2 ±O(pε) (by Theorem 5.7.1 and x, z ∈ B)

(Note that even we use the τ -relaxed second order optimality condition, the RHS

only becomes 1.99p‖z‖4 − 2p〈x, z〉2 ±O(pε) which does not effect the later proofs.)

Therefore plugging in estimates above back into equation (5.3.6), we have that

2p‖x‖2‖z‖2 + 2p〈x, z〉2 ±O(pε) ≥ 2‖z‖4 − 2〈x, z〉2 ±O(pε) ,

which implies that 6p‖x‖2‖z‖2 ≥ 2p‖x‖2‖z‖2 + 4p〈x, z〉2 ≥ 2p‖z‖4 − O(pε). Using

‖z‖2 = 1, and ε being sufficiently small, we complete the proof.

Next we use first order optimality condition to pin down another property of x –

it has to be close to z after scaling. Note that this doesn’t mean directly that x has to

be close to z since x = 0 also satisfies first order optimality condition (and therefore

the conclusion (5.3.7) below).

Lemma 5.3.5. With high probability over the randomness of Ω, for any x ∈ B that

satisfies first-order optimality condition (5.3.2), we have that x also satisfies

∥∥〈z, x〉z − ‖x‖2x∥∥ ≤ O(ε) . (5.3.7)

116

where ε = O(µ3(pd)−1/2).

Proof. Note that since x ∈ B, we have R(x) = 0. Therefore first-order optimality con-

dition says that

PΩ(M − xx>)x = PΩ(zz>)x− PΩ(xx>)x = 0 . (5.3.8)

Again, intuitively we hope PΩ(zzT ) ≈ pzzT and PΩ(xxT )x ≈ p‖x‖2x. These are made

precise by the concentration inequalities Lemma 5.7.4 and Theorem 5.7.2 respectively.

By Theorem 5.7.2, we have that with high probability over the choice of Ω, for

every x ∈ B,

‖PΩ(xx>)x− pxx>x‖F ≤ pε‖x‖3 ≤ pε (5.3.9)

where ε = O(µ3(pd)−1/2). Similarly, by Lemma 5.7.4, we have that for with high

probability over the choice of Ω,

∥∥PΩ(zz>)− pzz>∥∥ ≤ εp .

for ε = O(µ2(pd)−1/2). Therefore for every x,

∥∥PΩ(zz>)x− pzz>x∥∥ ≤ εp‖x‖ ≤ εp . (5.3.10)

Plugging in estimates (5.3.10) and (5.3.9) into equation (5.3.8), we complete the

proof.

Finally we combine the two optimality conditions and show equation (5.3.7) im-

plies xxT must be close to zzT .

Lemma 5.3.6. Suppose vector x satisfies that ‖x‖2 ≥ 1/4, and that ‖〈z, x〉z − ‖x‖2x‖ ≤

δ . Then for δ ∈ (0, 0.1), ∥∥xx> − zz>∥∥2

F≤ O(δ) .

117

Figure 5.1: Partition of Rd into regions where our Lemmas apply. For example,Lemma 3.8 rules out the possibility that a point x in the green region is local min-imum. Here, The green region is the intersection of `∞ norm ball and `2 norm ball.Both the white region and yellow region have non-zero gradient but for differentreasons.

Proof. We write z = ux+ v where u ∈ R and v is a vector orthogonal to x. Now we

know 〈z, x〉z = u2‖x‖2x+ u‖x‖2v, therefore

δ ≥∥∥〈z, x〉z − ‖x‖2x

∥∥ = ‖x‖2√u2‖v‖2 + (1− u2)2.

In particular, we know |1− u2| ≤ 4δ and u‖v‖ ≤ 4δ. This means |u| ∈ 1± 3δ and

‖v‖ ≤ 8δ. Now we expand xxT − zzT :

xxT − zzT = (1− u2)xxT + uxvT + uvxT + vvT

It is clear that all the terms have norm bounded byO(δ), therefore∥∥xx> − zz>∥∥2

F≤

O(δ).

118

5.3.2 Extension to General x

We have shown when x is incoherent and satisfies first and second order optimal-

ity conditions, then it must be close to z or −z. Now we need to consider more

general cases when x may have some very large coordinates. Here the main intuition

is that the first order optimality condition with a proper regularizer is enough to

guarantee that x cannot have a entry that is too much bigger than µ/√d.

Lemma 5.3.7. With high probability over the choice of Ω, for any x that satisfies

first-order order optimality condition (5.3.2), we have

‖x‖∞ ≤ 4 maxα, µ

√p/λ. (5.3.11)

Here we recall that α was chosen to be 10µ/√d and λ is chosen to be large so that

the α dominates the second term µ√p/λ in the setting of Theorem 5.3.2.

Proof of Lemma 5.3.7. Suppose i? = maxj |xj|. Without loss of generality, suppose

xi? ≥ 0. Suppose i?-th row of Ω consists of entries with index [i]× Si? . If |xi? | ≤ 2α,

we are done. Therefore in the rest of the proof we assume |xi?| > 2α. Note that when

p ≥ c(log d)/d for sufficiently large constant c, with high probability over the choice

of Ω, we have |Si?| ≤ 2pd. In the rest of argument we are working with such an Ω

with |Si?| ≤ 2pd.

We will compare the i?-th coordinate of LHS and RHS of first-order optimal-

ity condition (5.3.2). For preparation, we have

|(PΩ(M)x)i? | =∣∣(PΩ(zz>)x

)i?

∣∣ =

∣∣∣∣∣∣∑j∈Si?

zi?zjxj

∣∣∣∣∣∣≤ |xi?|

∑j∈Si?

|zi?zj| ≤ |xi?| · µ2/d · |Si?| ≤ 2|xi? |pµ2 (5.3.12)

119

where the last step we used the fact that |Si? | ≤ 2pd. Moreover, we have that

(PΩ(xx>)x)i? =∑j∈Si?

xi?x2j ≥ 0 ,

and that

(λ∇R(x))i? = 4λ(|xi? | − α)3 sign(xi?) ≥λ

2|xi? |3 (Since xi? ≥ 2α)

Now plugging in the bounds above into the i?-th coordinate of equation (5.3.2), we

obtain

4|xi?|pµ2 ≥ 2(PΩ(M − xx>)x)i? ≥ (λ∇R(x))i? ≥λ

2|xi? |3 ,

which implies that |xi? | ≤ 4√pµ2/λ.

Setting λ ≥ µ2p/α2 and α = 10µ√

1/d, Lemma 5.3.7 ensures that any x that

satisfies first-order optimality condition is the following ball,

B′ =x ∈ Rd : ‖x‖∞ ≤ 4α

.

Then we would like to continue to use arguments similar to Lemma 5.3.4 and

5.3.5. However, things have become more complicated as now we need to consider

the contribution of the regularizer.

Lemma 5.3.8 (Extension of Lemma 5.3.4). In the setting of Theorem 5.3.2, with

high probability over the choice of Ω, suppose x ∈ B′ satisfies second-order optimal-

ity condition (5.3.3) or τ -relaxed condition for τ ≤ 0.1p, we have ‖x‖2 ≥ 1/8.

The guarantees and proofs are very similar to Lemma 5.3.4. The main intuition

is that we can restrict our attentions to coordinates whose regularizer is equal to 0.

120

Proof. If ‖x‖ ≥ 1, then we are done. Therefore in the rest of the proof we assume

‖x‖ ≤ 1. The proof is very similar to Lemma 5.3.4. We plug in v = zJ instead into

equation (5.3.3), where J = i : |xi| ≤ α. Note that ∇R(zJ) vanishes. We plug in

v = zJ in the equation (5.3.3) and obtain that x satisfies that

∥∥PΩ(zJx> + xz>J )

∥∥2

F≥ 2z>J PΩ(M − xx>)zJ . (5.3.13)

Note that we assume ‖x‖∞ ≤ 2α, and in the beginning of the proof we assume

wlog ‖x‖ ≤ 1. Moreover, we have ‖zJ‖ ≤ µ√d

an, ‖zJ‖ ≤ 1. Similarly to the derivation

in the proof of Lemma 5.3.4, we apply Theorem 5.7.1 (twice) and obtain that with

high probability over the choice of Ω, for every x, for ε = O(µ2(pd)−1/2),

LHS of (5.3.13) = p∥∥zJx> + xz>J

∥∥2

F±O(pε) = 2p‖x‖2‖zJ‖2 + 2p〈x, zJ〉2 ±O(pε) .

RHS of (5.3.13) = 2(〈PΩ(zz>), PΩ(zJz

>J )〉 − 〈PΩ(xx>), PΩ(zJz

>J )〉)

(Since M = zz>)

= 2‖zJ‖4 − 2〈x, zJ〉2 ±O(pε) . (by Theorem 5.7.1)

(Again notice that using τ -relaxed second order optimality condition does not effect

the RHS by too much, so it does not change later steps.) Therefore plugging the

estimates above back into equation (5.3.13), we have that

p‖x‖2‖zJ‖2 + 2p〈x, zJ〉2 ≥ p‖zJ‖4 ±O(pε) ,

Using Cauchy-Schwarz, we have ‖x‖2‖zJ‖2 ≥ 〈x, zJ〉2, and therefore we obtain that

‖zJ‖2‖x‖2 ≥ 13‖zJ‖4 −O(ε).

121

Finally, we claim that ‖zJ‖2 ≥ 1/2, which completes the proof since ‖x‖2 ≥13‖zJ‖2 −O(ε) ≥ 1/8.

Claim 5.3.9. Suppose α ≥ 4µ√d

and x satisfies ‖x‖∞ ≤ 4α and ‖x‖ ≤ 2. Let J = i :

|xi| ≤ α. Then we have that ‖zJ‖ ≥ 1/2.

The claim can be simply proved as follows: Since ‖x‖2 ≤ 2 we have that |J c| ≤

2/α2 and therefore ‖zJc‖2 ≤ 2µ2/(dα2). This further implies that ‖zJ‖2 = ‖z‖2 −

‖zL‖2 ≥ (1− 2µ2/(dα2)) ≥ 12

because α ≥ 2µ√d.

We will now deal with first order optimality condition. We first write out the

basic extension of Lemma 5.3.5, which follows from the same proof except we now

include the regularizer term.

Lemma 5.3.10 (Basic extension of Lemma 5.3.5). With high probability over the

randomness of Ω, for any x ∈ B′ that satisfies first-order optimality condition (5.3.2),

we have that x also satisfies

∥∥〈z, x〉z − ‖x‖2x− γ · ∇R(x)∥∥ ≤ O(ε) . (5.3.14)

where ε = O(µ6(pd)−1/2) and γ = λ/(2p) ≥ 0.

Next we will show that we can remove the regularizer term, the main observation

here is nonzero entries ∇R(x) all have the same sign as the corresponding entries in

x.

Lemma 5.3.11. Suppose x ∈ B′ satisfies that ‖x‖2 ≥ 1/8, under the same assump-

tion as Lemma 5.3.10. we have,

∥∥〈x, z〉z − ‖x‖2x∥∥ ≤ O(ε)

122

Proof. Let L = i : ‖xi‖ ≥ α. For i 6∈ L, we have that (∇R(x))i = 0. Therefore it

suffices to prove that for every i ∈ L,

(ziz>x− xi‖x‖)2 ≤ (ziz

>x− xi‖x‖ − (γ∇R(x))i)2

It suffices to prov that

(∇R(x))i(xi‖x‖2 − zi〈z, x〉) ≥ 0 (5.3.15)

Since we have ∇R(x)i = γixi for some γi ≥ 0, we have

(∇R(x))i · xi‖x‖2 = 〈γixi, xi‖x‖2〉

≥ γix2i ‖x‖2

≥ 1√8γix

2i ‖x‖ (since ‖x‖2 ≥ 1/8)

On the other hand, we have

(∇R(x))i · zi〈z, x〉 = γixizi〈z, x〉

≤ 1

4γix

2i ‖x‖‖z‖ (by |xi| ≥ α ≥ 4|zi|)

Therefore combining two equations above we obtain equation (5.3.15) which completes

the proof.

Finally we combine Lemma 5.3.7, Lemma 5.3.8, Lemma 5.3.11 and Lemma 5.3.6

to prove Lemma 5.3.3. The argument are also summarized in Figure 5.1, where we

partition Rd into regions where our lemmas apply.

123

5.4 Rank-r Case

In this section we show how to extend the results in Section 5.3 to recover matrices

of rank r. Here we still use the same proof strategy of Section 5.2. Though for

simplicity we only write down the proof for the partial observation case, while the

analysis for the full observation case (which was our starting point) can be obtained

by substituting [d]× [d] for Ω everywhere.

Recall that in this case we assume the original matrix M = ZZT , where Z ∈ Rd×r.

We also assume Assumption 5.1.1. The objective function is very similar to the rank

1 case

f(X) =1

2

∥∥PΩ(M −XX>)∥∥2

F+ λR(X) , (5.4.1)

where R(X) =∑d

i=1 r(‖Xi‖) . Recall that r(t) = (|t|−α)41()t ≥ α. Here α and λ are

again parameters that we will determined later.

Without loss of generality, we assume that ‖Z‖2F = r in this section. This im-

plies that σmax(Z) ≥ 1 ≥ σmin(Z). Now we shall state the first and second order

optimality conditions:

Proposition 5.4.1. If X is a local optimum of objective function (5.4.1), its first

order optimality condition is,

2PΩ(M)X = 2PΩ(XX>)X + λ∇R(X) , (5.4.2)

and the second order optimality condition is equivalent to

∀V ∈ Rd×r, ‖PΩ(V X> +XV >)‖2F + λ〈V >,∇2R(X)V 〉 ≥ 2〈PΩ(M −XX>), V V >〉 .

(5.4.3)

Note that the regularizer now is more complicated than the one dimensional case,

but luckily we still have the following nice property.

124

Proposition 5.4.2. We have that ∇R(X) = ΓX where Γ ∈ Rd×d is a diagonal matrix

with Γii = 4(‖Xi‖−α)4

‖Xi‖ 1()‖Xi‖ ≥ α. As a direct consequence, 〈(∇R(X))i, Xi〉 ≥ 0 for

every i ∈ [d].

Now we are ready to state the precise version of Theorem 5.1.2:

Theorem 5.4.3. Suppose p ≥ C maxµ6κ16r4, µ4κ4r6d−1 log2 d where C is a large

enough constant. Let α = 4µκr/√d, λ ≥ µ2rp/α2. Then with high probability over

the randomness of Ω, any local minimum X of f(·) satisfies that f(X) = 0, and in

particular, ZZ> = XX>.

The proof of this Theorem follows from a similar path as Theorem 5.3.2. We

first notice that because of the regularizer, any matrix X that satisfies first order

optimality condition must be somewhat incoherent (this is analogues to Lemma 5.3.7):

Lemma 5.4.4. Suppose |Si| ≤ 2pd. Then for any X satisfies 1st order optimal-

ity (5.4.2), we have

‖X‖2→∞ = maxi‖Xi‖ ≤ 4 max

α, µ

√rp/λ

(5.4.4)

Proof. Assume i? = argmaxi‖Xi‖. Suppose the ith row of Ω consists of entries with

index [i]× Si. If ‖Xi?‖ ≤ 2α, then we are done. Therefore in the rest of the proof we

assume ‖Xi?‖ ≥ 2α.

We will compare the i-th row of LHS and RHS of (5.4.2). For preparation, we

have

(PΩ(M)x)i? =(PΩ(ZZ>)X

)i?

=(PΩ(ZZ>)

)i?X (5.4.5)

125

Then we have that

∥∥(PΩ(ZZ>))i?

∥∥1

=∑j∈Si?

|〈Zi? , Zj〉|

≤∑j∈Si?

‖Zi?‖‖Zj‖ ≤∑j∈Si?

µ2r/d|S1| (by incoherence of Z)

≤ 2µ2rp . (by |Si?| ≤ 2pd)

Therefore we can bound the `2 norm of LHS of 1st order optimality condi-

tion (5.4.2) by

∥∥(PΩ(ZZ>)X)i?

∥∥ ≤ ∥∥(PΩ(ZZ>))i?

∥∥1

∥∥X>∥∥1→2

≤ 2µ2rp ‖X‖2→∞ (by ‖X‖2→∞ =∥∥X>∥∥

1→2)

= 2µ2rp ‖Xi?‖ (5.4.6)

Next we lowerbound the norm of the RHS of equation (5.4.2). We have that

(PΩ(XX>)X)i? =∑j∈Si?

〈Xi? , Xj〉Xj = Xi

∑j∈Xi?

X>j Xj ,

which implies that

〈(PΩ(XX>)X)i? , Xi?〉 = Xi?

∑j∈Xi?

X>j Xj

X>i? ≥ 0 . (5.4.7)

Using Proposition 5.4.2 we obtain that

〈(PΩ(XX>)X)i? , (∇R(X))i?〉 = ΓiiXi?

∑j∈Xi?

X>j Xj

X>i? ≥ 0 . (5.4.8)

126

It follows that

∥∥(PΩ(XX>)X)i? + (λ∇R(X))i?∥∥ ≥ ‖(λ∇R(X))i?‖ (by equation (5.4.8))

=4λ(‖Xi?‖ − α)3

‖Xi?‖· ‖Xi?‖ (by Proposition 5.4.2)

≥ λ

2‖Xi?‖3 (by the assumptino ‖Xi?‖ ≥ 2α)

Therefore plugging in equation above and equation (5.4.6) into 1st order optimal-

ity condition (5.4.2). We obtain that ‖Xi?‖ ≤√

8µ2rp/λ which completes the

proof.

Next, we prove a property implied by first order optimality condition, which is

similar to Lemma 5.3.10.

Lemma 5.4.5. In the setting of Theorem 5.4.3, with high probability over the choice

of Ω, for any X that satisfies 1st order optimality condition (5.4.2), we have

‖X‖2F ≤ 2rσmax(Z)2 . (5.4.9)

Moreover, we have

σmax(X) ≤ 2σmax(Z)r1/6 . (5.4.10)

and ∥∥ZZTX −XXTX − γ∇R(X)∥∥F≤ O(δ) (5.4.11)

where δ = O(µ3κ3r2 log0.75(d)σmax(Z)−3(dp)−1/2) and γ = λ/(2p) ≥ 0.

Proof. If ‖X‖F ≤√rσmax(Z)2 we are done. When ‖X‖F ≥

√rσmax(Z)2, by

Lemma 5.4.4, we have that max ‖Xi‖ ≤ 4α = 4µκr/√d, and therefore max ‖Xi‖ ≤

ν‖X‖F with ν = O(µκ√r/σmax(Z)). Then by Theorem 5.7.2, we have that

∥∥PΩ(ZZ>)X − pZZ>X∥∥F≤ pδ ,

127

and ∥∥PΩ(XX>)X − pXX>X∥∥F≤ pδ ,

where δ = O(µ3κ3r2 log0.75(d)σmax(Z)−3(dp)−1/2). These two imply equation (5.4.11).

Moreover, we have

p∥∥ZZ>X∥∥

F=∥∥PΩ(ZZ>)X

∥∥F± pδ =

∥∥PΩ(XX>)X + λR(X)∥∥F± pδ

(by equation (5.4.2))

≥∥∥PΩ(XX>)X

∥∥F± pδ (by equation (5.4.8))

≥ p∥∥XX>X∥∥

F± 2pδ (5.4.12)

Suppose X has singular value σ1 ≥ · · · ≥ σr. Then we have∥∥ZZ>X∥∥2

F≤

‖ZZ>‖2‖X‖2F ≤ σmax(Z)4‖X‖2

F = σmax(Z)4(σ21 + · · · + σ2

r). On the other hand,∥∥XX>X∥∥2

F= σ6

1 + · · ·+ σ6r . Therefore, equation (5.4.12) implies that

(1 +O(δ))σmax(Z)4

r∑i=1

σ2i ≥

r∑i=1

σ6i

Then we have (by Proposition 8.3.4) we complete the proof.

Now we look at the second order optimality condition, this condition implies the

smallest singular value of X is large (similar to Lemma 5.3.8). Note that this lemma

is also true even if x only satisfies relaxed second order optimality condition with

τ = 0.01pσmin(Z).

Lemma 5.4.6. In the setting of Theorem 5.4.3. With high probability over the choice

of Ω, suppose X satisfies equation (5.4.9), (5.4.4) the 2nd order optimality condi-

tion (5.4.3). Then,

σmin(X) ≥ 1

4σmin(Z) (5.4.13)

128

Proof. Let J = i : ‖Xi‖ ≤ α. Let v ∈ Rr such that ‖Xv‖ = σmin(X). . Let ZJ be

the matrix that has the same i-th row as Z for every i ∈ J and 0 elsewhere. Since ZJ

has column rank at most r, by variational characterization of singular values, we have

that for there exists unit vector zJ ∈ col-span(ZJ) such that ‖X>zJ‖ ≤ σmin(X).

We claim that σmin(ZJ) ≥ 12σmin(Z). Let L = [d]− J . Since for any i ∈ L it holds

that ‖Xi‖ ≥ α, we have |L|α2 ≤ ‖X‖2F ≤ 2rσmax(Z)2 (by equation (5.4.9)), and it

follows that |L| ≤ 2rσmax(Z)2/α2. Therefore,

σmin(ZJ) ≥ σmin(Z)− σmax(ZL) ≥ σmin(Z)− ‖ZL‖F

≥ σmin(Z)−√|L|rµ2/d ≥ σmin(Z)−

√2r2σmax(Z)2µ2/(α2d)

≥ 1

2σmin(Z) . (by α ≥ rκµ√

d)

Since zJ ∈ col-span(ZJ) is a unit vector, we have that zJ can be written as

zJ = ZJβ where ‖β‖ ≤ 1σmin(ZJ )

≤ O(1/σmin(Z)). Therefore this in turn implies that

‖zJ‖∞ ≤ ‖ZJ‖2→∞‖β‖ ≤ O(µ√r/d/σmin(Z)) ≤ O(µκ

√r/d).

We will plug in V = zJvT in the 2nd order optimality condition (5.4.3). Note that

since zJ ∈ col-span(ZJ), it is supported on subset J , and therefore ∇2R(X)V = 0.

Therefore the term about regularization in (5.4.3) will vanish. For simplicity, let

y = X>zJ , w = Xv We obtain that taking V = zJv> in equation (5.4.3) will result

in ∥∥PΩ(wz>J + zJw>)∥∥2

F≥ 2〈PΩ(ZZ> −XX>), zJz

>J 〉

Note that we have that ‖w‖∞ ≤ ‖X‖2→∞‖v‖ ≤ µ√r/d. Recalling that ‖zJ‖∞ ≤

O(µκ√r/d), by Theorem 5.7.1, we have that

p∥∥wz>J + zJw

>∥∥2

F≥ 2p〈ZZ> −XX>, zJz>J 〉 − δp

129

where δ = O(µ2κr2(pd)−1/2). Then simple algebraic manipulation gives that

〈w, zJ〉2 + ‖w‖2‖zJ‖2 + ‖X>zJ‖2 ≥ ‖Z>zJ‖2 − δ/2 (5.4.14)

Note that 〈w, zJ〉 = 〈v,X>zJ〉 = 〈y, v〉. Recall that ‖zJ‖ = 1 and z ∈ col-span(ZJ),

and therefore ‖Z>zJ‖ = ‖Z>J zJ‖ ≥ σ2min(ZJ). Moreover, recall that ‖y‖ = ‖X>zJ‖ ≤

σmin(X). Using these with equation (5.4.14) we obtain that

〈w, zJ〉2 + ‖w‖2‖zJ‖2 + ‖X>zJ‖2 ≤ 〈y, v〉2 + ‖w‖2 + ‖y‖2

≤ 2‖y‖2 + σ2min(X)

(by Cauchy-Schwarz and ‖w‖ = σmin(X).)

≤ 3σ2min(X) (by ‖y‖ ≤ σmin(X).)

Therefore together with equation (5.4.14) and ‖Z>zJ‖ ≥ σ2min(ZJ) we obtain that

σmin(X) ≥ (1/2− Ω(δ))σmin(ZJ) (5.4.15)

Therefore combining equation (5.4.15) and the lower bound on σmin(ZJ) we complete

the proof.

Similar as before, we show it is possible to remove the regularizer term here, again

the intuition is that the regularizer is always in the same direction as X.

Lemma 5.4.7. Suppose X satisfies equation (5.4.4) and (5.4.13) and (5.4.10), then

for any γ ≥ 0,

∥∥ZZTX −XXTX∥∥2

F≤∥∥ZZTX −XXTX − γ∇R(X)

∥∥2

F(5.4.16)

130

Proof. Let L = i : ‖Xi‖ ≥ α. For i 6∈ L, we have that (∇R(X))i = 0. Therefore it

suffices to prove that for every i ∈ L,

∥∥ZiZ>X −XiX>X∥∥2 ≤

∥∥ZiZ>X −XiX>X − (γ∇R(X))i

∥∥2

It suffices to prove that

〈(∇R(X))i, XiX>X − ZiZ>X〉 ≥ 0 (5.4.17)

By proposition 5.4.2, we have ∇R(X))i = ΓiiXi for Γii ≥ 0. Then

〈(∇R(X))i, XiX>X〉 = Γii〈Xi, XiX

>X〉

≥ Γii ‖Xi‖2 σmin(XTX)

≥ 1

4Γii ‖Xi‖2 σmin(Z)2 (by equation 5.4.13)

On the other hand, we have

〈(∇R(X))i, ZiZ>X〉 = Γii〈Xi, ZiZ

>X〉

≤ Γii‖Xi‖‖Zi‖σmax(ZTX) ≤ Γii‖Xi‖‖Zi‖σmax(Z)σmax(X)

≤ Γii‖Xi‖‖Zi‖σmax(Z)2r1/6 (by equation (5.4.10))

≤ 1

10Γii‖Xi‖2σmin(Z)2r−1/3 (by ‖Xi‖ ≥ α ≥ 10

√rκ2‖Zi‖)

Therefore combining two equations above we obtain equation (5.4.17) which com-

pletes the proof.

Finally we show the form in Equation (5.4.16) implies ZZT is close to XXT (this

is similar to Lemma 5.3.6).

131

Lemma 5.4.8. Suppose X and Z satisfies that σmin(X) ≥ 1/4 · σmin(Z) and that

∥∥ZZTX −XXTX∥∥2

F≤ δ2

where δ ≤ σ3min(Z)/C for a large enough constant C, then

‖XX> − ZZ>‖2F ≤ O(δκ2/σmin(Z)).

Proof. The proof is similar to the one-dimensional case, we will separate Z into the

directions that are in column span of X and its orthogonal subspace. We will then

show the projection of Z in the column span is close to X, and the projection on the

orthogonal subspace must be small.

Let Z = U +V where U = Projspan(X)Z is the projection of Z to the column span

of X, and V is the projection to the orthogonal subspace. Then since V TX = 0 we

know

ZZTX = (U + V )(U + V )TX = UUTX + V UTX.

Here columns of the first term UUTX are in the column span of X, and the columns

second term V UTX are in the orthogonal subspace. Therefore,

‖ZZTX −XXTX‖2F = ‖UUTX −XXTX‖2

F + ‖V UTX‖2F ≤ δ2.

In particular, both terms should be bounded by δ2. Therefore ‖UUT −XXT‖2F ≤

δ2/σ2min(X) ≤ 16δ2/σ2

min(Z).

Also, we know σmin(UUTX) ≥ σmin(XXTX) − δ ≥ σmin(Z)3/128 if δ ≤

σmin(Z)3/128. Therefore σmin(UTX) is at least σmin(Z)3/‖Z‖128. Now ‖V ‖2F ≤

δ2/σmin(UTX)2 ≤ O(δ2‖Z‖2/σmin(Z)6).

Finally, we can bound ‖UV T‖F by ‖U‖‖V ‖F ≤ ‖Z‖‖V ‖F (last inequality is be-

cause U is a projection of Z), which at least Ω(‖V ‖2F ) when δ ≤ σmin(Z)3/128,

132

therefore

‖ZZT −XXT‖F ≤ ‖UUT −XXT‖F + 2‖UV T‖F + ‖V V T‖F ≤ O(δ‖Z‖2/σmin(Z)3).

Last thing we need to prove the main theorem is a result from Sun and Luo[163],

which shows whenever XXT is close to ZZT , the function is essentially strongly

convex, and the only points that have 0 gradient are points where XXT = ZZT , this

is explained in Lemma 5.6.1. Now we are ready to prove Theorem 5.4.3:

Proof of Theorem 5.4.3. Suppose X satisfies 1st and 2nd order optimality condi-

tion. Then by Lemma 5.4.5 and Lemma 5.4.4, we have that X satisfies equa-

tion (5.4.4), (5.4.9), (5.4.10) and (5.4.11). Then by Lemma 5.4.6, we obtain that

σmin(X) ≥ 1/6 · σmin(Z). Now by Lemma 5.4.7 and equation (5.4.11), we have that∥∥ZZTX −XXTX∥∥F≤ δ for δ ≤ cσmin(Z)3/κ2 for sufficiently small constant c. Then

by Lemma 5.4.8 we obtain that ‖ZZ> − XX>‖F ≤ cσmin(Z)2 for sufficiently small

constant c. By Lemma 5.6.1, in this region the only points that satisfy the first order

optimality condition must satisfy XXT = ZZT .

Handling Noise To handle noise, notice that we can only hope to get an ap-

proximate solution in presence of noise, and to get that our Lemmas only depend

on concentration bounds which still apply in the noisy setting. See Section 5.5 for

details.

5.5 Handling Noise

Suppose instead of observing the matrix ZZT , we actually observe a noisy version

M = ZZT +N , where N is a Gaussian matrix with independent N(0, σ2) entries. In

133

this case we should not hope to exactly recover ZZT (as two close Z’s may generate

the same observation). In this Section we show even with fairly large noise our

arguments can still hold.

Theorem 5.5.1. Let µ = maxµ,√

4σd√

log dr. Suppose p ≥ Cµ6κ12r4d−1ε−2 log1.5 d

where C is a large enough constant. Let α = 2µκr/√d, λ ≥ µ2rp/α2. Then with high

probability over the randomness of Ω, any local minimum X of f(·) satisfies

‖XXT − ZZT‖F ≤ ε.

In fact, a noise level σ√

log d ≤ µ2r/d (when the noise is almost as large as the

maximum possible entry) does not change the conclusions of Lemmas in this Section.

Proof. There are only three places in the proof where the noise will make a difference.

These are: 1. The infinity norm bound of M , used in Lemma 5.4.4. 2. The LHS

of first order optimality condition (Equation (5.4.2)). 3. The RHS of second order

optimality condition (Equation (5.4.3)).

What we require in these three steps are: 1. |M |∞ should be smaller than

µ2r/d. 2. 〈PΩ(N),W 〉 should be smaller than |〈PΩ(N), PΩ(W )〉| ≤ O(σ|Z|∞dr log d+√pd2rσ2|W |∞‖W‖F log d). 3. ‖PΩ(N)‖ ≤ εp‖ZZT‖F . When we define the µ =

maxµ,√

4σd√

log dr, all of these are satisfied (by Lemma 5.7.5 and 5.7.6).

Now we can follow the proof and see δ ≤ cεσmin(Z)/κ2 for small enough constant

c, and By Lemma 5.4.8 we know ‖XXT − ZZT‖F ≤ ε.

5.6 Finding the Exact Factorization

In Section 5.4, we showed that any point that satisfies the first and second order

necessary condition must satisfy ‖XXT − ZZT‖F ≤ c for a small enough constant c.

In this section we will show that in fact XXT must be exactly equal to ZZT . The

134

proof technique here is mostly based on the work of Sun and Luo[163]. However we

have to modify their proof because we use slightly different regularizers, and we work

in the symmetric case. The main Lemma in [163] can be rephrased as follows in our

setting:

Lemma 5.6.1 (Analog to Lemma 3.1 in [163]). Suppose p ≥ Cµ4r6κ4d−1 log d for

large enough absolute constant C, and ε = σmin(Z)2/100. with high probability over

the randomness of Ω, we have that for any point X in the set

Bε =

X ∈ Rd×r : ‖XXT − ZZT‖F ≤ ε, ‖X‖2→∞ ≤

16µκr√d

, (5.6.1)

there exists a matrix U such that UUT = ZZT and

〈∇f(X), X − U〉 ≥ p

4‖M −XXT‖2

F .

As a consequence, any point X in the set B that satisfies first order optimality con-

dition must be a global optimum (or, equivalently, satisfy XXT = ZZT ).

Recall f(X) = 12‖PΩ(M −XXT )‖2

F + λR(X). The proof of Lemma 5.6.1 consists

of three steps:

1. The regularizer has nonnegative correlation with (X−U): for any U such that

UUT = ZZT , we have 〈∇R(X), X − U〉 ≥ 0. (Claim 5.6.3).

2. There exists a matrix U such that UUT = ZZT , and U is close to X.

(Claim 5.6.4)

3. Argue that 〈∇f(x), X −U〉 ≥ p4‖PΩ(M −XXT )‖2

F when U is close to X. (See

proof of Lemma 5.6.1).

Before going into details, the first useful observation is that all matrices U with

UUT = ZZT have the same row norm.

135

Claim 5.6.2. Suppose U,Z ∈ Rd×r satisfy UU> = ZZ>. Then, for any i ∈ [d] we

have ‖Ui‖ = ‖Zi‖. Consequently, ‖U‖F = ‖Z‖F .

Proof. Suppose UU> = ZZ>, then we have U = ZR where R is an orthonormal

matrix. In particular, the i-th row of U is equal to

Ui = ZiR.

Since `2 norm (and Frobenius norm) is preserved after multiplying with an orthonor-

mal matrix, we know ‖Ui‖ = ‖Zi‖. The Frobenius norm bound follows immedi-

ately.

Note that this simple observation is only true in the symmetric case. This Claims

serves as the same role of the bounds on row norms of U, V in the asymmetric case

(Propositions 4.1 and 4.2 of [163]).

Next we are ready to argue that the regularizer is always positively correlated

with X − U .

Claim 5.6.3. For any U such that UUT = ZZT , we have,

〈∇R(X), X − U〉 ≥ 0.

Proof. Since the regularizer is applied independently to individual rows, we can

rewrite 〈∇R(X), X − U〉 =∑n

i=1〈∇R(Xi), Xi − Ui〉, and focus on i-th row.

For each row Xi, ∇R(Xi) is 0 when ‖Xi‖ ≤ 2µ√r/√d. In that case 〈∇R(Xi), Xi−

Ui〉 = 0.

When ‖Xi‖ is larger than 2µ/√d, we know∇R(Xi) is always in the same direction

as Xi. In this case λ∇R(Xi) = γXi for some γ > 0 and ‖Xi‖ ≥ 2µ√r/√d ≥ 2‖Zi‖ =

136

2‖Ui‖ (where last equality is by Claim 5.6.2). Therefore by triangle inequality

〈Xi, Xi − Ui〉 ≥ ‖Xi‖2 − ‖Xi‖‖Ui‖ ≥ ‖Xi‖2/2 > 0.

This then implies 〈λ∇R(Xi), Xi − Ui〉 = γ〈Xi, Xi − Ui〉 > 0.

Next we will prove the gradient of 12‖PΩ(M − XXT )‖2

F has a large correlation

with X − U . This is analogous to Proposition 4.2 in [163].

Claim 5.6.4. Suppose ‖XXT −M‖F = ε ≤ σmin(Z)2/100, there exists a matrix U

such that UUT = M and ‖X − U‖F ≤ 5ε√r/σmin(Z)2.

Proof. Without loss of generality we assume M is a diagonal matrix with first r

diagonal terms being σ1(Z)2, σ2(Z)2, ..., σr(Z)2 (this can be done by a change of basis).

That is, we assume M = diag(σ1(Z)2, . . . , σr(Z)2), 0, . . . , 0). We use M ′ to denote

the first r × r principle submatrix of M .

We write X =

VW

where V contains the first r rows of X, and W ∈ R(d−r)×r

contains the remaining rows in X. We write similarly U =

PQ

where P and Q

denote the first r rows and the rest of rows respectively.

In order to construct U , we first notice that Q must be constructed as a zero

matrix since M has non-zero diagonal only on the top-left corner. A natural guess of

P then becomes a “normalized” version of V .

Concretely, we construct P := V S = V (V T (M ′)−1V )−1/2 (where S :=

(V T (M ′)−1V )−1/2). Thus, the difference between U and X is equal to ‖U −X‖F ≤

‖P − V ‖F + ‖W‖F .

Since ‖XXT −M‖F ≤ ε, we know ‖M ′−V V T‖2F + 2‖VW T‖2

F ≤ ε2. In particular

both terms are smaller than ε2.

137

First, we bound ‖W‖F . Note that since ‖M ′ − V V T‖F ≤ ε ≤ σmin(Z)2/100, we

know σmin(V )2 ≥ 0.99σmin(Z)2. Therefore σmin(V ) ≥ 0.9σmin(Z). Now we know

‖W‖F ≤ ‖VW T‖F/σmin(V ) ≤ 2ε/σmin(Z).

Next we bound ‖P − V ‖2F . Since ‖M ′ − V V T‖F ≤ ε ≤ σmin(Z)2/100, we know

(1 − 2ε/σmin(Z)2)V V T M ′ (1 + 2ε2/σmin(Z)2)V V T . This implies ‖V ‖F ≤

1.1‖Z‖F , and (1 − 2ε/σmin(Z)2)I V TM−1V (1 + 2ε/σmin(Z)2)I. Therefore the

matrix S is also very close to identity, in particular, ‖S− I‖ ≤ 2ε/σmin(Z)2. Now we

know ‖P − V ‖F = ‖V ‖F‖S − I‖ ≤ 3ε‖Z‖F/σmin(Z)2. Using the fact that ‖Z‖F = 1

we know ‖U −X‖F ≤ ‖P − V ‖F + ‖W‖F ≤ 5ε√r/σmin(Z)2.

We can now combine this Claim with a sampling lemma to show ‖PΩ((X−U)(X−

U)T )‖2F is small:

Lemma 5.6.5. Under the same setting of Lemma 5.6.1, with probability at least

1− 1/(2n)4 over the choice of Ω, if U satisfies conclusion of Claim 5.6.4, then,

‖PΩ((X − U)(X − U)T )‖2F ≤

p

25‖M −XXT‖2

F .

Intuitively, this Lemma is true because ‖(X − U)(X − U)T‖F ≤ 25‖M −

XXT‖2F r/σmin(Z)4, which is much smaller than ‖M −XXT‖F when ‖M −XXT‖F

is small. By concentration inequalities we expect ‖PΩ((X − U)(X − U)T )‖2F to be

roughly equal to p‖(X − U)(X − U)T‖F , therefore it must be much smaller than

p‖M −XXT‖2F . The proof of this Lemma is exactly the same as Proposition 4.3 in

[163] (in fact, it is directly implied by Proposition 4.3), so we omit the proof here.

We also need a different concentration bound for the projection of the norm of the

matrix a = U(X − U)T + (X − U)UT . Unlike the previous lemma, here we want

‖PΩ(a)‖F to be large.

138

Lemma 5.6.6. Under the same setting of Lemma 5.6.1, let a = U(X − U)T + (X −

U)UT where U is constructed as in Claim 5.6.4. Then, with high probability, we have

that for any X ∈ Bε,

‖PΩ(a)‖2F ≥

5p

6‖a‖2

F .

Intuitively this should be true because a is in the tangent space Z : Z = UW T +

(W ′)UT which has rank O(nr). The proof of this follows from Theorem 3.4 [145],

and is written in detail in Equations (37) - (41) in [163].

Finally we are ready to prove the main lemma. The proof is the same as the

outline given in Section 4.1 of [163]. We give it here for completeness.

Proof of Lemma 5.6.1. Note that f(X) is equal to h(X) + λR(X) where where

h(X) = 12‖PΩ(M −XXT )‖2

F , and R(X) is the regularizer. By Claim 5.6.3 we know

〈∇R(X), X − U〉 ≥ 0, so we only need to prove there exists a U such that UUT = Z

and 〈∇g(X), X − U〉 ≥ p4‖M −XXT‖2

F .

Define a = U(X−U)T +(X−U)UT , b = (U−X)(U−X)T , then XXT−M = a+b

and (X − U)XT +X(X − U)T = a+ 2b.

Now

〈∇h(X), X − U〉 = 2〈PΩ(XXT −M)X,X − U〉

= 〈PΩ(XXT −M), (X − U)XT +X(X − U)T 〉

= 〈PΩ(a+ b), PΩ(a+ 2b)〉

= ‖PΩ(a)‖2F + 2‖PΩ(b)‖2

F + 3〈PΩ(a), PΩ(b)〉

≥ ‖PΩ(a)‖2F + 2‖PΩ(b)‖2

F − 3‖PΩ(a)‖‖PΩ(b)‖.

Let ε = ‖M −XXT‖F . Note that from Claim 5.6.4 and Lemma 5.6.5, we know

‖b‖F ≤ ε/10, ‖PΩ(b)‖F ≤√pd/5.

139

Therefore as long as we can show ‖PΩ(a)‖F is large we are done. This is true because

‖a‖F ≥ ‖M −XXT‖F − ‖b‖F ≥ 9ε/10. Hence by Lemma 5.6.6 we know

‖PΩ(A)‖2F ≥

5p

6‖a‖2

F ≥27

40pε2.

Combining the bounds for ‖PΩ(a)‖F , ‖PΩ(b)‖F , we know 〈∇g(X), X − U〉 ≥p4‖M −XXT‖2

F . Together with the fact that 〈∇R(X), X − U〉 ≥ 0, we know

〈∇f(X), X − U〉 ≥ p

4‖M −XXT‖2

F .

5.7 Concentration Inequalities

In this section we prove the concentration inequalities used in the main part. We first

show that the inner-product of two low rank matrices is preserved after restricting

to the observed entries. This is mostly used in arguments about the second order

necessary conditions.

Theorem 5.7.1. With high probability over the choice of Ω, for any two rank-r

matrices W,Z ∈ Rd×d, we have

|〈PΩ(W ), PΩ(Z)〉 − p〈W,Z〉| ≤ O(|W |∞|Z|∞dr log d+√pdr|W |∞|Z|∞ ‖W‖F ‖Z‖F log d)

Proof. Since both LHS and RHS are bilinaer in both W and Z, without loss of

generality we assume the Frobenius norms of W and Z are all equal to 1. Note that

in this case we should expect |W |∞ ≥ 1/d.

140

Let δi,j be the indicator variable for Ω, we know

〈PΩ(W,Z〉 =∑i,j

δi,jWi,jZi,j,

and in expectation it is equal to p〈W,Z〉. Let Q =∑

i,j(δi,j − p)Wi,jZi,j. We can

then view Q as a sum of independent entries (note that δi,j = δj,i, but we can simply

merge the two terms and the variance is at most a factor 2 larger). The expectation

E[Q] = 0. Each entry in the sum is bounded by |W |∞|Z|∞, and the variance is

bounded by

V[Q] ≤ p∑i,j

(Wi,jZi,j)2

≤ pmaxi,j|Wi,j|2

∑i,j

Z2i,j

≤ p|W |2∞.

Similarly, we also know V[Q] ≤ p|Z|2∞ and hence V[Q] ≤ p|W |∞|Z|∞.

Now we can apply Bernstein’s inequality, with probability at most η,

|Q− E[Q]| ≥ |W |∞|Z|∞ log 1/η +√p|W |∞|Z|∞ log(1/η).

By Proposition 8.3.3, there is a set Γ of size dO(dr) such that for any rank r matrix

X, there is a matrix X ∈ Γ such that ‖X − X‖F ≤ 1/d3. When W and Z come from

this set, we can set η = d−Cdr for a large enough constant C. By union bound, with

high probability

|Q− E[Q]| ≤ O(|W |∞|Z|∞dr log d+√pdr|W |∞|Z|∞ log d).

141

When W and Z are not from this set Γ, let W and Z be the closest matrix

in Γ, then we know |〈PΩ(W ), PΩ(Z)〉 − p〈W,Z〉 − (〈PΩ(W ), PΩ(Z)〉 − p〈W , Z〉)| ≤

O(1/d3) |W |∞|Z|∞dr log d. Therefore we still have

|〈PΩ(W ), PΩ(Z)〉−p〈W,Z〉| ≤ O(|W |∞|Z|∞dr log d+√pdr|W |∞|Z|∞ ‖W‖F ‖Z‖F log d).

Next Theorem shows PΩ(XXT )X is roughly equal to pXXTX, this is one of the

major terms in the gradient.

Theorem 5.7.2. When p ≥ Cν6r log2 ddε2

for a large enough constant C, With high

probability over the randomness of Ω, for any matrix X ∈ Rd×r such that ‖Xi‖ ≤

ν√

1d‖X‖F , we have

‖PΩ(XX>)X − pXXTX‖F ≤ pε‖X‖3F (5.7.1)

Proof. Without loss of generality we assume ‖X‖F = 1. Let δi,j be the indicator

variable for Ω, we first prove the result when δi,j are independent, then we will use

standard techniques to show the same argument works for δi,j = δj,i.

Note that

[PΩ(XX>)X]i =∑j

δi,j〈Xi, Xj〉Xj,

whose expectation is equal to

[pXXTX]i = p∑j

〈Xi, Xj〉Xj.

We know ‖Xi‖ ≤ ν√

1d, therefore each term is bounded by ν3(1/d)3/2. Let Zi

be a random variable that is equal to ‖PΩ(XX>)X]i − [pXXTX]i‖2, then it is easy

to see E[Zi] ≤ pdν6(r/d)3 = pν6/d2. and the variance V[Zi] = E[Z2i ] − E[Zi]

2 ≤142

pdν12(1/d)6 + 2E[Zi]2 ≤ 3E[Zi]

2 (as long as p > 1/d). Our goal now is to prove∑di=1 Zi ≤ p2ε2 for all X.

Let Zi be a truncated version of Zi. That is, Zi = Zi when Zi ≤ [2pdν3(1/d)3/2]2,

and Zi = [2pdν3(1/d)3/2]2 otherwise. It’s not hard to see Zi has smaller mean and

variance compared to Zi. Also, by vector’s Bernstein’s inequality, we know

Pr[√Zi ≤ t] ≤ d exp

(− t2

pν6

d2+ t ν3

d3/2

).

Notice that this is only relevant when t ≤ O(pν3d−1/2) (because otherwise the

probability is 0) and in that regime the variance term always dominates. Therefore

Zi is the square of a subgaussian random variable.

By the Bernstein’s inequality, we know the moments of√Zi − E[

√Zi] are domi-

nated by a Gaussian distribution with variance O(E[Zi] log d).

Now we can use the concentration bound for quadratics of the subgaussian random

variables[89]: we know that with probability exp(−t),

d∑i=1

Zi ≤ O(E[Z2i ] · (log d) · (d+ 2

√dt+ 2t)).

this means with probability exp(−Cdrlogd) with some large constant C, we know∑di=1 Zi ≤ O(pν6r log2 d/d). The probability is low enough for us to union bound

over all X in a standard ε-net such that every other X is within distance (ε/d)6.

Therefore we know with high probability for all X in the ε-net we have∑d

i=1 Zi ≤

O(pν6r log2 d/d), which is smaller than p2ε2 when p ≥ Cν6r log1.5 ddε2

for a large enough

constant C.

For any X that is not in the ε-net, let X be the closest point of X in the net, then

‖X − X‖F ≤ 1/d6, therefore the bound of X clearly follows from the bound of X.

Now to convert sum of Zi to sum of Zi, notice that with high probability there

are at most 2pd entries in Ω for every row. When that happens Zi is always bounded

143

by [2pdν3(1/d)3/2]2 so Zi = Zi. Let event 1 be∑d

i=1 Zi ≤ p2ε2 for all X, and let event

2 be that there are at most 2pd entries per row, we know with high probability both

event happens, and in that case∑d

i=1 Zi ≤ p2ε2 for all X.

Handling δi,j = δj,i First notice that the diagonal entries δi,i’s cannot change the

Frobenius norm by more than O(ν3(1/d)3/2 ·√d) ≤ pε so we can ignore the diagonal

terms. Now for independent terms δi,j, let γj,i = δi,j, then by union bound both δi,j

and γi,j satisfy the equation, and by triangle’s inequality (δi,j + γi,j)/2 also satisfies

the inequality. Let τi,j be the true indicator of Ω (hence τi,j = τj,i), and τ ′i,j be an

independent copy, we know (τi,j + τ ′i,j)/2 has the same distribution as (δi,j + γi,j)/2

(for off-diagonal entries), therefore with high probability the equation is true for

(τi,j+τ′i,j)/2. The Theorem then follows from the standard Claim below for decoupling

(note that sup‖X‖F=1 ‖PΩ(XXT )X − pXXTX‖F is a norm for the indicator variables

of Ω):

Claim 5.7.3. Let X, Y be two iid random variables, then

Pr[‖X‖ ≥ t] ≤ 3 Pr[‖X + Y ‖ ≥ 2t

3].

Proof. Let X, Y, Z be iid random variables then,

Pr[X ≥ t] = Pr[‖(X + Y ) + (X + Z)− (Y + Z)‖ ≥ 2t]

≤ Pr[‖X + Y ‖ ≥ 2t/3] + Pr[‖X + Z‖ ≥ 2t/3] + Pr[‖Y + Z‖ ≥ 2t/3]

≤ 3 Pr[‖X + Y ‖ ≥ 2t

3].

144

Finally we argue that random sampling of a matrix gives a nice spectral approxi-

mation. This is a standard Lemma that is used in arguing about the PΩ(M)X term

in the gradient (PΩ(M −XXT )X).

Lemma 5.7.4. Suppose W ∈ Rd×d satisfies that |W |∞ ≤ νd‖W‖F , then with high

probability (1− d−10) over the choice of Ω,

‖PΩ(W )− pW‖ ≤ εp‖W‖F .

where ε = O(ν√

log d/(pd)).

Proof. Without loss of generality we assume ‖W‖F = 1. The proof follows simply

from application of Bernstein inequality. We view PΩ(W ) as

PΩ(W ) =∑i,j∈[d]2

sijWijδij

where δij ∈ Rd×d is the indicator matrix for entry (i, j), and sij ∈ 0, 1 are in-

dependent Bernoulli variable with probability p of being 1. Then we have that

E[PΩ(W )] = pW and ‖sijWijδij‖ ≤ νd‖W‖F . Moreover, we compute the variance

by

∑i,j∈[d]2

E[sijW2ijδ>ijδij] =

∑i,j∈[d]2

E[sijW2ijδjj]

=∑j∈[d]

p

(∑i∈d

W 2ij

)δjj (5.7.2)

Therefore ∥∥∥∥∥∥∑i,j∈[d]2

E[sijW2ijδ>ijδij]

∥∥∥∥∥∥ ≤ pν2

d

Similarly we can control∥∥∥∑i,j∈[d]2 E[sijW

2ijδijδ

>ij ]∥∥∥ by pν2/d (again notice that al-

though δi,j = δj,i the bounds here are correct up to constant factors). Then it follows

145

from non-commutative Bernstein inequality [91] that

PrΩ

[‖PΩ(W )− p(W )‖ ≥ εp] ≤ d exp(−2ε2pd/ν2) .

Concentration Lemmas for Noise Matrix N Next we will state the concen-

tration lemmas that are necessary when observed matrix is perturbed by Gaussian

noise. The proof of these Lemmas are really exactly the same (in fact even simpler)

than the corresponding Theorem that we have just proven. The first Lemma is used

in the same settings as Theorem 5.7.1.

Lemma 5.7.5. Let N be a random matrix with independent Gaussian entries

N(0, σ2). With high probability over the support Ω and the Gaussian N , for any low

rank matrix W , we have

|〈PΩ(N), PΩ(W )〉| ≤ O(σ|Z|∞dr log d+√pd2rσ2|W |∞‖W‖F log d

Proof. The proof is exactly the same as Theorem 5.7.1 as |〈PΩ(N), PΩ(W )〉| is a sum

of independent entries that follows from the same Bernstein’s inequality.

Next we show that random sampling entries of a Gaussian matrix gives a matrix

with low spectral norm.

Lemma 5.7.6. Let N be a random Gaussian matrix with independent Gaussian en-

tries N(0, σ2), with high probability over the choice of Ω and N , we have

‖PΩ(N)‖ ≤ εpσd,

where ε = O(√

log d/pd).

Proof. Again the proof follows from the same argument as Lemma 5.7.4.

146

5.8 Conclusions

Although the matrix completion objective is non-convex, we showed the objective

function has very nice properties that ensures the local minima are also global. This

property gives guarantees for many basic optimization algorithms. An important open

problem is the robustness of this property under different model assumptions: Can

we extend the result to handle asymmetric matrix completion? Is it possible to add

weights to different entries (similar to the settings studied in [113])? Can we replace

the objective function with a different distance measure rather than Frobenius norm

(which is related to works on 1-bit matrix sensing [55])? We hope this framework of

analyzing the geometry of objective function can be applied to other problems.

147

Chapter 6

Learning Linear Dynamical

Systems

In this chapter, we use non-convex optimization approach to attack the problem of

learning linear dynamical system. We prove that gradient descent efficiently converges

to the global optimizer of the maximum likelihood objective of an unknown linear

time-invariant dynamical system from a sequence of noisy observations generated by

the system. Even though the objective function is non-convex, we provide polynomial

running time and sample complexity bounds under strong but natural assumptions.

Linear systems identification has been studied for many decades, yet, to the best of

our knowledge, these are the first polynomial guarantees for the problem we consider.

6.1 Introduction

Many learning problems are by their nature sequence problems where the goal is

to fit a model that maps a sequence of input words x1, . . . , xT to a corresponding

sequence of observations y1, . . . , yT . Text translation, speech recognition, time series

prediction, video captioning and question answering systems, to name a few, are all

sequence to sequence learning problems. For a sequence model to be both expressive

148

and parsimonious in its parameterization, it is crucial to equip the model with memory

thus allowing its prediction at time t to depend on previously seen inputs.

Recurrent neural networks form an expressive class of non-linear sequence models.

Through their many variants, such as long-short-term-memory [84], recurrent neural

networks have seen remarkable empirical success in a broad range of domains. At

the core, neural networks are typically trained using some form of (stochastic) gradi-

ent descent. Even though the training objective is non-convex, it is widely observed

in practice that gradient descent quickly approaches a good set of model parame-

ters. Understanding the effectiveness of gradient descent for non-convex objectives

on theoretical grounds is a major open problem in this area.

If we remove all non-linear state transitions from a recurrent neural network,

we are left with the state transition representation of a linear dynamical system.

Notwithstanding, the natural training objective for linear systems remains non-convex

due to the composition of multiple linear operators in the system. If there is any hope

of eventually understanding recurrent neural networks, it will be inevitable to develop

a solid understanding of this special case first.

To be sure, linear dynamical systems are important in their own right and have

been studied for many decades independently of machine learning within the con-

trol theory community. Control theory provides a rich set techniques for identifying

and manipulating linear systems. The learning problem in this context corresponds

to “linear dynamical system identification”. Maximum likelihood estimation with

gradient descent is a popular heuristic for dynamical system identification [114]. In

the context of machine learning, linear systems play an important role in numerous

tasks. For example, their estimation arises as subroutines of reinforcement learn-

ing in robotics [108], location and mapping estimation in robotic systems [59], and

estimation of pose from video [143].

149

In this work, we show that gradient descent efficiently minimizes the maximum

likelihood objective of an unknown linear system given noisy observations generated

by the system. More formally, we receive noisy observations generated by the following

time-invariant linear system:

ht+1 = Aht +Bxt (6.1.1)

yt = Cht +Dxt + ξt

Here, A,B,C,D are linear transformations with compatible dimensions and we denote

by Θ = (A,B,C,D) the parameters of the system. The vector ht represents the

hidden state of the model at time t. Its dimension n is called the order of the system.

The stochastic noise variables ξt perturb the output of the system which is why the

model is called an output error model in control theory. We assume the variables are

drawn i.i.d. from a distribution with mean 0 and variance σ2.

Throughout the paper we focus on controllable and externally stable systems.

A linear system is externally stable (or equivalently bounded-input bounded-output

stable) if and only if the spectral radius of A, denoted ρ(A), is strictly bounded by 1.

Controllability is a mild non-degeneracy assumption that we formally define later.

Without loss of generality, we further assume that the transformations B, C and D

have bounded Frobenius norm. This can be achieved by a rescaling of the output

variables. We assume we have N pairs of sequences (x, y) as training examples,

S =

(x(1), y(1)), . . . , (x(N), y(N)).

Each input sequence x ∈ RT of length T is drawn from a distribution and y is the

corresponding output of the system above generated from an unknown initial state

h. We allow the unknown initial state to vary from one input sequence to the next.

This only makes the learning problem more challenging.

150

Our goal is to fit a linear system to the observations. We parameterize our model

in exactly the same way as (6.1.1). That is, for linear mappings (A, B, C, D), the

trained model is defined as:

ht+1 = Aht + Bxt , yt = Cht + Dxt (6.1.2)

The (population) risk of the model is obtained by feeding the learned system

with the correct initial states and comparing its predictions with the ground truth in

expectation over inputs and errors. Denoting by yt the t-th prediction of the trained

model starting from the correct initial state that generated yt, and using Θ as a short

hand for (A, B, C, D), we formally define population risk as:

f(Θ) = Ext,ξt

[1

T

T∑t=1

‖yt − yt‖2

](6.1.3)

Note that even though the prediction yt is generated from the correct initial state,

the learning algorithm does not have access to the correct initial state for its training

sequences.

While the squared loss objective turns out to be non-convex, it has many ap-

pealing properties. Assuming the inputs xt and errors ξt are drawn independently

from a Gaussian distribution, the corresponding population objective corresponds to

maximum likelihood estimation. In this work, we make the weaker assumption that

the inputs and errors are drawn independently from possibly different distributions.

The independence assumption is certainly idealized for some learning applications.

However, in control applications the inputs can often be chosen by the controller

rather than by nature. Moreover, the outputs of the system at various time steps are

correlated through the unknown hidden state and therefore not independent even if

the inputs are.

151

6.1.1 Our Results

We show that we can efficiently minimize the population risk using projected stochas-

tic gradient descent. The bulk of our work applies to single-input single-output (SISO)

systems meaning that inputs and outputs are scalars xt, yt ∈ R. However, the hidden

state can have arbitrary dimension n. Every controllable SISO admits a convenient

canonical form called controllable canonical form that we formally introduce later. In

this canonical form, the transition matrix A is governed by n parameters a1, . . . , an

which coincide with the coefficients of the characteristic polynomial of A. The mini-

mal assumption under which we might hope to learn the system is that the spectral

radius of A is smaller than 1. However, the set of such matrices is non-convex and

does not have enough structure for our analysis. We will therefore make additional

assumptions. The assumptions we need differ between the case where we are trying

to learn A with n parameter system, and the case where we allow ourselves to over-

specify the trained model with n′ > n parameters. The former is sometimes called

proper learning, while the latter is called improper learning. In the improper case, we

are essentially able to learn any system with spectral radius less than 1 under a mild

separation condition on the roots of the characteristic polynomial. Our assumption

in the proper case is stronger and we introduce it next.

6.1.2 Proper Learning

Suppose that the state transition matrix A is given by parameters a1, . . . , an and

consider the polynomial q(z) = 1+a1z+a2z2+· · ·+anzn over the complex numbers C.

We will require that the image of the unit circle on the complex plane under the

polynomial q is contained in the cone of complex numbers whose real part is larger

than their absolute imaginary part. Formally, for all z ∈ C such that |z| = 1, we

require that <(q(z)) > |=(q(z))|. Here, <(z) and =(z) denote the real and imaginary

152

part of z, respectively. We illustrate this condition in the figure on the right for a

degree 4 system.

1 0 1

1

0

1

Complex plane

Our assumption has three important impli-

cations. First, it implies (via Rouche’s theorem)

that the spectral radius of A is smaller than 1 and

therefore ensures the stability of the system. Sec-

ond, the vectors satisfying our assumption form

a convex set in Rn. Finally, it ensures that the

objective function is weakly quasi-convex, a con-

dition we introduce later when we show that it

enables stochastic gradient descent to make suf-

ficient progress.

We note in passing that our assumption can be satisfied via the `1-norm constraint

‖a‖1 ≤√

2/2. Moreover, if we pick random Gaussian coefficients with expected norm

bounded by o(1/√

log n), then the resulting vector will satisfy our assumption with

probability 1 − o(1). Roughly speaking, the assumption requires the roots of the

characteristic polynomial p(z) = zn + a1zn−1 + · · ·+ an are relatively dispersed inside

the unit circle. (For comparison, on the other end of the spectrum, the polynomial

p(z) = (z − 0.99)n have all its roots colliding at the same point and doesn’t satisfy

the assumption.)

Theorem 6.1.1 (Informal). Under our assumption, projected stochastic gradient de-

scent, when given N sample sequence of length T , returns parameters Θ with popula-

tion risk

f(Θ) ≤ f(Θ) +O

(√n5 + σ2n3

TN

).

Recall that f(Θ) is the population risk of the optimal system, and σ2 refers to

the variance of the noise variables. We also assume that the inputs xt are drawn

153

from a pairwise independent distribution with mean 0 and variance 1. Note, however,

that this does not imply independence of the outputs as these are correlated by a

common hidden state. The stated version of our result glosses over the fact that we

need our assumption to hold with a small amount of slack; a precise version follows

in Section 6.3. Our theorem establishes a polynomial convergence rate for stochastic

gradient descent. Since each iteration of the algorithm only requires a sequence of

matrix operations and an efficient projection step, the running time is polynomial, as

well. Likewise, the sample requirements are polynomial since each iteration requires

only a single fresh example. An important feature of this result is that the error

decreases with both the length T and the number of samples N . Computationally,

the projection step can be a bottleneck, although it is unlikely to be required in

practice and may be an artifact of our analysis.

6.1.3 The Power of Over-parameterization

Endowing the model with additional parameters compared to the ground truth turns

out to be surprisingly powerful. We show that we can essentially remove the as-

sumption we previously made in proper learning. The idea is simple. If p is the

characteristic polynomial of A of degree n. We can find a system of order n′ > n

such that the characteristic polynomial of its transition matrix becomes p ·p′ for some

polynomial p′ of order n′ − n. This means that to apply our result we only need the

polynomial p · p′ to satisfy our assumption. At this point, we can choose p′ to be an

approximation of the inverse p−1. For sufficiently good approximation, the resulting

polynomial p · p′ is close to 1 and therefore satisfies our assumption. Such an ap-

proximation exists generically for n′ = O(n) under mild non-degeneracy assumptions

on the roots of p. In particular, any small random perturbation of the roots would

suffice.

154

Theorem 6.1.2 (Informal). Under a mild non-degeneracy assumption, stochastic

gradient descent returns parameters Θ corresponding to a system of order n′ = O(n)

with population risk

f(Θ) ≤ f(Θ) +O

(√n5 + σ2n3

TN

),

when given N sample sequences of length T .

We remark that the idea we sketched also shows that, in the extreme, improper

learning of linear dynamical systems becomes easy in the sense that the problem

can be solved using linear regression against the outputs of the system. However,

our general result interpolates between the proper case and the regime where linear

regression works. We discuss in more details in Section 6.5.3.

6.1.4 Multi-input Multi-output Systems

Both results we saw immediately extend to single-input multi-output (SIMO) sys-

tems as the dimensionality of C and D are irrelevant for us. The case of multi-input

multi-output (MIMO) systems is more delicate. Specifically, our results carry over

to a broad family of multi-input multi-output systems. However, in general MIMO

systems no longer enjoy canonical forms like SISO systems. In Section 6.6, we intro-

duce a natural generalization of controllable canonical form for MIMO systems and

extend our results to this case.

6.1.5 Related Work

System identification is a core problem in dynamical systems and has been studied

in depth for many years. The most popular reference on this topic is the text by

Ljung [114]. Nonetheless, the list of non-asymptotic results on identifying linear

systems from noisy data is surprisingly short. Several authors have recently tried

to estimate the sample complexity of dynamical system identification using machine

155

learning tools [170, 39, 173]. All of these result are rather pessimistic with sample

complexity bounds that are exponential in the degree of the linear system and other

relevant quantities. Contrastingly, we prove that gradient descent has an associated

polynomial sample complexity in all of these parameters. Moreover, all of these papers

only focus on how well empirical risk approximates the true population risk and do

not provide guarantees about any algorithmic schemes for minimizing the empirical

risk.

The only result to our knowledge which provides polynomial sample complexity

for identifying linear dynamical systems is in Shah et al [153]. Here, the authors

show that if certain frequency domain information about the linear dynamical system

is observed, then the linear system can be identified by solving a second-order cone

programming problem. This result is about improper learning only, and the size of

the resulting system may be quite large, scaling as the (1− ρ(A))−2. As we describe

in this work, very simple algorithms work in the improper setting when the system

degree is allowed to be polynomial in (1− ρ(A))−1. Moreover, it is not immediately

clear how to translate the frequency-domain results to the time-domain identification

problem discussed above.

Our main assumption about the image of the polynomial q(z) is an appeal to the

theory of passive systems. A system is passive if the dot product between the input

sequence ut and output sequence yt are strictly positive. Physically, this notion cor-

responds to systems that cannot create energy. For example, a circuit made solely of

resistors, capacitors, and inductors would be a passive electrical system. If one added

an amplifier to the internals of the system, then it would no longer be passive. The set

of passive systems is a subset of the set of stable systems, and the subclass is some-

what easier to work with mathematically. Indeed, Megretski used tools from passive

systems to provide a relaxation technique for a family of identification problems in dy-

namical systems [121]. His approach is to lower bound a nonlinear least-squares cost

156

with a convex functional. However, he does not prove that his technique can identify

any of the systems, even asymptotically. The work of Soderstrom and Stoica [154]

analyze the landscape of the population risk in the frequency domain and showed

that under certain conditions (that are closely related to ours), the population risk

has a unique global minimum. This effectively established Lemma 6.2.3 of our paper.

Bazanella et al use a passivity condition to prove quasi-convexity of a cost function

that arises in control design [24]. Building on this work, Eckhard and Bazanella prove

a weaker version of Lemma 6.2.3 in the context of system identification [60]. But no

sample complexity or global convergence proofs are provided in either [154] or [24].

6.1.6 Proof Overview

The first important step in our proof is to develop population risk in Fourier domain

where it is closely related to what we call idealized risk. Idealized risk essentially

captures the `2-difference between the transfer function of the learned system and

the ground truth. The transfer function is a fundamental object in control theory.

Any linear system is completely characterized by its transfer function G(z) = C(zI−

A)−1B. In the case of a SISO, the transfer function is a rational function of degree n

over the complex numbers and can be written as G(z) = s(z)/p(z). In the canonical

form introduced in Section 6.1.7, the coefficients of p(z) are precisely the parameters

that specify A. Moreover, znp(1/z) = 1 +a1z+a2z2 + · · ·+anz

n is the polynomial we

encountered in the introduction. Under the assumption illustrated earlier, we show

in Section 6.2 that the idealized risk is weakly quasi-convex (Lemma 6.2.3). Quasi-

convexity implies that gradients cannot vanish except at the optimum of the objective

function; we review this (mostly known) material in Section 4.1. In particular, this

lemma implies that in principle we can hope to show that gradient descent converges

to a global optimum. However, there are several important issues that we need to

address. First, the result only applies to idealized risk, not our actual population risk

157

objective. Therefore it is not clear how to obtain unbiased gradients of the idealized

risk objective. Second, there is a subtlety in even defining a suitable empirical risk

objective. The reason is that risk is defined with respect to the correct initial state of

the system which we do not have access to during training. We overcome both of these

problems. In particular, we design an almost unbiased estimator of the gradient of

the idealized risk in Lemma 6.4.4 and give variance bounds of the gradient estimator

(Lemma 6.4.5).

Our results on improper learning in Section 6.5 rely on a surprisingly simple

but powerful insight. We can extend the degree of the transfer function G(z) by

extending both numerator and denominator with a polynomial u(z) such that G(z) =

s(z)u(z)/p(z)u(z). While this results in an equivalent system in terms of input-output

behavior, it can dramatically change the geometry of the optimization landscape.

In particular, we can see that only p(z)u(z) has to satisfy the assumption of our

proper learning algorithm. This allows us, for example, to put u(z) ≈ p(z)−1 so

that p(z)u(z) ≈ 1, hence trivially satisfying our assumption. A suitable inverse

approximation exists under light assumptions and requires degree no more than d =

O(n). Algorithmically, there is almost no change. We simply run stochastic gradient

descent with n+ d model parameters rather than n model parameters.

6.1.7 Preliminaries

For complex matrix (or vector, number) C, we use <(C) to denote the real part and

=(C) the imaginary part, and C the conjugate and C∗ = C> its conjugate transpose

. We use | · | to denote the absolute value of a complex number c. For complex vector

u and v, we use 〈u, v〉 = u∗v to denote the inner product and ‖u‖ =√u∗u is the norm

of u. For complex matrix A and B with same dimension, 〈A,B〉 = tr(A∗B) defines

an inner product, and ‖A‖F =√

tr(A∗A) is the Frobenius norm. For a square matrix

A, we use ρ(A) to denote the spectral radius of A, that is, the largest absolute value

158

of the elements in the spectrum of A. We use Idn to denote the identity matrix with

dimension n × n, and we drop the subscript when it’s clear from the context.We let

ei denote the i-th standard basis vector.

A SISO of order n is in controllable canonical form if A and B have the following

form

A =

0 1 0 · · · 0

0 0 1 · · · 0

......

.... . .

...

0 0 0 · · · 1

−an −an−1 −an−2 · · · −a1

B =

0

0

...

0

1

(6.1.4)

We will parameterize A, B, C, D accordingly. We will write A = C(a) for brevity,

where a is used to denote the unknown last row [−an, . . . ,−a1] of matrix A. We will

use a to denote the corresponding training variables for a. Since here B is known, so

B is no longer a trainable parameter, and is forced to be equal to B. Moreover, C is

a row vector and we use [c1, · · · , cn] to denote its coordinates (and similarly for C).

A SISO is controllable if and only if the matrix [B | AB | A2B | · · · | An−1B] has

rank n. This statement corresponds to the condition that all hidden states should be

reachable from some initial condition and input trajectory. Any controllable system

admits a controllable canonical form [81]. For vector a = [an, . . . , a1], let pa(z) denote

the polynomial

pa(z) = zn + a1zn−1 + · · ·+ an . (6.1.5)

When a defines the matrix A that appears in controllable canonical form, then pa is

precisely the characteristic polynomial of A. That is, pa(z) = det(zI − A).

159

6.2 Population Risk in Frequency Domain

We next establish conditions under which risk is weakly-quasi-convex. Our strategy

is to first approximate the risk functional f(Θ) by what we call idealized risk. This

approximation of the objective function is fairly straightforward; we justify it toward

the end of the section. We can show that

f(Θ) ≈ ‖D − D‖2 +∑∞

k=0

(CAkB − CAkB

)2. (6.2.1)

The leading term ‖D − D‖2 is convex in D which appears nowhere else in the

objective. It therefore doesn’t affect the convergence of the algorithm (up to lower

order terms) by virtue of Proposition 4.1.9, and we restrict our attention to the

remaining terms.

Definition 6.2.1 (Idealized risk). We define the idealized risk as

g(A, C) =∞∑k=0

(CAkB − CAkB

)2

. (6.2.2)

We now use basic concepts from control theory (see [81, 82] for more background)

to express the idealized risk (6.2.2) in Fourier domain. The transfer function of the

linear system is

G(z) = C(zI − A)−1B . (6.2.3)

Note that G(z) is a rational function over the complex numbers of degree n and

hence we can find polynomials s(z) and p(z) such that G(z) = s(z)p(z)

, with the conven-

tion that the leading coefficient of p(z) is 1. In controllable canonical form (6.1.4),

the coefficients of p will correspond to the last row of the A, while the coefficients of

160

s correspond to the entries of C. Also note that

G(z) =∞∑t=1

z−tCAt−1B =∞∑t=1

z−trt−1

The sequence r = (r0, r1, . . . , rt, . . .) = (CB,CAB, . . . , CAtB, . . .) is called the im-

pulse response of the linear system. The behavior of a linear system is uniquely

determined by the impulse response and therefore by G(z). Analogously, we denote

the transfer function of the learned system by G(z) = C(zI − A)−1B = s(z)/p(z) .

The idealized risk (6.2.2) is only a function of the impulse response r of the learned

system, and therefore it is only a function of G(z).

For future reference, we note that by some elementary calculation (see

Lemma 8.3.1), we have

G(z) = C(zI − A)−1B =c1 + · · ·+ cnz

n−1

zn + a1zn−1 + · · ·+ an, (6.2.4)

which implies that s(z) = c1 + · · ·+ cnzn−1 and p(z) = zn + a1z

n−1 + · · ·+ an.

With these definitions in mind, we are ready to express idealized risk in Fourier

domain.

Proposition 6.2.2. Suppose pa(z) has all its roots inside unit circle, then the ideal-

ized risk g(a, C) can be written in the Fourier domain as

g(A, C) =

∫ 2π

0

∣∣∣G(eiθ)−G(eiθ)∣∣∣2 dθ .

161

Proof. Note that G(eiθ) is the Fourier transform of the sequence rk and so is G(eiθ)

the Fourier transform1 of rk. Therefore by Parseval’ Theorem, we have that

g(A, C) =∞∑k=0

‖rk − rk‖2 =

∫ 2π

0

|G(eiθ)−G(eiθ)|2dθ .

6.2.1 Quasi-convexity of the Idealized Risk

Now that we have a convenient expression for risk in Fourier domain, we can prove

that the idealized risk g(A, C) is weakly-quasi-convex when a is not so far from the

true a in the sense that pa(z) and pa(z) have an angle less than π/2 for every z on

the unit circle. We will use the convention that a and a refer to the parameters that

specify A and A, respectively.

Lemma 6.2.3. For τ > 0 and every C, the idealized risk g(A, C) is τ -weakly-quasi-

convex over the domain

Nτ (a) =

a ∈ Rn : <

(pa(z)

pa(z)

)≥ τ/2,∀ z ∈ C, s.t. |z| = 1

. (6.2.5)

Proof. We first analyze a single term h = |G(z)−G(z)|2. Recall that G(z) = s(z)/p(z)

where p(z) = pa(z) = zn + a1zn−1 + · · ·+ an. Note that z is fixed and h is a function

of a and C. Then it is straightforward to see that

∂h

∂s(z)= 2<

1

p(z)

(s(z)

p(z)− s(z)

p(z)

)∗. (6.2.6)

and

∂h

∂p(z)= −2<

s(z)

p(z)2

(s(z)

p(z)− s(z)

p(z)

)∗. (6.2.7)

1The Fourier transform exists since ‖rk‖2 = ‖CAkB‖2 ≤ ‖C‖‖Ak‖‖B‖ ≤ cρ(A)k where c doesn’t

depend on k and ρ(A) < 1.

162

Since s(z) and p(z) are linear in C and a respectively, by chain rule we have that

〈∂h∂a, a− a〉+ 〈 ∂h

∂C, C − C〉 =

∂h

∂p(z)〈∂p(z)

∂a, a− a〉+

∂h

∂s(z)〈∂s(z)

∂C, C − C〉

=∂h

∂p(z)(p(z)− p(z)) +

∂h

∂s(z)(s(z)− s(z)) .

Plugging the formulas (6.2.6) and (6.2.7) for ∂h∂s(z)

and ∂h∂p(z)

into the equation above,

we obtain that

〈∂h∂a, a− a〉+ 〈 ∂h

∂C, C − C〉

= 2<−s(z)(p(z)− p(z)) + p(z)(s(z)− s(z))

p(z)2

(s(z)

p(z)− s(z)

p(z)

)∗= 2<

s(z)p(z)− s(z)p(z)

p(z)2

(s(z)

p(z)− s(z)

p(z)

)∗= 2<

p(z)

p(z)

∣∣∣∣ s(z)

p(z)− s(z)

p(z)

∣∣∣∣2

= 2<p(z)

p(z)

∣∣∣G(z)−G(z)∣∣∣2 .

Hence h = |G(z) − G(z)|2 is τ -weakly-quasi-convex with τ = 2 min|z|=γ <p(z)p(z)

.

This implies our claim, since by Proposition 6.2.2, the idealized risk g is convex combi-

nation of functions of the form |G(z)−G(z)|2 for |z| = 1. Moreover, Proposition 4.1.9

shows convex combination preserves quasi-convexity.

For future reference, we also prove that the idealized risk is O(n2/τ 41 )-weakly

smooth.

Lemma 6.2.4. The idealized risk g(A, C) is Γ-weakly smooth with Γ = O(n2/τ 41 ).

Proof. By equation (6.2.7) and the chain rule we get that

∂g

∂C=

∫T

∂|G(z)−G(z)|2

∂p(z)· ∂p(z)

∂Cdz =

∫T

2<

1

p(z)

(s(z)

p(z)− s(z)

p(z)

)∗· [1, . . . , zn−1]dz .

163

therefore we can bound the norm of the gradient by

∥∥∥∥ ∂g∂C∥∥∥∥2

≤

(∫T

∣∣∣∣ s(z)

p(z)− s(z)

p(z)

∣∣∣∣2 dz)·(∫

T4‖[1, . . . , zn−1]‖2 · | 1

p(z)|2dz

)≤ O(n/τ 2

1 ) · g(A, C) .

Similarly, we could show that∥∥∂g∂a

∥∥2 ≤ O(n2/τ 21 ) · g(A, C).

6.2.2 Justifying Idealized Risk

We need to justify the approximation we made in Equation (6.2.1).

Lemma 6.2.5. Assume that ξt and xt are drawn i.i.d. from an arbitrary distribution

with mean 0 and variance 1. Then the population risk f(Θ) can be written as,

f(Θ) = (D −D)2 +T−1∑k=1

(1− k

T

)(CAk−1B − CAk−1B

)2

+ σ2 . (6.2.8)

Proof of Lemma 6.2.5. Under the distributional assumptions on ξt and xt, we can

calculate the objective functions above analytically. We write out yt, yt in terms of

the inputs,

yt = Dxt +t−1∑k=1

CAt−k−1Bxk + CAt−1h0 + ξt , yt = Dxt +t−1∑k=1

CAt−k−1Bxk + CAt−1h0 .

Therefore, using the fact that xt’s are independent and with mean 0 and covariance

Id, the expectation of the error can be calculated (formally by Claim 8.3.2),

E[‖yt − yt‖2

]= ‖D −D‖2

F +∑t−1

k=1

∥∥CAt−k−1B − CAt−k−1B∥∥2

F+ E[‖ξt‖2] . (6.2.9)

Using E[‖ξt‖2] = σ2 , it follows that

f(Θ) = ‖D −D‖2F +

∑T−1k=1

(1− k

T

)∥∥CAk−1B − CAk−1B∥∥2

F+ σ2 . (6.2.10)

164

Recall that under the controllable canonical form (6.1.4), B = en is known and there-

fore B = B is no longer a variable. We use a for the training variable corresponding

to a. Then the expected objective function (6.2.10) simplifies to

f(Θ) = (D −D)2 +∑T−1

k=1

(1− k

T

)(CAk−1B − CAk−1B

)2.

The previous lemma does not yet control higher order contributions present in

the idealized risk. This requires additional structure that we introduce in the next

section.

6.3 Effective Relaxations of Spectral Radius

The previous section showed quasi-convexity of the idealized risk. However, several

steps are missing towards showing finite sample guarantees for stochastic gradient

descent. In particular, we will need to control the variance of the stochastic gradient at

any system that we encounter in the training. For this purpose we formally introduce

our main assumption now and show that it serves as an effective relaxation of spectral

radius. This results below will be used for proving convergence of stochastic gradient

descent in Section 6.4.

Consider the following convex region C in the complex plane,

C = z : <z ≥ (1 + τ0)|=z| ∩ z : τ1 < <z < τ2 . (6.3.1)

where τ0, τ1, τ2 > 0 are constants that are considered as fixed constant throughout

the paper. Our bounds will have polynomial dependency on these parameters.

Definition 6.3.1. We say a polynomial p(z) is α-acquiescent if p(z)/zn : |z| =

α ⊆ C. A linear system with transfer function G(z) = s(z)/p(z) is α-acquiescent if

the denominator p(z) is.

165

The set of coefficients a ∈ Rn defining acquiescent systems form a convex set.

Formally, for a positive α > 0, define the convex set Bα ⊆ Rn as

Bα =a ∈ Rn : pa(z)/zn : |z| = α ⊆ C

. (6.3.2)

We note that definition (6.3.2) is equivalent to the definition Bα =a ∈ Rn :

znp(1/z) : |z| = 1/α ⊆ C

, which is the version that we used in introduction for

simplicity. Indeed, we can verify the convexity of Bα by definition and the convexity

of C: a, b ∈ Bα implies that pa(z)/zn, pb(z)/zn ∈ C and therefore, p(a+b)/2(z)/zn =

12(pa(z)/zn+pb(z)/zn) ∈ C. We also note that the parameter α in the definition of ac-

quiescence corresponds to the spectral radius of the companion matrix. In particular,

an acquiescent system is stable for α < 1.

Lemma 6.3.2. Suppose a ∈ Bα, then the roots of polynomial pa(z) have magnitudes

bounded by α. Therefore the controllable canonical form A = C(a) defined by a has

spectral radius ρ(A) < α.

Proof. Define holomorphic function f(z) = zn and g(z) = pa(z) = zn+a1zn−1 + · · ·+

an. We apply the symmetric form of Rouche’s theorem [63] on the circle K = z :

|z| = α. For any point z on K, we have that |f(z)| = αn, and that |f(z) − g(z)| =

αn · |1− pa(z)/zn|. Since a ∈ Bα, we have that pa(z)/zn ∈ C for any z with |z| = α.

Observe that for any c ∈ C we have that |1− c| < 1 + |c|, therefore we have that

|f(z)−g(z)| = αn|1−pa(z)/zn| < αn(1+|pa(z)|/|zn|) = |f(z)|+|pa(z)| = |f(z)|+|g(z)| .

Hence, using Rouche’s Theorem, we conclude that f and g have same number

of roots inside circle K. Note that function f = zn has exactly n roots in K and

therefore g have all its n roots inside circle K.

166

The following lemma establishes the fact that Bα is a monotone family of sets

in α. The proof follows from the maximum modulo principle of the harmonic func-

tions <(znp(1/z)) and =(znp(1/z)). We remark that there are larger convex sets

than Bα that ensure bounded spectral radius. However, in order to also guarantee

monotonicity and the no blow-up property below, we have to restrict our attention

to Bα.

Lemma 6.3.3 (Monotonicity of Bα). For any 0 < α < β, we have that Bα ⊂ Bβ.

Proof. Let qa(z) = 1 + a1z + · · ·+ anzn. Note that q(z−1) = pa(z)/zn. Therefore we

note that Bα = a : qa(z) ∈ C,∀|z| = 1/α. Suppose a ∈ Bα, then <(pa(z)) ≥ τ1

for any z with |z| = 1/α. Since <(pa(z)) is the real part of the holomorphic function

pa(z), its a harmonic function. By maximum (minimum) principle of the harmonic

functions, we have that for any |z| ≤ 1/α, <(pa(z)) ≥ inf |z|=1/α<(pa(z)) ≥ τ1. In

particular, it holds that for |z| = 1/β < 1/α, <(pa(z)) ≥ τ1. Similarly we can prove

that for z with |z| = 1/β, <(qa(z)) ≥ (1 + τ0)=(qa(z)), and other conditions for a

being in Bβ.

Our next lemma entails that acquiescent systems have well behaved impulse re-

sponses.

Lemma 6.3.4 (No blow-up property). Suppose a ∈ Bα for some α ≤ 1. Then the

companion matrix A = C(a) satisfies

∞∑k=0

‖α−kAkB‖2 ≤ 2πnα−2n/τ 21 . (6.3.3)

Moreover, it holds that for any k ≥ 0,

‖AkB‖2 ≤ min2πn/τ 21 , 2πnα

2k−2n/τ 21 .

167

Proof of Lemma 6.3.4. Let fλ =∑∞

k=0 eiλkα−kAkB be the Fourier transform of the

series α−kAkB. Then using Parseval’s Theorem, we have

∞∑k=0

‖α−kAkB‖2 =

∫ 2π

0

|fλ|2dλ =

∫ 2π

0

|(I − α−1eiλA)−1B|2dλ

=

∫ 2π

0

∑nj=1 α

2j

|pa(αe−iλ)|2dλ ≤

∫ 2π

0

n

|pa(αe−iλ)|2dλ. (6.3.4)

where at the last step we used the fact that (I−wA)−1B = 1pa(w−1)

[w−1, w−2 . . . , z−n]>

(see Lemma 8.3.1), and that α ≤ 1. Since a ∈ Bα, we have that |pa(α−1eiλ)| ≥ τ1,

and therefore pa(αe−iλ) = αne−inλp(eiλ/α) has magnitude at least τ1α

n. Plugging in

this into equation (6.3.4), we conclude that

∞∑k=0

‖α−kAkB‖2 ≤∫ 2π

0

n

|pa(αe−iλ)|2dλ ≤ 2πnα−2n/τ 2

1 .

Finally we establish the bound for ‖AkB‖2. By Lemma 6.3.3, we have Bα ⊂ B1 for

α ≤ 1. Therefore we can pick α = 1 in equation (6.3.3) and it still holds. That is, we

have that∞∑k=0

‖AkB‖2 ≤ 2πn/τ 21 .

This also implies that ‖AkB‖2 ≤ 2πn/τ 21 .

6.3.1 Efficiently Computing the Projection

In our algorithm, we require a projection onto Bα. However, the only requirement of

the projection step is that it projects onto a set contained inside Bα that also contains

the true linear system. So a variety of subroutines can be used to compute this pro-

jection or an approximation. First, the explicit projection onto Bα is representable by

a semidefinite program. This is because each of the three constrains can be checked

by testing if a trigonometric polynomial is non-negative. A simple inner approxima-

168

tion can be constructed by requiring the constraints to hold on an a finite grid of size

O(n). One can check that this provides a tight, polyhedral approximation to the set

Bα, following an argument similar to Section C of Bhaskar et al [29]. See Section 6.9

for more detailed discussion on why projection on a polytope suffices. Furthermore,

sometimes we can replace the constraint by an `1 or `2-constraint if we know that the

system satisfies the corresponding assumption. Removing the projection step entirely

is an interesting open problem.

6.4 Learning Acquiescent Systems

Next we show that we can learn acquiescent systems.

Theorem 6.4.1. Suppose the true system Θ is α-acquiescent and satisfies ‖C‖ ≤ 1.

Then with N samples of length T ≥ Ω(n + 1/(1 − α)), stochastic gradient descent

(Algorithm 7) with projection set Bα returns parameters Θ = (A, B, C, D) with pop-

ulation risk

f(Θ) ≤ f(Θ) +O

(n2

N+

√n5 + σ2n3

TN

), (6.4.1)

where O(·)-notation hides polynomial dependencies on 1/(1 − α), 1/τ0, 1/τ1, τ2, and

R = ‖a‖.

Algorithm 7 Projected stochastic gradient descent with partial loss

For i = 0 to N :

1. Take a fresh sample ((x1, . . . , xT ), (y1, . . . , yT )). Let yt be the simulated outputs2

of system Θ on inputs x and initial states h0 = 0.

2. Let T1 = T/4. Run stochastic gradient descent3 on loss function `((x, y), Θ) =1

T−T1

∑t>T1‖yt − yt‖2. Concretely, let GA = ∂`

∂a, GC = ∂`

∂C, and , GD = ∂`

∂D, we

update[a, C, D]→ [a, C, D]− η[GA, GC , GD] .

3. Project Θ = (a, C, D) to the set Bα ⊗ Rn ⊗ R.

169

Recall that T is the length of the sequence and N is the number of samples. The

first term in the bound (6.4.1) comes from the smoothness of the population risk

and the second comes from the variance of the gradient estimator of population risk

(which will be described in detail below). An important (but not surprising) feature

here is the variance scale in 1/T and therefore for long sequence actually we got 1/N

convergence instead of 1/√N (for relatively small N).

We can further balance the variance of the estimator with the number of samples

by breaking each long sequence of length T into Θ(T/n) short sequences of length

Θ(n), and then run back-propagation (7) on these TN/n shorter sequences. This

leads us to the following bound which gives the right dependency in T and N as we

expected: TN should be counted as the true number of samples for the sequence-to-

sequence model.

Corollary 6.4.2. Under the assumption of Theorem 6.4.1, Algorithm 8 returns pa-

rameters Θ with population risk

f(Θ) ≤ f(Θ) +O

(√n5 + σ2n3

TN

),

where O(·)-notation hides polynomial dependencies on 1/(1 − α), 1/τ0, 1/τ1, τ2, and

R = ‖a‖.

Algorithm 8 Projected stochastic gradient descent for long sequences

Input: N samples sequences of length TOutput: Learned system Θ

1. Divide each sample of length T into T/(βn) samples of length βn where β is alarge enough constant. Then run algorithm 7 with the new samples and obtainΘ.

3Note that yt is different from yt defined in equation (6.1.2) which is used to define the populationrisk: here yt is obtained from the (wrong) initial state h0 = 0 while yt is obtained from the correctinitial state.

3See Algorithm Box 9 for a detailed back-propagation algorithm that computes the gradient.

170

We remark the the gradient computation procedure takes time linear in Tn since

one can use chain-rule (also called back-propagation) to compute the gradient effi-

ciently . For completeness, Algorithm 9 gives a detailed implementation. Finally and

importantly, we remark that although we defined the population risk as the expected

error with respected to sequence of length T , actually our error bound generalizes to

any longer (or shorter) sequences of length T ′ maxn, 1/(1− α). By the explicit

formula for f(Θ) (Lemma 6.2.5) and the fact that ‖CAkB‖ decays exponentially for

k n (Lemma 6.3.4), we can bound the population risk on sequences of different

lengths. Concretely, let fT ′(Θ) denote the population risk on sequence of length T ′,

we have for all T ′ maxn, 1/(1− α),

fT ′(Θ) ≤ 1.1f(Θ) + exp(−(1− α) minT, T ′) ≤ O

(√n5 + σ2n3

TN

).

We note that generalization to longer sequence does deserve attention. Indeed in

practice, it’s usually difficult to train non-linear recurrent networks that generalize to

longer sequences than the training data.

Our proof of Theorem 6.4.1 simply consists of three parts: a) showing the idealized

risk is quasi-convex in the convex set Bα (Lemma 6.4.3); b) designing an (almost)

unbiased estimator of the gradient of the idealized risk (Lemma 6.4.4); c) variance

bounds of the gradient estimator (Lemma 6.4.5).

First of all, using the theory developed in Section 6.2 (Lemma 6.2.3 and

Lemma 6.2.4), it is straightforward to verify that in the convex set Bα ⊗ Rn, the

idealized risk is both weakly-quasi-convex and weakly-smooth.

Lemma 6.4.3. Under the condition of Theorem 6.4.1, the idealized risk (6.2.2) is

τ -weakly-quasi-convex in the convex set Bα ⊗ Rn and Γ-weakly smooth, where τ =

Ω(τ0τ1/τ2) and Γ = O(n2/τ 41 ).

171

Proof of Lemma 6.4.3. It suffices to show that for all a, a ∈ Bα, it satisfies a ∈

Nτ (a) for τ = Ω(τ0τ1/τ2). Indeed, by the monotonicity of the family of sets Bα

(Lemma 6.3.3), we have that a, a ∈ B1, which by definition means for every z on

unit circle, pa(z)/zn, pa(z)/zn ∈ C. By definition of C, for any point w, w ∈ C, the

angle φ between w and w is at most π − Ω(τ0) and ratio of the magnitude is at

least τ1/τ2, which implies that <(w/w) = |w|/|w| · cos(φ) ≥ Ω(τ0τ1/τ2). Therefore

<(pa(z)/pa(z)) ≥ Ω(τ0τ1/τ2), and we conclude that a ∈ Nτ (a). The smoothness

bound was established in Lemma 6.2.4.

Towards designing an unbiased estimator of the gradient, we note that there is a

small caveat here that prevents us to just use the gradient of the empirical risk, as

commonly done for other (static) problems. Recall that the population risk is defined

as the expected risk with known initial state h0, while in the training we don’t have

access to the initial states and therefore using the naive approach we couldn’t even

estimate population risk from samples without knowing the initial states.

We argue that being able to handle the missing initial states is indeed desired: in

most of the interesting applications h0 is unknown (or even to be learned). Moreover,

the ability of handling unknown h0 allows us to break a very long sequence into

shorter sequences, which helps us to obtain Corollary 6.4.2. Here the difficulty is

essentially that we have a supervised learning problem with missing data h0. We get

around it by simply ignoring first T1 = Ω(T ) outputs of the system and setting the

corresponding errors to 0. Since the influence of h0 to any outputs later than time

k ≥ T1 maxn, 1/(1− α) is inverse exponentially small, we could safely assume

h0 = 0 when the error earlier than time T1 is not taken into account.

This small trick also makes our algorithm suitable to the cases when these early

outputs are actually not observed. This is indeed an interesting setting, since in

many sequence-to-sequence model [165], there is no output in the first half fraction of

iterations (of course these models have non-linear operation that we cannot handle).

172

The proof of the correctness of the estimator is almost trivial and deferred to

Section 6.7.

Lemma 6.4.4. Under the assumption of Theorem 6.4.1, suppose a, a ∈ Bα. Then in

Algorithm 7, at each iteration, GA, GC are unbiased estimators of the gradient of the

idealized risk (6.2.2) in the sense that:

E [GA, GC ] =

[∂g

∂a,∂g

∂C

]± exp(−Ω((1− α)T )) .

(6.4.2)

Finally, we control the variance of the gradient estimator.

Lemma 6.4.5. The (almost) unbiased estimator (GA, GC) of the gradient of g(A, C)

has variance bounded by

V [GA] + V [GC ] ≤ O (n3Λ2/τ 61 + σ2n2Λ/τ 4

1 )

T.

where Λ = O(maxn, 1/(1− α) log 1/(1− α)).

Note that Lemma 6.4.5 does not directly follow from the Γ-weakly-smoothness of

the population risk, since it’s not clear whether the loss function `((x, y), Θ) is also

Γ-smooth for every sample. Moreover, even if it could work out, from smoothness the

variance bound can be only as small as Γ2, while the true variance scales linearly in

1/T . Here the discrepancy comes from that smoothness implies an upper bound of

the expected squared norm of the gradient, which is equal to the variance plus the

expected squared mean. Though typically for many other problems variance is on the

same order as the squared mean, here for our sequence-to-sequence model, actually

the variance decreases in length of the data, and therefore the bound of variance from

smoothness is very pessimistic.

173

We bound directly the variance instead. It’s tedious but very simple in spirit.

We mainly need Lemma 6.3.4 to control various difference sums that shows up from

calculating the expectation. The only tricky part is to obtain the 1/T dependency

which corresponds to the cancellation of the contribution from the cross terms. In

the proof we will basically write out the variance as a (complicated) function of A, C

which consists of sums of terms involving (CAkB − CAkB) and AkB. We control

these sums using Lemma 6.3.4. The proof is deferred to Section 6.7.

Finally we are ready to prove Theorem 6.4.1. We essentially just combine

Lemma 6.4.3, Lemma 6.4.4 and Lemma 6.4.5 with the generic convergence Proposi-

tion 4.1.8. This will give us low error in idealized risk and then we relate the idealized

risk to the population risk.

Proof of Theorem 6.4.1. We consider g′(A, C, D) = (D − D)2 + g(A, C), an ex-

tended version of the idealized risk which takes the contribution of D into

account. By Lemma 6.4.4 we have that Algorithm 7 computes GA, GC which

are almost unbiased estimators of the gradients of g′ up to negligible error

exp(−Ω((1− α)T )), and by Lemma 6.7.1 we have GD is an unbiased estimator

of g′ with respect to D. Moreover by Lemma 6.4.5, these unbiased estima-

tor has total variance V =O(n5+σ2n3)

Twhere O(·) hides dependency on τ1 and

(1− α). Applying Proposition 4.1.8 (which only requires an unbiased estima-

tor of the gradient of g′), we obtain that after T iterations, we converge to a

point with g′(a, C, D) ≤ O

(n2

N+√

n5+σ2n3

TN

). Then, by Lemma 6.2.5 we have

f(Θ) ≤ g′(a, C, D) + σ2 = g′(a, C, D) + f(Θ) ≤ O

(n2

N+√

n5+σ2n3

TN

)+ f(Θ) which

completes the proof.

174

6.5 The Power of Improper Learning

We observe an interesting and important fact about the theory in Section 6.4: it

solely requires a condition on the characteristic function p(z). This suggests that the

geometry of the training objective function depends mostly on the denominator of

the transfer function, even though the system is uniquely determined by the transfer

functionG(z) = s(z)/p(z). This might seem to be an undesirable discrepancy between

the behavior of the system and our analysis of the optimization problem.

However, we can actually exploit this discrepancy to design improper learning

algorithms that succeed under much weaker assumptions. We rely on the following

simple observation about the invariance of a system G(z) = s(z)p(z)

. For an arbitrary

polynomial u(z) of leading coefficient 1, we can write G(z) as

G(z) =s(z)u(z)

p(z)u(z)=s(z)

p(z),

where s = su and p = pu. Therefore the system s(z)/p(z) has identical behavior

as G. Although this is a redundant representation of G(z), it should counted as an

acceptable solution. After all, learning the minimum representation4 of linear system

is impossible in general. In fact, we will encounter an example in Section 6.5.1.

While not changing the behavior of the system, the extension from p(z) to p(z),

does affect the geometry of the optimization problem. In particular, if p(z) is now an

α-acquiescent characteristic polynomial as defined in Definition 6.3.1, then we could

find it simply using stochastic gradient descent as shown in Section 6.4. Observe

that we don’t require knowledge of u(z) but only its existence. Denoting by d the

degree of u, the algorithm itself is simply stochastic gradient descent with n+d model

parameters instead of n.

Our discussion motivates the following definition.

4The minimum representation of a transfer function G(z) is defined as the representation G(z) =s(z)/p(z) with p(z) having minimum degree.

175

Definition 6.5.1. A polynomial p(z) of degree n is α-acquiescent by extension of

degree d if there exists a polynomial u(z) of degree d and leading coefficient 1 such

that p(z)u(z) is α-acquiescent.

For a transfer function G(z), we define it’s H2 norm as

‖G‖2H2

=1

2π

∫ 2π

0

|G(eiθ)|2dθ .

We assume (with loss of generality) that the true transfer function G(z) has

bounded H2 norm, that is, ‖G‖H2 ≤ 1. This can be achieve by a rescaling5 of

the matrix C.

Theorem 6.5.2. Suppose the true system has transfer function G(z) = s(z)/p(z)

with a characteristic function p(z) that is α-acquiescent by extension of degree d, and

‖G‖H2 ≤ 1, then projected stochastic gradient descent with m = n+ d states (that is,

Algorithm 8 with m states) returns a system Θ with population risk

f(Θ) ≤ O

(√m5 + σ2m3

TK

).

where the O(·) notation hides polynomial dependencies on τ0, τ1, τ2, 1/(1− α).

The theorem follows directly from Corollary 6.4.2 (with some additional care about

the scaling.

Proof of Theorem 6.5.2. Let p(z) = p(z)u(z) be the acquiescent extension of p(z).

Since τ2 ≥ |u(z)p(z)| = |p(z)| ≥ τ0 on the unit circle, we have that |s(z)| =

|s(z)||u(z)| = s(z) · Oτ (1/p(z)). Therefore we have that s(z) satisfies that ‖s‖H2 =

Oτ (‖s(z)/p(z)‖H2) = Oτ (‖G(z)‖H2) ≤ Oτ (1). That means that the vector C that

determines the coefficients of s satisfies that ‖C‖ ≤ Oτ (1), since for a polynomial

5In fact, this is a natural scaling that makes comparing error easier. Recall that the populationrisk is essentially ‖G−G‖H2 , therefore rescaling C so that ‖G‖H2 = 1 implies that when error 1we achieve non-trivial performance.

176

h(z) = b0 + · · · + bn−1zn−1, we have ‖h‖H2 = ‖b‖. Therefore we can apply Corol-

lary 6.4.2 to complete the proof.

In the rest of this section, we discuss in subsection 6.5.1 the instability of the min-

imum representation in subsection, and in subsection 6.5.2 we show several examples

where the characteristic function p(z) is not α-acquiescent but is α-acquiescent by

extension with small degree d.

As a final remark, the examples illustrated in the following sub-sections may be

far from optimally analyzed. It is beyond the scope of this paper to understand the

optimal condition under which p(z) is acquiescent by extension.

6.5.1 Instability of the Minimum Representation

We begin by constructing a contrived example where the minimum representation of

G(z) is not stable at all and as a consequence one can’t hope to recover the minimum

representation of G(z).

Consider G(z) = s(z)p(z)

:= zn−0.8−n

(z−0.1)(zn−0.9−n)and G′(z) = s′(z)

p′(z):= 1

z−0.1. Clearly

these are the minimum representations of the G(z) and G′(z), which also both satisfy

acquiescence. On the one hand, the characteristic polynomial p(z) and p′(z) are very

different. On the other hand, the transfer functions G(z) and G′(z) have almost the

same values on unit circle up to exponentially small error,

|G(z)−G′(z)| ≤ 0.8−n − 0.9−n

(z − 0.1)(z − 0.9−n)≤ exp(−Ω(n)) .

Moreover, the transfer functions G(z) and G(z) are on the order of Θ(1) on unit

circle. These suggest that from an (inverse polynomially accurate) approximation of

the transfer function G(z), we cannot hope to recover the minimum representation in

any sense, even if the minimum representation satisfies acquiescence.

177

6.5.2 Power of Improper Learning in Various Cases

We illustrate the use of improper learning through various examples below.

Example: artificial construction

We consider a simple contrived example where improper learning can help us learn

the transfer function dramatically. We will show an example of characteristic function

which is not 1-acquiescent but (α+ 1)/2-(α+ 1)/2-acquiescent by extension of degree

3.

Let n be a large enough integer and α be a constant. Let J = 1, n − 1, n and

ω = e2πi/n, and then define p(z) = z3∏

j∈[n],j /∈J(z − αωj). Therefore we have that

p(z)/zn = z3∏

j∈[n],j∈J

(1− αωj/z) =1− αn/zn

(1− ω/z)(1− ω−1/z)(1− 1/z)(6.5.1)

Taking z = e−iπ/2 we have that p(z)/zn has argument (phase) roughly −3π/4, and

therefore it’s not in C, which implies that p(z) is not 1-acquiescent. On the other hand,

picking u(z) = (z − ω)(z − 1)(z − ω−1) as the helper function, from equation (6.5.1)

we have p(z)u(z)/zn+3 = 1 − αn/zn takes values inverse exponentially close to 1 on

the circle with radius (α + 1)/2. Therefore p(z)u(z) is (α + 1)/2-acquiescent.

Example: characteristic function with separated roots

A characteristic polynomial with well separated roots will be acquiescent by exten-

sion. Our bound will depend on the following quantity of p that characterizes the

separateness of the roots.

178

Definition 6.5.3. For a polynomial h(z) of degree n with roots λ1, . . . , λn inside unit

circle, define the quantity Γ(·) of the polynomial h as:

Γ(h) :=∑j∈[n]

∣∣∣∣∣ λnj∏i 6=j(λi − λj)

∣∣∣∣∣ .Lemma 6.5.4. Suppose p(z) is a polynomial of degree n with distinct roots inside

circle with radius α. Let Γ = Γ(p), then p(z) is α-acquiescent by extension of degree

d = O(max(1− α)−1 log(√nΓ · ‖p‖H2), 0).

Our main idea to extend p(z) by multiplying some polynomial u that approximates

p−1 (in a relatively weak sense) and therefore pu will always take values in the set C.

Lemma 6.5.5 (Approximation of inverse of a polynomial). Suppose p(z) is a polyno-

mial of degree n and leading coefficient 1 with distinct roots inside circle with radius

α, and Γ = Γ(p). Then for d = O(max( 11−α log Γ

(1−α)ζ, 0), there exists a polynomial

h(z) of degree d and leading coefficient 1 such that for all z on unit circle,

∣∣∣∣zn+d

p(z)− h(z)

∣∣∣∣ ≤ ζ .

We believe the following lemma should be known though for completeness we pro-

vide the proof below. Towards proving Proposition 6.5.5, we use the following lemma

to express the inverse of a polynomial as a sum of inverses of degree-1 polynomials.

Lemma 6.5.6. Let p(z) = (z−λ1) . . . (z−λn) where λj’s are distinct. Then we have

that

1

p(z)=

n∑j=1

tjz − λj

, where tj =(∏

i 6=j(λj − λi))−1

. (6.5.2)

179

Proof of Lemma 6.5.6. By interpolating constant function at points λ1, . . . , λn using

Lagrange interpolating formula, we have that

1 =n∑j=1

∏i 6=j(x− λi)∏i 6=j(λj − λi)

· 1 (6.5.3)

Dividing p(z) on both sides we obtain equation (6.5.2).

The following lemma computes the Fourier transform of function 1/(z − λ).

Lemma 6.5.7. Let m ∈ Z, and K be the unit circle in complex plane, and λ ∈ C

inside the K. Then we have that

∫K

zm

z − λdz =

2πiλm for m ≥ 0

0 o.w.

Proof of Lemma 6.5.7. For m ≥ 0, since zm is a holomorphic function, by Cauchy’s

integral formula, we have that

∫K

zm

z − λdz = 2πiλm .

For m < 0, by changing of variable y = z−1 we have that

∫K

zm

z − λdz =

∫K

y−m−1

1− λydy .

since |λy| = |λ| < 1, then we by Taylor expansion we have,

∫K

y−m−1

1− λydy =

∫Ky−m−1

(∞∑k=0

(λy)k

)dy .

Since the series λy is dominated by |λ|k which converges, we can switch the integral

with the sum. Note that y−m−1 is holomorphic for m < 0, and therefore we conclude

180

that

∫K

y−m−1

1− λydy = 0 .

Now we are ready to prove Proposition 6.5.5.

Proof of Proposition 6.5.5. Let m = n + d. We compute the Fourier transform of

zm/p(z). That is, we write

eimθ

p(eiθ)=

∞∑k=−∞

βkeikθ .

where

βk =1

2π

∫ 2π

0

ei(m−k)θ

p(eiθ)dθ =

1

2πi

∫K

zm−k−1

p(z)dz

By Lemma 6.5.6, we write

1

p(z)=

n∑j=1

tjz − λj

.

Then it follows that

βk =1

2πi

n∑j=1

tj

∫K

zm−k−1

z − λjdz

Using Lemma 6.5.7, we obtain that

βk =

∑n

j=1 tjλm−k−1j if −∞ ≤ k ≤ m− 1

0 o.w.(6.5.4)

We claim that

n∑j=1

tjλn−1j = 1 , and

n∑j=1

tjλsj = 0 , 0 ≤ s < n− 1 .

Indeed these can be obtained by writing out the lagrange interpolation for polynomial

f(x) = xs with s ≤ n− 1 and compare the leading coefficient. Therefore, we further

181

simplify βk to

βk =

∑n

j=1 tjλm−k−1j if −∞ < k < m− n

1 if k = m− n

0 o.w.

(6.5.5)

Let h(z) =∑

k≥0 βkzk. Then we have that h(z) is a polynomial with degree d = m−n

and leading term 1. Moreover, for our choice of d,

∣∣∣∣ zmp(z)− h(z)

∣∣∣∣ =

∣∣∣∣∣∑k<0

βkzk

∣∣∣∣∣ ≤∑k<0

|βk| ≤ maxj|tj|(1− λj)n

∑k<0

(1− γ)d−k−1

≤ Γ(1− γ)d/γ < ζ .

Proof of Proposition 6.5.4. Let γ = 1 − α. Using Proposition 6.5.5 with

ζ = 0.5‖p‖−1H∞ , we have that there exists polynomial u of degree d =

O(max 11−α log(Γ‖p‖H∞), 0) such that

∣∣∣∣zn+d

p(z)− u(z)

∣∣∣∣ ≤ ζ .

Then we have that ∣∣p(z)u(z)/zn+d − 1∣∣ ≤ ζ|p(z)| < 0.5 .

Therefore p(z)u(z)/zn+d ∈ Cτ0,τ1,τ2 for constant τ0, τ1, τ2. Finally noting that for

degree n polynomial we have ‖h‖H∞ ≤√n · ‖h‖H2 , which completes the proof.

Example: Characteristic polynomial with random roots

We consider the following generative model for characteristic polynomial of degree

2n. We generate n complex numbers λ1, . . . , λn uniformly randomly on circle with

radius α < 1, and take λi, λi for i = 1, . . . , n as the roots of p(z). That is, p(z) =

(z − λ1)(z − λ1) . . . (z − λn)(z − λn). We show that with good probability (over the

182

randomness of λi’s), polynomial p(z) will satisfy the condition in subsection 6.5.2 so

that it can be learned efficiently by our improper learning algorithm.

Theorem 6.5.8. Suppose p(z) with random roots inside circle of radius α is generated

from the process described above. Then with high probability over the choice of p, we

have that Γ(p) ≤ exp(O(√n)) and ‖p‖H2 ≤ exp(O(

√n)). As a corollary, p(z) is α-

acquiescent by extension of degree O((1− α)−1n).

Towards proving Theorem 6.5.8, we need the following lemma about the expected

distance of two random points with radius ρ and r in log-space.

Lemma 6.5.9. Let x ∈ C be a fixed point with |x| = ρ, and λ uniformly drawn on

the circle with radius r. Then E [ln |x− λ|] = ln maxρ, r .

Proof. When r 6= ρ, let N be an integer and ω = e2iπ/N . Then we have that

E[ln |x− λ| | r] = limN→∞

1

N

N∑k=1

ln |x− rωk| (6.5.6)

The right hand of equation (6.5.6) can be computed easily by observing that

1N

∑Nk=1 ln |x−rωk| = 1

Nln∣∣∣∏N

k=1(x− rωk)∣∣∣ = 1

Nln |xN−rN |. Therefore, when ρ > r,

we have limN→∞1N

∑Nk=1 ln |x−rωk| = limN→∞ ρ+ 1

Nln |(x/ρ)N − (r/ρ)N | = ln ρ. On

the other hand, when ρ < r, we have that limN→∞1N

∑Nk=1 ln |x− rωk| = ln r. There-

fore we have that E[ln |x−λ| | r] = ln(max ρ, r). For ρ = r, similarly proof (with more

careful concern of regularity condition) we can show that E[ln |x− λ| | r] = ln r.

Now we are ready to prove Theorem 6.5.8.

Proof of Theorem 6.5.8. Fixing index i, and the choice of λi, we consider the random

variable Yi = ln( |λi|2n∏j 6=i |λi−λj |

∏j 6=i |λi−λj |

)n ln |λi| −∑

j 6=i ln |λi − λj|. By Lemma 6.5.9,

we have that E[Yi] = n ln |λi| −∑

j 6=i E[ln |λi − λj|] = ln(1− δ). Let Zj = ln |λi − λj|.

Then we have that Zj are random variable with mean 0 and ψ1-Orlicz norm bounded

183

by 1 since E[eln |λi−λj |− 1] ≤ 1. Therefore by Bernstein inequality for sub-exponential

tail random variable (for example, [105, Theorem 6.21]), we have that with high

probability (1 − n−10), it holds that∣∣∣∑j 6=i Zj

∣∣∣ ≤ O(√n) where O hides logarithmic

factors. Therefore, with high probability, we have |Yi| ≤ O(√n).

Finally we take union bound over all i ∈ [n], and obtain that with high probability,

for ∀i ∈ [n], |Yi| ≤ O(√n), which implies that

∑ni=1 exp(Yi) ≤ exp(O(

√n)). With

similar technique, we can prove that ‖p‖H2 ≤ exp(O(√n).

Example: Passive systems

We will show that with improper learning we can learn almost all passive systems, an

important class of stable linear dynamical system as we discussed earlier. We start

off with the definition of a strict-input passive system.

Definition 6.5.10 (Passive System, c.f [103]). A SISO linear system is strict-input

passive if and only if for some τ0 > 0 and any z on unit circle, <(G(z)) ≥ τ0 .

In order to learn the passive system, we need to add assumptions in the definition

of strict passivity. To make it precise, we define the following subsets of complex

plane: For positive constant τ0, τ1, τ2, define

C+τ0,τ1,τ2

= z ∈ C : |z| ≤ τ2,<(z) ≥ τ1,<(z) ≥ τ0|=(z)| . (6.5.7)

We say a transfer function G(z) = s(z)/p(z) is (τ0, τ1, τ2)-strict input passive if

for any z on unit circle we have G(z) ∈ C+τ0,τ1,τ2

. Note that for small constant τ0, τ1

and large constant τ2, this basically means the system is strict-input passive.

Now we are ready to state our main theorem in this subsection. We will prove

that passive systems could be learned improperly with a constant factor more states

(dimensions), assuming s(z) has all its roots strictly inside unit circles and Γ(s) ≤

exp(O(n)).

184

Theorem 6.5.11. Suppose G(z) = s(z)/p(z) is (τ0, τ1, τ2)-strict-input passive. More-

over, suppose the roots of s(z) have magnitudes inside circle with radius α and

Γ = Γ(s) ≤ exp(O(n)) and ‖p‖H2 ≤ exp(O(n)). Then p(z) is α-acquiescent by

extension of degree d = Oτ,α(n), and as a consequence we can learn G(z) with n + d

states in polynomial time.

Moreover, suppose in addition we assume that G(z) ∈ Cτ0,τ1,τ2 for every z on unit

circle. Then p(z) is α-acquiescent by extension of degree d = Oτ,α(n).

The proof of Theorem 6.5.11 is similar in spirit to that of Proposition 6.5.4.

It follows directly from a combination of Lemma 6.5.12 and Lemma 6.5.13 below.

Lemma 6.5.12 shows that the denominator of a function (under the stated assump-

tions) can be extended to a polynomial that takes values in C+ on unit circle.

Lemma 6.5.13 shows that it can be further extended to another polynomial that

takes values in C.

Lemma 6.5.12. Suppose the roots of s are inside circle with radius α < 1, and

Γ = Γ(s). If transfer function G(z) = s(z)/p(z) satisfies that G(z) ∈ Cτ0,τ1,τ2

(or G(z) ∈ C+τ0,τ1,τ2

) for any z on unit circle, then there exists u(z) of degree d =

Oτ (max( 11−α log

√nΓ·‖p‖H2

1−α , 0) such that p(z)u(z)/zn+d ∈ Cτ ′0,τ′1,τ′2

(or p(z)u(z)/zn+d ∈

C+τ ′0,τ

′1,τ′2

respectively) for τ ′ = Θτ (1) , where Oτ (·),Θτ (·) hide the polynomial depen-

dencies on τ0, τ1, τ2.

Proof of Lemma 6.5.12. By the fact that G(z) = s(z)/p(z) ∈ Cτ0,τ1,τ2 , we have that

p(z)/s(z) ∈ Cτ ′0,τ ′1,τ ′2 for some τ ′ that polynomially depend on τ . Using Proposi-

tion 6.5.5, there exists u(z) of degree d such that

∣∣∣∣zn+d

s(z)− u(z)

∣∣∣∣ ≤ ζ .

185

where we set ζ minτ ′0, τ ′1/τ ′2 · ‖p‖−1H∞ . Then we have that

∣∣∣∣p(z)u(z)/zn+d − p(z)

s(z)

∣∣∣∣ ≤ |p(z)|ζ minτ ′0, τ ′1. (6.5.8)

It follows from equation (6.5.8) implies that that p(z)u(z)/zn+d ∈ Cτ ′′0 ,τ ′′1 ,τ ′′2 , where τ ′′

polynomially depends on τ . The same proof still works when we replace C by C+.

Lemma 6.5.13. Suppose p(z) of degree n and leading coefficient 1 satisfies that

p(z) ∈ C+τ0,τ1,τ2

for any z on unit circle. Then there exists u(z) of degree d such that

p(z)u(z)/zn+d ∈ Cτ ′0,τ ′1,τ ′2 for any z on unit circle with d = Oτ (n) and τ ′0, τ′1, τ′2 = Θτ (1),

where Oτ (·),Θτ (·) hide the dependencies on τ0, τ1, τ2.

Proof of Lemma 6.5.13. We fix z on unit circle first. Let’s defined√p(z)/zn be

the square root of p(z)/zn with principle value. Let’s write p(z)/zn = τ2(1 +

( p(z)τ2zn− 1)) and we take Taylor expansion for 1√

p(z)/zn= τ

−1/22 (1 + ( p(z)

τ2zn− 1))−1/2 =

τ−1/22

(∑∞k=0( p(z)

τ2zn− 1)k

). Note that since τ1 < |p(z)| < τ2, we have that | p(z)

τ2zn− 1| <

1−τ1/τ2. Therefore truncating the Taylor series at k = Oτ (1) we obtain a polynomial

a rational function h(z) of the form

h(z) =∑k

j≥0( p(z)τ2zn− 1)j ,

which approximates 1√p(z)/zn

with precision ζ minτ0, τ1/τ2, that is,∣∣∣∣ 1√p(z)/zn

− h(z)

∣∣∣∣ ≤ ζ . Therefore, we obtain that∣∣∣p(z)h(z)

zn−√p(z)/zn

∣∣∣ ≤ ζ|p(z)/zn| ≤

ζτ2 . Note that since p(z)/zn ∈ C+τ0,τ1,τ2

, we have that√p(z)/zn ∈ Cτ ′0,τ ′1,τ ′2 for some

constants τ ′0, τ′1, τ′2. Therefore p(z)h(z)

zn∈ Cτ ′0,τ ′1,τ ′2 . Note that h(z) is not a polynomial

yet. Let u(z) = znkh(z) and then u(z) is polynomial of degree at most nk and

p(z)u(z)/z(n+1)k ∈ Cτ ′0,τ ′1,τ ′2 for any z on unit circle.

186

6.5.3 Improper Learning Using Linear Regression

In this subsection, we show that under stronger assumption than α-acquiescent by

extension, we can improperly learn a linear dynamical system with linear regression,

up to some fixed bias.

The basic idea is to fit a linear function that maps [xk−`, . . . , xk] to yk. This

is equivalent to a dynamical system with ` hidden states and with the companion

matrix A in (6.1.4) being chosen as a` = 1 and a`−1 = · · · = a1 = 0. In this case, the

hidden states exactly memorize all the previous ` inputs, and the output is a linear

combination of the hidden states.

Equivalently, in the frequency space, this corresponds to fitting the transfer func-

tion G(z) = s(z)/p(z) with a rational function of the form c1z`−1+···+c1z`−1 = c1z

−(`−1) +

· · ·+ cn. The following is a sufficient condition on the characteristic polynomial p(x)

that guarantees the existence of such fitting,

Definition 6.5.14. A polynomial p(z) of degree n is extremely-acquiescent by exten-

sion of degree d with bias ε if there exists a polynomial u(z) of degree d and leading

coefficient 1 such that for all z on unit circle,

∣∣p(z)u(z)/zn+d − 1∣∣ ≤ ε (6.5.9)

We remark that if p(z) is 1-acquiescent by extension of degree d, then there exists

u(z) such that p(z)u(z)/zn+d ∈ C. Therefore, equation (6.5.9) above is a much

stronger requirement than acquiescence by extension.6

When p(z) is extremely-acquiescent, we see that the transfer function G(z) =

s(z)/p(z) can be approximated by s(z)u(z)/zn+d up to bias ε. Let ` = n+ d+ 1 and

s(z)u(z) = c1z`−1 + · · · + c`. Then we have that G(z) can be approximated by the

6We need (1−δ)-acquiescence by extension in previous subsections for small δ > 0, though this ismerely additional technicality needed for the sample complexity. We ignore this difference between1− δ-acquiescence and 1-acquiescence and for the purpose of this subsection

187

following dynamical system of ` hidden states with ε bias: we choose A = C(a) with

a` = 1 and a`−1 = · · · = a1 = 0, and C = [c1, . . . , c`]. As we have argued previously,

such a dynamical system simply memorizes all the previous ` inputs, and therefore it

is equivalent to linear regression from the feature [xk−`, . . . , xk] to output yk.

Proposition 6.5.15 (Informal). If the true system G(z) = s(z)/p(z) satisfies that

p(z) is extremely-acquiescent by extension of degree d. Then using linear regression

we can learn mapping from [xk−`, . . . , xk] to yk with bias ε and polynomial sampling

complexity.

We remark that with linear regression the bias ε will only go to zero as we increase

the length ` of the feature, but not as we increase the number of samples. Moreover,

linear regression requires a stronger assumption than the improper learning results

in previous subsections do. The latter can be viewed as an interpolation between the

proper case and the regime where linear regression works.

6.6 Learning Multi-input Multi-output (MIMO)

Systems

We consider multi-input multi-output systems with the transfer functions that have

a common denominator p(z),

G(z) =1

p(z)· S(z) (6.6.1)

where S(z) is an ìn × òut matrix with each entry being a polynomial with real

coefficients of degree at most n and p(z) = zn + a1zn−1 + · · ·+ an. Note that here we

use ìn to denote the dimension of the inputs of the system and òut the dimension of

the outputs.

188

Although a special case of a general MIMO system, this class of systems still

contains many interesting cases, such as the transfer functions studied in [65, 64],

where G(z) is assumed to take the form G(z) = R0 +∑n

i=1Riz−λi , for λ1, . . . , λn ∈ C

with conjugate symmetry and Ri ∈ Còut×ìn satisfies that Ri = Rj whenever λi = λj.

In order to learn the system G(z), we parametrize p(z) by its coefficients a1, . . . , an

and S(z) by the coefficients of its entries. Note that each entry of S(z) depends on

n + 1 real coefficients and therefore the collection of coefficients forms a third order

tensor of dimension òut × ìn × (n + 1). It will be convenient to collect the leading

coefficients of the entries of S(z) into a matrix of dimension òut × ìn, named D,

and the rest of the coefficients into a matrix of dimension òut × ìnn, denoted by C.

This will be particularly intuitive when a state-space representation is used to learn

the system with samples as discussed later. We parameterize the training transfer

function G(z) by a, C and D using the same way.

Let’s define the risk function in the frequency domain as,

g(A, C, D) =

∫ 2π

0

∥∥∥G(eiθ)− G(eiθ)∥∥∥2

Fdθ . (6.6.2)

The following lemma is an analog of Lemma 6.2.3 for the MIMO case. Itss proof

actually follows from a straightforward extension of the proof of Lemma 6.2.3 by

observing that matrix S(z) (or S(z)) commute with scalar p(z) and p(z), and that

S(z), p(z) are linear in a, C.

Lemma 6.6.1. The risk function g(a, C) defined in (6.6.2) is τ -weakly-quasi-convex

in the domain

Nτ (a) =

a ∈ Rn : <

(pa(z)

pa(z)

)≥ τ/2,∀ z ∈ C, s.t. |z| = 1

⊗ Rìn×òut×n′

Finally, as alluded before, we use a particular state space representation for learn-

ing the system in time domain with example sequences. It is known that any transfer

189

function of the form (6.6.1) can be realized uniquely by the state space system of the

following special case of Brunovsky normal form [37],

A =

0 Idìn 0 · · · 0

0 0 Idìn · · · 0

......

.... . .

...

0 0 0 · · · Idìn

−anIdìn −an−1Idìn −an−2Idìn · · · −a1Idìn

, B =

0

...

0

Idìn

,

(6.6.3)

and,

C ∈ Ròut×nìn , D ∈ Ròut×ìn .

The following Theorem is a straightforward extension of Corollary 6.4.2 and Theo-

rem 6.5.2 to the MIMO case.

Theorem 6.6.2. Suppose transfer function G(z) of a MIMO system takes

form (6.6.1), and has norm ‖G‖H2 ≤ 1. If the common denominator p(z) is

α-acquiescent by extension of degree d then projected stochastic gradient descent over

the state space representation (6.6.3) will return Θ with risk

f(Θ) ≤ poly(n+ d, σ, τ, (1− α)−1)

TN.

We note that since A and B are simply the tensor product of Idìn with C(a) and

en, the no blow-up property (Lemma 6.3.4) for AkB still remains true. Therefore to

prove Theorem 6.6.2, we essentially only need to run the proof of Lemma 6.4.5 with

matrix notation and matrix norm. We defer the proof to the full version.

190

6.7 Technicalities: Mean and Variance of the Gra-

dient Estimator

In this section, we formally prove Lemma 6.3 and 6.4, which controls the mean and

variance of the gradient estimator used in Algorithm 7.

Proof of Lemma 6.4.4

Lemma 6.4.4 follows directly from the following general Lemma which also handles

the multi-input multi-output case. It can be seen simply from calculation similar to

the proof of Lemma 6.2.5. We mainly need to control the tail of the series using the

no-blow up property (Lemma 6.3.4) and argue that the wrong value of the initial

states h0 won’t cause any trouble to the partial loss function `((x, y), Θ) (defined in

Algorithm 7). This is simply because after time T1 = T/4, the influence of the initial

state is already washed out.

Lemma 6.7.1. In algorithm 9 the values of GA, GC , GD are equal to the gradients

of g(A, C) + (D −D)2 with respect to A, C and D up to inverse exponentially small

error.

Proof of Lemma 6.7.1. We first show that the partial empirical loss function

`((x, y), Θ) has expectation almost equal to the idealized risk (up to the term

for D and exponential small error),

E[`((x, y), Θ)] = g(A, C) + (D −D)2 ± exp(−Ω((1− α)T )).

191

This can be seen simply from similar calculation to the proof of Lemma 6.2.5. Note

that

yt = Dxt +t−1∑k=1

CAt−k−1Bxk + CAt−1h0 + ξt and yt = Dxt +t−1∑k=1

CAt−k−1Bxk .

(6.7.1)

Therefore noting that when t ≥ T1 ≥ Ω(T ), we have that ‖CAt−1h0‖ ≤ exp(−Ω((1−

α)T ) and therefore the effect of h0 is negligible. Then we have that

E[`((x, y), Θ)] =1

T − T1E

[T∑

t>T1

‖yt − yt‖2

]± exp(−Ω((1− α)T ))

= ‖D −D‖2 +1

T − T1

∑T≥t>T1

∑0≤j≤t−1

‖CAjB − CAjB‖2

± exp(−Ω((1− α)T ))

= ‖D −D‖2 +

T1∑j=0

‖CAjB − CAjB‖2

∑T≥j≥T1

T − jT − T1

‖CAjB − CAjB‖2 ± exp(−Ω((1− α)T ))

= ‖D −D‖2 +∞∑j=0

‖CAjB − CAjB‖2 ± exp(−Ω((1− α)T )) .

where the first line use the fact that ‖CAt−1h0‖ ≤ exp(−Ω((1 − α)T ), the second

uses equation (6.2.9) and the last line uses the no-blowing up property of AkB

(Lemma 6.3.4).

Similarly, we can prove that the gradient of E[`((x, y), Θ)] is also close to the

gradient of g(A, C) + (D −D)2 up to inverse exponential error.

192

Proof of Lemma 6.4.5

Proof of Lemma 6.4.5. Both GA and GC can be written in the form of a quadratic

form (with vector coefficients) of x1, . . . , xT and ξ1, . . . , ξT . That is, we will write

GA =∑s,t

xsxtust +∑s,t

xsξtu′st and GC =

∑s,t

xsxtvst +∑s,t

xsξtv′st .

where ust and vst are vectors that will be calculated later. By Claim 6.7.2, we

have that

V

[∑s,t

xsxsust +∑s,t

xsξtu′st

]≤ O(1)

∑s,t

‖ust‖2 +O(σ2)∑s,t

‖u′st‖2 . (6.7.2)

Therefore in order to bound from above V [GA], it suffices to bound∑‖ust‖2 and∑

‖u′st‖2, and similarly for GC .

We begin by writing out ust for fixed s, t ∈ [T ] and bounding its norm. We use the

same set of notations as int the proof of Lemma 6.4.4. Recall that we set rk = CAkB

and rk = CAkB, and ∆rk = rk− rk. Moreover, let zk = AkB. We note that the sums

of ‖zk‖2 and r2k can be controlled. By the assumption of the Lemma, we have that

∞∑k=t

‖zk‖2 ≤ 2πnτ−21 , ‖zk‖2 ≤ 2πnα2k−2nτ−2

1 . (6.7.3)

∞∑k=t

∆r2k ≤ 4πnτ−2

1 , ‖∆rk‖2 ≤ 4πnα2k−2nτ−21 . (6.7.4)

which will be used many times in the proof that follows.

We calculate the explicit form of GA using the explicit back-propagation Algo-

rithm 9. We have that in Algorithm 9,

hk =∑k

j=1 Ak−jBxj =

∑kj=1 zk−jxj (6.7.5)

193

and

∆hk =T∑j=k

(A>)j−kC>∆yj =T∑j=k

αj−k(A>)j−kC>1(j > T1)

(ξj +

j∑`=1

∆rj−`x`

)(6.7.6)

Then using GA =∑

k≥2B>∆hkh

′>k−1 and equation (6.7.5) and equation (6.7.6) above,

we have that

ust =∑T

k=2

(∑j≥maxk,s,T1+1∆rj−sCA

j−kB)

1(k ≥ t+ 1) · Ak−t−1B

=∑T

k=2

(∑j≥maxk,s,T1+1∆rj−srj−k

)1(k ≥ t+ 1) · zk−t−1 . (6.7.7)

and that,

u′st =T∑k=2

zk−1−s · 1(k ≥ s+ 1) · r′t−k · 1(t > maxT1, k)

=∑

s+1≤k≤t

zk−1−s · r′t−k · 1(t > maxT1) (6.7.8)

Towards bounding ‖ust‖, we consider four different cases. Let Λ =

Ω(maxn, (1− α)−1 log( 1

1−α))

be a threshold.

Case 1: When 0 ≤ s− t ≤ Λ, we rewrite ust by rearranging equation (6.7.7),

ust =∑T≥k≥s

zk−t−1

∑j≥maxk,T1+1

∆rj−srj−k +∑t<k<s

zk−t−1

∑j≥maxs,T1+1

∆rj−srj−k

=∑

`≥0,`≥T1+1−s

∆r`∑

s≤k≤l+s,k≤T

r`+s−kzk−t−1 +∑

`≥0,`≥T1+1−s

∆r`∑s>k>t

r`+s−kzk−t−1

194

where at the second line, we did the change of variables ` = j − s. Then by Cauchy-

Schartz inequality, we have,

‖ust‖2 ≤ 2

( ∑`≥0,`≥T1+1−s

∆r2`

) ∑`≥0,`≥T1+1−s

∥∥∥∥∥ ∑s≤k≤l+s,k≤T

r`+s−kzk−t−1

∥∥∥∥∥2

︸︷︷︸T1

+ 2

( ∑`≥0,`≥T1+1−s

∆r2`

) ∑`≥0,`≥T1+1−s

∥∥∥∥∥ ∑s>k>t

r`+s−kzk−t−1

∥∥∥∥∥2

︸︷︷︸T2

. (6.7.9)

We could bound the contribution from ∆r2k ssing equation (6.7.4), and it remains

to bound terms T1 and T2. Using the tail bounds for ‖zk‖ (equation (6.7.3)) and the

fact that |rk| = |CAkB| ≤ ‖AkB‖ = ‖zk‖ , we have that

T1 =∑

`≥0,`≥T1+1−s

∥∥∥∥∥ ∑s≤k≤l+s,k≤T

r`+s−kzk−t−1

∥∥∥∥∥2

≤∑`≥0

( ∑s≤k≤`+s

|r`+s−k|‖zk−t−1‖

)2

.

(6.7.10)

We bound the inner sum of RHS of (6.7.10) using the fact that ‖zk‖2 ≤

O(nα2k−2n/τ 21 ) and obtain that,

∑s≤k≤`+s

|r`+s−k|‖zk−t−1‖ ≤∑

s≤k≤`+s

O(nα(`+s−t−1)−2n/τ 21 )

≤ O(`nα(`+s−t−1)−2n/τ 21 ) . (6.7.11)

Note that equation (6.7.11) is particular effective when ` > Λ. When ` ≤ Λ, we can

refine the bound using equation (6.7.3) and obtain that

∑s≤k≤`+s

|r`+s−k|‖zk−t−1‖ ≤

( ∑s≤k≤`+s

|r`+s−k|2)1/2( ∑

s≤k≤`+s

‖zk−t−1‖2

)1/2

≤ O(√n/τ1) ·O(

√n/τ1) = O(n/τ 2

1 ) . (6.7.12)

195

Plugging equation (6.7.12) and (6.7.11) into equation (6.7.10), we have that

∑`≥0

( ∑s≤k≤`+s

|r`+s−k|‖zk−t−1‖

)2

≤∑

Λ≥`≥0

O(n2/τ 41 ) +

∑`>Λ

O(`2n2α2(`+s−t−1)−4n/τ 41 )

≤ O(n2Λ/τ 41 ) +O(n2/τ 4

1 ) = O(n2Λ/τ 41 ) . (6.7.13)

For the second term in equation (6.7.9), we bound similarly,

T2 ≤∑

`≥0,`≥T1+1−s

∥∥∥∥∥ ∑s>k>t

r`+s−kzk−t−1

∥∥∥∥∥2

≤ O(n2Λ/τ 41 ) . (6.7.14)

Therefore using the bounds for T1 and T2 we obtain that,

‖ust‖2 ≤ O(n3Λ/τ 61 ) (6.7.15)

Case 2: When s− t > Λ, we tighten equation (6.7.13) by observing that,

T1 ≤∑`≥0

( ∑s≤k≤`+s

|r`+s−k|‖zk−t−1‖

)2

≤ α2(s−t−1)−4n∑`≥0

O(`2n2α2`/τ 41 )

≤ αs−t−1 ·O(n2/(τ 41 (1− α)3)) . (6.7.16)

where we used equation (6.7.11). Similarly we can prove that

T2 ≤ αs−t−1 ·O(n2/(τ 41 (1− α)3)) .

Therefore, we have when s− t ≥ Λ,

‖ust‖2 ≤ O(n3/((1− α)3τ 61 )) · αs−t−1 . (6.7.17)

196

Case 3: When −Λ ≤ s − t ≤ 0, we can rewrite ust and use the Cauchy-Schwartz

inequality and obtain that

ust =∑

T≥k≥t+1

zk−t−1

∑j≥maxk,T1+1

∆rj−srj−k

=∑

`≥0,`≥T1+1−s

∆r`∑

t+1≤k≤l+s,k≤T

r`+s−kzk−t−1 .

and,

‖ust‖2 ≤

( ∑`≥0,`≥T1+1−s

∆r2`

) ∑`≥0,`≥T1+1−s

∥∥∥∥∥ ∑t+1≤k≤l+s,k≤T

r`+s−kzk−t−1

∥∥∥∥∥2 .

Using almost the same arguments as in equation (6.7.11) and (6.7.12), we that

∑t+1≤k≤`+s

|r`+s−k| · ‖zk−t−1‖ ≤ O(`nα(`+s−t−1)−2n/τ 21 )

and∑

t+1≤k≤`+s

|r`+s−k| · ‖zk−t−1‖ ≤ O(√n/τ1) ·O(

√n/τ1) = O(n/τ 2

1 ) .

Then using a same type of argument as equation (6.7.13), we can have that

∑`≥0,`≥T1+1−s

∥∥∥∥∥ ∑t+1≤k≤l+s,k≤T

r′`+s−kz′k−t−1

∥∥∥∥∥2

≤ O(n2Λ/τ 41 ) +O(n2/τ 4

1 )

= O(n2Λ/τ 41 ) .

It follows that in this case ‖ust‖ can be bounded with the same bound in (6.7.15).

197

Case 4: When s− t ≤ −Λ, we use a different simplification of ust from above. First

of all, it follows (6.7.7) that

‖ust‖ ≤T∑k=2

∑j≥maxk,s,T1+1

‖∆rj−sr′j−kzk−t−1‖1(k ≥ t+ 1)

(6.7.18)

≤∑k≥t+1

‖z′k−t−1‖∑

j≥maxk,T1+1

|∆rj−sr′j−k| .

Since j − s ≥ k − s > 4n and it follows that

∑j≥maxk,T1+1

|∆rj−sr′j−k| ≤∑

j≥maxk,T1+1

O(√n/τ1 · αj−s−n) ·O(

√n/τ1 · αj−k−n)

≤ O(n/(τ 21 (1− α)) · αk−s−n)

Then we have that

‖ust‖2 ≤∑k≥t+1

‖z′k−t−1‖∑

j≥maxk,T1+1

|∆rj−sr′j−k|

≤

(∑k≥t+1

‖z′k−t−1‖2

)∑k≥t+1

∑j≥maxk,T1+1

|∆rj−sr′j−k|

2≤ O(n/τ 2

1 ) ·O(n2/(τ 41 (1− α)3)αt−s) = O(n3/(τ 6

1 δ3)αt−s)

Therefore, using the bound for ‖ust‖2 obtained in the four cases above, taking

sum over s, t, we obtain that

∑1≤s,t≤T

‖ust‖2 ≤∑

s,t∈[T ]:|s−t|≤Λ

O(n3Λ/τ 61 ) +

∑s,t:|s−t|≥Λ

O(n3/(τ 61 (1− α)3)α|t−s|−1)

≤ O(Tn3Λ2/τ 61 ) +O(n3/τ 6

1 ) = O(Tn3Λ2/τ 61 ) . (6.7.19)

198

We finished the bounds for ‖ust‖ and now we turn to bound ‖u′st‖2. Using the

formula for u′st (equation 6.7.8), we have that for t ≤ s+ 1, u′st = 0. For s+ Λ ≥ t ≥

s+ 2, we have that by Cauchy-Schwartz inequality,

‖u′st‖ ≤

( ∑s+1≤k≤t

‖zk−1−s‖2

)1/2( ∑s+1≤k≤t

|r′t−k|2)1/2

≤ O(n/τ 21 ) ≤ O(n/τ 2

1 ) .

On the other hand, for t > s+ Λ, by the bound that |r′k|2 ≤ ‖z′k‖2 ≤ O(nα2k−2n/τ 21 ),

we have,

‖u′st‖ ≤T∑

s+1≤k≤t−1

‖zk−1−s‖ · |r′t−k| ≤T∑

s+1≤k≤t−1

nαt−s−1/τ 21

≤ O(n(t− s)αt−s−1/τ 21 ) .

Therefore taking sum over s, t, similarly to equation (6.7.19),

∑s,t∈[T ]

‖u′st‖2 ≤ O(Tn2Λ/τ 41 ) . (6.7.20)

Then using equation (6.7.2) and equation (6.7.19) and (6.7.20), we obtain that

V[‖GA‖2] ≤ O(Tn3Λ2/τ 6

1 + σ2Tn2Λ/τ 41

).

Hence, it follows that

V[GA] ≤ 1

(T − T1)2 V[GA] ≤ O (n3Λ2/τ 61 + σ2n2Λ/τ 4

1 )

T.

We can prove the bound for GC similarly.

Claim 6.7.2. Let x1, . . . , xT be independent random variables with mean 0 and vari-

ance 1 and 4-th moment bounded by O(1), and uij be vectors for i, j ∈ [T ]. Moreover,

let ξ1, . . . , ξT be independent random variables with mean 0 and variance σ2 and u′ij

199

be vectors for i, j ∈ [T ]. Then,

V[∑

i,j xixjuij +∑

i,j xiξju′ij

]≤ O(1)

∑i,j ‖uij‖2 +O(σ2)

∑i,j ‖u′ij‖2 .

Proof. Note that the two sums in the target are independent with mean 0, therefore

we only need to bound the variance of both sums individually. The proof follows the

linearity of expectation and the independence of xi’s:

E[∥∥∥∑i,j xixjuij

∥∥∥2]

=∑i,j

∑k,`

E[xixjxkxù

>ijuk`

]=∑i

E[u>iiuiix4i ] +

∑i 6=j

E[u>iiujjx2ix

2j ]

+∑i,j

E[x2ix

2j(u>ijuij + u>ijuji)

]≤∑i,j

u>iiujj +O(1)∑i,j

‖uij + uji‖2

= ‖∑

i uii‖2 +O(1)

∑i,j

‖uij‖2

where at second line we used the fact that for any monomial xα with an odd degree

on one of the xi’s, E[xα] = 0. Note that E[∑

i,j xixjuij] =∑

i uii. Therefore,

V[∑

i,j xixjuij

]= E

[‖∑

i,j xixjuij‖2]− ‖E[

∑i,j xixjuij]‖2 ≤ O(1)

∑i,j ‖uij‖2

(6.7.21)

Similarly, we can control V[∑

i,j xiξju′ij

]by O(σ2)

∑i,j ‖u′ij‖2.

6.8 Back-propagation Implementation

In this section we give a detailed implementation of using back-propagation to com-

pute the gradient of the loss function. The algorithm is for general MIMO case with

the parameterization (6.6.3). To obtain the SISO sub-case, simply take ìn = òut = 1.

200

Algorithm 9 Back-propagation

Parameters: a ∈ Rn, C ∈ Rìn×nòut , and D ∈ Rìn×òut . Let A = MCC(a) =C(a)⊗ Idìn and B = en ⊗ Idìn .Input: samples ((x(1), y1), . . . , x(N), y(N)) and projection set Bα.

for each sample (x(j), yj) = ((x1, . . . , xT ), (y1, . . . , yT )) doFeed-forward pass:

h0 = 0 ∈ Rnìn .for k = 1 to T

hk ← Ahk−1 + Bxk, yt ← Chk + Dxk and hk ← Ahk−1 + Bxk.end for

Back-propagation:∆hT+1 ← 0, GA ← 0, GC ← 0. GD ← 0T1 ← T/4for k = T to 1

if k > T1, ∆yk ← yk−yk, o.w. ∆yk ← 0. Let ∆hk ← C>∆yk+ A>∆hk+1.

update GC ← GC + 1T−T1 ∆ykhk, GA ← GA − 1

T−T1B>∆hkh

>k−1, and

GD ← GD + 1T−T1 ∆ykxk.

end forGradient update: A← A− η ·GA, C ← C − η ·GC , D ← D − η ·GD.Projection step: Obtain a from A and set a← ΠB(a), and A = MCC(a)

end for

6.9 Projection to the Constraint Set

In order to have a fast projection algorithm to the convex set Bα, we consider a grid

GM of size M over the circle with radius α. We will show that M = Oτ (n) will be

enough to approximate the set Bα in the sense that projecting to the approximating

set suffices for the convergence.

Let B′α,τ0,τ1,τ2 = a : pa(z)/zn ∈ Cτ0,τ1,τ2 ,∀z ∈ GM and Bα,τ0,τ1,τ2 = a : pa(z)/zn ∈

Cτ0,τ1,τ2 ,∀|z| = α. Here Cτ0,τ1,τ2 is defined the same as before though we used the

subscript to emphasize the dependency on τi’s,

Cτ0,τ1,τ2 = z : <z ≥ (1 + τ0)|=z| ∩ z : τ1 < <z < τ2 . (6.9.1)

201

We will first show that with M = Oτ (n), we can make B′α,τ1,τ2,τ3 to be sandwiched

within to two sets Bα,τ0,τ1,τ2 and Bα,τ ′0,τ ′1,τ ′2 .

Lemma 6.9.1. For any τ0 > τ ′0, τ1 > τ ′1, τ2 < τ ′2, we have that for M = Oτ (n), there

exists κ0, κ1, κ2 that polynomially depend on τi, τ′i ’s such that Bα,τ0,τ1,τ2 ⊂ B′α,κ0,κ1,κ2 ⊂

Bα,τ ′0,τ ′1,τ ′2

Before proving the lemma, we demonstrate how to use the lemma in our algorithm:

We will pick τ ′0 = τ0/2, τ ′1 = τ1/2 and τ ′2 = 2τ2, and find κi’s guaranteed in the lemma

above. Then we use B′α,κ0,κ1,κ2 as the projection set in the algorithm (instead of

Bα,τ0,τ1,τ2)). First of all, the ground-truth solution Θ is in the set B′α,κ0,κ1,κ2 . Moreover,

since B′α,κ0,κ1,κ2 ⊂ Bα,τ ′0,τ ′1,τ ′2 , we will guarantee that the iterates Θ will remain in the

set Bα,τ ′0,τ ′1,τ ′2 and therefore the quasi-convexity of the objective function still holds7.

Note that the set B′α,κ0,κ1,κ2 contains O(n) linear constraints and therefore we

can use linear programming to solve the projection problem. Moreover, since the

points on the grid forms a Fourier basis and therefore Fast Fourier transform can be

potentially used to speed up the projection. Finally, we will prove Lemma 6.9.1. We

need S. Bernstein’s inequality for polynomials.

Theorem 6.9.2 (Bernstein’s inequality, see, for example, [152]). Let p(z) be any

polynomial of degree n with complex coefficients. Then,

sup|z|≤1

|p′(z)| ≤ n sup|z|≤1

|p(z)|.

We will use the following corollary of Bernstein’s inequality.

Corollary 6.9.3. Let p(z) be any polynomial of degree n with complex coefficients.

Then, for m = 20n,

sup|z|≤1

|p′(z)| ≤ 2n supk∈[m]

|p(e2ikπ/m)|.

7with a slightly worse parameter up to constant factor since τi’s are different from τi’s up toconstant factors

202

Proof. For simplicity let τ = supk∈[m] |p(e2ikπ/m)|, and let τ ′ = supk∈[m] |p(e2ikπ/m)|.

If τ ′ ≤ 2τ then we are done by Bernstein’s inequality. Now let’s assume that τ ′ >

2τ . Suppose p(z) = τ ′. Then there exists k such that |z − e2πik/m| ≤ 4/m and

|p(e2πik/m)| ≤ τ . Therefore by Cauchy mean-value theorem we have that there exists

ξ that lies between z and e2πik/m such that p′(ξ) ≥ m(τ ′ − τ)/4 ≥ 1.1nτ ′, which

contradicts Bernstein’s inequality.

Lemma 6.9.4. Suppose a polynomial of degree n satisfies that |p(w)| ≤ τ for every

w = αe2iπk/m for some m ≥ 20n. Then for every z with |z| = α there exists w =

αe2iπk/m such that |p(z)− p(w)| ≤ O(nατ/m).

Proof. Let g(z) = p(αz) by a polynomial of degree at most n. Therefore we have

g′(z) = αp(z). Let w = αe2iπk/m such that |z − w| ≤ O(α/m). Then we have

|p(z)− p(w)| = |g(z/α)− p(w/α)| ≤ sup|x|≤1

|g′(x)| · 1

α|z − w|

(By Cauchy’s mean-value Theorem)

≤ sup|x|≤1

|p′(x)| · |z − w| ≤ nτ |z − w| .

(Corallary 6.9.3)

≤ O(αnτ/m) .

Now we are ready to prove Lemma 6.9.1.

Proof of Lemma 6.9.1. We choose κi = 12(τi + τ ′i).The first inequality is trivial. We

prove the second one. Consider a such that a ∈ Bα,κ0,κ1,κ2 . We wil show that a ∈

B′α,τ ′0,τ ′1,τ ′2 . Let qa(z) = p(z−1)zn. By Lemma 6.9.4, for every z with |z| = 1/α, we

have that there exists w = α−1e2πik/M for some integer k such that |qa(z)− qa(w)| ≤

O(τ2n/(αM)). Therefore let M = cn for sufficiently large c (which depends on τi’s),

we have that for every z with |z| = 1/α, qa(z) ∈ Cτ ′0,τ ′1,τ ′2 . This completes the proof.

203

Part III

Interpreting Non-linear Models

and Their Non-convex Objective

Functions

204

Chapter 7

Understanding Word Embedding

Methods Using Generative Models

Semantic word embeddings represent the meaning of a word via a vector, and are cre-

ated by diverse methods, the learning of which often involves non-convex optimization

problems such as weighted matrix factorization or learning neural networks.

This chapter proposes a new generative model, a dynamic version of the log-linear

topic model of [127], under which we can explain the effectiveness of these diverse

methods. The methodological novelty is to use this generative model to compute

closed form expressions for word statistics. This provides a theoretical justification

for nonlinear models like PMI, word2vec, and GloVe, as well as some hyperparameter

choices. It also helps explain why low-dimensional semantic embeddings contain linear

algebraic structure that allows solution of word analogies, as shown by [122] and many

subsequent papers.

Experimental support is provided for the generative model assumptions, the most

important of which is that latent word vectors are fairly uniformly dispersed in space.

205

7.1 Introduction

Vector representations of words (word embeddings) try to capture relationships be-

tween words as distance or angle, and have many applications in computational

linguistics and machine learning. They are constructed by various models whose

unifying philosophy is that the meaning of a word is defined by “the company it

keeps” [66], namely, co-occurrence statistics. The simplest methods use word vectors

that explicitly represent co-occurrence statistics. Reweighting heuristics are known

to improve these methods, as is dimension reduction [56]. Some reweighting methods

are nonlinear, which include taking the square root of co-occurrence counts [147], or

the logarithm, or the related Pointwise Mutual Information (PMI) [50]. These are

collectively referred to as Vector Space Models, surveyed in [169].

Neural network language models [149, 150, 26, 53] propose another way to con-

struct embeddings: the word vector is simply the neural network’s internal represen-

tation for the word. This method is nonlinear and nonconvex. It was popularized

via word2vec, a family of energy-based models in [123, 125], followed by a matrix

factorization approach called GloVe [140]. The first paper also showed how to solve

analogies using linear algebra on word embeddings. Experiments and theory were

used to suggest that these newer methods are related to the older PMI-based models,

but with new hyperparameters and/or term reweighting methods [111].

But note that even the old PMI method is a bit mysterious. The simplest version

considers a symmetric matrix with each row/column indexed by a word. The entry

for (w,w′) is PMI(w,w′) = log p(w,w′)p(w)p(w′)

, where p(w,w′) is the empirical probability

of words w,w′ appearing within a window of certain size in the corpus, and p(w)

is the marginal probability of w. (More complicated models could use asymmetric

matrices with columns corresponding to context words or phrases, and also involve

tensorization.) Then word vectors are obtained by low-rank SVD on this matrix, or

a related matrix with term reweightings. In particular, the PMI matrix is found to

206

be closely approximated by a low rank matrix: there exist word vectors in say 300

dimensions, which is much smaller than the number of words in the dictionary, such

that

〈vw, vw′〉 ≈ PMI(w,w′) (7.1.1)

where ≈ should be interpreted loosely.

There appears to be no theoretical explanation for this empirical finding about

the approximate low rank of the PMI matrix. The current paper addresses this.

Specifically, we propose a probabilistic model of text generation that augments the

log-linear topic model of [127] with dynamics, in the form of a random walk over

a latent discourse space. The chief methodological contribution is using the model

priors to analytically derive a closed-form expression that directly explains (7.1.1);

see Theorem 7.2.2 in Section 7.2. Section 7.3 builds on this insight to give a rigorous

justification for models such as word2vec and GloVe, including the hyperparameter

choices for the latter. The insight also leads to a mathematical explanation for why

these word embeddings allow analogies to be solved using linear algebra; see Sec-

tion 7.4. Section 7.5 shows good empirical fit to this model’s assumtions and predic-

tions, including the surprising one that word vectors are pretty uniformly distributed

(isotropic) in space.

7.1.1 Related Work

Latent variable probabilistic models of language have been used for word embeddings

before, including Latent Dirichlet Allocation (LDA) and its more complicated variants

(see the survey [?]), and some neurally inspired nonlinear models [127, 117]. In fact,

LDA evolved out of efforts in the 1990s to provide a generative model that “explains”

207

the success of older vector space methods like Latent Semantic Indexing [137, 86].

However, none of these earlier generative models has been linked to PMI models.

[111] tried to relate word2vec to PMI models. They showed that if there were no

dimension constraint in word2vec, specifically, the “skip-gram with negative sampling

(SGNS)” version of the model, then its solutions would satisfy (7.1.1), provided the

right hand side were replaced by PMI(w,w′) − β for some scalar β. However, skip-

gram is a discriminative model (due to the use of negative sampling), not generative.

Furthermore, their argument only applies to very high-dimensional word embeddings,

and thus does not address low-dimensional embeddings, which have superior quality

in applications.

[76] focuses on issues similar to our paper. They model text generation as a

random walk on words, which are assumed to be embedded as vectors in a geometric

space. Given that the last word produced was w, the probability that the next word

is w′ is assumed to be given by h(|vw−vw′ |2) for a suitable function h, and this model

leads to an explanation of (7.1.1). By contrast our random walk involves a latent

discourse vector, which has a clearer semantic interpretation and has proven useful in

subsequent work, e.g. understanding structure of word embeddings for polysemous

words [17]. Also our work clarifies some weighting and bias terms in the training

objectives of previous methods (Section 7.3) and also the phenomenon discussed in

the next paragraph.

Researchers have tried to understand why vectors obtained from the highly non-

linear word2vec models exhibit linear structures [110, 140]. Specifically, for analogies

like “man:woman::king :??,” queen happens to be the word whose vector vqueen is the

most similar to the vector vking − vman + vwoman. This suggests that simple semantic

relationships, such as masculine vs feminine tested in the above example, correspond

approximately to a single direction in space, a phenomenon we will henceforth refer

to as relations=lines.

208

Section 7.4 surveys earlier attempts to explain this phenomenon and their short-

coming, namely, that they ignore the large approximation error in relationships like

(7.1.1). This error appears larger than the difference between the best solution and

the second best (incorrect) solution in analogy solving, so that this error could in

principle lead to a complete failure in analogy solving. In our explanation, the low

dimensionality of the word vectors plays a key role. This can also be seen as a the-

oretical explanation of the old observation that dimension reduction improves the

quality of word embeddings for various tasks. The intuitive explanation often given

—that smaller models generalize better—turns out to be fallacious, since the training

method for creating embeddings makes no reference to analogy solving. Thus there is

no a priori reason why low-dimensional model parameters (i.e., lower model capacity)

should lead to better performance in analogy solving, just as there is no reason they

are better at some other unrelated task like predicting the weather.

7.1.2 Benefits of Generative Approaches

In addition to giving some form of “unification” of existing methods, our generative

model also brings more intepretability to word embeddings beyond traditional cosine

similarity and even analogy solving. For example, it led to an understanding of how

the different senses of a polysemous word (e.g., bank) reside in linear superposition

within the word embedding [17]. Such insight into embeddings may prove useful in

the numerous settings in NLP and neuroscience where they are used.

Another new explanatory feature of our model is that low dimensionality of word

embeddings plays a key theoretical role —unlike in previous papers where the model

is agnostic about the dimension of the embeddings, and the superiority of low-

dimensional embeddings is an empirical finding (starting with [56]). Specifically,

our theoretical analysis makes the key assumption that the set of all word vectors

(which are latent variables of the generative model) are spatially isotropic, which

209

means that they have no preferred direction in space. Having n vectors be isotropic

in d dimensions requires d n. This isotropy is needed in the calculations (i.e.,

multidimensional integral) that yield (7.1.1). It also holds empirically for our word

vectors, as shown in Section 7.5.

The isotropy of low-dimensional word vectors also plays a key role in our ex-

planation of the relations=lines phenomenon (Section 7.4). The isotropy has a

“purification” effect that mitigates the effect of the (rather large) approximation error

in the PMI models.

7.2 Generative Model and Its Properties

The model treats corpus generation as a dynamic process, where the t-th word is

produced at step t. The process is driven by the random walk of a discourse vector

ct ∈ <d. Its coordinates represent what is being talked about.1 Each word has a

(time-invariant) latent vector vw ∈ <d that captures its correlations with the discourse

vector. We model this bias with a log-linear word production model:

Pr[w emitted at time t | ct] ∝ exp(〈ct, vw〉). (7.2.1)

The discourse vector ct does a slow random walk (meaning that ct+1 is obtained

from ct by adding a small random displacement vector), so that nearby words are

generated under similar discourses. We are interested in the probabilities that word

pairs co-occur near each other, so occasional big jumps in the random walk are allowed

because they have negligible effect on these probabilities.

A similar log-linear model appears in [127] but without the random walk. The

linear chain CRF of [52] is more general. The dynamic topic model of [34] utilizes topic

1This is a different interpretation of the term “discourse” compared to some other settings incomputational linguistics.

210

dynamics, but with a linear word production model. [25] have proposed a dynamic

model for text using Kalman Filters, where the sequence of words is generated from

Gaussian linear dynamical systems, rather than the log-linear model in our case.

The novelty here over such past works is a theoretical analysis in the method-of-

moments tradition [88, 51]. Assuming a prior on the random walk we analytically

integrate out the hidden random variables and compute a simple closed form ex-

pression that approximately connects the model parameters to the observable joint

probabilities (see Theorem 7.2.2). This is reminiscent of analysis of similar random

walk models in finance [31].

Model details. Let n denote the number of words and d denote the dimension of

the discourse space, where 1 ≤ d ≤ n. Inspecting (7.2.1) suggests word vectors need

to have varying lengths, to fit the empirical finding that word probabilities satisfy

a power law. Furthermore, we will assume that in the bulk, the word vectors are

distributed uniformly in space, earlier referred to as isotropy. This can be quantified

as a prior in the Bayesian tradition. More precisely, the ensemble of word vectors

consists of i.i.d draws generated by v = s · v, where v is from the spherical Gaussian

distribution, and s is a scalar random variable. We assume s is a random scalar with

expectation τ = Θ(1) and s is always upper bounded by κ, which is another constant.

Here τ governs the expected magnitude of 〈v, ct〉, and it is particularly important to

choose it to be Θ(1) so that the distribution Pr[w|ct] ∝ exp(〈vw, ct〉) is interesting.2

Moreover, the dynamic range of word probabilities will roughly equal exp(κ2), so one

should think of κ as an absolute constant like 5. These details about s are important

for realistic modeling but not too important in our analysis. (Furthermore, readers

uncomfortable with this simplistic Bayesian prior should look at Section 7.2.1 below.)

Finally, we clarify the nature of the random walk. We assume that the stationary

distribution of the random walk is uniform over the unit sphere, denoted by C. The

2A larger τ will make Pr[w|ct] too peaked and a smaller one will make it too uniform.

211

transition kernel of the random walk can be in any form so long as at each step the

movement of the discourse vector is at most ε2/√d in `2 norm.3 This is still fast

enough to let the walk mix quickly in the space.

The following lemma (whose proof appears in Section 7.6.1) is central to the analy-

sis. It says that under the Bayesian prior, the partition function Zc =∑

w exp(〈vw, c〉),

which is the implied normalization in equation (7.2.1), is close to some constant Z for

most of the discourses c. This can be seen as a plausible theoretical explanation of

a phenomenon called self-normalization in log-linear models: ignoring the partition

function or treating it as a constant (which greatly simplifies training) is known to

often give good results. This has also been studied in [9].

Lemma 7.2.1 (Concentration of partition functions). If the word vectors satisfy the

Bayesian prior described in the model details, then

Prc∼C

[(1− εz)Z ≤ Zc ≤ (1 + εz)Z] ≥ 1− δ, (7.2.2)

for εz = O(1/√n), and δ = exp(−Ω(log2 n)).

The concentration of the partition functions then leads to our main theorem

(the proof is in the Section 7.6). The theorem gives simple closed form approxi-

mations for p(w), the probability of word w in the corpus, and p(w,w′), the prob-

ability that two words w,w′ occur next to each other. The theorem states the

result for the window size q = 2, but the same analysis works for pairs that ap-

pear in a small window, say of size 10, as stated in Corollary 7.2.3. Recall that

PMI(w,w′) = log[p(w,w′)/(p(w)p(w′))].

3 More precisely, the proof extends to any symmetric product stationary distribution Cwith sub-Gaussian coordinate satisfying Ec

[‖c‖2

]= 1, and the steps are such that for all ct,

Ep(ct+1|ct)[exp(κ√d‖ct+1 − ct‖)] ≤ 1 + ε2 for some small ε2.

212

Theorem 7.2.2. Suppose the word vectors satisfy the inequality (7.2.2), and window

size q = 2. Then,

log p(w,w′) =‖vw + vw′‖2

2

2d− 2 logZ ± ε, (7.2.3)

log p(w) =‖vw‖2

2

2d− logZ ± ε. (7.2.4)

for ε = O(εz) + O(1/d) +O(ε2). Jointly these imply:

PMI (w,w′) =〈vw, vw′〉

d±O(ε). (7.2.5)

Remarks 1. Since the word vectors have `2 norm of the order of√d, for two typical

word vectors vw, vw′ , ‖vw + vw′‖22 is of the order of Θ(d). Therefore the noise level

ε is very small compared to the leading term 12d‖vw + vw′‖2

2. For PMI however, the

noise level O(ε) could be comparable to the leading term, and empirically we also

find higher error here.

Remarks 2. Variants of the expression for joint probability in (7.2.3) had been

hypothesized based upon empirical evidence in [123] and also [70], and [119] .

Remarks 3. Theorem 7.2.2 directly leads to the extension to a general window size

q as follows:

Corollary 7.2.3. Let pq(w,w′) be the co-occurrence probability in windows of size q,

and PMIq(w,w′) be the corresponding PMI value. Then

log pq(w,w′) =

‖vw + vw′‖22

2d− 2 logZ + γ ± ε,

PMIq (w,w′) =〈vw, vw′〉

d+ γ ±O(ε).

where γ = log(q(q−1)

2

).

213

It is quite easy to see that Theorem 7.2.2 implies the Corollary 7.2.3, as when the

window size is q the pair w,w′ could appear in any of(q2

)positions within the window,

and the joint probability of w,w′ is roughly the same for any positions because the

discourse vector changes slowly. (Of course, the error term gets worse as we consider

larger window sizes, although for any constant size, the statement of the theorem is

correct.) This is also consistent with the shift β for fitting PMI in [111], which showed

that without dimension constraints, the solution to skip-gram with negative sampling

satisfies PMI (w,w′) − β = 〈vw, vw′〉 for a constant β that is related to the negative

sampling in the optimization. Our result justifies via a generative model why this

should be satisfied even for low dimensional word vectors.

Proof sketches

Here we provide the proof sketches, while the complete proof can be found in the

Section 7.6.

Proof sketch of Theorem 7.2.2 Let w and w′ be two arbitrary words. Let c

and c′ denote two consecutive context vectors, where c ∼ C and c′|c is defined by the

Markov kernel p(c′ | c).

We start by using the law of total expectation, integrating out the hidden variables

c and c′:

p(w,w′) = Ec,c′

[Pr[w,w′|c, c′]]

= Ec,c′

[p(w|c)p(w′|c′)]

= Ec,c′

[exp(〈vw, c〉)

Zc

exp(〈vw′ , c′〉)Zc′

](7.2.6)

An expectation like (7.2.6) would normally be difficult to analyze because of the

partition functions. However, we can assume the inequality (7.2.2), that is, the par-

214

tition function typically does not vary much for most of context vectors c. Let F be

the event that both c and c′ are within (1 ± εz)Z. Then by (7.2.2) and the union

bound, event F happens with probability at least 1−2 exp(−Ω(log2 n)). We will split

the right-hand side (RHS) of (7.2.6) into the parts according to whether F happens

or not.

RHS of (7.2.6) = Ec,c′

[exp(〈vw, c〉)

Zc


1F

]︸︷︷︸

T1

+ Ec,c′

[exp(〈vw, c〉)

Zc


1F

]︸︷︷︸

T2

(7.2.7)

where F denotes the complement of event F and 1F and 1F denote indicator functions

for F and F , respectively. When F happens, we can replace Zc by Z with a 1 ± εz

factor loss: The first term of the RHS of (7.2.7) equals to

T1 =1±O(εz)

Z2 Ec,c′

[exp(〈vw, c〉) exp(〈vw′ , c′〉)1F ] (7.2.8)

On the other hand, we can use E[1F ] = Pr[F ] ≤ exp(−Ω(log2 n)) to show that

the second term of RHS of (7.2.7) is negligible,

|T2| = exp(−Ω(log1.8 n)) . (7.2.9)

This claim must be handled somewhat carefully since the RHS does not depend

on d at all. Briefly, the reason this holds is as follows: in the regime when

d is small (√d = o(log2 n)), any word vector vw and discourse c satisfies that

exp(〈vw, c〉) ≤ exp(‖vw‖) = exp(O(√d)), and since E[1F ] = exp(−Ω(log2 n)),

the claim follows directly; In the regime when d is large (√d = Ω(log2 n)), we

can use concentration inequalities to show that except with a small probability

215

exp(−Ω(d)) = exp(−Ω(log2 n)), a uniform sample from the sphere behaves equiva-

lently to sampling all of the coordinates from a standard Gaussian distribution with

mean 0 and variance 1d, in which case the claim is not too difficult to show using

Gaussian tail bounds.

Therefore it suffices to only consider (7.2.8). Our model assumptions state that c

and c′ cannot be too different. We leverage that by rewriting (7.2.8) a little, and get

that it equals

T1 =1±O(εz)

Z2 Ec

[exp(〈vw, c〉) E

c′|c[exp(〈vw′ , c′〉)]

]=

1±O(εz)

Z2 Ec

[exp(〈vw, c〉)A(c)] (7.2.10)

where A(c) := Ec′|c

[exp(〈vw′ , c′〉)]. We claim that A(c) = (1 ± O(ε2)) exp(〈vw′ , c〉).

Doing some algebraic manipulations,

A(c) = exp(〈vw′ , c〉) Ec′|c

[exp(〈vw′ , c′ − c〉)] .

Furthermore, by our model assumptions, ‖c− c′‖ ≤ ε2/√d. So

〈vw, c− c′〉 ≤ ‖vw‖‖c− c′‖ = O(ε2)

and thus A(c) = (1 ± O(ε2)) exp(〈vw′ , c〉). Plugging the simplification of A(c)

to (7.2.10),

T1 =1±O(εz)

Z2 E[exp(〈vw + vw′ , c〉)]. (7.2.11)

Since c has uniform distribution over the sphere, the random variable 〈vw+vw′ , c〉 has

distribution pretty similar to Gaussian distribution N (0, ‖vw + vw′‖2/d), especially

when d is relatively large. Observe that E[exp(X)] has a closed form for Gaussian

216

random variable X ∼ N (0, σ2),

E[exp(X)] =

∫x

1

σ√

2πexp(− x2

2σ2) exp(x)dx

= exp(σ2/2) . (7.2.12)

Bounding the difference between 〈vw + vw′ , c〉 from Gaussian random variable, we

can show that for ε = O(1/d),

E[exp(〈vw + vw′ , c〉)] = (1± ε) exp

(‖vw + vw′‖2

2d

). (7.2.13)

Therefore, the series of simplification/approximation above (concretely, combining

equations (7.2.6), (7.2.7), (7.2.9), (7.2.11), and (7.2.13)) lead to the desired bound on

log p(w,w′) for the case when the window size q = 2. The bound on log p(w) can be

shown similarly.

Proof sketch of Lemma 7.2.1 Note that for fixed c, when word vectors have

Gaussian priors assumed as in our model, Zc =∑

w exp(〈vw, c〉) is a sum of indepen-

dent random variables.

We first claim that using proper concentration of measure tools, it can be shown

that the variance of Zc are relatively small compared to its mean Evw [Zc], and thus

Zc concentrates around its mean. Note this is quite non-trivial: the random variable

exp(〈vw, c〉) is neither bounded nor subgaussian/sub-exponential, since the tail is

approximately inverse poly-logarithmic instead of inverse exponential. In fact, the

same concentration phenomenon does not happen for w. The occurrence probability

of word w is not necessarily concentrated because the `2 norm of vw can vary a lot in

our model, which allows the frequency of the words to have a large dynamic range.

So now it suffices to show that Evw [Zc] for different c are close to each other. Using

the fact that the word vector directions have a Gaussian distribution, Evw [Zc] turns

217

out to only depend on the norm of c (which is equal to 1). More precisely,

Evw

[Zc] = f(‖c‖22) = f(1) (7.2.14)

where f is defined as f(α) = nEs[exp(s2α/2)] and s has the same distribution as the

norms of the word vectors. We sketch the proof of this. In our model, vw = sw · vw,

where vw is a Gaussian vector with identity covariance I. Then

Evw

[Zc] = n Evw

[exp(〈vw, c〉)]

= n Esw

[E

vw|sw[exp(〈vw, c〉) | sw]

]

where the second line is just an application of the law of total expectation, if we pick

the norm of the (random) vector vw first, followed by its direction. Conditioned on

sw, 〈vw, c〉 is a Gaussian random variable with variance ‖c‖22s

2w, and therefore using

similar calculation as in (7.2.12), we have

Evw|sw

[exp(〈vw, c〉) | sw] = exp(s2‖c‖22/2) .

Hence, Evw [Zc] = nEs[exp(s2‖c‖22/2)] as needed.

7.2.1 Weakening the Model Assumptions

For readers uncomfortable with Bayesian priors, we can replace our assumptions with

concrete properties of word vectors that are empirically verifiable (Section 7.5.1) for

our final word vectors, and in fact also for word vectors computed using other recent

methods.

The word meanings are assumed to be represented by some “ground truth” vectors,

which the experimenter is trying to recover. These ground truth vectors are assumed

to be spatially isotropic in the bulk, in the following two specific ways: (i) For almost

218

all unit vectors c the sum∑

w exp(〈vw, c〉) is close to a constant Z; (ii) Singular values

of the matrix of word vectors satisfy properties similar to those of random matrices,

as formalized in the paragraph before Theorem 7.4.1. Our Bayesian prior on the

word vectors happens to imply that these two conditions hold with high probability.

But the conditions may hold even if the prior doesn’t hold. Furthermore, they are

compatible with all sorts of local structure among word vectors such as existence of

clusterings, which would be absent in truly random vectors drawn from our prior.

7.3 Training objective and relationship to other

models

To get a training objective out of Theorem 7.2.2, we reason as follows. Let Xw,w′

be the number of times words w and w′ co-occur within the same window in the

corpus. The probability p(w,w′) of such a co-occurrence at any particular time is

given by (7.2.3). Successive samples from a random walk are not independent. But

if the random walk mixes fairly quickly (the mixing time is related to the logarithm

of the vocabulary size), then the distribution of Xw,w′ ’s is very close to a multinomial

distribution Mul(L, p(w,w′)), where L =∑

w,w′ Xw,w′ is the total number of word

pairs.

Assuming this approximation, we show below that the maximum likelihood values

for the word vectors correspond to the following optimization,

minvw,C

∑w,w′

Xw,w′(log(Xw,w′)− ‖vw+vw′‖2

2 − C)2

As is usual, empirical performance is improved by weighting down very frequent

word pairs, possibly because very frequent words such as “the” do not fit our model.

This is done by replacing the weighting Xw,w′ by its truncation minXw,w′ , Xmax

219

where Xmax is a constant such as 100. We call this objective with the truncated

weights SN (Squared Norm).

We now give its derivation. Maximizing the likelihood of Xw,w′ is equivalent to

maximizing

` = log

∏(w,w′)

p(w,w′)Xw,w′

.

Denote the logarithm of the ratio between the expected count and the empirical count

as

∆w,w′ = log

(Lp(w,w′)

Xw,w′

). (7.3.1)

Then with some calculation, we obtain the following where c is independent of the

empirical observations Xw,w′ ’s.

` = c+∑

(w,w′)

Xw,w′∆w,w′ (7.3.2)

On the other hand, using ex ≈ 1 + x+ x2/2 when x is small,4 we have

L =∑

(w,w′)

Lpw,w′ =∑

(w,w′)

Xw,w′e∆w,w′

≈∑

(w,w′)

Xw,w′

(1 + ∆w,w′ +

∆2w,w′

2

).

4This Taylor series approximation has an error of the order of x3, but ignoring it can be theo-retically justified as follows. For a large Xw,w′ , its value approaches its expectation and thus thecorresponding ∆w,w′ is close to 0 and thus ignoring ∆3

w,w′ is well justified. The terms where ∆w,w′

is significant correspond to Xw,w′ ’s that are small. But empirically, Xw,w′ ’s obey a power law dis-tribution (see, e.g. [140]) using which it can be shown that these terms contribute a small fractionof the final objective (7.3.3). So we can safely ignore the errors.

220

Note that L =∑

(w,w′) Xw,w′ , so

∑(w,w′)

Xw,w′∆w,w′ ≈ −1

2

∑(w,w′)

Xw,w′∆2w,w′ .

Plugging this into (7.3.2) leads to

2(c− `) ≈∑

(w,w′)

Xw,w′∆2w,w′ . (7.3.3)

So maximizing the likelihood is approximately equivalent to minimizing the right

hand side, which (by examining (7.3.1)) leads to our objective.

Objective for training with PMI. A similar objective PMI can be obtained from

(7.2.5), by computing an approximate MLE, using the fact that the error between

the empirical and true value of PMI(w,w′) is driven by the smaller term p(w,w′),

and not the larger terms p(w), p(w′).

minvw,C

∑w,w′

Xw,w′ (PMI(w,w′)− 〈vw, vw′〉)2

This is of course very analogous to classical VSM methods, with a novel reweighting

method.

Fitting to either of the objectives involves solving a version of Weighted SVD

which is NP-hard, but empirically seems solvable in our setting via AdaGrad [58].

Connection to GloVe. Compare SN with the objective used by GloVe [140]:

∑w,w′

f(Xw,w′)(log(Xw,w′)− 〈vw, vw′〉 − sw − sw′ − C)2

221

with f(Xw,w′) = minX3/4w,w′ , 100. Their weighting methods and the need for bias

terms sw, sw′ , C were derived by trial and error; here they are all predicted and given

meanings due to Theorem 7.2.2, specifically sw = ‖vw‖2.

Connection to word2vec(CBOW). The CBOW model in word2vec posits that

the probability of a word wk+1 as a function of the previous k words w1, w2, . . . , wk:

p(wk+1

∣∣ wiki=1

)∝ exp(〈vwk+1

,1

k

k∑i=1

vwi〉).

This expression seems mysterious since it depends upon the average word vec-

tor for the previous k words. We show it can be theoretically justified. Assume a

simplified version of our model, where a small window of k words is generated as

follows: sample c ∼ C, where C is a uniformly random unit vector, then sample

(w1, w2, . . . , wk) ∼ exp(〈∑k

i=1 vwi , c〉)/Zc. Furthermore, assume Zc = Z for any c.

Lemma 7.3.1. In the simplified version of our model, the Maximum-a-Posteriori

(MAP) estimate of c given (w1, w2, . . . , wk) is∑ki=1 vwi

‖∑ki=1 vwi‖2

.

Proof. The cmaximizing p (c|w1, w2, . . . , wk) is the maximizer of p(c)p (w1, w2, . . . , wk|c).

Since p(c) = p(c′) for any c, c′, and we have p (w1, w2, . . . , wk|c) = exp(〈∑

i vwi , c〉)/Z,

the maximizer is clearly c =∑ki=1 vwi

‖∑ki=1 vwi‖2

.

Thus using the MAP estimate of ct gives essentially the same expression as CBOW

apart from the rescaling, which is often omitted due to computational efficiency in

empirical works.

222

7.4 Explaining relations=lines

As mentioned, word analogies like “a:b::c:??” can be solved via a linear algebraic

expression:

argmind ‖va − vb − vc + vd‖22 , (7.4.1)

where vectors have been normalized such that ‖vd‖2 = 1. This suggests that the

semantic relationships being tested in the analogy are characterized by a straight

line,5 referred to earlier as relations=lines.

Using our model we will show the following for low-dimensional embeddings: for

each such relation R there is a direction µR in space such that for any word pair a, b

satisfying the relation, va − vb is like µR plus some noise vector. This happens for

relations satisfying a certain condition described below. Empirical results supporting

this theory appear in Section 7.5, where this linear structure is further leveraged to

slightly improve analogy solving.

A side product of our argument will be a mathematical explanation of the em-

pirically well-established superiority of low-dimensional word embeddings over high-

dimensional ones in this setting [110]. As mentioned earlier, the usual explanation

that smaller models generalize better is fallacious.

We first sketch what was missing in prior attempts to prove versions of rela-

tions=lines from first principles. The basic issue is approximation error: the differ-

ence between the best solution and the 2nd best solution to (7.4.1) is typically small,

whereas the approximation error in the objective in the low-dimensional solutions is

larger. For instance, if one uses our PMI objective, then the weighted average of

the termwise error in (7.2.5) is 17%, and the expression in (7.4.1) above contains six

5Note that this interpretation has been disputed; e.g., it is argued in [110] that (7.4.1) can be un-derstood using only the classical connection between inner product and word similarity, using whichthe objective (7.4.1) is slightly improved to a different objective called 3COSMUL. However, this“explanation” is still dogged by the issue of large termwise error pinpointed here, since inner prod-uct is only a rough approximation to word similarity. Furthermore, the experiments in Section 7.5clearly support the relations=lines interpretation.

223

inner products. Thus in principle the approximation error could lead to a failure of

the method and the emergence of linear relationship, but it does not.

Prior explanations. [140] try to propose a model where such linear relation-

ships should occur by design. They posit that queen is a solution to the analogy

“man:woman::king:??” because

p(χ | king)

p(χ | queen)≈ p(χ | man)

p(χ | woman), (7.4.2)

where p(χ | king) denotes the conditional probability of seeing word χ in a small

window of text around king. Relationship (7.4.2) is intuitive since both sides will be

≈ 1 for gender-neutral χ like “walks” or “food”, will be > 1 when χ is like “he, Henry”

and will be < 1 when χ is like “dress, she, Elizabeth.” This was also observed by [110].

Given (7.4.2), they then posit that the correct model describing word embeddings in

terms of word occurrences must be a homomorphism from (<d,+) to (<+,×), so

vector differences map to ratios of probabilities. This leads to the expression

pw,w′ = 〈vw, vw′〉+ bw + bw′ ,

and their method is a (weighted) least squares fit for this expression. One shortcoming

of this argument is that the homomorphism assumption assumes the linear relation-

ships instead of explaining them from a more basic principle. More importantly, the

empirical fit to the homomorphism has nontrivial approximation error, high enough

that it does not imply the desired strong linear relationships.

[111] show that empirically, skip-gram vectors satisfy

〈vw, vw′〉 ≈ PMI(w,w′) (7.4.3)

224

up to some shift. They also give an argument suggesting this relationship must be

present if the solution is allowed to be very high-dimensional. Unfortunately, that

argument does not extend to low-dimensional embeddings. Even if it did, the issue

of termwise approximation error remains.

Our explanation. The current paper has introduced a generative model to theo-

retically explain the emergence of relationship (7.4.3). However, as noted after The-

orem 7.2.2, the issue of high approximation error does not go away either in theory

or in the empirical fit. We now show that the isotropy of word vectors (assumed

in the theoretical model and verified empirically) implies that even a weak version

of (7.4.3) is enough to imply the emergence of the observed linear relationships in

low-dimensional embeddings.

This argument will assume the analogy in question involves a relation that obeys

Pennington et al.’s suggestion in (7.4.2). Namely, for such a relation R there exists

function νR(·) depending only upon R such that for any a, b satisfying R there is a

noise function ξa,b,R(·) for which:

p(χ | a)

p(χ | b)= νR(χ) · ξa,b,R(χ) (7.4.4)

For different words χ there is huge variation in (7.4.4), so the multiplicative noise

may be large.

Our goal is to show that the low-dimensional word embeddings have the property

that there is a vector µR such that for every pair of words a, b in that relation,

va − vb = µR + noise vector, where the noise vector is small.

Taking logarithms of (7.4.4) results in:

log

(p(χ | a)

p(χ | b)

)= log(νR(χ)) + ζa,b,R(χ) (7.4.5)

225

Theorem 7.2.2 implies that the left-hand side simplifies to log(p(χ|a)p(χ|b)

)= 1

d〈vχ, va−

vb〉+ εa,b(χ) where ε captures the small approximation errors induced by the inexact-

ness of Theorem 7.2.2. This adds yet more noise! Denoting by V the n × d matrix

whose rows are the vχ vectors, we rewrite (7.4.5) as:

V (va − vb) = d log(νR) + ζ ′a,b,R (7.4.6)

where log(νR) in the element-wise log of vector νR and ζ ′a,b,R = d(ζa,b,R− εa,b,R) is the

noise.

In essence, (7.4.6) shows that va−vb is a solution to a linear regression in d variables

and m constraints, with ζ ′a,b,R being the “noise.” The design matrix in the regression is

V , the matrix of all word vectors, which in our model (as well as empirically) satisfies

an isotropy condition. This makes it random-like, and thus solving the regression by

left-multiplying by V †, the pseudo-inverse of V , ought to “denoise” effectively. We

now show that it does.

Our model assumed the set of all word vectors satisfies bulk properties similar

to a set of Gaussian vectors. The next theorem will only need the following weaker

properties. (1) The smallest non-zero singular value of V is larger than some constant

c1 times the quadratic mean of the singular values, namely, ‖V ‖F/√d. Empirically we

find c1 ≈ 1/3 holds; see Section 7.5. (2) The left singular vectors behave like random

vectors with respect to ζ ′a,b,R, namely, have inner product at most c2‖ζ ′a,b,R‖/√n with

ζ ′a,b,R, for some constant c2. (3) The max norm of a row in V is O(√d).

Theorem 7.4.1 (Noise reduction). Under the conditions of the previous paragraph,

the noise in the dimension-reduced semantic vector space satisfies

‖ζa,b,R‖2 . ‖ζ ′a,b,R‖2

√d

n.

226

As a corollary, the relative error in the dimension-reduced space is a factor of√d/n

smaller.

Proof of Theorem 7.4.1 The proof uses the standard analysis of linear regression.

Let V = PΣQT be the SVD of V and let σ1, . . . , σd be the left singular values of V (the

diagonal entries of Σ). For notational ease we omit the subscripts in ζ and ζ ′ since they

are not relevant for this proof. Since V † = QΣ−1P T and thus ζ = V †ζ ′ = QΣ−1P T ζ ′,

we have

‖ζ‖2 ≤ σ−1d ‖P

T ζ ′‖2. (7.4.7)

We claim

σ−1d ≤

√1

c1n. (7.4.8)

Indeed,∑d

i=1 σ2i = O(nd), since the average squared norm of a word vector is d. The

claim then follows from the first assumption. Furthermore, by the second assumption,

‖P T ζ ′‖∞ ≤ c2√n‖ζ ′‖2, so

‖P T ζ ′‖22 ≤

c22d

n‖ζ ′‖2

2. (7.4.9)

Plugging (7.4.8) and (7.4.9) into (7.4.7), we get

‖ζ‖2 ≤√

1

c1n

√c2

2d

n‖ζ ′‖2

2 =c2

√d

√c1n‖ζ ′‖2

as desired. The last statement follows because the norm of the signal, which is

d log(νR) originally and is V †d log(νR) = va − vb after dimension reduction, also gets

reduced by a factor of√n.

227

7.5 Experimental Verification

In this section, we provide experiments empirically supporting our generative model.

Corpus. All word embedding vectors are trained on the English Wikipedia (March

2015 dump). It is pre-processed by standard approach (removing non-textual ele-

ments, sentence splitting, and tokenization), leaving about 3 billion tokens. Words

that appeared less than 1000 times in the corpus are ignored, resulting in a vocabu-

lary of 68, 430. The co-occurrence is then computed using windows of 10 tokens to

each side of the focus word.

Training method. Our embedding vectors are trained by optimizing the SN ob-

jective using AdaGrad [58] with initial learning rate of 0.05 and 100 iterations. The

PMI objective derived from (7.2.5) was also used. SN has average (weighted) term-

wise error of 5%, and PMI has 17%. We observed that SN vectors typically fit the

model better and have better performance, which can be explained by larger errors

in PMI, as implied by Theorem 7.2.2. So, we only report the results for SN.

For comparison, GloVe and two variants of word2vec (skip-gram and CBOW)

vectors are trained. GloVe’s vectors are trained on the same co-occurrence as SN

with the default parameter values.6 word2vec vectors are trained using a window size

of 10, with other parameters set to default values.7

7.5.1 Model Verification

Experiments were run to test our modeling assumptions. First, we tested two counter-

intuitive properties: the concentration of the partition function Zc for different dis-

course vectors c (see Theorem 7.2.1), and the random-like behavior of the matrix of

6http://nlp.stanford.edu/projects/glove/7https://code.google.com/p/word2vec/

228

http://nlp.stanford.edu/projects/glove/

https://code.google.com/p/word2vec/

0.5 1 1.5 20

20

40

Partition function value

Pe

rce

nta

ge

(a) SN

0.5 1 1.5 2

0

20

40

60

80

100


(b) GloVe

0.5 1 1.5 2

0

20

40

60

80


(c) CBOW

0.5 1 1.5 2

0

20

40


(d) skip-gram

Figure 7.1: The partition function Zc. The figure shows the histogram of Zc for 1000random vectors c of appropriate norm, as defined in the text. The x-axis is normalizedby the mean of the values. The values Zc for different c concentrate around the mean,mostly in [0.9, 1.1]. This concentration phenomenon is predicted by our analysis.

6 8 10 12 14 16 18

1

2

3

4

5

6

7

8

9

10

Natural logarithm of frequency

Sq

ua

red

no

rm

Figure 7.2: The linear relationship between the squared norms of our word vectorsand the logarithms of the word frequencies. Each dot in the plot corresponds to aword, where x-axis is the natural logarithm of the word frequency, and y-axis is thesquared norm of the word vector. The Pearson correlation coefficient between thetwo is 0.75, indicating a significant linear relationship, which strongly supports ourmathematical prediction, that is, equation (7.2.4) of Theorem 7.2.2.

word embeddings in terms of its singular values (see Theorem 7.4.1). For compar-

ison we also tested these properties for word2vec and GloVe vectors, though they

are trained by different objectives. Finally, we tested the linear relation between the

squared norms of our word vectors and the logarithm of the word frequencies, as

implied by Theorem 7.2.2.

229

Partition function. Our theory predicts the counter-intuitive concentration of

the partition function Zc =∑

w′ exp(c>vw′) for a random discourse vector c (see

Lemma 7.2.1). This is verified empirically by picking a uniformly random direction,

of norm ‖c‖ = 4/µw, where µw is the average norm of the word vectors.8 Figure 7.1(a)

shows the histogram of Zc for 1000 such randomly chosen c’s for our vectors. The

values are concentrated, mostly in the range [0.9, 1.1] times the mean. Concentration

is also observed for other types of vectors, especially for GloVe and CBOW.

Isotropy with respect to singular values. Our theoretical explanation of rela-

tions=lines assumes that the matrix of word vectors behaves like a random matrix

with respect to the properties of singular values. In our embeddings, the quadratic

mean of the singular values is 34.3, while the minimum non-zero singular value of our

word vectors is 11. Therefore, the ratio between them is a small constant, consistent

with our model. The ratios for GloVe, CBOW, and skip-gram are 1.4, 10.1, and 3.1,

respectively, which are also small constants.

Squared norms v.s. word frequencies. Figure 7.2 shows a scatter plot for the

squared norms of our vectors and the logarithms of the word frequencies. A linear re-

lationship is observed (Pearson correlation 0.75), thus supporting Theorem 7.2.2. The

correlation is stronger for high frequency words, possibly because the corresponding

terms have higher weights in the training objective.

This correlation is much weaker for other types of word embeddings. This is

possibly because they have more free parameters (“knobs to turn”), which imbue

the embeddings with other properties. This can also cause the difference in the

concentration of the partition function for the two methods.

8Note that our model uses the inner products between the discourse vectors and word vectors, soit is invariant if the discourse vectors are scaled by s while the word vectors are scaled by 1/s for anys > 0. Therefore, one needs to choose the norm of c properly. We assume ‖c‖µw =

√d/κ ≈ 4 for a

constant κ = 5 so that it gives a reasonable fit to the predicted dynamic range of word frequenciesaccording to our theory; see model details in Section 7.2.

230

Relations SN GloVe CBOW skip-gram

Gsemantic 0.84 0.85 0.79 0.73syntactic 0.61 0.65 0.71 0.68total 0.71 0.73 0.74 0.70

M

adjective 0.50 0.56 0.58 0.58noun 0.69 0.70 0.56 0.58verb 0.48 0.53 0.64 0.56total 0.53 0.57 0.62 0.57

Table 7.1: The accuracy on two word analogy task testbeds: G (the GOOGLEtestbed); M (the MSR testbed). Performance is close to the state of the art despiteusing a generative model with provable properties.

7.5.2 Performance on Analogy Tasks

We compare the performance of our word vectors on analogy tasks, specifically the two

testbeds GOOGLE and MSR [122, 125]. The former contains 7874 semantic questions

such as “man:woman::king :??”, and 10167 syntactic ones such as “run:runs ::walk :??.”

The latter has 8000 syntactic questions for adjectives, nouns, and verbs.

To solve these tasks, we use linear algebraic queries.9 That is, first normalize the

vectors to unit norm and then solve “a:b::c:??” by

argmind ‖va − vb − vc + vd‖22 . (7.5.1)

The algorithm succeeds if the best d happens to be correct.

The performance of different methods is presented in Table 7.1. Our vectors

achieve performance comparable to the state of art on semantic analogies (similar

accuracy as GloVe, better than word2vec). On syntactic tasks, they achieve accu-

racy 0.04 lower than GloVe and skip-gram, while CBOW typically outperforms the

others.10 The reason is probably that our model ignores local word order, whereas

the other models capture it to some extent. For example, a word “she” can affect

9One can instead use the 3COSMUL in [110], which increases the accuracy by about 3%. But itis not linear while our focus here is the linear algebraic structure.

10It was earlier reported that skip-gram outperforms CBOW [122, 140]. This may be due to thedifferent training data sets and hyperparameters used.

231

the context by a lot and determine if the next word is “thinks” rather than “think”.

Incorporating such linguistic features in the model is left for future work.

7.5.3 Verifying relations=lines

The theory in Section 7.4 predicts the existence of a direction for a relation, whereas

earlier [110] had questioned if this phenomenon is real. The experiment uses the

analogy testbed, where each relation is tested using 20 or more analogies. For each

relation, we take the set of vectors vab = va − vb where the word pair (a, b) satisfies

the relation. Then calculate the top singular vectors of the matrix formed by these

vab’s, and compute the cosine similarity (i.e., normalized inner product) of individual

vab to the singular vectors. We observed that most (va − vb)’s are correlated with

the first singular vector, but have inner products around 0 with the second singular

vector. Over all relations, the average projection on the first singular vector is 0.51

(semantic: 0.58; syntactic: 0.46), and the average on the second singular vector is

0.035. For example, Table 7.2 shows the mean similarities and standard deviations on

the first and second singular vectors for 4 relations. Similar results are also obtained

for word embedings by GloVe and word2vec. Therefore, the first singular vector can

be taken as the direction associated with this relation, while the other components

are like random noise, in line with our model.

Cheating solver for analogy testbeds. The above linear structure suggests a

better (but cheating) way to solve the analogy task. This uses the fact that the same

semantic relationship (e.g., masculine-feminine, singular-plural) is tested many times

in the testbed. If a relation R is represented by a direction µR then the cheating

algorithm can learn this direction (via rank 1 SVD) after seeing a few examples of

the relationship. Then use the following method of solving “a:b::c:??”: look for a

232

relation 1 2 3 4 5 6 7

1st 0.65 ± 0.07 0.61 ± 0.09 0.52 ± 0.08 0.54 ± 0.18 0.60 ± 0.21 0.35 ± 0.17 0.42 ± 0.162nd 0.02 ± 0.28 0.00 ± 0.23 0.05 ± 0.30 0.06 ± 0.27 0.01 ± 0.24 0.07 ± 0.24 0.01 ± 0.25

relation 8 9 10 11 12 13 14

1st 0.56 ± 0.09 0.53 ± 0.08 0.37 ± 0.11 0.72 ± 0.10 0.37 ± 0.14 0.40 ± 0.19 0.43 ± 0.142nd 0.00 ± 0.22 0.01 ± 0.26 0.02 ± 0.20 0.01 ± 0.24 0.07 ± 0.26 0.07 ± 0.23 0.09 ± 0.23

Table 7.2: The verification of relation directions on 2 semantic and 2 syntacticrelations in the GOOGLE testbed. Relations include cap-com: capital-common-countries; cap-wor: capital-world; adj-adv: gram1-adjective-to-adverb; opp: gram2-opposite. For each relation, take vab = va − vb for pairs (a, b) in the relation, andthen calculate the top singular vectors of the matrix formed by these vab’s. The rowwith label “1st”/“2nd” shows the cosine similarities of individual vab to the 1st/2ndsingular vector (the mean and standard deviation).

SN GloVe CBOW skip-gram

w/o RD 0.71 0.73 0.74 0.70RD(k = 20) 0.74 0.77 0.79 0.75RD(k = 30) 0.79 0.80 0.82 0.80RD(k = 40) 0.76 0.80 0.80 0.77

Table 7.3: The accuracy of the RD algorithm (i.e., the cheater method) on theGOOGLE testbed. The RD algorithm is described in the text. For comparison, therow “w/o RD” shows the accuracy of the old method without using RD.

word d such that vc − vd has the largest projection on µR, the relation direction for

(a, b). This can boost success rates by about 10%.

The testbed can try to combat such cheating by giving analogy questions in a

random order. But the cheating algorithm can just cluster the presented analogies to

learn which of them are in the same relation. Thus the final algorithm, named analogy

solver with relation direction (RD), is: take all vectors va − vb for all the word pairs

(a, b) presented among the analogy questions and do k-means clustering on them; for

each (a, b), estimate the relation direction by taking the first singular vector of its

cluster, and substitute that for va− vb in (7.5.1) when solving the analogy. Table 7.3

shows the performance on GOOGLE with different values of k; e.g. using our SN

vectors and k = 30 leads to 0.79 accuracy. Thus future designers of analogy testbeds

should remember not to test the same relationship too many times! This still leaves

233

SN GloVe CBOW skip-gram

w/o RD-nn 0.71 0.73 0.74 0.70RD-nn (k = 10) 0.71 0.74 0.77 0.73RD-nn (k = 20) 0.72 0.75 0.77 0.74RD-nn (k = 30) 0.73 0.76 0.78 0.74

Table 7.4: The accuracy of the RD-nn algorithm on the GOOGLE testbed. Thealgorithm is described in the text. For comparison, the row “w/o RD-nn” shows theaccuracy of the old method without using RD-nn.

other ways to cheat, such as learning the directions for interesting semantic relations

from other collections of analogies.

Non-cheating solver for analogy testbeds. Now we show that even if a rela-

tionship is tested only once in the testbed, there is a way to use the above structure.

Given “a:b::c:??,” the solver first finds the top 300 nearest neighbors of a and those

of b, and then finds among these neighbors the top k pairs (a′, b′) so that the cosine

similarities between va′−vb′ and va−vb are largest. Finally, the solver uses these pairs

to estimate the relation direction (via rank 1 SVD), and substitute this (corrected)

estimate for va − vb in (7.5.1) when solving the analogy. This algorithm is named

analogy solver with relation direction by nearest neighbors (RD-nn). Table 7.4 shows

its performance, which consistently improves over the old method by about 3%.

7.6 Proof of Main Theorems and Lemmas

In this section we prove Theorem 7.2.2 and Lemma 7.2.1 (restated below).

Theorem 7.2.2. Suppose the word vectors satisfy equation (7.2.2), and window size

q = 2. Then,

log p(w,w′) =‖vw + vw′‖2

2

2d− 2 logZ ± ε, (7.6.1)

log p(w) =‖vw‖2

2

2d− logZ ± ε. (7.6.2)

234

for ε = O(εz) + O(1/d) +O(ε2). Jointly these imply:

PMI (w,w′) =〈vw, vw′〉

d±O(ε). (7.6.3)

Lemma 7.2.1. If the word vectors satisfy the bayesian prior v = s · v, where v is

from the spherical Gaussian distribution, and s is a scalar random variable, then with

high probability the entire ensemble of word vectors satisfies that

Prc∼C

[(1− εz)Z ≤ Zc ≤ (1 + εz)Z] ≥ 1− δ, (7.6.4)

for εz = O(1/√n), and δ = exp(−Ω(log2 n)).

In this sectioin, we first prove Theorem 7.2.2 using Lemma 7.2.1 and some helper

lemmas. Lemma 7.2.1 will be proved in Section 7.6.1, and the helper lemmas will be

proved in Section 7.6.2. Please see Section 7.2 of the main paper for the intuition of

the proof and a cleaner sketch without too many technicalities.

Now begin the proof. Let c be the hidden discourse that determines the probability

of word w, and c′ be the next one that determines w′. We use p(c′|c) to denote the

Markov kernel (transition matrix) of the Markov chain. Let C be the stationary

distribution of discourse vector c, and D be the joint distribution of (c, c′). We

marginalize over the contexts c, c′ and then use the independence of w,w′ conditioned

on c, c′,

p(w,w′) = E(c,c′)∼D

[exp(〈vw, c〉)

Zc


](7.6.5)

We first get rid of the partition function Zc using Lemma 7.2.1. As sketched in

the main paper, essentially we will replace Zc by Z in equation (7.6.5), though a very

careful control of the approximation error is required. Then we arrive at the following

claim.

235

Claim 7.6.1. Under the setting of Theorem 7.2.1,

log p(w,w′) = log

(E

(c,c′)∼D[exp(〈vw, c〉) exp(〈vw′ , c′〉)]± δ0

)− 2 logZ + 2 log(1± εz).

Proof of Claim 7.6.1. Formally, let F1 be the event that c satisfies

(1− εz)Z ≤ Zc ≤ (1 + εz)Z . (7.6.6)

Similarly, let F2 be the even that c′ satisfies (1 − εz)Z ≤ Zc′ ≤ (1 + εz)Z, and let

F = F1 ∩ F2, and F be its negation. Moreover, let 1F be the indicator function for

the event F . Therefore by Lemma 7.2.1 and union bound, we have E[1F ] = Pr[F ] ≥

1− exp(−Ω(log2 n)).

We first decompose the integral (7.6.5) into the two parts according to whether

event F happens,

p(w,w′) = E(c,c′)∼D

[1

ZcZc′exp(〈vw, c〉) exp(〈vw′ , c′〉)1F

]+ E

(c,c′)∼D

[1


](7.6.7)

We bound the first quantity on the right hand side using (7.2.2) and the definition of

F .

E(c,c′)∼D

[1


]≤ (1 + εz)

2 1

Z2 E(c,c′)∼D

[exp(〈vw, c〉) exp(〈vw′ , c′〉)1F ] (7.6.8)

236

For the second quantity of the right hand side of (7.6.7), we have by Cauchy-Schwartz,

(E

(c,c′)∼D

[1


])2

≤(

E(c,c′)∼D

[1

Z2c

exp(〈vw, c〉)21F

])(E

(c,c′)∼D

[1

Z2c′

exp(〈vw′ , c′〉)21F

])≤(Ec

[1

Z2c

exp(〈vw, c〉)2 Ec′|c

[1F ]

])(Ec′

[1

Z2c′

exp(〈vw′ , c′〉)2 Ec|c′

[1F ]

]). (7.6.9)

Using the fact that Zc ≥ 1, then we have that

Ec

[1

Z2c


[1F ]

]≤ E

c

[exp(〈vw, c〉)2 E

c′|c[1F ]

]

We can split that expectation as

Ec

[exp(〈vw, c〉)21〈vw,c〉>0 E

c′|c[1F ]

]+ E

c

[exp(〈vw, c〉)21〈vw,c〉<0 E

c′|c[1F ]

]. (7.6.10)

The second term of (7.6.10) is upper bounded by

Ec,c′

[1F ] ≤ exp(−Ω(log2 n))

We proceed to the first term of (7.6.10) and observe the following property of it:

Ec

[exp(〈vw, c〉)21〈vw,c〉>0 E

c′|c[1F ]

]≤ E

c

[exp(〈αvw, c〉)21〈vw,c〉>0 E

c′|c[1F ]

]≤ E

c

[exp(〈αvw, c〉)2 E

c′|c[1F ]

].

where α > 1. Therefore, it’s sufficient to bound

Ec

[exp(〈vw, c〉)2 E

c′|c[1F ]

]

when ‖vw‖ = Ω(√d).

237

Let’s denote by z the random variable 2〈vw, c〉. Let’s denote r(z) = Ec′|z[1F ]

which is a function of z between [0, 1]. We wish to upper bound Ec [exp(z)r(z)]. The

worst-case r(z) can be quantified using a continuous version of Abel’s inequality as

proven in Lemma 7.6.5, which gives

Ec[exp(z)r(z)] ≤ E [exp(z)1([t,+∞])(z)] (7.6.11)

where t satisfies that Ec[1[t,+∞]] = Pr[z ≥ t] = Ec[r(z)] ≤ exp(−Ω(log2 n)). Then, we

claim Pr[z ≥ t] ≤ exp(−Ω(log2 n)) implies that t ≥ Ω(log.9 n).

If c were distributed as N (0, 1dI), this would be a simple tail bound. However,

as c is distributed uniformly on the sphere, this requires special care, and the claim

follows by applying Lemma 7.6.2 instead.

Finally, applying Corollary 7.6.4, we have:

E[exp(z)r(z)] ≤ E [exp(z)1([t,+∞])(z)] = exp(−Ω(log1.8 n)). (7.6.12)

We have the same bound for c′ as well. Hence, for the second quantity of the right

hand side of (7.6.7), we have

E(c,c′)∼D

[1


]≤(Ec

[1

Z2c


[1F ]

])1/2(Ec′

[1

Z2c′

exp(〈vw′ , c′〉)2 Ec|c′

[1F ]

])1/2

≤ exp(−Ω(log1.8 n)) (7.6.13)

where the first inequality follows from Cauchy-Schwartz, and the second from the

calculation above.

238

Combining (7.6.7), (7.6.8) and (7.6.13), we obtain

p(w,w′) ≤ (1 + εz)2 1

Z2 E(c,c′)∼D

[exp(〈vw, c〉) exp(〈vw′ , c′〉)1F ] +1

n2exp(−Ω(log1.8 n))

≤ (1 + εz)2 1

Z2 E(c,c′)∼D

[exp(〈vw, c〉) exp(〈vw′ , c′〉)] + δ0

where δ0 = exp(−Ω(log1.8 n))Z2 ≤ exp(−Ω(log1.8 n)) by the fact that Z ≤

exp(2κ)n = O(n). Note that κ is treated as an absolute constant throughout

the paper. On the other hand, we can lower bound similarly

p(w,w′) ≥ (1− εz)2 1

Z2 E(c,c′)∼D

[exp(〈vw, c〉) exp(〈vw′ , c′〉)1F ]

≥ (1− εz)2 1

Z2 E(c,c′)∼D

[exp(〈vw, c〉) exp(〈vw′ , c′〉)]−1

n2exp(−Ω(log1.8 n))

≥ (1− εz)2 1

Z2 E(c,c′)∼D

[exp(〈vw, c〉) exp(〈vw′ , c′〉)]− δ0.

Taking logarithm, the multiplicative error translates to a additive error

log p(w,w′) = log

(E

(c,c′)∼D[exp(〈vw, c〉) exp(〈vw′ , c′〉)]± δ0

)− 2 logZ + 2 log(1± εz).

This completes the proof of the claim.

For the purpose of exploiting the fact that c, c′ should be close to each other, we

further rewrite log p(w,w′) by re-organizing the expectations above,

log p(w,w′) = log

(Ec

[exp(〈vw, c〉) E

c′|c[exp(〈vw′ , c′〉)]

]± δ0

)− 2 logZ + 2 log(1± εz)

= log

(Ec

[exp(〈vw, c〉)A(c)]± δ0

)− 2 logZ + 2 log(1± εz) (7.6.14)

where A(c) in the inner integral is defined as

A(c) := Ec′|c

[exp(〈vw′ , c′〉)] .

239

It can be bounded as in the following claim.

Claim 7.6.2. In the setting of Theorem 7.2.2, we have A(c) = (1± ε2) exp(〈vw′ , c〉).

Proof of Claim 7.6.2. Since ‖vw‖ ≤ κ√d, we have that 〈vw, c− c′〉 ≤ ‖vw‖‖c− c′‖ ≤

κ√d‖c− c′‖.

Then we can bound A(c) by

A(c) = Ec′|c

[exp(〈vw′ , c′〉)]

= exp(〈vw′ , c〉) Ec′|c

[exp(〈vw′ , c′ − c〉)]

≤ exp(〈vw′ , c〉) Ec′|c

[exp(κ√d‖c− c′‖)]

≤ (1 + ε2) exp(〈vw′ , c〉)

where the last inequality follows from our model assumptions. To derive a lower

bound of A(c), observe that

Ec′|c

[exp(κ√d‖c− c′‖)] + E

c′|c[exp(−κ

√d‖c− c′‖)] ≥ 2.

Therefore, our model assumptions imply that

Ec′|c

[exp(−κ√d‖c− c′‖)] ≥ 1− ε2.

Hence,

A(c) = exp(〈vw′ , c〉) Ec′|c

exp(〈vw′ , c′ − c〉)

≥ exp(〈vw′ , c〉) Ec′|c

exp(−κ√d‖c− c′‖)

≥ (1− ε2) exp(〈vw′ , c〉).

240

This completes the proof of the claim.

Plugging the just obtained estimation of A(c) into the equation (7.6.14), we get

that

log p(w,w′) = log

(Ec

[exp(〈vw, c〉)A(c)]± δ0

)− 2 logZ + 2 log(1± εz)

= log

(Ec

[(1± ε2) exp(〈vw, c〉) exp(〈vw′ , c〉)]± δ0

)− 2 logZ + 2 log(1± εz)

= log

(Ec

[exp(〈vw + vw′ , c〉)]± δ0

)− 2 logZ + 2 log(1± εz) + log(1± ε2)

(7.6.15)

Now it suffices to compute Ec[exp(〈vw+vw′ , c〉)]. Note that if c had the distribution

N (0, 1dI), which is very similar to uniform distribution over the sphere, then we could

get straightforwardly Ec[exp(〈vw + vw′ , c〉)] = exp(‖vw + vw′‖2/(2d)). For c having

a uniform distribution over the sphere, by Lemma 7.6.6, the same equality holds

approximately,

Ec[exp(〈vw + vw′ , c〉)] = (1± ε3) exp(‖vw + vw′‖2/(2d)) (7.6.16)

where ε3 = O(1/d).

Plugging in equation (7.6.16) into equation (7.6.15), we have that

log p(w,w′) = log((1± ε3) exp(‖vw + vw′‖2/(2d))± δ0

)− 2 logZ + 2 log(1± εz) + log(1± ε2)

= ‖vw + vw′‖2/(2d) +O(ε3) +O(δ′0)− 2 logZ ± 2εz ± ε2

241

where δ′0 = δ0 · (Ec∼C[exp(〈vw + vw′ , c〉)])−1 = exp(−Ω(log1.8 n)). Note that ε3 =

O(1/d), εz = O(1/√n), and ε2 by assumption, therefore we obtain

log p(w,w′) =1

2d‖vw + vw′‖2 − 2 logZ ±O(εz) +O(ε2) + O(1/d).

7.6.1 Analyzing Partition Function Zc

In this subsection, we prove Lemma 7.2.1. We basically first prove that for the means

of Zc are all (1 + o(1))-close to each other, and then prove that Zc is concentrated

around its mean. It turns out the concentration part is non trivial because the random

variable of concern, exp(〈vw, c〉) is not well-behaved in terms of the tail. Note that

exp(〈vw, c〉) is NOT sub-gaussian for any variance proxy. This essentially disallows

us to use an existing concentration inequality directly. We get around this issue by

considering the truncated version of exp(〈vw, c〉), which is bounded, and have similar

tail properties as the original one, in the regime that we are concerning.

We bound the mean and variance of Zc first in the lemma below.

Lemma 7.6.1. For any fixed unit vector c ∈ Rd, we have that E[Zc] ≥ n and V[Zc] ≤

O(n).

Proof of Lemma 7.6.1. Recall that by definition

Zc =∑w

exp(〈vw, c〉).

We fix context c and view vw’s as random variables throughout this proof. Recall

that vw is composed of vw = sw · vw, where sw is the scaling and vw is from spherical

Gaussian with identity covariance Id×d. Let s be a random variable that has the same

distribution as sw. We lower bound the mean of Zc as follows:

E[Zc] = nE [exp(〈vw, c〉)] ≥ nE [1 + 〈vw, c〉] = n

242

where the last equality holds because of the symmetry of the spherical Gaussian

distribution. On the other hand, to upper bound the mean of Zc, we condition on

the scaling sw,

E[Zc] = nE[exp(〈vw, c〉)]

= nE [E [exp(〈vw, c〉) | sw]]

Note that conditioned on sw, we have that 〈vw, c〉 is a Gaussian random variable

with variance σ2 = s2w. Therefore,

E [exp(〈vw, c〉) | sw] =

∫x

1

σ√

2πexp(− x2

2σ2) exp(x)dx

=

∫x

1

σ√

2πexp(−(x− σ2)2

2σ2+ σ2/2)dx

= exp(σ2/2)

It follows that

E[Zc] = nE[exp(σ2/2)] = nE[exp(s2w/2)] = nE[exp(s2/2)].

We calculate the variance of Zc as follows:

V[Zc] =∑w

V [exp(〈vw, c〉)] ≤ nE[exp(2〈vw, c〉)]

= nE [E [exp(2〈vw, c〉) | sw]]

243

By a very similar calculation as above, using the fact that 2〈vw, c〉 is a Gaussian

random variable with variance 4σ2 = 4s2w,

E [exp(2〈vw, c〉) | sw] = exp(2σ2).

Therefore, we have that

V[Zc] ≤ nE [E [exp(2〈vw, c〉) | sw]]

= nE[exp(2σ2)

]= nE

[exp(2s2)

]≤ Λn

for Λ = exp(8κ2) a constant, and at the last step we used the facts that s ≤ κ a.s.

Now we are ready to prove Lemma 7.2.1.

Proof of Lemma 7.2.1. We fix the choice of c, and the proving the concentration using

the randomness of vw’s first. Note that that exp(〈vw, c〉) is neither sub-Gaussian nor

sub-exponential (actually the Orlicz norm of random variable exp(〈vw, c〉) is never

bounded). This prevents us from applying the usual concentration inequalities. The

proof deals with this issue in a slightly more specialized manner.

Let’s define Fw be the event that |〈vw, c〉| ≤ 12

log n. We claim that Pr[Fw] ≥

1 − exp(−Ω(log2 n)). Indeed note that 〈vw, c〉 | sw has a Gaussian distribution with

standard deviation sw‖c‖ = sw ≤ 2κ a.s. Therefore by the Gaussianity of 〈vw, c〉 we

have that

Pr[|〈vw, c〉| ≥ η log n | sw] ≤ 2 exp(−Ω(1

4log2 n/κ2)) = exp(−Ω(log2 n)),

where Ω(·) hides the dependency on κ which is treated as absolute constants. Taking

expectations over sw, we obtain that

Pr[Fw] = Pr[|〈vw, c〉| ≤1

2log n] ≥ 1− exp(−Ω(log2 n)).

244

Note that by definition, we in particular have that conditioned on Fw, it holds that

exp(〈vw, c〉) ≤√n.

Let the random variable Xw have the same distribution as exp(〈vw, c〉)|Fw . We

prove that the random variable Z ′c =∑

wXw concentrates well. By convexity of the

exponential function, we have that the mean of Z ′c is lowerbounded

E[Z ′c] = nE [exp(〈vw, c〉)|Fw ] ≥ n exp(E [〈vw, c〉|Fw ]) = n

and the variance is upper bounded by

V[Z ′c] ≤ nE[exp(〈vw, c〉)2|Fw

]≤ 1

Pr[Fw]E[exp(〈vw, c〉)2

]≤ 1

Pr[Fw]Λn ≤ 1.1Λn

where the second line uses the fact that

E[exp(〈vw, c〉)2

]= Pr[Fw]E

[exp(〈vw, c〉)2|Fw

]+ Pr[Fw]E


]≥ Pr[Fw]E


].

Moreover, by definition, for any w, |Xw| ≤√n. Therefore by Bernstein’s inequal-

ity, we have that

Pr [|Z ′c − E[Z ′c]| > εn] ≤ exp(−12ε2n2

1.1Λn+ 13

√n · εn

).

245

Note that E[Z ′c] ≥ n, therefore for ε log2 n√n

, we have,

Pr [|Z ′c − E[Z ′c]| > εE[Z ′c]] ≤ Pr [|Z ′c − E[Z ′c]| > εn] ≤ exp(−12ε2n2

Λn+ 13

√n · εn

)

≤ exp(−Ω(minε2n/Λ, ε√n))

≤ exp(−Ω(log2 n))

Let F = ∪wFw be the union of all Fw. Then by union bound, it holds that

Pr[F ] ≤∑

w Pr[Fw] ≤ n · exp(−Ω(log2 n)) = exp(−Ω(log2 n)). We have that by

definition, Z ′c has the same distribution as Zc|F . Therefore, we have that

Pr[|Zc − E[Z ′c]| > εE[Z ′c] | F ] ≤ exp(−Ω(log2 n)) (7.6.17)

and therefore

Pr[|Zc − E[Z ′c]| > εE[Z ′c]] = Pr[F ] · Pr[|Zc − E[Z ′c]|

> εE[Z ′c] | F ] + Pr[F ] Pr[|Zc − E[Z ′c]| > εE[Z ′c] | F ]

≤ Pr[|Zc − E[Z ′c]| > εE[Z ′c] | F ] + Pr[F ]

≤ exp(−Ω(log2 n)) (7.6.18)

where at the last line we used the fact that Pr[F ] ≤ exp(−Ω(log2 n)) and equa-

tion (7.6.17).

Let Z = E[Z ′c] = E[exp(〈vw, c〉) | |〈vw, c〉| < 12

log n] (note that E[Z ′c] only depends

on the norm of ‖c‖ which is equal to 1). Therefore we obtain that with high probability

over the randomness of vw’s,

(1− εz)Z ≤ Zc ≤ (1 + εz)Z (7.6.19)

246

Taking expectation over the randomness of c, we have that

Prc,vw

[ (7.6.19) holds] ≥ 1− exp(−Ω(log2 n))

Therefore by a standard averaging argument (using Markov inequality), we have

Prvw

[Prc

[ (7.6.19) holds] ≥ 1− exp(−Ω(log2 n))]≥ 1− exp(−Ω(log2 n))

For now on we fix a choice of vw’s so that Prc[ (7.6.19) holds] ≥ 1−exp(−Ω(log2 n)) is

true. Therefore in the rest of the proof, only c is random variable, and with probability

1− exp(−Ω(log2 n)) over the randomness of c, it holds that,

(1− εz)Z ≤ Zc ≤ (1 + εz)Z. (7.6.20)

7.6.2 Helper Lemmas

The following lemmas are helper lemmas that were used in the proof above. We use

Cd to denote the uniform distribution over the unit sphere in Rd.

Lemma 7.6.2 (Tail bound for spherical distribution). If c ∼ Cd, v ∈ Rd is a vector

with ‖v‖ = Ω(√d) and t = ω(1), the random variable z = 〈v, c〉 satisfies Pr[z ≥ t] =

e−O(t2).

Proof. If c = (c1, c2, . . . , cd) ∼ Cd, c is in distribution equal to(c1‖c‖ ,

c2‖c‖ , . . . ,

cd‖c‖

)where

the ci are i.i.d. samples from a univariate Gaussian with mean 0 and variance 1d. By

spherical symmetry, we may assume that v = (‖v‖, 0, . . . , 0). Let’s introduce the

247

random variable r =∑d

i=2 c2i . Since

Pr [〈v, c〉 ≥ t] = Pr

[‖v| c1

‖c‖≥ t

]≤ Pr

[‖v‖c1

‖c‖≥ t | r ≥ 1

2

]Pr

[r ≥ 1

2

]+ Pr

[‖v‖c1

‖c‖≥ t | r ≥ 1

2

]Pr

[r ≥ 1

2

].

it’s sufficient to lower bound Pr [r ≤ 100] and Pr[‖v‖ c1‖c‖ ≥ t | r ≤ 100

]. The for-

mer probability is easily seen to be lower bounded by a constant by a Chernoff bound.

Consider the latter one next. It holds that

Pr

[‖v‖ c1

‖c‖≥ t | r ≤ 100

]= Pr

[c1 ≥

√t2 · r‖v‖2 − t2

| r ≤ 100

]

≥ Pr

[c1 ≥

√100t2

‖v‖2 − t2

].

Denoting t =√

100t2

‖v‖2−t2 , by a well-known Gaussian tail bound it follows that

Pr[c1 ≥ t

]= e−O(dt2)

(1√dt−(

1√dt

)3)

= e−O(t2)

where the last equality holds since ‖v‖ = Ω(√d) and t = ω(1).

Lemma 7.6.3. If c ∼ Cd, v ∈ Rd is a vector with ‖v‖ = Θ(√d) and t = ω(1), the

random variable z = 〈v, c〉 satisfies

E [exp(z)1([t,+∞])(z)] = exp(−Ω(t2)) + exp(−Ω(d))

Proof. Similarly as in Lemma 7.6.2, if c = (c1, c2, . . . , cd) ∼ Cd, c is in distribution

equal to(c1‖c‖ ,

c2‖c‖ , . . . ,

cd‖c‖

)where the ci are i.i.d. samples from a univariate Gaus-

248

sian with mean 0 and variance 1d. Again, by spherical symmetry, we may assume

v = (‖v‖, 0, . . . , 0). Let’s introduce the random variable r =∑d

i=2 c2i . Then, for an

arbitrary u > 1, some algebraic manipulation shows

Pr [exp (〈v, c〉) 1([t,+∞])(〈v, c〉) ≥ u] = Pr [exp (〈v, c〉) ≥ u ∧ 〈v, c〉 ≥ t] =

Pr

[exp

(‖v‖ c1

‖c‖

)≥ u ∧ ‖v‖ c1

‖c‖≥ u

]= Pr

[c1 = max

(√u2r

‖v‖2 − u2,

√t2r

‖v‖2 − t2

)](7.6.21)

where we denote u = log u. Since c1 is a mean 0 univariate Gaussian with variance

1d, and ‖v‖ = Ω(

√d) we have ∀x ∈ R

Pr

[c1 ≥

√x2r

‖v‖2 − u2

]= O

(e−Ω(x2r)

)

Next, we show that r is lower bounded by a constant with probability 1−exp(−Ω(d)).

Indeed, r is in distribution equal to 1dχ2d−1, where χ2

k is a Chi-squared distribution

with k degrees of freedom. Standard concentration bounds [104] imply that ∀ξ ≥

0,Pr[r − 1 ≤ −2√

ξd] ≤ exp(−ξ). Taking ξ = αd for α a constant implies that with

probability 1− exp(−Ω(d)), r ≥M for some constant M . We can now rewrite

Pr

[c1 ≥

√x2r

‖v‖2 − x2

]

Pr

[c1 ≥

√x2r

‖v‖2 − x2| r ≥M

]Pr[r ≥M ] + Pr

[c1 ≥

√x2r

‖v‖2 − x2| r ≤M

]Pr[r ≤M ] .

The first term is clearly bounded by e−Ω(x2) and the second by exp(−Ω(d)). Therefore,

Pr

[c1 ≥

√x2r

‖v‖2 − x2

]= O

(max

(exp

(−Ω

(x2)), exp (−Ω (d))

))(7.6.22)

249

Putting 7.6.21 and 7.6.22 together, we get that

Pr [exp (〈v, c〉) 1([t,+∞])(〈v, c〉) ≥ u] = O(max

(exp

(−Ω

(min

(d, (max (u, t))2)))))

(7.6.23)

(where again, we denote u = log u)

For any random variable X which has non-negative support, it’s easy to check

that

E[X] =

∫ ∞0

Pr[X ≥ x]dx

Hence,

E [exp(z)1([t,+∞])(z)] =

∫ ∞0

Pr [exp(z)1([t,+∞])(z) ≥ u] du

=

∫ exp(‖v‖)

0

Pr [exp(z)1([t,+∞])(z) ≥ u] du .

To bound this integral, we split into the following two cases:

• Case t2 ≥ d: max (u, t) ≥ t, so min(d, (max (u, t))2) = d. Hence, 7.6.23 implies

E [exp(z)1([t,+∞])(z)] = exp(‖v‖) exp(−Ω(d)) = exp(−Ω(d))

where the last inequality follows since ‖v‖ = O(√d).

• Case t2 < d: In the second case, we will split the integral into two portions:

u ∈ [0, exp(t)] and u ∈ [exp(t), exp(‖v‖)].

When u ∈ [0, exp(t)], max (u, t) = t, so min(d, (max (u, t))2) = t2. Hence,

∫ exp(t)

0

Pr [exp(z)1([t,+∞])(z) ≥ u] du ≤ exp(t) exp(−Ω(t2)) = − exp(Ω(t2))

250

When u ∈ [exp(t), exp(‖v‖)], max (u, t) = u. But u ≤ log(exp(‖v‖)) = O(√d),

so min(d, (max (u, t))2) = u. Hence,

∫ exp(‖v‖)

exp(t)

Pr [exp(z)1([t,+∞])(z) ≥ u] du ≤∫ exp(‖v‖)

exp(t)

exp(−(log(u))2)du

Making the change of variable u = log(u), the we can rewrite the last integral

as ∫ ‖v‖t

exp(−u2) exp(u)du = O(exp(−t2))

where the last inequality is the usual Gaussian tail bound.

In either case, we get that

∫ exp(‖v‖)

0

Pr [exp(z)1([t,+∞])(z) ≥ u] du = exp(−Ω(t2)) + exp(−Ω(d)))

which is what we want.

As a corollary to the above lemma, we get the following:

Corollary 7.6.4. If c ∼ Cd, v ∈ Rd is a vector with ‖v‖ = Θ(√d) and t = Ω(log.9 n)

then

E [exp(z)1([t,+∞])(z)] = exp(−Ω(log1.8 n))

Proof. We claim the proof is trivial if d = o(log4 n). Indeed, in this case, exp(〈v, c〉) ≤

exp(‖v‖) = exp(O(√d)). Hence,

E [exp(z)1([t,+∞])(z)] = exp(O(√d))E[1([t,+∞])(z)] = exp(O(

√d)) Pr[z ≥ t]

Since by Lemma 7.6.2, Pr[z ≥ t] ≤ exp(−Ω(log2 n), we get

E [exp(z)1([t,+∞])(z)] = exp(O(√d)− Ω(log2 n)) = exp(−Ω(log1.8 n))

251

as we wanted.

So, we may, without loss of generality assume that d = Ω(log4 n). In this case,

Lemma 7.6.3 implies

E [exp(z)1([t,+∞])(z)] = exp(− log1.8 n) + exp(−Ω(d))) = exp(− log1.8 n)

where the last inequality holds because d = Ω(log4 n) and t2 = Ω(log.9 n), so we get

the claim we wanted.

Lemma 7.6.5 (Continuous Abel’s Inequality). Let 0 ≤ r(x) ≤ 1 be a function such

that such that E[r(x)] = ρ. Moreover, suppose increasing function u(x) satisfies that

E[|u(x)|] <∞. Let t be the real number such that E[1([t,+∞])] = ρ. Then we have

E[u(x)r(x)] ≤ E[u(x)1([t,+∞])] (7.6.24)

Proof. Let G(z) =∫∞zf(x)r(x)dx, and H(z) =

∫∞zf(x)1([t,+∞])(x)dx. Then we

have that G(z) ≤ H(z) for all z. Indeed, for z ≥ t, this is trivial since r(z) ≤ 1.

For z ≤ t, we have H(z) = E[1([t,+∞])] = ρ = E[r(x)] ≥∫∞zf(x)r(x)dx. Then by

integration by parts we have,

∫ ∞−∞

u(x)f(x)r(x)dx = −∫ ∞−∞

u(x)dG

= −u(x)G(x) |∞−∞ +

∫ +∞

−∞G(x)u′(x)dx

≤∫ +∞

−∞H(x)u′(x)dx

=

∫ ∞−∞

u(x)f(x)1([t,+∞])(x)dx,

where at the third line we use the fact that u(x)G(x) → 0 as x → ∞ and that

u′(x) ≥ 0, and at the last line we integrate by parts again.

252

Lemma 7.6.6. Let v ∈ Rd be a fixed vector with norm ‖v‖ ≤ κ√d for absolute

constant κ. Then for random variable c with uniform distribution over the sphere, we

have that

logE[exp(〈v, d〉)] = ‖v‖2/2d± ε (7.6.25)

where ε = O(1d).

Proof. Let g ∈ N (0, I), then g/‖g‖ has the same distribution as c. Let r = ‖v‖.

Since c is spherically symmetric, we could, we can assume without loss of generality

that v = (r, 0, . . . , 0). Let x = g1 and y =√g2

2 + · · ·+ g2d. Therefore x ∈ N (0, 1) and

y2 has χ2 distribution with mean d− 1 and variance O(d).

Let F be the event that x ≤ 20 log d and 1.5√d ≥ y ≥ 0.5

√d. Note that the

Pr[F ] ≥ 1 − exp(−Ω(log1.8(d)). By Proposition 7.6.4, we have that E[exp(〈v, c〉)] =

E[exp(〈v, c〉) | F ] · (1± Ω(− log1.8 d)).

Conditioned on event F , we have

E[exp(〈v, c〉) | F ] = E

[exp(

rx√x2 + y2

) | F

]

= E

[exp(

rx

y− rx3

y√x2 + y2(y +

√x2 + y2)

) | F

]

= E

[exp(

rx

y) · exp(

rx3

y√x2 + y2(y +

√x2 + y2)

) | F

]

= E[exp(

rx

y) | F

]· (1±O(

log3 d

d)) (7.6.26)

where we used the fact that r ≤ κ√d. Let E be the event that 1.5

√d ≥ y ≥ 0.5

√d.

By using Proposition 7.6.3, we have that

E[exp(rx/y) | F ] = E[exp(rx/y) | E ]± exp(−Ω(log2(d)) (7.6.27)

253

Then let z = y2/(d− 1) and w = z − 1. Therefore z has χ2 distribution with mean 1

and variance 1/(d− 1), and w has mean 0 and variance 1/(d− 1).

E[exp(

rx

y) | E

]= E[E[exp(rx/y) | y] | E ] = E[exp(r2/y2) | E ]

= E[exp(r2/(d− 1) · 1/z2) | E ]

= E[exp(r2/(d− 1) · (1 +2w + w2

(1 + w)2)) | E ]

= exp(r2/(d− 1))E[exp(1 +2w + w2

(1 + w)2)) | E ]

= exp(r2/(d− 1))E[1 + 2w ±O(w2) | E ]

= exp(r2/(d− 1)2)(1± 1/d)

where the second-to-last line uses the fact that conditioned on 1/2 ≥ E , w ≥ −1/2

and therefore the Taylor expansion approximates the exponential accurately, and the

last line uses the fact that |E[w | E ]| = O(1/d) and E[w2 | E ] ≤ O(1/d). Combining

the series of approximations above completes the proof.

We finally provide the proofs for a few helper propositions on conditional proba-

bilities for high probability events used in the lemma above.

Proposition 7.6.3. Suppose x ∼ N (0, σ2) with σ = O(1). Then for any event E with

Pr[E ] = 1−O(−Ω(log2 d)), we have that E[exp(x)] = E[exp(x) | E ]±exp(−Ω(log2(d)).

Proof. Let’s denote by E the complement of the event E . We will consider the upper

and lower bound separately. Since

E[exp(x)] = E[exp(x) | E ] Pr[E ] + E[exp(x) | E ] Pr[E ]

254

we have that

E[exp(x)] ≤ E[exp(x) | E ] + E[exp(x) | E ] Pr[E ] (7.6.28)

and

E[exp(x)] ≥ E[exp(x) | E ](1− exp(−Ω(log2 d)))

≥ E[exp(x) | E ]− E[exp(x) | E ] exp(−Ω(log2 d))) (7.6.29)

Consider the upper bound (7.6.28) first. To show the statement of the lemma, it

suffices to bound E[exp(x) | E ] Pr[E ].

Working towards that, notice that

E[exp(x) | E ] Pr[E ] = E[exp(x)1E ] = E[exp(x)E[1E |x]] = E[exp(x)r(x)]

if we denote r(x) = E[1E |x]. We wish to upper bound E[exp(x)r(x)]. By Lemma

7.6.5, we have

E[exp(x)r(x)] ≤ E[exp(x)1[t,∞]]

where t is such that E[1[t,∞]] = E[r(x)]. However, since E[r(x)] = Pr[E ] =

exp(−Ω(log2 d)), it must be the case that t = Ω(log d) by the standard Gaussian tail

bound, and the assumption that σ = O(1). In turn, this means

E[exp(x)1[t,∞]] ≤1

σ√

2π

∫ ∞t

exe−x2

σ2 dx =1

σ√

2π

∫ ∞t

e−( xσ−σ

2)2+σ2

4 dx

= eσ2

41√2π

∫ +∞

t/σ

e−(x′−σ2

)2dx′

where the last equality follows from the change of variables x = σx′. However,

1√2π

∫ +∞

t/σ

e−(x′−σ2

)2dx′

255

is nothing more than Pr[x′ > tσ], where x′ is distributed like a univariate gaussian

with mean σ2

and variance 1. Bearing in mind that σ = O(1)

eσ2

41√2π

∫ +∞

t/σ

e−(x′−σ2

)2dx′ = exp(−Ω(t2)) = exp(−Ω(log2 d))

by the usual Gaussian tail bounds, which proves the lower bound we need.

We proceed to consider the lower bound 7.6.29. To show the statement of the

lemma, we will bound E[exp(x) | E ]. Notice trivially that since exp(x) ≥ 0,

E[exp(x) | E ] ≤ E[exp(x)]

Pr[E ]

Since Pr[E ] ≥ 1 − exp(Ω(log2 d)), 1Pr[E]

≤ 1 + exp(O(log2)). So, it suffices to bound

E[exp(x)]. However,

E[exp(x)] =1

σ√

2π

∫ +∞

t=−∞exe−

x2

σ2 dx

=1

σ√

2π

∫ +∞

t=−∞e−( x

σ−σ

2)2+σ2

4 dx

=1√2π

∫ +∞

t=−∞e−(x′−σ

2)2+σ2

4 dx .

where the last equality follows from the same change of variables x = σx′ as before.

Since∫ +∞t=−∞ e

−(x′−σ2

)2dx′ =√

2π, we get

1√2π

∫ +∞

t=−∞e−(x′−σ

2)2+σ2

4 dx′ = eσ2

4 = O(1)

Putting together with the estimate of 1Pr[E]

, we get that E[exp(x) | E ] = O(1). Plug-

ging this back in 7.6.29, we get the desired upper bound.

Proposition 7.6.4. Suppose c ∼ C and v is an arbitrary vector with ‖v‖ = O(√d).

Then for any event E with Pr[E ] ≥ 1−exp(−Ω(log2 d)), we have that E[exp(〈v, c〉)] =

E[exp(〈v, c〉) | E ]± exp(− log1.8 d).

256

Proof of Proposition 7.6.4. Let z = 〈v, c〉. We proceed similarly as in the proof of

Proposition 7.6.3. We have

E[exp(z)] = E[exp(z) | E ] Pr[E ] + E[exp(z) | E ] Pr[E ]

and

E[exp(z)] ≤ E[exp(z) | E ] + E[exp(z) | E ] Pr[E ] (7.6.30)

and

E[exp(z)] ≥ E[exp(z) | E ] Pr[E ] = E[exp(z) | E ]− E[exp(z) | E ] exp(−Ω(log2 d))

(7.6.31)

We again proceed by separating the upper and lower bound.

We first consider the upper bound 7.6.30.

Notice that that

E[exp(z) | E ] Pr[E ] = E[exp(z)1E ]

We can split the last expression as

E[exp(〈vw, c〉)1〈vw,c〉>01E

]+ E

[exp(〈vw, c〉)1〈vw,c〉<01E

].

The second term is upper bounded by

E[1E ] ≤ exp(−Ω(log2 n))

We proceed to the first term of (7.6.10) and observe the following property of it:

E[exp(〈vw, c〉)1〈vw,c〉>01E

]≤ E

[exp(〈αvw, c〉)1〈vw,c〉>0 1E

]≤ E [exp(〈αvw, c〉)1E ]

257

where α > 1. Therefore, it’s sufficient to bound

E [exp(z)1E ]

when ‖vw‖ = Θ(√d). Let’s denote r(z) = E[1E |z].

Using Lemma 7.6.5, we have that

Ec[exp(z)r(z)] ≤ E [exp(z)1([t,+∞])(z)] (7.6.32)

where t satisfies that Ec[1[t,+∞]] = Pr[z ≥ t] = Ec[r(z)] ≤ exp(−Ω(log2 d)). Then, we

claim Pr[z ≥ t] ≤ exp(−Ω(log2 d)) implies that t ≥ Ω(log.9 d).

Indeed, this follows by directly applying Lemma 7.6.2.

Afterward, applying Lemma 7.6.3, we have:

E[exp(z)r(z)] ≤ E [exp(z)1([t,+∞])(z)] = exp(−Ω(log1.8 d)) (7.6.33)

which proves the upper bound we want.

We now proceed to the lower bound 7.6.31, which is again similar to the lower

bound in the proof of Proposition 7.6.3: we just need to bound E[exp(z) | E ]. Same

as in Proposition 7.6.3, since exp(x) ≥ 0,

E[exp(z) | E ] ≤ E[exp(z)]

Pr[E ]

Consider the event E ′ : z ≤ t, for t = Θ(log.9 d), which by Lemma 7.6.2 satisfies

Pr[E ′] ≥ 1− exp(−Ω(log2 d)). By the upper bound we just showed,

E[exp(z)] ≤ E[exp(z) | E ′] + exp(−Ω(log2 n)) = O(exp(log.9 d))

258

where the last equality follows since conditioned on E ′, z = O(log.9 d). Finally, this

implies

E[exp(z) | E ] ≤ 1

Pr[E ]O(exp(log.9 d)) = O(exp(log.9 d))

where the last equality follows since Pr[E ] ≥ 1 − exp(−Ω(log2 n)). Putting this

together with 7.6.31, we get

E[exp(z)] ≥ E[exp(z) | E ] Pr[E ] = E[exp(z) | E ]− E[exp(z) | E ] exp(−Ω(log2 d)) ≥

E[exp(z) | E ]−O(exp(log.9 d)) exp(−Ω(log2 d)) ≥ E[exp(z) | E ]− exp(−Ω(log2 d))

which is what we needed.

7.7 Maximum Likelihood Estimator for Co-

occurrence

Let L be the corpus size, and Xw,w′ the number of times words w,w′ co-occur within

a context of size 10 in the corpus. According to the model, the probability of this

event at any particular time is log p(w,w′) ∝ ‖vw + vw′‖22 . Successive samples from

a random walk are not independent of course, but if the random walk mixes fairly

quickly (and the mixing time of our random walk is related to the logarithm of the

number of words) then the set of Xw,w′ ’s over all word pairs is distributed up to

a very close approximation as a multinomial distribution Mul(L, p(w,w′)) where

L =∑

w,w′ Xw,w′ is the total number of word pairs in consideration (roughly 10L).

Assuming this approximation, we show below that the maximum likelihood values

for the word vectors correspond to the following optimization,

minvw,C

∑w,w′

Xw,w′(log(Xw,w′)− ‖vw+vw′‖2

2 − C)2

(Objective SN)

259

Now we give the derivation of the objective. According to the multinomial distri-

bution, maximizing the likelihood of Xw,w′ is equivalent to maximizing

` = log

∏(w,w′)

p(w,w′)Xw,w′

=∑

(w,w′)

Xw,w′ log p(w,w′).

To reason about the likelihood, denote the logarithm of the ratio between the

expected count and the empirical count as

∆w,w′ = log

(Lp(w,w′)

Xw,w′

).

Note that

` =∑

(w,w′)

Xw,w′ log p(w,w′)

=∑

(w,w′)

Xw,w′

[log

Xw,w′

L+ log

(Lp(w,w′)

Xw,w′

)]

=∑

(w,w′)

Xw,w′ logXw,w′

L+∑

(w,w′)

Xw,w′ log

(Lp(w,w′)

Xw,w′

)

= c+∑

(w,w′)

Xw,w′∆w,w′ (7.7.1)

where we let c denote the constant∑

(w,w′) Xw,w′ logXw,w′

L. Furthermore, we have

L =∑

(w,w′)

Lpw,w′

=∑

(w,w′)

Xw,w′e∆w,w′

=∑

(w,w′)

Xw,w′(1 + ∆w,w′ + ∆2w,w′/2 +O(|∆w,w′|3))

260

and also L =∑

(w,w′) Xw,w′ . So

∑(w,w′)

Xw,w′∆w,w′ = −

∑(w,w′)

Xw,w′∆2w,w′/2 +

∑(w,w′)

Xw,w′O(|∆w,w′|3)

.

Plugging this into (7.3.2) leads to

c− ` =∑

(w,w′)

Xw,w′∆2w,w′/2 +

∑(w,w′)

Xw,w′O(|∆w,w′|3). (7.7.2)

When the last term is much smaller than the first term on the right hand side,

maximizing the likelihood is approximately equivalent to minimizing the first term

on the right hand side, which is our objective:

∑(w,w′)

Xw,w′∆2w,w′ ≈

∑(w,w′)

Xw,w′

(‖vw + vw′‖2

2/(2d)− logXw,w′ + log L− 2 logZ)2

where Z is the partition function.

We now argue that the last term is much smaller than the first term on the

right hand side in (7.7.2). For a large Xw,w′ , the ∆w,w′ is close to 0 and thus the

induced approximation error is small. Small Xw,w′ ’s only contribute a small fraction

of the final objective (7.3.3), so we can safely ignore the errors. To see this, note

that the objective∑

(w,w′) Xw,w′∆2w,w′ and the error term

∑(w,w′) Xw,w′O(|∆w,w′|3)

differ by a factor of |∆w,w′| for each Xw,w′ . For large Xw,w′ ’s, |∆w,w′| 1, and thus

their corresponding errors are much smaller than the objective. So we only need to

consider the Xw,w′ ’s that are small constants. The co-occurrence counts obey a power

law distribution (see, e.g. [140]). That is, if one sorts Xw,w′ in decreasing order,

then the r-th value in the list is roughly

x[r] =k

r5/4

261

where k is some constant. Some calculation shows that

L ≈ 4k,∑

Xw,w′≤x

Xw,w′ ≈ 4k4/5x1/5,

and thus when x is a small constant

∑Xw,w′≤x

Xw,w′

L≈(

4x

L

)1/5

= O

(1

L1/5

).

So there are only a negligible mass of Xw,w′ ’s that are small constants, which vanishes

when L increases. Furthermore, we empirically observe that the relative error of

our objective is 5%, which means that the errors induced by Xw,w′ ’s that are small

constants is only a small fraction of the objective. Therefore,∑

w,w′ Xw,w′O(|∆w,w′|3)

is small compared to the objective and can be safely ignored.

7.8 Conclusions

A simple generative model has been introduced to explain the classical PMI based

word embedding models, as well as recent variants involving energy-based models and

matrix factorization. The model yields an optimization objective with essentially “no

knobs to turn”, yet the embeddings lead to good performance on analogy tasks, and

fit other predictions of our generative model. A model with fewer knobs to turn should

be seen as a better scientific explanation (Occam’s razor), and certainly makes the

embeddings more interpretable.

The spatial isotropy of word vectors is both an assumption in our model, and also

a new empirical finding of our paper. We feel it may help with further development

of language models. It is important for explaining the success of solving analogies via

low dimensional vectors (relations=lines). It also implies that semantic relation-

262

ships among words manifest themselves as special directions among word embeddings

(Section 7.4), which lead to a cheater algorithm for solving analogy testbeds.

Our model is tailored to capturing semantic similarity, more akin to a log-linear

dynamic topic model. In particular, local word order is unimportant. Designing

similar generative models (with provable and interpretable properties) with linguistic

features is left for future work.

263

Chapter 8

Mathematical Tools

In this chapter, we collect mathematical tools that are used in this thesis and are of

possible independent interests. Concentration inequalities and Spectral perturbation

bounds in Section 8.1 and Section 8.2 often appear in many other machine learning

and statistical settings. In particular, Corollary 8.1.4 and Theorem 8.2.5 are very

straightforward extensions of matrix Bernstein inequality and Wedin’s Theorem, but

they are user-friendly in many machine learning settings.

8.1 Concentration Inequalities

This section contains a collection of known technical results which are useful in proving

the concentration bounds in various chapters of this thesis.

8.1.1 Hoeffding’s inequality and Bernstein’s inequalities

Theorem 8.1.1 (Bernstein Inequality [28], cf. [27]). Let X1, . . . , Xn be independent

real-valued variables with finite variance σ2i = V[Xi] and bounded by M in the sense

264

that |Xi − E[Xi]| ≤M . Let σ2 =∑

i σ2i . Then we have

Pr

[∣∣∣∣∣n∑i=1

Xi − E[n∑i=1

Xi]

∣∣∣∣∣ > t

]≤ 2 exp(− t2

2σ2 + 23Mt

) .

As a consequence, for any d ≥ 1 and C > 0, we have that with probability at least

1− d−C,

∣∣∣∣∣n∑i=1

Xi − E[n∑i=1

Xi]

∣∣∣∣∣ . CM log d+ σ√C log d . (8.1.1)

Up to constant factor, the Hoeffding’s inequality can be seen as a corollary of

Bernstein inequality when the variance term σ is bounded trivially by the uniform

bound M .

Theorem 8.1.2 (Hoeffding’s inequality [85]). Let X1, . . . , Xn be independent real-

valued variables whose fluctuations are bounded in the sense that |Xi − E[Xi]| ≤ M

almost surely. Then we have

Pr

[∣∣∣∣∣n∑i=1

Xi − E[n∑i=1

Xi]

∣∣∣∣∣ > t

]≤ exp

(− t2

2M2n

).

Theorem 8.1.3 (Matrix Bernstein Inequality [90]. cf. [167]). Let X1, . . . , Xn be

independent matrix random variables with common dimension d1 × d2. We assume

that

E [Xi] = 0, and ‖Xi‖ ≤M a.s., ∀i ∈ [n] (8.1.2)

Define σ > 0 as

σ2 = max

∥∥∥∥∥n∑i=1

XiX>i

∥∥∥∥∥ ,∥∥∥∥∥

n∑i=1

X>i Xi

∥∥∥∥∥. (8.1.3)

265

Then, we have that

Pr

[∥∥∥∥∥n∑i=1

Xi

∥∥∥∥∥ > t

]≤ (d1 + d2) exp(− t2

2σ2 + 23Mt

) . (8.1.4)

As a consequence, for any d ≥ 1 and C > 0, we have that with probability at least

1− d−C,

∥∥∥∥∥n∑i=1

Xi

∥∥∥∥∥ . CM log d+ σ√C log d . (8.1.5)

Corollary 8.1.4. Let X1, . . . , Xn be independent matrix random variables with com-

mon dimension d1 × d2. We assume that they are zero centered, that is, E [Xi] =

0,∀i ∈ [n]. Suppose there exists M > 0 and ε ∈ [0, 1], δ > 0, such that

Pr [‖Xi‖ ≥M ] ≤ ε, and ‖E [Xi1(Xi ≥M)]‖ ≤ δ, ∀i ∈ [n] . (8.1.6)

Define σ > 0 as

σ2 = max

∥∥∥∥∥n∑i=1

XiX>i

∥∥∥∥∥ ,∥∥∥∥∥

n∑i=1

X>i Xi

∥∥∥∥∥. (8.1.7)

Then, for any d ≥ 1 and C > 0, we have that with probability at least 1− d−C − nε,

∥∥∥∥∥n∑i=1

Xi

∥∥∥∥∥ . CM log d+ σ√C log d+ nδ . (8.1.8)

Typically ε and δ in Corollary 8.1.4 will be chosen as very small so that the

conclusion of Corollary 8.1.4 is not much different from Theorem 8.1.1.

Proof. Let Zi = Xi1(‖Xi‖ ≤ M). Then we have that

max∥∥∑n

i=1XiX>i

∥∥ ,∥∥∑ni=1 Z

>i Zi

∥∥ ≤ max∥∥∑n

i=1XiX>i

∥∥ ,∥∥∑ni=1 X

>i Xi

∥∥ = σ2,

and ‖Zi‖ ≤M a.s. Applying Bernstein inequality on∑Zi gives that with probability

266

at least 1− d−C ,

∥∥∥∥∥n∑i=1

Zi − E

[∑i

Zi

]∥∥∥∥∥ . CM log d+ σ√C log d . (8.1.9)

Note that E [∑

i Zi] = −E [∑

iXi1(Xi ≥M)] and therefore by equation 8.1.6, we

have that ‖E [∑

i Zi]‖ ≤ nδ. Therefore by triangle inequality we have that

∥∥∥∥∥n∑i=1

Zi

∥∥∥∥∥ ≤∥∥∥∥∥

n∑i=1

Zi − E

[∑i

Zi

]∥∥∥∥∥+ nδ . CM log d+ σ√C log d+ nδ .

Note that with probability at most nε we have that ‖Xi‖ ≤ M and thus Xi = Zi.

Thus by another union bound, and triangle inequality, we have that with probability

at least 1− d−C − nε, equation (8.1.8) holds.

8.1.2 Sub-Gaussian ans Sub-exponential Random Variables

In this subsection, we give the formal definition of sub-Gaussian and sub-exponential

random variables, which is used heavily in our analysis. We also summarize their

properties.

Definition 8.1.5 (c.f.[171, Definition 2.1]). A random variable X with mean µ =

E [X] is sub-Gaussian with variance proxy σ2 if

E[eλ(X−µ)

]≤ e

ν2λ2

2 , ∀λ ∈ R (8.1.10)

Definition 8.1.6 (c.f.[171, Definition 2.2]). A random variable X with mean µ =

E [X] is sub-exponential if there are non-negative parameters (ν, b) such that

E[eλ(X−µ)

]≤ e

ν2λ2

2 , ∀λ ∈ R, |λ| < 1/b .

267

A summation of sub-Gaussian random variables remains sub-Gaussian with vari-

ance proxy being the sum of the variance proxies of each summands.

Lemma 8.1.7 (c.f. [171]). Suppose independent random variable X1, . . . , Xn are sub-

Gaussian variables with parameter with variance proxies σ21, . . . , σ

2n respectively, then

X1 + · · · + Xn is a sub-Gaussian random variable with parameter σ∗ where σ∗ =√∑k∈[n] σ

2k.

A summation of sub-exponential random variables remain sub-exponential (with

different parameters).

Lemma 8.1.8 (c.f. [171]). Suppose independent random variable X1, . . . , Xn are

sub-exponential variables with parameter (ν1, b1), . . . , (νn, bn) respectively, then X1 +

· · · + Xn is a sub-exponential random variable with parameter (ν∗, b∗) where ν∗ =√∑k∈[n] ν

2k and b∗ = maxk bk.

8.2 Spectral Perturbation Theorems

In this section, we list some standard spectral perturbation inequalities that are help-

ful in machine learning and for many results in this thesis. Most of them can be found

in the seminal paper by Stewart and Sun [161].

Given A = A+E as a perturbed version of A, Weyl’s theorem [174] and Mirsky’s

Theorem [126]) bound the perturbation in individual singular values:

Theorem 8.2.1 (Weyl’s theorem [174] and Mirsky’s Theorem [126], c.f. [160]). Let

A ∈ Rm×n with m ≥ n and σk(·) denotes the k-th largest singular value. Suppose

A = A+ E. Then,

σk(A)− ‖E‖ ≤ σk(A) ≤ σk(A) + ‖E‖, ∀i = 1, . . . , n

268

Moreover,

n∑k=1

(σk(A)− σk(A)

)2

≤ ‖E‖2F .

The singular vector perturbation is bounded by Wedin’s Theorem. Towards stating

it, we first recall the definition of principal angles between subspaces (PABS) [101].

Definition 8.2.2. Suppose X and Y are two subspaces of Rn of dimension p and

q respectively. Let X ∈ Rn×p and Y ∈ Rn×q be some orthonormal basis of X and

Y respectively. Then the principal angle between subspaces X and Y, denoted by

Θ(X ,Y), is a vector of dimension m = min(p, q) such that

cos(Θ(X ,Y)) = S(X>Y ) (8.2.1)

where cos(·) is the entry-wise cosine function and S(X>Y ) denotes the list singular

values of X>Y in decreasing order.

Remark 8.2.1. Other characterization of principal angles between subspace can be

found, e.g., in [101]. When p = q = 1, then the principal angle coincides the angle

between two vectors.

Wedin’s Theorem bounds the principal angles between the subspaces of the singular

vectors of A and its perturbations. The following one is the original and strongest

form of the theorem. We will state a weaker version later that is more convenient for

machine learning applications.

269

Theorem 8.2.3 (Wedin’s Theorem [172]; c.f. [160]). Given matrices A,E ∈ Rm×n

with m ≥ n. Let A have the singular value decomposition

A = [U1, U2, U3]

Σ1 0

0 Σ2

0 0

[V1, V2]>. (8.2.2)

Let A = A+ E has singular vector decomposition

A = [U1, U2, U3]

Σ1 0

0 Σ2

0 0

[V1, V2]>. (8.2.3)

Let Φ = Θ(U1, U1) and Ψ = Θ(V1, V1).1 Suppose that there exists some δ > 0 such

that

mini,j|[Σ1]i,i − [Σ2]j,j| ≥ δ, and min

i,i|[Σ1]i,i| ≥ δ,

Let R = AV1 − U1Σ1 and S = A>U1 − V1Σ1. Then,

‖ sin(Φ)‖2 + ‖ sin(Ψ)‖2 ≤ ‖R‖2F + ‖S‖2

F

δ2≤ 2‖E‖2

F

δ2.

In many applications, we have a good understanding of the spectrum of A and E,

but not on the spectrum of A = A+E. Thus it would be ideal that the condition of

the theorem only involves the spectrum of A and E. Such a theorem can be obtained

as a straightforward extension of Weyl’s Theorem and Wedin’s Theorem.

1We note that the notation that we used here is slightly different from that in [160]. Here Φ andΨ are vectors that contains the principal angles and the norm ‖·‖ below is Euclidean norm of thevector.

270

Theorem 8.2.4 (User-friendly version of Wedin’s Theorem). Given matrices A,E ∈

Rm×n with m ≥ n. Let A = A + E. Let A and A have the SVD decomposition as

equation (8.2.3) and (8.2.2) respectively. Suppose that

mini

[Σ1]i,i −maxi

[Σ2]i,i ≥ δ > ‖E‖.

Let Φ = Θ(U1, U1) and Ψ = Θ(V1, V1). Then,

‖ sin(Φ)‖2 + ‖ sin(Ψ)‖2 ≤ 2‖E‖2F

(δ − ‖E‖)2.

Proof. Using Weyl’s Theorem, we have that mini[Σ1]i,i ≥ mini[Σ1]i,i − ‖E‖. Then

applying Wedin’s Theorem we complete the proof.

Remark 8.2.2. We remark that Theorem 8.2.4 is as strong as the original Wedin’s

Theorem up to a univesal constant factor. This is because when δ ≥ 2‖E‖, we have

that ‖ sin(Φ)‖2+‖ sin(Ψ)‖2 ≤ 8‖E‖2F/δ

2. Otherwise we can directly bound ‖ sin(Φ)‖2+

‖ sin(Ψ)‖2 by 2.

The perturbation bounds above depends on the Frobenius norm of the matrix E. The

following version of the Theorem depends on the spectral norm of E.

Theorem 8.2.5 (User-friendly version of Wedin’s Theorem with spectral norm).

Given matrices A,E ∈ Rm×n with m ≥ n. Let A = A + E. Let A and A have the

SVD decomposition as equation (8.2.3) and (8.2.2) respectively. Suppose that

mini

[Σ1]i,i −maxi

[Σ2]i,i ≥ δ > ‖E‖.

Let Φ = Θ(U1, U1) and Ψ = Θ(V1, V1). Then,

max‖ sin(Φ)‖∞, ‖ sin(Ψ)‖∞ ≤‖E‖

δ − ‖E‖.

271

Perturbation bound for pseudo-inverse When we have a lower bound on

σmin(A), it is easy to get bounds for the perturbation of pseudoinverse.

Theorem 8.2.6 ([161, Theorem 3.4]). Consider the perturbation of a matrix A ∈

Rm×n: B = A+ E. Assume that rank(A) = rank(B) = n, then

‖B† − A†‖ ≤√

2‖A†‖‖B†‖‖E‖.

Note that this theorem is not strong enough when the perturbation is only known

to be τ -spectrally bounded in our definition.

8.3 Auxiliary Lemmas

In this section, we collect a few auxiliary mathematical lemmas that are used in

various sections of this thesis.

The following lemma regarding inverse the canonical controllable form is used in

Section 6.2.

Lemma 8.3.1. Let B = en ∈ Rn×1 and λ ∈ [0, 2π], w ∈ C. Suppose A with

ρ(A) · |w| < 1 has the controllable canonical form A = C(a). Then

(I − wA)−1B =1

pa(w−1)

w−1

w−2

...

w−n

where pa(x) = xn + a1x

n−1 + · · ·+ an is the characteristic polynomial of A.

272

Proof. let v = (I − wA)−1B then we have (I − wA)v = B. Note that B = en, and

I − wA is of the form

I − wA =

1 −w 0 · · · 0

0 1 −w · · · 0

......

.... . .

...

0 0 0 · · · −w

anw an−1w an−2w · · · 1 + a1w

(8.3.1)

Therefore we obtain that vk = wvk+1 for 1 ≤ k ≤ n − 1. That is, vk = v0w−k for

v0 = v1w1. Using the fact that ((I − wA)v)n = 1, we obtain that v0 = pa(w

−1)−1

where pa(·) is the polynomial pa(x) = xn + a1xn−1 + · · · + an. Then we have that

u(I − wA)−1B = u1w−1+···+unw−npa(w−1)

The following elementary claim was useful in the proof of Lemma 6.2.5 in Section 6.2.2.

Claim 8.3.2. Suppose x1, . . . , xn are independent variables with mean 0 and covari-

ance matrices and Id, U1, . . . , Ud are fixed matrices, then

E [‖∑n

k=1 Ukxk‖2] =∑n

k=1 ‖Uk‖2F .

Proof. We have that

E [‖∑n

k=1 Ukxk‖2F ] =

∑nk,` tr(Ukxkx

>` U>` ) =

∑nk tr(Ukxkx

>k U>k ) =

∑nk=1 ‖Uk‖2

F

The following Proposition regarding the size of epsilon net is useful in the proof of

Theorem 5.7.1 in Section 5.7.

Proposition 8.3.3. For any ζ ∈ (0, 1), there is a set Γ of rank r d × d matrices,

such that for any rank r d × d matrix X with Frobenius norm at most 1, there is a

matrix X ∈ Γ with ‖X − X‖F ≤ ζ. The size of Γ is bounded by (d/ζ)O(dr).

273

Proof. Standard construction of ε-net shows that there is a set P ⊂ Rd of size (d/ε)O(d)

such that for any ‖u‖ ≤ 1, there is a u ∈ P such that ‖u− u‖ ≤ ε. Such construction

can also be applied to matrices and Frobenius norm as that is the same as vectors

and `2 norm.

Here we let ε = 0.1ζ, and construct three sets P1, P2, P3 where P1 is an ε-net for

d × r matrices with Frobenius norm at most√r, P2 is an ε-net for r × r diagonal

matrices whose Frobenius norm is bounded by 1, and P3 is an ε-net for r×d matrices

with Frobenius norm at most√r.

Now we define Γ = UDV |U ∈ P1, D ∈ P2, V ∈ P3. Clearly the size of Γ is as

promised. For any rank r d× d matrix X, suppose its Singular Value Decomposition

is UDV , we can find U ∈ P1, D ∈ P2 and V ∈ P3 that are ε close to U,D, V

respectively. Therefore UDV ∈ Γ and

‖UDV − UDV ‖F ≤ 8ε ≤ ζ.

The following proposition the connects `6 norm to `2 norm is useful in the proof of

Lemma 5.4.5 in Section 5.4

Proposition 8.3.4. Let a1, . . . , ar ≥ 0, C ≥ 0. Then C4(a21 + · · ·+a2

r) ≥ a61 + · · ·+a6

r

implies that a21 + · · ·+ a2

r ≤ C2r and that max ai ≤ Cr1/6.

Proof. By Cauchy-Schwartz inequality, we have,

(r∑i=1

a2i

)(r∑i=1

a6i

)≥

(r∑i=1

a4i

)2

≥

1

r

(r∑i=1

a2i

)22

Using the assumption and equation above we have that a21 + · · · + a2

r ≤ C2r. This

implies with the condition that a61 + · · · + a6

r ≤ C6r, which implis that max ai ≤

Cr1/6.

274

Bibliography

[1] A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon. Learningsparsely used overcomplete dictionaries via alternating minimization. In COLT,pages 123–137, 2014.

[2] A. Agarwal, A. Anandkumar, and P. Netrapalli. Exact recovery of sparselyused overcomplete dictionaries. In arXiv:1309.1952, 2013.

[3] Naman Agarwal, Zeyuan Allen Zhu, Brian Bullins, Elad Hazan, and TengyuMa. Finding approximate local minima faster than gradient descent, 2017.

[4] M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm for designingovercomplete dictionaries for sparse representation. In IEEE Trans. on SignalProcessing, pages 4311–4322, 2006.

[5] Yonatan Amit, Michael Fink, Nathan Srebro, and Shimon Ullman. Uncover-ing shared structures in multiclass classification. In Proceedings of the 24thinternational conference on Machine learning, pages 17–24. ACM, 2007.

[6] A. Anandkumar, D. Foster, D. Hsu, S. Kakade, and Y. Liu. A spectral algorithmfor latent dirichlet allocation. In NIPS, pages 926–934, 2012.

[7] Animashree Anandkumar, Rong Ge, Daniel Hsu, and Sham M. Kakade. Atensor approach to learning mixed membership community models. Journal ofMachine Learning Research, 15(1):2239–2312, 2014.

[8] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and MatusTelgarsky. Tensor decompositions for learning latent variable models. Journalof Machine Learning Research, 15(1):2773–2832, 2014.

[9] Jacob Andreas and Dan Klein. When and why are log-linear models self-normalizing? In Proceedings of the Annual Meeting of the North AmericanChapter of the Association for Computational Linguistics, 2014.

[10] S. Arora, R. Ge, and A. Moitra. New algorithms for learning incoherent andovercomplete dictionaries. In COLT, pages 779–806, 2014.

[11] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable boundsfor learning some deep representations. In Proceedings of the 31th Interna-tional Conference on Machine Learning, ICML 2014, Beijing, China, 21-26June 2014, pages 584–592, 2014.

275

[12] Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, DavidSontag, Yichen Wu, and Michael Zhu. A practical algorithm for topic modelingwith provable guarantees. In International Conference on Machine Learning,pages 280–288, 2013.

[13] Sanjeev Arora, Rong Ge, Frederic Koehler, Tengyu Ma, and Ankur Moitra.Provable algorithms for inference in topic models. In The 33rd Inter-national Conference on Machine Learning (ICML 2016). arXiv preprintarXiv:1605.08491, 2016.

[14] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, efficient, andneural algorithms for sparse coding. In Proceedings of The 28th Conference onLearning Theory, COLT 2015, Paris, France, July 3-6, 2015, pages 113–149,2015.

[15] Sanjeev Arora, Rong Ge, Tengyu Ma, and Andrej Risteski. Provable learning ofnoisy-or networks. In Proceedings of the 49th Annual ACM SIGACT Symposiumon Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23,2017, pages 1057–1066, 2017.

[16] Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Alatent variable model approach to pmi-based word embeddings. Transactionsof the Association for Computational Linguistics, 4:385–399, 2016.

[17] Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski.Linear algebraic structure of word senses, with applications to polysemy. Tech-nical report, ArXiV, 2016. http://arxiv.org/abs/1502.03520.

[18] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beatbaseline for sentence embeddings. In 5th International Conference on LearningRepresentations (ICLR 2017), 2017.

[19] Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guar-antees for the EM algorithm: From population to sample-based analysis. CoRR,abs/1408.2156, 2014.

[20] Sivaraman Balakrishnan, Martin J Wainwright, Bin Yu, et al. Statistical guar-antees for the em algorithm: From population to sample-based analysis. TheAnnals of Statistics, 45(1):77–120, 2017.

[21] Pierre Baldi and Kurt Hornik. Neural networks and principal component analy-sis: Learning from examples without local minima. Neural networks, 2(1):53–58,1989.

[22] Afonso S Bandeira, Nicolas Boumal, and Vladislav Voroninski. On the low-rankapproach for semidefinite programs arising in synchronization and communitydetection. arXiv preprint arXiv:1602.04426, 2016.

276

http://arxiv.org/abs/1502.03520

[23] Boaz Barak, John Kelner, and David Steurer. Dictionary learning using sum-of-square hierarchy. 2014.

[24] Alexandre S. Bazanella, Michel Gevers, Ljubisa Miskovic, and Brian D.O. An-derson. Iterative minimization of h2 control performance criteria. Automatica,44:2549–2559, 2008.

[25] David Belanger and Sham M. Kakade. A linear dynamical system model fortext. In Proceedings of the 32nd International Conference on Machine Learning,2015.

[26] Yoshua Bengio, Holger Schwenk, Jean-Sebastien Senecal, Frederic Morin, andJean-Luc Gauvain. Neural probabilistic language models. In Innovations inMachine Learning. 2006.

[27] George Bennett. Probability inequalities for the sum of independent randomvariables. Journal of the American Statistical Association, 57(297):pp. 33–45,1962.

[28] S. Bernstein. Theory of Probability, 1927.

[29] Badri Narayan Bhaskar, Gongguo Tang, and Benjamin Recht. Atomic normdenoising with applications to line spectral estimation. IEEE Transactions onSignal Processing, 61(23):5987–5999, 2013.

[30] S. Bhojanapalli, B. Neyshabur, and N. Srebro. Global Optimality of LocalSearch for Low Rank Matrix Recovery. ArXiv e-prints, May 2016.

[31] Fischer Black and Myron Scholes. The pricing of options and corporate liabili-ties. Journal of Political Economy, 1973.

[32] D. Blei. Introduction to probabilistic topic models. Communications of theACM, pages 77–84, 2012.

[33] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of MachineLearning Research, pages 993–1022, 2003. Preliminary version in NIPS 2001.

[34] David M. Blei and John D. Lafferty. Dynamic topic models. In Proceedings ofthe 23rd International Conference on Machine Learning, 2006.

[35] Leon Bottou. On-line learning in neural networks. chapter On-line Learningand Stochastic Approximations, pages 9–42. Cambridge University Press, NewYork, NY, USA, 1998.

[36] Leon Bottou. Online algorithms and stochastic approximations. In David Saad,editor, Online Learning and Neural Networks. Cambridge University Press,Cambridge, UK, 1998. revised, oct 2012.

[37] Pavol Brunovsky. A classification of linear controllable systems. Kybernetika,06(3):(173)–188, 1970.

277

[38] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithmfor solving semidefinite programs via low-rank factorization. Mathematical Pro-gramming, 95(2):329–357, 2003.

[39] M. C. Campi and Erik Weyer. Finite sample properties of system identificationmethods. IEEE Transactions on Automatic Control, 47(8):1329–1334, 2002.

[40] E. Candes, J. Romberg, and T. Tao. Stable signal recovery from incompleteand inaccurate measurements. In Communications of Pure and Applied Math,pages 1207–1223, 2006.

[41] E. Candes and T. Tao. Decoding by linear programming. In IEEE Trans. onInformation Theory, pages 4203–4215, 2005.

[42] Emmanuel J Candes, Xiaodong Li, Yi Ma, and John Wright. Robust principalcomponent analysis? Journal of the ACM (JACM), 58(3):11, 2011.

[43] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrievalvia wirtinger flow: Theory and algorithms. IEEE Transactions on InformationTheory, 61(4):1985–2007, 2015.

[44] Emmanuel J Candes and Benjamin Recht. Exact matrix completion via convexoptimization. Foundations of Computational mathematics, 9(6):717–772, 2009.

[45] Emmanuel J Candes and Terence Tao. The power of convex relaxation:Near-optimal matrix completion. Information Theory, IEEE Transactions on,56(5):2053–2080, 2010.

[46] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Acceleratedmethods for non-convex optimization. arXiv preprint arXiv:1611.00756, 2016.

[47] Yudong Chen and Martin J Wainwright. Fast low-rank estimation by projectedgradient descent: General statistical and algorithmic guarantees. arXiv preprintarXiv:1509.03025, 2015.

[48] Yuxin Chen and Emmanuel Candes. Solving random quadratic systems ofequations is nearly as easy as solving linear systems. In Advances in NeuralInformation Processing Systems, pages 739–747, 2015.

[49] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gerard Ben Arous, andYann LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.

[50] Kenneth Ward Church and Patrick Hanks. Word association norms, mutualinformation, and lexicography. Computational linguistics, 1990.

[51] Shay B. Cohen, Karl Stratos, Michael Collins, Dean P. Foster, and Lyle Ungar.Spectral learning of latent-variable PCFGs. In Proceedings of the 50th AnnualMeeting of the Association for Computational Linguistics: Long Papers-Volume1, 2012.

278

[52] Ronan Collobert and Jason . A unified architecture for natural language pro-cessing: Deep neural networks with multitask learning. In Proceedings of the25th International Conference on Machine Learning, 2008.

[53] Ronan Collobert and Jason Weston. A unified architecture for natural languageprocessing: Deep neural networks with multitask learning. In Proceedings ofthe 25th International Conference on Machine Learning, 2008.

[54] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, SuryaGanguli, and Yoshua Bengio. Identifying and attacking the saddle point prob-lem in high-dimensional non-convex optimization. In Advances in neural infor-mation processing systems, pages 2933–2941, 2014.

[55] Mark A Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters. 1-bitmatrix completion. Information and Inference, 3(3):189–223, 2014.

[56] Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas,and Richard A. Harshman. Indexing by latent semantic analysis. Journal ofthe American Society for Information Science, 1990.

[57] D. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition.In IEEE Trans. on Information Theory, pages 2845–2862, 1999.

[58] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods foronline learning and stochastic optimization. The Journal of Machine LearningResearch, 2011.

[59] Hugh Durrant-Whyte and Tim Bailey. Simultaneous localization and mapping:part i. Robotics & Automation Magazine, IEEE, 13(2):99–110, 2006.

[60] Diego Eckhard and Alexandre Sanfelice Bazanella. On the global convergenceof identification of output error models. In Proc. 18th IFAC World congress,2011.

[61] Michael Elad. Sparse and Redundant Representations: From Theory to Appli-cations in Signal and Image Processing. Springer Publishing Company, Incor-porated, 1st edition, 2010.

[62] K. Engan, S. Aase, and J. Hakon-Husoy. Method of optimal directions for framedesign. In ICASSP, pages 2443–2446, 1999.

[63] T. Estermann. Complex numbers and functions. Athlone Press, 1962.

[64] Maryam Fazel, Haitham Hindi, and S Boyd. Rank minimization and applica-tions in system theory. In Proc. American Control Conference, volume 4, pages3273–3278. IEEE, 2004.

[65] Maryam Fazel, Haitham Hindi, and Stephen P Boyd. A rank minimiza-tion heuristic with application to minimum order system approximation. InProc. American Control Conference, volume 6, pages 4734–4739. IEEE, 2001.

279

[66] John Rupert Firth. A synopsis of linguistic theory. 1957.

[67] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddlepoints—online stochastic gradient for tensor decomposition. In Proceedings ofThe 28th Conference on Learning Theory, pages 797–842, 2015.

[68] Rong Ge, Qingqing Huang, and Sham M. Kakade. Learning mixtures of gaus-sians in high dimensions. In Proceedings of the Forty-Seventh Annual ACM onSymposium on Theory of Computing, STOC 2015, Portland, OR, USA, June14-17, 2015, pages 761–770, 2015.

[69] Rong Ge, Jason D. Lee, and Tengyu Ma. Matrix completion has no spuriouslocal minimum. Advances in Neural Information Processing Systems (NIPS),2016.

[70] Amir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. Euclideanembedding of co-occurrence data. Journal of Machine Learning Research, 2007.

[71] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceedingsof the National Academy of Sciences, 101(suppl 1):5228–5235, 2004.

[72] Moritz Hardt. Understanding alternating minimization for matrix completion.In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Sympo-sium on, pages 651–660. IEEE, 2014.

[73] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. In 5th Inter-national Conference on Learning Representations (ICLR 2017), 2017.

[74] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns lineardynamical systems. CoRR, abs/1609.05191, 2016.

[75] Moritz Hardt and Mary Wootters. Fast matrix completion without the conditionnumber. In Conference on Learning Theory, pages 638–678, 2014.

[76] Tatsunori B. Hashimoto, David Alvarez-Melis, and Tommi S. Jaakkola. Wordembeddings as metric recovery in semantic spaces. Transactions of the Associ-ation for Computational Linguistics, 2016.

[77] Trevor Hastie, Rahul Mazumder, Jason Lee, and Reza Zadeh. Matrix comple-tion and low-rank svd via fast alternating least squares. Journal of MachineLearning Research, 2014.

[78] E. Hazan, K. Y. Levy, and S. Shalev-Shwartz. Beyond Convexity: StochasticQuasi-Convex Optimization. ArXiv e-prints, July 2015.

[79] Elad Hazan and Tengyu Ma. A non-generative framework and convex relax-ations for unsupervised learning. In Neural Information Processing Systems(NIPS), 2016, 2016.

280

[80] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn-ing for image recognition. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 770–778, 2016.

[81] Christiaan Heij, Andre Ran, and Freek van Schagen. Introduction to mathe-matical systems theory : linear systems, identification and control. Birkhauser,Basel, Boston, Berlin, 2007.

[82] Joao P Hespanha. Linear systems theory. Princeton university press, 2009.

[83] Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are np-hard. J.ACM, 60(6):45, 2013.

[84] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. NeuralComputation, 9(8):1735–1780, 1997.

[85] Wassily Hoeffding. Probability inequalities for sums of bounded random vari-ables. Journal of the American Statistical Association, 58(301):pp. 13–30, 1963.

[86] Thomas Hofmann. Probabilistic latent semantic analysis. In Proceedings of theFifteenth Conference on Uncertainty in Artificial Intelligence, 1999.

[87] Daniel Hsu and Sham M. Kakade. Learning mixtures of spherical gaussians:moment methods and spectral decompositions. In Proceedings of the 4th con-ference on Innovations in Theoretical Computer Science, pages 11–20, 2013.

[88] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm forlearning hidden markov models. Journal of Computer and System Sciences,2012.

[89] Daniel Hsu, Sham M Kakade, and Tong Zhang. A tail inequality for quadraticforms of subgaussian random vectors. Electron. Commun. Probab, 17(52):1–6,2012.

[90] R. Imbuzeiro Oliveira. Concentration of the adjacency matrix and of the Lapla-cian in random graphs with independent edges. ArXiv e-prints, November 2009.

[91] R. Imbuzeiro Oliveira. Sums of random Hermitian matrices and an inequalityby Rudelson. ArXiv e-prints, April 2010.

[92] Prateek Jain and Praneeth Netrapalli. Fast exact matrix completion with finitesamples. In Proceedings of The 28th Conference on Learning Theory, pages1007–1034, 2015.

[93] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix com-pletion using alternating minimization. In Proceedings of the forty-fifth annualACM symposium on Theory of computing, pages 665–674. ACM, 2013.

281

[94] M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods.ArXiv e-prints, June 2015.

[95] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesistesting for high-dimensional regression. Journal of Machine Learning Research,15(1):2869–2909, 2014.

[96] Kenji Kawaguchi. Deep learning without poor local minima. In Advances inNeural Information Processing Systems, pages 586–594, 2016.

[97] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrixcompletion from a few entries. Information Theory, IEEE Transactions on,56(6):2980–2998, 2010.

[98] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix com-pletion from noisy entries. The Journal of Machine Learning Research, 11:2057–2078, 2010.

[99] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.

[100] Jon M. Kleinberg and Mark Sandler. Using mixture models for collaborativefiltering. J. Comput. Syst. Sci., 74(1):49–69, 2008.

[101] Andrew V Knyazev and Peizhen Zhu. Principal angles between subspaces andtheir tangents. Arxiv preprint, 2012.

[102] Yehuda Koren. The bellkor solution to the netflix grand prize. Netflix prizedocumentation, 81, 2009.

[103] Nicholas Kottenstette and Panos J Antsaklis. Relationships between positivereal, passive dissipative, & positive systems. In American Control Conference(ACC), 2010, pages 409–416. IEEE, 2010.

[104] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional bymodel selection. Ann. Statist., 28(5):1302–1338, 10 2000.

[105] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperime-try and processes, volume 23. Springer Science & Business Media, 2013.

[106] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradientdescent converges to minimizers. University of California, Berkeley, 1050:16,2016.

[107] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and designof optimization algorithms via integral quadratic constraints. SIAM Journal onOptimization, 26(1):57–95, 2016.

282

[108] Sergey Levine and Vladlen Koltun. Guided policy search. In Proceedings ofThe 30th International Conference on Machine Learning, pages 1–9, 2013.

[109] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrixfactorization. In Advances in Neural Information Processing Systems (NIPS),2015.

[110] Omer Levy and Yoav Goldberg. Linguistic regularities in sparse and explicitword representations. In Proceedings of the Eighteenth Conference on Compu-tational Natural Language Learning, 2014.

[111] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrixfactorization. In Advances in Neural Information Processing Systems, 2014.

[112] M. Lewicki and T. Sejnowski. Learning overcomplete representations. In NeuralComputation, pages 337–365, 2000.

[113] Yuanzhi Li, Yingyu Liang, and Andrej Risteski. Recovery guarantee ofweighted low-rank approximation via alternating minimization. arXiv preprintarXiv:1602.02262, 2016.

[114] Lennart Ljung. System Identification. Theory for the user. Prentice Hall, UpperSaddle River, NJ, 2nd edition, 1998.

[115] Po-Ling Loh and Martin J Wainwright. Support recovery without incoherence:A case for nonconvex regularization. arXiv preprint arXiv:1412.5632, 2014.

[116] Po-Ling Loh and Martin J. Wainwright. Regularized m-estimators with noncon-vexity: statistical and algorithmic theory for local optima. Journal of MachineLearning Research, 16:559–616, 2015.

[117] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y.Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InThe 49th Annual Meeting of the Association for Computational Linguistics,2011.

[118] S. Mallat. A wavelet tour of signal processing. In Academic-Press, 1998.

[119] Yariv Maron, Michael Lamar, and Elie Bienenstock. Sphere embedding: Anapplication to part-of-speech induction. In Advances in Neural InformationProcessing Systems, 2010.

[120] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularizationalgorithms for learning large incomplete matrices. Journal of machine learningresearch, 11(Aug):2287–2322, 2010.

[121] Alexandre Megretski. Convex optimization in robust identification of nonlinearfeedback. In Proceedings of the 47th Conference on Decision and Control, 2008.

283

[122] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estima-tion of word representations in vector space. Proceedings of the InternationalConference on Learning Representations, 2013.

[123] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean.Distributed representations of words and phrases and their compositionality. InAdvances in Neural Information Processing Systems, 2013.

[124] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and JeffreyDean. Distributed representations of words and phrases and their composition-ality. In Advances in Neural Information Processing Systems (NIPS), 2015.

[125] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities incontinuous space word representations. In Proceedings of the Conference ofthe North American Chapter of the Association for Computational Linguistics:Human Language Technologies, 2013.

[126] L. MIRSKY. Symmetric gauge functions and unitarily invariant norms. TheQuarterly Journal of Mathematics, 11(1):50–59, 1960.

[127] Andriy Mnih and Geoffrey Hinton. Three new graphical models for statisticallanguage modelling. In Proceedings of the 24th International Conference onMachine Learning, 2007.

[128] Ankur Moitra and Michael E. Saks. A polynomial time algorithm for lossy pop-ulation recovery. In 54th Annual IEEE Symposium on Foundations of ComputerScience, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pages 110–116,2013.

[129] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability ofmixtures of gaussians. In the 51st Annual Symposium on the Foundations ofComputer Science (FOCS), 2010.

[130] Katta G Murty and Santosh N Kabadi. Some np-complete problems inquadratic and nonlinear programming. Mathematical programming, 39(2):117–129, 1987.

[131] Sahand Negahban and Martin J Wainwright. Restricted strong convexity andweighted matrix completion: Optimal bounds with noise. Journal of MachineLearning Research, 13(May):1665–1697, 2012.

[132] Yurii Nesterov. Introductory lectures on convex optimization : a basic course.Applied optimization. Kluwer Academic Publ., Boston, Dordrecht, London,2004.

[133] Yurii Nesterov. Introductory lectures on convex optimization: A basic course,volume 87. Springer Science & Business Media, 2013.

284

[134] Yurii Nesterov and Boris T Polyak. Cubic regularization of Newton methodand its global performance. Mathematical Programming, 108(1):177–205, 2006.

[135] Bruno A Olshausen and David J Field. Sparse coding with an overcompletebasis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997.

[136] Bruno A Olshausen and David J Field. Sparse coding of sensory inputs. Currentopinion in neurobiology, 14(4):481–487, 2004.

[137] Christos H. Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and SantoshVempala. Latent semantic indexing: A probabilistic analysis. In Proceed-ings of the 7th ACM SIGACT-SIGMOD-SIGART Symposium on Principlesof Database Systems, 1998.

[138] Robin Pemantle. Nonconvergence to unstable points in urn models and stochas-tic approximations. The Annals of Probability, pages 698–712, 1990.

[139] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove:Global vectors for word representation. In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Processing(EMNLP), 2014.

[140] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove:Global vectors for word representation. Proceedings of the Empiricial Meth-ods in Natural Language Processing, 2014.

[141] Boris T Polyak. Some methods of speeding up the convergence of iterationmethods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.

[142] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhur-nal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4):643–653, 1963.

[143] Ali Rahimi, Ben Recht, and Trevor Darrell. Learning appearance manifoldsfrom video. In Proc. IEEE CVPR, 2005.

[144] Marc’Aurelio Ranzato, Y-Lan Boureau, and Yann LeCun. Sparse feature learn-ing for deep belief networks. In Advances in Neural Information Processing Sys-tems 20, Proceedings of the Twenty-First Annual Conference on Neural Infor-mation Processing Systems, Vancouver, British Columbia, Canada, December3-6, 2007, pages 1185–1192, 2007.

[145] Benjamin Recht. A simpler approach to matrix completion. The Journal ofMachine Learning Research, 12:3413–3430, 2011.

[146] Jasson DM Rennie and Nathan Srebro. Fast maximum margin matrix factor-ization for collaborative prediction. In Proceedings of the 22nd internationalconference on Machine learning, pages 713–719. ACM, 2005.

285

[147] Douglas L. T. Rohde, Laura M. Gonnerman, and David C. Plaut. An improvedmodel of semantic similarity based on lexical co-occurence. Communication ofthe Association for Computing Machinery, 2006.

[148] Jason D. Lee Rong Ge and Tengyu Ma. Learning one-hidden-layer neural net-works with landscape design. 2017.

[149] David E. Rumelhart, Geoffrey E. Hinton, and James L. McClelland, editors.Parallel Distributed Processing: Explorations in the Microstructure of Cogni-tion. 1986.

[150] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learningrepresentations by back-propagating errors. Cognitive modeling, 1988.

[151] Christopher De Sa, Christopher Re, and Kunle Olukotun. Global convergenceof stochastic gradient descent for some non-convex matrix problems. In Proceed-ings of the 32nd International Conference on Machine Learning, ICML 2015,Lille, France, 6-11 July 2015, pages 2332–2341, 2015.

[152] A. C. Schaeffer. Inequalities of a. markoff and s. bernstein for polynomials andrelated functions. Bull. Amer. Math. Soc., 47(8):565–579, 08 1941.

[153] Parikshit Shah, Badri Narayan Bhaskar, Gongguo Tang, and Benjamin Recht.Linear system identification via atomic norm regularization. In Proceedings ofthe 51st Conference on Decision and Control, 2012.

[154] Torsten Soderstrom and Petre Stoica. Some properties of the output errormethod. Automatica, 18(1):93–99, 1982.

[155] Daniel Soudry and Yair Carmon. No bad local minima: Data indepen-dent training error guarantees for multilayer neural networks. arXiv preprintarXiv:1605.08361, 2016.

[156] D. Spielman, H. Wang, and J. Wright. Exact recovery of sparsely-used dictio-naries. In Journal of Machine Learning Research, 2012.

[157] Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. InICML, 2003.

[158] Nathan Srebro, Jason Rennie, and Tommi S Jaakkola. Maximum-margin ma-trix factorization. In Advances in neural information processing systems, pages1329–1336, 2004.

[159] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. InInternational Conference on Computational Learning Theory, pages 545–560.Springer, 2005.

[160] Gilbert W Stewart. Perturbation theory for the singular value decomposition.Technical report, 1998.

286

[161] GW Stewart. On the perturbation of pseudo-inverses, projections and linearleast squares problems. SIAM review, 19(4):634–662, 1977.

[162] Ju Sun, Qing Qu, and John Wright. When are nonconvex problems not scary?arXiv preprint arXiv:1510.06096, 2015.

[163] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via nonconvexfactorization. In Foundations of Computer Science (FOCS), 2015 IEEE 56thAnnual Symposium on, pages 270–289. IEEE, 2015.

[164] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convexfactorization. IEEE Transactions on Information Theory, 62(11):6535–6579,2016.

[165] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learningwith neural networks. In Proc. 27th NIPS, pages 3104–3112, 2014.

[166] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradi-ent by a running average of its recent magnitude. COURSERA: Neural networksfor machine learning, 4(2):26–31, 2012.

[167] J. A. Tropp. An Introduction to Matrix Concentration Inequalities. ArXive-prints, January 2015.

[168] Stephen Tu, Ross Boczar, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank solutions of linear matrix equations via procrustes flow. arXiv preprintarXiv:1507.03566, 2015.

[169] Peter D. Turney and Patrick Pantel. From frequency to meaning: Vector spacemodels of semantics. Journal of Artificial Intelligence Research, 2010.

[170] M. Vidyasagar and Rajeeva L. Karandikar. A learning theory approach to sys-tem identification and stochastic adaptive control. Journal of Process Control,18(3):421–430, 2008.

[171] Martin Wainwright. Basic tail and concentration bounds, 2015.

[172] Per-Ake Wedin. Perturbation bounds in connection with singular value decom-position. BIT Numerical Mathematics, 12(1):99–111, Mar 1972.

[173] Erik Weyer and M. C. Campi. Finite sample properties of system identificationmethods. In Proceedings of the 38th Conference on Decision and Control, 1999.

[174] H. Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partiellerdifferentialgleichungen (mit einer anwendung auf die theorie der hohlraum-strahlung). Mathematische Annalen, 71:441–479, 1912.

287

[175] Limin Yao, David Mimno, and Andrew McCallum. Efficient methods for topicmodel inference on streaming document collections. In Proceedings of the 15thACM SIGKDD International Conference on Knowledge Discovery and DataMining, KDD ’09, pages 937–946, New York, NY, USA, 2009. ACM.

[176] Tuo Zhao, Zhaoran Wang, and Han Liu. A nonconvex optimization frameworkfor low rank matrix estimation. In Advances in Neural Information ProcessingSystems, pages 559–567, 2015.

[177] Qinqing Zheng and John Lafferty. Convergence analysis for rectangular ma-trix completion using burer-monteiro factorization and gradient descent. arXivpreprint arXiv:1605.07051, 2016.

288

non-convex optimization for machine learning: design, analysis, … · 2020. 9. 7. · non-convex...

Documents