on efficient methods for high-dimensional statistical

159
HAL Id: tel-02433016 https://tel.archives-ouvertes.fr/tel-02433016v2 Submitted on 18 Jun 2020 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- enti๏ฌc research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. Lโ€™archive ouverte pluridisciplinaire HAL, est destinรฉe au dรฉpรดt et ร  la di๏ฌ€usion de documents scienti๏ฌques de niveau recherche, publiรฉs ou non, รฉmanant des รฉtablissements dโ€™enseignement et de recherche franรงais ou รฉtrangers, des laboratoires publics ou privรฉs. On e๏ฌ€cient methods for high-dimensional statistical estimation Dmitry Babichev To cite this version: Dmitry Babichev. On e๏ฌ€cient methods for high-dimensional statistical estimation. Machine Learning [stat.ML]. Universitรฉ Paris sciences et lettres, 2019. English. NNT : 2019PSLEE032. tel-02433016v2

Upload: others

Post on 29-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On efficient methods for high-dimensional statistical

HAL Id: tel-02433016https://tel.archives-ouvertes.fr/tel-02433016v2

Submitted on 18 Jun 2020

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

Lโ€™archive ouverte pluridisciplinaire HAL, estdestinรฉe au dรฉpรดt et ร  la diffusion de documentsscientifiques de niveau recherche, publiรฉs ou non,รฉmanant des รฉtablissements dโ€™enseignement et derecherche franรงais ou รฉtrangers, des laboratoirespublics ou privรฉs.

On efficient methods for high-dimensional statisticalestimation

Dmitry Babichev

To cite this version:Dmitry Babichev. On efficient methods for high-dimensional statistical estimation. Machine Learning[stat.ML]. Universitรฉ Paris sciences et lettres, 2019. English. NNT : 2019PSLEE032. tel-02433016v2

Page 2: On efficient methods for high-dimensional statistical
Page 3: On efficient methods for high-dimensional statistical
Page 4: On efficient methods for high-dimensional statistical

The best angle from which to

approach any problem is the

try-angle

Author Unknown

Page 5: On efficient methods for high-dimensional statistical
Page 6: On efficient methods for high-dimensional statistical

Abstract

In this thesis we consider several aspects of parameter estimation for statistics andmachine learning, and optimization techniques applicable to these problems. The goalof parameter estimation is to find the unknown hidden parameters, which govern thedata, for example parameters of an unknown probability density. The constructionof estimators through optimization problems is only one side of the coin, findingthe optimal value of the parameter often is an optimization problem that needs to besolved, using various optimization techniques. Hopefully these optimization problemsare convex for a wide class of problems, and we can exploit their structure to get fastconvergence rates.

The first main contribution of the thesis is to develop moment-matching techniquesfor multi-index non-linear regression problems. We consider the classical non-linearregression problem, which is unfeasible in high dimensions due to the curse of di-mensionality; that is why we assume a model, which states that in fact the data isa nonlinear function of several linear projections of data. We combine two existingtechniques: average derivative estimator (ADE) and Sliced Inverse Regression (SIR)to develop the hybrid method (SADE) without some of the weak sides of its parents:it works both in multi-index models and with weak assumptions on the data distri-bution. We also extend this method to high-order moments. We provide theoreticalanalysis of constructed estimators both for finite sample and population cases.

In the second main contribution we use a special type of averaging for stochasticgradient descent. We consider generalized linear models for conditional exponentialfamilies (such as logistic regression), where the goal is to find the unknown valueof the parameter. Classical approaches, such as Stochastic Gradient Descent (SGD)with constant step-size are known to converge only to some neighborhood of theoptimal value of the parameter, even with Rupert-Polyak averaging. We proposethe averaging of moment parameters, which we call prediction functions. For finite-dimensional models this type of averaging surprisingly can lead to negative error, i.e.,this approach provides us with the estimator better than any linear estimator canever achieve. For infinite-dimensional models our approach converges to the optimalprediction, while parameter averaging never does.

The third main contribution of this thesis deals with Fenchel-Young losses. Weconsider multi-class linear classifiers with the losses of a certain type, such that their

v

Page 7: On efficient methods for high-dimensional statistical

dual conjugate has a direct product of simplices as a support. The correspondingsaddle-point convex-concave formulation has a special form with a bilinear matrixterm and classical approaches suffer from the time-consuming multiplication of ma-trices. We show, that for multi-class SVM losses, under mild regularity assumptionand with smart matrix-multiplication sampling techniques, our approach has an iter-ation complexity which is sublinear in the size of the data. It means, that to do oneiteration, we need to pay only trice ๐‘‚(๐‘› + ๐‘‘ + ๐‘˜): for number of classes ๐‘˜, numberof features ๐‘‘ and number of samples ๐‘›, whereas all existing techniques use at leastone of ๐‘›๐‘‘, ๐‘›๐‘˜ or ๐‘‘๐‘˜ arithmetical operations per iteration. This is possible due to theright choice of geometries and using a mirror descent approach.

Keywords : parameter estimation, moment-matching, constant step-size SGD,conditional exponential family, Fenchel-Young loss, mirror descent.

vi

Page 8: On efficient methods for high-dimensional statistical

Rรฉsumรฉ

Dans cette thรจse, nous examinons plusieurs aspects de lโ€™estimation des paramรจtrespour les statistiques et les techniques dโ€™apprentissage automatique, aussi que lesmรฉthodes dโ€™optimisation applicables ร  ces problรจmes. Le but de lโ€™estimation desparamรจtres est de trouver les paramรจtres cachรฉs inconnus qui rรฉgissent les donnรฉes, parexemple les paramรจtres dont la densitรฉ de probabilitรฉ est inconnue. La constructiondโ€™estimateurs par le biais de problรจmes dโ€™optimisation nโ€™est quโ€™une partie du prob-lรจme, trouver la valeur optimale du paramรจtre est souvent un problรจme dโ€™optimisationqui doit รชtre rรฉsolu, en utilisant diverses techniques. Ces problรจmes dโ€™optimisationsont souvent convexes pour une large classe de problรจmes, et nous pouvons exploiterleur structure pour obtenir des taux de convergence rapides.

La premiรจre contribution principale de la thรจse est de dรฉvelopper des techniquesdโ€™appariement de moments pour des problรจmes de rรฉgression non linรฉaire multi-index.Nous considรฉrons le problรจme classique de rรฉgression non linรฉaire, qui est irrรฉalisabledans des dimensions รฉlevรฉes en raison de la malรฉdiction de la dimensionnalitรฉ et cโ€™estpourquoi nous supposons que les donnรฉes sont en fait une fonction non linรฉaire deplusieurs projections linรฉaires des donnรฉes. Nous combinons deux techniques exis-tantes : โ€œaverage derivative estimatorโ€ (ADE) et โ€œSliced Inverse Regressionโ€ (SIR)pour dรฉvelopper la mรฉthode hybride (SADE) sans certains des aspects faibles de sesparents : elle fonctionne ร  la fois dans des modรจles multi-index et avec des hypothรจsesfaibles sur la distribution des donnรฉes. Nous รฉtendons รฉgalement cette mรฉthode auxmoments dโ€™ordre รฉlevรฉ. Nous fournissons une analyse thรฉorique des estimateurs con-struits ร  la fois pour les cas dโ€™รฉchantillons finis et les cas population.

Dans la deuxiรจme contribution principale, nous utilisons un type particulier decalcul de la moyenne pour la descente stochastique du gradient. Nous considรฉrons desmodรจles linรฉaires gรฉnรฉralisรฉs pour les familles exponentielles conditionnelles (commela rรฉgression logistique), oรน lโ€™objectif est de trouver la valeur inconnue du paramรจtre.Les approches classiques, telles que la descente ร  gradient stochastique (SGD) avecune taille de pas constante, ne convergent que vers un certain voisinage de la valeuroptimale du paramรจtre, mรชme avec le calcul de la moyenne de Rupert-Polyak. Nousproposons le calcul de la moyenne des paramรจtres de moments, que nous appelonsfonctions de prรฉdiction. Dans le cas des modรจles ร  dimensions finies, ce type de calculde la moyenne peut, de faรงon surprenante, conduire ร  une erreur nรฉgative, cโ€™est-ร -direque cette approche nous fournit un estimateur meilleur que tout estimateur linรฉaire ne

vii

Page 9: On efficient methods for high-dimensional statistical

peut jamais le faire. Pour les modรจles ร  dimensions infinies, notre approche convergevers la prรฉdiction optimale, alors que le calcul de la moyenne des paramรจtres ne lefait jamais.

La troisiรจme contribution principale de cette thรจse porte sur les pertes de Fenchel-Young. Nous considรฉrons des classificateurs linรฉaires multi-classes avec les pertes dโ€™uncertain type, de sorte que leur double conjuguรฉ a un produit direct de simplices commesupport. La formulation convexe-concave ร  point-selle correspondante a une formespรฉciale avec un terme de matrice bilinรฉaire, et les approches classiques souffrent de lamultiplication des matrices qui prend beaucoup de temps. Nous montrons que pour lespertes SVM multi-classes, sous hypothรจse de rรฉgularitรฉ lรฉgรจre et avec des techniquesdโ€™รฉchantillonnage efficaces, notre approche a une complexitรฉ dโ€™itรฉration qui est sous-linรฉaire dans la taille des donnรฉes. Cela signifie que pour faire une itรฉration, nousnโ€™avons besoin de payer que trois fois : ๐‘‚(๐‘› + ๐‘‘ + ๐‘˜) pour le nombre de classes ๐‘˜,le nombre de caractรฉristiques ๐‘‘ et le nombre dโ€™รฉchantillons ๐‘›, alors que toutes lestechniques existantes utilisent au moins une des opรฉrations arithmรฉtiques ๐‘›๐‘‘, ๐‘›๐‘˜ ou๐‘‘๐‘˜ par itรฉration. Ceci est possible grรขce au bon choix des gรฉomรฉtries et ร  lโ€™utilisationdโ€™une approche de descente en miroir.

Mots Clรฉs : estimation des paramรจtres, mรฉthode des moments, SGD ร  pas con-stant, famille exponentielle conditionnelle, fonction objectif du Fenchel-Young, de-scente en miroir.

viii

Page 10: On efficient methods for high-dimensional statistical

Acknowledgements

First of all, I would like to thank my supervisor Francis Bach. His guidance andability to explain difficult concepts in easy words is the quality, which helped methrough all my Ph.D. His politics of open door, when anyone can come and answerquestions when his door is open was a great support. Also, I would like to thank myco-advisor Anatoli Juditsky, with whom I had several very fruitful discussions andwho instilled in me statistical thinking.

This thesis has received funding from the European Unionโ€™s H2020 FrameworkProgramme (H2020-MSCA-ITN-2014) under grant agreement n642685 MacSeNet. Iwould like to thank Helen Cooper for all the training, summer schools and workshopsshe helped to organize as a part of this framework. I am grateful to Pierre Van-dergheynst, with whom I had an opportunity to work during my academic internshipin Lausanne, EPFL. I am also grateful to Dave Betts, who was my mentor to theaudio restoration field during my industrial internship in Cambridge.

I also would like to thank Arnak Dalalyan and Stephane Chretien for accepting toreview my thesis. I would also like to express my sincere gratitude to Olivier Cappรฉand Franck Iutzeler for accepting to be part of the jury for my thesis.

I would like to thank all members of Willow and Sierra team, old and new ones, forall the group meetings, thanks to which I expanded my scientific outlook. EspeciallyI want to say thank you to my friend and colleague Dmitrii Ostrovskii, with whom Iwas working on the 3rd project, and who was able to explain me a large amount ofinformation in a short time.

Also, I am grateful to all my friends with whom I spent a lot of joyful hours anddays and who share my hobbies. I would like to thank Andrey, Dasha, Kolya, Tolik,Anya and Katya, my teammates in the intellectual game "What? Where? When?".I would like to thank Marwa, who introduced me to the world of climbing. Thankyou, Vitalic for our almost daily talks. Also, I say thank my friends Denis, Dima,and Petr for their support.

I would like to thank my parents who opened for me a wonderful world of math-ematics and their faith in me. I say thank my sister Tanka for all her sister care andreadiness to come to the rescue at any moment.

Last but not least, I am very grateful to my girlfriend Tatiana Shpakova for herlove and who was always there in easy and difficult times.

ix

Page 11: On efficient methods for high-dimensional statistical

x

Page 12: On efficient methods for high-dimensional statistical

Contents

1 Introduction 31.1 Principles for parameter estimation . . . . . . . . . . . . . . . . . . . 3

1.1.1 Moment matching . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . 61.1.3 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2 Convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2.1 Euclidean geometry . . . . . . . . . . . . . . . . . . . . . . . . 121.2.2 Non-Euclidean geometry . . . . . . . . . . . . . . . . . . . . . 141.2.3 Stochastic setup . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3 Saddle point optimization . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Sliced inverse regression with score functions 212.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Estimation with infinite sample size . . . . . . . . . . . . . . . . . . . 25

2.2.1 SADE: Sliced average derivative estimation . . . . . . . . . . . 252.2.2 SPHD: Sliced principal Hessian directions . . . . . . . . . . . 282.2.3 Relationship between first and second order methods . . . . . 30

2.3 Estimation from finite sample . . . . . . . . . . . . . . . . . . . . . . 302.3.1 Estimator and algorithm for SADE . . . . . . . . . . . . . . . 312.3.2 Estimator and algorithm for SPHD . . . . . . . . . . . . . . . 332.3.3 Consistency for the SADE estimator and algorithm . . . . . . 34

2.4 Learning score functions . . . . . . . . . . . . . . . . . . . . . . . . . 352.4.1 Score matching to estimate score from data . . . . . . . . . . 352.4.2 Score matching for sliced inverse regression: two-step approach 372.4.3 Score matching for SIR: direct approach . . . . . . . . . . . . 40

2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.5.1 Known score functions . . . . . . . . . . . . . . . . . . . . . . 412.5.2 Unknown score functions . . . . . . . . . . . . . . . . . . . . . 44

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.7 Appendix. Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.7.1 Probabilistic lemma . . . . . . . . . . . . . . . . . . . . . . . . 472.7.2 Proof of theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . 472.7.3 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . 52

xi

Page 13: On efficient methods for high-dimensional statistical

3 Constant step-size SGD for probabilistic modeling 573.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.2 Constant step size stochastic gradient descent . . . . . . . . . . . . . 593.3 Warm-up: exponential families . . . . . . . . . . . . . . . . . . . . . . 603.4 Conditional exponential families . . . . . . . . . . . . . . . . . . . . . 61

3.4.1 From estimators to prediction functions . . . . . . . . . . . . . 623.4.2 Averaging predictions . . . . . . . . . . . . . . . . . . . . . . . 633.4.3 Two types of averaging . . . . . . . . . . . . . . . . . . . . . . 64

3.5 Finite-dimensional models . . . . . . . . . . . . . . . . . . . . . . . . 663.5.1 Earlier work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.5.2 Averaging predictions . . . . . . . . . . . . . . . . . . . . . . . 67

3.6 Infinite-dimensional models . . . . . . . . . . . . . . . . . . . . . . . 683.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.7.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.7.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.9 Appendix. Explicit form of ๐ต, , ยฏ๐ต๐‘ค and ยฏ๐ต๐‘š . . . . . . . . . . . . . 75

3.9.1 Estimation without averaging . . . . . . . . . . . . . . . . . . 763.9.2 Estimation with averaging parameters . . . . . . . . . . . . . 763.9.3 Estimation with averaging predictions . . . . . . . . . . . . . 76

4 Sublinear Primal-Dual Algorithms for Large-Scale Classification 794.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.2 Choice of geometry and basic routines . . . . . . . . . . . . . . . . . 83

4.2.1 Basic schemes: Mirror Descent and Mirror Prox . . . . . . . . 854.2.2 Choice of the partial potentials . . . . . . . . . . . . . . . . . 884.2.3 Recap of the deterministic algorithms . . . . . . . . . . . . . . 904.2.4 Accuracy bounds for the deterministic algorithms . . . . . . . 92

4.3 Sampling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.3.1 Partial sampling . . . . . . . . . . . . . . . . . . . . . . . . . 954.3.2 Full sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 984.3.3 Efficient implementation of SVM with Full Sampling . . . . . 101

4.4 Discussion of Alternative Geometries . . . . . . . . . . . . . . . . . . 1024.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074.6 Conclusion and perspectives . . . . . . . . . . . . . . . . . . . . . . . 1094.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.7.1 Motivation for the multiclass hinge loss . . . . . . . . . . . . . 1104.7.2 General accuracy bounds for the composite saddle-point MD . 1104.7.3 Auxiliary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . 1164.7.4 Proof of Proposition 4.1 . . . . . . . . . . . . . . . . . . . . . 1174.7.5 Proof of Proposition 4.2 . . . . . . . . . . . . . . . . . . . . . 1184.7.6 Proof of Proposition 4.3 . . . . . . . . . . . . . . . . . . . . . 1204.7.7 Correctness of subroutines in Algorithm 1 . . . . . . . . . . . 1214.7.8 Additional remarks on Algorithm 1 . . . . . . . . . . . . . . . 125

xii

Page 14: On efficient methods for high-dimensional statistical

5 Conclusion and Future Work 127

xiii

Page 15: On efficient methods for high-dimensional statistical

xiv

Page 16: On efficient methods for high-dimensional statistical

Contributions and thesis outline

In Chapter 1 we give a brief overview of parameter estimation, concerning themethod of moments, maximum likelihood and loss minimization problems.

Chapter 2 is dedicated to a special application of method of moments for non-linearregression. This chapter is based on the journal article: Slice inverse regression withscore functions, D. Babichev, F. Bach, In Electronic Journal of Statistics [Babichevand Bach, 2018b]. The main contributions of this chapter are as follows:

โ€” We propose score function extensions to sliced inverse regression problems,both for the first-order and second-order score functions.

โ€” We consider the infinite sample case and show that in the population case ourestimators are superior to the non-sliced versions.

โ€” We consider also finite sample case and show their consistency given the exactscore functions. We provide non-asymptotical bounds, given sub-Gaussianassumptions.

โ€” We propose to learn the score function as well, in two steps, i.e., first learningthe score function and then learning the effective dimension reduction space, ordirectly, by solving a convex optimization problem regularized by the nuclearnorm.

โ€” We illustrate our results on a series of experiments.

In Chapter 3 we consider special type of averaging for Stochastic SGD applied togeneralized linear models. This chapter is based on the conference paper published asan UAI 2018 paper, which was accepted as an oral presentation: Constant step sizestochastic gradient descent for probabilistic modeling, D. Babichev, F.Bach, Proceed-ings in Uncertainty in Artificial Intelligence [Babichev and Bach, 2018a]. The maincontributions of this chapter are:

โ€” For generalized linear models, we propose averaging moment parameters in-stead of natural parameters for constant step size stochastic gradient descent.

โ€” For finite-dimensional models, we show that this can sometimes (and supris-ingly) lead to better predictions than the best linear model.

โ€” For infinite-dimensional models, we show that it always converges to optimalpredictions, while averaging natural parameter never does.

โ€” We illustrate our finding with simulations on synthetic data and classical

1

Page 17: On efficient methods for high-dimensional statistical

benchmarks with many observations.

In Chapter 4 we develop sublinear method for Fenchel-Young losses, this is jointwork with Dmitrii Ostrovskii. We have submitted this work to ICML 2019 under atitle Sublinear-time training of mlticlass classifiers with Fenchel-Young losses, DmitryBabichev, Dmitrii Ostrovskii and Francis Bach. The main contributions of this chap-ter are:

โ€” We develop efficient algorithms for solving the regularized empirical risk min-imization problems via associated saddle-point problem and using : (i) sam-pling for computationally heavy matrix multiplication and (ii) right choice ofgeometry for mirror descent type algorithms.

โ€” The less aggressive partial sampling scheme is applicable for any loss mini-mization problem, such that its dual conjugate has a direct product of simlicesas a support. This leads to the cost ๐‘‚(๐‘›(๐‘‘+ ๐‘˜)) of one iteration.

โ€” The more aggressive full sampling scheme, applied to multiclass hinge lossleads to the sublinear cost ๐‘‚(๐‘‘+ ๐‘›+ ๐‘˜) of one iteration.

โ€” We conclude the chapter with numerical experiments.

2

Page 18: On efficient methods for high-dimensional statistical

Chapter 1

Introduction

In this chapter we give the brief overview of two main pillars of this thesis: param-eter estimation, and optimization (mostly convex minimization and convex-concavesaddle point problems).

1.1 Principles for parameter estimation

Parameter estimation is a branch of statistics and machine learning, that solvesproblems of estimation of an unknown set of parameters given some observations.There are a big variety of different methods and models and we consider the threeprobably most famous of them: moment matching, maximum likelihood and risk min-imization. The main application of parameter estimation is density and conditionaldensity estimation (and more generally model estimation), which in turn are used inregression, classification and clustering problems.

1.1.1 Moment matching

Moment matching is a technique for finding the values of parameters, using sev-eral moments of the distribution, i.e., expectations of powers of random variable.The method of moments was introduced at least by Chebyshev and Pearson in thelate 1800s (see for example [Casella and Berger, 2002] for a discussion). Suppose,that we have a real valued random variable ๐‘‹ drawn from a family of distributions๐‘“(ยท | ๐œƒ) | ๐œƒ โˆˆ ฮ˜.

Given a sample (๐‘ฅ1, . . . , ๐‘ฅ๐‘›), the goal is to estimate the true value of the parame-ter ๐œƒ*. The classical approach is to consider the first ๐‘˜ moments: ๐œ‡๐‘– = E[๐‘‹ ๐‘–] = ๐‘”๐‘–(๐œƒ),๐‘– = 1, . . . , ๐‘˜, and solve the non-linear system of equations to find the estimator ๐œƒ:โŽงโŽชโŽชโŽชโŽชโŽจโŽชโŽชโŽชโŽชโŽฉ

1 =1

๐‘›

๐‘›โˆ‘๐‘–=1

๐‘ฅ๐‘– = ๐‘”1(๐œƒ),

...

๐‘˜ =1

๐‘›

๐‘›โˆ‘๐‘–=1

๐‘ฅ๐‘˜๐‘– = ๐‘”๐‘˜(๐œƒ).

3

Page 19: On efficient methods for high-dimensional statistical

Even though the method is called moment matching, it is non necessary to usemoments of random variable, but in general it can use any functions โ„Ž๐‘–(๐‘‹) as longas the expectations

โˆซโ„Ž๐‘–(๐‘ฅ)๐‘“(๐‘ฅ | ๐œƒ)๐‘‘๐‘ฅ can be easily computed. This approach can be

extended for the case of random vectors and cross-moments [Hansen, 1982]. Now wediscuss some good and bad points of this approach. Also we illustrate the methodon several examples, starting from the toy ones and finishing with state-of-the-artmethods applicable to non-linear regression.

Disadvantages and advantages

The main advantage of this approach is that it is quite simple and the estimatorsare consistent under mild assumptions [Hansen, 1982]. Also in some cases the solu-tions can be found in closed form, where maximum likelihood approach may requirea large computational effort.

However in some sense, this approach is inferior to the maximum likelihood ap-proach and estimators are often biased. In some cases, especially for small samples,the results can be outside of the parameter space. Also the nonlinear set of equationsmay be hard to solve.

Uniform distribution example

Let us start with a simple example, where we need to estimate the parametersof the one-dimensional uniform distribution: ๐‘‹ โˆผ ๐‘ˆ [๐‘Ž, ๐‘]. The first and the secondmoments can be evaluated as ๐œ‡1 = E๐‘‹ = 1

2(๐‘Ž + ๐‘) and ๐œ‡2 = E๐‘‹2 = 1

3(๐‘Ž2 + ๐‘Ž๐‘ + ๐‘2).

Solving the system of these two equations, given a sample (๐‘ฅ1, . . . , ๐‘ฅ๐‘›) and using thesample moments 1 and 2 instead of true ones, we get a formula for estimatingparameters ๐‘Ž and ๐‘:

(๐‘Ž, ๐‘) = (1 โˆ’โˆš

3(2 โˆ’ 21), 1 +

โˆš3(2 โˆ’ 2

1)).

Linear regression example

Consider now a simple linear regression model, where ๐‘ฆ = ๐‘ฅโŠค๐‘ + ๐œ€, where ๐‘ฆ โˆˆ R,vectors ๐‘ฅ, ๐‘ โˆˆ R๐‘‘ and error ๐œ€ has a zero expectation. The goal is to estimate theunknown vector of parameters ๐‘. Consider the cross moment E(๐‘ฅ๐‘–๐‘ฆ) = E(๐‘ฅ๐‘–๐‘ฅโŠค๐‘), for๐‘– = 1, . . . , ๐‘‘, where ๐‘ฅ๐‘– is the ๐‘–-th component of ๐‘ฅ. Replacing the expectations withempirical ones for the sample (๐‘ฅ๐‘–, ๐‘ฆ๐‘–), we get the equation:

๐‘ฆ1๐‘ฅ1 + ยท ยท ยท+ ๐‘ฆ๐‘›๐‘ฅ๐‘›๐‘›

=๐‘ฅ1๐‘ฅ

โŠค1 + ยท ยท ยท+ ๐‘ฅ๐‘›๐‘ฅ

โŠค๐‘›

๐‘›ยท ,

which is the traditional normal equation [Goldberger, 1964]. Finally, arranging thevectors in the matrix ๐‘‹ โˆˆ R๐‘›ร—๐‘‘ and ๐‘ฆ โˆˆ R๐‘›, we recover the = (๐‘‹๐‘‡๐‘‹)โˆ’1๐‘‹๐‘‡๐‘ฆ,Hence in this particular formulation the moment matching estimator coincides withthe ordinary least squares estimator.

4

Page 20: On efficient methods for high-dimensional statistical

Exponential families

Note, that for exponential families with probability density given by ๐‘“(๐‘ฅ|๐œƒ) =โ„Ž(๐‘ฅ) exp(๐œƒโŠค๐‘‡ (๐‘ฅ) โˆ’ ๐ด(๐œƒ), moment matching is equivalent to maximum likelihood es-timation [Lehmann and Casella, 2006]. We discuss this in more details in the nextsection.

Score functions

Consider the general non-linear regression problem

๐‘ฆ = ๐‘“(๐‘ฅ) + ๐œ€, ๐‘ฅ โˆˆ R๐‘‘, ๐‘ฆ โˆˆ R, error ๐œ€ independent of the data and E๐œ€ = 0.

The ambitious goal is to estimate the unknown function ๐‘“ , given samples (๐‘ฅ๐‘–, ๐‘ฆ๐‘–).However it is impossible to solve this problem in this loose formulation, due to thecurse of dimensionality and non-parametric regression setup. Indeed, classical non-parametric estimation results show that convergence rates with any relevant perfor-mance measure can decrease as ๐‘›โˆ’๐ถ/๐‘‘ [Tsybakov, 2009, Gyรถrfi et al., 2002]. Thismeans that the number of sample points ๐‘› to reach some level of precision is expo-nential in the dimension ๐‘‘. A classical way to circumvent the curse of dimensionalityis to impose an additional condition: the dependence on some hidden lower dimensionof data. Let us start with a simple assumption:

โ€” ๐‘ฅ is normal and ๐‘“(๐‘ฅ) = ๐‘”(๐‘คโŠค๐‘ฅ), with a matrix ๐‘ค โˆˆ R๐‘‘ร—๐‘˜.

If ๐‘˜ = 1, this model is called a single-index model and a multi-index in the othercase [Horowitz, 2012]. In this formulation, the goal is to estimate the unknown matrixof parameters ๐‘ค โˆˆ R๐‘˜ร—๐‘‘ and we can use moment matching techniques. We use againcross-moments and it is not difficult to show that for a single-index, using Steinโ€™slemma, that E(๐‘ฆ๐‘ฅ) โˆผ ๐‘ค1 [Stein, 1981, Brillinger, 1982]. Indeed, using independenceof noise, the Gaussian probability density ๐‘(๐‘ฅ) โˆผ exp(โˆ’๐‘ฅ2/2) and integration byparts:

E(๐‘ฆ๐‘ฅ) = E((๐‘“(๐‘ฅ) + ๐œ€)๐‘ฅ

)= E

(๐‘”(๐‘คโŠค

1 ๐‘ฅ)๐‘ฅ)

=

โˆซ๐‘”(๐‘คโŠค

1 ๐‘ฅ)๐‘(๐‘ฅ)๐‘ฅ๐‘‘๐‘ฅ =

=

โˆซ๐‘”(๐‘คโŠค

1 ๐‘ฅ)โˆ‡๐‘(๐‘ฅ)๐‘‘๐‘ฅ =

โˆซโˆ‡๐‘”(๐‘คโŠค

1 ๐‘ฅ)๐‘(๐‘ฅ)๐‘‘๐‘ฅ โˆผ ๐‘ค1.

The straightforward extension of this approach uses the notion of score function:

๐’ฎ1(๐‘ฅ) = โˆ’โˆ‡ log ๐‘(๐‘ฅ),

where ๐‘(๐‘ฅ) is the probability density of data ๐‘ฅ. We use a moment matching techniquein the form E(๐’ฎ1(๐‘ฅ)๐‘ฆ) โˆผ ๐‘ค1 as it is done in [Stoker, 1986]. The most recent approachfor multi-index models by [Janzamin et al., 2014] and [Janzamin et al., 2015] uses

the notion of high-order scores ๐’ฎ๐‘š(๐‘ฅ) = (โˆ’1)๐‘šโˆ‡(๐‘š)๐‘(๐‘ฅ)๐‘(๐‘ฅ)

(which are tensors) and cross

moments E[๐‘ฆ ยท ๐’ฎ๐‘š(๐‘ฅ)] to train neural networks.

5

Page 21: On efficient methods for high-dimensional statistical

Contribution of this thesis

One more extension of the method of moments uses conditional moments E(๐‘ฅ|๐‘ฆ),known as Sliced Inverse Regression (SIR, [Li, 1991]) and second order momentsE(๐‘ฅ๐‘ฅโŠค|๐‘ฆ) (Principal Hessian Directions PHD, [Li, 1992]) which use a normal dis-tribution assumption. We develop a new method, combining strong sides of Steinโ€™slemma and SIR in Chapter 2 of this thesis. We proposed new approaches (SADE andSPHD) and develop analysis for both population and sample cases.

1.1.2 Maximum likelihood

Maximum likelihood estimation or MLE is an another classical approach to es-timate the unknown parameters, maximizing the likelihood function: it means intu-itively, that the selected parameter makes the data most probable. More formally, let๐‘‹ = (๐‘ฅ1, . . . , ๐‘ฅ๐‘›) be a random sample from a family of distributions ๐‘“(ยท | ๐œƒ) | ๐œƒ โˆˆ ฮ˜,then

๐œƒ โˆˆ arg max๐œƒโˆˆฮ˜

โ„’(๐œƒ ;๐‘‹),

where โ„’(๐œƒ ;๐‘‹) = ๐‘“๐‘‹(๐‘‹ | ๐œƒ) is the so-called Likelihood function: that is, the jointprobability density for the given realization of sample.

In practice, it is often convenient to work with the negative natural logarithmof the likelihood function, called the negative log-likelihood : ๐‘™(๐œƒ ;๐‘‹) = โˆ’ lnโ„’(๐œƒ ;๐‘‹).If the data (๐‘ฅ1, . . . , ๐‘ฅ๐‘›) are independent and identically distributed, then the jointdistribution density can be written as a product of densities for a single ๐‘ฅ๐‘– and theaverage negative log-likelihood minimization problem takes the form:

arg min๐œƒโˆˆฮ˜

(๐œƒ ;๐‘‹) = arg min๐œƒโˆˆฮ˜

โˆ’ 1

๐‘›

๐‘›โˆ‘๐‘–=1

ln ๐‘“(๐‘ฅ๐‘– | ๐œƒ),

where ๐‘“(๐‘ฅ๐‘–|๐œƒ) is the value of the probability density at the point ๐‘ฅ๐‘–.

Advantages and disadvantages

Under mild assumptions the maximum likelihood estimator is consistent, asymp-totically efficient and asymptotically normal [LeCam, 1953, Akaike, 1998]. Eventhough for a simple model, solutions can be found in closed form, for more advancedmodels, methods of optimization must be used to get the solution. Hopefully theoptimization problem is convex for a wide class of likelihood estimators, such as ex-ponential families and conditional exponential families [Koller and Friedman, 2009,Murphy, 2012]. On the other hand, the optimization problem could be not convex ifwe consider for example mixture models. Moreover, maximum likelihood estimatorsare robust to mis-specified data: if the real data distribution ๐‘“ * does not come fromthe model ๐‘“(ยท | ๐œƒ) | ๐œƒ โˆˆ ฮ˜, we can still use the approach and the solution withinfinity data will be the projection (in the Kullback-Leibler sense) of ๐‘“ * to the set ofallowed distributions.

6

Page 22: On efficient methods for high-dimensional statistical

Example: Exponential families

The probability density for an exponential family can be written in the followingform:

๐‘“(๐‘ฅ|๐œƒ) = โ„Ž(๐‘ฅ) exp(๐œƒโŠค๐‘‡ (๐‘ฅ)โˆ’ ๐ด(๐œƒ)),

where โ„Ž(๐‘ฅ) is the base measure, ๐‘‡ (๐‘ฅ) โˆˆ R๐‘‘ is the sufficient statistics and ๐ด the log-partition function, which is always convex. Note that we do not assume that thedata distribution ๐‘(๐‘ฅ) comes from this exponential family. Then the average negativelog-likelihood is equal to

โ„“(๐œƒ;๐‘ฅ) =1

๐‘›

๐‘›โˆ‘๐‘–=1

[๐ด(๐œƒ)โˆ’ ๐œƒโŠค๐‘‡ (๐‘ฅ๐‘–)

],

which is a convex problem and can be solved, using any convex minimization ap-proach. Another view to this problem is moment matching: the solution can befound as:

๐ดโ€ฒ(๐œƒ) =1

๐‘›

๐‘›โˆ‘๐‘–=1

๐‘‡ (๐‘ฅ๐‘–),

where we consider the moment of the sufficient statistics ๐‘‡ (๐‘ฅ). Note, that in a facta big variety of classical distributions can be represented in this form: Bernoulli,normal, Poisson, exponential, Gamma, Beta, Dirichlet and many others. However,also, there are few which are not for example Studentโ€™s distribution and mixtures ofclassical distribution are not in this family.

Example: Conditional Exponential families

Now, let us consider the classical conditional exponential families:

๐‘“(๐‘ฆ|๐‘ฅ, ๐œƒ) = โ„Ž(๐‘ฆ) ยท exp(๐‘ฆ ยท ๐œ‚๐œƒ(๐‘ฅ)โˆ’ ๐‘Ž(๐œ‚๐œƒ(๐‘ฅ))).

Again, writing down the average negative log-likelihood, the goal is to minimize

1

๐‘›

๐‘›โˆ‘๐‘–=1

[โˆ’ ๐‘ฆ๐‘– ยท ๐œ‚๐œƒ(๐‘ฅ๐‘–) + ๐‘Ž(๐œ‚๐œƒ(๐‘ฅ๐‘–))

].

These families are used for regression and classification problems and closely related tothe generalized linear models [McCullagh and Nelder, 1989], if the natural parameteris linear combination in known basis: ๐œ‚๐œƒ(๐‘ฅ) = ๐œƒโŠคฮฆ(๐‘ฅ). On of the most popular choicesof function ๐‘Žโ€ฒ(ยท) is the sigmoid function ๐‘Žโ€ฒ(๐‘ก) = ๐œŽ(๐‘ก) = 1/(1 + ๐‘’โˆ’๐‘ก), which leads to thelogistic regression model. Let us illustrate the connection of logistic regression with aBernoulli distribution. Let ๐‘ be a parameter of Bernoulli distribution with ๐‘ฅ โˆˆ 0, 1,then the probability density is written as:

๐‘ž(๐‘ฅ|๐‘) = ๐‘ฅ log ๐‘โˆ’ (1โˆ’ ๐‘ฅ) log(1โˆ’ ๐‘) = ๐‘ฅ log๐‘

1โˆ’ ๐‘+ log(1โˆ’ ๐‘)

7

Page 23: On efficient methods for high-dimensional statistical

= exp(๐‘ฅ ยท ๐œƒ โˆ’ log(1 + ๐‘’๐œƒ)),

where ๐œƒ = log ๐‘1โˆ’๐‘ and hence logistic model is unconstrained reformulation of Bernoulli

model, and the lack of constraint is often seen as a benefit for optimization.

Another common choice is Poisson regression [McCullagh, 1984], where ๐‘Ž(๐‘ก) =exp(๐‘ก), which is used, when the dependent variable is a count.

Contribution of this thesis

In Chapter 3 of this thesis we consider constant step-size stochastic gradient de-scent for conditional exponential families. Instead of averaging of parameters ๐œƒ, wepropose averaging of so-called prediction functions ๐œ‡ = ๐‘Žโ€ฒ(๐œƒโŠคฮฆ(๐‘ฅ)), which leads tobetter convergence to the optimal prediction, especially for infinite-dimensional mod-els.

1.1.3 Loss functions

One more view to the parameter estimation problem is risk minimization, whichgoes beyond the maximum likelihood approach. A non-negative real-valued loss func-tion โ„“(๐‘ฆ, ๐‘ฆ) measures the performance for classification or regression problem: i.e.,the difference between prediction ๐‘ฆ of a and the true outcome ๐‘ฆ. The expected loss๐‘…(๐œƒ) = E[โ„“(๐‘“(๐‘ฅ|๐œƒ), ๐‘ฆ)] is called the risk. However, in practice the true distribution๐‘ƒ (๐‘ฅ, ๐‘ฆ) is inaccessible and the empirical risk used instead. Hence, the classical regu-larized empirical risk minimization problem is written as:

min๐œƒ

1

๐‘›

๐‘›โˆ‘๐‘–=1

โ„“(๐‘“(๐‘ฅ๐‘–|๐œƒ), ๐‘ฆ๐‘–

)+ ๐œ†ฮฉ(๐œƒ), (1.1.1)

where ฮฉ(๐œƒ) is a regularizer, which is used to avoid overfitting. There are severalclassical choices of loss function, and we start with in some sense the most intuitiveones:

0โˆ’ 1 loss

This loss, which is also called misclassification loss is used in classification prob-lems, where the response ๐‘ฆ is located in a finite set. The definition speaks for itself:the loss is zero, in the case of true classification and one in the other case:

โ„“(๐‘“(๐‘ฅ๐‘–|๐œƒ), ๐‘ฆ๐‘–) = 1 โ‡โ‡’ ๐‘“(๐‘ฅ๐‘–|๐œƒ) = ๐‘ฆ๐‘–.

It is known to lead to NP-hard problems, even for linear classifiers [Feldman et al.,2012, Ben-David et al., 2003]. That is why in practice people use a convex relaxationof the 0โˆ’ 1 loss functions (see below).

8

Page 24: On efficient methods for high-dimensional statistical

Quadratic loss

One of the other classical choices for loss function in the case of regression problemsis the quadratic loss:

โ„“(๐‘“(๐‘ฅ๐‘–|๐œƒ), ๐‘ฆ๐‘–) = (๐‘“(๐‘ฅ๐‘–|๐œƒ)โˆ’ ๐‘ฆ๐‘–)2 .

It has a simple form, smooth (in contradiction to the โ„“1 loss) and convex. Thus, theempirical risk minimization problem becomes mean squared error minimization.

The connection with likelihood estimators can be shown, using the linear modelwith Gaussian noise:

๐‘ฆ = ๐‘ฅโŠค๐œƒ + ๐œ€, where ๐œ€ โˆผ ๐’ฉ (0, ๐œŽ2).

Then, the negative log-likelihood is

๐‘™(๐‘ฆ;๐‘ฅ, ๐œƒ) โˆผ (๐‘ฅโŠค๐œƒ โˆ’ ๐‘ฆ)2,

and the empirical likelihood minimization problem for this problem is equivalent tothe mean squared error minimization (but the application of least squares is notlimited to Gaussian noise).

Let us illustrate also a connection of quadratic loss with regression: consider thatwe are looking for a function ๐‘“ , such that ๐‘ฆ = ๐‘“(๐‘ฅ) + ๐œ€, where ๐‘ฅ โˆˆ R๐‘‘ and ๐‘ฆ โˆˆ R.Then, the expected loss (risk) can be written as

๐‘…(๐‘“) =

โˆซโ„“(๐‘“(๐‘ฅ), ๐‘ฆ)๐‘(๐‘ฅ, ๐‘ฆ) ๐‘‘๐‘ฅ ๐‘‘๐‘ฆ.

Solving this functional minimization problem for the quadratic loss, we get the solu-tion ๐‘“(๐‘ฅ) = E๐‘ฆ[๐‘ฆ|๐‘ฅ], which is the conditional expectation of ๐‘ฆ given ๐‘ฅ and is knownas the regression function.

Convex surrogates

Here we consider the main examples for convex surrogates of the 0โˆ’ 1 loss, whichmake the computation more tractable.

Hinge loss. It is written as

โ„“(๐‘“(๐‘ฅ๐‘–|๐œƒ), ๐‘ฆ๐‘–) = max(0, 1โˆ’ ๐‘“(๐‘ฅ๐‘–|๐œƒ)๐‘ฆ๐‘–),

and is used in soft-margin support vector machines (SVM) approach introduced in itsmodern form by [Cortes and Vapnik, 1995]. The method tries to find a hyperplanewhich separates the data. In practice, usually regularized problem is considered dueto non-robustness, especially for separable data:[

1

๐‘›

๐‘›โˆ‘๐‘–=1

max (0, 1โˆ’ ๐‘ฆ๐‘–๐œƒ๐‘ฅ๐‘–)

]+ ๐œ†โ€–๐œƒโ€–2.

The classical way of solving the minimization problem is to switch to quadratic

9

Page 25: On efficient methods for high-dimensional statistical

programming problem (originally [Cortes and Vapnik, 1995], see also [Bishop, 2006]).However, now the most recent approaches are used, such as gradient descent andstochastic gradient descent types [Shalev-Shwartz et al., 2011].

Logistic loss and Exponential loss. The logistic loss is defined as โ„“(๐‘“(๐‘ฅ๐‘–|๐œƒ), ๐‘ฆ๐‘–) =log(1 + exp(โˆ’๐‘ฆ ยท ๐‘“(๐‘ฅ๐‘–|๐œƒ))) and the Exponential loss is defined as โ„“(๐‘“(๐‘ฅ๐‘–|๐œƒ), ๐‘ฆ๐‘–) =exp(โˆ’๐‘ฆ ยท ๐‘“(๐‘ฅ๐‘–|๐œƒ)). In fact, these two losses are dictated by maximum log-likelihoodformulations: the logistic loss takes its origin in logistic regression and exponential lossis used in Poisson regression. Moreover, we can say that every negative log-likelihoodminimization problem is equivalent to loss minimization problem, if we introduce thecorresponding log-likelihood loss. The opposite is typically not true (for example forthe hinge loss).

Graphical representation

We summarize the discussed losses in the Figure 1-1, where we renormalize someof them to pass through the point (1, 0). The dashed line represents the regressionformulation and solid ones are for classification problems.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3 Exponential lossHinge loss0-1 lossLogistic lossSquare loss

Figure 1-1 โ€“ Graphical representation of classical loss functions.

Contribution of this thesis

In Chapter 4 of this thesis we consider so-called Fenchel-Young losses for mul-ticlass linear classifiers, which extend convex surrogates to the multiclass setting.This leads to saddle-point convex-concave problems with expensive matrix multipli-cations. Using stochastic optimization methods for non-Euclidean setup (using mirrordescent) and specific variance reduction techniques we are able to reach sublinear (inthe natural dimensionality of the problem) running time complexity per iteration.

10

Page 26: On efficient methods for high-dimensional statistical

1.2 Convex optimization

In this section we discuss the basics of convex optimization. Convexity is a pre-vailing setup for optimization problems, due to the theoretical guarantees for thisclass of functions. The classical convex minimization problem is the following:

min๐‘ฅโˆˆ๐’ณ

๐‘“(๐‘ฅ),

where ๐’ณ โˆˆ R๐‘‘ is a convex set: for any two points ๐‘ฅ, ๐‘ฆ โˆˆ ๐’ณ and for every ๐›ผ โˆˆ [0, 1] thepoint ๐›ผ๐‘ฅ+ (1โˆ’ ๐›ผ)๐‘ฆ is also in ๐’ณ . This means, that for any two points inside the set,the whole segment, connecting these points is also in the set. The function ๐‘“ is convexas well: for any ๐‘ฅ, ๐‘ฆ โˆˆ ๐’ณ and ๐›ผ โˆˆ [0, 1]: ๐‘“(๐›ผ๐‘ฅ+ (1โˆ’๐›ผ)๐‘ฆ) 6 ๐›ผ๐‘“(๐‘ฅ) + (1โˆ’๐›ผ)๐‘“(๐‘ฆ) andthis means, that the epigraph of function ๐‘“ is a convex set: every chord lies abovethe graph of a function. It is known for convex minimization problems, that:

โ€” If a local minimum exists, it is also a global minimum.โ€” The set of all global minima is convex (note, that global minimum is not

necessarily unique).

Consider the unconstrained minimization problem, where ๐’ณ = R๐‘‘. The first propertyprovides us with the criterion of global minimum: if the function ๐‘“ is differentiable,then

โˆ‡๐‘“(๐‘ฅ*) = 0 โ‡” ๐‘ฅ* is global minimum.

If the function ๐‘“ is not differentiable, we still can use the criterion, where the gradientof the function is replaced by a sub-gradient : a generalization of gradient for convexfunction which defines the cone of directions in which the function increases [Rock-afellar, 2015]:

๐œ•๐‘“(๐‘ฅ*) โˆ‹ 0 โ‡” ๐‘ฅ* is global minimum.

However in practice, only for simple problems the solution can be found in closedform. For the majority of convex problems computational iterative methods are used,and the starting point for them are gradient descent or the steepest descent

๐‘ฅ๐‘ก+1 = ๐‘ฅ๐‘ก โˆ’ ๐›พ๐‘กโˆ‡๐‘“(๐‘ฅ๐‘ก).

The idea is straightforward: we are looking for the direction in which the functionincreases and then descend in the opposite direction. Classical choices for stepsizeare constant and decaying: ๐›พ๐‘ก = ๐ถ๐‘กโˆ’๐›ผ, where ๐›ผ โˆˆ [0, 1].

We also consider classical assumptions, such that bounded gradients, smoothness(which requires differentiability) and strong convexity (for more details see [Nesterov,2013, Bubeck, 2015]):

The first definition is the weakest one: bounded gradients, which does not requiredifferentiability:

Definition 1.1. The function ๐‘“ has bounded gradients (subgradients), if for any

11

Page 27: On efficient methods for high-dimensional statistical

๐‘ฅ โˆˆ ๐’ณ and for any ๐‘”(๐‘ฅ) โˆˆ ๐œ•๐‘“(๐‘ฅ):

โ€–๐‘”(๐‘ฅ)โ€– 6 ๐ต.

Definition 1.2. The function ๐‘“ is called smooth with Lipschitz constant ๐ฟ if, for all๐‘ฅ, ๐‘ฆ โˆˆ ๐’ณ :

๐‘“(๐‘ฆ) 6 ๐‘“(๐‘ฅ) +โˆ‡๐‘“(๐‘ฅ)โŠค(๐‘ฆ โˆ’ ๐‘ฅ) +๐ฟ

2โ€–๐‘ฆ โˆ’ ๐‘ฅโ€–2,

which is equivalent to, when ๐‘“ is convex,

โ€–โˆ‡๐‘“(๐‘ฅ)โˆ’โˆ‡๐‘“(๐‘ฆ)โ€– 6 ๐ฟโ€–๐‘ฅโˆ’ ๐‘ฆโ€–.

Definition 1.3. The function ๐‘“ is called strongly convex with constant ๐œ‡ if, for all๐‘ฅ, ๐‘ฆ โˆˆ ๐’ณ :

๐‘“(๐‘ฆ) > ๐‘“(๐‘ฅ) +โˆ‡๐‘“(๐‘ฅ)โŠค(๐‘ฆ โˆ’ ๐‘ฅ) +๐œ‡

2โ€–๐‘ฆ โˆ’ ๐‘ฅโ€–2.

The intuitive definition of Lipschitz smoothness is that at every point, the function๐‘“ can be bounded above by quadratic function with coefficient ๐ฟ/2, strong convexitymeans, that at every point the function ๐‘“ is bounded below by quadratic functionwith coefficient ๐œ‡/2. A graphical representation of one dimensional case can be foundin Figure 1-2.

Lipschitz bound

Strongly convexity

bound

Figure 1-2 โ€“ Graphical representation of Lipschitz smoothness and strong convexity.

1.2.1 Euclidean geometry

Note, that we did not define norms for smoothness and strong convexity. If wedefine them as standard Euclidean norms, we reproduce the so-called Euclidean ge-ometry and constants ๐ฟ and ๐œ‡ correspond to the biggest and the smallest eigenvaluesof Hessians, if the function is twice differentiable.

12

Page 28: On efficient methods for high-dimensional statistical

Projected gradient descent

The gradient descent method for constrained problem is called projected gradi-ent descent (see [Bertsekas, 1999] and references therein) and one iteration is thefollowing:

๐‘ฅ๐‘ก+1 = ฮ ๐’ณ(๐‘ฅ๐‘ก โˆ’ ๐›พ๐‘ก๐‘”(๐‘ฅ๐‘ก)

), with ๐‘”(๐‘ฅ๐‘ก) โˆˆ ๐œ•๐‘“(๐‘ฅ๐‘ก)

where ฮ ๐’ณ (๐‘ฅ) = arg min๐‘ฆโˆˆ๐’ณ

โ€–๐‘ฅโˆ’ ๐‘ฆโ€– is the Euclidean projection of the point ๐‘ฅ to the set

๐’ณ . We illustrate this approach in Figure 1-3. We can also to look at this equationthrough the prism of proximal approaches [Moreau, 1965, Rockafellar, 1976] as donefor example in [Beck and Teboulle, 2003]:

๐‘ฅ๐‘ก+1 = arg min๐‘ฅโˆˆ๐’ณ

โŸจ๐‘ฅ, ๐‘”(๐‘ฅ๐‘ก)โŸฉ+

1

๐›พ๐‘กโ€–๐‘ฅโˆ’ ๐‘ฅ๐‘กโ€–2

, where ๐‘”(๐‘ฅ๐‘ก) โˆˆ ๐œ•๐‘“(๐‘ฅ๐‘ก).

xt

xt+1/2

xt+1

X

gradient step

proximal step

Figure 1-3 โ€“ Graphical representation of projected gradient descent: firstly we dogradient step and then project onto ๐’ณ .

Note, that the proximal operator should be simple, i.e., the structure of the set ๐’ณbe simple enough to compute it in either in closed form or with a number of iterationscommensurate with the computation of the gradient.

Now we formulate main results for functions with different assumpions (shortproofs can be found for example in [Bubeck, 2015]):

Theorem 1.1. Let ๐‘“ be convex with bounded gradients with constant ๐ต; the radiusof set ๐’ณ is ๐‘…, i.e., sup

๐‘ฅ,๐‘ฆโˆˆ๐’ณโ€–๐‘ฅ โˆ’ ๐‘ฆโ€– = 2๐‘…. Then the projected gradient descent with

decaying stepsize ๐›พ๐‘ก = ๐‘…๐ตโˆš๐‘กsatisfies:

๐‘“(๐‘ก)โˆ’ ๐‘“(๐‘ฅ*) 6๐‘…๐ตโˆš๐‘ก, where ๐‘ก =

1

๐‘ก

๐‘กโˆ‘๐‘–=1

๐‘ฅ๐‘–.

Hence in case of bounded gradients we get an 1/โˆš๐‘ก convergence rate.

13

Page 29: On efficient methods for high-dimensional statistical

Theorem 1.2. Let ๐‘“ be convex and ๐ฟ-smooth on ๐’ณ . Then the projected gradientdescent with constant stepsize ๐›พ๐‘ก = 1

๐ฟsatisfies:

๐‘“(๐‘ฅ๐‘ก)โˆ’ ๐‘“(๐‘ฅ*) 64๐ฟโ€–๐‘ฅ1 โˆ’ ๐‘ฅ*โ€–2

๐‘ก.

Hence in case of ๐ฟ-smooth function we get an 1/๐‘ก convergence rate. If we add anassumption about strong convexity of the function ๐‘“ , we recover exponential rate:

Theorem 1.3. Let ๐‘“ be ๐œ‡ strongly convex and ๐ฟ-smooth on ๐’ณ . Then the projectedgradient descent with constant stepsize ๐›พ๐‘ก = 1

๐ฟsatisfies:

โ€–๐‘ฅ๐‘ก+1 โˆ’ ๐‘ฅ*โ€–2 6 exp(โˆ’๐‘ก ยท ๐œ‡

๐ฟ

)โ€–๐‘ฅ1 โˆ’ ๐‘ฅ*โ€–2.

Note also that acceleration techniques, proposed by Nesterov can be used to in-crease the convergence rates. The first work in this direction [Nesterov, 1983] wasproposed for smooth functions and unconstrained setup. It improved the rates from1/๐‘ก to 1/๐‘ก2. The case of non-smooth function was considered in [Nesterov, 2005] andimproved the rates from 1/

โˆš๐‘ก to 1/๐‘ก, using a special smoothing technique, which can

be applied to functions with explicit max-structure. Constrained problems were con-sidered in [Nesterov, 2007] and [Beck and Teboulle, 2009] with the same convergencerates.

Note, that Theorems 1.1, 1.2 and 1.3 use Euclidean geometries: definitions ofsmoothness and strongly convexity are given with respect with usual Euclidean ge-ometry. In the next section we extend these result for a more general setup.

1.2.2 Non-Euclidean geometry

The ideas of non-Euclidean approach are to exploit the geometry of the set ๐’ณand achieve the rates of projected gradient descent with constants adapting to thegeometry of the set ๐’ณ . Let us fix an arbitrary norm โ€–ยทโ€–X and a compact set ๐’ณ โˆˆ R๐‘‘.We introduce a definition of the dual norm:

Definition 1.4. The norm โ€–ยทโ€–X * is called the dual norm, if โ€–๐‘”โ€–X * = sup๐‘ฅโˆˆR๐‘‘:โ€–๐‘ฅโ€–X 61

๐‘”โŠค๐‘ฅ.

Now we need to adapt definitions of bounded gradients, smoothness and strongconvexity: (see for example [Bubeck, 2015, Beck and Teboulle, 2003]) (i) we say, thatfunction ๐‘“(๐‘ฅ) has bounded gradients, if

โ€–๐‘”โ€–X * 6 ๐ต for any ๐‘”(๐‘ฅ) โˆˆ ๐œ•๐‘“(๐‘ฅ).

(ii) we say, that function ๐‘“(๐‘ฅ) is ๐ฟ-smooth with respect to norm โ€– ยท โ€–X if

โ€–โˆ‡๐‘“(๐‘ฅ) โˆ’ โˆ‡๐‘“(๐‘ฆ)โ€–X * 6 โ€–๐‘ฅโˆ’ ๐‘ฆโ€–X ,

14

Page 30: On efficient methods for high-dimensional statistical

where โ€– ยท โ€–X * is dual norm, (iii) ๐œ‡-strongly convex if

๐‘“(๐‘ฆ) > ๐‘“(๐‘ฅ) +โˆ‡๐‘“(๐‘ฅ)โŠค(๐‘ฆ โˆ’ ๐‘ฅ) +๐œ‡

2โ€–๐‘ฆ โˆ’ ๐‘ฅโ€–2X .

The idea of the mirror approach is to consider a so-called potential function ฮฆ(๐‘ฅ)(which is also called DGF โ€” distance generating function or mirror map), then switchto the mirror space (using gradient of potential), do the gradient step in this space,return back to the main space and project the point to the set ๐’ณ .

Mirror descent

More formally, one step of Mirror descent [Nemirovsky and Yudin, 1983] is

โˆ‡ฮฆ(๐‘ฅ๐‘ก+1/2) = โˆ‡ฮฆ(๐‘ฅ๐‘ก)โˆ’ ๐›พ๐‘กโˆ‡๐‘“(๐‘ฅ๐‘ก),

๐‘ฅ๐‘ก+1 โˆˆ ฮ ฮฆ๐’ณ (๐‘ฅ๐‘ก+1/2),

where ฮ ฮฆ๐’ณ (๐‘ฅ, ๐‘ฆ) = arg min

๐‘ฅโˆˆ๐’ณ๐ตฮฆ(๐‘ฅ, ๐‘ฆ) is the projection in the Bregman divergence sense:

๐ตฮฆ(๐‘ฅ, ๐‘ฆ) = ฮฆ(๐‘ฅ)โˆ’ฮฆ(๐‘ฆ)โˆ’โˆ‡ฮฆ(๐‘ฆ)โŠค(๐‘ฅโˆ’๐‘ฆ). However, this notation could be difficult toembrace and [Beck and Teboulle, 2003] introduced the equivalent definition withoutswitching to the mirror space, and one step of mirror descent can be written as:

๐‘ฅ๐‘ก+1 = arg min๐‘ฅโˆˆ๐’ณ

โŸจ๐‘ฅ,โˆ‡๐‘“(๐‘ฅ๐‘ก)โŸฉ+

1

๐›พ๐‘ก๐ต๐œ‘(๐‘ฅ, ๐‘ฅ๐‘ก)

. (Mirror descent)

Note the similarity with projected gradient descent. In practice, the potential ฮฆ(๐‘ฅ)should be a strictly convex and differentiable function with the following properties:

โ€” ฮฆ(๐‘ฅ) is 1-strongly convex on ๐’ณ with respect to โ€– ยท โ€–X .โ€” The effective square radius of set ๐’ณ , which is defined as ฮฉ = sup

๐‘ฅโˆˆ๐’ณฮฆ(๐‘ฅ) โˆ’

inf๐‘ฅโˆˆ๐’ณ

ฮฆ(๐‘ฅ) should be small.

โ€” The proximal step above should be feasible.

Finally, we can formulate convergence rates for the Mirror descent algorithm for non-smooth case (a short proof can be found for example in [Bubeck, 2015]):

Theorem 1.4. Let ฮฆ be a 1-strongly convex with respect to โ€– ยท โ€–X and ๐‘“ be convex

with ๐ต-bounded gradients with respect to โ€– ยท โ€–X . Then mirror descent with ๐›พ๐‘ก =

โˆš2ฮฉ

๐ตโˆš๐‘ก

satisfies:

๐‘“(๐‘ก)โˆ’ ๐‘“(๐‘ฅ*) 6 ๐ต

โˆš2ฮฉ

๐‘ก, where ๐‘ก =

1

๐‘ก

๐‘กโˆ‘๐‘–=1

๐‘ฅ๐‘–.

Note, that there are various extensions of mirror descent, such as mirror prox [Ne-mirovski, 2004], Dual Averaging [Nesterov, 2007] and extentions to saddle point min-imax problems [Juditsky and Nemirovski, 2011a,b, Nesterov and Nemirovski, 2013].

15

Page 31: On efficient methods for high-dimensional statistical

One more direction is NoLips approach, where the idea is to get rid of the Lipschitz-continuous gradient, using convexity condition which captures the geometry of theconstraints [Bauschke et al., 2016].

Examples of geometries

โ„“2 geometry. This is the simplest geometry, which actually corresponds to theEuclidean case. Indeed, in this case ฮฆ(๐‘ฅ) = 1

2โ€–๐‘ฅโ€–22 is 1-strongly convex with respect

to the โ„“2 norm on the whole R๐‘‘. The Bregman divergence, associated with this normis ๐ต๐œ‘(๐‘ฅ, ๐‘ฆ) = 1

2โ€–๐‘ฅ โˆ’ ๐‘ฆโ€–22 and one step of mirror descent is reduced to one step of

projected gradient descent. Note, that Theorem 1.4 is reduced to Theorem 1.1 in thiscase.

โ„“1 geometry. This is more interesting choice of geometry, which is also called sim-plex setup. If we define potential as negative entropy:

ฮฆ(๐‘ฅ) =๐‘‘โˆ‘๐‘–=1

๐‘ฅ๐‘– log ๐‘ฅ๐‘–,

then, using Pinskerโ€™s inequality, we can show, that ฮฆ is 1-strongly convex with re-

spect to โ„“1 norm on the simplex โˆ†๐‘‘ = ๐‘ฅ โˆˆ R๐‘‘+ :

๐‘‘โˆ‘๐‘–=1

๐‘ฅ๐‘– = 1. The Bregman di-

vergence associated with negative entropy is so-called he Kullback-Leibler divergence:

๐ต๐œ‘(๐‘ฅ, ๐‘ฆ) = ๐ทKL(๐‘ฅโ€–๐‘ฆ) =๐‘‘โˆ‘๐‘–=1

๐‘ฅ๐‘– log ๐‘ฅ๐‘–๐‘ฆ๐‘–. The effective squared radius of โˆ†๐‘‘ is ฮฉ = log ๐‘‘

and moreover the solution for one step of MD can be found in closed form and lead toso-called multiplicative updates. This implies, that if we minimize function ๐‘“ on thesimplex โˆ†๐‘‘, such that โ€–โˆ‡๐‘“โ€–โˆž are bounded, the right choice of geometry give ratesas ๐‘‚

(log ๐‘‘๐‘ก

), whereas the Euclidean geometry give rates only as ๐‘‚

(๐‘‘๐‘ก

).

One more classical setup is often used: the โ„“1 ball can be obtained from thesimplex setup, by doubling the number of variables. Instead of ๐‘‘ real values (positiveor negative), we consider 2๐‘‘ positive values and transform the ๐‘‘-dimensional ball tothe 2๐‘‘-dimensional simplex.

1.2.3 Stochastic setup

In this section we consider stochastic approaches going back to [Robbins andMonro, 1951], where we do not use the gradient โˆ‡๐‘“(๐‘ฅ), but evaluate the noisy versionof it: namely a stochastic oracle ๐‘”(๐‘ฅ), such that E๐‘”(๐‘ฅ) = โˆ‡๐‘“(๐‘ฅ). The classicalsetting in machine learning is when the objective function ๐‘“ is the sampled mean ofobservations ๐‘“๐‘– (probably with some regularizer term):

๐‘“(๐‘ฅ) =1

๐‘›

๐‘›โˆ‘๐‘–=1

๐‘“๐‘–(๐‘ฅ) +๐‘…(๐‘ฅ),

16

Page 32: On efficient methods for high-dimensional statistical

and we can choose oracle as ๐‘“๐‘–(๐‘ฅ), where ๐‘– is chosen uniformly from ๐‘› points. Infact, all negative log-likelihood minimization and loss minimization problems havethis form. This also applies to the situation of single pass SGD where the bounds arethen on the generalization error. To define the quality of an oracle ๐‘”(๐‘ฅ), we assumethe existence of the moment

๐ต2 = sup๐‘ฅโˆˆ๐’ณ

Eโ€–๐‘”(๐‘ฅ)โ€–2X * ,

in the non-smooth case, and assume the existence of the variance in the smooth case:

๐œŽ2 = sup๐‘ฅโˆˆ๐’ณ

Eโ€–๐‘”(๐‘ฅ)โˆ’โˆ‡๐‘“(๐‘ฅ)โ€–2X * ,

where the norm depends on the geometry. Let us consider the general StochasticMirror Descent approach:

๐‘ฅ๐‘ก+1 = arg min๐‘ฅโˆˆ๐’ณ

โŸจ๐‘ฅ, ๐‘”(๐‘ฅ๐‘ก)โŸฉ+

1

๐›พ๐‘ก๐ต๐œ‘(๐‘ฅ, ๐‘ฅ๐‘ก)

. (Stochastic Mirror Descent)

Now we can formulate the theorem: [Juditsky et al., 2011, Lan, 2012, Xiao, 2010]

Theorem 1.5. Let ฮฆ be a 1-strongly convex with respect to โ€–ยทโ€–๐’ณ and ๐ต2 be the second

moment of an oracle ๐‘”(๐‘ฅ). Then Stochastic Mirror Descent with stepsize ๐›พ๐‘ก =

โˆš2ฮฉ

๐ตโˆš๐‘ก

satisfies:

E๐‘“(๐‘ก)โˆ’ ๐‘“(๐‘ฅ*) 6 ๐ต

โˆš2ฮฉ

๐‘ก, where ๐‘ก =

1

๐‘ก

๐‘กโˆ‘๐‘–=1

๐‘ฅ๐‘–.

Note the similarity of this result with Theorem 1.4: the only difference is thatbounded gradients are replaced with the bounded moment of the oracle.

1.3 Saddle point optimization

In this section we consider saddle point optimization. Consider two convex andcompact sets ๐’ณ โˆˆ R๐‘‘1 and ๐’ด โˆˆ R๐‘‘2 . Let ๐œ‘(๐‘ฅ, ๐‘ฆ) : ๐’ณ ร— ๐’ด โ†’ R be a function, suchthat ๐œ‘(ยท, ๐‘ฆ) is convex and ๐œ‘(๐‘ฅ, ยท) is concave. The goal is to find

min๐‘ฅโˆˆ๐’ณ

max๐‘ฆโˆˆ๐’ด

๐œ‘(๐‘ฅ, ๐‘ฆ).

A classical example is obtained from Fenchel duality below. We present here resultsfrom [Juditsky and Nemirovski, 2011a,b, Nesterov and Nemirovski, 2013], concerningthe mirror descent approach.

Introduce the (sub)gradient field : ๐บ(๐‘ฅ, ๐‘ฆ) =(๐œ•๐‘ฅ๐œ‘(๐‘ฅ, ๐‘ฆ),โˆ’๐œ•๐‘ฆ๐œ‘(๐‘ฅ, ๐‘ฆ)

)โ€” analogue

of (sub)gradients for saddle point problems. Then subgradients are given by(๐‘”๐’ณ (๐‘ฅ, ๐‘ฆ), ๐‘”๐’ด(๐‘ฅ, ๐‘ฆ)

)โˆˆ ๐บ(๐‘ฅ, ๐‘ฆ).

17

Page 33: On efficient methods for high-dimensional statistical

The analogue of bounded gradients is written as:

Definition 1.5. Function ๐œ‘(๐‘ฅ, ๐‘ฆ) has bounded gradients, if โ€–๐‘”๐’ณ (๐‘ฅ, ๐‘ฆ)โ€–๐’ณ * 6 โ„’๐’ณ andโ€–๐‘”๐’ด(๐‘ฅ, ๐‘ฆ)โ€–๐’ด* 6 โ„’๐’ด for any (๐‘ฅ, ๐‘ฆ) โˆˆ (๐’ณ ร— ๐’ด).

To evaluate the quality of the point (๐‘ฅ, ๐‘ฆ) โˆˆ (๐’ณ ร— ๐’ด), we introduce the notion ofthe so-called duality gap:

Definition 1.6. The duality gap is โˆ†๐‘‘๐‘ข๐‘Ž๐‘™(๐‘ฅ, ๐‘ฆ) = max๐‘ฆโˆˆ๐’ด

๐œ‘(๐‘ฅ, ๐‘ฆ)โˆ’min๐‘ฅโˆˆ๐’ณ

๐œ‘(๐‘ฅ, ๐‘ฆ).

Observe, that the duality gap is the sum of the primal gap max๐‘ฆโˆˆ๐’ด

๐œ‘(๐‘ฅ, ๐‘ฆ)โˆ’๐œ‘(๐‘ฅ*, ๐‘ฆ*)

and the dual gap ๐œ‘(๐‘ฅ*, ๐‘ฆ*) โˆ’ min๐‘ฅโˆˆ๐’ณ

๐œ‘(๐‘ฅ, ๐‘ฆ). Introduce the variable ๐‘ง = (๐‘ฅ, ๐‘ฆ) and the

set ๐’ต = ๐’ณ ร— ๐’ด . The main motivation of the duality gap is that it can be controlledin the following way, similar for the convex optimization:

โˆ†๐‘‘๐‘ข๐‘Ž๐‘™(๐‘ง) 6 ๐‘”(๐‘ง)โŠค(๐‘ง โˆ’ ๐‘ง),

where ๐‘”(๐‘ง) in the gradient field ๐บ(๐‘ง), for more details see Bertsekas [1999].

Recall, that in order to apply mirror descent, we firstly need to construct potentialand choose geometries. Let ฮฆ๐’ณ (๐‘ฅ) be a potential, defined for variable the ๐‘ฅ, such thatit is 1-strongly convex with respect to a norm โ€– ยท โ€–X on ๐’ณ . Similarly, let ฮฆ๐’ด(๐‘ฆ) bea potential, defined for the variable ๐‘ฆ, such that it is 1-strongly convex with respectto a norm โ€– ยท โ€–Y on ๐’ด . Let ฮฉ๐’ณ and ฮฉ๐’ด be the effective square radii of sets ๐’ณ and ๐’ดwith respect to the corresponding norms.

Let us construct the composite potential ฮฆ๐’ต(๐‘ง) =โ„’๐’ณโˆšฮฉ๐’ณ

ฮฆ๐’ณ (๐‘ฅ)+โ„’๐’ดโˆšฮฉ๐’ด

ฮฆ๐’ด(๐‘ฆ), then

one step of Saddle Point Mirror Descent (SP-MD) is the following:

๐‘ง๐‘ก+1 โˆˆ arg min

โŸจ๐‘”๐‘ก, ๐‘งโŸฉ+

1

๐›พ๐‘ก๐ตฮฆ๐’ต (๐‘ง, ๐‘ง๐‘ก)

, ๐‘”๐‘ก โˆˆ ๐บ(๐‘ฅ๐‘ก, ๐‘ฆ๐‘ก) SP-MD

Finally we can formulate the convergence rates:

Theorem 1.6. Let function ๐œ‘(๐‘ฅ, ๐‘ฆ) has a bounded gradients with constants โ„’๐’ณ and

โ„’๐’ด . Then SP-MD with stepsize ๐›พ๐‘ก =โˆš

2๐‘กsatisfies:

max๐‘ฆโˆˆ๐’ด

๐œ‘(๐‘ก, ๐‘ฆ)โˆ’min๐‘ฅโˆˆ๐’ณ

๐œ‘(๐‘ฅ, ๐‘ฆ๐‘ก) 6(โˆš

ฮฉ๐’ณโ„’๐’ณ +โˆš

ฮฉ๐’ดโ„’๐’ด

)โˆš2

๐‘ก.

Fenchel duality

We finish this introduction with the classical Fenchel duality result which providesus with the way to switch from convex problems to saddle-point problems. Let usfirstly define the notion of the Fenchel conjugate of the convex function ๐‘“ :

18

Page 34: On efficient methods for high-dimensional statistical

Definition 1.7. The function ๐‘“ *(๐‘ฆ) given by ๐‘“ โ‹† (๐‘ฆ) := sup โŸจ๐‘ฆ, ๐‘ฅโŸฉ โˆ’ ๐‘“ (๐‘ฅ)|๐‘ฅ โˆˆ R๐‘›is called the Fenchel conjugate of the function ๐‘“ .

Now we can formulate the theorem [see e.g. Borwein and Lewis, 2010]:

Theorem 1.7. (Fenchelโ€™s duality theorem) Let ๐‘“ : ๐’ณ โ†’ R and ๐‘” : ๐’ด โ†’ R beconvex functions and ๐ด : ๐’ณ โ†’ ๐’ด be a linear map. Then

inf๐‘ฅโˆˆ๐’ณ

๐‘“(๐‘ฅ) + ๐‘”(๐ด๐‘ฅ)

= sup

๐‘ฆโˆˆ๐’ด

โˆ’ ๐‘“ *(๐ดโŠค๐‘ฆ)โˆ’ ๐‘”*(โˆ’๐‘ฆ)

.

Finally, we provide the way to switch between primal, dual and saddle pointformulations of the convex problem:

Primal problem : inf๐‘ฅ

๐‘“(๐‘ฅ) + ๐‘”(๐ด๐‘ฅ)

.

Dual problem : sup๐‘ฆ

โˆ’ ๐‘“ *(๐ดโŠค๐‘ฆ)โˆ’ ๐‘”*(โˆ’๐‘ฆ)

.

Saddle problem : inf๐‘ฅ

sup๐‘ฆ

๐‘“(๐‘ฅ)โˆ’ ๐‘”*(โˆ’๐‘ฆ) + ๐‘ฆโŠค๐ด๐‘ฅ

.

Machine learning motivation. These optimization problems are motivated bymachine learning applications, where ๐‘ฅ is the parameter to estimate, matrix ๐ด thedata, ๐‘” the loss and ๐‘“ the regularizer (note similarity with regularized empiricalrisk minimization (1.1.1)). Dual or saddle problem formulations help to switch to anequivalent task, which in some sense has a simpler structure, like bilinear saddle-pointproblem with composite terms and moreover allows to control the duality gap.

19

Page 35: On efficient methods for high-dimensional statistical

20

Page 36: On efficient methods for high-dimensional statistical

Chapter 2

Sliced inverse regression with score

functions

Abstract

We consider non-linear regression problems where we assume that the responsedepends non-linearly on a linear projection of the covariates. We propose score func-tion extensions to sliced inverse regression problems, both for the first- order andsecond-order score functions. We show that they provably improve estimation in thepopulation case over the non-sliced versions and we study finite sample estimatorsand their consistency given the exact score functions. We also propose to learn thescore function as well, in two steps, i.e., first learning the score function and thenlearning the effective dimension reduction space, or directly, by solving a convex op-timization problem regularized by the nuclear norm. We illustrate our results on aseries of experiments.

This chapter is based on the journal article: Slice inverse regression with scorefunctions, D. Babichev, F. Bach, In Electronic Journal of Statistics [Babichev andBach, 2018b].

2.1 Introduction

Non-linear regression and related problems such as non-linear classification arecore important tasks in machine learning and statistics. In this chapter, we considera random vector ๐‘ฅ โˆˆ R๐‘‘, a random response ๐‘ฆ โˆˆ R, and a regression model of theform

๐‘ฆ = ๐‘“(๐‘ฅ) + ๐œ€, (2.1.1)

which we want to estimate from ๐‘› independent and identically distributed (i.i.d.)observations (๐‘ฅ๐‘–, ๐‘ฆ๐‘–), ๐‘– = 1, . . . , ๐‘›. Our goal is to estimate the function ๐‘“ from thesedata. A traditional key difficulty in this general regression problem is the lack ofparametric assumptions regarding the functional form of ๐‘“ , leading to a problem ofnon-parametric regression. This is often tackled by searching implicitly or explicitly

21

Page 37: On efficient methods for high-dimensional statistical

a function ๐‘“ within an infinite-dimensional vector space.While several techniques exist to estimate such a function, e.g., kernel methods,

local-averaging, or neural networks [see, e.g., Gyรถrfi et al., 2002, Tsybakov, 2009],they also suffer from the curse of dimensionality, that is, the rate of convergence ofthe estimated function to the true function (with any relevant performance measure)can only decrease as a small power of ๐‘›, and this power cannot be larger than aconstant divided by ๐‘‘. In other words, the number ๐‘› of observations for any level ofprecision is exponential in dimension.

A classical way of by-passing the curse of dimensionality is to make extra assump-tions regarding the function to estimate, such as the dependence on a lower unknownlow-dimensional subspace, such as done by projection pursuit or neural networks.More precisely, throughout the chapter, we make the following assumption:

(A1) For all ๐‘ฅ โˆˆ R๐‘‘, we have ๐‘“(๐‘ฅ) = ๐‘”(๐‘คโŠค๐‘ฅ) for a certain matrix ๐‘ค โˆˆ R๐‘‘ร—๐‘˜ anda function ๐‘” : R๐‘˜ โ†’ R. Moreover, ๐‘ฆ = ๐‘“(๐‘ฅ) + ๐œ€ with ๐œ€ independent of ๐‘ฅ withzero mean and finite variance.

The subspace of R๐‘‘ spanned by the ๐‘˜ columns ๐‘ค1, . . . , ๐‘ค๐‘˜ โˆˆ R๐‘‘ of ๐‘ค has dimensionless than or equal to ๐‘˜, and is often called the effective dimension reduction (e.d.r.)space. The model above is often referred to as a multiple-index model [Yuan, 2011].We will always make the assumption that the e.d.r. space has exactly rank ๐‘˜, that isthe matrix ๐‘ค has rank ๐‘˜ (which implies that ๐‘˜ 6 ๐‘‘).

Given ๐‘ค, estimating ๐‘” may be done by any technique in non-parametric regression,with a convergence rate which requires a number of observations ๐‘› to be exponentialin ๐‘˜, with methods based on local averaging (e.g., Nadaraya-Watson estimators) oron least-squares regression [see, e.g., Gyรถrfi et al., 2002, Tsybakov, 2009]. Given thenon-linear function ๐‘”, estimating ๐‘ค is computationally difficult because the resultingoptimization problem may not be convex and thus leads to several local minima. Thedifficulty is often even stronger since one often wants to estimate both the function ๐‘”and the matrix ๐‘ค.

Our main goal in this chapter is to estimate the matrix ๐‘ค, with the hope ofobtaining a convergence rate where the inverse power of ๐‘› will now be proportionalto ๐‘˜ and not ๐‘‘. Note that the matrix ๐‘ค is only identifiable up to a (right) lineartransform, since only the subspace spanned by its column is characteristic.

Method of moments vs. optimization. This multiple-index problem and thegoal of estimating ๐‘ค only can be tackled from two points of views: (a) the methodof moments, where certain moments are built so that the effect of the unknown func-tion ๐‘” disappears [Brillinger, 1982, Li and Duan, 1989], a method that we follow hereand describe in more details below. These methods rely heavily on the model beingcorrect, and in the instances that we consider here lead to provably polynomial-timealgorithms (and most often linear in the number of observations since only momentsare computed). In contrast, (b) optimization-based methods use implicitly or ex-plicitly non-parametric estimation, e.g., using local averaging methods to design anobjective function that can be minimized to obtain an estimate of ๐‘ค [Xia et al., 2002a,Fukumizu et al., 2009]. The objective function is usually non-convex and gradient

22

Page 38: On efficient methods for high-dimensional statistical

descent techniques are used to obtain a local minimum. While these procedures offerno theoretical guarantees due to the potential unknown difficulty of the optimiza-tion problem, they often work well in practice, and we have observed this in ourexperiments.

In this chapter, we consider and improve a specific instantiation of the method ofmoments, which partially circumvents the difficulty of joint estimation by estimating๐‘ค directly without the knowledge of ๐‘”. The starting point for this method is the workby Brillinger [1982], which shows, as a simple consequence of Steinโ€™s lemma [Stein,1981], that if the distribution of ๐‘ฅ is Gaussian, (A1) is satisfied with ๐‘˜ = 1 (e.d.r. ofdimension one, e.g., a single-index model), and the input data have zero mean andidentity covariance matrix, then the expectation E(๐‘ฆ๐‘ฅ) is proportional to ๐‘ค. Thus,a certain expectation, which can be easily approximated given i.i.d. observations,simultaneously eliminates ๐‘” and reveals ๐‘ค.

While the result above provides a very simple algorithm to recover ๐‘ค, it hasseveral strong limitations: (a) it only applies to normally distributed data ๐‘ฅ, or moregenerally to elliptically symmetric distributions [Cambanis et al., 1981], (b) it onlyapplies to ๐‘˜ = 1, and (c) in many situations with symmetries, the proportionalityconstant is equal to zero and thus we cannot recover the vector ๐‘ค. This has led toseveral extensions in the statistical literature which we now present.

Using score functions. The use of Steinโ€™s lemma with a Gaussian random vari-able can be directly extended using the score function ๐’ฎ1(๐‘ฅ) defined as the negativegradient of the log-density, that is, ๐’ฎ1(๐‘ฅ) = โˆ’โˆ‡ log ๐‘(๐‘ฅ) = โˆ’1

๐‘(๐‘ฅ)โˆ‡๐‘(๐‘ฅ), which leads to

the following assumption:(A2) The distribution of ๐‘ฅ has a strictly positive density ๐‘(๐‘ฅ) which is differen-

tiable with respect to the Lebesgue measure, and such that ๐‘(๐‘ฅ) โ†’ 0 whenโ€–๐‘ฅโ€– โ†’ +โˆž.

We will need the score to be sub-Gaussian to obtain consistency results. GivenAssumption (A2), then Stoker [1986] showed, as a simple consequence of integrationby parts, that, for ๐‘˜ = 1 and if Assumption (A1) is satisfied, then E(๐‘ฆ๐’ฎ1(๐‘ฅ)) isproportional to ๐‘ค, for all differentiable functions ๐‘”, with a proportionality constantthat depends on ๐‘ค and โˆ‡๐‘”. This leads to the โ€œaverage derivative methodโ€ (ADE) andthus replaces the Gaussian assumption by the existence of a differentiable log-density,which is much weaker. This however does not remove the restriction ๐‘˜ = 1, whichcan be done in two ways which we now present.

Sliced inverse regression. Given a normalized Gaussian distribution for ๐‘ฅ (orany elliptically symmetric distribution), then, if (A1) is satisfied, almost surely in ๐‘ฆ,the conditional expectation E(๐‘ฅ|๐‘ฆ) happens to belong to the e.d.r. subspace. Givenseveral distinct values of ๐‘ฆ, the vectors E(๐‘ฅ|๐‘ฆ) or any estimate thereof, will hopefullyspan the entire e.d.r. space and we can recover the entire matrix ๐‘ค, leading to โ€œsliceinverse regressionโ€ (SIR), originally proposed by Li and Duan [1989], Duan and Li[1991], Li [1991]. This allows the estimation with ๐‘˜ > 1, but this is still restricted toGaussian data. In this chapter, we propose to extend SIR by the use of score functions

23

Page 39: On efficient methods for high-dimensional statistical

to go beyond elliptically symmetric distributions, and we show that the new methodcombining SIR and score functions is formally better than the plain ADE method.

From first-order to second-order moments. Another line of extension of thesimple method of Brillinger [1982] is to consider higher-order moments, namely thematrix E(๐‘ฆ๐‘ฅ๐‘ฅโŠค) โˆˆ R๐‘‘ร—๐‘‘, which, with normally distributed input data ๐‘ฅ and, if (A1) issatisfied, will be proportional (in a particular form to be described in Section 2.2.2) tothe Hessian of the function ๐‘”, leading to the method of โ€œprincipal Hessian directionsโ€(PHD) from Li [1992]. Again, ๐‘˜ > 1 is allowed (more than a single projection), butthus is limited to elliptically symmetric data. However Janzamin et al. [2014] proposedto used second-order score functions to go beyond this assumption. In order to definethis new method, we consider the following assumption:

(A3) The distribution of ๐‘ฅ has a strictly positive density ๐‘(๐‘ฅ) which is twicedifferentiable with respect to the Lebesgue measure, and such that ๐‘(๐‘ฅ) andโ€–โˆ‡๐‘(๐‘ฅ)โ€– โ†’ 0 when โ€–๐‘ฅโ€– โ†’ +โˆž.

Given (A1) and (A3), then one can show [Janzamin et al., 2014] that E(๐‘ฆ๐’ฎ2(๐‘ฅ))will be proportional to the Hessian of the function ๐‘”, where ๐’ฎ2(๐‘ฅ) = โˆ‡2 log ๐‘(๐‘ฅ) +๐’ฎ1(๐‘ฅ)๐’ฎ1(๐‘ฅ)โŠค = 1

๐‘(๐‘ฅ)โˆ‡2๐‘(๐‘ฅ), thus extending the Gaussian situation above where ๐’ฎ1

was a linear function and ๐’ฎ2(๐‘ฅ), up to linear terms, proportional to ๐‘ฅ๐‘ฅโŠค.In this chapter, we propose to extend the method above to allow an SIR estimator

for the second-order score functions, where we condition on ๐‘ฆ, and we show that thenew method is formally better than the plain method of Janzamin et al. [2014].

Learning score functions through score matching. Relying on score functionsimmediately raises the following question: is estimating the score function (whennot available) really simpler than our original problem of non-parametric regression?Fortunately, a recent line of work [Hyvรคrinen, 2005] has considered this exact problem,and formulated the task of density estimation directly on score functions, which isparticularly useful in our context. We may then use the data, first to learn the score,and then to use the novel score-based moments to estimate ๐‘ค. We will also considera direct approach that jointly estimates the score function and the e.d.r. subspace,by regularizing by a sparsity-inducing norm.

Fighting the curse of dimensionality. Learning the score function is still a non-parametric problem, with the associated curse of dimensionality. If we first learn thescore function (through score matching) and then learn the matrix ๐‘ค, we will notescape that curse, while our direct approach is empirically more robust.

Note that Hristache, Juditsky and Spokoiny [Hristache et al., 2001] suggestediterative improvements of the ADE method, using elliptic windows which shrink inthe directions of the columns of ๐‘ค, stretch in all others directions and tend to flatlayers orthogonal to ๐‘ค. Dalalyan, Juditsky and Spokoiny [Dalalyan et al., 2008]generalize the algorithm to multi-index models and proved

โˆš๐‘›-consistency of the

proposed procedure in the case when the structural dimension is not larger than 4

24

Page 40: On efficient methods for high-dimensional statistical

and weaker dependence for ๐‘‘ > 4. In particular, they provably avoid the curse ofdimensionality. Such extensions are outside the scope of this chapter.

Contributions. In this chapter, we make the following contributions:โ€” We propose score function extensions to sliced inverse regression problems,

both for the first-order and second-order score functions. We consider theinfinite sample case in Section 2.2 and the finite sample case in Section 2.3.They provably improve estimation in the population case over the non-slicedversions, while we study in Section 2.3 finite sample estimators and their con-sistency given the exact score functions.

โ€” We propose in Section 2.4 to learn the score function as well, in two steps, i.e.,first learning the score function and then learning the e.d.r. space parameter-ized by ๐‘ค, or directly, by solving a convex optimization problem regularizedby the nuclear norm.

โ€” We illustrate our results in 2.5 on a series of experiments.

2.2 Estimation with infinite sample size

In this section, we focus on the population situation, where we can computeexpectations and conditional expectations exactly, while we focus on finite sampleestimators with known score functions in Section 2.3 with consistency results in Sec-tion 2.3.3, and with learned score functions in Section 2.4.

2.2.1 SADE: Sliced average derivative estimation

Before presenting our new moments which will lead to the novel SADE method,we consider the non-sliced method, which is based on Assumptions (A1) and (A2)and score functions (the method based on the Gaussian assumption will be derivedlater as corollaries). The ADE method is based on the following lemma:

Lemma 2.1 (ADE moment [Stoker, 1986]). Assume (A1), (A2), the differentia-bility of ๐‘” and the existence of expectation E(๐‘”โ€ฒ(๐‘คโŠค๐‘ฅ)). Then E(๐’ฎ1(๐‘ฅ)๐‘ฆ) is in thee.d.r. subspace.

Proof. Since ๐‘ฆ = ๐‘“(๐‘ฅ) + ๐œ€, and ๐œ€ is independent of ๐‘ฅ with zero mean, we have

E(๐’ฎ1(๐‘ฅ)๐‘ฆ) = E(๐’ฎ1(๐‘ฅ)๐‘“(๐‘ฅ)) =

โˆซR๐‘‘

โˆ’โˆ‡๐‘(๐‘ฅ)

๐‘(๐‘ฅ)๐‘“(๐‘ฅ)๐‘(๐‘ฅ)๐‘‘๐‘ฅ

= โˆ’โˆซR๐‘‘

โˆ‡๐‘(๐‘ฅ)๐‘“(๐‘ฅ)๐‘‘๐‘ฅ =

โˆซR๐‘‘

๐‘(๐‘ฅ)โˆ‡๐‘“(๐‘ฅ)๐‘‘๐‘ฅ by integration by parts,

= ๐‘ค ยท E(๐‘”โ€ฒ(๐‘คโŠค๐‘ฅ)),

which leads to the desired result. Note that in the integration by parts above, thedecay of ๐‘(๐‘ฅ) to zero for โ€–๐‘ฅโ€– โ†’ +โˆž is needed.

25

Page 41: On efficient methods for high-dimensional statistical

The ADE moment above only provides a single vector in the e.d.r. subspace, whichcan only potentially lead to recovery for ๐‘˜ = 1, and only if E(๐‘”โ€ฒ(๐‘คโŠค๐‘ฅ)) = 0, whichmay not be satisfied, e.g., if ๐‘ฅ has a a symmetric distribution and ๐‘” is even.

We can now present our first new lemma, the proof of which relies on similararguments as for SIR [Li and Duan, 1989] but extended to score functions. Note thatwe do not require the differentiability of the function ๐‘”.

Lemma 2.2 (SADE moment). Assume (A1) and (A2). Then, E(๐’ฎ1(๐‘ฅ)|๐‘ฆ) is in thee.d.r. subspace almost surely (in ๐‘ฆ).

Proof. We consider any vector ๐‘ โˆˆ R๐‘‘ in the orthogonal complement of the subspaceSpan๐‘ค1, . . . , ๐‘ค๐‘˜. We need to show, that ๐‘โŠคE(๐’ฎ1(๐‘ฅ)|๐‘ฆ) = 0 with probability 1. Wehave by the law of total expectation

๐‘โŠคE(๐’ฎ1(๐‘ฅ)|๐‘ฆ

)= E

(E(๐‘โŠค๐’ฎ1(๐‘ฅ)|๐‘คโŠค

1 ๐‘ฅ, . . . , ๐‘คโŠค๐‘˜ ๐‘ฅ, ๐‘ฆ

)|๐‘ฆ).

Because of Assumption (A1), we have ๐‘ฆ = ๐‘”(๐‘คโŠค๐‘ฅ) + ๐œ€ with ๐œ€ independent of ๐‘ฅ, andthus

E(๐‘โŠค๐’ฎ1(๐‘ฅ)|๐‘คโŠค๐‘ฅ, ๐‘ฆ

)= E

(๐‘โŠค๐’ฎ1(๐‘ฅ)|๐‘คโŠค๐‘ฅ, ๐œ€

)= E

(๐‘โŠค๐’ฎ1(๐‘ฅ)|๐‘คโŠค๐‘ฅ

)almost surely.

This leads to๐‘โŠคE

(๐’ฎ1(๐‘ฅ)|๐‘ฆ

)= E

[E(๐‘โŠค๐’ฎ1(๐‘ฅ)|๐‘คโŠค๐‘ฅ

)๐‘ฆ].

We now prove that almost surely E(๐‘โŠค๐’ฎ1(๐‘ฅ)|๐‘คโŠค๐‘ฅ

)= 0, which will be sufficient to

prove Lemma 2.2. We consider the linear transformation of coordinates: ๐‘ฅ = ๐‘คโŠค๐‘ฅ โˆˆR๐‘‘, where ๐‘ค = (๐‘ค1, . . . , ๐‘ค๐‘˜, ๐‘ค๐‘˜+1, . . . , ๐‘ค๐‘‘) is a square matrix with full rank obtained byadding a basis of the subspace orthogonal to the span of the ๐‘˜ columns of ๐‘ค. Then, if๐‘ is the density of ๐‘ฅ, we have ๐‘(๐‘ฅ) = (det ๐‘ค) ยท๐‘(๐‘ฅ) and thus โˆ‡๐‘(๐‘ฅ) = (det ๐‘ค) ยท ๐‘ค ยทโˆ‡๐‘(๐‘ฅ)

and ๐‘ = ๐‘คโŠค๐‘ = (0, . . . , 0,๐‘๐‘˜+1, . . . ,๐‘๐‘‘) โˆˆ R๐‘‘ (because ๐‘ โŠฅ Span๐‘ค1, . . . , ๐‘ค๐‘˜). Thedesired conditional expectation equals

E(๐‘โŠค ๐’ฎ1(๐‘ฅ)|๐‘ฅ1, . . . , ๐‘ฅ๐‘˜),

since ๐‘ค ๐’ฎ1(๐‘ฅ) =๐‘คโˆ‡๐‘(๐‘ฅ)๐‘(๐‘ฅ)

=โˆ‡๐‘(๐‘ฅ)

๐‘(๐‘ฅ)= ๐’ฎ1(๐‘ฅ) and hence ๐‘โŠค๐’ฎ1(๐‘ฅ) = ๐‘โŠค ๐‘ค ๐’ฎ1(๐‘ฅ) =๐‘โŠค ๐’ฎ1(๐‘ฅ).

It is thus sufficient to show that

โˆซR๐‘‘โˆ’๐‘˜

๐‘โŠค ๐’ฎ1(๐‘ฅ)๐‘(๐‘ฅ1, . . . , ๐‘ฅ๐‘‘)๐‘‘๐‘ฅ๐‘˜+1 . . . ๐‘‘๐‘ฅ๐‘‘ = 0, for all

๐‘ฅ1, . . . , ๐‘ฅ๐‘˜. We haveโˆซR๐‘‘โˆ’๐‘˜

๐‘โŠค ๐’ฎ1(๐‘ฅ)๐‘(๐‘ฅ1, . . . , ๐‘ฅ๐‘‘)๐‘‘๐‘ฅ๐‘˜+1 . . . ๐‘‘๐‘ฅ๐‘‘ = ๐‘โŠค โˆซR๐‘‘โˆ’๐‘˜

โˆ‡๐‘(๐‘ฅ) ยท ๐‘‘๐‘ฅ๐‘˜+1 . . . ๐‘‘๐‘ฅ๐‘‘

26

Page 42: On efficient methods for high-dimensional statistical

=๐‘‘โˆ‘

๐‘—=๐‘˜+1

๐‘๐‘— ยท โˆžโˆซโˆ’โˆž

โˆžโˆซโˆ’โˆž

ยท ยท ยทโˆžโˆซ

โˆ’โˆž

๐œ•๐‘(๐‘ฅ)

๐œ•๐‘ฅ๐‘— ๐‘‘๐‘ฅ๐‘— ยท โˆ๐‘˜+16๐‘ก6๐‘‘=๐‘—

๐‘‘๐‘ฅ๐‘ก = 0,

because for any ๐‘— โˆˆ ๐‘˜ + 1, . . . , ๐‘›,โˆžโˆซ

โˆ’โˆž

๐œ•๐‘(๐‘ฅ)

๐œ•๐‘ฅ๐‘— ๐‘‘๐‘ฅ๐‘— = 0 by Assumption (A2). This

leads to the desired result.

The key differences are now that:โ€” Unlike ADE, by conditioning on different values of ๐‘ฆ, we have access to several

vectors E(๐’ฎ1(๐‘ฅ)|๐‘ฆ) โˆˆ R๐‘‘.โ€” Unlike SIR, SADE does not require the linearity condition from [Li, 1991]

anymore and can be used with a smooth enough probability density.In the population case, we will consider the following matrix (using the fact thatE(๐’ฎ1(๐‘ฅ)) = 0):

๐’ฑ1,cov = E[E(๐’ฎ1(๐‘ฅ)|๐‘ฆ)E(๐’ฎ1(๐‘ฅ)|๐‘ฆ)โŠค

]= Cov

[E(๐’ฎ1(๐‘ฅ)|๐‘ฆ)

]โˆˆ R๐‘‘ร—๐‘‘,

which we will also denote E[E(๐’ฎ1(๐‘ฅ)|๐‘ฆ)โŠ—2

], where for any matrix ๐‘Ž, ๐‘ŽโŠ—2 denotes ๐‘Ž๐‘ŽโŠค.

The matrix above is positive semi-definite, and its column space is included in thee.d.r. space. If it has rank ๐‘˜, then we can exactly recover the entire subspace by aneigenvalue decomposition. When ๐‘˜ = 1, which is the only case where ADE may beused, the following proposition shows that if ADE allows to recover ๐‘ค, so is SADE.

We will also consider the other matrix (note the presence of the extra term ๐‘ฆ2)

๐’ฑ โ€ฒ1,cov = E

[๐‘ฆ2E(๐’ฎ1(๐‘ฅ)|๐‘ฆ)E(๐’ฎ1(๐‘ฅ)|๐‘ฆ)โŠค

],

because of its direct link with the non-sliced version. Note that we made the weakassumption of existence of matrices ๐’ฑ1,cov and ๐’ฑ โ€ฒ

1,cov, which is satisfied for majorityof problems.

Proposition 2.1. Assume (A1) and (A2), with ๐‘˜ = 1, as well as differentiabilityof ๐‘” and existence of the expectation E๐‘”โ€ฒ(๐‘คโŠค๐‘ฅ). The vector ๐‘ค may be recovered fromthe ADE moment (up to scale) if and only if E๐‘”โ€ฒ(๐‘คโŠค๐‘ฅ) = 0. If this condition issatisfied, then SADE also recovers ๐‘ค up to scale (i.e., ๐’ฑ1,cov and ๐’ฑ โ€ฒ

1,cov are differentfrom zero).

Proof. The first statement is a consequence of the proof of Lemma 2.1. If SADEfails, that is, for almost all ๐‘ฆ, E(๐’ฎ1(๐‘ฅ)|๐‘ฆ) = 0, then E(๐’ฎ1(๐‘ฅ)๐‘ฆ|๐‘ฆ) = 0 which impliesthat E(๐’ฎ1(๐‘ฅ)๐‘ฆ) = 0 and thus ADE fails. Moreover, we have, using operator convex-ity [Donoghue, 1974]:

๐’ฑ โ€ฒ1,cov = E

[E(๐‘ฆ๐’ฎ1(๐‘ฅ)|๐‘ฆ)E(๐‘ฆ๐’ฎ1(๐‘ฅ)|๐‘ฆ)โŠค

]<[E(E(๐‘ฆ๐’ฎ1(๐‘ฅ)|๐‘ฆ))

][E(E(๐‘ฆ๐’ฎ1(๐‘ฅ)|๐‘ฆ))

]โŠค=

=[E(๐‘ฆ๐’ฎ1(๐‘ฅ))

][E(๐‘ฆ๐’ฎ1(๐‘ฅ))

]โŠค,

27

Page 43: On efficient methods for high-dimensional statistical

showing that the new moment is dominating the ADE moment, which provides analternative proof of the rank of ๐’ฑ โ€ฒ

1,cov being larger than one if E(๐‘ฆ๐’ฎ1(๐‘ฅ)) = 0.

Elliptically symmetric distributions. If ๐‘ฅ is normally distributed with meanvector ๐œ‡ and covariance matrix ฮฃ, then we have ๐’ฎ1(๐‘ฅ) = ฮฃโˆ’1(๐‘ฅโˆ’ ๐œ‡) and we recoverthe result from Li and Duan [1989]. Note that the lemma then extends to all ellipticaldistributions of the form ๐œ™(1

2(๐‘ฅโˆ’๐œ‡)โŠคฮฃโˆ’1(๐‘ฅโˆ’๐œ‡)), for a certain function ๐œ™ : R+ โ†’ R.

See Li and Duan [1989] for more details.

Failure modes. In some cases, slice inverse regression does not span the entiree.d.r. space, because the inverse regression curve E(๐’ฎ1(๐‘ฅ)|๐‘ฆ) is degenerated. Forexample, this can occur, if ๐‘˜ = 1, ๐‘ฆ = โ„Ž(๐‘คโŠค

1 ๐‘ฅ) + ๐œ€, โ„Ž is an even function and ๐‘คโŠค1 ๐‘ฅ

has a symmetric distribution around 0. Then E(๐’ฎ1(๐‘ฅ)|๐‘ฆ) โ‰ก 0, and thus it is a poorestimation of the desired e.d.r. directions [Cook and Weisberg, 1991].

The second drawback of SIR occurs when we have a classification task, for exam-ple, ๐‘ฆ โˆˆ 0, 1. In this case, we have only two slices (i.e., possible values of ๐‘ฆ) andSIR can recover only one direction in the e.d.r. space [Cook and Lee, 1999].

Li [1992] suggested another way to estimate the e.d.r. space which can handle suchsymmetric cases: principal Hessian directions (PHD). However, this method uses thenormality of the vector ๐‘‹. As in the SIR case, we can extend this method, usingscore functions to use it for any distribution, which we will refer to as SPHD, whichwe now present.

2.2.2 SPHD: Sliced principal Hessian directions

Before presenting our new moment which will lead to the SPHD method, weconsider the non-sliced method, which is based on Assumptions (A1) and (A3) andscore functions (the method based on the Gaussian assumption will be derived lateras corollaries). The method of Janzamin et al. [2014], which we refer to as โ€œPHD+โ€is based on the following lemma (we reproduce the proof for readability):

Lemma 2.3 (second-order score moment (PHD+) [Janzamin et al., 2014]). As-sume (A1),(A3),twice differentiability of function ๐‘” and existence of the expectationE(โˆ‡2๐‘”(๐‘คโŠค๐‘ฅ)). Then E(๐’ฎ2(๐‘ฅ)๐‘ฆ) has a column space included in the e.d.r. subspace.

Proof. Since ๐‘ฆ = ๐‘“(๐‘ฅ) + ๐œ€, and ๐œ€ is independent of ๐‘ฅ, we have

E(๐‘ฆ๐’ฎ2(๐‘ฅ)) = E

(๐‘“(๐‘ฅ)๐’ฎ2(๐‘ฅ)

)=

โˆซโˆ‡2๐‘(๐‘ฅ)

๐‘(๐‘ฅ)๐‘(๐‘ฅ)๐‘“(๐‘ฅ)๐‘‘๐‘ฅ =

=

โˆซโˆ‡2๐‘(๐‘ฅ) ยท ๐‘“(๐‘ฅ)๐‘‘๐‘ฅ =

โˆซ๐‘(๐‘ฅ) ยท โˆ‡2๐‘“(๐‘ฅ)๐‘‘๐‘ฅ = E

[โˆ‡2๐‘“(๐‘ฅ)

],

using integration by parts and the decay of ๐‘(๐‘ฅ) and โˆ‡๐‘(๐‘ฅ) for โ€–๐‘ฅโ€– โ†’ โˆž. This leadsto the desired result since โˆ‡2๐‘“(๐‘ฅ) = ๐‘คโˆ‡2๐‘”(๐‘คโŠค๐‘ฅ)๐‘คโŠค. This was proved by Li [1992]for normal distributions.

28

Page 44: On efficient methods for high-dimensional statistical

Failure modes. The method does not work properly if rank(E(โˆ‡2๐‘”(๐‘คโŠค๐‘ฅ))) < ๐‘˜.For example, if ๐‘” is linear function, E(โˆ‡2๐‘”(๐‘คโŠค๐‘ฅ)) โ‰ก 0 and the estimated e.d.r.space is degenerated. Moreover, the method fails in symmetric cases, for example,if ๐‘” is an odd function with respect to any variable and ๐‘(๐‘ฅ) is even function, thenrank(E(โˆ‡2๐‘”(๐‘คโŠค๐‘ฅ))) < ๐‘˜.

We can now present our second new lemma, the proof of which relies on similararguments as for PHD [Li, 1992] but extended to score functions (again no differen-tiability is assumed on ๐‘”):

Lemma 2.4 (SPHD moment). Assume (A1) and (A3). Then, E(๐’ฎ2(๐‘ฅ)|๐‘ฆ) has acolumn space within the e.d.r. subspace almost surely.

Proof. We consider any ๐‘Ž โˆˆ R๐‘‘ and ๐‘ โˆˆ R๐‘‘ orthogonal to the e.d.r. subspace, andprove, that ๐‘ŽโŠคE

(๐’ฎ2(๐‘ฅ)|๐‘ฆ

)๐‘ = 0. We use the same transform of coordinates as in

the proof of Lemma 2.2: ๐‘ฅ = ๐‘คโŠค๐‘ฅ โˆˆ R๐‘‘. Then โˆ‡2๐‘(๐‘ฅ) = det( ๐‘ค) ยท ๐‘คโˆ‡2๐‘(๐‘ฅ) ๐‘คโŠค,๐‘ = ๐‘คโŠค๐‘ = (0, . . . , 0,๐‘๐‘˜+1, . . . ,๐‘๐‘‘) and we will prove, that E(๐‘ŽโŠค๐’ฎ2(๐‘ฅ)๐‘|๐‘คโŠค๐‘ฅ

)= 0

almost surely, and ๐‘คโŠค ๐’ฎ2(๐‘ฅ) ๐‘ค =๐‘คโŠคโˆ‡2๐‘(๐‘ฅ) ๐‘ค๐‘(๐‘ฅ)

=โˆ‡2๐‘(๐‘ฅ)

๐‘(๐‘ฅ)= ๐’ฎ2(๐‘ฅ). It is sufficient to

show, that for all ๐‘ฅ1, . . . , ๐‘ฅ๐‘˜:โˆซR๐‘‘โˆ’๐‘˜

๐‘ŽโŠค ยท๐’ฎ2(๐‘ฅ) ยท๐‘ ยท ๐‘(๐‘ฅ1, . . . , ๐‘ฅ๐‘‘)๐‘‘๐‘ฅ๐‘˜+1 . . . ๐‘‘๐‘ฅ๐‘‘ = 0.

We have:โˆซR๐‘‘โˆ’๐‘˜

๐‘ŽโŠค ยท ๐’ฎ2(๐‘ฅ) ยท๐‘ ยท ๐‘(๐‘ฅ1, . . . , ๐‘ฅ๐‘‘)๐‘‘๐‘ฅ๐‘˜+1 . . . ๐‘‘๐‘ฅ๐‘‘ = ๐‘ŽโŠค ยท [ โˆซR๐‘‘โˆ’๐‘˜

โˆ‡2๐‘(๐‘ฅ) ยท ๐‘‘๐‘ฅ๐‘˜+1 ยท ยท ยท ๐‘‘๐‘ฅ๐‘‘] ยท๐‘

=โˆ‘16๐‘–6๐‘‘

๐‘˜+16๐‘—6๐‘‘

๐‘Ž๐‘–๐‘๐‘— โˆžโˆซโˆ’โˆž

โˆžโˆซโˆ’โˆž

ยท ยท ยท

[ โˆžโˆซโˆ’โˆž

๐œ•2 ๐‘ƒ (๐‘ฅ)

๐œ•๐‘ฅ๐‘–๐œ•๐‘ฅ๐‘— ยท ๐‘‘๐‘ฅ๐‘—]ยท

โˆ๐‘˜+16๐‘ก6๐‘‘=๐‘—

๐‘‘๐‘ฅ๐‘ก = 0,

because for any ๐‘— โˆˆ ๐‘˜ + 1, . . . , ๐‘›:โˆžโˆซ

โˆ’โˆž

๐œ•2 ๐‘ƒ (๐‘ฅ)

๐œ•๐‘ฅ๐‘–๐œ•๐‘ฅ๐‘— ยท ๐‘‘๐‘ฅ๐‘— = 0 due to Assumption (A3),

which leads to the desired result.

In order to be able to use several values of ๐‘ฆ, we will estimate the matrix ๐’ฑ2 =

E(

[E(๐’ฎ2(๐‘ฅ)|๐‘ฆ)]2

), which is the expectation with respect to ๐‘ฆ of the square (in the

matrix multiplication sense) of the conditional expectation from Lemma 2.4, as well

as ๐’ฑ โ€ฒ2 = E

(๐‘ฆ2[E

(๐’ฎ2(๐‘ฅ)|๐‘ฆ)]2

), and consider the ๐‘˜ largest eigenvectors (we made the

weak assumption of existence of matrices ๐’ฑ2 and ๐’ฑ โ€ฒ2, which is satisfied for majority

of problems). From the lemma above, this matrix has a column space included in

29

Page 45: On efficient methods for high-dimensional statistical

the e.d.r. subspace, thus, if it has rank ๐‘˜, we get exact recovery by an eigenvaluedecomposition.

Effect of slicing. We now study the effect of slicing and show that in the populationcase, it is superior to the non-sliced version, as it recovers the true e.d.r. subspace inmore situations.

Proposition 2.2. Assume (A1) and (A3). The matrix ๐‘ค may be recovered fromthe moment in Lemma 2.3 (up to right linear transform) if and only if E[โˆ‡2๐‘”(๐‘คโŠค๐‘ฅ)]has full rank. If this condition is satisfied, then SPHD also recovers ๐‘ค up to scale.

Proof. The first statement is a consequence of the proof of Lemma 2.3. Moreover,using the Lowner-Heinz theorem about operator convexity [Donoghue, 1974]:

๐’ฑ โ€ฒ2 = E

[E(๐‘ฆ๐’ฎ2(๐‘ฅ)|๐‘ฆ)2

]<[E[E(๐‘ฆ๐’ฎ2(๐‘ฅ)|๐‘ฆ)]

]2=[E(๐‘ฆ๐’ฎ2(๐‘ฅ))

]2,

showing that the new moment is dominating the PHD moment, thus implying that

rank[๐’ฑ โ€ฒ2] > rank

[E(๐‘ฆ๐’ฎ2(๐‘ฅ))

].

Therefore, if rank[E(๐‘ฆ๐’ฎ2(๐‘ฅ))

]= ๐‘˜, then rank[๐’ฑ โ€ฒ

2] = ๐‘˜ (note that there is not such asimple proof for ๐’ฑ2).

Elliptically symmetric distributions. When ๐‘ฅ is a standard Gaussian random

variable, then ๐’ฎ2(๐‘ฅ) = โˆ’๐ผ + ๐‘ฅ๐‘ฅโŠค, and thus E(

[E(๐’ฎ2(๐‘ฅ)|๐‘ฆ)]2

)= E

([๐ผ โˆ’ Cov(๐‘ฅ|๐‘ฆ)]2

),

and we recover the sliced average variance estimation (SAVE) method by Cook [2000].However, our method applies to all distributions (with known score functions).

2.2.3 Relationship between first and second order methods

All considered methods have their own failure modes. The simplest one: ADEworks only in single-index model and has quite a simple working condition :E[๐‘”โ€ฒ(๐‘คโŠค๐‘ฅ)] = 0. The sliced improvement (e.g., SADE) of this algorithm has a betterperformance, however it still suffers from symmetric cases, when the inverse regres-sion curve is partly degenerated. PHD+ can not work properly in linear modelsand symmetric cases. SPHD is stronger than PHD+ and potentially has the widestapplication area among four described methods. See summary in Table 2.1.

Our conditions rely on the full possible rank of certain expected covariance ma-trices. When the function ๐‘” is selected randomly from all potential functions fromR๐‘˜ to R, rank-deficiencies typically do not occur and it would be interesting to showthat indeed they appear with probability zero for certain random function models.

2.3 Estimation from finite sample

In this section, we consider finite sample estimators for the moments we havedefined in Section 2.2. Since our extensions are combinations of existing techniques

30

Page 46: On efficient methods for high-dimensional statistical

method main equation score slicedADE E(๐’ฎ1(๐‘ฅ)๐‘ฆ) first no

SADE E๐‘ฆ[E(๐’ฎ1(๐‘ฅ)|๐‘ฆ)โŠ—2

]first yes

PHD+ E(๐’ฎ2(๐‘ฅ)๐‘ฆ) second no

SPHD E๐‘ฆ[E(

(๐’ฎ2(๐‘ฅ)|๐‘ฆ)2)]

second yes

method single-index multi-indexADE E[๐‘”โ€ฒ(๐‘คโŠค๐‘ฅ)] = 0 does not work

SADE E๐‘ฆ[โ€–E(๐’ฎ1(๐‘ฅ)|๐‘ฆโ€–2

]> 0 rank E๐‘ฆ

[E(๐’ฎ1(๐‘ฅ)|๐‘ฆ)โŠ—2

]= ๐‘˜

PHD+ E[๐‘”โ€ฒโ€ฒ(๐‘คโŠค๐‘ฅ)] = 0 rank[E[โˆ‡2๐‘”(๐‘ฅ)]

]= ๐‘˜

SPHD E๐‘ฆ[trE(

(๐’ฎ2(๐‘ฅ)|๐‘ฆ)2)]

> 0 rank E๐‘ฆ[E(

(๐’ฎ2(๐‘ฅ)|๐‘ฆ)2)]

= ๐‘˜

Table 2.1 โ€“ Comparison of different methods using score functions.

(using score functions and slicing) our finite-sample estimators naturally rely on ex-isting work [Hsing and Carroll, 1992, Zhu and Ng, 1995].

In this section, we assume that the score function is known. We consider learningthe score function in Section 2.4.

2.3.1 Estimator and algorithm for SADE

Our goal is to provide an estimator for ๐’ฑ1,cov = E[E(๐’ฎ1(๐‘ฅ)|๐‘ฆ)E

(๐’ฎ1(๐‘ฅ)|๐‘ฆ

)โŠค]=

Cov[E(๐’ฎ1(๐‘ฅ)|๐‘ฆ

)]given a finite sample (๐‘ฅ๐‘–, ๐‘ฆ๐‘–), ๐‘– = 1, . . . , ๐‘›. A similar estimator for

๐’ฑ โ€ฒ1,cov could be derived. In order to estimate ๐’ฑ1,cov = Cov

[E(๐’ฎ1(๐‘ฅ)|๐‘ฆ

)], we will use

the identity

Cov[๐’ฎ1(๐‘ฅ)

]= Cov

[E(๐’ฎ1(๐‘ฅ)|๐‘ฆ

)]+ E

[Cov

(๐’ฎ1(๐‘ฅ)|๐‘ฆ

)].

We use the natural consistent estimator 1๐‘›

๐‘›โˆ‘๐‘–=1

๐’ฎ1(๐‘ฅ๐‘–)๐’ฎ1(๐‘ฅ๐‘–)โŠค of Cov[๐’ฎ1(๐‘ฅ)

].

In order to obtain an estimator of E[Cov

(๐’ฎ1(๐‘ฅ)|๐‘ฆ

)], we consider slicing the real

numbers in ๐ป different slices, ๐ผ1, . . . , ๐ผ๐ป , which are contiguous intervals that form apartition of R (or of the range of all ๐‘ฆ๐‘–, ๐‘– = 1, . . . , ๐‘›). We then compute an estimatorof the conditional expectation (๐’ฎ1)โ„Ž = E(๐’ฎ1(๐‘ฅ)|๐‘ฆ โˆˆ ๐ผโ„Ž) with empirical averages:denoting ๐‘โ„Ž the empirical proportion of ๐‘ฆ๐‘–, ๐‘– = 1, . . . , ๐‘›, that fall in the slice ๐ผโ„Ž (whichis assumed to be strictly positive), we estimate E(๐’ฎ1(๐‘ฅ)|๐‘ฆ โˆˆ ๐ผโ„Ž) by

(๐’ฎ1)โ„Ž =1

๐‘›๐‘โ„Ž

๐‘›โˆ‘๐‘–=1

1๐‘ฆ๐‘–โˆˆ๐ผโ„Ž๐’ฎ1(๐‘ฅ๐‘–).

31

Page 47: On efficient methods for high-dimensional statistical

We then estimate Cov(๐’ฎ1(๐‘ฅ)|๐‘ฆ โˆˆ ๐ผโ„Ž

)by

(๐’ฎ1)cov,โ„Ž =1

๐‘›๐‘โ„Ž โˆ’ 1

๐‘›โˆ‘๐‘–=1

1๐‘ฆ๐‘–โˆˆ๐ผโ„Ž(๐’ฎ1(๐‘ฅ๐‘–)โˆ’ (๐’ฎ1)โ„Ž

)(๐’ฎ1(๐‘ฅ๐‘–)โˆ’ (๐’ฎ1)โ„Ž

)โŠค.

Note that it is important here to normalize the covariance computation by 1๐‘›๐‘โ„Žโˆ’1

(usual unbiased normalization of the variance) and not 1๐‘›๐‘โ„Ž

, to allow consistent esti-

mation even when the number of elements per slice is small (e.g., equal to 2).We finally use the following estimator of ๐’ฑ1,cov:

๐’ฑ1,cov =1

๐‘›

๐‘›โˆ‘๐‘–=1

๐’ฎ1(๐‘ฅ๐‘–)๐’ฎ1(๐‘ฅ๐‘–)โŠค โˆ’๐ปโˆ‘โ„Ž=1

๐‘โ„Ž ยท (๐’ฎ1)cov,โ„Ž.

The final SADE algorithm is thus the following:

โ€“ Divide the range of ๐‘ฆ1, . . . , ๐‘ฆ๐‘› into ๐ป slices ๐ผ1, . . . , ๐ผ๐ป . Let ๐‘โ„Ž > 0 be theproportion of ๐‘ฆ๐‘–, ๐‘– = 1, . . . , ๐‘›, that fall in slice ๐ผโ„Ž.

โ€“ For each slice ๐ผโ„Ž, compute the sample mean (๐’ฎ1)โ„Ž and covariance (๐’ฎ1)cov,โ„Ž:

(๐’ฎ1)โ„Ž =1

๐‘›๐‘โ„Ž

๐‘›โˆ‘๐‘–=1

1๐‘ฆ๐‘–โˆˆ๐ผโ„Ž๐’ฎ1(๐‘ฅ๐‘–) and

(๐’ฎ1)cov,โ„Ž =1

๐‘›๐‘โ„Ž โˆ’ 1

๐‘›โˆ‘๐‘–=1

1๐‘ฆ๐‘–โˆˆ๐ผโ„Ž(๐’ฎ1(๐‘ฅ๐‘–)โˆ’ (๐’ฎ1)โ„Ž

)(๐’ฎ1(๐‘ฅ๐‘–)โˆ’ (๐’ฎ1)โ„Ž

)โŠค.

โ€“ Compute ๐’ฑ1,cov = 1๐‘›

๐‘›โˆ‘๐‘–=1

๐’ฎ(๐‘ฅ๐‘–)๐’ฎ(๐‘ฅ๐‘–)โŠค โˆ’

๐ปโˆ‘โ„Ž=1

๐‘โ„Ž ยท (๐’ฎ)cov,โ„Ž.

โ€“ Find the ๐‘˜ largest eigenvalues and let 1, . . . , ๐‘˜ be eigenvectors in R๐‘‘ corre-sponding to these eigenvalues.

The schematic graphical representation of this method given in Figure 2-1.

Choice of slices. There are different ways to choose slices ๐ผ1, . . . , ๐ผ๐ป :โ€“ all slices have the same length, that is we choose the maximum and the mini-mum of ๐‘ฆ1, . . . , ๐‘ฆ๐‘›, and divide the range of ๐‘ฆ into ๐ป equal slices (for simplicity,we assume that ๐‘› is a multiple of ๐ป),

โ€“ we can also use the distribution of ๐‘ฆ to ensure a balanced distribution ofobservations in each slice, and choose ๐ผโ„Ž = (๐นโˆ’1

๐‘ฆ ((โ„Ž โˆ’ 1)/๐ป), ๐นโˆ’1๐‘ฆ (โ„Ž/๐ป)],

where ๐น๐‘ฆ(๐‘ก) =1

๐‘›

๐‘›โˆ‘๐‘–=1

1๐‘ฆ๐‘–6๐‘ก is the empirical distribution function of ๐‘ฆ. If ๐‘› is a

multiple ๐ป, there are exactly ๐‘ = ๐‘›/๐ป observations per slice.Later on in our experiments, we use the second way, where every slice has ๐‘ = ๐‘›/๐ป

points, and our consistency result applies to this situation as well. Note that thereis a variation of SADE, which uses standardized data. If we denote the standardized

32

Page 48: On efficient methods for high-dimensional statistical

P(๐‘ฆ โˆˆ ๐ผ1) P(๐‘ฆ โˆˆ ๐ผ๐ป)๐‘1 ๐‘๐ป

I1 I2 IHy

x

E(๐’ฎ1(๐‘ฅ)|๐‘ฆ โˆˆ ๐ผ1) E(๐’ฎ1(๐‘ฅ)|๐‘ฆ โˆˆ ๐ผ๐ป)

(๐‘†1)1 (๐‘†1)๐ป

Figure 2-1 โ€“ Graphical explanation of SADE: firstly we divide ๐‘ฆ into ๐ป slices, let ๐‘โ„Žbe a proportion of ๐‘ฆ๐‘– in slice ๐ผโ„Ž. This is an empirical estimator of P(๐‘ฆ โˆˆ ๐ผโ„Ž). Thenwe evaluate the empirical estimator (๐’ฎ1)โ„Ž of E(๐’ฎ1(๐‘ฅ)|๐‘ฆ โˆˆ ๐ผโ„Ž). Finally, we evaluateweighted covariance matrix and find the ๐‘˜ largest eigenvectors.

data as ๐’ฎ1(๐‘ฅ๐‘–) =[1๐‘›

๐‘›โˆ‘๐‘—=1

๐’ฎ1(๐‘ฅ๐‘—)๐’ฎ1(๐‘ฅ๐‘—)โŠค]โˆ’1/2

๐’ฎ1(๐‘ฅ๐‘–) (Remark 5.3 in [Li, 1991]).

Computational complexity. The first step of the algorithm requires ๐‘‚(๐‘›) ele-mentary operations, the second ๐‘‚(๐‘›๐‘‘2) operations, the third ๐‘‚(๐ป๐‘‘2) and the fourth๐‘‚(๐‘˜๐‘‘2) operations. The overall dependence on dimension ๐‘‘ is quadratic, while thedependence on the number of observations is linear in ๐‘›, as common in moment-matching methods.

Estimating the number of components. Our estimation method does not de-pend on ๐‘˜, up to the last step where the first ๐‘˜ largest eigenvectors are selected. Asimple heuristic to select ๐‘˜, similar to the selection of the number of components inprincipal component analysis, would select the largest ๐‘˜ such that the gap betweenthe ๐‘˜-th and (๐‘˜ + 1)-th eigenvalue is large enough. This could be made more formalusing the technique of Li [1991] for sliced inverse regression.

2.3.2 Estimator and algorithm for SPHD

We follow the same approach as for the SADE algorithm above, leading to the fol-

lowing algorithm, which estimates ๐’ฑ2 = E(

[E(๐’ฎ2(๐‘ฅ)|๐‘ฆ)]2

)and computes its principal

eigenvectors. Note that E[E(๐’ฎ2(๐‘ฅ)|๐‘ฆ โˆˆ ๐ผโ„Ž)

]= E

[๐’ฎ2(๐‘ฅ)

]= 0.

33

Page 49: On efficient methods for high-dimensional statistical

โ€“ Divide the range of ๐‘ฆ1, . . . , ๐‘ฆ๐‘› into ๐ป slices ๐ผ1, . . . , ๐ผ๐ป . Let ๐‘โ„Ž > 0 be theproportion of ๐‘ฆ๐‘–, ๐‘– = 1, . . . , ๐‘›, that fall in slice ๐ผโ„Ž.

โ€“ For each slice, compute the sample mean (๐’ฎ2)โ„Ž of ๐’ฎ2(๐‘ฅ): (๐’ฎ2)โ„Ž =

1

๐‘›๐‘โ„Ž

๐‘›โˆ‘๐‘–=1

1๐‘ฆ๐‘–โˆˆ๐ผโ„Ž๐’ฎ2(๐‘ฅ๐‘–).

โ€“ Compute the weighted covariance matrix ๐’ฑ2 =๐ปโˆ‘โ„Ž=1

๐‘โ„Ž(๐’ฎ2)2โ„Ž, find the ๐‘˜ largest

eigenvalues and let 1, . . . , ๐‘˜ be eigenvectors corresponding to these eigenval-ues.

The matrix ๐’ฑ2 is then an estimator of E(

[E(๐’ฎ2(๐‘ฅ)|๐‘ฆ)]2

).

Computational complexity. The first step of the algorithm requires ๐‘‚(๐‘›) ele-mentary operations, the second ๐‘‚(๐‘›๐‘‘2) operations, the third ๐‘‚(๐ป๐‘‘3) and the fourth๐‘‚(๐‘˜๐‘‘2) operations. The overall dependence on dimension ๐‘‘ is cubic, hence the methodis slower than SADE (but still linear in the number of observations ๐‘›).

2.3.3 Consistency for the SADE estimator and algorithm

In this section, we prove the consistency of the SADE moment estimator and theresulting algorithm, when the score function is known. Following Hsing and Carroll[1992] and Zhu and Ng [1995], we can get

โˆš๐‘›-consistency for the SADE algorithm

with very broad assumptions regarding the problem.In this section, we focus on the simplest set of assumptions to pave the way to the

analysis for the nuclear norm in future work. The key novelty compared to Hsing andCarroll [1992], Zhu and Ng [1995] is a precise non-asymptotic analysis with preciseconstants.

We make the following assumptions:(L1) The function ๐‘š : R โ†’ R๐‘‘ such that E(โ„“(๐‘ฅ)|๐‘ฆ) = ๐‘š(๐‘ฆ) is ๐ฟ-Lipschitz-

continuous.(L2) The random variable ๐‘ฆ โˆˆ R is sub-Gaussian, i.e., such that E๐‘’๐‘ก(๐‘ฆโˆ’๐ธ๐‘ฆ) 6

๐‘’๐œ2๐‘ฆ ๐‘ก

2/2, for some ๐œ๐‘ฆ > 0.(L3) The random variables โ„“๐‘—(๐‘ฅ) โˆˆ R are sub-Gaussian, i.e., such that E๐‘’๐‘กโ„“๐‘—(๐‘ฅ) 6

๐‘’๐œ2โ„“ ๐‘ก

2/2 for each component ๐‘— โˆˆ 1, . . . , ๐‘‘, for some ๐œโ„“ > 0.(L4) The random variables ๐œ‚๐‘— = โ„“๐‘—(๐‘ฅ) โˆ’๐‘š๐‘—(๐‘ฆ) โˆˆ R are sub-Gaussian, i.e., such

that E๐‘’๐‘ก๐œ‚๐‘— 6 ๐‘’๐œ2๐œ‚ ๐‘ก

2/2 for each component ๐‘— โˆˆ 1, . . . , ๐‘‘, for some ๐œ๐œ‚ > 0.Now we formulate and proof the main theorem, where โ€– ยท โ€–* is the nuclear norm,

defined as โ€–๐ดโ€–* = tr(โˆš

๐ด๐‘‡๐ด):

Theorem 2.1. Under assumptions (L1) - (L4) we get the following bound on

โ€–๐’ฑ1,cov โˆ’ ๐’ฑ1,covโ€–*: for any ๐›ฟ <1

๐‘›, with probability not less than 1โˆ’ ๐›ฟ:

โ€–๐’ฑ1,cov โˆ’ ๐’ฑ1,covโ€–* 6๐‘‘โˆš๐‘‘(195๐œ 2๐œ‚ + 2๐œ 2โ„“ )โˆš๐‘›

โˆšlog

24๐‘‘2

๐›ฟ

34

Page 50: On efficient methods for high-dimensional statistical

+8๐ฟ2๐œ 2๐‘ฆ + 16๐œ๐œ‚๐œ๐‘ฆ๐ฟ

โˆš๐‘‘+ (157๐œ 2๐œ‚ + 2๐œ 2โ„“ )๐‘‘

โˆš๐‘‘

๐‘›log2 32๐‘‘2๐‘›

๐›ฟ.

The proof of the theorem can be found in Appendix 2.7.2. The leading term

is proportional to๐‘‘โˆš๐‘‘๐œ 2๐œ‚โˆš๐‘›

, with a tail which is sub-Gaussian. We thus get aโˆš๐‘›-

consistent estimator or ๐’ฑ1. The dependency in ๐‘‘ could probably be improved, inparticular when using slices of sizes ๐‘ that tend to infinity (as done in [Lin et al.,2018]).

2.4 Learning score functions

All previous methods can work only if we know the score function of first orsecond order. In practice, we do not have such information, and we have to learnscore functions from sampled data. In this section, we only consider the first-orderscore function โ„“(๐‘ฅ) = ๐’ฎ1(๐‘ฅ) = โˆ’โˆ‡ log ๐‘(๐‘ฅ).

We first present the score matching approach of Hyvรคrinen [2005], and then applyit to our problem.

2.4.1 Score matching to estimate score from data

Given the true score function โ„“(๐‘ฅ) = ๐’ฎ1(๐‘ฅ) = โˆ’โˆ‡ log ๐‘(๐‘ฅ) and some i.i.d. datagenerated from ๐‘(๐‘ฅ), score matching aims at estimating the parameter of a model forthe score function โ„“(๐‘ฅ), by minimizing a empirical quantity aiming to estimate

โ„›score(โ„“) =1

2

โˆซR๐‘‘

๐‘(๐‘ฅ)โ€–โ„“(๐‘ฅ)โˆ’ โ„“(๐‘ฅ)โ€–2๐‘‘๐‘ฅ.

As is, the quantity above leads to consistent estimation (i.e., pushing โ„“ close to โ„“),but seems to need the knowledge of the true score โ„“(๐‘ฅ). A key insight from Hyvรคrinen[2005] is to use integration by parts to get (assuming the integrals exist):

โ„›score(โ„“) =1

2

โˆซR๐‘‘

๐‘(๐‘ฅ)[โ€–โ„“(๐‘ฅ)โ€–2 + โ€–โ„“(๐‘ฅ)โ€–2 + 2โ„“(๐‘ฅ)โŠคโˆ‡ log ๐‘(๐‘ฅ)

]๐‘‘๐‘ฅ

=1

2

โˆซR๐‘‘

๐‘(๐‘ฅ)โ€–โ„“(๐‘ฅ)โ€–2๐‘‘๐‘ฅ+1

2

โˆซR๐‘‘

๐‘(๐‘ฅ)โ€–โ„“(๐‘ฅ)โ€–2๐‘‘๐‘ฅ+

โˆซR๐‘‘

โ„“(๐‘ฅ)โŠคโˆ‡๐‘(๐‘ฅ)๐‘‘๐‘ฅ

=1

2

โˆซR๐‘‘

๐‘(๐‘ฅ)โ€–โ„“(๐‘ฅ)โ€–2๐‘‘๐‘ฅ+1

2

โˆซR๐‘‘

๐‘(๐‘ฅ)โ€–โ„“(๐‘ฅ)โ€–2๐‘‘๐‘ฅโˆ’โˆซR๐‘‘

(โˆ‡ ยท โ„“)(๐‘ฅ)๐‘(๐‘ฅ)๐‘‘๐‘ฅ,

by integration by parts, where (โˆ‡ ยท โ„“)(๐‘ฅ) =โˆ‘๐‘‘

๐‘–=1๐œ•โ„“๐œ•๐‘ฅ๐‘–

(๐‘ฅ) is the divergence of โ„“ at๐‘ฅ [Arfken, 1985].

The first part of the last right hand side does not depend on โ„“ while the two otherparts are expectations under ๐‘(๐‘ฅ) of quantities that only depend on โ„“. Thus is can

35

Page 51: On efficient methods for high-dimensional statistical

we well approximated, up to a constant, by:

โ„›score(โ„“) =1

2๐‘›

๐‘›โˆ‘๐‘–=1

โ€–โ„“(๐‘ฅ๐‘–)โ€–2 โˆ’1

๐‘›

๐‘›โˆ‘๐‘–=1

(โˆ‡ ยท โ„“)(๐‘ฅ๐‘–).

Parametric assumption. If we assume that the score is linear combination offinitely many basis functions, we will get a consistent estimator of these parameters.That is, we make the following assumption:

(A4) The score function โ„“(๐‘ฅ) is a linear combination of known basis functions

๐œ“๐‘—(๐‘ฅ), ๐‘— = 1, . . . ,๐‘š, where ๐œ“๐‘— : R๐‘‘ โ†’ R๐‘‘, that is โ„“(๐‘ฅ) =๐‘šโˆ‘๐‘—=1

๐œ“๐‘—(๐‘ฅ)๐œƒ*๐‘— , for

some ๐œƒ* โˆˆ R๐‘š. We assume that the score function and its derivatives aresquared-integrable with respect to ๐‘(๐‘ฅ).

In this chapter, we consider for simplicity a parametric assumption for the score. Inorder to go towards non-parametric estimation, we would need to let the number ๐‘šof basis functions to grow with ๐‘› (exponentially with no added assumptions), andthis is an interesting avenue for future work. In simulations in 2.5, we consider asimple set of basis function which are localized functions around observations; thesecan approximate reasonably well in practice most densities and led to good estima-tion of the e.d.r. subspace. Moreover, if we have the additional knowledge that thecomponents of ๐‘ฅ are statistically independent (potentially after linearly transformingthem using independent component analysis [Hyvรคrinen et al., 2004]), we can useseparable functions for the scores.

We introduce the notation

ฮจ(๐‘ฅ) =

โŽ›โŽœโŽ ๐œ“11(๐‘ฅ) ยท ยท ยท ๐œ“1

๐‘‘(๐‘ฅ)...

. . ....

๐œ“๐‘š1 (๐‘ฅ) ยท ยท ยท ๐œ“๐‘š๐‘‘ (๐‘ฅ)

โŽžโŽŸโŽ  โˆˆ R๐‘šร—๐‘‘,

so that the score function โ„“ we wish to estimate has the form

โ„“(๐‘ฅ) = ฮจ(๐‘ฅ)โŠค๐œƒ โˆˆ R๐‘‘.

We also introduce the notation (โˆ‡ ยท ฮจ)(๐‘ฅ) =

โŽ›โŽœโŽ (โˆ‡ ยทฮจ1)(๐‘ฅ)...

(โˆ‡ ยทฮจ๐‘š)(๐‘ฅ)

โŽžโŽŸโŽ  โˆˆ R๐‘š, so that

(โˆ‡ ยท โ„“)(๐‘ฅ) = ๐œƒโŠค(โˆ‡ ยทฮจ)(๐‘ฅ).The empirical score function may then be written as:

โ„›score(๐œƒ) =1

2๐œƒโŠค( 1

๐‘›

๐‘›โˆ‘๐‘–=1

ฮจ(๐‘ฅ๐‘–)ฮจ(๐‘ฅ๐‘–)โŠค)๐œƒ โˆ’ ๐œƒโŠค( 1

๐‘›

๐‘›โˆ‘๐‘–=1

(โˆ‡ ยทฮจ)(๐‘ฅ๐‘–)), (2.4.1)

which is a quadratic function of ๐œƒ and can thus be minimized by solving a linearsystem in running time ๐‘‚(๐‘š3 +๐‘š2๐‘‘๐‘›) (to form the matrix and to solve the system).

36

Page 52: On efficient methods for high-dimensional statistical

Given standard results regarding the convergence of 1๐‘›

โˆ‘๐‘›๐‘–=1 ฮจ(๐‘ฅ๐‘–)ฮจ(๐‘ฅ๐‘–)

โŠค โˆˆ R๐‘šร—๐‘š

to its expectation and of 1๐‘›

โˆ‘๐‘›๐‘–=1(โˆ‡ ยท โ„“)(๐‘ฅ๐‘–) โˆˆ R๐‘š to its expectation, we get a

โˆš๐‘›-

consistent estimation of ๐œƒ* under simple assumptions (see Theorem 2.2).

2.4.2 Score matching for sliced inverse regression: two-step

approach

We can now combine our linear parametrization of the score with the SIR approachoutlined in Section 2.3. The true conditional expectation is

E(โ„“(๐‘ฅ)|๐‘ฆ

)=

๐‘šโˆ‘๐‘—=1

E(๐œ“๐‘—(๐‘ฅ)|๐‘ฆ)๐œƒ*๐‘— ,

and belongs to the e.d.r. subspace. We consider ๐ป different slices ๐ผ1, . . . , ๐ผ๐ป , and thefollowing estimator, which simply replaces the true score by the estimated score (i.e.,๐œƒ* by ๐œƒ), and highlights the dependence in ๐œƒ.

The estimator ๐’ฑ1,cov can be rewritten as

๐’ฑ1,cov =๐‘›โˆ‘๐‘–=1

๐‘›โˆ‘๐‘—=1

๐›ผ๐‘–,๐‘—๐‘›(|๐ผโ„Ž(๐‘–, ๐‘—)| โˆ’ 1)

โ„“(๐‘ฅ๐‘–)โ„“(๐‘ฅ๐‘—)โŠค,

where

๐›ผ๐‘–,๐‘— =

1 if ๐‘– = ๐‘— in the same slice0 otherwise

Using linear property โ„“(๐‘ฅ) = ฮจ(๐‘ฅ)โŠค๐œƒ:

๐’ฑ1,cov(๐œƒ) =1

๐‘›

๐‘›โˆ‘๐‘–=1

๐‘›โˆ‘๐‘—=1

๐›ผ๐‘–,๐‘—|๐ผโ„Ž(๐‘–, ๐‘—)| โˆ’ 1

ฮจ(๐‘ฅ๐‘–)โŠค๐œƒ๐œƒโŠคฮจ(๐‘ฅ๐‘–). (2.4.2)

In the two-step approach, we first solve the score matching optimization problemto obtain an estimate for the optimal parameters ๐œƒ* and then use them to get the๐‘˜ largest eigenvectors of covariance matrix ๐’ฑ1. This approach works well in lowdimensions when the score function can be well approximated; otherwise, we maysuffer from the curse of dimensionality: if we want a good approximation of the scorefunction, we would need an exponential number of basis functions. In Section 2.4.3,we consider a direct approach aiming at improving robustness.

Consistency. Let also provide the result of consistency in the case of unknownscore function under Assumption (A4) for the two-step algorithm. We will make theadditional assumptions:

(M1) The random variables ฮจ๐‘—๐‘– (๐‘ฅ) โˆˆ R are sub-Gaussian, i.e., such that

E๐‘’๐‘ก(ฮจ๐‘—๐‘– (๐‘ฅ)โˆ’๐ธฮจ๐‘—

๐‘– (๐‘ฅ)) 6 ๐‘’๐œ2ฮจ๐‘ก

2/2, for some ๐œฮจ > 0.(M2) The random variables (โˆ‡ ยทฮจ๐‘–)(๐‘ฅ) are sub-Gaussian, i.e., such that

E๐‘’๐‘ก((โˆ‡ยทฮจ๐‘–)(๐‘ฅ)โˆ’๐ธ(โˆ‡ยทฮจ๐‘–)(๐‘ฅ) 6 ๐‘’๐œ2โˆ‡ฮฆ๐‘ก

2/2, for some ๐œโˆ‡ฮจ > 0.

37

Page 53: On efficient methods for high-dimensional statistical

(M3) The matrix E[ฮจ(๐‘ฅ)ฮจ(๐‘ฅ)โŠค

]is not degenerated and we let ๐œ†0 denote its

minimal eigenvalue.

Theorem 2.2. Let ๐œƒ be the estimated ๐œƒ, obtained on the first step of the algorithm.Under Assumptions (A1), (A2), (A4), (L1) - (L4) and (M1), (M2), (M3), for๐›ฟ 6 1/๐‘› and ๐‘› large enough, that is, ๐‘› > ๐‘1 + ๐‘2 log 1

๐›ฟfor some positive constants ๐‘1

and ๐‘2 not depending on ๐‘› and ๐›ฟ:

โ€–๐’ฑ1,cov(๐œƒ*)โˆ’ ๐’ฑ1,cov(๐œƒ)โ€–* 61โˆš๐‘›

โˆš2 log

48๐‘š2๐‘‘

๐›ฟร—

[๐‘‘โˆš๐‘‘(195๐œ 2๐œ‚ + 2๐œ 2โ„“ ) +

9๐‘š

2Eโ€–ฮจ(๐‘ฅ)โ€–22 ยท โ€–๐œƒ*โ€–

(2๐œโˆ‡ฮจ

โˆš๐‘š

๐œ†0+

4โ€–๐‘โ€–๐œ 2ฮจโˆš๐‘š3๐‘‘

๐œ†0

)]+๐‘‚

(1

๐‘›

).

with probability not less than 1โˆ’ ๐›ฟ.

Now, let us formulate and proof non-asymptotic result about the real and theestimated e.d.r. spaces. We need to define a notion of distance between subspacesspanned by two sets of vectors. We use the square trace error ๐‘…2(๐‘ค, ) [Hooper,1959]:

๐‘…2(๐‘ค, ) = 1โˆ’ 1

๐‘˜tr[๐‘ค ยท

], (2.4.3)

where ๐‘ค โˆˆ R๐‘˜ร—๐‘‘ and โˆˆ R๐‘˜ร—๐‘‘ both have orthonormal columns. It is always betweenzero and one, and equal to zero if and only if ๐‘ค = . This distance is closely relatedto principal angles notion:

sin ฮ˜(โŠค๐‘ค) = diag(sin(cosโˆ’1 ๐œŽ1), . . . , sin(cosโˆ’1 ๐œŽ๐‘‘),

where ๐œŽ1, . . . , ๐œŽ๐‘‘ are the singular values of the matrix โŠค๐‘ค. Actually, ๐‘…(๐‘ค, ) ยทโˆš๐‘˜ =

โ€– sin ฮ˜(โŠค๐‘ค)โ€–๐น .We use Davis-Kahan โ€œsin ๐œƒ theoremโ€ [Stewart and Sun, 1990, Theorem V.3.6] in

the following form [Yu et al., 2015, Theorem 2]:

Theorem 2.3. Let ฮฃ and ฮฃ โˆˆ R๐‘‘ร—๐‘‘ be symmetric, with eigenvalues ๐œ†1 > ยท ยท ยท >๐œ†๐‘‘ and 1, . . . , ๐‘‘ respectively. Fix 1 6 ๐‘Ÿ 6 ๐‘  6 ๐‘‘, let ๐‘˜ = ๐‘  โˆ’ ๐‘Ÿ + 1, and let๐‘ˆ = (๐‘ข๐‘Ÿ, ๐‘ข๐‘Ÿ+1, . . . , ๐‘ข๐‘ ) โˆˆ R๐‘‘ร—๐‘˜ and = (๐‘Ÿ, ๐‘Ÿ+1, . . . , ๐‘ ) โˆˆ R๐‘‘ร—๐‘˜ have orthonormalcolumns satisfying ฮฃ๐‘ข๐‘— = ๐œ†๐‘—๐‘ข๐‘— and ฮฃ๐‘— = ๐‘—๐‘— for ๐‘— = ๐‘Ÿ, . . . , ๐‘ . Let ๐›ฟ = inf|โˆ’ ๐œ†| :๐œ† โˆˆ [๐œ†๐‘ , ๐œ†๐‘Ÿ], โˆˆ (โˆ’โˆž, ๐‘ +1]โˆช[๐‘Ÿโˆ’1,+โˆž), where 0 = +โˆž and ๐‘+1 = โˆ’โˆž. Assume,that ๐›ฟ > 0, then:

๐‘…(๐‘ˆ, ) 6โ€–ฮฃโˆ’ ฮฃโ€–๐น๐›ฟโˆš๐‘˜

. (2.4.4)

Using Corollary 4.1 and its discussion by Vu and Lei [2013], we can derive, that

๐‘…(๐‘ค, ) 6โ€–๐’ฑ1(๐œƒ*)โˆ’ ๐’ฑ1(๐œƒ)โ€–๐น|๐œ†๐‘˜ โˆ’ ๐‘˜+1|

โˆš๐‘˜

, where ๐‘ค and are the real and the estimated e.d.r.

spaces respectively. Using Weylโ€™s inequality [Stewart and Sun, 1990] : if โ€–๐’ฑ1(๐œƒ*) โˆ’

38

Page 54: On efficient methods for high-dimensional statistical

๐’ฑ1(๐œƒ)โ€–2 < ๐œ€ โ‡’ |๐‘– โˆ’ ๐œ†๐‘–| < ๐œ€ for all ๐‘– = 1, . . . , ๐‘‘ and taking ๐œ€ = (๐œ†๐‘˜ โˆ’ ๐œ†๐‘˜+1)/2 = ๐œ†๐‘˜/2we get:

๐‘…(๐‘ค, ) 62โ€–๐’ฑ1(๐œƒ*)โˆ’ ๐’ฑ1(๐œƒ)โ€–*

๐œ†๐‘˜โˆš๐‘˜

, if โ€–๐’ฑ1(๐œƒ*)โˆ’ ๐’ฑ1(๐œƒ)โ€–๐น < ๐œ†๐‘˜/2,

because โ€– ยท โ€–๐น 6 โ€– ยท โ€–*. We can now formulate our main theorem for the analysis ofthe SADE algorithm:

Theorem 2.4. Consider assumptions (A1),(A2),(A4), (L1)- (L4), (M1), (M2),(M3) and:

(K1) The matrix ๐’ฑ1 has a rank ๐‘˜: ๐œ†๐‘˜ > 0.For ๐›ฟ 6 1/๐‘› and ๐‘› large enough: ๐‘› > ๐‘1 + ๐‘2 log 1

๐›ฟfor some positive constants ๐‘1

and ๐‘2 not depending on ๐‘› and ๐›ฟ:

๐‘…(, ๐‘ค) 62

โˆš๐‘›๐œ†๐‘˜โˆš๐‘˜

โˆš2 log

48๐‘š2๐‘‘

๐›ฟร—[

๐‘‘โˆš๐‘‘(195๐œ 2๐œ‚ + 2๐œ 2โ„“ ) +

9๐‘š

2Eโ€–ฮจ(๐‘ฅ)โ€–22 ยท โ€–๐œƒ*โ€–

(2๐œโˆ‡ฮจ

โˆš๐‘š

๐œ†0+

4โ€–๐‘โ€–๐œ 2ฮจโˆš๐‘š3๐‘‘

๐œ†0

)]+๐‘‚

(1

๐‘›

).

with probability not less than 1โˆ’ ๐›ฟ.

We thus obtain a convergence rate in 1/โˆš๐‘›, with an explicit dependence on all

constants of the problem. The dependence on dimension ๐‘‘ and number of basisfunctions๐‘š is of the order (forgetting logarithmic terms): ๐‘‚( ๐‘‘

3/2

๐‘›1/2 +๐‘š5/2

๐‘›1/2 โ€–๐œƒ*โ€–). For ๐‘˜ >1, our dependence on the sample size ๐‘› is improved compared to existing work suchas [Dalalyan et al., 2008] (while it matches the dependence for ๐‘˜ = 1 with [Hristacheet al., 2001]), but this comes at the expense of assuming that the score functionssatisfy a parametric model.

For a fixed number of basis functions ๐‘š, we thus escape the curse of dimension-ality as we get a polynomial dependence on ๐‘‘ (which we believe could be improved).However, when no assumptions are made on score functions, the number ๐‘š and po-tentially the norm โ€–๐œƒ*โ€– has to grow with the number of observations ๐‘›. We are indeedfaced with a traditional non-parametric estiamation problem, and the number ๐‘š willneed to grow when ๐‘› grows depending on the smoothness assumptions we are willingto make on the score function, with an effect on ๐œƒ* (and probably ๐œ†0, which we willneglect in the discussion below). While a precise analysis is out of the scope of thechapter and left for future work, we can make an informal argument as follows: inorder to approximate the score with precision ๐œ€ with a set of basis function whereโ€–๐œƒ*โ€– is bounded, we need ๐‘š(๐œ€) basis functions. Thus, we end up with two sources oferrors, an approximation error ๐œ€ and an estimation error of order ๐‘‚(๐‘š(๐œ€)5/2/๐‘›1/2).Typically, if we assume that the score has a number of bounded derivatives pro-portional to dimension or if we assume the input variables are independent and wecan thus estimate the score independently for each dimension, ๐‘š(๐œ€) is of the order1/๐œ€1/๐‘Ÿ, where ๐‘Ÿ is independent of the dimension, leading to an overall rate which isindependent of the dimension.

39

Page 55: On efficient methods for high-dimensional statistical

2.4.3 Score matching for SIR: direct approach

We can also try to combine these two steps to try to avoid the โ€œcurse of dimen-sionalityโ€. Our estimation of the score, i.e., of the parameter ๐œƒ is done only to be usedwithin the SIR approach where we expect the matrix ๐’ฑ1,cov to have rank ๐‘˜. Thus

when estimating ๐œƒ by minimizing โ„›score(๐œƒ), we may add a regularization that penal-izes large ranks for ๐’ฑ1,cov(๐œƒ) = 1

๐‘›

โˆ‘๐‘›๐‘–,๐‘—=1

๐›ผ๐‘–,๐‘—

|๐ผโ„Ž(๐‘–,๐‘—)|โˆ’1ฮจ(๐‘ฅ๐‘–)

โŠค๐œƒ๐œƒโŠคฮจ(๐‘ฅ๐‘–), where we highlightthe dependence on ๐œƒ โˆˆ R๐‘š. By enforcing the low-rank constraint, our aim is to cir-cumvent a potential poor estimation of the score function, which could be enough forthe task of estimating the e.d.r. space (we see a better behavior in our simulations in2.5).

Introduce matrix โ„’(๐œƒ) = (ฮจ(๐‘ฅ๐‘–)โŠค๐œƒ, . . . ,ฮจ(๐‘ฅ๐‘›)โŠค๐œƒ) โˆˆ R๐‘‘ร—๐‘› and ๐ด โˆˆ R๐‘›ร—๐‘› with

๐ด๐‘–,๐‘— =๐›ผ๐‘–,๐‘—

๐‘›ยท(|๐ผโ„Ž(๐‘–,๐‘—)|โˆ’1)and ๐’œ(๐œƒ) = โ„’ ยท ๐ด1/2 โˆˆ R๐‘‘ร—๐‘›.

We may then penalize the nuclear norm of ๐’œ(๐œƒ), or potentially consider normsthat take into account that we look for a rank ๐‘˜ (e.g., the ๐‘˜-support norm on thespectrum of ๐’œ(๐œƒ) [McDonald et al., 2014]). We have,

โ€–๐’œ(๐œƒ)โ€–* = tr(๐’œ(๐œƒ)๐’œ(๐œƒ)โŠค)1/2 = tr[๐’ฑ1,cov(๐œƒ)1/2

].

Combining two penalties, we have a convex optimization task:

โ„›(๐œƒ) = โ„›score(๐œƒ) + ๐œ† ยท tr[๐’ฑ1,cov(๐œƒ)1/2

]. (2.4.5)

Efficient algorithm. Following Argyriou et al. [2008], we consider reweighted least-squares algorithms. The trace norm admits the variational form:

โ€–๐‘Šโ€–* =1

2inf๐ทโ‰ป0

tr(๐‘Š ๐‘‡๐ทโˆ’1๐‘Š +๐ท).

The optimization problem (2.4.5) can be reformulated in the following way:

๐œƒ โ† arg min๐œƒ

โ„›score(๐œƒ) +๐œ†

2tr(๐’œ(๐œƒ)โŠค๐ทโˆ’1๐’œ(๐œƒ)

)and

๐ท โ†(๐’œ(๐œƒ)๐’œ(๐œƒ)โŠค + ๐œ€๐ผ๐‘‘

)1/2.

Note that the objective function is a quadratic function of ๐œƒ. Decompose matrix๐’œ in the form:

๐’œ(๐œƒ) =๐‘šโˆ‘๐‘˜=1

๐œƒ๐‘˜๐’œ๐‘˜, where ๐’œ๐‘˜ โˆˆ R๐‘‘ร—๐‘›.

Rearrange the regularizer term:

tr(๐’œ(๐œƒ)โŠค๐ทโˆ’1๐’œ(๐œƒ)

)=

๐‘šโˆ‘๐‘˜=1

๐‘šโˆ‘๐‘™=1

tr[๐œƒ๐‘˜๐’œโŠค

๐‘˜๐ทโˆ’1๐’œ๐‘™๐œƒ๐‘™

]= ๐œƒโŠค๐’ด๐œƒ,

where๐’ด๐‘˜,๐‘™ = tr

[๐’œโŠค๐‘˜๐ท

โˆ’1๐’œ๐‘™].

40

Page 56: On efficient methods for high-dimensional statistical

Introduce notation:

๐’œ =(

vect(๐’œ1), . . . , vect(๐’œ๐‘š))โˆˆ R๐‘‘๐‘›ร—๐‘š

โ„ฌ =(

vect(๐ทโˆ’1๐’œ1), . . . , vect(๐ทโˆ’1๐’œ๐‘š))โˆˆ R๐‘‘๐‘›ร—๐‘š,

then๐’ด = ๐’œโŠค โ„ฌ.

We can now estimate the complexity of this algorithm. Firstly we need to evalu-ate the quadratic form in โ„›score(๐œƒ): it is a summation of ๐‘› multiplications of ๐‘š ร— ๐‘‘and ๐‘‘ ร— ๐‘š matrices. Complexity of this step is ๐‘‚(๐‘›๐‘š2๐‘‘). Secondly, we need to

evaluate matrix matrices ๐ทโˆ’1, ๐’œ and โ„ฌ using ๐‘‚(๐‘‘3), ๐‘‚(๐‘‘๐ป๐‘š) and ๐‘‚(๐‘šร—๐‘‘2๐ป) oper-ations respectively. Next, we need to evaluate matrix ๐’ด , using ๐‘‚(๐‘‘๐‘›๐‘š2) operations.Finally, evaluating ๐’œ(๐œƒ) requires ๐‘‚(๐‘‘๐‘›๐‘š) operations, ๐’œ(๐œƒ)๐’œ(๐œƒ)โŠค requires ๐‘‚(๐‘‘2๐‘›) op-erations and evaluation of ๐ท requires ๐‘‚(๐‘‘3) operations. Combining complexities, weget ๐‘‚(๐‘š๐‘‘2๐‘›+ ๐‘‘3 + ๐‘›๐‘š2๐‘‘) operations, which is still linear in ๐‘›.

2.5 Experiments

In this section we provide numerical experiments for SADE, PHD+ and SPHD ondifferent functions ๐‘“ . We denote the true and estimated e.d.r. subspaces as โ„ฐ and โ„ฐrespectively, defined from ๐‘ค and .

2.5.1 Known score functions

Consider a Gaussian mixture model with 2 components in R๐‘‘:

๐‘(๐‘ฅ) =2โˆ‘๐‘–=1

๐œƒ๐‘– ยท1

(2๐œ‹)๐‘‘/2 ยท |ฮฃ๐‘–|1/2ยท exp

โˆ’ 1

2(๐‘‹ โˆ’ ๐œ‡๐‘–)โŠคฮฃโˆ’1

๐‘– (๐‘‹ โˆ’ ๐œ‡๐‘–)

, (2.5.1)

where ๐œƒ = (6/10, 4/10), ๐œ‡1 = (โˆ’1, . . . ,โˆ’1โŸ โž ๐‘‘

), ๐œ‡2 = (1, . . . , 1โŸ โž ๐‘‘

), ฮฃ1 = ๐ผ๐‘‘, ฮฃ2 = 2 ยท ๐ผ๐‘‘.

Contour lines of this distribution, when ๐‘‘ = 2 are shown in Figure 2-2.

The error ๐œ€ has a standard normal distribution. To estimate the effectiveness ofan estimated e.d.r. subspace, we use the square trace error ๐‘…2(๐‘ค, )

๐‘…2(๐‘ค, ) = 1โˆ’ 1

๐‘˜tr[๐‘ƒ ยท ๐‘ƒ

],

where ๐‘ค and are the real and the estimated e.d.r. vectors respectively and ๐‘ƒ and๐‘ƒ are projectors, corresponding to these matrices. Note, that (2.4.3) is the specialcase of this formula with orthonormal matrices.

To show the dominance of SADE over SIR (which should only work for elliptically

41

Page 57: On efficient methods for high-dimensional statistical

X1

X2

โˆ’4 โˆ’3 โˆ’2 โˆ’1 0 1 2 3 4โˆ’4

โˆ’3

โˆ’2

โˆ’1

0

1

2

3

4

Figure 2-2 โ€“ Contour lines of Gaussianmixture pdf in 2D case.

10 11 12 13 14 15 16

โˆ’11

โˆ’10

โˆ’9

โˆ’8

โˆ’7

โˆ’6

โˆ’5

โˆ’4

โˆ’3

โˆ’2

log2 n

log2

(

R2(

E,E)

)

SIRSADE

Figure 2-3 โ€“ Mean and standard devi-ation of ๐‘…2(โ„ฐ , โ„ฐ) divided by 10 for therational function (2.5.2).

symmetric distributions), we consider the rational multi-index model of the form

๐‘‘ = 10; ๐‘ฆ =๐‘ฅ1

1/2 + (๐‘ฅ2 + 2)2+ ๐œŽ ยท ๐œ€. (2.5.2)

Here the real e.d.r. space is generated by 2 specific vectors: (1, 0, 0, . . . , 0) and(0, 1, 0, . . . , 0). The number of slices is ๐ป = 10, and we consider numbers of obser-vations ๐‘› from 210 to 216 equally spaced in logarithmic scale, and we conduct 100replicates to obtain means and standard deviations of the logarithms of square traceerrors divided by

โˆš100 = 10 (to assess significance of differences) as shown in Figure

2-3. Even in this simple model, the ordinary SIR algorithm does not work properly,because the distribution of the inputs ๐‘ฅ has no elliptical symmetry. When ๐‘› โ†’ โˆž,the squared trace error tends to some nonzero constant depending on the proper-ties of the density function, whereas SADE shows good performance with slope โˆ’1(corresponding to a

โˆš๐‘›-consistent estimator).

Now, we compare the moments methods SADE, PHD+, SPHD. Although the goalof this chapter is to compare moment matching techniques, we compare them withthe state-of-the-art MAVE method [Xia et al., 2002b, Wang and Xia, 2007, 2008].It is worth noting two properties which we have already discussed concerning thesemethods:

1. Sliced methods have a wider application area than unsliced ones: SADE isstronger than ADE and SPHD is stronger than PHD+.

2. SADE can not recover the entire e.d.r. space in several cases, for example, aclassification task or symmetric cases. For those cases, we should use second-order methods (i.e., methods based on ๐’ฎ2).

3. MAVE works better, but the goal of this chapter is to compare moment-matching techniques. In Figure 2-12, we provide examples where MAVE suffersfrom the curse of dimensionality and performs worse than moment-matchingmethods.

42

Page 58: On efficient methods for high-dimensional statistical

We conduct 3 experiments, where ๐ป = 10, ๐‘‘ = 10, the error term ๐œ€ has a normalstandard distribution; numbers of observation ๐‘› from 210 to 216 equally spaced inlogarithmic scale and we made 10 replicas to evaluate sample means and variations:

โ€” Rational model of the form:

๐‘ฆ =๐‘˜โˆ‘๐‘–=1

tanh(8๐‘ฅ๐‘– โˆ’ 16) ยท ๐‘–+ tanh(8๐‘ฅ๐‘– + 16) ยท (๐‘˜ + 1โˆ’ ๐‘–) + ๐œ€/4, (2.5.3)

where the effective reduction subspace dimension is ๐‘˜ = 2.Results are shown of Figure 2-4 and we can see, that first-order methodSADE works better, than second-order SPHD + and SPHD works better,than PHD+, that is slicing make the method more robust.

โ€” Classification problem of the form

๐‘ฆ = 1๐‘ฅ21+2๐‘ฅ22>4 + ๐œ€/4. (2.5.4)

We can see, that the error of SADE is close to 0.5 (Figure 2-5). This meansthat the method finds only one direction in the e.d.r. space. Moreover, SPHDgave better results than PHD+.

โ€” Quadratic model of the form

๐‘‘ = 10; ๐‘ฆ = ๐‘ฅ1(๐‘ฅ1 + ๐‘ฅ2 โˆ’ 3) + ๐œ€. (2.5.5)

Both SADE and SPHD show a good performance (Figure 2-6), while PHD+can not recover the desired projection due to linearity of the function ๐‘”.

Figure 2-4 โ€“ Mean and standard deviationof ๐‘…2(โ„ฐ , โ„ฐ) divided by

โˆš10 for the ratio-

nal function (2.5.3) with ๐œŽ = 1/4.

Figure 2-5 โ€“ Mean and standard devia-tion of ๐‘…2(โ„ฐ , โ„ฐ) divided by

โˆš10 for the

the classification problem (2.5.4).

43

Page 59: On efficient methods for high-dimensional statistical

Figure 2-6 โ€“ Mean and standard deviation of ๐‘…2(โ„ฐ , โ„ฐ) divided byโˆš

10 for thequadratic model (2.5.2) with ๐œŽ = 1/4.

2.5.2 Unknown score functions

Now, we conduct numerical experiments for SADE with unknown score functions.We consider a โ€œtoyโ€ experiment and first examine results of the 2-step algorithm. Weconsider again a Gaussian mixture model (2.5.1) with 2 components in R2, where๐œƒ = (0.6, 0.4), ๐œ‡1 = (โˆ’0.5, 0.5), ๐œ‡2 = (0.5, 0.5), ฮฃ1 = 0.3 ยท ๐ผ2, ฮฃ2 = 0.4 ยท ๐ผ2, with๐‘ฆ = sin(๐‘ฅ1 + ๐‘ฅ2) + ๐œ€, where error ๐œ€ has a standard normal distribution. We choosethese parameters of Gaussian mixture to make sure, that the probability of the vector๐‘‹ to be in the square [โˆ’2, 2]2 is close to 1.

We chose 100 Gaussian kernels as basis functions:

๐œ“๐‘–,๐‘—(๐‘ฅ) = โˆ‡ exp

โˆ’ โ€–๐‘‹ โˆ’๐‘‹๐‘–,๐‘—โ€–2

2โ„Ž2

, 1 6 ๐‘–, ๐‘— 6 10,

where ๐‘‹๐‘–,๐‘— form the uniform grid on the square [โˆ’2, 2]2 (note that this does not implyany notion of Gaussianity for the underlying density, and the Gaussian kernel herecould be replaced by any differentiable local function).

In practice, ฮจ in (2.4.1) is close to be degenerated, and we use a regularized

estimator ๐œƒ* = โˆ’(ฮจ + ๐›ผ๐ผ๐‘‡

)โˆ’1 ยท ฮฆ instead.We choose ๐›ผ = 0.01, ๐œŽ = 1 and conduct 100 replicates with numbers of obser-

vations ๐‘› from 210 to 216 equally spaced in logarithmic scale. The results of theexperiments are presented in Figure 2-7: the square trace error tends to zero as ๐‘›increases.

In high dimensions, we can not use a uniform grid because of the curse of di-mensionality. Instead of this, we will use ๐‘› Gaussian kernels, centered in the samplepoints ๐‘‹๐‘–. Note here that we can only recover an approximate score function, butour goal is the estimation of the e.d.r. subspace.

We use our reweighted least-squares algorithm to solve the convex problem (2.4.5).We conduct several experiments on functions from the previous section.

44

Page 60: On efficient methods for high-dimensional statistical

8 9 10 11 12 13 14

โˆ’12

โˆ’11

โˆ’10

โˆ’9

โˆ’8

โˆ’7

โˆ’6

โˆ’5

โˆ’4

โˆ’3

log2(n)

log 2(R

2(E

,E))

Square trace error

Figure 2-7 โ€“ Square trace error ๐‘…(โ„ฐ , โ„ฐ) fordifferent sample sizes.

2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

R2(,)

SADE with true score

h

1step algorithm

2step algorithm

Figure 2-8 โ€“ Quadratic model, ๐‘‘ = 10,๐‘› = 1000.

2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

R2(,)

SADE with true score

1step algorithm

2step algorithm

h

Figure 2-9 โ€“ Rational model, ๐‘‘ = 10, ๐‘› =1000.

2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

R2(,)

SADE with true score

1step algorithm

2step algorithm

h

Figure 2-10 โ€“ Rational model, ๐‘‘ = 20, ๐‘› =2000.

We plot the performance as a function of the kernel bandwidth โ„Ž to assess therobustness of the methods. On Figure 2-8 we provide the relationship between thesquare trace error and โ„Ž for quadratic model (2.5.5) for ๐‘‘ = 10 and ๐‘› = 1000. InFigures 2-9, 2-10, 2-11, we provide the relationship between the square trace errorand ๐œŽ for rational model (2.5.2) for ๐‘‘ = 10, ๐‘› = 1000; ๐‘‘ = 20, ๐‘› = 2000 and๐‘‘ = 50, ๐‘› = 5000. We see that for large ๐‘‘, the one-step algorithm is more robust thanthe two-step algorithm. Moreover, the experiments show that even with weak scorefunctions, correct estimation of the e.d.r. space can be performed.

Comparison with MAVE. We consider the rational model (2.5.3) with ๐‘› =10000, ๐‘˜ = 10 and dimension ๐‘‘ of data in the range [20, 150]. Probability density๐‘(๐‘ฅ) has independent components, each of which has a form of mixture of 2 Gaussianswith weights (6/10, 4/10), means (โˆ’1, 1) and standard variations (1, 2). The results

45

Page 61: On efficient methods for high-dimensional statistical

2 4 6 8 100

0.1

0.2

0.3

0.4

0.5R2(,)

SADE with true score

1step algorithm

2step algorithm

h

Figure 2-11 โ€“ Rational model, ๐‘‘ = 50, ๐‘› =5000.

Figure 2-12 โ€“ Increasing dimensions, withrational model 2.5.3, ๐‘› = 10000, ๐‘˜ = 10,๐‘‘ = 20 to 150.

for MAVE, SADE with known score and for SADE with unknown score are shown onthe Figure 2-12. We can see, that both SADE with known and unknown scores losein case of low dimension of data, but more resistant to the curse of dimensionality(as expected as MAVE relies on non-parametric estimation in a space of dimension ๐‘˜,which is here larger). Moreover, the complexity of our moment-matching techniqueis linear in the number of observations ๐‘›, while MAVE is superlinear.

2.6 Conclusion

In this chapter we consider a general non-linear regression model and the depen-dence on a unknown ๐‘˜-dimensional subspace assumption. Our goal was direct estima-tion of this unknown ๐‘˜-dimensional space, which is often called the effective dimensionreduction or e.d.r. space. We proposed new approaches (SADE and SPHD), combin-ing two existing techniques (sliced inverse regression (SIR) and score function-basedestimation). We obtained consistent estimation for ๐‘˜ > 1 only using the first-orderscore and proposed explicit approaches to learn the score from data.

It would be interesting to extend our sliced extensions to learning neural net-works: indeed, our work focused on the subspace spanned by ๐‘ค and cannot identifyindividual columns, while Janzamin et al. [2015] showed that by using proper ten-sor decomposition algorithms and second-order score functions, columns of ๐‘ค can beconsistently estimated (in polynomial time).

46

Page 62: On efficient methods for high-dimensional statistical

2.7 Appendix. Proofs

In this appendix, we provide proofs which were omitted in the chapter.

2.7.1 Probabilistic lemma

Let us first formulate and proof an auxiliary lemma about tail inequalities:

Lemma 2.5. Let ๐‘‹ be a non-negative random variable such that for some positiveconstants ๐ด and ๐ต, and all ๐‘:

E๐‘‹๐‘ 6 (๐ดโˆš๐‘+๐ต๐‘›1/๐‘๐‘2)๐‘.

Then, for ๐‘ก > (log ๐‘›)/2:

P(๐‘‹ > 3๐ดโˆš๐‘ก+ 3๐ต๐‘›1/๐‘ก๐‘ก2) 6 3๐‘’โˆ’๐‘ก.

Proof. By Markovโ€™s inequality, for every non-negative integer ๐‘:

P(๐‘‹ > 3๐ดโˆš๐‘+ 3๐ต๐‘›1/๐‘๐‘2) = P(๐‘‹๐‘ > [3๐ด

โˆš๐‘+ 3๐ต๐‘›1/๐‘๐‘2]๐‘) 6

6[๐ดโˆš๐‘+๐ต๐‘›1/๐‘๐‘2]๐‘

[3๐ดโˆš๐‘+ 3๐ต๐‘›1/๐‘๐‘2]๐‘

6 ๐‘’โˆ’๐‘.

Consider any ๐‘ก > (log ๐‘›)/2 and ๐‘ = [๐‘ก], then:

P(๐‘‹ > 3๐ดโˆš๐‘ก+ 3๐ต๐‘›1/๐‘ก๐‘ก2) 6 P(๐‘‹ > 3๐ด

โˆš๐‘+ 3๐ต๐‘›1/๐‘๐‘2) 6 ๐‘’โˆ’๐‘ 6 3๐‘’โˆ’๐‘ก,

because function ๐‘“(๐‘ก) = 3๐ดโˆš๐‘ก+ 3๐ต๐‘›1/๐‘ก๐‘ก2 increases on [(log ๐‘›)/2; +โˆž).

2.7.2 Proof of theorem 2.1

Proof. Let(๐‘ฆ(๐‘–), ๐‘ฅ(๐‘–)

), ๐‘– = 1, . . . , ๐‘› be the ordered data set, where ๐‘ฆ(1) 6 ๐‘ฆ(2) 6

ยท ยท ยท 6 ๐‘ฆ(๐‘›). The ๐‘ฅ(๐‘–) are called the concomitants of order statistics by Yang [1977].Introduce double subscripts: โ„“(โ„Ž,1) = โ„“

(๐‘ฅ(2โ„Žโˆ’1)

), โ„“(โ„Ž,2) = โ„“

(๐‘ฅ(2โ„Ž)

), ๐‘ฆ(โ„Ž,1) = ๐‘ฆ(2โ„Žโˆ’1) and

๐‘ฆ(โ„Ž,2) = ๐‘ฆ(2โ„Ž).Introduce firstly alternative matrix (under the weak assumptoin of its existance):

๐’ฑ1,E = E[Cov(๐’ฎ1(๐‘ฅ)|๐‘ฆ)

].

We have the following estimator for this matrix, for ๐‘ = 2, and ๐ป = ๐‘›/๐‘ = ๐‘›/2:

๐’ฑ1,E =1

๐‘›

๐ปโˆ‘โ„Ž=1

(โ„“(โ„Ž,1) โˆ’ โ„“(โ„Ž,2)

)(โ„“(โ„Ž,1) โˆ’ โ„“(โ„Ž,2)

)โŠค,

Firstly, we estimate a norm of ๐’ฑ1,E โˆ’ ๐’ฑ1,E and afterwards a norm of ๐’ฑ1,cov โˆ’ ๐’ฑ1,cov.

47

Page 63: On efficient methods for high-dimensional statistical

Thus, the deviation from the population version, may be split into four terms asfollows:

๐’ฑ1,E โˆ’ ๐’ฑ1,E =1

๐‘›

๐ปโˆ‘โ„Ž=1

(โ„“(โ„Ž,1) โˆ’ โ„“(โ„Ž,2)

)(โ„“(โ„Ž,1) โˆ’ โ„“(โ„Ž,2)

)โŠค โˆ’ E๐œ‚๐œ‚โŠค

=1

๐‘›

๐ปโˆ‘โ„Ž=1

(๐œ‚(โ„Ž,1) โˆ’ ๐œ‚(โ„Ž,2)

)(๐œ‚(โ„Ž,1) โˆ’ ๐œ‚(โ„Ž,2)

)โŠค โˆ’ E๐œ‚๐œ‚โŠค

+1

๐‘›

๐ปโˆ‘โ„Ž=1

(๐‘š(๐‘ฆ(โ„Ž,1))โˆ’๐‘š(๐‘ฆ(โ„Ž,2))

)(๐œ‚(โ„Ž,1) โˆ’ ๐œ‚(โ„Ž,2)

)โŠค+

1

๐‘›

๐ปโˆ‘โ„Ž=1

(๐œ‚(โ„Ž,1) โˆ’ ๐œ‚(โ„Ž,2)

)(๐‘š(๐‘ฆ(โ„Ž,1))โˆ’๐‘š(๐‘ฆ(โ„Ž,2))

)โŠค+

1

๐‘›

๐ปโˆ‘โ„Ž=1

(๐‘š(๐‘ฆ(โ„Ž,1))โˆ’๐‘š(๐‘ฆ(โ„Ž,2))

)(๐‘š(๐‘ฆ(โ„Ž,1))โˆ’๐‘š(๐‘ฆ(โ„Ž,2))

)โŠค= ๐‘‡4 + ๐‘‡3 + ๐‘‡2 + ๐‘‡1.

We now bound each term separately.

Bounding ๐‘‡1. We have

๐‘‡1 41

๐‘›

๐ปโˆ‘โ„Ž=1

๐ฟ2(๐‘ฆ(โ„Ž,1) โˆ’ ๐‘ฆ(โ„Ž,2))(๐‘ฆ(โ„Ž,1) โˆ’ ๐‘ฆ(โ„Ž,2))โŠค

tr๐‘‡1 = โ€–๐‘‡1โ€–* 61

๐‘›

๐ปโˆ‘โ„Ž=1

๐ฟ2 ยท diameter(๐‘ฆ1, . . . , ๐‘ฆ๐‘›)|๐‘ฆ(โ„Ž,1) โˆ’ ๐‘ฆ(โ„Ž,2)|

61

๐‘›๐ฟ2 ยท diameter(๐‘ฆ1, . . . , ๐‘ฆ๐‘›)2.

The range cannot grow too much, i.e., as log ๐‘›. Indeed, assuming without loss ofgenerality that E๐‘ฆ = 0, we have max๐‘ฆ1, . . . , ๐‘ฆ๐‘› 6 ๐‘ข/2 and min๐‘ฆ1, . . . , ๐‘ฆ๐‘› >โˆ’๐‘ข/2 implies that the range is less than ๐‘ข, and thus, P(diameter(๐‘ฆ1, . . . , ๐‘ฆ๐‘›) >๐‘ข) 6 P(max๐‘ฆ1, . . . , ๐‘ฆ๐‘› > ๐‘ข/2) + P(min๐‘ฆ1, . . . , ๐‘ฆ๐‘› 6 โˆ’๐‘ข/2) 6 ๐‘›P(๐‘ฆ > ๐‘ข/2) +๐‘›P(๐‘ฆ < โˆ’๐‘ข/2) 6 2๐‘› exp(โˆ’๐‘ข2/8๐œ 2๐‘ฆ ) by using sub-Gaussianity. Then, by selecting๐‘ข2/8๐œ 2๐‘ฆ = log(2๐‘›) + log(8/๐›ฟ), with get with probability greater then 1 โˆ’ ๐›ฟ/8 that

diameter(๐‘ฆ1, . . . , ๐‘ฆ๐‘›) 6 2โˆš

2๐œ๐‘ฆโˆš

log(2๐‘›) + log(8/๐›ฟ).

Bounding ๐‘‡2 and ๐‘‡3. We also have

maxโ€–๐‘‡2โ€–*, โ€–๐‘‡3โ€–* 61

๐‘›

๐ปโˆ‘โ„Ž=1

๐ฟ|๐‘ฆ(โ„Ž,1)๐‘ฆ(โ„Ž,2)| ยท diameter(๐œ‚1, . . . , ๐œ‚๐‘›)

61

๐‘›๐ฟ ยท diameter(๐‘ฆ1, . . . , ๐‘ฆ๐‘›) ยท diameter(๐œ‚1, . . . , ๐œ‚๐‘›).

48

Page 64: On efficient methods for high-dimensional statistical

Like for ๐‘‡1, the ranges cannot grow too much, i.e., as log ๐‘›. Similarly

P(diameter((๐œ‚๐‘—)1, . . . , (๐œ‚๐‘—)๐‘›) > ๐‘ข) 6

6 ๐‘›P((๐œ‚๐‘—) > ๐‘ข/2) + ๐‘›P((๐œ‚๐‘—) < โˆ’๐‘ข/2) 6 2๐‘› exp(โˆ’๐‘ข2/8๐œ 2๐œ‚ ).

We thus with get with probability greater then 1โˆ’ ๐›ฟ/(8๐‘‘) that

max๐‘—โˆˆ1,...,๐‘‘

diameter((๐œ‚๐‘—)1, . . . , (๐œ‚๐‘—)๐‘›) 6 2โˆš

2๐œ๐œ‚โˆš

log(2๐‘›) + log(8๐‘‘/๐›ฟ).

Thus combining the two terms above, with probability greater than 1โˆ’ ๐›ฟ/4,

โ€–๐‘‡1โ€–* + โ€–๐‘‡2โ€–* + โ€–๐‘‡3โ€–* 68๐ฟ(๐ฟ๐œ 2๐‘ฆ + 2๐œ๐œ‚๐œ๐‘ฆ

โˆš๐‘‘)

๐‘›

(log(2๐‘›) + log(8๐‘‘) + log(1/๐›ฟ)

).

Note the term inโˆš๐‘‘, which corresponds to the definition of the diameter(๐œ‚1, . . . , ๐œ‚๐‘›)

in terms of the โ„“2-norm.

Bounding ๐‘‡4. We have:

๐‘‡4 =1

๐‘›

๐ปโˆ‘โ„Ž=1

๐œ‚(โ„Ž,1)๐œ‚

โŠค(โ„Ž,1) + ๐œ‚(โ„Ž,2)๐œ‚

โŠค(โ„Ž,2) โˆ’ ๐œ‚(โ„Ž,2)๐œ‚โŠค(โ„Ž,1) โˆ’ ๐œ‚(โ„Ž,1)๐œ‚โŠค(โ„Ž,2)

โˆ’ E๐œ‚๐œ‚โŠค

=1

๐‘›

๐ปโˆ‘โ„Ž=1

โˆ’ ๐œ‚(โ„Ž,2)๐œ‚โŠค(โ„Ž,1) โˆ’ ๐œ‚(โ„Ž,1)๐œ‚โŠค(โ„Ž,2)

+

1

๐‘›

๐‘›โˆ‘๐‘–=1

๐œ‚๐‘–๐œ‚โŠค๐‘– โˆ’ E๐œ‚๐œ‚โŠค = ๐‘‡4,1 + ๐‘‡4,2.

For the second term ๐‘‡4,2 above, if we select any element indexed by ๐‘Ž, ๐‘, then

1

๐‘›

๐‘›โˆ‘๐‘–=1

(๐œ‚๐‘–)๐‘Ž(๐œ‚๐‘–)๐‘ โˆ’ E๐œ‚๐‘Ž๐œ‚๐‘.

Using [Boucheron et al., 2013, Theorem 2.1], we get

E[(๐œ‚๐‘–)๐‘Ž(๐œ‚๐‘–)๐‘]2 6

โˆšE(๐œ‚๐‘–)4๐‘ŽE(๐œ‚๐‘–)4๐‘ 6 4(2๐œ 2๐œ‚ )2 = 16๐œ 4๐œ‚ ,

and

E[|(๐œ‚๐‘–)๐‘Ž(๐œ‚๐‘–)๐‘|๐‘ž] 6โˆšE(๐œ‚๐‘–)

2๐‘ž๐‘Ž E(๐œ‚๐‘–)

2๐‘ž๐‘ 6 2๐‘ž!(2๐œ 2๐œ‚ )๐‘ž =

๐‘ž!

2(2๐œ 2๐œ‚ )๐‘žโˆ’216๐œ 4๐œ‚ .

We can then use Bernsteinโ€™s inequality [Boucheron et al., 2013, Theorem 2.10], to getthat with probability less than ๐‘’โˆ’๐‘ก then

1

๐‘›

๐‘›โˆ‘๐‘–=1

(๐œ‚๐‘–)๐‘Ž(๐œ‚๐‘–)๐‘ โˆ’ E๐œ‚๐‘Ž๐œ‚๐‘ > 2๐œ 2๐œ‚๐‘›๐‘ก+โˆš

32๐œ 4๐œ‚โˆš๐‘ก/โˆš๐‘›.

We can also get upper bound for this quantity, using โˆ’๐œ‚๐‘Ž instead of ๐œ‚๐‘Ž. Thus, with

49

Page 65: On efficient methods for high-dimensional statistical

๐‘ก = log 8๐‘‘2

๐›ฟ, we get that all ๐‘‘(๐‘‘+ 1)/2 absolute deviations are less than 2๐œ 2๐œ‚

(log 8๐‘‘2

๐›ฟ

๐‘›+โˆš

2 log 8๐‘‘2

๐›ฟ

๐‘›

), with probability greater than 1โˆ’๐›ฟ/4. This implies that the nuclear norm

of the second term is less than 2๐‘‘โˆš๐‘‘๐œ 2๐œ‚

(log 8๐‘‘2

๐›ฟ

๐‘›+

โˆš2 log 8๐‘‘2

๐›ฟ

๐‘›

), because for any matrix

๐พ โˆˆ R๐‘‘ร—๐‘‘ : โ€–๐พโ€–* 6 ๐‘‘โˆš๐‘‘โ€–๐พโ€–โˆž, where โ€–๐พโ€–โˆž = max๐‘—,๐‘˜ |๐พ๐‘—๐‘˜|.

For the first term, we consider ๐‘ =๐ปโˆ‘โ„Ž=1

(๐œ‚(โ„Ž,2))๐‘Ž(๐œ‚(โ„Ž,1))๐‘, and consider conditioning

on Y = (๐‘ฆ1, . . . , ๐‘ฆ๐‘›). A key result from the theory of order statistics is that the ๐‘›random variables ๐œ‚(โ„Ž,2), ๐œ‚(โ„Ž,1), โ„Ž โˆˆ 1, . . . , ๐‘›/2 are independent given ๐‘Œ [Yang, 1977].This allows us to compute expectations.

Using Rosenthalโ€™s inequality [Boucheron et al., 2013, Theorem 15.11] conditionedon Y, for which we have E((๐œ‚(โ„Ž,2))๐‘Ž(๐œ‚(โ„Ž,1))๐‘|Y) = 0, we get:[

E(|๐‘|๐‘|Y)]1/๐‘

6

6โˆš

8๐‘[โˆ‘

โ„Ž

E[((๐œ‚(โ„Ž,2))

2๐‘Ž(๐œ‚(โ„Ž,1))

2๐‘ |Y]]1/2

+ ๐‘ ยท 2[Emax

โ„Ž

[((๐œ‚(โ„Ž,2))

๐‘๐‘Ž(๐œ‚(โ„Ž,1))

๐‘๐‘ |Y]]1/๐‘

6โˆš

8๐‘[โˆ‘

โ„Ž

E[((๐œ‚(โ„Ž,2))

2๐‘Ž(๐œ‚(โ„Ž,1))

2๐‘ |Y]]1/2

+ ๐‘ ยท 2[โˆ‘

โ„Ž

E[((๐œ‚(โ„Ž,2))

๐‘๐‘Ž(๐œ‚(โ„Ž,1))

๐‘๐‘ |Y]]1/๐‘

.

By taking the ๐‘-th power, we get:

E(|๐‘|๐‘|Y) 6 2๐‘โˆ’1โˆš

8๐‘๐‘[โˆ‘

โ„Ž

E[((๐œ‚(โ„Ž,2))

2๐‘Ž(๐œ‚(โ„Ž,1))

2๐‘ |Y]]๐‘/2

+ 2๐‘โˆ’1๐‘๐‘ ยท 2๐‘โˆ‘โ„Ž

E[((๐œ‚(โ„Ž,2))

๐‘๐‘Ž(๐œ‚(โ„Ž,1))

๐‘๐‘ |Y].

By now taking expectations with respect to Y, we get, using Jensenโ€™s inequality:

E|๐‘|๐‘ 6 2๐‘โˆ’1โˆš

8๐‘๐‘E([โˆ‘

โ„Ž

(๐œ‚(โ„Ž,2))2๐‘Ž(๐œ‚(โ„Ž,1))

2๐‘

]๐‘/2)+ 2๐‘โˆ’1๐‘๐‘ ยท 2๐‘

โˆ‘โ„Ž

E[((๐œ‚(โ„Ž,2))

๐‘๐‘Ž(๐œ‚(โ„Ž,1))

๐‘๐‘

]6 2๐‘โˆ’1

โˆš8๐‘

๐‘E([1

2

โˆ‘โ„Ž

(๐œ‚(โ„Ž,2))4๐‘Ž + (๐œ‚(โ„Ž,1))

4๐‘

]๐‘/2)+ 2๐‘โˆ’1๐‘๐‘ ยท 2๐‘ ยท 1

2

โˆ‘โ„Ž

E[((๐œ‚(โ„Ž,2))

2๐‘๐‘Ž + (๐œ‚(โ„Ž,1))

2๐‘๐‘

]6 2๐‘โˆ’1

โˆš8๐‘

๐‘E([โˆ‘

๐‘–

(๐œ‚๐‘–)4๐‘Ž + (๐œ‚๐‘–)

4๐‘

]๐‘/2)+ 2๐‘โˆ’1๐‘๐‘ ยท 2๐‘

โˆ‘๐‘–

E[((๐œ‚๐‘–)

2๐‘๐‘Ž + (๐œ‚๐‘–)

2๐‘๐‘

].

50

Page 66: On efficient methods for high-dimensional statistical

Because summing over all order statistics is equivalent to summing over all elements.Thus, using the bound on moments of (๐œ‚๐‘–)

2๐‘ , we get:

E|๐‘|๐‘ 6 2๐‘โˆ’1โˆš

8๐‘๐‘2๐‘/2โˆ’1E

([โˆ‘๐‘–

(๐œ‚๐‘–)4๐‘Ž

]๐‘/2)+ 2๐‘โˆ’1

โˆš8๐‘

๐‘2๐‘/2โˆ’1E

([โˆ‘๐‘–

(๐œ‚๐‘–)4๐‘

]๐‘/2)+ 2๐‘โˆ’1๐‘๐‘ ยท 2๐‘๐‘› ยท 4๐‘!(2๐œ 2๐œ‚ )๐‘

We can now use [Boucheron et al., 2013, Theorem 15.10], to get[E([โˆ‘

๐‘–

(๐œ‚๐‘–)4๐‘Ž

]๐‘/2)]2/๐‘6 2E

[โˆ‘๐‘–

(๐œ‚๐‘–)4๐‘Ž

]+๐‘

2

(E[

max๐‘–

((๐œ‚๐‘–)4๐‘Ž)๐‘/2])2/๐‘

6 2E[โˆ‘

๐‘–

(๐œ‚๐‘–)4๐‘Ž

]+๐‘

2

(E[โˆ‘

๐‘–

((๐œ‚๐‘–)4๐‘Ž)๐‘/2])2/๐‘

6 2๐‘›ร— 4(2๐œ 2๐œ‚ )2 +๐‘

2๐‘›2/๐‘E๐œ‚2๐‘๐‘–

6(32๐‘›+ ๐‘›2/๐‘๐‘

2(2๐‘!)2/๐‘

)๐œ 4๐œ‚

6(32๐‘›+ ๐‘›2/๐‘๐‘3

)๐œ 4๐œ‚ .

Thus

E|๐‘|๐‘ 6 2๐‘โˆš

8๐‘๐‘2๐‘/2โˆ’1

(32๐‘›+ ๐‘›2/๐‘๐‘3

)๐‘/2๐œ 2๐‘๐œ‚ + 2๐‘โˆ’1๐‘๐‘ ยท 2๐‘๐‘› ยท 4๐‘!(2๐œ 2๐œ‚ )๐‘

6 23๐‘โˆ’1๐‘๐‘/2 ยท 2๐‘/2โˆ’1((32๐‘›)๐‘/2 + ๐‘› ยท ๐‘3๐‘/2

)๐œ 2๐‘๐œ‚ + 23๐‘+1๐œ 2๐‘๐œ‚ ๐‘›๐‘

2๐‘

6(26๐‘โˆ’2๐‘๐‘/2๐‘›๐‘/2 + ๐‘›๐‘2๐‘[27๐‘/2โˆ’2 + 23๐‘+1]

)๐œ 2๐‘๐œ‚

6(

64๐‘ ยท ๐‘๐‘/2๐‘›๐‘/2 + 19๐‘ ยท ๐‘›๐‘2๐‘)๐œ 2๐‘๐œ‚ .

Thus

(E|๐‘|๐‘)1/๐‘ 6(

64 ยท โˆš๐‘๐‘›1/2 + 19 ยท ๐‘›1/๐‘๐‘2)๐œ 2๐œ‚ .

Thus, for any ๐›ฟ 6 1/๐‘›, using Lemma 2.5 for random variable ๐‘/๐‘› with ๐‘ก =log(12๐‘‘

2

๐›ฟ) > (log ๐‘›)/2 and we obtain:

P[๐‘

๐‘›

>

192๐œ 2๐œ‚โˆš๐‘›

โˆš๐‘ก+

57๐œ 2๐œ‚๐‘›

๐‘›1/๐‘ก๐‘ก2]6 3๐‘’โˆ’๐‘ก โ‡’

P[๐‘

๐‘›

>

192๐œ 2๐œ‚โˆš๐‘›

โˆšlog( 12๐‘‘2

๐›ฟ) +

57๐œ 2๐œ‚๐‘›

๐‘›1/ log( 12๐‘‘2

๐›ฟ) log2( 12๐‘‘2

๐›ฟ)

]6

๐›ฟ

4๐‘‘2โ‡’

P[๐‘

๐‘›

>

192๐œ 2๐œ‚โˆš๐‘›

โˆšlog( 12๐‘‘2

๐›ฟ) +

155๐œ 2๐œ‚๐‘›

log2( 12๐‘‘2

๐›ฟ)

]6

๐›ฟ

4๐‘‘2.

Combining all terms ๐‘‡1, ๐‘‡2, ๐‘‡3, ๐‘‡4,1 and ๐‘‡4,2 we get with probability not less than

51

Page 67: On efficient methods for high-dimensional statistical

1โˆ’ ๐›ฟ:

โ€–๐’ฑ1,E โˆ’๐’ฑ1,Eโ€–* 61

๐‘›ยท[8๐ฟ(๐ฟ๐œ 2๐‘ฆ + 2๐œ๐œ‚๐œ๐‘ฆ

โˆš๐‘‘) ยท log( 16๐‘‘๐‘›

๐›ฟ)+ 2๐‘‘

โˆš๐‘‘๐œ 2๐œ‚ log 8๐‘‘2

๐›ฟ+ 155๐œ 2๐œ‚๐‘‘ log2( 12๐‘‘2

๐›ฟ)

]+

1โˆš๐‘›

[2๐‘‘โˆš๐‘‘๐œ 2๐œ‚

โˆš2 log 8๐‘‘2

๐›ฟ+ 192๐œ 2๐œ‚๐‘‘

โˆšlog 12๐‘‘2

๐›ฟ

].

Rearranging terms and replacing ๐›ฟ by ๐›ฟ/2, with probability not less, than 1โˆ’ ๐›ฟ/2:

โ€–๐’ฑ1,Eโˆ’๐’ฑ1,Eโ€–* 6195๐‘‘โˆš๐‘‘๐œ 2๐œ‚โˆš๐‘›

โˆšlog

24๐‘‘2

๐›ฟ+

8๐ฟ2๐œ 2๐‘ฆ + 16๐œ๐œ‚๐œ๐‘ฆ๐ฟโˆš๐‘‘+ 157๐œ 2๐œ‚๐‘‘

โˆš๐‘‘

๐‘›log2 32๐‘‘2๐‘›

๐›ฟ.

Using expression:๐’ฑ1,cov + ๐’ฑ1,E = cov[โ„“1(๐‘ฅ)],

we can suggest estimator for ๐’ฑ1,cov as ๐’ฑ1,cov = 1๐‘›

๐‘›โˆ‘๐‘–=1

โ„“(๐‘ฅ๐‘–)โ„“(๐‘ฅ๐‘–)โŠค โˆ’ ๐’ฑ1,E.

Applying the triangle inequality:

โ€–๐’ฑ1,cov โˆ’ ๐’ฑ1,covโ€–* 6 โ€–๐’ฑ1,E โˆ’ ๐’ฑ1,Eโ€–* + โ€– 1

๐‘›

๐‘›โˆ‘๐‘–=1

โ„“(๐‘ฅ๐‘–)โ„“(๐‘ฅ๐‘–)โŠค โˆ’ E(โ„“(๐‘ฅ)โ„“(๐‘ฅ)โŠค)โ€–*.

To estimate the second term, we can use the same arguments as for bounding

๐‘‡4,2: with probability greater than 1โˆ’ ๐›ฟ/2, โ€– 1๐‘›

๐‘›โˆ‘๐‘–=1

โ„“(๐‘ฅ๐‘–)โ„“(๐‘ฅ๐‘–)โŠคโˆ’E(โ„“(๐‘ฅ)โ„“(๐‘ฅ)โŠค)โ€–* is less

than 2๐‘‘โˆš๐‘‘๐œ 2โ„“

(log 4๐‘‘2

๐›ฟ

๐‘›+

โˆš2 log 4๐‘‘2

๐›ฟ

๐‘›

). Finally, combining this bound with bound for

โ€–๐’ฑ1,cov โˆ’ ๐’ฑ1,covโ€–*:

โ€–๐’ฑ1,cov โˆ’ ๐’ฑ1,covโ€–* 6๐‘‘โˆš๐‘‘(195๐œ 2๐œ‚ + 2๐œ 2โ„“ )โˆš๐‘›

โˆšlog

24๐‘‘2

๐›ฟ+

8๐ฟ2๐œ 2๐‘ฆ + 16๐œ๐œ‚๐œ๐‘ฆ๐ฟโˆš๐‘‘+ (157๐œ 2๐œ‚ + 2๐œ 2โ„“ )๐‘‘

โˆš๐‘‘

๐‘›log2 32๐‘‘2๐‘›

๐›ฟ.

2.7.3 Proof of Theorem 2.2

Proof. Using triangle inequality, we get:

โ€–๐’ฑ1,cov(๐œƒ*)โˆ’ ๐’ฑ1,cov(๐œƒ)โ€–* 6โ€–๐’ฑ1,cov(๐œƒ*)โˆ’ ๐’ฑ1,cov(๐œƒ*)โ€–* + โ€–๐’ฑ1,cov(๐œƒ*)โˆ’ ๐’ฑ1,cov(๐œƒ)โ€–* = โ„ฑ1 + โ„ฑ2. (2.7.1)

52

Page 68: On efficient methods for high-dimensional statistical

Theorem 2.1 supplies us with non-asymptotic analysis for the โ„ฑ1 term: with proba-bility not less then 1โˆ’ ๐›ฟ/2:

โ€–๐’ฑ1,cov(๐œƒ*)โˆ’ ๐’ฑ1,cov(๐œƒ*)โ€–* 6๐‘‘โˆš๐‘‘(195๐œ 2๐œ‚ + 2๐œ 2โ„“ )โˆš๐‘›

โˆšlog

48๐‘‘2

๐›ฟ+

8๐ฟ2๐œ 2๐‘ฆ + 16๐œ๐œ‚๐œ๐‘ฆ๐ฟโˆš๐‘‘+ (157๐œ 2๐œ‚ + 2๐œ 2โ„“ )๐‘‘

โˆš๐‘‘

๐‘›log2 64๐‘‘2๐‘›

๐›ฟ.

(2.7.2)

For the second term, let us firstly analyse the norm โ€–๐œƒ*โˆ’๐œƒโ€–. For simplicity, introducenotation: ๐ถ = 1

๐‘›

โˆ‘๐‘›๐‘–=1 ฮจ(๐‘ฅ๐‘–)ฮจ(๐‘ฅ๐‘–)

โŠค โˆˆ R๐‘šร—๐‘š, ๐ถ = E[ฮจ(๐‘ฅ)ฮจ(๐‘ฅ)โŠค

]โˆˆ R๐‘šร—๐‘š, =

1๐‘›

โˆ‘๐‘›๐‘–=1(โˆ‡ ยทฮจ)(๐‘ฅ๐‘–) โˆˆ R๐‘š and ๐‘ = E

[(โˆ‡ ยทฮจ)(๐‘ฅ)

]โˆˆ R๐‘š

Let us estimate โ€–๐ถ โˆ’ ๐ถโ€–๐น and โ€–โˆ’ ๐‘โ€–. Introduce notation:

๐ถ๐‘๐‘Ž,๐‘ =

1

๐‘›

๐‘›โˆ‘๐‘–=1

ฮจ๐‘Ž๐‘(๐‘ฅ๐‘–) ยทฮจ๐‘

๐‘(๐‘ฅ๐‘–)โˆ’ E[ฮจ๐‘Ž๐‘(๐‘ฅ) ยทฮจ๐‘

๐‘(๐‘ฅ)].

It can be shown like in the bounding of ๐‘‡4,2 term in the Theoremโ€™s 2.1 proof, us-ing [Boucheron et al., 2013, Theorem 2.1] and Bernsteinโ€™s inequality [Boucheron et al.,2013, Theorem 2.10] that with probability less then ๐‘’โˆ’๐‘ก:

๐ถ๐‘๐‘Ž,๐‘ > 2

๐œ 2ฮจ๐‘›๐‘ก+โˆš

32๐œ 4ฮจโˆš๐‘ก/โˆš๐‘›.

Taking ๐‘ก = log 2๐‘š2๐‘‘๐›ฟ

we show that:

P

โŽกโŽฃโ€–๐ถ โˆ’ ๐ถโ€–๐น > 2๐œ 2ฮจโˆš๐‘š3๐‘‘

(log 2๐‘š2๐‘‘

๐›ฟ

๐‘›+

โˆš2 log 2๐‘š2๐‘‘

๐›ฟ

๐‘›

)โŽคโŽฆ < ๐›ฟ.

According to assumption (M2) and Hoeffding bound:

P[โ€–๐‘– โˆ’ ๐‘๐‘–โ€– >

๐‘ก

๐‘›

]6 ๐‘’

โˆ’๐‘ก2

2๐œ2โˆ‡ฮจ๐‘› ,

combining inequalities for all ๐‘š components of ๐‘, we have:

P[โ€–๐‘โˆ’ โ€– > ๐œโˆ‡ฮจ

โˆš๐‘šโˆš

๐‘›

โˆšlog ๐‘š2

๐›ฟ2

]6 ๐›ฟ.

Now, let estimate โ€–๐œƒ* โˆ’ ๐œƒโ€–:

๐œƒ* โˆ’ ๐œƒ = ๐ถโˆ’1โˆ’ ๐ถโˆ’1๐‘ = ๐ถโˆ’1(โˆ’ ๐‘) + (๐ถโˆ’1 โˆ’ ๐ถโˆ’1)๐‘โ‡’

โ€–๐œƒ* โˆ’ ๐œƒโ€– 6 โ€–โˆ’ ๐‘โ€–๐œ†min(๐ถ)

+โ€–๐‘โ€–2 ยท โ€–๐ถ โˆ’ ๐ถโ€–op๐œ†min(๐ถ) ยท ๐œ†min(๐ถ)

.

53

Page 69: On efficient methods for high-dimensional statistical

For ๐‘› large enough (for some constants ๐‘1 and ๐‘2, not depending on ๐‘› and ๐›ฟ: ๐‘› >

๐‘1 + ๐‘2 log 1๐›ฟ) โ€–๐ถ โˆ’ ๐ถโ€– 6

๐œ†min

2with probability more, than 1 โˆ’ ๐›ฟ/12 hence for this

๐‘›: ๐œ†min(๐ถ) >๐œ†min(๐ถ)

2, hence, combining 3 estimations for โ€–โˆ’ ๐‘โ€–, โ€–๐ถ โˆ’ ๐ถโ€– and for

๐œ†min(๐ถ), we obtain estimation for โ€–๐œƒ* โˆ’ ๐œƒโ€–: with probability not less, than 1โˆ’ ๐›ฟ/4:

โ€–๐œƒ* โˆ’ ๐œƒโ€– 6

2๐œโˆ‡ฮจโˆš๐‘šโˆš

๐‘›๐œ†min(๐ถ)

โˆš2 log 12๐‘š

๐›ฟ+

2โ€–๐‘โ€–๐œ†2min(๐ถ)

ยท 2๐œ 2ฮจโˆš๐‘š3๐‘‘

(log 24๐‘š2๐‘‘

๐›ฟ

๐‘›+

โˆš2 log 24๐‘š2๐‘‘

๐›ฟ

๐‘›

) (2.7.3)

Consider now (๐‘Ž, ๐‘) element of ๐’ฑ1,cov(๐œƒ*)โˆ’ ๐’ฑ1,cov(๐œƒ) (using 2.4.2):

[๐’ฑ1,cov(๐œƒ*)โˆ’ ๐’ฑ1,cov(๐œƒ)

]๐‘Ž,๐‘

=1

๐‘›

๐‘›โˆ‘๐‘–,๐‘—=1

๐›ผ๐‘–,๐‘—|๐ผโ„Ž(๐‘–, ๐‘—)| โˆ’ 1

๐‘šโˆ‘๐›ผ,๐›ฝ=1

ฮจ(๐‘ฅ๐‘–)๐›ผ๐‘Žฮจ(๐‘ฅ๐‘—)

๐›ฝ๐‘ ยท |๐œƒ

*๐›ผ๐œƒ

*๐›ฝ โˆ’ ๐œƒ๐›ผ๐œƒ๐›ฝ| 6

1

2๐‘›

๐‘›โˆ‘๐‘–,๐‘—=1

๐›ผ๐‘–,๐‘—|๐ผโ„Ž(๐‘–, ๐‘—)| โˆ’ 1

๐‘šโˆ‘๐›ผ,๐›ฝ=1

[[ฮจ(๐‘ฅ๐‘–)

๐›ผ๐‘Ž

]2+[ฮจ(๐‘ฅ๐‘—)

๐›ฝ๐‘

]2] ยท |๐œƒ*๐›ผ๐œƒ*๐›ฝ โˆ’ ๐œƒ๐›ผ๐œƒ๐›ฝ| 6๐‘šโˆ‘

๐›ผ,๐›ฝ=1

1

2๐‘›

[๐‘›โˆ‘๐‘–=1

[ฮจ(๐‘ฅ๐‘–)๐›ผ๐‘Ž ]2 +

๐‘›โˆ‘๐‘–=1

[ฮจ(๐‘ฅ๐‘–)

๐›ฝ๐‘

]2]ยท๐œƒ*๐›ผ๐œƒ

*๐›ฝ โˆ’ ๐œƒ๐›ผ๐œƒ๐›ฝ

,

because every row of binary matrix๐›ผ๐‘–,๐‘—๐‘–,๐‘—=1,๐‘›

has exactly |๐ผโ„Ž(๐‘–, ๐‘—)| โˆ’ 1 non-zero

elements.

Now, let us estimate desired norm:

โ€–๐’ฑ1,cov(๐œƒ*)โˆ’๐’ฑ1,cov(๐œƒ)โ€–* 6 ๐‘šยท๐‘›โˆ‘๐‘˜=1

๐‘‘โˆ‘๐‘™=1

๐‘›โˆ‘๐‘–=1

|ฮจ๐‘˜๐‘™ (๐‘ฅ๐‘–)|2

๐‘›ยท(2โ€–๐œƒ*โ€–+โ€–๐œƒโˆ’๐œƒ*โ€–

)ยทโ€–๐œƒโˆ’๐œƒ*โ€– (2.7.4)

Using again [Boucheron et al., 2013, Theorem 2.1] and Bernsteinโ€™s inequal-ity [Boucheron et al., 2013, Theorem 2.10] with probability not less then 1โˆ’ ๐›ฟ/8:

๐‘›โˆ‘๐‘˜=1

๐‘‘โˆ‘๐‘™=1

๐‘›โˆ‘๐‘–=1

|ฮจ๐‘˜๐‘™ (๐‘ฅ๐‘–)|2

๐‘›6 Eโ€–ฮจ(๐‘ฅ)โ€–22 + 2๐‘š๐‘‘๐œ 2๐œ“

(log 16๐‘š๐‘‘

๐›ฟ

๐‘›+

โˆš2 log 16๐‘š๐‘‘

๐›ฟ

๐‘›

)(2.7.5)

For ๐‘› large enough (for some constants ๐‘1 and ๐‘2, not depending on ๐‘› and ๐›ฟ: ๐‘› >

54

Page 70: On efficient methods for high-dimensional statistical

๐‘1 + ๐‘2 log 1๐›ฟ):

๐‘›โˆ‘๐‘˜=1

๐‘‘โˆ‘๐‘™=1

๐‘›โˆ‘๐‘–=1

|ฮจ๐‘˜๐‘™ (๐‘ฅ๐‘–)|2

๐‘›6

3

2Eโ€–ฮจ(๐‘ฅ)โ€–22 with probability no less then 1โˆ’ ๐›ฟ/8,

and (2โ€–๐œƒ*โ€–+ โ€–๐œƒ โˆ’ ๐œƒ*โ€–

)6 3โ€–๐œƒ*โ€– with probability no less then 1โˆ’ ๐›ฟ/8

โ€–๐’ฑ1,cov(๐œƒ*)โˆ’ ๐’ฑ1,cov(๐œƒ)โ€–* 6 ๐‘š ยท 9

2Eโ€–ฮจ(๐‘ฅ)โ€–22 ยท โ€–๐œƒ*โ€–ยท[

2๐œโˆ‡ฮจ

โˆš๐‘šโˆš

๐‘›๐œ†min(๐ถ)

โˆš2 log 12๐‘š

๐›ฟ+

2โ€–๐‘โ€–๐œ†2min(๐ถ)

ยท 2๐œ 2ฮจโˆš๐‘š3๐‘‘

(log 24๐‘š2๐‘‘

๐›ฟ

๐‘›+

โˆš2 log 24๐‘š2๐‘‘

๐›ฟ

๐‘›

)]with probability not less then 1โˆ’๐›ฟ/2. Combining and simplifying obtained inequalityand (2.7.2) we obtain final inequality: with probablility not less, then 1โˆ’ ๐›ฟ:

โ€–๐’ฑ1,cov(๐œƒ*)โˆ’ ๐’ฑ1,cov(๐œƒ)โ€–* 61โˆš๐‘›

โˆš2 log

48๐‘š2๐‘‘

๐›ฟร—

[๐‘‘โˆš๐‘‘(195๐œ 2๐œ‚ + 2๐œ 2โ„“ ) +

9๐‘š

2Eโ€–ฮจ(๐‘ฅ)โ€–22 ยท โ€–๐œƒ*โ€–

(2๐œโˆ‡ฮจ

โˆš๐‘š

๐œ†0+

4โ€–๐‘โ€–๐œ 2ฮจโˆš๐‘š3๐‘‘

๐œ†0

)]+๐‘‚

(1

๐‘›

).

55

Page 71: On efficient methods for high-dimensional statistical

56

Page 72: On efficient methods for high-dimensional statistical

Chapter 3

Constant step-size Stochastic

Gradient Descent for probabilistic

modeling

Abstract

Stochastic gradient methods enable learning probabilistic models from largeamounts of data. While large step-sizes (learning rates) have shown to be best forleast-squares (e.g., Gaussian noise) once combined with parameter averaging, theseare not leading to convergent algorithms in general. In this chapter, we consider gen-eralized linear models, that is, conditional models based on exponential families. Wepropose averaging moment parameters instead of natural parameters for constant-step-size stochastic gradient descent. For finite-dimensional models, we show thatthis can sometimes (and surprisingly) lead to better predictions than the best linearmodel. For infinite-dimensional models, we show that it always converges to optimalpredictions, while averaging natural parameters never does. We illustrate our findingswith simulations on synthetic data and classical benchmarks with many observations.

This chapter is based on the conference paper published as a Uncertainty inArtificial Intelligence (UAI) 2018 paper, which was accepted as an oral presenta-tion [Babichev and Bach, 2018a].

3.1 Introduction

Faced with large amounts of data, efficient parameter estimation remains one ofthe key bottlenecks in the application of probabilistic models. Once cast as an opti-mization problem, for example through the maximum likelihood principle, difficultiesmay arise from the size of the model, the number of observations, or the potentialnon-convexity of the objective functions, and often all three [Koller and Friedman,2009, Murphy, 2012].

57

Page 73: On efficient methods for high-dimensional statistical

In this chapter we focus primarily on situations where the number of observationsis large; in this context, stochastic gradient descent (SGD) methods which look at onesample at a time are usually favored for their cheap iteration cost. However, findingthe correct step-size (sometimes referred to as the learning rate) remains a practicaland theoretical challenge, for probabilistic modeling but also in most other situationsbeyond maximum likelihood [Bottou et al., 2016].

In order to preserve convergence, the step size ๐›พ๐‘› at the ๐‘›-th iteration typicallyhas to decay with the number of gradient steps (here equal to the number of datapoints which are processed), typically as ๐ถ/๐‘›๐›ผ for ๐›ผ โˆˆ [1/2, 1] [see, e.g., Bach andMoulines, 2011, Bottou et al., 2016]. However, these often leads to slow convergenceand the choice of ๐›ผ and ๐ถ is difficult in practice. More recently, constant step-sizes have been advocated for their fast convergence towards a neighborhood of theoptimal solution [Bach and Moulines, 2013], while it is a standard practice in manyareas [Goodfellow et al., 2016]. However, it is not convergent in general and thussmall step-sizes are still needed to converge to a decent estimator.

Constant step-sizes can however be made to converge in one situation. When thefunctions to optimize are quadratic, like for least-squares regression, using a constantstep-size combined with an averaging of all estimators along the algorithm can beshown to converge to the global solution with the optimal convergence rates [Bachand Moulines, 2013, Dieuleveut and Bach, 2016].

The goal of this chapter is to explore the possibility of such global convergence witha constant step-size in the context of probabilistic modeling with exponential families,e.g., for logistic regression or Poisson regression [McCullagh, 1984]. This would leadto the possibility of using probabilistic models (thus with a principled quantificationof uncertainty) with rapidly converging algorithms. Our main novel idea is to replacethe averaging of the natural parameters of the exponential family by the averagingof the moment parameters, which can also be formulated as averaging predictionsinstead of estimators. For example, in the context of predicting binary outcomesin 0, 1 through a Bernoulli distribution, the moment parameter is the probability๐‘ โˆˆ [0, 1] that the variable is equal to one, while the natural parameter is the โ€œlogodds ratioโ€ log ๐‘

1โˆ’๐‘ , which is unconstrained. This lack of constraint is often seenas a benefit for optimization; it turns out that for stochastic gradient methods, themoment parameter is better suited to averaging. Note that for least-squares, whichcorresponds to modeling with the Gaussian distribution with fixed variance, momentand natural parameters are equal, so it does not make a difference.

More precisely, our main contributions are:โ€” For generalized linear models, we propose in Section 3.4 averaging moment

parameters instead of natural parameters for constant-step-size stochastic gra-dient descent.

โ€” For finite-dimensional models, we show in Section 3.5 that this can sometimes(and surprisingly) lead to better predictions than the best linear model.

โ€” For infinite-dimensional models, we show in Section 3.6 that it always convergesto optimal predictions, while averaging natural parameters never does.

โ€” We illustrate our findings in Section 3.7 with simulations on synthetic dataand classical benchmarks with many observations.

58

Page 74: On efficient methods for high-dimensional statistical

ฮธn

ฮธn

ฮธฮณ

ฮธโˆ—

ฮธ0

ฮธn โˆ’ ฮธฮณ = Op(ฮณ1/2)

ฮธn โˆ’ ฮธฮณ = Op(nโˆ’1/2)

ฮธโˆ— โˆ’ ฮธฮณ = O(ฮณ)

Figure 3-1 โ€“ Convergence of iterates ๐œƒ๐‘› and averaged iterates ๐œƒ๐‘› to the mean ๐œƒ๐›พ underthe stationary distribution ๐œ‹๐›พ.

3.2 Constant step size stochastic gradient descent

In this section, we present the main intuitions behind stochastic gradient descent(SGD) with constant step-size. For more details, see Dieuleveut et al. [2017]. Weconsider a real-valued function ๐น defined on the Euclidean space R๐‘‘ (this can begeneralized to any Hilbert space, as done in Section 3.6 when considering Gaussianprocesses and positive-definite kernels), and a sequence of random functions (๐‘“๐‘›)๐‘›>1

which are independent and identically distributed and such that E๐‘“๐‘›(๐œƒ) = ๐น (๐œƒ) forall ๐œƒ โˆˆ R๐‘‘. Typically, ๐น will the expected negative log-likelihood on unseen data,while ๐‘“๐‘› will be the negative log-likelihood for a single observation. Since we requireindependent random functions, we assume that we make single pass over the data,and thus the number of iterations is equal to the number of observations.

Starting from an initial ๐œƒ0 โˆˆ R๐‘‘, then SGD will perform the following recursion,from ๐‘› = 1 to the total number of observations:

๐œƒ๐‘› = ๐œƒ๐‘›โˆ’1 โˆ’ ๐›พ๐‘›โˆ‡๐‘“๐‘›(๐œƒ๐‘›โˆ’1). (3.2.1)

Since the functions ๐‘“๐‘› are independent, the iterates (๐œƒ๐‘›)๐‘› form a Markov chain. Whenthe step-size ๐›พ๐‘› is constant (equal to ๐›พ) and the functions ๐‘“๐‘› are identically dis-tributed, the Markov chain is homogeneous. Thus, under additional assumptions [see,e.g., Dieuleveut et al., 2017, Meyn and Tweedie, 1993], it converges in distributionto a stationary distribution, which we refer to as ๐œ‹๐›พ. These additional assumptionsinclude that ๐›พ is not too large (otherwise the algorithm diverges) and in the tradi-tional analysis of step-sizes for gradient descent techniques, we analyze the situationof small ๐›พโ€™s (and thus perform asymptotic expansions around ๐›พ = 0).

The distribution ๐œ‹๐›พ is in general not equal to a Dirac mass, and thus, constant-

59

Page 75: On efficient methods for high-dimensional statistical

step-size SGD is not convergent. However, averaging along the path of the Markovchain has some interesting properties. Indeed, several versions of the โ€œergodic theo-remโ€ [see, e.g., Meyn and Tweedie, 1993] show that for functions ๐‘” from R๐‘‘ to anyvector space, then the empirical average 1

๐‘›

โˆ‘๐‘›๐‘–=1 ๐‘”(๐œƒ๐‘–) converges in probability to the

expectationโˆซ๐‘”(๐œƒ)๐‘‘๐œ‹๐›พ(๐œƒ) of ๐‘” under the stationary distribution ๐œ‹๐›พ. This convergence

can also be quantified by a central limit theorem with an error whichs tends to anormal distribution with variance proportional equal to a constant times 1/๐‘›.

Thus, if denote ๐œƒ๐‘› = 1๐‘›+1

โˆ‘๐‘›๐‘–=0 ๐œƒ๐‘–, applying the previous result to the identity

function ๐‘”, we immediately obtain that ๐œƒ๐‘› converges to ๐œƒ๐›พ =โˆซ๐œƒ๐‘‘๐œ‹๐›พ(๐œƒ), with a squared

error converging in ๐‘‚(1/๐‘›). The key question is the relationship between ๐œƒ๐›พ and theglobal optimizer ๐œƒ* of ๐น , as this characterizes the performance of the algorithm withan infinite number of observations.

By taking expectations in Eq. (3.2.1), and taking a limit with ๐‘› tending to infinitywe obtain that โˆซ

โˆ‡๐น (๐œƒ)๐‘‘๐œ‹๐›พ(๐œƒ) = 0, (3.2.2)

that is, under the stationary distribution ๐œ‹๐›พ, the average gradient is zero. When thegradient is a linear function (like for a quadratic objective ๐น ), this leads to

โˆ‡๐น(โˆซ

๐œƒ๐‘‘๐œ‹๐›พ(๐œƒ)

)= โˆ‡๐น (๐œƒ๐›พ) = 0,

and thus ๐œƒ๐›พ is a stationary point of ๐น (and hence the global minimizer if ๐น is convex).However this is not true in general.

As shown by Dieuleveut et al. [2017], the deviation ๐œƒ๐›พ โˆ’ ๐œƒ* is of order ๐›พ, which isan improvement on the non-averaged recursion, which is at average distance ๐‘‚(๐›พ1/2)(see an illustration in Figure 3-1); thus, small or decaying step-sizes are needed. Inthis chapter, we explore alternatives which are not averaging the estimators ๐œƒ1, . . . , ๐œƒ๐‘›,and rely instead on the specific structure of our cost functions, namely negative log-likelihoods.

3.3 Warm-up: exponential families

In order to highlight the benefits of averaging moment parameters, we first considerunconditional exponential families. We thus consider the standard exponential family

๐‘ž(๐‘ฅ|๐œƒ) = โ„Ž(๐‘ฅ) exp(๐œƒโŠค๐‘‡ (๐‘ฅ)โˆ’ ๐ด(๐œƒ)),

where โ„Ž(๐‘ฅ) is the base measure, ๐‘‡ (๐‘ฅ) โˆˆ R๐‘‘ is the sufficient statistics and ๐ด thelog-partition function. The function ๐ด is always convex [see, e.g., Koller and Fried-man, 2009, Murphy, 2012]. Note that we do not assume that the data distribution๐‘(๐‘ฅ) comes from this exponential family. The expected (with respect to the inputdistribution ๐‘(๐‘ฅ)) negative log-likelihood is equal to

๐น (๐œƒ) = โˆ’E๐‘(๐‘ฅ) log ๐‘ž(๐‘ฅ|๐œƒ)

60

Page 76: On efficient methods for high-dimensional statistical

= ๐ด(๐œƒ)โˆ’ ๐œƒโŠคE๐‘(๐‘ฅ)๐‘‡ (๐‘ฅ)โˆ’ E๐‘(๐‘ฅ) log โ„Ž(๐‘ฅ).

It is known to be minimized by ๐œƒ* such that โˆ‡๐ด(๐œƒ*) = E๐‘(๐‘ฅ)๐‘‡ (๐‘ฅ). Given i.i.d. data(๐‘ฅ๐‘›)๐‘›>1 sampled from ๐‘(๐‘ฅ), then the SGD recursion from Eq. (3.2.1) becomes:

๐œƒ๐‘› = ๐œƒ๐‘›โˆ’1 โˆ’ ๐›พ[โˆ‡๐ด(๐œƒ๐‘›โˆ’1)โˆ’ ๐‘‡ (๐‘ฅ๐‘›)

],

while the stationarity equation in Eq. (3.2.2) becomesโˆซ [โˆ‡๐ด(๐œƒ)โˆ’ E๐‘(๐‘ฅ)๐‘‡ (๐‘ฅ)]๐‘‘๐œ‹๐›พ(๐œƒ) = 0,

which leads to โˆซโˆ‡๐ด(๐œƒ)๐‘‘๐œ‹๐›พ(๐œƒ) = E๐‘(๐‘ฅ)๐‘‡ (๐‘ฅ) = โˆ‡๐ด(๐œƒ*).

Thus, averaging โˆ‡๐ด(๐œƒ๐‘›) will converge to โˆ‡๐ด(๐œƒ*), while averaging ๐œƒ๐‘› will notconverge to ๐œƒ*. This simple observation is the basis of our work.

Note that in this context of unconditional models, a simpler estimator exists, thatis, we can simply compute the empirical average 1

๐‘›

โˆ‘๐‘›๐‘–=1 ๐‘‡ (๐‘ฅ๐‘–) that will converge to

โˆ‡๐ด(๐œƒ*). Nevertheless, this shows that averaging moment parameters โˆ‡๐ด(๐œƒ) ratherthan natural parameters ๐œƒ can bring convergence benefits. We now turn to conditionalmodels, for which no closed-form solutions exist.

3.4 Conditional exponential families

Now we consider the conditional exponential family

๐‘ž(๐‘ฆ|๐‘ฅ, ๐œƒ) = โ„Ž(๐‘ฆ) exp(๐‘ฆ ยท ๐œ‚๐œƒ(๐‘ฅ)โˆ’ ๐‘Ž(๐œ‚๐œƒ(๐‘ฅ))

).

For simplicity we consider only one-dimensional families where ๐‘ฆ โˆˆ R โ€” but ourframework would also extend to more complex models such as conditional randomfields [Lafferty et al., 2001]. We will also assume that โ„Ž(๐‘ฆ) = 1 for all ๐‘ฆ to avoidcarrying constant terms in log-likelihoods. We consider functions of the form ๐œ‚๐œƒ(๐‘ฅ) =๐œƒโŠคฮฆ(๐‘ฅ), which are linear in a feature vector ฮฆ(๐‘ฅ), where ฮฆ : ๐’ณ โ†’ R๐‘‘ can be definedon an arbitrary input set ๐’ณ . Calculating the negative log-likelihood, one obtains:

๐‘“๐‘›(๐œƒ) = โˆ’ log ๐‘ž(๐‘ฆ๐‘›|๐‘ฅ๐‘›๐œƒ) = โˆ’๐‘ฆ๐‘›ฮฆ(๐‘ฅ๐‘›)โŠค๐œƒ + ๐‘Ž(ฮฆ(๐‘ฅ๐‘›)๐œƒ

),

and, for any distribution ๐‘(๐‘ฅ, ๐‘ฆ), for which ๐‘(๐‘ฆ|๐‘ฅ) may not be a member of the con-ditional exponential family,

๐น (๐œƒ) = E๐‘(๐‘ฅ๐‘›,๐‘ฆ๐‘›)๐‘“๐‘›(๐œƒ)

= E๐‘(๐‘ฅ๐‘›,๐‘ฆ๐‘›)[โˆ’ ๐‘ฆ๐‘›ฮฆ(๐‘ฅ๐‘›)โŠค๐œƒ + ๐‘Ž

(ฮฆ(๐‘ฅ๐‘›)๐œƒ

)].

61

Page 77: On efficient methods for high-dimensional statistical

The goal of estimation in such generalized linear models is to find an unknown pa-rameter ๐œƒ given ๐‘› observations (๐‘ฅ๐‘–, ๐‘ฆ๐‘–)๐‘–=1,...,๐‘›:

๐œƒ* = arg min๐œƒโˆˆR๐‘‘

๐น (๐œƒ). (3.4.1)

3.4.1 From estimators to prediction functions

Another point of view is to consider that an estimator ๐œƒ โˆˆ R๐‘‘ in fact definesa function ๐œ‚ : ๐’ณ โ†’ R, with value a natural parameter for the exponential family๐‘ž(๐‘ฆ) = exp(๐œ‚๐‘ฆ โˆ’ ๐‘Ž(๐œ‚)). This particular choice of function ๐œ‚๐œƒ is linear in ฮฆ(๐‘ฅ), andwe have, by decomposing the joint probability ๐‘(๐‘ฅ๐‘›, ๐‘ฆ๐‘›) in two (and dropping thedependence on ๐‘› since we have assumed i.i.d. data):

๐น (๐œƒ) = E๐‘(๐‘ฅ)(E๐‘(๐‘ฆ|๐‘ฅ)

[โˆ’ ๐‘ฆฮฆ(๐‘ฅ)โŠค๐œƒ + ๐‘Ž(ฮฆ(๐‘ฅ)โŠค๐œƒ)

])= E๐‘(๐‘ฅ)

(โˆ’ E๐‘(๐‘ฆ|๐‘ฅ)๐‘ฆฮฆ(๐‘ฅ)โŠค๐œƒ + ๐‘Ž(ฮฆ(๐‘ฅ)โŠค๐œƒ)

)= โ„ฑ(๐œ‚๐œƒ),

with โ„ฑ(๐œ‚) = E๐‘(๐‘ฅ)(โˆ’E๐‘(๐‘ฆ|๐‘ฅ)๐‘ฆ ยท ๐œ‚(๐‘ฅ) + ๐‘Ž(๐œ‚(๐‘ฅ))

)is the performance measure defined for

a function ๐œ‚ : ๐’ณ โ†’ R. By definition ๐น (๐œƒ) = โ„ฑ(๐œ‚๐œƒ) = โ„ฑ(๐œƒโŠคฮฆ(ยท)).However, the global minimizer of โ„ฑ(๐œ‚) over all functions ๐œ‚ : ๐’ณ โ†’ R may not be

attained at a linear function in ฮฆ(๐‘ฅ) (this can only be the case if the linear modelis well-specified or if the feature vector ฮฆ(๐‘ฅ) is flexible enough). Indeed, the globalminimizer of โ„ฑ is the function ๐œ‚** : ๐‘ฅ โ†ฆโ†’ (๐‘Žโ€ฒ)โˆ’1(E๐‘(๐‘ฆ|๐‘ฅ)๐‘ฆ) (starting from โ„ฑ(๐œ‚) =โˆซ [๐‘Ž(๐œ‚(๐‘ฅ)) โˆ’ E๐‘(๐‘ฅ|๐‘ฆ)๐‘ฆ ยท ๐œ‚(๐‘ฅ)

]๐‘(๐‘ฅ)๐‘‘๐‘ฅ and writing down the Euler - Lagrange equation:

๐œ•โ„ฑ๐œ•๐œ‚โˆ’ ๐‘‘

๐‘‘๐‘ฅ๐œ•๐น๐œ•๐œ‚โ€ฒ

= 0 โ‡”[๐‘Žโ€ฒ(๐œ‚) โˆ’ E๐‘(๐‘ฅ|๐‘ฆ)๐‘ฆ

]๐‘(๐‘ฅ) = 0 and finally ๐œ‚ โ†ฆโ†’ (๐‘Žโ€ฒ)โˆ’1(E๐‘(๐‘ฅ|๐‘ฆ)๐‘ฆ)) and

is typically not a linear function in ฮฆ(๐‘ฅ) (note here that ๐‘(๐‘ฆ|๐‘ฅ) is the conditionaldata-generating distribution).

The function ๐œ‚ corresponds to the natural parameter of the exponential family,and it is often more intuitive to consider the moment parameter, that is definingfunctions ๐œ‡ : ๐’ณ โ†’ R that now correspond to moments of outputs ๐‘ฆ; we will refer tothem as prediction functions. Going from natural to moment parameter is known tobe done through the gradient of the log-partition function, and we thus consider for๐œ‚ a function from ๐’ณ to R, ๐œ‡(ยท) = ๐‘Žโ€ฒ(๐œ‚(ยท)), and this leads to the performance measure

๐’ข(๐œ‡) = โ„ฑ((๐‘Žโ€ฒ)โˆ’1(๐œ‡(ยท))).

Note now, that the global minimum of ๐’ข is reached at

๐œ‡**(๐‘ฅ) = E๐‘(๐‘ฆ|๐‘ฅ)๐‘ฆ.

We introduce also the prediction function ๐œ‡*(๐‘ฅ) corresponding to the best ๐œ‚ which islinear in ฮฆ(๐‘ฅ), that is:

๐œ‡*(๐‘ฅ) = ๐‘Žโ€ฒ(๐œƒโŠค* ฮฆ(๐‘ฅ)

).

We say that the model is well-specified when ๐œ‡* = ๐œ‡**, and for these models,

62

Page 78: On efficient methods for high-dimensional statistical

Figure 3-2 โ€“ Graphical representation of reparametrization: firstly we expand theclass of functions, replacing parameter ๐œƒ with function ๐œ‚(ยท) = ฮฆโŠค(ยท)๐œƒ and then we doone more reparametrization: ๐œ‡(ยท) = ๐‘Žโ€ฒ(๐œ‚(ยท)). Best linear prediction ๐œ‡* is constructedusing ๐œƒ* and the global minimizer of ๐’ข is ๐œ‡**. Model is well-specified if and only if๐œ‡* = ๐œ‡**.

inf๐œƒ ๐น (๐œƒ) = inf๐œ‡ ๐’ข(๐œ‡). However, in general, we only have inf๐œƒ ๐น (๐œƒ) > inf๐œ‡ ๐’ข(๐œ‡) and(very often) the inequality is strict (see examples in our simulations).

To make the further developments more concrete, we now present two classicalexamples: logistic regression and Poisson regression.

Logistic regression. The special case of conditional family is logistic regression,where ๐‘ฆ โˆˆ 0, 1, ๐‘Ž(๐‘ก) = log(1 + ๐‘’โˆ’๐‘ก) and ๐‘Žโ€ฒ(๐‘ก) = ๐œŽ(๐‘ก) = 1

1+๐‘’โˆ’๐‘ก is the sigmoid functionand the probability mass function is given by ๐‘(๐‘ฆ|๐œ‚) = exp(๐œ‚๐‘ฆ โˆ’ log(1 + ๐‘’๐œ‚)).

Poisson regression. One more special case is Poisson regression with ๐‘ฆ โˆˆ N, ๐‘Ž(๐‘ก) =exp(๐‘ก) and the response variable ๐‘ฆ has a Poisson distribution. The probability massfunction is given by ๐‘(๐‘ฆ|๐œ‚) = exp(๐œ‚๐‘ฆ โˆ’ ๐‘’๐œ‚ โˆ’ log(๐‘ฆ!)). Poisson regression may beappropriate when the dependent variable is a count, for example in genomics, networkpacket analysis, crime rate analysis, fluorescence microscopy, etc. [Hilbe, 2011].

3.4.2 Averaging predictions

Recall from Section 3.2 that ๐œ‹๐›พ is the stationary distribution of ๐œƒ. Taking expec-tation of both parts of Eq. (3.2.1), we get, by using the fact that ๐œ‹๐›พ is the limitingdistribution of ๐œƒ๐‘› and ๐œƒ๐‘›โˆ’1:

E๐œ‹๐›พ(๐œƒ๐‘›)๐œƒ๐‘›= E๐œ‹๐›พ(๐œƒ๐‘›โˆ’1)๐œƒ๐‘›โˆ’1 โˆ’ ๐›พE๐œ‹๐›พ(๐œƒ๐‘›โˆ’1)E๐‘(๐‘ฅ๐‘›,๐‘ฆ๐‘›)๐‘“ โ€ฒ

๐‘›(๐œƒ๐‘›โˆ’1),

63

Page 79: On efficient methods for high-dimensional statistical

which leads to E๐œ‹๐›พ(๐œƒ)E๐‘(๐‘ฅ๐‘›,๐‘ฆ๐‘›)โˆ‡๐‘“๐‘›(๐œƒ) = 0, that is, now removing the dependence on ๐‘›(data (๐‘ฅ, ๐‘ฆ) are i.i.d.):

E๐œ‹๐›พ(๐œƒ)E๐‘(๐‘ฅ,๐‘ฆ)[โˆ’ ๐‘ฆฮฆ(๐‘ฅ) + ๐‘Žโ€ฒ

(ฮฆ(๐‘ฅ)โŠค๐œƒ

)ฮฆ(๐‘ฅ)

]= 0,

which finally leads to

E๐‘(๐‘ฅ)[E๐œ‹๐›พ(๐œƒ)๐‘Žโ€ฒ

(ฮฆ(๐‘ฅ)โŠค๐œƒ

)โˆ’ ๐œ‡**(๐‘ฅ)

]ฮฆ(๐‘ฅ) = 0. (3.4.2)

This is the core equation our method relies on. It does not imply that ๐‘(๐‘ฅ) =E๐œ‹๐›พ(๐œƒ)๐‘Žโ€ฒ

(ฮฆ(๐‘ฅ)โŠค๐œƒ

)โˆ’ ๐œ‡**(๐‘ฅ) is uniformly equal to zero (which we want), but only that

E๐‘(๐‘ฅ)ฮฆ(๐‘ฅ)๐‘(๐‘ฅ) = 0, i.e., ๐‘(๐‘ฅ) is uncorrelated with ฮฆ(๐‘ฅ).If the feature vector ฮฆ(๐‘ฅ) is โ€œlarge enoughโ€ then this is equivalent to ๐‘ = 0. 1 For

example, when ฮฆ(๐‘ฅ) is composed of an orthonormal basis of the space of integrablefunctions (like for kernels in Section 3.6), then this is exactly true. Thus, in thissituation,

๐œ‡**(๐‘ฅ) = E๐œ‹๐›พ(๐œƒ)๐‘Žโ€ฒ(ฮฆ(๐‘ฅ)โŠค๐œƒ

), (3.4.3)

and averaging predictions ๐‘Žโ€ฒ(ฮฆ(๐‘ฅ)โŠค๐œƒ๐‘›

), along the path (๐œƒ๐‘›) of the Markov chain

should exactly converge to the optimal prediction.This exact convergence is weaker (requires high-dimensional fatures) than for the

unconditional family in Section 3.3 but it can still bring surprising benefits even whenฮฆ is not large enough, as we present in Section 3.5 and Section 3.6.

3.4.3 Two types of averaging

Now we can introduce two possible ways to estimate the prediction function ๐œ‡(๐‘ฅ).

Averaging estimators. The first one is the usual way: we first estimate parameter๐œƒ, using Ruppert-Polyak averaging [Polyak and Juditsky, 1992]: ๐œƒ๐‘› = 1

๐‘›+1

โˆ‘๐‘›๐‘–=0 ๐œƒ๐‘– and

then we denote

๐‘›(๐‘ฅ) = ๐‘Žโ€ฒ(ฮฆ(๐‘ฅ)โŠค๐œƒ๐‘›) = ๐‘Žโ€ฒ(

ฮฆ(๐‘ฅ)โŠค 1๐‘›+1

โˆ‘๐‘›๐‘–=0 ๐œƒ๐‘–

)the corresponding prediction. As discussed in Section 3.2 it converges to ๐›พ : ๐‘ฅ โ†ฆโ†’๐‘Žโ€ฒ(ฮฆ(๐‘ฅ)โŠค๐œƒ๐›พ), which is not equal to in general to ๐‘Žโ€ฒ(ฮฆ(๐‘ฅ)โŠค๐œƒ*), where ๐œƒ* is the optimalparameter in R๐‘‘. Since, as presented at the end of Section 3.2, ๐œƒ๐›พ โˆ’ ๐œƒ* is of order๐‘‚(๐›พ), ๐น (๐œƒ๐›พ) โˆ’ ๐น (๐œƒ*) is of order ๐‘‚(๐›พ2) (because โˆ‡๐น (๐œƒ*) = 0), and thus an error of๐‘‚(๐›พ2) is added to the usual convergence rates in ๐‘‚(1/๐‘›).

Note that we are limited here to prediction functions which corresponds to linearfunctions in ฮฆ(๐‘ฅ) in the natural parameterization, and thus ๐น (๐œƒ*) > ๐’ข(๐œ‡**), and the

1. Let ฮฆ(๐‘ฅ) = (๐œ‘1(๐‘ฅ), . . . , ๐œ‘๐‘›(๐‘ฅ))โŠค be an orthogonal basis and ๐‘(๐‘ฅ) =

โˆ‘๐‘›๐‘–=1 ๐‘๐‘–๐œ‘๐‘–(๐‘ฅ)+ ๐œ€(๐‘ฅ), where

๐œ€(๐‘ฅ) is small if the basis is big enough. Then E๐‘(๐‘ฅ)ฮฆ(๐‘ฅ)๐‘(๐‘ฅ) = 0โ‡” E๐œ‘๐‘–(๐‘ฅ)[โˆ‘๐‘›

๐‘–=1 ๐‘๐‘–๐œ‘๐‘–(๐‘ฅ)+ ๐œ€(๐‘ฅ)]= 0

for every ๐‘–, and due to the orthogonality of the basis and the smallness of ๐œ€(๐‘ฅ): ๐‘๐‘– ยท E๐‘(๐‘ฅ)๐œ‘2(๐‘ฅ) โ‰ˆ 0

and hence ๐‘๐‘– โ‰ˆ 0 and thus ๐‘(๐‘ฅ) โ‰ˆ 0.

64

Page 80: On efficient methods for high-dimensional statistical

inequality is often strict.

Averaging predictions. We propose a new estimator

ยฏ๐œ‡๐‘›(๐‘ฅ) =1

๐‘›+ 1

๐‘›โˆ‘๐‘–=0

๐‘Žโ€ฒ(๐œƒโŠค๐‘– ฮฆ(๐‘ฅ)).

In general ๐’ข(ยฏ๐œ‡๐‘›)โˆ’๐’ข(๐œ‡**) does not converge to zero either (unless the feature vector ฮฆis large enough and Eq. (3.4.3) is satisfied). Thus, on top of the usual convergence in๐‘‚(1/๐‘›) with respect to the number of iterations, we have an extra term that dependsonly on ๐›พ, which we will study in Section 3.5 and Section 3.6.

We denote by ยฏ๐œ‡๐›พ(๐‘ฅ) the limit when ๐‘›โ†’โˆž, that is, using properties of convergingMarkov chains, ยฏ๐œ‡๐›พ(๐‘ฅ) = E๐œ‹๐›พ(๐œƒ)๐‘Žโ€ฒ

(ฮฆ(๐‘ฅ)โŠค๐œƒ

).

Rewriting Eq. (3.4.2) using our new notations, we get:

E[(๐œ‡**(๐‘ฅ)โˆ’ ยฏ๐œ‡๐›พ(๐‘ฅ))ฮฆ(๐‘ฅ๐‘›)

]= 0.

When ฮฆ : Rโ†’ R๐‘‘ is high-dimensional, this leads to ๐œ‡** = ยฏ๐œ‡๐›พ and in contrast to ๐›พ,averaging predictions potentially converge to the optimal prediction.

Graphical representation. We propose a schematic graphical representation ofaveraging estimators and averaging predictions in the Figure 3-3.

Computational complexity. Usual averaging of estimators [Polyak and Juditsky,1992] to compute

๐‘›(๐‘ฅ) = ๐‘Žโ€ฒ(ฮฆ(๐‘ฅ)โŠค๐œƒ๐‘›)

is simple to implement as we can simply update the average ๐œƒ๐‘› with essentially noextra cost on top the complexity ๐‘‚(๐‘›๐‘‘) of the SGD recursion. Given the number ๐‘› oftraining data points and the number ๐‘š of testing data points, the overall complexityis ๐‘‚(๐‘‘(๐‘›+๐‘š)).

Averaging prediction functions is more challenging as we have to store all iter-ates ๐œƒ๐‘–, ๐‘–=1, . . . , ๐‘›, and for each testing point ๐‘ฅ, compute

ยฏ๐œ‡๐‘›(๐‘ฅ) =1

๐‘›+ 1

๐‘›โˆ‘๐‘–=0

๐‘Žโ€ฒ(๐œƒโŠค๐‘– ฮฆ(๐‘ฅ)).

Thus the overall complexity is ๐‘‚(๐‘‘๐‘› + ๐‘š๐‘›๐‘‘), which could be too costly with manytest points (i.e., ๐‘š large).

There are several ways to alleviate this extra cost: (a) using sketching tech-niques [Woodruff, 2014], (b) using summary statistics like done in applications ofMCMC [Gilks et al., 1995], or (c) leveraging the fact that all iterates ๐œƒ๐‘– will end upbeing close to ๐œƒ๐›พ and use a Taylor expansion of ๐‘Ž

โ€ฒ(๐œƒโŠคฮฆ(๐‘ฅ))around ๐œƒ๐›พ. This expansion

is equal to:๐‘Žโ€ฒ(ฮฆ(๐‘ฅ)โŠค๐œƒ๐›พ

)+ (๐œƒ โˆ’ ๐œƒ๐›พ)โŠคฮฆ(๐‘ฅ) ยท ๐‘Žโ€ฒโ€ฒ

(ฮฆ(๐‘ฅ)โŠค๐œƒ๐›พ

)+

65

Page 81: On efficient methods for high-dimensional statistical

Figure 3-3 โ€“ Visualisation of the space of parameters and its transformation to thespace of predictions. Averaging estimators in red vs averaging predictions in green.๐œ‡* is optimal linear predictor and ๐œ‡** is the global optimum.

+1

2

((๐œƒ โˆ’ ๐œƒ๐›พ)โŠคฮฆ(๐‘ฅ)

)2 ยท ๐‘Žโ€ฒโ€ฒโ€ฒ(ฮฆ(๐‘ฅ)โŠค๐œƒ๐›พ)

+๐‘‚(โ€–๐œƒ โˆ’ ๐œƒ๐›พโ€–3

).

Taking expectation in both sides above leads to:

ยฏ๐œ‡๐›พ(๐‘ฅ) โ‰ˆ ๐›พ(๐‘ฅ) +1

2ฮฆ(๐‘ฅ)โŠคcov (๐œƒ) ยท ฮฆ(๐‘ฅ) ยท ๐‘Žโ€ฒโ€ฒโ€ฒ

(๐œƒโŠค๐›พ ฮฆ(๐‘ฅ)

),

where cov (๐œƒ) is the covariance matrix of ๐œƒ under ๐œ‹๐›พ. This provides a simple correctionto ๐›พ, and leads to an approximation of ยฏ๐œ‡๐‘›(๐‘ฅ) as

๐‘›(๐‘ฅ) +1

2ฮฆ(๐‘ฅ)โŠคcov๐‘›(๐œƒ) ฮฆ(๐‘ฅ) ยท ๐‘Žโ€ฒโ€ฒโ€ฒ

(๐œƒโŠค๐‘›ฮฆ(๐‘ฅ)

),

where cov๐‘›(๐œƒ) is the empirical covariance matrix of the iterates (๐œƒ๐‘–).The computational complexity now becomes ๐‘‚(๐‘›๐‘‘2 +๐‘š๐‘‘2), which is an improve-

ment when the number of testing points ๐‘š is large. In all of our experiments, weused this approximation.

3.5 Finite-dimensional models

In this section we study the behavior of ยฏ๐ด(๐›พ) = ๐’ข(ยฏ๐œ‡๐›พ)โˆ’๐’ข(๐œ‡*) for finite-dimensionalmodels, for which it is usually not equal to zero. We know that our estimators ยฏ๐œ‡๐‘› will

66

Page 82: On efficient methods for high-dimensional statistical

converge to ยฏ๐œ‡๐›พ, and our goal is to compare it to ๐ด(๐›พ) = ๐’ข(๐›พ)โˆ’๐’ข(๐œ‡*) = ๐น (๐œƒ๐›พ)โˆ’๐น (๐œƒ*)which is what averaging estimators tends to. We also consider for completeness thenon-averaged performance ๐ด(๐›พ) = E๐œ‹๐›พ(๐œƒ)

[๐น (๐œƒ)โˆ’ ๐น (๐œƒ*)

].

Note that we must have ๐ด(๐›พ) and ๐ด(๐›พ) non-negative, because we compare thenegative log-likelihood performances to the one of of the best linear prediction (inthe natural parameter), while ยฏ๐ด(๐›พ) could potentially be negative (it will in certainsituations), because the the corresponding natural parameters may not be linear inฮฆ(๐‘ฅ).

We consider the same standard assumptions as Dieuleveut et al. [2017], namelysmoothness of the negative log-likelihoods ๐‘“๐‘›(๐œƒ) and strong convexity of the expectednegative log-likelihood ๐น (๐œƒ). We first recall the results from Dieuleveut et al. [2017].See detailed explicit formulas in the supplementary material.

3.5.1 Earlier work

Without averaging. We have that ๐ด(๐›พ) = ๐›พ๐ต +๐‘‚(๐›พ3/2), that is ๐›พ is linear in ๐›พ,with ๐ต non-negative.

Averaging estimators. We have that ๐ด(๐›พ) = ๐›พ2+๐‘‚(๐›พ5/2), that is ๐ด is quadraticin ๐›พ, with non-negative. Averaging does provably bring some benefits because theorder in ๐›พ is higher (we assume ๐›พ small).

3.5.2 Averaging predictions

We are now ready to analyze the behavior of our new framework of averagingpredictions. The following result is shown in the supplementary material.

Proposition 3.1. Under the assumptions on the negative loglikelihoods ๐‘“๐‘› of eachobservation from Dieuleveut et al. [2017]:

โ€” In the case of well-specified data, that is, there exists ๐œƒ* such that for all (๐‘ฅ, ๐‘ฆ),๐‘(๐‘ฆ|๐‘ฅ) = ๐‘ž(๐‘ฆ|๐‘ฅ, ๐œƒ*), then ยฏ๐ด โˆผ ๐›พ2 ยฏ๐ตwell, where ยฏ๐ตwell is a positive constant.

โ€” In the general case of potentially mis-specified data, ยฏ๐ด = ๐›พ ยฏ๐ตmis +๐‘‚(๐›พ2), whereยฏ๐ตmis is constant which may be positive or negative.

Note, that in contrast to averaging parameters, the constant ยฏ๐ตmis can be negative.It means, that we obtain the estimator better than the optimal linear estimator, whichis the limit of capacity for averaging parameters. In our simulations, we show examplesfor which ยฏ๐ตmis is positive, and examples for which it is negative. Thus, in general, forlow-dimensional models, averaging predictions can be worse or better than averagingparameters. However, as we show in the next section, for infinite dimensional models,we always get convergence.

67

Page 83: On efficient methods for high-dimensional statistical

Figure 3-4 โ€“ Visualisation of the space of parameters and its transformation to thespace of predictions. Averaging estimators in red vs averaging predictions in green.Global optimizer coincides with the best linear: ๐œ‡* = ๐œ‡**.

3.6 Infinite-dimensional models

Recall, that we have the following objective function to minimize:

๐น (๐œƒ) = E๐‘ฅ,๐‘ฆ[โˆ’ ๐‘ฆ ยท ๐œ‚๐œƒ(๐‘ฅ) + ๐‘Ž

(๐œ‚๐œƒ(๐‘ฅ)

)], (3.6.1)

where till this moment we consider unknown functions ๐œ‚๐œƒ(๐‘ฅ) which were linear inฮฆ(๐‘ฅ) with ฮฆ(๐‘ฅ) โˆˆ R๐‘‘, leading to a complexity in ๐‘‚(๐‘‘๐‘›).

We now consider infinite-dimensional features, by considering that ฮฆ(๐‘ฅ) โˆˆ โ„‹,where โ„‹ is a Hilbert space. Note that this corresponds to modeling the function ๐œ‚๐œƒas a Gaussian process [Rasmussen and Williams, 2006].

This is computationally feasible through the usual โ€œkernel trickโ€ [Scholkopf andSmola, 2001, Shawe-Taylor and Cristianini, 2004], where we assume that the kernelfunction ๐‘˜(๐‘ฅ, ๐‘ฆ) = โŸจฮฆ(๐‘ฅ),ฮฆ(๐‘ฆ)โŸฉ is easy to compute. Indeed, following Bordes et al.[2005] and Dieuleveut and Bach [2016], starting from ๐œƒ0, each iterate of constant-step-size SGD is of the form ๐œƒ๐‘› =

โˆ‘๐‘›๐‘ก=1 ๐›ผ๐‘กฮฆ(๐‘ฅ๐‘ก), and the gradient descent recursion

๐œƒ๐‘› = ๐œƒ๐‘›โˆ’1 โˆ’ ๐›พ[๐‘Žโ€ฒ(๐œ‚๐œƒ๐‘›โˆ’1(๐‘ฅ๐‘›))โˆ’ ๐‘ฆ๐‘›]ฮฆ(๐‘ฅ๐‘›) leads to the following recursion on ๐›ผ๐‘กโ€™s:

๐›ผ๐‘› = โˆ’๐›พ[๐‘Žโ€ฒ(โˆ‘๐‘›โˆ’1

๐‘ก=1 ๐›ผ๐‘กโŸจฮฆ(๐‘ฅ๐‘ก),ฮฆ(๐‘ฅ๐‘›)โŸฉ)โˆ’ ๐‘ฆ๐‘›

]= โˆ’๐›พ

[๐‘Žโ€ฒ(โˆ‘๐‘›โˆ’1

๐‘ก=1 ๐›ผ๐‘ก๐‘˜(๐‘ฅ๐‘ก, ๐‘ฅ๐‘›))โˆ’ ๐‘ฆ๐‘›

].

68

Page 84: On efficient methods for high-dimensional statistical

This leads to ๐œ‚๐œƒ๐‘›(๐‘ฅ) = โŸจฮฆ(๐‘ฅ), ๐œƒ๐‘›โŸฉ and ๐œ‡๐œƒ๐‘›(๐‘ฅ) = ๐‘Žโ€ฒ(๐œ‚๐œƒ๐‘›(๐‘ฅ)

)with

๐œ‚๐œƒ๐‘›(๐‘ฅ) =๐‘›โˆ‘๐‘ก=1

๐›ผ๐‘กโŸจฮฆ(๐‘ฅ),ฮฆ(๐‘ฅ๐‘ก)โŸฉ =๐‘›โˆ‘๐‘ก=1

๐›ผ๐‘ก๐‘˜(๐‘ฅ, ๐‘ฅ๐‘ก),

and finally we can express ยฏ๐œ‡๐‘›(๐‘ฅ) in kernel form as:

ยฏ๐œ‡๐‘›(๐‘ฅ) =1

๐‘›+ 1

๐‘›โˆ‘๐‘ก=0

๐‘Žโ€ฒ[ ๐‘กโˆ‘๐‘™=1

๐›ผ๐‘™ ยท ๐‘˜(๐‘ฅ, ๐‘ฅ๐‘™)].

There is also a straightforward estimator for averaging parameters, i.e., ๐‘›(๐‘ฅ) =

๐‘Žโ€ฒ(

1๐‘›+1

๐‘›โˆ‘๐‘ก=0

๐‘กโˆ‘๐‘™=1

๐›ผ๐‘™๐‘˜(๐‘ฅ, ๐‘ฅ๐‘™)). If we assume that the kernel function is universal, that is,

โ„‹ is dense in the space of squared integrable functions, then it is known that ifE๐‘ฅ๐‘(๐‘ฅ)ฮฆ(๐‘ฅ) = 0, then ๐‘ = 0 [Sriperumbudur et al., 2008]. This implies that wemust have ยฏ๐œ‡๐›พ = 0 and thus averaging predictions does always converge to the globaloptimum (note that in this setting, we must have a well-specified model because weare in a non-parametric setting).

Graphical representation. We propose a schematic graphical representation ofaveraging estimators and averaging predictions for Hilbert space setup in the Figure3-4.

Column sampling. Because of the usual high running-time complexity of ker-nel method in ๐‘‚(๐‘›2), we consider a โ€œcolumn-sampling approximationโ€ [Williams andSeeger, 2001]. We thus choose a small subset ๐ผ = (๐‘ฅ1, . . . , ๐‘ฅ๐‘š) of samples and con-struct a new finite ๐‘š-dimensional feature map ฮฆ(๐‘ฅ) = ๐พ(๐ผ, ๐ผ)โˆ’1/2๐พ(๐ผ, ๐‘ฅ) โˆˆ R๐‘š,where ๐พ(๐ผ, ๐ผ) is the ๐‘š ร— ๐‘š kernel matrix of the ๐‘š points and ๐พ(๐ผ, ๐‘ฅ) the vectorcomposed of kernel evaluations ๐‘˜(๐‘ฅ๐‘–, ๐‘ฅ). This allows a running-time complexity in๐‘‚(๐‘š2๐‘›). In theory and practice, ๐‘š can be chosen small [Bach, 2013, Rudi et al.,2017].

Regularized learning with kernels. Although we can use an unregularized re-cursion with good convergence properties [Dieuleveut and Bach, 2016], adding a reg-ularisation by the squared Hilbertian norm is easier to analyze and more stable withlimited amounts of data. We thus consider the recursion (in Hilbert space), with ๐œ†small:

๐œƒ๐‘› = ๐œƒ๐‘›โˆ’1 โˆ’ ๐›พ[๐‘“ โ€ฒ๐‘›(๐œƒ๐‘›โˆ’1) + ๐œ†๐œƒ๐‘›โˆ’1

]= ๐œƒ๐‘›โˆ’1 + ๐›พ(๐‘ฆ๐‘› โˆ’ ๐‘Žโ€ฒ(โŸจฮฆ(๐‘ฅ๐‘›), ๐œƒโŸฉ))ฮฆ(๐‘ฅ๐‘›)โˆ’ ๐›พ๐œ†๐œƒ๐‘›โˆ’1.

This recursion can also be computed efficiently as above using the kernel trick andcolumn sampling approximations.

In terms of convergence, the best we can hope for is to converge to the minimizer

69

Page 85: On efficient methods for high-dimensional statistical

๐œƒ*,๐œ† of the regularized expected negative log-likelihood ๐น (๐œƒ)+ ๐œ†2โ€–๐œƒโ€–2 (which we assume

to exist). When ๐œ† tends to zero, then ๐œƒ*,๐œ† converges to ๐œƒ*.Averaging parameters will tend to a limit ๐œƒ๐›พ,๐œ† which is ๐‘‚(๐›พ)-close to ๐œƒ*,๐œ†, thus

leading to predictions which deviate from the optimal predictions for two reasons:because of regularization and because of the constant step-size. Since ๐œ† should de-crease as we get more data, the first effect will vanish, while the second will not.

When averaging predictions, the two effects will vanish as ๐œ† tends to zero. Indeed,by taking limits of the gradient equation, and denoting by ยฏ๐œ‡๐›พ,๐œ† the limit of ยฏ๐œ‡๐‘›, wehave

E[(๐œ‡**(๐‘ฅ)โˆ’ ยฏ๐œ‡๐›พ,๐œ†(๐‘ฅ))ฮฆ(๐‘ฅ)

]= ๐œ†๐œƒ๐›พ,๐œ†. (3.6.2)

Given that ๐œƒ๐›พ,๐œ† is ๐‘‚(๐›พ)-away from ๐œƒ*, if we assume that ๐œƒ* corresponds to a sufficientlyregular 2 element of the Hilbert space โ„‹, then the ๐ฟ2-norm of the deviation satisfiesโ€–๐œ‡**(๐‘ฅ)โˆ’ ยฏ๐œ‡๐›พ,๐œ†โ€– = ๐‘‚(๐œ†) and thus as the regularization parameter ๐œ† tends to zero, ourpredictions tend to the optimal one.

3.7 Experiments

In this section, we compare the two types of averaging (estimators and predic-tions) on a variety of problems, both on synthetic data and on standard benchmarks.When averaging predictions, we always consider the Taylor expansion approximationpresented at the end of Section 3.4.3.

3.7.1 Synthetic data

Finite-dimensional models. we consider the following logistic regression model:

๐‘ž(๐‘ฆ|๐‘ฅ, ๐œƒ) = exp(๐‘ฆ ยท ๐œ‚๐œƒ(๐‘ฅ)โˆ’ ๐‘Ž(๐œ‚๐œƒ(๐‘ฅ))

),

where we consider a linear model ๐œ‚๐œƒ(๐‘ฅ) = ๐œƒโŠค๐‘ฅ in ๐‘ฅ (i.e., ฮฆ(๐‘ฅ) = ๐‘ฅ), the link function๐‘Ž(๐‘ก) = log(1 + ๐‘’๐‘ก) and ๐‘Žโ€ฒ(๐‘ก) = ๐œŽ(๐‘ก) is the sigmoid function. Let ๐‘ฅ be distributed as astandard normal random variable in dimension ๐‘‘ = 2, ๐‘ฆ โˆˆ 0, 1 and P(๐‘ฆ = 1|๐‘ฅ) =๐œ‡**(๐‘ฅ) = ๐œŽ

(๐œ‚**(๐‘ฅ)

), where we consider two different settings:

โ€” Model 1: ๐œ‚**(๐‘ฅ) = sin๐‘ฅ1 + sin๐‘ฅ2,โ€” Model 2: ๐œ‚**(๐‘ฅ) = ๐‘ฅ31 + ๐‘ฅ32.

The global minimum โ„ฑ** of the corresponding optimization problem can be found as

โ„ฑ** = E๐‘(๐‘ฅ)[โˆ’ ๐œ‡**(๐‘ฅ) ยท ๐œ‚**(๐‘ฅ) + ๐‘Ž(๐œ‚**(๐‘ฅ))

].

We also introduce the performance measure โ„ฑ(๐œ‚)

โ„ฑ(๐œ‚) = E๐‘(๐‘ฅ)[โˆ’ ๐œ‡**(๐‘ฅ) ยท ๐œ‚(๐‘ฅ) + ๐‘Ž(๐œ‚(๐‘ฅ))

], (3.7.1)

2. While our reasoning is informal here, it can be made more precise by considering so-calledโ€œsource conditionsโ€ commonly used in the analysis of kernel methods [Caponnetto and De Vito,2007], but this is out of the scope of this chapter.

70

Page 86: On efficient methods for high-dimensional statistical

which can be evaluated directly in the case of synthetic data. Note that in oursituation, the model is misspecified because ๐œ‚**(๐‘ฅ) is not linear in ฮฆ(๐‘ฅ) = ๐‘ฅ, andthus, inf๐œƒ ๐น (๐œƒ) > โ„ฑ**, and thus our performance measures โ„ฑ(๐œ‡๐‘›) โˆ’ โ„ฑ** for variousestimators ๐œ‡๐‘› will not converge to zero.

The results of averaging 10 replications are shown in Fig. 3-5 and Fig. 3-6. We firstobserved that constant step-size SGD without averaging leads to a bad performance.

Moreover, we can see, that in the first case (Fig. 3-5) averaging predictions beatsaveraging parameters, and moreover beats the best linear model: if we use the bestlinear error โ„ฑ* instead of โ„ฑ**, at some moment โ„ฑ(๐œ‚๐‘›)โˆ’โ„ฑ* becomes negative. How-ever in the second case (Fig. 3-6), averaging predictions is not superior to averagingparameters. Moreover, by looking at the final differences between performances withdifferent values of ๐›พ, we can see the dependency of the final performance in ๐›พ foraveraging predictions, instead of ๐›พ2 for averaging parameters (as suggested by ourtheoretical results in Section 3.5). In particular in Fig. 3-5, we can observe the sur-prising behavior of a larger step-size leading to a better performance (note that wecannot increase too much otherwise the algorithm would diverge).

log(n)0 1 2 3 4 5

log(

F(ฮท

n)โˆ’

Fโˆ—โˆ—

)

-2.2

-2

-1.8

-1.6

-1.4

-1.2

-1

-0.8averaging predictions, ฮณ = 1/2

averaging parameters, ฮณ = 1/2

averaging predictions, ฮณ = 1/4

averaging parameters, ฮณ = 1/4

aver. parameters, decaying ฮณn

no averaging, ฮณ = 1/2

no averaging, ฮณ = 1/4

Figure 3-5 โ€“ Synthetic data for linearmodel ๐œ‚๐œƒ(๐‘ฅ) = ๐œƒโŠค๐‘ฅ and ๐œ‚**(๐‘ฅ) = sin๐‘ฅ1 +sin๐‘ฅ2. Excess prediction performance vs.number of iterations (both in log-scale).

log(n)0 1 2 3 4 5

log(

F(ฮท

n)โˆ’

Fโˆ—โˆ—

)

-1.4

-1.3

-1.2

-1.1

-1

-0.9

-0.8

-0.7

-0.6averaging predictions, ฮณ = 1/2

averaging parameters, ฮณ = 1/2

averaging predictions, ฮณ = 1/4

averaging parameters, ฮณ = 1/4

aver. parameters, decaying ฮณn

no averaging, ฮณ = 1/2

no averaging, ฮณ = 1/4

Figure 3-6 โ€“ Synthetic data for linearmodel ๐œ‚๐œƒ(๐‘ฅ) = ๐œƒโŠค๐‘ฅ and ๐œ‚**(๐‘ฅ) = ๐‘ฅ31 + ๐‘ฅ32.Excess prediction performance vs. num-ber of iterations (both in log-scale).

Infinite-dimensional models Here we consider the kernel setup described in Sec-tion 3.6. We consider Laplacian kernels ๐‘˜(๐‘ , ๐‘ก) = exp

(โ€–๐‘ โˆ’๐‘กโ€–1๐œŽ

)with ๐œŽ = 50, dimension

๐‘‘ = 5 and generating log odds ratio ๐œ‚**(๐‘ฅ) = 55+๐‘ฅโŠค๐‘ฅ

. We also use a squared normregularization with several values of ๐œ† and column sampling with ๐‘š = 100 points. Weuse the exact value of โ„ฑ** which we can compute directly for synthetic data. The re-sults are shown in Fig. 3-7, where averaging predictions leads to a better performancethan averaging estimators.

71

Page 87: On efficient methods for high-dimensional statistical

log(n)0 1 2 3 4 5 6

log(

F(ฮท

n)โˆ’

Fโˆ—โˆ—

)

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

Averaging pred. ฮป = 10โˆ’3.0

Averaging pred. ฮป = 10โˆ’3.5

Averaging param., ฮป = 10โˆ’3.0

Averaging param., ฮป = 10โˆ’3.5

No averaging, ฮป = 10โˆ’3.0

Figure 3-7 โ€“ Synthetic data for Laplacian kernel for ๐œ‚**(๐‘ฅ) = 55+๐‘ฅโŠค๐‘ฅ

. Excess predictionperformance vs. number of iterations (both in log-scale).

3.7.2 Real data

Note, that in the case of real data, one does not have access to unknown ๐œ‡**(๐‘ฅ)and computing the performance measure in Eq. (3.7.1) is inapplicable. Instead of itwe use its sampled version on held out data:

โ„ฑ(๐œ‚) = โˆ’โˆ‘๐‘–:๐‘ฆ๐‘–=1

log(๐œ‡(๐‘ฅ๐‘–)

)โˆ’โˆ‘๐‘—:๐‘ฆ๐‘—=0

log(1โˆ’ ๐œ‡(๐‘ฅ๐‘–)

).

We use two datasets, with ๐‘‘ not too large, and ๐‘› large, from [Lichman, 2013]: theโ€œMiniBooNE particle identificationโ€ dataset (๐‘‘ = 50, ๐‘› = 130 064), the โ€œCovertypeโ€dataset (๐‘‘ = 54, ๐‘› = 581 012).

We use two different approaches for each of them: a linear model ๐œ‚๐œƒ(๐‘ฅ) = ๐œƒโŠค๐‘ฅ

and a kernel approach with Laplacian kernel ๐‘˜(๐‘ , ๐‘ก) = exp(โ€–๐‘ โˆ’๐‘กโ€–1

๐œŽ

), where ๐œŽ = ๐‘‘.

The results are shown in Figures 3-8 to 3-11. Note, that for linear models we useโ„ฑ*โ€“the estimator of the best performance among linear models (learned on the testset, and hence not reachable from learning on the training data), and for kernels weuse โ„ฑ** (same definition as โ„ฑ* but with the kernelized model), that is why graphs arenot comparable (but, as shown below, the value of โ„ฑ** is lower than the value of โ„ฑ*because using kernels correspond to a larger feature space).

For the โ€œMiniBooNE particle identificationโ€ dataset โ„ฑ* = 0.35 and โ„ฑ** = 0.21;for theโ€œCovertypeโ€ dataset โ„ฑ* = 0.46 and โ„ฑ** = 0.39. We can see from the fourplots that, especially in the kernel setting, averaging predictions also shows betterperformance than averaging parameters.

72

Page 88: On efficient methods for high-dimensional statistical

log(n)0 1 2 3 4 5

log(

F(ฮท

n)โˆ’

Fโˆ—โˆ—

)

-1.1

-1

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

averaging parameters, ฮณ = 2โˆ’8

averaging predictions, ฮณ = 2โˆ’8

aver. param., decaying, ฮณ0 = 2โˆ’2

Figure 3-8 โ€“ MiniBooNE dataset, dimen-sion ๐‘‘ = 50, linear model. Excess predic-tion performance vs. number of iterations(both in log-scale).

log(n)0 1 2 3 4 5

log(

F(ฮท

n)โˆ’

Fโˆ—โˆ—

)

-2

-1.8

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

averaging parameters, ฮณ = 23

averaging predictions, ฮณ = 23

aver. param., decaying, ฮณ0 = 22

averaging parameters, ฮณ = 22

averaging predictions, ฮณ = 22

aver. param., decaying, ฮณ0 = 21

Figure 3-9 โ€“ MiniBooNE dataset, dimen-sion ๐‘‘ = 50, kernel approach, columnsampling๐‘š = 200. Excess prediction per-formance vs. number of iterations (bothin log-scale).

3.8 Conclusion

In this chapter, we have explored how averaging procedures in stochastic gradientdescent, which are crucial for fast convergence, could be improved by looking at thespecifics of probabilistic modeling. Namely, averaging in the moment parameteriza-tion can have better properties than averaging in the natural parameterization.

While we have provided some theoretical arguments (asymptotic expansion in thefinite-dimensional case, convergence to optimal predictions in the infinite-dimensionalcase), a detailed theoretical analysis with explicit convergence rates would provide abetter understanding of the benefits of averaging predictions.

73

Page 89: On efficient methods for high-dimensional statistical

log(n)0 1 2 3 4 5

log(

F(ฮท

n)โˆ’

Fโˆ—โˆ—

)

-2

-1.5

-1

-0.5averaging parameters, ฮณ = 2โˆ’4

averaging predictions, ฮณ = 2โˆ’4

aver. param., decaying, ฮณ0 = 20

Figure 3-10 โ€“ CoverType dataset, dimen-sion ๐‘‘ = 54, linear model. Excess predic-tion performance vs. number of iterations(both in log-scale).

log(n)0 1 2 3 4 5

log(

F(ฮท

n)โˆ’

Fโˆ—โˆ—

)

-1.4

-1.2

-1

-0.8

-0.6

-0.4averaging parameters, ฮณ = 21

averaging predictions, ฮณ = 21

aver. param., decaying, ฮณ0 = 23

Figure 3-11 โ€“ CoverType dataset, dimen-sion ๐‘‘ = 54, kernel approach, columnsampling๐‘š = 200. Excess prediction per-formance vs. number of iterations (bothin log-scale).

74

Page 90: On efficient methods for high-dimensional statistical

3.9 Appendix. Explicit form of ๐ต, , ยฏ๐ต๐‘ค and ยฏ๐ต๐‘š

In this appendix we provide explicit expressions for the asymptotic expansionsfrom the chapter. All assumptions from Dieuleveut et al. [2017] are reused, namely,beyond the usual sampling assumptions, smoothness of the cost functions.

We have, even for mis-specified models:

โ„ฑ(๐œ‚)โˆ’โ„ฑ(๐œ‚*) = ๐’ข(๐œ‡)โˆ’ ๐’ข(๐œ‡*) =

= E[โˆ’ E๐‘(๐‘ฆ๐‘›|๐‘ฅ๐‘›)๐‘ฆ๐‘›๐œ‚(๐‘ฅ๐‘›) + ๐‘Ž(๐œ‚(๐‘ฅ๐‘›)) + E๐‘(๐‘ฆ๐‘›|๐‘ฅ๐‘›)๐‘ฆ๐‘›๐œ‚*(๐‘ฅ๐‘›)โˆ’ ๐‘Ž(๐œ‚*(๐‘ฅ๐‘›))

]=

= E[๐‘Ž(๐œ‚(๐‘ฅ๐‘›))โˆ’ ๐‘Ž(๐œ‚*(๐‘ฅ๐‘›))โˆ’ ๐‘Žโ€ฒ(๐œ‚*(๐‘ฅ๐‘›))(๐œ‚(๐‘ฅ๐‘›)โˆ’ ๐œ‚*(๐‘ฅ๐‘›))

]+

+E[(๐‘Žโ€ฒ(๐œ‚*(๐‘ฅ๐‘›))โˆ’ E๐‘(๐‘ฆ๐‘›|๐‘ฅ๐‘›)๐‘ฆ๐‘›

)ยท(๐œ‚(๐‘ฅ๐‘›)โˆ’ ๐œ‚*(๐‘ฅ๐‘›)

)]=

E[๐ท๐‘Ž

(๐œ‚(๐‘ฅ๐‘›)

๐œ‚*(๐‘ฅ๐‘›)

)]+ E

[(๐œ‡*(๐‘ฅ)โˆ’ ๐œ‡**(๐‘ฅ)

)ยท(๐œ‚(๐‘ฅ๐‘›)โˆ’ ๐œ‚*(๐‘ฅ๐‘›)

)]=

E[๐ท๐‘Ž*

(๐œ‡(๐‘ฅ๐‘›)

๐œ‡*(๐‘ฅ๐‘›)

)]+ E

[(๐œ‡*(๐‘ฅ)โˆ’ ๐œ‡**(๐‘ฅ)

)ยท ๐œ‚(๐‘ฅ๐‘›)

](3.9.1)

for ๐ท๐‘Ž the Bregman divergence associated to ๐‘Ž, and ๐ท๐‘Ž* the one associated to ๐‘Ž*. Wealso use the optimality condition for the predictor ๐œ‚*(๐‘ฅ): E๐œ‚*(๐‘ฅ)[๐‘Žโ€ฒ(๐œ‚*(๐‘ฅ))โˆ’E๐‘(๐‘ฅ|๐‘ฆ)๐‘ฆ] =0 in the last step.

When the model is well-specified, we have ๐‘Žโ€ฒ(๐œ‚*(๐‘ฅ๐‘›)) = E(๐‘ฆ๐‘›|๐‘ฅ๐‘›) and thus๐น (๐œ‚)โˆ’ ๐น (๐œ‚*) = E

[๐ท๐‘Ž*(๐œ‡*(๐‘ฅ๐‘›)||๐œ‡(๐‘ฅ๐‘›))

]. If ๐œ‚ is linear in ฮฆ(๐‘ฅ), and even if the model

is mis-specified, then we also have ๐น (๐œ‚)โˆ’ ๐น (๐œ‚*) = E[๐ท๐‘Ž*(๐œ‡*(๐‘ฅ๐‘›)||๐œ‡(๐‘ฅ๐‘›))

].

Using asymptotic expansions of moments of the averaged SGD iterate with zero-mean statistically independent noise ๐‘“๐‘›(๐œƒ) = ๐น (๐œƒ) + ๐œ€๐‘›(๐œƒ) from Dieuleveut and Bach[2016], Theorem 2 one obtains:

๐œƒ๐›พ = E๐œ‹๐›พ (๐œƒ) = ๐œƒ* + ๐›พโˆ† +๐‘‚(๐›พ2), (3.9.2)

E๐œ‹๐›พ (๐œƒ โˆ’ ๐œƒ*)(๐œƒ โˆ’ ๐œƒ*)โŠค = ๐›พ๐ถ +๐‘‚(๐›พ2), (3.9.3)

where๐ถ =

[๐น โ€ฒโ€ฒ(๐œƒ*)โŠ— ๐ผ + ๐ผ โŠ— ๐น โ€ฒโ€ฒ(๐œƒ*)

]โˆ’1ฮฃ.

and ฮฃ =โˆซR๐‘‘

๐œ€(๐œƒ)โŠ—2๐œ‹๐›พ(๐‘‘๐œƒ) โˆˆ R๐‘‘ร—๐‘‘.

The โ€driftโ€ ๐œƒ๐›พ โˆ’ ๐œƒ* is linear in ๐›พ and can be interpreted as an additional error dueto the function is not being quadratic and step sizes are not decaying to zero.

Connection between โˆ† and ๐ถ can be easily obtained using ๐œƒ๐‘› = ๐œƒ๐‘›โˆ’1โˆ’๐›พ[๐น โ€ฒ(๐œƒ๐‘›โˆ’1)+

๐œ€๐‘›]. Taking expectation of both parts and using Tailor expansion up to the second

order:

๐น โ€ฒโ€ฒ(๐œƒ*)(๐œƒ๐›พ โˆ’ ๐œƒ*) = โˆ’1

2๐น โ€ฒโ€ฒโ€ฒ(๐œƒ*)E๐œ‹๐›พ (๐œƒ โˆ’ ๐œƒ*)โŠ—2 โ‡’

๐น โ€ฒโ€ฒ(๐œƒ*)โˆ† = โˆ’1

2๐น โ€ฒโ€ฒโ€ฒ(๐œƒ*)๐ถ. (3.9.4)

75

Page 91: On efficient methods for high-dimensional statistical

3.9.1 Estimation without averaging

We start with the simplest estimator of the prediction function: ๐œ‡0(๐‘ฅ) = ๐‘Žโ€ฒ(ฮฆโŠค๐œƒ๐‘›),where we do not use any averaging:

๐’ข(๐œ‡๐‘›)โˆ’ ๐’ข(๐œ‡*) = ๐‘“(๐œƒ๐‘›)โˆ’ ๐‘“(๐œƒ*) = ๐‘“ โ€ฒ(๐œƒ*)(๐œƒ๐‘› โˆ’ ๐œƒ*) +1

2๐‘“ โ€ฒโ€ฒ(๐œƒ*)(๐œƒ๐‘› โˆ’ ๐œƒ*)โŠ—2+

+1

6๐‘“ โ€ฒโ€ฒโ€ฒ(๐œƒ*)(๐œƒ๐‘› โˆ’ ๐œƒ*)โŠ—3 +๐‘‚(๐›พ3/2)

Taking expectation of both sides, when ๐‘›โ†’โˆž and using Eq. (3.9.3) one obtains:

๐ด(๐›พ) = E๐œ‹๐›พ๐‘“(๐œƒ๐‘›)โˆ’ ๐‘“(๐œƒ*) =1

2tr๐‘“ โ€ฒโ€ฒ(๐œƒ*)๐›พ๐ถ +๐‘‚(๐›พ3/2).

So, we have linear dependence of ๐›พ and ๐ต = 12tr๐‘“ โ€ฒโ€ฒ(๐œƒ*)๐ถ.

3.9.2 Estimation with averaging parameters

Now, let us estimate ๐ด(๐›พ):

๐’ข(๐‘›)โˆ’๐’ข(๐œ‡*) = ๐‘“(๐œƒ๐‘›)โˆ’ ๐‘“(๐œƒ*) = ๐‘“ โ€ฒ(๐œƒ*)(๐œƒ๐‘›โˆ’ ๐œƒ*) +1

2(๐œƒ๐‘›โˆ’ ๐œƒ*)๐‘“ โ€ฒโ€ฒ(๐œƒ*)(๐œƒ๐‘›โˆ’ ๐œƒ*) +๐‘‚(๐›พ3).

Taking expectation of both sides, when ๐‘›โ†’โˆž:

๐’ข(๐›พ)โˆ’ ๐’ข(๐œ‡*) = ๐‘“(๐œƒ๐›พ)โˆ’ ๐‘“(๐œƒ*) =1

2tr๐‘“ โ€ฒโ€ฒ(๐œƒ*)(๐œƒ๐›พ โˆ’ ๐œƒ*)โŠ—2 +๐‘‚

(๐›พ3)

=

=1

2tr๐‘“ โ€ฒโ€ฒ(๐œƒ*)๐›พ

2โˆ†โŠ—2 +๐‘‚(๐›พ3).

Finally we have a quadratic dependence of ๐›พ:

๐ด(๐›พ) =1

2tr๐‘“ โ€ฒโ€ฒ(๐œƒ*)๐›พ

2โˆ†โŠ—2 +๐‘‚(๐›พ3).

And the coefficient = 12tr๐‘“ โ€ฒโ€ฒ(๐œƒ*)โˆ†

โŠ—2.

3.9.3 Estimation with averaging predictions

Recall, that by definition, ยฏ๐ด(๐›พ) = ๐’ข(ยฏ๐œ‡๐›พ)โˆ’ ๐’ข(๐œ‡*), where ยฏ๐œ‡๐›พ(๐‘ฅ) = E๐œ‹๐›พ๐‘Žโ€ฒ(๐œƒโŠคฮฆ(๐‘ฅ)

).

We again use Tailor expansion for ๐‘Žโ€ฒ(๐œƒโŠคฮฆ(๐‘ฅ)

)at ๐œƒ*:

๐‘Žโ€ฒ(๐œƒโŠคฮฆ(๐‘ฅ)

)= ๐‘Žโ€ฒ

(๐œƒโŠค* ฮฆ(๐‘ฅ)

)+ ๐‘Žโ€ฒโ€ฒ

(๐œƒโŠค* ฮฆ(๐‘ฅ)

)(๐œƒ โˆ’ ๐œƒ*)โŠคฮฆ(๐‘ฅ)+

+1

2๐‘Žโ€ฒโ€ฒโ€ฒ(๐œƒโŠค* ฮฆ(๐‘ฅ)

)ยท(

(๐œƒ โˆ’ ๐œƒ*)โŠคฮฆ(๐‘ฅ))2

+๐‘‚(๐›พ3/2).

76

Page 92: On efficient methods for high-dimensional statistical

Taking expectation of both parts:

ยฏ๐œ‡๐›พ(๐‘ฅ) = ๐œ‡*(๐‘ฅ) + ๐‘Žโ€ฒโ€ฒ(๐œƒโŠค* ฮฆ(๐‘ฅ)

)(๐œƒ๐›พ โˆ’ ๐œƒ*)โŠคฮฆ(๐‘ฅ)+

+1

2๐‘Žโ€ฒโ€ฒโ€ฒ(๐œƒโŠค* ฮฆ(๐‘ฅ)

)tr[ฮฆ(๐‘ฅ)ฮฆ(๐‘ฅ)โŠคE(๐œƒ โˆ’ ๐œƒ*)โŠ—2

]+๐‘‚(๐›พ3/2) =

= ๐œ‡*(๐‘ฅ) + ๐‘Žโ€ฒโ€ฒ(๐œƒโŠค* ฮฆ(๐‘ฅ)

)๐›พโˆ†โŠคฮฆ(๐‘ฅ) +

1

2๐‘Žโ€ฒโ€ฒโ€ฒ(๐œƒโŠค* ฮฆ(๐‘ฅ)

)tr[ฮฆ(๐‘ฅ)ฮฆ(๐‘ฅ)โŠค๐›พ๐ถ

]+๐‘‚(๐›พ3/2).

Finally, we showed, that:

ยฏ๐œ‡๐›พ(๐‘ฅ)โˆ’ ๐œ‡*(๐‘ฅ) = ๐‘‚(๐›พ3/2) + ๐›พ[๐‘Žโ€ฒโ€ฒ(๐œ‚*(๐‘ฅ)

)โˆ†โŠคฮฆ(๐‘ฅ) +

1

2๐‘Žโ€ฒโ€ฒโ€ฒ(๐œ‚*(๐‘ฅ)

)tr[ฮฆ(๐‘ฅ)โŠ—2๐ถ

]]Now we use Bregram divergence notation Eq. (3.9.1):

ยฏ๐ด(๐›พ) = ๐’ข(ยฏ๐œ‡๐›พ)โˆ’ ๐’ข(๐œ‡*) = ๐’ข1 + ๐’ข2,

As mentioned above, the term ๐’ข2 vanishes if model is well-specified or ๐œ‚ is linearin ฮฆ(๐‘ฅ). Note, that for the case ๐ด(๐›พ) indeed linear in ฮฆ(๐‘ฅ).

Estimation of ๐’ข1.

By definition ๐ท๐‘Ž*(๐œ‡*(๐‘ฅ)||๐œ‡(๐‘ฅ)) =1

2

(๐œ‡*(๐‘ฅ)โˆ’๐œ‡(๐‘ฅ)

)(๐‘Ž*)โ€ฒโ€ฒ

(๐œ‡*(๐‘ฅ)

)(๐œ‡*(๐‘ฅ)โˆ’๐œ‡(๐‘ฅ)

)and

๐’ข1 =1

2E(๐œ‡*(๐‘ฅ)โˆ’ ยฏ๐œ‡๐›พ(๐‘ฅ)

)2๐‘Žโ€ฒโ€ฒ(๐œƒโŠค* ฮฆ(๐‘ฅ)

) =๐›พ2

2E[(๐‘Žโ€ฒโ€ฒ

(๐œ‚*(๐‘ฅ)

)โˆ†โŠคฮฆ(๐‘ฅ) + 1

2๐‘Žโ€ฒโ€ฒโ€ฒ(๐œ‚*(๐‘ฅ)

)tr[ฮฆ(๐‘ฅ)โŠ—2๐ถ

])2

๐‘Žโ€ฒโ€ฒ(๐œƒโŠค* ฮฆ(๐‘ฅ)

) ].

SinceE๐‘ฅ[๐‘Žโ€ฒโ€ฒ((๐œƒโŠค* ฮฆ(๐‘ฅ)

)(โˆ†โŠคฮฆ(๐‘ฅ))2

]= โˆ†โŠค๐‘“ โ€ฒโ€ฒ(๐œƒ*)โˆ†

andE๐‘ฅ[โˆ†โŠค๐‘Žโ€ฒโ€ฒโ€ฒ

(๐œƒโŠค* ฮฆ(๐‘ฅ)

)ฮฆ(๐‘ฅ)โŠ—3๐ถ

]= โˆ†โŠค๐‘“ โ€ฒโ€ฒโ€ฒ(๐œƒ*)๐ถ = โˆ’2โˆ†โŠค๐‘“ โ€ฒโ€ฒ(๐œƒ*)โˆ†,

๐’ข1 = ๐›พ2

[โˆ’ 1

2โˆ†โŠค๐‘“ โ€ฒโ€ฒ(๐œƒ*)โˆ† +

1

8E[๐‘Žโ€ฒโ€ฒโ€ฒ(๐œ‚*(๐‘ฅ))2

๐‘Žโ€ฒโ€ฒ(๐œ‚*(๐‘ฅ))ยท(tr[ฮฆ(๐‘ฅ)โŠ—2๐ถ

])2]]+๐‘‚(๐›พ3)

And the coefficient ยฏ๐ต๐‘ค = โˆ’12โˆ†โŠค๐‘“ โ€ฒโ€ฒ(๐œƒ*)โˆ† + 1

8E[๐‘Žโ€ฒโ€ฒโ€ฒ(๐œ‚*(๐‘ฅ))2

๐‘Žโ€ฒโ€ฒ(๐œ‚*(๐‘ฅ))ยท(tr[ฮฆ(๐‘ฅ)โŠ—2๐ถ

])2].

Estimation of ๐’ข2.

๐’ข2 = E[(๐œ‡*(๐‘ฅ)โˆ’ ๐œ‡**(๐‘ฅ)

)ยท(ยฏ๐œ‚(๐‘ฅ๐‘›)โˆ’ ๐œ‚*(๐‘ฅ๐‘›)

)],

using properties of conjugated functions,

๐’ข2 = E[(

(๐‘Ž*)โ€ฒ(ยฏ๐œ‡(๐‘ฅ))โˆ’ (๐‘Ž*)โ€ฒ(๐œ‡*(๐‘ฅ)))ยท(๐œ‡*(๐‘ฅ)โˆ’ ๐œ‡**(๐‘ฅ)

)]=

77

Page 93: On efficient methods for high-dimensional statistical

E[(๐‘Ž*)โ€ฒโ€ฒ

(ยฏ๐œ‡*(๐‘ฅ))(ยฏ๐œ‡(๐‘ฅ)โˆ’ ๐œ‡*(๐‘ฅ)

)ยท(๐œ‡*(๐‘ฅ)โˆ’ ๐œ‡**(๐‘ฅ)

)+๐‘‚(๐›พ2) =

= Eยฏ๐œ‡(๐‘ฅ)โˆ’ ๐œ‡*(๐‘ฅ)

๐‘Žโ€ฒโ€ฒ(๐œ‚*(๐‘ฅ))ยท(๐œ‡*(๐‘ฅ)โˆ’ ๐œ‡**(๐‘ฅ)

)+๐‘‚(๐›พ2) =

= ๐›พ ยท E[(

โˆ†โŠคฮฆ(๐‘ฅ) +๐‘Žโ€ฒโ€ฒโ€ฒ(๐œ‚*(๐‘ฅ))

2๐‘Žโ€ฒโ€ฒ(๐œ‚*(๐‘ฅ))tr[ฮฆ(๐‘ฅ)โŠ—2๐ถ

])ยท(๐œ‡*(๐‘ฅ)โˆ’ ๐œ‡**(๐‘ฅ)

)]+๐‘‚(๐›พ2).

And the coefficient ยฏ๐ต๐‘š = E(

โˆ†โŠคฮฆ(๐‘ฅ) + ๐‘Žโ€ฒโ€ฒโ€ฒ(๐œ‚*(๐‘ฅ))2๐‘Žโ€ฒโ€ฒ(๐œ‚*(๐‘ฅ))

tr[ฮฆ(๐‘ฅ)โŠ—2๐ถ

])ยท(๐œ‡*(๐‘ฅ)โˆ’ ๐œ‡**(๐‘ฅ)

).

78

Page 94: On efficient methods for high-dimensional statistical

Chapter 4

Sublinear Primal-Dual Algorithms for

Large-Scale Multiclass Classification

Abstract

In this chapter, we study multiclass classification problems with potentially largenumber of classes ๐‘˜, number of features ๐‘‘, and sample size ๐‘›. Our goal is to developefficient algorithms to train linear classifiers with losses of a certain type, allowingto address, in particular, multiclass support vector machines (SVM) and softmaxregression. The developed algorithms are based on first-order proximal primal-dualschemes โ€“ Mirror Descent and Mirror Prox. In particular, forthe multiclass SVM withelementwise โ„“1-regularization we propose a sublinear algorithm with numerical com-plexity of one iteration being only ๐‘‚(๐‘‘+๐‘›+๐‘˜). This result relies on the combinationof three ideas: (i) passing to the saddle-point problem with a quasi-bilinear objective;(ii) ad-hoc variance reduction technique based on a non-uniform sampling of matrixmultiplication; (iii) the proper choice of the proximal setup in Mirror Descent typeschemes leading to the balance between the stochastic and deterministic error terms.

The material presented in this chapter is a joint work with Dmitrii M. Ostrovskiiand Francis Bach. It is submitted to the ICML 2019 conference.

4.1 Introduction

We focus on efficient training of multiclass linear classifiers, potentially with verylarge numbers of classes and dimensionality of the feature space, and with losses ofa certain type. Formally, consider a dataset comprised of ๐‘› pairs (๐‘ฅ๐‘–, ๐‘ฆ๐‘–), ๐‘– = 1, ..., ๐‘›,where ๐‘ฅ๐‘– โˆˆ R๐‘‘ is the feature vector of the ๐‘–-th training example, and ๐‘ฆ๐‘– โˆˆ ๐‘’1, ..., ๐‘’๐‘˜is the label vector encoding one of the ๐‘˜ possible classes, ๐‘’1, ..., ๐‘’๐‘˜ โŠ† R๐‘˜ being thestandard basis vectors (we assume there are no ambiguities in the class assignment).Given such data, we would like to find a linear classifier that minimizes the regularized

79

Page 95: On efficient methods for high-dimensional statistical

empirical risk. This can be formalized as the following minimization problem:

min๐‘ˆโˆˆR๐‘‘ร—๐‘˜

1

๐‘›

๐‘›โˆ‘๐‘–=1

โ„“(๐‘ˆโŠค๐‘ฅ๐‘–, ๐‘ฆ๐‘–) + ๐œ†โ€–๐‘ˆโ€–U . (4.1.1)

Here, ๐‘ˆ โˆˆ R๐‘‘ร—๐‘˜ is the matrix whose columns specify the parameter vectors for eachof the ๐‘˜ classes; โ„“(๐‘ˆโŠค๐‘ฅ, ๐‘ฆ), with โ„“ : R๐‘˜ ร— R๐‘˜ โ†’ R, is the loss corresponding to themargins ๐‘ˆโŠค๐‘ฅ โˆˆ R๐‘˜ assigned to ๐‘ฅ when its true class happens to be ๐‘ฆ; finally, ๐œ†โ€–๐‘ˆโ€–U ,with ๐œ† > 0, is the regularization term corresponding to some norm โ€– ยท โ€–U on R๐‘‘ร—๐‘˜

which will be specified later on (cf. Section 4.2). For now, we only assume that โ€–ยทโ€–U isโ€œsimpleโ€ โ€“ intuitively, one can think of โ€–๐‘ˆโ€–U as being quasi-separable in the elementsof ๐‘ˆ , for example, belong to the class of row-wise mixed โ„“๐‘ร—โ„“๐‘ž-norms, cf. Section 4.2.

Our focus is on the so-called Fenchel-Young losses [Blondel et al., 2018] which canbe expressed in the form

โ„“(๐‘ˆโŠค๐‘ฅ, ๐‘ฆ) = max๐‘ฃโˆˆฮ”๐‘˜

โˆ’f(๐‘ฃ, ๐‘ฆ) + (๐‘ฃ โˆ’ ๐‘ฆ)โŠค๐‘ˆโŠค๐‘ฅ

, (4.1.2)

where โˆ†๐‘˜ is the probability simplex in R๐‘˜, and the function f : โˆ†๐‘˜ โ†’ R is convex andโ€œsimpleโ€ (i.e., quasi-separable in ๐‘ฃ) implying that the maximization in (4.1.2) can beperformed in running time ๐‘‚(๐‘˜). In particular, this allows to encompass two mostcommonly used multiclass losses:

โ€” the multiclass logistic (or softmax) loss

โ„“(๐‘ˆโŠค๐‘ฅ, ๐‘ฆ) = log

(๐‘˜โˆ‘๐‘—=1

exp(๐‘ˆโŠค๐‘— ๐‘ฅ)

)โˆ’ ๐‘ฆโŠค๐‘ˆโŠค๐‘ฅ, (4.1.3)

where ๐‘ˆ๐‘— is the ๐‘—-th column of ๐‘ˆ so that ๐‘ˆโŠค๐‘— ๐‘ฅ is the ๐‘—-th element of ๐‘ˆโŠค๐‘ฅ; this

loss corresponds to (4.1.2) with the negative entropy f(๐‘ฃ, ๐‘ฆ) =โˆ‘๐‘˜

๐‘–=1 ๐‘ฃ๐‘– log ๐‘ฃ๐‘– ;โ€” the multiclass hinge loss, given by

โ„“(๐‘ˆโŠค๐‘ฅ, ๐‘ฆ) = max๐‘—โˆˆ1,...,๐‘˜

1[๐‘’๐‘— = ๐‘ฆ] + ๐‘ˆโŠค

๐‘— ๐‘ฅโˆ’ ๐‘ฆโŠค๐‘ˆโŠค๐‘ฅ, (4.1.4)

and used in multiclass support vector machines (SVM), reduces to (4.1.2) bysetting f(๐‘ฃ, ๐‘ฆ) = ๐‘ฃโŠค๐‘ฆ โˆ’ 1 (see Appendix 4.7.1 for details).

Additional examples of Fenchel-Young losses are described by Blondel et al. [2018].

Arranging the feature vectors into the design matrix ๐‘‹ โˆˆ R๐‘›ร—๐‘‘, and the labelvectors in the matrix ๐‘Œ โˆˆ R๐‘›ร—๐‘˜, and using the Fenchel dual representation (4.1.2)of the loss, we recast the initial minimization problem (4.1.1) as a convex-concavesaddle-point problem of the following form:

min๐‘ˆโˆˆR๐‘‘ร—๐‘˜

max๐‘‰ โˆˆ๐’ฑ

โˆ’โ„ฑ(๐‘‰, ๐‘Œ ) +1

๐‘›tr[(๐‘‰ โˆ’ ๐‘Œ )โŠค๐‘‹๐‘ˆ

]+ ๐œ†โ€–๐‘ˆโ€–U . (4.1.5)

80

Page 96: On efficient methods for high-dimensional statistical

Here we denote

โ„ฑ(๐‘‰, ๐‘Œ ) :=1

๐‘›

๐‘›โˆ‘๐‘–=1

f(๐‘ฃ๐‘–, ๐‘ฆ๐‘–) (4.1.6)

where ๐‘ฃ๐‘–, ๐‘ฆ๐‘– โˆˆ โˆ†๐‘˜ are the ๐‘–-th rows of ๐‘‰ and ๐‘Œ , whereas

๐’ฑ := [โˆ†โŠ—๐‘›๐‘˜ ]โŠค โŠ‚ R๐‘›ร—๐‘˜ (4.1.7)

is the Cartesian product of probability simplices comprising all matrices in R๐‘›ร—๐‘˜

with rows belonging to โˆ†๐‘˜. Taking into account the availability of the Fencheldual representation (4.1.2), recasting the convex minimization problem (4.1.1) in theform (4.1.5) is quite natural. Indeed, while the objective in (4.1.1) can be non-smooth,the โ€œnon-trivialโ€ part of the objective in (4.1.5), given by

ฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ ) :=1

๐‘›tr[(๐‘‰ โˆ’ ๐‘Œ )โŠค๐‘‹๐‘ˆ

], (4.1.8)

is necessarily bilinear. On the other hand, the presence of the dual constraints, asgiven by (4.1.7), also does not seem problematic since ๐’ฑ allows for a computationallycheap projection oracle. Finally, the primal-dual formulation (4.1.5) allows one tocontrol the duality gap, thus providing online accuracy certificates for the initialproblem (see Nemirovski and Onn [2010]).

In this chapter, we develop efficient algorithms for solving (4.1.1) via the associ-ated saddle-point problem (4.1.5), building upon the two basic schemes for saddle-point optimization: Mirror Descent and Mirror Prox (see Juditsky and Nemirovski[2011a,b], Nesterov and Nemirovski [2013] and references therein). When applied tothe class of general convex-concave saddle-point problems with available first-orderinformation given by the partial (sub-)gradients of the objective, these basic schemesare well suited for obtaining medium-accuracy solutions โ€“ which is not a limitationin the context of empirical risk minimization, where the ultimate goal is to minimizethe true (expected) risk.

Moreover, these basic schemes must be able, at least in principle, to capture thespecific structure of quasi-bilinear saddle-point problems of the form (4.1.5). First,they rely on general Bregman divergences, rather than the standard Euclidean dis-tance, to measure proximity of the points, which allows one to adjust to the particularnon-Euclidean geometry associated to the set ๐’ฑ and the norm โ€– ยท โ€–U . Second, MirrorDescent and Mirror Prox retain their favorable convergence guarantees when the par-tial gradients are replaced with their unbiased estimates. To see why this circumstanceis so important, recall that the computation of the partial gradients โˆ‡๐‘ˆ [ฮฆ(๐‘ˆ, ๐‘‰ โˆ’๐‘Œ )]and โˆ‡๐‘‰ [ฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ )], cf. (4.1.8), is reduced to the computation of the matrix prod-ucts ๐‘‹๐‘ˆ and ๐‘‹โŠค(๐‘‰ โˆ’ ๐‘Œ ), and thus requires ๐‘‚(๐‘‘๐‘›๐‘˜) operations. Given the partialgradients, the proximal steps associated to ๐‘ˆ and ๐‘‰ variables, assuming the โ€œsimplic-ityโ€ of โ€– ยท โ€–U in the aforementioned sense, can be computed in ๐‘‚(๐‘‘๐‘˜ + ๐‘›๐‘˜). Thus, inthe regime interesting to us, where both ๐‘‘ and ๐‘˜ can be very large, the computationof these matrix products becomes the main bottleneck. On the other hand, unbiasedestimates of these matrix products, with reduced complexity of computation, can be

81

Page 97: On efficient methods for high-dimensional statistical

be obtained by randomized subsampling of the rows and columns of ๐‘ˆ , ๐‘‰ , ๐‘Œ , and ๐‘‹.While this approach has been extensively studied by Juditsky and Nemirovski [2011b]in the case of bilinear saddle-point problems with vector-valued variables arising insparse signal recovery, its extension to problems of the type (4.1.5) is non-trivial.In fact, for a sampling scheme to be deemed โ€œgoodโ€, it clearly has to satisfy twoconcurrent requirements:

(a) On one hand, one must control the stochastic variability of the estimates inthe chosen sampling scheme. Ideally, the variance โ€“ or its suitable analogue inthe case of a non-Euclidean geometry โ€“ should be comparable to the squarednorm of ๐‘‹ whose choice depends on the proximal setup.

(b) On the other hand, the estimates must be cheap to compute, much cheaperthan ๐‘‚(๐‘‘๐‘›๐‘˜) required for the full gradient computation. While the apparentgoal is ๐‘‚(๐‘‘๐‘˜+๐‘›๐‘˜) per iteration (the cost of the primal-dual proximal mappingfor the full gradients), one could even want to go beyond that, to ๐‘‚(๐‘‘+๐‘›+ ๐‘˜)per iteration, possibly paying a larger price once, during pre/post-processing.

Devising a sampling scheme satisfying these requirements is a delicate task. To solveit, one should exploit the geometric structure of (4.1.5) associated to ๐’ฑ and โ€– ยท โ€–U .

Contributions and outline. We identify โ€œgoodโ€ sampling schemes satisfying theabove requirements, rigorously analyze their statistical properties, carefuly imple-ment the basic algorithms equipped with these sampling schemes, and analyze thetotal computational complexity of the resulting algorithms. Namely, we consider thesituation where โ€– ยท โ€–U is the entrywise โ„“1-norm, which corresponds to the implicitassumption that both the classes and the features are sparse. Moreover, we choosethe proximal setup adapted to the geometry of the saddle-point problem (4.1.5), thusachieving favorable accuracy guarantees for both of the basic algorithms without sub-sampling (cf. Sections 4.2โ€“4.2.1). In this setup, we propose two sampling schemes withvarious levels of โ€œaggressivenessโ€, and study their statistical properties, showing thatboth of the basic primal-dual algorithms preserve their favorable accuracy boundswhen combined with the proposed sampling schemes. At the same time, equippingthe algorithms with these sampling schemes drastically reduces the total numericalcomplexity (cf. (b)), with improvement depending on the sampling scheme:

โ€” The Partial Sampling Scheme, described in Section 4.3.1, is applicable to anyproblem of the form (4.1.5). Here the main idea is to sample only one row of ๐‘ˆand ๐‘‰ at a time, with probabilities nearly minimizing the variance proxy of thecurrent estimates of ๐‘ˆ and ๐‘‰ . After careful implementation of the primal-dualproximal step, this leads to the cost ๐‘‚(๐‘‘๐‘˜ + ๐‘›๐‘˜ + ๐‘‘๐‘›) of one iteration.

โ€” In the more aggressive Full Sampling Scheme, described in Section 4.3.2, sam-pling of the rows of ๐‘ˆ and ๐‘‰ is augmented with column sampling. Applyingthis scheme to multiclass hinge loss (4.1.4), we are able to carry out the updatesin only ๐‘‚(๐‘‘+ ๐‘˜ + ๐‘›) arithmetic operations, with additional cost

๐‘‚(๐‘‘๐‘›+ (๐‘‘+ ๐‘›) ยทmin(๐‘˜, ๐‘‡ ))

82

Page 98: On efficient methods for high-dimensional statistical

of pre- and post-processing, ๐‘‡ being the number of iterations. Note that thesenumerical complexity estimates can be justified from the viewpoint of statis-tical learning theory: according to it, ๐‘‡ = ๐‘‚(๐‘›) iterations of a stochasticfirst-order algorithm must be sufficient to extract most of statistical infor-mation contained in the sample, since at each iteration we effectively selecta single training example by subsampling a row of ๐‘‰ . We see that in thehigh-dimensional regime ๐‘‘, ๐‘˜ โ‰ซ ๐‘›, this criterion is satisfied since the price ofpre/post-processing becomes ๐‘‚(๐‘‘๐‘›).

We conclude the chapter with numerical experiments on synthetic data presented inSection 4.5.

4.1.1 Related work

While we have passed to the saddle-point problem (4.1.5), there is still an optionto solve the original problem (4.1.1). The complexity of deterministic approach forit is ๐‘‚(๐‘‘๐‘›๐‘˜) and the complexity of stochastic approach is ๐‘‚(๐‘‘๐‘˜), and because of thestructure of the problem, the ideas of Full Sampling Scheme cannot be used. Due tothe easy access to noisy gradients, the classical approach is stochastic gradient descent(SGD). Algorithms such as SVRG [Johnson and Zhang, 2013], SDCA [Shalev-Shwartzand Zhang, 2013], SAGA [Defazio et al., 2014], SAG [Schmidt et al., 2017], large-scale linearly convergent algorithms [Palaniappan and Bach, 2016], Breg-SVRG [Shiet al., 2017] use variance-reduced techniques to obtain accelerated convergence rates,but all have ๐‘‚(๐‘‘๐‘˜) or ๐‘‚(๐‘‘๐‘˜ + ๐‘›๐‘˜) complexity. Note, however, that our variancereduction technique is different from all these methods, and none of these approachesis sublinear.

Sublinear algorithms. Formally, we call an algorithm sublinear if the computa-tional complexity of one iteration has a linear (or linear with a logarithmic factor)dependence on natural dimensions of data โ€“ such as sample size, number of featuresand number of classes. For the biclass setting several results can be found in the lit-erature. Probably, the first result for bilinear problems was considered by Grigoriadisand Khachiyan [1995], which was also considered by Juditsky and Nemirovski [2011b],Xiao et al. [2017]. A sublinear algorithm for SVM was considered by Hazan et al.[2011], for semidefinite programs was considered by Garber and Hazan [2011, 2016]and more general results were given by Clarkson et al. [2012]. However, none of theseapproaches can be easily extended to the multiclass setting without an extra ๐‘‚(๐‘˜)factor appearing in the numerical complexity bounds.

4.2 Choice of geometry and basic routines

Preliminaries. We focus on the convex-concave saddle-point problem of the form:

min๐‘ˆโˆˆR๐‘‘ร—๐‘˜

max๐‘‰ โˆˆ(ฮ”โŠค

๐‘˜ )โŠ—๐‘›โˆ’โ„ฑ(๐‘‰, ๐‘Œ ) + ฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ ) + ๐œ†โ€–๐‘ˆโ€–U , (4.2.1)

83

Page 99: On efficient methods for high-dimensional statistical

with the bilinear part ฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ ) of the objective being given by (4.1.8), and thecomposite term โ„ฑ(๐‘‰, ๐‘Œ ) by (4.1.6). From now on, we make the following mild as-sumption:

Assumption 1. The โ€– ยท โ€–U -radius of some optimal solution ๐‘ˆ* to (4.2.1) is known:

โ€–๐‘ˆ*โ€–U = ๐‘…๐’ฐ .

Remark 1. All the algorithms presented later on preserve their accuracy boundswhen ๐‘…๐’ฐ is an upper bound on โ€–๐‘ˆ*โ€–U instead of being its exact value. Howeverin this case they become suboptimal by factors proportional to the looseness of thisbound, hence we are interested in this upper bound to be a tight as possible. Note thatsince ๐‘ˆ = 0 is a feasible solution, in the case of positive losses ๐‘…๐’ฐ 6 1/๐œ†, but thisbound is usually very loose, since ๐œ† must decrease with the number of observations asdictated by statistical learning theory. A better approach is to solve a series of prob-lems, starting with a small radius, and increasing it exponentially until the obtainedsolution leaves the boundary of the feasible set. This strategy works when the solutionset is bounded. Since the complexity bounds that we obtain grow linearly with ๐‘…๐’ฐ , thismethod enjoys the same overall complexity bound, up to a constant factor, as in thecase when Assumption 1 is precisely satisfied.

With Assumption 1, we can recast the problem (4.2.1) in the constrained form:

min๐‘ˆโˆˆ๐’ฐ

max๐‘‰ โˆˆ๐’ฑโˆ’โ„ฑ(๐‘‰, ๐‘Œ ) + ฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ ) + ๐œ†โ€–๐‘ˆโ€–U , (4.2.2)

where ๐’ฐ := ๐‘ˆ โˆˆ R๐‘‘ร—๐‘˜ : โ€–๐‘ˆโ€–U โ‰ค ๐‘…๐’ฐ is a โ€– ยท โ€–U -norm ball, and ๐’ฑ := [โˆ†โŠ—๐‘›๐‘˜ ]โŠค

is the direct product of ๐‘› probability simplices in R๐‘˜. This problem has a veryparticular structure associated to the sets ๐’ฐ and ๐’ฑ , and ideally, this structure shouldbe exploited by the optimization algorithm in order to obtain efficiency estimatesscaling with the โ€œcorrectโ€ norm of the optimal primal-dual solution, see, e.g., Juditskyand Nemirovski [2011a], Nesterov and Nemirovski [2013], Beck and Teboulle [2003],Shi et al. [2017]. This issue can be addressed in the framework of proximal algorithms,in which one starts by choosing the norms that capture the inherent geometry of theproblem, and then replaces the usual (projected) gradient step with the proximalstep, i.e., uses the general Bregman divergence corresponding to some potential (alsocalled distance-generating function) instead of the squared Euclidean distance. Thepotential must be compatible with the chosen norm in the sense of Juditsky andNemirovski [2011a] (essentially, grow as the squared norm while being 1-stronglyconvex with respect to the norm, see Section 4.2.2), and at the same time allow foran efficient implementation of the proximal step. We will discuss the question ofchoosing the potentials in the next section, when describing Mirror Descent [see e.g.Nemirovsky and Yudin, 1983], the simplest general proximal algorithm applicableto (4.2.2). Before that, let us specify the choice of the norms themselves.

Choice of the norms. A natural strategy for choosing the norm in each variable isby ensuring, whenever possible, that its ball of a certain radius is simply the convex

84

Page 100: On efficient methods for high-dimensional statistical

hull of the symmetrization of the feasible set (note that scaling is not importantsince first-order algorithms are invariant with respect to it), see, e.g., Juditsky andNemirovski [2011a]. Following this strategy, we must choose โ€– ยท โ€–U itself for ๐‘ˆ ,whatever is the norm โ€– ยท โ€–U . On the other hand, the norm โ€– ยท โ€–U has not yet beendefined at this point; correspondingly, the โ€œimplicitโ€ geometry of the problem in ๐‘ˆhas not been specified. For some reasons explained in Section 4.4, we choose theelementwise โ„“1-norm for ๐‘ˆ :

โ€–๐‘ˆโ€–U = โ€–๐‘ˆโ€–1ร—1 =๐‘‘โˆ‘๐‘–=1

๐‘˜โˆ‘๐‘—=1

|๐‘ˆ(๐‘–, ๐‘—)|, (4.2.3)

where we use the โ€œMatlabโ€ indexing notation, and define the mixed โ„“๐‘ ร— โ„“๐‘ž norms by

โ€–๐ดโ€–๐‘ร—๐‘ž :=

(โˆ‘๐‘–

โ€–๐ด(๐‘–, :)โ€–๐‘๐‘ž

)1/๐‘

, ๐‘, ๐‘ž โ‰ฅ 1, (4.2.4)

i.e., the โ„“๐‘-norm of the vector of โ„“๐‘ž-norms of the individual rows of the matrix. Notethat this choice of โ€– ยท โ€–U favors solutions that are sparse in terms of both featuresand classes.

As for the norm on ๐’ฑ , one could choose the โ„“1-norm in the case ๐‘› = 1, and โ„“โˆžร—โ„“1-norm in the general case (recall that ๐’ฑ is the direct product of simplices). However,it is known (see Juditsky and Nemirovski [2011a]) that for the โ„“โˆž-norm there is nocompatible potential in the sense of Juditsky and Nemirovski [2011a] (the existence ofsuch a potential would cotradict the known worst-case complexity lower bounds [Ne-mirovsky and Yudin, 1983]. The remedy, leading to near-optimal accuracy estimates,is to replace the โ„“โˆž-norm with the โ„“2-norm, leading us to the โ„“2 ร— โ„“1-norm for ๐‘‰ :

โ€–๐‘‰ โ€–V = โ€–๐‘‰ โ€–2ร—1 =

(๐‘›โˆ‘๐‘–=1

โ€–๐‘‰ (๐‘–, :)โ€–21

)1/2

. (4.2.5)

Alternative choices for both norms will be briefly discussed in Section 4.4. However,we will see that this choice of norms, when equipped with the properly chosen po-tentials, is the only one in a broad class of those using the mixed โ„“๐‘ ร— โ„“๐‘ž norms, forwhich we can achieve the goal stated in Section 4.1, that is, to combine near-optimalcomplexity estimates with a numerically efficient implementation of the proximal step.

4.2.1 Basic schemes: Mirror Descent and Mirror Prox

Basic schemes in minimization problems. The classicalMirror Descent schemewas introduced by Nemirovsky and Yudin [1983], and generalizes the standard Pro-jected Subgradient Descent to the case of non-Euclidean distance measures. Thegeneral presentation of Mirror Descent is described by Beck and Teboulle [2003]; herewe only provide a brief exposition. When minimizing a convex function ๐‘“(๐‘ข) on adomain ๐’ฐ , the Mirror Descent scheme amounts to choosing the potential ๐œ‘๐’ฐ(๐‘ข) that

85

Page 101: On efficient methods for high-dimensional statistical

generalizes the squared Euclidean distance 12โ€–๐‘ขโ€–22, and the sequence of stepsizes ๐›พ๐‘ก,

starting with the initial point ๐‘ข0 = min๐‘ขโˆˆ๐’ฐ ๐œ‘๐’ฐ(๐‘ข) called the prox-center, and thenperforming iterations of the form

๐‘ข๐‘ก+1 = arg min๐‘ขโˆˆ๐’ฐ

โŸจโˆ‡๐‘“(๐‘ข๐‘ก), ๐‘ขโˆ’ ๐‘ข๐‘กโŸฉ+

1

๐›พ๐‘ก๐ท๐œ‘๐’ฐ (๐‘ข, ๐‘ข๐‘ก)

,

where๐ท๐œ‘(๐‘ข, ๐‘ข๐‘ก) = ๐œ‘(๐‘ข)โˆ’ ๐œ‘(๐‘ข๐‘กโˆ’1)โˆ’ โŸจโˆ‡๐œ‘(๐‘ข๐‘ก), ๐‘ขโˆ’ ๐‘ข๐‘กโŸฉ

is the Bregman divergence between the candidate point ๐‘ข and the previous point ๐‘ข๐‘ก

that corresponds to the chosen potential ๐œ‘(ยท) = ๐œ‘๐’ฐ(ยท) and replaces the squared Eu-clidean distance 1

2โ€–๐‘ข โˆ’ ๐‘ข๐‘กโ€–22. After computing ๐‘‡ iterates, the scheme outputs the

averaged point ๐‘‡ as the candidate solution; one can choose the uniform averaging๐‘‡ = 1

๐‘‡

โˆ‘๐‘‡โˆ’1๐‘ก=0 ๐‘ข

๐‘ก or more complex averaging schedules (for example, with the weightsproportional to the stepsizes); for simplicity, we will focus on the uniform averaging.

The counterpart of the Mirror Descent scheme, called Mirror Prox, was intro-duced by Nemirovski [2004], and combines the use of Bregman divergences with anextrapolation step, first proposed in the Euclidean setting by Korpelevich [1977]. Inthe context of minimization problems, the difference is that Mirror Prox iterates as

๐‘ข๐‘ก+1/2 = arg min๐‘ขโˆˆ๐’ฐ

โŸจโˆ‡๐‘“(๐‘ข๐‘ก), ๐‘ขโˆ’ ๐‘ข๐‘กโŸฉ+

1

๐›พ๐‘ก๐ท๐œ‘๐’ฐ (๐‘ข, ๐‘ข๐‘ก)

,

๐‘ข๐‘ก+1 = arg min๐‘ขโˆˆ๐’ฐ

โŸจโˆ‡๐‘“(๐‘ข๐‘ก+1/2), ๐‘ขโˆ’ ๐‘ข๐‘ก+1/2โŸฉ+

1

๐›พ๐‘ก๐ท๐œ‘๐’ฐ (๐‘ข, ๐‘ข๐‘ก)

;

in other words, one first performs the proximal step from the current point ๐‘ข๐‘ก toobtain the auxilliary point ๐‘ข๐‘ก+1/2, and then perfoms the extragradient step from ๐‘ข๐‘ก,i.e., the proximal step in which the gradient at ๐‘ข๐‘ก is replaced with that at ๐‘ข๐‘ก+1/2.

Basic schemes for composite saddle-point problems. Both approaches can beextended to composite minimization and, most importantly in our context, to com-posite saddle-point problems such as (4.2.2). In particular, introducing the combinedvariable ๐‘Š = (๐‘ˆ, ๐‘‰ ) โˆˆ ๐’ฒ [:= ๐’ฐ ร— ๐’ฑ ], the (Composite) Mirror Descent scheme forthe saddle-point problem (4.2.2) first constructs the joint potential ๐œ‘๐’ฒ(๐‘Š ) from theintial potentials ๐œ‘๐’ฐ and ๐œ‘๐’ฑ , in the way specified later on, and then performs iterations

๐‘Š 0 = min๐‘Šโˆˆ๐’ฒ

๐œ‘๐’ฒ(๐‘Š );

๐‘Š ๐‘ก+1 = arg min๐‘Šโˆˆ๐’ฒ

โ„Ž(๐‘Š ) + โŸจ๐บ(๐‘Š ๐‘ก),๐‘Š โŸฉ+

1

๐›พ๐‘ก๐ท๐œ‘๐’ฒ (๐‘Š,๐‘Š ๐‘ก)

, ๐‘ก โ‰ฅ 1

(4.2.6)

where โ„Ž(๐‘Š ) = โ„ฑ(๐‘‰, ๐‘Œ ) + ๐œ†โ€–๐‘ˆโ€–๐’ฐ is the combined composite term, cf. (4.1.6), and๐บ(๐‘Š ) is the vector field of the partial gradients of ฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ ), cf. (4.1.8):

๐บ(๐‘Š ) := (โˆ‡๐‘ˆฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ ),โˆ’โˆ‡๐‘‰ ฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ )) = (๐‘‹โŠค(๐‘‰ โˆ’ ๐‘Œ ),โˆ’๐‘‹๐‘ˆ). (4.2.7)

86

Page 102: On efficient methods for high-dimensional statistical

Accordingly, the (Composite) Mirror Prox scheme for (4.2.2) performs iterations

๐‘Š 0 = min๐‘Šโˆˆ๐’ฒ

๐œ‘๐’ฒ(๐‘Š );

๐‘Š ๐‘ก+1/2 = arg min๐‘Šโˆˆ๐’ฒ

โ„Ž(๐‘Š ) + โŸจ๐บ(๐‘Š ๐‘ก),๐‘Š โŸฉ+

1

๐›พ๐‘ก๐ท๐œ‘๐’ฒ (๐‘Š,๐‘Š ๐‘ก)

,

๐‘Š ๐‘ก+1 = arg min๐‘Šโˆˆ๐’ฒ

โ„Ž(๐‘Š ) + โŸจ๐บ(๐‘Š ๐‘ก+1/2),๐‘Š โŸฉ+

1

๐›พ๐‘ก๐ท๐œ‘๐’ฒ (๐‘Š,๐‘Š ๐‘ก)

, ๐‘ก > 1.

(4.2.8)

Note that the joint minimization in (4.2.6) and (4.2.8) can be performed separatelyin ๐‘ˆ and ๐‘‰ as long as the joint potential is a linear combination of ๐œ‘๐’ฐ and ๐œ‘๐’ฑ . Tospecify the accuracy guarantees for both schemes, we need to introduce two objects:the potential differences ฮฉ๐’ฐ and ฮฉ๐’ฑ , and the โ€œcrossโ€ Lipschitz constant โ„’U ,V . Thechoice of the joint potential will be given once we give the definitions of these objects.

Differences of potentials. The potential differences (called โ€œomega-radiiโ€ in theliterature co-authored by A. Nemirovski), are defined by

ฮฉ๐’ฐ = max๐‘ˆโˆˆ๐’ฐ

๐œ‘๐’ฐ(๐‘ˆ)โˆ’min๐‘ˆโˆˆ๐’ฐ

๐œ‘๐’ฐ(๐‘ˆ), ฮฉ๐’ฑ = max๐‘‰ โˆˆ๐’ฑ

๐œ‘๐’ฑ(๐‘‰ )โˆ’min๐‘‰ โˆˆ๐’ฑ

๐œ‘๐’ฑ(๐‘‰ ); (4.2.9)

note that all maxima and minima are attained when the potentials are continuous,and ๐’ฐ ,๐’ฑ are compact. When the potentials ๐œ‘๐’ฐ , ๐œ‘๐’ฑ are compatible with the corre-sponding norms โ€– ยท โ€–๐’ฐ , โ€– ยท โ€–๐’ฑ in the sense of Juditsky and Nemirovski [2011a] (seeSection 4.2.2), the potential differences grow as the squared radii of ๐’ฐ and ๐’ฑ in thecorresponding norms, up to factors logarithmic in the dimensions of ๐’ฐ ,๐’ฑ , see, e.g., Ne-mirovsky and Yudin [1983], Shapiro et al. [2009]; in particular, this holds for the choiceof ๐œ‘๐’ฐ and ๐œ‘๐’ฑ discussed later on.

โ€œCrossโ€ Lipschitz constant. Given a smooth convex-concave function ฮฆ(๐‘ˆ, ๐‘‰ ) =ฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ ), the โ€œcrossโ€ Lipschitz constant โ„’U ,V of the field

๐บ(๐‘Š ) := (โˆ‡๐‘ˆฮจ(๐‘ˆ, ๐‘‰ ),โˆ’โˆ‡๐‘‰ ฮจ(๐‘ˆ, ๐‘‰ ))

with respect to the pair of norms โ€–ยทโ€–U , โ€–ยทโ€–V is defined as โ„’U ,V = max(โ„’U โ†’V ,โ„’V โ†’U ),where โ„’U โ†’V , โ„’V โ†’U deliver tight inequalities of the form

โ€–โˆ‡๐‘ˆฮจ(๐‘ˆ โ€ฒ, ๐‘‰ )โˆ’โˆ‡๐‘ˆฮจ(๐‘ˆ, ๐‘‰ )โ€–V * 6 โ„’U โ†’V โ€–๐‘ˆ โ€ฒ โˆ’ ๐‘ˆโ€–U ,โ€–โˆ‡๐‘‰ ฮจ(๐‘ˆ, ๐‘‰ โ€ฒ)โˆ’โˆ‡๐‘‰ ฮจ(๐‘ˆ, ๐‘‰ )โ€–U * 6 โ„’V โ†’U โ€–๐‘‰ โ€ฒ โˆ’ ๐‘‰ โ€–V ,

uniformly over ๐’ฐ ร— ๐’ฑ , where โ€– ยท โ€–U * , โ€– ยท โ€–V * are the dual norms to โ€– ยท โ€–U , โ€– ยท โ€–V .For the bilinear function ฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ ) given by (4.1.8), โ„’U โ†’V and โ„’V โ†’U are equal,and we can express โ„’U ,V as a norm of the linear operator acting on R๐‘‘ร—๐‘˜ โ†’ R๐‘›ร—๐‘˜

as ๐‘ˆ โ†ฆโ†’ 1๐‘›๐‘‹๐‘ˆ :

โ„’U ,V =1

๐‘›

[โ€–๐‘‹โ€–U โ†’V := sup

โ€–๐‘ˆโ€–U โ‰ค1

โ€–๐‘‹๐‘ˆโ€–V *

]. (4.2.10)

87

Page 103: On efficient methods for high-dimensional statistical

Furthermore, for the chosen norms โ€– ยท โ€–U = โ€– ยท โ€–1ร—1 and โ€– ยท โ€–V = โ€– ยท โ€–2ร—1 we canexpress โ„’U ,V as a certain mixed norm (cf. (4.2.3)โ€“(4.2.5)) as stated by the followinglemma proved in Appendix:

Proposition 4.1. The โ€œcrossโ€ Lipschitz constant can be expressed as

โ„’U ,V =โ€–๐‘‹โŠคโ€–โˆžร—2

๐‘›. (4.2.11)

Constructing the joint potential. Given the partial potentials ๐œ‘๐’ฐ(ยท) and ๐œ‘๐’ฑ(ยท),the natural way to construct the joint potential ๐œ‘๐’ฒ is by taking a weighted averageof ๐œ‘๐’ฐ(ยท) and ๐œ‘๐’ฑ(ยท), since this leads to ๐œ‘๐’ฒ being separable in ๐‘ˆ and ๐‘‰ , and allows toexploit the structure of the partial potentials to obtain accuracy guarantees. In fact,one can show that the simplistic choice ๐œ‘๐’ฒ(๐‘Š ) = (๐œ‘๐’ฐ(๐‘ˆ) + ๐œ‘๐’ฑ(๐‘‰ ))/2 leads to theaccuracy guarantee proportional to โ„’U ,V (ฮฉ๐’ฐ+ฮฉ๐’ฑ) where โ„’U ,V is the โ€œcrossโ€ Lipschitzconstant, and ฮฉ๐’ฐ ,ฮฉ๐’ฑ are the potential differences defined above. On the other hand,if ฮฉ๐’ฐ and ฮฉ๐’ฑ are known, one can consider, following Juditsky and Nemirovski [2011b]and Ostrovskii and Harchaoui [2018], the โ€œbalancedโ€ joint potential

๐œ‘๐’ฒ(๐‘Š ) =๐œ‘๐’ฐ(๐‘ˆ)

2ฮฉ๐’ฐ+๐œ‘๐’ฑ(๐‘‰ )

2ฮฉ๐’ฑ, (4.2.12)

for which one can achieve the better accuracy guarantee proportional to โ„’U ,V

โˆšฮฉ๐’ฐฮฉ๐’ฑ ,

cf. Theorem 4.3 in Appendix. Moreover, this construction is, in a sense, โ€œrobustโ€: ifthe ratio of the weights in (4.2.12) is multiplied with a constant factor, the accuracyestimate is preserved up to a constant factor. Besides, note that for the choice of ๐œ‘๐’ฑconsidered below ฮฉ๐’ฑ is known; for the choice of ๐œ‘๐’ฐ considered below ฮฉ๐’ฑ is known whenAssumption 1 is satisfied, and is known up to a constant factor when one allows ๐‘…๐’ฐto be an upper bound for โ€–๐‘ˆ*โ€–U as explained in Remark 1. For all these reasons, wechoose the joint potential according to (4.2.12).

4.2.2 Choice of the partial potentials

We now discuss construction of the partial potentials for the chosen geometry asgiven by the norms โ€– ยท โ€–U , โ€– ยท โ€–V , cf. (4.2.3), (4.2.5). The choice of potentials for thealternative geometries is discussed in Section 4.4.

Potential for the dual variable. Since ๐‘‰ โˆˆ (โˆ†โŠค๐‘˜ )โŠ—๐‘› is the direct product of

probability simplices, the natural choice for the dual potential ๐œ‘๐’ฑ(ยท) is the sum ofnegative entropies [Beck and Teboulle, 2003]: denoting log(ยท) the natural logarithm,

๐œ‘๐’ฑ(๐‘‰ ) =๐‘›โˆ‘๐‘–=1

๐‘˜โˆ‘๐‘—=1

๐‘‰๐‘–๐‘— log(๐‘‰๐‘–๐‘—). (4.2.13)

This choice reflects the fact that ๐‘‰ โˆˆ ๐’ฑ corresponds to the marginals of an ๐‘›-foldproduct distribution for which the entropy is the sum of the marginal entropies. By

88

Page 104: On efficient methods for high-dimensional statistical

Pinskerโ€™s inequality Kemperman [1969], ๐œ‘๐’ฑ(๐‘‰ ) is 1-strongly convex on ๐’ฑ . On theother hand,

ฮฉ๐’ฑ = ๐‘› log ๐‘˜, (4.2.14)

whereas the squared โ€– ยท โ€–V -norm of any feasible solution ๐‘‰ to (4.2.2) is precisely ๐‘›,cf. (4.2.5). Finally, ๐œ‘๐’ฑ(๐‘‰ ) is clearly continuously differentiable in the interior of ๐’ฐ .As such, we see that the potential (4.2.13) is compatible with ๐’ฐ and โ€– ยท โ€–U , i.e.,the triple (๐’ฐ , โ€– ยท โ€–U , ๐œ‘๐‘‰ (ยท)) comprises a valid proximal setup in the sense of Juditskyand Nemirovski [2011a], and grows nearly as the squared โ€– ยท โ€–V -radius of the feasibleset ๐’ฑ . Note that the Bregman divergence corresponding to (4.2.13) is the sum of theKullback-Leibler divergences between the individual rows:

๐ท๐œ‘๐’ฑ (๐‘‰, ๐‘‰ ๐‘ก) =๐‘›โˆ‘๐‘–=1

๐ทKL(๐‘‰ (๐‘–, :)โ€–๐‘‰ ๐‘ก(๐‘–, :)), (4.2.15)

where ๐ท๐ทKL(๐‘, ๐‘ž) :=โˆ‘

๐‘— ๐‘๐‘— log(๐‘๐‘—/๐‘ž๐‘—) for two discrete measures ๐‘, ๐‘ž (not necessarilynormalized to one).

Final reduction and the primal potential. Recall that we chose the elementwisenorm โ€– ยท โ€–U = โ€–๐‘ˆโ€–1ร—1, and the primal feasible set of the problem (4.2.2) correspondsto the ball with radius ๐‘…๐’ฐ in this norm. In this situation, the standard option is thepower potential ๐œ‘(๐‘ˆ) = ๐ถ๐‘‘,๐‘˜โ€–๐‘ˆโ€–2๐‘, where ๐‘ = 1 + 1/ log(๐‘‘๐‘˜), and the constant ๐ถ๐‘‘,๐‘˜can be chosen, depending on ๐‘‘ and ๐‘˜, so that ๐œ‘(ยท) is 1-strongly convex on the wholespace R๐‘‘ร—๐‘˜, see Nesterov and Nemirovski [2013]. This leads to the compatible proxi-mal setup in the sense defined above, and the correct scaling ฮฉ๐’ฐ = ๐‘‚(log(๐‘‘๐‘˜)๐‘…2

๐’ฐ).

However, it turns out that the specific algebraic structure of this potential doesnot allow for โ€œsublinearโ€ numerical complexity in the full sampling scheme which wewill present later on in Section 4.3.2. Hence, we advocate an alternative approach:first transform the โ„“1-constraint into the โ€œsolidโ€ simplex constraint (borrowing the ideafrom Juditsky and Nemirovski [2011b]), and then construct a compatible potential forthe new problem based on the unnormalized negative entropy, using the fact that ๐‘…๐’ฐis known, cf. Assumption 1. To this end, let ๐’ฐ be the โ€œsolidโ€ simplex in R2๐‘‘ร—๐‘˜:

๐’ฐ :=๐‘ˆ โˆˆ R2๐‘‘ร—๐‘˜ : ๐‘ˆ๐‘–๐‘— > 0, tr[1โŠค

2๐‘‘ร—๐‘˜๐‘ˆ ] 6 ๐‘…๐’ฐ

, (4.2.16)

where 12๐‘‘ร—๐‘˜ โˆˆ R2๐‘‘ร—๐‘˜ is the matrix of all ones so that tr[1โŠค2๐‘‘ร—๐‘˜

๐‘ˆ ] =โˆ‘

๐‘–,๐‘—๐‘ˆ๐‘–๐‘— = โ€–๐‘ˆโ€–1ร—1

whenever ๐‘ˆ โˆˆ ๐’ฐ . Consider the following saddle-point problem:

min๐‘ˆโˆˆ๐’ฐ max๐‘‰ โˆˆ๐’ฑโˆ’โ„ฑ(๐‘‰, ๐‘Œ ) + ฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ ) + ๐œ† tr[1โŠค

2๐‘‘ร—๐‘˜๐‘ˆ ], (4.2.17)

where โ„ฑ(๐‘‰, ๐‘Œ ) and ๐’ฑ are the same as before (cf. (4.1.6)โ€“(4.1.7)), and

ฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ ) :=1

๐‘›tr[(๐‘‰ โˆ’ ๐‘Œ )โŠค ๐‘‹ ๐‘ˆ] , where ๐‘‹ := [๐‘‹,โˆ’๐‘‹] โˆˆ R๐‘›ร—2๐‘‘. (4.2.18)

89

Page 105: On efficient methods for high-dimensional statistical

It can be easily verified that the new saddle-point problem (4.2.17) is equivalentto (4.2.2) in the following sense: any ๐œ€-accurate (in the sence of duality gap, cf Sec-

tion 4.2.4) solution (๐‘ˆ, ๐‘‰ ) to (4.2.17) with ๐‘ˆ = [๐‘ˆ1; ๐‘ˆ2] provides an ๐œ€-accurate solu-

tion (๐‘ˆ, ๐‘‰ ) to (4.2.2) by taking ๐‘ˆ = ๐‘ˆ1 โˆ’ ๐‘ˆ2.1 Clearly, the quantities โ„’U ,V ,ฮฉ๐’ฐ ,ฮฉ๐’ฑ

for the new problem remain the same as their counterparts for the problem (4.2.2);the only difference is a slight change of ฮฉ๐’ฐ (see below). On the other hand, the pri-mal feasible set of the new problem (4.2.17) โ€“ the โ€œsolidโ€ simplex (4.2.16) โ€“ admitsa compatible entropy-type potential. Indeed, consider first the โ€œunit solidโ€ simplexgiven by (4.2.16) with ๐‘…๐’ฐ = 1. On this set, the unnormalized negative entropy

โ„‹(๐‘ˆ) =โˆ‘๐‘–,๐‘—

๐‘ˆ๐‘–๐‘— log ๐‘ˆ๐‘–๐‘— โˆ’ ๐‘ˆ๐‘–๐‘— (4.2.19)

is 1-strongly convex (see Yu [2013] for an elementary proof), and clearly is continu-ously differentiable in its interior; thus, โ„‹(ยท) is a compatible potential on the โ€œunitsolidโ€ simplex in the sense of Juditsky and Nemirovski [2011a]. Finally, using As-sumption 1, we can construct a compatible (i.e., 1-strongly convex and continuouslydifferentiable in the interior) potential on the initial โ€œsolidโ€ simplex (4.2.16) witharbitrary radius by scaling the argument of โ„‹(ยท) and then renormalizing it as follows:

๐œ‘๐’ฐ(๐‘ˆ) := ๐‘…2๐’ฐ ยท โ„‹(๐‘ˆ/๐‘…๐’ฐ)

= ๐‘…๐’ฐ

[2๐‘‘โˆ‘๐‘–=1

๐‘˜โˆ‘๐‘—=1

๐‘ˆ๐‘–๐‘— log

( ๐‘ˆ๐‘–๐‘—๐‘…๐’ฐ

)โˆ’ ๐‘ˆ๐‘–๐‘—] . (4.2.20)

Note that the potential difference in this case is

ฮฉ๐’ฐ = ๐‘…2๐’ฐ log(2๐‘‘๐‘˜), (4.2.21)

and the corresponding Bregman divergence is given by

๐ท๐œ‘ ๐’ฐ (๐‘ˆ, ๐‘ˆ ๐‘ก) = ๐‘…2๐’ฐ๐ทKL

( ๐‘ˆ๐‘…๐’ฐ

๐‘ˆ ๐‘ก

๐‘…๐’ฐ

)โˆ’๐‘…๐’ฐ tr[1

โŠค2๐‘‘ร—๐‘˜

๐‘ˆ ] +๐‘…๐’ฐ tr[1โŠค2๐‘‘ร—๐‘˜

๐‘ˆ ๐‘ก], (4.2.22)

where the last term does not depend on ๐‘ˆ .4.2.3 Recap of the deterministic algorithms

We can now formulate the iterations of the Mirror Descent and Mirror Proxschemes applied to the reformulation (4.2.17) of the saddle-point problem (4.2.2)

1. The idea of this reduction is that for any optimal primal solution ๐‘ˆ to (4.2.17), only one of

the blocks ๐‘ˆ1, ๐‘ˆ2 is non-zero, and these blocks can then be interpreted as the positive and negativeparts of the corresponding optimal solution to (4.2.2).

90

Page 106: On efficient methods for high-dimensional statistical

with the specified choice of geometry. Both of them are initialized with

๐‘‰ 0 =1๐‘›ร—๐‘˜

๐‘˜, ๐‘ˆ0 =

๐‘…๐’ฐ12๐‘‘ร—๐‘˜

2๐‘‘๐‘˜, (Init)

corresponding to the uniform distributions. Mirror Descent then iterates (๐‘ก โ‰ฅ 1):

๐‘‰ ๐‘ก+1 = arg min๐‘‰ โˆˆ๐’ฑ

โ„ฑ(๐‘‰, ๐‘Œ )โˆ’ ฮฆ(๐‘ˆ ๐‘ก, ๐‘‰ ) +

๐ท๐œ‘๐’ฑ (๐‘‰, ๐‘‰ ๐‘ก)

2๐›พ๐‘กฮฉ๐’ฑ

,

๐‘ˆ ๐‘ก+1 = arg min๐‘ˆโˆˆ๐’ฐ๐œ† tr[1โŠค

2๐‘‘ร—๐‘˜๐‘ˆ ] + ฮฆ(๐‘ˆ, ๐‘‰ ๐‘ก โˆ’ ๐‘Œ ) +

๐ท๐œ‘ ๐’ฐ (๐‘ˆ, ๐‘ˆ ๐‘ก)

2๐›พ๐‘กฮฉ๐’ฐ,

(MD)

where โ„ฑ(๐‘‰, ๐‘Œ ), ฮฆ(ยท, ยท), ๐’ฑ , ๐’ฐ , ๐ท๐œ‘๐’ฑ (๐‘‰, ๐‘‰ ๐‘ก), ๐ท๐œ‘ ๐’ฐ (๐‘ˆ, ๐‘ˆ ๐‘ก), ฮฉ๐’ฑ , ฮฉ๐’ฐ were defined above. Onthe other hand, Mirror Prox performs iterations

๐‘‰ ๐‘ก+1/2 = arg min๐‘‰ โˆˆ๐’ฑ

โ„ฑ(๐‘‰, ๐‘Œ )โˆ’ ฮฆ(๐‘ˆ ๐‘ก, ๐‘‰ ) +

๐ท๐œ‘๐’ฑ (๐‘‰, ๐‘‰ ๐‘ก)

2๐›พ๐‘กฮฉ๐’ฑ

,

๐‘ˆ ๐‘ก+1/2 = arg min๐‘ˆโˆˆ๐’ฐ๐œ† tr[1โŠค

2๐‘‘ร—๐‘˜๐‘ˆ ] + ฮฆ(๐‘ˆ, ๐‘‰ ๐‘ก โˆ’ ๐‘Œ ) +

๐ท๐œ‘ ๐’ฐ (๐‘ˆ, ๐‘ˆ ๐‘ก)

2๐›พ๐‘กฮฉ๐’ฐ

;

๐‘‰ ๐‘ก+1 = arg min๐‘‰ โˆˆ๐’ฑ

โ„ฑ(๐‘‰, ๐‘Œ )โˆ’ ฮฆ(๐‘ˆ ๐‘ก+1/2, ๐‘‰ ) +

๐ท๐œ‘๐’ฑ (๐‘‰, ๐‘‰ ๐‘ก)

2๐›พ๐‘กฮฉ๐’ฑ

,

๐‘ˆ ๐‘ก+1 = arg min๐‘ˆโˆˆ๐’ฐ๐œ† tr[1โŠค

2๐‘‘ร—๐‘˜๐‘ˆ ] + ฮฆ(๐‘ˆ, ๐‘‰ ๐‘ก+1/2 โˆ’ ๐‘Œ ) +

๐ท๐œ‘ ๐’ฐ (๐‘ˆ, ๐‘ˆ ๐‘ก)

2๐›พ๐‘กฮฉ๐’ฐ.

(MP)

Complexity of one iteration. Note that the primal updates in both algorithmscan be expressed in closed form: using Lemma 4.1 in Appendix 4.7.3 we can verifythat the primal update in (MD) amounts to

๐‘ˆ ๐‘ก+1๐‘–๐‘— = ๐‘ˆ ๐‘ก

๐‘–๐‘— ยท exp(โˆ’2๐›พ๐‘ก๐‘†๐‘ก๐‘–๐‘—๐‘…๐’ฐ log(2๐‘‘๐‘˜)) ยทmin

exp(โˆ’2๐›พ๐‘ก๐œ†๐‘…๐’ฐ log(2๐‘‘๐‘˜)),

๐‘…๐’ฐ

๐‘€

,

where ๐‘†๐‘ก =1

๐‘›๐‘‹โŠค(๐‘‰ ๐‘ก โˆ’ ๐‘Œ ), and๐‘€ =

2๐‘‘โˆ‘๐‘–=1

๐‘˜โˆ‘๐‘—=1

๐‘ˆ ๐‘ก๐‘–๐‘— ยท exp(โˆ’2๐›พ๐‘ก๐‘†

๐‘ก๐‘–๐‘—๐‘…๐’ฐ log(2๐‘‘๐‘˜)).

(4.2.23)On the other hand, the primal updates depend on the expression for f(๐‘ฃ, ๐‘ฆ) in thedefinition (4.1.2) of Fenchel-Young losses (cf. also (4.1.6)). Crucially, when f(๐‘ฃ, ๐‘ฆ) isseparable in ๐‘ฃ โ€“ which is the case for the Fenchel-Young losses including the multiclasslogistic (4.1.3) and hinge (4.1.4) loss โ€“ we can perform the update for ๐‘‰ as follows:

โ€” pass to the Lagrange dual formulation of the proximal step for ๐‘‰ ;โ€” minimize the Lagrangian for the given value of the multiplier by solving ๐‘‚(๐‘›๐‘˜)

one-dimensional problems exactly (when possible) or to numerical tolerance;โ€” on top of that, find the optimal Lagrange multiplier via one-dimensional search.

91

Page 107: On efficient methods for high-dimensional statistical

See Ostrovskii and Harchaoui [2018, Supp. Mat.] for an illustration of this technique.Note also that in the case of multiclass SVM (cf. (4.1.4)), we have an explicit formula:

๐‘‰ ๐‘ก+1๐‘–๐‘— =

๐‘‰ ๐‘ก๐‘–๐‘— exp

(2๐›พ๐‘ก๐‘„

๐‘ก๐‘–๐‘— log(๐‘˜)

)๐‘˜โˆ‘๐‘™=1

๐‘‰ ๐‘ก๐‘–๐‘™ exp (2๐›พ๐‘ก๐‘„๐‘ก

๐‘–๐‘™ log(๐‘˜))

, where ๐‘„๐‘ก = ๐‘‹ ๐‘ˆ ๐‘กโˆ’1 โˆ’ ๐‘Œ. (4.2.24)

Overall, we see that the computational bottleneck in the deterministic schemes is thematrix-matrix multiplication which costs ๐‘‚(๐‘‘๐‘›๐‘˜) arithmetic operations; the remain-ing part of the combined proximal step takes only ๐‘‚(๐‘‘๐‘˜ + ๐‘›๐‘˜) operations.

We now present accuracy guarantees that follow from the general analysis of theproximal saddle-point optimization schemes (see Appendix 4.7.2) combined with theobtained expressions for โ„’U ,V , ฮฉ๐’ฐ , and ฮฉ๐’ฑ .

4.2.4 Accuracy bounds for the deterministic algorithms

Preliminaries. In what follows, we denote by ๐‘‚(1) generic constants. Recall thatthe accuracy of a candidate solution ( , ๐‘‰ ) to a saddle-point problem

min๐‘ˆโˆˆ๐’ฐ

max๐‘‰ โˆˆ๐’ฑ

๐‘“(๐‘ˆ, ๐‘‰ ), (4.2.25)

assuming continuity of ๐‘“ and compactness of ๐’ฐ and ๐’ฑ , can be quantified via theduality gap

โˆ†๐‘“ ( , ๐‘‰ ) := max๐‘‰ โˆˆ๐’ฑ

๐‘“( , ๐‘‰ )โˆ’min๐‘ˆโˆˆ๐’ฐ

๐‘“(๐‘ˆ, ๐‘‰ ). (4.2.26)

In particular, under Slaterโ€™s conditions (which holds for (4.2.17) in particular), theproblem (4.2.25) possesses an optimal solution ๐‘Š * = (๐‘ˆ*, ๐‘‰ *), called a saddle point,for which it holds ๐‘“(๐‘ˆ*, ๐‘‰ *) = max๐‘‰ โˆˆ๐’ฑ ๐‘“(๐‘ˆ*, ๐‘‰ ) = min๐‘ˆโˆˆ๐’ฐ ๐‘“(๐‘ˆ, ๐‘‰ *) โ€“ that is, ๐‘ˆ*

(resp. ๐‘‰ *) is optimal in the primal problem of minimizing ๐‘“prim(๐‘ˆ) := max๐‘‰ โˆˆ๐’ฑ ๐‘“(๐‘ˆ, ๐‘‰ )in ๐‘ˆ (resp. the dual problem of maximizaing ๐‘“dual(๐‘‰ ) := min๐‘ˆโˆˆ๐’ฐ ๐‘“(๐‘ˆ, ๐‘‰ ) in ๐‘‰ ).Hence, in this case โˆ†๐‘“ ( , ๐‘‰ ) is the sum of the primal and dual accuracies, and boundsfrom above the primal accuracy, which is of main interest in the initial problem (4.1.1).

We can derive the following accuracy guarantee for the composite saddle-pointMirror Descent (MD) applied to the saddle-point problem (4.2.17). 2 To simplifythe exposition, we only consider constant stepsize and simple averaging of the iterates;empirically we observe similar accuracy for the sample sizes descreasing as โˆ 1/

โˆš๐‘ก.

2. Surprisingly, we could not find accuracy guarantees for the composite-objective variants ofMirror Descent and Mirror Prox when applied to saddle-point problems. Note that the resultsof Duchi et al. [2010] only hold for Mirror Descent in a composite minimization problem, and thoseof Nesterov and Nemirovski [2013] use a different (in fact, more robust) formulation of the algorithms.

92

Page 108: On efficient methods for high-dimensional statistical

Theorem 4.1. Let (๐‘‡ , ๐‘‰ ๐‘‡ ) = 1๐‘‡

โˆ‘๐‘‡โˆ’1๐‘ก=0 (๐‘ˆ ๐‘ก, ๐‘‰ ๐‘ก) be the average of the first ๐‘‡ iterates

of Mirror Descent (MD) with initialization (Init) and constant stepsize

๐›พ๐‘ก โ‰ก1

โ„’U ,V

โˆš5๐‘‡ฮฉ๐’ฐฮฉ๐’ฑ

with the values of โ„’U ,V , ฮฉ๐’ฑ , ฮฉ๐’ฐ given by (4.2.11), (4.2.14), (4.2.21). Then it holds

โˆ†๐‘“ (๐‘‡ , ๐‘‰ ๐‘‡ )

[6

2โˆš

5โ„’U ,V

โˆšฮฉ๐’ฐฮฉ๐’ฑโˆš

๐‘‡+๐น (๐‘‰ 0, ๐‘Œ )โˆ’min๐‘‰ โˆˆ๐’ฑ ๐น (๐‘‰, ๐‘Œ )

๐‘‡

]

62โˆš

5โ€–๐‘‹โŠคโ€–โˆžร—2โˆš๐‘›

ยท log(2๐‘‘๐‘˜)โ€–๐‘ˆ*โ€–1ร—1โˆš๐‘‡

+r

๐‘‡,

(4.2.27)

where r = max๐‘ฆโˆˆฮ”๐‘˜f(1/๐‘˜, ๐‘ฆ)โˆ’min๐‘ฃโˆˆฮ”๐‘˜

f(๐‘ฃ, ๐‘ฆ). Moreover, for the multiclass hingeloss (4.1.4) the ๐‘‚(1/๐‘‡ ) term vanishes from the brackets, and we can put r = 0.

Proof. The bracketed bound in (4.2.27) follows from the general accuracy boundfor the composite saddle-point Mirror Descent scheme (Theorem 4.3 in Appendix),

combined with the observation that the primal โ€œsimpleโ€ term ๐œ†tr[1โŠค2๐‘‘ร—๐‘˜

๐‘ˆ ] of the ob-jective is linear, whence the corresponding error term is not present. In the caseof the hinge loss (4.1.4), the same holds for the primal โ€œsimpleโ€ term โ„ฑ(๐‘‰, ๐‘Œ ),hence the final remark in the claim. To obtain the right-hand side of (4.2.27), weuse (4.2.11), (4.2.14), (4.2.21), and bound ๐น (๐‘‰ 0, ๐‘Œ )โˆ’min๐‘‰ โˆˆ๐’ฑ ๐น (๐‘‰, ๐‘Œ ) from above.

This result merits some discussion.

Remark 2. Denoting ๐‘‹(:, ๐‘—) โˆˆ R๐‘› the ๐‘—-th column of ๐‘‹ for 1 โ‰ค ๐‘— โ‰ค ๐‘‘, we have

โ€–๐‘‹โŠคโ€–โˆžร—2โˆš๐‘›

= max๐‘—6๐‘‘

โˆšโ€–๐‘‹(:, ๐‘—)โ€–22

๐‘›. (4.2.28)

Note that ๐‘‹(:, ๐‘—) represents the empirical distribution of the ๐‘—-th feature ๐œ™๐‘—. Hence,for i.i.d. data, โ€–๐‘‹โŠคโ€–โˆž,2/

โˆš๐‘› almost surely converges to the largest ๐ฟ2-norm of an

individual feature:โ€–๐‘‹โŠคโ€–โˆžร—2โˆš

๐‘›

a.s.โˆ’โ†’ max๐‘—6๐‘‘

(E๐œ™2๐‘—)

1/2, (4.2.29)

In the non-asymptotic regime, (4.2.28) is the largest empirical ๐ฟ2-moment of an indi-vidual feature, and can be controlled if the features are bounded, or, more generally, ifthey are sufficiently light-tailed, via standard concentration inequalities. In particular,

โ€–๐‘‹โŠคโ€–โˆžร—2โˆš๐‘›

6 ๐ต

if the features are uniformly bounded by ๐ต. Also, using the standard ๐œ’2-bound,

93

Page 109: On efficient methods for high-dimensional statistical

see Laurent and Massart [2000, Lemma 1], with probability at least 1โˆ’ ๐›ฟ it holds

โ€–๐‘‹โŠคโ€–โˆžร—2โˆš๐‘›

6 ๐‘‚(1)๐œŽ

(1 +

โˆšlog(๐‘‘/๐›ฟ)

๐‘›

)(4.2.30)

if the features are zero-mean and Gaussian with variances uniformly bounded by ๐œŽ2.

Remark 3. The accuracy bound in Theorem 4.1 includes the remainder term r/๐‘‡ dueto the presence of โ„ฑ(๐‘‰, ๐‘Œ ). Note that r does not depend on ๐‘‘; moreover, we can expectthat r = ๐‘‚(log(๐‘˜)) in the case when f(๐‘ฃ, ๐‘ฆ) corresponds to some notion of entropyon โˆ†๐‘˜ as is the case, in particular, for the multiclass logistic loss (4.1.3). Finally, asstated in the theorem, one can set r = 0 for the multiclass hinge loss (4.1.4). Thus,overall we see that the additional term is relatively small, and can be neglected.

Remark 4. Finally, note that the ๐‘‚(1/โˆš๐‘‡ ) can be improved to the ๐‘‚(1/๐‘‡ ) one if

instead of Mirror Descent we use Mirror Prox. While here we do not state the preciseaccuracy bound for the โ€œdirectโ€ version of the algorithm (MP), such results are knownfor the more flexible โ€œepigraphโ€ version, in which the โ€œsimpleโ€ terms are moved into theconstraints (which allows to address a more general class of semi-separable problems),see, e.g., He et al. [2015]. Also, in the case of multiclass hinge loss (4.1.4), bothโ€œsimpleโ€ terms are linear, and can be formally absorbed into ฮจ(๐‘ˆ, ๐‘‰ ) = ฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ )without changing โ„’U ,V . This would result in the bound

โˆ†๐‘“ (๐‘‡ , ๐‘‰ ๐‘‡ ) 6

๐‘‚(1)โ„’U ,V

โˆšฮฉ๐’ฐฮฉ๐’ฑ

๐‘‡6๐‘‚๐‘‘,๐‘˜(1)โ€–๐‘ˆ*โ€–1ร—1

๐‘‡ยท โ€–๐‘‹

โŠคโ€–โˆžร—2โˆš๐‘›

, (4.2.31)

where ๐‘‚๐‘‘,๐‘˜(1) is a logarithmic factor in ๐‘‘ and ๐‘˜, see Juditsky and Nemirovski [2011b].While the ๐‘‚(1/๐‘‡ ) rate is not preserved for Mirror Prox with stochastic oracle, thisresult is still useful if we are willing to use the mini-batching technique.

4.3 Sampling schemes

As prevously noted, the main drawback of the deterministic approach is the highnumerical complexity ๐‘‚(๐‘‘๐‘›๐‘˜) of operations due to the cost of matrix multiplications

when computing the partial gradients ๐‘‹ ๐‘ˆ and ๐‘‹โŠค(๐‘‰ โˆ’ ๐‘Œ ). Following Juditsky andNemirovski [2011b], the natural approach to accelerate the computation of thesematrix products is by sampling the matrices ๐‘ˆ and ๐‘‰ โˆ’ ๐‘Œ . Denoting ๐œ‰๐‘ˆ and ๐œ‚๐‘‰,๐‘Œunbiased estimates of the matrix products ๐‘‹ ๐‘ˆ and ๐‘‹โŠค(๐‘‰ โˆ’๐‘Œ ) obtained in this way,we arrive at the following version of Stochastic Mirror Descent scheme (cf. (MD)):

๐‘‰ ๐‘ก+1 = arg min๐‘‰ โˆˆ๐’ฑ

โ„ฑ(๐‘‰, ๐‘Œ )โˆ’ 1

๐‘›tr[๐œ‰โŠค๐‘ˆ๐‘ก๐‘‰

]+๐ท๐œ‘๐’ฑ (๐‘‰, ๐‘‰ ๐‘ก)

2๐›พ๐‘กฮฉ๐’ฑ

,

๐‘ˆ ๐‘ก+1 = arg min๐‘ˆโˆˆ๐’ฐ๐œ† tr[1โŠค

2๐‘‘ร—๐‘˜๐‘ˆ ] +

1

๐‘›tr[๐œ‚๐‘‰ ๐‘ก,๐‘Œ

๐‘ˆ]+๐ท๐œ‘ ๐’ฐ (๐‘ˆ, ๐‘ˆ ๐‘ก)

2๐›พ๐‘กฮฉ๐’ฐ,

(S-MD)

94

Page 110: On efficient methods for high-dimensional statistical

and its counterpart in the case of Mirror Prox, which can be formulated analogously.The immediate question is how to obtain ๐œ‰๐‘ˆ and ๐œ‚๐‘‰,๐‘Œ . For example, we might

sample one row of ๐‘ˆ and ๐‘‰ โˆ’ ๐‘Œ at a time, obtaining unbiased estimates of the truepartial gradients that can be computed in just ๐‘‚(๐‘‘๐‘˜ + ๐‘›๐‘˜) operations โ€“ comparablewith the remaining volume of computations for proximal mapping itself. In fact, onecan try to go even further, sampling a single element at a time, and using explicitexpressions for the updates available in the case of SVM, cf. (4.2.23)โ€“(4.2.24), allowingto reduce the complexity even further โ€“ to ๐‘‚(๐‘‘+๐‘›+๐‘˜) as we show below. Yet, thereis a price to pay: the gradient oracle becomes stochastic, and the accuracy boundsshould be augmented with an extra term that reflects stochastic variability of thegradients. In fact, the effect of the gradient noise on the accuracy bounds for bothschemes is known: we get the extra term

๐‘‚

โŽ›โŽโˆš

ฮฉ๐’ฐ๐œŽ2๐’ฑ

+ ฮฉ๐’ฑ๐œŽ2๐’ฐโˆš

๐‘‡

โŽžโŽ  , (4.3.1)

where ๐œŽ2๐’ฐ and ๐œŽ2๐’ฑ are โ€œvariance proxiesโ€ โ€“ expected squared dual norms of the noise in

the gradients (see Juditsky and Nemirovski [2011b, Sec. 2.5.1]) and Appendix 4.7.2):

๐œŽ2๐’ฐ =1

๐‘›2sup๐‘ˆโˆˆ๐’ฐ E

[ ๐‘‹ ๐‘ˆ โˆ’ ๐œ‰๐‘ˆ 2V *

], ๐œŽ2

๐’ฑ =1

๐‘›2sup

(๐‘‰,๐‘Œ )โˆˆ๐’ฑร—๐’ฑE[ ๐‘‹โŠค(๐‘‰ โˆ’ ๐‘Œ )โˆ’ ๐œ‚๐‘‰,๐‘Œ

2U *

],

(4.3.2)In this section, we consider two sampling schemes for the estimates ๐œ‰๐‘ˆ and ๐œ‚๐‘‰,๐‘Œ : thepartial scheme, in which the estimates are obtained by sampling (non-uniformly) therows of ๐‘ˆ and ๐‘‰ โˆ’ ๐‘Œ ; and the full scheme, in which one also samples the columnsof these matrices. In both cases, we derive the probabilities that nearly minimizethe variance proxies. As it turns out, for the choice of the norms and potentialsdone in Section 4.2, and under weak assumptions on the feature distribution, thesechoices of probabilities force the additional term (4.3.1) to be of the same order ofmagnitude as the accuracy bound in Theorem 4.1 for the deterministic Mirror Descent(cf. Theorems 4.2โ€“1 below). Due to the reduced cost of iterations, this leads to thedrastic reductions in the overall numerical complexity.

4.3.1 Partial sampling

In the partial sampling scheme, we draw the rows of ๐‘ˆ and ๐‘‰ with non-uniformprobabilities. In other words, we choose a pair of distributions ๐‘ = (๐‘1, . . . , ๐‘2๐‘‘) โˆˆ โˆ†2๐‘‘

and ๐‘ž = (๐‘ž1, . . . , ๐‘ž๐‘›) โˆˆ โˆ†๐‘›, and sample ๐œ‰๐’ฐ = ๐œ‰๐‘ˆ(๐‘) and ๐œ‚๐’ฑ,๐‘Œ = ๐œ‚๐‘‰,๐‘Œ (๐‘ž) according to

๐œ‰๐‘ˆ(๐‘) = ๐‘‹๐‘’๐‘–๐‘’โŠค๐‘–

๐‘๐‘–๐‘ˆ, ๐œ‚๐‘‰,๐‘Œ (๐‘ž) = ๐‘‹โŠค ๐‘’๐‘—๐‘’

โŠค๐‘—

๐‘ž๐‘—(๐‘‰ โˆ’ ๐‘Œ ) (Part-SS)

where ๐‘’๐‘– โˆˆ R2๐‘‘ and ๐‘’๐‘— โˆˆ R๐‘› are the standard basis vectors, and ๐‘– โˆˆ [2๐‘‘] := 1, ..., 2๐‘‘and ๐‘— โˆˆ [๐‘›] have distributions ๐‘ and ๐‘ž correspondingly (clearly, this guarantees un-

95

Page 111: On efficient methods for high-dimensional statistical

biasedness). Thus, in this scheme we sample the features and the training examples,but not the classes. The main challenge is to correctly choose the distributions ๐‘, ๐‘ž.This can be done in a data-dependent manner to approximately minimize the vari-ance proxies ๐œŽ2๐’ฐ , ๐œŽ2

๐’ฑ defined in (4.3.2). Note that minimization of the variance prox-ies can only be done explicitly in the Euclidean case, when โ€– ยท โ€–U * , โ€– ยท โ€–V * are theFrobenius norms. Instead, we minimize the expected squared norm of the gradientestimate itself (corresponding to the second moment in the Euclidean case), that is,

choose ๐‘* = ๐‘*( ๐‘‹, ๐‘ˆ) and ๐‘ž* = ๐‘ž*( ๐‘‹,๐‘‰, ๐‘Œ ) according to

๐‘*( ๐‘‹, ๐‘ˆ) โˆˆ arg min๐‘โˆˆฮ”2๐‘‘

E[โ€–๐œ‰๐‘ˆ(๐‘)โ€–2V *

],

๐‘ž*( ๐‘‹,๐‘‰, ๐‘Œ ) โˆˆ arg min๐‘žโˆˆฮ”๐‘›

E[โ€–๐œ‚๐‘‰,๐‘Œ (๐‘ž)โ€–2U *

].

(4.3.3)

As we show next in Proposition 4.2, these minimization problems can indeed be solvedexplicitly. On the other hand, we can then easily bound the variance proxies for ๐‘*

and ๐‘ž* as well via the triangle inequality. The approach is justified by the observationthat these bounds result in the stochastic term (4.3.1) of the same order as the termsalready present in the accuracy bound of Theorem 4.1 (cf. Theorem 4.2).

Remark 5. For the stochastic Mirror Descent it is sufficient to control the secondmoment proxies directly, and passing to the variance proxies only deteriorates theconstants. Nonetheless, we choose to provide bounds for the variance proxies: suchbounds are needed in the analysis of Stochastic Mirror Prox.

The next result gives the optimal sampling distributions (4.3.3) and upper boundsfor their variance proxies.

Proposition 4.2. Consider the choice of norms โ€– ยท โ€–U = โ€– ยท โ€–1ร—1, โ€– ยท โ€–V = โ€– ยท โ€–2ร—1.

Then, optimal solutions ๐‘* = ๐‘*( ๐‘‹, ๐‘ˆ) and ๐‘ž* = ๐‘ž*( ๐‘‹,๐‘‰, ๐‘Œ ) to (4.3.3) are given by

๐‘*๐‘– =โ€– ๐‘‹(:, ๐‘–)โ€–2 ยท โ€–๐‘ˆ(๐‘–, :)โ€–โˆžโˆ‘2๐‘‘๐šค=1 โ€– ๐‘‹(:, ๐šค)โ€–2 ยท โ€–๐‘ˆ(๐šค, :)โ€–โˆž

, ๐‘ž*๐‘— =โ€– ๐‘‹(๐‘—, :)โ€–โˆž ยท โ€–๐‘‰ (๐‘—, :)โˆ’ ๐‘Œ (๐‘—, :)โ€–โˆžโˆ‘๐‘›๐šฅ=1 โ€– ๐‘‹(๐šฅ, :)โ€–โˆž ยท โ€–๐‘‰ (๐šฅ, :)โˆ’ ๐‘Œ (๐šฅ, :)โ€–โˆž

.

(4.3.4)Moreover, their respective variance proxies satisfy the bounds

๐œŽ2๐’ฐ(๐‘*) 64๐‘…2

๐’ฐโ€–๐‘‹โŠคโ€–2โˆžร—2

๐‘›2, ๐œŽ2

๐’ฑ(๐‘ž*) 68โ€–๐‘‹โŠคโ€–2โˆžร—2

๐‘›+

8โ€–๐‘‹โ€–21ร—โˆž

๐‘›2. (4.3.5)

Proof. See Appendix 4.7.5 for the proof of a generalized result with mixed norms.

Combining Proposition 4.2 with the general accuracy bound for Stochastic MirrorDescent (see Theorem 4.4 in Appendix 4.7.2), we arrive at the following result:

Theorem 4.2. Let (๐‘‡ , ๐‘‰ ๐‘‡ ) = 1๐‘‡

โˆ‘๐‘‡โˆ’1๐‘ก=0 (๐‘ˆ ๐‘ก, ๐‘‰ ๐‘ก) be the average of the first ๐‘‡ iter-

ates of Stochastic Mirror Descent (S-MD) with initialization (Init), equipped with

96

Page 112: On efficient methods for high-dimensional statistical

sampling scheme (Part-SS) with distributions given by (4.3.4), and constant stepsize

๐›พ๐‘ก โ‰ก1โˆš๐‘‡

min

โŽงโŽจโŽฉ 1โˆš10โ„’U ,V

โˆšฮฉ๐’ฐฮฉ๐’ฑ

,1

โˆš2โˆš

ฮฉ๐’ฐ 2๐’ฑ + ฮฉ๐’ฑ

2๐’ฐ

โŽซโŽฌโŽญ , (4.3.6)

with the values of โ„’U ,V , ฮฉ๐’ฑ , ฮฉ๐’ฐ given by (4.2.11), (4.2.14), (4.2.21), and theupper bounds 2๐’ฐ , 2

๐’ฑ on the variance proxies given by (4.3.5). Then it holds

E[โˆ†๐‘“ (๐‘‡ , ๐‘‰ ๐‘‡ )][

62โˆš

10โ„’U ,V

โˆšฮฉ๐’ฐฮฉ๐’ฑโˆš

๐‘‡+

2โˆš

2โˆš

ฮฉ๐’ฐ 2๐’ฑ + ฮฉ๐’ฑ 2

๐’ฐโˆš๐‘‡

+๐น (๐‘‰ 0, ๐‘Œ )โˆ’min๐‘‰ โˆˆ๐’ฑ ๐น (๐‘‰, ๐‘Œ )

๐‘‡

]

6

((4โˆš

6 + 2โˆš

10)โ€–๐‘‹โŠคโ€–โˆžร—2โˆš๐‘›

+8โ€–๐‘‹โ€–1ร—โˆž

๐‘›

)ยท log(2๐‘‘๐‘˜)โ€–๐‘ˆ*โ€–1ร—1โˆš

๐‘‡+

r

๐‘‡, (4.3.7)

where E[ยท] is the expectation over the randomness of the algorithm, and r is thesame as in Theorem 4.1.

Discussion. The main message of Theorem 4.2 (cf. Theorem 4.1) is as follows: forthe chosen geometry of โ€– ยท โ€–U and potentials, partial sampling scheme (Part-SS)does not result in any growth of computational complexity, in terms of the numberof iterations to guarantee a given value of the duality gap, as long as โ€–๐‘‹โ€–1ร—โˆž/๐‘› doesnot dominate โ€–๐‘‹โŠคโ€–โˆžร—2/

โˆš๐‘›. This is a reasonable assumption as soon as the data has

a light-tailed distribution. Indeed, recall that โ€–๐‘‹โŠคโ€–โˆžร—2/โˆš๐‘› is the largest ๐ฟ2-norm of

an individual feature ๐œ™๐‘— (cf. Remark 2); on the other hand, โ€–๐‘‹โ€–1ร—โˆž/๐‘› is the sampleaverage of max๐‘—โˆˆ[๐‘‘] |๐œ™๐‘—|, and thus has the finite limit Emax๐‘—6๐‘‘ |๐œ™๐‘—| when ๐‘› โ†’ โˆž.

If ๐œ™๐‘— are subgaussian, Emax๐‘—6๐‘‘ |๐œ™๐‘—| 6 ๐‘‚๐‘‘(1) max๐‘—6๐‘‘E|๐œ™๐‘—| 6 ๐‘‚๐‘‘(1) max๐‘—6๐‘‘(E๐œ™2๐‘—)

1/2,whence (cf. (4.2.29)):

lim๐‘›โ†’โˆž

โ€–๐‘‹โ€–1ร—โˆž

๐‘›6 ๐‘‚๐‘‘(1) lim

๐‘›โ†’โˆž

โ€–๐‘‹โŠคโ€–โˆžร—2โˆš๐‘›

.

Similar observations can be made in the finite-sample regime. In particular, bothterms admit the same bound in terms of the uniform a.s. bound on the features. Also,if ๐œ™๐‘— โˆผ ๐’ฉ (0, ๐œŽ2

๐‘— ) with ๐œŽ๐‘— โ‰ค ๐œŽ for any ๐‘— โˆˆ [๐‘‘], then with probability at least 1โˆ’ ๐›ฟ,

โ€–๐‘‹โ€–1ร—โˆž

๐‘›6 ๐œŽ

โˆšlog(๐‘‘๐‘›/๐›ฟ)

with a similar bound for โ€–๐‘‹โŠคโ€–โˆžร—2/โˆš๐‘›, cf. (4.2.30).

Computational complexity. Computation of ๐œ‰๐‘ˆ(๐‘*) and ๐œ‚๐‘‰,๐‘Œ (๐‘ž*) requires ๐‘‚(๐‘‘๐‘›+๐‘‘๐‘˜+๐‘›๐‘˜) operations; note that this includes computing the optimal sampling probabil-

97

Page 113: On efficient methods for high-dimensional statistical

ities (4.3.4). The combined complexity of the proximal steps, given these estimates,is ๐‘‚(๐‘‘๐‘˜ + ๐‘›๐‘˜); in particular, in the case of SVM we have explicit formulae both forthe primal and dual updates, and in the general case the dual updates are reduced tothe root search on top of an explicit ๐‘‚(๐‘›๐‘˜) computation โ€“ cf. (4.2.23)โ€“(4.2.24) andthe accompanying discussion.

4.3.2 Full sampling

In the full sampling scheme, the sampling of the rows of ๐‘ˆ and ๐‘‰ โˆ’๐‘Œ is augmentedwith the subsequent column sampling. This can be formalized (see Figure 4-1) as

๐œ‰๐‘ˆ(๐‘, ๐‘ƒ ) = ๐‘‹๐‘’๐‘–๐‘’โŠค๐‘–

๐‘๐‘–๐‘ˆ ๐‘’๐‘™๐‘’โŠค๐‘™๐‘ƒ๐‘–๐‘™

, ๐œ‚๐‘‰,๐‘Œ (๐‘ž,๐‘„) = ๐‘‹โŠค ๐‘’๐‘—๐‘’โŠค๐‘—

๐‘ž๐‘—(๐‘‰ โˆ’ ๐‘Œ )

๐‘’๐‘™๐‘’โŠค๐‘™

๐‘„๐‘—๐‘™

, (Full-SS)

where the indices ๐‘– โˆˆ [2๐‘‘] and ๐‘— โˆˆ [๐‘›] are sampled with distributions ๐‘ โˆˆ โˆ†2๐‘‘ and ๐‘ž โˆˆโˆ†๐‘› as before, and the rows of the stochastic matrices ๐‘ƒ โˆˆ (โˆ†โŠค

๐‘˜ )โŠ—2๐‘‘ and ๐‘„ โˆˆ (โˆ†โŠค๐‘˜ )โŠ—๐‘›

specify the conditional sampling distribution of the class ๐‘™ โˆˆ [๐‘˜], provided that wedrew the ๐‘–-th feature (in the primal) and ๐‘—-th example (in the dual). Clearly, thisgives us unbiased estimates of the matrix products.

Next we derive the optimal sampling distributions and bound their variance prox-ies (see Appendix 4.7.6 for the proof).

Proposition 4.3. Consider the choice of norms โ€– ยท โ€–U = โ€– ยท โ€–1ร—1, โ€– ยท โ€–V = โ€– ยท โ€–2ร—1.Optimal solutions (๐‘*, ๐‘ƒ *), (๐‘ž*, ๐‘„*) to the optimization problems

min๐‘โˆˆฮ”2๐‘‘,๐‘ƒโˆˆ(ฮ”โŠค

๐‘˜ )โŠ—2๐‘‘Eโ€–๐œ‰๐‘ˆ(๐‘, ๐‘ƒ )โ€–2V * , min

๐‘žโˆˆฮ”๐‘›,๐‘„โˆˆ(ฮ”โŠค๐‘˜ )โŠ—๐‘›

Eโ€–๐œ‚๐‘‰,๐‘Œ (๐‘ž,๐‘„)โ€–2U *

are given by

๐‘*๐‘– =โ€– ๐‘‹(:, ๐‘–)โ€–2 ยท โ€–๐‘ˆ(๐‘–, :)โ€–1โˆ‘2๐‘‘๐šค=1 โ€– ๐‘‹(:, ๐šค)โ€–2 ยท โ€–๐‘ˆ(๐šค, :)โ€–1

, ๐‘ƒ *๐‘–๐‘™ =

|๐‘ˆ๐‘–๐‘™|โ€–๐‘ˆ(๐‘–, :)โ€–1

,

๐‘ž*๐‘— =โ€– ๐‘‹(๐‘—, :)โ€–โˆž ยท โ€–๐‘‰ (๐‘—, :)โˆ’ ๐‘Œ (๐‘—, :)โ€–1โˆ‘๐‘›๐šฅ=1 โ€– ๐‘‹(๐šฅ, :)โ€–โˆž ยท โ€–๐‘‰ (๐šฅ, :)โˆ’ ๐‘Œ (๐šฅ, :)โ€–1

, ๐‘„*๐‘—๐‘™ =

|๐‘‰๐‘—๐‘™ โˆ’ ๐‘Œ๐‘—๐‘™|โ€–๐‘‰ (๐‘—, :)โˆ’ ๐‘Œ (๐‘—, :)โ€–1

.

(4.3.8)

The variance proxies ๐œŽ2๐’ฐ(๐‘*, ๐‘ƒ *), ๐œŽ2๐’ฑ(๐‘ž*, ๐‘„*) admit the same upper bounds as in (4.3.5).

Combining Proposition 4.3 with the general bound used in Theorem 4.2, we obtain

Corollary 1. The accuracy bound (4.3.7) of Theorem 4.2 remains true when wereplace the sampling scheme (Part-SS) with the scheme (Full-SS) with parameterschosen according to (4.3.8).

Computational complexity. In the next section, we describe efficient implemen-tation for the multiclass hinge loss (4.1.4) which has ๐‘‚(๐‘‘+ ๐‘˜ + ๐‘›) complexity of one

98

Page 114: On efficient methods for high-dimensional statistical

๐‘™ โˆˆ [๐‘˜]โŽ›โŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽ...

โŽžโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽ โŸ โž

๐œ‰๐‘ˆ โˆˆ R๐‘›ร—๐‘˜

=

๐‘– โˆˆ [2๐‘‘]โŽ›โŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽ...

โŽžโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽ โŸ โž ๐‘‹ โˆˆ R๐‘›ร—2๐‘‘

ร—

โŽ›โŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽ. . . ๐‘™ . . .

โŽžโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽ โŸ โž ๐‘ˆ โˆˆ R2๐‘‘ร—๐‘˜

๐‘–

๐‘™ โˆˆ [๐‘˜]โŽ›โŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽ...

โŽžโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽ โŸ โž

๐œ‚๐‘‰,๐‘Œ โˆˆ R2๐‘‘ร—๐‘˜

=

๐‘— โˆˆ [๐‘›]โŽ›โŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽ...

โŽžโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽ โŸ โž ๐‘‹โŠค โˆˆ R2๐‘‘ร—๐‘›

ร—

โŽ›โŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽœโŽ. . . ๐‘™ . . .

โŽžโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽ โŸ โž ๐‘‰ โˆ’ ๐‘Œ โˆˆ R๐‘›ร—๐‘˜

๐‘—

Figure 4-1 โ€“ Depiction of the Full Sampling Scheme (Full-SS).

99

Page 115: On efficient methods for high-dimensional statistical

iteration, plus pre- and postprocessing of

๐‘‚(๐‘‘๐‘›+ (๐‘‘+ ๐‘›) ยทmin(๐‘˜, ๐‘‡ ))

arithmetic operations (a.o.โ€™s). While the detailed analysis is deferred to the nextsection, let us briefly present the main ideas on which it relies.

โ€” For the optimal sampling distributions ๐‘* and ๐‘ž*, we need to compute thenorms ๐œŽ๐‘– = โ€– ๐‘‹(:, ๐‘–)โ€–2 and ๐œ๐‘— = โ€– ๐‘‹(๐‘—, :)โ€–โˆž for all ๐‘– โˆˆ [2๐‘‘] and ๐‘— โˆˆ [๐‘›]. Thisrequires ๐‘‚(๐‘‘๐‘›) iterations but only once, during preprocessing.

โ€” We also need to calculate ๐‘‚(๐‘‘) quantities ๐œ‹๐‘– = โ€–๐‘ˆ(๐‘–, :)โ€–1 and ๐‘‚(๐‘›) quanti-ties ๐œŒ๐‘— = โ€–๐‘‰ (๐‘—, :) โˆ’ ๐‘Œ (๐‘—, :)โ€–1. While the direct calculation of each of themtakes ๐‘‚(๐‘˜) a.o.โ€™s, we can update them dynamically in ๐‘‚(1) in the case of thehinge loss. Examining the explicit formulae for the updates in the case of hingeloss (cf. (4.2.23)โ€“(4.2.24)), we notice that in the case of full sampling, but one

elements of each row of ๐‘ˆ and ๐‘‰ are simply scaled by the same coefficient, andthus one can maintain each of the quantities โ€–๐‘ˆ(๐‘–, :)โ€–1 and โ€–๐‘‰ (๐‘—, :)โˆ’ ๐‘Œ (๐‘—, :)โ€–1in ๐‘‚(1) a.o.โ€™s, leading to ๐‘‚(๐‘‘+๐‘›) complexity of this step. This result is fragile,requiring the specific combination of hinge loss (for which f(๐‘ฃ, ๐‘ฆ) is linear) andentropic potentials.

โ€” We do not need to compute the full matrices ๐‘ƒ,๐‘„. Instead, we can computeonly one row of each of them, corresponding to the drawn pair ๐‘–, ๐‘—, whichrequires ๐‘‚(๐‘˜) a.o.โ€™s. Hence we can implement sampling in ๐‘‚(๐‘‘+ ๐‘›+ ๐‘˜) a.o.โ€™s

โ€” Note that updates of the matrices ๐‘ˆ and ๐‘‰ are not dense even in the case of thefull sampling scheme, so implementing them naively would result in ๐‘‚(๐‘‘๐‘˜+๐‘›๐‘˜)a.o.โ€™s per iteration. Instead, we employ โ€œlazy updatesโ€: for each ๐‘– โˆˆ [2๐‘‘] and ๐‘— โˆˆ[๐‘›], instead of ๐‘ˆ(๐‘–, :) and ๐‘‰ (๐‘—, :) we maintain the pairs (๐‘ˆ(๐‘–, :), ๐›ผ๐‘–), (๐‘‰ (๐‘—, :), ๐›ฝ๐‘—)such that ๐‘ˆ(๐‘–, :) = ๐›ผ๐‘– ๐‘ˆ(๐‘–, :), ๐‘‰ (๐‘—, :) = ๐›ฝ๐‘— ๐‘‰ (๐‘—, :).

Recalling that at each iteration all but one elements of ๐‘ˆ(๐‘–, :) are scaled bythe same coefficient, we can update any such pair in ๐‘‚(1), with the totalcomplexity of ๐‘‚(๐‘‘+ ๐‘›) a.o.โ€™s.

โ€” It can be shown that given the pairs (๐‘ˆ(๐‘–, :), ๐›ผ๐‘–) and (๐‘‰ (๐‘—, :), ๐›ฝ๐‘—), the compu-tation of the averages ๐‘‡ , ๐‘‰ ๐‘‡ after ๐‘‡ iterations requires ๐‘‚(๐‘‘ + ๐‘›) a.o.โ€™s periteration plus the post-processing of ๐‘‚((๐‘‘ + ๐‘›) ยท min(๐‘˜, ๐‘‡ )), resulting in theoutlined complexity estimate. Note that the computation of the duality gap(which can be used as the online stopping criterion, see, e.g., Ostrovskii andHarchaoui [2018]) also has the same complexity. Hence, in practice it is rea-sonable to employ the โ€œdoublingโ€ technique, computing the averages and theduality gap at iterations ๐‘‡๐‘š = 2๐‘š with increasing values of ๐‘š. This will resultin the same overall complexity while allowing for an online stopping criterion.

Remark 6. The way to avoid averaging of iterations is to consider averaging ofstochastic gradients as done by Juditsky and Nemirovski [2011b]. Stochastic gradi-ents ๐œ‰๐‘ˆ and ๐œ‚๐‘‰ have only ๐‘‚(๐‘›) and ๐‘‚(๐‘‘) non-zero elements respectively, and theirrunning averages can be computed in ๐‘‚(๐‘‘ + ๐‘›) a.o.โ€™s. However, in this case we lose

100

Page 116: On efficient methods for high-dimensional statistical

the duality gap guarantee (and thus an online stopping criterion), and only have aguarantee on the primal accuracy. 3

Remark 7. Unfortunately, we were not able to achieve the ๐‘‚(๐‘‘+ ๐‘›+ ๐‘˜) complexityfor the multiclass logistic loss (4.1.3). The reason is that in this case the compos-ite term โ„ฑ(๐‘‰, ๐‘Œ ) is the sum of negative entropies (same as the potential (4.2.13)),and the corresponding proximal updates are not reduced to rescalings of rows (mod-ulo ๐‘‚(1) elements). However, due to the sparsity of stochastic gradients, the updateshave a special form, and we believe that ๐‘‚(๐‘‘+๐‘›+๐‘˜) complexity is possible to achievein this case as well. We are planning to investigate it in the future.

4.3.3 Efficient implementation of SVM with Full Sampling

In this section we consider updates for fully stochastic case and show that oneiteration complexity is indeed ๐‘‚(๐‘‘ + ๐‘› + ๐‘˜). We will show, that it is possible to thespecial form of โ€œlazyโ€ updates for scaling coefficients. Recall, that one iteration of thefully stochastic mirror descent is written as S-MD:

๐‘‰ ๐‘ก+1 = arg min๐‘‰ โˆˆ๐’ฑ

โ„ฑ(๐‘‰, ๐‘Œ )โˆ’ 1

๐‘›tr[๐œ‰โŠค๐‘ˆ๐‘ก๐‘‰

]+๐ท๐œ‘๐’ฑ (๐‘‰, ๐‘‰ ๐‘ก)

2๐›พ๐‘กฮฉ๐’ฑ

,

๐‘ˆ ๐‘ก+1 = arg min๐‘ˆโˆˆ๐’ฐ๐œ† tr[1โŠค

2๐‘‘ร—๐‘˜๐‘ˆ ] +

1

๐‘›tr[๐œ‚๐‘‰ ๐‘ก,๐‘Œ

๐‘ˆ]+๐ท๐œ‘ ๐’ฐ (๐‘ˆ, ๐‘ˆ ๐‘ก)

2๐›พ๐‘กฮฉ๐’ฐ,

(S-MD)

where ๐œ‰๐‘ˆ๐‘ก and ๐œ‚๐‘‰ ๐‘ก,๐‘Œ are stochastic gradients.

Lazy Updates.

Note that although the estimates ๐œ‰๐‘ˆ , ๐œ‚๐‘‰,๐‘Œ produced in (Full-SS) are sparse (eachcontains a single non-zero column), the updates in (S-MD), which can be expressedas (4.2.23)โ€“(4.2.24) with ๐œ‰๐‘ˆ , ๐œ‚๐‘‰,๐‘Œ instead of the corresponding matrix products, aredense, and implementing them naively costs ๐‘‚(๐‘‘๐‘˜ + ๐‘›๐‘˜) a.o.โ€™s. Fortunately, these

updates have a special form: all elements in each row of ๐‘ˆ ๐‘ก and ๐‘‰ ๐‘ก are simply rescaledwith the same factor โ€“ except for at most two elements corresponding to a single non-zero element of ๐œ‚๐‘‰ ๐‘ก,๐‘Œ and at most two non-zero elements of ๐œ‰๐‘ˆ๐‘ก โˆ’ ๐‘Œ in this row.To exploit this fact, we perform โ€œlazyโ€ updates: instead of explicitly computing theactual iterates (๐‘ˆ ๐‘ก, ๐‘‰ ๐‘ก), we maintain the quadruple (๐‘ˆ, ๐›ผ, ๐‘‰ , ๐›ฝ), where ๐‘ˆ, ๐‘‰ have thesame dimensions as ๐‘ˆ, ๐‘‰ , while ๐›ผ โˆˆ R2๐‘‘ and ๐›ฝ โˆˆ R๐‘› are the โ€œscaling vectorsโ€, so thatat any iteration ๐‘ก it holds

๐‘ˆ ๐‘ก(๐‘–, :) = ๐‘ˆ(๐‘–, :) ยท ๐›ผ(๐‘–), ๐‘‰ ๐‘ก(๐‘—, :) = ๐‘‰ (๐‘—, :) ยท ๐›ฝ(๐‘—) (4.3.9)

for any row of ๐‘ˆ ๐‘ก and ๐‘‰ ๐‘ก. Initializing with (๐‘ˆ, ๐‘‰ ) = (๐‘ˆ, ๐‘‰ ), ๐›ผ = 12๐‘‘, ๐›ฝ = 1๐‘›, wecan update the whole quadruple, while maintaining (4.3.9), by updating at most two

3. This situation corresponds to the โ€œgeneral caseโ€ in the terminology [Juditsky and Nemirovski,2011b, Sec. 2.5.2.1 and Prop. 2.6(ii)].

101

Page 117: On efficient methods for high-dimensional statistical

elements in each row of ๐‘ˆ and ๐‘‰ , and encapsulating the overall scaling of rows in ๐›ผand ๐›ฝ. Clearly, this update requires only ๐‘‚(๐‘‘ + ๐‘›) operations once ๐œ‰๐‘ˆ๐‘ก , ๐œ‚๐‘‰ ๐‘ก,๐‘Œ havebeen drawn.

Sampling.

Computing the distributions ๐‘*, ๐‘ž* from (4.3.8) requires the knowledge of โ€– ๐‘‹(:, ๐‘–)โ€–2and โ€– ๐‘‹(๐‘—, :)โ€–โˆž which can be precomputed in ๐‘‚(๐‘‘๐‘›) a.o.โ€™s, and maintaining ๐‘‚(๐‘‘+๐‘›)

norms ๐œ‹๐‘–, ๐œŒ๐‘— of the rows of ๐‘ˆ ๐‘ก and ๐‘‰ ๐‘ก โˆ’ ๐‘Œ that can maintained in ๐‘‚(1) a.o.โ€™s eachusing (4.3.9). Thus, ๐‘* and ๐‘ž* can be updated in ๐‘‚(๐‘‘ + ๐‘›). Once it is done, we cansample ๐‘–๐‘ก โˆผ ๐‘* and ๐‘—๐‘ก โˆผ ๐‘ž*, and then sample the class from ๐‘ƒ * and ๐‘„*, cf. (4.3.8),by computing only the ๐‘–๐‘ก-th row of ๐‘ƒ * and the ๐‘—๐‘ก-th row of ๐‘„*, both in ๐‘‚(๐‘˜) a.o.โ€™s.Thus, the total complexity of producing ๐œ‰๐‘ˆ๐‘ก , ๐œ‚๐‘‰ ๐‘ก,๐‘Œ is ๐‘‚(๐‘‘+ ๐‘›+ ๐‘˜).

Tracking the Averages.

Similar โ€œlazyโ€ updates can be performed for the running averages of the iterates.Omitting the details, this requires ๐‘‚(๐‘‘+ ๐‘›) a.o.โ€™s per iteration, plus post-processingof ๐‘‚(๐‘‘๐‘˜ + ๐‘›๐‘˜) a.o.โ€™s.

The above ideas are implemented in Algorithm 1 whose correctness is formallyshown in Appendix, Sec. 4.7.7 (see also Sec. 4.7.8 for an additional discussion). Itsclose inspection shows the iteration cost of ๐‘‚(๐‘‘ + ๐‘› + ๐‘˜) a.o.โ€™s, plus ๐‘‚(๐‘‘๐‘› + ๐‘‘๐‘˜ +๐‘›๐‘˜) a.o.โ€™s for pre/post-processing, and the memory complexiy of ๐‘‚(๐‘‘๐‘› + ๐‘‘๐‘˜ + ๐‘›๐‘˜).Moreover, the term ๐‘‚(๐‘‘๐‘˜), which dominates in high-dimensional and highly multiclassproblems, can be removed if one exploits sparsity of the corresponding primal solutionto the โ„“1-constrained problem (4.2.2), and outputs it directly, bypassing the explicit

storage of ๐‘ˆ (see Appendix Sec. 4.7.7 for details). Note that when ๐‘› = ๐‘‚(min(๐‘‘, ๐‘˜)),the resulting algorithm enters the sublinear regime after as few as ๐‘‚(๐‘›) iterations.

4.4 Discussion of Alternative Geometries

Here we consider alternative choices of the proximal geometry in mirror descentapplied to the saddle-point formulation of the CCSPP (4.1.1), possibly with otherchoices of regularization than the entrywise โ„“1-norm. The goal is to show that thegeometry chosen in Sec. 4.2 is the only one for which we can obtain favorable accuracyguarantees for stochastic mirror descent (S-MD).

Given the structure of the primal and dual feasible sets, it is reasonable to considergeneral mixed norms of the type (4.2.4):

โ€– ยท โ€–U = โ€– ยท โ€–๐‘1๐‘ˆร—๐‘2๐‘ˆ , โ€– ยท โ€–V = โ€– ยท โ€–๐‘1๐‘‰ ร—๐‘2๐‘‰ ,

where ๐‘1,2๐‘ˆ , ๐‘1,2๐‘‰ โ‰ฅ 1 (in the case of โ€–ยทโ€–U , we also assume the same norm for regulariza-tion). Note that their dual norms can be easily computed: the dual norm of โ€– ยท โ€–๐‘1ร—๐‘2is โ€– ยท โ€–๐‘ž1ร—๐‘ž2 , where ๐‘ž1,2 are the corresponding conjugates to ๐‘1,2, i.e., 1/๐‘๐‘– + 1/๐‘ž๐‘– = 1(see, e.g., Lemma 3 in Sra [2012]). Moreover, it makes sense to fix ๐‘1๐‘‰ = 2 for

102

Page 118: On efficient methods for high-dimensional statistical

Algorithm 1 Sublinear Multiclass โ„“1-Regularized SVM

Require: ๐‘‹ โˆˆ R๐‘›ร—๐‘‘, ๐‘ฆ โˆˆ [๐‘˜]โŠ—๐‘›, ๐œ†, ๐‘…1, ๐‘‡ > 1, ๐›พ๐‘ก๐‘‡โˆ’1๐‘ก=0

1: Obtain ๐‘Œ โˆˆ โˆ†โŠ—๐‘›๐‘˜ from the labels ๐‘ฆ; ๐‘‹ โ‰ก [๐‘‹,โˆ’๐‘‹]

2: ๐›ผโ† 12๐‘‘; ๐‘ˆ โ† ๐‘…*12๐‘‘ร—๐‘˜

2๐‘‘๐‘˜; ๐›ฝ โ† 1๐‘›; ๐‘‰ โ† 1๐‘›ร—๐‘˜

๐‘˜3: for ๐šค = 1 to 2๐‘‘ do4: ๐œŽ(๐šค) โ‰ก โ€– ๐‘‹(:, ๐šค)โ€–2; ๐œ‹(๐šค)โ† โ€–๐‘ˆ(๐šค, :)โ€–15: for ๐šฅ = 1 to ๐‘› do6: ๐œ(๐šฅ) โ‰ก โ€– ๐‘‹(๐šฅ, :)โ€–โˆž; ๐œŒ(๐šฅ)โ† โ€–๐‘‰ (๐šฅ, :)โˆ’ ๐‘Œ (๐šฅ, :)โ€–1# Initialize machinery to track the cumulative sums7: ๐‘ˆฮฃ โ† 02๐‘‘ร—๐‘˜; ๐‘‰ฮฃ โ† 0๐‘›ร—๐‘˜ # Cumulative sums8: ๐ดโ† 02๐‘‘; ๐ต โ† 0๐‘›; ๐ดpr โ† 02๐‘‘ร—๐‘˜; ๐ตpr โ† 0๐‘›ร—๐‘˜9: for ๐‘ก = 0 to ๐‘‡ โˆ’ 1 do # (S-MD) iterations10: Draw ๐‘— โˆผ ๐œ โˆ˜ ๐œŒ # โˆ˜ is the elementwise product11: Draw ๐‘™ โˆผ |๐‘‰ (๐‘—, :) ยท ๐›ฝ๐‘— โˆ’ ๐‘Œ (๐‘—, :)|12: [๐‘ˆฮฃ, ๐ดpr, ๐ด]โ† TrackPrimal(๐‘ˆ,๐‘ˆฮฃ, ๐ดpr, ๐ด, ๐›ผ, ๐‘™)# The only non-zero column of ๐œ‚๐‘‰ ๐‘ก,๐‘Œ , cf. (Full-SS):

13: ๐œ‚ โ† ๐‘‹(๐‘—, :) ยทโˆ‘๐‘›

๐šฅ=1 ๐œ(๐šฅ) ยท ๐œŒ(๐šฅ) ยท sgn[๐›ฝ๐šฅ ยท ๐‘‰ (๐šฅ, ๐‘™)โˆ’ ๐‘Œ (๐šฅ, ๐‘™)]

๐œ(๐‘—)

14: [๐‘ˆ, ๐›ผ, ๐œ‹]โ† UpdatePrimal(๐‘ˆ, ๐›ผ, ๐œ‹, ๐œ‚, ๐‘™, ๐›พ๐‘ก, ๐œ†, ๐‘…*)15: Draw ๐‘– โˆผ ๐œŽ โˆ˜ ๐œ‹16: Draw โ„“ โˆผ ๐‘ˆ(๐‘–, :)

17: [๐‘‰ฮฃ, ๐ตpr, ๐ต]โ† TrackDual(๐‘‰ , ๐‘‰ฮฃ, ๐ตpr, ๐ต, ๐›ฝ, โ„“, ๐‘ฆ)# The only non-zero column of ๐œ‰๐‘ˆ๐‘ก, cf. (Full-SS):

18: ๐œ‰ โ† ๐‘‹(:, ๐‘–) ยทโˆ‘2๐‘‘

๐šค=1 ๐œŽ(๐šค) ยท ๐œ‹(๐šค)

๐œŽ(๐‘–)

19: [๐‘‰ , ๐›ฝ, ๐œŒ]โ† UpdateDual(๐‘‰ , ๐‘Œ, ๐›ฝ, ๐œŒ, ๐œ‰, โ„“, ๐‘ฆ, ๐›พ๐‘ก)20: for ๐‘™ = 1 to ๐‘˜ do # Postprocessing of cumulative sums21: ๐‘ˆฮฃ(:, ๐‘™)โ† ๐‘ˆฮฃ(:, ๐‘™) + ๐‘ˆ(:, ๐‘™) โˆ˜ (๐›ผ + ๐ดโˆ’ ๐ดpr(:, ๐‘™))

22: ๐‘‰ฮฃ(:, ๐‘™)โ† ๐‘‰ฮฃ(:, ๐‘™) + ๐‘‰ (:, ๐‘™) โˆ˜ (๐›ฝ +๐ต โˆ’๐ตpr(:, ๐‘™))

Ensure:1

๐‘‡ + 1๐‘ˆฮฃ,

1

๐‘‡ + 1๐‘‰ฮฃ # Averages (๐‘‡+1, ๐‘‰ ๐‘‡+1)

103

Page 119: On efficient methods for high-dimensional statistical

Procedure 1 UpdatePrimal

Require: ๐‘ˆ โˆˆ R2๐‘‘ร—๐‘˜, ๐›ผ, ๐œ‹, ๐œ‚ โˆˆ R2๐‘‘, ๐‘™ โˆˆ [๐‘˜], ๐›พ, ๐œ†, ๐‘…1

1: ๐ฟ โ‰ก log(2๐‘‘๐‘˜)2: for ๐‘– = 1 to 2๐‘‘ do3: ๐œ‡๐‘– = ๐œ‹๐‘– โˆ’ ๐›ผ๐‘– ยท ๐‘ˆ(๐‘–, ๐‘™) ยท (1โˆ’ ๐‘’โˆ’2๐›พ๐ฟ๐‘…*๐œ‚๐‘–/๐‘›)

4: ๐‘€ =โˆ‘2๐‘‘

๐‘–=1 ๐œ‡๐‘–5: ๐œˆ = min๐‘’โˆ’2๐›พ๐ฟ๐‘…*๐œ†, ๐‘…*/๐‘€6: for ๐‘– = 1 to 2๐‘‘ do7: ๐‘ˆ(๐‘–, ๐‘™)โ† ๐‘ˆ(๐‘–, ๐‘™) ยท ๐‘’โˆ’2๐›พ๐ฟ๐‘…*๐œ‚๐‘–/๐‘›

8: ๐›ผ+๐‘– = ๐œˆ ยท ๐›ผ๐‘–

9: ๐œ‹+๐‘– = ๐œˆ ยท ๐œ‡๐‘–

Ensure: ๐‘ˆ, ๐›ผ+, ๐œ‹+

Procedure 2 UpdateDual

Require: ๐‘‰ , ๐‘Œ โˆˆ R๐‘›ร—๐‘˜, ๐›ฝ, ๐œŒ, ๐œ‰ โˆˆ R๐‘›, โ„“ โˆˆ [๐‘˜], ๐‘ฆ โˆˆ [๐‘˜]โŠ—๐‘›, ๐›พ1: ๐œƒ = ๐‘’โˆ’2๐›พ log(๐‘˜)

2: for ๐‘— = 1 to ๐‘› do3: ๐œ”๐‘— = ๐‘’2๐›พ log(๐‘˜)๐œ‰๐‘—

4: ๐œ€๐‘— = ๐‘’โˆ’2๐›พ log(๐‘˜)๐‘Œ (๐‘—,โ„“)

5: ๐œ’๐‘— = 1โˆ’ ๐›ฝ๐‘— ยท ๐‘‰ (๐‘—, โ„“) ยท (1โˆ’ ๐œ”๐‘— ยท ๐œ€๐‘—)6: if โ„“ = ๐‘ฆ๐‘— then # not the actual class of ๐‘— drawn

7: ๐œ’๐‘— โ† ๐œ’๐‘— โˆ’ ๐›ฝ๐‘— ยท ๐‘‰ (๐‘—, ๐‘ฆ๐‘—) ยท (1โˆ’ ๐œƒ)8: ๐›ฝ+

๐‘— = ๐›ฝ๐‘—/๐œ’๐‘—

9: ๐‘‰ (๐‘—, โ„“)โ† ๐‘‰ (๐‘—, โ„“) ยท ๐œ”๐‘— ยท ๐œ€๐‘—10: ๐‘‰ (๐‘—, ๐‘ฆ๐‘—)โ† ๐‘‰ (๐‘—, ๐‘ฆ๐‘—) ยท ๐œ”๐‘— ยท ๐œƒ11: ๐œŒ+๐‘— = 2โˆ’ 2๐›ฝ+

๐‘— ยท ๐‘‰ (๐‘—, ๐‘ฆ๐‘—)

Ensure: ๐‘‰ , ๐›ฝ+, ๐œŒ+

Procedure 3 TrackPrimal

Require: ๐‘ˆ,๐‘ˆฮฃ, ๐ดpr โˆˆ R2๐‘‘ร—๐‘˜, ๐ด,๐›ผ โˆˆ R2๐‘‘, ๐‘™ โˆˆ [๐‘˜]1: for ๐‘– = 1 to 2๐‘‘ do2: ๐‘ˆฮฃ(๐‘–, ๐‘™)โ† ๐‘ˆฮฃ(๐‘–, ๐‘™) + ๐‘ˆ(๐‘–, ๐‘™) ยท (๐ด๐‘– + ๐›ผ๐‘– โˆ’ ๐ดpr(๐‘–, ๐‘™))3: ๐ดpr(๐‘–, ๐‘™)โ† ๐ด๐‘– + ๐›ผ๐‘–4: ๐ด๐‘– โ† ๐ด๐‘– + ๐›ผ๐‘–Ensure: ๐‘ˆฮฃ, ๐ดpr, ๐ด

104

Page 120: On efficient methods for high-dimensional statistical

Procedure 4 TrackDual

Require: ๐‘‰ , ๐‘‰ฮฃ, ๐ตpr โˆˆ R๐‘›ร—๐‘˜, ๐ต, ๐›ฝ โˆˆ R๐‘›, โ„“ โˆˆ [๐‘˜], ๐‘ฆ โˆˆ [๐‘˜]โŠ—๐‘›

1: for ๐‘— = 1 to ๐‘› do2: for ๐‘™ โˆˆ โ„“, ๐‘ฆ๐‘— do # โ„“, ๐‘ฆ๐‘— has 1 or 2 elements

3: ๐‘‰ฮฃ(๐‘—, ๐‘™)โ† ๐‘‰ฮฃ(๐‘—, ๐‘™) + ๐‘‰ (๐‘—, ๐‘™) ยท (๐ต๐‘— + ๐›ฝ๐‘— โˆ’๐ตpr(๐‘—, ๐‘™))4: ๐ตpr(๐‘—, ๐‘™)โ† ๐ต๐‘— + ๐›ฝ๐‘—

5: ๐ต๐‘— โ† ๐ต๐‘— + ๐›ฝ๐‘—

Ensure: ๐‘‰ฮฃ, ๐ตpr, ๐ต

the reasons discussed in Section 4.2. This leaves us with the obvious choices ๐‘2๐‘‰ โˆˆ1, 2, ๐‘2๐‘ˆ โˆˆ 1, 2 which corresponds to the sparsity-inducing or the standard Eu-clidean geometry of the classes in the dual/primal; ๐‘1๐‘ˆ โˆˆ 1, 2 which corresponds tothe sparsity-inducing or Euclidean geometry of the features. Finally, the choice ๐‘1๐‘ˆ = 2(i.e., the Euclidean geometry in the features) can also be excluded: its combinationwith ๐‘1๐‘‰ = 2 is known to lead to the large variance term in the biclass case. 4 Thisleaves us with the possibilities

๐‘2๐‘ˆ , ๐‘2๐‘‰ โˆˆ 1, 2 ร— 1, 2. (4.4.1)

In all these cases, the quantity โ„’๐’ฐ ,๐’ฑ defined in (4.2.10) can be controlled by extendingProposition 4.1:

Proposition 4.4. For any ๐›ผ > 1 and ๐›ฝ > 1 such that ๐›ฝ > ๐›ผ it holds:

โ„’๐’ฐ ,๐’ฑ :=โ€–๐‘‹โ€–1ร—๐›ผ, 2ร—๐›ฝ

๐‘›=โ€–๐‘‹โŠคโ€–โˆžร—2

๐‘›.

The proof of this proposition follows the steps in the proof of Proposition 4.1, andis omitted.

Finally, the corresponding partial potentials could be constructed by combiningthe Euclidean and an entropy-type potential in a way similar to the one describedin Sec. 4.2 for the dual variable; alternatively, one could use the power potentialof Nesterov and Nemirovski [2013] that results in the same rates up to a constantfactor.

Using Proposition 4.4, we can also compute the potential differences for the fourremaining setups (4.4.1). The results are shown in Table 4.1. Up to logarithmicfactors, we have equivalent results for all four geometries, with the radius ๐‘…* evaluatedin the corresponding norm โ€– ยท โ€–1ร—2 or โ€– ยท โ€–1ร—1 = โ€– ยท โ€–1.

As a result, for the deterministic Mirror Descent (with balanced potentials) we

4. Note that in the biclass case, our variance estimate for the partial sampling scheme (cf. The-orem 4.2) reduces to those in [Juditsky and Nemirovski, 2011b, Section 2.5.2.3]. They consider thecases of โ„“1/โ„“1 and โ„“1/โ„“2 geometries for the primal/dual, and omit the case of โ„“2/โ„“2-geometry, inwhich the sampling variance โ€œexplodesโ€.

105

Page 121: On efficient methods for high-dimensional statistical

Norm for ๐‘‰ โˆˆ R๐‘›ร—๐‘˜

2ร— 1 2ร— 2

Norm for ๐‘ˆ โˆˆ R๐‘‘ร—๐‘˜

1ร— 2 ฮฉ๐’ฐ = โ€–๐‘ˆ*โ€–21ร—2 log ๐‘‘ฮฉ๐’ฑ = ๐‘› log ๐‘˜

ฮฉ๐’ฐ = โ€–๐‘ˆ*โ€–21ร—2 log ๐‘‘ฮฉ๐’ฑ = ๐‘›

1ร— 1 ฮฉ๐’ฐ = โ€–๐‘ˆ*โ€–21 log(๐‘‘๐‘˜)ฮฉ๐’ฑ = ๐‘› log ๐‘˜

ฮฉ๐’ฐ = โ€–๐‘ˆ*โ€–21 log(๐‘‘๐‘˜)ฮฉ๐’ฑ = ๐‘›

Table 4.1 โ€“ Comparison of the potential differences for the norms correspondingto (4.4.1).

obtain the accuracy bound (cf. (4.2.27)):

โˆ†๐‘“ (๐‘‡ , ๐‘‰ ๐‘‡ ) 6

๐‘‚(1)โ„’๐’ฐ ,๐’ฑโˆš

ฮฉ๐’ฐฮฉ๐’ฑโˆš๐‘‡

6๐‘‚๐‘‘,๐‘˜(1)๐‘…*โˆš

๐‘‡

โ€–๐‘‹โŠคโ€–โˆžร—2โˆš๐‘›

+r

๐‘‡.

in all four cases, where ๐‘‚๐‘‘,๐‘˜(1) is a logarithmic factor in ๐‘‘ and ๐‘˜, and ๐‘…* = โ€–๐‘ˆ*โ€–1ร—2

or ๐‘…* = โ€–๐‘ˆ*โ€–1 depending on ๐‘2๐‘ˆ โˆˆ 1, 2. In other words, the deterministic accuracybound of Theorem 4.1 is essentially preserved for all four geometries in (4.4.1). On theother hand, using Proposition 4.5, we obtain that in the case of (Part-SS), the extrapart of the accuracy bound due to sampling (cf. (4.3.7)) is also essentially preserved:

E[โˆ†๐‘“ (๐‘‡ , ๐‘‰ ๐‘‡ )] 6

๐‘‚๐‘‘,๐‘˜(1)๐‘…*โˆš๐‘‡

(โ€–๐‘‹โŠคโ€–โˆžร—2โˆš

๐‘›+โ€–๐‘‹โ€–1ร—โˆž

๐‘›

)+

r

๐‘‡.

However, if we consider full sampling, the situation changes: in the case ๐‘2๐‘ˆ = 2the variance bound that holds for (Part-SS) is not preserved for (Full-SS). Thisis because our argument to control the variance of the full sampling scheme alwaysrequires that ๐‘2๐‘ˆ 6 1 (see the proof of Proposition 4.3 in Appendix 4.7.6 for details;note that for ๐‘ž2๐‘‰ we do not have such a restriction since the variance proxy ๐œŽ2

๐’ฑ iscontrolled on the set ๐’ฑ given by (4.1.7) that has โ„“โˆžร— โ„“1-type geometry regardless ofthe norm โ€– ยท โ€–V . This leaves us with the final choice between the โ€– ยท โ€–2ร—1 and โ€– ยท โ€–2ร—2

norm in the dual, as we have to use the elementwise โ€– ยท โ€–1-norm in the primal. Bothchoices result in essentially the same accuracy bound (note that this choice onlyinfluences the algorithm but not the saddle-point problem itself). We have focusedon the โ€– ยท โ€–2ร—1 norm because of the algorithmic considerations: with this norm, wehave multiplicative updates in the case of the multiclass hinge loss, which allows fora sublinear algorithm presented in Section 4.3.3.

106

Page 122: On efficient methods for high-dimensional statistical

4.5 Experiments

The natural way to estimate the performance measure for saddle-point problemsis the so-called duality gap:

โˆ†๐‘“ (๐‘ˆ, ๐‘‰ ) = max๐‘‰ โˆˆ๐’ฑ

๐‘“(๐‘ˆ, ๐‘‰ )โˆ’min๐‘ˆโˆˆ๐’ฐ

๐‘“(๐‘ˆ, ๐‘‰ ).

In the case of the multi-class SVM formulation:

minโ€–๐‘ˆโ€–16โ„›1๐‘ˆโˆˆR2๐‘‘ร—๐‘˜

+

max๐‘‰ โˆˆ(ฮ”โŠค

๐‘˜ )โŠ—๐‘›โˆ’ 1

๐‘›tr[๐‘‰ โŠค๐‘Œ ] +

1

๐‘›tr[(๐‘‰ โˆ’ ๐‘Œ )โŠค ๐‘‹ ๐‘ˆ]+ ๐œ†โ€–๐‘ˆโ€–1,

the solution can be found in closed form:

โˆ†๐‘‘๐‘ข๐‘Ž๐‘™(๐‘ˆ, ๐‘‰ ) = โˆ’ 1

๐‘›tr[๐‘Œ โŠค ๐‘‹ ๐‘ˆ ] + ๐œ†โ€–๐‘ˆโ€–1 +

1

๐‘›

๐‘›โˆ‘๐‘–=1

max๐‘—

[( ๐‘‹ ๐‘ˆ โˆ’ ๐‘Œ )(๐‘–, ๐‘—)

]+

+1

๐‘›tr[๐‘‰ โŠค๐‘Œ ] +

[max๐‘–,๐‘—

[(โˆ’ 1

๐‘›๐‘‹โŠค(๐‘‰ โˆ’ ๐‘Œ )โˆ’ ๐œ†๐ผ

)(๐‘–, ๐‘—)

]ยท โ„›1

]+

Sublinear Runtime.

To illustrate the sublinear iteration cost of Algorithm 1, we consider the follow-ing experiment. Fixing ๐‘› = ๐‘‘ = ๐‘˜, we generate ๐‘‹ with i.i.d. standard Gaussianentries, take ๐‘ˆ ๐‘œ to be the identity matrix (thus very sparse), and generate the labelsby arg max๐‘™โˆˆ[๐‘˜] ๐‘ฅ๐‘—๐‘ˆ

๐‘œ๐‘™ + 1โˆš

๐‘‘๐’ฉ (0, ๐ผ๐‘‘), where ๐‘ฅ๐‘—โ€™s are the rows of ๐‘‹, and ๐‘ˆ ๐‘œ

๐‘™ โ€™s are thecolumns of ๐‘ˆ ๐‘œ. This is repeated 10 times with ๐‘› = ๐‘‘ = ๐‘˜ increasing by a constantfactor ๐œ…; each time we run Algorithm 1 for a fixed (large) number of iterations todominate the cost of pre/post-processing, with ๐‘…* = โ€–๐‘ˆ ๐‘œโ€–1 and ๐œ† = 10โˆ’3, and mea-sure its runtime. We observe (see Tab. 4.2) that the runtime is proportional to ๐œ…, asexpected.

๐‘› = ๐‘‘ = ๐‘˜ 200 400 800 1600 3200 6400๐‘‡ = 104 0.80 1.17 2.07 4.27 7.55 15.56๐‘‡ = 2 ยท 104 1.63 2.47 4.27 8.74 14.65 30.77

Table 4.2 โ€“ Runtime (in seconds) of Algorithm 1 on synthetic data.

Synthetic Data Experiment.

We compare Algorithm 1 with two competitors: โ€–ยทโ€–1-composite stochastic subgra-dient method (SSM) for the primal problem (4.1.1), in which one uniformly samplesone training example at a time Shalev-Shwartz et al. [2011], leading to ๐‘‚(๐‘‘๐‘˜) iter-ation cost; deterministic saddle-point Mirror Prox (MP) with geometry chosen as

107

Page 123: On efficient methods for high-dimensional statistical

in Algorithm 1, for which we have ๐‘‚(๐‘‘๐‘›๐‘˜) cost of iterations but ๐‘‚(1/๐‘‡ ) conver-gence in terms of the number of iterations. We generate data as in the previousexperiment, fixing ๐‘› = ๐‘‘ = ๐‘˜ = 103. The randomized algorithms are run 10 timesfor ๐‘‡ โˆˆ 10๐‘š/2,๐‘š = 1, ..., 12 iterations with constant stepsize (we use stepsize (4.3.6)in Algorithm 1, choose the one recommended in Theorem 4.1 for MP, and use the the-oretical stepsize for SSM, explicitly computing the variance of subgradients and theLipschitz constant). Each time we compute the duality gap and the primal accuracy,and measure the runtime (see Fig. 4-2). We see that Algorithm 1 outperforms SSM,which might be the combined effect of sublinearity and our choice of geometry. Italso outmatches MP up to high accuracy due to the sublinear effect (MP eventuallyโ€œwinsโ€ because of its ๐‘‚(1/๐‘‡ ) rate). 5

0 100 200 300 400 500

1

2

3

4

5

6Full-SMD, Gap

Full-SMD, Acc.

SSM, Acc.

MP, Gap

MP, Acc.

101

102

103

10-1

10-0.5

1

100.5

10

Full-SMD, Gap

Full-SMD, Acc.

SSM, Acc.

MP, Gap

MP, Acc.

Figure 4-2 โ€“ Primal accuracy and duality gap (when available) for Algorithm 1,stochastic subgradient method (SSM), and Mirror Prox (MP) with exact gradients,on a synthetic data benchmark.

Practical issue: โ€œexplosionโ€ of rescaling coefficients.

When running the fully stochastic algorithm, we observed one practical issue: thescaling coefficients ๐›ผ and ๐›ฝ tend to rapidly increase or decrease, causing occasionalarithmetic over/underflows. We suggest two solutions for this problem:

โ€” The immediate solution is to use arbitrary-precision arithmetic. In theory, thisresults in the ๐‘‚(๐‘‘+๐‘›+๐‘˜) cost of an iteration being inflated by the allowed limitof digits. The resulting complexity can still be beneficial, vis-ร -vis the partialsampling scheme, when ๐‘˜ is large enough. This solution is not practical, sinceone has to use external libraries or implement arbitrary-precision arithmetic.

โ€” A better solution, actually used in our codes, is to perform โ€œurgentโ€ rescales inthe case of an over/underflow. Hopefully, it will be required not too often, andthe more seldom the closer we are to the optimal solution. For example, in theabove experiment the โ€œurgentโ€ rescaling was required in about 10 iterations.

5. We provide the Matlab codes of our experiments in Supp. Mat.

108

Page 124: On efficient methods for high-dimensional statistical

4.6 Conclusion and perspectives

In this chapter we considered sublinear algorithms for the saddle-points problemsconstructed for certain type Fenchel-Young losses. The main contribution of this workis the algorighm for the โ„“1-regularized multiclass SVM with numerical cost ๐‘‚(๐‘‘+๐‘›+๐‘˜)of one iteration, where ๐‘‘ is the number of features, ๐‘› the sample size, and ๐‘˜ the numberof classes. This was possible due to the right choice of the proximal setup, and ad-hocsampling techniques for matrix multiplication.

We envision the following directions for future research:

โ€” Extension to the multi-class logistic regression (softmax) model, which iswidely used in Natural Language Processing (NLP) problems, where ๐‘‘, ๐‘› and๐‘˜ are on order of millions or even billions [Chelba et al., 2013, Partalas et al.,2015].

โ€” Provide more experiments with larger dimensions to highlight the advantagesof the algorithm; use real data as well.

โ€” Implement more flexible stepsizes, including the online stepsize search in thevein of Juditsky and Nemirovski [2011a].

109

Page 125: On efficient methods for high-dimensional statistical

4.7 Appendix

4.7.1 Motivation for the multiclass hinge loss

We justify the multiclass extension (4.1.4) of the hinge loss due to Shalev-Shwartzand Ben-David [2014]. In the binary case, the soft-margin SVM objective is

1

๐‘›

๐‘›โˆ‘๐‘–=1

[max(0, 1โˆ’ ๐‘ฆ๐‘–๐‘ขโŠค๐‘ฅ๐‘–)]+ ๐œ†โ€–๐‘ขโ€–,

where ๐‘ข โˆˆ R๐‘‘, โ€– ยท โ€– is some norm, and ๐‘ฆ๐‘– โˆˆ โˆ’1, 1. Introducing ๐‘ฆ = ๐‘’๐‘ฆ โˆˆ ๐‘’โˆ’1, ๐‘’1where ๐‘’๐‘— is the ๐‘—-th standard basis vector (the dimension of space are symbolicallyindexed in โˆ’1, 1), and putting ๐‘ข1 = โˆ’๐‘ขโˆ’1 = ๐‘ข

2, we can rewrite the loss as

max(0, 1โˆ’ ๐‘ฆ๐‘ขโŠค๐‘ฅ) = max๐‘˜โˆˆ1,โˆ’1

1๐‘’๐‘˜ = ๐‘ฆ+ ๐‘ขโŠค๐‘˜ ๐‘ฅโˆ’ ๐‘ขโŠค๐‘ฆ ๐‘ฅ .

The advantage of this reformulation is that we can naturally pass to the multiclasscase, by replacing the set โˆ’1, 1 with 1, ..., ๐พ and introducing ๐‘ข1, ..., ๐‘ข๐พ โˆˆ R๐‘‘

without any restrictions:

max๐‘˜โˆˆ1,...,๐พ

1๐‘’๐‘˜ = ๐‘ฆ+ ๐‘ขโŠค๐‘˜ ๐‘ฅโˆ’ ๐‘ขโŠค๐‘ฆ ๐‘ฅ = max

๐‘ฃโˆˆ๐‘’1,...,๐‘’๐พ

1๐‘ฃ = ๐‘ฆ+

๐พโˆ‘๐‘—=1

(๐‘ฃ[๐‘—]โˆ’ ๐‘ฆ[๐‘—])๐‘ขโŠค๐‘— ๐‘ฅ

= max๐‘ฃโˆˆ๐‘’1,...,๐‘’๐พ

1๐‘ฃ = ๐‘ฆ+

๐พโˆ‘๐‘—=1

(๐‘ฃ โˆ’ ๐‘ฆ)โŠค๐‘ˆโŠค๐‘ฅ

=: โ„“(๐‘ˆ, (๐‘ฅ, ๐‘ฆ)),

where ยท[๐‘—] denotes the ๐‘—-th element of a vector, and ๐‘ˆ โˆˆ R๐‘‘ร—๐พ has ๐‘ข๐‘— as its ๐‘—-thcolumn. Finally, we can rewrite โ„“(๐‘ˆ, (๐‘ฅ, ๐‘ฆ)) as follows:

โ„“(๐‘ˆ, (๐‘ฅ, ๐‘ฆ)) = max๐‘ฃโˆˆฮ”๐พ

1โˆ’ ๐‘ฃโŠค๐‘ฆ +

๐พโˆ‘๐‘—=1

(๐‘ฃ โˆ’ ๐‘ฆ)โŠค๐‘ˆโŠค๐‘ฅ

.

This is because we maximize an affine function of ๐‘ฃ, and 1โˆ’ ๐‘ฃโŠค๐‘ฆ = 1๐‘ฃ = ๐‘ฆ at thevertices. Thus, we arrive at the announced saddle-point problem

min๐‘ˆโˆˆR๐‘‘ร—๐‘˜

max๐‘‰ โˆˆ(ฮ”โŠค

๐‘˜ )โŠ—๐‘›1โˆ’ 1

๐‘›tr[๐‘‰ โŠค๐‘Œ ] +

1

๐‘›tr[(๐‘‰ โˆ’ ๐‘Œ )โŠค๐‘‹๐‘ˆ

]+ ๐œ†โ€–๐‘ˆโ€–U .

4.7.2 General accuracy bounds for the composite saddle-point

Mirror Descent

Deterministic case. Here we provide general accuracy bounds which are instan-tiated in Theorems 4.1โ€“1. Below we outline the general setting that encompasses, inparticular, the case of (4.2.17) solved via (MD) or (MP) with initialization (Init).

110

Page 126: On efficient methods for high-dimensional statistical

โ€” We consider a convex-concave saddle-point problem

min๐‘ˆโˆˆ๐’ฐ

max๐‘‰ โˆˆ๐’ฑ

๐‘“(๐‘ˆ, ๐‘‰ )

with a composite objective,

๐‘“(๐‘ˆ, ๐‘‰ ) = ฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ ) + ฮฅ(๐‘ˆ)โˆ’โ„ฑ(๐‘‰ ),

where

ฮฆ(๐‘ˆ, ๐‘‰ ) =1

๐‘›๐‘‰ โŠค๐‘‹๐‘ˆ

is a bilinear function, and ฮฅ(๐‘ˆ),โ„ฑ(๐‘‰ ) are convex โ€œsimpleโ€ terms. Moreover,we assume that the primal feasible set ๐’ฐ belongs to the โ€– ยท โ€–U -norm ball withradius ๐‘…๐’ฐ , the dual constraint set ๐’ฑ belongs to the โ€– ยท โ€–V -norm ball withradius ๐‘…๐’ฑ , and โ€–๐‘Œ โ€–V โ‰ค ๐‘…๐’ฑ .

6 To simplify the results, we make the mildassumption (satisfied in all known to us situations):

ฮฉ๐’ฐ > ๐‘…2๐’ฐ , ฮฉ๐’ฑ > ๐‘…2

๐’ฑ . (4.7.1)

โ€” Recall that the vector field of partial gradients of ฮจ(๐‘ˆ, ๐‘‰ ) := ฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ ) is

๐บ(๐‘Š ) : = (โˆ‡๐‘ˆฮจ(๐‘ˆ, ๐‘‰ ),โˆ’โˆ‡๐‘‰ ฮจ(๐‘ˆ, ๐‘‰ ))

=1

๐‘›(๐‘‹โŠค(๐‘‰ โˆ’ ๐‘Œ ),โˆ’๐‘‹๐‘ˆ)

(4.7.2)

โ€” Given the partial proximal setups (โ€–ยทโ€–U , ๐œ‘๐’ฐ(ยท)) and (โ€–ยทโ€–V , ๐œ‘๐’ฑ(ยท)), we run Com-posite Mirror Descent (4.2.6) or Mirror Prox (4.2.8) on the vector field ๐บ(๐‘Š )with the joint penalty term

โ„Ž(๐‘Š ) = ฮฅ(๐‘ˆ) + โ„ฑ(๐‘‰ ),

the โ€œbalancedโ€ joint potential given by (4.2.12), and stepsizes ๐›พ๐‘ก.We now provide the convergence analysis of the Mirror Descent scheme, extending

the argument of [Duchi et al., 2010, Lemma 1] to composite saddle-point optimization.

Theorem 4.3. In the above setting, let (๐‘‡ , ๐‘‰ ๐‘‡ ) = 1๐‘‡

โˆ‘๐‘‡โˆ’1๐‘ก=0 (๐‘ˆ ๐‘ก, ๐‘‰ ๐‘ก) be the average

of the first ๐‘‡ iterates of the composite Mirror Descent (4.2.6) with constant stepsize

๐›พ๐‘ก โ‰ก1

โ„’U ,V

โˆš5๐‘‡ฮฉ๐’ฐฮฉ๐’ฑ

,

where

โ„’U ,V :=1

๐‘›sup

โ€–๐‘ˆโ€–U โ‰ค1

โ€–๐‘‹๐‘ˆโ€–V * .

6. Note that the linear term 1๐‘›๐‘Œ

โŠค๐‘‹๐‘ˆ can be absorbed into the simple term ฯ’(๐‘ˆ), which willslightly improve the bound in Theorem 4.3. However, this improvement is impossible in the stochasticversion of the algorithm, in which we sample the linear form ๐‘Œ โŠค๐‘‹ but not the gradient of ฯ’(๐‘ˆ).

111

Page 127: On efficient methods for high-dimensional statistical

Then we have the following guarantee for the duality gap:

โˆ†๐‘“ (๐‘‡ , ๐‘‰ ๐‘‡ ) 6

2โˆš

5โ„’U ,V

โˆšฮฉ๐’ฐฮฉ๐’ฑโˆš

๐‘‡+

ฮฅ(๐‘ˆ0)โˆ’min๐‘ˆโˆˆ๐’ฐ ฮฅ(๐‘ˆ)

๐‘‡+โ„ฑ(๐‘‰ 0)โˆ’min๐‘‰ โˆˆ๐’ฑ โ„ฑ(๐‘‰ )

๐‘‡.

Moreover, if one of the functions ฮฅ(๐‘ˆ), โ„ฑ(๐‘‰ ) is affine, the corresponding ๐‘‚(1/๐‘‡ )error term vanishes from the bound.

Proof. 1๐‘œ. We begin by introducing the norm for ๐‘Š = (๐‘ˆ, ๐‘‰ ):

โ€–๐‘Šโ€–W =

โˆšโ€–๐‘ˆโ€–2U2ฮฉ๐’ฐ

+โ€–๐‘‰ โ€–2V2ฮฉ๐’ฑ

(4.7.3)

and its dual norm defined for ๐บ = (๐บ๐‘ˆ , ๐บ๐‘‰ ) with ๐บ๐‘ˆ โˆˆ R๐‘‘ร—๐‘˜ and ๐บ๐‘‰ โˆˆ R๐‘›ร—๐‘˜:

โ€–๐บโ€–W * =โˆš

2ฮฉ๐’ฐโ€–๐บ๐‘ˆโ€–2U * + 2ฮฉ๐’ฑโ€–๐บ๐‘‰ โ€–2V * , (4.7.4)

where โ€– ยท โ€–U * and โ€– ยท โ€–V * are the dual norms for โ€– ยท โ€–U and โ€– ยท โ€–V correspondingly.We now make a few observations. First, the joint potential ๐œ‘๐’ฒ(๐‘Š ) given by (4.2.12)is 1-strongly convex with respect to the norm โ€– ยท โ€–W . Second, we can compute thepotential difference corresponding to ๐œ‘๐’ฒ :

ฮฉ๐’ฒ := max๐‘Šโˆˆ๐’ฒ

๐œ‘๐’ฒ(๐‘Š )โˆ’ min๐‘Šโˆˆ๐’ฒ

๐œ‘๐’ฒ(๐‘Š ) = 1 (4.7.5)

Finally, by (4.7.2) and (4.2.10) we have

max๐‘Šโˆˆ๐’ฒ

โ€–๐บ๐‘ˆ(๐‘Š )โ€–U * 6 2โ„’U ,V ๐‘…๐’ฑ , max๐‘Šโˆˆ๐’ฒ

โ€–๐บ๐‘‰ (๐‘Š )โ€–V * 6 โ„’U ,V ๐‘…๐’ฐ ,

combining which with (4.7.1) we bound the โ€– ยท โ€–W *-norm of ๐บ(๐‘Š ) on ๐’ฒ :

max๐‘Šโˆˆ๐’ฒ

โ€–๐บ(๐‘Š )โ€–W * 6โˆš

10โ„’U ,V

โˆšฮฉ๐’ฐฮฉ๐’ฑ . (4.7.6)

2๐‘œ.We now follow the classical convergence analysis of composite Mirror Descent,see Duchi et al. [2010], extending it for convex-concave objectives. By the convexityproperties of ฮจ(๐‘ˆ, ๐‘‰ ) = ฮฆ(๐‘ˆ, ๐‘‰ โˆ’ ๐‘Œ ), for any ( , ๐‘‰ ) โˆˆ ๐’ฒ and (๐‘ˆ, ๐‘‰ ) โˆˆ ๐’ฒ it holds

ฮจ( , ๐‘‰ )โˆ’ฮจ(๐‘ˆ, ๐‘‰ ) = ฮจ( , ๐‘‰ )โˆ’ฮจ( , ๐‘‰ ) + ฮจ( , ๐‘‰ )โˆ’ฮจ(๐‘ˆ, ๐‘‰ )

6โŸจโˆ‡๐‘ˆฮจ( , ๐‘‰ ), โˆ’ ๐‘ˆ

โŸฉโˆ’โŸจโˆ‡๐‘‰ ฮจ( , ๐‘‰ ), ๐‘‰ โˆ’ ๐‘‰

โŸฉ=โŸจ๐บ( ), โˆ’๐‘Š

โŸฉ,

Let ๐‘Š ๐‘ก = (๐‘ˆ ๐‘ก, ๐‘‰ ๐‘ก) be the ๐‘ก-th iterate of (4.2.6) for ๐‘ก โ‰ฅ 1. By convexity of ฮฅ(๐‘ˆ)and โ„ฑ(๐‘‰ ), and denoting โ„Ž(๐‘Š ) = ฮฅ(๐‘ˆ) + โ„ฑ(๐‘‰ ), we have, for any ๐‘Š = [๐‘ˆ, ๐‘‰ ], that

ฮจ(๐‘ˆ ๐‘กโˆ’1, ๐‘‰ )โˆ’ฮจ(๐‘ˆ, ๐‘‰ ๐‘กโˆ’1) + โ„Ž(๐‘Š ๐‘ก)โˆ’ โ„Ž(๐‘Š )

6โŸจ๐บ(๐‘Š ๐‘กโˆ’1),๐‘Š ๐‘กโˆ’1 โˆ’๐‘Š

โŸฉ+โŸจ๐œ•โ„Ž(๐‘Š ๐‘ก),๐‘Š ๐‘ก โˆ’๐‘Š

โŸฉ.

(4.7.7)

112

Page 128: On efficient methods for high-dimensional statistical

Let us now bound the right-hand side. Note that the first-order optimality conditionfor (4.2.6) (denoting ๐œ‘(ยท) := ๐œ‘๐’ฒ(ยท) the joint potential) writes 7โŸจ

๐›พ๐‘ก[๐บ(๐‘Š ๐‘กโˆ’1) + ๐œ•โ„Ž(๐‘Š ๐‘ก)] +โˆ‡๐œ‘(๐‘Š ๐‘ก)โˆ’โˆ‡๐œ‘(๐‘Š ๐‘กโˆ’1),๐‘Š ๐‘ก โˆ’๐‘ŠโŸฉ6 0. (4.7.8)

Combining this with (4.7.7), we get

๐›พ๐‘ก[ฮจ(๐‘ˆ ๐‘กโˆ’1, ๐‘‰ )โˆ’ฮจ(๐‘ˆ, ๐‘‰ ๐‘กโˆ’1) + โ„Ž(๐‘Š ๐‘ก)โˆ’ โ„Ž(๐‘Š )] 6โŸจโˆ‡๐œ‘(๐‘Š ๐‘กโˆ’1)โˆ’โˆ‡๐œ‘(๐‘Š ๐‘ก),๐‘Š ๐‘ก โˆ’๐‘Š

โŸฉ+ ๐›พ๐‘ก

โŸจ๐บ(๐‘Š ๐‘กโˆ’1),๐‘Š ๐‘กโˆ’1 โˆ’๐‘Š ๐‘ก

โŸฉ.(4.7.9)

By the well-known identity,โŸจโˆ‡๐œ‘(๐‘Š ๐‘กโˆ’1)โˆ’โˆ‡๐œ‘(๐‘Š ๐‘ก),๐‘Š ๐‘ก โˆ’๐‘Š

โŸฉ= ๐ท๐œ‘(๐‘Š,๐‘Š ๐‘กโˆ’1)โˆ’๐ท๐œ‘(๐‘Š,๐‘Š ๐‘ก)โˆ’๐ท๐œ‘(๐‘Š ๐‘ก,๐‘Š ๐‘กโˆ’1),

(4.7.10)

see, e.g., [Beck and Teboulle, 2003, Lemma 4.1]. On the other hand, by the Fenchel-Young inequality we have

๐›พ๐‘กโŸจ๐บ(๐‘Š ๐‘กโˆ’1),๐‘Š ๐‘กโˆ’1 โˆ’๐‘Š ๐‘ก

โŸฉ6๐›พ2๐‘ก โ€–๐บ(๐‘Š ๐‘กโˆ’1)โ€–2W *

2+โ€–๐‘Š ๐‘กโˆ’1 โˆ’๐‘Š ๐‘กโ€–2W

26 5๐›พ2๐‘กโ„’2

U ,V ฮฉ๐’ฐฮฉ๐’ฑ +๐ท๐œ‘(๐‘Š ๐‘ก,๐‘Š ๐‘กโˆ’1),(4.7.11)

where we used (4.7.6) and 1-strong convexity of ๐œ‘(ยท) with respect to โ€– ยท โ€–W . Thus, weobtain

๐›พ๐‘ก[ฮจ(๐‘ˆ ๐‘กโˆ’1, ๐‘‰ )โˆ’ฮจ(๐‘ˆ, ๐‘‰ ๐‘กโˆ’1) + โ„Ž(๐‘Š ๐‘ก)โˆ’ โ„Ž(๐‘Š )] โ‰ค๐ท๐œ‘(๐‘Š,๐‘Š ๐‘กโˆ’1)โˆ’๐ท๐œ‘(๐‘Š,๐‘Š ๐‘ก)

+ 5๐›พ2๐‘กโ„’2U ,V ฮฉ๐’ฐฮฉ๐’ฑ .

(4.7.12)

3๐‘œ. Now, assuming the constant stepsize, by the convexity properties of ฮจ(ยท, ยท)and โ„Ž(ยท) we obtain

๐‘“(๐‘‡ , ๐‘‰ )โˆ’ ๐‘“(๐‘ˆ, ๐‘‰ ๐‘‡ ) = ฮจ(๐‘‡ , ๐‘‰ )โˆ’ฮจ(๐‘ˆ, ๐‘‰ ๐‘‡ ) + โ„Ž( ๐‘‡ )โˆ’ โ„Ž(๐‘Š )

61

๐‘‡

๐‘‡โˆ‘๐‘ก=1

ฮจ(๐‘ˆ ๐‘กโˆ’1, ๐‘‰ )โˆ’ฮจ(๐‘ˆ, ๐‘‰ ๐‘กโˆ’1) + โ„Ž(๐‘Š ๐‘กโˆ’1)โˆ’ โ„Ž(๐‘Š )

61

๐‘‡

(โ„Ž(๐‘Š 0)โˆ’ โ„Ž(๐‘Š ๐‘‡ ) +

๐ท๐œ‘(๐‘Š,๐‘Š 0)

๐›พ+ 5๐‘‡๐›พโ„’2

U ,V ฮฉ๐’ฐฮฉ๐’ฑ

)6

1

๐‘‡

(โ„Ž(๐‘Š 0)โˆ’ min

๐‘Šโˆˆ๐’ฒโ„Ž(๐‘Š ) +

1

๐›พ+ 5๐‘‡๐›พโ„’2

U ,V ฮฉ๐’ฐฮฉ๐’ฑ

).

(4.7.13)where for the third line we substituted (4.7.12), simplified the telescoping sum, andused that ๐ท(๐‘Š,๐‘Š ๐‘‡ ) > 0, and in the last line we used ๐ท(๐‘Š,๐‘Š 0) 6 ฮฉ๐’ฒ 6 1,cf. (4.7.5). The choice

๐›พ =1

โ„’U ,V

โˆš5๐‘‡ฮฉ๐’ฐฮฉ๐’ฑ

,

113

Page 129: On efficient methods for high-dimensional statistical

results in the accuracy bound from the premise of the theorem:

โˆ†๐‘“ (๐‘‡ , ๐‘‰ ๐‘‡ ) 6

2โˆš

5โ„’U ,V

โˆšฮฉ๐’ฐฮฉ๐’ฑโˆš

๐‘‡+โ„Ž(๐‘Š 0)โˆ’min๐‘Šโˆˆ๐’ฒ โ„Ž(๐‘Š )

๐‘‡.

Finally, assume that one of the terms ฮฅ(๐‘ˆ), โ„ฑ(๐‘‰ ) is affine โ€“ w.l.o.g. let it be ฮฅ(๐‘ˆ).Then, since โˆ‡ฮฅ(๐‘ˆ) is constant, ๐œ•โ„Ž(๐‘Š ๐‘ก) = (โˆ‡ฮฅ(๐‘ˆ ๐‘ก), ๐œ•โ„ฑ(๐‘‰ ๐‘ก)) in (4.7.8) can be re-placed with (โˆ‡ฮฅ(๐‘ˆ ๐‘กโˆ’1), ๐œ•โ„ฑ(๐‘‰ ๐‘ก)). Then in (4.7.12) we can replace โ„Ž(๐‘Š ๐‘ก) โˆ’ โ„Ž(๐‘Š 0)with ฮฅ(๐‘ˆ ๐‘กโˆ’1)โˆ’ฮฅ(๐‘ˆ)+โ„ฑ(๐‘‰ ๐‘ก)โˆ’โ„ฑ(๐‘‰ ), implying that the term โ„Ž(๐‘Š 0)โˆ’โ„Ž(๐‘Š ๐‘‡ ) in theright-hand side of (4.7.13) gets replaced with โ„ฑ(๐‘‰ 0)โˆ’โ„ฑ(๐‘‰ ๐‘ก). The claim is proved.

Stochastic Mirror Descent. We now consider the stochastic setting that allowsto encompass (S-MD). Stochastic Mirror Descent is given by

๐‘Š 0 = min๐‘Šโˆˆ๐’ฒ

๐œ‘๐’ฒ(๐‘Š );

๐‘Š ๐‘ก = arg min๐‘Šโˆˆ๐’ฒ

โ„Ž(๐‘Š ) + โŸจฮž(๐‘Š ๐‘กโˆ’1),๐‘Š โŸฉ+

1

๐›พ๐‘ก๐ท๐œ‘๐’ฒ (๐‘Š,๐‘Š ๐‘กโˆ’1)

, ๐‘ก โ‰ฅ 1,

(4.7.14)

where

ฮž(๐‘Š ) :=1

๐‘›(๐œ‚๐‘‰,๐‘Œ ,โˆ’๐œ‰๐‘ˆ)

is the unbiased estimate of the first-order oracle ๐บ(๐‘Š ) = 1๐‘›(๐‘‹โŠค(๐‘‰ โˆ’ ๐‘Œ ),โˆ’๐‘‹๐‘ˆ). Let

us introduce the corresponding variance proxies (refer to the preamble of Section 4.3for the discussion):

๐œŽ2๐’ฐ =

1

๐‘›2sup๐‘ˆโˆˆ๐’ฐ

E[โ€–๐‘‹๐‘ˆ โˆ’ ๐œ‰๐‘ˆโ€–2V *

], ๐œŽ2

๐’ฑ =1

๐‘›2sup

(๐‘‰,๐‘Œ )โˆˆ๐’ฑร—๐’ฑE[๐‘‹โŠค(๐‘‰ โˆ’ ๐‘Œ )โˆ’ ๐œ‚๐‘‰,๐‘Œ

2U *

].

(4.7.15)We assume that the noises ๐บ(๐‘Š ๐‘กโˆ’1)โˆ’ ฮž(๐‘Š ๐‘กโˆ’1) are independent along the iterationsof (4.7.14). In this setting, we prove the following generalization of Theorem 4.3:

Theorem 4.4. Let (๐‘‡ , ๐‘‰ ๐‘‡ ) = 1๐‘‡

โˆ‘๐‘‡โˆ’1๐‘ก=0 (๐‘ˆ ๐‘ก, ๐‘‰ ๐‘ก) be the average of the first ๐‘‡ iterates

of the Stochastic Composite Mirror Descent (4.7.14) with constant stepsize

๐›พ๐‘ก โ‰ก1โˆš๐‘‡

min

1โˆš

10โ„’U ,V

โˆšฮฉ๐’ฐฮฉ๐’ฑ

,1โˆš

2โˆš

ฮฉ๐’ฐ 2๐’ฑ + ฮฉ๐’ฑ 2

๐’ฐ

,

where โ„’U ,V ,ฮฉ๐’ฐ ,ฮฉ๐’ฑ are the same as in Theorem 4.4, and 2๐’ฐ ,

2๐’ฑ are the upper bounds

7. Note that ๐œ‘(๐‘Š ) is continuously differentiable in the interior of ๐’ฒ, and โˆ‡๐œ‘ diverges on theborder of ๐’ฒ, then the iterates are guaranteed to stay in the interior of ๐’ฒ Beck and Teboulle [2003].

114

Page 130: On efficient methods for high-dimensional statistical

for ๐œŽ2๐’ฐ , ๐œŽ

2๐’ฑ , cf. (4.7.15). Then it holds

E[โˆ†๐‘“ (๐‘‡ , ๐‘‰ ๐‘‡ )] 6

2โˆš

10โ„’U ,V

โˆšฮฉ๐’ฐฮฉ๐’ฑโˆš

๐‘‡+

2โˆš

2โˆš

ฮฉ๐’ฐ 2๐’ฑ + ฮฉ๐’ฑ 2

๐’ฐโˆš๐‘‡

+ฮฅ(๐‘ˆ0)โˆ’min๐‘ˆโˆˆ๐’ฐ ฮฅ(๐‘ˆ)

๐‘‡+โ„ฑ(๐‘‰ 0)โˆ’min๐‘‰ โˆˆ๐’ฑ โ„ฑ(๐‘‰ )

๐‘‡,

where E[ยท] is the expectation over the randomness in (4.7.14). Moreover, if one of thefunctions ฮฅ(๐‘ˆ), โ„ฑ(๐‘‰ ) is affine, the corresponding ๐‘‚(1/๐‘‡ ) term can be discarded.

Proof. The proof closely follows that of Theorem 4.3. First, 1๐‘œ remains unchanged.Then, in the first-order condition (4.7.8) one must replace ๐บ(๐‘Š ๐‘กโˆ’1) with ฮž(๐‘Š ๐‘กโˆ’1),which results in replacing (4.7.9) with

๐›พ๐‘ก[ฮจ(๐‘ˆ ๐‘กโˆ’1, ๐‘‰ )โˆ’ฮจ(๐‘ˆ, ๐‘‰ ๐‘กโˆ’1) + โ„Ž(๐‘Š ๐‘ก)โˆ’ โ„Ž(๐‘Š )]

6โŸจโˆ‡๐œ‘(๐‘Š ๐‘กโˆ’1)โˆ’โˆ‡๐œ‘(๐‘Š ๐‘ก),๐‘Š ๐‘ก โˆ’๐‘Š

โŸฉ+ ๐›พ๐‘ก

โŸจฮž(๐‘Š ๐‘กโˆ’1),๐‘Š ๐‘กโˆ’1 โˆ’๐‘Š ๐‘ก

โŸฉ+ ๐›พ๐‘ก

โŸจ๐บ(๐‘Š ๐‘กโˆ’1)โˆ’ ฮž(๐‘Š ๐‘กโˆ’1),๐‘Š ๐‘กโˆ’1 โˆ’๐‘Š

โŸฉ,

where the last term has zero mean. The term ๐›พ๐‘ก โŸจฮž(๐‘Š ๐‘กโˆ’1),๐‘Š ๐‘กโˆ’1 โˆ’๐‘Š ๐‘กโŸฉ can bebounded using Youngโ€™s inequality, and 1-strong convexity of ๐œ‘(ยท), cf. (4.7.11):

๐›พ๐‘กโŸจฮž(๐‘Š ๐‘กโˆ’1),๐‘Š ๐‘กโˆ’1 โˆ’๐‘Š ๐‘ก

โŸฉ6๐›พ2๐‘ก โ€–ฮž(๐‘Š ๐‘กโˆ’1)โ€–2W *

2+โ€–๐‘Š ๐‘กโˆ’1 โˆ’๐‘Š ๐‘กโ€–2W

26 ๐›พ2๐‘ก โ€–๐บ(๐‘Š ๐‘กโˆ’1)โ€–2W * + ๐›พ2๐‘ก โ€–ฮž(๐‘Š ๐‘กโˆ’1)โˆ’๐บ(๐‘Š ๐‘กโˆ’1)โ€–2W * +๐ท๐œ‘(๐‘Š ๐‘ก,๐‘Š ๐‘กโˆ’1).

Combining (4.7.6), (4.7.4), and (4.7.15), this implies

๐›พ๐‘กโŸจฮž(๐‘Š ๐‘กโˆ’1),๐‘Š ๐‘กโˆ’1 โˆ’๐‘Š ๐‘ก

โŸฉ6

10๐›พ2๐‘กโ„’2U ,V ฮฉ๐’ฐฮฉ๐’ฑ + 2๐›พ2๐‘ก

(ฮฉ๐’ฐ

2๐’ฑ + ฮฉ๐’ฑ

2๐’ฐ)

+๐ท๐œ‘(๐‘Š ๐‘ก,๐‘Š ๐‘กโˆ’1).

Using (4.7.10) and (4.7.5), this results in

โˆ†๐‘“ (๐‘‡ , ๐‘‰ ๐‘‡ ) = E

[max

(๐‘ˆ,๐‘‰ )โˆˆ๐’ฒ

๐‘“(๐‘‡ , ๐‘‰ )โˆ’ ๐‘“(๐‘ˆ, ๐‘‰ ๐‘‡ )

]6

1

๐‘‡

[โ„Ž(๐‘Š 0)โˆ’ min

๐‘Šโˆˆ๐’ฒโ„Ž(๐‘Š ) +

1

๐›พ+ 2๐‘‡๐›พ

(5โ„’2

U ,V ฮฉ๐’ฐฮฉ๐’ฑ + ฮฉ๐’ฐ 2๐’ฑ + ฮฉ๐’ฑ

2๐’ฐ)]

;

note that maximization on the left is under the expectation (and not vice versa)because the right hand side is independent from ๐‘Š = (๐‘ˆ, ๐‘‰ ). Choosing ๐›พ to bal-ance the terms, we arrive at the desired bound. Finally, improvement in the case ofaffine ฮฅ(๐‘ˆ), โ„ฑ(๐‘‰ ) is obtained in the same way as in Theorem 4.4.

115

Page 131: On efficient methods for high-dimensional statistical

4.7.3 Auxiliary lemmas

Lemma 4.1. Let ๐‘‹0 โˆˆ R๐‘›+ and ๐‘‹1 = arg min

โ€–๐‘‹โ€–16๐‘…๐‘‹โˆˆR๐‘›

+

๐ถ1โ€–๐‘‹โ€–1+โŸจ๐‘†,๐‘‹โŸฉ+๐ถ2

๐‘›โˆ‘๐‘–=1

๐‘‹๐‘– log๐‘‹๐‘–

๐‘‹0๐‘–

.

Then,

๐‘‹1๐‘– = ๐œŒ ยท ๐‘‹

0๐‘– ยท exp(โˆ’๐‘†๐‘–/๐ถ2)

๐‘€,

where ๐‘€ =๐‘›โˆ‘๐‘—=1

๐‘‹0๐‘— ยท exp(โˆ’๐‘†๐‘—/๐ถ2) and ๐œŒ = min(๐‘€ ยท ๐‘’โˆ’

๐ถ1+๐ถ2๐ถ2 , ๐‘…).

Proof. Clearly, we have

๐‘‹1 = arg min๐‘Ÿ6๐‘…

minโ€–๐‘‹โ€–1=๐‘Ÿ๐‘‹โˆˆR๐‘›

+

๐ถ1๐‘Ÿ + โŸจ๐‘†,๐‘‹โŸฉ+ ๐ถ2

๐‘›โˆ‘๐‘–=1

๐‘‹๐‘– log๐‘‹๐‘–

๐‘‹0๐‘–

.

Let us first do the internal minimization. By simple algebra, the first-order optimalitycondition for the Lagrangian dual problem (with constraint โ€–๐‘‹โ€–1 = ๐‘…) amounts to

๐‘†๐‘– + ๐ถ2 + ๐ถ2 log๐‘‹1๐‘– โˆ’ ๐ถ2 log๐‘‹0

๐‘– + ๐œ… = 0,

where ๐œ… is Lagrange multiplier, and๐‘›โˆ‘๐‘–=1

๐‘‹1๐‘– = ๐‘Ÿ. Equivalently,

๐‘‹1๐‘– = ๐‘‹0

๐‘– ยท exp

(โˆ’(๐œ…+ ๐ถ2 + ๐‘†๐‘–

๐ถ2

)),

that is,

๐‘‹1๐‘– = ๐‘Ÿ ยท ๐‘‹0

๐‘– ยท exp(โˆ’๐‘†๐‘–/๐ถ2)โˆ‘๐‘—

๐‘‹0๐‘— ยท exp(โˆ’๐‘†๐‘—/๐ถ2)

.

Denoting ๐ท๐‘— = exp(โˆ’๐‘†๐‘—/๐ถ2) and ๐‘€ =โˆ‘๐‘—

๐‘‹0๐‘— ยท ๐ท๐‘— and substituting for ๐‘‹1 in the

external minimization problem, we arrive at

๐œŒ = arg min๐‘Ÿ6๐‘…

๐ถ1๐‘Ÿ + ๐‘Ÿ

โˆ‘๐‘–

๐‘‹0๐‘–๐ท๐‘–๐‘†๐‘–๐‘€

+ ๐ถ2 ยท ๐‘Ÿโˆ‘๐‘–

๐‘‹0๐‘–๐ท๐‘–

๐‘€ยท log

[๐‘Ÿ ยท๐ท๐‘–

๐‘€

].

One can easily verify that the counterpart of this minimization problem with ๐‘… =โˆžhas a unique stationary point ๐‘Ÿ* = ๐‘€ ยท ๐‘’โˆ’

๐ถ1+๐ถ2๐ถ2 > 0. As the minimized function is

convex, the minimum is attained at the point ๐œŒ = min(๐‘Ÿ*, ๐‘…).

Lemma 4.2. Given ๐‘‹ โˆˆ R๐‘›ร—๐‘‘ and mixed norms (cf. (4.2.4)) โ€– ยท โ€–๐‘1๐‘ˆร—๐‘2๐‘ˆ on R๐‘‘ร—๐‘˜

116

Page 132: On efficient methods for high-dimensional statistical

and โ€– ยท โ€–๐‘ž1๐‘‰ ร—๐‘ž2๐‘‰ on R๐‘›ร—๐‘˜ with ๐‘2๐‘ˆ 6 ๐‘ž2๐‘‰ , one has

supโ€–๐‘ˆโ€–

๐‘1๐‘ˆ

ร—๐‘2๐‘ˆ61

๐‘‘โˆ‘๐‘–=1

โ€–๐‘‹(:, ๐‘–)โ€–๐‘ž1๐‘‰ ยท โ€–๐‘ˆ(๐‘–, :)โ€–๐‘ž2๐‘‰

= โ€–๐‘‹โŠคโ€–๐‘ž1๐‘ˆร—๐‘ž1๐‘‰ ,

where ๐‘ž1๐‘ˆ is the conjugate of ๐‘1๐‘ˆ , i.e., 1/๐‘1๐‘ˆ + 1/๐‘ž1๐‘ˆ = 1.

Proof. First assume ๐‘2๐‘ˆ = ๐‘ž2๐‘‰ . Let ๐‘Ž๐‘– = โ€–๐‘‹(:, ๐‘–)โ€–๐‘ž1๐‘‰ , ๐‘ข๐‘– = โ€–๐‘ˆ(๐‘–, :)โ€–๐‘ž2๐‘‰ , 1 6 ๐‘– 6 ๐‘‘. Then,

supโ€–๐‘ˆโ€–

๐‘1๐‘ˆ

ร—๐‘2๐‘ˆ61

๐‘‘โˆ‘๐‘–=1

โ€–๐‘‹(:, ๐‘–)โ€–๐‘ž1๐‘‰ ยท โ€–๐‘ˆ(๐‘–, :)โ€–๐‘ž2๐‘‰

= sup

โ€–๐‘ขโ€–๐‘1๐‘ˆ61

๐‘‘โˆ‘๐‘–=1

๐‘Ž๐‘–๐‘ข๐‘– = โ€–๐‘Žโ€–๐‘ž1๐‘ˆ = โ€–๐‘‹โŠคโ€–๐‘ž1๐‘ˆร—๐‘ž1๐‘‰ .

Now let ๐‘2๐‘ˆ < ๐‘ž2๐‘‰ . Then, for any ๐‘– โ‰ค ๐‘‘ one has โ€–๐‘ˆ(๐‘–, :)โ€–๐‘ž2๐‘‰ < โ€–๐‘ˆ(๐‘–, :)โ€–๐‘2๐‘ˆ unless ๐‘ˆ(๐‘–, :)has a single non-zero element, in which case โ€–๐‘ˆ(๐‘–, :)โ€–๐‘ž2๐‘‰ = โ€–๐‘ˆ(๐‘–, :)โ€–๐‘2๐‘ˆ . Hence, thesupremum must be attained on such ๐‘ˆ , for which the previous argument applies.

Lemma 4.3. In the setting of Lemma 4.2, for any ๐‘ž1๐‘ˆ > 1 and ๐‘ž2๐‘ˆ > 1 one has:

supโ€–๐‘‰ โ€–โˆžร—1โ‰ค1

๐‘›โˆ‘๐‘–=1

โ€–๐‘‹โŠค(:, ๐‘–)โ€–๐‘ž1๐‘ˆ ยท โ€–๐‘‰ (๐‘–, :)โ€–๐‘ž2๐‘ˆ

= โ€–๐‘‹โ€–1ร—๐‘ž1๐‘ˆ .

Proof. The claim follows by instatiating Lemma 4.2.

4.7.4 Proof of Proposition 4.1

By (4.2.10), and verifying that the dual norm to โ€– ยท โ€–2ร—1 is โ€– ยท โ€–2ร—โˆž, we have

โ„’U ,V = supโ€–๐‘ˆโ€–1ร—161

โ€–๐‘‹๐‘ˆโ€–2ร—โˆž.

The maximization over the unit ball โ€–๐‘ˆโ€–1ร—1 6 1 can be replaced with that over itsextremal points, which are the matrices ๐‘ˆ that have zeroes in all positions except forone in which there is 1. Let (๐‘–, ๐‘—) be this position, then for every such ๐‘ˆ we have:

โ€–๐‘‹๐‘ˆโ€–2ร—โˆž =

โŽฏ ๐‘›โˆ‘๐‘™=1

sup๐‘—|๐‘‹(๐‘™, :)๐‘ˆ(:, ๐‘—)|2 =

โŽฏ ๐‘›โˆ‘๐‘™=1

|๐‘‹(๐‘™, ๐‘–)|2 =

โŽฏ ๐‘›โˆ‘๐‘™=1

|๐‘‹โŠค(๐‘–, ๐‘™)|2.

As a result,

supโ€–๐‘ˆโ€–1ร—161

โ€–๐‘‹๐‘ˆโ€–2ร—โˆž = sup1โ‰ค๐‘–โ‰ค๐‘˜

โŽฏ ๐‘›โˆ‘๐‘™=1

|๐‘‹โŠค(๐‘–, ๐‘™)|2 =๐‘‹โŠค

โˆžร—2.

117

Page 133: On efficient methods for high-dimensional statistical

4.7.5 Proof of Proposition 4.2

We prove an extended result that holds when โ€– ยท โ€–U and โ€– ยท โ€–V are more generalmixed (โ„“๐‘ ร— โ„“๐‘ž)-norms, cf. (4.2.4).

Proposition 4.5. Let โ€– ยท โ€–U = โ€– ยท โ€–๐‘1๐‘ˆร—๐‘2๐‘ˆ on R2๐‘‘ร—๐‘˜, and โ€– ยท โ€–V = โ€– ยท โ€–๐‘1๐‘‰ ร—๐‘2๐‘‰ on R๐‘›ร—๐‘˜.

Then, optimal solutions ๐‘* = ๐‘*( ๐‘‹, ๐‘ˆ) and ๐‘ž* = ๐‘ž*( ๐‘‹,๐‘‰, ๐‘Œ ) to (4.3.3) are given by

๐‘*๐‘– =โ€– ๐‘‹(:, ๐‘–)โ€–๐‘ž1๐‘‰ ยท โ€–

๐‘ˆ(๐‘–, :)โ€–๐‘ž2๐‘‰โˆ‘2๐‘‘๐šค=1 โ€– ๐‘‹(:, ๐šค)โ€–๐‘ž1๐‘‰ ยท โ€–

๐‘ˆ(๐šค, :)โ€–๐‘ž2๐‘‰, ๐‘ž*๐‘— =

โ€– ๐‘‹(๐‘—, :)โ€–๐‘ž1๐‘ˆ ยท โ€–๐‘‰ (๐‘—, :)โˆ’ ๐‘Œ (๐‘—, :)โ€–๐‘ž2๐‘ˆโˆ‘๐‘›๐šฅ=1 โ€– ๐‘‹(๐šฅ, :)โ€–๐‘ž1๐‘ˆ ยท โ€–๐‘‰ (๐šฅ, :)โˆ’ ๐‘Œ (๐šฅ, :)โ€–๐‘ž2๐‘ˆ

.

Moreover, we can bound their respective variance proxies (cf. (4.3.2)): introducing

โ„’2U ,V =

1

๐‘›2sup

โ€–๐‘ˆโ€–U โ‰ค1

โ€– ๐‘‹ ๐‘ˆโ€–2V * ,

we have, as long as ๐‘2๐‘ˆ 6 ๐‘ž2๐‘‰ ,

๐œŽ2๐’ฐ(๐‘*) 6 2๐‘…2๐’ฐโ„’2

U ,V +2

๐‘›2๐‘…2

๐’ฐโ€– ๐‘‹โŠคโ€–2๐‘ž1๐‘ˆร—๐‘ž1๐‘‰,

and, as long as ๐‘1๐‘‰ > 2,

๐œŽ2๐’ฑ(๐‘ž*) 6 8๐‘› โ„’2

U ,V +8

๐‘›2โ€– ๐‘‹โ€–21ร—๐‘ž1๐‘ˆ .

Proof. Note that the dual norms to โ€– ยท โ€–๐‘1๐‘ˆร—๐‘2๐‘ˆ and โ€– ยท โ€–๐‘1๐‘‰ ร—๐‘2๐‘‰ are given by โ€– ยท โ€–๐‘ž1๐‘ˆร—๐‘ž2๐‘ˆand โ€– ยท โ€–๐‘ž1๐‘‰ ร—๐‘ž2๐‘‰ correspondingly, see, e.g., Sra [2012].

1๐‘œ. For E[โ€–๐œ‰๐‘ˆ(๐‘)โ€–2V *

]we have:

E[โ€–๐œ‰๐‘ˆ(๐‘)โ€–2V *

]=

2๐‘‘โˆ‘๐‘–=1

๐‘๐‘–

๐‘‹๐‘’๐‘–๐‘’โŠค๐‘–

๐‘๐‘–๐‘ˆ 2

๐‘ž1๐‘‰ ร—๐‘ž2๐‘‰

=2๐‘‘โˆ‘๐‘–=1

1

๐‘๐‘–

๐‘‹(:, ๐‘–) ยท ๐‘ˆ(๐‘–, :)2๐‘ž1๐‘‰ ร—๐‘ž2๐‘‰

=2๐‘‘โˆ‘๐‘–=1

1

๐‘๐‘–โ€– ๐‘‹(:, ๐‘–)โ€–2๐‘ž1๐‘‰ ยท โ€–

๐‘ˆ(๐‘–, :)โ€–2๐‘ž2๐‘‰ ,

where the last transition can be verified directly. The right-hand side can be easilyminimized on โˆ†2๐‘‘ explicitly, which results in

๐‘*๐‘– =โ€– ๐‘‹(:, ๐‘–)โ€–๐‘ž1๐‘‰ ยท โ€–

๐‘ˆ(๐‘–, :)โ€–๐‘ž2๐‘‰โˆ‘2๐‘‘๐šค=1 โ€– ๐‘‹(:, ๐šค)โ€–๐‘ž1๐‘‰ ยท โ€–

๐‘ˆ(๐šค, :)โ€–๐‘ž2๐‘‰

and

E[โ€–๐œ‰๐‘ˆ(๐‘*)โ€–2V *

]=

[2๐‘‘โˆ‘๐‘–=1

โ€– ๐‘‹(:, ๐‘–)โ€–๐‘ž1๐‘‰ ยท โ€–๐‘ˆ(๐‘–, :)โ€–๐‘ž2๐‘‰

]2.

118

Page 134: On efficient methods for high-dimensional statistical

Now we can bound ๐œŽ2๐’ฐ(๐‘*) via the triangle inequality:

๐œŽ2๐’ฐ(๐‘*) 62

๐‘›2sup๐‘ˆโˆˆ๐’ฐ โ€–

๐‘‹ ๐‘ˆโ€–2V * +2

๐‘›2sup๐‘ˆโˆˆ๐’ฐ E

[โ€–๐œ‰๐‘ˆ(๐‘*)โ€–2V *

]= 2๐‘…2

๐’ฐโ„’2

U ,V +2

๐‘›2sup๐‘ˆโˆˆ๐’ฐ

[2๐‘‘โˆ‘๐‘–=1

โ€– ๐‘‹(:, ๐‘–)โ€–๐‘ž1๐‘‰ ยท โ€–๐‘ˆ(๐‘–, :)โ€–๐‘ž2๐‘‰

]2

= 2๐‘…2๐’ฐโ„’2

U ,V +2

๐‘›2sup

โ€–๐‘ˆโ€–U โ‰ค๐‘…๐’ฐ

[2๐‘‘โˆ‘๐‘–=1

โ€– ๐‘‹(:, ๐‘–)โ€–๐‘ž1๐‘‰ ยท โ€–๐‘ˆ(๐‘–, :)โ€–๐‘ž2๐‘‰

]2= 2๐‘…2

๐’ฐโ„’2

U ,V +2

๐‘›2๐‘…2

๐’ฐโ€– ๐‘‹โŠคโ€–2๐‘ž1๐‘ˆร—๐‘ž1๐‘‰,

where we used Lemma 4.2 (see Appendix 4.7.3) in the last transition.

2๐‘œ. We now deal with E [โ€–๐œ‚๐‘‰,๐‘Œ (๐‘ž)โ€–2U * ]. As previously, we can explicitly compute

๐‘ž*๐‘— =โ€– ๐‘‹(๐‘—, :)โ€–๐‘ž1๐‘ˆ ยท โ€–๐‘‰ (๐‘—, :)โˆ’ ๐‘Œ (๐‘—, :)โ€–๐‘ž2๐‘ˆโˆ‘๐‘›๐šฅ=1 โ€– ๐‘‹(๐šฅ, :)โ€–๐‘ž1๐‘ˆ ยท โ€–๐‘‰ (๐šฅ, :)โˆ’ ๐‘Œ (๐šฅ, :)โ€–๐‘ž2๐‘ˆ

and

E[โ€–๐œ‚๐‘‰,๐‘Œ (๐‘ž*)โ€–2U *

]=

[๐‘›โˆ‘๐‘—=1

โ€– ๐‘‹(๐‘—, :)โ€–๐‘ž1๐‘ˆ ยท โ€–๐‘‰ (๐‘—, :)โˆ’ ๐‘Œ (๐‘—, :)โ€–๐‘ž2๐‘ˆ

]2.

Thus, by the triangle inequality,

๐œŽ2๐’ฑ(๐‘ž*) 6

2

๐‘›2sup

(๐‘‰,๐‘Œ )โˆˆ๐’ฑร—๐’ฑโ€– ๐‘‹โŠค(๐‘‰ โˆ’ ๐‘Œ )โ€–2U * +

2

๐‘›2sup

(๐‘‰,๐‘Œ )โˆˆ๐’ฑร—๐’ฑE[โ€–๐œ‚๐‘‰,๐‘Œ (๐‘ž*)โ€–2U *

]6

2

๐‘›2sup

โ€–๐‘‰ โ€–โˆžร—162

โ€– ๐‘‹โŠค๐‘‰ โ€–2U * +2

๐‘›2sup

โ€–๐‘‰ โ€–โˆžร—162

[๐‘›โˆ‘๐‘—=1

โ€– ๐‘‹(๐‘—, :)โ€–๐‘ž1๐‘ˆ ยท โ€–๐‘‰ (๐‘—, :)โ€–๐‘ž2๐‘ˆ

]2=

8

๐‘›sup

โ€–๐‘‰ โ€–2ร—161

โ€– ๐‘‹โŠค๐‘‰ โ€–2U * +8

๐‘›2โ€– ๐‘‹โ€–21ร—๐‘ž1๐‘ˆ

68

๐‘›sup

โ€–๐‘‰ โ€–๐‘1๐‘‰

ร—๐‘2๐‘‰61

โ€– ๐‘‹โŠค๐‘‰ โ€–2U * +8

๐‘›2โ€– ๐‘‹โ€–21ร—๐‘ž1๐‘ˆ

= 8๐‘› โ„’2U ,V +

8

๐‘›2โ€– ๐‘‹โ€–21ร—๐‘ž1๐‘ˆ .

Here in the second line we used that the Minkowski sum โˆ†๐‘˜+(โˆ’โˆ†๐‘˜) belongs to the โ„“1-ball with radius 2 (whence ๐’ฑ + (โˆ’๐’ฑ) belongs to the (โ„“โˆž ร— โ„“1)-ball with radius 2); inthe third line we used Lemma 4.3 (see Appendix 4.7.3) and the relation on R๐‘›ร—๐‘˜:

โ€– ยท โ€–2ร—1 6โˆš๐‘›โ€– ยท โ€–โˆžร—1;

lastly, we used that ๐‘1๐‘‰ โ‰ฅ 2 and that โ€– ยท โ€–๐‘1๐‘‰ ร—๐‘2๐‘‰ is non-increasing in ๐‘1๐‘‰ , ๐‘2๐‘‰ โ‰ฅ 1.

119

Page 135: On efficient methods for high-dimensional statistical

Proof of Proposition 4.2. We instantiate Proposition 4.5 with โ€– ยท โ€–U = โ€– ยท โ€–1ร—1

and โ€–ยทโ€–V = โ€–ยทโ€–2ร—1, and observe, using Proposition 4.1, that for ๐‘‹ = [๐‘‹,โˆ’๐‘‹] โˆˆ R๐‘›ร—2๐‘‘

it holds

โ„’U ,V =1

๐‘›โ€– ๐‘‹โŠคโ€–โˆžร—2 =

1

๐‘›โ€–๐‘‹โŠคโ€–โˆžร—2, โ€– ๐‘‹โ€–1ร—โˆž = โ€–๐‘‹โ€–1ร—โˆž.

4.7.6 Proof of Proposition 4.3

We have

min๐‘โˆˆฮ”2๐‘‘,

๐‘ƒโˆˆ(ฮ”โŠค๐‘˜ )โŠ—2๐‘‘

Eโ€–๐œ‰๐‘ˆ(๐‘, ๐‘ƒ )โ€–22ร—โˆž = min๐‘โˆˆฮ”2๐‘‘,

๐‘ƒโˆˆ(ฮ”โŠค๐‘˜ )โŠ—2๐‘‘

2๐‘‘โˆ‘๐‘–=1

1

๐‘๐‘–โ€– ๐‘‹(:, ๐‘–)โ€–22 ยท

[๐‘˜โˆ‘๐‘™=1

1

๐‘ƒ๐‘–๐‘™ยท |๐‘ˆ(๐‘–, ๐‘™)|2

]

= min๐‘โˆˆฮ”2๐‘‘

2๐‘‘โˆ‘๐‘–=1

1

๐‘๐‘–โ€– ๐‘‹(:, ๐‘–)โ€–22 ยท โ€–๐‘ˆ(๐‘–, :)โ€–21,

where we carried out the internal minimization explicitly, obtaining

๐‘ƒ *๐‘–๐‘™ =

|๐‘ˆ๐‘–๐‘™|โ€–๐‘ˆ(๐‘–, :)โ€–1

.

Optimization in ๐‘ gives:

๐‘*๐‘– =โ€– ๐‘‹(:, ๐‘–)โ€–2 ยท โ€–๐‘ˆ(๐‘–, :)โ€–1

2๐‘‘โˆ‘๐šค=1

โ€– ๐‘‹(:, ๐šค)โ€–2 ยท โ€–๐‘ˆ(๐šค, :)โ€–1, Eโ€–๐œ‰๐‘ˆ(๐‘*, ๐‘ƒ *)โ€–22ร—โˆž =

2๐‘‘โˆ‘๐‘–=1

โ€– ๐‘‹(:, ๐‘–)โ€–2 ยท โ€–๐‘ˆ(๐‘–, :)โ€–1.

Defining โ„’U ,V = supโ€–๐‘ˆโ€–U โ‰ค1

โ€–๐‘‹ ๐‘ˆโ€–V *

and proceeding as in the proof of Proposition 4.5, we get

๐œŽ2๐’ฐ(๐‘*, ๐‘ƒ *) 62

๐‘›2sup๐‘ˆโˆˆ๐’ฐ โ€–

๐‘‹ ๐‘ˆโ€–22ร—โˆž +2

๐‘›2sup๐‘ˆโˆˆ๐’ฐ E

[โ€–๐œ‰๐‘ˆ(๐‘*, ๐‘ƒ *)โ€–22ร—โˆž

]= 2๐‘…2

๐’ฐโ„’2

U ,V +2

๐‘›2sup๐‘ˆโˆˆ๐’ฐ

[2๐‘‘โˆ‘๐‘–=1

โ€– ๐‘‹(:, ๐‘–)โ€–2 ยท โ€–๐‘ˆ(๐‘–, :)โ€–1

]2

= 2๐‘…2๐’ฐโ„’2

U ,V +2๐‘…2

๐’ฐ๐‘›2

supโ€–๐‘ˆโ€–1ร—161

[2๐‘‘โˆ‘๐‘–=1

โ€– ๐‘‹(:, ๐‘–)โ€–2 ยท โ€–๐‘ˆ(๐‘–, :)โ€–1

]2= 2๐‘…2

๐’ฐโ„’2

U ,V +2

๐‘›2๐‘…2

๐’ฐโ€– ๐‘‹โŠคโ€–22ร—1

=4

๐‘›2๐‘…2

๐’ฐโ€–๐‘‹โŠคโ€–22ร—1,

120

Page 136: On efficient methods for high-dimensional statistical

where in the last two transitions we used Lemma 4.2 and Proposition 4.1 (note

that โ€– ๐‘‹โŠคโ€–2ร—1 = โ€–๐‘‹โŠคโ€–2ร—1). Note that the last transition requires that โ€– ยท โ€–U has โ„“1-geometry in the classes โ€“ otherwise, Lemma 4.2 cannot be applied.

To obtain (๐‘ž*, ๐‘„*) we proceed in a similar way:

min๐‘žโˆˆฮ”๐‘›,

๐‘„โˆˆ(ฮ”โŠค๐‘˜ )โŠ—๐‘›

Eโ€–๐œ‚๐‘‰,๐‘Œ (๐‘ž,๐‘„)โ€–2โˆžร—โˆž

= min๐‘žโˆˆฮ”๐‘›,

๐‘„โˆˆ(ฮ”โŠค๐‘˜ )โŠ—๐‘›

๐‘›โˆ‘๐‘—=1

1

๐‘ž๐‘—โ€– ๐‘‹(๐‘—, :)โ€–2โˆž ยท

[๐‘˜โˆ‘๐‘™=1

1

๐‘„๐‘—๐‘™

ยท |๐‘‰ (๐‘—, ๐‘™)โˆ’ ๐‘Œ (๐‘—, ๐‘™)|2]

= min๐‘žโˆˆฮ”๐‘›

๐‘›โˆ‘๐‘—=1

1

๐‘ž๐‘—โ€– ๐‘‹(๐‘—, :)โ€–2โˆž ยท โ€–๐‘‰ (๐‘—, :)โˆ’ ๐‘Œ (๐‘—, :)โ€–21,

which results in

๐‘ž*๐‘— =โ€– ๐‘‹(๐‘—, :)โ€–โˆž ยท โ€–๐‘‰ (๐‘—, :)โˆ’ ๐‘Œ (๐‘—, :)โ€–1โˆ‘๐‘›๐šฅ=1 โ€– ๐‘‹(๐šฅ, :)โ€–โˆž ยท โ€–๐‘‰ (๐šฅ, :)โˆ’ ๐‘Œ (๐šฅ, :)โ€–1

๐‘„*๐‘—๐‘™ =

|๐‘‰๐‘—๐‘™ โˆ’ ๐‘Œ๐‘—๐‘™|โ€–๐‘‰ (๐‘—, :)โˆ’ ๐‘Œ (๐‘—, :)โ€–1

.

The corresponding variance proxy can then be bounded in the same way as in theproof of Proposition 4.5.

4.7.7 Correctness of subroutines in Algorithm 1

In this section, we recall the subroutines used in Algorithm 1 โ€“ those for per-forming the lazy updates and tracking the running averages โ€“ and demonstrate theircorrectness.

Procedure 1 UpdatePrimal

Require: ๐‘ˆ โˆˆ R2๐‘‘ร—๐‘˜, ๐›ผ, ๐œ‹, ๐œ‚ โˆˆ R2๐‘‘, ๐‘™ โˆˆ [๐‘˜], ๐›พ, ๐œ†, ๐‘…1

1: ๐ฟ โ‰ก log(2๐‘‘๐‘˜)2: for ๐‘– = 1 to 2๐‘‘ do3: ๐œ‡๐‘– = ๐œ‹๐‘– โˆ’ ๐›ผ๐‘– ยท ๐‘ˆ(๐‘–, ๐‘™) ยท (1โˆ’ ๐‘’โˆ’2๐›พ๐ฟ๐‘…*๐œ‚๐‘–/๐‘›)

4: ๐‘€ =โˆ‘2๐‘‘

๐‘–=1 ๐œ‡๐‘–5: ๐œˆ = min๐‘’โˆ’2๐›พ๐ฟ๐‘…*๐œ†, ๐‘…*/๐‘€6: for ๐‘– = 1 to 2๐‘‘ do7: ๐‘ˆ(๐‘–, ๐‘™)โ† ๐‘ˆ(๐‘–, ๐‘™) ยท ๐‘’โˆ’2๐›พ๐ฟ๐‘…*๐œ‚๐‘–/๐‘›

8: ๐›ผ+๐‘– = ๐œˆ ยท ๐›ผ๐‘–

9: ๐œ‹+๐‘– = ๐œˆ ยท ๐œ‡๐‘–

Ensure: ๐‘ˆ, ๐›ผ+, ๐œ‹+

Primal updates (Procedure 1). To demonstrate the correctness of Procedure 1,we prove the following result:

121

Page 137: On efficient methods for high-dimensional statistical

Lemma 4.4. Suppose that at ๐‘ก-iteration of Algorithm 1, Procedure 1 was fed with ๐‘ˆ =๐‘ˆ ๐‘ก, ๐›ผ = ๐›ผ๐‘ก, ๐œ‹ = ๐œ‹๐‘ก, ๐œ‚ = ๐œ‚๐‘ก, ๐‘™ = ๐‘™๐‘ก for which one had

๐‘ˆ ๐‘ก(:, ๐‘™) โˆ˜ ๐›ผ๐‘ก = ๐‘ˆ ๐‘ก(:, ๐‘™), โˆ€๐‘™ โˆˆ [๐‘˜], (4.7.16)

where ๐‘ˆ ๐‘ก is the ๐‘ก-th primal iterate of (S-MD) equipped with (Full-SS) with theoptimal sampling distributions (4.3.8), and ๐œ‚๐‘ก was the only non-zero column ๐œ‚๐‘‰ ๐‘ก,๐‘Œ (:, ๐‘™๐‘ก)of ๐œ‚๐‘‰ ๐‘ก,๐‘Œ . Moreover, suppose also that

๐œ‹๐‘ก(๐šค) = โ€–๐‘ˆ ๐‘ก(๐šค, :)โ€–1 =โˆ‘๐‘™โˆˆ[๐‘˜]

๐‘ˆ ๐‘ก(๐šค, ๐‘™), ๐šค โˆˆ [2๐‘‘], (4.7.17)

were the correct norms at the ๐‘ก-th step. Then Procedure 1 will output ๐‘ˆ ๐‘ก+๐‘ก, ๐›ผ๐‘ก+1, ๐œ‹๐‘ก+1

such that ๐‘ˆ ๐‘ก+1(:, ๐‘™) โˆ˜ ๐›ผ๐‘ก+1 = ๐‘ˆ ๐‘ก+1(:, ๐‘™), โˆ€๐‘™ โˆˆ [๐‘˜]

and๐œ‹๐‘ก+1(๐šค) =

โˆ‘๐‘™โˆˆ[๐‘˜]

๐‘ˆ ๐‘ก+1(๐šค, ๐‘™), ๐šค โˆˆ [2๐‘‘].

Proof. Recall that the matrix ๐œ‚๐‘ก = ๐œ‚๐‘‰ ๐‘ก,๐‘Œ produced in (Full-SS) has a single non-zerocolumn ๐œ‚๐‘ก = ๐œ‚๐‘ก(:, ๐‘™๐‘ก), and according to (S-MD), the primal update ๐‘ˆ ๐‘ก โ†’ ๐‘ˆ ๐‘ก+1 writesas (cf. (4.2.23)):

๐‘ˆ ๐‘ก+1๐‘–๐‘™ = ๐‘ˆ ๐‘ก

๐‘–๐‘™ ยท ๐‘’โˆ’2๐›พ๐‘ก๐‘…*๐ฟ๐œ‚๐‘‰ ๐‘ก,๐‘Œ (๐‘–,๐‘™)/๐‘› ยทmin๐‘’โˆ’2๐›พ๐‘ก๐‘…*๐ฟ๐œ†, ๐‘…*/๐‘€

,

where

๐ฟ := log(2๐‘‘๐‘˜), ๐‘€๐‘ก :=2๐‘‘โˆ‘๐‘–=1

๐‘˜โˆ‘๐‘™=1

๐‘ˆ ๐‘ก๐‘–๐‘™ ยท ๐‘’โˆ’2๐›พ๐‘ก๐‘…*๐ฟ๐œ‚๐‘‰ ๐‘ก,๐‘Œ (๐‘–,๐‘™)/๐‘›,

and ๐œ‚๐‘‰ ๐‘ก,๐‘Œ has a single non-zero column ๐œ‚๐‘ก = ๐œ‚๐‘ก๐‘‰ ๐‘ก,๐‘Œ (:, ๐‘™๐‘ก). This can be rewritten as

๐‘ˆ ๐‘ก+1๐‘–๐‘™ =

๐‘ˆ ๐‘ก๐‘–๐‘™ ยท ๐œˆ ยท ๐‘ž๐‘ก๐‘– , ๐‘™ = ๐‘™๐‘ก,

๐‘ˆ ๐‘ก๐‘–๐‘™ ยท ๐œˆ, ๐‘™ = ๐‘™๐‘ก,

(4.7.18)

where๐‘ž๐‘ก๐‘– = ๐‘’โˆ’2๐›พ๐‘ก๐ฟ๐‘…*๐œ‚๐‘ก๐‘–/๐‘›,

๐œˆ = min๐‘’โˆ’2๐›พ๐‘ก๐ฟ๐‘…*๐œ†, ๐‘…*/๐‘€,

๐‘€ =โˆ‘๐‘–โˆˆ[2๐‘‘]

๐‘ˆ ๐‘ก๐‘–๐‘™๐‘ก ยท ๐‘ž๐‘ก๐‘– +

โˆ‘๐‘–โˆˆ[2๐‘‘]

โˆ‘๐‘™โˆˆ[๐‘˜]โˆ–๐‘™๐‘ก

๐‘ˆ ๐‘ก๐‘–๐‘™.

Thus, ๐‘€ and ๐œˆ can be expressed via ๐œ‹๐‘ก(๐‘–) =โˆ‘

๐‘™โˆˆ[๐‘˜] ๐‘ˆ๐‘ก(๐‘–, ๐‘™), cf. (4.7.17):

๐‘€ =โˆ‘๐‘–โˆˆ[2๐‘‘]

๐œ‹๐‘ก๐‘– โˆ’ ๐›ผ๐‘ก๐‘– ๐‘ˆ ๐‘ก๐‘–,๐‘™๐‘กโŸ โž

๐‘ˆ๐‘ก๐‘–,๐‘™๐‘ก

(1โˆ’ ๐‘ž๐‘ก๐‘–), (4.7.19)

122

Page 138: On efficient methods for high-dimensional statistical

where we used the premise (4.7.16). Now we can see that lazy updates of ๐‘ˆ can beexpressed as

๐›ผ๐‘ก+1๐‘– = ๐œˆ ยท ๐›ผ๐‘ก๐‘–,๐‘ˆ ๐‘ก+1

๐‘–,๐‘™๐‘ก = ๐‘ˆ ๐‘ก๐‘–,๐‘™๐‘ก ยท ๐‘ž๐‘ก๐‘– ,

(4.7.20)

and the updates for the norms ๐œ‹๐‘ก+1 as

๐œ‹๐‘ก+1๐‘– = ๐œˆ๐‘ก

[๐œ‹๐‘ก๐‘– + ๐›ผ๐‘ก๐‘–

๐‘ˆ ๐‘ก๐‘–,๐‘™๐‘ก(๐‘ž

๐‘ก๐‘– โˆ’ 1)

](4.7.21)

One can immediately verify that this is exactly the update produced in the call ofProcedure 1 in line 13 of Algorithm 1.

Procedure 2 UpdateDual

Require: ๐‘‰ , ๐‘Œ โˆˆ R๐‘›ร—๐‘˜, ๐›ฝ, ๐œŒ, ๐œ‰ โˆˆ R๐‘›, โ„“ โˆˆ [๐‘˜], ๐‘ฆ โˆˆ [๐‘˜]โŠ—๐‘›, ๐›พ1: ๐œƒ = ๐‘’โˆ’2๐›พ log(๐‘˜)

2: for ๐‘— = 1 to ๐‘› do3: ๐œ”๐‘— = ๐‘’2๐›พ log(๐‘˜)๐œ‰๐‘—

4: ๐œ€๐‘— = ๐‘’โˆ’2๐›พ log(๐‘˜)๐‘Œ (๐‘—,โ„“)

5: ๐œ’๐‘— = 1โˆ’ ๐›ฝ๐‘— ยท ๐‘‰ (๐‘—, โ„“) ยท (1โˆ’ ๐œ”๐‘— ยท ๐œ€๐‘—)6: if โ„“ = ๐‘ฆ๐‘— then # not the actual class of ๐‘— drawn

7: ๐œ’๐‘— โ† ๐œ’๐‘— โˆ’ ๐›ฝ๐‘— ยท ๐‘‰ (๐‘—, ๐‘ฆ๐‘—) ยท (1โˆ’ ๐œƒ)8: ๐›ฝ+

๐‘— = ๐›ฝ๐‘—/๐œ’๐‘—

9: ๐‘‰ (๐‘—, โ„“)โ† ๐‘‰ (๐‘—, โ„“) ยท ๐œ”๐‘— ยท ๐œ€๐‘—10: ๐‘‰ (๐‘—, ๐‘ฆ๐‘—)โ† ๐‘‰ (๐‘—, ๐‘ฆ๐‘—) ยท ๐œ”๐‘— ยท ๐œƒ11: ๐œŒ+๐‘— = 2โˆ’ 2๐›ฝ+

๐‘— ยท ๐‘‰ (๐‘—, ๐‘ฆ๐‘—)

Ensure: ๐‘‰ , ๐›ฝ+, ๐œŒ+

Dual updates (Procedure 2). To demonstrate the correctness of Procedure 2,we prove the following result:

Lemma 4.5. Suppose that at ๐‘ก-iteration of Algorithm 1, Procedure 2 was fed with ๐‘‰ =๐‘‰ ๐‘ก, ๐›ฝ = ๐›ฝ๐‘ก, ๐œŒ = ๐œŒ๐‘ก, ๐œ‰ = ๐œ‰๐‘ก, โ„“ = โ„“๐‘ก, for which one had

๐‘‰ ๐‘ก(:, ๐‘™) โˆ˜ ๐›ฝ๐‘ก = ๐‘‰ ๐‘ก(:, ๐‘™), โˆ€๐‘™ โˆˆ [๐‘˜], (4.7.22)

where ๐‘‰ ๐‘ก is the ๐‘ก-th dual iterate of (S-MD) equipped with (Full-SS) with the optimalsampling distributions (4.3.8), and ๐œ‰๐‘ก was the only non-zero column ๐œ‰๐‘ˆ๐‘ก(:, โ„“๐‘ก) of ๐œ‰๐‘ˆ๐‘ก.Moreover, suppose also that

๐œŒ๐‘ก(๐šฅ) = โ€–๐‘‰ ๐‘ก(๐šฅ, :)โˆ’ ๐‘Œ (๐šฅ, :)โ€–1, ๐šฅ โˆˆ [๐‘›] (4.7.23)

were the correct norms at the ๐‘ก-th step. Then Procedure 2 will output ๐‘‰ ๐‘ก+๐‘ก, ๐›ฝ๐‘ก+1, ๐œŒ๐‘ก+1

such that ๐‘‰ ๐‘ก+1(:, ๐‘™) โˆ˜ ๐›ฝ๐‘ก+1 = ๐‘‰ ๐‘ก+1(:, ๐‘™), โˆ€๐‘™ โˆˆ [๐‘˜]

123

Page 139: On efficient methods for high-dimensional statistical

and๐œŒ๐‘ก+1(๐šฅ) = โ€–๐‘‰ ๐‘ก+1(๐šฅ, :)โˆ’ ๐‘Œ (๐šฅ, :)โ€–1, ๐šฅ โˆˆ [๐‘›].

Proof. Recall that the random matrix ๐œ‰๐‘ˆ๐‘ก has a single non-zero column ๐œ‰๐‘ก := ๐œ‰๐‘ˆ๐‘ก(:, โ„“๐‘ก),and according to (S-MD), the update ๐‘‰ ๐‘ก โ†’ ๐‘‰ ๐‘ก+1 writes as (cf. (4.2.24)):

๐‘‰ ๐‘ก+1๐‘—๐‘™ = ๐‘‰ ๐‘ก

๐‘—๐‘™ ยทexp[2๐›พ๐‘ก log(๐‘˜) ยท (๐œ‰๐‘ˆ๐‘ก(๐‘—, ๐‘™)โˆ’ ๐‘Œ (๐‘—, ๐‘™))]

๐‘˜โˆ‘โ„“=1

๐‘‰ ๐‘ก๐‘—โ„“ ยท exp[2๐›พ๐‘ก log(๐‘˜) ยท (๐œ‰๐‘ˆ๐‘ก(๐‘—, โ„“)โˆ’ ๐‘Œ (๐‘—, โ„“))]

. (4.7.24)

Note that all elements of the matrix ๐œ‰๐‘ˆ๐‘ก โˆ’ ๐‘Œ in each row ๐‘— have value 1, except forat most two elements in the columns โ„“๐‘ก and ๐‘ฆ๐‘—, where ๐‘ฆ๐‘— is the actual label of the๐‘—-th training example, that is, the only ๐‘™ โˆˆ [๐‘˜] for which ๐‘Œ (๐‘—, ๐‘™) = 1. Recall alsothat

โˆ‘โ„“โˆˆ[๐‘˜] ๐‘‰

๐‘ก(๐‘—, โ„“) = 1 for any ๐‘— โˆˆ [๐‘›]. Thus, introducing

๐œ”๐‘— = ๐‘’2๐›พ๐‘ก log(๐‘˜)๐œ‰๐‘ก๐‘— , ๐œ€๐‘— = ๐‘’โˆ’2๐›พ๐‘ก log(๐‘˜)๐‘Œ (๐‘—,๐‘™)

as defined in Procedure 2, we can express the denominator in (4.7.24) as

๐œ’๐‘— =

โŽงโŽชโŽชโŽชโŽชโŽชโŽจโŽชโŽชโŽชโŽชโŽชโŽฉ

1โˆ’ ๐›ฝ๐‘ก๐‘— ยท ๐‘‰ ๐‘ก๐‘—,โ„“๐‘กโŸ โž

๐‘‰ ๐‘ก๐‘—,โ„“๐‘ก

ยท(1โˆ’ ๐œ”๐‘— ยท ๐œ€๐‘—), if โ„“๐‘ก = ๐‘ฆ๐‘—,

1โˆ’ ๐›ฝ๐‘ก๐‘— ยท ๐‘‰ ๐‘ก๐‘—,โ„“๐‘กโŸ โž

๐‘‰ ๐‘ก๐‘—,โ„“๐‘ก

ยท(1โˆ’ ๐œ”๐‘— ยท ๐œ€๐‘—)โˆ’ ๐›ฝ๐‘ก๐‘— ยท ๐‘‰ ๐‘ก๐‘—,๐‘ฆ๐‘—โŸ โž

๐‘‰ ๐‘ก๐‘—,๐‘ฆ๐‘—

ยท(1โˆ’ ๐‘’โˆ’2๐›พ๐‘ก log(๐‘˜)), if โ„“๐‘ก = ๐‘ฆ๐‘—,(4.7.25)

where we used the premise (4.7.22). One can verify that this corresponds to the valueof ๐œ’๐‘— produced by line 7 of Procedure 2. Then, examining the numerator in (4.7.24),we can verify that lines 8โ€“10 guarantee that

๐‘‰ ๐‘ก+1๐‘—,๐‘™ ยท ๐›ฝ

๐‘ก+1๐‘— = ๐‘‰ ๐‘ก+1

๐‘—,๐‘™ , โˆ€๐‘™ โˆˆ [๐‘˜]

holds for the updated values. To verify the second invariant, we combining this resultwith the premise (4.7.23). This gives

๐œŒ๐‘ก+1๐‘— = 2โˆ’ ๐‘‰ ๐‘ก+1

๐‘—,๐‘ฆ๐‘—= 2โˆ’ 2๐›ฝ๐‘ก+1

๐‘— ยท ๐‘‰ ๐‘ก+1๐‘—,๐‘ฆ๐‘—

, (4.7.26)

which indeed corresponds to the update in line 11 of the procedure.

Correctness of tracking the cumulative sums.

We only consider the primal variables (Procedure 3 and line 21 of Algorithm 1);the complimentary case can be treated analogously. Note that due to the previoustwo lemmas, at any iteration ๐‘ก of Algorithm 1 Procedure 3 is fed with ๐‘™ = ๐‘™๐‘ก, ๐‘ˆ =๐‘ˆ ๐‘ก, ๐›ผ = ๐›ผ๐‘ก for which it holds ๐‘ˆ ๐‘ก๐›ผ๐‘ก = ๐‘ˆ ๐‘ก. Now, assume that all previous inputvalues ๐ด๐œ , ๐œ โ‰ค ๐‘ก, of variable ๐ด, and the current inputs ๐ด๐‘กpr, ๐‘ˆ

๐‘กฮฃ of variables ๐ดpr, ๐‘ˆฮฃ,

124

Page 140: On efficient methods for high-dimensional statistical

satisfy the following:

๐ด๐œ =๐œโˆ’1โˆ‘๐‘ =0

๐›ผ๐‘ , โˆ€๐œ โ‰ค ๐‘ก, (4.7.27)

๐ด๐‘กpr(๐‘–, ๐‘™) = ๐ด๐œ ๐‘ก(๐‘–,๐‘™)๐‘– , (4.7.28)

๐‘ˆ ๐‘กฮฃ(๐‘–, ๐‘™) =

๐œ ๐‘ก(๐‘–,๐‘™)โˆ‘๐‘ =0

๐‘ˆ ๐‘ (๐‘–, ๐‘™), (4.7.29)

where 0 6 ๐œ ๐‘ก(๐‘–, ๐‘™) 6 ๐‘ก โˆ’ 1 is the latest moment ๐‘ , strictly before ๐‘ก, when thesampled ๐‘™๐‘  โˆˆ [๐‘˜] coincided with the given ๐‘™:

๐œ ๐‘ก(๐‘–, ๐‘™) = arg max๐‘ 6๐‘กโˆ’1

๐‘  : ๐‘™๐‘  = ๐‘™. (4.7.30)

Let us show that this invariant will be preserved aftet the call of Procedure 3 โ€“ in otherwords, that (4.7.27)โ€“(4.7.30) hold for ๐‘ก+ 1, i.e., for the ouput values ๐‘ˆ ๐‘ก+1

ฮฃ , ๐ด๐‘ก+1pr , ๐ด

๐‘ก+1

(note that the variables ๐‘ˆฮฃ, ๐ดpr, ๐ด๐‘ก+1 only changed within Procedure 3, so their output

values are also the input values at the next iteration).

Proof. Indeed, it is clear that (4.7.27) will be preserved (cf. line 4 of Procedure 3).To verify (4.7.28), note that ๐ดpr(๐‘–, ๐‘™) only gets updated when ๐‘™ = ๐‘™๐‘ก (cf. line 3), andin this case we will have ๐œ ๐‘ก+1(๐‘–, ๐‘™) = ๐‘ก, and otherwise ๐œ ๐‘ก+1(๐‘–, ๐‘™) = ๐œ ๐‘ก(๐‘–, ๐‘™), cf. (4.7.30).

Thus, it only remains to verify the validity of (4.7.29) after the update. To this

end, note that by (4.7.30) we know that the value ๐‘ˆ ๐‘ (๐‘–, ๐‘™) of the variable ๐‘ˆ(๐‘–, ๐‘™)remained constant for ๐œ ๐‘ก(๐‘–, ๐‘™) โ‰ค ๐‘  < ๐‘ก, and it will not change after the call at ๐‘ก-thiteration unless ๐‘™๐‘ก = ๐‘™, that is, unless ๐œ ๐‘ก+1(๐‘–, ๐‘™) = ๐‘ก. This is exactly when line (2) isinvoked, and it ensures (4.7.29) for ๐‘ก+ 1.

Finally, invoking (4.7.27)โ€“(4.7.30) at ๐‘ก = ๐‘‡ , we see that line 21 results in thecorrect final value

โˆ‘๐‘‡๐‘ก=0 ๐‘ˆ

๐‘ก of the cumulative sum ๐‘ˆฮฃ. Thus, the correctness of Algo-rithm 1 is verified.

4.7.8 Additional remarks on Algorithm 1

Removing the ๐‘‚(๐‘‘๐‘˜) complexity term. In fact, the extra term ๐‘‚(๐‘‘๐‘˜) in theruntime and memory complexities of Algorithm 1 can be easily avoided. To seethis, recall that when solving the simplex-constrained CCSPP (4.2.17), we are fore-most interested in solving the โ„“1-constrained CCSPP (4.2.2), and an ๐œ€-accurate solu-

tion ๐‘ˆ = [๐‘ˆ1;๐‘ˆ2] โˆˆ R2๐‘‘ร—๐‘˜ to (4.2.17) yields an ๐œ€-accurate solution ๐‘ˆ = ๐‘ˆ1โˆ’๐‘ˆ2 โˆˆ R๐‘‘ร—๐‘˜

to (4.2.2). Recall that we initialize Algorithm 1 with ๐‘ˆ0 = 12๐‘‘ร—๐‘˜ and ๐›ผ0 = 12๐‘‘, which

corresponds to ๐‘ˆ0 = 02๐‘‘ร—๐‘˜. Moreover, at any iteration we change a single entry of ๐‘ˆ ๐‘ก,and scale the whole scaling vector ๐›ผ0 = 12๐‘‘ by a constant (in fact, all entries of ๐›ผ๐‘ก

are always equal to each other; we omitted this fact in the main text to simplify thepresentation, since the entries of ๐›ฝ generally have different values). Hence, the final

125

Page 141: On efficient methods for high-dimensional statistical

candidate solution ๐‘ˆ๐‘‡ = [๐‘‡1 โˆ’ ๐‘‡

2 ] to the โ„“1-constrained problem will actually have

at most ๐‘‚(๐‘‘๐‘‡ ) non-zero entries that correspond to the entries of ๐‘ˆ that were changedin the course of the algorithm. To exploit this, we can modify Algorithm 1 as follows:

โ€” Instead of explicitly initializing and storing the whole matrices ๐‘ˆ , ๐‘ˆฮฃ, and ๐ดpr,

we can hard-code the โ€œdefaultโ€ value ๐‘ˆ(๐‘–, ๐‘™) = 1 (cf. line 2 of Algorithm 1),

and use a bit mask to flag the entries ๐‘ˆ(๐‘–, ๐‘™) that have already been changedat least once. This mask can be stored as a list Changed of pairs (๐‘–, ๐‘™), i.e. ina sparse form.

โ€” When post-processing the cumulative sum ๐‘ˆฮฃ (see line (21) of Algorithm 1),instead of post-processing all entries of ๐‘ˆฮฃ, we can only process those in thelist Changed, and ignore the remaining ones, since the corresponding to thementries in ๐‘ˆ๐‘‡ (a candidate solution to (4.2.2)) will have zero values. We can

then directly output ๐‘ˆ๐‘‡ in a sparse form.It is clear that such modification of Algorithm 1 results in the replacement of

the ๐‘‚(๐‘‘๐‘˜) term in runtime complexity with ๐‘‚(๐‘‘๐‘‡ ) (which is always an improvementsince๐‘‚(๐‘‘) a.o.โ€™s are done anyway in each iteration); moreover, the memory complexitychanges from ๐‘‚(๐‘‘๐‘›+ ๐‘›๐‘˜ + ๐‘‘๐‘˜) to ๐‘‚(๐‘‘๐‘›+ ๐‘›๐‘˜ + ๐‘‘min(๐‘‡, ๐‘˜)).

Infeasibility of the noisy dual iterates. Note that when we generate an estimateof the primal gradient ๐‘‹๐‘‡ (๐‘‰ ๐‘ก โˆ’ ๐‘Œ ) according to (Full-SS) or (Part-SS), we alsoobtain an unbiased estimate of the dual iterate ๐‘‰ ๐‘ก, and vice versa. In the setupwith vector variables, Juditsky and Nemirovski [2011b] propose to average such noisyiterates instead of the acutal iterates (๐‘ˆ ๐‘ก, ๐‘‰ ๐‘ก) as we do in (S-MD). Averaging of noisyiterates is easier to implement since they are sparse (one does not need to track thecumulative sums), and one could show similar guarantees for the primal accuracy oftheir running averages. However, in the case of the dual variable its noisy counterpartis infeasible (see Juditsky and Nemirovski [2011b, Sec. 2.5.1]); as a result, one losesthe guarantee for the duality gap. Hence, we prefer to track the averages of the actualiterates of (S-MD) as we do in Algorithm 1.

126

Page 142: On efficient methods for high-dimensional statistical

Chapter 5

Conclusion and Future Work

In this thesis we considered several parameter estimation problems: the extensionof the moment-matching technique for non-linear regression, a special type of aver-aging for generalized linear models and sublinear algorithms for multiclass Fenchel-Young losses. We get theoretical guarantees for the corresponding convex and saddle-point minimization problems.

In Chapter 2 we considered a general non-linear regression model and the de-pendence on an unknown ๐‘˜-dimensional subspace assumption. Our goal was directestimation of this unknown ๐‘˜-dimensional space, which is often called the effective di-mension reduction or e.d.r. space. We proposed new approaches (SADE and SPHD),combining two existing techniques (sliced inverse regression (SIR) and score function-based estimation). We obtained consistent estimation for ๐‘˜ > 1 only using the first-order score and proposed explicit approaches to learn the score from data.

It would be interesting to extend our sliced extensions to learning neural net-works: indeed, our work focused on the subspace spanned by ๐‘ค and cannot identifyindividual columns, while Janzamin et al. [2015] used tensor approaches to to estimatecolumns of ๐‘ค in polynomial time. Another direction is to use smarter score matchingtechniques to estimate score functions from the data: try to learn only relevant di-rections. Indeed, our current approach for learning score functions uses a parametricassumption, i.e., that the score is the linear combination of ๐‘š basis functions and thisnumber grows with the number of observations. This disadvantage can be avoided ifwe somehow manage to learn scores only in relevant directions.

In Chapter 3 we have explored how averaging procedures in stochastic gradientdescent, which are crucial for fast convergence, could be improved by looking at thespecifics of probabilistic modeling. Namely, averaging in the moment parameteriza-tion can have better properties than averaging in the natural parameterization.

While we have provided some theoretical arguments (asymptotic expansion in thefinite-dimensional case, convergence to optimal predictions in the infinite-dimensionalcase), a detailed theoretical analysis with explicit convergence rates would provide abetter understanding of the benefits of averaging predictions.

127

Page 143: On efficient methods for high-dimensional statistical

In Chapter 4 we considered the Fenchel-Young losses and the correspondingsaddle-point problems for multi-class classification problems. We develop sublinearalgorithm, i.e., with computational complexity of one iteration of ๐‘‚(๐‘‘+๐‘›+๐‘˜), where๐‘‘ is the number of features, ๐‘› is the sample size and ๐‘˜ it the number of classes.

Though our approach has a sublinear rates for the classical SVM setup, it isinteresting to investigate the efficiency estimates for more general geometries andlosses (which is actually partly done). Multinomial logit loss or softmax loss is widelyused in Natural Language Processing (NLP) and recommendation systems, where ๐‘›,๐‘‘ and ๐‘˜ are on order of millions or even billions (Chelba et al. [2013], Partalas et al.[2015]) and developing an ๐‘‚(๐‘› + ๐‘‘ + ๐‘˜) approach will be an important developmentin this area. Even though we provided simple experiments in this chapter, biggerexperiments with more expressed convergence rates will be a good illustration of ourapproach. We are planning to put the code online, due to the full implementationof Algorithm 1 is time-consuming. Another direction of research is to develop theapproach with flexible step-sizes, which can be learned from the data as well.

128

Page 144: On efficient methods for high-dimensional statistical

Bibliography

H. Akaike. Information theory and an extension of the maximum likelihood principle.In Selected papers of hirotugu akaike, pages 199โ€“213. Springer, 1998.

G. Arfken. Divergence. In Mathematical Methods for Physicists, chapter 1.7, pages37โ€“42. Academic Press, Orlando, FL, 1985.

A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Ma-chine Learning, 73(3):243โ€“272, 2008.

D. Babichev and F. Bach. Constant step size stochastic gradient descent for proba-bilistic modeling. Proceedings in Uncertainty in Artificial Intelligence, pages 219โ€“228, 2018a.

D. Babichev and F. Bach. Slice inverse regression with score functions. ElectronicJournal of Statistics, 12(1):1507โ€“1543, 2018b.

F. Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference onLearning Theory, pages 185โ€“209, 2013.

F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algo-rithms for machine learning. In Adv. NIPS, 2011.

F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximationwith convergence rate ๐‘‚(1/๐‘›). Advances in Neural Information Processing Systems(NIPS), 2013.

H. H. Bauschke, J. Bolte, and M. Teboulle. A descent lemma beyond lipschitz gra-dient continuity: first-order methods revisited and applications. Mathematics ofOperations Research, 42(2):330โ€“348, 2016.

A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient meth-ods for convex optimization. Operations Research Letters, 31(3):167โ€“175, 2003.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linearinverse problems. SIAM journal on imaging sciences, 2(1):183โ€“202, 2009.

S. Ben-David, N. Eiron, and P. M. Long. On the difficulty of approximately max-imizing agreements. Journal of Computer and System Sciences, 66(3):496โ€“514,2003.

129

Page 145: On efficient methods for high-dimensional statistical

D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.

C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

M. Blondel, A. F. T. Martins, and V. Niculae. Learning classifiers with fenchel-young losses: Generalized entropies, margins, and algorithms. arXiv preprintarXiv:1805.09717, 2018.

A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with online andactive learning. Journal of Machine Learning Research, 6(Sep):1579โ€“1619, 2005.

J. Borwein and A. S. Lewis. Convex analysis and nonlinear optimization: theory andexamples. Springer Science & Business Media, 2010.

L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machinelearning. Technical Report 1606.04838, arXiv, 2016.

S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymp-totic theory of independence. Oxford University Press, 2013.

D. R. Brillinger. A Generalized Linear Model with โ€™Gaussianโ€™ Regressor Variables.In K.A. Doksum P.J. Bickel and J.L. Hodges, editors, A Festschrift for Erich L.Lehmann. Woodsworth International Group, Belmont, California, 1982.

S. Bubeck. Convex optimization: Algorithms and complexity. Foundations andTrendsยฎ in Machine Learning, 8(3-4):231โ€“357, 2015.

S. Cambanis, S Huang, and G Simons. On the Theory of Elliptically ContouredDistributions. Journal of Multivariate Analysis, 11(3):368โ€“385, 1981.

A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algo-rithm. Foundations of Computational Mathematics, 7(3):331โ€“368, 2007.

G. Casella and R. L. Berger. Statistical inference, volume 2. Duxbury Pacific Grove,CA, 2002.

C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robin-son. One billion word benchmark for measuring progress in statistical languagemodeling. arXiv preprint arXiv:1312.3005, 2013.

K. L. Clarkson, E. Hazan, and D. P. Woodruff. Sublinear optimization for machinelearning. Journal of the ACM (JACM), 59(5):23, 2012.

R. D. Cook. Save: a method for dimension reduction and graphics in regression.Communications in Statistics - Theory and Methods, 29:2109โ€“2121, 2000.

R. D. Cook and H. Lee. Dimension Reduction in Binary Response Regression. Journalof the American Statistical Association, 94:1187โ€“1200, 1999.

130

Page 146: On efficient methods for high-dimensional statistical

R. D. Cook and S. Weisberg. Discussion of โ€™Sliced Inverse Regressionโ€™ by K. C. Li.Journal of the American Statistical Association, 86:328โ€“332, 1991.

C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273โ€“297,1995.

A. S. Dalalyan, A. Juditsky, and V. Spokoiny. A new algorithm for estimating theeffective dimension-reduction subspace. Journal of Machine Learning Research, 9:1647โ€“1678, 2008.

A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient methodwith support for non-strongly convex composite objectives. In Advances in neuralinformation processing systems, pages 1646โ€“1654, 2014.

A. Dieuleveut and F. Bach. Nonparametric stochastic approximation with large step-sizes. Ann. Statist., 44(4):1363โ€“1399, 08 2016.

A. Dieuleveut, A. Durmus, and F. Bach. Bridging the gap between constant stepsize stochastic gradient descent and markov chains. Technical Report 1707.06386,arXiv, 2017.

W.F. Donoghue, Jr. Monotone Matrix Functions and Analytic Continuation.Springer, 1974.

N. Duan and K.-C. Li. Slicing regression: a link-free regression method. The Annalsof Statistics, 19:505โ€“530, 1991.

J. C Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirrordescent. In COLT, pages 14โ€“26, 2010.

V. Feldman, V. Guruswami, P. Raghavendra, and Y. Wu. Agnostic learning of mono-mials by halfspaces is hard. SIAM Journal on Computing, 41(6):1558โ€“1590, 2012.

K. Fukumizu, F. R. Bach, and M. I. Jordan. Kernel dimension reduction in regression.The Annals of Statistics, 37(4):1871โ€“1905, 2009.

D. Garber and E. Hazan. Approximating semidefinite programs in sublinear time. InAdvances in Neural Information Processing Systems, pages 1080โ€“1088, 2011.

D. Garber and E. Hazan. Sublinear time algorithms for approximate semidefiniteprogramming. Mathematical Programming, 158(1-2):329โ€“361, 2016.

W. R. Gilks, S. Richardson, and D. Spiegelhalter. Markov chain Monte Carlo inpractice. CRC press, 1995.

A.S. Goldberger. Econometric theory. Econometric theory., 1964.

I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.

131

Page 147: On efficient methods for high-dimensional statistical

M. D. Grigoriadis and L. G. Khachiyan. A sublinear-time randomized approximationalgorithm for matrix games. Operations Research Letters, 18(2):53โ€“58, 1995.

L. Gyรถrfi, M. Kohler, A. Krzyzak, and H. Walk. A distribution-free theory of non-parametric regression. Springer series in statistics. Springer, New York, 2002.

L. P. Hansen. Large sample properties of generalized method of moments estimators.Econometrica: Journal of the Econometric Society, pages 1029โ€“1054, 1982.

E. Hazan, T. Koren, and N. Srebro. Beating SGD: Learning SVMs in sublinear time.In Advances in Neural Information Processing Systems, pages 1233โ€“1241, 2011.

Niao He, Anatoli Juditsky, and Arkadi Nemirovski. Mirror prox algorithm for multi-term composite minimization and semi-separable problems. Computational Opti-mization and Applications, 61(2):275โ€“319, 2015.

J. M. Hilbe. Negative binomial regression. Cambridge University Press, 2011.

J. Hooper. Simultaneous Equations and Canonical Correlation Theory. Econometrica,27:245โ€“256, 1959.

J. L. Horowitz. Semiparametric methods in econometrics, volume 131. SpringerScience & Business Media, 2012.

M. Hristache, A. Juditsky, and V. Spokoiny. Direct estimation of the index coefficientin a single index model. The Annals of Statistics, 29(3):595โ€“623, 2001.

T. Hsing and R. J. Carroll. An asymptotic theory for sliced inverse regression. TheAnnals of Statistics, 20(2):1040โ€“1061, 1992.

A. Hyvรคrinen. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6:695โ€“709, 2005.

A. Hyvรคrinen, J. Karhunen, and E. Oja. Independent Component Analysis, volume 46.John Wiley & Sons, 2004.

M. Janzamin, H. Sedghi, and A. Anandkumar. Score function features for discrim-inative learning: Matrix and tensor framework. arXiv preprint arXiv:1412.2863,2014.

M. Janzamin, H. Sedghi, and A. Anandkumar. Generalization Bounds for NeuralNetworks through Tensor Factorization. arXiv preprint arXiv:1412.2863, 2015.

R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictivevariance reduction. In Advances in Neural Information Processing Systems, pages315โ€“323, 2013.

A. Juditsky and A. Nemirovski. First-order methods for nonsmooth convex large-scaleoptimization, I: General purpose methods. Optimization for Machine Learning,pages 121โ€“148, 2011a.

132

Page 148: On efficient methods for high-dimensional statistical

A. Juditsky and A. Nemirovski. First order methods for nonsmooth convex large-scaleoptimization, ii: utilizing problems structure. Optimization for Machine Learning,pages 149โ€“183, 2011b.

A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities withstochastic mirror-prox algorithm. Stochastic Systems, 1(1):17โ€“58, 2011.

Johannes HB Kemperman. On the optimum rate of transmitting information. InProbability and information theory, pages 126โ€“169. Springer, 1969.

D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques- Adaptive Computation and Machine Learning. The MIT Press, 2009.

G. M. Korpelevich. Extragradient method for finding saddle points and other prob-lems. Matekon, 13(4):35โ€“49, 1977.

J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilisticmodels for segmenting and labeling sequence data. In Proc. ICML, 2001.

G. Lan. An optimal method for stochastic composite optimization. MathematicalProgramming, 133(1-2):365โ€“397, 2012.

B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by modelselection. The Annals of Statistics, 28(5):1302โ€“1338, 2000.

L. LeCam. On some asymptotic properties of maximum likelihood estimates andrelated bayes estimates. Univ. California Pub. Statist., 1:277โ€“330, 1953.

E. L. Lehmann and G. Casella. Theory of point estimation. Springer Science &Business Media, 2006.

K.-C. Li. Sliced Inverse Regression for Dimensional Reduction. Journal of the Amer-ican Statistical Association, 86:316โ€“327, 1991.

K.-C. Li. On Principal Hessian Directions for Data Visualization and DimensionReduction: Another Application of Steinโ€™s Lemma. Journal of the American Sta-tistical Association, 87:1025โ€“1039, 1992.

K.-C. Li and N. Duan. Regression analysis under link violation. The Annals ofStatistics, 17:1009โ€“1052, 1989.

M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.

uci.edu/ml.

Q. Lin, Z. Zhao, and J. S. Liu. On consistency and sparsity for sliced inverse regressionin high dimensions. The Annals of Statistics, 46(2):580โ€“610, 2018.

P. McCullagh. Generalized linear models. European Journal of Operational Research,16(3):285โ€“292, 1984.

133

Page 149: On efficient methods for high-dimensional statistical

P. McCullagh and J. A. Nelder. Generalized linear models, volume 37. CRC press,1989.

A. M. McDonald, M. Pontil, and S. Stamos. Spectral ๐‘˜-support norm regularization.In Advances in Neural Information Processing Systems, 2014.

S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Springer-VerlagInc, Berlin; New York, 1993.

J.-J. Moreau. Proximitรฉ et dualitรฉ dans un espace hilbertien. Bull. Soc. Math. France,93(2):273โ€“299, 1965.

K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.

A. Nemirovski. Prox-method with rate of convergence ๐‘‚(1/๐‘ก) for variational inequali-ties with lipschitz continuous monotone operators and smooth convex-concave sad-dle point problems. SIAM Journal on Optimization, 15(1):229โ€“251, 2004.

A. Nemirovski and U. G. Onn, S.and Rothblum. Accuracy certificates for computa-tional problems with convex structure. Mathematics of Operations Research, 35(1):52โ€“78, 2010.

A. Nemirovsky and D. Yudin. Problem complexity and method efficiency in optimiza-tion. Chichester, 1983.

Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical program-ming, 103(1):127โ€“152, 2005.

Y Nesterov. Gradient methods for minimizing composite objective function, 2007.

Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.Springer Science & Business Media, 2013.

Y. Nesterov and A. Nemirovski. On first-order algorithms for ๐‘™1/nuclear norm mini-mization. Acta Numerica, 22:509โ€“575, 2013.

Y. E. Nesterov. A method for solving the convex programming problem with con-vergence rate o (1/kห† 2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543โ€“547,1983.

D. Ostrovskii and Z. Harchaoui. Efficient first-order algorithms for adaptive signaldenoising. In Proceedings of the 35th ICML conference, volume 80, pages 3946โ€“3955, 2018.

B. Palaniappan and F. Bach. Stochastic variance reduction methods for saddle-pointproblems. In Advances in Neural Information Processing Systems, pages 1416โ€“1424,2016.

134

Page 150: On efficient methods for high-dimensional statistical

I. Partalas, A. Kosmopoulos, N. Baskiotis, T. Artieres, G. Paliouras, E. Gaussier,I. Androutsopoulos, M.-R. Amini, and P. Galinari. LSHTC: A benchmark forlarge-scale text classification. arXiv preprint arXiv:1503.08581, 2015.

B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by aver-aging. SIAM Journal on Control and Optimization, 30(4):838โ€“855, 1992.

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning.MIT Press, 2006.

H. Robbins and S. Monro. ยชa stochastic approximation method, ยบ annals math.Statistics, 22:400โ€“407, 1951.

R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAMjournal on control and optimization, 14(5):877โ€“898, 1976.

R. T. Rockafellar. Convex analysis. Princeton university press, 2015.

A. Rudi, L. Carratino, and L. Rosasco. Falkon: An optimal large scale kernel method.In Advances in Neural Information Processing Systems, pages 3891โ€“3901, 2017.

M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochasticaverage gradient. Mathematical Programming, 162(1-2):83โ€“112, 2017.

B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines,Regularization, Optimization, and beyond. MIT press, 2001.

S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theoryto algorithms. Cambridge university press, 2014.

S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods forregularized loss minimization. Journal of Machine Learning Research, 14(Feb):567โ€“599, 2013.

S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimatedsub-gradient solver for SVM. Mathematical programming, 127(1):3โ€“30, 2011.

A. Shapiro, D. Dentcheva, and A. Ruszczyล„ski. Lectures on stochastic programming:modeling and theory. SIAM, 2009.

J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridgeuniversity press, 2004.

Z. Shi, X. Zhang, and Y. Yu. Bregman divergence for stochastic variance reduc-tion: saddle-point and adversarial prediction. In Advances in Neural InformationProcessing Systems, pages 6031โ€“6041, 2017.

S. Sra. Fast projections onto mixed-norm balls with applications. Data Mining andKnowledge Discovery, 25(2):358โ€“377, 2012.

135

Page 151: On efficient methods for high-dimensional statistical

B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Schรถlkopf.Injective hilbert space embeddings of probability measures. In Proc. COLT, 2008.

C.M. Stein. Estimation of the Mean of a Multivariate Normal Distribution. TheAnnals of Statistics, 9:1135โ€“1151, 1981.

G.W. Stewart and J.-G. Sun. Matrix perturbation theory (computer science andscientific computing), 1990.

T.M. Stoker. Consistent estimation of scaled coefficients. Econometrica, 54:1461โ€“1481, 1986.

A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.

V. Q. Vu and J. Lei. Minimax sparse principal subspace estimation in high dimensions.The Annals of Statistics, 41(6):2905โ€“2947, 2013.

H. Wang and Y. Xia. On directional regression for dimension reduction. In J. Amer.Statist. Ass. Citeseer, 2007.

H. Wang and Y. Xia. Sliced regression for dimension reduction. Journal of theAmerican Statistical Association, 103(482):811โ€“821, 2008.

C. K. I. Williams and M. Seeger. Using the nystrรถm method to speed up kernelmachines. In Advances in neural information processing systems, pages 682โ€“688,2001.

D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations andTrendsยฎ in Theoretical Computer Science, 10(1โ€“2):1โ€“157, 2014.

Y. Xia, H. Tong, W. K. Li, and L.-X. Zhu. An adaptive estimation of dimensionreduction space. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 64(3):363โ€“410, 2002a.

Y. Xia, H. Tong, W.K. Li, and L.-X. Zhu. An adaptive estimation of dimensionreduction space. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 64(3):363โ€“410, 2002b.

L. Xiao. Dual averaging methods for regularized stochastic learning and online opti-mization. Journal of Machine Learning Research, 11(Oct):2543โ€“2596, 2010.

L. Xiao, A. W. Yu, Q. Lin, and W. Chen. Dscovr: Randomized primal-dual blockcoordinate algorithms for asynchronous distributed optimization. arXiv preprintarXiv:1710.05080, 2017.

S. S. Yang. General distribution theory of the concomitants of order statistics. TheAnnals of Statistics, 5:996โ€“1002, 1977.

Y. Yu, T. Wang, and R. J. Samworth. A useful variant of the davisโ€“kahan theoremfor statisticians. Biometrika, 102(2):315โ€“323, 2015.

136

Page 152: On efficient methods for high-dimensional statistical

Y.-L. Yu. The strong convexity of von Neumannโ€™s entropy. Unpublished note, June2013. URL http://www.cs.cmu.edu/~yaoliang/mynotes/sc.pdf.

M. Yuan. On the identifiability of additive index models. Statistica Sinica, 21(4):1901โ€“1911, 2011.

L.-X. Zhu and K. W. Ng. Asymptotics of sliced inverse regression. Statistica Sinica,5:727โ€“736, 1995.

137

Page 153: On efficient methods for high-dimensional statistical

138

Page 154: On efficient methods for high-dimensional statistical

List of Figures

1-1 Graphical representation of classical loss functions. . . . . . . . . . . 10

1-2 Graphical representation of Lipschitz smoothness and strong convexity. 12

1-3 Graphical representation of projected gradient descent: firstly we dogradient step and then project onto ๐’ณ . . . . . . . . . . . . . . . . . . 13

2-1 Graphical explanation of SADE: firstly we divide ๐‘ฆ into ๐ป slices, let ๐‘โ„Žbe a proportion of ๐‘ฆ๐‘– in slice ๐ผโ„Ž. This is an empirical estimator of P(๐‘ฆ โˆˆ๐ผโ„Ž). Then we evaluate the empirical estimator (๐’ฎ1)โ„Ž of E(๐’ฎ1(๐‘ฅ)|๐‘ฆ โˆˆ ๐ผโ„Ž).Finally, we evaluate weighted covariance matrix and find the ๐‘˜ largesteigenvectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2-2 Contour lines of Gaussian mixture pdf in 2D case. . . . . . . . . . . . 42

2-3 Mean and standard deviation of ๐‘…2(โ„ฐ , โ„ฐ) divided by 10 for the rationalfunction (2.5.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2-4 Mean and standard deviation of ๐‘…2(โ„ฐ , โ„ฐ) divided byโˆš

10 for the ra-tional function (2.5.3) with ๐œŽ = 1/4. . . . . . . . . . . . . . . . . . . 43

2-5 Mean and standard deviation of ๐‘…2(โ„ฐ , โ„ฐ) divided byโˆš

10 for the theclassification problem (2.5.4). . . . . . . . . . . . . . . . . . . . . . . 43

2-6 Mean and standard deviation of ๐‘…2(โ„ฐ , โ„ฐ) divided byโˆš

10 for the model(2.5.2) with ๐œŽ = 1/4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2-7 Square trace error ๐‘…(โ„ฐ , โ„ฐ) for different sample sizes. . . . . . . . . . . 45

2-8 Quadratic model, ๐‘‘ = 10, ๐‘› = 1000. . . . . . . . . . . . . . . . . . . . 45

2-9 Rational model, ๐‘‘ = 10, ๐‘› = 1000. . . . . . . . . . . . . . . . . . . . . 45

2-10 Rational model, ๐‘‘ = 20, ๐‘› = 2000. . . . . . . . . . . . . . . . . . . . . 45

2-11 Rational model, ๐‘‘ = 50, ๐‘› = 5000. . . . . . . . . . . . . . . . . . . . . 46

2-12 Increasing dimensions, with rational model 2.5.3, ๐‘› = 10000, ๐‘˜ = 10,๐‘‘ = 20 to 150. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3-1 Convergence of iterates ๐œƒ๐‘› and averaged iterates ๐œƒ๐‘› to the mean ๐œƒ๐›พunder the stationary distribution ๐œ‹๐›พ. . . . . . . . . . . . . . . . . . . 59

3-2 Graphical representation of reparametrization: firstly we expand theclass of functions, replacing parameter ๐œƒ with function ๐œ‚(ยท) = ฮฆโŠค(ยท)๐œƒand then we do one more reparametrization: ๐œ‡(ยท) = ๐‘Žโ€ฒ(๐œ‚(ยท)). Bestlinear prediction ๐œ‡* is constructed using ๐œƒ* and the global minimizerof ๐’ข is ๐œ‡**. Model is well-specified if and only if ๐œ‡* = ๐œ‡**. . . . . . . 63

139

Page 155: On efficient methods for high-dimensional statistical

3-3 Visualisation of the space of parameters and its transformation to thespace of predictions. Averaging estimators in red vs averaging pre-dictions in green. ๐œ‡* is optimal linear predictor and ๐œ‡** is the globaloptimum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3-4 Visualisation of the space of parameters and its transformation to thespace of predictions. Averaging estimators in red vs averaging predic-tions in green. Global optimizer coincides with the best linear. . . . . 68

3-5 Synthetic data for linear model ๐œ‚๐œƒ(๐‘ฅ) = ๐œƒโŠค๐‘ฅ and ๐œ‚**(๐‘ฅ) = sin๐‘ฅ1+sin๐‘ฅ2.Excess prediction performance vs. number of iterations (both in log-scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3-6 Synthetic data for linear model ๐œ‚๐œƒ(๐‘ฅ) = ๐œƒโŠค๐‘ฅ and ๐œ‚**(๐‘ฅ) = ๐‘ฅ31 + ๐‘ฅ32.Excess prediction performance vs. number of iterations (both in log-scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3-7 Synthetic data for Laplacian kernel for ๐œ‚**(๐‘ฅ) = 55+๐‘ฅโŠค๐‘ฅ

. Excess predic-tion performance vs. number of iterations (both in log-scale). . . . . 72

3-8 MiniBooNE dataset, dimension ๐‘‘ = 50, linear model. Excess predic-tion performance vs. number of iterations (both in log-scale). . . . . . 73

3-9 MiniBooNE dataset, dimension ๐‘‘ = 50, kernel approach, column sam-pling๐‘š = 200. Excess prediction performance vs. number of iterations(both in log-scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3-10 CoverType dataset, dimension ๐‘‘ = 54, linear model. Excess predictionperformance vs. number of iterations (both in log-scale). . . . . . . . 74

3-11 CoverType dataset, dimension ๐‘‘ = 54, kernel approach, column sam-pling๐‘š = 200. Excess prediction performance vs. number of iterations(both in log-scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4-1 Depiction of the Full Sampling Scheme (Full-SS). . . . . . . . . . . . 994-2 Primal accuracy and duality gap (when available) for Algorithm 1,

stochastic subgradient method (SSM), and Mirror Prox (MP) withexact gradients, on a synthetic data benchmark. . . . . . . . . . . . . 108

140

Page 156: On efficient methods for high-dimensional statistical

List of Tables

2.1 Comparison of different methods using score functions. . . . . . . . . 31

4.1 Comparison of the potential differences for the norms correspondingto (4.4.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.2 Runtime (in seconds) of Algorithm 1 on synthetic data. . . . . . . . . 107

141

Page 157: On efficient methods for high-dimensional statistical

142

Page 158: On efficient methods for high-dimensional statistical

143

Page 159: On efficient methods for high-dimensional statistical