on efficient methods for high-dimensional statistical
TRANSCRIPT
HAL Id: tel-02433016https://tel.archives-ouvertes.fr/tel-02433016v2
Submitted on 18 Jun 2020
HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.
Lโarchive ouverte pluridisciplinaire HAL, estdestinรฉe au dรฉpรดt et ร la diffusion de documentsscientifiques de niveau recherche, publiรฉs ou non,รฉmanant des รฉtablissements dโenseignement et derecherche franรงais ou รฉtrangers, des laboratoirespublics ou privรฉs.
On efficient methods for high-dimensional statisticalestimation
Dmitry Babichev
To cite this version:Dmitry Babichev. On efficient methods for high-dimensional statistical estimation. Machine Learning[stat.ML]. Universitรฉ Paris sciences et lettres, 2019. English. NNT : 2019PSLEE032. tel-02433016v2
The best angle from which to
approach any problem is the
try-angle
Author Unknown
Abstract
In this thesis we consider several aspects of parameter estimation for statistics andmachine learning, and optimization techniques applicable to these problems. The goalof parameter estimation is to find the unknown hidden parameters, which govern thedata, for example parameters of an unknown probability density. The constructionof estimators through optimization problems is only one side of the coin, findingthe optimal value of the parameter often is an optimization problem that needs to besolved, using various optimization techniques. Hopefully these optimization problemsare convex for a wide class of problems, and we can exploit their structure to get fastconvergence rates.
The first main contribution of the thesis is to develop moment-matching techniquesfor multi-index non-linear regression problems. We consider the classical non-linearregression problem, which is unfeasible in high dimensions due to the curse of di-mensionality; that is why we assume a model, which states that in fact the data isa nonlinear function of several linear projections of data. We combine two existingtechniques: average derivative estimator (ADE) and Sliced Inverse Regression (SIR)to develop the hybrid method (SADE) without some of the weak sides of its parents:it works both in multi-index models and with weak assumptions on the data distri-bution. We also extend this method to high-order moments. We provide theoreticalanalysis of constructed estimators both for finite sample and population cases.
In the second main contribution we use a special type of averaging for stochasticgradient descent. We consider generalized linear models for conditional exponentialfamilies (such as logistic regression), where the goal is to find the unknown valueof the parameter. Classical approaches, such as Stochastic Gradient Descent (SGD)with constant step-size are known to converge only to some neighborhood of theoptimal value of the parameter, even with Rupert-Polyak averaging. We proposethe averaging of moment parameters, which we call prediction functions. For finite-dimensional models this type of averaging surprisingly can lead to negative error, i.e.,this approach provides us with the estimator better than any linear estimator canever achieve. For infinite-dimensional models our approach converges to the optimalprediction, while parameter averaging never does.
The third main contribution of this thesis deals with Fenchel-Young losses. Weconsider multi-class linear classifiers with the losses of a certain type, such that their
v
dual conjugate has a direct product of simplices as a support. The correspondingsaddle-point convex-concave formulation has a special form with a bilinear matrixterm and classical approaches suffer from the time-consuming multiplication of ma-trices. We show, that for multi-class SVM losses, under mild regularity assumptionand with smart matrix-multiplication sampling techniques, our approach has an iter-ation complexity which is sublinear in the size of the data. It means, that to do oneiteration, we need to pay only trice ๐(๐ + ๐ + ๐): for number of classes ๐, numberof features ๐ and number of samples ๐, whereas all existing techniques use at leastone of ๐๐, ๐๐ or ๐๐ arithmetical operations per iteration. This is possible due to theright choice of geometries and using a mirror descent approach.
Keywords : parameter estimation, moment-matching, constant step-size SGD,conditional exponential family, Fenchel-Young loss, mirror descent.
vi
Rรฉsumรฉ
Dans cette thรจse, nous examinons plusieurs aspects de lโestimation des paramรจtrespour les statistiques et les techniques dโapprentissage automatique, aussi que lesmรฉthodes dโoptimisation applicables ร ces problรจmes. Le but de lโestimation desparamรจtres est de trouver les paramรจtres cachรฉs inconnus qui rรฉgissent les donnรฉes, parexemple les paramรจtres dont la densitรฉ de probabilitรฉ est inconnue. La constructiondโestimateurs par le biais de problรจmes dโoptimisation nโest quโune partie du prob-lรจme, trouver la valeur optimale du paramรจtre est souvent un problรจme dโoptimisationqui doit รชtre rรฉsolu, en utilisant diverses techniques. Ces problรจmes dโoptimisationsont souvent convexes pour une large classe de problรจmes, et nous pouvons exploiterleur structure pour obtenir des taux de convergence rapides.
La premiรจre contribution principale de la thรจse est de dรฉvelopper des techniquesdโappariement de moments pour des problรจmes de rรฉgression non linรฉaire multi-index.Nous considรฉrons le problรจme classique de rรฉgression non linรฉaire, qui est irrรฉalisabledans des dimensions รฉlevรฉes en raison de la malรฉdiction de la dimensionnalitรฉ et cโestpourquoi nous supposons que les donnรฉes sont en fait une fonction non linรฉaire deplusieurs projections linรฉaires des donnรฉes. Nous combinons deux techniques exis-tantes : โaverage derivative estimatorโ (ADE) et โSliced Inverse Regressionโ (SIR)pour dรฉvelopper la mรฉthode hybride (SADE) sans certains des aspects faibles de sesparents : elle fonctionne ร la fois dans des modรจles multi-index et avec des hypothรจsesfaibles sur la distribution des donnรฉes. Nous รฉtendons รฉgalement cette mรฉthode auxmoments dโordre รฉlevรฉ. Nous fournissons une analyse thรฉorique des estimateurs con-struits ร la fois pour les cas dโรฉchantillons finis et les cas population.
Dans la deuxiรจme contribution principale, nous utilisons un type particulier decalcul de la moyenne pour la descente stochastique du gradient. Nous considรฉrons desmodรจles linรฉaires gรฉnรฉralisรฉs pour les familles exponentielles conditionnelles (commela rรฉgression logistique), oรน lโobjectif est de trouver la valeur inconnue du paramรจtre.Les approches classiques, telles que la descente ร gradient stochastique (SGD) avecune taille de pas constante, ne convergent que vers un certain voisinage de la valeuroptimale du paramรจtre, mรชme avec le calcul de la moyenne de Rupert-Polyak. Nousproposons le calcul de la moyenne des paramรจtres de moments, que nous appelonsfonctions de prรฉdiction. Dans le cas des modรจles ร dimensions finies, ce type de calculde la moyenne peut, de faรงon surprenante, conduire ร une erreur nรฉgative, cโest-ร -direque cette approche nous fournit un estimateur meilleur que tout estimateur linรฉaire ne
vii
peut jamais le faire. Pour les modรจles ร dimensions infinies, notre approche convergevers la prรฉdiction optimale, alors que le calcul de la moyenne des paramรจtres ne lefait jamais.
La troisiรจme contribution principale de cette thรจse porte sur les pertes de Fenchel-Young. Nous considรฉrons des classificateurs linรฉaires multi-classes avec les pertes dโuncertain type, de sorte que leur double conjuguรฉ a un produit direct de simplices commesupport. La formulation convexe-concave ร point-selle correspondante a une formespรฉciale avec un terme de matrice bilinรฉaire, et les approches classiques souffrent de lamultiplication des matrices qui prend beaucoup de temps. Nous montrons que pour lespertes SVM multi-classes, sous hypothรจse de rรฉgularitรฉ lรฉgรจre et avec des techniquesdโรฉchantillonnage efficaces, notre approche a une complexitรฉ dโitรฉration qui est sous-linรฉaire dans la taille des donnรฉes. Cela signifie que pour faire une itรฉration, nousnโavons besoin de payer que trois fois : ๐(๐ + ๐ + ๐) pour le nombre de classes ๐,le nombre de caractรฉristiques ๐ et le nombre dโรฉchantillons ๐, alors que toutes lestechniques existantes utilisent au moins une des opรฉrations arithmรฉtiques ๐๐, ๐๐ ou๐๐ par itรฉration. Ceci est possible grรขce au bon choix des gรฉomรฉtries et ร lโutilisationdโune approche de descente en miroir.
Mots Clรฉs : estimation des paramรจtres, mรฉthode des moments, SGD ร pas con-stant, famille exponentielle conditionnelle, fonction objectif du Fenchel-Young, de-scente en miroir.
viii
Acknowledgements
First of all, I would like to thank my supervisor Francis Bach. His guidance andability to explain difficult concepts in easy words is the quality, which helped methrough all my Ph.D. His politics of open door, when anyone can come and answerquestions when his door is open was a great support. Also, I would like to thank myco-advisor Anatoli Juditsky, with whom I had several very fruitful discussions andwho instilled in me statistical thinking.
This thesis has received funding from the European Unionโs H2020 FrameworkProgramme (H2020-MSCA-ITN-2014) under grant agreement n642685 MacSeNet. Iwould like to thank Helen Cooper for all the training, summer schools and workshopsshe helped to organize as a part of this framework. I am grateful to Pierre Van-dergheynst, with whom I had an opportunity to work during my academic internshipin Lausanne, EPFL. I am also grateful to Dave Betts, who was my mentor to theaudio restoration field during my industrial internship in Cambridge.
I also would like to thank Arnak Dalalyan and Stephane Chretien for accepting toreview my thesis. I would also like to express my sincere gratitude to Olivier Cappรฉand Franck Iutzeler for accepting to be part of the jury for my thesis.
I would like to thank all members of Willow and Sierra team, old and new ones, forall the group meetings, thanks to which I expanded my scientific outlook. EspeciallyI want to say thank you to my friend and colleague Dmitrii Ostrovskii, with whom Iwas working on the 3rd project, and who was able to explain me a large amount ofinformation in a short time.
Also, I am grateful to all my friends with whom I spent a lot of joyful hours anddays and who share my hobbies. I would like to thank Andrey, Dasha, Kolya, Tolik,Anya and Katya, my teammates in the intellectual game "What? Where? When?".I would like to thank Marwa, who introduced me to the world of climbing. Thankyou, Vitalic for our almost daily talks. Also, I say thank my friends Denis, Dima,and Petr for their support.
I would like to thank my parents who opened for me a wonderful world of math-ematics and their faith in me. I say thank my sister Tanka for all her sister care andreadiness to come to the rescue at any moment.
Last but not least, I am very grateful to my girlfriend Tatiana Shpakova for herlove and who was always there in easy and difficult times.
ix
x
Contents
1 Introduction 31.1 Principles for parameter estimation . . . . . . . . . . . . . . . . . . . 3
1.1.1 Moment matching . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . 61.1.3 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2.1 Euclidean geometry . . . . . . . . . . . . . . . . . . . . . . . . 121.2.2 Non-Euclidean geometry . . . . . . . . . . . . . . . . . . . . . 141.2.3 Stochastic setup . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Saddle point optimization . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Sliced inverse regression with score functions 212.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Estimation with infinite sample size . . . . . . . . . . . . . . . . . . . 25
2.2.1 SADE: Sliced average derivative estimation . . . . . . . . . . . 252.2.2 SPHD: Sliced principal Hessian directions . . . . . . . . . . . 282.2.3 Relationship between first and second order methods . . . . . 30
2.3 Estimation from finite sample . . . . . . . . . . . . . . . . . . . . . . 302.3.1 Estimator and algorithm for SADE . . . . . . . . . . . . . . . 312.3.2 Estimator and algorithm for SPHD . . . . . . . . . . . . . . . 332.3.3 Consistency for the SADE estimator and algorithm . . . . . . 34
2.4 Learning score functions . . . . . . . . . . . . . . . . . . . . . . . . . 352.4.1 Score matching to estimate score from data . . . . . . . . . . 352.4.2 Score matching for sliced inverse regression: two-step approach 372.4.3 Score matching for SIR: direct approach . . . . . . . . . . . . 40
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.5.1 Known score functions . . . . . . . . . . . . . . . . . . . . . . 412.5.2 Unknown score functions . . . . . . . . . . . . . . . . . . . . . 44
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.7 Appendix. Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7.1 Probabilistic lemma . . . . . . . . . . . . . . . . . . . . . . . . 472.7.2 Proof of theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . 472.7.3 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . 52
xi
3 Constant step-size SGD for probabilistic modeling 573.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.2 Constant step size stochastic gradient descent . . . . . . . . . . . . . 593.3 Warm-up: exponential families . . . . . . . . . . . . . . . . . . . . . . 603.4 Conditional exponential families . . . . . . . . . . . . . . . . . . . . . 61
3.4.1 From estimators to prediction functions . . . . . . . . . . . . . 623.4.2 Averaging predictions . . . . . . . . . . . . . . . . . . . . . . . 633.4.3 Two types of averaging . . . . . . . . . . . . . . . . . . . . . . 64
3.5 Finite-dimensional models . . . . . . . . . . . . . . . . . . . . . . . . 663.5.1 Earlier work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.5.2 Averaging predictions . . . . . . . . . . . . . . . . . . . . . . . 67
3.6 Infinite-dimensional models . . . . . . . . . . . . . . . . . . . . . . . 683.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.7.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.7.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.9 Appendix. Explicit form of ๐ต, , ยฏ๐ต๐ค and ยฏ๐ต๐ . . . . . . . . . . . . . 75
3.9.1 Estimation without averaging . . . . . . . . . . . . . . . . . . 763.9.2 Estimation with averaging parameters . . . . . . . . . . . . . 763.9.3 Estimation with averaging predictions . . . . . . . . . . . . . 76
4 Sublinear Primal-Dual Algorithms for Large-Scale Classification 794.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.2 Choice of geometry and basic routines . . . . . . . . . . . . . . . . . 83
4.2.1 Basic schemes: Mirror Descent and Mirror Prox . . . . . . . . 854.2.2 Choice of the partial potentials . . . . . . . . . . . . . . . . . 884.2.3 Recap of the deterministic algorithms . . . . . . . . . . . . . . 904.2.4 Accuracy bounds for the deterministic algorithms . . . . . . . 92
4.3 Sampling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.3.1 Partial sampling . . . . . . . . . . . . . . . . . . . . . . . . . 954.3.2 Full sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 984.3.3 Efficient implementation of SVM with Full Sampling . . . . . 101
4.4 Discussion of Alternative Geometries . . . . . . . . . . . . . . . . . . 1024.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074.6 Conclusion and perspectives . . . . . . . . . . . . . . . . . . . . . . . 1094.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.7.1 Motivation for the multiclass hinge loss . . . . . . . . . . . . . 1104.7.2 General accuracy bounds for the composite saddle-point MD . 1104.7.3 Auxiliary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . 1164.7.4 Proof of Proposition 4.1 . . . . . . . . . . . . . . . . . . . . . 1174.7.5 Proof of Proposition 4.2 . . . . . . . . . . . . . . . . . . . . . 1184.7.6 Proof of Proposition 4.3 . . . . . . . . . . . . . . . . . . . . . 1204.7.7 Correctness of subroutines in Algorithm 1 . . . . . . . . . . . 1214.7.8 Additional remarks on Algorithm 1 . . . . . . . . . . . . . . . 125
xii
5 Conclusion and Future Work 127
xiii
xiv
Contributions and thesis outline
In Chapter 1 we give a brief overview of parameter estimation, concerning themethod of moments, maximum likelihood and loss minimization problems.
Chapter 2 is dedicated to a special application of method of moments for non-linearregression. This chapter is based on the journal article: Slice inverse regression withscore functions, D. Babichev, F. Bach, In Electronic Journal of Statistics [Babichevand Bach, 2018b]. The main contributions of this chapter are as follows:
โ We propose score function extensions to sliced inverse regression problems,both for the first-order and second-order score functions.
โ We consider the infinite sample case and show that in the population case ourestimators are superior to the non-sliced versions.
โ We consider also finite sample case and show their consistency given the exactscore functions. We provide non-asymptotical bounds, given sub-Gaussianassumptions.
โ We propose to learn the score function as well, in two steps, i.e., first learningthe score function and then learning the effective dimension reduction space, ordirectly, by solving a convex optimization problem regularized by the nuclearnorm.
โ We illustrate our results on a series of experiments.
In Chapter 3 we consider special type of averaging for Stochastic SGD applied togeneralized linear models. This chapter is based on the conference paper published asan UAI 2018 paper, which was accepted as an oral presentation: Constant step sizestochastic gradient descent for probabilistic modeling, D. Babichev, F.Bach, Proceed-ings in Uncertainty in Artificial Intelligence [Babichev and Bach, 2018a]. The maincontributions of this chapter are:
โ For generalized linear models, we propose averaging moment parameters in-stead of natural parameters for constant step size stochastic gradient descent.
โ For finite-dimensional models, we show that this can sometimes (and supris-ingly) lead to better predictions than the best linear model.
โ For infinite-dimensional models, we show that it always converges to optimalpredictions, while averaging natural parameter never does.
โ We illustrate our finding with simulations on synthetic data and classical
1
benchmarks with many observations.
In Chapter 4 we develop sublinear method for Fenchel-Young losses, this is jointwork with Dmitrii Ostrovskii. We have submitted this work to ICML 2019 under atitle Sublinear-time training of mlticlass classifiers with Fenchel-Young losses, DmitryBabichev, Dmitrii Ostrovskii and Francis Bach. The main contributions of this chap-ter are:
โ We develop efficient algorithms for solving the regularized empirical risk min-imization problems via associated saddle-point problem and using : (i) sam-pling for computationally heavy matrix multiplication and (ii) right choice ofgeometry for mirror descent type algorithms.
โ The less aggressive partial sampling scheme is applicable for any loss mini-mization problem, such that its dual conjugate has a direct product of simlicesas a support. This leads to the cost ๐(๐(๐+ ๐)) of one iteration.
โ The more aggressive full sampling scheme, applied to multiclass hinge lossleads to the sublinear cost ๐(๐+ ๐+ ๐) of one iteration.
โ We conclude the chapter with numerical experiments.
2
Chapter 1
Introduction
In this chapter we give the brief overview of two main pillars of this thesis: param-eter estimation, and optimization (mostly convex minimization and convex-concavesaddle point problems).
1.1 Principles for parameter estimation
Parameter estimation is a branch of statistics and machine learning, that solvesproblems of estimation of an unknown set of parameters given some observations.There are a big variety of different methods and models and we consider the threeprobably most famous of them: moment matching, maximum likelihood and risk min-imization. The main application of parameter estimation is density and conditionaldensity estimation (and more generally model estimation), which in turn are used inregression, classification and clustering problems.
1.1.1 Moment matching
Moment matching is a technique for finding the values of parameters, using sev-eral moments of the distribution, i.e., expectations of powers of random variable.The method of moments was introduced at least by Chebyshev and Pearson in thelate 1800s (see for example [Casella and Berger, 2002] for a discussion). Suppose,that we have a real valued random variable ๐ drawn from a family of distributions๐(ยท | ๐) | ๐ โ ฮ.
Given a sample (๐ฅ1, . . . , ๐ฅ๐), the goal is to estimate the true value of the parame-ter ๐*. The classical approach is to consider the first ๐ moments: ๐๐ = E[๐ ๐] = ๐๐(๐),๐ = 1, . . . , ๐, and solve the non-linear system of equations to find the estimator ๐:โงโชโชโชโชโจโชโชโชโชโฉ
1 =1
๐
๐โ๐=1
๐ฅ๐ = ๐1(๐),
...
๐ =1
๐
๐โ๐=1
๐ฅ๐๐ = ๐๐(๐).
3
Even though the method is called moment matching, it is non necessary to usemoments of random variable, but in general it can use any functions โ๐(๐) as longas the expectations
โซโ๐(๐ฅ)๐(๐ฅ | ๐)๐๐ฅ can be easily computed. This approach can be
extended for the case of random vectors and cross-moments [Hansen, 1982]. Now wediscuss some good and bad points of this approach. Also we illustrate the methodon several examples, starting from the toy ones and finishing with state-of-the-artmethods applicable to non-linear regression.
Disadvantages and advantages
The main advantage of this approach is that it is quite simple and the estimatorsare consistent under mild assumptions [Hansen, 1982]. Also in some cases the solu-tions can be found in closed form, where maximum likelihood approach may requirea large computational effort.
However in some sense, this approach is inferior to the maximum likelihood ap-proach and estimators are often biased. In some cases, especially for small samples,the results can be outside of the parameter space. Also the nonlinear set of equationsmay be hard to solve.
Uniform distribution example
Let us start with a simple example, where we need to estimate the parametersof the one-dimensional uniform distribution: ๐ โผ ๐ [๐, ๐]. The first and the secondmoments can be evaluated as ๐1 = E๐ = 1
2(๐ + ๐) and ๐2 = E๐2 = 1
3(๐2 + ๐๐ + ๐2).
Solving the system of these two equations, given a sample (๐ฅ1, . . . , ๐ฅ๐) and using thesample moments 1 and 2 instead of true ones, we get a formula for estimatingparameters ๐ and ๐:
(๐, ๐) = (1 โโ
3(2 โ 21), 1 +
โ3(2 โ 2
1)).
Linear regression example
Consider now a simple linear regression model, where ๐ฆ = ๐ฅโค๐ + ๐, where ๐ฆ โ R,vectors ๐ฅ, ๐ โ R๐ and error ๐ has a zero expectation. The goal is to estimate theunknown vector of parameters ๐. Consider the cross moment E(๐ฅ๐๐ฆ) = E(๐ฅ๐๐ฅโค๐), for๐ = 1, . . . , ๐, where ๐ฅ๐ is the ๐-th component of ๐ฅ. Replacing the expectations withempirical ones for the sample (๐ฅ๐, ๐ฆ๐), we get the equation:
๐ฆ1๐ฅ1 + ยท ยท ยท+ ๐ฆ๐๐ฅ๐๐
=๐ฅ1๐ฅ
โค1 + ยท ยท ยท+ ๐ฅ๐๐ฅ
โค๐
๐ยท ,
which is the traditional normal equation [Goldberger, 1964]. Finally, arranging thevectors in the matrix ๐ โ R๐ร๐ and ๐ฆ โ R๐, we recover the = (๐๐๐)โ1๐๐๐ฆ,Hence in this particular formulation the moment matching estimator coincides withthe ordinary least squares estimator.
4
Exponential families
Note, that for exponential families with probability density given by ๐(๐ฅ|๐) =โ(๐ฅ) exp(๐โค๐ (๐ฅ) โ ๐ด(๐), moment matching is equivalent to maximum likelihood es-timation [Lehmann and Casella, 2006]. We discuss this in more details in the nextsection.
Score functions
Consider the general non-linear regression problem
๐ฆ = ๐(๐ฅ) + ๐, ๐ฅ โ R๐, ๐ฆ โ R, error ๐ independent of the data and E๐ = 0.
The ambitious goal is to estimate the unknown function ๐ , given samples (๐ฅ๐, ๐ฆ๐).However it is impossible to solve this problem in this loose formulation, due to thecurse of dimensionality and non-parametric regression setup. Indeed, classical non-parametric estimation results show that convergence rates with any relevant perfor-mance measure can decrease as ๐โ๐ถ/๐ [Tsybakov, 2009, Gyรถrfi et al., 2002]. Thismeans that the number of sample points ๐ to reach some level of precision is expo-nential in the dimension ๐. A classical way to circumvent the curse of dimensionalityis to impose an additional condition: the dependence on some hidden lower dimensionof data. Let us start with a simple assumption:
โ ๐ฅ is normal and ๐(๐ฅ) = ๐(๐คโค๐ฅ), with a matrix ๐ค โ R๐ร๐.
If ๐ = 1, this model is called a single-index model and a multi-index in the othercase [Horowitz, 2012]. In this formulation, the goal is to estimate the unknown matrixof parameters ๐ค โ R๐ร๐ and we can use moment matching techniques. We use againcross-moments and it is not difficult to show that for a single-index, using Steinโslemma, that E(๐ฆ๐ฅ) โผ ๐ค1 [Stein, 1981, Brillinger, 1982]. Indeed, using independenceof noise, the Gaussian probability density ๐(๐ฅ) โผ exp(โ๐ฅ2/2) and integration byparts:
E(๐ฆ๐ฅ) = E((๐(๐ฅ) + ๐)๐ฅ
)= E
(๐(๐คโค
1 ๐ฅ)๐ฅ)
=
โซ๐(๐คโค
1 ๐ฅ)๐(๐ฅ)๐ฅ๐๐ฅ =
=
โซ๐(๐คโค
1 ๐ฅ)โ๐(๐ฅ)๐๐ฅ =
โซโ๐(๐คโค
1 ๐ฅ)๐(๐ฅ)๐๐ฅ โผ ๐ค1.
The straightforward extension of this approach uses the notion of score function:
๐ฎ1(๐ฅ) = โโ log ๐(๐ฅ),
where ๐(๐ฅ) is the probability density of data ๐ฅ. We use a moment matching techniquein the form E(๐ฎ1(๐ฅ)๐ฆ) โผ ๐ค1 as it is done in [Stoker, 1986]. The most recent approachfor multi-index models by [Janzamin et al., 2014] and [Janzamin et al., 2015] uses
the notion of high-order scores ๐ฎ๐(๐ฅ) = (โ1)๐โ(๐)๐(๐ฅ)๐(๐ฅ)
(which are tensors) and cross
moments E[๐ฆ ยท ๐ฎ๐(๐ฅ)] to train neural networks.
5
Contribution of this thesis
One more extension of the method of moments uses conditional moments E(๐ฅ|๐ฆ),known as Sliced Inverse Regression (SIR, [Li, 1991]) and second order momentsE(๐ฅ๐ฅโค|๐ฆ) (Principal Hessian Directions PHD, [Li, 1992]) which use a normal dis-tribution assumption. We develop a new method, combining strong sides of Steinโslemma and SIR in Chapter 2 of this thesis. We proposed new approaches (SADE andSPHD) and develop analysis for both population and sample cases.
1.1.2 Maximum likelihood
Maximum likelihood estimation or MLE is an another classical approach to es-timate the unknown parameters, maximizing the likelihood function: it means intu-itively, that the selected parameter makes the data most probable. More formally, let๐ = (๐ฅ1, . . . , ๐ฅ๐) be a random sample from a family of distributions ๐(ยท | ๐) | ๐ โ ฮ,then
๐ โ arg max๐โฮ
โ(๐ ;๐),
where โ(๐ ;๐) = ๐๐(๐ | ๐) is the so-called Likelihood function: that is, the jointprobability density for the given realization of sample.
In practice, it is often convenient to work with the negative natural logarithmof the likelihood function, called the negative log-likelihood : ๐(๐ ;๐) = โ lnโ(๐ ;๐).If the data (๐ฅ1, . . . , ๐ฅ๐) are independent and identically distributed, then the jointdistribution density can be written as a product of densities for a single ๐ฅ๐ and theaverage negative log-likelihood minimization problem takes the form:
arg min๐โฮ
(๐ ;๐) = arg min๐โฮ
โ 1
๐
๐โ๐=1
ln ๐(๐ฅ๐ | ๐),
where ๐(๐ฅ๐|๐) is the value of the probability density at the point ๐ฅ๐.
Advantages and disadvantages
Under mild assumptions the maximum likelihood estimator is consistent, asymp-totically efficient and asymptotically normal [LeCam, 1953, Akaike, 1998]. Eventhough for a simple model, solutions can be found in closed form, for more advancedmodels, methods of optimization must be used to get the solution. Hopefully theoptimization problem is convex for a wide class of likelihood estimators, such as ex-ponential families and conditional exponential families [Koller and Friedman, 2009,Murphy, 2012]. On the other hand, the optimization problem could be not convex ifwe consider for example mixture models. Moreover, maximum likelihood estimatorsare robust to mis-specified data: if the real data distribution ๐ * does not come fromthe model ๐(ยท | ๐) | ๐ โ ฮ, we can still use the approach and the solution withinfinity data will be the projection (in the Kullback-Leibler sense) of ๐ * to the set ofallowed distributions.
6
Example: Exponential families
The probability density for an exponential family can be written in the followingform:
๐(๐ฅ|๐) = โ(๐ฅ) exp(๐โค๐ (๐ฅ)โ ๐ด(๐)),
where โ(๐ฅ) is the base measure, ๐ (๐ฅ) โ R๐ is the sufficient statistics and ๐ด the log-partition function, which is always convex. Note that we do not assume that thedata distribution ๐(๐ฅ) comes from this exponential family. Then the average negativelog-likelihood is equal to
โ(๐;๐ฅ) =1
๐
๐โ๐=1
[๐ด(๐)โ ๐โค๐ (๐ฅ๐)
],
which is a convex problem and can be solved, using any convex minimization ap-proach. Another view to this problem is moment matching: the solution can befound as:
๐ดโฒ(๐) =1
๐
๐โ๐=1
๐ (๐ฅ๐),
where we consider the moment of the sufficient statistics ๐ (๐ฅ). Note, that in a facta big variety of classical distributions can be represented in this form: Bernoulli,normal, Poisson, exponential, Gamma, Beta, Dirichlet and many others. However,also, there are few which are not for example Studentโs distribution and mixtures ofclassical distribution are not in this family.
Example: Conditional Exponential families
Now, let us consider the classical conditional exponential families:
๐(๐ฆ|๐ฅ, ๐) = โ(๐ฆ) ยท exp(๐ฆ ยท ๐๐(๐ฅ)โ ๐(๐๐(๐ฅ))).
Again, writing down the average negative log-likelihood, the goal is to minimize
1
๐
๐โ๐=1
[โ ๐ฆ๐ ยท ๐๐(๐ฅ๐) + ๐(๐๐(๐ฅ๐))
].
These families are used for regression and classification problems and closely related tothe generalized linear models [McCullagh and Nelder, 1989], if the natural parameteris linear combination in known basis: ๐๐(๐ฅ) = ๐โคฮฆ(๐ฅ). On of the most popular choicesof function ๐โฒ(ยท) is the sigmoid function ๐โฒ(๐ก) = ๐(๐ก) = 1/(1 + ๐โ๐ก), which leads to thelogistic regression model. Let us illustrate the connection of logistic regression with aBernoulli distribution. Let ๐ be a parameter of Bernoulli distribution with ๐ฅ โ 0, 1,then the probability density is written as:
๐(๐ฅ|๐) = ๐ฅ log ๐โ (1โ ๐ฅ) log(1โ ๐) = ๐ฅ log๐
1โ ๐+ log(1โ ๐)
7
= exp(๐ฅ ยท ๐ โ log(1 + ๐๐)),
where ๐ = log ๐1โ๐ and hence logistic model is unconstrained reformulation of Bernoulli
model, and the lack of constraint is often seen as a benefit for optimization.
Another common choice is Poisson regression [McCullagh, 1984], where ๐(๐ก) =exp(๐ก), which is used, when the dependent variable is a count.
Contribution of this thesis
In Chapter 3 of this thesis we consider constant step-size stochastic gradient de-scent for conditional exponential families. Instead of averaging of parameters ๐, wepropose averaging of so-called prediction functions ๐ = ๐โฒ(๐โคฮฆ(๐ฅ)), which leads tobetter convergence to the optimal prediction, especially for infinite-dimensional mod-els.
1.1.3 Loss functions
One more view to the parameter estimation problem is risk minimization, whichgoes beyond the maximum likelihood approach. A non-negative real-valued loss func-tion โ(๐ฆ, ๐ฆ) measures the performance for classification or regression problem: i.e.,the difference between prediction ๐ฆ of a and the true outcome ๐ฆ. The expected loss๐ (๐) = E[โ(๐(๐ฅ|๐), ๐ฆ)] is called the risk. However, in practice the true distribution๐ (๐ฅ, ๐ฆ) is inaccessible and the empirical risk used instead. Hence, the classical regu-larized empirical risk minimization problem is written as:
min๐
1
๐
๐โ๐=1
โ(๐(๐ฅ๐|๐), ๐ฆ๐
)+ ๐ฮฉ(๐), (1.1.1)
where ฮฉ(๐) is a regularizer, which is used to avoid overfitting. There are severalclassical choices of loss function, and we start with in some sense the most intuitiveones:
0โ 1 loss
This loss, which is also called misclassification loss is used in classification prob-lems, where the response ๐ฆ is located in a finite set. The definition speaks for itself:the loss is zero, in the case of true classification and one in the other case:
โ(๐(๐ฅ๐|๐), ๐ฆ๐) = 1 โโ ๐(๐ฅ๐|๐) = ๐ฆ๐.
It is known to lead to NP-hard problems, even for linear classifiers [Feldman et al.,2012, Ben-David et al., 2003]. That is why in practice people use a convex relaxationof the 0โ 1 loss functions (see below).
8
Quadratic loss
One of the other classical choices for loss function in the case of regression problemsis the quadratic loss:
โ(๐(๐ฅ๐|๐), ๐ฆ๐) = (๐(๐ฅ๐|๐)โ ๐ฆ๐)2 .
It has a simple form, smooth (in contradiction to the โ1 loss) and convex. Thus, theempirical risk minimization problem becomes mean squared error minimization.
The connection with likelihood estimators can be shown, using the linear modelwith Gaussian noise:
๐ฆ = ๐ฅโค๐ + ๐, where ๐ โผ ๐ฉ (0, ๐2).
Then, the negative log-likelihood is
๐(๐ฆ;๐ฅ, ๐) โผ (๐ฅโค๐ โ ๐ฆ)2,
and the empirical likelihood minimization problem for this problem is equivalent tothe mean squared error minimization (but the application of least squares is notlimited to Gaussian noise).
Let us illustrate also a connection of quadratic loss with regression: consider thatwe are looking for a function ๐ , such that ๐ฆ = ๐(๐ฅ) + ๐, where ๐ฅ โ R๐ and ๐ฆ โ R.Then, the expected loss (risk) can be written as
๐ (๐) =
โซโ(๐(๐ฅ), ๐ฆ)๐(๐ฅ, ๐ฆ) ๐๐ฅ ๐๐ฆ.
Solving this functional minimization problem for the quadratic loss, we get the solu-tion ๐(๐ฅ) = E๐ฆ[๐ฆ|๐ฅ], which is the conditional expectation of ๐ฆ given ๐ฅ and is knownas the regression function.
Convex surrogates
Here we consider the main examples for convex surrogates of the 0โ 1 loss, whichmake the computation more tractable.
Hinge loss. It is written as
โ(๐(๐ฅ๐|๐), ๐ฆ๐) = max(0, 1โ ๐(๐ฅ๐|๐)๐ฆ๐),
and is used in soft-margin support vector machines (SVM) approach introduced in itsmodern form by [Cortes and Vapnik, 1995]. The method tries to find a hyperplanewhich separates the data. In practice, usually regularized problem is considered dueto non-robustness, especially for separable data:[
1
๐
๐โ๐=1
max (0, 1โ ๐ฆ๐๐๐ฅ๐)
]+ ๐โ๐โ2.
The classical way of solving the minimization problem is to switch to quadratic
9
programming problem (originally [Cortes and Vapnik, 1995], see also [Bishop, 2006]).However, now the most recent approaches are used, such as gradient descent andstochastic gradient descent types [Shalev-Shwartz et al., 2011].
Logistic loss and Exponential loss. The logistic loss is defined as โ(๐(๐ฅ๐|๐), ๐ฆ๐) =log(1 + exp(โ๐ฆ ยท ๐(๐ฅ๐|๐))) and the Exponential loss is defined as โ(๐(๐ฅ๐|๐), ๐ฆ๐) =exp(โ๐ฆ ยท ๐(๐ฅ๐|๐)). In fact, these two losses are dictated by maximum log-likelihoodformulations: the logistic loss takes its origin in logistic regression and exponential lossis used in Poisson regression. Moreover, we can say that every negative log-likelihoodminimization problem is equivalent to loss minimization problem, if we introduce thecorresponding log-likelihood loss. The opposite is typically not true (for example forthe hinge loss).
Graphical representation
We summarize the discussed losses in the Figure 1-1, where we renormalize someof them to pass through the point (1, 0). The dashed line represents the regressionformulation and solid ones are for classification problems.
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3 Exponential lossHinge loss0-1 lossLogistic lossSquare loss
Figure 1-1 โ Graphical representation of classical loss functions.
Contribution of this thesis
In Chapter 4 of this thesis we consider so-called Fenchel-Young losses for mul-ticlass linear classifiers, which extend convex surrogates to the multiclass setting.This leads to saddle-point convex-concave problems with expensive matrix multipli-cations. Using stochastic optimization methods for non-Euclidean setup (using mirrordescent) and specific variance reduction techniques we are able to reach sublinear (inthe natural dimensionality of the problem) running time complexity per iteration.
10
1.2 Convex optimization
In this section we discuss the basics of convex optimization. Convexity is a pre-vailing setup for optimization problems, due to the theoretical guarantees for thisclass of functions. The classical convex minimization problem is the following:
min๐ฅโ๐ณ
๐(๐ฅ),
where ๐ณ โ R๐ is a convex set: for any two points ๐ฅ, ๐ฆ โ ๐ณ and for every ๐ผ โ [0, 1] thepoint ๐ผ๐ฅ+ (1โ ๐ผ)๐ฆ is also in ๐ณ . This means, that for any two points inside the set,the whole segment, connecting these points is also in the set. The function ๐ is convexas well: for any ๐ฅ, ๐ฆ โ ๐ณ and ๐ผ โ [0, 1]: ๐(๐ผ๐ฅ+ (1โ๐ผ)๐ฆ) 6 ๐ผ๐(๐ฅ) + (1โ๐ผ)๐(๐ฆ) andthis means, that the epigraph of function ๐ is a convex set: every chord lies abovethe graph of a function. It is known for convex minimization problems, that:
โ If a local minimum exists, it is also a global minimum.โ The set of all global minima is convex (note, that global minimum is not
necessarily unique).
Consider the unconstrained minimization problem, where ๐ณ = R๐. The first propertyprovides us with the criterion of global minimum: if the function ๐ is differentiable,then
โ๐(๐ฅ*) = 0 โ ๐ฅ* is global minimum.
If the function ๐ is not differentiable, we still can use the criterion, where the gradientof the function is replaced by a sub-gradient : a generalization of gradient for convexfunction which defines the cone of directions in which the function increases [Rock-afellar, 2015]:
๐๐(๐ฅ*) โ 0 โ ๐ฅ* is global minimum.
However in practice, only for simple problems the solution can be found in closedform. For the majority of convex problems computational iterative methods are used,and the starting point for them are gradient descent or the steepest descent
๐ฅ๐ก+1 = ๐ฅ๐ก โ ๐พ๐กโ๐(๐ฅ๐ก).
The idea is straightforward: we are looking for the direction in which the functionincreases and then descend in the opposite direction. Classical choices for stepsizeare constant and decaying: ๐พ๐ก = ๐ถ๐กโ๐ผ, where ๐ผ โ [0, 1].
We also consider classical assumptions, such that bounded gradients, smoothness(which requires differentiability) and strong convexity (for more details see [Nesterov,2013, Bubeck, 2015]):
The first definition is the weakest one: bounded gradients, which does not requiredifferentiability:
Definition 1.1. The function ๐ has bounded gradients (subgradients), if for any
11
๐ฅ โ ๐ณ and for any ๐(๐ฅ) โ ๐๐(๐ฅ):
โ๐(๐ฅ)โ 6 ๐ต.
Definition 1.2. The function ๐ is called smooth with Lipschitz constant ๐ฟ if, for all๐ฅ, ๐ฆ โ ๐ณ :
๐(๐ฆ) 6 ๐(๐ฅ) +โ๐(๐ฅ)โค(๐ฆ โ ๐ฅ) +๐ฟ
2โ๐ฆ โ ๐ฅโ2,
which is equivalent to, when ๐ is convex,
โโ๐(๐ฅ)โโ๐(๐ฆ)โ 6 ๐ฟโ๐ฅโ ๐ฆโ.
Definition 1.3. The function ๐ is called strongly convex with constant ๐ if, for all๐ฅ, ๐ฆ โ ๐ณ :
๐(๐ฆ) > ๐(๐ฅ) +โ๐(๐ฅ)โค(๐ฆ โ ๐ฅ) +๐
2โ๐ฆ โ ๐ฅโ2.
The intuitive definition of Lipschitz smoothness is that at every point, the function๐ can be bounded above by quadratic function with coefficient ๐ฟ/2, strong convexitymeans, that at every point the function ๐ is bounded below by quadratic functionwith coefficient ๐/2. A graphical representation of one dimensional case can be foundin Figure 1-2.
Lipschitz bound
Strongly convexity
bound
Figure 1-2 โ Graphical representation of Lipschitz smoothness and strong convexity.
1.2.1 Euclidean geometry
Note, that we did not define norms for smoothness and strong convexity. If wedefine them as standard Euclidean norms, we reproduce the so-called Euclidean ge-ometry and constants ๐ฟ and ๐ correspond to the biggest and the smallest eigenvaluesof Hessians, if the function is twice differentiable.
12
Projected gradient descent
The gradient descent method for constrained problem is called projected gradi-ent descent (see [Bertsekas, 1999] and references therein) and one iteration is thefollowing:
๐ฅ๐ก+1 = ฮ ๐ณ(๐ฅ๐ก โ ๐พ๐ก๐(๐ฅ๐ก)
), with ๐(๐ฅ๐ก) โ ๐๐(๐ฅ๐ก)
where ฮ ๐ณ (๐ฅ) = arg min๐ฆโ๐ณ
โ๐ฅโ ๐ฆโ is the Euclidean projection of the point ๐ฅ to the set
๐ณ . We illustrate this approach in Figure 1-3. We can also to look at this equationthrough the prism of proximal approaches [Moreau, 1965, Rockafellar, 1976] as donefor example in [Beck and Teboulle, 2003]:
๐ฅ๐ก+1 = arg min๐ฅโ๐ณ
โจ๐ฅ, ๐(๐ฅ๐ก)โฉ+
1
๐พ๐กโ๐ฅโ ๐ฅ๐กโ2
, where ๐(๐ฅ๐ก) โ ๐๐(๐ฅ๐ก).
xt
xt+1/2
xt+1
X
gradient step
proximal step
Figure 1-3 โ Graphical representation of projected gradient descent: firstly we dogradient step and then project onto ๐ณ .
Note, that the proximal operator should be simple, i.e., the structure of the set ๐ณbe simple enough to compute it in either in closed form or with a number of iterationscommensurate with the computation of the gradient.
Now we formulate main results for functions with different assumpions (shortproofs can be found for example in [Bubeck, 2015]):
Theorem 1.1. Let ๐ be convex with bounded gradients with constant ๐ต; the radiusof set ๐ณ is ๐ , i.e., sup
๐ฅ,๐ฆโ๐ณโ๐ฅ โ ๐ฆโ = 2๐ . Then the projected gradient descent with
decaying stepsize ๐พ๐ก = ๐ ๐ตโ๐กsatisfies:
๐(๐ก)โ ๐(๐ฅ*) 6๐ ๐ตโ๐ก, where ๐ก =
1
๐ก
๐กโ๐=1
๐ฅ๐.
Hence in case of bounded gradients we get an 1/โ๐ก convergence rate.
13
Theorem 1.2. Let ๐ be convex and ๐ฟ-smooth on ๐ณ . Then the projected gradientdescent with constant stepsize ๐พ๐ก = 1
๐ฟsatisfies:
๐(๐ฅ๐ก)โ ๐(๐ฅ*) 64๐ฟโ๐ฅ1 โ ๐ฅ*โ2
๐ก.
Hence in case of ๐ฟ-smooth function we get an 1/๐ก convergence rate. If we add anassumption about strong convexity of the function ๐ , we recover exponential rate:
Theorem 1.3. Let ๐ be ๐ strongly convex and ๐ฟ-smooth on ๐ณ . Then the projectedgradient descent with constant stepsize ๐พ๐ก = 1
๐ฟsatisfies:
โ๐ฅ๐ก+1 โ ๐ฅ*โ2 6 exp(โ๐ก ยท ๐
๐ฟ
)โ๐ฅ1 โ ๐ฅ*โ2.
Note also that acceleration techniques, proposed by Nesterov can be used to in-crease the convergence rates. The first work in this direction [Nesterov, 1983] wasproposed for smooth functions and unconstrained setup. It improved the rates from1/๐ก to 1/๐ก2. The case of non-smooth function was considered in [Nesterov, 2005] andimproved the rates from 1/
โ๐ก to 1/๐ก, using a special smoothing technique, which can
be applied to functions with explicit max-structure. Constrained problems were con-sidered in [Nesterov, 2007] and [Beck and Teboulle, 2009] with the same convergencerates.
Note, that Theorems 1.1, 1.2 and 1.3 use Euclidean geometries: definitions ofsmoothness and strongly convexity are given with respect with usual Euclidean ge-ometry. In the next section we extend these result for a more general setup.
1.2.2 Non-Euclidean geometry
The ideas of non-Euclidean approach are to exploit the geometry of the set ๐ณand achieve the rates of projected gradient descent with constants adapting to thegeometry of the set ๐ณ . Let us fix an arbitrary norm โยทโX and a compact set ๐ณ โ R๐.We introduce a definition of the dual norm:
Definition 1.4. The norm โยทโX * is called the dual norm, if โ๐โX * = sup๐ฅโR๐:โ๐ฅโX 61
๐โค๐ฅ.
Now we need to adapt definitions of bounded gradients, smoothness and strongconvexity: (see for example [Bubeck, 2015, Beck and Teboulle, 2003]) (i) we say, thatfunction ๐(๐ฅ) has bounded gradients, if
โ๐โX * 6 ๐ต for any ๐(๐ฅ) โ ๐๐(๐ฅ).
(ii) we say, that function ๐(๐ฅ) is ๐ฟ-smooth with respect to norm โ ยท โX if
โโ๐(๐ฅ) โ โ๐(๐ฆ)โX * 6 โ๐ฅโ ๐ฆโX ,
14
where โ ยท โX * is dual norm, (iii) ๐-strongly convex if
๐(๐ฆ) > ๐(๐ฅ) +โ๐(๐ฅ)โค(๐ฆ โ ๐ฅ) +๐
2โ๐ฆ โ ๐ฅโ2X .
The idea of the mirror approach is to consider a so-called potential function ฮฆ(๐ฅ)(which is also called DGF โ distance generating function or mirror map), then switchto the mirror space (using gradient of potential), do the gradient step in this space,return back to the main space and project the point to the set ๐ณ .
Mirror descent
More formally, one step of Mirror descent [Nemirovsky and Yudin, 1983] is
โฮฆ(๐ฅ๐ก+1/2) = โฮฆ(๐ฅ๐ก)โ ๐พ๐กโ๐(๐ฅ๐ก),
๐ฅ๐ก+1 โ ฮ ฮฆ๐ณ (๐ฅ๐ก+1/2),
where ฮ ฮฆ๐ณ (๐ฅ, ๐ฆ) = arg min
๐ฅโ๐ณ๐ตฮฆ(๐ฅ, ๐ฆ) is the projection in the Bregman divergence sense:
๐ตฮฆ(๐ฅ, ๐ฆ) = ฮฆ(๐ฅ)โฮฆ(๐ฆ)โโฮฆ(๐ฆ)โค(๐ฅโ๐ฆ). However, this notation could be difficult toembrace and [Beck and Teboulle, 2003] introduced the equivalent definition withoutswitching to the mirror space, and one step of mirror descent can be written as:
๐ฅ๐ก+1 = arg min๐ฅโ๐ณ
โจ๐ฅ,โ๐(๐ฅ๐ก)โฉ+
1
๐พ๐ก๐ต๐(๐ฅ, ๐ฅ๐ก)
. (Mirror descent)
Note the similarity with projected gradient descent. In practice, the potential ฮฆ(๐ฅ)should be a strictly convex and differentiable function with the following properties:
โ ฮฆ(๐ฅ) is 1-strongly convex on ๐ณ with respect to โ ยท โX .โ The effective square radius of set ๐ณ , which is defined as ฮฉ = sup
๐ฅโ๐ณฮฆ(๐ฅ) โ
inf๐ฅโ๐ณ
ฮฆ(๐ฅ) should be small.
โ The proximal step above should be feasible.
Finally, we can formulate convergence rates for the Mirror descent algorithm for non-smooth case (a short proof can be found for example in [Bubeck, 2015]):
Theorem 1.4. Let ฮฆ be a 1-strongly convex with respect to โ ยท โX and ๐ be convex
with ๐ต-bounded gradients with respect to โ ยท โX . Then mirror descent with ๐พ๐ก =
โ2ฮฉ
๐ตโ๐ก
satisfies:
๐(๐ก)โ ๐(๐ฅ*) 6 ๐ต
โ2ฮฉ
๐ก, where ๐ก =
1
๐ก
๐กโ๐=1
๐ฅ๐.
Note, that there are various extensions of mirror descent, such as mirror prox [Ne-mirovski, 2004], Dual Averaging [Nesterov, 2007] and extentions to saddle point min-imax problems [Juditsky and Nemirovski, 2011a,b, Nesterov and Nemirovski, 2013].
15
One more direction is NoLips approach, where the idea is to get rid of the Lipschitz-continuous gradient, using convexity condition which captures the geometry of theconstraints [Bauschke et al., 2016].
Examples of geometries
โ2 geometry. This is the simplest geometry, which actually corresponds to theEuclidean case. Indeed, in this case ฮฆ(๐ฅ) = 1
2โ๐ฅโ22 is 1-strongly convex with respect
to the โ2 norm on the whole R๐. The Bregman divergence, associated with this normis ๐ต๐(๐ฅ, ๐ฆ) = 1
2โ๐ฅ โ ๐ฆโ22 and one step of mirror descent is reduced to one step of
projected gradient descent. Note, that Theorem 1.4 is reduced to Theorem 1.1 in thiscase.
โ1 geometry. This is more interesting choice of geometry, which is also called sim-plex setup. If we define potential as negative entropy:
ฮฆ(๐ฅ) =๐โ๐=1
๐ฅ๐ log ๐ฅ๐,
then, using Pinskerโs inequality, we can show, that ฮฆ is 1-strongly convex with re-
spect to โ1 norm on the simplex โ๐ = ๐ฅ โ R๐+ :
๐โ๐=1
๐ฅ๐ = 1. The Bregman di-
vergence associated with negative entropy is so-called he Kullback-Leibler divergence:
๐ต๐(๐ฅ, ๐ฆ) = ๐ทKL(๐ฅโ๐ฆ) =๐โ๐=1
๐ฅ๐ log ๐ฅ๐๐ฆ๐. The effective squared radius of โ๐ is ฮฉ = log ๐
and moreover the solution for one step of MD can be found in closed form and lead toso-called multiplicative updates. This implies, that if we minimize function ๐ on thesimplex โ๐, such that โโ๐โโ are bounded, the right choice of geometry give ratesas ๐
(log ๐๐ก
), whereas the Euclidean geometry give rates only as ๐
(๐๐ก
).
One more classical setup is often used: the โ1 ball can be obtained from thesimplex setup, by doubling the number of variables. Instead of ๐ real values (positiveor negative), we consider 2๐ positive values and transform the ๐-dimensional ball tothe 2๐-dimensional simplex.
1.2.3 Stochastic setup
In this section we consider stochastic approaches going back to [Robbins andMonro, 1951], where we do not use the gradient โ๐(๐ฅ), but evaluate the noisy versionof it: namely a stochastic oracle ๐(๐ฅ), such that E๐(๐ฅ) = โ๐(๐ฅ). The classicalsetting in machine learning is when the objective function ๐ is the sampled mean ofobservations ๐๐ (probably with some regularizer term):
๐(๐ฅ) =1
๐
๐โ๐=1
๐๐(๐ฅ) +๐ (๐ฅ),
16
and we can choose oracle as ๐๐(๐ฅ), where ๐ is chosen uniformly from ๐ points. Infact, all negative log-likelihood minimization and loss minimization problems havethis form. This also applies to the situation of single pass SGD where the bounds arethen on the generalization error. To define the quality of an oracle ๐(๐ฅ), we assumethe existence of the moment
๐ต2 = sup๐ฅโ๐ณ
Eโ๐(๐ฅ)โ2X * ,
in the non-smooth case, and assume the existence of the variance in the smooth case:
๐2 = sup๐ฅโ๐ณ
Eโ๐(๐ฅ)โโ๐(๐ฅ)โ2X * ,
where the norm depends on the geometry. Let us consider the general StochasticMirror Descent approach:
๐ฅ๐ก+1 = arg min๐ฅโ๐ณ
โจ๐ฅ, ๐(๐ฅ๐ก)โฉ+
1
๐พ๐ก๐ต๐(๐ฅ, ๐ฅ๐ก)
. (Stochastic Mirror Descent)
Now we can formulate the theorem: [Juditsky et al., 2011, Lan, 2012, Xiao, 2010]
Theorem 1.5. Let ฮฆ be a 1-strongly convex with respect to โยทโ๐ณ and ๐ต2 be the second
moment of an oracle ๐(๐ฅ). Then Stochastic Mirror Descent with stepsize ๐พ๐ก =
โ2ฮฉ
๐ตโ๐ก
satisfies:
E๐(๐ก)โ ๐(๐ฅ*) 6 ๐ต
โ2ฮฉ
๐ก, where ๐ก =
1
๐ก
๐กโ๐=1
๐ฅ๐.
Note the similarity of this result with Theorem 1.4: the only difference is thatbounded gradients are replaced with the bounded moment of the oracle.
1.3 Saddle point optimization
In this section we consider saddle point optimization. Consider two convex andcompact sets ๐ณ โ R๐1 and ๐ด โ R๐2 . Let ๐(๐ฅ, ๐ฆ) : ๐ณ ร ๐ด โ R be a function, suchthat ๐(ยท, ๐ฆ) is convex and ๐(๐ฅ, ยท) is concave. The goal is to find
min๐ฅโ๐ณ
max๐ฆโ๐ด
๐(๐ฅ, ๐ฆ).
A classical example is obtained from Fenchel duality below. We present here resultsfrom [Juditsky and Nemirovski, 2011a,b, Nesterov and Nemirovski, 2013], concerningthe mirror descent approach.
Introduce the (sub)gradient field : ๐บ(๐ฅ, ๐ฆ) =(๐๐ฅ๐(๐ฅ, ๐ฆ),โ๐๐ฆ๐(๐ฅ, ๐ฆ)
)โ analogue
of (sub)gradients for saddle point problems. Then subgradients are given by(๐๐ณ (๐ฅ, ๐ฆ), ๐๐ด(๐ฅ, ๐ฆ)
)โ ๐บ(๐ฅ, ๐ฆ).
17
The analogue of bounded gradients is written as:
Definition 1.5. Function ๐(๐ฅ, ๐ฆ) has bounded gradients, if โ๐๐ณ (๐ฅ, ๐ฆ)โ๐ณ * 6 โ๐ณ andโ๐๐ด(๐ฅ, ๐ฆ)โ๐ด* 6 โ๐ด for any (๐ฅ, ๐ฆ) โ (๐ณ ร ๐ด).
To evaluate the quality of the point (๐ฅ, ๐ฆ) โ (๐ณ ร ๐ด), we introduce the notion ofthe so-called duality gap:
Definition 1.6. The duality gap is โ๐๐ข๐๐(๐ฅ, ๐ฆ) = max๐ฆโ๐ด
๐(๐ฅ, ๐ฆ)โmin๐ฅโ๐ณ
๐(๐ฅ, ๐ฆ).
Observe, that the duality gap is the sum of the primal gap max๐ฆโ๐ด
๐(๐ฅ, ๐ฆ)โ๐(๐ฅ*, ๐ฆ*)
and the dual gap ๐(๐ฅ*, ๐ฆ*) โ min๐ฅโ๐ณ
๐(๐ฅ, ๐ฆ). Introduce the variable ๐ง = (๐ฅ, ๐ฆ) and the
set ๐ต = ๐ณ ร ๐ด . The main motivation of the duality gap is that it can be controlledin the following way, similar for the convex optimization:
โ๐๐ข๐๐(๐ง) 6 ๐(๐ง)โค(๐ง โ ๐ง),
where ๐(๐ง) in the gradient field ๐บ(๐ง), for more details see Bertsekas [1999].
Recall, that in order to apply mirror descent, we firstly need to construct potentialand choose geometries. Let ฮฆ๐ณ (๐ฅ) be a potential, defined for variable the ๐ฅ, such thatit is 1-strongly convex with respect to a norm โ ยท โX on ๐ณ . Similarly, let ฮฆ๐ด(๐ฆ) bea potential, defined for the variable ๐ฆ, such that it is 1-strongly convex with respectto a norm โ ยท โY on ๐ด . Let ฮฉ๐ณ and ฮฉ๐ด be the effective square radii of sets ๐ณ and ๐ดwith respect to the corresponding norms.
Let us construct the composite potential ฮฆ๐ต(๐ง) =โ๐ณโฮฉ๐ณ
ฮฆ๐ณ (๐ฅ)+โ๐ดโฮฉ๐ด
ฮฆ๐ด(๐ฆ), then
one step of Saddle Point Mirror Descent (SP-MD) is the following:
๐ง๐ก+1 โ arg min
โจ๐๐ก, ๐งโฉ+
1
๐พ๐ก๐ตฮฆ๐ต (๐ง, ๐ง๐ก)
, ๐๐ก โ ๐บ(๐ฅ๐ก, ๐ฆ๐ก) SP-MD
Finally we can formulate the convergence rates:
Theorem 1.6. Let function ๐(๐ฅ, ๐ฆ) has a bounded gradients with constants โ๐ณ and
โ๐ด . Then SP-MD with stepsize ๐พ๐ก =โ
2๐กsatisfies:
max๐ฆโ๐ด
๐(๐ก, ๐ฆ)โmin๐ฅโ๐ณ
๐(๐ฅ, ๐ฆ๐ก) 6(โ
ฮฉ๐ณโ๐ณ +โ
ฮฉ๐ดโ๐ด
)โ2
๐ก.
Fenchel duality
We finish this introduction with the classical Fenchel duality result which providesus with the way to switch from convex problems to saddle-point problems. Let usfirstly define the notion of the Fenchel conjugate of the convex function ๐ :
18
Definition 1.7. The function ๐ *(๐ฆ) given by ๐ โ (๐ฆ) := sup โจ๐ฆ, ๐ฅโฉ โ ๐ (๐ฅ)|๐ฅ โ R๐is called the Fenchel conjugate of the function ๐ .
Now we can formulate the theorem [see e.g. Borwein and Lewis, 2010]:
Theorem 1.7. (Fenchelโs duality theorem) Let ๐ : ๐ณ โ R and ๐ : ๐ด โ R beconvex functions and ๐ด : ๐ณ โ ๐ด be a linear map. Then
inf๐ฅโ๐ณ
๐(๐ฅ) + ๐(๐ด๐ฅ)
= sup
๐ฆโ๐ด
โ ๐ *(๐ดโค๐ฆ)โ ๐*(โ๐ฆ)
.
Finally, we provide the way to switch between primal, dual and saddle pointformulations of the convex problem:
Primal problem : inf๐ฅ
๐(๐ฅ) + ๐(๐ด๐ฅ)
.
Dual problem : sup๐ฆ
โ ๐ *(๐ดโค๐ฆ)โ ๐*(โ๐ฆ)
.
Saddle problem : inf๐ฅ
sup๐ฆ
๐(๐ฅ)โ ๐*(โ๐ฆ) + ๐ฆโค๐ด๐ฅ
.
Machine learning motivation. These optimization problems are motivated bymachine learning applications, where ๐ฅ is the parameter to estimate, matrix ๐ด thedata, ๐ the loss and ๐ the regularizer (note similarity with regularized empiricalrisk minimization (1.1.1)). Dual or saddle problem formulations help to switch to anequivalent task, which in some sense has a simpler structure, like bilinear saddle-pointproblem with composite terms and moreover allows to control the duality gap.
19
20
Chapter 2
Sliced inverse regression with score
functions
Abstract
We consider non-linear regression problems where we assume that the responsedepends non-linearly on a linear projection of the covariates. We propose score func-tion extensions to sliced inverse regression problems, both for the first- order andsecond-order score functions. We show that they provably improve estimation in thepopulation case over the non-sliced versions and we study finite sample estimatorsand their consistency given the exact score functions. We also propose to learn thescore function as well, in two steps, i.e., first learning the score function and thenlearning the effective dimension reduction space, or directly, by solving a convex op-timization problem regularized by the nuclear norm. We illustrate our results on aseries of experiments.
This chapter is based on the journal article: Slice inverse regression with scorefunctions, D. Babichev, F. Bach, In Electronic Journal of Statistics [Babichev andBach, 2018b].
2.1 Introduction
Non-linear regression and related problems such as non-linear classification arecore important tasks in machine learning and statistics. In this chapter, we considera random vector ๐ฅ โ R๐, a random response ๐ฆ โ R, and a regression model of theform
๐ฆ = ๐(๐ฅ) + ๐, (2.1.1)
which we want to estimate from ๐ independent and identically distributed (i.i.d.)observations (๐ฅ๐, ๐ฆ๐), ๐ = 1, . . . , ๐. Our goal is to estimate the function ๐ from thesedata. A traditional key difficulty in this general regression problem is the lack ofparametric assumptions regarding the functional form of ๐ , leading to a problem ofnon-parametric regression. This is often tackled by searching implicitly or explicitly
21
a function ๐ within an infinite-dimensional vector space.While several techniques exist to estimate such a function, e.g., kernel methods,
local-averaging, or neural networks [see, e.g., Gyรถrfi et al., 2002, Tsybakov, 2009],they also suffer from the curse of dimensionality, that is, the rate of convergence ofthe estimated function to the true function (with any relevant performance measure)can only decrease as a small power of ๐, and this power cannot be larger than aconstant divided by ๐. In other words, the number ๐ of observations for any level ofprecision is exponential in dimension.
A classical way of by-passing the curse of dimensionality is to make extra assump-tions regarding the function to estimate, such as the dependence on a lower unknownlow-dimensional subspace, such as done by projection pursuit or neural networks.More precisely, throughout the chapter, we make the following assumption:
(A1) For all ๐ฅ โ R๐, we have ๐(๐ฅ) = ๐(๐คโค๐ฅ) for a certain matrix ๐ค โ R๐ร๐ anda function ๐ : R๐ โ R. Moreover, ๐ฆ = ๐(๐ฅ) + ๐ with ๐ independent of ๐ฅ withzero mean and finite variance.
The subspace of R๐ spanned by the ๐ columns ๐ค1, . . . , ๐ค๐ โ R๐ of ๐ค has dimensionless than or equal to ๐, and is often called the effective dimension reduction (e.d.r.)space. The model above is often referred to as a multiple-index model [Yuan, 2011].We will always make the assumption that the e.d.r. space has exactly rank ๐, that isthe matrix ๐ค has rank ๐ (which implies that ๐ 6 ๐).
Given ๐ค, estimating ๐ may be done by any technique in non-parametric regression,with a convergence rate which requires a number of observations ๐ to be exponentialin ๐, with methods based on local averaging (e.g., Nadaraya-Watson estimators) oron least-squares regression [see, e.g., Gyรถrfi et al., 2002, Tsybakov, 2009]. Given thenon-linear function ๐, estimating ๐ค is computationally difficult because the resultingoptimization problem may not be convex and thus leads to several local minima. Thedifficulty is often even stronger since one often wants to estimate both the function ๐and the matrix ๐ค.
Our main goal in this chapter is to estimate the matrix ๐ค, with the hope ofobtaining a convergence rate where the inverse power of ๐ will now be proportionalto ๐ and not ๐. Note that the matrix ๐ค is only identifiable up to a (right) lineartransform, since only the subspace spanned by its column is characteristic.
Method of moments vs. optimization. This multiple-index problem and thegoal of estimating ๐ค only can be tackled from two points of views: (a) the methodof moments, where certain moments are built so that the effect of the unknown func-tion ๐ disappears [Brillinger, 1982, Li and Duan, 1989], a method that we follow hereand describe in more details below. These methods rely heavily on the model beingcorrect, and in the instances that we consider here lead to provably polynomial-timealgorithms (and most often linear in the number of observations since only momentsare computed). In contrast, (b) optimization-based methods use implicitly or ex-plicitly non-parametric estimation, e.g., using local averaging methods to design anobjective function that can be minimized to obtain an estimate of ๐ค [Xia et al., 2002a,Fukumizu et al., 2009]. The objective function is usually non-convex and gradient
22
descent techniques are used to obtain a local minimum. While these procedures offerno theoretical guarantees due to the potential unknown difficulty of the optimiza-tion problem, they often work well in practice, and we have observed this in ourexperiments.
In this chapter, we consider and improve a specific instantiation of the method ofmoments, which partially circumvents the difficulty of joint estimation by estimating๐ค directly without the knowledge of ๐. The starting point for this method is the workby Brillinger [1982], which shows, as a simple consequence of Steinโs lemma [Stein,1981], that if the distribution of ๐ฅ is Gaussian, (A1) is satisfied with ๐ = 1 (e.d.r. ofdimension one, e.g., a single-index model), and the input data have zero mean andidentity covariance matrix, then the expectation E(๐ฆ๐ฅ) is proportional to ๐ค. Thus,a certain expectation, which can be easily approximated given i.i.d. observations,simultaneously eliminates ๐ and reveals ๐ค.
While the result above provides a very simple algorithm to recover ๐ค, it hasseveral strong limitations: (a) it only applies to normally distributed data ๐ฅ, or moregenerally to elliptically symmetric distributions [Cambanis et al., 1981], (b) it onlyapplies to ๐ = 1, and (c) in many situations with symmetries, the proportionalityconstant is equal to zero and thus we cannot recover the vector ๐ค. This has led toseveral extensions in the statistical literature which we now present.
Using score functions. The use of Steinโs lemma with a Gaussian random vari-able can be directly extended using the score function ๐ฎ1(๐ฅ) defined as the negativegradient of the log-density, that is, ๐ฎ1(๐ฅ) = โโ log ๐(๐ฅ) = โ1
๐(๐ฅ)โ๐(๐ฅ), which leads to
the following assumption:(A2) The distribution of ๐ฅ has a strictly positive density ๐(๐ฅ) which is differen-
tiable with respect to the Lebesgue measure, and such that ๐(๐ฅ) โ 0 whenโ๐ฅโ โ +โ.
We will need the score to be sub-Gaussian to obtain consistency results. GivenAssumption (A2), then Stoker [1986] showed, as a simple consequence of integrationby parts, that, for ๐ = 1 and if Assumption (A1) is satisfied, then E(๐ฆ๐ฎ1(๐ฅ)) isproportional to ๐ค, for all differentiable functions ๐, with a proportionality constantthat depends on ๐ค and โ๐. This leads to the โaverage derivative methodโ (ADE) andthus replaces the Gaussian assumption by the existence of a differentiable log-density,which is much weaker. This however does not remove the restriction ๐ = 1, whichcan be done in two ways which we now present.
Sliced inverse regression. Given a normalized Gaussian distribution for ๐ฅ (orany elliptically symmetric distribution), then, if (A1) is satisfied, almost surely in ๐ฆ,the conditional expectation E(๐ฅ|๐ฆ) happens to belong to the e.d.r. subspace. Givenseveral distinct values of ๐ฆ, the vectors E(๐ฅ|๐ฆ) or any estimate thereof, will hopefullyspan the entire e.d.r. space and we can recover the entire matrix ๐ค, leading to โsliceinverse regressionโ (SIR), originally proposed by Li and Duan [1989], Duan and Li[1991], Li [1991]. This allows the estimation with ๐ > 1, but this is still restricted toGaussian data. In this chapter, we propose to extend SIR by the use of score functions
23
to go beyond elliptically symmetric distributions, and we show that the new methodcombining SIR and score functions is formally better than the plain ADE method.
From first-order to second-order moments. Another line of extension of thesimple method of Brillinger [1982] is to consider higher-order moments, namely thematrix E(๐ฆ๐ฅ๐ฅโค) โ R๐ร๐, which, with normally distributed input data ๐ฅ and, if (A1) issatisfied, will be proportional (in a particular form to be described in Section 2.2.2) tothe Hessian of the function ๐, leading to the method of โprincipal Hessian directionsโ(PHD) from Li [1992]. Again, ๐ > 1 is allowed (more than a single projection), butthus is limited to elliptically symmetric data. However Janzamin et al. [2014] proposedto used second-order score functions to go beyond this assumption. In order to definethis new method, we consider the following assumption:
(A3) The distribution of ๐ฅ has a strictly positive density ๐(๐ฅ) which is twicedifferentiable with respect to the Lebesgue measure, and such that ๐(๐ฅ) andโโ๐(๐ฅ)โ โ 0 when โ๐ฅโ โ +โ.
Given (A1) and (A3), then one can show [Janzamin et al., 2014] that E(๐ฆ๐ฎ2(๐ฅ))will be proportional to the Hessian of the function ๐, where ๐ฎ2(๐ฅ) = โ2 log ๐(๐ฅ) +๐ฎ1(๐ฅ)๐ฎ1(๐ฅ)โค = 1
๐(๐ฅ)โ2๐(๐ฅ), thus extending the Gaussian situation above where ๐ฎ1
was a linear function and ๐ฎ2(๐ฅ), up to linear terms, proportional to ๐ฅ๐ฅโค.In this chapter, we propose to extend the method above to allow an SIR estimator
for the second-order score functions, where we condition on ๐ฆ, and we show that thenew method is formally better than the plain method of Janzamin et al. [2014].
Learning score functions through score matching. Relying on score functionsimmediately raises the following question: is estimating the score function (whennot available) really simpler than our original problem of non-parametric regression?Fortunately, a recent line of work [Hyvรคrinen, 2005] has considered this exact problem,and formulated the task of density estimation directly on score functions, which isparticularly useful in our context. We may then use the data, first to learn the score,and then to use the novel score-based moments to estimate ๐ค. We will also considera direct approach that jointly estimates the score function and the e.d.r. subspace,by regularizing by a sparsity-inducing norm.
Fighting the curse of dimensionality. Learning the score function is still a non-parametric problem, with the associated curse of dimensionality. If we first learn thescore function (through score matching) and then learn the matrix ๐ค, we will notescape that curse, while our direct approach is empirically more robust.
Note that Hristache, Juditsky and Spokoiny [Hristache et al., 2001] suggestediterative improvements of the ADE method, using elliptic windows which shrink inthe directions of the columns of ๐ค, stretch in all others directions and tend to flatlayers orthogonal to ๐ค. Dalalyan, Juditsky and Spokoiny [Dalalyan et al., 2008]generalize the algorithm to multi-index models and proved
โ๐-consistency of the
proposed procedure in the case when the structural dimension is not larger than 4
24
and weaker dependence for ๐ > 4. In particular, they provably avoid the curse ofdimensionality. Such extensions are outside the scope of this chapter.
Contributions. In this chapter, we make the following contributions:โ We propose score function extensions to sliced inverse regression problems,
both for the first-order and second-order score functions. We consider theinfinite sample case in Section 2.2 and the finite sample case in Section 2.3.They provably improve estimation in the population case over the non-slicedversions, while we study in Section 2.3 finite sample estimators and their con-sistency given the exact score functions.
โ We propose in Section 2.4 to learn the score function as well, in two steps, i.e.,first learning the score function and then learning the e.d.r. space parameter-ized by ๐ค, or directly, by solving a convex optimization problem regularizedby the nuclear norm.
โ We illustrate our results in 2.5 on a series of experiments.
2.2 Estimation with infinite sample size
In this section, we focus on the population situation, where we can computeexpectations and conditional expectations exactly, while we focus on finite sampleestimators with known score functions in Section 2.3 with consistency results in Sec-tion 2.3.3, and with learned score functions in Section 2.4.
2.2.1 SADE: Sliced average derivative estimation
Before presenting our new moments which will lead to the novel SADE method,we consider the non-sliced method, which is based on Assumptions (A1) and (A2)and score functions (the method based on the Gaussian assumption will be derivedlater as corollaries). The ADE method is based on the following lemma:
Lemma 2.1 (ADE moment [Stoker, 1986]). Assume (A1), (A2), the differentia-bility of ๐ and the existence of expectation E(๐โฒ(๐คโค๐ฅ)). Then E(๐ฎ1(๐ฅ)๐ฆ) is in thee.d.r. subspace.
Proof. Since ๐ฆ = ๐(๐ฅ) + ๐, and ๐ is independent of ๐ฅ with zero mean, we have
E(๐ฎ1(๐ฅ)๐ฆ) = E(๐ฎ1(๐ฅ)๐(๐ฅ)) =
โซR๐
โโ๐(๐ฅ)
๐(๐ฅ)๐(๐ฅ)๐(๐ฅ)๐๐ฅ
= โโซR๐
โ๐(๐ฅ)๐(๐ฅ)๐๐ฅ =
โซR๐
๐(๐ฅ)โ๐(๐ฅ)๐๐ฅ by integration by parts,
= ๐ค ยท E(๐โฒ(๐คโค๐ฅ)),
which leads to the desired result. Note that in the integration by parts above, thedecay of ๐(๐ฅ) to zero for โ๐ฅโ โ +โ is needed.
25
The ADE moment above only provides a single vector in the e.d.r. subspace, whichcan only potentially lead to recovery for ๐ = 1, and only if E(๐โฒ(๐คโค๐ฅ)) = 0, whichmay not be satisfied, e.g., if ๐ฅ has a a symmetric distribution and ๐ is even.
We can now present our first new lemma, the proof of which relies on similararguments as for SIR [Li and Duan, 1989] but extended to score functions. Note thatwe do not require the differentiability of the function ๐.
Lemma 2.2 (SADE moment). Assume (A1) and (A2). Then, E(๐ฎ1(๐ฅ)|๐ฆ) is in thee.d.r. subspace almost surely (in ๐ฆ).
Proof. We consider any vector ๐ โ R๐ in the orthogonal complement of the subspaceSpan๐ค1, . . . , ๐ค๐. We need to show, that ๐โคE(๐ฎ1(๐ฅ)|๐ฆ) = 0 with probability 1. Wehave by the law of total expectation
๐โคE(๐ฎ1(๐ฅ)|๐ฆ
)= E
(E(๐โค๐ฎ1(๐ฅ)|๐คโค
1 ๐ฅ, . . . , ๐คโค๐ ๐ฅ, ๐ฆ
)|๐ฆ).
Because of Assumption (A1), we have ๐ฆ = ๐(๐คโค๐ฅ) + ๐ with ๐ independent of ๐ฅ, andthus
E(๐โค๐ฎ1(๐ฅ)|๐คโค๐ฅ, ๐ฆ
)= E
(๐โค๐ฎ1(๐ฅ)|๐คโค๐ฅ, ๐
)= E
(๐โค๐ฎ1(๐ฅ)|๐คโค๐ฅ
)almost surely.
This leads to๐โคE
(๐ฎ1(๐ฅ)|๐ฆ
)= E
[E(๐โค๐ฎ1(๐ฅ)|๐คโค๐ฅ
)๐ฆ].
We now prove that almost surely E(๐โค๐ฎ1(๐ฅ)|๐คโค๐ฅ
)= 0, which will be sufficient to
prove Lemma 2.2. We consider the linear transformation of coordinates: ๐ฅ = ๐คโค๐ฅ โR๐, where ๐ค = (๐ค1, . . . , ๐ค๐, ๐ค๐+1, . . . , ๐ค๐) is a square matrix with full rank obtained byadding a basis of the subspace orthogonal to the span of the ๐ columns of ๐ค. Then, if๐ is the density of ๐ฅ, we have ๐(๐ฅ) = (det ๐ค) ยท๐(๐ฅ) and thus โ๐(๐ฅ) = (det ๐ค) ยท ๐ค ยทโ๐(๐ฅ)
and ๐ = ๐คโค๐ = (0, . . . , 0,๐๐+1, . . . ,๐๐) โ R๐ (because ๐ โฅ Span๐ค1, . . . , ๐ค๐). Thedesired conditional expectation equals
E(๐โค ๐ฎ1(๐ฅ)|๐ฅ1, . . . , ๐ฅ๐),
since ๐ค ๐ฎ1(๐ฅ) =๐คโ๐(๐ฅ)๐(๐ฅ)
=โ๐(๐ฅ)
๐(๐ฅ)= ๐ฎ1(๐ฅ) and hence ๐โค๐ฎ1(๐ฅ) = ๐โค ๐ค ๐ฎ1(๐ฅ) =๐โค ๐ฎ1(๐ฅ).
It is thus sufficient to show that
โซR๐โ๐
๐โค ๐ฎ1(๐ฅ)๐(๐ฅ1, . . . , ๐ฅ๐)๐๐ฅ๐+1 . . . ๐๐ฅ๐ = 0, for all
๐ฅ1, . . . , ๐ฅ๐. We haveโซR๐โ๐
๐โค ๐ฎ1(๐ฅ)๐(๐ฅ1, . . . , ๐ฅ๐)๐๐ฅ๐+1 . . . ๐๐ฅ๐ = ๐โค โซR๐โ๐
โ๐(๐ฅ) ยท ๐๐ฅ๐+1 . . . ๐๐ฅ๐
26
=๐โ
๐=๐+1
๐๐ ยท โโซโโ
โโซโโ
ยท ยท ยทโโซ
โโ
๐๐(๐ฅ)
๐๐ฅ๐ ๐๐ฅ๐ ยท โ๐+16๐ก6๐=๐
๐๐ฅ๐ก = 0,
because for any ๐ โ ๐ + 1, . . . , ๐,โโซ
โโ
๐๐(๐ฅ)
๐๐ฅ๐ ๐๐ฅ๐ = 0 by Assumption (A2). This
leads to the desired result.
The key differences are now that:โ Unlike ADE, by conditioning on different values of ๐ฆ, we have access to several
vectors E(๐ฎ1(๐ฅ)|๐ฆ) โ R๐.โ Unlike SIR, SADE does not require the linearity condition from [Li, 1991]
anymore and can be used with a smooth enough probability density.In the population case, we will consider the following matrix (using the fact thatE(๐ฎ1(๐ฅ)) = 0):
๐ฑ1,cov = E[E(๐ฎ1(๐ฅ)|๐ฆ)E(๐ฎ1(๐ฅ)|๐ฆ)โค
]= Cov
[E(๐ฎ1(๐ฅ)|๐ฆ)
]โ R๐ร๐,
which we will also denote E[E(๐ฎ1(๐ฅ)|๐ฆ)โ2
], where for any matrix ๐, ๐โ2 denotes ๐๐โค.
The matrix above is positive semi-definite, and its column space is included in thee.d.r. space. If it has rank ๐, then we can exactly recover the entire subspace by aneigenvalue decomposition. When ๐ = 1, which is the only case where ADE may beused, the following proposition shows that if ADE allows to recover ๐ค, so is SADE.
We will also consider the other matrix (note the presence of the extra term ๐ฆ2)
๐ฑ โฒ1,cov = E
[๐ฆ2E(๐ฎ1(๐ฅ)|๐ฆ)E(๐ฎ1(๐ฅ)|๐ฆ)โค
],
because of its direct link with the non-sliced version. Note that we made the weakassumption of existence of matrices ๐ฑ1,cov and ๐ฑ โฒ
1,cov, which is satisfied for majorityof problems.
Proposition 2.1. Assume (A1) and (A2), with ๐ = 1, as well as differentiabilityof ๐ and existence of the expectation E๐โฒ(๐คโค๐ฅ). The vector ๐ค may be recovered fromthe ADE moment (up to scale) if and only if E๐โฒ(๐คโค๐ฅ) = 0. If this condition issatisfied, then SADE also recovers ๐ค up to scale (i.e., ๐ฑ1,cov and ๐ฑ โฒ
1,cov are differentfrom zero).
Proof. The first statement is a consequence of the proof of Lemma 2.1. If SADEfails, that is, for almost all ๐ฆ, E(๐ฎ1(๐ฅ)|๐ฆ) = 0, then E(๐ฎ1(๐ฅ)๐ฆ|๐ฆ) = 0 which impliesthat E(๐ฎ1(๐ฅ)๐ฆ) = 0 and thus ADE fails. Moreover, we have, using operator convex-ity [Donoghue, 1974]:
๐ฑ โฒ1,cov = E
[E(๐ฆ๐ฎ1(๐ฅ)|๐ฆ)E(๐ฆ๐ฎ1(๐ฅ)|๐ฆ)โค
]<[E(E(๐ฆ๐ฎ1(๐ฅ)|๐ฆ))
][E(E(๐ฆ๐ฎ1(๐ฅ)|๐ฆ))
]โค=
=[E(๐ฆ๐ฎ1(๐ฅ))
][E(๐ฆ๐ฎ1(๐ฅ))
]โค,
27
showing that the new moment is dominating the ADE moment, which provides analternative proof of the rank of ๐ฑ โฒ
1,cov being larger than one if E(๐ฆ๐ฎ1(๐ฅ)) = 0.
Elliptically symmetric distributions. If ๐ฅ is normally distributed with meanvector ๐ and covariance matrix ฮฃ, then we have ๐ฎ1(๐ฅ) = ฮฃโ1(๐ฅโ ๐) and we recoverthe result from Li and Duan [1989]. Note that the lemma then extends to all ellipticaldistributions of the form ๐(1
2(๐ฅโ๐)โคฮฃโ1(๐ฅโ๐)), for a certain function ๐ : R+ โ R.
See Li and Duan [1989] for more details.
Failure modes. In some cases, slice inverse regression does not span the entiree.d.r. space, because the inverse regression curve E(๐ฎ1(๐ฅ)|๐ฆ) is degenerated. Forexample, this can occur, if ๐ = 1, ๐ฆ = โ(๐คโค
1 ๐ฅ) + ๐, โ is an even function and ๐คโค1 ๐ฅ
has a symmetric distribution around 0. Then E(๐ฎ1(๐ฅ)|๐ฆ) โก 0, and thus it is a poorestimation of the desired e.d.r. directions [Cook and Weisberg, 1991].
The second drawback of SIR occurs when we have a classification task, for exam-ple, ๐ฆ โ 0, 1. In this case, we have only two slices (i.e., possible values of ๐ฆ) andSIR can recover only one direction in the e.d.r. space [Cook and Lee, 1999].
Li [1992] suggested another way to estimate the e.d.r. space which can handle suchsymmetric cases: principal Hessian directions (PHD). However, this method uses thenormality of the vector ๐. As in the SIR case, we can extend this method, usingscore functions to use it for any distribution, which we will refer to as SPHD, whichwe now present.
2.2.2 SPHD: Sliced principal Hessian directions
Before presenting our new moment which will lead to the SPHD method, weconsider the non-sliced method, which is based on Assumptions (A1) and (A3) andscore functions (the method based on the Gaussian assumption will be derived lateras corollaries). The method of Janzamin et al. [2014], which we refer to as โPHD+โis based on the following lemma (we reproduce the proof for readability):
Lemma 2.3 (second-order score moment (PHD+) [Janzamin et al., 2014]). As-sume (A1),(A3),twice differentiability of function ๐ and existence of the expectationE(โ2๐(๐คโค๐ฅ)). Then E(๐ฎ2(๐ฅ)๐ฆ) has a column space included in the e.d.r. subspace.
Proof. Since ๐ฆ = ๐(๐ฅ) + ๐, and ๐ is independent of ๐ฅ, we have
E(๐ฆ๐ฎ2(๐ฅ)) = E
(๐(๐ฅ)๐ฎ2(๐ฅ)
)=
โซโ2๐(๐ฅ)
๐(๐ฅ)๐(๐ฅ)๐(๐ฅ)๐๐ฅ =
=
โซโ2๐(๐ฅ) ยท ๐(๐ฅ)๐๐ฅ =
โซ๐(๐ฅ) ยท โ2๐(๐ฅ)๐๐ฅ = E
[โ2๐(๐ฅ)
],
using integration by parts and the decay of ๐(๐ฅ) and โ๐(๐ฅ) for โ๐ฅโ โ โ. This leadsto the desired result since โ2๐(๐ฅ) = ๐คโ2๐(๐คโค๐ฅ)๐คโค. This was proved by Li [1992]for normal distributions.
28
Failure modes. The method does not work properly if rank(E(โ2๐(๐คโค๐ฅ))) < ๐.For example, if ๐ is linear function, E(โ2๐(๐คโค๐ฅ)) โก 0 and the estimated e.d.r.space is degenerated. Moreover, the method fails in symmetric cases, for example,if ๐ is an odd function with respect to any variable and ๐(๐ฅ) is even function, thenrank(E(โ2๐(๐คโค๐ฅ))) < ๐.
We can now present our second new lemma, the proof of which relies on similararguments as for PHD [Li, 1992] but extended to score functions (again no differen-tiability is assumed on ๐):
Lemma 2.4 (SPHD moment). Assume (A1) and (A3). Then, E(๐ฎ2(๐ฅ)|๐ฆ) has acolumn space within the e.d.r. subspace almost surely.
Proof. We consider any ๐ โ R๐ and ๐ โ R๐ orthogonal to the e.d.r. subspace, andprove, that ๐โคE
(๐ฎ2(๐ฅ)|๐ฆ
)๐ = 0. We use the same transform of coordinates as in
the proof of Lemma 2.2: ๐ฅ = ๐คโค๐ฅ โ R๐. Then โ2๐(๐ฅ) = det( ๐ค) ยท ๐คโ2๐(๐ฅ) ๐คโค,๐ = ๐คโค๐ = (0, . . . , 0,๐๐+1, . . . ,๐๐) and we will prove, that E(๐โค๐ฎ2(๐ฅ)๐|๐คโค๐ฅ
)= 0
almost surely, and ๐คโค ๐ฎ2(๐ฅ) ๐ค =๐คโคโ2๐(๐ฅ) ๐ค๐(๐ฅ)
=โ2๐(๐ฅ)
๐(๐ฅ)= ๐ฎ2(๐ฅ). It is sufficient to
show, that for all ๐ฅ1, . . . , ๐ฅ๐:โซR๐โ๐
๐โค ยท๐ฎ2(๐ฅ) ยท๐ ยท ๐(๐ฅ1, . . . , ๐ฅ๐)๐๐ฅ๐+1 . . . ๐๐ฅ๐ = 0.
We have:โซR๐โ๐
๐โค ยท ๐ฎ2(๐ฅ) ยท๐ ยท ๐(๐ฅ1, . . . , ๐ฅ๐)๐๐ฅ๐+1 . . . ๐๐ฅ๐ = ๐โค ยท [ โซR๐โ๐
โ2๐(๐ฅ) ยท ๐๐ฅ๐+1 ยท ยท ยท ๐๐ฅ๐] ยท๐
=โ16๐6๐
๐+16๐6๐
๐๐๐๐ โโซโโ
โโซโโ
ยท ยท ยท
[ โโซโโ
๐2 ๐ (๐ฅ)
๐๐ฅ๐๐๐ฅ๐ ยท ๐๐ฅ๐]ยท
โ๐+16๐ก6๐=๐
๐๐ฅ๐ก = 0,
because for any ๐ โ ๐ + 1, . . . , ๐:โโซ
โโ
๐2 ๐ (๐ฅ)
๐๐ฅ๐๐๐ฅ๐ ยท ๐๐ฅ๐ = 0 due to Assumption (A3),
which leads to the desired result.
In order to be able to use several values of ๐ฆ, we will estimate the matrix ๐ฑ2 =
E(
[E(๐ฎ2(๐ฅ)|๐ฆ)]2
), which is the expectation with respect to ๐ฆ of the square (in the
matrix multiplication sense) of the conditional expectation from Lemma 2.4, as well
as ๐ฑ โฒ2 = E
(๐ฆ2[E
(๐ฎ2(๐ฅ)|๐ฆ)]2
), and consider the ๐ largest eigenvectors (we made the
weak assumption of existence of matrices ๐ฑ2 and ๐ฑ โฒ2, which is satisfied for majority
of problems). From the lemma above, this matrix has a column space included in
29
the e.d.r. subspace, thus, if it has rank ๐, we get exact recovery by an eigenvaluedecomposition.
Effect of slicing. We now study the effect of slicing and show that in the populationcase, it is superior to the non-sliced version, as it recovers the true e.d.r. subspace inmore situations.
Proposition 2.2. Assume (A1) and (A3). The matrix ๐ค may be recovered fromthe moment in Lemma 2.3 (up to right linear transform) if and only if E[โ2๐(๐คโค๐ฅ)]has full rank. If this condition is satisfied, then SPHD also recovers ๐ค up to scale.
Proof. The first statement is a consequence of the proof of Lemma 2.3. Moreover,using the Lowner-Heinz theorem about operator convexity [Donoghue, 1974]:
๐ฑ โฒ2 = E
[E(๐ฆ๐ฎ2(๐ฅ)|๐ฆ)2
]<[E[E(๐ฆ๐ฎ2(๐ฅ)|๐ฆ)]
]2=[E(๐ฆ๐ฎ2(๐ฅ))
]2,
showing that the new moment is dominating the PHD moment, thus implying that
rank[๐ฑ โฒ2] > rank
[E(๐ฆ๐ฎ2(๐ฅ))
].
Therefore, if rank[E(๐ฆ๐ฎ2(๐ฅ))
]= ๐, then rank[๐ฑ โฒ
2] = ๐ (note that there is not such asimple proof for ๐ฑ2).
Elliptically symmetric distributions. When ๐ฅ is a standard Gaussian random
variable, then ๐ฎ2(๐ฅ) = โ๐ผ + ๐ฅ๐ฅโค, and thus E(
[E(๐ฎ2(๐ฅ)|๐ฆ)]2
)= E
([๐ผ โ Cov(๐ฅ|๐ฆ)]2
),
and we recover the sliced average variance estimation (SAVE) method by Cook [2000].However, our method applies to all distributions (with known score functions).
2.2.3 Relationship between first and second order methods
All considered methods have their own failure modes. The simplest one: ADEworks only in single-index model and has quite a simple working condition :E[๐โฒ(๐คโค๐ฅ)] = 0. The sliced improvement (e.g., SADE) of this algorithm has a betterperformance, however it still suffers from symmetric cases, when the inverse regres-sion curve is partly degenerated. PHD+ can not work properly in linear modelsand symmetric cases. SPHD is stronger than PHD+ and potentially has the widestapplication area among four described methods. See summary in Table 2.1.
Our conditions rely on the full possible rank of certain expected covariance ma-trices. When the function ๐ is selected randomly from all potential functions fromR๐ to R, rank-deficiencies typically do not occur and it would be interesting to showthat indeed they appear with probability zero for certain random function models.
2.3 Estimation from finite sample
In this section, we consider finite sample estimators for the moments we havedefined in Section 2.2. Since our extensions are combinations of existing techniques
30
method main equation score slicedADE E(๐ฎ1(๐ฅ)๐ฆ) first no
SADE E๐ฆ[E(๐ฎ1(๐ฅ)|๐ฆ)โ2
]first yes
PHD+ E(๐ฎ2(๐ฅ)๐ฆ) second no
SPHD E๐ฆ[E(
(๐ฎ2(๐ฅ)|๐ฆ)2)]
second yes
method single-index multi-indexADE E[๐โฒ(๐คโค๐ฅ)] = 0 does not work
SADE E๐ฆ[โE(๐ฎ1(๐ฅ)|๐ฆโ2
]> 0 rank E๐ฆ
[E(๐ฎ1(๐ฅ)|๐ฆ)โ2
]= ๐
PHD+ E[๐โฒโฒ(๐คโค๐ฅ)] = 0 rank[E[โ2๐(๐ฅ)]
]= ๐
SPHD E๐ฆ[trE(
(๐ฎ2(๐ฅ)|๐ฆ)2)]
> 0 rank E๐ฆ[E(
(๐ฎ2(๐ฅ)|๐ฆ)2)]
= ๐
Table 2.1 โ Comparison of different methods using score functions.
(using score functions and slicing) our finite-sample estimators naturally rely on ex-isting work [Hsing and Carroll, 1992, Zhu and Ng, 1995].
In this section, we assume that the score function is known. We consider learningthe score function in Section 2.4.
2.3.1 Estimator and algorithm for SADE
Our goal is to provide an estimator for ๐ฑ1,cov = E[E(๐ฎ1(๐ฅ)|๐ฆ)E
(๐ฎ1(๐ฅ)|๐ฆ
)โค]=
Cov[E(๐ฎ1(๐ฅ)|๐ฆ
)]given a finite sample (๐ฅ๐, ๐ฆ๐), ๐ = 1, . . . , ๐. A similar estimator for
๐ฑ โฒ1,cov could be derived. In order to estimate ๐ฑ1,cov = Cov
[E(๐ฎ1(๐ฅ)|๐ฆ
)], we will use
the identity
Cov[๐ฎ1(๐ฅ)
]= Cov
[E(๐ฎ1(๐ฅ)|๐ฆ
)]+ E
[Cov
(๐ฎ1(๐ฅ)|๐ฆ
)].
We use the natural consistent estimator 1๐
๐โ๐=1
๐ฎ1(๐ฅ๐)๐ฎ1(๐ฅ๐)โค of Cov[๐ฎ1(๐ฅ)
].
In order to obtain an estimator of E[Cov
(๐ฎ1(๐ฅ)|๐ฆ
)], we consider slicing the real
numbers in ๐ป different slices, ๐ผ1, . . . , ๐ผ๐ป , which are contiguous intervals that form apartition of R (or of the range of all ๐ฆ๐, ๐ = 1, . . . , ๐). We then compute an estimatorof the conditional expectation (๐ฎ1)โ = E(๐ฎ1(๐ฅ)|๐ฆ โ ๐ผโ) with empirical averages:denoting ๐โ the empirical proportion of ๐ฆ๐, ๐ = 1, . . . , ๐, that fall in the slice ๐ผโ (whichis assumed to be strictly positive), we estimate E(๐ฎ1(๐ฅ)|๐ฆ โ ๐ผโ) by
(๐ฎ1)โ =1
๐๐โ
๐โ๐=1
1๐ฆ๐โ๐ผโ๐ฎ1(๐ฅ๐).
31
We then estimate Cov(๐ฎ1(๐ฅ)|๐ฆ โ ๐ผโ
)by
(๐ฎ1)cov,โ =1
๐๐โ โ 1
๐โ๐=1
1๐ฆ๐โ๐ผโ(๐ฎ1(๐ฅ๐)โ (๐ฎ1)โ
)(๐ฎ1(๐ฅ๐)โ (๐ฎ1)โ
)โค.
Note that it is important here to normalize the covariance computation by 1๐๐โโ1
(usual unbiased normalization of the variance) and not 1๐๐โ
, to allow consistent esti-
mation even when the number of elements per slice is small (e.g., equal to 2).We finally use the following estimator of ๐ฑ1,cov:
๐ฑ1,cov =1
๐
๐โ๐=1
๐ฎ1(๐ฅ๐)๐ฎ1(๐ฅ๐)โค โ๐ปโโ=1
๐โ ยท (๐ฎ1)cov,โ.
The final SADE algorithm is thus the following:
โ Divide the range of ๐ฆ1, . . . , ๐ฆ๐ into ๐ป slices ๐ผ1, . . . , ๐ผ๐ป . Let ๐โ > 0 be theproportion of ๐ฆ๐, ๐ = 1, . . . , ๐, that fall in slice ๐ผโ.
โ For each slice ๐ผโ, compute the sample mean (๐ฎ1)โ and covariance (๐ฎ1)cov,โ:
(๐ฎ1)โ =1
๐๐โ
๐โ๐=1
1๐ฆ๐โ๐ผโ๐ฎ1(๐ฅ๐) and
(๐ฎ1)cov,โ =1
๐๐โ โ 1
๐โ๐=1
1๐ฆ๐โ๐ผโ(๐ฎ1(๐ฅ๐)โ (๐ฎ1)โ
)(๐ฎ1(๐ฅ๐)โ (๐ฎ1)โ
)โค.
โ Compute ๐ฑ1,cov = 1๐
๐โ๐=1
๐ฎ(๐ฅ๐)๐ฎ(๐ฅ๐)โค โ
๐ปโโ=1
๐โ ยท (๐ฎ)cov,โ.
โ Find the ๐ largest eigenvalues and let 1, . . . , ๐ be eigenvectors in R๐ corre-sponding to these eigenvalues.
The schematic graphical representation of this method given in Figure 2-1.
Choice of slices. There are different ways to choose slices ๐ผ1, . . . , ๐ผ๐ป :โ all slices have the same length, that is we choose the maximum and the mini-mum of ๐ฆ1, . . . , ๐ฆ๐, and divide the range of ๐ฆ into ๐ป equal slices (for simplicity,we assume that ๐ is a multiple of ๐ป),
โ we can also use the distribution of ๐ฆ to ensure a balanced distribution ofobservations in each slice, and choose ๐ผโ = (๐นโ1
๐ฆ ((โ โ 1)/๐ป), ๐นโ1๐ฆ (โ/๐ป)],
where ๐น๐ฆ(๐ก) =1
๐
๐โ๐=1
1๐ฆ๐6๐ก is the empirical distribution function of ๐ฆ. If ๐ is a
multiple ๐ป, there are exactly ๐ = ๐/๐ป observations per slice.Later on in our experiments, we use the second way, where every slice has ๐ = ๐/๐ป
points, and our consistency result applies to this situation as well. Note that thereis a variation of SADE, which uses standardized data. If we denote the standardized
32
P(๐ฆ โ ๐ผ1) P(๐ฆ โ ๐ผ๐ป)๐1 ๐๐ป
I1 I2 IHy
x
E(๐ฎ1(๐ฅ)|๐ฆ โ ๐ผ1) E(๐ฎ1(๐ฅ)|๐ฆ โ ๐ผ๐ป)
(๐1)1 (๐1)๐ป
Figure 2-1 โ Graphical explanation of SADE: firstly we divide ๐ฆ into ๐ป slices, let ๐โbe a proportion of ๐ฆ๐ in slice ๐ผโ. This is an empirical estimator of P(๐ฆ โ ๐ผโ). Thenwe evaluate the empirical estimator (๐ฎ1)โ of E(๐ฎ1(๐ฅ)|๐ฆ โ ๐ผโ). Finally, we evaluateweighted covariance matrix and find the ๐ largest eigenvectors.
data as ๐ฎ1(๐ฅ๐) =[1๐
๐โ๐=1
๐ฎ1(๐ฅ๐)๐ฎ1(๐ฅ๐)โค]โ1/2
๐ฎ1(๐ฅ๐) (Remark 5.3 in [Li, 1991]).
Computational complexity. The first step of the algorithm requires ๐(๐) ele-mentary operations, the second ๐(๐๐2) operations, the third ๐(๐ป๐2) and the fourth๐(๐๐2) operations. The overall dependence on dimension ๐ is quadratic, while thedependence on the number of observations is linear in ๐, as common in moment-matching methods.
Estimating the number of components. Our estimation method does not de-pend on ๐, up to the last step where the first ๐ largest eigenvectors are selected. Asimple heuristic to select ๐, similar to the selection of the number of components inprincipal component analysis, would select the largest ๐ such that the gap betweenthe ๐-th and (๐ + 1)-th eigenvalue is large enough. This could be made more formalusing the technique of Li [1991] for sliced inverse regression.
2.3.2 Estimator and algorithm for SPHD
We follow the same approach as for the SADE algorithm above, leading to the fol-
lowing algorithm, which estimates ๐ฑ2 = E(
[E(๐ฎ2(๐ฅ)|๐ฆ)]2
)and computes its principal
eigenvectors. Note that E[E(๐ฎ2(๐ฅ)|๐ฆ โ ๐ผโ)
]= E
[๐ฎ2(๐ฅ)
]= 0.
33
โ Divide the range of ๐ฆ1, . . . , ๐ฆ๐ into ๐ป slices ๐ผ1, . . . , ๐ผ๐ป . Let ๐โ > 0 be theproportion of ๐ฆ๐, ๐ = 1, . . . , ๐, that fall in slice ๐ผโ.
โ For each slice, compute the sample mean (๐ฎ2)โ of ๐ฎ2(๐ฅ): (๐ฎ2)โ =
1
๐๐โ
๐โ๐=1
1๐ฆ๐โ๐ผโ๐ฎ2(๐ฅ๐).
โ Compute the weighted covariance matrix ๐ฑ2 =๐ปโโ=1
๐โ(๐ฎ2)2โ, find the ๐ largest
eigenvalues and let 1, . . . , ๐ be eigenvectors corresponding to these eigenval-ues.
The matrix ๐ฑ2 is then an estimator of E(
[E(๐ฎ2(๐ฅ)|๐ฆ)]2
).
Computational complexity. The first step of the algorithm requires ๐(๐) ele-mentary operations, the second ๐(๐๐2) operations, the third ๐(๐ป๐3) and the fourth๐(๐๐2) operations. The overall dependence on dimension ๐ is cubic, hence the methodis slower than SADE (but still linear in the number of observations ๐).
2.3.3 Consistency for the SADE estimator and algorithm
In this section, we prove the consistency of the SADE moment estimator and theresulting algorithm, when the score function is known. Following Hsing and Carroll[1992] and Zhu and Ng [1995], we can get
โ๐-consistency for the SADE algorithm
with very broad assumptions regarding the problem.In this section, we focus on the simplest set of assumptions to pave the way to the
analysis for the nuclear norm in future work. The key novelty compared to Hsing andCarroll [1992], Zhu and Ng [1995] is a precise non-asymptotic analysis with preciseconstants.
We make the following assumptions:(L1) The function ๐ : R โ R๐ such that E(โ(๐ฅ)|๐ฆ) = ๐(๐ฆ) is ๐ฟ-Lipschitz-
continuous.(L2) The random variable ๐ฆ โ R is sub-Gaussian, i.e., such that E๐๐ก(๐ฆโ๐ธ๐ฆ) 6
๐๐2๐ฆ ๐ก
2/2, for some ๐๐ฆ > 0.(L3) The random variables โ๐(๐ฅ) โ R are sub-Gaussian, i.e., such that E๐๐กโ๐(๐ฅ) 6
๐๐2โ ๐ก
2/2 for each component ๐ โ 1, . . . , ๐, for some ๐โ > 0.(L4) The random variables ๐๐ = โ๐(๐ฅ) โ๐๐(๐ฆ) โ R are sub-Gaussian, i.e., such
that E๐๐ก๐๐ 6 ๐๐2๐ ๐ก
2/2 for each component ๐ โ 1, . . . , ๐, for some ๐๐ > 0.Now we formulate and proof the main theorem, where โ ยท โ* is the nuclear norm,
defined as โ๐ดโ* = tr(โ
๐ด๐๐ด):
Theorem 2.1. Under assumptions (L1) - (L4) we get the following bound on
โ๐ฑ1,cov โ ๐ฑ1,covโ*: for any ๐ฟ <1
๐, with probability not less than 1โ ๐ฟ:
โ๐ฑ1,cov โ ๐ฑ1,covโ* 6๐โ๐(195๐ 2๐ + 2๐ 2โ )โ๐
โlog
24๐2
๐ฟ
34
+8๐ฟ2๐ 2๐ฆ + 16๐๐๐๐ฆ๐ฟ
โ๐+ (157๐ 2๐ + 2๐ 2โ )๐
โ๐
๐log2 32๐2๐
๐ฟ.
The proof of the theorem can be found in Appendix 2.7.2. The leading term
is proportional to๐โ๐๐ 2๐โ๐
, with a tail which is sub-Gaussian. We thus get aโ๐-
consistent estimator or ๐ฑ1. The dependency in ๐ could probably be improved, inparticular when using slices of sizes ๐ that tend to infinity (as done in [Lin et al.,2018]).
2.4 Learning score functions
All previous methods can work only if we know the score function of first orsecond order. In practice, we do not have such information, and we have to learnscore functions from sampled data. In this section, we only consider the first-orderscore function โ(๐ฅ) = ๐ฎ1(๐ฅ) = โโ log ๐(๐ฅ).
We first present the score matching approach of Hyvรคrinen [2005], and then applyit to our problem.
2.4.1 Score matching to estimate score from data
Given the true score function โ(๐ฅ) = ๐ฎ1(๐ฅ) = โโ log ๐(๐ฅ) and some i.i.d. datagenerated from ๐(๐ฅ), score matching aims at estimating the parameter of a model forthe score function โ(๐ฅ), by minimizing a empirical quantity aiming to estimate
โscore(โ) =1
2
โซR๐
๐(๐ฅ)โโ(๐ฅ)โ โ(๐ฅ)โ2๐๐ฅ.
As is, the quantity above leads to consistent estimation (i.e., pushing โ close to โ),but seems to need the knowledge of the true score โ(๐ฅ). A key insight from Hyvรคrinen[2005] is to use integration by parts to get (assuming the integrals exist):
โscore(โ) =1
2
โซR๐
๐(๐ฅ)[โโ(๐ฅ)โ2 + โโ(๐ฅ)โ2 + 2โ(๐ฅ)โคโ log ๐(๐ฅ)
]๐๐ฅ
=1
2
โซR๐
๐(๐ฅ)โโ(๐ฅ)โ2๐๐ฅ+1
2
โซR๐
๐(๐ฅ)โโ(๐ฅ)โ2๐๐ฅ+
โซR๐
โ(๐ฅ)โคโ๐(๐ฅ)๐๐ฅ
=1
2
โซR๐
๐(๐ฅ)โโ(๐ฅ)โ2๐๐ฅ+1
2
โซR๐
๐(๐ฅ)โโ(๐ฅ)โ2๐๐ฅโโซR๐
(โ ยท โ)(๐ฅ)๐(๐ฅ)๐๐ฅ,
by integration by parts, where (โ ยท โ)(๐ฅ) =โ๐
๐=1๐โ๐๐ฅ๐
(๐ฅ) is the divergence of โ at๐ฅ [Arfken, 1985].
The first part of the last right hand side does not depend on โ while the two otherparts are expectations under ๐(๐ฅ) of quantities that only depend on โ. Thus is can
35
we well approximated, up to a constant, by:
โscore(โ) =1
2๐
๐โ๐=1
โโ(๐ฅ๐)โ2 โ1
๐
๐โ๐=1
(โ ยท โ)(๐ฅ๐).
Parametric assumption. If we assume that the score is linear combination offinitely many basis functions, we will get a consistent estimator of these parameters.That is, we make the following assumption:
(A4) The score function โ(๐ฅ) is a linear combination of known basis functions
๐๐(๐ฅ), ๐ = 1, . . . ,๐, where ๐๐ : R๐ โ R๐, that is โ(๐ฅ) =๐โ๐=1
๐๐(๐ฅ)๐*๐ , for
some ๐* โ R๐. We assume that the score function and its derivatives aresquared-integrable with respect to ๐(๐ฅ).
In this chapter, we consider for simplicity a parametric assumption for the score. Inorder to go towards non-parametric estimation, we would need to let the number ๐of basis functions to grow with ๐ (exponentially with no added assumptions), andthis is an interesting avenue for future work. In simulations in 2.5, we consider asimple set of basis function which are localized functions around observations; thesecan approximate reasonably well in practice most densities and led to good estima-tion of the e.d.r. subspace. Moreover, if we have the additional knowledge that thecomponents of ๐ฅ are statistically independent (potentially after linearly transformingthem using independent component analysis [Hyvรคrinen et al., 2004]), we can useseparable functions for the scores.
We introduce the notation
ฮจ(๐ฅ) =
โโโ ๐11(๐ฅ) ยท ยท ยท ๐1
๐(๐ฅ)...
. . ....
๐๐1 (๐ฅ) ยท ยท ยท ๐๐๐ (๐ฅ)
โโโ โ R๐ร๐,
so that the score function โ we wish to estimate has the form
โ(๐ฅ) = ฮจ(๐ฅ)โค๐ โ R๐.
We also introduce the notation (โ ยท ฮจ)(๐ฅ) =
โโโ (โ ยทฮจ1)(๐ฅ)...
(โ ยทฮจ๐)(๐ฅ)
โโโ โ R๐, so that
(โ ยท โ)(๐ฅ) = ๐โค(โ ยทฮจ)(๐ฅ).The empirical score function may then be written as:
โscore(๐) =1
2๐โค( 1
๐
๐โ๐=1
ฮจ(๐ฅ๐)ฮจ(๐ฅ๐)โค)๐ โ ๐โค( 1
๐
๐โ๐=1
(โ ยทฮจ)(๐ฅ๐)), (2.4.1)
which is a quadratic function of ๐ and can thus be minimized by solving a linearsystem in running time ๐(๐3 +๐2๐๐) (to form the matrix and to solve the system).
36
Given standard results regarding the convergence of 1๐
โ๐๐=1 ฮจ(๐ฅ๐)ฮจ(๐ฅ๐)
โค โ R๐ร๐
to its expectation and of 1๐
โ๐๐=1(โ ยท โ)(๐ฅ๐) โ R๐ to its expectation, we get a
โ๐-
consistent estimation of ๐* under simple assumptions (see Theorem 2.2).
2.4.2 Score matching for sliced inverse regression: two-step
approach
We can now combine our linear parametrization of the score with the SIR approachoutlined in Section 2.3. The true conditional expectation is
E(โ(๐ฅ)|๐ฆ
)=
๐โ๐=1
E(๐๐(๐ฅ)|๐ฆ)๐*๐ ,
and belongs to the e.d.r. subspace. We consider ๐ป different slices ๐ผ1, . . . , ๐ผ๐ป , and thefollowing estimator, which simply replaces the true score by the estimated score (i.e.,๐* by ๐), and highlights the dependence in ๐.
The estimator ๐ฑ1,cov can be rewritten as
๐ฑ1,cov =๐โ๐=1
๐โ๐=1
๐ผ๐,๐๐(|๐ผโ(๐, ๐)| โ 1)
โ(๐ฅ๐)โ(๐ฅ๐)โค,
where
๐ผ๐,๐ =
1 if ๐ = ๐ in the same slice0 otherwise
Using linear property โ(๐ฅ) = ฮจ(๐ฅ)โค๐:
๐ฑ1,cov(๐) =1
๐
๐โ๐=1
๐โ๐=1
๐ผ๐,๐|๐ผโ(๐, ๐)| โ 1
ฮจ(๐ฅ๐)โค๐๐โคฮจ(๐ฅ๐). (2.4.2)
In the two-step approach, we first solve the score matching optimization problemto obtain an estimate for the optimal parameters ๐* and then use them to get the๐ largest eigenvectors of covariance matrix ๐ฑ1. This approach works well in lowdimensions when the score function can be well approximated; otherwise, we maysuffer from the curse of dimensionality: if we want a good approximation of the scorefunction, we would need an exponential number of basis functions. In Section 2.4.3,we consider a direct approach aiming at improving robustness.
Consistency. Let also provide the result of consistency in the case of unknownscore function under Assumption (A4) for the two-step algorithm. We will make theadditional assumptions:
(M1) The random variables ฮจ๐๐ (๐ฅ) โ R are sub-Gaussian, i.e., such that
E๐๐ก(ฮจ๐๐ (๐ฅ)โ๐ธฮจ๐
๐ (๐ฅ)) 6 ๐๐2ฮจ๐ก
2/2, for some ๐ฮจ > 0.(M2) The random variables (โ ยทฮจ๐)(๐ฅ) are sub-Gaussian, i.e., such that
E๐๐ก((โยทฮจ๐)(๐ฅ)โ๐ธ(โยทฮจ๐)(๐ฅ) 6 ๐๐2โฮฆ๐ก
2/2, for some ๐โฮจ > 0.
37
(M3) The matrix E[ฮจ(๐ฅ)ฮจ(๐ฅ)โค
]is not degenerated and we let ๐0 denote its
minimal eigenvalue.
Theorem 2.2. Let ๐ be the estimated ๐, obtained on the first step of the algorithm.Under Assumptions (A1), (A2), (A4), (L1) - (L4) and (M1), (M2), (M3), for๐ฟ 6 1/๐ and ๐ large enough, that is, ๐ > ๐1 + ๐2 log 1
๐ฟfor some positive constants ๐1
and ๐2 not depending on ๐ and ๐ฟ:
โ๐ฑ1,cov(๐*)โ ๐ฑ1,cov(๐)โ* 61โ๐
โ2 log
48๐2๐
๐ฟร
[๐โ๐(195๐ 2๐ + 2๐ 2โ ) +
9๐
2Eโฮจ(๐ฅ)โ22 ยท โ๐*โ
(2๐โฮจ
โ๐
๐0+
4โ๐โ๐ 2ฮจโ๐3๐
๐0
)]+๐
(1
๐
).
with probability not less than 1โ ๐ฟ.
Now, let us formulate and proof non-asymptotic result about the real and theestimated e.d.r. spaces. We need to define a notion of distance between subspacesspanned by two sets of vectors. We use the square trace error ๐ 2(๐ค, ) [Hooper,1959]:
๐ 2(๐ค, ) = 1โ 1
๐tr[๐ค ยท
], (2.4.3)
where ๐ค โ R๐ร๐ and โ R๐ร๐ both have orthonormal columns. It is always betweenzero and one, and equal to zero if and only if ๐ค = . This distance is closely relatedto principal angles notion:
sin ฮ(โค๐ค) = diag(sin(cosโ1 ๐1), . . . , sin(cosโ1 ๐๐),
where ๐1, . . . , ๐๐ are the singular values of the matrix โค๐ค. Actually, ๐ (๐ค, ) ยทโ๐ =
โ sin ฮ(โค๐ค)โ๐น .We use Davis-Kahan โsin ๐ theoremโ [Stewart and Sun, 1990, Theorem V.3.6] in
the following form [Yu et al., 2015, Theorem 2]:
Theorem 2.3. Let ฮฃ and ฮฃ โ R๐ร๐ be symmetric, with eigenvalues ๐1 > ยท ยท ยท >๐๐ and 1, . . . , ๐ respectively. Fix 1 6 ๐ 6 ๐ 6 ๐, let ๐ = ๐ โ ๐ + 1, and let๐ = (๐ข๐, ๐ข๐+1, . . . , ๐ข๐ ) โ R๐ร๐ and = (๐, ๐+1, . . . , ๐ ) โ R๐ร๐ have orthonormalcolumns satisfying ฮฃ๐ข๐ = ๐๐๐ข๐ and ฮฃ๐ = ๐๐ for ๐ = ๐, . . . , ๐ . Let ๐ฟ = inf|โ ๐| :๐ โ [๐๐ , ๐๐], โ (โโ, ๐ +1]โช[๐โ1,+โ), where 0 = +โ and ๐+1 = โโ. Assume,that ๐ฟ > 0, then:
๐ (๐, ) 6โฮฃโ ฮฃโ๐น๐ฟโ๐
. (2.4.4)
Using Corollary 4.1 and its discussion by Vu and Lei [2013], we can derive, that
๐ (๐ค, ) 6โ๐ฑ1(๐*)โ ๐ฑ1(๐)โ๐น|๐๐ โ ๐+1|
โ๐
, where ๐ค and are the real and the estimated e.d.r.
spaces respectively. Using Weylโs inequality [Stewart and Sun, 1990] : if โ๐ฑ1(๐*) โ
38
๐ฑ1(๐)โ2 < ๐ โ |๐ โ ๐๐| < ๐ for all ๐ = 1, . . . , ๐ and taking ๐ = (๐๐ โ ๐๐+1)/2 = ๐๐/2we get:
๐ (๐ค, ) 62โ๐ฑ1(๐*)โ ๐ฑ1(๐)โ*
๐๐โ๐
, if โ๐ฑ1(๐*)โ ๐ฑ1(๐)โ๐น < ๐๐/2,
because โ ยท โ๐น 6 โ ยท โ*. We can now formulate our main theorem for the analysis ofthe SADE algorithm:
Theorem 2.4. Consider assumptions (A1),(A2),(A4), (L1)- (L4), (M1), (M2),(M3) and:
(K1) The matrix ๐ฑ1 has a rank ๐: ๐๐ > 0.For ๐ฟ 6 1/๐ and ๐ large enough: ๐ > ๐1 + ๐2 log 1
๐ฟfor some positive constants ๐1
and ๐2 not depending on ๐ and ๐ฟ:
๐ (, ๐ค) 62
โ๐๐๐โ๐
โ2 log
48๐2๐
๐ฟร[
๐โ๐(195๐ 2๐ + 2๐ 2โ ) +
9๐
2Eโฮจ(๐ฅ)โ22 ยท โ๐*โ
(2๐โฮจ
โ๐
๐0+
4โ๐โ๐ 2ฮจโ๐3๐
๐0
)]+๐
(1
๐
).
with probability not less than 1โ ๐ฟ.
We thus obtain a convergence rate in 1/โ๐, with an explicit dependence on all
constants of the problem. The dependence on dimension ๐ and number of basisfunctions๐ is of the order (forgetting logarithmic terms): ๐( ๐
3/2
๐1/2 +๐5/2
๐1/2 โ๐*โ). For ๐ >1, our dependence on the sample size ๐ is improved compared to existing work suchas [Dalalyan et al., 2008] (while it matches the dependence for ๐ = 1 with [Hristacheet al., 2001]), but this comes at the expense of assuming that the score functionssatisfy a parametric model.
For a fixed number of basis functions ๐, we thus escape the curse of dimension-ality as we get a polynomial dependence on ๐ (which we believe could be improved).However, when no assumptions are made on score functions, the number ๐ and po-tentially the norm โ๐*โ has to grow with the number of observations ๐. We are indeedfaced with a traditional non-parametric estiamation problem, and the number ๐ willneed to grow when ๐ grows depending on the smoothness assumptions we are willingto make on the score function, with an effect on ๐* (and probably ๐0, which we willneglect in the discussion below). While a precise analysis is out of the scope of thechapter and left for future work, we can make an informal argument as follows: inorder to approximate the score with precision ๐ with a set of basis function whereโ๐*โ is bounded, we need ๐(๐) basis functions. Thus, we end up with two sources oferrors, an approximation error ๐ and an estimation error of order ๐(๐(๐)5/2/๐1/2).Typically, if we assume that the score has a number of bounded derivatives pro-portional to dimension or if we assume the input variables are independent and wecan thus estimate the score independently for each dimension, ๐(๐) is of the order1/๐1/๐, where ๐ is independent of the dimension, leading to an overall rate which isindependent of the dimension.
39
2.4.3 Score matching for SIR: direct approach
We can also try to combine these two steps to try to avoid the โcurse of dimen-sionalityโ. Our estimation of the score, i.e., of the parameter ๐ is done only to be usedwithin the SIR approach where we expect the matrix ๐ฑ1,cov to have rank ๐. Thus
when estimating ๐ by minimizing โscore(๐), we may add a regularization that penal-izes large ranks for ๐ฑ1,cov(๐) = 1
๐
โ๐๐,๐=1
๐ผ๐,๐
|๐ผโ(๐,๐)|โ1ฮจ(๐ฅ๐)
โค๐๐โคฮจ(๐ฅ๐), where we highlightthe dependence on ๐ โ R๐. By enforcing the low-rank constraint, our aim is to cir-cumvent a potential poor estimation of the score function, which could be enough forthe task of estimating the e.d.r. space (we see a better behavior in our simulations in2.5).
Introduce matrix โ(๐) = (ฮจ(๐ฅ๐)โค๐, . . . ,ฮจ(๐ฅ๐)โค๐) โ R๐ร๐ and ๐ด โ R๐ร๐ with
๐ด๐,๐ =๐ผ๐,๐
๐ยท(|๐ผโ(๐,๐)|โ1)and ๐(๐) = โ ยท ๐ด1/2 โ R๐ร๐.
We may then penalize the nuclear norm of ๐(๐), or potentially consider normsthat take into account that we look for a rank ๐ (e.g., the ๐-support norm on thespectrum of ๐(๐) [McDonald et al., 2014]). We have,
โ๐(๐)โ* = tr(๐(๐)๐(๐)โค)1/2 = tr[๐ฑ1,cov(๐)1/2
].
Combining two penalties, we have a convex optimization task:
โ(๐) = โscore(๐) + ๐ ยท tr[๐ฑ1,cov(๐)1/2
]. (2.4.5)
Efficient algorithm. Following Argyriou et al. [2008], we consider reweighted least-squares algorithms. The trace norm admits the variational form:
โ๐โ* =1
2inf๐ทโป0
tr(๐ ๐๐ทโ1๐ +๐ท).
The optimization problem (2.4.5) can be reformulated in the following way:
๐ โ arg min๐
โscore(๐) +๐
2tr(๐(๐)โค๐ทโ1๐(๐)
)and
๐ท โ(๐(๐)๐(๐)โค + ๐๐ผ๐
)1/2.
Note that the objective function is a quadratic function of ๐. Decompose matrix๐ in the form:
๐(๐) =๐โ๐=1
๐๐๐๐, where ๐๐ โ R๐ร๐.
Rearrange the regularizer term:
tr(๐(๐)โค๐ทโ1๐(๐)
)=
๐โ๐=1
๐โ๐=1
tr[๐๐๐โค
๐๐ทโ1๐๐๐๐
]= ๐โค๐ด๐,
where๐ด๐,๐ = tr
[๐โค๐๐ท
โ1๐๐].
40
Introduce notation:
๐ =(
vect(๐1), . . . , vect(๐๐))โ R๐๐ร๐
โฌ =(
vect(๐ทโ1๐1), . . . , vect(๐ทโ1๐๐))โ R๐๐ร๐,
then๐ด = ๐โค โฌ.
We can now estimate the complexity of this algorithm. Firstly we need to evalu-ate the quadratic form in โscore(๐): it is a summation of ๐ multiplications of ๐ ร ๐and ๐ ร ๐ matrices. Complexity of this step is ๐(๐๐2๐). Secondly, we need to
evaluate matrix matrices ๐ทโ1, ๐ and โฌ using ๐(๐3), ๐(๐๐ป๐) and ๐(๐ร๐2๐ป) oper-ations respectively. Next, we need to evaluate matrix ๐ด , using ๐(๐๐๐2) operations.Finally, evaluating ๐(๐) requires ๐(๐๐๐) operations, ๐(๐)๐(๐)โค requires ๐(๐2๐) op-erations and evaluation of ๐ท requires ๐(๐3) operations. Combining complexities, weget ๐(๐๐2๐+ ๐3 + ๐๐2๐) operations, which is still linear in ๐.
2.5 Experiments
In this section we provide numerical experiments for SADE, PHD+ and SPHD ondifferent functions ๐ . We denote the true and estimated e.d.r. subspaces as โฐ and โฐrespectively, defined from ๐ค and .
2.5.1 Known score functions
Consider a Gaussian mixture model with 2 components in R๐:
๐(๐ฅ) =2โ๐=1
๐๐ ยท1
(2๐)๐/2 ยท |ฮฃ๐|1/2ยท exp
โ 1
2(๐ โ ๐๐)โคฮฃโ1
๐ (๐ โ ๐๐)
, (2.5.1)
where ๐ = (6/10, 4/10), ๐1 = (โ1, . . . ,โ1โ โ ๐
), ๐2 = (1, . . . , 1โ โ ๐
), ฮฃ1 = ๐ผ๐, ฮฃ2 = 2 ยท ๐ผ๐.
Contour lines of this distribution, when ๐ = 2 are shown in Figure 2-2.
The error ๐ has a standard normal distribution. To estimate the effectiveness ofan estimated e.d.r. subspace, we use the square trace error ๐ 2(๐ค, )
๐ 2(๐ค, ) = 1โ 1
๐tr[๐ ยท ๐
],
where ๐ค and are the real and the estimated e.d.r. vectors respectively and ๐ and๐ are projectors, corresponding to these matrices. Note, that (2.4.3) is the specialcase of this formula with orthonormal matrices.
To show the dominance of SADE over SIR (which should only work for elliptically
41
X1
X2
โ4 โ3 โ2 โ1 0 1 2 3 4โ4
โ3
โ2
โ1
0
1
2
3
4
Figure 2-2 โ Contour lines of Gaussianmixture pdf in 2D case.
10 11 12 13 14 15 16
โ11
โ10
โ9
โ8
โ7
โ6
โ5
โ4
โ3
โ2
log2 n
log2
(
R2(
E,E)
)
SIRSADE
Figure 2-3 โ Mean and standard devi-ation of ๐ 2(โฐ , โฐ) divided by 10 for therational function (2.5.2).
symmetric distributions), we consider the rational multi-index model of the form
๐ = 10; ๐ฆ =๐ฅ1
1/2 + (๐ฅ2 + 2)2+ ๐ ยท ๐. (2.5.2)
Here the real e.d.r. space is generated by 2 specific vectors: (1, 0, 0, . . . , 0) and(0, 1, 0, . . . , 0). The number of slices is ๐ป = 10, and we consider numbers of obser-vations ๐ from 210 to 216 equally spaced in logarithmic scale, and we conduct 100replicates to obtain means and standard deviations of the logarithms of square traceerrors divided by
โ100 = 10 (to assess significance of differences) as shown in Figure
2-3. Even in this simple model, the ordinary SIR algorithm does not work properly,because the distribution of the inputs ๐ฅ has no elliptical symmetry. When ๐ โ โ,the squared trace error tends to some nonzero constant depending on the proper-ties of the density function, whereas SADE shows good performance with slope โ1(corresponding to a
โ๐-consistent estimator).
Now, we compare the moments methods SADE, PHD+, SPHD. Although the goalof this chapter is to compare moment matching techniques, we compare them withthe state-of-the-art MAVE method [Xia et al., 2002b, Wang and Xia, 2007, 2008].It is worth noting two properties which we have already discussed concerning thesemethods:
1. Sliced methods have a wider application area than unsliced ones: SADE isstronger than ADE and SPHD is stronger than PHD+.
2. SADE can not recover the entire e.d.r. space in several cases, for example, aclassification task or symmetric cases. For those cases, we should use second-order methods (i.e., methods based on ๐ฎ2).
3. MAVE works better, but the goal of this chapter is to compare moment-matching techniques. In Figure 2-12, we provide examples where MAVE suffersfrom the curse of dimensionality and performs worse than moment-matchingmethods.
42
We conduct 3 experiments, where ๐ป = 10, ๐ = 10, the error term ๐ has a normalstandard distribution; numbers of observation ๐ from 210 to 216 equally spaced inlogarithmic scale and we made 10 replicas to evaluate sample means and variations:
โ Rational model of the form:
๐ฆ =๐โ๐=1
tanh(8๐ฅ๐ โ 16) ยท ๐+ tanh(8๐ฅ๐ + 16) ยท (๐ + 1โ ๐) + ๐/4, (2.5.3)
where the effective reduction subspace dimension is ๐ = 2.Results are shown of Figure 2-4 and we can see, that first-order methodSADE works better, than second-order SPHD + and SPHD works better,than PHD+, that is slicing make the method more robust.
โ Classification problem of the form
๐ฆ = 1๐ฅ21+2๐ฅ22>4 + ๐/4. (2.5.4)
We can see, that the error of SADE is close to 0.5 (Figure 2-5). This meansthat the method finds only one direction in the e.d.r. space. Moreover, SPHDgave better results than PHD+.
โ Quadratic model of the form
๐ = 10; ๐ฆ = ๐ฅ1(๐ฅ1 + ๐ฅ2 โ 3) + ๐. (2.5.5)
Both SADE and SPHD show a good performance (Figure 2-6), while PHD+can not recover the desired projection due to linearity of the function ๐.
Figure 2-4 โ Mean and standard deviationof ๐ 2(โฐ , โฐ) divided by
โ10 for the ratio-
nal function (2.5.3) with ๐ = 1/4.
Figure 2-5 โ Mean and standard devia-tion of ๐ 2(โฐ , โฐ) divided by
โ10 for the
the classification problem (2.5.4).
43
Figure 2-6 โ Mean and standard deviation of ๐ 2(โฐ , โฐ) divided byโ
10 for thequadratic model (2.5.2) with ๐ = 1/4.
2.5.2 Unknown score functions
Now, we conduct numerical experiments for SADE with unknown score functions.We consider a โtoyโ experiment and first examine results of the 2-step algorithm. Weconsider again a Gaussian mixture model (2.5.1) with 2 components in R2, where๐ = (0.6, 0.4), ๐1 = (โ0.5, 0.5), ๐2 = (0.5, 0.5), ฮฃ1 = 0.3 ยท ๐ผ2, ฮฃ2 = 0.4 ยท ๐ผ2, with๐ฆ = sin(๐ฅ1 + ๐ฅ2) + ๐, where error ๐ has a standard normal distribution. We choosethese parameters of Gaussian mixture to make sure, that the probability of the vector๐ to be in the square [โ2, 2]2 is close to 1.
We chose 100 Gaussian kernels as basis functions:
๐๐,๐(๐ฅ) = โ exp
โ โ๐ โ๐๐,๐โ2
2โ2
, 1 6 ๐, ๐ 6 10,
where ๐๐,๐ form the uniform grid on the square [โ2, 2]2 (note that this does not implyany notion of Gaussianity for the underlying density, and the Gaussian kernel herecould be replaced by any differentiable local function).
In practice, ฮจ in (2.4.1) is close to be degenerated, and we use a regularized
estimator ๐* = โ(ฮจ + ๐ผ๐ผ๐
)โ1 ยท ฮฆ instead.We choose ๐ผ = 0.01, ๐ = 1 and conduct 100 replicates with numbers of obser-
vations ๐ from 210 to 216 equally spaced in logarithmic scale. The results of theexperiments are presented in Figure 2-7: the square trace error tends to zero as ๐increases.
In high dimensions, we can not use a uniform grid because of the curse of di-mensionality. Instead of this, we will use ๐ Gaussian kernels, centered in the samplepoints ๐๐. Note here that we can only recover an approximate score function, butour goal is the estimation of the e.d.r. subspace.
We use our reweighted least-squares algorithm to solve the convex problem (2.4.5).We conduct several experiments on functions from the previous section.
44
8 9 10 11 12 13 14
โ12
โ11
โ10
โ9
โ8
โ7
โ6
โ5
โ4
โ3
log2(n)
log 2(R
2(E
,E))
Square trace error
Figure 2-7 โ Square trace error ๐ (โฐ , โฐ) fordifferent sample sizes.
2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
R2(,)
SADE with true score
h
1step algorithm
2step algorithm
Figure 2-8 โ Quadratic model, ๐ = 10,๐ = 1000.
2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
R2(,)
SADE with true score
1step algorithm
2step algorithm
h
Figure 2-9 โ Rational model, ๐ = 10, ๐ =1000.
2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
R2(,)
SADE with true score
1step algorithm
2step algorithm
h
Figure 2-10 โ Rational model, ๐ = 20, ๐ =2000.
We plot the performance as a function of the kernel bandwidth โ to assess therobustness of the methods. On Figure 2-8 we provide the relationship between thesquare trace error and โ for quadratic model (2.5.5) for ๐ = 10 and ๐ = 1000. InFigures 2-9, 2-10, 2-11, we provide the relationship between the square trace errorand ๐ for rational model (2.5.2) for ๐ = 10, ๐ = 1000; ๐ = 20, ๐ = 2000 and๐ = 50, ๐ = 5000. We see that for large ๐, the one-step algorithm is more robust thanthe two-step algorithm. Moreover, the experiments show that even with weak scorefunctions, correct estimation of the e.d.r. space can be performed.
Comparison with MAVE. We consider the rational model (2.5.3) with ๐ =10000, ๐ = 10 and dimension ๐ of data in the range [20, 150]. Probability density๐(๐ฅ) has independent components, each of which has a form of mixture of 2 Gaussianswith weights (6/10, 4/10), means (โ1, 1) and standard variations (1, 2). The results
45
2 4 6 8 100
0.1
0.2
0.3
0.4
0.5R2(,)
SADE with true score
1step algorithm
2step algorithm
h
Figure 2-11 โ Rational model, ๐ = 50, ๐ =5000.
Figure 2-12 โ Increasing dimensions, withrational model 2.5.3, ๐ = 10000, ๐ = 10,๐ = 20 to 150.
for MAVE, SADE with known score and for SADE with unknown score are shown onthe Figure 2-12. We can see, that both SADE with known and unknown scores losein case of low dimension of data, but more resistant to the curse of dimensionality(as expected as MAVE relies on non-parametric estimation in a space of dimension ๐,which is here larger). Moreover, the complexity of our moment-matching techniqueis linear in the number of observations ๐, while MAVE is superlinear.
2.6 Conclusion
In this chapter we consider a general non-linear regression model and the depen-dence on a unknown ๐-dimensional subspace assumption. Our goal was direct estima-tion of this unknown ๐-dimensional space, which is often called the effective dimensionreduction or e.d.r. space. We proposed new approaches (SADE and SPHD), combin-ing two existing techniques (sliced inverse regression (SIR) and score function-basedestimation). We obtained consistent estimation for ๐ > 1 only using the first-orderscore and proposed explicit approaches to learn the score from data.
It would be interesting to extend our sliced extensions to learning neural net-works: indeed, our work focused on the subspace spanned by ๐ค and cannot identifyindividual columns, while Janzamin et al. [2015] showed that by using proper ten-sor decomposition algorithms and second-order score functions, columns of ๐ค can beconsistently estimated (in polynomial time).
46
2.7 Appendix. Proofs
In this appendix, we provide proofs which were omitted in the chapter.
2.7.1 Probabilistic lemma
Let us first formulate and proof an auxiliary lemma about tail inequalities:
Lemma 2.5. Let ๐ be a non-negative random variable such that for some positiveconstants ๐ด and ๐ต, and all ๐:
E๐๐ 6 (๐ดโ๐+๐ต๐1/๐๐2)๐.
Then, for ๐ก > (log ๐)/2:
P(๐ > 3๐ดโ๐ก+ 3๐ต๐1/๐ก๐ก2) 6 3๐โ๐ก.
Proof. By Markovโs inequality, for every non-negative integer ๐:
P(๐ > 3๐ดโ๐+ 3๐ต๐1/๐๐2) = P(๐๐ > [3๐ด
โ๐+ 3๐ต๐1/๐๐2]๐) 6
6[๐ดโ๐+๐ต๐1/๐๐2]๐
[3๐ดโ๐+ 3๐ต๐1/๐๐2]๐
6 ๐โ๐.
Consider any ๐ก > (log ๐)/2 and ๐ = [๐ก], then:
P(๐ > 3๐ดโ๐ก+ 3๐ต๐1/๐ก๐ก2) 6 P(๐ > 3๐ด
โ๐+ 3๐ต๐1/๐๐2) 6 ๐โ๐ 6 3๐โ๐ก,
because function ๐(๐ก) = 3๐ดโ๐ก+ 3๐ต๐1/๐ก๐ก2 increases on [(log ๐)/2; +โ).
2.7.2 Proof of theorem 2.1
Proof. Let(๐ฆ(๐), ๐ฅ(๐)
), ๐ = 1, . . . , ๐ be the ordered data set, where ๐ฆ(1) 6 ๐ฆ(2) 6
ยท ยท ยท 6 ๐ฆ(๐). The ๐ฅ(๐) are called the concomitants of order statistics by Yang [1977].Introduce double subscripts: โ(โ,1) = โ
(๐ฅ(2โโ1)
), โ(โ,2) = โ
(๐ฅ(2โ)
), ๐ฆ(โ,1) = ๐ฆ(2โโ1) and
๐ฆ(โ,2) = ๐ฆ(2โ).Introduce firstly alternative matrix (under the weak assumptoin of its existance):
๐ฑ1,E = E[Cov(๐ฎ1(๐ฅ)|๐ฆ)
].
We have the following estimator for this matrix, for ๐ = 2, and ๐ป = ๐/๐ = ๐/2:
๐ฑ1,E =1
๐
๐ปโโ=1
(โ(โ,1) โ โ(โ,2)
)(โ(โ,1) โ โ(โ,2)
)โค,
Firstly, we estimate a norm of ๐ฑ1,E โ ๐ฑ1,E and afterwards a norm of ๐ฑ1,cov โ ๐ฑ1,cov.
47
Thus, the deviation from the population version, may be split into four terms asfollows:
๐ฑ1,E โ ๐ฑ1,E =1
๐
๐ปโโ=1
(โ(โ,1) โ โ(โ,2)
)(โ(โ,1) โ โ(โ,2)
)โค โ E๐๐โค
=1
๐
๐ปโโ=1
(๐(โ,1) โ ๐(โ,2)
)(๐(โ,1) โ ๐(โ,2)
)โค โ E๐๐โค
+1
๐
๐ปโโ=1
(๐(๐ฆ(โ,1))โ๐(๐ฆ(โ,2))
)(๐(โ,1) โ ๐(โ,2)
)โค+
1
๐
๐ปโโ=1
(๐(โ,1) โ ๐(โ,2)
)(๐(๐ฆ(โ,1))โ๐(๐ฆ(โ,2))
)โค+
1
๐
๐ปโโ=1
(๐(๐ฆ(โ,1))โ๐(๐ฆ(โ,2))
)(๐(๐ฆ(โ,1))โ๐(๐ฆ(โ,2))
)โค= ๐4 + ๐3 + ๐2 + ๐1.
We now bound each term separately.
Bounding ๐1. We have
๐1 41
๐
๐ปโโ=1
๐ฟ2(๐ฆ(โ,1) โ ๐ฆ(โ,2))(๐ฆ(โ,1) โ ๐ฆ(โ,2))โค
tr๐1 = โ๐1โ* 61
๐
๐ปโโ=1
๐ฟ2 ยท diameter(๐ฆ1, . . . , ๐ฆ๐)|๐ฆ(โ,1) โ ๐ฆ(โ,2)|
61
๐๐ฟ2 ยท diameter(๐ฆ1, . . . , ๐ฆ๐)2.
The range cannot grow too much, i.e., as log ๐. Indeed, assuming without loss ofgenerality that E๐ฆ = 0, we have max๐ฆ1, . . . , ๐ฆ๐ 6 ๐ข/2 and min๐ฆ1, . . . , ๐ฆ๐ >โ๐ข/2 implies that the range is less than ๐ข, and thus, P(diameter(๐ฆ1, . . . , ๐ฆ๐) >๐ข) 6 P(max๐ฆ1, . . . , ๐ฆ๐ > ๐ข/2) + P(min๐ฆ1, . . . , ๐ฆ๐ 6 โ๐ข/2) 6 ๐P(๐ฆ > ๐ข/2) +๐P(๐ฆ < โ๐ข/2) 6 2๐ exp(โ๐ข2/8๐ 2๐ฆ ) by using sub-Gaussianity. Then, by selecting๐ข2/8๐ 2๐ฆ = log(2๐) + log(8/๐ฟ), with get with probability greater then 1 โ ๐ฟ/8 that
diameter(๐ฆ1, . . . , ๐ฆ๐) 6 2โ
2๐๐ฆโ
log(2๐) + log(8/๐ฟ).
Bounding ๐2 and ๐3. We also have
maxโ๐2โ*, โ๐3โ* 61
๐
๐ปโโ=1
๐ฟ|๐ฆ(โ,1)๐ฆ(โ,2)| ยท diameter(๐1, . . . , ๐๐)
61
๐๐ฟ ยท diameter(๐ฆ1, . . . , ๐ฆ๐) ยท diameter(๐1, . . . , ๐๐).
48
Like for ๐1, the ranges cannot grow too much, i.e., as log ๐. Similarly
P(diameter((๐๐)1, . . . , (๐๐)๐) > ๐ข) 6
6 ๐P((๐๐) > ๐ข/2) + ๐P((๐๐) < โ๐ข/2) 6 2๐ exp(โ๐ข2/8๐ 2๐ ).
We thus with get with probability greater then 1โ ๐ฟ/(8๐) that
max๐โ1,...,๐
diameter((๐๐)1, . . . , (๐๐)๐) 6 2โ
2๐๐โ
log(2๐) + log(8๐/๐ฟ).
Thus combining the two terms above, with probability greater than 1โ ๐ฟ/4,
โ๐1โ* + โ๐2โ* + โ๐3โ* 68๐ฟ(๐ฟ๐ 2๐ฆ + 2๐๐๐๐ฆ
โ๐)
๐
(log(2๐) + log(8๐) + log(1/๐ฟ)
).
Note the term inโ๐, which corresponds to the definition of the diameter(๐1, . . . , ๐๐)
in terms of the โ2-norm.
Bounding ๐4. We have:
๐4 =1
๐
๐ปโโ=1
๐(โ,1)๐
โค(โ,1) + ๐(โ,2)๐
โค(โ,2) โ ๐(โ,2)๐โค(โ,1) โ ๐(โ,1)๐โค(โ,2)
โ E๐๐โค
=1
๐
๐ปโโ=1
โ ๐(โ,2)๐โค(โ,1) โ ๐(โ,1)๐โค(โ,2)
+
1
๐
๐โ๐=1
๐๐๐โค๐ โ E๐๐โค = ๐4,1 + ๐4,2.
For the second term ๐4,2 above, if we select any element indexed by ๐, ๐, then
1
๐
๐โ๐=1
(๐๐)๐(๐๐)๐ โ E๐๐๐๐.
Using [Boucheron et al., 2013, Theorem 2.1], we get
E[(๐๐)๐(๐๐)๐]2 6
โE(๐๐)4๐E(๐๐)4๐ 6 4(2๐ 2๐ )2 = 16๐ 4๐ ,
and
E[|(๐๐)๐(๐๐)๐|๐] 6โE(๐๐)
2๐๐ E(๐๐)
2๐๐ 6 2๐!(2๐ 2๐ )๐ =
๐!
2(2๐ 2๐ )๐โ216๐ 4๐ .
We can then use Bernsteinโs inequality [Boucheron et al., 2013, Theorem 2.10], to getthat with probability less than ๐โ๐ก then
1
๐
๐โ๐=1
(๐๐)๐(๐๐)๐ โ E๐๐๐๐ > 2๐ 2๐๐๐ก+โ
32๐ 4๐โ๐ก/โ๐.
We can also get upper bound for this quantity, using โ๐๐ instead of ๐๐. Thus, with
49
๐ก = log 8๐2
๐ฟ, we get that all ๐(๐+ 1)/2 absolute deviations are less than 2๐ 2๐
(log 8๐2
๐ฟ
๐+โ
2 log 8๐2
๐ฟ
๐
), with probability greater than 1โ๐ฟ/4. This implies that the nuclear norm
of the second term is less than 2๐โ๐๐ 2๐
(log 8๐2
๐ฟ
๐+
โ2 log 8๐2
๐ฟ
๐
), because for any matrix
๐พ โ R๐ร๐ : โ๐พโ* 6 ๐โ๐โ๐พโโ, where โ๐พโโ = max๐,๐ |๐พ๐๐|.
For the first term, we consider ๐ =๐ปโโ=1
(๐(โ,2))๐(๐(โ,1))๐, and consider conditioning
on Y = (๐ฆ1, . . . , ๐ฆ๐). A key result from the theory of order statistics is that the ๐random variables ๐(โ,2), ๐(โ,1), โ โ 1, . . . , ๐/2 are independent given ๐ [Yang, 1977].This allows us to compute expectations.
Using Rosenthalโs inequality [Boucheron et al., 2013, Theorem 15.11] conditionedon Y, for which we have E((๐(โ,2))๐(๐(โ,1))๐|Y) = 0, we get:[
E(|๐|๐|Y)]1/๐
6
6โ
8๐[โ
โ
E[((๐(โ,2))
2๐(๐(โ,1))
2๐ |Y]]1/2
+ ๐ ยท 2[Emax
โ
[((๐(โ,2))
๐๐(๐(โ,1))
๐๐ |Y]]1/๐
6โ
8๐[โ
โ
E[((๐(โ,2))
2๐(๐(โ,1))
2๐ |Y]]1/2
+ ๐ ยท 2[โ
โ
E[((๐(โ,2))
๐๐(๐(โ,1))
๐๐ |Y]]1/๐
.
By taking the ๐-th power, we get:
E(|๐|๐|Y) 6 2๐โ1โ
8๐๐[โ
โ
E[((๐(โ,2))
2๐(๐(โ,1))
2๐ |Y]]๐/2
+ 2๐โ1๐๐ ยท 2๐โโ
E[((๐(โ,2))
๐๐(๐(โ,1))
๐๐ |Y].
By now taking expectations with respect to Y, we get, using Jensenโs inequality:
E|๐|๐ 6 2๐โ1โ
8๐๐E([โ
โ
(๐(โ,2))2๐(๐(โ,1))
2๐
]๐/2)+ 2๐โ1๐๐ ยท 2๐
โโ
E[((๐(โ,2))
๐๐(๐(โ,1))
๐๐
]6 2๐โ1
โ8๐
๐E([1
2
โโ
(๐(โ,2))4๐ + (๐(โ,1))
4๐
]๐/2)+ 2๐โ1๐๐ ยท 2๐ ยท 1
2
โโ
E[((๐(โ,2))
2๐๐ + (๐(โ,1))
2๐๐
]6 2๐โ1
โ8๐
๐E([โ
๐
(๐๐)4๐ + (๐๐)
4๐
]๐/2)+ 2๐โ1๐๐ ยท 2๐
โ๐
E[((๐๐)
2๐๐ + (๐๐)
2๐๐
].
50
Because summing over all order statistics is equivalent to summing over all elements.Thus, using the bound on moments of (๐๐)
2๐ , we get:
E|๐|๐ 6 2๐โ1โ
8๐๐2๐/2โ1E
([โ๐
(๐๐)4๐
]๐/2)+ 2๐โ1
โ8๐
๐2๐/2โ1E
([โ๐
(๐๐)4๐
]๐/2)+ 2๐โ1๐๐ ยท 2๐๐ ยท 4๐!(2๐ 2๐ )๐
We can now use [Boucheron et al., 2013, Theorem 15.10], to get[E([โ
๐
(๐๐)4๐
]๐/2)]2/๐6 2E
[โ๐
(๐๐)4๐
]+๐
2
(E[
max๐
((๐๐)4๐)๐/2])2/๐
6 2E[โ
๐
(๐๐)4๐
]+๐
2
(E[โ
๐
((๐๐)4๐)๐/2])2/๐
6 2๐ร 4(2๐ 2๐ )2 +๐
2๐2/๐E๐2๐๐
6(32๐+ ๐2/๐๐
2(2๐!)2/๐
)๐ 4๐
6(32๐+ ๐2/๐๐3
)๐ 4๐ .
Thus
E|๐|๐ 6 2๐โ
8๐๐2๐/2โ1
(32๐+ ๐2/๐๐3
)๐/2๐ 2๐๐ + 2๐โ1๐๐ ยท 2๐๐ ยท 4๐!(2๐ 2๐ )๐
6 23๐โ1๐๐/2 ยท 2๐/2โ1((32๐)๐/2 + ๐ ยท ๐3๐/2
)๐ 2๐๐ + 23๐+1๐ 2๐๐ ๐๐
2๐
6(26๐โ2๐๐/2๐๐/2 + ๐๐2๐[27๐/2โ2 + 23๐+1]
)๐ 2๐๐
6(
64๐ ยท ๐๐/2๐๐/2 + 19๐ ยท ๐๐2๐)๐ 2๐๐ .
Thus
(E|๐|๐)1/๐ 6(
64 ยท โ๐๐1/2 + 19 ยท ๐1/๐๐2)๐ 2๐ .
Thus, for any ๐ฟ 6 1/๐, using Lemma 2.5 for random variable ๐/๐ with ๐ก =log(12๐
2
๐ฟ) > (log ๐)/2 and we obtain:
P[๐
๐
>
192๐ 2๐โ๐
โ๐ก+
57๐ 2๐๐
๐1/๐ก๐ก2]6 3๐โ๐ก โ
P[๐
๐
>
192๐ 2๐โ๐
โlog( 12๐2
๐ฟ) +
57๐ 2๐๐
๐1/ log( 12๐2
๐ฟ) log2( 12๐2
๐ฟ)
]6
๐ฟ
4๐2โ
P[๐
๐
>
192๐ 2๐โ๐
โlog( 12๐2
๐ฟ) +
155๐ 2๐๐
log2( 12๐2
๐ฟ)
]6
๐ฟ
4๐2.
Combining all terms ๐1, ๐2, ๐3, ๐4,1 and ๐4,2 we get with probability not less than
51
1โ ๐ฟ:
โ๐ฑ1,E โ๐ฑ1,Eโ* 61
๐ยท[8๐ฟ(๐ฟ๐ 2๐ฆ + 2๐๐๐๐ฆ
โ๐) ยท log( 16๐๐
๐ฟ)+ 2๐
โ๐๐ 2๐ log 8๐2
๐ฟ+ 155๐ 2๐๐ log2( 12๐2
๐ฟ)
]+
1โ๐
[2๐โ๐๐ 2๐
โ2 log 8๐2
๐ฟ+ 192๐ 2๐๐
โlog 12๐2
๐ฟ
].
Rearranging terms and replacing ๐ฟ by ๐ฟ/2, with probability not less, than 1โ ๐ฟ/2:
โ๐ฑ1,Eโ๐ฑ1,Eโ* 6195๐โ๐๐ 2๐โ๐
โlog
24๐2
๐ฟ+
8๐ฟ2๐ 2๐ฆ + 16๐๐๐๐ฆ๐ฟโ๐+ 157๐ 2๐๐
โ๐
๐log2 32๐2๐
๐ฟ.
Using expression:๐ฑ1,cov + ๐ฑ1,E = cov[โ1(๐ฅ)],
we can suggest estimator for ๐ฑ1,cov as ๐ฑ1,cov = 1๐
๐โ๐=1
โ(๐ฅ๐)โ(๐ฅ๐)โค โ ๐ฑ1,E.
Applying the triangle inequality:
โ๐ฑ1,cov โ ๐ฑ1,covโ* 6 โ๐ฑ1,E โ ๐ฑ1,Eโ* + โ 1
๐
๐โ๐=1
โ(๐ฅ๐)โ(๐ฅ๐)โค โ E(โ(๐ฅ)โ(๐ฅ)โค)โ*.
To estimate the second term, we can use the same arguments as for bounding
๐4,2: with probability greater than 1โ ๐ฟ/2, โ 1๐
๐โ๐=1
โ(๐ฅ๐)โ(๐ฅ๐)โคโE(โ(๐ฅ)โ(๐ฅ)โค)โ* is less
than 2๐โ๐๐ 2โ
(log 4๐2
๐ฟ
๐+
โ2 log 4๐2
๐ฟ
๐
). Finally, combining this bound with bound for
โ๐ฑ1,cov โ ๐ฑ1,covโ*:
โ๐ฑ1,cov โ ๐ฑ1,covโ* 6๐โ๐(195๐ 2๐ + 2๐ 2โ )โ๐
โlog
24๐2
๐ฟ+
8๐ฟ2๐ 2๐ฆ + 16๐๐๐๐ฆ๐ฟโ๐+ (157๐ 2๐ + 2๐ 2โ )๐
โ๐
๐log2 32๐2๐
๐ฟ.
2.7.3 Proof of Theorem 2.2
Proof. Using triangle inequality, we get:
โ๐ฑ1,cov(๐*)โ ๐ฑ1,cov(๐)โ* 6โ๐ฑ1,cov(๐*)โ ๐ฑ1,cov(๐*)โ* + โ๐ฑ1,cov(๐*)โ ๐ฑ1,cov(๐)โ* = โฑ1 + โฑ2. (2.7.1)
52
Theorem 2.1 supplies us with non-asymptotic analysis for the โฑ1 term: with proba-bility not less then 1โ ๐ฟ/2:
โ๐ฑ1,cov(๐*)โ ๐ฑ1,cov(๐*)โ* 6๐โ๐(195๐ 2๐ + 2๐ 2โ )โ๐
โlog
48๐2
๐ฟ+
8๐ฟ2๐ 2๐ฆ + 16๐๐๐๐ฆ๐ฟโ๐+ (157๐ 2๐ + 2๐ 2โ )๐
โ๐
๐log2 64๐2๐
๐ฟ.
(2.7.2)
For the second term, let us firstly analyse the norm โ๐*โ๐โ. For simplicity, introducenotation: ๐ถ = 1
๐
โ๐๐=1 ฮจ(๐ฅ๐)ฮจ(๐ฅ๐)
โค โ R๐ร๐, ๐ถ = E[ฮจ(๐ฅ)ฮจ(๐ฅ)โค
]โ R๐ร๐, =
1๐
โ๐๐=1(โ ยทฮจ)(๐ฅ๐) โ R๐ and ๐ = E
[(โ ยทฮจ)(๐ฅ)
]โ R๐
Let us estimate โ๐ถ โ ๐ถโ๐น and โโ ๐โ. Introduce notation:
๐ถ๐๐,๐ =
1
๐
๐โ๐=1
ฮจ๐๐(๐ฅ๐) ยทฮจ๐
๐(๐ฅ๐)โ E[ฮจ๐๐(๐ฅ) ยทฮจ๐
๐(๐ฅ)].
It can be shown like in the bounding of ๐4,2 term in the Theoremโs 2.1 proof, us-ing [Boucheron et al., 2013, Theorem 2.1] and Bernsteinโs inequality [Boucheron et al.,2013, Theorem 2.10] that with probability less then ๐โ๐ก:
๐ถ๐๐,๐ > 2
๐ 2ฮจ๐๐ก+โ
32๐ 4ฮจโ๐ก/โ๐.
Taking ๐ก = log 2๐2๐๐ฟ
we show that:
P
โกโฃโ๐ถ โ ๐ถโ๐น > 2๐ 2ฮจโ๐3๐
(log 2๐2๐
๐ฟ
๐+
โ2 log 2๐2๐
๐ฟ
๐
)โคโฆ < ๐ฟ.
According to assumption (M2) and Hoeffding bound:
P[โ๐ โ ๐๐โ >
๐ก
๐
]6 ๐
โ๐ก2
2๐2โฮจ๐ ,
combining inequalities for all ๐ components of ๐, we have:
P[โ๐โ โ > ๐โฮจ
โ๐โ
๐
โlog ๐2
๐ฟ2
]6 ๐ฟ.
Now, let estimate โ๐* โ ๐โ:
๐* โ ๐ = ๐ถโ1โ ๐ถโ1๐ = ๐ถโ1(โ ๐) + (๐ถโ1 โ ๐ถโ1)๐โ
โ๐* โ ๐โ 6 โโ ๐โ๐min(๐ถ)
+โ๐โ2 ยท โ๐ถ โ ๐ถโop๐min(๐ถ) ยท ๐min(๐ถ)
.
53
For ๐ large enough (for some constants ๐1 and ๐2, not depending on ๐ and ๐ฟ: ๐ >
๐1 + ๐2 log 1๐ฟ) โ๐ถ โ ๐ถโ 6
๐min
2with probability more, than 1 โ ๐ฟ/12 hence for this
๐: ๐min(๐ถ) >๐min(๐ถ)
2, hence, combining 3 estimations for โโ ๐โ, โ๐ถ โ ๐ถโ and for
๐min(๐ถ), we obtain estimation for โ๐* โ ๐โ: with probability not less, than 1โ ๐ฟ/4:
โ๐* โ ๐โ 6
2๐โฮจโ๐โ
๐๐min(๐ถ)
โ2 log 12๐
๐ฟ+
2โ๐โ๐2min(๐ถ)
ยท 2๐ 2ฮจโ๐3๐
(log 24๐2๐
๐ฟ
๐+
โ2 log 24๐2๐
๐ฟ
๐
) (2.7.3)
Consider now (๐, ๐) element of ๐ฑ1,cov(๐*)โ ๐ฑ1,cov(๐) (using 2.4.2):
[๐ฑ1,cov(๐*)โ ๐ฑ1,cov(๐)
]๐,๐
=1
๐
๐โ๐,๐=1
๐ผ๐,๐|๐ผโ(๐, ๐)| โ 1
๐โ๐ผ,๐ฝ=1
ฮจ(๐ฅ๐)๐ผ๐ฮจ(๐ฅ๐)
๐ฝ๐ ยท |๐
*๐ผ๐
*๐ฝ โ ๐๐ผ๐๐ฝ| 6
1
2๐
๐โ๐,๐=1
๐ผ๐,๐|๐ผโ(๐, ๐)| โ 1
๐โ๐ผ,๐ฝ=1
[[ฮจ(๐ฅ๐)
๐ผ๐
]2+[ฮจ(๐ฅ๐)
๐ฝ๐
]2] ยท |๐*๐ผ๐*๐ฝ โ ๐๐ผ๐๐ฝ| 6๐โ
๐ผ,๐ฝ=1
1
2๐
[๐โ๐=1
[ฮจ(๐ฅ๐)๐ผ๐ ]2 +
๐โ๐=1
[ฮจ(๐ฅ๐)
๐ฝ๐
]2]ยท๐*๐ผ๐
*๐ฝ โ ๐๐ผ๐๐ฝ
,
because every row of binary matrix๐ผ๐,๐๐,๐=1,๐
has exactly |๐ผโ(๐, ๐)| โ 1 non-zero
elements.
Now, let us estimate desired norm:
โ๐ฑ1,cov(๐*)โ๐ฑ1,cov(๐)โ* 6 ๐ยท๐โ๐=1
๐โ๐=1
๐โ๐=1
|ฮจ๐๐ (๐ฅ๐)|2
๐ยท(2โ๐*โ+โ๐โ๐*โ
)ยทโ๐โ๐*โ (2.7.4)
Using again [Boucheron et al., 2013, Theorem 2.1] and Bernsteinโs inequal-ity [Boucheron et al., 2013, Theorem 2.10] with probability not less then 1โ ๐ฟ/8:
๐โ๐=1
๐โ๐=1
๐โ๐=1
|ฮจ๐๐ (๐ฅ๐)|2
๐6 Eโฮจ(๐ฅ)โ22 + 2๐๐๐ 2๐
(log 16๐๐
๐ฟ
๐+
โ2 log 16๐๐
๐ฟ
๐
)(2.7.5)
For ๐ large enough (for some constants ๐1 and ๐2, not depending on ๐ and ๐ฟ: ๐ >
54
๐1 + ๐2 log 1๐ฟ):
๐โ๐=1
๐โ๐=1
๐โ๐=1
|ฮจ๐๐ (๐ฅ๐)|2
๐6
3
2Eโฮจ(๐ฅ)โ22 with probability no less then 1โ ๐ฟ/8,
and (2โ๐*โ+ โ๐ โ ๐*โ
)6 3โ๐*โ with probability no less then 1โ ๐ฟ/8
โ๐ฑ1,cov(๐*)โ ๐ฑ1,cov(๐)โ* 6 ๐ ยท 9
2Eโฮจ(๐ฅ)โ22 ยท โ๐*โยท[
2๐โฮจ
โ๐โ
๐๐min(๐ถ)
โ2 log 12๐
๐ฟ+
2โ๐โ๐2min(๐ถ)
ยท 2๐ 2ฮจโ๐3๐
(log 24๐2๐
๐ฟ
๐+
โ2 log 24๐2๐
๐ฟ
๐
)]with probability not less then 1โ๐ฟ/2. Combining and simplifying obtained inequalityand (2.7.2) we obtain final inequality: with probablility not less, then 1โ ๐ฟ:
โ๐ฑ1,cov(๐*)โ ๐ฑ1,cov(๐)โ* 61โ๐
โ2 log
48๐2๐
๐ฟร
[๐โ๐(195๐ 2๐ + 2๐ 2โ ) +
9๐
2Eโฮจ(๐ฅ)โ22 ยท โ๐*โ
(2๐โฮจ
โ๐
๐0+
4โ๐โ๐ 2ฮจโ๐3๐
๐0
)]+๐
(1
๐
).
55
56
Chapter 3
Constant step-size Stochastic
Gradient Descent for probabilistic
modeling
Abstract
Stochastic gradient methods enable learning probabilistic models from largeamounts of data. While large step-sizes (learning rates) have shown to be best forleast-squares (e.g., Gaussian noise) once combined with parameter averaging, theseare not leading to convergent algorithms in general. In this chapter, we consider gen-eralized linear models, that is, conditional models based on exponential families. Wepropose averaging moment parameters instead of natural parameters for constant-step-size stochastic gradient descent. For finite-dimensional models, we show thatthis can sometimes (and surprisingly) lead to better predictions than the best linearmodel. For infinite-dimensional models, we show that it always converges to optimalpredictions, while averaging natural parameters never does. We illustrate our findingswith simulations on synthetic data and classical benchmarks with many observations.
This chapter is based on the conference paper published as a Uncertainty inArtificial Intelligence (UAI) 2018 paper, which was accepted as an oral presenta-tion [Babichev and Bach, 2018a].
3.1 Introduction
Faced with large amounts of data, efficient parameter estimation remains one ofthe key bottlenecks in the application of probabilistic models. Once cast as an opti-mization problem, for example through the maximum likelihood principle, difficultiesmay arise from the size of the model, the number of observations, or the potentialnon-convexity of the objective functions, and often all three [Koller and Friedman,2009, Murphy, 2012].
57
In this chapter we focus primarily on situations where the number of observationsis large; in this context, stochastic gradient descent (SGD) methods which look at onesample at a time are usually favored for their cheap iteration cost. However, findingthe correct step-size (sometimes referred to as the learning rate) remains a practicaland theoretical challenge, for probabilistic modeling but also in most other situationsbeyond maximum likelihood [Bottou et al., 2016].
In order to preserve convergence, the step size ๐พ๐ at the ๐-th iteration typicallyhas to decay with the number of gradient steps (here equal to the number of datapoints which are processed), typically as ๐ถ/๐๐ผ for ๐ผ โ [1/2, 1] [see, e.g., Bach andMoulines, 2011, Bottou et al., 2016]. However, these often leads to slow convergenceand the choice of ๐ผ and ๐ถ is difficult in practice. More recently, constant step-sizes have been advocated for their fast convergence towards a neighborhood of theoptimal solution [Bach and Moulines, 2013], while it is a standard practice in manyareas [Goodfellow et al., 2016]. However, it is not convergent in general and thussmall step-sizes are still needed to converge to a decent estimator.
Constant step-sizes can however be made to converge in one situation. When thefunctions to optimize are quadratic, like for least-squares regression, using a constantstep-size combined with an averaging of all estimators along the algorithm can beshown to converge to the global solution with the optimal convergence rates [Bachand Moulines, 2013, Dieuleveut and Bach, 2016].
The goal of this chapter is to explore the possibility of such global convergence witha constant step-size in the context of probabilistic modeling with exponential families,e.g., for logistic regression or Poisson regression [McCullagh, 1984]. This would leadto the possibility of using probabilistic models (thus with a principled quantificationof uncertainty) with rapidly converging algorithms. Our main novel idea is to replacethe averaging of the natural parameters of the exponential family by the averagingof the moment parameters, which can also be formulated as averaging predictionsinstead of estimators. For example, in the context of predicting binary outcomesin 0, 1 through a Bernoulli distribution, the moment parameter is the probability๐ โ [0, 1] that the variable is equal to one, while the natural parameter is the โlogodds ratioโ log ๐
1โ๐ , which is unconstrained. This lack of constraint is often seenas a benefit for optimization; it turns out that for stochastic gradient methods, themoment parameter is better suited to averaging. Note that for least-squares, whichcorresponds to modeling with the Gaussian distribution with fixed variance, momentand natural parameters are equal, so it does not make a difference.
More precisely, our main contributions are:โ For generalized linear models, we propose in Section 3.4 averaging moment
parameters instead of natural parameters for constant-step-size stochastic gra-dient descent.
โ For finite-dimensional models, we show in Section 3.5 that this can sometimes(and surprisingly) lead to better predictions than the best linear model.
โ For infinite-dimensional models, we show in Section 3.6 that it always convergesto optimal predictions, while averaging natural parameters never does.
โ We illustrate our findings in Section 3.7 with simulations on synthetic dataand classical benchmarks with many observations.
58
ฮธn
ฮธn
ฮธฮณ
ฮธโ
ฮธ0
ฮธn โ ฮธฮณ = Op(ฮณ1/2)
ฮธn โ ฮธฮณ = Op(nโ1/2)
ฮธโ โ ฮธฮณ = O(ฮณ)
Figure 3-1 โ Convergence of iterates ๐๐ and averaged iterates ๐๐ to the mean ๐๐พ underthe stationary distribution ๐๐พ.
3.2 Constant step size stochastic gradient descent
In this section, we present the main intuitions behind stochastic gradient descent(SGD) with constant step-size. For more details, see Dieuleveut et al. [2017]. Weconsider a real-valued function ๐น defined on the Euclidean space R๐ (this can begeneralized to any Hilbert space, as done in Section 3.6 when considering Gaussianprocesses and positive-definite kernels), and a sequence of random functions (๐๐)๐>1
which are independent and identically distributed and such that E๐๐(๐) = ๐น (๐) forall ๐ โ R๐. Typically, ๐น will the expected negative log-likelihood on unseen data,while ๐๐ will be the negative log-likelihood for a single observation. Since we requireindependent random functions, we assume that we make single pass over the data,and thus the number of iterations is equal to the number of observations.
Starting from an initial ๐0 โ R๐, then SGD will perform the following recursion,from ๐ = 1 to the total number of observations:
๐๐ = ๐๐โ1 โ ๐พ๐โ๐๐(๐๐โ1). (3.2.1)
Since the functions ๐๐ are independent, the iterates (๐๐)๐ form a Markov chain. Whenthe step-size ๐พ๐ is constant (equal to ๐พ) and the functions ๐๐ are identically dis-tributed, the Markov chain is homogeneous. Thus, under additional assumptions [see,e.g., Dieuleveut et al., 2017, Meyn and Tweedie, 1993], it converges in distributionto a stationary distribution, which we refer to as ๐๐พ. These additional assumptionsinclude that ๐พ is not too large (otherwise the algorithm diverges) and in the tradi-tional analysis of step-sizes for gradient descent techniques, we analyze the situationof small ๐พโs (and thus perform asymptotic expansions around ๐พ = 0).
The distribution ๐๐พ is in general not equal to a Dirac mass, and thus, constant-
59
step-size SGD is not convergent. However, averaging along the path of the Markovchain has some interesting properties. Indeed, several versions of the โergodic theo-remโ [see, e.g., Meyn and Tweedie, 1993] show that for functions ๐ from R๐ to anyvector space, then the empirical average 1
๐
โ๐๐=1 ๐(๐๐) converges in probability to the
expectationโซ๐(๐)๐๐๐พ(๐) of ๐ under the stationary distribution ๐๐พ. This convergence
can also be quantified by a central limit theorem with an error whichs tends to anormal distribution with variance proportional equal to a constant times 1/๐.
Thus, if denote ๐๐ = 1๐+1
โ๐๐=0 ๐๐, applying the previous result to the identity
function ๐, we immediately obtain that ๐๐ converges to ๐๐พ =โซ๐๐๐๐พ(๐), with a squared
error converging in ๐(1/๐). The key question is the relationship between ๐๐พ and theglobal optimizer ๐* of ๐น , as this characterizes the performance of the algorithm withan infinite number of observations.
By taking expectations in Eq. (3.2.1), and taking a limit with ๐ tending to infinitywe obtain that โซ
โ๐น (๐)๐๐๐พ(๐) = 0, (3.2.2)
that is, under the stationary distribution ๐๐พ, the average gradient is zero. When thegradient is a linear function (like for a quadratic objective ๐น ), this leads to
โ๐น(โซ
๐๐๐๐พ(๐)
)= โ๐น (๐๐พ) = 0,
and thus ๐๐พ is a stationary point of ๐น (and hence the global minimizer if ๐น is convex).However this is not true in general.
As shown by Dieuleveut et al. [2017], the deviation ๐๐พ โ ๐* is of order ๐พ, which isan improvement on the non-averaged recursion, which is at average distance ๐(๐พ1/2)(see an illustration in Figure 3-1); thus, small or decaying step-sizes are needed. Inthis chapter, we explore alternatives which are not averaging the estimators ๐1, . . . , ๐๐,and rely instead on the specific structure of our cost functions, namely negative log-likelihoods.
3.3 Warm-up: exponential families
In order to highlight the benefits of averaging moment parameters, we first considerunconditional exponential families. We thus consider the standard exponential family
๐(๐ฅ|๐) = โ(๐ฅ) exp(๐โค๐ (๐ฅ)โ ๐ด(๐)),
where โ(๐ฅ) is the base measure, ๐ (๐ฅ) โ R๐ is the sufficient statistics and ๐ด thelog-partition function. The function ๐ด is always convex [see, e.g., Koller and Fried-man, 2009, Murphy, 2012]. Note that we do not assume that the data distribution๐(๐ฅ) comes from this exponential family. The expected (with respect to the inputdistribution ๐(๐ฅ)) negative log-likelihood is equal to
๐น (๐) = โE๐(๐ฅ) log ๐(๐ฅ|๐)
60
= ๐ด(๐)โ ๐โคE๐(๐ฅ)๐ (๐ฅ)โ E๐(๐ฅ) log โ(๐ฅ).
It is known to be minimized by ๐* such that โ๐ด(๐*) = E๐(๐ฅ)๐ (๐ฅ). Given i.i.d. data(๐ฅ๐)๐>1 sampled from ๐(๐ฅ), then the SGD recursion from Eq. (3.2.1) becomes:
๐๐ = ๐๐โ1 โ ๐พ[โ๐ด(๐๐โ1)โ ๐ (๐ฅ๐)
],
while the stationarity equation in Eq. (3.2.2) becomesโซ [โ๐ด(๐)โ E๐(๐ฅ)๐ (๐ฅ)]๐๐๐พ(๐) = 0,
which leads to โซโ๐ด(๐)๐๐๐พ(๐) = E๐(๐ฅ)๐ (๐ฅ) = โ๐ด(๐*).
Thus, averaging โ๐ด(๐๐) will converge to โ๐ด(๐*), while averaging ๐๐ will notconverge to ๐*. This simple observation is the basis of our work.
Note that in this context of unconditional models, a simpler estimator exists, thatis, we can simply compute the empirical average 1
๐
โ๐๐=1 ๐ (๐ฅ๐) that will converge to
โ๐ด(๐*). Nevertheless, this shows that averaging moment parameters โ๐ด(๐) ratherthan natural parameters ๐ can bring convergence benefits. We now turn to conditionalmodels, for which no closed-form solutions exist.
3.4 Conditional exponential families
Now we consider the conditional exponential family
๐(๐ฆ|๐ฅ, ๐) = โ(๐ฆ) exp(๐ฆ ยท ๐๐(๐ฅ)โ ๐(๐๐(๐ฅ))
).
For simplicity we consider only one-dimensional families where ๐ฆ โ R โ but ourframework would also extend to more complex models such as conditional randomfields [Lafferty et al., 2001]. We will also assume that โ(๐ฆ) = 1 for all ๐ฆ to avoidcarrying constant terms in log-likelihoods. We consider functions of the form ๐๐(๐ฅ) =๐โคฮฆ(๐ฅ), which are linear in a feature vector ฮฆ(๐ฅ), where ฮฆ : ๐ณ โ R๐ can be definedon an arbitrary input set ๐ณ . Calculating the negative log-likelihood, one obtains:
๐๐(๐) = โ log ๐(๐ฆ๐|๐ฅ๐๐) = โ๐ฆ๐ฮฆ(๐ฅ๐)โค๐ + ๐(ฮฆ(๐ฅ๐)๐
),
and, for any distribution ๐(๐ฅ, ๐ฆ), for which ๐(๐ฆ|๐ฅ) may not be a member of the con-ditional exponential family,
๐น (๐) = E๐(๐ฅ๐,๐ฆ๐)๐๐(๐)
= E๐(๐ฅ๐,๐ฆ๐)[โ ๐ฆ๐ฮฆ(๐ฅ๐)โค๐ + ๐
(ฮฆ(๐ฅ๐)๐
)].
61
The goal of estimation in such generalized linear models is to find an unknown pa-rameter ๐ given ๐ observations (๐ฅ๐, ๐ฆ๐)๐=1,...,๐:
๐* = arg min๐โR๐
๐น (๐). (3.4.1)
3.4.1 From estimators to prediction functions
Another point of view is to consider that an estimator ๐ โ R๐ in fact definesa function ๐ : ๐ณ โ R, with value a natural parameter for the exponential family๐(๐ฆ) = exp(๐๐ฆ โ ๐(๐)). This particular choice of function ๐๐ is linear in ฮฆ(๐ฅ), andwe have, by decomposing the joint probability ๐(๐ฅ๐, ๐ฆ๐) in two (and dropping thedependence on ๐ since we have assumed i.i.d. data):
๐น (๐) = E๐(๐ฅ)(E๐(๐ฆ|๐ฅ)
[โ ๐ฆฮฆ(๐ฅ)โค๐ + ๐(ฮฆ(๐ฅ)โค๐)
])= E๐(๐ฅ)
(โ E๐(๐ฆ|๐ฅ)๐ฆฮฆ(๐ฅ)โค๐ + ๐(ฮฆ(๐ฅ)โค๐)
)= โฑ(๐๐),
with โฑ(๐) = E๐(๐ฅ)(โE๐(๐ฆ|๐ฅ)๐ฆ ยท ๐(๐ฅ) + ๐(๐(๐ฅ))
)is the performance measure defined for
a function ๐ : ๐ณ โ R. By definition ๐น (๐) = โฑ(๐๐) = โฑ(๐โคฮฆ(ยท)).However, the global minimizer of โฑ(๐) over all functions ๐ : ๐ณ โ R may not be
attained at a linear function in ฮฆ(๐ฅ) (this can only be the case if the linear modelis well-specified or if the feature vector ฮฆ(๐ฅ) is flexible enough). Indeed, the globalminimizer of โฑ is the function ๐** : ๐ฅ โฆโ (๐โฒ)โ1(E๐(๐ฆ|๐ฅ)๐ฆ) (starting from โฑ(๐) =โซ [๐(๐(๐ฅ)) โ E๐(๐ฅ|๐ฆ)๐ฆ ยท ๐(๐ฅ)
]๐(๐ฅ)๐๐ฅ and writing down the Euler - Lagrange equation:
๐โฑ๐๐โ ๐
๐๐ฅ๐๐น๐๐โฒ
= 0 โ[๐โฒ(๐) โ E๐(๐ฅ|๐ฆ)๐ฆ
]๐(๐ฅ) = 0 and finally ๐ โฆโ (๐โฒ)โ1(E๐(๐ฅ|๐ฆ)๐ฆ)) and
is typically not a linear function in ฮฆ(๐ฅ) (note here that ๐(๐ฆ|๐ฅ) is the conditionaldata-generating distribution).
The function ๐ corresponds to the natural parameter of the exponential family,and it is often more intuitive to consider the moment parameter, that is definingfunctions ๐ : ๐ณ โ R that now correspond to moments of outputs ๐ฆ; we will refer tothem as prediction functions. Going from natural to moment parameter is known tobe done through the gradient of the log-partition function, and we thus consider for๐ a function from ๐ณ to R, ๐(ยท) = ๐โฒ(๐(ยท)), and this leads to the performance measure
๐ข(๐) = โฑ((๐โฒ)โ1(๐(ยท))).
Note now, that the global minimum of ๐ข is reached at
๐**(๐ฅ) = E๐(๐ฆ|๐ฅ)๐ฆ.
We introduce also the prediction function ๐*(๐ฅ) corresponding to the best ๐ which islinear in ฮฆ(๐ฅ), that is:
๐*(๐ฅ) = ๐โฒ(๐โค* ฮฆ(๐ฅ)
).
We say that the model is well-specified when ๐* = ๐**, and for these models,
62
Figure 3-2 โ Graphical representation of reparametrization: firstly we expand theclass of functions, replacing parameter ๐ with function ๐(ยท) = ฮฆโค(ยท)๐ and then we doone more reparametrization: ๐(ยท) = ๐โฒ(๐(ยท)). Best linear prediction ๐* is constructedusing ๐* and the global minimizer of ๐ข is ๐**. Model is well-specified if and only if๐* = ๐**.
inf๐ ๐น (๐) = inf๐ ๐ข(๐). However, in general, we only have inf๐ ๐น (๐) > inf๐ ๐ข(๐) and(very often) the inequality is strict (see examples in our simulations).
To make the further developments more concrete, we now present two classicalexamples: logistic regression and Poisson regression.
Logistic regression. The special case of conditional family is logistic regression,where ๐ฆ โ 0, 1, ๐(๐ก) = log(1 + ๐โ๐ก) and ๐โฒ(๐ก) = ๐(๐ก) = 1
1+๐โ๐ก is the sigmoid functionand the probability mass function is given by ๐(๐ฆ|๐) = exp(๐๐ฆ โ log(1 + ๐๐)).
Poisson regression. One more special case is Poisson regression with ๐ฆ โ N, ๐(๐ก) =exp(๐ก) and the response variable ๐ฆ has a Poisson distribution. The probability massfunction is given by ๐(๐ฆ|๐) = exp(๐๐ฆ โ ๐๐ โ log(๐ฆ!)). Poisson regression may beappropriate when the dependent variable is a count, for example in genomics, networkpacket analysis, crime rate analysis, fluorescence microscopy, etc. [Hilbe, 2011].
3.4.2 Averaging predictions
Recall from Section 3.2 that ๐๐พ is the stationary distribution of ๐. Taking expec-tation of both parts of Eq. (3.2.1), we get, by using the fact that ๐๐พ is the limitingdistribution of ๐๐ and ๐๐โ1:
E๐๐พ(๐๐)๐๐= E๐๐พ(๐๐โ1)๐๐โ1 โ ๐พE๐๐พ(๐๐โ1)E๐(๐ฅ๐,๐ฆ๐)๐ โฒ
๐(๐๐โ1),
63
which leads to E๐๐พ(๐)E๐(๐ฅ๐,๐ฆ๐)โ๐๐(๐) = 0, that is, now removing the dependence on ๐(data (๐ฅ, ๐ฆ) are i.i.d.):
E๐๐พ(๐)E๐(๐ฅ,๐ฆ)[โ ๐ฆฮฆ(๐ฅ) + ๐โฒ
(ฮฆ(๐ฅ)โค๐
)ฮฆ(๐ฅ)
]= 0,
which finally leads to
E๐(๐ฅ)[E๐๐พ(๐)๐โฒ
(ฮฆ(๐ฅ)โค๐
)โ ๐**(๐ฅ)
]ฮฆ(๐ฅ) = 0. (3.4.2)
This is the core equation our method relies on. It does not imply that ๐(๐ฅ) =E๐๐พ(๐)๐โฒ
(ฮฆ(๐ฅ)โค๐
)โ ๐**(๐ฅ) is uniformly equal to zero (which we want), but only that
E๐(๐ฅ)ฮฆ(๐ฅ)๐(๐ฅ) = 0, i.e., ๐(๐ฅ) is uncorrelated with ฮฆ(๐ฅ).If the feature vector ฮฆ(๐ฅ) is โlarge enoughโ then this is equivalent to ๐ = 0. 1 For
example, when ฮฆ(๐ฅ) is composed of an orthonormal basis of the space of integrablefunctions (like for kernels in Section 3.6), then this is exactly true. Thus, in thissituation,
๐**(๐ฅ) = E๐๐พ(๐)๐โฒ(ฮฆ(๐ฅ)โค๐
), (3.4.3)
and averaging predictions ๐โฒ(ฮฆ(๐ฅ)โค๐๐
), along the path (๐๐) of the Markov chain
should exactly converge to the optimal prediction.This exact convergence is weaker (requires high-dimensional fatures) than for the
unconditional family in Section 3.3 but it can still bring surprising benefits even whenฮฆ is not large enough, as we present in Section 3.5 and Section 3.6.
3.4.3 Two types of averaging
Now we can introduce two possible ways to estimate the prediction function ๐(๐ฅ).
Averaging estimators. The first one is the usual way: we first estimate parameter๐, using Ruppert-Polyak averaging [Polyak and Juditsky, 1992]: ๐๐ = 1
๐+1
โ๐๐=0 ๐๐ and
then we denote
๐(๐ฅ) = ๐โฒ(ฮฆ(๐ฅ)โค๐๐) = ๐โฒ(
ฮฆ(๐ฅ)โค 1๐+1
โ๐๐=0 ๐๐
)the corresponding prediction. As discussed in Section 3.2 it converges to ๐พ : ๐ฅ โฆโ๐โฒ(ฮฆ(๐ฅ)โค๐๐พ), which is not equal to in general to ๐โฒ(ฮฆ(๐ฅ)โค๐*), where ๐* is the optimalparameter in R๐. Since, as presented at the end of Section 3.2, ๐๐พ โ ๐* is of order๐(๐พ), ๐น (๐๐พ) โ ๐น (๐*) is of order ๐(๐พ2) (because โ๐น (๐*) = 0), and thus an error of๐(๐พ2) is added to the usual convergence rates in ๐(1/๐).
Note that we are limited here to prediction functions which corresponds to linearfunctions in ฮฆ(๐ฅ) in the natural parameterization, and thus ๐น (๐*) > ๐ข(๐**), and the
1. Let ฮฆ(๐ฅ) = (๐1(๐ฅ), . . . , ๐๐(๐ฅ))โค be an orthogonal basis and ๐(๐ฅ) =
โ๐๐=1 ๐๐๐๐(๐ฅ)+ ๐(๐ฅ), where
๐(๐ฅ) is small if the basis is big enough. Then E๐(๐ฅ)ฮฆ(๐ฅ)๐(๐ฅ) = 0โ E๐๐(๐ฅ)[โ๐
๐=1 ๐๐๐๐(๐ฅ)+ ๐(๐ฅ)]= 0
for every ๐, and due to the orthogonality of the basis and the smallness of ๐(๐ฅ): ๐๐ ยท E๐(๐ฅ)๐2(๐ฅ) โ 0
and hence ๐๐ โ 0 and thus ๐(๐ฅ) โ 0.
64
inequality is often strict.
Averaging predictions. We propose a new estimator
ยฏ๐๐(๐ฅ) =1
๐+ 1
๐โ๐=0
๐โฒ(๐โค๐ ฮฆ(๐ฅ)).
In general ๐ข(ยฏ๐๐)โ๐ข(๐**) does not converge to zero either (unless the feature vector ฮฆis large enough and Eq. (3.4.3) is satisfied). Thus, on top of the usual convergence in๐(1/๐) with respect to the number of iterations, we have an extra term that dependsonly on ๐พ, which we will study in Section 3.5 and Section 3.6.
We denote by ยฏ๐๐พ(๐ฅ) the limit when ๐โโ, that is, using properties of convergingMarkov chains, ยฏ๐๐พ(๐ฅ) = E๐๐พ(๐)๐โฒ
(ฮฆ(๐ฅ)โค๐
).
Rewriting Eq. (3.4.2) using our new notations, we get:
E[(๐**(๐ฅ)โ ยฏ๐๐พ(๐ฅ))ฮฆ(๐ฅ๐)
]= 0.
When ฮฆ : Rโ R๐ is high-dimensional, this leads to ๐** = ยฏ๐๐พ and in contrast to ๐พ,averaging predictions potentially converge to the optimal prediction.
Graphical representation. We propose a schematic graphical representation ofaveraging estimators and averaging predictions in the Figure 3-3.
Computational complexity. Usual averaging of estimators [Polyak and Juditsky,1992] to compute
๐(๐ฅ) = ๐โฒ(ฮฆ(๐ฅ)โค๐๐)
is simple to implement as we can simply update the average ๐๐ with essentially noextra cost on top the complexity ๐(๐๐) of the SGD recursion. Given the number ๐ oftraining data points and the number ๐ of testing data points, the overall complexityis ๐(๐(๐+๐)).
Averaging prediction functions is more challenging as we have to store all iter-ates ๐๐, ๐=1, . . . , ๐, and for each testing point ๐ฅ, compute
ยฏ๐๐(๐ฅ) =1
๐+ 1
๐โ๐=0
๐โฒ(๐โค๐ ฮฆ(๐ฅ)).
Thus the overall complexity is ๐(๐๐ + ๐๐๐), which could be too costly with manytest points (i.e., ๐ large).
There are several ways to alleviate this extra cost: (a) using sketching tech-niques [Woodruff, 2014], (b) using summary statistics like done in applications ofMCMC [Gilks et al., 1995], or (c) leveraging the fact that all iterates ๐๐ will end upbeing close to ๐๐พ and use a Taylor expansion of ๐
โฒ(๐โคฮฆ(๐ฅ))around ๐๐พ. This expansion
is equal to:๐โฒ(ฮฆ(๐ฅ)โค๐๐พ
)+ (๐ โ ๐๐พ)โคฮฆ(๐ฅ) ยท ๐โฒโฒ
(ฮฆ(๐ฅ)โค๐๐พ
)+
65
Figure 3-3 โ Visualisation of the space of parameters and its transformation to thespace of predictions. Averaging estimators in red vs averaging predictions in green.๐* is optimal linear predictor and ๐** is the global optimum.
+1
2
((๐ โ ๐๐พ)โคฮฆ(๐ฅ)
)2 ยท ๐โฒโฒโฒ(ฮฆ(๐ฅ)โค๐๐พ)
+๐(โ๐ โ ๐๐พโ3
).
Taking expectation in both sides above leads to:
ยฏ๐๐พ(๐ฅ) โ ๐พ(๐ฅ) +1
2ฮฆ(๐ฅ)โคcov (๐) ยท ฮฆ(๐ฅ) ยท ๐โฒโฒโฒ
(๐โค๐พ ฮฆ(๐ฅ)
),
where cov (๐) is the covariance matrix of ๐ under ๐๐พ. This provides a simple correctionto ๐พ, and leads to an approximation of ยฏ๐๐(๐ฅ) as
๐(๐ฅ) +1
2ฮฆ(๐ฅ)โคcov๐(๐) ฮฆ(๐ฅ) ยท ๐โฒโฒโฒ
(๐โค๐ฮฆ(๐ฅ)
),
where cov๐(๐) is the empirical covariance matrix of the iterates (๐๐).The computational complexity now becomes ๐(๐๐2 +๐๐2), which is an improve-
ment when the number of testing points ๐ is large. In all of our experiments, weused this approximation.
3.5 Finite-dimensional models
In this section we study the behavior of ยฏ๐ด(๐พ) = ๐ข(ยฏ๐๐พ)โ๐ข(๐*) for finite-dimensionalmodels, for which it is usually not equal to zero. We know that our estimators ยฏ๐๐ will
66
converge to ยฏ๐๐พ, and our goal is to compare it to ๐ด(๐พ) = ๐ข(๐พ)โ๐ข(๐*) = ๐น (๐๐พ)โ๐น (๐*)which is what averaging estimators tends to. We also consider for completeness thenon-averaged performance ๐ด(๐พ) = E๐๐พ(๐)
[๐น (๐)โ ๐น (๐*)
].
Note that we must have ๐ด(๐พ) and ๐ด(๐พ) non-negative, because we compare thenegative log-likelihood performances to the one of of the best linear prediction (inthe natural parameter), while ยฏ๐ด(๐พ) could potentially be negative (it will in certainsituations), because the the corresponding natural parameters may not be linear inฮฆ(๐ฅ).
We consider the same standard assumptions as Dieuleveut et al. [2017], namelysmoothness of the negative log-likelihoods ๐๐(๐) and strong convexity of the expectednegative log-likelihood ๐น (๐). We first recall the results from Dieuleveut et al. [2017].See detailed explicit formulas in the supplementary material.
3.5.1 Earlier work
Without averaging. We have that ๐ด(๐พ) = ๐พ๐ต +๐(๐พ3/2), that is ๐พ is linear in ๐พ,with ๐ต non-negative.
Averaging estimators. We have that ๐ด(๐พ) = ๐พ2+๐(๐พ5/2), that is ๐ด is quadraticin ๐พ, with non-negative. Averaging does provably bring some benefits because theorder in ๐พ is higher (we assume ๐พ small).
3.5.2 Averaging predictions
We are now ready to analyze the behavior of our new framework of averagingpredictions. The following result is shown in the supplementary material.
Proposition 3.1. Under the assumptions on the negative loglikelihoods ๐๐ of eachobservation from Dieuleveut et al. [2017]:
โ In the case of well-specified data, that is, there exists ๐* such that for all (๐ฅ, ๐ฆ),๐(๐ฆ|๐ฅ) = ๐(๐ฆ|๐ฅ, ๐*), then ยฏ๐ด โผ ๐พ2 ยฏ๐ตwell, where ยฏ๐ตwell is a positive constant.
โ In the general case of potentially mis-specified data, ยฏ๐ด = ๐พ ยฏ๐ตmis +๐(๐พ2), whereยฏ๐ตmis is constant which may be positive or negative.
Note, that in contrast to averaging parameters, the constant ยฏ๐ตmis can be negative.It means, that we obtain the estimator better than the optimal linear estimator, whichis the limit of capacity for averaging parameters. In our simulations, we show examplesfor which ยฏ๐ตmis is positive, and examples for which it is negative. Thus, in general, forlow-dimensional models, averaging predictions can be worse or better than averagingparameters. However, as we show in the next section, for infinite dimensional models,we always get convergence.
67
Figure 3-4 โ Visualisation of the space of parameters and its transformation to thespace of predictions. Averaging estimators in red vs averaging predictions in green.Global optimizer coincides with the best linear: ๐* = ๐**.
3.6 Infinite-dimensional models
Recall, that we have the following objective function to minimize:
๐น (๐) = E๐ฅ,๐ฆ[โ ๐ฆ ยท ๐๐(๐ฅ) + ๐
(๐๐(๐ฅ)
)], (3.6.1)
where till this moment we consider unknown functions ๐๐(๐ฅ) which were linear inฮฆ(๐ฅ) with ฮฆ(๐ฅ) โ R๐, leading to a complexity in ๐(๐๐).
We now consider infinite-dimensional features, by considering that ฮฆ(๐ฅ) โ โ,where โ is a Hilbert space. Note that this corresponds to modeling the function ๐๐as a Gaussian process [Rasmussen and Williams, 2006].
This is computationally feasible through the usual โkernel trickโ [Scholkopf andSmola, 2001, Shawe-Taylor and Cristianini, 2004], where we assume that the kernelfunction ๐(๐ฅ, ๐ฆ) = โจฮฆ(๐ฅ),ฮฆ(๐ฆ)โฉ is easy to compute. Indeed, following Bordes et al.[2005] and Dieuleveut and Bach [2016], starting from ๐0, each iterate of constant-step-size SGD is of the form ๐๐ =
โ๐๐ก=1 ๐ผ๐กฮฆ(๐ฅ๐ก), and the gradient descent recursion
๐๐ = ๐๐โ1 โ ๐พ[๐โฒ(๐๐๐โ1(๐ฅ๐))โ ๐ฆ๐]ฮฆ(๐ฅ๐) leads to the following recursion on ๐ผ๐กโs:
๐ผ๐ = โ๐พ[๐โฒ(โ๐โ1
๐ก=1 ๐ผ๐กโจฮฆ(๐ฅ๐ก),ฮฆ(๐ฅ๐)โฉ)โ ๐ฆ๐
]= โ๐พ
[๐โฒ(โ๐โ1
๐ก=1 ๐ผ๐ก๐(๐ฅ๐ก, ๐ฅ๐))โ ๐ฆ๐
].
68
This leads to ๐๐๐(๐ฅ) = โจฮฆ(๐ฅ), ๐๐โฉ and ๐๐๐(๐ฅ) = ๐โฒ(๐๐๐(๐ฅ)
)with
๐๐๐(๐ฅ) =๐โ๐ก=1
๐ผ๐กโจฮฆ(๐ฅ),ฮฆ(๐ฅ๐ก)โฉ =๐โ๐ก=1
๐ผ๐ก๐(๐ฅ, ๐ฅ๐ก),
and finally we can express ยฏ๐๐(๐ฅ) in kernel form as:
ยฏ๐๐(๐ฅ) =1
๐+ 1
๐โ๐ก=0
๐โฒ[ ๐กโ๐=1
๐ผ๐ ยท ๐(๐ฅ, ๐ฅ๐)].
There is also a straightforward estimator for averaging parameters, i.e., ๐(๐ฅ) =
๐โฒ(
1๐+1
๐โ๐ก=0
๐กโ๐=1
๐ผ๐๐(๐ฅ, ๐ฅ๐)). If we assume that the kernel function is universal, that is,
โ is dense in the space of squared integrable functions, then it is known that ifE๐ฅ๐(๐ฅ)ฮฆ(๐ฅ) = 0, then ๐ = 0 [Sriperumbudur et al., 2008]. This implies that wemust have ยฏ๐๐พ = 0 and thus averaging predictions does always converge to the globaloptimum (note that in this setting, we must have a well-specified model because weare in a non-parametric setting).
Graphical representation. We propose a schematic graphical representation ofaveraging estimators and averaging predictions for Hilbert space setup in the Figure3-4.
Column sampling. Because of the usual high running-time complexity of ker-nel method in ๐(๐2), we consider a โcolumn-sampling approximationโ [Williams andSeeger, 2001]. We thus choose a small subset ๐ผ = (๐ฅ1, . . . , ๐ฅ๐) of samples and con-struct a new finite ๐-dimensional feature map ฮฆ(๐ฅ) = ๐พ(๐ผ, ๐ผ)โ1/2๐พ(๐ผ, ๐ฅ) โ R๐,where ๐พ(๐ผ, ๐ผ) is the ๐ ร ๐ kernel matrix of the ๐ points and ๐พ(๐ผ, ๐ฅ) the vectorcomposed of kernel evaluations ๐(๐ฅ๐, ๐ฅ). This allows a running-time complexity in๐(๐2๐). In theory and practice, ๐ can be chosen small [Bach, 2013, Rudi et al.,2017].
Regularized learning with kernels. Although we can use an unregularized re-cursion with good convergence properties [Dieuleveut and Bach, 2016], adding a reg-ularisation by the squared Hilbertian norm is easier to analyze and more stable withlimited amounts of data. We thus consider the recursion (in Hilbert space), with ๐small:
๐๐ = ๐๐โ1 โ ๐พ[๐ โฒ๐(๐๐โ1) + ๐๐๐โ1
]= ๐๐โ1 + ๐พ(๐ฆ๐ โ ๐โฒ(โจฮฆ(๐ฅ๐), ๐โฉ))ฮฆ(๐ฅ๐)โ ๐พ๐๐๐โ1.
This recursion can also be computed efficiently as above using the kernel trick andcolumn sampling approximations.
In terms of convergence, the best we can hope for is to converge to the minimizer
69
๐*,๐ of the regularized expected negative log-likelihood ๐น (๐)+ ๐2โ๐โ2 (which we assume
to exist). When ๐ tends to zero, then ๐*,๐ converges to ๐*.Averaging parameters will tend to a limit ๐๐พ,๐ which is ๐(๐พ)-close to ๐*,๐, thus
leading to predictions which deviate from the optimal predictions for two reasons:because of regularization and because of the constant step-size. Since ๐ should de-crease as we get more data, the first effect will vanish, while the second will not.
When averaging predictions, the two effects will vanish as ๐ tends to zero. Indeed,by taking limits of the gradient equation, and denoting by ยฏ๐๐พ,๐ the limit of ยฏ๐๐, wehave
E[(๐**(๐ฅ)โ ยฏ๐๐พ,๐(๐ฅ))ฮฆ(๐ฅ)
]= ๐๐๐พ,๐. (3.6.2)
Given that ๐๐พ,๐ is ๐(๐พ)-away from ๐*, if we assume that ๐* corresponds to a sufficientlyregular 2 element of the Hilbert space โ, then the ๐ฟ2-norm of the deviation satisfiesโ๐**(๐ฅ)โ ยฏ๐๐พ,๐โ = ๐(๐) and thus as the regularization parameter ๐ tends to zero, ourpredictions tend to the optimal one.
3.7 Experiments
In this section, we compare the two types of averaging (estimators and predic-tions) on a variety of problems, both on synthetic data and on standard benchmarks.When averaging predictions, we always consider the Taylor expansion approximationpresented at the end of Section 3.4.3.
3.7.1 Synthetic data
Finite-dimensional models. we consider the following logistic regression model:
๐(๐ฆ|๐ฅ, ๐) = exp(๐ฆ ยท ๐๐(๐ฅ)โ ๐(๐๐(๐ฅ))
),
where we consider a linear model ๐๐(๐ฅ) = ๐โค๐ฅ in ๐ฅ (i.e., ฮฆ(๐ฅ) = ๐ฅ), the link function๐(๐ก) = log(1 + ๐๐ก) and ๐โฒ(๐ก) = ๐(๐ก) is the sigmoid function. Let ๐ฅ be distributed as astandard normal random variable in dimension ๐ = 2, ๐ฆ โ 0, 1 and P(๐ฆ = 1|๐ฅ) =๐**(๐ฅ) = ๐
(๐**(๐ฅ)
), where we consider two different settings:
โ Model 1: ๐**(๐ฅ) = sin๐ฅ1 + sin๐ฅ2,โ Model 2: ๐**(๐ฅ) = ๐ฅ31 + ๐ฅ32.
The global minimum โฑ** of the corresponding optimization problem can be found as
โฑ** = E๐(๐ฅ)[โ ๐**(๐ฅ) ยท ๐**(๐ฅ) + ๐(๐**(๐ฅ))
].
We also introduce the performance measure โฑ(๐)
โฑ(๐) = E๐(๐ฅ)[โ ๐**(๐ฅ) ยท ๐(๐ฅ) + ๐(๐(๐ฅ))
], (3.7.1)
2. While our reasoning is informal here, it can be made more precise by considering so-calledโsource conditionsโ commonly used in the analysis of kernel methods [Caponnetto and De Vito,2007], but this is out of the scope of this chapter.
70
which can be evaluated directly in the case of synthetic data. Note that in oursituation, the model is misspecified because ๐**(๐ฅ) is not linear in ฮฆ(๐ฅ) = ๐ฅ, andthus, inf๐ ๐น (๐) > โฑ**, and thus our performance measures โฑ(๐๐) โ โฑ** for variousestimators ๐๐ will not converge to zero.
The results of averaging 10 replications are shown in Fig. 3-5 and Fig. 3-6. We firstobserved that constant step-size SGD without averaging leads to a bad performance.
Moreover, we can see, that in the first case (Fig. 3-5) averaging predictions beatsaveraging parameters, and moreover beats the best linear model: if we use the bestlinear error โฑ* instead of โฑ**, at some moment โฑ(๐๐)โโฑ* becomes negative. How-ever in the second case (Fig. 3-6), averaging predictions is not superior to averagingparameters. Moreover, by looking at the final differences between performances withdifferent values of ๐พ, we can see the dependency of the final performance in ๐พ foraveraging predictions, instead of ๐พ2 for averaging parameters (as suggested by ourtheoretical results in Section 3.5). In particular in Fig. 3-5, we can observe the sur-prising behavior of a larger step-size leading to a better performance (note that wecannot increase too much otherwise the algorithm would diverge).
log(n)0 1 2 3 4 5
log(
F(ฮท
n)โ
Fโโ
)
-2.2
-2
-1.8
-1.6
-1.4
-1.2
-1
-0.8averaging predictions, ฮณ = 1/2
averaging parameters, ฮณ = 1/2
averaging predictions, ฮณ = 1/4
averaging parameters, ฮณ = 1/4
aver. parameters, decaying ฮณn
no averaging, ฮณ = 1/2
no averaging, ฮณ = 1/4
Figure 3-5 โ Synthetic data for linearmodel ๐๐(๐ฅ) = ๐โค๐ฅ and ๐**(๐ฅ) = sin๐ฅ1 +sin๐ฅ2. Excess prediction performance vs.number of iterations (both in log-scale).
log(n)0 1 2 3 4 5
log(
F(ฮท
n)โ
Fโโ
)
-1.4
-1.3
-1.2
-1.1
-1
-0.9
-0.8
-0.7
-0.6averaging predictions, ฮณ = 1/2
averaging parameters, ฮณ = 1/2
averaging predictions, ฮณ = 1/4
averaging parameters, ฮณ = 1/4
aver. parameters, decaying ฮณn
no averaging, ฮณ = 1/2
no averaging, ฮณ = 1/4
Figure 3-6 โ Synthetic data for linearmodel ๐๐(๐ฅ) = ๐โค๐ฅ and ๐**(๐ฅ) = ๐ฅ31 + ๐ฅ32.Excess prediction performance vs. num-ber of iterations (both in log-scale).
Infinite-dimensional models Here we consider the kernel setup described in Sec-tion 3.6. We consider Laplacian kernels ๐(๐ , ๐ก) = exp
(โ๐ โ๐กโ1๐
)with ๐ = 50, dimension
๐ = 5 and generating log odds ratio ๐**(๐ฅ) = 55+๐ฅโค๐ฅ
. We also use a squared normregularization with several values of ๐ and column sampling with ๐ = 100 points. Weuse the exact value of โฑ** which we can compute directly for synthetic data. The re-sults are shown in Fig. 3-7, where averaging predictions leads to a better performancethan averaging estimators.
71
log(n)0 1 2 3 4 5 6
log(
F(ฮท
n)โ
Fโโ
)
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
Averaging pred. ฮป = 10โ3.0
Averaging pred. ฮป = 10โ3.5
Averaging param., ฮป = 10โ3.0
Averaging param., ฮป = 10โ3.5
No averaging, ฮป = 10โ3.0
Figure 3-7 โ Synthetic data for Laplacian kernel for ๐**(๐ฅ) = 55+๐ฅโค๐ฅ
. Excess predictionperformance vs. number of iterations (both in log-scale).
3.7.2 Real data
Note, that in the case of real data, one does not have access to unknown ๐**(๐ฅ)and computing the performance measure in Eq. (3.7.1) is inapplicable. Instead of itwe use its sampled version on held out data:
โฑ(๐) = โโ๐:๐ฆ๐=1
log(๐(๐ฅ๐)
)โโ๐:๐ฆ๐=0
log(1โ ๐(๐ฅ๐)
).
We use two datasets, with ๐ not too large, and ๐ large, from [Lichman, 2013]: theโMiniBooNE particle identificationโ dataset (๐ = 50, ๐ = 130 064), the โCovertypeโdataset (๐ = 54, ๐ = 581 012).
We use two different approaches for each of them: a linear model ๐๐(๐ฅ) = ๐โค๐ฅ
and a kernel approach with Laplacian kernel ๐(๐ , ๐ก) = exp(โ๐ โ๐กโ1
๐
), where ๐ = ๐.
The results are shown in Figures 3-8 to 3-11. Note, that for linear models we useโฑ*โthe estimator of the best performance among linear models (learned on the testset, and hence not reachable from learning on the training data), and for kernels weuse โฑ** (same definition as โฑ* but with the kernelized model), that is why graphs arenot comparable (but, as shown below, the value of โฑ** is lower than the value of โฑ*because using kernels correspond to a larger feature space).
For the โMiniBooNE particle identificationโ dataset โฑ* = 0.35 and โฑ** = 0.21;for theโCovertypeโ dataset โฑ* = 0.46 and โฑ** = 0.39. We can see from the fourplots that, especially in the kernel setting, averaging predictions also shows betterperformance than averaging parameters.
72
log(n)0 1 2 3 4 5
log(
F(ฮท
n)โ
Fโโ
)
-1.1
-1
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
averaging parameters, ฮณ = 2โ8
averaging predictions, ฮณ = 2โ8
aver. param., decaying, ฮณ0 = 2โ2
Figure 3-8 โ MiniBooNE dataset, dimen-sion ๐ = 50, linear model. Excess predic-tion performance vs. number of iterations(both in log-scale).
log(n)0 1 2 3 4 5
log(
F(ฮท
n)โ
Fโโ
)
-2
-1.8
-1.6
-1.4
-1.2
-1
-0.8
-0.6
-0.4
averaging parameters, ฮณ = 23
averaging predictions, ฮณ = 23
aver. param., decaying, ฮณ0 = 22
averaging parameters, ฮณ = 22
averaging predictions, ฮณ = 22
aver. param., decaying, ฮณ0 = 21
Figure 3-9 โ MiniBooNE dataset, dimen-sion ๐ = 50, kernel approach, columnsampling๐ = 200. Excess prediction per-formance vs. number of iterations (bothin log-scale).
3.8 Conclusion
In this chapter, we have explored how averaging procedures in stochastic gradientdescent, which are crucial for fast convergence, could be improved by looking at thespecifics of probabilistic modeling. Namely, averaging in the moment parameteriza-tion can have better properties than averaging in the natural parameterization.
While we have provided some theoretical arguments (asymptotic expansion in thefinite-dimensional case, convergence to optimal predictions in the infinite-dimensionalcase), a detailed theoretical analysis with explicit convergence rates would provide abetter understanding of the benefits of averaging predictions.
73
log(n)0 1 2 3 4 5
log(
F(ฮท
n)โ
Fโโ
)
-2
-1.5
-1
-0.5averaging parameters, ฮณ = 2โ4
averaging predictions, ฮณ = 2โ4
aver. param., decaying, ฮณ0 = 20
Figure 3-10 โ CoverType dataset, dimen-sion ๐ = 54, linear model. Excess predic-tion performance vs. number of iterations(both in log-scale).
log(n)0 1 2 3 4 5
log(
F(ฮท
n)โ
Fโโ
)
-1.4
-1.2
-1
-0.8
-0.6
-0.4averaging parameters, ฮณ = 21
averaging predictions, ฮณ = 21
aver. param., decaying, ฮณ0 = 23
Figure 3-11 โ CoverType dataset, dimen-sion ๐ = 54, kernel approach, columnsampling๐ = 200. Excess prediction per-formance vs. number of iterations (bothin log-scale).
74
3.9 Appendix. Explicit form of ๐ต, , ยฏ๐ต๐ค and ยฏ๐ต๐
In this appendix we provide explicit expressions for the asymptotic expansionsfrom the chapter. All assumptions from Dieuleveut et al. [2017] are reused, namely,beyond the usual sampling assumptions, smoothness of the cost functions.
We have, even for mis-specified models:
โฑ(๐)โโฑ(๐*) = ๐ข(๐)โ ๐ข(๐*) =
= E[โ E๐(๐ฆ๐|๐ฅ๐)๐ฆ๐๐(๐ฅ๐) + ๐(๐(๐ฅ๐)) + E๐(๐ฆ๐|๐ฅ๐)๐ฆ๐๐*(๐ฅ๐)โ ๐(๐*(๐ฅ๐))
]=
= E[๐(๐(๐ฅ๐))โ ๐(๐*(๐ฅ๐))โ ๐โฒ(๐*(๐ฅ๐))(๐(๐ฅ๐)โ ๐*(๐ฅ๐))
]+
+E[(๐โฒ(๐*(๐ฅ๐))โ E๐(๐ฆ๐|๐ฅ๐)๐ฆ๐
)ยท(๐(๐ฅ๐)โ ๐*(๐ฅ๐)
)]=
E[๐ท๐
(๐(๐ฅ๐)
๐*(๐ฅ๐)
)]+ E
[(๐*(๐ฅ)โ ๐**(๐ฅ)
)ยท(๐(๐ฅ๐)โ ๐*(๐ฅ๐)
)]=
E[๐ท๐*
(๐(๐ฅ๐)
๐*(๐ฅ๐)
)]+ E
[(๐*(๐ฅ)โ ๐**(๐ฅ)
)ยท ๐(๐ฅ๐)
](3.9.1)
for ๐ท๐ the Bregman divergence associated to ๐, and ๐ท๐* the one associated to ๐*. Wealso use the optimality condition for the predictor ๐*(๐ฅ): E๐*(๐ฅ)[๐โฒ(๐*(๐ฅ))โE๐(๐ฅ|๐ฆ)๐ฆ] =0 in the last step.
When the model is well-specified, we have ๐โฒ(๐*(๐ฅ๐)) = E(๐ฆ๐|๐ฅ๐) and thus๐น (๐)โ ๐น (๐*) = E
[๐ท๐*(๐*(๐ฅ๐)||๐(๐ฅ๐))
]. If ๐ is linear in ฮฆ(๐ฅ), and even if the model
is mis-specified, then we also have ๐น (๐)โ ๐น (๐*) = E[๐ท๐*(๐*(๐ฅ๐)||๐(๐ฅ๐))
].
Using asymptotic expansions of moments of the averaged SGD iterate with zero-mean statistically independent noise ๐๐(๐) = ๐น (๐) + ๐๐(๐) from Dieuleveut and Bach[2016], Theorem 2 one obtains:
๐๐พ = E๐๐พ (๐) = ๐* + ๐พโ +๐(๐พ2), (3.9.2)
E๐๐พ (๐ โ ๐*)(๐ โ ๐*)โค = ๐พ๐ถ +๐(๐พ2), (3.9.3)
where๐ถ =
[๐น โฒโฒ(๐*)โ ๐ผ + ๐ผ โ ๐น โฒโฒ(๐*)
]โ1ฮฃ.
and ฮฃ =โซR๐
๐(๐)โ2๐๐พ(๐๐) โ R๐ร๐.
The โdriftโ ๐๐พ โ ๐* is linear in ๐พ and can be interpreted as an additional error dueto the function is not being quadratic and step sizes are not decaying to zero.
Connection between โ and ๐ถ can be easily obtained using ๐๐ = ๐๐โ1โ๐พ[๐น โฒ(๐๐โ1)+
๐๐]. Taking expectation of both parts and using Tailor expansion up to the second
order:
๐น โฒโฒ(๐*)(๐๐พ โ ๐*) = โ1
2๐น โฒโฒโฒ(๐*)E๐๐พ (๐ โ ๐*)โ2 โ
๐น โฒโฒ(๐*)โ = โ1
2๐น โฒโฒโฒ(๐*)๐ถ. (3.9.4)
75
3.9.1 Estimation without averaging
We start with the simplest estimator of the prediction function: ๐0(๐ฅ) = ๐โฒ(ฮฆโค๐๐),where we do not use any averaging:
๐ข(๐๐)โ ๐ข(๐*) = ๐(๐๐)โ ๐(๐*) = ๐ โฒ(๐*)(๐๐ โ ๐*) +1
2๐ โฒโฒ(๐*)(๐๐ โ ๐*)โ2+
+1
6๐ โฒโฒโฒ(๐*)(๐๐ โ ๐*)โ3 +๐(๐พ3/2)
Taking expectation of both sides, when ๐โโ and using Eq. (3.9.3) one obtains:
๐ด(๐พ) = E๐๐พ๐(๐๐)โ ๐(๐*) =1
2tr๐ โฒโฒ(๐*)๐พ๐ถ +๐(๐พ3/2).
So, we have linear dependence of ๐พ and ๐ต = 12tr๐ โฒโฒ(๐*)๐ถ.
3.9.2 Estimation with averaging parameters
Now, let us estimate ๐ด(๐พ):
๐ข(๐)โ๐ข(๐*) = ๐(๐๐)โ ๐(๐*) = ๐ โฒ(๐*)(๐๐โ ๐*) +1
2(๐๐โ ๐*)๐ โฒโฒ(๐*)(๐๐โ ๐*) +๐(๐พ3).
Taking expectation of both sides, when ๐โโ:
๐ข(๐พ)โ ๐ข(๐*) = ๐(๐๐พ)โ ๐(๐*) =1
2tr๐ โฒโฒ(๐*)(๐๐พ โ ๐*)โ2 +๐
(๐พ3)
=
=1
2tr๐ โฒโฒ(๐*)๐พ
2โโ2 +๐(๐พ3).
Finally we have a quadratic dependence of ๐พ:
๐ด(๐พ) =1
2tr๐ โฒโฒ(๐*)๐พ
2โโ2 +๐(๐พ3).
And the coefficient = 12tr๐ โฒโฒ(๐*)โ
โ2.
3.9.3 Estimation with averaging predictions
Recall, that by definition, ยฏ๐ด(๐พ) = ๐ข(ยฏ๐๐พ)โ ๐ข(๐*), where ยฏ๐๐พ(๐ฅ) = E๐๐พ๐โฒ(๐โคฮฆ(๐ฅ)
).
We again use Tailor expansion for ๐โฒ(๐โคฮฆ(๐ฅ)
)at ๐*:
๐โฒ(๐โคฮฆ(๐ฅ)
)= ๐โฒ
(๐โค* ฮฆ(๐ฅ)
)+ ๐โฒโฒ
(๐โค* ฮฆ(๐ฅ)
)(๐ โ ๐*)โคฮฆ(๐ฅ)+
+1
2๐โฒโฒโฒ(๐โค* ฮฆ(๐ฅ)
)ยท(
(๐ โ ๐*)โคฮฆ(๐ฅ))2
+๐(๐พ3/2).
76
Taking expectation of both parts:
ยฏ๐๐พ(๐ฅ) = ๐*(๐ฅ) + ๐โฒโฒ(๐โค* ฮฆ(๐ฅ)
)(๐๐พ โ ๐*)โคฮฆ(๐ฅ)+
+1
2๐โฒโฒโฒ(๐โค* ฮฆ(๐ฅ)
)tr[ฮฆ(๐ฅ)ฮฆ(๐ฅ)โคE(๐ โ ๐*)โ2
]+๐(๐พ3/2) =
= ๐*(๐ฅ) + ๐โฒโฒ(๐โค* ฮฆ(๐ฅ)
)๐พโโคฮฆ(๐ฅ) +
1
2๐โฒโฒโฒ(๐โค* ฮฆ(๐ฅ)
)tr[ฮฆ(๐ฅ)ฮฆ(๐ฅ)โค๐พ๐ถ
]+๐(๐พ3/2).
Finally, we showed, that:
ยฏ๐๐พ(๐ฅ)โ ๐*(๐ฅ) = ๐(๐พ3/2) + ๐พ[๐โฒโฒ(๐*(๐ฅ)
)โโคฮฆ(๐ฅ) +
1
2๐โฒโฒโฒ(๐*(๐ฅ)
)tr[ฮฆ(๐ฅ)โ2๐ถ
]]Now we use Bregram divergence notation Eq. (3.9.1):
ยฏ๐ด(๐พ) = ๐ข(ยฏ๐๐พ)โ ๐ข(๐*) = ๐ข1 + ๐ข2,
As mentioned above, the term ๐ข2 vanishes if model is well-specified or ๐ is linearin ฮฆ(๐ฅ). Note, that for the case ๐ด(๐พ) indeed linear in ฮฆ(๐ฅ).
Estimation of ๐ข1.
By definition ๐ท๐*(๐*(๐ฅ)||๐(๐ฅ)) =1
2
(๐*(๐ฅ)โ๐(๐ฅ)
)(๐*)โฒโฒ
(๐*(๐ฅ)
)(๐*(๐ฅ)โ๐(๐ฅ)
)and
๐ข1 =1
2E(๐*(๐ฅ)โ ยฏ๐๐พ(๐ฅ)
)2๐โฒโฒ(๐โค* ฮฆ(๐ฅ)
) =๐พ2
2E[(๐โฒโฒ
(๐*(๐ฅ)
)โโคฮฆ(๐ฅ) + 1
2๐โฒโฒโฒ(๐*(๐ฅ)
)tr[ฮฆ(๐ฅ)โ2๐ถ
])2
๐โฒโฒ(๐โค* ฮฆ(๐ฅ)
) ].
SinceE๐ฅ[๐โฒโฒ((๐โค* ฮฆ(๐ฅ)
)(โโคฮฆ(๐ฅ))2
]= โโค๐ โฒโฒ(๐*)โ
andE๐ฅ[โโค๐โฒโฒโฒ
(๐โค* ฮฆ(๐ฅ)
)ฮฆ(๐ฅ)โ3๐ถ
]= โโค๐ โฒโฒโฒ(๐*)๐ถ = โ2โโค๐ โฒโฒ(๐*)โ,
๐ข1 = ๐พ2
[โ 1
2โโค๐ โฒโฒ(๐*)โ +
1
8E[๐โฒโฒโฒ(๐*(๐ฅ))2
๐โฒโฒ(๐*(๐ฅ))ยท(tr[ฮฆ(๐ฅ)โ2๐ถ
])2]]+๐(๐พ3)
And the coefficient ยฏ๐ต๐ค = โ12โโค๐ โฒโฒ(๐*)โ + 1
8E[๐โฒโฒโฒ(๐*(๐ฅ))2
๐โฒโฒ(๐*(๐ฅ))ยท(tr[ฮฆ(๐ฅ)โ2๐ถ
])2].
Estimation of ๐ข2.
๐ข2 = E[(๐*(๐ฅ)โ ๐**(๐ฅ)
)ยท(ยฏ๐(๐ฅ๐)โ ๐*(๐ฅ๐)
)],
using properties of conjugated functions,
๐ข2 = E[(
(๐*)โฒ(ยฏ๐(๐ฅ))โ (๐*)โฒ(๐*(๐ฅ)))ยท(๐*(๐ฅ)โ ๐**(๐ฅ)
)]=
77
E[(๐*)โฒโฒ
(ยฏ๐*(๐ฅ))(ยฏ๐(๐ฅ)โ ๐*(๐ฅ)
)ยท(๐*(๐ฅ)โ ๐**(๐ฅ)
)+๐(๐พ2) =
= Eยฏ๐(๐ฅ)โ ๐*(๐ฅ)
๐โฒโฒ(๐*(๐ฅ))ยท(๐*(๐ฅ)โ ๐**(๐ฅ)
)+๐(๐พ2) =
= ๐พ ยท E[(
โโคฮฆ(๐ฅ) +๐โฒโฒโฒ(๐*(๐ฅ))
2๐โฒโฒ(๐*(๐ฅ))tr[ฮฆ(๐ฅ)โ2๐ถ
])ยท(๐*(๐ฅ)โ ๐**(๐ฅ)
)]+๐(๐พ2).
And the coefficient ยฏ๐ต๐ = E(
โโคฮฆ(๐ฅ) + ๐โฒโฒโฒ(๐*(๐ฅ))2๐โฒโฒ(๐*(๐ฅ))
tr[ฮฆ(๐ฅ)โ2๐ถ
])ยท(๐*(๐ฅ)โ ๐**(๐ฅ)
).
78
Chapter 4
Sublinear Primal-Dual Algorithms for
Large-Scale Multiclass Classification
Abstract
In this chapter, we study multiclass classification problems with potentially largenumber of classes ๐, number of features ๐, and sample size ๐. Our goal is to developefficient algorithms to train linear classifiers with losses of a certain type, allowingto address, in particular, multiclass support vector machines (SVM) and softmaxregression. The developed algorithms are based on first-order proximal primal-dualschemes โ Mirror Descent and Mirror Prox. In particular, forthe multiclass SVM withelementwise โ1-regularization we propose a sublinear algorithm with numerical com-plexity of one iteration being only ๐(๐+๐+๐). This result relies on the combinationof three ideas: (i) passing to the saddle-point problem with a quasi-bilinear objective;(ii) ad-hoc variance reduction technique based on a non-uniform sampling of matrixmultiplication; (iii) the proper choice of the proximal setup in Mirror Descent typeschemes leading to the balance between the stochastic and deterministic error terms.
The material presented in this chapter is a joint work with Dmitrii M. Ostrovskiiand Francis Bach. It is submitted to the ICML 2019 conference.
4.1 Introduction
We focus on efficient training of multiclass linear classifiers, potentially with verylarge numbers of classes and dimensionality of the feature space, and with losses ofa certain type. Formally, consider a dataset comprised of ๐ pairs (๐ฅ๐, ๐ฆ๐), ๐ = 1, ..., ๐,where ๐ฅ๐ โ R๐ is the feature vector of the ๐-th training example, and ๐ฆ๐ โ ๐1, ..., ๐๐is the label vector encoding one of the ๐ possible classes, ๐1, ..., ๐๐ โ R๐ being thestandard basis vectors (we assume there are no ambiguities in the class assignment).Given such data, we would like to find a linear classifier that minimizes the regularized
79
empirical risk. This can be formalized as the following minimization problem:
min๐โR๐ร๐
1
๐
๐โ๐=1
โ(๐โค๐ฅ๐, ๐ฆ๐) + ๐โ๐โU . (4.1.1)
Here, ๐ โ R๐ร๐ is the matrix whose columns specify the parameter vectors for eachof the ๐ classes; โ(๐โค๐ฅ, ๐ฆ), with โ : R๐ ร R๐ โ R, is the loss corresponding to themargins ๐โค๐ฅ โ R๐ assigned to ๐ฅ when its true class happens to be ๐ฆ; finally, ๐โ๐โU ,with ๐ > 0, is the regularization term corresponding to some norm โ ยท โU on R๐ร๐
which will be specified later on (cf. Section 4.2). For now, we only assume that โยทโU isโsimpleโ โ intuitively, one can think of โ๐โU as being quasi-separable in the elementsof ๐ , for example, belong to the class of row-wise mixed โ๐รโ๐-norms, cf. Section 4.2.
Our focus is on the so-called Fenchel-Young losses [Blondel et al., 2018] which canbe expressed in the form
โ(๐โค๐ฅ, ๐ฆ) = max๐ฃโฮ๐
โf(๐ฃ, ๐ฆ) + (๐ฃ โ ๐ฆ)โค๐โค๐ฅ
, (4.1.2)
where โ๐ is the probability simplex in R๐, and the function f : โ๐ โ R is convex andโsimpleโ (i.e., quasi-separable in ๐ฃ) implying that the maximization in (4.1.2) can beperformed in running time ๐(๐). In particular, this allows to encompass two mostcommonly used multiclass losses:
โ the multiclass logistic (or softmax) loss
โ(๐โค๐ฅ, ๐ฆ) = log
(๐โ๐=1
exp(๐โค๐ ๐ฅ)
)โ ๐ฆโค๐โค๐ฅ, (4.1.3)
where ๐๐ is the ๐-th column of ๐ so that ๐โค๐ ๐ฅ is the ๐-th element of ๐โค๐ฅ; this
loss corresponds to (4.1.2) with the negative entropy f(๐ฃ, ๐ฆ) =โ๐
๐=1 ๐ฃ๐ log ๐ฃ๐ ;โ the multiclass hinge loss, given by
โ(๐โค๐ฅ, ๐ฆ) = max๐โ1,...,๐
1[๐๐ = ๐ฆ] + ๐โค
๐ ๐ฅโ ๐ฆโค๐โค๐ฅ, (4.1.4)
and used in multiclass support vector machines (SVM), reduces to (4.1.2) bysetting f(๐ฃ, ๐ฆ) = ๐ฃโค๐ฆ โ 1 (see Appendix 4.7.1 for details).
Additional examples of Fenchel-Young losses are described by Blondel et al. [2018].
Arranging the feature vectors into the design matrix ๐ โ R๐ร๐, and the labelvectors in the matrix ๐ โ R๐ร๐, and using the Fenchel dual representation (4.1.2)of the loss, we recast the initial minimization problem (4.1.1) as a convex-concavesaddle-point problem of the following form:
min๐โR๐ร๐
max๐ โ๐ฑ
โโฑ(๐, ๐ ) +1
๐tr[(๐ โ ๐ )โค๐๐
]+ ๐โ๐โU . (4.1.5)
80
Here we denote
โฑ(๐, ๐ ) :=1
๐
๐โ๐=1
f(๐ฃ๐, ๐ฆ๐) (4.1.6)
where ๐ฃ๐, ๐ฆ๐ โ โ๐ are the ๐-th rows of ๐ and ๐ , whereas
๐ฑ := [โโ๐๐ ]โค โ R๐ร๐ (4.1.7)
is the Cartesian product of probability simplices comprising all matrices in R๐ร๐
with rows belonging to โ๐. Taking into account the availability of the Fencheldual representation (4.1.2), recasting the convex minimization problem (4.1.1) in theform (4.1.5) is quite natural. Indeed, while the objective in (4.1.1) can be non-smooth,the โnon-trivialโ part of the objective in (4.1.5), given by
ฮฆ(๐, ๐ โ ๐ ) :=1
๐tr[(๐ โ ๐ )โค๐๐
], (4.1.8)
is necessarily bilinear. On the other hand, the presence of the dual constraints, asgiven by (4.1.7), also does not seem problematic since ๐ฑ allows for a computationallycheap projection oracle. Finally, the primal-dual formulation (4.1.5) allows one tocontrol the duality gap, thus providing online accuracy certificates for the initialproblem (see Nemirovski and Onn [2010]).
In this chapter, we develop efficient algorithms for solving (4.1.1) via the associ-ated saddle-point problem (4.1.5), building upon the two basic schemes for saddle-point optimization: Mirror Descent and Mirror Prox (see Juditsky and Nemirovski[2011a,b], Nesterov and Nemirovski [2013] and references therein). When applied tothe class of general convex-concave saddle-point problems with available first-orderinformation given by the partial (sub-)gradients of the objective, these basic schemesare well suited for obtaining medium-accuracy solutions โ which is not a limitationin the context of empirical risk minimization, where the ultimate goal is to minimizethe true (expected) risk.
Moreover, these basic schemes must be able, at least in principle, to capture thespecific structure of quasi-bilinear saddle-point problems of the form (4.1.5). First,they rely on general Bregman divergences, rather than the standard Euclidean dis-tance, to measure proximity of the points, which allows one to adjust to the particularnon-Euclidean geometry associated to the set ๐ฑ and the norm โ ยท โU . Second, MirrorDescent and Mirror Prox retain their favorable convergence guarantees when the par-tial gradients are replaced with their unbiased estimates. To see why this circumstanceis so important, recall that the computation of the partial gradients โ๐ [ฮฆ(๐, ๐ โ๐ )]and โ๐ [ฮฆ(๐, ๐ โ ๐ )], cf. (4.1.8), is reduced to the computation of the matrix prod-ucts ๐๐ and ๐โค(๐ โ ๐ ), and thus requires ๐(๐๐๐) operations. Given the partialgradients, the proximal steps associated to ๐ and ๐ variables, assuming the โsimplic-ityโ of โ ยท โU in the aforementioned sense, can be computed in ๐(๐๐ + ๐๐). Thus, inthe regime interesting to us, where both ๐ and ๐ can be very large, the computationof these matrix products becomes the main bottleneck. On the other hand, unbiasedestimates of these matrix products, with reduced complexity of computation, can be
81
be obtained by randomized subsampling of the rows and columns of ๐ , ๐ , ๐ , and ๐.While this approach has been extensively studied by Juditsky and Nemirovski [2011b]in the case of bilinear saddle-point problems with vector-valued variables arising insparse signal recovery, its extension to problems of the type (4.1.5) is non-trivial.In fact, for a sampling scheme to be deemed โgoodโ, it clearly has to satisfy twoconcurrent requirements:
(a) On one hand, one must control the stochastic variability of the estimates inthe chosen sampling scheme. Ideally, the variance โ or its suitable analogue inthe case of a non-Euclidean geometry โ should be comparable to the squarednorm of ๐ whose choice depends on the proximal setup.
(b) On the other hand, the estimates must be cheap to compute, much cheaperthan ๐(๐๐๐) required for the full gradient computation. While the apparentgoal is ๐(๐๐+๐๐) per iteration (the cost of the primal-dual proximal mappingfor the full gradients), one could even want to go beyond that, to ๐(๐+๐+ ๐)per iteration, possibly paying a larger price once, during pre/post-processing.
Devising a sampling scheme satisfying these requirements is a delicate task. To solveit, one should exploit the geometric structure of (4.1.5) associated to ๐ฑ and โ ยท โU .
Contributions and outline. We identify โgoodโ sampling schemes satisfying theabove requirements, rigorously analyze their statistical properties, carefuly imple-ment the basic algorithms equipped with these sampling schemes, and analyze thetotal computational complexity of the resulting algorithms. Namely, we consider thesituation where โ ยท โU is the entrywise โ1-norm, which corresponds to the implicitassumption that both the classes and the features are sparse. Moreover, we choosethe proximal setup adapted to the geometry of the saddle-point problem (4.1.5), thusachieving favorable accuracy guarantees for both of the basic algorithms without sub-sampling (cf. Sections 4.2โ4.2.1). In this setup, we propose two sampling schemes withvarious levels of โaggressivenessโ, and study their statistical properties, showing thatboth of the basic primal-dual algorithms preserve their favorable accuracy boundswhen combined with the proposed sampling schemes. At the same time, equippingthe algorithms with these sampling schemes drastically reduces the total numericalcomplexity (cf. (b)), with improvement depending on the sampling scheme:
โ The Partial Sampling Scheme, described in Section 4.3.1, is applicable to anyproblem of the form (4.1.5). Here the main idea is to sample only one row of ๐and ๐ at a time, with probabilities nearly minimizing the variance proxy of thecurrent estimates of ๐ and ๐ . After careful implementation of the primal-dualproximal step, this leads to the cost ๐(๐๐ + ๐๐ + ๐๐) of one iteration.
โ In the more aggressive Full Sampling Scheme, described in Section 4.3.2, sam-pling of the rows of ๐ and ๐ is augmented with column sampling. Applyingthis scheme to multiclass hinge loss (4.1.4), we are able to carry out the updatesin only ๐(๐+ ๐ + ๐) arithmetic operations, with additional cost
๐(๐๐+ (๐+ ๐) ยทmin(๐, ๐ ))
82
of pre- and post-processing, ๐ being the number of iterations. Note that thesenumerical complexity estimates can be justified from the viewpoint of statis-tical learning theory: according to it, ๐ = ๐(๐) iterations of a stochasticfirst-order algorithm must be sufficient to extract most of statistical infor-mation contained in the sample, since at each iteration we effectively selecta single training example by subsampling a row of ๐ . We see that in thehigh-dimensional regime ๐, ๐ โซ ๐, this criterion is satisfied since the price ofpre/post-processing becomes ๐(๐๐).
We conclude the chapter with numerical experiments on synthetic data presented inSection 4.5.
4.1.1 Related work
While we have passed to the saddle-point problem (4.1.5), there is still an optionto solve the original problem (4.1.1). The complexity of deterministic approach forit is ๐(๐๐๐) and the complexity of stochastic approach is ๐(๐๐), and because of thestructure of the problem, the ideas of Full Sampling Scheme cannot be used. Due tothe easy access to noisy gradients, the classical approach is stochastic gradient descent(SGD). Algorithms such as SVRG [Johnson and Zhang, 2013], SDCA [Shalev-Shwartzand Zhang, 2013], SAGA [Defazio et al., 2014], SAG [Schmidt et al., 2017], large-scale linearly convergent algorithms [Palaniappan and Bach, 2016], Breg-SVRG [Shiet al., 2017] use variance-reduced techniques to obtain accelerated convergence rates,but all have ๐(๐๐) or ๐(๐๐ + ๐๐) complexity. Note, however, that our variancereduction technique is different from all these methods, and none of these approachesis sublinear.
Sublinear algorithms. Formally, we call an algorithm sublinear if the computa-tional complexity of one iteration has a linear (or linear with a logarithmic factor)dependence on natural dimensions of data โ such as sample size, number of featuresand number of classes. For the biclass setting several results can be found in the lit-erature. Probably, the first result for bilinear problems was considered by Grigoriadisand Khachiyan [1995], which was also considered by Juditsky and Nemirovski [2011b],Xiao et al. [2017]. A sublinear algorithm for SVM was considered by Hazan et al.[2011], for semidefinite programs was considered by Garber and Hazan [2011, 2016]and more general results were given by Clarkson et al. [2012]. However, none of theseapproaches can be easily extended to the multiclass setting without an extra ๐(๐)factor appearing in the numerical complexity bounds.
4.2 Choice of geometry and basic routines
Preliminaries. We focus on the convex-concave saddle-point problem of the form:
min๐โR๐ร๐
max๐ โ(ฮโค
๐ )โ๐โโฑ(๐, ๐ ) + ฮฆ(๐, ๐ โ ๐ ) + ๐โ๐โU , (4.2.1)
83
with the bilinear part ฮฆ(๐, ๐ โ ๐ ) of the objective being given by (4.1.8), and thecomposite term โฑ(๐, ๐ ) by (4.1.6). From now on, we make the following mild as-sumption:
Assumption 1. The โ ยท โU -radius of some optimal solution ๐* to (4.2.1) is known:
โ๐*โU = ๐ ๐ฐ .
Remark 1. All the algorithms presented later on preserve their accuracy boundswhen ๐ ๐ฐ is an upper bound on โ๐*โU instead of being its exact value. Howeverin this case they become suboptimal by factors proportional to the looseness of thisbound, hence we are interested in this upper bound to be a tight as possible. Note thatsince ๐ = 0 is a feasible solution, in the case of positive losses ๐ ๐ฐ 6 1/๐, but thisbound is usually very loose, since ๐ must decrease with the number of observations asdictated by statistical learning theory. A better approach is to solve a series of prob-lems, starting with a small radius, and increasing it exponentially until the obtainedsolution leaves the boundary of the feasible set. This strategy works when the solutionset is bounded. Since the complexity bounds that we obtain grow linearly with ๐ ๐ฐ , thismethod enjoys the same overall complexity bound, up to a constant factor, as in thecase when Assumption 1 is precisely satisfied.
With Assumption 1, we can recast the problem (4.2.1) in the constrained form:
min๐โ๐ฐ
max๐ โ๐ฑโโฑ(๐, ๐ ) + ฮฆ(๐, ๐ โ ๐ ) + ๐โ๐โU , (4.2.2)
where ๐ฐ := ๐ โ R๐ร๐ : โ๐โU โค ๐ ๐ฐ is a โ ยท โU -norm ball, and ๐ฑ := [โโ๐๐ ]โค
is the direct product of ๐ probability simplices in R๐. This problem has a veryparticular structure associated to the sets ๐ฐ and ๐ฑ , and ideally, this structure shouldbe exploited by the optimization algorithm in order to obtain efficiency estimatesscaling with the โcorrectโ norm of the optimal primal-dual solution, see, e.g., Juditskyand Nemirovski [2011a], Nesterov and Nemirovski [2013], Beck and Teboulle [2003],Shi et al. [2017]. This issue can be addressed in the framework of proximal algorithms,in which one starts by choosing the norms that capture the inherent geometry of theproblem, and then replaces the usual (projected) gradient step with the proximalstep, i.e., uses the general Bregman divergence corresponding to some potential (alsocalled distance-generating function) instead of the squared Euclidean distance. Thepotential must be compatible with the chosen norm in the sense of Juditsky andNemirovski [2011a] (essentially, grow as the squared norm while being 1-stronglyconvex with respect to the norm, see Section 4.2.2), and at the same time allow foran efficient implementation of the proximal step. We will discuss the question ofchoosing the potentials in the next section, when describing Mirror Descent [see e.g.Nemirovsky and Yudin, 1983], the simplest general proximal algorithm applicableto (4.2.2). Before that, let us specify the choice of the norms themselves.
Choice of the norms. A natural strategy for choosing the norm in each variable isby ensuring, whenever possible, that its ball of a certain radius is simply the convex
84
hull of the symmetrization of the feasible set (note that scaling is not importantsince first-order algorithms are invariant with respect to it), see, e.g., Juditsky andNemirovski [2011a]. Following this strategy, we must choose โ ยท โU itself for ๐ ,whatever is the norm โ ยท โU . On the other hand, the norm โ ยท โU has not yet beendefined at this point; correspondingly, the โimplicitโ geometry of the problem in ๐has not been specified. For some reasons explained in Section 4.4, we choose theelementwise โ1-norm for ๐ :
โ๐โU = โ๐โ1ร1 =๐โ๐=1
๐โ๐=1
|๐(๐, ๐)|, (4.2.3)
where we use the โMatlabโ indexing notation, and define the mixed โ๐ ร โ๐ norms by
โ๐ดโ๐ร๐ :=
(โ๐
โ๐ด(๐, :)โ๐๐
)1/๐
, ๐, ๐ โฅ 1, (4.2.4)
i.e., the โ๐-norm of the vector of โ๐-norms of the individual rows of the matrix. Notethat this choice of โ ยท โU favors solutions that are sparse in terms of both featuresand classes.
As for the norm on ๐ฑ , one could choose the โ1-norm in the case ๐ = 1, and โโรโ1-norm in the general case (recall that ๐ฑ is the direct product of simplices). However,it is known (see Juditsky and Nemirovski [2011a]) that for the โโ-norm there is nocompatible potential in the sense of Juditsky and Nemirovski [2011a] (the existence ofsuch a potential would cotradict the known worst-case complexity lower bounds [Ne-mirovsky and Yudin, 1983]. The remedy, leading to near-optimal accuracy estimates,is to replace the โโ-norm with the โ2-norm, leading us to the โ2 ร โ1-norm for ๐ :
โ๐ โV = โ๐ โ2ร1 =
(๐โ๐=1
โ๐ (๐, :)โ21
)1/2
. (4.2.5)
Alternative choices for both norms will be briefly discussed in Section 4.4. However,we will see that this choice of norms, when equipped with the properly chosen po-tentials, is the only one in a broad class of those using the mixed โ๐ ร โ๐ norms, forwhich we can achieve the goal stated in Section 4.1, that is, to combine near-optimalcomplexity estimates with a numerically efficient implementation of the proximal step.
4.2.1 Basic schemes: Mirror Descent and Mirror Prox
Basic schemes in minimization problems. The classicalMirror Descent schemewas introduced by Nemirovsky and Yudin [1983], and generalizes the standard Pro-jected Subgradient Descent to the case of non-Euclidean distance measures. Thegeneral presentation of Mirror Descent is described by Beck and Teboulle [2003]; herewe only provide a brief exposition. When minimizing a convex function ๐(๐ข) on adomain ๐ฐ , the Mirror Descent scheme amounts to choosing the potential ๐๐ฐ(๐ข) that
85
generalizes the squared Euclidean distance 12โ๐ขโ22, and the sequence of stepsizes ๐พ๐ก,
starting with the initial point ๐ข0 = min๐ขโ๐ฐ ๐๐ฐ(๐ข) called the prox-center, and thenperforming iterations of the form
๐ข๐ก+1 = arg min๐ขโ๐ฐ
โจโ๐(๐ข๐ก), ๐ขโ ๐ข๐กโฉ+
1
๐พ๐ก๐ท๐๐ฐ (๐ข, ๐ข๐ก)
,
where๐ท๐(๐ข, ๐ข๐ก) = ๐(๐ข)โ ๐(๐ข๐กโ1)โ โจโ๐(๐ข๐ก), ๐ขโ ๐ข๐กโฉ
is the Bregman divergence between the candidate point ๐ข and the previous point ๐ข๐ก
that corresponds to the chosen potential ๐(ยท) = ๐๐ฐ(ยท) and replaces the squared Eu-clidean distance 1
2โ๐ข โ ๐ข๐กโ22. After computing ๐ iterates, the scheme outputs the
averaged point ๐ as the candidate solution; one can choose the uniform averaging๐ = 1
๐
โ๐โ1๐ก=0 ๐ข
๐ก or more complex averaging schedules (for example, with the weightsproportional to the stepsizes); for simplicity, we will focus on the uniform averaging.
The counterpart of the Mirror Descent scheme, called Mirror Prox, was intro-duced by Nemirovski [2004], and combines the use of Bregman divergences with anextrapolation step, first proposed in the Euclidean setting by Korpelevich [1977]. Inthe context of minimization problems, the difference is that Mirror Prox iterates as
๐ข๐ก+1/2 = arg min๐ขโ๐ฐ
โจโ๐(๐ข๐ก), ๐ขโ ๐ข๐กโฉ+
1
๐พ๐ก๐ท๐๐ฐ (๐ข, ๐ข๐ก)
,
๐ข๐ก+1 = arg min๐ขโ๐ฐ
โจโ๐(๐ข๐ก+1/2), ๐ขโ ๐ข๐ก+1/2โฉ+
1
๐พ๐ก๐ท๐๐ฐ (๐ข, ๐ข๐ก)
;
in other words, one first performs the proximal step from the current point ๐ข๐ก toobtain the auxilliary point ๐ข๐ก+1/2, and then perfoms the extragradient step from ๐ข๐ก,i.e., the proximal step in which the gradient at ๐ข๐ก is replaced with that at ๐ข๐ก+1/2.
Basic schemes for composite saddle-point problems. Both approaches can beextended to composite minimization and, most importantly in our context, to com-posite saddle-point problems such as (4.2.2). In particular, introducing the combinedvariable ๐ = (๐, ๐ ) โ ๐ฒ [:= ๐ฐ ร ๐ฑ ], the (Composite) Mirror Descent scheme forthe saddle-point problem (4.2.2) first constructs the joint potential ๐๐ฒ(๐ ) from theintial potentials ๐๐ฐ and ๐๐ฑ , in the way specified later on, and then performs iterations
๐ 0 = min๐โ๐ฒ
๐๐ฒ(๐ );
๐ ๐ก+1 = arg min๐โ๐ฒ
โ(๐ ) + โจ๐บ(๐ ๐ก),๐ โฉ+
1
๐พ๐ก๐ท๐๐ฒ (๐,๐ ๐ก)
, ๐ก โฅ 1
(4.2.6)
where โ(๐ ) = โฑ(๐, ๐ ) + ๐โ๐โ๐ฐ is the combined composite term, cf. (4.1.6), and๐บ(๐ ) is the vector field of the partial gradients of ฮฆ(๐, ๐ โ ๐ ), cf. (4.1.8):
๐บ(๐ ) := (โ๐ฮฆ(๐, ๐ โ ๐ ),โโ๐ ฮฆ(๐, ๐ โ ๐ )) = (๐โค(๐ โ ๐ ),โ๐๐). (4.2.7)
86
Accordingly, the (Composite) Mirror Prox scheme for (4.2.2) performs iterations
๐ 0 = min๐โ๐ฒ
๐๐ฒ(๐ );
๐ ๐ก+1/2 = arg min๐โ๐ฒ
โ(๐ ) + โจ๐บ(๐ ๐ก),๐ โฉ+
1
๐พ๐ก๐ท๐๐ฒ (๐,๐ ๐ก)
,
๐ ๐ก+1 = arg min๐โ๐ฒ
โ(๐ ) + โจ๐บ(๐ ๐ก+1/2),๐ โฉ+
1
๐พ๐ก๐ท๐๐ฒ (๐,๐ ๐ก)
, ๐ก > 1.
(4.2.8)
Note that the joint minimization in (4.2.6) and (4.2.8) can be performed separatelyin ๐ and ๐ as long as the joint potential is a linear combination of ๐๐ฐ and ๐๐ฑ . Tospecify the accuracy guarantees for both schemes, we need to introduce two objects:the potential differences ฮฉ๐ฐ and ฮฉ๐ฑ , and the โcrossโ Lipschitz constant โU ,V . Thechoice of the joint potential will be given once we give the definitions of these objects.
Differences of potentials. The potential differences (called โomega-radiiโ in theliterature co-authored by A. Nemirovski), are defined by
ฮฉ๐ฐ = max๐โ๐ฐ
๐๐ฐ(๐)โmin๐โ๐ฐ
๐๐ฐ(๐), ฮฉ๐ฑ = max๐ โ๐ฑ
๐๐ฑ(๐ )โmin๐ โ๐ฑ
๐๐ฑ(๐ ); (4.2.9)
note that all maxima and minima are attained when the potentials are continuous,and ๐ฐ ,๐ฑ are compact. When the potentials ๐๐ฐ , ๐๐ฑ are compatible with the corre-sponding norms โ ยท โ๐ฐ , โ ยท โ๐ฑ in the sense of Juditsky and Nemirovski [2011a] (seeSection 4.2.2), the potential differences grow as the squared radii of ๐ฐ and ๐ฑ in thecorresponding norms, up to factors logarithmic in the dimensions of ๐ฐ ,๐ฑ , see, e.g., Ne-mirovsky and Yudin [1983], Shapiro et al. [2009]; in particular, this holds for the choiceof ๐๐ฐ and ๐๐ฑ discussed later on.
โCrossโ Lipschitz constant. Given a smooth convex-concave function ฮฆ(๐, ๐ ) =ฮฆ(๐, ๐ โ ๐ ), the โcrossโ Lipschitz constant โU ,V of the field
๐บ(๐ ) := (โ๐ฮจ(๐, ๐ ),โโ๐ ฮจ(๐, ๐ ))
with respect to the pair of norms โยทโU , โยทโV is defined as โU ,V = max(โU โV ,โV โU ),where โU โV , โV โU deliver tight inequalities of the form
โโ๐ฮจ(๐ โฒ, ๐ )โโ๐ฮจ(๐, ๐ )โV * 6 โU โV โ๐ โฒ โ ๐โU ,โโ๐ ฮจ(๐, ๐ โฒ)โโ๐ ฮจ(๐, ๐ )โU * 6 โV โU โ๐ โฒ โ ๐ โV ,
uniformly over ๐ฐ ร ๐ฑ , where โ ยท โU * , โ ยท โV * are the dual norms to โ ยท โU , โ ยท โV .For the bilinear function ฮฆ(๐, ๐ โ ๐ ) given by (4.1.8), โU โV and โV โU are equal,and we can express โU ,V as a norm of the linear operator acting on R๐ร๐ โ R๐ร๐
as ๐ โฆโ 1๐๐๐ :
โU ,V =1
๐
[โ๐โU โV := sup
โ๐โU โค1
โ๐๐โV *
]. (4.2.10)
87
Furthermore, for the chosen norms โ ยท โU = โ ยท โ1ร1 and โ ยท โV = โ ยท โ2ร1 we canexpress โU ,V as a certain mixed norm (cf. (4.2.3)โ(4.2.5)) as stated by the followinglemma proved in Appendix:
Proposition 4.1. The โcrossโ Lipschitz constant can be expressed as
โU ,V =โ๐โคโโร2
๐. (4.2.11)
Constructing the joint potential. Given the partial potentials ๐๐ฐ(ยท) and ๐๐ฑ(ยท),the natural way to construct the joint potential ๐๐ฒ is by taking a weighted averageof ๐๐ฐ(ยท) and ๐๐ฑ(ยท), since this leads to ๐๐ฒ being separable in ๐ and ๐ , and allows toexploit the structure of the partial potentials to obtain accuracy guarantees. In fact,one can show that the simplistic choice ๐๐ฒ(๐ ) = (๐๐ฐ(๐) + ๐๐ฑ(๐ ))/2 leads to theaccuracy guarantee proportional to โU ,V (ฮฉ๐ฐ+ฮฉ๐ฑ) where โU ,V is the โcrossโ Lipschitzconstant, and ฮฉ๐ฐ ,ฮฉ๐ฑ are the potential differences defined above. On the other hand,if ฮฉ๐ฐ and ฮฉ๐ฑ are known, one can consider, following Juditsky and Nemirovski [2011b]and Ostrovskii and Harchaoui [2018], the โbalancedโ joint potential
๐๐ฒ(๐ ) =๐๐ฐ(๐)
2ฮฉ๐ฐ+๐๐ฑ(๐ )
2ฮฉ๐ฑ, (4.2.12)
for which one can achieve the better accuracy guarantee proportional to โU ,V
โฮฉ๐ฐฮฉ๐ฑ ,
cf. Theorem 4.3 in Appendix. Moreover, this construction is, in a sense, โrobustโ: ifthe ratio of the weights in (4.2.12) is multiplied with a constant factor, the accuracyestimate is preserved up to a constant factor. Besides, note that for the choice of ๐๐ฑconsidered below ฮฉ๐ฑ is known; for the choice of ๐๐ฐ considered below ฮฉ๐ฑ is known whenAssumption 1 is satisfied, and is known up to a constant factor when one allows ๐ ๐ฐto be an upper bound for โ๐*โU as explained in Remark 1. For all these reasons, wechoose the joint potential according to (4.2.12).
4.2.2 Choice of the partial potentials
We now discuss construction of the partial potentials for the chosen geometry asgiven by the norms โ ยท โU , โ ยท โV , cf. (4.2.3), (4.2.5). The choice of potentials for thealternative geometries is discussed in Section 4.4.
Potential for the dual variable. Since ๐ โ (โโค๐ )โ๐ is the direct product of
probability simplices, the natural choice for the dual potential ๐๐ฑ(ยท) is the sum ofnegative entropies [Beck and Teboulle, 2003]: denoting log(ยท) the natural logarithm,
๐๐ฑ(๐ ) =๐โ๐=1
๐โ๐=1
๐๐๐ log(๐๐๐). (4.2.13)
This choice reflects the fact that ๐ โ ๐ฑ corresponds to the marginals of an ๐-foldproduct distribution for which the entropy is the sum of the marginal entropies. By
88
Pinskerโs inequality Kemperman [1969], ๐๐ฑ(๐ ) is 1-strongly convex on ๐ฑ . On theother hand,
ฮฉ๐ฑ = ๐ log ๐, (4.2.14)
whereas the squared โ ยท โV -norm of any feasible solution ๐ to (4.2.2) is precisely ๐,cf. (4.2.5). Finally, ๐๐ฑ(๐ ) is clearly continuously differentiable in the interior of ๐ฐ .As such, we see that the potential (4.2.13) is compatible with ๐ฐ and โ ยท โU , i.e.,the triple (๐ฐ , โ ยท โU , ๐๐ (ยท)) comprises a valid proximal setup in the sense of Juditskyand Nemirovski [2011a], and grows nearly as the squared โ ยท โV -radius of the feasibleset ๐ฑ . Note that the Bregman divergence corresponding to (4.2.13) is the sum of theKullback-Leibler divergences between the individual rows:
๐ท๐๐ฑ (๐, ๐ ๐ก) =๐โ๐=1
๐ทKL(๐ (๐, :)โ๐ ๐ก(๐, :)), (4.2.15)
where ๐ท๐ทKL(๐, ๐) :=โ
๐ ๐๐ log(๐๐/๐๐) for two discrete measures ๐, ๐ (not necessarilynormalized to one).
Final reduction and the primal potential. Recall that we chose the elementwisenorm โ ยท โU = โ๐โ1ร1, and the primal feasible set of the problem (4.2.2) correspondsto the ball with radius ๐ ๐ฐ in this norm. In this situation, the standard option is thepower potential ๐(๐) = ๐ถ๐,๐โ๐โ2๐, where ๐ = 1 + 1/ log(๐๐), and the constant ๐ถ๐,๐can be chosen, depending on ๐ and ๐, so that ๐(ยท) is 1-strongly convex on the wholespace R๐ร๐, see Nesterov and Nemirovski [2013]. This leads to the compatible proxi-mal setup in the sense defined above, and the correct scaling ฮฉ๐ฐ = ๐(log(๐๐)๐ 2
๐ฐ).
However, it turns out that the specific algebraic structure of this potential doesnot allow for โsublinearโ numerical complexity in the full sampling scheme which wewill present later on in Section 4.3.2. Hence, we advocate an alternative approach:first transform the โ1-constraint into the โsolidโ simplex constraint (borrowing the ideafrom Juditsky and Nemirovski [2011b]), and then construct a compatible potential forthe new problem based on the unnormalized negative entropy, using the fact that ๐ ๐ฐis known, cf. Assumption 1. To this end, let ๐ฐ be the โsolidโ simplex in R2๐ร๐:
๐ฐ :=๐ โ R2๐ร๐ : ๐๐๐ > 0, tr[1โค
2๐ร๐๐ ] 6 ๐ ๐ฐ
, (4.2.16)
where 12๐ร๐ โ R2๐ร๐ is the matrix of all ones so that tr[1โค2๐ร๐
๐ ] =โ
๐,๐๐๐๐ = โ๐โ1ร1
whenever ๐ โ ๐ฐ . Consider the following saddle-point problem:
min๐โ๐ฐ max๐ โ๐ฑโโฑ(๐, ๐ ) + ฮฆ(๐, ๐ โ ๐ ) + ๐ tr[1โค
2๐ร๐๐ ], (4.2.17)
where โฑ(๐, ๐ ) and ๐ฑ are the same as before (cf. (4.1.6)โ(4.1.7)), and
ฮฆ(๐, ๐ โ ๐ ) :=1
๐tr[(๐ โ ๐ )โค ๐ ๐] , where ๐ := [๐,โ๐] โ R๐ร2๐. (4.2.18)
89
It can be easily verified that the new saddle-point problem (4.2.17) is equivalentto (4.2.2) in the following sense: any ๐-accurate (in the sence of duality gap, cf Sec-
tion 4.2.4) solution (๐, ๐ ) to (4.2.17) with ๐ = [๐1; ๐2] provides an ๐-accurate solu-
tion (๐, ๐ ) to (4.2.2) by taking ๐ = ๐1 โ ๐2.1 Clearly, the quantities โU ,V ,ฮฉ๐ฐ ,ฮฉ๐ฑ
for the new problem remain the same as their counterparts for the problem (4.2.2);the only difference is a slight change of ฮฉ๐ฐ (see below). On the other hand, the pri-mal feasible set of the new problem (4.2.17) โ the โsolidโ simplex (4.2.16) โ admitsa compatible entropy-type potential. Indeed, consider first the โunit solidโ simplexgiven by (4.2.16) with ๐ ๐ฐ = 1. On this set, the unnormalized negative entropy
โ(๐) =โ๐,๐
๐๐๐ log ๐๐๐ โ ๐๐๐ (4.2.19)
is 1-strongly convex (see Yu [2013] for an elementary proof), and clearly is continu-ously differentiable in its interior; thus, โ(ยท) is a compatible potential on the โunitsolidโ simplex in the sense of Juditsky and Nemirovski [2011a]. Finally, using As-sumption 1, we can construct a compatible (i.e., 1-strongly convex and continuouslydifferentiable in the interior) potential on the initial โsolidโ simplex (4.2.16) witharbitrary radius by scaling the argument of โ(ยท) and then renormalizing it as follows:
๐๐ฐ(๐) := ๐ 2๐ฐ ยท โ(๐/๐ ๐ฐ)
= ๐ ๐ฐ
[2๐โ๐=1
๐โ๐=1
๐๐๐ log
( ๐๐๐๐ ๐ฐ
)โ ๐๐๐] . (4.2.20)
Note that the potential difference in this case is
ฮฉ๐ฐ = ๐ 2๐ฐ log(2๐๐), (4.2.21)
and the corresponding Bregman divergence is given by
๐ท๐ ๐ฐ (๐, ๐ ๐ก) = ๐ 2๐ฐ๐ทKL
( ๐๐ ๐ฐ
๐ ๐ก
๐ ๐ฐ
)โ๐ ๐ฐ tr[1
โค2๐ร๐
๐ ] +๐ ๐ฐ tr[1โค2๐ร๐
๐ ๐ก], (4.2.22)
where the last term does not depend on ๐ .4.2.3 Recap of the deterministic algorithms
We can now formulate the iterations of the Mirror Descent and Mirror Proxschemes applied to the reformulation (4.2.17) of the saddle-point problem (4.2.2)
1. The idea of this reduction is that for any optimal primal solution ๐ to (4.2.17), only one of
the blocks ๐1, ๐2 is non-zero, and these blocks can then be interpreted as the positive and negativeparts of the corresponding optimal solution to (4.2.2).
90
with the specified choice of geometry. Both of them are initialized with
๐ 0 =1๐ร๐
๐, ๐0 =
๐ ๐ฐ12๐ร๐
2๐๐, (Init)
corresponding to the uniform distributions. Mirror Descent then iterates (๐ก โฅ 1):
๐ ๐ก+1 = arg min๐ โ๐ฑ
โฑ(๐, ๐ )โ ฮฆ(๐ ๐ก, ๐ ) +
๐ท๐๐ฑ (๐, ๐ ๐ก)
2๐พ๐กฮฉ๐ฑ
,
๐ ๐ก+1 = arg min๐โ๐ฐ๐ tr[1โค
2๐ร๐๐ ] + ฮฆ(๐, ๐ ๐ก โ ๐ ) +
๐ท๐ ๐ฐ (๐, ๐ ๐ก)
2๐พ๐กฮฉ๐ฐ,
(MD)
where โฑ(๐, ๐ ), ฮฆ(ยท, ยท), ๐ฑ , ๐ฐ , ๐ท๐๐ฑ (๐, ๐ ๐ก), ๐ท๐ ๐ฐ (๐, ๐ ๐ก), ฮฉ๐ฑ , ฮฉ๐ฐ were defined above. Onthe other hand, Mirror Prox performs iterations
๐ ๐ก+1/2 = arg min๐ โ๐ฑ
โฑ(๐, ๐ )โ ฮฆ(๐ ๐ก, ๐ ) +
๐ท๐๐ฑ (๐, ๐ ๐ก)
2๐พ๐กฮฉ๐ฑ
,
๐ ๐ก+1/2 = arg min๐โ๐ฐ๐ tr[1โค
2๐ร๐๐ ] + ฮฆ(๐, ๐ ๐ก โ ๐ ) +
๐ท๐ ๐ฐ (๐, ๐ ๐ก)
2๐พ๐กฮฉ๐ฐ
;
๐ ๐ก+1 = arg min๐ โ๐ฑ
โฑ(๐, ๐ )โ ฮฆ(๐ ๐ก+1/2, ๐ ) +
๐ท๐๐ฑ (๐, ๐ ๐ก)
2๐พ๐กฮฉ๐ฑ
,
๐ ๐ก+1 = arg min๐โ๐ฐ๐ tr[1โค
2๐ร๐๐ ] + ฮฆ(๐, ๐ ๐ก+1/2 โ ๐ ) +
๐ท๐ ๐ฐ (๐, ๐ ๐ก)
2๐พ๐กฮฉ๐ฐ.
(MP)
Complexity of one iteration. Note that the primal updates in both algorithmscan be expressed in closed form: using Lemma 4.1 in Appendix 4.7.3 we can verifythat the primal update in (MD) amounts to
๐ ๐ก+1๐๐ = ๐ ๐ก
๐๐ ยท exp(โ2๐พ๐ก๐๐ก๐๐๐ ๐ฐ log(2๐๐)) ยทmin
exp(โ2๐พ๐ก๐๐ ๐ฐ log(2๐๐)),
๐ ๐ฐ
๐
,
where ๐๐ก =1
๐๐โค(๐ ๐ก โ ๐ ), and๐ =
2๐โ๐=1
๐โ๐=1
๐ ๐ก๐๐ ยท exp(โ2๐พ๐ก๐
๐ก๐๐๐ ๐ฐ log(2๐๐)).
(4.2.23)On the other hand, the primal updates depend on the expression for f(๐ฃ, ๐ฆ) in thedefinition (4.1.2) of Fenchel-Young losses (cf. also (4.1.6)). Crucially, when f(๐ฃ, ๐ฆ) isseparable in ๐ฃ โ which is the case for the Fenchel-Young losses including the multiclasslogistic (4.1.3) and hinge (4.1.4) loss โ we can perform the update for ๐ as follows:
โ pass to the Lagrange dual formulation of the proximal step for ๐ ;โ minimize the Lagrangian for the given value of the multiplier by solving ๐(๐๐)
one-dimensional problems exactly (when possible) or to numerical tolerance;โ on top of that, find the optimal Lagrange multiplier via one-dimensional search.
91
See Ostrovskii and Harchaoui [2018, Supp. Mat.] for an illustration of this technique.Note also that in the case of multiclass SVM (cf. (4.1.4)), we have an explicit formula:
๐ ๐ก+1๐๐ =
๐ ๐ก๐๐ exp
(2๐พ๐ก๐
๐ก๐๐ log(๐)
)๐โ๐=1
๐ ๐ก๐๐ exp (2๐พ๐ก๐๐ก
๐๐ log(๐))
, where ๐๐ก = ๐ ๐ ๐กโ1 โ ๐. (4.2.24)
Overall, we see that the computational bottleneck in the deterministic schemes is thematrix-matrix multiplication which costs ๐(๐๐๐) arithmetic operations; the remain-ing part of the combined proximal step takes only ๐(๐๐ + ๐๐) operations.
We now present accuracy guarantees that follow from the general analysis of theproximal saddle-point optimization schemes (see Appendix 4.7.2) combined with theobtained expressions for โU ,V , ฮฉ๐ฐ , and ฮฉ๐ฑ .
4.2.4 Accuracy bounds for the deterministic algorithms
Preliminaries. In what follows, we denote by ๐(1) generic constants. Recall thatthe accuracy of a candidate solution ( , ๐ ) to a saddle-point problem
min๐โ๐ฐ
max๐ โ๐ฑ
๐(๐, ๐ ), (4.2.25)
assuming continuity of ๐ and compactness of ๐ฐ and ๐ฑ , can be quantified via theduality gap
โ๐ ( , ๐ ) := max๐ โ๐ฑ
๐( , ๐ )โmin๐โ๐ฐ
๐(๐, ๐ ). (4.2.26)
In particular, under Slaterโs conditions (which holds for (4.2.17) in particular), theproblem (4.2.25) possesses an optimal solution ๐ * = (๐*, ๐ *), called a saddle point,for which it holds ๐(๐*, ๐ *) = max๐ โ๐ฑ ๐(๐*, ๐ ) = min๐โ๐ฐ ๐(๐, ๐ *) โ that is, ๐*
(resp. ๐ *) is optimal in the primal problem of minimizing ๐prim(๐) := max๐ โ๐ฑ ๐(๐, ๐ )in ๐ (resp. the dual problem of maximizaing ๐dual(๐ ) := min๐โ๐ฐ ๐(๐, ๐ ) in ๐ ).Hence, in this case โ๐ ( , ๐ ) is the sum of the primal and dual accuracies, and boundsfrom above the primal accuracy, which is of main interest in the initial problem (4.1.1).
We can derive the following accuracy guarantee for the composite saddle-pointMirror Descent (MD) applied to the saddle-point problem (4.2.17). 2 To simplifythe exposition, we only consider constant stepsize and simple averaging of the iterates;empirically we observe similar accuracy for the sample sizes descreasing as โ 1/
โ๐ก.
2. Surprisingly, we could not find accuracy guarantees for the composite-objective variants ofMirror Descent and Mirror Prox when applied to saddle-point problems. Note that the resultsof Duchi et al. [2010] only hold for Mirror Descent in a composite minimization problem, and thoseof Nesterov and Nemirovski [2013] use a different (in fact, more robust) formulation of the algorithms.
92
Theorem 4.1. Let (๐ , ๐ ๐ ) = 1๐
โ๐โ1๐ก=0 (๐ ๐ก, ๐ ๐ก) be the average of the first ๐ iterates
of Mirror Descent (MD) with initialization (Init) and constant stepsize
๐พ๐ก โก1
โU ,V
โ5๐ฮฉ๐ฐฮฉ๐ฑ
with the values of โU ,V , ฮฉ๐ฑ , ฮฉ๐ฐ given by (4.2.11), (4.2.14), (4.2.21). Then it holds
โ๐ (๐ , ๐ ๐ )
[6
2โ
5โU ,V
โฮฉ๐ฐฮฉ๐ฑโ
๐+๐น (๐ 0, ๐ )โmin๐ โ๐ฑ ๐น (๐, ๐ )
๐
]
62โ
5โ๐โคโโร2โ๐
ยท log(2๐๐)โ๐*โ1ร1โ๐
+r
๐,
(4.2.27)
where r = max๐ฆโฮ๐f(1/๐, ๐ฆ)โmin๐ฃโฮ๐
f(๐ฃ, ๐ฆ). Moreover, for the multiclass hingeloss (4.1.4) the ๐(1/๐ ) term vanishes from the brackets, and we can put r = 0.
Proof. The bracketed bound in (4.2.27) follows from the general accuracy boundfor the composite saddle-point Mirror Descent scheme (Theorem 4.3 in Appendix),
combined with the observation that the primal โsimpleโ term ๐tr[1โค2๐ร๐
๐ ] of the ob-jective is linear, whence the corresponding error term is not present. In the caseof the hinge loss (4.1.4), the same holds for the primal โsimpleโ term โฑ(๐, ๐ ),hence the final remark in the claim. To obtain the right-hand side of (4.2.27), weuse (4.2.11), (4.2.14), (4.2.21), and bound ๐น (๐ 0, ๐ )โmin๐ โ๐ฑ ๐น (๐, ๐ ) from above.
This result merits some discussion.
Remark 2. Denoting ๐(:, ๐) โ R๐ the ๐-th column of ๐ for 1 โค ๐ โค ๐, we have
โ๐โคโโร2โ๐
= max๐6๐
โโ๐(:, ๐)โ22
๐. (4.2.28)
Note that ๐(:, ๐) represents the empirical distribution of the ๐-th feature ๐๐. Hence,for i.i.d. data, โ๐โคโโ,2/
โ๐ almost surely converges to the largest ๐ฟ2-norm of an
individual feature:โ๐โคโโร2โ
๐
a.s.โโ max๐6๐
(E๐2๐)
1/2, (4.2.29)
In the non-asymptotic regime, (4.2.28) is the largest empirical ๐ฟ2-moment of an indi-vidual feature, and can be controlled if the features are bounded, or, more generally, ifthey are sufficiently light-tailed, via standard concentration inequalities. In particular,
โ๐โคโโร2โ๐
6 ๐ต
if the features are uniformly bounded by ๐ต. Also, using the standard ๐2-bound,
93
see Laurent and Massart [2000, Lemma 1], with probability at least 1โ ๐ฟ it holds
โ๐โคโโร2โ๐
6 ๐(1)๐
(1 +
โlog(๐/๐ฟ)
๐
)(4.2.30)
if the features are zero-mean and Gaussian with variances uniformly bounded by ๐2.
Remark 3. The accuracy bound in Theorem 4.1 includes the remainder term r/๐ dueto the presence of โฑ(๐, ๐ ). Note that r does not depend on ๐; moreover, we can expectthat r = ๐(log(๐)) in the case when f(๐ฃ, ๐ฆ) corresponds to some notion of entropyon โ๐ as is the case, in particular, for the multiclass logistic loss (4.1.3). Finally, asstated in the theorem, one can set r = 0 for the multiclass hinge loss (4.1.4). Thus,overall we see that the additional term is relatively small, and can be neglected.
Remark 4. Finally, note that the ๐(1/โ๐ ) can be improved to the ๐(1/๐ ) one if
instead of Mirror Descent we use Mirror Prox. While here we do not state the preciseaccuracy bound for the โdirectโ version of the algorithm (MP), such results are knownfor the more flexible โepigraphโ version, in which the โsimpleโ terms are moved into theconstraints (which allows to address a more general class of semi-separable problems),see, e.g., He et al. [2015]. Also, in the case of multiclass hinge loss (4.1.4), bothโsimpleโ terms are linear, and can be formally absorbed into ฮจ(๐, ๐ ) = ฮฆ(๐, ๐ โ ๐ )without changing โU ,V . This would result in the bound
โ๐ (๐ , ๐ ๐ ) 6
๐(1)โU ,V
โฮฉ๐ฐฮฉ๐ฑ
๐6๐๐,๐(1)โ๐*โ1ร1
๐ยท โ๐
โคโโร2โ๐
, (4.2.31)
where ๐๐,๐(1) is a logarithmic factor in ๐ and ๐, see Juditsky and Nemirovski [2011b].While the ๐(1/๐ ) rate is not preserved for Mirror Prox with stochastic oracle, thisresult is still useful if we are willing to use the mini-batching technique.
4.3 Sampling schemes
As prevously noted, the main drawback of the deterministic approach is the highnumerical complexity ๐(๐๐๐) of operations due to the cost of matrix multiplications
when computing the partial gradients ๐ ๐ and ๐โค(๐ โ ๐ ). Following Juditsky andNemirovski [2011b], the natural approach to accelerate the computation of thesematrix products is by sampling the matrices ๐ and ๐ โ ๐ . Denoting ๐๐ and ๐๐,๐unbiased estimates of the matrix products ๐ ๐ and ๐โค(๐ โ๐ ) obtained in this way,we arrive at the following version of Stochastic Mirror Descent scheme (cf. (MD)):
๐ ๐ก+1 = arg min๐ โ๐ฑ
โฑ(๐, ๐ )โ 1
๐tr[๐โค๐๐ก๐
]+๐ท๐๐ฑ (๐, ๐ ๐ก)
2๐พ๐กฮฉ๐ฑ
,
๐ ๐ก+1 = arg min๐โ๐ฐ๐ tr[1โค
2๐ร๐๐ ] +
1
๐tr[๐๐ ๐ก,๐
๐]+๐ท๐ ๐ฐ (๐, ๐ ๐ก)
2๐พ๐กฮฉ๐ฐ,
(S-MD)
94
and its counterpart in the case of Mirror Prox, which can be formulated analogously.The immediate question is how to obtain ๐๐ and ๐๐,๐ . For example, we might
sample one row of ๐ and ๐ โ ๐ at a time, obtaining unbiased estimates of the truepartial gradients that can be computed in just ๐(๐๐ + ๐๐) operations โ comparablewith the remaining volume of computations for proximal mapping itself. In fact, onecan try to go even further, sampling a single element at a time, and using explicitexpressions for the updates available in the case of SVM, cf. (4.2.23)โ(4.2.24), allowingto reduce the complexity even further โ to ๐(๐+๐+๐) as we show below. Yet, thereis a price to pay: the gradient oracle becomes stochastic, and the accuracy boundsshould be augmented with an extra term that reflects stochastic variability of thegradients. In fact, the effect of the gradient noise on the accuracy bounds for bothschemes is known: we get the extra term
๐
โโโ
ฮฉ๐ฐ๐2๐ฑ
+ ฮฉ๐ฑ๐2๐ฐโ
๐
โโ , (4.3.1)
where ๐2๐ฐ and ๐2๐ฑ are โvariance proxiesโ โ expected squared dual norms of the noise in
the gradients (see Juditsky and Nemirovski [2011b, Sec. 2.5.1]) and Appendix 4.7.2):
๐2๐ฐ =1
๐2sup๐โ๐ฐ E
[ ๐ ๐ โ ๐๐ 2V *
], ๐2
๐ฑ =1
๐2sup
(๐,๐ )โ๐ฑร๐ฑE[ ๐โค(๐ โ ๐ )โ ๐๐,๐
2U *
],
(4.3.2)In this section, we consider two sampling schemes for the estimates ๐๐ and ๐๐,๐ : thepartial scheme, in which the estimates are obtained by sampling (non-uniformly) therows of ๐ and ๐ โ ๐ ; and the full scheme, in which one also samples the columnsof these matrices. In both cases, we derive the probabilities that nearly minimizethe variance proxies. As it turns out, for the choice of the norms and potentialsdone in Section 4.2, and under weak assumptions on the feature distribution, thesechoices of probabilities force the additional term (4.3.1) to be of the same order ofmagnitude as the accuracy bound in Theorem 4.1 for the deterministic Mirror Descent(cf. Theorems 4.2โ1 below). Due to the reduced cost of iterations, this leads to thedrastic reductions in the overall numerical complexity.
4.3.1 Partial sampling
In the partial sampling scheme, we draw the rows of ๐ and ๐ with non-uniformprobabilities. In other words, we choose a pair of distributions ๐ = (๐1, . . . , ๐2๐) โ โ2๐
and ๐ = (๐1, . . . , ๐๐) โ โ๐, and sample ๐๐ฐ = ๐๐(๐) and ๐๐ฑ,๐ = ๐๐,๐ (๐) according to
๐๐(๐) = ๐๐๐๐โค๐
๐๐๐, ๐๐,๐ (๐) = ๐โค ๐๐๐
โค๐
๐๐(๐ โ ๐ ) (Part-SS)
where ๐๐ โ R2๐ and ๐๐ โ R๐ are the standard basis vectors, and ๐ โ [2๐] := 1, ..., 2๐and ๐ โ [๐] have distributions ๐ and ๐ correspondingly (clearly, this guarantees un-
95
biasedness). Thus, in this scheme we sample the features and the training examples,but not the classes. The main challenge is to correctly choose the distributions ๐, ๐.This can be done in a data-dependent manner to approximately minimize the vari-ance proxies ๐2๐ฐ , ๐2
๐ฑ defined in (4.3.2). Note that minimization of the variance prox-ies can only be done explicitly in the Euclidean case, when โ ยท โU * , โ ยท โV * are theFrobenius norms. Instead, we minimize the expected squared norm of the gradientestimate itself (corresponding to the second moment in the Euclidean case), that is,
choose ๐* = ๐*( ๐, ๐) and ๐* = ๐*( ๐,๐, ๐ ) according to
๐*( ๐, ๐) โ arg min๐โฮ2๐
E[โ๐๐(๐)โ2V *
],
๐*( ๐,๐, ๐ ) โ arg min๐โฮ๐
E[โ๐๐,๐ (๐)โ2U *
].
(4.3.3)
As we show next in Proposition 4.2, these minimization problems can indeed be solvedexplicitly. On the other hand, we can then easily bound the variance proxies for ๐*
and ๐* as well via the triangle inequality. The approach is justified by the observationthat these bounds result in the stochastic term (4.3.1) of the same order as the termsalready present in the accuracy bound of Theorem 4.1 (cf. Theorem 4.2).
Remark 5. For the stochastic Mirror Descent it is sufficient to control the secondmoment proxies directly, and passing to the variance proxies only deteriorates theconstants. Nonetheless, we choose to provide bounds for the variance proxies: suchbounds are needed in the analysis of Stochastic Mirror Prox.
The next result gives the optimal sampling distributions (4.3.3) and upper boundsfor their variance proxies.
Proposition 4.2. Consider the choice of norms โ ยท โU = โ ยท โ1ร1, โ ยท โV = โ ยท โ2ร1.
Then, optimal solutions ๐* = ๐*( ๐, ๐) and ๐* = ๐*( ๐,๐, ๐ ) to (4.3.3) are given by
๐*๐ =โ ๐(:, ๐)โ2 ยท โ๐(๐, :)โโโ2๐๐ค=1 โ ๐(:, ๐ค)โ2 ยท โ๐(๐ค, :)โโ
, ๐*๐ =โ ๐(๐, :)โโ ยท โ๐ (๐, :)โ ๐ (๐, :)โโโ๐๐ฅ=1 โ ๐(๐ฅ, :)โโ ยท โ๐ (๐ฅ, :)โ ๐ (๐ฅ, :)โโ
.
(4.3.4)Moreover, their respective variance proxies satisfy the bounds
๐2๐ฐ(๐*) 64๐ 2
๐ฐโ๐โคโ2โร2
๐2, ๐2
๐ฑ(๐*) 68โ๐โคโ2โร2
๐+
8โ๐โ21รโ
๐2. (4.3.5)
Proof. See Appendix 4.7.5 for the proof of a generalized result with mixed norms.
Combining Proposition 4.2 with the general accuracy bound for Stochastic MirrorDescent (see Theorem 4.4 in Appendix 4.7.2), we arrive at the following result:
Theorem 4.2. Let (๐ , ๐ ๐ ) = 1๐
โ๐โ1๐ก=0 (๐ ๐ก, ๐ ๐ก) be the average of the first ๐ iter-
ates of Stochastic Mirror Descent (S-MD) with initialization (Init), equipped with
96
sampling scheme (Part-SS) with distributions given by (4.3.4), and constant stepsize
๐พ๐ก โก1โ๐
min
โงโจโฉ 1โ10โU ,V
โฮฉ๐ฐฮฉ๐ฑ
,1
โ2โ
ฮฉ๐ฐ 2๐ฑ + ฮฉ๐ฑ
2๐ฐ
โซโฌโญ , (4.3.6)
with the values of โU ,V , ฮฉ๐ฑ , ฮฉ๐ฐ given by (4.2.11), (4.2.14), (4.2.21), and theupper bounds 2๐ฐ , 2
๐ฑ on the variance proxies given by (4.3.5). Then it holds
E[โ๐ (๐ , ๐ ๐ )][
62โ
10โU ,V
โฮฉ๐ฐฮฉ๐ฑโ
๐+
2โ
2โ
ฮฉ๐ฐ 2๐ฑ + ฮฉ๐ฑ 2
๐ฐโ๐
+๐น (๐ 0, ๐ )โmin๐ โ๐ฑ ๐น (๐, ๐ )
๐
]
6
((4โ
6 + 2โ
10)โ๐โคโโร2โ๐
+8โ๐โ1รโ
๐
)ยท log(2๐๐)โ๐*โ1ร1โ
๐+
r
๐, (4.3.7)
where E[ยท] is the expectation over the randomness of the algorithm, and r is thesame as in Theorem 4.1.
Discussion. The main message of Theorem 4.2 (cf. Theorem 4.1) is as follows: forthe chosen geometry of โ ยท โU and potentials, partial sampling scheme (Part-SS)does not result in any growth of computational complexity, in terms of the numberof iterations to guarantee a given value of the duality gap, as long as โ๐โ1รโ/๐ doesnot dominate โ๐โคโโร2/
โ๐. This is a reasonable assumption as soon as the data has
a light-tailed distribution. Indeed, recall that โ๐โคโโร2/โ๐ is the largest ๐ฟ2-norm of
an individual feature ๐๐ (cf. Remark 2); on the other hand, โ๐โ1รโ/๐ is the sampleaverage of max๐โ[๐] |๐๐|, and thus has the finite limit Emax๐6๐ |๐๐| when ๐ โ โ.
If ๐๐ are subgaussian, Emax๐6๐ |๐๐| 6 ๐๐(1) max๐6๐E|๐๐| 6 ๐๐(1) max๐6๐(E๐2๐)
1/2,whence (cf. (4.2.29)):
lim๐โโ
โ๐โ1รโ
๐6 ๐๐(1) lim
๐โโ
โ๐โคโโร2โ๐
.
Similar observations can be made in the finite-sample regime. In particular, bothterms admit the same bound in terms of the uniform a.s. bound on the features. Also,if ๐๐ โผ ๐ฉ (0, ๐2
๐ ) with ๐๐ โค ๐ for any ๐ โ [๐], then with probability at least 1โ ๐ฟ,
โ๐โ1รโ
๐6 ๐
โlog(๐๐/๐ฟ)
with a similar bound for โ๐โคโโร2/โ๐, cf. (4.2.30).
Computational complexity. Computation of ๐๐(๐*) and ๐๐,๐ (๐*) requires ๐(๐๐+๐๐+๐๐) operations; note that this includes computing the optimal sampling probabil-
97
ities (4.3.4). The combined complexity of the proximal steps, given these estimates,is ๐(๐๐ + ๐๐); in particular, in the case of SVM we have explicit formulae both forthe primal and dual updates, and in the general case the dual updates are reduced tothe root search on top of an explicit ๐(๐๐) computation โ cf. (4.2.23)โ(4.2.24) andthe accompanying discussion.
4.3.2 Full sampling
In the full sampling scheme, the sampling of the rows of ๐ and ๐ โ๐ is augmentedwith the subsequent column sampling. This can be formalized (see Figure 4-1) as
๐๐(๐, ๐ ) = ๐๐๐๐โค๐
๐๐๐ ๐๐๐โค๐๐๐๐
, ๐๐,๐ (๐,๐) = ๐โค ๐๐๐โค๐
๐๐(๐ โ ๐ )
๐๐๐โค๐
๐๐๐
, (Full-SS)
where the indices ๐ โ [2๐] and ๐ โ [๐] are sampled with distributions ๐ โ โ2๐ and ๐ โโ๐ as before, and the rows of the stochastic matrices ๐ โ (โโค
๐ )โ2๐ and ๐ โ (โโค๐ )โ๐
specify the conditional sampling distribution of the class ๐ โ [๐], provided that wedrew the ๐-th feature (in the primal) and ๐-th example (in the dual). Clearly, thisgives us unbiased estimates of the matrix products.
Next we derive the optimal sampling distributions and bound their variance prox-ies (see Appendix 4.7.6 for the proof).
Proposition 4.3. Consider the choice of norms โ ยท โU = โ ยท โ1ร1, โ ยท โV = โ ยท โ2ร1.Optimal solutions (๐*, ๐ *), (๐*, ๐*) to the optimization problems
min๐โฮ2๐,๐โ(ฮโค
๐ )โ2๐Eโ๐๐(๐, ๐ )โ2V * , min
๐โฮ๐,๐โ(ฮโค๐ )โ๐
Eโ๐๐,๐ (๐,๐)โ2U *
are given by
๐*๐ =โ ๐(:, ๐)โ2 ยท โ๐(๐, :)โ1โ2๐๐ค=1 โ ๐(:, ๐ค)โ2 ยท โ๐(๐ค, :)โ1
, ๐ *๐๐ =
|๐๐๐|โ๐(๐, :)โ1
,
๐*๐ =โ ๐(๐, :)โโ ยท โ๐ (๐, :)โ ๐ (๐, :)โ1โ๐๐ฅ=1 โ ๐(๐ฅ, :)โโ ยท โ๐ (๐ฅ, :)โ ๐ (๐ฅ, :)โ1
, ๐*๐๐ =
|๐๐๐ โ ๐๐๐|โ๐ (๐, :)โ ๐ (๐, :)โ1
.
(4.3.8)
The variance proxies ๐2๐ฐ(๐*, ๐ *), ๐2๐ฑ(๐*, ๐*) admit the same upper bounds as in (4.3.5).
Combining Proposition 4.3 with the general bound used in Theorem 4.2, we obtain
Corollary 1. The accuracy bound (4.3.7) of Theorem 4.2 remains true when wereplace the sampling scheme (Part-SS) with the scheme (Full-SS) with parameterschosen according to (4.3.8).
Computational complexity. In the next section, we describe efficient implemen-tation for the multiclass hinge loss (4.1.4) which has ๐(๐+ ๐ + ๐) complexity of one
98
๐ โ [๐]โโโโโโโโโโโ...
โโโโโโโโโโโ โ โ
๐๐ โ R๐ร๐
=
๐ โ [2๐]โโโโโโโโโโโ...
โโโโโโโโโโโ โ โ ๐ โ R๐ร2๐
ร
โโโโโโโโโโโโ. . . ๐ . . .
โโโโโโโโโโโโ โ โ ๐ โ R2๐ร๐
๐
๐ โ [๐]โโโโโโโโโโโ...
โโโโโโโโโโโ โ โ
๐๐,๐ โ R2๐ร๐
=
๐ โ [๐]โโโโโโโโโโโ...
โโโโโโโโโโโ โ โ ๐โค โ R2๐ร๐
ร
โโโโโโโโโโโโ. . . ๐ . . .
โโโโโโโโโโโโ โ โ ๐ โ ๐ โ R๐ร๐
๐
Figure 4-1 โ Depiction of the Full Sampling Scheme (Full-SS).
99
iteration, plus pre- and postprocessing of
๐(๐๐+ (๐+ ๐) ยทmin(๐, ๐ ))
arithmetic operations (a.o.โs). While the detailed analysis is deferred to the nextsection, let us briefly present the main ideas on which it relies.
โ For the optimal sampling distributions ๐* and ๐*, we need to compute thenorms ๐๐ = โ ๐(:, ๐)โ2 and ๐๐ = โ ๐(๐, :)โโ for all ๐ โ [2๐] and ๐ โ [๐]. Thisrequires ๐(๐๐) iterations but only once, during preprocessing.
โ We also need to calculate ๐(๐) quantities ๐๐ = โ๐(๐, :)โ1 and ๐(๐) quanti-ties ๐๐ = โ๐ (๐, :) โ ๐ (๐, :)โ1. While the direct calculation of each of themtakes ๐(๐) a.o.โs, we can update them dynamically in ๐(1) in the case of thehinge loss. Examining the explicit formulae for the updates in the case of hingeloss (cf. (4.2.23)โ(4.2.24)), we notice that in the case of full sampling, but one
elements of each row of ๐ and ๐ are simply scaled by the same coefficient, andthus one can maintain each of the quantities โ๐(๐, :)โ1 and โ๐ (๐, :)โ ๐ (๐, :)โ1in ๐(1) a.o.โs, leading to ๐(๐+๐) complexity of this step. This result is fragile,requiring the specific combination of hinge loss (for which f(๐ฃ, ๐ฆ) is linear) andentropic potentials.
โ We do not need to compute the full matrices ๐,๐. Instead, we can computeonly one row of each of them, corresponding to the drawn pair ๐, ๐, whichrequires ๐(๐) a.o.โs. Hence we can implement sampling in ๐(๐+ ๐+ ๐) a.o.โs
โ Note that updates of the matrices ๐ and ๐ are not dense even in the case of thefull sampling scheme, so implementing them naively would result in ๐(๐๐+๐๐)a.o.โs per iteration. Instead, we employ โlazy updatesโ: for each ๐ โ [2๐] and ๐ โ[๐], instead of ๐(๐, :) and ๐ (๐, :) we maintain the pairs (๐(๐, :), ๐ผ๐), (๐ (๐, :), ๐ฝ๐)such that ๐(๐, :) = ๐ผ๐ ๐(๐, :), ๐ (๐, :) = ๐ฝ๐ ๐ (๐, :).
Recalling that at each iteration all but one elements of ๐(๐, :) are scaled bythe same coefficient, we can update any such pair in ๐(1), with the totalcomplexity of ๐(๐+ ๐) a.o.โs.
โ It can be shown that given the pairs (๐(๐, :), ๐ผ๐) and (๐ (๐, :), ๐ฝ๐), the compu-tation of the averages ๐ , ๐ ๐ after ๐ iterations requires ๐(๐ + ๐) a.o.โs periteration plus the post-processing of ๐((๐ + ๐) ยท min(๐, ๐ )), resulting in theoutlined complexity estimate. Note that the computation of the duality gap(which can be used as the online stopping criterion, see, e.g., Ostrovskii andHarchaoui [2018]) also has the same complexity. Hence, in practice it is rea-sonable to employ the โdoublingโ technique, computing the averages and theduality gap at iterations ๐๐ = 2๐ with increasing values of ๐. This will resultin the same overall complexity while allowing for an online stopping criterion.
Remark 6. The way to avoid averaging of iterations is to consider averaging ofstochastic gradients as done by Juditsky and Nemirovski [2011b]. Stochastic gradi-ents ๐๐ and ๐๐ have only ๐(๐) and ๐(๐) non-zero elements respectively, and theirrunning averages can be computed in ๐(๐ + ๐) a.o.โs. However, in this case we lose
100
the duality gap guarantee (and thus an online stopping criterion), and only have aguarantee on the primal accuracy. 3
Remark 7. Unfortunately, we were not able to achieve the ๐(๐+ ๐+ ๐) complexityfor the multiclass logistic loss (4.1.3). The reason is that in this case the compos-ite term โฑ(๐, ๐ ) is the sum of negative entropies (same as the potential (4.2.13)),and the corresponding proximal updates are not reduced to rescalings of rows (mod-ulo ๐(1) elements). However, due to the sparsity of stochastic gradients, the updateshave a special form, and we believe that ๐(๐+๐+๐) complexity is possible to achievein this case as well. We are planning to investigate it in the future.
4.3.3 Efficient implementation of SVM with Full Sampling
In this section we consider updates for fully stochastic case and show that oneiteration complexity is indeed ๐(๐ + ๐ + ๐). We will show, that it is possible to thespecial form of โlazyโ updates for scaling coefficients. Recall, that one iteration of thefully stochastic mirror descent is written as S-MD:
๐ ๐ก+1 = arg min๐ โ๐ฑ
โฑ(๐, ๐ )โ 1
๐tr[๐โค๐๐ก๐
]+๐ท๐๐ฑ (๐, ๐ ๐ก)
2๐พ๐กฮฉ๐ฑ
,
๐ ๐ก+1 = arg min๐โ๐ฐ๐ tr[1โค
2๐ร๐๐ ] +
1
๐tr[๐๐ ๐ก,๐
๐]+๐ท๐ ๐ฐ (๐, ๐ ๐ก)
2๐พ๐กฮฉ๐ฐ,
(S-MD)
where ๐๐๐ก and ๐๐ ๐ก,๐ are stochastic gradients.
Lazy Updates.
Note that although the estimates ๐๐ , ๐๐,๐ produced in (Full-SS) are sparse (eachcontains a single non-zero column), the updates in (S-MD), which can be expressedas (4.2.23)โ(4.2.24) with ๐๐ , ๐๐,๐ instead of the corresponding matrix products, aredense, and implementing them naively costs ๐(๐๐ + ๐๐) a.o.โs. Fortunately, these
updates have a special form: all elements in each row of ๐ ๐ก and ๐ ๐ก are simply rescaledwith the same factor โ except for at most two elements corresponding to a single non-zero element of ๐๐ ๐ก,๐ and at most two non-zero elements of ๐๐๐ก โ ๐ in this row.To exploit this fact, we perform โlazyโ updates: instead of explicitly computing theactual iterates (๐ ๐ก, ๐ ๐ก), we maintain the quadruple (๐, ๐ผ, ๐ , ๐ฝ), where ๐, ๐ have thesame dimensions as ๐, ๐ , while ๐ผ โ R2๐ and ๐ฝ โ R๐ are the โscaling vectorsโ, so thatat any iteration ๐ก it holds
๐ ๐ก(๐, :) = ๐(๐, :) ยท ๐ผ(๐), ๐ ๐ก(๐, :) = ๐ (๐, :) ยท ๐ฝ(๐) (4.3.9)
for any row of ๐ ๐ก and ๐ ๐ก. Initializing with (๐, ๐ ) = (๐, ๐ ), ๐ผ = 12๐, ๐ฝ = 1๐, wecan update the whole quadruple, while maintaining (4.3.9), by updating at most two
3. This situation corresponds to the โgeneral caseโ in the terminology [Juditsky and Nemirovski,2011b, Sec. 2.5.2.1 and Prop. 2.6(ii)].
101
elements in each row of ๐ and ๐ , and encapsulating the overall scaling of rows in ๐ผand ๐ฝ. Clearly, this update requires only ๐(๐ + ๐) operations once ๐๐๐ก , ๐๐ ๐ก,๐ havebeen drawn.
Sampling.
Computing the distributions ๐*, ๐* from (4.3.8) requires the knowledge of โ ๐(:, ๐)โ2and โ ๐(๐, :)โโ which can be precomputed in ๐(๐๐) a.o.โs, and maintaining ๐(๐+๐)
norms ๐๐, ๐๐ of the rows of ๐ ๐ก and ๐ ๐ก โ ๐ that can maintained in ๐(1) a.o.โs eachusing (4.3.9). Thus, ๐* and ๐* can be updated in ๐(๐ + ๐). Once it is done, we cansample ๐๐ก โผ ๐* and ๐๐ก โผ ๐*, and then sample the class from ๐ * and ๐*, cf. (4.3.8),by computing only the ๐๐ก-th row of ๐ * and the ๐๐ก-th row of ๐*, both in ๐(๐) a.o.โs.Thus, the total complexity of producing ๐๐๐ก , ๐๐ ๐ก,๐ is ๐(๐+ ๐+ ๐).
Tracking the Averages.
Similar โlazyโ updates can be performed for the running averages of the iterates.Omitting the details, this requires ๐(๐+ ๐) a.o.โs per iteration, plus post-processingof ๐(๐๐ + ๐๐) a.o.โs.
The above ideas are implemented in Algorithm 1 whose correctness is formallyshown in Appendix, Sec. 4.7.7 (see also Sec. 4.7.8 for an additional discussion). Itsclose inspection shows the iteration cost of ๐(๐ + ๐ + ๐) a.o.โs, plus ๐(๐๐ + ๐๐ +๐๐) a.o.โs for pre/post-processing, and the memory complexiy of ๐(๐๐ + ๐๐ + ๐๐).Moreover, the term ๐(๐๐), which dominates in high-dimensional and highly multiclassproblems, can be removed if one exploits sparsity of the corresponding primal solutionto the โ1-constrained problem (4.2.2), and outputs it directly, bypassing the explicit
storage of ๐ (see Appendix Sec. 4.7.7 for details). Note that when ๐ = ๐(min(๐, ๐)),the resulting algorithm enters the sublinear regime after as few as ๐(๐) iterations.
4.4 Discussion of Alternative Geometries
Here we consider alternative choices of the proximal geometry in mirror descentapplied to the saddle-point formulation of the CCSPP (4.1.1), possibly with otherchoices of regularization than the entrywise โ1-norm. The goal is to show that thegeometry chosen in Sec. 4.2 is the only one for which we can obtain favorable accuracyguarantees for stochastic mirror descent (S-MD).
Given the structure of the primal and dual feasible sets, it is reasonable to considergeneral mixed norms of the type (4.2.4):
โ ยท โU = โ ยท โ๐1๐ร๐2๐ , โ ยท โV = โ ยท โ๐1๐ ร๐2๐ ,
where ๐1,2๐ , ๐1,2๐ โฅ 1 (in the case of โยทโU , we also assume the same norm for regulariza-tion). Note that their dual norms can be easily computed: the dual norm of โ ยท โ๐1ร๐2is โ ยท โ๐1ร๐2 , where ๐1,2 are the corresponding conjugates to ๐1,2, i.e., 1/๐๐ + 1/๐๐ = 1(see, e.g., Lemma 3 in Sra [2012]). Moreover, it makes sense to fix ๐1๐ = 2 for
102
Algorithm 1 Sublinear Multiclass โ1-Regularized SVM
Require: ๐ โ R๐ร๐, ๐ฆ โ [๐]โ๐, ๐, ๐ 1, ๐ > 1, ๐พ๐ก๐โ1๐ก=0
1: Obtain ๐ โ โโ๐๐ from the labels ๐ฆ; ๐ โก [๐,โ๐]
2: ๐ผโ 12๐; ๐ โ ๐ *12๐ร๐
2๐๐; ๐ฝ โ 1๐; ๐ โ 1๐ร๐
๐3: for ๐ค = 1 to 2๐ do4: ๐(๐ค) โก โ ๐(:, ๐ค)โ2; ๐(๐ค)โ โ๐(๐ค, :)โ15: for ๐ฅ = 1 to ๐ do6: ๐(๐ฅ) โก โ ๐(๐ฅ, :)โโ; ๐(๐ฅ)โ โ๐ (๐ฅ, :)โ ๐ (๐ฅ, :)โ1# Initialize machinery to track the cumulative sums7: ๐ฮฃ โ 02๐ร๐; ๐ฮฃ โ 0๐ร๐ # Cumulative sums8: ๐ดโ 02๐; ๐ต โ 0๐; ๐ดpr โ 02๐ร๐; ๐ตpr โ 0๐ร๐9: for ๐ก = 0 to ๐ โ 1 do # (S-MD) iterations10: Draw ๐ โผ ๐ โ ๐ # โ is the elementwise product11: Draw ๐ โผ |๐ (๐, :) ยท ๐ฝ๐ โ ๐ (๐, :)|12: [๐ฮฃ, ๐ดpr, ๐ด]โ TrackPrimal(๐,๐ฮฃ, ๐ดpr, ๐ด, ๐ผ, ๐)# The only non-zero column of ๐๐ ๐ก,๐ , cf. (Full-SS):
13: ๐ โ ๐(๐, :) ยทโ๐
๐ฅ=1 ๐(๐ฅ) ยท ๐(๐ฅ) ยท sgn[๐ฝ๐ฅ ยท ๐ (๐ฅ, ๐)โ ๐ (๐ฅ, ๐)]
๐(๐)
14: [๐, ๐ผ, ๐]โ UpdatePrimal(๐, ๐ผ, ๐, ๐, ๐, ๐พ๐ก, ๐, ๐ *)15: Draw ๐ โผ ๐ โ ๐16: Draw โ โผ ๐(๐, :)
17: [๐ฮฃ, ๐ตpr, ๐ต]โ TrackDual(๐ , ๐ฮฃ, ๐ตpr, ๐ต, ๐ฝ, โ, ๐ฆ)# The only non-zero column of ๐๐๐ก, cf. (Full-SS):
18: ๐ โ ๐(:, ๐) ยทโ2๐
๐ค=1 ๐(๐ค) ยท ๐(๐ค)
๐(๐)
19: [๐ , ๐ฝ, ๐]โ UpdateDual(๐ , ๐, ๐ฝ, ๐, ๐, โ, ๐ฆ, ๐พ๐ก)20: for ๐ = 1 to ๐ do # Postprocessing of cumulative sums21: ๐ฮฃ(:, ๐)โ ๐ฮฃ(:, ๐) + ๐(:, ๐) โ (๐ผ + ๐ดโ ๐ดpr(:, ๐))
22: ๐ฮฃ(:, ๐)โ ๐ฮฃ(:, ๐) + ๐ (:, ๐) โ (๐ฝ +๐ต โ๐ตpr(:, ๐))
Ensure:1
๐ + 1๐ฮฃ,
1
๐ + 1๐ฮฃ # Averages (๐+1, ๐ ๐+1)
103
Procedure 1 UpdatePrimal
Require: ๐ โ R2๐ร๐, ๐ผ, ๐, ๐ โ R2๐, ๐ โ [๐], ๐พ, ๐, ๐ 1
1: ๐ฟ โก log(2๐๐)2: for ๐ = 1 to 2๐ do3: ๐๐ = ๐๐ โ ๐ผ๐ ยท ๐(๐, ๐) ยท (1โ ๐โ2๐พ๐ฟ๐ *๐๐/๐)
4: ๐ =โ2๐
๐=1 ๐๐5: ๐ = min๐โ2๐พ๐ฟ๐ *๐, ๐ */๐6: for ๐ = 1 to 2๐ do7: ๐(๐, ๐)โ ๐(๐, ๐) ยท ๐โ2๐พ๐ฟ๐ *๐๐/๐
8: ๐ผ+๐ = ๐ ยท ๐ผ๐
9: ๐+๐ = ๐ ยท ๐๐
Ensure: ๐, ๐ผ+, ๐+
Procedure 2 UpdateDual
Require: ๐ , ๐ โ R๐ร๐, ๐ฝ, ๐, ๐ โ R๐, โ โ [๐], ๐ฆ โ [๐]โ๐, ๐พ1: ๐ = ๐โ2๐พ log(๐)
2: for ๐ = 1 to ๐ do3: ๐๐ = ๐2๐พ log(๐)๐๐
4: ๐๐ = ๐โ2๐พ log(๐)๐ (๐,โ)
5: ๐๐ = 1โ ๐ฝ๐ ยท ๐ (๐, โ) ยท (1โ ๐๐ ยท ๐๐)6: if โ = ๐ฆ๐ then # not the actual class of ๐ drawn
7: ๐๐ โ ๐๐ โ ๐ฝ๐ ยท ๐ (๐, ๐ฆ๐) ยท (1โ ๐)8: ๐ฝ+
๐ = ๐ฝ๐/๐๐
9: ๐ (๐, โ)โ ๐ (๐, โ) ยท ๐๐ ยท ๐๐10: ๐ (๐, ๐ฆ๐)โ ๐ (๐, ๐ฆ๐) ยท ๐๐ ยท ๐11: ๐+๐ = 2โ 2๐ฝ+
๐ ยท ๐ (๐, ๐ฆ๐)
Ensure: ๐ , ๐ฝ+, ๐+
Procedure 3 TrackPrimal
Require: ๐,๐ฮฃ, ๐ดpr โ R2๐ร๐, ๐ด,๐ผ โ R2๐, ๐ โ [๐]1: for ๐ = 1 to 2๐ do2: ๐ฮฃ(๐, ๐)โ ๐ฮฃ(๐, ๐) + ๐(๐, ๐) ยท (๐ด๐ + ๐ผ๐ โ ๐ดpr(๐, ๐))3: ๐ดpr(๐, ๐)โ ๐ด๐ + ๐ผ๐4: ๐ด๐ โ ๐ด๐ + ๐ผ๐Ensure: ๐ฮฃ, ๐ดpr, ๐ด
104
Procedure 4 TrackDual
Require: ๐ , ๐ฮฃ, ๐ตpr โ R๐ร๐, ๐ต, ๐ฝ โ R๐, โ โ [๐], ๐ฆ โ [๐]โ๐
1: for ๐ = 1 to ๐ do2: for ๐ โ โ, ๐ฆ๐ do # โ, ๐ฆ๐ has 1 or 2 elements
3: ๐ฮฃ(๐, ๐)โ ๐ฮฃ(๐, ๐) + ๐ (๐, ๐) ยท (๐ต๐ + ๐ฝ๐ โ๐ตpr(๐, ๐))4: ๐ตpr(๐, ๐)โ ๐ต๐ + ๐ฝ๐
5: ๐ต๐ โ ๐ต๐ + ๐ฝ๐
Ensure: ๐ฮฃ, ๐ตpr, ๐ต
the reasons discussed in Section 4.2. This leaves us with the obvious choices ๐2๐ โ1, 2, ๐2๐ โ 1, 2 which corresponds to the sparsity-inducing or the standard Eu-clidean geometry of the classes in the dual/primal; ๐1๐ โ 1, 2 which corresponds tothe sparsity-inducing or Euclidean geometry of the features. Finally, the choice ๐1๐ = 2(i.e., the Euclidean geometry in the features) can also be excluded: its combinationwith ๐1๐ = 2 is known to lead to the large variance term in the biclass case. 4 Thisleaves us with the possibilities
๐2๐ , ๐2๐ โ 1, 2 ร 1, 2. (4.4.1)
In all these cases, the quantity โ๐ฐ ,๐ฑ defined in (4.2.10) can be controlled by extendingProposition 4.1:
Proposition 4.4. For any ๐ผ > 1 and ๐ฝ > 1 such that ๐ฝ > ๐ผ it holds:
โ๐ฐ ,๐ฑ :=โ๐โ1ร๐ผ, 2ร๐ฝ
๐=โ๐โคโโร2
๐.
The proof of this proposition follows the steps in the proof of Proposition 4.1, andis omitted.
Finally, the corresponding partial potentials could be constructed by combiningthe Euclidean and an entropy-type potential in a way similar to the one describedin Sec. 4.2 for the dual variable; alternatively, one could use the power potentialof Nesterov and Nemirovski [2013] that results in the same rates up to a constantfactor.
Using Proposition 4.4, we can also compute the potential differences for the fourremaining setups (4.4.1). The results are shown in Table 4.1. Up to logarithmicfactors, we have equivalent results for all four geometries, with the radius ๐ * evaluatedin the corresponding norm โ ยท โ1ร2 or โ ยท โ1ร1 = โ ยท โ1.
As a result, for the deterministic Mirror Descent (with balanced potentials) we
4. Note that in the biclass case, our variance estimate for the partial sampling scheme (cf. The-orem 4.2) reduces to those in [Juditsky and Nemirovski, 2011b, Section 2.5.2.3]. They consider thecases of โ1/โ1 and โ1/โ2 geometries for the primal/dual, and omit the case of โ2/โ2-geometry, inwhich the sampling variance โexplodesโ.
105
Norm for ๐ โ R๐ร๐
2ร 1 2ร 2
Norm for ๐ โ R๐ร๐
1ร 2 ฮฉ๐ฐ = โ๐*โ21ร2 log ๐ฮฉ๐ฑ = ๐ log ๐
ฮฉ๐ฐ = โ๐*โ21ร2 log ๐ฮฉ๐ฑ = ๐
1ร 1 ฮฉ๐ฐ = โ๐*โ21 log(๐๐)ฮฉ๐ฑ = ๐ log ๐
ฮฉ๐ฐ = โ๐*โ21 log(๐๐)ฮฉ๐ฑ = ๐
Table 4.1 โ Comparison of the potential differences for the norms correspondingto (4.4.1).
obtain the accuracy bound (cf. (4.2.27)):
โ๐ (๐ , ๐ ๐ ) 6
๐(1)โ๐ฐ ,๐ฑโ
ฮฉ๐ฐฮฉ๐ฑโ๐
6๐๐,๐(1)๐ *โ
๐
โ๐โคโโร2โ๐
+r
๐.
in all four cases, where ๐๐,๐(1) is a logarithmic factor in ๐ and ๐, and ๐ * = โ๐*โ1ร2
or ๐ * = โ๐*โ1 depending on ๐2๐ โ 1, 2. In other words, the deterministic accuracybound of Theorem 4.1 is essentially preserved for all four geometries in (4.4.1). On theother hand, using Proposition 4.5, we obtain that in the case of (Part-SS), the extrapart of the accuracy bound due to sampling (cf. (4.3.7)) is also essentially preserved:
E[โ๐ (๐ , ๐ ๐ )] 6
๐๐,๐(1)๐ *โ๐
(โ๐โคโโร2โ
๐+โ๐โ1รโ
๐
)+
r
๐.
However, if we consider full sampling, the situation changes: in the case ๐2๐ = 2the variance bound that holds for (Part-SS) is not preserved for (Full-SS). Thisis because our argument to control the variance of the full sampling scheme alwaysrequires that ๐2๐ 6 1 (see the proof of Proposition 4.3 in Appendix 4.7.6 for details;note that for ๐2๐ we do not have such a restriction since the variance proxy ๐2
๐ฑ iscontrolled on the set ๐ฑ given by (4.1.7) that has โโร โ1-type geometry regardless ofthe norm โ ยท โV . This leaves us with the final choice between the โ ยท โ2ร1 and โ ยท โ2ร2
norm in the dual, as we have to use the elementwise โ ยท โ1-norm in the primal. Bothchoices result in essentially the same accuracy bound (note that this choice onlyinfluences the algorithm but not the saddle-point problem itself). We have focusedon the โ ยท โ2ร1 norm because of the algorithmic considerations: with this norm, wehave multiplicative updates in the case of the multiclass hinge loss, which allows fora sublinear algorithm presented in Section 4.3.3.
106
4.5 Experiments
The natural way to estimate the performance measure for saddle-point problemsis the so-called duality gap:
โ๐ (๐, ๐ ) = max๐ โ๐ฑ
๐(๐, ๐ )โmin๐โ๐ฐ
๐(๐, ๐ ).
In the case of the multi-class SVM formulation:
minโ๐โ16โ1๐โR2๐ร๐
+
max๐ โ(ฮโค
๐ )โ๐โ 1
๐tr[๐ โค๐ ] +
1
๐tr[(๐ โ ๐ )โค ๐ ๐]+ ๐โ๐โ1,
the solution can be found in closed form:
โ๐๐ข๐๐(๐, ๐ ) = โ 1
๐tr[๐ โค ๐ ๐ ] + ๐โ๐โ1 +
1
๐
๐โ๐=1
max๐
[( ๐ ๐ โ ๐ )(๐, ๐)
]+
+1
๐tr[๐ โค๐ ] +
[max๐,๐
[(โ 1
๐๐โค(๐ โ ๐ )โ ๐๐ผ
)(๐, ๐)
]ยท โ1
]+
Sublinear Runtime.
To illustrate the sublinear iteration cost of Algorithm 1, we consider the follow-ing experiment. Fixing ๐ = ๐ = ๐, we generate ๐ with i.i.d. standard Gaussianentries, take ๐ ๐ to be the identity matrix (thus very sparse), and generate the labelsby arg max๐โ[๐] ๐ฅ๐๐
๐๐ + 1โ
๐๐ฉ (0, ๐ผ๐), where ๐ฅ๐โs are the rows of ๐, and ๐ ๐
๐ โs are thecolumns of ๐ ๐. This is repeated 10 times with ๐ = ๐ = ๐ increasing by a constantfactor ๐ ; each time we run Algorithm 1 for a fixed (large) number of iterations todominate the cost of pre/post-processing, with ๐ * = โ๐ ๐โ1 and ๐ = 10โ3, and mea-sure its runtime. We observe (see Tab. 4.2) that the runtime is proportional to ๐ , asexpected.
๐ = ๐ = ๐ 200 400 800 1600 3200 6400๐ = 104 0.80 1.17 2.07 4.27 7.55 15.56๐ = 2 ยท 104 1.63 2.47 4.27 8.74 14.65 30.77
Table 4.2 โ Runtime (in seconds) of Algorithm 1 on synthetic data.
Synthetic Data Experiment.
We compare Algorithm 1 with two competitors: โยทโ1-composite stochastic subgra-dient method (SSM) for the primal problem (4.1.1), in which one uniformly samplesone training example at a time Shalev-Shwartz et al. [2011], leading to ๐(๐๐) iter-ation cost; deterministic saddle-point Mirror Prox (MP) with geometry chosen as
107
in Algorithm 1, for which we have ๐(๐๐๐) cost of iterations but ๐(1/๐ ) conver-gence in terms of the number of iterations. We generate data as in the previousexperiment, fixing ๐ = ๐ = ๐ = 103. The randomized algorithms are run 10 timesfor ๐ โ 10๐/2,๐ = 1, ..., 12 iterations with constant stepsize (we use stepsize (4.3.6)in Algorithm 1, choose the one recommended in Theorem 4.1 for MP, and use the the-oretical stepsize for SSM, explicitly computing the variance of subgradients and theLipschitz constant). Each time we compute the duality gap and the primal accuracy,and measure the runtime (see Fig. 4-2). We see that Algorithm 1 outperforms SSM,which might be the combined effect of sublinearity and our choice of geometry. Italso outmatches MP up to high accuracy due to the sublinear effect (MP eventuallyโwinsโ because of its ๐(1/๐ ) rate). 5
0 100 200 300 400 500
1
2
3
4
5
6Full-SMD, Gap
Full-SMD, Acc.
SSM, Acc.
MP, Gap
MP, Acc.
101
102
103
10-1
10-0.5
1
100.5
10
Full-SMD, Gap
Full-SMD, Acc.
SSM, Acc.
MP, Gap
MP, Acc.
Figure 4-2 โ Primal accuracy and duality gap (when available) for Algorithm 1,stochastic subgradient method (SSM), and Mirror Prox (MP) with exact gradients,on a synthetic data benchmark.
Practical issue: โexplosionโ of rescaling coefficients.
When running the fully stochastic algorithm, we observed one practical issue: thescaling coefficients ๐ผ and ๐ฝ tend to rapidly increase or decrease, causing occasionalarithmetic over/underflows. We suggest two solutions for this problem:
โ The immediate solution is to use arbitrary-precision arithmetic. In theory, thisresults in the ๐(๐+๐+๐) cost of an iteration being inflated by the allowed limitof digits. The resulting complexity can still be beneficial, vis-ร -vis the partialsampling scheme, when ๐ is large enough. This solution is not practical, sinceone has to use external libraries or implement arbitrary-precision arithmetic.
โ A better solution, actually used in our codes, is to perform โurgentโ rescales inthe case of an over/underflow. Hopefully, it will be required not too often, andthe more seldom the closer we are to the optimal solution. For example, in theabove experiment the โurgentโ rescaling was required in about 10 iterations.
5. We provide the Matlab codes of our experiments in Supp. Mat.
108
4.6 Conclusion and perspectives
In this chapter we considered sublinear algorithms for the saddle-points problemsconstructed for certain type Fenchel-Young losses. The main contribution of this workis the algorighm for the โ1-regularized multiclass SVM with numerical cost ๐(๐+๐+๐)of one iteration, where ๐ is the number of features, ๐ the sample size, and ๐ the numberof classes. This was possible due to the right choice of the proximal setup, and ad-hocsampling techniques for matrix multiplication.
We envision the following directions for future research:
โ Extension to the multi-class logistic regression (softmax) model, which iswidely used in Natural Language Processing (NLP) problems, where ๐, ๐ and๐ are on order of millions or even billions [Chelba et al., 2013, Partalas et al.,2015].
โ Provide more experiments with larger dimensions to highlight the advantagesof the algorithm; use real data as well.
โ Implement more flexible stepsizes, including the online stepsize search in thevein of Juditsky and Nemirovski [2011a].
109
4.7 Appendix
4.7.1 Motivation for the multiclass hinge loss
We justify the multiclass extension (4.1.4) of the hinge loss due to Shalev-Shwartzand Ben-David [2014]. In the binary case, the soft-margin SVM objective is
1
๐
๐โ๐=1
[max(0, 1โ ๐ฆ๐๐ขโค๐ฅ๐)]+ ๐โ๐ขโ,
where ๐ข โ R๐, โ ยท โ is some norm, and ๐ฆ๐ โ โ1, 1. Introducing ๐ฆ = ๐๐ฆ โ ๐โ1, ๐1where ๐๐ is the ๐-th standard basis vector (the dimension of space are symbolicallyindexed in โ1, 1), and putting ๐ข1 = โ๐ขโ1 = ๐ข
2, we can rewrite the loss as
max(0, 1โ ๐ฆ๐ขโค๐ฅ) = max๐โ1,โ1
1๐๐ = ๐ฆ+ ๐ขโค๐ ๐ฅโ ๐ขโค๐ฆ ๐ฅ .
The advantage of this reformulation is that we can naturally pass to the multiclasscase, by replacing the set โ1, 1 with 1, ..., ๐พ and introducing ๐ข1, ..., ๐ข๐พ โ R๐
without any restrictions:
max๐โ1,...,๐พ
1๐๐ = ๐ฆ+ ๐ขโค๐ ๐ฅโ ๐ขโค๐ฆ ๐ฅ = max
๐ฃโ๐1,...,๐๐พ
1๐ฃ = ๐ฆ+
๐พโ๐=1
(๐ฃ[๐]โ ๐ฆ[๐])๐ขโค๐ ๐ฅ
= max๐ฃโ๐1,...,๐๐พ
1๐ฃ = ๐ฆ+
๐พโ๐=1
(๐ฃ โ ๐ฆ)โค๐โค๐ฅ
=: โ(๐, (๐ฅ, ๐ฆ)),
where ยท[๐] denotes the ๐-th element of a vector, and ๐ โ R๐ร๐พ has ๐ข๐ as its ๐-thcolumn. Finally, we can rewrite โ(๐, (๐ฅ, ๐ฆ)) as follows:
โ(๐, (๐ฅ, ๐ฆ)) = max๐ฃโฮ๐พ
1โ ๐ฃโค๐ฆ +
๐พโ๐=1
(๐ฃ โ ๐ฆ)โค๐โค๐ฅ
.
This is because we maximize an affine function of ๐ฃ, and 1โ ๐ฃโค๐ฆ = 1๐ฃ = ๐ฆ at thevertices. Thus, we arrive at the announced saddle-point problem
min๐โR๐ร๐
max๐ โ(ฮโค
๐ )โ๐1โ 1
๐tr[๐ โค๐ ] +
1
๐tr[(๐ โ ๐ )โค๐๐
]+ ๐โ๐โU .
4.7.2 General accuracy bounds for the composite saddle-point
Mirror Descent
Deterministic case. Here we provide general accuracy bounds which are instan-tiated in Theorems 4.1โ1. Below we outline the general setting that encompasses, inparticular, the case of (4.2.17) solved via (MD) or (MP) with initialization (Init).
110
โ We consider a convex-concave saddle-point problem
min๐โ๐ฐ
max๐ โ๐ฑ
๐(๐, ๐ )
with a composite objective,
๐(๐, ๐ ) = ฮฆ(๐, ๐ โ ๐ ) + ฮฅ(๐)โโฑ(๐ ),
where
ฮฆ(๐, ๐ ) =1
๐๐ โค๐๐
is a bilinear function, and ฮฅ(๐),โฑ(๐ ) are convex โsimpleโ terms. Moreover,we assume that the primal feasible set ๐ฐ belongs to the โ ยท โU -norm ball withradius ๐ ๐ฐ , the dual constraint set ๐ฑ belongs to the โ ยท โV -norm ball withradius ๐ ๐ฑ , and โ๐ โV โค ๐ ๐ฑ .
6 To simplify the results, we make the mildassumption (satisfied in all known to us situations):
ฮฉ๐ฐ > ๐ 2๐ฐ , ฮฉ๐ฑ > ๐ 2
๐ฑ . (4.7.1)
โ Recall that the vector field of partial gradients of ฮจ(๐, ๐ ) := ฮฆ(๐, ๐ โ ๐ ) is
๐บ(๐ ) : = (โ๐ฮจ(๐, ๐ ),โโ๐ ฮจ(๐, ๐ ))
=1
๐(๐โค(๐ โ ๐ ),โ๐๐)
(4.7.2)
โ Given the partial proximal setups (โยทโU , ๐๐ฐ(ยท)) and (โยทโV , ๐๐ฑ(ยท)), we run Com-posite Mirror Descent (4.2.6) or Mirror Prox (4.2.8) on the vector field ๐บ(๐ )with the joint penalty term
โ(๐ ) = ฮฅ(๐) + โฑ(๐ ),
the โbalancedโ joint potential given by (4.2.12), and stepsizes ๐พ๐ก.We now provide the convergence analysis of the Mirror Descent scheme, extending
the argument of [Duchi et al., 2010, Lemma 1] to composite saddle-point optimization.
Theorem 4.3. In the above setting, let (๐ , ๐ ๐ ) = 1๐
โ๐โ1๐ก=0 (๐ ๐ก, ๐ ๐ก) be the average
of the first ๐ iterates of the composite Mirror Descent (4.2.6) with constant stepsize
๐พ๐ก โก1
โU ,V
โ5๐ฮฉ๐ฐฮฉ๐ฑ
,
where
โU ,V :=1
๐sup
โ๐โU โค1
โ๐๐โV * .
6. Note that the linear term 1๐๐
โค๐๐ can be absorbed into the simple term ฯ(๐), which willslightly improve the bound in Theorem 4.3. However, this improvement is impossible in the stochasticversion of the algorithm, in which we sample the linear form ๐ โค๐ but not the gradient of ฯ(๐).
111
Then we have the following guarantee for the duality gap:
โ๐ (๐ , ๐ ๐ ) 6
2โ
5โU ,V
โฮฉ๐ฐฮฉ๐ฑโ
๐+
ฮฅ(๐0)โmin๐โ๐ฐ ฮฅ(๐)
๐+โฑ(๐ 0)โmin๐ โ๐ฑ โฑ(๐ )
๐.
Moreover, if one of the functions ฮฅ(๐), โฑ(๐ ) is affine, the corresponding ๐(1/๐ )error term vanishes from the bound.
Proof. 1๐. We begin by introducing the norm for ๐ = (๐, ๐ ):
โ๐โW =
โโ๐โ2U2ฮฉ๐ฐ
+โ๐ โ2V2ฮฉ๐ฑ
(4.7.3)
and its dual norm defined for ๐บ = (๐บ๐ , ๐บ๐ ) with ๐บ๐ โ R๐ร๐ and ๐บ๐ โ R๐ร๐:
โ๐บโW * =โ
2ฮฉ๐ฐโ๐บ๐โ2U * + 2ฮฉ๐ฑโ๐บ๐ โ2V * , (4.7.4)
where โ ยท โU * and โ ยท โV * are the dual norms for โ ยท โU and โ ยท โV correspondingly.We now make a few observations. First, the joint potential ๐๐ฒ(๐ ) given by (4.2.12)is 1-strongly convex with respect to the norm โ ยท โW . Second, we can compute thepotential difference corresponding to ๐๐ฒ :
ฮฉ๐ฒ := max๐โ๐ฒ
๐๐ฒ(๐ )โ min๐โ๐ฒ
๐๐ฒ(๐ ) = 1 (4.7.5)
Finally, by (4.7.2) and (4.2.10) we have
max๐โ๐ฒ
โ๐บ๐(๐ )โU * 6 2โU ,V ๐ ๐ฑ , max๐โ๐ฒ
โ๐บ๐ (๐ )โV * 6 โU ,V ๐ ๐ฐ ,
combining which with (4.7.1) we bound the โ ยท โW *-norm of ๐บ(๐ ) on ๐ฒ :
max๐โ๐ฒ
โ๐บ(๐ )โW * 6โ
10โU ,V
โฮฉ๐ฐฮฉ๐ฑ . (4.7.6)
2๐.We now follow the classical convergence analysis of composite Mirror Descent,see Duchi et al. [2010], extending it for convex-concave objectives. By the convexityproperties of ฮจ(๐, ๐ ) = ฮฆ(๐, ๐ โ ๐ ), for any ( , ๐ ) โ ๐ฒ and (๐, ๐ ) โ ๐ฒ it holds
ฮจ( , ๐ )โฮจ(๐, ๐ ) = ฮจ( , ๐ )โฮจ( , ๐ ) + ฮจ( , ๐ )โฮจ(๐, ๐ )
6โจโ๐ฮจ( , ๐ ), โ ๐
โฉโโจโ๐ ฮจ( , ๐ ), ๐ โ ๐
โฉ=โจ๐บ( ), โ๐
โฉ,
Let ๐ ๐ก = (๐ ๐ก, ๐ ๐ก) be the ๐ก-th iterate of (4.2.6) for ๐ก โฅ 1. By convexity of ฮฅ(๐)and โฑ(๐ ), and denoting โ(๐ ) = ฮฅ(๐) + โฑ(๐ ), we have, for any ๐ = [๐, ๐ ], that
ฮจ(๐ ๐กโ1, ๐ )โฮจ(๐, ๐ ๐กโ1) + โ(๐ ๐ก)โ โ(๐ )
6โจ๐บ(๐ ๐กโ1),๐ ๐กโ1 โ๐
โฉ+โจ๐โ(๐ ๐ก),๐ ๐ก โ๐
โฉ.
(4.7.7)
112
Let us now bound the right-hand side. Note that the first-order optimality conditionfor (4.2.6) (denoting ๐(ยท) := ๐๐ฒ(ยท) the joint potential) writes 7โจ
๐พ๐ก[๐บ(๐ ๐กโ1) + ๐โ(๐ ๐ก)] +โ๐(๐ ๐ก)โโ๐(๐ ๐กโ1),๐ ๐ก โ๐โฉ6 0. (4.7.8)
Combining this with (4.7.7), we get
๐พ๐ก[ฮจ(๐ ๐กโ1, ๐ )โฮจ(๐, ๐ ๐กโ1) + โ(๐ ๐ก)โ โ(๐ )] 6โจโ๐(๐ ๐กโ1)โโ๐(๐ ๐ก),๐ ๐ก โ๐
โฉ+ ๐พ๐ก
โจ๐บ(๐ ๐กโ1),๐ ๐กโ1 โ๐ ๐ก
โฉ.(4.7.9)
By the well-known identity,โจโ๐(๐ ๐กโ1)โโ๐(๐ ๐ก),๐ ๐ก โ๐
โฉ= ๐ท๐(๐,๐ ๐กโ1)โ๐ท๐(๐,๐ ๐ก)โ๐ท๐(๐ ๐ก,๐ ๐กโ1),
(4.7.10)
see, e.g., [Beck and Teboulle, 2003, Lemma 4.1]. On the other hand, by the Fenchel-Young inequality we have
๐พ๐กโจ๐บ(๐ ๐กโ1),๐ ๐กโ1 โ๐ ๐ก
โฉ6๐พ2๐ก โ๐บ(๐ ๐กโ1)โ2W *
2+โ๐ ๐กโ1 โ๐ ๐กโ2W
26 5๐พ2๐กโ2
U ,V ฮฉ๐ฐฮฉ๐ฑ +๐ท๐(๐ ๐ก,๐ ๐กโ1),(4.7.11)
where we used (4.7.6) and 1-strong convexity of ๐(ยท) with respect to โ ยท โW . Thus, weobtain
๐พ๐ก[ฮจ(๐ ๐กโ1, ๐ )โฮจ(๐, ๐ ๐กโ1) + โ(๐ ๐ก)โ โ(๐ )] โค๐ท๐(๐,๐ ๐กโ1)โ๐ท๐(๐,๐ ๐ก)
+ 5๐พ2๐กโ2U ,V ฮฉ๐ฐฮฉ๐ฑ .
(4.7.12)
3๐. Now, assuming the constant stepsize, by the convexity properties of ฮจ(ยท, ยท)and โ(ยท) we obtain
๐(๐ , ๐ )โ ๐(๐, ๐ ๐ ) = ฮจ(๐ , ๐ )โฮจ(๐, ๐ ๐ ) + โ( ๐ )โ โ(๐ )
61
๐
๐โ๐ก=1
ฮจ(๐ ๐กโ1, ๐ )โฮจ(๐, ๐ ๐กโ1) + โ(๐ ๐กโ1)โ โ(๐ )
61
๐
(โ(๐ 0)โ โ(๐ ๐ ) +
๐ท๐(๐,๐ 0)
๐พ+ 5๐๐พโ2
U ,V ฮฉ๐ฐฮฉ๐ฑ
)6
1
๐
(โ(๐ 0)โ min
๐โ๐ฒโ(๐ ) +
1
๐พ+ 5๐๐พโ2
U ,V ฮฉ๐ฐฮฉ๐ฑ
).
(4.7.13)where for the third line we substituted (4.7.12), simplified the telescoping sum, andused that ๐ท(๐,๐ ๐ ) > 0, and in the last line we used ๐ท(๐,๐ 0) 6 ฮฉ๐ฒ 6 1,cf. (4.7.5). The choice
๐พ =1
โU ,V
โ5๐ฮฉ๐ฐฮฉ๐ฑ
,
113
results in the accuracy bound from the premise of the theorem:
โ๐ (๐ , ๐ ๐ ) 6
2โ
5โU ,V
โฮฉ๐ฐฮฉ๐ฑโ
๐+โ(๐ 0)โmin๐โ๐ฒ โ(๐ )
๐.
Finally, assume that one of the terms ฮฅ(๐), โฑ(๐ ) is affine โ w.l.o.g. let it be ฮฅ(๐).Then, since โฮฅ(๐) is constant, ๐โ(๐ ๐ก) = (โฮฅ(๐ ๐ก), ๐โฑ(๐ ๐ก)) in (4.7.8) can be re-placed with (โฮฅ(๐ ๐กโ1), ๐โฑ(๐ ๐ก)). Then in (4.7.12) we can replace โ(๐ ๐ก) โ โ(๐ 0)with ฮฅ(๐ ๐กโ1)โฮฅ(๐)+โฑ(๐ ๐ก)โโฑ(๐ ), implying that the term โ(๐ 0)โโ(๐ ๐ ) in theright-hand side of (4.7.13) gets replaced with โฑ(๐ 0)โโฑ(๐ ๐ก). The claim is proved.
Stochastic Mirror Descent. We now consider the stochastic setting that allowsto encompass (S-MD). Stochastic Mirror Descent is given by
๐ 0 = min๐โ๐ฒ
๐๐ฒ(๐ );
๐ ๐ก = arg min๐โ๐ฒ
โ(๐ ) + โจฮ(๐ ๐กโ1),๐ โฉ+
1
๐พ๐ก๐ท๐๐ฒ (๐,๐ ๐กโ1)
, ๐ก โฅ 1,
(4.7.14)
where
ฮ(๐ ) :=1
๐(๐๐,๐ ,โ๐๐)
is the unbiased estimate of the first-order oracle ๐บ(๐ ) = 1๐(๐โค(๐ โ ๐ ),โ๐๐). Let
us introduce the corresponding variance proxies (refer to the preamble of Section 4.3for the discussion):
๐2๐ฐ =
1
๐2sup๐โ๐ฐ
E[โ๐๐ โ ๐๐โ2V *
], ๐2
๐ฑ =1
๐2sup
(๐,๐ )โ๐ฑร๐ฑE[๐โค(๐ โ ๐ )โ ๐๐,๐
2U *
].
(4.7.15)We assume that the noises ๐บ(๐ ๐กโ1)โ ฮ(๐ ๐กโ1) are independent along the iterationsof (4.7.14). In this setting, we prove the following generalization of Theorem 4.3:
Theorem 4.4. Let (๐ , ๐ ๐ ) = 1๐
โ๐โ1๐ก=0 (๐ ๐ก, ๐ ๐ก) be the average of the first ๐ iterates
of the Stochastic Composite Mirror Descent (4.7.14) with constant stepsize
๐พ๐ก โก1โ๐
min
1โ
10โU ,V
โฮฉ๐ฐฮฉ๐ฑ
,1โ
2โ
ฮฉ๐ฐ 2๐ฑ + ฮฉ๐ฑ 2
๐ฐ
,
where โU ,V ,ฮฉ๐ฐ ,ฮฉ๐ฑ are the same as in Theorem 4.4, and 2๐ฐ ,
2๐ฑ are the upper bounds
7. Note that ๐(๐ ) is continuously differentiable in the interior of ๐ฒ, and โ๐ diverges on theborder of ๐ฒ, then the iterates are guaranteed to stay in the interior of ๐ฒ Beck and Teboulle [2003].
114
for ๐2๐ฐ , ๐
2๐ฑ , cf. (4.7.15). Then it holds
E[โ๐ (๐ , ๐ ๐ )] 6
2โ
10โU ,V
โฮฉ๐ฐฮฉ๐ฑโ
๐+
2โ
2โ
ฮฉ๐ฐ 2๐ฑ + ฮฉ๐ฑ 2
๐ฐโ๐
+ฮฅ(๐0)โmin๐โ๐ฐ ฮฅ(๐)
๐+โฑ(๐ 0)โmin๐ โ๐ฑ โฑ(๐ )
๐,
where E[ยท] is the expectation over the randomness in (4.7.14). Moreover, if one of thefunctions ฮฅ(๐), โฑ(๐ ) is affine, the corresponding ๐(1/๐ ) term can be discarded.
Proof. The proof closely follows that of Theorem 4.3. First, 1๐ remains unchanged.Then, in the first-order condition (4.7.8) one must replace ๐บ(๐ ๐กโ1) with ฮ(๐ ๐กโ1),which results in replacing (4.7.9) with
๐พ๐ก[ฮจ(๐ ๐กโ1, ๐ )โฮจ(๐, ๐ ๐กโ1) + โ(๐ ๐ก)โ โ(๐ )]
6โจโ๐(๐ ๐กโ1)โโ๐(๐ ๐ก),๐ ๐ก โ๐
โฉ+ ๐พ๐ก
โจฮ(๐ ๐กโ1),๐ ๐กโ1 โ๐ ๐ก
โฉ+ ๐พ๐ก
โจ๐บ(๐ ๐กโ1)โ ฮ(๐ ๐กโ1),๐ ๐กโ1 โ๐
โฉ,
where the last term has zero mean. The term ๐พ๐ก โจฮ(๐ ๐กโ1),๐ ๐กโ1 โ๐ ๐กโฉ can bebounded using Youngโs inequality, and 1-strong convexity of ๐(ยท), cf. (4.7.11):
๐พ๐กโจฮ(๐ ๐กโ1),๐ ๐กโ1 โ๐ ๐ก
โฉ6๐พ2๐ก โฮ(๐ ๐กโ1)โ2W *
2+โ๐ ๐กโ1 โ๐ ๐กโ2W
26 ๐พ2๐ก โ๐บ(๐ ๐กโ1)โ2W * + ๐พ2๐ก โฮ(๐ ๐กโ1)โ๐บ(๐ ๐กโ1)โ2W * +๐ท๐(๐ ๐ก,๐ ๐กโ1).
Combining (4.7.6), (4.7.4), and (4.7.15), this implies
๐พ๐กโจฮ(๐ ๐กโ1),๐ ๐กโ1 โ๐ ๐ก
โฉ6
10๐พ2๐กโ2U ,V ฮฉ๐ฐฮฉ๐ฑ + 2๐พ2๐ก
(ฮฉ๐ฐ
2๐ฑ + ฮฉ๐ฑ
2๐ฐ)
+๐ท๐(๐ ๐ก,๐ ๐กโ1).
Using (4.7.10) and (4.7.5), this results in
โ๐ (๐ , ๐ ๐ ) = E
[max
(๐,๐ )โ๐ฒ
๐(๐ , ๐ )โ ๐(๐, ๐ ๐ )
]6
1
๐
[โ(๐ 0)โ min
๐โ๐ฒโ(๐ ) +
1
๐พ+ 2๐๐พ
(5โ2
U ,V ฮฉ๐ฐฮฉ๐ฑ + ฮฉ๐ฐ 2๐ฑ + ฮฉ๐ฑ
2๐ฐ)]
;
note that maximization on the left is under the expectation (and not vice versa)because the right hand side is independent from ๐ = (๐, ๐ ). Choosing ๐พ to bal-ance the terms, we arrive at the desired bound. Finally, improvement in the case ofaffine ฮฅ(๐), โฑ(๐ ) is obtained in the same way as in Theorem 4.4.
115
4.7.3 Auxiliary lemmas
Lemma 4.1. Let ๐0 โ R๐+ and ๐1 = arg min
โ๐โ16๐ ๐โR๐
+
๐ถ1โ๐โ1+โจ๐,๐โฉ+๐ถ2
๐โ๐=1
๐๐ log๐๐
๐0๐
.
Then,
๐1๐ = ๐ ยท ๐
0๐ ยท exp(โ๐๐/๐ถ2)
๐,
where ๐ =๐โ๐=1
๐0๐ ยท exp(โ๐๐/๐ถ2) and ๐ = min(๐ ยท ๐โ
๐ถ1+๐ถ2๐ถ2 , ๐ ).
Proof. Clearly, we have
๐1 = arg min๐6๐
minโ๐โ1=๐๐โR๐
+
๐ถ1๐ + โจ๐,๐โฉ+ ๐ถ2
๐โ๐=1
๐๐ log๐๐
๐0๐
.
Let us first do the internal minimization. By simple algebra, the first-order optimalitycondition for the Lagrangian dual problem (with constraint โ๐โ1 = ๐ ) amounts to
๐๐ + ๐ถ2 + ๐ถ2 log๐1๐ โ ๐ถ2 log๐0
๐ + ๐ = 0,
where ๐ is Lagrange multiplier, and๐โ๐=1
๐1๐ = ๐. Equivalently,
๐1๐ = ๐0
๐ ยท exp
(โ(๐ + ๐ถ2 + ๐๐
๐ถ2
)),
that is,
๐1๐ = ๐ ยท ๐0
๐ ยท exp(โ๐๐/๐ถ2)โ๐
๐0๐ ยท exp(โ๐๐/๐ถ2)
.
Denoting ๐ท๐ = exp(โ๐๐/๐ถ2) and ๐ =โ๐
๐0๐ ยท ๐ท๐ and substituting for ๐1 in the
external minimization problem, we arrive at
๐ = arg min๐6๐
๐ถ1๐ + ๐
โ๐
๐0๐๐ท๐๐๐๐
+ ๐ถ2 ยท ๐โ๐
๐0๐๐ท๐
๐ยท log
[๐ ยท๐ท๐
๐
].
One can easily verify that the counterpart of this minimization problem with ๐ =โhas a unique stationary point ๐* = ๐ ยท ๐โ
๐ถ1+๐ถ2๐ถ2 > 0. As the minimized function is
convex, the minimum is attained at the point ๐ = min(๐*, ๐ ).
Lemma 4.2. Given ๐ โ R๐ร๐ and mixed norms (cf. (4.2.4)) โ ยท โ๐1๐ร๐2๐ on R๐ร๐
116
and โ ยท โ๐1๐ ร๐2๐ on R๐ร๐ with ๐2๐ 6 ๐2๐ , one has
supโ๐โ
๐1๐
ร๐2๐61
๐โ๐=1
โ๐(:, ๐)โ๐1๐ ยท โ๐(๐, :)โ๐2๐
= โ๐โคโ๐1๐ร๐1๐ ,
where ๐1๐ is the conjugate of ๐1๐ , i.e., 1/๐1๐ + 1/๐1๐ = 1.
Proof. First assume ๐2๐ = ๐2๐ . Let ๐๐ = โ๐(:, ๐)โ๐1๐ , ๐ข๐ = โ๐(๐, :)โ๐2๐ , 1 6 ๐ 6 ๐. Then,
supโ๐โ
๐1๐
ร๐2๐61
๐โ๐=1
โ๐(:, ๐)โ๐1๐ ยท โ๐(๐, :)โ๐2๐
= sup
โ๐ขโ๐1๐61
๐โ๐=1
๐๐๐ข๐ = โ๐โ๐1๐ = โ๐โคโ๐1๐ร๐1๐ .
Now let ๐2๐ < ๐2๐ . Then, for any ๐ โค ๐ one has โ๐(๐, :)โ๐2๐ < โ๐(๐, :)โ๐2๐ unless ๐(๐, :)has a single non-zero element, in which case โ๐(๐, :)โ๐2๐ = โ๐(๐, :)โ๐2๐ . Hence, thesupremum must be attained on such ๐ , for which the previous argument applies.
Lemma 4.3. In the setting of Lemma 4.2, for any ๐1๐ > 1 and ๐2๐ > 1 one has:
supโ๐ โโร1โค1
๐โ๐=1
โ๐โค(:, ๐)โ๐1๐ ยท โ๐ (๐, :)โ๐2๐
= โ๐โ1ร๐1๐ .
Proof. The claim follows by instatiating Lemma 4.2.
4.7.4 Proof of Proposition 4.1
By (4.2.10), and verifying that the dual norm to โ ยท โ2ร1 is โ ยท โ2รโ, we have
โU ,V = supโ๐โ1ร161
โ๐๐โ2รโ.
The maximization over the unit ball โ๐โ1ร1 6 1 can be replaced with that over itsextremal points, which are the matrices ๐ that have zeroes in all positions except forone in which there is 1. Let (๐, ๐) be this position, then for every such ๐ we have:
โ๐๐โ2รโ =
โฏ ๐โ๐=1
sup๐|๐(๐, :)๐(:, ๐)|2 =
โฏ ๐โ๐=1
|๐(๐, ๐)|2 =
โฏ ๐โ๐=1
|๐โค(๐, ๐)|2.
As a result,
supโ๐โ1ร161
โ๐๐โ2รโ = sup1โค๐โค๐
โฏ ๐โ๐=1
|๐โค(๐, ๐)|2 =๐โค
โร2.
117
4.7.5 Proof of Proposition 4.2
We prove an extended result that holds when โ ยท โU and โ ยท โV are more generalmixed (โ๐ ร โ๐)-norms, cf. (4.2.4).
Proposition 4.5. Let โ ยท โU = โ ยท โ๐1๐ร๐2๐ on R2๐ร๐, and โ ยท โV = โ ยท โ๐1๐ ร๐2๐ on R๐ร๐.
Then, optimal solutions ๐* = ๐*( ๐, ๐) and ๐* = ๐*( ๐,๐, ๐ ) to (4.3.3) are given by
๐*๐ =โ ๐(:, ๐)โ๐1๐ ยท โ
๐(๐, :)โ๐2๐โ2๐๐ค=1 โ ๐(:, ๐ค)โ๐1๐ ยท โ
๐(๐ค, :)โ๐2๐, ๐*๐ =
โ ๐(๐, :)โ๐1๐ ยท โ๐ (๐, :)โ ๐ (๐, :)โ๐2๐โ๐๐ฅ=1 โ ๐(๐ฅ, :)โ๐1๐ ยท โ๐ (๐ฅ, :)โ ๐ (๐ฅ, :)โ๐2๐
.
Moreover, we can bound their respective variance proxies (cf. (4.3.2)): introducing
โ2U ,V =
1
๐2sup
โ๐โU โค1
โ ๐ ๐โ2V * ,
we have, as long as ๐2๐ 6 ๐2๐ ,
๐2๐ฐ(๐*) 6 2๐ 2๐ฐโ2
U ,V +2
๐2๐ 2
๐ฐโ ๐โคโ2๐1๐ร๐1๐,
and, as long as ๐1๐ > 2,
๐2๐ฑ(๐*) 6 8๐ โ2
U ,V +8
๐2โ ๐โ21ร๐1๐ .
Proof. Note that the dual norms to โ ยท โ๐1๐ร๐2๐ and โ ยท โ๐1๐ ร๐2๐ are given by โ ยท โ๐1๐ร๐2๐and โ ยท โ๐1๐ ร๐2๐ correspondingly, see, e.g., Sra [2012].
1๐. For E[โ๐๐(๐)โ2V *
]we have:
E[โ๐๐(๐)โ2V *
]=
2๐โ๐=1
๐๐
๐๐๐๐โค๐
๐๐๐ 2
๐1๐ ร๐2๐
=2๐โ๐=1
1
๐๐
๐(:, ๐) ยท ๐(๐, :)2๐1๐ ร๐2๐
=2๐โ๐=1
1
๐๐โ ๐(:, ๐)โ2๐1๐ ยท โ
๐(๐, :)โ2๐2๐ ,
where the last transition can be verified directly. The right-hand side can be easilyminimized on โ2๐ explicitly, which results in
๐*๐ =โ ๐(:, ๐)โ๐1๐ ยท โ
๐(๐, :)โ๐2๐โ2๐๐ค=1 โ ๐(:, ๐ค)โ๐1๐ ยท โ
๐(๐ค, :)โ๐2๐
and
E[โ๐๐(๐*)โ2V *
]=
[2๐โ๐=1
โ ๐(:, ๐)โ๐1๐ ยท โ๐(๐, :)โ๐2๐
]2.
118
Now we can bound ๐2๐ฐ(๐*) via the triangle inequality:
๐2๐ฐ(๐*) 62
๐2sup๐โ๐ฐ โ
๐ ๐โ2V * +2
๐2sup๐โ๐ฐ E
[โ๐๐(๐*)โ2V *
]= 2๐ 2
๐ฐโ2
U ,V +2
๐2sup๐โ๐ฐ
[2๐โ๐=1
โ ๐(:, ๐)โ๐1๐ ยท โ๐(๐, :)โ๐2๐
]2
= 2๐ 2๐ฐโ2
U ,V +2
๐2sup
โ๐โU โค๐ ๐ฐ
[2๐โ๐=1
โ ๐(:, ๐)โ๐1๐ ยท โ๐(๐, :)โ๐2๐
]2= 2๐ 2
๐ฐโ2
U ,V +2
๐2๐ 2
๐ฐโ ๐โคโ2๐1๐ร๐1๐,
where we used Lemma 4.2 (see Appendix 4.7.3) in the last transition.
2๐. We now deal with E [โ๐๐,๐ (๐)โ2U * ]. As previously, we can explicitly compute
๐*๐ =โ ๐(๐, :)โ๐1๐ ยท โ๐ (๐, :)โ ๐ (๐, :)โ๐2๐โ๐๐ฅ=1 โ ๐(๐ฅ, :)โ๐1๐ ยท โ๐ (๐ฅ, :)โ ๐ (๐ฅ, :)โ๐2๐
and
E[โ๐๐,๐ (๐*)โ2U *
]=
[๐โ๐=1
โ ๐(๐, :)โ๐1๐ ยท โ๐ (๐, :)โ ๐ (๐, :)โ๐2๐
]2.
Thus, by the triangle inequality,
๐2๐ฑ(๐*) 6
2
๐2sup
(๐,๐ )โ๐ฑร๐ฑโ ๐โค(๐ โ ๐ )โ2U * +
2
๐2sup
(๐,๐ )โ๐ฑร๐ฑE[โ๐๐,๐ (๐*)โ2U *
]6
2
๐2sup
โ๐ โโร162
โ ๐โค๐ โ2U * +2
๐2sup
โ๐ โโร162
[๐โ๐=1
โ ๐(๐, :)โ๐1๐ ยท โ๐ (๐, :)โ๐2๐
]2=
8
๐sup
โ๐ โ2ร161
โ ๐โค๐ โ2U * +8
๐2โ ๐โ21ร๐1๐
68
๐sup
โ๐ โ๐1๐
ร๐2๐61
โ ๐โค๐ โ2U * +8
๐2โ ๐โ21ร๐1๐
= 8๐ โ2U ,V +
8
๐2โ ๐โ21ร๐1๐ .
Here in the second line we used that the Minkowski sum โ๐+(โโ๐) belongs to the โ1-ball with radius 2 (whence ๐ฑ + (โ๐ฑ) belongs to the (โโ ร โ1)-ball with radius 2); inthe third line we used Lemma 4.3 (see Appendix 4.7.3) and the relation on R๐ร๐:
โ ยท โ2ร1 6โ๐โ ยท โโร1;
lastly, we used that ๐1๐ โฅ 2 and that โ ยท โ๐1๐ ร๐2๐ is non-increasing in ๐1๐ , ๐2๐ โฅ 1.
119
Proof of Proposition 4.2. We instantiate Proposition 4.5 with โ ยท โU = โ ยท โ1ร1
and โยทโV = โยทโ2ร1, and observe, using Proposition 4.1, that for ๐ = [๐,โ๐] โ R๐ร2๐
it holds
โU ,V =1
๐โ ๐โคโโร2 =
1
๐โ๐โคโโร2, โ ๐โ1รโ = โ๐โ1รโ.
4.7.6 Proof of Proposition 4.3
We have
min๐โฮ2๐,
๐โ(ฮโค๐ )โ2๐
Eโ๐๐(๐, ๐ )โ22รโ = min๐โฮ2๐,
๐โ(ฮโค๐ )โ2๐
2๐โ๐=1
1
๐๐โ ๐(:, ๐)โ22 ยท
[๐โ๐=1
1
๐๐๐ยท |๐(๐, ๐)|2
]
= min๐โฮ2๐
2๐โ๐=1
1
๐๐โ ๐(:, ๐)โ22 ยท โ๐(๐, :)โ21,
where we carried out the internal minimization explicitly, obtaining
๐ *๐๐ =
|๐๐๐|โ๐(๐, :)โ1
.
Optimization in ๐ gives:
๐*๐ =โ ๐(:, ๐)โ2 ยท โ๐(๐, :)โ1
2๐โ๐ค=1
โ ๐(:, ๐ค)โ2 ยท โ๐(๐ค, :)โ1, Eโ๐๐(๐*, ๐ *)โ22รโ =
2๐โ๐=1
โ ๐(:, ๐)โ2 ยท โ๐(๐, :)โ1.
Defining โU ,V = supโ๐โU โค1
โ๐ ๐โV *
and proceeding as in the proof of Proposition 4.5, we get
๐2๐ฐ(๐*, ๐ *) 62
๐2sup๐โ๐ฐ โ
๐ ๐โ22รโ +2
๐2sup๐โ๐ฐ E
[โ๐๐(๐*, ๐ *)โ22รโ
]= 2๐ 2
๐ฐโ2
U ,V +2
๐2sup๐โ๐ฐ
[2๐โ๐=1
โ ๐(:, ๐)โ2 ยท โ๐(๐, :)โ1
]2
= 2๐ 2๐ฐโ2
U ,V +2๐ 2
๐ฐ๐2
supโ๐โ1ร161
[2๐โ๐=1
โ ๐(:, ๐)โ2 ยท โ๐(๐, :)โ1
]2= 2๐ 2
๐ฐโ2
U ,V +2
๐2๐ 2
๐ฐโ ๐โคโ22ร1
=4
๐2๐ 2
๐ฐโ๐โคโ22ร1,
120
where in the last two transitions we used Lemma 4.2 and Proposition 4.1 (note
that โ ๐โคโ2ร1 = โ๐โคโ2ร1). Note that the last transition requires that โ ยท โU has โ1-geometry in the classes โ otherwise, Lemma 4.2 cannot be applied.
To obtain (๐*, ๐*) we proceed in a similar way:
min๐โฮ๐,
๐โ(ฮโค๐ )โ๐
Eโ๐๐,๐ (๐,๐)โ2โรโ
= min๐โฮ๐,
๐โ(ฮโค๐ )โ๐
๐โ๐=1
1
๐๐โ ๐(๐, :)โ2โ ยท
[๐โ๐=1
1
๐๐๐
ยท |๐ (๐, ๐)โ ๐ (๐, ๐)|2]
= min๐โฮ๐
๐โ๐=1
1
๐๐โ ๐(๐, :)โ2โ ยท โ๐ (๐, :)โ ๐ (๐, :)โ21,
which results in
๐*๐ =โ ๐(๐, :)โโ ยท โ๐ (๐, :)โ ๐ (๐, :)โ1โ๐๐ฅ=1 โ ๐(๐ฅ, :)โโ ยท โ๐ (๐ฅ, :)โ ๐ (๐ฅ, :)โ1
๐*๐๐ =
|๐๐๐ โ ๐๐๐|โ๐ (๐, :)โ ๐ (๐, :)โ1
.
The corresponding variance proxy can then be bounded in the same way as in theproof of Proposition 4.5.
4.7.7 Correctness of subroutines in Algorithm 1
In this section, we recall the subroutines used in Algorithm 1 โ those for per-forming the lazy updates and tracking the running averages โ and demonstrate theircorrectness.
Procedure 1 UpdatePrimal
Require: ๐ โ R2๐ร๐, ๐ผ, ๐, ๐ โ R2๐, ๐ โ [๐], ๐พ, ๐, ๐ 1
1: ๐ฟ โก log(2๐๐)2: for ๐ = 1 to 2๐ do3: ๐๐ = ๐๐ โ ๐ผ๐ ยท ๐(๐, ๐) ยท (1โ ๐โ2๐พ๐ฟ๐ *๐๐/๐)
4: ๐ =โ2๐
๐=1 ๐๐5: ๐ = min๐โ2๐พ๐ฟ๐ *๐, ๐ */๐6: for ๐ = 1 to 2๐ do7: ๐(๐, ๐)โ ๐(๐, ๐) ยท ๐โ2๐พ๐ฟ๐ *๐๐/๐
8: ๐ผ+๐ = ๐ ยท ๐ผ๐
9: ๐+๐ = ๐ ยท ๐๐
Ensure: ๐, ๐ผ+, ๐+
Primal updates (Procedure 1). To demonstrate the correctness of Procedure 1,we prove the following result:
121
Lemma 4.4. Suppose that at ๐ก-iteration of Algorithm 1, Procedure 1 was fed with ๐ =๐ ๐ก, ๐ผ = ๐ผ๐ก, ๐ = ๐๐ก, ๐ = ๐๐ก, ๐ = ๐๐ก for which one had
๐ ๐ก(:, ๐) โ ๐ผ๐ก = ๐ ๐ก(:, ๐), โ๐ โ [๐], (4.7.16)
where ๐ ๐ก is the ๐ก-th primal iterate of (S-MD) equipped with (Full-SS) with theoptimal sampling distributions (4.3.8), and ๐๐ก was the only non-zero column ๐๐ ๐ก,๐ (:, ๐๐ก)of ๐๐ ๐ก,๐ . Moreover, suppose also that
๐๐ก(๐ค) = โ๐ ๐ก(๐ค, :)โ1 =โ๐โ[๐]
๐ ๐ก(๐ค, ๐), ๐ค โ [2๐], (4.7.17)
were the correct norms at the ๐ก-th step. Then Procedure 1 will output ๐ ๐ก+๐ก, ๐ผ๐ก+1, ๐๐ก+1
such that ๐ ๐ก+1(:, ๐) โ ๐ผ๐ก+1 = ๐ ๐ก+1(:, ๐), โ๐ โ [๐]
and๐๐ก+1(๐ค) =
โ๐โ[๐]
๐ ๐ก+1(๐ค, ๐), ๐ค โ [2๐].
Proof. Recall that the matrix ๐๐ก = ๐๐ ๐ก,๐ produced in (Full-SS) has a single non-zerocolumn ๐๐ก = ๐๐ก(:, ๐๐ก), and according to (S-MD), the primal update ๐ ๐ก โ ๐ ๐ก+1 writesas (cf. (4.2.23)):
๐ ๐ก+1๐๐ = ๐ ๐ก
๐๐ ยท ๐โ2๐พ๐ก๐ *๐ฟ๐๐ ๐ก,๐ (๐,๐)/๐ ยทmin๐โ2๐พ๐ก๐ *๐ฟ๐, ๐ */๐
,
where
๐ฟ := log(2๐๐), ๐๐ก :=2๐โ๐=1
๐โ๐=1
๐ ๐ก๐๐ ยท ๐โ2๐พ๐ก๐ *๐ฟ๐๐ ๐ก,๐ (๐,๐)/๐,
and ๐๐ ๐ก,๐ has a single non-zero column ๐๐ก = ๐๐ก๐ ๐ก,๐ (:, ๐๐ก). This can be rewritten as
๐ ๐ก+1๐๐ =
๐ ๐ก๐๐ ยท ๐ ยท ๐๐ก๐ , ๐ = ๐๐ก,
๐ ๐ก๐๐ ยท ๐, ๐ = ๐๐ก,
(4.7.18)
where๐๐ก๐ = ๐โ2๐พ๐ก๐ฟ๐ *๐๐ก๐/๐,
๐ = min๐โ2๐พ๐ก๐ฟ๐ *๐, ๐ */๐,
๐ =โ๐โ[2๐]
๐ ๐ก๐๐๐ก ยท ๐๐ก๐ +
โ๐โ[2๐]
โ๐โ[๐]โ๐๐ก
๐ ๐ก๐๐.
Thus, ๐ and ๐ can be expressed via ๐๐ก(๐) =โ
๐โ[๐] ๐๐ก(๐, ๐), cf. (4.7.17):
๐ =โ๐โ[2๐]
๐๐ก๐ โ ๐ผ๐ก๐ ๐ ๐ก๐,๐๐กโ โ
๐๐ก๐,๐๐ก
(1โ ๐๐ก๐), (4.7.19)
122
where we used the premise (4.7.16). Now we can see that lazy updates of ๐ can beexpressed as
๐ผ๐ก+1๐ = ๐ ยท ๐ผ๐ก๐,๐ ๐ก+1
๐,๐๐ก = ๐ ๐ก๐,๐๐ก ยท ๐๐ก๐ ,
(4.7.20)
and the updates for the norms ๐๐ก+1 as
๐๐ก+1๐ = ๐๐ก
[๐๐ก๐ + ๐ผ๐ก๐
๐ ๐ก๐,๐๐ก(๐
๐ก๐ โ 1)
](4.7.21)
One can immediately verify that this is exactly the update produced in the call ofProcedure 1 in line 13 of Algorithm 1.
Procedure 2 UpdateDual
Require: ๐ , ๐ โ R๐ร๐, ๐ฝ, ๐, ๐ โ R๐, โ โ [๐], ๐ฆ โ [๐]โ๐, ๐พ1: ๐ = ๐โ2๐พ log(๐)
2: for ๐ = 1 to ๐ do3: ๐๐ = ๐2๐พ log(๐)๐๐
4: ๐๐ = ๐โ2๐พ log(๐)๐ (๐,โ)
5: ๐๐ = 1โ ๐ฝ๐ ยท ๐ (๐, โ) ยท (1โ ๐๐ ยท ๐๐)6: if โ = ๐ฆ๐ then # not the actual class of ๐ drawn
7: ๐๐ โ ๐๐ โ ๐ฝ๐ ยท ๐ (๐, ๐ฆ๐) ยท (1โ ๐)8: ๐ฝ+
๐ = ๐ฝ๐/๐๐
9: ๐ (๐, โ)โ ๐ (๐, โ) ยท ๐๐ ยท ๐๐10: ๐ (๐, ๐ฆ๐)โ ๐ (๐, ๐ฆ๐) ยท ๐๐ ยท ๐11: ๐+๐ = 2โ 2๐ฝ+
๐ ยท ๐ (๐, ๐ฆ๐)
Ensure: ๐ , ๐ฝ+, ๐+
Dual updates (Procedure 2). To demonstrate the correctness of Procedure 2,we prove the following result:
Lemma 4.5. Suppose that at ๐ก-iteration of Algorithm 1, Procedure 2 was fed with ๐ =๐ ๐ก, ๐ฝ = ๐ฝ๐ก, ๐ = ๐๐ก, ๐ = ๐๐ก, โ = โ๐ก, for which one had
๐ ๐ก(:, ๐) โ ๐ฝ๐ก = ๐ ๐ก(:, ๐), โ๐ โ [๐], (4.7.22)
where ๐ ๐ก is the ๐ก-th dual iterate of (S-MD) equipped with (Full-SS) with the optimalsampling distributions (4.3.8), and ๐๐ก was the only non-zero column ๐๐๐ก(:, โ๐ก) of ๐๐๐ก.Moreover, suppose also that
๐๐ก(๐ฅ) = โ๐ ๐ก(๐ฅ, :)โ ๐ (๐ฅ, :)โ1, ๐ฅ โ [๐] (4.7.23)
were the correct norms at the ๐ก-th step. Then Procedure 2 will output ๐ ๐ก+๐ก, ๐ฝ๐ก+1, ๐๐ก+1
such that ๐ ๐ก+1(:, ๐) โ ๐ฝ๐ก+1 = ๐ ๐ก+1(:, ๐), โ๐ โ [๐]
123
and๐๐ก+1(๐ฅ) = โ๐ ๐ก+1(๐ฅ, :)โ ๐ (๐ฅ, :)โ1, ๐ฅ โ [๐].
Proof. Recall that the random matrix ๐๐๐ก has a single non-zero column ๐๐ก := ๐๐๐ก(:, โ๐ก),and according to (S-MD), the update ๐ ๐ก โ ๐ ๐ก+1 writes as (cf. (4.2.24)):
๐ ๐ก+1๐๐ = ๐ ๐ก
๐๐ ยทexp[2๐พ๐ก log(๐) ยท (๐๐๐ก(๐, ๐)โ ๐ (๐, ๐))]
๐โโ=1
๐ ๐ก๐โ ยท exp[2๐พ๐ก log(๐) ยท (๐๐๐ก(๐, โ)โ ๐ (๐, โ))]
. (4.7.24)
Note that all elements of the matrix ๐๐๐ก โ ๐ in each row ๐ have value 1, except forat most two elements in the columns โ๐ก and ๐ฆ๐, where ๐ฆ๐ is the actual label of the๐-th training example, that is, the only ๐ โ [๐] for which ๐ (๐, ๐) = 1. Recall alsothat
โโโ[๐] ๐
๐ก(๐, โ) = 1 for any ๐ โ [๐]. Thus, introducing
๐๐ = ๐2๐พ๐ก log(๐)๐๐ก๐ , ๐๐ = ๐โ2๐พ๐ก log(๐)๐ (๐,๐)
as defined in Procedure 2, we can express the denominator in (4.7.24) as
๐๐ =
โงโชโชโชโชโชโจโชโชโชโชโชโฉ
1โ ๐ฝ๐ก๐ ยท ๐ ๐ก๐,โ๐กโ โ
๐ ๐ก๐,โ๐ก
ยท(1โ ๐๐ ยท ๐๐), if โ๐ก = ๐ฆ๐,
1โ ๐ฝ๐ก๐ ยท ๐ ๐ก๐,โ๐กโ โ
๐ ๐ก๐,โ๐ก
ยท(1โ ๐๐ ยท ๐๐)โ ๐ฝ๐ก๐ ยท ๐ ๐ก๐,๐ฆ๐โ โ
๐ ๐ก๐,๐ฆ๐
ยท(1โ ๐โ2๐พ๐ก log(๐)), if โ๐ก = ๐ฆ๐,(4.7.25)
where we used the premise (4.7.22). One can verify that this corresponds to the valueof ๐๐ produced by line 7 of Procedure 2. Then, examining the numerator in (4.7.24),we can verify that lines 8โ10 guarantee that
๐ ๐ก+1๐,๐ ยท ๐ฝ
๐ก+1๐ = ๐ ๐ก+1
๐,๐ , โ๐ โ [๐]
holds for the updated values. To verify the second invariant, we combining this resultwith the premise (4.7.23). This gives
๐๐ก+1๐ = 2โ ๐ ๐ก+1
๐,๐ฆ๐= 2โ 2๐ฝ๐ก+1
๐ ยท ๐ ๐ก+1๐,๐ฆ๐
, (4.7.26)
which indeed corresponds to the update in line 11 of the procedure.
Correctness of tracking the cumulative sums.
We only consider the primal variables (Procedure 3 and line 21 of Algorithm 1);the complimentary case can be treated analogously. Note that due to the previoustwo lemmas, at any iteration ๐ก of Algorithm 1 Procedure 3 is fed with ๐ = ๐๐ก, ๐ =๐ ๐ก, ๐ผ = ๐ผ๐ก for which it holds ๐ ๐ก๐ผ๐ก = ๐ ๐ก. Now, assume that all previous inputvalues ๐ด๐ , ๐ โค ๐ก, of variable ๐ด, and the current inputs ๐ด๐กpr, ๐
๐กฮฃ of variables ๐ดpr, ๐ฮฃ,
124
satisfy the following:
๐ด๐ =๐โ1โ๐ =0
๐ผ๐ , โ๐ โค ๐ก, (4.7.27)
๐ด๐กpr(๐, ๐) = ๐ด๐ ๐ก(๐,๐)๐ , (4.7.28)
๐ ๐กฮฃ(๐, ๐) =
๐ ๐ก(๐,๐)โ๐ =0
๐ ๐ (๐, ๐), (4.7.29)
where 0 6 ๐ ๐ก(๐, ๐) 6 ๐ก โ 1 is the latest moment ๐ , strictly before ๐ก, when thesampled ๐๐ โ [๐] coincided with the given ๐:
๐ ๐ก(๐, ๐) = arg max๐ 6๐กโ1
๐ : ๐๐ = ๐. (4.7.30)
Let us show that this invariant will be preserved aftet the call of Procedure 3 โ in otherwords, that (4.7.27)โ(4.7.30) hold for ๐ก+ 1, i.e., for the ouput values ๐ ๐ก+1
ฮฃ , ๐ด๐ก+1pr , ๐ด
๐ก+1
(note that the variables ๐ฮฃ, ๐ดpr, ๐ด๐ก+1 only changed within Procedure 3, so their output
values are also the input values at the next iteration).
Proof. Indeed, it is clear that (4.7.27) will be preserved (cf. line 4 of Procedure 3).To verify (4.7.28), note that ๐ดpr(๐, ๐) only gets updated when ๐ = ๐๐ก (cf. line 3), andin this case we will have ๐ ๐ก+1(๐, ๐) = ๐ก, and otherwise ๐ ๐ก+1(๐, ๐) = ๐ ๐ก(๐, ๐), cf. (4.7.30).
Thus, it only remains to verify the validity of (4.7.29) after the update. To this
end, note that by (4.7.30) we know that the value ๐ ๐ (๐, ๐) of the variable ๐(๐, ๐)remained constant for ๐ ๐ก(๐, ๐) โค ๐ < ๐ก, and it will not change after the call at ๐ก-thiteration unless ๐๐ก = ๐, that is, unless ๐ ๐ก+1(๐, ๐) = ๐ก. This is exactly when line (2) isinvoked, and it ensures (4.7.29) for ๐ก+ 1.
Finally, invoking (4.7.27)โ(4.7.30) at ๐ก = ๐ , we see that line 21 results in thecorrect final value
โ๐๐ก=0 ๐
๐ก of the cumulative sum ๐ฮฃ. Thus, the correctness of Algo-rithm 1 is verified.
4.7.8 Additional remarks on Algorithm 1
Removing the ๐(๐๐) complexity term. In fact, the extra term ๐(๐๐) in theruntime and memory complexities of Algorithm 1 can be easily avoided. To seethis, recall that when solving the simplex-constrained CCSPP (4.2.17), we are fore-most interested in solving the โ1-constrained CCSPP (4.2.2), and an ๐-accurate solu-
tion ๐ = [๐1;๐2] โ R2๐ร๐ to (4.2.17) yields an ๐-accurate solution ๐ = ๐1โ๐2 โ R๐ร๐
to (4.2.2). Recall that we initialize Algorithm 1 with ๐0 = 12๐ร๐ and ๐ผ0 = 12๐, which
corresponds to ๐0 = 02๐ร๐. Moreover, at any iteration we change a single entry of ๐ ๐ก,and scale the whole scaling vector ๐ผ0 = 12๐ by a constant (in fact, all entries of ๐ผ๐ก
are always equal to each other; we omitted this fact in the main text to simplify thepresentation, since the entries of ๐ฝ generally have different values). Hence, the final
125
candidate solution ๐๐ = [๐1 โ ๐
2 ] to the โ1-constrained problem will actually have
at most ๐(๐๐ ) non-zero entries that correspond to the entries of ๐ that were changedin the course of the algorithm. To exploit this, we can modify Algorithm 1 as follows:
โ Instead of explicitly initializing and storing the whole matrices ๐ , ๐ฮฃ, and ๐ดpr,
we can hard-code the โdefaultโ value ๐(๐, ๐) = 1 (cf. line 2 of Algorithm 1),
and use a bit mask to flag the entries ๐(๐, ๐) that have already been changedat least once. This mask can be stored as a list Changed of pairs (๐, ๐), i.e. ina sparse form.
โ When post-processing the cumulative sum ๐ฮฃ (see line (21) of Algorithm 1),instead of post-processing all entries of ๐ฮฃ, we can only process those in thelist Changed, and ignore the remaining ones, since the corresponding to thementries in ๐๐ (a candidate solution to (4.2.2)) will have zero values. We can
then directly output ๐๐ in a sparse form.It is clear that such modification of Algorithm 1 results in the replacement of
the ๐(๐๐) term in runtime complexity with ๐(๐๐ ) (which is always an improvementsince๐(๐) a.o.โs are done anyway in each iteration); moreover, the memory complexitychanges from ๐(๐๐+ ๐๐ + ๐๐) to ๐(๐๐+ ๐๐ + ๐min(๐, ๐)).
Infeasibility of the noisy dual iterates. Note that when we generate an estimateof the primal gradient ๐๐ (๐ ๐ก โ ๐ ) according to (Full-SS) or (Part-SS), we alsoobtain an unbiased estimate of the dual iterate ๐ ๐ก, and vice versa. In the setupwith vector variables, Juditsky and Nemirovski [2011b] propose to average such noisyiterates instead of the acutal iterates (๐ ๐ก, ๐ ๐ก) as we do in (S-MD). Averaging of noisyiterates is easier to implement since they are sparse (one does not need to track thecumulative sums), and one could show similar guarantees for the primal accuracy oftheir running averages. However, in the case of the dual variable its noisy counterpartis infeasible (see Juditsky and Nemirovski [2011b, Sec. 2.5.1]); as a result, one losesthe guarantee for the duality gap. Hence, we prefer to track the averages of the actualiterates of (S-MD) as we do in Algorithm 1.
126
Chapter 5
Conclusion and Future Work
In this thesis we considered several parameter estimation problems: the extensionof the moment-matching technique for non-linear regression, a special type of aver-aging for generalized linear models and sublinear algorithms for multiclass Fenchel-Young losses. We get theoretical guarantees for the corresponding convex and saddle-point minimization problems.
In Chapter 2 we considered a general non-linear regression model and the de-pendence on an unknown ๐-dimensional subspace assumption. Our goal was directestimation of this unknown ๐-dimensional space, which is often called the effective di-mension reduction or e.d.r. space. We proposed new approaches (SADE and SPHD),combining two existing techniques (sliced inverse regression (SIR) and score function-based estimation). We obtained consistent estimation for ๐ > 1 only using the first-order score and proposed explicit approaches to learn the score from data.
It would be interesting to extend our sliced extensions to learning neural net-works: indeed, our work focused on the subspace spanned by ๐ค and cannot identifyindividual columns, while Janzamin et al. [2015] used tensor approaches to to estimatecolumns of ๐ค in polynomial time. Another direction is to use smarter score matchingtechniques to estimate score functions from the data: try to learn only relevant di-rections. Indeed, our current approach for learning score functions uses a parametricassumption, i.e., that the score is the linear combination of ๐ basis functions and thisnumber grows with the number of observations. This disadvantage can be avoided ifwe somehow manage to learn scores only in relevant directions.
In Chapter 3 we have explored how averaging procedures in stochastic gradientdescent, which are crucial for fast convergence, could be improved by looking at thespecifics of probabilistic modeling. Namely, averaging in the moment parameteriza-tion can have better properties than averaging in the natural parameterization.
While we have provided some theoretical arguments (asymptotic expansion in thefinite-dimensional case, convergence to optimal predictions in the infinite-dimensionalcase), a detailed theoretical analysis with explicit convergence rates would provide abetter understanding of the benefits of averaging predictions.
127
In Chapter 4 we considered the Fenchel-Young losses and the correspondingsaddle-point problems for multi-class classification problems. We develop sublinearalgorithm, i.e., with computational complexity of one iteration of ๐(๐+๐+๐), where๐ is the number of features, ๐ is the sample size and ๐ it the number of classes.
Though our approach has a sublinear rates for the classical SVM setup, it isinteresting to investigate the efficiency estimates for more general geometries andlosses (which is actually partly done). Multinomial logit loss or softmax loss is widelyused in Natural Language Processing (NLP) and recommendation systems, where ๐,๐ and ๐ are on order of millions or even billions (Chelba et al. [2013], Partalas et al.[2015]) and developing an ๐(๐ + ๐ + ๐) approach will be an important developmentin this area. Even though we provided simple experiments in this chapter, biggerexperiments with more expressed convergence rates will be a good illustration of ourapproach. We are planning to put the code online, due to the full implementationof Algorithm 1 is time-consuming. Another direction of research is to develop theapproach with flexible step-sizes, which can be learned from the data as well.
128
Bibliography
H. Akaike. Information theory and an extension of the maximum likelihood principle.In Selected papers of hirotugu akaike, pages 199โ213. Springer, 1998.
G. Arfken. Divergence. In Mathematical Methods for Physicists, chapter 1.7, pages37โ42. Academic Press, Orlando, FL, 1985.
A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Ma-chine Learning, 73(3):243โ272, 2008.
D. Babichev and F. Bach. Constant step size stochastic gradient descent for proba-bilistic modeling. Proceedings in Uncertainty in Artificial Intelligence, pages 219โ228, 2018a.
D. Babichev and F. Bach. Slice inverse regression with score functions. ElectronicJournal of Statistics, 12(1):1507โ1543, 2018b.
F. Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference onLearning Theory, pages 185โ209, 2013.
F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algo-rithms for machine learning. In Adv. NIPS, 2011.
F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximationwith convergence rate ๐(1/๐). Advances in Neural Information Processing Systems(NIPS), 2013.
H. H. Bauschke, J. Bolte, and M. Teboulle. A descent lemma beyond lipschitz gra-dient continuity: first-order methods revisited and applications. Mathematics ofOperations Research, 42(2):330โ348, 2016.
A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient meth-ods for convex optimization. Operations Research Letters, 31(3):167โ175, 2003.
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linearinverse problems. SIAM journal on imaging sciences, 2(1):183โ202, 2009.
S. Ben-David, N. Eiron, and P. M. Long. On the difficulty of approximately max-imizing agreements. Journal of Computer and System Sciences, 66(3):496โ514,2003.
129
D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
M. Blondel, A. F. T. Martins, and V. Niculae. Learning classifiers with fenchel-young losses: Generalized entropies, margins, and algorithms. arXiv preprintarXiv:1805.09717, 2018.
A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with online andactive learning. Journal of Machine Learning Research, 6(Sep):1579โ1619, 2005.
J. Borwein and A. S. Lewis. Convex analysis and nonlinear optimization: theory andexamples. Springer Science & Business Media, 2010.
L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machinelearning. Technical Report 1606.04838, arXiv, 2016.
S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymp-totic theory of independence. Oxford University Press, 2013.
D. R. Brillinger. A Generalized Linear Model with โGaussianโ Regressor Variables.In K.A. Doksum P.J. Bickel and J.L. Hodges, editors, A Festschrift for Erich L.Lehmann. Woodsworth International Group, Belmont, California, 1982.
S. Bubeck. Convex optimization: Algorithms and complexity. Foundations andTrendsยฎ in Machine Learning, 8(3-4):231โ357, 2015.
S. Cambanis, S Huang, and G Simons. On the Theory of Elliptically ContouredDistributions. Journal of Multivariate Analysis, 11(3):368โ385, 1981.
A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algo-rithm. Foundations of Computational Mathematics, 7(3):331โ368, 2007.
G. Casella and R. L. Berger. Statistical inference, volume 2. Duxbury Pacific Grove,CA, 2002.
C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robin-son. One billion word benchmark for measuring progress in statistical languagemodeling. arXiv preprint arXiv:1312.3005, 2013.
K. L. Clarkson, E. Hazan, and D. P. Woodruff. Sublinear optimization for machinelearning. Journal of the ACM (JACM), 59(5):23, 2012.
R. D. Cook. Save: a method for dimension reduction and graphics in regression.Communications in Statistics - Theory and Methods, 29:2109โ2121, 2000.
R. D. Cook and H. Lee. Dimension Reduction in Binary Response Regression. Journalof the American Statistical Association, 94:1187โ1200, 1999.
130
R. D. Cook and S. Weisberg. Discussion of โSliced Inverse Regressionโ by K. C. Li.Journal of the American Statistical Association, 86:328โ332, 1991.
C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273โ297,1995.
A. S. Dalalyan, A. Juditsky, and V. Spokoiny. A new algorithm for estimating theeffective dimension-reduction subspace. Journal of Machine Learning Research, 9:1647โ1678, 2008.
A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient methodwith support for non-strongly convex composite objectives. In Advances in neuralinformation processing systems, pages 1646โ1654, 2014.
A. Dieuleveut and F. Bach. Nonparametric stochastic approximation with large step-sizes. Ann. Statist., 44(4):1363โ1399, 08 2016.
A. Dieuleveut, A. Durmus, and F. Bach. Bridging the gap between constant stepsize stochastic gradient descent and markov chains. Technical Report 1707.06386,arXiv, 2017.
W.F. Donoghue, Jr. Monotone Matrix Functions and Analytic Continuation.Springer, 1974.
N. Duan and K.-C. Li. Slicing regression: a link-free regression method. The Annalsof Statistics, 19:505โ530, 1991.
J. C Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirrordescent. In COLT, pages 14โ26, 2010.
V. Feldman, V. Guruswami, P. Raghavendra, and Y. Wu. Agnostic learning of mono-mials by halfspaces is hard. SIAM Journal on Computing, 41(6):1558โ1590, 2012.
K. Fukumizu, F. R. Bach, and M. I. Jordan. Kernel dimension reduction in regression.The Annals of Statistics, 37(4):1871โ1905, 2009.
D. Garber and E. Hazan. Approximating semidefinite programs in sublinear time. InAdvances in Neural Information Processing Systems, pages 1080โ1088, 2011.
D. Garber and E. Hazan. Sublinear time algorithms for approximate semidefiniteprogramming. Mathematical Programming, 158(1-2):329โ361, 2016.
W. R. Gilks, S. Richardson, and D. Spiegelhalter. Markov chain Monte Carlo inpractice. CRC press, 1995.
A.S. Goldberger. Econometric theory. Econometric theory., 1964.
I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
131
M. D. Grigoriadis and L. G. Khachiyan. A sublinear-time randomized approximationalgorithm for matrix games. Operations Research Letters, 18(2):53โ58, 1995.
L. Gyรถrfi, M. Kohler, A. Krzyzak, and H. Walk. A distribution-free theory of non-parametric regression. Springer series in statistics. Springer, New York, 2002.
L. P. Hansen. Large sample properties of generalized method of moments estimators.Econometrica: Journal of the Econometric Society, pages 1029โ1054, 1982.
E. Hazan, T. Koren, and N. Srebro. Beating SGD: Learning SVMs in sublinear time.In Advances in Neural Information Processing Systems, pages 1233โ1241, 2011.
Niao He, Anatoli Juditsky, and Arkadi Nemirovski. Mirror prox algorithm for multi-term composite minimization and semi-separable problems. Computational Opti-mization and Applications, 61(2):275โ319, 2015.
J. M. Hilbe. Negative binomial regression. Cambridge University Press, 2011.
J. Hooper. Simultaneous Equations and Canonical Correlation Theory. Econometrica,27:245โ256, 1959.
J. L. Horowitz. Semiparametric methods in econometrics, volume 131. SpringerScience & Business Media, 2012.
M. Hristache, A. Juditsky, and V. Spokoiny. Direct estimation of the index coefficientin a single index model. The Annals of Statistics, 29(3):595โ623, 2001.
T. Hsing and R. J. Carroll. An asymptotic theory for sliced inverse regression. TheAnnals of Statistics, 20(2):1040โ1061, 1992.
A. Hyvรคrinen. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6:695โ709, 2005.
A. Hyvรคrinen, J. Karhunen, and E. Oja. Independent Component Analysis, volume 46.John Wiley & Sons, 2004.
M. Janzamin, H. Sedghi, and A. Anandkumar. Score function features for discrim-inative learning: Matrix and tensor framework. arXiv preprint arXiv:1412.2863,2014.
M. Janzamin, H. Sedghi, and A. Anandkumar. Generalization Bounds for NeuralNetworks through Tensor Factorization. arXiv preprint arXiv:1412.2863, 2015.
R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictivevariance reduction. In Advances in Neural Information Processing Systems, pages315โ323, 2013.
A. Juditsky and A. Nemirovski. First-order methods for nonsmooth convex large-scaleoptimization, I: General purpose methods. Optimization for Machine Learning,pages 121โ148, 2011a.
132
A. Juditsky and A. Nemirovski. First order methods for nonsmooth convex large-scaleoptimization, ii: utilizing problems structure. Optimization for Machine Learning,pages 149โ183, 2011b.
A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities withstochastic mirror-prox algorithm. Stochastic Systems, 1(1):17โ58, 2011.
Johannes HB Kemperman. On the optimum rate of transmitting information. InProbability and information theory, pages 126โ169. Springer, 1969.
D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques- Adaptive Computation and Machine Learning. The MIT Press, 2009.
G. M. Korpelevich. Extragradient method for finding saddle points and other prob-lems. Matekon, 13(4):35โ49, 1977.
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilisticmodels for segmenting and labeling sequence data. In Proc. ICML, 2001.
G. Lan. An optimal method for stochastic composite optimization. MathematicalProgramming, 133(1-2):365โ397, 2012.
B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by modelselection. The Annals of Statistics, 28(5):1302โ1338, 2000.
L. LeCam. On some asymptotic properties of maximum likelihood estimates andrelated bayes estimates. Univ. California Pub. Statist., 1:277โ330, 1953.
E. L. Lehmann and G. Casella. Theory of point estimation. Springer Science &Business Media, 2006.
K.-C. Li. Sliced Inverse Regression for Dimensional Reduction. Journal of the Amer-ican Statistical Association, 86:316โ327, 1991.
K.-C. Li. On Principal Hessian Directions for Data Visualization and DimensionReduction: Another Application of Steinโs Lemma. Journal of the American Sta-tistical Association, 87:1025โ1039, 1992.
K.-C. Li and N. Duan. Regression analysis under link violation. The Annals ofStatistics, 17:1009โ1052, 1989.
M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.
uci.edu/ml.
Q. Lin, Z. Zhao, and J. S. Liu. On consistency and sparsity for sliced inverse regressionin high dimensions. The Annals of Statistics, 46(2):580โ610, 2018.
P. McCullagh. Generalized linear models. European Journal of Operational Research,16(3):285โ292, 1984.
133
P. McCullagh and J. A. Nelder. Generalized linear models, volume 37. CRC press,1989.
A. M. McDonald, M. Pontil, and S. Stamos. Spectral ๐-support norm regularization.In Advances in Neural Information Processing Systems, 2014.
S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Springer-VerlagInc, Berlin; New York, 1993.
J.-J. Moreau. Proximitรฉ et dualitรฉ dans un espace hilbertien. Bull. Soc. Math. France,93(2):273โ299, 1965.
K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
A. Nemirovski. Prox-method with rate of convergence ๐(1/๐ก) for variational inequali-ties with lipschitz continuous monotone operators and smooth convex-concave sad-dle point problems. SIAM Journal on Optimization, 15(1):229โ251, 2004.
A. Nemirovski and U. G. Onn, S.and Rothblum. Accuracy certificates for computa-tional problems with convex structure. Mathematics of Operations Research, 35(1):52โ78, 2010.
A. Nemirovsky and D. Yudin. Problem complexity and method efficiency in optimiza-tion. Chichester, 1983.
Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical program-ming, 103(1):127โ152, 2005.
Y Nesterov. Gradient methods for minimizing composite objective function, 2007.
Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.Springer Science & Business Media, 2013.
Y. Nesterov and A. Nemirovski. On first-order algorithms for ๐1/nuclear norm mini-mization. Acta Numerica, 22:509โ575, 2013.
Y. E. Nesterov. A method for solving the convex programming problem with con-vergence rate o (1/kห 2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543โ547,1983.
D. Ostrovskii and Z. Harchaoui. Efficient first-order algorithms for adaptive signaldenoising. In Proceedings of the 35th ICML conference, volume 80, pages 3946โ3955, 2018.
B. Palaniappan and F. Bach. Stochastic variance reduction methods for saddle-pointproblems. In Advances in Neural Information Processing Systems, pages 1416โ1424,2016.
134
I. Partalas, A. Kosmopoulos, N. Baskiotis, T. Artieres, G. Paliouras, E. Gaussier,I. Androutsopoulos, M.-R. Amini, and P. Galinari. LSHTC: A benchmark forlarge-scale text classification. arXiv preprint arXiv:1503.08581, 2015.
B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by aver-aging. SIAM Journal on Control and Optimization, 30(4):838โ855, 1992.
C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning.MIT Press, 2006.
H. Robbins and S. Monro. ยชa stochastic approximation method, ยบ annals math.Statistics, 22:400โ407, 1951.
R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAMjournal on control and optimization, 14(5):877โ898, 1976.
R. T. Rockafellar. Convex analysis. Princeton university press, 2015.
A. Rudi, L. Carratino, and L. Rosasco. Falkon: An optimal large scale kernel method.In Advances in Neural Information Processing Systems, pages 3891โ3901, 2017.
M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochasticaverage gradient. Mathematical Programming, 162(1-2):83โ112, 2017.
B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines,Regularization, Optimization, and beyond. MIT press, 2001.
S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theoryto algorithms. Cambridge university press, 2014.
S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods forregularized loss minimization. Journal of Machine Learning Research, 14(Feb):567โ599, 2013.
S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimatedsub-gradient solver for SVM. Mathematical programming, 127(1):3โ30, 2011.
A. Shapiro, D. Dentcheva, and A. Ruszczyลski. Lectures on stochastic programming:modeling and theory. SIAM, 2009.
J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridgeuniversity press, 2004.
Z. Shi, X. Zhang, and Y. Yu. Bregman divergence for stochastic variance reduc-tion: saddle-point and adversarial prediction. In Advances in Neural InformationProcessing Systems, pages 6031โ6041, 2017.
S. Sra. Fast projections onto mixed-norm balls with applications. Data Mining andKnowledge Discovery, 25(2):358โ377, 2012.
135
B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Schรถlkopf.Injective hilbert space embeddings of probability measures. In Proc. COLT, 2008.
C.M. Stein. Estimation of the Mean of a Multivariate Normal Distribution. TheAnnals of Statistics, 9:1135โ1151, 1981.
G.W. Stewart and J.-G. Sun. Matrix perturbation theory (computer science andscientific computing), 1990.
T.M. Stoker. Consistent estimation of scaled coefficients. Econometrica, 54:1461โ1481, 1986.
A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.
V. Q. Vu and J. Lei. Minimax sparse principal subspace estimation in high dimensions.The Annals of Statistics, 41(6):2905โ2947, 2013.
H. Wang and Y. Xia. On directional regression for dimension reduction. In J. Amer.Statist. Ass. Citeseer, 2007.
H. Wang and Y. Xia. Sliced regression for dimension reduction. Journal of theAmerican Statistical Association, 103(482):811โ821, 2008.
C. K. I. Williams and M. Seeger. Using the nystrรถm method to speed up kernelmachines. In Advances in neural information processing systems, pages 682โ688,2001.
D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations andTrendsยฎ in Theoretical Computer Science, 10(1โ2):1โ157, 2014.
Y. Xia, H. Tong, W. K. Li, and L.-X. Zhu. An adaptive estimation of dimensionreduction space. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 64(3):363โ410, 2002a.
Y. Xia, H. Tong, W.K. Li, and L.-X. Zhu. An adaptive estimation of dimensionreduction space. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 64(3):363โ410, 2002b.
L. Xiao. Dual averaging methods for regularized stochastic learning and online opti-mization. Journal of Machine Learning Research, 11(Oct):2543โ2596, 2010.
L. Xiao, A. W. Yu, Q. Lin, and W. Chen. Dscovr: Randomized primal-dual blockcoordinate algorithms for asynchronous distributed optimization. arXiv preprintarXiv:1710.05080, 2017.
S. S. Yang. General distribution theory of the concomitants of order statistics. TheAnnals of Statistics, 5:996โ1002, 1977.
Y. Yu, T. Wang, and R. J. Samworth. A useful variant of the davisโkahan theoremfor statisticians. Biometrika, 102(2):315โ323, 2015.
136
Y.-L. Yu. The strong convexity of von Neumannโs entropy. Unpublished note, June2013. URL http://www.cs.cmu.edu/~yaoliang/mynotes/sc.pdf.
M. Yuan. On the identifiability of additive index models. Statistica Sinica, 21(4):1901โ1911, 2011.
L.-X. Zhu and K. W. Ng. Asymptotics of sliced inverse regression. Statistica Sinica,5:727โ736, 1995.
137
138
List of Figures
1-1 Graphical representation of classical loss functions. . . . . . . . . . . 10
1-2 Graphical representation of Lipschitz smoothness and strong convexity. 12
1-3 Graphical representation of projected gradient descent: firstly we dogradient step and then project onto ๐ณ . . . . . . . . . . . . . . . . . . 13
2-1 Graphical explanation of SADE: firstly we divide ๐ฆ into ๐ป slices, let ๐โbe a proportion of ๐ฆ๐ in slice ๐ผโ. This is an empirical estimator of P(๐ฆ โ๐ผโ). Then we evaluate the empirical estimator (๐ฎ1)โ of E(๐ฎ1(๐ฅ)|๐ฆ โ ๐ผโ).Finally, we evaluate weighted covariance matrix and find the ๐ largesteigenvectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2-2 Contour lines of Gaussian mixture pdf in 2D case. . . . . . . . . . . . 42
2-3 Mean and standard deviation of ๐ 2(โฐ , โฐ) divided by 10 for the rationalfunction (2.5.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2-4 Mean and standard deviation of ๐ 2(โฐ , โฐ) divided byโ
10 for the ra-tional function (2.5.3) with ๐ = 1/4. . . . . . . . . . . . . . . . . . . 43
2-5 Mean and standard deviation of ๐ 2(โฐ , โฐ) divided byโ
10 for the theclassification problem (2.5.4). . . . . . . . . . . . . . . . . . . . . . . 43
2-6 Mean and standard deviation of ๐ 2(โฐ , โฐ) divided byโ
10 for the model(2.5.2) with ๐ = 1/4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2-7 Square trace error ๐ (โฐ , โฐ) for different sample sizes. . . . . . . . . . . 45
2-8 Quadratic model, ๐ = 10, ๐ = 1000. . . . . . . . . . . . . . . . . . . . 45
2-9 Rational model, ๐ = 10, ๐ = 1000. . . . . . . . . . . . . . . . . . . . . 45
2-10 Rational model, ๐ = 20, ๐ = 2000. . . . . . . . . . . . . . . . . . . . . 45
2-11 Rational model, ๐ = 50, ๐ = 5000. . . . . . . . . . . . . . . . . . . . . 46
2-12 Increasing dimensions, with rational model 2.5.3, ๐ = 10000, ๐ = 10,๐ = 20 to 150. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3-1 Convergence of iterates ๐๐ and averaged iterates ๐๐ to the mean ๐๐พunder the stationary distribution ๐๐พ. . . . . . . . . . . . . . . . . . . 59
3-2 Graphical representation of reparametrization: firstly we expand theclass of functions, replacing parameter ๐ with function ๐(ยท) = ฮฆโค(ยท)๐and then we do one more reparametrization: ๐(ยท) = ๐โฒ(๐(ยท)). Bestlinear prediction ๐* is constructed using ๐* and the global minimizerof ๐ข is ๐**. Model is well-specified if and only if ๐* = ๐**. . . . . . . 63
139
3-3 Visualisation of the space of parameters and its transformation to thespace of predictions. Averaging estimators in red vs averaging pre-dictions in green. ๐* is optimal linear predictor and ๐** is the globaloptimum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3-4 Visualisation of the space of parameters and its transformation to thespace of predictions. Averaging estimators in red vs averaging predic-tions in green. Global optimizer coincides with the best linear. . . . . 68
3-5 Synthetic data for linear model ๐๐(๐ฅ) = ๐โค๐ฅ and ๐**(๐ฅ) = sin๐ฅ1+sin๐ฅ2.Excess prediction performance vs. number of iterations (both in log-scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3-6 Synthetic data for linear model ๐๐(๐ฅ) = ๐โค๐ฅ and ๐**(๐ฅ) = ๐ฅ31 + ๐ฅ32.Excess prediction performance vs. number of iterations (both in log-scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3-7 Synthetic data for Laplacian kernel for ๐**(๐ฅ) = 55+๐ฅโค๐ฅ
. Excess predic-tion performance vs. number of iterations (both in log-scale). . . . . 72
3-8 MiniBooNE dataset, dimension ๐ = 50, linear model. Excess predic-tion performance vs. number of iterations (both in log-scale). . . . . . 73
3-9 MiniBooNE dataset, dimension ๐ = 50, kernel approach, column sam-pling๐ = 200. Excess prediction performance vs. number of iterations(both in log-scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3-10 CoverType dataset, dimension ๐ = 54, linear model. Excess predictionperformance vs. number of iterations (both in log-scale). . . . . . . . 74
3-11 CoverType dataset, dimension ๐ = 54, kernel approach, column sam-pling๐ = 200. Excess prediction performance vs. number of iterations(both in log-scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4-1 Depiction of the Full Sampling Scheme (Full-SS). . . . . . . . . . . . 994-2 Primal accuracy and duality gap (when available) for Algorithm 1,
stochastic subgradient method (SSM), and Mirror Prox (MP) withexact gradients, on a synthetic data benchmark. . . . . . . . . . . . . 108
140
List of Tables
2.1 Comparison of different methods using score functions. . . . . . . . . 31
4.1 Comparison of the potential differences for the norms correspondingto (4.4.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.2 Runtime (in seconds) of Algorithm 1 on synthetic data. . . . . . . . . 107
141
142
143