generalized conditional gradient for sparse estimationy328yu/mypapers/gcg.pdf · generalized...

46
Journal of Machine Learning Research 18 (2017) 1-46 Submitted 8/14; Revised 10/17; Published 11/17 Generalized Conditional Gradient for Sparse Estimation Yaoliang Yu [email protected] School of Computer Science University of Waterloo Waterloo, ON, N2L 3G1, Canada Xinhua Zhang [email protected] Department of Computer Science University of Illinois at Chicago Chicago, IL 60607 Dale Schuurmans [email protected] Department of Computing Science University of Alberta Edmonton, Alberta T6G 2E8, Canada Editor: Koby Crammer Abstract Sparsity is an important modeling tool that expands the applicability of convex formula- tions for data analysis, however it also creates significant challenges for efficient algorithm design. In this paper we investigate the generalized conditional gradient (GCG) algorithm for solving sparse optimization problems—demonstrating that, with some enhancements, it can provide a more efficient alternative to current state of the art approaches. After study- ing the convergence properties of GCG for general convex composite problems, we develop efficient methods for evaluating polar operators, a subroutine that is required in each GCG iteration. In particular, we show how the polar operator can be efficiently evaluated in learning low-rank matrices, instantiated with detailed examples on matrix completion and dictionary learning. A further improvement is achieved by interleaving GCG with fixed- rank local subspace optimization. A series of experiments on matrix completion, multi-class classification, and multi-view dictionary learning shows that the proposed method can sig- nificantly reduce the training cost of current alternatives. Keywords: generalized conditional gradient, frank-wolfe, dictionary learning, matrix completion, multi-view learning, sparse estimation 1. Introduction Sparsity is an important concept in high-dimensional statistics (B¨ uhlmann and van de Geer, 2011) and signal processing (Eldar and Kutyniok, 2012), which has led to important application successes by reducing model complexity and improving interpretability of the results. Although it is common to promote sparsity by adding appropriate regularizers, such as the l 1 norm, sparsity can also present itself in more sophisticated forms, such as group sparsity (Yuan and Lin, 2006), graph sparsity (Kim and Xing, 2009; Jacob et al., 2009), matrix rank sparsity (Cand` es and Recht, 2009), and other combinatorial sparsity patterns (Obozinski and Bach, 2012). These recent notions of structured sparsity (Bach et al., 2012; c 2017 Yaoliang Yu, Xinhua Zhang, and Dale Schuurmans. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/14-348.html.

Upload: others

Post on 20-May-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Journal of Machine Learning Research 18 (2017) 1-46 Submitted 8/14; Revised 10/17; Published 11/17

Generalized Conditional Gradient for Sparse Estimation

Yaoliang Yu [email protected] of Computer ScienceUniversity of WaterlooWaterloo, ON, N2L 3G1, Canada

Xinhua Zhang [email protected] of Computer ScienceUniversity of Illinois at ChicagoChicago, IL 60607

Dale Schuurmans [email protected]

Department of Computing Science

University of Alberta

Edmonton, Alberta T6G 2E8, Canada

Editor: Koby Crammer

Abstract

Sparsity is an important modeling tool that expands the applicability of convex formula-tions for data analysis, however it also creates significant challenges for efficient algorithmdesign. In this paper we investigate the generalized conditional gradient (GCG) algorithmfor solving sparse optimization problems—demonstrating that, with some enhancements, itcan provide a more efficient alternative to current state of the art approaches. After study-ing the convergence properties of GCG for general convex composite problems, we developefficient methods for evaluating polar operators, a subroutine that is required in each GCGiteration. In particular, we show how the polar operator can be efficiently evaluated inlearning low-rank matrices, instantiated with detailed examples on matrix completion anddictionary learning. A further improvement is achieved by interleaving GCG with fixed-rank local subspace optimization. A series of experiments on matrix completion, multi-classclassification, and multi-view dictionary learning shows that the proposed method can sig-nificantly reduce the training cost of current alternatives.

Keywords: generalized conditional gradient, frank-wolfe, dictionary learning, matrixcompletion, multi-view learning, sparse estimation

1. Introduction

Sparsity is an important concept in high-dimensional statistics (Buhlmann and van deGeer, 2011) and signal processing (Eldar and Kutyniok, 2012), which has led to importantapplication successes by reducing model complexity and improving interpretability of theresults. Although it is common to promote sparsity by adding appropriate regularizers, suchas the l1 norm, sparsity can also present itself in more sophisticated forms, such as groupsparsity (Yuan and Lin, 2006), graph sparsity (Kim and Xing, 2009; Jacob et al., 2009),matrix rank sparsity (Candes and Recht, 2009), and other combinatorial sparsity patterns(Obozinski and Bach, 2012). These recent notions of structured sparsity (Bach et al., 2012;

c©2017 Yaoliang Yu, Xinhua Zhang, and Dale Schuurmans.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are providedat http://jmlr.org/papers/v18/14-348.html.

Page 2: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

Micchelli et al., 2013) have greatly enhanced the ability to model structural relationshipsin data; however, they also create significant computational challenges.

A currently popular optimization scheme for training sparse models is accelerated prox-imal gradient (APG) (Beck and Teboulle, 2009; Nesterov, 2013), which enjoys an optimalrate of convergence among black-box first-order procedures (Nesterov, 2013). An advan-tage of APG is that each iteration consists solely of computing a proximal update (PU),which for simple regularizers can have a very low complexity. For example, under l1 normregularization each iterate of APG reduces to a soft-shrinkage operator that can be com-puted in linear time, which partly explains its popularity for such problems. Unfortunately,for matrix low-rank regularizers the PU soon becomes a computational bottleneck. Forexample, the trace norm is often used to promote low rank solutions in matrix variableproblems such as matrix completion (Candes and Recht, 2009); but here the associated PUrequires a full singular value decomposition (SVD) of the gradient matrix on each iteration,which prevents APG from being applied to large problems. Not only does the PU requirenontrivial computation for matrix regularizers, it also yields dense intermediate iterates.

To address the primary shortcomings with APG, we investigate the generalized condi-tional gradient (GCG) strategy for sparse optimization, which has been motivated by thepromise of recent sparse approximation methods (Hazan, 2008; Clarkson, 2010). A specialcase of GCG, known as the conditional gradient (CG), was originally proposed by Frank andWolfe (1956) and has received significant renewed interest (Bach, 2015; Clarkson, 2010; Fre-und and Grigas, 2016; Hazan, 2008; Jaggi and Sulovsky, 2010; Jaggi, 2013; Shalev-Shwartzet al., 2010; Tewari et al., 2011; Yuan and Yan, 2013). The key advantage of GCG forsparse estimation is that it need only compute the polar of the regularizer in each iteration,which sometimes admits a much more efficient update than the PU in APG. For example,under trace norm regularization, GCG only requires the spectral norm of the gradient tobe computed in each iteration, which is an order of magnitude cheaper than evaluatingthe full SVD as required by APG. Furthermore, the greedy nature of GCG affords explicitcontrol over the sparsity of intermediate iterates, which is not available in APG. Althoughexisting work on GCG has generally been restricted to constrained optimization (i.e. en-forcing an upper bound on the sparsity-inducing regularizer), in this paper we consider themore general regularized version of the problem. Despite their theoretical equivalence, theregularized form allows more efficient local optimization to be interleaved with the primaryupdate, which provides a significant acceleration in practice.

This paper is divided into two major parts: first we present a general treatment of GCGand establish its convergence properties for convex optimization in Section 3, with a specialfocus on convex gauge regularizers; then we apply the method to important case studies todemonstrate its effectiveness on large-scale, sparse estimation problems in Section 4, withexperiments presented in Section 5.

In particular, in the first part, after establishing notation and briefly discussing relevantoptimization schemes (Section 2), we present our first main contribution (Section 3): theconvergence properties of GCG for general convex composite optimization. The algorithmis then specialized to convex gauge regularizers with refined analysis (Section 3.2). Indeedwe prove that, under standard assumptions, GCG converges to the global optimum at therate of O(1/t)—which is a significant improvement over previous bounds (Dudik et al.,2012). Although the convergence rate for GCG remains inferior to that of APG, these two

2

Page 3: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

algorithms demonstrate alternative trade-offs between the number of iterations requiredand the per-step complexity. As shown in our experiments, GCG can be significantly fasterthan APG in terms of overall computation time for large problems. The results in Section 3are new.

Equipped with the technical results of GCG on convex optimization, the second partof the paper then applies the algorithm to the important application of low rank learning(Section 4). Since imposing a hard bound on matrix rank generally leads to an intractableproblem, in Section 4.2 we first present a generic relaxation strategy based on convex gauges(Chandrasekaran et al., 2012; Tewari et al., 2011) that yields a convex program (Bach et al.,2008; Bradley and Bagnell, 2009; Zhang et al., 2012). Conveniently, the resulting problemcan be easily optimized using the GCG algorithm from Section 3.2. To further reduce com-putation time, we introduce in Section 4.3 an auxiliary fixed rank subspace optimizationwithin each iteration of GCG. Although similar hybrid approaches have been previouslysuggested (Burer and Monteiro, 2005; Laue, 2012; Mishra et al., 2013), we propose an effi-cient new alternative that, instead of locally optimizing a constrained fixed rank problem,optimizes an unconstrained surrogate objective based on a variational representation of ma-trix norms. This alternative strategy allows a far more efficient local optimization withoutcompromising GCG’s convergence properties. In Section 4.4 we show that the approach canbe applied to dictionary learning and matrix completion problems, as well as a non-trivialmulti-view form of dictionary learning (White et al., 2012).

Finally, in Section 5 we provide an extensive experimental evaluation that comparesthe performance of GCG (augmented with interleaved local search) to state-of-the-art op-timization strategies, across a range of problems, including matrix completion, multi-classclassification, and multi-view dictionary learning.

1.1 Extensions over Previously Published Work

Some of the results in this article have appeared in a preliminary form in two conferencepapers (Zhang et al., 2012; White et al., 2012); however, several specific extensions havebeen added in this paper, including:

• A more extensive and unified treatment of GCG in general Banach spaces.

• A more refined analysis of the convergence properties of GCG.

• A new application of GCG to multi-view dictionary learning.

• New experimental evaluations on larger data sets, including new results for matrixcompletion and image multi-class classification.

• Open source Matlab code released at https://www.cs.uic.edu/~zhangx/GCG.

2. Preliminaries

We first establish some of the definitions and notation that will be used throughout the pa-per, before providing a brief background on the relevant optimization methods that establisha context for discussing GCG.

3

Page 4: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

2.1 Notation and Definitions

Throughout this paper we assume that the underlying space, unless otherwise stated, is areal Banach space B with norm ‖·‖; for example, B could be the Euclidean space Rd withany norm. The dual space (the set of all real-valued continuous linear functions on B) isdenoted B∗ and is equipped with the dual norm ‖g‖◦ = sup{g(w) : ‖w‖ ≤ 1}. Note thatfor a differentiable function f : B → R := R ∪ {∞}, its (Frechet) derivative ∇f is in thedual space B∗.

We use bold lowercase letters to denote vectors, where the i-th component of a vectorw is denoted wi, while wt denotes some other vector. We use the shorthand 〈w,g〉 := g(w)for any g ∈ B∗ and w ∈ B. We call a function f : B → R σ-strongly convex (w.r.t. thenorm ‖·‖) if there exists some σ ≥ 0 such that for all 0 ≤ λ ≤ 1 and w, z ∈ B,

f(λw + (1− λ)z) + 12σλ(1− λ) ‖w − z‖2 ≤ λf(w) + (1− λ)f(z). (1)

In the case where σ = 0 we simply say f is convex. A set C ⊆ B is called convex if for allw, z ∈ C =⇒ λw + (1− λ)z ∈ C for all 0 ≤ λ ≤ 1. It follows from the definition that thedomain of a convex function f , denoted as dom f := {w : f(w) <∞}, is a convex set. Thefunction f is called proper if dom f 6= ∅, and closed if all sublevel sets {w : f(w) ≤ α} areclosed for all α ∈ R. The subdifferential of a convex function f : B → R at w is defined asthe set ∂f(w) = {g ∈ B∗ : f(z) ≥ f(w) + 〈z−w,g〉 , ∀z ∈ B}.

Key to our subsequent development is two particular functions associated with a nonemptyset in B: the indicator function

ιC(w) =

{0, if w ∈ C∞, otherwise

, (2)

and the gauge function

κ(w) = κA(w) := inf{ρ ≥ 0 : w ∈ ρA}, (3)

where ρA := {ρw : w ∈ A}, and the infimum is taken to be infinity over an empty set(i.e. when the condition w ∈ ρA is not satisfied for any ρ ≥ 0). Intuitively, κ(w) is thesmallest rescaling of the set A to “barely” contain w; see Figure 1. Note that we alwayshave κ(0) = 0, and the gauge function is always positive homogeneous (i.e., κ(tw) = tκ(w)for all t ≥ 0). Let B = Bκ := {w : κ(w) ≤ 1} denote the “unit ball” of κ, and then κB = κA.Bκ is obviously intrinsic to κ, while different choices of A may yield the same gauge κA.Clearly, the indicator function ιC is convex iff C is convex, while the gauge function κ isconvex iff its unit ball is convex. Moreover, the gauge κ is a semi-norm iff its unit ballB is convex and symmetric (i.e. B = −B). Specifically, all semi-norms are convex gaugefunctions.

In practice, it is natural to engineer κA from a general “atomic set” A, which does nothave to be convex (and can even be discrete) or contain the origin. Examples include the setof permutation matrices (Chandrasekaran et al., 2012), and the set of xx> with ‖x‖2 = 1and xi ≥ 0 for low-rank nonnegative matrix factorization (x> is the transpose of x). In thispaper, we will work only with convex closed gauge functions, for which a sufficient but notnecessary condition is that A is convex, closed, and containing the origin (an assumption

4

Page 5: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

0C

w

ρ

Figure 1: Defining a convex gauge via the Minkowski functional of a convex set C.

made by many textbooks). For practical flexibility, we intentionally avoid making the latterassumptions in the first place, and the resulting technical complication is minimal.

A related function that underpins our algorithm is the polar of a gauge function κA,defined for all g ∈ B∗ as

κ◦(g) = κ◦A(g) := inf{µ ≥ 0 : 〈w,g〉 ≤ µκ(w), ∀w ∈ B} (4)

= supw∈Bκ

〈w,g〉 = supw∈A∪{0}

〈w,g〉 . (5)

As the notation suggests, the polar of a norm (a bona fide convex gauge) is its dual norm.

Lastly, a differential function ` : B → R is said to be L-smooth (w.r.t. the norm ‖·‖) iffor all w, z ∈ B,

`(w) ≤ ˜L(w; z) := `(z) + 〈w − z,∇`(z)〉+ L

2 ‖w − z‖2 , (6)

i.e., ` is upper bounded by the quadratic approximation around z. It is well-known that ifthe gradient ∇` is L-Lipschitz continuous, i.e., ‖∇`(w)−∇`(z)‖◦ ≤ L‖w− z‖ for all w, z,then ` is L-smooth. The converse is also true if ` is additionally convex. We will makerepeated use of this quadratic upper bound in our analysis below.

2.2 Composite Minimization Problem

Many machine learning problems, particularly those formulated in terms of regularized riskminimization, can be formulated as:

infw∈B

F (w), such that F (w) := `(w) + f(w), (7)

where f is convex and ` is convex and continuously differentiable. Typically, ` is a loss func-tion and f is a regularizer although their roles can be reversed in some common examples,such as support vector machines and Lasso.

Due to the importance of this problem in machine learning, significant effort has beendevoted to designing efficient algorithms for solving (7). Standard methods are generallybased on computing an update sequence {wt} that converges to a global minimizer of Fwhen F is convex, or converges to a stationary point otherwise. For example, a popular

5

Page 6: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

example is the mirror descent (MD) algorithm (Beck and Teboulle, 2003), which takes anarbitrary subgradient gt ∈ ∂F (wt) and successively updates by

wt+1 = arg minw

{〈w,gt〉+

1

2ηtD(w,wt)

}(8)

with a suitable step size ηt ≥ 0; here D(w, z) := d(w) − d(z) − 〈w − z,∇d(z)〉 denotes aBregman divergence induced by some differentiable 1-strongly convex function d : B → R.Under mild assumptions, MD converges (in terms of the function value) at a rate of O(1/

√t)

(Beck and Teboulle, 2003). However, this algorithm ignores the composite structure in (7)and can be impractically slow for many applications.

Another widely used algorithm is the proximal gradient (PG) (e.g. Fukushima and Mine,1981), also known as the forward-backward splitting procedure, which iteratively performsthe proximal update (PU)

wt+1 = arg minw

{〈w,∇`(wt)〉+ f(w) +

1

2ηtD(w,wt)

}. (9)

PU differs from MD updates in that it does not linearize the regularizer f . The downside isthat (9) is usually harder to solve than (8) for each iteration, with an upside that a fasterO(1/t) rate of convergence can be achieved (Tseng, 2010). This rate can be further improvedto O(1/t2) with the extrapolation variant known as accelerated proximal gradient (APG)(Beck and Teboulle, 2009; Nesterov, 2013). In particular, suppose B = Rd, D(w, z) =12 ‖w − z‖22, and f(w) = ‖w‖1, where the lp norm is defined as ‖w‖p := (

∑di=1 |wi|p)1/p

for p ≥ 1. Then PU recovers the soft-shrinkage operator, which plays an important role insparse estimation:

zt = wt − ηt∇`(wt), and wt+1 = (1− ηt/ |zt|)+ zt. (10)

Here the algebraic operations in the latter formula are performed componentwise. Sim-ilarly, when B = Rm×n, D(W,Z) = 1

2 ‖W − Z‖2F (Frobenius norm) and f(W ) = ‖W‖tr

(trace norm, i.e. sum of singular values), one recovers the matrix analogue of soft-shrinkageused for matrix completion (Candes and Recht, 2009): here the same gradient update ofzt in (10) is applied, but the shrinkage operator on wt+1 in (10) is applied to the singu-lar values of Zt while keeping the singular vectors intact. Unfortunately, such a PU canbecome prohibitively expensive for large matrices, since it requires a full singular valuedecomposition (SVD) at each iteration, at a cubic time cost. We will see in Section 3.2 thatan alternative algorithm only requires computing the dual of the trace norm (namely thespectral norm ‖·‖sp, i.e. the greatest singular value), which requires only quadratic timeper iteration.

3. Generalized Conditional Gradient for Convex Optimization

In this section we present the generalized conditional gradient (GCG) algorithm, which hasrecently been receiving renewed attention (Bredies and Lorenz, 2008; Hazan, 2008; Jaggi,2013; Shalev-Shwartz et al., 2010; Yuan and Yan, 2013; Zhang et al., 2012). Our goal inthis section will be to develop efficient solution methods for (7), where f is convex and `

6

Page 7: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

is convex and L-smooth (cf. (6)). In particular, we will refine our algorithms to convexgauge regularized problems where special computational savings can be achieved throughits polar. These are clearly among the most important settings in practice, and after someenhancements the algorithm and convergence guarantees we establish in this section will beapplied to dictionary learning in Section 4.

3.1 General Setting with Convex and Smooth `

We first develop the framework of the GCG algorithm and its analysis by focusing on generalconvex functions f . Formally, we make the following assumption throughout this Section 3.

Assumption 1 ` is convex and L-smooth, and f is convex.

Protocol : To avoid unnecessary pathologies, we tacitly assume all functions considered inthis work are proper and closed.

To derive the optimization algorithm, first observe that given the above assumption,F = `+ f is convex and so any w ∈ B is globally optimal for (7) if, and only if,

0 ∈ ∂F (w) = ∇`(w) + ∂f(w). (11)

Let f∗(g) := supw 〈w,g〉 − f(w) denote the Fenchel conjugate of f , and using the factthat g ∈ ∂f(w) iff w ∈ ∂f∗(g) when f is closed (Borwein and Vanderwerff, 2010), one canobserve that the necessary and sufficient condition for w to be globally optimal can also beexpressed as

(11) ⇐⇒ w ∈ ∂f∗(−∇`(w)) ⇐⇒ w ∈ (1− η)w + η∂f∗(−∇`(w)), (12)

where η ∈ (0, 1). Then, by the definition of f∗, we have

d ∈ ∂f∗(−∇`(w)) ⇐⇒ d ∈ arg mind{〈d,∇`(w)〉+ f(d)} . (13)

Thus, the condition for a global minimum can be characterized by the fixed-point inclusion

w ∈ (1− η)w + ηd for some d ∈ arg mind{〈d,∇`(w)〉+ f(d)} . (14)

This particular fixed-point condition immediately suggests an update that provides the foun-dation for the generalized conditional gradient algorithm (GCG) outlined in Algorithm 1.Each iteration of this procedure involves linearizing the smooth loss `, solving the subprob-lem (13), selecting a step size, taking a convex combination, then conducting some formof local improvement. The GCG update based on (14) naturally generalizes the originalconditional gradient update, which was first studied by (Frank and Wolfe, 1956) for thecase when f = ιC for a polyhedral set C.

Note that the subproblem (13) shares some similarity with the proximal update (9):both choose to leave the potentially nonsmooth function f intact, but here the smooth loss` is replaced by its plain linearization rather than its linearization plus a (strictly convex)proximal regularizer such as the quadratic upper bound ˜

L(·) in (6). Consequently, theremight be multiple solutions, in which case we simply adopt any one, or no solution (e.g.

7

Page 8: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

Algorithm 1 Generalized Conditional Gradient (GCG).

1: Initialize: w0 ∈ dom f .2: for t = 0, 1, . . . do3: Compute dt ∈ arg mind {〈d,gt〉+ f(d)}, where gt = ∇`(wt).4: Choose step size ηt ∈ [0, 1] and perform update wt+1 = (1− ηt)wt + ηtdt.5: wt+1 = Improve(wt+1, `, f) . Subroutine, see Definition 16: end for

divergence), in which case we will impose extra assumptions to ensure boundedness (detailsbelow, especially Assumption 2 and Theorem 2).

An important component of Algorithm 1 is the final step, Improve, in Line 5, which isparticularly important in practice: it allows for heuristic local improvement to be conductedon the iterates without sacrificing any theoretical convergence properties of the overall pro-cedure (see Section 4.3). A precursor of Improve appeared in (Meyer, 1974, Algorithm 2).Since Improve has access to both ` and f it can be very powerful in principle—we will con-sider the following variants that have respective consequences for practice and subsequentanalysis.

Definition 1 The subroutine Improve in Algorithm 1 is called Null if for all t, wt+1 =wt+1; Descent if F (wt+1) ≤ F (wt+1); Monotone if F (wt+1) ≤ F (wt); and Relaxed if

F (wt+1) ≤ ˜L(wt+1; wt) + (1− ηt)f(wt) + ηtf(dt). (15)

Obviously, Relaxed can be easily morphed to further satisfy Descent and/or Monotone

by comparing F (wt+1) with F (wt+1) and/or F (wt), whereas Null and Descent must beRelaxed (since ` is L-smooth).

Our main goal remains to establish that Algorithm 1 indeed converges to a global op-timum under general conditions given in Assumption 1. In order to address the aforemen-tioned issue of bounding dt, we next introduce one more assumption.

Assumption 2 The sequences {dt} and {wt} generated by the algorithm are bounded forany choice of ηt ∈ [0, 1].

Although this assumption is more stringent, it can fortunately be achieved under variousconditions, which we summarize as follows. The proof is given in Appendix A.1.

Proposition 2 Let Assumption 1 hold. Then Assumption 2 is satisfied if any of the fol-lowing holds.

(a) The subroutine Improve is Monotone; the sublevel set {w ∈ dom f : F (w) ≤ F (w0)} iscompact; and f is cofinite, i.e., its Fenchel conjugate f∗ has full domain.

(b) The subroutine Improve is Monotone; the sublevel set {w ∈ dom f : F (w) ≤ F (w0)} isbounded; and f is super-coercive, i.e., lim‖w‖→∞ f(w)/ ‖w‖ → ∞.

(c) dom f is bounded.

8

Page 9: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

In particular, under any of the conditions in Theorem 2, Line 3 of Algorithm 1 is well-defined. Note that if B = Rd, then condition (a) is in fact equivalent to condition (b). Onthe other hand, if f is a convex gauge, then none of the three conditions above holds, andwe will deal with this in Section 3.2.

Finally, for each w ∈ B we define a quantity, referred to as the duality gap, that will beuseful in understanding the convergence properties of GCG:

G(w) := F (w)− infd{`(w) + 〈d−w,∇`(w)〉+ f(d)} (linearizing ` at w) (16)

= 〈w,∇`(w)〉+ f(w)− infd{〈d,∇`(w)〉+ f(d)}

= 〈w,∇`(w)〉+ f(w) + f∗(−∇`(w)). (definition of Fenchel dual) (17)

Since f and ` are convex in Assumption 1, the duality gap provides an upper bound on thesuboptimality of any search point, as established in the following proposition.

Proposition 3 Let Assumption 1 hold. Then G(w) ≥ F (w)− infd F (d) ≥ 0 for all w ∈ B.Furthermore, G(w) = 0 iff w is globally optimal, i.e. w satisfies the condition (11).

Proof As ` is convex, (16) implies G(w) ≥ F (w)−infd{`(d)+f(d)} = F (w)−infd F (d)≥0.By the Fenchel-Young inequality, (17) implies G(w) = 0 iff w ∈ ∂f∗(−∇`(w)), i.e. (11)holds.

Since G(wt) is an upper bound on the suboptimality of wt, the duality gap gives a naturalstopping criterion for (7) that can be easily computed in each iteration of Algorithm 1.

We are now ready for the first convergence result regarding Algorithm 1.

Theorem 4 Let Assumptions 1 and 2 hold. Also assume the subroutine is Relaxed, andthe subproblem (13) is solved up to some additive error εt ≥ 0. Then, for any w ∈ domFand t ≥ 0, Algorithm 1 yields

F (wt+1) ≤ F (w) + πt(1− η0)(F (w0)− F (w)) +t∑

s=0

πtπsη2s(εs/ηs + L

2 ‖ds −ws‖2), (18)

where πt :=∏ts=1(1− ηs) with π0 = 1. Furthermore, for all t ≥ k ≥ 0, the minimal duality

gap Gtk := mink≤s≤t G(ws) satisfies

Gtk ≤1∑ts=k ηs

[F (wk)− F (wt+1) +

t∑s=k

η2s

(εs/ηs + L

2 ‖ds −ws‖2)]

. (19)

From Theorem 4 we derive the following concrete rates of convergence.

Corollary 5 Under the same assumptions as Theorem 4, also let ηt = 2/(t+2), εt ≤ δηt/2for some δ ≥ 0, and LF := supt L ‖dt −wt‖2. Then, for all t ≥ 1 and w, Algorithm 1yields

F (wt) ≤ F (w) +2(δ + LF )

t+ 3, and Gt1 ≤

3(δ + LF )

t ln 2≤ 4.5(δ + LF )

t. (20)

9

Page 10: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

The proofs of Theorem 4 and Theorem 5 can be found in Appendix A.2 and Ap-pendix A.3 respectively. Note that the simple step size rule ηt = 2/(t+ 2) already leads toan O(1/t) bound on the objective value attained. Of course, it is possible to use other stepsize rules. For instance, both ηs = 1/(s+ 1) and the constant rule ηs = 1− (t+ 1)1/t leadto an O(1+log t

t+1 ) rate; see (Freund and Grigas, 2016) for detailed calculations.

We emphasize that GCG with the open-loop step size rule ηt = 2/(t+2) requires neitherthe Lipschitz constant L nor the norm ‖·‖ to be known. We can therefore freely adopt thebest choices for the problem at hand in the analysis. Note that, even though the rate doesnot depend on the initial point w0 provided η0 = 1, by letting η0 6= 1 the bound can beslightly improved (Freund and Grigas, 2016).

Remark 6 The observation that an O(1/t) rate can be obtained by an open-loop step sizerule such as ηt = O(1/t) appears to have been first made by Dunn and Harshbarger (1978).The same rate can be achieved by the closed-loop step size rule where ηt is selected tominimize `((1−ηt)wt+ηtdt) or its quadratic upper bound ˜((1−ηt)wt+ηtdt; wt) (Frank andWolfe, 1956; Levitin and Polyak, 1966; Dem’yanov and Rubinov, 1967). A similar rate onthe minimal duality gap was also given in (Clarkson, 2010), and extended by (Jaggi, 2013).However, all these previous works focused on the special case where f = ιC for some compactset C. Fukushima and Mine (1981) was the first to consider a general convex functionf (and possibly a nonconvex `), but they only established convergence for the algorithm.Bredies and Lorenz (2008) proved the O(1/t) rate by using a closed-loop step size rule.More recently, Bach (2015) considered the special case where f is strongly convex, andidentified the equivalence between GCG and MD, hence also establishing the O(1/t) rate forthis special case.

3.2 Improved Algorithm and Refined Analysis when f is a Convex Gauge

Although the results in Theorem 4 and Theorem 5 hold for general convex functions f ,they require {dt} and {wt} to be bounded as in Assumption 2. Some sufficient conditionswere provided by Theorem 2. Unfortunately, these conditions cannot be satisfied by animportant class of regularizers in machine learning: norms, semi-norms, and more generallyconvex gauge functions, cf. Section 2 for the detailed definition. Examples include the `pnorms (p ≥ 1), the total variation semi-norm, the trace norm for matrices, etc. Indeed,the subproblem (13) in Line 3 of Algorithm 1 might diverge to −∞, leaving the updateundefined. In this subsection, we develop a specialized form of GCG that circumventsthis problem and exploits properties of convex gauges to significantly reduce computationaloverhead. The resulting GCG formulation, which is one of our main contributions, willbe used to achieve efficient algorithms in important case studies in the second part of thepaper, such as matrix completion under trace norm regularization (Section 4.2.1 below).

In the literature, two main approaches have been used to bypass the unboundednessin solving (13); unfortunately, both remain unsatisfactory in different ways. First, oneobvious way to restore boundedness to (13) is to reformulate the regularized problem (7) inits equivalent constrained form

infw

`(w) s.t. f(w) ≤ ζ, (21)

10

Page 11: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

where it is well-known that a correspondence between (7) and (21) can be achieved if theconstant ζ is chosen appropriately. In this case, Algorithm 1 can be directly applied as longas (21) has a bounded domain. Much recent work in machine learning has investigated thisvariant (Hazan, 2008; Jaggi and Sulovsky, 2010; Jaggi, 2013; Shalev-Shwartz et al., 2010;Tewari et al., 2011; Yuan and Yan, 2013). However, finding the appropriate value of ζ iscostly in general, and often one cannot avoid considering the penalized formulation (e.g.when it is nested in another problem). Furthermore, the constraints in (21) preclude theapplication of many efficient local Improve techniques designed for unconstrained problems.

A second, alternative, approach is to square f , which ensures that it becomes super-coercive (Bradley and Bagnell, 2009). However, this modification is somewhat arbitrary,in the sense that super-coerciveness can be achieved by raising the norm to the pth powerfor any p > 1 (when B = Rd). Moreover, the insertion of a local Improve step in such anapproach requires the regularizer f to be evaluated at all iterates, which is expensive forapplications such as matrix completion.

3.2.1 Generalized Convex Gauge Regularization

To address the issue of ensuring well defined iterates, while also reducing their cost, wedevelop an alternative modification of GCG. The algorithm we develop also allows a mildgeneralization of convex gauge regularization by introducing a composition with an increas-ing convex function. In particular, let κ denote a convex gauge induced by a convex, closedset C containing the origin (cf. (3)) and let h : R+ → R denote an increasing convex func-tion over R+ := [0,∞]. Then regularizing by f(·) = h(κ(·)) yields the particular form of(7) that we will address with this modified approach:

infw

`(w) + h(κ(w)). (22)

For the structured sparse regularizers typically considered in machine learning, the set Cused to construct the gauge function κ via (3) has additional structure that admits efficientcomputation (Chandrasekaran et al., 2012). In particular, C is often constructed by takingthe convex hull of some compact set of “atoms” A specific to the problem; i.e., C = convA.Such a structured formulation allows one to re-express the gauge as the “l1 norm” over anappropriate “atomic decomposition.” In details:

κC(w) = inf

{r∑i=1

λi : w =r∑i=1

λiai,ai ∈ A, λi ≥ 0

}(23)

κ◦C(g) = supw∈A∪{0}

〈w,g〉 . (24)

That is, if we decompose w as a conic combination of the “atoms” in A, then κ(w) is theleast l1 norm of the decomposition coefficients. Determining the convex gauge κC directlyfrom the set C is not always easy, and the following duality result may come into aid(Rockafellar and Wets, 1998, p. 491):

C is closed convex and 0 ∈ C =⇒ κC = (κ◦C)◦. (25)

Since A is compact, the polar κ◦C has full domain. Intuitively, κ◦ has a much simplerform than κ if the structure of A is “simple.” In such cases, it is advantageous to design

11

Page 12: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

an optimization algorithm that circumvents evaluation of κ, by instead using κ◦ which isgenerally far less expensive to evaluate. This will be the underpinning motivation of ouralgorithm design.

3.2.2 Modified GCG Algorithm for Generalized Gauge Regularization

The key to the modified algorithm is to avoid the possibly unbounded solution to (13)while also avoiding explicit evaluation of the gauge κ. First, to avoid unboundedness, weadopt the technique of moving the regularizer to the constraint, as in (21), but rather thanapplying the modification directly to (22) we only use it in the subproblem (13) via:

dt ∈ arg mind:h(κ(d))≤ζ

〈d,∇`(wt)〉 . (26)

Then, to bypass the complication of dealing with h and the unknown bound ζ, we decomposedt into its normalized direction at and scale θt (i.e., such that dt = atθt/ηt), determiningeach separately. In particular, from this decomposition, the update from Line 4 of Algo-rithm 1 can be equivalently expressed as

wt+1 = (1− ηt)wt + ηtdt = (1− ηt)wt + θtat, (27)

hence the modified algorithm need only work with at and θt directly.Determining the direction. First, the normalized direction, at, can be recovered via

at ∈ arg mina:κ(a)≤1

〈a,∇`(wt)〉 ⇐⇒ at ∈ arg mina∈A∪{0}

〈a,∇`(wt)〉 (by (4)) (28)

⇐⇒ at ∈ arg maxa∈A∪{0}

〈a,−∇`(wt)〉 . (29)

Observe that (29) effectively involves computation of the polar κ◦(−∇`(w)) only, by (4).Since this optimization might still be challenging, we further allow it to be solved onlyapproximately to within an additive error εt ≥ 0 and a multiplicative factor αt ∈ (0, 1]; thatis, by relaxing (28) the modified algorithm only requires an at ∈ A to be found that satisfies

〈at,∇`(wt)〉 ≤ αt(εt + min

a∈A∪{0}〈a,∇`(wt)〉

)= αt

(εt − κ◦(−∇`(wt))

). (30)

Intuitively, this formulation computes the normalized direction that has (approximately)the greatest negative “correlation” with the gradient ∇`, yielding the steepest local decreasein `. Importantly, this formulation does not require evaluation of the gauge function κ.

Determining the scale. Next, to choose the scale θt, it is natural to consider minimizinga simple upper bound on the objective that is obtained by replacing `(·) with its quadraticupper approximation ˜

L(·; wt) given in (6):

minθ≥0

˜L((1− ηt)wt + θat; wt) + h(κ((1− ηt)wt + θat)). (31)

Unfortunately, κ still participates in (31), so a further approximation is required. Here iswhere the decomposition of dt into at and θt is particularly advantageous. Observe that

h(κ((1−ηt)wt + θat)) ≤ h((1−ηt)κ(wt) + ηtκ(θat/ηt))

≤ (1−ηt)h(κ(wt)) + ηth(κ(θat/ηt)) (32)

≤ (1− ηt)h(κ(wt)) + ηth(θ/ηt)), (33)

12

Page 13: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

Algorithm 2 GCG for positively homogeneous regularizers.

Require: The set A whose convex hull C defines the gauge κ.1: Initialize w0 and ρ0 ≥ κ(w0).2: for t = 0, 1, . . . do3: Choose normalized direction at that satisfies (30).4: Choose step size ηt ∈ [0, 1] and set the scaling θt ≥ 0 by (34)5: wt+1 = (1− ηt)wt + θtat, and ρt+1 = (1− ηt)ρt + θt6: (wt+1, ρt+1) = Improve(wt+1, ρt+1, `, f) . Subroutine, see Theorem 87: end for

where the first inequality follows from the convexity of κ and the fact that h is increasing,the second inequality follows from the convexity of h, and (33) follows from the fact thatκ(at) ≤ 1 by construction (at ∈ A in (4)). Although κ still participates in (33) we cannow bypass evaluation of κ by maintaining a simple upper bound ρt ≥ κ(wt), yielding therelaxed update

θt = arg minθ≥0

{˜L((1− ηt)wt + θat; wt) + (1− ηt)h(ρt) + ηth(θ/ηt)

}. (34)

Crucially, the upper bound ρt ≥ κ(wt) can be maintained by the simple update ρt+1 =(1− ηt)ρt + θt, which is sufficient to ensure that ρt+1 upper bounds κ(wt+1):

ρt+1 = (1− ηt)ρt + θt ≥ (1− ηt)κ(wt) + θtκ(at) ≥ κ((1− ηt)wt + θtat) ≥ κ(wt+1). (35)

From these components we obtain the final modified procedure, Algorithm 2: a variantof GCG (Algorithm 1) that avoids the potential unboundedness of the subproblem (13) byreplacing it with (30), while also completely avoiding explicit evaluation of the gauge κ.The Improve subroutine can be adapted by mirroring Theorem 1.

Definition 7 The subroutine Improve in Algorithm 2 is called Null if for all t, wt+1 =wt+1 and ρt+1 = ρt+1. Provided that ρt+1 ≥ κ(wt+1), Improve is called Descent if`(wt+1) + h(ρt+1) ≤ `(wt+1) + h(ρt+1); Monotone if `(wt+1) + h(ρt+1) ≤ `(wt) + h(ρt);and Relaxed if

`(wt+1) + h(ρt+1) ≤ ˜L(wt+1; wt) + (1− ηt)h(ρt) + ηth(θt/ηt). (36)

Similar to Theorem 1, Relaxed can be easily morphed to further satisfy Descent and/orMonotone by comparing (wt+1, ρt+1) with (wt+1, ρt+1) and/or (wt, ρt) over `(w)+h(ρ). Asbefore Null and Descent are specializations of Relaxed.

For problems of the form (22), Algorithm 2 can provide significantly more efficientupdates than Algorithm 1. Despite the relaxations introduced in Algorithm 2 to achievefaster iterates, the following theorem establishes that the procedure is still sound, preservingessentially the same convergence guarantees as the more expensive Algorithm 1.

Theorem 8 Let h : R+ → R be an increasing convex function, κ be a convex gauge inducedby the convex hull of a compact atomic set A ∪ {0}, and ` be L-smooth and convex. LetF = ` + h ◦ κ and assume Algorithm 2 generates a bounded sequence {wt}. Let αt > 0,

13

Page 14: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

0 ≤ ηt ≤ 1, and the subroutine Improve be Relaxed. Then for any w ∈ domF and t ≥ 0,we have

F (wt+1) ≤ F (w) + πt(1− η0)(F (w0)− F (w)

)+

t∑s=0

πtπsη2s

((ρεs + h(ρ/αs)− h(ρ)

)/ηs + L

2

∥∥∥ ραs

as −ws

∥∥∥2 ), (37)

where ρ := κ(w) and πt :=∏ts=1(1− ηs) with π0 = 1.

The proof is given in Appendix A.4. From this theorem, we obtain the following rate ofconvergence by pursuing a similar argument as in the proof of Theorem 5.

Corollary 9 Under the same setting as in Theorem 8, let ηt = 2/(t + 2), εt ≤ δηt/2 for

some absolute constant δ > 0, αt = α > 0, ρ = κ(w), and LF := L supt∥∥ ραat −wt

∥∥2.Then for all t ≥ 1 and w, we have

F (wt) ≤ F (w) +2(ρδ + LF )

t+ 3+ h(ρ/α)− h(ρ). (38)

Moreover, if ` ≥ 0 and h is the identity map, then

F (wt) ≤1

αF (w) +

2(ρδ + LF )

t+ 3. (39)

Some remarks concerning this algorithm and its convergence properties are in order.

Remark 10 When αt = 1 for all t, and h is the indicator ι[0,ζ] for some ζ > 0, Theorem 9implies the same convergence rate for the constrained problem (21), which immediatelyrecovers (some of) the more specialized results discussed in (Clarkson, 2010; Hazan, 2008;Jaggi, 2013; Shalev-Shwartz et al., 2010; Tewari et al., 2011; Yuan and Yan, 2013). Forthe special case where h is the identity map, a similar O(1/t) rate achieved by a similaralgorithm appeared in Harchaoui et al. (2015), whose workshop version appeared aroundthe same time as our conference version (Zhang et al., 2012). Harchaoui et al. (2015)considered the slightly more restrictive setting where the convex gauge κ is the restrictionof a norm into a convex cone and they did not consider (multiplicative) approximate polarcomputations. Their algorithm also needs to evaluate κ explicitly, while the key advantage ofour method is to avoid this expensive computation by an upper bound. On the other hand,Harchaoui et al. (2015) investigated the memory based acceleration in (Holloway, 1974;Meyer, 1974) while we develop in the next section a fixed-rank local improvement step.

Remark 11 The additional factor ρ in the above bounds is necessary because the approx-imation in (30) is not invariant to scaling, requiring some compensation. Note also thatα ≤ 1, since the right-hand side of (30) must become negative as εt → 0. The result in (39)states roughly that an α-approximate subroutine (for computing the polar of κ) leads to anα-approximate “minimum”, also at the rate of O(1/t). This observation was independentlymade in (Bach, 2013), where the multiplicative factor ρ is introduced to solve approximatelysome matrix factorization problems. In a similar vein, Cheng et al. (2016) leveraged themultiplicative guarantee to solve some tensor problems.

14

Page 15: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

Remark 12 The assumption that {wt} is bounded can be satisfied easily, which in turnensures LF < ∞. The simplest way is to postulate that the sublevel set {w ∈ domF :F (w) ≤ `(w0) + h(ρ0)} is bounded. In this case {wt} is trivially bounded if Monotone isused. When Relaxed is used instead (which includes Null and Descent), we can strengthenit into Monotone via the roll-back trick in Theorem 7. At first sight, the algorithm appearsto be stuck, but in fact the diminishing step size ηt will eventually lead to progresses asguaranteed by Theorem 9. Computationally, keeping track of `(wt)+h(ρt) does not incur anyoverhead. We will provide concrete values/bounds of LF for the case studies in Section 4.2and Section 4.4.

Remark 13 The convergence results in Theorem 8 and Theorem 9 obviously hold whenNull or Descent is employed for Improve, because they are both specializations of Relaxed.The update for θ in (34) requires knowledge of the Lipschitz constant L and the norm ‖·‖.If the norm ‖·‖ is Hilbertian and h is the identity map, we have the formula

θt = 1L‖at‖2

max {〈at, Lηtwt −∇`(wt)〉 − 1, 0} .

However, it is also easy to devise other scaling and step size updates that do not require theknowledge of L or the norm ‖·‖, such as

(η∗t , θ∗t ) ∈ arg min

1≥η≥0,θ≥0`((1− η)wt + θat) + (1− η)h(ρt) + ηh(θ/η), (40)

followed by the Null step. Note that (40) is jointly convex in η and θ, since ηh(θ/η) is aperspective function. Evidently this update can be treated as a Relaxed subroutine, sincefor any (ηt, θt) chosen in step 4 of Algorithm 2, it follows that

`(wt+1) + h(ρt+1) ≤ `((1− η∗t )wt + θ∗t at) + (1− η∗t )h(ρt) + η∗t h(θ∗t /η∗t )

≤ `((1− ηt)wt + θtat) + (1− ηt)h(ρt) + ηth(θt/ηt) (by (40))

≤ ˜L(wt+1; wt) + (1− ηt)h(ρt) + ηth(θt/ηt).

The upper bound ρt+1 ≥ κ(wt+1) can be verified in the same way as for (35); hence theupdate (40) enjoys the same convergence guarantee as in Theorem 9.

Remark 14 An alternative workaround for solving (7) with gauge regularization wouldbe to impose an upper bound ζ ≥ κ(w) and consider minw:κ(w)≤ζ {`(w) + κ(w)}, whiledynamically adjusting ζ. A standard GCG approach, similar to (Harchaoui et al., 2015),would then require solving

dt ∈ arg mind:κ(d)≤ζ

{〈d,∇`(wt)〉+ κ(d)} (41)

as a subproblem, which can be as easy to solve as (30). If, however, one wished to include alocal improvement step Improve (which is essential in practice), it is not clear how to effi-ciently maintain the constraint κ(w) ≤ ζ. Moreover, solving (41) up to some multiplicativefactor might not be as easy as (30).

15

Page 16: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

3.3 Additional Discussions and Related Work

In this section we discuss some further properties and extensions of the GCG algorithm.Smoothing: The convergence rates we have established require the convex loss ` to be

L-smooth. Although this holds for a variety of losses such as the square and logistic loss, itdoes not hold for the hinge loss used in support vector machines. However, one can alwaysapproximate a nonsmooth (Lipschitz) loss by a smooth function (Nesterov, 2005) beforeapplying GCG, at the expense of getting a slower O(1/

√t) rate of convergence.

Totally Corrective: As given, Algorithm 2 adds a single atom in each iteration, witha conic combination between the new atom and the previous aggregate. We will refer to itas vanilla GCG if Improve is further set to Null. If f = ιC for some compact and convexset C, we simply call it vanilla CG for disambiguation. An even more aggressive scheme isto re-optimize the coefficients of all (or a subset of) existing atoms in each iteration. Sucha procedure was first studied in (Meyer, 1974; Holloway, 1974), which is often known asthe “totally (or fully) corrective update” in the boosting literature and as the “restrictedsimplicial decomposition” in optimization. More generally, for the convex gauge κ, eachiteration of a totally corrective procedure would involve solving

minσ≥0

`

(∑τ∈At

στbτ

)+∑τ∈At

στ , (42)

where At is a set of atoms accumulated up to the t-th iteration. When f = ιC , the totallycorrective CG solves minσ `

(∑τ∈At στbτ

)over σ ≥ 0 and 1′σ = 1. As pointed out by

Holloway (1974), as long as {wt,at} ⊆ At, the totally corrective CG will converge at leastas fast as the vanilla CG. Much faster convergence is usually observed in practice, althoughthis advantage must be countered by the extra effort spent in solving (42) and alike, whichitself need not be trivial. In a finite dimensional setting, provided that the atoms arelinearly independent and some restricted strong convexity is present, it is possible to provethat the totally corrective CG converges at a linear rate; see (Shalev-Shwartz et al., 2010;Yuan and Yan, 2013). Recent work of Lacoste-Julien and Jaggi (2015) has further relaxedthe requirement of linear independence by making a weaker assumption that the numberof atoms is finite (i.e. the feasible region is a polytope).

GCG v.s. PG: The convergence rate established for GCG is on par with PG. Whenf = ιC , Canon and Cullum (1968) showed that the rate of vanilla CG cannot be improved ingeneral even when ` is strongly convex.1 In this sense, vanilla GCG (which subsumes vanillaCG) is slower than an “optimal” algorithm like APG. GCG’s advantage, however, is thatit only needs to solve a linear subproblem (13) in each iteration (i.e., computing the polarκ◦), whereas APG (or PG) needs to compute the proximal update, which is a quadraticproblem for the least square prox-function. The two approaches are complementary in thesense that the polar can be easier to compute in some cases, whereas in other cases theproximal update can be computed analytically. An additional advantage GCG has overAPG, however, is its greedy sparse nature: Each iteration of GCG amounts to a singleatom, hence the number of atoms never exceeds the total number of iterations for GCG. Bycontrast, APG can produce a dense update in each iteration, although in later stages the

1. Improvements in specific cases are still possible; see the next paragraph. Indeed the counter-example ofCanon and Cullum (1968) only applies when the solution lies at the boundary of a polyhedral set.

16

Page 17: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

estimates may become sparser due to the shrinkage effect of the proximal update. Moreover,GCG is “robust” with respect to α-approximate atom selection (cf. (39) in Theorem 9),whereas similar results do not appear to be available for PG or APG when the proximalupdate is computed up to a multiplicative factor.

Faster Rates: Faster rates of convergence for CG (with a closed-loop step size rule),under the restriction f = ιC , have been considered by a number of works. Levitin and Polyak(1966) first showed that if C is strongly convex and the gradient ∇` is bounded away from 0,then a linear rate of convergence can be achieved by CG. Guelat and Marcotte (1986) showedthat if there is a minimizer in the interior of C, then again a linear rate of convergence canbe obtained. Wolfe (1970) proposed an away step modification of CG, and a variant ofit designed by Guelat and Marcotte (1986) attained linear rate of convergence when C ispolyhedral. The analysis was further improved into a global linear rate in (Lacoste-Julienand Jaggi, 2015; Beck and Shtern, 2015). Dunn (1979) proved o(1/t) rate, linear rate andfinite convergence for CG, depending on how “continuous” the polar operation is. Lastly,Garber and Hazan (2015) showed that for polytope C, a modification of CG achieves thelinear rate if ` is of quadratic growth (in particular, strongly convex), and Garber and Hazan(2016) proved the O(1/t2) rate if ` is of quadratic growth and C is strongly convex. Notethat these faster rates are possible by putting restrictive assumptions on the constraintset C (polyhedral or strongly convex). Interestingly, results on faster rates for generalconvex functions f are scarce, although establishing it for GCG is conceivably not hard.For example, in (Zhang et al., 2012) we proved a linear rate for the totally corrective GCGwhen ` is strongly convex (in a slightly stronger sense), f is a convex gauge, and B = Rn.

4. Application to Low Rank Learning

As a first application of GCG, we consider the problem of optimizing low rank matrices;a problem that forms the cornerstone of many machine learning tasks such as dictionarylearning and matrix completion. Since imposing a bound on matrix rank generally leads toan intractable problem, much recent work has investigated approaches that relax the hardrank constraint with carefully designed regularization penalties that indirectly encouragelow rank structure (Bach et al., 2008; Bradley and Bagnell, 2009). In Section 4.2 we firstformulate a generic convex relaxation scheme for low rank problems arising in matrix ap-proximation tasks, including dictionary learning and matrix completion. Conveniently, themodified GCG algorithm developed in Section 3.2, Algorithm 2, provides a general solutionmethod for the resulting problems. Next, in Section 4.3 we show how the local improvementcomponent of the modified GCG algorithm can be provided in this case by a procedure thatoptimizes an unconstrained surrogate (instead of a constrained problem) to obtain greaterefficiency. This modification significantly improves the practical efficiency of GCG withoutaffecting its convergence properties. We illustrate these contributions by highlighting theexample of matrix completion under trace norm regularization, where the improved GCGalgorithm demonstrates state of the art performance. Finally, we demonstrate how the con-vex relaxation can be generalized to handle latent multi-view models (White et al., 2012) inSection 4.4, showing how the improved GCG algorithm can still be applied once the majorchallenge of efficiently computing the polar κ◦ has been solved.

17

Page 18: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

4.1 Low Rank Learning Problems

Many machine learning tasks, such as dictionary learning and matrix completion, can beformulated as seeking low rank matrix approximations of a given data matrix. In theseproblems, one is presented with an n×m matrix X (perhaps only partially observed) whereeach column corresponds to a training example and each row corresponds to a feature acrossexamples; the goal is to recover a low rank matrixW that approximates (the observed entriesof) X. To achieve this, we assume we are given a convex loss function `(W ) := L(W,X)that measures how well W approximates X.2 If we impose a hard bound, t, on the rank ofW , the core training problem can then be expressed as

infW :rank(W )≤t

`(W ) = infU∈Rn×t

infV ∈Rt×m

`(UV ). (43)

Unfortunately, this problem is not convex, and except for special cases (such as SVD) isbelieved to be intractable.

The two forms of this problem that we will explicitly illustrate in this paper are dictio-nary learning and matrix completion. First, for dictionary learning, the n×m data matrixX is fully observed, the n× t matrix U is interpreted as a “dictionary” consisting of t basisvectors, and the t × m matrix V is interpreted as the “coefficients” consisting of m codevectors. For this problem, the dictionary and coefficient matrices need to be inferred si-multaneously, in contrast to traditional signal approximation schemes where the dictionary(e.g., the Fourier or a wavelet basis) is fixed a priori. Unfortunately, learning the dictionarywith the coefficients creates significant computational challenge since the formulation is nolonger jointly convex in the variables U and V , even when the loss ` is convex. Indeed, fora fixed dictionary size t, the problem is known to be NP-hard (Vavasis, 2010) except forvery special cases (Yu and Schuurmans, 2011).

The second learning problem we illustrate is low rank matrix completion. Here, oneobserves a small number of entries in X ∈ Rn×m and the task is to infer the remainingentries. In this case, the optimization objective can be expressed as `(W ) := L(P(W −X)),where L is a reconstruction loss (such as Frobenius norm squared) and P : Rn×m → Rn×m

is a “mask” operator that simply fills the entries corresponding to unobserved indices inX with zero. Various recommendation problems, such as the Netflix prize,3 can be castas matrix completion. Due to the ill-posed nature of the problem it is standard to assumethat the matrix X can be well approximated by a low rank reconstruction W . Unfortu-nately, the rank function is not convex, hence hard to minimize in general. Conveniently,both the dictionary learning and matrix completion problems are subsumed by the generalformulation we consider in (43).

4.2 General Convex Relaxation and Solution via GCG

To develop a convex relaxation of (43), we first consider the factored formulation expressedin terms of U and V on the right hand side. Although it only involves an unconstrainedminimization which can be convenient, care is needed to handle the scale invariance intro-duced by the factored form: since U can always be scaled and V counter-scaled to preserve

2. Note that the matrix W need not reconstruct the entries of X directly: a nonlinear transfer could beinterposed while maintaining convexity of `; for example, by using an appropriate Bregman divergence.

3. http://www.netflixprize.com//community/viewtopic.php?id=1537

18

Page 19: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

the product UV , some form of penalty or constraint is required to restore well-posedness. Astandard strategy is to constrain U , for which a particularly common choice is to constrainits columns to have unit length; i.e. ‖U:i‖c ≤ 1 for all i, where U:i stands for the i-th columnof U , and c simply denotes some norm on the columns. In dictionary learning, for example,this amounts to constraining the basis vectors in the dictionary U to lie within the unit ballof ‖·‖c. Similar normalization can be used to represent the rows Vi:, with the magnitudeencoded by σi ≥ 0. In consequence, the unconstrained optimization over `(UV ) in (43) canbe reformulated as

minU∈Rn×t,V ∈Rt×m,σ∈Rt

`(U diag(σ)V ), s.t. ‖U:i‖c ≤ 1, ‖Vi:‖r ≤ 1, and σi ≥ 0. (44)

Here σ := (σ1, σ2, . . . , σt)>, and diag(σ) is a diagonal matrix with the i-th diagonal entry

being σi. The specific form of the row norm ‖·‖r provides additional flexibility in promotingdifferent structures; for example, the l1 norm leads to sparse solutions, the l2 norm yieldslow rank solutions, and block structured norms generate group sparsity. The specific formof the column norm ‖·‖c also has an effect, as discussed in Section 4.4 below.

Of course the problem remains non-convex. However, it has recently been observedthat a simple relaxation of the rank constraint in (43) allows a convex reformulation tobe achieved (Argyriou et al., 2008; Bach et al., 2008; Bradley and Bagnell, 2009; Zhanget al., 2011). Observe that replacing the rank constraint with a regularization penalty onthe magnitude σi of each basis pair (U:i, Vi:) yields a relaxed form of (43):

infU :‖U:i‖c≤1

infV :‖Vi:‖r≤1

infσ:σi≥0

`(U diag(σ)V ) + λ ‖σ‖1 , (45)

where the number of columns in U (and also the number of rows in V ) is not restricted.Similar to Lasso, the l1 norm on the “singular values” (combination weights) σ is used as apenalty that serves as a convex proxy for rank. By the shrinkage effect of the l1 norm, manyσi will become zero, implying that the corresponding columns of U (and rows of V ) can bedropped. As a result, the rank of the solution—which is indeed unknown a priori—can bechosen adaptively with the trade-off parameter λ ≥ 0.

Importantly, although the problem (45) is still not jointly convex in the factors U , Vand σ, it can be exactly reformulated as a convex optimization as long as ` is convex:

(45) = minW

`(W ) + λ inf

{∑i

σi : σ ≥ 0,W =∑i

σiU:iVi:, ‖U:i‖c ≤ 1, ‖Vi:‖r ≤ 1

}(46)

= minW

`(W ) + λ κ(W ), (47)

where the gauge κ is defined by the following set A with uncountably many elements

A := {uv> : u ∈ Rn,v ∈ Rm, ‖u‖c ≤ 1, ‖v‖r ≤ 1}. (48)

The resulting convex formulation (47) has been observed, in various forms, by, e.g., Ar-gyriou et al. (2008); Bach et al. (2008); Zhang et al. (2011); White et al. (2012). Howeverour derivation, first presented in (Zhang et al., 2012), provides a much simpler and flexibleview based on convex gauges, which also establishes a direct connection to the GCG method

19

Page 20: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

outlined in Section 3.2, yielding an efficient solution method with no additional effort. Thereformulation (47) reveals that the relaxed learning problem (45) is essentially considering arank-one decomposition of W , which generalizes the singular value decomposition. Further-more it provides a clearer understanding of the computational properties of the regularizerin (47) via its polar:

κ◦(G) = supA∈A

〈A,G〉 = sup‖u‖c≤1,‖v‖r≤1

u>Gv (49)

= sup‖v‖r≤1

‖Gv‖◦c = sup‖u‖c≤1

∥∥∥G>u∥∥∥◦r. (50)

Recall that for gauge regularized problems in the form (47), Algorithm 2 only requiresapproximate computation of the polar (30) in Line 3; hence an efficient method for (ap-proximately) solving (49) immediately yields an efficient implementation. However, thissimplicity must be countered by the fact that the polar is not always tractable, since it in-volves maximizing a norm subject to a different norm constraint.4 Nevertheless, there existimportant cases where the polar can be efficiently evaluated, which we illustrate below.

The polar also allows one to gain insight into the structure of the original gauge regu-larizer, since by duality κ = (κ◦)◦ yields an explicit formula for κ. For example, considerthe important special case where ‖·‖r = ‖·‖c = ‖·‖2. In this case, using (49) one can inferthat the polar κ◦(·) = ‖·‖sp is the spectral norm, and therefore the corresponding gaugeregularizer κ(·) = ‖·‖tr is the trace norm. In this way, the trace norm regularizer, whichis known to provide a convex relaxation of the matrix rank (Chandrasekaran et al., 2012),can be viewed from the perspective of dictionary learning as an induced norm that arisesfrom a simple 2-norm constraint on the magnitude of the dictionary entries U:i and a simple2-norm penalty on the magnitudes of the code vectors Vi:.

This convex relaxation strategy based on convex gauges is quite flexible and has beenrecently studied for some structured sparse problems (Chandrasekaran et al., 2012; Tewariet al., 2011; Zhang et al., 2012; Bach, 2013). Importantly, the generality of the column androw norms in (45) proffer considerable flexibility to finely characterize the structures soughtby low rank learning methods. We will exploit this flexibility in Section 4.4 below to achievea novel extension of this framework to multi-view dictionary learning (White et al., 2012).

4.2.1 Application to Matrix Completion

We end this subsection with a brief illustration of how the framework can be applied tomatrix completion. Recall that in this case one is given a partially observed data matrix X,and the goal is to find a low rank approximator W that minimizes the reconstruction erroron observed entries. In particular, consider the following standard specialization of (47):

minW∈Rn×m

1

2 ‖P‖0‖P(X −W )‖2F + λ ‖W‖tr , (51)

where, as noted above, P : Rn×m → Rn×m is the mask operator that fills entries where Xis unobserved with zero. ‖P‖0 is the number of observed entries. Surprisingly, Candes and

4. In this regard, the α-approximate polar oracle introduced in (30) and Theorem 9 might be useful,although we shall not develop this idea further here. See the work of Bach (2013) for some applicationsof this idea to matrix factorization and the work of Cheng et al. (2016) to tensor problems.

20

Page 21: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

Recht (2009) have proved that, assuming X is (approximately) low-rank, the solution ofthe convex relaxation (51) will recover the true matrix X with high probability, even whenonly a small random portion of X is observed.

Note that, since (51) is a specialization of (47), the modified GCG algorithm, Algo-rithm 2, can be immediately applied. Interestingly, such an approach contrasts with themost popular algorithm favored in the current literature: PG (or its accelerated variantAPG; cf. Section 2). The two strategies, GCG versus PG, exhibit an interesting trade-off.Each iteration of PG (or APG) involves solving the PU (9), which in this case requires a fullsingular value decomposition (SVD) of the current approximant W . The modified GCGmethod, Algorithm 2, by contrast, only requires the polar of the gradient matrix ∇`(W )to be (approximately) computed in each iteration (i.e. dual of the trace norm; namely, thespectral norm), which is an order of magnitude cheaper to evaluate than a full SVD (cubicversus squared worst case time).5 On the other hand, GCG has a slower theoretical ratethan APG, O(1/t) versus O(1/t2), in terms of the number of iterations required to achievea given accuracy. Once the enhancements in the next section have been incorporated, ourexperiments in Section 5.1 will demonstrate that GCG is far more efficient than APG forsolving the matrix completion problem (51) to tolerances used in practice.

Given the specific objective (51), we may derive a concrete bound on LF used in Theo-rem 9. Obviously, the smoothness constant is L = 1/ ‖P‖0. Suppose we use Monotone

with W0 = 0. So all Wt (and the optimal W ) need to satisfy λ ‖Wt‖tr ≤ C, whereC = ‖PX‖2F /(2 ‖P‖0). Due to the normalization, C can be considered a universal constant.All matrices A ∈ A in (48) satisfy ‖A‖F ≤ 1. Obviously 1

n ‖Wt‖2tr ≤ ‖Wt‖2F ≤ ‖Wt‖2tr. So

LF ≤ 2L supt

{ρ2

α2‖At‖2F + ‖Wt‖2F

}≤ 2L sup

t

{ρ2

α2+ ‖Wt‖2tr

}≤ 2

‖P‖0

(1

α2+ 1

)C2

λ2. (52)

If λ needs to be less than 1‖P‖0

, then the upper bound on LF will be O(‖P‖0). Notice that

under Monotone, the expression of LF in Theorem 5 is on par with the condition numberLfD

2∗ in (Harchaoui et al., 2015, Theorem 3).

4.3 Fixed-rank Local Optimization

Although the modified GCG algorithm, Algorithm 2, can be readily applied to the refor-mulated low rank learning problem (47), the sublinear rate of convergence established inTheorem 9 is still too slow in practice. Nevertheless, an effective local improvement schemecan be developed that satisfies the properties of a Relaxed subroutine in Algorithm 2,which allows empirical convergence to be greatly accelerated. Recall that in Algorithm 2,the intermediate iterate Wt+1 is determined by a linear combination of the previous iter-ate Wt and the newly added atom At (we changed the notation from wt+1 to Wt+1 since

5. The per-iteration cost of computing the spectral norm can be further reduced for the sparse matricesoccuring naturally in matrix completion, since the gradient matrix in (51) contains at most k nonzeroelements for k equal to the number of observed entries in X. In such cases, an ε accurate estimate of the

squared spectral norm can be obtained inO(k log(n+m)√

ε

)time (Jaggi, 2013; Kuczynski and Wozniakowski,

1992), offering a significant speed up over the general case. Although some speed ups might also bepossible when computing the SVD for sparse matrices, the gap between the computational cost forevaluating the spectral norm versus the trace norm for sparse matrices remains significant.

21

Page 22: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

the optimization variables are now matrices). We will next demonstrate that Wt+1 can befurther improved by solving a reformulated local problem, and then show how this can stillpreserve the convergence guarantees of Algorithm 2.

The key reformulation relies on the following fact.

Proposition 15 The convex gauge κ induced by the set A in (48) can be re-expressed as

κ(W ) = min

{t∑i=1

‖U:i‖c ‖Vi:‖r : UV = W

}= min

{1

2

t∑i=1

(‖U:i‖2c + ‖Vi:‖2r

): UV = W

},

for U ∈ Rn×t and V ∈ Rt×m, as long as t ≥ mn.

In Appendix B.1 we provide a direct proof that allows us to bound the number ofterms involved in the summation. The above lower bound mn on t may not be sharp. Forexample, if ‖·‖r = ‖·‖c = ‖·‖2, then κ is the trace norm, where

∑ti=1(‖U:i‖2c + ‖Vi:‖2r) is

simply ‖U‖2F+‖V ‖2F, the sum of the squared Frobenius norms; that is, Proposition 15 yieldsthe well-known variational form of the trace norm in this case (Srebro et al., 2005). Observethat t in this case can be as low as the rank of W . Theorem 15 also appeared in (Bachet al., 2008; Bach, 2013), although without the explicit bound t ≥ mn.

Based on Theorem 15, the objective in (47) can then be approximated at iteration t as

Ft+1(U, V ) := `(UV ) +λ

2

t+1∑i=1

(‖U:i‖2c + ‖Vi:‖2r

). (53)

Our key intuition is to first factorize Wt+1 into UinitVinit, and then find improvement to(U, V ) with respect to Ft+1 (not the original F ). This results in Ut+1 and Vt+1 satisfyingFt+1(Ut+1, Vt+1) ≤ Ft+1(Uinit, Vinit). Finally we restore the iterates based on Theorem 15

Wt+1 = Ut+1Vt+1 and ρt+1 =1

2

t+1∑i=1

(‖(Ut+1):i‖2c + ‖(Vt+1)i:‖2r

). (54)

Using Ft+1 as a surrogate objective for local improvement is particularly advantageousin computation, because the problem is now unconstrained and the gauge function hasbeen replaced by much simpler forms such as ‖U:i‖2c and ‖Vi:‖2r . This allows efficient uncon-strained minimization routines (e.g. L-BFGS) to be applied directly, improving the qualityof the current iterate without increasing the size of the dictionary. Note however that (53)is not jointly convex in (U, V ) since the size t + 1 (the iteration counter) is fixed. Whent is sufficiently large, any local minimizer of (53) globally solves the original problem (47)(Burer and Monteiro, 2005)6, but we locally reduce (53) in each iteration even for small t.

Once this modification is added to Algorithm 2 we obtain the final variant, Algorithm 3.In this algorithm, Line 3 remains equivalent to before, and Line 4 adopts the slightly moresophisticated update rule in (40), although one can also use (34) with ηt = 2

t+2 . Lines 5to 8 collectively implement the Improve subroutine: Line 5 splits the iterate into two parts

6. The catch here is that seeking a local minimizer can still be highly non-trivial, sometimes it is even hardto certify the local optimality.

22

Page 23: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

Algorithm 3 GCG variant for low rank learning.

Require: The atomic set A.1: Initialize W0 = 0, ρ0 = 0, U0 = V0 = Λ0 = ∅.2: for t = 0, 1, . . . do3: (ut,vt)← arg min

uv>∈A

⟨∇`(Wt),uv>

⟩4: (ηt, θt)← arg min

0≤η≤1,θ≥0`((1− η)Wt + θ utv

>t ) + λ((1− η)ρt + θ)

5: Uinit ← (√

1− ηtUt,√θtut), Vinit ← (

√1− ηtV >t ,

√θtvt)

>

6: (Ut+1, Vt+1)← Reduce(Ft+1, Uinit, Vinit) s.t. Ft+1(Ut+1, Vt+1) ≤ Ft+1(Uinit, Vinit)e.g. find a local minimum of Ft+1 with U and V initialized to Uinit and Vinit resp.

7: Wt+1 ← Ut+1Vt+1 . Nominal as we only maintain (Ut, Vt) in practice8: ρt+1 ← 1

2

∑t+1i=1(‖(Ut+1):i‖2c + ‖(Vt+1)i:‖2r)

9: end for

Uinit and Vinit, while Line 6 further reduces the reformulated local objective (53) with thedesignated initialization (e.g. by finding a local minimum). We referred to it as Reduce inorder to be distinct from Improve. The last two steps restore the GCG iterate by (54).

The only major challenge left for inserting a local optimization step in Algorithm 2 isthat the Relaxed condition (36) in Theorem 7 must be maintained. This is the conditionunder which Algorithm 3 retains the O(1/t) rate of convergence established in Theorem 9.Fortunately this holds true as stated in Theorem 16 and its proof is given in Appendix B.2.

Proposition 16 (Uinit, Vinit) in Line 5 is a factorization of the intermediate iterate Wt+1 =(1− ηt)Wt + θtutv

>t . Let (Ut+1, Vt+1) improve upon it in the sense that Ft+1(Ut+1, Vt+1) ≤

Ft+1(Uinit, Vinit). Then the Wt+1 and ρt+1 recovered by (54) satisfy the Relaxed condition.

Although the Reduce subroutine is not guaranteed to achieve strict improvement inevery iteration, we observe in our experiments in Section 5 below that Algorithm 3 isalmost always significantly faster than the Null version of Algorithm 2. In other words,local acceleration appears to make a big difference in practice.

Remark 17 Interleaving local improvement with a globally convergent procedure has beenconsidered by Mishra et al. (2013) and Laue (2012). In particular, Laue (2012) consideredthe constrained problem (21), which unfortunately leads to considering a constrained min-imization for local improvement that can be less efficient than the unconstrained approachconsidered in (53). In addition, Laue (2012) used the original objective F directly in localoptimization. This leads to significant challenge when the regularizer κ is complicated suchas the trace norm, and we avoid this issue by using a smooth surrogate objective Ft+1.

Focusing on trace norm regularization only, Mishra et al. (2013) proposed a trust-regionprocedure that locally minimizes the original objective on the Stiefel manifold and the positivesemidefinite cone, which unfortunately also requires constrained minimization and dynamicmaintenance of the singular value decomposition of a small matrix. The rate of convergenceof this procedure has not been analyzed.

Remark 18 We remark that local minimization of (53) constitutes the primary bottleneckof Algorithm 3, consuming over 50% of the computation. Since (54) is the same objective

23

Page 24: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

considered by the standard fixed rank approach to matrix factorization, the overall perfor-mance of Algorithm 3 can be substantially improved by leveraging the techniques that havebeen developed for this specific objective. These improvements include system level opti-mizations such as stochastic solvers on distributed, parallel and asynchronous infrastruc-tures (Zhuang et al., 2013; Yun et al., 2014). In the experiments conducted in Section 5below, we merely solve (54) using a generic L-BFGS solver to highlight the efficiency ofthe GCG framework itself. Nevertheless, such a study leaves significant room for furtherstraightforward acceleration.

Remark 19 Much of our development in this section was inspired by the work of Bachet al. (2008), who first demonstrated, although in a slightly less clear way, that the convexrelaxation in (47) is possible, but did not observe the computational connection to GCG. Ourcontribution here is to present this result succinctly using the notion of convex gauges, con-nect it naturally to the GCG algorithm we developed in Section 3, and propose an efficientfixed-rank local improvement subroutine Reduce that does not affect theoretical convergenceguarantees. After our conference version (Zhang et al., 2012), Bach (2013) also indepen-dently extended the previous work (Bach et al., 2008) using convex gauges, and discusseda multiplicative approximate GCG variant similar to our Algorithm 2. However, the pos-sibility to intervene fixed-rank local subroutines based on Theorem 15 with GCG, and itspractical computational consequences, were not observed in (Bach, 2013).

4.4 Multi-view Dictionary Learning

Finally, in this section we show how low rank learning can be generalized to a multi-viewsetting where, once again, the modified GCG algorithm can efficiently provide optimalsolutions. The key challenge in this case is to develop column and row norms that cancapture the more complex problem structure.

Consider a two-view learning task where one is given m paired observations{[

xjyj

]}consisting of an x-view and a y-view with lengths n1 and n2 respectively. The goal is toinfer a set of latent representations, hj (of dimension t < min(n1, n2)), and reconstructionmodels parameterized by the matrices A and B, such that Ahj ≈ xj and Bhj ≈ yj forall j. In particular, let X denote the n1 ×m matrix of x-view observations, Y the n2 ×mmatrix of y-view observations, and Z =

[XY

]the concatenated (n1 + n2)×m matrix. The

problem can then be expressed as recovering an (n1 + n2)× t matrix of model parameters,

C =[AB

], and a t×m matrix of latent representations, H, such that Z ≈ CH.

The key assumption behind multi-view learning is that each of the two views, xj and yj ,is conditionally independent given the shared latent representation, hj . Although multi-view data can always be concatenated hence treated as a single view, explicitly representingmultiple views enables more accurate identification of the latent representation (as we willsee), when the conditional independence assumption holds. To better motivate this model,it is enlightening to first reinterpret the classical formulation of multi-view dictionary learn-ing, which is given by canonical correlation analysis (CCA). Typically, it is expressed asthe problem of projecting two views so that the correlation between them is maximized.Assuming the data is centered (i.e. X1 = 0 and Y 1 = 0), the sample covariance of Xand Y is given by XX>/m and Y Y >/m respectively. CCA can then be expressed as an

24

Page 25: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

optimization over matrix variables

maxU,V

tr(U>XY >V ) s.t. U>XX>U = V >Y Y >V = I (55)

for U ∈ Rn1×t, V ∈ Rn2×t (De Bie et al., 2005). Here I is the identity matrix whose size isclear from the context. Although this classical formulation (55) does not make the sharedlatent representation explicit, CCA can be expressed by a generative model: given a latentrepresentation, hj , the observations xj = Ahj + εj and yj = Bhj + νj are generated by alinear mapping plus independent zero mean Gaussian noise, ε ∼ N (0,Σx), ν ∼ N (0,Σy)(Bach and Jordan, 2006). Interestingly, one can show that the classical CCA problem (55)is equivalent to the following multi-view dictionary learning problem.

Proposition 20 (White et al. 2012, Proposition 1) Fix t, let

(A,B,H) = arg minA:‖A:i‖2≤1

minB:‖B:i‖2≤1

minH

∥∥∥∥[ (XX>)−1/2X

(Y Y >)−1/2Y

]−[AB

]H

∥∥∥∥2F

. (56)

Then U=(XX>)−12A and V =(Y Y >)−

12B provide an optimal solution to (55).

Proposition 20 demonstrates how formulation (56) respects the conditional independenceof the separate views: given a latent representation hj , the reconstruction losses on the twoviews, xj and yj , cannot influence each other, since the reconstruction models A and Bare individually constrained. By contrast, in single-view dictionary learning (i.e. principal

component analysis) A and B are concatenated in the larger variable C =[AB

], where C

as a whole is constrained but A and B are not. A and B must then compete against eachother to acquire magnitude to explain their respective “views” given hj (i.e. conditionalindependence is not enforced). Such sharing can be detrimental if the two views really areconditionally independent given hj .

This matrix factorization viewpoint has recently allowed dictionary learning to be ex-tended from the single-view setting to the multi-view setting (Jia et al., 2010). However,despite its elegance, the computational tractability of CCA hinges on its restriction tosquared loss under a particular normalization, precluding other important losses. To com-bat these limitations, we apply the same convex relaxation principle as in Section 4.2. Inparticular, we reformulate (56) by first incorporating an arbitrary loss function ` that isconvex in its first argument, and then relaxing the rank constraint by replacing it withthe l1 norm of the “singular values”. This amounts to the following training problem thatextends (Jia et al., 2010):

minA,B,H,σ

`

([AB

]diag(σ)H;Z

)+ λ ‖σ‖1 , s.t. ‖Hi:‖2 ≤ 1, σi ≥ 0,

[A:i

B:i

]∈ D ∀ i,

where D :=

{[ab

]: ‖a‖2 ≤ β, ‖b‖2 ≤ γ

}. (57)

The l2 norm on the rows of H is a regularizer that encourages the rows to become sparse,hence reducing the size of the learned representation (Argyriou et al., 2008). Since the sizeof the latent representation (the number of rows of H) is not fixed, our results in Section 4.2

25

Page 26: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

immediately allows the problem to be solved globally and efficiently for A, B and H. Thisconsiderably improves the previous local solutions in (Jia et al., 2010) which fixed the latentdimension and approached the non-convex problem by alternating block coordinate descentbetween A, B and diag(σ)H.

Indeed, (57) fits exactly in the framework of (45), with ‖·‖r being the l2 norm and

‖c‖c = max

(1

β‖a‖2 ,

1

γ‖b‖2

)for c =

[ab

]. (58)

Next we will show how to efficiently compute the polar operator (50) that underpins Algo-rithm 3 in this multi-view model.

4.4.1 Efficient Polar Operator

The polar operator (50) for problem (57) turns out to admit an efficient but nontrivialsolution. For simplicity, we assume β = γ = 1 while the more general case can be dealtwith in exactly the same way. The proof of the following key theorem is in Appendix B.3.

Proposition 21 Let I1 := diag([10

]) and I2 := diag(

[01

]) be the identity matrices on

the subspaces spanned by x-view and y-view, respectively. For any µ > 0, denote Dµ =√1 + µI1 +

√1 + 1/µI2 as a diagonal scaling of the two identity matrices. Then, for the

row norm ‖·‖r = ‖·‖2 and column norm ‖·‖c defined as in (58), the corresponding polar(49) is given by

κ◦(G) = minµ≥0

‖DµG‖sp . (59)

Despite its variational form, the polar (59) can still be efficiently computed by consid-ering its squared form and expanding:

‖DµG‖2sp =∥∥∥G>DµDµG

∥∥∥sp

=∥∥∥G>[(1 + µ)I1 + (1 + 1/µ)I2

]G∥∥∥sp.

Clearly, the term inside the spectral norm ‖·‖sp is convex in µ on R+ while the spectral normis convex and increasing on the cone of positive semidefinite matrices, hence by the usualrules of convex composition the left hand side above is a convex function of µ. Consequentlywe can use a subgradient algorithm, or even golden section search, to find the optimal µin (59). Given an optimal µ, an optimal pair of “atoms” a and b can be recovered via theKKT conditions; see Appendix B.4 for details. Thus, by using (59) Algorithm 3 is ableto solve the resulting optimization problem far more efficiently than the initial procedureoffered by White et al. (2012).

Similar to the case of matrix completion for single view in Section 4.2.1, the LF inTheorem 5 for this multi-view setting can be bounded in exactly the same way as in (52).The only difference is that instead of ‖W‖tr ≥ ‖W‖F, we now have κ(W ) ≥ 1√

2‖W‖F. To

see it, just set µ = 1 in (59), and we immediately obtain κ◦(G) ≤√

2 ‖G‖sp. Dualizing both

sides leads to κ(W ) ≥ 1√2‖W‖tr ≥ 1√

2‖W‖F. Therefore we can proceed in analogy to (52):

LF ≤ 2L supt

{ρ2

α2‖At‖2F + ‖Wt‖2F

}≤ 2L sup

t

{2ρ2

α2+ 2κ(Wt)

2

}≤ 4

‖P‖0

(1

α2+1

)C2

λ2. (60)

26

Page 27: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

4.4.2 A Modified Power Iteration

In practice, one can compute the polar (59) much more efficiently by deploying a heuristicvariant of the power iteration method, whose motivation arises from the KKT conditions inthe proof of Theorem 21 (see (76) in Appendix B.4). In particular, two normalized vectorsa and b are optimal for the polar (50) if and only if there exist nonnegative scalars µ1 andµ2 such that [

µ1 aµ2 b

]= GG>

[ab

], and µ1I1 + µ2I2 � GG>. (61)

Here A � B asserts that A−B is positive semidefinite. Therefore, after randomly initializinga and b with unit norm, it is natural to extend the power iteration method by repeatingthe update: [

st

]← GG>

[ab

], then

[anew

bnew

]←[

1‖s‖2

s1‖t‖2

t

].

This procedure usually converges to a solution where the Lagrange multipliers µ1 and µ2,recovered from (61), satisfy (61) hence confirming their optimality. While we leave theinvestigation of the theoretical guarantees of this power iteration heuristic for future work,let us mention that it works very well in practice (as in the single view case) and whenthe occasional failure occurs we revert to (59) and follow the recovery rule detailed inAppendix B.4.

5. Experimental Evaluation

We evaluate the extended GCG approach developed above in three sparse learning prob-lems: matrix completion, multi-class classification, and multi-view dictionary learning. Allalgorithms were implemented in Matlab unless otherwise noted.7 All experiments were runon a single core of a cluster housed at National ICT Australia with AMD 6-Core Opteron4184 (2.8GHz, 3M L2/6M L3 Cache, 95W TDP, 64 GB memory).

5.1 Matrix Completion with Trace Norm Regularization

Our first set of experiments is on matrix completion (51), where the division by ‖P‖0 isomitted to be consistent with existing solvers. Five data sets are used and their statistics aregiven in Table 1. The data sets Yahoo252M (Dror et al., 2012) and Yahoo60M8 provide musicratings, and the data sets Netflix (Dror et al., 2012), MovieLens1M, and MovieLens10M9 allprovide movie ratings. For MovieLens1M and Yahoo60M, we randomly sampled 75% ratingsas training data, using the rest for testing. The other three data sets came with trainingand testing partition.

We compared the extended GCG approach (Algorithm 3) with three state-of-the-artsolvers for trace norm regularized objectives: MMBS10 (Mishra et al., 2013), DHM (Dudik

7. All code is available for download at https://www.cs.uic.edu/~zhangx/GCG.8. Yahoo60M is the R1 data set at http://webscope.sandbox.yahoo.com/catalog.php?datatype=r.9. Both MovieLens1M and MovieLens10M were downloaded from http://www.grouplens.org/datasets/

movielens.10. Downloaded from http://www.montefiore.ulg.ac.be/~mishra/softwares/traceNorm.html.

27

Page 28: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

data set n (#user) m (#product) #train #test λ

MovieLens1M 6,040 3,952 750,070 250,024 50MovieLens10M 71,567 65,133 9,301,174 698,599 100

Netflix 2,549,429 17,770 99,072,112 1,408,395 1Yahoo60M 98,130 1,738,442 60,021,577 20,872,219 300Yahoo252M 1,000,990 624,961 252,800,275 4,003,960 100

Table 1: Summary of the data sets used in matrix completion experiments. Here n is thetotal number of users, m is the number of products (movies or music), #train and#test are the number of ratings used for training and testing respectively.

et al., 2012), and JS (Jaggi and Sulovsky, 2010). The local optimization in GCG wasperformed by L-BFGS with the maximum number of iteration set to 20.11 Traditionalsolvers such as proximal methods (Toh and Yun, 2010; Pong et al., 2010) were not includedin this comparison because they are much slower.

MMBS is confined to trace norm regularized problems. At each iteration t, they solve

minU∈Rn×t, B∈Rt×t, V ∈Rt×m

`(UBV ) + λ tr(B), s.t. U>U = V V > = I, and B � 0. (62)

After obtaining a stationary point, the gradient of ` is computed, with its top left andright singular vectors appended to U and V respectively, and repeat. Unfortunately, theoptimization in (62) is highly challenging on the Stiefel manifold, and their algorithm relieson the directional Hessian which becomes rather complicated in computation for general `.By contrast, the local surrogate (53) used in our approach is substantially easier to optimize.

DHM is a special case of GCG, with Improve instantiated with a line search procedure(Dudik et al., 2012). The detail of line search is intricate, but in practice it is slower thantotally corrective updates (42) solved by L-BFGS. To further streamline the optimizationfor (42), we buffered the quantities 〈P(utv

′t),P(usv

′s)〉 and 〈P(utv

′t),P(X)〉, where (ut,vt)

is the result of polar operator at iteration t (Line 3 of Algorithm 3). This allowed theobjective (42) to be evaluated efficiently without having to compute matrix inner product.

JS was proposed for solving the constrained problem: minW `(W ) s.t. ‖W‖tr ≤ ζ,which makes it hard to directly compare with solvers for the penalized problem (51). As aworkaround, given λ we first found the optimal solution W ∗ for the penalized problem, andthen we set ζ = ‖W ∗‖tr. To solve the constrained problem, JS lifts it to an SDP by observing

that ‖W‖tr ≤ t2 iff there exist symmetric matrices A and B such that Z :=

(A WW> B

)� 0

and tr(A) + tr(B) = t. Then the objective `(W ) can be rewritten in terms of Z, solved bythe conventional conditional gradient algorithm for constrained problems.

Figures 2 to 6 show how quickly the various algorithms drive the training objectivedown, while also providing the root mean square error (RMSE) on test data (i.e. com-paring the reconstructed matrix each iteration to the ground truth). JS is compara-ble only on RMSE. The regularization constant λ, given in Table 1, was chosen from

11. Downloaded from http://www.cs.ubc.ca/~pcarbo/lbfgsb-for-matlab.html. Recall a suboptimal so-lution is sufficient for the fixed-rank local optimization in Relaxed.

28

Page 29: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

10−1

100

101

1025.6

5.8

6

6.2

x 105

Running time (seconds)

Obj

ectiv

e va

lue

(tra

inin

g)

(a) Objective value vs CPU time

10−1

100

101

1020.86

0.87

0.88

0.89

0.9

Running time (seconds)

Tes

t RM

SE

(b) Test RMSE vs CPU time

0 10 20 30 40 50 60 70 805.6

5.8

6

6.2

x 105

Number of iterations

Obj

ectiv

e va

lue

(tra

inin

g)

(c) Objective value vs #iteration

Figure 2: Matrix completion on MovieLens1M

101

102

103

104

5.8

6

6.2

6.4

6.6

6.8

7x 106

Running time (seconds)

Obj

ectiv

e va

lue

(tra

inin

g)

(a) Objective value vs CPU time

101

102

103

1040.9

0.91

0.92

0.93

0.94

Running time (seconds)

Tes

t RM

SE

(b) Test RMSE vs CPU time

0 10 20 30 40 50 60 70 80

5.8

6

6.2

6.4

6.6

6.8

7x 106

Number of iterationsO

bjec

tive

valu

e (t

rain

ing)

(c) Objective value vs #iteration

Figure 3: Matrix completion on MovieLens10M

{1, 10, 50, 100, 200, 300, 400, 500, 1000} to minimize the test RMSE. On the MovieLens1M

and MovieLens10M data sets, GCG and DHM exhibit similar efficiency in reducing the ob-jective value with respect to CPU time (see Figure 2(a) and 3(a)). However, GCG achievescomparable test RMSE an order of magnitude faster than DHM, as shown in Figure 2(b)and 3(b). Interestingly, this discrepancy can be explained by investigating the rank of thesolution, i.e. the number of outer iterations taken. In Figure 2(c) and 3(c), DHM clearlyconverges much more slowly in terms of the number of iterations, which is expected becauseit does not re-optimize the basis at each iteration, unlike GCG and MMBS. Therefore, al-though the overall computational cost for DHM appears on par with that for GCG, itssolution has a much higher rank, resulting in a much higher test RMSE in these cases.

MMBS takes nearly the same number of iterations as GCG to achieve the same objectivevalue (see Figure 2(c) and 3(c)). However, its local search at each iteration is conductedon a constrained manifold, making it much more expensive than locally optimizing theunconstrained smooth surrogate (53). As a result, the overall computational complexity ofMMBS is an order of magnitude greater than that of GCG, with respect to optimizing boththe objective value and the test RMSE.

The observed discrepancies between the relative efficiency of optimizing the objectivevalue and test RMSE becomes much smaller on the other three data sets, which are muchlarger in size. In Figure 4 to 6, it is clear that GCG takes significantly less CPU time to

29

Page 30: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

103

1040.5

1

1.5

2

2.5

3

3.5x 108

Running time (seconds)

Obj

ectiv

e va

lue

(tra

inin

g)

(a) Objective value vs CPU time

103

104

1

1.5

2

2.5

3

Running time (seconds)

Tes

t RM

SE

(b) Test RMSE vs CPU time

Figure 4: Matrix completionon Netflix

102

103

104

1.5

2

2.5

x 108

Running time (seconds)

Obj

ectiv

e va

lue

(tra

inin

g)(a) Objective value vs CPU time

102

103

1041

1.5

2

2.5

3

Running time (seconds)

Tes

t RM

SE

(b) Test RMSE vs CPU time

Figure 5: Matrix completionon Yahoo100m

103

104

1

2

3

4

5x 1011

Running time (seconds)

Obj

ectiv

e va

lue

(tra

inin

g)

(a) Objective value vs CPU time

103

104

20

30

40

50

60

70

Running time (seconds)

Tes

t RM

SE

(b) Test RMSE vs CPU time

Figure 6: Matrix completionon Yahoo252m

optimize both criteria. The declining trend of the objective value has a similar shape tothat of the test RMSE. Here we observe that DHM is substantially slower than GCG andMMBS, confirming the importance of optimizing the bases at the same time as optimizingtheir weights. JS, which does not exploit the penalized form of the objective, is considerablyslower with respect to reducing the test RMSE.

5.2 Multi-class Classification with Trace Norm Regularization

Next we compared the four algorithms on multi-class classification problems also in thecontext of trace norm regularization. Here the task is to predict the class membershipof input instances, xi ∈ Rn, where each instance is associated with a unique label yi ∈{1, . . . , C}. In particular, we consider a standard linear classification model where eachclass c is associated with a weight vector in Rn, stacked in a model matrix W ∈ Rn×C .Then for each training example (xi, yi), the individual logistic loss L(W ; xi, yi) is given by

L(W ; xi, yi) = − logexp(x′iW:,yi)∑Cc=1 exp(x′iW:c)

,

30

Page 31: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

data set n C #train #test λ

Fungus10 4,096 10 10 per class 10 per class 10−2

Fungus134 4,096 134 50 per class 50 per class 10−3

k1024 131,072 50 50 per class 50 per class 10−3

Table 2: Summary of the data sets used in multi-class classification experiments. Here nis the number of features, C is the number of classes, #train and #test are thenumber of training and test images respectively.

and the complete loss is given by averaging over the training set, `(W ) := 1m

∑mi=1 L(W ; xi, yi).

It is reasonable to assume that W has a low rank; i.e. the weight vector of all classes liein a low dimensional subspace (Akata et al., 2014), which motivates the introduction of atrace norm regularizer on W , yielding a training problem in the form of (47).

We conducted experiments on three data sets extracted from the ImageNet repository(Deng et al., 2009), with characteristics shown in Table 2 (Akata et al., 2014).12 k1024

is from the ILSVRC2011 benchmark, and Fungus10 and Fungus134 are from the Fungusgroup. All images are represented by the Fisher descriptor.

As in the matrix completion case above, we again compared how fast the algorithmsreduced the objective value and test accuracy. Here the value of λ was chosen from{10−4, 10−3, 10−2, 10−1} to maximize test accuracy. Figures 7 to 9 show that the modelsproduced by GCG achieve comparable (or better) training objectives and test accuracies infar less time than MMBS, DHM and JS. Although DHM is often more efficient than GCGearly on, it is soon overtaken. This observation could be related to the diminishing returnof adding bases, as shown in Figure 7(c), 8(c), and 9(c), since the addition of the first fewbases yields a much steeper decrease in objective value than that achieved by later bases.Hence, after the initial stage, the computational efficiency of a solver turns to be dominatedby better exploitation of the bases. That is, using added bases as an initialization for localimprovement (as in GCG) appears to be more effective than simply finding an optimal coniccombination of added bases (as in DHM).

Also notice from Figures 7(c), 8(c) and 9(c) that MMBS and GCG are almost identical interms of the number of iterations required to achieve a given objective value. However, themanifold based local optimization employed by MMBS at each iteration makes its overallcomputational time worse than that of DHM and JS. This is a different outcome fromthe matrix completion experiments we saw earlier, and seems to suggest that the manifoldapproach in MMBS is more effective for the least square loss than for the logistic loss.

5.3 Multi-view Dictionary Learning

Our third set of experiments focuses on optimizing the objective for the multi-view dictio-nary learning model (57) presented in Section 4.4. Here we compared the GCG approachoutlined in Section 4.4 with a straightforward local solver that is based on block coordinatedescent (BCD), which alternates between: (a) fixing H and optimizing A and B (with norm

12. The three data sets were all downloaded from http://lear.inrialpes.fr/src/jsgd/.

31

Page 32: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

10−1

100

101

1021.2

1.4

1.6

1.8

2

Running time (seconds)

Obj

ectiv

e va

lue

(tra

inin

g)

(a) Objective value vs CPU time

10−1

100

101

10215

20

25

30

35

40

Running time (seconds)

Tes

t acc

urac

y (%

)(b) Test accuracy vs CPU time

0 4 8 12 161.2

1.4

1.6

1.8

2

Number of iterations

Obj

ectiv

e va

lue

(tra

inin

g)

(c) Test accuracy vs #iteration

Figure 7: Multi-class classification on Fungus10

101

102

1034.2

4.4

4.6

4.8

Running time (seconds)

Obj

ectiv

e va

lue

(tra

inin

g)

(a) Objective value vs CPU time

101

102

1030

2

4

6

8

10

12

Running time (seconds)

Tes

t acc

urac

y (%

)

(b) Test accuracy vs CPU time

0 4 8 12 16 204.2

4.4

4.6

4.8

Number of iterationsO

bjec

tive

valu

e (t

rain

ing)

(c) Objective value vs #iteration

Figure 8: Multi-class classification on Fungus134

102

103

104

2

2.5

3

3.5

4

Running time (seconds)

Obj

ectiv

e va

lue

(tra

inin

g)

(a) Objective value vs CPU time

102

103

1045

10

15

20

25

30

35

Running time (seconds)

Tes

t acc

urac

y (%

)

(b) Test accuracy vs CPU time

0 10 20 30 40 50

2

2.5

3

3.5

4

Number of iterations

Obj

ectiv

e va

lue

(tra

inin

g)

(c) Objective value vs #iteration

Figure 9: Multi-class classification on k1024

ball constraints), then (b) fixing A and B and optimizing H (with nonsmooth regularizer).Both steps were solved by FISTA (Beck and Teboulle, 2009), initialized with the solutionfrom the previous iteration. We tested several values of t, the dimensionality of the latentsubspace of the solution, in these experiments. The GCG approach used Algorithm 3 withthe local objective (53) instantiated by the norm expression in (58) and solved by L-BFGSwith max iteration set to 20. We initialized the U and V matrices with 20 columns androws respectively, each drawn i.i.d. from a uniform distribution on the unit sphere. Based

32

Page 33: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

on the noise model we consider in these experiments, we started with a base loss ` setto l1(Z,Z

∗) =∑

ij |Zij − Z∗ij | where Z∗ is the true underlying matrix; however, following(Becker et al., 2009, Eq 3.1), we smoothed this l1 loss by adding a quadratic prox-functionwith strong convexity modulus 10−4. Under the resulting loss, the optimization over A andB is decoupled given H. We terminated all algorithms after a maximum of running time(3,600 seconds) is reached.

5.3.1 Synthetic Data

We first constructed experiments on a synthetic data set that fulfills the conditional in-dependence assumption (57) of the multi-view model (as discussed in Section 4.4). Herethe x- and y-views have basis matrices A∗ ∈ Rn1×t∗ and B∗ ∈ Rn2×t∗ respectively, whereall columns of A∗ were drawn i.i.d. from the l2 sphere of radius β = 1, and all columnsof B∗ were drawn i.i.d. from the l2 sphere of radius γ = 5. We set n1 = 800, n2 = 1000,and t∗ = 10 (so that the dictionary size is indeed small). The latent representation matrixH∗ ∈ Rt∗×m has all elements drawn i.i.d. from the zero-mean unit-variance Gaussian distri-bution. The number of training examples is set to m = 200, which is much lower than thenumber of features n1 and n2. Then the clean versions of x-view and y-view were generatedby X∗ = A∗H∗ and Y ∗ = B∗H∗ respectively, while the noisy versions were obtained byadding i.i.d. noise to 15% entries of X∗ and Y ∗ that were selected uniformly at random,yielding X and Y respectively. The noise is uniformly distributed in [0, 10]. Given the noisyobservations, the denoising task is to find reconstructions X and Y for the two views, suchthat the error ‖X−X‖2F+‖Y − Y ‖2F is small. The composite training problem is formulatedin (57).

The results of comparing the modified GCG method with local optimization are pre-sented in Figures 10 to 12, where the regularization parameter λ has been varied among{80, 60, 40}. In each case, GCG optimizes the objective value significantly faster than BCDfor all values of t chosen for BCD. The differences between methods are even more evidentwhen considering the reconstruction error. In particular, although the local optimizationin GCG already achieves a low objective value in the first outer iteration (see Figure 10(a)and 11(a)), the reconstruction error remains high, thus requiring further effort to reduce it.

When t = 10 which equals the true rank t∗ = 10, BCD is more likely to get stuck ina poor local minimum when λ is large, for example as shown in Figure 10(a) for λ = 80.This occurs because when optimizing H, the strong shrinkage in the proximal update stiflesmajor changes in H (and in A and B consequently). This problem is exacerbated when therank of the solution is overly restricted, leading to even worse local optima. On the otherhand, under-regularization does indeed cause overfitting in this case; for example, as shownin Figure 12 for λ = 40. Interestingly, BCD gets trapped in local optima for all t, whileGCG eventually escapes suboptimal solutions and eventually finds a way to considerablyreduce the objective value. However, this simply creates a dramatic increase in the testreconstruction error, a typical phenomenon of overfitting.

33

Page 34: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

101

102

1032.8

3

3.2

3.4

x 105

Running time (seconds)

Obj

ectiv

e va

lue

(tra

inin

g)

(a) Objective value vs CPU time

101

102

1030

50

100

150

Running time (seconds)

Rec

onst

ruct

ion

Err

or

(b) Reconst. error vs CPU time

Figure 10: Denoising onsynthetic dataλ = 80

101

102

1032.75

2.8

2.85

2.9

2.95x 105

Running time (seconds)

Obj

ectiv

e va

lue

(tra

inin

g)(a) Objective value vs CPU time

101

102

1030

50

100

150

Running time (seconds)

Rec

onst

ruct

ion

Err

or

(b) Reconst. error vs CPU time

Figure 11: Denoising onsynthetic dataλ = 60

100

101

102

1032.4

2.6

2.8

3

3.2x 105

Running time (seconds)

Obj

ectiv

e va

lue

(tra

inin

g)

(a) Objective value vs CPU time

100

101

102

1030

50

100

150

Running time (seconds)

Rec

onst

ruct

ion

Err

or(b) Reconst. error vs CPU time

Figure 12: Denoising onsynthetic dataλ = 40

5.3.2 Image Denoising

We then conducted multi-view dictionary learning experiments on natural image data. Thedata set we considered is based on the Extended Yale Face Database B13 (Georghiadeset al., 2001). In particular, the data set we used consists of grey level face images of 28human subjects, each with 9 poses and 64 lighting conditions. To construct the data set,we set the x-view to a fixed lighting (+000E+00) and the y-view to a different fixed lighting(+000E+20). We obtained paired views by randomly drawing a subject and a pose under thetwo lighting conditions. The underlying assumption is that each lighting has its own set ofbases (A and B) and each (person, pose) pair has the same latent representation for the twolighting conditions. All images were down-sampled to 50-by-50, meaning n1 = n2 = 2500,and we randomly selected t = 50 pairs of (person, pose) for training. The x-view waskept clean, while pixel errors were added to the y-view, randomly setting 10% of the pixelvalues to 1. This noise model mimics natural phenomena such as occlusion and loss of pixelinformation from image transfer. The goal is again to enable appropriate reconstruction of

13. Data downloaded from http://vision.ucsd.edu/content/extended-yale-face-database-b-b.

34

Page 35: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

100

101

102

1032

4

6

8x 104

Running time (seconds)

Obj

ectiv

e va

lue

(tra

inin

g)

(a) Objective value vs CPU time

100

101

102

103300

600

900

1200

1500

1800

Running time (seconds)

Rec

onst

ruct

ion

Err

or

(b) Reconst. error vs CPU time

Figure 13: Denoising onface data withλ = 40

100

101

102

1032

4

6

8x 104

Running time (seconds)

Obj

ectiv

e va

lue

(tra

inin

g)(a) Objective value vs CPU time

100

101

102

103300

600

900

1200

1500

1800

Running time (seconds)

Rec

onst

ruct

ion

Err

or

(b) Reconst. error vs CPU time

Figure 14: Denoising onface data withλ = 30

100

101

102

1031.5

2

2.5

3

3.5

4x 104

Running time (seconds)

Obj

ectiv

e va

lue

(tra

inin

g)

(a) Objective value vs CPU time

100

101

102

103300

600

900

1200

1500

1800

Running time (seconds)

Rec

onst

ruct

ion

Err

or(b) Reconst. error vs CPU time

Figure 15: Denoising onface data withλ = 20

a noisy image using other views. Naturally, we set β = γ = 1 since the data is in the [0, 1]interval.

Figures 13 to 15 show the results comparing GCG with the local BCD for differentchoices of t for BCD. Clearly the value of λ that yields the lowest reconstruction error is40. However, here BCD simply gets stuck at the trivial local minimum H = 0, as shown inFigure 13(a). This occurs because under the random initialization of A and B, the proximalupdate simply sets H to 0 when the regularization is too strong. When λ is decreased to 30,as shown in Figure 14(a), BCD can escape the trivial H = 0 solution for t = 100, but againit becomes trapped in another local minimum that remains much worse than the globallyoptimal solution found by GCG. When λ is further reduced to 20, the BCD algorithmsfinally escape the trivial H = 0 point for all t, but still fail to find an acceptable solution.Although GCG again converges to a much better global solution in terms of the objectivevalue, overfitting can still occur: in Figure 15(b) the reconstruction error of GCG increasessignificantly after an initial decline.

35

Page 36: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

6. Conclusion

We have presented a unified treatment of the generalized conditional gradient (GCG) algo-rithm that covers several useful extensions. After a thorough treatment of its convergenceproperties, we illustrated how an extended form of GCG can efficiently handle a convexgauge regularizer in particular. We illustrated the application of these GCG methods inlearning low rank matrices, instantiated with examples on matrix completion, dictionarylearning, and multi-view learning. In each case, we considered a generic convex relaxationprocedure and focused on the efficient computation of the polar operator—one of the keysteps to obtaining an efficient implementation of GCG. To further accelerate its conver-gence, we interleaved the standard GCG update with a fixed rank subspace optimization,which greatly improves performance in practice without affecting the theoretical conver-gence properties. Extensive experiments on matrix completion, multi-class classification,multi-view learning confirmed the superiority of the modified GCG approach in practice.

For future work, several interesting extensions are worth investigating. The current formof GCG is inherently framed as a batch method, where each iteration requires gradient of theloss to be computed over all training data. When the training data is large, it is preferable toonly use a random subset, or distribute the training data across several processors. Anotherinteresting direction is to interpolate between the polar operator and proximal updates; thatis, one could incorporate a mini-batch of atoms at each iteration. For example, for tracenorm regularization, one can consider adding k instead of just the 1 largest singular vectors.It remains an open question whether such a “mini-batch” of atoms can provably acceleratethe overall optimization, but some recent work (Hsieh and Olsen, 2014) has demonstratedpromise in practice.

Acknowledgments

This research was supported by the Alberta Innovates Center for Machine Learning (AICML),the Canada Research Chairs program, and NSERC. Most of this work was carried out whenY Yu and X Zhang were with the University of Alberta and AICML. X Zhang would like tothank the support of NICTA, where he worked as a researcher from Oct 2012 to Oct 2015and conducted 50% of his work on this paper. We thank Csaba Szepesvari, the action editor,and the reviewers at various stages of this work for the many constructive comments.

Appendix A. Proofs for Section 3

A.1 Proof of Theorem 2

Since ∇` is Lipschitz continuous, it must be uniformly continuous on bounded set.

(a): Let C be the closure of the sequence {wt}t. Due to the compactness assumptionon the sublevel set and the monotonicity of F (wt), C is compact. Moreover C ⊆ dom fsince for all cluster points, say w, of wt we have from the closedness of F that F (w) ≤lim inf F (wtk) ≤ F (w0) < ∞. Since −∇` is continuous, −∇`(C) is a compact subsetof int(dom f∗), the interior of dom f∗. Note that f∗ is continuous on the interior of itsdomain, therefore its subdifferential is locally bounded on −∇`(C), see e.g. (Borwein and

36

Page 37: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

Vanderwerff, 2010, Proposition 4.1.26). A standard compactness argument then establishesthe boundedness of (∂f∗)(−∇`(C)). Thus {dt}t is bounded.

(b): Note first that the boundedness of {wt}t follows immediately from the boundednessassumption on the sublevel set, thanks to the monotonic property of F (wt). Since ∇` isuniformly continuous, the set {−∇`(wt)}t is again bounded. On the other hand, we knowfrom (Borwein and Vanderwerff, 2010, Theorem 4.4.13, Proposition 4.1.25) that f is super-coercive iff ∂f∗ maps bounded sets into bounded sets. Therefore {dt}t is again bounded.

(c): This is clear.Note that the convexity of ` is not used in this proof.

A.2 Proof of Theorem 4

Since ` is L-smooth, for the sequences {wt} and {dt} generated by Algorithm 1, and for allη ∈ [0, 1], we have

`(wt + η(dt −wt)) ≤ `(wt) + η 〈dt −wt,∇`(wt)〉+ Lη2

2 ‖dt −wt‖2 . (63)

Since the subroutine is Relaxed and the subproblem (13) is solved up to some εt ≥ 0,

F (wt+1) ≤ `(wt) + ηt 〈dt −wt,∇`(wt)〉+Lη2t2 ‖dt −wt‖2 + (1− ηt)f(wt) + ηtf(dt)

≤ F (wt)− ηtG(wt) +Lη2t2 ‖dt −wt‖2 + ηtεt

= F (wt)− ηtG(wt) + η2t (εt/ηt + L2 ‖dt −wt‖2).

Define ∆t := F (wt)− F (w) and Gt := G(wt). Thus

∆t+1 ≤ ∆t − ηtGt + η2t (εt/ηt + L2 ‖dt −wt‖2), and ∆t ≤ Gt. (64)

Plug the latter into the former and expand:

∆t+1 ≤ πt(1− η0)∆0 +t∑

s=0

πtπsη2s(εs/ηs + L

2 ‖ds −ws‖2). (65)

To prove the second claim, we have from (64) ηtGt ≤ ∆t−∆t+1 + η2t (εt/ηt + L2 ‖dt −wt‖2).

Summing from k to t:(mink≤s≤t

Gs

) t∑s=k

ηs ≤t∑

s=k

ηsGs ≤ F (wk)− F (wt+1) +

t∑s=k

η2s(εs/ηs + L2 ‖ds −ws‖2).

Rearranging completes the proof.

A.3 Proof of Theorem 5

As ηt = 2t+2 , we have η0 = 1 and πt = 2

(t+1)(t+2) . For the first claim (upper bound on F (wt))

all we need to verify is that, by induction, 1(t+1)(t+2)

∑ts=0

s+1s+2 ≤ 1

t+4 . This is elementary.

We prove the second claim by a sequence of calculations similar to that in (Freund andGrigas, 2016). First, using (19) and the bound on F (wt) in (20) with t = 1 and w = w2:

G11 ≤ 1

η1

(F (w1)− F (w2) +

η212 (δ + LF )

)≤ 3

2

(12 (δ + LF ) + 2

9 (δ + LF ))

= 1312 (δ + LF ),

37

Page 38: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

proving the second claim on Gt1 for t = 1, 2 (note that G21 ≤ G1

1 by definition). For t ≥ 3,we consider k = t/2− 1 if t is even and k = (t+ 1)/2− 1 otherwise. Clearly k ≥ 1, hence

t∑s=k

ηs = 2t∑

s=k

1

s+ 2≥ 2

∫ t

k−1

1

s+ 2ds = 2 ln

t+ 2

k + 1≥ 2 ln 2,

t∑s=k

η2s/2 = 2t∑

s=k

1

(s+ 2)2≤ 2

∫ t+1

k

1

(s+ 2)2ds = 2

(1

k + 2− 1

t+ 3

).

Using again (19), and the bound on F (wt) in (20) with t = k and w = wt+1:

Gt1 ≤ Gtk ≤δ + LF2 ln 2

(2

k + 3+

2

k + 2− 2

t+ 3

)≤ δ + LF

ln 2

(2

t+ 2+

1

t+ 3

)≤ 3(δ + LF )

t ln 2.

A.4 Proof of Theorem 8

A proof can be based on the simple observation that the objective value F ? satisfies

F ? := infw{`(w) + f(w)} = inf

w,ρ:κ(w)≤ρ`(w) + h(ρ). (66)

Note that if ρ were known, the theorem could have been proved as before. The key ideabehind the step size choice (34) is to ensure that Algorithm 2 performs almost the same asif it knew the unknown but fixed constant ρ = κ(w) beforehand.

Consider an arbitrary w and let ρ = κ(w). We also use the shorthand Ft := `(wt) +h(ρt) ≥ `(wt) + f(wt) = F (wt). Then, the following chain of inequalities can be verified:

Ft+1 := `(wt+1) + h(ρt+1)

≤ Ft + 〈θtat − ηtwt,∇`(wt)〉+ L2 ‖θtat − ηtwt‖2 − ηth(ρt) + ηth(θt/ηt)

≤ Ft + ηt

⟨ραt

at −wt,∇`(wt)⟩

+Lη2t2

∥∥∥ ραt

at −wt

∥∥∥2 − ηth(ρt) + ηth(ρ/αt)

≤ Ft + minz:κ(z)≤ρ

ηt 〈z−wt,∇`(wt)〉+ ηtρεt +Lη2t2

∥∥∥ ραt

at −wt

∥∥∥2 − ηth(ρt) + ηth(ρ/αt)

= Ft + minz:κ(z)≤ρ

ηt 〈z−wt,∇`(wt)〉 − ηth(ρt) + ηth(ρ)

+ η2t

(L2

∥∥∥ ραt

at −wt

∥∥∥2 + (ρεt + h(ρ/αt)− h(ρ))/ηt︸ ︷︷ ︸:=δt

)

= Ft + η2t δt − ηt[〈wt,∇`(wt)〉+ h(ρt)− min

z:κ(z)≤ρ〈z,∇`(wt)〉+ h(ρ)︸ ︷︷ ︸

:=G(wt)

],

where the first inequality is because the subroutine is Relaxed, the second inequality followsfrom the minimality of θt in (34), and the third inequality is due to the choice of at in Line3 of Algorithm 2.

38

Page 39: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

Recall that ρ = κ(w). Moreover, due to the convexity of `,

G(wt) = 〈wt,∇`(wt)〉+ h(ρt)− minz:κ(z)≤ρ

(〈z,∇`(wt)〉+ h(ρ)

)= h(ρt)− h(κ(w)) + max

z:κ(z)≤κ(w)〈wt − z,∇`(wt)〉

≥ h(ρt)− h(κ(w)) + maxz:κ(z)≤κ(w)

`(wt)− `(z)

≥ Ft − F (w).

Thus we have retrieved the recursion:

Ft+1 − F (w) ≤ Ft − F (w)− ηtG(wt) + η2t δt, and Ft − F (w) ≤ G(wt).

Summing the indices as in the proof of Theorem 4 and noting that F (wt) ≤ Ft for all tcompletes the proof.

Appendix B. Proofs for Section 4

B.1 Proof of Theorem 15

The first equivalence in Theorem 15 is well-known since Grothendieck’s work on tensorproducts of Banach spaces (where it is usually called the projective tensor norm). Thesecond equality follows directly from the arithmetic-geometric mean inequality.

We first note that the atomic set A in (48) is compact, so is its convex hull convA. It isalso easy to see that A is a connected set (by considering the continuous map (u,v) 7→ uv>).

By (4), κ(W ) = inf{ρ ≥ 0 : W = ρ

∑iσiai, σi ≥ 0,

∑iσi = 1,ai ∈ A

}(67)

= inf {ρ ≥ 0 : W ∈ ρ convA} .

Since convA is compact with 0 ∈ int(convA) we know the infimum above is attained.Thus there exist ρ ≥ 0 and C ∈ convA so that W = ρC and κ(W ) = ρ. ApplyingCaratheodory’s theorem for connected sets (Rockafellar and Wets, 1998, Theorem 2.29)we know C =

∑ti=1 σiuiv

>i for some σi ≥ 0,

∑i σi = 1, uiv

>i ∈ A and t ≤ mn. Let

U =√ρ [√σ1u1, . . . ,

√σtut], and V =

√ρ [√σ1v1, . . . ,

√σtvt]

>. We then have W = UV

and κ(W ) = ρ ≥ 12

∑ti=1

(‖U:i‖2c + ‖Vi:‖2r

), which proves one side of Theorem 15.

On the other hand, consider any U and V that satisfy

W = UV =t∑

j=1

‖U:j‖c ‖Vj:‖rt∑i=1

‖U:i‖c ‖Vi:‖r∑tj=1 ‖U:j‖c ‖Vj:‖r

U:i

‖U:i‖cVi:‖Vi:‖r

,

assuming w.l.o.g. that ‖U:i‖c ‖Vi:‖r 6= 0 for all 1 ≤ i ≤ t. This, together with the definition(67), gives the other half of the equality: κ(W ) ≤∑t

i=1 ‖U:i‖c ‖Vi:‖r . The proof is completed

by making the elementary observation that∑t

i=1 ‖U:i‖c ‖Vi:‖r ≤ 12

∑ti=1

(‖U:i‖2c + ‖Vi:‖2r

).

39

Page 40: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

B.2 Proof of Theorem 16

Since Uinit = (√

1− ηtUt,√θtut) and Vinit = (

√1− ηtV >t ,

√θtvt)

>, it is obvious that Wt+1 =UinitVinit. Note ‖ut‖c ≤ 1 and ‖vt‖r ≤ 1. By construction

`(Wt+1) + λρt+1 = `(Ut+1Vt+1) +λ

2

t+1∑i=1

(‖(Ut+1):i‖2c + ‖(Vt+1)i:‖2r

)= Ft+1(Ut+1, Vt+1) ≤ Ft+1(Uinit, Vinit)

≤ `((1− ηt)UtVt + θtutv>t ) + λθt + λ

(1− ηt)2

t∑i=1

(‖(Ut):i‖2c + ‖(Vt)i:‖2r

)= `((1− ηt)Wt + θtutv

>t ) + λ(1− ηt)ρt + λθt, (68)

where the inequality is by the definition of Uinit and Vinit, and the last equality is by thedefinition of ρt in (54). In addition, κ(Wt+1) ≤ ρt+1 follows from (54) and Theorem 15.

B.3 Proof of Theorem 21

By (50), the square of the polar is

(κ◦(G))2 = max{

c>ZZ>c : ‖a‖2 ≤ 1, ‖b‖2 ≤ 1}, where c =

[ab

]. (69)

Since it maximizes a convex quadratic function over a convex set, there must be a maximizerattained at an extreme point of the feasible region. Therefore the problem is equivalent to

(κ◦(G))2 = max{

c>ZZ>c : ‖a‖2 = 1, ‖b‖2 = 1}

(70)

= max{

tr(SZZ>) : tr(SI1) = 1, tr(SI2) = 1, S � 0, rank(S) = 1}, (71)

where S = cc>. Key to this proof is the fact that the rank one constraint can be droppedwithout breaking the equivalence. To see why, notice that without this rank constraint thefeasible region is the intersection of the positive semidefinite cone with two hyperplanes,hence the rank of all its extreme points must be upper bounded by one (Pataki, 1998).Furthermore the linearity of the objective implies that there must be a maximizing solutionattained at an extreme point of the feasible region. Therefore we can continue by

(κ◦(G))2 = max{

tr(SZZ>) : tr(SI1) = 1, tr(SI2) = 1, S � 0}

= maxS�0

minµ1,µ2∈R

tr(SGG>)− µ1(tr(SI1)− 1)− µ2(tr(SI2)− 1)

= minµ1,µ2∈R

maxS�0

tr(SGG>)− µ1(tr(SI1)− 1)− µ2(tr(SI2)− 1)

= min{µ1 + µ2 : µ1, µ2 ∈ R, GG> � µ1I1 + µ2I2

}(72)

= min{µ1 + µ2 : µ1 ≥ 0, µ2 ≥ 0,

∥∥Dµ2/µ1G∥∥2sp≤ µ1 + µ2

}(73)

= min{‖DµG‖2sp : µ ≥ 0

}, (74)

40

Page 41: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

where the third equality is due to the Lagrangian duality, the fifth equality follows fromthe equivalence

∥∥Dµ2/µ1G∥∥2sp≤ µ1 + µ2 ⇐⇒ Dµ2/µ1GG

>Dµ2/µ1 � (µ1 + µ2)I ⇐⇒GG> � (µ1 +µ2)D

−1µ2/µ1

D−1µ2/µ1 = µ1I1 +µ2I2, and the last equality is obtained through the

re-parameterization µ = µ2/µ1, ν = µ1 + µ2 and the elimination of ν.

B.4 Solving the multi-view polar

Here we show how to efficiently compute the polar (dual norm) in the multi-view setting.Let us first backtrack in (73), which produces the optimal µ1 and µ2 via

µ1 + µ2 = ‖DµG‖2sp , andµ2µ1

= µ. (75)

Then the KKT condition for the step from (72) to (73) can be written as

M � 0, Mc = 0, ‖a‖2 = ‖b‖2 = 1, where M := µ1I1 + µ2I2 −GG>. (76)

By duality, this is a sufficient and necessary condition for a and b to be the solution of thepolar operator. Given the optimal µ1 and µ2, the first condition M � 0 must have beensatisfied already. Let the null space of M be spanned by an orthonormal basis {q1, . . . ,qk}.Then we can parameterize c as c = Qα where Q = [q1, . . . ,qk] =:

[Q1

Q2

]. By (76),

0 = ‖a‖22 − ‖b‖22 = α′(Q>1 Q1 −Q>2 Q2

)α. (77)

Let PΣP> be the eigen-decomposition of Q>1 Q1−Q>2 Q2 with Σ being diagonal and P beingunitary. Let v = P>α. Then c = QPv and (77) simply becomes v>Σv = 0. In addition,(76) also implies 2 = ‖c‖22 = v>P>Q>QPv = v>v. So finally the optimal solution to thepolar problem (70) can be completely characterized by{[

ab

]= QPv : v>Σv = 0, ‖v‖22 = 2

}, (78)

which is simply a linear system after a straightforward change of variable. The majorcomputational cost for recovery is to find the null space of M , which can be achieved byQR decomposition. The complexity of eigen-decomposition for Q>1 Q1 −Q>2 Q2 depends onthe dimension of the null space of M , which is usually low in practice.

References

Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Good practice inlarge-scale learning for image classification. IEEE Transactions on Pattern Analysis andMachine Intelligence, 36(3):507–520, 2014.

Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task fea-ture learning. Machine Learning, 73(3):243–272, 2008.

Francis Bach. Convex relaxations of structured matrix factorizations. HAL:00861118, 2013.

41

Page 42: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

Francis Bach. Duality between subgradient and conditional gradient methods. SIAM Jour-nal of Optimization, 25(1):115–129, 2015.

Francis Bach and Michael Jordan. A probabilistic interpretation of canonical correlationanalysis. Technical Report 688, Department of Statistics, University of California, Berke-ley, 2006.

Francis Bach, Julien Mairal, and Jean Ponce. Convex sparse matrix factorizations.arXiv:0812.1869v1, 2008.

Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structuredsparsity through convex optimization. Statistical Science, 27(4):450–468, 2012.

Amir Beck and Shimrit Shtern. Linearly convergent away-step conditional gradient fornon-strongly convex functions, 2015.

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methodsfor convex optimization. Operations Research Letters, 31:167–175, 2003.

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linearinverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

Stephen Becker, Jerome Bobin, and Emmanuel J. Candes. NESTA: A fast and accuratefirst-order method for sparse recovery. SIAM Journal on Imaging Sciences, 4(1):1–39,2009.

Jonathan M. Borwein and John D. Vanderwerff. Convex Functions: Constructions, Char-acterizations and Counterexamples. Cambridge University Press, 2010.

David M. Bradley and James Andrew Bagnell. Convex coding. In Conf. on Uncertainty inArtificial Intelligence, 2009.

Kristian Bredies and Dirk A. Lorenz. Iterated hard shrinkage for minimization problemswith sparsity constraints. SIAM Journal on Scientific Computing, 30(2):657–683, 2008.

Peter Buhlmann and Sara van de Geer. Statistics for High-Dimensional Data. Springer,2011.

Samuel Burer and Renato D C Monteiro. Local minima and convergence in low-ranksemidefinite programming. Mathematical Programming, 103(3):427–444, 2005.

Emmanuel J. Candes and Benjamin Recht. Exact matrix completion via convex optimiza-tion. Foundations of Computational Mathematics, 9:717–772, 2009.

Michael D. Canon and Clifton D. Cullum. A tight upper bound on the rate of convergenceof Frank-Wolfe algorithm. SIAM Journal on Control, 6:509–516, 1968.

Venkat Chandrasekaran, Benjamin Recht, Pablo A. Parrilo, and Alan S.Willsky. The convexgeometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849, 2012.

42

Page 43: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

H. Cheng, Y. Yu, X. Zhang, E. Xing, and D. Schuurmans. Scalable and sound low-ranktensor learning. In Proc. Intl. Conf. on Artificial Intelligence and Statistics, 2016.

Kenneth L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algo-rithm. ACM Transactions on Algorithms, 6(4):1–30, 2010.

Tij De Bie, Nello Cristianini, and Roman Rosipal. Eigenproblems in pattern recognition.In Handbook of Geometric Computing, pages 129–170, 2005.

Vladimir Fedorovich Dem’yanov and Alexander M. Rubinov. The minimization of a smoothconvex functional on a convex set. SIAM Journal on Control, 5:280–294, 1967.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-fei. ImageNet: A large-scale hierarchical image database. In Proc. IEEE Conf. Computer Vision and PatternRecognition, 2009.

Gideon Dror, Noam Koenigstein, Yehuda Koren, and Markus Weimer. The Yahoo! musicdataset and KDD-Cup’11. JMLR Workshop and Conference Proceedings: Proceedings ofKDD Cup 2011 Competition, 18:3–18, 2012.

Miroslav Dudik, Zaid Harchaoui, and Jerome Malick. Lifted coordinate descent for learn-ing with trace-norm regularizations. In Proc. Intl. Conf. on Artificial Intelligence andStatistics, 2012.

J. Dunn. Rates of convergence for conditional gradient algorithms near singular and non-singular extremals. SIAM Journal on Control and Optimization, 17:187–211, 1979.

J. Dunn and S. Harshbarger. Conditional gradient algorithms with open loop step size rules.Journal of Mathematical Analysis and Applications, 62:432–444, 1978.

Yonina C. Eldar and Gitta Kutyniok, editors. Compressed Sensing: Theory and Applica-tions. Cambridge University Press, 2012.

Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. NavalResearch Logistics Quarterly, 3(1-2):95–110, 1956.

Robert Freund and Paul Grigas. New analysis and results for the Frank-Wolfe method.Mathematical Programming, 155(1):199–230, 2016.

Masao Fukushima and Hisashi Mine. A generalized proximal point algorithm for certainnon-convex minimization problems. International Journal of Systems Science, 12(8):989–1000, 1981.

Dan Garber and Elad Hazan. Faster rates for the Frank-Wolfe method over strongly-convexsets. In Proc. Intl. Conf. on Machine Learning, 2015.

Dan Garber and Elad Hazan. A linearly convergent variant of the conditional gradientalgorithm under strong convexity, with applications to online and stochastic optimization.SIAM Journal on Optimization, 26(3):1493–1528, 2016.

43

Page 44: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

Athinodoros S. Georghiades, Peter N. Belhumeur, and David J. Kriegman. From few tomany: Illumination cone models for face recognition under variable lighting and pose.IEEE Transactions on Pattern Analysis and Machine Intelligence, 23:643–660, 2001.

J. Guelat and P. Marcotte. Some comments on Wolfe’s ‘away step’. Mathematical Pro-gramming, 35:110–119, 1986.

Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithmsfor norm-regularized smooth convex optimization. Mathematical Programming, 152:75–112, 2015.

Elad Hazan. Sparse approximate solutions to semidefinite programs. In Latin AmericanConference on Theoretical Informatics, 2008.

Charles A. Holloway. An extension of the Frank and Wolfe method of feasible directions.Mathematical Programming, 6(1):14–27, 1974.

Cho-Jui Hsieh and Peder A. Olsen. Nuclear norm minimization via active subspace selection.In Proc. Intl. Conf. on Machine Learning, 2014.

Laurent Jacob, Guillaume Obozinski, and Jean P. Vert. Group lasso with overlap and graphlasso. In Proc. Intl. Conf. on Machine Learning, 2009.

Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proc.Intl. Conf. on Machine Learning, 2013.

Martin Jaggi and Marek Sulovsky. A simple algorithm for nuclear norm regularized prob-lems. In Proc. Intl. Conf. on Machine Learning, 2010.

Yangqing Jia, Mathieu Salzmann, and Trevor Darrell. Factorized latent spaces with struc-tured sparsity. In Advances in Neural Information Processing Systems, pages 982–990,2010.

Seyoung Kim and Eric P. Xing. Statistical estimation of correlated genome associations toa quantitative trait network. PLoS Genetics, 5(8):1–18, 2009.

Jacek Kuczynski and Henryk Wozniakowski. Estimating the largest eigenvalue by the powerand Lanczos algorithms with a random start. SIAM Journal on Matrix Analysis andApplications, 13(4):1094–1122, 1992.

S. Lacoste-Julien and M. Jaggi. On the global linear convergence of Frank-Wolfe optimiza-tion variants. In Advances in Neural Information Processing Systems, 2015.

Soren Laue. A hybrid algorithm for convex semidefinite optimization. In Proc. Intl. Conf.on Machine Learning, 2012.

Evgeny S. Levitin and Boris T. Polyak. Constrained minimization problems. USSR Com-putational Mathematics and Mathematical Physics, 6(5):1–50, 1966.

Gerard Meyer. Accelerated Frank–Wolfe algorithms. SIAM Journal on Control, 12:655–655,1974.

44

Page 45: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Generalized Conditional Gradient for Sparse Estimation

Charles A. Micchelli, Jean M. Morales, and Massimiliano Pontil. Regularizers for structuredsparsity. Advances in Computational Mathematics, 38(3):455–489, 2013.

Bamdev Mishra, Gilles Meyer, Francis Bach, and Rodolphe Sepulchre. Low-rank optimiza-tion with trace norm penalty. SIAM Journal on Optimization, 23(4):2124–2149, 2013.

Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming,103(1):127–152, 2005.

Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Pro-gramming, Series B, 140:125–161, 2013.

Guillaume Obozinski and Francis Bach. Convex relaxation for combinatorial penalties.Technical Report HAL 00694765, 2012.

Gabor Pataki. On the rank of extreme matrices in semidefinite programs and the multiplicityof optimal eigenvalues. Mathematics of Operations Research, 23(2):339–358, 1998.

Ting Kei Pong, Paul Tseng, Shuiwang Ji, and Jieping Ye. Trace norm regularization:Reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization, 20(6):3465–3489, 2010.

Ralph Tyrell Rockafellar and Roger J-B Wets. Variational Analysis. Springer, 1998.

Shai Shalev-Shwartz, Nathan Srebro, and Tong Zhang. Trading accuracy for sparsity inoptimization problem with sparsity constraint. SIAM Journal on Optimization, 20(6):2807–2832, 2010.

Nathan Srebro, Jason D. M. Rennie, and Tommi S. Jaakkola. Maximum-margin matrixfactorization. In Advances in Neural Information Processing Systems, 2005.

Ambuj Tewari, Pradeep Ravikumar, and Inderjit S. Dhillon. Greedy algorithms for struc-turally constrained high dimensional problems. In Advances in Neural Information Pro-cessing Systems, 2011.

Kim-Chuan Toh and Sangwoon Yun. An accelerated proximal gradient algorithm for nuclearnorm regularized least squares problems. Pacific Journal of Optimization, 6:615–640,2010.

Paul Tseng. Approximation accuracy, gradient methods, and error bound for structuredconvex optimization. Mathematical Programming, Series B, 125:263–295, 2010.

Stephen A Vavasis. On the complexity of nonnegative matrix factorization. SIAM Journalon Optimization, 20(3):1364–1377, 2010.

Martha White, Yaoliang Yu, Xinhua Zhang, and Dale Schuurmans. Convex multi-viewsubspace learning. In Advances in Neural Information Processing Systems, 2012.

P. Wolfe. Convergence theory in nonlinear programming. In Integer and Nonlinear Pro-gramming. North-Holland, 1970.

45

Page 46: Generalized Conditional Gradient for Sparse Estimationy328yu/mypapers/gcg.pdf · Generalized Conditional Gradient for Sparse Estimation algorithms demonstrate alternative trade-o

Yu, Zhang, and Schuurmans

Yaoliang Yu and Dale Schuurmans. Rank/norm regularization with closed-form solutions:Application to subspace clustering. In Conf. on Uncertainty in Artificial Intelligence,2011.

Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.Journal of The Royal Statistical Society Series B, 68(1):49–67, 2006.

Xiaotong Yuan and Shuicheng Yan. Forward basis selection for pursuing sparse representa-tions over a dictionary. IEEE Transactions on Pattern Analysis and Machine Intelligence,35(12):3025–3036, 2013.

Hyokun Yun, Hsiang-Fu Yu, Cho-Jui Hsieh, S V N Vishwanathan, and Inderjit Dhillon.NOMAD: Nonlocking, stochastic multi-machine algorithm for asynchronous and decen-tralized matrix completion. In Proc. Intl. Conf. on Very Large Data Bases, 2014.

Xinhua Zhang, Yaoliang Yu, Martha White, Ruitong Huang, and Dale Schuurmans. Convexsparse coding, subspace learning, and semi-supervised extension. In Proc. Nat’l. Conf.Artificial Intelligence, 2011.

Xinhua Zhang, Yaoliang Yu, and Dale Schuurmans. Accelerated training for matrix-normregularization: A boosting approach. In Advances in Neural Information ProcessingSystems, 2012.

Yong Zhuang, Wei-Sheng Chin, Yu Juan, and Chih-Jen Lin. A fast parallel SGD for matrixfactorization in shared memory systems. In Proc. ACM Recommender System Conference,2013.

46