a short introduction to parsimonya short introduction to parsimony and a complexity analysis of the...

100
A short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October 2014 Julien Mairal Complexity Analysis of the Lasso Regularization Path 1/67

Upload: others

Post on 02-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

A short introduction to parsimony

and

a complexity analysis

of the Lasso regularization path

Julien Mairal

Inria, LEAR team, Grenoble

Toulouse, October 2014

Julien Mairal Complexity Analysis of the Lasso Regularization Path 1/67

Page 2: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Big data

An ill-defined concept

a “buzz” word; regardless of rationality, you may get some fundingand become famous by making extensive use of “big data”;

replacing “thinking” by “data” and hope for the best;

a means to make money from your personal data.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 2/67

Page 3: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Big data

An ill-defined concept

a “buzz” word; regardless of rationality, you may get some fundingand become famous by making extensive use of “big data”;

replacing “thinking” by “data” and hope for the best;

a means to make money from your personal data.

A scientific utopia

converting data into scientific knowledge;

better understanding the world by observing it as much as we can.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 2/67

Page 4: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Big data

An ill-defined concept

a “buzz” word; regardless of rationality, you may get some fundingand become famous by making extensive use of “big data”;

replacing “thinking” by “data” and hope for the best;

a means to make money from your personal data.

A scientific utopia

converting data into scientific knowledge;

better understanding the world by observing it as much as we can.

in French

Megadonnees.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 2/67

Page 5: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Collaborator

The analysis of the regularization path is a joint workwith Bin Yu, from UC Berkeley.

Reference

J. Mairal and B. Yu. Complexity analysis of the Lasso regularizationpath. ICML. 2012.

Advertisement

The introduction to parsimony is based on the material of the upcomingmonograph, which will be freely available on arXiv mid-october:

J. Mairal, F. Bach and J. Ponce. Sparse Modeling for Image and VisionProcessing. 2014.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 3/67

Page 6: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Part I: A Short Introduction to Parsimony

Julien Mairal Complexity Analysis of the Lasso Regularization Path 4/67

Page 7: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

1 A short introduction to parsimonyEarly thoughtsSparsity in the statistics literature from the 60’s and 70’sWavelet thresholding in signal processing from 90’sThe modern parsimony and the ℓ1-normStructured sparsity

Julien Mairal Complexity Analysis of the Lasso Regularization Path 5/67

Page 8: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Early thoughts

(a) Dorothy Wrinch1894–1980

(b) Harold Jeffreys1891–1989

The existence of simple laws is, then, apparently, to beregarded as a quality of nature; and accordingly we may inferthat it is justifiable to prefer a simple law to a more complexone that fits our observations slightly better.

[Wrinch and Jeffreys, 1921]. Philosophical Magazine Series.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 6/67

Page 9: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Historical overview of parsimony

14th century: Ockham’s razor;

1921: Wrinch and Jeffreys’ simplicity principle;

1952: Markowitz’s portfolio selection;

60 and 70’s: best subset selection in statistics;

70’s: use of the ℓ1-norm for signal recovery in geophysics;

90’s: wavelet thresholding in signal processing;

1996: Olshausen and Field’s dictionary learning;

1996–1999: Lasso (statistics) and basis pursuit (signal processing);

2006–now: compressed sensing (signal processing) and Lassoconsistency (statistics); applications in various scientific fields suchas image processing, bioinformatics, neuroscience, computer vision...

Julien Mairal Complexity Analysis of the Lasso Regularization Path 7/67

Page 10: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Sparsity in the statistics literature from the 60’s and 70’s

Given some observed data points z1, . . . , zn that are assumed to beindependent samples from a statistical model with parameters θ in R

p,maximum likelihood estimation (MLE) consists of minimizing

minθ∈Rp

[

L(θ) △

= −n∑

i=1

logPθ(zi )

]

.

Example: ordinary least square

Observations zi = (yi , xi ), with yi in R.Linear model: yi = x⊤i θ + εi , with εi ∼ N (0, 1).

minθ∈Rp

n∑

i=1

1

2

(

yi − x⊤i θ)2.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 8/67

Page 11: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Sparsity in the statistics literature from the 60’s and 70’s

Given some observed data points z1, . . . , zn that are assumed to beindependent samples from a statistical model with parameters θ in R

p,maximum likelihood estimation (MLE) consists of minimizing

minθ∈Rp

[

L(θ) △

= −n∑

i=1

logPθ(zi )

]

.

Motivation for finding a sparse solution:

removing irrelevant variables from the model;

obtaining an easier interpretation;

preventing overfitting;

Julien Mairal Complexity Analysis of the Lasso Regularization Path 8/67

Page 12: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Sparsity in the statistics literature from the 60’s and 70’s

Given some observed data points z1, . . . , zn that are assumed to beindependent samples from a statistical model with parameters θ in R

p,maximum likelihood estimation (MLE) consists of minimizing

minθ∈Rp

[

L(θ) △

= −n∑

i=1

logPθ(zi )

]

.

Why this is highly relevant in a modern big data context:

large n allows learning (better) complex models with large p;

large p leads to poor interpretation and irrelevant variables;

large p and large n lead to high computational cost;

Julien Mairal Complexity Analysis of the Lasso Regularization Path 8/67

Page 13: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Sparsity in the statistics literature from the 60’s and 70’s

Given some observed data points z1, . . . , zn that are assumed to beindependent samples from a statistical model with parameters θ in R

p,maximum likelihood estimation (MLE) consists of minimizing

minθ∈Rp

[

L(θ) △

= −n∑

i=1

logPθ(zi )

]

.

Two questions:

1 how to choose k?

2 how to find the best subset of k variables?

Julien Mairal Complexity Analysis of the Lasso Regularization Path 8/67

Page 14: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Sparsity in the statistics literature from the 60’s and 70’s

How to choose k?

Mallows’s Cp statistics [Mallows, 1964, 1966];

Akaike information criterion (AIC) [Akaike, 1973];

Bayesian information critertion (BIC) [Schwarz, 1978];

Minimum description length (MDL) [Rissanen, 1978].

These approaches lead to penalized problems

minθ∈Rp

L(θ) + λ‖θ‖0,

with different choices of λ depending on the chosen criterion.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 9/67

Page 15: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Sparsity in the statistics literature from the 60’s and 70’s

How to solve the best k-subset selection problem?

Unfortunately...

...the problem is NP-hard [Natarajan, 1995].

Two strategies

combinatorial exploration with branch-and-boundtechniques [Furnival and Wilson, 1974] → leaps and bounds,exact algorithm but exponential complexity;

greedy approach: forward selection [Efroymson, 1960] (originallydeveloped for observing intermediate solutions),already contains all the ideas of matching pursuit algorithms.

Important reference: [Hocking, 1976]. The analysis and selection ofvariables in linear regression. Biometrics.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 10/67

Page 16: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Wavelet thresholding in signal processing from the 90’s

A wavelet basis represents a set of functions ϕ1, ϕ2 that are essentiallydilated and shifted versions of each other [see Mallat, 2008].

Concept of parsimony with wavelets

When a signal f is “smooth”, it is close to an expansion∑

i αiϕi whereonly a few coefficients αi are non-zero.

−8 −6 −4 −2 0 2 4 6 8−1

−0.5

0

0.5

1

1.5

(a) Meyer’s wavelet

−4 −3 −2 −1 0 1 2 3 4−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

(b) Morlet’s wavelet

Julien Mairal Complexity Analysis of the Lasso Regularization Path 11/67

Page 17: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Wavelet thresholding in signal processing from the 90’s

Wavelets where the topic of a long quest for representing natural images

2D-Gabors [Daugman, 1985];

steerable wavelets [Simoncelli et al., 1992];

curvelets [Candes and Donoho, 2002];

countourlets [Do and Vertterli, 2003];

bandlets [Le Pennec and Mallat, 2005];

⋆-lets.

(a) 2D Gabor filter. (b) With shifted phase. (c) With rotation.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 12/67

Page 18: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Wavelet thresholding in signal processing from 90’s

The theory of wavelets is well developed for continuous signals, e.g.,in L2(R), but also for discrete signals x in R

n.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 13/67

Page 19: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Wavelet thresholding in signal processing from 90’s

Given an orthogonal wavelet basis D = [d1, . . . ,dn] in Rn×n, the wavelet

decomposition of x in Rn is simply

β = D⊤x and we have x = Dβ.

The k-sparse approximation problem

minα∈Rp

1

2‖x−Dα‖22 s.t. ‖α‖0 ≤ k ,

is not NP-hard here: since D is orthogonal, it is equivalent to

minα∈Rp

1

2‖β −α‖22 s.t. ‖α‖0 ≤ k .

Julien Mairal Complexity Analysis of the Lasso Regularization Path 14/67

Page 20: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Wavelet thresholding in signal processing from 90’s

Given an orthogonal wavelet basis D = [d1, . . . ,dn] in Rn×n, the wavelet

decomposition of x in Rn is simply

β = D⊤x and we have x = Dβ.

The k-sparse approximation problem

minα∈Rp

1

2‖x−Dα‖22 s.t. ‖α‖0 ≤ k ,

The solution is obtained by hard-thresholding:

αht[j ] = δ|β[j ]|≥µβ[j ] =

{β[j ] if |β[j ]| ≥ µ0 otherwise

,

where µ the k-th largest value among the set {|β[1]|, . . . , |β[p]|}.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 14/67

Page 21: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Wavelet thresholding in signal processing, 90’s

Another key operator introduced by Donoho and Johnstone [1994] isthe soft-thresholding operator:

αst[j ]△

= sign(β[j ])max(|β[j ]| − λ, 0) =

β[j ]− λ if β[j ] ≥ λβ[j ] + λ if β[j ] ≤ −λ0 otherwise

,

where λ is a parameter playing the same role as µ previously.

With β△

= D⊤x and D orthogonal, it provides the solution of thefollowing sparse reconstruction problem:

minα∈Rp

1

2‖x−Dα‖22 + λ‖α‖1,

which will be of high importance later.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 15/67

Page 22: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Wavelet thresholding in signal processing, 90’s

β

αst

λ

−λ

(d) Soft-thresholding operator,αst = sign(β)max(|β| − λ, 0).

β

αht

µ

−µ

(e) Hard-thresholding operatorαht = δ|β|≥µβ.

Figure : Soft- and hard-thresholding operators, which are commonly used forsignal estimation with orthogonal wavelet basis.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 16/67

Page 23: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Wavelet thresholding in signal processing, 90’s

Various work tried to exploit the structure of wavelet coefficients.

α1

α2 α3

α4 α5 α6 α7

α8 α9 α10 α11 α12 α13 α14 α15

Figure : Illustration of a wavelet tree with four scales for one-dimensionalsignals. We also illustrate the zero-tree coding scheme [Shapiro, 1993].

Julien Mairal Complexity Analysis of the Lasso Regularization Path 17/67

Page 24: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

The modern parsimony and the ℓ1-normSparse linear models in signal processing

Let x in Rn be a signal.

Let D = [d1, . . . ,dp] ∈ Rn×p be a set of

elementary signals.We call it dictionary.

D is “adapted” to x if it can represent it with a few elements—that is,there exists a sparse vector α in R

p such that x ≈ Dα. We call α thesparse code.

x

︸ ︷︷ ︸

x∈Rn

d1 d2 · · · dp

︸ ︷︷ ︸

D∈Rn×p

α[1]α[2]...

α[p]

︸ ︷︷ ︸

α∈Rp,sparse

Julien Mairal Complexity Analysis of the Lasso Regularization Path 18/67

Page 25: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

The modern parsimony and the ℓ1-normSparse linear models: machine learning/statistics point of view

Let (yi , xi )ni=1 be a training set, where the vectors xi are in R

p and arecalled features. The scalars yi are in

{−1,+1} for binary classification problems.

R for regression problems.

We assume there exists a relation y ≈ β⊤x, and solve

minβ∈Rp

1

n

n∑

i=1

L(yi ,β⊤xi )

︸ ︷︷ ︸

empirical risk

+ λψ(β)︸ ︷︷ ︸

regularization

.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 19/67

Page 26: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

The modern parsimony and the ℓ1-normSparse linear models: machine learning/statistics point of view

A few examples:

Ridge regression: minβ∈Rp

1

n

n∑

i=1

1

2(yi − β⊤xi )

2 + λ‖β‖22.

Linear SVM: minβ∈Rp

1

n

n∑

i=1

max(0, 1− yiβ⊤xi ) + λ‖β‖22.

Logistic regression: minβ∈Rp

1

n

n∑

i=1

log(

1 + e−yiβ⊤xi

)

+ λ‖β‖22.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 20/67

Page 27: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

The modern parsimony and the ℓ1-normSparse linear models: machine learning/statistics point of view

A few examples:

Ridge regression: minβ∈Rp

1

n

n∑

i=1

1

2(yi − β⊤xi )

2 + λ‖β‖22.

Linear SVM: minβ∈Rp

1

n

n∑

i=1

max(0, 1− yiβ⊤xi ) + λ‖β‖22.

Logistic regression: minβ∈Rp

1

n

n∑

i=1

log(

1 + e−yiβ⊤xi

)

+ λ‖β‖22.

The squared ℓ2-norm induces “smoothness” in β. When one knows inadvance that β should be sparse, one should use a sparsity-inducingregularization such as the ℓ1-norm. [Chen et al., 1999, Tibshirani, 1996]

Julien Mairal Complexity Analysis of the Lasso Regularization Path 21/67

Page 28: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

The modern parsimony and the ℓ1-norm

Originally used to induce sparsity in geophysics [Claerbout and Muir,1973, Taylor et al., 1979], the ℓ1-norm became popular in statistics withthe Lasso [Tibshirani, 1996] and in signal processing with the Basispursuit [Chen et al., 1999].

Three “equivalent” formulations

1

minα∈Rp

1

2‖x−Dα‖22 + λ‖α‖1;

2

minα∈Rp

1

2‖x−Dα‖22 s.t. ‖α‖1 ≤ µ;

3

minα∈Rp

‖α‖1 s.t. ‖x−Dα‖22 ≤ ε.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 22/67

Page 29: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

The modern parsimony and the ℓ1-norm

And some variants...

For noiseless problems

minα∈Rp

‖α‖1 s.t. x = Dα.

Beyond least squares

minα∈Rp

f (α) + λ‖α‖1,

where f : Rp → R is convex.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 23/67

Page 30: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

The modern parsimony and the ℓ1-norm

And some variants...

For noiseless problems

minα∈Rp

‖α‖1 s.t. x = Dα.

Beyond least squares

minα∈Rp

f (α) + λ‖α‖1,

where f : Rp → R is convex.

An important question remains:

why does the ℓ1-norm induce sparsity?

Julien Mairal Complexity Analysis of the Lasso Regularization Path 23/67

Page 31: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

The modern parsimony and the ℓ1-norm

Why does the ℓ1-norm induce sparsity?

Can we get some intuition from the simplest isotropic case?

α(λ) = argminα∈Rp

1

2‖x−α‖22 + λ‖α‖1,

or equivalently the Euclidean projection onto the ℓ1-ball?

α(µ) = argminα∈Rp

1

2‖x−α‖22 s.t. ‖α‖1 ≤ µ.

“equivalent” means that for all λ ≥ 0, there exists µ ≥ 0 such thatα(µ) = α(λ) and vice versa.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 24/67

Page 32: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

The modern parsimony and the ℓ1-norm

Why does the ℓ1-norm induce sparsity?

Can we get some intuition from the simplest isotropic case?

α(λ) = argminα∈Rp

1

2‖x−α‖22 + λ‖α‖1,

or equivalently the Euclidean projection onto the ℓ1-ball?

α(µ) = argminα∈Rp

1

2‖x−α‖22 s.t. ‖α‖1 ≤ µ.

“equivalent” means that for all λ ≥ 0, there exists µ ≥ 0 such thatα(µ) = α(λ) and vice versa.The relation between µ and λ is unknown a priori.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 24/67

Page 33: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Why does the ℓ1-norm induce sparsity?Regularizing with the ℓ1-norm

ℓ1-ball

‖α‖1 ≤ µ

α[2]

α[1]

The projection onto a convex set is “biased” towards singularities.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 25/67

Page 34: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Why does the ℓ1-norm induce sparsity?Regularizing with the ℓ2-norm

α[2]

α[1]ℓ2-ball

‖α‖2 ≤ µ

The ℓ2-norm is isotropic.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 26/67

Page 35: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Why does the ℓ1-norm induce sparsity?In 3D. (images produced by G. Obozinski

Julien Mairal Complexity Analysis of the Lasso Regularization Path 27/67

Page 36: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Why does the ℓ1-norm induce sparsity?Regularizing with the ℓ∞-norm

α[2]

α[1]ℓ∞-ball

‖α‖∞ ≤ µ

The ℓ∞-norm encourages |α[1]| = |α[2]|.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 28/67

Page 37: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Why does the ℓ1-norm induce sparsity?Analytical point of view: 1D case

minα∈R

1

2(x − α)2 + λ|α|

Piecewise quadratic function with a kink at zero.

Derivative at 0+: g+ = −x + λ and 0−: g− = −x − λ.

Optimality conditions. α is optimal iff:

|α| > 0 and (x − α) + λ sign(α) = 0

α = 0 and g+ ≥ 0 and g− ≤ 0

The solution is a soft-thresholding:

α⋆ = sign(x)(|x | − λ)+.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 29/67

Page 38: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Why does the ℓ1-norm induce sparsity?Comparison with ℓ2-regularization in 1D

ψ(α) = α2

ψ′(α) = 2α

ψ(α) = |α|

ψ′−(α) = −1, ψ′

+(α) = 1

The gradient of the ℓ2-penalty vanishes when α get close to 0. On itsdifferentiable part, the norm of the gradient of the ℓ1-norm is constant.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 30/67

Page 39: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Why does the ℓ1-norm induce sparsity?Physical illustration

E1 = 0 E1 = 0

x

Julien Mairal Complexity Analysis of the Lasso Regularization Path 31/67

Page 40: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Why does the ℓ1-norm induce sparsity?Physical illustration

E1 =k12 (x − α)2

E2 =k22 α

2 α

α

E1 =k12 (x − α)2

E2 = mgα

Julien Mairal Complexity Analysis of the Lasso Regularization Path 31/67

Page 41: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Why does the ℓ1-norm induce sparsity?Physical illustration

E1 =k12 (x − α)2

E2 =k22 α

2 α

α = 0 !!

E1 =k12 (x − α)2

E2 = mgα

Julien Mairal Complexity Analysis of the Lasso Regularization Path 31/67

Page 42: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Why does the ℓ1-norm induce sparsity?

0 1 2 3 4−0.5

0

0.5

1

1.5

λ

co

eff

icie

nt

va

lue

s

w1

w2

w3

w4

w5

Figure : The regularization path of the Lasso.

minα∈Rp

1

2‖x−Dα‖22 + λ‖α‖1.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 32/67

Page 43: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Non-convex sparsity-inducing penalties

Exploiting concave functions with a kink at zero

ψ(α) =∑p

j=1 ϕ(|α[j ]|).ℓq-penalty, with 0 < q < 1: ψ(α)

=∑p

j=1 |α[j ]|q, [Frank andFriedman, 1993];

log penalty, ψ(α)△

=∑p

j=1 log(|α[j ]|+ ε), [Candes et al., 2008].

ϕ is any function that looks like this:

α

ϕ(|α|)

Julien Mairal Complexity Analysis of the Lasso Regularization Path 33/67

Page 44: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Non-convex sparsity-inducing penalties

(a) ℓ0.5-ball, 2-D (b) ℓ1-ball, 2-D (c) ℓ2-ball, 2-D

Figure : Open balls in 2-D corresponding to several ℓq-norms andpseudo-norms.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 34/67

Page 45: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Non-convex sparsity-inducing penalties

α[2]

α[1]ℓq-ball

‖α‖q ≤ µ with q < 1

Julien Mairal Complexity Analysis of the Lasso Regularization Path 35/67

Page 46: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Elastic-net

The elastic net introduced by [Zou and Hastie, 2005]

ψ(α) = ‖α‖1 + γ‖α‖22,

The penalty provides more stable (but less sparse) solutions.

(a) ℓ1-ball, 2-D (b) elastic-net, 2-D (c) ℓ2-ball, 2-D

Julien Mairal Complexity Analysis of the Lasso Regularization Path 36/67

Page 47: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

The elastic-netvs other penalties

α[2]

α[1]ℓq-ball

‖α‖q ≤ µ with q < 1

Julien Mairal Complexity Analysis of the Lasso Regularization Path 37/67

Page 48: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

The elastic-netvs other penalties

ℓ1-ball

‖α‖1 ≤ µ

α[2]

α[1]

Julien Mairal Complexity Analysis of the Lasso Regularization Path 37/67

Page 49: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

The elastic-netvs other penalties

elastic-net

ball

(1− γ)‖α‖1 + γ‖α‖22 ≤ µ

α[2]

α[1]

Julien Mairal Complexity Analysis of the Lasso Regularization Path 37/67

Page 50: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

The elastic-netvs other penalties

α[2]

α[1]ℓ2-ball

‖α‖22 ≤ µ

Julien Mairal Complexity Analysis of the Lasso Regularization Path 37/67

Page 51: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Total variation and fused Lasso

The anisotropic total variation [Rudin et al., 1992]

ψ(α) =

p−1∑

j=1

|α[j + 1]−α[j ]|,

called fused Lasso in statistics [Tibshirani et al., 2005]. The penaltyencourages piecewise constant signals (can be extended to images).

Image borrowed from a talk of J.-P. Vert, representing DNA copynumbers.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 38/67

Page 52: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Group Lasso and mixed norms[Turlach et al., 2005, Yuan and Lin, 2006, Zhao et al., 2009][Grandvalet and Canu, 1999, Bakin, 1999]

the ℓ1/ℓq-norm : ψ(α) =∑

g∈G‖α[g ]‖q.

G is a partition of {1, . . . , p};q = 2 or q = ∞ in practice;

can be interpreted as the ℓ1-norm of [‖α[g ]‖q]g∈G .

ψ(α) = ‖α[{1, 2}]‖2 + |α[3]|.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 39/67

Page 53: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Spectral sparsity[Fazel et al., 2001, Srebro et al., 2005]

A natural regularization function for matrices is the rank

rank(A)△

= |{j : sj(A) 6= 0}| = ‖s(A)‖0,

where sj is the j-th singular value and s is the spectrum of A.

A successful convex relaxation of the rank is the sum of singular values

‖A‖∗ △

=

p∑

j=1

sj(A) = ‖s(A)‖1,

for A in Rp×k with k ≥ p.

The resulting function is a norm, called the trace or nuclear norm.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 40/67

Page 54: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Structured sparsityimages produced by G. Obozinski

Julien Mairal Complexity Analysis of the Lasso Regularization Path 41/67

Page 55: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Structured sparsityimages produced by G. Obozinski

Julien Mairal Complexity Analysis of the Lasso Regularization Path 42/67

Page 56: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Structured sparsityMetabolic network of the budding yeast from Rapaport et al. [2007]

Julien Mairal Complexity Analysis of the Lasso Regularization Path 43/67

Page 57: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Structured sparsityMetabolic network of the budding yeast from Rapaport et al. [2007]

Julien Mairal Complexity Analysis of the Lasso Regularization Path 44/67

Page 58: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Structured sparsity

Warning: Under the name “structured sparsity” appear in factsignificantly different formulations!

1 non-convex

zero-tree wavelets [Shapiro, 1993];predefined collection of sparsity patterns: [Baraniuk et al., 2010];select a union of groups: [Huang et al., 2009];structure via Markov random fields: [Cehver et al., 2008];

2 convex (norms)

tree-structure: [Zhao et al., 2009];select a union of groups: [Jacob et al., 2009];zero-pattern is a union of groups: [Jenatton et al., 2011];other norms: [Micchelli et al., 2013].

Julien Mairal Complexity Analysis of the Lasso Regularization Path 45/67

Page 59: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Structured sparsityGroup Lasso with overlapping groups [Jenatton et al., 2011]

ψ(α) =∑

g∈G‖α[g ]‖q.

What happens when the groups overlap?

the pattern of non-zero variables is an intersection of groups;

the zero pattern is a union of groups.

ψ(α) = ‖α‖2 + |α[2]|+ |α[3]|.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 46/67

Page 60: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Structured sparsityHierarchical norms [Zhao et al., 2009].

(d) Sparsity. (e) Group sparsity. (f) Hierarchical sparsity.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 47/67

Page 61: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Some thoughts from Hocking [1976]:

The problem of selecting a subset of independent orpredictor variables is usually described in an idealizedsetting. That is, it is assumed that (a) the analyst has dataon a large number of potential variables which include allrelevant variables and appropriate functions of them plus,possibly, some other extraneous variables and variablefunctions and (b) the analyst has available “good” data onwhich to base the eventual conclusions. In practice, the lackof satisfaction of these assumptions may make a detailedsubset selection analysis a meaningless exercise.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 48/67

Page 62: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Part II: Complexity Analysis of the LassoRegularization Path

Julien Mairal Complexity Analysis of the Lasso Regularization Path 49/67

Page 63: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

What this work is about

another paper about the Lasso/Basis Pursuit [Tibshirani, 1996,Chen et al., 1999]:

minw∈Rp

1

2‖y − Xw‖22 + λ‖w‖1; (1)

the first complexity analysis of the homotopy method [Ritter, 1962,Osborne et al., 2000, Efron et al., 2004] for solving (1);

a robust homotopy algorithm.

A main message reminiscent of

the simplex algorithm [Klee and Minty, 1972];

the SVM regularization path [Gartner et al., 2010].

Julien Mairal Complexity Analysis of the Lasso Regularization Path 50/67

Page 64: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

The Lasso Regularization Path and the Homotopy

When it exists, the regularization path is piecewise linear:

0 1 2 3 4−0.5

0

0.5

1

1.5

λ

co

eff

icie

nt

va

lue

s

w1

w2

w3

w4

w5

Julien Mairal Complexity Analysis of the Lasso Regularization Path 51/67

Page 65: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Our Main Results

Theorem - worst case analysis

In the worst-case, the regularization path of the Lasso has exactly(3p + 1)/2 linear segments.

Proposition - approximate analysis

there exists an ε-approximate path with O(1/√ε) linear segments.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 52/67

Page 66: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Brief Introduction to the Homotopy Algorithm

Optimality conditions of the Lasso

w⋆ in Rp is a solution of Eq. (1) if and only if for all j in {1, . . . , p},

xj⊤(y − Xw⋆) = λ sign(w⋆j ) if w⋆

j 6= 0,

|xj⊤(y − Xw⋆)| ≤ λ otherwise.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 53/67

Page 67: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Brief Introduction to the Homotopy Algorithm

Optimality conditions of the Lasso

w⋆ in Rp is a solution of Eq. (1) if and only if for all j in {1, . . . , p},

xj⊤(y − Xw⋆) = λ sign(w⋆j ) if w⋆

j 6= 0,

|xj⊤(y − Xw⋆)| ≤ λ otherwise.

Uniqueness of the solution

Define J△

= {j ∈ {1, . . . , p} : |xj⊤(y − Xw⋆)| = λ}.If the matrix X⊤

J XJ is invertible, the solution is unique and

w⋆J = (X⊤

J XJ)−1(X⊤

J y − ληJ) = A+ λB,

where η△

= sign(X⊤(y − Xw⋆)).

Julien Mairal Complexity Analysis of the Lasso Regularization Path 53/67

Page 68: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Brief Introduction to the Homotopy Algorithm

Piecewise linearity

Under uniqueness assumptions of the Lasso solution, the regularizationpath λ 7→ w⋆(λ) is continuous and piecewise linear.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 54/67

Page 69: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Brief Introduction to the Homotopy Algorithm

Piecewise linearity

Under uniqueness assumptions of the Lasso solution, the regularizationpath λ 7→ w⋆(λ) is continuous and piecewise linear.

Recipe of the homotopy method - main ideas

1 finds a trivial solution w⋆(λ∞) = 0 with λ∞ = ‖X⊤y‖∞;

2 compute the direction of the piecewise linear segment of the path;

3 follow the direction of the path by decreasing λ;

4 stop at the next “kink” and go back to 2.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 54/67

Page 70: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Brief Introduction to the Homotopy Algorithm

Piecewise linearity

Under uniqueness assumptions of the Lasso solution, the regularizationpath λ 7→ w⋆(λ) is continuous and piecewise linear.

Recipe of the homotopy method - main ideas

1 finds a trivial solution w⋆(λ∞) = 0 with λ∞ = ‖X⊤y‖∞;

2 compute the direction of the piecewise linear segment of the path;

3 follow the direction of the path by decreasing λ;

4 stop at the next “kink” and go back to 2.

Caveats - questions

kinks can be very close to each other;

X⊤J XJ can be ill-conditioned;

what is the complexity?

Julien Mairal Complexity Analysis of the Lasso Regularization Path 54/67

Page 71: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Worst case analysis

Theorem - worst case analysis

In the worst-case, the regularization path of the Lasso has exactly(3p + 1)/2 linear segments.

100 200 300

−1

0

1

2

Regularization path, p=6

Kinks

Co

eff

icie

nts

(lo

g s

ca

le)

Julien Mairal Complexity Analysis of the Lasso Regularization Path 55/67

Page 72: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Worst case analysis

Consider a Lasso problem (y ∈ Rn, X ∈ R

n×p).Define the vector y in R

n+1 and the matrix X in R(n+1)×(p+1) as follows:

y△

=

[y

yn+1

]

, X△

=

[X 2αy0 αyn+1

]

,

where yn+1 6= 0 and 0 < α < λ1/(2y⊤y + y2n+1).

Adverserial strategy

If the regularization path of the Lasso (y,X) has k linear segments, thepath of (y, X) has 3k − 1 linear segments.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 56/67

Page 73: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Worst case analysis

y△

=

[y

yn+1

]

, X△

=

[X 2αy0 αyn+1

]

,

Let us denote by {η1, . . . ,ηk} the sequence of k sparsity patterns in{−1, 0, 1}p encountered along the path of the Lasso (y,X).

The new sequence of sparsity patterns for (y, X) is

{first k patterns

︷ ︸︸ ︷[η1 = 0

0

]

,

[η2

0

]

, . . . ,

[ηk

0

]

,

middle k patterns︷ ︸︸ ︷[ηk

1

]

,

[ηk−1

1

]

, . . . ,

[η1=0

1

]

,

[−η2

1

]

,

[−η3

1

]

, . . . ,

[−ηk

1

]

︸ ︷︷ ︸

last k−1 patterns

}

.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 57/67

Page 74: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Worst case analysis

y△

=

[y

yn+1

]

, X△

=

[X 2αy0 αyn+1

]

,

Some intuition why this is true:

1 the patterns of the new path must be [ηi⊤, 0]⊤ or [±ηi⊤, 1]⊤;2 the factor α ensures the (p + 1)-th variable to enter late the path;

3 after the k first kinks, we have y ≈ Xw⋆(λ) and thus

X

[w⋆(λ)

0

]

+

[0

yn+1

]

≈ y ≈ X

[−w⋆(λ)1/α

]

.

We are now in shape to build a pathological path with (3p + 1)/2linear segments. Note that this lower-bound complexity is optimal.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 58/67

Page 75: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Approximate Complexity

Strong Duality

w⋆

w κ

κ⋆

f (w), primal

g(κ), dual

b

b

b

b

Strong duality means that maxκ g(κ) = minw f (w)

Julien Mairal Complexity Analysis of the Lasso Regularization Path 59/67

Page 76: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Approximate Complexity

Duality Gaps

w

w

κ

κ

f (w), primal

g(κ), dual

b

b

b

bδ(w, κ)

Strong duality means that maxκ g(κ) = minw f (w)

The duality gap guarantees us that 0 ≤ f (w)− f (w⋆) ≤ δ(w, κ).

Julien Mairal Complexity Analysis of the Lasso Regularization Path 60/67

Page 77: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Approximate Complexity

minw

{

fλ(w)△

=1

2‖y − Xw‖22 + λ‖w‖1

}

, (primal)

maxκ

{

gλ(κ)△

= −1

2κ⊤κ− κ⊤y s.t. ‖X⊤κ‖∞ ≤ λ

}

. (dual)

ε-approximate solution

w is a ε-approximate solution when there exists a dual variable κ s.t.

δλ(w,κ) = fλ(w)− gλ(κ) ≤ εfλ(w).

ε-approximate path

A path P : λ 7→ w(λ) is an approximate path if it always containsε-approximate solutions.

(see Giesen et al. [2010] for generic results)

Julien Mairal Complexity Analysis of the Lasso Regularization Path 61/67

Page 78: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Approximate Complexity

ε-approximate solution

w satisfies APPROXλ(ε) when there exists a dual variable κ s.t.

δλ(w,κ) = fλ(w)− gλ(κ) ≤ εfλ(w).

ε-approximate path

A path P : λ 7→ w(λ) is an approximate path if it always containsε-approximate solutions.

(see Giesen et al. [2010] for generic results)

Julien Mairal Complexity Analysis of the Lasso Regularization Path 62/67

Page 79: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Approximate Complexity

Optimality conditions

w in Rp is a solution of (1) if and only if for all j in {1, . . . , p},

xj⊤(y − Xw) = λ sign(wj) if wj 6= 0,

|xj⊤(y − Xw)| ≤ λ otherwise.(exact)

Julien Mairal Complexity Analysis of the Lasso Regularization Path 63/67

Page 80: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Approximate Complexity

(ε1, ε2)-approximate optimality conditions

w in Rp satisfies OPTλ(ε1, ε2) if and only if for all j in {1, . . . , p},

λ(1− ε2) ≤ xj⊤(y − Xw) sign(wj) ≤ λ(1 + ε1) if wj 6= 0,

|xj⊤(y − Xw)| ≤ λ(1 + ε1) otherwise.

Relations between OPTλ and APPROXλ

APPROXλ(0) =⇒ OPTλ(0, 0)

=⇒ OPTλ(1−

√(ε)

(√

(ε)/(1−√ε),−

(ε)/(1−√ε))

=⇒ APPROXλ(1−√ε)(ε)

Proposition - approximate analysis

there exists an ε-approximate path with at most⌈log(λ∞/λ1)√

ε

segments.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 64/67

Page 81: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Approximate Homotopy

Recipe - main ideas/features

Maintain OPTλ(ε/2, ε/2) instead of OPTλ(0, 0);

Make steps in λ greater than or equal to λ(1− θ√ε);

When the kinks are too close to each other, make a large step anduse a first-order method instead;

Between λ∞ and λ1, the number of iterations is upper-bounded by⌈log(λ∞/λ1)

θ√ε

.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 65/67

Page 82: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Conclusion

A few messages

Despite an exponential complexity, the homotopy algorithmsremains extremely powerful in practice;

the main issue of the homotopy algorithm might be its numericalstability;

when one does not care about precision, the worst-case complexityof the path can significantly reduce.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 66/67

Page 83: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Advertisement SPAMS toolbox (open-source)

C++ interfaced with Matlab, R, Python.

proximal gradient methods for ℓ0, ℓ1, elastic-net, fused-Lasso,group-Lasso, tree group-Lasso, tree-ℓ0, sparse group Lasso,overlapping group Lasso...

...for square, logistic, multi-class logistic loss functions.

handles sparse matrices, provides duality gaps.

fast implementations of OMP and LARS - homotopy.

dictionary learning and matrix factorization (NMF, sparse PCA).

coordinate descent, block coordinate descent algorithms.

fast projections onto some convex sets.

Try it! http://www.di.ens.fr/willow/SPAMS/

Julien Mairal Complexity Analysis of the Lasso Regularization Path 67/67

Page 84: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

References I

H. Akaike. Information theory and an extension of the maximumlikelihood principle. In Second International Symposium onInformation Theory, volume 1, pages 267–281, 1973.

S. Bakin. Adaptive regression and model selection in data miningproblems. PhD thesis, 1999.

R.G. Baraniuk, V. Cevher, M. Duarte, and C. Hegde. Model-basedcompressive sensing. IEEE Transactions on Information Theory, 56(4):1982–2001, 2010.

D.P. Bertsekas. Nonlinear programming. Athena Scientific, 1999. 2ndedition.

J.M. Borwein and A.S. Lewis. Convex analysis and nonlinearoptimization: theory and examples. Springer, 2006.

S. Boyd and L. Vandenberghe. Convex Optimization. CambridgeUniversity Press, 2004.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 68/67

Page 85: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

References IIE. Candes and D. L. Donoho. Recovering edges in ill-posed inverseproblems: Optimality of curvelet frames. Annals of Statistics, 30(3):784–842, 2002.

E.J. Candes, M. Wakin, and S. Boyd. Enhancing sparsity by reweightedL1 minimization. Journal of Fourier Analysis and Applications, 14(5):877–905, 2008.

V. Cehver, M. Duarte, C. Hedge, and R.G. Baraniuk. Sparse signalrecovery using markov random fields. In Advances in NeuralInformation Processing Systems (NIPS), 2008.

S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decompositionby basis pursuit. SIAM Journal on Scientific Computing, 20:33–61,1999.

J. F. Claerbout and F. Muir. Robust modeling with erratic data.Geophysics, 38(5):826–844, 1973.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 69/67

Page 86: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

References IIIJ. G. Daugman. Uncertainty relation for resolution in space, spatialfrequency, and orientation optimized by two-dimensional visual corticalfilters. Journal of the Optical Society of America A, 2(7):1160–1169,1985.

M. Do and M. Vertterli. Contourlets, Beyond Wavelets. AcademicPress, 2003.

D. L. Donoho and J. M. Johnstone. Ideal spatial adaptation by waveletshrinkage. Biometrika, 81(3):425–455, 1994.

B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angleregression. Annals of Statistics, 32(2):407–499, 2004.

M. A. Efroymson. Multiple regression analysis. Mathematical methodsfor digital computers, 9(1):191–203, 1960.

M. Fazel, H. Hindi, and S. Boyd. A rank minimization heuristic withapplication to minimum order system approximation. In AmericanControl Conference, volume 6, pages 4734–4739, 2001.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 70/67

Page 87: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

References IVI. E Frank and J. H. Friedman. A statistical view of some chemometricsregression tools. Technometrics, 35(2):109–135, 1993.

G. M. Furnival and R. W. Wilson. Regressions by leaps and bounds.Technometrics, 16(4):499–511, 1974.

B. Gartner, M. Jaggi, and C. Maria. An exponential lower bound on thecomplexity of regularization paths. preprint arXiv:0903.4817v2, 2010.

J. Giesen, M. Jaggi, and S. Laue. Approximating parameterized convexoptimization problems. In Algorithms - ESA, Lectures Notes Comp.Sci. 2010.

Y. Grandvalet and S. Canu. Outcomes of the equivalence of adaptiveridge with least absolute shrinkage. In Advances in Neural InformationProcessing Systems (NIPS), 1999.

R. R. Hocking. A Biometrics invited paper. The analysis and selection ofvariables in linear regression. Biometrics, 32:1–49, 1976.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 71/67

Page 88: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

References VJ. Huang, Z. Zhang, and D. Metaxas. Learning with structured sparsity.

In Proceedings of the International Conference on Machine Learning(ICML), 2009.

L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlaps andgraph Lasso. In Proceedings of the International Conference onMachine Learning (ICML), 2009.

R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selectionwith sparsity-inducing norms. Journal of Machine Learning Research,12:2777–2824, 2011.

V. Klee and G. J. Minty. How good is the simplex algorithm? InO. Shisha, editor, Inequalities, volume III, pages 159–175. AcademicPress, New York, 1972.

E. Le Pennec and S. Mallat. Sparse geometric image representationswith bandelets. IEEE Transactions on Image Processing, 14(4):423–438, 2005.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 72/67

Page 89: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

References VIS. Mallat. A wavelet tour of signal processing. Academic press, 2008.3rd edition.

C. L. Mallows. Choosing variables in a linear regression: A graphical aid.unpublished paper presented at the Central Regional Meeting of theInstitute of Mathematical Statistics, Manhattan, Kansas, 1964.

C. L. Mallows. Choosing a subset regression. unpublished paperpresented at the Joint Statistical Meeting, Los Angeles, California,1966.

C. A. Micchelli, J. M. Morales, and M. Pontil. Regularizers forstructured sparsity. Advances in Computational Mathematics, 38(3):455–489, 2013.

B. K. Natarajan. Sparse approximate solutions to linear systems. SIAMJournal on Computing, 24:227–234, 1995.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 73/67

Page 90: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

References VIIM. R. Osborne, B. Presnell, and B. A. Turlach. On the Lasso and itsdual. Journal of Computational and Graphical Statistics, 9(2):319–37,2000.

F. Rapaport, A. Zinovyev, M. Dutreix, E. Barillot, and J.P. Vert.Classification of microarray data using gene networks. BMCbioinformatics, 8(1):35, 2007.

J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.

K. Ritter. Ein verfahren zur losung parameterabhangiger, nichtlinearermaximum-probleme. Mathematical Methods of Operations Research,6(4):149–166, 1962.

L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation basednoise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4):259–268, 1992.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 74/67

Page 91: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

References VIIIG. Schwarz. Estimating the dimension of a model. Annals of Statistics,6(2):461–464, 1978.

J. M. Shapiro. Embedded image coding using zerotrees of waveletcoefficients. IEEE Transactions on Signal Processing, 41(12):3445–3462, 1993.

E. P. Simoncelli, W. T. Freeman, E. H. Adelson, and D. J. Heeger.Shiftable multiscale transforms. IEEE Transactions on InformationTheory, 38(2):587–607, 1992.

N. Srebro, J.D.M. Rennie, and T.S. Jaakkola. Maximum-margin matrixfactorization. In Advances in Neural Information Processing Systems(NIPS), 2005.

H. L. Taylor, S. C. Banks, and J. F. McCoy. Deconvolution with the ℓ1norm. Geophysics, 44(1):39–52, 1979.

R. Tibshirani. Regression shrinkage and selection via the Lasso. Journalof the Royal Statistical Society Series B, 58(1):267–288, 1996.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 75/67

Page 92: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

References IXR. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsityand smoothness via the fused Lasso. Journal of the Royal StatisticalSociety Series B, 67(1):91–108, 2005.

B. A. Turlach, W. N. Venables, and S. J. Wright. Simultaneous variableselection. Technometrics, 47(3):349–363, 2005.

D. Wrinch and H. Jeffreys. XLII. on certain fundamental principles ofscientific inquiry. Philosophical Magazine Series 6, 42(249):369–390,1921.

M. Yuan and Y. Lin. Model selection and estimation in regression withgrouped variables. Journal of the Royal Statistical Society Series B,68:49–67, 2006.

P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties familyfor grouped and hierarchical variable selection. Annals of Statistics, 37(6A):3468–3497, 2009.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 76/67

Page 93: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

References XH. Zou and T. Hastie. Regularization and variable selection via theelastic net. Journal of the Royal Statistical Society Series B, 67(2):301–320, 2005.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 77/67

Page 94: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Appendix

Julien Mairal Complexity Analysis of the Lasso Regularization Path 78/67

Page 95: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Basic convex optimization tools: subgradients

α

(g) Smooth case

α

(h) Non-smooth case

Figure : Gradients and subgradients for smooth and non-smooth functions.

∂f (α)△

= {κ ∈ Rp | f (α) + κ⊤(α′ −α) ≤ f (α′) for all α′ ∈ R

p}.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 79/67

Page 96: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Basic convex optimization tools: subgradientsSome nice properties

∂f (α) = {g} iff f differentiable at α and g = ∇f (α).

many calculus rules: ∂(γf + µg) = γ∂f + µ∂g for γ, µ > 0.

for more details, see Boyd and Vandenberghe [2004], Bertsekas [1999],Borwein and Lewis [2006] and S. Boyd’s course at Stanford.

Optimality conditions

For g : Rp → R convex,

g differentiable: α⋆ minimizes g iff ∇g(α⋆) = 0.

g nondifferentiable: α⋆ minimizes g iff 0 ∈ ∂g(α⋆).

Careful: the concept of subgradient requires a function to beabove its tangents. It does only make sense for convex functions!

Julien Mairal Complexity Analysis of the Lasso Regularization Path 80/67

Page 97: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Basic convex optimization tools: dual-norm

Definition

Let κ be in Rp,

‖κ‖∗ △

= maxα∈Rp :‖α‖≤1

α⊤κ.

Exercises

‖α‖∗∗ = ‖α‖ (true in finite dimension)

ℓ2 is dual to itself.

ℓ1 and ℓ∞ are dual to each other.

ℓq and ℓ′q are dual to each other if 1q+ 1

q′= 1.

similar relations for spectral norms on matrices.

∂‖α‖ = {κ ∈ Rp s.t. ‖κ‖∗ ≤ 1 and κ⊤α = ‖α‖}.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 81/67

Page 98: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Optimality conditions

Let f : Rp → R be convex differentiable and ‖.‖ be any norm.

minα∈Rp

f (α) + λ‖α‖.

α is solution if and only if

0 ∈ ∂(f (α) + λ‖α‖) = ∇f (α) + λ∂‖α‖

Since ∂‖α‖ = {κ ∈ Rp s.t. ‖κ‖∗ ≤ 1 and κ⊤α = ‖α‖},

General optimality conditions:

‖∇f (α)‖∗ ≤ λ and −∇f (α)⊤α = λ‖α‖.

Julien Mairal Complexity Analysis of the Lasso Regularization Path 82/67

Page 99: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Convex Duality

Strong Duality

α⋆

α κ

κ⋆

f (α), primal

g(κ), dual

b

b

b

b

Strong duality means that maxκ g(κ) = minα f (α)

Julien Mairal Complexity Analysis of the Lasso Regularization Path 83/67

Page 100: A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the Lasso regularization path Julien Mairal Inria, LEAR team, Grenoble Toulouse, October

Convex Duality

Duality Gaps

α

α

κ

κ

f (α), primal

g(κ), dual

b

b

b

bδ(α, κ)

Strong duality means that maxκ g(κ) = minα f (α)

The duality gap guarantees us that 0 ≤ f (α)− f (α⋆) ≤ δ(α, κ).

Julien Mairal Complexity Analysis of the Lasso Regularization Path 84/67