convex relaxation for combinatorial penalties · from sparsity... empirical risk: for w 2rd, l(w) =...

Convex relaxation for Combinatorial Penalties

Guillaume Obozinski

Equipe Imagine

Laboratoire d’Informatique Gaspard Monge

Ecole des Ponts - ParisTech

Joint work with Francis Bach

Fete Parisienne in Computation, Inference and Optimization

IHES - March 20th 2013

Convex relaxation for Combinatorial Penalties 1/33

From sparsity...

Empirical risk: for w ∈ Rd ,

L(w) =1

n∑i=1

(yi − x>i w)2

Support of the model:

Supp(w) = i | wi 6= 0.

Penalization for variable selection

minw∈Rd

L(w) + λ |Supp(w)|

minw∈Rd

L(w) + λ‖w‖1

|Supp(w)| =n∑

1wi 6=0

From sparsity...

L(w) =1

n∑i=1

(yi − x>i w)2

Supp(w) = i | wi 6= 0.

minw∈Rd

L(w) + λ |Supp(w)|

minw∈Rd

L(w) + λ‖w‖1

|Supp(w)| =n∑

1wi 6=0

From sparsity...

L(w) =1

n∑i=1

(yi − x>i w)2

Supp(w) = i | wi 6= 0.

minw∈Rd

L(w) + λ |Supp(w)|

minw∈Rd

L(w) + λ‖w‖1

|Supp(w)| =n∑

1wi 6=0

From sparsity...

L(w) =1

n∑i=1

(yi − x>i w)2

Supp(w) = i | wi 6= 0.

minw∈Rd

L(w) + λ |Supp(w)|

minw∈Rd

L(w) + λ‖w‖1

|Supp(w)| =n∑

1wi 6=0

From sparsity...

L(w) =1

n∑i=1

(yi − x>i w)2

Supp(w) = i | wi 6= 0.

minw∈Rd

L(w) + λ |Supp(w)|

minw∈Rd

L(w) + λ‖w‖1

|Supp(w)| =n∑

1wi 6=0

From sparsity...

L(w) =1

n∑i=1

(yi − x>i w)2

Supp(w) = i | wi 6= 0.

minw∈Rd

L(w) + λ |Supp(w)|

minw∈Rd

L(w) + λ‖w‖1

|Supp(w)| =n∑

1wi 6=0

0 1−1

From sparsity...

L(w) =1

n∑i=1

(yi − x>i w)2

Supp(w) = i | wi 6= 0.

minw∈Rd

L(w) + λ |Supp(w)|

minw∈Rd

L(w) + λ‖w‖1

|Supp(w)| =n∑

1wi 6=0

0 1−1

From sparsity...

L(w) =1

n∑i=1

(yi − x>i w)2

Supp(w) = i | wi 6= 0.

minw∈Rd

L(w) + λ |Supp(w)|

minw∈Rd

L(w) + λ‖w‖1

|Supp(w)| =n∑

1wi 6=0

0 1−1

... to Structured Sparsity

The support is not only sparse, but, in addition,we have prior information about its structure.

Examples

The variables should be selected in groups.

The variables lie in a hierarchy.

The variables lie on a graph or network and the support should belocalized or densely connected on the graph.

Examples

Difficult inverse problem in Brain Imaging

y=-84 x=17

-5.00e-02 5.00e-02Scale 6 - Fold 9

Jenatton et al. (2012)

Hierarchical Dictionary Learning

Figure: Hierarchical dictionary of imagepatches Figure: Hierarchical Topic model

Mairal, Jenatton, Obozinski and Bach (2010)

Ideas in structured sparsity

Group Lasso and `1/`p norm (Yuan and Lin, 2006)

Group Lasso

Given G = A1, . . . ,Am a partition of V := 1, . . . , d consider

‖w‖`1/`p =∑A∈G

δA ‖wA‖p

Overlapping groups: direct extension of Jenatton et al. (2011).

Interesting induced structures

→ Induce patterns of rooted subtree

→ Induce “convex” patterns on a grid

Group Lasso and `1/`p norm (Yuan and Lin, 2006)

Group Lasso

Given G = A1, . . . ,Am a partition of V := 1, . . . , d consider

‖w‖`1/`p =∑A∈G

δA ‖wA‖p

Overlapping groups: direct extension of Jenatton et al. (2011).

Interesting induced structures

→ Induce patterns of rooted subtree

→ Induce “convex” patterns on a grid

Hierarchical Norms (Zhao et al., 2009; Bach, 2008)

44 55 66

(Jenatton, Mairal, Obozinski and Bach, 2010a)

A covariate can only be selected after its ancestors

Structure on parameters w

Hierarchical penalization: Ω(w) =∑

g∈G ‖wg‖2 where groups g inG are equal to the set of descendants of some nodes in a tree.

Hierarchical Norms (Zhao et al., 2009; Bach, 2008)

(Jenatton, Mairal, Obozinski and Bach, 2010a)

A covariate can only be selected after its ancestors

Structure on parameters w

Hierarchical penalization: Ω(w) =∑

g∈G ‖wg‖2 where groups g inG are equal to the set of descendants of some nodes in a tree.

A new approach based on combinatorialfunctions

General framework

Let V = 1, . . . , d.Given a set function F : 2V 7→ R+

consider

minw∈Rd

L(w) + F (Supp(w))

Examples of combinatorial functions

Use recursivity or counts of structures (e.g. tree) with DP

Block-coding (Huang et al., 2011)

G (A) = minBi

F (B1) + . . .+ F (Bk) s.t. B1 ∪ . . . ∪ Bk ⊃ A

Submodular functions (Work on convex relaxations by Bach (2010))

General framework

Let V = 1, . . . , d.Given a set function F : 2V 7→ R+ consider

minw∈Rd

L(w) + F (Supp(w))

G (A) = minBi

F (B1) + . . .+ F (Bk) s.t. B1 ∪ . . . ∪ Bk ⊃ A

General framework

Let V = 1, . . . , d.Given a set function F : 2V 7→ R+ consider

minw∈Rd

L(w) + F (Supp(w))

G (A) = minBi

F (B1) + . . .+ F (Bk) s.t. B1 ∪ . . . ∪ Bk ⊃ A

A relaxation for F ...?

How to solve?minw∈Rd

L(w) + F (Supp(w))

→ Greedy algorithms

→ Non-convex methods

→ Relaxation

|A| F (A)

L(w) + λ |Supp(w)| L(w) + λF (Supp(w))

↓ ↓ ?

L(w) + λ ‖w‖1 L(w) + λ ...?...

L(w) + F (Supp(w))

→ Relaxation

|A| F (A)

↓ ↓ ?

L(w) + λ ‖w‖1 L(w) + λ ...?...

L(w) + F (Supp(w))

→ Relaxation

|A| F (A)

L(w) + λ |Supp(w)|

L(w) + λF (Supp(w))

↓ ↓ ?

L(w) + λ ‖w‖1 L(w) + λ ...?...

L(w) + F (Supp(w))

→ Relaxation

|A| F (A)

L(w) + λ |Supp(w)|

L(w) + λF (Supp(w))

L(w) + λ ‖w‖1

L(w) + λ ...?...

L(w) + F (Supp(w))

→ Relaxation

|A| F (A)

L(w) + λ ‖w‖1

L(w) + λ ...?...

L(w) + F (Supp(w))

→ Relaxation

|A| F (A)

↓ ↓ ?

L(w) + λ ‖w‖1 L(w) + λ ...?...

Previous relaxation result

Bach (2010) showed that if F is a submodular function, it is possible toconstruct the “tightest” convex relaxation of the penalty F for vectorsw ∈ Rd such that ‖w‖∞ ≤ 1.

Limitations and open issues:

The relaxation is defined on the unit `∞ ball.

Seems to implicitly assume that the w to be estimated is in a fixed`∞ ball

The choice of `∞ seems arbitrary

The `∞ relaxation induces undesirable clustering artifacts of thecoefficients absolute values.

What happens in the non-submodular case?

Penalizing and regularizing...

Given a function F : 2V → R+, consider for ν, µ > 0 the combinedpenalty:

pen(w) = µF (Supp(w)) + ν ‖w‖pp.

Motivations

Compromise between variable selection and smooth regularization

Required for F allowing large supports such as A 7→ 1A6=∅

Leads to a penalty which is coercive so that a convex relaxation onRd will not be trivial.

Motivations

A convex and homogeneous relaxation

Looking for a convex relaxation of pen(w).

Require as well that it is positively homogeneous → scale invariance.

Definition (Homogeneous extension of a function g)

gh : x 7→ infλ>0

λg(λx).

Proposition

The tightest convex positively homogeneous lower bound of a function gis the convex envelope of gh.

Leads us to consider:

penh(w) = infλ>0

(µF (Supp(λw)) + ν ‖λw‖pp

)∝ Θ(w) := ‖w‖p F (Supp(w))1/q with

gh : x 7→ infλ>0

λg(λx).

Proposition

penh(w) = infλ>0

gh : x 7→ infλ>0

λg(λx).

Proposition

penh(w) = infλ>0

gh : x 7→ infλ>0

λg(λx).

Proposition

penh(w) = infλ>0

∝ Θ(w) := ‖w‖p F (Supp(w))1/q with1

gh : x 7→ infλ>0

λg(λx).

Proposition

penh(w) = infλ>0

Envelope of the homogeneous penalty ΘConsider Ωp with dual norm

Ω∗p(s) = maxA⊂V ,A 6=∅

‖sA‖qF (A)1/q

with unit ball: BΩ∗p

:=s ∈ Rd | ∀A ⊂ V , ‖sA‖qq ≤ F (A)

Proposition

The norm Ωp is the convex envelope (tightest convex lower bound) ofthe function w 7→ ‖w‖p F (Supp(w))1/q.

Proof.

Denote Θ(w) = ‖w‖p F (Supp(w))1/q:

Θ∗(s) = maxw∈Rd

w>s − ‖w‖p F (Supp(w))1/q

= maxA⊂V

maxwA∈RA

w>A sA − ‖wA‖p F (A)1/q

= maxA⊂V

ι‖sA‖q6F (A)1/q = ιΩ∗p (s)61

‖sA‖qF (A)1/q

:=s ∈ Rd | ∀A ⊂ V , ‖sA‖qq ≤ F (A)

Proposition

Proof.

= maxA⊂V

maxwA∈RA

= maxA⊂V

ι‖sA‖q6F (A)1/q = ιΩ∗p (s)61

‖sA‖qF (A)1/q

:=s ∈ Rd | ∀A ⊂ V , ‖sA‖qq ≤ F (A)

Proposition

Proof.

= maxA⊂V

maxwA∈RA

= maxA⊂V

ι‖sA‖q6F (A)1/q = ιΩ∗p (s)61

‖sA‖qF (A)1/q

:=s ∈ Rd | ∀A ⊂ V , ‖sA‖qq ≤ F (A)

Proposition

Proof.

= maxA⊂V

maxwA∈RA

= maxA⊂V

ι‖sA‖q6F (A)1/q = ιΩ∗p (s)61

‖sA‖qF (A)1/q

:=s ∈ Rd | ∀A ⊂ V , ‖sA‖qq ≤ F (A)

Proposition

Proof.

= maxA⊂V

maxwA∈RA

= maxA⊂V

ι‖sA‖q6F (A)1/q = ιΩ∗p (s)61

‖sA‖qF (A)1/q

:=s ∈ Rd | ∀A ⊂ V , ‖sA‖qq ≤ F (A)

Proposition

Proof.

= maxA⊂V

maxwA∈RA

= maxA⊂V

ι‖sA‖q6F (A)1/q

= ιΩ∗p (s)61

‖sA‖qF (A)1/q

:=s ∈ Rd | ∀A ⊂ V , ‖sA‖qq ≤ F (A)

Proposition

Proof.

= maxA⊂V

maxwA∈RA

= maxA⊂V

ι‖sA‖q6F (A)1/q = ιΩ∗p (s)61

Graphs of the different penalties for w ∈ R2

F (Supp(w))

pen(w) = µF (Supp(w)) + ν ‖w‖22

F (Supp(w)) pen(w) = µF (Supp(w)) + ν ‖w‖22

Θ(w) =√F (Supp(w))‖w‖2

ΩF (w)

Θ(w) =√F (Supp(w))‖w‖2 ΩF (w)

A large latent group Lasso (Jacob et al., 2009)

V = v = (vA)A⊂V ∈(RV)2V

s.t. Supp(vA) ⊂ A

Ωp(w) = minv∈V

∑A⊂V

F (A)1q ‖vA‖p s.t. w =

∑A⊂V

v1 v2 v1,2 v1,2,3,4... ...

+ + + + + + + + + + + + + +=

Some simple examples

|A| ‖w‖1

1A6=∅ ‖w‖pIf G is a partition of 1, . . . , d:

∑B∈G 1A∩B 6=∅

∑B∈G ‖wB‖p

When p =∞ and F is submodular, our relaxation coincides withthat of Bach (2010).

However, when G is not a partition and p <∞, Ωp is not in generalan `1/`p-norms !

→ New norms... e.g. the k-support norm of Argyriou et al. (2012).

|A| ‖w‖1

∑B∈G 1A∩B 6=∅

∑B∈G ‖wB‖p

|A| ‖w‖1

∑B∈G 1A∩B 6=∅

∑B∈G ‖wB‖p

Example

Consider V=1, 2, 3.

G = 1, 2, 1, 3, 2, 3

F (1, 2) = 1,

F (1, 3) = 1,

F (2, 3) = 1,

F (A) =∞ or defined byblock-coding.

Example

Consider V=1, 2, 3.

G = 1, 2, 1, 3, 2, 3

F (1, 2) = 1,

F (1, 3) = 1,

F (2, 3) = 1,

F (A) =∞ or defined byblock-coding.

How tight is the relaxation? Example: the range function

Consider V = 1, . . . , p and the function

F (A) = range(A) = max(A)−min(A) + 1.

→ Leads to the selection of interval patterns.

What is its convex relaxation?

⇒ ΩFp (w) = ‖w‖1

The relaxation fails

New concept of Lower Combinatorial envelope provides a tool toanalyze the tightness of the relaxation.

⇒ ΩFp (w) = ‖w‖1

Submodular penalties

A function F : 2V 7→ R is submodular if

∀A,B ⊂ V , F (A) + F (B) ≥ F (A ∪ B) + F (A ∩ B) (1)

For these functions ΩF∞(w) = f (|w |) for f the Lovasz extension of F .

Properties of submodular function

f is computed efficiently (via the so-called “greedy” algorithm)

decomposition (“weak” separability) properties

F and f can be minimized in polynomial time.

... leads to properties of the corresponding submodular norms

Regularized empirical risk minimization problems solved efficiently

Statistical guarantees in terms of consistency and support recovery.

Submodular penalties

A function F : 2V 7→ R is submodular if

∀A,B ⊂ V , F (A) + F (B) ≥ F (A ∪ B) + F (A ∩ B) (1)

For these functions ΩF∞(w) = f (|w |) for f the Lovasz extension of F .

Properties of submodular function

f is computed efficiently (via the so-called “greedy” algorithm)

decomposition (“weak” separability) properties

F and f can be minimized in polynomial time.

... leads to properties of the corresponding submodular norms

Regularized empirical risk minimization problems solved efficiently

Statistical guarantees in terms of consistency and support recovery.

Consistency for the Lasso (Bickel et al., 2009)

Assume that y = Xw∗ + σε, with ε ∼ N (0, Idn)

Let Q = 1nX>X ∈ Rd×d .

Denote J = Supp(w∗).

Assume the `1-Restricted Eigenvalue condition:

∀∆ s.t. ‖∆Jc‖1 6 3 ‖∆J‖1, ∆>Q∆ > κ ‖∆J‖21.

Then we have

n‖Xw − Xw∗‖2

2 672|J|σ2

(2 log p + t2

with probability larger than 1− exp(−t2).

Support Recovery for the Lasso (Wainwright, 2009)

Assume y = Xw∗ + σε, with ε ∼ N (0, Idn)

Let Q = 1nX>X ∈ Rd×d .

Denote by J = Supp(w∗).

Define ν = minj ,w∗j 6=0 |w∗j | > 0

Assume κ = λmin(QJJ) > 0

Assume the Irrepresentability Condition, i.e., that for η > 0,

|||Q−1JJ QJJc |||∞,∞ 6 1− η.

Then, if 2η

√2σ2 log(p)

n < λ < κν|J| , the minimizer w is unique and has

support equal to J, with probability larger than 1− 4 exp(−c1nλ2).

An example: penalizing the rangeStructured prior on support (Jenatton et al., 2011):

the support is an interval of 1, . . . , p

Natural associated penalization:F (A) = range(A) = imax(A)− imin(A) + 1.

→ F is not submodular...

→ G (A) = |A|But F (A) : = d − 1 + range(A) is submodular !

In fact F (A) =∑

B∈G 1A∩B 6=∅ for B of the form:

Jenatton et al. (2011) considered Ω(w) =∑

B∈B ‖wB dB‖2.

In fact F (A) =∑

B∈B ‖wB dB‖2.

In fact F (A) =∑

B∈B ‖wB dB‖2.

→ G (A) = |A|

But F (A) : = d − 1 + range(A) is submodular !

In fact F (A) =∑

B∈B ‖wB dB‖2.

In fact F (A) =∑

B∈B ‖wB dB‖2.

In fact F (A) =∑

B∈B ‖wB dB‖2.

Experiments

0 50 100 150 200 2500

0 50 100 150 200 250−2

Figure: Signals

S1 constant

S2 triangular shape

S3 x 7→ | sin(x) sin(5x)|S4 a slope pattern

S5 i.i.d. Gaussian pattern

Compare:

Elastic Net

Naive `2 group-Lasso

Ω2 for F (A) = d − 1 + range(A)

Ω∞ for F (A) = d − 1 + range(A)

The weighted `2 group-Lasso of(Jenatton et al., 2011).

Constant signal

0 500 1000 1500 20000

d=256, k=160, σ=0.5

Sub p=∞

Sub p=2

d = 256

k = 160

σ = .5

Triangular signal

0 500 1000 1500 20000

Best Hamming d=256, k=160, σ=0.5, S2, cov=id

Sub p=∞

Sub p=2 d = 256

k = 160

σ = .5

(x1, x2) 7→ | sin(x1) sin(5x1) sin(x2) sin(5x2)| signal in 2D

0 500 1000 1500 2000 250020

d=256, k=160, σ=1.0

Sub p=∞

Sub p=2

d = 256

k = 160

σ = .5

i.i.d Random signal in 2D

0 500 1000 1500 2000 25000

d=256, k=160, σ=1.0

Sub p=∞

Sub p=2

d = 256

k = 160

σ = .5

Summary

A convex relaxation for functions penalizing

(a) the support via a general set function(b) the `p norm of the parameter vector w .

Principled construction of:

known norms like the group Lasso or `1/`p-normmany new sparsity inducing norms

Caveat: the relaxation can fail to capture the structure(e.g. range function)

For submodular functions we can obtain efficient algorithms, andtheoretical results such as consistency and support recoveryguarantees.

References I

Argyriou, A., Foygel, R., and Srebro, N. (2012). Sparse prediction with the k-support norm.In Advances in Neural Information Processing Systems 25, pages 1466–1474.

Bach, F. (2008). Exploring large feature spaces with hierarchical multiple kernel learning. InAdvances in Neural Information Processing Systems.

Bach, F. (2010). Structured sparsity-inducing norms through submodular functions. In Adv.NIPS.

Bickel, P., Ritov, Y., and Tsybakov, A. (2009). Simultaneous analysis of Lasso and Dantzigselector. Annals of Statistics, 37(4):1705–1732.

Huang, J., Zhang, T., and Metaxas, D. (2011). Learning with structured sparsity. J. Mach.Learn. Res., 12:3371–3412.

Jacob, L., Obozinski, G., and Vert, J. (2009). Group lasso with overlap and graph lasso. InICML.

Jenatton, R., Audibert, J., and Bach, F. (2011). Structured variable selection withsparsity-inducing norms. JMLR, 12:2777–2824.

Jenatton, R., Gramfort, A., Michel, V., Obozinski, G., Eger, E., Bach, F., and Thirion, B.(2012). Multiscale mining of fMRI data with hierarchical structured sparsity. SIAMJournal on Imaging Sciences, 5(3):835–856.

Mairal, J., Jenatton, R., Obozinski, G., and Bach, F. (2011). Convex and network flowoptimization for structured sparsity. JMLR, 12:2681–2720.

References II

Wainwright, M. J. (2009). Sharp thresholds for noisy and high-dimensional recovery ofsparsity using `1- constrained quadratic programming. IEEE Transactions on InformationTheory, 55:2183–2202.

Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with groupedvariables. J. Roy. Stat. Soc. B, 68:49–67.

Zhao, P., Rocha, G., and Yu, B. (2009). Grouped and hierarchical model selection throughcomposite absolute penalties. Annals of Statistics, 37(6A):3468–3497.

convex relaxation for combinatorial penalties · from sparsity... empirical risk: for w 2rd, l(w) =...

Documents