![Page 1: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/1.jpg)
OWL to the rescue of LASSO
IISc IBM day 2018Joint Work
R. Sankaran and Francis BachAISTATS ’17
Chiranjib Bhattacharyya
Professor, Department of Computer Science and AutomationIndian Institute of Science, Bangalore
March 7, 2018
![Page 2: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/2.jpg)
Identifying groups of strongly correlated variablesthrough Smoothed Ordered Weighted L1-norms
![Page 3: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/3.jpg)
LASSO: The method of choice for feature selection
Ordered Weighted L1(OWL) norm and Submodular penalties
ΩS : Smoothed OWL (SOWL)
![Page 4: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/4.jpg)
LASSO: The method of choice for featureselection
![Page 5: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/5.jpg)
Linear Regression in High dimension
Question?
Let x ∈ Rd , y ∈ R and
y = w∗>x + ε ε ∼ N(0, σ2)
From D = (xi , yi )|i ∈ [n], xi ∈ Rd , yi ∈ R can we find w∗
X =
x>1x>2...x>n
,Y ∈ Rn
X ∈ Rn×d , y ∈ Rn
![Page 6: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/6.jpg)
Linear Regression in High dimension
Least Squares Linear Regression
X ∈ Rn×d , y ∈ Rd :
wLS = argminw∈Rd
1
2‖Xw − Y ‖2
2
Assumptions
I Labels centered :∑n
j=1 yj = 0.
I Features normalized : xi ∈ Rd , ‖xi‖2 = 1, x>i 1d = 0.
![Page 7: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/7.jpg)
Linear Regression in High dimension
Least Squares Linear Regression
wLS =(X>X
)−1X>Y and E (wLS) = w ∗
Variance of Predictive error: 1nE (‖X (wLS − w ∗)‖2) = σ2 d
n
![Page 8: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/8.jpg)
Linear Regression in High dimension
rank(X ) = d
I wLS is unique
I Poor predictiveperformance,d is close to n
rank(X ) < d
I d > n
I wLS is not unique.
![Page 9: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/9.jpg)
Linear Regression in High dimension
rank(X ) = d
I wLS is unique
I Poor predictiveperformance,d is close to n
rank(X ) < d
I d > n
I wLS is not unique.
![Page 10: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/10.jpg)
Regularized Linear Regression
Regularized Linear Regression X ∈ Rn×d , y ∈ Rd ,Ω : Rd → R:
minw∈Rd
1
2‖Xw − y‖2
2 s.t. Ω(w) ≤ t
Regularizer: Ω(w)
I non-negative
I Convex function, typically a norm.
I Possibly non-differentiable.
![Page 11: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/11.jpg)
Lasso regression[Tibshirani, 1994]
Ridge: Ω(w) = ‖w‖22
I Does not promotesparsity
I Closed form solution
Lasso: Ω(w) = ‖w‖1
I Encourages sparsesolutions.
I Solve convexoptimization problem
![Page 12: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/12.jpg)
Regularized Linear Regression
Regularized Linear Regression X ∈ Rn×d , y ∈ Rd ,Ω : Rd → R:
minw∈Rd
1
2‖Xw − y‖2
2 + λΩ(w)
I Equivalent to the constraint version
I Unconstrained
![Page 13: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/13.jpg)
Lasso: Properties at a glance
Computational
I Proximal methods : IST, FISTA, Chambolle-Pock [Chambolleand Pock, 2011, Beck and Teboulle, 2009].
I Convergence rate : O(1/T 2) in T iterations.
I Assumption: availability of proximal operator of Ω (Easy for`1).
Statistical properties[Wainwright, 2009]
I Support recovery (Will it recover the true support ?).
I Sample complexity (How many samples needed ?).
I Prediction error (What is the expected error in prediction ?).
![Page 14: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/14.jpg)
Lasso: Properties at a glance
Computational
I Proximal methods : IST, FISTA, Chambolle-Pock [Chambolleand Pock, 2011, Beck and Teboulle, 2009].
I Convergence rate : O(1/T 2) in T iterations.
I Assumption: availability of proximal operator of Ω (Easy for`1).
Statistical properties[Wainwright, 2009]
I Support recovery (Will it recover the true support ?).
I Sample complexity (How many samples needed ?).
I Prediction error (What is the expected error in prediction ?).
![Page 15: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/15.jpg)
Lasso Model Recovery[Wainwright, 2009, Theorem 1]
Setup
y = Xw∗ + ε, εi ∼ N (0, σ2),X ∈ Rn×d
Support of w∗ be S = j |w∗j 6= 0 j ∈ [d ]
Lasso
minw∈Rd
1
2‖Xw − y‖2
2 + λ‖w‖1
![Page 16: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/16.jpg)
Lasso Model Recovery[Wainwright, 2009, Theorem 1]
Conditions1.
∥∥∥∥X>ScXS
(X>S XS
)−1∥∥∥∥∞≤ 1− γ, Incoherence with γ ∈ (0, 1],
Λmin
(1
nX>S XS
)≥ Cmin
λ > λ0 ,2
γ
√2σ2 log d
n
W.h.p, the following holds:
‖wS − w∗S‖∞ ≤ λ(∥∥∥∥(X>S XS/n
)−1∥∥∥∥∞
+ 4σ/√Cmin
)
1Define ‖M‖∞ = maxi∑
j |Mij |
![Page 17: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/17.jpg)
Lasso Model Recovery[Wainwright, 2009, Theorem 1]
Conditions1.
∥∥∥∥X>ScXS
(X>S XS
)−1∥∥∥∥∞≤ 1− γ, Incoherence with γ ∈ (0, 1],
Λmin
(1
nX>S XS
)≥ Cmin
λ > λ0 ,2
γ
√2σ2 log d
n
W.h.p, the following holds:
‖wS − w∗S‖∞ ≤ λ(∥∥∥∥(X>S XS/n
)−1∥∥∥∥∞
+ 4σ/√Cmin
)
1Define ‖M‖∞ = maxi∑
j |Mij |
![Page 18: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/18.jpg)
Lasso Model Recovery: Special cases
Case: X>ScXS = 0.
I The incoherence condition trivially holds, and γ = 1.
I The threshold λ0 is lesser ⇒ The recovery error is lesser.
Case: X>S XS = I .
I Cmin = 1/n, the largest possible for a given n.
I Larger Cmin ⇒ lesser recovery error.
When does Lasso work well?
I Lasso prefers low correlation between support and non-supportcolumns.
I Low correlation of columns within support lead to betterrecovery.
![Page 19: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/19.jpg)
Lasso Model Recovery: Special cases
Case: X>ScXS = 0.
I The incoherence condition trivially holds, and γ = 1.
I The threshold λ0 is lesser ⇒ The recovery error is lesser.
Case: X>S XS = I .
I Cmin = 1/n, the largest possible for a given n.
I Larger Cmin ⇒ lesser recovery error.
When does Lasso work well?
I Lasso prefers low correlation between support and non-supportcolumns.
I Low correlation of columns within support lead to betterrecovery.
![Page 20: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/20.jpg)
Lasso Model Recovery: ImplicationsSetting: Strongly correlated columns in X .
I Correlation between feature i and feature j
ρij ≈ x>i xj
I Large correlation between XS and XSc ⇒ γ is small.
I Large correlation within XS ⇒ Cmin is small.
I The r.h.s. of the bound is large. (loose bound).
I Hence w.h.p., lasso fails in model recovery.In other words:
I Lasso solutions differ with the solver used.
I Solution is not unique typically.
I The prediction error may not be as worse though [Hebiri andLederer, 2013].
Requirements.I Need consistent estimates independent of the solver.
I Preferably select all the correlated variables as a group.
![Page 21: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/21.jpg)
Lasso Model Recovery: ImplicationsSetting: Strongly correlated columns in X .
I Correlation between feature i and feature j
ρij ≈ x>i xj
I Large correlation between XS and XSc ⇒ γ is small.
I Large correlation within XS ⇒ Cmin is small.
I The r.h.s. of the bound is large. (loose bound).
I Hence w.h.p., lasso fails in model recovery.In other words:
I Lasso solutions differ with the solver used.
I Solution is not unique typically.
I The prediction error may not be as worse though [Hebiri andLederer, 2013].
Requirements.I Need consistent estimates independent of the solver.
I Preferably select all the correlated variables as a group.
![Page 22: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/22.jpg)
Lasso Model Recovery: ImplicationsSetting: Strongly correlated columns in X .
I Correlation between feature i and feature j
ρij ≈ x>i xj
I Large correlation between XS and XSc ⇒ γ is small.
I Large correlation within XS ⇒ Cmin is small.
I The r.h.s. of the bound is large. (loose bound).
I Hence w.h.p., lasso fails in model recovery.In other words:
I Lasso solutions differ with the solver used.
I Solution is not unique typically.
I The prediction error may not be as worse though [Hebiri andLederer, 2013].
Requirements.I Need consistent estimates independent of the solver.
I Preferably select all the correlated variables as a group.
![Page 23: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/23.jpg)
Illustration: Lasso under correlation[Zeng and Figueiredo,2015]
Setting: strongly correlated features.
I 1, . . . , d = G1 ∪ · · · ∪ Gk , Gm ∩ Gl = ∅, ∀l 6= m
I ρij ≡ |x>i xj | very high (≈ 1) for pairs i , j ∈ Gm.
Toy Example
I d = 40, k = 4.
I G1 = [1 : 10], G2 = [11 : 20],G3 = [21 : 30], G4 = [31 : 40].
0 20 400
2
4
6
Figure: Original signal
Lasso: Ω(w) = ‖w‖1.
I Sparse recovery.
I Arbitrarily selects the variableswithin a group.
0 20 400
2
4
6
Figure: Recovered signal.
![Page 24: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/24.jpg)
Illustration: Lasso under correlation[Zeng and Figueiredo,2015]
Setting: strongly correlated features.
I 1, . . . , d = G1 ∪ · · · ∪ Gk , Gm ∩ Gl = ∅, ∀l 6= m
I ρij ≡ |x>i xj | very high (≈ 1) for pairs i , j ∈ Gm.
Toy Example
I d = 40, k = 4.
I G1 = [1 : 10], G2 = [11 : 20],G3 = [21 : 30], G4 = [31 : 40].
0 20 400
2
4
6
Figure: Original signal
Lasso: Ω(w) = ‖w‖1.
I Sparse recovery.
I Arbitrarily selects the variableswithin a group.
0 20 400
2
4
6
Figure: Recovered signal.
![Page 25: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/25.jpg)
Possible solutions: 2-stage procedures
Cluster Group Lasso [Buhlmann et al., 2013]I Identify strongly correlated groups G = G1, . . . ,Gk
I Canonical Correlation.
I Group selection. Ω(w) =∑k
j=1 αj‖wGj‖2.
I Select all or no variables from each group.
Goal: Learn G and w simultaneously ?
![Page 26: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/26.jpg)
Possible solutions: 2-stage procedures
Cluster Group Lasso [Buhlmann et al., 2013]I Identify strongly correlated groups G = G1, . . . ,Gk
I Canonical Correlation.
I Group selection. Ω(w) =∑k
j=1 αj‖wGj‖2.
I Select all or no variables from each group.
Goal: Learn G and w simultaneously ?
![Page 27: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/27.jpg)
Ordered Weighted `1 (OWL) norms
OSCAR2 [Bondell and Reich, 2008]
ΩO(w) =d∑
i=1
ci |w |(i)
ci = c0 + (d − i)µ,
c0, µ, cd > 0.
0 20 400
2
4
6
Figure: Recovered: OWL
OWL [Figueiredo and Nowak, 2016]:
I c1 ≥ · · · ≥ cd ≥ 0.
2Notation: |w|(i) : i th largest in |w|.
![Page 28: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/28.jpg)
Ordered Weighted `1 (OWL) norms
OSCAR2 [Bondell and Reich, 2008]
ΩO(w) =d∑
i=1
ci |w |(i)
ci = c0 + (d − i)µ,
c0, µ, cd > 0.
0 20 400
2
4
6
Figure: Recovered: OWL
OWL [Figueiredo and Nowak, 2016]:
I c1 ≥ · · · ≥ cd ≥ 0.
2Notation: |w|(i) : i th largest in |w|.
![Page 29: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/29.jpg)
Ordered Weighted `1 (OWL) norms
OSCAR2 [Bondell and Reich, 2008]
ΩO(w) =d∑
i=1
ci |w |(i)
ci = c0 + (d − i)µ,
c0, µ, cd > 0.
0 20 400
2
4
6
Figure: Recovered: OWL
OWL [Figueiredo and Nowak, 2016]:
I c1 ≥ · · · ≥ cd ≥ 0.
2Notation: |w|(i) : i th largest in |w|.
![Page 30: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/30.jpg)
Oscar: Sparsity Illustrated
Figure: Examples of solutions:[Bondell and Reich, 2008]
I Solutions encouraged towards vertices.
I Encourages blockwise constant solutions.
I See [Bondell and Reich, 2008].
![Page 31: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/31.jpg)
OWL-Properties
Grouping covariates [Bondell and Reich, 2008, Theorem 1],[Figueiredo and Nowak, 2016, Theorem 1]
|wi | = |wj |, if λ ≥ λ0ij .
I λ0ij ∝
√1− ρ2
ij .
I Strongly correlated pairs grouped early in the regularizationpath.
I Groups: Gj = i ||wi | = αj.
![Page 32: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/32.jpg)
OWL-Properties
Grouping covariates [Bondell and Reich, 2008, Theorem 1],[Figueiredo and Nowak, 2016, Theorem 1]
|wi | = |wj |, if λ ≥ λ0ij .
I λ0ij ∝
√1− ρ2
ij .
I Strongly correlated pairs grouped early in the regularizationpath.
I Groups: Gj = i ||wi | = αj.
![Page 33: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/33.jpg)
OWL-Issues
0 20 400
2
4
6
Figure: True model
0 20 400
2
4
6
Figure: Recovered: OWL
I Bias for piecewise constant wI Easily understood through the norm balls.I Requires more samples to consistent estimation.
I Lack of interpretations for choosing c
![Page 34: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/34.jpg)
OWL-Issues
0 20 400
2
4
6
Figure: True model
0 20 400
2
4
6
Figure: Recovered: OWL
I Bias for piecewise constant wI Easily understood through the norm balls.I Requires more samples to consistent estimation.
I Lack of interpretations for choosing c
![Page 35: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/35.jpg)
Ordered Weighted L1(OWL) norm andSubmodular penalties
![Page 36: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/36.jpg)
Preliminaries: Penalties on the Support
GoalEncourage w to have desired support structure.
supp(w) = i |wi 6= 0
IdeaPenalty on support [Obozinski and Bach, 2012]:pen(w) = F (supp(w)) + ‖w‖pp, p ∈ [1,∞].
Relaxation(pen(w)): ΩFp (w)
Tightest positively homogenous, convex lower bound.
Example: F (supp(w)) = |supp(w)| ⇒ ΩFp (w) = ‖w‖1. Familiar!
Message: The cardinality function always relaxes to the `1 norm.
![Page 37: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/37.jpg)
Preliminaries: Penalties on the Support
GoalEncourage w to have desired support structure.
supp(w) = i |wi 6= 0
IdeaPenalty on support [Obozinski and Bach, 2012]:pen(w) = F (supp(w)) + ‖w‖pp, p ∈ [1,∞].
Relaxation(pen(w)): ΩFp (w)
Tightest positively homogenous, convex lower bound.
Example: F (supp(w)) = |supp(w)| ⇒ ΩFp (w) = ‖w‖1. Familiar!
Message: The cardinality function always relaxes to the `1 norm.
![Page 38: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/38.jpg)
Preliminaries: Penalties on the Support
GoalEncourage w to have desired support structure.
supp(w) = i |wi 6= 0
IdeaPenalty on support [Obozinski and Bach, 2012]:pen(w) = F (supp(w)) + ‖w‖pp, p ∈ [1,∞].
Relaxation(pen(w)): ΩFp (w)
Tightest positively homogenous, convex lower bound.
Example: F (supp(w)) = |supp(w)|
⇒ ΩFp (w) = ‖w‖1. Familiar!
Message: The cardinality function always relaxes to the `1 norm.
![Page 39: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/39.jpg)
Preliminaries: Penalties on the Support
GoalEncourage w to have desired support structure.
supp(w) = i |wi 6= 0
IdeaPenalty on support [Obozinski and Bach, 2012]:pen(w) = F (supp(w)) + ‖w‖pp, p ∈ [1,∞].
Relaxation(pen(w)): ΩFp (w)
Tightest positively homogenous, convex lower bound.
Example: F (supp(w)) = |supp(w)|
⇒ ΩFp (w) = ‖w‖1. Familiar!
Message: The cardinality function always relaxes to the `1 norm.
![Page 40: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/40.jpg)
Preliminaries: Penalties on the Support
GoalEncourage w to have desired support structure.
supp(w) = i |wi 6= 0
IdeaPenalty on support [Obozinski and Bach, 2012]:pen(w) = F (supp(w)) + ‖w‖pp, p ∈ [1,∞].
Relaxation(pen(w)): ΩFp (w)
Tightest positively homogenous, convex lower bound.
Example: F (supp(w)) = |supp(w)| ⇒ ΩFp (w) = ‖w‖1. Familiar!
Message: The cardinality function always relaxes to the `1 norm.
![Page 41: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/41.jpg)
Nondecreasing Submodular Penalties on Cardinality
Assumptions. Denote F ∈ F , if F : A ⊆ 1, . . . , d → R is:
1. Submodular [Bach, 2011].I ∀A ⊆ B,F (A∪k)−F (A) ≥ F (B ∪k)−F (B).I Lovasz extension: f : Rd → R. (Convex extension
of F to Rd).
2. Cardinality based.I F (A) = g(|A|) (Invariant to permutations).
3. Non Decreasing.I g(0) = 0, g(x) ≥ g(x − 1).
Implication: F ∈ F ⇒ F completely specified through g .Example: Let V = 1, . . . , d, define F (A) = |A||V \ A|.
Then f (w) =∑
i<j |wi − wj |.
![Page 42: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/42.jpg)
ΩF∞ and Lovasz extension
Result: Case p =∞ [Bach, 2010]:
ΩF∞(w) = f (|w |)
I The `∞ relaxation coincides with the Lovasz extension in thepositive orthant.
I To work with ΩF∞, may use existing results of submodular
function minimization.
I ΩFp not known in closed form for p <∞.
![Page 43: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/43.jpg)
Equivalence of OWL and Lovasz extensions: Statement
Proposition [Sankaran et al., 2017]: F ∈ F ,ΩF∞(w)⇔ ΩO(w)
1. Given F (A) = f (|A|), ΩF∞(w) = ΩO(w), with
ci = f (i)− f (i − 1).
2. Given c1 ≥ . . . cd ≥ 0, ΩO(w) = ΩF∞(w) with
f (i) = c1 + · · ·+ ci .
Interpretations
I Gives alternate interpretations for OWL.
I ΩF∞ has undesired extreme points [Bach, 2011].I Explains piecewise constant solutions of OWL.
I Motivates ΩFp (w) for p <∞.
![Page 44: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/44.jpg)
Equivalence of OWL and Lovasz extensions: Statement
Proposition [Sankaran et al., 2017]: F ∈ F ,ΩF∞(w)⇔ ΩO(w)
1. Given F (A) = f (|A|), ΩF∞(w) = ΩO(w), with
ci = f (i)− f (i − 1).
2. Given c1 ≥ . . . cd ≥ 0, ΩO(w) = ΩF∞(w) with
f (i) = c1 + · · ·+ ci .
Interpretations
I Gives alternate interpretations for OWL.I ΩF∞ has undesired extreme points [Bach, 2011].
I Explains piecewise constant solutions of OWL.
I Motivates ΩFp (w) for p <∞.
![Page 45: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/45.jpg)
Equivalence of OWL and Lovasz extensions: Statement
Proposition [Sankaran et al., 2017]: F ∈ F ,ΩF∞(w)⇔ ΩO(w)
1. Given F (A) = f (|A|), ΩF∞(w) = ΩO(w), with
ci = f (i)− f (i − 1).
2. Given c1 ≥ . . . cd ≥ 0, ΩO(w) = ΩF∞(w) with
f (i) = c1 + · · ·+ ci .
Interpretations
I Gives alternate interpretations for OWL.I ΩF∞ has undesired extreme points [Bach, 2011].
I Explains piecewise constant solutions of OWL.
I Motivates ΩFp (w) for p <∞.
![Page 46: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/46.jpg)
ΩS : Smoothed OWL (SOWL)
![Page 47: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/47.jpg)
SOWL: DefinitionSmoothed OWL
ΩS(w) := ΩF2 (w).
Variational form for ΩF2 [Obozinski and Bach, 2012].
ΩS(w) = minη∈Rd
+
1
2
(d∑
i=1
w2i
ηi+ f (η)
).
Use OWL equivalance: f (|η|) = ΩF∞(η) =
∑di=1 ci |η|(i),
ΩS(w) = minη∈Rd
+
1
2
d∑i=1
(w2i
ηi+ ciη(i)
)︸ ︷︷ ︸
Ψ(w ,η)
. (SOWL)
![Page 48: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/48.jpg)
SOWL: DefinitionSmoothed OWL
ΩS(w) := ΩF2 (w).
Variational form for ΩF2 [Obozinski and Bach, 2012].
ΩS(w) = minη∈Rd
+
1
2
(d∑
i=1
w2i
ηi+ f (η)
).
Use OWL equivalance: f (|η|) = ΩF∞(η) =
∑di=1 ci |η|(i),
ΩS(w) = minη∈Rd
+
1
2
d∑i=1
(w2i
ηi+ ciη(i)
)︸ ︷︷ ︸
Ψ(w ,η)
. (SOWL)
![Page 49: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/49.jpg)
SOWL: DefinitionSmoothed OWL
ΩS(w) := ΩF2 (w).
Variational form for ΩF2 [Obozinski and Bach, 2012].
ΩS(w) = minη∈Rd
+
1
2
(d∑
i=1
w2i
ηi+ f (η)
).
Use OWL equivalance: f (|η|) = ΩF∞(η) =
∑di=1 ci |η|(i),
ΩS(w) = minη∈Rd
+
1
2
d∑i=1
(w2i
ηi+ ciη(i)
)︸ ︷︷ ︸
Ψ(w ,η)
. (SOWL)
![Page 50: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/50.jpg)
OWL vs SOWL
Case: c = 1d .
I ΩS(w) = ‖w‖1.
I ΩO(w) = ‖w‖1.
Case: c = [1, 0, . . . , 0︸ ︷︷ ︸d−1
]>.
I ΩS(w) = ‖w‖2.
I ΩO(w) = ‖w‖∞.
Norm Balls
-1.5 -1 -0.5 0 0.5 1 1.5 -1.5
-1
-0.5
0
0.5
1
1.5
: OWL
-1.5 -1 -0.5 0 0.5 1 1.5 -1.5
-1
-0.5
0
0.5
1
1.5
: SOWL
Figure: Norm balls for OWL, SOWL, for different values of c
![Page 51: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/51.jpg)
OWL vs SOWL
Case: c = 1d .
I ΩS(w) = ‖w‖1.
I ΩO(w) = ‖w‖1.
Case: c = [1, 0, . . . , 0︸ ︷︷ ︸d−1
]>.
I ΩS(w) = ‖w‖2.
I ΩO(w) = ‖w‖∞.
Norm Balls
-1.5 -1 -0.5 0 0.5 1 1.5 -1.5
-1
-0.5
0
0.5
1
1.5
: OWL
-1.5 -1 -0.5 0 0.5 1 1.5 -1.5
-1
-0.5
0
0.5
1
1.5
: SOWL
Figure: Norm balls for OWL, SOWL, for different values of c
![Page 52: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/52.jpg)
Group Lasso and ΩS
SOWL objective (eliminating η):
ΩS(w) =k∑
j=1
‖wGj‖√∑i∈Gj
ci
.
Denote ηw : denote the optimal η, given w .
Key differences:
I Groups defined through ηw = [δ1, . . . , δ1︸ ︷︷ ︸G1
, . . . , δk , . . . , δk︸ ︷︷ ︸Gk
].
I Influenced by the choice of c.
![Page 53: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/53.jpg)
Group Lasso and ΩS
SOWL objective (eliminating η):
ΩS(w) =k∑
j=1
‖wGj‖√∑i∈Gj
ci
.
Denote ηw : denote the optimal η, given w .Key differences:
I Groups defined through ηw = [δ1, . . . , δ1︸ ︷︷ ︸G1
, . . . , δk , . . . , δk︸ ︷︷ ︸Gk
].
I Influenced by the choice of c.
![Page 54: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/54.jpg)
Open Questions:
1. Does ΩS promotes grouping of correlated variables as ΩO ?I Are there any benefits over ΩO ?
2. Is using ΩS computationally feasible ?
3. Theoretical properties of ΩS vs Group Lasso?
![Page 55: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/55.jpg)
Open Questions:
1. Does ΩS promotes grouping of correlated variables as ΩO ?I Are there any benefits over ΩO ?
2. Is using ΩS computationally feasible ?
3. Theoretical properties of ΩS vs Group Lasso?
![Page 56: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/56.jpg)
Open Questions:
1. Does ΩS promotes grouping of correlated variables as ΩO ?I Are there any benefits over ΩO ?
2. Is using ΩS computationally feasible ?
3. Theoretical properties of ΩS vs Group Lasso?
![Page 57: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/57.jpg)
Grouping property ΩS : Statement
Learning Problem: LS-SOWL
minw∈Rd ,η∈Rd
+
1
2n‖Xw − y‖2
2 +λ
2
d∑i=1
(w2i
ηi+ ciη(i)
)︸ ︷︷ ︸
Γ(λ)(w ,η)
.
Theorem: [Sankaran et al., 2017]
Define the following:
I(w (λ), η(λ)
)= argminw ,ηΓ(λ)(w , η).
I ρij = x>i xj .
I c = mini ci − ci+1.
There exists 0 ≤ λ0 ≤ ‖y‖2√c
(4− 4ρ2ij)
14 , such that ∀λ > λ0, η
(λ)i =
η(λ)j
![Page 58: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/58.jpg)
Grouping property ΩS : Statement
Learning Problem: LS-SOWL
minw∈Rd ,η∈Rd
+
1
2n‖Xw − y‖2
2 +λ
2
d∑i=1
(w2i
ηi+ ciη(i)
)︸ ︷︷ ︸
Γ(λ)(w ,η)
.
Theorem: [Sankaran et al., 2017]
Define the following:
I(w (λ), η(λ)
)= argminw ,ηΓ(λ)(w , η).
I ρij = x>i xj .
I c = mini ci − ci+1.
There exists 0 ≤ λ0 ≤ ‖y‖2√c
(4− 4ρ2ij)
14 , such that ∀λ > λ0, η
(λ)i =
η(λ)j
![Page 59: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/59.jpg)
Grouping property ΩS : Statement
Learning Problem: LS-SOWL
minw∈Rd ,η∈Rd
+
1
2n‖Xw − y‖2
2 +λ
2
d∑i=1
(w2i
ηi+ ciη(i)
)︸ ︷︷ ︸
Γ(λ)(w ,η)
.
Theorem: [Sankaran et al., 2017]
Define the following:
I(w (λ), η(λ)
)= argminw ,ηΓ(λ)(w , η).
I ρij = x>i xj .
I c = mini ci − ci+1.
There exists 0 ≤ λ0 ≤ ‖y‖2√c
(4− 4ρ2ij)
14 , such that ∀λ > λ0, η
(λ)i =
η(λ)j
![Page 60: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/60.jpg)
Grouping property ΩS : Interpretation
1. Variables ηi , ηj grouped if ρij ≈ 1 (Even for small λ).I Similar to Figueiredo and Nowak [2016, Theorem 1], which is
for absolute values of w .
2. SOWL differentiates grouping variable η and model variablew .
3. Allows model variance within group.
I w(λ)i 6= w
(λ)j as long as c has distinct values.
![Page 61: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/61.jpg)
Grouping property ΩS : Interpretation
1. Variables ηi , ηj grouped if ρij ≈ 1 (Even for small λ).I Similar to Figueiredo and Nowak [2016, Theorem 1], which is
for absolute values of w .
2. SOWL differentiates grouping variable η and model variablew .
3. Allows model variance within group.
I w(λ)i 6= w
(λ)j as long as c has distinct values.
![Page 62: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/62.jpg)
Grouping property ΩS : Interpretation
1. Variables ηi , ηj grouped if ρij ≈ 1 (Even for small λ).I Similar to Figueiredo and Nowak [2016, Theorem 1], which is
for absolute values of w .
2. SOWL differentiates grouping variable η and model variablew .
3. Allows model variance within group.
I w(λ)i 6= w
(λ)j as long as c has distinct values.
![Page 63: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/63.jpg)
Illustration: Group Discovery using SOWL
Aim: Illustrate group discovery of SOWL.
I Consider z ∈ Rd , Compute proxλΩS (z).
I Study the regularization path.
0 10 20 300
0.5
1
1.5
2
Figure: Original signal
I Early group discovery.
I Model variation.
0 50 100 150
Lasso w
0
1
2
0 50 100 150
OWL w
0
1
2
0 50 100 150
SOWL w
0
1
2
0 100 200
SOWL η
0
2
4
Figure: x-axis : λ, y-axis: w .
![Page 64: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/64.jpg)
Illustration: Group Discovery using SOWL
Aim: Illustrate group discovery of SOWL.
I Consider z ∈ Rd , Compute proxλΩS (z).
I Study the regularization path.
0 10 20 300
0.5
1
1.5
2
Figure: Original signal
I Early group discovery.
I Model variation.
0 50 100 150
Lasso w
0
1
2
0 50 100 150
OWL w
0
1
2
0 50 100 150
SOWL w
0
1
2
0 100 200
SOWL η
0
2
4
Figure: x-axis : λ, y-axis: w .
![Page 65: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/65.jpg)
Illustration: Group Discovery using SOWL
Aim: Illustrate group discovery of SOWL.
I Consider z ∈ Rd , Compute proxλΩS (z).
I Study the regularization path.
0 10 20 300
0.5
1
1.5
2
Figure: Original signal
I Early group discovery.
I Model variation.
0 50 100 150
Lasso w
0
1
2
0 50 100 150
OWL w
0
1
2
0 50 100 150
SOWL w
0
1
2
0 100 200
SOWL η
0
2
4
Figure: x-axis : λ, y-axis: w .
![Page 66: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/66.jpg)
Proximal Methods: A brief overview
Proximal operator
proxΩ(z) = argminw1
2‖w − z‖2
2 + Ω(w)
I Easy to evaluate for many simple norms.
I proxλ`1(z) = sign(z) (|z | − λ)+.
I Generalization of Projected Gradient Descent
![Page 67: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/67.jpg)
FISTA [Beck and Teboulle, 2009]
Initialization
I t(1) = 1, w (1) = x (1) = 0.
Steps : k > 1
I w (k) = proxΩ
(w (k−1) − 1
L∇f (w (k−1))).
I t(k) =
(1 +
√1 + 4
(t(k−1)
)2)/2.
I w (k) = w (k) +(t(k−1)−1
t(k)
) (w (k) − w (k−1)
).
Guarantee
I Convergence rate O(1/T 2).
I No additional assumptions than IST.
I Known to be optimal for this class of minimization problems.
![Page 68: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/68.jpg)
Computing proxΩSProblem:
proxλΩ(z) = argminw1
2‖w − z‖2
2 + λΩ(w).
w (λ) = proxλΩS (z), η(λ)w = argminηΨ(w (λ), η).
Key idea: Ordering of η(λ)w remains same for all λ.
log(λ)-5 0 5 10
η
0
1
2
3
λ
0 10 20
η
0
1
2
3
I η(λ)w = (ηz − λ)+.
I Same complexity as computing the norm ΩS(O(d log d)).
I True for all cardinality based F .
![Page 69: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/69.jpg)
Computing proxΩSProblem:
proxλΩ(z) = argminw1
2‖w − z‖2
2 + λΩ(w).
w (λ) = proxλΩS (z), η(λ)w = argminηΨ(w (λ), η).
Key idea: Ordering of η(λ)w remains same for all λ.
log(λ)-5 0 5 10
η
0
1
2
3
λ
0 10 20
η
0
1
2
3
I η(λ)w = (ηz − λ)+.
I Same complexity as computing the norm ΩS(O(d log d)).
I True for all cardinality based F .
![Page 70: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/70.jpg)
Computing proxΩSProblem:
proxλΩ(z) = argminw1
2‖w − z‖2
2 + λΩ(w).
w (λ) = proxλΩS (z), η(λ)w = argminηΨ(w (λ), η).
Key idea: Ordering of η(λ)w remains same for all λ.
log(λ)-5 0 5 10
η
0
1
2
3
λ
0 10 20
η
0
1
2
3
I η(λ)w = (ηz − λ)+.
I Same complexity as computing the norm ΩS(O(d log d)).
I True for all cardinality based F .
![Page 71: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/71.jpg)
Random Design
Problem setting: LS-SOWL.I True model: y = Xw∗ + ε,
(X>)i∼ N (µ,Σ), εi ∈ N (0, σ2).
I Notation: J = i |w∗i 6= 0, ηw∗ = [δ∗1 , . . . , δ∗1︸ ︷︷ ︸
G1
, . . . , δ∗k , . . . , δ∗k︸ ︷︷ ︸
Gk
].
Assumptions: 3
I ΣJ ,J is invertible, λ→ 0, and λ√n→∞.
Irrepresentability conditions
1. δ∗k = 0 if |J c | 6= ∅.
2.
∥∥∥ΣJ c ,J (ΣJ ,J )−1Dw∗J
∥∥∥2
β < 1 .
Result[Sankaranet al., 2017]
1. w →p w∗.
2. P(J = J )→ 1.
I Similar to Group Lasso [Bach, 2008, Theorem 2].
I Learns the weights, without explicit groups information.
3D: Diagonal matrix, defined using groups G1, . . . ,Gk , and w∗
![Page 72: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/72.jpg)
Random Design
Problem setting: LS-SOWL.I True model: y = Xw∗ + ε,
(X>)i∼ N (µ,Σ), εi ∈ N (0, σ2).
I Notation: J = i |w∗i 6= 0, ηw∗ = [δ∗1 , . . . , δ∗1︸ ︷︷ ︸
G1
, . . . , δ∗k , . . . , δ∗k︸ ︷︷ ︸
Gk
].
Assumptions: 3
I ΣJ ,J is invertible, λ→ 0, and λ√n→∞.
Irrepresentability conditions
1. δ∗k = 0 if |J c | 6= ∅.
2.
∥∥∥ΣJ c ,J (ΣJ ,J )−1Dw∗J
∥∥∥2
β < 1 .
Result[Sankaranet al., 2017]
1. w →p w∗.
2. P(J = J )→ 1.
I Similar to Group Lasso [Bach, 2008, Theorem 2].
I Learns the weights, without explicit groups information.
3D: Diagonal matrix, defined using groups G1, . . . ,Gk , and w∗
![Page 73: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/73.jpg)
Random Design
Problem setting: LS-SOWL.I True model: y = Xw∗ + ε,
(X>)i∼ N (µ,Σ), εi ∈ N (0, σ2).
I Notation: J = i |w∗i 6= 0, ηw∗ = [δ∗1 , . . . , δ∗1︸ ︷︷ ︸
G1
, . . . , δ∗k , . . . , δ∗k︸ ︷︷ ︸
Gk
].
Assumptions: 3
I ΣJ ,J is invertible, λ→ 0, and λ√n→∞.
Irrepresentability conditions
1. δ∗k = 0 if |J c | 6= ∅.
2.
∥∥∥ΣJ c ,J (ΣJ ,J )−1Dw∗J
∥∥∥2
β < 1 .
Result[Sankaranet al., 2017]
1. w →p w∗.
2. P(J = J )→ 1.
I Similar to Group Lasso [Bach, 2008, Theorem 2].
I Learns the weights, without explicit groups information.
3D: Diagonal matrix, defined using groups G1, . . . ,Gk , and w∗
![Page 74: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/74.jpg)
Random Design
Problem setting: LS-SOWL.I True model: y = Xw∗ + ε,
(X>)i∼ N (µ,Σ), εi ∈ N (0, σ2).
I Notation: J = i |w∗i 6= 0, ηw∗ = [δ∗1 , . . . , δ∗1︸ ︷︷ ︸
G1
, . . . , δ∗k , . . . , δ∗k︸ ︷︷ ︸
Gk
].
Assumptions: 3
I ΣJ ,J is invertible, λ→ 0, and λ√n→∞.
Irrepresentability conditions
1. δ∗k = 0 if |J c | 6= ∅.
2.
∥∥∥ΣJ c ,J (ΣJ ,J )−1Dw∗J
∥∥∥2
β < 1 .
Result[Sankaranet al., 2017]
1. w →p w∗.
2. P(J = J )→ 1.
I Similar to Group Lasso [Bach, 2008, Theorem 2].
I Learns the weights, without explicit groups information.
3D: Diagonal matrix, defined using groups G1, . . . ,Gk , and w∗
![Page 75: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/75.jpg)
Random Design
Problem setting: LS-SOWL.I True model: y = Xw∗ + ε,
(X>)i∼ N (µ,Σ), εi ∈ N (0, σ2).
I Notation: J = i |w∗i 6= 0, ηw∗ = [δ∗1 , . . . , δ∗1︸ ︷︷ ︸
G1
, . . . , δ∗k , . . . , δ∗k︸ ︷︷ ︸
Gk
].
Assumptions: 3
I ΣJ ,J is invertible, λ→ 0, and λ√n→∞.
Irrepresentability conditions
1. δ∗k = 0 if |J c | 6= ∅.
2.
∥∥∥ΣJ c ,J (ΣJ ,J )−1Dw∗J
∥∥∥2
β < 1 .
Result[Sankaranet al., 2017]
1. w →p w∗.
2. P(J = J )→ 1.
I Similar to Group Lasso [Bach, 2008, Theorem 2].
I Learns the weights, without explicit groups information.
3D: Diagonal matrix, defined using groups G1, . . . ,Gk , and w∗
![Page 76: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/76.jpg)
Quantitative Simulation: Predictive Accuracy
Aim: Learn w using LS-SOWL, evaluate prediction error.
Generate samples:
I x ∼ N (0,Σ), ε ∼ N (0, σ2),
I y = x>w∗ + ε.
Metric: E [‖x>(w∗ − w)‖2] = (w∗ − w)>Σ(w∗ − w).
Data:4 w∗ = [0>10, 2>10, 0
>10, 2
>10]>.
I n = 100, σ = 15 and Σi ,j = 0.5 if i 6= j and 1 if i = j .
Models with group variance:I Measure E [‖x>(w∗ − w)‖2].
I w∗ = w∗ + ε,I ε ∼ U [−τ, τ ],I τ = 0, 0.2, 0.4.
4The experiments followed the setup of Bondell and Reich [2008]
![Page 77: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/77.jpg)
Quantitative Simulation: Predictive Accuracy
Aim: Learn w using LS-SOWL, evaluate prediction error.Generate samples:
I x ∼ N (0,Σ), ε ∼ N (0, σ2),
I y = x>w∗ + ε.
Metric: E [‖x>(w∗ − w)‖2] = (w∗ − w)>Σ(w∗ − w).
Data:4 w∗ = [0>10, 2>10, 0
>10, 2
>10]>.
I n = 100, σ = 15 and Σi ,j = 0.5 if i 6= j and 1 if i = j .
Models with group variance:I Measure E [‖x>(w∗ − w)‖2].
I w∗ = w∗ + ε,I ε ∼ U [−τ, τ ],I τ = 0, 0.2, 0.4.
4The experiments followed the setup of Bondell and Reich [2008]
![Page 78: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/78.jpg)
Quantitative Simulation: Predictive Accuracy
Aim: Learn w using LS-SOWL, evaluate prediction error.Generate samples:
I x ∼ N (0,Σ), ε ∼ N (0, σ2),
I y = x>w∗ + ε.
Metric: E [‖x>(w∗ − w)‖2] = (w∗ − w)>Σ(w∗ − w).
Data:4 w∗ = [0>10, 2>10, 0
>10, 2
>10]>.
I n = 100, σ = 15 and Σi ,j = 0.5 if i 6= j and 1 if i = j .
Models with group variance:I Measure E [‖x>(w∗ − w)‖2].
I w∗ = w∗ + ε,I ε ∼ U [−τ, τ ],I τ = 0, 0.2, 0.4.
4The experiments followed the setup of Bondell and Reich [2008]
![Page 79: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/79.jpg)
Quantitative Simulation: Predictive Accuracy
Aim: Learn w using LS-SOWL, evaluate prediction error.Generate samples:
I x ∼ N (0,Σ), ε ∼ N (0, σ2),
I y = x>w∗ + ε.
Metric: E [‖x>(w∗ − w)‖2] = (w∗ − w)>Σ(w∗ − w).
Data:4 w∗ = [0>10, 2>10, 0
>10, 2
>10]>.
I n = 100, σ = 15 and Σi ,j = 0.5 if i 6= j and 1 if i = j .
Models with group variance:I Measure E [‖x>(w∗ − w)‖2].
I w∗ = w∗ + ε,I ε ∼ U [−τ, τ ],I τ = 0, 0.2, 0.4.
4The experiments followed the setup of Bondell and Reich [2008]
![Page 80: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/80.jpg)
Quantitative Simulation: Predictive Accuracy
Aim: Learn w using LS-SOWL, evaluate prediction error.Generate samples:
I x ∼ N (0,Σ), ε ∼ N (0, σ2),
I y = x>w∗ + ε.
Metric: E [‖x>(w∗ − w)‖2] = (w∗ − w)>Σ(w∗ − w).
Data:4 w∗ = [0>10, 2>10, 0
>10, 2
>10]>.
I n = 100, σ = 15 and Σi ,j = 0.5 if i 6= j and 1 if i = j .
Models with group variance:I Measure E [‖x>(w∗ − w)‖2].
I w∗ = w∗ + ε,I ε ∼ U [−τ, τ ],I τ = 0, 0.2, 0.4.
4The experiments followed the setup of Bondell and Reich [2008]
![Page 81: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/81.jpg)
Predictive accuracy results
Algorithm Med. MSE MSE (10th Perc). MSE (90th Perc)
LASSO 46.1 / 45.2 / 45.5 32.8 / 32.7 / 33.2 60.0 / 61.5 / 61.4OWL 27.6 / 27.0 / 26.4 19.8 / 19.2 / 19.2 42.7 / 40.4 / 39.2
El. Net 30.8 / 30.7 / 30.6 21.9 / 22.6 / 23.0 42.4 / 43.0 / 41.4ΩS 23.9 / 23.3 / 23.4 16.9 / 16.8 / 16.8 35.2 / 35.4 / 33.2
Table: Each column has numbers for τ = 0, 0.2, 0.4.
![Page 82: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/82.jpg)
Summary
1. Proposed a new family of norms ΩS .
2. Properties:I Equivalent to OWL in group identification.I Efficient computational toolsI Equivalences to Group Lasso.
3. Illustrations on performance through simulations.
![Page 83: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/83.jpg)
Questions ?
![Page 84: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/84.jpg)
Thank you !!!
![Page 85: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/85.jpg)
References I
F. Bach. Consistency of the group lasso and multiple kernellearning. Journal of Machine Learning Research, 9:1179–1225,2008.
F. Bach. Structured sparsity-inducing norms through submodularfunctions. In Neural Information Processing Systems, 2010.
F. Bach. Learning with submodular functions: A convexoptimization perspective. Foundations and Trends in MachineLearning, 6-2-3:145–373, 2011.
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems. SIAM journal on imagingsciences, 2(1):183–202, March 2009.
H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage,variable selection, and supervised clustering of predictors withoscar. Biometrics, 64:115–123, 2008.
![Page 86: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/86.jpg)
References II
P. Buhlmann, P. Rutimann, Sara van de Geer, and Cun-Hui Zhang.Correlated variables in regression : Clustering and sparseestimation. Journal of Statistical Planning and Inference, 43:1835–1858, 2013.
A. Chambolle and T. Pock. A first-order primal-dual algorithm forconvex problems with applications to imaging. Journal ofmathematical imaging and vision, 40(1):120–145, May 2011.
M. A. T. Figueiredo and R. D. Nowak. Ordered weighted `1
regularized regression with strongly correlated covariates:Theoretical aspects. In International Conference on ArtificialIntelligence and Statistics (AISTATS), 2016.
Mohamed Hebiri and Johannes Lederer. How correlations influencelasso prediction. IEEE Transactions on Information Theory, 59(3):1846–1854, 2013.
G. Obozinski and F. Bach. Convex relaxation for combinatorialpenalties. Technical Report 00694765, HAL, 2012.
![Page 87: OWL to the rescue of LASSO - malllabiisc.github.io · I Lasso solutions di er with the solver used. I Solution is not unique typically. I The prediction error may not be as worse](https://reader031.vdocument.in/reader031/viewer/2022022800/5c6ac4d509d3f2e4178d1d02/html5/thumbnails/87.jpg)
References III
R. Sankaran, F. Bach, and C. Bhattacharya. Identifying groups ofstrongly correlated variables through smoothed ordered weightedl 1-norms. In Artificial Intelligence and Statistics, pages1123–1131, 2017.
R. Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society, Series B, 58:267–288,1994.
M.J. Wainwright. Sharp thresholds for high-dimensional and noisysparsity recovery using `1-constrained quadratic programming(lasso). IEEE Transactions On Information Theory, 55(5):2183–2202, 2009.
X. Zeng and M. Figueiredo. The ordered weighted `1 norm:Atomic formulation, projections, and algorithms. ArXivpreprint:1409.4271v5, 2015.