sparse regularization path by differential inclusion · 2020. 4. 24. · sparse regularization path...

Sparse regularization path by differential inclusion

Wotao Yin (UCLA Math)

joint with: Stanley Osher, Ming Yan (UCLA)

Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U)

ICERM Approximation, Integration, and Optimization Workshop

October 2, 2014

1 / 42

Background

• Assume vector x∗ ∈ Rn is sparse, unknown

• Goal: Recover x∗ from

b = Ax∗ + �

where A ∈ Rm×n , b ∈ Rm and � is unknown noise.

• Consider m � n, the under-determined case

2 / 42

Background: Regularization by optimization

Examples:

• convex: LASSO: minimize λ‖x‖1 + 12m ‖Ax − b‖22

• nonconvex: SCAD, `p-seminorm minimization p ∈ (0, 1)

Optimization approach:

• convex penalty: avoid overfitting, tractable, lead to bias

• nonconvex penalty: may works better, but performance unpredictable

They have a tuning parameter, the best choice of which is often unknown

So, we need model selection: vary the parameter values, solve many instances,

and then pick the best one

3 / 42

Background: Regularization path by an algorithm

Algorithmic regularization:

• An algorithm generates a regularization path. (Points on the path may not

minimize an energy function)

• Model selection is done by deciding when to stop at a time (for continuous

dynamic) or stop at an iteration (for discrete update)

Examples:

• LASSO/LARS: solve parameterized LASSO KKT conditions

• xMP family

4 / 42

This talk introduces a continuous regularization path by differential

inclusions, with

• recovery guarantees

• fast implementation

• and generalization to other structured solution

5 / 42

Introduced: two Inverse Scale Space (ISS) Dynamics

• Let x(t), p(t) ∈ Rn by primal-dual regularization path. t is time.

• Bregman ISS dynamic: {x(t), p(t)}t≥0 is governed by

ṗ(t) =1m

AT (b − Ax(t)),

p(t) ∈ ∂‖x(t)‖1.

Initial solution x(0) = p(0) = 0.

• Linearized Bregman ISS dynamic: {x(t), p(t)}t≥0 is governed by

ṗ(t) +1κ

ẋ(t) =1m

AT (b − Ax(t)),

p(t) ∈ ∂‖x(t)‖1.

Initial solution x(0) = p(0) = 0.

• For well-defined path (and uniqueness), make technical assumptions:

• p(t) is right continuously differentiable, and• x(t) is right continuous

6 / 42

Generalization

Given any convex optimization model:

minimizex

r(x) + t ∙ f (x)

one can generate the related Bregman ISS model:

ṗ(t) = −f ′(x),

p(t) ∈ ∂r(x(t)).

where

• r is convex regularization: weighted `1, `1,2, nuclear norm, and so on;

can incorporate nonnegative or box constraints as indicator functions

• f is convex fitting: square loss, logistic loss, etc.

Linearized Bregman ISS: add a strongly convex function to r .

7 / 42

Major claims for Bregman ISS applied to `1

The solution path {x(t), p(t)}t≥0 :

• x(t) is sparse, if p ∈ ∂‖x‖1 ∩R(AT ) and A is fat, w.h.p.

• x(t) is less biased than LASSO, better than LASSO+debiasing

• path can be piece-wise computed very quickly

• sign-consistency: sign(x(t)) = sign(x∗) at some t under conditions

In less technical languages, the new method

• recovers sparse nonzero elements like `1 but avoids its bias

• can generate a regularization path much quicker than `1

• whereas `1 extends, it does too

8 / 42

Background: `1 subgradient

• Subdifferential of convex function f

∂f (y) = {p : f (x) ≥ f (y) + 〈p, x − y〉, ∀y ∈ domf }.

Each p ∈ ∂f (y) is a subgradient of f at y.

• Subdifferential of | ∙ |:

∂|xi | =

{1}, xi > 0;

[−1, 1], xi = 0;

{−1}, xi < 0.

=⇒ let pi ∈ ∂|xi |, then

xi

≥ 0, if pi = 1;

= 0, if pi ∈ (−1, 1);

≤ 0, if pi = −1.

9 / 42

Sparsity and `1 subgradient

Although x ↔ p is not one-one, it is in some cases. When xi is nonzero, pimust equal its sign; when pi ∈ (−1, 1), xi has to be zero.

p is like an array of 3-position switches: −1, (−1, 1), +1

10 / 42

Toy example 1

Consider:

b = ax + �,

where b, a, x are strictly positive scalars.

Bregman ISS:

• start: x(0) = 0 and p(0) = 0

• stage 1: p evolves before reaching 1, meanwhile x stays 0.

ṗ = a(b − ax) = ab ⇒ p(t) = (ab)t

• stage 2: p reaches 1 at t = 1/(ab), but cannot exceed 1, so ṗ(t) ≤ 0 and

thus x(t) 6= 0. Right continuity assumption makes ṗ(t) < 0 impossible as

it will immediately make x(t+) = 0. Therefore,

at t ≥ 1/(ab), 0 = ṗ(t) = a(b − ax(t)) ⇒ x(t) = b/a, p(t) = 1.

Once p(t) = 1 “switch is on”, the signal x(t) = b/a immediately pops out!

11 / 42

LASSO:

x(t) = arg minx

|x|+t2|ax − b|2

⇒ optimality condition:

0 = p + ta(ax − b), p ∈ ∂|x|

⇒ solution:

p(t) =

{(ab)t,

1,x(t) =

{0, t ∈ [0, 1ab ),ba−

1ta2 , t ∈ [

1ab ,∞).

In this example, LASSO has the same p(t) path but a different x(t) path.

LASSO’s x(t) path reduced the signal strength.

12 / 42

`1 subgradient and sparsity

Faces of ∂‖x‖

• `1 subdifferential: ∂‖x‖1 = ∂|x1| × ∙ ∙ ∙ × ∂|xn |.

• The image of ∂‖x‖1 is [−1, 1]n

• Let p ∈ ∂‖x‖1. For xi 6=, pi must equal ±1 and is thus exposed.

• More pi exposed ⇔ p lies on a low-dim face of [−1, 1]n

Observation:

vector x is sparse ⇔ few pi = ±1 ⇔ p 6∈ a low-dim face of [−1, 1]n

13 / 42

• If matrix A is fat (or AT is thin), then R(AT ) is a small subspace

• A is random and p ∈ ∂‖x‖1 ∩R(AT )

⇒ p is unlikely on a low-dim face of [−1, 1]n

⇒ very few pi = ±1

⇒ sparse x

Bregman ISS update:

ṗ(t) =1m

AT (b − Ax(t)), p(t) ∈ ∂‖x(t)‖1,

⇒ p ∈ ∂‖x‖1 ∩R(AT )

Conclusion: if A is fat, then x(t) is typically sparse.

14 / 42

Toy example 2

• x ∈ Rn , measurement b is a scalar:

b = aT x + � ∈ R

Suppose a1 = 1 > a2 ≥ . . . ≥ an > 0 and b > 0. w.o.l.g.

• Bregman ISS solution:

x1(t) =

{0, t < 1/b,

b, t ≥ 1/b.

x2(t) = ∙ ∙ ∙ = xn(t) = 0, t ≥ 0.

15 / 42

• LASSO solution:

x1(t) =

{0, t < 1/b,

b − 1t , t ≥ 1/b.

x2(t) = ∙ ∙ ∙ = xn(t) = 0, t ≥ 0.

• Both solutions are sparse. Like before, LASSO solution is a

strength–reduced signal, which is not good.

16 / 42

Oracle estimator

• Unknown S := supp(x∗) is disclosed by an oracle

• Oracle estimator is the least-squares solution restricted to S

x̃∗S = arg min{1

2m‖Ax − b‖22 : supp(x) = S}

• Define: submatrix AS of A and Σm := 1m ATS AS . The oracle estimate

x̃∗S = Σ−1m (

1m

ATS b) = x∗S +

1m

Σ−1m ATS �

has oracle properties:

• consistency: supp(x̃∗S ) = S

• normality: x̃∗S ∼ N (x∗S ,σ2

m Σ−1m ). In particular, E[x̃

∗S ] = x

∗S unbiased.

17 / 42

LASSO fails to have oracle properties

Tibshirani’96 (LASSO) and Chen-Donoho-Saunders’96 (BPDN):

minimize ‖x‖1 +t

2m‖Ax − b‖22

Optimality conditions:

p =tm

AT (b − Ax), p ∈ ∂‖x‖1.

Pros:

• p ∈ ∂‖u‖1 ∩R(AT ), so x is sparse

• efficient solvers for fixed t

• sign–consistency under conditions

Cons:

• x(t) is always biased!

• computing for many values of t is slow or inaccurate

18 / 42

The LASSO bias

At some t, suppose supp(x̃LASSO) = supp(x∗) =: S .

Then,

x̃LASSOS = x∗S +

1m

Σ−1m ATS �

︸︷︷︸oracle estimate

−1t

Σ−1m sign(x̃LASSOS )

︸︷︷︸bias

.

The bias is caused by the part `1-norm applied to xS .

LASSO’s `1 minimization enforces xSc = 0 but hurts the signals in xS!

19 / 42

Debias LASSO

Two approaches:

• Exact debias: Add 1t Σ−1m sign(x̃

LASSOS ) to x̃

LASSOS

• Pseudo debias:

minimizex

‖Ax − b‖2 subject to supp(x) = supp(x̃LASSO)

It’s “psuedo” since the debiased solution may have changed signs.

Issues:

• extra computation

• bias has negative effect on the signs of x̃LASSO, which is not removed by

debiasing, therefore:

x̃LASSOS often misses small signals, which are not recovered by debiasing.

• not work for problems with “continuous support” (e.g., in low-rank

matrix recovery)

20 / 42

Bregman ISS: a “debiasing” interpretation

• LASSO optimality condition:

p =tm

AT (b − Ax)

• Differentiate w.r.t. t ⇒

ṗ =1m

AT (b − A(tẋ + x))

• Important: recognize that tẋ + x is the debiased LASSO solution!

• Idea: replace tẋ + x by x ⇒ Bregman ISS:

ṗ =1m

AT (b − Ax)

• No bias is ever introduced!

• Note: Bregman ISS 6=LASSO+debiasing. Bregman ISS is better and faster.

21 / 42

Compute the Bregman ISS path

Theorem

The solution path to

ṗ+(t) =1m

AT (b − Ax(t)), p(t) ∈ ∂‖x(t)‖1

with initial conditions t0 = 0, p(0) = 0, x(0) = 0, is given piece-wise by:

• for k = 1, 2, . . . ,K

• p(t) is piece-wise linear

p(t) = p(tk−1) +t − tk−1

mAT (b − Ax(tk−1)), t ∈ [tk−1, tk ],

where tk := sup{t > tk−1 : p(t) ∈ ‖x(tk−1)‖1}.

• x(t) = x(tk−1) is piece-wise constant for t ∈ [tk−1, tk); if tk 6=∞,

x(tk) = arg minu

‖Au − b‖22 subject to ui

≥ 0, pi(tk) = 1,

= 0, pi(tk) ∈ (−1, 1),

≤ 0, pi(tk) = −1.

22 / 42

Faster alternative: Linearized Bregman ISS

ṗ(t) +1κ

ẋ(t) =1m

AT (b − Ax(t)),

p(t) ∈ ∂‖x(t)‖1.

• Solution is piece-wise smooth, closed form.

• It approximates Bregman ISS. Converges to the Bregman ISS solution

exponentially fast in κ

• Reduce to one nonlinear ODE:

ż(t) =1m

AT (b − κA shrink(z(t))).

Insight: The mapping z(t) = p(t) + 1κ

x(t) is one-one. Given z(t), recover

x(t) = κ shrink(z(t)), p(t) = z(t)−1κ

x(t),

whereshrink(u) = prox‖∙‖1 (u) = arg min

y‖y‖1 +

12‖y − u‖22.

23 / 42

Discrete Linearized Bregman Iteration

• Nonlinear ODE (from last slide):

ż =1m

AT (b − κA shrink(z(t))).

• Forward Euler:

zk+1 = zk +αkm

AT (b − A (κ shrink(zk))︸︷︷︸

xk

)

• Easy to parallelize for very large dataset. For example:

A = [A1 A2 ∙ ∙ ∙ AL], where A` is distributed

Distributed implementation:

for ` = 1, . . . ,L in parallel:

{zk+1` = z

k` +

αkm A

T` (b − w

k)

wk+1` = κA` shrink(zk+1` )

all-reduce sum: wk+1 =L∑

`=1

wk+1` .

24 / 42

Comparison to ISTA iteration for LASSO

• Linearized Bregman (LB) iteration:

zk+1 = zk −αkm

AT (A(κ shrink(zk))− b)

• ISTA (forward-backward splitting, FPC, SpaSRA, ...) iteration:

xk+1 = shrink(xk −αkm

AT (Axk − b), λ)

Comparison:

• ISTA: intermediate xk is dense, solves LASSO for fixed λ as k →∞

• LBreg: intermediate xk is sparse (useful as a regularization path)

as k →∞, solves:

minimize ‖x‖1 +1

2κ‖x‖2 subject to Ax = b,

with exact penalty property: sufficiently large κ gives ‖x‖1 minimizer

25 / 42

Comparison to orthogonal matching pursuit (OMP)1

OMP: start with index set S = ∅ and vector x = 0;

iterate

1. compute residual vector A∗(b − Ax), add its largest entry to S

2. set x ← arg min ‖b − Ax‖22 subject to xi = 0 ∀i 6∈ S.

Differences:

• OMP: increase index set S (OMP variants evolve S in other ways)

• ISS: evolves p ∈ ‖x‖1, which encodes more information

1Mallat-Zhang’93, Tropp-Gilbert’0726 / 42

Generalization (once again)

Bregman ISS model:

ṗ(t) = −f ′(x),

p(t) ∈ ∂r(x(t)).

where

• r is convex regularization: weighted `1, `1,2, nuclear norm, etc.

• f is convex fitting: square loss, logistic loss, etc.

Linearized Bregman ISS model: add a strongly convex function to r .

27 / 42

Next: Numerical examples

28 / 42

20-Dimensional Example

0 0.5 1 1.5 2 2.5

x 10-6

-100

-50

0

50

100

0 0.5 1 1.5 2 2.5

x 10-6

-1

-0.5

0

0.5

1

29 / 42

Example

0 0.5 1 1.5 2 2.5

x 10-6

-100

-50

0

50

100

0 0.5 1 1.5 2 2.5

x 10-6

-1

-0.5

0

0.5

1

30 / 42

Example

0 0.5 1 1.5 2 2.5

x 10-6

-100

-50

0

50

100

0 0.5 1 1.5 2 2.5

x 10-6

-1

-0.5

0

0.5

1

31 / 42

Example

0 0.5 1 1.5 2 2.5

x 10-6

-100

-50

0

50

100

0 0.5 1 1.5 2 2.5

x 10-6

-1

-0.5

0

0.5

1

32 / 42

Example

0 0.5 1 1.5 2 2.5

x 10-6

-100

-50

0

50

100

0 0.5 1 1.5 2 2.5

x 10-6

-1

-0.5

0

0.5

1

33 / 42

Predict prostate tumor size

• given 8 clinical features, select predictors for prostate tumor size

• data: 67 training cases + 30 testing cases

Predictor LS Subset LASSO ISS

Intercept 2.452 2.466 2.481 2.476

lcavol 0.716 0.667 0.622 0.554

lweight 0.293 0.366 0.289 0.279

age -0.143 0 -0.096 0

lbph 0.212 0 0.188 0.198

svi 0.310 0.268 0.262 0.238

lcp -0.289 -0.291 -0.164 0

gleason -0.021 0 0 0

pgg45 0.277 0.227 0.187 0.122

#Features 8 5 7 5

Test Error 0.586 0.587 0.543 0.541

LS = least squares, Subset = best subset regression, LASSO solved by glmnet

Bregman ISS achieves least test error with fewest features!34 / 42

Relation to discrete Bregman iteration

• Forward Euler of ṗ = 1m AT (b − Ax):

pk+1 = pk +δ

mAT (b − Axk),

which is the first-order optimality condition to

xk+1 ← arg minx

D‖∙‖1 (x; xk) +

δ

2m‖Ax − b‖2,

where D‖∙‖1 (x; xk) := ‖x‖1 − ‖x

k‖1 − 〈pk , x − xk〉.

• By change of variable, “add-back-the-residual” iteration

xk+1 ← arg minx‖x‖1 +

δ

2m‖Ax − bk‖2,

bk+1 ←bk + (b − Axk).

Still true if ‖ ∙ ‖1 is replace by any convex regularizer

• Message: keep existing solver, use a small δ, “add-back-the-residual”

35 / 42

Test with noisy measurements and tiny signals

0 50 100 150 200 250-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

true signalBPDN recovery

LASSO (hand tuned)

0 50 100 150 200 250-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

true signalBregman recovery

Bregman 5th itr.

36 / 42

Related observation

YALL1 paper (Yang-Zhang’08): tested different algorithms for LASSO

min ‖u‖1 +t

2n‖Au − b‖22.

Strange observation: ADM algorithms do better than the model itself!

37 / 42

Theory: path consistency

Question: does there ∃t so that solution x(t) has the following properties?

• no false positive: if ui = 0, then xi(t) = 0

• no false negative: if ui 6= 0, then xi(t) 6= 0

• sign consistency: furthermore, sign(x) = sign(x(t)).

Theorem

Under the assumptions

• Gaussian noise: ω ∼ N (0, σ2I ),

• normalized column: 1n maxj ‖Aj‖2 ≤ 1,

and under irrepresentable and strong-signal conditions, Bregman ISS reaches

sign consistency and gives an unbias estimate to x∗.

Proof is based on the next two lemmas.

38 / 42

No false positive

Define true support S := supp(x∗), and let T := Sc.

Lemma

Under assumptions, if AS has full column rank and

maxj∈T‖ATj AS(A

TS AS)

−1‖1 ≤ 1− η

for some η ∈ (0, 1), then with high probability

supp(x(s)) ⊆ S , ∀s ≤ t̄ := O

(η

σ

√m

log n

).

Proof uses: (i) concentration inequality and (ii) if supp(x(s)) ⊆ S , s ≤ t, then

p(s)T = ATT AS(A

TS AS)

−1p(s)S + tA∗T PA⊥

Sw, s ≤ t.

39 / 42

No false negative / sign consistency

Lemma

Under assumptions, if A∗SAS � γI and

umin ≥ max

{

O

(σ√γ

√log |S |

m

)

,O

(σ log |S |ηγ

√log n

m

)}

,

then there exist t∗ (which can be given explicitly) so that with high probability

sign(x(t)) = sign(x∗)

and x(t) = x∗S − (A∗SAS)

−1A∗Sω obeys

‖x(t)− x∗‖∞ ≤ umin/2.

• first term in max ensures ‖(A∗SAS)−1A∗Sω‖∞ ≤ umin/2

• second term ensures: inf{t : sign(xS(t)) = sign(xS)} ≤ t̄.

40 / 42

Related work

Discrete:

• Bregman iteration for imaging (TV) and compressed sensing `1:

Osher-Burger-Goldfarb-Xu-Y’06, Y-Osher-Goldfarb-Darbon’08

• Linearized Bregman on `1: Y-Osher-Goldfarb-Darbon’08, Y’10, Lai-Y’13

• Matrix completion SVT on ‖X‖∗: Cai-Candès-Shen’10

• Extension and analysis: Zhang’13, Zhang’14

Continuous:

• Inverse scale space (ISS) on TV: Burger-Gilboa-Osher-Xu’06

• Adaptive ISS on `1: Burger-Möller-Benning-Osher’11

• Greedy ISS on `1: Möller-Zhang’13

41 / 42

Summary

Instead of minimize r(x) + t ∙ f (x), just try

ṗ(t) = −f ′(x), p ∈ ∂r(x).

It will

• keep solution structure

• remove bias

• give a solution path efficiently

Even simpler for you: keep your existing solver, apply “add back the residual”

42 / 42

sparse regularization path by differential inclusion · 2020. 4. 24. · sparse regularization path...

Documents