sparse regularization path by differential inclusion · 2020. 4. 24. · sparse regularization path...

42
Sparse regularization path by differential inclusion Wotao Yin (UCLA Math) joint with: Stanley Osher, Ming Yan (UCLA) Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U) ICERM Approximation, Integration, and Optimization Workshop October 2, 2014 1 / 42

Upload: others

Post on 02-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Sparse regularization path by differential inclusion

    Wotao Yin (UCLA Math)

    joint with: Stanley Osher, Ming Yan (UCLA)

    Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U)

    ICERM Approximation, Integration, and Optimization Workshop

    October 2, 2014

    1 / 42

  • Background

    • Assume vector x∗ ∈ Rn is sparse, unknown

    • Goal: Recover x∗ from

    b = Ax∗ + �

    where A ∈ Rm×n , b ∈ Rm and � is unknown noise.

    • Consider m � n, the under-determined case

    2 / 42

  • Background: Regularization by optimization

    Examples:

    • convex: LASSO: minimize λ‖x‖1 + 12m ‖Ax − b‖22

    • nonconvex: SCAD, `p-seminorm minimization p ∈ (0, 1)

    Optimization approach:

    • convex penalty: avoid overfitting, tractable, lead to bias

    • nonconvex penalty: may works better, but performance unpredictable

    They have a tuning parameter, the best choice of which is often unknown

    So, we need model selection: vary the parameter values, solve many instances,

    and then pick the best one

    3 / 42

  • Background: Regularization path by an algorithm

    Algorithmic regularization:

    • An algorithm generates a regularization path. (Points on the path may not

    minimize an energy function)

    • Model selection is done by deciding when to stop at a time (for continuous

    dynamic) or stop at an iteration (for discrete update)

    Examples:

    • LASSO/LARS: solve parameterized LASSO KKT conditions

    • xMP family

    4 / 42

  • This talk introduces a continuous regularization path by differential

    inclusions, with

    • recovery guarantees

    • fast implementation

    • and generalization to other structured solution

    5 / 42

  • Introduced: two Inverse Scale Space (ISS) Dynamics

    • Let x(t), p(t) ∈ Rn by primal-dual regularization path. t is time.

    • Bregman ISS dynamic: {x(t), p(t)}t≥0 is governed by

    ṗ(t) =1m

    AT (b − Ax(t)),

    p(t) ∈ ∂‖x(t)‖1.

    Initial solution x(0) = p(0) = 0.

    • Linearized Bregman ISS dynamic: {x(t), p(t)}t≥0 is governed by

    ṗ(t) +1κ

    ẋ(t) =1m

    AT (b − Ax(t)),

    p(t) ∈ ∂‖x(t)‖1.

    Initial solution x(0) = p(0) = 0.

    • For well-defined path (and uniqueness), make technical assumptions:

    • p(t) is right continuously differentiable, and• x(t) is right continuous

    6 / 42

  • Generalization

    Given any convex optimization model:

    minimizex

    r(x) + t ∙ f (x)

    one can generate the related Bregman ISS model:

    ṗ(t) = −f ′(x),

    p(t) ∈ ∂r(x(t)).

    where

    • r is convex regularization: weighted `1, `1,2, nuclear norm, and so on;

    can incorporate nonnegative or box constraints as indicator functions

    • f is convex fitting: square loss, logistic loss, etc.

    Linearized Bregman ISS: add a strongly convex function to r .

    7 / 42

  • Major claims for Bregman ISS applied to `1

    The solution path {x(t), p(t)}t≥0 :

    • x(t) is sparse, if p ∈ ∂‖x‖1 ∩R(AT ) and A is fat, w.h.p.

    • x(t) is less biased than LASSO, better than LASSO+debiasing

    • path can be piece-wise computed very quickly

    • sign-consistency: sign(x(t)) = sign(x∗) at some t under conditions

    In less technical languages, the new method

    • recovers sparse nonzero elements like `1 but avoids its bias

    • can generate a regularization path much quicker than `1

    • whereas `1 extends, it does too

    8 / 42

  • Background: `1 subgradient

    • Subdifferential of convex function f

    ∂f (y) = {p : f (x) ≥ f (y) + 〈p, x − y〉, ∀y ∈ domf }.

    Each p ∈ ∂f (y) is a subgradient of f at y.

    • Subdifferential of | ∙ |:

    ∂|xi | =

    {1}, xi > 0;

    [−1, 1], xi = 0;

    {−1}, xi < 0.

    =⇒ let pi ∈ ∂|xi |, then

    xi

    ≥ 0, if pi = 1;

    = 0, if pi ∈ (−1, 1);

    ≤ 0, if pi = −1.

    9 / 42

  • Sparsity and `1 subgradient

    Although x ↔ p is not one-one, it is in some cases. When xi is nonzero, pimust equal its sign; when pi ∈ (−1, 1), xi has to be zero.

    p is like an array of 3-position switches: −1, (−1, 1), +1

    10 / 42

  • Toy example 1

    Consider:

    b = ax + �,

    where b, a, x are strictly positive scalars.

    Bregman ISS:

    • start: x(0) = 0 and p(0) = 0

    • stage 1: p evolves before reaching 1, meanwhile x stays 0.

    ṗ = a(b − ax) = ab ⇒ p(t) = (ab)t

    • stage 2: p reaches 1 at t = 1/(ab), but cannot exceed 1, so ṗ(t) ≤ 0 and

    thus x(t) 6= 0. Right continuity assumption makes ṗ(t) < 0 impossible as

    it will immediately make x(t+) = 0. Therefore,

    at t ≥ 1/(ab), 0 = ṗ(t) = a(b − ax(t)) ⇒ x(t) = b/a, p(t) = 1.

    Once p(t) = 1 “switch is on”, the signal x(t) = b/a immediately pops out!

    11 / 42

  • LASSO:

    x(t) = arg minx

    |x|+t2|ax − b|2

    ⇒ optimality condition:

    0 = p + ta(ax − b), p ∈ ∂|x|

    ⇒ solution:

    p(t) =

    {(ab)t,

    1,x(t) =

    {0, t ∈ [0, 1ab ),ba−

    1ta2 , t ∈ [

    1ab ,∞).

    In this example, LASSO has the same p(t) path but a different x(t) path.

    LASSO’s x(t) path reduced the signal strength.

    12 / 42

  • `1 subgradient and sparsity

    Faces of ∂‖x‖

    • `1 subdifferential: ∂‖x‖1 = ∂|x1| × ∙ ∙ ∙ × ∂|xn |.

    • The image of ∂‖x‖1 is [−1, 1]n

    • Let p ∈ ∂‖x‖1. For xi 6=, pi must equal ±1 and is thus exposed.

    • More pi exposed ⇔ p lies on a low-dim face of [−1, 1]n

    Observation:

    vector x is sparse ⇔ few pi = ±1 ⇔ p 6∈ a low-dim face of [−1, 1]n

    13 / 42

  • • If matrix A is fat (or AT is thin), then R(AT ) is a small subspace

    • A is random and p ∈ ∂‖x‖1 ∩R(AT )

    ⇒ p is unlikely on a low-dim face of [−1, 1]n

    ⇒ very few pi = ±1

    ⇒ sparse x

    Bregman ISS update:

    ṗ(t) =1m

    AT (b − Ax(t)), p(t) ∈ ∂‖x(t)‖1,

    ⇒ p ∈ ∂‖x‖1 ∩R(AT )

    Conclusion: if A is fat, then x(t) is typically sparse.

    14 / 42

  • Toy example 2

    • x ∈ Rn , measurement b is a scalar:

    b = aT x + � ∈ R

    Suppose a1 = 1 > a2 ≥ . . . ≥ an > 0 and b > 0. w.o.l.g.

    • Bregman ISS solution:

    x1(t) =

    {0, t < 1/b,

    b, t ≥ 1/b.

    x2(t) = ∙ ∙ ∙ = xn(t) = 0, t ≥ 0.

    15 / 42

  • • LASSO solution:

    x1(t) =

    {0, t < 1/b,

    b − 1t , t ≥ 1/b.

    x2(t) = ∙ ∙ ∙ = xn(t) = 0, t ≥ 0.

    • Both solutions are sparse. Like before, LASSO solution is a

    strength–reduced signal, which is not good.

    16 / 42

  • Oracle estimator

    • Unknown S := supp(x∗) is disclosed by an oracle

    • Oracle estimator is the least-squares solution restricted to S

    x̃∗S = arg min{1

    2m‖Ax − b‖22 : supp(x) = S}

    • Define: submatrix AS of A and Σm := 1m ATS AS . The oracle estimate

    x̃∗S = Σ−1m (

    1m

    ATS b) = x∗S +

    1m

    Σ−1m ATS �

    has oracle properties:

    • consistency: supp(x̃∗S ) = S

    • normality: x̃∗S ∼ N (x∗S ,σ2

    m Σ−1m ). In particular, E[x̃

    ∗S ] = x

    ∗S unbiased.

    17 / 42

  • LASSO fails to have oracle properties

    Tibshirani’96 (LASSO) and Chen-Donoho-Saunders’96 (BPDN):

    minimize ‖x‖1 +t

    2m‖Ax − b‖22

    Optimality conditions:

    p =tm

    AT (b − Ax), p ∈ ∂‖x‖1.

    Pros:

    • p ∈ ∂‖u‖1 ∩R(AT ), so x is sparse

    • efficient solvers for fixed t

    • sign–consistency under conditions

    Cons:

    • x(t) is always biased!

    • computing for many values of t is slow or inaccurate

    18 / 42

  • The LASSO bias

    At some t, suppose supp(x̃LASSO) = supp(x∗) =: S .

    Then,

    x̃LASSOS = x∗S +

    1m

    Σ−1m ATS �

    ︸ ︷︷ ︸oracle estimate

    −1t

    Σ−1m sign(x̃LASSOS )

    ︸ ︷︷ ︸bias

    .

    The bias is caused by the part `1-norm applied to xS .

    LASSO’s `1 minimization enforces xSc = 0 but hurts the signals in xS!

    19 / 42

  • Debias LASSO

    Two approaches:

    • Exact debias: Add 1t Σ−1m sign(x̃

    LASSOS ) to x̃

    LASSOS

    • Pseudo debias:

    minimizex

    ‖Ax − b‖2 subject to supp(x) = supp(x̃LASSO)

    It’s “psuedo” since the debiased solution may have changed signs.

    Issues:

    • extra computation

    • bias has negative effect on the signs of x̃LASSO, which is not removed by

    debiasing, therefore:

    x̃LASSOS often misses small signals, which are not recovered by debiasing.

    • not work for problems with “continuous support” (e.g., in low-rank

    matrix recovery)

    20 / 42

  • Bregman ISS: a “debiasing” interpretation

    • LASSO optimality condition:

    p =tm

    AT (b − Ax)

    • Differentiate w.r.t. t ⇒

    ṗ =1m

    AT (b − A(tẋ + x))

    • Important: recognize that tẋ + x is the debiased LASSO solution!

    • Idea: replace tẋ + x by x ⇒ Bregman ISS:

    ṗ =1m

    AT (b − Ax)

    • No bias is ever introduced!

    • Note: Bregman ISS 6=LASSO+debiasing. Bregman ISS is better and faster.

    21 / 42

  • Compute the Bregman ISS path

    Theorem

    The solution path to

    ṗ+(t) =1m

    AT (b − Ax(t)), p(t) ∈ ∂‖x(t)‖1

    with initial conditions t0 = 0, p(0) = 0, x(0) = 0, is given piece-wise by:

    • for k = 1, 2, . . . ,K

    • p(t) is piece-wise linear

    p(t) = p(tk−1) +t − tk−1

    mAT (b − Ax(tk−1)), t ∈ [tk−1, tk ],

    where tk := sup{t > tk−1 : p(t) ∈ ‖x(tk−1)‖1}.

    • x(t) = x(tk−1) is piece-wise constant for t ∈ [tk−1, tk); if tk 6=∞,

    x(tk) = arg minu

    ‖Au − b‖22 subject to ui

    ≥ 0, pi(tk) = 1,

    = 0, pi(tk) ∈ (−1, 1),

    ≤ 0, pi(tk) = −1.

    22 / 42

  • Faster alternative: Linearized Bregman ISS

    ṗ(t) +1κ

    ẋ(t) =1m

    AT (b − Ax(t)),

    p(t) ∈ ∂‖x(t)‖1.

    • Solution is piece-wise smooth, closed form.

    • It approximates Bregman ISS. Converges to the Bregman ISS solution

    exponentially fast in κ

    • Reduce to one nonlinear ODE:

    ż(t) =1m

    AT (b − κA shrink(z(t))).

    Insight: The mapping z(t) = p(t) + 1κ

    x(t) is one-one. Given z(t), recover

    x(t) = κ shrink(z(t)), p(t) = z(t)−1κ

    x(t),

    whereshrink(u) = prox‖∙‖1 (u) = arg min

    y‖y‖1 +

    12‖y − u‖22.

    23 / 42

  • Discrete Linearized Bregman Iteration

    • Nonlinear ODE (from last slide):

    ż =1m

    AT (b − κA shrink(z(t))).

    • Forward Euler:

    zk+1 = zk +αkm

    AT (b − A (κ shrink(zk))︸ ︷︷ ︸

    xk

    )

    • Easy to parallelize for very large dataset. For example:

    A = [A1 A2 ∙ ∙ ∙ AL], where A` is distributed

    Distributed implementation:

    for ` = 1, . . . ,L in parallel:

    {zk+1` = z

    k` +

    αkm A

    T` (b − w

    k)

    wk+1` = κA` shrink(zk+1` )

    all-reduce sum: wk+1 =L∑

    `=1

    wk+1` .

    24 / 42

  • Comparison to ISTA iteration for LASSO

    • Linearized Bregman (LB) iteration:

    zk+1 = zk −αkm

    AT (A(κ shrink(zk))− b)

    • ISTA (forward-backward splitting, FPC, SpaSRA, ...) iteration:

    xk+1 = shrink(xk −αkm

    AT (Axk − b), λ)

    Comparison:

    • ISTA: intermediate xk is dense, solves LASSO for fixed λ as k →∞

    • LBreg: intermediate xk is sparse (useful as a regularization path)

    as k →∞, solves:

    minimize ‖x‖1 +1

    2κ‖x‖2 subject to Ax = b,

    with exact penalty property: sufficiently large κ gives ‖x‖1 minimizer

    25 / 42

  • Comparison to orthogonal matching pursuit (OMP)1

    OMP: start with index set S = ∅ and vector x = 0;

    iterate

    1. compute residual vector A∗(b − Ax), add its largest entry to S

    2. set x ← arg min ‖b − Ax‖22 subject to xi = 0 ∀i 6∈ S.

    Differences:

    • OMP: increase index set S (OMP variants evolve S in other ways)

    • ISS: evolves p ∈ ‖x‖1, which encodes more information

    1Mallat-Zhang’93, Tropp-Gilbert’0726 / 42

  • Generalization (once again)

    Bregman ISS model:

    ṗ(t) = −f ′(x),

    p(t) ∈ ∂r(x(t)).

    where

    • r is convex regularization: weighted `1, `1,2, nuclear norm, etc.

    • f is convex fitting: square loss, logistic loss, etc.

    Linearized Bregman ISS model: add a strongly convex function to r .

    27 / 42

  • Next: Numerical examples

    28 / 42

  • 20-Dimensional Example

    0 0.5 1 1.5 2 2.5

    x 10-6

    -100

    -50

    0

    50

    100

    0 0.5 1 1.5 2 2.5

    x 10-6

    -1

    -0.5

    0

    0.5

    1

    29 / 42

  • Example

    0 0.5 1 1.5 2 2.5

    x 10-6

    -100

    -50

    0

    50

    100

    0 0.5 1 1.5 2 2.5

    x 10-6

    -1

    -0.5

    0

    0.5

    1

    30 / 42

  • Example

    0 0.5 1 1.5 2 2.5

    x 10-6

    -100

    -50

    0

    50

    100

    0 0.5 1 1.5 2 2.5

    x 10-6

    -1

    -0.5

    0

    0.5

    1

    31 / 42

  • Example

    0 0.5 1 1.5 2 2.5

    x 10-6

    -100

    -50

    0

    50

    100

    0 0.5 1 1.5 2 2.5

    x 10-6

    -1

    -0.5

    0

    0.5

    1

    32 / 42

  • Example

    0 0.5 1 1.5 2 2.5

    x 10-6

    -100

    -50

    0

    50

    100

    0 0.5 1 1.5 2 2.5

    x 10-6

    -1

    -0.5

    0

    0.5

    1

    33 / 42

  • Predict prostate tumor size

    • given 8 clinical features, select predictors for prostate tumor size

    • data: 67 training cases + 30 testing cases

    Predictor LS Subset LASSO ISS

    Intercept 2.452 2.466 2.481 2.476

    lcavol 0.716 0.667 0.622 0.554

    lweight 0.293 0.366 0.289 0.279

    age -0.143 0 -0.096 0

    lbph 0.212 0 0.188 0.198

    svi 0.310 0.268 0.262 0.238

    lcp -0.289 -0.291 -0.164 0

    gleason -0.021 0 0 0

    pgg45 0.277 0.227 0.187 0.122

    #Features 8 5 7 5

    Test Error 0.586 0.587 0.543 0.541

    LS = least squares, Subset = best subset regression, LASSO solved by glmnet

    Bregman ISS achieves least test error with fewest features!34 / 42

  • Relation to discrete Bregman iteration

    • Forward Euler of ṗ = 1m AT (b − Ax):

    pk+1 = pk +δ

    mAT (b − Axk),

    which is the first-order optimality condition to

    xk+1 ← arg minx

    D‖∙‖1 (x; xk) +

    δ

    2m‖Ax − b‖2,

    where D‖∙‖1 (x; xk) := ‖x‖1 − ‖x

    k‖1 − 〈pk , x − xk〉.

    • By change of variable, “add-back-the-residual” iteration

    xk+1 ← arg minx‖x‖1 +

    δ

    2m‖Ax − bk‖2,

    bk+1 ←bk + (b − Axk).

    Still true if ‖ ∙ ‖1 is replace by any convex regularizer

    • Message: keep existing solver, use a small δ, “add-back-the-residual”

    35 / 42

  • Test with noisy measurements and tiny signals

    0 50 100 150 200 250-2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    true signalBPDN recovery

    LASSO (hand tuned)

    0 50 100 150 200 250-2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    true signalBregman recovery

    Bregman 5th itr.

    36 / 42

  • Related observation

    YALL1 paper (Yang-Zhang’08): tested different algorithms for LASSO

    min ‖u‖1 +t

    2n‖Au − b‖22.

    Strange observation: ADM algorithms do better than the model itself!

    37 / 42

  • Theory: path consistency

    Question: does there ∃t so that solution x(t) has the following properties?

    • no false positive: if ui = 0, then xi(t) = 0

    • no false negative: if ui 6= 0, then xi(t) 6= 0

    • sign consistency: furthermore, sign(x) = sign(x(t)).

    Theorem

    Under the assumptions

    • Gaussian noise: ω ∼ N (0, σ2I ),

    • normalized column: 1n maxj ‖Aj‖2 ≤ 1,

    and under irrepresentable and strong-signal conditions, Bregman ISS reaches

    sign consistency and gives an unbias estimate to x∗.

    Proof is based on the next two lemmas.

    38 / 42

  • No false positive

    Define true support S := supp(x∗), and let T := Sc.

    Lemma

    Under assumptions, if AS has full column rank and

    maxj∈T‖ATj AS(A

    TS AS)

    −1‖1 ≤ 1− η

    for some η ∈ (0, 1), then with high probability

    supp(x(s)) ⊆ S , ∀s ≤ t̄ := O

    σ

    √m

    log n

    ).

    Proof uses: (i) concentration inequality and (ii) if supp(x(s)) ⊆ S , s ≤ t, then

    p(s)T = ATT AS(A

    TS AS)

    −1p(s)S + tA∗T PA⊥

    Sw, s ≤ t.

    39 / 42

  • No false negative / sign consistency

    Lemma

    Under assumptions, if A∗SAS � γI and

    umin ≥ max

    {

    O

    (σ√γ

    √log |S |

    m

    )

    ,O

    (σ log |S |ηγ

    √log n

    m

    )}

    ,

    then there exist t∗ (which can be given explicitly) so that with high probability

    sign(x(t)) = sign(x∗)

    and x(t) = x∗S − (A∗SAS)

    −1A∗Sω obeys

    ‖x(t)− x∗‖∞ ≤ umin/2.

    • first term in max ensures ‖(A∗SAS)−1A∗Sω‖∞ ≤ umin/2

    • second term ensures: inf{t : sign(xS(t)) = sign(xS)} ≤ t̄.

    40 / 42

  • Related work

    Discrete:

    • Bregman iteration for imaging (TV) and compressed sensing `1:

    Osher-Burger-Goldfarb-Xu-Y’06, Y-Osher-Goldfarb-Darbon’08

    • Linearized Bregman on `1: Y-Osher-Goldfarb-Darbon’08, Y’10, Lai-Y’13

    • Matrix completion SVT on ‖X‖∗: Cai-Candès-Shen’10

    • Extension and analysis: Zhang’13, Zhang’14

    Continuous:

    • Inverse scale space (ISS) on TV: Burger-Gilboa-Osher-Xu’06

    • Adaptive ISS on `1: Burger-Möller-Benning-Osher’11

    • Greedy ISS on `1: Möller-Zhang’13

    41 / 42

  • Summary

    Instead of minimize r(x) + t ∙ f (x), just try

    ṗ(t) = −f ′(x), p ∈ ∂r(x).

    It will

    • keep solution structure

    • remove bias

    • give a solution path efficiently

    Even simpler for you: keep your existing solver, apply “add back the residual”

    42 / 42