super-linear convergence of dual augmented lagrangian algorithm for sparse learning

8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning

1/33

Super-Linear Convergence of Dual AugmentedLagrangian Algorithm for Sparse Learning

Ryota Tomioka 1 , Taiji Suzuki 1 , and Masashi Sugiyama 2

1 University of Tokyo2 Tokyo Institute of Technology

2009-12-12 @ NIPS workshop OPT09

Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 1 / 22


2/33

Introduction

Objective

Develop an optimization algorithm for the optimization problem:

minimizew R n

f (Aw )

loss+ (w )

regularizer.

For example, lasso:

minimizew R n

12

Aw y 2 + w 1 .

A

R m n

: design matrix ( m : #observations, n : #unknowns)f is convex and twice differentiable. (w ) is convex but possibly non-differentiable. = .We are interested in algorithms for general f and (LARS).



3/33

Introduction

Where does the difculty come from?

Conventional view : the non-differentiability of (w ).

Upper bound the regularizer from abovewith a differentiable function.

FOCUSS

(Rao & Kreutz-Delgado, 99)Majorization-Minimization (Figueiredo etal., 07)Iteratively reweighted least squares(IRLS).

Explicitly handle the non-differentiability.Sub-gradient L-BFGS (Andrew & Gao,07; Yu et al., 08)

Our view : the coupling between variables introduced by A.



4/33

Introduction

Where does the difculty come from?

Our view : the coupling between variables introduced by A.

In fact, when A = I n

minw R n

12

y w 22 + w 1 =n

j = 1

minw j R

12

(y j w j )2 + |w j | .

w j = ST (y j )

=y j ( y j ),0 (

y j

),

y j + (y j ).min is obtained analytically!

We focus on

for which the above min can be obtained analytically



5/33

Introduction

Proximation wrt can be computed analytically

AssumptionProximation wrt (soft-thresholding):

ST (y ) = argminw R n

(w ) +12

y w 22can be computed analytically.



6/33

Introduction

Outline

1 IntroductionSparse regularized learning.Why is it difcult? not the non-differentiability

2 Methods

Iterative shrinkage-thresholding (IST)Dual Augmented Lagrangian (porposed)3 Theoretical results: super-linear convergence

Exact inner minimizationApproximate inner minimization

4 Empirical resultsComparison against OWLQN, SpaRSA, and FISTA.

5 Summary


h d


7/33

Methods

Iterative Shrinkage/Thresholding (IST)Algorithm (Figueiredo&Nowak, 03; Daubechies et al., 04;...)

1 Choose an initial solution w 0 .2 Repeat until some stopping criterion is satised:

w t + 1 ST t

shrink

w t t Af (Aw t )

gradient step

.

Pro : easy to implement.Con : bad for poorly conditioned A.

Also known as:Forward-Backward Splitting[Combettes & Wajs, 05]Thresholded Landweber Iteration[Daubechies et al., 04]


M h d


8/33

Methods

Iterative Shrinkage/Thresholding (IST)Algorithm (Figueiredo&Nowak, 03; Daubechies et al., 04;...)

1 Choose an initial solution w 0 .2 Repeat until some stopping criterion is satised:

w t + 1 ST t

shrink

w t t Af (Aw t )

gradient step

.

Pro : easy to implement.Con : bad for poorly conditioned A.

Also known as:Forward-Backward Splitting[Combettes & Wajs, 05]Thresholded Landweber Iteration[Daubechies et al., 04]


Methods


9/33

Methods

Dual Augmented Lagrangian (DAL) method

Primal problem

minimizew

f (Aw ) + (w )

f (w )Proximal minimization:w t + 1 = argmin

w f (w ) +

12t

w w t 2

(0

1

)

Easy to analyze.f (w t + 1 ) + 12 t w

t + 1 w t 2 f (w t ) .

Not practical! (as difcult asthe original problem)

Dual problem

maximize ,v f ( ) (v )

s.t. v = A

Augmented Lagrangian(Tomioka & Sugiyama, 09) :

w t + 1 = ST t (w t + t A t )

t = argmin

t ( )

Minimization of t ( ) is easy(smooth).Step-size t is increased.See Rockafellar 76 for the equivalence.


Methods


10/33

Methods


Primal problem

minimizew

f (Aw ) + (w )

f (w )Proximal minimization:w t + 1 = argmin

w f (w ) +

12t

w w t 2

(0

1

)


t + 1 w t 2 f (w t ) .


Dual problem


s.t. v = A


w t + 1 = ST t (w t + t A t )

t = argmin

t ( )



Methods


11/33

Methods


Primal problem

minimizew

f (Aw ) + (w )

f (w )Proximal minimization:

w t + 1 = argminw

f (w ) +1

2t w w t 2

(0

1

)


t + 1 w t 2 f (w t ) .


Dual problem


s.t. v = A


w t + 1 = ST t (w t + t A t )

t = argmin

t ( )



Methods


12/33

Methods

Difference: How do we get rid of the couplings?

Proximation wrt f is hard:

w t + 1 = argminw

f (w )

f (Aw ) variables are coupled+ (w ) +

12t w w t 2 .

IST: linearly approximates the loss term:

f (Aw )f (Aw t ) + (w w t )Af (Aw t )

tightest at the current point w t DAL (proposed) : linearly lower-bounds theloss term:

f (Aw ) = max R m f ( ) w A

tightest at the next point w t + 1Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 9 / 22

Methods


13/33



w t + 1 = argminw

f (w )


12t w w t 2 .





tightest at the next point w t + 1

wt

wt+1


Methods


14/33



w t + 1 = argminw

f (w )


12t w w t 2 .





tightest at the next point w t + 1

wt

wt+1

wt

wt+1


Methods


15/33

Numerical examples

DAL is better when A is poorly conditioned.

IST

DAL

2 1.5 1 0.5 0 0.5 1 1.5 22

1.5

1

0.5

0

0.5

1

1.5

2

2 1.5 1 0.5 0 0.5 1 1.5 22

1.5

1

0.5

0

0.5

1

1.5

2

2 1.5 1 0.5 0 0.5 1 1.5 22

1.5

1

0.5

0

0.5

1

1.5

2


Theoretical results


16/33

Theorem 1 (exact minimization)

Denition

w t : sequence generated by the DAL algorithm witht (

t )= 0 (exact minimization) .w : the unique minimizer of the objective f .

AssumptionThere is a constant such that

f (w t + 1) f (w ) w t + 1 w 2 (t = 0, 1 , 2, . . . ).

Theorem 1

w t + 1 w

11 + t

w t w .

I.e., w t converges super-linearly to w if t is increasing.


Theoretical results


17/33

Theorem 2 (approximate minimization)

Denition

w t : sequence generated by the DAL algorithm with

t (t ) t w t + 1 w t 1/ : Lipschitz con-stant of f .

Theorem 2Under the same assumption as in Theorem 1,

w t + 1 w

11 + 2t w

t w .


NoteConvergence is slower than the exact case (

t (t )= 0).

A faster rate can be obtained if we choose t (t )

w t + 1

w t

O(1/ t ).Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 12 / 22

Theoretical results


18/33

Theorem 2 (approximate minimization)

Denition

w t : sequence generated by the DAL algorithm with

t (t ) t w t + 1 w t 1/ : Lipschitz con-stant of f .

Theorem 2Under the same assumption as in Theorem 1,

w t + 1 w

11 + 2t w

t w .


NoteConvergence is slower than the exact case (

t (t )= 0).

A faster rate can be obtained if we choose t (t )

w t + 1

w t

O(1/ t ).Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 12 / 22

Theoretical results


19/33

Proof (in essence) of Theorem 1

Since w t + 1 = argmin w f (w ) +1

2t w w t 2 ,

(w t

w t + 1

)/ t f (w t + 1

) (is a subgradient of f ). I.e.,

f (w ) f (w t + 1) (w t w t + 1)/ t , w w t + 1 .(inspired by Beck & Teboulle 09)

w

wt+1

f(w

)

f(wt+1 )


Theoretical results


20/33

Proof (in essence) of Theorem 2

f (w

) f (w t + 1

) (w t

w t + 1

)/ t ,w

w t + 1

1

2 t (

t )

2

cost of approximateminimization

.

1/ : Lipschitz constant of f .

w

wt+1

f(w

)

f(wt+1 )


Empirical results


21/33

Empirical results: 1-logistic regression

#samples=1 , 024, #unknowns=16 , 384.

FISTATwo-step IST (Beck& Teboulle 09)

OWLQN

Orthant-wiseL-BFGS (Andrew &Gao 07)

SpaRSAStep-size improvedIST (Wright et al. 09)

100

101

102

103

104

104

102

100

| | w

t

w * | |

DALDAL (thm2)DAL (thm1)FISTAOWLQNSpaRSA

100

101

102

104

102

100

100

101

102

103

104

108

106

104

10 2

100

102

#iteretions

f ( w

t )

f ( w

* )

100

101

102

108

106

104

10 2

100

102

CPU time (sec)

DALFISTAOWLQNSpaRSA


Summary


22/33

Summary

Why is sparse learning difcult to optimize? couplings Non-differentiability is not bad.Cost of inner minimization O (m 2 n + ) (n + : number of activevariables). Sparsity makes inner minimization efcient .

How do we get rid of the couplings?

Use linear parametric lower bound instead of linear approximation.Super-linear convergence for exact/approximate innerminimization.

Improved a classic result in optimization by specializing the settingto sparse learning ; i.e., proximation wrt can be performedanalytically.

Empirical results are promissing.Faster than OWLQN, SpaRSA, and FISTA with the potential to begeneralized further.


Summary


23/33

Summary

Why is sparse learning difcult to optimize? couplings Non-differentiability is not bad.Cost of inner minimization O (m 2 n + ) (n + : number of activevariables). Sparsity makes inner minimization efcient .






Summary


24/33

Summary

Why is sparse learning difcult to optimize? couplings

Non-differentiability is not bad.Cost of inner minimization O (m 2 n + ) (n + : number of activevariables). Sparsity makes inner minimization efcient .






Summary


25/33

Summary

Why is sparse learning difcult to optimize? couplings

Non-differentiability is not bad.Cost of inner minimization O (m 2 n + ) (n + : number of activevariables). Sparsity makes inner minimization efcient .






Appendix


26/33

(1) Proximation wrt is analytic (though non-smooth):

w t + 1 = ST t w t + t A t

(2) Inner minimization is smooth:

t = argmin R m

f ( ) independent of A.

+1

2t ST t (w

t + t A )22

= ()(linear to the number of

active variables)

0

(w)

(w)


Appendix


27/33

Comparison to other algorithms

DAL (this talk)

w k w = O (exp (k ))SpaRSA (Step-size improved IST)Convergence shown but no rate given. (Wright et al. 09)

OWLQN (Orthant-wise L-BFGS)Convergence shown but no rate given. (Andrew & Gao 07)

IST (Iterative Soft-thresholding)

f (w k

) f (w

) = O (1/ k ) (Beck & Teboulle 09)FISTA (Two-step IST)f (w k ) f (w ) = O (1/ k 2) (Beck & Teboulle 09)


Appendix


28/33

Comparison to Rockafellar 76

Assumption

The multifunction

f is (locally) Lipschitz contin uou s at the origin:

f ( ) f (0)L ( )

Implies our assumption with = 12 min (1/ L, / w

0 w ).Convergence (exact minimization) comparable to Thm 1

w t + 1 w

1

1 + ( t / L)2w t w

Convergence (approximate minimization) much worse than Thm 2

w t + 1 w

t + t 1 t

w t w t = 11+( t / L)2

(assuming

t t / t w

t + 1

w t

)Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 19 / 22

Appendix


29/33

Outline of proof of Theorem 1

1 Since w t + 1 = argmin w f (w ) +1

2t w w t 2 ,

(w t w t + 1)/ t is a subgradient of f at w t + 1 . I.e.,f (w ) f (w t + 1 ) (w t w t + 1)/ t , w w t + 1 .

2 For any > 0,

w w t + 1w t w

2

w w t + 12 +1

2w t w 2 .

3 Combining 1 & 2 and using f (w t + 1

) f (w

) w t + 1

w

2,

12

w t w 2 ((1 + t ) 2

2)w

t + 1 w 2 .4 Maximize RHS wrt .Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 20 / 22

Appendix


30/33



2t w w t 2 ,


2 For any > 0,

w w t + 1w t w

2

w w t + 12 +1

2w t w 2 .


) f (w

) w t + 1

w

2,

12

w t w 2 ((1 + t ) 2

2)w


Appendix


31/33



2t w w t 2 ,


2 For any > 0,

w w t + 1w t w

2

w w t + 12 +1

2w t w 2 .


) f (w

) w t + 1

w

2,

12

w t w 2 ((1 + t ) 2

2)w


Appendix

l f f f h


32/33


1 Let t :=

t ( t ),

f (w ) f (w t + 1 ) (w t w t + 1)/ t , w w t + 1 1

2 t

2 .

2 By assumption

f (w t + 1) f (w ) w t + 1 w 2 ,

t

2 t

w t + 1 w t 2 .3 Combining 1 & 2,

12

w w t 2 (t +12

)w w t + 12 .


Appendix

EEG bl P300 i l ll d ( bj A)


33/33

EEG problem P300 visual speller dataset (subject A)

Number ofsamplesm = 2550.6 classclassication.w

R 37 64 .Trace-normregularization. 0 100 200 300

0

1000

2000

3000

4000

5000

time (s)

DAL

0 100 200

3200

3400

3600

3800

SubLBFGS

time (s)0 1000 2000 3000

3265

3270

3275

3280

3285

3290

ProjGrad

time (s)


super-linear convergence of dual augmented lagrangian algorithm for sparse learning

Documents