ols with and regularization - duke...

OLS with `1 and `2 regularizationCEE 629. System Identification

Duke University, Fall 2017

`1 regularization

• The `1 norm of a vector v ∈ Rn is given by ||v||1 = ∑ |vi|The gradient of ||v||1 is not defined if an element of v is zero.

• In `1 regularization, the objective J(a) = ||y − f(y; a)||22 is penalizedwith a term α||a||1, where α is called the regularization parameter, or thepenalty factor.

• The effect of `1 regularization is to force some of the model parameters,ai, to zero (exactly). If the parameters are coefficients for bases of themodel, then `1 regularization is a means to remove un-important basesof the model. It is a form of model reduction

• It is applicable to linear and nonlinear models.

• It has similar but not identical effects as `2 regularization and truncatedSVD solutions

• a.k.a. LASSO, Compressed Sensing, Support Vector Machine

But `1 is not smooth at the optimum point.What can one say about necessary conditions for optimality?

`1 regularization can be recast as a Quadratic Program (QP)

J(a) = ||y −Xa||22 + α||a||1 = ||y −Xa||22 + α∑i

|ai| (1)⇔

ai ≡ pi − qi (2)|ai| ≡ pi + qi (3)

J(p, q) = ||y −Xp+Xq||22 + α∑i

(pi + qi) (4)

such that pi ≥ 0 and qi ≥ 0 ∀ i

2 CEE 629 – System Identification – Duke University – Fall 2017 – H.P. Gavin

Five conditions to show that `1 is a QP

1. pi − qi = ai

2. pi + qi = |ai|

3. pi ≥ 0 qi ≥ 0

4. ai > 0⇒ pi = ai , qi = 0ai < 0⇒ pi = 0 , qi = −aiai = 0⇒ pi = 0 , qi = 0

5. min(pi + qi) = ai

• (1)&(2)⇒ (3)&(4)

• (1)&(3)⇒\ (2)&(4)

• (1)&(3)⇒ (4)&(5)

Solving the `1 regularization QP

J(p, q) = ||y −Xp+Xq||22 + α∑i

(pi + qi)

such that pi ≥ 0 and qi ≥ 0 ∀ i

But which elements of pi and which elements of qi should be set to zero?... (the “active” constraints)This question does not have an answer in closed form.So we solve the problem incrementally.

Incremental formulation

p(k+1) = p(k) + u(k) , p(k) + u(k) ≥ 0 (5)q(k+1) = q(k) + v(k) , q(k) + v(k) ≥ 0 (6)

Lagrangian

L(p, q, µ, ν) = J(p, q) + µ(k)Ti∈Au

(p(k) + u(k))i∈Au+ ν

(k)Ti∈Av

(q(k) + v(k))i∈Av

• k is the iteration number

• Au is the set of active equality constraints pi = 0

• Av is the set of active equality constraints qi = 0

• µ(k)i∈Au

is the set of Lagrange multipliers for the active constraint set Au

• ν(k)i∈Av

is the set of Lagrange multipliers for the active constraint set Av

CC BY-NC-ND January 19, 2019, H.P. Gavin

http://creativecommons.org/licenses/by-nc-nd/3.0/

OLS with `1 and `2 regularization 3

Minimize the Lagrangian w.r.t. (p, q),Maximize the Lagrangian w.r.t. (µ, ν)

L = yTy −2(p + u)TXTy + pTα

+2(q + v)TXTy + qTα

+(p + u)TXTX(p + u)−2(q + v)TXTX(p + u)+µT

i∈Au(p + u)i∈Au+νT

i∈Av(q + v)i∈Av

The necessary conditions for optimality ...

∂L

∂u= 0 : −2XTy + α + µi∈Au

+ 2XTXp+ 2XTXu− 2XTXq − 2XTXv = 0

∂L

∂v= 0 : +2XTy + α + νi∈Av

− 2XTXp− 2XTXu+ 2XTXq + 2XTXv = 0

∂L

∂µi∈Au

= 0 : (p+ u)i∈Au= 0

∂L

∂νi∈Av

= 0 : (q + v)i∈Av= 0

... are linear in our unknowns u, v, µ, ν ...

2XTX −2XTX IT

i∈Au

−2XTX 2XTX ITi∈Av

Ii∈Au

Ii∈Av

uv

µν

=

2XTy − 2XTXp+ 2XTXq − α−2XTy + 2XTXp− 2XTXq − α

−pi∈Au

−qi∈Av

The number of columns of the matrix Ii∈Auis the number of model parameters.

The number of rows of Ii∈Auis the number of active constraints on p. If index

i is the element j in the set of active constraints, Au, then the i, j element ofIi∈Au

equals 1.




A trick to block an incremental step from going negative.

Define the index j s.t. pj + uj = min(p+ u) ... (pj ≥ 0)

If pj + uj < 0, then define δu = −pj/uj (δu > 0)and adjust the increment by the factor δu

p(k+1) = p(k) + δuu(k)

Likewise ...

Define the index j s.t. qj + vj = min(q + v) ... (qj ≥ 0)

If qj + vj < 0, then define δv = −qj/vj (δv > 0)and adjust the increment by the factor δv

q(k+1) = q(k) + δvv(k)

Implementation in L1 fit.m




Example 1, 2-term power polynomial

y(x; a1, a2) = a1x+ a2x2 y = y + η ση = 0.5

(a1 = 4, a2 = 8)OLS criterion mina ||y −Xa||22`1 regularization criterion mina ||y −Xa||22 + α||a||1`2 regularization criterion mina ||y −Xa||22 + β||a||2

-5

0

5

10

15

0 0.2 0.4 0.6 0.8 1

true

msmnt

OLS

L2, β = 1

L1, α = 20

-5

0

5

10

15

0 0.2 0.4 0.6 0.8 1

true

msmnt

OLS

L2, β = 70

L1, α = 70

1

2

3

4

5

6

7

8

9

2 4 6 8 10

a2

a1

OLS

L1

L1

L2

L2

α = 20.0

α = 70.0

β = 1.0

β = 70.0




Example 2, 100-term Prony Series

A Prony series a model for the behavior dynamic system expressed in termsof a convolution of the system forcing with a kernel, in which the kernel is aseries of exponentials.

y(t) =∫ t

0h(r)u(t− r) dr

where, for example, the kernel can be parameterized as

h(r) = k0 +n∑i=1

ki exp[−r/τi]

Prony series are linear in the coefficients ki but are highly nonlinear in thetime constants τi.

In the frequency domain, the complex modulus is obtained from the Fouriertransform of the convolution kernel,

H(ω) = k0 +n∑i=1

kiiωτi

1 + iωτiThe real part of H(ω) is called the storage modulus and represents the energy-conserving elasticity of the system. The imaginary part of H(ω) is called theloss modulus and relates to the energy dissipation aspects. For linear visco-elastic materials the loss modulus approaches zero as ω approaches zero.

Using `1 regularization, a Prony series with very large set of time constantscan be fit to complex-modulus data. By tuning the `1 penalty factor α tobe large enough, a large number of coefficients in the Prony series will beforced to zero. Since negative values of the coefficients are not permissible,this problem is a perfect candidate for `1 regularization.

The system model is then given by

H(ω; k) = T (τ)k

where the i-th column of the matrix T (τ) is iωτi/(1 + iωτi) and the vector kcontains the n+ 1 coefficients k0, ..., kn.

The regularized objective function is

J(k) = ||H(ω)− T (τ)k||22 + α∑i

k s.t. k ≥ 0

This is a special (more simple) case of the `1 objective analyzed in the previoussection. The solution is implemented in PronyFit.m .




02468

10

10-4 10-3 10-2 10-1 100 101 102 103 104

sto

rag

e m

od

ulu

s

00.5

11.5

2

10-4 10-3 10-2 10-1 100 101 102 103 104loss m

od

ulu

s

00.05

0.10.15

0.20.25

0.30.35

0.4

10-4 10-3 10-2 10-1 100 101 102 103 104

tan δ

frequency, f, Hz

0

0.2

0.4

0.6

0.8

1

10-3 10-2 10-1 100 101 102 103

basis

functi

ons

Tk(ω

) =

i ω

τk /

(i ωτ

k +

1)

frequency, f, Hz

-100

-50

0

50

100

1 2 3 4 5 6

para

mete

rs:

'o';

lag

rang

e m

ult

ipliers

: 'x

'

iteration

convergence history

01234567

10-3 10-2 10-1 100 101 102 103 104

sto

rag

e m

od

ulu

s,

G'(ω

)

G'(ω) meas.

G'(ω) fit

0

0.5

1

1.5

2

10-3 10-2 10-1 100 101 102 103 104loss m

od

ulu

s,

G"(ω

)

ω, rad/s

G"(ω) meas.

G"(ω) fit

-1

0

1

2

3

4

10-5 10-4 10-3 10-2 10-1 100 101 102

coeffi

cient,

kk

relaxation time, τk

relaxation spectrum

true

fit




L1 reg example.m1 % L1 reg example .m2 % HPG, Duke Univ . 2013−09, 2017−0934 % generate some data ( x , y ) , noisy in y56 m = 100; % number o f measurement p o i n t s7 x = [ 1:m]’/m; % independent v a r i a b l e s − noise−f r e e89 a = [ 4 8 ]’; % ” true ” parameter c o e f f i c i e n t s

10 n = length(a);1112 X = zeros(m,n);13 for i=1:n14 X(:,i) = x.ˆi;15 end1617 [eigvec , eigval ] = eig (X ’*X); % p r i n c i p a l d i r e c t i o n s o f parameter covar iance1819 y = X*a;2021 alpha_set = [ 20.0 70.0 ]; % f o r L1 r e g u l a r i z a t i o n22 beta_set = [ 1.0 70.0 ]; % f o r L2 r e g u l a r i z a t i o n2324 N_MCS = 100;25 N_R = length( alpha_set );2627 a_hat_ols_rec = zeros(n, N_MCS );28 a_hat_alpha_rec = zeros(n,N_R , N_MCS );29 a_hat_beta_rec = zeros(n,N_R , N_MCS );303132 for j=1: N_MCS ; % . . . s t a r t s imu la t ion loop3334 y_dat = y + 0.5*randn(m ,1); % add noise to t rue data3536 a_hat_ols = X \ y_dat ; % ordinary l e a s t squares37 y_hat_ols = X* a_hat_ols ;3839 a_hat_ols_rec (:,j) = a_hat_ols ;4041 for k=1: N_R4243 alpha = alpha_set (k);44 beta = beta_set (k);4546 a_hat_alpha = L1_fit ( X, y_dat , alpha ); % L 1 r e g u l a r i z a t i o n47 y_hat_alpha = X * a_hat_alpha ;48 a_hat_alpha_rec (:,j,k) = a_hat_alpha ;4950 a_hat_beta = [ X ’*X + beta*eye(n) ] \ X ’* y_dat ; % L 2 r e g u l a r i z a t i o n51 y_hat_beta = X* a_hat_beta ;52 a_hat_beta_rec (:,j,k) = a_hat_beta ;5354 figure (1)55 c l f56 plot ( x,y,’-k’, x,y_dat ,’ok ’, ...57 x,y_hat_ols ,’-k’, x, y_hat_beta ,’-b’, x, y_hat_alpha ,’-r’ )58 legend(’true ’,’msmnt ’,’OLS ’, ...59 sprintf (’L_2 , \\ beta = %2.0f’,beta), ...60 sprintf (’L_1 , \\ alpha = %2.0f’,alpha ),’location ’,’northwest ’);61 legend(’boxoff ’)62 drawnow6364 end6566 end % . . . end s imu la t ion loop




L1 fit.m1 function [a, mu , nu , cvg_hst ] = L1_fit ( X, y, alpha )2 % [ a , mu, nu , c v g h s t ] = L 1 f i t (X, y , a lpha )3 %4 % f i t model parameters , a , in the model ˆy = X∗a to data , y , wi th5 % L1 r e g u l a r i z a i o n o f the parameters .6 % J = | | y − X∗a | | 2 ˆ2 + alpha ∗ | | a | | 17 % <=>8 % J = | | y − X(p−q ) | | 2 ˆ2 + alpha ∗ sum( p+q ) such t h a t p>=0, q>=09 % ( a = p−q ; | a | = p+q )

10 %11 % INPUT: X . . . the des ign matrix ( b a s i s o f the model ) , m x n12 % y . . . the ve c t o r o f data to be f i t to the model , m x 113 % alpha . . . l 1 r e g u l a r i z a t i o n f a c t o r f o r Prony s e r i e s c o e f f i c i e n t s14 %15 % OUTPUT: a . . . the model c o e f f i c i e n t s16 % mu, nu . . . Lagrange m u l t i p l i e r s f o r p and q17 % c v g h s t . . . convergence h i s t o r y18 %1920 %{21 The main idea behind casting L1 as a QP is that the parameter22 vector {a} is replaced by the difference of two vectors {a}={p}-{q}23 that are constrained to be non - negative : p_i >=0 for a l l i and q_i >=024 for a l l i.25 If a_i > 0, then p_i=a_i and q_i =0;26 If a_i < 0, then p_i= 0 and q_i = -a_i.27 With this constrained re - parameterization , |a_i| = p_i + q_i.28 Note that the dimension of the parameter space doubles , but the KTT equations29 for the QP are simple and have analytical Hessians and gradients .30 %}313233 [m,n] = s ize (X);3435 t = [1:m];3637 XtX = 2*X ’*X; Xty = 2*X ’*y;3839 % good i n i t i a l guess f o r p and q from non−r e g u l a r i z e d l i n e a r l e a s t squares40 a = XtX \ Xty;41 p = zeros(n ,1); p ( find (a > 2*eps)) = a( find (a > 2*eps));42 q = zeros(n ,1); q ( find (a < -2*eps)) = -a( find (a < -2*eps));4344 MaxIter = 19;4546 cvg_hst = zeros ( 5*n, MaxIter ); % convergence h i s t o r y4748 for iter = 1: MaxIter4950 Au = find (p <= 2*eps); lp = length(Au ); % a c t i v e s e t f o r update u51 Av = find (q <= 2*eps); lq = length(Av ); % a c t i v e s e t f o r update v5253 Ip = zeros(lp ,n); for i=1:lp , Ip(i,Au(i)) = 1; end % c o n s t r a i n t g r a d i e n t u54 Iq = zeros(lq ,n); for i=1:lq , Iq(i,Av(i)) = 1; end % c o n s t r a i n t g r a d i e n t v5556 mu = zeros(n ,1); nu = zeros(n ,1);57 % KKT equat ions58 XTX = [ XtX -XtX Ip ’ zeros(n,lq) ; % dL/du = 059 -XtX XtX zeros(n,lp) Iq ’ ; % dL/dv = 060 Ip zeros(lp ,n+lp+lq) ; % dL/d mu = 061 zeros(lq ,n) Iq zeros(lq ,lp+lq) ]; % dL/d nu = 06263 % r i g h t−hand−s i d e v e c t or64 XTY = [ Xty - XtX*p + XtX*q - alpha ; % dL/du = 065 -Xty + XtX*p - XtX*q - alpha ; % dL/dv = 066 -p(Au) ; % dL/d mu = 067 -q(Av) ]; % dL/d nu = 06869 u_v_mu_nu = XTX \ XTY; % s o l v e the system




7071 %

7273 u = u_v_mu_nu (1:n); % update f o r p74 v = u_v_mu_nu (n +1:2* n); % update f o r q75 mu(Au) = u_v_mu_nu (2*n +1:2* n+lp ); % what to use mu f o r ?76 nu(Av) = u_v_mu_nu (2*n+lp +1:2* n+lp+lq ); % what to use nu f o r ?7778 % i f an element o f p+u becomes nege t i ve , reduce the l e n g t h o f s t e p u79 du = 1;80 [p_min , j] = min(p+u);81 i f p_min < 082 du = -p(j)/u(j);83 end8485 % i f an element o f q+v becomes nege t i ve , reduce the l e n g t h o f s t e p v86 dv = 1;87 [q_min , j] = min(q+v);88 i f q_min < 089 dv = -q(j)/v(j);90 end9192 p = p + du*u;93 q = q + dv*v;94 a = p - q;9596 cvg_hst (:, iter) = [ a ; p ; q ; mu ; nu ]; % convergence h i s t o r y9798 % convergence check99 i f ( norm(u) <= norm(p)/1 e3 && min(p) > -5*eps && ...

100 norm(v) <= norm(q)/1 e3 && min(q) > -5*eps )101 break;102 end103104 % f i g u r e (101) ; c l f ; p l o t ( t , y , ’ o ’ , t ,X∗a ) ; drawnow ;105106 end107108 cvg_hst = cvg_hst (: ,1: iter );109110 % f i g u r e (102) ; c l f ; p l o t ( [ 1 : i t e r ] , c v g h s t ( 1 : n , : ) ’ , ’−o ’ ) ; drawnow ; %pause ( 1 ) ;111112 % −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− L 1 f i t113 % H.P. Gavin , 2013−10−04




PronyFit test.m1 % −−− P r o n y F i t t e s t .m2 % t e s t prony s e r i e s f i t t i n g with L1 r e g u l a r i z a t i o n34 clear a l l56 % simula ted noisy data78 % a r e l a t i v e l y smal l number o f ’ true ’ time l a g s9 tau = 1e -2*[ 0.01 0.1 1 10 100 1000 ]’;

10 ko = 1; % s t a t i c s t i f f n e s s11 k = 0.5 ./ tau .ˆ0.2; % ’ true ’ Prony s e r i e s c o e f f i c i e n t s1213 points = 100;14 f_dat = logspace(-3,3, points )’;15 iw = 2* pi* f_dat * sqrt ( -1.0);1617 % generate complex modulus data with some a d d i t i v e noise and f r i c t i o n e f f e c t s18 T = [ ones(points ,1) , iw*tau ’ ./ (iw*tau ’ + 1.0) ];19 G_dat = T * [ ko ; k ] + 5e -2*(randn(points ,1) + i*randn(points ,1));2021 friction = 0.1; % d e f i n a t e l y NOT l i n e a r v i s c o e l a s t i c i t y !22 G_dat = real ( G_dat ) + sqrt ( -1)*( friction + imag( G_dat ));2324 % p l o t the s imula ted data25 figure (1)26 c l f27 % semi logx ( f , r e a l (G) , ’ or ’ , f , imag (G) , ’ ob ’ )28 subplot (311)29 semilogx(f_dat , real ( G_dat ),’or ’)30 ylabel (’storage modulus ’)31 subplot (312)32 semilogx(f_dat ,imag( G_dat ),’ob ’)33 ylabel (’loss modulus ’)34 subplot (313)35 semilogx(f_dat ,imag( G_dat )./ real ( G_dat ),’ok ’)36 ylabel (’tan \ delta ’)37 xlabel (’frequency , f_{dat}, Hz ’)3839 % number o f terms in the Prony s e r i e s40 nID = 97;4142 tau_id = logspace ( -4.5 ,2 , nID )’; % s p e c i f i e d s e t o f time lags , tau4344 alpha = 0.50; % l 1 r e g u l a r i z a t i o n parameter4546 [ko_id ,k_id , cvg_hst ] = PronyFit (G_dat ,f_dat ,tau_id , alpha ); % do i t !4748 f_fit = logspace ( -4 ,4 ,100);49 iw_fit = 2* pi* f_fit * sqrt ( -1.0);50 G_fit = ko_id + sum( (k_id .* tau_id * iw_fit ) ./ ( tau_id * iw_fit + 1) );




PronyFit.m1 function [ko ,k, cvg_hst ] = PronyFit (G_dat ,f_dat ,tau , alpha )2 % [ ko , k , c v g h s t ] = p r o n y f i t ( G dat , f d a t , tau , a lpha )3 %4 % Fit a prony s e r i e s to frequency domain complex modulus data5 %6 % INPUT: G dat . . . complex modulus data Mx1 complex v e c t o r7 % f d a t . . . f r e q u e n c i e s where the measured frequency response data i s8 % ev a lu a te d (Hz) , ( f >0) Mx1 r e a l v e c t o r9 % tau . . . s p e c i f i e d s e t o f r e l a x a t i o n times , Nx1 r e a l v e c t o r

10 % alpha . . . l 1 r e g u l a r i z a t i o n f a c t o r f o r Prony s e r i e s c o e f f i c i e n t s11 %12 % OUTPUT: ko , k . . . Prony s e r i e s c o e f f i c i e n t s13 % c v g h s t . . . convergence h i s t o r y141516 i = sqrt ( -1.0);1718 m = length( f_dat ); % number o f data p o i n t s19 w = 2* pi * f_dat ; % one−s ided frequency2021 T = [ ones(m ,1) , i*w*tau ’ ./ (i*w*tau ’ + 1.0) ]; % des ign matrix2223 % p l o t the b a s i s f u n c t i o n s ( columns o f T) with s i n g l e−s ided log−s c a l e f r e q u e n c i e s24 figure (101)25 semilogx(f_dat , real (T), ’-r’, f_dat , imag(T),’-b’)26 ylabel (’basis functions T_k (\ omega ) = i \ omega \ tau_k / (i \ omega \ tau_k + 1) ’)27 xlabel (’frequency , f, Hz ’)28 axis ( [ 0.5*min( f_dat ) 1.5*max( f_dat ) -0.05 1.05 ] )2930 [m,n] = s ize (T); % m = number o f data p o i n t s ; n = number o f parameters3132 TtT = 2* real (T ’*T); % t a k i n g r e a l par t i s l i k e adding complex conjugate33 TtG = 2* real (T ’* G_dat );3435 k = TtT \ TtG; % O.L. S . f i t i s a b e t t e r i n i t i a l guess , even though i n f e a s i b l e , k<03637 MaxIter = 20; % u s u a l l y enough38 cvg_hst = zeros (2*n, MaxIter ); % convergence o f c o e f f i c i e n t s and m u l t i p l i e r s3940 for iter = 1: MaxIter % −−− S t a r t Main Loop4142 A = find (k < 2*eps); l = length(A); % a c t i v e s e t f o r update c o n s t r a i n t s43 % Ia i s the c o n s t r a i n t g r a d i e n t w. r . t . the s t e p v e c t o r h44 Ia = zeros(l,n); for i=1:l, Ia(i,A(i)) = 1; end4546 % KKT equat ions47 lambda = zeros(n ,1); % Lagrange m u l t i p l i e r48 XTX = [ TtT Ia ’ ; % dL / dh49 Ia zeros(l,l) ]; % dL / d lambda5051 XTY = [ -TtT*k + TtG - alpha ; % dL / dh52 -k(A) ]; % dL / d lambda5354 h_lambda = XTX \ XTY; % s o l v e the system5556 %

5758 h = h_lambda (1:n); % the s t e p59 lambda (A) = h_lambda (n+1:end); % the non−zero m u l t i p l i e r s6061 % i f an element o f k+h becomes nege t i ve , reduce the l e n g t h o f s t e p h to dh∗h62 dh = 1;63 [ h_test_min , idx] = min(k+h);64 i f h_test_min < 065 dh = -k(idx )/h(idx );66 end67




68 k = (k + dh * h); % update the Prony c o e f f i c i e n t s6970 figure (102) % p l o t msmnt p o i n t s and f i t71 G_hat = T*k;72 c l f73 subplot (211)74 semilogx ( w, real ( G_dat ),’or ’, w, real ( G_hat ),’-k’)75 ylabel (’storage modulus , G’’(\ omega ) ’)76 legend(’G’’(\ omega ) meas.’,’G’’(\ omega ) fit ’,’location ’,’northwest ’)77 subplot (212)78 semilogx ( w,imag( G_dat ),’ob ’, w,imag( G_hat ),’-k’)79 xlabel (’\omega , rad/s’)80 ylabel (’loss modulus , G"(\ omega ) ’)81 legend(’G"(\ omega ) meas.’,’G"(\ omega ) fit ’,’location ’,’northwest ’)82 drawnow8384 cvg_hst (:, iter) = [ k ; lambda ]; % convergence h i s t o r y85 i f ( norm(h) < norm(k)/1 e3 && min(k) > -5*eps ) % convergence check86 break;87 end8889 end % −−− End Main Loop9091 cvg_hst = cvg_hst (: ,1: iter );9293 % asymptot ic standard e r r o r s o f the Prony c o e f f i c i e n t s ???94 % k s t d e r r = s q r t (norm( G dat−T∗k )/(m−n+1) ∗ diag ( r e a l ( inv (TtTT ) ) ) ) ;9596 ko = k(1);97 k = k(2:n);9899 % −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− PRONY FIT

100 % HP Gavin , 2013−10−04



ols with and regularization - duke...

Documents