lms

The LMS Algorithm

The LMS Objective Function.

Global solution.

Pseudoinverse of a matrix.

Optimization (learning) by gradient descent.

LMS or Widrow-Hoff Algorithms.

Convergence of the Batch LMS Rule.

Compiled by LATEX at 3:50 PM on February 27, 2013.

CS 295 (UVM) The LMS Algorithm Fall 2013 1 / 23

LMS Objective Function

Again we consider the problem of programming a linear threshold function

x1

w2

w3

wn

w1

x2

x3

xn

w0y

y D sgn.wTxC w0/ D(

1; if wTxC w0 > 0;1; if wTxC w0 0:

so that it agrees with a given dichotomy of m feature vectors,

Xm D f.x1; `1/; : : : ; .xm; `m/g:where xi 2 Rn and `i 2 f1;C1g, for i D 1; 2; : : : ;m.


LMS Objective Function (cont.)Thus, given a dichotomy

Xm D f.x1; `1/; : : : ; .xm; `m/g;we seek a solution weight vector w and bias w0, such that,

sgnwTxi C w0

D ì ;for i D 1; 2; : : : ;m.Equivalently, using homogeneous coordinates (or augmented feature vectors),

sgnbwTbxi D ì ;

for i D 1; 2; : : : ;m, wherebxi D 1; xTi T .Using normalized coordinates,

sgnbwTbx0 i D 1; or, bwTbx0 i > 0;

for i D 1; 2; : : : ;m, wherebx0 i D ìbxi .CS 295 (UVM) The LMS Algorithm Fall 2013 3 / 23

LMS Objective Function (cont.)

Alternatively, let b 2 Rm satisfy bi > 0 for i D 1; 2; : : : ;m. We call b a margin vector.Frequently we will assume that bi D 1.Our learning criterion is certainly satisfied if

bwTbx0 i D bifor i D 1; 2; : : : ;m. (Note that this condition is sufficient, but not necessary.)Equivalently, the above is satisfied if the expression

E.bw/ D mXiD1

bi bwTbx0 i2 :

equals zero. The above expression we call the LMS objective function, where LMSstands for least mean square. (N.B. we could normalize the above by dividing theright side by m.)


LMS Objective Function: Remarks

Converting the original problem of classification (satisfying a system of inequalities)into one of optimization is somewhat ad hoc. And there is no guarantee that we canfind a bw? 2 RnC1 that satisfies E.bw?/ D 0. However, this conversion can lead topractical compromises if the original inequalities possess inconsistencies.

Also, there is no unique objective function. The LMS expression,

E.bw/ D mXiD1

bi bwTbx0 i2:

can be replaced by numerous candidates, e.g.

mXiD1

bi bwTbx0 i ; mX

iD1

1 sgn bwTbx0 i; mX

iD1

1 sgn bwTbx0 i2; etc.

However, minimizing E.bw?/ is generally an easy task.CS 295 (UVM) The LMS Algorithm Fall 2013 5 / 23

Minimizing the LMS Objective FunctionInspection suggests that the LMS Objective Function

E.bw/ D mXiD1

bi bwTbx0 i2

describes a parabolic function. It may have a unique global minimum, or an infinitenumber of global minima which occupy a connected linear set. (The latter can occurif m < nC 1.) Letting,

X D

0BBBB@bx0T1bx0T2:::bx0Tm

1CCCCA D0BBB@`1 Ox 01;1 Ox 01;2 Ox 01;n`2 Ox 02;1 Ox 02;2 Ox 02;n:::

::::::

: : ::::

`m Ox 0m;1 Ox 0m;2 Ox 0m;n

1CCCA 2 Rm.nC1/;then,

E.bw/ D mXiD1

bi bx0Ti bw2 D

0BBBB@b1 bx0T1 bwb2 bx0T2 bw

:::

bm bx0Tmbw

1CCCCA

2

D

b

0BBBB@bx0T1bx0T2:::bx0Tm

1CCCCAbw

2

D kb Xbwj2:CS 295 (UVM) The LMS Algorithm Fall 2013 6 / 23

Minimizing the LMS Objective Function (cont.)

E.bw/ D kb Xbwk2D .b Xbw/T .b Xbw/DbT bwT X T .b Xbw/

D bwT X TX bw bwT X Tb bTX bwC bTbD bwT X TX bw 2bTX bwC kbk2

As an aside, note that

X TX D .bx01; : : : ;bx0m/0B@bx0T1:::bx0Tm1CA D mX

iD1bx0 ibx0Ti 2 R.nC1/.nC1/;

bTX D .b1; : : : ; bm/

0B@bx0T1:::bx0Tm1CA D mX

iD1bibx0Ti 2 RnC1:


Minimizing the LMS Objective Function (cont.)Okay, so how do we minimize

E.bw/ D bwT X TX bw 2bTX bwC kbk2Using calculus (e.g., Math 121), we can compute the gradient of E.bw/, andalgebraically determine a value of bw? which makes each component vanish. That is,solve

rE.bw/ D

0BBBBBBBBBB@

@E

@bw0@E

@bw1:::@E

@bwn

1CCCCCCCCCCAD 0:

It is straightforward to show that

rE.bw/ D 2X T Xbw 2X Tb:CS 295 (UVM) The LMS Algorithm Fall 2013 8 / 23

Minimizing the LMS Objective Function (cont.)

Thus,rE.bw/ D 2X T Xbw 2X Tb D 0

if bw? D .X T X /1X Tb D Xb;where the matrix,

XdefD .X T X /1X T 2 R.nC1/m

is called the pseudoinverse of X . If X T X is singular, one defines

XdefD lim!0.X

T X C I/1X T :

Observe that if X T X is nonsingular,

XX D .X T X /1X T X D I:


Example

The following example appears in R. O. Duda, P. E. Hart, and D. G. Stork, PatternClassification, Second Edition, Wiley, NY, 2001, p. 241.Given the dichotomy,

X4 D.1; 2/T ; 1

;.2; 0/T ; 1

;.3; 1/T ;1; .2; 3/T ;1

we obtain,

X D

0BBBBB@bx0T1bx0T2bx0T3bx0T4

1CCCCCA D0BB@

1 1 21 2 01 3 11 2 3

1CCA :Whence,

X TX D0@4 8 68 18 11

6 11 14

1A ; and, X D .X TX /1X T D0BB@

54

1312

34

712

12 16 12 160 13 0 13

1CCA :


Example (cont.)

Letting, b D .1; 1; 1; 1/T , then

bw D Xb D0BB@

54

1312

34

712

12 16 12 160 13 0 13

1CCA0BBB@

1111

1CCCA D0BB@

113

43 23

1CCA :

Whence,

w0 D 113 and w D4

3; 2

3

T

x1

x2

1

1

2

2

3

3

4

4

5


Method of Steepest Descent

An alternative approach is the method of steepest descent.

We begin by representing Taylors theorem for functions of more than one variable:let x 2 Rn, and f W Rn ! R, so

f .x/ D f .x1; x2; : : : ; xn/ 2 R:Now let x 2 Rn, and consider

f .xC x/ D f .x1 C x1; : : : ; xn C xn/:Define F W R! R, such that,

F .s/ D f .xC s x/:Thus,

F .0/ D f .x/; and, F .1/ D f .xC x/:


Method of Steepest Descent (cont.)Taylors theorem for a single variable (Math 21/22),

F .s/ D F .0/C 11

F 0.0/sC 12

F 00.0/s2 C 13

F 000.0/s3 C :

Our plan is to set s D 1 and replace F .1/ by f .xC x/, F .0/ by f .x/, etc.To evaluate F 0.0/ we will invoke the multivariate chain rule, e.g.,

dds

fu.s/; v.s/

D @[email protected]; v/ u0.s/C @f

@v.u; v/ v 0.s/:

Thus,

F 0.s/ D dFds.s/ D df

ds.x1 C s x1; : : : ; xn C s xn/

D @f@x1

.xC s x/ dds.x1 C s x1/C C @f

@xn.xC s x/ d

ds.xn C s xn/

D @f@x1

.xC s x/ x1 C C @f@xn

.xC s x/ xn:


Method of Steepest Descent (cont.)

Thus,

F 0.0/ D @f@x1

.x/ x1 C C @f@xn

.x/ xn D rf .x/T x:


Method of Steepest Descent (cont.)

Thus, it is possible to show

f .xC x/ D f .x/Crf .x/T xCO kxk2D f .x/C krf .x/kkxk cos CO kxk2 ;

where defines the angle between rf .x/ and x. If kxk 1, thenf D f .xC x/ f .x/ krf .x/kkxk cos :

Thus, the greatest reduction f occurs if cos D 1, that is if x D rf , where > 0. We thus seek a local minimum of the LMS objective function by taking asequence of steps bw.t C 1/ D bw.t/ rEbw.t/:


Training an LTU using Steepest DescentWe now return to our original problem. Given a dichotomy

Xm D f.x1; `1/; : : : ; .xm; `m/gof m feature vectors xi 2 Rn with ì 2 f1; 1g for i D 1; : : : ;m, we construct the setof normalized, augmented feature vectors

bX0m D f.ì ; ìxi;1; : : : ; ìxi;n/T 2 RnC1ji D 1; : : : ;mg:Given a margin vector, b 2 Rm, with bi > 0 for i D 1; : : : ;m, we construct the LMSobjective function,

E.bw/ D 12

mXiD1.bwTbx0 i bi/2 D 12kXbw bk2;

(the factor of 12 is added with some foresight), and then evaluate its gradient

rE.bw/ D mXiD1.bwTbx0 i bi/bx0 i D X T .Xbw b/:


Training an LTU using Steepest Descent (cont.)

Substitution into the steepest descent update rule,

bw.t C 1/ D bw.t/ rEbw.t/;yields the batch LMS update rule,

bw.t C 1/ D bw.t/C mXiD1

bi bw.t/Tbx0 ibx0 i :

Alternatively, one can abstract the sequential LMS, or Widrow-Hoff rule, from theabove: bw.t C 1/ D bw.t/C b bw.t/Tbx0.t/bx0.t/:wherebx0.t/ 2 bX0m is the element of the dichotomy that is presented to the LTU atepoch t . (Here, we assume that b is fixed; otherwise, replace it by b.t/.)


Sequential LMS Rule

The sequential LMS rule

bw.t C 1/ D bw.t/C hb bw.t/Tbx0.t/ibx0.t/;resembles the sequential perceptron rule,

bw.t C 1/ D bw.t/C 2

h1 sgn bw.t/Tbx0.t/ibx0.t/:

Sequential rules are well suited to real-time implementations, as only the currentvalues for the weights, i.e. the configuration of the LTU itself, need to be stored. Theyalso work with dichotomies of infinite sets.


Convergence of the batch LMS rule

Recall, the LMS objective function in the form

E.bw/ D 12

Xbw b

2;has as its gradient,

rE.bw/ D X TXbw X Tb D X T.Xbw b/:Substitution into the rule of steepest descent,

bw.t C 1/ D bw.t/ rEbw.t/;yields,

bw.t C 1/ D bw.t/ X TXbw.t/ b


Convergence of the batch LMS rule (cont.)

The algorithm is said to converge to a fixed point bw?, if for every finite initial value

bw.0/


Convergence bw.t/! bw? occurs if bw.t/ D bw.t/ bw? ! 0. Thus we require that

bw.t C 1/

<

bw.t/

. Inspecting the update rule,bw.t C 1/ D I X TXbw.t/;

this reduces to the condition that all the eigenvalues of

I X TXhave magnitudes less than 1.

We will now evaluate the eigenvalues of the above matrix.



Let S 2 R.nC1/.nC1/ denote the similarity transform that reduces the symmetricmatrix X TX to a diagonal matrix 2 R.nC1/.nC1/. Thus, STS D SST D I, and

S X TX ST D D diag.0; 1; : : : ; n/;The numbers, 0; 1; : : : ; n, represent the eigenvalues of X TX . Note that 0 i fori D 0; 1; : : : ; n. Thus,

S bw.t C 1/ D S.I X TX /ST S bw.t/;D .I /S bw.t/:

Thus convergence occurs if

kS bw.t C 1/k < kS bw.t/k;which occurs if the eigenvalues of I all have magnitudes less than one.



The eigenvalues of I equal 1 i , for i D 0; 1; : : : ; n. (These are in fact theeigenvalues of I X TX .) Thus, we require that

1 < 1 i < 1; or 0 < i < 2;for all i . Let max D max

0in i denote the largest eigenvalue of XTX , then convergence

requires that

0

lms

Documents

sgnbwtbxi d

nc1btx d

d bifor

wherebxi d

thatx tx d

lms expression

bw d mxid1bi bwtbx0

d mxid1bi bwtbx0 i2describes