lms
DESCRIPTION
Neural NetworksTRANSCRIPT
-
The LMS Algorithm
The LMS Objective Function.
Global solution.
Pseudoinverse of a matrix.
Optimization (learning) by gradient descent.
LMS or Widrow-Hoff Algorithms.
Convergence of the Batch LMS Rule.
Compiled by LATEX at 3:50 PM on February 27, 2013.
CS 295 (UVM) The LMS Algorithm Fall 2013 1 / 23
-
LMS Objective Function
Again we consider the problem of programming a linear threshold function
x1
w2
w3
wn
w1
x2
x3
xn
w0y
y D sgn.wTxC w0/ D(
1; if wTxC w0 > 0;1; if wTxC w0 0:
so that it agrees with a given dichotomy of m feature vectors,
Xm D f.x1; `1/; : : : ; .xm; `m/g:where xi 2 Rn and `i 2 f1;C1g, for i D 1; 2; : : : ;m.
CS 295 (UVM) The LMS Algorithm Fall 2013 2 / 23
-
LMS Objective Function (cont.)Thus, given a dichotomy
Xm D f.x1; `1/; : : : ; .xm; `m/g;we seek a solution weight vector w and bias w0, such that,
sgnwTxi C w0
D `i ;for i D 1; 2; : : : ;m.Equivalently, using homogeneous coordinates (or augmented feature vectors),
sgnbwTbxi D `i ;
for i D 1; 2; : : : ;m, wherebxi D 1; xTi T .Using normalized coordinates,
sgnbwTbx0 i D 1; or, bwTbx0 i > 0;
for i D 1; 2; : : : ;m, wherebx0 i D `ibxi .CS 295 (UVM) The LMS Algorithm Fall 2013 3 / 23
-
LMS Objective Function (cont.)
Alternatively, let b 2 Rm satisfy bi > 0 for i D 1; 2; : : : ;m. We call b a margin vector.Frequently we will assume that bi D 1.Our learning criterion is certainly satisfied if
bwTbx0 i D bifor i D 1; 2; : : : ;m. (Note that this condition is sufficient, but not necessary.)Equivalently, the above is satisfied if the expression
E.bw/ D mXiD1
bi bwTbx0 i2 :
equals zero. The above expression we call the LMS objective function, where LMSstands for least mean square. (N.B. we could normalize the above by dividing theright side by m.)
CS 295 (UVM) The LMS Algorithm Fall 2013 4 / 23
-
LMS Objective Function: Remarks
Converting the original problem of classification (satisfying a system of inequalities)into one of optimization is somewhat ad hoc. And there is no guarantee that we canfind a bw? 2 RnC1 that satisfies E.bw?/ D 0. However, this conversion can lead topractical compromises if the original inequalities possess inconsistencies.
Also, there is no unique objective function. The LMS expression,
E.bw/ D mXiD1
bi bwTbx0 i2:
can be replaced by numerous candidates, e.g.
mXiD1
bi bwTbx0 i ; mX
iD1
1 sgn bwTbx0 i; mX
iD1
1 sgn bwTbx0 i2; etc.
However, minimizing E.bw?/ is generally an easy task.CS 295 (UVM) The LMS Algorithm Fall 2013 5 / 23
-
Minimizing the LMS Objective FunctionInspection suggests that the LMS Objective Function
E.bw/ D mXiD1
bi bwTbx0 i2
describes a parabolic function. It may have a unique global minimum, or an infinitenumber of global minima which occupy a connected linear set. (The latter can occurif m < nC 1.) Letting,
X D
0BBBB@bx0T1bx0T2:::bx0Tm
1CCCCA D0BBB@`1 Ox 01;1 Ox 01;2 Ox 01;n`2 Ox 02;1 Ox 02;2 Ox 02;n:::
::::::
: : ::::
`m Ox 0m;1 Ox 0m;2 Ox 0m;n
1CCCA 2 Rm.nC1/;then,
E.bw/ D mXiD1
bi bx0Ti bw2 D
0BBBB@b1 bx0T1 bwb2 bx0T2 bw
:::
bm bx0Tmbw
1CCCCA
2
D
b
0BBBB@bx0T1bx0T2:::bx0Tm
1CCCCAbw
2
D kb Xbwj2:CS 295 (UVM) The LMS Algorithm Fall 2013 6 / 23
-
Minimizing the LMS Objective Function (cont.)
E.bw/ D kb Xbwk2D .b Xbw/T .b Xbw/DbT bwT X T .b Xbw/
D bwT X TX bw bwT X Tb bTX bwC bTbD bwT X TX bw 2bTX bwC kbk2
As an aside, note that
X TX D .bx01; : : : ;bx0m/0B@bx0T1:::bx0Tm1CA D mX
iD1bx0 ibx0Ti 2 R.nC1/.nC1/;
bTX D .b1; : : : ; bm/
0B@bx0T1:::bx0Tm1CA D mX
iD1bibx0Ti 2 RnC1:
CS 295 (UVM) The LMS Algorithm Fall 2013 7 / 23
-
Minimizing the LMS Objective Function (cont.)Okay, so how do we minimize
E.bw/ D bwT X TX bw 2bTX bwC kbk2Using calculus (e.g., Math 121), we can compute the gradient of E.bw/, andalgebraically determine a value of bw? which makes each component vanish. That is,solve
rE.bw/ D
0BBBBBBBBBB@
@E
@bw0@E
@bw1:::@E
@bwn
1CCCCCCCCCCAD 0:
It is straightforward to show that
rE.bw/ D 2X T Xbw 2X Tb:CS 295 (UVM) The LMS Algorithm Fall 2013 8 / 23
-
Minimizing the LMS Objective Function (cont.)
Thus,rE.bw/ D 2X T Xbw 2X Tb D 0
if bw? D .X T X /1X Tb D Xb;where the matrix,
XdefD .X T X /1X T 2 R.nC1/m
is called the pseudoinverse of X . If X T X is singular, one defines
XdefD lim!0.X
T X C I/1X T :
Observe that if X T X is nonsingular,
XX D .X T X /1X T X D I:
CS 295 (UVM) The LMS Algorithm Fall 2013 9 / 23
-
Example
The following example appears in R. O. Duda, P. E. Hart, and D. G. Stork, PatternClassification, Second Edition, Wiley, NY, 2001, p. 241.Given the dichotomy,
X4 D.1; 2/T ; 1
;.2; 0/T ; 1
;.3; 1/T ;1; .2; 3/T ;1
we obtain,
X D
0BBBBB@bx0T1bx0T2bx0T3bx0T4
1CCCCCA D0BB@
1 1 21 2 01 3 11 2 3
1CCA :Whence,
X TX D0@4 8 68 18 11
6 11 14
1A ; and, X D .X TX /1X T D0BB@
54
1312
34
712
12 16 12 160 13 0 13
1CCA :
CS 295 (UVM) The LMS Algorithm Fall 2013 10 / 23
-
Example (cont.)
Letting, b D .1; 1; 1; 1/T , then
bw D Xb D0BB@
54
1312
34
712
12 16 12 160 13 0 13
1CCA0BBB@
1111
1CCCA D0BB@
113
43 23
1CCA :
Whence,
w0 D 113 and w D4
3; 2
3
T
x1
x2
1
1
2
2
3
3
4
4
5
CS 295 (UVM) The LMS Algorithm Fall 2013 11 / 23
-
Method of Steepest Descent
An alternative approach is the method of steepest descent.
We begin by representing Taylors theorem for functions of more than one variable:let x 2 Rn, and f W Rn ! R, so
f .x/ D f .x1; x2; : : : ; xn/ 2 R:Now let x 2 Rn, and consider
f .xC x/ D f .x1 C x1; : : : ; xn C xn/:Define F W R! R, such that,
F .s/ D f .xC s x/:Thus,
F .0/ D f .x/; and, F .1/ D f .xC x/:
CS 295 (UVM) The LMS Algorithm Fall 2013 12 / 23
-
Method of Steepest Descent (cont.)Taylors theorem for a single variable (Math 21/22),
F .s/ D F .0/C 11
F 0.0/sC 12
F 00.0/s2 C 13
F 000.0/s3 C :
Our plan is to set s D 1 and replace F .1/ by f .xC x/, F .0/ by f .x/, etc.To evaluate F 0.0/ we will invoke the multivariate chain rule, e.g.,
dds
fu.s/; v.s/
D @[email protected]; v/ u0.s/C @f
@v.u; v/ v 0.s/:
Thus,
F 0.s/ D dFds.s/ D df
ds.x1 C s x1; : : : ; xn C s xn/
D @f@x1
.xC s x/ dds.x1 C s x1/C C @f
@xn.xC s x/ d
ds.xn C s xn/
D @f@x1
.xC s x/ x1 C C @f@xn
.xC s x/ xn:
CS 295 (UVM) The LMS Algorithm Fall 2013 13 / 23
-
Method of Steepest Descent (cont.)
Thus,
F 0.0/ D @f@x1
.x/ x1 C C @f@xn
.x/ xn D rf .x/T x:
CS 295 (UVM) The LMS Algorithm Fall 2013 14 / 23
-
Method of Steepest Descent (cont.)
Thus, it is possible to show
f .xC x/ D f .x/Crf .x/T xCO kxk2D f .x/C krf .x/kkxk cos CO kxk2 ;
where defines the angle between rf .x/ and x. If kxk 1, thenf D f .xC x/ f .x/ krf .x/kkxk cos :
Thus, the greatest reduction f occurs if cos D 1, that is if x D rf , where > 0. We thus seek a local minimum of the LMS objective function by taking asequence of steps bw.t C 1/ D bw.t/ rEbw.t/:
CS 295 (UVM) The LMS Algorithm Fall 2013 15 / 23
-
Training an LTU using Steepest DescentWe now return to our original problem. Given a dichotomy
Xm D f.x1; `1/; : : : ; .xm; `m/gof m feature vectors xi 2 Rn with `i 2 f1; 1g for i D 1; : : : ;m, we construct the setof normalized, augmented feature vectors
bX0m D f.`i ; `ixi;1; : : : ; `ixi;n/T 2 RnC1ji D 1; : : : ;mg:Given a margin vector, b 2 Rm, with bi > 0 for i D 1; : : : ;m, we construct the LMSobjective function,
E.bw/ D 12
mXiD1.bwTbx0 i bi/2 D 12kXbw bk2;
(the factor of 12 is added with some foresight), and then evaluate its gradient
rE.bw/ D mXiD1.bwTbx0 i bi/bx0 i D X T .Xbw b/:
CS 295 (UVM) The LMS Algorithm Fall 2013 16 / 23
-
Training an LTU using Steepest Descent (cont.)
Substitution into the steepest descent update rule,
bw.t C 1/ D bw.t/ rEbw.t/;yields the batch LMS update rule,
bw.t C 1/ D bw.t/C mXiD1
bi bw.t/Tbx0 ibx0 i :
Alternatively, one can abstract the sequential LMS, or Widrow-Hoff rule, from theabove: bw.t C 1/ D bw.t/C b bw.t/Tbx0.t/bx0.t/:wherebx0.t/ 2 bX0m is the element of the dichotomy that is presented to the LTU atepoch t . (Here, we assume that b is fixed; otherwise, replace it by b.t/.)
CS 295 (UVM) The LMS Algorithm Fall 2013 17 / 23
-
Sequential LMS Rule
The sequential LMS rule
bw.t C 1/ D bw.t/C hb bw.t/Tbx0.t/ibx0.t/;resembles the sequential perceptron rule,
bw.t C 1/ D bw.t/C 2
h1 sgn bw.t/Tbx0.t/ibx0.t/:
Sequential rules are well suited to real-time implementations, as only the currentvalues for the weights, i.e. the configuration of the LTU itself, need to be stored. Theyalso work with dichotomies of infinite sets.
CS 295 (UVM) The LMS Algorithm Fall 2013 18 / 23
-
Convergence of the batch LMS rule
Recall, the LMS objective function in the form
E.bw/ D 12
Xbw b
2;has as its gradient,
rE.bw/ D X TXbw X Tb D X T.Xbw b/:Substitution into the rule of steepest descent,
bw.t C 1/ D bw.t/ rEbw.t/;yields,
bw.t C 1/ D bw.t/ X TXbw.t/ b
CS 295 (UVM) The LMS Algorithm Fall 2013 19 / 23
-
Convergence of the batch LMS rule (cont.)
The algorithm is said to converge to a fixed point bw?, if for every finite initial value
bw.0/
-
Convergence of the batch LMS rule (cont.)
Convergence bw.t/! bw? occurs if bw.t/ D bw.t/ bw? ! 0. Thus we require that
bw.t C 1/
<
bw.t/
. Inspecting the update rule,bw.t C 1/ D I X TXbw.t/;
this reduces to the condition that all the eigenvalues of
I X TXhave magnitudes less than 1.
We will now evaluate the eigenvalues of the above matrix.
CS 295 (UVM) The LMS Algorithm Fall 2013 21 / 23
-
Convergence of the batch LMS rule (cont.)
Let S 2 R.nC1/.nC1/ denote the similarity transform that reduces the symmetricmatrix X TX to a diagonal matrix 2 R.nC1/.nC1/. Thus, STS D SST D I, and
S X TX ST D D diag.0; 1; : : : ; n/;The numbers, 0; 1; : : : ; n, represent the eigenvalues of X TX . Note that 0 i fori D 0; 1; : : : ; n. Thus,
S bw.t C 1/ D S.I X TX /ST S bw.t/;D .I /S bw.t/:
Thus convergence occurs if
kS bw.t C 1/k < kS bw.t/k;which occurs if the eigenvalues of I all have magnitudes less than one.
CS 295 (UVM) The LMS Algorithm Fall 2013 22 / 23
-
Convergence of the batch LMS rule (cont.)
The eigenvalues of I equal 1 i , for i D 0; 1; : : : ; n. (These are in fact theeigenvalues of I X TX .) Thus, we require that
1 < 1 i < 1; or 0 < i < 2;for all i . Let max D max
0in i denote the largest eigenvalue of XTX , then convergence
requires that
0