8 performance surfaces and optimum...

ASIAN INSTITUTE OF TECHNOLOGY MECHATRONICS

Manukid Parnichkun

78

8 Performance Surfaces and Optimum Points

Optimization process consists of

• Defining performance index

• Searching for optimal parameters in the index

8.1 Taylor Series

LL +−++−+−+====

nxxn

n

xxxx xxxFdxd

nxxxF

dxdxxxF

dxdxFxF )()(

!1)()(

21)()()()( *2*

2

2**

*** (8.1-1)

8.1.1 Vector Case

),,,()( 21 nxxxFF L=x (8.1.1-1)

)()()()()()()()( **22

2

*11

1

**** nn

n

xxFx

xxFx

xxFx

FF −∂∂

++−∂∂

+−∂∂

+==== xxxxxx

xxxxx L

L+−−∂∂

∂+−

∂∂

+==

))(()(21)()(

21 *

22*

1121

22*

1121

2

xxxxFxx

xxFx

** xxxx xx (8.1.1-2)

L+−∇−+−∇+===

)()()(21)()()()( **

2* *xx

**xx

xxxxxxxxxx FFFF TT (8.1.1-3)


Manukid Parnichkun

79

∇ F(x): Gradient,

T

n

Fx

Fx

Fx

F ⎥⎦

⎤⎢⎣

⎡∂∂

∂∂

∂∂

=∇ )()()()(21

xxxx L (8.1.1-4)

∇2F(x): Hessian,

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

∂∂

∂∂∂

∂∂∂

∂∂∂

∂∂

∂∂∂

∂∂∂

∂∂∂

∂∂

=∇

)()()(

)()()(

)()()(

)(

2

2

2

2

1

2

2

2

22

2

12

21

2

21

2

21

2

2

xxx

xxx

xxx

x

Fx

Fxx

Fxx

Fxx

Fx

Fxx

Fxx

Fxx

Fx

F

nnn

n

n

L

MOMM

L

L

(8.1.1-5)


Manukid Parnichkun

80

8.2 Directional Derivatives

Directional derivative along p: (+ : increasing, - : decreasing, 0 : constant)

pxp )(FT ∇ (8.2-1)

Second derivative along p: (+ : convex, - : concave, 0 : straight)

2

2 )(p

pxp FT ∇ (8.2-2)

Example: 22

21 2)( xxF +=x (8.2-3)

⎥⎦

⎤⎢⎣

⎡=∇

2

1

42

)(xx

F x (8.2-4)

At ⎥⎦

⎤⎢⎣

⎡=

5.05.0

x ,

⎥⎦

⎤⎢⎣

⎡=∇

21

)(xF (8.2-5)


Manukid Parnichkun

81

For ⎥⎦

⎤⎢⎣

⎡=

21

p : parallel to gradient,

Directional derivative [ ]

521

21

21)(

22=

+

⎥⎦

⎤⎢⎣

⎡

=∇

pxp FT

(8.2-6)

For ⎥⎦

⎤⎢⎣

⎡−

=1

2p : orthogonal to gradient,

Directional derivative [ ]

021

21

12)(

22=

+

⎥⎦

⎤⎢⎣

⎡−

=∇

pxp FT

(8.2-7)

Figure 8.2-1 Quadratic Function and Directional Derivatives


Manukid Parnichkun

82

8.3 Minima

Definition: The point x* is a strong minimum of F(x) if a scalar δ > 0 exists, such that F(x*) < F(x*+Δx) for all Δx such

that δ > ||Δx|| > 0.

Definition: The point x* is a unique global minimum of F(x) if F(x*) < F(x*+Δx) for all Δx≠0.

Definition: The point x* is a weak minimum of F(x) if it is not a strong minimum, and a scalar δ > 0 exists, such that

F(x*) ≤ F(x*+Δx) for all Δx such that δ > ||Δx|| > 0.


Manukid Parnichkun

83

Examples:

62173)( 24 +−−= xxxxF (8.3-1)

Figure 8.3-1 Scalar Example of Local and Global Minima


Manukid Parnichkun

84

38)()( 2121

412 ++−+−= xxxxxxF x (8.3-2)

Figure 8.3-2 Vector Example of Minima and Saddle Point


Manukid Parnichkun

85

21

2221

21 )25.1()( xxxxxF +−=x (8.3-3)

Figure 8.3-3 Weak Minimum Example


Manukid Parnichkun

86

8.4 Necessary Conditions for Optimality

8.4.1 First-Order Conditions

Approximation by Taylor’s series

xxxxx xx Δ∇+≅Δ+= *)()()( ** TFFF (8.4.1-1)

If x* is minimum point, the second term must be non negative,

0)( * ≥Δ∇=

xxxx

TF (8.4.1-2)

For positive of the second term,

0)( * >Δ∇=

xx xxTF (8.4.1-3)

)()()()( **** xxxxxx

xxFFFF T <Δ∇−≅Δ+

= (8.4.1-4)

For zero of the second term,

0)( * =Δ∇=

xx xxTF (8.4.1-5)

0)( * =∇=xxxF (8.4.1-6)

• The gradient must be zero at a minimum point.

• This is a first-order, necessary (but not sufficient) condition for x* to be a local minimum point.

• Any points that have zero gradient are called stationary points.


Manukid Parnichkun

87

8.4.2 Second-Order Conditions

Approximation by Taylor’s series

L+Δ∇Δ+=Δ+=

xxxxxx xx *)(21)()( 2** FFF T (8.4.2-1)

As If x* is minimum point, the second term must be non negative,

0)( *2 ≥Δ∇Δ

=xxx xxFT (8.4.2-2)

Matrix A is positive definite (all eigenvalues are positive), if

0>AzzT (8.4.2-3)

Matrix A is positive semidefinite (all eigenvalues are non negative) for any vector z≠0, if

0≥AzzT (8.4.2-4)

• A positive definite Hessian matrix is a second-order, sufficient condition for a strong minimum to exist.

• It is not a necessary condition. A minimum can still be strong if the second-order term of the Taylor series is zero, but

the third-order term is positive.

• The second-order, necessary condition for a strong minimum is that the Hessian matrix be positive definite.


Manukid Parnichkun

88

8.5 Quadratic Functions

cF TT ++= xdAxxx

21)( (8.5-1)

where, the matrix A is symmetric.

hhxxh =∇=∇ )()( TT (8.5-2)

where h is a constant vector,

QxxQQxQxx 2=+=∇ TT (for symmetric Q) (8.5-3)

dAxx +=∇ )(F (8.5-4)

Ax =∇ )(2F (8.5-5)

• All higher derivatives of the quadratic function are zero.

• The first three terms of the Taylor series expansion give an exact representation of the function.

• All analytic functions behave like quadratics over a small neighborhood.


Manukid Parnichkun

89

8.5.1 Eigensystem of the Hessian

Consider a quadratic function that has a stationary point at the origin, and whose value there is zero:

Axxx TF21)( = (8.5.1-1)

The basis vectors are changed to the basis vectors of eigenvectors of the Hessian matrix, A.

• A is symmetric, its normalized eigenvectors will be mutually orthogonal.

[ ]nzzzB L21= (8.5.1-2) TBB =−1 (8.5.1-2)

[ ] Λ=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

==′

n

T

λ

λλ

L

MOMM

L

L

00

0000

2

1

ABBA (8.5.1-3)

TBBA Λ= (8.5.1-4)

The second derivative of a function F(x) in the direction of a vector p,

22

2 )(pApp

ppxp TT F

=∇ (8.5.1-5)


Manukid Parnichkun

90

Bcp = (8.5.1-6)

where c is the representation of the vector p with respect to the eigenvectors of A.

∑

∑

=

==Λ

=Λ

= n

ii

n

iii

T

T

TT

TTTT

c

c

1

2

1

2

2)(

λ

cccc

BcBcBcBBBc

pApp (8.5.1-7)

• The second derivative is a weighted average of the eigenvalues.

• The second derivative can never be larger that the largest eigenvalue, or smaller than the smallest eigenvalue.

max2min λλ ≤≤pAppT

(8.5.1-8)

• The maximum second derivative occurs in the direction of the eigenvector that corresponds to the largest eigenvalue.

• In each of the eigenvector directions the second derivatives will be equal to the corresponding eigenvalue.

• In other directions, the second derivative will be a weighted average of the eigenvalues.

• The eigenvectors define a new coordinate system in which the quadratic cross terms vanish.

• The eigenvectors are known as the principal axes of the function contours.


Manukid Parnichkun

91

Figure 8.5.1-1 The principal Axes of the Function Contours


Manukid Parnichkun

92

Example:

xxx ⎥⎦

⎤⎢⎣

⎡=+=

2002

21)( 2

221

TxxF (8.5.1-9)

⎥⎦

⎤⎢⎣

⎡=∇

2002

)(2 xF , 21 =λ and ⎥⎦

⎤⎢⎣

⎡=

01

1z , 22 =λ and ⎥⎦

⎤⎢⎣

⎡=

10

2z (8.5.1-10)

Figure 8.5.1-2 Circular Hollow


Manukid Parnichkun

93

Example:

xxx ⎥⎦

⎤⎢⎣

⎡=++=

2112

21)( 2

22121

TxxxxF (8.5.1-11)

⎥⎦

⎤⎢⎣

⎡=∇

2112

)(2 xF , 11 =λ and ⎥⎦

⎤⎢⎣

⎡−

=1

11z , 32 =λ and ⎥

⎦

⎤⎢⎣

⎡=

11

2z (8.5.1-12)

Figure 8.5.1-3 Elliptic Hollow


Manukid Parnichkun

94

Example:

xxx ⎥⎦

⎤⎢⎣

⎡−−−−

=−−−=5.05.15.15.0

21

41

23

41)( 2

22121

TxxxxF (8.5.1-13)

⎥⎦

⎤⎢⎣

⎡−−−−

=∇5.05.15.15.0

)(2 xF , 11 =λ and ⎥⎦

⎤⎢⎣

⎡−=

11

1z , 22 −=λ and ⎥⎦

⎤⎢⎣

⎡−−

=11

2z (8.5.1-14)

Figure 8.5.1-4 Elongated Saddle


Manukid Parnichkun

95

Example:

xxx ⎥⎦

⎤⎢⎣

⎡−

−=+−=

1111

21

21

21)( 2

22121

TxxxxF (8.5.1-16)

⎥⎦

⎤⎢⎣

⎡−

−=∇

1111

)(2 xF , 11 =λ and ⎥⎦

⎤⎢⎣

⎡−=

11

1z , 02 =λ and ⎥⎦

⎤⎢⎣

⎡−−

=11

2z (8.5.1-17)

A weak minimum along the line

21 xx = (8.5.1-18)

Figure 8.5.1-5 Stationary Valley


Manukid Parnichkun

96

Characteristics of the quadratic function

1. If the eigenvalues of the Hessian matrix are all positive, the function will have a single strong minimum.

2. If the eigenvalues are all negative, the function will have a single strong maximum.

3. If some eigenvalues are positive and other eigenvalues are negative, the function will have a single saddle point.

4. If the eigenvalues are all nonnegative, but some eigenvalues are zero, then the function will either have a weak

minimum or will have no stationary point.

5. If the eigenvalues are all nonpositive, but some eigenvalues are zero, then the function will either have a weak

maximum or will have no stationary point.

Stationary point of a quadratic equation,

cF TT ++= xdAxxx21)( (8.5.1-19)

dAxx +=∇ )(F (8.5.1-20)

dAx 1* −−= (8.5.1-21)

• If A is not invertible (has some zero eigenvalues) and d is nonzero then no stationary points will exist.


Manukid Parnichkun

97

9 Performance Optimization

9.1 Steepest Descent

Updating to minimum point,

kkkk pxx α+=+1 (9.1-1)

)()( 1 kk FF xx <+ (9.1-2)

kTkkkkk FFF xgxxxx Δ+≈Δ+=+ )()()( 1 (9.1-3)

kFk xxxg =∇≡ )( (9.1-4)

0<=Δ kTkkk

Tk pgxg α (9.1-5)

0<kTk pg (9.1-6)

The direction of steepest descent,

kTk pg : most negative (9.1-7)

kk gp −= (9.1-8)

The method of steepest descent,

kkkk gxx α−=+1 (9.1-9)


Manukid Parnichkun

98

Learning rate determination, αk,

1. Stable learning rate: Using fixed learning rate or predetermined learning rate

2. Minimizing along a line: Minimize the performance index F(x) with respect to αk at each iteration

kkk gx α− (9.1-10)

Consider 22

21 25)( xxF +=x (9.1-11)

⎥⎦

⎤⎢⎣

⎡=

5.05.0

0x (9.1-12)

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

∂∂

∂∂

=∇2

1

2

1

502

)(

)()(

xx

Fx

FxF

x

xx (9.1-13)

⎥⎦

⎤⎢⎣

⎡=∇= = 25

1)(

00 xxxg F (9.1-14)

Fixed learning rate of α = 0.01,

⎥⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡−⎥

⎦

⎤⎢⎣

⎡=−=

25.049.0

251

01.05.05.0

001 gxx α (9.1-15)

⎥⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡−⎥

⎦

⎤⎢⎣

⎡=−=

125.04802.0

5.1298.0

01.025.049.0

112 gxx α (9.1-16)


Manukid Parnichkun

99

Figure 9.1-1 Trajectory for Steepest Descent with α = 0.01 and 0.035


Manukid Parnichkun

100

9.1.1 Stable Learning Rates

For quadratic performance index,

cF TT ++= xdAxxx21)( (9.1.1-1)

dAxx +=∇ )(F (9.1.1-2)

)(1 dAxxgxx +−=−=+ kkkkk αα (9.1.1-3)

dxAIx αα −−=+ kk ][1 (9.1.1-4)

• This linear dynamic system is stable if the eigenvalues of the matrix [I-αA] are less than one in magnitude.

{λ1, λ2, …, λn}: eigenvalues of the Hessian matrix, {z1, z2, …, zn}: eigenvectors of the Hessian matrix,

iiiiiiii zzzAzzzAI )1(][ αλαλαα −=−=−=− (9.1.1-5)

• The eigenvectors of [I-αA] are the same as the eigenvectors of A, and the eigenvalues of [I-αA] are (1-αλi).

The condition for the stability of the steepest descent algorithm,

1)1( <− iαλ (9.1.1-6)

iλα 20 << (9.1.1-7)

max

2λ

α < (9.1.1-8)


Manukid Parnichkun

101

Example:

⎥⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡=+=

50002

,50002

2125)( 2

221 Axxx TxxF (9.1.1-9)

21 =λ and ⎥⎦

⎤⎢⎣

⎡=

01

1z , 502 =λ and ⎥⎦

⎤⎢⎣

⎡=

10

2z (9.1.1-10)

04.05022

max

==<λ

α (9.1.1-11)

Figure 9.1.1-1 Trajectories for α = 0.039 (Left), and α = 0.041 (Right)


Manukid Parnichkun

102

9.1.2 Minimizing Along a Line

)()()(

21)()()()()()()( **

2*1

*xx

**xx

xxxxxxxxxxpxxxx −∇−+−∇+≈=+=Δ+===+ FFFFFFF TT

kkkk α (9.1.2-1)

kTkkk

Tkkk

kkk

FFFd

d pxppxpx xxxx == ∇+∇=+ )()()( 2ααα

(9.1.2-2)

kTk

kTk

kTk

kT

k

k

k

F

F

Apppg

pxp

px

xx

xx−=

∇

∇−=

=

=

)(

)(2

α (9.1.2-3)

kFk xxxA =∇= )(2 (9.1.2-4)

Example:

xxx ⎥⎦

⎤⎢⎣

⎡=++=

2112

21)( 2

22121

TxxxxF (9.1.2-5)

⎥⎦

⎤⎢⎣

⎡−

=25.08.0

0x (9.1.2-6)

⎥⎦

⎤⎢⎣

⎡+

+=∇

21

21

22

)(xxxx

F x (9.1.2-7)

⎥⎦

⎤⎢⎣

⎡−−

=−∇=−= = 3.035.1

)(000 xxxgp F (9.1.2-8)


Manukid Parnichkun

103

[ ]

[ ]413.0

3.035.1

2112

3.035.1

3.035.1

3.035.1

0 =

⎥⎦

⎤⎢⎣

⎡−−

⎥⎦

⎤⎢⎣

⎡−−

⎥⎦

⎤⎢⎣

⎡−−

−=α (9.1.2-9)

⎥⎦

⎤⎢⎣

⎡−

=⎥⎦

⎤⎢⎣

⎡−⎥

⎦

⎤⎢⎣

⎡−

=−=37.0

24.03.0

35.1413.0

25.08.0

0001 gxx α (9.1.2 -10)

Figure 9.1.2-1 Steepest Descent with Minimization Along a Line


Manukid Parnichkun

104

9.2 Newton's Method

kkTkk

Tkkkkk FFF xAxxgxxxx ΔΔ+Δ+≈Δ+=+ 2

1)()()( 1 (9.2-1)

For quadratic function of

cF TT ++= xdAxxx21)( (9.2-2)

dAxx +=∇ )(F (9.2-3)

0=Δ+ kkk xAg (9.2-4)

kkk gAx 1−−=Δ (9.2-5)

kkkk gAxx 11

−+ −= (9.2-6)

Example: 22

21 25)( xxF +=x (9.2-7)

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

∂∂

∂∂

=∇2

1

2

1

502

)(

)()(

xx

Fx

FxF

x

xx , ⎥

⎦

⎤⎢⎣

⎡=∇

50002

)(2 xF (9.2-8)

⎥⎦

⎤⎢⎣

⎡=

5.05.0

0x (9.2-9)


Manukid Parnichkun

105

⎥⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡−⎥

⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡−⎥

⎦

⎤⎢⎣

⎡=

−

00

5.05.0

5.05.0

251

50002

5.05.0 1

1x (9.2-10)

Figure 9.2-1 Trajectory for Newton’s Method


Manukid Parnichkun

106

• If the function F(x) is not quadratic, then Newton's method will not generally converge in one step.

Example:

38)()( 21214

12 ++−+−= xxxxxxF x (9.2-11)

Three stationary points:

⎥⎦

⎤⎢⎣

⎡−=

42.042.01x , ⎥

⎦

⎤⎢⎣

⎡−=

13.013.02x , and ⎥

⎦

⎤⎢⎣

⎡−

=55.0

55.03x (9.2-12)

Figure 9.2-2 One Iteration of Newton’s Method from x0 = [1.5 0]T


Manukid Parnichkun

107

Figure 9.2-3 One Iteration of Newton’s Method from x0 = [-1.5 0]T

Figure 9.2-4 One Iteration of Newton’s Method from x0 = [0.75 0.75]T


Manukid Parnichkun

108

9.3 Conjugate Gradient

For a quadratic function,

cF TT ++= xdAxxx21)( (9.3-1)

Definition: A set of vectors {pk} is mutually conjugate with respect to a positive definite Hessian matrix A if and only if

0=jTk App for jk ≠ (9.3-2)

• If we make a sequence of exact linear searches along any set of conjugate directions {p1, p2, …, pn}, then the exact

minimum of any quadratic function, with n parameters, will be reached in at most n searches.

The change in the gradient at iteration k+1 for quadratic functions,

kkkkkk xAdAxdAxggg Δ=+−+=−=Δ ++ )()( 11 (9.3-3)

kkkkk pxxx α=−=Δ + )( 1 (9.3-4)

0=Δ=Δ== jTkj

Tkj

Tkkj

Tk pgApxAppApp α for jk ≠ (9.3-5)

• The search directions will be conjugate if they are orthogonal to the changes in the gradient.

00 gp −= (9.3-6)

1−+−= kkkk pgp β (9.3-7)


Manukid Parnichkun

109

By Hestenes and Steifel,

kTk

kTk

k pggg

1

1

−

−

ΔΔ

=β (9.3-8)

Fletcher and Reeves,

11 −−

=k

Tk

kTk

k gggg

β (9.3-9)

Polak and Ribiere,

11

1

−−

−Δ=

kTk

kTk

k ggggβ (9.3-10)

The conjugate gradient method consists of the following steps:

1. Select the first search direction to be the negative of the gradient, 00 gp −= .

2. Take a step according to kkk px α=Δ , selecting the learning rate αk to minimize the function along the search direction.

3. Select the next search direction according to 1−+−= kkkk pgp β , using k

Tk

kTk

pggg

1

1

−

−

ΔΔ ,

11 −− kTk

kTk

gggg , or

11

1

−−

−Δ

kTk

kTk

gggg to calculate βk..

4. If the algorithm has not converged, return to step 2.


Manukid Parnichkun

110

Example:

xxx ⎥⎦

⎤⎢⎣

⎡=

2112

21)( TF (9.3-11)

⎥⎦

⎤⎢⎣

⎡−

=25.08.0

0x (9.3-12)

⎥⎦

⎤⎢⎣

⎡+

+=∇

21

21

22

)(xxxx

F x (9.3-13)

⎥⎦

⎤⎢⎣

⎡−−

=−∇=−= = 3.035.1

)(000 xxxgp F (9.3-14)

[ ]

[ ]413.0

3.035.1

2112

3.035.1

3.035.1

3.035.1

0 =

⎥⎦

⎤⎢⎣

⎡−−

⎥⎦

⎤⎢⎣

⎡−−

⎥⎦

⎤⎢⎣

⎡−−

−=α (9.3-15)

⎥⎦

⎤⎢⎣

⎡−

=⎥⎦

⎤⎢⎣

⎡−−

+⎥⎦

⎤⎢⎣

⎡−

=+=37.0

24.03.0

35.1413.0

25.08.0

0001 pxx α (9.3-16)

⎥⎦

⎤⎢⎣

⎡−

=⎥⎦

⎤⎢⎣

⎡−⎥

⎦

⎤⎢⎣

⎡=∇= = 5.0

11.03.0

24.02112

)(11 xxxg F (9.3-17)


Manukid Parnichkun

111

[ ]

[ ]137.0

9125.12621.0

3.035.1

3.035.1

5.011.0

5.011.0

00

111 ==

⎥⎦

⎤⎢⎣

⎡

⎥⎦

⎤⎢⎣

⎡−

−==

gggg

T

T

β (9.3-18)

Using the method of Fletcher and Reeves,

⎥⎦

⎤⎢⎣

⎡−=⎥

⎦

⎤⎢⎣

⎡−−

+⎥⎦

⎤⎢⎣

⎡−=+−=

459.0295.0

3.035.1

137.05.011.0

0111 pgp β (9.3-19)

[ ]

[ ]807.0

459.0295.0

2112

459.0295.0

459.0295.0

5.011.0

1 =

⎥⎦

⎤⎢⎣

⎡−⎥⎦

⎤⎢⎣

⎡−

⎥⎦

⎤⎢⎣

⎡−−

−=α (9.3-20)

⎥⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡−+⎥

⎦

⎤⎢⎣

⎡−

=+=00

459.0295.0

807.037.0

24.01112 pxx α (9.3-21)


Manukid Parnichkun

112

Figure 9.3-1 Conjugate Gradient Algorithm


Manukid Parnichkun

113

10 Widrow-Hoff Learning

10. 1 ADALINE (ADAptive LInear NEuron) Network

Figure 10.1-1 ADALINE Network

a = purelin(Wp+b) = Wp+b (10.1-1)

ai = purelin(ni) = purelin(iwTp+bi) = iwTp+bi (10.1-2)

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

Ri

i

i

i

w

ww

,

2,

1,

Mw (10.1-3)


Manukid Parnichkun

114

10.1.1 Single ADALINE

Figure 10.1.1-1 Two-Input Linear Neuron

a = purelin(n) = purelin(1wTp+b) = 1wTp+b = w1,1p1+w1,2p2+b (10.1.1-1)


Manukid Parnichkun

115

10.2 Mean Square Error

LMS algorithm: supervised training

{p1, t1}, {p2, t2}, …, {pQ, tQ} (10.2-1)

⎥⎦

⎤⎢⎣

⎡=

bw

x 1 (10.2-2)

⎥⎦

⎤⎢⎣

⎡=

1p

z (10.2-3)

a = 1wTp+b (10.2-4)

a = xTz (10.2-5)

])[(])[(][)( 222 zxx TtEatEeEF −=−== (10.2-6)

xzzxzxxzzxzxx ][][2][]2[)( 22 TTTTTT EtEtEttEF +−=+−= (10.2-7)

Rxxhxx TTcF +−= 2)( (10.2-8)

where

c = E[t2], h = E[tz], and R = E[zzT] (10.2-9)

General form quadratic function,

Axxxdx TTcF21)( ++= (10.2-10)

d = -2h and A = 2R (10.2-11)


Manukid Parnichkun

116

The gradient,

RxhAxdAxxxdx 2221)( +−=+=⎟

⎠⎞

⎜⎝⎛ ++∇=∇ TTcF (10.2-12)

The stationary point of the performance index,

022 =+− Rxh (10.2-13)

hRx 1* −= (10.2-14)

Example: %,501,43

%,501,21

2211 →⎭⎬⎫

⎩⎨⎧

−=⎥⎦

⎤⎢⎣

⎡=→

⎭⎬⎫

⎩⎨⎧

=⎥⎦

⎤⎢⎣

⎡= tt zz

[ ] ( ) ( ) 115.015.0 222 =−+== tEc (10.2-15)

[ ] ⎥⎦

⎤⎢⎣

⎡−−

=⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡−+⎥

⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡==

11

43

15.021

15.0zh tE (10.2-16)

[ ] [ ] [ ] ⎥⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡+⎥

⎦

⎤⎢⎣

⎡==

10775

4343

5.02121

5.0TE zzR (10.2-17)

xxxx ⎥⎦

⎤⎢⎣

⎡+⎥

⎦

⎤⎢⎣

⎡−−

−=10775

11

21)( TTF (10.2-18)

011

210775

2)( =⎥⎦

⎤⎢⎣

⎡−−

−⎥⎦

⎤⎢⎣

⎡=∇ xxF (10.2-19)


Manukid Parnichkun

117

⎥⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡−=⎥

⎦

⎤⎢⎣

⎡−−

⎥⎦

⎤⎢⎣

⎡=

−

12

111

23

11

10775

ww

x (10.2-20)

⎭⎬⎫

⎩⎨⎧

=⎥⎦

⎤⎢⎣

⎡= 1,

21

11 tz ,

[ ] 11 121

23 ta ==⎥⎦

⎤⎢⎣

⎡−= (10.2-21)

⎭⎬⎫

⎩⎨⎧

−=⎥⎦

⎤⎢⎣

⎡= 1,

43

22 tz ,

[ ] 22 143

23 ta =−=⎥⎦

⎤⎢⎣

⎡−= (10.2-22)


Manukid Parnichkun

118

Figure 10.2-1 Network

Figure 10.2-2 Weight Vector and Decision Boundary

z1

x z2


Manukid Parnichkun

119

10.3 LMS Algorithm

In Widrow and Hoff learning rule, the mean square error, F(x), is estimated by

)())()(()(ˆ 22 kekaktF =−=x (10.3-1)

)()(ˆ 2 keF ∇=∇ x (10.3-2)

jjj w

kekew

keke,1,1

22 )()(2)()]([

∂∂

=∂

∂=∇ for j = 1, 2, …, R, (10.3-3)

bkeke

bkeke R ∂

∂=

∂∂

=∇ +)()(2)()]([

2

12 (10.3-4)

⎥⎦

⎤⎢⎣

⎡⎟⎠

⎞⎜⎝

⎛ +−∂

∂=+−

∂∂

=∂

−∂=

∂∂ ∑

=

R

iii

j

T

jjj

bkpwktw

bkktww

kaktw

ke1

,1,1

1,1,1,1

)()()])(()([)]()([)( pw (10.3-5)

)()(

,1

kpw

kej

j

−=∂∂ (10.3-6)

1)(−=

∂∂

bke (10.3-7)

)()(2)()(ˆ 2 kkekeF zx −=∇=∇ (10.3-8)

kFkk xxxxx =+ ∇−= )(1 α (10.3-9)

)()(21 kkekk zxx α+=+ (10.3-10)


Manukid Parnichkun

120

)()(2)()1( 11 kkekk pww α+=+ (10.3-11)

)(2)()1( kekbkb α+=+ (10.3-12)

• Widrow-Hoff learning algorithm is also called delta rule.

)()(2)()1( kkekk iii pww α+=+ (10.3-13)

)(2)()1( kekbkb iii α+=+ (10.3-14)

)()(2)()1( kkkk TpeWW α+=+ (10.3-15)

)(2)()1( kkk ebb α+=+ (10.3-16)


Manukid Parnichkun

121

10.4 Example on Apple/Orange Recognition

For simplicity, a zero bias is used in the ADALINE network.

[ ] %501,11

1

11 →⎪⎭

⎪⎬

⎫

⎪⎩

⎪⎨

⎧−=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−= tp , [ ] %501,

111

22 →⎪⎭

⎪⎬

⎫

⎪⎩

⎪⎨

⎧=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−= tp (10.4-1)

[ ] ( ) ( ) 115.015.0 222 =+−== tEc (10.4-2)

[ ]⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−+

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−−==

010

111

15.011

115.0zh tE (10.4-3)

[ ] [ ]⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−

−=−

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−+−−

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−=+==

101010101

1111

11

21111

11

1

21

21

21][ 2211

TTTE ppppppR (10.4-4)

The eigenvalues of R,

λ1 = 1.0, λ2 = 0.0, λ3 = 2.0 (10.4-3)

5.00.2

11

max

==<λ

α (10.4-4)

Select α = 0.2,


Manukid Parnichkun

122

Presenting orange, [ ]⎪⎭

⎪⎬

⎫

⎪⎩

⎪⎨

⎧−=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−= 1,

11

1

11 tp ,

[ ] 011

1000)0()0()0()0( 1 =

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−=== pWpWa (10.4-5)

101)0()0()0()0( 1 −=−−=−=−= atate (10.4-6)

[ ] [ ]4.04.04.011

1)1)(2.0(2000)0()0(2)0()1( −=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−−+=+=

T

Te pWW α (10.4-7)

Presenting apple, [ ]⎪⎭

⎪⎬

⎫

⎪⎩

⎪⎨

⎧=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−= 1,

111

22 tp ,

[ ] 4.01

11

4.04.04.0)1()1()1()1( 2 −=⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−=== pWpWa (10.4-8)

4.1)4.0(1)1()1()1()1( 2 =−−=−=−= atate (10.4-9)


Manukid Parnichkun

123

[ ] [ ]16.096.016.01

11

)4.1)(2.0(24.04.04.0)1()1(2)1()2( −=⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−+−=+=

T

Te pWW α (10.4-10)

If we continue this procedure, the algorithm converges to

[ ]010)( =∞W (10.4-11)

Figure 10.4-1 Prototype Vectors and Decision Boundary


Manukid Parnichkun

124

• Decision boundary generated by ADALINE falls halfway between the two reference patterns.

• The perceptron rule does not produce halfway decision boundary since the perceptron rule stops as soon as the patterns

are correctly classified.


Manukid Parnichkun

125

10.5 Example on Linear Regression

ADALINE network is used to determine parameters of a quadratic function, 0a , 1a , and 2a , of the relation

012

2 axaxay ++= when the data of x and y are as the following.

x -3 -2 -1 0 1 2 3

y 6 3 2 3 6 11 18

By LMS algorithm,

RxxhxcxzzExtzExtExF TTTTT +−=+−= 2][][2][)( 2 (10.5-1)

( ) 7753971181163236

71 2222222 ==++++++=c (10.5-2)

x 1a

1 0a

Σ y

2x 2a

purelin


Manukid Parnichkun

126

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡+

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡+

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡+

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡+

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡−+

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡−+

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡−=

7840

4956280

71

139

18124

11111

6100

311

12

12

43

13

96

71h (10.5-3)

[ ] [ ] [ ] [ ] [ ] [ ] [ ]⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡+

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡+

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡+

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡+−

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡−+−

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡−+−

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡−=

1040404028

71

70280280280196

71139

139

124124

111111

100100

11111

1124

12

4139

13

9

71R (10.5-4)

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡==

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡= −

321

7840

1040404028

1

0

1

2

hRaaa

x (10.5-5)

By Widrow-Hoff learning rule with 0.02 learning rate,

epWW oldnew α2+= (10.5-6)

Present (-3, 6),

606;00)3(0)3(0 2 =−==+−+−= ey (10.5-7)

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡−=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−

+⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

24.072.0

16.2

13)3(

)6)(02.0(2000 2

12

11

bww

(10.5-8)


Manukid Parnichkun

127

Present (-2, 3),

32.732.103;32.1024.0)2(72.0)2(16.2 2 −=−==+−−−= ey (10.5-9)

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−

−+⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡−=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

05.013.0

99.0

12)2(

)32.7)(02.0(224.072.0

16.2 2

12

11

bww

(10.5-10)

Present (-1, 2),

93.007.12;07.105.0)1(13.0)1(99.0 2 =−==−−−−= ey (10.5-11)

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−

+⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

01.017.0

03.1

11)1(

)93.0)(02.0(205.013.0

99.0 2

12

11

bww

(10.5-12)

Present (0, 3),

01.301.03;01.001.0)0(17.0)0(03.1 2 =−−=−=−−= ey (10.5-13)

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡−=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

+⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

11.017.0

03.1

10)0(

)01.3)(02.0(201.017.0

03.1 2

12

11

bww

(10.5-14)


Manukid Parnichkun

128

Present (1, 6),

03.597.06;97.011.0)1(17.0)1(03.1 2 =−==+−= ey (10.5-15)

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

+⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡−=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

31.003.023.1

11)1(

)03.5)(02.0(211.017.0

03.1 2

12

11

bww

(10.5-16)

Present (2, 11),

71.529.511;29.531.0)2(03.0)2(23.1 2 =−==++= ey (10.5-17)

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

+⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

54.049.014.2

12)2(

)71.5)(02.0(231.003.023.1 2

12

11

bww

(10.5-18)

Present (3, 18),

27.327.2118;27.2154.0)3(49.0)3(14.2 2 −=−==++= ey (10.5-19)

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡−=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−+⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

18.036.0

05.0

13)3(

)27.3)(02.0(231.003.023.1 2

12

11

bww

(10.5-20)

8 performance surfaces and optimum...

Documents