lecture 7: smoothing of data and the method of … 7: smoothing of data and the method of least...

29
LECTURE 7: Smoothing of data and the method of least squares November 6, 2012 1 Introduction Suppose that the surface tension S of a particular liquid has been measured at eight differ- ent temperatures T . The surface tension is known to be a linear function of temperature, but the measured values do not fall exactly on a straight line. 0 20 40 60 80 100 60 61 62 63 64 65 66 67 68 69 T S(T) data fit: S = aT + b Figure 1: A set of measured data and the corresponding line fit. Using statistical methods, we can determine the most probable values of the constants a and b in the equation S = aT + b 1

Upload: hoangcong

Post on 18-Mar-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

LECTURE 7: Smoothing of data and the methodof least squares

November 6, 2012

1 Introduction

Suppose that the surface tension S of a particular liquid has been measured at eight differ-ent temperatures T . The surface tension is known to be a linear function of temperature,but the measured values do not fall exactly on a straight line.

0 20 40 60 80 10060

61

62

63

64

65

66

67

68

69

T

S(T

)

data

fit: S = aT + b

Figure 1: A set of measured data and the corresponding line fit.

Using statistical methods, we can determine the most probable values of the constants aand b in the equation

S = aT +b

1

2 Method of least squares

2.1 Linear least squares

An experiment or survey often produces a mass of data which may fall approximately ona straight line. There might also be theoretical reasons for assuming that the underlyingfunction is linear, and the failure of the points to fall precisely on a straight line is due toexperimental errors.

Assuming that the data can be described by the relation

y = ax+b

we want to determine the coefficients a and b such that the resulting line represents thedata as well as possible. For given values of a and b, the magnitude of the discrepancy orerror between the kth data point and the line is

e = |axk +b− yk|

The total absolute error for m+1 data points is

etot =m

∑k=0|axk +b− yk|

This is a function of a and b, and it would be reasonable to choose a and b so that thisfunction assumes its minimum value.

In practice, it is common to minimize a different error function of a and b:

ϕ(a,b) =m

∑k=0

(axk +b− yk)2

This function is suitable because of statistical considerations.

If the errors of the data follow a normal distribution, then minimization of ϕ(a,b) pro-duces the best estimate of a and b.

2

Finding the minimum of ϕ

The following conditions,∂ϕ

∂a= 0

∂ϕ

∂b= 0

are necessary at the minimum of ϕ(a,b). Taking the derivatives of ϕ gives

m

∑k=0

2(axk +b− yk)xk = 0

m

∑k=0

2(axk +b− yk) = 0

These are called the normal equations and can be written as

(m

∑k=0

x2k

)a+

(m

∑k=0

xk

)b =

m

∑k=0

ykxk

(m

∑k=0

xk

)a+(m+1)b =

m

∑k=0

yk

The explicit solution of these equations is

a =1d

[(m+1)

m

∑k=0

ykxk−

(m

∑k=0

xk

)(m

∑k=0

yk

)]

b =1d

[(m

∑k=0

x2k

)(m

∑k=0

yk

)−

(m

∑k=0

xk

)(m

∑k=0

ykxk

)]

where

d = (m+1)m

∑k=0

x2k−

(m

∑k=0

xk

)2

3

Example.For the following table of data we get the following system of two equations:

x y0 68.1

10 67.020 66.530 65.740 64.480 61.790 61.195 60.3

{26525a+365b = 22710365a+8b = 514.8

The solution is a =−0.079 and b = 67.94. The corresponding value of ϕ(a,b) is 0.3 (inthe case of a perfect fit, ϕ(a,b) = 0).

As a check, we can use Matlab to do the least squares fit and plot the resulting line togetherwith the data points. The following commands can be used:

d = load(’example1.dat’); %load datap = polyfit(d(:,1),d(:,2),1) %linear least squares fitx = 0:0.1:100; % x range for ploty = polyval(p,x); % evaluation of polynomialplot(d(:,1),d(:,2),’o’,x,y) % plot the fit and data pointsxlabel(’x’)ylabel(’f(x)’)dd = p(1)*d(:,1) + p(2) - d(:,2);phi = sum(dd.*dd); % value of phi(a,b)

0 20 40 60 80 10060

61

62

63

64

65

66

67

68

69

x

f(x

)

dataleast squares fit

phi(a,b) = 0.3

Figure 2: Data set and line fit.

4

2.2 Nonpolynomial example

The method of least squares is not restricted to polynomials or to any specific functionalform. Suppose that we want to fit a table of values (xk,yk), where k = 0,1, . . . ,m, by afunction of the form

f = a lnx+bcosx+ cex

in the least-squares sense. The unknowns are the three coefficients a, b and c.

We consider the function

ϕ(a,b,c) =m

∑k=0

(a lnxk +bcosxk + cexk− yk)2

and set ∂ϕ/∂a = 0, ∂ϕ/∂b = 0 and ∂ϕ/∂c = 0. This gives us a system of three normalequations.

For the following table of data

x y0.24 0.230.65 -0.260.95 -1.101.24 -0.451.73 0.272.01 0.102.23 -0.292.52 0.242.77 0.562.99 1.00

the solution is a =−1.04103, b =−1.26132 and c = 0.03073. The corresponding valueof ϕ(a,b,c) is 0.92557.

We can use the following Matlab commands to carry out the fitting procedure and to plotthe results:

d = load(’example2.dat’); %load datax = [log(d(:,1)) cos(d(:,1)) exp(d(:,1))];z = x\d(:,2); %solution to YZ = Xdd = z(1)*log(d(:,1)) + z(2)*cos(d(:,1))...

+ z(3)*exp(d(:,1)) - d(:,2);phi = sum(dd.*dd); % value of phi(a,b,c)xi = 0.24:0.1:2.99; % x range for plotyi = z(1)*log(xi) + z(2)*cos(xi)+ z(3)*exp(xi);plot(d(:,1),d(:,2),’o’,xi,yi)xlabel(’x’)ylabel(’f(x)’)

5

0 0.5 1 1.5 2 2.5 3−1.5

−1

−0.5

0

0.5

1

x

f(x

)

dataleast squares fit

Figure 3: Data set and corresponding nonlinear least-squares fit.

2.3 Basis functions

The principle of least squares can be extended to general linear families of functions with-out involving any new ideas. Suppose that the data are thought to conform to relationshiplike

y =n

∑j=0

c jg j(x)

where the basis functions g0,g1, . . . ,gn are known and held fixed. The coefficients c0,c1, . . . ,cnare to be determined according to the principle of least squares.

We define the expression

ϕ(c0,c1, . . . ,cn) =m

∑k=0

[n

∑j=0

c jg j(xk)− yk

]2

which is the sum of the squares of the errors associated with each entry (xk,yk) in a giventable of data. The coefficients are now selected such that this expression is minimized.We write down the following n equations as the necessary conditions for the minimum:

∂ϕ

∂ci= 0 (0≤ i≤ n)

These partial derivatives are given by

∂ϕ

∂ci=

m

∑k=0

2

[n

∑j=0

c jg j(xk)− yk

]gi(xk) (0≤ i≤ n)

When these are set equal to zero, we obtain the normal equations:

n

∑j=0

[m

∑k=0

gi(xk)g j(xk)

]c j =

m

∑k=0

ykgi(xk) (0≤ i≤ n)

6

The normal equations are linear in c0,c1, . . . ,cn, and thus, in principle, they can be solvedby the method of Gaussian elimination. In practice, care must be taken in choosing thebasis functions gi, since otherwise the normal equations may be very difficult to solve.The set {g0,g1, . . . ,gn} should be linearly independent which means that no linear com-bination ∑

ni=0 cigi can be the zero function.

The basis functions should also be appropriate for the problem at hand. Finally, the setof basis functions should be well conditioned for numerical work (see Lecture 5 for moredetails on ill-conditioned systems of linear equations).

3 Orthogonal systems and Chebyshev polynomials

3.1 Orthonormal basis functions

We concentrate now on the problem of choosing the set of basis functions {g0,g1, . . . ,gn}in order to describe the given data using the equation

y =n

∑j=0

c jg j(x)

where the coefficients c0,c1, . . . ,cn are determined according to the principle of leastsquares.

Once the basis functions have been chosen, the least-squares problem can be interpretedas follows: The set of all functions g that can be expressed as linear combinations ofg0,g1, . . . ,gn is a vector space G . In symbols,

G =

{g : there exist c0,c1, . . . ,cn such that g(x) =

n

∑j=0

c jg j(x)

}

The function we are looking for in the least-squares problem is thus an element of thevector space G .

The functions g0,g1, . . . ,gn form a basis for the vector space G if they are not linearlydependent. A given vector space can have many different bases which may differ drasti-cally in their numerical properties.

What basis should be chosen for numerical work?In the least-squares problem, the main task is to solve the normal equations:

n

∑j=0

[m

∑k=0

gi(xk)g j(xk)

]c j =

m

∑k=0

ykgi(xk) (0≤ i≤ n)

The nature of this problem depends on the basis {g0,g1, . . . ,gn}. We want to solve theseequations easily or to be able to solve them accurately.

7

In an ideal situation, the basis {g0,g1, . . . ,gn} has the property of orthonormality:

m

∑k=0

gi(xk)g j(xk) = δi j =

{1, i = j0, i 6= j

In this case, the coefficient matrix is the identity matrix and the normal equations simplifydramatically to

c j =m

∑k=0

ykg j(xk) (0≤ j ≤ n)

which is an explicit formula for the coefficients c j.

A procedure known as the Gram-Schmidt process can be used to obtain such an orthonor-mal basis. In certain situations, the effort is justified but often simpler procedures suffice.Next, we give a quick review of the Gram-Schmidt process and then introduce a simpleralternative.

Gram-Schmidt processIn linear algebra, the Gram-Schmidt process is a method of orthogonalizing a set of vec-tors in an inner product space, most commonly the Euclidean space Rn.

Orthogonalization in this context means the following: we start with vectors v1, . . . ,vkwhich are linearly independent and we want to find mutually orthogonal vectors u1, . . . ,ukwhich generate the same subspace as the vectors v1, . . . ,vk.

We denote the inner (dot) product of the two vectors u and v by (u ·v). The Gram-Schmidtprocess works as follows:

u1 = v1u2 = v2− [(u1 · v2)/(u1 ·u1)]u1u3 = v3− [(u1 · v3)/(u1 ·u1)]u1− [(u2 · v3)/(u2 ·u2)]u2

...uk = vk− [(u1 · vk)/(u1 ·u1)]u1− [(u2 · vk)/(u2 ·u2)]u2− . . .

− [(uk−1 · vk)/(uk−1 ·uk−1)]uk−1

To check that these formulas work, first compute (u1 · u2) by substituting the above for-mula for u2: you will get zero. Then use this to compute (u1 ·u3) again by substituting theformula for u3: you will get zero. The general proof proceeds by mathematical induction.

Geometrically, this method proceeds as follows: to compute ui, it projects vi orthogonallyonto the subspace U generated by u1, . . . ,ui−1, which is the same as the subspace gener-ated by v1, . . . ,vi−1. ui is then defined to be the difference between vi and this projection,guaranteed to be orthogonal to all of the vectors in the subspace U.

If one is interested in an orthonormal system u1, . . . ,uk (i.e. the vectors are mutuallyorthogonal and all have norm 1), then one can divide ui by its norm (ui ·ui).

When performing orthogonalization on a computer, the Householder transformationis usually preferred over the Gram-Schmidt process since it is numerically more stable,i.e. rounding errors tend to have less serious effects. The Householder transformation isused in a frequently applied algorithm for QR decomposition which means decomposinga matrix A into a much simpler form consisting of an orthogonal matrix Q and an uppertriangular matrix R. The QR decomposition can be used to easily solve the least-squaresproblem Ax = b.

8

Polynomials as basis functions

We now discuss an alternative method for choosing the set of basis functions in such away that they are suitable for numerical solution of the least-squares problem. Considerthe vector space G that consists of all polynomials of degree ≤ n. The simplest choicewould be to use the following n+1 functions as a basis for G :

g0(x) = 1 g1(x) = x g2(x) = x2 . . . gn(x) = xn

Using this basis, a typical element of the vector space G has the form

g(x) =n

∑j=0

c jg j(x) = c0 + c1x+ c2x2 + . . .+ cnxn

This basis, however, is almost always a poor choice for numerical work. The reason is thatthe simple polynomials xk are too much alike. It is difficult to determine the coefficientsc j precisely for a function given as a linear combination of the polynomials xk: g(x) =∑

nj=0 c jx j. Figure 4 shows a plot of x,x2,x3,x4 and x5.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1g1

g2

g3

g4

g5

Figure 4: Polynomials x,x2,x3,x4 and x5.

Chebyshev polynomials

For many purposes, a good basis can be obtained using the Chebyshev polynomials. Forsimplicity, assume that in our least-squares problem all the data points lie within theinterval [−1,1]. The Chebyshev polynomials for the interval [−1,1] are given by{

T0(x) = 1 T1(x) = x T2(x) = 2x2−1T3(x) = 4x3−3x T4(x) = 8x4−8x2 +1 etc.

The formal definition of the Chebyshev polynomials is given by the following recursiveformula

Tj(x) = 2xTj−1(x)−Tj−2(x) ( j ≥ 2)

with the equations T0(x) = 1 and T1(x) = x.

9

Alternatively, we can writeTk(x) = cos(k arccosx)

Figure 5 shows a plot of T1,T2,T3,T4 and T5. Note that these functions are quite differentfrom one another and therefore better suited for numerical work.

0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1

T1

T2

T3

T4

T5

Figure 5: Chebyshev polynomials T1,T2,T3,T4 and T5.

Nested multiplication algorithm

Linear combinations of Chebyshev polynomials are easy to evaluate because a specialnested multiplication algorithm applies. Consider an arbitrary linear combination ofT0,T1,T2, . . . ,Tn:

g(x) =n

∑j=0

c jTj(x)

The following algorithm can be used to compute g(x) for any x:wn+2 = wn+1 = 0w j = c j +2xw j+1−w j+2 ( j = n,n−1, ...,0)g(x) = w0− xw1

10

To see that this really works, we can write down the series for g(x):

g(x) =n

∑j=0

c jTj(x)

=n

∑j=0

(w j−2xw j+1 +w j+2)Tj

=n

∑j=0

w jTj−2xn

∑j=0

w j+1Tj +n

∑j=0

w j+2Tj

=n

∑j=0

w jTj−2xn+1

∑j=1

w jTj−1 +n+2

∑j=2

w jTj−2 (wn+2 = wn+1 = 0)

=n

∑j=0

w jTj−2xn

∑j=1

w jTj−1 +n

∑j=2

w jTj−2

= w0T0 +w1T1 +n

∑j=2

w jTj−2xw1T0−2xn

∑j=2

w jTj−1 +n

∑j=2

w jTj−2

(T0 = 1,T1 = x)

= w0 + xw1−2xw1 +n

∑j=2

w j(Tj−2xTj−1 +Tj−2)

(Tj = 2xTj−1−Tj−2)

= w0− xw1

The normal equations should now be reasonably well conditioned if the first few Cheby-shev polynomials are used to form the basis. This means that Gaussian elimination withpivoting produces an accurate solution to the normal equations.

If the original data do not satisfy min{xk}=−1 and max{xk}= 1 but lie instead in anotherinterval [a,b], then the change of variable

x =12(b−a)z+

12(a+b)

produces a variable z that traverses [−1,1] as x traverses [a,b].

11

3.2 Algorithm for polynomial fitting

The following procedure produces a polynomial of degree ≤ (n+1) that best fits a giventable of values (xk,yk) (0 ≤ k ≤ m). Here m is usually much greater than n. The Cheby-shev polynomials are used as basis.

Algorithm.

1. Find the smallest interval [a,b] that contains all the xk.Let a = min{xk} and b = max{xk}.

2. Make a transformation to the interval [−1,1] by defining

zk =2xk−a−b

b−a

3. Decide the value of n to be used (8 or 10 are large values in this situation).

4. Using Chebyshev polynomials as basis, generate the (n+1)×(n+1) normal equa-tions (see below for details)

n

∑j=0

[m

∑k=0

Ti(zk)Tj(zk)

]c j =

m

∑k=0

ykTi(zk) (0≤ i≤ n)

5. Use an equation-solving routine to solve the normal equations for the coefficientsc0,c1, . . . ,cn in the function

g(x) =n

∑j=0

c jTj(x)

6. The polynomial sought is

g(

2x−a−bb−a

)

12

The details of step 4 are as follows: Begin by introducing a double-subscribed variable:

t jk = Tj(zk) (0≤ k ≤ m,0≤ j ≤ n)

The matrix T = (t jk) can be computed efficiently by using the recursive definition of theChebyshev polynomials as in the following pseudocode:

real array (ti j)0:n×0:n,(zi)0:ninteger j,k,mfor k = 0 to m do

t0k← 1for j = 2 to n do

t jk← 2zk t j−1,k− t j−2,kend for

end for

The normal equations have a coefficient matrix A = (ai j)0:n×0:n and a right-hand sideb = (bi)0:n. These are given by

real array (bi)0:n,(ti j)0:n×0:n,(yi)0:ninteger i, j,m,nreal s...for i = 0 to n do

s← 0for k = 0 to m do

s← s+ yktikend forbi← sfor j = i to n do

s← 0for k = 0 to m do

s← s+ tikt jkend forai j← sa ji← s

end forend for

13

3.3 Smoothing of data: Polynomial regression

An important application of the least-squares procedure is smoothing of data; i.e. remov-ing experimental errors to whatever degree possible.

If it is known that the data should conform to a specific type of function (e.g. straight line,polynomial, exponential, etc.), then the least-squares procedure can be used to computeany unknown parameters in the function. On the other hand, if the specific form of thefunction is not the important factor but one only wishes to smooth the data by fitting themto any convenient function, then polynomials of increasing degree can be used to obtain areasonable balance between a good fit and smoothness.

Example.Suppose that we have the following set of 20 data points.

−1 −0.5 0 0.5 1−10

−5

0

5

10

15

x

y

Figure 6: Example data set.

Of course, we can determine a polynomial of degree 19 that passes through all the datapoints. But if the data points contain experimental errors, then it is much more desirableto find a function that fits the data approximately in the least-squares sense. A lowerorder polynomial can be used for the fitting procedure. In statistical terms, this is calledcurvilinear regression.

Good software libraries contain a procedure for the polynomial fitting of empirical datain the least-squares sense. Such codes are highly optimized to give high precision withminimum computing effort.

For example, the following Matlab commands fit two polynomials of degree 8 and 13 tothe given set of 20 data points.

d = load(’reg_ex1.dat’); %load datap8 = polyfit(d(:,1),d(:,2),8); % polynomial of degree 8p13 = polyfit(d(:,1),d(:,2),13);% polynomial of degree 13x = -1:0.01:1; % x range for ploty8 = polyval(p8,x); % evaluation of polynomialsy13 = polyval(p13,x);plot(d(:,1),d(:,2),’ok’,x,y8,’-b’,x,y13,’--r’)xlabel(’x’) ylabel(’y’)axis([-1 1 -10 18])

14

−1 −0.5 0 0.5 1−10

−5

0

5

10

15

x

f(x)

datap8

p13

Figure 7: Polynomial fits of degree 8 and 13 to a set of 20 data points.

The lower-degree polynomial produces a much smoother fit without any oscillations. Asimilar fit can be obtained using the techniques which we have already discussed, e.g. us-ing the Chebyshev polynomials as a basis, we can set up and solve the normal equationsto obtain polynomial fits such as those shown in the figure above. This, however, is not avery well optimized approach.

Polynomial regressionAn efficient procedure for polynomial regression is now explained. This method uses asystem of orthogonal polynomials tailormade for the problem in question. We begin witha given table of experimental values:

x x0 x1 . . . xmy y0 y1 . . . ym

The goal is to replace the table by a suitable polynomial of modest degree with the errorsin the data suppressed. We do not know what degree of polynomial should be used.

Suppose that the trend of the table is represented by a polynomial

pN(x) =N

∑i=0

aixi

and that the given tabular values obey the equation

yi = pN(xi)+ εi (0≤ i≤ m)

Here εi represents an observational error present in yi. A further reasonable hypothesis isthat these errors are independent random variables that are normally distributed.

We have already discussed a polynomial fitting procedure which can be used to obtain pnby the method of least squares. Using this approach, we can set up a system of normalequations to determine the coefficients of pn. Once these are known, a quantity called thevariance can be computed from the formula

σ2n =

1m−n

m

∑i=0

[yi− pn(xi)]2 (m > n)

15

We know from statistics that if the trend of the table is truly a polynomial of degree N(but infected by noise), then

σ20 > σ

21 > .. . > σ

2N = σ

2N+1 = σ

2N+2 = . . .= σ

2m−1

For the case in which N is not known, the following strategy can be used: Computeσ2

0,σ21, . . . in succession. As long as these are descending, continue the calculation. When

an integer N is reached for which σ2N ≈ σ2

N+1 ≈ σ2N+2 ≈ . . ., stop and declare pN to be the

polynomial sought.

If we were to compute σ20,σ

21, . . . directly from the definition of the variance given above,

this would involve determining all the polynomials p0, p1, . . .. Instead, we can use a moreintelligent procedure which avoids determination of all but the one desired polynomial.This procedure is described next.

Inner productWe begin by a definition of a quantity called the inner product of two functions f and g.Assume in the following that the abscissas xi are distinct and held fixed. If f and g are twofunctions whose domains include the points {x0,x1, . . . ,xm}, then the following notationis used for the inner product of f and g:

〈 f ,g〉=m

∑i=0

f (xi)g(xi)

Much of the discussion does not depend on the exact form of the inner product but onlyon its certain properties.

An inner product 〈 , 〉 has the following properties:

1. 〈 f ,g〉= 〈g, f 〉.

2. 〈 f , f 〉> 0 unless f (xi) = 0 for all i.

3. 〈a f ,g〉= a〈 f ,g〉 for any a ∈ R.

4. 〈 f ,g+h〉= 〈 f ,g〉+ 〈 f ,h〉

A set of functions is said to be orthogonal if 〈 f ,g〉 = 0 for any two different functions fand g in that set.

16

Recursive formula for generating an orthogonal set of polynomialsAn orthogonal set of polynomials can be generated recursively by the following formulas:

q0(x) = 1q1(x) = x−α0

qn+1(x) = xqn(x)−αnqn(x)−βnqn−1(x) (n≥ 1)

where αn =

〈xqn,qn〉〈qn,qn〉

βn =〈xqn,qn−1〉〈qn−1,qn−1〉

Here xqn denotes the function whose value at x is xqn(x).

Proof. In order to prove that the definition above leads to an orthogonal system, we canfirst consider consider the following case,

〈q1,q0〉= 〈x−α0,q0〉= 〈xq0−α0q0,q0〉= 〈xq0,q0〉−α0〈q0,q0〉= 0

Another similar case is

〈q2,q1〉= 〈xq1−α1q1−β1q0,q1〉= 〈xq1,q1〉−α1〈q1,q1〉−β1〈q0,q1〉= 0

The next step of the proof would be to verify that 〈q2,q0〉= 0. Finally an inductive argu-ment can be used to complete the proof. (These two remaining steps are left as an exercisefor the student).

Linear combination of orthogonal polynomialsThe system of orthogonal polynomials {q0,q1, . . . ,qm−1} is a basis for the vector space∏m−1 of all polynomials of degree at most m− 1. If it is desired to express a givenpolynomial pn (n≤m−1) as a linear combination of {q0,q1, . . . ,qn}, this can be done asfollows.

Set

p =n

∑i=0

aiqi

Only one of the summands on the right-hand side, namely anqn, contains the highestterm xn. The coefficient an is chosen in such a way that anxn on the right is equal to acorresponding term in p. We write

p−anxn =n−1

∑i=0

aiqi

We can now choose an−1 the same way: making the xn−1 terms equal on both sides. Bycontinuing this way, we can discover the unique values for all the coefficients ai. Thisestablishes that {q0,q1, . . . ,qn} is a basis for ∏n for n = 0,1,2, . . . ,m−1.

17

Another way of determining the coefficients ai is to take the inner product of both sides inthe equation for the given polynomial. This gives

〈p,q j〉=n

∑i=0

ai〈qi,q j〉 (0≤ j ≤ n)

Since the set q0,q1, . . . ,qn is orthogonal, 〈qi,q j〉= 0 for i 6= j. Hence,

〈p,q j〉= a j〈q j,q j〉

and consequently

a j =〈p,q j〉〈q j,q j〉

Solution of the least-squares problemNow we return to the least-squares problem. Let F be a function that we wish to fit by apolynomial pn of degree n. According to the principle of least squares, we want to findthe pn that minimizes the expression

m

∑i=0

[F(xi)− pn(xi)]2

The solution is given by

pn(x) =n

∑i=0

ciqi ci =〈F,qi〉〈qi,qi〉

Note especially that ci does not depend on n. This means that the various polynomialsp0, p1, . . . , .. can be obtained by truncating one series, namely ∑

m−1i=0 ciqi.

Proof. To see that the solution given above really is the correct solution to our problem,we return to the normal equations. The basis functions are now q0,q1, . . . ,qn and thenormal equations are given by

n

∑j=0

[m

∑k=0

qi(xk)q j(xk)

]c j =

m

∑k=0

ykqi(xk) (0≤ i≤ n)

Using the inner product notation, we getn

∑j=0〈qi,q j〉c j = 〈F,qi〉 (0≤ i≤ n)

where F is some function such that F(xk) = yk for 0 ≤ k ≤ m (these are the given datapoints).

Next, the orthogonality property, 〈qi,q j〉= 0 when i 6= j, is applied. This gives

〈qi,qi〉ci = 〈F,qi〉 (0≤ i≤ n)

Thus we get

ci =〈F,qi〉〈qi,qi〉

which is the same as given above, and thus the solution is correct.

18

Algorithm for calculating the variances

Now we return to the variance numbers σ20,σ

21, . . .. Remember that our task was to find

an efficient procedure for calculating these in a successive manner until an integer N isreached for which σ2

N ≈ σ2N+1 ≈ σ2

N+2 ≈ . . .. pN is then the polynomial we are seeking.

The variance numbers can in fact be computed easily. First, an observation: The set{q0,q1, . . . ,qn,F − pn} is orthogonal. To check this, verify that 〈F − pn,qi〉 = 0 for 0 ≤i≤ n. Since pn is a linear combination of q0,q1, . . . ,qn, it follows that

〈F− pn, pn〉= 0

Now recall the definition of the variance σ2n:

σ2n =

ρn

m−nρn =

m

∑i=0

[yi− pn(xi)]2

These quantities ρn can be written in another way, using F(xi) = yi and the definition ofthe inner product:

ρn =m

∑i=0

[yi− pn(xi)]2

= 〈F− pn,F− pn〉= 〈F− pn,F〉 [since〈F− pn, pn〉= 0]= 〈F,F〉−〈F, pn〉

= 〈F,F〉−n

∑i=0

ci〈F,qi〉

= 〈F,F〉−n

∑i=0

〈F,qi〉2

〈qi,qi〉

The numbers ρ0,ρ1, . . . can now be generated recursively by the algorithmρ0 = 〈F,F〉−

〈F,q0〉2〈q0,q0〉

ρn = ρn−1−〈F,qn〉2〈qn,qn〉

(n≥ 1)

19

3.4 Summary: Algorithm for polynomial regression

We now have all the necessary ingredients for constructing an efficient algorithm forpolynomial regression. As an input, we need a table of data values {xi,F(xi)} (0≤ i≤m)which are assumed to contain experimental errors. The underlying trend of the data isassumed to conform to some functional form. The objective of polynomial regression isto reduce the errors by fitting the data with a polynomial of modest degree. Thus the mostimportant assumption we are making is that the underlying function can be approximatedwell by a polynomial.

Constructing a polynomial regression procedure:

• Write a procedure for calculating inner products. The following segment of pseu-docode can be used for this:real array f0:m, g0:msum ← 0for i = 0 to m do

sum ← figiend for

• Input to the polynomial regression procedure:m+1 = number of data points(x0:m) = array containing the x values of data(F0:m) = array containing the y values of data(α0:m) = array for the α values for the basis polynomials(β0:m) = array for the β values for the basis polynomials(c0:m) = array for the coefficients of the sought polynomial pn

• Allocate memory for the following arrays:(σ2

0:m) = values of the variances σ2i

(qOld0:m) = values of the basis polynomial qi−2(qPrev0:m) = values of the basis polynomial qi−1(qCurr0:m) = values of the basis polynomial qi(xq0:m) = help array for values of x∗qi(x)

(useful in C, not needed in Fortran)

• Two variables are needed for values of ρi and ρi−1.

• Calculate the following initial values:Basis functions q0 and q1 (stored in arrays (qOld) and (qPrev))Value of α0Values of ρ0 and ρ1 Values of σ2

0 and σ21

Values of c0 and c1.

20

• Set n = 1 (degree of polynomial).

• Do a loop to calculate ρ2,ρ3, . . . in succession to obtain σ22,σ

23, . . .. Continue until

σ2n−1 ≈ σ2

n.

• In practice, use two stop criteria within the loop:if σ2

n > σ2n−1 or

if |σ2n−1−σ2

n|< ε (where e.g. ε = 0.1).

• In the end, print out the values of αi, βi, σ2i and ci for 0≤ i≤ n.

• Finally, the polynomial pn can be evaluated at any point x = t in the interval frommin{xi} to max{xi} by the formula

pn(t) =n

∑i=0

ciqi(t)

where the coefficients ci are now known.

21

Example: Polynomial regression

Suppose that we have the following set of 20 data points. The trend of data suggests thatthe underlying function could be a polynomial.

−10 −5 0 5 10−10

−5

0

5

10

15

x

y

measured data

Figure 8: Example data set.

Using a polynomial regression program, we can try to smooth the data by fitting a low-order polynomial. The assumption which we make here is that the fluctuations are indeeddue to experimental errors and not true variations of the underlying function.

The program gives the following output:

i alpha beta variance c0 -0.105193 0.000000 22.009486 1.7720731 0.005345 36.715295 5.943165 0.6675972 -0.138884 28.731086 3.722647 0.0466503 0.195848 28.251617 0.319753 0.010164

Thus the resulting polynomial pn is of degree three (n = 3).

Figure 9 shows a graph of the data and the polynomial p3 evaluated at 100 points in theinterval [−10,10].

−10 −5 0 5 10−10

−5

0

5

10

15

x

f(x)

data : 0.01x3+0.05x

2+0.5cos(x)+noise

polynomial fit

Figure 9: Data points and the polynomial p3 obtained by polynomial regression.

22

The polynomial seems to be a nice smooth fit which represents the data well. However, ifthe small fluctuations in the data points are not just random noise but due to the true formof the underlying function, then these features are lost in the fitting procedure. Figure 10shows what the graph of the true underlying function looks like (the data points wereobtained by taking the values of this function at various points and adding a small noiseterm).

−10 −5 0 5 10−10

−5

0

5

10

15

x

f(x)

f(x) = 0.01x3+0.05x

2+0.5cos(x)

data with noise

Figure 10: The true underlying function and the data points.

Example 2: What can go wrongIn the previous example, the overall trend of the data could be described well by a third-order polynomial. The governing assumption is that all deviations in the data points aredue to experimental errors (random noise). If the underlying function is in fact more com-plicated and cannot be well approximated by a low-order polynomial, then the smoothingof the data may result in an approximation which is an oversimplification of the underly-ing function. (E.g. in the previous example, the oscillations due to the cosine term werelost in the fitting procedure).

The polynomial regression procedure may also fail in the attempt to find a smooth poly-nomial which would give a good approximation of the data. Figure 11 shows an exampleof such a case: The trend of the data resembles that of a third order polynomial but theregression program has given a linear fit.

−10 −5 0 5 10

−10

−5

0

5

10 data : 0.01x3+0.5cos(x)+noise

polynomial fit

Figure 11: Example of a problematic case.

23

The reason is that a second-order polynomial does not approximate the data any better(variances are almost equal). So the program stops and does not continue to higher orderpolynomials. In such a case, we can of course modify the code by hand and calculate thefit using higher order polynomials. We notice that for n = 3 we obtain an excellent fit.

−10 −5 0 5 10−15

−10

−5

0

5

10

15

x

f(x)

datapolynomial of degree 1polynomial of degree 2polynomial of degree 3

Figure 12: Polynomial fit with n = 3.

The corresponding output values are:

i alpha beta variance c0 -0.105193 0.000000 19.634211 -0.0089961 0.005345 36.715295 3.484323 0.6667932 -0.138884 28.731086 3.676964 -0.000892

The first and second degree fits are almost equal (the second degree polynomial is veryclose to a straight line). Figure 13 shows the difference between the first- and second-order polynomials.

−10 −5 0 5 10−0.04

−0.02

0

0.02

0.04

0.06

x

f(x)

Figure 13: Difference between the first- and second-order polynomials.

24

4 Other examples of the least-squares principle

4.1 Solving inconsistent systems of linear equations

As another type of an application of the least-squares principle, we can consider solvingan inconsistent system of linear equations of the form

n

∑j=0

ak jx j = bk (0 ≤ k ≤ m)

in which m > n. Thus we have m+1 equations but only n+1 unknowns. In matrix form,the system is written as

a00 a01 a02 a03 . . . a0na10 a11 a12 a13 . . . a1na20 a21 a22 a23 . . . a2na30 a31 a32 a33 . . . a3n

......

......

an0 an1 an2 an3 . . . ann...

......

...am0 am1 am2 am3 . . . amn

x0x1x2x3...

xn

=

b0b1b2b3...

bn...

bm

If a given set of values (x0,x1, . . . ,xn) is substituted into the equations, we obtain a residualvector. The kth element of the vector is termed the kth residual and it describes thediscrepancy between the two sides of the kth equation. Ideally, all the residuals should bezero.

If it is not possible to select (x0,x1, . . . ,xn) such that all the residuals are zero, the systemis called inconsistent or incompatible. In this case, an alternative is to minimize theresiduals. This means minimizing the expression

ϕ(x0,x1, . . . ,xn) =m

∑k=0

(n

∑j=0

ak jx j−bk

)2

by making an appropriate choice of (x0,x1, . . . ,xn).

We take the partial derivatives with respect to xi and set them equal to zero. This gives usthe normal equations

n

∑j=0

(m

∑k=0

akiak j

)x j =

m

∑k=0

bkaki (0≤ i≤ n)

This is a linear system of n+1 equations involving n+1 unknownsx0,x1, . . . ,xn. It can be shown that this system is consistent, provided that the column vec-tors in the original coefficient array are linearly independent. This system can be solved,for example, by Gaussian elimination. The solution is then an approximate solution ofthe original system in the least-squares sense.

25

4.2 Weight function

Another important example of different application ways of the least-squares principleoccurs in fitting or approximating functions on intervals rather than discrete sets.

For example, a given function f defined on an interval [a,b] may have to be approximatedby a function like

g(x) =n

∑j=0

c jg j(x)

In such a case, we attempt to minimize the expression

ϕ(c0,c1, . . . ,cn) =∫ b

a[g(x)− f (x)]2dx

by choosing the coefficients appropriately.

In some applications, it is desirable to force the functions f and g into better agreementon certain parts of the interval than elsewhere. For this purpose, we can include a positiveweight function w(x) which can emphasize certain parts of the interval. If all parts areequally important, then w(x) = 1. The resulting expression is

ϕ(c0,c1, . . . ,cn) =∫ b

a[g(x)− f (x)]2w(x)dx

The minimum of this expression is again sought by differentiating with respect to ci. Theresult is the following system of normal equations

n

∑j=0

[∫ b

agi(x)g j(x)w(x)dx

]c j =

∫ b

af (x)gi(x)w(x)dx (0≤ i≤ n)

This is a system of n+ 1 equations involving n+ 1 unknowns c0,c1, . . . ,cn and it can besolved by Gaussian elimination.

4.3 Nonlinear least squares

As an example of how the least-squares principle is applied in a nonlinear case, supposethat a table of points (xk,yk) is to be fitted by a function of the form

y = ecx

The least-squares approach leads to the minimization of the function

φ(c) =m

∑k=0

(ecxk− yk)2

Differentiating with respect to c gives

∂φ

∂c=

m

∑k=0

2(ecxk− yk)ecxkxk = 0

This equation is nonlinear in c. We could consider using a root-finding method such asthe Newton’s method or the secant method (introduced in Lecture 3). However, this is

26

not a very safe approach because there can be multiple roots in the normal equation. Onthe other hand, we could think about a direct minimization of the function φ, but this mayalso turn out to be difficult if there are local minima in φ.

These difficulties are typical in nonlinear least-squares problems. Consequently, othermethods of curve fitting are often preferred if the unknown parameters do not occur lin-early in the problem.

In this particular case, we can linearize the problem by a change of variables z = lny andby fitting to a form

z = cx

The minimization task involves now the function

φ(c) =m

∑k=0

(cxk− zk)2 (zk = lnyk)

This is easily solved and leads to

c =

m

∑k=0

zkxk

m

∑k=0

x2k

Note that this value of c is not the solution of the original least-squares problem but itmay be satisfactory, depending on the application.

Example.Figure 14 shows a set of data which seems to follow a relation of the form

y = ecx

0 2 4 6 8 100

50

100

150

x

f(x)

data

Figure 14: Example data set.

27

Let us try to linearize the problem by setting zi = ln(yi) and using the least-squares methodto fit a straight line through the set {xi,zi}.

0 2 4 6 8 10−1

0

1

2

3

4

5

6

x

ln y

datalinear fit c = 0.4951

Figure 15: Linearized fit.

The result is shown in Figure 15. The slope of the line is c = 0.4951.

This problem is sufficiently simple so that we can also use a nonlinear least-squaresmethod to solve the original problem directly. For example, the Matlab command (nlinfit)can be used for a nonlinear least-squares fit.

Figure 16 shows a comparison of the two fits, one obtained by linearizing the problemand the other by solving the nonlinear problem directly. The two values of the unknowncoefficient are: c1 = 0.4951 (from linearization) and c2 = 0.4999 (nonlinear).

0 2 4 6 8 100

50

100

150data: exp(0.5x) +noisefrom linear fitnonlinear fit

c1 = 0.4951 (linear)

c2 = 0.4999 (nonlinear)

Figure 16: Comparison of the solutions obtained by linearization and a direct nonlinearfit.

The difference between the two curves is not very large so for many applications thesolution of the linearized model is sufficient.

Note that the data points are obtained by taking 11 equidistant points from the curvef (x) = exp(0.5x) and adding random noise (distributed uniformly in [−1,1]). Withoutany noise in the data, the linearized solution gives exactly c = 0.5.

28

4.4 Linear / nonlinear least squares

As a final example, we consider a case which combines elements of linear and nonlineartheory. Suppose that the given data seems to follow a relationship like

y = asin(bx)

Can the least-squares principle be used in this case to find the appropriate values of a andb?

Here the parameter b enters the relationship in a nonlinear way which creates some diffi-culty. According to the principle of least squares, the parameters should be chosen suchthat they minimize the function

φ(a,b) =m

∑k=0

[asin(bxk)− yk]2

The minimum is again sought by differentiating this expression with respect to a and b,thus giving

m

∑k=0

2[asin(bxk)− yk]sin(bxk) = 0

m

∑k=0

2[asin(bxk)− yk]axk cos(bxk) = 0

If we knew b, then a could be obtained easily from either of these equations. The correctvalue of b is such that the corresponding two values of a obtained from these two equa-tions are equal. So we should solve both of these equations with respect to a and set theresults equal. This gives

m

∑k=0

yk sinbxk

m

∑k=0

(sinbxk)2=

m

∑k=0

xkyk cosbxk

m

∑k=0

xk sinbxk cosbxk

This can now be solved for b using, for example, the bisection or the secant method. Theneither side of this equation can be evaluated as a.

29