correlation and regression analysis many engineering design and analysis problems involve factors...

Correlation and Regression Analysis• Many engineering design and analysis problems involve factors that are

interrelated and dependent. E.g., (1) runoff volume, rainfall; (2) evaporation, temperature, wind speed; (3) peak discharge, drainage area, rainfall intensity; (4) crop yield, irrigated water, fertilizer.

• Due to inherent complexity of system behaviors and lack of full understanding of the procedure involved, the relationship among the various relevant factors or variables are established empirically or semi-empirically.

• Regression analysis is a useful and widely used statistical tool dealing with investigation of the relationship between two or more variables related in a non-deterministic fashion.

• If a variable Y is related to several variables X1, X2, …, XK and their

relationships can be expressed, in general, as

Y = g(X1, X2, …, XK)

where g(.) = general expression for a function;

Y = Dependent (or response) variable;

X1, X2,…, XK = Independent (or explanatory) variables.

Correlation• When a problem involves two dependent random variables, the degree of

linear dependence between the two can be measured by the correlation coefficient (X,Y), which is defined as

where Cov(X,Y) is the covariance between random variables X and Y defined as

where <Cov(X,Y)< and (X,Y) .

• Various correlation coefficients are developed in statistics for measuring the degree of association between random variables. The one defined above is called the Pearson product moment correlation coefficient or correlation coefficient.

• If the two random variables X and Y are independent, then (X,Y)= Cov(X,Y)= . However, the reverse statement is not necessarily true.

Cases of Correlation

Perfectly linearlycorrelated in oppositedirection

Strongly & positively correlated inlinear fashion

Perfectly correlated innonlinear fashion, butuncorrelated linearly.

Uncorrelated in linear fashion

Calculation of Correlation Coefficient

• Given a set of n paired sample observations of two random variables (xi, yi), the sample correlation coefficient ( r) can be calculated as

Auto-correlation• Consider following daily stream flows (in 1000 m3) in June 2001 at Chung Mei

Upper Station (610 ha) located upstream of a river feeding to Plover Cove Reservoir. Determine its 1-day auto-correlation coefficient, i.e., (Qt, Qt+1).

• 29 pairs: {(Qt, Qt+1)} = {(Q1, Q2), (Q2, Q3), …, (Q29, Q30)};

Relevant sample statistics: n=29

The 1-day auto-correlation is 0.439

Day (t) Flow Q(t) Day (t) Flow Q(t) Day (t) Flow Q(t)1 8.35 11 313.89 21 20.062 6.78 12 480.88 22 17.523 6.32 13 151.28 23 116.134 17.36 14 83.92 24 68.255 191.62 15 44.58 25 280.226 82.33 16 36.58 26 347.537 524.45 17 33.65 27 771.308 196.77 18 26.39 28 124.209 785.09 19 22.98 29 58.00

10 562.05 20 21.92 30 44.08

11186.22; 230.06; 187.45; 229.17t tt Q t QQ S Q S

Chung Mei Upper Daily Flow

10 20 30

0

100

200

300

400

500

600

700

800

Day

Flo

w (1000 c

ubic

mete

rs)

1 2 3 4 5

-1.0-0.8-0.6-0.4-0.20.00.20.40.60.81.0

Aut

ocor

rela

tion

Autocorrelation for June 2001 Daily Flows at Chung Mei Upper, HK

Time lags (Days)

0

100

200300

400

500

600700

800

900

0 200 400 600 800 1000

Q(t), 1000 m^3

Q(t

+1),

1000

m^3

Regression Models

• due to the presence of uncertainties a deterministic functional relationship generally is not very appropriate or realistic.

• The deterministic model form can be modified to account for uncertainties in the model as

Y = g(X1, X2, …, XK) +

where = model error term with E()=0, Var()=2.

• In engineering applications, functional forms commonly used for establishing empirical relationships are

– Additive: Y = 0 + 1X1 + 2X2 + … + KXK +

– Multiplicative: K21 β

Kβ2

β10 X...XXβY

Least Square Method

Suppose that there are n pairs of data, {(xi, yi)}, i=1, 2,.. , n and a plot of these data appears as

What is a plausible mathematical model describing x & y relation?

x

y

Least Square Method

Considering an arbitrary straight line, y =0+1 x, is to be fitted through these data points. The question is “Which line is the most representative”?

11

0

xi

x

yi

yi^

y =0+1 x^

ei = yi – yi = error (residual)^

y

11

0

xi

x

yi

yiyi^

y =0+1 xy =0+1 xy =0+1 x^

ei = yi – yi = error (residual)^ei = yi – yi = error (residual)^

y

Least Square Criterion

• What are the values of 0 and 1 such that the resulting line “best” fits the data points?

• But, wait !!! What goodness-of-fit criterion to use to determine among all possible combinations of 0 and 1 ?

• The least squares (LS) criterion states that the sum of the squares of errors (or residuals, deviations) is minimum. Mathematically, the LS criterion can be written as:

• Any other criteria that can be used?

Normal Equations for LS Criterion

• The necessary conditions for the minimum values of D are:

and

• Expanding the above equations

• Normal equations:

00

D

01

D

n

iiii

n

iii

xxyD

xyD

110

1

110

0

02

012

n

iiii

n

iii

xyx

xy

1

1

0

0

0

0

1

2

11

11

n

ii

n

ii

n

iii

n

ii

n

ii

xxyx

xny

n

iii

n

ii

n

ii

n

ii

n

ii

yxxx

yxn

11

2

1

11

LS Solution (2 Unknowns)

2

1

2

12

11

2

111

11

1

1

ˆ

ˆˆ

xnx

yxnyx

xn

x

yxn

yx

xyn

x

n

y

n

ii

n

iii

n

ii

n

ii

n

ii

n

ii

n

iii

n

ii

n

ii

Fitting a Polynomial Eq. By LS Method nixxxy i

kikiii ,,2,1,2

2 LS criterion:

minimize D=

n

i

kiiii xxxy

1

22

,,

Set kjforD

j

,,2,1,0,0

Normal Equations are:

n

i

kii

n

i

ki

n

i

ki

n

i

ki

n

iii

n

i

ki

n

ii

n

ii

n

ii

n

i

ki

n

ii

xyxxx

xyxxx

yxxn

11

2

1

1

1

11

1

1

2

1

111

Fitting a Linear Function of Several Variables kxxxy 21

Normal equations:

n

iiki

n

iik

n

iiik

n

iik

n

iii

n

iiki

n

ii

n

ii

n

ii

n

iik

n

ii

xyxxxx

xyxxxx

yxxn

11

2

11

1

11

11

1

21

1

1111

LS criterion :

Minimize D= 2

11

n

i i ki

y x x x

k ,,, 1

Set 0 , 0,1, 2, ,j

Dfor j k

Matrix Form of Multiple Regression by LS

nknknn

k

k

n xxx

xxx

xxx

y

y

y

2

1

21

22221

11211

2

1

1

1

1

(Note: ijx= i

th observation of the jth independent variable)

or y = X + in short

LS criterion is:

min β X -y 'βX -y ε ε' 1

2

n

iiD

β Set 0β D , and result in: 0β XyX

^

) - ( '

The LS solutions are: y X' XX' β 1 ˆ

Measure of Goodness-of-Fit

R2 = Coefficient of Determination

n

1i

2y

iy

n

1i

2iε

1

= 1 - % of variation in the dependent variable, y, unexplained by the regression equation; = % of variation in the dependent variable, y, explained by the

regression equation.

Example 1 (LS Method)

LS Example

LS Example (Matrix Approach)

LS Example (by Minitab w/ 0)

LS Example (by Minitab w/o 0)

LS Example (Output Plots)

correlation and regression analysis many engineering design and analysis problems involve factors...

Documents

random variables x

correlation coefficient

response variable x

variable y

dependent random variables

function y

explanatory variables

regression analysis