enve3502. environmental monitoring, measurements & data ...nurban/classes/ce3502/... · class...

1

ENVE3502. Environmental Monitoring, Measurements &

Data Analysis

Regression and Correlation Analysis

Points from previous lecture

• “Noise” in environmental data can obscure t dtrends;

• Smoothing is one mechanism for removing noise;

• Smoothing can help reveal trends and cyclic features in data;features in data;

• Two common means of smoothing are moving averages and exponentially-weighted moving averages;

2

Grades to date

• 2007 • 2010 • 20122007• Presentations 87 + 2.7

• Lab Reports:• Lab 1: 65 + 9 %• Lab 2: 72 + 14 %

2010• 86 + 3%

• Reports:• 74 + 9.7%• 78 + 10 %

2012• 87 + 1.9%

• Reports:• 73 + 9%• 82%• Lab 2: 72 + 14 %

• Lab 3: 73 + 12%• Lab 4: 80 + 8%• Lab 5: 89 + 13%

• 78 + 10 % • 82%

Class ScheduleWeek Monday Wednesday Friday4 1/30 2/1 Lecture 2/3 Lab 4

PresentationsLab 2

Regression, Correlation

Wastewater SRP –memo

5 2/6PresentationsLab 3

2/8 Lec. CI, percentiles

2/10 WinterCarnival


2/15 Lec.Detection Limit

2/17 Lab 5Snow water content –memoLab 4 memo

7 2/20 Lecture 2/22Exam review

2/24 Lab 6Lake DOFull report


2/29EXAM

3/2 PresentationsLab 6

3

Motivation: Regression & Correlation

1. Frequently, scientists and engineers need to determine if two factors are statistically associated with each other (e.g., illness and exposure to a pollutant, CO2 emissions and climate warming); Correlation analysis

2. Frequently, engineers and scientists need to know the tit ti l ti hi b t t i bl (quantitative relationship between two variables (e.g.,

rainfall intensity and runoff); Regression analysis {empirical models}

4

Rainfall Intensity-Duration-Frequency Curvesfor the State of Michigan

David Watkins & Dennis Johnson

Great Lakes Comparison: Phosphorus

Figure taken from Great Lakes Atlas, Canadian Gov't & U.S. EPA, 3rd Ed., 1995

5

Limnology & Oceanography 1974, 19(5): 767-7731974, 19(5): 767 773

Historical changes in Lake Superior?

spho

rus

Con

c. (m

g m

-3)

4

6

8

10

Year

1950 1960 1970 1980 1990 2000

Tota

l Pho

s

0

2

6

Correlation: P and Chlorophyll in Lake Superior

ean

Chl

orop

hyll

conc

. (m

g m

-3)

1

2

Annual mean TP conc. (mg m-3)

0 1 2 3 4 5

Ann

ual m

e

0

r2 = 0.78, P < 0.05r2 = 0.38, P < 0.01

400

Nitrate (NO3-) in Lake Superior

100

200

300

NO

3-N

(ppb

)

01942 1952 1962 1972 1982 1992 2002

Year

7

y = 3.09x - 5803.192

100

200

300

400

NO

3-N

(ppb

)

Predictive Models

R2 = 0.88

01942 1952 1962 1972 1982 1992 2002

Year

300

400

b)

0

100

200

1945 1955 1965 1975 1985 1995 2005

Year

NO

3- (ppb

Model Concentration ppbMeasured Concentration ppb

Theory: Least squares regression

1 Consider two variables (x y) that may be linearly related1. Consider two variables (x, y) that may be linearly relatedthrough an expression such as:

y = A + Bxwhere A and B are constants whose values are unknown;

2 The method used to determine values of A and B and hence2. The method used to determine values of A and B and henceto define the straight line that best fits the data (x1,y1), (x2,y2) …(xn,yn) is called linear regression, and the technique used most frequently is Least-squares fitting.

8

Theory: Least Squares1618

Data: xi,yi

02468

101214

0 2 4 6 8 10

X variable

Y va

riabl

e residual

Data: xi,yi

Equation:yi’ = A + Bxi

X variableResidual (error):ε = yi – yi’ = yi – (A+Bxi)

Sums of Squares of Errors: SSE = Σ(ε2)Least-squares: Minimizes SSE

Theory: Least SquaresCalculation of A and B:

Ax y x x y

n x x

i i i i i

i i

=−

−

∑ ∑ ∑ ∑∑ ∑

2

2 2

d id i d id id i d i

Bn x y x y

n x x

i i i i

i i

=−

−

∑ ∑ ∑∑ ∑

d i d id id i d i2 2

Note: xi,yi are pairs of corresponding measurements madeat the same time and location (e.g., x = time, y = [NO3

-])( g , , y [ 3 ])

9

Regression in Excel: 1 Trendline1.6)

0

0.4

0.8

1.2

1.6

0 1 2 3 4 5 6

Chl

orop

hyll

(mg/

m3

y = 0.262x - 0.093R2 = 0.383

0

0.4

0.8

1.2

1.6C

hlor

ophy

ll (m

g/m3 )

0 1 2 3 4 5 6

Total P (mg/m3)

00 1 2 3 4 5 6

Total P (mg/m3)ChartAdd trendline

options: Show equation on chartShow r2 on chart

Regression in Excel: 2. Data Analysis

Year Chlorophyll T.P.

mg/m3 mg/m3

1973.542 1.2885 4.5

1973.625 1.2984 5.2

1973.792 1.5585 4.9

1973.875 1.0379 5.3

1983.375 1.1188 2.9

1983.458 0.9957 3.3

1983.542 0.9188 2.9

1983.708 1.0262 3.6

1983.792 0.9111 3.2

1987 375 0 9132 3 61987.375 0.9132 3.6

1989.375 0.5509 2.8

1990.375 0.6442 2.6

1991.375 0.4945 3.7

1992.375 0.7508 3.1

1992.708 0.8623 3.5

1996.375 0.2464 3.7

1996.625 0.44 3.2

1997.375 0.6452 3.2

1997.625 0.2875 2.6

ToolsData AnalysisRegression

10

Theory: Correlation

1 How closely are two variables associated with one another?1. How closely are two variables associated with one another?2. How good is the fit of the regression equation to the data?

The correlation coefficient, r, answers both questions.

rs

s sxy

x y

= Covariance of x and y

x y

Covariance is the product of the deviations from the means:

sn

x x y yxy i i

n

= − −∑11

( )( )Deviation of y from its mean

Theory: Correlation

Combining expressions for s s and s :Combining expressions for sxy, sy, and sx:

rx x y y

x x y y

i i

i i

=− −

− −LNM

OQP

∑∑ ∑

d id id i d i2 2 1 2/

Note:• r will be between –1 and 1;• If r is close to + 1 then the x and y data lie close to a straight line (defined by A and B) and are highly correlated;• If r is close to zero then the points do not lie along a line, and x and y are not correlated.• r2 is called the regression coefficient and represents the percentage of the variance explained by the independent variable;

11

What is good enough?Table 1. The probability, P (%), that for n samples of two uncorrelated variables (x,y) a value of r greater than r0would occur.would occur.

N r0:0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

3 100 94 87 81 74 67 59 51 41 29 -

7 100 83 67 51 37 25 15 8 3.1 0.6 -

10 100 78 58 40 25 14 7 2 0.5 - -

20 100 67 40 20 8 2 0.5 0.1 - - -

50 100 49 16 3 4 - - - - - -

Example 1: Is correlation good enough?

y = 0.262x - 0.093R2 = 0.383

0

0.4

0.8

1.2

1.6

Chl

orop

hyll

(mg/

m3 ) N = 19r2 = 0.383r = 0.618

00 1 2 3 4 5 6

Total P (mg/m3)

P < 0.5% that this correlation resulted from chanceTherefore, regression is significant at > 99.5% confidence level

12

Is the slope significantly different from zero?

0.09 < Slope < 0.43Slope = 0.26 + 0.17 (+ 95% CI)

YES!

Application: Regression & Correlation

Rain duration and intensity were measured at a weather station in central Illinois. Perform a least-squares fit on the data and assess the correlation.

Rain event

1 2 3 4 5 6 7

Duration (min)

50 10 5 20 40 30 60

Intensity(in/hr)

2.0 5.4 6.1 4.0 2.4 3.2 1.5

13

Is it linear?

y = 0 083x + 6 0616

7

y = -0.083x + 6.061R2 = 0.954

0

1

2

3

4

5

6

Inte

nsity

(in/

hr)

0 10 20 30 40 50 60 70

Duration (min.)

r = 0.98, n = 7D.F. = n – 2 = 5 Statistics table P < 0.6% that this is chance

Regression is significant at > 99.4% confidence level

df= N-2

N = number of pairs of data

Level of significance for two-tailed test.10 .05 .02 .01

1 .988 .997 .9995 .9999

Pearson Product-Moment Correlation CoefficientTable of Critical Values

2 .900 .950 .980 .9903 .805 .878 .934 .9594 .729 .811 .882 .9175 .669 .754 .833 .8746 .622 .707 .789 .8347 .582 .666 .750 .7988 .549 .632 .716 .7659 .521 .602 .685 .73510 .497 .576 .658 .70811 .476 .553 .634 .684

14

Residuals

slope -0.08292I t t 6 0610480 20.40.6

dict

ed

ty Intercept 6.061048Residuals

50 2 -0.0848410 5.4 -0.168135 6.1 -0.45354

20 4 0.40269140 2.4 0.34433430 3.2 0.37351360 1.5 -0.41402

-0.6-0.4-0.2

00.2

0 20 40 60

Storm duration (min)

Erro

r in

pre

inte

nsit

Residuals must be randomly distributed. The systematic pattern shown here violates an assumption of regression analysis and implies that the regression is not valid, particularly outside the range in which data were collected.

15

Theory: Least SquaresUncertainty in A and B:• A and B are ESTIMATES for constants that make the best linear fit for the data;• Just as each value of x and y has uncertainties due to systematic (bias) and random (precision) errors, so also do A and B have uncertainties;• The uncertainty in A and B will lead to an uncertainty in any value of y predicted with the regression equation.

sn

y A Bxy i i

n

'2 2

1

12

=−

− +∑ c hAssume: measured values of x have less error than yThen:

Theory: Least SquaresError in A (A + sA): 2 2∑d io ( sA):

ss x

n x xA

y i

i i

22 2

2 2=−

∑∑ ∑

'd id i d i

Error in B (B + sB):2

sns

n x xB

y

i i

22

2 2=−∑ ∑

'

d i d i

16

Excel Output – error tabulation

Error in calculated X

What if we want to use the regression line to calculate unknown x values? Can we use the uncertainty in y’ A andunknown x values? Can we use the uncertainty in y , A and B to estimate the uncertainty in x’?

y = 0.054x + 0.022R² = 0.995

0 600

0.800

1.000

1.200

ance

(au)

0.000

0.200

0.400

0.600

0 5 10 15 20

Abs

orba

Conc. (mg/L)

17

Goal: Calculate uncertainty in x’

Information available: A + εa, B + εb, sy’ or spred, y

22 22 2 2 2x a b y

x x xdA dB dy

ε ε ε ε⎛ ⎞∂ ∂ ∂⎛ ⎞ ⎛ ⎞= + + ⎜ ⎟⎜ ⎟ ⎜ ⎟

⎝ ⎠ ⎝ ⎠ ⎝ ⎠Approach:

Given: ' ' '' what is ? What is ? What is ?y A x x xxB A B y− ∂ ∂ ∂

=∂ ∂ ∂

Using the regression equation for the standard curve and the absorbance values of the unknowns, determine the concentrations of the unknowns. Use the error propagation technique outlined in the lab handout to determine the uncertainty in your calculated concentrations for the unknowns. Calculate the mean and standard deviation for theconcentrations for each sample based on the four measurements by different groups. How does the uncertainty (standard error) that you estimated above compare with the standard deviation that you just calculated?

enve3502. environmental monitoring, measurements & data ...nurban/classes/ce3502/... · class...

Documents