enve3502. environmental monitoring, measurements & data ...nurban/classes/ce3502/... · class...
TRANSCRIPT
1
ENVE3502. Environmental Monitoring, Measurements &
Data Analysis
Regression and Correlation Analysis
Points from previous lecture
• “Noise” in environmental data can obscure t dtrends;
• Smoothing is one mechanism for removing noise;
• Smoothing can help reveal trends and cyclic features in data;features in data;
• Two common means of smoothing are moving averages and exponentially-weighted moving averages;
2
Grades to date
• 2007 • 2010 • 20122007• Presentations 87 + 2.7
• Lab Reports:• Lab 1: 65 + 9 %• Lab 2: 72 + 14 %
2010• 86 + 3%
• Reports:• 74 + 9.7%• 78 + 10 %
2012• 87 + 1.9%
• Reports:• 73 + 9%• 82%• Lab 2: 72 + 14 %
• Lab 3: 73 + 12%• Lab 4: 80 + 8%• Lab 5: 89 + 13%
• 78 + 10 % • 82%
Class ScheduleWeek Monday Wednesday Friday4 1/30 2/1 Lecture 2/3 Lab 4
PresentationsLab 2
Regression, Correlation
Wastewater SRP –memo
5 2/6PresentationsLab 3
2/8 Lec. CI, percentiles
2/10 WinterCarnival
6 2/13PresentationsLab 4
2/15 Lec.Detection Limit
2/17 Lab 5Snow water content –memoLab 4 memo
7 2/20 Lecture 2/22Exam review
2/24 Lab 6Lake DOFull report
8 2/27PresentationsLab 5
2/29EXAM
3/2 PresentationsLab 6
3
Motivation: Regression & Correlation
1. Frequently, scientists and engineers need to determine if two factors are statistically associated with each other (e.g., illness and exposure to a pollutant, CO2 emissions and climate warming); Correlation analysis
2. Frequently, engineers and scientists need to know the tit ti l ti hi b t t i bl (quantitative relationship between two variables (e.g.,
rainfall intensity and runoff); Regression analysis {empirical models}
4
Rainfall Intensity-Duration-Frequency Curvesfor the State of Michigan
David Watkins & Dennis Johnson
Great Lakes Comparison: Phosphorus
Figure taken from Great Lakes Atlas, Canadian Gov't & U.S. EPA, 3rd Ed., 1995
5
Limnology & Oceanography 1974, 19(5): 767-7731974, 19(5): 767 773
Historical changes in Lake Superior?
spho
rus
Con
c. (m
g m
-3)
4
6
8
10
Year
1950 1960 1970 1980 1990 2000
Tota
l Pho
s
0
2
6
Correlation: P and Chlorophyll in Lake Superior
ean
Chl
orop
hyll
conc
. (m
g m
-3)
1
2
Annual mean TP conc. (mg m-3)
0 1 2 3 4 5
Ann
ual m
e
0
r2 = 0.78, P < 0.05r2 = 0.38, P < 0.01
400
Nitrate (NO3-) in Lake Superior
100
200
300
NO
3-N
(ppb
)
01942 1952 1962 1972 1982 1992 2002
Year
7
y = 3.09x - 5803.192
100
200
300
400
NO
3-N
(ppb
)
Predictive Models
R2 = 0.88
01942 1952 1962 1972 1982 1992 2002
Year
300
400
b)
0
100
200
1945 1955 1965 1975 1985 1995 2005
Year
NO
3- (ppb
Model Concentration ppbMeasured Concentration ppb
Theory: Least squares regression
1 Consider two variables (x y) that may be linearly related1. Consider two variables (x, y) that may be linearly relatedthrough an expression such as:
y = A + Bxwhere A and B are constants whose values are unknown;
2 The method used to determine values of A and B and hence2. The method used to determine values of A and B and henceto define the straight line that best fits the data (x1,y1), (x2,y2) …(xn,yn) is called linear regression, and the technique used most frequently is Least-squares fitting.
8
Theory: Least Squares1618
Data: xi,yi
02468
101214
0 2 4 6 8 10
X variable
Y va
riabl
e residual
Data: xi,yi
Equation:yi’ = A + Bxi
X variableResidual (error):ε = yi – yi’ = yi – (A+Bxi)
Sums of Squares of Errors: SSE = Σ(ε2)Least-squares: Minimizes SSE
Theory: Least SquaresCalculation of A and B:
Ax y x x y
n x x
i i i i i
i i
=−
−
∑ ∑ ∑ ∑∑ ∑
2
2 2
d id i d id id i d i
Bn x y x y
n x x
i i i i
i i
=−
−
∑ ∑ ∑∑ ∑
d i d id id i d i2 2
Note: xi,yi are pairs of corresponding measurements madeat the same time and location (e.g., x = time, y = [NO3
-])( g , , y [ 3 ])
9
Regression in Excel: 1 Trendline1.6)
0
0.4
0.8
1.2
1.6
0 1 2 3 4 5 6
Chl
orop
hyll
(mg/
m3
y = 0.262x - 0.093R2 = 0.383
0
0.4
0.8
1.2
1.6C
hlor
ophy
ll (m
g/m3 )
0 1 2 3 4 5 6
Total P (mg/m3)
00 1 2 3 4 5 6
Total P (mg/m3)ChartAdd trendline
options: Show equation on chartShow r2 on chart
Regression in Excel: 2. Data Analysis
Year Chlorophyll T.P.
mg/m3 mg/m3
1973.542 1.2885 4.5
1973.625 1.2984 5.2
1973.792 1.5585 4.9
1973.875 1.0379 5.3
1983.375 1.1188 2.9
1983.458 0.9957 3.3
1983.542 0.9188 2.9
1983.708 1.0262 3.6
1983.792 0.9111 3.2
1987 375 0 9132 3 61987.375 0.9132 3.6
1989.375 0.5509 2.8
1990.375 0.6442 2.6
1991.375 0.4945 3.7
1992.375 0.7508 3.1
1992.708 0.8623 3.5
1996.375 0.2464 3.7
1996.625 0.44 3.2
1997.375 0.6452 3.2
1997.625 0.2875 2.6
ToolsData AnalysisRegression
10
Theory: Correlation
1 How closely are two variables associated with one another?1. How closely are two variables associated with one another?2. How good is the fit of the regression equation to the data?
The correlation coefficient, r, answers both questions.
rs
s sxy
x y
= Covariance of x and y
x y
Covariance is the product of the deviations from the means:
sn
x x y yxy i i
n
= − −∑11
( )( )Deviation of y from its mean
Theory: Correlation
Combining expressions for s s and s :Combining expressions for sxy, sy, and sx:
rx x y y
x x y y
i i
i i
=− −
− −LNM
OQP
∑∑ ∑
d id id i d i2 2 1 2/
Note:• r will be between –1 and 1;• If r is close to + 1 then the x and y data lie close to a straight line (defined by A and B) and are highly correlated;• If r is close to zero then the points do not lie along a line, and x and y are not correlated.• r2 is called the regression coefficient and represents the percentage of the variance explained by the independent variable;
11
What is good enough?Table 1. The probability, P (%), that for n samples of two uncorrelated variables (x,y) a value of r greater than r0would occur.would occur.
N r0:0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
3 100 94 87 81 74 67 59 51 41 29 -
7 100 83 67 51 37 25 15 8 3.1 0.6 -
10 100 78 58 40 25 14 7 2 0.5 - -
20 100 67 40 20 8 2 0.5 0.1 - - -
50 100 49 16 3 4 - - - - - -
Example 1: Is correlation good enough?
y = 0.262x - 0.093R2 = 0.383
0
0.4
0.8
1.2
1.6
Chl
orop
hyll
(mg/
m3 ) N = 19r2 = 0.383r = 0.618
00 1 2 3 4 5 6
Total P (mg/m3)
P < 0.5% that this correlation resulted from chanceTherefore, regression is significant at > 99.5% confidence level
12
Is the slope significantly different from zero?
0.09 < Slope < 0.43Slope = 0.26 + 0.17 (+ 95% CI)
YES!
Application: Regression & Correlation
Rain duration and intensity were measured at a weather station in central Illinois. Perform a least-squares fit on the data and assess the correlation.
Rain event
1 2 3 4 5 6 7
Duration (min)
50 10 5 20 40 30 60
Intensity(in/hr)
2.0 5.4 6.1 4.0 2.4 3.2 1.5
13
Is it linear?
y = 0 083x + 6 0616
7
y = -0.083x + 6.061R2 = 0.954
0
1
2
3
4
5
6
Inte
nsity
(in/
hr)
0 10 20 30 40 50 60 70
Duration (min.)
r = 0.98, n = 7D.F. = n – 2 = 5 Statistics table P < 0.6% that this is chance
Regression is significant at > 99.4% confidence level
df= N-2
N = number of pairs of data
Level of significance for two-tailed test.10 .05 .02 .01
1 .988 .997 .9995 .9999
Pearson Product-Moment Correlation CoefficientTable of Critical Values
2 .900 .950 .980 .9903 .805 .878 .934 .9594 .729 .811 .882 .9175 .669 .754 .833 .8746 .622 .707 .789 .8347 .582 .666 .750 .7988 .549 .632 .716 .7659 .521 .602 .685 .73510 .497 .576 .658 .70811 .476 .553 .634 .684
14
Residuals
slope -0.08292I t t 6 0610480 20.40.6
dict
ed
ty Intercept 6.061048Residuals
50 2 -0.0848410 5.4 -0.168135 6.1 -0.45354
20 4 0.40269140 2.4 0.34433430 3.2 0.37351360 1.5 -0.41402
-0.6-0.4-0.2
00.2
0 20 40 60
Storm duration (min)
Erro
r in
pre
inte
nsit
Residuals must be randomly distributed. The systematic pattern shown here violates an assumption of regression analysis and implies that the regression is not valid, particularly outside the range in which data were collected.
15
Theory: Least SquaresUncertainty in A and B:• A and B are ESTIMATES for constants that make the best linear fit for the data;• Just as each value of x and y has uncertainties due to systematic (bias) and random (precision) errors, so also do A and B have uncertainties;• The uncertainty in A and B will lead to an uncertainty in any value of y predicted with the regression equation.
sn
y A Bxy i i
n
'2 2
1
12
=−
− +∑ c hAssume: measured values of x have less error than yThen:
Theory: Least SquaresError in A (A + sA): 2 2∑d io ( sA):
ss x
n x xA
y i
i i
22 2
2 2=−
∑∑ ∑
'd id i d i
Error in B (B + sB):2
sns
n x xB
y
i i
22
2 2=−∑ ∑
'
d i d i
16
Excel Output – error tabulation
Error in calculated X
What if we want to use the regression line to calculate unknown x values? Can we use the uncertainty in y’ A andunknown x values? Can we use the uncertainty in y , A and B to estimate the uncertainty in x’?
y = 0.054x + 0.022R² = 0.995
0 600
0.800
1.000
1.200
ance
(au)
0.000
0.200
0.400
0.600
0 5 10 15 20
Abs
orba
Conc. (mg/L)
17
Goal: Calculate uncertainty in x’
Information available: A + εa, B + εb, sy’ or spred, y
22 22 2 2 2x a b y
x x xdA dB dy
ε ε ε ε⎛ ⎞∂ ∂ ∂⎛ ⎞ ⎛ ⎞= + + ⎜ ⎟⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠ ⎝ ⎠Approach:
Given: ' ' '' what is ? What is ? What is ?y A x x xxB A B y− ∂ ∂ ∂
=∂ ∂ ∂
Using the regression equation for the standard curve and the absorbance values of the unknowns, determine the concentrations of the unknowns. Use the error propagation technique outlined in the lab handout to determine the uncertainty in your calculated concentrations for the unknowns. Calculate the mean and standard deviation for theconcentrations for each sample based on the four measurements by different groups. How does the uncertainty (standard error) that you estimated above compare with the standard deviation that you just calculated?