statistics in wr: lecture 1
DESCRIPTION
Statistics in WR: Lecture 1. Key Themes Knowledge discovery in hydrology Introduction to probability and statistics Definition of random variables Reading: Helsel and Hirsch, Chapter 1. By deduction from existing knowledge By experiment in a laboratory - PowerPoint PPT PresentationTRANSCRIPT
Statistics in WR: Lecture 1
• Key Themes– Knowledge discovery in hydrology– Introduction to probability and statistics– Definition of random variables
• Reading: Helsel and Hirsch, Chapter 1
How is new knowledge discovered?
• By deduction from existing knowledge
• By experiment in a laboratory
• By observation of the natural environment
After completing the Handbook of Hydrology in 1993, I asked myself the question: how is new knowledge discovered in hydrology?
I concluded:
Deduction – Isaac Newton
• Deduction is the classical path of mathematical physics– Given a set of axioms– Then by a logical process– Derive a new principle or
equation
• In hydrology, the St Venant equations for open channel flow and Richard’s equation for unsaturated flow in soils were derived in this way.
(1687)Three laws of motion and law of gravitation
http://en.wikipedia.org/wiki/Isaac_Newton
Experiment – Louis Pasteur
• Experiment is the classical path of laboratory science – a simplified view of the natural world is replicated under controlled conditions
• In hydrology, Darcy’s law for flow in a porous medium was found this way.
Pasteur showed that microorganisms cause disease & discovered vaccination
Foundations of scientific medicine http://en.wikipedia.org/wiki/Louis_Pasteur
Observation – Charles Darwin
• Observation – direct viewing and characterization of patterns and phenomena in the natural environment
• In hydrology, Horton discovered stream scaling laws by interpretation of stream maps
Published Nov 24, 1859Most accessible book of great
scientific imagination ever written
Mean Annual Flow
Is there a relation between flow and water quality?
Total Nitrogen in water
Are Annual Flows Correlated?
CE 397 Statistics in Water Resources, Lecture 2, 2009
David R. MaidmentDept of Civil Engineering
University of Texas at Austin
9
Key Themes• Statistics
– Parametric and non-parametric approach• Data Visualization• Distribution of data and the distribution of
statistics of those data• Reading: Helsel and Hirsch p. 17-51 (Sections 2.1
to 2.3• Slides from Helsel and Hirsch (2002) “Techniques
of water resources investigations of the USGS, Book 4, Chapter A3.
10
Characteristics of Water Resources Data
• Lower bound of zero• Presence of “outliers”• Positive skewness• Non-normal distribution
of data• Data measured with
thresholds (e.g. detection limits)
• Seasonal and diurnal patterns
• Autocorrelation – consecutive measurements are not independent
• Dependence on other uncontrolled variables e.g. chemical concentration is related to discharge
11
Normal Distribution
From Helsel and Hirsch (2002) 12
Lognormal Distribution
From Helsel and Hirsch (2002) 13
Method of Moments
From Helsel and Hirsch (2002) 14
Statistical measures
• Location (Central Tendency)– Mean– Median– Geometric mean
• Spread (Dispersion)– Variance– Standard deviation– Interquartile range
• Skewness (Symmetry)– Coefficient of skewness
• Kurtosis (Flatness)– Coefficient of kurtosis
15
Histogram
From Helsel and Hirsch (2002)
16
Annual Streamflow for the Licking River at Catawba, Kentucky03253500
Quantile Plot
From Helsel and Hirsch (2002) 17
Plotting positions
i = rank of the data with i = 1 is the lowestn = number of datap = cumulative probability or “quantile” of the data value (its percentile value)
18
Normal Distribution Quantile Plot
From Helsel and Hirsch (2002) 19
Probability Plot with Normal Quantiles (Z values)
qzsqq
q
z
q
From Helsel and Hirsch (2002) 20
Annual Flows From HydroExcel
21
Annual Flows produced using Pivot Tables in Excel
22
CE 397 Statistics in Water Resources, Lecture 3, 2009
David R. MaidmentDept of Civil Engineering
University of Texas at Austin
23
Key Themes
• Using HydroExcel for accessing water resources data using web services
• Descriptive statistics and histograms using Excel Analysis Toolpak
• Reading: Chapter 11 of Applied Hydrology by Chow, Maidment and Mays
24
CE 397 Statistics in Water Resources, Lecture 4, 2009
David R. MaidmentDept of Civil Engineering
University of Texas at Austin
25
Key Themes
• Frequency and probability functions• Fitting methods• Typical distributions• Reading: Chapter 4 of Helsel and Hirsh pp. 97-
116 on Hypothesis tests
26
27
Method of Moments
28
Maximum Likelihood
29
CE 397 Statistics in Water Resources, Lecture 5, 2009
David R. MaidmentDept of Civil Engineering
University of Texas at Austin
30
Key Themes
• Using Excel to fit frequency and probability distributions
• Chi Square test and probability plotting• Beginning hypothesis testing• Reading: Chapter 3 of Helsel and Hirsh pp. 65-
97 on Describing Uncertainty• Slides from Helsel and Hirsch Chap. 4
31
32
Statistics in Water Resources, Lecture 6
• Key theme– T-distribution for distributions where standard
deviation is unknown– Hypothesis testing– Comparing two sets of data to see if they are
different
• Reading: Helsel and Hirsch, Chapter 6 Matched Pair Tests
Chi-Square Distribution
http://en.wikipedia.org/wiki/Chi-square_distribution
t-, z and ChiSquare
Source: http://en.wikipedia.org/wiki/Student's_t-distribution
Normal and t-distributions
Normal
t-dist for ν = 1
t-dist for ν = 30t-dist for ν = 5
t-dist for ν = 3t-dist for ν = 2
t-dist for ν = 10
• Standard Normal z– X1, … , Xn are
independently distributed (μ,σ), and
– thenis normally distributed
with mean 0 and std dev 1
Standard Normal and Student - t
• Student’s t-distribution– Applies to the case
where the true standard deviation σ is unknown and is replaced by its sample estimate Sn
38
p-value is the probability of obtaining the value of the test-statistic if the null hypothesis (Ho) is true
If p-value is very small (<0.05 or 0.025) then reject Ho
If p-value is larger than α then do not reject Ho
One-sided test
Two-sided test
Statistics in WR: Lecture 7
• Key Themes– Statistics for populations and samples– Suspended sediment sampling– Testing for differences in means and variances
• Reading: Helsel and Hirsch Chapter 8 Correlation
Estimators of the Variance
Maximum Likelihood Estimate forPopulation variance
Unbiased estimatefrom a sample
http://en.wikipedia.org/wiki/Variance
Bias in the VarianceCommon sense would suggest to apply the population formula to the sample as well. The reason that it is biased is that the sample mean is generally somewhat closer to the observations in the sample than the population mean is to these observations. This is so because the sample mean is by definition in the middle of the sample, while the population mean may even lie outside the sample. So the deviations from the sample mean will often be smaller than the deviations from the population mean, and so, if the same formula is applied to both, then this variance estimate will on average be somewhat smaller in the sample than in the population.
Suspended Sediment Sampling
http://pubs.usgs.gov/sir/2005/5077/
T-test with same variances
T-test with different variances
Statistics in WR: Lecture 8
• Key Themes– Replication in Monte Carlo experiments– Testing paired differences and analysis of
variance– Correlation
• Reading: Helsel and Hirsch Chapter 9 Simple Regression
Statistics of Mean of Replicated Series
Patterns of data that all have correlation between x and y of 0.7
Monotonic nonlinear correlation
Linear correlation
Non-monotonic correlation
Statistics in WR: Lecture 9
• Key Themes– Using SAS to compute cross-correlation between two data
series– Using Excel to compute autocorrelation of a single data
series– Correlation length and influence of data interval on that– Lagged Cross-correlation between rainfall and flow
• Reading: Helsel and Hirsch Chapter 12 Trend Analysis
Correlation
• Correlation (or cross-correlation) measures the association between two sets of data (x, y)
• Autocorrelation measures the correlation of a dataset with lagged or displace values of itself (either in time or space), e.g x(t) with x(t – L) where L is the lag time
• Lagged cross-correlation measures the association between one series y(t), and lagged values of another series x(t – L)
Statistics in WR: Lecture 10
• Key Themes– Trend analysis using Simple Linear Regression– Characterization of outliers– Multiple Linear Regression
• Reading: Helsel and Hirsch Chapter 11 Multiple Linear Regression
• Slides are from Helsel and Hirsch, Chapter 9
H&H p.222
H&H p.226
Regression Formulas
H&H p.227
Regression Formulas
Statistics in WR: Lecture 11
• Key Themes– Simple Linear Regression– Derivation of the normal equations– Multiple Linear Regression
• Reading: Helsel and Hirsch Chapter 7 Comparing several independent groups
• Reading: Barnett, Environmental Statistics Chapter 10 Time series methods
• Slides are from Helsel and Hirsch, Chapter 9
Regression Assumptions
Formulas used in the derivation of the normal
equations
(1a) Plot the Data: TDS vs LogQ
(2) Interpret Regression Statistics
A good set of Residuals
Multiple Linear Regression
Simple vs Complex regression models
F-distributionhttp://en.wikipedia.org/wiki/F-test
“If U is a Chisquare random variable with m degrees of freedom, V is a Chisquare random variable with n degrees of freedom, and if U and V are independent, then the ratio [(U/m)/V/n) has an F-distribution with (m, n) degrees of freedom.” Haan, Statistical Methods in Hydrology, p.122
The values of the F-statistic are tabulated at:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm
Statistics in WR: Lecture 12
• Key Themes– Regression y|x and x|y– Adjusted R2
– Time series and seasonal variations
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.950344
R Square 0.903154 0.903154347
Adjusted R Square 0.898543 0.89854265
Standard Error 159033.1
Observations 23
ANOVA
df SS MS FSignificance
F
Regression 1 4.95309E+12 4.95309E+12 195.8399 4.07E-12
Residual (error) 21 5.31122E+11 25291521454
Total (y) 22 5.48421E+12
)1/(
)/(12
nSSy
pnSSEAdjR
SSy
SSER 12
R2 and Adjusted R2
Time Series Trend: Tide Levels at San Diego
http://tidesandcurrents.noaa.gov/sltrends/sltrends_station.shtml?stnid=9410170%20San%20Diego,%20CA
One harmonic
Five harmonics
http://en.wikipedia.org/wiki/Fourier_series
Statistics in WR: Lecture 13
• Key Themes– ANOVA for sediment data– Fourier series for diurnal cycles– Fourier series for seasonal cycles
Analysis of Variance (ANOVA)
Assumptions
There are several variants (one factor, two factor, two factor with replication). We will deal just with One Factor ANOVA
Single Factor ANOVA
Single Factor ANOVA
ANOVA Formulas
Single Factor ANOVA
TWDB Mean189,000 Ton/yr
USGS2 Mean97,000 Ton/yr
USGS1 Mean218,000 Ton/yr
Groups of Sediment Load Data (Ex3)
Overall Mean183,000 Ton/yr
Zero
3.5 x 106 5.5 x 106
480,000