statistics in wr: lecture 1

Statistics in WR: Lecture 1

• Key Themes– Knowledge discovery in hydrology– Introduction to probability and statistics– Definition of random variables

• Reading: Helsel and Hirsch, Chapter 1

How is new knowledge discovered?

• By deduction from existing knowledge

• By experiment in a laboratory

• By observation of the natural environment

After completing the Handbook of Hydrology in 1993, I asked myself the question: how is new knowledge discovered in hydrology?

I concluded:

Deduction – Isaac Newton

• Deduction is the classical path of mathematical physics– Given a set of axioms– Then by a logical process– Derive a new principle or

equation

• In hydrology, the St Venant equations for open channel flow and Richard’s equation for unsaturated flow in soils were derived in this way.

(1687)Three laws of motion and law of gravitation

http://en.wikipedia.org/wiki/Isaac_Newton

Experiment – Louis Pasteur

• Experiment is the classical path of laboratory science – a simplified view of the natural world is replicated under controlled conditions

• In hydrology, Darcy’s law for flow in a porous medium was found this way.

Pasteur showed that microorganisms cause disease & discovered vaccination

Foundations of scientific medicine http://en.wikipedia.org/wiki/Louis_Pasteur

Observation – Charles Darwin

• Observation – direct viewing and characterization of patterns and phenomena in the natural environment

• In hydrology, Horton discovered stream scaling laws by interpretation of stream maps

Published Nov 24, 1859Most accessible book of great

scientific imagination ever written

Mean Annual Flow

Is there a relation between flow and water quality?

Total Nitrogen in water

Are Annual Flows Correlated?

CE 397 Statistics in Water Resources, Lecture 2, 2009

David R. MaidmentDept of Civil Engineering

University of Texas at Austin

9

Key Themes• Statistics

– Parametric and non-parametric approach• Data Visualization• Distribution of data and the distribution of

statistics of those data• Reading: Helsel and Hirsch p. 17-51 (Sections 2.1

to 2.3• Slides from Helsel and Hirsch (2002) “Techniques

of water resources investigations of the USGS, Book 4, Chapter A3.

10

Characteristics of Water Resources Data

• Lower bound of zero• Presence of “outliers”• Positive skewness• Non-normal distribution

of data• Data measured with

thresholds (e.g. detection limits)

• Seasonal and diurnal patterns

• Autocorrelation – consecutive measurements are not independent

• Dependence on other uncontrolled variables e.g. chemical concentration is related to discharge

11

Normal Distribution

From Helsel and Hirsch (2002) 12

Lognormal Distribution


Method of Moments


Statistical measures

• Location (Central Tendency)– Mean– Median– Geometric mean

• Spread (Dispersion)– Variance– Standard deviation– Interquartile range

• Skewness (Symmetry)– Coefficient of skewness

• Kurtosis (Flatness)– Coefficient of kurtosis

15

Histogram

From Helsel and Hirsch (2002)

16

Annual Streamflow for the Licking River at Catawba, Kentucky03253500

Quantile Plot


Plotting positions

i = rank of the data with i = 1 is the lowestn = number of datap = cumulative probability or “quantile” of the data value (its percentile value)

18

Normal Distribution Quantile Plot


Probability Plot with Normal Quantiles (Z values)

qzsqq

q

z

q


Annual Flows From HydroExcel

21

Annual Flows produced using Pivot Tables in Excel




23

Key Themes

• Using HydroExcel for accessing water resources data using web services

• Descriptive statistics and histograms using Excel Analysis Toolpak

• Reading: Chapter 11 of Applied Hydrology by Chow, Maidment and Mays

24




25

Key Themes

• Frequency and probability functions• Fitting methods• Typical distributions• Reading: Chapter 4 of Helsel and Hirsh pp. 97-

116 on Hypothesis tests

26

Method of Moments

28

Maximum Likelihood

29




30

Key Themes

• Using Excel to fit frequency and probability distributions

• Chi Square test and probability plotting• Beginning hypothesis testing• Reading: Chapter 3 of Helsel and Hirsh pp. 65-

97 on Describing Uncertainty• Slides from Helsel and Hirsch Chap. 4

31

Statistics in Water Resources, Lecture 6

• Key theme– T-distribution for distributions where standard

deviation is unknown– Hypothesis testing– Comparing two sets of data to see if they are

different

• Reading: Helsel and Hirsch, Chapter 6 Matched Pair Tests

Chi-Square Distribution

http://en.wikipedia.org/wiki/Chi-square_distribution

t-, z and ChiSquare

Source: http://en.wikipedia.org/wiki/Student's_t-distribution

Normal and t-distributions

Normal

t-dist for ν = 1

t-dist for ν = 30t-dist for ν = 5

t-dist for ν = 3t-dist for ν = 2

t-dist for ν = 10

• Standard Normal z– X1, … , Xn are

independently distributed (μ,σ), and

– thenis normally distributed

with mean 0 and std dev 1

Standard Normal and Student - t

• Student’s t-distribution– Applies to the case

where the true standard deviation σ is unknown and is replaced by its sample estimate Sn

38

p-value is the probability of obtaining the value of the test-statistic if the null hypothesis (Ho) is true

If p-value is very small (<0.05 or 0.025) then reject Ho

If p-value is larger than α then do not reject Ho

One-sided test

Two-sided test


• Key Themes– Statistics for populations and samples– Suspended sediment sampling– Testing for differences in means and variances

• Reading: Helsel and Hirsch Chapter 8 Correlation

Estimators of the Variance

Maximum Likelihood Estimate forPopulation variance

Unbiased estimatefrom a sample

http://en.wikipedia.org/wiki/Variance

Bias in the VarianceCommon sense would suggest to apply the population formula to the sample as well. The reason that it is biased is that the sample mean is generally somewhat closer to the observations in the sample than the population mean is to these observations. This is so because the sample mean is by definition in the middle of the sample, while the population mean may even lie outside the sample. So the deviations from the sample mean will often be smaller than the deviations from the population mean, and so, if the same formula is applied to both, then this variance estimate will on average be somewhat smaller in the sample than in the population.

Suspended Sediment Sampling

http://pubs.usgs.gov/sir/2005/5077/

T-test with same variances

T-test with different variances


• Key Themes– Replication in Monte Carlo experiments– Testing paired differences and analysis of

variance– Correlation

• Reading: Helsel and Hirsch Chapter 9 Simple Regression

Statistics of Mean of Replicated Series

Patterns of data that all have correlation between x and y of 0.7

Monotonic nonlinear correlation

Linear correlation

Non-monotonic correlation


• Key Themes– Using SAS to compute cross-correlation between two data

series– Using Excel to compute autocorrelation of a single data

series– Correlation length and influence of data interval on that– Lagged Cross-correlation between rainfall and flow

• Reading: Helsel and Hirsch Chapter 12 Trend Analysis

Correlation

• Correlation (or cross-correlation) measures the association between two sets of data (x, y)

• Autocorrelation measures the correlation of a dataset with lagged or displace values of itself (either in time or space), e.g x(t) with x(t – L) where L is the lag time

• Lagged cross-correlation measures the association between one series y(t), and lagged values of another series x(t – L)


• Key Themes– Trend analysis using Simple Linear Regression– Characterization of outliers– Multiple Linear Regression

• Reading: Helsel and Hirsch Chapter 11 Multiple Linear Regression

• Slides are from Helsel and Hirsch, Chapter 9

H&H p.222

H&H p.226

Regression Formulas

H&H p.227

Regression Formulas


• Key Themes– Simple Linear Regression– Derivation of the normal equations– Multiple Linear Regression

• Reading: Helsel and Hirsch Chapter 7 Comparing several independent groups

• Reading: Barnett, Environmental Statistics Chapter 10 Time series methods

• Slides are from Helsel and Hirsch, Chapter 9

Regression Assumptions

Formulas used in the derivation of the normal

equations

(1a) Plot the Data: TDS vs LogQ

(2) Interpret Regression Statistics

A good set of Residuals

Multiple Linear Regression

Simple vs Complex regression models

F-distributionhttp://en.wikipedia.org/wiki/F-test

“If U is a Chisquare random variable with m degrees of freedom, V is a Chisquare random variable with n degrees of freedom, and if U and V are independent, then the ratio [(U/m)/V/n) has an F-distribution with (m, n) degrees of freedom.” Haan, Statistical Methods in Hydrology, p.122

The values of the F-statistic are tabulated at:

http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm


• Key Themes– Regression y|x and x|y– Adjusted R2

– Time series and seasonal variations

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.950344

R Square 0.903154 0.903154347

Adjusted R Square 0.898543 0.89854265

Standard Error 159033.1

Observations 23

ANOVA

df SS MS FSignificance

F

Regression 1 4.95309E+12 4.95309E+12 195.8399 4.07E-12

Residual (error) 21 5.31122E+11 25291521454

Total (y) 22 5.48421E+12

)1/(

)/(12

nSSy

pnSSEAdjR

SSy

SSER 12

R2 and Adjusted R2

Time Series Trend: Tide Levels at San Diego

http://tidesandcurrents.noaa.gov/sltrends/sltrends_station.shtml?stnid=9410170%20San%20Diego,%20CA

One harmonic

Five harmonics

http://en.wikipedia.org/wiki/Fourier_series


• Key Themes– ANOVA for sediment data– Fourier series for diurnal cycles– Fourier series for seasonal cycles

Analysis of Variance (ANOVA)

Assumptions

There are several variants (one factor, two factor, two factor with replication). We will deal just with One Factor ANOVA

Single Factor ANOVA

ANOVA Formulas

Single Factor ANOVA

TWDB Mean189,000 Ton/yr

USGS2 Mean97,000 Ton/yr

USGS1 Mean218,000 Ton/yr

Groups of Sediment Load Data (Ex3)

Overall Mean183,000 Ton/yr

Zero

3.5 x 106 5.5 x 106

480,000

statistics in wr: lecture 1

Documents

histogramfrom helsel

method of momentsfrom

equationin hydrology

distribution of statistics

handbook of hydrology

unsaturated flow

data value

water quality