k-nearest neighbor resampling technique (weather generation and water quality applications) balaji...

Post on 26-Dec-2015

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

K-Nearest Neighbor Resampling Technique

(Weather Generation and Water Quality Applications)

Balaji Rajagopalan

Somkiat Apipattanavis & Erin TowlerDepartment of Civil, Environmental and

Architectural Engineering

University of Colorado

Boulder, CO

Denver Water

February 2007

“Translation” of Climate Info

• Users most interested in sectoral outcomes (streamflows, crop yields, risk of disease X)

ClimateForecast /Projection

Forecast /ProjectionTranslation

ProcessModels

Distributionof Outcomes

Translation

28.5

………

12.4

23.1

………

10.2

29.1

………

11.4

25.8

………9.7

HistoricalData

Synthetic series

Process model

Frequency distribution of

outcomes

Why Simulation?• Limited historical data

– cannot capture the full range of variability– electing a (single or a set of ) historical years from the record – with

equal chance.Unconditional bootstrap, Index Sequential Method

• Need – tool to generate ‘scenarios’ that capture the historical statistical properties

• Several statistical techniques are available (e.g., time series techniques, Monte-carlo techniques etc.)

– These are cumbersome, restrictive (in their assumptions)

• Re-sampling techniques are simple and robust– Unconditional and Conditional bootstrap, K-nearest neighbor (K-NN)

bootstrap offer attractive alternatives.

Why Simulation?• Limited historical data

– cannot capture the full range of variability– electing a (single or a set of ) historical years from the record – with

equal chance.Unconditional bootstrap, Index Sequential Method

• Need – tool to generate ‘scenarios’ that capture the historical statistical properties

• Several statistical techniques are available (e.g., time series techniques, Monte-carlo techniques etc.)

– These are cumbersome, restrictive (in their assumptions)

• Re-sampling techniques are simple and robust– Unconditional and Conditional bootstrap, K-nearest neighbor (K-NN)

bootstrap offer attractive alternatives.

Re-sampling Techniques

• Drawing cards from a well shuffled deck– Selecting a (single or a set of ) historical years from the record –

with equal chance.Unconditional bootstrap, Index Sequential Method

• Drawing card from a biased deck– Selecting a (single or a set of) historical years with unequal

chance.E.g., selecting only El Nino years

Conditional bootstrap• K-Nearest Neighbor Bootstrap – “pattern matching”

– Select ‘K’ nearest neighbors (e.g., years) to the current ‘feature’– Select one of the K neighbors at random– Repeat to produce an ensemble–

Examples

• Ensemble Weather Generation– Scenario generation– Forecast

Argentina - Pampas Region

• Water Quality Modeling

(Boulder Water Utility)

Two Step Weather Generator

1 0 0 1 1 0 0 0 1 0 0 - - - - -

Probability of Dry and Wet Days

Dry day Wet day

0.60 (pd) 0.40 (pw)

  Transition Prob (pij)

  Dry day Wet day

Dry day 0.70 (pdd) 0.30 (pdw)

Wet day 0.80 (pwd) 0.20 (pww)

Generated Precipitation State time series

• Estimate Transition (wet to dry, etc.) Probabilities of the Markov Chain order-1 from historical data – for each month

• Generate Precipitation State time series using Markov Chain

• Suppose we need weather simulation for January 5th - January 4th is a wet day

• Get Neighbors from a 7-day window (7*50) centered on January 4th

• Screen days using the Precipitation state [(1,0), days in blue] – i.e., “Potential Neighbors”

• Calculate the distances between weather variables of current day feature vector and the potential neighbors

• Select the K-nearest neighbors • Assign them weights

Year   January           February  

  1234567 - - 11234 - -

1 20030200- - x x x x - -2 03200040- - x x x x - -3 30020300- - x x x x - -4 00600000- - x x x x - ----- - - - - - - - - - - - - - - - ----- - - - - - - - - - - - - - - - ----- - - - - - - - - - - - - - - - -0 02030023- - x x x x - -

• Pick a day from k-NN using the weight function – say, Jan 1st 1953

• The simulated weather for Jan 5th is Jan 2nd 1953.

• Repeat

k

jj

jijK

1

1

1

nk

Single Site Simulation

• Pergamino, Argentina– Daily weather variables 1931-2003

• Precipitation• Max. Temperature• Min. Temperature

• 100 simulations of 73 year length (as length of record)

• Statistics of simulated and historical data are compared

Spell Properties

Pergamino, Argentina

wet and dry spell statistics

Moments (wet month - Jan)

Moments (dry month - July)

Conditional K-NN Re-sampling

• Conditioned on IRI seasonal forecast

• Get the prediction (A:N:B=40:35:25)

• Divide historical (seasonal) total into 3 tercile categories

• Bootstrap 40, 35 and 25 sample of historical years from wet, normal and dry categories

• Apply the two-step weather generator on this sample.

Conditional Weather Generation (results)

Multi-site extension

• Same procedure as single site is used but– Calculate the Average time series – “single site virtual

weather data” – Apply the two-step generator– Select the weather at all the locations on the picked

day – to obtain multi-site simulation

• Stations in Pampus region, Stations in Pampus region, Argentina Argentina

• PergaminoPergamino• JuninJunin• Nueve de JulioNueve de Julio

wet and dry spell Statistics

Pergamino, Argentina

Multisite Case

Basic Distribution Properties

Spatial Correlation

0 2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

sw_avg

Pro

babi

lity

dens

ity fu

nctio

n0 2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

sw_avg

Pro

babi

lity

dens

ity fu

nctio

n

Motivation

Dis

trib

utio

n

Dis

trib

utio

n

Input Output

Comply Non-Compliance

Uncertainty helps us to understand the risk of non-compliance with a given regulation

WTP

• Monitoring effort mandated by USEPA

• Large public water systems

• Water quality and operating data

- Disinfection by-products (DBPs) and microorganisms to support rulemakings

• Most comprehensive view of large drinking water systems to date

Data SetInformation Collection Rule (ICR)

• 18 months (Jul. 1997 – Dec. 1998)

• 458 continental US locations

Data Set

ICR

Data Set

• Water Quality – Influent

– Intermediate

– Finished

– Distribution system

• Chemical Additions

ICR Database

Influent water quality has significant variability due to

- climate

- geology

- water management

practices

Characterize Variability

Source Water

• TOC

• TSUVA

• Alkalinity

• pH

• Turbidity

• Temperature

• Total Hardness

• Examine influent water quality for surface waters (SWs) – Spatial variability– Temporal variability

• Focus on total organic carbon (TOC)– TOC is a precursor in formation of DBPs– Methods extend to other water quality

parameters

Variability

Spatial Variability

Variability

• Local polynomial approach

• Find best K and P combination

• Contour estimates

),(_ LongitudeLatitudefTOC averageannual

Spatial Variability SW Average Annual TOC (mg/L)

Variability

2,30. P

Spatial Variability

Variability

Similar spatial patterns found for• Finished water TOC (lower)

• Distribution system DBPs– TTHM (total trihalomethanes)

– HAA5 (five haloacetic acids)

Spatial Variability

Variability

• Alkalinity

• Bromide

Spatial patterns consistent with previous research for other influent water quality variables

Variability

Temporal Variability

J F M A M J J A S O N D

Influ

ent T

OC

(m

g/L

)

0

1

2

3

4

2 4 6 8 10 12

01

23

4

1:12

TO

C[1

:12

]

J F M A M J J A S O N D

1998

City of Boulder’s Betasso Water Treatment Plant (CO)

Variability

Temporal Variability

• Some locations exhibited seasonal trends, others did not

• Month to month variations should be considered

• Inherent variability in water quality contributes to uncertainty

• How can we quantify uncertainty?

Variability

Simulate “ensembles” of influent water quality (Monte Carlo)

Quantify Uncertainty

121 ... TOCTOC

Observed data

12_1001_100

12_11_1

...

.........

...

SS

SS

TOCTOC

TOCTOC

Ensembles

Normal

Lognormal

• Fit a probability density function (pdf) to the data-Normal, Lognormal, etc.

• Simulate from pdf

Quantify

Traditional Method

Limitations - What if the pdf is not a good fit?

- What if you don’t have enough data to make the pdf?

ex. 18 months/location in ICR database

Histogram of May

May

De

nsi

ty

1000 2000 3000 4000 5000 6000

0

e+

00

1

e-0

42

e

-04

3

e-0

44

e

-04

Quantify

• Skip fitting a pdf to the data

• Simulate by bootstrapping• Randomly sample data with replacement

• Expand bootstrapping pool to include “similar” locations (nearest neighbors)

• What is limited in time is available in space

Space-Time Bootstrapping Method

Quantify

• Find nearest neighbors (locations) in terms of a feature vector that includes variables of interest

• Feature vector includes:- Average Annual Concentration- Latitude

- Longitude

Quantify

),,( LonLatTOCtorFeatureVec average

Average annual concentration helps finds neighbors that are similar but may not be geographically nearby.

Average annual TOC (mg/L) for Ohio surface waters

Geographically close, but not good “neighbors” for bootstrapping

Quantify

Quantify

),,( LonLatTOCtorFeatureVec average

• Sample monthly TOC values based on feature vector• Conditional probability

)|( torFeatureVecTOCf monthly

Simulation Algorithm

user

user

user

user

Lon

Lat

TOC

x

mmm

iiiICR

LonLatTOC

LonLatTOC

LonLatTOC

x

.........

.........111

1) User inputs their location and their average annual TOC concentration

2) The ICR database is queried for all eligible entries

Quantify

"" ICRuser xxd

Algorithm- cont.

3) Calculate distances, d, between the xuser vector and the xICR vector

Quantify

userx ICRx

Algorithm- cont.

3) Calculate distances using weighted Mahalanobis equation

Quantify

))(())(( _1

_ iICRuserTT

iICRuseri xxWSxxWd

))(())(( _1

_ iICRuserTT

iICRuseri xxWSxxWd

Algorithm- cont.Quantify

Remove the weights (W) and the covariance matrix (S) and it’s Euclidean Distance

))(())(( _1

_ iICRuserTT

iICRuseri xxWSxxWd

Algorithm- cont.Quantify

By including S, covariance matrix, components of the feature vector do not have to be scaled

(Davis 1986 )

Algorithm- cont.Quantify

))(())(( 1iuser

TTiuseri xxWSxxWd

Weights are assigned as

LonLatTOC WWWW

100 LonLatTOC WWW 010 LonLatTOC WWW

001 LonLatTOC WWW 111 LonLatTOC WWW

Quantify

Weights offer flexibility in neighbor selection

(a) (b)

(c) (d)

4) Obtain observed monthly data for each nearest neighbor

DeckJank

DeciJani

DecJan

NN

TOCTOC

TOCTOC

TOCTOC

x

__

__

_1_1

...

.........

...

.........

...

Algorithm- cont.Quantify

5) Bootstrap xNN using a weight function

k

ii

jjp

1

1

1

Algorithm- cont.Quantify

Increases likelihood of picking nearer neighbors

Apply algorithm to quantify uncertainty in influent TOC concentrationCity of Boulder’s Betasso Water Treatment Plant (CO)

Boulder

SWs only, N = 334

Quantify

Red dot is the Boulder plant being simulated

Empty black dots are the “neighbors” to be bootstrapped

Identify nearest neighbors

- Include Boulder in pool for bootstrapping

111 LonLatTOC WWW

Quantify

01

23

45

Influ

en

t T

OC

(m

g/L

)

J F M A M J J A S O N D Ann

Quantify

Box plot each monthly bootstrap ensemble (100 values)

Median

5th Percentile

95th Percentile

25th Percentile

75th Percentile

Outliers

Uncertainty quantified for Boulder

01

23

45

Influ

en

t TO

C (

mg

/L)

J F M A M J J A S O N D Ann

1998

Influ

ent T

OC

(m

g/L)

0

1

2

3

4

5

J F M A M J J A S O N D Ann

Quantify

• Simulates seasonal trends

• Provides rich variety of uncertainty

Overlay recent data

• Simulations capture recent data

01

23

45

TO

C (

mg

/L)

J F M A M J J A S O N D Ann

19971998200320042005

Influ

ent T

OC

(m

g/L)

0

1

2

3

4

5

J F M A M J J A S O N D Ann

Quantify

City of Birmingham’s Carson Filter Plant (AL)

J F M A M J J A S O N D Ann

Influ

ent T

OC

(m

g/L)

0

1

2

3

4

2 4 6 8 10 12

01

23

4

Influ

en

t TO

C (

mg

/L)

J F M A M J J A S O N D Ann

1998

QuantifyPortable Across Locations

City of Birmingham’s Carson Filter Plant (AL)

J F M A M J J A S O N D Ann

Influ

ent T

OC

(m

g/L)

0

1

2

3

4

QuantifyPortable Across Locations

2 4 6 8 10 12

01

23

4

Influ

en

t TO

C (

mg

/L)

J F M A M J J A S O N D Ann

01

23

4

City of Birmingham’s Carson Filter Plant (AL)

J F M A M J J A S O N D Ann

Influ

ent T

OC

(m

g/L)

0

1

2

3

4

QuantifyPortable Across Locations

2 4 6 8 10 12

01

23

4

Influ

en

t TO

C (

mg

/L)

J F M A M J J A S O N D Ann

01

23

4 19971998200320042005

19971998200320042005

J F M A M J J A S O N D Ann

Influ

ent A

lkal

inity

(as

mg/

L C

aCO

3)

0

10

2

0

30

4

0

50

60

70

2 4 6 8 10 12

01

02

03

04

05

06

07

0

z1

ob

s_1

99

8

J F M A M J J A S O N D Ann

New Jersey American Water Swimming River Treatment Plant (NJ)

QuantifyApplies to Other Variables

2 4 6 8 10 12

01

23

4

1:12

TO

C[1

:12

]

J F M A M J J A S O N D

1998

J F M A M J J A S O N D Ann

Influ

ent A

lkal

inity

(as

mg/

L C

aCO

3)

0

10

2

0

30

4

0

50

60

70

New Jersey American Water Swimming River Treatment Plant (NJ)

QuantifyApplies to Other Variables

2 4 6 8 10 12

01

02

03

04

05

06

07

0

z1

ob

s_1

99

8

J F M A M J J A S O N D Ann

01

02

03

04

05

06

07

0

J F M A M J J A S O N D Ann

Influ

ent A

lkal

inity

(as

mg/

L C

aCO

3)

0

10

2

0

30

4

0

50

60

70

New Jersey American Water Swimming River Treatment Plant (NJ)

QuantifyApplies to Other Variables

2 4 6 8 10 12

01

02

03

04

05

06

07

0

z1

ob

s_1

99

8

J F M A M J J A S O N D Ann

01

02

03

04

05

06

07

0

++

+ ++ +

++

++ + + +

+

199719982002200320042005

• K-NN resampling technique provides a simple and robust alternative to generating ‘scenarios’.

– Quantify Uncertainty

– Ensemble forecast

• Very general – can be easily applied to a variety of situations.

Weather generation

Water Quality

Streamflow (Colorado River Basin)

Summary & Conclusions

• Can readily be extended to generate ‘scenarios’ under climate change or decadal variability

modify the ‘feature vector’ to include the climate variability information

• Rajagopalan and Lall (1999); Yates et al. (2003), Apipattanavis et al. (2007) - all papers in Water Resources Research

• balajir@colorado.edu

AwwaRF project 3115

“Decision Tool to Help Utilities Develop Simultaneos Compliance Strategies”

Utilities

City of Boulder’s Betasso Water Treatment Plant (CO)

City of Birmingham’s Carson Filter Plant (AL)

New Jersey American Water Swimming River Treatment Plant (NJ)

Greater Cincinnati (OH) Water Works Richard Miller Water Treatment Plant

Acknowledgements

Questions

“It is better to be roughly right than precisely wrong.”

-John Maynard Keynes (1883-1946)

top related