benchmark dataset processing p. Štěpánek, p. zahradníček czech hydrometeorological institute...

Benchmark dataset processing

P. Štěpánek, P. Zahradníček

Czech Hydrometeorological Institute (CHMI), Regional Office Brno, Czech Republic,

COST-ESO601 meeting, Tarragona, 9-11 March 2009

E-mail: [email protected]

OutlineOutline

Outliers (daily data)Detection (monthly data) + correction

(daily data)

(description of methodology)Benchmark dataset (monthly data)

Processing before any data analysisProcessing before any data analysis

Software

AnClim,

ProClimDB

Data Data QQuality uality CControl ontrol FFinding inding OOutliersutliers

Two main approaches: Using limits derived from interquartile

ranges (time series)

comparing values to values of neighbouring stations (spatial analysis)

-4.0

-2.0

0.0

2.0

4.0

6.0

8.0

10.0

1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000

-4.0

-2.0

0.0

2.0

4.0

6.0

8.0

10.0

1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000

-4.0

-2.0

0.0

2.0

4.0

6.0

8.0

10.0

1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000

Example of outputs for outliers assessmentExample of outputs for outliers assessment

Altitudes

and distances of neighbours

List of neighbours

Neighbour stations valuesExpected value

Suspicious values

Quality controlQuality control

Run for period 1961-2007, daily data (measured values in observation hours)

All stations (200 climatological stations, 800 precipitation stations) All meteorological elements (T, TMA, TMI, TPM, SRA, SCE,

SNO, E, RV, H, F) – parameters set individually

Historical records will follow now

Air temperature, Air temperature, number of outliers 1961-2007, number of outliers 1961-2007, from from 33..431431..000000 station-days station-days

0

200

400

600

800

1000

1200

T_07:00 T_14:00 T_21:00 T_AVG TMA TMI TPM

T – air temperature at obs. hour, TMA – daily maximum temp., TMI – daily min. temp., TPM – daily ground minimum temp.


Temperature

0

50

100

150

200

250

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

07:00 14:00 21:00 AVG

Air temperature at obs. hour, AVG – daily average temp.


0

50

100

150

200

250


TMA TMI TPM

v

Max, min temperature, ground minimum

TMA – daily maximum temp., TMI – daily min. temp., TPM – daily ground minimum temp.

Air temperature, Air temperature, number of outliers 1961-2007, number of outliers 1961-2007,

Number of outliers per one station (all observation hours, AVG)

0.000

0.020

0.040

0.060

0.080

0.100

0.120

1961 1964 1967 1970 1973 1976 1979 1982 1985 1988 1991 1994 1997 2000 2003 2006

temperature

Spatial distribution of precipitation stationsSpatial distribution of precipitation stations

period 1961-2007 600 stations mean minimum distance: 7.5 km

Problematic detections Problematic detections (heavy rainfall)(heavy rainfall)

Problematic detections Problematic detections (heavy rainfall), Radar information(heavy rainfall), Radar information

Precipitation, Precipitation, number of outliers 1961-2007, from number of outliers 1961-2007, from 1313..724724..000000 station-daysstation-days

.

0

100

200

300

400

500

600

700

800

900

1000


precipitation

Precipitation, Precipitation, number of outliers 1961-2007, number of outliers 1961-2007,

Number of outliers per one station

0.000

0.050

0.100

0.150

0.200

0.250

0.300

0.350

0.400

0.450

1961 1964 1967 1970 1973 1976 1979 1982 1985 1988 1991 1994 1997 2000 2003 2006

precipitation

ConclusionsConclusions

Only combination of several methods for outliers detection leads to satisfying results (“real” outliers detection, supressing fault detection -> Emsemble approach)

Parameters (settings) has to be found individually for each meteorological element, maybe also region (terrain complexity) and part of a year (noticeable annual cycle in number of outliers)

Similar to homogenization of time series, it is important to use measured value (e.g. from observation hours) - outliers are masked in daily average (and even more in monthly or annual ones)

Errors found in all elements and investigated countries (AT, CZ, SK, HU)

Experiences with homogenization in the Czech Experiences with homogenization in the Czech RepublicRepublic

HomogenizationHomogenization Change of measuring conditions

inhomogeneities

DetectingDetecting Inhomogeneities Inhomogeneities by SNHT by SNHT (p=0.05, 950 series)(p=0.05, 950 series)

0

20

40

60

80

100

120

.1 .2 .3 .4 .5 .6 .7 .8 .9 1.0

Amount of change in level /°C

Inh

om

og

en

eit

ies

de

tec

ted

/ %

>2

2

1

0

Error/years

most of metadata incomplete

we depend upon statistical tests results

uncertainty in test results - right inhomogeneity detection is problematic (for smaller amount of change)

Assessing Homogeneity - Assessing Homogeneity - ProblemsProblems

Proposed solutionProposed solution

most of metadata incomplete uncertainty in test results - the right inhomogeneity

detection is problematic

„Ensemble“ approach - processing of big amount of test

results for each individual series

To get as many test results for each candidate series as possible

D a ta P ro ce ss ing

In te rq ua rtile R an ge C o m p ar ing to N e igh b ou rs

A le xa n de rsso n te st B iva r ia te Te st t- te st M a nn -W h itn ey -P e tt it

fro m C o r re la tio ns fro m D is tan ces

Filling M iss. Values

Adjusting Data

Hom. Assessm ent

Reference Series

Hom ogeneity T esting

Com bining Near Stations

Q uality Control - Outliers

Monthly, Seasonal and Annual Averages

Several Iterations

Probability

Days, Months, seasons, year

How to increase number of test resultsHow to increase number of test results

for monthly, daily data (each month individually)

weighted/unweighted mean from neighbouring stations criterions used for stations selection (or combination of it):

– best correlated / nearest neighbours (correlations – from the first differenced series)

– limit correlation, limit distance– limit difference in altitudes

neighbouring stations series should be standardized to test series AVG and / or STD

(temperature - elevation, precipitation - variance)

- missing data are not so big problem then

Creating RCreating Reference eference SSerieseries

Relative homogeneity testingRelative homogeneity testing

Available tests:– Alexandersson SNHT– Bivariate test of Maronna and Yohai– Mann – Whitney – Pettit test– t-test– Easterling and Peterson test– Vincent method– …

20 year parts of the daily series (40 for monthly series with 10 years overlap),

in SNHT splitting into subperiods in position of detected significant changepoint

(30-40 years per one inhomogeneity)

Homogeneity assessmentHomogeneity assessment

Homogeneity assessmentHomogeneity assessment Quality control Homogenization Data Analysis

Test Ref I II III IV V VI VII VIII IX X XI XII Win Spr Sum Aut Year

A avg 1927 1929 1927 1927 1927 1928 1927 1926 1926 1926 1926 1926 1927 1927 1927 1926 1927A 1930A corr 1927 1927 1927 1927 1927 1928 1927 1926 1926 1926 1926 1926 1927 1927 1927 1926 1927A 1939 1938 1939 1940 1922 1937 1937 1935A dist 1927 1928 1927 1927 1927 1928 1927 1926 1926 1926 1926 1926 1927 1927 1927 1926 1927A 1930 1940 1918B avg 1927 1928 1927 1927 1927 1928 1927 1926 1926 1926 1926 1926 1927 1927 1927 1926 1927B 1922B corr 1927 1927 1927 1927 1927 1928 1927 1926 1926 1926 1926 1926 1927 1927 1927 1926 1927B 1936 1938 1939 1944 1922 1935 1937 1937 1935B 1937B dist 1927 1928 1927 1927 1927 1928 1927 1926 1926 1926 1926 1926 1927 1927 1927 1926 1927B 1930 1940 1931 1913 1918V corr 1927 1926V 1937 1922 1935V 1937V dist 1927 1927 1927V 1918

Output example: Station Čáslav, 3rd segment, 1911-1950, n=40


Begin End LengthInHomogen

eityNumber

% detected inhom

% possible inhom

EndMissin

g

1911 1950 40 140 100 120

1927 60 43 511926 37 26 32

1928 9 6 8 4

1937 7 5 61922 4 3 31935 4 3 31918 3 2 31930 3 2 31939 3 2 31940 3 2 3 21938 2 1 21913 1 1 1 3 31929 1 1 11931 1 1 11936 1 1 11944 1 1 1

1926 1927 2 97 69 831926 1931 6 111 79 951935 1940 6 20 14 17

1911 1920 10 4 3 31921 1930 10 114 81 97

1931 1940 10 21 15 181941 1950 10 1 1 1

Homogeneity assessmentHomogeneity assessment, , Output II example:

Summed numbers of detections for individual years



ID ELEMYEAR_INHOMBEGINEND YEAR_COUNTY_POSSIBL YEAR_ENDMISSVALSX_BEGIN_DAX_END_DATEX_BEGINX_ENDLATITUDELONGITUDEALTITUDEB_FULLNAMEREMARKC_OBSERVERC_IDx B1BOJK01 x 1985 41 14.24 12 23.3.1984 31.3.2003 # # Bojkovicechange

B1BOJK01 x 1985 41 14.24 12 23.3.1984 31.12.9999 # # obs Vladimˇr Maz lekB1BOJK01B1BYSH01 x 1978 37 12.85

? B1BYSH01 x 1979 33 11.46? B1BYSH01 x 1980 43 14.93? B1HLHO01 x 1965 31 10.76 4 1

B1HOLE01 x 1976 33 11.46B1KROM01 x 1977 1978 31 10.76

x B1RADE01 x 1994 44 15.28 2 1.1.1994 31.12.9999 # # RadýjovchangeB1RADE01 x 1994 44 15.28 2 1.1.1994 31.12.9999 # # obs Josef Pˇ§aB1RADE01

x B1RYCH01 x 1973 49 17.01 1.5.1973 28.2.1991 # # VyÜkov, Rychtß°ov, Ŕ.157changeB1RYCH01 x 1973 49 17.01 1.9.1972 28.2.1991 # # obs Marie Hor kov B1RYCH01

xx? B1STRZ01 x 1987 53 18.40B1STRZ01 x 1988 30 10.42B1UHBR01 x 1983 31 10.76 18.2.1984 31.1.1999 # # Uherskř Brod, MoŔidla 354changeB1UHBR01 x 1983 31 10.76 18.2.1984 12.5.1993 # # obs Josef KudelaB1UHBR01

x B1UHBR01 x 1984 77 26.74 18.2.1984 31.1.1999 # # Uherskř Brod, MoŔidla 354changeB1UHBR01 x 1984 77 26.74 18.2.1984 12.5.1993 # # obs Josef KudelaB1UHBR01B1VELI01 x 1978 31 10.76

? B1VELI01 x 1977 1978 44 15.28? B1VKLO01 x 1984 29 10.07x B1VYSK01 x 1999 32 11.11 -1 1.4.1998 31.12.9999 # # VyÜkov, Dukelskß 12change

B1VYSK01 x 1999 32 11.11 -1 1.4.1998 31.12.9999 # # obs VojtŘch Sur kB1VYSK01B2BOSK01_rx 1968 33 11.46B2BREC01 x 1968 35 12.15B2BRUM01 x 1989 51 17.71 1.2.1989 31.3.1994 # # BrumovchangeB2BRUM01 x 1989 51 17.71 1.2.1989 31.3.1994 # # obs Marta Paýˇzkov B2BRUM01

-1.0

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1911 1915 1919 1923 1927 1931 1935 1939 1943 1947

combining several outputs (sums of detections in individual years, metadata, graphs of differences/ratios, …)

Adjusting monthly dataAdjusting monthly data using reference series based on correlations adjustment: from differences/ratios 20 years before and after a

change, monhtly

smoothing monthly adjustments (low-pass filter for adjacent values)

I II III IV V VI VII VIII IX X XI XII

Example:

Adjusting values - evaluation

Iterative homogeneity testingIterative homogeneity testing

several iteration of testing and results evaluation– several iterations of homogeneity testing and

series adjusting (3 iterations should be sufficient)

– question of homogeneity of reference series is thus solved:

• possible inhomogeneities should be eliminated by using averages of several neighbouring stations

• if this is not true: in next iteration neighbours should be already homogenized

Filling missing valuesFilling missing values

Before homogenization: influence on right inhomogeneity detection

After homogenization: more precise - data are not influenced by possible shifts in the series

Dependence of tested series on reference series

#

#

Prague

Brno

HomogenizationHomogenization of the series of the series in the Czech Republicin the Czech Republic

Correlations between tested and reference series Air temperature

Boxplots:

- Median

- Upper and lower quartiles

(for 200 testes series)

1 - 07h2 - 14h3 - 21h4 - AVG

0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00


4

0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00


1

0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00


3

0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00


2

Correlations between tested and reference series Precipitation, snow depth, new snow

Boxplots:

- Median



1 - 07h precip.2 - 14h snow depth3 - 21h new snow

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00


1

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00


3

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00


2

Correlations between tested and reference series Wind speed

Boxplots:

- Median



1 - 07h2 - 14h3 - 21h4 - AVG

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00


4

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00


1

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00


3

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00


2

Number of significant inhomogeneities (0.05) detected by used tests

(A, B tests, c and d reference series, alltogether)

0

100

200

300

400

500

600

700

800

900

1000


T_07

T_14

T_21

T_AVG

Air temperature

Homogeneity testing resultsHomogeneity testing results Air temperatureAir temperature

Amount of adjustments, averages of absolute values, T_AVG

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0


°C

T_07 T_14 T_21 T_AVG

Air temperature

Homogeneity testing resultsHomogeneity testing resultsAir temperatureAir temperature

Homogeneity testing resultsHomogeneity testing resultsPrecipitationPrecipitation

4 tests, 4 reference series, 12 months + 4 seasons and year Number of detected inhomogeneities (significant)

0

1000

2000

3000

4000

5000

6000

I II III IV V VI VII VIII IX X XI XIIMonth

Nu

mb

er

of

de

tec

tio

ns

Amount of change (ratios – standardized to be >1.0), precipitation(reference series calculation based on correlations)

Boxplots:

- Median



-0.005

0.000

0.005

0.010

0.015

0.020

0.025


Co

rrel

atio

n in

crea

se

1.000

1.050

1.100

1.150

1.200

1.250

1.300


Am

ou

nt o

f ch

an

ge

(st

an

da

rdiz

ed

)

Correlation improvement

HomogenizationHomogenization Final rFinal remarks, recommendationsemarks, recommendations 1/2 1/2

data quality control before homogenization is of very importance (if it is not part of it)

Using series of observation hours (complementarily to daily

AVG) is highly recommended (different manifestation of breaks)

be aware of annual cycle of inhomogeneities, adjustments, …

to know behavior of spatial correlations (of element being processed) to be able to create reference series of sufficient quality …

HomogenizationHomogenization Final rFinal remarks, recommendations 2/2emarks, recommendations 2/2

Because of Noise in the time series it makes sense: - „Ensemble“ approach to homogenization (combining

information from different statistical tests, time frames, overlapping periods, reference series, meteorological elements, …)

- more information for inhomogeneities assessment – higher quality of homogenization in case metadata are incomplete

Software used for data processingSoftware used for data processing

LoadData - application for downloading data from central database (e.g. Oracle)

ProClimDB software for processing whole dataset (finding outliers, combining series, creating reference series, preparing data for homogeneity testing, extreme value analysis, RCM outputs validation, correction, …)

AnClim software for homogeneity testing

http://www.http://www.cclimahom.limahom.eueu

AnClim softwareAnClim software

AnClim softwareAnClim software

ProcDataProcData software software

ProProClimDBClimDB software software

Testing of benchmark datasetTesting of benchmark dataset

Fully automated (just for one click), but it should not work like this in reality:

detection phase can be fully automated, but desicion about breaks to be corrected should be man-made (comparison with metadata, plots of differences – ratios, …)

Fully automatic detection phase in Fully automatic detection phase in ProProClimDBClimDB software softwareselecting neighbours for reference seriesselecting neighbours for reference series

Fully automatic detection phase in Fully automatic detection phase in ProProClimDBClimDB software softwarereference series calculation and launching AnClimreference series calculation and launching AnClim

Automation in AnClim, tests can be selected in advanceAutomation in AnClim, tests can be selected in advance

Test results processed back in ProClimDBTest results processed back in ProClimDB

Non-automated desicion about breaks before correctionNon-automated desicion about breaks before correction

http://www.http://www.cclimahom.limahom.eueu

benchmark dataset processing p. Štěpánek, p. zahradníček czech hydrometeorological institute...

Documents

tmi daily

daily average

outliers assessmentaltitudes

number of outlierssimilar

portion of number

observation hours outliers

tpm daily ground minimum

stationdayst air temperature