the identification of exceptional values in the espon database
DESCRIPTION
Paul Harris Martin Charlton National Centre for Geocomputation NUIM Maynooth Ireland Madrid seminar - 10/6/10. The identification of exceptional values in the ESPON database. ESPON DB data Identifying exceptional values Case study 1 (detecting logical input errors) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/1.jpg)
The identification of exceptional The identification of exceptional values in the ESPONvalues in the ESPON database database
Paul HarrisPaul HarrisMartin CharltonMartin Charlton
National Centre for GeocomputationNational Centre for GeocomputationNUIM Maynooth IrelandNUIM Maynooth Ireland
Madrid seminar - 10/6/10Madrid seminar - 10/6/10
![Page 2: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/2.jpg)
OutlineOutline
1.1. ESPON DB dataESPON DB data
2.2. Identifying exceptional valuesIdentifying exceptional values
3.3. Case study 1 (detecting logical input errors)Case study 1 (detecting logical input errors)
4.4. Case study 2 (detecting statistical outliers)Case study 2 (detecting statistical outliers)
5.5. Next things to do..Next things to do..
![Page 3: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/3.jpg)
1. ESPON DB data1. ESPON DB data
Socio-economic, land cover,…Socio-economic, land cover,…
Continuous, categorical, nominal, ordinal,….Continuous, categorical, nominal, ordinal,….
Spatial support:Spatial support:Area units – NUTS 0/1/2/23/3Area units – NUTS 0/1/2/23/3(whose boundaries may also change over time)(whose boundaries may also change over time)
Temporal support:Temporal support:Commonly, yearly units (with only a short time series)Commonly, yearly units (with only a short time series)
![Page 4: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/4.jpg)
Define two types:Define two types:
1.1. Logical input errorsLogical input errors(e.g. a negative unemployment rate)(e.g. a negative unemployment rate)
2.2. Statistical outliersStatistical outliers(e.g. an unusually high unemployment rate)(e.g. an unusually high unemployment rate)
Two-stage identification algorithm:Two-stage identification algorithm:
Stage 1: identify input errors via mechanical techniquesStage 1: identify input errors via mechanical techniques
Stage 2: identify outliers via statistical techniquesStage 2: identify outliers via statistical techniques
2. Identifying exceptional values2. Identifying exceptional values
![Page 5: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/5.jpg)
Stage 1:Stage 1:
Identify logical Input ErrorsIdentify logical Input Errors
![Page 6: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/6.jpg)
Logical input errors…Logical input errors… Usually detected using some logical, mathematical approachUsually detected using some logical, mathematical approach
Statistical detection may also help…Statistical detection may also help…
Typical input errors:Typical input errors:
Impossible values (e.g. negatives, fractions…)Impossible values (e.g. negatives, fractions…)
Repeated data for different variablesRepeated data for different variables
Data displaced between or within columnsData displaced between or within columns
Data swapped between or within columnsData swapped between or within columns
Wrong NUTS code or nameWrong NUTS code or name
Wrong NUTS regions used (e.g. for 1999 instead of 2006)Wrong NUTS regions used (e.g. for 1999 instead of 2006)
Missing value code (e.g. 9999 treated as a true value)Missing value code (e.g. 9999 treated as a true value)
Etc.Etc.
![Page 7: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/7.jpg)
Our approach…Our approach…
Detect input errors mathematically (& statistically)Detect input errors mathematically (& statistically)
Flag observations if they are likely input errorsFlag observations if they are likely input errors
If possible - correct themIf possible - correct them
More likely - consult an expert on the dataMore likely - consult an expert on the data
Once happy - go to stage 2 - assume data is error-freeOnce happy - go to stage 2 - assume data is error-free
![Page 8: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/8.jpg)
Stage 2:Stage 2:
Identify statistical outliersIdentify statistical outliers
![Page 9: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/9.jpg)
Types of outliers….Types of outliers….
![Page 10: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/10.jpg)
Our approach…Our approach…
There is no single ‘best’ outlier detection technique, so…There is no single ‘best’ outlier detection technique, so…
Apply a representative selection of outlier detection Apply a representative selection of outlier detection techniques (which are simple & robust)techniques (which are simple & robust)
Flag an observation if it is a likely outlier according to each Flag an observation if it is a likely outlier according to each techniquetechnique
Build up a Build up a weight of evidenceweight of evidence for the likelihood of a given for the likelihood of a given observation being statistically outlyingobservation being statistically outlying
Suggest what type of outlier it is likely to beSuggest what type of outlier it is likely to be - - aspatial, spatial, temporal, relationship, some mixture…aspatial, spatial, temporal, relationship, some mixture…
Consult an expert on the data to decide on the appropriate Consult an expert on the data to decide on the appropriate course of actioncourse of action
Here’s an example using nine techniques & three Here’s an example using nine techniques & three observations…observations…
![Page 11: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/11.jpg)
Identification technique Identification type Obs. 1 Obs. 2 Obs. 3
1. Boxplot statistics Aspatial & univariate Yes Yes
2. Hawkins’ spatial test statistic Spatial & univariate Yes
3. Time series statistics Temporal & univariate Yes YesYes
4. Large residuals from multiple linear regression*
Aspatial & multivariate,Linear relationships
Yes YesYes
5. Large residuals from locally weighted regression*
Aspatial & multivariate,Nonlinear relationships
Yes
6. Large residuals from geographically weighted regression*
Spatial & multivariate,Nonlinear relationships
Yes
7. Principal component analysis* Aspatial & multivariate,Linear relationships
Yes
8. Locally weighted principal component analysis*
Aspatial & multivariate,Nonlinear relationships
Yes
9. Geographically weighted principal component analysis*
Spatial & multivariate,Nonlinear relationships
Yes
* Can have a spatial, univariate form if the coordinate data are used as variables
![Page 12: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/12.jpg)
DataData Data at NUTS3 level (1351 observations/regions)Data at NUTS3 level (1351 observations/regions) Variables:Variables: GDP evolution (2000 to 2005) (%age)GDP evolution (2000 to 2005) (%age) Calculated using 4 other variables:Calculated using 4 other variables:
205 logical input errors deliberately introduced to:205 logical input errors deliberately introduced to: NUTS codes & the 4 variables used to calculate GDP NUTS codes & the 4 variables used to calculate GDP
evolution onlyevolution only ~ 15% of data infected~ 15% of data infected
2005
2000
2000
2005
20002000
200520050500 POP
POP
GDP
GDP
POPGDP
POPGDPE
3. Case study 1 (detecting logical input errors)3. Case study 1 (detecting logical input errors)
![Page 13: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/13.jpg)
Performance resultsPerformance results
False negatives - 13.2% (e.g. in Italy)False positives - 2.0% (e.g. in Spain)Overall misclassification rate - 3.7%
![Page 14: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/14.jpg)
Consequences if we had ignored input Consequences if we had ignored input errors….errors….
![Page 15: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/15.jpg)
DataData Data at NUTS23 level for eight years: 2000-2007Data at NUTS23 level for eight years: 2000-2007
For each year - ‘unemployment rate’ calculatedFor each year - ‘unemployment rate’ calculated [Unemployment population)/(Active population)][Unemployment population)/(Active population)]
8 variables at each of 790 regions = 6320 obs.8 variables at each of 790 regions = 6320 obs.
Data checked for input errors - i.e. stage 1 doneData checked for input errors - i.e. stage 1 done
4. Case study 24. Case study 2(detecting statistical outliers)(detecting statistical outliers)
![Page 16: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/16.jpg)
![Page 17: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/17.jpg)
Presentation of results…Presentation of results…
For brevity…For brevity…
Lets say - we only need at least one of 8 Lets say - we only need at least one of 8 time-specific unemployment values in a time-specific unemployment values in a region to be outlying…region to be outlying…
(But we can identify outliers by year too)(But we can identify outliers by year too)
![Page 18: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/18.jpg)
Results: 1 boxplot statisticsResults: 1 boxplot statistics(aspatial & univariate)(aspatial & univariate)
![Page 19: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/19.jpg)
Results: 2 Hawkins’ testResults: 2 Hawkins’ test(spatial & univariate)(spatial & univariate)
![Page 20: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/20.jpg)
Results: 3 time series statisticsResults: 3 time series statistics(temporal & univariate)(temporal & univariate)
![Page 21: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/21.jpg)
Results: 4 MLR residualsResults: 4 MLR residuals(aspatial linear relationships)(aspatial linear relationships)
![Page 22: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/22.jpg)
Results: 5 LWR residualsResults: 5 LWR residuals(aspatial nonlinear relationships)(aspatial nonlinear relationships)
![Page 23: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/23.jpg)
Results: 6 GWR residualsResults: 6 GWR residuals(spatial nonlinear relationships)(spatial nonlinear relationships)
![Page 24: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/24.jpg)
Results: 7 PCA residualsResults: 7 PCA residuals(aspatial linear relationships & model-free)(aspatial linear relationships & model-free)
![Page 25: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/25.jpg)
Results: 8 LWPCA residualsResults: 8 LWPCA residuals(aspatial nonlinear relationships & model-free)(aspatial nonlinear relationships & model-free)
![Page 26: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/26.jpg)
Results: 9 GWPCA residualsResults: 9 GWPCA residuals(spatial nonlinear relationships & model-free)(spatial nonlinear relationships & model-free)
![Page 27: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/27.jpg)
Summary of results: weight of evidenceSummary of results: weight of evidence
![Page 28: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/28.jpg)
Preliminary performance resultsPreliminary performance results
Infected ~ 5% of the data with ‘outliers’ & Infected ~ 5% of the data with ‘outliers’ & repeated the analysis on this ‘infected’ data…repeated the analysis on this ‘infected’ data…
False negatives: 10.3% False positives: 34.3% Overall misclassification rate: 26.1%
Problems: Difficult to guarantee that our infections actually
produce outliers… The data already contains outliers (as shown)
![Page 29: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/29.jpg)
1. Other ways of performance testing our approach Simulated data with known properties? Statistical theory (or properties)?
2. Refining each of our nine chosen techniques Robust extensions
5. Next things to do…5. Next things to do…
![Page 30: The identification of exceptional values in the ESPON database](https://reader036.vdocument.in/reader036/viewer/2022081516/568147b3550346895db4f70a/html5/thumbnails/30.jpg)
Thank You!Thank You!