introduction in the data hyperdimension...–in the social statistical database* it was found (in...

16
01/10/2014 1 Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension Purpose of the module - Introduction in Data hyperdimension - Introduction of indicators for data evaluation (implemented in R software package) Developed within European BLUE ETS project Theory and practical examples Group exercise in which groups determine whether a source should be used based on the results for the data hyperdimension. - Introduction of Quality Report Card

Upload: others

Post on 21-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

1

Saskia Ossen, and Piet Daas

Introduction in the Data hyperdimension

Purpose of the module

- Introduction in Data hyperdimension

- Introduction of indicators for data evaluation (implemented in R software package)

• Developed within European BLUE ETS project • Theory and practical examples

• Group exercise in which groups determine whether a

source should be used based on the results for the data

hyperdimension.

- Introduction of Quality Report Card

Page 2: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

2

Data: quality of the input

– Input quality of administrative data • After evaluation of Source and Metadata hyperdimension

– Data hyperdimension studies

• Quality of the facts (values) in the source • Data are part of every delivery!

• Time needed for evaluation is a serious issue • Evaluate every delivery thoroughly? • Evaluation may differ depending on the use intended (output) • Relation with process (availability and quality of other data sources)

Essential pre-requisites and considerations

– Evaluation of the data quality of input sources needs to be

efficient

– Focus on essential quality components

• What are the essential dimensions of input data quality? • What are the essential indicators for those dimensions?

• For objects (units/events) and variables

– Purely input or also with output in mind?

• Data Source Quality (admin. data quality per se) • Input oriented Output Quality (guestimate of expected

effect on output)

Page 3: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

3

Essential dimensions of input data quality

– Five essential quality dimensions identified for input data of administrative sources:

1. Technical checks

• Technical accessibility, IT-part

2. Accuracy • Correctness, validity, error-freeness

3. Completeness • Coverage of units, missing variable data

4. Time-related dimension • Timeliness, punctuality, period covered

5. Integrability • Easiness of integration and consistency of data between sources

Technical checks: Theory

Indicators Description 1. Technical checks Technical usability of the file and data in the file

1.1 Readability Accessability of the file and data in the file 1.2 File declaration Compliance of the data in the file to the metadata compliance agreements 1.3 Convertability Conversion of the file to the NSI-standard format

Technical checks dimension

Page 4: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

4

Technical checks: Examples

– Very important for new sources, becomes somewhat less

essential later on

‐ Corrupt files

‐ Encoded files of which decoding password is missing

‐ Files of which the data is not compliant to the metadata

description

‐ Files with errors during/after conversion

Technical checks: File declaration compliance

– Simple frequency distributions are very helpful

Page 5: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

5

Technical checks: File declaration compliance

Accuracy: Theory

Indicators Description 2. Accuracy The extent to which data are correct, reliable, and certified

Objects

2.1 Authenticity Legitimacy of objects 2.2 Inconsistent objects Extent of erroneous objects in source 2.3 Dubious objects Presence of untrustworthy objects

Variables

2.4 Measurement errors Deviation of actual data value from ideal error-free measurements

2.5 Inconsistent values Extent of inconsistent combinations of variable values 2.6 Dubious values Presence of implausible values or combinations of

values for variables

Accuracy dimension

Page 6: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

6

– Objects with incorrect Identification numbers (ID’s)

– In the Netherlands all people have a Citizen’s Service Number ‐ 9-digit number (e.g. 123456782) ‐ Number has a feasibility check, last digit is a checking digit ‐ Rule used: sum(9*n1 + 8*n2 + 7*n3 + 6*n4 + 5*n5 + 4*n6 + 3*n7 +

2*n8 – 1*n9) Remainder of sum/11 should be 0

– In the Social Statistical Database* it was found (in 2000) that: ‐ 0,3% of all persons in admin. data sources used had an invalid

Citizen Service Number

*set of integrated admin. data sources and surveys (then ~100 million admin records) Arts et al. (2000) Netherlands Official Statistics 15, pp. 16-22.

Accuracy example: Authenticity (1) % of objects with a syntactically incorrect identification key

Accuracy example: Authenticity (2) % of objects for which the source contains information contradictive to information in a reference list for those objects

– Studies reveal significant differences between findings for ‘educational attainment’ obtained from a survey and from linked administrative data sources.

More in: Bakker (2011) Estimating the Validity of Administrative Variables. ISI-paper session IPS030, Dublin, Ireland.

Page 7: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

7

Accuracy example: Authenticity (3) % of objects for which the source contains information contradictive to information in a reference list for those objects

Accuracy example: Inconsistent objects

Rule: a person is part of exactly one household

Page 8: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

8

Accuracy example: Dubious values

Cross tabulation of the variable “Current activity status” versus age group

Completeness: Theory

Indicators Description 3. Completeness Degree to which a data source includes data describing the

corresponding set of real-world objects and variables

Objects 3.1 Undercoverage Absence of target objects (missing objects) in the source 3.2 Overcoverage Presence of non-target objects in the source 3.3 Selectivity Statistical coverage and representativity of objects 3.4 Redundancy Presence of multiple registrations of objects Variables 3.5 Missing values Absent values for (key) variables 3.6 Imputed values Presence of values resulting from imputation actions by

data source holder

Completeness dimension

Page 9: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

9

Completeness example: Selectivity (1)

Completeness example: Selectivity (2)

The education register has age-related undercoverage of educational attainment (56,3% is missing)

Explanation: 1) Children <15 age have a known level of education 2) Level of education of young adults is usually stored in recently created admin. data sources 3) Information from ‘middle-aged’ people is obtained from LFS-survey (small compared to admin. data info) 4) Information of ‘elderly’ people (≥65 year) almost completely missing (not surveyed and hardly registered)

Page 10: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

10

Pre-evaluation and input quality of administrative data sources (Part 2)

Completeness example: Selectivity (3)

Time related: Theory

Indicators Description 4. Time-related dimension Indicators that are time and/or stability related

4.1 Timeliness Lapse of time between the end of the reference period and the moment of receipt of the data source

4.2 Punctuality Possible time lag between the actual delivery date of the source and the date it should have been delivered

4.3 Overall time lag Overall time difference between the end of the reference period and the moment it is concluded that it can definitely be used

4.4 Delay Extent of delays in registration

Objects 4.5 Dynamics of objects Changes in the population of objects (new and dead

objects) over time

Variables 4.6 Stability of variables Changes of variables or values over time

Time-related dimension

Page 11: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

11

Time-related example: Delay

– Events recorded some time after they have occurred

• Events are missing (or erroneously recorded)

• Particularly important for sources used immediately

– Examples:

• Marriages contracted in immigrants’ country of origin are

sometimes recorded two or three years after the event (Bakker

et al. AIOS-paper 2008)

• Part of VAT-data is reported later than is needed for monthly

estimates (Vlag, ISI-paper 2011)

Time-related example: Stability of variables (1)

Type of comparison used in the Dutch Short term Statistics

Page 12: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

12

Time-series for a single company

Time-related example: Stability of variables (2)

Integrability: Theory

Indicators Description 5. Integrability Extent to which the data source is capable of

undergoing integration or of being integrated.

Objects 5.1 Comparability of objects Similarity of objects in source -at the proper level of

detail- with the objects used by NSI 5.2 Alignment of objects Linking-ability (align-ability) of objects in source with

those of NSI

Variables 5.3 Linking variable Usefulness of linking variables (keys) in source 5.4 Comparability of variables Proximity (closeness) of variables

Integrability dimension

Page 13: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

13

Integrability example: Alignment of objects

export

import

VAT-turnover (€) IC

P-t

urn

ove

r (€

)

Finding: - Differences between two admin.

Data sources (ICP and VAT) both used for International trade statistics

- Export aligns good but import is much more problematic!

Explanation: - ICP import units are difficult to

identify and can therefore not always by linked correctly

- ICP export data can be integrated well.

VAT-turnover (€)

ICP

-tu

rno

ver

(€)

Quality Report Card: Step 1 Indicator level

– Step 1: Determine one score per indicator

Page 14: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

14

Quality Report Card: Step 2 Dimensional level

– Step 2: Determine one score per dimension

Quality Report Card: Step 3 General level

– Step 3: Determine a general score

Page 15: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

15

Questions?

Any questions or comments?

Exercise

– Let’s try to interpret some data quality findings!

– To ease the exercise, every indicator has a single score

Page 16: Introduction in the Data hyperdimension...–In the Social Statistical Database* it was found (in 2000) that: ‐0,3% of all persons in admin. data sources used had an invalid Citizen

01/10/2014

16

Group exercise

– Participants will be split into groups and each group is provided with: ‐ The Source, Metadata and Data results for the administrative

data source discussed in the previous exercise ‐ An intended use

– Each group will be asked to discuss: ‐ whether the data in the source could be used for the purpose

intended/ • If yes, why is everything OK? • If not, what is the problem that prevents its use and how can

it be solved?