model of transformation administrative data to statistical data

23
Model of transformation administrative data to statistical data Data used in Population and Housing Census 2011 – examples Janusz Dygaszewicz and Paweł Murawski Central Statistical Office POLAND

Upload: billie

Post on 25-Feb-2016

38 views

Category:

Documents


1 download

DESCRIPTION

Model of transformation administrative data to statistical data. Data used in Population and Housing Census 2011 – examples . Janusz Dygaszewicz and Paweł Murawski Cent r al Statistical Office POLAND. Purpose of the work on administrtive sources Data quality Extract data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Model of transformation administrative data to statistical data

Model of transformation administrative data to statistical data

Data used in Population and Housing Census 2011

– examples

Janusz Dygaszewicz and Paweł MurawskiCentral Statistical Office

POLAND

Page 2: Model of transformation administrative data to statistical data

Outline

1. Purpose of the work on administrtive sources

2. Data quality3. Extract data4. Transform data5. Summary

Page 3: Model of transformation administrative data to statistical data

Data Owners:• Ministry of Finance,• Ministry of Interior and Administration,• Ministry of Justice,• Agricultural Social Insurance Fund,• National Health Fund,• Agency for Restructuring and Modernisation of Agriculture,• Agricultural and Food Quality Inspection,• Agency for Geodesy and Cartography,• State Fund for Rehabilitation of Disabled Persons,• County Offices,• Commune Offices,• Regional Offices,• Telcoms,• Energy Suppliers,• Office For Foreigners,• Social Insurance Institution,• Housing Managers,

Registers - data acquisition

3

Page 4: Model of transformation administrative data to statistical data

Purpose of the work on administrative data

Obtaining a sufficiently complete data set –subjective and objective completeness corresponding to classification standards, definitions and basic categories, and thus the effective use of administrative data

Page 5: Model of transformation administrative data to statistical data

Data quality-measures-

1. Measuring the quality of administrative registers– timeliness of data– methodological compatibility– completeness– identification standards used in the registry– usefulness– compatibility of data in administrative sources to data obtained in the

study/survey

2. Measuring the quality in processing of data registers– excessive coverage error rate– incomplete coverage error rate– subjective indicator of completeness– objective indicator of completeness– imputation rate– data correction index– integration data from various sources index

Page 6: Model of transformation administrative data to statistical data

Extract data consolidation data from various source systems; different data

format, extract data into the production environment based on the SAS

software, converting data into one format that is suitable for processing – SAS

tables, validate of imported data structure is an integral part of this process .

Page 7: Model of transformation administrative data to statistical data

Extract data-examples-

Register/System Name CentralY/N

Relational Y/N Data format

PESEL General electronic system of Population Register Y N TXTKEP Register of the National Taxpayers Y N TXTGZM Community Registers of Residence N Y SQL ServerPIT Personal income tax register Y N TXTSI MS Ministry of Justice system Y N XLSPOBYT Foreigners evidence system Y Y SQL ServerZUS CRPS Central Register of Contribution Payers Y N TXTZUS CRU Central Register od Insured Persons Y N TXTZUS SER Pension insurance system Y N TXTKRUS Agricultural Social Insurance Fund System Y N XMLCWU NFZ Central Register of Insured Y N TXTPFRON State Fund for Rehabilitation of Disabled Persons Y N TXTARiMR Agency for Restructuring and Modernisation of Agriculture

systemY Y XLS

EPN Property Tax Records N Y SQL Server

Page 8: Model of transformation administrative data to statistical data

Transform data

Data processing in the production environment consisting of:• profiling – create a raport on the data quality,• unification/standardization of data,• parsing (separation) or combining variables,• standardization with schemes,• conversion,• validation,• deduplication,• data integration.

Page 9: Model of transformation administrative data to statistical data

Transform data- profiling-

Page 10: Model of transformation administrative data to statistical data

Transform data- standardization and parsing examples-

Incorrect data format Format after standardization

1985-02-21 198502211985.02.21 198502211985 02 21 19850221

Voivodeship City Street Place of birthMAZOWQIECKIE WARZSWA ul. DŁUGA LONDYN - ANGLIAMAZPWOECKIE WARS-AWA Ulica DŁUUGA LONDYN – WLK BRYTANIAZAZOWIEVCKIE AWRSZAWA DLUGAA LONDYN/CHELSEA

MZAOWIECIE WARSZAAAWA DŁUGA (ul.) LONDYN BRIDGE

Voivodeship City Prefix Street Place of birth

MAZOWIECKIE WARSZAWA UL DŁUGA LONDYN

Page 11: Model of transformation administrative data to statistical data

Transform data- schemes examples-

Page 12: Model of transformation administrative data to statistical data

Transform data- exemples: report data cleaning -

Description Before cleaning After cleaning

Group of variables Variable TotalInorrect

TotalInorrect

total incorrect In % total

incorect In %

Address of permanent residence

COMMUNITY 4320724 428469 9,92% 4316061 72797 1,69%CITY 4353209 207399 4,77% 4352983 43086 0,99%STREET 3514154 573899 16,34% 3440932 125392 3,65%PREFIX 0 0 - 108551 0 0,00%

Address of residence

COMMUNITY 739088 100282 13,57% 738717 11666 1,58%CITY 742388 30644 4,13% 742336 6344 0,86%STREET 607939 102725 16,90% 593370 21012 3,55%PREFIX 0 0 - 18416 0 0,00%

Corresponding address

COMMUNITY 2005 132 6,59% 2005 30 1,50%CITY 448791 21678 4,84% 448704 4796 1,07%STREET 377849 64871 17,17% 374220 20575 5,50%PREFIX 0 0 - 11192 0 0,00%

Personal Data NAME 4355764 7208 0,17% 4355757 5534 0,13%

Page 13: Model of transformation administrative data to statistical data

Transform data- conversion: gender variables

famale

male

F

M1

2

Page 14: Model of transformation administrative data to statistical data

Transform data- conversion: marital status variable-

Family benefits system

501 – unmarried woman

504 – married (F)505 – divorced (M)506 – divorced (F)507 – widow508 – widower509 – non-formalized (M)510 – non-formalized (F)

511 – separated

Foreigners evidence system

$BD – no data

MTA – married (F)PNA – unmarried womenRWA – divorced (F)RWY – divorced (M)WDA – widowWDC – widower

WNA – single (F)

WNY – single (F)

Statistical standard

2 unmarried woman

4 married (F)5 divorced (M)6 divorced (F)7 widower8 widow9 unidentified10 separated (M)11 separated (F)

12non-formalized relationship (M)

13non-formalized relationship (F)

14 single (M)

15 single (F)

Community Registers of Residence

2 – unmarried woman

4 – married (F)5 – divorced (M)6 – divorced (F)7 – widower8 – widow

no data – empty field

123456789

914265873

1514

21345687

1213

10/11

3 married (M)

503 – married (M)

ZNY - married (M)

3 – married (M)

1 – bachelor

KWR – bachelor

502 – bachelor

1 bachelor

Page 15: Model of transformation administrative data to statistical data

Transform data-validation-

checking the data, correcting abnormal values, according

to the algorithms prepared by methodologists,

eventual exclusion from further processing records which improvement is impossible.

Page 16: Model of transformation administrative data to statistical data

Transform data- deduplication -

removal of repeated units,requires detailed analisys, including

alalysis of legal actsindividual for each register, result of deduplication – one record with

all the possible and unique information.

Page 17: Model of transformation administrative data to statistical data

Transform data-expamle of deduplication process-

imie1_GZM nazwisko1_GZM plec_GZM pesel_GZM adr_ulica_GZM adr_nr_dom_GZM adr_nr_lok_GZM data_zam_od_GZM JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303ANNA MALINOWSKA K 00000000002 ANDERSA 7   20010205ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 19840712

imie1_GZM nazwisko1_GZM plec_GZM pesel_GZM adr_ulica_GZM adr_nr_dom_GZM adr_nr_lok_GZM data_zam_od_GZM JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 19840712ANNA MALINOWSKA K 00000000002 ANDERSA 7   20010205ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303

imie1_GZM nazwisko1_GZM plec_GZM pesel_GZM adr_ulica_GZM adr_nr_dom_GZM adr_nr_lok_GZM data_zam_od_GZM JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303ANNA MALINOWSKA K 00000000002 ANDERSA 7   20010205ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 19840712

imie1_GZM nazwisko1_GZM plec_GZM pesel_GZM adr_ulica_GZM adr_nr_dom_GZM adr_nr_lok_GZM data_zam_od_GZM JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 19840712ANNA MALINOWSKA K 00000000002 ANDERSA 7   20010205ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303

ANNA MALINOWSKA K 00000000002 ANDERSA 7   20010205ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101

imie1_GZM nazwisko1_GZM plec_GZM pesel_GZM adr_ulica_GZM adr_nr_dom_GZM adr_nr_lok_GZM data_zam_od_GZM ANNA MALINOWSKA K 00000000002 ANDERSA 7   20010205ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303

Page 18: Model of transformation administrative data to statistical data

Transform data-data integration-

process of selection of the best, most current and correct value of several or a dozen of registers

Used to create a statistical register, which will be available for use by analysts.

Page 19: Model of transformation administrative data to statistical data

Transform data-intergation process – scheme-

A Register

B Register

C Register

ONE ID

MULTIPLE IDENTIFIRES

ALTERNATIVE LINKING KEYS

DATA INTEGRATION

LINKING

SELECTING

ALGORYTHMS

SELECTING THE BEST VALUES

DATA COMPLETENESS

STATISTICAL REGISTER

REGISTER OF REFERENCE

Page 20: Model of transformation administrative data to statistical data

kraj_ur_kod_KEP # not null

msce_ur_kod_POBYT # not null

kraj_ur_kod_GZM # not null

Transform data-data integration: example of algorythm

FALSE

FALSE

TRUE

TRUE

TRUE

Kraj_ur_kodselectkraj_ur_kod_GZM

selectkaj_ur_kod_POBYT

selectkraj_ur_kod_KEP

Page 21: Model of transformation administrative data to statistical data

Data integration-example of process-

Page 22: Model of transformation administrative data to statistical data

Summary Common difficulties: - poor quality data, missing values, duplicates, - conflicting data,- technical: size of the registers, time-consuming

process.Benefits: - obtain relevent, useful, accurate data- improve the quality of the output data. - selection of the best variables from multiple registers,

Page 23: Model of transformation administrative data to statistical data

Thank you for your attention

www.stat.gov.pl