statistical data editing

of 45 /45
Statistical Data Statistical Data Editing Editing Anders Norberg, Statistics Sweden (SCB) 2011-11-11

Author: early

Post on 04-Feb-2016

90 views

Category:

Documents


0 download

Embed Size (px)

DESCRIPTION

Statistical Data Editing. Anders Norberg, Statistics Sweden (SCB) 2011-11-11. Editing. Editing is an activity of detecting, resolving and understanding errors in data and produced statistics. Selective data editing. - PowerPoint PPT Presentation

TRANSCRIPT

  • Statistical Data EditingAnders Norberg, Statistics Sweden (SCB)2011-11-11

  • EditingEditing is an activity of detecting, resolving and understanding errors in data and produced statistics*

  • Selective data editingThe purpose of selective data editing is to reduce cost for the statistical agency as well as for the respondents, without significant decrease of the quality of the output statistics.

    *

  • Papers by my colleague Leopold GranquistGranquist (1984). On the role of editing. Statistical Review 2Granquist (1997). The New View on Editing. International Statistical ReviewGranquist and Kovar (1997). Editing of Survey Data: How Much is Enough? In Survey Measurement and Process Quality. Wiley.

  • If we only want information from businesses that we know they have,and we ask for that information so they understand,and we motivate them to deliver as good quality in data as possible,and we help them to avoid accidental errors in answering questionnaires,then editing would be a minor process!*

  • Where errors are introducedErrors in raw data delivered by respondents to the statistical agency are typically non-response and measurement errorsErrors in data transmissions The statistics production process is a mixture of many activities with risks of introducing errors*

  • The role of editingQuality Control of the measurement processFind errors (efficient controls)Consider every identified error as a problem for the respondent to deliver correct data by our collection instrumentIdentify sources of error (process data)Analyse process data communicate with cognitive specialistsContribute to quality declarationAdjust (change/correct) significant errors

    3

  • Types of errorsObvious errors / Fatal errors Non-valid values Item non-response Data structure- or model errors, totalsum of components Contradictions

    Suspected data valuesDeviation errors (Outliers)Suspiciously high/low values, data outside of predetermined limitsDefinition errors (Inliers)Many respondent miss-understand a question in the same wayMany respondents fetch data from info-systems with other definitions

  • Suspected data valuesDeviation errorsManual follow-up takes time and is expensiveFew deviation errors have impact on output statistics (low hit-rate, many changes in data have very little impact)

    Editing must have impact on the output!Remember response burdon !

  • Suspected data valuesDefinition errors (Inliers)Difficult to findWays to find them:Combined editing for several surveysDeep interviews in focus groups Use statistics from FEQ and from re-contacts with respondentsHigh proportions of item non-responseGraphical editingGood examples

  • The Process PerspectiveAudit and improve data collection (measurement instrument and collection process)and the editing process itself

    Un-edited data must be saved in order to produced important process indicators, as hit-rate and impact on output!

  • Process dataSources of errors = problem for the respondentsSuspicionsError codesManuel actions (accept / amended values) Automatic actions

  • Process indicatorsSources of errors (problem for the respondents)Prop. of flagged units and variablesProp. of manually and automatically reviewed units and variablesProp. of amended values and impact of the changes, per variableHit-rate for edits

  • *

    Establish needs

    1

    Plan and design

    2

    Create and test

    3

    Collect 4

    Prepare and process

    5

    Analyse

    6

    Report and communicate

    7

    Statistical Production Process

    survey statistical needs1.1

    Affirm customer needs1.2

    Develop table plan1.3

    Identify data sources1.4

    Examine disclosure1.5

    Carry out market process1.6

    Prepare data and statistical values for dissemination7.1

    produce final output

    7.2

    Report and communicate final output7.3

    handleinquiries7.4

    Marketfinal output7.5

    Classify and code micro data5.1

    Check micro data5.2

    Impute for Non-response5.3

    Complement Data set5.4

    Calculate weights5.5

    Establish final observation register5.6

    Create frame and draw sample4.1Handle respondentissues4.2

    Prepare data collection4.3

    Carry out data collection4.4

    Transfer and storedata electronically4.5

    Plan and design table plan2.1

    Plan and design data collection2.3

    Plan and design data processing2.4

    Plan and design analysis and reporting2.5

    Plan production flow2.6

    Design production system2.7

    Create and testMeasurem. instrument3.1Develop existing and building newproduction tools3.2

    Assure communication between production tools3.3

    Test production system3.4

    Carry outpilot test3.5

    Implementproduction tools3.6

    Implement production system3.7

    Producestatistical values6.1Quality assuranceof produced statistics6.2

    Interpret andexplain6.3Prepare contents for reporting andcommunication6.4Establish contentsfor reporting andcommunication6.5

    Plan and design frame/population and sample2.2

  • *

    Establish needs

    1

    Plan and design

    2

    Create and test

    3

    Collect 4

    Prepare and process

    5

    Analyse

    6

    Report and communicate

    7

    Statistical Production Process

    survey statistical needs1.1

    Affirm customer needs1.2

    Develop table plan1.3

    Identify data sources1.4

    Examine disclosure1.5

    Carry out market process1.6

    Prepare data and statistical values for dissemination7.1

    produce final output

    7.2

    Report and communicate final output7.3

    handleinquiries7.4

    Marketfinal output7.5

    Classify and code micro data5.1

    Check micro data5.2

    Impute for Non-response5.3

    Complement Data set5.4

    Calculate weights5.5

    Establish final observation register5.6

    Create frame and draw sample4.1Handle respondentissues4.2

    Prepare data collection4.3

    Carry out data collection4.4

    Transfer and storedata electronically4.5

    Plan and design table plan2.1

    Plan and design data collection2.3

    Plan and design data processing2.4

    Plan and design analysis and reporting2.5

    Plan production flow2.6

    Design production system2.7

    Create and testMeasurem. instrument3.1Develop existing and building newproduction tools3.2

    Assure communication between production tools3.3

    Test production system3.4

    Carry outpilot test3.5

    Implementproduction tools3.6

    Implement production system3.7

    Producestatistical values6.1Quality assuranceof produced statistics6.2

    Interpret andexplain6.3Prepare contents for reporting andcommunication6.4Establish contentsfor reporting andcommunication6.5

    Plan and design frame/population and sample2.2

  • *

    Establish needs

    1

    Plan and design

    2

    Create and test

    3

    Collect 4

    Prepare and process

    5

    Analyse

    6

    Report and communicate

    7

    Statistical Production Process

    survey statistical needs1.1

    Affirm customer needs1.2

    Develop table plan1.3

    Identify data sources1.4

    Examine disclosure1.5

    Carry out market process1.6

    Prepare data and statistical values for dissemination7.1

    produce final output

    7.2

    Report and communicate final output7.3

    handleinquiries7.4

    Marketfinal output7.5

    Classify and code micro data5.1

    Check micro data5.2

    Impute for Non-response5.3

    Complement Data set5.4

    Calculate weights5.5

    Establish final observation register5.6

    Create frame and draw sample4.1Handle respondentissues4.2

    Prepare data collection4.3

    Carry out data collection4.4

    Transfer and storedata electronically4.5

    Plan and design table plan2.1

    Plan and design data collection2.3

    Plan and design data processing2.4

    Plan and design analysis and reporting2.5

    Plan production flow2.6

    Design production system2.7

    Create and testMeasurem. instrument3.1Develop existing and building newproduction tools3.2

    Assure communication between production tools3.3

    Test production system3.4

    Carry outpilot test3.5

    Implementproduction tools3.6

    Implement production system3.7

    Producestatistical values6.1Quality assuranceof produced statistics6.2

    Interpret andexplain6.3Prepare contents for reporting andcommunication6.4Establish contentsfor reporting andcommunication6.5

    Plan and design frame/population and sample2.2

  • Average proportions of costs of sub-processes 2004*Process Proportion of total cost (%) All products Short-period Annual surveys and periodic Respondent service 3.3 3.3 3.4 Manual pre-editing 4.4 3.9 5.1 Data-registration editing 5.6 5.1 6.5 Production editing 15.3 12.7 18.9 Output editing 3.9 3.4 4.8 Total editing cost 32.6 28.3 38.6

  • Web data collectionDemands:High hit-rate in electronic questionnairesSystem that can measure hit-rate?Question:Can it be a goal for us to move all editing to electronic data collection?

  • Expectation on the production editing process at Stat. SwedenGeneric IT-toolsLess IT-maintenanceEasier planning of work and personnel at Data collection unitsBetter working environmentsMethodology studiesEfficient editing methodsSelective/significance editingBetter working environmentsLess response burdenCollection and analysis of process dataContinuous improvement of data collection and editing processesInformation for quality declaration of statistics

  • Input, throughput, output

  • ImpactActual impact = w ( y_une y_edi) for observation k is the impact on domain-total T if y_une is kept instead of making a review to find y_edi.Potential impact = w (y_edi y_pred) is a proxy for actual impact to be used in practice, as y_edi will not be known until review. y_pred is a prediction (expected value) for y_edi.Anticipated (expected) impact (per domain, variable, observation) is the product of suspicion and potential impact.

  • Suspicion: Traditional edits

    Utkast/Version

    STATISTISKA CENTRALBYRNDOKUMENTTYP

    1(1)

    Avd/Enhet/Projekt/Arbetsgrupp, etc20xx-xx-xx

    Handlggare/Frfattare

    Finding acceptance limits: Data from previous survey rounds

    Hourly wage distributed by SNI code at one-digit level.

    NORMAL.DOT

    10-11-22 08.57

  • Selective data editingConstruct a score function for prioritizing variables and records. Alternatives:1) Potential impact on statistics for records flagged by at least one traditional edit2) Sum of expected impact on statistics for variable values flagged as suspected by edits

    Norberg, A. et al. (2010): A General Methodology for Selective Data Editing. Statistics Sweden

    *

  • Selective data editingPotentialimpactSuspicion01FlaggedA procedure which targets only some of the micro data variables or records for review by prioritizing the manual work.*

  • 1

  • 5

  • Predicted (expected) valuesData / predictorTime seriesPrevious valueForecastCross sectionMean/standard errorMedian/quartileEdit groups 21

  • Suspicion

    R=

    Suspicion=R/(TAU+R)Susp

  • Score functionLocal score, by domain d, variable j & observed unit k,l is the anticipated impact related to an appropriate measure of size for the domain/variable, say standard error of estimate.VIOLINj = weights for variables (j) CLARINETc = weights for domains by classifications d(c)OBOEj = adjustment for size of estimated total or its standard error (j)LScored,j,k,l = Suspicionj,k,l x(Potential impact) x CELLOd(c),j

  • Score functionGlobal scores are aggregated local scores by (5) domains, (4) variables, second stage units (3) to one score for each primary unit (2) and finally to (1) respondent. Methods: sum, sum of squares, sum of local scores truncated by local thresholds, maximum etc.

    28

  • Cut-off or probability sampling?Say that 821 of the total sample (n=4 000) have a score >0. There are two options for manual review:Cut-off sampling: Score2 >Threshold2, assuming the remaining bias is smallTwo-phase sampling: ps-sampling and design-based estimation of measurement errors to subtract from initial estimates

    Ilves, K. (2010): Probability Approach to Editing. Workshop on Survey Sampling Theory and Methodology, Vilnius, Lithuania, August 23-27, 2010*

  • EvaluationRelative pseudo-bias is a measure of error in output due to incomplete data review

  • EvaluationPsedobias for PPI relative to the overall price index. Observation units ordered in descending order of impact.

  • Editing remaining methodology issuesFatal errorsClassifying variablesSurvey variablesConfidence (respondents and clients)New and old respondentsEdited in earlier processesWeb-questionnairesScanned paper questionnairesData and methods for computing predicted values etc.Homogenous edit groupsPriorities; variables, domains (from the clients perspective)Score functionsHow to decide threshold valuesSampling below thresholdInferenceData for evaluation

  • SELEKT 1.1Survey specific cold adapter (SAS code)Data preparationSAS data setPRE-SELEKTParameter specifications,Analysis of cold data

    AUTOSELEKTScore calculation & record flagging

    Records to FOLLOW-UPProcess data and reports

    Input (hot) survey dataRecords to IMPUTATIONRaw+edited past (cold) survey dataSurvey specific hot adapter (SAS code)Data preparationSAS data setTable of ParametersTable of EstimatesAccepted recordsCLAN estimation softwareSNOWDON-X analysis of editsEdits*

  • SELEKT Parameters (in)Give AUTO-SELEKT the parameters:

    %let PATH_sys=C:\SELEKT\1.0;%let PATH_app=C:\SELEKT\Prod\Demo1_Enterprise;%let EDIT_parms=Ent_Parms1;%let EDIT_data=Demo1_Adap_Ent_Inflowdata; %let EDIT_T1_Value=2009;%let EDIT_T2_Value=1;

    By the parameter table &EDIT_parms AUTO-SELEKT knows what to do.

  • SELEKT Error list (out)Identification: Column name = VariableId1 = Identity for respondent[optional]Id2 = Identity for primary sampling unit (PeOrgnr, CfarNr etc.)Id3 = Identity for observational unit (Social security number, CN8 for products)EditNumber = Edit identification, if the edit flags for suspicions or obvious error.Timestamp = Time when the questionnaire passes SELEKTProcess data:EditFlag = 0 = accepted, 1-5 = error flaggedEditSuspicion = Suspicion generated by continuous edits.Score1 = (Local) Score for respondent[optional]Score2 = (Local/Global) Score for primary sampling unit [optional]Score3 = (Global) Score for observational unit [optional]N_Obs = Number of observations, which have gone through the edit round. N_Obs_Flagg = Number of error flagged observations in the PSU, on this listN_PSU = Number of PSU for the respondent, which have passed the edit round N_PSU_Flagg = Number of error flagged primary sampling units, on this list

  • EditsEDIT GROUP AND ACCEPTANCE REGION Edit identification Edit group Acceptance rangeEDIT Edit identificationType of editActiveSectionInternal error messageExternal error messageInstruction for data review

    Un-edited test variableError flagKEY Edit identificationSurvey variableIMPACT ON STATISTICS Survey variablePotential impact on statistics

    543FLAGGINGOBJECTSEDIT PRACTICAL SUPPORT Edit identificationStandard edit ruleEdited test variableSuspicion probability value produced by the SELEKT system

    21

    - Vid terkommande underskningar- I aktuell unders.

    kval. i ingende o utgende data*Svra validitetsfel: identitetsvariabler, yrkeskoder som sedan skall in i yrkesregistret

    Definitionsfel allvarligast: svrt att id med enbart kontroller. - x-tra frg i blk - graf. mnster

    EX. Uppg. lmnare ger samma svar som frra und.*F avvikelsefel betydelse - anpassa acc.gr till detta.

    *F avvikelsefel betydelse - anpassa acc.gr till detta.

    *Avsnitt 7.1 cbm*Avsnitt 7.1 cbm