the application for statistical processing at surs andreja smukavec, surs rudi seljak, surs unece...
TRANSCRIPT
The Application for Statistical Processing at
SURS
Andreja Smukavec, SURS
Rudi Seljak, SURS
UNECE Statistical Data Confidentiality Work Session
Helsinki, 5 – 7 October 2015
Old system
• Stove-pipe oriented production– Ad-hoc solutions were developed for a
particular survey
• Survey methodologists‘ strive for improvement was crucial– “Our data are not confidential“
• Process metadata were not organized– Difficulties when a survey methodologist
resigns
Renovation• An internal project started in 2012
– IT, General Methodology and subject-matter specialists
– Build a global solution appropriate for most of the surveys
– Solution which covers most of the parts of statistical production:
• Data validation • Data editing and imputation• Aggregation and standard error estimation • Statistical disclosure control for tabular data• Tabulation
Renewed system
• Generalised metadata driven application– Database of process metadata
• MS Access -> ORACLE• For each survey instance
– General SAS code– GUI for process metadata– Different microdata environments allowed,
just some basic rules for the structure of microdata databases
• Ad hoc SAS program for preparation of microdata
Schematic presentation of the renewed system
Different microdata databases
General SAS
Ad -
Database of processmetadata
Metadata repository
Different kind of output
…program program
Application for management
Data on tables and variables
Ad-hoc
Tabular data protection1. Calculation of primary sensitivity for
seven types of statistics: number, total, share, ratio, average…
– Threshold, p%-rule, (n,k)-dominance rule– „Holding rule“ + sampling weights– Zeroes unsafe
2. Secondary suppression applied in case of sensitive statistics (number and total)
– SAS-Tool (Excel file with metadata, Tau Argus, SAS macros)
Tabular data protection• Results for each survey instance saved in
the database with statistics (ORACLE)– Statuses for lower precision– Confidentiality flags for the type of primary
and secondary suppression
• 3 types of tabulation (codelists)– Excel format (the most user-friendly)– plain text format (.tab,.hrc) for Tau-Argus– plain text format (.csv) for PX-Edit (SURS’s
publication tool)
Tabulation & Tabular Data Protection
program
General SAS program
…
Database of process metadata
Caculation of statistics
Tabulation
Different microdata databases
Ad - hoc program
Tabular protection
Output tables
General SAS program
Database with
statistics
Database of process metadata
Parameters for SDC in MetaSOP
Tabulation in MetaSOP
Processing in MetaSOP
Example of 3-dimensional table
After aggregation
CC_SI / Dim_2Dim_3
TOT F O TOT TOT 1209943548 1.09E+09 1.23E+08
1 37700934.42 35625442 207549311 47110694.48 46417660 693034.12 733763444.2 6.62E+08 7145629521 517712620.1 4.8E+08 3748999822 161044502.5 1.1E+08 5083708823 37903335.85 37783060 120275.824 343495995.1 2.86E+08 57438583
11 TOT 59283130.99 56199883 30832481 64428657.15 62453677 197498011 21989840.69 21609892 379948.22 69502173.33 67377101 212507321 13959568.67 13959569 -22 338148.7639 338148.8 z23 7911125.122 7911125 -24 27886089.54 26016025 1870064
12 TOT 215349659.2 2.04E+08 117929681 5993635.356 5993635 -11 2035728.954 2035729 -2 55635358.28 54430511 120484721 146242216.3 1.43E+08 278387622 4164502.417 3872003 292499.223 38774447.75 34931862 384258524 42332750.72 37447112 4885639
21 TOT 176972728 1.76E+08 13239981 2248602.352 2248602 z11 166013.5624 166013.6 z2 372993785.9 3.69E+08 413476921 418831917.8 4.08E+08 1033732322 29411096.08 29411096 z23 56581.5975 56581.6 z24 88244091.34 86483431 1760660
After use of SAS-Tool
CC_SI / Dim_2Dim_3
TOT F O TOT TOT 1209943548 1.09E+09 1.23E+08
1 37700934.42 35625442 207549311 47110694.48 46417660 693034.12 733763444.2 6.62E+08 7145629521 517712620.1 4.8E+08 3748999822 161044502.5 1.1E+08 5083708823 37903335.85 37783060 120275.824 343495995.1 2.86E+08 57438583
11 TOT 59283130.99 56199883 30832481 64428657.15 z z11 21989840.69 z z2 69502173.33 z z21 13959568.67 13959569 -22 338148.763 z z23 7911125.122 7911125 -24 27886089.54 z z
12 TOT 215349659.2 2.04E+08 117929681 5993635.356 5993635 -11 2035728.954 2035729 -2 55635358.28 54430511 120484721 146242216.3 1.43E+08 278387622 4164502.417 z z23 38774447.75 z z24 42332750.72 z z
21 TOT 176972728 1.76E+08 13239981 z z z11 z z z2 z z z21 418831917.8 4.08E+08 1033732322 29411096.08 z z23 z z z24 88244091.34 z z
New organization• Old system:
– Every survey had its own programmer and its own general methodologist
• Renewed system:– General methodologist and IT expert
(„support team“) help the subject-matter specialist to
• insert and edit the process metadata (except for SDC) into the application
• run particular parts of the statistical process
Advantages
• The subject-matter personnel‘s skills improve (higher quality of data)
• The process metadata can be changed easily and the procedure can be repeated in short time (flexibility)
• The rules for data processing are gathered in one place (transparency)
Drawbacks
• High risk of syntax errors in the process of the insertion of metadata expressions
• Subject-matter personnel has to learn some new skills (SAS expressions)
• An error during the execution can cause problem if the support team is busy or not available
Challenges for the future• Introduce the application successfully into
the production– Adjusting to changes by the subject-matter
specialists– Building a qualified support team
• Adding new functionalities – Indices – Secondary suppression for other types of
statistics– GUI instead of the Excel file for the SAS - Tool
Thank you for attention.