progressing from safe data to safe output to safe...
TRANSCRIPT
Charles (Chuck) HumphreyUniversity of AlbertaAugust 2007
Progressing from SafeData to Safe Output toSafe Metadata
Managing the risk of disclosure
2
Victoria
Sudbury
LavalMcGill
SherbrookeUQAM
COOL
Queen’s
Western
Manitoba
BCIRDC
Prairie
Alberta
SWORDC
McMaster
Toronto
CRISP
CIQSS
Atlantic
Centres in first CFI application
Provincial funds leveraged by first CFI application
3
Cumulative Growth of Projects up to April 13, 2007
0
200
400
600
800
1000
1200
1400
C
O
U
N
T
Incomplete/Withdrawn
Active/Dormant
Proposal in evaluation
Completed
Incomplete/Withdrawn 196
Active/Dormant 31 122 251 350 481 600 754 534
Proposal in evaluation 1 7 0 19 14 18 30 72
Completed 0 0 0 2 47 131 233 448
2000 2001 2002 2003 2004 2005 2006 2007
Source: Gustave Goldmann, RDC Manager’s Report, April 2007
4
Data & Knowledge: an Old Problem
How does one maximise the creation of newknowledge from confidential data?
For data producers, the problem is expressedas the risk of disclosure. What level ofaccess can be permitted while minimising therisk of disclosure?
From the researchers perspective, theproblem is one of having the fullness of datato create new knowledge. Do all thevariables exist to explain the phenomena?
6
The Canadian Context
Two microdata dissemination mechanisms30 Years of Public Use Microdata Files7 Years of Research Data Centres
25 Year Gap between PUMFs and RDCs 3 computing paradigm shifts Statistic Canada introduces exorbitant data
prices No longitudinal PUMFs coming out of the
1990’s
7
The Canadian Context
The context coming out of the 1990’sDownsizing of the public service,
especially policy analyst positions.A crisis of the national health care system.The introduction of a new wave of
longitudinal surveys,The rise of evidence-based policy making
and its demands for policy-relevantknowledge.
Recently, microdata from administrativerecords and the Census
8
Approaches to Managing Risk
Public Use MicrodataCreated at the tail end of a survey’s
production stream with a focus on satisfyingthe Data Release Committee’s concernsabout confidentiality
Confidential DataAccess is created on a project-by-project
basis in a secure environment controlled byStatistics Canada
Focus is on disclosure analysis and ‘safe’output
9
Adding More Information
Adding more organised information in theform of metadata.
Metadata consist of information describingother information in a format that is webactionable, can be migrated easily withchanges in technology and can be preserved.
In this instance, what is being described isthe information produced within the separatestages of the research life cycle.
10
The RDC Research Life Cycle
ProjectApplication
ProjectApproval
ProjectCreation
Access to Data
GenerateAnalysis
Files
OutputDisclosureAnalysis
ResearchCommun-
icatons
Stages in the life cycle
11
The RDC Research Life Cycle
ProjectApplication
ProjectApproval
ProjectCreation
Access to Data
GenerateAnalysis
Files
OutputDisclosureAnalysis
ResearchCommun-
icatons
Stages in the life cycle
12
Metadata in the Life Cycle
Generate DDI metadatato 1.0/2.0 standards ina retrospectiveconversion project
Statistics CanadaMaster Files
Develop tools toconvert DDI 1.0/2.0 toDDI 3.0 and incorporatethe Questionnairemodule
Acc
ess
to m
aste
r file
s
Analysis through MultipleWorking Files
Rep
urpo
se m
aste
r file
s
SubsetRecode
ComputeMerge
Multiple Versions
13
Metadata in the Life CycleAnalysis through Multiple
Working Files
Acc
ess
to m
aste
r file
s
Rep
urpo
se m
aste
r file
s
Multiple Versions
Generate DDI metadata forworking files using tools that readstatistical system and syntax / logfiles. These metadata files can beused to document the workingfiles, to produce products from themetadata (e.g., a codebook listingor Powerpoint slides), to be linkedto research communications andto recreate the working file from itsmaster data file.
A validation tool will compare aworking file against a virtuallygenerated working file.
14
The RDC Research Life Cycle
ProjectApplication
ProjectApproval
ProjectCreation
Access to Data
GenerateAnalysis
Files
OutputDisclosureAnalysis
ResearchCommun-
icatons
Stages in the life cycle
15
Metadata in the Life CycleAnalysis through
Working Files
Out
put f
rom
Wor
king
File
sWorking Data FilesMetadata
DisclosureAnalysis
Tables
Reports
Res
earc
h C
omm
unic
atio
ns
Website
Journals
Conferences
RepositorySupportingmetadata
16
A Third Method of Managing Risk
Metadata-driven risk assessment Introduces the researcher’s data needs as
the central focusUses metadata to drive the process of
determining the access mechanismRequires standard metadata across all
surveysMust be a service outside the survey
production stream
17
New Role for Research Data Centres
Research Data Centres are not part of asurvey production stream and provide theonly continuous dissemination service forconfidential data in Statistics Canada. Canada Foundation for Innovation funding is providing in
infrastructure allowing a rationalisation of RDC operations
Research Data Centre Network is creatingDDI-compliant metadata for all confidentialdata files in its Centres. Canada Foundation for Innovation funding metadata
creation and tools development projects
18
User-driven Continuum of Risk Management
Researcher
Searchesmetadata forpublic andconfidentialmicrodataand identifiesa subset ofvariables andcases.
Submitssubsetrequest toRDCN.Request isautomaticallyin metadataformat.
RDCN
Requestreceived inmetadataformat.
Risk scorescalculatedand thresholddetermined.
Con
tinuu
m o
f Acc
ess
RDC applicationprepared and returned toresearcher
Confidential subset createdin RDC; datasent pendingcontract
Metadata setupgenerated forextracting subsetfrom PUMF
Researcher
19
Future Work
Develop risk-level measures for individualvariables and ascertain thresholds of overallscores. Make use of survey managers’ experiences with preparing
public use microdata files for the Data Release Committee Build a database of RDC disclosure decisions and
incorporate in project metadata
Through metadata we can improve access todetailed microdata while safeguarding theconfidentiality of respondents.