protecting confidentiality

in a Virtual Data Centre Protecting Confidentiality COMPUTATIONAL INFORMATICS Christine O’Keefe , Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO Tim Churches, Sax Institute 28 October 2012

Upload: rue

Post on 22-Feb-2016




0 download


Protecting Confidentiality. in a Virtual Data Centre. Computational Informatics. Christine O’Keefe , Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO Tim Churches, Sax Institute. 28 October 2012. Overview. Introduction to the problem Virtual Data Centres Proposed solution. - PowerPoint PPT Presentation


Page 1: Protecting Confidentiality

in a Virtual Data CentreProtecting Confidentiality


Christine O’Keefe , Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIROTim Churches, Sax Institute28 October 2012

Page 2: Protecting Confidentiality

Introduction to the problem

Virtual Data Centres

Proposed solution


Confidentiality in Virtual Data Centres | Christine O’Keefe

Page 3: Protecting Confidentiality

Provides access to linkable de-identified health data for research Improving outcomes Improving policy

Traditionally Supplies linkable de-identified health data directly to researchers

Loss of control over data heightens risk of: External attack on datasets Accidental or inadvertent actions by researcher Deliberate attack by trusted researcher

Population Health Research Network*

Confidentiality in Virtual Data Centres | Christine O’Keefe


Page 4: Protecting Confidentiality

Secure remote access to virtual workstations and network in a data centre

Secure Unified Research Environment*

Confidentiality in Virtual Data Centres | Christine O’Keefe

*Sax Institute SURE User Guide v1.2

Page 5: Protecting Confidentiality

Governance Comply with privacy legislation and regulation Honour assurances to data providers

Restrict access to approved researchers Information security measures

Restrict amount and detail of data available Apply statistical disclosure control methods before releasing data to researcher

– No further confidentiality measures Enable access via secure on-line system

– Manual checking for confidentiality issues in statistical analysis outputs– “…developing valid output checking processes that are automated is an open

research question” (Duncan, Elliot, Salazar-González 2012)

Confidentiality Protection for Health Data

Confidentiality in Virtual Data Centres | Christine O’Keefe

Page 6: Protecting Confidentiality

Remote Analysis Researcher cannot see data itself, only “Output for publication”

Virtual Data Centre Researcher authorised to see data and “Output” as well as “Output for


Conceptual Model for online access

Confidentiality in Virtual Data Centres | Christine O’Keefe



Page 7: Protecting Confidentiality

Assumptions Custodian prepares data to comply with legislation, regulation and assurances Researcher complies with applicable researcher agreements

Researcher authorised to see data itself– Do not need to protect dataset records from researcher– Do not need to protect against malicious attacks by researcher– Data transformations and analyses are unrestricted– Confidentiality issues with respect to readers of academic literature– Confidentiality issues with repect to outputs of genuine queries

Virtual Data Centre

Confidentiality in Virtual Data Centres | Christine O’Keefe

Page 8: Protecting Confidentiality

Individual valuesSmall cells/samples … thresholdDominanceDifferencingLinear or other algebraic relationships in dataPrecision

Main Disclosure Risks in Statistical Output

Confidentiality in Virtual Data Centres | Christine O’Keefe

Page 9: Protecting Confidentiality

1. Dataset preparation - by Custodian

2. Confidentialisation of statistical analysis output for publication – by Researcher

Confidentiality Protection in a Virtual Data Centre – two stage process

Confidentiality in Virtual Data Centres | Christine O’Keefe



Similarities to:• ESSNet SDC Guidelines for checking output based on microdata research … Hundepool, Domingo-Ferrer, Franconi, Giessing, Nordholt, Spicer, de Wolf 2012• Statistics New Zealand Data Lab Output Guide

Page 10: Protecting Confidentiality

Custodian1. Removes obvious identifiers2. Ensures dataset has sufficient records3. Ensures published datasets differ by sufficiently many records4. Ensures variables and combinations of variables have suff many records5. Reduces detail in data using aggregation (esp dates, locations)6. Other measures as needed – statistical disclosure control

Dataset preparation – by Custodian

Confidentiality in Virtual Data Centres | Christine O’Keefe


Page 11: Protecting Confidentiality


1. uses Checklist of tests to identify outputs that fail one or more tests

2. considers context and interations of outputs to identify potential disclosure risks

3. applies treatments from Checklist to reduce potential disclosure risk

Confidentialisation of statistical analysis output for publication – by Researcher

Confidentiality in Virtual Data Centres | Christine O’Keefe

Page 12: Protecting Confidentiality

Individual value: an individual data value is directly revealed Threshold n: A cell or statistic is calculated on fewer than n data values Threshold p%: A cell contains more than p% of the values in a table margin Dominance (n,k): Amongst the records used to calculate a cell value or statistic,

the n largest account for at least k% of the value Dominance p%: Amongst the records used to calculate a cell value or statistic,

the total minus the two largest values is less than p% of the largest value Differencing: A statistic is calculated on populations that differ in fewer than n

records Relationships: The statistic involves linear or other algebraic relationships Precision: The output involves a high level of precision in terms of significant

figures and/or decimal places Degrees of Freedom: The model output has fewer than n degrees of freedom

Checklist of Tests

Confidentiality in Virtual Data Centres | Christine O’Keefe

Page 13: Protecting Confidentiality

Statistic Confidentiality Test

Treatment NotesNumber e.g. sample size

Threshold n Try to get more data Suppress value

If this test is failed, the study is probably unreliable due to the small sample size.

Mean Threshold n Recode variable Round reported value Suppress denominator Suppress value

The tests and treatments are only necessary if the denominator is known so the sum can be inferred

The mean has a strong algebraic relationship with the sum so is potentially disclosive

Dominance (n,k)

Dominance p%

Differencing Redefine one or both populations

Ratios and percentages

Individual values Suppress individual values For a ratio, the tests and treatments are only necessary if one of the terms is known so the other can be inferred (this is an example of the relationship test)

Threshold n Recode variables Round reported values Suppress valuesThreshold p%

Dominance (n,k)

Dominance p%

Differencing Redefine one or both populations

Relationships Round reported values

Precision Round reported values

Checklist - examples

Confidentiality in Virtual Data Centres | Christine O’Keefe

Page 14: Protecting Confidentiality

Checklist - examples

Confidentiality in Virtual Data Centres | Christine O’Keefe

Statistic Confidentiality Test Treatment NotesPrecision Round reported values

Relative risk

Threshold n Recode variables In some cases data might be reconstructed from sample size and relative risk value alone. If so, the data would need to be checked for disclosure risk, and treatments applied if necessary.

Precision Round reported value

Confidence interval

Degrees of freedom Change model or data groups to increase degrees of freedom

A confidence interval based on a normal distribution reveals a mean and standard error. These might be disclosive - see the tests and treatments under Summary Statistics Note that in a regression context it is claimed they can be used to reconstruct the fitted values

Threshold n Recode variablesPrecision Round reported values

p value of a test

Precision Round reported value A p value can reveal the value of a test statistic which might be disclosive in combination with other reported information; see the 1st note on Confidence Intervals

Kaplan-Meier plotOther cumulative distribution plots

Individual value Do not show individual values This can be done by either

smoothing the plot or recoding variables

There exists software that can read data values from a pdf version of a plot

Threshold n Only relevant if data already grouped in plot

Recode variablesThreshold p%Dominance (n,k)Dominance p%

Page 15: Protecting Confidentiality

Virtual Data Centres Becoming more popular Manual checking of outputs for confidentiality risk not sustainable Automated methods for confidentiality protection in statistical analysis outputs

still under development

Interim Solution1. Dataset preparation by Custodian2. Researchers confidentialise their own outputs for publication

– Training– Checklist of tests and confidentiality treatments


Confidentiality in Virtual Data Centres | Christine O’Keefe

Page 16: Protecting Confidentiality

Thank youComputational InformaticsDr Christine O’KeefeResearch Program Leader, Decision and User Sciencet +61 2 6216 7021e [email protected]