genetic data protection
TRANSCRIPT
-
7/27/2019 Genetic Data Protection
1/25
Review
Routes for breaching and protectinggenetic privacy
Yaniv Erlich1,* and Arvind Narayanan2
1 Whitehead Institute for Biomedical Research, Nine Cambridge Center, Cambridge, MA USA 021422 Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ USA 08540* Correspondence to Y.E ([email protected])
Abstract
We are entering the era of ubiquitous genetic information for research, clinical care,and personal curiosity. Sharing these datasets is vital for rapid progress inunderstanding the genetic basis of human diseases. However, one growing concernis the ability to protect the genetic privacy of the data originators. Here, wetechnically map threats to genetic privacy and discuss potential mitigationstrategies for privacy-preserving dissemination of genetic data.
About the Authors
Yaniv Erlich is a Fellow at the Whitehead Institute for Biomedical Research. Erlich
received his Ph.D. from Cold Spring Harbor Laboratory in 2010 and B.Sc. from Tel-
Aviv University in 2006. Prior to that, Erlich worked in computer security and wasresponsible for conducting penetration tests on financial institutes and commercialcompanies. Dr. Erlichs research involves developing new algorithms for
computational human genetics.
Arvind Narayanan is an Assistant Professor in the Department of Computer Scienceand the Center for Information Technology and Policy at Princeton. He studies
information privacy and security. His research has shown that data anonymizationis broken in fundamental ways, for which he jointly received the 2008 Privacy
Enhancing Technologies Award. His current research interests include building a
platform for privacy-preserving data sharing.
Draft
1
-
7/27/2019 Genetic Data Protection
2/25
Summary
Broad data dissemination is essential for advancements in genetics, but also
brings to light concerns regarding privacy.
Privacy breaching techniques work by cross-referencing two or more pieces of
information to gain new, potentially undesirable knowledge on individuals or
their families.
Broadly speaking, the main routes to breach privacy are identity tracing, attribute
disclosure, and completion of sensitive DNA information.
Identity tracing exploits quasi-identifiers in the DNA data or metadata to uncover
the identity of an unknown genetic dataset.
Attribute disclosure techniques work on known DNA datasets. They use the
DNA information to link the identity of a person with a sensitive phenotype.
Completion techniques also work on known DNA data. They try to uncover
sensitive genomic areas that were masked to protect the participant.
In the last few years, we have witnessed a rapid growth in the range of techniques
and tools to conduct these privacy-breaching attacks. Currently, most of the
techniques are beyond the reach of the general public, but can be executed by
trained persons with varying degrees of effort.
There is considerable debate regarding risk management. One camp supports a
pragmatic, ad-hoc approach of privacy by obscurity and the other supports a
systematic, mathematically-backed approach of privacy by design.
Privacy by design algorithms include access control, differential privacy, and
cryptographic techniques. So far, data custodians of genetic databases mainly
adopted access control as a mitigation strategy.
New developments in cryptographic techniques may usher in an additional
arsenal of security by design techniques.
2
-
7/27/2019 Genetic Data Protection
3/25
INTRODUCTION
We produce genetic information for research, clinical care, and genealogy at
exponential rates. Sequencing studies with thousands of individuals have become a
reality1,2 and new projects aim to sequence hundreds of thousands to millions ofindividuals3. Some geneticists envision whole genome sequencing of every person
as part of routine health care4,5.
Sharing genetic findings is vital for accelerating the pace of biomedical discoveries
and fully realizing the promises of the genetic revolution6. Recent studies suggest
that robust predictions of genetic predispositions to complex traits from genetic
data will require the analysis of millions of samples7,8. Clearly, collecting cohorts at
such scales are typically beyond the reach of individual investigators and cannot be
achieved without combining different sources. In addition, broad dissemination of
genetic data promotes serendipitous discoveries through secondary analysis, which
is necessary to maximize its utility for patients and the general public9.
One of the key issues of broad dissemination is an adequate balance of dataprivacy10. Prospective participants of scientific studies have ranked privacy of
sensitive information as one their top concerns and a major determinant if to
participate in the study11-13. Protecting personally identifiable information is also a
demand of an array of regulatory statutes in United States and the European
Union14. Data de-identifying, the removing of the person identifier, has been
suggested as a potential path to reconcile data sharing and privacy demands15. But is
this technically feasible for genetic data?
This review characterizes privacy breaching techniques of genetic information and
maps potential counter-measures. We first categorize privacy-breaching strategies,
discuss their underlying technical concepts, and evaluate their performance and
limitations. Then, we present privacy-preserving technologies, group themaccording to their methodological approaches, and discuss their relevance to genetic
information. As a general theme, we focus only on breaching techniques that involve
data mining and fusing distinct resources to gain private information relevant to
DNA data. Data custodians should be aware that security threats can be much
broader. They can include cracking weak database passwords, classical computer
hacking techniques of the server that holds the data, stealing of storage devices due
to poor physical security, and intentional misconduct of data custodians16-18. We do
not include these threats since they are not unique to genetic information and havebeen extensively studied by the computer security field19. In addition, this review
does not cover the potential implications of loss of privacy, which heavily depend oncultural, legal, and socio-economical context and were covered in part by the broad
privacy literature20,21
.
3
-
7/27/2019 Genetic Data Protection
4/25
PRIVACY BREACHING OF GENETIC DATA
Genetic privacy breaching techniques fall into three categories: Identity Tracing,
Attribute Disclosure Attacks via DNA (ADAD), and Completion Techniques (Figure
1). The shared concept of these techniques is gaining a new piece of private
potentially sensitive information about the target or his family by exploiting DNA
data. The three categories are distinct in the type of sensitive information that they
reveal. The aim of identity tracing is to link between an unknown genome and theconcealed identity of the data originator. In ADAD, the adversary already has access
to the identified DNA sample of the target and to a database that links DNA-derived
data to sensitive attributes without explicit identifiers, for example a public
database of the genetic study of drug abuse. The ADAD techniques match the DNA
data and associate the identity of the target with the sensitive attribute. In
completion techniques, the adversary also knows the identity of a genomic dataset
but has access only to a sanitized version without sensitive loci. The aim here is to
4
-
7/27/2019 Genetic Data Protection
5/25
expose the sensitive loci that are not part of the original data. Table 1 summarizes
all privacy breaching techniques that are presented in this section.
Table 1 | Categorization of techniques for breaching genetic privacy
Technique Maturation
Level
Technical
complexity
Auxiliary
information
Identity TracingSurname Inference Intermediate-
Good
DNA Phenotyping Poor
Demographic identifiers Good
Pedigree structure Poor
Side channel leakage Varies
Attribute Disclosure Attacks via DNA
N=1 Good
Genotype frequencies Good
Linkage disequilibrium Intermediate
Effect sizes Good
Trait inference GoodGene expression data Poor
Completion Attacks
Imputation of a masked marker Good
Genealogical imputation Poor
Maturation level: Working principles established with simulated data. Small scale proof of concept with
real data in a controlled environment (typically only one dataset). Large scale experiments in controlled
environments with real data (typically more than one dataset). Breach of privacy was reported in a real
scenario.
Technical complexity: no knowledge in genetics or special tools is required. Require genetic knowledge;
computation can reasonably be done on a regular computer. Existing tools are available Require genetic
knowledge, intermediate scale processing of data and/or molecular techniques. Require genetic knowledge;
large scale processing of data is a prerequisite; may also require molecular techniques.
Auxiliary information: this column refers to the level of existing reference databases for the US population in
public resources. For identity tracing, it refers to the availability of organized lists that link identities and
extract pieces of information. For ADAD and completion techniques, it refers to the existence of supporting
reference datasets that are necessary to complete the attack. Poor supporting data is highly fragmented and
not amenable to searching. Intermediate supporting data is harmonized and searchable but requires some
pre-processing. Great supporting data is ready to use using existing tools or minimal pre-processing.
IDENTITY TRACING ATTACKS
The goal of identity tracing attacks is to uniquely identify the data originator fromthe population despite the absence of explicit identifiers such as the name and exact
address in the published dataset. The idea is to accumulate quasi-identifiers --residual pieces of information that are embedded in the dataset -- and to gradually
narrow down the possible individuals that match the combination of these quasi-
5
-
7/27/2019 Genetic Data Protection
6/25
identifiers to the point that the data originator is the only match. The success of the
attack depends on the information content that the adversary can obtain from these
quasi-identifiers relative to size of the base population (Box 1).
IDENTITY DISCLOSURE BY META-DATA
6
-
7/27/2019 Genetic Data Protection
7/25
Genetic datasets are typically published with additional metadata, such as basic
demographic details, inclusion/exclusion criteria, pedigree structure, and health
conditions that are critical to understand the study and for secondary analysis.
Unrestricted demographic information conveys substantial power for identity
tracing. It has been estimated that the combination of date of birth, sex, and 5 digitzip code uniquely identifies more than 60% of US individuals22,23. In addition, there
are extensive public resources with broad population coverage and searchinterfaces that link demographic quasi-identifiers and individuals, including voter
registries, public record search engines such as People- Finders.com, data brokers,and social media. In one of the pioneering studies of identity tracing using metadata,
Sweeny reported the successful tracing of the medical record of the Governor of
Massachusetts using demographic identifiers24. At that time, the MassachusettsGroup Insurance Commission released hospital discharge information with five digit
zip codes, sex, and date of birth. By searching the voter registry, Sweeny was able touniquely match the hospital discharge of the Governor. A more recent study
reported the identification of 30% of Personal Genome Project (PGP) participants bydemographic profiling that included zip code and exact birthdates that are found in
PGP profiles25.
Since the inception of the HIPAA Rule in 2003, demographic identifiers are thesubjects of tight regulation in the US health care system26. The SAFE HARBOR
provision requires that the maximal resolution of any date field, including birth andhospital admissions, will be in years. In addition, the maximal resolution of a
geographical subdivision is the first three digits of a zip code (as long as there are
more than 20,000 living in the regions that correspond to the three digit zip codes).
Statistical analyses of the census data have found that the Safe Harbor provision
provides reasonable immunity against identity tracing assuming that the adversary
has access only to demographic identifiers. The combination of sex, age, ethnic
group, and state is unique to less than 0.25% of the populations across all states27.
An empirical study evaluated the re-identification of 15,000 records of Hispanic
patients in the Chicago area that included year of birth, 3-digit zip code, and maritalstatus (married/unmarried) by comparison to voting registry data28. The authors
reported the correct identification of 2 out of the 15,000 records and estimated that
less of 0.22% the population is exposed with this set of quasi-identifiers. These
studies show that with access to only HIPAA redacted demographic quasi-
identifiers, identity tracing is extremely hard.
Pedigree structures are another piece of metadata that are included in many genetic
studies. These structures contain rich information, especially when large kinships
are available29. The number of offspring, their birth order, and other familial events
such as remarriage, create unique combinations of quasi-identifiers that quickly
narrow down the search space. A systematic study analyzed the distribution of
2,500 two-generation family pedigrees that were sampled from obituaries of a town
of 60,000 individuals30. The pedigrees were unsorted, meaning that only the number
of male and female individuals in each generation was available. Despite this limitedinformation, about 30% of the pedigree structures were unique, demonstrating the
large information content that can be obtained from such data. Another feature ofpedigrees for identity tracing is the combination of quasi-identifiers across records.
For example, it is quite rare that a surname alone can identity an individual.However, the surname combination of a couple prior to their marriage is an
7
-
7/27/2019 Genetic Data Protection
8/25
extremely strong identifier. In addition, once a single individual in a pedigree is
identified, it is easy to link the identities of the other relatives and their genetic
datasets. The main limitation of identity tracing using pedigree structures is their
low searchability. Perhaps one notable exception is Israel, where the entire
population registry was leaked to the web in 2006 and allows the construction ofmulti-generation family trees of all Israeli citizens31. But in general due to their low
searchability, the value of family trees for re-identification is mostly limited tomanual verification of the potential identity of the target rather than a starting point
of the process.
IDENTITY TRACING BY GENEALOGICAL TRIANGULATION
Genetic genealogy attracts millions of individuals interested in their ancestry and
discovering distant relatives. To that end, the community has developed impressiveonline platforms to search for genetic matches and connect individuals. These online
resources can be exploited to triangulate the identity of an unknown genome.
One potential route of identity tracing is surname inference from Y-chromosomedata32,33. In most societies, surnames are passed from father to son, creating a
transient correlation with specific Y chromosome HAPLOTYPES34,35. The adversarycan take advantage of the Y chromosome-surname correlation and compare the Y
haplotype of the unknown genome to haplotype records in recreational geneticgenealogy databases. A close match with a relatively short time to the most common
recent ancestor (MRCA) would signal that the unknown genome likely has the samesurname as the record in the database.
The power of surname inference stems from exploiting information from distant
patrilineal relatives of the unknown genome. The association between surnamesand Y-chromosomes usually spans dozens of generations, implying that every
record in a genealogical database is capable of revealing the surnames of hundreds
to thousands of males. A recent empirical study estimated that 10-14% of USCaucasian males from the middle and upper classes are subject to surname
inference based on scanning the two largest Y-chromosome genealogical websiteswith a built-in search engine33.
An inferred surname has tremendous power for identity tracing. Individual
surnames are relatively rare in the population and in most cases a single surname is
shared by less than 40,000 US males33, which is equivalent to 12 bits of information.
In terms of identification, successful surname recovery is very close to determiningan individuals zip code. Another feature of surname inference is that surnames are
highly searchable. From public record search engines to social networks, numerousonline resources offer surname query interfaces, simplifying the adversarys efforts
to complete the triangulation.Surname inference has been utilized to breach genetic privacy in the past36-39.Several sperm donor conceived individuals and adoptees successfully used this
technique on their own DNA to reveal the surnames of their ancestors, whicheventually lead to the exposure of their biological families. This technique could also
be applied to whole genome sequencing datasets. A recent study reported five
successful surname inferences from Illumina datasets of three large families that
8
-
7/27/2019 Genetic Data Protection
9/25
were part of the 1000 Genomes project, which eventually exposed the identity of
close to fifty research participants33.
The main limitation of surname inference is that haplotype matching relies on
comparing Y chromosome Short Tandem Repeats (Y-STRs). Currently, most
sequencing studies do not routinely report these markers and the adversary would
have to process large-scale raw sequencing files with a specialized tool, which isboth time and resource consuming and requires bioinformatics experience 40.
Another complication is false identification of surnames and inference of surnames
with spelling variants compared to the original surname. Eliminating incorrect
surname hits necessitates access to additional quasi-identifiers such as pedigreestructure and typically requires a few hours of manual work. Finally, the
performance of surname inference varies between different socio-ethnic groupsbased on non-paternity rates, sociological norms of surname inheritance, and access
of the group to recreational genealogy.
An open research question is the utility of non Y chromosome markers forgenealogical triangulation. Websites such as Mitosearch.org and GedMatch.com run
open searchable databases for matching mitochondrial and autosomal genotypes,
respectively. Our expectation is that mitochondrial data will not be very informativefor tracing identities. The resolution of mitochondrial searches is much lower due to
its smaller size and the absence of highly polymorphic markers like Y-STRs, meaningthat a large number of individuals would share the same mitochondrial haplotypes.
In addition, most human societies do not exercise maternally inherited identifiers,reducing the utility of such searches. Autosomal searches on the other hand might
be quite powerful. Genetic genealogy companies have started to market services fordense genome-wide arrays that enable relatively sufficient accuracy to identify
distant relatives on the order of 3rd to 4th cousins41. These hits would reduce the
search space to no more than a few thousand individuals42. The main challenge of
this approach would be translating the genealogical match to a list of potential
people. But with the growing interest in genealogy, this technique might be easier in
the future and should be taken into consideration.
IDENTITY TRACING BY PHENOTYPIC PREDICTION
Several reports on genetic privacy envisioned that phenotypic predictions from
genetic data could serve as quasi-identifiers for identity tracing43,44. Twin studieshave estimated high heritabilities for various visible traits such as height45and facial
morphology46. In addition, recent studies showed that age prediction is possible
from DNA specimens derived from blood samples47,48. But the applicability of these
DNA-derived quasi-identifiers for identity tracing has yet to be demonstrated.
The major limitation of phenotypic prediction is the fast decay of the identificationpower with small inference errors (Box 1). Current genetic knowledge explains only
a small extent of the phenotypic variability of most visible traits, such as height49,
BMI50, and face morphology51, significantly limiting their utility for identification.
For example, perfect knowledge about height at one-centimeter resolution conveys
5 bits of information. However, with current genetic knowledge that explains 10%
of height variability49, the adversary learns only 0.15 bits of information. Predictions
of most of the other visible traits are even worse, implying that their utility as quasi-
9
-
7/27/2019 Genetic Data Protection
10/25
identifiers would be quite low. The exceptions in visible traits are eye color52 and
age prediction47. Recent studies showed a prediction accuracy of 75%-90% of the
phenotypic variability of these traits. But even these successes translate to
approximately 3-4 bits of information. Another challenge for phenotypic prediction
is the low searchability of most of these traits. There are no population-basedregistries of height, eye color, or face morphology and the adversary would have to
invest substantial efforts to compile such a registry. However, with the advent ofnew types of social media, this barrier might be less significant in the future.
IDENTITY TRACING BY SIDE-CHANNEL LEAKS
Side channel attacks exploit quasi-identifiers that are unintentionally encoded in the
database building blocks and structure rather than the actual data that is meant to
be public. A good example for such leaks is the exposure of the full names of PGPparticipants from 23andMe filenames25. The PGP allowed participants to upload
23andMe genotyping files to their public profile webpages. The default conventionof these 23andMe files includes the first and last name of the user. As part of the
upload process, the PGP website automatically compressed the file, named it withthe PGP identifier of the user, and presented a link that showed the new file name
that does not include the first and last names. However, after downloading and
decompressing the 23andMe file, the original filename appeared. Since most of theusers did not change the default naming convention, it was possible to trace the
identity of a large number of PGP profiles. Based on this experience, the PGP nowforces the participant to rename files before uploading and warns them that the file
may contain hidden information that can expose their identities.
Rich data files embed multiple layers of hidden information that provide ampleopportunities for leakage of quasi-identifiers. Photo files typically embed
Exchangeable Image File Format (EXIF) fields that can include GPS data about the
location of the photo or the serial number of the camera53. This information couldconvey potential leads even if the photo itself does not disclose any sensitiveinformation. Microsoft Office products typically embed the author name and contain
previous revisions of the document that show deleted text54. In general, flat text files
are the most immune format to these types of leaks of unintentional content.
The mechanism to generate database accession numbers can also leak personal
information. Ideally, these numbers should be completely random but experience
has highlighted that sometimes these numbers unintentionally reveal residual
information due to non-random assignments. For example, in several top medial
data mining contests, the accession numbers unintentionally revealed the disease
status of the patient, which was the aim of the contest55. Another example is the
non-random assignment of Social Security Numbers (SSN) in the US. Pattern
analysis of a large amount of public data revealed temporal and spatial
commonalities in the assignment system that allowed predictions of the SSN from
quasi-identifiers56. Some suggested the assignment of accession numbers by
applyingCRYPTOGRAPHIC HASHINGto the participant identifiers such as name or
social security number57. However, this technique is extremely vulnerable toDICTIONARY ATTACKS due to the relatively low search space of the input. In
10
-
7/27/2019 Genetic Data Protection
11/25
general, it is advisable to add some sort of randomization to procedures that
generate accession numbers in order to prevent misuse.
ATTRIBUTE DISCLOSURE ATTACKS VIA DNA (ADAD)
In ADAD, the adversary creates a statistical bridge that uses DNA data to link
sensitive attributes with the identity of a person. The first piece of information is aDNA sample from an identified target. This can be achieved by successful
completion of an identity tracing attack, exploiting identified DNA data in projects
such as OpenSNP, gaining internal access to restricted databases, or simply by
obtaining a DNA sample directly from the target. The second piece of information is
DNA derived data that is associated with sensitive information, such as disease,
personality traits, or socio-economic status, which does not otherwise contain
explicit identifiers. The main difference between the ADAD attacks is the type of
DNA derived data that is associated with the sensitive attribute.
ADAD: THE N=1 SCENARIO
The simplest scenario of ADAD is when the sensitive attribute is associated with the
genotype data of the individual. The adversary can simply match the genotype data
that is associated with the identity of the individual and the genotype data that is
associated with the attribute. Such an attack requires only a small number of
autosomal SNPs. Empirical data showed that a carefully chosen set of 45 SNPs is
sufficient to provide matches with a TYPE I ERROR of 10-15 for most of the major
populations across the globe58. Moreover, it is expected that random subsets of
approximately 300 common SNPs would yield sufficient information to uniquely
match any person59.
With the low number of SNPs required for matching, individual level genotypes-phenotype records in genome-wide association studies (GWAS) are highly
vulnerable to ADAD. In order to address this issue, several organizations, including
the NIH, adopted a two tier access system for GWAS datasets: a restricted accessarea that stores individual level genotypes and phenotypes and a public access area
for high level data summary statistics of allele frequencies of all cases and controls60.The premise of this distinction was that summary statistics enable secondary data
usage for meta-GWAS analysis while it was thought that this type of data is immune
to ADAD.
ADAD: THE SUMMARY STATISTIC SCENARIO
A landmark work by Homer et al. reported the possibility of ADAD on GWAS
datasets that only consists of the allele frequencies of the study participants61. The
underlying concept of their approach is that, with the target genotypes in the casegroup, the average allele frequencies will be positively biased towards the target
genotypes compared to the estimated MAF from the general population. Conversely,when the target is not part of the study, the average allele frequencies will be
11
-
7/27/2019 Genetic Data Protection
12/25
negatively biased compared to the target genotypes. A good illustration of this
concept is considering an extremely rare variation in the subjects genome. Non-
zero allele frequency of this variation in a small-scale study increases the likelihood
that the target was part of the study, whereas zero allele frequency strongly reduces
this likelihood. Homer et al. showed that by integrating the slight biases in the allelefrequencies over a large number of SNPs it is also possible to conduct ADAD with
the common variations that are analyzed in GWAS.
Subsequent studies extended the range of exploitations for summary statistics. Oneline of studies improved the test statistic in the original Homer et al. work and
analyzed its mathematical properties62-64. Under the assumption of common SNPs inLINKAGE EQUILIBRIUM, their improved test statistic is mathematically guaranteed
to yield maximalPOWERfor anySPECIFICITYlevel. Wang et al. went beyond allelefrequencies and demonstrated that it is possible to exploit local LD structures for
ADAD65. Their test statistic scores the co-appearance of two SNP alleles in the targetgenome with the bias of LD structure in a GWAS study versus the general
population. The power of this approach stems from scavenging for the co-
occurrence of two mildly uncommon alleles in different haplotype blocks that
together create a rare event. They reported a power of 80% and specificity of 95%for ADAD on a GWAS with 200 samples that exploited the LD structure of 174common SNPs in the FGFR2 locus. With the same number of SNPs, ADAD methods
that use only allele frequencies yield an expected power of 24% for the same
specificity level under the most optimal scenario. Im et al. developed a method to
exploit the EFFECT SIZES of GWAS studies of quantitative traits to detect the
presence of the target66. Different from ADAD with allele frequencies, the detection
performance is better for participants with extreme phenotypes and worse for
participants with average phenotypes. A powerful development of this approach is
exploiting GWAS studies that utilize the same cohort for multiple phenotypes. The
adversary repeats the identification process of the target with the effect sizes of
each phenotype and integrates them to boost the identification performance. After
determining the presence of the target in a quantitative trait study, the adversarycan further exploit the GWAS data to predict the phenotypes with high accuracy67.
This method works by simply correlating the DNA of the target with the effect sizes
and takes advantage of the spurious associations when regressing a large number
on markers with a single phenotype.
The theoretical performance of ADAD is a complex function of the size of the study
and the general population68,69. On one hand, in any of the techniques above, studies
with smaller numbers of participants generate more apparent biases in their
summary statistics, which increases the power and specificity of the ADAD
discrimination (Figure 2A). On the other hand, a target drawn randomly from the
general population has a lower a-priori probability of having participated in a study
with a smaller number of participants. This means that ADAD on smaller studies
needs to work with higher specificity to achieve the same PRECISION of larger
studies, reducing the power of the attack and the number of people at risk ( Figure
2B). In any case, the performance and risk increase when the base population is
smaller, such as the Amish or Hutterite populations, or when the meta-informationenables stratification of the general population (Figure 2C).
The actual risk of ADAD on summary data has been the subject of debate. Following
the original Homer et al. study, the NIH and other data custodians moved their
12
-
7/27/2019 Genetic Data Protection
13/25
GWAS summary statistic data from public databases to access controlled databases
such as dbGAP70. A retrospective analysis found that significantly fewer GWAS
studies publicly released their summary statistic data71. Most of the studies publish
summary statistic data on 10-500 SNPs, which is compatible with one suggested
guideline to manage risk67. Some warned that these policies are too harsh72. Thereare several practical complications that the adversary needs to overcome to launch
a successful attack, such as access to the targets DNA data73, access to a largereference database to assess the general population frequency data, and accurate
matching between the ancestries of the target with those listed in the referencedatabase74. Failure to address any of these prerequisites can severely impact the
performance of the ADAD. In addition, for a range of GWAS studies, the associated
attributes are not sensitive or private (e.g. height). Thus, even if ADAD occurs, theimpact on the participant should be minimal. A recent NIH workshop proposed the
release of summary statistics as the default policy and developing an exemptionmechanism for studies with increased risk due to the sensitivity of the attribute or
the vulnerability level of the summary data75.
13
-
7/27/2019 Genetic Data Protection
14/25
ADAD: THE EXPRESSION DATA SCENARIO
Public databases such as GEO hold hundreds of thousands of gene expression
profiles of individuals that are linked to a range of medical attributes. Schadt et al.
proposed a potential route to exploit these profiles for ADAD76. The method starts
with a training step that employs a standard EXPRESSION QUANTITATIVE TRAITLOCI (eQTL) analysis with a reference dataset. The goal of this step is to identify
several hundred strong eQTLs and to learn the distributions of the expression level
for each genotype. Then, the algorithm scans the public expression profiles and
calculates the probability distributions of the genotypes of the eQTLs. Last, the
algorithm matches the targets genotype with the inferred allelic distributions of
each expression profile and tests the hypothesis that the match is random. If the null
hypothesis is rejected, the algorithm links the identity of the target to the medical
14
-
7/27/2019 Genetic Data Protection
15/25
attribute in the gene expression experiment. This ADAD technique has the potential
for relatively high accuracy in ideal conditions. The method perfectly matched 580
individuals with their expression profiles when the training was conducted on a
distinct dataset. Based on large-scale simulations, they further predicted that the
method can reach a type I error of 1x10 -5 with a power of 85% when tested on anexpression database using the entire US population.
There are several practical limitations to ADAD via expression data. While the
training step and inference steps are capable of working with expression profilesfrom different tissues, the method reaches its maximal power when the training and
inference utilize eQTL from the same tissue. Moreover, there is a significant loss ofaccuracy when the expression data in the training phase is collected using a
different technology than the expression data in the inference phase. Anothercomplication is that in order to fully execute the technique on a large database such
as GEO, the adversary will need to manage and process large-scale expression data.
Due to these practical barriers, the NIH did not issue any changes to their policiesregarding sharing expression data from human subjects.
COMPLETION ATTACKS
Completion of genetic information from partial data is a well-studied task in genetic
studies, called genotype imputation77. This method takes advantage of theLINKAGE
DISEQUILIBRIUM between markers and uses reference panels with complete
genetic information to restore missing genotype values in the data of interest. The
very same strategies enable the adversary to expose certain regions of interest
where only partial access to the DNA data is available. One publicized case of a
completion attack was the inference of Jim Watsons risk for Alzheimer's disease.
Watson opted to publish his entire identified genome sequence except data from his
ApoE gene, which is associated with Alzheimers disease78. Nyholt et al. restored the
ApoE status using imputation with markers that are 15Kb away from the maskedsite79. As a result of the study, a 2Mb segment around the ApoE gene was removed
from Watsons published genome.
In some cases, completion techniques also enable the prediction of genomic
sequences when there is no access to the DNA of the target. This technique is
possible when the reference panel is combined with genealogical information80. Thealgorithm finds relatives of the target that donated their DNA to the reference panel
and that reside on a unique path that includes the target, for example a pair of half-first cousins when the target is their grandfather. A shared DNA segment between
the relatives indicates that the target had the same segment. By scanning more pairsof relatives that are connected through the target, it is possible to infer the two
copies of autosomal loci and collect more genomic information on the target withoutany access to its DNA. Building on the deep genealogical records in Iceland, deCode
Genetics was able to leverage their large reference panel to infer genetic variants of
an additional 200,000 living individuals who never donated their DNA to thecompany. While this technique is mostly relevant to targets with a large number of
decedents and can be executed in only a narrow range of scenarios, it emphasizesthe complexities of genetic privacy. In May 2013, Iceland's Data Protection
15
-
7/27/2019 Genetic Data Protection
16/25
Authority prohibited the use of this technique until consent can be obtained from
the individuals who are not part of the original reference panel81.
MITIGATION TECHNIQUES
Most of the genetic privacy breaches presented above are quite sophisticated. Theyrequire a background in genetics and statistics and -- importantly -- a motivatedadversary. One school of thought posits that these practical complexities almosteliminate the probability of an adverse event and therefore attenuate the risk tonegligible levels for most studies82,83. According to this approach, an appropriatemitigation strategy is just removing very obvious identifiers from the datasetsbefore publicly sharing the information. In the field of computer security, this riskmanagement strategy is called security by obscurity. This approach is simple toimplement and poses minimal burden on data dissemination. The opponents ofsecurity by obscurity posit that risk management schemes based on the probabilityof an adverse event are fragile and short lasting. According to their views,technologies only get better and what is technically challenging but possible today
will be much easier in the future. Therefore, the probabilities of adverse events arenon-computable and irrelevant84. Known in cryptography as Shannons maxim85,this school of thought assumes that the adversary exists and is equipped with theknowledge and means to execute the breach. Robust data protection, therefore, isachieved by explicit design of the data access protocol rather than by the actualchance of a breach86. This section surveys the main security by design schemes andtheir relevance to protecting genetic data.
AC CE SS CO NTRO L
One approach to mitigate the chance of a privacy breach is to place the sensitive
data in a secure location and screen the legitimacy of the applicants and theirresearch projects. Once approval is made, the applicants are allowed to download
the data under the conditions that they will store it in a secure location and will not
attempt to identify individuals. In addition, the applicants should be required to file
periodic reports about the data usage and any adverse events. This approach is thecornerstone of the access-controlled dbGAP60. Based on periodic reports of the
users, a retrospective analysis of dbGAP access control has identified 8 datamanagement incidents in close to 750 studies, mostly non-adherence to the
technical regulations, and no reports of breaching the privacy of participants87.Despite the absence of privacy breaches thus far, some have criticized the fact that
access control creates an illusion of security88. Once the data is in the hand of the
applicant, there is no real oversight of how it is being stored, the actual work, andwhat exactly is published. To address these limitations, an alternative approach is
the trust-but-verify model, where the user cannot download the raw data but mayexecute certain types of queries that are recorded and monitored by the system89.
Supporters of this model state that monitoring has the potential to deter malicioususers from accessing the data and facilitates early detection. Another development
based on this approach is enforcing the users and data custodians to have skin in
the game90, by adding penalties beyond denying access to the resource in case of
16
-
7/27/2019 Genetic Data Protection
17/25
misuse. The main downside of access control is that any of the models listed above
require constant management of the resource and create administrative burden to
both data custodians and users.
DATA ANONYMIZATION AND AGGREGATION
The premise of anonymity is the ability to be lost in the crowd. One line of studies
suggested restoring anonymity by restricting the granularity of the quasi-identifiers
to the point that each record in the database is not unique. A popular heuristic is k-anonymity91. Using this approach, the quasi-identifiers are binned such that each
subjects record is identical to that of at least k-1 records from other individuals inthe dataset. To maximize the utility of the data for subsequent analysis, the binning
process is adaptive. Certain records will have a lower resolution depending on thedistribution of the other records and certain data categories that are too unique are
suppressed entirely. There is a strong trade-off in the selection of the value of k;
high values increase the size of the background crowd but at the same time reducethe utility of the data. As a rule of thumb, it was recommended to setk5(92). More
recent work showed that while k-anonymity protects against identity tracingtechniques, it is vulnerable to attribute disclosure, especially when the adversary
has a certain level of prior knowledge about the presence of the target in thedatabase93. Subsequent studies developed more elaborative redaction techniques to
address these issues93,94. These anonymization techniques have been mainly
successful in safeguarding demographic identifiers in medical research. However,
attempts to adopt these techniques to DNA research are yet to be practical95. The
high dimensionality of DNA data dictates that most of the records will be unique and
it is not clear how the data can be redacted without destroying its value for
secondary analysis.
Differential privacy offers a distinct approach to restore anonymity by producing
summary statistics after sophisticated data perturbation96.
It aims to ensure thatsummary statistics of two datasets that differ by exactly one individuals record areextremely close to each other. This way, the adversary cannot be sure whether the
target was part of the dataset or not and therefore cannot learn sensitive attributes.
The challenge in differential privacy algorithms is to minimize the perturbation
while satisfying the privacy property so that the summary statistic will still convey
useful information on the population as a whole. Differential privacy has gained
popularity in computer science and statistics as a very vibrant research area and the
US Census Bureau uses this technique for their OnTheMap tool97. Early attempts
have made progress towards protecting GWAS data using this approach98,99.
Currently, the main limitation is that the amount of perturbation that needs to be
added to the summary statistic grows linearly with the number of exposed SNPs,
which quickly abolishes the ability to detect fine associations in meta-analysis.Whether or not there is a way to add much smaller amounts of noise in a way that
still maintaining privacy for GWAS datasets remains an open question.
17
-
7/27/2019 Genetic Data Protection
18/25
CRYPTOGRAPHIC SOLUTIONS
Modern cryptography brought new advancements for data dissemination beyond
the traditional usage of encrypting sensitive files and distributing the key toauthorized users. Secure multiparty computation (SMC) allows two or more entities
who each have some private data to execute a computation on these private inputs
without revealing the input to each other or disclosing it to a third party. In one
classical example of SMC,ALICE and BOBcan determine who is richer without either
one revealing their actual wealth to the other. Researchers have constructed SMC
protocols in various domainsfrom voting100to location-based services101.
In the area of genetic data, one line of work has developed SMC algorithms for
genetic matching. Bruekers et al. presented a privacy-preserving algorithm to match
STR profiles between two parties without exposing the actual genetic data102.
Bohannon et al. suggested searchable genetic databases for forensic purposes that
allow only going from genetic data to identity but not from identity to geneticdata103. In their scheme, the records in the databases are encrypted with the
individuals genotype as the key. To tolerate genotyping errors or missing data, they
utilize a fuzzy encryption scheme that can use a key that only approximately
matches the original one. This way, only access to the genotype information can
reveal the identity but not the opposite. Along similar lines, Cristofaro et al.
constructed cryptographic protocols for privacy-preserving paternity tests and
genetic compatibility tests104, albeit for molecular techniques that are no longer in
18
-
7/27/2019 Genetic Data Protection
19/25
use, such as RFLP. They also presented a smartphone-based implementation of
these protocols105. The performance varies dramatically between tasks that examine
only a few loci and those that depend on the whole genome. The former complete in
under a second and the latter take days of computation and gigabytes of bandwidth,
rendering them impractical at the current time.
In another direction, Kamm et al suggested a secure multi-center GWAS analysis106.In their protocol, each center deploys a secret sharing scheme on its own collection
of subjects phenotypes and genotypes that divides the data into small shares, each
of which reveals nothing about the original values on its own. The shares are then
sent to the other centers, which store them in dedicated servers. The servers have
an interface that allows outsiders to initiate a GWAS study on phenotypes and
genotypes of interest. Upon request, the servers coordinate to perform the
association without reconstructing the original genotypes or phenotypes and onlyreport in plain text the significant SNPs. A potential shortcoming of their approach is
that, at least theoretically, the end product plain text is still vulnerable to ADAD onsummary statistic data, rendering the solution far from complete.
Another line of cryptographic work looks at privacy-preserving outsourcing ofcomputations on genetic information using homomorphic encryption107 (Box 2).
The concept of this approach is that, with advent of ubiquitous usage of genetic data,
users (or physicians) will interact with a variety of genetic interpretation services
(e.g. promethease.com) throughout their lives, which increases the chance of a
genetic privacy breach. Under this cryptographic work, users send an encrypted
version of their genome to the cloud. The interpretation service can access the clouddata but does not have the key and therefore cannot read the plain genotype values.
Instead, the interpretation service executes the algebraic operations of its geneticrisk prediction algorithm on the encrypted genotypes without inspecting the
plaintext. After completing the algorithm, the user grabs the cyphertext results fromthe cloud. Due to the special mathematical properties of the underlying
cryptosystem, the user simply decrypts the results to obtain his risk prediction. Thisway the user does not expose any of his genotypes or disease susceptibility to theservice provider. The current scope of risk prediction models is still limited but this
approach is quite amenable to future improvements.
CONCLUSION
The invention of asymmetric cryptography in the 1970s led to a revolution in
secure communication. Today, a wide variety of Internet transactions build upon
these security measures in ways that are completely transparent to the average
user. Data privacy still awaits a similar breakthrough. The status quo has greatlyshifted in the last few years, with a torrent of studies showing that a motivated,
technically-sophisticated adversary is capable of exploiting a wide range of genetic
data for unintended purposes. With the constant innovation in genetics and theexplosion of online information, we can expect that new privacy breaching
techniques will be discovered in the next few years. Restoring the status quo with
technical means will necessitate large strides in the theory and implementation of
mitigation algorithms. Some of the approaches, particularly access control, have
19
-
7/27/2019 Genetic Data Protection
20/25
been quite useful. But so far, mitigation schemes are resource and time consuming
for both the data custodian and users. Due to both technical and human factors108,
the privacy field has yet to come up with a set of methodologies of comparable
impact to communication security.
Successful balancing of privacy demands and data sharing is not restricted to
technical means109. Balanced informed consents outlining both benefits and risksare key ingredients for maintaining long-lasting credibility in genetic research. With
the active engagements of a wide range of stakeholders from the broad genetics
community and the general public, we as a society could develop social and ethical
norms, legal frameworks, and educational programs to reduce the chance of misuse
of genetic data despite the inability to theoretically prevent privacy breaches.
GLOSSARY
SAFE HARBOR: A standard in the HIPAA Rule for de-identification of protected health
information by removing 18 types of quasi-identifiers.HAPLOTYPES: A set of alleles along the same chromosome.CRYPTOGRAPHIC HASHING: A procedure that yields a fixed length output from any size of
input in a way that is hard to determine the input from the output.
DICTIONARY ATTACKS: A brute force approach to reverse cryptographic hashing by scanning
the relatively small input space.
TYPE I ERROR: The probability to obtain a positive answer from a negative item.
LINKAGE EQUILIBRIUM: Absence of correlation between the alleles in two loci.
POWER: The probability to obtain a positive answer for a positive item.
SPECIFICITY: The probability to obtain a negative answer for a negative item.
EFFECT SIZES: In quantitative traits, the contribution of a certain allele to the value of the trait.
EXPRESSION QUANTITATIVE TRAIT LOCI: Genetic variants associated with variability in
gene expression.
LINKAGE DISEQUILIBRIUM: The correlation between alleles in two loci.
ALICE AND BOB: Common placeholders in cryptography to denote party A and party B.
ACKNOWLEDGEMENTS
YE is an Andria and Paul Heafy Family Fellow and holds a Career Award at the
Scientific Interface from the Burroughs Wellcome Fund. This study was alsosupported by a gift from Cathy and Jim Stone. The authors thank Dina Zielinski and
Melissa Gymrek for useful comments and Shriram Sankararaman for his niceintroduction between the authors.
COMPETING INTERESTS STATEMENT
None.
REFERENCES
20
-
7/27/2019 Genetic Data Protection
21/25
1 Fu, W. et al. Analysis of 6,515 exomes reveals the recent origin of most human
protein-coding variants. Nature 493, 216-220, doi:10.1038/nature11690 (2013).2 Genomes Project, C. et al. An integrated map of genetic variation from 1,092 human
genomes. Nature 491, 56-65, doi:10.1038/nature11632 (2012).
3 Roberts, J. P. Million veterans sequenced. Nat Biotech 31, 470-470,doi:10.1038/nbt0613-470 (2013).
4 Drmanac, R. Medicine. The ultimate genetic test. Science 336, 1110-1112,doi:10.1126/science.1221037 (2012).5 Burn, J. Should we sequence everyone's genome? Yes. Bmj 346, f3133,
doi:10.1136/bmj.f3133 (2013).
6 Kaye, J., Heeney, C., Hawkins, N., de Vries, J. & Boddington, P. Data sharing ingenomics--re-shaping scientific practice. Nat Rev Genet 10, 331-335,
doi:10.1038/nrg2573 (2009).7 Park, J. H. et al. Estimation of effect size distribution from genome-wide association
studies and implications for future discoveries. Nat Genet 42, 570-575,doi:10.1038/ng.610 (2010).
8 Chatterjee, N. et al. Projecting the performance of risk prediction based on polygenic
analyses of genome-wide association studies. Nat Genet 45, 400-405, 405e401-403,doi:10.1038/ng.2579 (2013).
9 Friend, S. H. & Norman, T. C. Metcalfe's law and the biology information commons.Nature biotechnology 31, 297-303, doi:10.1038/nbt.2555 (2013).
10 Rodriguez, L. L., Brooks, L. D., Greenberg, J. H. & Green, E. D. Research ethics. The
complexities of genomic identifiability. Science 339, 275-276,doi:10.1126/science.1234593 (2013).
11 Care, I. o. M. U. R. o. V. S.-D. H. in Clinical Data as the Basic Staple of Health Learning:Creating and Protecting a Public Good: Workshop Summary The National
Academies Collection: Reports funded by National Institutes of Health (2010).12 McGuire, A. L. et al. To share or not to share: a randomized trial of consent for data
sharing in genome research. Genetics in medicine : official journal of the American
College of Medical Genetics 13, 948-955, doi:10.1097/GIM.0b013e3182227589(2011).
13 Oliver, J. M. et al. Balancing the risks and benefits of genomic data sharing: genome
research participants' perspectives. Public Health Genomics 15, 106-114,
doi:10.1159/000334718 (2012).14 Schwartz, P. M. & Solove, D. J. Reconciling Personal Information in the United States
and European Union. SSRN Electronic Journal, doi:10.2139/ssrn.2271442 (2013).
15 El Emam, K. Heuristics for De-identifying Health Data. Security & Privacy, IEEE 6,
58-61, doi:10.1109/MSP.2008.84 (2008).16 Lunshof, J. E., Chadwick, R., Vorhaus, D. B. & Church, G. M. From genetic privacy to
open consent. Nat Rev Genet 9, 406-411, doi:10.1038/nrg2360 (2008).
17 Brenner, S. E. Be prepared for the big genome leak. Nature 498, 139,doi:10.1038/498139a (2013).
18
19 Scambray, J. M. S. K. G. Hacking exposed network security secrets & solutions, (2001).
20 Solove, D. J. A Taxonomy of Privacy. University of Pennsylvania Law Review 154,
477 (2006).21 Ohm, P. Broken Promises of Privacy: Responding to the Surprising Failure of
Anonymization. UCLA Law Review 57 (2010).
22 Golle, P. in Proceedings of the 5th ACM workshop on Privacy in electronic society77-80 (ACM, Alexandria, Virginia, USA, 2006).
23 Sweeney, L. A. Simple Demographics Often Identify People Uniquely. (2000).
21
http://www.privacyrights.org/data-breachhttp://www.privacyrights.org/data-breachhttp://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=70568http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=70568http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=70568http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=70568http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=70568http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=70568http://www.privacyrights.org/data-breach -
7/27/2019 Genetic Data Protection
22/25
24 Greely, H. T. The uneasy ethical and legal underpinnings of large-scale genomic
biobanks. Annual review of genomics and human genetics 8, 343-364,doi:10.1146/annurev.genom.7.080505.115721 (2007).
25 Sweeney, L. A., Abu, A. & Winn, J. Identifying Participants in the Personal Genome
Project by Name (2013). .26 United States. General Accounting Office. & United States. (U.S. General Accounting
Office, Washington, D.C., 2002).27 Benitez, K. & Malin, B. Evaluating re-identification risks with respect to the HIPAAprivacy rule. Journal of the American Medical Informatics Association : JAMIA 17,169-177, doi:10.1136/jamia.2009.000026 (2010).
28 Kwok, P., Davern, M., Hair, E. & Lafky, D. in NORC at The University of Chicago(Chicago 2011).
29 Bennett, R. L. et al. Recommendations for standardized human pedigree
nomenclature. Pedigree Standardization Task Force of the National Society of
Genetic Counselors. Am J Hum Genet 56, 745-752 (1995).30 Malin, B. Re-identification of familial database records. AMIA ... Annual Symposium
proceedings / AMIA Symposium. AMIA Symposium, 524-528 (2006).
31 Israel vs. Shalom Bilik, Avraham Adam, Yosef Vitman, Haim Aharon, MosheMoshkowitz and Meir Liver (In Hebrew) Verdict 24441-05-12
32 Gitschier, J. Inferential genotyping of Y chromosomes in Latter-Day Saints foundersand comparison to Utah samples in the HapMap project. Am J Hum Genet 84, 251-258, doi:S0002-9297(09)00025-1 [pii] 10.1016/j.ajhg.2009.01.018 (2009).
33 Gymrek, M., McGuire, A. L., Golan, D., Halperin, E. & Erlich, Y. Identifying personalgenomes by surname inference. Science 339, 321-324,
doi:10.1126/science.1229566 (2013).34 King, T. E. & Jobling, M. A. What's in a name? Y chromosomes, surnames and the
genetic genealogy revolution. Trends Genet 25, 351-360, doi:S0168-9525(09)00133-4 [pii] 10.1016/j.tig.2009.06.003 (2009).
35 King, T. E. & Jobling, M. A. Founders, drift, and infidelity: the relationship between Y
chromosome diversity and patrilineal surnames. Mol Biol Evol 26, 1093-1102,doi:msp022 [pii] 10.1093/molbev/msp022 (2009).
36 Motluk, A. Anonymous sperm donor traced on internet. New Sci 188, 2 (2005).
37 Stein, R. Found on the Web, With DNA: a Boy's Father. Washington Post, 1 (2005).
38 Naik, G. Family Secrets: An Adopted Man's 26-Year Quest for His Father Wall StreetJournal (2009).
39 Lehmann-Haupt, R. Are Sperm Donors Really Anonymous Anymore? Slate (2010).
40 Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. lobSTR: A short tandem repeat profilerfor personal genomes Genome research, doi: (2012).
41 Huff, C. D. et al. Maximum-likelihood estimation of recent shared ancestry (ERSA).
Genome research 21, 768-774, doi:10.1101/gr.115972.110 (2011).
42 Henn, B. M. et al. Cryptic distant relatives are common in both isolated andcosmopolitan genetic samples. PLoS One 7, e34267,
doi:10.1371/journal.pone.0034267 (2012).
43 Lowrance, W. W. & Collins, F. S. Ethics. Identifiability in genomic research. Science317, 600-602, doi:10.1126/science.1147699 (2007).
44 Kayser, M. & de Knijff, P. Improving human forensics through advances in genetics,
genomics and molecular biology. Nat Rev Genet 12, 179-192, doi:10.1038/nrg2952
(2011).45 Silventoinen, K. et al. Heritability of adult body height: a comparative study of twin
cohorts in eight countries. Twin research : the official journal of the International
Society for Twin Studies 6, 399-408, doi:10.1375/136905203770326402 (2003).46 Kohn, L. A. P. The Role of Genetics in Craniofacial Morphology and Growth. Annu Rev
Anthropol. 20, 261-278 (1991).
22
http://dataprivacylab.org/projects/pgp/1021-1.pdf%3ehttp://dataprivacylab.org/projects/pgp/1021-1.pdf%3ehttp://dataprivacylab.org/projects/pgp/1021-1.pdf%3e -
7/27/2019 Genetic Data Protection
23/25
47 Zubakov, D. et al. Estimating human age from T-cell DNA rearrangements. Curr Biol
20, R970-971, doi:10.1016/j.cub.2010.10.022 (2010).48 Ou, X. L. et al. Predicting human age with bloodstains by sjTREC quantification. PLoS
One 7, e42412, doi:10.1371/journal.pone.0042412 (2012).
49 Lango Allen, H. et al. Hundreds of variants clustered in genomic loci and biologicalpathways affect human height. Nature 467, 832-838, doi:10.1038/nature09410
(2010).50 Manning, A. K. et al. A genome-wide approach accounting for body mass indexidentifies genetic variants influencing fasting glycemic traits and insulin resistance.Nat Genet 44, 659-669, doi:10.1038/ng.2274 (2012).
51 Liu, F. et al. A Genome-Wide Association Study Identifies Five Loci Influencing FacialMorphology in Europeans. PLoS Genet 8, e1002932,
doi:10.1371/journal.pgen.1002932 (2012).52 Walsh, S. et al. IrisPlex: a sensitive DNA tool for accurate prediction of blue and
brown eye colour in the absence of ancestry information. Forensic Sci Int Genet 5,170-180, doi:10.1016/j.fsigen.2010.02.004 (2011).
53 CIPA. Vol. DC-008-2010 (Camera & Imaging Product Association, 2010).
54 Byers, S. Information leakage caused by hidden data in published documents.Security & Privacy, IEEE 2, 23-27, doi:10.1109/MSECP.2004.1281241 (2004).
55 Kaufman, S., Rosset, S. & Perlich, C. in Proceedings of the 17th ACM SIGKDDinternational conference on Knowledge discovery and data mining 556-563 (ACM,San Diego, California, USA, 2011).
56 Acquisti, A. & Gross, R. Predicting Social Security numbers from public data. ProcNatl Acad Sci U S A 106, 10975-10980, doi:10.1073/pnas.0904891106 (2009).
57 Noumeir, R., Lemay, A. & Lina, J. M. Pseudonymization of radiology data for researchpurposes. Journal of digital imaging 20, 284-295, doi:10.1007/s10278-006-1051-4
(2007).58 Pakstis, A. J. et al. SNPs for a universal individual identification panel. Hum Genet
127, 315-324, doi:10.1007/s00439-009-0771-1 (2010).
59 Lin, Z., Owen, A. B. & Altman, R. B. Genetics. Genomic research and human subjectprivacy. Science 305, 183, doi:10.1126/science.1095019 (2004).
60 Mailman, M. D. et al. The NCBI dbGaP database of genotypes and phenotypes. Nat
Genet 39, 1181-1186, doi:10.1038/ng1007-1181 (2007).
61 Homer, N. et al. Resolving individuals contributing trace amounts of DNA to highlycomplex mixtures using high-density SNP genotyping microarrays. PLoS Genet 4,e1000167, doi:10.1371/journal.pgen.1000167 (2008).
62 Halperin, E. & Stephan, D. A. SNP imputation in association studies. Nature
biotechnology 27, 349-351, doi:10.1038/nbt0409-349 (2009).63 Jacobs, K. B. et al. A new statistic and its power to infer membership in a genome-
wide association study using genotype frequencies. Nat Genet 41, 1253-1257,
doi:ng.455 [pii] 10.1038/ng.455 (2009).64 Visscher, P. M. & Hill, W. G. The Limits of Individual Identification from Sample Allele
Frequencies: Theory and Statistical Analysis. PLoS Genet 5, e1000628,
doi:10.1371/journal.pgen.1000628 (2009).65 Wang, R., Li, Y. F., Wang, X., Haixu, T. & Zhou, X. in CCS09 (Chicago, IL, USA, 2009).66 Im, H. K., Gamazon, E. R., Nicolae, D. L. & Cox, N. J. On Sharing Quantitative Trait
GWAS Results in an Era of Multiple-omics Data and the Limits of Genomic Privacy.
Am J Hum Genet 90, 591-598, doi:S0002-9297(12)00093-6 [pii]10.1016/j.ajhg.2012.02.008 (2012).
67 Lumley, T. Potential for Revealing Individual-Level Information in Genome-wide
Association Studies. JAMA 303, 659, doi:10.1001/jama.2010.120 (2010).68 Craig, D. W. et al. Assessing and managing risk when sharing aggregate genetic
variant data. Nat Rev Genet 12, 730-736, doi:10.1038/nrg3067 (2011).
23
-
7/27/2019 Genetic Data Protection
24/25
69 Braun, R., Rowe, W., Schaefer, C., Zhang, J. & Buetow, K. Needles in the Haystack:
Identifying Individuals Present in Pooled Genomic Data. PLoS Genet 5, e1000668,doi:10.1371/journal.pgen.1000668 (2009).
70 Zerhouni, E. A. & Nabel, E. G. Protecting aggregate genomic data. Science 322, 44,
doi:10.1126/science.1165490 (2008).71 Johnson, A. D., Leslie, R. & O'Donnell, C. J. Temporal trends in results availability
from genome-wide association studies. PLoS Genet 7, e1002269,doi:10.1371/journal.pgen.1002269 (2011).72 Gilbert, N. Researchers criticize genetic data restrictions. Nature,
doi:10.1038/news.2008.1083 (2008).
73 Malin, B., Karp, D. & Scheuermann, R. H. Technical and policy approaches tobalancing patient privacy and data sharing in clinical and translational research.
Journal of investigative medicine : the official publication of the AmericanFederation for Clinical Research 58, 11-18, doi:10.231/JIM.0b013e3181c9b2ea
(2010).74 Clayton, D. On inferring presence of an individual in a mixture: a Bayesian approach.
Biostatistics 11, 661-673, doi:10.1093/biostatistics/kxq035 (2010).
75 Workshop on Establishing a Central Resource of Data from Genome SequencingProjects (2012)..76 Schadt, E. E., Woo, S. & Hao, K. Bayesian method to predict individual SNP genotypes
from gene expression data. Nat Genet 44, 603-608, doi:10.1038/ng.2248 (2012).
77 Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies.
Nat Rev Genet 11, 499-511, doi:10.1038/nrg2796 (2010).78 Check, E. James Watsons genome sequenced. Nature (2007).
79 Nyholt, D. R., Yu, C. E. & Visscher, P. M. On Jim Watson's APOE status: genetic
information is hard to hide. European journal of human genetics : EJHG 17, 147-149,doi:10.1038/ejhg.2008.198 (2009).
80 Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype
imputation. Nat Genet 40, 1068-1075, doi:10.1038/ng.216 (2008).81 Kaiser, J. Human genetics. Agency nixes deCODE's new data-mining plan. Science
340, 1388-1389, doi:10.1126/science.340.6139.1388 (2013).82 Bambauer, J. R. Tragedy of the Data Commons. Harvard Journal of Law and
Technology 25, doi:http://dx.doi.org/10.2139/ssrn.1789749(2011).83 Hartzog, W. & Stutzman, F. The Case for Online Obscurity. California Law Review
101, 1, doi:http://dx.doi.org/10.2139/ssrn.159774(2013).
84 Taleb, N. N. The black swan : the impact of the highly improbable. (Random House,2007).
85 Shannon, C. Communication Theory of Secrecy Systems". Bell System Technical
Journal 28, 656715 (1949).86 Cavoukian, A. Privacy by Design. (2009).
.
87 Ramos, E. M. et al. A mechanism for controlled access to GWAS data: experience ofthe GAIN Data Access Committee. Am J Hum Genet 92, 479-488,doi:10.1016/j.ajhg.2012.08.034 (2013).
88 Church, G. et al. Public access to genome-wide data: five views on balancing research
with privacy and protection. PLoS Genet 5, e1000665,doi:10.1371/journal.pgen.1000665 (2009).
89 Creating a Global Alliance to Enable Responsible Sharing of Genomic and Clincal
Data. (2013)..
90 Sandis, C. & Taleb, N. N. Skin in the Game as a Required Heuristic for Acting Under
Uncertainty. Available at SSRN 2298292 (2013).
24
http://www.genome.gov/Pages/Research/DER/GVP/Data_Aggregation_Workshop_Summary.pdf%3ehttp://www.genome.gov/Pages/Research/DER/GVP/Data_Aggregation_Workshop_Summary.pdf%3ehttp://www.genome.gov/Pages/Research/DER/GVP/Data_Aggregation_Workshop_Summary.pdf%3ehttp://www.genome.gov/Pages/Research/DER/GVP/Data_Aggregation_Workshop_Summary.pdf%3ehttp://dx.doi.org/10.2139/ssrn.1789749http://dx.doi.org/10.2139/ssrn.1789749http://dx.doi.org/10.2139/ssrn.1789749http://dx.doi.org/10.2139/ssrn.159774http://dx.doi.org/10.2139/ssrn.159774http://dx.doi.org/10.2139/ssrn.159774http://www.ipc.on.ca/images/Resources/privacybydesign.pdf%3ehttp://www.ipc.on.ca/images/Resources/privacybydesign.pdf%3ehttp://www.ipc.on.ca/images/Resources/privacybydesign.pdf%3ehttp://www.broadinstitute.org/files/news/pdfs/GAWhitePaperJune3.pdf%3ehttp://www.broadinstitute.org/files/news/pdfs/GAWhitePaperJune3.pdf%3ehttp://www.broadinstitute.org/files/news/pdfs/GAWhitePaperJune3.pdf%3ehttp://www.broadinstitute.org/files/news/pdfs/GAWhitePaperJune3.pdf%3ehttp://www.ipc.on.ca/images/Resources/privacybydesign.pdf%3ehttp://dx.doi.org/10.2139/ssrn.159774http://dx.doi.org/10.2139/ssrn.1789749http://www.genome.gov/Pages/Research/DER/GVP/Data_Aggregation_Workshop_Summary.pdf%3ehttp://www.genome.gov/Pages/Research/DER/GVP/Data_Aggregation_Workshop_Summary.pdf%3e -
7/27/2019 Genetic Data Protection
25/25
91 Sweeney, L. k-anonymity: a model for protecting privacy. International journal of
uncertainty, fuzziness, and knowledge-based systems 10, 557-570 (2002).92 El Emam, K. & Dankar, F. K. Protecting privacy using k-anonymity. Journal of the
American Medical Informatics Association : JAMIA 15, 627-637,
doi:10.1197/jamia.M2716 (2008).93 Machanavajjhala, A., Kifer, D., Gehrke, J. & Venkitasubramaniam, M. L-diversity. ACM
Trans. Knowl. Discov. Data 1, 3-es, doi:10.1145/1217299.1217302 (2007).94 Ninghui, L., Tiancheng, L. & Venkatasubramanian, S. in Data Engineering, 2007. ICDE2007. IEEE 23rd International Conference on. 106-115.
95 Malin, B. A. Protecting genomic sequence anonymity with generalization lattices.
Methods of information in medicine 44, 687-692 (2005).96 Dwork, C. Differential Privacy. in ICALP. 1-12 (2007).
97 Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J. & Vilhuber, L. in Data Engineering,2008. ICDE 2008. IEEE 24th International Conference on. 277-286.
98 Uhler, C., Slavkovic, A. B. & Fienberg, S. E. Privacy-Preserving Data Sharing forGenome-Wide Association. CoRR abs/1205.0739 (2012).
99 Johnson, A. & Shmatikov, V. in Proceedings of the 19th ACM SIGKDD international
conference on Knowledge discovery and data mining 1079-1087 (ACM, Chicago,Illinois, USA, 2013).
100 Cao, G. in Computer Science and Computational Technology, 2008. ISCSCT '08.International Symposium on. 292-294.
101 Narayanan, A., Thiagarajan, N., Lakhani, M., Hamburg, M. & Boneh, D. (NDSS,
2011).102 Bruekers. F., Stefan, K., Klaus, K. & Pim, T. Privacy-Preserving Matching of DNA
Profiles. 2008 (2008).103 Bohannon, P., Jakobsson, M. & Srikwan, S. in Public Key Cryptography Vol. 1751
Lecture Notes in Computer Science (eds Hideki Imai & Yuliang Zheng) Ch. 25, 373-390 (Springer Berlin Heidelberg, 2000).
104 Baldi, P., Baronio, R., Cristofaro, E. D., Gasti, P. & Tsudik, G. in Proceedings of the 18th
ACM conference on Computer and communications security 691-702 (ACM,Chicago, Illinois, USA, 2011).
105 Cristofaro, E. D., Faber, S., Gasti, P. & Tsudik, G. in Proceedings of the 2012 ACM
workshop on Privacy in the electronic society 97-108 (ACM, Raleigh, North
Carolina, USA, 2012).106 Kamm, L., Bogdanov, D., Laur, S. & Vilo, J. A new way to protect privacy in large-scale
genome-wide association studies. Bioinformatics 29, 886-893,
doi:10.1093/bioinformatics/btt066 (2013).
107 Ayday, E., Raisaro, J. L. & Hubaux, J. P. Privacy-Enhancing Technologies for MedicalTests Using Genomic Data. Technical Report (2013).
.
108 Narayanan, A. What Happend to the Crypto Dream? Security & Privacy, IEEE 11, 75-76 (2013).
109 Presidential Commission for the Study of Bioethical Issues, Privacy and Progress in
Whole Genome Sequencing. Privacy and Progress in Whole Genome Sequencing(2012).
110 Paillier, P. in Advances in Cryptology EUROCRYPT 99 Vol. 1592 Lecture Notes in
Computer Science (ed Jacques Stern) Ch. 16, 223-238 (Springer Berlin Heidelberg,
1999).111 Gentry, C. A fully homomorphic encryption scheme.
doi:papers2://publication/uuid/E389BFF9-B17D-45A9-BB67-0B586EE445F8
(2009).
25
http://infoscience.epfl.ch/record/182897/files/CS_version_technical_report.pdf%3ehttp://infoscience.epfl.ch/record/182897/files/CS_version_technical_report.pdf%3ehttp://infoscience.epfl.ch/record/182897/files/CS_version_technical_report.pdf%3ehttp://infoscience.epfl.ch/record/182897/files/CS_version_technical_report.pdf%3e