genetic data protection

Upload: mossad-news

Post on 14-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Genetic Data Protection

    1/25

    Review

    Routes for breaching and protectinggenetic privacy

    Yaniv Erlich1,* and Arvind Narayanan2

    1 Whitehead Institute for Biomedical Research, Nine Cambridge Center, Cambridge, MA USA 021422 Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ USA 08540* Correspondence to Y.E ([email protected])

    Abstract

    We are entering the era of ubiquitous genetic information for research, clinical care,and personal curiosity. Sharing these datasets is vital for rapid progress inunderstanding the genetic basis of human diseases. However, one growing concernis the ability to protect the genetic privacy of the data originators. Here, wetechnically map threats to genetic privacy and discuss potential mitigationstrategies for privacy-preserving dissemination of genetic data.

    About the Authors

    Yaniv Erlich is a Fellow at the Whitehead Institute for Biomedical Research. Erlich

    received his Ph.D. from Cold Spring Harbor Laboratory in 2010 and B.Sc. from Tel-

    Aviv University in 2006. Prior to that, Erlich worked in computer security and wasresponsible for conducting penetration tests on financial institutes and commercialcompanies. Dr. Erlichs research involves developing new algorithms for

    computational human genetics.

    Arvind Narayanan is an Assistant Professor in the Department of Computer Scienceand the Center for Information Technology and Policy at Princeton. He studies

    information privacy and security. His research has shown that data anonymizationis broken in fundamental ways, for which he jointly received the 2008 Privacy

    Enhancing Technologies Award. His current research interests include building a

    platform for privacy-preserving data sharing.

    Draft

    1

  • 7/27/2019 Genetic Data Protection

    2/25

    Summary

    Broad data dissemination is essential for advancements in genetics, but also

    brings to light concerns regarding privacy.

    Privacy breaching techniques work by cross-referencing two or more pieces of

    information to gain new, potentially undesirable knowledge on individuals or

    their families.

    Broadly speaking, the main routes to breach privacy are identity tracing, attribute

    disclosure, and completion of sensitive DNA information.

    Identity tracing exploits quasi-identifiers in the DNA data or metadata to uncover

    the identity of an unknown genetic dataset.

    Attribute disclosure techniques work on known DNA datasets. They use the

    DNA information to link the identity of a person with a sensitive phenotype.

    Completion techniques also work on known DNA data. They try to uncover

    sensitive genomic areas that were masked to protect the participant.

    In the last few years, we have witnessed a rapid growth in the range of techniques

    and tools to conduct these privacy-breaching attacks. Currently, most of the

    techniques are beyond the reach of the general public, but can be executed by

    trained persons with varying degrees of effort.

    There is considerable debate regarding risk management. One camp supports a

    pragmatic, ad-hoc approach of privacy by obscurity and the other supports a

    systematic, mathematically-backed approach of privacy by design.

    Privacy by design algorithms include access control, differential privacy, and

    cryptographic techniques. So far, data custodians of genetic databases mainly

    adopted access control as a mitigation strategy.

    New developments in cryptographic techniques may usher in an additional

    arsenal of security by design techniques.

    2

  • 7/27/2019 Genetic Data Protection

    3/25

    INTRODUCTION

    We produce genetic information for research, clinical care, and genealogy at

    exponential rates. Sequencing studies with thousands of individuals have become a

    reality1,2 and new projects aim to sequence hundreds of thousands to millions ofindividuals3. Some geneticists envision whole genome sequencing of every person

    as part of routine health care4,5.

    Sharing genetic findings is vital for accelerating the pace of biomedical discoveries

    and fully realizing the promises of the genetic revolution6. Recent studies suggest

    that robust predictions of genetic predispositions to complex traits from genetic

    data will require the analysis of millions of samples7,8. Clearly, collecting cohorts at

    such scales are typically beyond the reach of individual investigators and cannot be

    achieved without combining different sources. In addition, broad dissemination of

    genetic data promotes serendipitous discoveries through secondary analysis, which

    is necessary to maximize its utility for patients and the general public9.

    One of the key issues of broad dissemination is an adequate balance of dataprivacy10. Prospective participants of scientific studies have ranked privacy of

    sensitive information as one their top concerns and a major determinant if to

    participate in the study11-13. Protecting personally identifiable information is also a

    demand of an array of regulatory statutes in United States and the European

    Union14. Data de-identifying, the removing of the person identifier, has been

    suggested as a potential path to reconcile data sharing and privacy demands15. But is

    this technically feasible for genetic data?

    This review characterizes privacy breaching techniques of genetic information and

    maps potential counter-measures. We first categorize privacy-breaching strategies,

    discuss their underlying technical concepts, and evaluate their performance and

    limitations. Then, we present privacy-preserving technologies, group themaccording to their methodological approaches, and discuss their relevance to genetic

    information. As a general theme, we focus only on breaching techniques that involve

    data mining and fusing distinct resources to gain private information relevant to

    DNA data. Data custodians should be aware that security threats can be much

    broader. They can include cracking weak database passwords, classical computer

    hacking techniques of the server that holds the data, stealing of storage devices due

    to poor physical security, and intentional misconduct of data custodians16-18. We do

    not include these threats since they are not unique to genetic information and havebeen extensively studied by the computer security field19. In addition, this review

    does not cover the potential implications of loss of privacy, which heavily depend oncultural, legal, and socio-economical context and were covered in part by the broad

    privacy literature20,21

    .

    3

  • 7/27/2019 Genetic Data Protection

    4/25

    PRIVACY BREACHING OF GENETIC DATA

    Genetic privacy breaching techniques fall into three categories: Identity Tracing,

    Attribute Disclosure Attacks via DNA (ADAD), and Completion Techniques (Figure

    1). The shared concept of these techniques is gaining a new piece of private

    potentially sensitive information about the target or his family by exploiting DNA

    data. The three categories are distinct in the type of sensitive information that they

    reveal. The aim of identity tracing is to link between an unknown genome and theconcealed identity of the data originator. In ADAD, the adversary already has access

    to the identified DNA sample of the target and to a database that links DNA-derived

    data to sensitive attributes without explicit identifiers, for example a public

    database of the genetic study of drug abuse. The ADAD techniques match the DNA

    data and associate the identity of the target with the sensitive attribute. In

    completion techniques, the adversary also knows the identity of a genomic dataset

    but has access only to a sanitized version without sensitive loci. The aim here is to

    4

  • 7/27/2019 Genetic Data Protection

    5/25

    expose the sensitive loci that are not part of the original data. Table 1 summarizes

    all privacy breaching techniques that are presented in this section.

    Table 1 | Categorization of techniques for breaching genetic privacy

    Technique Maturation

    Level

    Technical

    complexity

    Auxiliary

    information

    Identity TracingSurname Inference Intermediate-

    Good

    DNA Phenotyping Poor

    Demographic identifiers Good

    Pedigree structure Poor

    Side channel leakage Varies

    Attribute Disclosure Attacks via DNA

    N=1 Good

    Genotype frequencies Good

    Linkage disequilibrium Intermediate

    Effect sizes Good

    Trait inference GoodGene expression data Poor

    Completion Attacks

    Imputation of a masked marker Good

    Genealogical imputation Poor

    Maturation level: Working principles established with simulated data. Small scale proof of concept with

    real data in a controlled environment (typically only one dataset). Large scale experiments in controlled

    environments with real data (typically more than one dataset). Breach of privacy was reported in a real

    scenario.

    Technical complexity: no knowledge in genetics or special tools is required. Require genetic knowledge;

    computation can reasonably be done on a regular computer. Existing tools are available Require genetic

    knowledge, intermediate scale processing of data and/or molecular techniques. Require genetic knowledge;

    large scale processing of data is a prerequisite; may also require molecular techniques.

    Auxiliary information: this column refers to the level of existing reference databases for the US population in

    public resources. For identity tracing, it refers to the availability of organized lists that link identities and

    extract pieces of information. For ADAD and completion techniques, it refers to the existence of supporting

    reference datasets that are necessary to complete the attack. Poor supporting data is highly fragmented and

    not amenable to searching. Intermediate supporting data is harmonized and searchable but requires some

    pre-processing. Great supporting data is ready to use using existing tools or minimal pre-processing.

    IDENTITY TRACING ATTACKS

    The goal of identity tracing attacks is to uniquely identify the data originator fromthe population despite the absence of explicit identifiers such as the name and exact

    address in the published dataset. The idea is to accumulate quasi-identifiers --residual pieces of information that are embedded in the dataset -- and to gradually

    narrow down the possible individuals that match the combination of these quasi-

    5

  • 7/27/2019 Genetic Data Protection

    6/25

    identifiers to the point that the data originator is the only match. The success of the

    attack depends on the information content that the adversary can obtain from these

    quasi-identifiers relative to size of the base population (Box 1).

    IDENTITY DISCLOSURE BY META-DATA

    6

  • 7/27/2019 Genetic Data Protection

    7/25

    Genetic datasets are typically published with additional metadata, such as basic

    demographic details, inclusion/exclusion criteria, pedigree structure, and health

    conditions that are critical to understand the study and for secondary analysis.

    Unrestricted demographic information conveys substantial power for identity

    tracing. It has been estimated that the combination of date of birth, sex, and 5 digitzip code uniquely identifies more than 60% of US individuals22,23. In addition, there

    are extensive public resources with broad population coverage and searchinterfaces that link demographic quasi-identifiers and individuals, including voter

    registries, public record search engines such as People- Finders.com, data brokers,and social media. In one of the pioneering studies of identity tracing using metadata,

    Sweeny reported the successful tracing of the medical record of the Governor of

    Massachusetts using demographic identifiers24. At that time, the MassachusettsGroup Insurance Commission released hospital discharge information with five digit

    zip codes, sex, and date of birth. By searching the voter registry, Sweeny was able touniquely match the hospital discharge of the Governor. A more recent study

    reported the identification of 30% of Personal Genome Project (PGP) participants bydemographic profiling that included zip code and exact birthdates that are found in

    PGP profiles25.

    Since the inception of the HIPAA Rule in 2003, demographic identifiers are thesubjects of tight regulation in the US health care system26. The SAFE HARBOR

    provision requires that the maximal resolution of any date field, including birth andhospital admissions, will be in years. In addition, the maximal resolution of a

    geographical subdivision is the first three digits of a zip code (as long as there are

    more than 20,000 living in the regions that correspond to the three digit zip codes).

    Statistical analyses of the census data have found that the Safe Harbor provision

    provides reasonable immunity against identity tracing assuming that the adversary

    has access only to demographic identifiers. The combination of sex, age, ethnic

    group, and state is unique to less than 0.25% of the populations across all states27.

    An empirical study evaluated the re-identification of 15,000 records of Hispanic

    patients in the Chicago area that included year of birth, 3-digit zip code, and maritalstatus (married/unmarried) by comparison to voting registry data28. The authors

    reported the correct identification of 2 out of the 15,000 records and estimated that

    less of 0.22% the population is exposed with this set of quasi-identifiers. These

    studies show that with access to only HIPAA redacted demographic quasi-

    identifiers, identity tracing is extremely hard.

    Pedigree structures are another piece of metadata that are included in many genetic

    studies. These structures contain rich information, especially when large kinships

    are available29. The number of offspring, their birth order, and other familial events

    such as remarriage, create unique combinations of quasi-identifiers that quickly

    narrow down the search space. A systematic study analyzed the distribution of

    2,500 two-generation family pedigrees that were sampled from obituaries of a town

    of 60,000 individuals30. The pedigrees were unsorted, meaning that only the number

    of male and female individuals in each generation was available. Despite this limitedinformation, about 30% of the pedigree structures were unique, demonstrating the

    large information content that can be obtained from such data. Another feature ofpedigrees for identity tracing is the combination of quasi-identifiers across records.

    For example, it is quite rare that a surname alone can identity an individual.However, the surname combination of a couple prior to their marriage is an

    7

  • 7/27/2019 Genetic Data Protection

    8/25

    extremely strong identifier. In addition, once a single individual in a pedigree is

    identified, it is easy to link the identities of the other relatives and their genetic

    datasets. The main limitation of identity tracing using pedigree structures is their

    low searchability. Perhaps one notable exception is Israel, where the entire

    population registry was leaked to the web in 2006 and allows the construction ofmulti-generation family trees of all Israeli citizens31. But in general due to their low

    searchability, the value of family trees for re-identification is mostly limited tomanual verification of the potential identity of the target rather than a starting point

    of the process.

    IDENTITY TRACING BY GENEALOGICAL TRIANGULATION

    Genetic genealogy attracts millions of individuals interested in their ancestry and

    discovering distant relatives. To that end, the community has developed impressiveonline platforms to search for genetic matches and connect individuals. These online

    resources can be exploited to triangulate the identity of an unknown genome.

    One potential route of identity tracing is surname inference from Y-chromosomedata32,33. In most societies, surnames are passed from father to son, creating a

    transient correlation with specific Y chromosome HAPLOTYPES34,35. The adversarycan take advantage of the Y chromosome-surname correlation and compare the Y

    haplotype of the unknown genome to haplotype records in recreational geneticgenealogy databases. A close match with a relatively short time to the most common

    recent ancestor (MRCA) would signal that the unknown genome likely has the samesurname as the record in the database.

    The power of surname inference stems from exploiting information from distant

    patrilineal relatives of the unknown genome. The association between surnamesand Y-chromosomes usually spans dozens of generations, implying that every

    record in a genealogical database is capable of revealing the surnames of hundreds

    to thousands of males. A recent empirical study estimated that 10-14% of USCaucasian males from the middle and upper classes are subject to surname

    inference based on scanning the two largest Y-chromosome genealogical websiteswith a built-in search engine33.

    An inferred surname has tremendous power for identity tracing. Individual

    surnames are relatively rare in the population and in most cases a single surname is

    shared by less than 40,000 US males33, which is equivalent to 12 bits of information.

    In terms of identification, successful surname recovery is very close to determiningan individuals zip code. Another feature of surname inference is that surnames are

    highly searchable. From public record search engines to social networks, numerousonline resources offer surname query interfaces, simplifying the adversarys efforts

    to complete the triangulation.Surname inference has been utilized to breach genetic privacy in the past36-39.Several sperm donor conceived individuals and adoptees successfully used this

    technique on their own DNA to reveal the surnames of their ancestors, whicheventually lead to the exposure of their biological families. This technique could also

    be applied to whole genome sequencing datasets. A recent study reported five

    successful surname inferences from Illumina datasets of three large families that

    8

  • 7/27/2019 Genetic Data Protection

    9/25

    were part of the 1000 Genomes project, which eventually exposed the identity of

    close to fifty research participants33.

    The main limitation of surname inference is that haplotype matching relies on

    comparing Y chromosome Short Tandem Repeats (Y-STRs). Currently, most

    sequencing studies do not routinely report these markers and the adversary would

    have to process large-scale raw sequencing files with a specialized tool, which isboth time and resource consuming and requires bioinformatics experience 40.

    Another complication is false identification of surnames and inference of surnames

    with spelling variants compared to the original surname. Eliminating incorrect

    surname hits necessitates access to additional quasi-identifiers such as pedigreestructure and typically requires a few hours of manual work. Finally, the

    performance of surname inference varies between different socio-ethnic groupsbased on non-paternity rates, sociological norms of surname inheritance, and access

    of the group to recreational genealogy.

    An open research question is the utility of non Y chromosome markers forgenealogical triangulation. Websites such as Mitosearch.org and GedMatch.com run

    open searchable databases for matching mitochondrial and autosomal genotypes,

    respectively. Our expectation is that mitochondrial data will not be very informativefor tracing identities. The resolution of mitochondrial searches is much lower due to

    its smaller size and the absence of highly polymorphic markers like Y-STRs, meaningthat a large number of individuals would share the same mitochondrial haplotypes.

    In addition, most human societies do not exercise maternally inherited identifiers,reducing the utility of such searches. Autosomal searches on the other hand might

    be quite powerful. Genetic genealogy companies have started to market services fordense genome-wide arrays that enable relatively sufficient accuracy to identify

    distant relatives on the order of 3rd to 4th cousins41. These hits would reduce the

    search space to no more than a few thousand individuals42. The main challenge of

    this approach would be translating the genealogical match to a list of potential

    people. But with the growing interest in genealogy, this technique might be easier in

    the future and should be taken into consideration.

    IDENTITY TRACING BY PHENOTYPIC PREDICTION

    Several reports on genetic privacy envisioned that phenotypic predictions from

    genetic data could serve as quasi-identifiers for identity tracing43,44. Twin studieshave estimated high heritabilities for various visible traits such as height45and facial

    morphology46. In addition, recent studies showed that age prediction is possible

    from DNA specimens derived from blood samples47,48. But the applicability of these

    DNA-derived quasi-identifiers for identity tracing has yet to be demonstrated.

    The major limitation of phenotypic prediction is the fast decay of the identificationpower with small inference errors (Box 1). Current genetic knowledge explains only

    a small extent of the phenotypic variability of most visible traits, such as height49,

    BMI50, and face morphology51, significantly limiting their utility for identification.

    For example, perfect knowledge about height at one-centimeter resolution conveys

    5 bits of information. However, with current genetic knowledge that explains 10%

    of height variability49, the adversary learns only 0.15 bits of information. Predictions

    of most of the other visible traits are even worse, implying that their utility as quasi-

    9

  • 7/27/2019 Genetic Data Protection

    10/25

    identifiers would be quite low. The exceptions in visible traits are eye color52 and

    age prediction47. Recent studies showed a prediction accuracy of 75%-90% of the

    phenotypic variability of these traits. But even these successes translate to

    approximately 3-4 bits of information. Another challenge for phenotypic prediction

    is the low searchability of most of these traits. There are no population-basedregistries of height, eye color, or face morphology and the adversary would have to

    invest substantial efforts to compile such a registry. However, with the advent ofnew types of social media, this barrier might be less significant in the future.

    IDENTITY TRACING BY SIDE-CHANNEL LEAKS

    Side channel attacks exploit quasi-identifiers that are unintentionally encoded in the

    database building blocks and structure rather than the actual data that is meant to

    be public. A good example for such leaks is the exposure of the full names of PGPparticipants from 23andMe filenames25. The PGP allowed participants to upload

    23andMe genotyping files to their public profile webpages. The default conventionof these 23andMe files includes the first and last name of the user. As part of the

    upload process, the PGP website automatically compressed the file, named it withthe PGP identifier of the user, and presented a link that showed the new file name

    that does not include the first and last names. However, after downloading and

    decompressing the 23andMe file, the original filename appeared. Since most of theusers did not change the default naming convention, it was possible to trace the

    identity of a large number of PGP profiles. Based on this experience, the PGP nowforces the participant to rename files before uploading and warns them that the file

    may contain hidden information that can expose their identities.

    Rich data files embed multiple layers of hidden information that provide ampleopportunities for leakage of quasi-identifiers. Photo files typically embed

    Exchangeable Image File Format (EXIF) fields that can include GPS data about the

    location of the photo or the serial number of the camera53. This information couldconvey potential leads even if the photo itself does not disclose any sensitiveinformation. Microsoft Office products typically embed the author name and contain

    previous revisions of the document that show deleted text54. In general, flat text files

    are the most immune format to these types of leaks of unintentional content.

    The mechanism to generate database accession numbers can also leak personal

    information. Ideally, these numbers should be completely random but experience

    has highlighted that sometimes these numbers unintentionally reveal residual

    information due to non-random assignments. For example, in several top medial

    data mining contests, the accession numbers unintentionally revealed the disease

    status of the patient, which was the aim of the contest55. Another example is the

    non-random assignment of Social Security Numbers (SSN) in the US. Pattern

    analysis of a large amount of public data revealed temporal and spatial

    commonalities in the assignment system that allowed predictions of the SSN from

    quasi-identifiers56. Some suggested the assignment of accession numbers by

    applyingCRYPTOGRAPHIC HASHINGto the participant identifiers such as name or

    social security number57. However, this technique is extremely vulnerable toDICTIONARY ATTACKS due to the relatively low search space of the input. In

    10

  • 7/27/2019 Genetic Data Protection

    11/25

    general, it is advisable to add some sort of randomization to procedures that

    generate accession numbers in order to prevent misuse.

    ATTRIBUTE DISCLOSURE ATTACKS VIA DNA (ADAD)

    In ADAD, the adversary creates a statistical bridge that uses DNA data to link

    sensitive attributes with the identity of a person. The first piece of information is aDNA sample from an identified target. This can be achieved by successful

    completion of an identity tracing attack, exploiting identified DNA data in projects

    such as OpenSNP, gaining internal access to restricted databases, or simply by

    obtaining a DNA sample directly from the target. The second piece of information is

    DNA derived data that is associated with sensitive information, such as disease,

    personality traits, or socio-economic status, which does not otherwise contain

    explicit identifiers. The main difference between the ADAD attacks is the type of

    DNA derived data that is associated with the sensitive attribute.

    ADAD: THE N=1 SCENARIO

    The simplest scenario of ADAD is when the sensitive attribute is associated with the

    genotype data of the individual. The adversary can simply match the genotype data

    that is associated with the identity of the individual and the genotype data that is

    associated with the attribute. Such an attack requires only a small number of

    autosomal SNPs. Empirical data showed that a carefully chosen set of 45 SNPs is

    sufficient to provide matches with a TYPE I ERROR of 10-15 for most of the major

    populations across the globe58. Moreover, it is expected that random subsets of

    approximately 300 common SNPs would yield sufficient information to uniquely

    match any person59.

    With the low number of SNPs required for matching, individual level genotypes-phenotype records in genome-wide association studies (GWAS) are highly

    vulnerable to ADAD. In order to address this issue, several organizations, including

    the NIH, adopted a two tier access system for GWAS datasets: a restricted accessarea that stores individual level genotypes and phenotypes and a public access area

    for high level data summary statistics of allele frequencies of all cases and controls60.The premise of this distinction was that summary statistics enable secondary data

    usage for meta-GWAS analysis while it was thought that this type of data is immune

    to ADAD.

    ADAD: THE SUMMARY STATISTIC SCENARIO

    A landmark work by Homer et al. reported the possibility of ADAD on GWAS

    datasets that only consists of the allele frequencies of the study participants61. The

    underlying concept of their approach is that, with the target genotypes in the casegroup, the average allele frequencies will be positively biased towards the target

    genotypes compared to the estimated MAF from the general population. Conversely,when the target is not part of the study, the average allele frequencies will be

    11

  • 7/27/2019 Genetic Data Protection

    12/25

    negatively biased compared to the target genotypes. A good illustration of this

    concept is considering an extremely rare variation in the subjects genome. Non-

    zero allele frequency of this variation in a small-scale study increases the likelihood

    that the target was part of the study, whereas zero allele frequency strongly reduces

    this likelihood. Homer et al. showed that by integrating the slight biases in the allelefrequencies over a large number of SNPs it is also possible to conduct ADAD with

    the common variations that are analyzed in GWAS.

    Subsequent studies extended the range of exploitations for summary statistics. Oneline of studies improved the test statistic in the original Homer et al. work and

    analyzed its mathematical properties62-64. Under the assumption of common SNPs inLINKAGE EQUILIBRIUM, their improved test statistic is mathematically guaranteed

    to yield maximalPOWERfor anySPECIFICITYlevel. Wang et al. went beyond allelefrequencies and demonstrated that it is possible to exploit local LD structures for

    ADAD65. Their test statistic scores the co-appearance of two SNP alleles in the targetgenome with the bias of LD structure in a GWAS study versus the general

    population. The power of this approach stems from scavenging for the co-

    occurrence of two mildly uncommon alleles in different haplotype blocks that

    together create a rare event. They reported a power of 80% and specificity of 95%for ADAD on a GWAS with 200 samples that exploited the LD structure of 174common SNPs in the FGFR2 locus. With the same number of SNPs, ADAD methods

    that use only allele frequencies yield an expected power of 24% for the same

    specificity level under the most optimal scenario. Im et al. developed a method to

    exploit the EFFECT SIZES of GWAS studies of quantitative traits to detect the

    presence of the target66. Different from ADAD with allele frequencies, the detection

    performance is better for participants with extreme phenotypes and worse for

    participants with average phenotypes. A powerful development of this approach is

    exploiting GWAS studies that utilize the same cohort for multiple phenotypes. The

    adversary repeats the identification process of the target with the effect sizes of

    each phenotype and integrates them to boost the identification performance. After

    determining the presence of the target in a quantitative trait study, the adversarycan further exploit the GWAS data to predict the phenotypes with high accuracy67.

    This method works by simply correlating the DNA of the target with the effect sizes

    and takes advantage of the spurious associations when regressing a large number

    on markers with a single phenotype.

    The theoretical performance of ADAD is a complex function of the size of the study

    and the general population68,69. On one hand, in any of the techniques above, studies

    with smaller numbers of participants generate more apparent biases in their

    summary statistics, which increases the power and specificity of the ADAD

    discrimination (Figure 2A). On the other hand, a target drawn randomly from the

    general population has a lower a-priori probability of having participated in a study

    with a smaller number of participants. This means that ADAD on smaller studies

    needs to work with higher specificity to achieve the same PRECISION of larger

    studies, reducing the power of the attack and the number of people at risk ( Figure

    2B). In any case, the performance and risk increase when the base population is

    smaller, such as the Amish or Hutterite populations, or when the meta-informationenables stratification of the general population (Figure 2C).

    The actual risk of ADAD on summary data has been the subject of debate. Following

    the original Homer et al. study, the NIH and other data custodians moved their

    12

  • 7/27/2019 Genetic Data Protection

    13/25

    GWAS summary statistic data from public databases to access controlled databases

    such as dbGAP70. A retrospective analysis found that significantly fewer GWAS

    studies publicly released their summary statistic data71. Most of the studies publish

    summary statistic data on 10-500 SNPs, which is compatible with one suggested

    guideline to manage risk67. Some warned that these policies are too harsh72. Thereare several practical complications that the adversary needs to overcome to launch

    a successful attack, such as access to the targets DNA data73, access to a largereference database to assess the general population frequency data, and accurate

    matching between the ancestries of the target with those listed in the referencedatabase74. Failure to address any of these prerequisites can severely impact the

    performance of the ADAD. In addition, for a range of GWAS studies, the associated

    attributes are not sensitive or private (e.g. height). Thus, even if ADAD occurs, theimpact on the participant should be minimal. A recent NIH workshop proposed the

    release of summary statistics as the default policy and developing an exemptionmechanism for studies with increased risk due to the sensitivity of the attribute or

    the vulnerability level of the summary data75.

    13

  • 7/27/2019 Genetic Data Protection

    14/25

    ADAD: THE EXPRESSION DATA SCENARIO

    Public databases such as GEO hold hundreds of thousands of gene expression

    profiles of individuals that are linked to a range of medical attributes. Schadt et al.

    proposed a potential route to exploit these profiles for ADAD76. The method starts

    with a training step that employs a standard EXPRESSION QUANTITATIVE TRAITLOCI (eQTL) analysis with a reference dataset. The goal of this step is to identify

    several hundred strong eQTLs and to learn the distributions of the expression level

    for each genotype. Then, the algorithm scans the public expression profiles and

    calculates the probability distributions of the genotypes of the eQTLs. Last, the

    algorithm matches the targets genotype with the inferred allelic distributions of

    each expression profile and tests the hypothesis that the match is random. If the null

    hypothesis is rejected, the algorithm links the identity of the target to the medical

    14

  • 7/27/2019 Genetic Data Protection

    15/25

    attribute in the gene expression experiment. This ADAD technique has the potential

    for relatively high accuracy in ideal conditions. The method perfectly matched 580

    individuals with their expression profiles when the training was conducted on a

    distinct dataset. Based on large-scale simulations, they further predicted that the

    method can reach a type I error of 1x10 -5 with a power of 85% when tested on anexpression database using the entire US population.

    There are several practical limitations to ADAD via expression data. While the

    training step and inference steps are capable of working with expression profilesfrom different tissues, the method reaches its maximal power when the training and

    inference utilize eQTL from the same tissue. Moreover, there is a significant loss ofaccuracy when the expression data in the training phase is collected using a

    different technology than the expression data in the inference phase. Anothercomplication is that in order to fully execute the technique on a large database such

    as GEO, the adversary will need to manage and process large-scale expression data.

    Due to these practical barriers, the NIH did not issue any changes to their policiesregarding sharing expression data from human subjects.

    COMPLETION ATTACKS

    Completion of genetic information from partial data is a well-studied task in genetic

    studies, called genotype imputation77. This method takes advantage of theLINKAGE

    DISEQUILIBRIUM between markers and uses reference panels with complete

    genetic information to restore missing genotype values in the data of interest. The

    very same strategies enable the adversary to expose certain regions of interest

    where only partial access to the DNA data is available. One publicized case of a

    completion attack was the inference of Jim Watsons risk for Alzheimer's disease.

    Watson opted to publish his entire identified genome sequence except data from his

    ApoE gene, which is associated with Alzheimers disease78. Nyholt et al. restored the

    ApoE status using imputation with markers that are 15Kb away from the maskedsite79. As a result of the study, a 2Mb segment around the ApoE gene was removed

    from Watsons published genome.

    In some cases, completion techniques also enable the prediction of genomic

    sequences when there is no access to the DNA of the target. This technique is

    possible when the reference panel is combined with genealogical information80. Thealgorithm finds relatives of the target that donated their DNA to the reference panel

    and that reside on a unique path that includes the target, for example a pair of half-first cousins when the target is their grandfather. A shared DNA segment between

    the relatives indicates that the target had the same segment. By scanning more pairsof relatives that are connected through the target, it is possible to infer the two

    copies of autosomal loci and collect more genomic information on the target withoutany access to its DNA. Building on the deep genealogical records in Iceland, deCode

    Genetics was able to leverage their large reference panel to infer genetic variants of

    an additional 200,000 living individuals who never donated their DNA to thecompany. While this technique is mostly relevant to targets with a large number of

    decedents and can be executed in only a narrow range of scenarios, it emphasizesthe complexities of genetic privacy. In May 2013, Iceland's Data Protection

    15

  • 7/27/2019 Genetic Data Protection

    16/25

    Authority prohibited the use of this technique until consent can be obtained from

    the individuals who are not part of the original reference panel81.

    MITIGATION TECHNIQUES

    Most of the genetic privacy breaches presented above are quite sophisticated. Theyrequire a background in genetics and statistics and -- importantly -- a motivatedadversary. One school of thought posits that these practical complexities almosteliminate the probability of an adverse event and therefore attenuate the risk tonegligible levels for most studies82,83. According to this approach, an appropriatemitigation strategy is just removing very obvious identifiers from the datasetsbefore publicly sharing the information. In the field of computer security, this riskmanagement strategy is called security by obscurity. This approach is simple toimplement and poses minimal burden on data dissemination. The opponents ofsecurity by obscurity posit that risk management schemes based on the probabilityof an adverse event are fragile and short lasting. According to their views,technologies only get better and what is technically challenging but possible today

    will be much easier in the future. Therefore, the probabilities of adverse events arenon-computable and irrelevant84. Known in cryptography as Shannons maxim85,this school of thought assumes that the adversary exists and is equipped with theknowledge and means to execute the breach. Robust data protection, therefore, isachieved by explicit design of the data access protocol rather than by the actualchance of a breach86. This section surveys the main security by design schemes andtheir relevance to protecting genetic data.

    AC CE SS CO NTRO L

    One approach to mitigate the chance of a privacy breach is to place the sensitive

    data in a secure location and screen the legitimacy of the applicants and theirresearch projects. Once approval is made, the applicants are allowed to download

    the data under the conditions that they will store it in a secure location and will not

    attempt to identify individuals. In addition, the applicants should be required to file

    periodic reports about the data usage and any adverse events. This approach is thecornerstone of the access-controlled dbGAP60. Based on periodic reports of the

    users, a retrospective analysis of dbGAP access control has identified 8 datamanagement incidents in close to 750 studies, mostly non-adherence to the

    technical regulations, and no reports of breaching the privacy of participants87.Despite the absence of privacy breaches thus far, some have criticized the fact that

    access control creates an illusion of security88. Once the data is in the hand of the

    applicant, there is no real oversight of how it is being stored, the actual work, andwhat exactly is published. To address these limitations, an alternative approach is

    the trust-but-verify model, where the user cannot download the raw data but mayexecute certain types of queries that are recorded and monitored by the system89.

    Supporters of this model state that monitoring has the potential to deter malicioususers from accessing the data and facilitates early detection. Another development

    based on this approach is enforcing the users and data custodians to have skin in

    the game90, by adding penalties beyond denying access to the resource in case of

    16

  • 7/27/2019 Genetic Data Protection

    17/25

    misuse. The main downside of access control is that any of the models listed above

    require constant management of the resource and create administrative burden to

    both data custodians and users.

    DATA ANONYMIZATION AND AGGREGATION

    The premise of anonymity is the ability to be lost in the crowd. One line of studies

    suggested restoring anonymity by restricting the granularity of the quasi-identifiers

    to the point that each record in the database is not unique. A popular heuristic is k-anonymity91. Using this approach, the quasi-identifiers are binned such that each

    subjects record is identical to that of at least k-1 records from other individuals inthe dataset. To maximize the utility of the data for subsequent analysis, the binning

    process is adaptive. Certain records will have a lower resolution depending on thedistribution of the other records and certain data categories that are too unique are

    suppressed entirely. There is a strong trade-off in the selection of the value of k;

    high values increase the size of the background crowd but at the same time reducethe utility of the data. As a rule of thumb, it was recommended to setk5(92). More

    recent work showed that while k-anonymity protects against identity tracingtechniques, it is vulnerable to attribute disclosure, especially when the adversary

    has a certain level of prior knowledge about the presence of the target in thedatabase93. Subsequent studies developed more elaborative redaction techniques to

    address these issues93,94. These anonymization techniques have been mainly

    successful in safeguarding demographic identifiers in medical research. However,

    attempts to adopt these techniques to DNA research are yet to be practical95. The

    high dimensionality of DNA data dictates that most of the records will be unique and

    it is not clear how the data can be redacted without destroying its value for

    secondary analysis.

    Differential privacy offers a distinct approach to restore anonymity by producing

    summary statistics after sophisticated data perturbation96.

    It aims to ensure thatsummary statistics of two datasets that differ by exactly one individuals record areextremely close to each other. This way, the adversary cannot be sure whether the

    target was part of the dataset or not and therefore cannot learn sensitive attributes.

    The challenge in differential privacy algorithms is to minimize the perturbation

    while satisfying the privacy property so that the summary statistic will still convey

    useful information on the population as a whole. Differential privacy has gained

    popularity in computer science and statistics as a very vibrant research area and the

    US Census Bureau uses this technique for their OnTheMap tool97. Early attempts

    have made progress towards protecting GWAS data using this approach98,99.

    Currently, the main limitation is that the amount of perturbation that needs to be

    added to the summary statistic grows linearly with the number of exposed SNPs,

    which quickly abolishes the ability to detect fine associations in meta-analysis.Whether or not there is a way to add much smaller amounts of noise in a way that

    still maintaining privacy for GWAS datasets remains an open question.

    17

  • 7/27/2019 Genetic Data Protection

    18/25

    CRYPTOGRAPHIC SOLUTIONS

    Modern cryptography brought new advancements for data dissemination beyond

    the traditional usage of encrypting sensitive files and distributing the key toauthorized users. Secure multiparty computation (SMC) allows two or more entities

    who each have some private data to execute a computation on these private inputs

    without revealing the input to each other or disclosing it to a third party. In one

    classical example of SMC,ALICE and BOBcan determine who is richer without either

    one revealing their actual wealth to the other. Researchers have constructed SMC

    protocols in various domainsfrom voting100to location-based services101.

    In the area of genetic data, one line of work has developed SMC algorithms for

    genetic matching. Bruekers et al. presented a privacy-preserving algorithm to match

    STR profiles between two parties without exposing the actual genetic data102.

    Bohannon et al. suggested searchable genetic databases for forensic purposes that

    allow only going from genetic data to identity but not from identity to geneticdata103. In their scheme, the records in the databases are encrypted with the

    individuals genotype as the key. To tolerate genotyping errors or missing data, they

    utilize a fuzzy encryption scheme that can use a key that only approximately

    matches the original one. This way, only access to the genotype information can

    reveal the identity but not the opposite. Along similar lines, Cristofaro et al.

    constructed cryptographic protocols for privacy-preserving paternity tests and

    genetic compatibility tests104, albeit for molecular techniques that are no longer in

    18

  • 7/27/2019 Genetic Data Protection

    19/25

    use, such as RFLP. They also presented a smartphone-based implementation of

    these protocols105. The performance varies dramatically between tasks that examine

    only a few loci and those that depend on the whole genome. The former complete in

    under a second and the latter take days of computation and gigabytes of bandwidth,

    rendering them impractical at the current time.

    In another direction, Kamm et al suggested a secure multi-center GWAS analysis106.In their protocol, each center deploys a secret sharing scheme on its own collection

    of subjects phenotypes and genotypes that divides the data into small shares, each

    of which reveals nothing about the original values on its own. The shares are then

    sent to the other centers, which store them in dedicated servers. The servers have

    an interface that allows outsiders to initiate a GWAS study on phenotypes and

    genotypes of interest. Upon request, the servers coordinate to perform the

    association without reconstructing the original genotypes or phenotypes and onlyreport in plain text the significant SNPs. A potential shortcoming of their approach is

    that, at least theoretically, the end product plain text is still vulnerable to ADAD onsummary statistic data, rendering the solution far from complete.

    Another line of cryptographic work looks at privacy-preserving outsourcing ofcomputations on genetic information using homomorphic encryption107 (Box 2).

    The concept of this approach is that, with advent of ubiquitous usage of genetic data,

    users (or physicians) will interact with a variety of genetic interpretation services

    (e.g. promethease.com) throughout their lives, which increases the chance of a

    genetic privacy breach. Under this cryptographic work, users send an encrypted

    version of their genome to the cloud. The interpretation service can access the clouddata but does not have the key and therefore cannot read the plain genotype values.

    Instead, the interpretation service executes the algebraic operations of its geneticrisk prediction algorithm on the encrypted genotypes without inspecting the

    plaintext. After completing the algorithm, the user grabs the cyphertext results fromthe cloud. Due to the special mathematical properties of the underlying

    cryptosystem, the user simply decrypts the results to obtain his risk prediction. Thisway the user does not expose any of his genotypes or disease susceptibility to theservice provider. The current scope of risk prediction models is still limited but this

    approach is quite amenable to future improvements.

    CONCLUSION

    The invention of asymmetric cryptography in the 1970s led to a revolution in

    secure communication. Today, a wide variety of Internet transactions build upon

    these security measures in ways that are completely transparent to the average

    user. Data privacy still awaits a similar breakthrough. The status quo has greatlyshifted in the last few years, with a torrent of studies showing that a motivated,

    technically-sophisticated adversary is capable of exploiting a wide range of genetic

    data for unintended purposes. With the constant innovation in genetics and theexplosion of online information, we can expect that new privacy breaching

    techniques will be discovered in the next few years. Restoring the status quo with

    technical means will necessitate large strides in the theory and implementation of

    mitigation algorithms. Some of the approaches, particularly access control, have

    19

  • 7/27/2019 Genetic Data Protection

    20/25

    been quite useful. But so far, mitigation schemes are resource and time consuming

    for both the data custodian and users. Due to both technical and human factors108,

    the privacy field has yet to come up with a set of methodologies of comparable

    impact to communication security.

    Successful balancing of privacy demands and data sharing is not restricted to

    technical means109. Balanced informed consents outlining both benefits and risksare key ingredients for maintaining long-lasting credibility in genetic research. With

    the active engagements of a wide range of stakeholders from the broad genetics

    community and the general public, we as a society could develop social and ethical

    norms, legal frameworks, and educational programs to reduce the chance of misuse

    of genetic data despite the inability to theoretically prevent privacy breaches.

    GLOSSARY

    SAFE HARBOR: A standard in the HIPAA Rule for de-identification of protected health

    information by removing 18 types of quasi-identifiers.HAPLOTYPES: A set of alleles along the same chromosome.CRYPTOGRAPHIC HASHING: A procedure that yields a fixed length output from any size of

    input in a way that is hard to determine the input from the output.

    DICTIONARY ATTACKS: A brute force approach to reverse cryptographic hashing by scanning

    the relatively small input space.

    TYPE I ERROR: The probability to obtain a positive answer from a negative item.

    LINKAGE EQUILIBRIUM: Absence of correlation between the alleles in two loci.

    POWER: The probability to obtain a positive answer for a positive item.

    SPECIFICITY: The probability to obtain a negative answer for a negative item.

    EFFECT SIZES: In quantitative traits, the contribution of a certain allele to the value of the trait.

    EXPRESSION QUANTITATIVE TRAIT LOCI: Genetic variants associated with variability in

    gene expression.

    LINKAGE DISEQUILIBRIUM: The correlation between alleles in two loci.

    ALICE AND BOB: Common placeholders in cryptography to denote party A and party B.

    ACKNOWLEDGEMENTS

    YE is an Andria and Paul Heafy Family Fellow and holds a Career Award at the

    Scientific Interface from the Burroughs Wellcome Fund. This study was alsosupported by a gift from Cathy and Jim Stone. The authors thank Dina Zielinski and

    Melissa Gymrek for useful comments and Shriram Sankararaman for his niceintroduction between the authors.

    COMPETING INTERESTS STATEMENT

    None.

    REFERENCES

    20

  • 7/27/2019 Genetic Data Protection

    21/25

    1 Fu, W. et al. Analysis of 6,515 exomes reveals the recent origin of most human

    protein-coding variants. Nature 493, 216-220, doi:10.1038/nature11690 (2013).2 Genomes Project, C. et al. An integrated map of genetic variation from 1,092 human

    genomes. Nature 491, 56-65, doi:10.1038/nature11632 (2012).

    3 Roberts, J. P. Million veterans sequenced. Nat Biotech 31, 470-470,doi:10.1038/nbt0613-470 (2013).

    4 Drmanac, R. Medicine. The ultimate genetic test. Science 336, 1110-1112,doi:10.1126/science.1221037 (2012).5 Burn, J. Should we sequence everyone's genome? Yes. Bmj 346, f3133,

    doi:10.1136/bmj.f3133 (2013).

    6 Kaye, J., Heeney, C., Hawkins, N., de Vries, J. & Boddington, P. Data sharing ingenomics--re-shaping scientific practice. Nat Rev Genet 10, 331-335,

    doi:10.1038/nrg2573 (2009).7 Park, J. H. et al. Estimation of effect size distribution from genome-wide association

    studies and implications for future discoveries. Nat Genet 42, 570-575,doi:10.1038/ng.610 (2010).

    8 Chatterjee, N. et al. Projecting the performance of risk prediction based on polygenic

    analyses of genome-wide association studies. Nat Genet 45, 400-405, 405e401-403,doi:10.1038/ng.2579 (2013).

    9 Friend, S. H. & Norman, T. C. Metcalfe's law and the biology information commons.Nature biotechnology 31, 297-303, doi:10.1038/nbt.2555 (2013).

    10 Rodriguez, L. L., Brooks, L. D., Greenberg, J. H. & Green, E. D. Research ethics. The

    complexities of genomic identifiability. Science 339, 275-276,doi:10.1126/science.1234593 (2013).

    11 Care, I. o. M. U. R. o. V. S.-D. H. in Clinical Data as the Basic Staple of Health Learning:Creating and Protecting a Public Good: Workshop Summary The National

    Academies Collection: Reports funded by National Institutes of Health (2010).12 McGuire, A. L. et al. To share or not to share: a randomized trial of consent for data

    sharing in genome research. Genetics in medicine : official journal of the American

    College of Medical Genetics 13, 948-955, doi:10.1097/GIM.0b013e3182227589(2011).

    13 Oliver, J. M. et al. Balancing the risks and benefits of genomic data sharing: genome

    research participants' perspectives. Public Health Genomics 15, 106-114,

    doi:10.1159/000334718 (2012).14 Schwartz, P. M. & Solove, D. J. Reconciling Personal Information in the United States

    and European Union. SSRN Electronic Journal, doi:10.2139/ssrn.2271442 (2013).

    15 El Emam, K. Heuristics for De-identifying Health Data. Security & Privacy, IEEE 6,

    58-61, doi:10.1109/MSP.2008.84 (2008).16 Lunshof, J. E., Chadwick, R., Vorhaus, D. B. & Church, G. M. From genetic privacy to

    open consent. Nat Rev Genet 9, 406-411, doi:10.1038/nrg2360 (2008).

    17 Brenner, S. E. Be prepared for the big genome leak. Nature 498, 139,doi:10.1038/498139a (2013).

    18

    19 Scambray, J. M. S. K. G. Hacking exposed network security secrets & solutions, (2001).

    20 Solove, D. J. A Taxonomy of Privacy. University of Pennsylvania Law Review 154,

    477 (2006).21 Ohm, P. Broken Promises of Privacy: Responding to the Surprising Failure of

    Anonymization. UCLA Law Review 57 (2010).

    22 Golle, P. in Proceedings of the 5th ACM workshop on Privacy in electronic society77-80 (ACM, Alexandria, Virginia, USA, 2006).

    23 Sweeney, L. A. Simple Demographics Often Identify People Uniquely. (2000).

    21

    http://www.privacyrights.org/data-breachhttp://www.privacyrights.org/data-breachhttp://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=70568http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=70568http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=70568http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=70568http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=70568http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=70568http://www.privacyrights.org/data-breach
  • 7/27/2019 Genetic Data Protection

    22/25

    24 Greely, H. T. The uneasy ethical and legal underpinnings of large-scale genomic

    biobanks. Annual review of genomics and human genetics 8, 343-364,doi:10.1146/annurev.genom.7.080505.115721 (2007).

    25 Sweeney, L. A., Abu, A. & Winn, J. Identifying Participants in the Personal Genome

    Project by Name (2013). .26 United States. General Accounting Office. & United States. (U.S. General Accounting

    Office, Washington, D.C., 2002).27 Benitez, K. & Malin, B. Evaluating re-identification risks with respect to the HIPAAprivacy rule. Journal of the American Medical Informatics Association : JAMIA 17,169-177, doi:10.1136/jamia.2009.000026 (2010).

    28 Kwok, P., Davern, M., Hair, E. & Lafky, D. in NORC at The University of Chicago(Chicago 2011).

    29 Bennett, R. L. et al. Recommendations for standardized human pedigree

    nomenclature. Pedigree Standardization Task Force of the National Society of

    Genetic Counselors. Am J Hum Genet 56, 745-752 (1995).30 Malin, B. Re-identification of familial database records. AMIA ... Annual Symposium

    proceedings / AMIA Symposium. AMIA Symposium, 524-528 (2006).

    31 Israel vs. Shalom Bilik, Avraham Adam, Yosef Vitman, Haim Aharon, MosheMoshkowitz and Meir Liver (In Hebrew) Verdict 24441-05-12

    32 Gitschier, J. Inferential genotyping of Y chromosomes in Latter-Day Saints foundersand comparison to Utah samples in the HapMap project. Am J Hum Genet 84, 251-258, doi:S0002-9297(09)00025-1 [pii] 10.1016/j.ajhg.2009.01.018 (2009).

    33 Gymrek, M., McGuire, A. L., Golan, D., Halperin, E. & Erlich, Y. Identifying personalgenomes by surname inference. Science 339, 321-324,

    doi:10.1126/science.1229566 (2013).34 King, T. E. & Jobling, M. A. What's in a name? Y chromosomes, surnames and the

    genetic genealogy revolution. Trends Genet 25, 351-360, doi:S0168-9525(09)00133-4 [pii] 10.1016/j.tig.2009.06.003 (2009).

    35 King, T. E. & Jobling, M. A. Founders, drift, and infidelity: the relationship between Y

    chromosome diversity and patrilineal surnames. Mol Biol Evol 26, 1093-1102,doi:msp022 [pii] 10.1093/molbev/msp022 (2009).

    36 Motluk, A. Anonymous sperm donor traced on internet. New Sci 188, 2 (2005).

    37 Stein, R. Found on the Web, With DNA: a Boy's Father. Washington Post, 1 (2005).

    38 Naik, G. Family Secrets: An Adopted Man's 26-Year Quest for His Father Wall StreetJournal (2009).

    39 Lehmann-Haupt, R. Are Sperm Donors Really Anonymous Anymore? Slate (2010).

    40 Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. lobSTR: A short tandem repeat profilerfor personal genomes Genome research, doi: (2012).

    41 Huff, C. D. et al. Maximum-likelihood estimation of recent shared ancestry (ERSA).

    Genome research 21, 768-774, doi:10.1101/gr.115972.110 (2011).

    42 Henn, B. M. et al. Cryptic distant relatives are common in both isolated andcosmopolitan genetic samples. PLoS One 7, e34267,

    doi:10.1371/journal.pone.0034267 (2012).

    43 Lowrance, W. W. & Collins, F. S. Ethics. Identifiability in genomic research. Science317, 600-602, doi:10.1126/science.1147699 (2007).

    44 Kayser, M. & de Knijff, P. Improving human forensics through advances in genetics,

    genomics and molecular biology. Nat Rev Genet 12, 179-192, doi:10.1038/nrg2952

    (2011).45 Silventoinen, K. et al. Heritability of adult body height: a comparative study of twin

    cohorts in eight countries. Twin research : the official journal of the International

    Society for Twin Studies 6, 399-408, doi:10.1375/136905203770326402 (2003).46 Kohn, L. A. P. The Role of Genetics in Craniofacial Morphology and Growth. Annu Rev

    Anthropol. 20, 261-278 (1991).

    22

    http://dataprivacylab.org/projects/pgp/1021-1.pdf%3ehttp://dataprivacylab.org/projects/pgp/1021-1.pdf%3ehttp://dataprivacylab.org/projects/pgp/1021-1.pdf%3e
  • 7/27/2019 Genetic Data Protection

    23/25

    47 Zubakov, D. et al. Estimating human age from T-cell DNA rearrangements. Curr Biol

    20, R970-971, doi:10.1016/j.cub.2010.10.022 (2010).48 Ou, X. L. et al. Predicting human age with bloodstains by sjTREC quantification. PLoS

    One 7, e42412, doi:10.1371/journal.pone.0042412 (2012).

    49 Lango Allen, H. et al. Hundreds of variants clustered in genomic loci and biologicalpathways affect human height. Nature 467, 832-838, doi:10.1038/nature09410

    (2010).50 Manning, A. K. et al. A genome-wide approach accounting for body mass indexidentifies genetic variants influencing fasting glycemic traits and insulin resistance.Nat Genet 44, 659-669, doi:10.1038/ng.2274 (2012).

    51 Liu, F. et al. A Genome-Wide Association Study Identifies Five Loci Influencing FacialMorphology in Europeans. PLoS Genet 8, e1002932,

    doi:10.1371/journal.pgen.1002932 (2012).52 Walsh, S. et al. IrisPlex: a sensitive DNA tool for accurate prediction of blue and

    brown eye colour in the absence of ancestry information. Forensic Sci Int Genet 5,170-180, doi:10.1016/j.fsigen.2010.02.004 (2011).

    53 CIPA. Vol. DC-008-2010 (Camera & Imaging Product Association, 2010).

    54 Byers, S. Information leakage caused by hidden data in published documents.Security & Privacy, IEEE 2, 23-27, doi:10.1109/MSECP.2004.1281241 (2004).

    55 Kaufman, S., Rosset, S. & Perlich, C. in Proceedings of the 17th ACM SIGKDDinternational conference on Knowledge discovery and data mining 556-563 (ACM,San Diego, California, USA, 2011).

    56 Acquisti, A. & Gross, R. Predicting Social Security numbers from public data. ProcNatl Acad Sci U S A 106, 10975-10980, doi:10.1073/pnas.0904891106 (2009).

    57 Noumeir, R., Lemay, A. & Lina, J. M. Pseudonymization of radiology data for researchpurposes. Journal of digital imaging 20, 284-295, doi:10.1007/s10278-006-1051-4

    (2007).58 Pakstis, A. J. et al. SNPs for a universal individual identification panel. Hum Genet

    127, 315-324, doi:10.1007/s00439-009-0771-1 (2010).

    59 Lin, Z., Owen, A. B. & Altman, R. B. Genetics. Genomic research and human subjectprivacy. Science 305, 183, doi:10.1126/science.1095019 (2004).

    60 Mailman, M. D. et al. The NCBI dbGaP database of genotypes and phenotypes. Nat

    Genet 39, 1181-1186, doi:10.1038/ng1007-1181 (2007).

    61 Homer, N. et al. Resolving individuals contributing trace amounts of DNA to highlycomplex mixtures using high-density SNP genotyping microarrays. PLoS Genet 4,e1000167, doi:10.1371/journal.pgen.1000167 (2008).

    62 Halperin, E. & Stephan, D. A. SNP imputation in association studies. Nature

    biotechnology 27, 349-351, doi:10.1038/nbt0409-349 (2009).63 Jacobs, K. B. et al. A new statistic and its power to infer membership in a genome-

    wide association study using genotype frequencies. Nat Genet 41, 1253-1257,

    doi:ng.455 [pii] 10.1038/ng.455 (2009).64 Visscher, P. M. & Hill, W. G. The Limits of Individual Identification from Sample Allele

    Frequencies: Theory and Statistical Analysis. PLoS Genet 5, e1000628,

    doi:10.1371/journal.pgen.1000628 (2009).65 Wang, R., Li, Y. F., Wang, X., Haixu, T. & Zhou, X. in CCS09 (Chicago, IL, USA, 2009).66 Im, H. K., Gamazon, E. R., Nicolae, D. L. & Cox, N. J. On Sharing Quantitative Trait

    GWAS Results in an Era of Multiple-omics Data and the Limits of Genomic Privacy.

    Am J Hum Genet 90, 591-598, doi:S0002-9297(12)00093-6 [pii]10.1016/j.ajhg.2012.02.008 (2012).

    67 Lumley, T. Potential for Revealing Individual-Level Information in Genome-wide

    Association Studies. JAMA 303, 659, doi:10.1001/jama.2010.120 (2010).68 Craig, D. W. et al. Assessing and managing risk when sharing aggregate genetic

    variant data. Nat Rev Genet 12, 730-736, doi:10.1038/nrg3067 (2011).

    23

  • 7/27/2019 Genetic Data Protection

    24/25

    69 Braun, R., Rowe, W., Schaefer, C., Zhang, J. & Buetow, K. Needles in the Haystack:

    Identifying Individuals Present in Pooled Genomic Data. PLoS Genet 5, e1000668,doi:10.1371/journal.pgen.1000668 (2009).

    70 Zerhouni, E. A. & Nabel, E. G. Protecting aggregate genomic data. Science 322, 44,

    doi:10.1126/science.1165490 (2008).71 Johnson, A. D., Leslie, R. & O'Donnell, C. J. Temporal trends in results availability

    from genome-wide association studies. PLoS Genet 7, e1002269,doi:10.1371/journal.pgen.1002269 (2011).72 Gilbert, N. Researchers criticize genetic data restrictions. Nature,

    doi:10.1038/news.2008.1083 (2008).

    73 Malin, B., Karp, D. & Scheuermann, R. H. Technical and policy approaches tobalancing patient privacy and data sharing in clinical and translational research.

    Journal of investigative medicine : the official publication of the AmericanFederation for Clinical Research 58, 11-18, doi:10.231/JIM.0b013e3181c9b2ea

    (2010).74 Clayton, D. On inferring presence of an individual in a mixture: a Bayesian approach.

    Biostatistics 11, 661-673, doi:10.1093/biostatistics/kxq035 (2010).

    75 Workshop on Establishing a Central Resource of Data from Genome SequencingProjects (2012)..76 Schadt, E. E., Woo, S. & Hao, K. Bayesian method to predict individual SNP genotypes

    from gene expression data. Nat Genet 44, 603-608, doi:10.1038/ng.2248 (2012).

    77 Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies.

    Nat Rev Genet 11, 499-511, doi:10.1038/nrg2796 (2010).78 Check, E. James Watsons genome sequenced. Nature (2007).

    79 Nyholt, D. R., Yu, C. E. & Visscher, P. M. On Jim Watson's APOE status: genetic

    information is hard to hide. European journal of human genetics : EJHG 17, 147-149,doi:10.1038/ejhg.2008.198 (2009).

    80 Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype

    imputation. Nat Genet 40, 1068-1075, doi:10.1038/ng.216 (2008).81 Kaiser, J. Human genetics. Agency nixes deCODE's new data-mining plan. Science

    340, 1388-1389, doi:10.1126/science.340.6139.1388 (2013).82 Bambauer, J. R. Tragedy of the Data Commons. Harvard Journal of Law and

    Technology 25, doi:http://dx.doi.org/10.2139/ssrn.1789749(2011).83 Hartzog, W. & Stutzman, F. The Case for Online Obscurity. California Law Review

    101, 1, doi:http://dx.doi.org/10.2139/ssrn.159774(2013).

    84 Taleb, N. N. The black swan : the impact of the highly improbable. (Random House,2007).

    85 Shannon, C. Communication Theory of Secrecy Systems". Bell System Technical

    Journal 28, 656715 (1949).86 Cavoukian, A. Privacy by Design. (2009).

    .

    87 Ramos, E. M. et al. A mechanism for controlled access to GWAS data: experience ofthe GAIN Data Access Committee. Am J Hum Genet 92, 479-488,doi:10.1016/j.ajhg.2012.08.034 (2013).

    88 Church, G. et al. Public access to genome-wide data: five views on balancing research

    with privacy and protection. PLoS Genet 5, e1000665,doi:10.1371/journal.pgen.1000665 (2009).

    89 Creating a Global Alliance to Enable Responsible Sharing of Genomic and Clincal

    Data. (2013)..

    90 Sandis, C. & Taleb, N. N. Skin in the Game as a Required Heuristic for Acting Under

    Uncertainty. Available at SSRN 2298292 (2013).

    24

    http://www.genome.gov/Pages/Research/DER/GVP/Data_Aggregation_Workshop_Summary.pdf%3ehttp://www.genome.gov/Pages/Research/DER/GVP/Data_Aggregation_Workshop_Summary.pdf%3ehttp://www.genome.gov/Pages/Research/DER/GVP/Data_Aggregation_Workshop_Summary.pdf%3ehttp://www.genome.gov/Pages/Research/DER/GVP/Data_Aggregation_Workshop_Summary.pdf%3ehttp://dx.doi.org/10.2139/ssrn.1789749http://dx.doi.org/10.2139/ssrn.1789749http://dx.doi.org/10.2139/ssrn.1789749http://dx.doi.org/10.2139/ssrn.159774http://dx.doi.org/10.2139/ssrn.159774http://dx.doi.org/10.2139/ssrn.159774http://www.ipc.on.ca/images/Resources/privacybydesign.pdf%3ehttp://www.ipc.on.ca/images/Resources/privacybydesign.pdf%3ehttp://www.ipc.on.ca/images/Resources/privacybydesign.pdf%3ehttp://www.broadinstitute.org/files/news/pdfs/GAWhitePaperJune3.pdf%3ehttp://www.broadinstitute.org/files/news/pdfs/GAWhitePaperJune3.pdf%3ehttp://www.broadinstitute.org/files/news/pdfs/GAWhitePaperJune3.pdf%3ehttp://www.broadinstitute.org/files/news/pdfs/GAWhitePaperJune3.pdf%3ehttp://www.ipc.on.ca/images/Resources/privacybydesign.pdf%3ehttp://dx.doi.org/10.2139/ssrn.159774http://dx.doi.org/10.2139/ssrn.1789749http://www.genome.gov/Pages/Research/DER/GVP/Data_Aggregation_Workshop_Summary.pdf%3ehttp://www.genome.gov/Pages/Research/DER/GVP/Data_Aggregation_Workshop_Summary.pdf%3e
  • 7/27/2019 Genetic Data Protection

    25/25

    91 Sweeney, L. k-anonymity: a model for protecting privacy. International journal of

    uncertainty, fuzziness, and knowledge-based systems 10, 557-570 (2002).92 El Emam, K. & Dankar, F. K. Protecting privacy using k-anonymity. Journal of the

    American Medical Informatics Association : JAMIA 15, 627-637,

    doi:10.1197/jamia.M2716 (2008).93 Machanavajjhala, A., Kifer, D., Gehrke, J. & Venkitasubramaniam, M. L-diversity. ACM

    Trans. Knowl. Discov. Data 1, 3-es, doi:10.1145/1217299.1217302 (2007).94 Ninghui, L., Tiancheng, L. & Venkatasubramanian, S. in Data Engineering, 2007. ICDE2007. IEEE 23rd International Conference on. 106-115.

    95 Malin, B. A. Protecting genomic sequence anonymity with generalization lattices.

    Methods of information in medicine 44, 687-692 (2005).96 Dwork, C. Differential Privacy. in ICALP. 1-12 (2007).

    97 Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J. & Vilhuber, L. in Data Engineering,2008. ICDE 2008. IEEE 24th International Conference on. 277-286.

    98 Uhler, C., Slavkovic, A. B. & Fienberg, S. E. Privacy-Preserving Data Sharing forGenome-Wide Association. CoRR abs/1205.0739 (2012).

    99 Johnson, A. & Shmatikov, V. in Proceedings of the 19th ACM SIGKDD international

    conference on Knowledge discovery and data mining 1079-1087 (ACM, Chicago,Illinois, USA, 2013).

    100 Cao, G. in Computer Science and Computational Technology, 2008. ISCSCT '08.International Symposium on. 292-294.

    101 Narayanan, A., Thiagarajan, N., Lakhani, M., Hamburg, M. & Boneh, D. (NDSS,

    2011).102 Bruekers. F., Stefan, K., Klaus, K. & Pim, T. Privacy-Preserving Matching of DNA

    Profiles. 2008 (2008).103 Bohannon, P., Jakobsson, M. & Srikwan, S. in Public Key Cryptography Vol. 1751

    Lecture Notes in Computer Science (eds Hideki Imai & Yuliang Zheng) Ch. 25, 373-390 (Springer Berlin Heidelberg, 2000).

    104 Baldi, P., Baronio, R., Cristofaro, E. D., Gasti, P. & Tsudik, G. in Proceedings of the 18th

    ACM conference on Computer and communications security 691-702 (ACM,Chicago, Illinois, USA, 2011).

    105 Cristofaro, E. D., Faber, S., Gasti, P. & Tsudik, G. in Proceedings of the 2012 ACM

    workshop on Privacy in the electronic society 97-108 (ACM, Raleigh, North

    Carolina, USA, 2012).106 Kamm, L., Bogdanov, D., Laur, S. & Vilo, J. A new way to protect privacy in large-scale

    genome-wide association studies. Bioinformatics 29, 886-893,

    doi:10.1093/bioinformatics/btt066 (2013).

    107 Ayday, E., Raisaro, J. L. & Hubaux, J. P. Privacy-Enhancing Technologies for MedicalTests Using Genomic Data. Technical Report (2013).

    .

    108 Narayanan, A. What Happend to the Crypto Dream? Security & Privacy, IEEE 11, 75-76 (2013).

    109 Presidential Commission for the Study of Bioethical Issues, Privacy and Progress in

    Whole Genome Sequencing. Privacy and Progress in Whole Genome Sequencing(2012).

    110 Paillier, P. in Advances in Cryptology EUROCRYPT 99 Vol. 1592 Lecture Notes in

    Computer Science (ed Jacques Stern) Ch. 16, 223-238 (Springer Berlin Heidelberg,

    1999).111 Gentry, C. A fully homomorphic encryption scheme.

    doi:papers2://publication/uuid/E389BFF9-B17D-45A9-BB67-0B586EE445F8

    (2009).

    25

    http://infoscience.epfl.ch/record/182897/files/CS_version_technical_report.pdf%3ehttp://infoscience.epfl.ch/record/182897/files/CS_version_technical_report.pdf%3ehttp://infoscience.epfl.ch/record/182897/files/CS_version_technical_report.pdf%3ehttp://infoscience.epfl.ch/record/182897/files/CS_version_technical_report.pdf%3e