genomic privacy - institute for mathematics and its …...dtc and genomic privacy from the 23andme...

37
Genomic Privacy: Limits of Individual Detection in a Pool Sriram Sankararaman a , Guillaume Obozinski b , Michael I. Jordan c , Eran Halperin d a Harvard Medical School b INRIA c UC Berkeley d Tel Aviv University

Upload: others

Post on 19-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Genomic Privacy:Limits of Individual Detection

in a PoolSriram Sankararamana, Guillaume Obozinskib, Michael I.

Jordanc, Eran Halperind

a Harvard Medical Schoolb INRIA

c UC Berkeleyd Tel Aviv University

Page 2: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

GWAS: Genomewide Association Studies

0 1 1 0 0 0 1 0 0 1 0 10 1 1 1 0 0 0 0 1 0 1 00 1 1 1 0 0 1 0 0 1 0 01 1 0 0 0 0 1 1 0 0 0 1

Cas

es

SNP

0 1 0 0 0 0 1 0 0 1 1 00 1 0 1 0 0 0 0 1 1 0 00 1 0 1 1 1 1 1 0 0 0 10 1 0 0 1 0 1 1 0 0 0 1

Con

trol

s

Page 3: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

GWAS factsLooking for common SNPs

Frequency above 1%

Chosen to be correlated to unobserved causal variants.

Most of these SNPs have low effect sizes.

Testing about million SNPs

Bottomline : Need a large number of samples to have sufficient power.

Page 4: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

GWAS so far

600 studies covering around 150 traits (Manolio, 2010)

Power can be increased by combining data from multiple studies.

Tens to hundreds of thousands of participants are common.

Rheumatoid Arthritis (5K cases, 17K controls), Alzheimers’ (7K, 14K), lipid levels and cholesterol (~100K).

Has led to setting up of central data-sharing repositories such as dbGap, EGP archive, WTCCCC.

What about individual privacy ?

Page 5: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Some views on privacy and sharing

5

Give up privacy assurances e.g. PGP

Have streamlined procedures to regulate access to data.

The middle ground ?

Separate individual-level and summary data.

Make summary data public.

Page 6: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

DTC and genomic privacy

From the 23andMe website:23andMe may collaborate with external parties. Under this informed consent, external parties will only have access to pooled data stripped of identifying information. 23andMe will never release your individual-level data to any third party without asking for and receiving your explicit authorization to do so.

6

Page 7: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Do these measures guarantee privacy of participants ?

7

Page 8: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Individual Detection in a Pool

8

0 1 1 0 0 0 1 0 0 1 0 10 1 1 1 0 0 0 0 1 0 1 00 1 1 1 0 0 1 0 0 1 0 01 1 0 0 0 0 1 1 0 0 0 1

Cas

esSNP

0.25 1 0.75 0.5 0 ................................ 0.5 0.25 0.5

0 1 0 0 0 0 1 0 0 1 1 00 1 0 1 0 0 0 0 1 1 0 00 1 0 1 1 1 1 1 0 0 0 10 1 0 0 1 0 1 1 0 0 0 1

Con

trol

s

0 1 0 0.5 0.5 ................................ 0.5 0.25 0.5

Page 9: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Individual Detection in a Pool

9

0 1 1 0 0 0 1 0 0 1 0 10 1 1 1 0 0 0 0 1 0 1 00 1 1 1 0 0 1 0 0 1 0 01 1 0 0 0 0 1 1 0 0 0 1

Cas

esSNP

0.25 1 0.75 0.5 0 ................................ 0.5 0.25 0.5

0 1 0 0 0 0 1 0 0 1 1 00 1 0 1 0 0 0 0 1 1 0 00 1 0 1 1 1 1 1 0 1 0 10 1 0 0 1 0 1 1 0 0 0 1

Con

trol

s

0 1 0 0.5 0.5 ................................ 0.5 0.25 0.5

0 1 1 1 0 0 0 0 1 0 1 0 : Is this in the case ?

Page 10: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

High-density SNP arrays can be used to resolve DNA mixtures

Homer et al, PLoS Genetics,2008

10

Page 11: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Identification in Pools

11

NIH and others removed summary data.

Page 12: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Identification in Pools

12

NIH and others removed summary data.

Need a mathematical model of privacy.

Page 13: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Forensics vs Privacy

Forensics: Given data, choose a procedure to maximize power.

Privacy: Select data to expose such that the maximum power attained by an adversary is small.

13

Page 14: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Forensics vs Privacy

Forensics: Given data, choose a procedure to maximize power.

Privacy: Select data to expose such that the maximum power attained by an adversary is small. Bounds matter.

14

Page 15: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Limits of Individual Detection

Formulate individual detection in a pool as a hypothesis testing problem.

15

Page 16: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Limits of Individual Detection

Formulate individual detection in a pool as a hypothesis testing problem.

Likelihood-Ratio test (LR-test) is optimal for this hypothesis test (Neyman-Pearson lemma)

16

L(x) =Pr(x|H1)Pr(x|H0)

! t(!)

Pr(L(x) ! t(!)|H0) = !

Page 17: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Limits of Individual Detection

Formulate individual detection in a pool as a hypothesis testing problem.

Likelihood-Ratio test (LR-test) is optimal for this hypothesis test.

The power of the LR-test provides an upper bound on the power of any method.

17

Page 18: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Limits of Individual Detection

xXi

p

p

n! 1

xXi

p

p

n

Null Alternative

L =m!

j=1

"xj log

pj

pj+ (1! xj) log

1! pj

1! pj

#

18

Likelihood-ratio test

Page 19: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

L = x log!

p

p

"+ (1! x) log

!1! p

1! p

"

" (x!p)(p!p)p(1! p)

! 12

(x!p)2(p!p)2

p2(1! p)2

" 1#n

x! p#p(1!p)

Z ! 12n

(x!p)2

p(1!p)Z2.

E0[L] = ! 12n

, V0(L) " 1n

E1[L] " +12n

, V1(L) " 1n

What happens for large pools?

a < p < 1! a, a > 0Need SNPs to be common

19

Page 20: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Main Result

20

z! + z1!" !!

m

n

1-! "

µ0 µ1

Null Alternative

z!#0 z1-"#1

Page 21: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Main Result

21

1-! "

µ0 µ1

Null Alternative

z!#0 z1-"#1

log !, " = 10 ! ! mn

-2.0000 1.0916-3.0000 0.5835-4.0000 0.3954-5.0000 0.2980-6.0000 0.2387-7.0000 0.1988-8.0000 0.1703-9.0000 0.1488-10.0000 0.1322

Page 22: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Can we apply the LR-test in practice?

Use a leave-one out procedure on a dataset to obtain empirical power estimates.

Requires an estimate of the population allele frequencies.

Use an independent reference dataset.

22

L =m!

j=1

"xj log

pj

pj+ (1! xj) log

1! pj

1! pj

#

Page 23: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Can we apply the LR-test in practice?

Requires an estimate of the population allele frequencies.

Use an independent reference dataset.

Drop in power.

Use a leave-one out procedure on a dataset to obtain empirical power estimates.

z! + z1!" !!

mn (1" n

n )

23

Page 24: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Analysis and empirical estimates agree for large pools.

−3 −2 −1 00

0.2

0.4

0.6

0.8

1WTCCC

False positive rate (Log base 10)

Pow

er

−3 −2 −1 00

0.2

0.4

0.6

0.8

1Simulated data

False positive rate (Log base 10)

Pow

er

LRLR theoryHomer et al

m=10000 m=10000

m=1000

m=33138

m=1000

24

Page 25: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Why does our optimal test have lower power than Homer et al?

Alternative hypothesis is the same.Tested individual is present in pool.

Nulls differ.Our null: Tested individual is sampled from the population and is not part of the reference dataset.

Null tested in Homer et al: Tested individual is part of the reference dataset.

25

Page 26: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Does this difference in the nulls matter?

Population has 10 individuals of which 5 are in the pool and rest in the reference.

Easy to detect individual in pool or reference.

Population has 1 million individuals of which 5 are in the pool.

Harder to detect in reference.

Even harder if only 5 out of these 1 million are available in a reference dataset.

26

Page 27: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Does this difference in the nulls matter?

Population has 10 individuals of which 5 are in the pool and rest in the reference.

Easy to detect individual in pool or reference.

Population has 1 million individuals of which 5 are in the pool.

Harder to detect in reference.

Even harder if only 5 out of these 1 million are available in a reference dataset.

Null tested in Homer et al. more appropriate for forensics. Our null more appropriate for privacy.

27

Page 28: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

The other null is indeed easier to test.

−3 −2.5 −2 −1.5 −1 −0.5 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Pow

er

Homer et alLRLR theory

−3 −2.5 −2 −1.5 −1 −0.5 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Pow

er

Homer et alLRLR theory

28

Our null requires 4 times more independent SNPs to achieve the same power.

Page 29: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Related questions

Dependent SNPs: Slight decrease in power. Haplotype-based test can be more powerful.

Genotyping errors: Reduces power.

Relatives: Requires more SNPs.

Population-independent.

1!2

29

Page 30: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Xi ! X ,X = {0, 1}n

(X1, . . . , Xm)! f(X1, . . . , Xm)

An alternative framework

30

Page 31: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

f = f + !

An alternative framework

31

Release noisy version of fMust still be useful for a non-attacker.An attacker cannot used this sanitized f to learn about X.

Page 32: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Differential Privacy

Relates to the LR test.

Given a test with false positive rate

Power at most

32

Pr(f(X) ! S)Pr(f(Y ) ! S)

" exp!

"(X, Y ) = 1

! exp"!

Dwork et al , 2006

Page 33: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Exponential mechanism

33

!(") ! exp(" #"

S(f))

Page 34: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Exponential mechanism

What is f ? Say the mean frequencies of the allele frequencies.

34

!(") ! exp(" #"

S(f))

S(f) = sup{x,y:!(x,y)=1}||f(x)! f(y)||1

Page 35: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Exponential mechanism

What is f ? Say the mean frequencies of the allele frequencies.

What is S(f) ? O (number of SNPs)

Bad news : The standard deviation of noise is proportional to the number of SNPs.

35

!(") ! exp(" #"

S(f))

S(f) = sup{x,y:!(x,y)=1}||f(x)! f(y)||1

Page 36: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Conclusions

A statistical framework to analyze the limits of genotype detection in pools.

Provides guidelines on data sharing to researchers.

The analytical bound is valid for large pools and common SNPs.

Use in conjunction with the empirical test.

36

Page 37: Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe website: 23andMe may collaborate with external parties. Under this informed consent,

Future Directions

37

Identity

PhenotypeGenotype