Download - Resolving membership in a study in shared aggregate genetics data

Resolving membership in a study in shared aggregate genetics data

David W. Craig, Ph.D.Investigator & Associate DirectorNeurogenomics [email protected]

Genome-wide Association Studies

Nature Reviews Genetics

Genome-wide Association Studies (GWAS) genotype millions of Single Nucleotide Polymorphisms (SNPs) across 1000’s of individuals.

SNPs are typically biallic and diploid: CC/CT/TT 00/01/11

Due to ancestral meiotic recombination, SNPs are not independent from neighboring variants. They are often in linkage disequilibrium.

The concept of LD means that a SNP may be associated with disease, due to underlying correlation with a different functional variant.

Summary stats for a SNP across hundreds/thousands of individuals:

33% C / 77% T for cases and 45% C / 55% T P=10-8

CC=508 / CT=250 / TT= 108 OR=1.8

Resolving Identity from aggregate genetics data

GWAS are expensive, requiring genotyping of 1000’s of individuals.

Often require consortiums of consortiums. Sharing individual-level data was and is a

challenge. Sharing meta-data is a reasonable option. In 2007, summary allele frequency and

genotype counts were routinely placed on the web for all SNPs.

In 2008, after broad deliberation with the scientific community we published a forensics paper showing that one could have crude estimates of allele frequency, yet still resolve individuals.

Resolve is the term we purposely use. Identify has multiple meanings, particularly in GWAS study

Example Aggregate Data

rs903252 25% 26% rs232323 15% 15% rs323555 29% 29% rs232343 73% 75% rs233432 21% 22% rs234312 5.1% 5.1% rs163232 3.1% 2.8% rs8392731 15% 16% rs238764 7.3% 7.1% rs383745 45% 54%

% A allele~500 cases

% A allele~500 controls

Other SNP Aggregate Data Types: Genotypes, odds ratios, p-values, etc.

Visual example (SNP data as visualized)

AA=1.0

AB=0.5

BB= 0

250,000 pixels

Merge 96 independent data images equally

After merging, individual images still resolvable

No Adjustment Auto Contrast & Smooth Filter

Conceptual Approach

Rs903252 25% 35% 100% +10 Rs232323 15% 13% 50% -2 Rs323555 29% 39% 100% +10 Rs232343 73% 51% 0% +22 Rs233432 21% 32% 100% +11 Rs234312 5% 15% 50% +10 Rs163232 3% 0% 0% +3….. ….. ….. ….. …..

Data Set of Question

Person Of Interest

Directionalscore

Reference Data SetSNP

Reference Data Set

Rs903252 25% 35% 100% +10 Rs232323 15% 13% 50% -2 Rs323555 29% 39% 100% +10 Rs232343 73% 51% 0% +22 Rs233432 21% 32% 100% +11 Rs234312 5% 15% 50% +10 Rs163232 3% 0% 0% +3….. ….. ….. ….. …..

Data Set of Question

Person Of Interest

DirectionalscoreSNP

Equations (one approach of many!!)

Resolving Individuals in Aggregate Data Sets

Results on pooled samples

Impact

NIH policy was changed Summary-level data is no longer freely

available on the web in a distributed unrestrictive manner.

Additional papers refined the math and described limitations

Managing Risk

Distributing results of studies on human subjects inherently increases the the risk of a person being identifiable..

Context is important. The concept of Positive Predictive Value (PPV) can provide a measure.

PPV can also account for ‘at-risk’ populations. Currently, working with NIH on guidance for

measuring risk with a given dataset The approaches leveraged a critical concept of

directionality, specific to genotype data and frequency tables.

P-values represent a fundamentally different datatype with low information content

A new era

The era of whole-genome sequencing is approaching

SNPs are common and usually defined as greater than 1%

Whole-genome sequencing and exome sequencing inherently measure rare variants.

Rare variants can be highly informative, particularly in combination.

Approaches need to be explored for summarizing results without revealing identity.

Acknowledgements

Lab Jennifer Dinh Szabolcs Szelinger Holly Benson Meredith Sanchez-Castillo Brooke Hjelm

Informatics Nils Homer, Ph.D. Tyler Izatt Jessica Aldrich Alexis Christoforides Ahmet Kurdoglu James Long Shripad Sinari

FundingNINDS U24NS051872State of ArizonaNHGRI U01HG005210This work: ENDGAME (NHLBI U01 HL086528 )

Thank you

Download - Resolving membership in a study in shared aggregate genetics data

Top Related