Resolving membership in a study in shared aggregate genetics data
David W. Craig, Ph.D.Investigator & Associate DirectorNeurogenomics [email protected]
Genome-wide Association Studies
Nature Reviews Genetics
Genome-wide Association Studies (GWAS) genotype millions of Single Nucleotide Polymorphisms (SNPs) across 1000’s of individuals.
SNPs are typically biallic and diploid: CC/CT/TT 00/01/11
Due to ancestral meiotic recombination, SNPs are not independent from neighboring variants. They are often in linkage disequilibrium.
The concept of LD means that a SNP may be associated with disease, due to underlying correlation with a different functional variant.
Summary stats for a SNP across hundreds/thousands of individuals:
33% C / 77% T for cases and 45% C / 55% T P=10-8
CC=508 / CT=250 / TT= 108 OR=1.8
Resolving Identity from aggregate genetics data
GWAS are expensive, requiring genotyping of 1000’s of individuals.
Often require consortiums of consortiums. Sharing individual-level data was and is a
challenge. Sharing meta-data is a reasonable option. In 2007, summary allele frequency and
genotype counts were routinely placed on the web for all SNPs.
In 2008, after broad deliberation with the scientific community we published a forensics paper showing that one could have crude estimates of allele frequency, yet still resolve individuals.
Resolve is the term we purposely use. Identify has multiple meanings, particularly in GWAS study
Example Aggregate Data
rs903252 25% 26% rs232323 15% 15% rs323555 29% 29% rs232343 73% 75% rs233432 21% 22% rs234312 5.1% 5.1% rs163232 3.1% 2.8% rs8392731 15% 16% rs238764 7.3% 7.1% rs383745 45% 54%
% A allele~500 cases
% A allele~500 controls
Other SNP Aggregate Data Types: Genotypes, odds ratios, p-values, etc.
Visual example (SNP data as visualized)
AA=1.0
AB=0.5
BB= 0
250,000 pixels
Merge 96 independent data images equally
After merging, individual images still resolvable
No Adjustment Auto Contrast & Smooth Filter
Conceptual Approach
Rs903252 25% 35% 100% +10 Rs232323 15% 13% 50% -2 Rs323555 29% 39% 100% +10 Rs232343 73% 51% 0% +22 Rs233432 21% 32% 100% +11 Rs234312 5% 15% 50% +10 Rs163232 3% 0% 0% +3….. ….. ….. ….. …..
Data Set of Question
Person Of Interest
Directionalscore
Reference Data SetSNP
Reference Data Set
Rs903252 25% 35% 100% +10 Rs232323 15% 13% 50% -2 Rs323555 29% 39% 100% +10 Rs232343 73% 51% 0% +22 Rs233432 21% 32% 100% +11 Rs234312 5% 15% 50% +10 Rs163232 3% 0% 0% +3….. ….. ….. ….. …..
Data Set of Question
Person Of Interest
DirectionalscoreSNP
Equations (one approach of many!!)
Resolving Individuals in Aggregate Data Sets
Results on pooled samples
Impact
NIH policy was changed Summary-level data is no longer freely
available on the web in a distributed unrestrictive manner.
Additional papers refined the math and described limitations
Managing Risk
Distributing results of studies on human subjects inherently increases the the risk of a person being identifiable..
Context is important. The concept of Positive Predictive Value (PPV) can provide a measure.
PPV can also account for ‘at-risk’ populations. Currently, working with NIH on guidance for
measuring risk with a given dataset The approaches leveraged a critical concept of
directionality, specific to genotype data and frequency tables.
P-values represent a fundamentally different datatype with low information content
A new era
The era of whole-genome sequencing is approaching
SNPs are common and usually defined as greater than 1%
Whole-genome sequencing and exome sequencing inherently measure rare variants.
Rare variants can be highly informative, particularly in combination.
Approaches need to be explored for summarizing results without revealing identity.
Acknowledgements
Lab Jennifer Dinh Szabolcs Szelinger Holly Benson Meredith Sanchez-Castillo Brooke Hjelm
Informatics Nils Homer, Ph.D. Tyler Izatt Jessica Aldrich Alexis Christoforides Ahmet Kurdoglu James Long Shripad Sinari
FundingNINDS U24NS051872State of ArizonaNHGRI U01HG005210This work: ENDGAME (NHLBI U01 HL086528 )
Thank you