future directions in genetic epidemiology, impact on it and data requirements nic timpson mrc caite...
TRANSCRIPT
Future directions in genetic epidemiology, impact on IT and Data requirements
Nic Timpson
MRC CAiTE CentreDepartment of Social Medicine
“Step change”
Larsen
Why?
-Technology
-Paradigm shift
-Genomic properties
EUCCONET Data Management Workshop
Raw dataClinical meaning ???????
EUCCONET Data Management Workshop
Two of the driving technologies:
Chip based genotyping
Next Generation Sequencing (NGS)
EUCCONET Data Management Workshop
Basic flat Illumina output…
EUCCONET Data Management Workshop
Derivation of flat file data from image based intensity reads:
CHR16_HAPMAP.recode.map
red_test_run_assoc.txt genetic_map_chr16.txt
HAPMAP- Illumina- Affymetrix
CHR16_HAPMAP.recode.ped
EUCCONET Data Management Workshop
EUCCONET Data Management Workshop
Position (Mb)
EUCCONET Data Management Workshop
NOD2 Crohn’s association
Individual Platform Read Length
Base coverage
Genomic coverage Cost ($US)
J. Craig Venter Automated
Sanger 800 7.5 N/A 70,000,000
James D. Watson Roche/454 250 7.4 95 1,000,000
Yoruban male Illumina/
Solexa 35 40.6 99.9 250,000
Yoruban male Life/APG 50 17.9 98.6 60,000
EUCCONET Data Management Workshop
Dat
a (b
ytes
)
~20Tb
Based on n~5000
~$5 + billion
~$70 million
~$1 million~$ 60 000
Per genome
HGP Venter & Watson
NGS
1- Candidate2- CHIP (designer)3- Affy 5004- Intensity data5- NGS data (*LC)
~10Gb
~2Mb
Consequent shifting budgets…
EUCCONET Data Management Workshop
Based on the storage of re-sequence data, one can consider storage requirements fora next generation sequencing effort:
Assuming a storage cost of about 1.5byte per bp of sequence reads for a low coverage ~2000 samples (as per UK10K for example) x 3 billion bp x 1.5 = 10 terabytes. That doesn't include any subsequent parsed data
Double this just to have the data in all formats one might be able to use meaningfully.
Yields ~20Tb
“20 Tb is pretty small these days” if buying new storage capacity just to do this alone one may therefore be better accounting for up to 50-100Tb if buying bespoke.
Cost – service costs can be as high as £1500 per Tb
NGS project on some 2000 individuals can be as much as 40-50k on computing alone.
EUCCONET Data Management Workshop
Also receiving data on:
Copy number variation across the genome
Expression data (e.g. records of messenger RNA to track gene activity)
Methylome (markers of the epigenome)
Not to mention phenotype data (a retrospective effort and an ever increasing pool)
Raises the issue of linkage and data USE…
EUCCONET Data Management Workshop
Not just storage…
EUCCONET Data Management Workshop
EUCCONET Data Management Workshop
EUCCONET Data Management Workshop
Varying matrix properties and overlaid ribbon plots:(here MAF)
Male vs Female
D’ vs r^2
EUCCONET Data Management Workshop
CDKAL
Combinations of data processing/visualisation methods:
e.g. follow-up of the dissection of the TCF2 locus and the counter results for T2D and prostate cancer - other T2D loci?
See: Amundadottir et al Nature Genetics 2007
EUCCONET Data Management Workshop
EUCCONET Data Management Workshop
Not to mention iterative approaches!
Generation of empirical distributions for the purpose of comparison, e.g. expression data
Gene X Gene (and possibly environment) interationanalysis which may span the genome
Overall
EUCCONET Data Management Workshop
As would expect, data requirements are increasing
Genetic epidemiology has been boosted into a realm of real findings and Exciting capability by the existence of new technology
Increases may (or may not) be more rapid than once thought
Storage and manipulation of large data sets present new challenges
A new breed of analysts is emerging
The computer scientist with a passion for biology
Perhaps windows is dead…