future directions in genetic epidemiology, impact on it and data requirements nic timpson mrc caite...

21
Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

Upload: anthony-mcginnis

Post on 26-Mar-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

Future directions in genetic epidemiology, impact on IT and Data requirements

Nic Timpson

MRC CAiTE CentreDepartment of Social Medicine

Page 2: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

“Step change”

Larsen

Page 3: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

Why?

-Technology

-Paradigm shift

-Genomic properties

EUCCONET Data Management Workshop

Page 4: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

Raw dataClinical meaning ???????

Page 5: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

EUCCONET Data Management Workshop

Two of the driving technologies:

Chip based genotyping

Next Generation Sequencing (NGS)

Page 6: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

EUCCONET Data Management Workshop

Basic flat Illumina output…

Page 7: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

EUCCONET Data Management Workshop

Derivation of flat file data from image based intensity reads:

Page 8: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

CHR16_HAPMAP.recode.map

red_test_run_assoc.txt genetic_map_chr16.txt

HAPMAP- Illumina- Affymetrix

CHR16_HAPMAP.recode.ped

EUCCONET Data Management Workshop

Page 9: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

EUCCONET Data Management Workshop

Page 10: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

Position (Mb)

EUCCONET Data Management Workshop

NOD2 Crohn’s association

Page 11: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

Individual Platform Read Length

Base coverage

Genomic coverage Cost ($US)

J. Craig Venter Automated

Sanger 800 7.5 N/A 70,000,000

James D. Watson Roche/454 250 7.4 95 1,000,000

Yoruban male Illumina/

Solexa 35 40.6 99.9 250,000

Yoruban male Life/APG 50 17.9 98.6 60,000

EUCCONET Data Management Workshop

Page 12: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

Dat

a (b

ytes

)

~20Tb

Based on n~5000

~$5 + billion

~$70 million

~$1 million~$ 60 000

Per genome

HGP Venter & Watson

NGS

1- Candidate2- CHIP (designer)3- Affy 5004- Intensity data5- NGS data (*LC)

~10Gb

~2Mb

Consequent shifting budgets…

EUCCONET Data Management Workshop

Page 13: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

Based on the storage of re-sequence data, one can consider storage requirements fora next generation sequencing effort:

Assuming a storage cost of about 1.5byte per bp of sequence reads for a low coverage ~2000 samples (as per UK10K for example) x 3 billion bp x 1.5 = 10 terabytes. That doesn't include any subsequent parsed data

Double this just to have the data in all formats one might be able to use meaningfully.

Yields ~20Tb

“20 Tb is pretty small these days” if buying new storage capacity just to do this alone one may therefore be better accounting for up to 50-100Tb if buying bespoke.

Cost – service costs can be as high as £1500 per Tb

NGS project on some 2000 individuals can be as much as 40-50k on computing alone.

EUCCONET Data Management Workshop

Page 14: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

Also receiving data on:

Copy number variation across the genome

Expression data (e.g. records of messenger RNA to track gene activity)

Methylome (markers of the epigenome)

Not to mention phenotype data (a retrospective effort and an ever increasing pool)

Raises the issue of linkage and data USE…

EUCCONET Data Management Workshop

Page 15: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

Not just storage…

EUCCONET Data Management Workshop

Page 16: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

EUCCONET Data Management Workshop

Page 17: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

EUCCONET Data Management Workshop

Page 18: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

Varying matrix properties and overlaid ribbon plots:(here MAF)

Male vs Female

D’ vs r^2

EUCCONET Data Management Workshop

Page 19: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

CDKAL

Combinations of data processing/visualisation methods:

e.g. follow-up of the dissection of the TCF2 locus and the counter results for T2D and prostate cancer - other T2D loci?

See: Amundadottir et al Nature Genetics 2007

EUCCONET Data Management Workshop

Page 20: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

EUCCONET Data Management Workshop

Not to mention iterative approaches!

Generation of empirical distributions for the purpose of comparison, e.g. expression data

Gene X Gene (and possibly environment) interationanalysis which may span the genome

Page 21: Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

Overall

EUCCONET Data Management Workshop

As would expect, data requirements are increasing

Genetic epidemiology has been boosted into a realm of real findings and Exciting capability by the existence of new technology

Increases may (or may not) be more rapid than once thought

Storage and manipulation of large data sets present new challenges

A new breed of analysts is emerging

The computer scientist with a passion for biology

Perhaps windows is dead…