beyond the tsunami: dealing with life sciences data

11
[1] Beyond the Tsunami: Developing the Infrastructure to Deal with Life Sciences Data Christopher Southan and Graham Cameron, EMBL- European Bioinformatics Institute (EBI), Cambridge, U.K.

Upload: chris-southan

Post on 10-May-2015

486 views

Category:

Technology


1 download

DESCRIPTION

Microsoft E Science 2009

TRANSCRIPT

Page 1: Beyond the Tsunami: Dealing with Life Sciences Data

[1]

Beyond the Tsunami: Developing the Infrastructure to Deal with Life Sciences Data

Christopher Southan and Graham Cameron, EMBL-European Bioinformatics Institute (EBI), Cambridge, U.K.

Page 2: Beyond the Tsunami: Dealing with Life Sciences Data

[2]

EBI and Sanger at Hinxton: Engaging with the Data Challenges

• Technology for sequence data generation and reduction• Repositories, storage, archiving • Databases, entitity linking, infrasctruture and utility• Biocuration, annotation, standards, ontolgies• Experimental biological data from research groups• Data exploitation, mining and visualisation • Biological hypothesis iteration

Page 3: Beyond the Tsunami: Dealing with Life Sciences Data

[3]

EMBL-Bank

0

5E+10

1E+11

1.5E+11

2E+11

2.5E+11

3E+11

Release 101, Aug 2009, 163 million entries, 283 billion bases

Page 4: Beyond the Tsunami: Dealing with Life Sciences Data

[4]

10 years of Rapid Growth

GU057010; SV 1; linear; viral cRNA; STD; VRL; 1701 BP.08-OCT-2009 (Rel. 102, Created)08-OCT-2009 (Rel. 102, Last updated, Version 1)Influenza A virus (A/Chengdu/03/2009(H1N1)) segment 4 hemagglutinin (HA) Jiang T., Qin C., Li X., Zhao H., Yu M., Deng Y., Yu X., Han J., Qin E., RA Zhu Q.; "A community transmission of influenza A (H1N1) virus in a boarding school RT in China, 22-27 July 2009“

*******************************************************************************************AF177758; SV 1; linear; mRNA; STD; HUM; 1868 BP.10-SEP-1999 (Rel. 61, Created)07-OCT-2008 (Rel. 97, Last updated, Version 6)Homo sapiens ubiquitin specific protease 16 (USP16) mRNA, complete cds.PUBMED; 10786635. Smith T.S., Southan C.; "Sequencing, tissue distribution and chromosomal assignment of a novel ubiquitin-specific protease USP23"; Biochim. Biophys. Acta 1490(1-2):184-88(2000). Ensembl-Gn; ENSG00000143258; Homo_sapiens.

Page 5: Beyond the Tsunami: Dealing with Life Sciences Data

[5]

New Technology > New Data Archives

Volume (TB) 1.9

70

35Assembledsequence

Capilliary traces

Next. Gen. Reads

European Nucleotide Archive Snapshot March 2009

Page 6: Beyond the Tsunami: Dealing with Life Sciences Data

[6]

Accelerating Genome Coverage

Jan 2009, 4370 projects

Page 7: Beyond the Tsunami: Dealing with Life Sciences Data

[7]

from EBI/Sanger

Page 8: Beyond the Tsunami: Dealing with Life Sciences Data

[8]

The 1000 Genomes Project: Cataloging Human Genetic Variation

• Initial human genome -10 years and 40 gigabases • Over next two years the eqivalent of two human genomes

will be produced every 24 hours • Completed dataset will be 6 trillion DNA bases, 500 TB• 60-fold more than 28 years of EMBL-Bank • Expected to cover 1200 genomes

Page 9: Beyond the Tsunami: Dealing with Life Sciences Data

[9]

Data Exploitation: EBI Accesses

Last 4 years of hit-rates for web pages and web services

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

CGI

API

Page 10: Beyond the Tsunami: Dealing with Life Sciences Data

[10]

GenomesGenomes Nucleotide sequenceNucleotide sequence

ExpressionExpression ProteomesProteomes

Protein families, and domains

Protein families, and domains

Protein structureProtein structure

Protein interactions

Protein interactions

Chemical entitiesChemical entities

PathwaysPathways

SystemsSystems

Literature, ontologiesLiterature, ontologies

Towards a sustainable infrastructure for biological information in Europe, to support life science, translation to medicine, the environment, bio-industries and society.

Page 11: Beyond the Tsunami: Dealing with Life Sciences Data

[11]

Conclusions

• The International Nucleotide Sequence Database Collaboration will exeed 300 billion bases in 2009.

• Storage at the EBI has doubled annually and is now 5 Petabytes.• Next-Generation Sequencing is increasing data production ~ 10-fold.• By 2010 the full genomic variation in over 1000 people will be revealed

and genomes from over 1000 species completed.• An increase in data mining is needed to facilitate conversion into

knowledge.• The European ELIXIR project and other global initiatives to enhance

the sustainable infrastructure for biological databases are essential.• The impact of data-intensive computing on the Life Sciences will be

profound and transforming.• Exploitation will bring major benefits for biology, medicine, agriculture,

biofuels and environmental science.