scott edmunds: gigascience - big-data, data citation and future data handling

47
Scott Edmunds : Big Data, Data Citation and Future Data Handling www.gigasciencejournal.com cc Flickr allan* William Gibson: "Information is the currency of the future world"

Upload: gigascience-bgi-hong-kong

Post on 27-Jan-2015

112 views

Category:

Technology


0 download

DESCRIPTION

Scott Edmunds talk on GigaScience Big-Data, Data Citation and future data handling at the International Conference of Genomics on the 15th November 2011.

TRANSCRIPT

Page 1: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Scott Edmunds

: Big Data, Data Citation and Future Data Handling

www.gigasciencejournal.comcc Flickr allan*

William Gibson: "Information is the currency of the future world"

Page 2: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Data Tsunami?

Flickr cc: opensourceway

Page 3: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Data Bonanza?

Page 4: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

19961997

19981999

20002001

20022003

20042005

20062007

20080

100

200

300

400

500

600

700rice wheat

Rice v Wheat: consequences of publically available genome data.

Page 5: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Sharing aids everyone…

Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308

Sharing Detailed Research Data Is Associated with Increased Citation Rate.

Every 10 datasets collected contributes to at least 4 papers in the following 3-years.Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a

Page 6: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Page 7: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

“Trans-Omics”

• Genomics • Transcriptomics• Proteomics • Metabolomics

Objective to integrate data from:

Page 8: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Problems?

Flickr cc: opensourceway

Page 9: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

~100,000X

Sequencing cost ($ per Mbp)

Moore’s Law

Sequencing

Source: E Lander/Broad

Page 10: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Sequencing Output

Data

Moore’s/Kryders Law

Storage

Page 11: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Sequencing Output

Data

Dissemination?

Publication

Page 12: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

1 Illumina HiSeq 2000 (+Truseq upgrade)

= 600Gb/run (12 days)

X 128 Hiseq = 6Tb/day = >2Pb/year

= ~ 2000 Human Genomes/day

Potential sequencing capacity

Page 13: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Flickr cc: opensourceway

Difficulties keeping up…

Page 14: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Flickr cc: opensourceway

Do we have models for long term funding?

Human Gene Mutation Database

?

Kyoto Encyclopedia of Genes and Genomes

Page 15: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

?

Are there now too many hurdles?

Page 16: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

?

Are there now too many hurdles?

Technical: too large volumes too heterogeneous

no home for many data typestoo time consuming

Economic: too expensive, no long-term funding

Cultural: inertiano incentives to share unaware of how

Page 17: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Potential solutions?

Page 18: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Incentives/creditCredit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)

Prepublication data sharing (Toronto International Data Release Workshop)“Data producers benefit from creating a citable reference, as it can later be used to reflect impact of the data sets.” Nature 461, 168-170 (2009)

Page 19: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Datacitation: Datacite and DOIs

Digital Object Identifiers (DOIs) offer a solution

Mostly widely used identifier for scientific articles

Researchers, authors, publishers know how to use them

Put datasets on the same playing field as articles

DatasetYancheva et al (2007). Analyses on sediment of Lake Maar. PANGAEA.doi:10.1594/PANGAEA.587840

Page 20: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Datacitation: Datacite and DOIs

>1 million DOIs since Dec 2009

Central metadata repository to link with WoS/ISI

- finally can track and credit use!

Page 21: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

www.gigasciencejournal.com

Large-Scale Data Journal/Database

Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDAssistant Editor: Alexandra Basford, PhD

In conjunction with:

Now taking submissions…

Page 22: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Now taking submissions…

Page 23: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

www.gigasciencejournal.com

Editorial Board: International

Stephen O'Brien, USA Hanchuan Peng, USA Russell Poldrack, USAMing Qi, China/USA Susanna-Assunta Sansone, UK Michael Schatz, USA David Schwartz, USAFritz Sommer, USA Lincoln Stein, CanadaSumio Sugano, Japan Thomas Wachtler, Germany Jun Wang, ChinaAlistair Young, New ZealandZang Yufeng, China Marie Zins, France

Stephan Beck, UKAlvis Brazma, UKAnn-Shyn Chiang, Taiwan Richard Durbin, UK Paul Flicek, UK Robert Hanner, Canada Yoshihide Hayashizaki, Japan Henning Hermjakob, UK Wolfgang Huber, GermanyGary King, USA Tin-Lap Lee, Hong KongDonald Moerman, CanadaKaren Nelson, USA Francis Ouellette, Canada

Page 24: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

www.gigasciencejournal.com

Editorial Board: International

Stephen O'Brien, GenomicsHanchuan Peng, Imaging/Neuro Russell Poldrack, NeuroscienceMing Qi, GeneticsSusanna-Assunta Sansone, Standards Michael Schatz, Cloud ComputingDavid Schwartz, Optical MappingFritz Sommer, NeuroscienceLincoln Stein, Cloud ComputingSumio Sugano, GenomicsThomas Wachtler, Neuroscience Jun Wang, GenomicsAlistair Young, Medical ImagingZang Yufeng, NeuroscienceMarie Zins, Medicine

Stephan Beck, EpigenomicsAlvis Brazma, TranscriptomicsAnn-Shyn Chiang, NeuroscienceRichard Durbin, Genetics/GenomicsPaul Flicek, GenomicsRobert Hanner, DNA Barcoding/Ecology Yoshihide Hayashizaki, GenomicsHenning Hermjakob, ProteomicsWolfgang Huber, Functional GenomicsGary King, MedicineTin-Lap Lee, GenomicsDonald Moerman, Functional GenomicsKaren Nelson, MetagenomicsFrancis Ouellette, Genomics

Page 25: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

www.gigasciencejournal.com

Criteria and Focus of Journal/DatabaseReproducibility/ReuseUtility/UsabilityStandards/Searchability/Scale/SharingData publishing/DOI

Page 26: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

www.gigasciencejournal.com

Use of Data = Importance + Usability

easier to assesssubjective?

Page 27: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

www.gigasciencejournal.com

Reproducibility/Reuse BGI Cloud Computing resources for handling and analyzing large-scale data.Integrated tools to promote more widespread access, viewing, and analysis of data.Encourage and aid use of workflow systems for methods (e.g. submission of Galaxy XML files).

Page 28: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

www.gigasciencejournal.com

Special Series/Hub for cloud-based toolsTechnical notes: test tools in the BGI-Cloud.Tools + Test Data (BGI or user) in one place.Aids reproducibility. Aids reviewers (free)Aids authors: visibility (pubmed, etc.)

hosting (included/free offers)

–contact us: [email protected]

Oledoe flickr cc

Page 29: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

www.gigasciencejournal.com

Standards/Searchability/Sharing ISA-Tab compatibility to aid and promote best practice in metadata reporting.All supporting data must be publically available.Ask for MIBBI compliance and use of reporting checklists.Part of the Biosharing network.

Page 30: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

www.gigasciencejournal.com

Data publishing/DOINew journal format combines standard manuscript publication with an extensive database to host all associated data. Data hosting will follow standard funding agency and community guidelines.DOI assignment available for submitted data to allow ease of finding and citing datasets, as well as for citation tracking.

Page 31: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

of data use/release?

Page 32: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

The era of the data consumer?

Page 33: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

The era of the data consumer?

?

Page 34: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

The era of the data consumer?

?

Free access to data – but analysis hubs/nodes for will form around it

Page 35: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Genomic Data Submission and Analytical platform

Big data from the

“Sequencing Farm”

Data Modeling

Pipeline design

Validation

Commercial applications

GDSAP:

Data, Data, Data…

Tin-Lap Lee, CUHK

“Apps”

Page 36: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

www.gigaDB.org

New Database

Page 37: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

www.gigaDB.org

New Database

Page 38: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

BGI Datasets Get DOI®s

doi:10.5524/100004

PLANTSChinese cabbageCucumberFoxtail milletPigeonpeaPotatoSorghum

MicrobeE. Coli O104:H4 TY-2482

Cell-LineChinese Hamster Ovary

Human Asian individual (YH) - DNA Methylome - Genome Assembly- TranscriptomeAncient DNA (coming soon)- Saqqaq Eskimo - Aboriginal Australian

VertebratesGiant panda Macaque - Chinese rhesus - Crab-eatingNaked mole rat Penguin - Emperor penguin- Adelie penguinPigeon, domesticPolar bearSheepTibetan antelope

InvertebrateAnt - Florida carpenter ant- Jerdon’s jumping ant- Leaf-cutter antRoundwormSilkworm

Page 39: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

BGI Datasets Get DOI®s

doi:10.5524/100004

PLANTSChinese cabbageCucumberFoxtail milletPigeonpeaPotatoSorghum

MicrobeE. Coli O104:H4 TY-2482

Cell-LineChinese Hamster Ovary

Human Asian individual (YH) - DNA Methylome - Genome Assembly- TranscriptomeAncient DNA (coming soon)- Saqqaq Eskimo - Aboriginal Australian

VertebratesGiant panda Macaque - Chinese rhesus - Crab-eatingNaked mole rat Penguin - Emperor penguin- Adelie penguinPigeon, domesticPolar bearSheepTibetan antelope

InvertebrateAnt - Florida carpenter ant- Jerdon’s jumping ant- Leaf-cutter antRoundwormSilkworm

Many unpublished…

Page 40: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Page 41: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

Data also submitted to NCBI (including SV data to dbVar)

Complemented by citable form, and data-types including:

Assemblies of 3 strains Raw Data

SNPsInDels

CNVsSV

Coming soon…

Page 42: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

To maximize its utility to the research community and aid those  fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. http://dx.doi.org/10.5524/100001

Our first DOI:

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Page 43: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Page 44: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Page 45: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

“The way that the genetic data of the 2011 E. coli strain were disseminated globally suggests a more effective approach for tackling public health problems. Both groups put their sequencing data on the Internet, so scientists the world over could immediately begin their own analysis of the bug's makeup. BGI scientists also are using Twitter to communicate their latest findings.”

“German scientists and their colleagues at the Beijing Genomics Institute in China have been working on uncovering secrets of the outbreak. BGI scientists revised their draft genetic sequence of the E. coli strain and have been sharing their data with dozens of scientists around the world as a way to "crowdsource" this data. By publishing their data publicy and freely, these other scientists can have a look at the genetic structure, and try to sort it out for themselves.”

Page 46: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Page 47: Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

www.gigasciencejournal.com

We want your data!

[email protected]

[email protected]

@gigascience

facebook.com/GigaScience

blogs.openaccesscentral.com/blogs/gigablog/