nicole nogoy's talk at eresearchnz 2014: improving data sharing, integration and...

37
Nicole Nogoy eResearch NZ, 2 July 2014 Open-Data Open-Source Open-Access : Improving data sharing, integration and reproducibility

Upload: gigascience-bgi-hong-kong

Post on 28-Jan-2015

104 views

Category:

Science


0 download

DESCRIPTION

Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility, July 2nd 2014

TRANSCRIPT

Page 1: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Nicole NogoyeResearch NZ, 2 July 2014

Open-DataOpen-Source

Open-Access

: Improving data sharing, integration and reproducibility

Page 2: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Open-Review Open-Access

Open-DataOpen-Source

What can be achieved?

Page 3: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Its all about the re-use

To do this everything needs to be free and accessible to be read by humans & machines*

* See: http://www.biomedcentral.com/about/datamining

Take home message:

Page 4: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Challenges/Opportunities in the Data-Driven Era

Quick response to climate change, food security & disease outbreaks

Using networking power of the internet to tackle problems

Can ask new questions & find hidden patterns & connections

Build on each others efforts quicker & more efficiently

More collaborations across more disciplinesHarness wisdom of the crowds: crowdsourcing, citizen science, crowdfunding

Enables:

Enabled by:Removing silos, standards/formats, open-access/data

Challenges:

Page 5: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Not enabled by: paywalls, silos, dead trees

18121665 1869

• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995

• Lack of transparency, lack of credit for anything other than “regular” dead tree publication

• If there is interest in data, only to monetise & repackage

Page 6: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Problem: growing replication gap

1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Page 7: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Growing Issue: increasing number of retractions>15X increase in last decade

Strong correlation of “retraction index” with higher impact factor

1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

At current % increase by 2045 as many papers published as retracted!

Page 8: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

?How

Page 9: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

GigaSolution: Deconstructing the paper

www.gigadb.orgwww.gigasciencejournal.com

Utilizes big-data infrastructure and expertise from:

Combines and integrates:Open-access journal

Data Publishing Platform

Data Analysis Platform

Page 10: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

• Data• Software• Review• Re-use…

= Credit

}

Credit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)

New incentives/credit

Nicole Nogoy
WHAT IS ODIN?ORCID??
Page 11: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Anatomy of a Publication

Data

Idea

Study

Analysis

Answer

Metadata

Page 12: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Anatomy of a Data Publication

Data

Idea

Study

Analysis

Answer

Metadata

Page 13: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Valid

ation

chec

ks

Fail – submitter is provided error report

Pass – dataset is uploaded to GigaDB.

Submission Workflow

Curator makes dataset public (can be set as future date if required)DataCite

XML file

Excel submission file

Submitter logs in to GigaDB website and uploads Excel submission

GigaDB

DOI assigned

FilesSubmitter provides files by ftp or Aspera

XML is generated and registered with DataCite

Curator Review

Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).

DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)

Public GigaDB dataset

See: http://database.oxfordjournals.org/content/2014/bau018.abstract

Page 14: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

GigaScience Data Publishing PlatformCurrently 120 datasets & 50TB data

Page 15: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

• ~50 TBs of data from: BGI, ACRG, G10K, Bird10K, 3K Rice Genomes• Provide curation & integration with other DBs (INSDC databases)

Page 16: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Many data types…

Page 17: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

BGI Datasets Get DOIs

PlantsChinese cabbageCucumberFoxtail milletPigeonpeaPotatoSorghumWheat A+BRice

Microbe/metagenomicsE. Coli O104:H4 TY-2482T2D gut metagenomeBulk pooled insectsT. Tengcongensis proteomeCell-LinesChinese Hamster OvaryMouse methylomesCancer quantitative protemicsHuman

Asian individual (YH) - DNA Methylome - Genome Assembly v1+2- TranscriptomeCancer (14TB)Single cell bladder cancerHBV infected exomesAncient DNA - Saqqaq Eskimo - Aboriginal Australian

VertebratesDarwin’s FinchGiant panda Macaque -Chinese rhesus -Crab-eatingMini-PigNaked mole rat Parrot, Puerto Rican Penguin - Emperor penguin- Adelie penguinPigeon, domesticPolar bearDA and F344 rats SheepTibetan antelopeOtherfMRI & Retinal waves

InvertebrateAnt - Florida carpenter ant- Jerdon’s jumping ant- Leaf-cutter antRoundwormSchistosomaSilkwormParasitic nematodePacific oyster

Released pre-publicationPaper Published in GigaScience

Page 18: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Cloud solutions?

Reward better handling of metadata…Novel tools/formats for data interoperability/handling.

Page 19: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Examples

Page 20: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

To maximize its utility to the research community and aid those  fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001

Our first DOI:

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Page 21: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility
Page 22: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

IRRI GALAXY

Beneficiaries of the genomics revolution?Rice 3K project: 3,000 rice genomes, 13.4TB public data

Page 23: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

NO

Collaborations with Pensoft & PLOSCyber-centipedes & virtual worms

Page 24: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

SOURCE

USE/REUSE

PUBLISH

INTEGRATION WITH DOMAIN-SPECIFIC

DATABASES VIA ISA-TOOLS

NARRATIVE DATA

(SOCIAL) MEDIA

DATA PRODUCTION

Sneddon,T.P., Zhe,X.S., Edmunds,S.C., et al. GigaDB: promoting data dissemination and reproducibility. Database (2014) Vol. 2014: article ID bau018; doi:10.1093/database/bau018

Page 25: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Disseminating new types of data

Page 26: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility
Page 27: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

How are we supporting data reproducibility?

Data sets

Analyses

Linked to

Linked to

DOI

DOI

Open-Paper

Open-Review

DOI:10.1186/2047-217X-1-18~21,000 accesses

Open-Code

8 reviewers tested data in ftp server & named reports published

DOI:10.5524/100044

Open-PipelinesOpen-Workflows

DOI:10.5524/100038Open-Data

78GB CC0 data

Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/~21,000 downloads

Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2

Page 28: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

New & more transparent peer-review:The GigaScience way:

8 referees downloaded & tested data, then signed reports

Page 29: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

New & more transparent peer-review:The GigaScience way:

Real-time open-review = paper in arXiv + blogged reviews

Page 30: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Implement workflows in a community-accepted format

http://galaxyproject.org

Over 36,000 main Galaxy server users

Over 1000 papersciting Galaxy use

Over 55 Galaxyservers deployed

Open source

Page 31: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Visualizations & DOIs for workflows

Page 32: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

SOAPdenovo2 workflows implemented in

galaxy.cbiit.cuhk.edu.hk

Implemented entire workflow in our Galaxy server, inc.:

• 3 pre-processing steps

• 4 SOAPdenovo modules

• 1 post processing steps

• Evaluation and visualization tools

Also will be available to download by >36K Galaxy users in

Page 33: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Tool list Tool parameterisation Results panelResults panel

GigaGalaxy & Metabolomics

Page 34: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Rewarding and aiding reproducibility

OMERO: providing access to imaging data…

Nicole Nogoy
SCOTT - what to say here?
Nicole Nogoy
WHAT IS GIGA- DV?!?!
Nicole Nogoy
WHAT IS JCB DATA VIEWER?
Page 35: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Changing the way we publish:

Page 36: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

“Deconstructed”Journal

“Regular”Journal

“Conscientious” Online Journal

Page 37: Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

Ruibang Luo (BGI/HKU)Shaoguang Liang (BGI-SZ)Tin-Lap Lee (CUHK)Qiong Luo (HKUST)Senghong Wang (HKUST)Yan Zhou (HKUST)

Thanks to:

@gigasciencefacebook.com/GigaScienceblogs.biomedcentral.com/gigablog/

Peter LiChris HunterJesse Si ZheNicole NogoyLaurie GoodmanRob DavidsonAmye Kenall (BMC)

Marco Roos (LUMC)Mark Thompson (LUMC)Jun Zhao (Lancaster)Susanna Sansone (Oxford)Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford)

www.gigadb.orggalaxy.cbiit.cuhk.edu.hk

www.gigasciencejournal.com

CBIITFunding from:

Our collaborators:team: Case study: