cshl minseqe 2013_ouellette

You are free to:

Copy, share, adapt, or re-mix;

Photograph, film, or broadcast;

Blog, live-blog, or post video of;

This presentation. Provided that:

You attribute the work to its author and respect the rights and licenses associated with its components.

Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites

http://creativecommons.org/publicdomain/zero/1.0/

http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites

Disclaimer

• I do not (and will not) profit in any way, shape or form, from any of the brands, products or companies I may mention in this presentation.

Data availability and re‐usability in the transition from microarray to next‐generation sequencing: can we do better?

B.F. Francis Ouellette• Senior Scientist & Associate Director, Informatics

and Biocomputing, Ontario Institute for Cancer Research, Toronto, ON

• Associate Professor, Department of Cell and Systems Biology, University of Toronto, Toronto, ON.@bffo

on

• Gabriella Rustici, Eleanor Williams, B.F. Francis Ouellette,Alvis Brazma and the Functional Genomics Data Societyhttp://fged.org

• Alvis Brazma - EBI• Roger Bumgarner - U of Washington • Cesare Furlanello - FBK – MPBA • Michael Miller - ISB• Francis Ouellette - OICR• John Quackenbush – Dana-Farber• Michael Reich - Broad• Gabriella Rustici - EBI• Chris Stoeckert – U Penn• Ronald Taylor - PNNL• Steve Chervitz Trutane - Personalis• Jennifer Weller - UNC• Brian Wilhelm - IRIC• Neil Winegarden - UHN

FGED’s mission:

To be a positive agent of change in the effective sharing and reproducibility of functional genomic data

fged.org

Poster # 142 (Friday)

I come here wearing many hats!

• Officer of FGED• Data submitter to a large international

cancer genomics initiative• Receiving and curating data from that same

initiative from 67 cancer genome projects.• Editor in an #openaccess journal where we

are just now rewriting the data submission policy to ensure reproducibility

• Associate Editor of an #OA DATABASE journal

• Also on the SAB of Galaxy and Genomespace

What do we do with this?

FGED (Functional Genomics Data Society) was MGED (Microarray Gene Expression Data Society)

we evaluated the replication of data analyses in 18 articles on microarray-based gene expression profiling. (…) We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced. The main reason for failure to reproduce was data unavailability, and discrepancies were mostly due to incomplete data annotation or specification of data processing and analysis. Repeatability of published microarray studies is apparently limited. More strict publication rules enforcing public data availability and explicit description of data processing and analysis should be considered.

Does it matter?

• In Ioannidis et al (2009), they were not saying that the papers were wrong.

• But there were problems– missing data (38%)– missing software, hardware details (50%) – missing method, processing details (66%)

… forensic bioinformatics [was needed] to infer what was done to obtain the results- Keith Baggerly

Does it matter?• In both cases the supporting data WERE

deposited in GEO or ArrayExpress• Forensic bioinformatics was needed and

more often than not failed• May be just depositing is not quite enough?

What was in MIAME?

1.The raw data 2.The final processed (normalised) data3.The essential sample annotation and

experimental variables4.Sample data relationships5.Array annotation (e.g., probe oligonucleotide

sequences)6.The laboratory and data processing protocols

Did it work? The glass half empty…

• Where were the hiccups? MIAME was asking too much!

• However, some now say that MIAME is much too little to ask! (e.g., publishing fully documented code with instructions how to run it)

• What does it mean ‘sufficient data processing protocols’?

• Even when data and protocols were deposited, would the reviewers check these? Probably not

• So does it help at all?

Did it work? The glass half full …

• ArrayExpress and GEO have data from well over 6 million high throughput assays from some 30,000 functional genomics studies

• The MIAME compliance has been increasing over time

• Many studies have shown the reusability of these data

• We can have an informed discussion about the reproducibility rather than forensics

Standards for content vs standards for format

• Developing a usable format is challenging – If it’s too ‘flexible’, too much free text, it’s no

longer a standard, no software can reasonably parse it

– If it’s too rigid, too granular, it can’t handle new type of data, and people end up putting things in fields that don’t work

• Human readable formats is useful, but machine readability is essential!

A simple human readable format for Functional genomics experiment metadata

• Sample-Data Relationship File (SDRF)

Lessons learned• Keep it simple, keep it simple, keep it

simple!• Perils of designing standards by a committee

vs advantages of community agreement

• Successful formats are mostly defined by successful software, e.g., GFF in UCSC GB or Bioconductors gene_set

• The attraction and perils of perfection – the last few steps of full automation cost most effort – A human person may be a cheep broker between

two pieces of software (again – Bioconductor example)

What does it mean for HTS?

• (RNASeq – ChIPSeq)• The metadata for functional genomics HTS

experiments are not so different from microarray experiments – replace cel files with BAM files

1. A general description of the aim of the experiment;

2. The submitter contact details;3. Essential sample annotation and the

experimental factors;4. An ‘experiment’ or ‘run’ date, which may be

important for identifying batch effects;5. Sufficient information to correctly identify

bio & tech reps;6. Experimental and data processing protocols7. Raw sequencing reads location; and

processed data.

MINSEQE - Minimum Information about a high-throughput Nucleotide SeQuencing Experiment

Percentage of publications from 2012 containing new gene expression data

Data type Number of PMID with new data

% of data in SRA/Arrayexpress/GEO

Microarray 347 49

RNA-SEQ 334 61

Percentage of RNA-Seq studies providing metadata (1/2)

Original Database

ArrayExpress

GEO SRA

Experimentaldescription

95 100 100

Contact 100 100 0

Sample & Factor info

100 100 60

ExperimentalOr Run date

0 0 60

Percentage of RNA-Seq studies providing metadata (2/2)

Original Database

ArrayExpress

GEO SRA

Biological and Tech replicates

Yes Sometimes Yes

Exp and data processing protocol

60 100 0

Raw reads 100 100 100

Processed data

35 90 0

Things we still need to do:

• Involves folks from NCBI• Compare methods and metrics over time

(2009-2012)• Compare methods with ENCODE, ICGC, EGA

and the databases we presented here.• Look for shared meta data and seek to mate

what is best and core to all.• Make sure it aligns with large funder’s

current requirements.• Share and publish this information

Take home messages

• Archiving just something is not the same as making data available and useful – metadata, analysis code, usable format, …– Storing metadata doesn’t cost too much,

extracting them from data generators does!

• Minimising the human mediation in moving data between the LIMS, archives and analysis tools is more realistic goal than eliminating it – the need for brokerage

• The main source of variability in RNSseq interpretation seems to be the alignments – we don’t know how to do this well yet. Getting the short reads for RNASeq is a beginning.

• FGED: The Functional Genomics Data Society is a very open society, and we welcome feedback and input!

–http://fged.org–Twitter: @fged

• Gabriella Rustici, Eleanor Williams, Alvis Brazma and the Functional Genomics Data Society http://fged.org

• Alvis Brazma - EBI• Roger Bumgarner - U of Washington • Cesare Furlanello - FBK – MPBA • Michael Miller - ISB• Francis Ouellette - OICR• John Quackenbush – Dana-Farber• Michael Reich - Broad• Gabriella Rustici - EBI• Chris Stoeckert – U Penn• Ronald Taylor - PNNL• Steve Chervitz Trutane - Personalis• Jennifer Weller - UNC• Brian Wilhelm - IRIC• Neil Winegarden - UHN

Acknowledgements:

cshl minseqe 2013_ouellette

Technology