cshl minseqe 2013_ouellette
DESCRIPTION
2013 Genome Informatics presentation by Francis Ouellette at the Wednesday Oct 30 evening sessionTRANSCRIPT
You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;
Blog, live-blog, or post video of;
This presentation. Provided that:
You attribute the work to its author and respect the rights and licenses associated with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;
Blog, live-blog, or post video of;
This presentation. Provided that:
You attribute the work to its author and respect the rights and licenses associated with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
Disclaimer
• I do not (and will not) profit in any way, shape or form, from any of the brands, products or companies I may mention in this presentation.
Data availability and re‐usability in the transition from microarray to next‐generation sequencing: can we do better?
B.F. Francis Ouellette• Senior Scientist & Associate Director, Informatics
and Biocomputing, Ontario Institute for Cancer Research, Toronto, ON
• Associate Professor, Department of Cell and Systems Biology, University of Toronto, Toronto, ON.@bffo
on
• Gabriella Rustici, Eleanor Williams, B.F. Francis Ouellette,Alvis Brazma and the Functional Genomics Data Societyhttp://fged.org
• Alvis Brazma - EBI• Roger Bumgarner - U of Washington • Cesare Furlanello - FBK – MPBA • Michael Miller - ISB• Francis Ouellette - OICR• John Quackenbush – Dana-Farber• Michael Reich - Broad• Gabriella Rustici - EBI• Chris Stoeckert – U Penn• Ronald Taylor - PNNL• Steve Chervitz Trutane - Personalis• Jennifer Weller - UNC• Brian Wilhelm - IRIC• Neil Winegarden - UHN
FGED’s mission:
To be a positive agent of change in the effective sharing and reproducibility of functional genomic data
fged.org
Poster # 142 (Friday)
I come here wearing many hats!
• Officer of FGED• Data submitter to a large international
cancer genomics initiative• Receiving and curating data from that same
initiative from 67 cancer genome projects.• Editor in an #openaccess journal where we
are just now rewriting the data submission policy to ensure reproducibility
• Associate Editor of an #OA DATABASE journal
• Also on the SAB of Galaxy and Genomespace
What do we do with this?
FGED (Functional Genomics Data Society) was MGED (Microarray Gene Expression Data Society)
we evaluated the replication of data analyses in 18 articles on microarray-based gene expression profiling. (…) We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced. The main reason for failure to reproduce was data unavailability, and discrepancies were mostly due to incomplete data annotation or specification of data processing and analysis. Repeatability of published microarray studies is apparently limited. More strict publication rules enforcing public data availability and explicit description of data processing and analysis should be considered.
Does it matter?
• In Ioannidis et al (2009), they were not saying that the papers were wrong.
• But there were problems– missing data (38%)– missing software, hardware details (50%) – missing method, processing details (66%)
… forensic bioinformatics [was needed] to infer what was done to obtain the results- Keith Baggerly
Does it matter?• In both cases the supporting data WERE
deposited in GEO or ArrayExpress• Forensic bioinformatics was needed and
more often than not failed• May be just depositing is not quite enough?
What was in MIAME?
1.The raw data 2.The final processed (normalised) data3.The essential sample annotation and
experimental variables4.Sample data relationships5.Array annotation (e.g., probe oligonucleotide
sequences)6.The laboratory and data processing protocols
Did it work? The glass half empty…
• Where were the hiccups? MIAME was asking too much!
• However, some now say that MIAME is much too little to ask! (e.g., publishing fully documented code with instructions how to run it)
• What does it mean ‘sufficient data processing protocols’?
• Even when data and protocols were deposited, would the reviewers check these? Probably not
• So does it help at all?
Did it work? The glass half full …
• ArrayExpress and GEO have data from well over 6 million high throughput assays from some 30,000 functional genomics studies
• The MIAME compliance has been increasing over time
• Many studies have shown the reusability of these data
• We can have an informed discussion about the reproducibility rather than forensics
Standards for content vs standards for format
• Developing a usable format is challenging – If it’s too ‘flexible’, too much free text, it’s no
longer a standard, no software can reasonably parse it
– If it’s too rigid, too granular, it can’t handle new type of data, and people end up putting things in fields that don’t work
• Human readable formats is useful, but machine readability is essential!
A simple human readable format for Functional genomics experiment metadata
• Sample-Data Relationship File (SDRF)
Lessons learned• Keep it simple, keep it simple, keep it
simple!• Perils of designing standards by a committee
vs advantages of community agreement
• Successful formats are mostly defined by successful software, e.g., GFF in UCSC GB or Bioconductors gene_set
• The attraction and perils of perfection – the last few steps of full automation cost most effort – A human person may be a cheep broker between
two pieces of software (again – Bioconductor example)
What does it mean for HTS?
• (RNASeq – ChIPSeq)• The metadata for functional genomics HTS
experiments are not so different from microarray experiments – replace cel files with BAM files
1. A general description of the aim of the experiment;
2. The submitter contact details;3. Essential sample annotation and the
experimental factors;4. An ‘experiment’ or ‘run’ date, which may be
important for identifying batch effects;5. Sufficient information to correctly identify
bio & tech reps;6. Experimental and data processing protocols7. Raw sequencing reads location; and
processed data.
MINSEQE - Minimum Information about a high-throughput Nucleotide SeQuencing Experiment
Percentage of publications from 2012 containing new gene expression data
Data type Number of PMID with new data
% of data in SRA/Arrayexpress/GEO
Microarray 347 49
RNA-SEQ 334 61
Percentage of RNA-Seq studies providing metadata (1/2)
Original Database
ArrayExpress
GEO SRA
Experimentaldescription
95 100 100
Contact 100 100 0
Sample & Factor info
100 100 60
ExperimentalOr Run date
0 0 60
Percentage of RNA-Seq studies providing metadata (2/2)
Original Database
ArrayExpress
GEO SRA
Biological and Tech replicates
Yes Sometimes Yes
Exp and data processing protocol
60 100 0
Raw reads 100 100 100
Processed data
35 90 0
Things we still need to do:
• Involves folks from NCBI• Compare methods and metrics over time
(2009-2012)• Compare methods with ENCODE, ICGC, EGA
and the databases we presented here.• Look for shared meta data and seek to mate
what is best and core to all.• Make sure it aligns with large funder’s
current requirements.• Share and publish this information
Take home messages
• Archiving just something is not the same as making data available and useful – metadata, analysis code, usable format, …– Storing metadata doesn’t cost too much,
extracting them from data generators does!
• Minimising the human mediation in moving data between the LIMS, archives and analysis tools is more realistic goal than eliminating it – the need for brokerage
• The main source of variability in RNSseq interpretation seems to be the alignments – we don’t know how to do this well yet. Getting the short reads for RNASeq is a beginning.
• FGED: The Functional Genomics Data Society is a very open society, and we welcome feedback and input!
–http://fged.org–Twitter: @fged
• Gabriella Rustici, Eleanor Williams, Alvis Brazma and the Functional Genomics Data Society http://fged.org
• Alvis Brazma - EBI• Roger Bumgarner - U of Washington • Cesare Furlanello - FBK – MPBA • Michael Miller - ISB• Francis Ouellette - OICR• John Quackenbush – Dana-Farber• Michael Reich - Broad• Gabriella Rustici - EBI• Chris Stoeckert – U Penn• Ronald Taylor - PNNL• Steve Chervitz Trutane - Personalis• Jennifer Weller - UNC• Brian Wilhelm - IRIC• Neil Winegarden - UHN
Acknowledgements: