on the importance (and absence) of annotation in next generation sequencing data
TRANSCRIPT
The importance (and absence) of annotation in the Next
Generation Sequence DataHugh Shanahan & Jamie Alnasir
[email protected] @hughshanahan
Results to be published in GigaScience
It was the best of times• Many exciting experiments based on gathering huge amounts of data.
• 100,000 Genomes in the UK, many others
• Elixir - Exabytes of biomedical data in the next decade
• Large experiments - SKA, LHC
• Opening up of Government data
• Up ahead - Sensor networks and Monitoring Cities
• Machine Learning is now a widely accepted tool in analysing data and in making decisions.
• Evidence-based policy becoming the norm.
It was the worst of times• Leaks appearing in the Scientific process.
• In domains with many possible relationships, most published results are wrong (Ioannidis, PLoS Medicine, 2005).
• 1/4 of 67 published experiments on drug targets reproduced (Prinz et al., Nat. Rev. Drug Disc., 2011)
• 39% of key Psychology experiments could be reproduced (Nature News, 2015).
Poor statistics?• Naive use of p-value
calculations across fields.
• Banning use of Null Hypothesis Significance Test Procedure in Basic and Applies Social Psychology (Trafimow and Marks, BASP, 2015)
• Not the end of the story…more like the tip of the iceberg (Leek and Peng, Nature 2015)
Lessons learnt
• Results from individual experiments are probably wrong.
• Bias in your data means your conclusions are even more likely to be wrong.
• Meta-analyses help.
• Understand how you got the data you have.
Sequence Read Archive
• Central repository of sequence data.
• Nearly 30,000 genomic and transcriptomics experiments stored and freely available.
• 2 x 1015 nucleotides stored
• Based on Next Generation Sequencing
• Step reduction in cost of sequencing
• ~$thousands for a human genome
• Potentially an enormous resource
• But how do you get that data?
Good news
• SRA data is open
• Stored in a sensible way (uses SQL)
• API and documentation to access it
Mucky business
• Data stored in SRA are short reads.
• ~100 nucleotide-long fragments which are then assembled.
• Very long pipeline to get from a sample to this step.
• Pipeline (Protocol in their lingo) is VARIABLE
Obvious question
• Is there any evidence of bias in the data due to varying the protocol?
Even More Obvious Question
• Where is the metadata on the pipeline (protocol)?
4% of experiments describe all of the steps
What’s more…
• Metadata are stored as text fields.
• Hugely difficult task to parse.
• Submitters are not obliged to fill this data in.
• Confusion about what level to enter data in.
Bottom line
• For much of the SRA data, there is a “known unknown” about biases due to preparation.
• It’s very unlikely we’ll ever be able to figure that out.
Why should you be paying attention?
• As a member of the public - it’s your money down the drain ($108-$109)
• As a researcher - all of this undermines confidence in Science as a whole.
• If you work with big (and more particularly) complex data - the same issues will crop up for you.
Answers?• Understand how you got your data - even if it’s a step
for modelling.
• Metadata is crucial.
• Organising your data is crucial.
• Use Ontologies
• Use discrete keywords
• Get people to use it
In summary :- We want to do all the clever stuff….
Most of the time we need to deal with a ton of pitchblende to find the milligram
of Radium ..