on the importance (and absence) of annotation in next generation sequencing data

The importance (and absence) of annotation in the Next

Generation Sequence DataHugh Shanahan & Jamie Alnasir

[email protected] @hughshanahan

Results to be published in GigaScience

It was the best of times• Many exciting experiments based on gathering huge amounts of data.

• 100,000 Genomes in the UK, many others

• Elixir - Exabytes of biomedical data in the next decade

• Large experiments - SKA, LHC

• Opening up of Government data

• Up ahead - Sensor networks and Monitoring Cities

• Machine Learning is now a widely accepted tool in analysing data and in making decisions.

• Evidence-based policy becoming the norm.

It was the worst of times• Leaks appearing in the Scientific process.

• In domains with many possible relationships, most published results are wrong (Ioannidis, PLoS Medicine, 2005).

• 1/4 of 67 published experiments on drug targets reproduced (Prinz et al., Nat. Rev. Drug Disc., 2011)

• 39% of key Psychology experiments could be reproduced (Nature News, 2015).

Poor statistics?• Naive use of p-value

calculations across fields.

• Banning use of Null Hypothesis Significance Test Procedure in Basic and Applies Social Psychology (Trafimow and Marks, BASP, 2015)

• Not the end of the story…more like the tip of the iceberg (Leek and Peng, Nature 2015)

Lessons learnt

• Results from individual experiments are probably wrong.

• Bias in your data means your conclusions are even more likely to be wrong.

• Meta-analyses help.

• Understand how you got the data you have.

Sequence Read Archive

• Central repository of sequence data.

• Nearly 30,000 genomic and transcriptomics experiments stored and freely available.

• 2 x 1015 nucleotides stored

• Based on Next Generation Sequencing

• Step reduction in cost of sequencing

• ~$thousands for a human genome

• Potentially an enormous resource

• But how do you get that data?

Good news

• SRA data is open

• Stored in a sensible way (uses SQL)

• API and documentation to access it

Mucky business

• Data stored in SRA are short reads.

• ~100 nucleotide-long fragments which are then assembled.

• Very long pipeline to get from a sample to this step.

• Pipeline (Protocol in their lingo) is VARIABLE

Obvious question

• Is there any evidence of bias in the data due to varying the protocol?

Even More Obvious Question

• Where is the metadata on the pipeline (protocol)?

4% of experiments describe all of the steps

What’s more…

• Metadata are stored as text fields.

• Hugely difficult task to parse.

• Submitters are not obliged to fill this data in.

• Confusion about what level to enter data in.

Bottom line

• For much of the SRA data, there is a “known unknown” about biases due to preparation.

• It’s very unlikely we’ll ever be able to figure that out.

Why should you be paying attention?

• As a member of the public - it’s your money down the drain ($108-$109)

• As a researcher - all of this undermines confidence in Science as a whole.

• If you work with big (and more particularly) complex data - the same issues will crop up for you.

Answers?• Understand how you got your data - even if it’s a step

for modelling.

• Metadata is crucial.

• Organising your data is crucial.

• Use Ontologies

• Use discrete keywords

• Get people to use it

In summary :- We want to do all the clever stuff….

Most of the time we need to deal with a ton of pitchblende to find the milligram

of Radium ..

on the importance (and absence) of annotation in next generation sequencing data

Science

biomedical data

government data

analysing data

complex data

itmucky business data

huge amounts of data

good news sra data

published experiments