on the importance (and absence) of annotation in next generation sequencing data

20
The importance (and absence) of annotation in the Next Generation Sequence Data Hugh Shanahan & Jamie Alnasir [email protected] @hughshanahan Results to be published in GigaScience

Upload: hugh-shanahan

Post on 21-Jul-2015

152 views

Category:

Science


3 download

TRANSCRIPT

Page 1: On the importance (and absence) of annotation in Next Generation Sequencing Data

The importance (and absence) of annotation in the Next

Generation Sequence DataHugh Shanahan & Jamie Alnasir

[email protected] @hughshanahan

Results to be published in GigaScience

Page 2: On the importance (and absence) of annotation in Next Generation Sequencing Data

It was the best of times• Many exciting experiments based on gathering huge amounts of data.

• 100,000 Genomes in the UK, many others

• Elixir - Exabytes of biomedical data in the next decade

• Large experiments - SKA, LHC

• Opening up of Government data

• Up ahead - Sensor networks and Monitoring Cities

• Machine Learning is now a widely accepted tool in analysing data and in making decisions.

• Evidence-based policy becoming the norm.

Page 3: On the importance (and absence) of annotation in Next Generation Sequencing Data

It was the worst of times• Leaks appearing in the Scientific process.

• In domains with many possible relationships, most published results are wrong (Ioannidis, PLoS Medicine, 2005).

• 1/4 of 67 published experiments on drug targets reproduced (Prinz et al., Nat. Rev. Drug Disc., 2011)

• 39% of key Psychology experiments could be reproduced (Nature News, 2015).

Page 4: On the importance (and absence) of annotation in Next Generation Sequencing Data

Poor statistics?• Naive use of p-value

calculations across fields.

• Banning use of Null Hypothesis Significance Test Procedure in Basic and Applies Social Psychology (Trafimow and Marks, BASP, 2015)

• Not the end of the story…more like the tip of the iceberg (Leek and Peng, Nature 2015)

Page 5: On the importance (and absence) of annotation in Next Generation Sequencing Data

Lessons learnt

• Results from individual experiments are probably wrong.

• Bias in your data means your conclusions are even more likely to be wrong.

• Meta-analyses help.

• Understand how you got the data you have.

Page 6: On the importance (and absence) of annotation in Next Generation Sequencing Data

Sequence Read Archive

• Central repository of sequence data.

• Nearly 30,000 genomic and transcriptomics experiments stored and freely available.

• 2 x 1015 nucleotides stored

Page 7: On the importance (and absence) of annotation in Next Generation Sequencing Data
Page 8: On the importance (and absence) of annotation in Next Generation Sequencing Data

• Based on Next Generation Sequencing

• Step reduction in cost of sequencing

• ~$thousands for a human genome

• Potentially an enormous resource

• But how do you get that data?

Page 9: On the importance (and absence) of annotation in Next Generation Sequencing Data

Good news

• SRA data is open

• Stored in a sensible way (uses SQL)

• API and documentation to access it

Page 10: On the importance (and absence) of annotation in Next Generation Sequencing Data

Mucky business

• Data stored in SRA are short reads.

• ~100 nucleotide-long fragments which are then assembled.

• Very long pipeline to get from a sample to this step.

• Pipeline (Protocol in their lingo) is VARIABLE

Page 11: On the importance (and absence) of annotation in Next Generation Sequencing Data
Page 12: On the importance (and absence) of annotation in Next Generation Sequencing Data

Obvious question

• Is there any evidence of bias in the data due to varying the protocol?

Page 13: On the importance (and absence) of annotation in Next Generation Sequencing Data

Even More Obvious Question

• Where is the metadata on the pipeline (protocol)?

Page 14: On the importance (and absence) of annotation in Next Generation Sequencing Data

4% of experiments describe all of the steps

Page 15: On the importance (and absence) of annotation in Next Generation Sequencing Data

What’s more…

• Metadata are stored as text fields.

• Hugely difficult task to parse.

• Submitters are not obliged to fill this data in.

• Confusion about what level to enter data in.

Page 16: On the importance (and absence) of annotation in Next Generation Sequencing Data

Bottom line

• For much of the SRA data, there is a “known unknown” about biases due to preparation.

• It’s very unlikely we’ll ever be able to figure that out.

Page 17: On the importance (and absence) of annotation in Next Generation Sequencing Data

Why should you be paying attention?

• As a member of the public - it’s your money down the drain ($108-$109)

• As a researcher - all of this undermines confidence in Science as a whole.

• If you work with big (and more particularly) complex data - the same issues will crop up for you.

Page 18: On the importance (and absence) of annotation in Next Generation Sequencing Data

Answers?• Understand how you got your data - even if it’s a step

for modelling.

• Metadata is crucial.

• Organising your data is crucial.

• Use Ontologies

• Use discrete keywords

• Get people to use it

Page 19: On the importance (and absence) of annotation in Next Generation Sequencing Data

In summary :- We want to do all the clever stuff….

Page 20: On the importance (and absence) of annotation in Next Generation Sequencing Data

Most of the time we need to deal with a ton of pitchblende to find the milligram

of Radium ..