university of manchester symposium 2012: extraction and representation of in silico biological...

Extrac'on and Representa'on of in silico Biological Methods from the

Literature

Geraint Duck

Supervisors: Robert Stevens, Goran Nenadic and David Robertson

Advisor: Joshua Knowles

School of Computer Science, University of Manchester

Importance of Method in Science

•  Understanding – Key part of research, central to science – Reproducibility and replica'on – What? Why? Where? How? When? – Extension

•  Advise/evaluate – “Current Approach” – “Best Prac'ce”

2

Background

•  In silico: performed on a computer, or through computer simula'on

•  Bioinforma'cs is a resource-‐focused domain – Numerous resources appearing – Literature is growing rapidly

•  Resource availability and usage is central to biological research

•  Current aTempts oUen manually curated and/or incomplete

3

The Method to Obtain a Method

4

1.  Extrac'on – Automa'cally extract resource and task men'ons from the bioinforma'cs literature •  This presenta'on focuses on this step

2.  Representa'on and Analysis –  Evaluate the extracted men'ons for paTerns of

representa'on 3.  Explora'on –  Provide a means of exploring the methods extracted

to aid other research/researchers

Key Hypothesis: Resource ordering implies method

•  An analogy – baking a cake: –  Ingredients: buTer, eggs, flour, sugar, etc…

– Recipe/method: Set oven to 180°C, mix in a bowl the buTer and sugar… Divide between 'ns, cook in oven for 30mins…

5

Key Hypothesis: Resource ordering implies method

•  An analogy – baking a cake: –  Ingredients: bu#er, eggs, flour, sugar, etc…

– Recipe/method: Set oven to 180°C, mix in a bowl the bu#er and sugar… Divide between 2ns, cook in oven for 30mins…

6 Key: Resource; Task

Example: Lagerström et al. (2006) … all sequences were aligned … using … BLAT 3.0 … in which case the GenBank sequence was used… … divided … by BLAST searches … were combined into a FASTA file and aligned using … ClustalW 1.82 … The alignment was bootstrapped … using SEQBOOT from the … Phylip 3.6 package … [excerpt removed] … branch lengths were es'mated in TreePuzzle using the following parameters … … constructed and scored automa'cally using a bash-‐script that u'lized ClustalW as alignment engine and infoalign from the EMBOSS 2.8.0 package for scoring, … All sta's'cal analysis was performed using MiniTab. Graphs were ploTed using MicrosoU Excel and MiniTab.

7

Example: Lagerström et al. (2006) … all sequences were aligned … using … BLAT 3.0 … in which case the GenBank sequence was used… … divided … by BLAST searches … were combined into a FASTA file and aligned using … ClustalW 1.82 … The alignment was bootstrapped … using SEQBOOT from the … Phylip 3.6 package … [excerpt removed] … branch lengths were es2mated in TreePuzzle using the following parameters … … constructed and scored automa'cally using a bash-‐script that u'lized ClustalW as alignment engine and infoalign from the EMBOSS 2.8.0 package for scoring, … All sta's'cal analysis was performed using MiniTab. Graphs were plo#ed using MicrosoL Excel and MiniTab.

8

Key: Resource; Task; Poten2al Challenge

Example: Lagerström et al. (2006) … all sequences were aligned … using … BLAT 3.0 … in which case the GenBank sequence was used… … divided … by BLAST searches … were combined into a FASTA file and aligned using … ClustalW 1.82 … The alignment was bootstrapped … using SEQBOOT from the … Phylip 3.6 package … [excerpt removed] … branch lengths were es2mated in TreePuzzle using the following parameters. … constructed and scored automa'cally using a bash-‐script that u'lized ClustalW as alignment engine and infoalign from the EMBOSS 2.8.0 package for scoring, … All sta's'cal analysis was performed using MiniTab. Graphs were plo#ed using MicrosoL Excel and MiniTab.

9

Key: Resource; Task; Poten2al Challenge

Example: Lagerström et al. (2006)

10

Key: Resource; Task

GenBank BLAT, aligned

BLAST, searched ClustalW, aligned

SEQBOOT, bootstrapped (Phylip)

TreePuzzle, esDmated

ClustalW, aligned infoalign, scored

(EMBOSS)

MiniTab, staDsDcs MS Excel, graphs ploIed MiniTab, graphs ploIed

Tree Construc'on

Sequence and Tree Analysis

Result Visualisa'on

Sequence Alignment

Example…

•  Mul'ple methods –  Usage counts –  Recentness of use –  “best-‐prac'ce”

11

Challenges -‐ Ambiguity

•  leg •  white •  cab

•  HIV –  Human immunodeficiency virus

–  Human immunovirus

•  analysis •  Network •  graph

•  DIP –  distal interphalangeal –  Database of Interac'ng Proteins

12

Challenges -‐ Variability

•  Orthographics – Swiss Prot – SWISS-‐PROT – SwissProt

•  Misspellings and typos – One paper, same resource, spelt 3 different ways

•  Abbrevia'ons – Different authors can use different acronyms for the same thing

13

Name Composi'on

•  Majority are single nouns –  includes acronyms

•  6% lowercase common nouns –  affy, bioconductor

•  A few contained numbers –  S4, t2prhd

•  A few misclassified as verbs –  …each query protein is first BLASTed with… –  …held near their equilibrium values using SHAKE. –  …graphical representaKons were achieved using dot v1.10… 14

Name Composi'on

•  Longest Names (most tokens) – Corpus: 5 – Gene Expression Profile Analysis Suite – Dic'onary: 12 – PredicKon of Protein SorKng Signals and LocalisaKon Sites in Amino Acid Sequences

•  Evaluated token frequencies within our dic'onary – Long-‐tail curve – 87% used only once

15

!"#$%"&

'($)"*#&!"#"&

+",-"#."&

/-%0#&

1&

21&

31&

41&

51&

611&

621&

1& 27& 71& 87& 611& 627& 671&

!"#$%&'($)

*$%+,&

!"-&./0&!"#$%1&23"(415&

!"#$%&'($)*$%+,&67897%&:7+;"%<(,&:<8<=<1$&<%4&>"?6<($&@<A$1&

16

Named En'ty Recogni'on (NER)

•  Variety of NER uses – Species – Gene/protein names – Chemical names

•  Variety of NER accuracy – 95% F-‐score species (LINNAEUS) – 73% F-‐score (strict) gene name (ABNER) – Over 70% F-‐score chemical names (OSCAR3)

17

bioNerDS

•  Automa'cally matches database and soLware names in the literature –  Uses dic'onary, rules and clues

•  F-‐scores between 63 and 91% – Mixed results depending on corpus –  Issues of mul'ple men'ons of a single resource in one paper

– Ambiguity and variability…

18 hTp://bionerds.sourceforge.net/

!

!"#$%"&'#(()*+!"#$%&'()!*$(+,(!-./+#,00,(!

!

/.'1,(2"!

2.3#.'%$(4!

,

2.3#.'%$(4!-''567*!

!

8.%)!7%5%'9%!0,%#.'%+!

,

8.-#,(!-.5,-4!8:+!

!

;'0/.%,!#<,!+3'(,+!

!

",3'%)!*$++!9.#<!0,%#.'%+!$/'=,!#<,!

#<(,+<'-)!

System Overview

19

Preliminary Analysis of Resource Usage

•  Used bioNerDS to extract name men'ons from two journals: – Genome Biology – BMC Bioinforma'cs

•  Analysed differences

20

bioNerDS: Results

•  Over 36,000 men'ons in BMC BioinformaKcs

•  Over 15,000 men'ons in Genome Biology.

•  78% of Genome Biology and 98% of BMC BioinformaKcs papers contained at least one resource men'on.

•  The top 5 men'oned resources were: R, BLAST, GO, GenBank, GEO and PDB.

•  The general trend across both journals have most major resources declining in usage

21

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

Rela've Usage within the Top 50 Genome Biology BMC BioinformaDcs

22 BLAST Bioconductor ClustalW Ensembl GenBank Gene Ontology R Swiss-‐Prot

bioNerDS: Full PMC Set

•  Run on full open-‐access PMC set –  ~230,000 full-‐text ar'cles

–  ~1000 different journals –  Extracted ~1.8M men'ons

•  Method? •  Method fingerprints

•  Trying to extract (data-‐mine): –  Ordering –  PaTerns –  Co-‐occurance –  Rela'onships –  Associate rules –  Frequent subsets –  “Networks”

23

Method Analysis and Explora'on

•  Mining “best-‐prac'ce”: Metrics – Most common – Newest – Who uses it – What resources is it comprised of

•  Challenges – Scien'fic discourse – provenance informa'on – Men'on order does not imply order of use

•  Clustering and associa'ons •  Fingerprints 24

Conclusion

•  Literature mining bioinforma'cs in silico methods

•  Developed bioNerDS: automated resource name extrac'on

•  Extrac'ng and analysing paTerns of resource usage – Full PMC corpus

•  Provided a way to extract method for any resource based domain – Applied this to bioinforma'cs

25

Thank-‐you

•  Acknowledgements – Supervisors:

•  Robert Stevens •  Goran Nenadic •  David Robertson

– Funding:

26

Resource Men'ons per Journal Journal Total ArDcles Total MenDons RaDo Nucleic Acids Research 7,192 200,339 27.8558 PLoS One 15,791 168,624 10.6785 BMC Bioinforma'cs 3,982 149,668 37.5861 BMC Genomics 3,203 90,396 28.2223 Genome Biology 2,321 48,976 21.1012 Acta Crystallographica. Sec'on E, Structure Reports Online 11,834 41,383 3.497 BMC Evolu'onary Biology 1,570 31,222 19.8866 PLoS Computa'on Biology 1,613 30,185 18.7136 PLoS Gene'cs 1,876 29,734 15.8497 PLoS Pathology 1,691 20,661 12.2182

27


•  Variety of NER uses – Species – Gene/protein names – Chemical names

•  Evalua'ng NER – True posi'ves, false posi'ves, false nega'ves – Precision: – Recall: – F-‐score:

28


•  Evalua'ng NER – True posi'ves, false posi'ves, false nega'ves

•  tp: Correct •  fp: Returned incorrect •  fn: Missed

– Precision: tp / ( tp + fp ) •  How accurate are the results we obtained

– Recall: tp / ( tp + fn ) •  How many of the total correct results did we obtain

– F-‐score: 2 x P x R / ( P + R ) 29


•  Evalua'ng NER – True posi'ves, false posi'ves, false nega'ves – Precision: tp / ( tp + fp ) – Recall: tp / ( tp + fn ) – F-‐score: 2 x P x R / ( P + R )

•  Variety of NER accuracy – 95% F-‐score species (LINNAEUS) – 73% F-‐score (strict) gene name (ABNER) – Over 70% F-‐score chemical names (OSCAR3)

30