university of manchester symposium 2012: extraction and representation of in silico biological...
TRANSCRIPT
Extrac'on and Representa'on of in silico Biological Methods from the
Literature
Geraint Duck
Supervisors: Robert Stevens, Goran Nenadic and David Robertson
Advisor: Joshua Knowles
School of Computer Science, University of Manchester
Importance of Method in Science
• Understanding – Key part of research, central to science – Reproducibility and replica'on – What? Why? Where? How? When? – Extension
• Advise/evaluate – “Current Approach” – “Best Prac'ce”
2
Background
• In silico: performed on a computer, or through computer simula'on
• Bioinforma'cs is a resource-‐focused domain – Numerous resources appearing – Literature is growing rapidly
• Resource availability and usage is central to biological research
• Current aTempts oUen manually curated and/or incomplete
3
The Method to Obtain a Method
4
1. Extrac'on – Automa'cally extract resource and task men'ons from the bioinforma'cs literature • This presenta'on focuses on this step
2. Representa'on and Analysis – Evaluate the extracted men'ons for paTerns of
representa'on 3. Explora'on – Provide a means of exploring the methods extracted
to aid other research/researchers
Key Hypothesis: Resource ordering implies method
• An analogy – baking a cake: – Ingredients: buTer, eggs, flour, sugar, etc…
– Recipe/method: Set oven to 180°C, mix in a bowl the buTer and sugar… Divide between 'ns, cook in oven for 30mins…
5
Key Hypothesis: Resource ordering implies method
• An analogy – baking a cake: – Ingredients: bu#er, eggs, flour, sugar, etc…
– Recipe/method: Set oven to 180°C, mix in a bowl the bu#er and sugar… Divide between 2ns, cook in oven for 30mins…
6 Key: Resource; Task
Example: Lagerström et al. (2006) … all sequences were aligned … using … BLAT 3.0 … in which case the GenBank sequence was used… … divided … by BLAST searches … were combined into a FASTA file and aligned using … ClustalW 1.82 … The alignment was bootstrapped … using SEQBOOT from the … Phylip 3.6 package … [excerpt removed] … branch lengths were es'mated in TreePuzzle using the following parameters … … constructed and scored automa'cally using a bash-‐script that u'lized ClustalW as alignment engine and infoalign from the EMBOSS 2.8.0 package for scoring, … All sta's'cal analysis was performed using MiniTab. Graphs were ploTed using MicrosoU Excel and MiniTab.
7
Example: Lagerström et al. (2006) … all sequences were aligned … using … BLAT 3.0 … in which case the GenBank sequence was used… … divided … by BLAST searches … were combined into a FASTA file and aligned using … ClustalW 1.82 … The alignment was bootstrapped … using SEQBOOT from the … Phylip 3.6 package … [excerpt removed] … branch lengths were es2mated in TreePuzzle using the following parameters … … constructed and scored automa'cally using a bash-‐script that u'lized ClustalW as alignment engine and infoalign from the EMBOSS 2.8.0 package for scoring, … All sta's'cal analysis was performed using MiniTab. Graphs were plo#ed using MicrosoL Excel and MiniTab.
8
Key: Resource; Task; Poten2al Challenge
Example: Lagerström et al. (2006) … all sequences were aligned … using … BLAT 3.0 … in which case the GenBank sequence was used… … divided … by BLAST searches … were combined into a FASTA file and aligned using … ClustalW 1.82 … The alignment was bootstrapped … using SEQBOOT from the … Phylip 3.6 package … [excerpt removed] … branch lengths were es2mated in TreePuzzle using the following parameters. … constructed and scored automa'cally using a bash-‐script that u'lized ClustalW as alignment engine and infoalign from the EMBOSS 2.8.0 package for scoring, … All sta's'cal analysis was performed using MiniTab. Graphs were plo#ed using MicrosoL Excel and MiniTab.
9
Key: Resource; Task; Poten2al Challenge
Example: Lagerström et al. (2006)
10
Key: Resource; Task
GenBank BLAT, aligned
BLAST, searched ClustalW, aligned
SEQBOOT, bootstrapped (Phylip)
TreePuzzle, esDmated
ClustalW, aligned infoalign, scored
(EMBOSS)
MiniTab, staDsDcs MS Excel, graphs ploIed MiniTab, graphs ploIed
Tree Construc'on
Sequence and Tree Analysis
Result Visualisa'on
Sequence Alignment
Example…
• Mul'ple methods – Usage counts – Recentness of use – “best-‐prac'ce”
11
Challenges -‐ Ambiguity
• leg • white • cab
• HIV – Human immunodeficiency virus
– Human immunovirus
• analysis • Network • graph
• DIP – distal interphalangeal – Database of Interac'ng Proteins
12
Challenges -‐ Variability
• Orthographics – Swiss Prot – SWISS-‐PROT – SwissProt
• Misspellings and typos – One paper, same resource, spelt 3 different ways
• Abbrevia'ons – Different authors can use different acronyms for the same thing
13
Name Composi'on
• Majority are single nouns – includes acronyms
• 6% lowercase common nouns – affy, bioconductor
• A few contained numbers – S4, t2prhd
• A few misclassified as verbs – …each query protein is first BLASTed with… – …held near their equilibrium values using SHAKE. – …graphical representaKons were achieved using dot v1.10… 14
Name Composi'on
• Longest Names (most tokens) – Corpus: 5 – Gene Expression Profile Analysis Suite – Dic'onary: 12 – PredicKon of Protein SorKng Signals and LocalisaKon Sites in Amino Acid Sequences
• Evaluated token frequencies within our dic'onary – Long-‐tail curve – 87% used only once
15
!"#$%"&
'($)"*#&!"#"&
+",-"#."&
/-%0#&
1&
21&
31&
41&
51&
611&
621&
1& 27& 71& 87& 611& 627& 671&
!"#$%&'($)
*$%+,&
!"-&./0&!"#$%1&23"(415&
!"#$%&'($)*$%+,&67897%&:7+;"%<(,&:<8<=<1$&<%4&>"?6<($&@<A$1&
16
Named En'ty Recogni'on (NER)
• Variety of NER uses – Species – Gene/protein names – Chemical names
• Variety of NER accuracy – 95% F-‐score species (LINNAEUS) – 73% F-‐score (strict) gene name (ABNER) – Over 70% F-‐score chemical names (OSCAR3)
17
bioNerDS
• Automa'cally matches database and soLware names in the literature – Uses dic'onary, rules and clues
• F-‐scores between 63 and 91% – Mixed results depending on corpus – Issues of mul'ple men'ons of a single resource in one paper
– Ambiguity and variability…
18 hTp://bionerds.sourceforge.net/
!
!"#$%"&'#(()*+!"#$%&'()!*$(+,(!-./+#,00,(!
!
/.'1,(2"!
2.3#.'%$(4!
,
2.3#.'%$(4!-''567*!
!
8.%)!7%5%'9%!0,%#.'%+!
,
8.-#,(!-.5,-4!8:+!
!
;'0/.%,!#<,!+3'(,+!
!
",3'%)!*$++!9.#<!0,%#.'%+!$/'=,!#<,!
#<(,+<'-)!
System Overview
19
Preliminary Analysis of Resource Usage
• Used bioNerDS to extract name men'ons from two journals: – Genome Biology – BMC Bioinforma'cs
• Analysed differences
20
bioNerDS: Results
• Over 36,000 men'ons in BMC BioinformaKcs
• Over 15,000 men'ons in Genome Biology.
• 78% of Genome Biology and 98% of BMC BioinformaKcs papers contained at least one resource men'on.
• The top 5 men'oned resources were: R, BLAST, GO, GenBank, GEO and PDB.
• The general trend across both journals have most major resources declining in usage
21
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
Rela've Usage within the Top 50 Genome Biology BMC BioinformaDcs
22 BLAST Bioconductor ClustalW Ensembl GenBank Gene Ontology R Swiss-‐Prot
bioNerDS: Full PMC Set
• Run on full open-‐access PMC set – ~230,000 full-‐text ar'cles
– ~1000 different journals – Extracted ~1.8M men'ons
• Method? • Method fingerprints
• Trying to extract (data-‐mine): – Ordering – PaTerns – Co-‐occurance – Rela'onships – Associate rules – Frequent subsets – “Networks”
23
Method Analysis and Explora'on
• Mining “best-‐prac'ce”: Metrics – Most common – Newest – Who uses it – What resources is it comprised of
• Challenges – Scien'fic discourse – provenance informa'on – Men'on order does not imply order of use
• Clustering and associa'ons • Fingerprints 24
Conclusion
• Literature mining bioinforma'cs in silico methods
• Developed bioNerDS: automated resource name extrac'on
• Extrac'ng and analysing paTerns of resource usage – Full PMC corpus
• Provided a way to extract method for any resource based domain – Applied this to bioinforma'cs
25
Thank-‐you
• Acknowledgements – Supervisors:
• Robert Stevens • Goran Nenadic • David Robertson
– Funding:
26
Resource Men'ons per Journal Journal Total ArDcles Total MenDons RaDo Nucleic Acids Research 7,192 200,339 27.8558 PLoS One 15,791 168,624 10.6785 BMC Bioinforma'cs 3,982 149,668 37.5861 BMC Genomics 3,203 90,396 28.2223 Genome Biology 2,321 48,976 21.1012 Acta Crystallographica. Sec'on E, Structure Reports Online 11,834 41,383 3.497 BMC Evolu'onary Biology 1,570 31,222 19.8866 PLoS Computa'on Biology 1,613 30,185 18.7136 PLoS Gene'cs 1,876 29,734 15.8497 PLoS Pathology 1,691 20,661 12.2182
27
Named En'ty Recogni'on (NER)
• Variety of NER uses – Species – Gene/protein names – Chemical names
• Evalua'ng NER – True posi'ves, false posi'ves, false nega'ves – Precision: – Recall: – F-‐score:
28
Named En'ty Recogni'on (NER)
• Evalua'ng NER – True posi'ves, false posi'ves, false nega'ves
• tp: Correct • fp: Returned incorrect • fn: Missed
– Precision: tp / ( tp + fp ) • How accurate are the results we obtained
– Recall: tp / ( tp + fn ) • How many of the total correct results did we obtain
– F-‐score: 2 x P x R / ( P + R ) 29
Named En'ty Recogni'on (NER)
• Evalua'ng NER – True posi'ves, false posi'ves, false nega'ves – Precision: tp / ( tp + fp ) – Recall: tp / ( tp + fn ) – F-‐score: 2 x P x R / ( P + R )
• Variety of NER accuracy – 95% F-‐score species (LINNAEUS) – 73% F-‐score (strict) gene name (ABNER) – Over 70% F-‐score chemical names (OSCAR3)
30