![Page 1: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/1.jpg)
Text Mining for Biomedicine: Techniques & tools
Sophia Ananiadou
School of Computer ScienceNational Centre for Text Mining
![Page 2: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/2.jpg)
Outline
•
Challenges / objectives of TM in biomedicine•
Terminology processing –
Term extraction, term variation, named entity recognition
•
Resources for TM in biomedicine•
Information Extraction approaches
•
Biological Annotation and Event Recognition•
Biomedical text mining services and systems @ NaCTeM –
TerMine, AcroMine, FACTA–
Medie, InfoPubMed, KLEIO, PathText
![Page 3: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/3.jpg)
Material
•
Further background on TM for BiologyAnaniadou, S. & McNaught, J. (eds) (2006) Text Mining for Biology and Biomedicine. Boston, MA: Artech
House
•
Numerous papers on line from bibliography•
See BLIMP http://blimp.cs.queensu.ca/–
Biomedical Literature (and text) mining publications
![Page 4: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/4.jpg)
Text Mining in biomedicine
•
Why biomedicine?– Consider just MEDLINE: 17,000,000 references,
40,000 added per month–
Dynamic nature of the domain: new terms (genes, proteins, chemical compounds, drugs) constantly created
–
Impossible to manage such an information overload
![Page 5: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/5.jpg)
Text mining aims
•
Extract and discover knowledge hidden in text•
Aid domain experts by automatically:– identifying concepts– extracting facts/relations – discovering implicit links– generating hypotheses (based on integration of
heterogeneous knowledge sources)
![Page 6: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/6.jpg)
Need for text mining•
Increased availability of full text –
Information overload
–
Information retrieval insufficient solution•
Bio-databases, controlled vocabularies and bio-
ontologies
encode only small fraction of information
•
Most information is in textual form –
unstructured data
•
Automated aids are needed
![Page 7: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/7.jpg)
FromFrom
TextText
toto
KnowledgeKnowledge: : tackling the data deluge through text miningtackling the data deluge through text mining
Unstructured Text(implicit knowledge)
Structured content(explicit knowledge)
Informationextraction
Semanticmetadata
Knowledge Discovery
InformationRetrieval
AdvancedInformation
Retrieval
![Page 8: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/8.jpg)
Information deluge
•
Bio-databases, controlled vocabularies and bio- ontologies
encode only small fraction of
information•
Linking
text to databases and ontologies
–
Curators struggling to process scientific literature–
Discovery of facts and events crucial for gaining insights in biosciences: need for text mining
![Page 9: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/9.jpg)
Medline searches over time
0
10
20
30
40
50
60
70
80
90Ja
n-97
Aug-
Mar
-
Oct
-98
May
-
Dec
-
Jul-0
0
Feb-
Sep-
Apr-
02
Nov
-
Jun-
03
Jan-
04
Aug-
Mar
-
Oct
-05
Month/year
Sear
ches
(mill
ions
)
![Page 10: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/10.jpg)
A solution: Text Mining www.nactem.ac.uk
•
Location: Manchester Interdisciplinary Biocentre
(MIB) www.mib.ac.uk
•
First publicly funded text mining centre in the world..
•
Focus: biology, medicine, social sciences…
![Page 11: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/11.jpg)
We don’t just press a button…•
TM involves–
Many components (converters, analysers, miners, visualisers, ...)
–
Many resources (grammars, ontologies, lexicons, terminologies, thesauri, CVs)
–
Many combinations of components and resources for different applications
–
Many different user requirements and scenarios, training needs
•
The best solutions are customised
![Page 12: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/12.jpg)
What NaCTeM is building:
•
Resources: ontologies, lexicons, terminologies, thesauri, grammars, annotated corpora–
BOOTStrep project http://www.nactem.ac.uk/bootstrep.php
•
Tools: tokenisers, taggers, chunkers, parsers, NE recognisers, semantic analysers
•
NaCTeM is also providing services•
Our related bio-text mining projects–
REFINE Representing Evidence For Interacting Network Elements –
ONDEX (data integration, workflows, text mining)–
PathText
(from text to pathways)
![Page 13: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/13.jpg)
Individual tools for user data•
Splitters, taggers, chunkers, parsers, NER, term extractors
•
Modes of useDemonstrators: for small-scale online useBatch mode: upload data, get email with link to download site when job doneWeb Services
•
Some services are compositions of tools
![Page 14: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/14.jpg)
Aims
•
Text mining: discover & extract unstructured knowledge hidden in text–
Hearst (1999)
•
Text mining aids to construct hypotheses
from associations derived from text
– protein-protein interactions –associations of genes –
phenotypes
–functional relationships among genes
![Page 15: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/15.jpg)
Impact of text mining
•
Extraction of named entities (genes, proteins, metabolites, etc)
•
Discovery of concepts allows semantic annotation
of documents
–
Improves information access by going beyond index terms, enabling semantic querying
•
Construction of concept networks
from text–
Allows clustering, classification of documents
–
Visualisation of concept maps
![Page 16: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/16.jpg)
Impact of TM
•
Extraction of relationships (events and facts) for knowledge discovery–
Information extraction, more sophisticated annotation of texts (event annotation)
–
Beyond named entities: facts, events–
Enables even more advanced semantic querying
![Page 17: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/17.jpg)
Hypothesis generation from literature
•
Swanson experiments (1986) influenced conceptual biology–
rapid ‘mining’
of candidate hypotheses from the literature –
migraine and magnesium deficiency (Swanson, 1988)–
indomethacin and Alzheimer’s disease (Swanson and Smalheiser 1994),
–
Curcuma longa and retinal diseases, Crohn's disease and disorders related to the spinal cord (Srinivasan and Libbus 2004).
–
(Weeber M, Rein et al. 2003) thalidomide for treating a series of diseases such as acute pancreatitis, chronic hepatitis C.
![Page 18: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/18.jpg)
Text mining steps
•
Information Retrieval
yields all relevant texts–
Gathers, selects, filters documents that may prove useful–
Finds what is known
•
Information Extraction
extracts facts & events of interest to user–
Finds relevant concepts, facts about concepts
–
Finds only what we are looking for
•
Data Mining
discovers unsuspected associations–
Combines & links facts and events–
Discovers new knowledge, finds new associations
![Page 19: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/19.jpg)
Structured Knowledge
FromFrom
TextText
toto
KnowledgeKnowledge: : NLP and NLP and KnowledgeKnowledge
ExtractionExtraction
Lexicons and ontologies
Knowledge Extraction
Tools
TextAnnotation Tools
![Page 20: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/20.jpg)
Challenge: the resource bottleneck
•
Lack of large-scale, richly annotated corpora–
Support training of ML algorithms
–
Development of computational grammars–
Evaluation of text mining components
•
Lack of knowledge resources: lexica, terminologies, ontologies.
![Page 21: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/21.jpg)
Text semantic annotation
•
annotation of events
and involved named entities–
Example: “Regulation of Transcription events”
–
BOOTSTrep project http://www.nactem.ac.uk/bootstrep.php
•
two different types of annotation levels •
linguistic annotation levels
•
biological annotation level, in charge of marking the biological knowledge contained in the text
•
Linking text with biological knowledge
![Page 22: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/22.jpg)
Events and variablesEvents and variables
•
Biological events can be centred on:–
verbs, e.g. activate, –
nouns with verb-like meanings (nominalised verbs), e.g. transcription
•
Different parts of sentence correspond to different types of variables in the event e.g.–
What caused event •
The narL gene product activates the nitrate reductase
operon–
What was affected by event•
Analysis of mutants …–
Where event took place•
These fusions were formed on plasmid cloning vectors
![Page 23: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/23.jpg)
biobio--event incremental annotationevent incremental annotation
![Page 24: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/24.jpg)
“The narL gene product activates the nitrate reductase operon”
Theme Characteristics
operon
Verb Frame ExampleVerb Frame Example
Agent Characteristics
protein activate
![Page 25: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/25.jpg)
Role Name Description Phrase Type(s) Clues
AGENT Drives or instigates event
Entity or event Typically subject of verb,Follows by in passives
The narL gene product activates the nitrate reductase
operon
THEME Affected by or results from event
Entity or event Typically object of verb, subject in passives
recA protein was induced by UV radiation
MANNER Method or way in which event is carried out
Event (process),adverb, direction, in vitro, in vivo etc
by, through, via, using
cpxA
gene increases the levels of csgA
transcription by dephosphorylation of CpxR
![Page 26: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/26.jpg)
Role Name Description Phrase Type(s) CluesINSTRUMENT Used to carry out
event Entity with,with the aid of,
via, by, through, using
EnvZ functions through OmpR to control porin gene expression in Escherichia coli K-12
LOCATION Location of event Entity in, on, near, etc
Phosphorylation
of OmpR
by the osmosensor
EnvZ
modulates expression of the ompF
and ompC
genes in Escherichia coliSOURCE Start point of event Entity fromA transducing
lambda phage carrying glpD''lacZ, glpR, and malT
was isolated from a strain harbouring
a glpD''lacZ
fusion
DESTINATION End point of event Entity to, into
Transcription of gntT
is activated by binding of the cyclic AMP (cAMP)-cAMP
receptor protein (CRP) complex to a CRP binding site
![Page 27: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/27.jpg)
Role Name Description Phrase Type(s) Clues
TEMPORAL Situates event in time or with respect to another event
Normally an event or time interval
during, before or after
The Alp protease activity is detected in cells after introduction of plasmids carrying the alpA
gene
DESCRIPTIVE Descriptive information about other entity
Entity as
It is likely that HyfR
acts as a formate-dependent regulator of the hyf
operon
CONDITION Environmental conditions or changes in conditions
Entity, event or adverb
in the presence of, in response to.
Strains carrying a mutation in the crp
structural gene fail to repress ODC and ADC activities in response to increased cAMP
![Page 28: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/28.jpg)
Named Entity TypesNamed Entity TypesNE class Definition
DNA
Entities chiefly composed of nucleic acids and their structural or positional references. This includes the physical structure of all DNA-based entities and the functional roles associated with regions thereof.
PROTEINEntities chiefly composed of amino acids and their positional references. This includes the physical structure and functional roles associated with each type.
EXPERIMENTAL Both physical and methodological entities, either used, consumed or required for a reaction to take place.
ORGANISMS Entities representing individuals or collections of living things and their component parts.
PROCESSES A set of event classes used to label biological processes described in text.
![Page 29: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/29.jpg)
activates
Example 1Example 1
operonthe nitrate reductase
operon
The narL
gene productprotein
the agent
the theme (what is acted upon)
![Page 30: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/30.jpg)
Linguistically Annotated Corpora
•
GENIA–
Domain•
Mesh term: Human, Blood Cells, and Transcription Factors. –
Annotation: POS, named entity, parse tree•
Penn BioIE–
Domain •
the molecular genetics of oncology•
the inhibition of enzymes of the CYP450 class. –
Annotation: POS, named entity, parse tree•
Yapex
•
GENETAG
a corpus of 20K MEDLINE®
sentences for gene/protein NER
![Page 31: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/31.jpg)
Annotation of GENIA corpus –
Term&POS
Term (entity) annotation 2000+400 abstracts
Term (entity) annotation 2000+400 abstracts
Part-of-speech annotation
2,000 abstracts
Part-of-speech annotation
2,000 abstracts
![Page 32: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/32.jpg)
The GENIA annotation
•
Linguistic annotation–
Reveals linguistic structures behind the text•
Part-of-speech annotation–
annotates for the syntactic category of each word.•
Syntactic Tree annotation–
annotates for the syntactic structure of sentences.
•
Semantic annotation–
Reveals knowledge pieces delivered by the text.•
Term annotation–
annotates domain-specific terms•
Event annotation–
annotates events on biological entities.Ontology-driven
annotation
![Page 33: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/33.jpg)
Annotation ToolAnnotation Tool
•
WordFreak http://wordfreak.sourceforge.net/•
Java-based linguistic annotation tool developed at University of Pennsylvania
•
Extensible to new tasks and domains•
Customised visualisation and annotation specification–
Allows annotation process to be made as simple as possible
![Page 34: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/34.jpg)
WordFreakWordFreak
ToolTool
![Page 35: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/35.jpg)
Resources
![Page 36: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/36.jpg)
What about existing resources?
•
Ontologies
important for knowledge discovery–
They form the link between terms in texts and biological databases
–
Can be used to add meaning, semantic annotation of texts
![Page 37: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/37.jpg)
Link between text and ontologies
Ontological
resourcestext
GO
UMLS
GENIASupporting semantics
Adding new knowledge
KEGG
![Page 38: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/38.jpg)
Ontological
resourcestext
GO
UMLS
GENIASupporting semantics
Adding new knowledge
KEGG
Databases
SemanticInterpretation of data
Mathematical Models
SemanticInterpretation of models in Systems Biology
Bridging the Gap– Integrating data, text and knowledge
![Page 39: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/39.jpg)
Resources for Bio-Text Mining
•
Lexical / terminological resources–
SPECIALIST lexicon, Metathesaurus
(UMLS)
–
Lists of terms / lexical entries (hierarchical relations)•
Ontological resources–
Metathesaurus, Semantic Network, GO, SNOMED CT, etc
–
Encode relations among entitiesBodenreider, O. “Lexical, Terminological, and Ontological Resources for Biological Text Mining”, Chapter 3, Text Mining for Biology and Biomedicine, pp.43-66
![Page 40: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/40.jpg)
SPECIALIST lexicon
–
UMLS specialist lexicon http://SPECIALIST.nlm.nih.gov
•
Each lexical entry contains morphological (e.g. cauterize, cauterizes, cauterized, cauterizing), syntactic (e.g. complementation patterns for verbs, nouns, adjectives), orthographic information (e.g. esophagus – oesophagus)
•
General language lexicon with many biomedical terms (over 180,000 records)
•
Lexical programs include variation (spelling), base form, inflection, acronyms
![Page 41: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/41.jpg)
Lexicon record
{base=Kaposi's sarcomaspelling_variant=Kaposi sarcoma entry=E0003576cat=nounvariants=uncountvariants=regvariants=glreg}
Kaposi’s sarcoma
Kaposi’s sarcomas
Kaposi’s sarcomata
Kaposi sarcoma
Kaposi sarcomas
Kaposi sarcomata
The SPECIALIST Lexicon and Lexical Tools Allen C. Browne, Guy Divita, and Chris Lu PhD 2002 NLM Associates Presentation, 12/03/2002, Bethesda, MD
![Page 42: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/42.jpg)
Normalisation (lexical tools)
Hodgkin DiseaseHODGKIN DISEASEHodgkin’s DiseaseHodgkin’s diseaseDisease, Hodgkin ...
disease hodgkinnormalise
![Page 43: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/43.jpg)
Steps of Norm Remove genitive
Hodgkin’s DiseasesReplace punctuation with spaces
Hodgkin DiseasesRemove stop words
Hodgkin DiseasesLowercase
hodgkin
diseasesUninflect
each wordhodgkin
diseaseWord order sort
disease hodgkin
Lexical tools of the UMLS http://lexsrv3.nlm.nih.gov/SPECIALIST/index.html
![Page 44: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/44.jpg)
The Gene Ontology (GO)
• Controlled vocabulary for the annotation of gene products
http://www.geneontology.org/19,468 terms. 95.3% with definitions
10391 biological_process1681 cellular_component
7396 molecular_function
![Page 45: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/45.jpg)
Gene Ontology
•
GOA database (http://www.ebi.ac.uk/GOA/) assigns gene products to the Gene Ontology
•
GO terms follow certain conventions of creation, have synonyms such as:–
ornithine cycle is an exact synonym of urea cycle
–
cell division is a broad synonym of cytokinesis–
cytochrome bc1 complex is a related synonym of ubiquinol-cytochrome-c reductase activity
![Page 46: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/46.jpg)
GO terms, definitions and ontologies in OBO
id: GO:0000002 name: mitochondrial genome maintenance namespace: biological_processdef: "The maintenance of the structure and integrity of the
mitochondrial genome.“
[GOC:ai] is_a: GO:0007005 ! mitochondrion organization and biogenesis
![Page 47: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/47.jpg)
Metathesaurus
•
organised by concept–
5M names, 1M concepts, 16M relations
•
built from 134 electronic versions of many different thesauri, classifications, code sets, and lists of controlled terms
•
"source vocabularies“•
common representation
![Page 48: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/48.jpg)
Are the existing knowledge resources sufficient for TM?
No!Why?
Limited lexical & terminological coverage of biological sub-domainsResources focused on human specialists
GO, UMLS, UniProt
ontology concept names frequently confused with terms
![Page 49: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/49.jpg)
Naming conventions
3.
Update and curation
of resources–
FlyBase
gene name coverage 31% (abstracts) to
84% (full texts)
4.
Naming conventions and representation in heterogeneous resources
–
Term formation guidelines from formal bodies e.g. HUGO, IPI not uniformly used
–
Problems with integration of resourcesdystrophin used for 18 gene products “Dystrophin (muscular dystrophy, Duchenne and Becker
types), included DXS143, DXS164, DXS206, …” HUGO
![Page 50: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/50.jpg)
Term variation
5.
Terminological variation and complexity of names–
High correlation between degree of term variation and dynamic nature of biomedicine
–
Variation occurs in controlled vocabularies and texts but discrepancy between the two
–
Exact match methods fail to associate term occurrences in texts with databases
![Page 51: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/51.jpg)
What’s in a name?
Terms, named entities in biology
![Page 52: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/52.jpg)
What’s in a name?
•
Breast cancer 1 (BRCA1)•
p53
•
Ribosomal protein S27•
Heat shock protein 110
•
Mitogen
activated protein kinase
15•
Mitogen
activated protein kinase
kinase
kinase
5
From K. Cohen, NAACL 2007
![Page 53: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/53.jpg)
Worst gene names
•
sema
domain, seven thrombospondin
repeats (type 1 and type 1-like), transmembrane
domain
(TM) and short cytoplasmic
domain, (semaphorin) 5A
K. Cohen NAACL 2007
![Page 54: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/54.jpg)
Worst gene names
•
sema
domain, seven thrombospondin
repeats (type 1 and type 1-like), transmembrane
domain
(TM) and short cytoplasmic
domain, (semaphorin) 5A
K. Cohen NAACL 2007
![Page 55: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/55.jpg)
Worst gene names
•
sema
domain, seven thrombospondin
repeats (type 1 and type 1-like), transmembrane
domain
(TM) and short cytoplasmic
domain, (semaphorin) 5A
•
SEMA5A
K. Cohen NAACL 2007
![Page 56: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/56.jpg)
Worst gene names
•
sema
domain, seven thrombospondin
repeats (type 1 and type 1-like), transmembrane
domain (TM) and short
cytoplasmic
domain, (semaphorin) 5A •
SEMA5A
•
Tyrosine kinase with immunoglobulin and epidermal growth factor homology domains
•
tie
K. Cohen NAACL 2007
![Page 57: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/57.jpg)
Term ambiguity
Neurofibromatosis 2
[disease]
NF2 Neurofibromin
2 [protein]
Neurofibromatosis 2 gene [gene]
O. Bodenreider, MIE 2005 tutorial
http://www.nactem.ac.uk/
![Page 58: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/58.jpg)
Term ambiguity
–
Gene terms may be also common English words•
BAD human gene encoding BCL-2 family of proteins (bad news, bad prediction)
–
Gene names are often used to denote gene products (proteins)
•
suppressor of sable is used ambiguously to refer to either genes
and proteins
–
Existing resources lack information that can support term disambiguation
–
Difficult to establish equivalences between termforms and concepts
![Page 59: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/59.jpg)
Homologues
•
Cycline-dependent kinase
inhibitor
first introduced to represent a protein family p27–
But it is used interchangeably with p27
or p27kip1, as
the name of the individual protein
and not as the name of the protein family (Morgan 2003).
•
NFKB2
denotes the name of a family of 2 individual proteins with separate IDs in Swiss-
Prot. –
These proteins are homologues belonging to different species, homo sapiens & chicken.
![Page 60: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/60.jpg)
Terms
–
Term: linguistic realisation of specialised concepts, e.g. genes, proteins, diseases
–
Terminology: collection of terms structured (hierarchy) denoting relationships among concepts, part-whole, is-a, specific, generic, etc.
–
Terms link text and ontologies–
Mapping is not trivial (main challenge)
![Page 61: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/61.jpg)
Term variation and ambiguity
Term1 Term2
Term3 TEXT
Term1 Term2
Term3 TEXT
Concept1 concept2
concept3 ONTOLOGY
Concept1 concept2
concept3 ONTOLOGY
Term ambiguity
Term variation
![Page 62: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/62.jpg)
Term mining steps
Term recognition
Term classification
Term mapping
Tp53
Gene
Genome Database,
IARC TP53 Mutation Database
![Page 63: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/63.jpg)
Term recognition techniques
•
ATR
extracts terms (variants) from a collection of document
•
Distinguishes terms vs
non-terms•
In NER
the steps of recognition and
classification
are merged, a classified terminological instance is a named entity
•
The tasks of ATR and NER share techniques but their ultimate goals are different–
ATR for resource building, lexica & ontologies
–
NER first step of IE, text mining
![Page 64: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/64.jpg)
Overview papers
1. S. Ananiadou & G. Nenadic (2006) Automatic Terminology Management in Biomedicine, Text Mining for Biology and Biomedicine, pp. 67- 97.
2. M. Krauthammer & G. Nenadic (2004) Term identification in the biomedical literature, JBI 37 (2004) 512-526
3. J.C. Park & J. Kim (2006) Named Entity Recognition, Text Mining for Biology and Biomedicine, pp. 121-142
Detailed bibliography in Bio-Text Mining 1. BLIMPhttp://blimp.cs.queensu.ca/2. http://www.ccs.neu.edu/home/futrelle/bionlp/Book on BioText Mining1. S. Ananiadou & J. McNaught (eds) (2006) Text Mining for Biology and
Biomedicine, Artech House.
Other Bio-Text Mining tutorialsKevin Cohen (NAACL 2007 tutorial) U. Colorado
![Page 65: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/65.jpg)
Main ATR approaches
ATR
Dictionary based
Rule based
Machine learning
![Page 66: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/66.jpg)
Dictionary NER (1)
•
Use terminological resources to locate term occurrences in text–
NCBI http://www.ncbi.nlm.nih.gov/
–
EBI http://www.ebi.ac.uk/–
neologisms, variations, ambiguity problematic for simple dictionary look-up
–
Ambiguous words e.g. an, for, can …–
spelling variants, punctuation, word order variations
•
estrogen oestrogen•
NF kappa B / NF kB
![Page 67: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/67.jpg)
Dictionary NER (3)
–
Tsuruoka & Tsujii (2003) suggest a probabilistic generator of spelling variants, edit distance operations (delete, substitute, insert)•
Terms with ED ≤
1 considered spelling
variants•
Used a dictionary of protein terms
–
Support query expansion–
Augment dictionaries with variation
![Page 68: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/68.jpg)
Rule NER (2)
Rule based
PROPER, Fukuda,1998 Yapex, Franzen 2002
![Page 69: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/69.jpg)
Rule based (1)
•
Use orthographic, morpho-syntactic features of terms –
Rules that make use of internal term formation patterns (tagging, morphological analysers) e.g. affixes, combining forms
–
Do not take into account contextual features–
Dictionaries of constituents e.g. affixes, neoclassical forms included
•
Portability to different domains?
![Page 70: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/70.jpg)
Rule-based
•
Fukuda (1998) used lexical, orthographic features for protein name recognition e.g. upper case character, numerals etc.
•
PROPER: core
and feature
elements–
Core: meaning bearing elements–
Feature: function elements
SAP kinasecore feature
Core elements extended to feature based on concatenation rules (based on POS tags)
![Page 71: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/71.jpg)
Rule-based
•
Inspired by PROPER, Yapex
uses Swiss-Prot to add core term elements
http://www.sics.se/humle/projects/prothalt/yapex.cgi•
Hou
(2003) used Yapex
with context information (collocations) appearing with protein names
•
Rule based approaches construct rule and patterns manually or automatically
•
Difficult to tune to different domains
![Page 72: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/72.jpg)
Machine learning systems
•
Learn features from training data for term recognition and classification
•
Most ML systems combine recognition and classification
Challenges–
Feature selection and optimisation
–
Availability of training data –
detection of term boundaries
![Page 73: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/73.jpg)
Overview of ML-based NER
•
Training phase:
•
Testing phase:
Manually tagged texts•Detecting features•Learning model
Learned Model
Tagged texts
Tag annotatorwith model
Raw texts
![Page 74: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/74.jpg)
ML (1)
•
Nobata et al.(1999) used Decision Tree for NER•
Decision tree: one of the methods to classify a case using training data–
Node: specifies some condition with a subtree
–
Leaf: indicates a class•
Features:–
Part-of-speech information
–
Orthographic information–
Term lists
![Page 75: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/75.jpg)
Example of a decision tree
Is the current wordin the Protein term list?
YesNo
What is thenext word’s POS?
NounVerb …
Does the previous wordhave figures?
YesNo
PROTEINUnknown RNADNA
Each node has one condition:
Each leaf has one class:
……
![Page 76: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/76.jpg)
ML (2)
•
Collier (2000) used HMM, orthographic features for term recognition–
HMM looks for most likely sequence of classes corresponding to a word sequence e.g. interleukin-2 protein/DNA
–
To find similarities between known words (training set) and unknown words, use character features
Feature ExamplesDigitNumber
[2]protein[3]DNA
GreekLetter
[alpha]proteinTwoCaps
[RelB]protein[TAR]RNA
![Page 77: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/77.jpg)
ML (2)
•
Use of GENIA resources as training data–
Results depend on training data
•
Morgan (2004) used FlyBase
to construct automatically training corpus–
Pattern matching for gene name recognition, noisy corpus annotated
–
HMM was trained on that corpus for gene name recognition
![Page 78: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/78.jpg)
Support Vector Machines (1)
•
Kazama
trained multi-class SVMs
on Genia corpus
•
Corpus annotated with B-I-O tags–
B tags denote words at beginning of term
–
I tags inside term–
O tags outside term
–
B-protein-tag
: word in the beginning of a protein name
![Page 79: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/79.jpg)
SVMs
for NER (2)
•
Yamamoto used a combination of features for protein name recognition:–
Morphological, lexical, boundary, syntactic (head noun), domain specific (if term exists in biomedical database).
•
Lee use different features for recognition and classification.
•
orthographic, prefix, suffix•
Contextual information
![Page 80: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/80.jpg)
81
Challenges of D-NER
1. IL-2-mediated activation of2. IL-2 receptor activates3. IL2-mediated activation of4. Interleukin 2-mediated activation of•
We use a 3-stage strategy:
1.
Use character based tagging which integrates tagging process with dictionary consultation process
2.
Use CRF to treat broader term formation patterns3.
Term normalisation in lexicon which treats all spelling variants
and maps extracted terms to semantic UniProt
ID
![Page 81: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/81.jpg)
82
Dictionary based NER
![Page 82: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/82.jpg)
TextPOS
taggingToken
sequencesSequentiallabelling
gene/protein names
CRFlabellingmodel
Features- word- orthographic- POS- PROTEIN
Gene/protein recognition stepsGene/protein recognition steps
1. Analyze a sentence using a dictionary-based POS tagger2. Add features to tokens3. Identify gene/protein name by CRF-based sequential labeling based on
the standard IOB labeling
![Page 83: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/83.jpg)
84
Features used for Features used for CRFsCRFs
•
CRFs
find the best sequence of labels based on–
state feature: features for each token–
edge feature: features between two adjacent tokens•
State (token) feature used–
Word: surface word form of the token–
Orthographic –
POS: the POS of the token–
PROTEIN: whether the token is a known protein name•
Edge feature used–
All combinations of state features of two adjacent tokens
![Page 84: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/84.jpg)
Hybrid approaches
•
Combine rules, statistics, resources
Hybrid ATR / NER
ABGene (Tanabe & Wilbur)
ARBITER (Rindflesch)
C/NC-value (Frantzi & Ananiadou)
![Page 85: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/85.jpg)
Hybrid (1)
•
ABGene: protein and gene name tagger–
Combines ML, transformation rules, dictionaries with statistics
–
Protein tagger trained on MEDLINE abstracts by adapting Brill’s tagger
–
Transformation rules for recognition of gene, protein names
–
Used GO, LocusLink
list of genes, proteins for false negative tags
![Page 86: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/86.jpg)
Hybrid (2)
–
ARBITER (Access and Retrieve Binding Terms) uses •
UMLS Metathesaurus
and GenBank
to
map NPs (binding terms)•
morphological features
•
lexical information (head noun)–
EDGAR recognises gene, cell, drug names using co-occurrences of cell,
clone,
expression
![Page 87: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/87.jpg)
Hybrid (3)
•
C/NC value (Frantzi & Ananiadou, 1999)•
C-value
•
Linguistic filters •
total frequency of occurrence of string in corpus
•
frequency of string as part of longer candidate terms (nested terms)
•
number of these longer candidate terms•
length of string
–
Output: automatically ranked terms (TerMine)
![Page 88: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/88.jpg)
C-value
•
C-
value measure
extracts multi-word, nested terms
[adenoid [cystic [basal [cell carcinoma]]]]cystic basal cell carcinoma
ulcerated basal cell carcinomarecurrent basal cell carcinoma
basal cell carcinoma
![Page 89: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/89.jpg)
Term variation
•
variation recognition as part of ATR (Nenadic, Ananiadou)
•
recognise term forms and link them into equivalence classes
•
important if ATR is based on statistics (e.g. frequency of occurrence)–
corpus-based measures are distributed across different variants
–
conflation of various surface representations of a given term should improve ATR
![Page 90: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/90.jpg)
Simple variation
•
orthographic–
hyphens, slashes (amino acid and amino-acid)
–
lower/upper cases (NF-KB and NF-kb)–
spelling variations (tumour and tumor)
–
transliterations (oestrogen and estrogen)•
morphological–
inflectional phenomena (plural, possessives)
•
lexical–
genuine synonyms (carcinoma and cancer)
![Page 93: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/93.jpg)
Biomedical IE/IR Systems
•
iHOP–
http://www.ihop-net.org/UniPub/iHOP/•
EBIMed–
http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp•
GoPubMed–
http://www.gopubmed.org/•
PubFinder–
http://www.glycosciences.de/tools/PubFinder•
Textpresso–
http://www.textpresso.org/
![Page 94: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/94.jpg)
Acronyms
•
Very productive type of term variation •
Acronym variation (synonymy)–
NF kappa B/ NF kB
/ nuclear factor kappa B
•
Acronym ambiguity (polysemy) even in controlled vocabularies
GR glucocorticoid
receptorglutathione reductase
![Page 95: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/95.jpg)
Acronym recognition
•
Swartz, A. & Hearst, M. (2003) A simple algorithm for identifying abbreviation definitions in biomedical text, PSB 2003,8, 451-462
•
Adar, E. (2004) SaRAD: a simple and robust abbreviation dictionary, Bioinformatics, 20(4) 527-533
•
Chang, J.T. & Schutze, H. (2006) Abbreviations in biomedical text, Text Mining for Biology and Biomedicine, pp.99-119, Artech
•
Tsuruoka, Y., Ananiadou, S. & Tsujii, J. (2005) A Machine learning approach to automatic acronym generation, ISMB, BioLink SIG, 25-31
•
Okazaki, N. & S.Ananiadou (2006) Acronym recognition based on term identification, Bioinformatics
![Page 96: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/96.jpg)
The importance of acronym recognition
•
Acronyms are among the most productive type of term variation–
64, 242 new acronyms are introduced in 2004 [Chang and Schütze 06]
•
Acronyms are used more frequently than full terms–
5,477 documents could be retrieved by using the acronym JNK while only 3,773 documents could be retrieved by using its full term, c-jun
N-terminal kinase
[Wren et al. 05]
•
No rules or exact patterns for the creation of acronyms from their full form
![Page 97: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/97.jpg)
Recognition
•
Extracting pairs of short and long forms<acronym, long form>
–
Distinguishing acronyms from parenthetical expressions
–
Search for parentheses in text; single or more words; e.g. Ab (antibody)
–
Limit context around ( ); limit number of words according to number of letters in acronym
![Page 98: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/98.jpg)
Letter matching
–
Alignment: find all matches between letters of acronyms and their long forms and calculate likelihood (Chang & Schütze)
•
Solves problem of acronyms containing letters not occurring in LF
•
Choose best alignment based on features, e.g. position of letter etc.
•
Finding optimal weight for each feature challenge
http://abbreviation.stanford.edu/
![Page 99: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/99.jpg)
Acronym Recognition
Okazaki, N., Ananiadou, S. (2006) Building an abbreviation dictionary using a term recognition approach. Bioinformatics.
![Page 100: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/100.jpg)
A simple algorithm – Schwartz and Hearst (2003)
•
Uses parenthetical expressions as a marker of a short form
… long-form ‘(‘short-form ‘)’ …•
All letters and digits in a short form must appear in the corresponding long form in the same order
– We used hidden markov model (HMM) to …
– Early repolarization (ER) is an enigma.
![Page 101: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/101.jpg)
Problems of letter-matching approach
•
Highly dependent on the expressions in the target text–
o acquired immuno deficiency syndrome (AIDS)–
x acquired syndrome (AIDS)–
x a patient with human immunodeficiency syndrome (AIDS)–
? magnetic resonance imaging unit (MRI)–
! beta 2 adrenergic receptor (ADRB2)–
! gamma interferon (IFN-GAMMA)(These examples are obtained from actual MEDLINE abstracts)
•
Naive with respect to term variations
![Page 102: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/102.jpg)
AcroMine’s
approach
•
Extract a word or word sequence:–
Co-occurring frequently with an acronym (e.g., TTF-1)• 1, factor 1, transcription factor 1, thyroid transcription
factor 1–
Does not co-occur with other surrounding words• thyroid transcription factor 1
•
Not necessarily based on letter-matching–
Note that this is a difficult case for the letter-matching algorithm•
Prune unlikely candidates–
Nested candidates: transcription factor 1–
Expansions: expression of thyroid transcription factor 1–
Insertions: thyroid specific transcription factor 1
![Page 103: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/103.jpg)
Short-form mining
•
Enumerate all short forms in a target text–
Using parentheses as a clue: … ‘(‘short-form ‘)’ …–
Validation rules for identifying acronyms [Schwartz and Hearst 03]
•
It consists of at most two words•
Its length is between two to ten characters•
It contains at least an alphabetic letter•
The first character is alphanumeric
The present system consists of a hidden Markov model (HMM) based automatic speech recognizer (ASR), with a keyword spotting system to capture the machine sensitive words (registered in a dictionary) from the running utterances.
The contextual sentence of HMM and ASR.
![Page 104: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/104.jpg)
Enumerating long-form candidates for an acronym
•
Tokenize a contextual sentence by non-alphanumeric characters (e.g., space, hyphen, etc.)
•
Apply Porter’s stemming algorithm [Porter 80]•
Extract terms that match the following pattern[:WORD:].*$
We studied the expression of thyroid transcription factor-1 (TTF-1).
1 factor 1
transcript factor 1thyroid transcript factor 1
expression of thyroid transcript factor 1studi the expression of thyroid transcript factor 1
of thyroid transcript factor 1thyroid transcript
Empty string or words of any length
![Page 105: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/105.jpg)
Expansions for TTF-1
![Page 106: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/106.jpg)
Top 20 acronyms in MEDLINE
![Page 107: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/107.jpg)
Long-form candidates for acronym ADM
Candidate Length Frequency Score Validityadriamycin 1 727 721.4 oadrenomedullin 1 247 241.7 oabductor digiti
minimi 3 78 74.9 odoxorubicin 1 56 54.6 xeffect of adriamycin 3 25 23.6 Expansionadrenodemedullated 1 19 17.7 oacellular
dermal matrix 3 17 15.9 opeptide adrenomedullin 2 17 15.1 Expansioneffects of adrenomedullin 3 15 13.2 Expansionresistance to adriamycin 3 15 13.2 Expansionamyopathic
dermatomyositis 2 14 12.8 obrevis
and abductor digiti
minimi 5 11 9.8 Expansionminimi 1 83 5.8 Nesteddigiti
minimi 2 80 3.9 Nestedautomated digital microscopy 3 1 0.0 matchadrenomedullin
concentration 2 1 0.0 Nested
![Page 108: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/108.jpg)
Long-form extraction
•
Long-form candidates are sorted with their scores in a descending order
•
A long-form candidate is considered valid if:–
It has a score greater than 2.0
–
The words in the long form can be rearranged so that all alphanumeric letters appear in the same order as the short form
–
It is not nested or expansion of the previously chosen long forms
![Page 110: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/110.jpg)
Acronym disambiguation
Sample text: Considerations in the identification of functional RNA structural elements in genomic alignments (Tomas Babak
et al)http://www.biomedcentral.com/1471-2105/8/33
![Page 111: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/111.jpg)
KLEIO
•
Semantically enriched information retrieval system for biology
•
Offers textual and metadata searches across MEDLINE
•
Provides enhanced searching functionality by leveraging terminology management technologies
http://nactem4.mc.man.ac.uk:8080/Kleio
![Page 112: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/112.jpg)
KLEIO architecture
![Page 113: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/113.jpg)
Text mining modules
1.
Acronym recognition and disambiguation2.
Normalisation
of biology terms
3.
Named entity recognition for gene/protein names
4.
Indexing of terms
![Page 114: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/114.jpg)
1. Acronym recognition and disambiguation
•
Recognises
acronyms and their definitions from Medline
•
Disambiguates isolated acronyms using their context
•
Maps acronyms into corresponding definitions
Okazaki, N. and Ananiadou, S. (2006) Building an Abbreviation Dictionary using a Term Recognition Approach, in Bioinformatics
![Page 115: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/115.jpg)
VEGF ►
Equivalent results mergedProposed Acromine
Candidate Proposed NER Candidates
ECM ►
AcroMine
results selected. NER ignoredProposed Acromine
Candidate Proposed NER Candidates
(Sample text)
Transcription and protein levels of extracellular matrix (ECM) related genes were evaluated in the rat retina after intravitreal (VEGF) injection by polymerase chain reaction, Western blot analysis, and immunohistochemistry.
extracellular matrix,
extracellular matrices,
...
Extracellular
matrixECM
Term VariantDefinitionAcronym
Multimerin
Multimerin
1
GeneECM
Full NameTypeNE
vascular endothelial growth factor,
vascular epidermal growth factor,
antivascular
endothelial growth factor
vascular
endothelial
growth
factor
VEGF
Term VariantDefinitionAcronym
c‐fos
induced growth factor
vascular endothelial growth factor B
...
GeneVEGF
Full NameTypeNE
![Page 116: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/116.jpg)
2.Normalisation of biology terms
•
Based on a combination of exact and soft string matching methods
•
Permit efficient look-up and to discover ambiguous and variant terms in the resources
•
Using existing resources to learn term variation patterns automatically
![Page 117: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/117.jpg)
3. Named entity recognition for gene/protein names
•
Allow users to specify the entity type they want to retrieve (e.g. gene/protein)
•
Combination of conditional random fields and maximum entropy models to filter out false positives
![Page 118: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/118.jpg)
![Page 119: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/119.jpg)
Fewer documentswith more precisequery
![Page 120: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/120.jpg)
122
Mining associations from MEDLINE
•
FACTA: Finding Associated Concepts with Text Analysis –
What diseases are related to a particular chemical?–
What proteins are related to a particular disease?–
etc.
•
EBIMed
http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp•
PubMatrix
http://pubmatrix.grc.nia.nih.gov/
:•
FACTA http://text0.mib.man.ac.uk/software/facta/–
Quick and interactive
![Page 121: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/121.jpg)
123
Query
![Page 122: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/122.jpg)
124
Click!
![Page 123: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/123.jpg)
125
… Alzheimer's disease and schizophrenia. Interestingly, nicotine and similar compounds have been shown to enhance memory function and increase the expression of nAChRs
and
therefore, could have a therapeutic role
in the aforementioned diseases.
![Page 124: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/124.jpg)
Biological Annotation and Event Recognition
![Page 125: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/125.jpg)
Text Annotation
•
Task-oriented Annotation–
Bio-Creative annotated text
–
System development
–
Defined by specific tasks
•
Specific curation
tasks in specific environments
•
Mapping of Protein names to database IDs in specific text types
•
Specific event types such as Protein-
Protein Interaction, in specific text types •
Disease-Gene Association of specific sets of diseases
•
Task-neutral Annotation–
GENIA Corpus[U-Tokyo, NaCTeM]
–
Development of generic tools
–
Defined by theories
•
Linguistics–
Tokens–
POS–
Phrase Structure–
Dependency Structure–
Deep Syntax (PAS)•
Biology–
Named Entities of various semantic types
–
Events•
Linguistics + Biology–
Co-references
Interoperable Tools
![Page 126: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/126.jpg)
Text Annotation
•
Task-oriented Annotation–
Bio-Creative annotated text
–
System development
–
Defined by specific tasks
•
Specific curation
tasks in specific environments
•
Mapping of Protein names to database IDs in specific text types
•
Specific event types such as Protein-
Protein Interaction, in specific text types •
Disease-Gene Association of specific sets of diseases
•
Task-neutral Annotation–
GENIA Corpus[U-Tokyo, NaCTeM]
–
Development of generic tools
–
Defined by theories
•
Linguistics–
Tokens–
POS–
Phrase Structure–
Dependency Structure–
Deep Syntax (PAS)•
Biology–
Named Entities of various semantic types
–
Events•
Linguistics + Biology–
Co-references
Interoperable Tools
![Page 127: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/127.jpg)
Annotation of GENIA corpus –
Annotation Tool
![Page 128: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/128.jpg)
Annotation of GENIA corpus –
Term&POS
Term (entity) annotation 2000+400 abstracts
Term (entity) annotation 2000+400 abstracts
![Page 129: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/129.jpg)
Annotation of GENIA corpus –
Term&POS
Part-of- speech
annotation 2,000
abstracts
Part-of- speech
annotation 2,000
abstracts
![Page 130: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/130.jpg)
Annotation of GENIA corpus –
Term&POS
![Page 131: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/131.jpg)
Annotation of GENIA corpus –
Process&TreeTree annotation
2000 abstracts
Tree annotation
2000 abstracts
Process annotation
500 abstracts by May 2006
1000 abstracts by Dec. 2006
Process annotation
500 abstracts by May 2006
1000 abstracts by Dec. 2006
![Page 132: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/132.jpg)
Example of Co-references (Institute of Infocomm
Research, Singapore)
<s>Consistent with <COREF ID=“35” REF=“34” MIN=“t1/2” TYPE=“IDENT”> this short t1/2 </COREF>, accumuration of <COREF ID=“19” REF=“8” TYPE=“IDENT”> 1,25(OH)2D3 recepter RNA </COREF> increased in <COREF ID=“36”> cells </COREF> as <COREF ID=“37” REF=“36” TYPE=“PRON”> their </COREF> protein synthesis was inhibited.</s>
IDs are assigned to all NPs and they are used for representingco-referential relationships
•Pronouns•Relative clauses•Noun phrases with/without definite determiners•Appositions
![Page 133: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/133.jpg)
Adaptation of Tools for the Biology Domain
![Page 134: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/134.jpg)
Why Linguistic Annotation?
•
Tool Development and Adaptation
–
Training, Development, Test
–
New Research: Domain Adaptation
Crucial for the success of Text Mining in Specific Domain
![Page 135: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/135.jpg)
Tool1: POS Tagger
•
General-Purpose POS taggers, trained by WSJ–
Brill’s tagger, TnT
tagger, MX POST, etc. –
97%
•
General-Purpose POS taggers do not work well for MEDLINE abstracts
The peri-kappa B site mediates human immunodeficiencyDT NN NN
NN
VBZ JJ NN
virus type 2 enhancer activation in monocytes …NN NN
CD NN NN
IN NNS
![Page 136: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/136.jpg)
Errors seen in TnT
tagger (Brants
2000)A chromosomal translocation in …DT JJ NN IN
… and membrane potential after mitogen binding.CC NN NN IN NN JJ
… two factors, which bind to the same kappa B enhancers…CD NNS WDT NN TO DT JJ NN NN NNS
… by analysing the Ag amino acid sequence.IN VBG DT VBG JJ NN NN
… to contain more T-cell determinants than …TO VB RBR JJ NNS INStimulation of interferon beta gene transcription in vitro by
NN IN JJ JJ NN NN IN NN IN
![Page 137: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/137.jpg)
Performance of GENIA Tagger
Training corpus WSJ GENIA
WSJ 97.0 84.3
GENIA 75.2 98.1
WSJ+GENIA 96.9 98.1
Training corpus WSJ GENIA
WSJ 96.7 84.3
GENIA 80.1 97.9
WSJ+GENIA 96.5 97.5
• GENIA tagger (Ref.) TnT tagger
No degradation of the taggertrained by the mixed corpus
Some degradations (0.2 ~ 0.4) were observed, compared withthe taggers trained by “pure”corpora
![Page 138: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/138.jpg)
Semantic structure
CRP mesurement
does
NP13
VP17
VP15
So
NP1
DT2
A
VP16
AV19
not
VP21
VP22
exclude
NP25
NP24
AJ26
deep
NP28
NP29
vein
NP31
thrombosis
serum
NP4
AJ5
normal
NP7
NP8 NP10
NP11
ARG1
ARG1
MOD
MOD
ARG1
ARG2
ARG1
ARG1
ARG2
ARG1
MOD
Predicate
argument
relations
![Page 139: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/139.jpg)
Predicate-argument structure Parser based on Probabilistic HPSG (Enju)
S
p53 has been shown to directly activate the Bcl-2 protein
NP
VP
ADVP
S
VP
VP
VP
NP arg1arg2
arg2
arg3
![Page 140: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/140.jpg)
Event Annotation
![Page 141: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/141.jpg)
GENIA event annotation•
Target of GENIA event annotation–
Corpus
•
Part of GENIA corpus which is taken from PubMed
using the MeSH
terms, Human, Blood Cells and Transcription Factors.
–
Ontology•
From the Gene ontology, concepts required for describing NFkB
pathway have been selected (34 terms).•
3 additional concepts have been defined–
Gene_expression–
Artificial_process–
Correlation
![Page 142: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/142.jpg)
GENIA event annotation –
Stat (1/2)
•
Annotation–
5 annotators + 1 manager with biology background.–
using XConc
Annotation tool
•
1,000
abstracts have been annotated–
# of sentences: 8,981
–
# of sentences with events: 8,265•
92.0%–
# of events: 34,065
•
Avg. 4.15 events/sentence
![Page 143: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/143.jpg)
GENIA event annotation –
Stat (2/2)
•
Correlation–
meaning ‘some’
relation between events.
•
Artificial_process–
Artificially performed processes.–
Transfection, treatment, …•
Gene_expression–
Transcription + Translation
(758)
(1,740)
(6/2,297)(2,269)
(22)(170)(261)
(407)
(1)
(485)
(6/1,330)
(1,149/6,755)
(730)(117/4,876)
(69/152)(26)(57)
(0)(2,958)(31/668)
(22/62)
(40)(164)(71/411)
(3)(1)
(9)(321)
(6)(0)
(46/52)(6)
(929)(4,229/19,940)
(4,567)(11,144)
(476/1,043)(277)(290)
(59/29,127)
![Page 144: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/144.jpg)
GENIA Event Annotation -
example
LinkCauseLinkCause
–
For an identified event in the given sentence,•
classify the type of events and record the text span giving the clue of it (ClueType).•
identify the theme of the events and record the text span linking the theme to the
event (LinkTheme).•
identify the cause of the events and record the text span linking the cause to the
event (LinkCause).•
record the environment (location, time) of the events (ClueLoc, ClueTime).
LinkThem e
LinkThem e
ClueLocClueLoc
ClueTypeClueType
ClueTypeClueType
![Page 145: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/145.jpg)
Event Annotation -
Example
![Page 146: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/146.jpg)
Localization•
Theme patterns observed (730)–
Protein
608–
Lipid
31–
Atom
29–
Other_organic_compound
14–
DNA 12–
Virus 5–
Carbohydrate
5–
RNA
4–
Inorganic
4–
Peptide
3
•
ClueLoc–
NONE
241
–
nuclear
140
–
to the nucleus 12
–
into the nucleus
11
–
Cytoplasmic
8
–
in the cytoplasm
7
–
macrophages
5
–
nuclear … in t lymphocytes
4
–
monocytes
4
–
in the nucleus 4
–
in the cytosol
4
–
in colostrum
4
–
from the cytoplasm to the nucleus 4
![Page 147: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/147.jpg)
Localization•
Keywords and Locations–
translocation (166)•
nuclear108•
NONE
38•
…–
secretion (100)•
NONE
57•
name_of_cells
43–
release (80)•
NONE
51•
name_of_cells
19•
…–
localization (30)•
nuclear25•
intracellular
3–
uptake (24)•
NONE 14•
name_of_cells
20
•
Keywords and Themes–
translocation (166)•
Protein161•
Virus
4•
RNA
1–
secretion (100)•
Protein 98•
Lipid
1•
Peptide
1–
release (80)•
Protein
67•
Other_organic_compoun
6•
Lipid
3–
localization (30)•
Protein30–
uptake (24)•
Lipid
15•
Carbohydrate 5•
Protein
4
![Page 148: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/148.jpg)
Event Recognition
![Page 149: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/149.jpg)
Our Policy
•
Distinguish a domain-independent part from a domain-specific part.
Domain‐independent
Domain‐specific
IE Systema full parser:normalizes sentences
into PASs
extraction rules on PASs
PAS = Predicate‐Argument Structure
Machine LearningMachine Learning
![Page 150: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/150.jpg)
An Advantage of Using Full Parsing
•
Normalization of syntactic variations into PASs
We can construct more general extraction rules. Less extraction rules
less training corpus
Entity1 activates Entity2 Entity2 is activated by Entity1 Entity1 cooperate to activate Entity2 Entity1 play key roles by activating Entity2
activateARG1 Entity1ARG2 Entity2
![Page 151: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/151.jpg)
Target Domain/Task
•
Extraction of protein-protein interactions from MEDLINE abstracts
•
A source corpus of extraction rules: Aimed [Bunescu
et al., 2004]
–
MEDLINE abstracts obtained from the Database of Interacting Proteins (DIP)
–
Tagged protein names and interactions of them
![Page 152: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/152.jpg)
Automatic Construction of Extraction Rules (PAS Patterns)
Pattern Extraction
PAS Patterns
Pattern Division
Pattern Filtering
Pattern Constructor
Full Parser ex.)Entity1 activates Entity2What properties?
How to construct?for protein‐protein
interactions
Text Annotated with desired Info.
![Page 153: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/153.jpg)
Required Patterns
Classes(1) Entity-Verb(-Preposition)-Entity(2) Other Patterns with a Single Verb(3) Patterns with More than One Verb(4) Noun Patterns(5) Adjective Patterns
![Page 154: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/154.jpg)
(1) Entity-Verb(-Preposition)-Entity
This demonstrates that Entity1 recognizes Entity2.We found Entity1, interacted with Entity2.
•
Straightforward•
Easy to extract
![Page 155: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/155.jpg)
(2) Other Patterns with a Single Verb
(2a) With Nouns Unable to Be OmittedEntity1 formed complexes with Entity2.
(2b) With Nouns Able to Be OmittedEntity1 protein interacts with Entity2.
•
Can be divided into verbal components and nominal components
•
In (2b), every combination of verbal and nominal components can be used as a pattern
![Page 156: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/156.jpg)
(3) Patterns with More than One Verb
Entity1 recognizes one FGFR isoform known as Entity2.Entity1 contains this site as well as a region that restricts interaction with Entity2.
•
Combinations of general verbs and domain- specific verbs --
Can be divided?
Not Dividing Now
![Page 157: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/157.jpg)
(4) Noun Patterns (1/2)
(4a) Coordinates with Nouns of Interacting SubstancesEntity1 receptor ( Entity2 )
(4b) Nouns Representing Interactioninteraction of Entity1 with Entity2
(4c) Nouns and Modifiers Representing InteractionEntity1 binding domain on Entity2
![Page 158: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/158.jpg)
(4) Noun Patterns (2/2)
(4c) Nouns and Modifiers Representing InteractionEntity1 binding domain on Entity2
•
Difficult Problems:–
Distinction of these modifiers from general modifiers
–
Decision on whether modifiers are needed for proper patternsspecific Entity1 ligand ( Entity2 )
Not Supporting Now
![Page 159: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/159.jpg)
(5) Adjective Patterns
dimeric Entity1Entity1 is a homodimeric protein.
•
Similar with (4c)
Not Supporting Now
![Page 160: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/160.jpg)
Class of Required Patterns
Supported Patterns at the present(1) Entity-Verb(-Preposition)-Entity(2) Other Patterns with Only 1 Verb(3) Patterns with More than 1 Verb(4) Noun Patterns
(Partially)(5) Adjective Patterns
![Page 161: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/161.jpg)
Automatic Construction of Extraction Rules (PAS Patterns)
Text Annotated with desired Info.
Pattern Extraction
PAS Patterns
Pattern Division
Pattern Filtering
Pattern Constructor
![Page 162: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/162.jpg)
Automatic Construction of PAS Patterns
Text Annotated with desired Info.
Pattern Extraction
PAS Patterns
Pattern Division
Pattern Filtering
Generalization by parsing
Generalization by dividing into components
Raising accuracy by deleting inappropriate
patterns
![Page 163: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/163.jpg)
Entity1 coreceptor interacts with non-polymorphic regions of the Entity2.
Pattern Extraction
Entity1MODARG 1 coreceptor
interactARG1 1
•
Convert a sentence annotated with desired information to PASs
by parsing.
•
Extract the smallest PAS set.
withARG1ARG2
2
2
3 region
ofARG1ARG2
3Entity2
, ,
,
![Page 164: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/164.jpg)
Pattern Division
•
Divide one-verb patterns to verb (+preposition) components and noun components.
•
Treat all combination of verb and noun components as patterns.
Entity1MODARG coreceptor
withARG1ARG2
1Xnoun2
ofARG1ARG2 Entity2
Xnoun1interactARG1
1 ,
Xnounregion
Xnoun
Verbal Component
Nominal Component
![Page 165: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/165.jpg)
Application Systemswith Event Recognition
![Page 166: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/166.jpg)
169
MEDIE
•
An interactive intelligent IR system retrieving events
•
Performs a semantic search •
System components–
GENIA tagger
–
Enju
(HPSG parser)–
Dictionary-based named entity recognition
Demo
![Page 167: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/167.jpg)
170
Medie
system overview
InputTextbase
Deep parser
Entity Recognizer
Semantically-annotatedTextbase
RegionAlgebraSearch engine
Query Searchresults
Off-line
On-line
![Page 168: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/168.jpg)
171
![Page 169: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/169.jpg)
172
Info-PubMed
•
An interactive IE system and an efficient PubMed
search tool, helping users to find information about biomedical entities such as genes, proteins,and
the interactions
between them. •
System components–
MEDIE–
Extraction of protein-protein interactions –
Multi-window interface on a browserDemo
![Page 170: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/170.jpg)
173
Info-PubMed
![Page 171: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/171.jpg)
I
Information Extraction
![Page 172: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/172.jpg)
IE in Biology
Pattern-matchingContext-free grammar approachesFull parsing approachesSublanguage driven IEOntology-driven IE
McNaught, J. & Black, W. (2006) Information Extraction, Text Mining for Biology & Biomedicine, Artech
house, pp.143-177
![Page 173: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/173.jpg)
Pattern-matching IE
–
Usual limitations with non inclusion of semantic processing
–
Large amount of surface grammatical structures = too many patterns (Zipf’s
law)
–
Cannot explore syntactic generalisations (active, passive voice)
–
Systems extract phrases or entire sentences with matched patterns; restricted usefulness for subsequent mining
![Page 174: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/174.jpg)
Pattern-matching systems (1)
BioIE uses patterns to extract sentences, protein families, structures, functions..
Presents user with relevant information, improvement from classic IR
BioRAT uses “deeper” analysis, tagging, apply RE over POS tags, stemming, gazettercategories etc
Templates apply to extract matching phrases, primitive filters (verbs are not proteins, etc)
![Page 175: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/175.jpg)
Pattern matching systems (2)
RLIMS-P (Hu) protein phosphorylation by looking for enzymes, substrates, sites assigned to agent, theme, site roles of phosphorylation relationsPos tagger, trained on newswire, chunking, semantic typing of chunks, identification of relations using pattern-matching rules Semantic typing of NPs: using combination of clue words, suffixes, acronyms etcSemantically typed sentences matched with rulesPatterns target sentences containing phosphorylate
![Page 176: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/176.jpg)
Full parsing approaches
•
Link Grammar applied for protein-protein interactions; general English grammar adapted to bio-text
•
Link Grammar finds all possible linkages according to its grammar•
Number of analyses reduced by random sampling, heuristics, processing constraints relaxed–
10,000 results permitted per sentence–
60% of protein interactions extracted–
Problems: missing possessive markers & determiners, coordination
of compound noun modifiers
![Page 177: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/177.jpg)
Full parsing IE (2)
•
Not all parsing strategies suitable for bio-text mining•
Text type, abstracts, “ungrammaticality”
related with sublanguage characteristics?
•
Ambiguity and full parsing; fragmentary phrases (titles, headings, text in table cells, etc)
•
CADERIGE
project used Link grammar but on shallow parsing mode
•
Kim & Park (BioIE)
use combinatorial categorial
grammar, annotated with GO concepts, extract general biological interactions
•
1,300 patterns applied to find instances of patterns with keywords
![Page 178: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/178.jpg)
Full parsing (3)
•
Keywords indicate basic biological interactions•
Patterns find potential arguments of the interaction keywords (verbs or nominalisations) –
Validated arguments mapped into GO concepts–
Difficult to generalise interaction keyword patterns•
BioIE’s
syntactic parsing performance improved after
adding subcategorisation
frames on verbal interaction keywords
![Page 179: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/179.jpg)
Full parsing (4)
–
Daraselia(2004) use full parsing and domain specific filter to extract protein interactions
1.
All syntactic analyses discovered using CFG and variant of LFG
2.
Each alternative parse mapped to its corresponding semantic representation
3.
Output= set of semantic trees, lexemes linked by relations indicating thematic or attributive roles
4.
Apply custom-built, frame based ontology to filter representations of each sentence
5.
Preference mechanism controls construction of frame tree, high precision, low recall (21%)
![Page 180: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/180.jpg)
Sublanguage-driven IE (1)
•
Language of a special community (e.g. biology) •
Particular set of constraints re GL•
Constraints operate at all linguistic levels–
Special vocabulary (terms) –
Specialised term formation rules–
Sublanguage syntactic patterns–
Sublanguage semantics•
These constraints give rise to the informational structure of the domain (Z. Harris)
•
See JBI 35(4) Special Issue on Sublanguage
![Page 181: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/181.jpg)
GENIES system
•
Employs SL approach to extract biomolecular
interactions•
Uses hybrid syntactic-semantic rules –
Syntactic and semantic constraints referred to in one rule•
Able to cope with complex sentences•
Frame-based representation –
Embedded frames•
Domain specific ontology covers both entities and events
![Page 182: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/182.jpg)
GENIES system
•
Default strategy: full parsing –
Robust due to sublanguage constraints–
Much ambiguity excluded•
If full parse fails, partial parsing invoked–
Maintains good level of recall•
Precision: 96%, Recall: 63%
![Page 183: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/183.jpg)
Ontology-driven IE
•
Until recently most rule based IE have used neither linguistic lexica nor ontologies–
Reliance on gazetteers –
Small number of semantic categories•
Gazetteer approach not well suited in bioIE•
Ontology based
vs
ontology driven–
Passive use of ontologies, map discovered entity to concept–
Active use, ontology guides and constrains analysis, fewer rules•
Examples: PASTA, GenIE
not SL •
GENIES, SL and ontology driven
![Page 184: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/184.jpg)
Summary: simple pattern matching
Over text stringsMany patterns required, no generalisation possible
Over POSSome generalisation but ignore sentence structure
POS tagging, chunking, semantic p-m, typingLimited generalisation, some account taken of structure, limitedconsideration of SL patterns
![Page 185: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/185.jpg)
Summary: full parsing
Full parsing on its own, parsing done in combination with chunking, partial parsing, heuristics) to reduce ambiguity, filter out implausible readings
GL theories not appropriate Difficult to specialise for biotextMany analyses per sentenceMissing information due to sublanguage meaning
![Page 186: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/186.jpg)
Summary: sublanguage approach
Exploits a rich SL lexiconDescribes SL verbs in detailSyntactic-semantic grammarCurrent systems would benefit from adopting ontology-driven approach
![Page 187: Text Mining for Biomedicine: Techniques & toolsnactem.ac.uk/dtc/DTC-Ananiadou.pdfText Mining for Biomedicine: Techniques & tools Sophia Ananiadou School of Computer Science National](https://reader036.vdocument.in/reader036/viewer/2022070809/5f07be417e708231d41e84b5/html5/thumbnails/187.jpg)
Ontology-driven
Uses event concept frames to guide processingIntegration of extracted informationCurrent systems would benefit from adopting also SL approach