[email protected] research opportunities in biomedical text mining kevin bretonnel cohen...
TRANSCRIPT
![Page 1: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/1.jpg)
[email protected]://compbio.ucdenver.edu/Hunter_lab/Cohen
Research Opportunities in Biomedical Text Mining
Kevin Bretonnel CohenBiomedical Text Mining Group Lead
![Page 2: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/2.jpg)
Information extraction
•Also known as “relation extraction”•Limited to one or a small number of
types of facts– Contrast information retrieval or
question-answering
![Page 3: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/3.jpg)
Information extraction
Information extraction: relationships between things
BINDING_EVENT
Binder:
Bound:
2
![Page 4: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/4.jpg)
Information extraction
Met28 binds to DNA.
BINDING_EVENTBinder: Met28Bound: DNA
2
![Page 5: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/5.jpg)
Why text mining is difficult
•Variability
•Pervasive ambiguity at every level of analysis
5
![Page 6: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/6.jpg)
Why text mining is difficult
Met28 binds to DNA…binding of Met28 to DNA……Met28 and DNA bind……binding between Met28 and DNA……Met28 is sufficient to bind DNA……DNA bound by Met28…
2(6)
![Page 7: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/7.jpg)
Why text mining is difficult
…binding of Met28 to DNA……binding under unspecified conditions
of Met28 to DNA……binding of this translational variant
of Met28 to DNA……binding of Met28 to upstream
regions of DNA…
2(6)
![Page 8: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/8.jpg)
Why text mining is difficult
…binding under unspecified conditions of this translational variant of Met28 to upstream regions of DNA…
3(6)
![Page 9: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/9.jpg)
NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF 266998:12177002)
NACT:neoadjuvant chemotherapy (PMID 8898170)
N-acetyltransferase (PMID 10725313)
Na+-coupled citrate transporter (PMID 12177002 )
Why text mining is difficult
6
![Page 10: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/10.jpg)
Why text mining is difficult
NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF 266998:12177002)
•(liver), (testis) and (brain in rat)
•liver, (testis and brain in rat)
•(liver, testis and brain in rat)6
![Page 11: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/11.jpg)
Why text mining is difficult
NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF 266998:12177002)
•shows preference for (citrate over dicarboxylates)
•shows preference (for citrate) (over dicarboxylates) 7
![Page 12: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/12.jpg)
Why text mining is difficult
regulation of cell migration and proliferation(PMID …)
serine phosphorylation, translocation, and degradation of IRS-1 (PMID 16099428)
!proliferation and regulation of cell migration
! regulation of proliferation and cell migration regulation of cell migration and regulation of cell
proliferation
7
![Page 13: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/13.jpg)
Why text mining is difficult
regulation of cell migration and proliferation (PMID …)
serine phosphorylation, translocation, and degradation of IRS-1 (PMID 16099428)
!degradation of IRS-1, translocation, and serine phosphorylation
!serine phosphorylation, serine translocation, and serine degradation (of IRS-1) 7
![Page 14: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/14.jpg)
2.5 types of solutions
•Rule-based– Patterns– Grammars
•Statistical/machine learning– Labelled training data– Noisy training data
•Hybrid statistical/rule-based5
![Page 15: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/15.jpg)
Classic work in molecular biology information
extraction: pattern-based
•Blaschke et al. (1999): The beginning of biologists working in BioNLP– Gene names assumed to be known a priori– Patterns assume two gene names and an
“action word”proteinA action_word proteinB– Action words: acetylate, acetylates,
acetylated, acetylation, etc.– Not traditionally evaluated
![Page 16: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/16.jpg)
Classic work in molecular biology information
extraction: pattern-based
•Blaschke et al. (2002): Biologists begin to be aware of linguistics
•Proteins assumed to be known a priori
[proteins] (0-5) [verbs] (6-10) [proteins]
•(Why not 0-5 twice? Different weight of rule)
•P 0.45, R 0.40 (traditional evaluation)
![Page 17: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/17.jpg)
The Colorado solution: OpenDMAP
![Page 18: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/18.jpg)
Classic DMAP
•Direct Memory Access Parser (Riesbeck, 1986; Martin, 1991; Fitzgerald, 1995)– Belonging to the conceptual parser
family– Going as directly as possible from lexical
input to concepts in memory.– Mostly toy prototype implementations
with no real evaluation
Slide from Zhiyong Lu
![Page 19: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/19.jpg)
New Features in OpenDMAP
•Open Source – Implemented in java– Available at www.sourceforge.net
•OpenDMAP patterns are – Richer (capable of using external information
such as protein names and linguistic analyses rather than just strings and concepts)
– More flexible in terms of concept ordering
•First time in biomedical domain– Well constructed ontologies– Open Biomedical Ontologies (e.g. Gene
Ontology)Slide from Zhiyong Lu
![Page 20: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/20.jpg)
Framed-based Representations
•Common representation for ontologies
•A unique name that refers to a concept
•A list of attributes (slots) with admissible values
•Frame slots describe logical relations between framesConcept: Protein Transport
Slots: [transported entity]: protein or molecular complex [transporting entity]: protein or molecular complex [transport origin]: cellular component [transport destination]: cellular componentPhrasal-patterns: [transported entity] translocation to [transport destination]
Slide from Zhiyong Lu
![Page 21: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/21.jpg)
Transport Frame in Protégé
Slide from Zhiyong Lu
![Page 22: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/22.jpg)
Concept: Protein TransportSlots: [transported entity]: protein or molecular complex [transporting entity]: protein or molecular complex [transport origin]: cellular component [transport destination]: cellular componentPhrasal-patterns:[transported entity] translocation to [transport destination]
Slots Defines Logical Relations Between Concepts
Concept: Protein TransportSlots: [transported entity]: protein or molecular complex [transporting entity]: protein or molecular complex [transport origin]: cellular component [transport destination]: cellular componentPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: proteinPhrasal-patterns: none
Concept: molecular complexPhrasal-patterns: none
Concept: Protein TransportSlots: [transport origin]: cellular component [transport destination]: cellular componentPhrasal-patterns:[transported entity] translocation to [transport destination]
Relation linkSlide from Zhiyong Lu
![Page 23: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/23.jpg)
Slots Defines Logical Relations Between Concepts
Concept: proteinPhrasal-patterns: none
Concept: molecular complexPhrasal-patterns: none
Concept: Protein TransportSlots: [transport origin]: cellular component [transport destination]: cellular componentPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: proteinPhrasal-patterns: none
Concept: molecular complexPhrasal-patterns: none
Concept: cellular componentPhrasal-patterns: none
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrialSubsumption link
Relation link
Slide from Zhiyong Lu
![Page 24: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/24.jpg)
Patterns for Cellular Locations
•Names and synonyms from Gene Ontology terms
•Linguistic variationsConcept: cellular component
Concept: mitochondrion Phrasal-patterns: := mitochondrion := mitochondria := mitochondrial
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Slide from Zhiyong Lu
![Page 25: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/25.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Slide from Zhiyong Lu
![Page 26: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/26.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Slide from Zhiyong Lu
![Page 27: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/27.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Slide from Zhiyong Lu
![Page 28: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/28.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Recognized
Slide from Zhiyong Lu
![Page 29: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/29.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Recognized
Active
Slide from Zhiyong Lu
![Page 30: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/30.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Recognized
Active
Slide from Zhiyong Lu
![Page 31: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/31.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Recognized
Active
Slide from Zhiyong Lu
![Page 32: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/32.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Recognized
Active
Slide from Zhiyong Lu
![Page 33: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/33.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Recognized
Active
Slide from Zhiyong Lu
![Page 34: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/34.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Recognized
Active
Slide from Zhiyong Lu
![Page 35: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/35.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Recognized
Active
Slide from Zhiyong Lu
![Page 36: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/36.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Recognized
Active
Slide from Zhiyong Lu
![Page 37: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/37.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Recognized
Active
Slide from Zhiyong Lu
![Page 38: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/38.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Recognized
Active
Slide from Zhiyong Lu
![Page 39: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/39.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Recognized
Active
Slide from Zhiyong Lu
![Page 40: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/40.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Recognized
Active
Slide from Zhiyong Lu
![Page 41: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/41.jpg)
Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear
Subsumption link
Relation link
Concept: proteinPhrasal-patterns: none
Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial
Concept: cellular componentPhrasal-patterns: none
Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]
Concept: molecular complexPhrasal-patterns: none
Bax translocation to mitochondria
Pattern Matching Process
Awaiting
Recognized
Active
Slide from Zhiyong Lu
![Page 42: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/42.jpg)
New Features in OpenDMAP Patterns
Data demonstrate that TFII-I, through a Src-dependent mechanism, translocates reversibly from the cytoplasm to the nucleus, leading to the transcription activation of growth-regulated genes.
[transported entity dep:x] _ [action c-action-transport head:x]
(by the? [transporting entity])? @ (to the? [transport destination]
@ (from the? [transport origin])
_ wildcarddep:x/head:x placement of linguistic constraints ? optional concept match@ optional concept ordering + match
Slide from Zhiyong Lu
![Page 43: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/43.jpg)
Other Transport Patterns
Pattern: [transport destination] [action c-action-transport] _(of the? [transported entity]? (by the? transporting entity])?GeneRIF: … nuclear translocation of the NF-kappaB (p65/p50) heterodimers
Pattern: [transported entity dep:x]? _ [transport destination][action c-action-transport head:x] (by the? transporting entity])?GeneRIF: … is sufficient to degrade the AHR and that nuclear translocation
Pattern: [transported entity] (is|are|was|were) [action c-action-transport-passive] @ (by the? transporting entity])@ (from the? [transport origin]) @ (to the? [transport destination])GeneRIF: the YY1 factor is translocated to the cytoplasm … Slide from Zhiyong Lu
![Page 44: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/44.jpg)
Evaluation of information extraction (and many
other NLP tasks)
•Standard paradigm: “corpus”—body of texts with “gold standard” answers marked
•“Weakly annotated” data: publications with metadata only
•Test suites—see previous lectures
![Page 45: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/45.jpg)
Evaluation of NLP systems
•Precision (aka specificity) and recall (aka sensitivity). Tradeoffs between them.
•Against a “gold standard” of human generated representations of texts– Humans don’t always agree, therefore
calculate inter-annotator agreement
•Post-hoc judgments (particularly of IR relevance)
•“Shared task” paradigm – TREC Genomics (IR)– BioCreative (IE)
![Page 46: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/46.jpg)
Evaluation of NLP systems
•Precision: – True positives / (True positives + False
positive)
•Recall: – True positives / (True positives + False
negatives)
•F-measure: “harmonic mean” of precision and recall
![Page 47: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/47.jpg)
Evaluation of NLP systems
•Formal definition:
•Typical definition: β = 1, so…
(1 + β2) * precision * recall
(β2 * precision) + recallFβ =
![Page 48: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/48.jpg)
Evaluation of NLP systems
•Typical definition:
•…or just F: β is usually assumed to be 1
2 * precision * recall
precision + recallF1 =
![Page 49: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/49.jpg)
Evaluation of NLP systems
•β allows you to weight precision and recall differently– Increasing β weights precision more
highly– Decreasing β weights recall more highly
•Rarely used, but designated by value of β, e.g. F0.5 or F2
![Page 50: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/50.jpg)
OpenDMAP performance
•Performance of any rule-based information extraction system is a function of two things:– Overall architecture and abilities of the
system– Quality of the rules
![Page 51: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/51.jpg)
OpenDMAP performance•Protein transport
– Complete frame filled: P 0.75, R 0.49, F 0.59– Incomplete frames: P 0.75, R 0.67, F 0.71– Gold standard gene names: complete frame
P 0.77, R 0.67, F 0.72, incomplete frame P 0.75, R 0.85, F 0.81
•Cell-type-specific gene expression– Without gold standard gene names: P 0.64,
R 0.16, F 0.26– With gold standard gene names: P 0.85 R
0.36 F 0.51
![Page 52: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/52.jpg)
OpenDMAP performance
•Protein-protein interactions– BioCreative II shared task: Placed 1st
•F 0.29 ten percent higher than #2 system, more than 3 standard deviations above the mean—similar recall to others, but precision of 0.39 more than 20% higher than #2 system
•BioCreative II.5: – Another team placed 1st using
OpenDMAP and a much larger set of (automatically learned) rules
![Page 53: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/53.jpg)
OpenDMAP performance
•“Event” recognition– E.g. phosphorylation, expression,
binding, localization (weird definition of “event”)
– Ranked 19 out of 24 groups , but…
![Page 54: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/54.jpg)
OpenDMAP performance
•“Event” recognition– E.g. phosphorylation, expression,
binding, localization (weird definition of “event”)
– Ranked 19 out of 24 groups , but…had the highest precision (0.71-0.72)
![Page 55: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/55.jpg)
Paths to improving OpenDMAP performance
•Increase recall– Pattern learning? See Haibin’s lecture
•Increase precision– Leverage what we know about biology– Huge knowledge-base construction
effort underway here over the course of past two years
![Page 56: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/56.jpg)
Adding even small amounts of knowledge to
the system helps•Livingston (2011): Gene activation task
•Original system: enzymes and substrates both allowed to be of type protein
•Enhancement: – Gene Ontology annotations– Potential enzymes must have annotation
catalytic activity– Potential substrates must have annotation
receptor activity
![Page 57: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/57.jpg)
Adding even small amounts of knowledge to
the system helps
Original Added knowledge
Difference
Precision 0.16 0.36 0.20
Recall 0.24 0.18 -0.06
F-measure 0.19 0.24 0.05
![Page 58: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/58.jpg)
Nominalization
•Nominalization: noun derived from a verb– Verbal nominalization: activation,
inhibition, induction – Argument nominalization: activator,
inhibitor, inducer, mutant
![Page 59: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/59.jpg)
Nominalizations are dominant in biomedical
textsPredicate Nominalization All verb forms
Express 2,909 1,233
Develop 1,408 597
Analyze 1,565 364
Observe 185 809
Differentiate 737 166
Describe 10 621
Compare 185 668
Lose 556 74
Perform 86 599
Form 533 511 Data from CRAFT corpus
![Page 60: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/60.jpg)
Relevant points for text mining
•Nominalizations are an obvious route for scaling up recall
•Nominalizations are more difficult to handle than verbs…
•…but can yield higher precision (Cohen et al. 2008)
![Page 61: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/61.jpg)
Alternations of nominalizations: positions
of arguments
•Any combination of the set of positions for each argument of a nominalization– Pre-nominal: phenobarbital induction,
trkA expression– Post-nominal: increases of oxygen– No argument present: Induction
followed a slower kinetic…– Noun-phrase-external: this enzyme can
undergo activation
![Page 62: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/62.jpg)
Result 1: attested alternations are
extraordinarily diverse•Inhibition, a 3-argument predicate—Arguments 0 and 1 only shown
![Page 63: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/63.jpg)
Implications for system-building
•Distinction between absent and noun-phrase-external arguments is crucial and difficult, and finite state approaches will not suffice; merging data from different clauses and sentences may be useful
•Pre-nominal arguments are undergoer by ratio of 2.5:1
•For predicates with agent and patient, post/post and pre/post patterns predominate, but others are common as well
![Page 64: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/64.jpg)
What can be done?
•External arguments:– semantic role labelling approach
•…but, very important to recognize the absent/external distinction, especially with machine learning
– pattern-based approach•…but, approaches to external arguments
(RLIMS-P) are so far very predicate-specific
![Page 65: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/65.jpg)
What can be done?
•Pre-nominal arguments: – apply heuristic that we have identified
based on distributional characteristics– for most frequent nominalizations,
manual encoding may be tractable
![Page 66: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/66.jpg)
So, how do you dotext mining?
Two approaches that are not coexisting peacefully
![Page 67: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/67.jpg)
Two approaches to NLP
Knowledge-based Statistical/machine learning
![Page 68: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/68.jpg)
First approach to NLP
•Rule-based
•AI, linguisticsOntologiesKnowledge bases
•Patterns (regular, context-free…)
•Procedures
![Page 69: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/69.jpg)
K-based: procedural
•Patterns (regular, context-free, …)
•Procedures
if (currentWordEndsWith-ing) {
if (previousWordIsThe) {
if (nextWordIsOf) {
![Page 70: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/70.jpg)
K-based: regex
•Patterns (regular, context-free, …)
•Procedures
$geneName = “[A-Za-z]+-?[0-9]”;
$input =~ /interaction of ($geneName) with ($geneName)/;
$interactionAssertion->setGene1($1);
$interactionAssertion->setGene2($2);
![Page 71: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/71.jpg)
K-based: CFGs
•Patterns (regular, context-free, …)
•Procedures
NounPhrase -> NounPhrase+ Conjunction NounPhrase
NounPhrase -> Predeterminer Determiner+ Adjective+ Noun
![Page 72: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/72.jpg)
Knowledge-based approachesWhy they work
•Patterns are real– Psychologically– Formally adequate (mostly)
•Intuition works
•No need for training data
![Page 73: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/73.jpg)
Knowledge-based approaches
Why they’re hard
•Knowledge takes time to get
•Process of developing large rule sets can be slow– Consider English syntax…
![Page 74: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/74.jpg)
Second approach to NLP
•Mosteller & Wallace
•Bayesian
•Other machine learning techniques
![Page 75: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/75.jpg)
Statistical/ML approaches
•Frame the NLP task as a series of classification problems– Which POS is this?– Which word meaning?– Which phrasal grouping?
![Page 76: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/76.jpg)
Statistical approachesWhy they work
•Statistics can be proxy for knowledge
•Some interesting stuff is frequent enough to be tractable
![Page 77: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/77.jpg)
Statistical approachesWhy they’re hard
•Problem: sparse data
Frequency
Rank
![Page 78: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/78.jpg)
Statistical approachesWhy they’re hard
•Solutions: smoothing, back-off
![Page 79: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/79.jpg)
Statistical approachesWhy they’re hard
•Problem: labelled training data is expensive
![Page 80: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/80.jpg)
Statistical approachesWhy they’re hard
•Solutions: – spend money– figure out how to use other people’s– “weakly labelled” data
![Page 81: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/81.jpg)
Knowledge-based or statistical: what to do??
![Page 82: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/82.jpg)
Knowledge-based vs. statistical approaches
•Pragmatic answer #1: if you must pick one...– Is it cheaper to label more training data,
or to put time into developing patterns?
![Page 83: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/83.jpg)
Knowledge-based vs. statistical approaches
•Researcher’s answer:– Use one as the baseline for the other
![Page 84: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/84.jpg)
Knowledge-based vs. statistical approaches
•Pragmatic answer #2: combine them– Do both together/iteratively– Statistical solution first, then rule-based
post-processing
the 2.5th approach
“Natural language processing is never pure and rarely simple.”
![Page 85: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/85.jpg)
Which works better?
Pestian et al. (2007)
![Page 86: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/86.jpg)
A rapprochement
![Page 87: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/87.jpg)
Conceptual features for information retrieval
•Task: retrieve sentences that contain mentions of mutations.
•Keyword approach: 1,092
•Recognize mutation mentions: additional 2,171
![Page 88: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/88.jpg)
Conceptual features indocument classification
Caporaso et al. (2005)
![Page 89: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/89.jpg)
Conceptual features indocument classification
Caporaso et al. (2005)
![Page 90: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/90.jpg)
Untapped conceptual types
![Page 91: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/91.jpg)
Malignancies (F = 0.84)
Jin et al. (2006)
![Page 92: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/92.jpg)
Mouse strains
•CAST/EiJ
•C57BL
•SJL/J
•SEG
•C3H/He
•RIII
• DBA/1
Caporaso et al. (2005)
![Page 93: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/93.jpg)
Mutations
•Ala64->Gly
•Ala64Gly
•A376G
Caporaso et al. (2007)
![Page 94: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/94.jpg)
Point/Counterpoint
![Page 95: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/95.jpg)
Contradictory findings
•TREC 2003: “...searching in the MeSH and substance name fields, along with filtering for species, accounted for the best performance” (Hersh and Bhupatiraju 2003, Caporaso et al. 2005)
•TREC 2004: “Approaches that attempted to map to controlled vocabulary terms did not fare as well” (Hersh et al. 2004)
![Page 96: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/96.jpg)
Understanding the TREC 2004 results
• Poor choice of concepts – MeSH terms only, which is known to have problems even
if manually indexed
• “Conceptual” systems weren’t very good (or didn’t try very hard) at concept recognition– Even synonymy not detected well (1 case)– Methods not described, so presumably not a focus of the
work (2 cases)
• Hersh et al. (2004) overstate role of concepts in these systems– Synonym source only (1 case)– Only one of several features (1 case)
![Page 97: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/97.jpg)
I’m convinced in theory, but will it scale?
•Jin et al. (2006): for malignancy mentions, relatively small amount of training data sufficed
•Caporaso et al. (2007): mutation patterns were learnable with small person-hour investment
![Page 98: Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text](https://reader037.vdocument.in/reader037/viewer/2022110405/56649eca5503460f94bd8dba/html5/thumbnails/98.jpg)
Conclusion
•Statistical and conceptual approaches to text mining can coëxist peacefully– Statistical and rule-based concept
recognizers can work well– Concepts are good features for
statistical systems