towards an automated analysis of biomedical abstracts barbara gawronska, björn erlendsson, björn...
TRANSCRIPT
Towards an Automated Analysis of Biomedical
Abstracts
Barbara Gawronska, Björn Erlendsson, Björn Olsson
School of Humanities and Informatics, University of Skövde, Sweden
The goal of the project: text analysis for candidate path extraction
Scored & ranked path alignments
Path extraction from model database
Path extraction from text
Path alignmentParameter settings
GO graph
Organism annotation database
Model pathway database
PubMedLexical
databases, grammar
GO term probability calculation
The characteristics of the language of biomedical texts
A typical PubMed abstract (PMID: 16301995):
The tumor suppressor gene hypermethylated in cancer 1 (HIC1), located on human chromosome 17p13.3, is frequently silenced in cancer by epigenetic mechanisms. Hypermethylated in cancer 1 belongs to the bric a brac/poxviruses and zinc-finger family of transcription factors and acts by repressing target geneexpression. It has been shown that enforced p53 expression leads to increasedHIC1 mRNA, and recent data suggest that p53 and Hic1 cooperate in tumorigenesis. In order to elucidate the regulation of HIC1 expression, we have analysed the HIC1 promoter region for p53-dependent induction of gene expression. (…)Other members of the p53 family, notably TAp73beta and DeltaNp63alpha, can also act through this HIC1.PRE to induce transcription of HIC1, and finally, hypermethylation of the HIC1 promoter attenuates inducibility by p53.
Results of POS-tagging of two large corpora (30 million words each): 1) texts on stem cell research, and 2) general English prose
0%
5%
10%
15%
20%
25%
30%
35%
40%
adje
ctiv
enoun
preposi
tion
verb
deter
min
er
conju
nctio
n
pronou
n
auxi
liary
moda
l
adve
rb
light - Stem Cell, dark - Prose
Results of POS-tagging of a smaller sample corpus of biomedical abstracts
30%
36%
19%
3%12%
Proper nouns
Nouns
Verbs
Closed Class Words
Adjectives+Adverbs
The general architecture of the Information Extraction system
Biomedical abstracts
Normalization
Identification of proper nouns, acronyms,
semantic and syntactic tagging
Identification of relevant text parts
Syntactic parsing
Extraction of biological relations from parse trees
Named Entity Recognition
Linking acronyms to full names of biological objects
Matching input words against previously stored
tags
Identification and classification of remaining
words and symbols
Domain-specific acronym and name patterns
WordNet
Specialized verb lexicons
Tag Memory DatabaseClosed Class Word List
Biomedical abstracts
Normalization
Identification of relevant text parts
Syntactic parsing
Extraction of biological relations from parse trees
Biomedical abstracts
Normalization
Identification of proper nouns, acronyms,
semantic and syntactic tagging
Identification of relevant text parts
Syntactic parsing
Extraction of biological relations from parse trees
Patterns for domain-specific Named Entity Recognition
Pattern 1: n lower case chars (n>=1) + m integers (m >=2) + optionally: any character (p53, cdc25C, bcl2)
Pattern 2: n lower case chars (n>=1) + m upper case chars (m>=1) + k integers (k>=0) (mRNA)
Pattern 3: integer + lower case + n integers (n>=0) (1alpha)
Pattern 4: n integers (n>=1) + m upper case (m >=1) (7BL)
Linking acronyms to full names of biological objects
Find next acronym A
Found?
L1:= First Letterof AN := Numberof letters in A
Yes Within(…) ?
Yes
Find the N:th word beginning in L1 to the left of the ‘(‘ , link that word and its right context to A
Is A followed by ’(’ and L1* ?
No
Mark the wordsinside the (…),link to A
YesNo
No
Place pointer at the first word in the sentence
To next procedure(Other parts of theNER-module)
From previousprocedure
Therearealsotumor-related
geneslikeNF2neurofibromatose of type 2.p16INK4a
belongsto a groupcellcycleregulator calledcyclindependentkinaseinhibitors CDKI .
( )
( )
Sample semantico-syntactic tags
Our finding implicates that TNF-alpha released from the mesangium after IgAdeposition activates renal tubular cells.
[semcat('Our',our,[[],poss([])]),semcat(finding,find,[wnn,[]]),semcat(implicates,implicate,[[],[speech_act_verb([1])]),semcat(that,that,[[],rel([])]),semcat('TNF',[propername]),semcat(alpha,alpha,[wnn,[]]),semcat(released,release,[[],bioverb([[],production])]),semcat(from,from,[[],prep([])]),semcat(the,the,[[],det([])]),semcat(mesangium,mesangium,[[],[]]),semcat(after,after,[[],prep([])]),semcat('IgA',[propername]),semcat(deposition,deposition,[wnn,[]]),semcat(activates,activate,[[],bioverb([[],activation])]),semcat(renal,renal,[adj,[]]),semcat(tubular,tubular,[adj,[]]),semcat(cells,cell,[[],cell([])]),semcat('.',[[],[]])]
Tags (occurrences) in the test set in relation to knowledge sources
277
2989
1964
1450
59 70420
61580
287742
444
8905
257
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
ProperNouns
N CCV "Bioverbs" V Adj+Adv ClosedClass
Tags learnedfrom the trainingsetTags from NER,lexicons andmorphology
The next step: finding background and foreground in abstracts
Biomedical abstracts
Normalization
Identification of proper nouns, acronyms,
semantic and syntactic tagging
Identification of relevant text parts
Syntactic parsing
Extraction of biological relations from parse trees
Biomedical abstracts
Normalization
Identification of proper nouns, acronyms,
semantic and syntactic tagging
Identification of relevant text parts
Syntactic parsing
Extraction of biological relations from parse trees
Textual delimitators
Background/foreground in abstracts
ID: 16284406.The transcription factors dehydration-responsive element-binding protein 1s (DREB1s)/C-repeat-binding factors (CBFs) specifically interact with the DRE/CRT cis-acting element and control the expression of many stress-inducible genes in Arabidopsis. The genes for DREB1 orthologs, OsDREB1A and OsDREB1B from rice, are induced by cold stress, and overexpression of DREB1 or OsDREB1 induced strong expression of stress-responsive genes in transgenic Arabidopsis plants, resulting in increased tolerance to high-salt and freezing stresses. In this study, we generated transgenic rice plants overexpressing the OsDREB1 or DREB1 genes. These transgenic rice plants showed not only growth retardation under normal growth conditions but also improved tolerance to drought, high-salt and low-temperature stresses like the transgenic Arabidopsis plants overexpressing OsDREB1 or DREB1. We also detected elevated contents of osmoprotectants such as free proline and various soluble sugars in the transgenic rice as in the transgenic Arabidopsis plants. (…)
Retrieval of Relevant Text Parts
Presence of the string this study/current study/present stud/our study or synonyms of study in the same context (work, research, investigation)
Presence of the pronoun we preceded by or followed by a verb denoting an event in the world of the researcher (i.e., a cognition, communication, or manipulation verb) and not combined with a time adverb referring to past time, such as previously, earlier
Presence of the string our goal/our aim
Presence of a cognition/communication verb combined with the adverb now, presently or here.
Tense shift from present to past.success rate: 92,5%
Retrieval of Relevant Text Parts (2)if Foreground < 6 and word is in [study, work, research, investigation] and word-1 is in [this, current, present,our] -> Foreground = 6 else
if Foreground < 5 and word is a CCVerb and foundWe=1 and set found{ "previously", "earlier" } = 0 -> Foreground = 5 else
if Foreground < 4 and foundCCverb=1 and (foundWe=1 and word is not in [previously, earlier])) -> Foreground = 4 else
if Foreground < 3 and word is in [goal, aim] and word-1 is [our] -> Foreground = 3 else
if Foreground < 2 and word is a CCverb ->if set found{ "now", "presently", "here" } = 1 Foreground= 2else foundCCverb=1
if Foreground < 2 and word is in [now, presently, here] ->if foundCCverb=1 -> Foreground= 2 elseset_found{ "now", "presently", "here" } = 1
if word indicates tense shift from present to past -> Foreground = 1
Extracting relations from syntactic trees
S
subj Sdsent
pred obj
subj Sdsent
objpred
we
hypothesize
NUX i
Sdsent
pred advl
P NP
mediators
release, pass
from
HMC
may lead to activation
PTEC
RelclNUX j
subj : Ref j
(HMC)
Sdsent
advl: agentpred
trigger, pass IgA deposition
Relcl
subj : Ref i
(mediators)
PTEC
Hy
po
the
se
: K
EG
G r
ela
tio
n:
ac
tiv
ati
on
v:
ac
tiv
ate
HMC
IgA deposition
KEGG relation: activationv : trigger
mediators
v : release
relation type : production
We hypothesise that mediators released from human mesangial cells (HMC) triggerred by IgA deposition may lead to activation of proximal tubular epithelial cells (PTEC)
Allelic loss at TP53 seems to arise independently of LOH at the RB1 gene in carcinomas of the uterine corpus in humans
The syntactic tree after application of the tree search algorithm
A possible graphical representation of the compressed tree
LOH
appearance
No relation
Allelic lossplace: TP53
Hypothese
Results
Test corpus: about 15 000 words selected from PubMed using p53 as keyword
Tagging: 95.2% recall Retrieval of relevant text parts: success rate
92.5% Syntactic parsing: 79% recall, 86% precision Relation retrieval: tested only manually,
success rate about 94%
Current and Future Work
A revised tagging procedure; tagging using a smaller lexicon and domain-specific prefix list
parsing improvements implementation of the tree search
algorithm the question of the final output format