towards an automated analysis of biomedical abstracts barbara gawronska, björn erlendsson, björn...

21
Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University of Skövde, Sweden

Upload: heather-miles

Post on 16-Dec-2015

221 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

Towards an Automated Analysis of Biomedical

Abstracts

Barbara Gawronska, Björn Erlendsson, Björn Olsson

School of Humanities and Informatics, University of Skövde, Sweden

Page 2: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

The goal of the project: text analysis for candidate path extraction

Scored & ranked path alignments

Path extraction from model database

Path extraction from text

Path alignmentParameter settings

GO graph

Organism annotation database

Model pathway database

PubMedLexical

databases, grammar

GO term probability calculation

Page 3: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

The characteristics of the language of biomedical texts

A typical PubMed abstract (PMID: 16301995):

The tumor suppressor gene hypermethylated in cancer 1 (HIC1), located on human chromosome 17p13.3, is frequently silenced in cancer by epigenetic mechanisms. Hypermethylated in cancer 1 belongs to the bric a brac/poxviruses and zinc-finger family of transcription factors and acts by repressing target geneexpression. It has been shown that enforced p53 expression leads to increasedHIC1 mRNA, and recent data suggest that p53 and Hic1 cooperate in tumorigenesis. In order to elucidate the regulation of HIC1 expression, we have analysed the HIC1 promoter region for p53-dependent induction of gene expression. (…)Other members of the p53 family, notably TAp73beta and DeltaNp63alpha, can also act through this HIC1.PRE to induce transcription of HIC1, and finally, hypermethylation of the HIC1 promoter attenuates inducibility by p53.

Page 4: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

Results of POS-tagging of two large corpora (30 million words each): 1) texts on stem cell research, and 2) general English prose

0%

5%

10%

15%

20%

25%

30%

35%

40%

adje

ctiv

enoun

preposi

tion

verb

deter

min

er

conju

nctio

n

pronou

n

auxi

liary

moda

l

adve

rb

light - Stem Cell, dark - Prose

Page 5: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

Results of POS-tagging of a smaller sample corpus of biomedical abstracts

30%

36%

19%

3%12%

Proper nouns

Nouns

Verbs

Closed Class Words

Adjectives+Adverbs

Page 6: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

The general architecture of the Information Extraction system

Biomedical abstracts

Normalization

Identification of proper nouns, acronyms,

semantic and syntactic tagging

Identification of relevant text parts

Syntactic parsing

Extraction of biological relations from parse trees

Named Entity Recognition

Linking acronyms to full names of biological objects

Matching input words against previously stored

tags

Identification and classification of remaining

words and symbols

Domain-specific acronym and name patterns

WordNet

Specialized verb lexicons

Tag Memory DatabaseClosed Class Word List

Biomedical abstracts

Normalization

Identification of relevant text parts

Syntactic parsing

Extraction of biological relations from parse trees

Biomedical abstracts

Normalization

Identification of proper nouns, acronyms,

semantic and syntactic tagging

Identification of relevant text parts

Syntactic parsing

Extraction of biological relations from parse trees

Page 7: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

Patterns for domain-specific Named Entity Recognition

Pattern 1: n lower case chars (n>=1) + m integers (m >=2) + optionally: any character (p53, cdc25C, bcl2)

Pattern 2: n lower case chars (n>=1) + m upper case chars (m>=1) + k integers (k>=0) (mRNA)

Pattern 3: integer + lower case + n integers (n>=0) (1alpha)

Pattern 4: n integers (n>=1) + m upper case (m >=1) (7BL)

Page 8: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

Linking acronyms to full names of biological objects

Find next acronym A

Found?

L1:= First Letterof AN := Numberof letters in A

Yes Within(…) ?

Yes

Find the N:th word beginning in L1 to the left of the ‘(‘ , link that word and its right context to A

Is A followed by ’(’ and L1* ?

No

Mark the wordsinside the (…),link to A

YesNo

No

Place pointer at the first word in the sentence

To next procedure(Other parts of theNER-module)

From previousprocedure

Therearealsotumor-related

geneslikeNF2neurofibromatose of type 2.p16INK4a

belongsto a groupcellcycleregulator calledcyclindependentkinaseinhibitors CDKI .

( )

( )

Page 9: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

Sample semantico-syntactic tags

Our finding implicates that TNF-alpha released from the mesangium after IgAdeposition activates renal tubular cells.

[semcat('Our',our,[[],poss([])]),semcat(finding,find,[wnn,[]]),semcat(implicates,implicate,[[],[speech_act_verb([1])]),semcat(that,that,[[],rel([])]),semcat('TNF',[propername]),semcat(alpha,alpha,[wnn,[]]),semcat(released,release,[[],bioverb([[],production])]),semcat(from,from,[[],prep([])]),semcat(the,the,[[],det([])]),semcat(mesangium,mesangium,[[],[]]),semcat(after,after,[[],prep([])]),semcat('IgA',[propername]),semcat(deposition,deposition,[wnn,[]]),semcat(activates,activate,[[],bioverb([[],activation])]),semcat(renal,renal,[adj,[]]),semcat(tubular,tubular,[adj,[]]),semcat(cells,cell,[[],cell([])]),semcat('.',[[],[]])]

Page 10: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

Tags (occurrences) in the test set in relation to knowledge sources

277

2989

1964

1450

59 70420

61580

287742

444

8905

257

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

ProperNouns

N CCV "Bioverbs" V Adj+Adv ClosedClass

Tags learnedfrom the trainingsetTags from NER,lexicons andmorphology

Page 11: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

The next step: finding background and foreground in abstracts

Biomedical abstracts

Normalization

Identification of proper nouns, acronyms,

semantic and syntactic tagging

Identification of relevant text parts

Syntactic parsing

Extraction of biological relations from parse trees

Biomedical abstracts

Normalization

Identification of proper nouns, acronyms,

semantic and syntactic tagging

Identification of relevant text parts

Syntactic parsing

Extraction of biological relations from parse trees

Textual delimitators

Page 12: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

Background/foreground in abstracts

ID: 16284406.The transcription factors dehydration-responsive element-binding protein 1s (DREB1s)/C-repeat-binding factors (CBFs) specifically interact with the DRE/CRT cis-acting element and control the expression of many stress-inducible genes in Arabidopsis. The genes for DREB1 orthologs, OsDREB1A and OsDREB1B from rice, are induced by cold stress, and overexpression of DREB1 or OsDREB1 induced strong expression of stress-responsive genes in transgenic Arabidopsis plants, resulting in increased tolerance to high-salt and freezing stresses. In this study, we generated transgenic rice plants overexpressing the OsDREB1 or DREB1 genes. These transgenic rice plants showed not only growth retardation under normal growth conditions but also improved tolerance to drought, high-salt and low-temperature stresses like the transgenic Arabidopsis plants overexpressing OsDREB1 or DREB1. We also detected elevated contents of osmoprotectants such as free proline and various soluble sugars in the transgenic rice as in the transgenic Arabidopsis plants. (…)

Page 13: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

Retrieval of Relevant Text Parts

Presence of the string this study/current study/present stud/our study or synonyms of study in the same context (work, research, investigation)

Presence of the pronoun we preceded by or followed by a verb denoting an event in the world of the researcher (i.e., a cognition, communication, or manipulation verb) and not combined with a time adverb referring to past time, such as previously, earlier

Presence of the string our goal/our aim

Presence of a cognition/communication verb combined with the adverb now, presently or here.

Tense shift from present to past.success rate: 92,5%

Page 14: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

Retrieval of Relevant Text Parts (2)if Foreground < 6 and word is in [study, work, research, investigation] and word-1 is in [this, current, present,our] -> Foreground = 6 else

if Foreground < 5 and word is a CCVerb and foundWe=1 and set found{ "previously", "earlier" } = 0 -> Foreground = 5 else

if Foreground < 4 and foundCCverb=1 and (foundWe=1 and word is not in [previously, earlier])) -> Foreground = 4 else

if Foreground < 3 and word is in [goal, aim] and word-1 is [our] -> Foreground = 3 else

if Foreground < 2 and word is a CCverb ->if set found{ "now", "presently", "here" } = 1 Foreground= 2else foundCCverb=1

if Foreground < 2 and word is in [now, presently, here] ->if foundCCverb=1 -> Foreground= 2 elseset_found{ "now", "presently", "here" } = 1

if word indicates tense shift from present to past -> Foreground = 1

Page 15: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

Extracting relations from syntactic trees

S

subj Sdsent

pred obj

subj Sdsent

objpred

we

hypothesize

NUX i

Sdsent

pred advl

P NP

mediators

release, pass

from

HMC

may lead to activation

PTEC

RelclNUX j

subj : Ref j

(HMC)

Sdsent

advl: agentpred

trigger, pass IgA deposition

Relcl

subj : Ref i

(mediators)

PTEC

Hy

po

the

se

: K

EG

G r

ela

tio

n:

ac

tiv

ati

on

v:

ac

tiv

ate

HMC

IgA deposition

KEGG relation: activationv : trigger

mediators

v : release

relation type : production

We hypothesise that mediators released from human mesangial cells (HMC) triggerred by IgA deposition may lead to activation of proximal tubular epithelial cells (PTEC)

Page 16: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

Allelic loss at TP53 seems to arise independently of LOH at the RB1 gene in carcinomas of the uterine corpus in humans

Page 17: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

The syntactic tree after application of the tree search algorithm

Page 18: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

A possible graphical representation of the compressed tree

LOH

appearance

No relation

Allelic lossplace: TP53

Hypothese

Page 19: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

Results

Test corpus: about 15 000 words selected from PubMed using p53 as keyword

Tagging: 95.2% recall Retrieval of relevant text parts: success rate

92.5% Syntactic parsing: 79% recall, 86% precision Relation retrieval: tested only manually,

success rate about 94%

Page 20: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University

Current and Future Work

A revised tagging procedure; tagging using a smaller lexicon and domain-specific prefix list

parsing improvements implementation of the tree search

algorithm the question of the final output format

Page 21: Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University