bridget mcinnes ted pedersen ying liu genevieve b. melton serguei pakhomov

Post on 23-Feb-2016

33 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Knowledge-based Method for Determining the Meaning of Ambiguous Biomedical Terms Using Information Content Measures of Similarity. Bridget McInnes Ted Pedersen Ying Liu Genevieve B. Melton Serguei Pakhomov. Objective of this work. Develop and evaluate a method than can - PowerPoint PPT Presentation

TRANSCRIPT

1

KNOWLEDGE-BASED METHOD FOR DETERMINING THE MEANING OF AMBIGUOUS BIOMEDICAL TERMS USING INFORMATION CONTENT MEASURES OF SIMILARITY

Bridget McInnesTed Pedersen Ying LiuGenevieve B. MeltonSerguei Pakhomov

2

OBJECTIVE OF THIS WORK Develop and evaluate a method than can

disambiguate terms in biomedical text by exploiting similarity information extrapolated from the Unified Medical

Language System

Evaluate the efficacy of Information Content-based similarity measures over path-based similarity measures

for Word Sense Disambiguation, WSD

3

WORD SENSE DISAMBIGUATIONWord sense disambiguation is the task of

determining the appropriate sense of a term given context in which it is used.

TERM: tolerance

DrugTolerance

ImmuneTolerance

4

WORD SENSE DISAMBIGUATIONWord sense disambiguation is the task of

determining the appropriate sense of a term given context in which it is used.

Busprione attenuates tolerance to morphine in mice with skin cancer

DrugTolerance

ImmuneTolerance

5

SENSE INVENTORY: UNIFIED MEDICAL LANGUAGE SYSTEM Unified Medical Language Sources (UMLS)

Semantic Network Metathesaurus

~1.7 million biomedical and clinical concepts; integrated semi-automatically

CUIs (Concept Unique Identifiers), linked: Hierarchical: PAR/CHD and RB/RN Non-hierarchical: SIB, RO

Sources viewed together or independently Medical Subject Heading (MSH)

SPECIALIST Lexicon Biomedical and clinical terms, including variants

6

WORD SENSE DISAMBIGUATION

Busprione attenuates tolerance to morphine in mice with skin cancer

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Concept Unique Identifiers: CUIs

7

SENSERELATE ALGORITHM

Each possible sense of a target word is assigned a score [sum similarity between it and its surrounding terms]

Assign target word the sense with highest score

Proposed by Patwardhan and Pedersen 2003 using WordNet

UMLS::SenseRelate is a modification of this algorithm using information from the UMLS

NEXT UP: an example

8

SENSERELATE EXAMPLEBusprione attenuates tolerance to morphine

in mice with skin cancer

9

SENSERELATE EXAMPLEBusprione attenuates tolerance to morphine

in mice with skin cancer

DrugTolerance: C0013220

ImmuneTolerance:C0020963

10

SENSERELATE EXAMPLEBusprione attenuates tolerance to morphine

in mice with skin cancer

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

11

SENSERELATE EXAMPLE

0.090.160.11

Busprione attenuates tolerance to morphine in mice with skin cancer

0.09

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

12

SENSERELATE EXAMPLE

0.090.160.11

Busprione attenuates tolerance to morphine in mice with skin cancer

0.09

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45

13

SENSERELATE EXAMPLE

0.090.160.11

0.090.050.04

Busprione attenuates tolerance to morphine in mice with skin cancer

0.09 0.09

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45

14

SENSERELATE EXAMPLE

0.090.160.11

0.090.050.04

Busprione attenuates tolerance to morphine in mice with skin cancer

0.09 0.09

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45

Immune ToleranceScore = 0.09 + 0.09 + 0.05 + 0.05 = 0.27

15

SENSERELATE EXAMPLE

0.090.160.11

0.090.050.04

Busprione attenuates tolerance to morphine in mice with skin cancer

0.09 0.09

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45

Immune ToleranceScore = 0.09 + 0.09 + 0.05 + 0.05 = 0.27

16

SENSE RELATE ASSUMPTION

An ambiguous word is often used in the sense

that is most similar to the sense of the terms that surround it

17

SENSERELATE COMPONENTS Identifying the concepts of surrounding terms

Calculating semantic similarity

18

IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS

Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in

the UMLS

19

IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS

Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in

the UMLSBusprione attenuates tolerance to morphine in mice with skin cancer

20

IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS

Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in

the UMLS

...skin cancerskin graftingskin disease

...

SPECIALISTLEXICON

Busprione attenuates tolerance to morphine in mice with skin cancer

21

IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS

Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in

the UMLS

...skin cancerskin graftingskin disease

...

...skin cancer C0007114skin grafting C0037297skin disease C0037274

...

SPECIALISTLEXICON

MRCONSO

Busprione attenuates tolerance to morphine in mice with skin cancer

22

SEMANTIC SIMILARITY MEASURES Path-based measures

Path Wu and Palmer Leacock and Chodorow Ngyuen and Al-Mubaid

Information content (IC)-based measures Resnik Lin Jiang and Conrath

23

PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a

taxonomy

24

PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a

taxonomy

Path measure sim(c1,c2) = 1 / minpath(c2,c2)

where minpath is the shortest path between the two concepts

25

PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a

taxonomy

Path measure sim(c1,c2) = 1/minpath(c2,c2)

where minpath is the shortest path between the two concepts

Wu and Palmer, 1994 sim(c1,c2) = (2*depth(LCS(c2,c2))) /

(depth(c1)+depth(c2)) where LCS is the least common subsumer of the

two concepts

26

PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy

Path measure sim(c1,c2) = 1/ minpath(c2,c2)

where minpath is the shortest path between the two concepts

Wu and Palmer, 1994 sim(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2))

where LCS is the least common subsumer of the two concepts

Leacock and Chodorow, 1998 sim(c1,c2) = -log( minpath(c1,c2) / (2D) )

where D is the total depth of the taxonomy

27

PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy

Path measure sim(c1,c2) = 1/ minpath(c2,c2)

where minpath is the shortest path between the two concepts

Leacock and Chodorow, 1998 sim(c1,c2) = -log( minpath(c1,c2) / (2D) )

where D is the total depth of the taxonomy

Wu and Palmer, 1994 sim(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2))

where LCS is the least common subsumer of the two concepts

Nyguen and Al-Mubaid, 2006 sim(c1,c2) = log ( (2 + minpath(c1,c2) - 1) *

(D - depth(LCS(c1,c2))) )

28

PATH-BASED SIMILARITY MEASURES

USE ONLY THE PATH INFORMATION OBTAINED FROM A TAXONOMY

Disease: C0012634

Drug Related

Disorder: C0277579

DrugTolerance: C0013220

Neoplasm: C1302761

Neoplastic Disease: C1882062

Malignant Neoplasm: C0006826

Skin cancer:

C0007114

29

INFORMATION CONTENT-BASED MEASURES Incorporate the probability of the concepts

IC = -log(P(concept))

30

INFORMATION CONTENT-BASED MEASURES Incorporate the probability of the concepts

IC = -log(P(concept))

P(concept)

Calculated by summing the probability of the concept and the probability of its descendants

Probabilities are obtained from an external corpus

31

INFORMATION CONTENT-BASED MEASURES Incorporate the probability of the concepts

IC = -log(P(concept)

Resnik, 1995 sim(c1,c2) = IC(LCS(c1,c2))

32

INFORMATION CONTENT-BASED MEASURES

Incorporate the probability of the concepts

IC = -log(P(concept)

Resnik, 1995 sim(c1,c2) = IC(LCS(c2,c2))

Jiang and Conrath, 1997 sim(c1,c2) = 1 / (IC(c1)+IC(c2) – 2*

IC(LCS(c1,c2))

33

INFORMATION CONTENT-BASED MEASURES

Incorporate the probability of the concepts

IC = -log(P(concept)

Resnik, 1995 sim(c1,c2) = IC(LCS(c2,c2))

Jiang and Conrath, 1997 sim(c1,c2) = 1 ÷ (IC(c1)+IC(c2) – 2* IC(LCS(c1,c2))

Lin, 1998 sim(c1,c2) = (2*IC(LCS(c2,c2))) / (IC(c1)+IC(c2))

34

IC-BASED SIMILARITY MEASURES

Disease: C001263

4Drug

Related Disorder: C0277579Drug

Tolerance:

C0013220

Neoplasm:

C1302761Neoplasti

c Disease: C188206

2Malignant

Neoplasm:

C0006826Skin

cancer: C000711

4

+

PATH INFORMATIONPROBABILITY OF

CONCEPTS

EXTERNAL CORPUS

35

EXPERIMENTAL FRAMEWORK Use open-source UMLS::Similarity package to

obtain the similarity between the terms and possible senses in the SenseRelate algorithm

Path information: parent/child relations in MSH source

Information content: calculated using the UMLSonMedline dataset created by NLM

Consists of concepts from 2009AB UMLS and the frequency they occurred in Medline using the Essie Search Engine (Ide et al 2007)

Medline: database of citations of biomedical/clinical articles

36

EVALUATION DATA: MSH WSD MSH-WSD dataset (Jimeno-Yepes, et al 2011)

203 target words (ambiguous word) from Medline 106 terms e.g. tolerance 88 acronyms e.g. CA (calcium, california) 9 mixtures e.g. bat (brown adipose tissue)

Each target word contains ~187 instances (Medline abstracts) abstract = ~ 500 words

Each target word in the instances assigned a concept from MSH by exploiting the manually assigned MSH concepts assigned to the abstract

Average of 2.08 possible senses per target word Majority sense over all the target words is 54.5%

37

RESULTS

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.55

0.7200000000000010.690000000

0000010.700000000

0000010.720000000

0000010.730000000

0000010.740000000

0000010.740000000

000001

baseline path

lch wup

nam

res jcn

accuracy

Path-based IC-basedlin

38

COMPARISON ACROSS SUBSETS OF MSH-WSD

Terms Acronyms Mixture Overall0

0.10.20.30.40.50.60.70.80.9

1

0.55 0.54 0.53 0.55

0.670000000000002

0.8 0.730000000000001

0.7400000000000020.71000000000

0001

0.870000000000002 0.88

0.80.67000000000

0002

0.850000000000001

0.93

0.78

BaselineSenseRe-lateMRD2-MRD

accuracy

39

COMPARISON ACROSS SUBSETS OF MSH-WSD

Terms Acronyms Mixture Overall0

0.10.20.30.40.50.60.70.80.9

1

0.55 0.54 0.53 0.55

0.670000000000002

0.8 0.730000000000001

0.7400000000000020.71000000000

0001

0.870000000000002 0.88

0.80.67000000000

0002

0.850000000000001

0.93

0.78

BaselineSenseRe-lateMRD2-MRD

accuracy

40

COMPARISON ACROSS SUBSETS OF MSH-WSD

Terms Acronyms Mixture Overall0

0.10.20.30.40.50.60.70.80.9

1

0.55 0.54 0.53 0.55

0.670000000000002

0.8 0.730000000000001

0.7400000000000020.71000000000

0001

0.870000000000002 0.88

0.80.67000000000

0002

0.850000000000001

0.93

0.78

BaselineSenseRe-lateMRD2-MRD

accuracy

41

COMPARISON ACROSS SUBSETS OF MSH-WSD

Terms Acronyms Mixture Overall0

0.10.20.30.40.50.60.70.80.9

1

0.55 0.54 0.53 0.55

0.670000000000002

0.8 0.730000000000001

0.7400000000000020.71000000000

0001

0.870000000000002 0.88

0.80.67000000000

0002

0.850000000000001

0.93

0.78

BaselineSenseRe-lateMRD2-MRD

accuracy

42

COMPARISON ACROSS SUBSETS OF MSH-WSD

Terms Acronyms Mixture Overall0

0.10.20.30.40.50.60.70.80.9

1

0.55 0.54 0.53 0.55

0.670000000000002

0.8 0.730000000000001

0.7400000000000020.71000000000

0001

0.870000000000002 0.88

0.80.67000000000

0002

0.850000000000001

0.93

0.78

BaselineSenseRe-lateMRD2-MRD

accuracy

43

WINDOW SIZES Use the terms surrounding the target word

within a specified window: 1, 2, 5, 10, 25, 50, 60, 70

Busprione attenuates tolerance to morphine in mice with skin_cancer

WINDOW SIZE = 2

44

COMPARISON OF WINDOW SIZES FOR LIN

0 1 2 5 10 25 50 60 700

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.5 0.53

0.650000000000002

0.690000000000001

0.710000000000001

0.740000000000002

0.740000000000002

0.740000000000002

0.740000000000002

lin

accuracy

window size

45

SURROUNDING TERMSNot all terms have a concept in the

UMLS

therefore

Not all surrounding terms in the window mapped to CUIs

46

WINDOW SIZES VERSUS MAPPED TERMS

0 1 2 5 10 25 50 60 7002468

1012141618

0 0.27 0.791.85

3.49

7.6

12.9614.28

15.64

lin

number

of

mappings

window size

47

FUTURE WORK: MAPPING TERMS Currently looking at mapping the terms to

CUIs using information from the concept mapping system MetaMap

Obtain the terms from MetaMap and do a dictionary look up in MRCONSO Hypothesis – the terms obtained by MetaMap are more

accurate than using the SPECIALIST Lexicon

Obtain the CUIs from MetaMap Hypothesis – the CUIs obtained by MetaMap will be

more accurate than the dictionary look-up

48

OBJECTIVE #1Develop and evaluate a method than can disambiguate terms in biomedical text by

exploiting similarity information extrapolated from the UMLS

UMLS::SenseRelate statistically significantly higher disambiguation accuracy than the baseline

On par with previous unsupervised methods for terms

49

OBJECTIVE #2Evaluate the efficacy of IC-based similarity

measures over path-based measures on a secondary task

There is no statistically significant difference between the accuracies obtained by the IC-based measures

There is a statistically significant difference between the IC-based measures and the path-based measures

50

TAKE HOME MESSAGE:

An ambiguous word is often used in the sense

that is most similar to the sense of the concepts

of the terms that surround it

51

RESOURCES Software:

UMLS::SenseRelate http://search.cpan.org/dist/UMLS-SenseRelate/

UMLS::Similarity http://search.cpan.org/dist/UMLS-Similarity/

Data MSH-WSD

http://wsd.nlm.nih.gov/collaboration.shtml

52

RESOURCES Software:

UMLS::SenseRelate http://search.cpan.org/dist/UMLS-SenseRelate/

UMLS::Similarity http://search.cpan.org/dist/UMLS-Similarity/

Data MSH-WSD

http://wsd.nlm.nih.gov/collaboration.shtml

THANK YOU

53

RESOURCES Software:

UMLS::SenseRelate http://search.cpan.org/dist/UMLS-SenseRelate/

UMLS::Similarity http://search.cpan.org/dist/UMLS-Similarity/

Data MSH-WSD

http://wsd.nlm.nih.gov/collaboration.shtml

QUESTIONS?

top related