free-text medical document retrieval via phrase-based vector space model wenlei mao, ms and wesley...

21
Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD [email protected] and [email protected] Computer Science Department University of California, Los Angeles

Upload: kyleigh-colie

Post on 02-Apr-2015

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

Free-text Medical Document Retrieval via Phrase-based Vector Space Model

Wenlei Mao, MS and Wesley W. Chu, PhD

[email protected] and [email protected]

Computer Science DepartmentUniversity of California, Los Angeles

Page 2: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 2

Outline Vector space model (VSM) in document

retrieval Stem-based VSM Concept-based VSM

Conceptual similarity Phrase-based VSM Retrieval effectiveness comparison Conclusion

Page 3: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 3

Document Retrieval

Find free-text documents to answer queries like, “Hyperthermia, leukocytosis,

increased intracranial pressure, and central herniation.Cerebral edema secondary to infection, diagnosis and treatment.”

Page 4: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 4

Vector Space Model (VSM)

Leukocytosis

Hyperthermia

Words as terms

d

qd

ddqq

dqdqsim dq

,cos,

q

Page 5: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 5

Stem-based VSM Morphological variants bear similar content

E.g., “edema” and “edemas” Use stemmer to extract stems Lovins stemmer and Porter stemmer

Query:“Hyperthermia, leukocytosis, increased intracranial pressure”…

Stems:“hypertherm”, “leukocytos”, “increas”, “intracran”, “pressur”…

Baseline of comparison

Page 6: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 6

Shortcomings of Stem-based VSM

Inability to capture multi-word concepts

1. “Increased intracranial pressure” Inability to utilize the relations

between concepts:2. Synonyms: “hyperthermia” and “fever”3. IS-A relation: “hyperthermia” and

“body temperature elevation”

Page 7: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 7

Concept-based VSM Uses concepts in knowledge base (KB) as terms

KB: Metathesaurus in UMLS

Captures multi-word concepts Captures synonyms

Query:“Hyperthermia, leukocytosis, increased intracranial pressure”…

CUIs:(C0015967),(C0023518),(C0151740)…

Page 8: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 8

Shortcomings of Concept-based VSM

Concepts may be related: E.g. “hyperthermia” and “body

temperature elevation” are not identical but related concepts

Need to quantify conceptual relations Knowledge bases are often

incomplete, which reduces the retrieval effectiveness

Page 9: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 9

Conceptual Similarity Evaluation

c1

c2

c3

c4

Body temperature elevation

Hyperthermia

DiseaseAnimal disease

jijiji cDcDccdccs

1log,

1,

2

Node Distanced(c3,c4)=1

Descendant CountD(c3)=2D(c4)=0

Page 10: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 10

Deriving Conceptual Similarity From Hypernym Hierarchy

c1

c2

c3

c4

Body temperature elevation

Hyperthermia

DiseaseAnimal disease

jijiji cDcDccdccs

1log,

1,

2

Page 11: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 11

Shortcomings of Concept-based VSM

Concepts may be related: The conceptual similarity measure,

s(ci,cj), quantifies relations between concepts.

Knowledge bases are often incomplete, which reduces the retrieval effectiveness.

Page 12: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 12

Incompleteness of the Knowledge Bases

Missing concepts in KB, e.g.,“Infiltrative small bowelprocess”

(),(C0021852),()

In general, concept-based VSM cannot outperform stem-based VSM

(cerebral edema) (cerebral lesion)

Missing links between related concepts, e.g.,

Page 13: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 13

Phrase-based Indexing Examples

“Infiltrative small bowelprocess”

[(); “infiltr”][(C0021852); “smal”, “bowel”][(); ”proces”]

Query: “Cerebral edema”Document: “Cerebral lesion”

[(C0699725); “cerebr”, “edem”][(C0221505); “cerebr”, “lesion”]

Query:“Hyperthermia,leukocytosis,increased intracranial pressure…”

Phrases:[(C0015967); “hypertherm”][(C0023518); “leukocytos”][(C0151740); “increas”, “intracran”, “pressur”]…

Page 14: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 14

Evaluate Phrase-based Document Similarity

ddqq

dqdqsim

,

Pp Pp

dqs

dqc

q d

ppSppSdq ,,,max

Due to the conceptual

similarity s(ci,cj) between concepts

in pq and pd

Due to the stem

overlap in pq and pd

Page 15: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 15

To Compare Retrieval Effectiveness

The test set: OHSUMED 106 queries, 14K documents Expert relevance judgment: R or N

Retrieval effectiveness: Recall – the percentage of relevant

documents retrieved so far Precision – the percentage of

retrieved documents that are relevant

Page 16: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 16

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

aver

age

prec

isio

n ov

er 1

05 q

uerie

s

Stems

Retrieval Effectiveness Comparison (Corpus: OHSUMED, KB: UMLS)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

aver

age

prec

isio

n ov

er 1

05 q

uerie

s

StemsConcepts Unrelated

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

aver

age

prec

isio

n ov

er 1

05 q

uerie

s

StemsConcepts UnrelatedConcepts Related

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

aver

age

prec

isio

n ov

er 1

05 q

uerie

s

StemsConcepts UnrelatedConcepts RelatedPhrases, Concepts Unrelated

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

aver

age

prec

isio

n ov

er 1

05 q

uerie

s

StemsConcepts UnrelatedConcepts RelatedPhrases, Concepts UnrelatedPhrases, Concepts Related

16%100 queries

vs.5%

50 queries

Page 17: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 17

Stem and Concept Similarity Contribution Weights

Pp Pp

dqs

dqc

q d

ppSppSdq ,,,max

Pp Pp

dqss

dqcc

q d

ppSfppSfdq ,,,max

: similarity contribution weight for concepts: similarity contribution weight for stems

cfsf

Page 18: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 18

(1, 0) (1, 0.4) (1, 0.8) (1, 1) (0.8, 1) (0.4, 1) (0, 1)0.37

0.38

0.39

0.4

0.41

0.42

0.43

0.44

(fs, fc): (stem similarity contribution weight, concept similarity contribution weight)

11-p

oint

ave

rage

pre

cisi

on

Sensitivity of Retrieval Effectiveness to fs and fc

Stems

Concepts

Optimal region

Page 19: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 19

Computation Complexity Using Phrase-based VSM

Data reorganization: Build separate indexes on stems and concepts Keep a list of related concepts cj’s and

conceptual similarity s(ci,cj) with ci.

Time complexities of document similarity calculation, same order of magnitude Stem-based VSM: Phrase-based VSM:

ss LLO log

ccrcss LLNLLO loglog

Page 20: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 20

Conclusion A new document indexing paradigm based on

phrases is proposed Use phrases (concept and its word stems) as

terms Document similarity is derived from both the

stem and the concept contributions Conceptual similarity quantifies the concept

relations and improves retrieval effectiveness Stems remedy the incomplete coverage of the

knowledge base (missing concepts and missing links between related concepts)

Experimental results reveal a significant retrieval effectiveness improvement of the phrase-based VSM over the stem-based VSM

Page 21: Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer

11/9-13/2002 AMIA 2002 21

Acknowledgement

This research is supported in part by NIC/NIH Grant#4442511-33780