free-text medical document retrieval via phrase-based vector space model wenlei mao, ms and wesley...
TRANSCRIPT
Free-text Medical Document Retrieval via Phrase-based Vector Space Model
Wenlei Mao, MS and Wesley W. Chu, PhD
[email protected] and [email protected]
Computer Science DepartmentUniversity of California, Los Angeles
11/9-13/2002 AMIA 2002 2
Outline Vector space model (VSM) in document
retrieval Stem-based VSM Concept-based VSM
Conceptual similarity Phrase-based VSM Retrieval effectiveness comparison Conclusion
11/9-13/2002 AMIA 2002 3
Document Retrieval
Find free-text documents to answer queries like, “Hyperthermia, leukocytosis,
increased intracranial pressure, and central herniation.Cerebral edema secondary to infection, diagnosis and treatment.”
11/9-13/2002 AMIA 2002 4
Vector Space Model (VSM)
Leukocytosis
Hyperthermia
Words as terms
d
qd
ddqq
dqdqsim dq
,cos,
q
11/9-13/2002 AMIA 2002 5
Stem-based VSM Morphological variants bear similar content
E.g., “edema” and “edemas” Use stemmer to extract stems Lovins stemmer and Porter stemmer
Query:“Hyperthermia, leukocytosis, increased intracranial pressure”…
Stems:“hypertherm”, “leukocytos”, “increas”, “intracran”, “pressur”…
Baseline of comparison
11/9-13/2002 AMIA 2002 6
Shortcomings of Stem-based VSM
Inability to capture multi-word concepts
1. “Increased intracranial pressure” Inability to utilize the relations
between concepts:2. Synonyms: “hyperthermia” and “fever”3. IS-A relation: “hyperthermia” and
“body temperature elevation”
11/9-13/2002 AMIA 2002 7
Concept-based VSM Uses concepts in knowledge base (KB) as terms
KB: Metathesaurus in UMLS
Captures multi-word concepts Captures synonyms
Query:“Hyperthermia, leukocytosis, increased intracranial pressure”…
CUIs:(C0015967),(C0023518),(C0151740)…
11/9-13/2002 AMIA 2002 8
Shortcomings of Concept-based VSM
Concepts may be related: E.g. “hyperthermia” and “body
temperature elevation” are not identical but related concepts
Need to quantify conceptual relations Knowledge bases are often
incomplete, which reduces the retrieval effectiveness
11/9-13/2002 AMIA 2002 9
Conceptual Similarity Evaluation
c1
c2
c3
c4
Body temperature elevation
Hyperthermia
DiseaseAnimal disease
jijiji cDcDccdccs
1log,
1,
2
Node Distanced(c3,c4)=1
Descendant CountD(c3)=2D(c4)=0
11/9-13/2002 AMIA 2002 10
Deriving Conceptual Similarity From Hypernym Hierarchy
c1
c2
c3
c4
Body temperature elevation
Hyperthermia
DiseaseAnimal disease
jijiji cDcDccdccs
1log,
1,
2
11/9-13/2002 AMIA 2002 11
Shortcomings of Concept-based VSM
Concepts may be related: The conceptual similarity measure,
s(ci,cj), quantifies relations between concepts.
Knowledge bases are often incomplete, which reduces the retrieval effectiveness.
11/9-13/2002 AMIA 2002 12
Incompleteness of the Knowledge Bases
Missing concepts in KB, e.g.,“Infiltrative small bowelprocess”
(),(C0021852),()
In general, concept-based VSM cannot outperform stem-based VSM
(cerebral edema) (cerebral lesion)
Missing links between related concepts, e.g.,
11/9-13/2002 AMIA 2002 13
Phrase-based Indexing Examples
“Infiltrative small bowelprocess”
[(); “infiltr”][(C0021852); “smal”, “bowel”][(); ”proces”]
Query: “Cerebral edema”Document: “Cerebral lesion”
[(C0699725); “cerebr”, “edem”][(C0221505); “cerebr”, “lesion”]
Query:“Hyperthermia,leukocytosis,increased intracranial pressure…”
Phrases:[(C0015967); “hypertherm”][(C0023518); “leukocytos”][(C0151740); “increas”, “intracran”, “pressur”]…
11/9-13/2002 AMIA 2002 14
Evaluate Phrase-based Document Similarity
ddqq
dqdqsim
,
Pp Pp
dqs
dqc
q d
ppSppSdq ,,,max
Due to the conceptual
similarity s(ci,cj) between concepts
in pq and pd
Due to the stem
overlap in pq and pd
11/9-13/2002 AMIA 2002 15
To Compare Retrieval Effectiveness
The test set: OHSUMED 106 queries, 14K documents Expert relevance judgment: R or N
Retrieval effectiveness: Recall – the percentage of relevant
documents retrieved so far Precision – the percentage of
retrieved documents that are relevant
11/9-13/2002 AMIA 2002 16
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
recall
aver
age
prec
isio
n ov
er 1
05 q
uerie
s
Stems
Retrieval Effectiveness Comparison (Corpus: OHSUMED, KB: UMLS)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
recall
aver
age
prec
isio
n ov
er 1
05 q
uerie
s
StemsConcepts Unrelated
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
recall
aver
age
prec
isio
n ov
er 1
05 q
uerie
s
StemsConcepts UnrelatedConcepts Related
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
recall
aver
age
prec
isio
n ov
er 1
05 q
uerie
s
StemsConcepts UnrelatedConcepts RelatedPhrases, Concepts Unrelated
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
recall
aver
age
prec
isio
n ov
er 1
05 q
uerie
s
StemsConcepts UnrelatedConcepts RelatedPhrases, Concepts UnrelatedPhrases, Concepts Related
16%100 queries
vs.5%
50 queries
11/9-13/2002 AMIA 2002 17
Stem and Concept Similarity Contribution Weights
Pp Pp
dqs
dqc
q d
ppSppSdq ,,,max
Pp Pp
dqss
dqcc
q d
ppSfppSfdq ,,,max
: similarity contribution weight for concepts: similarity contribution weight for stems
cfsf
11/9-13/2002 AMIA 2002 18
(1, 0) (1, 0.4) (1, 0.8) (1, 1) (0.8, 1) (0.4, 1) (0, 1)0.37
0.38
0.39
0.4
0.41
0.42
0.43
0.44
(fs, fc): (stem similarity contribution weight, concept similarity contribution weight)
11-p
oint
ave
rage
pre
cisi
on
Sensitivity of Retrieval Effectiveness to fs and fc
Stems
Concepts
Optimal region
11/9-13/2002 AMIA 2002 19
Computation Complexity Using Phrase-based VSM
Data reorganization: Build separate indexes on stems and concepts Keep a list of related concepts cj’s and
conceptual similarity s(ci,cj) with ci.
Time complexities of document similarity calculation, same order of magnitude Stem-based VSM: Phrase-based VSM:
ss LLO log
ccrcss LLNLLO loglog
11/9-13/2002 AMIA 2002 20
Conclusion A new document indexing paradigm based on
phrases is proposed Use phrases (concept and its word stems) as
terms Document similarity is derived from both the
stem and the concept contributions Conceptual similarity quantifies the concept
relations and improves retrieval effectiveness Stems remedy the incomplete coverage of the
knowledge base (missing concepts and missing links between related concepts)
Experimental results reveal a significant retrieval effectiveness improvement of the phrase-based VSM over the stem-based VSM
11/9-13/2002 AMIA 2002 21
Acknowledgement
This research is supported in part by NIC/NIH Grant#4442511-33780