identifying disease diagnosis factors by proximity-based mining of medical texts rey-long liu *,...
TRANSCRIPT
Identifying Disease Diagnosis Factors by
Proximity-based Mining of Medical Texts
Rey-Long Liu*, Shu-Yu Tung, and Yun-Ling Lu*Dept. of Medical Informatics
Tzu Chi University
Taiwan, R.O.C.
Outline
• Research background
• Problem definition
• The proposed approach: PDFI
• Empirical evaluation
• Conclusion
PDFI@ACIIDS2011 2
Diagnosis Knowledge Map: Fundamental of Diagnosis
Support & Education
PDFI@ACIIDS2011 4
r5
r4
r3
r2
r1
d3
d2
d1
Symptoms & Signs (and examinations & tests)DiseasesRisk Factors
m1
m2
m3
m4
m5
Basic Properties
• Diagnosis factors of a disease– Risk factors, symptoms, and signs of the disease
• A diagnosis knowledge map consist of many-to-many relationships between diseases and their diagnosis factors– May have different capability of discriminating
the diseases, and may evolve
• Construction of a diagnosis knowledge map is essential but costly
PDFI@ACIIDS2011 5
Goal• Explore how the identification of the diagnosis
factors may be supported by text mining
• Develop a technique PDFI (Proximity-based Diagnosis Factors Identifier) that – Employs term proximity to improve diagnosis
factors identifiers– Serves as a supplement to improve existing
identifiers
PDFI@ACIIDS2011 7
Related Work• Extract relationships by parsing or template
matching– Weakness: Relationships between diseases and
diagnosis factors are seldom expressed in individual sentences
• Select key features by text classification– Weakness: Term proximity is NOT considered
• Proximity-based retrieval– Weakness: NOT applicable to diagnosis factor
identificationPDFI@ACIIDS2011 8
Basic Observation
• In a medical text talking about the diagnosis of a disease, the diagnosis factors often appear in a nearby area of the text
PDFI@ACIIDS2011 10
The Approach
• For a candidate diagnosis factor u, PDFI– Measures how other candidate diagnosis factors
appear in the areas near to u in the medical texts, and then
– Encodes the term proximity information into the discriminating capability of u measured by the underlying discriminative factors identifiers.
PDFI@ACIIDS2011 11
System Overview
PDFI@ACIIDS2011 12
Encode term proximity contexts to revise the strengths of candidate factors
Measure discriminating strengths of candidate factors
Underlying identifierPDFI
Ranked factors for individual diseases
Texts about individual diseases
Discriminating strengths of candidate factors
Scoring for a Candidate Factor
PDFI@ACIIDS2011 13
||1
1
),(Pr ;)( ,
Ne
cureoximitySco unNnMinDist nu
),(1
1),(
cuRankcuScoreIdentifier
MinDistu,c= Minimum distance between u and n in the texts about disease c, and α is set to 30
• For a candidate diagnosis factor u for disease c
Rank(u,c) = Rank of u w.r.t. c by the underlying identifier
Finalscore(u, c) = ProximityScore(u,c)+IdentifierScore(u,c)
Experimental Data• Medical dictionary: from MeSH
– Each MeSH term and its retrieval equivalence terms, resulting in a dictionary of 164,354 medical terms
• Medical texts for disease: from MedlinePlus– All the diseases for which MedlinePlus tags
diagnosis/symptoms texts, resulting in a text database of 420 medical texts for 131 diseases
– Each medical text is manually read and cross-checked to extract target diagnosis factor terms from the texts, resulting in 2,797 target terms
PDFI@ACIIDS2011 15
Underlying Diagnosis Factor Identifier
• The chi-square feature scoring technique – Produces a discriminating strength for each
feature (candidate factor) with respect to each disease, and
– For each disease, all positively-correlated features are sent to PDFI for re-ranking
PDFI@ACIIDS2011 16
Evaluation Criteria
• Mean average precision (MAP)– Measuring how target diagnosis factors are ranked
high for the medical expert to check and validate
– Example• Targets ranked 1st, 3rd, 5th AP=(1/1+2/3+3/5)/3=0.76
• Targets ranked 1st, 2nd, 3rd AP=(1/1+2/2+3/3)/3=1.00
PDFI@ACIIDS2011 17
k
jfj
iAPC
iAPMAP
m
j i
C
i
1
||
1 )()(,
||
)(
An Example• Parasitic diseases
– AP: chi-square:0.3003; chi-square+PDFI:0.3448• PDFI promotes the ranks of several target
diagnosis factors (e.g., parasite, antigen, diarrhea, and MRI scan)
– They appear at some place(s) where more other candidate terms occur in a nearby area
• PDFI lowers the ranks of a few target diagnosis factors (e.g., serology)
– Serology only appears at one place where the author used lots of words to explain serology
PDFI@ACIIDS2011 19
• Diagnosis factors to discriminate diseases are the fundamental basis for – Diagnosis decision support, diagnosis skill
training, medical research, & health education
• Text mining is a good way to identify and maintain the huge amount of diagnosis factors for diseases
• By encoding term proximity information, PDFI may be a good supplement to existing technique to identify the diagnosis factors for individual diseases
PDFI@ACIIDS2011 21