identifying disease diagnosis factors by proximity-based mining of medical texts rey-long liu *,...

Identifying Disease Diagnosis Factors by

Proximity-based Mining of Medical Texts

Rey-Long Liu*, Shu-Yu Tung, and Yun-Ling Lu*Dept. of Medical Informatics

Tzu Chi University

Taiwan, R.O.C.

Outline

• Research background

• Problem definition

• The proposed approach: PDFI

• Empirical evaluation

• Conclusion

PDFI@ACIIDS2011 2

Research Background

PDFI@ACIIDS2011 3

Diagnosis Knowledge Map: Fundamental of Diagnosis

Support & Education

PDFI@ACIIDS2011 4

r5

r4

r3

r2

r1

d3

d2

d1

Symptoms & Signs (and examinations & tests)DiseasesRisk Factors

m1

m2

m3

m4

m5

Basic Properties

• Diagnosis factors of a disease– Risk factors, symptoms, and signs of the disease

• A diagnosis knowledge map consist of many-to-many relationships between diseases and their diagnosis factors– May have different capability of discriminating

the diseases, and may evolve

• Construction of a diagnosis knowledge map is essential but costly

PDFI@ACIIDS2011 5

Problem Definition

PDFI@ACIIDS2011 6

Goal• Explore how the identification of the diagnosis

factors may be supported by text mining

• Develop a technique PDFI (Proximity-based Diagnosis Factors Identifier) that – Employs term proximity to improve diagnosis

factors identifiers– Serves as a supplement to improve existing

identifiers

PDFI@ACIIDS2011 7

Related Work• Extract relationships by parsing or template

matching– Weakness: Relationships between diseases and

diagnosis factors are seldom expressed in individual sentences

• Select key features by text classification– Weakness: Term proximity is NOT considered

• Proximity-based retrieval– Weakness: NOT applicable to diagnosis factor

identificationPDFI@ACIIDS2011 8

The Proposed Approach: PDFI

PDFI@ACIIDS2011 9

Basic Observation

• In a medical text talking about the diagnosis of a disease, the diagnosis factors often appear in a nearby area of the text

PDFI@ACIIDS2011 10

The Approach

• For a candidate diagnosis factor u, PDFI– Measures how other candidate diagnosis factors

appear in the areas near to u in the medical texts, and then

– Encodes the term proximity information into the discriminating capability of u measured by the underlying discriminative factors identifiers.

PDFI@ACIIDS2011 11

System Overview

PDFI@ACIIDS2011 12

Encode term proximity contexts to revise the strengths of candidate factors

Measure discriminating strengths of candidate factors

Underlying identifierPDFI

Ranked factors for individual diseases

Texts about individual diseases

Discriminating strengths of candidate factors

Scoring for a Candidate Factor

PDFI@ACIIDS2011 13

||1

1

),(Pr ;)( ,

Ne

cureoximitySco unNnMinDist nu

),(1

1),(

cuRankcuScoreIdentifier

MinDistu,c= Minimum distance between u and n in the texts about disease c, and α is set to 30

• For a candidate diagnosis factor u for disease c

Rank(u,c) = Rank of u w.r.t. c by the underlying identifier

Finalscore(u, c) = ProximityScore(u,c)+IdentifierScore(u,c)

Empirical Evaluation

PDFI@ACIIDS2011 14

Experimental Data• Medical dictionary: from MeSH

– Each MeSH term and its retrieval equivalence terms, resulting in a dictionary of 164,354 medical terms

• Medical texts for disease: from MedlinePlus– All the diseases for which MedlinePlus tags

diagnosis/symptoms texts, resulting in a text database of 420 medical texts for 131 diseases

– Each medical text is manually read and cross-checked to extract target diagnosis factor terms from the texts, resulting in 2,797 target terms

PDFI@ACIIDS2011 15

Underlying Diagnosis Factor Identifier

• The chi-square feature scoring technique – Produces a discriminating strength for each

feature (candidate factor) with respect to each disease, and

– For each disease, all positively-correlated features are sent to PDFI for re-ranking

PDFI@ACIIDS2011 16

Evaluation Criteria

• Mean average precision (MAP)– Measuring how target diagnosis factors are ranked

high for the medical expert to check and validate

– Example• Targets ranked 1st, 3rd, 5th AP=(1/1+2/3+3/5)/3=0.76

• Targets ranked 1st, 2nd, 3rd AP=(1/1+2/2+3/3)/3=1.00

PDFI@ACIIDS2011 17

k

jfj

iAPC

iAPMAP

m

j i

C

i

1

||

1 )()(,

||

)(

Results• MAP: chi-square: 0.2136; chi-square+PDFI: 0.2996

PDFI@ACIIDS2011 18

An Example• Parasitic diseases

– AP: chi-square:0.3003; chi-square+PDFI:0.3448• PDFI promotes the ranks of several target

diagnosis factors (e.g., parasite, antigen, diarrhea, and MRI scan)

– They appear at some place(s) where more other candidate terms occur in a nearby area

• PDFI lowers the ranks of a few target diagnosis factors (e.g., serology)

– Serology only appears at one place where the author used lots of words to explain serology

PDFI@ACIIDS2011 19

Conclusion

PDFI@ACIIDS2011 20

• Diagnosis factors to discriminate diseases are the fundamental basis for – Diagnosis decision support, diagnosis skill

training, medical research, & health education

• Text mining is a good way to identify and maintain the huge amount of diagnosis factors for diseases

• By encoding term proximity information, PDFI may be a good supplement to existing technique to identify the diagnosis factors for individual diseases

PDFI@ACIIDS2011 21

identifying disease diagnosis factors by proximity-based mining of medical texts rey-long liu *,...

Documents

candidate diagnosis

disease diagnosis factors

diagnosis factorsmay

candidate diagnosis

target diagnosis factor

diseaserisk factors

disease c ranku

medical termsmedical