swat4 ls fca_slides
Post on 15-Jul-2015
50 Views
Preview:
TRANSCRIPT
Refining Health Outcomes of Interest using Formal Concept Analysis and Semantic Query
Expansion
Olivier Curé1, Henri Maurer2, Paea Le Pendu3, Nigam Shah3
1: CNRS LIGM lab, UPEM, France2: Edinburgh University, IK3: BMIR lab, Stanford University, USA
2
Problem setting
● Applications need to select, extract, compare and analyze groups of patients using Electronic Health Records (EHRs)
● This require to define Health Outcomes of Interests (HOI), e.g. myocardial infarction, chronic obstructive pulmonary disease.
● With clinical text, these definitions should capture variations of terms and ensure good precision and recall of the text-mining process.
3
Problem setting (2)
● It is not practical to define precisely these HOIs with concept identifiers, e.g. UMLS CUIs.
● We provide a solution that produces and refines HOI definitions from terms provided by the end-user.
● Our solution aims to propose sound and complete definitions in a best-effort way.
4
Approach overview
Diseases
Procedures
DrugsDevices
Bioportal - Knowledge
termsconcepts
Semantic QueryExpansion
Terminology3 DB
Semantic QueryExpansion
Formal ConceptAnalysis
StatisticsBasedPruning
5
SQE
● Improve search results by expanding queries with the transitive closure of the subsumption relationship of ontology concepts.
● Queries can be generalized (resp. specialized) via expansions with ancestors (resp. descendants).
● Ex: expanding a query with 'neoplasm' or 'tumor' when searching for 'cancer'.
6
FCA
● Abstract conceptual descriptions from a set of objects described by some attributes.
● Used in machine learning and knowledge management.
● A formal context is a triple (G,M,I), resp. a set of objects, attributes and a binary relation between G and M.
● A formal context can be represented as a matrix.
7
FCA (2)
{1,2}-{CF1,F1,CF2,F2}
{3}-{CF1,F1,MF2,F2}
{6}-{BLF1,F1,MF2,F2}
{4,5}-{BLF1,F1,BLF2,F2}
{1,2,3}-{CF1,F1,F2}
{3,6}-{MF2,F1,F2}
{4,5,6}-{BLF1,F1,F2}
{1,2,3,4,5,6}-{F1,F2}
⊥
⊤
8
Method
● SQE: Relational database approach– We are using the ontologies stored in Stanford's
DB and its materialization of concept subsumption (almost 14 millions entries).
● FCA: objects and attributes of the formal context are concept identifiers (UMLS concept identifiers).
10
Method (3)
● To improve relevance, identifying potential concepts among discovered ones, a pruning FCA-based approach is designed.
● Formal contexts is composed of matching concepts as objects and candidate concepts as attributes.
● Thus the binary relation corresponds to the subsumption relationship.
11
Method (4)
● Ex: 10365: “hyperlipoproteinemia type iv” and 740154 : “disease, disorder or finding”● Standard FCA algorithms are used to define the FCA lattice.
12
Method (5)
● Qualifying a discovered concept is performed using a top-down navigation of the FCA lattice.
● For each formal concept <Ai,Bi>, we compute the transitive closure of sub concepts of Ai (resp. Bi), denoted LAi (resp. Lbi).
● If (|LBi ∩ LAi |)/ | LBi | ≥ Θ, with Θ a predefined pruning threshold then Bi is potential concept
13
Method (6)
● Concept sets:– M : matching
– D : Discovered
– P : Potential
– C : Other concept
14
Example
● Search on Hypercholesterolemia on 18 ontologies provides:– 20 matching concepts (i.e., FCA objects)
– 102 discovered concepts (i.e., FCA attributes)
● Generates an FCA lattice with 67 formal concepts
● First formal concept satisfying a Θ=.75 pruning threshold is at the 4th level of the lattice: only 4 concepts out of 16 LBi are covered by LAi .
● These 4 concepts have the following preferred labels: “hypercholesterolemia”, “cholesterolosis”, “secondary hypercholesterolemia” and “hyperlipidemia”.
15
Method (7)
● We include interactions with end-user to validate our potential discoveries.
● Hence the domain expert has the final decision on acceptance/rejection of a proposition.
● Important issue: trade-off between user interactions and precision/recall of results.
● End-user can validate whenever she wants.● Interactions are performed in a web interface providing
additional information on the search (clinical text snippets, number of patients).
16
Evaluation
● i2b2 obesity NLP reference set used as an evaluation data set
● Gold standard are the results of a previous experiment conducted at Stanford.
● Evaluation in terms of specificity, sensitivity and duration of computation (on commodity hardware)
17
Evaluation (2)
● An improvement of 2 and 3 % on resp. sensitivity and specificity.
● Computation duration in terms of seconds on a standard laptop.
18
Evaluation (3)
● More interesting is that some of our false negatives seem to be relevant to the search.
● Some of these false negative come from the matching and also the potential (i.e. FCA based) approaches:
● Matching example :– Sitosterolemia for hypercholesterolemia'' for hypercholesterolemia
● Potential examples: ● “h/o: raised blood, familial hyperlipoproteinemia”, “fh: raised blood lipids” for
hypercholesterolemia, while the gold standard contains concepts such as “hyperlipoproteinemia type ii”) concepts which confirms the relevance of using a semantic approach.
● Note that among our true positive, depending on the use case, a significant number of items have been retrieved from the potential concept set, i.e., using our FCA statistical approach.
19
Conclusion
● We have proposed a semi-automatic solution for defining HOIs.
● Approach uses SQE and FCA enriched with a statistical approach.
● Our results are comparable to state of the art methods.
● It refines HOIs definitions efficiently with relevant terms/concepts/
20
Future works
● Conduct user-driven evaluations with clinicians and researchers.
● Analyze acceptance/rejection of end-users in practical scenarios.
● Use active learning over past query refinements to improve future queries.
● Study our method's impact on mining EHRs clinical notes and cohort building tools.
21
Thanks
Questions ?
ocure@univ-mlv.fr
top related