d. 15.3 a multimodal case- and ontology-based retrieval ... query using relevance feedback...
TRANSCRIPT
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
1
Model Driven Paediatric European Digital Repository
Call identifier: FP7-ICT-2011-9 - Grant agreement no: 600932
Thematic Priority: ICT - ICT-2011.5.2: Virtual Physiological Human
Deliverable 15.3
A multimodal case- and ontology-based retrieval service, powered with relevance feedback
Due date of delivery: 31-08-2016
Actual submission date: Sept 8, 2016
Start of the project: 1st March 2013
Ending Date: 28th February 2017
Partner responsible for this deliverable: HES-SO
Version: 1.1
Dissemination Level: RE
Document Classification: R
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
2
Title A multimodal case- and ontology-based retrieval
service, powered with relevance feedback
Deliverable 15.3
Reporting Period 4
Authors HES-SO
Work Package WP15
Security Private
Nature Report
Keyword(s) Case-based retrieval
Document History
Name Remark Version Date
E. Pasche, P. Ruch 1.1 July 2016
List of Contributors
Name Affiliation
Emilie Pasche HES-SO
Patrick Ruch HES-SO
List of reviewers
Name Affiliation
Omiros Metaxas ATHENA
Abbreviations
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
3
Table of Contents
1. Introduction .................................................................................................................................... 4
2. Data description.............................................................................................................................. 4
3. Functional specifications ................................................................................................................. 5
4. Architecture of services ................................................................................................................... 5
5. Functionalities ................................................................................................................................ 6
5.1 Assignement of ontological descriptors .............................................................................................. 7
5.2 Search for similar cases in electronic health records .......................................................................... 7
5.3 Relevance feedback ............................................................................................................................. 8
5.4 Search for similar cases in literature .................................................................................................... 9
5.5 Search for similar images in literature ............................................................................................... 10
6. Graphical User Interface ................................................................................................................ 11
7. Conclusion .................................................................................................................................... 17
8. Bibliography ................................................................................................................................. 18
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
4
1. Introduction
Physicians, who are facing complex diseases, show a great interest in finding populations of patients similar
to their patients. Thus, they can observe the response of a particular treatment and learn about the outcomes
at different points in time in a given clinical pathway. In this context, a case-based retrieval service has been
developed.
The case-based retrieval (CBR) engine, developed by HES-SO, aims to find similar episodes of care based on
several modalities: unstructured data (e.g. clinical syntheses), structured data (e.g. age or gender) and
ontological resources (e.g. MeSH terminology). Similar episodes of care are extracted from clinical data
stored in the MD-Paedigree infostructure. Moreover, it proposes to expand the search to similar cases from
the literature, through two modes: a multilingual search and an image-based search by leveraring the
development of a related EU project, whose focus was on literature search.
A first version of the CBR engine was presented in deliverable D15.1. We present here the second version,
which include major improvements. While the first version of the CBR was aiming to find similar patients, the
second version of the CBR targets a more focused granularity. Indeed, this new version aims to find similar
episodes of care to a given episode of care. We present in this section the different aspects of the second
version of the CBR engine.
2. Data description
The second version of the CBR is based on a set of 47,433 episodes of care, corresponding to 33,674 distinct
patients. Only episodes of care that contain a medical event of type “conclusions” are selected. Such events
will be called “clinical syntheses” in the following. The patients are consulting for cardiac pathologies. The
source data originate from the OPBG hospital (Ospedale Pediatrico Bambino Gesù) and from the Taormina
hospital. Therefore, all textual contents are in Italian. Data were obtained using the secured PCDR API
developed by GNúBILA within WP14. The secured channel is the first step of the integration within the MD-
Paedigree infostructure. Data extracted from the MD-Paedigree infostructure are presented in Table 1.
Field Description Type
Patient identifier Unique identifier for the patient String
Event identifier Unique identifier for the episode of care String
Gender Gender of the patient Char
Birth date Birth date of the patient Date (milisecond)
Medical bag date Date and time when the episode of care occured Date (milisecond)
Conclusions Clinical syntheses of the episode of care Narrative
Table 1 Fields extracted from the clinical cases
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
5
3. Functional specifications
In order to provide a CBR service that fulfills the needs of the clinicians, the development relies on the
requirements analysis performed within WP13. Surveys targeting stakeholders from the four diseases areas
of MD-Paedigree have been conducted and have enabled, among others, to identify the most highly useful
features for the clinicians. Details about the surveys are presented in deliverables D13.1 and following. We
focus here on the specifications concerning decision support (either through internal or external data).
Several criteria have been reported as of high importance to search through internal data: pathology (1a),
keywords (1b), age (1c), gender (1d) and anatomical structure (1e). All these criteria can be used with the
CBR. Specific fields are provided for the age and gender: the user can enter the age and gender of his patients
or/and he can limit the search to a given gender and a range of ages. Regarding the pathologies, keywords
and anatomical structures, these criteria can be addressed with free text queries.
We also address the specification “Support for search in multiple languages” (1h). While the clinical data
currently used by the CBR are strictly in Italian, an expansion mode provides the possibility to search from
Italian cases to two different external sources, using English keywords automatically attributed to the Italian
clinical syntheses.
The new version of the CBR also addresses the specification “Access to online search engines used often by
clinicians” (4c), concerning the search through external information. Indeed, we now offer the possibility to
search for similar cases in the Europe PubMed Central (Europe PMC), an online database providing free
access to a large collection of biomedical literature.
4. Architecture of services
The Figure 1 illustrates the global architecture of the system.
The system harvests electronic health records (EHR) from the MD-Paedigree infostructure with a secured
access in order to mirror and update the cases collection. Normalized descriptors (e.g. MeSH) are
automatically assigned to each case, and cases are indexed using using Apache Solr (version 4.4.0).
Symmetrically, at query time, MeSH descriptors are assigned to the query (represented in purple in Figure
1). The Solr retrieval engine outputs similar cases in EHR (represented in red in Figure 1). The user can refine
his query using relevance feedback possibilities (represented in orange in Figure 1). Alternatively, the user
can expand his search to external sources, such as Europe PMC through the literature search mode
(represented in blue in Figure 1) or search for similar images in PubMed using the Shambala search engine
(represented in green in Figure 1).
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
6
Figure 1 Architecture of the Case-Based Retrieval service
5. Functionalities
In this section, we present the different functionalities embedded within the CBR service: the assignement
of ontological descriptors (section 5.1), the search for similar cases in EHR (section 5.2), the relevance
feedback possibilities (section 5.3), the search for similar cases in literature (section 5.4), and finally the
search for similar images in literature (section 5.5).
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
7
5.1 Assignement of ontological descriptors
The automatic assignement of ontological descriptors [1] is based on MHIta, a service developed by HES-SO
to normalize clinical texts written in Italian with MeSH descriptors. This webservice is freely accessible at the
following URL: http://eagl.unige.ch/MHita/. Given a textual input, the system returns a relevance-ranked list
of MeSH terms. A basic cleaning of the suggestion is performed in order to remove concepts not relevant in
a cardiology context. This approach is used at two different levels: 1) at data preparation time and 2) at query
execution time.
At data preparation time, MeSH descriptors are assigned to clinical syntheses (i.e. discharge summary,
diagnosis reports…) before the upload of data in the search engine index. We submit clinical syntheses to the
webservice and we retrieve a set of MeSH terms. Previously, the top-3 MeSH terms were selected. We now
propose a more advanced strategy, based on a dynamic threshold to define the number of MeSH terms to
select. The list of suggested MeSH terms is mapped with exact matching strategies to the input text (i.e. the
clinical syntheses). The last exact match found defines the lowest threshold score, and all terms ranked higher
than this threshold are indexed, while those below are ignored.
At query execution time, MeSH descriptors are assigned to the query keywords. The top-20 MeSH terms
returned by MHIta are displayed on the screen, while the top-3 are by default pre-selected. The user can
then select/unselect any suggested MeSH terms. Adding MeSH concepts to a query enables to refine the
query and retrieves more relevant results. An extrinsic evaluation, also called evaluation in use, has been
performed with medical experts. The query-time MeSH normalisation triggered a strong interest from the
audience. Indeed, for all the queries, the evaluators were ready to spend a few seconds to choose the
appropriate MeSH descriptors. However, it was also noted that the MHIta sometimes failed to suggest an
existing relevant descriptor. More details about the evaluation of this feature have been presented in
deliverable D17.4.
5.2 Search for similar cases in electronic health records
The retrieval of similar episodes of care relies on the Solr search engine, based on a weighting schema tuned
on a literature collection with similar distribution (average document length and average deviation) as well
as qualitative and quantitative expert-based evaluations of the system. Query parameters are the following:
1) Clinical syntheses (i.e. free text in Italian);
2) Gender of the patient;
3) Age of the patient;
4) MeSH descriptors.
Three different modes are tested:
A. All query parameters are equally weighted;
B. Gender and age are not used for the query;
C. Clinical syntheses and MeSH descriptors get a weight 1000 times higher than the gender and the age.
The evaluation of the three different settings is based on a qualitative assessment of the CBR by medical
experts. The first tuning model (A), giving equal importance to each parameter, did not convince the
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
8
evaluators. Indeed, retrieving patients of the same age and gender but with a different diagnostic made no
sense. Given this first observation, two other tuning models were tested: first, we decided to ignore the age
and gender in the query (B). Second, we give a very small weight to the age and gender (C). The last option
(C) was qualitatively performing the best. In addition, it is possible to restrict results to a certain age range
and a given gender. Such approach is a better option to inject an age composant in the model. Indeed, while
for babies, a small age difference can be of major importance regarding medical outcome, a larger difference
would be of less importance for teenagers.
As a future evolution of the CBR, investigating the possibility to manually weight the elements of the query
(e.g. in particular to increase the weight of the primary diagnosis or to decrease the weight of the age) should
allow to retrieve more relevant results.
Based on 38 queries manually assessed by medical experts, we showed that the CBR was able to suggest a
similar episode of care at first rank in more than half of the cases and for up to two thirds of them (Table 2).
The observed precisions are a bit lower than for the first version of the CBR. However, the dataset is larger
and the task is more challenging: to find a similar episode of care and not just a related patient. See
deliverable D17.4 for more details about the quantitative evaluation.
Parameter All queries
(38)
Queries with at least a relevant case
(30)
P0 0.5 0.63
P5 0.44 0.55
P10 0.42 0.54 Table 2 Evaluation of the second version of the CBR engine
5.3 Relevance feedback
The relevance feedback functionality is based on the assessment of the retrieved episodes of care: the
physician reports them as relevant, or not relevant. These judgements are used to reformulate the query
with additional keywords and thus refine the results. The physician can iterates until he retrieves satisfying
results. Figure 2 describes this process.
The relevance feedback is composed of three steps. First, cases judged as not relevant are excluded from
future results set and will not appear anymore in the results for the session. Second, keywords are suggested
based on a Rocchio algorithm [2]. Suggested keywords are the most frequent words extracted from the
clinical syntheses of the episodes of care judged as similar. Third, an update of the MeSH terms normalization
is proposed, based not only on the query but also on the clinical syntheses of the episodes of care judged as
similar.
As reported within D17.4, the Rocchio relevance feedback feature showed some limitations during the
evaluation session. The evaluators perceived the suggested terms as too general (i.e. common Italian words)
or not clinically relevant. However, data analysis showed that for more than 90% of the queries, they selected
a few terms. This feature was at a first stage of development and needed to be improved. A slight cleaning
of the terms has been performed. We also plan to further improve it by filtering the list of suggestion to
clinical terms only. Moreover, negative feedback could be used, in order to remove from the suggestion
terms having a high frequency in episodes of care judged as not relevant.
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
9
Figure 2 Relevance feedback process
5.4 Search for similar cases in literature
Europe PMC (https://europepmc.org/) [3] is an online database, freely accessible and providing access to a
large collection of biomedical articles. Developed by the European Molecular Biology Laboratory – European
Bioinformatics Institute (EMBL-EBI), Europe PMC is supported by 27 organisations. Currently, this database
contains 31.3 millions of abstracts, but also 3.8 millions of full text articles. Thanks to its large coverage, it is
a premier resource for expanding CBR searches.
Expanding CBR searches to literature enables to retrieve more similar cases. Moreover, while the clinical
syntheses stored in the PCDR are for the moment solely in Italian, Europe PMC provides a collection in
English. We can thus investigate the multilingual search capacity of the system.
Figure 3 describes the methodology used. The search in Europe PMC consists to retrieve cases in literature
similar to one or several episodes of care. The physician selects one or several episodes of care that he
considers as similar to the patient described in the query. The clinical syntheses of these episodes of care are
extracted and automatically normalized with the MeSH terminology, using the MHIta (see section 5.1). The
top-10 suggested MeSH concepts are selected, and their corresponding main terms in English are retrieved.
We then query the Europe PMC API with the 10 English MeSH terms. If no result is retrieved, the last
suggested MeSH concept is removed and the “shorten” query is tested again. MeSH concepts are removed
one by one until the service retrieves at least a similar case.
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
10
Figure 3 Methodology used to retrieve similar cases in EuropePMC
5.5 Search for similar images in literature
Shambala [4], developed by the HES-SO within the Kreshmoi project, is a web-based search interface for
content-based image retrieval. Shambala’s image retrieval is based on the ParaDISE retrieval system. The
ParaDISE retrieval system first extract local visual features from the image, and provides a global
representation of the image through descriptors. This information is stored in indexes. Searches are based
on a fusor, which combines results from multiple lists (see [5] for more details about ParaDISE’s algorithm).
Similarly to the methodology described in section 5.4, the first step of the methodology consists to select one
or several episodes of care that are considered by the physician as similar to the patient described in the
initial query. The clinical syntheses of these episodes of care are extracted and automatically normalized with
the MeSH terminology, using the MHIta (see section 5.1). The top-3 suggested MeSH concepts of the category
“Disorder” are selected, and their corresponding main terms in English are retrieved. We then query the
Shambala website with the three English MeSH terms, separated by a “OR”.
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
11
6. Graphical User Interface
A user-friendly GUI has been developed for the case-based retrieval service. Special attention was given to
develop a service that is user-friendly, with the aim to reduce the extra workload of users. It is accessible at
the following URL: http://casimir.hesge.ch/MDPaedigree/CBR.jsp. The service is also part of the MD-
Paedigree portal and can be accessed directly using the following link: https://pcdr.gnubila.fr/web/md-
paedigree/case-based-retrieval. However, the latest developments have not yet been integrated within the
MD-Paedigree portal.
The case-based retrieval service is a 5-step process (Figure 4). The user goes from step 1 to 4 and can then
either iterate (steps 2 to 4) to refine his query and thus obtain more relevant results, or he can expand his
query to external resources (step 5).
Figure 4 Workflow of the CBR
The first step is the query section. The physician describes his patient. There is two ways to do that depending
if the patient’s EHR is stored in the PCDR. If the patient’s EHR is in the PCDR (Figure 5), the physician fills the
patient identifier field and the system automatically loads the existing clinical syntheses for this patient. The
user can then select any of them (if several). If the patient’s EHR is not in the PCDR (Figure 6), the physician
fills a form with information about his patient. He describes the cases with his own words (free text) and can
optionally add the age and gender of his patient.
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
12
Figure 5 Query section: example of a query based on a patient identifier
Figure 6 Query section: example of a query based on free text
The second step is the refinement section (Figure 7). It proposes additional keywords to add to the query in
order to narrow it down. There are two refinements proposed: one based on MeSH terms and one based on
Rocchio algorithm. The MeSH refinement consists to automatically normalize the clinical synthesis (i.e. the
query). Up to 20 MeSH terms are suggested, and the top-3 is by default pre-selected. The physician can
select/unselect the MeSH terms he wants to add to his query. When the user iterates (i.e. goes from the
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
13
step 4 again to the step 2), the MeSH refinement is based not only on the query, but also on the episodes of
care judged as similar by the physician. The Rocchio refinement is solely based on the episodes of care judged
as similar by the physician. Therefore, it only appears after the first iteration. The clinical syntheses of these
episodes of care are taken and the Rocchio service retrieves the most frequent words in these texts.
Figure 7 Query refinement section
The third step is the filter section (Figure 8). Two actions are available: 1) modification of the query and 2)
filter of the results. The modification of the query enables the physician to visualize his final query and to
optionally remove any part of the query (e.g. remove the age, the gender, one of the additional keywords,
etc.). The filtering of the results enables the physician to define filters for the results (e.g. show only patients
that are girls, or show only boys from 3 to 10 year-old, etc.)
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
14
Figure 8 Filter section
The fourth step is the results (Figure 9). The similar episodes of care are displayed, ranked by relevance. To
facilitate the processing by the physician, following information is displayed: demographic information (i.e.
gender and age), MeSH terms automatically attributed to the clinical synthesis, clinical synthesis, a relevance
score, link to the full patient history. The displayed clinical synthesis is an abstract automatically generated.
In addition, the physician can access to the complete clinical synthesis, as well as the future clinical syntheses
of the same patients. In addition, a radiobutton is proposed, composed of a green and a red smiley: the
physician checks the green smiley if the episode of care is similar (i.e. relevant), and the red smiley if the
episode of care is not similar (i.e. not relevant). This information is used for the refinement step. The physician
can then click on the “next” button to refine the query (step 2).
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
15
Figure 9 Results section
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
16
Finally, a fifth step proposes an expansion of the results to external resources. Also based on the selection of
similar episodes of care in step 4, MeSH terms are automatically attributed to the selected episodes of care.
Two different expansions are proposed, using the MeSH terms as query parameters: a search for similar
images and a search for similar literature. The search for similar images (Figure 10) queries the Shambala
webservice and displays similar images. A link to the service is also proposed. The search for similar literature
(Figure 11) queries the Europe PMC website and displays the publications concerning similar cases. A link to
the Europe PMC website is also proposed. In both search modes, the physician can manually modify the
request if needed.
Figure 10 Expansion section: an example of similar images search
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
17
Figure 11 Expansion section: an example of similar cases in literature
7. Conclusion
We have thus developed a case-based retrieval service dealing with several modalities (e.g. structured data,
unstructured data, ontologies, images) and proposing various fonctionalities to search for similar cases (e.g.
search in EHR, search in literature, relevance feedback, etc.). This tool is respecting most of the specifications
defined for decision support within WP13. A qualitative and quantitative evaluation of the tool enables to
show encouraging results: the CBR is able to suggest a similar episode of care at first rank in more than half
D. 15.3 A multimodal case- and ontology-based
retrieval service, powered with relevance feedback MD-Paedigree - FP7-ICT-2011-9 (600932)
18
of the cases and for up to two thirds of them. With the improved feedback relevance strategy, we can expect
an improvement of the precision. However, some improvements are still required.
Efforts mainly focused on relevance feedback functionalities, giving the possibility to the user to reformulate
and refine his query, in order to retrieve a more focused set of results. As for now, the system proposes
relevance feedback functionalities based on three axes (i.e. exclusion of non relevant episodes of care, MeSH
refinement and Rocchio-based refinement). Before the final release of the CBR, due in month 48, efforts will
be continued to improve this functionality. We will attempt to filter the terms suggested by the Rocchio
algorithm to clinical terms only. We will also investigate negative feedback with the Rocchio algorithm.
Finally, as an alternative to Rocchio, other feedback features are investigated, such as the latent semantic
indexing (LSI) in cooperation with UTBV.
Finally, efforts to fully integrate the service within the MD-Paedigree portal have started, with the integration
of a preliminary version.
8. Bibliography
[1] Ruch P. Automatic assignement of biomedical categories: toward a generic approach. Bioinformatics
(2006), 22(6).
[2] Ruch P, Tbahriti I, Gobeill J, Aronson AR. Argumentative feedback: a linguistically-motivated term
expansion for information retrieval. In Proceeding COLING-ACL ’06 (2006).
[3] McEntyre J et al. Europe PMC: a full-text literature database for the life sciences and plateform for
innovation. Nucleic Acids Res (2015), 43(Database issue).
[4] Schaer R, Müller H. A Modern Web Interface for Medical Image Retrieval. Swiss Medical Informatics
(2014), 30.
[5] Schaer R, Markonis D and Müller H. Architecture and applications of the Parallel Distributed Image Search
Engine (ParaDISE). GI-Jahrestagung (2014).