extracting and analysing information from electronic...
TRANSCRIPT
Extracting And Analysing Information From Patient Records In Order To Track Disease Progression Over
Time
A Dissertation Submitted To The University Of Manchester For The Degree Of Master Of Science In The Faculty Of Engineering And Physical Sciences
2015
By Aleksandra Ivaylova Nacheva School of Computer Science
0. Contents
Contents 1
List of Figures 3
List of Tables 4
1 Introduction 9
1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.1 Learning objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.2 Deliverable objectives . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Report outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Background 13
2.1 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 General architecture of Text Mining systems . . . . . . . . . . . . 14
2.1.2 Text Mining stages . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2.1 Pre-processing stage . . . . . . . . . . . . . . . . . . . . . 15
2.1.2.2 Information Extraction stage . . . . . . . . . . . . . . . . 16
2.1.3 Named Entity Recognition Approaches . . . . . . . . . . . . . . . . 18
2.1.3.1 Dictionary–based approaches . . . . . . . . . . . . . . . . 18
2.1.3.2 Rule–based approaches . . . . . . . . . . . . . . . . . . . 19
2.1.3.3 Machine Learning–based approach . . . . . . . . . . . . . 19
2.1.3.4 Hybrid approaches . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Clinical Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Sentiment analysis of clinical data . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Classification approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3 Specification and Design 33
3.1 Project scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Challenges for developing the project . . . . . . . . . . . . . . . . . . . . . 33
3.3 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 System requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Initial analysis of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.1 Identification of disease factors that need to be extracted . . . . . 38
3.5.2 Lexical profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1
2
3.6 Method overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.7.1 Modeling the system boundaries . . . . . . . . . . . . . . . . . . . 44
3.7.2 Modeling the system components . . . . . . . . . . . . . . . . . . . 46
3.7.3 Modeling the system interactions . . . . . . . . . . . . . . . . . . . 51
3.7.4 Modeling the system workflows . . . . . . . . . . . . . . . . . . . . 53
3.7.5 Classification rules design . . . . . . . . . . . . . . . . . . . . . . . 58
3.7.6 Quality attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.8 Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.8.1 Types of information that needs to be stored . . . . . . . . . . . . 62
3.8.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 Implementation 68
4.1 Development environment preparation . . . . . . . . . . . . . . . . . . . . 68
4.1.1 Ethical approvals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.2 Hospital environment preparation . . . . . . . . . . . . . . . . . . 68
4.1.3 Development Language Choice . . . . . . . . . . . . . . . . . . . . 69
4.1.4 Database Platform and Type Choice . . . . . . . . . . . . . . . . . 70
4.1.5 Preparation of Feature Extraction tools . . . . . . . . . . . . . . . 70
4.2 Classifier creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Software Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.1 Regression Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.2 Integration Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.3 Tests Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5 Evaluation 77
5.1 Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Description of Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 SentiStrength Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Classification approaches evaluation . . . . . . . . . . . . . . . . . . . . . 80
5.4.1 Rule-based approach evaluation . . . . . . . . . . . . . . . . . . . . 80
5.4.2 Machine learning approach evaluation . . . . . . . . . . . . . . . . 81
5.4.3 Comparison between approaches . . . . . . . . . . . . . . . . . . . 83
6 Conclusion and Future work 84
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.1.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.1 Integrate application within hospital environment . . . . . . . . . . 87
6.2.2 Expand Information Extraction functionality . . . . . . . . . . . . 87
6.2.3 Improve run time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2.4 Extending the system functionality . . . . . . . . . . . . . . . . . . 88
0. List of Figures
2.1 General Architecture of TM systems . . . . . . . . . . . . . . . . . . . . . 14
2.2 IE task example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Example of system architecture (Spasic et al, 2010) . . . . . . . . . . . . . 25
2.5 DNorm architecture (Leaman et al, 2013) . . . . . . . . . . . . . . . . . . 27
2.6 MedEx architecture (Xu et al, 2010) . . . . . . . . . . . . . . . . . . . . . 28
2.7 CliNER architecture (Kovacevic et al, 2013) . . . . . . . . . . . . . . . . . 29
2.8 Weka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.9 Orange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1 Explicit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Implicit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 2-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 3-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 4-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Negative sentiment note example . . . . . . . . . . . . . . . . . . . . . . . 42
3.7 Positive sentiment note example . . . . . . . . . . . . . . . . . . . . . . . 42
3.8 System workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.9 Context Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.10 Context diagram for Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.11 Component Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.12 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.13 System Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.14 Metastatic Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.15 Non-metastatic Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.16 Quality Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.17 Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.18 Classifier Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3
0. List of Tables
2.1 Pre-processing tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Comparison between NER approaches . . . . . . . . . . . . . . . . . . . . 20
2.3 Comparison between clinical Text Mining systems . . . . . . . . . . . . . 24
3.1 Table columns description . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Disease factor identified . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 Work environment set up . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Functionality Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.1 Notes statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 SentiStrength accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Confusion Matrix for Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 Results for Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 Confusion Matrix for ML . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6 Results for ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4
5
Abstract
Cancer is a disease that has caused the death of many people around the world which
makes it a threat that needs to be coped with.
Even though the Electronic Health Records (EHRs) are a rich source of information
regarding cancer, the data stored in them is mainly in free text form, which make them
di�cult to analyse. Thus, there is a lot of research going into the clinical domain in
order to create methods for analysing the data available in the EHRs in order to improve
the services provided in hospitals.
In collaboration with The Christie hospital in Manchester this project aims to provide
an automatic method for identifying the metastatic breast cancer patients that need to
be contacted for immediate support. For this purposes a clinical text mining system for
extracting and analysing information from patient records (free text) has been created.
The project consists of two main parts. The first part focuses on analysing, extracting
and structuring the relevant features to metastasis patient by using text mining tech-
niques.The second part uses a classification method in order to identify the metastatic
patients.
Sentiment analysis has also been performed on the data since clinical narratives are
written by clinicians who very often express emotions in the notes. The sentiments
associated with the notes can help identifying metastasis easier.
For the classification two approaches are used: rule - based approach and machine learn-
ing approach. The results from rule-based approach are 83% recall and 89% precision.
The machine learning approach has been evaluated with and without sentiment rankings
and it has produced its better result with included sentiments, which is 76% recall and
6
85% precision. The rule-based approach has been considered the more suitable method
for the purposes of the project. Further to that, it has been found that sentiments in
clinical notes help to find hidden valuable information.
7
Declaration
No portion of the work referred to in this dissertation has been submitted in support of
an application for another degree or qualification of this or any other university or other
institute of learning.
Copyright
1. The author of this thesis (including any appendices and/or schedules to this the-
sis) owns certain copyright or related rights in it (the “Copyright”) and s/he has
given The University of Manchester certain rights to use such Copyright, including
for administrative purposes.
2. Copies of this thesis ,either in full or in extracts and whether in hard or electronic
copy, may be made only in accordance with the Copyright, Designs and Patents
Act 1988 (as amended) and regulations issued under it or, where appropriate, in
accordance with licensing agreements which the University has from time to time.
This page must form part of any such copies made.
3. The ownership of certain Copyright, patents, designs, trade marks and other intel-
lectual property (the “Intellectual Property”) and any reproductions of copy- right
works in the thesis, for example graphs and tables (“Reproductions”), which may
be described in this thesis, may not be owned by the author and may be owned by
third parties. Such Intellectual Property and Reproductions cannot and must not
be made available for use without the prior written permission of the owner(s) of
the relevant Intellectual Property and/or Reproductions.
8
4. Further information on the conditions under which disclosure, publication and
commercialisation of this thesis, the Copyright and any Intellectual Property
and/or Reproductions described in it may take place is available in the Uni-
versity IP Policy (see http://documents.manchester.ac.uk/DocuInfo.aspx? Do-
cID=487), in any relevant Thesis restriction declarations deposited in the Univer-
sity Library, The University Library’s regulations (see http://www.manchester.
ac.uk/library/aboutus/regulations) and in The University’s policy on presentation
of Theses
Acknowledgments
This section acknowledges and thanks all those who helped make this project possible.
I would like to give special thanks to the following people:
Dr Goran Nenadic For academic and moral support
Mr Tom Liptrot For academic support
Mrs Claire Gaskell For academic support
Christie Hospital For use of their facilities and academic support
Mr Thomas Edwards For moral support
Mrs Jeni Nacheva For moral support
1. Introduction
Cancer is a disease that has caused the death of many people around the world. The
recent statistics released for cancer research for the UK show that more than 331,000
people were diagnosed with cancer in 2011 (All Cancers Combined, 2015). As a result
of this there has been a lot of research into understanding this disease, its causes, and
its cures.
Electronic Health Records (EHRs) are a rich source of valuable cancer related data,
which could be used for the creation of automated applications for improving the health
care services and clinical research. However, the creation of such applications is chal-
lenging since most of the clinical information is available in clinical narratives (notes,
letters and reports) that are in free text form (Friedman et al, 2004). Further to that,
there are two other problems associated with the creation of clinical automated systems
related to cancer data (Spasic et al., 2014):
First, there are many types of cancer, each with di↵erent causes, symptoms and treat-
ments. Thus, existing systems and approaches for one type of cancer might not be
applicable to another type of cancer.
Another problem concerns the ethical and legal issues associated with patient’s data.
Therefore, most systems are developed on local data and their testing on a larger sample
or comparison to other methods is di�cult.
9
10
There is an ongoing collaboration between the School of Computer Science at Manch-
ester University and the Christie hospital to extract important information from clinical
records. The hospital has been collecting cancer patient data over 30 years and includes
free text in the form of notes and letters that have been sent to GPs. Previous infor-
mation extraction tasks performed on the hospital’s clinical data include extraction of
cancer outcomes from clinical notes (Litrot et al., 2015) and identifying heart disease
factors in clinical notes (Karystianis et al, 2015).
One of the tasks the hospital is currently pursuing is tracking disease progression (cancer
in the patient becoming more advanced due to growth and/or spread of a tumour) over
time. However, this is not a straightforward task since the data is mainly in free text
and very often the disease is not mentioned explicitly but needs to be deducted from the
context. Therefore, it is di�cult to identify the patients whose condition has deteriorated
and need to be contacted for additional support. Currently a nurse goes through the
records manually in order to identify those patients, which is a time consuming and
subjective process.
1.1 Aim
The aim of the project is to create a system (software suite) that takes raw medical
notes of breast cancer as input and generates an output of the patients diagnosed with
metastasis at the last date of their visit. The project will be of help in two main
ways: One by supporting the work of the nurses and the hospital in providing better
treatments and support, and secondly by helping to provide a faster identification of
metastatic patients. Further to this, the system will ease the workload of the nurses since
they would not need to go through the notes manually. The project needs to address the
11
following problems: terminological variability and ambiguity, negation, highly condensed
text, and abbreviations with multiple meanings.The system will need to be evaluated in
order to know whether it is able to identify the metastatic patients. For this purposes,
we will compare the output of the system to a Gold standard (manually annotated notes)
provided by a nurse from the hospital.
1.2 Objectives
1.2.1 Learning objectives
• Investigate and understand text mining approaches and techniques
• Investigate and understand how similar clinical text mining systems work and how
they can be used in the project
• Identify tools that can be re-used for the implementation of the di↵erent stages
• Investigate di↵erent algorithms and software that can be used for the classification
stage of the project
1.2.2 Deliverable objectives
• Investigate and discuss with the nurse the type of information that needs to be
extracted
• Develop an application that performs an initial analysis of the data in order to
understand the type of data that is stored in the notes
• Discuss and outline the main requirements of the system with the nurse
12
• Develop an application that pre-process the notes and extracts the information
needed
• Design and implement a classifier to be included into the application
• Develop a database that stores the extracted information and the output from the
classifier
• Evaluate the performance of the classifier against a given Gold standard from the
nurse
1.3 Report outline
This report is structured as follows: Chapter 2 exposes the background related to Text
mining approaches and techniques as well as classification and evaluation methods. Fur-
ther to this, this chapter presents an overview of existing clinical text mining systems
that can be of help for the completion of the project. Chapter 3 explains the methodol-
ogy used for the system creation as well as the system design and provides justification
of choices made. Chapter 4 presents the main points of the implementation process and
tests performed in order to measure system quality. The evaluation of the classification
method used is explained in chapter 5 while in chapter 6 conclusions and future work
plans are presented.
2. Background
2.1 Text Mining
Text mining (TM) is used for processing large amounts of unstructured text in free
form in order to find useful information. Its goal is to identify implicit knowledge that
hides in unstructured text and present it in an explicit form. It is an interdisciplinary
field, which employs many computational technologies, such as Machine Learning (ML),
Natural language Processing (NLP), statistics, information technologies and pattern
recognition (Zhu et al, 2007).
The most natural and common way for storing data is text, thus text mining finds
applications in many fields. For instance, in marketing it is used in survey research
for analysing open-ended surveys (DELL, 2015). Another application of text mining
technique is the automatic classification of text. For example, emails, in order to filter
out junk messages (DELL, 2015). Other examples of its usage includes the mining of
patents and research articles by the pharmacological industry in order to improve drug
discovery and academic research for finding new knowledge and trends (McDonald and
Kelly, 2015).
13
14
2.1.1 General architecture of Text Mining systems
At an abstract level, a text mining system takes in input (raw documents) and generates
various types of output (patterns, trends maps of connections) (Feldman and Sanger,
2007).
The general architecture of TM system is given in Figure 2.1.
Figure 2.1: General Architecture of TM systems
As it can be seen from Figure 2.1 a TM system usually consists of four main stages:
1. Information Retrieval (IR): This is the first task to be performed. It is concerned
with extracting relevant documents that answer a query. The IR systems can
also be called search engines (Ananiadou and McNaught, 2006). It is usually
domain–independent and returns a number of documents that satisfy a given query
(Spasic et al, 2014).
2. Pre-processing of data: It involves methods that prepare data for the information
extraction tasks.
3. Information Extraction (IE): IE task is usually applied after IR and its goal is to
extract specific information from text documents (Hotho et al, 2005). It converts
free text into structured information (Spasic et al, 2014).
4. Data Mining: This stage involves operations that help finding patterns of interest,
trend analysis, etc. Examples of data mining techniques that are usually per-
formed are classification and association rules. For instance, a classification can
15
be performed after all relevant data is extracted from the text in order to classify
patients as metastatic or non-metastatic.
2.1.2 Text Mining stages
In this section we will revisit the above mentioned text mining stages and outline the
main tasks they include in the order in which they are usually performed. However, we
will exclude the Information Retrieval stage because it is irrelevant to the project. We
do not need to extract text documents related to breast cancer since we will be provided
with the specific data that we need to analyse from the hospital. Instead, we need to
extract specific terms of interest from the given medical records.
2.1.2.1 Pre-processing stage
The pre-processing stage usually includes tasks that are not problem-specific and are
more generic operations (Nadkarni et al, 2011). The main pre-processing tasks are:
sentence boundary detection, tokenization, part-of-speech tagging (POS tagging), mor-
phological analysis, and syntax analysis.
The main pre-processing tasks usually performed are explained in Table 2.1. We are not
focusing on the problems associated with each of them since they can be implemented
using existing tools and frameworks.
16
Table 2.1: Pre-processing tasks
2.1.2.2 Information Extraction stage
The IE stage includes tasks that are problem–specific and are built on the top of the pre-
processing tasks (Nadkarni et al, 2011) and (Ben-Dov and Feldman, 2010), they mainly
include Named Entity Recognition (NER), Relations extraction, Temporal IE and Term
Classification. These techniques enable us to obtain formal, structured representations
of documents.
The IE tasks are key for the project since they will help to structure the patient notes
and extract the useful information based on which the classification model will be built.
17
An example of how IE tasks would work for the purposes of the project (identifying
metastatic patients) is given in Figure 2.2.
Figure 2.2: IE task example
In figure 2.2 it can be seen that the input to the IE task is an anonymised patient
note. The IE task extracts terms that prove whether a patient is metastatic or not and
normalise and structure them in a format that allows further analysis. In this specific
case, the term ”stable disease” is proof that the patient’s condition is improving and
thus the disease is not progressing.
The main IE tasks are:
• NER: NER task (Nadkarni et al, 2011) identifies specific terms and relations be-
tween them that could be of interest such as treatments, drugs, and diagnoses.
Problems (Nadkarni et al, 2011) related to medical data that need to be solved dur-
ing this stage include: ambiguity of abbreviations and terms, grammar and spelling
mistakes, term variability and synonyms, heavy use of domain specific terminology
(only internally known terminology), typographical variants, word/phrase order
variation, and negation and uncertainty identification. This task will be re-visited
in the next section as it is considered important for the project.
18
• Relations extraction: Relationships(Nadkarni et al, 2011) between entities also
need to be extracted. Entities that show relevance and connection between terms
should also be extracted since they can reveal more hidden information in the text.
• Temporal IE: Temporal data (Kovacevic et al, 2013) such as dates, time, frequency,
etc have di↵erent characteristics and are associated with problems di↵erent from
those typical for named entities. Therefore, temporal data might need to be ex-
tracted separately.
• Term Classification (Term Categorization): The recognized named entities can
be further classified in a number of pre-defined classes (Spasic et al, 2014). For
instance, if a named entity denotes a medication, then it would be classified into
the class of medication names. This will provide more structure and metadata
and thus help for the creation of a classification method which takes into account
the presence of a specific combination of terms from di↵erent classes and based on
that, build a classifier.
2.1.3 Named Entity Recognition Approaches
The four approaches that can be used for performing NER are:
2.1.3.1 Dictionary–based approaches
Dictionary–based approach (Spasic et al., 2014) involves the creation of dictionaries that
include all synonyms and variations of the terms that need to be extracted and thus help
locate term occurrences in text. The explicit use of already created dictionaries or the
use of specialised well–known biomedical databases is not feasible as very often the
clinical notes consist of terms that are not available in the more general dictionaries.
19
Therefore, there is need to tune existing dictionaries to the particular case or to create
a new one manually with the help of o�cial sources.
A drawback of this approach is that this approach can only recognize entities if the
names by which they are denoted in text are part of the dictionaries. Further to that,
the biomedical domain is changing constantly and consequently the terms used within
a text can change often. Thus, the dictionaries created for the specific text must be
updated constantly.
2.1.3.2 Rule–based approaches
Rule–base approach (Appelt and Israel, 1999) is also called Knowledge engineering ap-
proach. In this approach careful analysis of the text needs to be performed in order
to find common patterns. Then, sets of rules are written manually in order to capture
these patterns and are used to extract useful information from the text. This approach
is a time–consuming iterative process because it requires rules to be tested and changed
many times until they give the desired results. Another drawback of this approach is
that it cannot be generalized and applied to other cases. However, the creation of a
system that can be applied to di↵erent clinical cases is out of the scope of this project
and thus we are not concerned with this problem.
2.1.3.3 Machine Learning–based approach
ML approach (Appelt and Israel, 1999) is also called Automatic approach. In this
approach a machine learning algorithm creates the rules for extracting information au-
tomatically. This method requires a large number of training sets related to the specific
domain, which can be di�cult to obtain. An advantage of this approach is that it
20
is domain–independent and can be applied to di↵erent cases as long as a corpus of
domain–dependent texts is available. Possible ML algorithms that can be used for the
implementation of this approach are: Decision trees, hidden Markov models (hMM), con-
ditional random fields (CRFs) and maximum entropy models(Appelt and Israel, 1999).
2.1.3.4 Hybrid approaches
This approach (Appelt and Israel, 1999) combines rules and machine learning algorithms
for extraction of relevant information. Also, combination of the three approaches or a
mixture of rule–based and dictionary–based is also possible. The main advantage of this
approach is that it can combine the advantages of the main three approaches and give
more flexibility in the implementation of the NER.
A comparison between the first three approaches is given in Table 2.2.
Table 2.2: Comparison between NER approaches
From Table 2.2 can be concluded that the main advantage of dictionary and rule - based
approaches over the ML approach is that they do not require training data. Further to
that, the main advantage of ML approach over the other two is that it can be applied
21
to di↵erent hospital cases with minimum changes and can cope with term changes over
time more easily.
2.2 Clinical Text Mining
One of the areas in which the importance of text mining techniques constantly grows
is clinical research. The reason for this is that medical records are valuable sources of
information but are di�cult to perform analysis on due to the fact that most of the
information is available in clinical narratives in free–text form and have rich contextual
meaning. Further to that, the amount of data available in the narratives is growing and
therefore becoming more di�cult for manual processing (Zhu et al, 2007).
Due to the ability of text mining techniques to process large amounts of unstructured text
automatically, text mining is believed to bridge the gap between unstructured clinical
notes and structured data representation (Kovacevic, 2013).
2.2.1 Evaluation methods
Evaluation will be performed on two of the stages of the system development – the Text
mining stage for evaluating the performance of the Information Extraction systems and
the classification stage. We can use the same evaluation approach for both of them.
The evaluation method should be able to test how good your classifier is in predicting
the class label of a tuple and also how good is the Information Extraction system is in
extracting the correct terms from the text.
We can use measures based on a confusion matrix (example given in Figure 2.3) where
each column of the matrix represents the instances in the predicted class and each row
22
represents the instances of the actual class. For a binary classification problem there are
four possible outcomes of a single prediction:
• true positives (TP): These refer to the positive tuples correctly labeled by the
classifier
• true negatives (TN): These are negative tuples that were correctly labeled by the
classifier
• false positives (FP): These are the negative tuples that were incorrectly labeled as
positive
• false negatives (FN): These are the positive tuples that were incorrectly labeled as
negative
Figure 2.3: Confusion matrix
Based on the features summerised in the confusion matrix, there are three main measures
used to evaluate the performance of a classifier:
• Accuracy: The accuracy of a classifier on a given test set is the percentage of test
set tuples that are correctly classified by the classifier. In other words, it reflects
how well the classifier recognises tuples of various classes. That is,
accuracy = (TP + TN)/(P+N)
• Precision: The precision metric represents the percentage of positive instances that
were presented as positive. It is used to measure the exactness of the classifier
23
(what percentage of tuples labeled as positive are actually such). That is,
TP/(TP + FP)
• Recall: Recall is a measure of completeness (what percentage of positive tuples
are labeled as such). That is,
recall = TP/(TP + FN)
• F-measure: Precision and recall are naturally opposed in sense that if the precision
measure is increased, the recall measure is typically degreased and vice versa.
Thus, the two measures need to be balanced. This is done using F-measure, which
is the harmonic mean of precision and recall and it is calculated with the formula:
(P + R)/2PR (Spasic et al, 2014).
2.2.2 Related work
Many projects focus on extracting useful patient information from medical records au-
tomatically using a variety of TM approaches and methods. For example, Goryachev et
al (2008) extracted family history information from clinical reports using a rule–based
approach while Patrick et al (2010) extracted medication information using hybrid ap-
proach (machine learning and rule–based algorithms).
In Table 2.3 explains four di↵erent systems which extract di↵erent types of information
from clinical notes. Their approaches and design choices will be used as examples for
the creation of our system.
The first two papers use mainly rule–based approaches and the other two use hybrid
approaches. The similarities between the systems are the following:
24
Table 2.3: Comparison between clinical Text Mining systems
• All of them use 3–stage implementation process always starting with a pre-processing
stage. For the pre-processing stage they use mainly general frameworks.
• Both of the rule–based systems use the same software for rule – creation – Mixup
and also both use dictionaries. They also use Unified Medical Language System
–UMLS, a generic knowledge representation system that brings together many
health and biomedical vocabularies to support the dictionary creation.
• Both of the hybrid systems use NegEx in the pre-processing stage in order to
detect negation. NegEx uses regular expressions and a list of terms to determine
whether clinical conditions are negated in a sentence (Negex, 2009). Further to
that, both of them create di↵erent modules, which process data depending on
their characteristics. The first one creates a module for processing notes where a
disease status is mentioned explicitly and a module for processing notes where a
25
disease status needs to be derived by the context. The second system performs
the temporal extraction and the entity extraction separately. Further to that, the
second system uses JAPE for creating rules for the TE module.
The clinical text mining system used as a main example for our project is the Medication
information extraction system proposed by Spasic et al, 2010 and thus it will be described
more in depth in this section.The system has an F-measure of 81% (with 86% precision
and 77% recall). The system architecture is given in Figure 2.4.
Figure 2.4: Example of system architecture (Spasic et al, 2010)
The system aims to extract medication information from patient medical reports us-
ing rule – based approach. In this approach only explicitly mentioned information is
extracted without the need to map it to standardize terminology or to interpret it se-
mantically.
The method consists of three main steps:
1. Linguistic preprocessing: It includes the preprocessing tasks: sentence splitting,
POS tagging, shallow parsing. The output is stored in an XML file, in which the
XML tags were used to mark up sentences, POS categories of individual tokens
26
and syntactic categories of word chunks. The XML document is stored in the
database where the original notes reside.
2. Dictionary and pattern matching: The annotated XML document from the first
step is used as an input for this step. Rules were used to exploit morphologic,
lexical, and syntactic properties that will help for the dictionary creation, and also
for acquiring slot filler labels. Rules were implemented using Mixup. The approach
for medication name recognition is mainly dictionary – based. Three types of
dictionaries were created: dictionaries for medication recognition, dictionaries to
help recognition of generic medication types, and dictionaries to store reason –
related terms in order to support the recognition of the reason for medication.
The obtained slot filler labels are stored in the database.
3. Template filling: The slot filler labels extracted in the second step are combined to
fill the information extraction template with the slots: medication, dosage, mode,
frequency, duration and reason. XML tags are marking the slot fillers.
The Information Extraction stage is the most time-consuming part of the project need-
ing to be implemented. Thus, in this section we will investigate existing Information
Extraction systems that can be used for extracting metastasis - related data in order to
ease the workload for the implementation of this stage.
DNorm
DNorm (Disease name Normalisation Information Extraction system) is a information
extraction system (Leaman et al, 2013), which automatically determines diseases men-
tioned in text. It uses machine learning methods to normalize disease names in biomed-
ical text and relies on on dictionary lookup techniques and various string matching
algorithms to account for term variation(Leaman et al,2013). The algorithm achieves
27
an 80.9% F-measure with precision 82.8% and 80.9% recall. An example of the basic
architecture of DNorm is given in Figure 2.5
Figure 2.5: DNorm architecture (Leaman et al, 2013)
From Figure 2.5 it can be seen that DNorm works on a sentence level and the main steps
it performs are to locate disease names, abbreviation resolution and normalisation. The
output is a set of concepts found within each sentence of the input text, and ID of the
term and a position of the term within the sentence.
MedEx
Medication Extraction system (MedEx) (Xu et al, 2010) was developed to extract medi-
cation information from clinical notes and represent it in a structured way. An evaluation
using a data set of 50 discharge summaries showed it performed well on identifying not
only drug names (F-measure 93.2%), but also signature information, such as strength,
28
route, and frequency, with F-measures of 94.5%, 93.9%, and 96.0% respectively(Xu et
al, 2010). It uses a rule-based approach. An overview of the architecture of MedEx is
given in Figure 2.6.
Figure 2.6: MedEx architecture (Xu et al, 2010)
Similar to DNorm, MedEx works on a sentence-level and outputs the found medication
information to a text file including an ID for each concept and the position of the term
in the sentence.
CliNER
CliNER (Kovacevic et al, 2013) is a command line tool for identification of mentions of
four categories of clinically relevant events: Problems, Tests, Treatments, and Clinical
Departments.It also recognises and normalises clinical temporal expressions. The system
combines rules and machine learning approaches for extracting events for clinical nar-
ratives. The system achieves F scores of 90% for the extraction of temporal expressions
and 87% for clinical event extraction. The system architecture is given in Figure 2.7.
29
Figure 2.7: CliNER architecture (Kovacevic et al, 2013)
2.3 Sentiment analysis of clinical data
Clinical narratives are usually written in natural language by clinicians who express emo-
tions in the notes. Thus, the patients notes contain sentiments that can reveal whether
a patient’s condition is worsening or improving and thus can help to gain more infor-
mation about the patient. Currently, there are not many cases of performing sentiment
analysis on clinical data and thus existing sentiment analysis systems specifically for the
clinical domain do not exist.
SentiStrength (SentiStrength website, no date) is a system that performs automatic
sentiment analysis of mainly unstructured informal text. It perofrms sentiment strength
analysis which means that it predicts the strength of positive or negative sentiment
within a text. It has been mainly used for analysing twitter data. However, it can be used
in the clinical sphere as well. SentiStrength estimates the positive and negative sentiment
in text. It can work on sentence-level or note-level and reports two sentiment strengths:
-1(not negative) to -5(extremely negative) AND 1(not positive) to 5(extremely positive)
The reason for producing two scores for a given text input is because a psychology study
has revealed that we process positive and negative sentiment in parallel - hence mixed
emotions. SentiStrength is using a lexicon of 2310 sentiment words and word stems
30
that have been assigned weights and it is based on a dictionary lookup approach. The
performance of SentiStrength for Twitter data is 70% for positive ranking and 75.4%
for negative ranking. Its performance for MySpace data is 63% for positive ranking and
77.3% for negative ranking (Mike Thelwall,no date).
2.4 Classification approaches
The classification method that will be used for classification of patient data will depend
on the output of the text mining stage. However, in this section we will present the two
main classification methods that might be used - Decision tree algorithm or Rule-based
approach. The first one is a machine learning, data-driven approach, like Decision tree.
The other approach is a knowledge-based approach Rule – based classification.
Decision tree induction (Han et al, 2012) is a top-down recursive tree induction algo-
rithm, which uses an attribute selection measure to select the attribute tested for each
non-leaf node in the tree. Decision tree approach can also be called Data-driven ap-
proach and is based on machine-learning principles. For this particular case, the main
class labels can be – metastatic and non-metastatic and each branch will present a
specific combination of attribute values of the disease factors, identified from the text
classification stage. The decision tree creation process consists of two steps: a learning
step during which the classifier is built using previously annotated data and a testing
step in which the classifier performance is measured by applying it on unseen cases. An
advantage of the decision tree is that they do not require any domain specific knowledge
or parameter settings, and therefore are appropriate for exploratory knowledge discov-
ery. The learning and classification steps of decision tree induction are simple and fast.
In general, decision tree classifiers have good accuracy.
31
Rules (Han et al, 2012) are another good way of representing information. The manually
created rules are not created using machine learning methods, but are created by a
human as they will require more knowledge of the data. A rule-based classifier uses
a set of IF-THEN rules for classification. The IF part of a rule is known as the rule
antecedent or precondition, and the THEN part is the rule consequent (it contains the
class prediction). For the specific case, an example of if – then rule will be:
if disease factor1 = value1 and disease factor2=value2, then class = metastatic — non-
metastatic
In comparison with a decision tree, the IF-THEN rules may be easier for humans to
understand, particularly if the decision tree is very large. Problems (Han et al, 2012) that
can be encountered during the classification stage are: data overfitting (the classification
model fit the training data too well and it is unable to make predictions for unseen cases)
and class imbalance (the main class of interest, e.g., metastatic patients, is represented
by only few tuples).
Two of the frameworks that can be used for implementing the classification stage are:
Weka (Hall et al, 2009) and Orange. Both are software that represent a collection of
machine learning algorithms for data mining tasks. Further to that, they can be can
be applied directly to a dataset from their visual panels (given in Figure 2.8 and Figure
2.9) or can be called from a development environment. However, Weka is Java- based
and can be called from Java code while Orange can be called from Python code or visual
scripting.
32
Figure 2.8: Weka Figure 2.9: Orange
2.5 Summary
in summary this chapter has provided an overview of the literature research done so
far. It first identified what Text Mining is and how Text Mining approaches can help
in clinical research. It also explained some of the main features of existing clinical text
mining systems and the main steps they implement. These are:
• Pre-processing stage
• Information Extraction stage
• Post-processing
• Data mining
Further to this a couple of Information Extraction systems suitable for use in clinical
domain have been identified: DNorm for disease mentions, MedEx for medication names,
CliNER for treatments, and SentiStrength for sentiment analysis. At the end various
classification methods and frameworks have been reviewed that can be used for the
implementation of the classifier of the system as well as an evaluation method, based on
Confusion matrix that can be used for evaluating the performance of the classifier.
3. Specification and Design
3.1 Project scope
This project is concerned with the creation of a software that extracts relevant data
from clinical notes, structures the extracted data and classifies it in order to find the
metastatic and non-metastatic patients that have been diagnosed at the last date of
their visit.
The project will be limited to the mining of EHRs that include data exclusively on
breast cancer and no other types of cancer. The project is not concerned with the
creation of a user interface or the integration of the software into the hospital system.
The implementation stage will not be concerned with the development of any security
mechanisms since the data and respectively the system (when it is using the data) must
not be exported and used outside the hospital boundaries. Therefore, it is the hospital
concern to protect the security of the system and the results.
3.2 Challenges for developing the project
Firstly challenges of building a clinical text mining system need to be taken into account,
as there are some factors to it that do not exist for standalone software.Some of the
most important factors is the data protection and data sensitivity. Due to the fact
33
34
that the project is dealing with patients data the Data Protection Act and the hospital
restrictions need to be taken into account. The system needs to be implemented mainly
in the hospital since patient’s data cannot be taken outside. Security measures and
policies should also be taken into account and discussed with people from the hospital.
Another challenge is that the project requires medical knowledge in order to know what
type of data indicates whether a patient is metastatic or not. Therefore, a collaboration
with the nurse from the hospital needs to be established in order to discuss with her the
medical aspects of the project.
The third challenge for building a clinical text mining system is that it needs to be
created in the hospital environment in the presence of a person working in the hospital.
Therefore, it is important to carry out initial research on limitations of the environment
and take them into account when creating the system. Further to this, the fact that the
project needs to be supervised by a person in the hospital who is not always available
can cause problems regarding the time frame provided for the project. In order to solve
this problem careful planing is needed and also a higher priority should be given to the
most important requirements.
3.3 Data Description
Before outlining the system requirements it is important to look at the patient data that
will need to be classified. In this way we will understand the needs of the project better.
The data is available in an SQL database. The records of interest are presented in a
table that has the following structure:
35
A description of the type of information that is included in each column is given in Table
3.1.
Table 3.1: Table columns description
The main sources of information are the patient notes, stored in the ‘Note’ column and
the date on which the event described in the note has happened (available in Date-
OfEvent column). The patient notes can be split in two types: doctors annotations
and letters. The doctors annotations are only free text and do not have any metadata
such as: History, Diagnosis, Treatments sections. The letters (send to other specialists)
include specified sections for diagnosis, treatments performed etc which makes them
easier to analyse. Also, notes include HTML tags that might need to be removed when
performing initial analysis of the data.
The column ‘DateOfEvent’ is also important since it provides structured representation
of the date on which the examination has happened.
It also can be observed that the condition of a patient (metastatic or non-metastatic) is
given in two ways:
• Explicitly – it is clearly mentioned in the notes that a patient is metastatic or the
disease has progressed
• Implicitly - The condition of the patient’s progression needs to be deducted by the
context
36
Examples of two notes, one explicitly mentioning a patient is metastatic and one implic-
itly showing progression are given in Figure 3.1 and Figure 3.2. Both are presented in
their original form as they appear in the hospital database (i.e without removing HTML
tags or structuring them).
Figure 3.1: Explicit
Figure 3.2: Implicit
3.4 System requirements
The system requirements have been discussed with the nurse currently involved in the
manual identification of the metastatic patients. The functional requirements can be
split into two groups: essential and desirable.
Essential Functional requirements
1. The system must be able to extract the relevant information based on whether a
patient is metastatic or not
2. The system must be able to identify the patients with metastasis (both curable
and incurable)
3. The system must return only patients diagnosed as metastatic at the last date of
their visit to the hospital
4. The system must return the patient id and whether they have been diagnosed as
metastatic or not.
37
5. The system must store the results in a structured format
Desirable Functional requirements
1. The system may return additional information regarding the patients condition
such as: a justification of classifying the patient as metastatic, date of diagnoses,
and whether the metastatic condition is curable or incurable
2. The system may return patients that are likely to be metastatic but have not been
diagnosed as such yet
3. The system may store the results in the hospital database
4. The system may display the results on the online system of the hospital
Non-functional requirements
1. Portability: The system must be able to work on the machines in the hospital.
2. Reliability: The system must return almost all metastatic patients with high cov-
erage (recall)
3. Performance: The system must return the results within a timely manner
3.5 Initial analysis of data
This section will be split into two parts. In the first one, we will identify the main types
of disease factors that show a patient is metastatic and thus need to be extracted. In
the second part we will perform lexical analysis, which will give us more understanding
of the type of data stored in the notes.
38
3.5.1 Identification of disease factors that need to be extracted
In order to identify the type of information which proof a patient is metastatic and thus
needs to be extracted another meeting with the nurse who is currently identifying the
metastatic patients has to be organised. During the second meeting the main disease
factor classes which indicate that a person is metastatic were identified. Six main classes
are identified. These are given in Table 3.2.
Table 3.2: Disease factor identified
The identification of the disease factors will assist in the implementation of the entity
extraction module of the system. The main types of information considered to identify
whether a patient is metastatic or not are the disease names, drug names, and symptoms.
Thus, these factors will be the first priority in the creation of the information extraction
module.
3.5.2 Lexical profiling
In order to understand what terms/phrases need to be extracted from the notes, we not
only need to know the disease factors proving metastasis but we also need to know the
39
key information residing in them. For this purposes we need to perform lexical profiling.
This process will be split into two steps: data preparation and lexical profiling. The
analyses is performed on a sample of 200 patients (104 029 notes).
The Data preparation step converts the data into a suitable format. For this purpose a
Java program has been created that takes a csv file with the patient notes as an input,
removes the HTML tags and punctuation. Then, the output is used for performing
lexical profiling.
Lexical profiling is performed using an N-gram model. This model allows performing
feature extraction of an object by describing it in terms of its subsequences. An n-gram
is a subsequence of length n (Tauriz, no date). For a given sequence of tokens, an n-gram
of the sequence is n-long subsequence of consecutive tokens (Tauriz, no date).
The produced n-grams will help us to identify the type of information more frequently
residing in the notes and whether this information could be useful in identifying the
disease progression and thus needed to be considered for extraction. For the creation
of n-grams we use a corpus analysis toolkit, called AntCon (Anthony, 2015). We have
generated 2-grams, 3-grams, 4-grams, presented in Figure 3.3, Figure 3.4, and Figure
3.5. For anonymity purposes some of the results are hidden.
40
Figure 3.3: 2-grams
Figure 3.4: 3-grams
Figure 3.5: 4-grams
From the figures above it can be noticed that some of the most frequent types of informa-
tion mentioned in the notes are treatment- (e.g. treatment summary, protocol treatment,
oncology summary, etc) or dosage-based (i.e. medication information). Therefore, it can
41
be judged that we need to focus on extracting treatment information as well, not only
disease names, drugs, and symptoms). Further to that, it can be noticed that the notes
include sentiment terms such as ’very well’ which means that performing sentiment anal-
ysis on the notes can also help to extract hidden information that will help in identifying
metastasis.
3.6 Method overview
The general idea underlying our approach is to identify the text snippets in the separate
notes that contain evidence to support a judgment for a metastatic disease, and then to
integrate evidence gathered at the notes level to make a prediction at document (patient)
level as to whether a patient is metastatic or not. Temporal extraction for dates on which
patients have been diagnosed as metastatic is not needed since the dates are given in
the Date Of Event column in a structured format.Therefore, Entity extraction will be
the main focus for the project. For the implementation of the NER stage we will use
existing IE systems since the time frame for the project will not allow for the creation
of an entirely new IE system. The benefits of this approach are:
• It does not force us to ”re-invent the wheel” as the main types of data we are
interested in can be extracted with existing tools
• It will save time
• The finished application can be applied to various hospital cases since the IE tools
that it uses are applicable to di↵erent types of data
From the term lists provided by the nurse and the initial analysis conducted on the
data it can be concluded that the main types of information needing to be extracted
42
are: Disease names, Treatments, Medications, and Symptoms and thus the Information
Extraction tools that will be used to extract this types of entities are DNorm, CliNER
and MedEx.
An important feature of the approach chosen is that we will also perform sentiment anal-
ysis of the data. This can be beneficial for identifying metastatic patients more easily
since very often nurses and doctors express their judgments and observations towards a
patient’s health status via sentiment phrases (Deng et al, 2014). The identification of
whether there are positive or negative sentiments used within a note can show whether
a patient is metastatic or not. In Figure 3.6, it can be seen that the negative sentiment
phrase ”Unfortunately” is used in a note where it is mentioned that the patient’s condi-
tion is progresing while in Figure 3.7 the positive sentiment phrase ”remains well” shows
that a patient remaining stable.
Figure 3.6: Negative sentiment note example
Figure 3.7: Positive sentiment note example
The classification is also considered an important aspect of the project since the project
requires almost all metastatic patients to be identified (i.e. the system must has high
coverage). Therefore, two approaches will be used separately for the creation of the
classifier. The first approach is knowledge-based, i.e. based on manually-created rules
and the second approach is data-based, i.e. based on the creation of a machine learning
algorithm that derive rules automatically from the data.
The rule-based approach requires a good understanding of the data and partial medi-
cal knowledge but does not require training data while the machine learning approach
43
requires a training set but does not require understanding of the data or medical back-
ground. Since, the two methods have di↵erent characteristics and advantages it will be
useful to use both and then based on the results judge which approach is more suitable
for the project.
44
The system workflow, given in Figure 3.8, will consist of four main stages:
1. Pre-processing of notes: Includes the performance of low–level pre-processing tasks
that will prepare the data for performing IE tasks
2. Information Extraction: Includes the extraction of relevant features, including
NER sentiments from the notes
3. Post-processing: Includes the performance of post-processing steps that will pre-
pare the extracted data for classification
4. Classification: As the good performance of the classifier stage is important, two
classification approaches will be used for identifying the metastatic patients. The
first approach will consist of manually-created rules while the second approach
will consist of a machine learning algorithm. The implementation of two di↵erent
classification methods will help us to make a comparison and choose the one that
suits the project purposes the best.
Figure 3.8: System workflow
3.7 System Design
This section shows explicit details of the process of design of the software based on the
methodology and the requirements outlined earlier.
3.7.1 Modeling the system boundaries
The first step in the design process consists of modeling the context diagram (or level
0 diagram) with final goal being to identify the boundaries of the system, and the way
45
it interacts with external components. Figure 3.9 shows that the system will consist of
four main modules each accomplishing a step described in the ’Method overview section’.
The system will interact with two external entities.
First, the system will take as an input a CSV file with a set of the patient notes. The
reason for not creating a direct connection to the hospital database is the fact that there
is a possibility for the connection to the hospital server to get lost and thus the access
to the database denied. By using CSV instead, we speed up the implementation process
and eliminate the possibility of unexpected crashes due to connection loss to the server.
Secondly, the system will store the results, produced by the first two modules in a
database, di↵erent from the hospital database. These results will be used for further
processing and classification by the ”Post-processing” module and the ”Classifier”. The
results from the classifier will be stored in the same database. The reason for not using
the hospital database is the fact that no write permissions to this database are provided
for the project.
Figure 3.9: Context Diagram
Figure 3.10 extends the case of Figure 3.9 by showing how the machine learning classifier
will interact with the local database. In order to build the classifier we need to have
46
a learning set of data that will be used for building the classifier and a testing set
of data, which will help to evaluate the classifier. Both, the learning and testing set
will be saved in the local database. As the classifier needs to be able to identify both
the metastatic and non-metastatic patients, the learning set of data needs to include
notes of metastatic and non-metastatic patients. In comparison to the machine learning
approach, the rule-based will not need to go through a learning stage.
Figure 3.10: Context diagram for Classifier
3.7.2 Modeling the system components
The figure 3.11 shows a level 1 diagram having more details about the components that
interact in the software. The system consists of four main components: Pre-processing
component, Information Extraction component, Post-processing component, and Clas-
sification component. Each of them consists of sub-components, described below:
1. Pre-processing module: This component will perform tasks that prepare the notes
for extraction
• HTML Tags stripper: It removes the HTML tags from the notes.
• Sentence splitter: It splits the notes into sentences since some of the sub-
modules of the Information Extraction component will be implemented on
sentence-level.
47
2. Information Extraction module: This component consists mainly of sub-modules
that extract the main types of information outlined earlier as important for find-
ing the metastatic patients (i.e. Treatments and Symptoms information, Disease
names information, Medication names information, and Sentiments). For each
type of data a di↵erent Information Extraction system will be integrated into the
component.
• CliNER sub-module: It is responsible for the Treatment and Symptoms in-
formation extraction. This module will extract information on note-level.
The reason for working on note-level is because very often treatments and
symptoms information can be spread over a couple of sentences and thus if
this kind of information is extracted on sentence-level, the meaning of the
extracted features might be lost.
• XML parser: The CliNER output is in XML format and thus an XML parse
need to be created in order to strip the XML tags and structure the output
in a format that matches the database structure.
• DNorm sub-module: It is responsible for Disease names extraction. Since
we only need to extract disease names that cannot be spread over a multiple
sentences this module will work on sentence-level.
• MedEx sub-module: It is responsible for Medication names extraction. It
works on sentence-level because of the same reasons as the previous module.
• NegEx: It is used to identify whether the extracted features are negated.
This is important feature since if the Disease name extraction module output
disease progression but the term is negated, it will change the meaning from
identifying metastasis to identifying non-metastasis. Only the results pro-
duced from DNorm and MedEx are checked for negation by NegEx. CliNER
48
results are not checked for negation as CliNER has integrated function to
check for negation.
• SentiStrength sub-module: This module will perform sentiment analysis to
the data on sentence-level by giving to each sentence a positive sentiment rank
and negative sentiment rank. The reason to perform sentiment analysis on
sentence-level is because it will give us the opportunity on a later stage to see
how the sentiment in a note changes over time and thus get more information
about the note and the feelings expressed in it. If sentiment analysis were
performed only on note-level this will give us only a single average rank for the
entire note. SentiStrength will give to each sentence a positive rank between
1 and 5 where 1 denotes neutral and 5 denotes very positive and a negative
rank between -1 and -5 denotes where -1 denotes neutral and -5 denotes very
negative.
3. Post-processing module: This module will clear the inappropriate results and com-
bine them on note-level where necessary.More about this module will be said in
another section.
• Combination of results on note-level: The results produced from some of
the Information Extraction sub-modules are on sentence-level. However, the
classifier works on notes-level and thus the results produced on sentence -
level need to be combined on notes-level.
• Post-processing rules: It includes rules that clear the results produced from
the Information Extraction module and prepare them for classification.
4. Classification module: This module will consist of the two classification approaches
that will be created.
49
• Classification rules: It will include a set of manually created rules that can
classify the notes to metastatic and non-metastatic.
• ML algorithm: This component integrates ML algorithm for classifying the
notes as metastatic or non-metastatic.
51
The components presented in Figure 3.11 will be the classes implemented for the pro-
gram.
3.7.3 Modeling the system interactions
One of the last steps of the design process is the sequence diagram which presents the
order of interactions between the main components of the system. Figure 3.12 focuses
mainly on the order in which the Information Extraction components (DNorm, CliNER,
SentiStrength, MedEx, and NegEx) are interacting with the other modules of the system
(Pre-processing, Post-processing, Classification, and Database).
The Information Extraction components are executing and writing to the database in a
sequential way. Once all the results from the Information Extraction module are stored
in the database, they are passed to the Post-processing module where the results are
prepared for classification and then stored back in the database. The Classifier module is
using the results from the Post-processing module that are already cleaned and combined
on note-level in order to apply the classification approaches to them. The output of the
Classifier module is stored back to the database.
53
3.7.4 Modeling the system workflows
The system workflow will be presented using an Activity diagram, given in Figure 3.13.
As in the previous section we focused mainly on the Information Extraction module,
in this section we will focus on two other main aspects of the system. The first one is
the Post-processing module and the second one is the combination of the results from
sentence-level to note-level and from note-level to patient-level.
The Post-processing module consists of two sub-modules (Combination of results on
note-level and Post-processing rules). The first sub-module role is to combine the results
produced on sentence-level to note-level. The role of the Post-processing rules is to clear
the inappropriate results that should not be passed to the Classification module. This
will be done by using lists that contain terms given by the nurse and some other terms
that have been found from observing the results from the Information Extraction sub-
modules and are considered important.The rules created will be mainly term-matching
based where for each term produced by one of the Information Extraction sub-modules it
will be checked as to whether this term is in the list and if it is not, it will be deleted from
the database. The reasons why we have decided to use Information Extraction systems
for extracting the important features instead of creating a dictionary-based approach in
first place are as follows:
• The Information Extraction systems perform normalisation of the terms so the
results produced are in a basic form, which makes the matching to the list of terms
easier. In a dictionary-based approach we would need to do the normalisation or
extend the dictionaries used in order to include all variations of terms, which is a
time-consuming process.
54
• The lists of terms given by the nurse are not exhaustive and thus the dictionaries
created based on them would not be su�cient for extracting all features that could
be helpful in identifying metastasis. The Information Extraction systems used can
extract important information that we could not know about and thus help in
extracting more features that can participate in the classification.
In the Post-processing module the types of rules needing to be created are split into two
groups:’Combination of results on note level’ and ’Post-processing rules’. The first set
of rules take as an input the results produced by SentiStrength, DNorm, and MedEx on
sentence-level in order to combine results on a note-level and clear inappropriate results.
The steps involved in this part of the module are as follows:
1. First, the sub-module takes the SentiStrength results on sentence-level and per-
forms the following steps:
• It calculates statistic information about each note that can help in the clas-
sification stage later on, including the following:
– Average rank for the note: This will give us knowledge about the overall
sentiments in the note (i.e. whether the note is with positive, negative
or neutral feelings)
– Maximum negative and positive result for the note: This is the highest
positive or negative score for a note. This can give us information about
whether there are very negative or positive sentiments in a note in which
case it can be prove that the note definitely mentions that a patient has
disease progression or a stable disease.
– Keeping the results for the first sentence and the last sentence in a
note: This can give us information about the how the sentiment changes
55
throughout the note. For instance, it is highly likely that a sentence that
starts with a very negative score but changes in a positive direction at the
end of the note to show an improvement in the patients condition and the
opposite if a note starts with positive sentiment but finishes with negative
sentiment, it can indicate that the patient’s condition is improving.
• The calculated statistics from the SentiStrength results are saved on note-
level.
2. Secondly, the sub-module takes the terms produced by DNorm and MedEx on
sentence-level and performs the following steps:
• It checks whether the term is the only one found for the note. If yes, it prints
the term on note-level.
• If the term is not the only one found for the note, it checks whether the found
terms are in the list given by the nurse and prints on note-level only the terms
that are relevant.
The second sub-module ”Post-processing rules” of the Post-processing component con-
sists of two steps as follows:
1. First, it takes as input the DNorm, MedEx, and CliNER results on note-level and
applies post-processing rules in order to clear the results from inappropriate terms.
The results that do not appear in the lists of terms are deleted.
2. The second step involves the preparation of the post-processed data for classifica-
tion. This includes the mapping of the extracted data on note-level as a format
that is suitable for classification. More information about that will be given in the
Database design section.
56
As it can be seen from Figure 4.1, another important aspect of the system is the com-
bination of results from sentence-level to note-level and from note-level to patient-level.
The reason for the need to do this aggregation is that some of the Information Ex-
traction sub-modules need to work on sentence-level, the Classifier module will need to
make judgments on note-level and the requirement for the system is to show results for
a specific patient instead of showing results on note-level.
This is done in the following way:
First, the Information Extraction sub-modules DNorm, MedEx, and SentiStrength pro-
duce results on sentence-level, which are combined on note-level by the Post-processing
sub-module and saved to the database.
Then, the classifier takes the results aggregated on note-level and assigns a class (metastatic
or non-metastatic) to each note. In order to combine the results on patient-level, we will
search for the latest note for each patient (i.e. the note that was generated on the last
date of visit for the patient) and print it on a patient-level. The reason for choosing the
note at the last date of visit is because one of the requirements is to find the condition
of the patient at her last visit to the hospital.
In case, the outcome of the last note is that the patient is non-metastatic we will search
for the latest note which says a patient is metastatic and print the date of it. As in the
cases when a patient has been identified as non-metastatic incorrectly it can be danger-
ous we need a system that helps the nurse judge these false cases.This is why in cases in
which the outcome of the last note is that the patient is non-metastatic we will search
for the latest note which says a patient is metastatic and print the date of it.In this way
we ensure that the nurse can track the patients condition over time.
58
3.7.5 Classification rules design
As mentioned before for the classification of the notes we will use two approaches: rule-
based approach and machine learning approach. The main advantage of using both is
that we will be able to compare their outcome and decide which one is more suitable
for the given data. In the rest of the section we will present the design of the rule-based
approach.
The rules can be split into two groups. The first group of rules are those used to find
indications of metastasis in the notes while the second group of rules are used to find
non-metastasis in the notes.
Metastatic Rules
The rules used to identify the metastatic notes are given in Figure 3.14.
The rules are 6 and they are explained below:
• The first rule uses only SentiStrength results and its goal is to catch the cases of
incurable metastasis.
From observations it can be seen that the notes that show incurable metastasis
are with very negative sentiment ranks. Thus, it would be enough to classify those
cases with a rule where the threshold for the sentiment ranking is -4.
• The second rule uses DNorm and SentiStrength results and its main goal is to
catch cases of curable and incurable metastasis where sentiments are not strongly
presented in the note
As the sentiment ranking of some notes that show curable metastasis are not very
negative but at the same time the diagnoses is mentioned, we use both DNorm
and SentiStrength to catch both of these cases. !DNorm cannot be used by itself
59
as in some cases diagnoses are mentioned but they are part of a patient history
and thus using only these terms by themselves can lead to incorrect classification.
• The third rule uses DNorm and MedEx results trying the catch the cases of curable
metastasis in which sentiment ranking is neutral.
In some cases (from what has been told by the nurse)the combination of some
diagnosis and medications mean that a patient is metastatic.
• The lists (diagnose list,medication list,treatment list, symptoms list,and scans list),
used in the creation of the rules are created regarding the main groups of terms
the nurse has provided and have been found important (diagnoses, treatments,
symptoms, and medications). Depending on the type of terms extracted by each
of the tools, they are associated with a di↵erent list type. For instance, Dnorm
extract diagnosis and thus the rules including DNorm check whether the terms
extracted are in the diagnosis list. For the CliNER results which are from type
treatment, the rules check whether the terms are in the treatment list.
• The last three rules use results from CliNER and SentiStrength in order to try to
catch curable and incurable metastasis for the cases where diagnosis and medica-
tions have not been mentioned and the sentiment rankings are mainly negative.
• For each of the rules where DNorm, MedEx and CliNER results have been used
we also have checked that the terms are not negated.
• The options for first and last sentence rankings have not been used in the rules
since it is di�cult to judge only by observations how those rankings a↵ect the
classification. They will be used in the ML approach.
60
Figure 3.14: Metastatic rules
Non-metastatic Rules
The rules used to identify the non-metastatic notes are given in Figure 3.15.
There are only two non-metastatic rules and they focus on finding notes that are either
very positive or include diagnosis such as stable disease and response to treatment and
have positive ranking higher than the negative. Thus, these rules combine results from
DNorm and SentiStrength.
Figure 3.15: Non-metastatic rules
3.7.6 Quality attributes
The quality attributes (given in Figure 3.16) taken into account in the design of the
system are as follows:
61
• Portability – The system needs to work on the main computer platforms and
mainly on Windows and Linux as these are the main platforms used within the
hospital
• Reliability – The system needs to be able to identify almost all metastatic patients.
This is one of the most important quality attributes the system needs to have.
• Performance (Response Time) – Even though it is not a first priority for the system
to return results very quickly, the response time of the system should be still fast
enough not to a↵ect the work of the nurses in a negative aspect
• Interoperability – The system must be easy to integrate within the hospital database
• Reusability – The system components must be easily replaced or extended in order
to support improvements and changes in future
Figure 3.16: Quality Factors
3.8 Database Design
A well structured database is a key component of the project as it will store the extracted
data during the IE stage, provide structure to the extracted data and thus help the
classification of it. This section is split into two parts. In the first part, we will identify
62
the types of information that need to be stored and then based on that in the second
part the actual design of the database will be presented.
3.8.1 Types of information that needs to be stored
The database needs to store each type of information extracted by the di↵erent compo-
nents of the program, the information prepared to be passed to the classifier as well as
the results from the classifier. The main groups of data that the database needs to store
are explained below:
1. Pre-processed Notes: These are the notes that have been cleaned from HTML
tags by the HTML Stripper sub-module as well as the sentences returned by the
Sentence Splitter module.
2. DNorm results on sentence level: These are the results returned from DNorm
on sentence-level including information on whether they are negated or not
3. MedEx results on sentence level: These are the results returned from MedEx
including whether they are negated or not
4. SentiStrength results: These are the results returned from SentiStrength on
sentence-level
5. CliNER results: These are the results returned from CliNER on note-level
6. Results from Post-processing module: These are the results returned by
DNorm, MedEx, and SentiStrength aggregated on notes – level as well as CliNER
results cleaned and prepared for classification
7. Classifier data: This is the data produced by the Post-Processing module struc-
tured in a way so that it is appropriate to be passed to the machine learning
63
classification approach. The rule-based approach is using the results produced by
the Post-processing module without additional re-structuring
8. Classifier results: These are the results returned from both of the classification
approaches.
3.8.2 Design
Figure 3.17 illustrates the database design taking into account the types of information
needing to be stored. Following the information presented in the section above, we can
derive 8 tables as follows:
• Data Source: It includes the notes with unique IDs, cleaned from HTML tags.
It also stores the results from the classifier on note-level.
• Data Sentence: It includes the sentences with unique IDs for each note. This
table also includes the position of the sentence in the note (start and end character).
• CliNER Event results: It includes the results produced from CliNER regarding
the events extracted from the notes.
• CliNER Timex results: It includes the results produced from CliNER regard-
ing temporal data extracted from the notes.
• DNorm SentLevel: It includes the results from DNorm and on sentence-level.
Further to that, it includes information whether the results produced from DNorm
are negated or not and stores the position of the term in the sentence.
• MedEx SentLevel: It includes the results from MedEx on sentence-level. Fur-
ther to that, it includes information whether the results are negated or not and
stores the position of the term in the sentence.
64
• SentiStrength SentLevel: It includes the SentiStrength positive and negative
ranking for each sentence.
• DNorm NoteLevel: It includes the DNorm results aggregated on note-level.
These results have also been cleaned form inappropriate terms.
• MedEx NotesLevel: It includes the MedEx results aggregated on notes-level.
These results have also been cleaned form inappropriate terms.
• SentiStrength NoteLevel: It includes statistics information about the positive
and negative ranking within the note based on the performed aggregation of the
SentiStrength results on sentence-level.
• Classifier Rules: It includes the results produced by the rule-based classification
approach.
• Classifier ML: It includes all the data from SentiStrength, DNorm, MedEx, and
CliNER, combined and structured in a way so that it is suitable for the machine
learning-based classification approach. I also inlcudes the results,produced by the
classifier.More about the creation of the classification template that will be passed
to the ML algorithm will be given below.
• Patient Diagnose: This table stores the final results for a patient (i.e. whether
she is metastatic or not).
65
Figure 3.17: Database Design
Assumptions and Justifications for design
• Generation of the IDs for notes and sentences: These are generated in a way in
which it will be easier to derive the patient associated with the note, the date on
which the note has been created and the position of the sentence in the note
– The IDs for the notes are formed using the patient id and a number generated
depending on the date on which the note has been written.
– The sentence IDs are formed with the note id and the index of the sentence
in the note.
• The results produced by DNorm, MedEx, and CliNER are also held the positions
of the extracted terms in case they need to be backtracked to the original notes.
66
• SentiStrength information on note-level: The sentiment information saved on note-
level is the following: Average (combined positive and negative) score for the note,
Max positive/negative score in a note,The score of the first and the last sentence
• Classifier ML table information: This table serves as a classifier template that
helps to structure the results in the most suitable format for performing ML-based
classification.It is given in Figure 3.12
– The columns of the template will be presented by the features that indicate
whether a note shows metastasis or non-metastasis. The features can be split
into 4 groups: DrugName List, DiseaseName List,and TreatmentsName List,
and SentiStr Values. Each group of features represents a type of information
that has been extracted from the notes and thus each column will represent
a term for which, if it is present or not present in a note it will mean that the
note is more likely to show either metastasis or non-metastasis.
– The columns representing terms from the groups DrugName List, Disease-
Name List, and TreatmentsName List will be of binary type and will be able
to take only values 1 or 0 where 1 indicates that the term is present in the
note and 0 indicates that the term is not present. The columns represent-
ing the SentiStrength results will hold for each note the average ranking, the
maximum positive and negative ranking, the ranking for the first and last
sentence for a note.
– An example for a note represented in the template is: A note can have a
value for column ’stable disease’ 1, value for column ’adjuvant metastasis’ 0
and value for column ’SentiStr avg’ 3.This could mean that the note indicates
that the patient is non-metastatic.
67
– The reason for creating a classification template instead of using the post-
processed data for the creation of the algorithm is that it provides a unified
way for representing the data in each note and thus will ensure that the
classifier is passed the same type of data with the same structure. Further to
that, the values of the classifier template are numerical values, most of which
binary and the classifier is more easily built on binary numeric values than
on other types of data.
– This table will also include the columns Class Actual, which will store the
actual class for a note and Class Predicted, which will store the class given
by the classifier.
Figure 3.18: Classifier Table
4. Implementation
4.1 Development environment preparation
4.1.1 Ethical approvals
Due to the fact that the project is concerned with processing and analysing patient data
there are some data security and ethical aspects that needed to be considered.
First, an ethical approval from the hospital has been obtained. In order to accomplish
this online application has been submitted and approval has been received. Secondly,
two Information Governance (IG) and Data protection Act declarations needed to be
filled within the hospital and training needed to be completed. All this has been done
and access to the data and the hospital has been permitted.
4.1.2 Hospital environment preparation
Due to the sensitivity of the EHRs, the hospital has a set of restrictions, which have
been taken into account and appropriate actions have been performed. In Table 4.1 the
restrictions and the associate actions performed are presented.
68
69
Table 4.1: Work environment set up
4.1.3 Development Language Choice
Based on the quality attributes described in the previous chapter, the programming
language selected to build the software is Java. Due to the fact that Java runs on top of
the JVM, it employs a build once run everywhere style of application and thus satisfies
the portability quality attribute. Due to its Oriented Object Programming paradigm,
many architectural patterns, design patterns and best practices rules can be applied to
reach the performance, itneroperability and reusability attributes. Other advantages of
using Java as the language for developing the system are:
• Compatibility with SQLite:
As the project uses a database, the chosen language would need to be able to
communicate with SQL. Java is able to give us this ability through the use of
well-established projects
• Timeliness:
The code needs to be fast and responsive. Java performs at speeds comparable to
native code such as C and C++, but it is much more flexible and easier to use
• Security and Reliability:
Java is an Open-Source project that is constantly being updated and should not
70
cause security issues. As JVM runs the code natively in protected memory other
languages would not o↵er any significant reliability benefits over Java
• Maintainability:
A Java is widely used language. Therefore, it is feasible that other developers
could continue the project in future
4.1.4 Database Platform and Type Choice
The Relational Database Management System (RDBMS) is the most suitable option for
the purposes of the project since this is the database type used in the hospital. Further
to that, it is vendor- and community- supported and it has capabilities to deal with
data semantics, consistency and redundancy problems. It also provides simple logical
structure for storing data. The database platform that will be used is SQLite. Even
though, the platform used in the hospital is SQL we do not have write permissions to it
and thus it cannot be used for storing data. However, the SQLite operations that will be
used have equivalent in SQL and thus the integration of the system within the hospital
will not be a problem. Advantages of SQlite over SQL is that it requires no server and
needs little or no configuration.
4.1.5 Preparation of Feature Extraction tools
In this section we will present the systems/tools that will be used for the implementation
of the Feature Extraction module of the system in more depth. These are: DNorm,
MedEx, CliNER, NegEx, and SentiStrength. Even though, they have been presented in
the previous sections of the report, we will re-visit them again in order to describe their
specific requirements and how they have been prepared in order to be able to integrate
71
them in the application. In the rest of this section we will present each tool with its
requirements and the preparation actions taken in order to be integrated later on:
1. DNorm
• It requires 10GB of RAM: In order to reduce the memory required for this
tool, we will remove some of the dictionaries that it uses and are irrelevant
to cancer disease.
• It is initially suited to run on Linux: Since it is preferable for the system to
work on Windows, we will include the necessary packages that are needed in
order for DNorm to work on Windows.
• It takes as input and produces as output text file: The program is changed
in order to take variables as input and store the results in variables
2. MedEx
• It is initially suited to run on Linux: Since it is preferable for the system to
work on Windows, we will include the necessary packages that are needed in
order for MedEx to work on Windows.
• It prints out not only medication names but also other information concerned
with medication intake:Since the only information we are interested in are the
medication names, the rest of the data produced by MedEx is ignored
• It takes as input and produces as output text file:The program is changed in
order to take variables as input and store the results in variables
3. NegEx
• It is working on all platforms: No actions needed
72
• It is associated with GUI: The tool’s GUI has been removed and suited to be
included in the application.
4. CliNER
• It is initially suited to run on Linux: Since it is preferable for the system to
work on Windows, we will include the necessary packages that are needed in
order for CliNER to work on Windows.
• It provides the output as an XML file: In order to transform the output to a
format suitable to be saved in the database, we need to create an XML parser
• It takes text files as an input: The program will be suited to take variables
as input instead
5. SentiStrength
• It is working on all platforms: No actions are needed.
• It provides the output on the command line: The tool has been transformed
to store the output in variables.
• It uses modules specific for social media (emoticons, dictionaries with words
that cannot be seen in clinical text): These modules are removed
4.2 Classifier creation
To create the classifier module of the system, the following steps have been performed:
1. Step 1: Installation of classification platform
The platform chosen for implementing the classifier is Weka. The reasons for
choosing it is the wide use and maturity of the software as well as the large amounts
of help information available.
73
2. Transfer of classifier data to suitable format
The most suitable is the default format for Weka - ar↵. Even though Weka can
read csv files, the advantages of using ar↵ format are the following:
• CSV files cannot be read incrementally (Weka, 2015): In order to determine
whether a column is numeric or nominal, all the rows need to be inspected
first. ARFF files contain a header defining the attributes, i.e. the internal
data structures can be set up correctly before reading the actual data.
• Train and test set may not be compatible with csv files (Weka, 2015): Since
CSV files do not contain any information about the attributes, WEKA needs
to determine the labels for nominal attributes itself. Not only does the order
of the appearance of these labels create di↵erent nominal attributes (”1,2,3”
vs ”1,3,2”), it is also does not guarantee that all the labels that appeared in
the train set also appear in the test set (”1,2,3,4” vs ”1,3,4”) and vice versa.
3. Using decision tree method(i.e.J48 for classification)
The reason for choosing decision tree method is because it is an easy to understand
method, there are no prior assumptions about the nature data, and can classify
both categorical and numerical data.
4. Evaluate by calculating confusion matrix
The classifier will be evaluated by calculating precision and recall. Recall is con-
sidered the more important measure since it is important that the most of the
metastatic patients are found.
74
4.3 Software Testing
Software testing is one of the core practices of software engineering. In order to prove
that the system will have a positive impact on the intended users and will support
the successful identification of metastatic patients we must test the system. Testing
needs to be done on an ongoing basis and not just at the end when the software has
been finalised. In this way, we can find bugs and issues more easily and on an earlier
basis which also makes the debugging much easier. Therefore, the system has been
tested during development and its finalisation. Two main types of testing have been
performed: Regression Testing and Integration Testing.
The main goals of software testing will be to determine the system and information
quality. The system quality will be determined on whether the system provides all
necessary functions in order to extract information that proves a patient is metastatic
and then classifying the patients correctly. In this instance information quality will be
judged on whether or not the data extracted is accurate and discrepancy free.
For accomplishing the stated goals two main types of testing have been performed:
Regression Testing and Integration Testing.
4.3.1 Regression Testing
The objective of regression testing is to check that old functionality has not been broken
by new functionality or changes made in application (Narayan Singh,2013).This type of
testing has been done throughout the implementation of the system. It has been done
in the form of test cases every time when a new functionality is added in order to check
75
that the old modules still work. This has been particularly useful during the integration
of the Information Extraction tools (DNorm, MedEx, and CliNER).
4.3.2 Integration Testing
The objective of Integration testing is to ensure that aggregates of units perform ac-
curately together(Narayan Singh,2013). Integration testing is useful since it helps to
identify problems that occur when units are combined. In order for the testing to be
successful a test plan needs to be created that requires the use of a bottom-up approach
in which we first test individual units and then when units are combined. In this way
we will know that any errors discovered when combining units are likely related to the
interface between units. This method reduces the number of possibilities to a far simpler
level of analysis. For instance, we first need to test whether CliNER work well individ-
ually and extracts data and then tests it together with XML Parser. For this stage of
testing we also have created test cases before the implementation.
4.3.3 Tests Summary
In this section the Regression and Integration tests are summerised . They have been
performed on Mint 12 (Linux) with Java version 1.6.0. The tests as well as description
of them is given in Table 4.1.
5. Evaluation
In this chapter we will describe the data set used to train and evaluate the classifier.
Further to that we will perform evaluation of SnetiStrength performance. Both classi-
fication approaches (rule-based and ML-based) will be evaluated and compared and we
will also conduct error analysis.
5.1 Evaluation methodology
The evaluation method (Spasic et al, 2014) should be able to test whether the system
is able to correctly identify the metastatic patients and most importantly whether the
system is able to find all metastatic patients (i.e. it has good coverage,recall). The
methodology we will use for evaluating the two classification approaches will be based
on a confusion matrix as explained in the Background section. In this particular case,
the positives will be the metastatic patients and the negatives the patients without
metastasis. The precision is the fraction of the patients that have been identified as
metastatic and are truly metastatic and recall is the percentage of the correctly identified
metastatic patients overall. We are mainly interested in the coverage of the system (all
patients with metastasis are found) and thus we want the recall of the system to give
good results. Due to this fact, in the calculation of the F-measure the recall will be given
a higher importance by including a non-negative real number (i.e.b) in the calculation
77
78
of F-measure, which will give a preference to the recall. This version of F-measure is
called F2 measure with as it weights recall twice as much as precision.
5.2 Description of Data set
We will use 50 notes for both testing and training the classifier. From these notes
around 20 notes are annotated by the nurse while the rest are annotated specifically
for the project by non-medical person. Statistic information about the notes is given in
Table 5.1.
Table 5.1: Notes statistics
The notes have been chosen in this way so that they are representative of the main
types of data that need to be classified: curable and non-curable metastasis and non-
metastatic patients. Each of the groups have almost the same number of representative
notes. In this way we are trying to avoid cases in which the classes representatives
are unbalanced and there are no examples of a particular group of notes and thus the
classifier cannot be trained to recognize those cases and fail to classify the unseen ones.
5.3 SentiStrength Evaluation
SentiStrength is the only tool integrated into the Information Extraction module that
is not specifically created for clinical data. Thus, its performance needs to be evaluated
79
before including its results as attributes for classification. The data set used for its
evaluation is the same as the data set that will be used for the evaluation of the classifier.
For this purpose we will manually assign positive and negative sentiment ranks to each
sentence, stored in the SQLite database for which SentiStrength has performed sentiment
analysis. This will be done without looking at the output from SentiStrength in order to
avoid bias. Then, we will compare the manually given rankings with the SentiStrength
rankings. In Table 5.2 presents the overall number of incorrectly ranked sentences,
the number of sentences given incorrect positive and negative ranks separately and
the overall accuracy, the accuracy of the correctly identified positive rankings and the
correctly identified negative rankings.
Table 5.2: SentiStrength accuracy
From the results shown in the table above it can be concluded that SentiStrength has
high enough accuracy in order to be included as a factor in the classification. It can
also be seen that the positive rankings give a little bit higher results then the negative
rankings. Further to that, after observation of the ranked sentences, we have found the
following information which can be useful during the evaluation of the classifier:
1. SentiStrength could be used by itself to find the cases of non-curable cancer
The sentences that show a patient’s condition is non-curable produce very strong
negative sentiment (ex. unfortunately, very unlikely) which is included in the
SentiStrength dictionaries and is easy to identify. From observations, the accuracy
of SentiStrength for given only cases of non-curable cancer will be higher.
80
However, cases with curable cancer are more di�cult to notice by SentiStrenth
since they usually lack sentiments. For example, the sentence:”She had mammo-
gram and ultrasound and was found to have a tumour in the left breast” shows that
a patient has metastasis but does not include sentiments and thus the sentiment
rankings are neutral.
2. In some cases SentiStrength produces rankings that can a↵ect the classification
negatively
In some cases SentiStrength gives ranking that do not respond to the real senti-
ments of the sentence. Examples include: ”No ovarian cancer” with ranking +1
and -4 and ” On Examination: Height 169.6cm, weight 62.65kg, no red flag
symptoms including paraesthesia or weakness in the spine” with ranking +1 and
-2. Both of the cases show positive information but are classified as negative.
However, there are no cases in which a sentence will be with negative sentiment
but ranked as positive. This means that, SentiStrength might a↵ect the precision
of the classifier but not the recall, which is the more important metric of the two.
5.4 Classification approaches evaluation
In this section we will evaluate the two approaches, make error analysis on the results
and compare their performance.
5.4.1 Rule-based approach evaluation
In Table 5.3 identifies the confusion matrix produced for evaluating the rule-based ap-
proach in which the positives are the metastatic patients and the negatives the non-
metastatic patients.
81
Table 5.3: Confusion Matrix for Rules
Recall, precision, accuracy and f-measure are calculated based on the values given in the
confusion matrix. When calculating F-Measure preference will be given to recall over
precision, as in this scenario, recall is a value of a higher importance. The results are
given in Table 5.4.
Table 5.4: Results for Rules
From Table 5.4 above it can be seen that the approach is working better than the
threshold given in order to be considered successful with recall 83%. The problems that
might have caused the wrong classification of some of the metastatic and non-metastatic
patients are as follows:
1. A reason for non-metastatic patients being classified as metastatic is that Sen-
tiStrength has given negative ranking to sentences that are with positive senti-
ments and thus some of the rules where SentiStrength is used gave wrong results
2. A reason for cases of metastasis not being found CliNER results are not su�cient
in isolation for metastasis to be correctly identified
5.4.2 Machine learning approach evaluation
The machine learning approach will be evaluated by using all features that have been
extracted or by removing some of them in order to judge which version gives better
82
results. The versions that will be tested are:
1. First we evaluated the original version in which all features are kept (the confusion
matrix for it is given in Table 5.5)
2. Secondly, a version without sentiment analysis is evaluated in order to see whether
sentiments actually help in the classification
3. Finally, we evaluated a version without CliNER results: The reason for choosing to
remove CliNER is because it is observed that in the rule-based approach the rules
with CliNER do not work correctly and thus it is likely that a classifier without
considering these results will give higher performance
The measures produced for each version is given in Table 5.6. It can be seen that the
best performance of the algorithm is produced when CliNER results do not participate
in the classification process. The version with excluded sentiment analysis gives slightly
higher precision, however it’s recall is much lower. This result might be due to the fact
that SentiStrength is good in capturing incurable metastasis but produces mistakes for
cases of non-metastasis.
Due to the fact that recall is the most important measure as it shows whether the
classifier is able to find the metastatic patients, the best version is the one with a 76%
recall without taking into account CliNER.
Table 5.5: Confusion Matrix for ML
83
Table 5.6: Results for ML
5.4.3 Comparison between approaches
The rule-based approach gives much higher recall (83%) while the machine learning
approach gives a highest result of 76% recall. From this it can be concluded that the
rule-based approach works better for the given case. However, a problem that can be
associated with the rule-based approach is overfitting and a problem with the machine
learning approach is that the data set used is very small.
6. Conclusion and Future work
6.1 Conclusion
6.1.1 Summary
The project’s aim was to develop a software that takes as an input, patient notes and
output whether a patient is with metastatic breast cancer or not. In order to accomplish
this aim, four main components were implemented: Pre-processing component, which
prepares the notes for performing Information Extraction approaches on them; Infor-
mation Extraction component which integrates various existing Information Extraction
tools for extracting relevant data from the notes; Post-processing component which pre-
pares the results produced from the previous module for classification and Classification
component which implements two classification approaches (rule-based and machine
learning-based). After evaluation has been performed the rule-based approach has been
considered better as it has a recall of 83%.
Learning and deliverable objectives were established, and they were used as a trace route
for developing the project. The learning objectives were reached because we completed
the following activities:
84
85
• The background process was carried out in order to understand previous studies
in this subject and to know the actual state of this topic. General research on
Text Mining concepts and techniques have been done.
• Relevant to the project, clinical text mining systems and techniques have been
reviewed and studied. Further to that, existing Information Extraction tools have
been investigated that can be integrated within the system
• Existing classifying algorithms and evaluation methods have been reviewed and
understood
The deliverable objectives were reached because with this project we are delivering the
following artefacts:
• An application that performs initial analysis of the data and establishes the main
types of data that need to be extracted have been performed
• The requirements of the system have been identified and discussed with people
from the hospital
• A system that meets the requirements have been developed and performs the
following actions: pre-process the patient notes, extract the important types of
information, classify the patients as metastatic or non-metastatic
• The system also satisfies the quality attributes stated in the following way:
– Portability:The system has been tested on Linux. However, as it is a Java
application portability is inherited with the JVM being able to run across
most platforms. Additionally, the tools used for the creation of the system are
able to work on Windows and Mac OS thus the system should be executable
on all main platforms.
86
– Reliability: The system recall is above the threshold given in the beginning
– Performance: The response time for the system for 50 notes is 3 hours which
is less than the given threshold
– Interoperability: Due to the fact that the SQLite operations used are available
for SQL database, the system can be easily adapted to work with the hospital
database
– Reusability: The system modules can be easily expanded or replaced
6.1.2 Limitations
The main limitations of the system and the project conduction as a whole that need to
be taken into account in future plans are:
• Lack of evaluation of Information Extraction systems
The only Information Extraction system that was evaluated is SentiStrength. The
main reason for that is the lack of time. The way in which we have tried to ensure
correctness of the results produced by the other systems is to create post-processing
rules that clear the irrelevant results and also during Software testing when we have
manually looked at the produced results and whether they are relevant to the data
given. However, more clear overview of the performance and suitability of the
systems will be given by evaluating them.
• Lack of annotated data
Another drawback is the inability to collect more annotated notes from the nurse
in order to test the system on a larger sample. This would give us a better overview
of the performance of the system and the quality of the results produced. However,
the samples used are chosen to be the best representative of the data.
87
• Lack of testing on all computer platforms
Due to limited time, the software testing has been performed only on Linux. How-
ever, the tools integrated in the system work on Windows and the implementation
language is Java and thus the system should work on Windows.
6.2 Future work
Next stages of the research could deal with some of the limitations that were found
during the execution of the project. The main areas in which the project could be
continued are as follows:
6.2.1 Integrate application within hospital environment
Currently the system uses SQLite database for storing the results. In future, the system
needs to be integrated to work with the hospital database and system. The integration
of the software to work with the hospital database would not be a challenge since the
SQlite operations used within the software is the same as the operations available in
SQL.
6.2.2 Expand Information Extraction functionality
The Information Extraction module is extracting only some types of information that
were discussed as important with the nurse due to time restrictions. Even though the
features extracted are enough at this stage for identifying the metastatic patients, it will
be better in future to expand the Information Extraction functionality of the software
in order to extract more types of information that can help in the identification of the
88
metastatic patients such as relations between terms and notes, more disease symptoms
and scan information. This can also improve the accuracy of the software.
6.2.3 Improve run time
Due to the use of existing IE systems for the implementation of the Feature extraction
component of the system, the run time of the system is quite slow. This problem can be
solved in future by replacing the more time-consuming tools with a rule-based approach
for extracting information from the notes.
6.2.4 Extending the system functionality
Currently, the system accomplishes only the essential requirements. However, in future
the system can be extended to satisfy the desirable requirements such as displaying the
reason for identifying the patient as metastatic and whether the patient condition is
curable or not.
6. References
[1] Ananiadou, and McNaught, 2006. Text Mining for Biology and Biomedicine. United
States of America: Artech House.
[2] Anthony, 2015. AntConc software.
Available from: http://www.laurenceanthony.net/software.html.
[3] Appelt and Israel, 1999. Introduction to Information Extraction
Technology Available from:
http://www.dfki.de/ neumann/esslli04/reader/overview/IJCAI99.pdf.
[4] Ben-Dov and Feldman, 2010. Text Mining and Information Extraction.
In Data Mining and Knowledge Discovery Handbook. 2nd ed. Springer Science.
pp.809-835.
[5] Cancer Research UK, 2015. All cancers combined Key Stats.
[6] Cancer Research UK, Stages of cancer.
[7] Cunningham, et al, 2013. Getting More Out of Biomedical Documents with GATE’s
Full Lifecycle Open Source Text Analytics, PLOS Computational Biology.
[8] Dehghan, TERN: TEmporal expressions Recognizer and Normalizer.
Available from: http://gnode1.mib.man.ac.uk/hecta.html#tools
89
90
[9] Dehghan, TEid: TEmporal expressions IDentifier.
Available from: http://gnode1.mib.man.ac.uk/hecta.html#tools
[10] DeLone and McLean, 2003. The DeLone and McLean Model of Information Systems
Success: a Ten-Year Update
The DeLone and McLean Model of Information Systems. Journal of Management
Information Systems, 19 (4), p. 9–30.
[11] Deng et al, 2014. Retrieving Attitudes: Sentiment Analysis from Clinical Narratives.
MedIR, p. 12-15.
[12] DELL, 2015. Text Mining. Available from:
http://documents.software.dell.com/Statistics/Textbook/Text-
Mining#applications
[13] Feldman and Sanger, 2007. The Text Mining Handbook: Advanced Approaches in
Analysing Unstructured Data. 1st ed. United States of America: Cambridge Univer-
sity Press.
[14] Filannino, Clinical NorMA: temporal expression normaliser.
Available from: http://gnode1.mib.man.ac.uk/hecta.html#tools
[15] Friedman et al, 2004. Automated Encoding of Clinical Documents Based on Natural
Language Processing. Journal of the American Medical Informatics Association, pp.
392-402.
[16] Goryachev et al, 2008. Identification and Extraction of Family History Information
from Clinical Reports. Journal of the American Medical Informatics Associatation,
p. pp. 247-251.
91
[17] Han et al, 2012. Classification: Basic Concepts. In Data Mining: Concepts and
Techniques. 3rd ed. United States of America: Morgan Kaufmann. p. pp. 327-389.
[18] Hotho et al, 2005. A Brief Survey of Text Mining. 13 May.
[19] Impact of Information Systems. Cardi↵ School of Computer Science and Informat-
ics.
[20] Ipalakova, M., 2010. Information Extraction. MSc Manchester.
[21] Jakob Nielsen, 2000. Why You Only Need to Test with 5 Users. Nielsen Norman
Group, 19 March.
[22] Karystianis et al, 2014. Using Local Lexicalized Rules to Identify Heart Disease
Factors in Clinical Notes.
[23] Kovacevic, A., et. al, 2013. Combining rules and machine learning for extraction of
temporal expressions and events from clinical narratives. Journal of the American
Medical Informatics Association, 20 (5), pp. 859–866.
[24] Kovacevic A, 2012. HECTA - Healthcare Text Analytics @ gnTEAM.
[25] Leaman et al, 2013. DNorm: disease name normalization with pairwise learning to
rank. Bioinformatics, 29 (22).
[26] Litrot, T. et al, 2015. Using natural language processing to automatically extract
cancer outcomes data from clinical notes.
[27] Mansouri et al, 2008. Named Entity Recognition Approaches. International Journal
of Computer Science and Network Security, 8 (2). Nadkarni et al, 2011. Natural
language processing: an introduction. Journal of the American Medical Informatics
Association, 18 (5), pp.544–551.
92
[28] Marafino, B., et. al, 2015. E�cient and sparse feature selection for biomedical text
classification via the elastic net: Application to ICU risk stratification from nursing
notes. Biomedical informatics.
[29] Mark Hall, Eibe Frank, Geo↵rey Holmes, Bernhard Pfahringer, Peter Reutemann,
Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD
Explorations, Volume 11, Issue 1.
[30] McDonald and Kelly, 2015. Value and benefits of text mining. Jisc.
[31] Microsoft,2015.Chapter 16:Quality Attributes.
Available from: https://msdn.microsoft.com/en-gb/library/ee658094.aspx
[32] Narayan Singh, 2013. Test Plan for Mobile Application Testing. [Software Testing
Concepts]
[33] Savova, G., et. al, 2010. Mayo clinical Text Analysis and Knowledge Extraction
System (cTAKES): architecture, component evaluation and applications. Journal of
the American Medical Informatics Association, 17 (5), pp. 507–513.
[34] SentiStrength. Available from: http://sentistrength.wlv.ac.uk
[35] Spasic et al, 2014. Text mining of cancer-related information: Review of current
status and future directions. ISMI, 83 (9), pp. 605-23.
[36] Spasic et al, 2010. Medication information extraction with linguistic pattern match-
ing and semantic rules. Journal of teh American Medical Informatics Association, 17
(5), pp. 532-535.
[37] Thelwall, M. (in press). Heart and soul: Sentiment strength detection in the social
web with SentiStrength (summary book chapter)
93
[38] University of Ljubljana, Data Mining - Fruitful and Fun. Available from:
http://orange.biolab.si
[39] Witte, R, 2006. Introduction to Text Mining. Universitat Karlsruhe, Germany.
[40] Xu et al, 2010. MedEx: a medication information extraction system for clinical
narratives. Journal of the American Medical Informatics Association, 17 (1).
[41] Yang et al, 2009. A Text Mining Approach to the Prediction of Disease Status
from Clinical Discharge Summaries. Journal of the American Medical Informatics
Association, 16 (4), 7 April, pp.596 - 600.
[42] Zhou, X. and Han, H., 2006. Approaches to Text Mining for Clinical Medical
Records. CiteSeer, April.
[43] Zhu, F., et al, 2013. Biomedical text mining and its applications in cancer research.
Biomedical Informatics, 46 (2), pp. 200–211.