visualisation and manipulation of structured...
TRANSCRIPT
VISUALISATION AND MANIPULATION OF STRUCTURED
EPIDEMIOLOGICAL INFORMATION
A DISSERTATION SUBMITTED TO THE UNIVERSITY OF MANCHESTER
FOR THE DEGREE OF MASTER OF SCIENCE
IN THE FACULTY OF ENGINEERING AND PHYSICAL SCIENCES
Stefan Lesnjakovic
September 2014
- 2 -
Table of Contents
Abstract .............................................................................................................................................. 5
Declaration ......................................................................................................................................... 6
Intellectual Property Statement ........................................................................................................ 7
Acknowledgements ............................................................................................................................ 8
1. Introduction ................................................................................................................................... 9
1.1 Aims and Objectives ............................................................................................................... 10
1.2 Deliverables ............................................................................................................................ 10
1.3 Report Outline........................................................................................................................ 10
2. Background and Literature Review .............................................................................................. 12
2.1 Medical Literature .................................................................................................................. 12
2.2 Epidemiological Literature ..................................................................................................... 13
2.3 Medical Information Gathering ............................................................................................. 14
2.4 Visualisation of Information ................................................................................................... 17
2.5 Related Work ......................................................................................................................... 21
2.6 Summary ................................................................................................................................ 23
3. Requirements and Design ............................................................................................................ 25
3.1 Requirements ......................................................................................................................... 27
3.2 Software Design ..................................................................................................................... 31
3.2.1 Challenges of Designing a Website ................................................................................. 31
3.2.2 Security ........................................................................................................................... 32
3.3.3 Architecture .................................................................................................................... 33
3.3.4 Domain Diagram ............................................................................................................. 36
3.3.5 Sequence Diagram .......................................................................................................... 39
3.3 User Interface Design ............................................................................................................. 41
3.4 Database ................................................................................................................................ 44
4. Implementation ........................................................................................................................... 47
4.1 Tools ....................................................................................................................................... 47
4.1.1 Current state of the art Web Technology ....................................................................... 47
4.1.2 Google Web Toolkit ........................................................................................................ 49
4.1.3 Relational vs. Non-Relational Database Technologies .................................................... 49
4.1.4 Jenny and JDBC ............................................................................................................... 50
4.2 Core Implementations ........................................................................................................... 50
4.3 Client-Server Interaction in GWT ........................................................................................... 52
4.4 Client Implementation ........................................................................................................... 52
4.4.1 Results Table Widget....................................................................................................... 53
4.4.2 Advanced Search Feature ............................................................................................... 53
- 3 -
4.4.3 Details Page Widget with Curation ................................................................................. 54
4.4.4 Future Client Improvements ........................................................................................... 54
4.5 Server Implementation .......................................................................................................... 55
4.5.1 Database Connectors ...................................................................................................... 55
4.5.2 Services ........................................................................................................................... 56
4.5.3 Query Services................................................................................................................. 56
4.5.4 Details Page Service ........................................................................................................ 56
4.5.5 Curation Service .............................................................................................................. 57
4.6 Possible Future Implementation Ideas .................................................................................. 57
4.7 Software Testing .................................................................................................................... 58
4.7.1 Regression Testing .......................................................................................................... 58
4.7.2 Integration Testing .......................................................................................................... 59
4.7.3 Security Testing ............................................................................................................... 59
4.7.4 Unit Testing ..................................................................................................................... 59
5. Results and Evaluation ................................................................................................................. 62
5.1 Overview of the Application .................................................................................................. 62
5.2 Evaluation .............................................................................................................................. 65
6. Conclusion and Future ................................................................................................................. 67
6.1 Summary ................................................................................................................................ 67
6.2 Future Work ........................................................................................................................... 68
6.3 Concluding Remarks ............................................................................................................... 69
7. Bibliography ................................................................................................................................. 70
8. Appendix ...................................................................................................................................... 74
A. SCRUM principles ..................................................................................................................... 74
B. Full Release Plan for the Project .............................................................................................. 75
C. Full Backlog for the Project ...................................................................................................... 76
D. Full Sequence Diagram with Details Request .......................................................................... 78
E. Anti-Pattern GWT ..................................................................................................................... 79
F. More about Java RPC Calls in GWT .......................................................................................... 80
G. Testing Done as result of a Testing Plan .................................................................................. 81
H. Evaluation Interview Script ...................................................................................................... 83
I. Evaluation Interview Questionnaire with Discussion Questions .............................................. 84
J. Evaluation Interview Highlights Table ...................................................................................... 85
- 4 -
Table of Figures Figure 1: Number of Articles about Smoking on PubMed ............................................................... 12
Figure 2: General overview of feature extraction............................................................................ 16
Figure 3: Example clinician UI in a table .......................................................................................... 18
Figure 4: Sample Tag Cloud .............................................................................................................. 19
Figure 5: Sample return of EpiTeM .................................................................................................. 21
Figure 6: Sample of EdVic, a tool created by Elise Hahn .................................................................. 22
Figure 7: Agile development life cycle ............................................................................................. 25
Figure 8: Digitalised User Story Sample I ......................................................................................... 27
Figure 9: Digitalised User Story Sample II ........................................................................................ 28
Figure 10: Architecture overview .................................................................................................... 35
Figure 11: Domain Class Diagram .................................................................................................... 36
Figure 12: Basic Search Sequence Diagram ..................................................................................... 39
Figure 13: Curation Sequence Diagram ........................................................................................... 40
Figure 14: Main Page User Interface Drawing ................................................................................. 41
Figure 15: Details Page User Interface Drawing .............................................................................. 43
Figure 16: Application Specific Database Design Diagram .............................................................. 44
Figure 17: The evolution of the Web in the past 15 years ............................................................... 47
Figure 18: GWT Application Structure Tree ..................................................................................... 51
Figure 19: General Simplified Application Class Diagram ................................................................ 52
Figure 20: Sample Test Code ............................................................................................................ 60
Figure 21: Standard Search Feature ................................................................................................. 62
Figure 22: Sample Results Table ...................................................................................................... 63
Figure 23: Details Page Feature ....................................................................................................... 63
Figure 24: Curation Feature ............................................................................................................. 64
Figure 25: Advanced Search Feature ............................................................................................... 65
Figure 26: Sample Burndown Chart ................................................................................................. 74
Table of Tables Table 1: Simplified initial release plan ............................................................................................. 26
Table 2: Simplified Product Backlog ................................................................................................. 29
Table 3: Non Functional Requirements ........................................................................................... 30
Table 4: Comparison of Web Development Tools ........................................................................... 48
Table 5: Simplified Testing Table ..................................................................................................... 61
Table 6: Evaluation Questionnaire and Discussion Highlights ......................................................... 66
Table 7: Set Goals vs. Achievements ................................................................................................ 68
Words: 18,430
- 5 -
Abstract Visualisation and Manipulation of Structured Epidemiological Information
Medical papers are being published on a daily basis in vast amounts all over the world. In
fact, the numbers are so high nowadays that it has become impossible for a single person
to read all papers published in one year in their lifetime. Therefore it is important to
provide a tool to easily and flexibly access the information needed. This is especially
useful in the field of Epidemiology as it aids in the identification of patterns of diseases
and their related factors. Text mining technologies exist to utilise the existing data,
however, the platforms and environments which are used to access these technologies
are often crude or not suitable for the average user.
The aim of this project is to provide such a tool which adapts to the ever
extending advancements in the field of Public Health. This is was achieved utilising a text
mining technology developed here in Manchester, which extracted relevant information
from abstracts of published papers according to six key characteristics used in
Epidemiology. A web application has been developed to utilise browsing epidemiological
data according to these six dimensions. The application also visualises some statistics of
the extracted data as well as provides means to manipulate the data. The background
research, development and evaluation of the application itself will be elaborated in this
dissertation.
Author
Stefan Lesnjakovic
Supervisor
Goran Nenadic
September 2014
- 6 -
Declaration
No portion of the work referred to in the dissertation has been submitted in support of
an application for another degree or qualification of this or any other university or other
institute of learning.
- 7 -
Intellectual Property Statement
i. The author of this dissertation (including any appendices and/or schedules to
this dissertation) owns certain copyright or related rights in it (the
“Copyright”) and s/he has given The University of Manchester certain rights to
use such Copyright, including for administrative purposes.
ii. Copies of this dissertation, either in full or in extracts and whether in hard or
electronic copy, may be made only in accordance with the Copyright, Designs
and Patents Act 1988 (as amended) and regulations issued under it or, where
appropriate, in accordance with licensing agreements which the University has
entered into. This page must form part of any such copies made.
iii. The ownership of certain Copyright, patents, designs, trademarks and other
intellectual property (the “Intellectual Property”) and any reproductions of
copyright works in the dissertation, for example graphs and tables
(“Reproductions”), which may be described in this dissertation, may not be
owned by the author and may be owned by third parties. Such Intellectual
Property and Reproductions cannot and must not be made available for use
without the prior written permission of the owner(s) of the relevant
Intellectual Property and/or Reproductions.
iv. Further information on the conditions under which disclosure, publication and
commercialisation of this dissertation, the Copyright and any Intellectual
Property and/or Reproductions described in it may take place is available in
the University IP Policy (see
http://documents.manchester.ac.uk/display.aspx?DocID=487), in any relevant
Dissertation restriction declarations deposited in the University Library, The
University Library’s regulations (see
http://www.manchester.ac.uk/library/aboutus/regulations) and in The
University’s Guidance for the Presentation of Dissertations.
- 8 -
Acknowledgements
I would like to especially thank my supervisor Dr Goran Nenadic, as without his consistent
guidance and presence throughout the time of this project, it would have not been
possible.
I would also like to thank Dr George Karystianis, for his support throughout and for
always being there and open for questions, as without his foundational work this project
would also have not been possible.
Lastly, I would like to thank Dr Jenny Newman for her feedback and input to the project,
as it positively contributed towards the progression and quality of the system.
- 9 -
1. Introduction
The accumulated wealth of medical literature is an indispensable tool to the field of
Epidemiology, and to keep track of medical advancements. According to Last (1988)
Epidemiology is “the study of the distribution and determinants of health-related states
or events in specified populations, and the application of this study to the control of
health problems” [1] or in other words the study of patterns, causes and effects of
diseases in a given population. The volume of information relevant to individual diseases
keeps growing to such an extent that it has become impossible for an individual to keep
track of it. Although there is a wealth of online citation indices, there is a need for a more
sophisticated way of querying these papers in a structural manner, in order to utilise their
hidden potential. Text mining approaches exist which utilise these databases by
extracting data in a structured manner, however, user interfaces to use these text mining
tools are often inefficient, lack the needed functionality or are totally missing. The aim of
this project is to provide a platform which enables the user to query and manipulate
ready extracted medical information and as a result make it possible to use queried data
for example for decision making. Another feature of the project is that it aims to provide
a new way of querying the data, which may lead to new discoveries in the field of
Epidemiology. In order to be helpful however, it has to be designed well and efficient
which will be an additional challenge.
Making this kind of information publically accessible has been a hot topic in the past
few years. For example, the Farr institute has been recently established with a multi-
million pound investment [2] in order to develop informatics support for research on
various health data, including epidemiological information. There are websites online
which provide their own database to query such papers. Another example would be
MEDLINE, the primary module for the well-known PubMed database [3]. It holds over 21
million journal articles regarding the life sciences and an average of 4000 references are
added every day. This project uses a data structure developed by George Karystianis at
the University of Manchester in order to query ready structured extracted data from the
MEDLINE database. This system extracts the data according to six key characteristics (or
dimensions): exposure, outcome, covariate, study design, population and effect size type,
which are commonly used in the field of Epidemiology. A closer look at these
characteristics will be taken in Chapter 2 of this dissertation.
- 10 -
1.1 Aims and Objectives
The aim of the project is to provide a system which enables the user to query and
manipulate ready extracted medical information from the MEDLINE database, meaning
to make it possible to use queried data for example, for decision making and to
manipulate the data in such a way which would correct incorrectly extracted entries.
The main objectives are as follows:
To develop a search function and implement an extended query model utilising
the six dimensions needed for the epidemiological data.
To develop an extended search function to enable the user to query the six
dimensions combined in any way.
To be able to manipulate automatically extracted epidemiological data in order to
correct errors.
To provide relevant information about retrieved data, such as statistics.
To enable only registered users to manipulate information.
To provide all the mentioned facilities on an online, web based, platform with high
availability.
To provide a simple to use, but informative GUI with good user experience.
1.2 Deliverables
There are several deliverables associated with this project. Here a quick list of them is
provided:
Progress Report: To provide an overview of the progress of the project.
This Dissertation: As a final report of the project.
The Web Application: produced as part of this MSc Dissertation. It is online and
can be found at: http://gnode1.mib.man.ac.uk/epidemiology/home.html
1.3 Report Outline
Chapter 2 Literature Review and Background: This chapter focuses on the background
research done. It will start off by discussing the importance of medical literature and how
it is used nowadays as well as give an example using the case study smoking. It will then
- 11 -
lead over to how it is utilised in IT using text mining and how the discoveries are useful
and may aid in, for example, decision support.
Chapter 3 Requirements and Design: This chapter will provide an overview of the
requirements gathering methodology as well as the actual requirements which have been
found and elaborated before the implementation could be started. It will provide how
they have been gathered, i.e. what questions have been asked to get to the given results.
It will also discuss the thought process behind the designs and provide visuals such as
diagrams to support these.
Chapter 4 Implementation: This chapter will start off by discussing the choices of tools
and the reasoning behind them. It will then elaborate on the actual implementation of
the product and provide the challenges as well as how they were overcome. Note that
this is an online platform, meaning that there are some special issues regarding security
as well as client-client server architecture used for this project. Lastly it will talk about
how the software has been tested.
Chapter 5 Results and Evaluation: This chapter will briefly take the reader through the
final state of the application and explain the features implemented. It will then focus on
evaluation techniques which verify whether the application has been made correctly and
is useful.
Chapter 6 Conclusion and Future: This chapter will summarise what has been done and
give an overview where the project stands in the end. It will also provide a summary for
possible future implementations and provide useful ideas in which the application may be
extended in future.
- 12 -
2. Background and Literature Review
This chapter will elaborate upon the literature research done for the project. First, it will
explain more general concepts, such as how medical literature is used nowadays and the
issues that come with it. It will then lead over to more specific uses and features such as
text mining and explain how these technologies may aid in decision support and are
generally useful. Lastly, it will look at similar or related work done.
2.1 Medical Literature
Even today with all the advancements in technology, the most commonly used basis for
scientists to communicate knowledge is still textual [4] in form of papers and citations. As
already mentioned in the introduction, there are vast amounts of papers published daily
thanks to the technology and ease available nowadays [5] and as a result, the
denominator for any specific search is growing too. The numbers of publications and the
volumes of citations have grown to such an extent that it has become impossible for one
person to keep track of it all [6]. The chart below shows the number of articles published
per year about smoking. Nevertheless it is important that the number keeps growing as it
may also lead to faster and better advancements in given fields, such as Epidemiology in
our case.
Figure 1: Number of Articles about Smoking on PubMed
0
2000
4000
6000
8000
10000
12000
14000
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
20
01
20
02
20
03
20
04
20
05
20
06
20
07
20
08
20
09
20
10
20
11
20
12
20
13
Nu
mb
er
of
Art
icle
s P
ub
lish
ed
Year of Publication
- 13 -
On top of the large amounts of data, another factor contributing to the inability to
handle the data is that it is mostly unstructured [6] by nature. This means that it mainly
consists of unlabelled text the way someone would write, which can make it extremely
hard to query for the right things using a simple word search. Searching for a specific
word can return millions of results nowadays, of which a majority may be unrelated to
what is actually wanted. Categorisation is often sparse and for specific uses such as
Epidemiology, simply not sufficient. Text mining approaches are taken to improve this
problem, which will be discussed closer in the next section.
There are several databases out there which store all that information needed. The
most popular one is called MEDLINE (Medical Literature Analysis and Retrieval System
Online) which is accessed through the online service called PubMed. MEDLINE is run by
the US National Library of Medicine and stores information about citations and abstracts
in the fields of medicine, nursing, dentistry, veterinary medicine, health care systems, and
preclinical sciences [7]. It is linked to a database called MeSH which is designed to aid
querying by providing a controlled vocabulary thesaurus and enable the user to search
for better results easier [8] on their own website when querying for articles. This is
especially useful when the user is a non-professional or does not know how to spell a
certain term, it will automatically look up any entered word in that database first before
querying to ensure best result and minimise the number of queries leading to no results.
2.2 Epidemiological Literature
Between 2000 and 4000 epidemiological references are added every day as medical
literature to the MEDLINE database. In 2013 over 700,000 references have been added
alone: As of today there are a total of over 23 million entries. From these facts it is
obvious that one dimensional search tags are insufficient for a complex field such as
Epidemiology. The six dimensions have therefore been identified [9] which are
appropriate for a search feature which will be used in this project. They are as follows:
Exposure: A risk factor a person may have been exposed to, for example smoking,
stress, etc.
Outcome: A consequential condition or incident from the determinant in the
population that has been studied. For example lung cancer, hair loss, etc.
- 14 -
Covariate: A factor that may have an influence on the development or outcome
and is therefore important to that specific study. Examples would be gender or
age.
Study Design: The type of study used on a given population. There are a number
of legitimate predefined study designs that exist. A few examples would be
observational studies, prevalence studies or general randomised control trials
(e.g. involving placebos), etc.
Population: The type of target people the study was applied upon. This may mean
gender, ethnicity or even nationality.
Effect Size Type: a numerical measure of an attribute within an epidemiological
study. This means you can look for numbers within a study such as “risk factor”,
etc.
These six entities have been used to introduce structure and extract information [9] from
numerous abstracts of the MEDLINE database and are therefore ideal for the purposes of
this project. Figure 2 gives a general overview of how the extraction has been done.
2.3 Medical Information Gathering
Text mining or data text mining [10] refers to the process of extracting interesting and
non-trivial patterns or knowledge from unstructured text documents [11]. It is estimated
that about 80% of a company’s knowledge is hidden in text rather than structured data. It
is similar in the field of Epidemiology since most advancement is published in form of
papers. In biomedical science the process of natural language processing seems to have
established a popular reputation of making sense of this natural data. This process is
referred to as BioNLP [5]. The need to automatically extract information and introducing
a logical structure seem to become gradually more important [12]. Such NLP processes
have shown promising results when it came to extracting key biomedical information
from relevant literature in the past few years [13].
Another idea is to be able to identify connections and similarities between articles
leading to new discoveries that are invisible to the naked eye, so called “undiscovered
public knowledge”. This approach is also known as Knowledge Discovery in Databases (or
short, KDD) [14]. For example, a connection between magnesium and migraine has been
- 15 -
discovered, which was previously unknown, just by carefully looking over several papers
[15]. Whereas this discovery was done manually, using approaches of KDD easily could
have led to the same results faster and easier. It is therefore appropriate for exploratory
research [16]. These KDD methods are merely an extension to current statistical ones,
exploiting today’s resources of computational power, machine learning and artificial
intelligence algorithms [17].
As already mentioned text mining may provide structure for unstructured data and
therefore makes it easier to access. It does so, for example, by introducing tags [18]. The
tags are similar to search tags as these are commonly used on platforms such as YouTube
or Twitter. This is realised by either manually entering such words or scanning, in our
case, an abstract of a paper and extracting all keywords. Doing a simple search for one of
these words could then return the abstract with the best match. In recent years,
collaborative tagging has become very popular [13]. Collaborative tagging means that
people may add or introduce their own tags as they go along. This feature may be useful
in the field of Epidemiology as specialists could simply highlight and introduce their own
tags while going through a paper or patient history, and therefore making the extracted
key information more useful. However, doing it in the same way as, for example, Twitter
is too one dimensional for our needs. A more sophisticated query model is therefore
required.
Another useful way to extract information is by identifying patterns within the text
and providing references to it. This may also be known as indexing. It can provide
structure and set open information of large spans of texts. This approach, however, uses
a lot more resources, as the text has to first be scanned for all keywords, according to a
pattern using for example natural language processing algorithms. It can be very lengthy
and costly. Another downside is that the error rate might be high as such an algorithm
could easily misinterpret special characters such as punctuation within the texts. This is
mainly due to the ambiguity and different writing styles of individuals over the world
[14].
- 16 -
Figure 2: General overview of feature extraction [9]
The documents, or in our case, abstracts of documents are taken. The input is pre-
processed, for example by filtering certain words or tokenising them making them easier
to read by the machine, which may lead to higher accuracy, depending on the input. Data
is then extracted according to dictionaries, certain machine learning algorithms or rules
[9], leading to the six concepts being used for the project. Examples of how these
extracted concepts may be used can be found in the related work section (2.5), where
several pieces of software are described which utilise these principles.
Utilising the text mining approaches to process the epidemiological literature,
developed in the University of Manchester by George Karystianis, abstracts of numerous
articles have been scanned and the results have been normalised and put in tables of six
dimensions, widely used in the field of medicine, which will be discussed closer later.
Normalisation means that the extracted data is mapped onto their descriptive attributes,
making it easier to identify for the right occurrence later [9]. It is important to stress that
text mining approaches are extremely hard to get to 100% accuracy. Therefore there will
always be mistakes or errors propagating through the extracted data. The approach
mentioned has been thoroughly tested though and an accuracy of over 81% has been
reported, clearly classifying it as reliable [19]. Such promising results are clearly worth
trying to make more accessible and maybe even improve, for example by providing error
- 17 -
correction in the extracted tables and making it accessible online. These features could
be supported by this project and are in fact part of the scope.
2.4 Visualisation of Information
From the previous sections it is clear that the visualisation techniques of such an amount
of information are crucial in these types of applications (CDS and structural information
representation). Therefore it is important to discuss some of the issues in this section.
Human Computer Interaction is the study of how computer technology influences
human work and activities [20]. Since the early 60s this subject has been studied,
however only since the emergence of the personal computer it has become a key point of
computing. In the 80s, the company XEROX released their so called Star-user-interface
which was the first to introduce a window-icon-menu -pointer and as a result
revolutionised the way humans interacted with computers. They also established five
golden key principles [21] which are still used today as a pointer of how to design a user
interface well:
Familiar Conceptual Models
Universal Commands
Consistency
Simplicity
User Tailor-ability (or customisability)
Although these concepts were originally applied to an operating system, they are just
as valid for standalone applications. A Familiar Conceptual Model is about giving the user
what he is expecting to see and keeping the application coherent with familiar designs.
Consistency is keeping all modules and messages of an application coherent. The
interface itself should never drastically change unless the user specifically wants to apply
a different design.
Especially for medical applications it is crucial that the user interface is well designed
and coherent because human wellbeing is closely involved with the interaction of these.
It is clear that an interface like YouTube or Google will not suffice for the complexity
needed in a medical environment. It is a well-known fact that many errors that happen in
medicine due to poor interface design rather than human error on its own [22]. It is
- 18 -
therefore important that the information is clear and messages are understandable for
the given range of users. As a study, conducted by Patel V. et al (1998), has shown that
especially in a medical environment, the perception of what is important to a user varies
extremely [23]. As a result it needs to be pointed out that not only phrasing but designing
and placing the output has to be done differently for each user group as well as different
appropriate options have to be provided for these. The user has to always be aware that
he is in control and should be able to decide when to start or stop a given procedure.
Warnings should not be designed in a way so that the user has to rely on them or that
they would stop the flow of the program. It is especially crucial that human error in
interaction with the system isn’t fatal and that actions can be easily reverted, changed or
reconstructed [24]. Others than that, there is a wide load of features that have to be
reachable within a few clicks and as a result, a lot more elements need to be on the main
page than for other online platforms. Figure 3 below shows an example of a medical GUI
developed for a clinical status application for patient’s families. It is clear that the number
of elements on it is fairly higher than any other standard platform. It may be argued that
it is rather clear however, the warnings for the individual patients only seem to appear in
the second column and it could be argued that this is not clear enough, as mentioned
above, what is important is different to each user especially in medicine and a larger
popup may be needed.
Figure 3: Example clinician UI in a table [25]
- 19 -
Depending on what the system is used for, there are several methods of displaying
outputs which are appropriate. It is important to clearly define how complex future
interaction may be in order to get the visualisations right. There are different methods
useful for different amounts of features. There is a big difference for example between
exploratory research and close research or decision support.
Tables and lists are probably the most classic and reliable way of representing
information as shown in the example above. They open a very broad spectrum of
potential usages and a high amount of information may be displayed in an ordered
manner across the screen. This kind of interaction is especially useful when the extraction
or manipulation of the information has to be particularly specific. Another major upside is
that tables are usually sortable by selecting a column header which can make browsing
the data much more convenient. The only downside may be that the information
displayed is too much and can therefore be overwhelming to the user. There are ways to
improve this situation, for example by restricting the maximal number of rows and split
the remaining rows off into other pages, this can be rather obstructive if the page size is
set too small or too large, depending on the resolution of the screen used.
Charts and graphs have many different forms, some more appropriate than others.
With modern software packages these can be created on the fly as information is
extracted. They have restrictions though, that the information has to be within certain
ranges and clearly defined in order to have significance. These kinds of visualisations are
usually only used for extraction and not manipulation of information.
Figure 4: Sample Tag Cloud
- 20 -
A type of visualisation that has emerged rather recently is a so called word or tag
cloud [26]. Basically it is a table of words, in which the colour, order or size may indicate
variable significance. An example made in Google Web Toolkit can be seen above (Figure
4). They are more flexible than charts and they can be compiled over a more varying data
set rather efficiently. They are especially useful for exploratory research as they can
compile the most significant information of a wide variety of input into a comparably
small space. As a result, another major advantage of this kind of representation is that it
is appropriate for decision support [23]. This especially applies for clinicians, as for
example, such a cloud could represent a set of treatment options with the format of
appearance of the words meaning success rate of these treatment options. Selecting one
of these options could then reveal more information about the given method.
Nevertheless they are rather unsuitable when it comes to manipulation of information
due to data coming from a lot of different places.
When it comes to evaluating the usability of such a product, according to the ISO9241
there are three key principles [27] that need to be taken into consideration:
Effectiveness
Efficiency
Satisfaction
Effectiveness asks the question whether the user can achieve his goal. Efficiency is
about how quickly he can achieve it and lastly satisfaction is about how getting there
made the user feel, or in other words, the user experience. A wide variety of studies are
available which measure these factors. Evaluation frameworks exist [28] where users are
given a certain set of tasks to do and then without further instructions are exposed to a
system with which they have to complete these tasks. In a proper study there would be a
number of different versions of the system running in order to assess which one is best. If
the users are being watched and are questioned while or after they complete these tasks,
it is called an observational study. The data will be recorded and evaluated. If the scores
are above a certain threshold the system will be deemed as acceptable. More than one
variety, if multiple exist, might pass this threshold; in that case the one with the best
score in all three aspects should be chosen.
- 21 -
2.5 Related Work
There has been similar work done here in Manchester. One significant piece of work,
directly related to this project has been done by George Karystianis as part of his PhD
thesis. First, he extracted information from about 20,000 MEDLINE articles abstracts
using a text mining tool he also developed as part of his studies, and put them into a
database, structured in the six epidemiological key characteristics he identified, described
in section 2.2 of this report, leading to almost 100,000 entries. Then, using Java, he wrote
a powerful tool called “EpiTeM” (Epidemiological Text Miner) to explore that data in a
structured manner. As input query data it can take up to six variables which represent
these characteristics.
Figure 5: Sample return of EpiTeM
The main purpose of this system is information retrieval and information extraction.
Figure 5 above shows the outcome of a search for “obesity” as exposure and
“depression” as outcome. It can be seen that it lists all the articles found, ordered by
highest significant of hits. The table on the bottom displays the data of extracted
concepts of the five highest matches, structured into the six attributes defined. Clicking
on one of the papers will display the abstract of the given article and extracted words are
- 22 -
highlighted in different colours, one for each of the six attributes. Of course it also
provides a link to the full article on PubMed in case the user wants to read it. This tool is
rather powerful in a sense that it can search through large amounts of data and compile
the results in a readable manner. Data extraction is not flawless however and there are
errors within the data, which would be highlighted in the abstracts. Since it is retrieval
only there are no means of correcting or changing that data. It has no advanced visuals
such as graphs either as these are not needed for such functionality. All the given
downsides in the two mentioned pieces of software are aimed to be improved in this
project as it is a goal to also implement data manipulation.
Another example would be by a former student, called Elise Hein, also here in the
University of Manchester. Her third year project was to visualise extracted structured
data online, for a more convenient browsing experience. A system was implemented
using Flask for Python, which gave a graphical representation in form of word clouds
based on user input.
Figure 6: Sample of EdVic, a tool created by Elise Hahn
As seen from Figure 6 above, a platform has been created which allows the user to
query a pre extracted database in the form of the six variables, described earlier. Here a
search has been defined for “Exposure= watching television”, as displayed in the bottom
bar. The left shows the papers where the extracted information came from and the top
bar shows how many words of each key characteristic have been extracted from these. In
- 23 -
the middle a word cloud has been implemented, representing that extracted information
in form of keywords, where words mentioned most in these articles shine up as more
significant. The words can be clicked and added to the search as another search variable
or if wanted removed or even excluded for this search. A stream graph [29] can also be
dynamically created to show the significance of the words displayed over time. This
platform is excellent for a casual informative search, which could be used by a user for
example to brain storm certain epidemiological concepts. It is also adequate for some
exploratory research, as associated keywords can easily be visualised in an aesthetically
pleasing way and articles can be accessed. Nevertheless there are some major
disadvantages with this representation. Firstly, it lacks detail as only words are displayed
and more information can only be given by hovering over one or opening the whole
article and there are no means of personalisation or storing such a search. Secondly,
means of manipulation are very sparse if any. Local filters can be defined in order to add
or ignore certain words to or from the search. The extracted data is taken as it is and no
means for error discovery or correction can be provided in such a format.
2.6 Summary
To sum up, this chapter has provided an overview of the literature research done so far. It
first identified the importance of medical literature at the current day and how
computing techniques can be used as an approach to overcome the overwhelming
amounts that exist and are published today. It then explained some of the techniques
used to provide some structure into this information using text mining and natural
language processing as well as explained how they relate to this project and what
features will be needed in order to utilise these efficiently. The three main functionalities
identified as:
Querying the information – A query stricture has been identified using six key
terms in the field of Epidemiology in order to search through the data.
Visualising – A method to output the queried data in order to be useful for
exploratory research as well as some passive decision support tool.
Manipulating – Meaning being able to correct errors that are implied by the
automatic extraction of data, as well as add entries themselves.
- 24 -
Implementation methods for the visuals have been described and lastly, already existing
software more or less directly related to the functionalities of this project have been
given and analysed.
The next chapter will be about the requirements and the design of the application in
order to meet the goals set. This includes software engineering principles as well as the
results of actual requirements gathering with stakeholders involved. It will also provide all
the visuals that aided along the implementation of the software and elaborate upon the
choices as well as provide reason for some of the actions. This includes several different
types of diagrams including UML diagrams.
- 25 -
3. Requirements and Design
This chapter is about the requirements of the project and how they have been
established. It will first go into methodology of gathering as well as general Software
engineering principles and then move on to the actual established requirements. Then it
will describe the design derived from these.
The project will be implemented following agile software engineering guidelines.
The Waterfall method suggests to gather requirements and to do research only in the
beginning of the project and set fixed deliverables and milestones according to that
research in the future. It is not possible to change these or go back and edit the
requirements. While this may seem suitable for a project which has clear requirements, it
has to be noted that in the real world things often change. New discoveries are made
every day and people find better methods or designs for doing things, or simply just
change their mind. It is therefore important to embrace that change and be open-minded
towards it. As this is one of the core principles of agile software engineering it is more
suitable for this project.
The agile approach, however, suggests iterative and incremental development [30]. It
is divided into pre-defined time boxed iterations. These iterations consist of a few weeks
each. Time boxed means that in case the assigned time for that iteration cannot be met,
the plan is restructured, rather than the iteration extended and objectives will be
changed or moved around. Agile development suggests minimising ceremony, meaning
to minimise the paperwork such as documentation during development, and only
produce the parts that are useful to the developer to understand. All other
Figure 7: Agile development life cycle [57]
- 26 -
documentation is deemed optional or can be done after implementation if wanted by the
customer. The main advantage of agile is, as seen from Figure 7 above, that it is always
possible to go back and change things in the requirements or design when needed as part
of embracing change in an ever changing environment.
The Agile Manifesto [31] is a set of core principles which most practises are based on.
It consists of four points:
Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan
These points clearly support what was explained above and will be kept in mind during
the whole development process of the project. Working software will be prioritised over
documentation in order to show results as soon as possible.
There are several development methods based on the agile principles. These methods
suggest extra guidelines and artefacts in order to aid development. The one being used
for this project is called SCRUM [32]. More about SCRUM can be found in the Appendix A.
A release plan for this project has been according to these principles (Table 1).
Table 1: Simplified initial release plan
The simplified release plan given above is based on the identified requirements,
which will be provided in the next section which will also give design and
implementation. The full version can be found in the Appendix B. First, the basic website
has to be implemented, which will be done in the first sprint. It can then utilise the
database functions. It will be incrementally improved; first adding the query functions
and then more complex functionality on top of that. In the third sprint visualisation
Sprint Start Weeks End Description
1 1.7. 2 15.7. Create the basic website with all its login and managerial features.
2 16.7. 2 31.7. Implement all query functions for all tables and data involved.
3 1.8. 2 15.8. Implement visualisations and dynamic creation of statistics.
4 15.8. 1 22.8. Implement alteration of data; this may include pattern detection.
- 27 -
techniques are added which will help display things such as statistics. The fourth iteration
is about implementing alteration and curation of data. As mentioned in the literature
research chapter, text mining extracts are not accurate and there are errors in them. It
has shown that a lot of these errors are similar and propagate through the data. In this
iteration a way for the user to curate these kinds of errors will be implemented. Ways
and algorithms which can help identifying such patterns according to one wrong data set,
and then find similar mistakes within the extracted data, will be taken into account.
3.1 Requirements
The initial requirements have been gathered in form of user stories, a full list of which can
be found in the Appendix C as a backlog. In this type of user interaction the user is
questioned with the intent to be able to eventually formulate a requirement into one
sentence, which is understandable to both, the stakeholder as well as the developer.
These “stories” are then written down onto cards, which can easily be flicked through or
thoughts can be added. The general format is as follows: “As a ‘Person X’ I want to ‘Action
Y’ so I can ‘Goal Z’”. The back of the card usually holds more information useful to the
developer, associated with this story. This is a format widely used in industry as part of
agile software engineering practices as suggested by Ambler et al. (2012) [33].
Figure 8: Digitalised User Story Sample I
- 28 -
There are two sample user story cards from this project shown in Figures 8 and 9. The
first card shows the story which helped identify the fifth entry in the following backlog.
The front of the card shows what the developer has deduced from the interview with the
customer as well as the priority of this function. The back of the card, shown on the right
is reserved for important notes which may be implied by that feature. The second
example shows the feature identified as number eleven in the backlog below. It is about
an advanced query model, where the users can build their own queries consisting of any
combination of the six previously identified query models. The back of the card gives
away a row of dependencies as well as availability of that feature. More on the realisation
of this feature will follow in the design and implementation parts of this dissertation.
The stories have then been compiled into a product backlog, and assigned with
priorities. The SCRUM convention also suggests assigning an estimated value to each
identified function according to how complex the task of implementing this function is.
The following table (Table 2) shows a simplified version of this product backlog.
Figure 9: Digitalised User Story Sample II
- 29 -
ID As a/an I want to Done criteria
1 user view the website created Made a website that contains all the features a standard website has: Intro page, tutorial page, about page, etc.
2 user login to access curation feature A login function that disables the curation of data unless authorised
3 admin manage user access on the website An admin account that decide to restrict access to certain/all users
4 user be able to access epidemiological data Implemented database access
5 user query epidemiological data by Exposure
Query for given type of epidemiological dimension is possible
6 user query epidemiological data by Outcome
Query for given type of epidemiological dimension is possible
7 user query epidemiological data by Covariate
Query for given type of epidemiological dimension is possible
8 user query epidemiological data by Study Design
Query for given type of epidemiological dimension is possible
9 user query epidemiological data by Population
Query for given type of epidemiological dimension is possible
10 user query epidemiological data by Effect Size Type
Query for given type of epidemiological dimension is possible
11 user be able to query for epidemiological data using any combination of the mentioned above categories
Query for any combination possible
12 admin insert new highlighted key words for a study
Logged alteration of data is possible
13 admin alter a highlighted key word in a study Logged alteration of data is possible
14 admin delete a highlighted key word in a study
Logged alteration of data is possible
15 admin identify commonly extracted keywords
Auto suggestion upon alteration of data
Table 2: Simplified Product Backlog
Looking at the table it is obvious that basic features have to be realised first, in order
to be able to base more sophisticated and complex ones upon them. The first three
features in this backlog are about creating the website and adding basic administrative
and interactive features. Once this has been established, more complex features such as
interacting with a database and querying can be implemented. Features 5 to 10
inclusively are about implementing the query functions for the six dimensions identified
earlier. The eleventh feature combines the query methods into one dynamically created
query where the user can combine any of the six dimensions at his convenience. From
- 30 -
the interviews it was established that there was a database already in place, however, in
order to be able to utilise the data more closely or even alternate it, it is clear from the
requirements that a custom database needs to be created to act as a kind of mirror
image of the pre-existing one, as well as enhance it and extend the functionality and data
stored.
The last three identified requirements are about the curation of one or more entries
within the database. What this means is that, as already mentioned, text mining
algorithms do not always work flawlessly and errors made during extraction may
propagate through the database. Therefore these mistakes need to be corrected as they
are encountered. The entries in the database, however, are normalised, meaning that
they have been identified to belong to certain categories or groups of words. Another
restriction is that they correspond to the text within the extracted abstract. As a result it
is important to restrict the curation process accordingly by denying curation of extracts in
a way that if the newly entered value varies too much from the abstract or even the old
extracted value, that the curation is denied. Therefore a feature to entirely delete and/or
add a newly identified extract is needed as well.
Note that the table also satisfies coverage of the functional requirements.
Nevertheless, non-functional requirements are still to be identified. While doing so, it has
to be taken into account that this project is supposedly web based, and therefore has a
slightly wider set of non-functional requirements than offline projects. The most
significant ones are in the following table (Table 3):
ID Requirement Description Priority
1 Manual Clear instructions high
2 Ease of use Simple and effective UI high
3 Security Privacy high
4 Performance Searches and general medium
5 User access levels More than 2 access levels medium
6 Customisability Of interface low
7 Language support Changing languages low
Table 3: Non Functional Requirements
- 31 -
The first two non-functional requirements are closely related as they both have to do
with usability of the product. Instructions need to be clear and the application easy to
use, however, there still needs to be a good explanation of what is going on for example
to explain the six dimensions. Since the application is online, security is important. That
goes for all applications which may require a password. Since it is known that the search
databases tend to be rather large, performance has only been assigned a medium
priority. User access levels are also important, as data may be altered. It is necessary to
facilitate that only registered users are allowed to perform alterations on the data.
Another example of a potential security hole is the interaction with the database, as this
must not be done directly from the client application as access credentials might be read
out in client side, which may give them master access to all entries in that database. How
this issue may be resolved will be discussed in the design part of the dissertation.
Although customisability and language support play a major role in the usability of the
application, they have been assigned low priorities as most features are assumed to be
straight forward.
3.2 Software Design
Here the gathered requirements from the previous section will be put into practice in
order to create a design for the software that was going to be built. It is important to
include a wide set of factors when doing this, in order to make sure that not only the
right software is built but that the software is built right. This was done by sticking the so
called GRASP Principles as suggested by Larman (2005) as close as possible. These
principles dictate ways of assigning responsibilities to classes, which is a key skill in Object
Orientated programming. It has shown that, even for strong programmers, that this is a
key skill that often lacks [34] and as a result leads to a weak design.
3.2.1 Challenges of Designing a Website
Firstly challenges of building a website need to be taken into account, as there are some
factors to it that do not exist for standalone software. One of the most important ones is
security. A website has to always be secure even when it does not store personal
information as it has to be safe against certain session high jacking attacks, which might
re-route the user to other malicious websites without being aware of it.
- 32 -
A login is needed to protect higher computational features of the website as well as
personal information that may be entered by users. Sophisticated login algorithms exist
which enable the website with secure handshakes between user and server. It was
argued whether to implement the login within the GWT framework, as there are more
security threats to Java and its associated web technologies [35] than there are to say
PHP. Nevertheless, a suitable library has been found [36] and put in place, utilising jBcrypt
for Java, an encryption tool which is deemed safe. Apart from that, the application will
not store any other sensible data that will be unencrypted and this approach has been
taken so far.
Throughout the development of the application, however, it has shown that the login
itself would only help to protect one feature, which is the curation and alteration of data.
It has been established that this feature should in fact not be possible to access to anyone
who signs up as it is rather powerful and it could easily render the whole database
unusable with very little effort. This feature should only be accessible to the
administrator himself, or other people within the domain. Therefore the login itself was
designed a lot simpler than first intended and is only really needed when accessing this
feature.
Another challenge of building a public website is that the design of the user interface
has to be fitted for a general public standard. Especially when the application itself is very
domain specific, such as this one, which is focused around Epidemiology, it can contain
complex features and terminology that might not be easily understandable to a common
user not being involved within the domain. It is therefore important to design the
application appropriate, in order to be manageable by a wide set of users from different
backgrounds. This can be done by either adding manuals or enough help descriptions, or
by making the design itself more obvious so that it appeals to a broad spectrum of users.
In this particular case one way would be, for example, by implementing a standard search
which just queries one of the six given search dimensions to make it look closer to a
standard search like it is implemented on common websites nowadays.
3.2.2 Security
As already mentioned, security is an important factor, especially when designing a
healthcare web application. As it will be designed using a client-server architecture which
- 33 -
is accessible from anywhere in the world, it is prone to attacks. It would be critical to, not
only the application itself, but the whole institution to unintentionally leave parts of the
application exposed. There are several measures which can be taken in order to design
the system more securely. Firstly, it is important to not give access to any feature or even
the page which holds the feature to any user who is not authorised to use that feature.
As a result potential attackers are not provided with a template for a so called replay-
attack, where such an attacker could reverse engineer an authorised request and send
that to the server.
It should also be mentioned that the programmer has to select the platform and
architecture which the application is developed and run on carefully and be aware of its
risks and exposures itself. This is critical, as no platform can be assumed flawless or
impenetrable, even if it has been in use for a long time. A current example of this fact
would be the recently found “Heartbleed Bug” in SSL encryptions [37], which has been
present for years but went unnoticed until recently. SSL is commonly used for “secure”
communications on the web for example for emails or instant messaging as well as
connecting to Virtual Private Networks (VPN). This weakness was found in a software
library used to implement SSL. It made it possible to read out the encrypted information
which was stored within a message using SSL, without leaving any kind of trace on the
server itself.
The application should be designed keeping such issues in mind. Leaking critical
information can be avoided by not sending it in the first place. This can be achieved by
hashing passwords sent to authenticate using a one-way encryption, meaning that the
raw password cannot be in anyway reverse engineered from its current state. Therefore
it is obvious to implement such encryption client-side so that none of the
communications channels are ever exposed to unencrypted or un-hashed critical
information.
3.3.3 Architecture
A diagram to depict the general architecture of the application has been designed in
order to give a map of what is going on within. It gives a rough overview of all major parts
of the program and how most of them interconnect. It has been used as a guideline on
how to implement the system and generally provides a useful simplified overview.
- 34 -
Figure 10, below clearly indicates a split of the application into three tiers. Top tier are
the databases external to the system, which will be interacted with. Note that there are
three of them which add a rather high level of complexity since for some operations that
may be needed in order to interact with more than one database at the same time. The
first database, labelled as “Database of Extracted MEDLINE Concepts” contains all the pre
extracted data from MEDLINE abstracts as a result of a text mining algorithm. It stores
them in several tables, one for each dimension. The second database labelled as
“Database of MEDLINE Titles and Abstracts” holds information about the abstracts and
data itself, which has been used for text mining. After a query is completed, more
detailed information about the results will have to be taken from this database. The third
and last database is labelled as “Application Database”. It holds information specific to
the application itself. This may include login data for higher level access, logs for the
curation of data to keep track of what has been changed, as well as copies of the first
database or general indexing which is used to speedup queries.
The second tier and central part of the system is the application server itself. Each
large box here can be translated into a package which may be implemented. On top part
of the server the database connectors can be seen which will be used to access and query
the databases involved. Note that there may be more than one connector for each
database. On the one hand only one column may be needed and on the other a large
proportion of columns may need to be selected. There are different ways of retrieving
these entries efficiently, depending on the request. On the bottom half of the server an
example of the services that are provided to the client can be seen. The query service
itself is connected to the database connectors and knows how to use them in order to
retrieve the needed data. The server also knows how to pack the retrieved result data
into given data types which can easily be accessed and identified by the client. These data
types are stored in a model package. Note that there should only be one implementation
of the data types within the model package in order to ensure consistency between client
and server, hence the dashed line. This can easily be achieved by establishing the data
types upon deployment on both sides. The last two services in the diagram are the login
and the curation service, which should only be used in combination. It should be noted
that both of these need some kind of database access, however the connections are left
out in the diagram for simplicity sake.
- 35 -
Figure 10: Architecture overview
The third tier in the diagram is the client application. It will connect and talk to the
server via an established internet connection. It has to implement the necessary client
side service equivalents in order to be able to access the right implementations on the
server. Therefore the query methods have to be implemented on the client so it can send
the right query information to the server. It also has knowledge of the models used and is
able to pack the queries into the right format for the server to read, as well as make
- 36 -
proper use of the responses sent by the server. The client also has user interface classes,
which hold information about interaction functionality with the user, for the parts which
need direct server interaction, i.e. tables and forms. This client application is intended to
run on top of some kind of HTML implementation so it can be read out by any common
browser. More about the tools which will be used to achieve this can be found in the
fourth chapter of this dissertation.
3.3.4 Domain Diagram
A simple domain diagram was created in order to make it clear what was needed and
bring the idea closer to an implementation.
Looking at the diagram above (Figure 11) it can be seen that classes have been circled
and grouped into a certain pattern in order to highlight which part of the application they
Figure 11: Domain Class Diagram
- 37 -
belong to. Starting from the bottom, the part of the application which belongs to the
client has been outlined. The Model-View-Controller separation within the classes
defined is clearly given. The user interface provides the views, which is connected to a
controller which controls which parts of the implementation are accessed. In the
requirements section it has been established that a certain amount of query methods are
needed for the application: one for each of the six epidemiological dimensions, as well as
another one, which is able to combine any of these dimensions. A strategy pattern [34],
as suggested by Larman (2006), has been chosen to implement the different parts of the
search required. An interface (Search Strategy) defines which methods need to be
implemented for each of the search strategies. The main difference between the
different strategies is how the query for the database itself is assembled. Therefore there
is a need to specify that this method needs to be re-implemented for every strategy
inheriting this interface. Here “Basic Search” is the class that handles a query where one
or more of the six dimensions can be queried for, where “Advanced Search” is able to
dynamically handle any number of entries for any of the six dimensions. How the query is
assembled in detail will be discussed in the Implementation chapter of this dissertation.
The next thing in the client section as a “Result Formatter” it is responsible for putting the
results or responses from the server into a readable format, such as a table or other
visualisations. It then passes them on to the user interface through the controller. The
“Curator” is responsible for handling the client side curation requests that a user might
do.
The Model defines the data models sent between server and client and other models
which may be used within the application. A query is assembled by the different query
strategies defined and then sent to the server. The server would then need to
disassemble that query and reply. The responses are formed using a Result model and it
contains whatever a query may have produced. The result data in then assembled on the
client side in whichever way it may be needed. The third model depicted holds the
information for a curation request. This would hold information such as unique IDs, the
old word and the new one which will be replacing it and whether the user wants to
specify use of any special algorithms to find more similar entries as the one being
replaced.
- 38 -
The top of the diagram shows the main parts of the server such as the “Query
Service” which is responsible for forwarding a query to the right databases if needed. This
will be done by a query engine which it implements. It should forward its request to the
right database connectors, which are not in this diagram, since it is a domain diagram,
which only exist in a purely virtual way. The next part is the “Results Parser” which
translates the response from the database into something that is simpler to handle and
access from within the system. The last piece is the “Curation Service” which facilitates
curation requests from the client. It implements an algorithm to identify similar entries to
the word or entity which is specified to be wrong and therefore needs to be changed
within the database. Note that this algorithm is meant to be optionally pluggable, and if
not enabled, only one entry may be changed.
It needs to be mentioned that user specific curations have also been considered while
planning and designing, however they are rather complex and would require a too large
amount of time to be properly realised and have therefore been discarded or left as a
future improvement idea. The idea behind user specific curations is to have some kind of
database entry for each registered user for their curations done. These curations would
only be visible to each individual user. However, such curation may take up an exorbitant
amount of space and storing them in a separate table would make the query process a lot
more complicated since the process would now be different for each user.
Lastly, note that login functionality is not included in the diagram to keep it simple,
however it should be mentioned that it is designed to be only for pre-registered
administrators since the curation function itself, which is the main reason for a need of a
log in, can cause errors in large parts of the database and render it unusable if misused.
The database can always be rolled back, however, that would cause all the previously
applied curations to be lost if it is not backed up separately. It is therefore important to
only enable this feature within a trusted circle and prompt for a password when this
feature is to be used. It may be extended in future so that only the detection algorithm
may be used by administrators but single curations of one entry can still be done by
automatically registered users. Individual registration will not be implemented as such,
however a basic form of authentication upon which it can be extended to is planned and
part of the domain. Sessions will be implemented; however, they expire after one
curation in order to prevent accidental curations.
- 39 -
3.3.5 Sequence Diagram
Sequence diagrams have been created in order make the most complex parts of the
application more clear and depict of how parts of the application are intended to interact
once assembled. These diagrams are especially useful since they show the intention of
the internal workings of the program. Two of these diagrams have been provided here.
The first diagram (Figure 12) shows how a standard query is done in the system.
Figure 12: Basic Search Sequence Diagram
First the user types what he wants to search for into a query field. The controller
forwards this information to the right function within the client, in this case, the “Basic
Search” function. A Query object is compiled and sent to the server. On the server side
the query is disassembled and forwarded to the actual database(s) needed. The
connections between server and database as well as client are closed and the user is
notified that the query has been submitted, for example, by displaying a loading screen
until the results find its way back to the client side interface. First the raw information of
results is sent back to the server, which compiles a Result object to send back to the
client and sends it. On client side the results are put into the right format for displaying
i.e. a table or diagram, depending on what has been requested. This is forwarded to the
controller and interface and the display is updated for the user to see.
- 40 -
The next sequence diagram below (Figure 13) shows how a curation is done within
the system. Note that in order to do curations, the details of a found article have to be
requested first. This is not shown in this sequence diagram, however, a full version can be
found in the Appendix D. After requesting the details, the abstract and some statistics will
be displayed and the user can see which words within the abstract belong to which
search dimension. If one of the extracted words seems erroneous, the user can choose to
change this word. A prompt will appear to enter what the extracted word should be or an
option to get rid of this extraction as a whole. After the details are entered the request is
sent to the Curator on the client side, which compiles a Curation Request object and
sends it to the server. Now, while entering the details of the duration, the user would
have been asked if he or she wants to replace all occurrences of that word within that
search dimension in the table, for example in form of a check box. If this was ticked, an
algorithm will be run which will identify similar entries within the database. The Curation
Service on the server will then compile and send a request to update a log, where all
curations are kept, as well as execute the given curation on the database. Once this was
successful, confirmation will be send to the user. The user can then verify whether the
curation was successful by refreshing the details of the given found article.
Figure 13: Curation Sequence Diagram
- 41 -
3.3 User Interface Design
Note that before designing the user interface, tools have been researched and taken into
account when creating the design, in order to see what is viable within the researched
technologies that were up for selection and what is not. More about the decision about
which tools have been used can be found in the next chapter. The available interface
elements for the selected technology have been looked at and arranged accordingly in
initial design drawings. This section will quickly explain two of these concept drawings
done.
The first drawing above (Figure 14) shows how the main design of the website.
Note that three of five points, elaborated in Section 2.4 of this dissertation have
especially been taken into account when designing the website. These were: Familiar
Conceptual Models, Consistency and Simplicity. Firstly the features of the website had to
easily be found by the common user. This is done by putting a navigation bar on top,
which is a familiar concept adopted by most websites nowadays. The buttons in this bar
Figure 14: Main Page User Interface Drawing
- 42 -
will direct to the implemented search functions or help pages as well as a welcome page.
Above the menu bar the logo and the title of the website are located. The basic search
feature is also shown. One of the dimensions will be displayed as standard, and can
optionally be expanded to show the fields for all six of them. The middle part of the
drawing shows a table that holds the results of a search. Note that there is a division into
pages at the bottom of the table, so that not all the results are displayed at once and the
user can browse the results more easily. The part to the right of the table is reserved for
certain statistics to be displayed, which may be useful to the user. The bottom of the
pages depicts a concept feature, where the user might do more than one search in
parallel and can switch between the results shown. It is intended that the bar also
displays whether a query has completed or is still running. This feature is theoretically
supported by the technology chosen. The bottom of the page will give some general
information about the creator, the institution as well as server status of the database
used. Also note that there is padding to the left and the right of that drawing in the
browser, as this makes the website look spaced out better and less cluttered.
An entry of the results table can be selected, which will prompt a pop-up holding
detail about the found paper. The design of this pop-up can be seen in the drawing below
(Figure 15). The top shows the title of the paper, as well as provides a link to it on
PubMed. Collaborators of the paper, as well as the abstract itself will be displayed. The
extracted key words within the abstract can be highlighted by selecting a button of a
dimension on the right. Below the abstract there is some space for statistics about this
specific paper. The button for curation is located on the bottom. A help button is
provided which displays a manual for the pop up. And lastly the close button on the
bottom right, which will take the user back to the main page, containing the results table.
- 43 -
Figure 15: Details Page User Interface Drawing
- 44 -
3.4 Database
Figure 16: Application Specific Database Design Diagram
- 45 -
The diagram above (Figure 16) shows the database created specifically for the
application. It can be seen that one table has been created for each of the main search
dimensions: covariate, effect size, exposure, outcome, population and study design. Each
entry in one of the tables is linked to the actual abstract, where they have been extracted
from, via the so called “PMID” column. It is a unique identifier for each abstract and links
to the PMID table, where it is a primary key. This table also holds which year this abstract
is from. Note that in the six tables for the dimensions there may be multiple entries with
the same PMID, as more than one word belonging to the same category may have been
extracted from that abstract. The highlights table simply stores the ranges in character
count, to highlight within that abstract and to which dimension the marked word
belongs. This is useful for the highlighting feature on the details pop-up discussed earlier,
as it facilitates highlighting the extracted words easily by providing a sort of mapping, as
well as putting them into categories, in case the dimensions need to be highlighted in
different colours.
There are two more tables; the first one is holding information about which entries in
the six tables that hold the extracted information have been changed by user curations.
The old as well as the new values are stored. It is important to keep track of these
alterations as they may have been done by mistake, in case these entries need to be
rolled back. Note that this will also affect the highlights; therefore the values in there
have to be changed to the old values as well on recovery. The user which has done the
curation is also kept track of. This is due to complying with a future extensible design as
more users may be added as the system expands. The last table keeps track of the user
data itself. This includes entries such as user names, password hashes as well as email
addresses. The user names are used as foreign key for the curation table.
There is potential to index these tables and as a result speed up querying. In a
technology, such as SQL however, for proper indexing to take place it is a requirement
that each entry in one of the columns holds a unique value, so that a tree like structure
can be built from the data internally [38]. This is true for the PMID table as these are
stored there in such a way that they never repeat. This state, however, does not apply to
any of the six tables which hold the data themselves. SQL allows the user to define the
indexing upon two columns [39] which combined could make up a unique value. This
however, cannot be done without defining a new arbitrary type of data column, and is
- 46 -
therefore not recommended as it would introduce a large amount of clutter within the
data.
Note that there are two more databases not shown in the diagram. The first one is
the database which is used by the feature extraction process. The database for this
application is merely a copy of that. This is due to the extraction process being costly and
the curation feature being a rather powerful modifier. It makes sense to protect the
original data by not altering it. The third database which is not shown in the diagram is
holding the abstracts of each paper itself as a string and is only queried if details for a
found entry are requested. This database is updated regularly as new papers are being
published. Therefore it does not make sense to copy the values of this database into the
application specific one as it would make the whole application harder to maintain.
- 47 -
4. Implementation
This chapter is about the implementation of the application itself. First it will talk about
available tools to achieve the given goals. It will then discuss which tools have been
chosen to get the implementation done and elaborate on why they are fitting for our
domain. Next it will focus on how the implementation itself, stating what has been
created and how. It will explain how the code has been extended in order to achieve the
goals and deliver a working product, which is flexible to more future changes and
implementation.
4.1 Tools
This part will discuss the available tools that have been researched for the project. It will
first give an overview of the researched web development tools that are available and
elaborate on the choices taken as well as give reasons for these. Next it will give different
types of database technologies available for use and make reason for a choice. Lastly, it
will introduce a tool used for database interaction with the chosen technology.
4.1.1 Current state of the art Web Technology
Since the invention of the World Wide Web in 1989 by Tim Berners Lee [40], it has been
ever evolving. Particularly in the past six years, more technology was added than ever
before since its invention. The types of interactions with a browser have never been so
versatile and interactive. The following graph shows available technologies since 1991,
Figure 17: The evolution of the Web in the past 15 years [59]
- 48 -
where each line represents such a technology, and where they overlap with a browser
that browser adopted support for that technology. Note that the top three timelines
represent Netscape, Internet Explorer and Opera respectively but are cut off before 2000.
A number of web development tools have been researched and looked into in order
to determine which one is most appropriate, not only for the project, but to the author’s
level of experience. The following table (Table 4) will give an overview of these tools.
Tool name Features Notes
Aptana Studio HTML, CSS, JavaScript Code assist Deployment wizard Integrated debugger Git integration Available as plugin for Eclipse
Standalone editor or as plugin. A wide set of helpful tools and a good community support. 100% free and open source.
Google Web Toolkit
SDK provides Java APIs and widgets Full web support including AJAX and JavaScript Automatic deployment of apps, working on all major browsers and platforms Debug in any IDE Available as plugin for Eclipse
Open source, created by Google. Widely used even by companies. Wide set of extensions and APIs. Integrates seamlessly on most IDEs. Automatic deployment of server and client side application using Java and JavaScript.
ASP .NET Provides full framework for apps that require client and server side computation For MS Visual Studio only Therefore support for all major programming languages Git, debugger, etc. integrates in MS VS
Tool by Microsoft. Widely used. Good support. Fitted for applications with heavy or complex computation and interactions. Development only supported on Windows.
Java on the Server - Netbeans
Official Java EE and web development plugin for NetBeans IDE HTML, CSS, JavaScript and PHP Code assist Git, debugger included in IDE
Official open source plugin for NetBeans developed by the community. Good community support.
Adobe Dreamweaver
Supports all major web Technologies Longest on the market Provides wide set of templates and integrated features to design website
Probably the oldest tool, however mostly disliked by the community due to not being open source for what it provides nowadays.
Intellij for Jetstreams
Very good tool for Java and JavaScript development; HTML, PHP support included (CSS paid); Integrated database tools for SQL (paid); UML diagram designer; Auto refactoring Git support
A very popular and intelligent tool. Comes with a lot of helpful tools built in. However, most special features require a paid version.
Table 4: Comparison of Web Development Tools
- 49 -
4.1.2 Google Web Toolkit
Google Web Toolkit [41] has been chosen to build the software. As it is one of the more
wide spread development tools and there is substantial documentation available, which
sets it ahead of some of the newer tools mentioned above. It provides a full client-server
framework for web applications using Java. Services are implemented on server side
which can be accessed in the client code using standard java convention, which is
perfectly suitable for the scope of this project. The client is implemented in conventional
Java as well and a full custom API for visualisations is provided [42]. This is a positive
feature as the developer is fairly comfortable with Java. On compilation, the code is
converted to JavaScript, which is supported by all conventional browsers and platforms
including mobiles. This platform-wide flexibility is one of the main reasons in favour of
the choice of Google Web Toolkit, as it can easily be deployed and set up for a wide set of
users. The services that the client accesses will be automatically converted into Java
Remote Procedure calls [43], so that the client can communicate with the server. GWT
has an official plugin for the Eclipse IDE [44], which will be used for this project, which is
one of the Integrated Development Environments the author has a good amount of
experience with. It provides integrated support for version control, which will be useful
for the development of this application, as progress can be tracked and rolled back if
needed.
4.1.3 Relational vs. Non-Relational Database Technologies
Non-Relational databases provide efficient query techniques if there is no direct relation
from one dataset or table to another. Examples of such non-relational databases would
be csv-files of general spread sheets. There are some more sophisticated non-relational
databases with online availability such as NoSQL [45], which goal it is to maximise
simplicity within the design or provide a better scaling as datasets become larger.
Nevertheless, non-relational databases are not really suited for this type of applications,
since the main part of the data which is used is spread into six tables, which are all
related. Therefore a relational implementation was chosen. A wide set of technologies
was available, however, due to the dependency on external databases of this application
MySQL was the closest choice. This is mainly due to compatibility between the already
existing databases, as no new additional wrappers or bridges would be needed to realise
- 50 -
the application specific database. Not only does it provide all the features needed to
satisfy the domain of the project, but also the author is fairly confident in using that
technology.
4.1.4 Jenny and JDBC
Java Database Connectivity (or JDBC) is the native API in java that allows it to connect and
interact with external databases [46], for example using MySQL. The way it basically
functions is by compiling MySQL commands, which can then be fired at a server, given a
connection has been established through the API first. Jenny is a tool developed by the
Java-Ranch community [47]. It is able to read out a database structure and create
according code to access and manipulate each table individually a bit more efficiently.
However, most of the time one instruction will not be enough. Therefore this code can be
put together in order to fulfil more complex instructions on the data and utilise the
information properly.
It needs to be noted that establishing a connection to the database in order to run
one query is quite a costly process. As a result it is important to mention that Jenny
makes interaction for single result sets with one query more efficient, however if a query
for a wide set of results is needed, which accesses several tables or databases and our
query can be defined within one MySQL command, it is more efficient to do that directly
via the JDBC connector, since Jenny would need to fire a separate command for each
table or database within the query. It is therefore in some cases more efficient to either
use one of the two technologies or mix them depending on what is needed. One example
would be the single curation, which is more efficiently done with Jenny. However, if a
query is compiled for a whole set of entries, that query might as well be fired directly.
4.2 Core Implementations
To get started with GWT an IDE has to be prepared and all needed plugins have to be
installed. Then a new project can be created. GWT for Eclipse, like it was used for this
project, will automatically produce all the needed packages and folders and put them in
the proper structure. The picture below (Figure 18) shows the tree generated by such a
GWT project.
- 51 -
Figure 18: GWT Application Structure Tree
The “src” folder contains the source code implemented, divided into packages. Note
that packages have been added and extended in order to provide a reasonable structure
to all the classes created. A test folder is also generated and was later populated with
tests. A number of imported Java libraries which were needed for the implementation are
located in the middle of the tree. The last important part that should be pointed out is
the “war” folder as it contains all HTML parts and xml link tables that are created during
runtime and loaded by the client. An entry point for the application is defined, which is
then called inside the HTML class which will load the application into the browser. Note
that CSS is also contained in that folder. More about GWT and how the implementation
corresponds to the design can be found in the Appendix E.
It needs to be mentioned that the User Interface in GWT is created on a modular
basis, called “Widgets” rather than objective. GWT is able to identify and forward the
information retrieved to the needed module automatically. The module itself can then
decide how it wants to display the data, and how it is laid out. The routing is achieved via
so called Data Listeners, which the result data from the server is forwarded to.
- 52 -
4.3 Client-Server Interaction in GWT
Google Web Toolkit wraps all interaction between the Client and the Server into so called
Java Remote Procedure Calls, which are also generally used for process
intercommunication and method invocation [48]. The interaction can be implemented by
following a certain set of rules, which can be found within the service package of the
code. Figure 19, below gives an overview of the interfaces and classes implemented in
order to achieve this, where the dotted boxes represent interfaces and the solid ones
represent one or more classes. More about how this is done can be found in the
Appendix F.
Figure 19: General Simplified Application Class Diagram
4.4 Client Implementation
The client side implementation has been done on a modular basis, meaning that
modules, also called “Widgets” have been constructed and arranged accordingly. The
highest level module is the menu, which enables the user to switch between the different
features implemented. Apart from the textual parts of the application, which include
home page, about page and help pages, there are several computationally complex
features that required implementing some more complex functionality to process the
server’s response on the client side. The first feature is the standard search. It consists of
six fields, one for each of the search dimensions and a search button. The fields have
been arranged in a table, to which a so called CSS-tag has been added, so that it can be
designed further using the CSS implementation of the project. The words that want to be
- 53 -
searched for can be entered into the fields. When the search button is clicked the module
will compile a Query request object, wrapping the entered words inside of objects and
send them to the server.
4.4.1 Results Table Widget
A widget to display the results has also been created. It is added onto the standard search
module once a response is received. This widget defines how the results are displayed
and arranged. It is implemented using the “GWT Cell Table”, which requires the data to
be linked directly. The table displays the PMID, Year, Name as well as type of study, if
applicable, of the found abstract. The maximum number of rows per page has been set to
25 and the pages are made accessible using a Pager. This is very efficient and pages can
be browsed through in real time without needing to refresh the whole page. Each column
within the table can be made sortable however, since the data contained in the table
itself is retrieved from the server, this is done through a data handler and needed to be
implemented server side. This is due to there being more than one page. A local sort
would only sort the current page displayed. The major advantage of implementing the
result table as a separate widget is its reusability. Other search features could simply
create their own instance of such a table and populate it on their own page.
4.4.2 Advanced Search Feature
The results table widget has been reused for the dynamic search feature. The goal of this
feature was it to be able to create a query in which one can combine all six dimensions in
any way. This includes stacking more than one word for the same dimensions, using the
operators “AND”, “OR” or “NOT”. This has been realised using a GWT Flex Table, which
can dynamically be extended. Each row provides the user with a choice of operator,
dimension and a field for the word itself. Rows can be added to or removed from the
query dynamically. When the search button is clicked, the current state of the table is
analysed and read out accordingly. A query object can then be created and sent to the
server. It should also be noted that a loading animation has been implemented for while
the client is waiting for a server response.
- 54 -
4.4.3 Details Page Widget with Curation
The table itself has been given a listener, which enables each individual row to be clicked.
This is needed when the user wants to access the details about a certain paper that has
been found. When a row is selected a pop-up will appear. This is utilised using a GWT
Dialog Box, which opens as kind of overlay and fades the rest of the page in the
background in grey. A detail request is formed and sent to the server, containing
information about the selected row. The details page itself looks as shown in the design
drawing (Figure 15), with title and collaborators on top, abstract and highlighters in the
middle and a statistics table at the bottom. The table shows all extracted features from
this abstract as well as to which dimension they belong to. A series of buttons used for
curation are located underneath the statistics table. An entry in the table can be curated
by selecting that entry and then hitting the “Curate Selected Extract” button. This will
replace the table with a form, in which the user can specify the information which he
wants that entry to be replaced with. Alternatively there is a checkbox if that entire
extract is wrong and needs to be deleted. Another checkbox has also been added which
enables the option to run a detection algorithm on the server which will curate same
entries. A field for username and password is also provided. Note that the password is
not displayed when entered and it is hashed before it is sent to the server as part of the
curation request. Two other types of curation have also been made possible. The first one
is adding new extracts from the current abstract. When the “Add Extract” button is
clicked, the user can select a dimension and enter a word which is supposed to be added
to the database. Before a request is formed, a check is done whether the entered word is
found within the abstract. The option to enter some extra information such about this
abstract is also provided.
4.4.4 Future Client Improvements
It needs to be mentioned that the design of this system is laid out openly towards future
additions, as new modules or widgets can simply be added without affecting already
existing ones. One addition that needs to be pointed out is the ability to add more
statistics to the search pages themselves. This can be done, for example, by using
diagrams. Data can be requested from the server and laid out according to the needs of
the diagram library used with GWT. Others it should be mentioned that most widgets and
- 55 -
HTML variables of the application have been given a CSS-tag so that an extended design
can be specified within the project’s CSS file. The main advantage of this is that parts of
the application can be flexibly visually redesigned, without affecting functionality.
4.5 Server Implementation
It was established in the design chapter that all database interaction has to be handled
through the server for security reasons. Therefore all database connectors are
implemented on the server side. There are a total of three databases that the server
needs to interact with. The first database is the database that contains all the extracted
data from the abstracts as a result of text mining. The second database is the application
specific database. Most tables from the first database are copied into the application
specific database so that they can be edited without affecting the original data. This
database also holds information about the curation done as well as user accounts. Note
that on query, this reduces the number of databases that need to be interacted with and
is therefore more efficient. The third database contains information about the abstracts
themselves. This includes titles, collaborators as well as year of publishing.
4.5.1 Database Connectors
The server can connect to each database in two different ways, the first one being the
database connector crated by Jenny. It provides a class for each table in each database
and provides an efficient way to query for each row individually. The second database
connector was made using the Java Database Connector (JDBC) library directly. Provided
with the login credentials it can connect to any of the three databases. A SQL query can
be passed to it directly as a string, which can then be executed on the database itself. The
JDBC will then assemble a java Result Set. This set can be iterated sequentially and
contains a string for each result within a column of a row.
It should be mentioned that a slight inefficiency within the way that the server
retrieves the result table data for a search was identified during implementation. This is
due to the titles of the abstract being stored in a different database rather than the
application specific one. For every match found, a query is passed to the textual database
to retrieve the title of the found abstract. This implies that the more results are found in a
search, the more queries have to be sent to the additional database and the slower the
- 56 -
retrieval of results will be. This process was optimised by moving the titles themselves
into the application specific database. However, it needs to be mentioned that this
process made the application specific database grow in size considerably, as sufficient
data had to be allocated per title in order to make sure all titles fit within the assigned
table.
4.5.2 Services
The server side implementation that is able to take requests from the client consists of a
series of methods, which can be invoked in code directly from the client side. The
previous section has mentioned three main functions that needed substantial server side
implementation: Standard Search, Advanced Search and Details Request.
4.5.3 Query Services
When a request for a standard search is sent to the server, the object is first analysed and
checked which field actually contained a string to search for. Next a SQL query is built
accordingly using the efficient Java String Builder. The query is dynamically assembled
and the entries from within the fields are connected via the “AND” operator. Next the
query is sent to the database using a database connector. The result from the database is
turned into a so called “Result Set” by the Database connector. This result set will be
disassembled by the server and packed inside an object which can be identified on the
client side. It is then sent back to the client where its type can be identified and
redirected accordingly. It needs to be noted that the request for a dynamic search is
implemented very similar. The only significant difference is the disassembly of the
request coming from the client, since the operators between the entered search-
dimensions need to be assigned dynamically.
4.5.4 Details Page Service
In a request for details of an article, the client request mainly consists of a PMID being
passed. The third database, which stores information about the abstracts themselves is
queried for the titles, collaborators as well as the abstract itself are fetched. Then the
associated extracted features are gathered from the six tables as well as the ranges for
- 57 -
the highlights within the abstract are gathered from the application specific database.
They are put in a details response object, which can be correctly identified on the client.
4.5.5 Curation Service
When a curation request reaches the server, it first checks whether the passed user
credentials add up. This is done by comparing whether the user and password entries are
consistent with what is stored within the user table in the application specific database. If
the credentials should come out wrong, a response object is compiled which will contain
a message that the curation has failed and the reason. If the details add up, however, the
parameters of the curation are checked. First the server will look whether the checkbox
for identifying similar entries was ticked. It will then search the database for similar
entries and get a list of PMIDs in that same dimensional table if the box was indeed
ticked. It will then proceed to change the entry or entries for the given feature. In case
the deletion box was ticked however, the entry will just be deleted from this table of the
database. There are also two other types of curation requests. The first one is adding a
new extract from the article into the database. A client side check is done whether the
entered word corresponds to the abstract itself and is then sent to the server, where the
request is pushed through to the database, depending on whether the authorisation
completed successful. The last operation is to change the highlights of an extract,
changing values in the highlights table.
4.6 Possible Future Implementation Ideas
There are two ideas which have been considered during implementation which should be
mentioned. The first one is the possible use of a Levenshtein’s Distance algorithm [49] in
order to be able to identify similar words that need curation within the database and
make the algorithm generally more effective. This algorithm basically compares two
strings and calculates a value corresponding to the similarity of those two strings; the
better the match, the higher the value. This could be done when searching through the
database, comparing all the words within the table where a string that needs curation has
been identified and calculating the Levenshtein’s Distance between that old string and
the entries in the table. If the algorithm’s result is above a certain threshold for an entry
- 58 -
as well as the other values of that row roughly correspond to the same category as the
word that needs curation, it could be assumed that this entry is similar enough to the
wrong word and may be curated as well. This is one approach of implementing some kind
of “smart detection” of data.
Another possible future addition is to enable parallel searches, since data and
communication within GWT are handled asynchronously. Instead of switching between
features when cycling through the tabs of the application, a new instance of that widget
as well as reference to client access service could be created each time. This would need
to be displayed to the user and kept track of in some kind of task bar. As a result the user
would now be able to open two basic searches at the same time. GWT would know
where to route the information coming back from the server since the request would
have been sent from a different client service each. While this feature seems useful,
especially when the user wishes to do a wide set of searches it also has to be mentioned
that it would take up a lot more resources and as a result make the application slower. It
could also be considered redundant since the user could just open the application twice
in different tabs of the browser and achieve the same goal.
4.7 Software Testing
Software testing is one of the core practices of software engineering. Here the question is
asked, whether the software was built right. It is important to treat software testing as an
ongoing process and not just one lifecycle at the end of development in order to spot
mistakes and bugs early and prevent them from propagating through the system. The less
software is tested, the harder it is to fix individual bugs as they may start affecting each
other. Therefore the application has been tested thoroughly during implementation as
well as after implementation was finished. Several testing methodologies exist which
have been applied to the project. It needs to be noted that acceptance testing has been
done as part of evaluation, which is covered in the next chapter.
4.7.1 Regression Testing
Regression testing was done throughout development of the application. It is done to
make sure that the already existing parts of the application still work, as new parts are
added. This is especially useful when widgets have been nested, as functionality may be
- 59 -
directly affected. A regression testing plan was created, which was executed and updated
every time a major feature was finished being implemented. Note that this plan was
growing as the implementation went along, starting out at only one test at the start. A
simplified version of the plan can be seen below
4.7.2 Integration Testing
Integration testing was done to test how the nested features behaved individually. Test
cases were created before implementation. These cases state expected behaviour upon
merge of widgets as well as all possible alterations of that feature. A bottom up approach
has been taken on integration testing, which means that lower level merged features
were tested first and the tests were then slowly expanded in order to integrate some of
the more complex features.
When the feature pages were first merged with the menu switcher this was
considered the first integration test. This continued to the addition of the results table on
top of the search features. Next, the details page for each was tested after it was merged
onto the results table feature and so on.
4.7.3 Security Testing
It should also be mentioned that some security testing was done. This included testing
three features: Confidentiality, Integrity and Authentication. Firstly, confidentiality was
ensured already in the design of the application, making sure that there is no confidential
information stored within the system. Integrity is ensured as part of integration testing. It
is generally done by making sure that what is created on client side corresponds with that
is received on the server side. Thirdly, authentication is ensured by always having some
way of identifying possible people involved. This is done by only creating user accounts
for trusted people as well as logging all critical actions on the server to the corresponding
user.
4.7.4 Unit Testing
Unit testing has also been done in order to be able to test parts of the application
individually. Since GWT is programmed in Java it can be tested using standard jUnit Test
Cases [50]. In fact it can be seen in the application tree above that GWT automatically
- 60 -
creates a package where tests are located. Skeleton classes were created for each class
which can have been populated by the developer. A testing plan was first drawn out and
unit tests for each method on the server as well as the client have been created
accordingly. The general style of these tests is by setting up the so called “pre-conditions”
which are all variables needed to run a function. Note that these are usually stubs and
only the part of the function that needs to be tested is populated with reasonable data.
The expected outcome is defined and asserted to the actual outcome after processing the
data through the given function. A test then passes, depending on whether the assertion
was mean to evaluate true or not. The code below shows a sample unit test for the query
builder on the server. A query request object is being passed containing the words to be
queried. The expected result is supposed to be the fully built query.
Figure 20: Sample Test Code
Note that also the database connectors implemented on the server had to be
tested individually for each database. Skeleton test classes were already created for each
of the database connectors. It needs to be pointed out that only tests for tables that have
been used for the application have been tested. The table below (Table 5) is a general
realisation of the testing plan and shows a simplified summary of the unit tests done for
the application. It indicates which test classes were written per package as well as a
number of tests of that package. A small summary of what was tested has also been
added as well as an indicator of whether the tests passed.
- 61 -
Package Test Class Description #Tests Outcome
Client AdvancedSearchTest Various tests assembling a search request object, testing different combinations of operators and dimensions.
10 Pass
StandardSearchTest Tests combining different search dimensions
12 Pass
ResultTableTest Several Tests for each column in the results table
8 Pass
DetailsPageTest Several tests for populating each
16 Pass
Client. model
AdvancedQueryTest Testing assignments and reads of model class
12 Pass
CurationRequestTest Testing assignments etc. 22 Pass
DetailedArticleTest Testing assignments etc. 16 Pass
DetailStatsTest Testing assignments etc. 8 Pass
StandardQueryTest Testing assignments etc. 12 Pass
QueryResultTest Testing assignments etc. 8 Pass
Client. service
ClientServiceImplTest Testing forwarding of data to client module objects
3 Pass
Server ServerAccessImplTest Testing server routing of possible actions
4 Pass
ServerQueryBuilderTest Testing query builder for each dimension and some combination of dimensions
10 Pass
ServerResultTest Testing server generated response object creation
8 Pass
Server. dbmysql
DBMYSQLConnectorTest Mainly tests with ResultSet alteration; Some connection creation tests.
7 Pass
Table 5: Simplified Testing Table
- 62 -
5. Results and Evaluation
This chapter will show off the individual parts of the application that have been
produced. It will explain how each of them are used and how they relate. Next it will
focus on evaluation done for the project. This has been done in form of interviews which
will be discussed and the results will be presented.
5.1 Overview of the Application
The following picture (Figure 21) shows the Standard Search feature. The six search
dimensions as well as the search button which will initialise and send a query with the
entered data to the server can be seen.
Figure 21: Standard Search Feature
Note that the exposure and the outcome field have been populated according to
the evaluation task. Once the search button is pressed a loading screen will appear while
the result data is being fetched on the server. Once the client receives the results, the
loading screen is replaced with the result table. The picture below (Figure 22) shows the
results for the query above: 22 entries have been found. It should be mentioned that
pages are set to be separated every 25 entries. Also note that there are two pagers, one
at the top and another at the bottom of the page. The table itself provides the user with
year, title and study design of the found papers. It is initially sorted by year in ascending
order and can be re-sorted by clicking the year header in the table.
- 63 -
Figure 22: Sample Results Table When one of the columns of the table above is clicked, a detail page will appear. This is
shown in the picture below (Figure 23).
Figure 23: Details Page Feature
- 64 -
Note that the rest of the page is shaded while a details page is being displayed. The
pop up itself gives more information about the paper including title, collaborators and
the abstract itself; a link to the actual paper is also provided. The markers on the right will
highlight all extracts of a selected dimension within the abstract. The table at the bottom
shows the extracted features and some information about them. It is set to display five
results per page. Clicking one of the entries in the table and then selecting “Curate
Selected Extract” will replace the table with the curation menu seen below.
Figure 24: Curation Feature
As shown above, hypertension has been selected from the table. A new word can be
added which will replace the entry for hypertension within the database. Since extracts
are normalised they generally correspond to the information in the abstract itself and
only such words may be added to the database. The option to delete an extract has also
been provided, as well as detecting similar entries within the database and curating these
as well. A user name and password also need to be provided in order to be able to do a
curation.
Lastly, the advanced search feature also needs to be shown off. In the picture below
(Figure 25) the main advantages of this feature are made clear. The six search dimensions
can be combined in any wanted way. Two words for the same dimension can be added.
The available operators include “AND”, “OR” and “NOT” so that certain extracts may be
included if wanted. Rows can be added or removed at will and rows with no word
provided will be ignored in the search.
- 65 -
Figure 25: Advanced Search Feature
5.2 Evaluation
Evaluation and acceptance testing of the application has been undertaken in form of
evaluation interviews. An interview script has been designed, a copy of which can be
found in the Appendix H. The tasks have been set out for the person being interviewed to
complete. The first task is done together with the developer, where the last two tasks are
set to be done on their own. A short questionnaire to be used after the interview has also
been designed and can be found in the Appendix I. All questions asked have been
answered as well as discussed during these interviews. Two people working in the field
have been interviewed as part of the evaluation. The first person was Dr Jenny Newman
an epidemiologist and medically trained doctor. The second person was Dr George
Karystians, who developed the text mining algorithm to gather the extracts used in this
project. The main highlights of these interviews are summarised in the table below (Table
6). A full version where the answers are mapped to their corresponding questions can be
found in the Appendix J.
Person Discussion
Jenny A second pager on top of the results table would be useful, as it would eliminate the need to scroll on smaller resolution machines.
Jenny As the second task from the evaluation showed, after entering 3 search terms, only two papers remained in the result set. This is already a good example for showing gaps in epidemiological research.
Jenny It needs to be pointed out that an application like this could be easily misunderstood by users who are not as educated in the field and as a result
- 66 -
wrong conclusions could be drawn about what the data means. Help pages or tooltips could improve this situation; Nevertheless, this scenario should always be considered a risk.
Jenny It is useful as it can provide a preliminary examination of previous work for example when writing a grant application or preparing undergraduate projects.
Jenny Especially liked the positioning of close buttons for the details page, as they were easy to find and conveniently placed.
George Add more information about the current curation.
George The curation feature is rather powerful. The users curating the data have to take responsibility for what they curate as this could lead to problems within the application. It is generally a good practice that curations are tracked.
George By pointing out gaps between current research and the one represented.
George Add a column to the result set which shows the type of study that the found paper has investigated.
George Generally more statistics about the current result set.
George Jenny
Adding the number of submissions about a certain topic per year as a statistic on the search result.
George Jenny
Some statistic that adds a denominator to what proportion of submissions is about a certain topic this year compared to last year.
Table 6: Evaluation Questionnaire and Discussion Highlights
The general feedback given was positive and constructive. Each question has also
been discussed and the people interviewed never got lost or stuck while completing the
tasks they were given. There was a general liking towards the simplicity and the layout of
the results. General concerns about the application and its use have been identified and
taken into account.
There are some future changes and extensions implied from the results above. The
second pager above the results table has already been realised as it can be seen in the
walkthrough. The same is true for the study design column being displayed for the result
set of a search as suggested by the feedback. A future addition that has been identified
by both persons interviewed was the addition of more statistics of the currently displayed
result set. This may include a simple display for the numbers needed as well as tables and
graphs. Due to the modular layout of the application this can be realised in future
without the need for refactoring.
- 67 -
6. Conclusion and Future
This chapter will conclude the dissertation by first providing a summary of what has been
achieved. It will look at the goal set out at the start and state whether they have been
met or not and why. It will then summarise all possible future work mentioned and give
an outlook on how this could be done. It will then conclude the dissertation by talking
about skills acquired and lessons learnt
6.1 Summary
The aims and objectives have been set out for this project. Research has been made into
the domain as well as the text mining process behind the application that is needed in
order to realise the goals. This included already existing data structures and databases in
place. The requirements have been set out according to the aims as well as stakeholders
such a supervisors involved. A design has been created according to the identified
domain. A need for a database was identified and also designed accordingly. It was
identified that the application needed to be online therefore an appropriate interface
was designed as well. Tools to implement the website have been thoroughly researched
and a decision has been made to use Google Web Toolkit. The following table (Table 7)
shows the objectives set out at the start and states whether they have been met in the
therefore developed application.
- 68 -
Goal Achieved
To develop a search function and implement an extended query model utilising the six dimensions needed for the epidemiological data.
This aim has been met as a query function has been implemented which covers all six search dimensions.
To develop an extended search function to enable the user to query the six dimensions combined in any way.
This has been achieved as the advanced query functionality enables the user to do just that.
To be able to manipulate automatically extracted epidemiological data in order to correct errors.
This feature was implemented as curation, as entries can be added, edited as well as deleted.
To provide relevant information about retrieved data, such as statistics.
Statistics are mainly shown for details of a selected paper in the application therefore this goal has been met. However more statistics are viable as future implementation.
To enable only registered users to manipulate information.
This has been done as part of the login required on curation.
To provide all the mentioned facilities on an online, web based, platform with high availability.
This goal has been met as the application has been implemented using a web platform.
After implementation, an interview questionnaire has been designed this was
used to evaluate the produced results with people closer involved with the domain. A
generally positive feedback was received as well as a good set of ideas for future
additions was gained.
6.2 Future Work
Several future additions and changes are mentioned throughout the dissertation, which
are summarised here. During implementation it was mentioned that the future addition
of a Levensthein’s Distance algorithm could be used in order to identify similar entries in
the database, as part of the curation process. This is useful as such errors usually
propagate through the database and this algorithm could be used to identify such errors
more efficiently. Another future change that could be realised utilising this algorithm is
for the curation of one entry. The new entry could be compared to the old one in order to
identify similarities. This could be used to aid the user and make sure that the curation
done is not too harsh, as entries in the database are normalised and should not differ too
much from the original text.
Table 7: Set Goals vs. Achievements
- 69 -
Other future additions include more displays and visualisations for statistics. More
general statistics could be displayed of the gathered result set of a search. This could be
done by extending the general results table by a few more columns. However, it could be
argued that this would just unnecessarily clutter the results display and make them
unclear at first glimpse. More information could be added regarding the results
underneath the table. This may be done in form of more tables or even diagrams. It is all
possible through GWT as on compilation all client side code gets converted into
JavaScript. As a result JavaScript libraries can be incorporated to draw and visualise these
statistics.
For the far future, it should be mentioned that the application can be extended to
include a full user account management system. This would have a number of
advantages, however it has not been realised as part of this project as the complexity and
effort are too high compared to the gain of useful features. This feature would enable
users to define and store custom searches on their account. It could also enable the
application to have private curations, meaning that alterations to the database would
only affect the user who entered them.
6.3 Concluding Remarks
A web based workbench for epidemiologists has been developed in this project. It
provides a search feature which ultimately enables the user to access a large amount of
information. The product has been evaluated by people involved in the field. A number of
possible future improvements have been identified. Nevertheless, feedback gained was
positive and the project was deemed a success.
- 70 -
7. Bibliography
[1] L. JM, ed. Dictionary of Epidemiology, 2nd ed., New York: Oxford U. Press, 1988.
[2] G. Hill, “University of Dundee External Relations,” Press Office, July 2013. [Online]. Available:
http://app.dundee.ac.uk/pressreleases/2013/july13/institute.htm. [Accessed August 2014].
[3] “MEDLINE factsheet,” 21 July 2014. [Online]. Available:
http://www.nlm.nih.gov/pubs/factsheets/medline.html. [Accessed April 2014].
[4] Spasić I, Sarafraz F, Keane AJ, Nenadić G., “Medication information extraction with linguistic
pattern matching and semantic rules,” Journal of the American Medical, vol. 17, no. 5, pp.
532-535, 2010.
[5] Chapman WW, Cohen KB, “Current issues in biomedical text mining and natural language
processing,” Journal of Biomedical Informatics, vol. 42, no. 5, p. 757–759, October. 2009.
[6] Aarts S, Vos R, van Boxtel MP, Verhey FRJ, Metsemakers JF, van den Akker M, “Exploring
medical data to generate new hypotheses: an introduction to data and text mining,” 2012.
[7] “PUBMED factsheet,” 21 July 2014. [Online]. Available:
http://www.nlm.nih.gov/pubs/factsheets/pubmed.html. [Accessed April 2014].
[8] “PubMed MESH factsheet,” 21 July 2014. [Online]. Available:
http://www.nlm.nih.gov/pubs/factsheets/mesh.html. [Accessed April 2014].
[9] Karystianis, G, “Extraction and representation of key characteristics from epidemiological
literature,” University of Manchester, School of Computer Science., 2013.
[10] Hearst, M. A., “Text data mining: Issues, techniques, and the relationship to information
access,” Presentation notes for UW/MS workshop on data mining, July 1997.
[11] Tan, Ah-Hwee, “Text mining: The state of the art and the challenges.,” Proceedings of the
PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases, 1999.
[12] Korhonen, A., Séaghdha, D. Ó., Silins, I., Sun, L., Högberg, J., & Stenius, U., “Text mining for
literature review and knowledge discovery in cancer risk assessment and research,” PloS one,
vol. 7, no. 4, 2012.
[13] Zweigenbaum, P., Demner-Fushman, D., Yu, H., & Cohen, K. B., “Frontiers of biomedical text
mining: current progress,” Briefings in bioinformatics, vol. 8, no. 5, pp. 358-375, 2007.
[14] Hotho, A., Nürnberger, A., & Paaß, G., “A Brief Survey of Text Mining,” In Ldv Forum, vol. 20,
no. 1, pp. 19-62, May 2005.
[15] R. Rodriguez-Esteban, “Biomedical text mining and its applications,” PLoS Computational
Biology, vol. 5, no. 12, 2009.
[16] Imberman, S.p., “Effective use of the KDD process and data mining for computer
performance professionals.,” Journal of Computing Resources, no. 107, pp. 68-77, 2002.
[17] Berger, A.M. and Berger C.R., “ata mining as a tool for research and knowledge development
in nursing,” Computers, Informatics, Nursing, vol. 22, no. 3, p. 123–131, 2004.
[18] Bischoff, K., Firan, C. S., Nejdl, W., & Paiu, R., “Can all tags be used for search?,” In
Proceedings of the 17th ACM conference on Information and knowledge management, vol.
ACM, pp. 193-202, October, 2008.
[19] Aronson, A. R., & Lang, F. M., “An overview of MetaMap: historical perspective and recent
- 71 -
advances.,” Journal of the American Medical Informatics Association, vol. 17, no. 3, pp. 229-
236, 2010.
[20] Dix, A., “Human-computer interaction,” Springer US, pp. 1327-1331, 2009.
[21] Smith, D. C., Irby, C., Kimball, R., Verplank, W. L., & Harslem, E., “Designing the Star user
interface. In Human-computer interaction,” Morgan Kaufmann Publishers Inc., pp. 653-661,
1987, December.
[22] Fairbanks, R. J., & Caplan, S., “Poor interface design and lack of usability testing facilitate
medical error,” Joint Commission Journal on Quality and Patient Safety, vol. 30, no. 10, pp.
579-584, 2004.
[23] Patel, V. L., & Kushniruk, A. W., “Interface design for health care environments: the role of
cognitive science, In Proceedings of the AMIA Symposium (p. 29),” American Medical
Informatics Association., 1998.
[24] Weinger, M. B., Wiklund, M. E., & Gardner-Bonneau, D. J. (Eds.)., Handbook of human factors
in medical device design, CRC Press, 2011.
[25] Zaninzinato, Zaninzinato design, 2013. [Online]. Available:
http://www.zanzinato.com/work/avation-health/. [Accessed May 2014].
[26] Gottron, T., “Document word clouds: Visualising web documents as tag clouds to aid users in
relevance decisions. In Research and Advanced Technology for Digital Libraries,” Springer
Berlin Heidelberg, pp. 94-105, 2009.
[27] “Ergonomic Requirements for Office Work with Visual Display Terminals (VDTs): Part 11:
Guidance on Usability,” ISO 9241-11, 1998.
[28] Charfi, S., Ezzedine, H., & Kolski, C., “RITA: A framework based on multi-evaluation
techniques for user interface evaluation: Application to a transport network supervision
system. In Advanced Logistics and Transport (ICALT),” International Conference on IEEE, pp.
263-268, 2013, May.
[29] Byron, L., & Wattenberg, M., “Stacked Graphs-Geometry & Aesthetics,” IEEE Trans. Vis.
Comput. Graph., vol. 14, no. 6, pp. 1245-1252, 2008.
[30] Cockburn, A., “Agile software development,” Boston: Addison-Wesley., vol. 2006, 2002.
[31] Beck, K., Beedle, M., Van Bennekum, A., Cockburn, A., Cunningham, W., Fowler, M., ... &
Thomas, D., “Manifesto for agile software development.,” 2001.
[32] Schwaber, K. and Beedle, M., Agile Software Development with SCRUM, Upper Saddle River,
NJ: Prentice-Hall, 2002.
[33] Ambler, Scott W., and Mark Lines, “Disciplined agile delivery: A practitioner's guide to agile
software delivery in the enterprise,” IBM Press, 2012.
[34] C. Larman, Applying UML and Patterns. An Introduction to Object-Oriented Analysis and
Design and Iterative Development, 2006.
[35] G. Project, “Known vulnerabilities to GWT,” 2014. [Online]. Available:
http://www.gwtproject.org/articles/security_for_gwt_applications.html.
[36] Code-Google, “GWT incubator,” 2014. [Online]. Available:
https://code.google.com/p/google-web-toolkit-incubator/wiki/LoginSecurityFAQ. [Accessed
March 2014].
[37] “Heartbleed Bug,” Codenomicon Ltd., 29 April 2014. [Online]. Available:
- 72 -
http://heartbleed.com/. [Accessed August 2014].
[38] P. Brodersen, “How MySQL Uses Indexes,” Oracle MySQL, 2004 June 2004. [Online].
Available: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html. [Accessed August
2014].
[39] Y. Z. Low, “Multiple Column Indexes,” Oracle MySQL, 25 March 2012. [Online]. Available:
http://dev.mysql.com/doc/refman/5.0/en/multiple-column-indexes.html. [Accessed August
2014].
[40] Berners-Lee, T., Cailliau, R., Groff, J. F., & Pollermann, B., “World-Wide Web: the information
universe.,” Internet Research, vol. 2, no. 1, pp. 52-58, 1992.
[41] Burnette, E., “Google Web Toolkit.,” 2006.
[42] Dewsbury, R., Google web toolkit applications, Pearson Education, 2007.
[43] “GWT Using RPC,” GWTProject, 2012. [Online]. Available:
http://www.gwtproject.org/doc/latest/tutorial/RPC.html. [Accessed August 2014].
[44] Google, “GWT Plugin for Eclipse,” Google Developers, December 2013. [Online]. Available:
https://developers.google.com/eclipse/. [Accessed April 2014].
[45] Katarina Grolinger, Wilson A. Higashino, Abhinav Tiwari, and Miriam AM Capretz, “Data
management in cloud environments: NoSQL and NewSQL data stores,” Journal of Cloud
Computing: Advances, Systems and Applications, vol. 1, no. 2, p. 22, 2013.
[46] Oracle, “JDBC Overview,” 2014. [Online]. Available:
http://www.oracle.com/technetwork/java/overview-141217.html. [Accessed May 2014].
[47] Javaranch, “Jenny the DB code generator,” 2013. [Online]. Available:
http://www.javaranch.com/jenny.jsp. [Accessed May 2014].
[48] J. Waldo, “Remote procedure calls and java remote method invocation,” IEEE Concurrency,
vol. 3, no. 6, pp. 5-7, 1998.
[49] M. Gilleland, “Levenshtein distance, in three flavors,” Merriam Park Software, 2009. [Online].
Available: http://www. merriampark. com/ld. htm.
[50] JUnit, “A programmer-oriented testing framework for Java,” 2014. [Online]. Available:
junit.org. [Accessed May 2014].
[51] Z. Grossbart, “Hackito Ergo Sum,” 21 September 2010. [Online]. Available:
http://www.zackgrossbart.com/hackito/antiptrn-gwt/. [Accessed April 2014].
[52] Google, “GWT Ajax Communication,” Google Developer Guide, 2012. [Online]. Available:
http://www.gwtproject.org/doc/latest/DevGuideServerCommunication.html. [Accessed
August 2014].
[53] Osheroff, J. A., Teich, J. M., Middleton, B., Steen, E. B., Wright, A., & Detmer, D. E., “A
roadmap for national action on clinical decision support,” Journal of the American medical
informatics association, vol. 14, no. 2, pp. 141-145, 2007.
[54] Berner, E. S., “Clinical Decision Support Systems,” Springer Science+ Business Media, LLC,
2007.
[55] Demner-Fushman, D., Chapman, W. W., & McDonald, C. J., “What can natural language
processing do for clinical decision support?,” Journal of biomedical informatics, vol. 42, no. 5,
pp. 760-772, 2009.
- 73 -
[56] Garg AX, Adhikari NK, McDonald H, Rosas-Arellano MP, Devereaux PJ, Beyene J, et al.,
“Effects of computerized clinical decision support systems on practitioner performance and
patient outcomes: a systematic review,” JAMA, vol. 293, no. 10, p. 1223–38, 2005.
[57] Infotech, P2C, “Software Development Life Cycles (SDLC),” 2011. [Online]. Available:
http://www.p2cinfotech.com/software-development-life-cycle/. [Accessed May 2014].
[58] Sittig, D. F., Wright, A., Osheroff, J. A., Middleton, B., Teich, J. M., Ash, J. S., ... & Bates, D. W.,
“Grand challenges in clinical decision support,” Journal of biomedical informatics, vol. 42, no.
2, pp. 387-392, 2008.
[59] G. C. Team, “Evolution of the Web.,” 2014. [Online]. Available:
http://www.evolutionoftheweb.com/. [Accessed April 2014].
[60] Edwards, S., “History of Processor Performance.,” University of Columbia, 2012.
- 74 -
8. Appendix
A. SCRUM principles
In SCRUM the requirements need to be put down first in form of user stories which are
then processed into the so called “Product Backlog”. This backlog holds all information
about the features to implement and may grow as the project progresses as things might
change or are added. SCRUM suggests focusing on the whole set of functionality but only
in a basic matter first. This functionality is then improved and extended in the following
iterations. This underlines the agile principle of working code first. After the backlog is
done, a so called “Release Plan” is usually produced. It holds information about which
stories will be implemented in which iteration, until a first feasible release. This release
plan has to be flexible and give space so that things might be added. At the end of each
sprint it is reflected upon what is done in order to keep track of how productive the
development process is so far. A so called burn down chart may be created, shown Figure
26 below. It visualises the remaining work plan (blue line) and compares it against the
work done (red line). Once the red line drops below the blue line, the project would be
ahead of schedule, the same principle applies the other way around, with red being
above blue meaning the project is behind.
Figure 26: Sample Burndown Chart
- 75 -
B. Full Release Plan for the Project
Sprint Start Weeks End Points total Status Goal Description
1 1.7.2014 2 15.7.2014 19 planned Stories 1
to 3
Create the basic website with all its login and managerial features.
2 16.7.2014 2 31.7.2014 29 planned Stories 5
to 10
Implement all query functions for all tables and data involved.
3 1.8.2014 2 15.8.2014 24 planned Story 11 Implement visualisations and dynamic creation of statistics.
4 16.8.2014 1 6.7.2014 10 planned Stories
12 to 15
Implement alteration of data; this may include pattern detection.
- 76 -
C. Full Backlog for the Project
ID As a/an … I want to… so that…. Done criteria Est. priority
1 user view the website created
I can see who made it and eventually make out how to use it
Made a website that contains all the features a standard website has: Intro page, tutorial page, about page, etc.
8 high
2 user login to access curation feature
A login function that disables the curation of data unless authorised
A login function that hides all functionality until authorised
5 medium
3 admin manage user access on the website
An admin account that decide to restrict access to certain/all users
An admin account that can manage all entries of users etc.
6 medium
4 user
be able to access epidemiological data
I can browse data if needed or use it to query it
Implemented database access
10 high
5 user
query epidemiological data by Exposure
I can get an overview of research to that related health care problem
Query for given type of epidemiological dimension is possible
4 high
6 user
query epidemiological data by Outcome
I can get an overview of research to that related health care problem
Query for given type of epidemiological dimension is possible
4 high
7 user
query epidemiological data by Covariate
I can get an overview of research to that related health care problem
Query for given type of epidemiological dimension is possible
4 high
8 user
query epidemiological data by Study Design
I can get an overview of research to these related studies
Query for given type of epidemiological dimension is possible
4 high
9 user
query epidemiological data by Population
I can get an overview of related research to health care problems related to that population
Query for given type of epidemiological dimension is possible
4 high
10 user
query epidemiological data by Effect Size Type
I can get an overview of research having that certain effect type and value
Query for given type of epidemiological dimension is possible
4 high
- 77 -
11 user
be able to query for epidemiological data using any combination of the mentioned above categories
I can get more specific search queries on data
Query for any combination possible
5 high
12
admin or high level user
insert new highlighted key words for a study
I can add a spotted keyword for an abstract that hasn't been spotted
Logged alteration of data is possible
8 medium
13
admin or high level user
alter a highlighted key word in a study
I can alter a spotted highlighted keyword that may not be correctly identified
Logged alteration of data is possible
8 medium
14
admin or high level user
delete a highlighted key word in a study
I can get rid of a wrongly identified keyword
Logged alteration of data is possible
8 medium
15
admin or high level user
identify commonly extracted keywords
I can change or delete a wide range of possibly wrongly identified key words
Auto suggestion upon alteration of data
10 low
- 78 -
D. Full Sequence Diagram with Details Request
- 79 -
E. Anti-Pattern GWT
It has to be mentioned that GWT is generally considered an Anti-Pattern [51]. This means
that it goes against some of the Object Oriented Software Engineering Principles
explained earlier in this dissertation. As a consequence, the design of the application,
although it depicts perfectly fine what is needed and what is happening within the
application, does not translate into code quite that simply. The best example would be
that the User Interface in GWT is created on a modular basis, called “Widgets” rather
than objective. This has some major advantages, such as addition of new modules can
easily be achieved by creating a new class for that module. This is a perfect example of a
Protected Variation in Software Engineering, as a new module can easily be added
without affecting the functionality of any existing ones. Another major advantage is the
modules being nestable meaning that they are able to invoke as well as contain each
other. This leads to great reusability of a single module as it will be pointed out later. The
main disadvantage of this is, however, that a conventional model view controller is hard
to achieve as some of these modules require the data to be handled as well as altered
within them. What this basically means, regarding our design, is that the controller is
mostly automated within our user interface and therefore becomes redundant as a
separate class. Regarding the Sequence Diagrams from Chapter 3, it needs to be
mentioned that most of the actions are automated within GWT or nested within the way
the data is processed inside the user interface. This mainly affects the “Result Formatter”
and the “Controller”.
- 80 -
F. More about Java RPC Calls in GWT
Looking at Figure 19 the following information can be seen: The Service interface
specifies which methods (or services) have to be implemented on client and server side.
It also specifies the type of the response object coming from the server. The “Service
Asynch” Interface must be implemented in order to be able to receive asynchronous call-
backs from remote procedure calls. Each callable method within this interface
corresponds to a method from the Service Interface and also provides an
“AsyncCallback<T>” object as parameter, where the generic type “T” is used to pass the
return objects asynchronously within GWT. Note that the Client Interface here is optional
and has been implemented in order to specify which services are available to this Client
Implementation. This is useful, as different levels for clients may be implemented in the
future, which will need access to different services that should not be accessible to other
client Implementations.
Once the server response comes in, it can be identified within the Client
Implementation class which has to implement “AsyncCallback”. The type of the response
can then be identified and forwarded accordingly to the needed user interface
implementation. Note that these responses are handled asynchronously. What this
means is that once a request has been sent to the server the client does not have to
interrupt its current activity and wait for a server response. The response may come in
asynchronously, whenever it is ready and can then be processed in the background.
It needs to be mentioned that there are several restrictions to GWT as well as the RPC
method used within it. The most important one being that Java-RPC calls are dependent
on types passed between them being “Serializable” [52]. What this means is that the
object which is used to implement the responses has to be entirely made up of primitive
types or other objects which implement Serializable, meaning that they can be
disassembled into a binary data stream and recovered properly. Another restriction
between client and server implementation is that all types and methods used in the client
have to be available or compatible with JavaScript, as the client is run in JavaScript after
deployment.
- 81 -
G. Testing Done as result of a Testing Plan
Package Test Class Description # Tests Outcome
Client AdvancedSearchTest Various tests assembling a search request object, testing different combinations of operators and dimensions.
10 Pass
StandardSearchTest Tests combining different search dimensions
12 Pass
ResultTableTest Several Tests for each column in the results table
8 Pass
DetailsPageTest Several tests for populating each
16 Pass
Client. model
AdvancedQueryTest Testing assignments and reads of model class
12 Pass
CurationRequestTest Testing assignments etc. 22 Pass
DetailedArticleTest Testing assignments etc. 16 Pass
DetailStatsTest Testing assignments etc. 8 Pass
StandardQueryTest Testing assignments etc. 12 Pass
QueryResultTest Testing assignments etc. 8 Pass
Client. service
ClientServiceImplTest Testing forwarding of client module objects
3 Pass
Server ServerAccessImplTest Testing server routing of possible actions
4 Pass
ServerQueryBuilderTest Testing query builder for each dimension and some combination of dimensions
10 Pass
ServerResultTest Testing server generated response object creation
8 Pass
Server. dbmysql
DBMYSQLConnectorTest Mainly tests with ResultSet alteration; Some connection creation tests.
7 Pass
Server. owndb
CovariateTableTest 2 tests for each column in that table.
12 Pass
CurationTableTest 2 tests for each column 18 Pass
Effect_sizeTableTest 2 tests for each column 12 Pass
ExposureTableTest 2 tests for each column 12 Pass
HighlightsTableTest 2 tests for each column 12 Pass
OutcomeTableTest 2 tests for each column 12 Pass
UserTableTest 2 tests for each column 8 Pass
PmidTableTest 2 tests for each column 8 Pass
PopulationTableTest 2 tests for each column 8 Pass
Study_designTableTest 2 tests for each column 38 Pass
Server. epidb
CovariateTableTest 2 tests for each column 12 Pass
Effect_sizeTableTest 2 tests for each column 12 Pass
ExposureTableTest 2 tests for each column 12 Pass
- 82 -
HighlightsTableTest 2 tests for each column 12 Pass
OutcomeTableTest 2 tests for each column 12 Pass
PmidTableTest 2 tests for each column 6 Pass
PopulationTableTest 2 tests for each column 38 Pass
Server. shareddb
ArticlesMedline2014TableTest
2 tests for each column 36 Pass
Total Number of Tests = 436 (149 without DB connector generation)
- 83 -
H. Evaluation Interview Script
Evaluation Interview Script Visualisation of Structured Epidemiological Information Short Description: This application provides a search engine for epidemiological data. Key information has been extracted from numerous articles’ abstracts reaching back to the 1970s and put into six dimensions: covariate, effect size, exposure, outcome, population and study design. The extracted information can now be queried according to these six dimensions.
P1 - Browse together Let’s look through the website together. Notice there is a navigation bar on top of the page. This is where we browse through the main features of the website. The static search provides six fields for the six search dimension which we can query for. They consist of: Covariate, Effect Size, Exposure, Outcome Population and Study Design. To test it, let’s search for “cancer” as an exposure. We should be able to get 97 results. Each result should be clickable and bring up the details of this paper. On top we can see the title and the collaborators of that paper. The middle section provides an abstract in which the extracted results can be highlighted by dimension. The bottom of the details window shows a summary of all extracted features. The marked features within the text are changeable using the curation button at the bottom, in case they highlight the wrong extracted words. P2 - Find the Following Let’s search for something! Find how many papers there exist for “married” as exposure and “obesity” as outcome. Let’s search for “adiposity” as an exposure and note how many results were found. Let’s add “diabetes” as an outcome and note the drastic change in found search results. Let’s tighten the circle even further by adding “smoking” as a covariate. There should now only be two papers left. Open details about the most recent one and take note of the other two covariates within it. P3 - Advanced Search Let’s try the advanced search! Note that here you can combine any of the six dimensions in any way you like. Let’s give it a go, and try searching for “lifestyle” as exposure, as well as “gender” as exposure and see how many results you get.
- 84 -
I. Evaluation Interview Questionnaire with Discussion Questions
Questionnaire:
Questions Yes No N/A
1. Was the application easily understandable?
2. Was the application easy to navigate?
3. Were all the buttons in the position you expected them to be?
4. Would you change any of them?
5. Were you able to solve the tasks easily?
6. Was it ever unclear where to look next in order to solve a task?
7. Was the information being represented accurately?
8. Was there sufficient Epidemiological Information represented?
9. Do you feel that this application could be useful for further epidemiological investigation?
10. Do you think this application could point out gaps in epidemiological research?
11. Do you think this application is useful for browsing epidemiological data in general?
12. Do you think an application like this would be more useful if it would be publicly accessible?
General Discussion:
13. How do you think an application like this would be useful?
14. What would you change or add in order to make it more useful?
15. Overall was there anything that you especially liked or disliked about the application?
- 85 -
J. Evaluation Interview Highlights Table
Question Person Answer Discussion
4 Jenny Yes A second pager on top of the results table would be useful, as it would eliminate the need to scroll on smaller resolution machines.
10 Jenny Yes As the second task from the evaluation showed, after entering 3 search terms, only two papers remained in the result set. This is already a good example for showing gaps in epidemiological research.
12 Jenny Yes It needs to be pointed out that an application like this could be easily misunderstood by users who are not as educated in the field and as a result wrong conclusions could be drawn about what the data means. Help pages or tooltips could improve this situation; Nevertheless, this scenario should always be considered a risk.
13 Jenny Discuss It is useful as it can provide a preliminary examination of previous work for example when writing a grant application or preparing undergraduate projects.
3/15 Jenny Yes Especially liked the positioning of close buttons for the details page, as they were easy to find and conveniently placed.
4 George Yes Add more information about the current curation.
12 George Yes The curation feature is rather powerful. The users curating the data have to take responsibility for what they curate as this could lead to problems within the application. It is generally a good practice that curations are tracked.
13 George Discuss By pointing out gaps between current research and the one represented.
14 George Discuss Add a column to the result set which shows the type of study that the found paper has investigated.
14 George Discuss Generally more statistics about the current result set.
14 George Jenny
Discuss Adding the number of submissions about a certain topic per year as a statistic on the search result.
14 George Jenny
Discuss Some statistic that adds a denominator to what proportion of submissions is about a certain topic this year compared to last year.