visualisation and manipulation of structured...

VISUALISATION AND MANIPULATION OF STRUCTURED

EPIDEMIOLOGICAL INFORMATION

A DISSERTATION SUBMITTED TO THE UNIVERSITY OF MANCHESTER

FOR THE DEGREE OF MASTER OF SCIENCE

IN THE FACULTY OF ENGINEERING AND PHYSICAL SCIENCES

Stefan Lesnjakovic

September 2014

- 2 -

Table of Contents

Abstract .............................................................................................................................................. 5

Declaration ......................................................................................................................................... 6

Intellectual Property Statement ........................................................................................................ 7

Acknowledgements ............................................................................................................................ 8

1. Introduction ................................................................................................................................... 9

1.1 Aims and Objectives ............................................................................................................... 10

1.2 Deliverables ............................................................................................................................ 10

1.3 Report Outline........................................................................................................................ 10

2. Background and Literature Review .............................................................................................. 12

2.1 Medical Literature .................................................................................................................. 12

2.2 Epidemiological Literature ..................................................................................................... 13

2.3 Medical Information Gathering ............................................................................................. 14

2.4 Visualisation of Information ................................................................................................... 17

2.5 Related Work ......................................................................................................................... 21

2.6 Summary ................................................................................................................................ 23

3. Requirements and Design ............................................................................................................ 25

3.1 Requirements ......................................................................................................................... 27

3.2 Software Design ..................................................................................................................... 31

3.2.1 Challenges of Designing a Website ................................................................................. 31

3.2.2 Security ........................................................................................................................... 32

3.3.3 Architecture .................................................................................................................... 33

3.3.4 Domain Diagram ............................................................................................................. 36

3.3.5 Sequence Diagram .......................................................................................................... 39

3.3 User Interface Design ............................................................................................................. 41

3.4 Database ................................................................................................................................ 44

4. Implementation ........................................................................................................................... 47

4.1 Tools ....................................................................................................................................... 47

4.1.1 Current state of the art Web Technology ....................................................................... 47

4.1.2 Google Web Toolkit ........................................................................................................ 49

4.1.3 Relational vs. Non-Relational Database Technologies .................................................... 49

4.1.4 Jenny and JDBC ............................................................................................................... 50

4.2 Core Implementations ........................................................................................................... 50

4.3 Client-Server Interaction in GWT ........................................................................................... 52

4.4 Client Implementation ........................................................................................................... 52

4.4.1 Results Table Widget....................................................................................................... 53

4.4.2 Advanced Search Feature ............................................................................................... 53

- 3 -

4.4.3 Details Page Widget with Curation ................................................................................. 54

4.4.4 Future Client Improvements ........................................................................................... 54

4.5 Server Implementation .......................................................................................................... 55

4.5.1 Database Connectors ...................................................................................................... 55

4.5.2 Services ........................................................................................................................... 56

4.5.3 Query Services................................................................................................................. 56

4.5.4 Details Page Service ........................................................................................................ 56

4.5.5 Curation Service .............................................................................................................. 57

4.6 Possible Future Implementation Ideas .................................................................................. 57

4.7 Software Testing .................................................................................................................... 58

4.7.1 Regression Testing .......................................................................................................... 58

4.7.2 Integration Testing .......................................................................................................... 59

4.7.3 Security Testing ............................................................................................................... 59

4.7.4 Unit Testing ..................................................................................................................... 59

5. Results and Evaluation ................................................................................................................. 62

5.1 Overview of the Application .................................................................................................. 62

5.2 Evaluation .............................................................................................................................. 65

6. Conclusion and Future ................................................................................................................. 67

6.1 Summary ................................................................................................................................ 67

6.2 Future Work ........................................................................................................................... 68

6.3 Concluding Remarks ............................................................................................................... 69

7. Bibliography ................................................................................................................................. 70

8. Appendix ...................................................................................................................................... 74

A. SCRUM principles ..................................................................................................................... 74

B. Full Release Plan for the Project .............................................................................................. 75

C. Full Backlog for the Project ...................................................................................................... 76

D. Full Sequence Diagram with Details Request .......................................................................... 78

E. Anti-Pattern GWT ..................................................................................................................... 79

F. More about Java RPC Calls in GWT .......................................................................................... 80

G. Testing Done as result of a Testing Plan .................................................................................. 81

H. Evaluation Interview Script ...................................................................................................... 83

I. Evaluation Interview Questionnaire with Discussion Questions .............................................. 84

J. Evaluation Interview Highlights Table ...................................................................................... 85

- 4 -

Table of Figures Figure 1: Number of Articles about Smoking on PubMed ............................................................... 12

Figure 2: General overview of feature extraction............................................................................ 16

Figure 3: Example clinician UI in a table .......................................................................................... 18

Figure 4: Sample Tag Cloud .............................................................................................................. 19

Figure 5: Sample return of EpiTeM .................................................................................................. 21

Figure 6: Sample of EdVic, a tool created by Elise Hahn .................................................................. 22

Figure 7: Agile development life cycle ............................................................................................. 25

Figure 8: Digitalised User Story Sample I ......................................................................................... 27

Figure 9: Digitalised User Story Sample II ........................................................................................ 28

Figure 10: Architecture overview .................................................................................................... 35

Figure 11: Domain Class Diagram .................................................................................................... 36

Figure 12: Basic Search Sequence Diagram ..................................................................................... 39

Figure 13: Curation Sequence Diagram ........................................................................................... 40

Figure 14: Main Page User Interface Drawing ................................................................................. 41

Figure 15: Details Page User Interface Drawing .............................................................................. 43

Figure 16: Application Specific Database Design Diagram .............................................................. 44

Figure 17: The evolution of the Web in the past 15 years ............................................................... 47

Figure 18: GWT Application Structure Tree ..................................................................................... 51

Figure 19: General Simplified Application Class Diagram ................................................................ 52

Figure 20: Sample Test Code ............................................................................................................ 60

Figure 21: Standard Search Feature ................................................................................................. 62

Figure 22: Sample Results Table ...................................................................................................... 63

Figure 23: Details Page Feature ....................................................................................................... 63

Figure 24: Curation Feature ............................................................................................................. 64

Figure 25: Advanced Search Feature ............................................................................................... 65

Figure 26: Sample Burndown Chart ................................................................................................. 74

Table of Tables Table 1: Simplified initial release plan ............................................................................................. 26

Table 2: Simplified Product Backlog ................................................................................................. 29

Table 3: Non Functional Requirements ........................................................................................... 30

Table 4: Comparison of Web Development Tools ........................................................................... 48

Table 5: Simplified Testing Table ..................................................................................................... 61

Table 6: Evaluation Questionnaire and Discussion Highlights ......................................................... 66

Table 7: Set Goals vs. Achievements ................................................................................................ 68

Words: 18,430

- 5 -

Abstract Visualisation and Manipulation of Structured Epidemiological Information

Medical papers are being published on a daily basis in vast amounts all over the world. In

fact, the numbers are so high nowadays that it has become impossible for a single person

to read all papers published in one year in their lifetime. Therefore it is important to

provide a tool to easily and flexibly access the information needed. This is especially

useful in the field of Epidemiology as it aids in the identification of patterns of diseases

and their related factors. Text mining technologies exist to utilise the existing data,

however, the platforms and environments which are used to access these technologies

are often crude or not suitable for the average user.

The aim of this project is to provide such a tool which adapts to the ever

extending advancements in the field of Public Health. This is was achieved utilising a text

mining technology developed here in Manchester, which extracted relevant information

from abstracts of published papers according to six key characteristics used in

Epidemiology. A web application has been developed to utilise browsing epidemiological

data according to these six dimensions. The application also visualises some statistics of

the extracted data as well as provides means to manipulate the data. The background

research, development and evaluation of the application itself will be elaborated in this

dissertation.

Author

Stefan Lesnjakovic

Supervisor

Goran Nenadic

September 2014

- 6 -

Declaration

No portion of the work referred to in the dissertation has been submitted in support of

an application for another degree or qualification of this or any other university or other

institute of learning.

- 7 -

Intellectual Property Statement

i. The author of this dissertation (including any appendices and/or schedules to

this dissertation) owns certain copyright or related rights in it (the

“Copyright”) and s/he has given The University of Manchester certain rights to

use such Copyright, including for administrative purposes.

ii. Copies of this dissertation, either in full or in extracts and whether in hard or

electronic copy, may be made only in accordance with the Copyright, Designs

and Patents Act 1988 (as amended) and regulations issued under it or, where

appropriate, in accordance with licensing agreements which the University has

entered into. This page must form part of any such copies made.

iii. The ownership of certain Copyright, patents, designs, trademarks and other

intellectual property (the “Intellectual Property”) and any reproductions of

copyright works in the dissertation, for example graphs and tables

(“Reproductions”), which may be described in this dissertation, may not be

owned by the author and may be owned by third parties. Such Intellectual

Property and Reproductions cannot and must not be made available for use

without the prior written permission of the owner(s) of the relevant

Intellectual Property and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication and

commercialisation of this dissertation, the Copyright and any Intellectual

Property and/or Reproductions described in it may take place is available in

the University IP Policy (see

http://documents.manchester.ac.uk/display.aspx?DocID=487), in any relevant

Dissertation restriction declarations deposited in the University Library, The

University Library’s regulations (see

http://www.manchester.ac.uk/library/aboutus/regulations) and in The

University’s Guidance for the Presentation of Dissertations.

- 8 -

Acknowledgements

I would like to especially thank my supervisor Dr Goran Nenadic, as without his consistent

guidance and presence throughout the time of this project, it would have not been

possible.

I would also like to thank Dr George Karystianis, for his support throughout and for

always being there and open for questions, as without his foundational work this project

would also have not been possible.

Lastly, I would like to thank Dr Jenny Newman for her feedback and input to the project,

as it positively contributed towards the progression and quality of the system.

- 9 -

1. Introduction

The accumulated wealth of medical literature is an indispensable tool to the field of

Epidemiology, and to keep track of medical advancements. According to Last (1988)

Epidemiology is “the study of the distribution and determinants of health-related states

or events in specified populations, and the application of this study to the control of

health problems” [1] or in other words the study of patterns, causes and effects of

diseases in a given population. The volume of information relevant to individual diseases

keeps growing to such an extent that it has become impossible for an individual to keep

track of it. Although there is a wealth of online citation indices, there is a need for a more

sophisticated way of querying these papers in a structural manner, in order to utilise their

hidden potential. Text mining approaches exist which utilise these databases by

extracting data in a structured manner, however, user interfaces to use these text mining

tools are often inefficient, lack the needed functionality or are totally missing. The aim of

this project is to provide a platform which enables the user to query and manipulate

ready extracted medical information and as a result make it possible to use queried data

for example for decision making. Another feature of the project is that it aims to provide

a new way of querying the data, which may lead to new discoveries in the field of

Epidemiology. In order to be helpful however, it has to be designed well and efficient

which will be an additional challenge.

Making this kind of information publically accessible has been a hot topic in the past

few years. For example, the Farr institute has been recently established with a multi-

million pound investment [2] in order to develop informatics support for research on

various health data, including epidemiological information. There are websites online

which provide their own database to query such papers. Another example would be

MEDLINE, the primary module for the well-known PubMed database [3]. It holds over 21

million journal articles regarding the life sciences and an average of 4000 references are

added every day. This project uses a data structure developed by George Karystianis at

the University of Manchester in order to query ready structured extracted data from the

MEDLINE database. This system extracts the data according to six key characteristics (or

dimensions): exposure, outcome, covariate, study design, population and effect size type,

which are commonly used in the field of Epidemiology. A closer look at these

characteristics will be taken in Chapter 2 of this dissertation.

- 10 -

1.1 Aims and Objectives

The aim of the project is to provide a system which enables the user to query and

manipulate ready extracted medical information from the MEDLINE database, meaning

to make it possible to use queried data for example, for decision making and to

manipulate the data in such a way which would correct incorrectly extracted entries.

The main objectives are as follows:

To develop a search function and implement an extended query model utilising

the six dimensions needed for the epidemiological data.

To develop an extended search function to enable the user to query the six

dimensions combined in any way.

To be able to manipulate automatically extracted epidemiological data in order to

correct errors.

To provide relevant information about retrieved data, such as statistics.

To enable only registered users to manipulate information.

To provide all the mentioned facilities on an online, web based, platform with high

availability.

To provide a simple to use, but informative GUI with good user experience.

1.2 Deliverables

There are several deliverables associated with this project. Here a quick list of them is

provided:

Progress Report: To provide an overview of the progress of the project.

This Dissertation: As a final report of the project.

The Web Application: produced as part of this MSc Dissertation. It is online and

can be found at: http://gnode1.mib.man.ac.uk/epidemiology/home.html

1.3 Report Outline

Chapter 2 Literature Review and Background: This chapter focuses on the background

research done. It will start off by discussing the importance of medical literature and how

it is used nowadays as well as give an example using the case study smoking. It will then

- 11 -

lead over to how it is utilised in IT using text mining and how the discoveries are useful

and may aid in, for example, decision support.

Chapter 3 Requirements and Design: This chapter will provide an overview of the

requirements gathering methodology as well as the actual requirements which have been

found and elaborated before the implementation could be started. It will provide how

they have been gathered, i.e. what questions have been asked to get to the given results.

It will also discuss the thought process behind the designs and provide visuals such as

diagrams to support these.

Chapter 4 Implementation: This chapter will start off by discussing the choices of tools

and the reasoning behind them. It will then elaborate on the actual implementation of

the product and provide the challenges as well as how they were overcome. Note that

this is an online platform, meaning that there are some special issues regarding security

as well as client-client server architecture used for this project. Lastly it will talk about

how the software has been tested.

Chapter 5 Results and Evaluation: This chapter will briefly take the reader through the

final state of the application and explain the features implemented. It will then focus on

evaluation techniques which verify whether the application has been made correctly and

is useful.

Chapter 6 Conclusion and Future: This chapter will summarise what has been done and

give an overview where the project stands in the end. It will also provide a summary for

possible future implementations and provide useful ideas in which the application may be

extended in future.

- 12 -

2. Background and Literature Review

This chapter will elaborate upon the literature research done for the project. First, it will

explain more general concepts, such as how medical literature is used nowadays and the

issues that come with it. It will then lead over to more specific uses and features such as

text mining and explain how these technologies may aid in decision support and are

generally useful. Lastly, it will look at similar or related work done.

2.1 Medical Literature

Even today with all the advancements in technology, the most commonly used basis for

scientists to communicate knowledge is still textual [4] in form of papers and citations. As

already mentioned in the introduction, there are vast amounts of papers published daily

thanks to the technology and ease available nowadays [5] and as a result, the

denominator for any specific search is growing too. The numbers of publications and the

volumes of citations have grown to such an extent that it has become impossible for one

person to keep track of it all [6]. The chart below shows the number of articles published

per year about smoking. Nevertheless it is important that the number keeps growing as it

may also lead to faster and better advancements in given fields, such as Epidemiology in

our case.

Figure 1: Number of Articles about Smoking on PubMed

0

2000

4000

6000

8000

10000

12000

14000

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

20

01

20

02

20

03

20

04

20

05

20

06

20

07

20

08

20

09

20

10

20

11

20

12

20

13

Nu

mb

er

of

Art

icle

s P

ub

lish

ed

Year of Publication

- 13 -

On top of the large amounts of data, another factor contributing to the inability to

handle the data is that it is mostly unstructured [6] by nature. This means that it mainly

consists of unlabelled text the way someone would write, which can make it extremely

hard to query for the right things using a simple word search. Searching for a specific

word can return millions of results nowadays, of which a majority may be unrelated to

what is actually wanted. Categorisation is often sparse and for specific uses such as

Epidemiology, simply not sufficient. Text mining approaches are taken to improve this

problem, which will be discussed closer in the next section.

There are several databases out there which store all that information needed. The

most popular one is called MEDLINE (Medical Literature Analysis and Retrieval System

Online) which is accessed through the online service called PubMed. MEDLINE is run by

the US National Library of Medicine and stores information about citations and abstracts

in the fields of medicine, nursing, dentistry, veterinary medicine, health care systems, and

preclinical sciences [7]. It is linked to a database called MeSH which is designed to aid

querying by providing a controlled vocabulary thesaurus and enable the user to search

for better results easier [8] on their own website when querying for articles. This is

especially useful when the user is a non-professional or does not know how to spell a

certain term, it will automatically look up any entered word in that database first before

querying to ensure best result and minimise the number of queries leading to no results.

2.2 Epidemiological Literature

Between 2000 and 4000 epidemiological references are added every day as medical

literature to the MEDLINE database. In 2013 over 700,000 references have been added

alone: As of today there are a total of over 23 million entries. From these facts it is

obvious that one dimensional search tags are insufficient for a complex field such as

Epidemiology. The six dimensions have therefore been identified [9] which are

appropriate for a search feature which will be used in this project. They are as follows:

Exposure: A risk factor a person may have been exposed to, for example smoking,

stress, etc.

Outcome: A consequential condition or incident from the determinant in the

population that has been studied. For example lung cancer, hair loss, etc.

- 14 -

Covariate: A factor that may have an influence on the development or outcome

and is therefore important to that specific study. Examples would be gender or

age.

Study Design: The type of study used on a given population. There are a number

of legitimate predefined study designs that exist. A few examples would be

observational studies, prevalence studies or general randomised control trials

(e.g. involving placebos), etc.

Population: The type of target people the study was applied upon. This may mean

gender, ethnicity or even nationality.

Effect Size Type: a numerical measure of an attribute within an epidemiological

study. This means you can look for numbers within a study such as “risk factor”,

etc.

These six entities have been used to introduce structure and extract information [9] from

numerous abstracts of the MEDLINE database and are therefore ideal for the purposes of

this project. Figure 2 gives a general overview of how the extraction has been done.

2.3 Medical Information Gathering

Text mining or data text mining [10] refers to the process of extracting interesting and

non-trivial patterns or knowledge from unstructured text documents [11]. It is estimated

that about 80% of a company’s knowledge is hidden in text rather than structured data. It

is similar in the field of Epidemiology since most advancement is published in form of

papers. In biomedical science the process of natural language processing seems to have

established a popular reputation of making sense of this natural data. This process is

referred to as BioNLP [5]. The need to automatically extract information and introducing

a logical structure seem to become gradually more important [12]. Such NLP processes

have shown promising results when it came to extracting key biomedical information

from relevant literature in the past few years [13].

Another idea is to be able to identify connections and similarities between articles

leading to new discoveries that are invisible to the naked eye, so called “undiscovered

public knowledge”. This approach is also known as Knowledge Discovery in Databases (or

short, KDD) [14]. For example, a connection between magnesium and migraine has been

- 15 -

discovered, which was previously unknown, just by carefully looking over several papers

[15]. Whereas this discovery was done manually, using approaches of KDD easily could

have led to the same results faster and easier. It is therefore appropriate for exploratory

research [16]. These KDD methods are merely an extension to current statistical ones,

exploiting today’s resources of computational power, machine learning and artificial

intelligence algorithms [17].

As already mentioned text mining may provide structure for unstructured data and

therefore makes it easier to access. It does so, for example, by introducing tags [18]. The

tags are similar to search tags as these are commonly used on platforms such as YouTube

or Twitter. This is realised by either manually entering such words or scanning, in our

case, an abstract of a paper and extracting all keywords. Doing a simple search for one of

these words could then return the abstract with the best match. In recent years,

collaborative tagging has become very popular [13]. Collaborative tagging means that

people may add or introduce their own tags as they go along. This feature may be useful

in the field of Epidemiology as specialists could simply highlight and introduce their own

tags while going through a paper or patient history, and therefore making the extracted

key information more useful. However, doing it in the same way as, for example, Twitter

is too one dimensional for our needs. A more sophisticated query model is therefore

required.

Another useful way to extract information is by identifying patterns within the text

and providing references to it. This may also be known as indexing. It can provide

structure and set open information of large spans of texts. This approach, however, uses

a lot more resources, as the text has to first be scanned for all keywords, according to a

pattern using for example natural language processing algorithms. It can be very lengthy

and costly. Another downside is that the error rate might be high as such an algorithm

could easily misinterpret special characters such as punctuation within the texts. This is

mainly due to the ambiguity and different writing styles of individuals over the world

[14].

- 16 -

Figure 2: General overview of feature extraction [9]

The documents, or in our case, abstracts of documents are taken. The input is pre-

processed, for example by filtering certain words or tokenising them making them easier

to read by the machine, which may lead to higher accuracy, depending on the input. Data

is then extracted according to dictionaries, certain machine learning algorithms or rules

[9], leading to the six concepts being used for the project. Examples of how these

extracted concepts may be used can be found in the related work section (2.5), where

several pieces of software are described which utilise these principles.

Utilising the text mining approaches to process the epidemiological literature,

developed in the University of Manchester by George Karystianis, abstracts of numerous

articles have been scanned and the results have been normalised and put in tables of six

dimensions, widely used in the field of medicine, which will be discussed closer later.

Normalisation means that the extracted data is mapped onto their descriptive attributes,

making it easier to identify for the right occurrence later [9]. It is important to stress that

text mining approaches are extremely hard to get to 100% accuracy. Therefore there will

always be mistakes or errors propagating through the extracted data. The approach

mentioned has been thoroughly tested though and an accuracy of over 81% has been

reported, clearly classifying it as reliable [19]. Such promising results are clearly worth

trying to make more accessible and maybe even improve, for example by providing error

- 17 -

correction in the extracted tables and making it accessible online. These features could

be supported by this project and are in fact part of the scope.

2.4 Visualisation of Information

From the previous sections it is clear that the visualisation techniques of such an amount

of information are crucial in these types of applications (CDS and structural information

representation). Therefore it is important to discuss some of the issues in this section.

Human Computer Interaction is the study of how computer technology influences

human work and activities [20]. Since the early 60s this subject has been studied,

however only since the emergence of the personal computer it has become a key point of

computing. In the 80s, the company XEROX released their so called Star-user-interface

which was the first to introduce a window-icon-menu -pointer and as a result

revolutionised the way humans interacted with computers. They also established five

golden key principles [21] which are still used today as a pointer of how to design a user

interface well:

Familiar Conceptual Models

Universal Commands

Consistency

Simplicity

User Tailor-ability (or customisability)

Although these concepts were originally applied to an operating system, they are just

as valid for standalone applications. A Familiar Conceptual Model is about giving the user

what he is expecting to see and keeping the application coherent with familiar designs.

Consistency is keeping all modules and messages of an application coherent. The

interface itself should never drastically change unless the user specifically wants to apply

a different design.

Especially for medical applications it is crucial that the user interface is well designed

and coherent because human wellbeing is closely involved with the interaction of these.

It is clear that an interface like YouTube or Google will not suffice for the complexity

needed in a medical environment. It is a well-known fact that many errors that happen in

medicine due to poor interface design rather than human error on its own [22]. It is

- 18 -

therefore important that the information is clear and messages are understandable for

the given range of users. As a study, conducted by Patel V. et al (1998), has shown that

especially in a medical environment, the perception of what is important to a user varies

extremely [23]. As a result it needs to be pointed out that not only phrasing but designing

and placing the output has to be done differently for each user group as well as different

appropriate options have to be provided for these. The user has to always be aware that

he is in control and should be able to decide when to start or stop a given procedure.

Warnings should not be designed in a way so that the user has to rely on them or that

they would stop the flow of the program. It is especially crucial that human error in

interaction with the system isn’t fatal and that actions can be easily reverted, changed or

reconstructed [24]. Others than that, there is a wide load of features that have to be

reachable within a few clicks and as a result, a lot more elements need to be on the main

page than for other online platforms. Figure 3 below shows an example of a medical GUI

developed for a clinical status application for patient’s families. It is clear that the number

of elements on it is fairly higher than any other standard platform. It may be argued that

it is rather clear however, the warnings for the individual patients only seem to appear in

the second column and it could be argued that this is not clear enough, as mentioned

above, what is important is different to each user especially in medicine and a larger

popup may be needed.

Figure 3: Example clinician UI in a table [25]

- 19 -

Depending on what the system is used for, there are several methods of displaying

outputs which are appropriate. It is important to clearly define how complex future

interaction may be in order to get the visualisations right. There are different methods

useful for different amounts of features. There is a big difference for example between

exploratory research and close research or decision support.

Tables and lists are probably the most classic and reliable way of representing

information as shown in the example above. They open a very broad spectrum of

potential usages and a high amount of information may be displayed in an ordered

manner across the screen. This kind of interaction is especially useful when the extraction

or manipulation of the information has to be particularly specific. Another major upside is

that tables are usually sortable by selecting a column header which can make browsing

the data much more convenient. The only downside may be that the information

displayed is too much and can therefore be overwhelming to the user. There are ways to

improve this situation, for example by restricting the maximal number of rows and split

the remaining rows off into other pages, this can be rather obstructive if the page size is

set too small or too large, depending on the resolution of the screen used.

Charts and graphs have many different forms, some more appropriate than others.

With modern software packages these can be created on the fly as information is

extracted. They have restrictions though, that the information has to be within certain

ranges and clearly defined in order to have significance. These kinds of visualisations are

usually only used for extraction and not manipulation of information.

Figure 4: Sample Tag Cloud

- 20 -

A type of visualisation that has emerged rather recently is a so called word or tag

cloud [26]. Basically it is a table of words, in which the colour, order or size may indicate

variable significance. An example made in Google Web Toolkit can be seen above (Figure

4). They are more flexible than charts and they can be compiled over a more varying data

set rather efficiently. They are especially useful for exploratory research as they can

compile the most significant information of a wide variety of input into a comparably

small space. As a result, another major advantage of this kind of representation is that it

is appropriate for decision support [23]. This especially applies for clinicians, as for

example, such a cloud could represent a set of treatment options with the format of

appearance of the words meaning success rate of these treatment options. Selecting one

of these options could then reveal more information about the given method.

Nevertheless they are rather unsuitable when it comes to manipulation of information

due to data coming from a lot of different places.

When it comes to evaluating the usability of such a product, according to the ISO9241

there are three key principles [27] that need to be taken into consideration:

Effectiveness

Efficiency

Satisfaction

Effectiveness asks the question whether the user can achieve his goal. Efficiency is

about how quickly he can achieve it and lastly satisfaction is about how getting there

made the user feel, or in other words, the user experience. A wide variety of studies are

available which measure these factors. Evaluation frameworks exist [28] where users are

given a certain set of tasks to do and then without further instructions are exposed to a

system with which they have to complete these tasks. In a proper study there would be a

number of different versions of the system running in order to assess which one is best. If

the users are being watched and are questioned while or after they complete these tasks,

it is called an observational study. The data will be recorded and evaluated. If the scores

are above a certain threshold the system will be deemed as acceptable. More than one

variety, if multiple exist, might pass this threshold; in that case the one with the best

score in all three aspects should be chosen.

- 21 -

2.5 Related Work

There has been similar work done here in Manchester. One significant piece of work,

directly related to this project has been done by George Karystianis as part of his PhD

thesis. First, he extracted information from about 20,000 MEDLINE articles abstracts

using a text mining tool he also developed as part of his studies, and put them into a

database, structured in the six epidemiological key characteristics he identified, described

in section 2.2 of this report, leading to almost 100,000 entries. Then, using Java, he wrote

a powerful tool called “EpiTeM” (Epidemiological Text Miner) to explore that data in a

structured manner. As input query data it can take up to six variables which represent

these characteristics.

Figure 5: Sample return of EpiTeM

The main purpose of this system is information retrieval and information extraction.

Figure 5 above shows the outcome of a search for “obesity” as exposure and

“depression” as outcome. It can be seen that it lists all the articles found, ordered by

highest significant of hits. The table on the bottom displays the data of extracted

concepts of the five highest matches, structured into the six attributes defined. Clicking

on one of the papers will display the abstract of the given article and extracted words are

- 22 -

highlighted in different colours, one for each of the six attributes. Of course it also

provides a link to the full article on PubMed in case the user wants to read it. This tool is

rather powerful in a sense that it can search through large amounts of data and compile

the results in a readable manner. Data extraction is not flawless however and there are

errors within the data, which would be highlighted in the abstracts. Since it is retrieval

only there are no means of correcting or changing that data. It has no advanced visuals

such as graphs either as these are not needed for such functionality. All the given

downsides in the two mentioned pieces of software are aimed to be improved in this

project as it is a goal to also implement data manipulation.

Another example would be by a former student, called Elise Hein, also here in the

University of Manchester. Her third year project was to visualise extracted structured

data online, for a more convenient browsing experience. A system was implemented

using Flask for Python, which gave a graphical representation in form of word clouds

based on user input.

Figure 6: Sample of EdVic, a tool created by Elise Hahn

As seen from Figure 6 above, a platform has been created which allows the user to

query a pre extracted database in the form of the six variables, described earlier. Here a

search has been defined for “Exposure= watching television”, as displayed in the bottom

bar. The left shows the papers where the extracted information came from and the top

bar shows how many words of each key characteristic have been extracted from these. In

- 23 -

the middle a word cloud has been implemented, representing that extracted information

in form of keywords, where words mentioned most in these articles shine up as more

significant. The words can be clicked and added to the search as another search variable

or if wanted removed or even excluded for this search. A stream graph [29] can also be

dynamically created to show the significance of the words displayed over time. This

platform is excellent for a casual informative search, which could be used by a user for

example to brain storm certain epidemiological concepts. It is also adequate for some

exploratory research, as associated keywords can easily be visualised in an aesthetically

pleasing way and articles can be accessed. Nevertheless there are some major

disadvantages with this representation. Firstly, it lacks detail as only words are displayed

and more information can only be given by hovering over one or opening the whole

article and there are no means of personalisation or storing such a search. Secondly,

means of manipulation are very sparse if any. Local filters can be defined in order to add

or ignore certain words to or from the search. The extracted data is taken as it is and no

means for error discovery or correction can be provided in such a format.

2.6 Summary

To sum up, this chapter has provided an overview of the literature research done so far. It

first identified the importance of medical literature at the current day and how

computing techniques can be used as an approach to overcome the overwhelming

amounts that exist and are published today. It then explained some of the techniques

used to provide some structure into this information using text mining and natural

language processing as well as explained how they relate to this project and what

features will be needed in order to utilise these efficiently. The three main functionalities

identified as:

Querying the information – A query stricture has been identified using six key

terms in the field of Epidemiology in order to search through the data.

Visualising – A method to output the queried data in order to be useful for

exploratory research as well as some passive decision support tool.

Manipulating – Meaning being able to correct errors that are implied by the

automatic extraction of data, as well as add entries themselves.

- 24 -

Implementation methods for the visuals have been described and lastly, already existing

software more or less directly related to the functionalities of this project have been

given and analysed.

The next chapter will be about the requirements and the design of the application in

order to meet the goals set. This includes software engineering principles as well as the

results of actual requirements gathering with stakeholders involved. It will also provide all

the visuals that aided along the implementation of the software and elaborate upon the

choices as well as provide reason for some of the actions. This includes several different

types of diagrams including UML diagrams.

- 25 -

3. Requirements and Design

This chapter is about the requirements of the project and how they have been

established. It will first go into methodology of gathering as well as general Software

engineering principles and then move on to the actual established requirements. Then it

will describe the design derived from these.

The project will be implemented following agile software engineering guidelines.

The Waterfall method suggests to gather requirements and to do research only in the

beginning of the project and set fixed deliverables and milestones according to that

research in the future. It is not possible to change these or go back and edit the

requirements. While this may seem suitable for a project which has clear requirements, it

has to be noted that in the real world things often change. New discoveries are made

every day and people find better methods or designs for doing things, or simply just

change their mind. It is therefore important to embrace that change and be open-minded

towards it. As this is one of the core principles of agile software engineering it is more

suitable for this project.

The agile approach, however, suggests iterative and incremental development [30]. It

is divided into pre-defined time boxed iterations. These iterations consist of a few weeks

each. Time boxed means that in case the assigned time for that iteration cannot be met,

the plan is restructured, rather than the iteration extended and objectives will be

changed or moved around. Agile development suggests minimising ceremony, meaning

to minimise the paperwork such as documentation during development, and only

produce the parts that are useful to the developer to understand. All other

Figure 7: Agile development life cycle [57]

- 26 -

documentation is deemed optional or can be done after implementation if wanted by the

customer. The main advantage of agile is, as seen from Figure 7 above, that it is always

possible to go back and change things in the requirements or design when needed as part

of embracing change in an ever changing environment.

The Agile Manifesto [31] is a set of core principles which most practises are based on.

It consists of four points:

Individuals and interactions over processes and tools

Working software over comprehensive documentation

Customer collaboration over contract negotiation

Responding to change over following a plan

These points clearly support what was explained above and will be kept in mind during

the whole development process of the project. Working software will be prioritised over

documentation in order to show results as soon as possible.

There are several development methods based on the agile principles. These methods

suggest extra guidelines and artefacts in order to aid development. The one being used

for this project is called SCRUM [32]. More about SCRUM can be found in the Appendix A.

A release plan for this project has been according to these principles (Table 1).

Table 1: Simplified initial release plan

The simplified release plan given above is based on the identified requirements,

which will be provided in the next section which will also give design and

implementation. The full version can be found in the Appendix B. First, the basic website

has to be implemented, which will be done in the first sprint. It can then utilise the

database functions. It will be incrementally improved; first adding the query functions

and then more complex functionality on top of that. In the third sprint visualisation

Sprint Start Weeks End Description

1 1.7. 2 15.7. Create the basic website with all its login and managerial features.

2 16.7. 2 31.7. Implement all query functions for all tables and data involved.

3 1.8. 2 15.8. Implement visualisations and dynamic creation of statistics.

4 15.8. 1 22.8. Implement alteration of data; this may include pattern detection.

- 27 -

techniques are added which will help display things such as statistics. The fourth iteration

is about implementing alteration and curation of data. As mentioned in the literature

research chapter, text mining extracts are not accurate and there are errors in them. It

has shown that a lot of these errors are similar and propagate through the data. In this

iteration a way for the user to curate these kinds of errors will be implemented. Ways

and algorithms which can help identifying such patterns according to one wrong data set,

and then find similar mistakes within the extracted data, will be taken into account.

3.1 Requirements

The initial requirements have been gathered in form of user stories, a full list of which can

be found in the Appendix C as a backlog. In this type of user interaction the user is

questioned with the intent to be able to eventually formulate a requirement into one

sentence, which is understandable to both, the stakeholder as well as the developer.

These “stories” are then written down onto cards, which can easily be flicked through or

thoughts can be added. The general format is as follows: “As a ‘Person X’ I want to ‘Action

Y’ so I can ‘Goal Z’”. The back of the card usually holds more information useful to the

developer, associated with this story. This is a format widely used in industry as part of

agile software engineering practices as suggested by Ambler et al. (2012) [33].

Figure 8: Digitalised User Story Sample I

- 28 -

There are two sample user story cards from this project shown in Figures 8 and 9. The

first card shows the story which helped identify the fifth entry in the following backlog.

The front of the card shows what the developer has deduced from the interview with the

customer as well as the priority of this function. The back of the card, shown on the right

is reserved for important notes which may be implied by that feature. The second

example shows the feature identified as number eleven in the backlog below. It is about

an advanced query model, where the users can build their own queries consisting of any

combination of the six previously identified query models. The back of the card gives

away a row of dependencies as well as availability of that feature. More on the realisation

of this feature will follow in the design and implementation parts of this dissertation.

The stories have then been compiled into a product backlog, and assigned with

priorities. The SCRUM convention also suggests assigning an estimated value to each

identified function according to how complex the task of implementing this function is.

The following table (Table 2) shows a simplified version of this product backlog.

Figure 9: Digitalised User Story Sample II

- 29 -

ID As a/an I want to Done criteria

1 user view the website created Made a website that contains all the features a standard website has: Intro page, tutorial page, about page, etc.

2 user login to access curation feature A login function that disables the curation of data unless authorised

3 admin manage user access on the website An admin account that decide to restrict access to certain/all users

4 user be able to access epidemiological data Implemented database access

5 user query epidemiological data by Exposure

Query for given type of epidemiological dimension is possible

6 user query epidemiological data by Outcome


7 user query epidemiological data by Covariate


8 user query epidemiological data by Study Design


9 user query epidemiological data by Population


10 user query epidemiological data by Effect Size Type


11 user be able to query for epidemiological data using any combination of the mentioned above categories

Query for any combination possible

12 admin insert new highlighted key words for a study

Logged alteration of data is possible

13 admin alter a highlighted key word in a study Logged alteration of data is possible

14 admin delete a highlighted key word in a study


15 admin identify commonly extracted keywords

Auto suggestion upon alteration of data

Table 2: Simplified Product Backlog

Looking at the table it is obvious that basic features have to be realised first, in order

to be able to base more sophisticated and complex ones upon them. The first three

features in this backlog are about creating the website and adding basic administrative

and interactive features. Once this has been established, more complex features such as

interacting with a database and querying can be implemented. Features 5 to 10

inclusively are about implementing the query functions for the six dimensions identified

earlier. The eleventh feature combines the query methods into one dynamically created

query where the user can combine any of the six dimensions at his convenience. From

- 30 -

the interviews it was established that there was a database already in place, however, in

order to be able to utilise the data more closely or even alternate it, it is clear from the

requirements that a custom database needs to be created to act as a kind of mirror

image of the pre-existing one, as well as enhance it and extend the functionality and data

stored.

The last three identified requirements are about the curation of one or more entries

within the database. What this means is that, as already mentioned, text mining

algorithms do not always work flawlessly and errors made during extraction may

propagate through the database. Therefore these mistakes need to be corrected as they

are encountered. The entries in the database, however, are normalised, meaning that

they have been identified to belong to certain categories or groups of words. Another

restriction is that they correspond to the text within the extracted abstract. As a result it

is important to restrict the curation process accordingly by denying curation of extracts in

a way that if the newly entered value varies too much from the abstract or even the old

extracted value, that the curation is denied. Therefore a feature to entirely delete and/or

add a newly identified extract is needed as well.

Note that the table also satisfies coverage of the functional requirements.

Nevertheless, non-functional requirements are still to be identified. While doing so, it has

to be taken into account that this project is supposedly web based, and therefore has a

slightly wider set of non-functional requirements than offline projects. The most

significant ones are in the following table (Table 3):

ID Requirement Description Priority

1 Manual Clear instructions high

2 Ease of use Simple and effective UI high

3 Security Privacy high

4 Performance Searches and general medium

5 User access levels More than 2 access levels medium

6 Customisability Of interface low

7 Language support Changing languages low

Table 3: Non Functional Requirements

- 31 -

The first two non-functional requirements are closely related as they both have to do

with usability of the product. Instructions need to be clear and the application easy to

use, however, there still needs to be a good explanation of what is going on for example

to explain the six dimensions. Since the application is online, security is important. That

goes for all applications which may require a password. Since it is known that the search

databases tend to be rather large, performance has only been assigned a medium

priority. User access levels are also important, as data may be altered. It is necessary to

facilitate that only registered users are allowed to perform alterations on the data.

Another example of a potential security hole is the interaction with the database, as this

must not be done directly from the client application as access credentials might be read

out in client side, which may give them master access to all entries in that database. How

this issue may be resolved will be discussed in the design part of the dissertation.

Although customisability and language support play a major role in the usability of the

application, they have been assigned low priorities as most features are assumed to be

straight forward.

3.2 Software Design

Here the gathered requirements from the previous section will be put into practice in

order to create a design for the software that was going to be built. It is important to

include a wide set of factors when doing this, in order to make sure that not only the

right software is built but that the software is built right. This was done by sticking the so

called GRASP Principles as suggested by Larman (2005) as close as possible. These

principles dictate ways of assigning responsibilities to classes, which is a key skill in Object

Orientated programming. It has shown that, even for strong programmers, that this is a

key skill that often lacks [34] and as a result leads to a weak design.

3.2.1 Challenges of Designing a Website

Firstly challenges of building a website need to be taken into account, as there are some

factors to it that do not exist for standalone software. One of the most important ones is

security. A website has to always be secure even when it does not store personal

information as it has to be safe against certain session high jacking attacks, which might

re-route the user to other malicious websites without being aware of it.

- 32 -

A login is needed to protect higher computational features of the website as well as

personal information that may be entered by users. Sophisticated login algorithms exist

which enable the website with secure handshakes between user and server. It was

argued whether to implement the login within the GWT framework, as there are more

security threats to Java and its associated web technologies [35] than there are to say

PHP. Nevertheless, a suitable library has been found [36] and put in place, utilising jBcrypt

for Java, an encryption tool which is deemed safe. Apart from that, the application will

not store any other sensible data that will be unencrypted and this approach has been

taken so far.

Throughout the development of the application, however, it has shown that the login

itself would only help to protect one feature, which is the curation and alteration of data.

It has been established that this feature should in fact not be possible to access to anyone

who signs up as it is rather powerful and it could easily render the whole database

unusable with very little effort. This feature should only be accessible to the

administrator himself, or other people within the domain. Therefore the login itself was

designed a lot simpler than first intended and is only really needed when accessing this

feature.

Another challenge of building a public website is that the design of the user interface

has to be fitted for a general public standard. Especially when the application itself is very

domain specific, such as this one, which is focused around Epidemiology, it can contain

complex features and terminology that might not be easily understandable to a common

user not being involved within the domain. It is therefore important to design the

application appropriate, in order to be manageable by a wide set of users from different

backgrounds. This can be done by either adding manuals or enough help descriptions, or

by making the design itself more obvious so that it appeals to a broad spectrum of users.

In this particular case one way would be, for example, by implementing a standard search

which just queries one of the six given search dimensions to make it look closer to a

standard search like it is implemented on common websites nowadays.

3.2.2 Security

As already mentioned, security is an important factor, especially when designing a

healthcare web application. As it will be designed using a client-server architecture which

- 33 -

is accessible from anywhere in the world, it is prone to attacks. It would be critical to, not

only the application itself, but the whole institution to unintentionally leave parts of the

application exposed. There are several measures which can be taken in order to design

the system more securely. Firstly, it is important to not give access to any feature or even

the page which holds the feature to any user who is not authorised to use that feature.

As a result potential attackers are not provided with a template for a so called replay-

attack, where such an attacker could reverse engineer an authorised request and send

that to the server.

It should also be mentioned that the programmer has to select the platform and

architecture which the application is developed and run on carefully and be aware of its

risks and exposures itself. This is critical, as no platform can be assumed flawless or

impenetrable, even if it has been in use for a long time. A current example of this fact

would be the recently found “Heartbleed Bug” in SSL encryptions [37], which has been

present for years but went unnoticed until recently. SSL is commonly used for “secure”

communications on the web for example for emails or instant messaging as well as

connecting to Virtual Private Networks (VPN). This weakness was found in a software

library used to implement SSL. It made it possible to read out the encrypted information

which was stored within a message using SSL, without leaving any kind of trace on the

server itself.

The application should be designed keeping such issues in mind. Leaking critical

information can be avoided by not sending it in the first place. This can be achieved by

hashing passwords sent to authenticate using a one-way encryption, meaning that the

raw password cannot be in anyway reverse engineered from its current state. Therefore

it is obvious to implement such encryption client-side so that none of the

communications channels are ever exposed to unencrypted or un-hashed critical

information.

3.3.3 Architecture

A diagram to depict the general architecture of the application has been designed in

order to give a map of what is going on within. It gives a rough overview of all major parts

of the program and how most of them interconnect. It has been used as a guideline on

how to implement the system and generally provides a useful simplified overview.

- 34 -

Figure 10, below clearly indicates a split of the application into three tiers. Top tier are

the databases external to the system, which will be interacted with. Note that there are

three of them which add a rather high level of complexity since for some operations that

may be needed in order to interact with more than one database at the same time. The

first database, labelled as “Database of Extracted MEDLINE Concepts” contains all the pre

extracted data from MEDLINE abstracts as a result of a text mining algorithm. It stores

them in several tables, one for each dimension. The second database labelled as

“Database of MEDLINE Titles and Abstracts” holds information about the abstracts and

data itself, which has been used for text mining. After a query is completed, more

detailed information about the results will have to be taken from this database. The third

and last database is labelled as “Application Database”. It holds information specific to

the application itself. This may include login data for higher level access, logs for the

curation of data to keep track of what has been changed, as well as copies of the first

database or general indexing which is used to speedup queries.

The second tier and central part of the system is the application server itself. Each

large box here can be translated into a package which may be implemented. On top part

of the server the database connectors can be seen which will be used to access and query

the databases involved. Note that there may be more than one connector for each

database. On the one hand only one column may be needed and on the other a large

proportion of columns may need to be selected. There are different ways of retrieving

these entries efficiently, depending on the request. On the bottom half of the server an

example of the services that are provided to the client can be seen. The query service

itself is connected to the database connectors and knows how to use them in order to

retrieve the needed data. The server also knows how to pack the retrieved result data

into given data types which can easily be accessed and identified by the client. These data

types are stored in a model package. Note that there should only be one implementation

of the data types within the model package in order to ensure consistency between client

and server, hence the dashed line. This can easily be achieved by establishing the data

types upon deployment on both sides. The last two services in the diagram are the login

and the curation service, which should only be used in combination. It should be noted

that both of these need some kind of database access, however the connections are left

out in the diagram for simplicity sake.

- 35 -

Figure 10: Architecture overview

The third tier in the diagram is the client application. It will connect and talk to the

server via an established internet connection. It has to implement the necessary client

side service equivalents in order to be able to access the right implementations on the

server. Therefore the query methods have to be implemented on the client so it can send

the right query information to the server. It also has knowledge of the models used and is

able to pack the queries into the right format for the server to read, as well as make

- 36 -

proper use of the responses sent by the server. The client also has user interface classes,

which hold information about interaction functionality with the user, for the parts which

need direct server interaction, i.e. tables and forms. This client application is intended to

run on top of some kind of HTML implementation so it can be read out by any common

browser. More about the tools which will be used to achieve this can be found in the

fourth chapter of this dissertation.

3.3.4 Domain Diagram

A simple domain diagram was created in order to make it clear what was needed and

bring the idea closer to an implementation.

Looking at the diagram above (Figure 11) it can be seen that classes have been circled

and grouped into a certain pattern in order to highlight which part of the application they

Figure 11: Domain Class Diagram

- 37 -

belong to. Starting from the bottom, the part of the application which belongs to the

client has been outlined. The Model-View-Controller separation within the classes

defined is clearly given. The user interface provides the views, which is connected to a

controller which controls which parts of the implementation are accessed. In the

requirements section it has been established that a certain amount of query methods are

needed for the application: one for each of the six epidemiological dimensions, as well as

another one, which is able to combine any of these dimensions. A strategy pattern [34],

as suggested by Larman (2006), has been chosen to implement the different parts of the

search required. An interface (Search Strategy) defines which methods need to be

implemented for each of the search strategies. The main difference between the

different strategies is how the query for the database itself is assembled. Therefore there

is a need to specify that this method needs to be re-implemented for every strategy

inheriting this interface. Here “Basic Search” is the class that handles a query where one

or more of the six dimensions can be queried for, where “Advanced Search” is able to

dynamically handle any number of entries for any of the six dimensions. How the query is

assembled in detail will be discussed in the Implementation chapter of this dissertation.

The next thing in the client section as a “Result Formatter” it is responsible for putting the

results or responses from the server into a readable format, such as a table or other

visualisations. It then passes them on to the user interface through the controller. The

“Curator” is responsible for handling the client side curation requests that a user might

do.

The Model defines the data models sent between server and client and other models

which may be used within the application. A query is assembled by the different query

strategies defined and then sent to the server. The server would then need to

disassemble that query and reply. The responses are formed using a Result model and it

contains whatever a query may have produced. The result data in then assembled on the

client side in whichever way it may be needed. The third model depicted holds the

information for a curation request. This would hold information such as unique IDs, the

old word and the new one which will be replacing it and whether the user wants to

specify use of any special algorithms to find more similar entries as the one being

replaced.

- 38 -

The top of the diagram shows the main parts of the server such as the “Query

Service” which is responsible for forwarding a query to the right databases if needed. This

will be done by a query engine which it implements. It should forward its request to the

right database connectors, which are not in this diagram, since it is a domain diagram,

which only exist in a purely virtual way. The next part is the “Results Parser” which

translates the response from the database into something that is simpler to handle and

access from within the system. The last piece is the “Curation Service” which facilitates

curation requests from the client. It implements an algorithm to identify similar entries to

the word or entity which is specified to be wrong and therefore needs to be changed

within the database. Note that this algorithm is meant to be optionally pluggable, and if

not enabled, only one entry may be changed.

It needs to be mentioned that user specific curations have also been considered while

planning and designing, however they are rather complex and would require a too large

amount of time to be properly realised and have therefore been discarded or left as a

future improvement idea. The idea behind user specific curations is to have some kind of

database entry for each registered user for their curations done. These curations would

only be visible to each individual user. However, such curation may take up an exorbitant

amount of space and storing them in a separate table would make the query process a lot

more complicated since the process would now be different for each user.

Lastly, note that login functionality is not included in the diagram to keep it simple,

however it should be mentioned that it is designed to be only for pre-registered

administrators since the curation function itself, which is the main reason for a need of a

log in, can cause errors in large parts of the database and render it unusable if misused.

The database can always be rolled back, however, that would cause all the previously

applied curations to be lost if it is not backed up separately. It is therefore important to

only enable this feature within a trusted circle and prompt for a password when this

feature is to be used. It may be extended in future so that only the detection algorithm

may be used by administrators but single curations of one entry can still be done by

automatically registered users. Individual registration will not be implemented as such,

however a basic form of authentication upon which it can be extended to is planned and

part of the domain. Sessions will be implemented; however, they expire after one

curation in order to prevent accidental curations.

- 39 -

3.3.5 Sequence Diagram

Sequence diagrams have been created in order make the most complex parts of the

application more clear and depict of how parts of the application are intended to interact

once assembled. These diagrams are especially useful since they show the intention of

the internal workings of the program. Two of these diagrams have been provided here.

The first diagram (Figure 12) shows how a standard query is done in the system.

Figure 12: Basic Search Sequence Diagram

First the user types what he wants to search for into a query field. The controller

forwards this information to the right function within the client, in this case, the “Basic

Search” function. A Query object is compiled and sent to the server. On the server side

the query is disassembled and forwarded to the actual database(s) needed. The

connections between server and database as well as client are closed and the user is

notified that the query has been submitted, for example, by displaying a loading screen

until the results find its way back to the client side interface. First the raw information of

results is sent back to the server, which compiles a Result object to send back to the

client and sends it. On client side the results are put into the right format for displaying

i.e. a table or diagram, depending on what has been requested. This is forwarded to the

controller and interface and the display is updated for the user to see.

- 40 -

The next sequence diagram below (Figure 13) shows how a curation is done within

the system. Note that in order to do curations, the details of a found article have to be

requested first. This is not shown in this sequence diagram, however, a full version can be

found in the Appendix D. After requesting the details, the abstract and some statistics will

be displayed and the user can see which words within the abstract belong to which

search dimension. If one of the extracted words seems erroneous, the user can choose to

change this word. A prompt will appear to enter what the extracted word should be or an

option to get rid of this extraction as a whole. After the details are entered the request is

sent to the Curator on the client side, which compiles a Curation Request object and

sends it to the server. Now, while entering the details of the duration, the user would

have been asked if he or she wants to replace all occurrences of that word within that

search dimension in the table, for example in form of a check box. If this was ticked, an

algorithm will be run which will identify similar entries within the database. The Curation

Service on the server will then compile and send a request to update a log, where all

curations are kept, as well as execute the given curation on the database. Once this was

successful, confirmation will be send to the user. The user can then verify whether the

curation was successful by refreshing the details of the given found article.

Figure 13: Curation Sequence Diagram

- 41 -

3.3 User Interface Design

Note that before designing the user interface, tools have been researched and taken into

account when creating the design, in order to see what is viable within the researched

technologies that were up for selection and what is not. More about the decision about

which tools have been used can be found in the next chapter. The available interface

elements for the selected technology have been looked at and arranged accordingly in

initial design drawings. This section will quickly explain two of these concept drawings

done.

The first drawing above (Figure 14) shows how the main design of the website.

Note that three of five points, elaborated in Section 2.4 of this dissertation have

especially been taken into account when designing the website. These were: Familiar

Conceptual Models, Consistency and Simplicity. Firstly the features of the website had to

easily be found by the common user. This is done by putting a navigation bar on top,

which is a familiar concept adopted by most websites nowadays. The buttons in this bar

Figure 14: Main Page User Interface Drawing

- 42 -

will direct to the implemented search functions or help pages as well as a welcome page.

Above the menu bar the logo and the title of the website are located. The basic search

feature is also shown. One of the dimensions will be displayed as standard, and can

optionally be expanded to show the fields for all six of them. The middle part of the

drawing shows a table that holds the results of a search. Note that there is a division into

pages at the bottom of the table, so that not all the results are displayed at once and the

user can browse the results more easily. The part to the right of the table is reserved for

certain statistics to be displayed, which may be useful to the user. The bottom of the

pages depicts a concept feature, where the user might do more than one search in

parallel and can switch between the results shown. It is intended that the bar also

displays whether a query has completed or is still running. This feature is theoretically

supported by the technology chosen. The bottom of the page will give some general

information about the creator, the institution as well as server status of the database

used. Also note that there is padding to the left and the right of that drawing in the

browser, as this makes the website look spaced out better and less cluttered.

An entry of the results table can be selected, which will prompt a pop-up holding

detail about the found paper. The design of this pop-up can be seen in the drawing below

(Figure 15). The top shows the title of the paper, as well as provides a link to it on

PubMed. Collaborators of the paper, as well as the abstract itself will be displayed. The

extracted key words within the abstract can be highlighted by selecting a button of a

dimension on the right. Below the abstract there is some space for statistics about this

specific paper. The button for curation is located on the bottom. A help button is

provided which displays a manual for the pop up. And lastly the close button on the

bottom right, which will take the user back to the main page, containing the results table.

- 43 -

Figure 15: Details Page User Interface Drawing

- 44 -

3.4 Database

Figure 16: Application Specific Database Design Diagram

- 45 -

The diagram above (Figure 16) shows the database created specifically for the

application. It can be seen that one table has been created for each of the main search

dimensions: covariate, effect size, exposure, outcome, population and study design. Each

entry in one of the tables is linked to the actual abstract, where they have been extracted

from, via the so called “PMID” column. It is a unique identifier for each abstract and links

to the PMID table, where it is a primary key. This table also holds which year this abstract

is from. Note that in the six tables for the dimensions there may be multiple entries with

the same PMID, as more than one word belonging to the same category may have been

extracted from that abstract. The highlights table simply stores the ranges in character

count, to highlight within that abstract and to which dimension the marked word

belongs. This is useful for the highlighting feature on the details pop-up discussed earlier,

as it facilitates highlighting the extracted words easily by providing a sort of mapping, as

well as putting them into categories, in case the dimensions need to be highlighted in

different colours.

There are two more tables; the first one is holding information about which entries in

the six tables that hold the extracted information have been changed by user curations.

The old as well as the new values are stored. It is important to keep track of these

alterations as they may have been done by mistake, in case these entries need to be

rolled back. Note that this will also affect the highlights; therefore the values in there

have to be changed to the old values as well on recovery. The user which has done the

curation is also kept track of. This is due to complying with a future extensible design as

more users may be added as the system expands. The last table keeps track of the user

data itself. This includes entries such as user names, password hashes as well as email

addresses. The user names are used as foreign key for the curation table.

There is potential to index these tables and as a result speed up querying. In a

technology, such as SQL however, for proper indexing to take place it is a requirement

that each entry in one of the columns holds a unique value, so that a tree like structure

can be built from the data internally [38]. This is true for the PMID table as these are

stored there in such a way that they never repeat. This state, however, does not apply to

any of the six tables which hold the data themselves. SQL allows the user to define the

indexing upon two columns [39] which combined could make up a unique value. This

however, cannot be done without defining a new arbitrary type of data column, and is

- 46 -

therefore not recommended as it would introduce a large amount of clutter within the

data.

Note that there are two more databases not shown in the diagram. The first one is

the database which is used by the feature extraction process. The database for this

application is merely a copy of that. This is due to the extraction process being costly and

the curation feature being a rather powerful modifier. It makes sense to protect the

original data by not altering it. The third database which is not shown in the diagram is

holding the abstracts of each paper itself as a string and is only queried if details for a

found entry are requested. This database is updated regularly as new papers are being

published. Therefore it does not make sense to copy the values of this database into the

application specific one as it would make the whole application harder to maintain.

- 47 -

4. Implementation

This chapter is about the implementation of the application itself. First it will talk about

available tools to achieve the given goals. It will then discuss which tools have been

chosen to get the implementation done and elaborate on why they are fitting for our

domain. Next it will focus on how the implementation itself, stating what has been

created and how. It will explain how the code has been extended in order to achieve the

goals and deliver a working product, which is flexible to more future changes and

implementation.

4.1 Tools

This part will discuss the available tools that have been researched for the project. It will

first give an overview of the researched web development tools that are available and

elaborate on the choices taken as well as give reasons for these. Next it will give different

types of database technologies available for use and make reason for a choice. Lastly, it

will introduce a tool used for database interaction with the chosen technology.

4.1.1 Current state of the art Web Technology

Since the invention of the World Wide Web in 1989 by Tim Berners Lee [40], it has been

ever evolving. Particularly in the past six years, more technology was added than ever

before since its invention. The types of interactions with a browser have never been so

versatile and interactive. The following graph shows available technologies since 1991,

Figure 17: The evolution of the Web in the past 15 years [59]

- 48 -

where each line represents such a technology, and where they overlap with a browser

that browser adopted support for that technology. Note that the top three timelines

represent Netscape, Internet Explorer and Opera respectively but are cut off before 2000.

A number of web development tools have been researched and looked into in order

to determine which one is most appropriate, not only for the project, but to the author’s

level of experience. The following table (Table 4) will give an overview of these tools.

Tool name Features Notes

Aptana Studio HTML, CSS, JavaScript Code assist Deployment wizard Integrated debugger Git integration Available as plugin for Eclipse

Standalone editor or as plugin. A wide set of helpful tools and a good community support. 100% free and open source.

Google Web Toolkit

SDK provides Java APIs and widgets Full web support including AJAX and JavaScript Automatic deployment of apps, working on all major browsers and platforms Debug in any IDE Available as plugin for Eclipse

Open source, created by Google. Widely used even by companies. Wide set of extensions and APIs. Integrates seamlessly on most IDEs. Automatic deployment of server and client side application using Java and JavaScript.

ASP .NET Provides full framework for apps that require client and server side computation For MS Visual Studio only Therefore support for all major programming languages Git, debugger, etc. integrates in MS VS

Tool by Microsoft. Widely used. Good support. Fitted for applications with heavy or complex computation and interactions. Development only supported on Windows.

Java on the Server - Netbeans

Official Java EE and web development plugin for NetBeans IDE HTML, CSS, JavaScript and PHP Code assist Git, debugger included in IDE

Official open source plugin for NetBeans developed by the community. Good community support.

Adobe Dreamweaver

Supports all major web Technologies Longest on the market Provides wide set of templates and integrated features to design website

Probably the oldest tool, however mostly disliked by the community due to not being open source for what it provides nowadays.

Intellij for Jetstreams

Very good tool for Java and JavaScript development; HTML, PHP support included (CSS paid); Integrated database tools for SQL (paid); UML diagram designer; Auto refactoring Git support

A very popular and intelligent tool. Comes with a lot of helpful tools built in. However, most special features require a paid version.

Table 4: Comparison of Web Development Tools

- 49 -

4.1.2 Google Web Toolkit

Google Web Toolkit [41] has been chosen to build the software. As it is one of the more

wide spread development tools and there is substantial documentation available, which

sets it ahead of some of the newer tools mentioned above. It provides a full client-server

framework for web applications using Java. Services are implemented on server side

which can be accessed in the client code using standard java convention, which is

perfectly suitable for the scope of this project. The client is implemented in conventional

Java as well and a full custom API for visualisations is provided [42]. This is a positive

feature as the developer is fairly comfortable with Java. On compilation, the code is

converted to JavaScript, which is supported by all conventional browsers and platforms

including mobiles. This platform-wide flexibility is one of the main reasons in favour of

the choice of Google Web Toolkit, as it can easily be deployed and set up for a wide set of

users. The services that the client accesses will be automatically converted into Java

Remote Procedure calls [43], so that the client can communicate with the server. GWT

has an official plugin for the Eclipse IDE [44], which will be used for this project, which is

one of the Integrated Development Environments the author has a good amount of

experience with. It provides integrated support for version control, which will be useful

for the development of this application, as progress can be tracked and rolled back if

needed.

4.1.3 Relational vs. Non-Relational Database Technologies

Non-Relational databases provide efficient query techniques if there is no direct relation

from one dataset or table to another. Examples of such non-relational databases would

be csv-files of general spread sheets. There are some more sophisticated non-relational

databases with online availability such as NoSQL [45], which goal it is to maximise

simplicity within the design or provide a better scaling as datasets become larger.

Nevertheless, non-relational databases are not really suited for this type of applications,

since the main part of the data which is used is spread into six tables, which are all

related. Therefore a relational implementation was chosen. A wide set of technologies

was available, however, due to the dependency on external databases of this application

MySQL was the closest choice. This is mainly due to compatibility between the already

existing databases, as no new additional wrappers or bridges would be needed to realise

- 50 -

the application specific database. Not only does it provide all the features needed to

satisfy the domain of the project, but also the author is fairly confident in using that

technology.

4.1.4 Jenny and JDBC

Java Database Connectivity (or JDBC) is the native API in java that allows it to connect and

interact with external databases [46], for example using MySQL. The way it basically

functions is by compiling MySQL commands, which can then be fired at a server, given a

connection has been established through the API first. Jenny is a tool developed by the

Java-Ranch community [47]. It is able to read out a database structure and create

according code to access and manipulate each table individually a bit more efficiently.

However, most of the time one instruction will not be enough. Therefore this code can be

put together in order to fulfil more complex instructions on the data and utilise the

information properly.

It needs to be noted that establishing a connection to the database in order to run

one query is quite a costly process. As a result it is important to mention that Jenny

makes interaction for single result sets with one query more efficient, however if a query

for a wide set of results is needed, which accesses several tables or databases and our

query can be defined within one MySQL command, it is more efficient to do that directly

via the JDBC connector, since Jenny would need to fire a separate command for each

table or database within the query. It is therefore in some cases more efficient to either

use one of the two technologies or mix them depending on what is needed. One example

would be the single curation, which is more efficiently done with Jenny. However, if a

query is compiled for a whole set of entries, that query might as well be fired directly.

4.2 Core Implementations

To get started with GWT an IDE has to be prepared and all needed plugins have to be

installed. Then a new project can be created. GWT for Eclipse, like it was used for this

project, will automatically produce all the needed packages and folders and put them in

the proper structure. The picture below (Figure 18) shows the tree generated by such a

GWT project.

- 51 -

Figure 18: GWT Application Structure Tree

The “src” folder contains the source code implemented, divided into packages. Note

that packages have been added and extended in order to provide a reasonable structure

to all the classes created. A test folder is also generated and was later populated with

tests. A number of imported Java libraries which were needed for the implementation are

located in the middle of the tree. The last important part that should be pointed out is

the “war” folder as it contains all HTML parts and xml link tables that are created during

runtime and loaded by the client. An entry point for the application is defined, which is

then called inside the HTML class which will load the application into the browser. Note

that CSS is also contained in that folder. More about GWT and how the implementation

corresponds to the design can be found in the Appendix E.

It needs to be mentioned that the User Interface in GWT is created on a modular

basis, called “Widgets” rather than objective. GWT is able to identify and forward the

information retrieved to the needed module automatically. The module itself can then

decide how it wants to display the data, and how it is laid out. The routing is achieved via

so called Data Listeners, which the result data from the server is forwarded to.

- 52 -

4.3 Client-Server Interaction in GWT

Google Web Toolkit wraps all interaction between the Client and the Server into so called

Java Remote Procedure Calls, which are also generally used for process

intercommunication and method invocation [48]. The interaction can be implemented by

following a certain set of rules, which can be found within the service package of the

code. Figure 19, below gives an overview of the interfaces and classes implemented in

order to achieve this, where the dotted boxes represent interfaces and the solid ones

represent one or more classes. More about how this is done can be found in the

Appendix F.

Figure 19: General Simplified Application Class Diagram

4.4 Client Implementation

The client side implementation has been done on a modular basis, meaning that

modules, also called “Widgets” have been constructed and arranged accordingly. The

highest level module is the menu, which enables the user to switch between the different

features implemented. Apart from the textual parts of the application, which include

home page, about page and help pages, there are several computationally complex

features that required implementing some more complex functionality to process the

server’s response on the client side. The first feature is the standard search. It consists of

six fields, one for each of the search dimensions and a search button. The fields have

been arranged in a table, to which a so called CSS-tag has been added, so that it can be

designed further using the CSS implementation of the project. The words that want to be

- 53 -

searched for can be entered into the fields. When the search button is clicked the module

will compile a Query request object, wrapping the entered words inside of objects and

send them to the server.

4.4.1 Results Table Widget

A widget to display the results has also been created. It is added onto the standard search

module once a response is received. This widget defines how the results are displayed

and arranged. It is implemented using the “GWT Cell Table”, which requires the data to

be linked directly. The table displays the PMID, Year, Name as well as type of study, if

applicable, of the found abstract. The maximum number of rows per page has been set to

25 and the pages are made accessible using a Pager. This is very efficient and pages can

be browsed through in real time without needing to refresh the whole page. Each column

within the table can be made sortable however, since the data contained in the table

itself is retrieved from the server, this is done through a data handler and needed to be

implemented server side. This is due to there being more than one page. A local sort

would only sort the current page displayed. The major advantage of implementing the

result table as a separate widget is its reusability. Other search features could simply

create their own instance of such a table and populate it on their own page.

4.4.2 Advanced Search Feature

The results table widget has been reused for the dynamic search feature. The goal of this

feature was it to be able to create a query in which one can combine all six dimensions in

any way. This includes stacking more than one word for the same dimensions, using the

operators “AND”, “OR” or “NOT”. This has been realised using a GWT Flex Table, which

can dynamically be extended. Each row provides the user with a choice of operator,

dimension and a field for the word itself. Rows can be added to or removed from the

query dynamically. When the search button is clicked, the current state of the table is

analysed and read out accordingly. A query object can then be created and sent to the

server. It should also be noted that a loading animation has been implemented for while

the client is waiting for a server response.

- 54 -

4.4.3 Details Page Widget with Curation

The table itself has been given a listener, which enables each individual row to be clicked.

This is needed when the user wants to access the details about a certain paper that has

been found. When a row is selected a pop-up will appear. This is utilised using a GWT

Dialog Box, which opens as kind of overlay and fades the rest of the page in the

background in grey. A detail request is formed and sent to the server, containing

information about the selected row. The details page itself looks as shown in the design

drawing (Figure 15), with title and collaborators on top, abstract and highlighters in the

middle and a statistics table at the bottom. The table shows all extracted features from

this abstract as well as to which dimension they belong to. A series of buttons used for

curation are located underneath the statistics table. An entry in the table can be curated

by selecting that entry and then hitting the “Curate Selected Extract” button. This will

replace the table with a form, in which the user can specify the information which he

wants that entry to be replaced with. Alternatively there is a checkbox if that entire

extract is wrong and needs to be deleted. Another checkbox has also been added which

enables the option to run a detection algorithm on the server which will curate same

entries. A field for username and password is also provided. Note that the password is

not displayed when entered and it is hashed before it is sent to the server as part of the

curation request. Two other types of curation have also been made possible. The first one

is adding new extracts from the current abstract. When the “Add Extract” button is

clicked, the user can select a dimension and enter a word which is supposed to be added

to the database. Before a request is formed, a check is done whether the entered word is

found within the abstract. The option to enter some extra information such about this

abstract is also provided.

4.4.4 Future Client Improvements

It needs to be mentioned that the design of this system is laid out openly towards future

additions, as new modules or widgets can simply be added without affecting already

existing ones. One addition that needs to be pointed out is the ability to add more

statistics to the search pages themselves. This can be done, for example, by using

diagrams. Data can be requested from the server and laid out according to the needs of

the diagram library used with GWT. Others it should be mentioned that most widgets and

- 55 -

HTML variables of the application have been given a CSS-tag so that an extended design

can be specified within the project’s CSS file. The main advantage of this is that parts of

the application can be flexibly visually redesigned, without affecting functionality.

4.5 Server Implementation

It was established in the design chapter that all database interaction has to be handled

through the server for security reasons. Therefore all database connectors are

implemented on the server side. There are a total of three databases that the server

needs to interact with. The first database is the database that contains all the extracted

data from the abstracts as a result of text mining. The second database is the application

specific database. Most tables from the first database are copied into the application

specific database so that they can be edited without affecting the original data. This

database also holds information about the curation done as well as user accounts. Note

that on query, this reduces the number of databases that need to be interacted with and

is therefore more efficient. The third database contains information about the abstracts

themselves. This includes titles, collaborators as well as year of publishing.

4.5.1 Database Connectors

The server can connect to each database in two different ways, the first one being the

database connector crated by Jenny. It provides a class for each table in each database

and provides an efficient way to query for each row individually. The second database

connector was made using the Java Database Connector (JDBC) library directly. Provided

with the login credentials it can connect to any of the three databases. A SQL query can

be passed to it directly as a string, which can then be executed on the database itself. The

JDBC will then assemble a java Result Set. This set can be iterated sequentially and

contains a string for each result within a column of a row.

It should be mentioned that a slight inefficiency within the way that the server

retrieves the result table data for a search was identified during implementation. This is

due to the titles of the abstract being stored in a different database rather than the

application specific one. For every match found, a query is passed to the textual database

to retrieve the title of the found abstract. This implies that the more results are found in a

search, the more queries have to be sent to the additional database and the slower the

- 56 -

retrieval of results will be. This process was optimised by moving the titles themselves

into the application specific database. However, it needs to be mentioned that this

process made the application specific database grow in size considerably, as sufficient

data had to be allocated per title in order to make sure all titles fit within the assigned

table.

4.5.2 Services

The server side implementation that is able to take requests from the client consists of a

series of methods, which can be invoked in code directly from the client side. The

previous section has mentioned three main functions that needed substantial server side

implementation: Standard Search, Advanced Search and Details Request.

4.5.3 Query Services

When a request for a standard search is sent to the server, the object is first analysed and

checked which field actually contained a string to search for. Next a SQL query is built

accordingly using the efficient Java String Builder. The query is dynamically assembled

and the entries from within the fields are connected via the “AND” operator. Next the

query is sent to the database using a database connector. The result from the database is

turned into a so called “Result Set” by the Database connector. This result set will be

disassembled by the server and packed inside an object which can be identified on the

client side. It is then sent back to the client where its type can be identified and

redirected accordingly. It needs to be noted that the request for a dynamic search is

implemented very similar. The only significant difference is the disassembly of the

request coming from the client, since the operators between the entered search-

dimensions need to be assigned dynamically.

4.5.4 Details Page Service

In a request for details of an article, the client request mainly consists of a PMID being

passed. The third database, which stores information about the abstracts themselves is

queried for the titles, collaborators as well as the abstract itself are fetched. Then the

associated extracted features are gathered from the six tables as well as the ranges for

- 57 -

the highlights within the abstract are gathered from the application specific database.

They are put in a details response object, which can be correctly identified on the client.

4.5.5 Curation Service

When a curation request reaches the server, it first checks whether the passed user

credentials add up. This is done by comparing whether the user and password entries are

consistent with what is stored within the user table in the application specific database. If

the credentials should come out wrong, a response object is compiled which will contain

a message that the curation has failed and the reason. If the details add up, however, the

parameters of the curation are checked. First the server will look whether the checkbox

for identifying similar entries was ticked. It will then search the database for similar

entries and get a list of PMIDs in that same dimensional table if the box was indeed

ticked. It will then proceed to change the entry or entries for the given feature. In case

the deletion box was ticked however, the entry will just be deleted from this table of the

database. There are also two other types of curation requests. The first one is adding a

new extract from the article into the database. A client side check is done whether the

entered word corresponds to the abstract itself and is then sent to the server, where the

request is pushed through to the database, depending on whether the authorisation

completed successful. The last operation is to change the highlights of an extract,

changing values in the highlights table.

4.6 Possible Future Implementation Ideas

There are two ideas which have been considered during implementation which should be

mentioned. The first one is the possible use of a Levenshtein’s Distance algorithm [49] in

order to be able to identify similar words that need curation within the database and

make the algorithm generally more effective. This algorithm basically compares two

strings and calculates a value corresponding to the similarity of those two strings; the

better the match, the higher the value. This could be done when searching through the

database, comparing all the words within the table where a string that needs curation has

been identified and calculating the Levenshtein’s Distance between that old string and

the entries in the table. If the algorithm’s result is above a certain threshold for an entry

- 58 -

as well as the other values of that row roughly correspond to the same category as the

word that needs curation, it could be assumed that this entry is similar enough to the

wrong word and may be curated as well. This is one approach of implementing some kind

of “smart detection” of data.

Another possible future addition is to enable parallel searches, since data and

communication within GWT are handled asynchronously. Instead of switching between

features when cycling through the tabs of the application, a new instance of that widget

as well as reference to client access service could be created each time. This would need

to be displayed to the user and kept track of in some kind of task bar. As a result the user

would now be able to open two basic searches at the same time. GWT would know

where to route the information coming back from the server since the request would

have been sent from a different client service each. While this feature seems useful,

especially when the user wishes to do a wide set of searches it also has to be mentioned

that it would take up a lot more resources and as a result make the application slower. It

could also be considered redundant since the user could just open the application twice

in different tabs of the browser and achieve the same goal.

4.7 Software Testing

Software testing is one of the core practices of software engineering. Here the question is

asked, whether the software was built right. It is important to treat software testing as an

ongoing process and not just one lifecycle at the end of development in order to spot

mistakes and bugs early and prevent them from propagating through the system. The less

software is tested, the harder it is to fix individual bugs as they may start affecting each

other. Therefore the application has been tested thoroughly during implementation as

well as after implementation was finished. Several testing methodologies exist which

have been applied to the project. It needs to be noted that acceptance testing has been

done as part of evaluation, which is covered in the next chapter.

4.7.1 Regression Testing

Regression testing was done throughout development of the application. It is done to

make sure that the already existing parts of the application still work, as new parts are

added. This is especially useful when widgets have been nested, as functionality may be

- 59 -

directly affected. A regression testing plan was created, which was executed and updated

every time a major feature was finished being implemented. Note that this plan was

growing as the implementation went along, starting out at only one test at the start. A

simplified version of the plan can be seen below

4.7.2 Integration Testing

Integration testing was done to test how the nested features behaved individually. Test

cases were created before implementation. These cases state expected behaviour upon

merge of widgets as well as all possible alterations of that feature. A bottom up approach

has been taken on integration testing, which means that lower level merged features

were tested first and the tests were then slowly expanded in order to integrate some of

the more complex features.

When the feature pages were first merged with the menu switcher this was

considered the first integration test. This continued to the addition of the results table on

top of the search features. Next, the details page for each was tested after it was merged

onto the results table feature and so on.

4.7.3 Security Testing

It should also be mentioned that some security testing was done. This included testing

three features: Confidentiality, Integrity and Authentication. Firstly, confidentiality was

ensured already in the design of the application, making sure that there is no confidential

information stored within the system. Integrity is ensured as part of integration testing. It

is generally done by making sure that what is created on client side corresponds with that

is received on the server side. Thirdly, authentication is ensured by always having some

way of identifying possible people involved. This is done by only creating user accounts

for trusted people as well as logging all critical actions on the server to the corresponding

user.

4.7.4 Unit Testing

Unit testing has also been done in order to be able to test parts of the application

individually. Since GWT is programmed in Java it can be tested using standard jUnit Test

Cases [50]. In fact it can be seen in the application tree above that GWT automatically

- 60 -

creates a package where tests are located. Skeleton classes were created for each class

which can have been populated by the developer. A testing plan was first drawn out and

unit tests for each method on the server as well as the client have been created

accordingly. The general style of these tests is by setting up the so called “pre-conditions”

which are all variables needed to run a function. Note that these are usually stubs and

only the part of the function that needs to be tested is populated with reasonable data.

The expected outcome is defined and asserted to the actual outcome after processing the

data through the given function. A test then passes, depending on whether the assertion

was mean to evaluate true or not. The code below shows a sample unit test for the query

builder on the server. A query request object is being passed containing the words to be

queried. The expected result is supposed to be the fully built query.

Figure 20: Sample Test Code

Note that also the database connectors implemented on the server had to be

tested individually for each database. Skeleton test classes were already created for each

of the database connectors. It needs to be pointed out that only tests for tables that have

been used for the application have been tested. The table below (Table 5) is a general

realisation of the testing plan and shows a simplified summary of the unit tests done for

the application. It indicates which test classes were written per package as well as a

number of tests of that package. A small summary of what was tested has also been

added as well as an indicator of whether the tests passed.

- 61 -

Package Test Class Description #Tests Outcome

Client AdvancedSearchTest Various tests assembling a search request object, testing different combinations of operators and dimensions.

10 Pass

StandardSearchTest Tests combining different search dimensions

12 Pass

ResultTableTest Several Tests for each column in the results table

8 Pass

DetailsPageTest Several tests for populating each

16 Pass

Client. model

AdvancedQueryTest Testing assignments and reads of model class

12 Pass

CurationRequestTest Testing assignments etc. 22 Pass

DetailedArticleTest Testing assignments etc. 16 Pass

DetailStatsTest Testing assignments etc. 8 Pass

StandardQueryTest Testing assignments etc. 12 Pass

QueryResultTest Testing assignments etc. 8 Pass

Client. service

ClientServiceImplTest Testing forwarding of data to client module objects

3 Pass

Server ServerAccessImplTest Testing server routing of possible actions

4 Pass

ServerQueryBuilderTest Testing query builder for each dimension and some combination of dimensions

10 Pass

ServerResultTest Testing server generated response object creation

8 Pass

Server. dbmysql

DBMYSQLConnectorTest Mainly tests with ResultSet alteration; Some connection creation tests.

7 Pass

Table 5: Simplified Testing Table

- 62 -

5. Results and Evaluation

This chapter will show off the individual parts of the application that have been

produced. It will explain how each of them are used and how they relate. Next it will

focus on evaluation done for the project. This has been done in form of interviews which

will be discussed and the results will be presented.

5.1 Overview of the Application

The following picture (Figure 21) shows the Standard Search feature. The six search

dimensions as well as the search button which will initialise and send a query with the

entered data to the server can be seen.

Figure 21: Standard Search Feature

Note that the exposure and the outcome field have been populated according to

the evaluation task. Once the search button is pressed a loading screen will appear while

the result data is being fetched on the server. Once the client receives the results, the

loading screen is replaced with the result table. The picture below (Figure 22) shows the

results for the query above: 22 entries have been found. It should be mentioned that

pages are set to be separated every 25 entries. Also note that there are two pagers, one

at the top and another at the bottom of the page. The table itself provides the user with

year, title and study design of the found papers. It is initially sorted by year in ascending

order and can be re-sorted by clicking the year header in the table.

- 63 -

Figure 22: Sample Results Table When one of the columns of the table above is clicked, a detail page will appear. This is

shown in the picture below (Figure 23).

Figure 23: Details Page Feature

- 64 -

Note that the rest of the page is shaded while a details page is being displayed. The

pop up itself gives more information about the paper including title, collaborators and

the abstract itself; a link to the actual paper is also provided. The markers on the right will

highlight all extracts of a selected dimension within the abstract. The table at the bottom

shows the extracted features and some information about them. It is set to display five

results per page. Clicking one of the entries in the table and then selecting “Curate

Selected Extract” will replace the table with the curation menu seen below.

Figure 24: Curation Feature

As shown above, hypertension has been selected from the table. A new word can be

added which will replace the entry for hypertension within the database. Since extracts

are normalised they generally correspond to the information in the abstract itself and

only such words may be added to the database. The option to delete an extract has also

been provided, as well as detecting similar entries within the database and curating these

as well. A user name and password also need to be provided in order to be able to do a

curation.

Lastly, the advanced search feature also needs to be shown off. In the picture below

(Figure 25) the main advantages of this feature are made clear. The six search dimensions

can be combined in any wanted way. Two words for the same dimension can be added.

The available operators include “AND”, “OR” and “NOT” so that certain extracts may be

included if wanted. Rows can be added or removed at will and rows with no word

provided will be ignored in the search.

- 65 -

Figure 25: Advanced Search Feature

5.2 Evaluation

Evaluation and acceptance testing of the application has been undertaken in form of

evaluation interviews. An interview script has been designed, a copy of which can be

found in the Appendix H. The tasks have been set out for the person being interviewed to

complete. The first task is done together with the developer, where the last two tasks are

set to be done on their own. A short questionnaire to be used after the interview has also

been designed and can be found in the Appendix I. All questions asked have been

answered as well as discussed during these interviews. Two people working in the field

have been interviewed as part of the evaluation. The first person was Dr Jenny Newman

an epidemiologist and medically trained doctor. The second person was Dr George

Karystians, who developed the text mining algorithm to gather the extracts used in this

project. The main highlights of these interviews are summarised in the table below (Table

6). A full version where the answers are mapped to their corresponding questions can be

found in the Appendix J.

Person Discussion

Jenny A second pager on top of the results table would be useful, as it would eliminate the need to scroll on smaller resolution machines.

Jenny As the second task from the evaluation showed, after entering 3 search terms, only two papers remained in the result set. This is already a good example for showing gaps in epidemiological research.

Jenny It needs to be pointed out that an application like this could be easily misunderstood by users who are not as educated in the field and as a result

- 66 -

wrong conclusions could be drawn about what the data means. Help pages or tooltips could improve this situation; Nevertheless, this scenario should always be considered a risk.

Jenny It is useful as it can provide a preliminary examination of previous work for example when writing a grant application or preparing undergraduate projects.

Jenny Especially liked the positioning of close buttons for the details page, as they were easy to find and conveniently placed.

George Add more information about the current curation.

George The curation feature is rather powerful. The users curating the data have to take responsibility for what they curate as this could lead to problems within the application. It is generally a good practice that curations are tracked.

George By pointing out gaps between current research and the one represented.

George Add a column to the result set which shows the type of study that the found paper has investigated.

George Generally more statistics about the current result set.

George Jenny

Adding the number of submissions about a certain topic per year as a statistic on the search result.

George Jenny

Some statistic that adds a denominator to what proportion of submissions is about a certain topic this year compared to last year.

Table 6: Evaluation Questionnaire and Discussion Highlights

The general feedback given was positive and constructive. Each question has also

been discussed and the people interviewed never got lost or stuck while completing the

tasks they were given. There was a general liking towards the simplicity and the layout of

the results. General concerns about the application and its use have been identified and

taken into account.

There are some future changes and extensions implied from the results above. The

second pager above the results table has already been realised as it can be seen in the

walkthrough. The same is true for the study design column being displayed for the result

set of a search as suggested by the feedback. A future addition that has been identified

by both persons interviewed was the addition of more statistics of the currently displayed

result set. This may include a simple display for the numbers needed as well as tables and

graphs. Due to the modular layout of the application this can be realised in future

without the need for refactoring.

- 67 -

6. Conclusion and Future

This chapter will conclude the dissertation by first providing a summary of what has been

achieved. It will look at the goal set out at the start and state whether they have been

met or not and why. It will then summarise all possible future work mentioned and give

an outlook on how this could be done. It will then conclude the dissertation by talking

about skills acquired and lessons learnt

6.1 Summary

The aims and objectives have been set out for this project. Research has been made into

the domain as well as the text mining process behind the application that is needed in

order to realise the goals. This included already existing data structures and databases in

place. The requirements have been set out according to the aims as well as stakeholders

such a supervisors involved. A design has been created according to the identified

domain. A need for a database was identified and also designed accordingly. It was

identified that the application needed to be online therefore an appropriate interface

was designed as well. Tools to implement the website have been thoroughly researched

and a decision has been made to use Google Web Toolkit. The following table (Table 7)

shows the objectives set out at the start and states whether they have been met in the

therefore developed application.

- 68 -

Goal Achieved

To develop a search function and implement an extended query model utilising the six dimensions needed for the epidemiological data.

This aim has been met as a query function has been implemented which covers all six search dimensions.

To develop an extended search function to enable the user to query the six dimensions combined in any way.

This has been achieved as the advanced query functionality enables the user to do just that.

To be able to manipulate automatically extracted epidemiological data in order to correct errors.

This feature was implemented as curation, as entries can be added, edited as well as deleted.

To provide relevant information about retrieved data, such as statistics.

Statistics are mainly shown for details of a selected paper in the application therefore this goal has been met. However more statistics are viable as future implementation.

To enable only registered users to manipulate information.

This has been done as part of the login required on curation.

To provide all the mentioned facilities on an online, web based, platform with high availability.

This goal has been met as the application has been implemented using a web platform.

After implementation, an interview questionnaire has been designed this was

used to evaluate the produced results with people closer involved with the domain. A

generally positive feedback was received as well as a good set of ideas for future

additions was gained.

6.2 Future Work

Several future additions and changes are mentioned throughout the dissertation, which

are summarised here. During implementation it was mentioned that the future addition

of a Levensthein’s Distance algorithm could be used in order to identify similar entries in

the database, as part of the curation process. This is useful as such errors usually

propagate through the database and this algorithm could be used to identify such errors

more efficiently. Another future change that could be realised utilising this algorithm is

for the curation of one entry. The new entry could be compared to the old one in order to

identify similarities. This could be used to aid the user and make sure that the curation

done is not too harsh, as entries in the database are normalised and should not differ too

much from the original text.

Table 7: Set Goals vs. Achievements

- 69 -

Other future additions include more displays and visualisations for statistics. More

general statistics could be displayed of the gathered result set of a search. This could be

done by extending the general results table by a few more columns. However, it could be

argued that this would just unnecessarily clutter the results display and make them

unclear at first glimpse. More information could be added regarding the results

underneath the table. This may be done in form of more tables or even diagrams. It is all

possible through GWT as on compilation all client side code gets converted into

JavaScript. As a result JavaScript libraries can be incorporated to draw and visualise these

statistics.

For the far future, it should be mentioned that the application can be extended to

include a full user account management system. This would have a number of

advantages, however it has not been realised as part of this project as the complexity and

effort are too high compared to the gain of useful features. This feature would enable

users to define and store custom searches on their account. It could also enable the

application to have private curations, meaning that alterations to the database would

only affect the user who entered them.

6.3 Concluding Remarks

A web based workbench for epidemiologists has been developed in this project. It

provides a search feature which ultimately enables the user to access a large amount of

information. The product has been evaluated by people involved in the field. A number of

possible future improvements have been identified. Nevertheless, feedback gained was

positive and the project was deemed a success.

- 70 -

7. Bibliography

[1] L. JM, ed. Dictionary of Epidemiology, 2nd ed., New York: Oxford U. Press, 1988.

[2] G. Hill, “University of Dundee External Relations,” Press Office, July 2013. [Online]. Available:

http://app.dundee.ac.uk/pressreleases/2013/july13/institute.htm. [Accessed August 2014].

[3] “MEDLINE factsheet,” 21 July 2014. [Online]. Available:

http://www.nlm.nih.gov/pubs/factsheets/medline.html. [Accessed April 2014].

[4] Spasić I, Sarafraz F, Keane AJ, Nenadić G., “Medication information extraction with linguistic

pattern matching and semantic rules,” Journal of the American Medical, vol. 17, no. 5, pp.

532-535, 2010.

[5] Chapman WW, Cohen KB, “Current issues in biomedical text mining and natural language

processing,” Journal of Biomedical Informatics, vol. 42, no. 5, p. 757–759, October. 2009.

[6] Aarts S, Vos R, van Boxtel MP, Verhey FRJ, Metsemakers JF, van den Akker M, “Exploring

medical data to generate new hypotheses: an introduction to data and text mining,” 2012.

[7] “PUBMED factsheet,” 21 July 2014. [Online]. Available:

http://www.nlm.nih.gov/pubs/factsheets/pubmed.html. [Accessed April 2014].

[8] “PubMed MESH factsheet,” 21 July 2014. [Online]. Available:

http://www.nlm.nih.gov/pubs/factsheets/mesh.html. [Accessed April 2014].

[9] Karystianis, G, “Extraction and representation of key characteristics from epidemiological

literature,” University of Manchester, School of Computer Science., 2013.

[10] Hearst, M. A., “Text data mining: Issues, techniques, and the relationship to information

access,” Presentation notes for UW/MS workshop on data mining, July 1997.

[11] Tan, Ah-Hwee, “Text mining: The state of the art and the challenges.,” Proceedings of the

PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases, 1999.

[12] Korhonen, A., Séaghdha, D. Ó., Silins, I., Sun, L., Högberg, J., & Stenius, U., “Text mining for

literature review and knowledge discovery in cancer risk assessment and research,” PloS one,

vol. 7, no. 4, 2012.

[13] Zweigenbaum, P., Demner-Fushman, D., Yu, H., & Cohen, K. B., “Frontiers of biomedical text

mining: current progress,” Briefings in bioinformatics, vol. 8, no. 5, pp. 358-375, 2007.

[14] Hotho, A., Nürnberger, A., & Paaß, G., “A Brief Survey of Text Mining,” In Ldv Forum, vol. 20,

no. 1, pp. 19-62, May 2005.

[15] R. Rodriguez-Esteban, “Biomedical text mining and its applications,” PLoS Computational

Biology, vol. 5, no. 12, 2009.

[16] Imberman, S.p., “Effective use of the KDD process and data mining for computer

performance professionals.,” Journal of Computing Resources, no. 107, pp. 68-77, 2002.

[17] Berger, A.M. and Berger C.R., “ata mining as a tool for research and knowledge development

in nursing,” Computers, Informatics, Nursing, vol. 22, no. 3, p. 123–131, 2004.

[18] Bischoff, K., Firan, C. S., Nejdl, W., & Paiu, R., “Can all tags be used for search?,” In

Proceedings of the 17th ACM conference on Information and knowledge management, vol.

ACM, pp. 193-202, October, 2008.

[19] Aronson, A. R., & Lang, F. M., “An overview of MetaMap: historical perspective and recent

- 71 -

advances.,” Journal of the American Medical Informatics Association, vol. 17, no. 3, pp. 229-

236, 2010.

[20] Dix, A., “Human-computer interaction,” Springer US, pp. 1327-1331, 2009.

[21] Smith, D. C., Irby, C., Kimball, R., Verplank, W. L., & Harslem, E., “Designing the Star user

interface. In Human-computer interaction,” Morgan Kaufmann Publishers Inc., pp. 653-661,

1987, December.

[22] Fairbanks, R. J., & Caplan, S., “Poor interface design and lack of usability testing facilitate

medical error,” Joint Commission Journal on Quality and Patient Safety, vol. 30, no. 10, pp.

579-584, 2004.

[23] Patel, V. L., & Kushniruk, A. W., “Interface design for health care environments: the role of

cognitive science, In Proceedings of the AMIA Symposium (p. 29),” American Medical

Informatics Association., 1998.

[24] Weinger, M. B., Wiklund, M. E., & Gardner-Bonneau, D. J. (Eds.)., Handbook of human factors

in medical device design, CRC Press, 2011.

[25] Zaninzinato, Zaninzinato design, 2013. [Online]. Available:

http://www.zanzinato.com/work/avation-health/. [Accessed May 2014].

[26] Gottron, T., “Document word clouds: Visualising web documents as tag clouds to aid users in

relevance decisions. In Research and Advanced Technology for Digital Libraries,” Springer

Berlin Heidelberg, pp. 94-105, 2009.

[27] “Ergonomic Requirements for Office Work with Visual Display Terminals (VDTs): Part 11:

Guidance on Usability,” ISO 9241-11, 1998.

[28] Charfi, S., Ezzedine, H., & Kolski, C., “RITA: A framework based on multi-evaluation

techniques for user interface evaluation: Application to a transport network supervision

system. In Advanced Logistics and Transport (ICALT),” International Conference on IEEE, pp.

263-268, 2013, May.

[29] Byron, L., & Wattenberg, M., “Stacked Graphs-Geometry & Aesthetics,” IEEE Trans. Vis.

Comput. Graph., vol. 14, no. 6, pp. 1245-1252, 2008.

[30] Cockburn, A., “Agile software development,” Boston: Addison-Wesley., vol. 2006, 2002.

[31] Beck, K., Beedle, M., Van Bennekum, A., Cockburn, A., Cunningham, W., Fowler, M., ... &

Thomas, D., “Manifesto for agile software development.,” 2001.

[32] Schwaber, K. and Beedle, M., Agile Software Development with SCRUM, Upper Saddle River,

NJ: Prentice-Hall, 2002.

[33] Ambler, Scott W., and Mark Lines, “Disciplined agile delivery: A practitioner's guide to agile

software delivery in the enterprise,” IBM Press, 2012.

[34] C. Larman, Applying UML and Patterns. An Introduction to Object-Oriented Analysis and

Design and Iterative Development, 2006.

[35] G. Project, “Known vulnerabilities to GWT,” 2014. [Online]. Available:

http://www.gwtproject.org/articles/security_for_gwt_applications.html.

[36] Code-Google, “GWT incubator,” 2014. [Online]. Available:

https://code.google.com/p/google-web-toolkit-incubator/wiki/LoginSecurityFAQ. [Accessed

March 2014].

[37] “Heartbleed Bug,” Codenomicon Ltd., 29 April 2014. [Online]. Available:

- 72 -

http://heartbleed.com/. [Accessed August 2014].

[38] P. Brodersen, “How MySQL Uses Indexes,” Oracle MySQL, 2004 June 2004. [Online].

Available: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html. [Accessed August

2014].

[39] Y. Z. Low, “Multiple Column Indexes,” Oracle MySQL, 25 March 2012. [Online]. Available:

http://dev.mysql.com/doc/refman/5.0/en/multiple-column-indexes.html. [Accessed August

2014].

[40] Berners-Lee, T., Cailliau, R., Groff, J. F., & Pollermann, B., “World-Wide Web: the information

universe.,” Internet Research, vol. 2, no. 1, pp. 52-58, 1992.

[41] Burnette, E., “Google Web Toolkit.,” 2006.

[42] Dewsbury, R., Google web toolkit applications, Pearson Education, 2007.

[43] “GWT Using RPC,” GWTProject, 2012. [Online]. Available:

http://www.gwtproject.org/doc/latest/tutorial/RPC.html. [Accessed August 2014].

[44] Google, “GWT Plugin for Eclipse,” Google Developers, December 2013. [Online]. Available:

https://developers.google.com/eclipse/. [Accessed April 2014].

[45] Katarina Grolinger, Wilson A. Higashino, Abhinav Tiwari, and Miriam AM Capretz, “Data

management in cloud environments: NoSQL and NewSQL data stores,” Journal of Cloud

Computing: Advances, Systems and Applications, vol. 1, no. 2, p. 22, 2013.

[46] Oracle, “JDBC Overview,” 2014. [Online]. Available:

http://www.oracle.com/technetwork/java/overview-141217.html. [Accessed May 2014].

[47] Javaranch, “Jenny the DB code generator,” 2013. [Online]. Available:

http://www.javaranch.com/jenny.jsp. [Accessed May 2014].

[48] J. Waldo, “Remote procedure calls and java remote method invocation,” IEEE Concurrency,

vol. 3, no. 6, pp. 5-7, 1998.

[49] M. Gilleland, “Levenshtein distance, in three flavors,” Merriam Park Software, 2009. [Online].

Available: http://www. merriampark. com/ld. htm.

[50] JUnit, “A programmer-oriented testing framework for Java,” 2014. [Online]. Available:

junit.org. [Accessed May 2014].

[51] Z. Grossbart, “Hackito Ergo Sum,” 21 September 2010. [Online]. Available:

http://www.zackgrossbart.com/hackito/antiptrn-gwt/. [Accessed April 2014].

[52] Google, “GWT Ajax Communication,” Google Developer Guide, 2012. [Online]. Available:

http://www.gwtproject.org/doc/latest/DevGuideServerCommunication.html. [Accessed

August 2014].

[53] Osheroff, J. A., Teich, J. M., Middleton, B., Steen, E. B., Wright, A., & Detmer, D. E., “A

roadmap for national action on clinical decision support,” Journal of the American medical

informatics association, vol. 14, no. 2, pp. 141-145, 2007.

[54] Berner, E. S., “Clinical Decision Support Systems,” Springer Science+ Business Media, LLC,

2007.

[55] Demner-Fushman, D., Chapman, W. W., & McDonald, C. J., “What can natural language

processing do for clinical decision support?,” Journal of biomedical informatics, vol. 42, no. 5,

pp. 760-772, 2009.

- 73 -

[56] Garg AX, Adhikari NK, McDonald H, Rosas-Arellano MP, Devereaux PJ, Beyene J, et al.,

“Effects of computerized clinical decision support systems on practitioner performance and

patient outcomes: a systematic review,” JAMA, vol. 293, no. 10, p. 1223–38, 2005.

[57] Infotech, P2C, “Software Development Life Cycles (SDLC),” 2011. [Online]. Available:

http://www.p2cinfotech.com/software-development-life-cycle/. [Accessed May 2014].

[58] Sittig, D. F., Wright, A., Osheroff, J. A., Middleton, B., Teich, J. M., Ash, J. S., ... & Bates, D. W.,

“Grand challenges in clinical decision support,” Journal of biomedical informatics, vol. 42, no.

2, pp. 387-392, 2008.

[59] G. C. Team, “Evolution of the Web.,” 2014. [Online]. Available:

http://www.evolutionoftheweb.com/. [Accessed April 2014].

[60] Edwards, S., “History of Processor Performance.,” University of Columbia, 2012.

- 74 -

8. Appendix

A. SCRUM principles

In SCRUM the requirements need to be put down first in form of user stories which are

then processed into the so called “Product Backlog”. This backlog holds all information

about the features to implement and may grow as the project progresses as things might

change or are added. SCRUM suggests focusing on the whole set of functionality but only

in a basic matter first. This functionality is then improved and extended in the following

iterations. This underlines the agile principle of working code first. After the backlog is

done, a so called “Release Plan” is usually produced. It holds information about which

stories will be implemented in which iteration, until a first feasible release. This release

plan has to be flexible and give space so that things might be added. At the end of each

sprint it is reflected upon what is done in order to keep track of how productive the

development process is so far. A so called burn down chart may be created, shown Figure

26 below. It visualises the remaining work plan (blue line) and compares it against the

work done (red line). Once the red line drops below the blue line, the project would be

ahead of schedule, the same principle applies the other way around, with red being

above blue meaning the project is behind.

Figure 26: Sample Burndown Chart

- 75 -

B. Full Release Plan for the Project

Sprint Start Weeks End Points total Status Goal Description

1 1.7.2014 2 15.7.2014 19 planned Stories 1

to 3

Create the basic website with all its login and managerial features.

2 16.7.2014 2 31.7.2014 29 planned Stories 5

to 10

Implement all query functions for all tables and data involved.

3 1.8.2014 2 15.8.2014 24 planned Story 11 Implement visualisations and dynamic creation of statistics.

4 16.8.2014 1 6.7.2014 10 planned Stories

12 to 15

Implement alteration of data; this may include pattern detection.

- 76 -

C. Full Backlog for the Project

ID As a/an … I want to… so that…. Done criteria Est. priority

1 user view the website created

I can see who made it and eventually make out how to use it

Made a website that contains all the features a standard website has: Intro page, tutorial page, about page, etc.

8 high

2 user login to access curation feature

A login function that disables the curation of data unless authorised

A login function that hides all functionality until authorised

5 medium

3 admin manage user access on the website

An admin account that decide to restrict access to certain/all users

An admin account that can manage all entries of users etc.

6 medium

4 user

be able to access epidemiological data

I can browse data if needed or use it to query it

Implemented database access

10 high

5 user

query epidemiological data by Exposure

I can get an overview of research to that related health care problem


4 high

6 user

query epidemiological data by Outcome



4 high

7 user

query epidemiological data by Covariate



4 high

8 user

query epidemiological data by Study Design

I can get an overview of research to these related studies


4 high

9 user

query epidemiological data by Population

I can get an overview of related research to health care problems related to that population


4 high

10 user

query epidemiological data by Effect Size Type

I can get an overview of research having that certain effect type and value


4 high

- 77 -

11 user

be able to query for epidemiological data using any combination of the mentioned above categories

I can get more specific search queries on data

Query for any combination possible

5 high

12

admin or high level user

insert new highlighted key words for a study

I can add a spotted keyword for an abstract that hasn't been spotted


8 medium

13


alter a highlighted key word in a study

I can alter a spotted highlighted keyword that may not be correctly identified


8 medium

14


delete a highlighted key word in a study

I can get rid of a wrongly identified keyword


8 medium

15


identify commonly extracted keywords

I can change or delete a wide range of possibly wrongly identified key words

Auto suggestion upon alteration of data

10 low

- 78 -

D. Full Sequence Diagram with Details Request

- 79 -

E. Anti-Pattern GWT

It has to be mentioned that GWT is generally considered an Anti-Pattern [51]. This means

that it goes against some of the Object Oriented Software Engineering Principles

explained earlier in this dissertation. As a consequence, the design of the application,

although it depicts perfectly fine what is needed and what is happening within the

application, does not translate into code quite that simply. The best example would be

that the User Interface in GWT is created on a modular basis, called “Widgets” rather

than objective. This has some major advantages, such as addition of new modules can

easily be achieved by creating a new class for that module. This is a perfect example of a

Protected Variation in Software Engineering, as a new module can easily be added

without affecting the functionality of any existing ones. Another major advantage is the

modules being nestable meaning that they are able to invoke as well as contain each

other. This leads to great reusability of a single module as it will be pointed out later. The

main disadvantage of this is, however, that a conventional model view controller is hard

to achieve as some of these modules require the data to be handled as well as altered

within them. What this basically means, regarding our design, is that the controller is

mostly automated within our user interface and therefore becomes redundant as a

separate class. Regarding the Sequence Diagrams from Chapter 3, it needs to be

mentioned that most of the actions are automated within GWT or nested within the way

the data is processed inside the user interface. This mainly affects the “Result Formatter”

and the “Controller”.

- 80 -

F. More about Java RPC Calls in GWT

Looking at Figure 19 the following information can be seen: The Service interface

specifies which methods (or services) have to be implemented on client and server side.

It also specifies the type of the response object coming from the server. The “Service

Asynch” Interface must be implemented in order to be able to receive asynchronous call-

backs from remote procedure calls. Each callable method within this interface

corresponds to a method from the Service Interface and also provides an

“AsyncCallback<T>” object as parameter, where the generic type “T” is used to pass the

return objects asynchronously within GWT. Note that the Client Interface here is optional

and has been implemented in order to specify which services are available to this Client

Implementation. This is useful, as different levels for clients may be implemented in the

future, which will need access to different services that should not be accessible to other

client Implementations.

Once the server response comes in, it can be identified within the Client

Implementation class which has to implement “AsyncCallback”. The type of the response

can then be identified and forwarded accordingly to the needed user interface

implementation. Note that these responses are handled asynchronously. What this

means is that once a request has been sent to the server the client does not have to

interrupt its current activity and wait for a server response. The response may come in

asynchronously, whenever it is ready and can then be processed in the background.

It needs to be mentioned that there are several restrictions to GWT as well as the RPC

method used within it. The most important one being that Java-RPC calls are dependent

on types passed between them being “Serializable” [52]. What this means is that the

object which is used to implement the responses has to be entirely made up of primitive

types or other objects which implement Serializable, meaning that they can be

disassembled into a binary data stream and recovered properly. Another restriction

between client and server implementation is that all types and methods used in the client

have to be available or compatible with JavaScript, as the client is run in JavaScript after

deployment.

- 81 -

G. Testing Done as result of a Testing Plan

Package Test Class Description # Tests Outcome

Client AdvancedSearchTest Various tests assembling a search request object, testing different combinations of operators and dimensions.

10 Pass

StandardSearchTest Tests combining different search dimensions

12 Pass

ResultTableTest Several Tests for each column in the results table

8 Pass

DetailsPageTest Several tests for populating each

16 Pass

Client. model

AdvancedQueryTest Testing assignments and reads of model class

12 Pass

CurationRequestTest Testing assignments etc. 22 Pass

DetailedArticleTest Testing assignments etc. 16 Pass

DetailStatsTest Testing assignments etc. 8 Pass

StandardQueryTest Testing assignments etc. 12 Pass

QueryResultTest Testing assignments etc. 8 Pass

Client. service

ClientServiceImplTest Testing forwarding of client module objects

3 Pass

Server ServerAccessImplTest Testing server routing of possible actions

4 Pass

ServerQueryBuilderTest Testing query builder for each dimension and some combination of dimensions

10 Pass

ServerResultTest Testing server generated response object creation

8 Pass

Server. dbmysql

DBMYSQLConnectorTest Mainly tests with ResultSet alteration; Some connection creation tests.

7 Pass

Server. owndb

CovariateTableTest 2 tests for each column in that table.

12 Pass

CurationTableTest 2 tests for each column 18 Pass

Effect_sizeTableTest 2 tests for each column 12 Pass

ExposureTableTest 2 tests for each column 12 Pass

HighlightsTableTest 2 tests for each column 12 Pass

OutcomeTableTest 2 tests for each column 12 Pass

UserTableTest 2 tests for each column 8 Pass

PmidTableTest 2 tests for each column 8 Pass

PopulationTableTest 2 tests for each column 8 Pass

Study_designTableTest 2 tests for each column 38 Pass

Server. epidb

CovariateTableTest 2 tests for each column 12 Pass

Effect_sizeTableTest 2 tests for each column 12 Pass

ExposureTableTest 2 tests for each column 12 Pass

- 82 -

HighlightsTableTest 2 tests for each column 12 Pass

OutcomeTableTest 2 tests for each column 12 Pass

PmidTableTest 2 tests for each column 6 Pass

PopulationTableTest 2 tests for each column 38 Pass

Server. shareddb

ArticlesMedline2014TableTest

2 tests for each column 36 Pass

Total Number of Tests = 436 (149 without DB connector generation)

- 83 -

H. Evaluation Interview Script

Evaluation Interview Script Visualisation of Structured Epidemiological Information Short Description: This application provides a search engine for epidemiological data. Key information has been extracted from numerous articles’ abstracts reaching back to the 1970s and put into six dimensions: covariate, effect size, exposure, outcome, population and study design. The extracted information can now be queried according to these six dimensions.

P1 - Browse together Let’s look through the website together. Notice there is a navigation bar on top of the page. This is where we browse through the main features of the website. The static search provides six fields for the six search dimension which we can query for. They consist of: Covariate, Effect Size, Exposure, Outcome Population and Study Design. To test it, let’s search for “cancer” as an exposure. We should be able to get 97 results. Each result should be clickable and bring up the details of this paper. On top we can see the title and the collaborators of that paper. The middle section provides an abstract in which the extracted results can be highlighted by dimension. The bottom of the details window shows a summary of all extracted features. The marked features within the text are changeable using the curation button at the bottom, in case they highlight the wrong extracted words. P2 - Find the Following Let’s search for something! Find how many papers there exist for “married” as exposure and “obesity” as outcome. Let’s search for “adiposity” as an exposure and note how many results were found. Let’s add “diabetes” as an outcome and note the drastic change in found search results. Let’s tighten the circle even further by adding “smoking” as a covariate. There should now only be two papers left. Open details about the most recent one and take note of the other two covariates within it. P3 - Advanced Search Let’s try the advanced search! Note that here you can combine any of the six dimensions in any way you like. Let’s give it a go, and try searching for “lifestyle” as exposure, as well as “gender” as exposure and see how many results you get.

- 84 -

I. Evaluation Interview Questionnaire with Discussion Questions

Questionnaire:

Questions Yes No N/A

1. Was the application easily understandable?

2. Was the application easy to navigate?

3. Were all the buttons in the position you expected them to be?

4. Would you change any of them?

5. Were you able to solve the tasks easily?

6. Was it ever unclear where to look next in order to solve a task?

7. Was the information being represented accurately?

8. Was there sufficient Epidemiological Information represented?

9. Do you feel that this application could be useful for further epidemiological investigation?

10. Do you think this application could point out gaps in epidemiological research?

11. Do you think this application is useful for browsing epidemiological data in general?

12. Do you think an application like this would be more useful if it would be publicly accessible?

General Discussion:

13. How do you think an application like this would be useful?

14. What would you change or add in order to make it more useful?

15. Overall was there anything that you especially liked or disliked about the application?

- 85 -

J. Evaluation Interview Highlights Table

Question Person Answer Discussion

4 Jenny Yes A second pager on top of the results table would be useful, as it would eliminate the need to scroll on smaller resolution machines.

10 Jenny Yes As the second task from the evaluation showed, after entering 3 search terms, only two papers remained in the result set. This is already a good example for showing gaps in epidemiological research.

12 Jenny Yes It needs to be pointed out that an application like this could be easily misunderstood by users who are not as educated in the field and as a result wrong conclusions could be drawn about what the data means. Help pages or tooltips could improve this situation; Nevertheless, this scenario should always be considered a risk.

13 Jenny Discuss It is useful as it can provide a preliminary examination of previous work for example when writing a grant application or preparing undergraduate projects.

3/15 Jenny Yes Especially liked the positioning of close buttons for the details page, as they were easy to find and conveniently placed.

4 George Yes Add more information about the current curation.

12 George Yes The curation feature is rather powerful. The users curating the data have to take responsibility for what they curate as this could lead to problems within the application. It is generally a good practice that curations are tracked.

13 George Discuss By pointing out gaps between current research and the one represented.

14 George Discuss Add a column to the result set which shows the type of study that the found paper has investigated.

14 George Discuss Generally more statistics about the current result set.

14 George Jenny

Discuss Adding the number of submissions about a certain topic per year as a statistic on the search result.

14 George Jenny

Discuss Some statistic that adds a denominator to what proportion of submissions is about a certain topic this year compared to last year.

visualisation and manipulation of structured...

Documents