investigation of pathway analysis tools for mapping omics ... · keywords: biochemistry,...
TRANSCRIPT
Degree: Bachelor of Computer Science 180hp Supervisor(s): Céline Fernandez, Major: Information Systems Annabella Loconsole Programme: Information Systems Examiner: Bengt J. Nilsson Date of exam: 2012-09-20
Tech n ology a n d societ y
Com pu t er Scien ce
Invest iga t ion of Pa thway Ana lysis
Tools for mapping omics da ta to
pa thways -Focu s on l ip id om ics an d gen om ics d a ta
Undersökning av ana lysverktyg för a t t ka r t lägga omik da ta
t ill rela t ionsvägar – F ok u s p å d a t a a v t yp en l i p i d om i k och gen om i k
A u t h or: A t t i l a K on rá d
P a g e | 2
An education isn 't how m uch you have com m itted to m em ory,
or even how m uch you know. It 's being able to d ifferentiate
between what you know and what you don 't. /Anatole France
Ackn ow le dge m e n ts I would like say thank you to everyone who helped me with my thesis. To my supervisors
I thank you for your pa t ience, guidance and a ll the good feedbacks.
P a g e | 3
Abst ract This thesis examines PATs from a mult idisciplinary view. There are a lot of PAT's
exist ing today ana lyzing specific type of omics da ta , therefore we invest iga te them and
what they can do. By defin ing some specific requirements such as how many omics data
types it can handle, the accuracy of the PAT can be obta ined to get the most su itable
PAT when it comes to mapping omics da ta to pa thways . Result s show that no PATs
found today fu lfills the specific set of requirements or the main goal though software
test ing. The Ingenuity PAT is the closest to fu lfill the requirements . Requested by the
end user , two PATs are tested in combinat ion to see if th ese can fu lfill the requirements
of the end user . Uniprot ba tch conver ter was tested with FEvER and r esults did not turn
out successfu lly since the combinat ion of the two PATs is no bet ter than the Ingenuity
PAT. Focus then turned to an a lternat ive combinat ion , a homepage ca lled NCBI that
have search engines connected to severa l free PATs available thus fulfilling the
requirements . Through the search engine “omics” da ta can be combined and more t han
one input can be taken a t a t ime. Since technology is rapidly moving forward , the need
for new tools for data in terpreta t ion a lso grows. It means tha t in a near future we may
be able to find a PAT tha t fu lfills the requirements of the end user s.
Ke yw ords: Biochemist ry, Cardiovascular disease, Database, Genomics, Lipids,
Lipidomics, Metabolomics, PAT, Technology
Sammanfa t tn ing Detta examensarbete granskar ana lysverktyg ur et t tvärvetenskapligt perspekt iv. Det
finns en hel del olika ana lysverktyg idag som analyserar specifika typer av omik data
och därför undersöker vi hur många det finns samt vad de kan göra. Genom a t t defin iera
et t anta l specifika krav såsom hur många typer av omik da ta den kan hantera,
noggrannhet av verktygets ana lys så kan man se vilka som är mest lämpliga
ana lysverktygen när det gä ller kar t läggning av omik da ta . Resulta ten visar a t t det idag
in te finns ana lysverktyg som uppfyller de specifik t angivna kraven eller huvudsyftet
genom testn ing av programvaran . Ingenuity ana lysverktyget ä r det närmaste vi kan
komma för de krav som vi söker . På begäran av slu tanvändaren testades två
ana lysverktyg för a t t se om en kombinat ion av dessa kan uppfylla slu t användarens
krav. Analysverktyget Uniprot ba tch converter t estas med FEvER men resulta t är in te
framgångsr ikt , då kombinat ionen av dessa verktyg in te ä r bä t t re än Ingenuity
ana lysverktyget . Fokus vänds mot en a lternat iv kombinat ion som är en hemsida och
heter NCBI. Hemsidan har en sökmotor kopplad t ill flera olika ana lysverktyg som är
gra t is a t t använda . Genom sökmotorn kan ”omik” data kombineras och mer än et t
inmata t värde kan hanteras i t aget . Eftersom tekniken snabbt går framåt innebär det
däremot a t t nya ana lysverktyg behövs för da ta hanter ing och inom en snar framt id så
har vi kanske et t a na lysverktyg som uppfyller kraven av slutanvändar na .
Nycke lord: Biokemi, Kardiovaskulär sjukdom, Databas, Genomik, Lipider , Lipidomik,
Metabolomik, Analysverktyg, Teknik
P a g e | 4
Con ten ts
Abst ract ................................................................................................................................................. 3 Sammanfa t tning ................................................................................................................................. 3 1. In t roduct ion .................................................................................................................................. 5
1.1. Purpose ...................................................................................................................................... 5 1.2. Problem definit ions and Aims .............................................................................................. 6 1.3. Problem discussion .................................................................................................................. 6 1.4. Rela ted work with PAT .......................................................................................................... 8
2. Methods ......................................................................................................................................... 8 2.1. Model in use .............................................................................................................................. 8
2.1.1. R equirem ent collection , docum entation and validation ...................................... 8 2.1.2. R equirem ent processing and test case creation ..................................................... 9 2.1.3. Objective ...................................................................................................................... 12 2.1.4. Underlying objectives ................................................................................................ 12
2.2. Alternat ive research methods ............................................................................................ 13 3. Biomedica l background ............................................................................................................ 13
3.1. Genet ics ................................................................................................................................... 13 3.1.1. Gene .............................................................................................................................. 14 3.1.2. S N P .............................................................................................................................. 15
3.2. Biochemist ry of Lipids .......................................................................................................... 16 3.2.1. Lipid defin ition .......................................................................................................... 17 3.2.2. Classes of L ipids ........................................................................................................ 17 3.2.3. Enzym es involved in the synthesis of lipids ......................................................... 18 3.2.4. Lipoproteins ................................................................................................................ 21
3.3. Genomics ................................................................................................................................. 21 3.4. Metabolomics .......................................................................................................................... 22 3.5. Lipidomics ............................................................................................................................... 22 3.6. Cardiovascular diseases ....................................................................................................... 22
4. Computer Science background ............................................................................................... 23 4.1. Databases, Data mining and Knowledge discovery ....................................................... 23 4.2. PAT ........................................................................................................................................... 23
5. Requirements and Test elicit a t ion ........................................................................................ 24 5.1. Requirements ......................................................................................................................... 24 5.2. Test ing ..................................................................................................................................... 25 5.3. Test cases ................................................................................................................................ 26
6. Result ........................................................................................................................................... 26 6.1. F inding the PATs .................................................................................................................. 26 6.2. Sort ing the PATs ................................................................................................................... 27 6.3. Test ing the PATs ................................................................................................................... 27 6.4. Evalua t ing the PATs ............................................................................................................ 28 6.5. F ina l eva luat ion of the PATs .............................................................................................. 28 6.6. The best PAT from the ranked list .................................................................................... 29 6.7. Combining PATs .................................................................................................................... 31 6.8. Funct ionalit ies ....................................................................................................................... 34 6.9. Quality ..................................................................................................................................... 35
7. Discussion ................................................................................................................................... 36 7.1. Is it possible to find a PAT that processes metabolomics and lipidomics raw da ta
as input and combine them with genet ic informat ion? ........................................................ 36 7.2. What a re the funct ionalit ies offered by the ava ilable ana lysis tools? ....................... 36 7.3. What a re the qualit ies of these tool's and how to eva lua te them? ............................. 37
P a g e | 5
7.4. Why not Ingenuity and why Uniprot with FEvER? ...................................................... 38 8. Future Value .............................................................................................................................. 38 9. References ................................................................................................................................... 38 Appendix 1 – Test Cases .................................................................................................................. 42 Appendix 2 – Lipid, MI SNP and Metabo SNP data sheet ...................................................... 46 Appendix 3 – Requirements Matr ixes .......................................................................................... 50 Appendix 4 – Respons Times .......................................................................................................... 53
1. In t roduct ion Vast amount s of resea rch is done in lipidomics and genomics, making
computers, In ternet and var ious ana lysis tool's very common today both in
simple and advanced forms. As an example a simple ca lcula t ion can be
performed on one computer and t ransfer red or copied to another if needed.
More advanced per formances somet imes require a software tool tha t can
perform a cer ta in ta sk on a given set of data in order to give a cer ta in resu lt .
The resu lt is in turn usua lly not logically ordered and a visua l presen ta t ion is
needed. This is where a pa thway analysis tool (PAT) is needed. A pa thway
ana lysis tool (PAT) is an advanced tool t ha t processes given da ta , compares
the given da ta with stored da ta in a da tabase and present s the resu lt s
obta ined visually. A company tends to h ire a programmer to develop a
pa thway ana lysis tool (PAT) in order to in tegra te it with in the organiza t ion
[34]. One of the main groups of scien t ific users is the group of r esea rchers in
fields of bioinformat ics, genet ics, genomics and metabolomics. Researchers a re
dependent of these pa thway analysis tools in their scien t ific work. In some
scien t ific fields such as genomics and metabolomics, there a re too many
ana lysis tools (PAT), doing a ll kinds of different t a sks. Too many pa thway
ana lysis tools in a specific field can confuse resea rchers who do not have
enough knowledge in technology [5]. This makes it difficu lt to decide wha t
pa thway analysis tools a re su ited for cer ta in da ta and within wha t scien t ific
field. Since technology is a lso moving forward ext remely fast , people with
mult idisciplina ry knowledge a re needed more and more [20]. For resea rchers
who work with in the biomedical field of metabolomics and gen omics there a re
specific ana lysis tools. The purposes of these pa thway ana lysis tools (PAT) a re
to help the users in their work, where they can visua lize da ta that may lead to
new scien t ific discovery. Technology and informat ion shar ing has taken a big
step forward and has helped substant ia lly in different a reas a round the wor ld
such as in hea lth ca re and medicine.
1.1.P u rpose
Finding reliable pathway analysis tools (will be refer red to as PAT from now on) that
can do a ll the necessary da ta computat ions and can visua lly present the results is
requested by Céline Fernandez from Clin ica l Research Center (CRC) in Malmö (will
be refer red to as the end user). CRC work s in discover ing new medicine, diagnost ic
tools and improved t reatments in order to improve hea lth wor ldwide.
P a g e | 6
1.2.P roblem de fin it ions an d Aim s
Since there a re many PAT ava ilable with lot s of informat ion , the following
resea rch quest ions a re defined in th is thesis:
Is it possible to find a PAT tha t processes metabolomics and
lipidomics raw da ta as input and combines them with genet ic
informat ion?
What a re the funct iona lit ies offered by the available analysis tools?
What a re the qua lit ies of these tool's and how to evalua te them?
The object ives a re defined in order to help answer the three resea rch
quest ions. The main a im of th is thesis is the following:
To find a PAT tha t can process a combinat ion of da ta inputs with the type of
“omics” da ta , i.e. lipidomics/metabolomics, genomics da ta .
In order to reach the main purpose, severa l under lying object ives a re needed.
These a re the following:
1) Find PATs tha t a re able to map pa thways of the following type of da ta :
a ) Overa ll metabolomics da ta
b) Lipidomics da ta
c) Genomics da ta
2) Evalua te the selected PAT and their funct ions. Test the current
accuracy of the exist ing PAT in order to answer if the output from
these tools shows the “correct” resu lt s.
3) Evalua te the selected PAT according to specific requirements given by
the end user ; see sect ion 1.3 for the specific requirements.
After the eva lua t ion of the PAT according to requirements, two opt ions
a re possible:
Opt ion 1: One or more PAT passes steps 2 and 3 and is delivered to the
end user .
Opt ion 2: If no PAT fulfilling the requirements is found. Alterna t ive
solu t ions will be to see if it is possible to adapt any of the evalua ted
ana lysis tools, combine more than one or make an in house
development of a PAT meet ing the requirements of the end user .
1.3.P roblem discu ss ion
In order to solve the problem we must consider wha t PATs a re, how complex
they a re and wha t they can do. The funct iona lit ies of the PAT need to be
tested [28] to see if they fu lfill the specific requirements (S ee T able 1).
P a g e | 7
Table1. 8 specific requirements listed tha t needs to be fu lfilled by a PAT.
Requ irem en t
ID
Requ irem en t description
1 User is able t o see and select on the PAT
what type of da ta it must process (if the
input field is for metabolomic, lipidomic or
genomic)
2 User must be able to cont rol if obta ined
resu lt is va lid from the PAT according to
lit era ture, In ternet or laboratory resu lt s
3 The user must receive resu lt s by the PAT
with in a cer ta in t ime
4 The user can navigate between sta r t of
sea rch (input da ta ) to the end of sea rch
(resu lt s obta ined).
5 The user can get a visua l presenta t ion of
metabolomics, lipidomics and genomics
da ta from the PAT
6 The user can zoom in and out expanding
the view to neighbor ing possible resu lt s to
see connected pa thways on the received
resu lt s from the PAT.
7 The user can input a specific type of da ta
in to the PAT (metabolomic, lipidomic or
genomic)
8 The user can input combined omics da ta
and then map them to pa thways
Acquir ing knowledge from litera ture gives us informat ion about the
complexity of a PAT [27]. The funct iona lit ies from a PAT can be obta ined with
help of software test ing of da ta inputs [9] and th is way we can check if the
PAT sa t isfy the requirements of the potent ia l users. The defin it ion of qua lity
is of a bigger sca le and harder to define since qua lity has different meanings
to different people [36]. The qua lit ies of the PAT are acceptable if they a re
fu lfilling a ll the requirements [36] according to a set of requirement
specifica t ions. We will be using the requirement specifica t ions according to
table 1. Homepages associa ted with PAT a lso need to be qua lity checked and
five selected a spects a re used: Accuracy and Correctness (how t rustwor thy is
the informat ion provided on the homepages), Com pleteness (a re the
homepages complete or under const ruct ion), R elevance (how relevant is
content or informat ion on a homepage to the PAT), T im e and Punctuality
(how fast can a homepage be found when sea rching), T raceability (is the
informat ion provided on the homepages t raceable to their or iginal source).
P a g e | 8
1.4.Re lated w ork w ith P AT
Most PAT today is made specifica lly with focus on metabolomics and
genomics. This is due to the resea rch work in metabolic engineer ing, cellular
metabolism and in toxic genomics [16, 25]. Companies spend vast amounts of
money developing a PAT while t rying to compete with each other [8, 15]. The
compet it ion for the companies involves building, adapt ing and eva lua t ing
each other 's PAT, telling why their PAT is bet ter than the other [8, 15, 33].
Since the PAT is specifica lly developed for a biomedica l field [41], there exists
no fu ll-sca le analysis on the en t ire PAT yet . Our study is a fir st a t tempt a t
such an ana lysis of a complete set of a ll PAT.
2. Methods This sect ion descr ibes the scien t ific methods used to eva lua te the different PAT.
Sta r t ing with the selected method in use, how the informat ion is ga thered and
deta ils on the object ives and under lying object ives.
2.1.Mode l in u se
The main purpose (t o find a PAT tha t can process a combina t ion of da ta
inputs with the type of “omics” da ta , i.e. lipidomics/metabolomics, genomics
da ta ) of the project was divided in to four under lying object ive, each with it s
specific object ive. Methods tha t will be performed a re based on an empir ica l
model with a study on PAT in order to test and ana lyze each of the PAT and
their homepages. Test cases a re designed based on the requirements from the
end user a t the fir st in terview. The requirements a re rechecked a few weeks
la ter with the end user in a second in terview. Once acknowledged, the
software test ing begins with requirements and test cases, in order to see if
ingoing da ta matches the out coming da ta of the PAT. Da ta is based on a gene
name (e.g. NPPA), reference SNP accession ID (rs number such as rs5068) or
a lipid class name (such as lipoproteins). Ver ifica t ion (from the PAT) of the
out coming da ta to see if it s relevant is per formed by compar ing the received
resu lt s with informat ion found in lit era ture. A ranked list is made ranking
the best PAT first , based on how many requirements a re met . If no PAT meets
a ll the requirements, the end user have a request to adapt or combine 2
specifica lly selected tools, which end user is a lready familia r ized with , while
the ranked list get s disca rded.
2.1.1. R equirem ent collection , docum entation and validation
Five meet ings a re booked a t the Clin ica l Research Center (CRC) in order to
make in terviews. All pa r t icipants (resea rchers including the end user) a re
going to discuss about the problem tha t needs to be solved. Discussion will
focus on PAT in genera l and specific funct ions a re going to be desired by the
resea rchers tha t have to be on a PAT. Requirements a re made connected to
these funct ions on a PAT and a new meet ing is booked. Dur ing each
meet ing everyth ing is wr it ten down and documented. After each meet ing,
P a g e | 9
requirements a re collected to be sor ted and processed in order t o make test
cases. La ter a checkup takes place a t same place, to see if everyth ing is on
the r ight t rack.
2.1.2. R equirem ent processing and test case creation
The requirements a re processed and formula ted. They a re a lso shor tened
down from 15 to eight requirements with the most impor tan t things tha t a
PAT must do. Each of the requirements is given an ident ifica t ion number .
Test case templa tes a re sought and one t empla te is selected, downloaded
and then customized (Fig 1). Specific test cases a re designed to su it the
requirements and linking them to their respect ive requirement (S ee T able
2). The designs of the specific test cases a re made by adding the goa l of the
test a long with the events to achieve the goa l. Last ly the expected response
is wr it ten , descr ibing wha t resu lt s we should expect by following the
events. The whole process sta r t s by ca refu lly checking a requirement from
the list and t rying to see if they can be made as a single test case in one go.
If tha t is not possible severa l t est cases a re needed. If we look a t fir st
requirement in table 1 above, we see tha t 3 different da ta types need to be
tested. So we have to split the requirement in to more than 1 test case since
a ll PAT may not be able to process a ll 3 da ta types. We decide to take the
first da ta type which is for metabolomic input da ta . We a lso select a da ta
input tha t we know should give a response and present some resu lt s. F rom
th is we can write down our events in the test case by having an input and
then get t ing a response. So we can then a lso sta te the expected response. In
our case it is tha t the metabolomic da ta type gives da ta informat ion rela ted
to our da ta input tha t we made. Next 2 test cases will be simila r with the
small difference of having a different input da ta type. Same approach
method is applied to the rest of the test cases. Requirement s a re going to be
checked, eva lua ted if it can be made as one test or split t ing them in to more
test cases for same requirement , wr it ing the events and the expected
response.
P a g e | 1 0
Figu re 1. A test case templa te used in th is study.
P a g e | 1 1
Table 2. A table showing requirement ID with descr ipt ion linked to specific Test
Case ID
ID Requ irem en t
description
Type Lin ked w ith Test
Case ID
1 User is able to see and
select on the PAT what
type of data it must
process (if the input field
is for metabolomic,
lipidomic or genomic)
Fu n ction al 1, 2 an d 3
2 User must be able to
check if the result s
obta ined is va lid from the
PAT according to
lit era ture or laboratory
results
Non
fu n ction al
4
3 The user must receive
results by the PAT with in
a cer ta in t ime
Non
fu n ction al
5
4 The user should naviga te
between star t of search
(input data) to the end of
search (result s obtained).
Non
fu n ction al
6
5 The user should get a
visua l presenta t ion of
metabolomics, lipidomics
and genomics da ta from
the PAT
Non
fu n ction al
7
6 The user must be able to
zoom in and out
expanding the view to
neighbor ing possible
results to see connected
pa thways on the received
results from the PAT.
Fu n ction al 8
7 The user must input a
specific type of data in to
the PAT (metabolomic,
lipidomic or genomic)
Fu n ction al 9
8 The user must be able to
input combined omics
da ta and then map them
to pathways
Fu n ction al 10
P a g e | 1 2
2.1.3. Objective
The main purpose is achieved by acquir ing knowledge from litera ture such
as books and a r t icles and by doing software test ing. The resu lts obta ined
from the test s a re than compared with requirements made by the potent ia l
users of the PAT.
2.1.4. Underlying objectives
Object ive 1:
Ga ther ing of informat ion by sea rching books and a r t icles , finding lot s of
PAT and obta in what da ta it can process. Download PAT if possible to
ana lyze them.
Object ive 2:
Eva lua te the selected PAT with their funct ions and methods by going
through each tool, clicking a round and input t ing da ta . Test cases a re
designed from the given requirements. Test s on the PAT are based upon:
a ) From the lit era ture known metabolomics, lipidomics, and
genet ic pa thways and correla t ions
b) Compar ison between resu lt s obta ined from the lit era ture and
from the PAT
c) Compar ison between exist ing labora tory resu lt s and the PAT
d) How long it t akes to process da ta by the ana lysis tool
Correct resu lt s a re considered to be those tha t come from scient ific a r t icles,
books or labora tory resu lt s ver ified by scien t ist s. Pa thways and correla t ions
with metabolomics, lipidomics, and genet ics a r e tested against lit era ture
known resu lt s. Compar ison between resu lt s obta ined from PAT aga inst
a r t icle and book resu lt s a re going to be done first , a fterwards the exist ing
labora tory resu lt s. Accuracy of the PAT are acquired by the output da ta and
resu lt s will either accura tely match a ll da ta or not . A simple t imer is used
to record the processing t ime of a PAT. F inally a list of PAT will show
which PAT passed, fa iled and why they fa iled our examina t ion .
Object ive 3:
In order to have a sa t isfied end user , specific set of requirements a re
needed tha t must be fu lfilled with a final evalua t ion . Requirements a re
collected a t an ea r ly stage with in terviews from resea rchers and the end
user who a lso represent other potent ia l users. The most desired and
impor tan t requirements were discussed and ident ified to be the following:
Selected ana lysis tool must be able to:
a ) Naviga te between data and resu lt s
b) Make visua l presenta t ion of obta ined metabolomics, lipidomics
or genomics da ta
c) Have zoom in and zoom out funct ions expanding the view to
neighbor ing possible resu lt s connected to pa thways on the
resu lt s obta ined
P a g e | 1 3
d) The PAT should be able to process more than one type of da ta
(metabolomic, lipidomic or genomic)
e) Be able to combine omics da ta and then map their pa thways
Naviga t ion will be tested by looking a t the output da ta (resu lt s obta ined) to
the ingoing da ta (the beginning of where da ta is inser ted). Inser t ions of
da ta a re made in the required fields while t raceability or clickable t racking
views a re sought when obta in ing resu lts. Any visua l presenta t ions on
obta ined resu lt s a re accepted but deta iled view of pa thway combina t ions
and correla t ions a re prefer red. On output da ta zoom funct ions a re sought
tha t is a small magnifying glass with a plus or minus sign in the PAT. To
test how many type of da ta (metabolomic, lipidomic or genomic) the PAT
can process, one of each da ta type will be selected. Three da ta types
together (metabolomic, lipidomic and genomic together) a re going to be
tested first , two da ta types (metabolomic with lipidomic or genomic,
lipidomic with genomic or metabolomic) a re tested secondly and last ly one
by one inputs of each (metabolomic, lipidomic, genomic). If a PAT passes a ll
a ims a fter eva lua t ion , a ll resu lt s and test mater ia l a re in tended to be
turned over to the end user . Fur ther suppor t will be provided in form of
answer ing quest ions on specific PAT. Test s on the PAT, Uniprot and
FEvER are going to be done if no PAT will be found tha t fu lfill the
requirements.
2.2.Altern ative re search m eth ods
There a re a lterna t ive methods to conduct th is study but it would involve
working in a biochemist ry labora tory to observe, in terview and obta in resu lt s
from exper iments and a fterwards designing while a lso building a complete
PAT. Another method is to make a homepage connect ing it towards a PAT
tha t is being used in the labora tory. Method selected in sect ion 3.1 and
descr ibed more in sect ion 4 is being done by reasons of get t ing good qua lity
resu lt s, t ime saving and efficiency.
3. Biomedica l background This sect ion conta ins background information needed in order to understand
the biomedica l pa r t . Ga thered informat ion is about genet ics, lipids and their
biochemist ry, metabolomics, genomics and ca rd iovascula r disease.
3.1.Gen e tics
Genet ics is the study of genes with their st ructures, sequences and their role
in heredity. It is a way to t ry and expla in how they work, what they a re and
wha t they can do [32]. Genet ics involve scien t ific studies of genes and their
effect s leading to va r ia t ion in living organisms [32]. Meaning how cer ta in t ra it
is or condit ions a re being passed down from one genera t ion to the next . Also
how genes a re un it is of heredity tha t ca r ry inst ruct ions for making proteins
P a g e | 1 4
tha t direct act ivit ies in cells and funct ions of ou r bodies. An example of
funct ion is inher ited disorders leading to diseases [32]. Disorders have been
detected due to the la rge amount of labora tory exper iments and technology
advancements, da ta stor ing provide use of PATs, thus giving funct ions to
sea rch and match genes with each other .
3.1.1. Gene
Genes a re small molecula r un it is tha t ca r ry the heredity of living
organisms. The gene holds the informat ion to build and main ta in an
organism. Eukaryot ic cells have a nucleus, which conta ins t igh t ly packed
DNA and a re well protected [5]. The main building blocks of a gene consist
of cova lent ly linked n it rogen bases A, T, C and G. The st ructures a re then
st rengthened by ca rbon and hydrogen bonds. This makes a sequence tha t in
the end forms a long double helix DNA cha in . The DNA cha in is t igh t ly
packed together with h istones, which a re proteins, to form an organized
st ructu re. The organized st ructu re is ca lled chromosomes [11]. All the
chromosomes a re well protected with in the nucleus (Fig 2). The DNA cha in
in turn codes for many funct ions of living orga nisms [5]. Genet ic
informat ion and t ra it is a lso gets passed on to the offspr ing when mat ing.
In our genome there a re some st ructura l genes which upon reading, t ell us
wha t mater ia ls a re needed in order to build up a cell or an organism. This
is our genotype. The st ructura l genes we a re going to use a re determined in
combina t ion with the environment and this is ca lled our phenotype. The
phenotype is a lso a ffected by the environment of ea r lier genera t ions and
th is is ca lled epigenet ic [5]. Those phenotypes a re e.g. eye color and blood
type. The genotypes a re ident ica l in a ll human individua ls up to about 99
percent . Remaining 1 percent va ry from person to person crea t ing the
fea tures tha t makes us a ll unique. Tiny differences in t he genome
sequences dist inguish an individual from another [5]. The t iny difference on
the changes of single bases involves reproduct ion from two individuals
crea t ing an offspr ing and changes by Single Nucleot ide Polymorphism
(SNP) as ment ioned more in text below. Keeping t rack of t in y differences is
ha rd and some of t hese t iny genet ic var ia t ions a re impor tan t due to
suscept ibility to cer ta in diseases (like asthma, diabetes, sclerosis and
cancer), un less you have an ana lysis tool a t your disposa l [5].
P a g e | 1 5
Figu re 2. A schemat ic presenta t ion of human DNA assembled in to a
chromosome.
3.1.2. S N P
SNP is shor t for Single Nucleot ide Polymorphism and it is a sequence
var ia t ion in DNA. This means tha t a n it rogen base is different in a gene
sequence for one individual while the rest of the gene sequence is st ill
simila r to another individua l [5]. For an example the gene sequence
ATAGGC is a lmost the same as the gene sequence ATCGGC, however , we
have a change on the second A to having a C instead. Changes of one
nucleot ide in the sequence of our genes a re named Single Nucleot ide
Polymorphism (SNP) and occur throughout the whole genome [3]. Single
Nucleot ide Polymorphism (SNP) var ia t ions occur in a ll species, leading to
genet ic va r ia t ions and may resu lt in different phenotype of the organism. In
[4] resea rch resu lt s show how different ia t ion has occurred. The genet ic
changes a re based on na tura l select ion to su it the most favorable adapt ion
of the genes [3]. Some of these Single Nucleot ide Polymorphism (SNP)
sequences a re even specific to an ethnic group while it may be missing in
another group. According to [32] both the coding and the non coding regions
of the DNA can be a ffected. Single Nucleot ide Polymorphism (SNP)
sequences involve suscept ibility to diseases as ment ioned in the end of
sect ion 2.1.1. A scen ar io given will descr ibe why Single Nucleot ide
Polymorphisms (SNP: s) a re impor tan t [32]. Couples registers for a hea lth
check and gives blood to be ana lyzed in order to detect how hea lthy they
a re. The blood goes through t rea tments so only small sequences of
nucleot ides a re left . The Single Nucleot ide Polymorphism (SNP) sequence
of one individual is the following:
“GCCAGTATTGTCGATTTCACAAGTGCCTTTCTGTCGGGATGTCACACA
P a g e | 1 6
ACGG”. Other person has the following of
“GCCAGTATTGTCGATTTCACAAGTGCGTTTCTGTCGGGATGTCACACA
ACGG”. The sequences from both individual’s a re codes for a prot ein , coding
the uptake of fa t and sugar in the human body. The small va r ia t ions
between these two individua ls a re marked with a color . One of them has
h igh r isk of get t ing diabetes. With the help of today’s technology, SNP
ana lyses a re used to determina te disease suscept ibility [32]. Ana lysis
revea ls t er r ible news for the couple, were the individual with the single
base changed to G has to sta r t using insulin with a syr inge, unless food
habit change within a year or two. The scenar io descr ibed above a re very
common in hea lth ca re today and a lso not the only work a rea exploit ing
genet ic va r ia t ions. In forensic science the genet ic va r ia t ions a re exploited
dur ing DNA fingerpr in t ing [32].
3.2.Bioch em istry o f Lip ids
Biochemist ry is a lso ca lled biological chemist ry which is the study of chemica l
processes in living organisms. Biochemistry regula tes and governs over a ll
living processes with in a ll living organisms [5]. This occurs by biochemical
signa ling. The signa ling is sor t of an informat ion flow as in sending a message
from one place to another . Signa ls flow through every par t in an organism
regula t ing the metabolism. Metabolism stands for the meaning of living
organisms to susta in life and reproduce them self. One impor tan t pa r t in
biochemist ry is the lipids. Lipids a re impor tan t components in a cell and form
cell membrane, vita l t issues and serve as an energy source for the organism
[1]. Lipids a re stored as energy reserves with in the organism and used whe n
needed. Lipids help keeping the elect rochemica l balance of a cell, cell
signa ling and t ra fficking regarding wha t is going in or out to the cell [1 1].
The lipids usua lly consist of a pola r head and a hydrophobic ta il. The lipids
bind to each other due to the hydrophobic pa r t wants to stay in contact wit h
other hydrophobic molecules [3]. The dist r ibut ion between the hydrophobic
and pola r pa r t s of the lipids direct s the 3-dimensiona l st ructure of the
molecules [7] and with a rela t ively la rge pola r pa r t , the lipids form micelles
while more equal dist r ibut ion , leads to the format ion of double layers known
as membranes (Fig 3).
P a g e | 1 7
Figu re 3. P icture of lipids with hydrophobic ta ils bound together and with
other components forming the membrane. (Modified picture taken from
Human Cell Biology ref. [43]).
3.2.1. Lipid defin ition
Chemists, biochemists and other analyst s tha t work with lipids have a
grea t and firm understanding of the t erm ca lled lipid according to [19]. But
there is no widely accepted defin it ion today and they a re sa id to be a group
of na tura lly occurr ing compoun ds. In an organism, [44] and [53] sta te tha t
thousands of va r ious forms of lipid molecules can be found and lipids can be
ca tegor ized in to six main ca tegor ies (Fig 4). They a ll have a low solubility in
wa ter and h igh solubility in organic solvents.
3.2.2. Classes of L ipids
Recent ly a new nomencla ture system was proposed by [26] due to the
diversity of lipids in human plasma , separa t ing lipids in to eight classes or
ca tegory where six of them are considered main classes. Each class can be
fur ther divided in to sub classes and individua l molecula r species (Fig 3).
The first ca tegory is the fa t ty acyls and is a lso ca lled fa t ty acids. The
fa t ty acids can have three forms such as fa t ty acids, octadecanoids
and eicosanoids. They a re the most common building block for more
st ructu ra l complex lipids and can be sa tura ted or unsa tura ted. Cells
use these lipids to form the va r ious membranes found in a cell, to
store energy and to adjust the membrane flu idity in many ce lls. [43,
53]
Second ca tegory is the Glycerolipids and has three forms as mono-,
di- and t r iacylglycerolipids. Their funct ions a re main ly as energy
storage and a re bulked up in the t issue as fa t in an imals. [43, 53]
P a g e | 1 8
Third ca tegory is ca lled Glycerophospolipids but they a re usua lly
ca lled phospholipids. The main forms a re Phospha t idylcholine (PC),
Phospha t idylchethanolamine (PE) and Phospha t idic acid (PA). The
glycerophospolipid classes a re the only ones tha t have a phosphor
binding and they a re the key component in order to form bilayers.
[43, 53]
The four th ca tegory consist s of Sphingolipids. The main forms a re
Sphingomyelin and Ceramides. The Sphingolipids have a pola r head
and two non pola r t a ils. Sphingomyelin act a s a protect ion forming a
myelin sheath to protect nerves. [43, 53]
The fifth ca tegory is the Sterol lipids and they a re of va r ious a lcohol
forms. Sterol lipids a re an impor tan t component for biological roles.
Sterols act a s regula t ing hormones and as signa ling molecules. [43,
53]
The last ca tegory is the Prenols tha t form terpenes and act a s a pre-
cursor molecules of vitamins as vitamin A, E and K. [43, 53]
3.2.3. Enzym es involved in the synthesis of lipids
A deeper insight is presented in th is sect ion with focus on lipids and it is
synthesis, for a more understanding on the amount of informat ion a PAT
must be able to process. Sta r t ing from the sta r t of da ta inputs (a lipid name
connected to glycerolipids) to resu lt s obta ined.
Some lipid cha ins a re very long or complex while others a re shor t . It wou ld
take a long t ime to chemica lly synthesize the lipids, however , with the help
of enzymes it is much faster a s [37] presents. Numerous forms of lipids
occur and severa l enzymes a re needed. In [46] a system biology view
presents needed enzymes by use of a PAT. E.g. the synthesis of fa t ty acids
occurs in the cytoplasm and key enzymes involved a re the acetyl -CoA
carboxylase (ACC) and malonyl-CoA carboxylase (MCC) sta t ed in [51].
While another group of coenzyme ca lled Acyl-CoA, choresterol
acylt ransferase (ACAT) works on cholesterol [51]. This is st rengthened in
[45] showing a clea r view by pictures. The fa t ty acids a re so many and can
be sa tura ted or unsa tura ted and for th is purpose designa ted symbols a re
given [31] in order to keep t rack of the ca rbon a toms a nd their bindings.
The symbols consist of two numbers between a colon (:) [31]. The first
number tells us the ca rbon length of the fa t ty acid and the second number
the sta te of sa tura t ion . A fa t ty acid with severa l unsa tura ted bounds shows
a h igher number a t it is second va lue (S ee T able 3). Synthesis of fa t ty acids
beyond 16 ca rbons length goes through a two-carbon elonga t ion process,
according to [31] by enzymes in the endoplasmic ret icu lum (ER). Not only
elonga t ion occurs bu t a lso desa tura t ion by enzymes in the endoplasmic
ret icu lum (ER) using four enzymes named desa turase delta four , delta five,
delta six and delta nine. The designa ted delta names with a number a re
given according to which posit ion in the fa t ty acid ca rbon cha in the
desa tura t ion occurs [31]. The main dena turase is delta nine and is ca lled
Stea royl-CoA desa turase-1. The desa tura t ion requires oxygen (O2), a
coenzyme ca lled Nicot inamide adenine dinucleot ide hydrogen (NADH) and
P a g e | 1 9
an elect ron t ranspor t ing hemoprotein ca lled Cytochrome b5 [47]. In fa t ty
acid desa tura t ion two hydrogen a toms a re removed from the fa t ty acid
making an oxida t ion on both the fa t ty acid and NADH. This crea tes a
double bond between ca rbons in the fa t ty acid cha in .
Table 3. The main fa t ty acids in organisms (Modified table taken from Cyber lipid
center ref. [31] and Virgin ia web educa t ion ref. [10])
Main fatty acids
Number of
carbons
Name Systematic name Symbol Structure
Saturated fatty acids
12 Lauric acid Dodecanoid acid 12:0 CH3( CH2)10COOH
14 Myristic acid Tetradecanoic acid 14:0 CH3( CH2)12COOH
16 Palmitic acid Hexadecanoic acid 16:0 CH3( CH2)14COOH
18 Stearic acid Octadecanoic acid 18:0 CH3( CH2)16COOH
20 Archidic acid Eicosanoic acid 20:0 CH3( CH2)18COOH
22 Behenic acid Docosanoic acid 22:0 CH3( CH2)20COOH
24 Lignoceric acid Tetracosanoic acid 24:0 CH3( CH2)22COOH
Unsaturated fatty acids
16 Palmitoleic acid 9-Hexadecanoic acid 16:1 CH3( CH2)5CH=CH(CH2)7COOH
18 Oleic acid 9-Octadecanoic acid 18:1 CH3( CH2)7CH=CH(CH2)7COOH
18 Linoleic acid 9,12-Octadecanoic
acid
18:2 CH3(CH2)4(CH=CHCH2)2(CH2)6COOH
18 a-Linolenic acid 9,12,15-Octadecanoic
acid
18:3 CH3CH2(CH=CHCH2)3(CH2)6COOH
18 g-Linolenic acid 6,9,12-Octadecanoic
acid
18:3 CH3(CH2)4(CH=CHCH2)3(CH2)3COOH
20 Arachidonic acid 5,8,11,14-
Eicosatetraenoic acid
20:4 CH3(CH2)4(CH=CHCH2)4(CH2)2COOH
24 Nervonic acid 15-Tetracosanoic acid 24:1 CH3(CH2)7CH=CH(CH2)13COOH
Complex lipids have a longer biosynthet ic pa thway and two main pa thways
a re known according to [47], the sn -glycerol-3-phospha te pa thway (a lso
known as the Kennedy pa thway) and the monoacylglycerol pa thway (Fig. 5
and 6). Synthesis by the Kennedy pa thway occurs in the liver and adipose
t issues while the monoacylglycerol pa thway takes place in in test ine
confirmed in [42]. Both sta r t s by ca tabolism of glucose (glycolysis) resu lt ing
in the bio-synthesis of glycerol, however , new evidence in [47] indica tes
some glycerol is synthesized anew (de novo) from single molecules by a
process ca lled glyceroneogenesis. The following react ions occur in the
P a g e | 2 0
endoplasmic ret icu lum (ER) of mammalian organisms [47]; sn-glycerol-3-
phospha teis ester ified by a fa t ty acid coenzyme in a ca ta lyt ic react ion by
the enzyme glycerol-3-phospha te acylt ransferase (GPAT) a t the sn- posit ion
in order to form lysophospha t idic acid. Lysophospha t idic acid then becomes
acyla ted forming phospha t idic acid, an in termedia te product in the
synthesis of a ll glycerolipids [47]. Dur ing synthesis of t r iacyl-sn-glycerol the
phospha te group is removed by a family of enzymes ca lled l ipid phospha te
phospha tase (PAP), sta ted by [47], forming 1,2-diacyl-sn-glycerols and
fur ther acyla ted by diacylglycerol acylt ransferase (DGAT) in to t r iacyl -sn-
glycerol (Fig 5). Dur ing synthesis of glycerophospolipids,
phospha t idylcholine (PC), phospha t idylchethanolamine (PE) and
phospha t idylser in , the phospha te group is not removed sta ted by [6] from
phospha t ic acid. Instead phospha t ic acid a re used as pre -cursor molecules
in the synthesis of glycerolipids (Fig 6). The synthesis by the
monoacylglycerol pa t hway is less complex and involves only a few enzymes
belonging to an acylglycerol acylt ransferase family to form the
t r iacylglycerols in the in test ine [47].
Figu re 4. Seven lipid classes and how they in teract bio-synthet ica lly (Modified
picture taken from Molecula r biochemist ry ref. [45]).
Figu re 5. The Kennedy pa thway synthesis in mammals (Modified picture taken
from The AOCS Lipid Libra ry – Tr iacylglycerols ref. [47]).
P a g e | 2 1
Figu re 6. Synthesis of acylglyer ides and glycerophospholipids showing a link
between the two pathways (Modified picture taken from Lipid Libra ry ref. [35]).
3.2.4. Lipoproteins
Lipids a re a lmost insoluble, however , there a re ways for t hem to be
t ranspor ted or pass t hrough the blood circu la t ion [5]. Lipoproteins a llow the
lipids to be t ranspor ted through the blood circu la t ion in order to reach
different t issues [5]. The lipoproteins a re assembled in a way tha t it
conta ins both proteins and lipids. The protein pa r t serves as an
emulsifica t ion for the lipids [11] and there a re five major classes, two being
very impor tan t classes of the lipoproteins [1], h igh density lipoproteins
(HDL) and low density lipoproteins (LDL). Remaining lipoproteins a re
In termedia te density lipoprotein (IDL), very low density lipoproteins
(VLDL) and chylomicrons [11]. Both HDL and LDL carry lipids as
cholesterol and LDL is somet imes refer red to as the bad cholesterol while
HDL is the good cholesterol. P roblems can occur du r ing the oxida t ion of the
LDL according to [11], leading to a lmost unstoppable cha in react ions. Cha in
react ion effect resu lt s in a therosclerosis many years la ter [11].
3.3.Gen om ics
Genomics is a discipline with in genet ics tha t focus on the study of the genome
of a ll organisms [32]. With in th is field of resea rch the purpose is to determine
the en t ire DNA sequence of a ll organisms and making a sca led mapping of a ll
the genes [32]. This includes a lso mapping of wha t a gene does and the
associa t ion it has to processes with in an organism e.g. metabolomics or
lipidomics [7]. Dur ing the process of mapping and associa t ion each gene gets a
designa ted name and number (as an ID tag) with a ll the necessa ry
informat ion provided about tha t specific gene in a da tabase [32]. Informat ion
can be ret r ieved with a PAT from these da tabases when needed.
P a g e | 2 2
3.4.Metabolom ics
Metabolomics is the study of chemica l processes involving metabolites and
sophist ica ted ana lyt ica l t echnologies a re used to make systemat ic studies [8,
18]. A systemat ic study consist s of t a rget ana lysis, metabolite profiling and
metabolic fingerpr in t ing. The metabolit es are found in a ll biologica l cells and
a ll have unique chemica l signa tures, like a fingerpr in t [46]. The unique
fingerpr in t s a re an end product a fter a cellu la r process and can be used to see
how specific chemica l processes have occur red [8, 46]. The chemica l processes
tha t a re examined can be from a living organism, cells, t issues and even from
an organ . Research field of metabolomics consist s of many sub par t s and the
pa r t s we a re focusing on a re the study of lipidomics (lipids/fa t ty acids) [40].
3.5.Lip idom ics
Lipidomics is used for descr ibing the complete profile of lipids in cells, t issues
or organisms [47]. Lipidomics a re one subpar t of metabolomics and a newly
emerged resea rch field tha t has been dr iven fast forward by rapid advances in
technology [53]. Such technologies a re e.g. mass spect romet ry (MS),
fluorescence spect roscopy (FS) [24], and Nuclea r Magnet ic Resonance (NMR)
[39]. These technologies save la rge amounts of da ta in da tabases, giving PAT's
possibility of a new method for da ta ana lysis [18].
3.6.Cardiovascu lar d isease s
There a re many diseases a round the wor ld. One of them is a class of diseases
tha t involve hear t or vessels tha t t ranspor t s blood (a r ter ies and veins) and a re
ca lled ca rdiovascu la r diseases [46]. Cardiovascula r diseases include th e
following: Aneurysm (Abnormal bulge in an a r tery), Angina (Chest pa in due
to lack of blood to the hear t muscle), Atherosclerosis (plaque builds up inside
the a r ter ies), Cerebrovascula r Accident (St roke), Congest ive Hear t Fa ilure,
Coronary Artery Disease and Myocardia l Infa rct ion (Hear t At tack). Severa l
resea rchers in [12] cla im some known factors as lipid or fa t con tent can a ffect
the ca rdiovascula r system, poin t ing tha t high density lipoproteins (HDL) and
low density lipoproteins (LDL) a re regarded to be a factor behind the
ca rdiovascula r diseases. Once the disease is detected it has usua lly progressed
for years and leads to the necessity of opera t ion or even dea th . Another group
of resea rchers in [29] have shown tha t t here a re severa l SNP:s which is
a ssocia ted with plasma level of h igh density lipoproteins (HDL) and low
density lipoproteins (LDL) which a re associa ted with myocardia l in fa rct ion.
After t est ing individua ls in [2], from five different ethnic groups with respect
of eight SNP from three genes a ssocia ted with cholesterol and lipoprotein
synthesis, a clea r correla t ion with myocardia l in fa rct ions was obta ined. A
study on r isks for myocardia l in fa rct ion in [48] st rengthens th is theory.
P a g e | 2 3
4. Computer Science background This sect ion conta ins background in formation needed in order to understand
the IT par t and how a PAT works.
4.1.Databases , Data m in in g an d Kn ow ledge d iscovery
A da tabase consist s of t ables with many columns and rows with a collect ion of
vast amount of da ta in form of informat ion . The informat ion is stored a t a
specific place and is often being well organized. Within the da tabase
depending on wha t is inser ted in to it , the informat ion can be of genet ics,
lipidomics or about something else en t irely. Da tabases often require some
form of tool in order to read and ret r ieve specific informat ion fast among the
vast amount of da ta . This is where da ta mining comes in , when specific
informat ion is sought and ga thered by a tool, then presented as resu lt s. F rom
the presented resu lt s, knowledge can be ga ined . The gained knowledge have
many forms but a few of these can perhaps be to make improvement in
exper iments, confirming exper iment resu lt s or perhaps change approach
methods to solve a scien t ific problem [13].
4.2.P AT
As ment ioned ea r lier a ll kinds of IT-based tools have been developed in
va r ious biomedica l fields [21]. This has been done in order to keep t rack of a ll
the necessa ry scien t ific informat ion obta ined dur ing the past yea rs [20]. PAT
tha t we a re working with process informa t ion about genes, SNP:s and lipid
metabolism. Therefore the PAT play an impor tan t role in lipidomics and
genomics resea rch . The tools can be either web based software or
downloadable programs tha t t ake da ta in forms of a gene name, SNP
accession ID (rs number) or a class name of the lipid. The PAT then process
the da ta given , making sea rches in loca l or remote da tabases. The da tabases
consist s of many tables with different informat ion rela ted to genes and lipids.
What the PAT do is making many join funct ions between the tables a nd put
these informat ion together . When a sea rch occurs by da ta mining, the
informat ion from these tables a re ga thered and presented as resu lt s based on
the input pa rameters [9]. The input pa rameters a re the inser ted text s in the
sea rch field. In some cases the tools fa il to provide resu lt s on cer ta in input .
F inally the ana lysis tool shows the resu lt s t elling if any resu lt s were found,
where the lipids or genes a re used in the metabolism and how they a re
connected to other lipids or genes in the metabolism. The resea rchers working
with the PAT can fur ther ana lyze the result s in order to gain new knowledge.
With new knowledge, new discover ies can be made in e.g. lipidomics or
genomics and th is leads to the demand for upda tes to the da tabases and the
PATs (Fig 8). With the help of PAT, new discover ies of diseases like
a therosclerosis can be made [24] or new pa thway links in the metabolic
system responses can be found [52].
P a g e | 2 4
Figu re 8. A view on how everyth ing is connected (both vir tua lly and
physica lly) to the PAT.
5. Requirements and Test elicit a t ion The following sect ion conta ins informat ion about requirement specifica t ion ,
t est plan , t est case, and software test ing.
5.1.Requ irem en ts
Many software programs or tools require months of t est ing to see if they
funct ion correct ly and need very thorough set s of specific requirements in
order to be considered as good funct iona l tools, a s sta ted in [39]. Specific set s
of requirements can be obta ined by in terviews with the people who a re of
impor tance such as stakeholders or individua ls with a key role. We use an
empir ica l method on software test ing by ga ther ing requirements pa r t ia lly
based on a requirements phase from one of the wa ter fa ll models used by Ian
Sommerville [50]. In the requirements phase, somet imes a lso ca lled
requirements engineer ing phase, bra instorming, resea rch and ana lysis is
being conducted on the software tha t will either be developed or tested. We
in tend to do the test ing par t without developing any new PAT and therefore
only the requirements methodology is adapted and applied in our study. The
bra instorming, resea rch and ana lysis is the most impor tan t pa r t of
requirements ga ther ing. Basic requirements a re defined and set for uses tha t
the software must suppor t [49]. Dur ing th is phase, in -depth studies of cu rrent
working processes a re done and how the problems can be addressed or solved.
Thus, without understanding the requirements given , hopes of deliver ing a
P a g e | 2 5
successful system or software is unlikely. Requirements elicita t ions a re done
involving many in ter views with key individua ls. Requirements a re ana lyzed
and eva lua ted and can be of three kinds. Those with h ighest pr ior ity a re the
m ust have, second pr ior ity a re should have and the lowest pr ior it y a re the nice
to have. Specifica t ions a re then made to the requirements and va lida t ions a re
done where the individuals with a key role or stakeholders accept the
requirements. Requirements a re therefore needed in order to define wha t a
software program must or should do, fea tures it ha s and a cer ta in qua lity that
it must fu lfill [39]. Requirements a re divided in to funct iona l and
nonfunct iona l requirements. Funct iona l requirements a re a lways defined with
sha ll or must . Non funct iona l requirements a re proper t ies or cer ta in qua lit ies
a product must have and a re used to descr ibe software’s usability, reliability
and performance [17]. Qua lity is not something tha t can be measured easily
as ment ioned in sect ion 1.3, bu t one way to measure qua lity according to [23]
is by using met r ics or sta t ist ics on proper t ies tha t can be measured and tha t
a re associa ted with qua lity.
5.2.Testin g
In order to do good test ing, t est plans a re developed in the form of a document
so tha t systemat ic approaches can be used on software test ing. The systemat ic
approaches a re fast and t ime saving [30]. A test plan descr ibes the test ing
phases (e.g. un it t est ing, in tegra t ion test ing, acceptance test ing), the needed
requirements, a ll act ivit ies, the needed resources and the documenta t ion
notes. In th is work, we use a cceptance test ing to determine if a set of
requirements will be met . The acceptance test ing is run ning with specific
da ta . Once the resu lts a re obta ined, they a re compared with known expected
resu lt s. Upon correct match of the resu lt , a pass or no pass is given . There a re
many ways to make test s and most of the t ime they differ between companies
[30]. All t est plans a re crea ted pr ior to any test ing. The content of a t est plan
usua lly consist s of a su itable name rela ted to the test s performed. Once a t est
plan is selected, it is performed according to a templa te [30] where the test
plan document is divided in to two par t s. The first pa r t consist s of genera l
informat ion about the test s tha t will be performed. The genera l informat ion
conta ins pa r t icipants, t est st ra tegies, specifica t ions and funct ion s or fea tures
tha t will be used. The second par t conta ins the procedure on scenar ios a lso
ca lled cases on how test s will be done. The crea t ion of t he test plan ,
requirements and test cases will lead to the actua l software test ing. The
purpose of the softwa re test ing is to invest iga te and ga in informat ion on how
the selected software program works [39]. Software proper t ies a re eva lua ted
dur ing a selected software test ing. Dur ing the software test ing, bugs and
er ror s tha t might occur a re a lso eliminated [23]. Once the software program or
applica t ion meets the requirements, it is considered to be working well and
a lso sa t isfies the needs of those who request ed it .
P a g e | 2 6
5.3.Test case s
A set of t est condit ions have to be writ ten from funct iona l requirements [17] if
software is going to be tested. The t est condit ions a re refer red to as test cases
and a re performed in cer ta in sequences which it must follow dur ing test ing.
F irst t he goa l of the test is descr ibed with the event or execut ion steps,
followed by the expected response and the actua l response from the test .
Many t imes test cases a re crea ted, with a set of condit ions from given
requirements as ment ioned above, in order to elimina te the ambiguity to the
minimum in software [21]. The amount of t est cases depends on the amount of
requirements given . To perform the test ing phase, t esters a re selected to
examine, discover and determine if the software is working correct ly or not
dur ing test ing. To keep t rack of a tester 's work t raceability matr ices a re used,
linking requirements to specific test cases (Fig 8). A test case consist s of many
steps sta r t ing with an input based on a r equirement tha t will be tested and
ending once a ll steps have been completed. The act ion or execut ion events a re
then going to be made on how to do the test with descr ipt ions on expected
response or outcome. The actua l resu lt s obta ined will be writ ten down once
the test case is complete.
Figu re 1. A test case templa te used in th is study.
6. Result
6.1.Fin din g th e P ATs
With use of In ternet , sea rching for PATs, many were found and most of them
had their own homepages. Searches with the Google sea rch engine were made
as “PAT”. 46 homepages were found, descr ibing analysis tools tha t had to be
eva luated. The homepages sta ted wha t type of da ta the PATs could process,
which was confirmed by downloading and test ing the PATs. The downloaded
PATs were eva lua ted with test cases 1 to 3 tha t require the tool to be able to
P a g e | 2 7
input and process lipidomics, metabolomics and genomics da ta as descr ibed in
appendix 1. Used genomic da ta with known correla t ions with lipid synthesis
and metabolism a re shown in appendix 2. Out of 46 PATs, 23 ana lysis tools
fu lfilled the cr iter ia of being able to process metabolomic, lipidomic and
genomic da ta .
6.2.Sortin g th e P ATs
Sort ing of the PATs a re made by ranking them based on the possibility of
processing more than one type of da ta . The Ingenuity PAT is the only one that
processes a ll types of da ta . The major ity of the selected PAT's only process
genomics da ta (13 out of 23) whereas 9 of the 23 PATs could process both
genomic and protein da ta (Fig 9).
Figu re 9. A ranking list of PATs based on number of da ta they can process
and the ID number of the requirement it passed.
6.3.Testin g th e P ATs
PAT's goes th rough test ing with test cases of 4 and 5 tha t a re linked to
requirement ID 2 and 3, according to sub a im 2 in sect ion 1.2. Very lit t le
lit era ture informat ion can be found when sea rching for metabolomics,
lipidomics and genet ic pa thways or correla t ions with PATs. Books found
descr ibed basic infor mat ion about the actua l metabolomics, lipidomics and
genet ics but not much of their pa thways or correla t ions [5, 31 and 3]. Searches
on In ternet give few result s of a r t icles and journa ls (such as [10, 34, 41 and
46]).
When doing the compar ison between obta ined resu lt s from the PATs, three
references can be found in which two out of three a re rela ted to lipids and
their synthesis pa thway [1, 5]. The third lit era ture reference is about
correla t ion of genet ics to pa thways [20]. These lit era ture informat ion ’s a ll
correla te with resu lt s obta ined by the PAT.
Since few result s a re found in lit era ture and only a handful by the Google
sea rch engine, work focuses on labora tory resu lt s obta ined by the end user
P a g e | 2 8
and is known to be va lid resu lt s. 35 out of 45 PATs a re compared and show
successful resu lt s, cor rela t ing with labora tory resu lt s.
To do a t ime measurement dur ing processing of da ta on a PAT, a t imer is used
(requirement ID 3). The t imer sta r t s from zero and count s upwards unt il the
processes by the PAT are complete. Three t imes a re recorded for each PAT to
obta in a more accura te measurement . All PATs show a response with in a
maximum of 7 seconds.
6.4.Evalu atin g th e P ATs
Evalua t ion of the PATs according to sub-a im 2 resu lt s in a tota l of 14
ana lyzing tool's passing the cr iter ia (Fig 10). These tools fu lfills the
requirements of being fast in da ta processing and shows correla t ion with
resu lt s repor ted in the lit era ture and obta ined a t the labora tory.
Figu re 10. The PATs tha t passed basic evalua t ion according to sub-a im 2.
6.5.Fin al evalu ation of th e P ATs
The PATs a re fur ther eva lua ted towards the end user requirements and the
resu lt shows no ana lysis tool's fu lfilling the requirements (Fig 10). Test cases
6 to 10 (linked to requirement ID 4 to 8) a re used for the fina l evalua t ion .
Figu re 10. We see the PATs tha t passed cer ta in requirements.
There were no PATs tha t passed sub-a im 3. The tool's fa iled on the zoom in
and out funct ion (to expand the view to neighbor ing possible resu lt s)
connected to pa thways when return in g resu lt s were obta ined and on the
combina t ion of more than one type of omics da ta input . A complete view of the
resu lt s and reason for exclusion a re shown in appendix 3.
P a g e | 2 9
6.6.Th e best P AT from th e ranked lis t
The Ingenuity PAT is best ranked in our list cont a in ing a ll of today’s ava ilable
ana lysis tools. The Ingenu ity PAT can process and fu lfill a lmost a ll of the
requirements given by the end user . The tool is even capable of t aking input of
more than one omics da ta and dur ing the test ing stage of the tool no input
limit s a re found when combining da ta . The only setback is the missing zoom
in and out funct ion (to expand the view) connected to pa thways upon
obta ining resu lt s tha t the program could not process. A gene name (APOE)
and two protein accession numbers (NP_000032 and P02649) is selected in
order to test the tool. Selected gene and proteins a re known to be involved in
the synthesis and format ion of lipoproteins . Therefore t he expected resu lt
from the PAT is to find informat ion rela ted to lipoproteins. Th e gene name
inser ted in the sea rch field and the returned resu lt shows 1 match found for
lipoproteins (Fig 11). F irst selected protein accession number (NP_000032)
returned 1 match for lipoproteins linking it to the gene APOE (Fig 12). Second
selected protein accession number (P02649) returned 1 match for lipoproteins
linking it a lso to the gene APOE (Fig 13).
Figu re 11. Retu rned resu lt from the Ingenuity PAT after input of the gene
APOE.
P a g e | 3 0
Figu re 12. Search resu lt s a fter input of a protein with the accession number
(NP_000032) showing the protein to be linked to lipoproteins and the gene
APOE.
Figu re 13. Search resu lt s a fter input of a protein with the accession number
(P02649) showing the protein to be linked to lipoproteins and the gene APOE.
P a g e | 3 1
6.7.Com bin in g P ATs
No PATs passes examina t ion out lined in sub-a im 3, since even the Ingenuity
PAT with good score from our test ing fa iled, due to the missing zoom
funct ions. Instead we ha ve to invest iga te if it is possible to develop a PAT that
meets the requiremen ts by the end user or if a combina t ion of 2 or 3 PATs can
fu lfill the requirements of the end user (See sect ion 3.1.2).
The decision a fter a meet ing with the end user is tha t developing a PAT by
our self is not an opt ion due to limited t ime and insufficien t manpower . The
pa thway ana lysis program Uniprot ba tch conver ter is requested by the end
user to be pr imar ily tested aga inst a program ca lled FEvER but a lso aga inst
a ll other combina t ions of PATs tha t can be found. With reason tha t Uniprot
handle lipidomics while FEvER handle metabolics, the two PATs have
funct ions tha t in th is way can complement each other . After input of genomic
da ta , the Uniprot gave either a blank page showing tha t nothing is found or a
list of possible matches to the protein encoded by the gene. Test resu lt s using
Uniprot ba tch conver ter ends up with successfully conver t ing some of the gene
names in to proteins. Input t ing resu lt s from Uniprot to FEvER works well and
the PAT's sta r t s, however , a lways ends up with 0 resu lt s. The combina t ion of
the PAT's Uniprot and FEvER is therefore not successful.
In our sea rch for combina t ion between PATs we turn our focus on the
Na t iona l Center of Biotechnology Informat ion (NCBI) homepage. At the NCBI
homepage a combinat ion of pa thway analysis programs can be found. The
homepage of NCBI have a ser ies of PAT's ava ilable without needs of
downloads or payments and conta ins vast amount of wor ldwide collected
informa t ion in molecula r biology. NCBI’s homepage is a lso a government -
funded homepage by the U.S. On the NCBI homepage an online tool can be
found with a very useful sea rch engine, capable of t aking more than one input
in order to forward the sea rch to specific da tabases to which the NCBI
homepage is connected to. Depending on how much input is inser ted in the
sea rch field, resu lt s a re returned accordingly. The gene named ENAC was
selected which codes for a protein tha t a ffect s the sodium channels in
biological organisms. Expecta t ions a re to obta in substant ia l resu lt s on rela ted
informat ion about ENAC an d the sodium channel from a ll da tabases
connected to the NCBI homepage. The returned resu lt s shows 26 of 37
different da tabases found matches conta in ing informat ion about a r t icles,
genes, proteins, SNP and nucleot ide sequences rela ted to ENAC and the
sodium channel (Fig 14). A second gene name is selected NPPA coding for a
protein tha t makes a receptor ca lled na t r iuret ic pept ide class A, regula t ing
water and sodium ba lance in biological organisms. Gene names selected
(ENAC and NPPA) a re put in to the sea rch field as “ENAC, NPPA”. No
combina t ions or rela t ions a re expected to be found between these gene names
(ENAC and NPPA) and no resu lt s a re either expected to be retu rned from the
da tabases. Result s obta ined precedes the expecta t ions with matches for 8
da tabases conta in ing some informat ion about a r t icles, genes, molecula r
in teract ions and da ta mapping rela t ing ENAC and NPPA to each other (Fig
15). With the possibility of more da tabases connected to each other the
efficiency for finding combined resu lt s a re th erefore increased.
P a g e | 3 2
Figu re 14. A search on NCBI’s homepage across a ll da tabases it is connected
with . The gene ENAC is used as da ta in order to invest iga te the retu rned
sea rch resu lt s.
P a g e | 3 3
Figu re 15. A search on NCBI’s homepage across a ll da tabases it is connected
with . The genes ENAC and NPPA are used together as da t a in order to
invest iga te see the combined sea rch resu lt returned.
P a g e | 3 4
6.8.Fu n ction alit ie s
A var iety of funct iona lit ies a re found through software test ing from the PAT.
The most common funct iona lit ies a re the following: the sea rch input field,
input t ext a rea field, dropdown list s and response window of how many
seconds it t akes to receive the resu lt s (Fig 16).
Figu re16. P icture of the sea rch input field marked blue, input text a rea field
marked yellow and push but ton marked green from 2 different websites.
Some tools give a n ice visua l presenta t ion showing how connect ions between
resu lt s a re connected such as genes or lipids (Fig 17). Two tools have 3d view
but ha lf the t ime dur ing test ing it give only a white blank page or resu lt ing in
crash ing of the program.
P a g e | 3 5
Figu re17. A gene name was sea rched and the PAT visua lly showed the
resu lt . The sea rched gene is colored red and show how it is in rela t ion to other
genes, proteins, signaling receptors a nd biological cell processes.
No PAT fulfill a ll the requirements according to sub-a im 3. Of a ll the
invest iga ted PAT's only 2 have easy naviga t ion funct ions according to our
requirements and on ly 5 ma ke visual pa thway presenta t ions. None of the
PATs have any zoom in or out funct ions (to expand the view) connected to
pa thways on resu lt s obta ined.
6.9.Qu ality
Quality is decided on tha t if a ll requirements a re fu lfilled on a PAT, then the
qua lity is good. As the resu lt s shows, no PAT fulfills the requirements.
Therefore the qua lity is not acceptable on any of the PAT. The following
aspects a re used to eva luate the quality of the homepages: Accuracy and
Correctness, Com pleteness, R elevance, T im e and Punctuality, T raceability .
All homepages provides accura te informa t ion with no er rors or misleading
informat ion a lthough many homepages belongs to companies tha t consider
their product to be the best . Also five homepages sta t es tha t it is a work in
P a g e | 3 6
progress making it less appea ling by design . There a re no companies sta t ing
when or where their PATs a re developed or how long it is in existence. The
Google sea rch engine uses a page ranking system, rank ing the compan ies
pages h igh . This makes it fast to be found and takes about 3 seconds a t
maximum to be found, fu ll resu lt s a re shown in appendix 4. Companies on
their homepages ra rely use any references to the informat ion tha t they put on
their homepage making it ha rd to t race the informat ion they provide and few
or none can be found. In tota l, on ly one homepage provides some references
and informat ion tha t can be t raced.
7. Discussion
7.1.Is it poss ible to fin d a P AT th at processe s m etabolom ics an d
lip idom ics raw data as in pu t an d com bin e th em w ith gen e tic
in form ation ?
Looking a t the resu lts we see tha t 46 PATs a re found but only 23 fu lfilled
cr iter ia ’s for being able to process metabolomic, lipidomic or genomic da ta .
The PATs a re sor ted and ranked based on how many type of da ta they a re
capable of handling and fur ther eva lua ted according to sub-a im 2. A tota l of
14 PATs passed sub-a im 2. This leads to only 1 available PAT that can process
a ll types of omics da ta which is the Ingenuity PAT. It looks promising as a
PAT and is number one on the ranking list , however , the tool fa ils the end
user requirements on the zoom in and out funct iona lity (to expand the view to
neighbor ing possible resu lt s) connected to pa thways upon received resu lt s.
The Ingenuity PAT it self was easy to use and the resu lt s retu rned a re a lso
understandable. If requirements ha ve been slight ly different , th is is a good
PAT to use. Ingenuity PAT belongs to a company so payment fees a re required
in order to use it a fter few days of t r ia l. 495 U.S dolla rs a re a h igh pr ice to pay
for an individua l person but for a la rger group of minimum six working
resea rchers tha t migh t use the Ingenuity PAT dur ing a per iod of two years
per iod, the pr ice is acceptable.
7.2.Wh at are th e fu nction alit ie s offered by th e available an alys is
tools?
Throughout software test ing, going through test cases, a ll PATs
funct iona lit ies a re tested. The PATs have a va r iety of funct iona lit ies. As the
resu lt s shows, most commonly used funct ions a re the input fields, t ext a rea
input fields and a dynamic dropdown list . This apply on both downloaded and
web based PAT. Some PATs a re more unique and offer ext ra funct iona lit ies
such as file upload or links rela ted to the inser ted sea rch field. All PATs have
some form of visua l presenta t ion but only a few gives the desired effect like
the KEGG PAT. KEGG ha ve the best type of visua l presenta t ion with a r rows
and different color markings. Next in line a re the Ingenuity PAT (a t the
current sta te on 2012-05-20, th is tool only ha ve a test version but st ill looks
good) showing a visual presenta t ion with possible correla t ions.
P a g e | 3 7
7.3.Wh at are th e qu alit ie s of th e se tool's an d h ow to evalu ate th em ?
The quality (ment ioned in sect ion 1.3) is interpreted different ly by people and
everyone have their own poin t of views on wha t qualit ies a re. Our view of
good qua lity is tha t if a ll requirements a re fu lfilled by the PATs, t hey a re then
a lso fu lfilling the needs of the potent ia l users. However if quality is good for
someone it may not be good for someone else, therefore choices of having a
more genera l view on qua lit ies a re made. However , we made the choice of
having 8 specific requirements tha t needed to be fu lfilled by a PAT as seen in
sect ion 1.3 and is accepted by the end user as good qua lity. Quality defin it ions
a re ha rd to make in genera l for software programs or applica t ions , even when
making broad and genera l defin it ions, a s poin ted out in [36] while [38]
st rengthened the reasons. According to [27], homepages can be examined
thoroughly by cer ta in aspects to define tha t it is qua lity. The examined
homepages for respect ive PATs have good standards fu lfilling 4 of 5 aspects.
In [38] requirements a re a lso a way to define qua lity and in [14] approaches
a re shown more thoroughly. We follow the defin it ion of qua lity given by [14],
showing tha t a ll requirements need to be fu lfilled in order to be considered
good qua lity, which a re easily seen on a t raceability mat r ix as [38] sta tes.
Looking a t the mat r ices a fterwards as t raceability [22], appendix 3 show that
not a ll requirements a re fu lfilled, therefore we cannot say tha t the qualit ies
for these ana lyzed PATs a re good.
If we look a t quality on downloaded PATs versus the homepage of NCBI we
might ask the following quest ion: What a re the pros and cont ra s? Downloaded
PATs seem to be more customized by companies to a specific group of
biomedical field user s, there by fu lfilling the qua lity for tha t specific group.
Monthly fees a re required or the tool can be bought and in order to get
suppor t addit iona l payments a re needed. The homepage of NCBI have free
services and do not cost anyth ing to use. Result s obta ined from the
downloaded PATs a re same as on the homepage of NCBI, however , NCBI a re
based on scien t ific resu lt s direct ly linked to scien t ific a r t icles and can
therefore be seen more va lid. Informat ion ga ined from a ll PATs a re meant to
help the resea rchers in their biomedica l field (whether it ma y be in lipidomics,
genomics or any other field) for making new discover ies.
Figu re 18. Small figure with +/- on downloaded PATs versus NCBI
P a g e | 3 8
7.4.Wh y n ot In gen u ity an d w h y Un iprot w ith FEvER ?
As the resu lt s shows the Ingenuity had promising fea tures and were very
good a t fu lfilling a lmost a ll requirements. The end user however dur ing
discussion made the decision to disca rd it since it is a new tool and would
need a lot of t ime to lea rn it . The end user was more familia r with the PAT
named Uniprot and the in h ouse developed PAT FEvER. Since Uniprot handle
lipidomics while FEvER handle metabolics, the two PATs have funct ions tha t
in th is way can complement each other . By th is reason the two tools were
instead analyzed. As resu lt s show, the two tools were not wor king so well
together . Either the conversion went well bu t no resu lt s were shown or it
simply gave a blank page, meaning no presented resu lt s. In the end th is was
a lso disca rded.
8. Future Value All sta r ted with how to combine an oppor tunity for a mult idisciplina ry thesis
and th is thesis shows one way on how it can be done. The test ing method used
on the pa thway ana lysis shows tha t no problems a re encountered and other
users can use the same method with ease. The hard pa r t is when no tool's a re
found according to requirements. We a lso found out tha t even with a small
group of programmers, months a re needed to develop a PAT, due to the vast
amount of informat ion required to be included in the program. Not a ll PATs
can combine or take more than one input of omics da ta . Searches for an
a lterna t ive solu t ion leads to a homepage named NCBI tha t have severa l
collected PATs free of use. As discussed about pros and cont ras on downloaded
PAT's versus free, in the fu tu re there will probably be free based resources.
Homepages like NCBI grow in popula rity a t t ract ing many users. F ree
available PATs a re more prefer red to be used since they a re free of use and
their qua lity is equally good as the downloaded ones. Today we a lready see
the glimpse to the beginning of th is proces s. Companies will t ry matching the
demands of user s and sta r t s to run either longer free t r ia ls or even making
their tool free of use while sponsored by adver t isements.
9. References 1. Alber t sson-Er lanson C, (1991), Medicinsk och fysiologisk kem i – en
in troduk tion , Lund: Student lit t era tur
2. Anand S. S, Xia C, Paré G, Montpet it A, Rangara jan S, McQueen J . M,
Cordell J . H, Keavney B, Yusuf S, Hudson J . T, Enger t C. J , (2009),
Genetic Varian ts Associated With Myocardial In farction R isk Factors in
Over 8000 Individuals From Five Ethnic Groups, Circula t ion:
Cardiovascula r Genet ics Volume 23
3. Atkins P . W, J ones L. L, (2008), Chem ical Principles: T he quest for in sight ,
New York: W. H Freeman & Company, 527-534.
P a g e | 3 9
4. Barreiro B. L, Lava l G, Quach H, Pa t in E, Quin tana -Murci L, (2008),
N atural selection has driven popu lation d ifferen tiation in m odern hum ans,
Nature Genet ics Volume 40, 340 -345.
5. Becker M. W, Ber toni P . G, Hardin J , Kleinsmith J . L, (2009), T he World of
the Cell, San Francisco: Pearson Benjamin Cummings Educa t ion In c, 508-
520; 526; 527-534; 346-347.
6. Biochemist ry, (2012) S ynthesis of m em brane L ipids and T riglycerides >
h t tp://www.uky.edu/~dhild/biochem/20/lect20.h tml < 2012-06-14
7. Buhman K. K, Chen C. H, F arese J r V. R, (2001), T he Enzym e of N eutral
L ipid S ynthesis, The J ourna l of Biologica l Chemist ry Volume 276, Number
44.
8. Cong T. T, Wlaschin A, Sr ienc F , (2009), T Elem entary m ode analysis: a
usefu l m etabolic PAT for characterizing cellu lar m etabolism , Spr ingerLink
Volume 81 Number 5.
9. Curr icu lum Proposa l, (2012) In form ation gathering by data m ining >
h t tp://www.sigkdd.org/curr icu lum.php < 2012-07-11
10. Cyber lipid center , (2012) Fatty Acids >
ht tp://www.cyber lipid.org/fa /acid0001.h tm < 2012-06-26
11. Devlin M. T, (2006), T ext Book of Biochem istry: With Clin ical Correlations ,
New J ersey: Wiley-Liss Inc, 24-29; 666; 711-713; 716-717.
12. Fahy E, Subramaniam S, Brown H. A, Glass K. C, Merr ill J r H. A, Murphy
C. R, Raetz H. R. C, Russell W. D, Seyama Y, Shaw W, Shirmizu T, Spener
F , Meer G, VanNieuwenhze S. M, White H. S, Witztum L. J , Dennis A. E ,
(2005), Lipidom ics reveals a rem arkable d iversity of lipids in hum an
plasm a, J ourna l of Lipid Research Volume 46.
13. Fayyad U, P ia tet sky-Shapiro G, Smyth P , (1996), From Data Mining to
Knowledge Discovery in Databases, AI Magazine Volume 17 Number 3.
14. Firesmith D, (2003), Using Quality Models to Engineer Quality
R equirem ents, J ourna l of Object Technology Volume 2 Number 5.
15. Ganter B, Giroux CN, (2008), Em erging applications of network and
pathway analysis in drug d iscovery and developm ent, PubMed cent ra l
Volume 11 Issue 1.
16. Ganter B, Zidek N, Hewit t R. P , Müller D, Vladimirova A, (2008), PAT s
and toxicogenom ics reference databases for risk assessm ent, Future
Medicine Volume 9 Number 1.
17. Gut iér rez J . J , Esca lona J . M, Mejías M, Torres J , (2012), Generation of test
cases from functional requirem ents, Depar tment of System Informat ion a t
University of Seville with 4:th Workshop on System Informat ion.
18. Han X, (2007), N eurolipidom ics: Challenges and developm ent, Front iers of
Bioscience Volume 12
19. Human Cell Biology – BIO3I5F, (2012) T he cell m em brane >
ht tp://www.er in .u toronto.ca /~w3bio315/lecture2.h tm < 2012-04-29
20. Ignacimuthu S, (2008), Biotechnology: An In t roduct ion , Oxford: Alpha
science In terna t iona l Ltd, 1-10.
21. iSixSigma- Tools and Templa tes, (2012) Im portance of T est Plans or T est
Protocols > h t tp://www.isixsigma.com/tools-templates/design -of-
exper iments-doe/impor tance-test -planstest -protocol-templa te/ < 2012-08-
20
P a g e | 4 0
22. J ordan W. K, Nordenstam J , Lauwers Y. G, Rothenberger A. D, Alavi K,
Garwood M, Cheng L. L, (2009), Metabolom ic Characterization of Hum an
R ectal Adenocarcinom a with In tact T issue Magnetic R esonance
S pectroscopy, Diseases of the Colon and Rectum Volume 52 Issue 3
23. Kannenberg A, Saiedian H, (2009), Why S oftware R equirem ents
T raceability R em ains a Challenge, CrossTa lk: The J ourna l of Defense
Software Engineer ing Volume 22 Number 5.
24. King Y. J , Ferra ra R, Tabibiaza r R, Spin M. J , Chen M. M, Kuchins ky A,
Vailaya A, Kinca id R, Tsa lenko A, Deng X-F . D, Connolly A, Zhang P ,
Yang E, Wat t C, Yakhin i Z, Ben -Dor A, Adler A, Bruhn L, Tsao P,
Quer termous T, Ashley A. E , (2005), Pathway analysis of coronary
atherosclerosis, Research Art icle Physiological Genom ics Volume 23
Number 1
25. Klamt S, Stelling J , (2002), T wo approaches for m etabolic pathway
analysis, Trends in biotechnology Volume 21 Issue 2.
26. LipidMaps Nature – Lipidomicsga teway, (2012) WLipid classification
system >
ht tp://www.lipidmaps.org/da ta /classifica t ion/LM_classifica t ion_exp.php <
2012-06-14
27. Lundh D, (2011), Informat ion Quality and Secur ity, Skövde University
28. Luo L, (2001), S oftware T esting T echniques– T echnology Maturation and
R esearch S trategy, Carnegie Mellon University
29. Meer G, (2005), Cellu lar L ipidom ics, The EMBO J ourna ls members review
Volume 24
30. Mogyorodi G.E, (2005), R equirem ents-Based T esting – Am biguity R eviews,
Software Test ing Services Number 1.
31. Molecula r biochemist ry, (2012) Fatty acid syn thesis >
ht tp://www.rpi.edu/dept /bcbp/molbiochem/MBWeb/mb2/par t1/fasynthesis.h
tm < 2012-06-14
32. Nat iona l Human Genome Research In st itu te: Genet ic and Genomic
Science, (2012) Genetic and genom ic science >
ht tp://www.genome.gov/19016904 < 2012-04-20
33. Network Science – NetSci, (2012) Welcom e to N etS ci’s L ists of S oftware for
B ioin form atics: PAT s >
h t tp://www.netsci.org/Resources/Software/Bioinform/pa thwayanalysis.h tml
< 2012-02-27
34. Olson L. D, Kesharwani S, (2010), Enterpr ise Informat ion Systems:
Contemporary Trends and Issues, Singapore: World Scient ific Publish ing
Co. P te. Ltd, 7-23.
35. Phospha t idic acid, lysophospha t idic acid and rela ted lipids: structure,
occurrence, biochem istry and analysis , (2012) Phosphatid ic acid –
Occurrence and Biosynthesis >
h t tp://lipidlibra ry.aocs.org/Lipids/pa /index.h tm < 2012-07-11
36. Quality, (2012) Quality >
h t tp://www.qua litydigest .com/html/qualitydef.h tml < 2012-04-09
37. Quehenberger O, Armando M. A, Brown H. A, Milne B. S, Myers S. D,
Merr ill H. A, Bandyopadhyay S, J ones N. K, Kelly S, Shaner L. R, Sulla rds
M. C, Wang E, Murphy C. R, Barkley M. R, Leiker J . T, Raetz H. R. C,
P a g e | 4 1
Guan Z, Laird M. G, Six A. D, Ru ssell W. D, McDona ld G. J , Subramaniam
S, Fahy E, Dennis A. E , (2010), Lipidom ics reveals a rem arkable d iversity
of lipids in hum an plasm a, J ourna l of Lipid Research Volume 51.
38. Reeves A. C, Bednar A. D, (1994), Defin ing Quality: Alternatives and
Im plications, Academy of Management Review Volume 19 Number 3.
39. Rosenberg L.H, Hammer F . T, Huffman L. L, (1998), R equirem ents,
T esting and Metrics, CiteSeer 15:th Annua l Pacific Nor thwest Software
Qua lity Conference.
40. Schilling H. C, Letscher D, Pa lsson ∅. B, (2000), T heory for the S ystem ic
Defin ition of Metabolic Pathways and their use in In terpreting Metabolic
Function from a Pathway-Oriented Perspective, J ourna l of Theoret ica l
Biology Volume 203 Issue 3.
41. Schuster S, Dandekar T, Fell A.D, (1999), Detection of elem en tary flux
m odes in biochem ical networks: a prom ising tool for pathway analysis and
m etabolic engineering, Trends in biotechnology Volume 17 Issue 2.
42. The AOCS Lipid Libra ry – Tr iacylglycerols, (2012) Biosynthesis and
m etabolism > h t tp://lipidlibra ry.aocs.org/lipids/tag2/index.h tm < 2012-06-
14
43. The lipid chronicles, (2012) Lipidom ics >
h t tp://www.samuelfurse.com/2011/12/lipidomics/ < 2012-03-19
44. The Lipid Libra ry, (2012) Lipid synthesis >
h t tp://lipidlibra ry.aocs.org/index.h tml < 2012-04-15
45. The Medica l Biochemist ry Page, (2012) Lipid syn thesis >
h t tp://themedica lbiochemist rypage.org/lipid-synthesis.php < 2012-03-09
46. Vance E. D, Vance E.J , (2008), Biochem istry of L ipids, L ipoproteines and
Mem branes 5th ed ition , Amsterdam: Elsevier , 278-279; 583-588.
47. Virgin ia web educa t ion , (2012) Lipids >
h t tp://web.virginia .edu/Heidi/chapter8/chp8.h tm < 2012-06-26
48. Voight et a l, (2012), Plasm a HDL cholesterol and risk of m yocard ial
in farction: a m endelian random ization study, The Lancet Volume 380
Issue 9841
49. Waterfa ll Model, (2012), All about the waterfall m odel >
h t tp://www.waterfa ll-model.com/ < 2012-08-18
50. Sommerville Ian , (1996), S oftware process m odels, J ourna l of ACM
Comput ing surveys Volume 28 Issue 1 , p269-271
51. Watson D. A, (2006), Lipidom ics: A global approach to lipid analysis in
biological system s, J ourna l of Lipid Research Volume 47.
52. Watson D. A, (2006), T hem atic review series: S ystem s Biology Approaches
to Metabolic and Cardiovascular Disorders, J ourna l of Lipid Research
Volume 47.
53. Wenk MR, (2005), T he em erging field of lipidom ics, Nature Reviews Drug
Discovery Volume 46.
54. William W. C, Xianlin H, (2010), Lipid Analysis: Isolation , S eparation ,
Identification and L ipidom ic Analysis, Br idgwater : The Oily Press
P a g e | 4 2
Appendix 1 – Test Cases Test Case
Test case: is a document which descr ibes INPUT, ACTION, EVENT and
EXPECTED RESPONSE to determine if fea ture of an applica t ion is working
correct ly or not . A set of inputs, execut ion precondit ions, and expected outcomes
developed for a pa r t icu la r object ive, such as to exercise a pa r t icu la r program pa th
or to ver ify compliance with a specific requirement . (Comp Software test ing,
(2010), Test Case formats > h t tp://www.faqs.org/qa /qa -4044.h tml < 2010-12-18)
Test Case 1 “Metabolom ic type of data in put” (Requ irem en t ID 1):
Goal:
See if the PAT can process metabolomic data .
Even t:
(Presumpt ion made tha t the PAT is a lready running).
1. Input metabolomic da ta type (such as “NPPA”).
2. Get resu lt rela ted to the metabolism from the PAT.
Expected re spon se :
The user can input metabolomic da ta type on selected PAT achieving da ta
in format ion rela ted to “NPPA”.
Test Case 2 “Lipidom ic type of data inpu t” (Requ irem en t ID 1):
Goal:
See if the PAT can process lipidomic da ta .
Even t:
(Presumpt ion made tha t the PAT is a lready running).
1. Input lipidomic da ta type (such as “rs4420638”).
2. Get resu lt rela ted to lipids from the PAT.
Expected re spon se :
The PAT process lipidomic da ta type get t ing da ta informat ion rela ted to
“rs4420638”.
Test Case 3 “Gen om ic type of data inpu t” (Requ irem en t ID 1):
Goal:
See if the PAT can process genomic da ta .
Even t:
(Presumpt ion made tha t the PAT is a lready sta r ted).
1. Input genomic da ta type (such as “APOE”).
P a g e | 4 3
2. Get genet ica lly rela ted resu lt from the PAT.
Expected re spon se :
The PAT process genomic da ta type where da ta informat ion is rela ted to “APOE”.
Test Case 4 “Verify in g th e re su lt” (Requ irem en t ID 2):
Goal:
Get correct and va lid resu lt s returned by the PAT.
Even t:
(Presumpt ion made tha t the PAT is a lready sta r ted).
1. Input omics (metabolomics, lipidomics and genomics) da ta .
2. Get resu lt from the PAT.
3. Check resu lt s obta ined with valid resu lt s from books, a r t icles or labora tory
resu lt s.
4. Confirm tha t the received resu lt is va lid.
Expected re spon se :
The resu lt returned from the PAT is valid to lit era ture or labora tory resu lt .
Test Case 5 “Tim e e ffec tiven ess” (Requ irem en t ID 3):
Goal:
Informat ion about how fast the PAT returns a resu lt .
Even t:
(Presumpt ion made tha t the PAT is a lready sta r ted).
1. Input omics (metabolomics, lipidomics and genomics) da ta .
2. Sta r t the t imer .
3. Get resu lt from the PAT.
4. Stop the t imer .
5. Record t ime.
Expected re spon se :
The resu lt returned from the PAT took less than 5 seconds and is displayed.
Test Case 6 “Navigation ” (Requ irem en t ID 4):
Goal:
See the naviga t ion capabilit ies between the da ta inser ted and the resu lt given by
the selected PAT.
Even t:
(Presumpt ion made tha t the PAT is a lready sta r ted).
1. Input omics (metabolomics, lipidomics and genomics) da ta .
2. Get resu lt from the PAT.
P a g e | 4 4
3. Try naviga t ing between the sta r t of input ted da ta and the resu lt s obta ined by
scrolling on the resu lt window.
4. Seeing a pa th leading from sta r t (input da ta ) to end (resu lt ).
Expected re spon se :
It was possible to naviga te between the inser ted da ta and the resu lt obta ined.
Test Case 7 “Visu al pre sen tation ” (Requ irem en t ID 5):
Goal:
See if the resu lt can be visua lly presented/displayed by using the PAT.
Even t:
(Presumpt ion made tha t the PAT is a lready sta r ted).
1. Input omics (metabolomics, lipidomics and genomics) da ta .
2. Get resu lt from the PAT.
3. Get a visua l presenta t ion where you can map resu lt s t o each other .
Expected re spon se :
There was a visua l presenta t ion when using the PAT.
Test Case 8 “Zoom in g” (Requ irem en t ID 6):
Goal:
See if the PAT displays any zoom funct ions.
Even t:
(Presumpt ion made tha t the PAT is a lready sta r ted).
1. Input omics (metabolomics, lipidomics and genomics) da ta .
2. Get resu lt from the PAT tha t is connected to pa thways.
3. Search for a small magnifying glass with a plus or minus sign .
4. Search for specific funct ion with a r rows for zooming.
5. Try zooming in to na r row the view.
6. Try zooming out to expand the view.
Expected re spon se :
The PAT has zoom funct ions.
Test Case 9 “Spec ific data in pu t type” (Requ irem en t ID 7):
Goal:
See if the PAT can take specific type of da t a as an input .
Even t:
(Presumpt ion made tha t the PAT is a lready sta r ted).
1. Input specific type of da ta (such accession numbers: AC_0088966, rs number :
r s896530).
P a g e | 4 5
2. Get resu lt from the PAT.
Expected re spon se :
The PAT could process the specific type of da ta .
Test Case 10 “Mappin g com bin ed data types” (Requ irem en t ID 8):
Goal:
See if the PAT can combine omics da ta and map them to rela ted pa thways.
Even t:
(Presumpt ion made tha t the PAT is a lready sta r ted).
1. Input more than one type of omics (metabolomics, lipidomics and genomics)
da ta combined.
2. Get resu lt from the PAT.
3. See if resu lt combined to other da ta types showing a mapped view with the
other da ta types.
Expected re spon se :
The PAT could combine and map the da ta for the user .
P a g e | 4 6
Appendix 2 – Lipid, MI SNP and Metabo SNP da ta sheet Lipid and MI SNP Metabo SNP
Gene
Name
rs
(number)
SNP
funct ion
Gene Name rs
(number)
SNP
funct ion
FADS123 174546 Lipid
SNP
ENAC Metabo
SNP
UBE2L3 181362 Lipid
SNP
ENAC Metabo
SNP
LILRA3 386000 Lipid
SNP
NPPA 5068 Metabo
SNP
APOE 439401 Lipid
SNP
BDNF 6265 Metabo
SNP
KLHL8 442177 Lipid
SNP
SLC35F1 89107 Metabo
SNP
TTC39B 581080 Lipid
SNP
NPPA 198358 Metabo
SNP
CITED2 605066 Lipid
SNP
BCL11A 243021 Metabo
SNP
SORT1 629301 Lipid
SNP
MDS1 419076 Metabo
SNP
MSL2L1 645040 Lipid
SNP
579459 Metabo
SNP
LOC55908 737337 Lipid
SNP
NPPB 632793 Metabo
SNP
SCARB1 838880 Lipid
SNP
TMEM133 633185 Metabo
SNP
APOA1 964184 Lipid
SNP
PNPLA3 738409 Metabo
SNP
APOB 1042034 Lipid
SNP
PLCE1 932764 Metabo
SNP
LDLR 1122608 Lipid
SNP
TFAP2B 987237 Metabo
SNP
GCKR 1260326 Lipid
SNP
CHRNA3 1051730 Metabo
SNP
NAT2 1495741 Lipid
SNP
SGK1 1057293 Metabo
SNP
LIPC 1532085 Lipid
SNP
C5orf174 1173771 Metabo
SNP
LPA 1564348 Lipid
SNP
Glasgow 1230297 Metabo
SNP
ZNF648 1689800 Lipid
SNP
J AG1 1327235 Metabo
SNP
HNF4A 1800961 Lipid
SNP
LOC10018 1329650 Metabo
SNP
ABCA1 1883025 Lipid CYP1A1 1378942 Metabo
P a g e | 4 7
SNP SNP
CYP26A1 2068888 Lipid
SNP
C10orf10 1530440 Metabo
SNP
ANGPTL3 2131925 Lipid
SNP
Glasgow2 1703492 Metabo
SNP
TRPS1 2293889 Lipid
SNP
STC1 1731274 Metabo
SNP
CAPN3 2412710 Lipid
SNP
SGK1 1743966 Metabo
SNP
PCSK9 2479409 Lipid
SNP
HFE 1799945 Metabo
SNP
LACTB 2652834 Lipid
SNP
MDS1 1918974 Metabo
SNP
C6orf106 2814944 Lipid
SNP
ENAC 2228576 Metabo
SNP
AMPD3 2923084 Lipid
SNP
ESR1 2234693 Metabo
SNP
CMIP 2925979 Lipid
SNP
HNF1a 2259816 Metabo
SNP
FRMD5 2929282 Lipid
SNP
NEDD4L 2288774 Metabo
SNP
TRIB1 2954029 Lipid
SNP
HCCA2 2334499 Metabo
SNP
IRS1 2972146 Lipid
SNP
FES 2521501 Metabo
SNP
LRP4 3136441 Lipid
SNP
LYPLAL1 2605100 Metabo
SNP
MYLIP 3757354 Lipid
SNP
MOV10 2932538 Metabo
SNP
CETP 3764261 Lipid
SNP
2987983 Metabo
SNP
PGS1 4129767 Lipid
SNP
DBH 3025343 Metabo
SNP
ABCA8 4148008 Lipid
SNP
SLC22A2 3127573 Metabo
SNP
ABCG58 4299376 Lipid
SNP
EGLN2 3733829 Metabo
SNP
APOE 4420638 Lipid
SNP
3741913 Metabo
SNP
PABPC4 4660293 Lipid
SNP
ULK4 3774372 Metabo
SNP
KLF14 4731702 Lipid
SNP
CACNB2 4373814 Metabo
SNP
ZNF664 4765127 Lipid
SNP
ZBED3 4457053 Metabo
SNP
1LNT2 4846914 Lipid
SNP
GCK 4607517 Metabo
SNP
P a g e | 4 8
TOP1 6029526 Lipid
SNP
SLC7A9 4805834 Metabo
SNP
PLTP 6065906 Lipid
SNP
J AG1 6040055 Metabo
SNP
LDLR 6511720 Lipid
SNP
6983267 Metabo
SNP
PDE3A 7134375 Lipid
SNP
GLIS3 7034200 Metabo
SNP
MVK 7134594 Lipid
SNP
ADM 7129220 Metabo
SNP
LIPG 7241918 Lipid
SNP
KIAA1486 7578326 Metabo
SNP
ANGPTL4 7255436 Lipid
SNP
RBMS1ITG 7593730 Metabo
SNP
NYNRIN 8017377 Lipid
SNP
MSRA 7826222 Metabo
SNP
ABO 9411489 Lipid
SNP
7931342 Metabo
SNP
MAP3K1 9686661 Lipid
SNP
MEIS2 8031633 Metabo
SNP
COBLL1 10195252 Lipid
SNP
TBX2 8068318 Metabo
SNP
PLEC1 11136341 Lipid
SNP
IL6 10242595 Metabo
SNP
LRP1 11613352 Lipid
SNP
10993994 Metabo
SNP
PINX1 11776767 Lipid
SNP
C6orf204 11153768 Metabo
SNP
STARD3 11869286 Lipid
SNP
NPPA 11191548 Metabo
SNP
COBLL1 12328675 Lipid
SNP
EBF1 11953630 Metabo
SNP
LPL 12678919 Lipid
SNP
ZNF652 12940887 Metabo
SNP
MC4R 12967135 Lipid
SNP
PLCD3 12946454 Metabo
SNP
SLC39A8 13107325 Lipid
SNP
CST3CST9 13038305 Metabo
SNP
TYW1B 13238203 Lipid
SNP
SLC4A7 13082711 Metabo
SNP
LCAT 16942887 Lipid
SNP
GUCY1A3 13139571 Metabo
SNP
MLXIPL 17145738 Lipid
SNP
ZNF652 16948048 Metabo
SNP
APOB 1367117 Lipid
SNP
FGF5 16998073 Metabo
SNP
HLA 2247056 Lipid ATP2B1 17249754 Metabo
P a g e | 4 9
SNP SNP
PLA2G6 5756931 Lipid
SNP
VPS13C 17271305 Metabo
SNP
OSBPL7 7206971 Lipid
SNP
SHROOM3 17319721 Metabo
SNP
LRPPP1R3 9987289 Lipid
SNP
MTHFR 17367504 Metabo
SNP
J MJ D1C 10761731 Lipid
SNP
GOSR2 17608766 Metabo
SNP
SMG6 216172 MI SNP STK39 35929607 Metabo
SNP
SORT1 629301 MI SNP
SORT1 646776 MI SNP
APOA1 964184 MI SNP
LDLR 1122608 MI SNP
LPA 1564348 MI SNP
CXCL12 1746048 MI SNP
KIAA1822 2895811 MI SNP
LPA 3798220 MI SNP
ADAMTS7 3825807 MI SNP
LDLR 6511720 MI SNP
PHACTR 9349379 MI SNP
ABO 9411489 MI SNP
MRAS 9818870 MI SNP
KCNE2 9982601 MI SNP
PCSK9 11206510 MI SNP
ZC3HC1 11556924 MI SNP
TCF21 12190287 MI SNP
CNNM2 12413409 MI SNP
PPAP2B 17114036 MI SNP
MIA3 17465637 MI SNP
ANKS1A 17609940 MI SNP
SH2B3 3184504 MI SNP
CDKN2A 4977574 MI SNP
WDR12 6725887 MI SNP
RALI 12936587 MI SNP
P a g e | 5 0
Appendix 3 – Requirements Mat r ixes
P a g e | 5 1
P a g e | 5 2
P a g e | 5 3
Appendix 4 – Respons Times