investigation of pathway analysis tools for mapping omics ... · keywords: biochemistry,...

Degree: Bachelor of Computer Science 180hp Supervisor(s): Céline Fernandez, Major: Information Systems Annabella Loconsole Programme: Information Systems Examiner: Bengt J. Nilsson Date of exam: 2012-09-20

Tech n ology a n d societ y

Com pu t er Scien ce

Invest iga t ion of Pa thway Ana lysis

Tools for mapping omics da ta to

pa thways -Focu s on l ip id om ics an d gen om ics d a ta

Undersökning av ana lysverktyg för a t t ka r t lägga omik da ta

t ill rela t ionsvägar – F ok u s p å d a t a a v t yp en l i p i d om i k och gen om i k

A u t h or: A t t i l a K on rá d

P a g e | 2

An education isn 't how m uch you have com m itted to m em ory,

or even how m uch you know. It 's being able to d ifferentiate

between what you know and what you don 't. /Anatole France

Ackn ow le dge m e n ts I would like say thank you to everyone who helped me with my thesis. To my supervisors

I thank you for your pa t ience, guidance and a ll the good feedbacks.

P a g e | 3

Abst ract This thesis examines PATs from a mult idisciplinary view. There are a lot of PAT's

exist ing today ana lyzing specific type of omics da ta , therefore we invest iga te them and

what they can do. By defin ing some specific requirements such as how many omics data

types it can handle, the accuracy of the PAT can be obta ined to get the most su itable

PAT when it comes to mapping omics da ta to pa thways . Result s show that no PATs

found today fu lfills the specific set of requirements or the main goal though software

test ing. The Ingenuity PAT is the closest to fu lfill the requirements . Requested by the

end user , two PATs are tested in combinat ion to see if th ese can fu lfill the requirements

of the end user . Uniprot ba tch conver ter was tested with FEvER and r esults did not turn

out successfu lly since the combinat ion of the two PATs is no bet ter than the Ingenuity

PAT. Focus then turned to an a lternat ive combinat ion , a homepage ca lled NCBI that

have search engines connected to severa l free PATs available thus fulfilling the

requirements . Through the search engine “omics” da ta can be combined and more t han

one input can be taken a t a t ime. Since technology is rapidly moving forward , the need

for new tools for data in terpreta t ion a lso grows. It means tha t in a near future we may

be able to find a PAT tha t fu lfills the requirements of the end user s.

Ke yw ords: Biochemist ry, Cardiovascular disease, Database, Genomics, Lipids,

Lipidomics, Metabolomics, PAT, Technology

Sammanfa t tn ing Detta examensarbete granskar ana lysverktyg ur et t tvärvetenskapligt perspekt iv. Det

finns en hel del olika ana lysverktyg idag som analyserar specifika typer av omik data

och därför undersöker vi hur många det finns samt vad de kan göra. Genom a t t defin iera

et t anta l specifika krav såsom hur många typer av omik da ta den kan hantera,

noggrannhet av verktygets ana lys så kan man se vilka som är mest lämpliga

ana lysverktygen när det gä ller kar t läggning av omik da ta . Resulta ten visar a t t det idag

in te finns ana lysverktyg som uppfyller de specifik t angivna kraven eller huvudsyftet

genom testn ing av programvaran . Ingenuity ana lysverktyget ä r det närmaste vi kan

komma för de krav som vi söker . På begäran av slu tanvändaren testades två

ana lysverktyg för a t t se om en kombinat ion av dessa kan uppfylla slu t användarens

krav. Analysverktyget Uniprot ba tch converter t estas med FEvER men resulta t är in te

framgångsr ikt , då kombinat ionen av dessa verktyg in te ä r bä t t re än Ingenuity

ana lysverktyget . Fokus vänds mot en a lternat iv kombinat ion som är en hemsida och

heter NCBI. Hemsidan har en sökmotor kopplad t ill flera olika ana lysverktyg som är

gra t is a t t använda . Genom sökmotorn kan ”omik” data kombineras och mer än et t

inmata t värde kan hanteras i t aget . Eftersom tekniken snabbt går framåt innebär det

däremot a t t nya ana lysverktyg behövs för da ta hanter ing och inom en snar framt id så

har vi kanske et t a na lysverktyg som uppfyller kraven av slutanvändar na .

Nycke lord: Biokemi, Kardiovaskulär sjukdom, Databas, Genomik, Lipider , Lipidomik,

Metabolomik, Analysverktyg, Teknik

P a g e | 4

Con ten ts

Abst ract ................................................................................................................................................. 3 Sammanfa t tning ................................................................................................................................. 3 1. In t roduct ion .................................................................................................................................. 5

1.1. Purpose ...................................................................................................................................... 5 1.2. Problem definit ions and Aims .............................................................................................. 6 1.3. Problem discussion .................................................................................................................. 6 1.4. Rela ted work with PAT .......................................................................................................... 8

2. Methods ......................................................................................................................................... 8 2.1. Model in use .............................................................................................................................. 8

2.1.1. R equirem ent collection , docum entation and validation ...................................... 8 2.1.2. R equirem ent processing and test case creation ..................................................... 9 2.1.3. Objective ...................................................................................................................... 12 2.1.4. Underlying objectives ................................................................................................ 12

2.2. Alternat ive research methods ............................................................................................ 13 3. Biomedica l background ............................................................................................................ 13

3.1. Genet ics ................................................................................................................................... 13 3.1.1. Gene .............................................................................................................................. 14 3.1.2. S N P .............................................................................................................................. 15

3.2. Biochemist ry of Lipids .......................................................................................................... 16 3.2.1. Lipid defin ition .......................................................................................................... 17 3.2.2. Classes of L ipids ........................................................................................................ 17 3.2.3. Enzym es involved in the synthesis of lipids ......................................................... 18 3.2.4. Lipoproteins ................................................................................................................ 21

3.3. Genomics ................................................................................................................................. 21 3.4. Metabolomics .......................................................................................................................... 22 3.5. Lipidomics ............................................................................................................................... 22 3.6. Cardiovascular diseases ....................................................................................................... 22

4. Computer Science background ............................................................................................... 23 4.1. Databases, Data mining and Knowledge discovery ....................................................... 23 4.2. PAT ........................................................................................................................................... 23

5. Requirements and Test elicit a t ion ........................................................................................ 24 5.1. Requirements ......................................................................................................................... 24 5.2. Test ing ..................................................................................................................................... 25 5.3. Test cases ................................................................................................................................ 26

6. Result ........................................................................................................................................... 26 6.1. F inding the PATs .................................................................................................................. 26 6.2. Sort ing the PATs ................................................................................................................... 27 6.3. Test ing the PATs ................................................................................................................... 27 6.4. Evalua t ing the PATs ............................................................................................................ 28 6.5. F ina l eva luat ion of the PATs .............................................................................................. 28 6.6. The best PAT from the ranked list .................................................................................... 29 6.7. Combining PATs .................................................................................................................... 31 6.8. Funct ionalit ies ....................................................................................................................... 34 6.9. Quality ..................................................................................................................................... 35

7. Discussion ................................................................................................................................... 36 7.1. Is it possible to find a PAT that processes metabolomics and lipidomics raw da ta

as input and combine them with genet ic informat ion? ........................................................ 36 7.2. What a re the funct ionalit ies offered by the ava ilable ana lysis tools? ....................... 36 7.3. What a re the qualit ies of these tool's and how to eva lua te them? ............................. 37

P a g e | 5

7.4. Why not Ingenuity and why Uniprot with FEvER? ...................................................... 38 8. Future Value .............................................................................................................................. 38 9. References ................................................................................................................................... 38 Appendix 1 – Test Cases .................................................................................................................. 42 Appendix 2 – Lipid, MI SNP and Metabo SNP data sheet ...................................................... 46 Appendix 3 – Requirements Matr ixes .......................................................................................... 50 Appendix 4 – Respons Times .......................................................................................................... 53

1. In t roduct ion Vast amount s of resea rch is done in lipidomics and genomics, making

computers, In ternet and var ious ana lysis tool's very common today both in

simple and advanced forms. As an example a simple ca lcula t ion can be

performed on one computer and t ransfer red or copied to another if needed.

More advanced per formances somet imes require a software tool tha t can

perform a cer ta in ta sk on a given set of data in order to give a cer ta in resu lt .

The resu lt is in turn usua lly not logically ordered and a visua l presen ta t ion is

needed. This is where a pa thway analysis tool (PAT) is needed. A pa thway

ana lysis tool (PAT) is an advanced tool t ha t processes given da ta , compares

the given da ta with stored da ta in a da tabase and present s the resu lt s

obta ined visually. A company tends to h ire a programmer to develop a

pa thway ana lysis tool (PAT) in order to in tegra te it with in the organiza t ion

[34]. One of the main groups of scien t ific users is the group of r esea rchers in

fields of bioinformat ics, genet ics, genomics and metabolomics. Researchers a re

dependent of these pa thway analysis tools in their scien t ific work. In some

scien t ific fields such as genomics and metabolomics, there a re too many

ana lysis tools (PAT), doing a ll kinds of different t a sks. Too many pa thway

ana lysis tools in a specific field can confuse resea rchers who do not have

enough knowledge in technology [5]. This makes it difficu lt to decide wha t

pa thway analysis tools a re su ited for cer ta in da ta and within wha t scien t ific

field. Since technology is a lso moving forward ext remely fast , people with

mult idisciplina ry knowledge a re needed more and more [20]. For resea rchers

who work with in the biomedical field of metabolomics and gen omics there a re

specific ana lysis tools. The purposes of these pa thway ana lysis tools (PAT) a re

to help the users in their work, where they can visua lize da ta that may lead to

new scien t ific discovery. Technology and informat ion shar ing has taken a big

step forward and has helped substant ia lly in different a reas a round the wor ld

such as in hea lth ca re and medicine.

1.1.P u rpose

Finding reliable pathway analysis tools (will be refer red to as PAT from now on) that

can do a ll the necessary da ta computat ions and can visua lly present the results is

requested by Céline Fernandez from Clin ica l Research Center (CRC) in Malmö (will

be refer red to as the end user). CRC work s in discover ing new medicine, diagnost ic

tools and improved t reatments in order to improve hea lth wor ldwide.

P a g e | 6

1.2.P roblem de fin it ions an d Aim s

Since there a re many PAT ava ilable with lot s of informat ion , the following

resea rch quest ions a re defined in th is thesis:

Is it possible to find a PAT tha t processes metabolomics and

lipidomics raw da ta as input and combines them with genet ic

informat ion?

What a re the funct iona lit ies offered by the available analysis tools?

What a re the qua lit ies of these tool's and how to evalua te them?

The object ives a re defined in order to help answer the three resea rch

quest ions. The main a im of th is thesis is the following:

To find a PAT tha t can process a combinat ion of da ta inputs with the type of

“omics” da ta , i.e. lipidomics/metabolomics, genomics da ta .

In order to reach the main purpose, severa l under lying object ives a re needed.

These a re the following:

1) Find PATs tha t a re able to map pa thways of the following type of da ta :

a ) Overa ll metabolomics da ta

b) Lipidomics da ta

c) Genomics da ta

2) Evalua te the selected PAT and their funct ions. Test the current

accuracy of the exist ing PAT in order to answer if the output from

these tools shows the “correct” resu lt s.

3) Evalua te the selected PAT according to specific requirements given by

the end user ; see sect ion 1.3 for the specific requirements.

After the eva lua t ion of the PAT according to requirements, two opt ions

a re possible:

Opt ion 1: One or more PAT passes steps 2 and 3 and is delivered to the

end user .

Opt ion 2: If no PAT fulfilling the requirements is found. Alterna t ive

solu t ions will be to see if it is possible to adapt any of the evalua ted

ana lysis tools, combine more than one or make an in house

development of a PAT meet ing the requirements of the end user .

1.3.P roblem discu ss ion

In order to solve the problem we must consider wha t PATs a re, how complex

they a re and wha t they can do. The funct iona lit ies of the PAT need to be

tested [28] to see if they fu lfill the specific requirements (S ee T able 1).

P a g e | 7

Table1. 8 specific requirements listed tha t needs to be fu lfilled by a PAT.

Requ irem en t

ID

Requ irem en t description

1 User is able t o see and select on the PAT

what type of da ta it must process (if the

input field is for metabolomic, lipidomic or

genomic)

2 User must be able to cont rol if obta ined

resu lt is va lid from the PAT according to

lit era ture, In ternet or laboratory resu lt s

3 The user must receive resu lt s by the PAT

with in a cer ta in t ime

4 The user can navigate between sta r t of

sea rch (input da ta ) to the end of sea rch

(resu lt s obta ined).

5 The user can get a visua l presenta t ion of

metabolomics, lipidomics and genomics

da ta from the PAT

6 The user can zoom in and out expanding

the view to neighbor ing possible resu lt s to

see connected pa thways on the received

resu lt s from the PAT.

7 The user can input a specific type of da ta

in to the PAT (metabolomic, lipidomic or

genomic)

8 The user can input combined omics da ta

and then map them to pa thways

Acquir ing knowledge from litera ture gives us informat ion about the

complexity of a PAT [27]. The funct iona lit ies from a PAT can be obta ined with

help of software test ing of da ta inputs [9] and th is way we can check if the

PAT sa t isfy the requirements of the potent ia l users. The defin it ion of qua lity

is of a bigger sca le and harder to define since qua lity has different meanings

to different people [36]. The qua lit ies of the PAT are acceptable if they a re

fu lfilling a ll the requirements [36] according to a set of requirement

specifica t ions. We will be using the requirement specifica t ions according to

table 1. Homepages associa ted with PAT a lso need to be qua lity checked and

five selected a spects a re used: Accuracy and Correctness (how t rustwor thy is

the informat ion provided on the homepages), Com pleteness (a re the

homepages complete or under const ruct ion), R elevance (how relevant is

content or informat ion on a homepage to the PAT), T im e and Punctuality

(how fast can a homepage be found when sea rching), T raceability (is the

informat ion provided on the homepages t raceable to their or iginal source).

P a g e | 8

1.4.Re lated w ork w ith P AT

Most PAT today is made specifica lly with focus on metabolomics and

genomics. This is due to the resea rch work in metabolic engineer ing, cellular

metabolism and in toxic genomics [16, 25]. Companies spend vast amounts of

money developing a PAT while t rying to compete with each other [8, 15]. The

compet it ion for the companies involves building, adapt ing and eva lua t ing

each other 's PAT, telling why their PAT is bet ter than the other [8, 15, 33].

Since the PAT is specifica lly developed for a biomedica l field [41], there exists

no fu ll-sca le analysis on the en t ire PAT yet . Our study is a fir st a t tempt a t

such an ana lysis of a complete set of a ll PAT.

2. Methods This sect ion descr ibes the scien t ific methods used to eva lua te the different PAT.

Sta r t ing with the selected method in use, how the informat ion is ga thered and

deta ils on the object ives and under lying object ives.

2.1.Mode l in u se

The main purpose (t o find a PAT tha t can process a combina t ion of da ta

inputs with the type of “omics” da ta , i.e. lipidomics/metabolomics, genomics

da ta ) of the project was divided in to four under lying object ive, each with it s

specific object ive. Methods tha t will be performed a re based on an empir ica l

model with a study on PAT in order to test and ana lyze each of the PAT and

their homepages. Test cases a re designed based on the requirements from the

end user a t the fir st in terview. The requirements a re rechecked a few weeks

la ter with the end user in a second in terview. Once acknowledged, the

software test ing begins with requirements and test cases, in order to see if

ingoing da ta matches the out coming da ta of the PAT. Da ta is based on a gene

name (e.g. NPPA), reference SNP accession ID (rs number such as rs5068) or

a lipid class name (such as lipoproteins). Ver ifica t ion (from the PAT) of the

out coming da ta to see if it s relevant is per formed by compar ing the received

resu lt s with informat ion found in lit era ture. A ranked list is made ranking

the best PAT first , based on how many requirements a re met . If no PAT meets

a ll the requirements, the end user have a request to adapt or combine 2

specifica lly selected tools, which end user is a lready familia r ized with , while

the ranked list get s disca rded.

2.1.1. R equirem ent collection , docum entation and validation

Five meet ings a re booked a t the Clin ica l Research Center (CRC) in order to

make in terviews. All pa r t icipants (resea rchers including the end user) a re

going to discuss about the problem tha t needs to be solved. Discussion will

focus on PAT in genera l and specific funct ions a re going to be desired by the

resea rchers tha t have to be on a PAT. Requirements a re made connected to

these funct ions on a PAT and a new meet ing is booked. Dur ing each

meet ing everyth ing is wr it ten down and documented. After each meet ing,

P a g e | 9

requirements a re collected to be sor ted and processed in order t o make test

cases. La ter a checkup takes place a t same place, to see if everyth ing is on

the r ight t rack.

2.1.2. R equirem ent processing and test case creation

The requirements a re processed and formula ted. They a re a lso shor tened

down from 15 to eight requirements with the most impor tan t things tha t a

PAT must do. Each of the requirements is given an ident ifica t ion number .

Test case templa tes a re sought and one t empla te is selected, downloaded

and then customized (Fig 1). Specific test cases a re designed to su it the

requirements and linking them to their respect ive requirement (S ee T able

2). The designs of the specific test cases a re made by adding the goa l of the

test a long with the events to achieve the goa l. Last ly the expected response

is wr it ten , descr ibing wha t resu lt s we should expect by following the

events. The whole process sta r t s by ca refu lly checking a requirement from

the list and t rying to see if they can be made as a single test case in one go.

If tha t is not possible severa l t est cases a re needed. If we look a t fir st

requirement in table 1 above, we see tha t 3 different da ta types need to be

tested. So we have to split the requirement in to more than 1 test case since

a ll PAT may not be able to process a ll 3 da ta types. We decide to take the

first da ta type which is for metabolomic input da ta . We a lso select a da ta

input tha t we know should give a response and present some resu lt s. F rom

th is we can write down our events in the test case by having an input and

then get t ing a response. So we can then a lso sta te the expected response. In

our case it is tha t the metabolomic da ta type gives da ta informat ion rela ted

to our da ta input tha t we made. Next 2 test cases will be simila r with the

small difference of having a different input da ta type. Same approach

method is applied to the rest of the test cases. Requirement s a re going to be

checked, eva lua ted if it can be made as one test or split t ing them in to more

test cases for same requirement , wr it ing the events and the expected

response.

P a g e | 1 0

Figu re 1. A test case templa te used in th is study.

P a g e | 1 1

Table 2. A table showing requirement ID with descr ipt ion linked to specific Test

Case ID

ID Requ irem en t

description

Type Lin ked w ith Test

Case ID

1 User is able to see and

select on the PAT what

type of data it must

process (if the input field

is for metabolomic,

lipidomic or genomic)

Fu n ction al 1, 2 an d 3

2 User must be able to

check if the result s

obta ined is va lid from the

PAT according to

lit era ture or laboratory

results

Non

fu n ction al

4

3 The user must receive

results by the PAT with in

a cer ta in t ime

Non

fu n ction al

5

4 The user should naviga te

between star t of search

(input data) to the end of

search (result s obtained).

Non

fu n ction al

6

5 The user should get a

visua l presenta t ion of

metabolomics, lipidomics

and genomics da ta from

the PAT

Non

fu n ction al

7

6 The user must be able to

zoom in and out

expanding the view to

neighbor ing possible

results to see connected

pa thways on the received

results from the PAT.

Fu n ction al 8

7 The user must input a

specific type of data in to

the PAT (metabolomic,

lipidomic or genomic)

Fu n ction al 9

8 The user must be able to

input combined omics

da ta and then map them

to pathways

Fu n ction al 10

P a g e | 1 2

2.1.3. Objective

The main purpose is achieved by acquir ing knowledge from litera ture such

as books and a r t icles and by doing software test ing. The resu lts obta ined

from the test s a re than compared with requirements made by the potent ia l

users of the PAT.

2.1.4. Underlying objectives

Object ive 1:

Ga ther ing of informat ion by sea rching books and a r t icles , finding lot s of

PAT and obta in what da ta it can process. Download PAT if possible to

ana lyze them.

Object ive 2:

Eva lua te the selected PAT with their funct ions and methods by going

through each tool, clicking a round and input t ing da ta . Test cases a re

designed from the given requirements. Test s on the PAT are based upon:

a ) From the lit era ture known metabolomics, lipidomics, and

genet ic pa thways and correla t ions

b) Compar ison between resu lt s obta ined from the lit era ture and

from the PAT

c) Compar ison between exist ing labora tory resu lt s and the PAT

d) How long it t akes to process da ta by the ana lysis tool

Correct resu lt s a re considered to be those tha t come from scient ific a r t icles,

books or labora tory resu lt s ver ified by scien t ist s. Pa thways and correla t ions

with metabolomics, lipidomics, and genet ics a r e tested against lit era ture

known resu lt s. Compar ison between resu lt s obta ined from PAT aga inst

a r t icle and book resu lt s a re going to be done first , a fterwards the exist ing

labora tory resu lt s. Accuracy of the PAT are acquired by the output da ta and

resu lt s will either accura tely match a ll da ta or not . A simple t imer is used

to record the processing t ime of a PAT. F inally a list of PAT will show

which PAT passed, fa iled and why they fa iled our examina t ion .

Object ive 3:

In order to have a sa t isfied end user , specific set of requirements a re

needed tha t must be fu lfilled with a final evalua t ion . Requirements a re

collected a t an ea r ly stage with in terviews from resea rchers and the end

user who a lso represent other potent ia l users. The most desired and

impor tan t requirements were discussed and ident ified to be the following:

Selected ana lysis tool must be able to:

a ) Naviga te between data and resu lt s

b) Make visua l presenta t ion of obta ined metabolomics, lipidomics

or genomics da ta

c) Have zoom in and zoom out funct ions expanding the view to

neighbor ing possible resu lt s connected to pa thways on the

resu lt s obta ined

P a g e | 1 3

d) The PAT should be able to process more than one type of da ta

(metabolomic, lipidomic or genomic)

e) Be able to combine omics da ta and then map their pa thways

Naviga t ion will be tested by looking a t the output da ta (resu lt s obta ined) to

the ingoing da ta (the beginning of where da ta is inser ted). Inser t ions of

da ta a re made in the required fields while t raceability or clickable t racking

views a re sought when obta in ing resu lts. Any visua l presenta t ions on

obta ined resu lt s a re accepted but deta iled view of pa thway combina t ions

and correla t ions a re prefer red. On output da ta zoom funct ions a re sought

tha t is a small magnifying glass with a plus or minus sign in the PAT. To

test how many type of da ta (metabolomic, lipidomic or genomic) the PAT

can process, one of each da ta type will be selected. Three da ta types

together (metabolomic, lipidomic and genomic together) a re going to be

tested first , two da ta types (metabolomic with lipidomic or genomic,

lipidomic with genomic or metabolomic) a re tested secondly and last ly one

by one inputs of each (metabolomic, lipidomic, genomic). If a PAT passes a ll

a ims a fter eva lua t ion , a ll resu lt s and test mater ia l a re in tended to be

turned over to the end user . Fur ther suppor t will be provided in form of

answer ing quest ions on specific PAT. Test s on the PAT, Uniprot and

FEvER are going to be done if no PAT will be found tha t fu lfill the

requirements.

2.2.Altern ative re search m eth ods

There a re a lterna t ive methods to conduct th is study but it would involve

working in a biochemist ry labora tory to observe, in terview and obta in resu lt s

from exper iments and a fterwards designing while a lso building a complete

PAT. Another method is to make a homepage connect ing it towards a PAT

tha t is being used in the labora tory. Method selected in sect ion 3.1 and

descr ibed more in sect ion 4 is being done by reasons of get t ing good qua lity

resu lt s, t ime saving and efficiency.

3. Biomedica l background This sect ion conta ins background information needed in order to understand

the biomedica l pa r t . Ga thered informat ion is about genet ics, lipids and their

biochemist ry, metabolomics, genomics and ca rd iovascula r disease.

3.1.Gen e tics

Genet ics is the study of genes with their st ructures, sequences and their role

in heredity. It is a way to t ry and expla in how they work, what they a re and

wha t they can do [32]. Genet ics involve scien t ific studies of genes and their

effect s leading to va r ia t ion in living organisms [32]. Meaning how cer ta in t ra it

is or condit ions a re being passed down from one genera t ion to the next . Also

how genes a re un it is of heredity tha t ca r ry inst ruct ions for making proteins

P a g e | 1 4

tha t direct act ivit ies in cells and funct ions of ou r bodies. An example of

funct ion is inher ited disorders leading to diseases [32]. Disorders have been

detected due to the la rge amount of labora tory exper iments and technology

advancements, da ta stor ing provide use of PATs, thus giving funct ions to

sea rch and match genes with each other .

3.1.1. Gene

Genes a re small molecula r un it is tha t ca r ry the heredity of living

organisms. The gene holds the informat ion to build and main ta in an

organism. Eukaryot ic cells have a nucleus, which conta ins t igh t ly packed

DNA and a re well protected [5]. The main building blocks of a gene consist

of cova lent ly linked n it rogen bases A, T, C and G. The st ructures a re then

st rengthened by ca rbon and hydrogen bonds. This makes a sequence tha t in

the end forms a long double helix DNA cha in . The DNA cha in is t igh t ly

packed together with h istones, which a re proteins, to form an organized

st ructu re. The organized st ructu re is ca lled chromosomes [11]. All the

chromosomes a re well protected with in the nucleus (Fig 2). The DNA cha in

in turn codes for many funct ions of living orga nisms [5]. Genet ic

informat ion and t ra it is a lso gets passed on to the offspr ing when mat ing.

In our genome there a re some st ructura l genes which upon reading, t ell us

wha t mater ia ls a re needed in order to build up a cell or an organism. This

is our genotype. The st ructura l genes we a re going to use a re determined in

combina t ion with the environment and this is ca lled our phenotype. The

phenotype is a lso a ffected by the environment of ea r lier genera t ions and

th is is ca lled epigenet ic [5]. Those phenotypes a re e.g. eye color and blood

type. The genotypes a re ident ica l in a ll human individua ls up to about 99

percent . Remaining 1 percent va ry from person to person crea t ing the

fea tures tha t makes us a ll unique. Tiny differences in t he genome

sequences dist inguish an individual from another [5]. The t iny difference on

the changes of single bases involves reproduct ion from two individuals

crea t ing an offspr ing and changes by Single Nucleot ide Polymorphism

(SNP) as ment ioned more in text below. Keeping t rack of t in y differences is

ha rd and some of t hese t iny genet ic var ia t ions a re impor tan t due to

suscept ibility to cer ta in diseases (like asthma, diabetes, sclerosis and

cancer), un less you have an ana lysis tool a t your disposa l [5].

P a g e | 1 5

Figu re 2. A schemat ic presenta t ion of human DNA assembled in to a

chromosome.

3.1.2. S N P

SNP is shor t for Single Nucleot ide Polymorphism and it is a sequence

var ia t ion in DNA. This means tha t a n it rogen base is different in a gene

sequence for one individual while the rest of the gene sequence is st ill

simila r to another individua l [5]. For an example the gene sequence

ATAGGC is a lmost the same as the gene sequence ATCGGC, however , we

have a change on the second A to having a C instead. Changes of one

nucleot ide in the sequence of our genes a re named Single Nucleot ide

Polymorphism (SNP) and occur throughout the whole genome [3]. Single

Nucleot ide Polymorphism (SNP) var ia t ions occur in a ll species, leading to

genet ic va r ia t ions and may resu lt in different phenotype of the organism. In

[4] resea rch resu lt s show how different ia t ion has occurred. The genet ic

changes a re based on na tura l select ion to su it the most favorable adapt ion

of the genes [3]. Some of these Single Nucleot ide Polymorphism (SNP)

sequences a re even specific to an ethnic group while it may be missing in

another group. According to [32] both the coding and the non coding regions

of the DNA can be a ffected. Single Nucleot ide Polymorphism (SNP)

sequences involve suscept ibility to diseases as ment ioned in the end of

sect ion 2.1.1. A scen ar io given will descr ibe why Single Nucleot ide

Polymorphisms (SNP: s) a re impor tan t [32]. Couples registers for a hea lth

check and gives blood to be ana lyzed in order to detect how hea lthy they

a re. The blood goes through t rea tments so only small sequences of

nucleot ides a re left . The Single Nucleot ide Polymorphism (SNP) sequence

of one individual is the following:

“GCCAGTATTGTCGATTTCACAAGTGCCTTTCTGTCGGGATGTCACACA

P a g e | 1 6

ACGG”. Other person has the following of

“GCCAGTATTGTCGATTTCACAAGTGCGTTTCTGTCGGGATGTCACACA

ACGG”. The sequences from both individual’s a re codes for a prot ein , coding

the uptake of fa t and sugar in the human body. The small va r ia t ions

between these two individua ls a re marked with a color . One of them has

h igh r isk of get t ing diabetes. With the help of today’s technology, SNP

ana lyses a re used to determina te disease suscept ibility [32]. Ana lysis

revea ls t er r ible news for the couple, were the individual with the single

base changed to G has to sta r t using insulin with a syr inge, unless food

habit change within a year or two. The scenar io descr ibed above a re very

common in hea lth ca re today and a lso not the only work a rea exploit ing

genet ic va r ia t ions. In forensic science the genet ic va r ia t ions a re exploited

dur ing DNA fingerpr in t ing [32].

3.2.Bioch em istry o f Lip ids

Biochemist ry is a lso ca lled biological chemist ry which is the study of chemica l

processes in living organisms. Biochemistry regula tes and governs over a ll

living processes with in a ll living organisms [5]. This occurs by biochemical

signa ling. The signa ling is sor t of an informat ion flow as in sending a message

from one place to another . Signa ls flow through every par t in an organism

regula t ing the metabolism. Metabolism stands for the meaning of living

organisms to susta in life and reproduce them self. One impor tan t pa r t in

biochemist ry is the lipids. Lipids a re impor tan t components in a cell and form

cell membrane, vita l t issues and serve as an energy source for the organism

[1]. Lipids a re stored as energy reserves with in the organism and used whe n

needed. Lipids help keeping the elect rochemica l balance of a cell, cell

signa ling and t ra fficking regarding wha t is going in or out to the cell [1 1].

The lipids usua lly consist of a pola r head and a hydrophobic ta il. The lipids

bind to each other due to the hydrophobic pa r t wants to stay in contact wit h

other hydrophobic molecules [3]. The dist r ibut ion between the hydrophobic

and pola r pa r t s of the lipids direct s the 3-dimensiona l st ructure of the

molecules [7] and with a rela t ively la rge pola r pa r t , the lipids form micelles

while more equal dist r ibut ion , leads to the format ion of double layers known

as membranes (Fig 3).

P a g e | 1 7

Figu re 3. P icture of lipids with hydrophobic ta ils bound together and with

other components forming the membrane. (Modified picture taken from

Human Cell Biology ref. [43]).

3.2.1. Lipid defin ition

Chemists, biochemists and other analyst s tha t work with lipids have a

grea t and firm understanding of the t erm ca lled lipid according to [19]. But

there is no widely accepted defin it ion today and they a re sa id to be a group

of na tura lly occurr ing compoun ds. In an organism, [44] and [53] sta te tha t

thousands of va r ious forms of lipid molecules can be found and lipids can be

ca tegor ized in to six main ca tegor ies (Fig 4). They a ll have a low solubility in

wa ter and h igh solubility in organic solvents.

3.2.2. Classes of L ipids

Recent ly a new nomencla ture system was proposed by [26] due to the

diversity of lipids in human plasma , separa t ing lipids in to eight classes or

ca tegory where six of them are considered main classes. Each class can be

fur ther divided in to sub classes and individua l molecula r species (Fig 3).

The first ca tegory is the fa t ty acyls and is a lso ca lled fa t ty acids. The

fa t ty acids can have three forms such as fa t ty acids, octadecanoids

and eicosanoids. They a re the most common building block for more

st ructu ra l complex lipids and can be sa tura ted or unsa tura ted. Cells

use these lipids to form the va r ious membranes found in a cell, to

store energy and to adjust the membrane flu idity in many ce lls. [43,

53]

Second ca tegory is the Glycerolipids and has three forms as mono-,

di- and t r iacylglycerolipids. Their funct ions a re main ly as energy

storage and a re bulked up in the t issue as fa t in an imals. [43, 53]

P a g e | 1 8

Third ca tegory is ca lled Glycerophospolipids but they a re usua lly

ca lled phospholipids. The main forms a re Phospha t idylcholine (PC),

Phospha t idylchethanolamine (PE) and Phospha t idic acid (PA). The

glycerophospolipid classes a re the only ones tha t have a phosphor

binding and they a re the key component in order to form bilayers.

[43, 53]

The four th ca tegory consist s of Sphingolipids. The main forms a re

Sphingomyelin and Ceramides. The Sphingolipids have a pola r head

and two non pola r t a ils. Sphingomyelin act a s a protect ion forming a

myelin sheath to protect nerves. [43, 53]

The fifth ca tegory is the Sterol lipids and they a re of va r ious a lcohol

forms. Sterol lipids a re an impor tan t component for biological roles.

Sterols act a s regula t ing hormones and as signa ling molecules. [43,

53]

The last ca tegory is the Prenols tha t form terpenes and act a s a pre-

cursor molecules of vitamins as vitamin A, E and K. [43, 53]

3.2.3. Enzym es involved in the synthesis of lipids

A deeper insight is presented in th is sect ion with focus on lipids and it is

synthesis, for a more understanding on the amount of informat ion a PAT

must be able to process. Sta r t ing from the sta r t of da ta inputs (a lipid name

connected to glycerolipids) to resu lt s obta ined.

Some lipid cha ins a re very long or complex while others a re shor t . It wou ld

take a long t ime to chemica lly synthesize the lipids, however , with the help

of enzymes it is much faster a s [37] presents. Numerous forms of lipids

occur and severa l enzymes a re needed. In [46] a system biology view

presents needed enzymes by use of a PAT. E.g. the synthesis of fa t ty acids

occurs in the cytoplasm and key enzymes involved a re the acetyl -CoA

carboxylase (ACC) and malonyl-CoA carboxylase (MCC) sta t ed in [51].

While another group of coenzyme ca lled Acyl-CoA, choresterol

acylt ransferase (ACAT) works on cholesterol [51]. This is st rengthened in

[45] showing a clea r view by pictures. The fa t ty acids a re so many and can

be sa tura ted or unsa tura ted and for th is purpose designa ted symbols a re

given [31] in order to keep t rack of the ca rbon a toms a nd their bindings.

The symbols consist of two numbers between a colon (:) [31]. The first

number tells us the ca rbon length of the fa t ty acid and the second number

the sta te of sa tura t ion . A fa t ty acid with severa l unsa tura ted bounds shows

a h igher number a t it is second va lue (S ee T able 3). Synthesis of fa t ty acids

beyond 16 ca rbons length goes through a two-carbon elonga t ion process,

according to [31] by enzymes in the endoplasmic ret icu lum (ER). Not only

elonga t ion occurs bu t a lso desa tura t ion by enzymes in the endoplasmic

ret icu lum (ER) using four enzymes named desa turase delta four , delta five,

delta six and delta nine. The designa ted delta names with a number a re

given according to which posit ion in the fa t ty acid ca rbon cha in the

desa tura t ion occurs [31]. The main dena turase is delta nine and is ca lled

Stea royl-CoA desa turase-1. The desa tura t ion requires oxygen (O2), a

coenzyme ca lled Nicot inamide adenine dinucleot ide hydrogen (NADH) and

P a g e | 1 9

an elect ron t ranspor t ing hemoprotein ca lled Cytochrome b5 [47]. In fa t ty

acid desa tura t ion two hydrogen a toms a re removed from the fa t ty acid

making an oxida t ion on both the fa t ty acid and NADH. This crea tes a

double bond between ca rbons in the fa t ty acid cha in .

Table 3. The main fa t ty acids in organisms (Modified table taken from Cyber lipid

center ref. [31] and Virgin ia web educa t ion ref. [10])

Main fatty acids

Number of

carbons

Name Systematic name Symbol Structure

Saturated fatty acids

12 Lauric acid Dodecanoid acid 12:0 CH3( CH2)10COOH

14 Myristic acid Tetradecanoic acid 14:0 CH3( CH2)12COOH

16 Palmitic acid Hexadecanoic acid 16:0 CH3( CH2)14COOH

18 Stearic acid Octadecanoic acid 18:0 CH3( CH2)16COOH

20 Archidic acid Eicosanoic acid 20:0 CH3( CH2)18COOH

22 Behenic acid Docosanoic acid 22:0 CH3( CH2)20COOH

24 Lignoceric acid Tetracosanoic acid 24:0 CH3( CH2)22COOH

Unsaturated fatty acids

16 Palmitoleic acid 9-Hexadecanoic acid 16:1 CH3( CH2)5CH=CH(CH2)7COOH

18 Oleic acid 9-Octadecanoic acid 18:1 CH3( CH2)7CH=CH(CH2)7COOH

18 Linoleic acid 9,12-Octadecanoic

acid

18:2 CH3(CH2)4(CH=CHCH2)2(CH2)6COOH

18 a-Linolenic acid 9,12,15-Octadecanoic

acid

18:3 CH3CH2(CH=CHCH2)3(CH2)6COOH

18 g-Linolenic acid 6,9,12-Octadecanoic

acid


20 Arachidonic acid 5,8,11,14-

Eicosatetraenoic acid


24 Nervonic acid 15-Tetracosanoic acid 24:1 CH3(CH2)7CH=CH(CH2)13COOH

Complex lipids have a longer biosynthet ic pa thway and two main pa thways

a re known according to [47], the sn -glycerol-3-phospha te pa thway (a lso

known as the Kennedy pa thway) and the monoacylglycerol pa thway (Fig. 5

and 6). Synthesis by the Kennedy pa thway occurs in the liver and adipose

t issues while the monoacylglycerol pa thway takes place in in test ine

confirmed in [42]. Both sta r t s by ca tabolism of glucose (glycolysis) resu lt ing

in the bio-synthesis of glycerol, however , new evidence in [47] indica tes

some glycerol is synthesized anew (de novo) from single molecules by a

process ca lled glyceroneogenesis. The following react ions occur in the

P a g e | 2 0

endoplasmic ret icu lum (ER) of mammalian organisms [47]; sn-glycerol-3-

phospha teis ester ified by a fa t ty acid coenzyme in a ca ta lyt ic react ion by

the enzyme glycerol-3-phospha te acylt ransferase (GPAT) a t the sn- posit ion

in order to form lysophospha t idic acid. Lysophospha t idic acid then becomes

acyla ted forming phospha t idic acid, an in termedia te product in the

synthesis of a ll glycerolipids [47]. Dur ing synthesis of t r iacyl-sn-glycerol the

phospha te group is removed by a family of enzymes ca lled l ipid phospha te

phospha tase (PAP), sta ted by [47], forming 1,2-diacyl-sn-glycerols and

fur ther acyla ted by diacylglycerol acylt ransferase (DGAT) in to t r iacyl -sn-

glycerol (Fig 5). Dur ing synthesis of glycerophospolipids,

phospha t idylcholine (PC), phospha t idylchethanolamine (PE) and

phospha t idylser in , the phospha te group is not removed sta ted by [6] from

phospha t ic acid. Instead phospha t ic acid a re used as pre -cursor molecules

in the synthesis of glycerolipids (Fig 6). The synthesis by the

monoacylglycerol pa t hway is less complex and involves only a few enzymes

belonging to an acylglycerol acylt ransferase family to form the

t r iacylglycerols in the in test ine [47].

Figu re 4. Seven lipid classes and how they in teract bio-synthet ica lly (Modified

picture taken from Molecula r biochemist ry ref. [45]).

Figu re 5. The Kennedy pa thway synthesis in mammals (Modified picture taken

from The AOCS Lipid Libra ry – Tr iacylglycerols ref. [47]).

P a g e | 2 1

Figu re 6. Synthesis of acylglyer ides and glycerophospholipids showing a link

between the two pathways (Modified picture taken from Lipid Libra ry ref. [35]).

3.2.4. Lipoproteins

Lipids a re a lmost insoluble, however , there a re ways for t hem to be

t ranspor ted or pass t hrough the blood circu la t ion [5]. Lipoproteins a llow the

lipids to be t ranspor ted through the blood circu la t ion in order to reach

different t issues [5]. The lipoproteins a re assembled in a way tha t it

conta ins both proteins and lipids. The protein pa r t serves as an

emulsifica t ion for the lipids [11] and there a re five major classes, two being

very impor tan t classes of the lipoproteins [1], h igh density lipoproteins

(HDL) and low density lipoproteins (LDL). Remaining lipoproteins a re

In termedia te density lipoprotein (IDL), very low density lipoproteins

(VLDL) and chylomicrons [11]. Both HDL and LDL carry lipids as

cholesterol and LDL is somet imes refer red to as the bad cholesterol while

HDL is the good cholesterol. P roblems can occur du r ing the oxida t ion of the

LDL according to [11], leading to a lmost unstoppable cha in react ions. Cha in

react ion effect resu lt s in a therosclerosis many years la ter [11].

3.3.Gen om ics

Genomics is a discipline with in genet ics tha t focus on the study of the genome

of a ll organisms [32]. With in th is field of resea rch the purpose is to determine

the en t ire DNA sequence of a ll organisms and making a sca led mapping of a ll

the genes [32]. This includes a lso mapping of wha t a gene does and the

associa t ion it has to processes with in an organism e.g. metabolomics or

lipidomics [7]. Dur ing the process of mapping and associa t ion each gene gets a

designa ted name and number (as an ID tag) with a ll the necessa ry

informat ion provided about tha t specific gene in a da tabase [32]. Informat ion

can be ret r ieved with a PAT from these da tabases when needed.

P a g e | 2 2

3.4.Metabolom ics

Metabolomics is the study of chemica l processes involving metabolites and

sophist ica ted ana lyt ica l t echnologies a re used to make systemat ic studies [8,

18]. A systemat ic study consist s of t a rget ana lysis, metabolite profiling and

metabolic fingerpr in t ing. The metabolit es are found in a ll biologica l cells and

a ll have unique chemica l signa tures, like a fingerpr in t [46]. The unique

fingerpr in t s a re an end product a fter a cellu la r process and can be used to see

how specific chemica l processes have occur red [8, 46]. The chemica l processes

tha t a re examined can be from a living organism, cells, t issues and even from

an organ . Research field of metabolomics consist s of many sub par t s and the

pa r t s we a re focusing on a re the study of lipidomics (lipids/fa t ty acids) [40].

3.5.Lip idom ics

Lipidomics is used for descr ibing the complete profile of lipids in cells, t issues

or organisms [47]. Lipidomics a re one subpar t of metabolomics and a newly

emerged resea rch field tha t has been dr iven fast forward by rapid advances in

technology [53]. Such technologies a re e.g. mass spect romet ry (MS),

fluorescence spect roscopy (FS) [24], and Nuclea r Magnet ic Resonance (NMR)

[39]. These technologies save la rge amounts of da ta in da tabases, giving PAT's

possibility of a new method for da ta ana lysis [18].

3.6.Cardiovascu lar d isease s

There a re many diseases a round the wor ld. One of them is a class of diseases

tha t involve hear t or vessels tha t t ranspor t s blood (a r ter ies and veins) and a re

ca lled ca rdiovascu la r diseases [46]. Cardiovascula r diseases include th e

following: Aneurysm (Abnormal bulge in an a r tery), Angina (Chest pa in due

to lack of blood to the hear t muscle), Atherosclerosis (plaque builds up inside

the a r ter ies), Cerebrovascula r Accident (St roke), Congest ive Hear t Fa ilure,

Coronary Artery Disease and Myocardia l Infa rct ion (Hear t At tack). Severa l

resea rchers in [12] cla im some known factors as lipid or fa t con tent can a ffect

the ca rdiovascula r system, poin t ing tha t high density lipoproteins (HDL) and

low density lipoproteins (LDL) a re regarded to be a factor behind the

ca rdiovascula r diseases. Once the disease is detected it has usua lly progressed

for years and leads to the necessity of opera t ion or even dea th . Another group

of resea rchers in [29] have shown tha t t here a re severa l SNP:s which is

a ssocia ted with plasma level of h igh density lipoproteins (HDL) and low

density lipoproteins (LDL) which a re associa ted with myocardia l in fa rct ion.

After t est ing individua ls in [2], from five different ethnic groups with respect

of eight SNP from three genes a ssocia ted with cholesterol and lipoprotein

synthesis, a clea r correla t ion with myocardia l in fa rct ions was obta ined. A

study on r isks for myocardia l in fa rct ion in [48] st rengthens th is theory.

P a g e | 2 3

4. Computer Science background This sect ion conta ins background in formation needed in order to understand

the IT par t and how a PAT works.

4.1.Databases , Data m in in g an d Kn ow ledge d iscovery

A da tabase consist s of t ables with many columns and rows with a collect ion of

vast amount of da ta in form of informat ion . The informat ion is stored a t a

specific place and is often being well organized. Within the da tabase

depending on wha t is inser ted in to it , the informat ion can be of genet ics,

lipidomics or about something else en t irely. Da tabases often require some

form of tool in order to read and ret r ieve specific informat ion fast among the

vast amount of da ta . This is where da ta mining comes in , when specific

informat ion is sought and ga thered by a tool, then presented as resu lt s. F rom

the presented resu lt s, knowledge can be ga ined . The gained knowledge have

many forms but a few of these can perhaps be to make improvement in

exper iments, confirming exper iment resu lt s or perhaps change approach

methods to solve a scien t ific problem [13].

4.2.P AT

As ment ioned ea r lier a ll kinds of IT-based tools have been developed in

va r ious biomedica l fields [21]. This has been done in order to keep t rack of a ll

the necessa ry scien t ific informat ion obta ined dur ing the past yea rs [20]. PAT

tha t we a re working with process informa t ion about genes, SNP:s and lipid

metabolism. Therefore the PAT play an impor tan t role in lipidomics and

genomics resea rch . The tools can be either web based software or

downloadable programs tha t t ake da ta in forms of a gene name, SNP

accession ID (rs number) or a class name of the lipid. The PAT then process

the da ta given , making sea rches in loca l or remote da tabases. The da tabases

consist s of many tables with different informat ion rela ted to genes and lipids.

What the PAT do is making many join funct ions between the tables a nd put

these informat ion together . When a sea rch occurs by da ta mining, the

informat ion from these tables a re ga thered and presented as resu lt s based on

the input pa rameters [9]. The input pa rameters a re the inser ted text s in the

sea rch field. In some cases the tools fa il to provide resu lt s on cer ta in input .

F inally the ana lysis tool shows the resu lt s t elling if any resu lt s were found,

where the lipids or genes a re used in the metabolism and how they a re

connected to other lipids or genes in the metabolism. The resea rchers working

with the PAT can fur ther ana lyze the result s in order to gain new knowledge.

With new knowledge, new discover ies can be made in e.g. lipidomics or

genomics and th is leads to the demand for upda tes to the da tabases and the

PATs (Fig 8). With the help of PAT, new discover ies of diseases like

a therosclerosis can be made [24] or new pa thway links in the metabolic

system responses can be found [52].

P a g e | 2 4

Figu re 8. A view on how everyth ing is connected (both vir tua lly and

physica lly) to the PAT.

5. Requirements and Test elicit a t ion The following sect ion conta ins informat ion about requirement specifica t ion ,

t est plan , t est case, and software test ing.

5.1.Requ irem en ts

Many software programs or tools require months of t est ing to see if they

funct ion correct ly and need very thorough set s of specific requirements in

order to be considered as good funct iona l tools, a s sta ted in [39]. Specific set s

of requirements can be obta ined by in terviews with the people who a re of

impor tance such as stakeholders or individua ls with a key role. We use an

empir ica l method on software test ing by ga ther ing requirements pa r t ia lly

based on a requirements phase from one of the wa ter fa ll models used by Ian

Sommerville [50]. In the requirements phase, somet imes a lso ca lled

requirements engineer ing phase, bra instorming, resea rch and ana lysis is

being conducted on the software tha t will either be developed or tested. We

in tend to do the test ing par t without developing any new PAT and therefore

only the requirements methodology is adapted and applied in our study. The

bra instorming, resea rch and ana lysis is the most impor tan t pa r t of

requirements ga ther ing. Basic requirements a re defined and set for uses tha t

the software must suppor t [49]. Dur ing th is phase, in -depth studies of cu rrent

working processes a re done and how the problems can be addressed or solved.

Thus, without understanding the requirements given , hopes of deliver ing a

P a g e | 2 5

successful system or software is unlikely. Requirements elicita t ions a re done

involving many in ter views with key individua ls. Requirements a re ana lyzed

and eva lua ted and can be of three kinds. Those with h ighest pr ior ity a re the

m ust have, second pr ior ity a re should have and the lowest pr ior it y a re the nice

to have. Specifica t ions a re then made to the requirements and va lida t ions a re

done where the individuals with a key role or stakeholders accept the

requirements. Requirements a re therefore needed in order to define wha t a

software program must or should do, fea tures it ha s and a cer ta in qua lity that

it must fu lfill [39]. Requirements a re divided in to funct iona l and

nonfunct iona l requirements. Funct iona l requirements a re a lways defined with

sha ll or must . Non funct iona l requirements a re proper t ies or cer ta in qua lit ies

a product must have and a re used to descr ibe software’s usability, reliability

and performance [17]. Qua lity is not something tha t can be measured easily

as ment ioned in sect ion 1.3, bu t one way to measure qua lity according to [23]

is by using met r ics or sta t ist ics on proper t ies tha t can be measured and tha t

a re associa ted with qua lity.

5.2.Testin g

In order to do good test ing, t est plans a re developed in the form of a document

so tha t systemat ic approaches can be used on software test ing. The systemat ic

approaches a re fast and t ime saving [30]. A test plan descr ibes the test ing

phases (e.g. un it t est ing, in tegra t ion test ing, acceptance test ing), the needed

requirements, a ll act ivit ies, the needed resources and the documenta t ion

notes. In th is work, we use a cceptance test ing to determine if a set of

requirements will be met . The acceptance test ing is run ning with specific

da ta . Once the resu lts a re obta ined, they a re compared with known expected

resu lt s. Upon correct match of the resu lt , a pass or no pass is given . There a re

many ways to make test s and most of the t ime they differ between companies

[30]. All t est plans a re crea ted pr ior to any test ing. The content of a t est plan

usua lly consist s of a su itable name rela ted to the test s performed. Once a t est

plan is selected, it is performed according to a templa te [30] where the test

plan document is divided in to two par t s. The first pa r t consist s of genera l

informat ion about the test s tha t will be performed. The genera l informat ion

conta ins pa r t icipants, t est st ra tegies, specifica t ions and funct ion s or fea tures

tha t will be used. The second par t conta ins the procedure on scenar ios a lso

ca lled cases on how test s will be done. The crea t ion of t he test plan ,

requirements and test cases will lead to the actua l software test ing. The

purpose of the softwa re test ing is to invest iga te and ga in informat ion on how

the selected software program works [39]. Software proper t ies a re eva lua ted

dur ing a selected software test ing. Dur ing the software test ing, bugs and

er ror s tha t might occur a re a lso eliminated [23]. Once the software program or

applica t ion meets the requirements, it is considered to be working well and

a lso sa t isfies the needs of those who request ed it .

P a g e | 2 6

5.3.Test case s

A set of t est condit ions have to be writ ten from funct iona l requirements [17] if

software is going to be tested. The t est condit ions a re refer red to as test cases

and a re performed in cer ta in sequences which it must follow dur ing test ing.

F irst t he goa l of the test is descr ibed with the event or execut ion steps,

followed by the expected response and the actua l response from the test .

Many t imes test cases a re crea ted, with a set of condit ions from given

requirements as ment ioned above, in order to elimina te the ambiguity to the

minimum in software [21]. The amount of t est cases depends on the amount of

requirements given . To perform the test ing phase, t esters a re selected to

examine, discover and determine if the software is working correct ly or not

dur ing test ing. To keep t rack of a tester 's work t raceability matr ices a re used,

linking requirements to specific test cases (Fig 8). A test case consist s of many

steps sta r t ing with an input based on a r equirement tha t will be tested and

ending once a ll steps have been completed. The act ion or execut ion events a re

then going to be made on how to do the test with descr ipt ions on expected

response or outcome. The actua l resu lt s obta ined will be writ ten down once

the test case is complete.

Figu re 1. A test case templa te used in th is study.

6. Result

6.1.Fin din g th e P ATs

With use of In ternet , sea rching for PATs, many were found and most of them

had their own homepages. Searches with the Google sea rch engine were made

as “PAT”. 46 homepages were found, descr ibing analysis tools tha t had to be

eva luated. The homepages sta ted wha t type of da ta the PATs could process,

which was confirmed by downloading and test ing the PATs. The downloaded

PATs were eva lua ted with test cases 1 to 3 tha t require the tool to be able to

P a g e | 2 7

input and process lipidomics, metabolomics and genomics da ta as descr ibed in

appendix 1. Used genomic da ta with known correla t ions with lipid synthesis

and metabolism a re shown in appendix 2. Out of 46 PATs, 23 ana lysis tools

fu lfilled the cr iter ia of being able to process metabolomic, lipidomic and

genomic da ta .

6.2.Sortin g th e P ATs

Sort ing of the PATs a re made by ranking them based on the possibility of

processing more than one type of da ta . The Ingenuity PAT is the only one that

processes a ll types of da ta . The major ity of the selected PAT's only process

genomics da ta (13 out of 23) whereas 9 of the 23 PATs could process both

genomic and protein da ta (Fig 9).

Figu re 9. A ranking list of PATs based on number of da ta they can process

and the ID number of the requirement it passed.

6.3.Testin g th e P ATs

PAT's goes th rough test ing with test cases of 4 and 5 tha t a re linked to

requirement ID 2 and 3, according to sub a im 2 in sect ion 1.2. Very lit t le

lit era ture informat ion can be found when sea rching for metabolomics,

lipidomics and genet ic pa thways or correla t ions with PATs. Books found

descr ibed basic infor mat ion about the actua l metabolomics, lipidomics and

genet ics but not much of their pa thways or correla t ions [5, 31 and 3]. Searches

on In ternet give few result s of a r t icles and journa ls (such as [10, 34, 41 and

46]).

When doing the compar ison between obta ined resu lt s from the PATs, three

references can be found in which two out of three a re rela ted to lipids and

their synthesis pa thway [1, 5]. The third lit era ture reference is about

correla t ion of genet ics to pa thways [20]. These lit era ture informat ion ’s a ll

correla te with resu lt s obta ined by the PAT.

Since few result s a re found in lit era ture and only a handful by the Google

sea rch engine, work focuses on labora tory resu lt s obta ined by the end user

P a g e | 2 8

and is known to be va lid resu lt s. 35 out of 45 PATs a re compared and show

successful resu lt s, cor rela t ing with labora tory resu lt s.

To do a t ime measurement dur ing processing of da ta on a PAT, a t imer is used

(requirement ID 3). The t imer sta r t s from zero and count s upwards unt il the

processes by the PAT are complete. Three t imes a re recorded for each PAT to

obta in a more accura te measurement . All PATs show a response with in a

maximum of 7 seconds.

6.4.Evalu atin g th e P ATs

Evalua t ion of the PATs according to sub-a im 2 resu lt s in a tota l of 14

ana lyzing tool's passing the cr iter ia (Fig 10). These tools fu lfills the

requirements of being fast in da ta processing and shows correla t ion with

resu lt s repor ted in the lit era ture and obta ined a t the labora tory.

Figu re 10. The PATs tha t passed basic evalua t ion according to sub-a im 2.

6.5.Fin al evalu ation of th e P ATs

The PATs a re fur ther eva lua ted towards the end user requirements and the

resu lt shows no ana lysis tool's fu lfilling the requirements (Fig 10). Test cases

6 to 10 (linked to requirement ID 4 to 8) a re used for the fina l evalua t ion .

Figu re 10. We see the PATs tha t passed cer ta in requirements.

There were no PATs tha t passed sub-a im 3. The tool's fa iled on the zoom in

and out funct ion (to expand the view to neighbor ing possible resu lt s)

connected to pa thways when return in g resu lt s were obta ined and on the

combina t ion of more than one type of omics da ta input . A complete view of the

resu lt s and reason for exclusion a re shown in appendix 3.

P a g e | 2 9

6.6.Th e best P AT from th e ranked lis t

The Ingenuity PAT is best ranked in our list cont a in ing a ll of today’s ava ilable

ana lysis tools. The Ingenu ity PAT can process and fu lfill a lmost a ll of the

requirements given by the end user . The tool is even capable of t aking input of

more than one omics da ta and dur ing the test ing stage of the tool no input

limit s a re found when combining da ta . The only setback is the missing zoom

in and out funct ion (to expand the view) connected to pa thways upon

obta ining resu lt s tha t the program could not process. A gene name (APOE)

and two protein accession numbers (NP_000032 and P02649) is selected in

order to test the tool. Selected gene and proteins a re known to be involved in

the synthesis and format ion of lipoproteins . Therefore t he expected resu lt

from the PAT is to find informat ion rela ted to lipoproteins. Th e gene name

inser ted in the sea rch field and the returned resu lt shows 1 match found for

lipoproteins (Fig 11). F irst selected protein accession number (NP_000032)

returned 1 match for lipoproteins linking it to the gene APOE (Fig 12). Second

selected protein accession number (P02649) returned 1 match for lipoproteins

linking it a lso to the gene APOE (Fig 13).

Figu re 11. Retu rned resu lt from the Ingenuity PAT after input of the gene

APOE.

P a g e | 3 0

Figu re 12. Search resu lt s a fter input of a protein with the accession number

(NP_000032) showing the protein to be linked to lipoproteins and the gene

APOE.

Figu re 13. Search resu lt s a fter input of a protein with the accession number

(P02649) showing the protein to be linked to lipoproteins and the gene APOE.

P a g e | 3 1

6.7.Com bin in g P ATs

No PATs passes examina t ion out lined in sub-a im 3, since even the Ingenuity

PAT with good score from our test ing fa iled, due to the missing zoom

funct ions. Instead we ha ve to invest iga te if it is possible to develop a PAT that

meets the requiremen ts by the end user or if a combina t ion of 2 or 3 PATs can

fu lfill the requirements of the end user (See sect ion 3.1.2).

The decision a fter a meet ing with the end user is tha t developing a PAT by

our self is not an opt ion due to limited t ime and insufficien t manpower . The

pa thway ana lysis program Uniprot ba tch conver ter is requested by the end

user to be pr imar ily tested aga inst a program ca lled FEvER but a lso aga inst

a ll other combina t ions of PATs tha t can be found. With reason tha t Uniprot

handle lipidomics while FEvER handle metabolics, the two PATs have

funct ions tha t in th is way can complement each other . After input of genomic

da ta , the Uniprot gave either a blank page showing tha t nothing is found or a

list of possible matches to the protein encoded by the gene. Test resu lt s using

Uniprot ba tch conver ter ends up with successfully conver t ing some of the gene

names in to proteins. Input t ing resu lt s from Uniprot to FEvER works well and

the PAT's sta r t s, however , a lways ends up with 0 resu lt s. The combina t ion of

the PAT's Uniprot and FEvER is therefore not successful.

In our sea rch for combina t ion between PATs we turn our focus on the

Na t iona l Center of Biotechnology Informat ion (NCBI) homepage. At the NCBI

homepage a combinat ion of pa thway analysis programs can be found. The

homepage of NCBI have a ser ies of PAT's ava ilable without needs of

downloads or payments and conta ins vast amount of wor ldwide collected

informa t ion in molecula r biology. NCBI’s homepage is a lso a government -

funded homepage by the U.S. On the NCBI homepage an online tool can be

found with a very useful sea rch engine, capable of t aking more than one input

in order to forward the sea rch to specific da tabases to which the NCBI

homepage is connected to. Depending on how much input is inser ted in the

sea rch field, resu lt s a re returned accordingly. The gene named ENAC was

selected which codes for a protein tha t a ffect s the sodium channels in

biological organisms. Expecta t ions a re to obta in substant ia l resu lt s on rela ted

informat ion about ENAC an d the sodium channel from a ll da tabases

connected to the NCBI homepage. The returned resu lt s shows 26 of 37

different da tabases found matches conta in ing informat ion about a r t icles,

genes, proteins, SNP and nucleot ide sequences rela ted to ENAC and the

sodium channel (Fig 14). A second gene name is selected NPPA coding for a

protein tha t makes a receptor ca lled na t r iuret ic pept ide class A, regula t ing

water and sodium ba lance in biological organisms. Gene names selected

(ENAC and NPPA) a re put in to the sea rch field as “ENAC, NPPA”. No

combina t ions or rela t ions a re expected to be found between these gene names

(ENAC and NPPA) and no resu lt s a re either expected to be retu rned from the

da tabases. Result s obta ined precedes the expecta t ions with matches for 8

da tabases conta in ing some informat ion about a r t icles, genes, molecula r

in teract ions and da ta mapping rela t ing ENAC and NPPA to each other (Fig

15). With the possibility of more da tabases connected to each other the

efficiency for finding combined resu lt s a re th erefore increased.

P a g e | 3 2

Figu re 14. A search on NCBI’s homepage across a ll da tabases it is connected

with . The gene ENAC is used as da ta in order to invest iga te the retu rned

sea rch resu lt s.

P a g e | 3 3

Figu re 15. A search on NCBI’s homepage across a ll da tabases it is connected

with . The genes ENAC and NPPA are used together as da t a in order to

invest iga te see the combined sea rch resu lt returned.

P a g e | 3 4

6.8.Fu n ction alit ie s

A var iety of funct iona lit ies a re found through software test ing from the PAT.

The most common funct iona lit ies a re the following: the sea rch input field,

input t ext a rea field, dropdown list s and response window of how many

seconds it t akes to receive the resu lt s (Fig 16).

Figu re16. P icture of the sea rch input field marked blue, input text a rea field

marked yellow and push but ton marked green from 2 different websites.

Some tools give a n ice visua l presenta t ion showing how connect ions between

resu lt s a re connected such as genes or lipids (Fig 17). Two tools have 3d view

but ha lf the t ime dur ing test ing it give only a white blank page or resu lt ing in

crash ing of the program.

P a g e | 3 5

Figu re17. A gene name was sea rched and the PAT visua lly showed the

resu lt . The sea rched gene is colored red and show how it is in rela t ion to other

genes, proteins, signaling receptors a nd biological cell processes.

No PAT fulfill a ll the requirements according to sub-a im 3. Of a ll the

invest iga ted PAT's only 2 have easy naviga t ion funct ions according to our

requirements and on ly 5 ma ke visual pa thway presenta t ions. None of the

PATs have any zoom in or out funct ions (to expand the view) connected to

pa thways on resu lt s obta ined.

6.9.Qu ality

Quality is decided on tha t if a ll requirements a re fu lfilled on a PAT, then the

qua lity is good. As the resu lt s shows, no PAT fulfills the requirements.

Therefore the qua lity is not acceptable on any of the PAT. The following

aspects a re used to eva luate the quality of the homepages: Accuracy and

Correctness, Com pleteness, R elevance, T im e and Punctuality, T raceability .

All homepages provides accura te informa t ion with no er rors or misleading

informat ion a lthough many homepages belongs to companies tha t consider

their product to be the best . Also five homepages sta t es tha t it is a work in

P a g e | 3 6

progress making it less appea ling by design . There a re no companies sta t ing

when or where their PATs a re developed or how long it is in existence. The

Google sea rch engine uses a page ranking system, rank ing the compan ies

pages h igh . This makes it fast to be found and takes about 3 seconds a t

maximum to be found, fu ll resu lt s a re shown in appendix 4. Companies on

their homepages ra rely use any references to the informat ion tha t they put on

their homepage making it ha rd to t race the informat ion they provide and few

or none can be found. In tota l, on ly one homepage provides some references

and informat ion tha t can be t raced.

7. Discussion

7.1.Is it poss ible to fin d a P AT th at processe s m etabolom ics an d

lip idom ics raw data as in pu t an d com bin e th em w ith gen e tic

in form ation ?

Looking a t the resu lts we see tha t 46 PATs a re found but only 23 fu lfilled

cr iter ia ’s for being able to process metabolomic, lipidomic or genomic da ta .

The PATs a re sor ted and ranked based on how many type of da ta they a re

capable of handling and fur ther eva lua ted according to sub-a im 2. A tota l of

14 PATs passed sub-a im 2. This leads to only 1 available PAT that can process

a ll types of omics da ta which is the Ingenuity PAT. It looks promising as a

PAT and is number one on the ranking list , however , the tool fa ils the end

user requirements on the zoom in and out funct iona lity (to expand the view to

neighbor ing possible resu lt s) connected to pa thways upon received resu lt s.

The Ingenuity PAT it self was easy to use and the resu lt s retu rned a re a lso

understandable. If requirements ha ve been slight ly different , th is is a good

PAT to use. Ingenuity PAT belongs to a company so payment fees a re required

in order to use it a fter few days of t r ia l. 495 U.S dolla rs a re a h igh pr ice to pay

for an individua l person but for a la rger group of minimum six working

resea rchers tha t migh t use the Ingenuity PAT dur ing a per iod of two years

per iod, the pr ice is acceptable.

7.2.Wh at are th e fu nction alit ie s offered by th e available an alys is

tools?

Throughout software test ing, going through test cases, a ll PATs

funct iona lit ies a re tested. The PATs have a va r iety of funct iona lit ies. As the

resu lt s shows, most commonly used funct ions a re the input fields, t ext a rea

input fields and a dynamic dropdown list . This apply on both downloaded and

web based PAT. Some PATs a re more unique and offer ext ra funct iona lit ies

such as file upload or links rela ted to the inser ted sea rch field. All PATs have

some form of visua l presenta t ion but only a few gives the desired effect like

the KEGG PAT. KEGG ha ve the best type of visua l presenta t ion with a r rows

and different color markings. Next in line a re the Ingenuity PAT (a t the

current sta te on 2012-05-20, th is tool only ha ve a test version but st ill looks

good) showing a visual presenta t ion with possible correla t ions.

P a g e | 3 7

7.3.Wh at are th e qu alit ie s of th e se tool's an d h ow to evalu ate th em ?

The quality (ment ioned in sect ion 1.3) is interpreted different ly by people and

everyone have their own poin t of views on wha t qualit ies a re. Our view of

good qua lity is tha t if a ll requirements a re fu lfilled by the PATs, t hey a re then

a lso fu lfilling the needs of the potent ia l users. However if quality is good for

someone it may not be good for someone else, therefore choices of having a

more genera l view on qua lit ies a re made. However , we made the choice of

having 8 specific requirements tha t needed to be fu lfilled by a PAT as seen in

sect ion 1.3 and is accepted by the end user as good qua lity. Quality defin it ions

a re ha rd to make in genera l for software programs or applica t ions , even when

making broad and genera l defin it ions, a s poin ted out in [36] while [38]

st rengthened the reasons. According to [27], homepages can be examined

thoroughly by cer ta in aspects to define tha t it is qua lity. The examined

homepages for respect ive PATs have good standards fu lfilling 4 of 5 aspects.

In [38] requirements a re a lso a way to define qua lity and in [14] approaches

a re shown more thoroughly. We follow the defin it ion of qua lity given by [14],

showing tha t a ll requirements need to be fu lfilled in order to be considered

good qua lity, which a re easily seen on a t raceability mat r ix as [38] sta tes.

Looking a t the mat r ices a fterwards as t raceability [22], appendix 3 show that

not a ll requirements a re fu lfilled, therefore we cannot say tha t the qualit ies

for these ana lyzed PATs a re good.

If we look a t quality on downloaded PATs versus the homepage of NCBI we

might ask the following quest ion: What a re the pros and cont ra s? Downloaded

PATs seem to be more customized by companies to a specific group of

biomedical field user s, there by fu lfilling the qua lity for tha t specific group.

Monthly fees a re required or the tool can be bought and in order to get

suppor t addit iona l payments a re needed. The homepage of NCBI have free

services and do not cost anyth ing to use. Result s obta ined from the

downloaded PATs a re same as on the homepage of NCBI, however , NCBI a re

based on scien t ific resu lt s direct ly linked to scien t ific a r t icles and can

therefore be seen more va lid. Informat ion ga ined from a ll PATs a re meant to

help the resea rchers in their biomedica l field (whether it ma y be in lipidomics,

genomics or any other field) for making new discover ies.

Figu re 18. Small figure with +/- on downloaded PATs versus NCBI

P a g e | 3 8

7.4.Wh y n ot In gen u ity an d w h y Un iprot w ith FEvER ?

As the resu lt s shows the Ingenuity had promising fea tures and were very

good a t fu lfilling a lmost a ll requirements. The end user however dur ing

discussion made the decision to disca rd it since it is a new tool and would

need a lot of t ime to lea rn it . The end user was more familia r with the PAT

named Uniprot and the in h ouse developed PAT FEvER. Since Uniprot handle

lipidomics while FEvER handle metabolics, the two PATs have funct ions tha t

in th is way can complement each other . By th is reason the two tools were

instead analyzed. As resu lt s show, the two tools were not wor king so well

together . Either the conversion went well bu t no resu lt s were shown or it

simply gave a blank page, meaning no presented resu lt s. In the end th is was

a lso disca rded.

8. Future Value All sta r ted with how to combine an oppor tunity for a mult idisciplina ry thesis

and th is thesis shows one way on how it can be done. The test ing method used

on the pa thway ana lysis shows tha t no problems a re encountered and other

users can use the same method with ease. The hard pa r t is when no tool's a re

found according to requirements. We a lso found out tha t even with a small

group of programmers, months a re needed to develop a PAT, due to the vast

amount of informat ion required to be included in the program. Not a ll PATs

can combine or take more than one input of omics da ta . Searches for an

a lterna t ive solu t ion leads to a homepage named NCBI tha t have severa l

collected PATs free of use. As discussed about pros and cont ras on downloaded

PAT's versus free, in the fu tu re there will probably be free based resources.

Homepages like NCBI grow in popula rity a t t ract ing many users. F ree

available PATs a re more prefer red to be used since they a re free of use and

their qua lity is equally good as the downloaded ones. Today we a lready see

the glimpse to the beginning of th is proces s. Companies will t ry matching the

demands of user s and sta r t s to run either longer free t r ia ls or even making

their tool free of use while sponsored by adver t isements.

9. References 1. Alber t sson-Er lanson C, (1991), Medicinsk och fysiologisk kem i – en

in troduk tion , Lund: Student lit t era tur

2. Anand S. S, Xia C, Paré G, Montpet it A, Rangara jan S, McQueen J . M,

Cordell J . H, Keavney B, Yusuf S, Hudson J . T, Enger t C. J , (2009),

Genetic Varian ts Associated With Myocardial In farction R isk Factors in

Over 8000 Individuals From Five Ethnic Groups, Circula t ion:

Cardiovascula r Genet ics Volume 23

3. Atkins P . W, J ones L. L, (2008), Chem ical Principles: T he quest for in sight ,

New York: W. H Freeman & Company, 527-534.

P a g e | 3 9

4. Barreiro B. L, Lava l G, Quach H, Pa t in E, Quin tana -Murci L, (2008),

N atural selection has driven popu lation d ifferen tiation in m odern hum ans,

Nature Genet ics Volume 40, 340 -345.

5. Becker M. W, Ber toni P . G, Hardin J , Kleinsmith J . L, (2009), T he World of

the Cell, San Francisco: Pearson Benjamin Cummings Educa t ion In c, 508-

520; 526; 527-534; 346-347.

6. Biochemist ry, (2012) S ynthesis of m em brane L ipids and T riglycerides >

h t tp://www.uky.edu/~dhild/biochem/20/lect20.h tml < 2012-06-14

7. Buhman K. K, Chen C. H, F arese J r V. R, (2001), T he Enzym e of N eutral

L ipid S ynthesis, The J ourna l of Biologica l Chemist ry Volume 276, Number

44.

8. Cong T. T, Wlaschin A, Sr ienc F , (2009), T Elem entary m ode analysis: a

usefu l m etabolic PAT for characterizing cellu lar m etabolism , Spr ingerLink

Volume 81 Number 5.

9. Curr icu lum Proposa l, (2012) In form ation gathering by data m ining >

h t tp://www.sigkdd.org/curr icu lum.php < 2012-07-11

10. Cyber lipid center , (2012) Fatty Acids >

ht tp://www.cyber lipid.org/fa /acid0001.h tm < 2012-06-26

11. Devlin M. T, (2006), T ext Book of Biochem istry: With Clin ical Correlations ,

New J ersey: Wiley-Liss Inc, 24-29; 666; 711-713; 716-717.

12. Fahy E, Subramaniam S, Brown H. A, Glass K. C, Merr ill J r H. A, Murphy

C. R, Raetz H. R. C, Russell W. D, Seyama Y, Shaw W, Shirmizu T, Spener

F , Meer G, VanNieuwenhze S. M, White H. S, Witztum L. J , Dennis A. E ,

(2005), Lipidom ics reveals a rem arkable d iversity of lipids in hum an

plasm a, J ourna l of Lipid Research Volume 46.

13. Fayyad U, P ia tet sky-Shapiro G, Smyth P , (1996), From Data Mining to

Knowledge Discovery in Databases, AI Magazine Volume 17 Number 3.

14. Firesmith D, (2003), Using Quality Models to Engineer Quality

R equirem ents, J ourna l of Object Technology Volume 2 Number 5.

15. Ganter B, Giroux CN, (2008), Em erging applications of network and

pathway analysis in drug d iscovery and developm ent, PubMed cent ra l

Volume 11 Issue 1.

16. Ganter B, Zidek N, Hewit t R. P , Müller D, Vladimirova A, (2008), PAT s

and toxicogenom ics reference databases for risk assessm ent, Future

Medicine Volume 9 Number 1.

17. Gut iér rez J . J , Esca lona J . M, Mejías M, Torres J , (2012), Generation of test

cases from functional requirem ents, Depar tment of System Informat ion a t

University of Seville with 4:th Workshop on System Informat ion.

18. Han X, (2007), N eurolipidom ics: Challenges and developm ent, Front iers of

Bioscience Volume 12

19. Human Cell Biology – BIO3I5F, (2012) T he cell m em brane >

ht tp://www.er in .u toronto.ca /~w3bio315/lecture2.h tm < 2012-04-29

20. Ignacimuthu S, (2008), Biotechnology: An In t roduct ion , Oxford: Alpha

science In terna t iona l Ltd, 1-10.

21. iSixSigma- Tools and Templa tes, (2012) Im portance of T est Plans or T est

Protocols > h t tp://www.isixsigma.com/tools-templates/design -of-

exper iments-doe/impor tance-test -planstest -protocol-templa te/ < 2012-08-

20

http://www.uky.edu/~dhild/biochem/20/lect20.html

http://www.sigkdd.org/curriculum.php

http://www.cyberlipid.org/fa/acid0001.htm

http://www.erin.utoronto.ca/~w3bio315/lecture2.htm

http://www.isixsigma.com/tools-templates/design-of-experiments-doe/importance-test-planstest-protocol-template/

http://www.isixsigma.com/tools-templates/design-of-experiments-doe/importance-test-planstest-protocol-template/

P a g e | 4 0

22. J ordan W. K, Nordenstam J , Lauwers Y. G, Rothenberger A. D, Alavi K,

Garwood M, Cheng L. L, (2009), Metabolom ic Characterization of Hum an

R ectal Adenocarcinom a with In tact T issue Magnetic R esonance

S pectroscopy, Diseases of the Colon and Rectum Volume 52 Issue 3

23. Kannenberg A, Saiedian H, (2009), Why S oftware R equirem ents

T raceability R em ains a Challenge, CrossTa lk: The J ourna l of Defense

Software Engineer ing Volume 22 Number 5.

24. King Y. J , Ferra ra R, Tabibiaza r R, Spin M. J , Chen M. M, Kuchins ky A,

Vailaya A, Kinca id R, Tsa lenko A, Deng X-F . D, Connolly A, Zhang P ,

Yang E, Wat t C, Yakhin i Z, Ben -Dor A, Adler A, Bruhn L, Tsao P,

Quer termous T, Ashley A. E , (2005), Pathway analysis of coronary

atherosclerosis, Research Art icle Physiological Genom ics Volume 23

Number 1

25. Klamt S, Stelling J , (2002), T wo approaches for m etabolic pathway

analysis, Trends in biotechnology Volume 21 Issue 2.

26. LipidMaps Nature – Lipidomicsga teway, (2012) WLipid classification

system >

ht tp://www.lipidmaps.org/da ta /classifica t ion/LM_classifica t ion_exp.php <

2012-06-14

27. Lundh D, (2011), Informat ion Quality and Secur ity, Skövde University

28. Luo L, (2001), S oftware T esting T echniques– T echnology Maturation and

R esearch S trategy, Carnegie Mellon University

29. Meer G, (2005), Cellu lar L ipidom ics, The EMBO J ourna ls members review

Volume 24

30. Mogyorodi G.E, (2005), R equirem ents-Based T esting – Am biguity R eviews,

Software Test ing Services Number 1.

31. Molecula r biochemist ry, (2012) Fatty acid syn thesis >

ht tp://www.rpi.edu/dept /bcbp/molbiochem/MBWeb/mb2/par t1/fasynthesis.h

tm < 2012-06-14

32. Nat iona l Human Genome Research In st itu te: Genet ic and Genomic

Science, (2012) Genetic and genom ic science >

ht tp://www.genome.gov/19016904 < 2012-04-20

33. Network Science – NetSci, (2012) Welcom e to N etS ci’s L ists of S oftware for

B ioin form atics: PAT s >

h t tp://www.netsci.org/Resources/Software/Bioinform/pa thwayanalysis.h tml

< 2012-02-27

34. Olson L. D, Kesharwani S, (2010), Enterpr ise Informat ion Systems:

Contemporary Trends and Issues, Singapore: World Scient ific Publish ing

Co. P te. Ltd, 7-23.

35. Phospha t idic acid, lysophospha t idic acid and rela ted lipids: structure,

occurrence, biochem istry and analysis , (2012) Phosphatid ic acid –

Occurrence and Biosynthesis >

h t tp://lipidlibra ry.aocs.org/Lipids/pa /index.h tm < 2012-07-11

36. Quality, (2012) Quality >

h t tp://www.qua litydigest .com/html/qualitydef.h tml < 2012-04-09

37. Quehenberger O, Armando M. A, Brown H. A, Milne B. S, Myers S. D,

Merr ill H. A, Bandyopadhyay S, J ones N. K, Kelly S, Shaner L. R, Sulla rds

M. C, Wang E, Murphy C. R, Barkley M. R, Leiker J . T, Raetz H. R. C,

http://www.lipidmaps.org/data/classification/LM_classification_exp.php

http://www.rpi.edu/dept/bcbp/molbiochem/MBWeb/mb2/part1/fasynthesis.htm

http://www.rpi.edu/dept/bcbp/molbiochem/MBWeb/mb2/part1/fasynthesis.htm

http://www.genome.gov/19016904

http://www.netsci.org/Resources/Software/Bioinform/pathwayanalysis.html

http://lipidlibrary.aocs.org/Lipids/pa/index.htm

http://www.qualitydigest.com/html/qualitydef.html

P a g e | 4 1

Guan Z, Laird M. G, Six A. D, Ru ssell W. D, McDona ld G. J , Subramaniam

S, Fahy E, Dennis A. E , (2010), Lipidom ics reveals a rem arkable d iversity

of lipids in hum an plasm a, J ourna l of Lipid Research Volume 51.

38. Reeves A. C, Bednar A. D, (1994), Defin ing Quality: Alternatives and

Im plications, Academy of Management Review Volume 19 Number 3.

39. Rosenberg L.H, Hammer F . T, Huffman L. L, (1998), R equirem ents,

T esting and Metrics, CiteSeer 15:th Annua l Pacific Nor thwest Software

Qua lity Conference.

40. Schilling H. C, Letscher D, Pa lsson ∅. B, (2000), T heory for the S ystem ic

Defin ition of Metabolic Pathways and their use in In terpreting Metabolic

Function from a Pathway-Oriented Perspective, J ourna l of Theoret ica l

Biology Volume 203 Issue 3.

41. Schuster S, Dandekar T, Fell A.D, (1999), Detection of elem en tary flux

m odes in biochem ical networks: a prom ising tool for pathway analysis and

m etabolic engineering, Trends in biotechnology Volume 17 Issue 2.

42. The AOCS Lipid Libra ry – Tr iacylglycerols, (2012) Biosynthesis and

m etabolism > h t tp://lipidlibra ry.aocs.org/lipids/tag2/index.h tm < 2012-06-

14

43. The lipid chronicles, (2012) Lipidom ics >

h t tp://www.samuelfurse.com/2011/12/lipidomics/ < 2012-03-19

44. The Lipid Libra ry, (2012) Lipid synthesis >

h t tp://lipidlibra ry.aocs.org/index.h tml < 2012-04-15

45. The Medica l Biochemist ry Page, (2012) Lipid syn thesis >

h t tp://themedica lbiochemist rypage.org/lipid-synthesis.php < 2012-03-09

46. Vance E. D, Vance E.J , (2008), Biochem istry of L ipids, L ipoproteines and

Mem branes 5th ed ition , Amsterdam: Elsevier , 278-279; 583-588.

47. Virgin ia web educa t ion , (2012) Lipids >

h t tp://web.virginia .edu/Heidi/chapter8/chp8.h tm < 2012-06-26

48. Voight et a l, (2012), Plasm a HDL cholesterol and risk of m yocard ial

in farction: a m endelian random ization study, The Lancet Volume 380

Issue 9841

49. Waterfa ll Model, (2012), All about the waterfall m odel >

h t tp://www.waterfa ll-model.com/ < 2012-08-18

50. Sommerville Ian , (1996), S oftware process m odels, J ourna l of ACM

Comput ing surveys Volume 28 Issue 1 , p269-271

51. Watson D. A, (2006), Lipidom ics: A global approach to lipid analysis in

biological system s, J ourna l of Lipid Research Volume 47.

52. Watson D. A, (2006), T hem atic review series: S ystem s Biology Approaches

to Metabolic and Cardiovascular Disorders, J ourna l of Lipid Research

Volume 47.

53. Wenk MR, (2005), T he em erging field of lipidom ics, Nature Reviews Drug

Discovery Volume 46.

54. William W. C, Xianlin H, (2010), Lipid Analysis: Isolation , S eparation ,

Identification and L ipidom ic Analysis, Br idgwater : The Oily Press

http://lipidlibrary.aocs.org/lipids/tag2/index.htm

http://www.samuelfurse.com/2011/12/lipidomics/

http://lipidlibrary.aocs.org/index.html

http://themedicalbiochemistrypage.org/lipid-synthesis.php

http://web.virginia.edu/Heidi/chapter8/chp8.htm

http://www.waterfall-model.com/

P a g e | 4 2

Appendix 1 – Test Cases Test Case

Test case: is a document which descr ibes INPUT, ACTION, EVENT and

EXPECTED RESPONSE to determine if fea ture of an applica t ion is working

correct ly or not . A set of inputs, execut ion precondit ions, and expected outcomes

developed for a pa r t icu la r object ive, such as to exercise a pa r t icu la r program pa th

or to ver ify compliance with a specific requirement . (Comp Software test ing,

(2010), Test Case formats > h t tp://www.faqs.org/qa /qa -4044.h tml < 2010-12-18)

Test Case 1 “Metabolom ic type of data in put” (Requ irem en t ID 1):

Goal:

See if the PAT can process metabolomic data .

Even t:

(Presumpt ion made tha t the PAT is a lready running).

1. Input metabolomic da ta type (such as “NPPA”).

2. Get resu lt rela ted to the metabolism from the PAT.

Expected re spon se :

The user can input metabolomic da ta type on selected PAT achieving da ta

in format ion rela ted to “NPPA”.

Test Case 2 “Lipidom ic type of data inpu t” (Requ irem en t ID 1):

Goal:

See if the PAT can process lipidomic da ta .

Even t:

(Presumpt ion made tha t the PAT is a lready running).

1. Input lipidomic da ta type (such as “rs4420638”).

2. Get resu lt rela ted to lipids from the PAT.


The PAT process lipidomic da ta type get t ing da ta informat ion rela ted to

“rs4420638”.

Test Case 3 “Gen om ic type of data inpu t” (Requ irem en t ID 1):

Goal:

See if the PAT can process genomic da ta .

Even t:

(Presumpt ion made tha t the PAT is a lready sta r ted).

1. Input genomic da ta type (such as “APOE”).

http://www.faqs.org/qa/qa-4044.html

P a g e | 4 3

2. Get genet ica lly rela ted resu lt from the PAT.


The PAT process genomic da ta type where da ta informat ion is rela ted to “APOE”.

Test Case 4 “Verify in g th e re su lt” (Requ irem en t ID 2):

Goal:

Get correct and va lid resu lt s returned by the PAT.

Even t:


1. Input omics (metabolomics, lipidomics and genomics) da ta .

2. Get resu lt from the PAT.

3. Check resu lt s obta ined with valid resu lt s from books, a r t icles or labora tory

resu lt s.

4. Confirm tha t the received resu lt is va lid.


The resu lt returned from the PAT is valid to lit era ture or labora tory resu lt .

Test Case 5 “Tim e e ffec tiven ess” (Requ irem en t ID 3):

Goal:

Informat ion about how fast the PAT returns a resu lt .

Even t:



2. Sta r t the t imer .


4. Stop the t imer .

5. Record t ime.


The resu lt returned from the PAT took less than 5 seconds and is displayed.

Test Case 6 “Navigation ” (Requ irem en t ID 4):

Goal:

See the naviga t ion capabilit ies between the da ta inser ted and the resu lt given by

the selected PAT.

Even t:




P a g e | 4 4

3. Try naviga t ing between the sta r t of input ted da ta and the resu lt s obta ined by

scrolling on the resu lt window.

4. Seeing a pa th leading from sta r t (input da ta ) to end (resu lt ).


It was possible to naviga te between the inser ted da ta and the resu lt obta ined.

Test Case 7 “Visu al pre sen tation ” (Requ irem en t ID 5):

Goal:

See if the resu lt can be visua lly presented/displayed by using the PAT.

Even t:




3. Get a visua l presenta t ion where you can map resu lt s t o each other .


There was a visua l presenta t ion when using the PAT.

Test Case 8 “Zoom in g” (Requ irem en t ID 6):

Goal:

See if the PAT displays any zoom funct ions.

Even t:



2. Get resu lt from the PAT tha t is connected to pa thways.

3. Search for a small magnifying glass with a plus or minus sign .

4. Search for specific funct ion with a r rows for zooming.

5. Try zooming in to na r row the view.

6. Try zooming out to expand the view.


The PAT has zoom funct ions.

Test Case 9 “Spec ific data in pu t type” (Requ irem en t ID 7):

Goal:

See if the PAT can take specific type of da t a as an input .

Even t:


1. Input specific type of da ta (such accession numbers: AC_0088966, rs number :

r s896530).

P a g e | 4 5



The PAT could process the specific type of da ta .

Test Case 10 “Mappin g com bin ed data types” (Requ irem en t ID 8):

Goal:

See if the PAT can combine omics da ta and map them to rela ted pa thways.

Even t:


1. Input more than one type of omics (metabolomics, lipidomics and genomics)

da ta combined.


3. See if resu lt combined to other da ta types showing a mapped view with the

other da ta types.


The PAT could combine and map the da ta for the user .

P a g e | 4 6

Appendix 2 – Lipid, MI SNP and Metabo SNP da ta sheet Lipid and MI SNP Metabo SNP

Gene

Name

rs

(number)

SNP

funct ion

Gene Name rs

(number)

SNP

funct ion

FADS123 174546 Lipid

SNP

ENAC Metabo

SNP

UBE2L3 181362 Lipid

SNP

ENAC Metabo

SNP

LILRA3 386000 Lipid

SNP

NPPA 5068 Metabo

SNP

APOE 439401 Lipid

SNP

BDNF 6265 Metabo

SNP

KLHL8 442177 Lipid

SNP

SLC35F1 89107 Metabo

SNP

TTC39B 581080 Lipid

SNP

NPPA 198358 Metabo

SNP

CITED2 605066 Lipid

SNP

BCL11A 243021 Metabo

SNP

SORT1 629301 Lipid

SNP

MDS1 419076 Metabo

SNP

MSL2L1 645040 Lipid

SNP

579459 Metabo

SNP

LOC55908 737337 Lipid

SNP

NPPB 632793 Metabo

SNP

SCARB1 838880 Lipid

SNP

TMEM133 633185 Metabo

SNP

APOA1 964184 Lipid

SNP

PNPLA3 738409 Metabo

SNP

APOB 1042034 Lipid

SNP

PLCE1 932764 Metabo

SNP

LDLR 1122608 Lipid

SNP

TFAP2B 987237 Metabo

SNP

GCKR 1260326 Lipid

SNP

CHRNA3 1051730 Metabo

SNP

NAT2 1495741 Lipid

SNP

SGK1 1057293 Metabo

SNP

LIPC 1532085 Lipid

SNP

C5orf174 1173771 Metabo

SNP

LPA 1564348 Lipid

SNP

Glasgow 1230297 Metabo

SNP

ZNF648 1689800 Lipid

SNP

J AG1 1327235 Metabo

SNP

HNF4A 1800961 Lipid

SNP

LOC10018 1329650 Metabo

SNP

ABCA1 1883025 Lipid CYP1A1 1378942 Metabo

P a g e | 4 7

SNP SNP

CYP26A1 2068888 Lipid

SNP


SNP

ANGPTL3 2131925 Lipid

SNP

Glasgow2 1703492 Metabo

SNP

TRPS1 2293889 Lipid

SNP

STC1 1731274 Metabo

SNP

CAPN3 2412710 Lipid

SNP

SGK1 1743966 Metabo

SNP

PCSK9 2479409 Lipid

SNP

HFE 1799945 Metabo

SNP

LACTB 2652834 Lipid

SNP

MDS1 1918974 Metabo

SNP

C6orf106 2814944 Lipid

SNP

ENAC 2228576 Metabo

SNP

AMPD3 2923084 Lipid

SNP

ESR1 2234693 Metabo

SNP

CMIP 2925979 Lipid

SNP

HNF1a 2259816 Metabo

SNP

FRMD5 2929282 Lipid

SNP

NEDD4L 2288774 Metabo

SNP

TRIB1 2954029 Lipid

SNP

HCCA2 2334499 Metabo

SNP

IRS1 2972146 Lipid

SNP

FES 2521501 Metabo

SNP

LRP4 3136441 Lipid

SNP

LYPLAL1 2605100 Metabo

SNP

MYLIP 3757354 Lipid

SNP

MOV10 2932538 Metabo

SNP

CETP 3764261 Lipid

SNP

2987983 Metabo

SNP

PGS1 4129767 Lipid

SNP

DBH 3025343 Metabo

SNP

ABCA8 4148008 Lipid

SNP

SLC22A2 3127573 Metabo

SNP

ABCG58 4299376 Lipid

SNP

EGLN2 3733829 Metabo

SNP

APOE 4420638 Lipid

SNP

3741913 Metabo

SNP

PABPC4 4660293 Lipid

SNP

ULK4 3774372 Metabo

SNP

KLF14 4731702 Lipid

SNP

CACNB2 4373814 Metabo

SNP

ZNF664 4765127 Lipid

SNP

ZBED3 4457053 Metabo

SNP

1LNT2 4846914 Lipid

SNP

GCK 4607517 Metabo

SNP

P a g e | 4 8

TOP1 6029526 Lipid

SNP


SNP

PLTP 6065906 Lipid

SNP

J AG1 6040055 Metabo

SNP

LDLR 6511720 Lipid

SNP

6983267 Metabo

SNP

PDE3A 7134375 Lipid

SNP

GLIS3 7034200 Metabo

SNP

MVK 7134594 Lipid

SNP

ADM 7129220 Metabo

SNP

LIPG 7241918 Lipid

SNP

KIAA1486 7578326 Metabo

SNP

ANGPTL4 7255436 Lipid

SNP

RBMS1ITG 7593730 Metabo

SNP

NYNRIN 8017377 Lipid

SNP

MSRA 7826222 Metabo

SNP

ABO 9411489 Lipid

SNP

7931342 Metabo

SNP

MAP3K1 9686661 Lipid

SNP

MEIS2 8031633 Metabo

SNP

COBLL1 10195252 Lipid

SNP

TBX2 8068318 Metabo

SNP

PLEC1 11136341 Lipid

SNP

IL6 10242595 Metabo

SNP

LRP1 11613352 Lipid

SNP

10993994 Metabo

SNP

PINX1 11776767 Lipid

SNP


SNP

STARD3 11869286 Lipid

SNP

NPPA 11191548 Metabo

SNP

COBLL1 12328675 Lipid

SNP

EBF1 11953630 Metabo

SNP

LPL 12678919 Lipid

SNP

ZNF652 12940887 Metabo

SNP

MC4R 12967135 Lipid

SNP

PLCD3 12946454 Metabo

SNP

SLC39A8 13107325 Lipid

SNP

CST3CST9 13038305 Metabo

SNP

TYW1B 13238203 Lipid

SNP


SNP

LCAT 16942887 Lipid

SNP

GUCY1A3 13139571 Metabo

SNP

MLXIPL 17145738 Lipid

SNP

ZNF652 16948048 Metabo

SNP

APOB 1367117 Lipid

SNP

FGF5 16998073 Metabo

SNP

HLA 2247056 Lipid ATP2B1 17249754 Metabo

P a g e | 4 9

SNP SNP

PLA2G6 5756931 Lipid

SNP

VPS13C 17271305 Metabo

SNP

OSBPL7 7206971 Lipid

SNP

SHROOM3 17319721 Metabo

SNP

LRPPP1R3 9987289 Lipid

SNP

MTHFR 17367504 Metabo

SNP

J MJ D1C 10761731 Lipid

SNP

GOSR2 17608766 Metabo

SNP

SMG6 216172 MI SNP STK39 35929607 Metabo

SNP

SORT1 629301 MI SNP

SORT1 646776 MI SNP

APOA1 964184 MI SNP

LDLR 1122608 MI SNP

LPA 1564348 MI SNP

CXCL12 1746048 MI SNP

KIAA1822 2895811 MI SNP

LPA 3798220 MI SNP

ADAMTS7 3825807 MI SNP

LDLR 6511720 MI SNP

PHACTR 9349379 MI SNP

ABO 9411489 MI SNP

MRAS 9818870 MI SNP

KCNE2 9982601 MI SNP

PCSK9 11206510 MI SNP

ZC3HC1 11556924 MI SNP

TCF21 12190287 MI SNP

CNNM2 12413409 MI SNP

PPAP2B 17114036 MI SNP

MIA3 17465637 MI SNP

ANKS1A 17609940 MI SNP

SH2B3 3184504 MI SNP

CDKN2A 4977574 MI SNP

WDR12 6725887 MI SNP

RALI 12936587 MI SNP

P a g e | 5 0

Appendix 3 – Requirements Mat r ixes

P a g e | 5 1

P a g e | 5 2

P a g e | 5 3

Appendix 4 – Respons Times

investigation of pathway analysis tools for mapping omics ... · keywords: biochemistry,...

Documents