the development of data standards and a database to aid ...ajones/jonesthesis.pdf · the...
TRANSCRIPT
Department of Computing Science,
and the Division of Infection and Immunity,
Institute of Biomedical and Life Sciences
The Development of Data Standards and a
Database to Aid Proteomic Research
Andrew Jones
Submitted for the degree of Doctor of
Philosophy in Computing Science at
the University of Glasgow
October 2004
c© 2004, Andrew Jones
Abstract
The thesis reports new developments in the area of database support for proteomics exper-
iments. We have developed a proposal for a data standard that will facilitate sharing and
archival of data. We have also developed a database implementing the standard, which is a
prototype of a public repository capable of storing large volumes of data. Our technology
allows for the integration of results from both microarrays and proteomics. The database
has been evaluated in the context of two investigations performed by collaborating biolo-
gists. We have demonstrated that our technology enables the discovery of new results by
facilitating complex queries and providing novel visualisations of experimental data.
i
Thesis statement
This work will highlight the requirements of proteomic research for standard formats and
centralised databases that allow results to be well annotated and queried. We have developed
a proposal for a data standard, and a prototype of a public repository, and the thesis will
demonstrate how they facilitate the research process.
ii
Declaration
I declare that this thesis describes my own work, that it has not been accepted in a pre-
vious application for a degree, and that all sources of information have been specifically
acknowledged. The work reported in Chapters 4 (FGE-OM) and 5 (RAPAD) was initiated
during a two week period I spent at the Computational Biology and Informatics Laboratory,
University of Pennsylvania working with Prof. Chris Stoeckert and Angel Pizarro. During
the two weeks, the framework for FGE-OM was developed and the SQL database schema
for RAPAD was designed. The subsequent development of RAPAD, including refinements
to the schema, the creation of the web interface and software for data visualisation, was
performed by myself at the University of Glasgow.
Chapter 3 contains a revised version of material published in [176]. The material in Chapter
4 has been revised from [175].
Andrew Jones
iii
Thesis Overview
There is a new research paradigm in molecular biology in which large data sets are obtained
about genes and proteins, and the results enable researchers to formulate new hypotheses
about the system they are studying. This methodology is reversed from the classical approach
where an experiment is designed to test a hypothesis. The field of research is collectively
known as functional genomics, as researchers attempt to assign functions to all genes that
can be discovered in the genome sequence. The experiments can also give insights into the
factors that are crucial in particular processes, such as disease, by discovering the differences
between results from a diseased sample and a normal sample. The methods that investigate
protein abundance, interactions and localisation on a large scale are known as proteomics.
Proteomic investigations present significant computational challenges because data sets are
very large and contain heterogeneous information from different laboratories, which could
be useful to researchers working in a variety of domains. The thesis will describe proposals
for data standards for proteomics, and a new relational database, which will alleviate some
of the computational challenges presented by the experiments. The proposals for a standard
should ensure that proteome data can be archived and will be accessible to querying in the
future.
Chapter 1 will describe the experimental techniques of functional genomics, three case
studies of proteomic research and the requirements for central databases and standardisation.
There has been significant work in both bioinformatics, and computing science research, to
improve methods for making data accessible and open to a wide range of queries, which will
lead to the next generation of the Web. Chapter 2 will focus on the new developments in
computing science, and will cover previous work on data standards for life sciences that allow
information to be exchanged between research groups and deposited in central databases.
There are a large number of databases for functional genomics that have different capabilities
and access methods. The chapter will present the challenges in data integration that arise
from the number of different systems that exist. An area that has attracted much recent
iv
v
attention in computing science is ontology development. Ontologies are structured controlled
vocabularies of terms with definitions that describe a domain in a way that ensures there is
a shared understanding of the concepts by different people. An ontology can also contain
rules associated to the terms that allow computer systems to ask logical questions of the
relationships between different parts of a data set. Chapter 2 will describe the ontologies
that currently exist for life sciences.
Chapter 3 will focus on the standardisation of data formats for proteomics. There will
be a description of the previous work in this area, which consists of an object model 1 to
describe the experimental methodology. We have developed an alternative proposal for a
data standard, which was released in October 2003 to describe additional information that
should be captured in a community standard. It is essential that the finalised standard
contains sufficient description of the results, and the methods that were used to obtain data,
to ensure that future re-evaluation and statistical analysis is possible. The chapter will
describe our proposal and will give an overview of the current progress towards a community
accepted standard for proteomics.
There is an established data standard for gene expression studies using microarrays. It
is becoming feasible for researchers to perform both proteomic and microarray investiga-
tions on the same starting samples. In other cases, the results from different investigations
using microarray or proteomic techniques could be integrated, leading to a much better un-
derstanding about the genes and proteins that are important in the sample conditions. We
believe that microarray and proteomic data sets could be integrated more easily, and queried
in parallel, if they have a single shared data standard. Therefore, we have integrated the
microarray standard with the current models of proteomic data to form a single proposal for
a data standard, known as FGE-OM (Functional Genomics Experiment - Object Model),
which will be described in Chapter 4. Chapter 4 will also contain a discussion of the impor-
tance of using ontologies to describe the experimental protocols, to allow future comparison
and querying of different data sets.
We have developed a database for storage of proteomic results, experimental protocols
and details of the biological samples on which the experiments were performed, known as
RAPAD (RNA And Protein Abundance Database), which will be described in Chapter
5. RAPAD is an extension of a microarray database system developed at the University
of Pennsylvania. We have extended a microarray database into proteomics because we
1An object model is a platform independent notation for describing a software system. The importanceof object models for developing data standards will be described in Chapter 2.
vi
hypothesise that data integration across the two fields will be facilitated if the technologies
are captured in a shared database schema and they have a similar user interface. There is
a very close correspondence between FGE-OM and RAPAD, described in Chapter 5, which
allows RAPAD to be used to test that FGE-OM correctly captures the data semantics.
RAPAD also acts as a prototype of a public repository, and demonstrates that proteome
data can be visualised and queried in complex ways using real data sets. Two investigations
are supported by the current implementation of RAPAD, which will be described in Chapters
6 and 7. The investigations allow the core facilities of the database to be evaluated.
Chapter 6 will describe how the database assists an investigation performed in the labo-
ratory of Dr Jonathan Wastling at the Institute of Biomedical and Life Sciences, University
of Glasgow. The investigation aims to discover the proteins that are differentially expressed
in a human cell culture when invaded with the intracellular parasite Toxoplasma gondii, com-
pared with non-invaded cells. The results will enable a better understanding of host-parasite
interactions. The chapter will demonstrate how gene expression and protein abundance
values have been compared in practice.
There will be a description of another project at the Institute of Biomedical and Life
Sciences, which is supported by RAPAD, in Chapter 7. The project is attempting to cat-
alogue all the expressed proteins in the disease-causing parasite Trypanosoma brucei, using
a variety of experimental techniques. The genome sequence is nearing completion but the
level of functional annotation is poor. The proteome catalogue facilitates the genome an-
notation, and the experiments give insights into the dynamic nature of proteins within the
system. Chapter 7 will describe visualisation software written by the author that allows new
conclusions to be drawn from the results.
Chapter 8 will summarise and extend our arguments on standardisation, ontologies and
archiving of data in public repositories. There will be a comparison of our approach with
alternative methods that could have been employed. There will be a description of the work
that is still required to solve the research challenges that follow directly from the thesis, and
a summary of our contribution.
There are four appendices at the end of the thesis. The first, Appendix A, will describe
an investigation performed by the author into indexing large collections of biological data
represented in Extensible Markup Language (XML), as an alternative to relational database
storage. Appendix B contains detailed diagrams of FGE-OM, which supplement the work
presented in Chapter 4. The RAPAD database schema is included in Appendix C. Finally,
vii
Appendix D will describe how difference gel electrophoresis data can be represented in Gla-
PSI, FGE-OM and RAPAD.
Acknowledgements
I give thanks to my supervisors Ela Hunt and Jonathan Wastling. Throughout my PhD, Ela
has given me great support, spending inordinate lengths of time discussing ideas, reading
my work, and giving me encouragement to persevere with my ideas. At the outset of my
research, Jonathan’s enthusiasm was infectious, which gave me great interest in the subject.
I am very grateful to the MRC for funding my research through first an MRes degree, and
then the PhD.
I would like to thank Chris Stoeckert for giving me the opportunity to visit his lab in
Philadelphia, and thanks to Angel Pizarro for giving up so much of his time while I was
there. The time spent in Philadelphia provided a big impetus for my work, for which I am
very grateful. My thanks also to Mike Turner for giving valuable feedback on my work. I give
thanks to Morag Nelson and Anne Faldas, who generated the data I have used in Chapters
6 and 7, for taking time to explain their experiments, for trying out all my software and for
appearing interested when I talk about databases!
Finally, my biggest thanks to my partner Clare, for all her love and support.
viii
Contents
1 Investigations in Functional Genomics 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Experimental methodology . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Systems biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Proteomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Gel based proteomics . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.2 Mass spectrometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.2.3 Other proteomics techniques . . . . . . . . . . . . . . . . . . . . . . . 161.2.4 Post-translational modifications . . . . . . . . . . . . . . . . . . . . . . 221.2.5 Case studies of proteomics research . . . . . . . . . . . . . . . . . . . . 241.2.6 Case study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241.2.7 Case study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251.2.8 Case study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261.2.9 Publication of proteomics data . . . . . . . . . . . . . . . . . . . . . . 28
1.3 Gene expression techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281.3.1 The development of microarrays . . . . . . . . . . . . . . . . . . . . . 291.3.2 Serial analysis of gene expression . . . . . . . . . . . . . . . . . . . . . 31
1.4 Other techniques used in functional genomics . . . . . . . . . . . . . . . . . . 311.4.1 RNA interference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311.4.2 Immunohistochemistry . . . . . . . . . . . . . . . . . . . . . . . . . . . 321.4.3 Metabolomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321.4.4 Protein interaction studies . . . . . . . . . . . . . . . . . . . . . . . . . 331.4.5 Three dimensional structures . . . . . . . . . . . . . . . . . . . . . . . 35
1.5 Investigations across the “omics” . . . . . . . . . . . . . . . . . . . . . . . . . 361.5.1 Comparative studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2 Databases, standards and ontologies for the life sciences 402.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.1.1 Computational support for the life sciences . . . . . . . . . . . . . . . 402.1.2 The future accessibility of data . . . . . . . . . . . . . . . . . . . . . . 412.1.3 Guide to the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2 Technology required for data standards . . . . . . . . . . . . . . . . . . . . . 442.2.1 Extensible Markup Language: XML . . . . . . . . . . . . . . . . . . . 442.2.2 Resource Description Framework . . . . . . . . . . . . . . . . . . . . . 462.2.3 DAML+OIL and the Web Ontology Language . . . . . . . . . . . . . 472.2.4 Unified Modeling Language . . . . . . . . . . . . . . . . . . . . . . . . 482.2.5 The object management group . . . . . . . . . . . . . . . . . . . . . . 49
ix
x
2.3 Data standards in the life sciences . . . . . . . . . . . . . . . . . . . . . . . . 502.3.1 Microarray standards . . . . . . . . . . . . . . . . . . . . . . . . . . . 502.3.2 PEDRo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.3.3 PSI-OM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.3.4 Mass spectrometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.3.5 Protein interaction standards . . . . . . . . . . . . . . . . . . . . . . . 562.3.6 Other data standards in life sciences . . . . . . . . . . . . . . . . . . . 56
2.4 Databases for life sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582.4.1 Microarray databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 592.4.2 Proteomics databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 622.4.3 Other Databases for Life Sciences . . . . . . . . . . . . . . . . . . . . . 63
2.5 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642.5.1 Software for developing ontologies . . . . . . . . . . . . . . . . . . . . 652.5.2 Gene Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652.5.3 MGED Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682.5.4 Other ontologies in life sciences . . . . . . . . . . . . . . . . . . . . . . 692.5.5 The Grid and data integration . . . . . . . . . . . . . . . . . . . . . . 702.5.6 Data standards and ontologies in other fields . . . . . . . . . . . . . . 71
2.6 Data integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712.6.1 Federation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722.6.2 Warehouses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722.6.3 Mediator approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732.6.4 Schema integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3 An object model for proteomics 793.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.1.1 The emergence of proteomics . . . . . . . . . . . . . . . . . . . . . . . 793.1.2 Publication of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.1.3 A central repository for proteomics . . . . . . . . . . . . . . . . . . . . 813.1.4 The status of proteomics standards . . . . . . . . . . . . . . . . . . . . 82
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843.3 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.3.1 SWISS-2DPAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873.3.2 GELBANK and HUP-ML . . . . . . . . . . . . . . . . . . . . . . . . . 873.3.3 PEDRo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.4 Gla-PSI: A model for 2-D gel electrophoresis and analysis . . . . . . . . . . . 923.4.1 Overview of the experiment and protein extraction . . . . . . . . . . . 923.4.2 Two-dimensional gel electrophoresis . . . . . . . . . . . . . . . . . . . 923.4.3 Image analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943.4.4 Protein spots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 953.4.5 Two-dimensional difference gel electrophoresis . . . . . . . . . . . . . . 963.4.6 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973.4.7 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.5 Future developments in proteomics standards . . . . . . . . . . . . . . . . . . 983.5.1 An overview of PSI-OM . . . . . . . . . . . . . . . . . . . . . . . . . . 993.5.2 Data model in PSI-OM . . . . . . . . . . . . . . . . . . . . . . . . . . 1013.5.3 An ontology for proteomics . . . . . . . . . . . . . . . . . . . . . . . . 1023.5.4 Minimum information about a proteomics experiment . . . . . . . . . 103
xi
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1033.6.1 Web access to date . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1033.6.2 Status of proteome standards . . . . . . . . . . . . . . . . . . . . . . . 104
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4 Development of a data standard for functional genomics 1074.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.1.1 Requirements for standards . . . . . . . . . . . . . . . . . . . . . . . . 1074.1.2 Status of standardisation . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114.2.1 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.3 Overview of FGE-OM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1164.3.1 BioOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1164.3.2 ArrayOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1174.3.3 ProteomicsOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1184.3.4 A workflow for proteomics . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.4 Other work: CEBS object model for systems biology data . . . . . . . . . . . 1244.4.1 SysBio-OM data model . . . . . . . . . . . . . . . . . . . . . . . . . . 1274.4.2 SysBio-OM Protocol and BioMaterial packages . . . . . . . . . . . . . 1314.4.3 SysBio-OM BioAssay and SummaryData packages . . . . . . . . . . . 131
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1344.5.1 FGE-OM, SysBio-OM and future standards . . . . . . . . . . . . . . . 1354.5.2 Developments to MAGE-OM . . . . . . . . . . . . . . . . . . . . . . . 1364.5.3 Integrated standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5 A prototype public database for proteomics 1405.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.1.1 Extending existing technology . . . . . . . . . . . . . . . . . . . . . . . 1405.1.2 The development of RAPAD . . . . . . . . . . . . . . . . . . . . . . . 1435.1.3 Chapter guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.2 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1455.2.1 GUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1455.2.2 Proteomics database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1465.2.3 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.3 Systems and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1475.3.1 Schema development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1475.3.2 Interface development . . . . . . . . . . . . . . . . . . . . . . . . . . . 1475.3.3 Data integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1485.3.4 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1515.3.5 Unique identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1535.4.1 Data privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1565.4.2 Studies, protocols and contact details . . . . . . . . . . . . . . . . . . 1565.4.3 Protein separations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1585.4.4 2-D gel data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595.4.5 Mass spectrometry and external databases . . . . . . . . . . . . . . . . 1635.4.6 RAPAD Querier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1635.4.7 Public data access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1695.4.8 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
xii
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1715.5.1 A prototype of a central repository . . . . . . . . . . . . . . . . . . . . 1715.5.2 The relationship between FGE-OM and RAPAD . . . . . . . . . . . . 1725.5.3 Support for current proteome studies . . . . . . . . . . . . . . . . . . . 1735.5.4 Future developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6 Database support for proteomic studies of host-parasite interactions 1786.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.1.1 Host-parasite interactions . . . . . . . . . . . . . . . . . . . . . . . . . 1786.1.2 Genomic investigation of Toxoplasma . . . . . . . . . . . . . . . . . . 1796.1.3 Microarray analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1806.1.4 Support for proteome studies . . . . . . . . . . . . . . . . . . . . . . . 1816.1.5 Project status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1836.2.1 Display of protein data from different gels . . . . . . . . . . . . . . . . 1846.2.2 Comparison of protein and gene expression data . . . . . . . . . . . . 1856.2.3 Functional classification of proteins . . . . . . . . . . . . . . . . . . . . 188
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1916.3.1 Visualisation of differential expression . . . . . . . . . . . . . . . . . . 1916.3.2 Functional annotation of proteins . . . . . . . . . . . . . . . . . . . . . 1936.3.3 Comparison with microarray data . . . . . . . . . . . . . . . . . . . . 1976.3.4 Post-translational modifications . . . . . . . . . . . . . . . . . . . . . . 2076.3.5 Public access to data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2096.5 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7 Software support for a proteome map of Trypanosoma brucei 2147.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
7.1.1 The biology of trypanosomes . . . . . . . . . . . . . . . . . . . . . . . 2147.1.2 Annotating the genome . . . . . . . . . . . . . . . . . . . . . . . . . . 2167.1.3 Database support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2177.1.4 Project status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2197.2.1 Generation of samples for proteome analysis . . . . . . . . . . . . . . . 2197.2.2 Project requirements capture . . . . . . . . . . . . . . . . . . . . . . . 2207.2.3 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2237.3.1 Investigation into multiple protein forms . . . . . . . . . . . . . . . . . 2237.3.2 Using data in RAPAD to improve genome annotation . . . . . . . . . 2367.3.3 Search for post-translational modifications . . . . . . . . . . . . . . . . 2417.3.4 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2447.4.1 Improving the annotation of genes . . . . . . . . . . . . . . . . . . . . 2467.4.2 Visualisation issues in the life sciences . . . . . . . . . . . . . . . . . . 2477.4.3 Analysis of modifications . . . . . . . . . . . . . . . . . . . . . . . . . 2487.4.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
xiii
8 Future work, discussion and conclusions 2548.1 Summary of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2548.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
8.2.1 Alternative approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 2558.2.2 Digital archiving and publication of life science data . . . . . . . . . . 2588.2.3 The role of data standards . . . . . . . . . . . . . . . . . . . . . . . . . 2608.2.4 A functional genomics standard . . . . . . . . . . . . . . . . . . . . . . 2618.2.5 Proteomics standards . . . . . . . . . . . . . . . . . . . . . . . . . . . 2628.2.6 A vision for future data sharing . . . . . . . . . . . . . . . . . . . . . . 263
8.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2638.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
A An XML indexing solution for data integration 268A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268A.2 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269A.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
A.3.1 Index A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270A.3.2 Index B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272A.3.3 Index creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274A.3.4 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275A.3.5 Index A Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275A.3.6 Index B Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276A.3.7 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
A.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
B Detailed diagrams of FGE-OM 280
C Database schema for RAPAD 287
D Modelling and database storage of difference gel data 342D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
D.1.1 Host-parasite responses . . . . . . . . . . . . . . . . . . . . . . . . . . 343D.2 Gla-PSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343D.3 FGE-OM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345D.4 RAPAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
List of Figures
1.1 A conceptual view of the data flow in functional genomics. . . . . . . . . . . . 31.2 The data flow in a proteomics experiment. . . . . . . . . . . . . . . . . . . . . 61.3 A sample image from 2-DE separation of proteins from Toxoplasma gondii . . 71.4 A schematic of a difference gel electrophoresis experiment. . . . . . . . . . . . 111.5 An MS trace viewed with Voyager software [339]. . . . . . . . . . . . . . . . . 131.6 A sample trace from tandem mass spectrometry . . . . . . . . . . . . . . . . 151.7 Two dimensional liquid chromatography coupled with MS . . . . . . . . . . . 181.8 The ICAT method for quantitative proteomics . . . . . . . . . . . . . . . . . 201.9 A two dimensional gel highlights possible different phosphorylation states of
Protein disulfide isomerase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.10 A summary of the technique involved in the creation of Affymetrix microarrays 301.11 A summary of Yeast Two-Hybrid experiments . . . . . . . . . . . . . . . . . . 341.12 Affinity methods for assaying protein interactions . . . . . . . . . . . . . . . . 35
2.1 A partial record from the PIR database, in the native PIR format. . . . . . . 452.2 A partial record from the PIR database, released in XML format. . . . . . . . 452.3 An example partial PIR record stored in a relational database . . . . . . . . . 462.4 The main components of a UML class diagram for a hospital computer system. 492.5 The top level of MAGE-OM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.6 The BioMaterial package in MAGE-OM . . . . . . . . . . . . . . . . . . . . . 522.7 A screenshot of the Protege editor displaying the Gene Ontology for Yeast. . 662.8 The entry for actin in the Gene Ontology, displayed in the AmiGo browser . 67
3.1 The data flow in a proteomics experiments. The parts of the analysis coveredby Gla-PSI are boxed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.2 The complete PEDRo model represented in UML . . . . . . . . . . . . . . . . 883.3 The classes that record biological samples in PEDRo . . . . . . . . . . . . . . 893.4 The part of PEDRo covering protein separation techniques . . . . . . . . . . 903.5 The model of MS ionisation and protocol in PEDRo . . . . . . . . . . . . . . 913.6 MS data and database searches modelled in PEDRo . . . . . . . . . . . . . . 913.7 The complete Gla-PSI object model represented as a UML class diagram. . . 933.8 A model of 2-DE data, and a scanned gel image. . . . . . . . . . . . . . . . . 943.9 The classes capture data from image analysis applications, including multiple
analysis across a number of gels. . . . . . . . . . . . . . . . . . . . . . . . . . 953.10 The relationship between spot data (Spot) and identified proteins (Protein) 953.11 Classes for storing difference gel electrophoresis data. . . . . . . . . . . . . . . 963.12 The part of Gla-PSI modelling statistical analysis of a proteomics experiment. 973.13 Several classes are subclasses of Identifiable . . . . . . . . . . . . . . . . . 993.14 A draft version of the main components of PSI-OM. . . . . . . . . . . . . . . 1003.15 Part of PSI-OM showing the relationships between spots identified on a gel
and the corresponding protein records. . . . . . . . . . . . . . . . . . . . . . . 101
xiv
xv
3.16 A draft version of the protein data model in PSI-OM . . . . . . . . . . . . . . 102
4.1 A time line displaying the emergence of microarray and proteomics technology,and the efforts to standardise data formats. . . . . . . . . . . . . . . . . . . . 110
4.2 An overview of the FGE-OM object model. The model is divided into threenamespaces: BioOM, ArrayOM and ProteomicsOM. . . . . . . . . . . . . . . 111
4.3 A screenshot of the term “Age” in the MGED Ontology viewed with OilEd. . 1134.4 A complete listing of the packages within FGE-OM. . . . . . . . . . . . . . . 1154.5 The packages and classes in the BioOM namespace of FGE-OM . . . . . . . . 1164.6 The packages in the ArrayOM namespace . . . . . . . . . . . . . . . . . . . . 1174.7 The ProteomicsOM namespace. . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.8 The ProteinSeparation package . . . . . . . . . . . . . . . . . . . . . . . . . . 1204.9 The ProteomeBioAssay package . . . . . . . . . . . . . . . . . . . . . . . . . . 1214.10 The ProteinData package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1224.11 The model of MS data and protocols, adapted from PEDRo. . . . . . . . . . 1234.12 The ProteinRecord package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1234.13 A workflow for a proteomics experiment involving 2-DE or liquid chromatog-
raphy to separate proteins, followed by MS to identify proteins . . . . . . . . 1254.14 A subset of classes in the QuantitationType package from SysBio-OM . . . . 1264.15 The CommonBioAssayData package from SysBio-OM . . . . . . . . . . . . . 1284.16 The top image shows a small subset of classes from the Measurement package
in SysBio-OM, the lower is the Measurement package in FGE-OM. . . . . . . 1294.17 The Protocol package from SysBio-OM . . . . . . . . . . . . . . . . . . . . . . 1304.18 The BioMaterial package from SysBio-OM. . . . . . . . . . . . . . . . . . . . 1324.19 The BioAssay package from SysBio-OM. . . . . . . . . . . . . . . . . . . . . . 133
5.1 A summary of several workflows in functional genomics to illustrate the re-quirements for data integration. . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.2 A mapping from classes in FGE-OM to database tables in RAPAD. . . . . . 1445.3 The architecture of RAPAD. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1545.4 The user interaction with RAPAD for entering a 2-DE experiment. . . . . . . 1555.5 The interface for entering protocol information into RAPAD. . . . . . . . . . 1575.6 A web page for specifying sources of biological materials . . . . . . . . . . . . 1585.7 A summary of the database schema for storing information about the design
of a study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595.8 The database schema for protein separation techniques and the relationships
to the BioAssayTreatment table. . . . . . . . . . . . . . . . . . . . . . . . . . 1605.9 Screenshots for loading 2-DE, scanning and image analysis data into RAPAD 1615.10 The tables present in the database schema store data from gel spots, image
analysis and the scanning of a 2-D gel . . . . . . . . . . . . . . . . . . . . . . 1625.11 The database schema for linking protein records to gel spots . . . . . . . . . . 1625.12 The database schema for mass spectrometry, adapted from PEDRo. . . . . . 1645.13 A screen shot of the 2-D Gel Viewer that provides search capabilities over
protein data and links to MS results . . . . . . . . . . . . . . . . . . . . . . . 1655.14 A form for entering annotation about a gel spot and linking to protein records 1665.15 A table displaying all the proteins identified on a single gel. . . . . . . . . . . 1675.16 The query interface for searching for specific protein records. . . . . . . . . . 168
6.1 The process of matching microarray data to protein abundance data. . . . . . 1866.2 Output from GoMiner, displaying the GO tree browser open for the gene
Tropomyosin 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
xvi
6.3 Output from FatiGO showing the classification of up and down-regulated pro-teins in the Biological Process branch of GO . . . . . . . . . . . . . . . . . . . 190
6.4 The interface for visualising spots across replicate gels . . . . . . . . . . . . . 1926.5 The interface for displaying data combined across replicates . . . . . . . . . . 1946.6 The protein record for Cathepsin B in RAPAD has external links to various
databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1966.7 The table in RAPAD displaying protein abundance and gene expression values 1986.8 Spots matched to vimentin from infected and non-infected samples . . . . . . 2006.9 Spots matched to actin beta from infected and non-infected samples . . . . . 2026.10 Superoxide dismutase from infected and non-infected samples . . . . . . . . . 2056.11 Potential PTMs of protein disulphide isomerase . . . . . . . . . . . . . . . . . 2076.12 The result of a search for potential post-translational modification of protein
disulphide isomerase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2086.13 A summary page displays all the gels present in the experiment, and a link
exists to display the experimental protocols used for each gel. . . . . . . . . . 210
7.1 The life cycle of Trypanosoma brucei . . . . . . . . . . . . . . . . . . . . . . . 2157.2 An electron micrograph of the bloodstream form of Trypanosoma brucei . . . 2167.3 The span of peptides that have been matched within a protein sequence . . . 2217.4 Protein spots matched to β-tubulin, overlaid with a graphic displaying the
span of peptide hits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2247.5 Protein spots matched to α-tubulin, overlaid with a graphic displaying the
span of peptide hits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2267.6 Protein spots matched to five different Elongation Factors . . . . . . . . . . . 2287.7 Protein spots matched to Elongation factor 1-α . . . . . . . . . . . . . . . . . 2297.8 Protein spots matched to EF-β and EF (putative) are displayed with the
corresponding span of peptide hits . . . . . . . . . . . . . . . . . . . . . . . . 2307.9 The span of peptide hits for protein spots matched to Elongation Factor 2 . . 2327.10 A multiple alignment of five Hsp 70 protein sequences from T. brucei . . . . . 2347.11 Protein spots matched to five different Hsp70 protein sequences . . . . . . . . 2357.12 The interface for publishing T. brucei proteome data . . . . . . . . . . . . . . 2377.13 A search using the Gel Viewer reveals 100 proteins, annotated as “hypothetical”2387.14 The protein spots that have been matched to different hypothetical proteins . 2397.15 Four spots containing arginine kinase. The MS results for spots 575 and 535
reveal possible modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2427.16 There are four spots that match initiation factor 5, of which possible modifi-
cations were found for spots 554 and 575 . . . . . . . . . . . . . . . . . . . . . 243
8.1 A possible model for future data sharing and exchange . . . . . . . . . . . . . 264
A.1 Index A has four components: the Data Path Tree, Data Stores, XML LocaterLists and an XML Dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
A.2 Index B has four components: the Data Path Tree, Data Stores, the StructureContainer and the XML Dictionary (not shown). . . . . . . . . . . . . . . . . 273
A.3 The method used to implement a join query in Index B is implemented in asix stage algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
A.4 A prototype interface for querying an indexed store of XML data. . . . . . . 278
B.1 The ProteinSeparation package of FGE-OM. . . . . . . . . . . . . . . . . . . 281B.2 The ProteomeBioAssay package. . . . . . . . . . . . . . . . . . . . . . . . . . 282B.3 The ProteinData package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
xvii
B.4 The ProteinRecord package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284B.5 The MassSpecProtocol package. . . . . . . . . . . . . . . . . . . . . . . . . . . 285B.6 The MassSpecData package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
D.1 The part of Gla-PSI covering DIGE experiments . . . . . . . . . . . . . . . . 344D.2 A DIGE experiment represented in Gla-PSI . . . . . . . . . . . . . . . . . . . 346D.3 A DIGE study represented in FGE-OM . . . . . . . . . . . . . . . . . . . . . 348D.4 Relative protein abundance data calculated from DIGE can be viewed in the
Gel Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
List of Tables
1.1 Software available for image analysis of 2-D gels. . . . . . . . . . . . . . . . . 91.2 Software available for searching mass spectrometry data. . . . . . . . . . . . . 14
2.1 Summary table displaying features of microarray databases . . . . . . . . . . 62
3.1 A summary of the interviews held with researchers to formulate an under-standing of proteomics research. . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1 The correspondence between gene and protein abundance for HFF cells in-fected with T. gondii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
A.1 Build times in seconds for Index A and B for four different sizes of data set . 274A.2 Summary of query timings for Index A, values are time in seconds . . . . . . 276A.3 Summary of query timings for Index B, with different caching procedures. . . 276
D.1 Experimental plan for Cy labelling of proteins in the DIGE experiment . . . 343
xviii
xix
Commonly used abbreviations
2-DE - Two dimensional gel electrophoresisAPI - Application Programming InterfacecDNA - coding DNAEST - Expressed Sequence TagFG - Functional GenomicsFGE-OM - Functional Genomics Experiment Object ModelGla-PSI - Glasgow proposal for the Proteomics Standards InitiativeGO - Gene OntologyHUPO - Human Proteome OrganisationIPG - Immobilized pH GradientLC-MS - Liquid Chromatography-Mass SpectrometryLIMS - Laboratory Information Management SystemMAGE-ML - Microarray and Gene Expression Markup LanguageMAGE-OM - Microarray and Gene Expression Object ModelMALDI - Matrix-Assisted Laser Desorption IonisationMGED Society - Microarray Gene Expression Data SocietyMIAME - Minimum Information About a Microarray ExperimentMIAPE - Minimum Information About a Proteomics ExperimentMO - MGED OntologymRNA - messenger RNAMS - Mass SpectrometryMW - Molecular weightNMR - Nuclear Magnetic ResonancePEDRo - Proteomics Experiment Data RepositorypI - Isoelectric pointPSI - Proteomics Standards InitiativePSI-OM - Proteomics Standards Initiative Object ModelPSI-Ont - Proteomics Standards Initiative OntologyPTM - Post-Translational ModificationRAD - RNA Abundance DatabaseRAPAD - RNA And Protein Abundance DatabaseRDF - Resource Description FrameworkRDMS - Relational Database Management SystemsRNAi - RNA interferenceSAGE - Serial Analysis of Gene ExpressionTOF - Time Of flightUML - Unified Modeling LanguageURI - Universal Resource IdentifierURL - Uniform Resource LocaterW3C - The World Wide Web ConsortiumXMI - XML Metadata InterchangeXML - Extensible Markup Language
Chapter 1
Investigations in Functional
Genomics
1.1 Introduction
In recent years, the sequencing of the human genome has gained much deserved publicity
[164, 334]. The sequence of man, and all the model organisms, has generated a vast amount
of information about the basis of life at the molecular level. This was only possible due
to progress in the way in which DNA sequencing is performed [276, 305], and the work of
bioinformaticians to produce software that can assemble the huge genome sequences, find
genes and determine similarity between genes in different organisms. We can state to a
reasonable level of accuracy how many genes there are in man (23758 genes are currently
predicted in Ensembl [94]), mouse (26762 in Ensembl), and yeast (approximately 6,000 [190])
and new genomes can be sequenced on relatively short time scales. However, the genome
sequence is only a starting point, the actual DNA sequences comprising the genes tell us
nothing about how living systems function, and what happens when they go wrong, causing
disease. This knowledge requires information about the molecular function performed by
the proteins encoded by every gene, the interaction partners for the proteins, and the subtle
changes that are propagated to the whole system when a protein malfunctions, or is not
present. One of the most conclusive arguments about how far there is to go in molecular
biology is provided by the surprisingly small difference in the total number of genes between
the nematode worm Caenorhabditis elegans (about 20000 [353]) and humans (20000 - 40000
depending on different estimates) [59]. C. elegans contains only 959 cells and the difference
in biological complexity between it and man is vast, yet this is not caused by the number of
genes. We must ask how such a small number of genes in humans gives rise to the number of
different cell types, the complex development of organs and ultimately the intricacy of brain
1
Chapter 1. Investigations in Functional Genomics 2
circuitry that leads to consciousness. The answer must lie in several phenomena: the actual
number of functional proteins being far larger than the number of genes, caused by differential
splicing creating multiple products from a single gene; modifications to proteins that alter
their function; protein interactions giving rise to complex new functions not achieved by
single proteins; and exquisite regulation of when and where genes are expressed. Therefore,
simply assigning one single function to a gene is a major over simplification as it fails to
capture the richness of the whole system, including the possibility for a gene to encode more
than one protein. Furthermore, each protein form may have several different functions in
different physiological locations.
1.1.1 Experimental methodology
A number of new experimental approaches have arisen, to perform large scale analysis of
systems, which have been the result of technological developments, collectively known as
functional genomics (FG). FG involves the analysis of very large data sets, to find the genes
or proteins that are implicated in disease processes or the changes that result from external
stimuli, and to aid efforts to annotate all genes with information about their biological
function. The workflow displayed in Figure 1.1 gives an overview of how different experiments
can be used to gain insights into gene function. FG includes studies that determine gene
expression, protein abundance (in proteomics), protein localisation and others. The different
methods can be classified into seven categories [360], which can be used to assign a function
to a protein by investigating:
• The extent of expression of a protein under different conditions and in different loca-
tions.
• The interaction partners for a protein.
• The gene neighbourhood, including any co-expressed genes, such as bacterial operons.
• The phenotype of the gene knockout.
• The biochemical activity of the protein once isolated.
• Any post-translational modifications that are observed.
• The three dimensional structure of the protein.
The experiments present significant computational challenges due to the vast sizes of data
sets, the heterogeneity in the information generated by each different lab, and the frequency
Chapter 1. Investigations in Functional Genomics 3
Legend
Sample Flow
Data Flow
SampleBiological
SampleBiological
ExperimentDesign
SampleBiological
Microarray analysis Proteomics Immunohistochemistry Metabolomics
GenomeSequencing
Genome sequence
Data Integration
Statistical processing
Image analysis
Determine sequence,assemble and find genes
Overview of Functional Genomics Experiments
Functionally annotated genome
Measure relative levelof mRNA expression to identify proteins
Mass Spectrometryscanning microscopeView samples with a
Global gene expression Global protein expression Positional expression profile Metabolite profile
separate by 2−D gelExtract protein and Apply antibodies
to samplesExtract mRNA andapply to microarray Protocols Separate metabolites by
Mass spectrometry or NMRto detect metabolite profiles
gas chromatographyclone fragmentsExtract DNA and
Figure 1.1: A conceptual view of the data flow in functional genomics.
at which new laboratory techniques are developed. It is vital that functional genomics data
sets can be integrated and adequately queried, linked to gene databases, and exchanged
between research groups [322]. This requires: (i) the development of new database tech-
nologies, and (ii) standard formats to which published data must adhere. The focus for the
work presented in the thesis is to address these two questions for proteomic studies.
1.1.2 Systems biology
A new research area in the life sciences is an effort to understand all the components and
interactions that comprise the entire system, so called systems biology. Systems biology
and functional genomics are not synonymous but there is a large overlap between the two
domains. Functional genomics is the acquisition of data about the function of genes on a
large scale, using various technologies. Systems biology is the discipline of trying to order
all the available information into an understanding about how components interact. One
of the main sources of data can be from functional genomics studies, although that alone
is not enough to build up a complete picture of the system. Critically, in many functional
genomic studies there is no information about causality. If a group of genes are up-regulated
under a particular biological condition, it is not possible to say if the genes are regulated
Chapter 1. Investigations in Functional Genomics 4
in response to the condition, or if the condition is caused by the change in gene regulation
[188]. A complete understanding of metabolic pathways requires experiments that assay the
biochemical reactions, such as the flux in the pathways under a certain condition, compared
with the steady state. New technological advances will enable single molecule measurements
and visualisation of molecular interactions that will be crucial to systems biology, by allow-
ing researchers to derive insights into cellular processes at previously impossible resolution.
These new technologies will require significant database support.
1.1.3 Overview
The scope of our work is restricted to developing technology to aid functional genomics
research. The main focus is the development of a database and a data standard for the
proteomic techniques that are used to detect and measure the abundance of proteins in
complex samples, and to integrate these data with results from other types of experiments.
In this chapter, the main techniques in functional genomics research are described, along
with the computational challenges they present. An outline of the experimental techniques
in proteomics, and three case studies that have been performed, is given in Section 1.2. The
experimental techniques that measure gene expression are described in Section 1.3. Other
types of functional genomics research are described in Section 1.4, and a summary of major
functional genomics investigations is given in Section 1.5.
1.2 Proteomics
The proteome of a sample is the complete set of expressed proteins in a sample of interest,
or the entire set of proteins that could be found in an organism. The term “proteomics”
was first used in the mid 1990s to refer to a newly emerging approach of analysing large
numbers of proteins expressed in a sample [345, 349]. Knowledge of the proteins expressed
in a sample can aid understanding the entire system if the functions of proteins are well
understood. Alternatively, proteomics experiment can give insights into the functions of
proteins that have little annotation, for example if a protein is strongly expressed in one
condition compared with another [362]. Researchers aim to define the proteome of a cellular
sample, tissue, organ or organism using various techniques. The proteome is highly dynamic:
the volume of different proteins change, proteins are translocated to different organelles,
chemical modifications alter the behaviour of proteins and protein-protein interactions give
rise to complex new functions. Researchers are often limited to taking a snapshot of the
Chapter 1. Investigations in Functional Genomics 5
system at one time, but as the size of data sets continue to increase, it will be possible to
gain a more complete understanding [137]. Data sets produced by different laboratories may
comprise heterogeneous file formats produced from different sources, which are difficult to
compare, therefore the requirements for bioinformatics support continue to grow. Data sets
must be made publicly accessible, and software must be designed that allows researchers
to perform detailed re-analysis of data, using various statistical packages. This area is the
focus of Chapter 3, which describes our work on the development of a standard data format
for proteomics. A second issue is that there are currently no major public databases for
publishing proteome data sets, although several are in development. In Chapter 5, there
is a description of a database for proteomics that we have developed as a prototype for a
public repository. The database supports two on-going projects at the University of Glasgow,
described in Chapters 6 and 7.
The emergence of proteomics has been achieved through the developments of new tech-
nologies, although still one of the most commonly used approaches is that of protein separa-
tion by two dimensional gel electrophoresis (2-DE). 2-DE was first developed in the 1970s,
and pioneered in the 1980s by Angelika Gorg and colleagues [136], and while 2-DE techniques
have improved, the experimental basis remains the same today [135]. The main technique
for identifying proteins is mass spectrometry (MS), in which there have been major technical
advances, coupled with the development of software, enabling clear identification of proteins,
even in mixed samples. In this section, gel based proteomics are described in Section 1.2.1.
MS techniques are outlined in Section 1.2.2, other proteomics techniques are described in
Section 1.2.3, and investigations into post-translational modifications are outlined in Section
1.2.4.
1.2.1 Gel based proteomics
The majority of proteomics experiments involve a stage of protein separation, followed by a
technique for identifying proteins once isolated from the mixture. One of the most common
processes is the use of gel electrophoresis, coupled with mass spectrometry. Figure 1.2
displays a workflow from an experiment to determine the abundance of a large set of proteins.
Initially, proteins are extracted from a starting sample and solubilised using a protocol that
is dependent upon the origin of sample and the technique used. Proteomics is not restricted
to a particular area of the life sciences, but can be performed on almost any type of biological
substance, such as microbial cultures, tissues, organs, whole organisms and environmental
Chapter 1. Investigations in Functional Genomics 6
Sample B Sample CSample A
Protein Expression Profile
Sequence Database
Protein Identification
Search
Overview of a Proteomics Experiment
DesignExperiment
ID Vol X Y Protein
1 454 23 24
2 222 28 87 abc1
3 12 20 12
4 662 262 101
1 454 23 24
2 222 28 87
3 12 20 12
4 662 262 101
1 454 23 24
2 222 28 87
3 12 20 12
4 662 262 101
ID Vol X Y Protein ID Vol X Y Protein
2D−PAGE
SolubilisationProtein
StatisticalAnalysis
Add protein ID toabundance data
Digest withtrypsin across gels
Compare abundance
Legend
Sample Flow
Data Flow
Image Analysis
MS/MSMALDIMass Spectrometry
Figure 1.2: The data flow in a proteomics experiment.
Chapter 1. Investigations in Functional Genomics 7
pH 4 pH 7
MW
Figure 1.3: A sample image from 2-DE separation of proteins from Toxoplasma gondii (cour-tesy of A. M. Cohen).
samples.
The solubilised protein mixture is applied to an IPG (Immobilized pH Gradient) strip
and an electric current is applied. A protein migrates to a specific position in the pH gradient
where it has no net charge, in a process known as isoelectric focusing. In the second dimen-
sion, the strip is placed on top of a polyacrylamide gel1 and a second current is applied. The
gel contains a denaturing agent, such as SDS, which causes the three dimensional structure
of the protein to unfold, and gives each protein a net negative charge. In this dimension
the proteins migrate into the gel to a distance that is dependent on their molecular weight.
Smaller proteins migrate furthest and tend to appear at the bottom of gels in most images.
The proteins can be visualised by staining (Figure 1.3). Different IPG strips can be used
to separate proteins with different pI (isoelectric point) values, for example a standard IPG
strip may separate on a 4 - 7 pH gradient. However, to achieve finer resolution of spots, a 5.5
- 6.5 pH gradient more accurately resolves spots with a charge value in this range. Proteins
with charge values at the extremes of the pH gradient may not be observed on a 2-D gel.
This issue is discussed in the Limitations section.
1The abbreviation 2D-PAGE (Two dimensional PolyAcrylamide Gel Electrophoresis) is often used in theliterature.
Chapter 1. Investigations in Functional Genomics 8
Image analysis and quantification of protein spot volume
A 2-D gel can be stained to visualise proteins, using Coomassie blue or silver (discussed
below), and scanned with a flat bed scanner. The scanned image is analysed with specialised
software that detects properties of protein spots, including their coordinates within the image
and an estimate of the volume of protein in the gel. Coordinates are usually specified as the
central point of a circular spot with a particular diameter, or as a set of boundary points
that specify the exact shape of the spot in two dimensions. The volume is estimated from the
darkness of each pixel within the spot. Different software packages have different methods
for quantifying the volume of protein in a spot, and most apply a strategy to normalise the
values across the gel, or a set of replicate gels. The software can match spots produced on
different gels which correspond to the same protein, and determine the relative difference in
the spot size and intensity across two or more gels. One problem that arises is that there
has been little work comparing the algorithms used for quantifying protein spots, or on the
relationship between the amount of visible spot and the actual volume of protein, which
is dependent upon the stain used. Generally fluorescent dyes give the best sensitivity and
linearity. Other stains include Coomassie blue and silver staining. Silver stains allow lower
volumes of protein to be visualised, but there is poor digestion of silver stained proteins with
trypsin and the stains are notoriously non-linear. Coomassie blue offers reasonable linearity
[200] and is widely used due to low cost, although it is less sensitive than either silver staining
or fluorescent dyes.
A goal of computational research is to perform analysis of protein abundance values from
2-D gels produced by different laboratories, as is happening in the microarray field [86].
However, this cannot occur without significant efforts to determine how different software
packages perform gel image analysis. The ProteomeGRID is attempting this kind of anal-
ysis by creating an automated infrastructure for analysing and comparing 2-D gels, using
high performance distributed computing [256]. Large scale analysis of images from different
sources requires software companies to have an open approach to the algorithms or statis-
tical techniques offered by their software, or they must collaborate to create a standardised
output. An alternative would be for researchers to release the original high-resolution scans
of images, in addition to lists of protein volumes, to enable future re-evaluation of large
collections of images in a single analysis. One analysis has been performed to compare the
quality of spot detection in two software packages (Z3 [366] and Melanie 3 [210]) [264]. It
was discovered that both perform reasonably well at detecting spots (approximately 90%
Chapter 1. Investigations in Functional Genomics 9
• ImageMaster published by Amersham Biosciences,http://www.amershambiosciences.com
• Melanie 4 - developed at the Swiss Institute for Bioinformatics,http://ca.expasy.org/melanie/
• DeCyder published by Amersham Biosciences,http://www.amershambiosciences.com
• PDQuest published by Bio-Rad, http://www.bio-rad.com/
• Z3 published by Compugen, http://www.2dgels.com/
• ProGenesis published by Prolific, Inc.,http://www.prolificinc.com/progenesis.html
• Delta 2D published by Bio Imaging,http://www.raytest.de/bio imaging/products/delta2D/delta2d.html
Table 1.1: Software available for image analysis of 2-D gels.
accuracy), and moderately well for detecting ratios of volumes where the ratio is not great
(less than 1:6). A more detailed analysis is required of all the different software packages that
perform image analysis. This work is beyond our scope, but a list of the software packages
available for image analysis is given in Table 1.1.
In the current situation there is little quality control over protein volume values, therefore
the values have limited scope outside of the original experiment. There have been several
efforts to automate the process of comparing large collections of gel images, such as Veeser
et al. 2001 [331] and Rogers et al. 2003 [272]. These efforts are similar to the comparisons
that are being performed across large numbers of microarrays to detect patterns of gene
expression [86, 319], however there are several challenges that must be overcome before large
scale comparisons can be made over 2-D gels. There is variability in the appearance of
gel spots, causing difficulties matching spots across a series of gels [338], different staining
protocols affect the signal strength, and errors can be made in correct protein identification.
A review of current progress in the area of algorithms for detecting and quantifying protein
spots is given by Dowsey, Dunn and Yang [83]. This is an area in which significant future
research is required.
Difference gel electrophoresis
A major new technology in gel based proteomics is two-dimensional difference gel elec-
trophoresis [327], or DIGE2, in which two samples are labelled with different fluorescent
dyes, mixed and separated on a single gel. The gel is scanned at different wavelengths,
2Ettan DIGETM: Fluorescence 2D Difference Gel Electrophoresis [98] produced by Amersham Biosciences.
Chapter 1. Investigations in Functional Genomics 10
creating two images that can be compared. This removes the variability in resolving spots
on different gels thereby improving the matching of spots between gels. The system can
be adapted to use three dyes. The third dye is used to label a mixture of proteins formed
by pooling the two samples in the experiment, to improve normalisation of protein vol-
umes between different images, allowing smaller changes in protein level to be determined
as significant (Figure 1.4) [8].
Limitations of gel electrophoresis
There are several limitations of 2-DE technology. Firstly, membrane and nuclear proteins
tend to be highly hydrophobic and difficult to solubilise, therefore they often do not appear
on a gel [3]. Secondly, high molecular weight proteins do not migrate well through gels and
may not be detected. Thirdly, 2-DE tends to detect high abundance proteins and many
functionally important proteins may be present only in small quantities. Finally, it is fairly
common for multiple proteins to co-migrate to the same spot, causing problems quantifying
the volume of individual proteins. However, this limitation can be avoided by the use of
narrow range pH gels, or zoom gels that improve the resolution of gel spots. Another
advance in gel electrophoresis is sample prefractionation. A protocol reported by Zuo and
Speicher in 2002 [370] can resolve complex mixtures of proteins by first separating proteins
into separate pools based on the charge of proteins. Each fraction of the sample is analysed
by 2-DE, performed over several overlapping narrow range pH gradient gels. This technique
allows more low abundance proteins to be detected as there is a general improvement in
spot resolution, and high abundance proteins are less likely to mask or interfere with other
protein spots. The detection of membrane proteins by 2-DE has been improved by systematic
analysis of the different variables and constituents of buffers to maximise the solubility of
membrane proteins, allowing improved loading of the proteins onto gels [277]. A review
of optimised solubilisation procedures for resolving membrane proteins is given by Molloy
[217]. The poor reproducibility of 2-DE is often discussed as a major limitation, however
the gradual improvements in protocols for the two dimensions mean that reproducibility of
2-DE is now fairly high [317].
Chapter 1. Investigations in Functional Genomics 11
Sample pooling
Sample A Sample B
Extract proteins Extract proteins
Attach blue label Attach green label Attach red label
Recombine samples
Separate by 2−DE
(green) (red)(blue)
Combined Image
Image 1 Image 2 Image 3
Scan gel at three wavelengths
Figure 1.4: A schematic of a difference gel electrophoresis experiment.
Chapter 1. Investigations in Functional Genomics 12
1.2.2 Mass spectrometry
Ionisation types
The most common method of protein identification in proteomics is mass spectrometry (MS,
a review of techniques is given by Mann [203]). In gel based proteomics, a protein spot
is excised from the gel and digested with a protease that cleaves the protein at specific,
predictable positions along its length to form a set of peptides. The most commonly used
protease is trypsin. The peptide mixture can be applied to a matrix and a laser is fired at
a particular wavelength. A matrix is used that absorbs at the chosen wavelength, causing
the proteins to become ionised. This process is matrix-assisted laser desorption ionisation
(MALDI) as developed by Karas and Hillenkamp in the late 1980s [180, 151], which is often
used for identifying proteins in conjunction with gel electrophoresis. An alternate ionisation
approach is electrospray first developed by Fenn and colleagues [102], in which a liquid
containing the peptide mixture is forced through a gold or platinum plated glass capillary
with a fine tip, at a high voltage, causing small droplets to form in a spray. The droplets
evaporate, imparting their charge to the peptides.
Detection
There are various methods for detecting the mass of the peptides that have been ionised.
Time of flight (TOF) is often coupled with MALDI (MALDI-TOF), and functions in the
following way. A laser fires at the matrix, imparting a fixed amount of kinetic energy to the
peptides. The ionised peptides travel through the mass spectrometer and reach the detector
in an amount of time that is dependent on the mass of the peptide, hence smaller peptides
travel faster. Therefore, the mass of each peptide can be determined from the length of time
taken to reach the detector.
A quadrupole detector is commonly used with electrospray ionisation. A quadrupole
consists of four electrically charged rods to which an oscillating current is applied. Pep-
tides travel through the quadrupole but only at a particular amplitude of electric field can
a peptide, of a given mass, reach the detector. Therefore, a range of amplitudes is scanned,
allowing the mass of a peptide to be inferred from the amplitude at that time. A similar
system is the quadrupole ion trap in which ions enter a device that comprises several elec-
trodes trapping the ions inside. Various voltages are applied to the electrodes to eject ions
according to their mass:charge ratios. The ions are focused and detected using an electron
multiplier [177].
Chapter 1. Investigations in Functional Genomics 13
Figure 1.5: An MS trace viewed with Voyager software [339].
A recent advance in detection is Fourier Transform Ion Cyclotron Resonance (FTICR)
mass spectrometry [205]. FTICR can be coupled with both MALDI or electrospray ionisa-
tion and ions are collected in a cell (ICR trap), which is surrounded by a large electromagnet
that causes the ions to resonate. The resonation can be detected by an electrode and con-
verted into a mass:charge ratio, producing a similar spectrum to that produced from TOF
or quadrupole detection.
Data interpretation
Regardless of the method of ionisation, the result is a list of peptide masses on an MS
trace (Figure 1.5). Initially, a noise reduction procedure may be performed on a trace to
remove very weakly detected masses that are unlikely to be the result of genuine peptides.
The software supplied with the mass spectrometer can perform this task automatically but
the researcher may also manually select the strong peaks that they believe correspond to
peptides. The complete set of peptide masses, called the peptide mass fingerprint, can be
used to identify the protein. The list of masses is entered into a search engine that queries a
database of protein sequences, or translated DNA sequences, on which a theoretical digest
is performed. The search engine allows the researcher to specify which protease was used for
digesting the protein and calculates, for every protein in the database, the expected peptide
masses that would result from using that protease. Table 1.2 displays some of the software
that is available for searching peptide mass data. The software finds the proteins in the
Chapter 1. Investigations in Functional Genomics 14
• PROWL - http://prowl.rockefeller.edu/
• MOWSE - http://srs.hgmp.mrc.ac.uk/cgi-bin/mowse
• ProteinProspector - http://prospector.ucsf.edu/
• MASCOT - http://www.matrixscience.com
• SEQUEST - http://fields.scripps.edu/sequest/
• PepMAPPER - http://wolf.bms.umist.ac.uk/mapper/
Table 1.2: Software available for searching mass spectrometry data.
database that have a set of predicted peptide masses that match most closely the observed
peptide masses. The software produces output that includes a statistical score indicating the
likelihood of a correct match, the number of peptides matched, and the percentage coverage
of the peptides matched out of the entire protein sequence. Each value has a statistical
basis, but the researcher uses a combination of these measures that is dependent on various
criteria, to decide if a protein has been correctly matched. In some cases, obtaining complete
coverage of the proteome may be of primary importance, and a low threshold will be used
that allows some false positives. In other situations, finding the exact identity of a single
protein is crucial and a high threshold will be used.
Tandem mass spectrometry
The peptide mass fingerprint method does not always identify a protein with sufficient confi-
dence. In these cases, an alternative approach called tandem mass spectrometry, or MS/MS,
can be used. MS/MS is so called because it involves two sequential MS stages. The first
stage separates proteins into different peptides by their mass but, rather than the ionised
peptide hitting a detector, a peptide is selected, and it is collided with an inert gas such as
argon or nitrogen. The collision causes the bonds between amino acids to split, resulting in
a range of ionised fragments. The mass of each ionised fragment is detected in the second
MS stage. For example, if the selected peptide contains eight amino acids, the fragmentation
would produce new peptides containing 8, 7, 6, 5, 4 amino acids and so on in the second
stage. The difference in mass between each new peptide corresponds with the exact mass of
the amino acids that is lost between the two peptides. The masses of the fragments can be
read from right to left on a trace, revealing the amino acid sequence of the peptide (Figure
1.6). The peptide sequence, or the set of masses from the MS/MS trace, can then be searched
against a sequence database to find an exact match (or near exact) that will conclusively
identify the protein.
Chapter 1. Investigations in Functional Genomics 15
Figure 1.6: Three traces from a tandem mass spectrometry experiment, reproduced from[189]. Image a displays the first MS stage from which the two strongest peptides are selectedfor fragmentation. The results of the second stage fragmentation are shown in (b) and (c).The difference between the mass of the peaks, shown on the y-axis, corresponds to the massof the individual amino acids that form the peptide sequences shown.
Chapter 1. Investigations in Functional Genomics 16
Standardising mass spectrometry
One of the major limitations of MS is that there is neither any standardisation across the
methods employed by different instruments to measure peptide masses, nor in the input
parameters for the instruments. One effort to remedy this situation is provided in a study
by Purvine and colleagues [259]. They created a standard mixture of peptides and proteins,
which they assayed by liquid chromatography and MS (LC-MS), coupled with a database
search engine. The system correctly identified 23 peptides and 12 proteins from the mixture.
The experimental methodology has been released as a standard for assessing the quality of
studies, to see how effectively other systems can identify different proteins from within the
mixture.
The peak list generated from MS is usually entered into a search engine to identify the
protein. Each different application has its own measures of the quality of a protein match
and a researcher often decides, using a combination of measures, whether an identification is
correct. The measures of correct matching often depend upon the software being used, and a
cut-off is determined by each laboratory, using their own criteria that depend upon the type
of experiment. This means that there is no standard method for comparing the likelihood
of a correct match between data produced from different laboratory setups. Therefore, it
is very difficult to ascertain in large data sets the statistical probability that a protein has
been correctly identified. The efforts of the Proteomics Standards Initiative to solve some of
the standardisation problems are described in the following chapter.
1.2.3 Other proteomics techniques
One of the main criticisms of 2-DE based proteomics is the unreliability of estimates of
protein volume made by image analysis. The stain used to visualise protein spots greatly
affects the linearity of the relationship between true protein volume and the spot density
measured by analysis software. There is little information in the published literature about
the accuracy of measurement of protein volumes, therefore in the past results have often
been qualitative: spots are present on one gel and absent on another, or clearly up or down
regulated with large fold differences. However, recent advances in staining or labelling of
proteins, such as DIGE analysis, and improvements in software have enabled quantitative
measurement of protein volume from 2-D gels [181, 328]. In the microarray domain there
has been substantial work on the quantification and statistical analysis of results to be able
to say what differences are statistically significant (examples include [130, 267, 312]). The
Chapter 1. Investigations in Functional Genomics 17
interpretation of results would be easier if quantitative analysis of proteomics data sets could
be performed. Towards this goal a set of new experimental techniques have been devised for
quantifying protein volumes in samples, as described below.
A limitation of 2-DE based proteomics is that highly abundant proteins are identified
much more readily than low abundance proteins. Many functionally significant proteins,
such as transcription factors, are present in low copy number in the cell, and it is vital
that these proteins can be assayed. Therefore, techniques have been developed that perform
proteomic analysis using separation techniques other than 2-DE, which detect proteins that
are expressed at low levels.
Liquid chromatography and tandem mass spectrometry
A technique has been developed in the labs of John Yates at the Scripps Institute, for identi-
fying large numbers of expressed proteins. This technique is unbiased with regard to protein
volume, protein charge or molecular weight, and can identify membrane proteins [344]. The
technique is known as MudPIT (Multidimensional Protein Identification Technology). Mud-
PIT is a further development of a technique reported in 1999, in which two dimensional
liquid chromatography (LC) is coupled directly with MS (LC-MS, Figure 1.7) [195].
There are many variations in the functionality of LC but the principle is that a solution
containing the proteins or peptides to be separated is applied to a column. The column
contains substances that create a gradient to fractionate the mixture based on the charge or
hydrophobicity of the proteins [290]. Reverse phase (RP) chromatography is often performed
in proteomics, in which a column is filled with an aqueous solution and there is an increasing
gradient of an organic solvent. Different fractions are eluted from the column according
to their hydrophobicity as the gradient of solvent increases. The fractions can be collected
for further separations or analyses, such as mass spectrometry, because RP can be directly
coupled to electrospray ionisation. One of the limitations of this technique is that complex
mixture of proteins, such as the entire proteome of a sample, often cannot be adequately
resolved. This problem can be overcome by performing two-dimensional chromatography
in which two sequential stages are performed, which separate on different properties of the
mixture. The first stage is often ion-exchange chromatography, for instance eluting particular
proteins using different concentrations of KCl in stages, causing proteins or peptides to
separate differentially according to their charge.
MudPIT was used with the SEQUEST software for performing database searches [93] in
Chapter 1. Investigations in Functional Genomics 18
Denaturated protein complex
Identified proteins in complex
Peptides (pH < 3)
2D chromatographic
separation of pepetides
Peptide fragmentation using
tandem mass spectrometry
Computational translation of
tandem mass spectra to amino
acid sequences using genomic
sequences
Figure 1.7: Two dimensional liquid chromatography coupled with MS for identifying largenumbers of proteins from a mixture, reproduced from [195]. Two phases of LC are per-formed: (i) strong cation exchange (SCX) for separating by charge, (ii) reversed phase (RP)separating by hydrophobicity, followed by tandem mass spectrometry.
Chapter 1. Investigations in Functional Genomics 19
a study reported in 2001 [344]. The technique was used to identify almost 1500 proteins from
the Saccharomyces cerevisiae proteome, including proteins with extremes in pI, MW, abun-
dance and hydrophobicity. Many studies have been performed to determine the proteome
of S. cerevisiae by 2-DE and MS, however previous to this analysis the largest study had
resolved only 279 proteins [245]. A later refinement of the process was reported by Peng et al.
2003 [243] in another study of the yeast proteome, using two dimensional chromatography,
coupled with tandem mass spectrometry. The study identified a similar number of proteins,
approximately 1500, and reported a very low rate of false positives (less that 1%).
ICAT
The technique of mass spectrometry for protein identification has been discussed above, but
if performed using a standard protocol, MS does not produce quantitative output. This is
because the height of peaks on a trace are very poorly reproducible, and do not generally cor-
relate well with the amount of protein in the starting sample. In 1999, a new technique was
reported by Gygi and colleagues [142], in which MS was coupled with liquid chromatography
for protein separation, and proteins from two different samples could be compared concur-
rently. The scheme is shown in Figure 1.8 and consists of labelling proteins from two different
conditions with ICAT reagents (Isotope-Coded Affinity Tags). ICAT has a component that
binds cysteine residues in proteins, with an isotopically heavy reagent binding proteins in
one sample, and an isotopically light reagent binding proteins in the other sample. The sam-
ples are combined, and enzymatically cleaved to produce peptides. The ICAT reagent also
includes biotin which allows peptides to be extracted with an avidin affinity column because
avidin binds biotin with a very high affinity. Peptides labelled with the ICAT reagent are
captured in the affinity column. The peptides are then analysed in a mass spectrometer
which reveals a pair of adjacent peaks for each peptide. The adjacent peaks are separated
by a difference of 8 Da, which is the difference in mass between the heavy and light isotope.
The difference in peak height represents the relative volume of protein that was present in
the two samples. At this stage, there is no information about protein identity. The peptides
undergo a second stage of MS, in which peptides are fragmented into amino acids (MS/MS
described above) to reveal the amino acid sequence that in many cases can be used to search
a sequence database, correctly identifying the protein. In the original paper describing the
method, ICAT was used to analyse the volume of proteins in two cultures of yeast growing
in different media. The authors were able to identify subtle changes in protein expression
Chapter 1. Investigations in Functional Genomics 20
Mass/charge
Rel
ativ
e ab
unda
nce
Cell State 1(light ICAT)
Cell State 2(heavy ICAT)
Combine samples andproteolyse
Affinity isolation of
ICAT peptides
MS/MS analysis to identify protein
Peptide B
Mass/charge
Rel
ativ
e ab
unda
nce
Peptide C Peptide D
Peptide A
from sequence of peptide A
Quantify relative protein abundanceby measuring ratio of peaks
Figure 1.8: The ICAT method for quantitative proteomics.
Chapter 1. Investigations in Functional Genomics 21
that correlate well with previously published data. One limitation of this method is that the
reagents bind cysteine residues, and cysteine is one of the rarest amino acids. However, the
first publication about ICAT suggests that the percentage of cysteine-free proteins in yeast
is only 8%.
SILAC
A similar approach for quantifying protein abundance is SILAC (stable isotopic amino acids
in cell culture), presented by Blagoev in 2003 [36]. In this approach, a heavy isotope of
arginine or leucine, labelled with C13, is incorporated into the medium in which cells are
growing. A separate culture is grown in normal medium for a different condition. The
proteins are then extracted, digested with a protease and analysed by mass spectrometry.
Each peptide that contains an arginine residue is represented by a pair of adjacent peaks,
caused by a slight increase in mass of the peptide in the heavy carbon medium. It is expected
that all proteins contain arginine residues. The method was utilised to examine the EGFR
(Epidermal Growth Factor Receptor) pathway. One culture was stimulated with EGF, the
other was not stimulated. The cells from both cultures were lysed, mixed in a 1:1 ratio, and
an affinity column was used to extract proteins that interact with EGFR. The difference
in volumes for proteins implicated in EGFR processes were accurately determined by pairs
of peptides, as for the ICAT method. However, SILAC can only be used for cell cultures
growing in a medium whereas ICAT reagents are used to label the proteins after they have
been extracted from the sample, therefore there are fewer restrictions on the samples that
can be analysed with ICAT.
Other differential labelling strategies
ICAT and SILAC were two of the first procedures reported for labelling proteins to quantify
their abundance on a large scale by mass spectrometry. However there are various other
labels that have been used to create “heavy” and “light” isotopes that can be detected by
MS. An example is iTRAQ (isotope Tags for Relative and Absolute Quantitation) which
functions in a similar way to ICAT but has the advantage that more than two samples can
be compared concurrently [15]. The use of H2O16/18 [214], deuterated hydrogen and various
other tags to amino acid sidechains have also been applied to protein quantification (reviewed
in [284]). It is likely that these methods will begin to overtake gel based quantitation of
protein abundance as they do not suffer from the same limitations in the range of proteins
Chapter 1. Investigations in Functional Genomics 22
that will be identified.
1.2.4 Post-translational modifications
The genome sequence is a static representation of biology, and while it is possible to predict
the amino acid sequence of proteins with a high degree of accuracy, this does not reflect the
complete picture of proteins as functional units in cells. The chemical alteration of proteins,
known as post-translational modification (PTM), is a common phenomenon that occurs in a
time and signal controlled manner. Modifications include the addition and removal of phos-
phate groups (phosphorylation and dephosphorylation), which are well known mechanisms
for controlling the catalytic and signalling activity of proteins [172]. For example, receptor
tyrosine kinases (RTKs) potentiate external signals to the inside of cells. RTKs reside in
cell membranes and, when bound by a ligand, change in conformation, switching on their
kinase activity. The RTK subsequently binds and phosphorylates other proteins within the
cell, transmitting the signal downstream [206].
The addition of carbohydrate molecules to proteins, termed glycosylation, is the most
common type of modification. Analysis of glycosylation, or “glycomics”, describes studies to
find all the carbohydrate molecules produced by a protein, and already 5000 genes have been
assigned as having a potential role in the synthesis of carbohydrates across all the sequences
deposited in GenBank [337]. Other types of modification are acetylation, methylation and
cysteine oxidation. In general, modifications cause proteins to change in conformation, lead-
ing to the protein translocating to another part of the cell, or causing new protein interactions
to form. Modifications play a role in maintaining the tertiary (the 3-D conformation of a
single protein unit) and the quaternary (multi-protein complex) structure of proteins, and
are therefore ultimately associated with function.
Identifying PTMs
Protein modifications can be identified by 2-DE coupled with MS, and various other methods
for their detection have been developed (a review of techniques is given by Mann and Jensen
[204]). Distinct protein spots can be observed on a 2-D gel that correspond to differentially
modified forms of the same protein. Phosphorylation can be observed in the case of different
spots positioned in a horizontal line, due to a change in the protein’s charge (pI) with only a
negligible change in molecular weight (Figure 1.9). Glycosylation of proteins (the addition of
chains of carbohydrates) causes a change in molecular weight and pI, causing variant forms of
Chapter 1. Investigations in Functional Genomics 23
Figure 1.9: A two dimensional gel highlights possible different phosphorylation states ofProtein disulfide isomerase from a human cell line (image courtesy of M. Nelson).
proteins to appear in a diagonal line. MS can conclusively identify modifications, for example
if the peptide mass fingerprint reveals a peptide with a shift in mass that corresponds exactly
to the known mass of a modification type. Tandem mass spectrometry is even more accurate,
and can reveal the exact amino acid position of the modification if one amino acid displays a
characteristic increase in mass. However, there are several problems using this technique on a
large scale. Firstly, phosphopeptides are low in abundance and extract poorly from gel slices.
Secondly, during MALDI-TOF only a proportion of peptides reach the detector, therefore
often they may not be detected. Thirdly, while using electrospray ionisation, phosphorylated
peptides ionise poorly in acidified solvents. Finally, in MS/MS the situation is worse, as only
a few peptides in the entire sequence may be detected, therefore the majority of the protein
sequence is not analysed, and modifications on the rest of the protein are silent.
There are methods for improving the detection of modifications including the use of
affinity columns that bind phosphorylated proteins [103], to enrich for these proteins as they
often occur as a small proportion of the total amount of a single protein. One such method
is Immobilized Metal Affinity Chromatography (IMAC) in which columns are loaded with a
metal ion-containing resin that causes phosphopeptides to bind under acidic conditions [354].
Other techniques used to identify modifications include Western blot analysis whereby pro-
Chapter 1. Investigations in Functional Genomics 24
teins are treated with specific antibodies that are known to bind particular phosphorylation
sites on peptides. The antibodies can be fluorescently labelled, allowing differences in fluo-
rescence signal to detect the amount of phosphorylated protein. A similar approach is the
use of autoradiography, whereby radiolabelled 32P is incorporated into proteins, which can
then be quantified [179].
The development of new techniques means that data sets of PTMs are rapidly increas-
ing in size, and good database support is required to make the information available to
researchers to avoid manual analysis of the literature. One estimate suggests that there are
at least 200,000 published PTMs in PubMed [285]. It is a major research challenge to make
the information on PTMs available in the context of large scale investigations.
1.2.5 Case studies of proteomics research
In this section, examples are given of proteomic investigations we have studied. Chapters
3 and 4 will return to this topic and discuss the development of standard data formats for
proteomics, and Chapter 5 will outline a database system that has been implemented to aid
research.
A major part of the development process of the standard was the capture of the re-
quirements of proteomic research. Three case studies of current research activity at the
University of Glasgow, which use proteomic techniques, have been performed. Two case
studies of research in parasitology are summarised below (Case Studies 1 and 3), which
ultimately contributed to the work described in Chapters 6 and 7. Case study 2 outlines
a collaboration at the Beatson Institute3 with the research group of Prof. Walter Kolch,
investigating the MAP Kinase signalling pathway. The data from case study 2 were not
available for inclusion in RAPAD but the experimental setup was taken into consideration
during the development of the model presented in Chapter 3.
1.2.6 Case study 1
This case study is derived from work with researchers in the field of microbial pathogenesis
[61] at the Institute of Biomedical and Life Sciences, University of Glasgow. The researchers
wish to investigate the changes that occur in the proteome of a human cell line (the host)
during invasion with the parasite Toxoplasma gondii compared with non-infected host cells.
A set of replicate samples are obtained and the proteins are extracted from each sample,
3The Beatson Institute for Cancer Research, www.beatson.gla.ac.uk.
Chapter 1. Investigations in Functional Genomics 25
solubilised and separated by 2-DE. The gels are scanned and image analysis is performed
to match spots on different gels corresponding to the same protein. Protein spots showing
differential expression are extracted from the gel and prepared for MS. Many proteins are
identified conclusively by MS. The next stage involves characterising the large number of
hits that are obtained. There are a large number of Internet accessible resources about
human proteins which can only be searched manually. This process is very time consuming
for a large data set. If database searches could be automated, many more proteins could be
analysed in one study, and greater insights could be made. After a long period of manual
database searching, a significant amount of information is obtained about each protein, but
there is no simple mechanism for summarising or managing the information.
The researchers also wish to identify post-translational control mechanisms, to determine
if a protein expressed during parasite invasion has been modified, compared with the same
protein in non-invaded cells. Potential modifications can be found by 2-DE if a protein
migrates to a different position on one gel compared with another gel, the result of a slight
change in the charge or molecular weight of the protein caused by the modification. The
modification can be positively identified on an MS trace by discovering a peptide with a mass
that is different from the expected value, and the difference corresponds to the mass of an
additional group, such as an extra methyl residue. However, to discover modifications that
are functionally important, the researcher must have information about how the protein is
modified in other conditions. These efforts are hindered because there are no major databases
of MS traces or modifications available. An annotated database, containing a large number
of MS traces, would greatly improve the identification of modifications in two ways. Firstly,
annotated traces for proteins with confirmed modifications could be mined to improve the
algorithms for the detection of modifications in other proteins. Secondly, if a particular
protein already has an entry in the database, differences in the modification pattern could
be highlighted, and investigated further to determine if the modification is significant for the
function of the protein.
1.2.7 Case study 2
This case study was conducted at the Beatson Institute, in collaboration with Prof. Walter
Kolch. A cell line was obtained in which the protein Raf-1 is knocked out. The protein
is known to be involved in major metabolic processes in the MAP kinase pathway [38],
and researchers wish to discover the downstream affects from the loss of Raf-1. Gels are
Chapter 1. Investigations in Functional Genomics 26
run using a difference gel electrophoresis system, labelling proteins from the knockout cell
line with one dye, and from a normal cell line with a different dye. A series of replicates
are run, and the gel images are analysed. The researcher has a number of questions they
wish to pose. For example, which spots show significant differential expression between
the samples, and what the identities of these proteins are. After statistical analysis, two
hundred spots showing the greatest difference in expression are highlighted for further study.
The two hundred spots are robotically picked from the gel and prepared for MS. MS traces
are analysed, peak lists are produced and entered into applications that search genome
databases. The searches identify approximately one hundred and fifty proteins that reside in
databases, of which many have only basic functional annotation. The researcher wishes to
further characterise the proteins by searching other relevant databases, of which about ten
exist. The researcher must manually browse Internet sites to assemble information and read
bibliographic references which takes a number of hours, or up to days, if extensive literature
searches are required, for each protein. Therefore, to characterise all one hundred and fifty
proteins in detail could take several weeks for a single researcher.
Once the proteins have been characterised, the researcher wishes to build a mathematical
model of the changes that occur in the metabolic pathway, caused by the loss of function of
Raf-1. Data for the model are to be drawn from the 2-DE studies, a microarray experiment
that has been carried out by another research group on the same cell line, and biochemical
studies carried out over several decades by many different research groups. The process
of retrieving data from the biochemical studies is extremely laborious because little of the
data reside in accessible databases, therefore extensive literature searches are required. The
microarray data sets have been published by other research groups, and are available on
the Internet, but do not have any information about how the cell lines were cultured. In
addition, the database identifiers (accession numbers) for the features on the microarray do
not match the identifiers of the proteins identified by MS. Therefore, it is not possible to
make any direct comparison with changes observed in the 2-DE studies. The major problems
highlighted by this case study are lack of tools for the integration of data from distributed
databases, and insufficient information stored with published data for it to be re-used.
1.2.8 Case study 3
This study was performed with Prof. Mike Turner at the Institute of Biomedical and Life
Sciences, University of Glasgow, in the context of an investigation to determine the proteome
Chapter 1. Investigations in Functional Genomics 27
of the parasite Trypanosoma brucei. The genome sequence of T. brucei is nearing comple-
tion, but many genes have little functional annotation and it is hypothesised that proteome
investigations can aid the annotation process. The data from this investigation form the
basis for Chapter 7.
Proteomics experiments can aid annotation efforts by conclusively identifying proteins
that are expressed under particular conditions. There are many other examples of published
work in which researchers have used proteomics techniques to catalogue the set of proteins
present in a sample of interest, to determine the entire proteome of particular cell types,
organelles or microorganisms (examples include whole yeast cells [115], the human heart
mitochondrion [316] and the plasma membrane of yeast [223]). The organism being studied
may have no genome sequence, or the sequence may be incomplete, therefore there are
significant problems conclusively identifying spots found on a 2-D gel. In some cases, several
2-D gels may be run to separate proteins within different pH ranges. Spots from the gels
are picked, and prepared for MS. Four scenarios for the results of database searches with
peptide masses obtained from an MS trace are possible:
1. A good match to a sequence in the genome database, with functional annotation.
2. A match with no annotation but with homology to sequences from other organisms.
3. A match with no annotation and no homologous sequences.
4. No match in any genome database, for example if the identification has been made
from an expressed sequence tag (EST) database.
Genome sequencing and annotation work is only partially complete for many organisms,
therefore a major problem arises due to the dynamic nature of the sequence databases. After
the release of each new database version, sequences are more likely to be found in groups
1 and 2. However, it is extremely difficult to identify which sequences have been updated
between database versions and the information cannot be accessed without repeating all
the initial searches. The sequence identifiers may also change between database releases,
therefore automating the process of searching for protein records that have been updated is
a major challenge. The sequence of peptides from an MS/MS experiment can also be used
to discover new genes within the genome, or act as an identifier for genes that previously
had not been sequenced, that fall into category 4.
Chapter 1. Investigations in Functional Genomics 28
1.2.9 Publication of proteomics data
There is a growing body of publications in which researchers have utilised a global approach
to study the proteins in a system. A search of PubMed for the word “proteomics” returns over
3500 articles (July 2004). Articles describing gel based proteomics usually include a printed
image of one or more gel, often with a table containing proteins that have been identified
(example [208]). In some cases, there is a comparative analysis across several conditions and
the ratios of the volume of proteins are displayed in a table (example [129]). Experiments
involving different separation techniques, such as liquid chromatography, coupled with MS
for protein identification often display the chromatograms for the different fractions, and
images of MS traces (examples [369, 361]). The proteins that have been identified are also
usually presented in a table. Most publications reproduce the protocols for MS, and a
reference to the software used for protein identification, but rarely is there any detail about
the input parameters for the software or the version of the database that was searched, and
there is variability in the significance cut-off that was used for protein identifications. It
is therefore often not possible to assess the statistical probability that proteins have been
correctly identified without substantial manual effort.
The data from proteomic studies are usually not open to any kind of automated analysis,
even if publications are reproduced electronically on the Internet. This is because the results
are often embedded within images, which cannot be extracted, or the results are written in
the main body of text, which must be read manually to understand the context. This cannot
be automated using current information retrieval techniques. We focus on the challenges of
making proteome data widely accessible in Chapter 3.
1.3 Gene expression techniques
The techniques described above attempt to assess the status of the proteins within a system.
However, the experiments present technical challenges due to the difficulties of extracting
very low volumes of proteins from the cell. There is also no technique for amplifying the
volume of a protein, which is equivalent to PCR (polymerase chain reaction) for amplifying
nucleic acid sequences. Therefore, in the last decade, techniques have been developed for
assessing how strongly genes are expressed by measuring the messenger RNA (mRNA) levels
produced. These techniques are described in this section.
Chapter 1. Investigations in Functional Genomics 29
1.3.1 The development of microarrays
Microarrays were first developed in the mid 1990s from two different approaches. One of
the first developments in microarrays was achieved by Shalon and colleagues in 1996, who
developed a protocol for attaching DNA fragments to a glass slide, and hybridising two sets
of yeast chromosomes, labelled with different fluorophores [287]. A paper was published later
that year by DeRisi and colleagues outlining how microarrays, formed by spotting cDNA
(coding DNA) onto a slide, can be used to assay gene expression in the context of classifying
differences in human tumour cell lines [76]. A different article was published at the same time
outlining the use of microarrays for detecting mutations in a gene implicated in breast cancer
from a number of patients [145]. Each cDNA “feature” corresponds to the complementary
sequence of the mRNA that is produced for each gene to be assayed.
Affymetrix arrays
An alternative approach was pioneered by the Affymetrix company in which very short (10 -
50 base pairs) stretches of DNA are synthesised on the chip using a technique inherited from
the semi-conductor industry, called photolithography [5]. Short sequences of DNA bases
(oligonucleotides) are synthesised on the chip, one base at a time in specific positions. The
process uses fine masks over the chip that allow light to reach particular positions, which
causes the specific degradation of a “blocking residue” that prevents additional bases be-
ing added to an oligonucleotide chain. The chip is then washed with a solution containing
whichever base (A, C, G or T) is required in the next position at the unmasked oligonu-
cleotide, attached to a new blocking residue. A new mask is applied and the next set of
bases are added (Figure 1.10). In this way, chains of nucleotides can be built up one base at
a time.
Measuring expression
Using either of the two approaches outlined above, the result is a chip or slide containing
up to tens of thousands of reporters. Each reporter detects the level of expression for
one gene. When a gene is expressed, mRNA is produced as a signalling molecule, which
is later translated into a protein, the functional unit in the cell. It is believed that the
relative amount of mRNA in one cell compared with another is indicative of the rate of gene
expression and can give insights into the genes that cause the differences between samples.
Two sets of mRNA from samples produced under different conditions (example: one normal,
Chapter 1. Investigations in Functional Genomics 30
Figure 1.10: A summary of the technique involved in the creation of Affymetrix microarrays,image obtained from [5].
one disease) can be labelled with different fluorescent compounds (one red, one green) and
attached to the array. The ratio of red to green for each reporter gives the difference in
expression for each gene between the two samples. For Affymetrix arrays only one sample
is assayed at a time (a one-colour array), and two different samples must be compared on
two different hybridizations to the chip. Statistical processing is performed to ensure that
values obtained from different assays can be compared. Large changes in expression for a
gene, between a normal and a disease sample, may implicate the gene in the disease process.
Since the early days of research the use of microarrays has grown at a remarkable rate.
A simple search of PubMed for the word “microarray” reveals almost 6000 articles published
since 1996. Each experiment generates a large amount of data, most studies involve many
parallel assays, with each assay containing thousands of data points. Therefore, as a general
estimate, each published study could generate several hundred thousand data points. In ad-
dition, we should also consider the genes’ annotation, experimental protocols, and statistical
processing. The challenges in database support for microarrays are clearly very large. These
requirements were realised by the MGED (Microarray and Gene Expression Data) society
in the late 1990s [42], which was established to improve support for publishing, querying
and exchanging microarray data sets. The issues of data standardisation, and the creation
of public databases, are discussed in the following chapter.
Chapter 1. Investigations in Functional Genomics 31
1.3.2 Serial analysis of gene expression
The technique of serial analysis of gene expression (SAGE) was first reported in 1995 by
Velculescu and colleagues [332] as a method for quantifying the expression of genes, prior
to the invention of microarrays. The basic principle is that short tags (10-14 base pairs),
which uniquely identify the transcript of the gene, are obtained for each gene to be assayed.
A sample is obtained, and the tags are isolated from the transcripts, reverse transcribed
(converting mRNA back into DNA), and concatenated to form a long stretch of DNA. The
newly formed DNA is sequenced, and the number of times each tag appears indicates the
level of expression of each gene. The technique has been successfully used to assay the
expression of over 4000 genes in yeast in 1997, which was one of the first examples of a
technique to perform high-throughput analysis on a whole system [333].
1.4 Other techniques used in functional genomics
The main focus of our research is to improve computational support for proteomics, and to
integrate the results of protein abundance experiments with gene expression values. However,
it is also important that technology can be extended to capture and integrate data from all
types of functional genomics experiment. This section contains a brief overview of other
types of large scale experiments which may yield data needed for functional genomics.
1.4.1 RNA interference
RNA interference (RNAi) is a technique first developed in Caenorhabditis elegans [108]. It
is a powerful method for removing the function of a gene without having to develop genetic
crosses, or engineer complex methods for deleting the gene from the genome. In certain
species, simply injecting the organism with double stranded RNA of the same sequence as
the targeted gene, prevents the gene being translated into protein. The same effect can also
be achieved to a lesser extent using single stranded anti-sense RNA. The resulting phenotype
of the gene knockout allows researchers to assign a function to a gene, as long as the knockout
is not lethal, and this has proved vital for investigating C. elegans. The vast majority of the
predicted 20000 genes have been tested with RNAi. Similar experiments have been performed
in plants, in Drosophila and in the disease causing parasites, trypanosomes (a review is given
by Hannon [147]). There is some evidence that RNAi may be effective in mammalian cells,
although this has not yet been conclusively demonstrated, and the complete mechanism for
Chapter 1. Investigations in Functional Genomics 32
RNAi is not currently understood. However, RNAi is a highly specific technique that allows
researchers to determine the function of genes on a large scale.
1.4.2 Immunohistochemistry
The position of a protein in a cell or tissue can be localised using immunohistochemistry,
which is a widely used technique in molecular biology [69]. A particular protein can be
viewed under a microscope using a specific antibody to which a fluorescent label has been
attached, such as green fluorescent protein (GFP [163]), or a radioactive tag. More generally,
proteins can be visualised in a sample using silver staining.
The position of the protein in the cell can be visualised, and differences in the pattern
of labelled proteins can be used to classify samples. Localisation information may provide
clues to the function of a protein. For example, a protein shown to be highly expressed
in cell membranes may prove to be a transporter or membrane receptor. The technique
can be modified to visualise two proteins concurrently, using two different fluorescent labels
attached to antibodies against the two proteins to be studied. In one study, 75% of the yeast
proteome was analysed by this method, totalling 4156 proteins, allowing researchers to infer
significant functional information [157].
1.4.3 Metabolomics
Proteins and mRNA sequences are not the only molecules that can give information about the
current state of a system. Biological reactions are catalysed by proteins, but the reactants are
in fact small molecules, such as citrate, glucose or NADPH, known as metabolites. Researches
have developed techniques to analyse the metabolites within one system compared with
another, for example to determine the difference between bacterial strains, or to analyse
the critical changes in metabolite concentration during a disease process. The study of the
entire set of metabolites as a diagnostic tool has become known as metabolomics (current
progress is reviewed by Weckwerth 2003 [346]), and the term metabonomics has also been
used. According to Nicholson 2002 [227], metabonomics is the study of metabolic profiles in
vivo in whole organisms, biofluids or tissues.
In theory, mass spectrometry could be used directly to detect the metabolites present
in a sample, by detecting the mass of all the metabolites and comparing with a reference
database. In practice, an additional stage is used to separate metabolites according to their
molecular mass, prior to MS, to increase the resolution. The additional stage can be liquid
Chapter 1. Investigations in Functional Genomics 33
or gas chromatography (GC) [105]. The principle of GC is similar to LC but uses a column
filled with an inert gas, rather than a solution. The mixture undergoes a process that causes
it to become gaseous, and small molecules separate according to a property, such as mass
or charge. There have been several studies that determine the metabolites present in plant
samples using LC/MS or GC/MS, examples include [271, 105, 347].
An alternative approach for determining the metabolome is nuclear magnetic resonance
(NMR) [306]. NMR can detect a fingerprint of the metabolites in a sample that contain 1H,
13C, 15N, or 31P when pulsed with a radio frequency. The atomic nuclei give information
about the chemical environment within a magnetic field. NMR has the advantage over
MS that it is not destructive of the sample, and in some cases can be used in a non-invasive
manner for analysing tissues. This kind of metabolomics is used for diagnostics, to determine
the characteristic fingerprint of the metabolites present in a particular bacterial strain, or a
diseased tissue.
1.4.4 Protein interaction studies
Proteins rarely act as single units in cells, but form complexes with other proteins to create
new functions. It is therefore an essential part of functional genomics research to gain
insights into the interactions partners for proteins. The main experimental techniques for
such studies are summarised here.
One of the main technologies developed in the late 1980s is the Yeast Two-Hybrid system
that works in the following way [107]. The DNA binding domain of a transcription factor A
is fused to protein X, and the activation domain of transcription factor A is separated and
fused to protein Y. Transcription factor A switches on a gene that causes a visible change in
a cell culture, causing cells to grow rapidly, or a particular colour to develop. Transcription
factor A can only switch on the gene if its two domains come into contact, caused by protein
X and protein Y interacting (Figure 1.11). The two-hybrid method has been employed on
a large scale to analyse protein interactions in yeast [323], C. elegans [343] and Helicobacter
pylori [263]. In the study on yeast, researchers plated 192 “bait” proteins, and assayed almost
all of the 6000 predicted proteins as “prey”, revealing 281 protein-protein interactions. The
reverse study was also performed, using all the predicted proteins as bait against a library
of prey proteins, revealing a further 700 protein interactions. This system has proved vital
for determining functionally significant interactions, however it has disadvantages [56]. It is
based on transcriptional activation, thereby forces interaction partners to localise together
Chapter 1. Investigations in Functional Genomics 34
Figure 1.11: A summary of Yeast Two-Hybrid experiments, reproduced from [56].
in the nucleus producing a large number of false positives. Therefore, other methods are
usually required to confirm the interactions identified by Yeast Two-Hybrid analysis. In
addition, the fusion of proteins to the transcription factor domains may block sites required
for interactions, or required for modifications that must occur before interaction, such that
they may be missed.
An alternative method for detecting protein-protein interactions is affinity purification
of multiprotein complexes. In this method a single protein A is fused to a tag that can be
purified using an antibody that is attached to an affinity column. Proteins that bind to A,
forming a complex, can be pulled out. The complex is separated on a one or two dimensional
gel and identified by MS. This system has been used in yeast to identify 3617 interactions
with 493 baits [152]. A similar method is tandem-affinity purification (TAP tagging) in
which protein A is fused to a tag that binds IgG beads in a column [120, 270]. Other proteins
interact with protein A forming a complex. The TAP tag contains a highly specific protease
cleavage site to enable the complex containing protein A to be extracted from the column
without disrupting the interactions. The proteins within the complex can subsequently be
identified by gel electrophoresis and mass spectrometry. The affinity based methods have
the advantage over Yeast Two-Hybrid that interactions take place under conditions that are
much closer to natural cellular conditions, although interactions may not be detected if the
interacting proteins are not in high abundance.
A new advance in understanding protein interactions is the development of protein mi-
Chapter 1. Investigations in Functional Genomics 35
Figure 1.12: Affinity methods for assaying protein interactions, reproduced from [56].
croarrays (or protein chips) [251]. The basic technique involves immobilising a set of recom-
binant proteins to a surface, such as a membrane or slide. The chip can then be assayed with
a protein or antibody attached to a fluorescent molecule. Any protein spot that fluoresces
is likely to be an interaction partner for that protein or antibody. Multiple proteins can be
tested against the chip in sequence, to generate data about protein interactions on a large
scale. There are currently several technical difficulties with the production of protein chips.
However, although protein chips are still at the “proof-of-concept” stage, new techniques
for printing protein spots, immobilising correctly folded proteins and detection should soon
make this technique widely available to researchers, enabling rapid, large scale surveys of
protein interactions.
1.4.5 Three dimensional structures
The three dimensional structure of a protein is one of the most insightful pieces of infor-
mation about its function, particularly if a structure is obtained in which a ligand is bound
to the active site. The resolution of 3-D structures is a major research field and might be
considered outside the scope of functional genomics. However, in recent years an effort has
been initiated to perform high-throughput generation of protein structures, that has been
Chapter 1. Investigations in Functional Genomics 36
termed structural genomics, or structural proteomics [360]. Large collections of recombinant
proteins are screened in parallel for the ability to form crystals, each using a range of ex-
perimental conditions. An early example of the success of this approach was demonstrated
by Christendat and co-workers in 2000, in which 10 structures were published simultane-
ously [57]. In the protein data bank (PDB) there are over 26,000 structures in July 2004
and this number is likely to increase exponentially as the structural proteomics effort gains
momentum.
1.5 Investigations across the “omics”
Large scale investigations are being undertaken in many labs, working on a great range of
organisms. The techniques used depend upon the organism, for example in the nematode
worm, C. elegans, RNAi is one of the best methods for investigating the function of genes (a
review is given by Lee and colleagues [191]). However, RNAi is not a viable method for some
other species. In mice, more common techniques include the development of “knock-out”
mice, whereby targeted recombination replaces a specific gene in embryonic stem (ES) cells.
The ES cells are then injected into blastocysts, which can form embryos when implanted in
a pseudo-pregnant mouse. The resulting litter contains certain mice with the gene knocked-
out, from which a strain of mice can be developed. The phenotype of the resulting strain
gives information about the function of a gene [308].
A summary of the FG approaches that have been used in yeast is given by Castrillo and
Oliver [50]. Yeast has been a very important model organism, and many of the techniques
described in this chapter were first developed in a yeast model. Current investigations in
yeast focus on finding all the genes in the genome, using bioinformatics approaches [190]. In
addition, various high-throughput approaches have been used to study the transcriptome4
[165], proteome [115] and metabolome [9]. Investigations in parasitology form the basis for
the work in chapters 6 and 7, and FG studies on other organisms are too numerous to cover in
detail. However, in the following section a brief description is given of studies in which more
than one type of approach has been used to study a system, such as genome, transcriptome,
and proteome analysis.
4The transcriptome is the complete mRNA abundance of a sample.
Chapter 1. Investigations in Functional Genomics 37
1.5.1 Comparative studies
There are several examples of published work in which researchers have characterised a
biological system by applying more than one type of functional genomics technique, and
in the next few years it will become common for researchers to perform parallel analysis
of the transcriptome, proteome and metabolome. In 1999, two papers reported similar
analysis on yeast to determine the global gene expression and to compare this with protein
abundance data, in an attempt to find the correspondence between the rate of transcription
and translation [115, 143]. The paper by Futcher and colleagues [115] compared protein data
from 2-DE, using LC-MS for identification, against mRNA data from SAGE and microarrays.
The results suggested that the correlation between gene and protein expression is high. They
found that approximately one molecule of mRNA gives rise to 4000 molecules of protein. The
study published early that year by Steven Gygi at the University of Washington compared
data from 2-DE and SAGE [143], and found a very poor correlation between gene expression
and protein abundance [143]. In their study, certain groups of proteins that had the same
level of abundance had mRNA levels that varied 30-fold. Conversely, genes with similar
levels of mRNA produced proteins that varied up to 20-fold in volume. The difference in
the two studies may result from anomalies in the experimental techniques that produced the
data, or the statistical model used to perform the comparison.
A study by Lee and colleagues in 2004 performed comparative analysis of gene expression
and protein abundance in yeast, using microarrays and 2-DE, to establish which genes and
proteins were up-regulated in a particular mutant strain [192]. Fifty-four genes out of 4290
assayed were found to have differential expression assayed by microarrays. Eighteen differ-
entially expressed proteins were observed by comparative 2-DE analysis, of which 14 were
identified by MS. The study revealed that many of the sequences differentially expressed in
both analyses had similar functions, but the overall data sets were too small to perform any
kind of statistical correlation analysis between the rate of transcription and translation. This
study exemplifies the current problems hindering large scale comparison of microarray and
protein abundance results. There are few studies that make protein abundance data publicly
available, and therefore it is difficult to determine how accurately the level of mRNA predicts
the volume of the corresponding protein. For this to be possible, data must be pooled from
several different studies, which requires the deposition of experiments in a public repository,
where the results are formatted in a standard way. Moreover, it is likely that there will
be significant variation in the relationship between mRNA and protein production. This
Chapter 1. Investigations in Functional Genomics 38
might occur both at the protein class level, and at the species level. Thus, the discovery of a
single process to govern transcriptional control of protein production may be unlikely. The
problems of standardisation, and public deposition of data, are addressed in the following
chapter.
There are many large FG studies that are currently being performed on a variety of
organisms, and in the next few years it is likely that studies analysing more than one level of
the central dogma5 will become widespread. It is clear from the studies that microarrays are
a powerful tool for finding genes that have an important role in a process, but single data
points may not be able to predict accurately the abundance of functional protein, if analysed
independently of the entire data set. Protein abundance values may be a more accurate
measure of the amount of functional material but the experiments are less reproducible, and
cannot be performed at the same throughput level as microarrays. Therefore, a combination
of approaches will provide a more complete picture of the status of the system and the data
will feed into models of cellular and physiological processes, allowing the vision of systems
biology (as described in Section 1.1.2) to be realised. The issues involved with integrating
data from microarrays and proteomics are explored in detail in Chapters 5 and 6.
1.6 Summary
The techniques described in this chapter provide insights into gene and protein function,
with new technological developments allowing researchers to generate very large data sets
on a previously unimaginable scale. The monetary cost of such ambitious experiments is
extraordinarily high, since they are dependent on an expanding range of complex machinery
requiring high levels of technical expertise. Therefore, there is an economic requirement
to maximise the amount of information from each experiment, and to provide flexible data
storage capable of repeated interrogation.
An important consideration is how to interpret data from large scale approaches, and how
to place statistical confidence on findings derived from the data. It is critical that more than
one experimental approach is utilised, for example microarray results are often confirmed
by PCR or Northern analysis, and differential expression of proteins can be confirmed using
antibodies in a Western analysis. The combination of results from more than one level of the
“omics”, for example comparing mRNA and protein level, will enable much higher confidence
5“The Central Dogma of Molecular Biology” was proposed by Francis Crick to explain that the informationflow usually ran from DNA to RNA to protein [64].
Chapter 1. Investigations in Functional Genomics 39
to be placed on functional assignments. The data sets will ultimately feed into models that
are used to generate an overview at the level of the whole system. Before this can be achieved,
a significant body of work is required to improve public databases for functional genomics
data, and community wide agreement is required on standard formats to which published
experimental data must conform. An overview of the current work in this area is the focus
of the following chapter.
Chapter 2
Databases, standards and
ontologies for the life sciences
2.1 Introduction
In the previous chapter the techniques that comprise functional genomics research were
described, along with the computational challenges they present. In particular, the focus
was on proteomics research, for which we have developed proposals for a data standard, and
a new database system, described later in the thesis. This chapter contains a description of
the major research developments in database technology for functional genomics (FG) and
other life sciences domains. FG experiments require the development of standard formats
for transferring data between research groups and sending datasets to central repositories.
Ontologies are controlled vocabularies of terms describing a particular domain, and are vital
for data interchange and archiving in FG. Current advances in standards and ontologies are
described.
2.1.1 Computational support for the life sciences
In theory, building good databases for life sciences should be no different from building con-
ventional databases for commerce, banking and industry, however in practice there are a
number of key differences. Relational database management systems (RDMS) have been
designed to support commercial applications with relatively simple data types: most con-
cepts required for a banking database can be represented by strings, integers and floating
point numbers. In addition, this area is standardised to a large extent, as there are well
designed packaged solutions that can be purchased. The huge growth in life sciences data, to
which massive public access is required, presents new challenges to the database community.
Consider that the human genome sequence, even without the annotation of genes, is a set
40
Chapter 2. Databases, standards and ontologies for the life sciences 41
of 3 billion characters, which must be queried in a number of different ways. It is not easy
to query DNA code stored in tables in an RDMS, therefore additional indexes and software
have been designed de novo and run alongside database applications to provide access to
the data. The situation in functional genomics research is even more complex due to the
heterogeneity of data sets produced by different laboratories.
In proteomics, high resolution images of 2-D gels are an integral part of a data set, to
which significant information must be attached. RDMS can store images, but do not offer
any facilities for querying data within images, or any image comparison. As the field of
proteomics is developing rapidly, there are frequent changes and improvements in the types
of experiment, in laboratory equipment and new software. The number of different data
formats that a bench researcher must deal with is large, and providing an integrated view of
all the data within even a single experiment is a challenge. Once a study has been performed,
researchers often spend significant periods of time searching online databases to characterise
genes and proteins that have been highlighted by their study. Each year the Nucleic Acids
Research journal (NAR) has a special issue, the Molecular Biology Database Collection,
describing all the databases that are freely available over the Internet [117]. In 2004, the
collection contained 548 different databases, many of which are relevant to functional ge-
nomics. Most databases can be queried via the Internet, but the results of queries are often
embedded in web pages that are very difficult to process automatically. Alternatively, many
databases offer a download of their entire contents in a bespoke text format that requires
specific software for handling. A complete data set assembled by a researcher could contain a
great variety of file formats, high-resolution images with annotation, experimental protocols
written in lab books, and large quantities of raw and statistically analysed data. It is vital
that experimental data is made available to other research groups. The publication of results
only in journals is no longer sufficient because data sets are simply too large to comprehend
by reading alone. Research is required to develop local databases for laboratory manage-
ment, and centralised public repositories [273]. Standardisation of formats must occur to
enable developers to create software that can process results into a single file that can be
used for sending data to centralised repositories, or to other research groups.
2.1.2 The future accessibility of data
The remarkable growth of the World Wide Web in the last decade has changed the face of
business and research, by enabling information to be made globally accessible, in an instant.
Chapter 2. Databases, standards and ontologies for the life sciences 42
The Web has altered the way scientists publish their data, as almost all journals are now
accessible on the Internet, and can be searched very rapidly with an index. Our libraries
are not yet defunct, but are certainly under threat. This model of Web publishing is still
far from ideal because almost all web pages are intended to be read and understood by
people, and not by computer systems. Additional software has been created to allow the
Web to be searched, but the search engines utilise only a fairly simple index of the text in
web pages, and generally ignore the context. For example, it would be desirable to be able
to find automatically all the databases in the NAR Molecular Biology Database Collection
which contain information about proteins, query them for a specific protein, and summarise
the results. Unfortunately this will not be possible in the near future because there is no
standard mechanism for automatically discovering the types of data stored in a particular
system, or how they can be accessed. The solution to these problems may be found by the
Semantic Web [342], the next generation architecture of the Web.
The Semantic Web has been proposed by Tim Berners-Lee, the founder of the WWW, as
a global network of resources that are machine understandable [31]. The basic premise is that
web sites will be created using technologies that allow them to specify the objects described
in the web pages, the relationships between objects and how the web sites can be accessed.
An essential component will be ontologies, which are controlled vocabularies containing terms
that have a strict definition, and a specified source location, to ensure that a version of a
term is used in different contexts with exactly the same meaning. Ontologies can contain
a set of rules associated with terms, which allow the terms to be processed in computer
systems. Software can discover the relationships between terms, and perform reasoning,
to ask logical questions of a resource described using an ontology [133]. A hypothetical
biological example is as follows. All databases within the NAR Molecular Biology Database
Collection are made accessible through the Semantic Web, using a software package that is
freely available, similar to the HTML editors that are used to produce current web pages.
A database specifies what it contains, such as the three-dimensional structures of proteins,
and that it can be accessed by querying with a URL (Uniform Resource Locater) followed
by the term ?query=PROTEIN NAME. The terms that describe the contents and methods of
accessing the database are obtained from a controlled vocabulary that resides elsewhere on
the Web, to ensure that the same terms are used by different databases. Software can then
be developed that automatically discovers the 3-D structure database, queries it for a protein
name, and processes the results as required by the user.
Chapter 2. Databases, standards and ontologies for the life sciences 43
This has clear implications for biomedical research, and it is one of the areas that will
benefit most from the Semantic Web [173]. The life sciences, unlike the axiom-based sciences,
rely on knowledge acquisition about a domain, and have been subject to an unavoidable
historical bias caused by the interests of the particular researcher investigating an area. The
advent of functional genomics removes much of the bias because, rather than an experiment
being designed to test a hypothesis, the experiment itself generates hypotheses about the
function of genes, proteins or entire systems. The results presented in a journal publication
could still be focused on a researcher’s particular interests, but the whole data sets will often
contain far more information than is highlighted in the original publication, which could be
valuable to many other research groups. The Semantic Web has the potential to maximise
the knowledge derived from a single experiment, by making it as widely accessible as possible.
For a knowledge-based science, clearly this will be a major advance.
The Semantic Web will be built using a number of technologies, of which several al-
ready exist (described in Section 2.2). Extensible Markup Language (XML) has become the
primary notation for exchanging information over the Web, and most standard formats for
the life sciences are expressed in XML. XML itself cannot express how concepts are related
to each other, this functionality is offered by the Resource Description Framework (RDF)
which can describe the location of objects on the Web, and how objects relate to each other.
Finally, the development of ontologies will be vital for ensuring that terminology is used in
a standard way, and various formats for expressing ontologies have been developed. Current
progress in ontologies for biomedical research is presented in Section 2.5.
The vision of the Semantic Web may be realised in the next decade, but in the nearer
term many of the concepts can be applied now, to improve the facilities for data publishing
and exchange. The results of functional genomics experiments must be made accessible in
public databases. Later in the chapter there is a description of the public databases that
currently exist for functional genomics data (Section 2.4), although neither the problem of
developing standard access methods, nor the challenge of data integration (Section 2.6), have
yet been solved. The development of central repositories is not possible without standard
exchange formats that researchers must use to express their data sets. A description of
current developments in data standardisation is also given (Section 2.3).
Chapter 2. Databases, standards and ontologies for the life sciences 44
2.1.3 Guide to the chapter
The structure of the chapter is as follows. The formats used to express data standards and
ontologies are described first (Section 2.2). Since the development of public repositories is a
major challenge without common data formats, previous work in standardisation is described
in Section 2.3. A summary of databases that have been developed for life sciences is presented
in Section 2.4. There are a number of newly established efforts to design ontologies to capture
biological information, described in Section 2.5. Finally, there are major efforts by a number
of research groups to bring all the diverse parts of related information together in common
systems (data integration), described briefly in Section 2.6.
2.2 Technology required for data standards
2.2.1 Extensible Markup Language: XML
The emergence of data standards has been tied to the rise in usage of Extensible Markup
Language [101] (XML) as a data interchange format in e-commerce, industry and research.
The importance of XML for bioinformatics has been recognised for some time [2]. An XML
document has a hierarchy of tagged elements, in which the name of the tag describes the data
type that follows. XML has been described as semi-structured data because the document
is self-describing [44], unlike the tuples1 in a relational database, which have little meaning
in the absence of the database schema. An example of a partial record in the native format
from the PIR (Protein Information Resource) database [254] is given (Figure 2.1), along with
the same data stored in XML (Figure 2.2) and a representation of how the same data could
be stored in a relational database (Figure 2.3).
XML has become the most commonly utilised format for expressing data standards and
ontologies because there are a large number of applications that can automatically process
XML documents [279, 82], unlike bespoke text formats that require processing software to
be re-written every time there is a change to the format. Many life sciences databases now
offer a bulk download in XML format that could be used for data integration, as described in
Section 2.6. Data represented in XML can be validated using a document that specifies what
elements and relationships are allowed in the XML. The current specification for validation
documents is XML Schema [341] that has superseded the initial proposal of the Document
Type Definition (DTD) [75].
1A tuple is a term for a row of data in a table of a relational database.
Chapter 2. Databases, standards and ontologies for the life sciences 45
ENTRY CCHU #type complete iProClass View of CCHU
TITLE cytochrome c [validated] - human
ORGANISM #formal_name Homo sapiens #common_name man
...
SUMMARY #length 105 #molecular_weight 11749
SEQUENCE
5 10 15 20 25 30
1 M G D V E K G K K I F I M K C S Q C H T V E K G G K H K T G
31 P N L H G L F G R K T G Q A P G Y S Y T A A N K N K G I I W
61 G E D T L M E Y L E N P K K Y I P G T K M I F V G I K K K E
91 E R A D L I A Y L K K A T N E
Figure 2.1: A partial record from the PIR database, in the native PIR format.
<ProteinEntry id="CCHU">
<protein>
<name status="validated">cytochrome c [validated]</name>
</protein>
<organism>
<source>human</source>
<common>man</common>
<formal>Homo sapiens</formal>
</organism>
...
<summary>
<length>105</length>
<type>complete</type>
</summary>
<sequence>MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIW
GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
</sequence>
</ProteinEntry>
Figure 2.2: A partial record from the PIR database, released in XML format.
Chapter 2. Databases, standards and ontologies for the life sciences 46
Protein Entry Table
ID Name Status Length Type Sequence Organism
CCHU cytochrome c validated 105 Complete MGD..TNE 1
...
Organism Table
Organism ID Source Common Formal
1 human man Homo sapiens
2 chimpanzee chimpanzee Pan troglodytes
3 ....
Figure 2.3: An example of how a partial PIR record could be stored in two relations in arelational database.
XML was initially intended to be a format for transferring data over the Web, and soft-
ware has been developed for processing XML into different formats or to extract information
for database storage. In recent years there has been a growing momentum towards develop-
ing methods for storage and querying of “raw” XML, because it has been recognised that the
hierarchical, semi-structured nature of XML captures the semantics of certain data in a more
natural way than a relational database representation, particularly for data that has a tree
structure. There is a substantial body of research for improving the facilities for querying
data represented in XML format, and several proposals have been made for query languages,
such as XQuery [357]. In Appendix A, there is a report of work undertaken by the author
to develop a new type of index for fast querying of biological data, represented in XML. The
index has the potential to be extended to aid data integration, as highlighted in Section 2.6.
2.2.2 Resource Description Framework
The Resource Description Framework (RDF), recommended by the W3C2, provides a way
of modelling metadata [269]. In the context of RDF, metadata is machine understandable
information describing web pages, but metadata can also have the general meaning of “data
about data”. In this sense, metadata is the real world meaning and context of the data values.
RDF is expressed in XML but, unlike XML, RDF can explicitly specify the properties of
other objects in the document, allowing automated reasoning. The following example is from
the article “What is RDF?” by Tim Bray [40]:
2The World Wide Web Consortium (W3C) is an organisation for the development of technologies and bestpractice for the Web [352].
Chapter 2. Databases, standards and ontologies for the life sciences 47
<rdf:Description about=’http://www.textuality.com/RDF/Why-RDF.html’>
<Author>Tim Bray</Author>
<Home-Page rdf:resource=’http://www.textuality.com’ />
</rdf:Description>
In this example, the excerpt of RDF describes an article on a web page, specifying that the
author is “Tim Bray” and the home page of the web site is http://www.textuality.com. An
RDF description consists of three components: a Resource, a Property, and a Statement. A
resource is any object that has a Universal Resource Indicator (URI), such as a web page,
or part of an XML document. A property is a resource that has a name, and is a facet
of, or belongs to, another resource. In the example, the author is a property of the article.
A statement is a combination of a resource, property and value, such as The Author OF
http://www.textuality.com/RDF/Why-RDF.html IS Tim Bray.
RDF could be used in the life sciences domain, for instance to describe protein records
in a web accessible database, in which the URI of the record is the resource, and the amino
acid sequence of the protein is a property. The following statement could be deduced auto-
matically:
The Protein Sequence OF www.myProtDB.org/query?myDBId=1A1B IS "MLENT...".
The RDF representation has advantages over a pure XML representation because, while
a person viewing an XML document may be able to deduce that a protein sequence is a
property of a protein record, this could not be done automatically [228]. There are various
biomedical ontologies described below that utilise extensions of RDF. In the field of chemistry
RDF is also used, for example to express the Chemical Markup Language that enables the
interchange of molecular data [220, 131].
2.2.3 DAML+OIL and the Web Ontology Language
The use of ontologies is a major research area in the life sciences. Several examples drawn
from this area are discussed in Section 2.5. There is a formal language for expressing on-
tologies, which was originally called DAML+OIL because it resulted from the fusion of two
separate efforts [154]. It is now set to become the W3C standard OWL (Web Ontology
Language [238]). OWL is expressed in XML and uses the RDF extension. OWL is a further
extension of RDF because it specifies what the associated objects are, and how they are
related, rather than only specifying a single object with a set of properties. An ontology
Chapter 2. Databases, standards and ontologies for the life sciences 48
expressed in OWL consists of axioms that state the formal relationships between classes and
properties.
For example, an ontology describing genes, transcripts and proteins could be defined as
follows. One relationship could be specified: isTranslated, between the class:mRNA (mod-
elling an RNA sequence record) and the class:Protein (for the protein sequence record).
The class:Protein and class:mRNA both have a textual definition that describes exactly
what is meant by the term. This representation is powerful because it allows reasoning to be
carried out by a computer system, in combination with rules over other objects. The software
could find that the protein sequence is created by translating an mRNA sequence. This kind
of reasoning cannot be done in a purely relational database system, because the semantics
of a relationship are usually only captured by a record having a foreign key that references
another table. The meaning of a relationship in a database can be open to interpretation.
A well designed ontology ensures that every concept and relationship has a clearly indicated
meaning [39].
2.2.4 Unified Modeling Language
An important component of a data standard is an object model that describes a system
independently of the technology that is used for its implementation. Object models are
most commonly expressed in Unified Modeling Language [324] (UML), which is a standard
notation designed to improve the process of developing large software systems [274]. UML
includes components that represent the design and visualisation of the architecture of a
system during development. UML supports the definition of “use case” scenarios and work-
flows which could be used to model the biological research process. UML can also be used
for database design.
The most commonly used part of UML for representing a system is the class diagram. A
class diagram represents real world objects as a set of classes with attributes of certain types
(such as strings, integers, or user-defined), and relationships between classes (see Figure 2.4).
The concept of inheritance can also be represented in UML, in which one class inherits all
the attributes and relationships of another class. It is common in class diagrams to see
multiple subclasses inheriting from a single superclass. This design is intended to reduce the
amount of code required to implement the model because the attributes and relationships
only have to be programmed once for the superclass, rather than repeating code for each of
the subclasses. The concept of inheritance is exemplified in the description of MAGE-OM,
Chapter 2. Databases, standards and ontologies for the life sciences 49
Relationship betweenHospital and Ward
DOB: date
name: String
Doctor
telephone: int
Patient
Person
admission: date
A package forgrouping classes
Ward cannot exist without Hospital.A diamond indicates containment e.g.
Open arrow indicates inheritance.Doctor and Patient are subclassesof Person and inherit the attributesname and DOB from the superclass.
1..n1
postcode: String
address: String wardNumber: Int
WardA class representinga real world object
Attribute typeAttribute of Person
Hospital
name: StringStaff
in which the relationship should be implemented.Arrow in a relationship indicates the direction
The numbers refer to the multiplicity of the1 1..n
linked to one or more instances of Ward.relationship. One instance of Hospital can be
Figure 2.4: The main components of a UML class diagram for a hospital computer system.
the object model for microarray experiments (Section 2.3.1).
An object model enables developers to have a shared understanding of the components
of a complex system, but it can also be converted into an XML validation document and a
database schema without significant effort. Another use of UML is to support the design of
code for an entire software system, for instance to provide database connectivity, produce
output in a file format, or describe user interactions with the system.
2.2.5 The object management group
The object management group [231] (OMG) is a consortium formed to improve the interoper-
ability of software systems. The standards defined by OMG are expressed in UML, and other
notations, such as the MetaObject Facility (MOF) [231] . The main component of OMG is
the Model Driven Architecture (MDA). This is a notation for specifying the components of
large software systems for business, which is independent of the technology that will realise
them. A model is first specified in MDA, and it can then be instantiated with any program-
ming language such as Java [169], C++ [63], .NET [213], and so on. This model insulates
companies from evolution of technologies, and reduces the overhead of re-implementation.
A second benefit of ensuring that a system is described in a platform independent manner
is that it should help the sharing of applications and data across different domains. OMG
is also involved with checking the consistency of object models but it is left to domain ex-
perts to ensure that an object model correctly represents the concepts in the domain. The
Chapter 2. Databases, standards and ontologies for the life sciences 50
OMG has been involved with verifying the object model for the microarray data standard,
described in Section 2.3.1.
2.3 Data standards in the life sciences
The problems of the incompatibility of data from different laboratories have been recognised
by researchers, leading to the development of data interchange formats. In the absence of a
data standard, even if published data is made available from authors’ web sites, the overhead
required to write software to interpret data from a number of different sources is often too
great, and the information is effectively inaccessible. A good data standard should ensure
that sufficient information is stored about the biological samples and experimental protocols
to enable future re-evaluation of the information. This is a major issue for digital archiving
because the volume of data continues to grow very rapidly. It cannot be assumed that it will
be possible to perform manual searches of the literature for all the relevant experiments in
the future, and automated methods will be required. In this section a brief introduction is
given to the established and proposed data standards.
The data format for microarrays, called MAGE-ML (Section 2.3.1), has influenced efforts
in other areas of functional genomics. The draft standard for proteomics, called PEDRo
(Proteomics Experiment Data Repository), is introduced in Section 2.3.2 and is one of the
main focal points of the following chapter. Mass spectrometry (MS) is a crucial part of
proteomic analysis, and was incorporated into the original PEDRo proposals. Data standards
for MS are now under development by a newly formed group, described in Section 2.3.4. In
the rest of the section, there is a description of other data exchange formats that are relevant
to life sciences research.
2.3.1 Microarray standards
Microarray experiments have now become widespread [55] and produce very large amounts
of data that could potentially be useful to researchers in a variety of contexts. The re-
quirements for central repositories of data, and standards for sharing and publishing, were
recognised several years ago [42]. A group of researchers formed the MGED (Microarray
Gene Expression Data) Society for improving the facilities for data sharing [212]. The first
stage of the standardisation process was the release of a checklist of information that should
be made available with a microarray data set to allow future re-evaluation of the data. The
checklist is known as MIAME [41] (Minimum Information About a Microarray Experiment).
Chapter 2. Databases, standards and ontologies for the life sciences 51
ArrayDesign Array
BioAssayBioMaterial
BioAssayData Experiment HigherLevelAnalysis
AuditAndSecurity
BioEvent
Description Measurement Protocol
Identifiable
identifier : Str...name : String
BioSequenceBQS
DesignElement NameValueType
name : Stringvalue : Stringtype : String
0..* 1+propertySets
0..*{rank: 1}
PropertySets
1
Extendable
0..n
1+propertySets
0..n
{rank: 1}
1 PropertySets
Description
text : StringURI : String
Audit
date : Dateaction : enum {creation,modification}
Security
Describable
0..*
1
+descriptions
0..*
{rank: 1}
1
Descriptions
0..*
1
+auditTrail
0..*{rank: 2}
1
AuditTrail
0..1
0..n+security
0..1{rank: 3}
0..nSecurity
Figure 2.5: The top level of MAGE-OM, reproduced from [212]. There are fifteen packagescontaining classes to capture different parts of a microarray experiment. There are threeclasses included at the top level: Identifiable, Describable and Extendable that can beused by most other classes in the model for linking to additional attributes.
MIAME specifies the parts of experimental protocols, sample details, raw data and analysis
that must be released for an experiment to be understood and potentially reproduced, if
the same biological samples are available. The MIAME guidelines have been accepted by a
number of journals, and they must be satisfied for a publication to be accepted [23, 24, 25].
A formal specification of the microarray requirements was released as an object model,
MicroArray Gene Expression-Object Model (MAGE-OM), expressed in UML. The object
model serves two purposes. Firstly, the class diagrams allow developers to have a shared
understanding of the concepts and relationships in the standard. Secondly, the object model
has been used to generate a software toolkit, available from the MGED website, which allows
developers to create applications that process data into an exchange format, based on the
model. The data format, MAGE-ML [297] (MAGE-Markup Language), is expressed in XML,
and several major databases now accept MAGE-ML for loading data (Section 2.4.1). An
essential component of the standard is the MGED Ontology that consists of a controlled
vocabulary of terms used in microarray experiments (described in Section 2.5).
Chapter 2. Databases, standards and ontologies for the life sciences 52
Contact
BioSource
0..n
0..n
0..n
0..n
SourceContact
BioMaterialMeasurement
NameValueType
Measurement
1
0..1
1
+measurement
0..1
Measurement
Treatment
order : int
0..n
1
0..n
1
Sources
1 0..11
+actionMeasurement
0..1
BioSample
BioMaterial 1
0..n
1
0..n
BioMaterial
0..n
1
+treatments
0..n
10..n
1 +qualityControlStatistics0..n
1
LabeledExtract
CompoundMeasurement
1
0..1
1
+measurement
0..1
0..n
1
+compoundMeasurements
0..n
1
OntologyEntry
1
1
1
1
Action
11
1type
1 Type
1
1
1
1
MaterialType
0..n
1
0..n
1
Characteristics
DatabaseEntry
accession : StringaccessionVersion : StringURI : String
Compound
isSolvent : boolean = false1..n
0..n
1..n
0..n
Labels1
0..n
+compound
1
0..n 0..n
1
+componentCompounds
0..n
1
0..1
1
0..1
1
MerckIndex
0..1
1
0..1
1
ExternalLIMS
Figure 2.6: The BioMaterial package in MAGE-OM, reproduced from [212].
Chapter 2. Databases, standards and ontologies for the life sciences 53
The MAGE object model
The overview of MAGE-OM is displayed in Figure 2.5. There are fifteen packages, each
containing a number of classes to represent part of a microarray workflow. For example,
Array, ArrayDesign and DesignElement describe the features on a microarray, and BioAssay
describes the hybridization of mRNA to the array. MAGE-OM is designed to allow as much
flexibility as possible to ensure that it does not restrict the types of experiment that can be
captured. An example of this is in the BioMaterial package shown in Figure 2.6. The package
is intended to capture the substances that are processed at various stages in the experiment.
A BioMaterial can be one of three types: a BioSource (the source of biological material),
a LabelledExtract (for example the fluorescently labelled mRNA that is hybridized to an
array) or a BioSample (any intermediate between a BioSource and LabelledExtract). This
is an example of inheritance because the three classes inherit relationships from BioMaterial.
The use of inheritance should reduce the amount of programming required to capture this
part of the model because the relationships to other classes only need to be coded a single
time for BioMaterial, rather than three times for each of the more specific classes. One of
the relationships allows the class to reference OntologyEntry, which can be used to specify
a number of characteristics about the material, by obtaining the values from a controlled
vocabulary. Any kind of simple laboratory treatment can be described using a combination
of the class Treatment and the relationship to OntologyEntry, which captures the type of
treatment.
EXAMPLE: The mRNA that is hybridized to an array is captured in LabelledExtract.
LabelledExtract references the set of treatments that have been used to create it, via
Treatment, BioMaterialMeasurement and BioSample. Chemical compounds, such as the
fluorescent labels that are attached, are recorded in Compound. A cycle of treatments can be
described that points back the original starting material in BioSource.
This package does not contain any classes that are specific to a microarray experiment, and
therefore could potentially be used to model concepts from other types of functional genomics
experiment. This issue is expanded on in Chapter 4, in which MAGE-OM is combined with
a model of proteomics data to form a proposal for a data standard that we believe can be
extended to cover all functional genomics techniques.
Chapter 2. Databases, standards and ontologies for the life sciences 54
2.3.2 PEDRo
In recent years, the success of MAGE-ML as a microarray standard has encouraged re-
searchers in proteomics to attempt a similar standardisation procedure. The status of pro-
teomics standardisation is the focus of the following chapter but a brief overview is given
here. The Proteomics Experiment Data Repository [315] (PEDRo) object model has been
released to initiate discussion in the community about the requirements for a data stan-
dard. Data standards for proteomics are managed by the Proteomics Standards Initiative
[257] (PSI), which was formed by the Human Proteome Organisation [161] (HUPO). PEDRo
represents a typical proteomics workflow, and consists of four parts:
• Biological sample origin.
• Protein separation techniques.
• Mass spectrometry laboratory protocols.
• Mass spectrometry data analysis.
PEDRo is designed to allow an experiment involving a number of stages of protein separation
to be described, including: 2-DE, affinity columns and chemical treatments. MS data is also
described in the PEDRo model, including support for storage of database searches and the
results of the searches. There are a number of organisations developing standards for MS
to serve different purposes (described below), therefore it is important that a consensus is
reached. A detailed description of PEDRo is given in the following chapter.
2.3.3 PSI-OM
PEDRo was presented to the PSI in 2003 as a proposal for a data standard for proteomics.
A new object model was developed in 2004, loosely based on PEDRo, called PSI-OM (Pro-
teomics Standards Initiative - Object Model) to which the author contributed at the annual
meetings of the PSI. PSI-OM has a similar structure to PEDRo covering protein separa-
tion techniques and MS. In the following chapter, there is a description of an object model
we developed (Gla-PSI) that preceded the development of PSI-OM, therefore a complete
description of PSI-OM is given after the section on Gla-PSI.
2.3.4 Mass spectrometry
Mass spectrometry is used in proteomics to identify proteins. An experiment generates raw
data, in the form of a trace, and processed data comprising a list of peaks that correspond to
Chapter 2. Databases, standards and ontologies for the life sciences 55
the masses of peptides. There is a major problem preventing re-analysis of MS data, which
is caused by the proprietary data formats generated by mass spectrometer manufacturers.
Instruments are supplied with software for data collection and analysis. The software only
provides the functionality to save analysis within a data format that cannot be interpreted
by any other software. Researchers often manually enter the peak heights into a text editor,
for input into database search programs. Proprietary formats pose a major problem for
research throughput and data archiving. It cannot be assumed that the software needed to
interpret the spectra will still be available in the future. It is also not feasible for researchers
wishing to analyse the spectra deposited in databases, to obtain the software that produced
them. Therefore, there is a great need for a data exchange standard that can be interpreted
without specialist software. The standard must support algorithm development for large
scale database searches.
There are several proposals for MS standards including GAML (Generalized Analytical
Markup Language [128]), SpectroML and the Analytical Information Markup Language
(AnIML) [13]. Both SpectroML and AnIML have been developed by the National Institute
for Standards and Technology in the USA [222]. GAML is an industry generated effort to
develop an XML-based data format for analytical instruments. GAML stores values of X/Y
coordinates from a trace, and the parameters entered in the instrument. SpectroML has
similar goals, and was originally developed in collaboration with ASTM, an internationally
recognised standards organisation [18]. SpectroML has now been superseded at the ASTM
by AnIML, which is a wider XML based format for analytical instruments. The PEDRo
model also supports MS data.
A recent project has been initiated at the Institute for Systems Biology, known as
mzXML, which is part of the SASHIMI open source software for downstream analysis of
MS data [278]. The goal of the project is to produce software for processing each of the out-
put formats produced by different instrument vendors, into a single XML file. The mzXML
format can then be analysed with a single piece of software that has a statistical measure
of the likelihood that a correct match has been made to a protein. This should improve the
comparability of data produced by different types of instrument.
The efforts described above are being coordinated by a sub-group of the Proteomics
Standards Initiative, and meetings of the PSI have been well supported by MS instrument
manufacturers. A single proposal, mzData, has been formulated. It is agreed that vendors
will supply software with their instruments for creating output in mzData format. The
Chapter 2. Databases, standards and ontologies for the life sciences 56
first version of mzData describes the raw data from MS, which is the list of peaks on the
trace, and the format also captures the input parameters that are produced by different
instruments [258]. The next version of the format will capture the input parameters and
results of database searches, in addition to the peak list used to identify proteins.
2.3.5 Protein interaction standards
Protein interaction experiments have become widespread, and there are a number of
databases that offer access to large volumes of data arising from Yeast Two-Hybrid and
affinity column experiments, such as BIND [32], DIP [65], MINT [367] and many others.
There is some overlap in the data coverage between the databases, and therefore it is desir-
able that data can easily be exchanged between different systems. This requirement led to
the development of the PSI interaction standard [150], which is now supported by most of
the publicly available databases. The format is being developed incrementally, and the first
release (level 1) covers the majority of data that is currently available. Level 1 can describe
both binary, and more complex interactions, but the format does not include detailed de-
scriptions of the experimental methodology used to generate the data, or a description of
the mechanism of interaction. This kind of data is not widely available at the present time
but may be supported in future versions of the standard.
2.3.6 Other data standards in life sciences
Mathematical models of biological data
The data generated by functional genomics, and traditional biochemistry experiments, reveal
information about the role of proteins and metabolites in a cell, and the interactions between
different components. Researchers have begun to create mathematical models of chemical
reactions and biological processes, which can in theory predict what changes would be prop-
agated to the system when part of it is perturbed. Mathematical models are published
in journals, often represented as a series of equations printed with mathematical symbols
that cannot be interpreted by a computer. Models are also represented by software, and
can therefore be released as computer code, however there are a large number of different
programming languages and different versions of code, therefore it is not easy to combine
models that have been developed independently. The problem is further complicated be-
cause processes can be modelled at different physiological levels: cellular, tissue, organ and
organ systems can all be represented mathematically. Researchers would ultimately like to
Chapter 2. Databases, standards and ontologies for the life sciences 57
integrate models represented in different formats, and at different levels of detail.
CellML has been created to standardise the format in which mathematical models of
cellular functions are described [196]. CellML is expressed in XML, and uses constructs from
another well-established format known as MathML [340]. MathML describes mathematical
equations and consists of two types of encoding: content and presentation, the first for
expressing what is meant by a mathematical expression, the second deals with how the
expression should be presented for a web browser or printer.
The main constructs of CellML are components and variables, and MathML is used to
specify a mathematical relationship between variables that have been declared by a compo-
nent of the model. CellML also has structures for describing reactions, units, and connections
between different components. The complete specifications for CellML are available through
the web site [51]. It is hoped that researchers wishing to publish a model of a physiological
process will release the model in CellML, allowing future integration with other relevant
models.
The Systems Biology Markup Language (SBML) has been created to model biochemical
networks, such as metabolic pathways or sets of co-regulated genes [155]. Conceptually,
a biochemical reaction can be broken down into a number of components that comprise
the main parts of SBML, including Compartment, Reaction, Rule and several others, each
of which has a textual description, and a number of associated attributes. The format is
expressed in XML and there are various software packages that support the first version
of SBML [311]. The second version of SBML may include MathML support, which could
enable some interchange between models represented in CellML and SBML.
Metabolomics
A new area of functional genomics is the study of the composition of small molecules (metabo-
lites) in different samples, using NMR (Nuclear Magnetic Resonance) and mass spectrometry,
known as metabolomics. The metabolomics community does not have a current data stan-
dard, however a data model has been created to record a generic NMR experiment. The
work is part of the Collaborative Computing Project for the NMR community (CCPN).
CCPN contains an object model and a programming interface for creating software [113]. It
is possible that CCPN could contribute to a data standard for metabolomics although it is
likely that additional modules will be required to capture the biological focus and intention
of a metabolomics experiment.
Chapter 2. Databases, standards and ontologies for the life sciences 58
An object model has been recently released as part of the Chemical Effects in Biological
Systems (CEBS) database developed by the National Center for Toxicogenomics in the USA
[355], called SysBio-OM. SysBio-OM covers various components of microarray, proteomics
and metabolomics experiments, however, due to its recent release, it is not possible to say
whether the metabolomics component will gain widespread use in the community. The CEBS
proposal is discussed in detail in Chapter 4.
2.4 Databases for life sciences
Databases are often created by small research communities wishing to disseminate their data
to a wider audience. The problem with this model is that no standard protocols exist for
accessing or querying databases, and many databases have their own text formats to allow
researchers to download the data in bulk. This presents several problems to the user. Firstly,
a researcher may not know about all the databases that exist which could be relevant. This
was the motivation for the creation of the NAR Molecular Biology Database Collection
to improve awareness of the databases that exist. Secondly, it is very slow to browse or
query all the relevant web sites manually, and assimilate the information by cutting and
pasting into a word processing document or spreadsheet. This problem is partly remedied
by systems like SRS (Sequence Retrieval System) [99], which present pointers to relevant
data items. However, the onus of data acquisition and assimilation of results is still on the
user. Thirdly, the databases are highly dynamic, and some are updated daily. Database
updates most commonly involve new data being added, but errors are also corrected and ID
numbers change with different database releases. Data that has been found by a researcher
may become out of date fairly rapidly, and there are no standard methods for automatic
repetition of the same searches. There are considerable efforts to alleviate these challenges
by employing data integration methods, described in Section 2.6.
A different aspect of the data integration challenge is the storage of heterogeneous data
types within unified systems that can be queried. Chapter 5 describes a database system for
proteomics, which is built on top of an existing microarray database system, as an extension
into a wider system for functional genomics. In this section, a comparison of the features
offered by different microarray databases is given, and the systems that already exist for
proteomics are described. There are several other databases that are highly relevant to
functional genomics research, outlined in Section 2.4.3.
Chapter 2. Databases, standards and ontologies for the life sciences 59
2.4.1 Microarray databases
The development of a database that is capable of storing both proteomics and microarray
data is described in Chapter 5, which is an extension of the RAD (RNA Abundance Database)
system developed at the University of Pennsylvania. However, there are a large number
of different databases for microarrays that offer various different capabilities. A detailed
review of the main features of microarray databases was published by Gardiner-Garden and
Littlejohn in 2001 [119], which is brought up to date in this section (Table 2.1).
ArrayExpress
ArrayExpress at the European Bioinformatics Institute has been developed by researchers
who have been central to the efforts of MGED to standardise microarray data [16]. Ar-
rayExpress accepts public deposition of data, can be queried via a web based interface,
and is MIAME compliant. Data can be sent to ArrayExpress in MAGE-ML format, and
the database can store a significant amount of detail covering experimental protocols and
biological samples.
URL: www.ebi.ac.uk/arrayexpress/
RAD
RAD (RNA Abundance Database) is a system produced at the Center for Bioinformatics,
University of Pennsylvania [302]. RAD is capable of storing single or two channel arrays,
Affymetrix arrays and SAGE experiments (Serial Analysis of Gene Expression). There is
a web based interface for loading data and protocols known as the RAD Study-Annotator
[202]. The database schema for RAD, and the web interface, are freely available. As part
of the GUS (Genomics Unified Schema) system for functional genomics, it supports gene
expression data on several major web sites, such as PlasmoDB [21].
URL: www.cbil.upenn.edu/RAD
Stanford Microarray Database
The Stanford Microarray Database (SMD) [134] is a well established system that stores 160
published array experiments (March 2004), from a number of organisms. The web site can be
queried to retrieve particular studies, and a set of software is available for data visualisation
and statistical analysis, such as graphical output from ANOVA (analysis of variance [88]).
Searches can also be performed for a particular gene or clone across all microarrays. The
Chapter 2. Databases, standards and ontologies for the life sciences 60
software used to generate SMD is freely available, and has been deployed by a several other
organisations. SMD researchers are part of the MGED effort, SMD is MIAME compliant
and there are plans to enable export of MAGE-ML in the future.
URL: genome-www5.stanford.edu
BASE
The BioArray Software System (BASE) is freely available for researchers to download and
install locally [275]. BASE includes a database schema that can be deployed in MySQL
[221], and an interface, which runs on a web server, can be created using PHP [246], Java
[169] and Javascript [171]. Data produced by image processing software can be loaded in
tab-delimited files, and additional software is included for performing statistical analysis.
BASE has several advantages over other similar systems. Firstly, all the software required to
run BASE is freely available: PHP, MySQL and Java. Secondly, all the source code for the
project can be downloaded and altered as required. However, a system based on MySQL is
likely to be less robust than one based on a commercial RDMS, such as Oracle [235] or DB2
[70], therefore BASE may be more suited to smaller scale microarray databases.
URL: base.thep.lu.se
GEO
The GEO (Gene Expression Omnibus) database is hosted at the NCBI [85]. GEO has
different goals from the other microarray databases discussed so far. The support of the
MIAME guidelines and the MAGE format are not major goals of GEO. In contrast, GEO
aims to act as a large public repository for as wide a range of data as possible. Each
experiment is stored in a simple, tabular format that is indexed to allow searches. Data
can be submitted by any organisation, using either a web based interface or a bulk loading
facility. GEO has been incorporated in the Entrez system3, and therefore information can
be queried in parallel with bibliographic references, and databases of nucleotide or protein
sequences [123]. GEO does not store substantial information about protocols or biological
samples, and can be viewed as a very large data repository rather than storing microarray
experiments.
URL: www.ncbi.nlm.nih.gov/geo/
3Entrez is the data retrieval system at the NCBI which performs queries over a large number of differentNCBI databases [97], described in Section 2.6.
Chapter 2. Databases, standards and ontologies for the life sciences 61
Yale Microarray Database
Yale Microarray Database (YMD) [54] is in the final stages of testing with a number of data
sources, and is not as well established as ArrayExpress, SMD or RAD. However, YMD offers
certain features not present in other systems. Microarray images are fairly large, and each
experiment can contain hundreds of raw images, each being a TIFF file several megabytes
in size. Most databases choose to store only the processed data, created by software after
analysis of images. YMD includes an image server that enables researchers to obtain raw
images for future re-analysis. It remains to be seen how frequently images will be re-analysed,
but by keeping raw data, this ensures that future evaluation is possible, even if the amount
of data stored grows very rapidly. Experimental protocols can be entered via the Web,
and sample tracking can be performed to link DNA samples to the arrays. Data stored in
YMD can be linked to external resources, and a number of tools are available for performing
statistical analysis. The image server in YMD is both the advantage and disadvantage of the
system: data can be re-analysed but the system may not scale up to very large data sets.
URL: info.med.yale.edu/microarray/
HugeIndex
HugeIndex is a gene expression database developed at Harvard [148]. The database schema is
very simple, containing only four tables and it is intended for storage of microarray results and
limited information about the experiment. HugeIndex is specialised to store gene expression
data from normal human tissues. The query interface allows particular genes to be specified,
or data can be accessed by the type of organ. The initial release in 2002 contained 59
experiments.
URL: HugeIndex.org
Integration across all databases
A scheme for how data can ultimately be integrated across all the databases has been out-
lined by Stoeckert and colleagues [303]. In essence, all databases have different structures,
reflecting the needs and requirements of the local users that are supported by the system.
If data is to be published, it should be made available via the Web, and conform to the
MIAME guidelines that are essentially a checklist of parts of the analysis that must be made
available. However, this alone is not amenable to large scale automatic analysis. For that
to be possible, researchers must either make data available in MAGE-ML format, or submit
Chapter 2. Databases, standards and ontologies for the life sciences 62
DatabaseName
RDMS Webqueries
Totalexpts.
Sourcecodeavailable
MIAMEcompliant
MAGEImport-Export
Array-Express
Oracle Yes 115 Yes Yes Import
BASE MySQL N/A Intended forlocal setup
Yes Yes Exportplanned
GEO Storage ofindexedtables
Yes 605 No or N/K No No
SMD Oracle Yes 160 Yes Yes Exportplanned
YMD Oracle N/K N/K Notcurrently
Notcurrently
N/K
RAD Oracle Yes 16 (RAD),many inGUS sites
Yes Yes Both underdev.
HugeIndex PostgresSQL Yes 59 (2002) Yes No Future plans
Table 2.1: Summary table displaying features of microarray databases. Data is correct as ofMarch 2004, except where stated. A N/K symbol (not known) indicates that the informationis not readily available.
data to a public database that has an export option for MAGE. Currently, few databases
actually create MAGE-ML, due to the complexity of the format, although almost all, with
the exception of GEO, plan to produce MAGE-ML in the future. When this is realised, it
will be possible to move data seamlessly between public repositories, and for researchers to
download and assemble large datasets, for analysis with locally installed software packages.
2.4.2 Proteomics databases
The following chapter contains a proposal for a standard data format for proteomics, and
covers the current output formats from several databases. There is also a detailed description
of other databases and a comparison with our system in Chapter 5. A brief overview of the
publicly available systems is given here.
There are a number of proteomics databases that can provide access via the Internet.
SWISS-2DPAGE was initially developed in 1993, storing 2-DE images and information about
proteins identified on gels. The proteins often have a link to a record in the annotated
sequence database, Swiss-Prot. SWISS-2DPAGE has an interface containing images of 2-D
gels, which can be used to access information about protein spots [153].
Another proteome database, developed by the Japanese Human Proteome Organisation
(J-HUPO [166]), has an output format known as HUP-ML. HUP-ML is centred on 2-DE data
and experimental protocols, allowing the constituents of solutions and timings to be specified,
Chapter 2. Databases, standards and ontologies for the life sciences 63
similar to sample preparation stages described in MAGE-ML. There are a number of domain
specific proteome databases, storing 2-DE or MS data (a summary of proteomics databases
can be found at WORLD2D-PAGE [351]). In general, the databases store only limited
information about experimental protocols and are not fully integrated with other types of
protein databases. It is a major challenge to integrate distributed proteomics databases
because data is not formatted in a uniform manner, and the databases rarely offer flexible
query facilities.
The GELBANK system has recently been made available over the Internet [20], and
has similar functionality to SWISS-2DPAGE. There is also 2D-PAGE database at the Max-
Planck Institute in Berlin, storing images of 2-D gels that can be annotated with spot
coordinates which link to pages describing proteins that have been identified [255]. Basic
information about protocols is stored, and gels can be browsed by species. The functionality
of these 2-D gel databases is described in more detail in Chapter 5.
There are no major repositories of mass spectrometry data which have query facilities,
possibly due to the size of the output format for MS and the problems of incompatible data
formats, as reported in Section 2.3.4. One effort that attempts to remedy this situation is
RADARS [106], which is a commercial relational database application for managing large
volumes of data from high-throughput studies. Due to the commercial nature of the soft-
ware it is not possible to assess the functionality of RADARS in practice. Another recent
development is the Open Proteomics Database that allows bulk downloads of raw MS data
in various formats, including mzXML [232]. This system allows public access to a large
amount of data (400,000 spectra), but it requires developers to obtain software to interpret
and manage the spectra once downloaded, and the spectra cannot be queried online. This
prevents it from being used by most researchers, who do not have the time or resources to
obtain software for managing this volume of data.
2.4.3 Other Databases for Life Sciences
The databases for microarrays and proteomics rely heavily on the existence of genome
databases for linking to annotation about gene products, and obtaining the original DNA or
protein sequences. The main databases containing nucleotide and protein sequence data are
GenBank at the NCBI [122], EMBL [91] and DNA Data Bank of Japan (DDBJ) [80]. These
databases are generally considered to contain raw sequence data, although they do contain
some basic annotation, including bibliographic references, the data source and the predicted
Chapter 2. Databases, standards and ontologies for the life sciences 64
intron/exon structure of genes. Data are regularly transferred between the databases us-
ing an agreed mapping, called the Feature Table [72]. Certain records in GenBank have a
link to an external database, such as a curated record in Swiss-Prot [310]. Swiss-Prot has
cross-references to many different databases, including all the raw sequence databases, and
repositories of protein motifs and families.
There has been an effort to unify protein sequence databases in the Universal Protein
Resource (UniProt) system [326], which comprises several components. The main component
is a curated, non-redundant source of all the protein sequences that exist in any database.
There is also a separate archive (UniParc) containing all the identifiers with which sequences
have previously been annotated [325]. The archive contains links to the most recent record
of a protein in UniProt. The archive will enable software to be developed which performs
repeated searches, to find changes to identifiers. This will be particularly important for
datasets that have been assembled locally over time in a laboratory, and which contain
sequence identifiers that do not exist in the current version of a database.
The protein structure community has initiated a high-throughput approach for obtaining
protein 3-D structures, known as structural genomics. Protein structures are currently stored
in the Protein Data Bank [253] (PDB). Each 3-D structure gives a strong indication of the
function of a protein, particularly if the structure shows a small molecule bound to an active
site. It is vital that if a structure exists for a protein closely related to those highlighted
in a functional genomics study, that the structures can be displayed within the context of
the experiment. This will enable protein or gene abundance studies to be correlated with a
detailed functional analysis.
This review covers a very small subset of the most important databases that exist. There
are a great number of resources about genes and proteins which could be relevant to an FG
experiment. The problem of integrating all the diverse databases is highlighted further in
Section 2.6.
2.5 Ontologies
One of the first definitions of an ontology and its potential for data integration was presented
by Gruber in 1993 [138]. The idea of conceptualisation was introduced, expressed as the
following problem: how can we digitally represent objects, concepts and their relationships
that arise from a real world situation? The representation required is, in effect, a simplified
view of the world that is useful for some purpose. The term “ontology” was coined to
Chapter 2. Databases, standards and ontologies for the life sciences 65
describe the exact specifications of the conceptualisation. An ontology usually consists of
a set of terms that represent objects, and their relationships in the real world. The terms
must be associated with definitions that are human readable, describing what the term
means, along with a set of formal rules specifying how the terms can be used in a computer
system. Gruber suggests how ontologies can be used for data integration, using the example
of different bibliographic databases. For example, a rule is specified that describes what an
author is, and how the author relates to their publications. If different databases associate
records with a set of rules, the rules themselves can be used to query the source databases,
without an underlying knowledge of the particular database schema.
Ontologies will become widely used in the Semantic Web, as highlighted at the beginning
of the chapter, to describe the contents of a web site, and how it can be accessed. This will
enable software to discover automatically the resources that are relevant to the user. In
the rest of the section, a brief description of the software available for developing ontologies
is provided. The major proposals within the life sciences, and other related areas are also
reviewed.
2.5.1 Software for developing ontologies
A number of tools are available for generating ontologies, and they include Protege [230] and
OilEd [28]. The Protege software is available as open-source Java code, developed around a
‘plug-in’ architecture (Figure 2.7). This enables other research groups to adapt the software
for their own use, and develop new plug-ins. Examples of plug-ins include: software for
visualising ontologies in domain-specific ways, tools for merging ontologies, archiving and
querying. OilEd was developed at the University of Manchester and it is designed for the
development of DAML+OIL ontologies. It includes functionality for reasoning over the
ontologies for knowledge acquisition and inconsistency checking. Both Protege and OilEd
are freely available and can export data in DAML+OIL format, enabling the ontologies to be
transferred between editors, which should improve the accessibility of ontology information.
2.5.2 Gene Ontology
A major development in computational biology is the development of the Gene Ontology
(GO) [125, 126]. GO includes three ontologies: cellular localisation, molecular function and
biological process, for a number of model organisms. Entries for genes are sorted according
to the categories defined by the ontology, and the controlled vocabularies ensure that terms
Chapter 2. Databases, standards and ontologies for the life sciences 66
Figure 2.7: A screenshot of the Protege editor displaying the Gene Ontology for Yeast.
are used with same meaning in different contexts. For example, the protein Raf-1, that is
involved in the MAP Kinase metabolic signalling pathway, has many entries in GO. One
entry in the biological process branch of the ontology is as follows:
GO:0003673 : Gene_Ontology ( 149784 )
GO:0008150 : biological_process ( 99849 )
GO:0009987 : cellular process ( 32926 )
GO:0007154 : cell communication ( 9155 )
GO:0007165 : signal transduction ( 6932 )
GO:0007242 : intracellular signaling cascade ( 2389 )
GO:0007243 : protein kinase cascade ( 904 )
A database and a user interface have been developed that enable GO to be queried [126].
GO annotations are being added to Swiss-Prot, TrEMBL4 and Interpro5, in a project known
as GOA [46] (Gene Ontology Annotation). Each entry in Swiss-Prot has several keywords
that describe a protein’s function, which were developed prior to the creation of GO. The
keywords have been manually mapped to GO terms. This now allows for automatic retrieval
of GO annotations, once a protein sequence has been found in Swiss-Prot.
4TrEMBL is an automatically annotated supplement of Swiss-Prot, which contains all the translations ofthe EMBL DNA database prior to their manual annotation within Swiss-Prot [37].
5Interpro is a database of protein families and domains [219].
Chapter 2. Databases, standards and ontologies for the life sciences 67
Figure 2.8: The entry for actin in the Gene Ontology, displayed in the AmiGo browser [12].
There are a number of other projects extending GO, and GO is being used by a number
of organisations to add levels of information to gene and protein products (links can be
found from the GO web site [124]). GO is a major advance in molecular biology because
it enables a high level view of large datasets, allowing researchers to generate functional
classifications very rapidly for all genes or proteins in a data set. However, it is vital that
the Gene Ontology is continuously curated and improved, to reduce the number of incorrect
or inaccurate functional assignments. It is becoming common practice for researchers to
obtain the top set of significant results from their study, say 100 genes or proteins, and
assign functional groupings based on GO. The conclusions drawn from the groupings must
be verified by external means, such as further experiments or literature surveys, because it
is possible that errors have been introduced into GO, which may be propagated into other
systems built on top.
Software for GO
A number of software applications are available for viewing and searching GO, of which
several are summarised here. Access from the Gene Ontology web site is provided by the
AmiGO browser [12]. AmiGO presents a view of the GO tree that can be browsed, allowing
Chapter 2. Databases, standards and ontologies for the life sciences 68
users to move up or down the hierarchy of the ontology. Figure 2.8 displays the GO tree for
the human gene actin. In this example, GO suggests that actin is localised in the cytoplasm,
and more specifically to the cytoskeleton of the cell. A gene can be found at many different
places in GO if the gene has been implicated in several different processes, or possibly if there
is conflicting evidence about function. The AmiGO browser has basic search mechanisms for
retrieving entries by GO ID, ontology term or gene name. There is an alternative graphical
view of GO, and parts of the tree can be downloaded in XML format or as a text file.
GOMiner is a stand alone application written in Java, which provides a view of GO for
a list of genes that are predicted to be up or down regulated between two conditions [368].
The software displays where gene names are located in the GO tree and provides statistics
to show branches of GO that contain more up-regulated (or down-regulated) genes. A DAG
(Directed Acyclic Graph) viewer is also included that displays graphically where genes appear
in the tree.
FatiGO offers similar functionality to GOMiner but in a web browser interface that
accepts two lists of gene symbols, corresponding to the genes that are up or down regulated in
a study [7]. Summary information is produced outlining where terms appear in the ontology,
for the three different ontology parts. Statistics are provided displaying which parts of the
ontology are matched to genes that are up or down regulated in the study. The software
can display information for a specified level in the ontology, from 2 down to 5 (lowest level)
and links to external databases are provided, such as Swiss-Prot, and the KEGG database
of metabolic pathways [184]. The usage of GoMiner and FatiGO in practice is demonstrated
in the study presented in Chapter 6.
GOblet also provides access to GO via the Web. DNA or protein sequences can be
submitted to a BLAST survey that returns the best matches to sequences in Swiss-Prot and
TrEMBL, which have been mapped to GO terms [149].
2.5.3 MGED Ontology
The MGED Ontology (MO) is a hierarchical collection of terms used to describe microarray
experiments. Each term has a textual description of its meaning, and a specification of
where it should be used in MAGE-OM. MO contains terms that can be used to describe the
origin and characteristics of biological samples, regardless of the usage of the sample. For
this reason, MO could be utilised to describe samples in a number of functional genomics
investigations. In Chapter 4, a proposal is made for a functional genomics data standard,
Chapter 2. Databases, standards and ontologies for the life sciences 69
and a detailed description of the contents and structure of MO is given there.
2.5.4 Other ontologies in life sciences
Ontologies are being created to model various different domains within the life sciences. The
OBO (Open Biology Ontologies) project aims to bring together related ontologies into a
common structure [233]. A set of rules has been established for inclusion within OBO: the
ontologies have to open and freely available, described in a common syntax (GO or OWL)
and must have a definition that can be understood by people. An organisation has also been
established for unifying the work in ontologies for functional genomics, known as Standards
and Ontologies for Functional Genomics [298] (SOFG). A brief description of some of the
ontologies within OBO is given below.
Taxonomy
The NCBI taxonomy ontology is an important resource for standardising the taxonomic
naming of organisms [224]. The ontology is accessible via the Web, and the records contain
links to other information about the organism through Entrez, such as nucleotide and protein
sequences, expression data and publications.
Anatomy
There are several ontologies covering the anatomy of organisms: such as C. elegans [353],
Drosophila [112], mouse [90, 218] and humans [100]. The SOFG organisation is coordinating
an effort to integrate them to produce a single anatomical ontology. A related project
is XSPAN from the University of Edinburgh, which aims to provide access to anatomical
information from embryos for several model organisms [358].
Sequence data
The Sequence Ontology (SO) project has recently been initiated to capture information
about features on DNA and protein sequences, such as chromosomal variations, gene features
(intron and exon structure) and RNA processing during transcription [286]. It is intended
that genomic databases will be annotated with these terms to facilitate integration across
systems offering different methods of querying.
Chapter 2. Databases, standards and ontologies for the life sciences 70
Metabolic pathways
One of the first major proposals for ontologies for molecular biology was made by Karp in
1995 [183]. Karp presents the idea that knowledge representation could be used to determine
mappings between different databases to aid integration. The architecture proposed by Karp
was influential in a number of data integration projects described in Section 2.6. A database
of E. coli genes and biochemical pathways was later defined, known as EcoCyc [182]. EcoCyc
contains curated descriptions of the function and chromosomal location of all E. coli genes,
and uses an ontology of pathways to allow the knowledge to be formally queried. EcoCyc
presents an integrated view of data derived from a number of sources including genome
databases, bibliographic references, and protein structures.
Summary
In a functional genomics database, many of the ontologies described above could be used for
specifying characteristics of biological samples, genes, proteins or experimental techniques.
Database systems for functional genomics should provide the facility to link out to external
ontologies so that an object can be specified, which is accompanied by an exact definition
that has a meaning outside the scope of the source database. It is hoped that if databases
use ontologies extensively, the vision of the Semantic Web can be realised, and as Gruber
proposed, data integration can become automated. Software can then be developed to recog-
nise objects automatically in different databases which correspond to the same real world
objects.
2.5.5 The Grid and data integration
The Grid is the next generation architecture for high performance computing [132]. The
Grid is a network of computers joined by high bandwidth connections, allowing the creation
of software that assigns a computationally intensive job to the best available resource on the
network. There is a collaborative effort to perform data integration on a large scale via the
Grid, known as OGSA-DAI (Open Grid Services Architecture Data Access and Integration)
[234]. OGSA-DAI comprises many projects aiming to provide access to vast data sets, in
particular in the fields of astronomy, geoscience and biology. One of the major biological
proposals is called myGrid.
myGrid comprises a network of biological web services, such as BLAST and EMBOSS6,
6EMBOSS is an open source package of software for performing common sequence analysis tasks [92].
Chapter 2. Databases, standards and ontologies for the life sciences 71
which must be registered at a central location [300]. Each resource must contain a standard
description of the type of service it offers, and how it can be accessed. Once this infrastructure
is in place, it will be possible to write software that automatically discovers applications that
are available for performing the task required by the user. Each service has a wrapper7 to
enable standard queries to be submitted, and to convert between different input formats.
Queries can be written in OQL [1] (Object Query Language) and submitted to the source
database over the Grid. The system is specifically tailored to an organisation because a
database is maintained at each location, storing a record of the services that have been used
in the context of a particular workflow, thus facilitating their re-use. The local database also
stores an audit trail of what services have been used at what time, with a system that alerts
the developers if an external data source or service changes, such as a new database release,
which may require a search to be performed again.
2.5.6 Data standards and ontologies in other fields
Ontologies and data standards are becoming widespread in the life sciences but are also
widely used in commercial applications and other fields of research. A related area is the
development of ontologies of language, which could also have uses in the life sciences. The
WordNet project comprises an ontology of the English language, in which nouns, verbs,
adjectives and adverbs are organised into synonymous groupings, similar to a thesaurus [350].
Synonyms in the life sciences present considerable challenges. In particular, many genes and
proteins have been given more than one name over time, and the synonyms often persist. It
is becoming more common to store experimental protocols and descriptions of hypotheses
alongside raw data, to enable data sets to be retrieved. Resources such as WordNet will
be useful for defining particular concepts in a standard way that could be described using
different, synonymous terms.
2.6 Data integration
Data integration is one of the greatest challenges currently facing bioinformatics [299]. The
Molecular Biology Database Collection contains 548 databases at present, and this is likely
to be an underestimate of the total number of different systems that are available. The
integration challenge can be broken down into different parts: firstly, bringing together
7A wrapper is a piece of code that converts the specific inputs and outputs offered by a single applicationto a standard set of inputs and outputs.
Chapter 2. Databases, standards and ontologies for the life sciences 72
similar types of data, such as genome, transcriptome and proteome into a single system that
can be adequately queried is one challenge. A second challenge is discovering and querying
all the resources on the Internet that relate to one particular gene or protein sequence.
The first challenge of data integration is addressed in Chapter 5, in which a framework is
described for storing different types of FG data in one system. A possible solution to the
second challenge has also been addressed, in the context of indexing large collections of XML
data, to generate an integrated query system to a number of databases. An investigation
by the author into XML indexing for biological data is described in Appendix A, which has
been continued in the Xtect project [359] by colleagues at the University of Glasgow and the
University of Strathclyde.
There has been substantial work in the area of data integration in e-commerce and
biomedical fields with the aim of generating single access points to heterogeneous data
sources. In a survey of approaches by Garcia-Molina et al. [118], three general methods are
identified: federation, warehousing andmediation. Federation involves a set of databases sup-
plying agreed additional information or software for accessing information in a standard way.
Warehousing is a large scale approach of reconstructing local copies of relevant databases by
creating an integrated schema that covers all the constituent databases, and importing data
on a regular basis. Mediation based approaches send queries to diverse databases, in some
cases via the Internet, and convert the results into a single format. Examples of biological
resources that have utilised these approaches are given below.
2.6.1 Federation
The Entrez system provides access to many different databases based at the NCBI [314].
Entrez queries GenBank, PubMed, GEO and many others (the web site has a complete list
[97]), and provides a number of output formats including HTML, XML and a text format.
However, there is no integration of results, instead a list of the number of hits in each of
the database is returned, which must be manually browsed by the user. This process is very
time consuming, especially if a large number of genes are to be queried, such as the top 200
hits from a microarray experiment.
2.6.2 Warehouses
One of the largest efforts to integrate life sciences data has been demonstrated by SRS [99]
(Sequence Retrieval System), which provides access to a large number of databases using
Chapter 2. Databases, standards and ontologies for the life sciences 73
pre-defined hyperlinks. SRS downloads all the source databases at a regular interval and
builds a text index. SRS accepts queries against any type of text in the entry, and allows
users to retrieve a record with a particular ID number. SRS does not post-process the queries
to integrate the information, instead a list of entries from different databases is returned.
SRS does not support a major query language such as SQL, therefore complex queries cannot
be made.
The GUS system from the University of Pennsylvania comprises a large relational schema
that is divided into different namespaces8, which have been developed from separate source
databases. Data from various sources (Genbank, EMBL, DDBJ and others) are downloaded
at intervals, cleaned to remove erroneous annotation, and added to the database. A pro-
gramming layer resides on top of the database to allow queries to be performed. In addition
to genomic data, GUS also stores microarray data, and Chapter 5 describes a proposal for a
proteomics extension to GUS.
2.6.3 Mediator approaches
K2
An approach known as K2 has also been developed at the University of Pennsylvania, which
formulates queries over a number of databases, and presents an integrated view to the user.
K2 originated in a project known as Kleisli that introduced the idea of mediators [68, 348]
and a query language known as Collection Programming Language (CPL). The mediators
describe the data sources in terms of common objects, and provide a mapping from the un-
derlying data source to the objects. CPL can then be used to query the object representation
of the data, even if the underlying data sources do not have query capabilities. The system
has been compared with GUS by Davidson and colleagues [67]. Davidson concludes that the
data warehousing approach is preferable for larger scale, production-strength applications,
and the mediator approach may be favoured for smaller systems for users wishing to browse
data sources via web pages.
TAMBIS and BioDataServer
There are a number of other bioinformatics systems to integrate heterogeneous resources,
including TAMBIS, which has been developed as an interface to a number of databases
8A namespace is a subdivision of a database schema or object model in which all the names of thecomponents are unique.
Chapter 2. Databases, standards and ontologies for the life sciences 74
and tools frequently used by biologists [241]. TAMBIS is developed from the same software
used to generate K2, and uses mediators to access several databases. TAMBIS is supported
by a description logic known as GRAIL [22] that includes rules to link different concepts
together. For example, a protein is formally linked to motifs found in its sequence by the
rule hasComponentMotif. GRAIL is used to formulate queries, and automatically retrieves
information from the relevant database.
Another mediator based system is BioDataServer [114] that enables information retrieval
over a number of biological databases, with similar goals to K2. BioDataServer generates an
interface that maps the data sources into a relational database, and can be viewed as a cross
between a mediator and a warehousing approach. BioDataServer enables complex queries
to be formulated over the data, even if the underlying query capabilities of the data sources
do not support SQL.
DiscoveryLink
DiscoveryLink from IBM offers access using SQL to a number of databases in distributed
locations [144]. DiscoveryLink processes queries and decides which parts of the query need
to be sent to which database. Each data source has a wrapper that maps the structure
of the source data to the relational model employed by DiscoveryLink. The wrapper also
stores information about the query capabilities of the data source, and maps parts of queries
sent from DiscoveryLink into the format accepted by the data source. For a wrapper to
be developed, it is required that the underlying data source must include an interface that
accepts programmable queries, and must return data in the form of a table. The software
can then process the results after they have been returned. DiscoveryLink does not offer
any kind of semantic integration, for example the problem of synonyms is not solved, and
redundant data may be returned. If data is modelled differently by different databases, all
the results will be presented to the user, but will not be fully integrated.
2.6.4 Schema integration
An alternative approach, that could be used to develop a warehouse, is that of schema
integration, which involves matching elements in different database schemas believed to
correspond to the same real world object. Many approaches involve a manual process using
a graphical user interface to match elements from different schemas, which is time-consuming
and error-prone. Attempts have been made to automate schema matching [81], and recent
Chapter 2. Databases, standards and ontologies for the life sciences 75
work has been done to integrate XML data sources. Integration using XML is gaining
popularity in molecular biology, because many databases now offer a bulk download of data
in XML. If a mapping can be produced across different XML Schemas, a warehouse could
be created by importing different databases in XML, and converting data to a standard
representation.
Yang et al. [363] have developed an algorithm that finds matching elements in XML
Schemas and removes differences in the hierarchical structure. The algorithm allows differ-
ent schemas to be weighted according to how representative they are of the system, and
produces an integrated schema. However, it is assumed that elements in different documents
have already been re-named so that real world objects all share the same name in different
schemas. This is not necessarily a trivial task if a great number of different databases are to
be integrated. In molecular biology many concepts have synonyms, and conversely, similar
but non-identical concepts may share the same name. A similar approach relevant to data
integration is XClust [194], which is an algorithm for clustering and integrating DTDs (doc-
ument type definition, the initial proposal for validating XML). The algorithm first searches
for similar DTDs, and then integrates over clusters of related documents. The technique has
been demonstrated for real world data sets derived from e-business, and may be applicable
to biological database integration.
A recent approach by Hunt and colleagues at Glasgow University aims to alleviate the
data integration challenge by developing indexes of the paths9 found in XML documents. If
more than one identical path is found to the same leaf node, containing the same piece of
data, the additional paths are removed to avoid redundancy. The index is created on top of
the SRS system and is stored in a relational database. It is intended that the system will be
used to retrieve data from a large number of databases, for a set of genes or proteins that
are highlighted for further study from a functional genomics experiment [159].
2.7 Discussion
This chapter describes the current state of the art in database technology for biomedical
research. It is an area that is being driven by both the day-to-day requirements of ex-
perimentalists, and strong theoretical work in computing science. The challenge of data
integration is so great because FG experiments often generate unexpected results that must
be investigated from various perspectives. In the past, a biological investigation required
9A path is the hierarchy of elements that the precede textual data in XML.
Chapter 2. Databases, standards and ontologies for the life sciences 76
a researcher to have a comprehensive knowledge of a particular organism, organ, or set of
genes. The situation now is far more challenging, as the results from a functional genomics
experiment could lead an investigator into a great variety of domains. For example, the
top 200 hits from a microarray investigation on liver samples could contain genes that had
previously not been implicated in liver function at all. This would require the investigator
to determine the function of the genes from a number of different angles: protein structure,
modifications, biochemistry analysis from databases or the literature, and several others.
Many of the new developments presented in this chapter aim to improve the facilities for
automating the retrieval of this information.
The vision of the Semantic Web is one of the driving forces of the work on standards and
ontologies, but its realisation is some way off. The technologies that will be used to create
the Semantic Web can be put into practice now, and will greatly improve the capabilities
of computer systems. It is clear that there are major advantages to the use of ontologies in
databases, for web publishing of data and in exchange formats, which are as follows. Firstly,
the problem of synonyms in the life sciences is significant. The names of genes and proteins
have been assigned in the last few decades, often based on some phenotypic characteristic that
has limited relevance now that comparative genomics can discover the same gene in different
organisms. For example, the “wingless” gene in Drosophila has a number of synonyms
reflecting the range of roles that it has in different parts of the organism. It was named
because when its function is removed, the flies have no wings, which is clearly of limited
relevance for the same gene in humans. It is becoming apparent that a new organised
naming system is required that takes into account the role of a gene in different organisms.
This is one of the areas that will be aided greatly by the Gene Ontology. A second problem
is finding a common description of how experiments have been performed (the methods).
An ontology-based description of experimental protocols will aid the retrieval of experiments
stored in databases, and may allow future reasoning over different experiments to find how
they are different. For example, it may be possible to find automatically the genes with
altered expression that differentiate two strains of an organism, if the description of each
strain is well structured using ontologies. The synonym problem also arises in the description
of protocols. For instance, in the description of microarrays in the previous chapter, I avoided
the term “probe”, which is frequently used in the methods section of microarray publications.
“Probe” is used by some groups to mean the features deposited on the array, and by others
to describe the labelled mRNA that is hybridized to the array. It is hoped that the MGED
Chapter 2. Databases, standards and ontologies for the life sciences 77
Ontology (MO) can remove these kinds of problems because it contains terms with strict
definitions that are not open to confusion, and therefore software can be developed to search
for particular terms, knowing that queries will be answered correctly. If MO can be extended
to describe all functional genomics investigations, and gains widespread usage, we will be
someway towards solving the problem of imprecise language that hinders automated analysis.
The work on data standards is essential to allow the creation of public databases that
can be queried, and to allow data sets to be downloaded in bulk for re-analysis with new
statistical techniques. The object models are a vital component that allow developers to have
a shared understanding of large systems. The models also allow software to be developed
for creating standard output in an exchange format, and can act as a bridge between flat
files and database storage. In particular, MAGE-OM has influenced efforts in other parts
of functional genomics because it has gained widespread community acceptance, and it is
forward thinking in the use of ontologies.
It could be argued that many of the data integration methods currently being developed
will not be required if the Semantic Web is successful. In reality, it is the data integration
efforts that are currently on-going that will evolve into the vision of the Semantic Web. The
development of ontologies to describe biological knowledge as it is now, and to describe the
experimental process that was used to produce the knowledge, will be vital. In addition, the
schema matching techniques, that aim to find the commonality between different databases,
will be a vital intermediate step towards fully interoperable systems. The data integration
methods will help us to learn how different data are structured, and how they can be described
in common terms.
The solution to the data integration challenge is still an open research question. The
majority of research groups performing functional genomics investigations are left with a
laborious, time consuming task, often involving manual Web browsing, to assimilate infor-
mation about genes, proteins and pathways. If systems can be developed that automate this
process, they will free up large amounts of research time that could be better spent else-
where, and new knowledge will be derived by discovering the relationships between different
components of biological systems.
2.8 Conclusions
In this chapter, a brief overview has been given of the different databases that exist for storing
functional genomics data, and the data integration challenges that they present. It is vital
Chapter 2. Databases, standards and ontologies for the life sciences 78
that data standards and ontologies are created to allow researchers to exchange and transfer
data sets to central repositories. An overview of the major proposals has been given. In the
following chapter, the current status of proteomics data standards is described. The main
focus is the development of a new object model that supplements the first draft standard,
and there is a discussion of the future of data sharing and publishing in proteomics.
Chapter 3
An object model for proteomics
3.1 Introduction
The first two chapters outlined the computational requirements of functional genomics ex-
periments, and previous work in databases, standards and ontologies for life sciences. This
chapter comprises two parts, the first focuses on the development of an object model to cap-
ture proteomics data, which we released as a proposal for a standard data format and was
published in October 2003 [176]. The first model is referred to as Gla-PSI (Glasgow proposal
for the Proteomics Standards Initiative) and covers studies in which proteins are separated
by two-dimensional gel electrophoresis (2-DE), and identified by mass spectrometry (MS).
Gla-PSI was developed to supplement the draft standard for proteomics, the Proteomics
Experiment Data Repository (PEDRo) originating at the University of Manchester, which
was released in January 2003 [315]. The latter part of the chapter outlines the continued
development of the official standard of the Proteomics Standards Initiative with which we
have been involved. The new object model, PSI-OM (Proteomics Standards Initiative Ob-
ject Model), was initiated in 2004 at the annual meeting of the PSI. We have contributed
to the development in collaboration with other members of PSI. PSI-OM has evolved from
PEDRo, and includes parts of the data model from Gla-PSI.
3.1.1 The emergence of proteomics
The challenge in proteomics research is to characterise the expression of all, or as many as
possible, of the proteins in a sample of interest. Comparative analysis may also be carried
out to determine the difference in protein expression between two or more samples, and the
differences may provide clues as to proteins that are critical for the process being studied,
such as a disease. Two dimensional gel electrophoresis (2-DE) is frequently used to separate
79
Chapter 3. An object model for proteomics 80
proteins into discrete spots that may be quantifiable, and mass spectrometry (MS) is often
used to determine the observed mass of the peptides in the protein. Observed peptides masses
can be used in a search against a sequence database to identify the protein. There are also
new protein separation techniques, such as multi-dimensional chromatography and affinity
methods, for determining the proteins that are present in a sample. The core techniques of
2-DE and MS have been available to researchers for several decades but it is only in recent
years that large scale analysis has become feasible, forming the field of proteomics. There
have been gradual improvements in the experimental protocols for 2-DE and several new
stains have been designed that improve the linearity of the relationship between visible stain
and the actual amount of protein in a gel spot. Software has been developed for improved
detection and quantification of spots on gels, and matching spots between different gels. The
technology for MS has also moved forward with improved ionisation protocols and detection
mechanisms (described in Chapter 1). However, the main reason for the major shift in
research paradigm towards the global approach has not been related to the improvements
in 2-DE and MS, but can be attributed to the vast increase in the availability of DNA
and protein sequence data in the genome databases. MS is only a good method of high-
throughput protein identification, if protein sequences are deposited in a database, or very
closely related sequences exist. Therefore, without the major sequence databases, it would
not be possible to perform large scale proteomic investigations.
3.1.2 Publication of data
In other areas of biology, the deposition of data in a central repository is a prerequisite for
publication: DNA sequences must be deposited in GenBank [30] and protein structures in
the PDB [253]. At present, public access to large amounts of proteomics data is limited.
There is a database of 2-D gels, hosted at the Swiss Institute of Bioinformatics, known as
SWISS-2DPAGE [153], and a similar effort at Argonne national lab, called GELBANK [20].
Both databases offer images of 2-D gels that can be browsed, providing access to limited
information about spots identified on gels (described in Section 3.3). However, the general
availability of proteomics data is poor, and most journal publications only display gel images
and a table of proteins that have been identified. Essentially this information is inaccessible
to computational analysis, even if the data is placed on the author’s web site, because there
is no common mechanism for querying or finding the data. The same issue exists for the
related fields of phylogeny and immunohistochemistry where diagrams of trees, or images
Chapter 3. An object model for proteomics 81
of cells, are reproduced in journals but are not open to computational analysis. The rate
of production of data is too great for researchers to have complete awareness of the protein
expression data that could relate to the system they are studying, and if there is no change
in the way proteomics data is represented, the situation will become far worse.
3.1.3 A central repository for proteomics
There is a major requirement for the development of a central proteome database that
includes 2-DE images, their analysis and MS data. For such a plan to be realised, it is vital
that a standard data model is adopted by the research community to enable experiments
from different laboratories to be compared or queried. A central database must contain
sufficient detail about experimental protocols for the context of the experiment to be fully
understood. It is also important that statistical analysis is captured, to ensure that new
results derived from data are electronically accessible, and can be verified. It is only in
the last two years that efforts have been initiated to develop a standard data format for
proteomics, resulting in the PEDRo proposal released in January 2003 [315]. A community
wide proteomics standard is still some way off, even though in the related field of microarray
analysis the data format MAGE-ML [297] has become well established in a relatively short
time frame. There are several reasons for the delay in finding a consensus on a standard.
The most significant challenge is the complexity of proteomics experiments compared with
microarrays. The identity of each feature on a microarray is known in advance, and matching
data points across a set of microarrays is a trivial task. In proteomics, proteins spots on a
2-D gel have to be identified by some process, which may be error prone, single spots may
contain multiple proteins and multiple different forms of the same protein can appear in
several positions on one gel. The reproducibility of 2-DE has improved greatly but is still
far from 100%. There are also various statistical models of the match quality for proteins
identified with MS data, but no single standard that can be compared across experiments.
The result is that a single proteomics data set is complex, and the experimental methodology
is rarely homogenous across different laboratories. This presents a major challenge because
it is difficult to create a model that captures all the methods used, and data that may arise
in proteomics experiments. In consequence, heterogeneous data formats are used, which are
difficult to load into a central database that supports queries over experiments produced by
different laboratories.
The expression of all the proteins in a disease sample compared with a normal sample
Chapter 3. An object model for proteomics 82
can facilitate understanding the disease process but the information can also provide an
additional level of information to sequence databases [4]. For example, if experiments to
determine the proteome of human liver cells reveal that a specific protein is abundantly
expressed, the information is functionally significant and should be available to researchers
accessing sequence databases. Additionally, gel spots analysed by MS may reveal peptides
that match a region of genomic sequence that has not been annotated. Therefore, the peptide
sequences can be used to discover new genes, or edit incorrectly annotated genes.
The global protein profile generated by experiments depends upon the conditions under
which the sample was produced and processed, prior to separation by 2-DE. The data may
be valuable to researchers in diverse fields, who could obtain new results from data sets
originally intended for another purpose. Therefore, it is vital that experimental protocols are
rigorously documented, according to a shared standard, and stored in a structured format
that allows searches over biological conditions: species, cell, tissue type; or experimental
conditions, such as: gel constituents, stain, or MS instrument parameters.
3.1.4 The status of proteomics standards
The Proteomics Standards Initiative (PSI) [257] was formed by the Human Proteome Organ-
isation (HUPO). So far, there have been two annual meetings at the European Bioinformatics
Institute [236, 237] and one meeting in Nice, France in 2004. The PEDRo proposal for a stan-
dard was released to demonstrate that a universal proteomics data format could be feasible,
and to stimulate discussion from the proteomics community about the requirements for a
standard (described in detail in Section 3.3). PEDRo focuses on the experimental techniques
used by proteomics researchers. Gla-PSI was developed at the same time as PEDRo, but
was modified following the release of PEDRo to model in more detail the protein data that
arises in a proteomics experiment (described in Section 3.4). Gla-PSI models 2-DE data, dif-
ference gel electrophoresis, image analysis and statistical analysis of large data sets (Figure
3.1). These data types are not adequately covered in PEDRo and therefore Gla-PSI acts as
a proposal for additional information that should be captured in the community standard.
Gla-PSI allows researchers to store data from any of the image analysis applications that
are available. Statistical analyses performed on data produced from image processing, such
as software, algorithms and the associated parameters, can also be captured. The model
is further specialised to manage difference gel electrophoresis data. Gla-PSI links spots
visualised on a gel, to proteins that have been identified by MS. The model is not a proposal
Chapter 3. An object model for proteomics 83
BiologicalSample
BiologicalSample
BiologicalSample
Legend
Sample Flow
Data Flow
Search
Mass Spectrometry MS/MSMALDI
Sequence Database
Solubilisation
DesignExperiment
StatisticalAnalysis
Image Analysis
Overview of a Proteomics Experiment
Protein Identification
ID Vol X Y Protein
1 454 23 24
2 222 28 87 abc1
3 12 20 12
4 662 262 101
1 454 23 24
2 222 28 87
3 12 20 12
4 662 262 101
1 454 23 24
2 222 28 87
3 12 20 12
4 662 262 101
ID Vol X Y Protein ID Vol X Y Protein
Global Expression Profile
Protein
2D−PAGE
Figure 3.1: The data flow in a proteomics experiments. The parts of the analysis covered byGla-PSI are boxed.
Chapter 3. An object model for proteomics 84
for annotation standards for MS, however there are a number of groups working towards a
standard for MS under the auspices of PSI, described in Chapter 2. PSI will oversee the
development of a complete model for proteomics that encompasses sample origin, 2-DE and
MS. The current status of proteome standards is presented in Section 3.5.
A new model, PSI-OM (PSI object model), is under development following several work-
shop meetings. The new model has evolved from PEDRo and includes part of the data
model from Gla-PSI. PSI-OM will ultimately be merged with the microarray data model to
form a single unified standard for functional genomics, as described in the following chap-
ter. It has been recognised during the development of microarray standards that controlled
vocabularies (ontologies) are critical for the creation of systems that have enough flexibility
to capture a wide range of experiment types, and allow the information to be queried in
complex ways. An ontology for proteomics is under development, as described in Section
3.5.3. A major contribution towards microarray standardisation was the release of a set
of guidelines for researchers wishing to publish, known as MIAME (Minimum Information
About a Microarray Experiment) [41]. A similar effort is underway in proteomics that will
be released in late 2004 or early 2005 (Section 3.5.4).
The rest of the chapter is structured as follows. Section 3.2 describes the methodology
used to develop Gla-PSI, and how requirements capture was carried out. The previous work
in proteomics data formats and standards is given in Section 3.3. A detailed description
of Gla-PSI is given in Section 3.4. The future development of a community wide proteome
standard, an ontology and guidelines for publication are described in Section 3.5. Section 3.6
includes a discussion of the importance of standards for proteomics, and the current status
of public access to proteomics data.
3.2 Methods
The early stages of developing a standard involved the creation of a prototype database for
2-DE and MS data by the author for a Master’s degree by research [174]. The database high-
lighted the challenges of integrating heterogeneous data types, and capturing experimental
protocols, in a structured format. The prototype demonstrated that many types of questions
that biologists posed could not be answered using the current technology, which would be
solved by the development of a central repository and appropriate query tools.
Case studies into proteomics investigation have been carried out (Chapter 1) which
demonstrated the requirement for new bioinformatics tools to facilitate the analysis of large
Chapter 3. An object model for proteomics 85
protein data sets. The case studies also highlighted significant challenges in data integra-
tion and systems development, and found several areas in which proteomics techniques are
employed:
• Proteome cataloguing: determine the entire set of proteins expressed in a cell type,
organelle or microorganism.
• Hypothesis generation: discover proteins whose function may be important in the
condition of interest.
• Protein regulation: discover sets of proteins that share patterns of expression across
a range of sample conditions.
• Correlating gene and protein expression.
• Post-translational modifications: which include phosphorylation, glycosylation
and acetylation.
The case studies also revealed that a critical factor required for aiding proteomics research
is the development of a data standard. Therefore, Gla-PSI was initiated to model data
from 2-DE, difference gel electrophoresis, image analysis and statistical processing. The
development of the model was driven by analysis of real data sets, and an understanding
of the types of queries that researchers would like to pose. The experimental basis for Gla-
PSI was established over a significant period in which requirements capture was performed
(Table 3.1). A number of interviews were held with principal investigators in laboratories
performing proteomics investigations. Time was also spent shadowing bench researchers to
gain a better understanding of the techniques involved in the research. Finally, literature
surveys were performed into functional genomics investigations, databases for life sciences,
and data standards in other fields to learn what procedures are commonly used to model
complex domains. During the development of Gla-PSI, regular meetings were held to present
the model to biological researchers, gaining feedback to ensure that a database based on the
model would cover all the data types that are required.
The data flow shown in Figure 3.1 outlines the stages in which information must be
captured in a proteomics experiment, and the boxed area represents the part of the anal-
ysis covered by Gla-PSI. Gla-PSI is expressed in Unified Modeling Language [324] (UML,
described in Chapter 2) and was developed using the UML modelling tool Rational Rose
[266]. Gla-PSI comprises class diagrams in UML to represent the concepts, objects and
relationships in a proteomics experiment.
Chapter 3. An object model for proteomics 86
Name Position Meet-ings
TimeSpan
Description
DrJonathanWastling
Principalinvestigator
50 2001-2004 Dr Wastling runs a laboratory that uses pro-teomic techniques to investigate parasitol-ogy. Many meeting were held in which dif-ferent proteomic technologies were discussedalong with the computational challenges theypresent.
AudeFoucher
PhDstudent
5 2001 Miss Foucher supplied data sets for the firstprototype database and evaluated the system.
AdrianCohen
PhDstudent
5 2001 Mr Cohen used proteomics to catalogue theexpressed proteins in the parasite Toxoplasmagondii and supplied test data for the first pro-totype database.
Dr ChrisWard
Post-doctoralresearcher
5 2002-2003 Dr Ward presented his work using proteomicsto identify the proteome of an organelle fromToxoplasma gondii and supplied data for test-ing.
Prof.WalterKolch
Principalinvestigator
5 2003-2004 Prof. Kolch is head of a laboratory at theBeatson Institute for Cancer Research. Thefuture developments of proteome databaseshave been discussed on several occasions.
Alex vonKriegsheim
PhDstudent
3 2003 Mr von Kriegsheim is a researcher at the Beat-son Institute for Cancer Research and per-forms DIGE analysis. The coverage of theGla-PSI model was discussed in a series ofmeetings.
MoragNelson
PhDstudent
30 2002-2004 Miss Nelson is investigating the differential ex-pression of proteins in host cells when invadedby a parasite, compared with non-invadedcells. Miss Nelson produced the data that isanalysed in Chapter 6.
Prof. MikeTurner
Principalinvestigator
5 2003-2004 Prof. Turner is head of a laboratory thatinvestigates the mechanism of action of try-panosomes and malaria. One of the techniquesemployed is proteomics. There have been sev-eral discussions of the requirements for thedatabase and the annotation of the genomesequence.
AnneFaldas
Researchassistant
20 2003-2004 Miss Faldas is cataloguing the proteome of theparasite Trypanosoma brucei (Chapter 7).
Table 3.1: A summary of the interviews held with researchers to formulate an understandingof proteomics research.
3.3 Previous work
Gla-PSI was released as a proposal for information that should be captured in a community
standard for proteomics, in addition to what is captured in PEDRo. In this section a
detailed description of PEDRo is given, along with a brief description of other data formats
for proteomics.
Chapter 3. An object model for proteomics 87
3.3.1 SWISS-2DPAGE
The SWISS-2DPAGE system was first established in the early 1990s as a web repository of
2-D gel data [153]. The web interface contains gel images overlaid with a map of spots which
has hyperlinks to other web pages for individual spot records. The spot records can be linked
to corresponding entries in the protein sequence database Swiss-Prot. The functionality of
the database is discussed in more detail in Chapter 5. The system utilises a textual data
format for specifying 2-DE and protein spot data, which is similar to the format of the Swiss-
Prot database, and was considered as a candidate format during the standardisation process
(see the SWISS-2D PAGE website for a sample record [309]). The format contains some
information about how the protein was identified, such as the peaks produced from mass
spectrometry, and can incorporate links to bibliographic references and other databases.
However, there is limited information about the protocols employed to create the gel. The
format does not include the method of scanning to create the gel image, or the software used
to analyse the image. There is also only a very limited minimum set of information that
must be supplied, therefore certain entries contain only the protein name, species of origin
and identifiers for the protein and gel. A data standard for proteomics requires a wider and
more complex specification of the minimum information that should be captured for each
protein entry.
3.3.2 GELBANK and HUP-ML
A similar format is produced by the GELBANK database (the data format is displayed in
Babnigg and Giometti 2004 [20]). The GELBANK text format is similar to SWISS-2DPAGE
but contains slightly different information about the gel protocol, and has different format-
ting. Protein spots are stored with the following information: gel position, the observed
molecular weight (MW) and charge (pI) of the protein, the theoretically calculated MW and
pI, the protein name and its accession number. There is no current facility for linking to MS
data that would enable the quality of the protein match to be assessed. The Japanese Human
Proteomics Organisation (J-HUPO) has also produced a proteomics data format, HUP-ML
(HUman Proteome Markup Language) represented in XML. HUP-ML has been presented at
past PSI meetings, and contains more detailed information about sample processing prior to
2-D gel electrophoresis. There is a DTD (document type definition) available for validating
HUP-ML [160]. The developers of HUP-ML are committed to the PSI development process
and will produce a mapping from HUP-ML to the finalised standard of PSI.
Chapter 3. An object model for proteomics 88
Figure 3.2: The complete PEDRo model represented in UML, reproduced from [315].
Chapter 3. An object model for proteomics 89
Figure 3.3: The classes that record biological samples in PEDRo, reproduced from [315].
3.3.3 PEDRo
The Proteomics Experiment Data Repository (PEDRo) from the University of Manchester
was created to address the requirements for a proteomics standard and covered four parts
of the analysis: sample generation, sample processing, MS protocols, and MS data analysis.
The complete PEDRo model is displayed in Figure 3.2, the four parts of the analysis are
represented by different shading in the four sections of the model. The sample generation
part is shown in Figure 3.3. An overview of the experimental hypothesis and citations for
methods and results are captured in the class Experiment. There is a relationship to the class
Sample and SampleOrigin for recording basic details about the type of material on which the
experiment is being performed, along with genotype information in Organism. PEDRo was
originally designed for capturing data from experiments with yeast, therefore the description
of sample is focused on cell cultures and has very limited facilities for recording any detailed
phenotype information about larger organisms.
Protein separation in PEDRo
Figure 3.4 summarises the classes for capturing protein separation techniques. The Sample
class is a subclass of Analyte (Figure 3.3) and separation techniques are modelled as sub-
classes of AnalyteProcessingStep. The substance on which a separation technique is per-
formed (the input) is modelled by a relationship from Analyte to AnalyteProcessingStep.
Sample, a subclass of Analyte, is thus directly related to the first separation technique
(AnalyteProcessingStep) performed on it. The separation techniques are modelled by
Chapter 3. An object model for proteomics 90
Figure 3.4: The part of PEDRo covering protein separation techniques, reproduced from[315].
classes, such as Gel, Column and ChemicalTreatment. The products of separation (outputs)
are modelled by the classes GelItem, Fraction and TreatedAnalyte. The inheritance re-
lationship enables a series of treatments to be specified where the product (output) of one
treatment becomes the input for another. 2-D gel data is represented by the attributes in
GelItem, and spots matched between gels can be captured in RelatedGelItem. The method
used to perform comparative gel analysis is not recorded in PEDRo.
Mass spectrometry in PEDRo
The third section models the type of ion source for a mass spectrometer and the machine
parameters (Figure 3.5). The protein sample, and its analysis, are represented by the re-
lationship from Analyte to MassSpecExperiment, enabling a link to a gel spot, column
fraction or output from another type of treatment.
MS data itself is represented in the fourth part of the model (Figure 3.6). The data in
MS is typically a list of peaks from an MS trace. Database searches that are carried out
to identify proteins from the MS data are captured by DBSearch and DBSearchParameters.
Peptides that are matched by the data are represented by PeptideHit and protein records
Chapter 3. An object model for proteomics 91
Figure 3.5: The model of MS ionisation and protocol in PEDRo, reproduced from [315].
Figure 3.6: MS data and database searches modelled in PEDRo, reproduced from [315].
Chapter 3. An object model for proteomics 92
that have been matched are modelled by ProteinHit and Protein. There is a relationship
between ProteinHit and RelatedGelItem that enables a direct link from gel spots to the
proteins to which they have been matched, without traversing the entire set of MS data and
analysis. There are a large number of attributes in most of the classes that are representative
of the properties of the experiment that researchers may wish to store. However, for certain
concepts it is very difficult to cover all the possible attributes, for example different database
search programs offer a large range of parameters that cannot all be explicitly specified in
the model. Therefore, the class OntologyEntry is used to specify additional attributes that
can be added where required, by obtaining the relevant term from a controlled vocabulary.
The development of an ontology for proteomics is introduced in Section 3.5.3, and there is a
detailed discussion of ontology usage in the following chapter.
3.4 Gla-PSI: A model for 2-D gel electrophoresis and analysis
This section includes a detailed breakdown of the components in Gla-PSI. The UML concepts
of classes and attributes are used to represent objects in a proteomics investigation, and
relationships have been created between classes to model the links between items in an
experiment. The complete model is shown in Figure 3.7, and the following sections describe
each part of the analysis in turn. A case study demonstrating how the model captures data
from a difference gel electrophoresis experiment is given in Appendix D.
3.4.1 Overview of the experiment and protein extraction
Gla-PSI does not contain a complete proposal for describing the overview of an experiment,
however we believe that there are classes in MAGE-OM that can adequately describe the
hypothesis of a proteomics investigation and the biological samples used. Experimental
protocols for recording protein extraction and solubilisation can also be described in MAGE-
OM. In our original publication describing Gla-PSI [176], the exact details of how protein
samples and protocols can be recorded in MAGE-OM were not given, however the following
chapter describes the complete integration.
3.4.2 Two-dimensional gel electrophoresis
A complex mixture of proteins can be separated by a number of techniques, including: two-
dimensional gel electrophoresis (2-DE), chromatography, affinity column and others. Gla-PSI
is focused around 2-DE, which is the most widely used technique for protein separation in
Chapter 3. An object model for proteomics 93
IDEvidence
MassSpec
The stages preceding image analysis have been presented in models: MAGE http://www.mged.org and PEDRo http://pedro.man.ac.uk
Class A
Class B
New classes inthe model
Classes derived from MAGE or PEDRo
Legend
Database
version : StringURI : String
Identifiable
identifier : Stringname : String
All classes are subclasses of Identifiable and Describable (not shown). Therefore, all classes can have an identifier attached and be linked to annotation classes.
ScannedImage
scanner : StringfileURI : Stringresolution : Doublecontrast : Doublebrightness : Doublewavelength : DoubledimensionX : IntegerdimensionY : Integer
ExternalReference
exportedFromServer : StringexportedFromDB : StringexportID : StringexportName : String
Describable
BibliographicReference
title : Stringauthors : Stringpublication : Stringpublisher : Stringeditor : Stringyear : Datevolume : Stringissue : Stringpages : StringURI : String
Database
version : StringURI : String
Description
text : StringURI : String
0..1 10..1 1
0..n1 0..n1
0..n
1
0..n
1
OntologyEntry
category : Stringvalue : Stringdescription : String 1..n 11..n 1
0..n
1
0..n
1
DatabaseEntry
accession : StringaccessionVersion : StringURI : String
1 0..n1 0..n
0..n
1
0..n
1
0..1 10..1 1
OntologyRef
0..11 0..11
Type
SpotRatio
id1 : Stringid2 : StringnormalisedRatio : Doublequality : StringratioType : String
DatabaseEntry
accession : StringaccessionVersion : StringURI : String
1
0..n
1
0..n
Parameter
parameterType : StringparameterValue : StringparameterUnit : String
0..n0..1 0..n0..1
DIGESingleSpot
volume : DoublepeakHeight : DoublenormalisedVolume : Double
0..1
2
0..1
2
0..n
0..1
0..n
0..1
Protein
id : StringmW : DoublepI : Doubleaccession : StringswissProtID : StringpirID : String
0..n
0..1
0..n
0..1
Parameter
parameterType : StringparameterValue : StringparameterUnit : String
DIGESingleImage
dyeLabel : StringisMasterGel : StringvolumeAverage : Double
0..n
0..1
0..n
0..1
0..n1 0..n1
SpotSets
ScannedImage
scanner : StringfileURI : Stringresolution : Doublecontrast : Doublebrightness : Doublewavelength : DoubledimensionX : IntegerdimensionY : Integer
0..n
0..1
0..n
0..1
StatisticalAnalysis
software : Stringversion : Stringalgorithm : StringdataFile : StringanalysisType : String
Spot
volume : DoublenormalisedVolume : Doublearea : DoublepeakHeight : DoublexCoord : IntegeryCoord : Integerexperiment_pI : Doubleexperiment_mW : Doubleradius : Double
0..1
0..n
0..1
0..n
1..n
0..1
1..n
0..1
0..n0..n 0..n0..n
SpotRefs
spotID : String0..n
0..1
0..n
0..1
1..n
0..1
1..n
0..1
2D-PAGE
pI_start : DoublepI_end : DoublemW_start : DoublemW_end : DoublepercentAcrylamide : DoublesolubilizationBuffer : StringstainDetails : StringdimensionX : DoubledimensionY : DoubledimensionZ : DoubledimensionUnit : String
0..n
0..1
0..n
0..1
1..n
1
1..n
1
DIGEAnalysis1..n1 1..n1
0..1
1
0..1
1
1
1
1
1
ImageAnalysis
softwareName : Stringversion : StringfileURI : StringimageProcessing : String
0..n
1
0..n
1
0..n1 0..n1
Parameter
parameterType : StringparameterValue : StringparameterUnit : String 0..n
0..1
0..n
0..1
0..n
0..1
0..n
0..1
0..n
0..1
0..n
0..1
0..n 0..10..n 0..1
ExperimentDesign
ProteinPreparation
1
1
1
1
MultipleAnalysis
analysisType : String0..1
0..n
0..1
0..n
0..1
0..n
0..1
0..n
0..n
0..1
0..n
0..1
ExperimentParameters
1..n
1
1..n
1
1
1..n
1
1..n
0..1
0..n
0..1
0..n
MatchedSpots
quality : Stringdescription : String
0..n0..1
0..n0..1
1
1..n
1
1..n
0..n
1
0..n
1
0..n
0..1
0..n
0..1
Figure 3.7: The complete Gla-PSI object model represented as a UML class diagram.
Chapter 3. An object model for proteomics 94
2D-PAGE
pI_start : DoublepI_end : DoublemW_start : DoublemW_end : DoublepercentAcrylamide : DoublesolubilizationBuffer : StringstainDetails : StringdimensionX : DoubledimensionY : DoubledimensionZ : DoubledimensionUnit : String
ImageAnalysis
softwareName : Stringversion : StringfileURI : StringimageProcessing : String
ScannedImage
scanner : StringfileURI : Stringresolution : Doublecontrast : Doublebrightness : Doublewavelength : DoubledimensionX : IntegerdimensionY : Integer
1..n1 1..n1 0..n1 0..n1
Attributes in 2D-PAGE have been derived from the PEDRo model: http://pedro.man.ac.uk
Figure 3.8: A model of 2-DE data, and a scanned gel image.
proteomics. A standard for 2-DE must capture the conditions under which the gel was run.
The conditions include the dimensions and voltages applied to the pH strip, gel dimensions,
buffers, and staining procedures. Many of these parameters are covered in PEDRo, and
certain attributes from PEDRo have been reproduced in the 2D-PAGE class. Once a gel has
been run, there is a significant amount of information that must be captured, which is not
adequately covered in PEDRo. Initially, a gel is scanned and a raw image is produced. Gla-
PSI incorporates this process, recording the details of the scanner and the image produced, in
the class ScannedImage (Figure 3.8). The model allows multiple instances of a scanning event
to cover cases where researchers have re-scanned a gel, for example, at different resolutions.
ScannedImage has attributes for the resolution, contrast, and brightness of the image, to
allow different versions of the same image to be stored. The dimensions of the image in
pixels are also stored. The image derived from a scanned gel becomes the input for the next
part of the model: image analysis.
3.4.3 Image analysis
A number of software packages can be used to analyse scanned 2-D gels. The software is able
to perform edge detection on an image to determine the coordinates, volume, area, and other
properties of protein spots. The class Spot accommodates many of the properties produced
by current software packages, however it is not possible to include all measures that may be
produced by current or future software versions (Figure 3.9). A class containing attributes:
parameterName, parameterValue and parameterUnit is used to cover data types that are
not explicitly included in the model (shown on Figure 3.7). Values for these attributes will
be obtained from a controlled vocabulary to ensure consistency. The class ImageAnalysis
Chapter 3. An object model for proteomics 95
MatchedSpots
quality : Stringdescription : String
SpotRefs
spotID : String
1
1..n
1
1..n
MultipleAnalysis
analysisType : String0..n1 0..n1
Spot
volume : DoublenormalisedVolume : Doublearea : DoublepeakHeight : DoublexCoord : IntegeryCoord : Integerexperiment_pI : Doubleexperiment_mW : Doubleradius : Double
1..n0..1 1..n0..1
ImageAnalysis
softwareName : Stringversion : StringfileURI : StringimageProcessing : String
0..1
0..n
0..1
0..n
0..n1 0..n1
Figure 3.9: The classes capture data from image analysis applications, including multipleanalysis across a number of gels.
IDEvidence
MassSpec
Protein
id : StringmW : DoublepI : Doubleaccession : StringswissProtID : StringpirID : String
DatabaseEntry
accession : StringaccessionVersion : StringURI : String
0..1 0..n0..1 0..n
Database
version : Str...URI : String
0..n
1
0..n
1
Spot
volume : DoublenormalisedVolume : Doublearea : DoublepeakHeight : DoublexCoord : IntegeryCoord : Integerexperiment_pI : Doubleexperiment_mW : Doubleradius : Double
0..n0..n 0..n0..n
SpotRefs
spotID : String
0..1
1..n
0..1
1..n
MultipleAnalysis
analysisType : String
Parameter
parameterType : StringparameterValue : StringparameterUnit : String
0..1
0..n
0..1
0..n
0..1
0..n
0..1
0..n
MatchedSpots
quality : Stringdescription : String
1..n 11..n 1
1
0..n
1
0..n0..10..n 0..10..n
Figure 3.10: The relationship between spot data (Spot) and identified proteins (Protein).The evidence for a spot being matched to a protein, such as MS data, can be added to therelationship, although Gla-PSI does not have a specification of MS data.
records the software package and a description of image processing that has occurred.
3.4.4 Protein spots
There are separate classes for spots identified on a gel (Spot), and proteins (Protein) to
which spots may be matched (Figure 3.10). The relationship between Spot and Protein
allows one or more spot records to be linked to one or more protein records. The cardinality
is displayed by 0..n to 0..n on the relationship between the two classes. The relationship from
Spot to Protein is modelled in this way because there are known instances where a single
spot contains a number of different proteins. In the opposite direction, it is possible that
a particular protein arises in a number of different positions on one gel. The relationship
Chapter 3. An object model for proteomics 96
DIGESingleImage
dyeLabel : StringisMasterGel : StringvolumeAverage : Double
DIGEAnalysis
1..n
1
1..n
1
DIGESingleSpot
volume : DoublepeakHeight : DoublenormalisedVolume : Double
0..n
1
0..n
1
SpotSets
1
1
1
1
SpotRatio
id1 : Stringid2 : StringnormalisedRatio : Doublequality : StringratioType : String
0..1 20..1 2
MultipleAnalysis
analysisType : String
0..1
0..n
0..1
0..n
Spot
volume : DoublenormalisedVolume : Doublearea : DoublepeakHeight : DoublexCoord : IntegeryCoord : Integerexperiment_pI : Doubleexperiment_mW : Doubleradius : Double
0..1 0..n0..1 0..n
1..n
0..1
1..n
0..1
1..n1 1..n1
MatchedSpots
quality : Stringdescription : String
0..n
1
0..n
1
SpotRefs
spotID : String
1..n0..1 1..n0..1
1
1..n
1
1..n
Figure 3.11: Classes for storing difference gel electrophoresis data.
is linked via an attribute that includes the evidence for the match, such as any MS data
that is available. This is very important because any findings based upon the predicted
expression of a protein should take into account the probability that the protein has been
identified correctly. Gla-PSI does not model MS data but a finalised MS standard should
be integrated at this position. The Protein class contains sufficient information such that
a repository based on the model can link directly to external databases. A single protein
may have entries in a number of databases that may be relevant to the experiment, such as
GenBank, Swiss-Prot, PIR, or domain specific databases.
Image analysis applications have the ability to match spots on different gels, believed
to correspond to the same protein: from replicate gels for the same samples, or from gels
over which a sample condition is varied, such as a time course experiment. Spots that have
been matched are linked via a specific class, MatchedSpots, and the class MultipleAnalysis
stores a description of the type of matching that has been carried out.
3.4.5 Two-dimensional difference gel electrophoresis
Data produced from a difference gel electrophoresis experiment is captured in Gla-PSI as
shown in Figure 3.11. Amersham Biosciences produce DIGE (Difference In Gel Electrophore-
sis) technology and DeCyderTMsoftware, for analysis of gels [74]. DeCyder can export data
in an XML format, known as DeCyderML (personal communication from Amersham Bio-
Chapter 3. An object model for proteomics 97
Figure 3.12: The part of Gla-PSI modelling statistical analysis of a proteomics experiment.
sciences), which we have mapped to Gla-PSI. A single DIGE gel can produce several images,
corresponding to the fluorescent dyes used for different samples. DeCyderML contains a
class for the single channel image, with attributes such as dye type, which corresponds with
the class DIGESingleImage. DIGESingleImage is a subclass of ScannedImage and therefore
inherits all the attributes from ScannedImage. The class DIGESingleSpot models spots that
have been identified from a scan of the gel at a single wavelength. DeCyderML also has a
class for storing information about spots that have been calculated from a combination of
scans at several different wavelengths (co-migrated spots). This data can be recorded in the
general Spot class in Gla-PSI. DeCyderML includes information about spots that have been
matched across gels, which can be recorded using the MultipleAnalysis and MatchedSpot
classes that exist in Gla-PSI for storing non-DIGE data. DeCyder software calculates ratios
between pairs of single image spots that have co-migrated, captured in SpotRatio.
3.4.6 Statistical analysis
Statistical analysis techniques, such as ANOVA [88] (analysis of variance), are used to locate
spots whose volume is significantly different between two samples, indicating a change in
protein expression under a certain condition. It is vital that the exact details of the analysis
are preserved to ensure that the same procedure can be reproduced by other research groups.
A number of statistical techniques can be applied to large data sets, such as analysis over a
number of replicates, or over a number of gels analysing a varying condition. An example
is cluster analysis, as performed on microarray data sets [86], to detect groups of proteins
Chapter 3. An object model for proteomics 98
sharing similar expression patterns over a number of gels. The StatisticalAnalysis class
accommodates a description of the software or algorithm used to perform the analysis, and
the appropriate parameters and significance levels used (Figure 3.12). Gla-PSI has a link from
a description of the analysis to the raw data. The analysis can be linked to individual spot
records, or spots matched between gels. A formal description of statistical analysis presented
by Papageorgiou [239] covers most of the attributes that are applicable to proteomics analysis,
but is possibly too complex for use in biological applications. Gla-PSI has few attributes,
with the intention that the details of the analysis will be described with data types obtained
from controlled vocabularies. It is desirable that future versions of a proteomics standard
incorporate future statistical standards.
3.4.7 Annotation
Gla-PSI is designed to allow annotation of all aspects of the experiment including raw data,
experimental protocols and analysis (Figure 3.13). Annotation may be in the form of free
text or links to external databases or ontologies. MAGE-OM includes classes that allow
annotation to be added and linked to any other part of the model, which have been included
in Gla-PSI. Gla-PSI uses inheritance, such that all classes in the model are subclasses of
Identifiable, inheriting the attributes that allow a name and identifier to be added to each
class (Figure 3.13). Identifiable is a subclass of Describable, which has a relationship
to the annotation classes. Every class also inherits from Describable, enabling all classes
to be linked out to other database entries, or bibliographic references.
3.5 Future developments in proteomics standards
The PEDRo proposals have been presented to the research community and there have been
three annual meetings of PSI [236, 237] at which the model has been discussed and refined.
Currently, a new object model is in development, known as PSI-OM (Proteomics Standards
Initiative-Object Model). The rest of the section discusses proposed additions and changes,
giving a snapshot of the current development as of July 2004. There are several parts of
the original proposal that were not modelled correctly and certain omissions, including those
covered in more detail in Gla-PSI.
Chapter 3. An object model for proteomics 99
Protein SpotRatio DIGESingleSpot2D-PAGE ScannedImage
MultipleAnalysisMatchedSpots StatisticalAnalysisImageAnalysis DIGEAnalysis
Spot SpotSets
Identifiable
identifier : Stringname : String
ExternalReference
exportedFromServer : Str...exportedFromDB : StringexportID : StringexportName : String BibliographicReference
title : Stringauthors : Stringpublication : Stringpublisher : Stringeditor : Stringyear : Datevolume : Stringissue : Stringpages : StringURI : String
Database
version : Str...URI : String
OntologyEntry
category : Stringvalue : Stringdescription : String
DatabaseEntry
accession : StringaccessionVersion : Str...URI : String
1
1
1..n1..n 1
0..n
1
0..n 0..1 10..1 1
OntologyRef
0..11 0..11
Type
Description
text : StringURI : String
0..1 10..1 1
0..n
1
0..n
1
0..n
1
0..n
1
0..n
1
0..n
1
Describable
1
0..n
1
0..n
Figure 3.13: Several classes are subclasses of Identifiable, enabling a unique identifierand name to be attached. Each class is also a subclass of Describable enabling links tobibliographic references and external database entries to be specified.
3.5.1 An overview of PSI-OM
An overview of the new model is displayed in Figure 3.14. The main features of
the experimental techniques are similar to PEDRo, with a cycle from Analyte to
AnalyteProcessingStep. There has been no current effort to specify a detailed descrip-
tion of a sample within the model, however there is a relationship from SourceInformation
to OntologyEntry to specify characteristics of a sample. At the top level is the class
MIAPEDataSet for clustering a set of related proteomics experiments, below which is the
top level of one complete analysis (Project). The concept of a StudyGroup has been in-
troduced for comparing one set of samples with another. For example, an experiment is
performed to compare mice with a gene knockout X, against wild-type mice. Ten gels are
performed, of which five are replicates from pooled samples of knockout mice tissue, and five
are replicates from wild-type. An instantiation of PSI-OM would contain one instance of
Project and two instances of StudyGroup (one for wild-type and one for knockout). The
source of biological material is captured in the class Source. The model allows either: 10
sources of material to be specified for biological replicates (10 different mice) or two sources
of protein that is subsequently split, using the classes Subdivision, for specifying technical
replicates.
Chapter 3. An object model for proteomics 100
PercentOfComponent Timepoint
1 11 1
MobilePhaseComponent
1..n
1
1..n
1
Column
SampleLoading
Fraction ColumnRun
1..n
1
1..n
1
1
0..n
1
0..n1..n
1
1..n
1
0..n 10..n 1
CombinedAnalytes Combination
1 2..n1 2..n
AnalytePortion Subdivision
2..n 12..n 1
TaggedAnalyte TaggingProcess
1 11 1
Description
RunDetails
StudyDescription
experimentalFactor
Analyte AnalyteProcessingStep
0..n1 0..n1
Protocol
Source
1..n
1..n
1..n
1..n
OntologyEntry
SourceInformation 1..n1..n 1..n1..n
+type
+characteristicsOtherAnalyte
OntologyEntry
1
1 +type
1
1
OtherAnalyteProcessingStep
0..n0..n 0..n0..n
1
1+type
1
MIAPEDataSet
StudyGroup
0..n
1
0..n
1
Project
hypothesis
0..n
1
0..n
1
0..n
1
0..n
1
RunDetails
Description
PhysicalGelSpot Gel2D
1
1
1
1
1
1
1
1
0..n 10..n 1
Analysis
0..n1 0..n1
StudyGroupDataSet
1 1
ExpressedProtein
1 1..n1 1..n
Gel1D
1
1
1
1
1
1
1
1
PhysicalBand Gel1DLane
0..n
1
0..n
1
0..n 10..n 1
1
1 1
Figure 3.14: A draft version of the main components of PSI-OM.
Chapter 3. An object model for proteomics 101
See DataModel diagram for link between Image, ImageAnalysis and IdentifiedSpot / Band
Gel1DGel2D
Image
URI : Str...
ImageAcquisition
0..1
0..n
0..1
+scans1DGel
0..n0..n
0..1
+scans2DGel
0..n
0..1
0..n
0..1
0..n+createsImage
0..1
DatabaseEntry
IdentifiedBand
Analysis
IdentifiedSpot
DIGECompositeSpot OntologyEntry
MSDataCapture
ProteinRecord
0..n
0..1
0..n
0..1
PhysicalBand
0..n 10..n 1
StudyGroupDataSet
1
1
PhysicalGelSpot0..n 10..n 1
0..n
1
0..n
1
ProteinModification
1
1
+type
1
1
MSDataSet
0..1
ExpressedProtein
0..n
0..1
0..n
0..1
0..n0..1 0..n0..1
1..n
1
1..n
1
0..n0..1 0..n0..1
0..n1 0..n1
0..n
0..1
0..n
+proteinIdentification
0..1
Fraction0..n0..1 0..n0..1
1
1
0..1
+containsProtein
Figure 3.15: Part of PSI-OM showing the relationships between spots identified on a gel andthe corresponding protein records.
3.5.2 Data model in PSI-OM
The diagram in Figure 3.15 displays the overview of a proteomics data set. A number
of experiments are packaged together using the class StudyGroupDataSet. The core data
point is an ExpressedProtein which can be linked to a set of classes describing the result
of separation techniques (PhysicalGelSpot, PhysicalBand, Fraction and so on). The
class ExpressedProtein will capture a complex concept, as follows. In a 2-DE experiment
particular proteins may appear in multiple positions on a 2-D gel, which may be the result of
differential splicing of gene products or chemical modifications to the protein. These variant
forms of the protein will usually only be identified by a single protein name or accession
number, however it is vital that the alternative forms are differentiated in the model. An
ExpressedProtein is intended to capture the idea of a single protein form that arises in one
position on a gel, or in one column fraction, resulting from the set of modifications that it
has. If the nature of the modification is known, it can be captured in ProteinModification,
and a reference to a record in a sequence database can be captured in ProteinRecord and
DatabaseEntry. The current model has no detailed specification for MS standards because
these are in development by a separate organisation, and will be added to the model when
finalised.
The draft model of protein spot data arising from image analysis has been influenced
Chapter 3. An object model for proteomics 102
OntologyEntry
MultipleGelAnalysis
Image
URI : String
1
1
+format
1
1
ImageAnalysis
SpotsMatchedAcrossGels
1..n
1
1..n
1
SingleGelSpotSet0..n0..1 0..n0..1
1
0..1
1
0..1
11 11 DIGESingleSpotSet1 11 1
DIGEAnalysis
10..1 10..1
1..n
1
1..n
1
Image
URI : String
MultipleGelAnalysis
OntologyEntry
IdentifiedSpot
0..n0..1
0..n0..1
0..n
1
0..n
1
DIGESingleSpot
1
1..n
1
1..n
DIGESpotSet
1
1
1
1
0..1
0..1
+compositeImage
SpotsMatchedAcrossGels
1..n
1
1..n
1
SpotMeasurement
value : Double
0..1
0..1
+unit
0..1
0..1
1
1
+type
1
1
0..n0..1 0..n0..10..n
0..1
0..n
0..1
DIGECompositeSpot
1
1..n
1
1..n
1..n
1
1..n
1
0..n 0..10..n 0..1
0..n
0..1
0..n
0..1
0..1
0..1
Figure 3.16: A draft version of the protein data model in PSI-OM. The classes on the leftmodel conventional 2-DE and the classes on the right represent difference gel electrophoresis.
by Gla-PSI, and is displayed in Figure 3.16. There are two separate sets of classes for
modelling gel electrophoresis data. The classes on the left of Figure 3.16 model standard
gel electrophoresis, in which one sample is applied to one gel, and multiple samples are
compared on different gels. The classes on the right model data resulting from a DIGE
experiment, in which there are two kinds of spot data: spots arising from scanning a gel at
a single wavelength (DIGESingleSpot), and spots arising from a composite image that has
been calculated from the single channel images (DIGECompositeSpot). The attributes that
will be assigned to classes are still to be finalised, but one issue that must be resolved is the
extent to which ontologies will be utilised. It is possible to include many attributes in the
model for describing protein data, or put the types of attributes in a controlled vocabulary
and link many classes to OntologyEntry. This is an area for future discussion but we believe
that there are considerable advantages to using ontologies extensively, because the controlled
vocabularies can be updated at regular intervals, allowing gradual evolution of the coverage
of the model. It is not possible to update an object model at regular intervals without
generating backward compatibility problems.
3.5.3 An ontology for proteomics
The original PEDRO model used ontologies sparingly, taking the approach that an initial
model for proteomics should function as a document for specifying the main components of
Chapter 3. An object model for proteomics 103
a typical workflow to stimulate discussion in the community. The Gla-PSI proposal specifies
that ontologies are required to capture certain parts of the analysis, but there are currently no
major ontologies containing proteomic experimental terms. Therefore, it has been recently
proposed that the MGEDOntology should be extended for proteomics. The MGEDOntology
(MO) includes a controlled vocabulary of terms describing microarray experiments, including
the details of biological samples (described in more detail in the following chapter). There
is no difference between the sample prior to mRNA extraction for a microarray assay or
protein extraction for proteomic analysis, hence parts of MO can describe biological samples
for proteomics. A new ontology, PSI-Ont, is in development and will include terms describing
proteomic experimental techniques. PSI-Ont will be developed as an extension to the MGED
Ontology, and will follow the same structure.
3.5.4 Minimum information about a proteomics experiment
An essential stage in improving the process of exchanging and publishing microarray data
was the release of the MIAME guidelines [41]. MIAME is a checklist of the information that
should be made publicly available to allow the data sets to be re-analysed, or to allow the ex-
periment to be reproduced, if identical biological samples are available. An equivalent effort
has been initiated by PSI to develop MIAPE (Minimum Information About a Proteomics
Experiment). The guidelines will be formalised after a series of meetings and discussions
via the mailing list. In overview, we believe that MIAPE should contain the following. It
is vital that sufficient description of the biological samples is given so that the validity of
each study group can be established. Researchers should also publish the protein extraction
protocols, detailed descriptions of the protein separation techniques, and the equipment and
protocols utilised for MS. Any software that is used to analyse data should be reported with
a version number, vendor name and contact details. If database searches have been carried
out to identify proteins, there should be a date stamp of when the search was carried out if
the database is updated daily, or a version stamp if the database is released less frequently.
3.6 Discussion
3.6.1 Web access to date
It has been recognised that past funding for large databases of scientific data has not been
sufficient, and as a result, important information is lost [209]. An activity which attempts
Chapter 3. An object model for proteomics 104
to remedy this situation is the effort to develop biochemical pathway databases, such as
KEGG [184]. Information regarding reaction kinetics and functional information has been
published over several decades, but is not generally available in electronic form. Only papers
published in the last decade may be available on the Internet, and data is not presented in any
kind of format that can be mined automatically. Instead, information retrieval techniques
must be used with significant manual intervention. This process is time consuming and will
miss substantial amounts of information. Today, data regarding one biological system is
often too extensive for a single researcher to gain access to by reading published literature,
and automated methods are required. Microarray experts have previously recognised these
needs and efforts are underway to develop large central repositories [42]. In recent years a
parallel effort has been initiated by proteomics researchers, however there are currently no
major central repositories of proteomics data [252]. A standard data format will facilitate
the creation of a central repository that will allow re-analysis of published data as new
statistical techniques are developed. Microarray and proteomics experiments generate large
amounts of data that is of potential use to researchers in many other fields. In particular,
the studies can improve genome annotation by demonstrating conditions in which genes or
proteins have been shown to be up or down regulated, allowing researchers to improve the
functional annotation.
3.6.2 Status of proteome standards
This chapter documents the development of the Gla-PSI model, which we released in October
2003. Gla-PSI represents data from one section of a proteomics workflow and complements
other work undertaken by various organisations. PSI is overseeing the development of a
standard, and is using PEDRo as an initial framework from which to develop a unified
model. Gla-PSI covers image analysis of 2-DE, multiple gel comparison, DIGE and statistical
analysis of large data sets, and represents additional information that should be included
in the next version of the community standard. Capturing experimental protocols in a
structured format is a major challenge due to the enormous range of possible experiments
that could be performed. The MAGE format for microarray has been designed with a flexible
structure that allows it to be extended into new technologies by using ontologies. Gla-PSI
utilises parts of MAGE for adding additional annotation and bibliographic references to
the model. In our original publication on Gla-PSI [176], we stated that classes derived
from MAGE should be used for capturing information about experimental protocols and the
Chapter 3. An object model for proteomics 105
biological samples on which experiments are performed but at that time the integration had
not been completed. The following chapter describes later work, which is the integration
of Gla-PSI, PEDRo and MAGE to create a framework for capturing data from a range of
functional genomics techniques.
In Section 3.5, the development of the next version of the official PSI object model (PSI-
OM) was discussed, which incorporates parts of Gla-PSI and PEDRo. The development of
the object model will take place in conjunction with the creation of an ontology for pro-
teomics (PSI-Ont), which will be regulated by PSI. An important first stage will be the
creation of a document that specifies the minimum information set that must be published
alongside proteomics data to allow future re-analysis (MIAPE). The development of all three
components (PSI-OM, PSI-Ont and MIAPE) will continue with discussions at official meet-
ings of PSI, and via an email mailing list. The development of a finalised standard requires
significant contribution from the proteomics community before consensus can be reached.
The complete model should be flexible with regard to new technologies and experimental
protocols. A data standard should not prescribe how researchers carry out experiments,
but should capture enough detail to ensure that useful data archives can be developed. If
a standard is to be accepted, tools must be developed which enable researchers to capture
data conforming to the standard without substantial manual data entry. Laboratory Infor-
mation Management Systems (LIMS) are available from commercial software vendors. They
capture instrument parameters, and track solutions using bar-coding. It is likely that future
versions will be specifically tailored for proteomics applications, and software vendors should
provide an output file conforming to the proposed standard. A data set containing 2-DE
images, MS traces, analysis and annotation is fairly bulky, therefore the development of a
single public database covering all aspects of proteomics is unlikely for all species. A more
feasible solution is the development of distributed, domain specific proteome databases, such
as single organism, or disease, with data transfer between databases occurring via an XML
data format, created from the object model. It is essential that databases provide wide
ranging query facilities to enable the development of applications that search for data sets
of interest. Data integration applications will be developed to link proteome databases to
other repositories, such as databases of sequences, motifs and structures.
Chapter 3. An object model for proteomics 106
3.7 Conclusions
Gla-PSI has been developed to represent 2-DE, image analysis, difference gel electrophoresis
and statistical processing. It was initially developed at the same time as the PEDRo proposal,
however it was later modified and released to document additional information that should
be recorded in a community wide standard. The model has influenced the development of
the next version of the standard, PSI-OM.
The microarray field has recognised the need for central data repositories and exchange
standards for some time. The additional complexity of proteomics experiments means that
the efforts are some way behind, and there are still no databases that offer access to protein
separation information, quantification data and mass spectrometry. The development of
a proteomics data standard will enable data to be sent to a public database. Chapter 5
describes a prototype system that could serve as a centralised public database for proteomics.
The database stores protocols and data from 2-DE and MS, and facilities for integration with
microarray results are demonstrated. We believe that the efforts of MGED in the microarray
field can be used directly for proteomics, and in the following chapter there is a description
of the unification of the proteomics proposals with MAGE-OM, to create a proposal for
standard across the whole of functional genomics techniques.
Chapter 4
Development of a data standard for
functional genomics
4.1 Introduction
In Chapter 2, the importance of data standards for life sciences was outlined and this was
further exemplified in the previous chapter with a description of the development of a data
model for proteomics. The success of the MAGE-ML format for microarrays demonstrates
the feasibility of a community wide standard for capturing data from a diverse range of
experiment types. This chapter covers the integration of the Gla-PSI model into a wider
proposal for functional genomics, which was published in July 2004 [175], which includes
substantial detail from MAGE-OM, and the draft standard for proteomics, PEDRo. The
new model is known as FGE-OM (Functional Genomics Experiment - Object Model) and has
been presented to the standards organisations for proteomics and microarrays as a proposal
for the integration of the current efforts in both fields.
URL: www.gusdb.org/fge.html
4.1.1 Requirements for standards
The motivation for integrating the current proposals for microarrays and proteomics is as
follows. It is becoming common for research groups to carry out experiments using multiple
types of technology as the cost of performing experiments has fallen. Several institutions
have semi-automated facilities offering a service for performing parts of experiments that
were previously very labour intensive. The functional genomics facility in Glasgow is one
example, offering a sequencing, microarray and proteomics service to researchers [293]. Re-
searchers now generate large volumes of data from diverse techniques that they wish to
107
Chapter 4. Development of a data standard for functional genomics 108
compare, or analyse side by side. There are several facets of experiments that can be de-
scribed using the same terms. An overview of a functional genomics (FG) experiment can
be described with a text description of the hypothesis, and a parameter that is varied be-
tween different samples, such as the different time points in a time course experiment. The
biological samples used in any type of FG experiment should be described using common
terms because this stage precedes the extraction of mRNA, proteins or metabolites and could
potentially be analysed downstream using any of the experimental techniques. Experimental
protocols from microarrays, 2-DE and other separation techniques can be described as a set
of sequential steps involving substances, actions and equipment. It may also be desirable
that all experiments are annotated with an audit trail, capturing when, where and by whom
the experiments were carried out. Data points in an FG experiment are usually genes or
proteins which may be quantified or localised in one sample compared with another. It is
therefore possible to create a framework containing the common parts of FG analysis as
part of an all encompassing data format. A shared format that has wide community accep-
tance would allow developers to create software capable of formatting all locally generated
FG data into one format that can be exchanged with other researchers or sent to public
databases. The format should be suitably designed such that there is no great overhead if
research groups wish to use only a subset of the entire model, for example if they are only
performing proteomics. It is likely that one single model for FG will require significantly less
effort for developers than creating software to manage four or five separate formats. Finally,
if experimental protocols are captured in a common format it will open up new possibilities
for comparing data produced from different methodologies, allowing researchers to have a
view of the biology that is nearer to the whole system level.
An integrated data format will also facilitate the development of public repositories for
storage and querying of functional genomics data. Microarray experiments are used widely
because a large number of assays can be performed concurrently, producing a large number
of possible leads about the genes that are significantly associated with a particular condition
or disease. However, while it had previously been believed that there is a correlation between
the expression of mRNA and protein [115], more recent studies have indicated that mRNA
level is a poor indicator of protein abundance [178]. Proteomics experiments can determine
the relative level of protein produced, therefore would be expected to be a better indicator
of the level of protein activity. Proteomics experiments can also give information about
post-translational modifications, which may have important effects on the function of the
Chapter 4. Development of a data standard for functional genomics 109
protein [240]. It is therefore desirable that microarray and proteomics data can be queried
in parallel to determine the extent of gene expression and the level of encoded protein that
has been observed for a particular gene. Protein and RNA expression data should also
be accessible with genomic data, to allow better annotation of the genome with functional
information derived from FG studies, such as protein X is up regulated under condition Y.
A current example of this functionality is offered by the SOURCE database [78], which can
be queried by gene name, and returns textual annotation about the gene, and the relative
expression values from different microarray studies in which it has been assayed. Single data
points from a microarray experiment may not be sufficiently powerful to determine how much
active protein was present in the sample at that time, but can provide functional evidence
if a gene is strongly expressed in a sample or condition, or conversely not expressed where
it might be expected. These kinds of results can be assayed by further experimentation and
lead to the formation of new hypothesis about the function of genes and systems as a whole.
Functional genomics databases should also incorporate information from other types
of study: immunohistochemistry and protein interaction studies, such as yeast two-hybrid
[107]. Such systems would enable data mining applications to be developed that search for
the factors that affect regulation of transcription and translation, and ultimately, protein
function. Integrated databases will aid the development of mathematical models capturing
the effects of changes at the system level, and could provide source data for the modelling
of metabolic pathways [336]. Data mining algorithms could then be employed to search for
genes that may be important in a condition of interest, such as drug targets for a particular
disease.
4.1.2 Status of standardisation
Data standards for proteomics, and other FG experiments, are at a much earlier stage than
microarrays (Figure 4.1). PEDRo was released as a draft proposal to stimulate community
discussion about what was required in a data standard and, aside from the data capture
tool released with PEDRo (PEDRoDC), there have been few implementations of PEML
(Proteomics Experiment Markup Language), the XML-based data exchange language based
on PEDRo. This is because PEML is a complex format, and therefore considerable effort
is required by developers to create software that produces PEML. Furthermore, the benefits
of producing output in PEML at this time are limited, because there are no major public
repositories that accept PEML as input. There are also several parts of PEDRo that do not
Chapter 4. Development of a data standard for functional genomics 110
Formation of PSI
Release of PEDRo
Release of Gla−PSI
Developmentof PSI−OM
Developmentof MAGE v.2
1999 2000 2001 2002 2003 2004 2005 2006
Formation ofMGED guidelines
MIAME
published
1996
Microarray Standards
Proteomics Standards
Advent ofmicroarrays
Release of FGE−OM andSysBio−OM
First objectmodel toOMG
Release of MAGE−MLv.1
v.1
First largescaleexperiments
Figure 4.1: A time line displaying the emergence of microarray and proteomics technology,and the efforts to standardise data formats.
adequately capture a proteomics workflow, the most important being insufficient descriptions
of biological samples, and no support for auditing. These two areas are captured in MAGE,
and this part of the object model has been refined over a significant period by a team
of experienced developers. It is vital that the next round of development in proteomics
standards makes extensive use of the experience gained in the development of MAGE. This
process has already begun with several MAGE developers giving oral presentations at the
2004 meeting of the PSI in Nice, France [257].
FGE-OM offers a possible framework for developing a standard across all FG experiments,
however an alternative proposal has been released known as CEBS (Chemical Effects in
Biological Systems) SysBio-OM. SysBio-OM was released after the creation of FGE-OM
therefore was not available for analysis at the time of development. A comparison of the
features offered by the two systems is made in Section 4.4. The future development of MAGE-
OM and the PSI data standard should take place jointly, using FGE-OM and SysBio-OM
as a framework around which it can be coordinated.
FGE-OM captures microarray and proteomics data, including separation techniques such
as two-dimensional gel electrophoresis (2-DE), and protein identification by mass spectrom-
etry (MS). The model also stores experimental protocols, raw data and data analysis. FGE-
OM comprises three namespaces that organise the classes in logical subsets: BioOM, Ar-
rayOM and ProteomicsOM (Figure 4.2). Substantial detail from MAGE-OM has been
Chapter 4. Development of a data standard for functional genomics 111
FGE-OM
Components common to all functional genomics experiments
Microarray specfic components
Classes modelling proteomicstechnologies
Top-level of theObject Model
Namespaces
BioOM
ArrayOM
ProteomicsOM
Figure 4.2: An overview of the FGE-OM object model. The model is divided into threenamespaces: BioOM, ArrayOM and ProteomicsOM.
used to develop BioOM (the part of the model that is generic), and ArrayOM (the parts
of the model specific to microarrays). BioOM contains a set of packages and classes that
describe an experiment using microarrays, proteomics, or potentially other functional ge-
nomics techniques. The ProteomicsOM namespace captures information from proteomic
specific technologies. The object model has been implemented as a relational database,
known as RAPAD (RNA And Protein Abundance Database), which is described in the
following chapter.
The rest of the chapter is structured as follows. Section 4.2 outlines the methodology used
to create FGE-OM. A detailed description of FGE-OM is given in Section 4.3 and Section 4.4
briefly describes the contents of the alternative SysBio-OM proposal, and compares it with
FGE-OM. Finally, a plan for how the development of an integrated standard for functional
genomics can take place is outlined in Section 4.5.
4.2 Methods
FGE-OM was developed using an evolutionary software development model. MAGE-OM
was imported into a UML editing tool, and changes were made to accommodate proteomics
data. The PEDRo object model has not been released in UML format, however a database
Chapter 4. Development of a data standard for functional genomics 112
schema has been released in SQL, which matches the object model very closely. Therefore,
the PEDRo database schema was reverse engineered and imported into the editing tool.
Additional classes were added manually from Gla-PSI where required. The initial develop-
ment involved the creation of class diagrams to model parts of proteomics experiments, using
components derived from MAGE-OM where possible. This was followed by a phase of dis-
cussion between several developers to test whether hypothetical proteomics workflows were
adequately covered in the object model. In cases where FGE-OM did not correctly model
a possible workflow, refinements were made to the model. The model was further refined
after the objects had been mapped to relations, and deployed as a relational database. At
the time of development there had been no complete implementation of the PEDRo model
or database schema, therefore several classes had to be refined to reflect real data sets.
FGE-OM was developed in UML using the modelling tool PoseidonTM[249], into which
the source models were imported. Poseidon has the advantage over other tools that there
is a version that is freely available, offering sufficient functionality to view and edit UML
class diagrams. It is vital that as many developers as possible have access to the object
model, beyond being able to view images of class diagrams. The main alternative, Rational
Rose [266], is expensive software which precludes many researchers from analysing models.
There is a major compatibility problem between the UML versions specified by different
vendors. UML is intended to be standard notation but there is currently no robust method
of transferring models between tools. An interchange format for UML, XML Metadata
Interchange (XMI) [356], has now been defined that may improve compatibility in the future,
but the current implementations of XMI only specify the contents of the model, not the
diagrams that have been drawn to represent the model. Therefore, once an object model
has been imported, diagrams must be redrawn by the developers, which is a laborious task.
4.2.1 Ontologies
An ontology can be described as the result of knowledge capture about a particular domain, in
a formal structure [138]. The use of ontologies in life sciences is rapidly increasing, because it
is believed that they can improve facilities for data re-use and integration [300]. The MGED
Ontology (MO) has been created to capture terms used in a microarray experiment [304].
Each entry contains a term, a definition and a specification for where the term should be used
in the model. An example term viewed with the OilEd editor [28] is displayed in Figure 4.3
(OilEd is described in Chapter 2). The ontology contains classes, properties and instances
Chapter 4. Development of a data standard for functional genomics 113
Figure 4.3: A screenshot of the term “Age” in the MGED Ontology viewed with OilEd.
(individuals in OilEd). A class is the type of information (e.g. Age), the properties of the
class are its attributes (e.g. “has Measurement” and “Initial time point”) and the actual
values are the instances (e.g. years). There is also a definition of the term, in the case of
Age: The time period elapsed since an identifiable point in the life cycle of an organism. If
a developmental stage is specified, the identifiable point would be the beginning of that stage.
Otherwise the identifiable point must be specified such as planting.
The class OntologyEntry from MAGE-OM is used widely to store terms obtained from
controlled vocabularies, along with the source of the vocabulary. Ontologies are vital for
capturing the complexity of biological samples used in functional genomics.
EXAMPLE: Two FG experiments are performed, the biological material of the first is a
cell culture grown in a specific medium, and the second is a tissue sample from a person
suffering from heart disease.
It would be extremely difficult to engineer a schema to capture this range of information in
a structured way. For example, without an ontology, a model to capture the species of origin
Chapter 4. Development of a data standard for functional genomics 114
could be designed with a class Species and an attribute scientificName. However, this
can pose major problems for querying due to the different ways a name could be represented,
consider: abbreviations, different classification systems and user errors. This problem was
avoided in MAGE-OM, by designing classes that have a relationship to OntologyEntry, for
instance called speciesName. The model would be instantiated by obtaining the value from
a taxonomic database, along with an ID number and a URL pointing to the source data.
In FGE-OM, OntologyEntry is used in this way in all three namespaces, and many of the
terms in the MGED Ontology can be used for both microarrays and proteomics.
EXAMPLE: A comparative 2-D gel analysis is being used on tissue from the hearts of two
samples of mice, one of which has a genetic defect. One characteristic that the researchers
want to capture is the gender of the mice.
The gender is specified by a relationship from the class BioMaterial to OntologyEntry
called Characteristics. OntologyEntry captures the category (Gender), the value (Male)
and the term’s definition. In many cases the usage is more complex because classes in the
ontology can have subclasses to build up a hierarchical structure, in fact Gender is a subclass
of Sex. The hierarchy is expressed in the object model by a reference from one instance of
OntologyEntry to another. The overall effect of the use of ontologies is the delegation of the
task of describing the domain to a different process, the ontology development, instead of
representing all concepts in the object model. This is advantageous because ontologies can
be easily extended without affecting the core functionality, but an object model must stay
fixed for a significant period of time, and cannot gradually evolve.
The MGED ontology will be extended further to incorporate standard terms used in
protein studies. There are examples of how the ontology has been implemented in a relational
database in the following chapter (page 170). Other ontologies, such as the Mouse Anatomy
Ontology [45] and the Plant Ontology [247], can also be used to describe biological samples
where required. The usage of other external ontologies will be vital because the MGED
Ontology will never contain all the terms to describe any kind of sample on which microarrays
could be performed. However, separate ontologies will become available from specific research
communities and, as long as the source and definition of a term is clearly stated, then
structured descriptions of biological samples can be captured. This will greatly improve the
facilities for querying databases in the future to find relevant data sets.
Chapter 4. Development of a data standard for functional genomics 115
Figure 4.4: A complete listing of the packages within FGE-OM.
Chapter 4. Development of a data standard for functional genomics 116
Experiment Protocol Bio-Material
Measure-ment
BioAssay BioAssay Data
BioEvent DescriptionBio-
SequenceBQS
HigherLevel
Analysis
Audit And Security
Identifiable
Extendable
Describable
Packages Classes
Figure 4.5: The packages and classes in the BioOM namespace of FGE-OM. The boxedpackages have been altered from MAGE-OM, others are identical to packages in MAGE-OM.Open arrows indicate inheritance, for example Identifiable is a subclass of Describable(the superclass) and inherits all the attributes from Describable.
4.3 Overview of FGE-OM
FGE-OM models microarray and proteomics data and a complete listing of the packages
and classes is given in Figure 4.4. All classes in BioOM and ArrayOM are derived from
MAGE-OM. In ProteomicsOM, classes in the packages MassSpecData, MassSpecProtocol
and ProteinSeparation have been derived from PEDRo, classes in ProteinData and Pro-
teomeBioAssay are from Gla-PSI, and ProteinRecord contains newly created classes. In the
rest of this section there is a description of the three namespaces, and the relationships that
exist between classes residing in different namespaces. The use of the model in the context
of a sample biological workflow is also described. A set of detailed diagrams, displaying the
attributes of classes and the cardinality of relationships, is displayed in Appendix B.
4.3.1 BioOM
Figure 4.5 shows the packages in the BioOM namespace. BioOM covers the components
in FGE-OM that are common to all experiment types. The majority of the packages are
identical to packages of the same name in MAGE-OM, as described in Chapter 2, and the
technical documentation that describes MAGE-OM can be obtained via the MGED web site
[212]. There are components of packages BioAssay and BioAssayData (from MAGE-OM)
that contain array specific information, which have been placed in newly created packages
within the ArrayOM namespace. The three abstract classes at the top-level: Extendable,
Chapter 4. Development of a data standard for functional genomics 117
Array
Array
BioAssay
ArrayDesign
Array
BioAssayData
Quantitation
Type
DesignElement
Figure 4.6: The packages in the ArrayOM namespace. The boxed packages are newly createdin FGE-OM but contain a number of classes derived from MAGE-OM. The other packagesare identical to packages with the same name in MAGE-OM.
Describable, and Identifiable are unchanged from MAGE-OM, and most classes inherit
their attributes. Identifiable allows a name and an identifier to be added to classes.
Describable enables links to external ontologies, data ownership and an audit trail to be
attached. Extendable enables a triplet of attributes: Name, Value, Type to be attached to
any class for storage of properties that are not recorded in other parts of the model.
The BioAssay package in MAGE-OM contains a class describing the hybridization
of mRNA to an array. This class has been relocated in our model to ArrayOM,
and a new package (ArrayBioAssay) has been created in ArrayOM containing the
Hybridization class. The rest of the classes in BioOM:BioAssay are the same as in
MAGE-OM. The BioOM:BioAssayData package contains only five classes: BioAssayData,
BioAssayDimension, MeasuredBioAssayData, BioDataTuples and BioDataValues. The
five classes are identical to those in MAGE-OM. These classes specify the general structure
and location of data from any type of experiment and therefore reside in the BioOM names-
pace. BioAssayDimension allows experimental data to be packaged together across a range
of conditions, such as multiple array or multiple gel comparison.
4.3.2 ArrayOM
Packages unchanged from MAGE-OM
The ArrayOM namespace (Figure 4.6) contains the packages derived from MAGE-OM which
are microarray specific. The packages Array, ArrayDesign and DesignElement describe the
Chapter 4. Development of a data standard for functional genomics 118
layout of features on a microarray and have not been altered. QuantitationType includes
details of how array data is analysed using any of the available statistical packages, and is
therefore also included in ArrayOM. However, various data types from functional genomics
experiments could be quantified in similar ways, using standard statistical tests. Therefore,
an alternative design would be to include a generic package in BioOM modelling statistical
processing, recording the software used, and the parameters employed. This design was
considered but has not been implemented at this stage. The software for statistical analysis
of microarray data is continuously evolving and, apart from image analysis, there are no
dedicated statistical packages for quantifying proteomics data.
Differences from MAGE-OM
The ArrayBioAssayData package is a modified version of the BioAssayData package in
MAGE-OM. ArrayBioAssayData includes the MAGE-OM derived class BioDataCube that
represents the three dimensions of data: the array features; the parameter that is varied
across a multiple array experiment; and the values calculated for each array feature, such
as the relative fluorescence. BioDataCube captures the order of the three dimensions, and
stores pointers to separate files containing large quantities of numerical data. The three
dimensions of data also exist in a proteomics experiment, and potentially in other functional
genomics experiments, therefore in theory it should be possible to create a generic data
model in BioOM that models the dimensions of data. However, the BioDataCube is possi-
bly too simplistic to capture proteomics data, having only an ordering and pointers to lists
of values in files. In proteomics, a multiple 2-DE experiment may detect certain proteins
present on one gel and not another, calculated by image analysis software. The comparison
of multiple gels can be error prone and spots matched across multiple gels may have scores
assigned to the quality of the match. Spots may also be matched based on experimental
evidence, such as MS data. A generic data model covering all types of functional genomics
experiments would have to be more complex and would require major changes to the rela-
tionships between classes derived from PEDRo. The ArrayBioAssay package contains only
Hybridization, which is linked to classes in BioOM:BioAssay.
4.3.3 ProteomicsOM
The proteomics namespace (Figure 4.7) is a further development of PEDRo and Gla-PSI.
PEDRo design was based upon different principles than the design of MAGE-OM. MAGE-
Chapter 4. Development of a data standard for functional genomics 119
Protein
Separation
MassSpec
Protocol
Proteome
BioAssay
MassSpec
DataProteinRecord
ProteinData
Figure 4.7: The ProteomicsOM namespace.
OM was intended to be future proof, by including generic attributes in classes, and allowing
data types to be specified using controlled vocabularies of terms, rather than specifying
explicitly in the model which data types should be stored in which position. PEDRo contains
specific named attributes for all the data types that may need to be recorded. In 2-DE, a
gel is used to separate thousands of proteins into individual spots. An image of the gel is
analysed with specialised software that produces output about gel spots, such as an estimate
of volume, area, the coordinates on the gel and many others. PEDRo aims to explicitly define
all of the data types that are produced by current image analysis software and therefore will
require modification in the future. A model following MAGE design principles would have
a placeholder for the first data type and value, followed by the second data type and value,
and so on. ProteomicsOM includes the classes from PEDRo in new packages, however the
classes have been linked explicitly to components in BioOM that allow generic protocols and
parameters to be attached, as required. The following sections describe the classes that are
contained within the six packages of ProteomicsOM.
ProteinSeparation Package
The ProteinSeparation package describes a number of separation techniques, including 2-
DE and liquid chromatography, and is summarised in Figure 4.8. Classes modelling sep-
aration techniques are subclasses of BioAssayTreatment within BioOM. An instance of
BioAssayTreatment can be linked to Protocol, which allows any type of protocol informa-
tion from hardware or software to be added, along with a set of parameters. This mechanism
Chapter 4. Development of a data standard for functional genomics 120
Gel2D
LCColumn
Physical
GelSpot
Fraction
Separation techniques Separation products
Source biomaterial
BioMaterialBioAssay
Treatment
BioMaterial
MeasurementBioOM
ProteomicsOM
Legend
Figure 4.8: The ProteinSeparation package contains classes that model the relationshipbetween separation techniques and the products of those techniques.
can be used to store additional information about proteome experiments, if the attributes
specified in the part of the model derived from PEDRo do not cover the information that
must be recorded. This mechanism will be particularly important for storing information
about nascent technologies that cannot be covered by PEDRo as it stands. The products of
a separation technique, such as a gel spot, or column fraction are modelled as classes, with
a set of attributes capturing the relevant parameters, and are subclasses of BioMaterial.
The classes Gel2d and LCColumn have a large number of attributes that are not displayed in
Figure 4.8 for clarity (Gel2D records the gel dimensions, pI and molecular weight range and
so on). However, more detailed diagrams displaying all the attributes and the cardinality of
relationships are included in Appendix B. A separation product can become the input for
another separation technique, therefore the model utilises a link from BioAssayTreatment
to BioMaterial via BioMaterialMeasurement to specify the source of material. These three
classes are all contained within BioOM.
ProteomeBioAssay package
The ProteomeBioAssay package contains only one class, GelImageAnalysis, however new re-
lationships have been added to enable the re-use of classes in BioOM:BioAssay in the protein
context (Figure 4.9). These relationships have the following semantics. FeatureExtraction
from MAGE-OM models the process by which data is extracted from a scanned microar-
Chapter 4. Development of a data standard for functional genomics 121
BioAssay
Treatment
Physical
BioAssay
BioAssay
Image
Channel
Image
Acquisition
GelImage
Analysis
Measured
BioAssay
Feature
Extraction
Measured
BioAssay
Data
BioAssay
Data
targettreatment
BioOM
ProteomicsOM
Legend
Ontology
Entry
format
Figure 4.9: The relationship between the GelImageAnalysis class, in the ProteomeBioAssaypackage, with classes from the BioAssay package in the BioOM namespace.
ray. In ProteomicsOM, GelImageAnalysis is a subclass of FeatureExtraction, and models
the process of analysing a 2-D gel with specialist software. FeatureExtraction is linked
to PhysicalBioAssay, which is linked to the source image (Image), the scanning process
(ImageAcquisition) and information about a specific channel or wavelength at which the
array has been scanned (Channel). These classes can be re-used in proteomics, to refer to the
scanning of a 2-D gel. The Channel class is re-used from MAGE to model the technique of
difference gel electrophoresis, in which a single gel is scanned at a number of different wave-
lengths. Data that is obtained from image analysis is stored in classes linked to BioAssayData
in the ProteinData package. There are two relationship from MeasuredBioAssay, one to
the data model in MeasuredBioAssayData, the other to FeatureExtraction. This en-
ables the raw data, MeasuredBioAssayData, to be linked to the process by which it was
generated (scanning and image analysis are referenced through FeatureExtraction and
PhysicalBioAssay).
ProteinData Package
The ProteinData package models information about gel spots (Figure 4.10). Spot data is
captured in IdentifiedSpot, which has attributes covering data types produced by image
analysis software. The model also captures data from difference gel electrophoresis. Spots
from the single channel image are captured in DIGESingleSpot, and co-migrated spots from
Chapter 4. Development of a data standard for functional genomics 122
GelImage
Analysis
Feature
Extraction
Identified
Spot
Physical
GelSpotBioMaterial
DIGESingle
Spot
BioData
Tuples
BioData
Values
Multiple
Analysis
Matched
Spots
Physical
BioAssay
BioAssay
Data
BioAssay
Dimension
SpotRatio
BioOM
ProteomicsOM
Legend
Figure 4.10: The ProteinData package.
the composite image are stored in IdentifiedSpot. Spot data is linked to the gel from which
it was produced because IdentifiedSpot is a subclass of PhysicalGelSpot, which is directly
linked to Gel2D in the ProteinSeparation package (Figure 4.8). Spot data is linked back to
the image analysis from which it was produced via BioAssayData and MeasuredBioAssay,
as described above (Figure 4.9). The ProteinData package also captures multiple gel com-
parisons. BioAssayDimension in BioOM models multiple sample comparisons, and is used
in ProteomicsOM by the addition of a link to MatchedSpots, modelling spots matched across
multiple gels to capture differential expression of proteins. MultipleAnalysis is a subclass
of GelImageAnalysis and records the software used for the multiple gel comparison, and
this groups together a set of MatchedSpots in one analysis.
MassSpecProtocol and MassSpecData packages
The packages capturing MS data and protocols contain classes derived from PEDRo (Figure
4.11). MS protocols are modelled by a package called MassSpecProtocol which contains
a class at the top level called MassSpecExperiment. MassSpecExperiment is a subclass
of BioAssayTreatment that can be used to link to the biological substance on which MS
has been performed (in BioMaterial). The substance can be the product of a series of
separation techniques, such as a spot from a 2-D gel. PEDRo-derived classes specify many of
the parameters that are associated with an MS instrument, along with the type of ionisation
Chapter 4. Development of a data standard for functional genomics 123
MassSpecExperiment PeakList
Peak
MassSpecProtocol Package MassSpecData Package
BioOM
ProteomicsOM
Legend
BioAssay Treatment
PEDRo derived classes modelling MS protocol
BioMaterialMeasurement
PEDRo derived classes modelling database searches
Figure 4.11: The model of MS data and protocols, adapted from PEDRo.
Location
species
modificationType
Protein
ModificationProtein
Ontology
Entry
Database
Entry
BioOM
ProteomicsOM
Legend
Figure 4.12: The ProteinRecord package.
employed, such as electrospray or MALDI (described in Chapter 1). Additional text and
parameters not covered in these classes can be attached using the generic Protocol class
in BioOM, linked to BioAssayTreatment. This ensures that the model can be extended
to include protocols from different MS instrument manufacturers, new software, and new
technologies. A new package, MassSpecData, has been defined to capture the list of peaks
from a trace and the database searches that are subsequently carried out. Proteins identified
by MS analysis and database searches are stored in the ProteinRecord package.
ProteinRecord package
A new package was designed to store details of proteins identified in an investigation (Figure
4.12). The class Protein can be referenced from MS data that has been used for protein
Chapter 4. Development of a data standard for functional genomics 124
identification. The protein identifier and database URL are captured in DatabaseEntry, and
the species of origin in OntologyEntry (from BioOM:Description). ProteinModification
stores information about modifications that have been observed. The type of modification,
such as glycosylation or phosphorylation, is obtained from a controlled vocabulary and cap-
tured in OntologyEntry. The position of the modification is captured in Location.
4.3.4 A workflow for proteomics
A sample workflow is displayed in Figure 4.13, demonstrating how FGE-OM captures pro-
teomics data. The overview of the experiment is modelled by the class Experiment. If
the experiment includes multiple samples, for example comparing a number of 2-D gels,
the parameter that is varied between samples, such as the different genotypes of groups of
organisms, is attached to classes referencing Experiment. A biological substance must be
processed to extract proteins, and make the proteins soluble in a multi-stage process. This
is modelled by a series of treatments (Treatment) applied to a substance (BioMaterial), to
produce the final soluble mixture of proteins, on which certain separation techniques may be
performed. Protein separation techniques, such as 2-DE or liquid chromatography, are mod-
elled as specialised subclasses of BioAssayTreatment. Each BioAssayTreatment has a mea-
sured source of material, which is captured in BioMaterial and BioMaterialMeasurement.
When data is produced after imaging a 2-D gel, an instance of PhysicalBioAssay is created.
PhysicalBioAssay can be referenced by the class ImageAcquisition, representing the scan-
ning of the gel. 2-DE image analysis is represented by GelImageAnalysis, which is a subclass
of FeatureExtraction. Gel spot data produced by image analysis can be stored in specific
classes in the ProteomicsOM namespace, linked to image acquisition via MeasuredBioAssay.
If MS is performed on a spot excised from a gel, or a fraction from a column, an instance of
BioMaterial is created, modelling the physical entity that is the excised spot or fraction.
MassSpecExperiment is a subclass of BioAssayTreatment, which can be linked to the source
of material. MS data obtained from a particular gel spot is linked directly to data produced
by image analysis of the spot, which is captured in MeasuredBioAssayData.
4.4 Other work: CEBS object model for systems biology data
Subsequent to the development of FGE-OM, a new model covering several functional ge-
nomics techniques has been published [355]. This section reviews the new model, and dis-
cusses how it can contribute to the on-going standards work for FG.
Chapter 4. Development of a data standard for functional genomics 125
ImageAcquisition
FeatureExtraction
BioAssayTreatment
Physical
BioAssay
Physical
BioAssayImage
Measured
BioAssay
BioMaterial
Measurement
Material TypeDNARNAProteinCell...
Experiment
Treatment BioMaterial
BioMaterial
Gel2D
LCColumn
MassSpec
Experiment
MeasuredBio-
AssayData
GelImage
Analysis
Acquisition
Protocol
Figure 4.13: A workflow for a proteomics experiment involving 2-DE or liquid chromatog-raphy to separate proteins, followed by MS to identify proteins. Diamonds indicate events,rectangles are physical entities and ovals represent data.
Chapter 4. Development of a data standard for functional genomics 126
SpecializedQuantitationType
Intensity IonCount MassValueType Ratio
StandardQuantitationType
Time Volume DerivedSignal ScorePValue
QuantitationType
isBackground : boolean
ConfidenceIndicator
0..3
1
+confidenceIndicators{rank: 4}
0..3
+targetQuantitationType
{rank: 1}1
Figure 4.14: A subset of classes in the QuantitationType package from SysBio-OM. Darkerboxes are newly created classes in the model, lighter boxes represent classes that have notbeen changed from MAGE-OM.
The CEBS object model, SysBio-OM, has been created with similar goals to FGE-OM
and will support a database for toxicogenomics. Toxicogenomics is the study of the effects
of toxicological compounds on gene and protein expression. The model has been created
by merging MAGE-OM and PEDRo, and adding additional classes to model metabolomics
data. SysBio-OM has been developed with the requirements of toxicogenomics in mind, but
the authors claim that it covers generic types of microarray, proteome or metabolome study.
There is no division of technologies into separate namespaces, as in FGE-OM, but new classes
have been added to the packages in MAGE-OM, and two packages, CommonBioAssayData
and SummaryData, have been newly designed. CommonBioAssayData covers protein ex-
pression, protein-protein interaction and metabolomics data, and SummaryData captures
a textual overview of the data to allow a researcher to decide whether a data set may be
relevant without requiring a full data analysis. At the top level there is very little difference
between SysBio-OM and FGE-OM, both have the classes Identifiable, Describable and
Extendable linked to many of the classes in the model. SysBio-OM is identical to MAGE-
OM (and FGE-OM) in the packages: AuditAndSecurity, Array, ArrayDesign, DesignEle-
ment, BQS, HigherLevelAnalysis and Description. The BioAssayData package is identical
to MAGE-OM, which has been split into two new packages in FGE-OM.
Chapter 4. Development of a data standard for functional genomics 127
4.4.1 SysBio-OM data model
The SysBio-OM QuantitationType package contains two superclasses at the top level,
SpecializedQuantitationType and StandardQuantitationType. There are several newly
designed classes in SysBio-OM, including PeakAbundance, Intensity, Percentage, Volume,
all of which are subclasses of SpecializedQuantitationType (Figure 4.14 displays a subset
of classes in the package). These classes capture measurement data for various types of FG
experiment. The MAGE-OM derived classes for quantifying microarray data are subclasses
of StandardQuantitationType. SysBio-OM is not restrictive in the kinds of measurement
that can be used for different technologies, and is therefore more generic than the equivalent
section of FGE-OM. FGE-OM captures measurement data for proteomics in specific classes
in the ProteinData and MassSpecData packages, and microarray data in QuantitationType.
The approach taken in SysBio-OM may be superior for this section, and should be considered
as a possible design for an extension to the QuantitationType package in the next version of
MAGE.
The CommonBioAssayData package is a new feature in SysBio-OM (Figure 4.15) to
model proteomics and metabolomics data. Rows of numerical data are represented by
CommonBioDataTuples, and single data points are subclasses of DataElement (boxed in
Figure 4.15). The raw data values are stored in the class QuantitationDimension in the
CommonBioAssayData package. It is not clear how the model captures information about
spots matched across multiple gels.
The Measurement package in SysBio-OM is an extension of the MAGE-OM package,
incorporating many different types of measurement and units that could be used in functional
genomics. In MAGE-OM, and SysBio-OM, each class has an attribute unitNameCV with
an enumeration of values, e.g. the class TimeUnit has an enumeration containing the values:
years, months, weeks, d, h, s, us, ns, fs, other. The option other is included in almost all
classes in the Measurement package and causes problems for developing applications based
on the model because it is not specified how the type other is controlled or used. The FGE-
OM Measurement package does not have any of the specific classes for units but instead has
two links to the OntologyEntry class to specify the type and name of the unit (Figure 4.16).
This design may be superior because the names of units are not hard coded in the model,
avoiding the problem of the attribute other, and it is therefore unlimited in what can be
captured. It is a simple task of incorporating all the measurement types and units into the
MGED Ontology, which already includes most of those added to SysBio-OM.
Chapter 4. Development of a data standard for functional genomics 128
Figure 4.15: The CommonBioAssayData package from SysBio-OM. The boxed classes arediscussed in the text.
Chapter 4. Development of a data standard for functional genomics 129
SysBio−OM
FGE−OM
Figure 4.16: The top image shows a small subset of classes from the Measurement packagein SysBio-OM, the lower is the Measurement package in FGE-OM.
Chapter 4. Development of a data standard for functional genomics 130
Figure 4.17: The Protocol package from SysBio-OM. The boxed classes are newly created.
Chapter 4. Development of a data standard for functional genomics 131
4.4.2 SysBio-OM Protocol and BioMaterial packages
The Protocol package in SysBio-OM diverges from MAGE-OM (Figure 4.17) by introducing
new packages for different types of protocol (1-D, 2-D gel, MS database search and NMR).
The model does not specify what attributes belong to these classes, therefore this may create
confusion for developers using this part of SysBio-OM. The Protocol package in MAGE-OM
was intended to be independent from technology and can therefore be re-used with no change
for any type of FG experiment. The addition of new classes without attributes does not add
significantly to what can be captured by this part of the model. A new design that can
capture all the information in Protocol of SysBio-OM but remain generic would introduce
a new relationship from the Protocol class to OntologyEntry called protocolType, which
captures the type of protocol, such as 2-D or 1-D gel.
The BioMaterial package in SysBio-OM has several new classes, modelling gel spots and
column fractions, derived from PEDRo (Figure 4.18). These classes also exist in FGE-
OM but reside in the ProteomicsOM namespace in order to leave the BioMaterial package
independent of any technology, however the core functionality of the two models is very
similar for this part. It may be advantageous to put technology specific classes in separate
packages, as in FGE-OM, so that it is easier for developers to understand the intended usage
of the model and focus only on the parts of the model that are required.
4.4.3 SysBio-OM BioAssay and SummaryData packages
The BioAssay package in SysBio-OM is displayed in Figure 4.19. The intended usage of the
package is very similar to a combination of BioAssay, ArrayBioAssay and ProteomeBioAssay
in FGE-OM. A new class, GelFeatureExtraction, models the process of gel image analysis
enabling the classes Image, Channel and ImageAcquisition from MAGE-OM to be re-used
in the proteomics context. Another new class, CommonBioAssayCreation, models techniques
such as a 2-D gel, NMR or a column separation, and links to data acquisition and raw data,
such as images, through the PhysicalBioAssay class. CommonBioAssayCreation functions
in a very similar way to BioAssayTreatment in FGE-OM (although BioAssayTreatment
also exists in SysBio-OM with a different function). CommonBioAssayCreation references
the source material for a treatment through BioMaterialMeasurement in exactly the same
way as in FGE-OM. PhysicalBioAssay has associations with classes modelling column or
NMR data files for metabolomics data (NMROutputFile and ColumnFractionOutputFile).
The SummaryData package is a new development proposed in SysBio-OM (di-
Chapter 4. Development of a data standard for functional genomics 132
Figure 4.18: The BioMaterial package from SysBio-OM.
Chapter 4. Development of a data standard for functional genomics 133
Figure 4.19: The BioAssay package from SysBio-OM.
Chapter 4. Development of a data standard for functional genomics 134
agram not shown) which contains only two classes QualitativeOrSummaryData and
DataInterpretation. These two classes are for adding textual descriptions onto the ex-
periment and it remains to be seen how this differs from what can be captured in the
Experiment package.
4.5 Discussion
The object model, FGE-OM, was created in UML to represent both proteomics and microar-
ray experiments. FGE-OM is based on MAGE-OM and incorporates additional information
from PEDRo and Gla-PSI. There are three namespaces in the new model: BioOM, ArrayOM,
and ProteomicsOM. The BioOM namespace is suitable for describing a generic functional
genomics experiment, encompassing microarrays, 2-DE, histochemistry and others. The
ProteomicsOM namespace was defined from PEDRo and Gla-PSI, and includes classes with
attributes covering 2-DE, MS and data analysis. ProteomicsOM has been integrated with
BioOM, enabling generic protocols, including details of hardware or software, to be attached
to specific classes. FGE-OM uses inheritance from several key superclasses: experimental
techniques are modelled as subclasses of BioAssayTreatment and the products of treatments
are subclasses of BioMaterial. This framework will allow new models describing other tech-
nologies to be added into FGE-OM without significant difficulty, allowing a unified model
for functional genomics to be created in the future. An important use of FGE-OM will be to
generate an XML Schema, to allow research groups to format data in a consistent manner
into FGE-ML, a markup language based on the model. A software toolkit is also required,
based on the microarray software toolkit (MAGEstk), for creating FGE-ML from the object
model.
FGE-OM has been created by merging models that have slightly different design princi-
pals. MAGE-OM was intended to be “future proof” by including generic classes that could
be used for various technologies. Conversely, PEDRo aimed to describe the current status of
proteomics experiments, recognising that future developments would require changes to the
model. The forthcoming versions of both MAGE-OM and the protein model, PSI-OM, will
undergo changes that may bring about the convergence of the different design principles. In
other words, MAGE-OM will include classes for some parts of the model that capture the
standard cases more simply, and PSI-OM will utilise more generic classes to model exper-
imental protocols and biological samples. This issue is outlined in detail in the following
section. We believe that the design process for the next version of both MAGE-OM and
Chapter 4. Development of a data standard for functional genomics 135
PSI-OM should be guided by the experience of developers who have attempted to create
software based on the two models. It is our view that ontologies should be used extensively,
to reduce the burden on the developers to create an object model that captures all possible
uses of the technology.
FGE-OM demonstrates that the integration of the two current versions of the object
models is feasible. We believe that even if the next versions of the models are developed
independently, the framework described here can be easily evolved, reflecting the changes to
the new object models, and there are significant benefits to capturing both microarray and
proteomic technology in the same structure.
4.5.1 FGE-OM, SysBio-OM and future standards
The CEBS SysBio-OM model is an alternative proposal for an FG data standard. There are
currently no major proposals specifically for metabolomics, however CCPN (A Collaborative
Computing Project for the NMR Community) is fairly well established in the NMR com-
munity and contains an object model and programming interface [113]. The metabolomics
part of SysBio-OM comprises a simple model of NMR data, therefore the CCPN proposals
may be able to contribute to the efforts, and both models should stimulate discussion in the
metabolomics field as to the requirements for a data standard.
In overview of SysBio-OM, new classes have been added to seven MAGE packages to
cover proteomics, and two new packages have been created. The object model has been used
for generating code that acts as a bridge between flat data files and the CEBS database,
and it is planned that future functionality will enable import and export of MAGE-ML and
the future proteome standard, PSI-ML. Another function of SysBio-OM is to act as a pro-
posal for the future development of an integrated data standard across several fields. The
design of certain packages, such as the QuantitationType package, serves this purpose well,
because it is generic, and can capture a wide range of quantitation types. The design of
other packages such as BioMaterial and Protocol mixes the generic approach of MAGE with
technology specific classes. It is our view that this may cause problems because the design of
MAGE will change for the next version, and the PEDRo proposals are changing to become
PSI-OM, as reported in the previous chapter. Therefore, it is likely that a large amount of
work will be required to redesign these packages to reflect the changes to MAGE-OM and
PSI-OM, but this should not be the case for FGE-OM. FGE-OM separates different tech-
nologies with only a few key relationships linking classes in different namespaces, and the
Chapter 4. Development of a data standard for functional genomics 136
original functionality of MAGE-OM packages is maintained in almost all cases. Therefore,
when PSI-OM is finalised it can be easily merged with the next version of MAGE, using
FGE-OM as a guide. The packages CommonBioAssayData and BioAssay in SysBio-OM
function in a similar way to a combination of ProteinData and the three related BioAs-
say packages in FGE-OM. The CommonBioAssayData package (SysBio-OM) appears to be
more generic than ProteinData (FGE-OM) and utilises inheritance from the superclasses
DataElementDimension, DataElement and QuantitationType for the three dimensions of
data. It remains to be seen how this works in practice, but if a successful implementation
of this part of the model is demonstrated in the CEBS database, this may represent a good
framework for developing a generic data model across all FG experiments. It is likely that
the best design of a standard for FG will take parts of both SysBio-OM and FGE-OM and
a potential framework for this integration is described below.
4.5.2 Developments to MAGE-OM
The division of FGE-OM into namespaces is a simple but important concept that should
make a large object model easier to understand, allowing developers to focus more quickly on
the relevant parts. The next version of MAGE is planned to contain a core of components
that are shared across all types of FG experiment, similar to the BioOM namespace. A
structured description of the purpose of the experiment, the biological samples and the
parameter that is varied across samples is the most important part of the core. All types
of FG experiment can be described in this way and the use of the MGED Ontology, and
extensions to it, will be an essential component. This part of the design ensures that the
purpose of the experiment can be determined very easily by manual or automated inspection
of files rather than having to parse all the information in the document and search for the
differences between the samples. For example, the purpose of an experiment may be to
determine the changes in gene expression between two cell lines, one of which had gene X
knocked out. This information must be easy to search for as it is one of the most crucial
parts of the experimental annotation. FGE-OM, MAGE-OM and SysBio-OM have classes
at the top level, ExperimentFactor and ExperimentFactorValue, which allow the critical
characteristics and differences between the samples under comparison to be specified. These
classes are vital for the purpose of the experiment to be easily understood and therefore the
FG data standard should retain them at the top level. A database should ensure that these
attributes are stored in a way that allows rapid querying and programmatic access to this part
Chapter 4. Development of a data standard for functional genomics 137
of the annotation. I believe that the next version will benefit from the proposed extensions
of SysBio-OM and FGE-OM. The Quantitation and CommonBioAssayData packages from
SysBio-OM offer a generic framework for capturing FG data and could be incorporated into
the core namespace. In FGE-OM, the simplification of the Measurement package may be
advantageous and should be considered.
The next version of MAGE aims to fix semantic annotation problems with the current
version that have been discovered over several years since its release. PEDRo has been
widely accepted as a draft standard from which the first formal proteomics standard can
be developed. It is vital that PSI-OM, which will supersede PEDRo, utilises the experience
gained from MAGE to avoid the same problems. One general criticism of MAGE-OM is that
for certain concepts it is “over engineered”, in other words, the designers attempted to define
a model that could cover all eventualities but the most common case is captured in a complex
way. Large efforts are required from software developers to create applications that produce
MAGE-ML, and there are still relatively few public databases that offer MAGE-ML input
and output, although this feature is in development for almost all microarray databases.
The next version of MAGE is likely to make greater use of the OntologyEntry class, and
PSI-OM should also utilise ontologies to capture complex concepts. The PSI ontology (PSI-
Ont) will become an extension of the MGED Ontology. PSI-OM will be designed with
the consideration of future integration with MAGE, and the separate mass spectrometry
standards that are under development (as described in the previous chapter).
4.5.3 Integrated standards
The development of an integrated standard requires joint meetings between PSI and MGED.
The two organisations are now committed to co-developing a standard, however the devel-
opment of MAGE will first focus on the creation of a core module, based around similar
principles to BioOM. The last meeting of PSI (Nice, France 2004) was attended by sev-
eral key developers of MAGE, and the previous MGED programming workshop (European
Bioinformatics Institute, Cambridge, UK Dec 2003) had presentations by members of PSI.
FGE-OM was presented at both meetings by the author. It is vital that collaboration con-
tinues between the two organisations. This requires principal investigators to present work
to the wider biological research community to ensure that there is a good awareness and
support for the standard. The Object Management Group (OMG) was involved with the
development of MAGE-OM, providing a framework for checking the consistency of the object
Chapter 4. Development of a data standard for functional genomics 138
model. The future FG standard should also be vetted through OMG, because while this in-
troduces extra developmental stages, there are likely to be fewer problems that arise once the
model is being used by a large community. Finally, there needs to be a number of workshop
meetings in which developers of MAGE and PSI-OM work together to define a format that
captures everything that is required in the two fields. The format should support functional
genomics, not just microarrays and proteomics, therefore researchers in other parts of FG
research should also be aware of the efforts. It is likely that a data format will only gain
widespread support once several major databases are committed to its development. One
other consideration is that a data format that can encompass a range of functional genomics
techniques may be too bulky for many users who use only a single technique and wish to
utilise a subset of the standard. If the different namespaces are well designed, it will be pos-
sible to derive the single technology data formats from the model, MAGE-ML and PSI-ML,
for transferring results to databases storing only microarray or proteomics experiments.
In the following chapter, the development of an Internet accessible database is described,
which will ultimately form part of a large system for functional genomics. The CEBS
database will also offer access to various types of FG data, and it is likely that several
other systems will come on-line in the next few years. It is important that developers of
different systems collaborate at an early stage to avoid the data incompatibility problems
that have arisen over the last decade in biomedical research, which make the challenge of
data integration so great.
4.6 Conclusion
The chapter has described the development of an object model for functional genomics. FGE-
OM comprises three namespaces that have been created to reflect the different components
in a large biological investigation. BioOM contains twelve packages and ArrayOM contains
six packages that match very closely the structure of MAGE-OM. The third namespace,
ProteomicsOM, comprises six packages that contain classes derived from PEDRo and Gla-
PSI. FGE-OM is intended to demonstrate a potential schema for the integration of microarray
and proteomics data standards, and acts as a proposal from which the next version of MAGE-
OM can be developed. The division into namespaces should allow the model to evolve as
the proteomics and microarray proposals change, and also creates a framework that enables
object models from other types of FG experiment to be integrated. FGE-OM has been
presented to PSI to influence the design of the finalised standard for proteomics, and to
Chapter 4. Development of a data standard for functional genomics 139
MGED to generate discussion about the next version of MAGE-ML. The model has been
verified against real data by the development of a database implementation that matches
the structure of the object model very closely, described in the following chapter.
Chapter 5
A prototype public database for
proteomics
5.1 Introduction
The main aim of the research presented in this thesis is to improve the facilities for data
sharing and querying in functional genomics (FG). In the previous chapter, the definition of
a functional genomics object model was given, which acts as a proposal for a data standard.
In this chapter, a database implementation is discussed which is capable of storing data from
both microarrays and proteomics. The RAPAD (RNA And Protein Abundance Database)
system is an extension of the RAD microarray database from the University of Pennsylvania,
into which a proteomics component has been incorporated. There are many database systems
for storing microarray data (ArrayExpress, GEO, and SMD summarised in Chapter 2) and
several initial attempts to capture proteomics data, of which SWISS-2DPAGE is the most
well established. However, there is no major public repository that covers both protein
separations experiments and mass spectrometry, and an integration of data from microarrays
and proteomics has not previously been demonstrated in a database.
5.1.1 Extending existing technology
A description of different experiment types used in FG was given in Chapter 1. In overview,
a typical proteomics experiment involves obtaining a set of samples produced under different
conditions and attempting to separate, identify and (possibly) quantify the proteins present
in the different samples. Where proteomics differs from microarray analysis is the range of
different methods that could be used at each stage to get the final result, including: multiple
separation stages, novel techniques for quantifying protein abundance, and identification of-
ten through mass spectrometry (MS) accompanied by database searches. A significant part
140
Chapter 5. A prototype public database for proteomics 141
of the challenge of formally describing this information occurred during the development of
the object models described in the previous two chapters. Therefore, the major implementa-
tion challenges involved creating interfaces for capturing data and protocols, development of
complex query facilities, visualisation of results and data integration. In Section 5.2, there
is a description of databases that exist for capturing proteomics data, however none offer
a complete solution storing 2-DE, MS, and experimental protocols. Therefore, a system is
required that can capture a complete proteomics workflow in a structured format that can be
queried. The decision to extend the RAD system into proteomics rather than develop a new
database from scratch was based on several criteria. Firstly, it is important that microarrays
and proteomics data can be queried side by side, and in conjunction with other functional
genomics data. This will be facilitated by having a shared database schema and user inter-
face, and it will be easier to produce a mapping from an object model, such as FGE-OM,
to a database if the general structure is similar. There is already a close correspondence
between RAD and MAGE-OM, therefore a large part of FGE-OM is already mapped to
a relational representation. Secondly, RAD is a part of the GUS system which is a major
public repository, providing access to genomic sequence data, ESTs, RNA, SAGE [332] and
gene expression data. One of the long term goals of GUS is to incorporate proteomics, im-
munohistochemistry, and cell anatomy components, creating a single access point to many
types of functional genomics data (Figure 5.1). Therefore, RAPAD also serves as a prototype
for developing a proteomics namespace in GUS which, when complete, will provide access to
2-DE, MS and other proteomics data for major web sites such as PlasmoDB [21], ToxoDB
[187] and GeneDB [127]. Thirdly, the time required to develop a large system is significantly
reduced if developing on top of established software, compared with developing de novo. In
summary, RAPAD was developed with several major goals that are explored in the rest of
the chapter:
• RAPAD functions as a prototype for a major public repository for proteomics data,
and ultimately will form part of GUS.
• The implementation was created to provide a framework for developing tools for in-
vestigating the correlation between gene expression and protein abundance, stored in
the same database.
• The current implementation, while serving as a prototype for the future development
of a public resource, also has acts as a platform for supporting on-going proteomics re-
Chapter 5. A prototype public database for proteomics 142
Microarray analysis Proteomics Immunohistochemistry
Legend
Sample Flow
Data Flow
SampleBiological
SampleBiological
ExperimentDesign
SampleBiological
Extract mRNA and
Global protein expression Positional expression profileGlobal gene expression
Extract protein and Apply antibodiesseparate by 2−D gel to samplesapply to microarray
GenomeSequencing
Genome sequence
Data Integration
Statistical processing
Image analysis
clone fragments
Determine sequence,assemble and find genes
Extract DNA and
Overview of Functional Genomics Experiments
Functionally annotated genome
Measure relative levelof mRNA expression to identify proteins
Mass Spectrometryscanning microscopeView samples with a
Figure 5.1: A summary of several workflows in functional genomics to illustrate the require-ments for data integration.
Chapter 5. A prototype public database for proteomics 143
search. RAPAD currently supports two projects at the University of Glasgow: changes
in protein expression of host cells following invasion of Toxoplasma gondii, described
in Chapter 6; and the determination of the proteome of Trypanosoma brucei (Chapter
7). It is planned that the current implementation will be extended and used to manage
large volumes of data produced by the Functional Genomics Facility in Glasgow [293].
5.1.2 The development of RAPAD
The approach taken during the development of RAPAD is as follows. A large database
schema has been designed (174 tables), based closely on the RAD system, and new table
definitions have been created covering proteomics (51 tables). The main advantage of devel-
oping on top of RAD is that it is already MIAME compliant, and a set of tables exist that
correspond to MAGE-OM objects. Therefore, the same tables can be used to store objects
defined in the BioOM (generic) and ArrayOM (microarray specific) namespaces in FGE-OM.
There is software under development for transferring data between MAGE-ML and RAD
(MAGE - RAD Translator, Mr T [202]), which can be adapted to import MAGE-ML into
RAPAD, and extended to map data stored in RAPAD to the finalised functional genomics
data format, based on FGE-OM. The proteomics component was derived primarily from
the PEDRo database schema and the Gla-PSI object model. The PEDRo database schema
matches the PEDRo object mode very closely, much of which contributed to FGE-OM, there-
fore mapping concepts from the object representation (in the ProteomicsOM namespace) to
the RAPAD database schema was not a major challenge. Additional tables were imported
from the Core namespace of GUS for storing login and security data, and from the SRes
namespace of GUS for storing taxonomic information, bibliographic references and contact
details. The different namespaces that exist in GUS are described in more detail in Section
5.2. The RAPAD schema, and the web interface, are freely available for download from
the web site1. Figure 5.2 summarises the correspondence between classes in FGE-OM and
tables in RAPAD. There are additional components that are modelled in FGE-OM using
ontologies, which are stored in RAPAD using other GUS-derived tables, described in Section
5.3.1.
1The web site of the Functional Genomics Experiment Object Model: FGE-OM www.gusdb.org/fge.html.
Chapter 5. A prototype public database for proteomics 144
Figure 5.2: A mapping from classes in FGE-OM to database tables in RAPAD.
Chapter 5. A prototype public database for proteomics 145
5.1.3 Chapter guide
The rest of the chapter is structured as follows: previous work on databases and ontologies
is described in the next section. The methods used to develop the schema, the user interface
and perform data integration are outlined in Section 5.3. Section 5.4 describes the current
implementation and the database schema, using the web interface to illustrate examples.
Section 5.5 includes a discussion of the technology, and how it can be extended in the future.
5.2 Previous work
5.2.1 GUS
The GUS database developed at the University of Pennsylvania is an established system
storing functional genomics data. GUS provides the database facilities for several major web
sites that allow access to genome and transcriptome and EST data for various organisms,
including Plasmodium falciparum [248], Trypanosoma brucei [127], Toxoplasma gondii [320]
and several others. GUS also supports AllGenes [10], which is a gene index for human
and mouse created from collections ESTs and mRNA sequences (described in more detail
in the following chapter). GUS consists of several namespaces that have been developed
independently and added into one large schema. The tables comprising the namespaces can
be viewed in the GUS Schema browser [141], and fall into following categories: Core, App,
DoTS, RAD, SRes and TESS. Core stores details of users, projects, and information about
a specific database implementation. The App namespace stores help pages and information
specific to the application that is using GUS. The DoTS database supports the AllGenes web
sites and consists of tables for storing details of genes, mRNAs and ESTs. The different types
of sequence data can be associated together, for example if an EST and mRNA sequence
both arise from a single gene sequence, all entries can be linked together through a single
entity (a DoTS gene). This enables a user to map different types of database identifier back
to the same gene. The RAD database is the gene expression component of the database, and
various features are described in the rest of the chapter. SRes contains a variety of tables for
storing contact details, taxonomy, phenotype, general ontologies, associations to the Gene
Ontology and others. TESS (Transcription Element Search System) stores information about
transcription factor binding sites and can be used for predicting new sites according to a
statistical model.
Chapter 5. A prototype public database for proteomics 146
5.2.2 Proteomics database
There are several existing proteomics databases that are described in this section. The most
established proteomics database is SWISS-2DPAGE [153] that offers static gel images that
can be clicked on to access protein data, and allows searches on proteins by accession number,
text search over descriptions and the author of the study. The data format used by SWISS-
2DPAGE was described in Chapter 3. There are a number of other systems developed using
the software available from SWISS-2DPAGE that offer similar capabilities. In general, 2-D
gel databases on the Web tend to offer only static pictures of 2-D gels with links to pages
about spots identified on the gels, but at best have very limited search facilities, and little
or no information about experimental protocols or biological samples. The data from these
systems can usually only be queried by manual browsing of web pages.
The GELBANK system [20], also described in Chapter 3, has facilities for searching,
and has a visualisation system for gels that allows zooming, but stores only very basic
information about experimental protocols: the gel stain, a brief description of the starting
sample and a description of the first and second dimension separation. GELBANK does
not store mass spectrometry data, and SWISS-2DPAGE only stores basic MS information
without any information about the quality of the match, therefore in these systems it is
difficult to place a confidence value on the correct identification of a protein.
There are several systems that manipulate gel images and enable Web publishing of data.
The GelScape system is one example, which allows researchers to register gel images and lists
of identified spots, storing data in text files [365]. GelScape is not supported by a DBMS
therefore cannot offer complex query facilities.
There are several commercial systems for storing MS data, such as RADARS [106], how-
ever these systems tend to be very expensive and are therefore not feasible for many labora-
tories. LIMS (Laboratory Information Management Systems) applications are also available
from software companies that have facilities for storing 2-D gels, but are usually only acces-
sible at high cost and are generally geared toward sample tracking in a generic laboratory
experiment, rather than specifically capturing protein separations and mass spectrometry.
5.2.3 Ontologies
GUS makes extensive use of ontologies to store concepts that cannot easily be represented in
a database schema. The MGED Ontology (MO) was described in the last chapter in terms
of its use in data standards, however it is also important in the database context. In RAD,
Chapter 5. A prototype public database for proteomics 147
MO is used to populate data entry forms in the interface and we have followed this in the
design of RAPAD. This is particular important for storing the characteristics of biological
samples and details of experimental protocols. The NCBI Taxonomy is stored in the SRes
part of GUS and can be referenced in RAD, and RAPAD, for storing the species of origin
of a sample.
5.3 Systems and Methods
5.3.1 Schema development
The database schema was created using a database design application (PowerDesigner
9TM[250]). The schema was developed manually, guided by information from the object
model, FGE-OM, rather than using an automatic conversion application. There are currently
no reliable methods of automating schema evolution, and all schema generation packages,
where a database is generated from an object model, assume that the database is newly cre-
ated. The database schemas for RAD [262] and PEDRo [242] were imported into the design
application and new tables were created as required, to store objects that were defined in
the Gla-PSI model. Section 5.4 describes the constituents of the database in detail but in
overview it has the following structure: tables covering an overview of the experiment, bio-
logical samples and protocols are all re-used from RAD. Other parts of GUS (SRes and Core
namespaces) have been installed alongside to capture data privacy, bibliographic references
and species data. Specific details of protein separation techniques and mass spectrometry
are stored in tables similar to the PEDRo specifications. Tables covering image analysis and
2-D gel data are derived from Gla-PSI. In the following chapter there is a description of
the integration of microarray and protein abundance data. The microarray data is stored
in a set of temporary tables within RAPAD rather than tables derived from RAD due to
time constraints, and because the aim of work is to demonstrate that integration of results
is possible, not that microarray experiments can be stored in RAD, which has been well
documented in the past [302]. The complete RAPAD schema is displayed in Appendix C.
5.3.2 Interface development
The interface for loading data (the RAPAD Study-Annotator) has been developed from the
existing RAD Study-Annotator [202] and functions within a web browser. The query inter-
face (RAPAD Querier) has been created de novo, after consultation with researchers about
Chapter 5. A prototype public database for proteomics 148
their requirements for publishing, visualising and querying data. A significant period of time
has been spent testing the interface with both real and artificially generated data by database
developers and bench researchers. Feedback arising from interviews with researchers has en-
abled improvements to the interface, such as providing help pages, and adding comments on
data entry forms to make the interface easier to use. The interface allows data and protocols
to be entered manually but an option also exist to load data about gel spots and protein
records in bulk. Two file formats have been specified for bulk loading, following consultation
with researchers and principal investigators in the Functional Genomics Facility at Glasgow
University.
5.3.3 Data integration
One of the main goals of developing RAPAD is to facilitate the integration across different
types of functional genomics data. The diagram in Figure 5.1 displays several different types
of experiment and the requirements for data integration. One of the goals of RAPAD is
to test whether core RAD tables can capture experimental protocols and sample tracking
information from other types of FG experiments. It is planned that the proteomics compo-
nent will become part of GUS to produce a system that is capable of integrating all the data
types shown in Figure 5.1. In this section, the issue is explored of how integration can take
place in theory across a complete system for FG. In the following chapter a specific example
is given describing integration across microarrays and proteomics for a project supported by
the current implementation of RAPAD.
Database identifiers for proteins
The core data point in a proteomic investigation is an identified protein, which may have been
quantified, such as a volume ratio between different conditions produced by image analysis, or
by differences in fluorescence or radioactivity, as measured from a labelling experiment. The
database identifier of the protein will depend upon the organism being studied, but often the
identifier will point to a record in a sequence database, such as GenBank. For organisms with
incomplete genome sequences, mass spectrometry data may be searched against predicted
translations of the latest release of the genome data or EST databases. These data sets
contain only partial, or inaccurate protein sequences, and many sequences have no record in
GenBank. In these cases integrating data points is an even greater challenge.
Chapter 5. A prototype public database for proteomics 149
Identifiers for microarray features
The data points in a microarray experiment are measured for every clone deposited on the
array, or oligonucleotide position, collectively known as the features. The type of identifier
given to an array feature, depends on the type of microarray. The features on arrays produced
by cDNA deposition are identified by unique IDs supplied by the array manufacturer, which
often have entries in the manufacturer’s own database, and may be supplied with GenBank
identifiers of the cDNA or EST records from which the clone were produced. Affymetrix
arrays are supplied with their own unique identifiers of the features, and GenBank identifiers
can be obtained from the company’s web site using a software toolkit. The data values
associated with every clone are usually a single fluorescence measure, or a ratio of fluorescence
from scanning an array at two different wavelengths.
Matching different types of identifier
Immunohistochemistry data sets tend to be far smaller than microarrays or proteomics,
arising from antibodies raised against particular proteins that usually have a known entry
in GenBank. It is desirable that the data points from all experiments are related back to
the genome, gene and protein databases, allowing a user to search for particular genes, and
to discover studies in which modulated expression or localisation data exists. To realise this
goal there must be mapping across the identifiers from: i) protein sequences identified by
MS, ii) cDNA or EST sequences on microarrays, iii) protein records in immunohistochemistry
experiments, and iv) the DNA and protein records in GenBank, SWISS-PROT and other
major databases. Some protein records in GenBank have a link to the corresponding gene
record, however there is usually no direct link to the corresponding record for the cDNA
sequence that has been used on a microarray. The only robust method for mapping across
all identifiers is to perform sequence similarity searches.
The AllGenes web site provides access to the Database of Transcribed Sequences (DoTS)
at the University of Pennsylvania [10]. DoTS is a part of GUS and has predefined mappings
for human and mouse sequences for the different identifiers that exist for sequences from
microarrays, EST, cDNA and genome databases. In the following chapter, a process is
described outlining how microarray and proteomics data have been integrated using DoTS,
from studies of a parasitic infection of human cell culture. However, for organisms other
than human and mouse, the following algorithm is required to integrate data:
1. Create a new database of sequence clusters that will comprise entries containing clusters
Chapter 5. A prototype public database for proteomics 150
of different database identifiers that correspond to individual gene sequences.
2. For every protein record matched by MS data, or identified by another method in a
proteomics experiment, obtain the protein sequence, example: ABC.
3. Perform a sequence search with ABC against a translation of the most recent version
of the organism’s genome database. For sequences that match a gene exactly, create a
new database record for this cluster, including the DNA or protein sequence.
4. For sequences that do not match anything in a genome database, search against EST
or cDNA databases for very close or exact matches. Assign all the identifiers that can
be found to a new record in the cluster database.
5. Retrieve cDNA or clone sequence data from microarrays and search against the genome
database. For exact matches check to see if a corresponding entry exists in the cluster
database. If the entry exists, add the microarray clone ID to the record in the cluster
database, otherwise create a new entry.
6. Perform the same process for immunohistochemistry, or other FG data points, to
retrieve the most closely matching sequence entries, and add records to the cluster
database.
7. Integration occurs by performing queries over the cluster database to find proteins
and microarray clones that have been assigned to the same cluster. For these records,
quantitative data may be comparable if the starting samples from different experiments
have been treated in the same way. However, substantial statistical analysis is required
to determine the validity of correlating protein and mRNA abundance data [143]. This
is an issue which will require significant future efforts from statisticians working with
bench biologists.
8. An entry should be created for every known gene in the cluster database, containing
the identifier from GenBank, Swiss-Prot, PIR and the genome database specific to that
organism.
9. When a gene is highlighted by a user, there should be an option to perform a query
of the cluster database to find all the other sequence identifiers that the gene has been
assigned. A further query can then be performed to retrieve any FG experiments in
which modulated expression, localisation or interaction data exists.
Chapter 5. A prototype public database for proteomics 151
The process described above allows different types of FG experiment to be integrated
at the level of individual gene and protein records, however, one problem that this does
not address is the challenge of different protein forms. Proteomics experiments can reveal
differentially modified proteins, different splice forms and protein complexes. In these cases
it possibly does make sense to attempt to correlate results with quantitative changes at the
mRNA level, but qualitative results may be of interest. For example, if three spots appear
on gel A, produced by different phosphorylation states of protein X, and only one gel spot
appears on gel B, from a different condition, it would be difficult to attempt to correlate the
total difference in protein volume with changes in the amount of mRNA between condition A
and B as measured by microarrays. However, it may be interesting to note that microarray
analysis reveals up-regulation of gene X (in condition A), and 2-DE analysis reveals three
differentially modified forms.
The integration of different FG experiments is a major database challenge, however it also
raises issues in data visualisation. If a set of good visualisation tools are created, coupled with
complex query facilities, a researcher can begin to build a global picture of gene and protein
regulation, and the changes that occur during disease. Visualisation issues are addressed in
the following section.
5.3.4 Visualisation
Proteomics data is visualised in RAPAD using several different methods. Information about
experimental protocols, samples and bibliographic references can be viewed in web pages
created dynamically within the RAPAD Study-Annotator. 2-DE data is viewed in a Java
Applet [168] that resides within a web browser. The Gel Viewer has controls for navigating
around a gel and zooming to unlimited resolution, and manages several gels simultaneously
using tabbed panes that the user can switch between. Individual records of mass spectrom-
etry data are viewed using the web interface supplied with the MASCOT software [207],
however certain parts of the results are also summarised in a tabular format within HTML
pages.
There are several issues with the visualisation of 2-DE data in the Gel Viewer. In theory
a Java Applet should load in any web browser that has Java installed, however this is not our
experience. The Gel Viewer is loaded within an HTML page that is created dynamically by
PHP code [246]. The PHP code writes out parameters encoding gel spots and proteins for
the Applet to read in, however there appears to be a flaw in the way in which different web
Chapter 5. A prototype public database for proteomics 152
browsers load the Applet. When the number of parameters becomes large, for example with
four gels each with several hundred proteins, the Applet starts before all the parameters
have been read in, leading to missing spots, or gel images not displaying correctly. This
problem does not occur for Internet Explorer version 6 but is a major problem in Netscape
and Mozilla. There does not appear to be a simple solution to this bug. Therefore in the
future, the Gel Viewer may need to be re-coded, to enable greater flexibility with regard to
accessibility of the database.
There are several alternatives for developing applications that function within a web
browser, including Macromedia Flash [199], Javascript [171] and Scalable Vector Graphics
(SVG) [281]. However, each of these technologies has certain limitations for complex appli-
cations, and there are few good examples of their use in the life sciences. SVG is useful for
drawing regular shapes and objects, such as graphs or diagrams but is not suitable for load-
ing high resolution images such as 2-D gels. Javascript is used widely in web applications but
is not suitable for developing complex software. For example, Javascript could not be used
for zooming on a gel image without loading a new web page with each zoom factor, which
would be too slow. An example of an application developed using Flash is the Human-Mouse
Homology Map at the NCBI, which provides a visualisation of the homology between mouse
genes along a specified human chromosome, or vice versa [158]. For large chromosomes, the
visualisation of genes is difficult to read and the software runs too slowly, which may be
an implementation problem or a limitation of the technology. A preferable solution for the
continued development of the Gel Viewer may be to create a stand alone desktop application
that database users must download, using a technology such as Java Web Start [170].
5.3.5 Unique identifiers
It is essential for archiving data that protein records in RAPAD are assigned a unique ID
that persists even if a new version of the database is created. Protein records in the current
implementation are assigned a sequential numerical ID that is managed by the RDMS. The
ID number can be used to query the database if it prefixed with RPD, and suffixed with the
database version number (1 ). For example, record 101 can be queried via the web interface,
as long as it has been specified as a public record, using the string RPD101.1. This system
is not ideal from the point of view of security, and would be improved by creating a record
in the DatabaseEntry table for each protein that is publicly accessible, with an identifier
that is unrelated to the RDMS identifier.
Chapter 5. A prototype public database for proteomics 153
There is an effort to create universal public identifiers in the IBM Life Science Identifier
(LSID) project [296]. The aim is to create identifiers that are globally unique and persistent,
therefore they can never be re-used and will outlive the objects that they identify. An
LSID is created by concatenating the web address of the organisation, the database name
followed by the type of identifier, the identifier itself and finally the database version number,
separated by colons. The examples below demonstrate how a uniform resource name (URN)
is formulated for three major databases.
URN:LSID:ebi.ac.uk:SWISS-PROT.accession:P34355:3
URN:LSID:rcsb.org:PDB:1D4X:22
URN:LSID:ncbi.nlm.nih.gov:GenBank.accession:NT_001063:
An equivalent LSID for RAPAD would be:
URN:LSID:brc.gla.ac.uk:RAPAD:RPD101:1
A foreseeable possible problem is that RAPAD does not currently have a permanent
home and the web address is likely to change. However, this problem can be avoided as
long as the Bioinformatics Research Centre in Glasgow (brc.gla.ac.uk) does not develop
an alternative database called RAPAD, which is unlikely. The LSID project can be easily
implemented if databases adhere to the guidelines and provide programmatic access to the
database, accepting the LSID of an object as a query string.
5.4 Implementation
RAPAD has been deployed in Oracle 9i [235] as part of a standard three tier architecture
(Figure 5.3). A web interface has been created, written in the PHP language [246]. The
database schema is large (174 tables), and has the capability to store information from a
wide range of technologies. Therefore, web pages have been developed for data capture as
they are required by the users. In this section, an overview of each part of the database is
described, using examples of data capture in the Study-Annotator to illustrate graphically
how the database has been implemented.
A workflow is displayed in Figure 5.4 summarising the stages at which data is entered
by the user. There are several stages at which queries are made of the database to retrieve
terms from an ontology to populate drop-down boxes in the user interface, described in more
detail below.
Chapter 5. A prototype public database for proteomics 154
DatabaseOracle
ServerImage
ServerMASCOT
PHP
Interface generation
MASCOT Results
User InterfaceMiddlewareData Storage
Java
Querier
Study−Annotator
Gel Viewer
Batch queries forspecific investigationsand Gel Viewer code
Perl
Scripts suppliedwith MASCOT
Figure 5.3: The architecture of RAPAD.
Chapter 5. A prototype public database for proteomics 155
Login Page
Study, contactsand references
BioSource andsolubilisationprotocol
image analysis
2−DE, image,scanning and
Visualise 2−DEin Gel Viewer
1) Query DB forontology terms
2) Add details to DB
1) Query DB forontology terms
2) Add details to DB
and spot detailsQuery for gel
Data entry
Data entry
Data entry
Check usernameand password
Add details to DB
Data entry
User interaction RAPAD Study−Annotator Oracle database
1)
1)
2)
2)
and bulk loadin two files
Figure 5.4: The user interaction with RAPAD for entering a 2-DE experiment.
Chapter 5. A prototype public database for proteomics 156
5.4.1 Data privacy
The first entry point for the RAPAD Study-Annotator requires users to login, and select
their data privacy preferences (Figure 5.4). Essentially, this requires selecting the Project,
Group and Study settings. The Project setting specifies the database namespace in which the
data will be stored, which will be required when RAPAD is integrated with GUS. The value
of Project is set to “RAPAD” in the current implementation. The Group setting is the top
level for dividing researchers into different classifications. It is envisaged that each laboratory
will have its own Group value. The Study value is a further specification, and captures a
complete investigation, consisting of many different 2-DE experiments. For example, the
entire Trypanosoma brucei proteome investigation is currently captured as one study. All
tables in RAPAD have the following attributes, to ensure data integrity:
MODIFICATION_DATE NOT NULL DATE
USER_READ NOT NULL NUMBER(1)
USER_WRITE NOT NULL NUMBER(1)
GROUP_READ NOT NULL NUMBER(1)
GROUP_WRITE NOT NULL NUMBER(1)
OTHER_READ NOT NULL NUMBER(1)
OTHER_WRITE NOT NULL NUMBER(1)
ROW_USER_ID NOT NULL NUMBER(12)
ROW_GROUP_ID NOT NULL NUMBER(3)
ROW_PROJECT_ID NOT NULL NUMBER(3)
ROW_ALG_INVOCATION_ID NOT NULL NUMBER(12)
The attributes ROW USER ID, ROW GROUP ID and ROW PROJECT ID are assigned the foreign key
linking to the corresponding record for each user, group and project for every record that is
entered in the database. Additional tables exist for linking information to the Study in which
it belongs. Data security issues are discussed in more detail in Section 5.4.7.
5.4.2 Studies, protocols and contact details
Bibliographic references, experimental protocols and contact details can be entered in RA-
PAD, and are not linked to any particular study, allowing their re-use in many different
contexts. The web page for entering Protocol data (Figure 5.5) has drop-down menus for
selecting the type of protocol, options include nucleic acid extraction, protein solubilisation,
Chapter 5. A prototype public database for proteomics 157
Figure 5.5: The interface for entering protocol information into RAPAD.
gel stain, and so on. These options are populated from the OntologyEntry table, and are
used for linking the protocol to the correct page in the Study-Annotator. For example, any
protocols entered with the option gel stain will appear as options for linking to a staining
protocol in the 2-DE Assay page of the interface.
A set of web pages exist for capturing the intention of the study as a textual descrip-
tion, and also a set of parameters can be entered, with a different parameter value for each
experiment in the study. For example, in a time course experiment, samples from 1, 2, 4,
6, and 24 hours post infection are each analysed by 2-DE. This information can be cap-
tured in RAPAD, linking the parameter to the 2-DE details, and in turn the 2-DE details
can be linked to a description of the protein sample (BioMaterial). The source of mate-
rial can be entered in RAPAD, linked to contact details for the provider of material, the
species of origin, type of material (e.g. DNA, protein, cells, generated from entries in the
OntologyEntry table), and a general description (stored in the table BioSource, Figure
5.6). A series of treatments can be applied to convert a source of material (BioSource) to
a substance (BioMaterial), such as a protein mixture, which can be linked to a 2-D gel
record. Alternatively, BioMaterial could store labelled mRNA that has been hybridised to
a microarray. Treatments correspond to basic laboratory procedures such as additions of
Chapter 5. A prototype public database for proteomics 158
Figure 5.6: A web page for specifying sources of biological materials
solutions, washes, incubations and many more, allowing a researcher to store a structured
definition of lab protocols, such as the extraction and solubilisation of proteins from cells.
These features have been inherited from RAD, however additional tables have been added to
the database schema: StudyAssayProt, StudyDesignAssayProt and so on, for linking study
and biomaterial details to the corresponding proteomics experiment (table ProteomeAssay)
rather than a microarray (table Array, Figure 5.7).
5.4.3 Protein separations
RAPAD has capabilities to store information describing a series of protein separation treat-
ments (Figure 5.8), although the focus of the current implementation is 2-DE. Every experi-
ment type has an entry in a specific table (e.g. Gel2D, Gel1D or LCColumn) and an entry in a
generic table, BioAssayTreatment. BioAssayTreatment can be linked to a measured input
of a biological material, captured in AnalyteMeasurement and a view2 (BioMaterial) on
the table BioMaterialImp. The output of each treatment produces a set of entries in spe-
cific tables, such as PhysicalGelItem and Fraction, which are linked to BioMaterialImp,
enabling a series of treatments with specified inputs and outputs to be captured in a struc-
2A view in SQL is a single table that is derived from other tables. A view may not be physically stored inthe relational schema but is a notation representing certain information that is frequently required [89].
Chapter 5. A prototype public database for proteomics 159
Proteome
Assay
StudyDesign
AssayProt
StudyAssay
Prot
StudyFactor
ValueProt
Study
StudyFactorStudyDesign
Gel2DBioAssay
Treatment
Assay
StudyDesign
Assay
StudyAssay
StudyFactor
Value
Study
StudyFactorStudyDesign
Array
Proteomics Microarrays
Figure 5.7: A summary of the database schema for storing information about the design of astudy. Three RAD derived tables have been replicated in the RAPAD schema with changesto one relationship, referencing ProteomeAssay rather than Array. Each box represents adatabase relation (table) and arrows represent a relationship between two tables, such asGel2d has a foreign key from BioAssayTreatment.
tured format. The BioAssayTreatment table has a relationship to Protocol, which enables
additional protocol information to be attached to a technique, if the attributes specified in
the table specific to the technique do not cover what is required.
5.4.4 2-D gel data
The details about a 2-D gel are entered on the 2-DE Assay page (Figure 5.9). The parameters
of the gel are entered in the Gel2D table, and the table ProteomeAssay stores the name of
the experiment and a link to the experiment’s operator. ProteomeAssay is used to link
indirectly to protocols for the first and second dimension separation, protein solubilisation
and staining (all stored in the table Protocol). Following input of 2-DE data, scanning
information can be entered into the table ImageAcquistion, capturing: the type of scanner
used, the operator, the date, a protocol if required and any associated parameters with values.
Multiple scans can be entered, each associated with a particular channel or wavelength, which
can also be used to store a difference gel electrophoresis experiment, in which a single gel
is fluorescently labelled and scanned at two or three wavelengths. Each scan is assigned
a unique name that appears on the Gel Image Analysis page. On this page, the user can
Chapter 5. A prototype public database for proteomics 160
Gel2D
Gel1D
FractionLCColumn
Physical
GelItem
Link to image
analysis data
Source Product
BioAssay
Treatment
Analyte
MeasurementProtocol
BioMaterial
Imp
Figure 5.8: The database schema for protein separation techniques and the relationships tothe BioAssayTreatment table.
enter a protocol and name of the software used to analyse the gel image (inserted in the
GelImageAnalysis table). The image scan must also be associated with a gel image that is
stored on the file system, and the URI (Uniform Resource Indicator) of the file is updated
in the ImageAcquistion table. Two further pages exist for bulk loading data: gel spot files
and protein files. Spot data files contain lists of spot ID numbers, coordinates and volume
values (calculated by image analysis), which are stored in the tables IdentifiedSpot and
PhysicalGelItem. Each IdentifiedSpot record links to the image analysis that produced
it (in GelImageAnalysis). Data files can also be loaded that contain tab delimited data
about the proteins to which spots have been matched, including: the protein name, species,
MW (molecular weight), pI (charge), links to external databases, and a link to MS data on a
separate file server. The data is loaded in batches, linked to the correct spot using the table
AnalyteMeasurement, linked to BioMaterial and PhysicalGelItem (Figure 5.11).
The schema design for this section is fairly complex (Figure 5.11), however this reflects
the nature of a proteomics experiment: a spot may be excised from a gel and could be used in
a number of different experiment types: MS, chromatography, or additional gel separations.
Therefore, an entry exists to model a gel spot as a physical entity (a BioMaterial), to enable
further treatments on the spot to be captured. A gel spot does not have an identifier until
the gel image has been analysed and spot data has been input. Therefore, to correctly specify
a gel spot, a record is required in the IdentifiedSpot table (from image analysis), in the
Chapter 5. A prototype public database for proteomics 161
Image acquisition
Image analysis
2−DE assay
Figure 5.9: Screenshots for loading 2-DE, scanning and image analysis data into RAPAD.The scanner image is obtained from http://biology.berkeley.edu/EML/scanner.jpg, the im-age analysis software is a screenshot of DeCyderTM[74].
Chapter 5. A prototype public database for proteomics 162
BioAssay
Treatment Gel2D
Image
Acquisition
Identified
Spot
DIGESingle
Spot
Channel
GelImage
Analysis
Physical
GelItem
Matched
Spots
Multiple
AnalysisProteome
Assay
Figure 5.10: The tables present in the database schema store data from gel spots, imageanalysis and the scanning of a 2-D gel. The database also records information about spotsmatched across a number of gels in MatchedSpots and MultipleAnalysis, and differencegel electrophoresis data in the table DIGESingleSpot.
Protein
Record
MassSpec
Experiment
Identified
Spot
PeakList
BioAssay
Treatment
Physical
GelItem
BioMaterial DBSearch
ProteinHit
Analyte
Measurement
Direct link to top protein hit
Figure 5.11: The database schema for linking protein records to gel spots. A protein recordis linked to the gel spot via the raw MS data and database searches that have performed foridentification. A direct link from the gel spot (PhysicalGelItem) to the protein record hasalso been implemented to enable fast queries.
Chapter 5. A prototype public database for proteomics 163
PhysicalGelItem table (referring to the actual spot on the gel) and in the BioMaterial view
when required, to enable the gel spot to be linked to additional treatments in the database
(via BioAssayTreatment). If spots corresponding to the same protein have been matched
across gels, this information can be captured in the table MatchedSpots, and spots appear
with a different symbol in the Gel Viewer.
5.4.5 Mass spectrometry and external databases
Mass spectrometry data can be stored in tables derived from the PEDRo database schema.
The tables are linked to rest of the schema via the BioAssayTreatment table (Figure 5.12).
BioAssayTreatment references a source of biological material, enabling MS data to be linked
to a protein sample arising from a series of separation techniques, which could be a gel spot.
However, in the current implementation only a URI is stored in the DBSearch table, linking
to the results of searches with MS data, generated using the MASCOT software. Certain
data are extracted automatically from the MS results using a script developed by Karl
Burgess (IBLS, University of Glasgow), and stored in the ProteinHit table, such as the
match score, e-value, the number of peptides hit in a sequence, and the sequence coverage3.
These factors enable the quality of match to be determined, allowing researchers to exclude
data from certain views in the interface, if the MS data does not conclusively identify a
protein. The table ProteinRecord stores properties of each protein in the database, such as
MW, pI, the protein’s name and a reference to the species of origin, stored in the SRes Taxon
table. The table ProteinRecordEntry links a record to external database entries, stored in
DatabaseEntry. DatabaseEntry captures the database accession number, and has three
external links to OntologyEntry in which the database name, database URI and database
version are captured. In this way, a protein record can be linked to any external database
required, as long as it is Internet accessible.
5.4.6 RAPAD Querier
An important feature of a database system for functional genomics is the ability to perform
complex queries. The current RAPAD implementation includes a set of tools that enable
data to be visualised and queried, to support biological research. The use of the query
interface, the RAPAD Querier, is outlined in Chapters 6 and 7 with regard to two biological
investigations: the proteome of host cells when invaded with the parasite Toxoplasma gondii
3Sequence coverage is the percentage of the protein sequence that is covered by the peptides that havebeen matched.
Chapter 5. A prototype public database for proteomics 164
BioAssay Treatment
MSExperiment
Tables for protocol Tables for database searches
PeakList
Peak
ProteinHit
ProteinRecord
ProteinModification
Physical GelItem
Figure 5.12: The database schema for mass spectrometry, adapted from PEDRo.
and the determination of the proteome of Trypanosoma brucei. Specific features of the
interface have been geared towards providing the queries required by the two projects, to
solve specific goals. An overview of the main features of the RAPAD Querier is given in the
rest of this section.
There are several different methods for accessing data in RAPAD. Firstly, for researchers
annotating data in a particular study there is an option to load any of the gels in that study in
the Gel Viewer (Figure 5.13). Researchers can also perform a search to find all 2-D gels within
their Project-Group preference settings, within a particular study, performed by specific
operators, or containing a certain protein name. The Gel Viewer has been implemented as
a Java Applet [168], an application that runs within a web browser, thereby enabling any
users to view data without needing to install new software (except Java). The Gel Viewer
is capable of loading multiple gels simultaneously in different tab windows. Within the Gel
Viewer basic searches can be performed to find particular protein names, label a specific spot
by ID number, or highlight a set of proteins with a range of molecular weights or pI values.
Controls exist for moving around the gel and zooming on particular regions for highlighting
subtle differences in spot patterns between two or more gels. Once the Gel Viewer has been
loaded, there are a set of options for viewing data about a single gel: 1. Display All Spots,
2. Display All Proteins, 3. Search This Data, 4. Display Gel Details, 4. Show Gel Info, 5.
Show Microarray Data, and if two gels have been loaded 6. Show Matched Spots.
1. There is an option to view a table, created dynamically in HTML, showing all the discrete
Chapter 5. A prototype public database for proteomics 165
Figure 5.13: A screen shot of the 2-D Gel Viewer that provides search capabilities overprotein data and links to MS results. There is a feature for loading multiple gels in differenttabbed windows, for example for comparing gels for samples for different conditions.
Chapter 5. A prototype public database for proteomics 166
Figure 5.14: A form for entering annotation about a gel spot and linking to protein records.Links are provided for adding data about protein modifications and updating the proteindetails.
Chapter 5. A prototype public database for proteomics 167
Figure 5.15: A table displaying all the proteins identified on a single gel.
spots that have been identified on a gel. Hyperlinks exist for each spot ID number which
load the specific record about each spot (Figure 5.14), which enable additional annotation to
be entered, and for linking a gel spot to protein data, such as MS information. There is also
a page for entering the type, location and description of post-translational modifications.
2. Similar output is provided displaying only the gel spots that have been matched to protein
records (Figure 5.15).
3. An option is given for loading an HTML form that enables searches to be performed over
a data set arising from a single gel. Search criteria include approximate matches to multiple
protein names entered, ranges of values for molecular weight, pI, and statistics from MS
data about the quality of a match (Figure 5.16). Boolean “AND” or “OR” searches can be
performed, and the resulting data can be ordered by any of the above criteria. The results
of a search are displayed in a table on a web page, with links to the source data, and an
option exists for highlighting the spots found by a search in the Gel Viewer.
4. Clicking the Display Gel Details button loads a page displaying the parameters and
protocol employed for the gel. There are links to separate protocols for the first and second
dimension separation, staining, and protein solubilisation. If the gel has been linked to
information about a biological sample (BioSample) or source of material (BioSource) in
Chapter 5. A prototype public database for proteomics 168
Figure 5.16: The query interface for searching for specific protein records.
RAPAD, this information is displayed, along with the protocol for gel image scanning and
gel image analysis.
5. If the gel has been associated with a microarray study, a table can be loaded displaying
all the proteins on the gel, alongside the microarray expression values for the corresponding
gene. This feature is illustrated in the following chapter.
6. RAPAD has options for loading information about spots on different gels that correspond
to the same protein. Clicking the Show Matched Spots button loads an HTML page display-
ing all the proteins on the two gels. Spots that match across the two gels are highlighted
in bold, and if spot volume information has been entered, the ratio of volumes is displayed,
corresponding to an approximation of the change in expression of the protein between the
two conditions.
An important feature is the ability to summarise all the data within a study, especially if it
results from proteins identified on a number of different gels. An option exists to classify gels
within a study into two groups, for example one set of gels from “disease” samples, versus a
set of “normal” samples. The proteins identified in the two groups appear in separate tables,
with links to the source protein records, and an option to load the Gel Viewer highlighting
selected spots on the gel.
Chapter 5. A prototype public database for proteomics 169
5.4.7 Public data access
The standard interface contains pages displaying protein spot records that can be updated,
intended for researchers to modify and insert new data as required. Clearly, this system
is not suitable for external access, even if updates could only be performed by researchers
with a specific login, because it would be difficult to ensure that data was always secure.
Therefore, a separate interface has been created allowing anyone to view publicly acces-
sible data in RAPAD, which only has views of the data, with no facilities for updating.
Data can be accessed in this interface through a page that displays all the public studies
in RAPAD, giving the option to load particular gels. A query page is also available to
search for particular proteins identified on any gel in the public system. The page displaying
protein records in RAPAD can be queried by a web link, thus providing basic program-
matic access. The following URL can be used to link to any record on the public system
(http://balabio.dcs.gla.ac.uk/jonesa/RAPAD/ProteinView.php?Query=RPD123), whereby
RPD123 is a unique identifier assigned to each protein. This system enables other databases
to link to protein records in RAPAD. This feature will be especially important for proteins
identified by MS for which there is no annotation in public databases, or the protein is only
annotated as “hypothetical”. In effect, the MS data proves that a protein is expressed under
the particular sample conditions.
It is essential that only data intended for public access can be viewed through this inter-
face in RAPAD. This is ensured because of the design of tables inherited from RAD, whereby
every record across the entire schema is assigned with data privacy settings. Every table
has a set of permissions that highlight which individuals can view a particular piece of data:
the researcher who enters the data (USER READ), only members of the group (GROUP READ)
and anyone (OTHER READ). The group level setting can be used for releasing data to a set of
different laboratories without making data publicly available, for example to allow collabo-
rators in a different location to view or update records. If researchers wish to make their
data publicly accessible, every record in the study has the attribute (OTHER READ) changed
from 0 to 1. Therefore, when any web page is accessed through the public data interface, a
simple check is performed to ensure that no protein data will be accessed where OTHER READ
= 0. Similar attributes exist in every table for ensuring that data can only be changed by
certain individuals (write access). At the present time, the studies supported by RAPAD
have not been published, therefore the interface for making data publicly accessible has only
been tested with artificial data.
Chapter 5. A prototype public database for proteomics 170
5.4.8 Ontologies
In the previous chapter, the importance of developing ontologies to support the develop-
ment of standard exchange formats was outlined. In this section, the implementation of
ontologies within RAPAD is addressed. The OntologyEntry table in RAPAD stores a flat
representation of the MGED Ontology (MO) [211], for specifying protocols, characteristics
of biomaterials and many other parts of the analysis), following the design of RAD. The
OntologyEntry schema is as follows (data security attributes not shown):
Name Null? Type
----------------------------------------- -------- ----------------------------
ONTOLOGY_ENTRY_ID NOT NULL NUMBER(10)
PARENT_ID NUMBER(10)
TABLE_ID NUMBER(8)
ROW_ID NUMBER(12)
EXTERNAL_DATABASE_RELEASE_ID NUMBER(10)
SOURCE_ID VARCHAR2(100)
URI VARCHAR2(500)
NAME VARCHAR2(100)
CATEGORY NOT NULL VARCHAR2(100)
VALUE NOT NULL VARCHAR2(100)
DEFINITION VARCHAR2(500)
The attribute CATEGORY captures the type of term: ProtocolType, DevelopmentalStage,
DataType and so on. An example would be:
• CATEGORY = ProtocolType
• VALUE = nucleic acid extraction
• DEFINITION = "The procedure of extracting nucleic acid from the
biomaterial"
In RAPAD, additional entries have been included in the OntologyEntry table to cover prop-
erties of proteins, such as types of chemical modifications. The storage of post-translation
modification (PTM) data is an important feature of RAPAD, which for instance may be
generated from tandem MS or from a phosphate labelling experiment. The type of PTM,
Chapter 5. A prototype public database for proteomics 171
such as glycosylation, phosphorylation or biotinylation is obtained from the OntologyEntry
table. This has two clear advantages: firstly to reduce manual entry, as terms do not have
to be typed in each time, but are selected from a drop-down menu; secondly, errors and
imprecision should be reduced if the term is presented to the user with a clear definition,
ensuring that there is a shared understanding of exactly what is being specified. It would
not be possible to design an ontology, capable of capturing all terms used in any type of
study. The approach taken in RAD is that users can enter new terms when required, after
being checked by a member of MGED. A similar feature has been implemented in RAPAD,
whereby new terms can be added to the OntologyEntry table by contacting the author.
Terms are annotated as “user defined” along with a URI specifying the source of the term
and a definition to ensure that the origin of the term is clear.
A number of terms describing proteins and proteomics experiments have been added
to the OntologyEntry table during the development of RAPAD. It is important that this
controlled vocabulary is made available to others developing similar systems. The PSI is
developing an ontology (PSI-Ont) as an extension to MO, covering protein terms, and ul-
timately will provide a repository where developers can obtain and add new terms used in
proteomics studies. The vocabulary developed for RAPAD will contribute to PSI-Ont.
A separate part of GUS, known as SRes, stores phenotype information such as disease
states, bibliographic references and taxonomy information. SRes has been installed alongside
RAPAD, and stores a flat representation of parts of the NCBI taxonomy [224], which is in
effect an ontology of species. This means that the names of species are captured in a
controlled way, which facilitates database queries.
5.5 Discussion
The RAPAD system was developed with several main aims: to support the local proteomic
research requirement, to test the extension of RAD into proteomics as a prototype of a future
public repository for proteomics, to assess if FGE-OM correctly models the data semantics
and to test facilities for correlating changes in protein abundance with gene expression values.
In this section, the progress towards these goals is discussed.
5.5.1 A prototype of a central repository
RAPAD has been developed on top of RAD, which is a well established system grounded in
a significant amount of database research. RAD has robust facilities for storing structured
Chapter 5. A prototype public database for proteomics 172
descriptions of biological samples and experimental protocols, and uses ontologies to create a
standard representation of certain concepts. Protocols stored in this way can be queried more
easily than a free text description, and this opens the possibility for data mining in the future.
RAPAD makes use of the features from RAD that ensure data integrity and security, with
facilities for tracking which individuals have entered data, and restricting access to certain
information where necessary. The successful implementation of a proteomics database using
core RAD tables also demonstrates that parts of the schema could be used for other types
of functional genomics study, such as immunohistochemistry.
RAPAD has been tested by the developers of GUS. The developers have taken the
database schema and interface code, and work is underway to add the proteomics com-
ponent to GUS. The addition of proteomics support in GUS will be a major advance for
web sites, such as PlasmoDB that provides access to FG data for Plasmodium falciparum,
the causative agent of malaria. Large volumes of proteome data are being produced for P.
falciparum [110, 229] but there is currently no method for publicly releasing the material in
a format that can be queried, and it cannot be integrated with microarray or genomic data.
One of the goals of developing RAPAD was to build a prototype of a public proteomics
repository. The proteome extension of GUS is underway, utilising the RAPAD database
schema and interface code, demonstrating that the prototyping stage has been successful.
5.5.2 The relationship between FGE-OM and RAPAD
The object model specified in the previous chapter is a proposal for a data standard. However,
a specification expressed solely in UML cannot be used to test if the concepts of the domain
have been correctly modelled, or if real data can be captured in practice. One of the functions
of RAPAD is to demonstrate that real data can be captured by our proposal. We must first
establish the correspondence between FGE-OM and RAPAD, because the database schema
was not created automatically from the object model. Figure 5.2 displays the names of
classes in FGE-OM and tables in RAPAD that cover the same parts of the domain. The
attributes for the majority of tables are identical or very similar to those belonging to classes
in FGE-OM (the database schema and additional diagrams of FGE-OM are displayed in the
appendices). The BioOM and ArrayOM namespaces in FGE-OM contain classes derived
from MAGE-OM. The relationship between these classes and tables in RAD (now inherited
in RAPAD) has been established previously, and software is in development for automatically
converting between MAGE-OM and RAD [202]. Many of the tables in RAPAD that store
Chapter 5. A prototype public database for proteomics 173
proteome data are derived from the PEDRo database schema, and the PEDRo schema and
object model are virtually identical. Therefore, the parts of ProteomicsOM that are derived
from PEDRo are highly similar to the corresponding part of RAPAD. Finally, tables have
been created in RAPAD that exactly correspond with the parts of FGE-OM that are derived
from Gla-PSI. The overall result is that FGE-OM and RAPAD have a very similar structure,
and therefore it is reasonable to state that by illustrating the use of RAPAD in a real research
environment, it is demonstrated that FGE-OM correctly models proteome workflows. The
integration of gene and protein expression results was one of the main goals of developing
FGE-OM, and this functionality is demonstrated in RAPAD in the following chapter.
5.5.3 Support for current proteome studies
A second goal of developing RAPAD was to produce a system capable of supporting on-going
proteomics research, because the currently available databases do not offer all the facilities
that are required. The following two chapters describe projects that are supported by the
current implementation, however in this section a brief description of the main advantages
of RAPAD is given.
The database allows a structured description of experimental protocols and biological
samples to be specified using ontologies. This should improve the capabilities for querying
in the future as data sets become large. This feature is not included in SWISS-2DPAGE
and GELBANK, which only offer fairly simple descriptions of protocols. The data security
features inherited from RAD also provide a simple mechanism for allowing particular re-
searchers or groups to access or modify information in the database. This feature is vital for
large organisations in which many different levels of security could be required.
Data security models
In modern database management systems (DBMS) there are two broad approaches used to
ensure data security: discretionary and mandatory [66]. The security policy can be enforced
at various levels, such as over the entire database, on particular relations, or down to the level
of a single attribute of one row of data. The discretionary approach gives particular rights
to a specific user on different objects in the database, and different users may have different
rights on the same object. Therefore, this model is very flexible but it has a large overhead
if security settings have to be checked for many different objects and users. The alternative
security approach is the mandatory scheme in which certain database objects are assigned
Chapter 5. A prototype public database for proteomics 174
a particular classification, and users are given a clearance level that specifies which data are
accessible or can be modified. The mandatory approach is used in situations where data
fall into particular levels of accessibility, such as government or military databases where
controlling data access is of utmost importance. Security settings can be managed by the
security subsystem of the DBMS, and encoded as a set of rules that must be checked every
time an object is accessed or modified.
In RAPAD, a security system closer to the discretionary approach is employed at the level
of individual rows of data (tuples). However, there are currently no formal rules specified in
the DBMS, instead checks are made by the user interface to ensure that data has been speci-
fied as publicly accessible, or can be modified by a certain user and so on. The attributes that
specify the security setting exist for every tuple in the database. This approach is possibly
not as robust as having security rules set in the DBMS, but this would require a permanent
database administrator to update the rules with every new user or group that utilises the
database. The approach taken in RAPAD should in theory be more robust than ensuring
data security only at the level of the user interface. Additionally, the security settings can
be updated automatically without requiring a permanent database administrator.
Query capabilities
RAPAD has a query system that enables users to generate fairly complex queries to find
particular proteins in a study. The details of MS search results are stored which enable
the quality of a match to a protein to be determined, for example allowing a researcher
to exclude particular proteins that are only weakly identified. The results of a search over
different gels can be displayed in the Gel Viewer, which can load several gels simultaneously
for comparing the proteomes of different samples. The Gel Viewer has other features that are
advantageous compared with other databases, such as the facilities to zoom to an unlimited
depth to visualise small spots. The same region can be highlighted on a different gel to find
differences in the pattern of spots. The Gel Viewer can also display the name of proteins,
and the predicted pI and MW, which can be toggled on or off. There are capabilities that
enable researchers to search for possible post-translational modifications. These features are
exemplified in Chapters 6 and 7.
Chapter 5. A prototype public database for proteomics 175
Integration of gene and protein experiments
In the introduction, it was hypothesised that extending a database schema and graphical
user interface intended for microarray experiments into proteomics, would facilitate the in-
tegration of data across the two domains. In the following chapter, there is a description of
how the results can be integrated by matching the identifiers associated with gene expression
values to the identifiers for protein abundance. However, this is only part of the process.
An advantage of our approach is that biological samples and experimental protocols can be
entered into RAPAD, and are not linked to a particular experiment, but can be used in
any context. For this reason, a sample could be described a single time, using ontologies to
record the type of material, the source (company, organisation, contact details and so on),
and the species of origin. The sample description could then be associated with a microarray
hybridization, 2-DE, or an LC-MS analysis. When a large number of studies of this type
have been entered in the system, the RAPAD Querier will be capable of retrieving all the
experiments that have been performed on a particular sample. Therefore, integration occurs
at the level of results, as described in the following chapter, and at the level of the biological
samples and experimental protocols.
Availability of RAPAD
The database schema, the RAPAD Study-Annotator and the code for the Gel Viewer are
all freely available for download on the web site. Therefore, other developers can install
RAPAD locally to manage their own proteomics data, and there should not be a significant
overhead installing the current version. However, the current version has not undergone
several rounds of testing and therefore may require some modification or bug fixes once
implemented elsewhere.
Features of RAPAD demonstrate the feasibility of integrating proteomics and microar-
ray data in a single system (a specific example of this facility is described in the following
chapter). At present there are no well publicised systems offering this facility. The CEBS
SysBio system [355] hopes to offer similar capabilities in the future for data mining across
a range of experiment types, but a working prototype is not currently available. An inte-
grated database enables researchers to begin asking questions about the correlation between
gene expression and protein abundance at the global level. It is also thought that post-
translational modifications are important for protein function, and their relationship with
gene expression and protein abundance values has not previously been investigated. It is also
Chapter 5. A prototype public database for proteomics 176
likely that a proteome database could discover instances where proteins display modulated
regulation, which would not be observed at the transcriptome level.
5.5.4 Future developments
The current implementation of RAPAD supports proteomics research and can store microar-
ray data. It has been demonstrated that experimental hypotheses, biological samples and
protocols can be stored in common tables, regardless of whether a microarray or proteome
experiment has been performed. RAPAD could therefore be extended to cover metabolomics
experiments, given that metabolome data comprise column separations and mass spectrom-
etry. This would allow for integration across the transcriptome, proteome and metabolome,
giving a broad view of the biological system to the researcher. A future version of the
database could also incorporate a number of features that will improve facilities for data
mining. A number of links to external databases are already provided but this could be ex-
tended. For example, proteins that have a 3-D structure could be displayed using structure
visualisation software, such as RasMol [280] or Chime [216]. For certain studies it would also
be useful to correlate protein abundance with chromosomal location, this could be achieved
using the Expressionview software, which can display a microarray data set, and visualise
the position of genes on the chromosomes [109]. The relationship between chromosomal loca-
tion and gene expression is particularly important for bacterial studies because sets of genes
are often co-expressed from operons, and the genes within one operon often have related
functions.
Functional classification of genes and proteins in RAPAD is provided through dynamic
links to the Gene Ontology (GO), however a great variety of new software is currently in
development by a number of groups for summarising and correlating functional categories
with expression values. Additional software for querying and summarising GO, such as
GoMiner [368] (described in Chapter 2), will be installed alongside RAPAD when it becomes
available.
The current RAPAD implementation does not provide support for any detailed statisti-
cal analysis of data sets. The R software has a programmable interface that allows direct
connection to relational databases [261]. Therefore, pre-defined packages can be used to
search for significant differences in protein volumes, and correlations between gene and pro-
tein abundances. New packages can be also written in R for normalising across mRNA and
protein volume data, and for mining data to search for patterns of co-regulation. These
Chapter 5. A prototype public database for proteomics 177
features would enable protein abundance data to be queried in parallel with gene expression
studies, functional classifications and 3-D structures to improve the facilities for knowledge
acquisition. This kind of statistical analysis requires large data sets from gel electrophoresis,
and more research is required into the accuracy of relative protein volume between two or
more gels, detected by image analysis applications.
5.6 Conclusions
RAPAD supports proteomics research, and comprises a relational database with a web based
interface, which has been created by extending existing technologies. The system uses on-
tologies to capture knowledge in a standardised, controlled manner. This demonstrates that
re-using and integrating existing systems can facilitate integration of different types of data,
and that the time to develop a large system is significantly reduced, compared with develop-
ing de novo. The implementation also acts as a prototype for a major, public repository for
proteomics, which is currently in development. In the following two chapters, two specific
projects are described that allow the core features of RAPAD to be evaluated. The results
will illustrate how the software has enabled researchers to improve annotation of their data,
and formulate queries that facilitate new biological discoveries.
Chapter 6
Database support for proteomic
studies of host-parasite interactions
6.1 Introduction
The RAPAD system was created to test the feasibility of extending an established microarray
database into proteomics, as a step towards creating a single, integrated database for func-
tional genomics. In this chapter, an example is given of a project that is supported by the
current implementation of RAPAD, including an outline of how facilities of the database have
been specifically tailored for making new discoveries in this area. The biological investigation
aims to characterise changes in the proteome of host cells when invaded with the intracellular
protozoan parasite Toxoplasma gondii, from an in vitro culture. This chapter outlines how
the data from this investigation allows the core facilities of RAPAD to be evaluated. A de-
scription is given of additional software that has been developed for: (i) the visualisation of
proteins with modulated expression, (ii) the integration of the proteomics data with previ-
ously published microarray studies and (iii) the discovery of post-translational modifications.
The results enable researchers to formulate hypotheses about the biological processes that
occur during parasite invasion, and gain a better understanding of host-parasite relationships
in general.
6.1.1 Host-parasite interactions
The species Toxoplasma gondii, along with the other closely related parasites Plasmodium,
Cryptosporidium and Eimeria, pose major global problems to human and animal health.
Genome projects are well underway, and functional genomics investigations are being used to
elucidate the biological processes involved in the infectivity of the parasites [6]. Toxoplasma
is used as a model organism for studying related parasites because it is relatively easy and
178
Chapter 6. Database support for proteomic studies of host-parasite interactions 179
safe to culture in vitro, can invade animal models or host cell cultures, and possesses many
of the characteristics of its phylum, Apicomplexa [186]. T. gondii can infect a remarkably
wide range of hosts, including birds, livestock, humans and even oceanic mammals, such as
whales. The parasite is found in almost all geographic regions and infects 10-30% of human
populations [49]. Infection occurs after ingestion of oocysts from the faeces of cats, the
definitive host, or from tissue cysts in infected, undercooked meat. In the majority of cases,
T. gondii forms cysts in the deep tissues, including the brain, where it maintains a life-long
chronic infection. Toxoplasma induces disease in certain cases: (i) the parasite can cross the
placenta to the foetus, causing congenital defects or abortion; (ii) T. gondii can also be fatal
in immuno-compromised patients, for example in individuals with AIDS. It is believed that
the tissue cysts rupture, enabling the parasite to switch from the latent form (bradyzoite)
to a rapidly dividing form (tachyzoite), killing host cells. Therefore, one of the areas for
further investigation is to discover the factors that cause a switch between the two forms
[198]. Substantial work has also been carried out to identify the parasite and host proteins
that are critical for infectivity, and to elucidate the pathways in which they function. It
is believed that the method of invasion is conserved across the Apicomplexa, therefore any
discoveries made in T. gondii could have far reaching consequences.
The parasite invades by the following mechanism (reviewed by Sibley 2004 [291]). Toxo-
plasma releases molecules (adhesins) that attach to surface receptors on host cells. The para-
site actively penetrates the membrane, enclosing itself within a vacuole (the parasitophorous
vacuole) that is primarily formed from the host’s cell membrane, thereby reducing the ability
of the host to recognise and reject the parasite. The parasite releases the contents of a set of
organelles into the cytosol, including rhoptries that are crucial for parasite infectivity [29].
Rhoptries release a set of proteins that cause the parasitophorous vacuole to interact with
host cell mitochondria and endoplasmic reticulum, allowing the parasite to scavenge glucose
and cholesterol. An understanding of the proteins and pathways involved in infectivity has
been developed over several decades using classical techniques, such as gene knockout experi-
ments [185], but developments in technology have now opened up the possibility of analysing
the systems on a much larger scale.
6.1.2 Genomic investigation of Toxoplasma
The genome of T. gondii is currently being sequenced [187] and access to large EST databases
has been available for several years [321]. Therefore, T. gondii can now be investigated using
Chapter 6. Database support for proteomic studies of host-parasite interactions 180
functional genomics techniques, allowing researchers to gain a wider view of the systems
involved with infectivity than previously possible. The genome is 80Mb (Megabases) in
size, contains 11 chromosomes and, as of early 2004, there is a ten times coverage of the
sequence [320], created using the “shotgun” approach [52]. Many genes have little or no
functional assignment, therefore any studies that provide insights into gene function will aid
the annotation efforts. A previous study investigated the constituents of the proteome of the
tachyzoite (rapidly dividing stage) of T. gondii by two dimensional gel electrophoresis (2-DE)
[61]. The study discovered that the same proteins appear in a number of positions on a single
gel, indicating that differential splicing of gene products, or post-translational modifications
are common. A separate investigation into the proteomics of Toxoplasma demonstrated a
protocol for a 2-D gel map of the tachyzoite stage [79]. Microarray studies have also been
carried out by Gail [116] and Blader [35], discussed below.
6.1.3 Microarray analysis
A more detailed understanding of the function of proteins from Toxoplasma, and the com-
plex networks of interacting proteins, will greatly facilitate the search for new drug targets.
However, researchers also wish to focus on how the parasite interacts with host cells, and
what changes occur in the functioning of the host cells. Microarray studies by Blader and
colleagues [35] determined the genes that are significantly up or down-regulated in a host
cell culture (Human Foreskin Fibroblasts, HFF) when invaded by the parasite, compared
with non-infected host cells, at a number of time points after invasion (1-24 hours post in-
fection). Several groups of genes displaying modulated expression were defined, leading to
hypotheses about the mechanisms of parasite invasion, and the recruitment of host processes
for its own survival. It is believed that the parasites arrest the cell cycle to enable them to
continue utilising host resources as long as possible. An important mechanism for host cell
defence against parasites and viruses is the apoptosis cascade, which causes host cells to die,
thus preventing further development of the intracellular pathogen. Evidence suggests that
Toxoplasma switches on a number of host genes that inhibit and prevent the propagation
of the apoptosis cascade [292]. The microarray results revealed down regulation of genes
implicated in mitosis and meiosis (cell cycle processes), apoptosis genes, and cytoskeletal
proteins. The role of calcium dependent signalling during parasite invasion has also been
studied in detail (reviewed by Arrizabalaga and Boothroyd [17]). Some evidence suggests
that Toxoplasma utilises its own calcium dependent pathways, unlike other parasites that
Chapter 6. Database support for proteomic studies of host-parasite interactions 181
sequester host pathways, therefore one area of study is to determine if there are also changes
in the host genes implicated in these processes. Blader also discovered up-regulation of
genes involved in glycolysis and cholesterol synthesis for energy generation. Infection by
the parasite is resisted by host cells, and therefore it is expected that an up-regulation of
genes involved in the immune response would be observed. In the microarray study, an early
up-regulation of these genes was observed, at one hour post-infection.
A later study by de Avalos, Blader and colleagues [73] performed a microarray experi-
ment, similar to Blader 2001, on the related organism Trypanosoma cruzi. T. cruzi is also
an intracellular pathogen believed to invade by a similar mechanism. The results indicated
that very few host genes were up-regulated early in infection, unlike the T. gondii data,
and across the whole data set, the correspondence in up-regulated genes between T. gondii
and T. cruzi was very low. This has important implications for general understanding of
host-parasite interactions. It has previously been thought that the response of host cells
to invasion by a parasite would be the same, or similar, regardless of the type of parasite.
However, the comparison of the T. gondii and T. cruzi data suggests that there may be
different mechanisms used by host cells to respond to invasion by different parasites. The
consequence of this finding is that drug development should be targeted towards disrupting
very specific processes for specific parasites, rather than targeting a single set of processes
to prevent invasion by any kind of parasite. It is important that the host responses to a
number of parasites are studied in more detail to elucidate the mechanisms involved.
6.1.4 Support for proteome studies
RAPAD is supporting a project from the laboratory of Jonathan Wastling in the Institute
of Biomedical and Life Sciences at the University of Glasgow. The investigations were
performed by Morag Nelson, a PhD student, as part of a project to investigate the changes
in the proteome of mammalian host cells when invaded with T. gondii, compared with non-
infected host cells, at 24 hours post infection. The investigation uses 2-DE for protein
separation, coupled with MS (mass spectrometry) for protein identification. The specific
aims of the biological investigation are as follows. Firstly, to verify if changes observed at
the transcriptional level (by microarray analysis) are confirmed by changes in the amount of
protein produced. Secondly, it is believed that because proteins are the functional unit in the
cell, protein abundance is a better indicator of functional significance than gene expression
values. Therefore, new groups of proteins could be discovered with modulated expression,
Chapter 6. Database support for proteomic studies of host-parasite interactions 182
which were not found by microarray analysis, leading to the formation of novel hypotheses.
A third aim is to investigate what role post-translational modifications (PTMs) might play
in parasite infectivity.
The experiments present considerable computational challenges that enable the evalu-
ation of the core facilities of RAPAD in three key areas: managing large volumes of data
across replicates, enabling complex queries, and visualisation of results to allow new findings
to be derived. In this chapter, we report on additional work by the author to develop specific
queries and visualisation software, in order to enable differential expression of proteins to
be detected across two conditions from a number of replicate gels. Facilities have also been
developed to integrate microarray data points with the corresponding proteins identified by
MS, and to support the storage and querying of PTMs, in conjunction with gene expres-
sion and protein abundance data. The integration of transcriptome and proteome data may
answer several questions:
• The interval between changes in gene expression and protein abundance. If
genes are up-regulated immediately after infection, when are changes observed in the
level of protein?
• Translational control: are there groups of proteins with modulated expression that
were not associated with a change in gene expression?
• Post-translational modification: do groups of proteins undergo changes in modifi-
cation status that are functionally significant, where there is no change in the rate of
transcription?
6.1.5 Project status
The current status of the biological study is as follows. 14 gels produced by 2-DE, from seven
infections with T. gondii and seven non-infected cell lines, have been loaded into RAPAD.
From the gels, approximately 350 differentially expressed spots have been identified. There
are 130 distinct proteins out of the 350, because some proteins appear in multiple copies in
different places on the 2-D gels, and in some cases the same protein has been identified on
replicate gels. Currently, about 40 proteins spots (14 distinct) have been matched to the
corresponding microarray clone, although it is expected that this number will increase as the
number of protein records in RAPAD increases (discussed in Section 6.4).
The rest of the chapter is structured as follows. Section 6.2 briefly describes the biological
Chapter 6. Database support for proteomic studies of host-parasite interactions 183
methodology, how differential protein expression data is visualised, how microarray data
points are matched to protein records, and the techniques employed to assign, display and
summarise functional classification of proteins. An overview of the results is given in Section
6.3, focusing on how RAPAD has supported the generation of new hypotheses. Discussion
is provided in Section 6.4.
6.2 Methods
The source of biological material for the investigations was a human foreskin fibroblast
(HFF) cell line, which was prepared and infected with Toxoplasma gondii, using a protocol
reproduced from the microarray study by Blader et al. 2001 [35]. This should ensure that
the proteome data from these studies are, as far as possible, comparable with the earlier
microarray analysis. The experimental protocols for protein solubilisation, the IPG strip
(first dimension separation), the gel electrophoresis stage (second dimension) and staining
are stored in RAPAD. Eleven biological replicates (infected versus non-infected, 22 gels)
were performed but most of the examples given are from a single replicate (replicate 11).
Coomassie blue stain was used to visualise proteins, gels were scanned with a standard
laboratory scanner and images were analysed using the ImageMaster 2D Elite software [162].
The matching of spots on two different gels (pairwise between replicates) was performed by
the 2D Elite software, which also measured the spot volumes. Differential expression of
proteins was determined as follows. Spots with a volume difference of greater than 30% were
picked for MS analysis, or spots that were present on one gel and not on the other determined
by manual inspection. The gels were normalised to background on a per spot basis, after
background subtraction had taken place, using the method “normalisation at lowest on
boundary”. The spot coordinates and volumes determined by 2D Elite were imported into
RAPAD. Samples were sent for MALDI-TOF (Matrix Assisted Laser Desorption Ionisation
- Time of Flight) analysis and identifications were made using the MASCOT software [207].
The samples that did not produce a significant protein identification were analysed using a
tandem MS system (AB Q-Star Pulsar).
The contribution of the author was: (i) to develop the core RAPAD system, as described
in the previous chapter; (ii) to create additional displays of differential expression (Section
6.2.1); (iii) to write software for matching gene expression values to protein abundance data
(Section 6.2.2); and (iv) to develop scripts to retrieve identifiers that enable hyperlinks to be
created from RAPAD to external software, in order to provide a summary of the functional
Chapter 6. Database support for proteomic studies of host-parasite interactions 184
classification of each protein in this specific study (Section 6.2.3).
6.2.1 Display of protein data from different gels
The previous chapter described facilities in the RAPAD Gel Viewer. The Gel Viewer enables
multiple gels to be loaded simultaneously, the results of searches to be viewed, and offers the
display of the predicted charge and molecular weight of the proteins. These features allow
a researcher to search for PTMs and analyse the proteins that have been identified in the
study. For the Toxoplasma investigation an additional interface was created to improve the
visualisation of proteins that are differentially expressed on 2-D gels. The interface addresses
the problem of spot interpretation where certain proteins appear in several different positions
on individual gels, corresponding to particular PTMs or differentially spliced forms of the
protein. A series of gels have been performed from replicate samples, in which there may be
supporting or contradictory evidence, and this information must be assimilated. The first
goal was to develop additional software to aid researchers to define the spots on different
gels that correspond to the same protein.
EXAMPLE: Spots matching protein XYZ1 appear ten times on gels from infected samples
(across replicates), and three times from non-infected samples (exemplified in the Results
section, Figure 6.4). A visualisation has been created that shows the exact regions that
XYZ1 appears on the different gels, to enable the researcher to say how many different
forms of XYZ1 exist in total, and which different forms are up or down-regulated. It may
be the case that the three spots containing XYZ1 from the non-infected sample correspond
to a particularly modified form of the protein that has the same abundance in infected and
non-infected samples. However, the additional spots from infected samples correspond to a
different form of XYZ1, which is produced in greater abundance, and is crucial for parasite
infectivity. The visualisation system displays which forms of the protein are up-regulated,
down-regulated or have stable abundance.
A query has been developed in RAPAD that returns a page that lists the proteins that
have been identified across all replicates. The researcher selects the proteins they wish to
investigate and the Gel Viewer opens, highlighting the proteins selected on all replicates,
with each replicate gel loaded in a separate tabbed window of the viewer. The researcher
can zoom on the proteins and note the ID numbers of spots in the same position. The ID
numbers of spots in the same positions are manually entered into a text file by the researcher,
Chapter 6. Database support for proteomic studies of host-parasite interactions 185
and it is loaded into the database (into the tables MatchedSpots and MultipleAnalysis).
This allows a spot set to be defined that corresponds to the same form of the same protein on
different gels. After a spot set has been defined, a second interface displays the spot sets and
the volume of individual spots on different gels, if these values have been entered in RAPAD,
to display which spots appear in greater or lesser volume. This should allow the researcher
to define a particular variant of a protein (one spot set) as up or down-regulated during
infection. Section 6.3.1 gives an example of how the software has been used in practice to
identify differentially expressed proteins in the biological investigation.
6.2.2 Comparison of protein and gene expression data
The experimental protocol for infecting an HFF cell line with T. gondii for the proteome
study was reproduced from Blader’s study, as detailed above, therefore it should be possible
to make comparisons between the expression of a gene measured by a microarray, with the
protein abundance value obtained in this study. The microarrays of Blader were created
according to a standard protocol, and are supplied with an identifier of the cDNA clone
(example IMAGE:123456) and of the GenBank cDNA record. The cDNA record does not
share an identifier with either the protein record returned by MASCOT [207] (the software
used to identify proteins following MS), or with the corresponding nucleotide record found by
following a link from the protein record. Therefore, performing matching between microarray
clones and protein sequences is not a trivial task.
An initial attempt to find corresponding gene and protein records used pattern matching
over the names of the microarray features and the protein names, expressed as an SQL query,
in the following way. A query is deployed to match the first word of both the microarray
clone name, and the protein name. A list of exceptions is generated where the first word
occurs frequently and is not informative, such as “hypothetical” or “protein”. In these cases,
other words in the protein name are analysed to find matches. A list of potential matches is
supplied to the user, and sensible matches are returned in only approximately 50% of cases
because the following problems arise:
• If synonyms exist for gene names, one name may be used for the cDNA clone and
another for the protein.
• Certain words occur frequently in gene names which cause incorrect matches to be
found, such as “alpha” or “beta”.
Chapter 6. Database support for proteomic studies of host-parasite interactions 186
Retrieve DoTS ID numberfor each sequence
Store local copy of DoTSID for each sequence
Store mapping from DoTS IDto DoTS gene record
FGB451.2HYAB22.1DDRA44CAB224.2LF11AH.1
QARTGH
....
RAPAD
OUTPUT: Microarray gene name | Gene expression value| Protein Name | Protein Volume
PDB
2−D gel dataMass SpectrometryList of cDNA clone IDs
ABDG45.3NW4523HWEIU9.1JKHL652.1HGF456.2
NMD123.1
....
PIR
List of Protein IDs
List of Genbank nucleotide IDs
Retrieve DoTS gene number for every microarray result Retrieve DoTS gene number for every protein
Join query
Swiss−ProtGenbank
Retrieve Genbank nucleotide IDsusing BioJava
DoTS at AllGenes.org
Figure 6.1: The process of matching microarray data to protein abundance data.
Chapter 6. Database support for proteomic studies of host-parasite interactions 187
• Gene families exist with a number of closely related entries, such as Tropomyosin 2,3
and 4 which have closely related sequences, therefore microarray clones or protein
records may have been annotated incorrectly, or a search with the MS data may return
the incorrect entry.
• More generally, annotation in the databases is prone to inaccuracy and is being con-
stantly refined.
Further improvements to the algorithm for matching names would improve specificity but
it would be very difficult to engineer a robust method that would succeed in all situations.
Therefore, a different approach has been implemented in RAPAD using AllGenes [10]. All-
Genes is a web site that provides access to the Database of Transcribed Sequences (DoTS)
that collects all the different identifiers that a particular sequence (cDNA, mRNA, DNA)
could be assigned, which correspond to the same underlying gene. For example, the gene:
“heterogeneous nuclear ribonucleoprotein F” has a GenBank record for the protein sequence
(gi|4826760), nucleotide record (NM 004966), a microarray specific ID (IMAGE:345833), and
the corresponding cDNA GenBank ID (W72693). The DNA, cDNA and microarray identi-
fiers are each assigned a DoTS number, and collections of DoTS entries that correspond to
the same underlying object (gene) are assigned a single DoTS gene number (DG.36388269).
DoTS entries have been created by performing sequence similarity searches, and assembling
clusters of sequences that corresponds to the same object. A significant number of DoTS
entries have been manually curated.
The following series of actions is used to match protein records to microarray clones
(summarised in Figure 6.1).
1. RAPAD stores a URL referencing a web page on an external server for visualising
MASCOT results. A script retrieves GenBank protein IDs from the web page.
2. Protein records are retrieved from GenBank using the API (Application Programming
Interface) provided by BioJava [34]. Many GenBank protein records have a link to
the corresponding nucleotide record under the data type: DB Source, except for cases
where the protein sequence originated from a 3-D structure, or a database other than
GenBank, such as Swiss-Prot. In these cases, the nucleotide record must be found by
following a series of links manually (approximately 10% of proteins), or performing a
sequence similarity search on the GenBank nucleotide database.
Chapter 6. Database support for proteomic studies of host-parasite interactions 188
3. The DoTS web site allows programmatic access for single entries, and has batch capa-
bilities, but does not currently scale up for accepting very large numbers of identifiers.
Therefore, the DoTS database has been downloaded in flat files, and the UNIX grep
utility was used to search the files for the DoTS identifiers for GenBank nucleotide
records (found automatically from MASCOT or found manually) and cDNA records
(from the microarrays).
4. DoTS identifiers for microarray clones or proteins are stored in a newly created table
in RAPAD. A mapping from all DoTS identifiers to the corresponding DoTS genes is
stored in a table in RAPAD that can be queried when required.
5. An SQL query finds DoTS numbers for every protein, and retrieves the corresponding
DoTS gene number. A search is performed to find any microarray features that have
a DoTS number that has been mapped to the same DoTS gene ID.
The results of matching protein data to microarray results are displayed in the RAPAD
interface in a table, showing properties of the protein with links to the full protein record.
The microarray results from the different time points are displayed alongside. If the protein
has been matched across the two gels (infected and non-infected in this case), and volume
measures have been found for the two gel spots, the ratio in protein volume is displayed
alongside the microarray results. When large datasets are assembled it should be possible
to determine the correlation between gene expression and protein abundance for a series of
time points. This will enable the lag between the up-regulation of a gene and the production
of new protein to be calculated on a large scale.
6.2.3 Functional classification of proteins
Proteomics experiments generate large quantities of complex data, therefore analysis is re-
quired that can provide summaries, to generate a better understanding of the whole system.
The biological investigation reported in this chapter is analysing the changes that occur in
the human proteome, and there are a great number of resources available for characterising
human proteins. One example is the Gene Ontology (GO) project [126] (described in Chap-
ter 2), which has assembled a large amount of information about the function of proteins. In
RAPAD, GO ID numbers are stored for all proteins identified in this study and hyperlinks
have been created to the AmiGO browser [12]. AmiGO graphically highlights the position of
the term, and has controls for traversing up and down the GO tree, enabling the researcher
Chapter 6. Database support for proteomic studies of host-parasite interactions 189
Figure 6.2: Output from GoMiner, displaying the GO tree browser open for the geneTropomyosin 1.
to view the hierarchical classification of a gene (or protein). However, this system is not ideal
for a large collection of proteins because the knowledge about function must be manually
assembled by browsing, and is difficult to summarise because it is difficult to know from
which depth of the tree to store functional information. For certain proteins, the lowest
depth may provide useful annotation, but in other cases a more general classification (higher
up the hierarchy) may be more informative. Therefore, additional tools have been used that
summarise GO classifications: GoMiner [368] and FatiGO [7].
GoMiner accepts a list of gene symbols1 from one or two experiments, and displays
summaries of where genes have been found in the hierarchy. GoMiner also displays which
branches of GO are linked to genes that are up or down-regulated with statistics (described in
more detail in Chapter 2). For example, if three genes involved with cytoskeletal development
are up-regulated and one is down-regulated, this result would be displayed graphically, with
a statistic indicating that, for this set of conditions, cytoskeletal proteins tend to be up-
regulated (Figure 6.2).
FatiGO provides access to GO over the Internet, and has similar goals to GoMiner.
1A gene symbol is an official annotation for every human gene from the Human Genome Organisation(HUGO) [156]. Example: the gene actin beta has the gene symbol ACTB.
Chapter 6. Database support for proteomic studies of host-parasite interactions 190
Figure 6.3: Output from FatiGO showing the classification of up and down-regulated proteinsin the Biological Process branch of GO at a depth of 3, the third lowest (Query = infectedcells, Reference = non-infected cells).
Chapter 6. Database support for proteomic studies of host-parasite interactions 191
FatiGO provides summaries of where up and down-regulated genes appear in GO. FatiGO
accepts lists of gene symbols that have been highlighted from two experiments, and allows
the user to select the depth of the hierarchy and which branch of the three classifications
in GO to display. A visual summary of results is displayed with p-values to indicate the
significance of the association between one of the two conditions in the experiment and a
particular branch of GO (Figure 6.3).
FatiGO and GoMiner can also be used to classify proteins instead of genes, but both tools
require gene symbols as input rather than GO identifiers or GenBank accession numbers.
A set of scripts were developed by the author to retrieve the gene symbols from GenBank
nucleotide records for all the proteins highlighted in this investigation. The gene symbols are
stored in RAPAD, and are also used to create web links to the Ensembl genome browser [58]
for visualising the chromosomal location of the gene, as well as linking to GenAtlas [121] and
GeneCards [268]. GenAtlas and GeneCards summarise information about the function of
genes, display the intron/exon structure, provide physical maps showing other genes in the
localised region, give expression values in different human tissues, and display the domains
of the protein.
6.3 Results
The introduction outlined several key changes that are thought to occur in host cells when
invaded with Toxoplasma gondii. The proteome project had several major hypotheses to test,
which required significant database support. In this section, an outline of the results of the
analysis is given in four areas: the display of differentially expressed proteins, software that
aids the functional annotation of proteins, the integration with microarray results and the
search for post-translational modifications. The purpose of this chapter is to focus on how
RAPAD has facilitated these processes for the experiments with Toxoplasma, using several
examples of proteins highlighted by the study which may have a role in the infectivity of
the parasite. The proteome investigation is still continuing, and a complete report of the
biological results is beyond the scope of this work.
6.3.1 Visualisation of differential expression
The development of software for the visualisation of spots on different gels corresponding to
the same protein was described in Section 6.2.1. In this section an example is given of the
usage of the software, in the context of the T. gondii infection data.
Chapter 6. Database support for proteomic studies of host-parasite interactions 192
Spots 29 and 27
Spots 25 and 24
Spot IDs 42 and 41
Figure 6.4: The interface for viewing spots across replicate gels. A table displays proteins or-dered by name, allowing the researcher to select entries that have been identified as the sameprotein across different replicates, in this case ACTB. The Gel Viewer opens, highlightingthe proteins in different windows to allow the researcher to assess which spots correspond toeach other on different gels. A polygon has been overlaid to demonstrate that spots 42 and41 from non-infected replicate 1 appear to correspond with spots 29 and 27 from non-infectedreplicate 3. Gel images courtesy of M. Nelson.
Chapter 6. Database support for proteomic studies of host-parasite interactions 193
The process is demonstrated in Figure 6.4 for six spots containing the protein ACTB,
which appears in 26 spots in total. In this example, there is a cluster of four spots matched
to ACTB on one gel, and two spots on a replicate gel. The corresponding region has been
highlighted for the two gels, and a polygon has been drawn2 to demonstrate that spots 41
and 42 from non-infected replicate 1 correspond with spots 27 and 29 from non-infected
replicate 3. In this example, spot 42 (replicate 1) and spot 29 (rep. 3) form one spot set 3
and spot 41 (rep. 1) and spot 27 (rep. 3) form a different spot set. The region can then be
compared on gels from infected samples to see if this particular form of the protein, in this
exact position, is up or down-regulated.
The Gel Viewer, in combination with the RAPAD query system, allows differential ex-
pression of proteins to be visualised. An additional view of the data has been created which
will allow the results to be made public when the study is published in a journal. A total of
130 differentially expressed proteins have been identified by the researcher, which are stored
in RAPAD. Figure 6.5 displays the interface for viewing data that is combined across repli-
cates. The data can be viewed in a table that provides links to the individual protein records,
and enables any number of proteins to be selected and opened within the Gel Viewer. There
are facilities for investigating the function of the proteins, addressed in the following section.
6.3.2 Functional annotation of proteins
The software described in Section 6.2.1 facilitates the determination of a set of proteins that
show changed expression between infected and non-infected host cells. Each protein record
has links to a number of external databases: GenBank displays the nucleotide and protein
sequence; Harvester [33], GenAtlas, and Genecards summarise a large amount of information
that has previously been assembled for each entry; and Ensembl enables a researcher to
visualise the chromosomal location of a gene. A link to the Gene Ontology record for the
protein is also provided, allowing the researcher to build a complex picture of the function
of each protein. RAPAD includes an option for annotating a protein spot with a textual
description, thereby allowing new findings, that have been derived from external sources, to
be recorded in the database.
Proteins with modulated expression in this study could potentially fall into three cate-
gories:
2The polygon was created manually by the author to clarify which spots correspond to each other acrossthe two gels.
3A spot set is defined as a group of spots in the same position on different gels, corresponding to a specificisoform of the protein.
Chapter 6. Database support for proteomic studies of host-parasite interactions 194
Figure 6.5: The interface for displaying data combined across replicates. The top imagedisplays the option for assigning groups of gels to two different conditions (infected versusnon-infected). The lower image shows the table of proteins that have been identified in eachgroup of gels.
Chapter 6. Database support for proteomic studies of host-parasite interactions 195
• Host proteins actively up or down-regulated by the parasite, required for invasion or
maintaining infection.
• Proteins expressed by host cells in an attempt to resist parasite infectivity.
• Proteins with altered expression, caused indirectly, as a result of other proteins being
up or down-regulated.
It is therefore important to consider when analysing changes to the host proteome, whether
or not the change is facilitating parasite infectivity, as this has major consequences for the
interpretation placed on the result.
Example: Differential expression of Cathepsin B
One of the proteins found to be differentially expressed by the researchers was the protein
Cathepsin B, which cleaves proteins, transforming them from their initially transcribed form
(the prepro protein) into the functional form. Previous studies have suggested that Cathepsin
B from T. gondii is required for infectivity and rhoptry protein processing [260]. The pro-
teome studies described here, along with the previous microarray experiments, suggest that
human Cathepsin B is down-regulated during infection. A study by Que et al. in 2002 [260]
demonstrated that inhibition of Toxoplasma Cathepsin B prevented the parasite from infect-
ing cells, and was therefore a potential drug target. The study by Que also demonstrated
a significant sequence and structural similarity between human and Toxoplasma Cathepsin
proteins. Therefore, the finding that human Cathepsin B is down-regulated during infection
raises the possibility that human Cathepsin interferes with correct processing of Toxoplasma
proteins. If this proved to be correct, induction of expression of human Cathepsin could
prove to be an inhibitor of Toxoplasma infectivity. However, the situation is more complex
because human Cathepsin has also been implicated in the apoptosis pathway [139], and
one of the critical factors enabling a parasite to maintain infection is inhibition of apop-
tosis. Therefore, Toxoplasma may cause the down-regulation of Cathepsin to prevent the
cell entering apoptosis. This demonstrates that there is a significant information retrieval
task required to understand the results after particular proteins have been highlighted. The
interface provided by RAPAD allows the researcher to assimilate the results from past experi-
ments rapidly, via other Internet accessible resources (Figure 6.6), and record the information
within the database.
Chapter 6. Database support for proteomic studies of host-parasite interactions 196
Figure 6.6: The protein record for Cathepsin B in RAPAD has external links to AmiGO[12], GenBank [30] and GeneCards [268].
Chapter 6. Database support for proteomic studies of host-parasite interactions 197
Summary of biological results
Since the results of the investigation with T. gondii will be published by Dr Wastling and
Morag Nelson at a later date, a complete description of the results of the biological inves-
tigation is outside the scope of this work. When the results are ready for publication, the
RAPAD interface will provide public access to the data, as described in Section 6.3.5.
Cathepsin B, described above, is one of many proteins found to have modulated expres-
sion during parasite infectivity, which demonstrates the effectiveness of the experimental
approach adopted by Dr Wastling and the software developed in this investigation. Initial
results from the proteomics investigation have discovered down-regulation of proteins in-
volved in the formation of the cytoskeleton, as expected due to the ability of the parasite
to halt new cell growth and cell division. Other proteins implicated in apoptosis, such as
cytochrome c, are also down-regulated, and there is an up-regulation of proteins involved in
the host’s response to stress. The following section describes work by the author to match
the protein abundance values from this study, to gene expression values from the previously
published microarray experiments. Several examples are given of proteins that have been
shown to be differentially expressed, which have been highlighted for further investigation.
6.3.3 Comparison with microarray data
We have developed software to match proteins identified by MS to the corresponding clones
from the microarray study by Blader and colleagues, in order to discover the correlation be-
tween gene expression and protein abundance. The Blader experiment contains two relevant
datasets. The first is a time course experiment to highlight genes with altered expression
at 1, 2, 4, 6 and 24 hours post infection with T. gondii. The second data set from Blader’s
microarray experiment contains an analysis, from two independent infections, of the genes
that were most strongly up or down-regulated at 24 hours post-infection. The proteomics
experiment carried out at Glasgow determines the abundance of proteins at 24 hours post-
infection. It is likely that there is a lag between an up-regulation in gene expression, and
the production of new protein, although the length of time is not known exactly.
The technique to match data points present in both data sets performed correct matching
between gene and protein identifiers. However, due to the limited coverage of both exper-
iments, the datasets are not currently large enough to infer global information about the
rate of translational control for Toxoplasma proteins. The results of the matching, displayed
in Table 6.1, provide qualitative information about the correspondence between the rate of
Chapter 6. Database support for proteomic studies of host-parasite interactions 198
Figure 6.7: The table in RAPAD displaying protein abundance and gene expression values.The column headings are as follows: 1 = Spot ID, 2 = Protein name, 3 = cDNA clonename, columns 4 to 8 are relative gene expression values from a time course experiment, andcolumns 9 and 10 are relative expression values from a separate microarray hybridization(24 hour time point, see Section 6.3.3). Column 10 = spot ID of matching spot on a secondgel and column 11 is the ratio of protein volume between the two gels.
Chapter 6. Database support for proteomic studies of host-parasite interactions 199
Protein Name 1h 2h 4h 6h 24h 24h(i)
24h(ii)
Up-regulatedAnnexin-1 2.02 0.65 1.18 0.95 0.79 — —Heterogeneous ribonucleoprotein F — — — — — 2.77 2.30HS70kDa protein 8 isoform 1 — — — — — 2.58 2.22Nucleoside diphosphate kinase 1 1.82 0.75 1.37 1.12 2.55 — —Phospholipase C alpha or Protein disulphideisomerase
— — — — — 2.03 2.64
Thioredoxin peroxidase — — — — — 2.14 2.04Tubulin beta — — — — — 3.94 3.00Villin 2 1.53 1.02 1.26 1.34 2.50 — —
Down-regulatedActin beta 0.69 1.05 0.74 1.07 0.44 — —AHNAK (Desmoyokin) — — — — — 0.41 0.15Cathepsin B 0.95 0.96 0.84 1.00 0.47 — —Dimethyl arginine dimethyl aminohydrolase 1.13 0.84 1.01 1.29 2.02 — —Heterogeneous ribonucleoprotein F — — — — — 2.77 2.30Superoxide dismutase 1.50 3.56 3.77 4.72 1.69 — —Tubulin beta — — — — — 3.94 3.00Vimentin 0.87 0.73 0.81 0.74 0.43 — —
Table 6.1: The correspondence between gene and protein abundance for HFF cells infectedwith T. gondii. Column 1 contains the names of proteins identified in the proteome study,which are up or down-regulated during parasite infection. The numerical values are thecorresponding gene expression values from the study by Blader [35] from a time courseexperiment (columns 2-6) and two independent infections at the 24h time point (columns 7and 8). The values are the ratio of the expression of the gene that corresponds to the proteinin column 1, from infected versus non-infected samples. A value greater than 1 indicates thegene is up-regulated during infection, less than 1 indicates that the gene is down-regulated.The — symbol indicates that the value was not present in the Blader study.
Chapter 6. Database support for proteomic studies of host-parasite interactions 200
2)
2)
1)
1)
Figure 6.8: The top image displays a part of the gel from the infected sample at a highermagnification, and the bottom image is the non-infected sample. Spots matched to vimentinare highlighted. The cluster of spots marked 2 is present on both gels. The cluster of spotsmarked 1 is only present in non-infected samples. Gel images courtesy of M. Nelson.
Chapter 6. Database support for proteomic studies of host-parasite interactions 201
transcription and translation. The first column in Table 6.1 displays the proteins that have
been found to be up or down-regulated during infection in the proteomics investigation, and
have been matched to a gene in the Blader study. Proteins are identified as up-regulated
in infected samples if they appear in a larger volume on gels from infected samples, or the
spot is present in the infected sample and absent in the non-infected sample. A protein is
defined as down-regulated if it appears in a larger volume, or is only present on gels from
non-infected samples. Columns 2-6 display the expression values at the five time points post-
infection from the Blader study for the gene that corresponds to the protein in column 1.
Columns 7 and 8 display the expression values for genes that have been matched to proteins
in this investigation, from two further independent infections at 24 hours post-infection in
the Blader study. Table 6.1 summarises fairly complex data, as for example vimentin and
actin both appear in multiple copies on gels from infected and non-infected samples. Both
vimentin and actin are defined as down-regulated because there are spots clearly present
across replicates on non-infected samples, which are not present on infected gels. Figure 6.8
displays the spots matched to vimentin from infected and non-infected samples. The spot
cluster 2 is present on both gels in roughly similar volumes. Cluster 1 is only present in non-
infected samples. This indicates that several forms of vimentin with particular modifications
are down-regulated during infection.
The spots matched to actin beta are displayed in Figure 6.9. The pattern of spots indi-
cates that particular forms of actin beta are less abundant during infection, or it may reflect
the fact that the total volume of all spots is reduced, and certain spots cannot be viewed at
very low volumes. Both vimentin and actin are implicated in cytoskeletal development, and
may be down-regulated because Toxoplasma arrests the host’s cell cycle. Tubulin beta and
heterogeneous ribonucleoprotein F appear in both halves of the table because some forms of
the proteins appear in greater volumes in infected samples, and other forms in non-infected
samples. Therefore, there may be a different type of modification that causes spots to shift
positions on the 2-D gel, and it is not possible to state simply whether the proteins are up
or down-regulated.
Up-regulated proteins
There are three genes: HS70kDa protein, protein disulphide isomerase and thioredoxin per-
oxidase that are strongly up-regulated in the Blader study at 24 hours, and the proteins
are also up-regulated in this investigation. HS70kDa is a heat shock protein that is released
Chapter 6. Database support for proteomic studies of host-parasite interactions 202
Figure 6.9: Spots matched to actin beta from infected (top) and non-infected (bottom)samples. Gel images courtesy of M. Nelson.
Chapter 6. Database support for proteomic studies of host-parasite interactions 203
when the cell is placed under stress, therefore it may represent a host cell response to infec-
tion. Thioredoxin peroxidase is implicated in oxidative stress and regulation of transcription
factors, and may also be a sign of a host cell response.
The comparison data reveals that both phospholipase C alpha (PCA) and protein disul-
phide isomerase (PDI) are predicted to match the same microarray clone, annotated as a
“glucose regulated protein” (accession R33030). The 2-D gel data also reveals that spots
containing phospholipase C alpha are also predicted to contain PDI, based on MS results.
Further analysis reveals that GenBank contains exactly the same protein sequence for both
PCA (BAA03759) and PDI (JC5704). The Harvester database contains a different, unrelated
protein sequence for PCA (Harvester ID Q15111), but PDI has the same protein sequence in
Harvester and GenBank. This indicates that the PCA record in GenBank contains an incor-
rect protein sequence. It appears that both the proteomics and microarray data agree that
PDI is up-regulated in response to parasite infection. PDI functions to rearrange sulphide
bonds in proteins, and the up-regulation may be due to a general increase in proteins that
must be produced during infection. PCA may not be implicated in this study at all, and
if it is incorrectly annotated in GenBank, the record should be updated. The public access
part of RAPAD, described in Chapter 5, will allow other databases to connect to RAPAD
when the proteome data has been published.
The proteome studies reveal that Annexin-1 is up-regulated during infection at the 24
hour time point. It is interesting to note that the gene expression studies suggest that
Annexin is up-regulated early, and then down-regulated later. This would suggest that
there is a large lag between changes in gene expression and the production of new protein,
however much larger data sets would be required to confirm and quantify this hypothesis.
The record in the SOURCE database [78] for Annexin suggests that it is involved with
exocytosis, membrane fusion and an anti-inflammatory response. The Swiss-Prot database
specifies that Annexin can be phosphorylated, leading to inactivation. The 2-DE data reveals
two adjacent spots that may be the result of differentially phosphorylated forms, which have
been further investigated (Section 6.3.4).
Down-regulated proteins
There are eight proteins that have been classified as down-regulated in the protein investiga-
tion and which have been matched to microarray data points. The apparent down-regulation
of the proteins actin beta and vimentin has been discussed above. The gene expression data
Chapter 6. Database support for proteomic studies of host-parasite interactions 204
suggest that vimentin is down-regulated as expected, but the results for actin beta are less
clear, although on average the gene for actin beta seems to be down-regulated. The function
of Cathepsin B was discussed in Section 6.3.2, and it appears that the microarray data sug-
gest the gene is slightly down-regulated early in infection, and very strongly down-regulated
late in infection. AHNAK (Desmoyokin) appears to be down-regulated in both the proteome
and microarray investigation. It is believed to have various roles, including signal transduc-
tion and regulation. The GenAtlas entry suggests that AHNAK plays “a regulatory role of
the actin-bound cytoskeleton to the l-type Ca2+ channel”, which would suggest that it may
be down-regulated as part of the inhibition of cell cycle and cytoskeletal development, caused
by the parasite.
There are several forms of the protein heterogeneous ribonucleoprotein F that appear in
higher volumes in infected samples, but other forms appear in lower quantities. The mi-
croarray experiments suggest that the gene is strongly up-regulated. The protein is involved
in RNA processing. It would be expected that more genes are expressed when the cell in
under stress, such as during infection. A general increase in gene expression should correlate
with higher RNA processing, and we might expect that heterogeneous ribonucleoprotein F
would be up-regulated. The finding that there are different variants of this protein may
suggest that an activated form of the protein is present in much higher volumes in infected
samples, and spots that are larger in non-infected cells correspond with a de-activated form
of the protein. Additional investigations into the PTMs of the protein would be required to
confirm this hypothesis.
The protein for dimethyl arginine dimethyl aminohydrolase appears in lower abundance
during infection in the proteomic study but the microarray data suggest that the gene has
fairly stable expression until the 24 hour time point, at which it is strongly up-regulated.
This protein has a catalytic role associated with the generation of nitric oxide generation.
While nitric oxide is used by macrophage cells to kill engulfed pathogens, nitric oxide is
unlikely to be used in this way in an HFF cell line. It is therefore difficult to hypothesise as
to why the protein appears in lower abundance. The protein superoxide dismutase exhibits
unusual results in this study, and is discussed below.
Superoxide dismutase
In general, there appears to be a reasonable correspondence between gene expression and
protein abundance, because most proteins that are found to be up-regulated in the proteome
Chapter 6. Database support for proteomic studies of host-parasite interactions 205
Figure 6.10: The top images display the spot identified as superoxide dismutase chain Afrom the non-infected sample, replicate 11 (left) versus infected (right). The lower imagedisplays superoxide dismutase. A polygon has been drawn on top of the image to displaythe likely position of the protein in the second gel. Gel images courtesy of M. Nelson.
Chapter 6. Database support for proteomic studies of host-parasite interactions 206
study have a corresponding gene expression value that is greater than one. In addition, most
of the proteins that are down-regulated have a corresponding gene expression value of less
than one. The one clear exception is superoxide dismutase, which is down-regulated in the
proteome, but strongly up-regulated in the microarray study. The Gene Ontology classifies
the protein as released in response to oxidative stress, which we would predict to be greater
during parasite invasion, therefore the result from 2-DE is surprising.
There are two spots on 2-D gels from non-infected samples, one predicted to match “su-
peroxide dismutase chain A” and another matching “superoxide dismutase”. The automated
comparison predicts that only the latter protein matches the microarray result in the Blader
study. A local alignment of the two protein sequences reveals that they have very low ho-
mology (35% similarity, alignment not shown), indicating that they are not highly related
proteins, even though they have similar names (GenBank accessions gi|515251 and gi|34711).
The diagram in Figure 6.10 displays the positions of the spots on the gels from infected and
non-infected samples. The top image displays superoxide dismutase chain A and the lower
image shows the position of superoxide dismutase, from infected (right) and non-infected
(left) samples. The microarray results demonstrate very strong up-regulation of superoxide
dismutase during infection. It is likely that in the proteome study “superoxide dismutase
chain A” is a different protein, and is not strongly down-regulated. Therefore, considering
only the lower image on Figure 6.10 (superoxide dismutase), there is no clear spot, or only
a spot with a far lower volume, in the infected sample. This result is surprising given the
suggested role of the protein, therefore further analysis is required to verify that superoxide
dismutase is down-regulated during infection in the proteome but up-regulated in the tran-
scriptome. If this proved to be correct, this would demonstrate strong post-transcriptional
control regulating protein abundance, because a large increase in gene expression does not
appear to produce a corresponding change in protein abundance.
In summary, the results of the comparison between microarray and proteomics highlight
the potential for discovery of the relationship between gene expression and protein abun-
dance when larger data sets are assembled. The study reveals several proteins that correlate
well with gene expression values. The examples presented in this section demonstrate that
information about the proteins’ functions can be assimilated easily within RAPAD, due to
the number of links to external databases which are provided.
Chapter 6. Database support for proteomic studies of host-parasite interactions 207
Figure 6.11: Four spots containing protein disulphide isomerase. The pattern of spots isindicative of different phosphorylated forms of the protein. Gel image courtesy of M. Nelson.
6.3.4 Post-translational modifications
The database query facility and the Gel Viewer enable researchers to find proteins that lo-
calise to the same region on the gel, and share the same name. This can highlight potential
post-translational modifications for further enquiry. An example is shown in Figure 6.11
of four spots matched to protein disulphide isomerase, a protein that catalyses the rear-
rangement of sulphide bonds in proteins. The pattern of several spots in a horizontal line
is characteristic of different phosphorylation states, although other types of variable modifi-
cations can produce clusters of spots. It is also possible that differential splicing occurs to
produce various different protein sequences from a single gene.
Mass spectrometry data was used primarily to identify proteins, however, a process was
undertaken to search the MS data again, to find variable modifications on the proteins.
The MASCOT software has an option to search for different types of modifications, such
as phosphorylation, acetylation, and others, to find if the mass of each peptide detected,
matches more closely a peptide sequence if one of the residues has a particular modification.
The search was implemented for clusters of spots that match the same protein (Vimentin,
PDI and Annexin). However, the searches revealed little information about modifications.
Chapter 6. Database support for proteomic studies of host-parasite interactions 208
Start - End Observed Mr(expt) Mr(calc) Delta Miss Sequence
63 - 73 1191.58 1190.57 1190.59 -0.02 0 LAPEYEAAATR
95 - 104 1084.56 1083.55 1083.56 -0.01 0 YGVSGYPTLK
108 - 119 1236.51 1235.51 1235.51 0.00 0 DGEEAGAYDGPR
259 - 271 1619.78 1618.77 1618.78 -0.01 0 DLLIAYYDVDYEK
336 - 344 1188.53 1187.52 1187.53 -0.01 0 FVMQEEFSR Oxidation (M)
336 - 347 1652.70 1651.70 1651.66 0.04 1 FVMQEEFSRDGK Acetyl (K); Acetyl (N-term); Oxidation (M); Phospho (ST)
352 - 362 1359.66 1358.65 1358.65 -0.00 0 FLQDYFDGNLK
352 - 363 1515.75 1514.74 1514.75 -0.01 1 FLQDYFDGNLKR
434 - 448 1680.75 1679.75 1679.75 -0.00 0 MDATANDVPSPYEVR Oxidation (M)
449 - 460 1341.68 1340.68 1340.68 0.00 0 GFPTIYFSPANK
449 - 461 1469.75 1468.74 1468.77 -0.03 1 GFPTIYFSPANKK
472 - 482 1370.69 1369.68 1369.69 -0.01 0 ELSDFISYLQR
Figure 6.12: The result of a search for potential post-translational modification of proteindisulphide isomerase, revealing a peptide that may be acetylated and phosphorylated. Theoxidations are caused experimentally and are not biologically relevant.
There are several possible reasons: firstly, the number of peptides detected by MS is usually
far smaller than the total number of peptides in a protein, and only a proportion (10-
40%) of the peptides are actually detected. Therefore, in many cases the modification is
to a peptide that is not detected by MS. Secondly, it is believed that peptides with certain
modifications do not ionise well, and are therefore less likely to be detected than peptides
without additional modifications. Finally, it is possible that the cluster of spots is the result
of several different translations of the same gene, to produce a set of proteins that contain
peptides that still match the sequence entry in the database. The searches revealed a single
possible modification to the PDI spot at the furthest right position in Figure 6.11, which is
predicted to have been acetylated and phosphorylated (Figure 6.12). A phosphorylation to
a protein could be confirmed by a labelling experiment to quantify the number of phosphate
residues per protein, in each spot. There are facilities in RAPAD for the storage and querying
of PTMs after they have been confirmed, as described in the previous chapter.
6.3.5 Public access to data
An interface has been created that will allow public access to the proteomic data to ac-
company a future journal publication. The opening page loads a general description of the
experiment, a summary of all the gels, a listing of the number of proteins identified on each
gel, and links to the protocols for the protein solubilisation, first and second dimension sep-
aration, staining and scanning (Figure 6.13). There is an option to select particular gels,
and view a table containing the proteins that have been identified. The second page allows
users to select particular proteins, and open the Gel Viewer highlighting the proteins, with
different gels appearing in separate tabbed windows. The security of data is ensured because
a check is made before loading each page that data has been specified as publicly accessible
for every gel and protein (every database table has the attribute OTHER READ which is set
Chapter 6. Database support for proteomic studies of host-parasite interactions 209
from 0 to 1 for public data). At the time of writing, the researchers do not wish to release
the data until it has been published elsewhere, therefore the URL for this part of the inter-
face will accompany the publication of the data. The query interface that forms part of the
RAPAD Study-Annotator, described in the previous chapter, will be linked to the publicly
accessible data sets. This will allow other researchers to verify the findings, and opens the
possibility for new discoveries by allowing complex queries of the data.
6.4 Discussion
The experiments described in this chapter present challenges due to the size and complexity
of the data. One of the major challenges is the requirement for summarising data across
replicates, and determining if proteins are differentially expressed during parasite invasion.
The biological goals were to investigate if proteomics experiments confirmed or conflicted
with previous hypotheses regarding the mechanism of parasite invasion, and the continued
survival of the parasites in host cell culture. This has been facilitated by the development of
software for matching spots between gels, and visualising differentially expressed proteins.
The Gel Viewer enables multiple gels to be loaded concurrently, with controls for zooming,
search facilities for highlighting particular spots and links to more detailed information in
the database. There are also query facilities for finding particular proteins in the database
and for summarising all the data across replicates. Software has been written to connect
RAPAD to a number of external databases and analysis applications that summarise func-
tional classifications, to find which classes of proteins change in expression during parasite
invasion. The ability to connect to external software demonstrates the flexibility of RAPAD
which is due to the extensive use of ontologies. External database entries are stored in a
generic table (DatabaseEntry), with a record stored in the OntologyEntry table that has
sufficient information for capturing how the link to the external database should be imple-
mented, capturing the database’s URL and version. Therefore, external links to any web
accessible database can be provided.
This investigation allows the general functionality of RAPAD to be assessed in a genuine
research environment. It is common for proteomics investigations to require the display of
differential expression of proteins, and links to external Internet accessible databases. The
data set described in this chapter is fairly large (14 gels, 350 identified proteins), and in the
following chapter there is a description of a different study in which a further 1000 protein
identifications are stored in RAPAD. This demonstrates that RAPAD can scale up to manage
Chapter 6. Database support for proteomic studies of host-parasite interactions 210
Figure 6.13: A summary page displays all the gels present in the experiment, and a linkexists to display the experimental protocols used for each gel.
Chapter 6. Database support for proteomic studies of host-parasite interactions 211
substantial data sets, and allows them to be queried. The interface code and database schema
are freely available, therefore other developers can re-create their own version of RAPAD to
support a variety of proteome studies.
The integration between the locally generated proteomics data and the previously pub-
lished microarray studies was a critical requirement of the project. The results demon-
strate the viability of the approach, however currently there are few protein records that
are matched to microarray data points. This appears to be a reflection of the proportion of
records present in both studies, rather than a flaw in the methodology. The microarrays used
by Blader contained from 18,000 to 27,000 clones. However, the results were only reported
for those genes that showed a 2-fold difference in fluorescence between scans generated from
infected and non-infected samples, corresponding to approximately 1800 microarray results.
In the proteome study, 130 distinct proteins displaying differential expression have been iden-
tified, of which 14 have been matched to a clone in the microarray study. This is about 1 in
9 proteins that match a microarray clone. It would be expected by chance that a minimum
of 1 in 15 proteins identified by 2-DE should match a clone in the microarray results (1800
results from 27000 clones = 1/15). In reality, we would expect a far higher number to corre-
spond in the two studies because protein spots have been selected for analysis if they appear
in different volumes across the two conditions. It is assumed that if a protein is produced in
much greater abundance, there would be a corresponding increase in the mRNA levels that
would be detected by microarray analysis. Therefore, it might be predicted that most of the
proteins identified in the proteome study should appear in the Blader results.
There are a large number of the 130 proteins found in this study that do not match any
differentially expressed genes in the microarray study. It is possible that the Blader study
did not have complete coverage of all genes, but the majority of genes were assayed and were
found to have stable gene expression between infected and non-infected samples. Therefore,
the differentially expressed proteins that do not match anything in the Blader study are of
interest, because they demonstrate that there may be post-transcriptional control in response
to parasite invasion. In other words, many proteins are produced in greater or lesser volume
during infection that do not have a measurable difference in their mRNA levels. It is possible
that certain proteins that are required for infectivity would not be highlighted from a gene
expression experiment, and in that case the mechanisms for infectivity cannot be studied
using microarrays only. This finding demonstrates the viability of the 2-DE and MS approach
for hypothesis formation, and it is likely that it will continue to grow as a technology for
Chapter 6. Database support for proteomic studies of host-parasite interactions 212
functional genomics analysis.
6.5 Summary and conclusions
In this chapter software has been described that enables clustering and visualising spots on
replicate gels that contain the same form of a protein, and the spots that contain variant
forms. This has enabled potential post-translational modifications to be identified for fur-
ther study. When PTMs have been confirmed, RAPAD has facilities for their storage and
querying. The results suggest that different forms of proteins exist in infected and non-
infected samples, although the exact types of the modifications have yet to be confirmed.
The data sets will continue to grow rapidly, and it will be vital to combine information about
modifications with the relative expression values measured by microarrays, 2-DE and other
technologies. RAPAD provides a framework in which this kind of data integration can take
place on a large scale, and it will serve as a repository for the publication of data to accom-
pany journal articles. It is planned that the data from the experiments with Toxoplasma
will be published at some point in the future. RAPAD will provide public access to the
data, using the interface described in the previous chapter, to allow researchers accessing the
article to query the proteome data.
A common type of proteome investigation is the search for differentially expressed pro-
teins, using 2-DE, image analysis and mass spectrometry. The RAPAD system has been
extended to support the experiments presented in this chapter, which compare a human cell
line, invaded with Toxoplasma gondii, with non-invaded cells. RAPAD specifically facilitates
the identification of differential expression by providing a visualisation of clusters of spots
that have been matched to the same protein across a series of replicates. Following the
identification of proteins, a large amount of information must be assimilated from diverse
databases to characterise the proteins. Every protein record in RAPAD has hyperlinks to sev-
eral other databases, using the GenBank identifier or the corresponding gene symbol, which
were obtained for each protein using scripts written by the author. Additional tools were
used to summarise the functions of proteins from the Gene Ontology. An approach has been
presented for matching differentially expressed proteins to the corresponding results from a
previously published microarray experiment. The results of the matching demonstrate some
correspondence between genes that are up-regulated during infection and increased protein
abundance on 2-D gels, but the data sets are not currently large enough to quantify the cor-
relation. The software can be re-used when data sets are larger for determining the global
Chapter 6. Database support for proteomic studies of host-parasite interactions 213
rate of transcription and translation.
The following chapter outlines a project with a different parasite, Trypanosoma brucei.
RAPAD assists an investigation to catalogue all the proteins that can be found using a
gel-based approach, to improve the functional annotation of the genome, and determine the
dynamic nature of the proteome.
Chapter 7
Software support for a proteome
map of Trypanosoma brucei
7.1 Introduction
The previous chapter focused on the use of proteomics techniques to find differentially ex-
pressed proteins that allow for the formation of new hypotheses about the function of a
system. This chapter outlines the use of proteomics in a different context, where it is used
for cataloguing information about protein expression, to improve the functional annotation
of genes and the search for post-translational modifications. The RAPAD database sup-
ports a proteome map of the parasite Trypanosoma brucei that causes sleeping sickness in
Africa. The genome sequence of T. brucei is nearing completion from which many open
reading frames have been accurately predicted, but the functional annotation of the genes
is generally poor. There are many genes that have only been tentatively identified and have
no functional assignment. The proteome data is able to confirm the existence of genes that
encode proteins expressed in the cell line and provide insights into the dynamic nature of
proteins in terms of modifications, and different isoforms that exist. Additional software
has been written to provide a novel visualisation of proteins identified by mass spectrome-
try, and to summarise information within a substantial data set. The analysis presented in
this chapter will improve the naming of certain genes, and provides a potential functional
assignment for several proteins.
7.1.1 The biology of trypanosomes
Trypanosoma brucei is a eukaryotic parasite that causes sleeping sickness in sub-Saharan
Africa, and there have been a number of recent epidemics [294]. Trypanosomes live in
the bloodstream and tissue fluids of mammals, causing a variety of diseases in livestock and
214
Chapter 7. Software support for a proteome map of Trypanosoma brucei 215
Figure 7.1: The life cycle of Trypanosoma brucei, from DPDx - CDC Parasitology Diagnosticweb site, http://www.dpd.cdc.gov/dpdx/HTML/TrypanosomiasisAfrican.asp
mortality in humans. They are transmitted by tsetse flies, and it is predicted that more than
half a billion people live in affected areas, with hundreds of thousands of new cases per year
[26]. The expected outcome, in the absence of chemotherapy, is death. Anti-trypanosomal
drugs have been developed, although drugs are not 100% effective, and resistant strains are
now arising [301].
The prospects for the development of a vaccine are very slim because the parasite evades
the immune response through the process of antigenic variation, first reported by Vickerman
in the 1960s [335]. A set of proteins, known as variant surface glycoproteins (VSG), form
a dense outer layer around the parasite, protecting against recognition from the immune
system. There is one locus from which a single VSG gene is activated at any one time,
with approximately 1000 other VSG genes distributed in different, silenced positions. At
intervals, a rearrangement of the genes occurs, switching the gene that is positioned in the
activated locus. A different protein becomes expressed, forming a new surface coat that will
not be recognised by the immune system (the mechanisms of gene switching are reviewed by
Barry 1997 [27]).
Trypanosomes undergo a complex developmental cycle that is simplified in Figure 7.1.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 216
Figure 7.2: An electron micrograph of the bloodstream form of Trypanosoma brucei, fromhttp://www.ulb.ac.be/sciences/biodic/ImProto0003.html
The regulation of the life-cycle is poorly understood despite its obvious importance to the
parasite. When a fly takes a blood meal from an infected mammalian host, bloodstream
forms (Figure 7.2) differentiate to the procyclic stage of the life cycle in the gut of the fly,
accompanied by alterations in metabolism and morphology caused by changes in expression
of an unknown number of proteins. It is vital these proteins are identified given the severity
of the disease and the unusual biology of trypanosomes, which is discussed in more detail
below. It is also possible that proteins involved in regulating the life-cycle may prove to be
viable drug targets.
7.1.2 Annotating the genome
The genome sequence of T. brucei is nearing completion and the sequence of chromosomes I
and II was reported in 2003 [146, 87]. The genome contains 11 chromosomes in total, and is
27 Megabases in length. Currently, 5500 coding sequences have been conclusively identified
(March 2004) [127], and it is expected that the total gene number will be about 8000. Efforts
are now underway to determine the function of all genes, with particular focus on genes that
cause drug resistance, genes that enable the parasite to evade the immune response and the
proteins that are up-regulated during infection of mammals. Trypanosoma brucei belongs
to a small class of unicellular organisms, the kinetoplastids, which exhibit highly unusual
Chapter 7. Software support for a proteome map of Trypanosoma brucei 217
regulation of gene expression. It seems that these organisms do not regulate transcription
by RNA polymerase II, and large numbers of genes appear to be regulated from a single
transcriptional initiation point. The genes lie adjacent to each other in long runs, interspersed
with almost no introns, similar to bacterial operons [60]. However, unlike operons, the
genes do not encode similar proteins that would be expected to be under a single control
mechanism, but instead contain seemingly unrelated genes. It will therefore be interesting
to discover what functional genomics (FG) experiments can demonstrate about how genetic
regulation is performed in these parasites. Microarray analysis would be expected to reveal
unusual results because transcriptional control may occur only through regulation of the
rate of degradation of mRNA, or the rate of splicing. Therefore, the abundance of mRNA
may have different patterns from organisms with conventional gene regulation. Proteomics
studies aim to determine the level of expressed proteins and therefore may prove vital in
elucidating how post-transcriptional control is exerted.
It is essential that the functional annotation of the T. brucei genome is improved rapidly,
and made widely available, to facilitate the search for new drugs to control sleeping sickness.
There are also several related species that cause serious diseases. One of the closest relatives is
Trypanosoma cruzi that causes Chagas disease in South America. The parasite is transmitted
by triatomal bugs, infects mainly cardiovascular and autonomic nervous tissues, and is fatal
in about a third of all cases [53]. There are several members of the genus Leishmania, which
cause a variety of life-threatening diseases in the third world. Genome sequence is taking
place on T. cruzi and Leishmania major. Comparative genome studies must be performed
to ensure that any gene annotations for closely related species can be related back to newly
sequenced genes in other organisms.
7.1.3 Database support
RAPAD is supporting a project to generate a catalogue of all the expressed proteins from T.
brucei, which can be separated by two dimensional gel electrophoresis (2-DE) and identified
by mass spectrometry (MS). The experiments are being performed by Anne Faldas and Prof.
Mike Turner in the Institute of Biomedical and Life Sciences at the University of Glasgow,
and the biological data in this chapter is reproduced with their permission.
Many of the 8000 genes in the genome are annotated as “hypothetical proteins” because
they been identified solely by gene prediction algorithms. A naıve search of the genome
database, GeneDB [127], for the annotation “hypothetical AND protein” in T. brucei pro-
Chapter 7. Software support for a proteome map of Trypanosoma brucei 218
duces a list of 11,999 entries, for which there is little or no further annotation. Several
entries must refer to the same underlying gene, but appear more than once in the database,
because this number is far larger than the expected total number of genes. Clearly, if a
protein is identified conclusively by mass spectrometry, the protein is a real sequence, and
is expressed under the conditions used to generate the sample. This information must feed
back to the genome curators to allow the annotation to change from “hypothetical protein”
to “confirmed protein”. If homologous sequences from other organisms have been found by
similarity searches, the functional assignment of the homologous sequence should also be
added as annotation (described in Section 7.3.2).
RAPAD supports searching and filtering of proteome data, allowing complex Boolean
queries to mine specific information from large data sets. It is also important that protein
data arising from gels with different pH ranges is combined in an intuitive manner, requir-
ing the development of good visualisation tools. This facility in RAPAD was described in
the previous chapter. One of the most important parts of the analysis is to discover the
frequency of post-translational modifications, or other events, which cause multiple spots,
matched to the same protein to appear on a gel. Many proteins appear in multiple copies at
different positions on the gel, indicating that some processing or alteration of proteins must
be occurring to change either the charge or mass. For example, 92 distinct spots contain a
tubulin protein (α or β), many of which appear near the base of the gel, indicating small
molecular weight proteins, and the spots are reproducible across replicate gels. This would
suggest that the spots contain only fragments of proteins, the result of degradation. Software
has been developed alongside RAPAD to investigate this phenomenon (Section 7.3.1).
7.1.4 Project status
The current status of the T. brucei data deposited in RAPAD is as follows (June 2004).
There are 955 proteins identified in total, which arise from 619 spots on three gels. The
number of proteins is higher than the number of spots because several different proteins
are frequently identified from a single spot. A database query reveals that 260 proteins
have distinct molecular weights, indicating that this is the approximate number of different
proteins that have been identified. The rest of the analysis has been performed on one single
master gel (pH range 4-7), which contains 879 distinct spots. On the master gel 753 protein
identifications have been made from 460 spots.
The rest of the chapter is structured as follows: the methods used to capture the project
Chapter 7. Software support for a proteome map of Trypanosoma brucei 219
requirements and to develop the software are discussed in Section 7.2. Section 7.3 describes
the results, in terms of how RAPAD supports the discovery of modifications and aids genome
annotation. An investigation into the causes of multiple spots arising for a single protein is
also described. Discussion is provided in Section 7.4.
7.2 Methods
7.2.1 Generation of samples for proteome analysis
One of the major problems of performing functional genomics analysis on trypanosomes is
the speed with which they evolve, and it has been reported that trypanosome lines can spon-
taneously change their phenotype as a result of laboratory manipulation (see for example
van Deursen et al. 2001 [329]). If researchers perform investigations to characterise the
gene or protein expression of trypanosomes, the results may only have relevance to the exact
laboratory strain on which the experiments were performed. To alleviate these problems, a
reference strain of T. brucei has been generated (TREU 927), which has been used for gen-
erating the genome sequence [329]. The strain has several properties that are representative
of trypanosomes in the wild, and it can be cultured in vitro. Proteins have been extracted
from procyclic forms of the TREU 927 line grown as an in vitro culture for the proteome
study in Glasgow. This is vital because DNA has also been extracted from this line for
microarrays that are being created. Therefore, it will be possible to compare data from
the genome, transcriptome and proteome in the future, and the proteome experiments can
directly contribute to improving the annotation of the genome. The proteomics experiments
in the database comprise three main gels which have been run over different pH ranges (4-7,
6-11, and 4.5-5.5) to achieve a high resolution of proteins. The experimental protocols for
protein solubilisation, the two dimensions of gel separation and staining are all stored in
RAPAD. The details of the experimental procedure are given below.
Procyclic forms of the genome reference strain TREU 927/4 were grown in SDM-79
with 10% foetal calf serum according to [330]. Parasites were purified by washing in PSG
buffer and centrifuged at 13,000g. Approximately 2x108 trypanosomes (650 µg protein) were
solubilised in 470µl lysis/rehydration buffer (9M urea, 2M thiourea, 2% CHAPS, 65mM DTT
and 0.5% IPG buffer pH4-7, trace bromophenol blue). A protease inhibitor cocktail (5µl,
Roche), at a concentration of 25µg/ml, and 10µl nucleases (2000 units/ml DNase, 1750
units/ml RNase A, 50mM MgCl2) were added to limit proteolysis and digest nucleic acids.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 220
The sample was incubated at room temperature for 1 hour, vortexing every 10 minutes, then
freeze/thawed in liquid nitrogen. The sample (450µl) was loaded on to a 24 cm IPG strip
(Amersham) and isoelectric focusing was performed, reaching more than 70,000Vhrs.
The strips were equilibrated in 100mM DTT for 15 minutes followed by 15 minutes in
250mM α-iodoacetamide before being applied to a 12.5% precast SDS polyacrylamide gel.
Electrophoresis ran over night at 150C using the Amersham buffer kit. The gels were stained
using colloidal Coomassie dye and scanned using Image Master (Amersham). Replicate gels
were performed (ten replicates of pH 4-7, five replicates of 4.5-5.5 and 6-11) of which one was
selected for protein identification. The 2D Elite software (Amersham) was used to generate
a picklist, and the gel was transferred to the Amersham robotic workstation, each gel plug
digested with trypsin and mixed with a CHCA (α-cyano-4-hydroxy cinnamic acid) matrix,
and spotted on to a MALDI (Matrix Assisted Laser Desorption Ionisation) target plate. A
peptide sample and a gel plug were collected for each sample and stored at −200C. Analysis
of the peptides were performed using MALDI-TOF (Time Of Flight) with a Voyager system
(Perseptive Biosystems) and tandem MS (AB Q-Star Pulsar). Tandem MS was used for the
majority of protein identifications (approximately 95%). Genome sequence information was
downloaded from GeneDB (Release 3) to a local database that was searched using MASCOT
software [207]. Proteins were positively identified at a significance value of P < 0.05 as
calculated by the software.
7.2.2 Project requirements capture
The first phase of developing an understanding of the problem area involved meetings with
the project leader and researchers working on trypanosomes. The current practice of man-
aging data was observed. This consists of the data from the project being stored in Excel
spreadsheets. Data was entered into the spreadsheet by manual copy and pasting from mass
spectrometry results and database searches that had been performed to characterise pro-
teins. Protein data in the Excel spreadsheet was related back to the spot on the 2-D gel
from which it arose, using the numerical identifier assigned to the spot by the image analysis
application, which was entered in the corresponding row of the table.
The project leader, Prof. Mike Turner, outlined a set of six questions that could poten-
tially be solved by improvements in software:
1. Can the time and labour to identify proteins be reduced?
2. How many different proteins can be identified from 2500 spots?
Chapter 7. Software support for a proteome map of Trypanosoma brucei 221
Protein unfoldsduring 2−DE
Digested into peptides
Peptides detected by MS
Peptide span ofwhole sequence
Folded protein
Figure 7.3: The span of peptides that have been matched within a protein sequence arerepresented by the shaded section of the block, for a cluster of four spots, explained inSection 7.2.3.
3. How widespread and common are post-translational modifications?
4. How can we improve the T. brucei genome annotation?
5. Can we build a “point and click” virtual 2D gel?
6. Can we build pages that give original MS data interpretations?
The issues of genome annotation and data integration were discussed in meetings with the
curators of the T. brucei genome database at the Sanger centre, Cambridge UK (December
2003). The web site providing access to the genome is GeneDB, which is supported by the
GUS database system. One of the main goals of the proteome project is to improve genome
annotation. Once the proteome namespace has been added to GUS, as discussed in Chapter
5, the proteomics data can be stored directly within GeneDB. However, it is important that
data produced from the experiments can be linked up with GeneDB in the near future, prior
to the full deployment of a new version of GUS that supports proteomics. Towards this
goal, a new interface has been developed as part of RAPAD for publishing data, with unique
identifiers that can be linked up with GeneDB, when the proteomics data is made public.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 222
7.2.3 Visualisation
There are many different spots that have been identified as the same protein on the master
gel in this investigation. The database can be queried for a particular protein name, and
the results of the query can be visualised in the Gel Viewer. The Gel Viewer provides a link
from each spot to the record for the mass spectrometry results that were used to identify
the protein. However, there are limited facilities for investigating why so many different
spots arise that appear to match the same protein. Therefore, additional software has been
implemented alongside RAPAD for visualising the peptide sequences that have been matched
by MS data, to investigate why certain proteins appear in multiple positions on a 2-D gel.
A piece of text processing software has been written to extract the peptide sequences from
mass spectrometry results. The full length sequence has also been obtained for each protein,
and linked up to the Gel Viewer to provide a visualisation for every spot, displaying the
proportion of the protein sequence that has been matched: the span of peptide hits (Figure
7.3). Each spot is labelled with a white block representing the entire protein sequence, filled
with a shaded section. The left end of the shaded block represents the position of the first
peptide hit in the protein sequence, the right end of the shaded block represents the last
position of the last hit to the protein sequence. From this information, it is possible to say
that at least this proportion of the protein sequence was present in the spot, assuming correct
identification from MS data. Peptides may not be detected by MS for several reasons: (i)
during MS/MS only a proportion of the peptides most strongly detected in the first stage
are subjected to the second stage of MS, (ii) ionisation is dependent on various properties of
a peptide, such as its charge and (iii) there is technical variability in the efficiency of peptide
ionisation.
The genome database contains several different genes that share the same name. An
additional visualisation has been created to summarise where these different forms of the
same protein arise on the master gel. A different colour is used to shade spots that have
been matched to peptides within a specific protein sequence in the database. In this way,
groups of proteins that have the same name but are in fact different, can easily be visualised
on the gel. This allows researchers to verify that clusters of proteins with the same name
have been identified correctly, because it is expected that proteins located in the same region
of the gel will arise from the same gene.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 223
7.3 Results
7.3.1 Investigation into multiple protein forms
The proteomics experiments on T. brucei reveal several proteins that appear in multiple
positions on a single gel (the pH 4 - 7 gel), examples include Heat Shock Protein 70 (62 spots),
α-tubulin (50 spots), β-tubulin (40 spots), Elongation factors (EF 1-α, EF 1-β, EF 1-γ, EF
2; creating 37 spots in total) and Heat Shock Protein 60 (19 proteins). There are several
reasons why proteins may appear in multiple positions. Firstly, chemical modifications, such
as the gain or loss of phosphate groups on the protein, can cause multiple spots to appear in
a localised region. Secondly, a protein may also be fragmented at some point, either in vivo
or during the experimental procedure, therefore peptides measured by mass spectrometry
may not have arisen from the full protein sequence. Protein spots that arise near the bottom
of the gel, indicating low molecular weight proteins (described on page 7 in Chapter 1),
are more likely to contain only fragments of proteins . Thirdly, it is formally possible that
differential splicing causes different proteins to be produced from the same gene, which still
have peptides that match the protein entry in a sequence database, even if the full length
sequence of the protein is different from the predicted form. However, while differential
splicing in higher eukaryotes seems to be a very common phenomenon [215], it has never
been reported in T. brucei because almost all genes comprise a single exon and therefore
are not spliced at all. It is also possible that the proteins which seem to appear in multiple
copies are false positives, arising because the sequences have some characteristic that causes
many incorrect database matches.
Tubulin proteins
α and β-tubulin produce many spots on the 2-D gels for T. brucei, which could be the
result of protein modifications. α and β-tubulin form a heterodimer and are one of the main
components of microtubules that form a layer around the cytoplasm, just beneath the outer
cell membrane [140]. A study by Lubega and colleagues demonstrated that mice can be
immunised against African trypanosomosis by injection with tubulin proteins, raising the
possibility that tubulins could form part of a successful vaccine [197].
It has previously been demonstrated that post-translational modifications (PTMs) of
tubulin are associated with the construction of the cytoskeleton and fall into two categories:
general protein modifications, such as phosphorylation or acetylation, and tubulin-specific,
Chapter 7. Software support for a proteome map of Trypanosoma brucei 224
1)
2)
3)
4)
Figure 7.4: Protein spots matched to β-tubulin, overlaid with a graphic displaying the spanof peptide hits (shaded block) as a proportion of the full length sequence (white block). Theboxed regions are discussed further in the text. Gel image courtesy of A. Faldas.
such as tyrosination. The acetylation of tubulin has previously been identified by 2-DE,
therefore many of the spots observed in this study are likely to correspond to differentially
modified forms of the protein (original experiments are reviewed by Gull [140]).
β-tubulin
The results from the peptide alignment analysis with β-tubulin are displayed in Figure 7.4.
The main cluster of proteins (1 on Figure 7.4) towards the top of the gel is in the position
that would be predicted by the molecular weight of β-tubulin (50KDa). It is likely that there
are several different types of chemical modifications that occur to β-tubulin, causing the 16
different spots to appear in this region. The spots at the bottom left of the gel (4) have fairly
Chapter 7. Software support for a proteome map of Trypanosoma brucei 225
short spans of peptide hits (less than 10% of the full sequence), therefore are more likely to
be caused by peptide fragments. In the bottom middle range of the gel there are two spots
(3) both with peptides matching a range in the middle of the protein sequence, indicating
these two are caused by two similar protein fragments, possibly with a single modification
causing a localised shift in position.
There is a cluster of several spots in the middle/left of the image (2 on Figure 7.4),
which appear to have very long peptide spans (up to 80%). This result is surprising because
it would not be expected that the full length protein sequence for tubulin would migrate this
far into the gel. Therefore, it is theoretically possible that this protein arises from differential
splicing of gene products to produce a protein that has peptide sequences from the two ends
of the original sequence. It is also possible that the spots contain a different protein that
has peptides that closely match parts of the β-tubulin sequence. However, a BLAST [11]
search of GeneDB with the peptides from these regions reveals that there are no similar
sequences except the other tubulin proteins (BLAST results not shown). The MS data for
the close groupings of three spots (spot ID 677, 664 and 641) have very high MASCOT scores,
indicating that the matches are probably correct, with strong hits to peptides near the start
of the sequence, and other matches to peptides near the end of the protein sequence. GeneDB
contains a cluster of identical genes on chromosome 1, annotated as β-tubulin, although the
exact number of genes is not known because it varies in different cell lines. It is also very
difficult to assemble regions of the genome that contain repetitive identical sequences. There
are no gene sequences deposited in GeneDB that could explain the long span of peptide hits
of this spot cluster.
A further observation on Figure 7.4 is that the peptides matched tend to cluster at the
N-terminus (left end) and there are no peptides matched to the C-terminus (right end) of
protein sequences. This raises the possibility that there is cleavage of a peptide at the C-
terminus. Alternatively, it is possible that there are modifications that prevent peptides
being ionised in a mass spectrometer. In particular, it is known that the C-terminus of
β-tubulin is extensively glutamylated, which is the addition of up to 20 extra glutamate
residues to a defined glutamate near the C-terminus [283]. This may prevent the peptides
at the C-terminus from being detected by MS.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 226
1)
2)
4)5)
3)
Figure 7.5: Protein spots matched to α-tubulin, overlaid with a graphic displaying the spanof peptide hits. There is a correlation between the span of peptide hits and the position ofa spot on the gel. Gel image courtesy of A. Faldas.
α-tubulin
Figure 7.5 displays the peptide spans for α-tubulin. The cluster of six spots towards the top
of the gel (1) are in the position that would be expected by a protein with the molecular
mass of α-tubulin (50KDa) and therefore probably contain the full length sequence. There
is a cluster of spots presumably caused by various small modifications to the protein, which
account for the localised shifts in positions. The genome contains a cluster of identical α-
tubulin sequences on chromosome 1, therefore the different spot positions are not due to
differences in gene sequence.
At the bottom of the gel there are a large number of possible fragments, and there appears
to be a fairly strong correlation between spots located in the same region and the span of
peptide hits (see for example 2, 3 and 4 on Figure 7.5). This would suggest that a fragment is
Chapter 7. Software support for a proteome map of Trypanosoma brucei 227
being produced reproducibly with one or two different modifications on the peptides present
in the fragment. The volume of spots in the small molecular weight range also appears to
be reproducible across replicate gels by manual inspection. However, it is not possible to
investigate the peptide spans of all spots from replicate gels by MS due to the cost involved.
It remains to be investigated if these fragments have any biological significance or if they
are experimental artifacts. The correlation between peptide span and spot position may be
related to protein modifications. Modification status affects the ability of a peptide to be
ionised, therefore peptides that have the same set of modifications should have the same
probability of being detected by mass spectrometry. Proteins located in similar regions are
likely to contain many peptides that have been modified in the same way, and these peptides
will share the same likelihood of being detected by mass spectrometry.
There are two spots towards the bottom left that have very long spans of matched
peptides (5). This is similar to the results for β-tubulin, and it is unlikely that a full length
protein could migrate this distance in the gel, therefore these may be the result of differential
splicing. An alternative, although unlikely, possibility is that tubulin fragments from the two
ends of the protein have independently co-migrated and appear as a single spot. It is also
possible that the protein fragmented but the 3-D structure did not completely disassociate
as expected, leaving different parts of the protein bound together, with a small overall mass.
The spots (IDs 741 and 734) both have strong hits to the α-tubulin protein record, matching
peptides near the beginning and end of the protein sequence. Additional experiments could
be performed to further characterise this protein spot, for example performing tandem mass
spectrometry on as many peptides as possible to determine what parts of the protein are
present in the spot.
The same observation about the lack of peptides matched at the C-terminus can be made
for α-tubulin, as well as β-tubulin. This may be due to glutamylation, which has also been
reported for α-tubulin [84], or tyrosination of C-terminal peptides [289]. Modifications of
these kinds are thought to be common on α-tubulin, and may prevent peptides becoming
ionised during MS. The peptide spans on Figure 7.5 also demonstrate that there are no
N-terminal peptides that have been matched. This raises the possibility that PTMs also
occur on N-terminal peptides, which as far as we aware has not been previously reported.
This demonstrates that the peptide visualisation software has the capacity for hypothesis
generation, which can be confirmed by further experimentation.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 228
Figure 7.6: Protein spots matched to five different Elongation Factors. EF-α (blue); EF-β(red); EF-2 (yellow); EF-γ (orange and boxed); EF (putative) (white). Gel image courtesyof A. Faldas.
Elongation factor proteins
The peptide alignment analysis has also been performed to classify Elongation Factor (EF)
protein spots. Elongation factors function during protein translation, for example controlling
the addition of new amino acids onto a growing peptide chain. It has been suggested that T.
brucei protein abundance is controlled at the level of translational rather than transcription,
therefore any insights into EF proteins could prove important in understanding regulation.
There are at least five different elongation factor genes, with many spots appearing on the
2-D gel (Figure 7.6). Functional annotation for these genes in T. brucei is still at an early
stage, therefore any information from proteomics that can aid annotation will be useful.
An analysis was carried out to determine the peptide spans of EF 1-α, EF 1-β, EF-2, a
sequence annotated in the database solely as EF (putative), and EF-γ (one protein spot -
peptide alignment not shown), to test whether spots have been correctly identified on gels,
and to determine whether sequences have been correctly predicted in the genome database.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 229
Figure 7.7: Protein spots matched to Elongation factor 1-α. Gel image courtesy of A. Faldas.
Elongation factor 1-α
The graphic for Elongation factor 1-α (Figure 7.7) displays a large cluster of spots to the
right of the gel, likely to be caused by multiple differentially modified forms of the proteins.
The post-translational modification of EF 1-α is a common phenomenon in other organisms,
such as plants [265], but as far as we are aware, it has not been investigated in detail for
trypanosomes. The evidence presented here suggests that PTMs to EF 1-α from T. brucei
are also very common. The spots towards the bottom of the gel are likely to be protein
fragments, shown by the very short spans of peptides (less than 5% of the sequence length).
Elongation factor 1-β and EF (putative)
The left gel in Figure 7.8 displays Elongation factor 1-β (EF-β) protein spots. There are
three spots in the middle of the gel, which are likely to result from different modifications,
such as different phosphorylations to the protein. A single spot towards the bottom of the
gel is probably a fragment of the full length sequence. The right image on Figure 7.8 displays
the spots matched to EF (putative). There is probably one match to the full protein, in the
centre of the gel, and two possible fragments at the bottom of the gel. A multiple alignment
has been performed, using ClustalW [318], of the sequences of EF-β and EF (putative) from
Chapter 7. Software support for a proteome map of Trypanosoma brucei 230
91.m00148 MCDHMYSPVFIPFAFFSIVKCHNKCSFVCNRSGKDMSIKDVNVKSGKLEE 50
gi|461992|sp|P34827|EF1B_TRYCR −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−MSVKDVNKRSGELEG 15TRYP_x−70a06.p2kb545_154 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−MSSLKEIN−−−−−−−G 9gi|310944|gb|AAA30183.1| −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−NSARVKDAMTTLKELNG 17 :*:
91.m00148 KLKGKLFLGGVKPSEEDVKAFNDLLGGDNTNVFRWVKNIASFTEAERTAW 100gi|461992|sp|P34827|EF1B_TRYCR KLKGKLFLGGTKPSKEDVKLFNDLLGAENTSLYLWVKHMTSFTEAERKAW 65TRYP_x−70a06.p2kb545_154 RLSAQPYVSGFTPSKEDARIFSEMFG−SNTAVIQWAARMAAYYQAER−−− 55gi|310944|gb|AAA30183.1| RLSSQPYVSGYCPAR−KTRRYSLRCS−−−ARLALWLSGPHVWLRTIKR−− 61 :*..: ::.* *:....: :. . ..: : * : .: :
91.m00148 GAPVKITPPVAAPVAAPAAAPAAAPAATPARKAAEADDDDIDLFGETTEE 150gi|461992|sp|P34827|EF1B_TRYCR GAPVKVTATTSA−−SAPAKQAPKKAASAPAKQADE−−DEEIDLFGEATEE 111TRYP_x−70a06.p2kb545_154 −−−VQLTK−−−−−−−−−GATASKTSATTKAAAGDD−−−DDIDLFGEATEE 90gi|310944|gb|AAA30183.1| −−TEQILK−−−−−−−−−−−−−GTASSSKKAAAAED−−−EDIDLFGEATEE 93 :: . . .:: * . : ::******:***
91.m00148 ELAALEAKKKKDAAAKSTKKVIIAKSSILFDIKPWDDTVDLQKLATELHA 200gi|461992|sp|P34827|EF1B_TRYCR ETAALEAKKKKDTDAKKAKKEVIAKSSILFDVKPWDDTVDLQALANKLHA 161TRYP_x−70a06.p2kb545_154 ELAALEAKKKKDAAAKSSKKVIIAKSSILFDIKPWDDTVDLDGLAQKLHA 140gi|310944|gb|AAA30183.1| ETAALEAKKKKDADAKKAKKEVIAKSSILFDVKPWDDTVDLQALADKLHA 143 * **********: **.:** :*********:*********: ** :***
91.m00148 IKRDGLLWGDHKLVPIAFGVKKLQQLVVIEDDKVSGDDLEEMIMSFGDAV 250gi|461992|sp|P34827|EF1B_TRYCR VKRDGLLWGDHKLVPVAFGVKKLQQLIVIEDDKVLSDDLEELIMSFEDEV 211TRYP_x−70a06.p2kb545_154 IKRDGLLWGDHKLVPIAFGVKKLQQLVVIEDDKVSGDDLEEMIMSFGDDV 190gi|310944|gb|AAA30183.1| VKRDGLLWGDHKLVPVAFGVKKLQQLIVIEDDKVSSDDLEELIMSFEDEV 193 :**************:**********:******* .*****:**** * *
91.m00148 QSMDIVAWNKI 261gi|461992|sp|P34827|EF1B_TRYCR QSMDIVAWNKI 222TRYP_x−70a06.p2kb545_154 QSMDIVAWNKI 201gi|310944|gb|AAA30183.1| QSMDIVAWNKI 204 ***********
>91.m00148 |||25 kDa elongation factor 1−beta, putative|t_brucei|chr_4|RPCI93|26G5|91>gi|461992|sp|P34827|EF1B_TRYCR 25 KD ELONGATION FACTOR 1−BETA (EF−1−BETA) (T. cruzi)>TRYP_x−70a06.p2kb545_154 |||elongation factor, putative|Trypanosoma brucei||chr 10|||Manual>gi|310944|gb|AAA30183.1| elongation factor (T. cruzi)
EF−beta EF (putative)
Key
Figure 7.8: Protein spots matched to EF-β and EF (putative) are displayed with the corre-sponding span of peptide hits. The boxed regions mark a spot that contains peptides thatmatch both EF-β and EF (putative). A multiple alignment is also displayed of EF-β fromT. brucei and T. cruzi, with EF (putative) from T. brucei and EF T. cruzi. The boxedregion of the alignment shows that the starting codon of EF-β from T. brucei may havebeen wrongly predicted. Gel images courtesy of A. Faldas.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 231
T. brucei and from T. cruzi (in the lower part of Figure 7.8). There is a very high degree of
sequence similarity between EF-β and EF (putative), with long stretches of identical residues.
In T. brucei the sequences lie on chromosome 4 and chromosome 10, therefore there is a low
chance that this is an annotation error and they are in fact the same sequence. However, it
is known that contamination has been detected in sequences derived from the chromosome
10 project, and therefore it is not possible to say definitively that the two sequences arise
from different genes.
The alignment shows that the N-terminus of EF-β may have been incorrectly predicted
because the first 30 or 40 residues align poorly, and there is a region 37 residues downstream,
which matches the start of the other EF sequences. It is also worth noting that the first
residue of the T. cruzi EF sequence is not a methionine and may also have been incorrectly
predicted. There is a methionine nine residues downstream that aligns very well with the
start codon of EF (putative) from T. brucei, which is more likely to be the correct start
position.
The alignment of peptide sequences against proteins reveals a single spot that contains a
peptide that exactly matches the protein sequence of both EF-β and EF (putative), towards
the bottom left corner of the gel (boxed in Figure 7.8). This finding, and the high sequence
similarity on the multiple alignment, demonstrates that mass spectrometry results for EF
(putative) and EF-β cannot always conclusively identify between these two proteins. How-
ever, the spots in the middle of the gel have long peptide spans that cover the N-terminus of
the protein sequence, which is more divergent than the C-terminus of the sequence between
EF-β and EF (putative). Therefore, these spots are likely to have been correctly identified.
Elongation factor 2
The image in Figure 7.9 displays the peptide spans of proteins matched to EF-2. There are
eight spots near the top of the gel which are probably differentially modified forms of the
complete protein, and the spots at the bottom of the gel are likely to be protein fragments.
There is no T. brucei EF-2 sequence deposited in GenBank as of May 2004, but there is an
EF-2 gene in GeneDB. The closest match in GenBank is Elongation Factor 2 from T. cruzi.
A sequence alignment reveals that Elongation Factor 2 is almost identical between T. brucei
and T. cruzi, indicating that the sequence has been correctly named. The last part of the
alignment is displayed in the lower part of Figure 7.9, and it appears that the end point of
the T. brucei sequence may have been incorrectly predicted.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 232
TRYP_x−70a06.p2kb545_355 AIHRGGGQIIPTARRVFYACCLTATPRLMEPMFQVDIQTVEHAMGGIYGV 750gi|1800107|dbj|BAA09433.1| AIHRGGGQIIPTARRVFYACCLTAAPRLMEPMFQVDIQTVEHAMGGIYGV 721 ************************:*************************
TRYP_x−70a06.p2kb545_355 LTRRRGVIIGEENRPGTPIYNVRAYLPVAESFGFTADLRAGTGGQAFPQC 800gi|1800107|dbj|BAA09433.1| LTRRRGVIIGEENRPGTPIYNVRAYLPVAESFGFTADLRAGTGGQAFPQC 771 **************************************************
TRYP_x−70a06.p2kb545_355 VFDHWQQYPGDPLDPKSQANTLVLSIRQRKGLKPDIPGLDTFLDKL 846gi|1800107|dbj|BAA09433.1| VFDHW−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 776 *****.. .... ...:.:.: : . .. ... .. .: ..
TRYP_x−70a06.p2kb545_355(T. brucei)gi|1800107|dbj|BAA09433.1| (T. cruzi)
Figure 7.9: The span of peptide hits for protein spots matched to Elongation Factor 2. Thealignment shows the 150 residues at the C-terminus of the EF-2 sequences from T. bruceiand T. cruzi. The boxed region shows that the end point of one of the sequences may nothave been predicted correctly, given the overall similarity between the two sequences is sohigh. Gel image courtesy of A. Faldas.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 233
Elongation factor γ
There is a single protein spot matched to EF-γ, near the bottom of the gel (orange and
boxed on Figure 7.6). A BLAST search of the EF-γ gene sequence hits only other EF-γ
sequences, and not the other EF genes, therefore this match is probably correct. However,
the spot is positioned near the base of the gel, indicating that this may only be a protein
fragment, therefore it is not definitive that the full protein of EF-γ is present on the gel.
Summary of elongation factor results
In summary, the results demonstrate that there are at least five genes encoding elongation
factors in T. brucei and many different protein spots appear on the 2-D gel, raising the
possibility that protein modifications are common. Modifications could regulate the activity
of elongation factor proteins, to achieve control over the translation of proteins. This is an
interesting area for further research because T. brucei does not modulate the rate of tran-
scriptional initiation, and it is likely that control over protein expression occurs downstream,
perhaps by regulating the rate of translation.
Heat shock proteins
The heat shock proteins (Hsp) are conserved across virtually all organisms, and are often
expressed in response to environmental stress. It has been shown that Hsps are up-regulated
when the temperature of the parasite’s environment is rapidly increased, for example during
transfer from the tsetse fly (25◦C) to the mammalian host (37◦C). At this time there are
extensive changes in morphology and metabolism of the parasite as it switches from the
procyclic form to the bloodstream form. It is thought that the expression of Hsp genes at
this time is crucial. It has been demonstrated that post-transcriptional control is exerted
to regulate the expression of Hsp70, and this control may be exerted at the level of mRNA
stability [193]. The proteome map of T. brucei suggests that many different protein forms
exist due the large number of distinct spots that have been matched to Hsp70, therefore
post-translational modifications may also be common.
The current level of annotation for T. brucei heat shock proteins is fairly poor, and
many spots on a single gel match Hsp70, although it is possible that in fact there are several
closely related genes, rather than the 62 distinct protein spots arising from one gene. An
analysis was carried out to identify how many distinct genes coded for the 62 protein spots
observed. Five distinct protein sequences were obtained that had been matched by mass
Chapter 7. Software support for a proteome map of Trypanosoma brucei 234
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−LIGRKFSDSVVQSDMKHWPFKVVTKGDDKPVIQVQFRGETKTFNPEEISSLIGRKYTDAAVQADKKLLSYEVIADRDGKPKVQVMVGGKKKQFTPEEISAIIGRKYDDPDLQADMKHWPFKVTVK−EGKPVVEVEYQGERRTFFPEEISALIGRRFDDEHIQHDIKNVPYKIIRSNNGDAWVQ−−−DGNGKQYSPSQVGA
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−MVLLKMKEVAESYLGKQVAKAVVTVPAYFNDSQRQATKDAGTIAGLEVLR 172MVLQKMKEIAETYLGEKVKNAVVTVPAYFNDAQRQSTKDAGTIAGLNVVR 200MVLQKMKEIAESYLGEKVSKAVVTVPAYFNDSQRQATKDAGSIAGLEVLR 170FVLEKMKETAENFLGRKVSNAVVTCPAYFNDAQRQATKDAGTIAGLNVIR 192
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−IINEPTAAAIAYGLDKADEGKERNVLIFDLGGGTFDVTLLTIDGGIFEVKIINEPTAAAIAYGLN−−−KAGEKNILVFDLGGGTFDVSLLTIDEGFFEVVIVNEPTAAAIAYGMDRSSEGAMKTVLIFDLGGGTFDVTLLNIDGGLFEVRVVNEPTAAALAYGLD−−−KTKDSLIAVYDLGGGTFDISVLEIAGGVFEVK
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−ATNGDTHLGGEDFDNRLVAHFTEEFKRKNKGKDLSSNLRALRRLRTACER 272ATNGDTHLGGEDFDNNMMRHFVDMLKKK−KNVDISKDQKALARLRKACEA 296ATAGDTHLGGEDFDSRLVDYFATEFRTR−TGKDLRGNARAMRRLRTACER 269ATNGDTHLGGEDFDLCLSDHILEEFRKT−SGIDLSKERMALQRIREAAEK 288
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−AKRTLSSAAQATIEIDALFENID−−−−FQATITRARFEELCGDLFRGTLQAKRQLSSHPEARVEVDSLTEGFD−−−−FSEKITRAKFEELNMDLFKGTLVVKRTLSSSASTNIEIDALYEGFD−−−−FFSKITRARFEEMCRDQFERCLEAKCELSTTMETEVNLPFITANQDGAQHVQMMVSRSKFESLADKLVQRSLG
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−PVERVLQDAKMDKRAVHDVVLVGGSTRIPKVMQLVSDFFGGKELNKSINP 368PVQRVLEDAKLKKSDIHEIVLVGGSTRVPKVQQLISDFFGGKELNRGINP 392PVRKVLKDAEVDASAVDDVVLVGGSTRIPRVQQLVQNFFNGKEPNRSINP 365PCKQCIKDAAVDLKEISEVVLVGGMTRMPKVVEAVKQFFG−REPFRGVNP 387
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−DEAVAYGAAVQAFILTGGKSKQTEGLLLLDVAPLTLGIETAGGVMTALIKDEAVAYGAAVQAAVLTGESEVGGR−VVLVDVIPLSLGIETVGGVMTKLIEDEAVAYGAAVQAHIVSGGKSKQTKDLLLVDVTPLSLGVETAGGVMSVLIPDEAVALGAATLGGVLRG−−−−DVKGLVLLDVTPLSLGIETLGGVFTRMIP PAPRGVPQIEVTFDLDANGILSVSAEEKGTGKRNQIVITNDKGRLSKADIPAPRGVPQIEVTFDLDANGILSVSAEEKGTGKRNQIVITNDKGRLSKADIPAARGVPQIEVTFDVDENSILQVSAMDKSSGKKEEITITNDKGRLSEEEIPAPRGKPRITVSFDVNVDGILVVTAVEETAGKTQAITISNDKGRLSREQIPAPRGVPQIEVTFDIDANGICHVTAKDKATGKTQNITITAHGG−LTKEQI
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−MGTFDLSGIP 10RNTTIPTKKSQIFSTYSDNQPGVHIQVFEGERTMTKDCHLLGTFDLSGIP 468RNTQIPTKKSQVFSTHADNQPGVLIQVYEGERQLTKDNRLLGKFELSGIP 491RNTSVPAQKSQTFSTNADNQRSVEIKVYEGERPLVSQCQCLGTFTLTDIP 465KNTTIPTKKSQTFSTAADNQTQVGIKVFQGEREMASDNQMMGQFDLVGIP 483
ERMVSDAAKYEAEDKAQRERIDAKNGLENYAFSMKNTINDPN−VAGKLDD 109ERMVSDAAKYEAEDKAQRERIDAKNGLENYAFSMKNTINDPN−VAGKLDD 567ERMVREAAEFEDEDRKVRERVDARNSLESVAYSLRNQVNDKDKLGGKLDP 591DKMVAEAEKFAEEDRANAEKIEARNSVENYTFSLRSTLSDPD−VQQNISQ 564ENMIRDSEMHAEADRVKRELVEVRNNAETQANTAERQLTEWK−−−−YVTD 578
ADKNAVTTAVEEALRWLNDNQEASLDEYNHRQKELEGVCAPILSKMYQGMADKNAVTTAVEEALRWLNDNQEASLDEYNHRQKELEGVCAPILSKMYQGMNDKAAVETAVAEAIRFLDENPNAEKEEYKTALETLQSVTNPIIQKTYQSAEDQQKIQTVVNAVVNWLDENRDATKEEYDAKNKEIEQVAHPILSAYYVKRAEKENVRTLLAELRKVME−NPNVTKDELSASTDKLQKAVMECGRTEYQQA
GGGDAAGGMPGGMPGGMPG−−−−GMGGGMGGAAASSGPKVEEVD 199GGGDAAGGMPGGMPGGMPGGMPGGMGGGMGGAAASSGPKVEEVD 661GGGDKPQPMDDL−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 653AMEQAPPAPPSGE−−−−−−−−−−−−−−−−−−−GEGNAPVPDDVD 639AAANSGSSGSSSTEGQ−−−−−−−−−−−−−−GEQQQQQASGEKKE 657
a)b)c)d)e)
a)b)c)d)e)
a)b)c)d)e)
a)b)c)d)e)
a)b)c)d)e)
a)b)c)d)e)
a) −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−b) −−−−−−−−−−−−−−−−−−−−−−−−−−−−MTYEGAIGIDLGTTYSCVGVWQc) MSRMWLTTAAVFLTVTVAAVSAAPESGGKVEAPCVGIDLGTTYSVVGVWQd) −−−−−−−−−−−−−−−−−−−−−−−−−−−−−MPAPAIGIDLGTTYSCVGVFKe) −−−−−MLARRVCAPMCLASAPFARWQSSKVTGDVIGIDLGTTYSCVAVME
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−NERVEIIANDQGNRTTPSYVAFTDSERLIGDAAKNQVAMNPTNTVFDAKR 72KGDVHIIPNEMGNRITPSVVAFTDTERLIGDGAKNQLPQNPHNTIYTIKR 100NDQVEIVANDQGNRTTPSYVSFSETERLVGDAAKNQVAMNPTNTVFDAKR 71GDRPRVLENTEGFRTTPSVVAFKGQEKLVGLAAKRQAITNPQSTFFAVKR 95
Figure 7.10: A multiple alignment of five Hsp 70 protein sequences from T. brucei ; a) =TRYPtp2h24gd03.q1k 1, b) = TRYPtp30n4hh05.p1k 3, c) = TRYP xi-1015g04.q1k 13, d)= 125.m00218, e) = 92.m00252.
spectrometry data, all predicted to be Hsp70 by BLAST searches. A multiple alignment of
the five sequences has been performed (Figure 7.10). All five sequences are highly related
but no two appear similar enough to have arisen from an incorrect prediction of a single
gene, therefore there appear to be at least five distinct Hsp70 genes that exist in T. brucei.
The first sequence in the alignment is significantly shorter than the other four, possibly
indicating that the start of this gene has been incorrectly predicted, or it is a pseudogene.
The similarity between all sequences raises the possibility that mass spectrometry matches
to these proteins could be incorrect, however there are few long stretches in any sequence
that are identical to a different sequence, therefore it is likely that most peptide matches
will be made correctly. A study by Lee in 1998 suggested that there is an Hsp70 locus in
T. brucei containing 6 identical genes [193]. A search of GeneDB (May 2004) for the text
query: “∗heat shock protein∗” and “∗hsp∗” finds seven proteins that are predicted to be an
Hsp 70, of which four are clustered on chromosome 11, which may be the locus reported by
Lee, one sequence on chromosome 9 and two on chromosome 7.
A multiple alignment has been performed of the sequences retrieved from the current
release of GeneDB against those from the MS analysis, which come from GenBank and older
Chapter 7. Software support for a proteome map of Trypanosoma brucei 235
Figure 7.11: Protein spots matched to five different Hsp70 protein sequences. 125.m00218= blue; 92.m00252 = red; TRYP xi-1015g04.q1k 13 = yellow; TRYPtp30n4hh05.p1k 3 =white; all spots marked as cyan contain peptides that hit both TRYPtp2h24gd03.q1k 1(cyan) and TRYPtp30n4hh05.p1k 3 (white). Gel image courtesy of A. Faldas.
downloads of GeneDB that were used for the original MS analysis over the last year. The
alignment is displayed at the end of the chapter. Ten out of the twelve sequences appear
to be distinct, and two of the sequences from the MS analysis are identical to sequences
on chromosome 11. It is possible that these are the same genes however it is not possible
to verify, as there is no correspondence between different versions of sequence identifiers in
GeneDB. Two sequences in GeneDB: Tb09.160.3090 and Tb07.29K4.60 are annotated as
Hsp70, and contains several motifs that are highly similar to other Hsp70 sequences. Over
the full length however, they are more divergent and are 25% longer than the other Hsp70
sequences, therefore would be predicted to have a higher molecular weight and may in fact
be a closer match to a different heat shock protein.
Figure 7.11 displays which protein spot matches which sequence in the genome database.
There are distinct clusters of spots that match the sequence 92.m00252 (red) and 125.m00218
(blue). Only one sequence matches TRYP xi-1015g04.q1k 13 (yellow) at the bottom of the
gel, therefore this may only be a protein fragment. There is a cluster of spots predicted
to match TRYPtp30n4hh05.p1k 3 (white) and TRYPtp2h24gd03.q1k 1 (cyan), however all
Chapter 7. Software support for a proteome map of Trypanosoma brucei 236
those coloured cyan are matched to peptides that also exactly match TRYPtp30n4hh05.p1k 3
(white) therefore it is not possible to say from this analysis which is the correct protein
identification. It is possible that the MS results have incorrectly predicted the identity of
the proteins coloured cyan or white. It is not possible to say definitively that the protein
TRYPtp2h24gd03.q1k 1 (cyan), which has a very short sequence, is expressed in this sample,
and it may be a pseudogene.
7.3.2 Using data in RAPAD to improve genome annotation
An interface has been developed which allows external databases to link to protein records
in RAPAD. Unique ID numbers have been assigned to proteins that identify the database
version (v. 1) so that in future database versions, a link can be provided to the most recent
records. A record displays the protein name, has a link to the corresponding gel with the
spot highlighted, and provides evidence about the quality of the match to MS data. When
the data is released to the public, the web page for each protein can be referenced from other
databases. Alternatively, a more robust approach would be for other databases to store the
unique ID number that has been assigned to each protein, and maintain a single URL to
where the current implementation of the database is located. This feature will be used by
the genome database, when the existence of a protein has been verified by the proteome
map, as discussed in Chapter 5. The interface that allows public access to T. brucei data in
RAPAD is displayed in Figure 7.12.
Hypothetical proteins
An analysis has been performed to find the number of distinct proteins stored in RAPAD
which are named as a “hypothetical protein”. A simple search of RAPAD for the word
hypothetical in the protein name reveals 100 matching entries that arise from 47 distinct
spots on the master gel. It is therefore likely that the actual number of proteins that are
annotated as hypothetical on the master gel, is somewhere between 47 and 100 because it is
possible that there is more than one distinct protein annotated as hypothetical in a single
spot. However, given that many sequences have not been manually curated, the genome
database may contain a large number of open reading frames that have been incorrectly
predicted, and the sequence may have been hit by chance. A further database search reveals
that 24 out the 100 proteins are matched with a sequence coverage of less than 5%, therefore
these may not be true matches.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 237
Figure 7.12: The interface for publishing T. brucei proteome data. The initial page displaysimages of gels that are stored and the number of identified proteins on each gels. A list ofproteins can be generated and individual records can be displayed.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 238
Figure 7.13: A search using the Gel Viewer reveals 100 proteins, annotated as “hypothetical”.Gel image courtesy of A. Faldas.
The protein sequences hit by MS data were obtained for 88 out the 100 sequences. The
other 12 sequences could not be obtained because the ID numbers that are stored in RAPAD
have changed in GeneDB, and there is no link to the current record. This is a major problem
for researchers working with genome sequences before they are complete because temporary
ID numbers are assigned to proteins that are later deleted, and databases often do not
maintain an archive of previous identifiers. For the spots that do not link to a current
GeneDB record, the MS searches must be repeated, which is time consuming. There will
be an option in the next release of RAPAD to perform repeated MS searches automatically.
Of the 88 sequences, there were 57 distinct protein sequences that have been matched. A
piece of software was written that matches the peptide sequences hit by mass spectrometry
results to the protein sequences, to determine which spots on the gel matched which protein
sequence. It was discovered that there are ten proteins that have been matched to more than
one spot, in total matched to 33 spots. The diagram in Figure 7.13 displays all the spots
that have been annotated as matching a protein whose name contains the word hypothetical.
Many of the proteins lie at the bottom of the gel, indicating possible protein fragments that
may have been matched to short protein sequences in the database. Several of the database
sequences annotated as hypothetical are very short, of which the shortest contains only 39
Chapter 7. Software support for a proteome map of Trypanosoma brucei 239
1)
3)
2)
Figure 7.14: The protein spots that have been matched to different hypothetical proteins.The spots with the same colour label have been matched to the same database sequence.The three boxed regions are discussed further in the text. Gel image courtesy of A. Faldas.
amino acids, and is very unlikely to be a correctly predicted protein.
There are ten hypothetical proteins that have been matched to more than one gel spot.
Groups of spots matched to the same protein are displayed in a particular colour on Figure
7.14. Three pairs of spots that have been matched to three different hypothetical proteins
have been highlighted for further study because they reside in the middle of the gel, therefore
are unlikely to be protein fragments. Furthermore, two spots matched to one protein, located
next to each other, are unlikely to be incorrect matches because the probability of two
adjacent spots independently matching the same sequence is low. However, it is still possible
that an incorrect match could be made to a short “hypothetical” protein sequence in the
database if there were two spots containing the same protein that had a peptide that matched
the hypothetical protein by chance.
Spot group 1
The spots marked 1 in Figure 7.14 (Spot IDs 313 and 275) are both fairly strongly matched
to a 438 amino acid protein, annotated as “Conserved hypothetical protein”. The left spot
contains only this protein, the right spot is predicted to match five different proteins: ATPase
Chapter 7. Software support for a proteome map of Trypanosoma brucei 240
beta subunit; Lipophosphoglycan biosynthetic protein; α-tubulin; Conserved hypothetical
protein; and Hsp83-1. A BLAST search of the hypothetical sequence against the non-
redundant (NR) database at GenBank reveals a top hit matching the following entry, with
an e-value of 0.15 (not highly significant):
NP_937883 1392 aa DEFINITION restin isoform b; cytoplasmic linker 1;
Reed-Steinberg cell-expressed intermediate filament-associated protein
[Homo sapiens].
The finding that the protein does not strongly match any annotated entries in the genome
databases for other organisms indicates that the protein may be specific to T. brucei and
its close relatives. It may therefore be a good candidate for further functional analysis to
determine if it is essential for the life cycle of trypanosomes.
Spot group 2
The spots marked 2 are both matched to several different proteins, the spot on the left
(ID 330) matches: Hsp81-2, S-adenosylmethionine synthetase, Hypothetical protein, con-
served; Hypothetical protein, conserved (possible RNA binding protein), and β-tubulin. The
right protein (ID 323) is annotated as: Elongation factor 2; Conserved hypothetical protein;
Hsp81-2; α-tubulin; Hsp70; S-adenosylmethionine synthetase. Both spots are matched to
S-adenosylmethionine synthetase and the hypothetical protein, therefore it is possible that
the hypothetical protein is highly similar to S-adenosylmethionine synthetase and has been
matched by chance. However, the match has been made based on tandem MS data, using
peptide sequence information and manual inspection of the results demonstrates that the
two matches are to peptides of different sequences.
A BLAST search of the sequence for the conserved hypothetical protein matches several
sequences. The strongest match is to an ATPase from a blue-green algae, with an e-value of
1e−88, which is highly significant. The mass spectrometry data for the spots matches three
and four peptides respectively to the database sequence, therefore there is a good chance
that the match was made correctly. The evidence suggests that this “Conserved hypothetical
protein” is an ATPase.
LOCUS ZP_00327657 609 aa linear BCT 17-JUN-2004
DEFINITION COG3044: Predicted ATPase of the ABC class
[Trichodesmium erythraeum IMS101].
Chapter 7. Software support for a proteome map of Trypanosoma brucei 241
Spot group 3
The two spots marked 3 in Figure 7.14 (IDs 543 and 548) are annotated as a hypothetical
protein on chromosome 4, and have been matched at a 5% sequence coverage. This value
is fairly low, but given that two proximally located spots have been matched to the same
sequence independently, it is reasonable to assume that a correct match has been made. A
BLAST search of the sequence matches several proteins from other organisms at a fairly
high degree of significance, however all are annotated only as hypothetical proteins. The
degree of similarity (e-value 5e−05) to the top matching sequence (GenBank ID NP 522045)
indicates that the sequence is likely to be a real protein, but more work is required to assign
a function. There are also six other spots that have been matched to this protein, located at
the base of the gel. These spots probably contain fragments of the protein but this indicates
that the protein is fairly abundantly expressed.
7.3.3 Search for post-translational modifications
The method of searching MS data for possible modifications to peptides, which was detailed
in the last chapter, was repeated for the T. brucei data. A manual search was performed to
investigate if peptides have an altered mass resulting from phosphorylation, deamidation or
acetylation. Several clusters of spots that appeared to result from PTMs were investigated
but this method of searching for modifications has several major limitations and therefore a
large scale approach has not been undertaken at the present time. The results from two of
searches are presented below.
Arginine kinase
There is a cluster of four spots in the middle of the gel which have been matched to arginine
kinase (Figure 7.15). Arginine kinase is thought to be important in protozoans because
it is up-regulated in response to cell stress, fulfilling the same role as creatine kinase in
multicellular eukaryotes, and is a possible target for chemotherapy [244].
The spots marked 575 and 554 are predicted to have undergone deamidation. Deamida-
tion is the conversion of a glutamine residue to glutamate, or asparagine to aspartate, and
it is known to occur during the degradation of proteins [282]. The oxidation positions are
caused by the experimental process and not indicative of a protein’s status in vivo. A phos-
phorylation and deamidation has been detected on the same peptide for spot 535, however
the e-value is high (expect = 31) and may represent an artifact.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 242
99 − 118 728.74 2183.19 2183.07 0.12 1 QPPKDFGDLNTLVDVDPEGK 143 − 151 608.77 1215.53 1215.47 0.06 0 EQYEEMESR Oxidation (M) 152 − 175 922.80 2765.37 2766.27 −0.90 1 VREQLSTMTDDLQGTYYPLSGMTK Deamidation (NQ); 2 Oxidation (M) 154 − 175 837.75 2510.21 2510.12 0.10 0 EQLSTMTDDLQGTYYPLSGMTK 2 Oxidation (M) 176 − 189 588.00 1760.96 1760.87 0.09 0 ETQQQLIDDHFLFK
Start − End Observed Mr(expt) Mr(calc) Delta Miss SequenceSpot 575
Spot 575
Spot 571
27 608.74 1215.46 1215.47 −0.01 0 30 0.66 1 EQYEEMESR + Oxidation (M) 30 734.36 1466.70 1466.72 −0.03 0 26 1.9 1 SLAGYPFNPCLTK 33 867.40 1732.79 1732.82 −0.03 0 59 0.0013 1 DFGDLNTLVDVDPEGK 35 587.96 1760.87 1760.87 −0.00 0 33 0.46 1 ETQQQLIDDHFLFK 40 728.69 2183.06 2183.07 −0.02 1 29 1.5 1 QPPKDFGDLNTLVDVDPEGK 46 837.69 2510.05 2510.12 −0.06 0 25 4.7 1 EQLSTMTDDLQGTYYPLSGMTK + 2 Oxidation (M) 47 657.58 2626.31 2626.31 −0.00 2 37 0.33 1 VTDKQPPKDFGDLNTLVDVDPEGK 49 922.76 2765.25 2765.29 −0.04 1 37 0.31 1 VREQLSTMTDDLQGTYYPLSGMTK + 2 Oxidation (M)
12 427.74 853.47 854.38 −0.91 0 10 31 5 AVNTIEK + Deamidation (NQ); Phospho (ST)
Spot 535
Spot 528
Spot 535
Query Obs Mr(expt) Mr(calc) Delta Miss Score Expect Rank Peptide
Figure 7.15: Four spots containing arginine kinase. The MS results for spots 575 and 535reveal possible modifications. Gel image courtesy of A. Faldas.
Initiation factor
There are four spots that have been strongly matched to eukaryotic initiation factor 5 (Fig-
ure 7.16). Of these, Spot 575 contains both initiation factor and arginine kinase by chance.
A deamidation has been observed for the match to initiation factor protein. Spot 554 has
also been predicted to have undergone deamidation. The spots are all likely to have slight
differences in the chemical sidechains, causing the four different spots to appear. A deami-
dation causes a slight change in mass and an alteration in the charge of the protein but it is
likely that there are other modifications that are not observed in the MS data, which cause
the different spots to appear.
7.3.4 Results Summary
The investigations into multiple protein products demonstrate the core functionality of RA-
PAD. RAPAD supports the finding and visualisation of spots that have been identified as the
same protein. Additional software was developed alongside RAPAD to determine the range
of peptides that were matched in mass spectrometry results, and to provide a visualisation
of the clusters. The visualisation software highlighted some unusual results for the tubulin
Chapter 7. Software support for a proteome map of Trypanosoma brucei 243
31 826.21 3300.83 3301.65 −0.82 1 64 0.00077 1 VSIVALDIFTGNKMEDQAPSTHNVEVPFVK + Deamidation (NQ); Oxidation (M)
38 922.80 2765.37 2766.27 −0.90 1 51 0.014 1 VREQLSTMTDDLQGTYYPLSGMTK + Deamidation (NQ); 2 Oxidation (M)
Query Observed Mr(expt) Mr(calc) Delta Miss Score Expect Rank Peptide 8 430.80 859.59 859.50 0.09 0 56 0.0012 1 VIDLSVSK 14 688.92 1375.82 1375.77 0.05 0 81 5e−06 1 VSIVALDIFTGNK 21 648.69 1943.05 1943.89 −0.85 0 60 0.0011 1 MEDQAPSTHNVEVPFVK + Deamidation (NQ); Oxidation (M)
Query Observed Mr(expt) Mr(calc) Delta Miss Score Expect Rank Peptide 17 608.77 1215.53 1215.47 0.06 0 30 0.63 1 EQYEEMESR + Oxidation (M) 25 588.00 1760.96 1760.87 0.09 0 50 0.0087 1 ETQQQLIDDHFLFK 27 728.74 2183.19 2183.07 0.12 1 34 0.54 1 QPPKDFGDLNTLVDVDPEGK 35 837.75 2510.21 2510.12 0.10 0 26 3.8 1 EQLSTMTDDLQGTYYPLSGMTK + 2 Oxidation (M)
Spot 554
Spot 575
Spot 554Spot 557
Spot 571
Spot 575
Figure 7.16: There are four spots that match initiation factor 5, of which possible modifica-tions were found for spots 554 and 575. Gel image courtesy of A. Faldas.
proteins and, coupled with the multiple alignments, should improve annotation of Elongation
Factor sequences. The visualisation of heat shock protein 70 results indicates that there are
at least five different gene sequences from which Hsp70 proteins are expressed in the sample.
The visualisation makes it clear that only very short spans of peptides are present in spots
at the base of the gel, indicating that they are protein fragments. It is an area for future
investigation to determine if these are biologically meaningful, or experimental artifacts.
The analysis demonstrates a strong correlation between spots that are proximally located
and the span of peptide hits, even for spots that are not fragments but probably contain full
length proteins. An investigation was also carried out to verify that sequences annotated as
hypothetical proteins in the genome database were real proteins identified in the proteome
study. Three proteins were analysed in detail, of which two of the sequences are likely to be
real proteins, but a definitive function cannot be assigned at this time. The other protein
appears to be an ATPase in T. brucei, and the next version of the genome database should
update this annotation. Finally, a search for PTMs within MS data was undertaken and
several potential sites were found. There are major limitations with the method of searching
for PTMs and therefore other experiments are required to confirm modifications. The issues
raised by the results are discussed in Section 7.4.1.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 244
7.4 Discussion
The annotation of an organism’s genome is a major challenge once sequencing is nearing
completion. The usual method to assign functions to newly sequenced genes is to apply
computational methods to find proteins in other organisms that are homologous and have
a functional assignment. After this initial stage, the slow process begins of performing
laboratory investigations to determine the mechanism of action of proteins, and to search
for the proteins that are important in disease. The biological goals of the trypanosome
project are to catalogue all the expressed proteins that can be found by various methods.
In particular, the proteome study is able to verify that genes annotated as “hypothetical
proteins” are expressed in the particular cell line. The project also aims to shed light on the
number of different forms of proteins that are found.
The core functionality of RAPAD has aided the management of large volumes of data
for the proteome investigation. This has been facilitated due to the feature that allows bulk
uploads of data, enabling the protein identifications to be moved easily from the previously
used method (spreadsheets). This reduces the overhead of manual data entry which is time
consuming and error prone. The database query facilities allow the data set to be searched
and filtered, which is important for large data sets. In Section 7.2, a series of questions was
outlined that RAPAD may be able to solve, which are answered here.
Q. Can the time and labour to identify proteins be reduced?
The RAPAD Querier allows researchers to verify which proteins have been strongly or weakly
matched, and there is a facility for loading very large amounts of protein data in bulk, into the
system. However, in the current implementation there is no automated pipeline for moving
raw mass spectrometry data to the MASCOT server, and placing the results of searches in
RAPAD. This feature will be considered for the next version of the database (Section 7.4.4).
Q. How many different proteins can be identified from 2500 spots?
In the system at the present time there are almost 1000 identified proteins for 650 spots,
across three gels. In the previous chapter, the combination of data across replicate gels was
discussed, therefore the system should easily scale up to 2500 spots and many more.
Q. How widespread and common are post-translational modifications?
The additional investigations into the causes of multiple spots that match the same protein
Chapter 7. Software support for a proteome map of Trypanosoma brucei 245
demonstrate that post-translational modifications (PTMs) are very common for some pro-
teins. The software was also able to demonstrate that many of the spots near the base of the
gel are almost certainly fragments of proteins, and are not caused by PTMs. A search of MS
results to confirm types of modification did not reveal any significant results, demonstrating
that more biological investigations are required.
Q. How can we improve the T. brucei genome annotation?
The interface for publishing data allows the genome database to connect to records in RA-
PAD, which verify the existence of proteins. The analysis reported in this chapter will aid
the annotation of several groups of genes, summarised in Section 7.4.1.
Q. Can we build a “point and click” virtual 2D gel?
The Gel Viewer provides this facility by dynamically linking the spots on the gel to individual
records for each protein. The results of complex queries in RAPAD can also be visualised in
the Gel Viewer, providing a system for data analysis and management that is more powerful
than the facilities offered by commercial image analysis applications.
Q. Can we build pages that give original MS data interpretations?
One feature that has not been employed at this time in RAPAD is to automate repeated
searching of MS data, for example to search for different types of PTM that could be found
within the data. A number of searches have been performed manually in MASCOT to find
modifications on peptides, however very few positive results have been obtained. Therefore,
there may be limited benefits in implementing an automated search at this time. The graphic
showing peptide hits within protein sequences is a novel visualisation of MS results, and is
discussed further in Section 7.4.2.
The database does not store raw MS data in the present implementation due to the size of
the files and the fact that the raw data is in a proprietary format that can only be interpreted
with software that is installed on a few terminals, which is a major drawback for re-analysis
of data. The next version of RAPAD may include an automated system for analysing MS
data, similar to the SASHIMI software [278] developed at the Institute for Systems Biology,
in the proteomics group headed by Ruedi Aebersold. SASHIMI is open source software that
aims to improve the downstream analysis of MS data. It comprises an application that
converts raw MS data, from any of the instruments that are available, into a single XML-
Chapter 7. Software support for a proteome map of Trypanosoma brucei 246
based format that can be analysed with a number of software packages to standardise the
identification of proteins.
7.4.1 Improving the annotation of genes
The additional investigations identified several sequences in the genome database, which may
have been incorrectly predicted. The study also discovered that there are several different
proteins with highly related, but not identical sequences, which have the same protein name.
It is likely that the protein families were formed by relatively recent gene duplication events,
and the function of these protein families may be redundant. However, it is also possible that
different members of the family perform slightly different roles. For example, the finding that
up to five different proteins, annotated as heat shock protein 70, are strongly expressed raises
the possibility that all of the different forms are functionally significant. It is believed that
Hsps may be important when trypanosomes infect the mammalian host, therefore clearly
the current naming strategy for these proteins is inadequate. At this time gene annotation
is not yet finalised for T. brucei therefore we believe that the Hsp genes should have a suffix
on the name that uniquely identifies each one, for example the chromosome position of each
gene, plus a letter if more than one sequence resides on the same arm of a chromosome e.g.
Hsp70 (11 p a).
The analysis reveals that most proteins near the bottom of the gel, in the small molecular
weight range, have very short spans of peptides matched, indicating that these proteins are
likely to be fragments caused either experimentally, or in vivo. It is possible that these
spots do not have great biological significance. The visualisation software highlighted an
unexpected result for both β-tubulin and α-tubulin. It was observed that several spots near
the base of the gel, which would be predicted to have a low molecular weight, matched
peptides from the two ends of the protein sequence. There are several possible explanations
for this result, one of which is that splicing occurs at the level of mRNA, resulting in a
protein that is formed from the two ends of the gene sequence. The evidence presented
here is far from sufficient to confirm this hypothesis but it is still open for discussion how
these spots arose. Additional experiments are required to investigate the result, for example
by performing MS/MS to sequence as many peptides as possible to determine the exact
constituents of the protein spot.
An interesting finding from the visualisation is that proteins of the same name, in the
same region of the gel, tend to have a similar span of peptide hits. It might be expected that
Chapter 7. Software support for a proteome map of Trypanosoma brucei 247
the distribution of peptides matched from the same protein would be fairly similar for all
spots regardless of their position on the gel, only subject to random variation in ionisation
and detection of peptides. It is known that certain chemical modifications cause peptides
to ionise less well in MS, such as phosphorylation, therefore it could be expected that spots
that have the same span of peptide hits, have a shared set of modifications to the peptides
that are detected by MS. Small differences in the range of peptides matched between spots
located near each other could indicate the loss or gain of a modification. For example, if pro-
tein A matches peptides covering the range 50-80 amino acids in the sequence, and protein
B matches peptides covering 50-95, this may indicate that protein A has an additional phos-
phate group on the peptide from position 81-95, preventing its detection by MS. However,
there is also a technical explanation for the correlation between peptide span and spot posi-
tion. Spots closely co-located are more likely to have been included in the same MS run and,
as ionisation efficiency is highly variable, spots on the same MALDI plate may be subject to
more similar ionisation conditions. This is an area that requires further investigation, such as
performing experiments with radioactive or fluorescently labelled phosphates, coupled with
the visualisation software, to determine the phosphorylation status of protein spots and the
span of peptide hits to verify if the peptides matched are related to modification status. Our
results indicate that there is a high correlation between the peptides detected by MS and a
protein’s position on a gel, and as far as we are aware, this has not been previously reported.
7.4.2 Visualisation issues in the life sciences
In general, the visualisation of life sciences data requires significant further research, and
there are few examples of published work concerning investigations into best practice for
visualising large data sets. Software for biomedical applications is often created without
developers applying standard guidelines for graphical user interface design, leading to the
generation of systems that are not intuitive for users.
The visualisation of the span of peptide hits is a new method for viewing mass spectrom-
etry results on a 2-D gel. A similar approach could be adopted to view microarray results,
such as displaying the extent of the hybridization signal for different probes within each
feature on the array. The other use of the Gel Viewer reported in Section 7.3.2, in which
different colours are used to display the clusters of spots that match the same protein, is a
standard method for summarising complex data, and could potentially be used to display a
variety of functional genomics data.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 248
The visualisation software displaying the span of peptide hit will be included in the next
release of RAPAD and could be adapted to show other facets of the mass spectrometry
results. The height of the bar could be used to indicate the e-value or score assigned by the
search software. The software could be also adapted to display different proteins that have
been matched to the same gel spot, using different shading of bands on the spot label.
7.4.3 Analysis of modifications
On the 2-D gel, there are several clusters of spots that match the same protein, which are
likely to be the result of different PTMs causing slight changes in the mass or charge of the
protein. A search was performed on MS data to confirm the modifications but this only
revealed a few results that are not highly significant. The main problems are that many
proteins are identified by only a small proportion of the total peptides in the sequence and
the majority of the modifications will not be observed. This issue was discussed in more
detail in the previous chapter but the additional visualisations presented in this chapter
could ultimately help to find and display modifications. The graphic showing the peptide
that had been matched could be modified to display more detailed information. For example,
the labelling bars next to spots could display the peptides that have been matched along the
length of the protein, with a graphic showing possible modification sites along the protein
sequence, obtaining the sites from an in silico analysis of the protein sequence, or a database
of known modifications. If a particular peptide, that was detected in one spot and not
another, had a known modification site, this could provide evidence that the peptide had
been modified in one of the spots. The major hindrance to this effort is that there are
no major databases of modifications, even though there is a very large amount of research
that has been performed over several decades identifying modifications. It is hoped that the
future integration of RAPAD into GUS will allow researchers to publish and distribute data
about modifications to the wider research community.
7.4.4 Future work
The proteome map of T. brucei comprises 2-DE and MS derived data. It is planned that
other techniques, such as LC-MS (reported in Chapter 1), will be used to generate even
greater volumes of protein data. The RAPAD database schema has capabilities for storing
this kind of information but the web pages have not yet been created for data entry or the
visualisation of results. A major issue will be the integration of this data with the gel based
Chapter 7. Software support for a proteome map of Trypanosoma brucei 249
studies. In the near future the data must also be made available to the curators of GeneDB
to enable improvements to the annotation of genes. The long term goals are to integrate
the proteome part of RAPAD into GUS, which will enable the proteome data to be stored
directly within GeneDB.
7.5 Conclusions
The core functionality of RAPAD has greatly improved the data management facilities for
the Trypanosoma brucei proteome project by enabling queries over the large data set to
find proteins of interest. Additional investigations have been performed on several groups
of proteins that appear abundantly on 2-D gels, for which the genome annotation is poor.
The results demonstrate one way in which experimental data, coupled with bioinformatics
analysis, can find protein sequences that have been incorrectly predicted. The visualisation
of results in new ways could be applied to proteome data from any organism and would aid
the annotation of newly sequenced genomes. The large data set generated by the T. brucei
investigation also demonstrates the scalability of the current implementation of RAPAD.
Appendix: Alignment of Hsp70 sequences
A multiple alignment has been performed with ClustalW on twelve sequences predicted to
match heat shock protein 70. Five sequences are from the MS results matched by proteins
in the T. brucei proteome map, and seven sequences are from the current version of the T.
brucei genome database (Section 7.3.1). Tb09.160.3090 and Tb07.29K4.60 are considerably
longer than the other sequences, and align poorly. Therefore, they may have been incorrectly
predicted or if correctly predicted, should be named as a different heat shock protein e.g.
Hsp80.
Tb11.01.3110 -------------------------------MTYEGAIGIDLGTTYSCVG 19
TRYPtp2h24gd03.q1k_1 --------------------------------------------------
Tb11.01.3130 -------------------------------MTYEGAIGIDLGTTYSCVG 19
TRYPtp30n4hh05.p1k_3 -------------------------------MTYEGAIGIDLGTTYSCVG 19
Tb11.01.3120 -------------------------------MTYEGAIGIDLGTTYSCVG 19
TRYP_xi-1015g04.q1k_13 ---MSRMWLTTAAVFLTVTVAAVSAAPESGGKVEAPCVGIDLGTTYSVVG 47
Tb07.29K4.620 --------------------------------MPAPAIGIDLGTTYSCVG 18
125.m00218 --------------------------------MPAPAIGIDLGTTYSCVG 18
Tb11.01.3080 -------------------------------MTYEGAIGIDLGTTYSCVG 19
92.m00252 --------MLARRVCAPMCLASAPFARWQSSKVTGDVIGIDLGTTYSCVA 42
Tb07.29K4.60 MQHAVEIEAKRRVELDEATRARYVVVKEETRASGDRVIGIDLGTTNSCIS 50
Tb09.160.3090 -----MLCLAQWALLLVLCLVGCCCTVSGGSEVLAVDIGADWAKGATRVI 45
Tb11.01.3110 VWQN--ERVEIIANDQGNRTTPSYVAFTDSE-----------RLIGDAAK 56
Chapter 7. Software support for a proteome map of Trypanosoma brucei 250
TRYPtp2h24gd03.q1k_1 --------------------------------------------------
Tb11.01.3130 VWQN--ERVEIIANDQGNRTTPSYVAFTDSE-----------RLIGDAAK 56
TRYPtp30n4hh05.p1k_3 VWQN--ERVEIIANDQGNRTTPSYVAFTDSE-----------RLIGDAAK 56
Tb11.01.3120 VWQN--ERVEIIANDQGNRTTPSYVAFTDSE-----------RLIGDAAK 56
TRYP_xi-1015g04.q1k_13 VWQK--GDVHIIPNEMGNRITPSVVAFTDTE-----------RLIGDGAK 84
Tb07.29K4.620 VFKN--DQVEIVANDQGNRTTPSYVSFSETE-----------RLVGDAAK 55
125.m00218 VFKN--DQVEIVANDQGNRTTPSYVSFSETE-----------RLVGDAAK 55
Tb11.01.3080 VWQN--ERVEIIANDQGNRTTPSYVAFVNNE-----------VLVGDAAK 56
92.m00252 VMEG--DRPRVLENTEGFRTTPSVVAFKGQE-----------KLVGLAAK 79
Tb07.29K4.60 YIDKKTNRPKIIPSPTGSWVFPTAITFDKSHKV---------RLYGEEAR 91
Tb09.160.3090 GGST-APRASIVLNDQTNRKSPQCIAFRIVPNAGNDTLRSVERLFAEEAR 94
Tb11.01.3110 NQVAMNPTNTVFDAKRLIGRKFSDSVVQ---------------------S 85
TRYPtp2h24gd03.q1k_1 --------------------------------------------------
Tb11.01.3130 NQVAMNPTNTVFDAKRLIGRKFSDSVVQ---------------------S 85
TRYPtp30n4hh05.p1k_3 NQVAMNPTNTVFDAKRLIGRKFSDSVVQ---------------------S 85
Tb11.01.3120 NQVAMNPTNTVFDAKRLIGRKFSDSVVQ---------------------S 85
TRYP_xi-1015g04.q1k_13 NQLPQNPHNTIYTIKRLIGRKYTDAAVQ---------------------A 113
Tb07.29K4.620 NQVAMNPTNTVFDAKRIIGRKYDDPDLQ---------------------A 84
125.m00218 NQVAMNPTNTVFDAKRIIGRKYDDPDLQ---------------------A 84
Tb11.01.3080 NHAARGSNGVIFDAKRLIGRKFSDSVVQ---------------------S 85
92.m00252 RQAITNPQSTFFAVKRLIGRRFDDEHIQ---------------------H 108
Tb07.29K4.60 ACVRTSASATLCSGKRLIGRGVGELGRV---------------------Q 120
Tb09.160.3090 SLEPRFPQQSICGPSLLAGLIVSKEISAGQKHHEQTGNQRSEREGVISFS 144
Tb11.01.3110 DMKHWPFKVVTKGDDKPVIQVQFRG--------ETKTFNPEEISSMVLLK 127
TRYPtp2h24gd03.q1k_1 --------------------------------------------------
Tb11.01.3130 DMKHWPFKVVTKGDDKPVIQVQFRG--------ETKTFNPEEISSMVLLK 127
TRYPtp30n4hh05.p1k_3 DMKHWPFKVVTKGDDKPVIQVQFRG--------ETKTFNPEEISSMVLLK 127
Tb11.01.3120 DMKHWPFKVVTKGDDKPVIQVQFRG--------ETKTFNPEEISSMVLLK 127
TRYP_xi-1015g04.q1k_13 DKKLLSYEVIADRDGKPKVQVMVGG--------KKKQFTPEEISAMVLQK 155
Tb07.29K4.620 DMKHWPFKVTVK-EGKPVVEVEYQG--------ERRTFFPEEISAMVLQK 125
125.m00218 DMKHWPFKVTVK-EGKPVVEVEYQG--------ERRTFFPEEISAMVLQK 125
Tb11.01.3080 DMKHWPFKVEEGEKGGAVMRVEHLG--------EGMLLQPEQISARVLAY 127
92.m00252 DIKNVPYKIIRSNNGDAWVQ---DG--------NGKQYSPSQVGAFVLEK 147
Tb07.29K4.60 SQLHKTNMVTLNERGEVAVEIM------------GRTYTVTHIIAMFLRY 158
Tb09.160.3090 DTDRFTYVVVPQIRRKSAVVRITPGGSSEGTTTAPIEFTVEELIGMILGH 194
Tb11.01.3110 MKEVAESYLG-KQVAKAVVTVPAYFNDSQRQATKDAGTIAGLEVLRIINE 176
TRYPtp2h24gd03.q1k_1 --------------------------------------------------
Tb11.01.3130 MKEVAESYLG-KQVAKAVVTVPAYFNDSQRQATKDAGTIAGLEVLRIINE 176
TRYPtp30n4hh05.p1k_3 MKEVAESYLG-KQVAKAVVTVPAYFNDSQRQATKDAGTIAGLEVLRIINE 176
Tb11.01.3120 MKEVAESYLG-KQVAKAVVTVPAYFNDSQRQATKDAGTIAGLEVLRIINE 176
TRYP_xi-1015g04.q1k_13 MKEIAETYLG-EKVKNAVVTVPAYFNDAQRQSTKDAGTIAGLNVVRIINE 204
Tb07.29K4.620 MKEIAESYLG-EKVSKAVVTVPAYFNDSQRQATKDAGSIAGLEVLRIVNE 174
125.m00218 MKEIAESYLG-EKVSKAVVTVPAYFNDSQRQATKDAGSIAGLEVLRIVNE 174
Tb11.01.3080 LKSCAESYLG-KQVAKAVVTVPAYFNDSQRQATKDAGTIAGLEVLRIINE 176
92.m00252 MKETAENFLG-RKVSNAVVTCPAYFNDAQRQATKDAGTIAGLNVIRVVNE 196
Tb07.29K4.60 LKKEAEKFLK-EPVNAVVVSVPAFFTPQQKVATEDAALAAGFDVLEVIDE 207
Tb09.160.3090 MKRSAERSLDGAPVRHLVLVVPTSSSLAYRQAMVDAAAVVGLRTIRLVHG 244
Tb11.01.3110 PTAAAIAYGLDK----------ADEGKERNVLIFDLGGGTFDVTLLTIDG 216
TRYPtp2h24gd03.q1k_1 --------------------------------------------------
Tb11.01.3130 PTAAAIAYGLDK----------ADEGKERNVLIFDLGGGTFDVTLLTIDG 216
TRYPtp30n4hh05.p1k_3 PTAAAIAYGLDK----------ADEGKERNVLIFDLGGGTFDVTLLTIDG 216
Tb11.01.3120 PTAAAIAYGLDK----------ADEGKERNVLIFDLGGGTFDVTLLTIDG 216
TRYP_xi-1015g04.q1k_13 PTAAAIAYGLNK----------AGE---KNILVFDLGGGTFDVSLLTIDE 241
Tb07.29K4.620 PTAAAIAYGMDR----------SSEGAMKTVLIFDLGGGTFDVTLLNIDG 214
125.m00218 PTAAAIAYGMDR----------SSEGAMKTVLIFDLGGGTFDVTLLNIDG 214
Tb11.01.3080 PTAAAIAYGLDK----------ADEGKERNVLVFDFGGGTFDVSIISVSG 216
Chapter 7. Software support for a proteome map of Trypanosoma brucei 251
92.m00252 PTAAALAYGLDK----------TKDS---LIAVYDLGGGTFDISVLEIAG 233
Tb07.29K4.60 PSAACLAHTVLQPSNASSREHLSGSKRIVRSLVFDLGGGTLDCAVMENDR 257
Tb09.160.3090 SAAAATQLAHLNTETLFRG-HPSNTTERKYAMIYDMGSSKTEVAVFRFTP 293
Tb11.01.3110 -------GIFEVKATNGDTHLGGEDFDNRLVAHFTEEFKRKN-------- 251
TRYPtp2h24gd03.q1k_1 --------------------------------------------------
Tb11.01.3130 -------GIFEVKATNGDTHLGGEDFDNRLVAHFTEEFKRKN-------- 251
TRYPtp30n4hh05.p1k_3 -------GIFEVKATNGDTHLGGEDFDNRLVAHFTEEFKRKN-------- 251
Tb11.01.3120 -------GIFEVKATNGDTHLGGEDFDNRLVAHFTEEFKRKN-------- 251
TRYP_xi-1015g04.q1k_13 -------GFFEVVATNGDTHLGGEDFDNNMMRHFVDMLKKK--------- 275
Tb07.29K4.620 -------GLFEVRATAGDTHLGGEDFDSRLVDYFATEFRTR--------- 248
125.m00218 -------GLFEVRATAGDTHLGGEDFDSRLVDYFATEFRTR--------- 248
Tb11.01.3080 -------GVFEVKATNGDTHLGGEDVDAALLEHALADIRNRY-------- 251
92.m00252 -------GVFEVKATNGDTHLGGEDFDLCLSDHILEEFRKT--------- 267
Tb07.29K4.60 R-----RGTFTLVATHGDPLLGGNDWDAVLSQHFSDQFERKWR----VPL 298
Tb09.160.3090 ATARDDFGTVTLVASATNHTLGGRSFDRCLARYVERNLFPAAKPTPVTPV 343
Tb11.01.3110 --KGKDLSSNLRALRRLRTACERAKRTLSSAAQATIEIDALF-------E 292
TRYPtp2h24gd03.q1k_1 --------------------------------------------------
Tb11.01.3130 --KGKDLSSNLRALRRLRTACERAKRTLSSAAQATIEIDALF-------E 292
TRYPtp30n4hh05.p1k_3 --KGKDLSSNLRALRRLRTACERAKRTLSSAAQATIEIDALF-------E 292
Tb11.01.3120 --KGKDLSSNLRALRRLRTACERAKRTLSSAAQATIEIDALF-------E 292
TRYP_xi-1015g04.q1k_13 --KNVDISKDQKALARLRKACEAAKRQLSSHPEARVEVDSLT-------E 316
Tb07.29K4.620 --TGKDLRGNARAMRRLRTACERVKRTLSSSASTNIEIDALY-------E 289
125.m00218 --TGKDLRGNARAMRRLRTACERVKRTLSSSASTNIEIDALY-------E 289
Tb11.01.3080 --GIEQGSLSQKMLSKLRSRCEEVKRVLSHSTVGEIALDGLLP------D 293
92.m00252 --SGIDLSKERMALQRIREAAEKAKCELSTTMETEVNLPFITAN---QDG 312
Tb07.29K4.60 EDAEGNVGQGVATYRQLLLEAEKAKIHFTHSTEPYYGYNRAFHFSEKLRD 348
Tb09.160.3090 LDRKPVTATTRRAVVSLLRAVNAARERLSVNQNVPFVVPGVRE------D 387
Tb11.01.3110 NIDFQATITRARFEELCGDLFRGTLQPVERVLQDAKMDKRAVHDVV---L 339
TRYPtp2h24gd03.q1k_1 --------------------------------------------------
Tb11.01.3130 NIDFQATITRARFEELCGDLFRGTLQPVERVLQDAKMDKRAVHDVV---L 339
TRYPtp30n4hh05.p1k_3 NIDFQATITRARFEELCGDLFRGTLQPVERVLQDAKMDKRAVHDVV---L 339
Tb11.01.3120 NIDFQATITRARFEELCGDLFRGTLQPVERVLQDAKMDKRAVHDVV---L 339
TRYP_xi-1015g04.q1k_13 GFDFSEKITRAKFEELNMDLFKGTLVPVQRVLEDAKLKKSDIHEIV---L 363
Tb07.29K4.620 GFDFFSKITRARFEEMCRDQFERCLEPVRKVLKDAEVDASAVDDVV---L 336
125.m00218 GFDFFSKITRARFEEMCRDQFERCLEPVRKVLKDAEVDASAVDDVV---L 336
Tb11.01.3080 GEEYVLKLTRARLEELCTKIFARCLSVVQRALKDASMKVEDIEDVV---L 340
92.m00252 AQHVQMMVSRSKFESLADKLVQRSLGPCKQCIKDAAVDLKEISEVV---L 359
Tb07.29K4.60 IVPLEATLTLEEYIELTRPLRVRCVECLNKLFDHTSIRPADIDNVL---L 395
Tb09.160.3090 GGDFIANISRAQFEEACGELFNEAVRLRDHAITQTNGTVRSLNELVRLEL 437
Tb11.01.3110 VGGSTRIPKVMQLVSDFFGGKELNKSINPDE-AVAYGAAVQAFILTGG-- 386
TRYPtp2h24gd03.q1k_1 --------------------------------------------------
Tb11.01.3130 VGGSTRIPKVMQLVSDFFGGKELNKSINPDE-AVAYGAAVQAFILTGG-- 386
TRYPtp30n4hh05.p1k_3 VGGSTRIPKVMQLVSDFFGGKELNKSINPDE-AVAYGAAVQAFILTGG-- 386
Tb11.01.3120 VGGSTRIPKVMQLVSDFFGGKELNKSINPDE-AVAYGAAVQAFILTGG-- 386
TRYP_xi-1015g04.q1k_13 VGGSTRVPKVQQLISDFFGGKELNRGINPDE-AVAYGAAVQAAVLTG--- 409
Tb07.29K4.620 VGGSTRIPRVQQLVQNFFNGKEPNRSINPDE-AVAYGAAVQAHIVSGG-- 383
125.m00218 VGGSTRIPRVQQLVQNFFNGKEPNRSINPDE-AVAYGAAVQAHIVSGG-- 383
Tb11.01.3080 VGGSSRIPAVQAQLRELFRGKQLCSSVHPDE-AVAYGAAVQAHVLSGGYG 389
92.m00252 VGGMTRMPKVVEAVKQFFG-REPFRGVNPDE-AVALGAATLGGVLRGD-- 405
Tb07.29K4.60 VGAMTRDPPIRHLLTEYFGRHVESEASCPADYAVAIGAAVRGAMLQGGFD 445
Tb09.160.3090 IGGATRMPKLQERLSEGYG-KPADRTLNSDEAVVSGAALMIHDTLSRIRV 486
Tb11.01.3110 KSKQTEGLLLLDVAPLTLG------IETAGGVMTALIKRNTTIPTKKSQI 430
TRYPtp2h24gd03.q1k_1 --------------------------------------------------
Tb11.01.3130 KSKQTEGLLLLDVAPLTLG------IETAGGVMTALIKRNTTIPTKKSQI 430
Chapter 7. Software support for a proteome map of Trypanosoma brucei 252
TRYPtp30n4hh05.p1k_3 KSKQTEGLLLLDVAPLTLG------IETAGGVMTALIKRNTTIPTKKSQI 430
Tb11.01.3120 KSKQTEGLLLLDVAPLTLG------IETAGGVMTALIKRNTTIPTKKSQI 430
TRYP_xi-1015g04.q1k_13 ESEVGGRVVLVDVIPLSLG------IETVGGVMTKLIERNTQIPTKKSQV 453
Tb07.29K4.620 KSKQTKDLLLVDVTPLSLG------VETAGGVMSVLIPRNTSVPAQKSQT 427
125.m00218 KSKQTKDLLLVDVTPLSLG------VETAGGVMSVLIPRNTSVPAQKSQT 427
Tb11.01.3080 ESSRTAGIVLLDVVPLSIG------VEVDDGKFDVIIRRNTTIPYLATKE 433
92.m00252 ----VKGLVLLDVTPLSLG------IETLGGVFTRMIPKNTTIPTKKSQT 445
Tb07.29K4.60 DLLSNTRFVTGTAQALKQGGFLRRCCNRIGSLVSSSVNPNAIGQRWRGRA 495
Tb09.160.3090 MESLTNDIYFTASPPIKES------NETKPHRNLLFAKRNTTVPAARSLI 530
Tb11.01.3110 FS--------TYSD-----NQPGVHIQVFEGERTMTKDCHLLGTFDLSGI 467
TRYPtp2h24gd03.q1k_1 -----------------------------------------MGTFDLSGI 9
Tb11.01.3130 FS--------TYSD-----NQPGVHIQVFEGERTMTKDCHLLGTFDLSGI 467
TRYPtp30n4hh05.p1k_3 FS--------TYSD-----NQPGVHIQVFEGERTMTKDCHLLGTFDLSGI 467
Tb11.01.3120 FS--------TYSD-----NQPGVHIQVFEGERTMTKDCHLLGTFDLSGI 467
TRYP_xi-1015g04.q1k_13 FS--------THAD-----NQPGVLIQVYEGERQLTKDNRLLGKFELSGI 490
Tb07.29K4.620 FS--------TNAD-----NQRSVEIKVYEGERPLVSQCQCLGTFTLTDI 464
125.m00218 FS--------TNAD-----NQRSVEIKVYEGERPLVSQCQCLGTFTLTDI 464
Tb11.01.3080 YS--------TVDD-----NQSEVEIQVFEGERPLTRHNHRLGSFVLDGI 470
92.m00252 FS--------TAAD-----NQTQVGIKVFQGEREMASDNQMMGQFDLVGI 482
Tb07.29K4.60 KG--------LSDEEIANYAKELVEFEAACDRRLLLERAENDANFVMRRV 537
Tb09.160.3090 FPNRTADFTLTLHDGNGRYSRSVLVSGVSGSMNAAREKEKEMSTERANKV 580
. :
Tb11.01.3110 PP------------------------------------------------ 469
TRYPtp2h24gd03.q1k_1 PP------------------------------------------------ 11
Tb11.01.3130 PP------------------------------------------------ 469
TRYPtp30n4hh05.p1k_3 PP------------------------------------------------ 469
Tb11.01.3120 PP------------------------------------------------ 469
TRYP_xi-1015g04.q1k_13 PP------------------------------------------------ 492
Tb07.29K4.620 PP------------------------------------------------ 466
125.m00218 PP------------------------------------------------ 466
Tb11.01.3080 TP------------------------------------------------ 472
92.m00252 PP------------------------------------------------ 484
Tb07.29K4.60 TADSSKRQGMQEKRVRQLSEQLKFWQYMVHNFHDHEDELLRTVRELEQAL 587
Tb09.160.3090 TKTS---------------------------------------------- 584
.
Tb11.01.3110 ------APRGVPQIEVTFDLDANGILSVSAEEKGTGKRNQIVITNDKGRL 513
TRYPtp2h24gd03.q1k_1 ------APRGVPQIEVTFDLDANGILSVSAEEKGTGKRNQIVITNDKGRL 55
Tb11.01.3130 ------APRGVPQIEVTFDLDANGILSVSAEEKGTGKRNQIVITNDKGRL 513
TRYPtp30n4hh05.p1k_3 ------APRGVPQIEVTFDLDANGILSVSAEEKGTGKRNQIVITNDKGRL 513
Tb11.01.3120 ------APRGVPQIEVTFDLDANGILSVSAEEKGTGKRNQIVITNDKGRL 513
TRYP_xi-1015g04.q1k_13 ------AARGVPQIEVTFDVDENSILQVSAMDKSSGKKEEITITNDKGRL 536
Tb07.29K4.620 ------APRGKPRITVSFDVNVDGILVVTAVEETAGKTQAITISNDKGRL 510
125.m00218 ------APRGKPRITVSFDVNVDGILVVTAVEETAGKTQAITISNDKGRL 510
Tb11.01.3080 ------AKHGEPTITVTFSVDADGILTVTAAEELGSVTKTLVVENSE-RL 515
92.m00252 ------APRGVPQIEVTFDIDANGICHVTAKDKATGKTQNITITAHG-GL 527
Tb07.29K4.60 DELEGLAEDNTSGLTTAGTVDFSSVTPVNHCEEEERDCSSVSAASRSAQL 637
Tb09.160.3090 ------VVLRQVEVVVEVVLSRSGLPYVAGSYVHARYAEQVTVLPSVKKT 628
. : . :. ..: * . :
Tb11.01.3110 SKADIERMVSDAAKYEAEDKAQ------------------RERIDAKNGL 545
TRYPtp2h24gd03.q1k_1 SKADIERMVSDAAKYEAEDKAQ------------------RERIDAKNGL 87
Tb11.01.3130 SKADIERMVSDAAKYEAEDKAQ------------------RERIDAKNGL 545
TRYPtp30n4hh05.p1k_3 SKADIERMVSDAAKYEAEDKAQ------------------RERIDAKNGL 545
Tb11.01.3120 SKADIERMVSDAAKYEAEDKAQ------------------RERIDAKNGL 545
TRYP_xi-1015g04.q1k_13 SEEEIERMVREAAEFEDEDRKV------------------RERVDARNSL 568
Tb07.29K4.620 SREQIDKMVAEAEKFAEEDRAN------------------AEKIEARNSV 542
125.m00218 SREQIDKMVAEAEKFAEEDRAN------------------AEKIEARNSV 542
Tb11.01.3080 TSEEVQKMIEVAQKFALTDATA------------------LARMEATERL 547
92.m00252 TKEQIENMIRDSEMHAEADRVK------------------RELVEVRNNA 559
Tb07.29K4.60 RTAHGDGKLKERTQDEEGEKPKGRKIMRRAVPLPRASAEAQELVEAGHPA 687
Chapter 7. Software support for a proteome map of Trypanosoma brucei 253
Tb09.160.3090 GDNETTAQKDENNNPSQNETDTTSTIS-----------PGREKRSGGSPS 667
. : .
Tb11.01.3110 ENYAFSMKNTINDPN-VAGKLDDADKNAVTTAVEEALR------------ 582
TRYPtp2h24gd03.q1k_1 ENYAFSMKNTINDPN-VAGKLDDADKNAVTTAVEEALR------------ 124
Tb11.01.3130 ENYAFSMKNTINDPN-VAGKLDDADKNAVTTAVEEALR------------ 582
TRYPtp30n4hh05.p1k_3 ENYAFSMKNTINDPN-VAGKLDDADKNAVTTAVEEALR------------ 582
Tb11.01.3120 ENYAFSMKNTINDPN-VAGKLDDADKNAVTTAVEEALR------------ 582
TRYP_xi-1015g04.q1k_13 ESVAYSLRNQVNDKDKLGGKLDPNDKAAVETAVAEAIR------------ 606
Tb07.29K4.620 ENYTFSLRSTLSDPD-VQQNISQEDQQKIQTVVNAVVN------------ 579
125.m00218 ENYTFSLRSTLSDPD-VQQNISQEDQQKIQTVVNAVVN------------ 579
Tb11.01.3080 TQWFDRLEAVMETVPQPYSEKLQKRIAFLPHGKEWVGT------------ 585
92.m00252 ETQANTAERQLTEWK----YVTDAEKENVRTLLAELRK------------ 593
Tb07.29K4.60 LRGADVSMTESTRSAFFEAQVEERAWREPPTPPGEHGS------------ 725
Tb09.160.3090 AANSNSAKMQNSRADEAKENETPTGDEILEVNERDAGTGGKNNNAKVRHF 717
Tb11.01.3110 --WLNDNQEASLDEYNHRQKE--LEGVCAPILSKMYQGMGGGDAAGGMPG 628
TRYPtp2h24gd03.q1k_1 --WLNDNQEASLDEYNHRQKE--LEGVCAPILSKMYQGMGGGDAAG---- 166
Tb11.01.3130 --WLNDNQEASLDEYNHRQKE--LEGVCAPILSKMYQGMGGGDAAGGMPG 628
TRYPtp30n4hh05.p1k_3 --WLNDNQEASLDEYNHRQKE--LEGVCAPILSKMYQGMGGGDAAGGMPG 628
Tb11.01.3120 --WLNDNQEASLDEYNHRQKE--LEGVCAPILSKMYQGMGGGDAAG---- 624
TRYP_xi-1015g04.q1k_13 --FLDENPNAEKEEYKTALET--LQSVTNPIIQKTYQSAGGGDKPQ---- 648
Tb07.29K4.620 --WLDENRDATKEEYDAKNKE--IEQVAHPILSAYYVKRAMEQAPP---- 621
125.m00218 --WLDENRDATKEEYDAKNKE--IEQVAHPILSAYYVKRAMEQAPP---- 621
Tb11.01.3080 --QLHTYTDAASIEAKVAKIERLAKRALKSARREGKDGWAPGNEDNGSGD 633
92.m00252 --VME-NPNVTKDELSASTDK--LQKAVMECGRTEYQQAAAANSGS---- 634
Tb07.29K4.60 --WQEVKRAVDAGEPVGSPIG--LQELQRPMTHEEMLQVLNNIAPIDDPV 771
Tb09.160.3090 ALRFPLNNTPAPSSTSQGGVNMNKEEALAARNRLRALQRLDDERLRRSGL 767
. :
Tb11.01.3110 GMPGGMPGGMPGGMGGGMGGAAASSGPKVEEVD----------------- 661
TRYPtp2h24gd03.q1k_1 GMPGGMPGGMPGGMGGGMGGAAASSGPKVEEVD----------------- 199
Tb11.01.3130 GMPGGMPGGMPGGMGGGMGGAAASSGPKVEEVD----------------- 661
TRYPtp30n4hh05.p1k_3 GMPGGMPGGMPGGMGGGMGGAAASSGPKVEEVD----------------- 661
Tb11.01.3120 ----GMPGRYARWYARRNGWWDGRRCGIVRA------------------- 651
TRYP_xi-1015g04.q1k_13 ---------------------------PMDDL------------------ 653
Tb07.29K4.620 ---------------APPSGEGEGNAPVPDDVD----------------- 639
125.m00218 ---------------APPSGEGEGNAPVPDDVD----------------- 639
Tb11.01.3080 DNDGDDNSDEDDELQRGRGVTEGSGRPPIRKRDRIEAINANTE------- 676
92.m00252 ---------------SGSSSTEGQGEQQQQQASGEKKE------------ 657
Tb07.29K4.60 SEEHARKRDHSIDMRTMTIVEGAVDMVALQELLEEEAKRAEELQRAQKKG 821
Tb09.160.3090 LNDVESLLLHYKSLDAWSAQQSDDNSNDWRSVVKDVSRWFEEVGGDVNVT 817
Tb11.01.3110 ----------------
TRYPtp2h24gd03.q1k_1 ----------------
Tb11.01.3130 ----------------
TRYPtp30n4hh05.p1k_3 ----------------
Tb11.01.3120 ----------------
TRYP_xi-1015g04.q1k_13 ----------------
Tb07.29K4.620 ----------------
125.m00218 ----------------
Tb11.01.3080 ----------------
92.m00252 ----------------
Tb07.29K4.60 EKQLVADSSAKLFAMD 837
Tb09.160.3090 ELQKQYQRLKDLKLGE 833
Chapter 8
Future work, discussion and
conclusions
This chapter summarises the contents of the thesis (Section 8.1) and extends our arguments
on data sharing. There is a comparison of the approach we have taken with other possible
alternatives (Section 8.2.1). There is a discussion of digital archiving in the future for
biomedical data (Section 8.2.2), the role of standards (Section 8.2.3), and the immediate
future work that leads on from our research (Section 8.3). Finally, there will be a summary
of the contribution this research makes to the field of functional genomics (Section 8.4).
8.1 Summary of thesis
Chapter 1 introduced the concept of a functional genomics investigation and described the
main experimental techniques that encompass transcriptomics, proteomics and other new
developments. Chapter 2 covered the computational techniques used to aid the challenges
presented by the experiments. The availability of databases, the creation of ontologies and
data standards were discussed. The development of a proteome standard was discussed in
Chapter 3, An object model for proteomics, which covered the past efforts (PEDRo), our
own object model Gla-PSI, and gave a snapshot of the next version, PSI-OM.
It is our view that gene and protein expression experiments should ultimately be de-
scribed in the same format. Chapter 4 described the integration of the object models for
microarrays and proteomics to create a new model, FGE-OM. The chapter also contained
a description of how the future development of standards should take place, and of the role
of ontologies. The RAPAD database was described in Chapter 5. The database has several
functions. Firstly, it demonstrates that an established microarray database can be extended
to support proteomics, and that integration of results from gel electrophoresis and microar-
rays is possible. Secondly, the database serves as a prototype for a public repository of
254
Chapter 8. Future work, discussion and conclusions 255
proteomics data, and components from RAPAD are currently being incorporated into the
GUS platform for functional genomics. Thirdly, the relational implementation verifies that
the object model, FGE-OM, correctly models proteomic experiments. Finally, the database
has been evaluated with two projects in Glasgow University. One project demonstrates the
differential expression of proteins, comparing a human cell line invaded with Toxoplasma
gondii with non-invaded cells, described in Chapter 6. The second project catalogues the
proteome of the African parasite Trypanosoma brucei, described in Chapter 7. Additional
software was specifically tailored to provide novel visualisations of data, and to summarise
complex information.
8.2 Discussion
This section compares our approach with possible alternatives, and discusses the contribution
the thesis makes in digital archiving, publication of data, and data standards.
8.2.1 Alternative approaches
The initial goal of our research was to utilise software engineering techniques to improve
the database facilities for data storage and querying. There was also a wider requirement
to support future re-analysis and publication of proteome data. In this section, there is a
critical analysis of our approach, and a description of alternative methods that could have
been employed.
Extending existing technology
Our approach has been to use a combination of object modelling to describe the biological
workflow (Gla-PSI and FGE-OM), and relational technology (RAPAD) to store and query
proteome data. There is a third component, a data exchange format expressed in XML,
that has been discussed, but has not yet been implemented. We have re-used and extended
well established technology (MAGE-OM and RAD), which we believe is advantageous for
the following reasons. In general, the time to develop a large system should be greatly
reduced if previously existing technology is extended, such as RAD into RAPAD, instead
of developing from scratch. This claim is difficult to quantify, but the RAPAD schema
and graphical interface were developed in approximately seven or eight months, primarily
by the author, in consultation with developers at the University of Pennsylvania. We do
not believe that a comparable system could have been developed entirely from scratch in a
Chapter 8. Future work, discussion and conclusions 256
similar length of time. Another reason for extending a microarray database into proteomics
is that because the two technologies share parts of a database schema, and have a similar
user interface, integration across the two domains is facilitated. For example, biological
samples must only be described once, even if both microarrays and proteomics experiments
have been performed. This allows for the execution of queries with regard to the samples,
and for the retrieval of relevant microarray and proteome studies.
However, there are dangers in re-using technology. In general, if a design or programming
error originates in one system, it may be inherited in another system and not detected. It is
also possible that adapting existing technology leads to the creation of a system that is not
optimal for the users. A hypothetical example would be that a system tailored to capture
microarray protocols could be adapted adequately to record proteomics experiments, but the
interface is not intuitive for the user. We believe that we have avoided these problems during
the development of RAPAD due to the amount of interactions that took place between the
author and the experimentalists during the development.
The use of object models
We have taken the approach that data standards should be developed in UML (Unified
Modeling Language) for object modelling, and that an XML representation can be derived
from the model. This means that ultimately there are three parallel technologies that must
be managed: an object model, an XML Schema, and a relational database. Software is
required to create XML from the object representation, and to process data for database
entry. An alternative approach would be to model the experiments only at the level of XML,
by writing an XML Schema. This approach has the advantage that fewer technologies have
to be managed because the object model would not be required. In addition, any developer
can read and edit an XML Schema without the kind of specialist software that is required
for editing object models. However, it is generally believed that object models for complex
domains are easier to understand and develop than XML Schemas [47]. If a developer has a
basic knowledge of UML, the contents of an object model can usually be understood fairly
easily, because class diagrams and use-case scenarios give a graphical representation of the
system. The only way to comprehend an XML Schema is by reading hundreds of lines of
text, which for a large, complex domain is not feasible. An XML Schema is designed for
machine processing, and while in theory it may be human readable, this is not the intended
usage. There are tools for automatically creating an XML Schema from UML [96], although
Chapter 8. Future work, discussion and conclusions 257
software is still required to create correctly formatted data in XML files.
A data standard could alternatively be expressed as objects and classes in a software
system. This approach has the disadvantage that the data standard is tied to a particular
programming language that can only be fully understood by an expert in that language.
Furthermore, programming code cannot be represented graphically unless it is converted
into a UML representation and imported into an object modelling tool. In this case, it
is often beneficial to start with an object model and derive code, rather than developing
a system and then reverse engineering. We believe that the complexity of the proteomics
domain means that the advantages of utilising an object model for representing the standard
far outweigh the disadvantages.
Database management systems
RAPAD is a relational system that uses Oracle for database management. We have proposed
that XML will be used for data transfer, and that software will be created for processing the
exchange language, such as PSI-ML, for entry of data into the relational database. The use
of a relational database management system (RDMS) has considerable advantages. Firstly,
there has been substantial research into improving query performance for relational databases
over the last three decades, and an RDMS is more secure and less likely to lose data than
alternative solutions [89]. However, there is a growing trend towards the storage of XML in
its native format, rather than converting XML formatted data into a relational representation
[104]. There are a number of arguments that are too detailed to address satisfactorily here,
but the main point is that an XML Schema can be evolved and changed fairly easily. XML
files formatted in the new version can be stored immediately in the native XML store with
minimal additional effort. In contrast, relational databases must be stable for significant
lengths of time. Furthermore, the hierarchical tree structure of XML may represent some
concepts more naturally than the tabular representation in a relational database. There
are a number of XML database management systems that can either be purchased [313],
or that are freely available [14], although none are likely to offer the query performance or
security features of an RDMS. These issues are discussed in more detail in the report on
XML indexing in Appendix A.
Chapter 8. Future work, discussion and conclusions 258
8.2.2 Digital archiving and publication of life science data
It is our view that the model of publishing data only in journals is no longer sufficient
in the post-genome era. As little as 20 years ago, the scientific record consisted of libraries
containing journals that had to be searched manually, with some form of index. The situation
has now improved, as most journals are published electronically, and can be searched online
using information retrieval techniques. This model is still far from ideal because data sets, if
present at all, are embedded within the text, or in a tabular format, and the context of the
results can only be interpreted by reading the article and understanding the methods used.
Furthermore, data may be presented as an image that cannot be searched at all, and rarely
can text be extracted from images that are published online. This is not to say that journal
articles will become redundant. The publication of textual descriptions of an experiment will
always be required to offer an interpretation of the results and to position the work within
its context, but new methods are required to disseminate a digital record of the experiments.
The policy of journals
A functional genomics data set contains far more information than is conveyed in the original
publication. Different statistical models can be applied to mine new information from data,
and importantly, the results may be useful to research groups who were not aware of the
original publication. For example, the differential expression of proteins presented in Chapter
6 may be published in a parasitology journal. However, one of the study conditions is the
expression of proteins in a non-infected human cell line, which could be useful to other
researchers using the same cell line in any other field of research. Bioinformatics developers
must create systems that ensure data sets are available for a considerable length of time.
Public databases must be created for large scale experiments that are capable of being
queried for not only the results, but also the protocols, and structured descriptions of the
sample used. The editors of journals should employ a policy that a publication will only be
accepted once an electronic record of the experiment has been deposited in a public database.
The journal Molecular and Cellular Proteomics has recently released a set of guidelines for
authors wishing to publish articles in which a large number of proteins have been identified
by MS [48]. Authors must make available all the information that allows others to evaluate
the probability that proteins have been correctly identified. The journal also encourages
authors to deposit the raw spectra as supplementary material. However, the journal’s policy
is that they will not offer a database facility, and currently there are no public repositories
Chapter 8. Future work, discussion and conclusions 259
that are widely used. It is therefore left to authors to place spectra on their own web sites,
although the spectra will be in a format that cannot be interpreted by most other researchers.
This exemplifies the problems that hinder standardisation efforts at present. The problems
will only be solved when a standard format is agreed and public databases are developed.
The RAPAD prototype presented in Chapter 5 demonstrates that deposition and publi-
cation of complex proteomic data sets is possible. The integration with microarray results
has also been demonstrated, and will lead to the creation of a proteomics namespace in GUS.
GUS will provide access to a wide range of functional genomics data, to accompany journal
publications, and will give added value to data because it can be analysed within the context
of all the other results stored within the system. The current RAPAD implementation is a
significant intermediate step towards a public data repository for proteomics.
Archival of raw data
The software for normalising microarray data sets, gel spot detection and quantification of
proteins from gel images, continues to improve. It is worth noting that very few public
databases store raw data: neither the original scans of microarrays and gels, nor the coordi-
nates from a mass spectrometry trace. Instead, a processed version of data is stored, which
may have undergone several different statistical or software manipulations that cannot be
reversed. The processed version of the data will be sufficient for the majority of users, but
this will prevent any future developments in the algorithms for processing raw data being
applied across massive data sets. One exception is that the original DNA traces from genome
sequencing projects are stored1, although the traces can only be browsed by species name
and downloaded in bulk, which leaves the user with a large data handling task. The cost of
data storage continues to decrease rapidly, while bandwidth continues to increase. Further-
more, the emergence of Grid technology means that high-performance computing will soon
be available to many bench researchers. Therefore, it is worth considering whether public
databases should store raw data from all published studies, even if it cannot be queried
at present. The alternative is that researchers who wish to publish must “guarantee” to
make raw data available on request, but this is very difficult to enforce, and prevents any
automatic assembly of data sets. The first version of MIAME stated that the storage of raw
image data was not an absolute requirement. The next versions of MAGE-OM and PSI-OM
are currently in development, and this issue should be discussed. It could be argued that the
1Ensembl [95] and the NCBI [225] both have a server that allows downloads of sequencing traces.
Chapter 8. Future work, discussion and conclusions 260
cost to perform microarray experiments continues to fall, and therefore data sets could be re-
obtained if required. There are also several different array platforms that cannot be directly
compared, therefore raw data may be re-evaluated infrequently. In the proteomics field, it
has yet to be demonstrated whether it is feasible to compare large numbers of gel images
produced with different protocols, although this is being investigated in the ProteomeGRID
project [256]. However, we believe that the potential benefits of being able to assemble very
large data sets in the future could be great, and standards organisations should reconsider
whether storage of raw data should be a prerequisite for publication. If this policy is to be
realistically enforced, the public databases must develop facilities for archiving large files,
and query facilities for finding the correct files when required.
If raw data is to be made widely available, there must be a significant change in philosophy
for bench researchers. Many researchers are wary of releasing entire raw data sets because
in the initial publication they may not have covered all the possible results that could be
deduced, and they are not willing to lose ownership of the data and any future publications
that result. Furthermore, the release of raw data to the public allows other groups to verify
the correctness of the entire analysis, which may not be welcomed by all, although in the long
term should benefit the life sciences. The release of raw data does not initially benefit the
research group that owns the information. Therefore, data sharing will only be encouraged
if journals prevent authors from publishing if data sets have not been made available. It is
crucial that research is not hindered by these requirements, and significant efforts must be
made by computing science to aid the archival of data.
8.2.3 The role of data standards
The goals of standards organisations fall into different categories depending on the kinds
of data that are represented. The protein-protein interaction community required a format
that enables data transfer between the major databases, similar to the agreements between
GenBank, DDBJ and EMBL for genomic data, and therefore PSI rapidly created a format
that is agreed upon by all major parties [150]. In this case, standardisation is fairly simple,
where there is an obvious, immediate need for a solution that may not be found without
the intervention of an official body. The situation for other functional genomics experiments
is more complex. PSI and MGED are attempting to generate data formats that are a
digital representation of the experiment: results, methods, and analysis. For most users a
data format that facilitates deposition in public repositories is required. The format should
Chapter 8. Future work, discussion and conclusions 261
capture three vital components:
1. The data from the experiment, stored in a way that can be easily retrieved and ma-
nipulated.
2. Information about the purpose (hypothesis) and overview of the experiment.
3. The methods used that allow the experiment to be fully understood or notionally
reproduced, although functional genomics experiments are rarely reproduced due to
cost and lack of identical biological samples.
We believe there are clear advantages of storing a structured description of 1 and 2. The
data from the experiment (1) is required for most users, and due to the size of data sets,
it must be in a format that can be processed easily. Furthermore, querying a free text de-
scription of the purpose of the experiment (2) is far from sufficient, and therefore it must be
well structured, preferably using ontologies. The overview of the experiment must include
the samples used, because it is usually only in the context of the sample that the results
have any meaning. It is generally agreed that a structured description of 1 and 2 is a re-
quirement for the next versions of MAGE-OM and PSI-OM. MAGE-OM also attempts to
capture all parts of an experiment, including those that fall into category 3, the methods,
in a structured format. One of the reasons is that MAGE is also used for sample tracking
in some laboratories, and that by storing a structured description of the entire experiment,
automated comparison between different experiments is possible, for instance to isolate a
batch of reagents that is contaminated if a particular assay fails. However, while MAGE is
useful in this context, it could be argued that a complete breakdown of the protocol is not
required for public databases, because it is very unlikely that it will ever be queried. Further-
more, proteomics experiments are complex, and new technologies are frequently developed,
therefore structuring entire protocols is even harder, and may hinder the development of a
finalised format. Therefore, the initial development of the standard for proteomics should
focus on describing the experimental overview and the core data.
8.2.4 A functional genomics standard
The design of FGE-OM makes use of concepts from PEDRo and MAGE, and for certain
parts of experimental protocols, information can be stored either using the well defined
PEDRo classes, or using generic MAGE derived classes. A finalised standard should not be
ambiguous about how concepts can be represented, otherwise it will become more difficult to
Chapter 8. Future work, discussion and conclusions 262
process and query the format. The creation of an integrated functional genomics format is
in development, and these issues are still open for discussion. One of the main criticisms of
MAGE is that it is “over-engineered”, and too complex for many developers to use. PEDRo
can be understood fairly easily by both developers and experimentalists, but in practice
real experiments cannot always be captured adequately. Our view is that a compromise is
needed. It is vital that a well structured description of data, the starting sample and the
experimental overview is created. In terms of general experimental protocols, there is a need
for some large laboratories to have a structured description of how protocols were performed,
for auditing and sample tracking, but it may not be essential for public databases. A data
format should not necessarily be an electronic version of the entire experiment, for example
it is very unlikely that users will ever need to query a public database for the voltage applied
during the second stage of electrophoresis. In contrast, users may want to query a database
for the pH range of a gel. The protocol must be deposited in the database, and there should
be no additional overhead of doing so, because methods are required in a journal publication
anyway, but for some parts of the experiment, simply depositing the text of the protocol
within the database will be sufficient. One of the goals of the standards bodies will be to
decide which parts of protocols should be well structured using ontologies, and which parts
can be described in plain text.
FGE-OM is not intended to be a finalised working object model, but was released to
instigate more collaboration between PSI and MGED, and to provide a framework for an
integrated data standard for functional genomics. This goal has been achieved as it is
now planned that the proteomics standard will incorporate a “core” module, based on the
BioOM namespace in FGE-OM, which is being created by MGED, to capture an experimen-
tal overview and descriptions of biological samples.
8.2.5 Proteomics standards
The official standard, PSI-OM, is currently in development. The author has contributed to
this development at meetings of the PSI [257], and directly through the design of a model
of gel data. The Gla-PSI model, described in Chapter 3, was released to incorporate a more
detailed data model into the PEDRo proposals, as 2-DE is widely used, and a structured
representation of the data, and the methods used to obtain it, is essential to allow for future
re-analysis. It is vital for developing a community standard that a wide range of users and
developers are consulted, to allow for continual refinement and improvements to the object
Chapter 8. Future work, discussion and conclusions 263
model.
8.2.6 A vision for future data sharing
The diagram in Figure 8.1 displays a possible organisation of future data publication and
sharing. The main concept is that a generic model will be defined, similar to FGE-OM.
Software will be developed that can create FGE-ML, a mark-up language based on the
format, used for formatting experiments that cover a wide range of technologies. The main
benefit is that biological samples need only be described once, regardless of the down-stream
processing and analysis. For many users, this amount of complexity will be unnecessary, and
smaller subsets of the formats can be created for single technologies, MAGE-ML (version 2)
and PSI-ML (version 1). There will be various databases that can accept either the single
technology formats, or the wider FGE-ML specification, depending on their scope. FGE-
ML, or the derived formats, can also be used for transferring information between databases.
Ontologies will be used for populating the data format and the databases, improving the
facilities for querying, and asking questions of data semantics (reasoning). The ontologies
must be accessible to programmatic access, and date stamped, to ensure that there is an
exact specification of which version of a term is being used. It is vital that ontologies remain
accessible in the future, to ensure that URLs referencing the source of a term do not become
out of date, causing a data set to become incomplete.
8.3 Future work
There are several areas in which efforts are continuing that follow on from the work pre-
sented in the thesis. These include proteomics standards, a functional genomics standard,
and the development of a public repository. The first version of the official standard of the
Proteomics Standards Initiative, PSI-OM, must be defined in the near future. PSI-OM will
be presented and discussed at meetings for computer systems developers, and at biological
conferences, to ensure that it has wide community input, and that it adequately covers all
the technologies that are currently used. The standard should be designed to accommodate
future developments. The object model will only be one component of a successful data
standard, and it is also vital that an ontology is developed in tandem. It is planned that
there will be a PSI endorsed ontology, PSI-Ont, that will contain terms describing parts of
experimental techniques that are difficult to represent in the object model. PSI-Ont will
also contain certain data values, such as types of units, the names of instrument manufac-
Chapter 8. Future work, discussion and conclusions 264
MGED + PSIOntology
FGE−MLtransfer ofcompletedatasets
Software installed locallyfor creating and interpreting FGE−ML
Download terms forinstantiating the model
Terms used for querying the database andpopulating drop−down menus for data entry
MAGE−MLderived frommain model
PSI−MLderivedfrom model
FGE−MLtransfer ofcompletedatasets
MAGE−ML
NCBITaxonomy
PSI−ML PSI−ML
Databases
Laboratory
LIMS forcapturing data
OntologyGene
Ontologies
SpeciesMicroarray Functional Proteomics
ArrayExpress RAPAD PlasmoDBCEBS
genomics
Figure 8.1: A possible model for future data sharing and exchange. Example databases arein italics.
Chapter 8. Future work, discussion and conclusions 265
turers and so on. This part of the vocabulary is required because the usage of these terms
should be controlled, to allow future querying of the format, and there are many components
that should not be “hard-coded” in the object model. Detailed discussions are required to
converge on a data standard that captures the requirements of research as it is now, and
ensures that future re-analysis of archived data is possible. Another consideration is that the
developers of an object model should not attempt to create a format that is highly descrip-
tive of every possible technology but is too complex to use. The development of PSI-OM
and PSI-Ont are well underway, and it is likely that a finalised data standard will be agreed
within the next year.
We envisage that the fields of proteomics, microarray technology, metabolomics, and
other functional genomic techniques should converge on a single data standard. We offer
FGE-OM as a framework from which future development can start. Of primary importance
is the definition of a notation that describes biological samples from any type of experiment in
a structured format. At this stage it may be too ambitious to design an all encompassing data
standard before a complete format has been widely accepted for proteomics or metabolomics.
However, the development of standards should take place with the future integration with
MAGE-OM in mind. In particular, there is limited benefit attempting to develop different
structures for describing the experimental overview and biological samples. It is essential that
PSI members continue to attend MAGE meetings, and that MAGE developers contribute
to the on-going discussions on PSI-OM.
The RAPAD database supports the projects outlined in Chapters 6 and 7. However,
it is still a prototype of a public repository, as it has not undergone sufficient testing to
ensure that the code is free from error, and that data is completely secure for external public
access. One of the goals of developing RAPAD was to investigate the possibility of using
tables from the RAD schema to store proteomics experimental protocols. This facility has
been demonstrated, and in the near future a proteomics extension to GUS will be developed.
At the University of Glasgow, RAPAD will be extended to support the requirements of the
Functional Genomics Facility [293]. An extension for laboratory sample tracking, similar to
a LIMS (Laboratory Information Management System), and for customer billing is currently
in development by Dr Giorgia Riboldi-Tunnicliffe at Glasgow University.
The current implementation of RAPAD can record data from 2-D electrophoresis and
mass spectrometry (MS). The integration of microarray and proteomics results was demon-
strated in Chapter 6, however the facility for the complete storage of microarray experiments
Chapter 8. Future work, discussion and conclusions 266
has not yet been incorporated, as this was not a primary objective of our development. The
next round of development will ensure that microarray protocols and data can be recorded
in RAPAD. There are several additions to the proteome component that will be required in
the future. The database schema contains tables to store column separations and labelling
experiments, for capturing LC-MS or ICAT technologies (described in Chapter 1), but the
user interface has not yet been developed. Web pages will be created for entry of LC-MS
and ICAT protocols, and a separate interface will be required for the visualisation of the
results, and the querying of data. The interface will be an important component because
very large volumes of data are produced by LC-MS, which must be open to various types
of query. The code for the interface, and the next version of the database schema, will be
released via the web site at regular intervals, to allow other developers to test the code, and
to create similar systems in different locations.
It was reported in Chapter 5 that the MGED ontology has been stored and extended
in RAPAD. The ontology can be used for capturing types of experimental protocols and
details of biological samples, however the use of controlled vocabularies could be extended
further. For example, the complete Gene Ontology (GO) could be stored in RAPAD and
used to find the correspondence between different RNA sequences (from gene expression
studies), proteins (from proteome studies) and the functional annotations in GO. Chapter 2
described the vision of the Semantic Web and Grid based services for integrating different
types of applications. RAPAD could be described using ontologies and made available on the
Grid as a web service to allow applications to query the database for proteomic experiments,
in conjunction with other Internet accessible resources. This functionality is already being
investigated for RAPAD as part of a PhD project by Frank Gibson in the Department of
Computing Science at the University of Newcastle.
The development of technology to aid proteome research is likely to be mirrored by other
research communities, such as metabolomics. Various types of data must be capable of being
queried in parallel, and that data can feed into models at the level of whole biological systems.
It has already been demonstrated that microarray results can be used to hypothesise about
the components that interact within metabolic pathways [201]. It is likely that protein
abundance data and metabolic profiles will also contribute to the models, although this
will require considerable efforts from database engineers who will have to provide integrated
query systems for all the information. The systems biology approach, which aims to gain
insights into the functioning of biological processes, will only produce significant results when
Chapter 8. Future work, discussion and conclusions 267
the databases are in place which provide access to very large volumes of data. It is our view
that the development of centralised databases will be greatly hindered unless the efforts to
define standard formats are successful, and gain widespread acceptance.
8.4 Summary
The work presented in the thesis moves the field closer to the vision presented in Figure
8.1 in the following ways. The object model for proteomics (Gla-PSI) has contributed to
the community efforts that will lead to the creation of the first version of the official PSI
standard within the next year. In particular, PEDRo does not cover 2-DE data adequately,
which will now be covered in much greater detail. Biological samples were also not described
in sufficient detail in PEDRo, and will now be captured using classes from MAGE-OM, as
proposed in FGE-OM. FGE-OM can be viewed as an intermediate step towards a shared
data format for functional genomics, which will improve the facilities for integrating data
from the different techniques. The RAPAD system has a close correspondence with FGE-
OM, therefore the relational implementation demonstrates that the object model captures
proteomic workflows in practice.
RAPAD is a working database for proteomics, as demonstrated by the on-going investi-
gations reported in Chapters 6 and 7. RAPAD has aided the research process for these two
projects whose requirements are common to many proteome investigations. Therefore, the
software could be used in a variety of different domains. The thesis has reported that the
current facilities for releasing proteomics data sets on the Internet are not sufficient. The
RAPAD schema and interface code are freely available for download by other developers who
can implement and extend the system, to support the requirements of their own laboratory
data. RAPAD also serves as a prototype of a public repository of data, and the future in-
tegration of a proteomics namespace into GUS will greatly extend the facilities of the web
sites supported by the platform.
We believe that efforts to improve facilities for archiving and querying functional ge-
nomics data will bring significant benefit to the biological research community. There is a
danger that unless there are continued developments in bioinformatics, we risk losing impor-
tant information, as the computational technology lags behind laboratory innovation. Our
research will lead to greatly improved accessibility of proteome data, thereby maximising
the knowledge that can be gained from the experiments, and enabling new discoveries.
Appendix A
An XML indexing solution for data
integration
A.1 Introduction
Chapter 2 introduced the challenges in data integration for the life sciences, and there was a
brief description of the concept that an XML representation of data could facilitate integra-
tion across different databases. In this section there is a brief report of work undertaken by
the author in 2001. The work comprises a new method of indexing large volumes of biologi-
cal data represented in XML. The index was designed with several goals in mind: firstly to
develop a prototype system storing local versions of various biological databases, to improve
the information retrieval task for researchers. Secondly, to assess the feasibility of using a
persistent programming language to implement a large index, and to compare the perfor-
mance of the index against previously published studies. Thirdly, to test the capabilities of
a native XML store compared with a relational representation of biological data.
The first goal has been partially completed, and a local version of the DBLP bibliography
database [71] (100MB) and PIR [254] (Protein Information Resource, 800MB) were indexed
and fast queries have been demonstrated. A graphical query interface was also created. A
complete prototype system has now been fully realised in the Xtect project [359] by our
collaborators at the University of Glasgow and the University of Strathclyde, which was
published at the DILS (Data Integration in Life Sciences) workshop in 2004 [159]. The
initial prototype demonstrates that a persistent programming language can be used for
implementing a large index, however performance difficulties are encountered as the size
of the index grows, due to the method of caching1 employed by the persistent store. The
RAPAD database (Chapter 5) utilises a relational representation of proteomics data, which
1Caching is the process of loading frequently accessed objects into main memory.
268
Appendix A. An XML indexing solution for data integration 269
allows complex queries and provides a robust security model. However, the object models
for data standards continue to evolve, and it is not possible to make major changes to a
relational database after it has been deployed, without significant efforts. It is possible to
evolve the XML representation of the data standard with no significant effort, and therefore
a native XML storage solution would be advantageous because it could be tested with data
expressed in the latest version of the standard with minimal refactoring required.
The rest of this report is structured as follows. There is a brief introduction to previous
work on XML indexing, native XML databases and query languages in Section A.2. The
implementation of the index structure and the results of performance tests are described in
Section A.3 and discussion is in Section A.4.
A.2 Previous work
XML was designed to act as a format to exchange data over the Internet, and large volumes of
XML are produced in a number of different domains. XML can be automatically processed,
and has been a W3C standard for a number of years [101]. XML is now being used as a data
storage format, and a number of biomedical databases can be downloaded in XML. In the
initial specification, XML data was validated using a document type definition (DTD) that
specifies the tags that are allowed in a particular data set, to ensure that data is uniformly
formatted and can be processed. DTD has been superseded by XML Schema [341], which
incorporates the ability to specify the type of data (string, integer or floating point number),
along with several other changes.
The most advanced proposal for a query language for XML is XQuery [357]. XQuery is
based upon a previous proposal known as XPath that enables paths in XML (the hierarchy of
tags) to be specified in a formal manner. Other proposals include a graphical query language
[226], which enables users to build queries, developed from a graphical representation of the
tree structures of the source XML.
Several proposals have been made regarding storage of XML within databases. The
proposals include databases that store the native structure of XML, such as Tamino [313]
or Xindice [14]; and techniques for representing the structure of XML within tables in a
relational database [111, 288, 77, 364]. An investigation by Cooper et al. in 2001 [62]
encoded XML within a tree structure designed for fast retrieval of strings. The tag names
comprising the paths to data were stored as compressed strings, with textual data stored
in the same structure. The entire index, and the source data, were stored in a relational
Appendix A. An XML indexing solution for data integration 270
database. The conclusion was that the new index structure showed significant improvements
in performance, and utilised a larger data collection, compared with the previous approaches.
The first version of our index structure is loosely based on the work of Cooper.
A recent proposal that indexes XML has been made by Buneman et al. 2003 [43].
This uses a representation that separates the textual data from the tree structure of XML
(the nested organisation of tags). Processing queries over the tree structure (referred to
as the Skeleton) takes up a large proportion of the time required to answer a query. The
authors show that by querying a compressed representation of the Skeleton in main memory,
significant improvements in query performance can be made. Tests have been performed
over collections of XML data, including Swiss-Prot, demonstrating effective query processing
against large collections of biologically relevant data.
A.3 Results
We have designed new structures to index XML, which have been implemented using Persis-
tent Java (PJama) [19]. PJama was developed at the University of Glasgow in collaboration
with Sun Microsystems [307], and allows objects, represented in Java, to be written directly
to disk. The objects can be accessed when required without needing specific methods to
read the objects from disk, allowing the encoding of very large structures that behave as
if they were represented in main memory. Two index structures have been created, Index
A and Index B, for XML versions of the bibliography database, DBLP [71] (100 MB), and
the PIR database (800 MB). The performance of the indexes created for DBLP were tested
for simple queries, and more complex Boolean queries in which a join was required. The
structure of the indexes is as follows.
A.3.1 Index A
Index A comprises four components: a Data Path Tree, many Data Stores, an XML Dictio-
nary and many XML Locater Lists (Figure A.1).
• The XML Dictionary stores the names of the tags (elements) encountered in the XML
document, assigning each one a unique integer code that is used in the rest of the index
for the purpose of compression.
• The Data Path Tree stores a summary of all the different types of XML paths encoun-
tered in the database. The following is an example of one data path (without closing
Appendix A. An XML indexing solution for data integration 271
4 Author5 Volume6 Title7 Organism8 Commom9 Formal
Data Store for Path 1_2_3_4
(Protein_Entry:Reference:Authors:Author)
Data Store for Path 1_7_9(Protein_Entry:Organism:Formal)
m
u
s
m r
ID 1378.1ID 2356.9ID 1356.4a
1 Protein_Entry2 Reference3 Authors
.............
Root
o
m
o
h
s
18
17
1
2
3
4
78
9
6
XML Dictionary
.....
s
a
Data Path Tree Data Stores
g
(Data: Homo sapiens)For Data Path 1:7:9
r
a
XML Locater List
Figure A.1: Index A has four components: the Data Path Tree, Data Stores, XML LocaterLists and an XML Dictionary.
tags).
<PIR_Database><PROTEIN_ENTRY><PROTEIN_NAME>p53
The Data Path Tree contains a node for each element on the path, storing the integer
corresponding to the element’s name and a reference to the child node, if one exists.
• There is one Data Store for each path in the Data Path Tree. Each Data Store contains
all the textual data, found in the entire collection of XML documents, which can be
reached by a particular Data Path. For example, all the protein names would be stored
in one Data Store. The structure of the Data Store is a digital trie [295] that allows
fast searching of strings. Each node of the trie stores one character of the string and a
pointer to the child node (the next character in the string).
• XML Locater Lists contain a set of identifiers for the records from the source XML
dataset that contain a specific string in a particular Data Store. Hence, the identifiers
for all the database records that contain the protein name, p53, are stored in one XML
Locater List. The last node in the string (the “3” of p53) has a link to the XML
Locater List.
Appendix A. An XML indexing solution for data integration 272
Index A is designed to support very rapid retrieval of simple query terms against an XML
record. For example to find all the documents that contain the protein name (p53), the
following algorithm is executed:
1. The user must know which path corresponds to the data to be searched for, in this
case it is PIR Database/Protein Entry/Protein Name. A fully functional application
would achieve this using a graphical user interface.
2. Look up the integer codes for each element in the XML Dictionary.
3. Search the Data Path Tree for a path that corresponds to the elements in the query.
4. When the leaf node is reached in the Data Path Tree, follow the link to the Data Store.
5. Match each character in the query string (p, then 5, then 3) in the Data Store.
6. Correct matches are obtained if there is an XML Locater List referenced from the “3”
node (of “p53”) in the Data Store. If it exists, the identifiers in the XML Locater List
can be used to find records in a database, or on the file system, which correctly answer
the query.
This is an extremely efficient structure for performing simple queries of this type because
once a query has been formulated, the number of object references that must be followed is
almost the minimum possible. However, there are two problems with this index structure,
which prevent it from being highly usable for many queries. The first is that the order of XML
documents can be crucial, and it would not be possible to specify a query of the type “find
leaf node A followed by leaf node B”, using this structure. Secondly, more complex queries of
the type “find a record containing leaf node A with value X AND leaf node B with value Y”,
are performed inefficiently. This query can only be answered by performing two runs through
the process described above, and combining the two sets of results to find the cross-product
(the results present in both sets). If the two sets of results are large, the combination part
of the query would take a considerable length of time (discussed further in Section A.3.4).
An attempt to alleviate these two problems, without reducing the performance of simple
queries, resulted in the design of Index B.
A.3.2 Index B
Index B retains the Data Path Tree, the Data Stores and the XML Dictionary from Index
A but includes an additional component, the Structure Container that stores a compressed
Appendix A. An XML indexing solution for data integration 273
8
Data Store for Path 1_2_3_4
(Protein_Entry:Reference:Authors:Author)
Data Store for Path 1_7_9(Protein_Entry:Organism:Formal)
m
u
s
m r
Root
o
m
o
h
s
18
17
1
2
3
4
78
9
6
s
a
Data Path Tree Data Stores
gr
a
1578.9
97
1
17 9
Structure Container
1817
63
2 1356.4a
2445
998.4b
Data Pointer
Locator Pointer
Locater ID
Figure A.2: Index B has four components: the Data Path Tree, Data Stores, the StructureContainer and the XML Dictionary (not shown).
representation of the structure of every XML record (Figure A.2). An XML record is rep-
resented in the Structure Container by a set of nodes that contain the integer code of the
element, and pointers to child and sibling nodes. The order of elements in the Structure
Container matches the original XML record. A leaf in the Structure Container has a pointer
to a node in the corresponding Data Store, which represents the final character of the textual
string. An entry in the Structure Container is effectively a compressed version of an XML
record and can be used to reconstruct the source text. The textual values in the Data Store
can be obtained by reading backward from the leaf node to the first node (e.g. reading 3,
then 5 then p for p53). The Structure Container also stores the ID number of the record,
which is referenced directly from the leaf nodes in the Data Store. The XML Locater Lists
are not required and the records do not need to be stored in another file system or database
because the Structure Container can be used to reconstruct an exact copy of the source
XML.
A query of the type “find all the documents that contain the protein name p53” is per-
formed in exactly the same way as for Index A. A Boolean “AND” query is performed using
a more complex algorithm that is more efficient than performing two individual queries and
Appendix A. An XML indexing solution for data integration 274
8
Root
o
m
o
h
s
18
17
1
2
3
4
78
9
6
s
a
Data Path Tree
gr
a
97
1
17 9
Structure Container
1817
63
2
Data Pointer
i) Search for data path
for structure of query
v) For matched structuresfollow pointers back to Data Stores
m
u
s
m r
1356.4a
Locator Pointer
iii) Follow Pointers tostructures
Data Store
ii) Search Data Store
iv) Search Structure Containers
vi) Match data in Store
Figure A.3: The method used to implement a join query in Index B is implemented in a sixstage algorithm.
finding the cross-product of results, as was required for Index A. The basic concept is that
the first term is evaluated via the Data Path Tree as before. Records matched by the first
term are retrieved from the Structure Container, and only those that contain the path cor-
responding to the second term of the query are evaluated further, the rest can be thrown
away. The value of the second query term is searched in the corresponding Data Store, as
displayed in Figure A.3
A.3.3 Index creation
The time to construct Index A and B is given below:
Total Records 1000 5000 64,000 240,000
Size of XML (MB) 0.4 2.1 25 100
Index A (s) 11.2 57.8 742 3240
Index B (s) 12.6 71.8 833 3521
Store Size (MB) 17 40 281 1014
Table A.1: Build times in seconds for Index A and B for four different sizes of data set. Thesize of the persistent store for Index B is given in MB.
Appendix A. An XML indexing solution for data integration 275
Table A.1 displays the time in seconds to build Index A and B on disk. Index B takes
approximately 10% more time to build than Index A. The build times for Index A and B
appear to grow linearly with the size of data. The store size grows proportionally less than
linearly with regard to collection size because as the data size becomes large the tries in the
Data Stores become saturated. In other words, when data is added to a larger store, it is
more likely that the same string of characters is already entirely (or partially) present in the
store, and only a new XML location object has to be added. Secondly, the XML Dictionary
and Data Path tree grow very rapidly at the start of a build but will only subsequently grow
when new types of documents are encountered. A version of Index B was also created for
PIR database in XML as a single 800 MB file, giving rise to a persistent store of size 5.7 GB.
A.3.4 Queries
The following searches were carried out over the indexes stored persistently on disk, and in
main memory, for the DBLP database.
1. Search for 10,000 authors’ names.
2. Search for 10,000 sets of two authors’ names in a single bibliographic reference.
3. Search for 10,000 sets of the author’s name and the year of publication.
Query sets 1 and 3 were obtained by selecting authors’ names (and the author and year
pairs) at random from a query retrieving all records from the collections. Set 2 was obtained
by selecting records at random from a query that retrieved all records containing two or
more authors. The searches are similar to the queries in the publication by Cooper et al.
[62]. While several parameters are not the same between these tests and those discussed by
Cooper, the tests are intended to give an initial benchmark, against which future comparisons
could be made. Queries were carried out on a four processor Enterprise 450 SUN Solaris
CPU, with a clock speed of 300Mhz. The Java memory was set at a minimum heap size
400M and maximum heap size 800M.
A.3.5 Index A Results
Table A.2 displays the results of timings for Index A for queries 1, 2 and 3. The timings for
Index A demonstrate that the index performs efficiently while retrieving single items of data
because query 1 is performed in a small amount of time in memory, and with a persistent
Appendix A. An XML indexing solution for data integration 276
Query: 1 2 (Join time) 3 (Join time)
In memory (s) 3.5 4.1 (1.9) 2194 (2032)
Persistent (s) 31 91 (41) 1473 (1353)
Table A.2: Summary of query timings for Index A, values are time in seconds. Persistentresults are from a test with a cold cache.
index. Query 2 is also returned within a fairly short time frame, however query 3 performs
poorly. Measurements were made to determine the proportion of time required by different
parts of the query. It was observed that the vast majority of the time taken was not in the
retrieval of data, but in carrying out the join operation (in italics in columns 2 and 3). In
this query, the search for the year of publication returned extremely large results sets. For
every query, the list of identifiers for documents that contained the correct year was very
large, and had to be compared with the identifiers returned for a particular author’s name, in
a search taking O(nm) time, which is inefficient. An unexpected result was that the length
of time to complete query 3 is longer in memory than in a persistent store. The result may
be explained by the lengthy join operations performing more slowly as the limits of memory
are reached, with the entire index loaded into memory.
A.3.6 Index B Results
Query: 1 2 3
Cold cache (s) 450 1298 1692
Caching on (s) 0.78 2.1 2.4
Table A.3: Summary of query timings for Index B, with different caching procedures.
The results for Index B are fairly complex but a summary of the main results is shown
in Table A.3. The results for “Caching On” refer to timing the queries after the index has
already been accessed for other tests in the same run, and “Cold cache” refers to performing
the queries immediately after starting Java. The method used to carry out join queries did
not allow the timing of the separate stages of the query to be assessed, therefore an exact
measure of the length of the join stage is not given. In theory, Index B carries out query
joins in a more efficient manner than Index A. The small time to retrieve query 3 with
“Caching On” demonstrates that query joins are carried out very efficiently in Index B. An
unexpected finding was that Index B retrieves single items of data significantly slower than
Index A, when run with a cold cache (31s for Index A, 450s for Index B). The result for
Appendix A. An XML indexing solution for data integration 277
query 1 is particularly unexpected because the Structure Container is the only difference
between Index A and B, yet the XML structure objects within the Structure Container are
not accessed to complete query 1. Query 1 is completed by the author’s name being matched
in the correct Data Store and pointers followed to the locater objects. The locater objects are
referenced from the Data Stores in Index A, but are stored alongside the root of objects in
the Structure Container in Index B. The exact same number of operations will be performed
to complete query 1 on Index A and B. Therefore, the slow-down that is observed must be
accounted for by a difference in the overall size of the persistent store, or by differences in
the way in which objects are physically located on disk. The overall size of Index B is 25 %
greater than Index A (750MB vs 1 GB), but differences in retrieval time are approximately
10 fold. Therefore, it is likely that the containers in Index B are stored differently across
disk partitions, compared to Index A. Future work is required in this area to understand the
effects of different caching procedures and different methods of making containers persistent.
A set of similar queries was attempted for the PIR database, stored using the Index B
structure. A query to find 10,000 author names, retrieving 4.5 million records in total, takes
244 seconds from a cold cache, and only 3 seconds if the search is repeated with a warm
cache. A series of Boolean “AND” queries was attempted but errors were encountered due
to the limits of the PJama technology being reached.
A.3.7 Visualisation
One facet of querying an XML representation is enabling the user to visualise the structure
of the data and find the leaf nodes that correspond to the semantics of the data they wish
to find. An interface was created displaying the Data Path Tree (Figure A.4) enabling the
user to browse the structure of the data. Boolean queries can be formulated, and records
returned by the search can be viewed.
A.4 Discussion
The results for querying the DBLP bibliography demonstrate that rapid queries can be
performed against fairly large collections of data. The index returned results very slowly
from a cold cache, but rapidly if searches were repeated and data were already cached in
main memory. Unexpectedly, much better results were observed for a more simple index
structure that did not have the XML Structure Container, even for comparisons over queries
that did not utilise the Structure Container in the search. The problems observed may
Appendix A. An XML indexing solution for data integration 278
Figure A.4: A prototype interface for querying an indexed store of XML data.
be related to the manner in which PJama caches objects in main memory, and the more
complex index structure is accessed much less efficiently. The caching policy of PJama is
difficult to analyse, and therefore future improvements to the index structure would require
implementing the structure in a different technology, such as the native disk I/O (input
output) methods offered by Java v1.4 [167].
The investigation demonstrates that indexing XML is a viable approach for storage of
local versions of biological databases, without requiring the overhead of implementing and
maintaining a relational database management system. The Xtect project has extended
this work at Glasgow University, and software has been developed to store local versions of
several different biological databases, and in addition to indexing, to perform matching across
different databases to find data paths that are semantically equivalent. Redundant paths
are removed from the index and a digest of the information that can be found about each
gene or protein can be presented to the user. This represents an advance in data integration
utilising XML, which is likely to be extended further in the near future.
The indexing investigation also demonstrates that a feasible solution for storage of bio-
logical data may be to use a native XML representation, although the technology employed
in this investigation is not robust enough for a large scale system. There are several com-
Appendix A. An XML indexing solution for data integration 279
mercial database systems for XML, which may be a viable alternative to relational database
storage for biological data, especially for cases where there are frequent changes to the data
that must be stored.
Appendix B
Detailed diagrams of FGE-OM
The diagrams in this section display the contents of the six packages within the Proteomic-
sOM namespace of FGE-OM, which are described in detail in Chapter 4.
280
Appendix B. Detailed diagrams of FGE-OM 281
Figure B.1: The ProteinSeparation package of FGE-OM.
Appendix B. Detailed diagrams of FGE-OM 282
Figure B.2: The ProteomeBioAssay package.
Appendix B. Detailed diagrams of FGE-OM 283
Figure B.3: The ProteinData package.
Appendix B. Detailed diagrams of FGE-OM 284
Figure B.4: The ProteinRecord package.
Appendix B. Detailed diagrams of FGE-OM 285
Figure B.5: The MassSpecProtocol package.
Appendix B. Detailed diagrams of FGE-OM 286
Figure B.6: The MassSpecData package.
Appendix C
Database schema for RAPAD
/*==============================================================*/
/* Table: ACQUISITION */
/*==============================================================*/
create table ACQUISITION (
ACQUISITION_ID NUMBER(8) not null,
ASSAY_ID NUMBER(8) not null,
PROTOCOL_ID NUMBER(10),
CHANNEL_ID NUMBER(4),
ACQUISITION_DATE DATE,
NAME VARCHAR2(100),
URI VARCHAR2(255),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ACQUISITION primary key (ACQUISITION_ID)
)
/
/*==============================================================*/
/* Index: ACQUISITION_IND05 */
/*==============================================================*/
create index ACQUISITION_IND05 on ACQUISITION (
ASSAY_ID ASC
)
/
/*==============================================================*/
/* Index: ACQUISITION_IND06 */
/*==============================================================*/
create index ACQUISITION_IND06 on ACQUISITION (
CHANNEL_ID ASC
)
/
/*==============================================================*/
/* Index: ACQUISITION_IND07 */
/*==============================================================*/
create index ACQUISITION_IND07 on ACQUISITION (
PROTOCOL_ID ASC
)
/
/*==============================================================*/
/* Index: ACQUISITION_IND08 */
/*==============================================================*/
create index ACQUISITION_IND08 on ACQUISITION (
NAME ASC
)
/
/*==============================================================*/
/* Table: ACQUISITIONPARAM */
/*==============================================================*/
create table ACQUISITIONPARAM (
ACQUISITION_PARAM_ID NUMBER(5) not null,
ACQUISITION_ID NUMBER(8) not null,
PROTOCOL_PARAM_ID NUMBER(10),
NAME VARCHAR2(100) not null,
VALUE VARCHAR2(50) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ACQUISITIONPARAM primary key
287
Appendix C. Database schema for RAPAD 288
(ACQUISITION_PARAM_ID)
)
/
/*==============================================================*/
/* Index: ACQPARAM_AK01 */
/*==============================================================*/
create unique index ACQPARAM_AK01 on ACQUISITIONPARAM (
ACQUISITION_ID ASC,
NAME ASC
)
/
/*==============================================================*/
/* Table: ANALYSIS */
/*==============================================================*/
create table ANALYSIS (
ANALYSIS_ID NUMBER(5) not null,
NAME VARCHAR2(100) not null,
DESCRIPTION VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ANALYSIS primary key (ANALYSIS_ID)
)
/
/*==============================================================*/
/* Table: ANALYSISIMPLEMENTATION */
/*==============================================================*/
create table ANALYSISIMPLEMENTATION (
ANALYSIS_IMPLEMENTATION_ID NUMBER(5) not null,
ANALYSIS_ID NUMBER(5) not null,
NAME VARCHAR2(100) not null,
DESCRIPTION VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ANALYSISIMPLEMENTATION3 primary key
(ANALYSIS_IMPLEMENTATION_ID)
)
/
/*==============================================================*/
/* Index: ANALYSISIMPLEMENTATION_IND01 */
/*==============================================================*/
create index ANALYSISIMPLEMENTATION_IND01 on
ANALYSISIMPLEMENTATION (
ANALYSIS_ID ASC
)
/
/*==============================================================*/
/* Table: ANALYSISIMPLEMENTATIONPARAM */
/*==============================================================*/
create table ANALYSISIMPLEMENTATIONPARAM (
ANALYSIS_IMP_PARAM_ID NUMBER(5) not null,
ANALYSIS_IMPLEMENTATION_ID NUMBER(5) not null,
NAME VARCHAR2(100) not null,
VALUE VARCHAR2(100) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ANALYSISIMPLPARAM primary key
(ANALYSIS_IMP_PARAM_ID)
)
/
/*==============================================================*/
/* Index: ANALYSISIMPPARAM_IND01 */
/*==============================================================*/
create index ANALYSISIMPPARAM_IND01 on ANALYSISIMPLEMENTATIONPARAM
(
ANALYSIS_IMPLEMENTATION_ID ASC
)
/
/*==============================================================*/
/* Table: ANALYSISINPUT */
/*==============================================================*/
create table ANALYSISINPUT (
ANALYSIS_INPUT_ID NUMBER(5) not null,
ANALYSIS_INVOCATION_ID NUMBER(5) not null,
TABLE_ID NUMBER(5),
INPUT_ROW_ID NUMBER(10),
INPUT_VALUE VARCHAR2(50),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
Appendix C. Database schema for RAPAD 289
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ANALYSISINPUT primary key (ANALYSIS_INPUT_ID)
)
/
/*==============================================================*/
/* Index: ANALYSISINPUT_IND01 */
/*==============================================================*/
create index ANALYSISINPUT_IND01 on ANALYSISINPUT (
ANALYSIS_INVOCATION_ID ASC
)
/
/*==============================================================*/
/* Index: ANALYSISINPUT_IND02 */
/*==============================================================*/
create index ANALYSISINPUT_IND02 on ANALYSISINPUT (
TABLE_ID ASC
)
/
/*==============================================================*/
/* Table: ANALYSISINVOCATION */
/*==============================================================*/
create table ANALYSISINVOCATION (
ANALYSIS_INVOCATION_ID NUMBER(5) not null,
ANALYSIS_IMPLEMENTATION_ID NUMBER(5) not null,
NAME VARCHAR2(100) not null,
DESCRIPTION VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ANALYSISINVOCATION primary key
(ANALYSIS_INVOCATION_ID)
)
/
/*==============================================================*/
/* Index: ANALYSISINVOCATION_IND01 */
/*==============================================================*/
create index ANALYSISINVOCATION_IND01 on ANALYSISINVOCATION (
ANALYSIS_IMPLEMENTATION_ID ASC
)
/
/*==============================================================*/
/* Table: ANALYSISINVOCATIONPARAM */
/*==============================================================*/
create table ANALYSISINVOCATIONPARAM (
ANALYSIS_INVOCATION_PARAM_ID NUMBER(5) not null,
ANALYSIS_INVOCATION_ID NUMBER(5) not null,
NAME VARCHAR2(100) not null,
VALUE VARCHAR2(100) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ANAYLSISINVOCATIONPARAM primary key
(ANALYSIS_INVOCATION_PARAM_ID)
)
/
/*==============================================================*/
/* Index: ANALYSISINVOCATIONPARAM_IND01 */
/*==============================================================*/
create index ANALYSISINVOCATIONPARAM_IND01 on
ANALYSISINVOCATIONPARAM (
ANALYSIS_INVOCATION_ID ASC
)
/
/*==============================================================*/
/* Table: ANALYSISOUTPUT */
/*==============================================================*/
create table ANALYSISOUTPUT (
ANALYSIS_OUTPUT_ID NUMBER(10) not null,
ANALYSIS_INVOCATION_ID NUMBER(5) not null,
NAME VARCHAR2(100) not null,
TYPE VARCHAR2(50) not null,
VALUE NUMBER(5) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ANALYSISOUTPUT primary key (ANALYSIS_OUTPUT_ID)
)
Appendix C. Database schema for RAPAD 290
/
/*==============================================================*/
/* Index: ANALYSISOUTPUT_IND01 */
/*==============================================================*/
create index ANALYSISOUTPUT_IND01 on ANALYSISOUTPUT (
ANALYSIS_INVOCATION_ID ASC
)
/
/*==============================================================*/
/* Table: ANALYTEMEASUREMENT */
/*==============================================================*/
create table ANALYTEMEASUREMENT (
BIO_MATERIAL_MEASUREMENT_ID NUMBER(10) not null,
BIO_MATERIAL_ID NUMBER(8) not null,
BIOASSAY_TREATMENT_ID NUMBER(8),
VALUE FLOAT,
UNIT_TYPE_ID NUMBER(5),
MEASUREMENT_DESCRIPTION VARCHAR2(300),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ANALYTE_MEASUREMENT primary key
(BIO_MATERIAL_MEASUREMENT_ID)
)
/
/*==============================================================*/
/* Index: ANALYTE_MEASUREMENT_IND04 */
/*==============================================================*/
create unique index ANALYTE_MEASUREMENT_IND04 on
ANALYTEMEASUREMENT (
BIO_MATERIAL_ID ASC
)
/
/*==============================================================*/
/* Table: ARRAY */
/*==============================================================*/
create table ARRAY (
ARRAY_ID NUMBER(4) not null,
MANUFACTURER_ID NUMBER(12) not null,
PLATFORM_TYPE_ID NUMBER(10) not null,
SUBSTRATE_TYPE_ID NUMBER(10),
PROTOCOL_ID NUMBER(10),
EXTERNAL_DATABASE_RELEASE_ID NUMBER(4),
SOURCE_ID VARCHAR2(100),
NAME VARCHAR2(100) not null,
VERSION VARCHAR2(50) not null,
DESCRIPTION VARCHAR2(500),
ARRAY_DIMENSIONS VARCHAR2(50),
ELEMENT_DIMENSIONS VARCHAR2(50),
NUMBER_OF_ELEMENTS NUMBER(10),
NUM_ARRAY_COLUMNS NUMBER(3),
NUM_ARRAY_ROWS NUMBER(3),
NUM_GRID_COLUMNS NUMBER(3),
NUM_GRID_ROWS NUMBER(3),
NUM_SUB_COLUMNS NUMBER(6),
NUM_SUB_ROWS NUMBER(6),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ARRAY primary key (ARRAY_ID)
)
/
/*==============================================================*/
/* Index: ARRAY_AK01 */
/*==============================================================*/
create unique index ARRAY_AK01 on ARRAY (
NAME ASC,
VERSION ASC
)
/
/*==============================================================*/
/* Index: ARRAY_IND02 */
/*==============================================================*/
create index ARRAY_IND02 on ARRAY (
EXTERNAL_DATABASE_RELEASE_ID ASC
)
/
/*==============================================================*/
/* Index: ARRAY_IND03 */
/*==============================================================*/
create index ARRAY_IND03 on ARRAY (
PLATFORM_TYPE_ID ASC
)
/
/*==============================================================*/
/* Index: ARRAY_IND04 */
/*==============================================================*/
create index ARRAY_IND04 on ARRAY (
SUBSTRATE_TYPE_ID ASC
)
/
Appendix C. Database schema for RAPAD 291
/*==============================================================*/
/* Index: ARRAY_IND05 */
/*==============================================================*/
create index ARRAY_IND05 on ARRAY (
PROTOCOL_ID ASC
)
/
/*==============================================================*/
/* Index: ARRAY_IND06 */
/*==============================================================*/
create index ARRAY_IND06 on ARRAY (
MANUFACTURER_ID ASC
)
/
/*==============================================================*/
/* Table: ARRAYANNOTATION */
/*==============================================================*/
create table ARRAYANNOTATION (
ARRAY_ANNOTATION_ID NUMBER(5) not null,
ARRAY_ID NUMBER(4) not null,
NAME VARCHAR2(500) not null,
VALUE VARCHAR2(100) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ARRAYANNOTATION primary key (ARRAY_ANNOTATION_ID)
)
/
/*==============================================================*/
/* Index: ARRAYANNOTATION_IND01 */
/*==============================================================*/
create index ARRAYANNOTATION_IND01 on ARRAYANNOTATION (
ARRAY_ID ASC
)
/
/*==============================================================*/
/* Table: ASSAY */
/*==============================================================*/
create table ASSAY (
ASSAY_ID NUMBER(8) not null,
ARRAY_ID NUMBER(4) not null,
PROTOCOL_ID NUMBER(10),
ASSAY_DATE DATE,
ARRAY_IDENTIFIER VARCHAR2(100),
ARRAY_BATCH_IDENTIFIER VARCHAR2(100),
OPERATOR_ID NUMBER(10) not null,
EXTERNAL_DATABASE_RELEASE_ID NUMBER(5),
SOURCE_ID VARCHAR2(50),
NAME VARCHAR2(100),
DESCRIPTION VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ASSAY primary key (ASSAY_ID)
)
/
/*==============================================================*/
/* Index: ASSAY_INDEX */
/*==============================================================*/
create index ASSAY_INDEX on ASSAY (
ARRAY_ID ASC
)
/
/*==============================================================*/
/* Index: ASSAY_IND02 */
/*==============================================================*/
create index ASSAY_IND02 on ASSAY (
OPERATOR_ID ASC
)
/
/*==============================================================*/
/* Index: ASSAY_IND03 */
/*==============================================================*/
create index ASSAY_IND03 on ASSAY (
PROTOCOL_ID ASC
)
/
/*==============================================================*/
/* Index: ASSAY_IND04 */
/*==============================================================*/
create index ASSAY_IND04 on ASSAY (
EXTERNAL_DATABASE_RELEASE_ID ASC
)
/
/*==============================================================*/
/* Index: ASSAY_IND05 */
/*==============================================================*/
create index ASSAY_IND05 on ASSAY (
NAME ASC
Appendix C. Database schema for RAPAD 292
)
/
/*==============================================================*/
/* Table: ASSAYBIOMATERIAL */
/*==============================================================*/
create table ASSAYBIOMATERIAL (
ASSAY_BIO_MATERIAL_ID NUMBER(5) not null,
ASSAY_ID NUMBER(8) not null,
BIO_MATERIAL_ID NUMBER(8) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ASSAYBIOMATERIAL primary key
(ASSAY_BIO_MATERIAL_ID)
)
/
/*==============================================================*/
/* Index: ASSAYBIOMATERIAL_IND01 */
/*==============================================================*/
create index ASSAYBIOMATERIAL_IND01 on ASSAYBIOMATERIAL (
BIO_MATERIAL_ID ASC
)
/
/*==============================================================*/
/* Index: ASSAYBIOMATERIAL_IND02 */
/*==============================================================*/
create index ASSAYBIOMATERIAL_IND02 on ASSAYBIOMATERIAL (
ASSAY_ID ASC
)
/
/*==============================================================*/
/* Table: ASSAYDATAPOINT */
/*==============================================================*/
create table ASSAYDATAPOINT (
id NUMBER(8) not null,
time float not null,
protein_assay float not null,
lc_column NUMBER(8) not null,
constraint PK_ASSAYDATAPOINT primary key (lc_column, id)
)
/
/*==============================================================*/
/* Table: ASSAYGROUP */
/*==============================================================*/
create table ASSAYGROUP (
STUDY_ID NUMBER(4) not null,
ASSAY_ID NUMBER(8) not null,
STUDY_DESIGN_ID NUMBER(5) not null,
FACTOR_VALUE VARCHAR2(100) not null,
STUDY_FACTOR_VALUE_ID NUMBER(8),
constraint PK_ASSAYGROUP primary key
(STUDY_ID, ASSAY_ID, STUDY_DESIGN_ID)
)
/
/*==============================================================*/
/* Table: ASSAYLABELEDEXTRACT */
/*==============================================================*/
create table ASSAYLABELEDEXTRACT (
ASSAY_LABELED_EXTRACT_ID NUMBER(8) not null,
ASSAY_ID NUMBER(8) not null,
LABELED_EXTRACT_ID NUMBER(8) not null,
CHANNEL_ID NUMBER(4) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ASSAYLABELEDEXTRACT primary key
(ASSAY_LABELED_EXTRACT_ID)
)
/
/*==============================================================*/
/* Index: ASSAYLABELEDEXTRACT_IND01 */
/*==============================================================*/
create index ASSAYLABELEDEXTRACT_IND01 on ASSAYLABELEDEXTRACT (
ASSAY_ID ASC
)
/
/*==============================================================*/
/* Index: ASSAYLABELEDEXTRACT_IND02 */
/*==============================================================*/
create index ASSAYLABELEDEXTRACT_IND02 on ASSAYLABELEDEXTRACT (
CHANNEL_ID ASC
)
/
/*==============================================================*/
/* Index: ASSAYLABELEDEXTRACT_IND03 */
/*==============================================================*/
Appendix C. Database schema for RAPAD 293
create index ASSAYLABELEDEXTRACT_IND03 on ASSAYLABELEDEXTRACT (
LABELED_EXTRACT_ID ASC
)
/
/*==============================================================*/
/* Table: ASSAYPARAM */
/*==============================================================*/
create table ASSAYPARAM (
ASSAY_PARAM_ID NUMBER(10) not null,
ASSAY_ID NUMBER(8) not null,
PROTOCOL_PARAM_ID NUMBER(10) not null,
VALUE VARCHAR2(100) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ASSAYPARAM primary key (ASSAY_PARAM_ID)
)
/
/*==============================================================*/
/* Index: ASSAYPARAM_IND01 */
/*==============================================================*/
create index ASSAYPARAM_IND01 on ASSAYPARAM (
ASSAY_ID ASC
)
/
/*==============================================================*/
/* Index: ASSAYPARAM_IND02 */
/*==============================================================*/
create index ASSAYPARAM_IND02 on ASSAYPARAM (
PROTOCOL_PARAM_ID ASC
)
/
/*==============================================================*/
/* Table: ASSAYPARAMPROT */
/*==============================================================*/
create table ASSAYPARAMPROT (
ASSAY_PARAM_ID NUMBER(10) not null,
PROTEOME_ASSAY_ID NUMBER(8),
PROTOCOL_PARAM_ID NUMBER(10) not null,
VALUE VARCHAR2(100) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ASSAYPARAM_PROT primary key (ASSAY_PARAM_ID)
)
/
/*==============================================================*/
/* Index: ASSAYPARAM_PROT_IND02 */
/*==============================================================*/
create index ASSAYPARAM_PROT_IND02 on ASSAYPARAMPROT (
PROTOCOL_PARAM_ID ASC
)
/
/*==============================================================*/
/* Table: BAND */
/*==============================================================*/
create table BAND (
id NUMBER(8) not null,
area float,
intensity float,
local_background float,
annotation varchar(200),
annotation_source varchar(200),
volume float,
pixel_x_coord float,
pixel_y_coord float,
pixel_radius float,
normalisation varchar(200),
normalised_volume float,
lane_number float not null,
apparent_mass float not null,
gel_1d NUMBER(8) not null,
physicalGelSpot_ID NUMBER(8),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_BAND primary key (gel_1d, id)
)
/
/*==============================================================*/
/* Table: BIOASSAYTREATMENT */
/*==============================================================*/
Appendix C. Database schema for RAPAD 294
create table BIOASSAYTREATMENT (
BIOASSAY_TREATMENT_ID NUMBER(8) not null,
ORDER_NUM NUMBER(3) not null,
PROTEOME_ASSAY_ID NUMBER(8),
PROTOCOL_ID NUMBER(10),
TREATMENT_TYPE_ID NUMBER(10),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_BIOASSAYTREATMENT primary key
(BIOASSAY_TREATMENT_ID)
)
/
/*==============================================================*/
/* Table: BIOMATERIALCHARACTERISTIC */
/*==============================================================*/
create table BIOMATERIALCHARACTERISTIC (
BIO_MATERIAL_CHARACTERISTIC_ID NUMBER(5) not null,
BIO_MATERIAL_ID NUMBER(8) not null,
ONTOLOGY_ENTRY_ID NUMBER(10) not null,
VALUE VARCHAR2(100),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_BIOAMATCHARACTERISTIC primary key
(BIO_MATERIAL_CHARACTERISTIC_ID)
)
/
/*==============================================================*/
/* Index: BIOMATCHARACTERISTIC_IND01 */
/*==============================================================*/
create index BIOMATCHARACTERISTIC_IND01 on
BIOMATERIALCHARACTERISTIC (
BIO_MATERIAL_ID ASC
)
/
/*==============================================================*/
/* Index: BIOMATCHARACTERISTIC_IND02 */
/*==============================================================*/
create index BIOMATCHARACTERISTIC_IND02 on
BIOMATERIALCHARACTERISTIC (
ONTOLOGY_ENTRY_ID ASC
)
/
/*==============================================================*/
/* Table: BIOMATERIALIMP */
/*==============================================================*/
create table BIOMATERIALIMP (
BIO_MATERIAL_ID NUMBER(8) not null,
LABEL_METHOD_ID NUMBER(4),
TAXON_ID NUMBER(10),
BIO_SOURCE_PROVIDER_ID NUMBER(12),
BIO_MATERIAL_TYPE_ID NUMBER(10),
SUBCLASS_VIEW VARCHAR2(27) not null,
EXTERNAL_DATABASE_RELEASE_ID NUMBER(5),
SOURCE_ID VARCHAR2(50),
STRING1 VARCHAR2(100),
STRING2 VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_BIOMATERIALIMP primary key (BIO_MATERIAL_ID)
)
/
/*==============================================================*/
/* Index: BIOMATERIALIMP_IND01 */
/*==============================================================*/
create index BIOMATERIALIMP_IND01 on BIOMATERIALIMP (
LABEL_METHOD_ID ASC
)
/
/*==============================================================*/
/* Index: BIOMATERIALIMP_IND02 */
/*==============================================================*/
create index BIOMATERIALIMP_IND02 on BIOMATERIALIMP (
TAXON_ID ASC
)
/
/*==============================================================*/
/* Index: BIOMATERIALIMP_IND03 */
/*==============================================================*/
create index BIOMATERIALIMP_IND03 on BIOMATERIALIMP (
BIO_MATERIAL_TYPE_ID ASC
)
/
Appendix C. Database schema for RAPAD 295
/*==============================================================*/
/* Index: BIOMATERIALIMP_IND04 */
/*==============================================================*/
create index BIOMATERIALIMP_IND04 on BIOMATERIALIMP (
BIO_SOURCE_PROVIDER_ID ASC
)
/
/*==============================================================*/
/* Index: BIOMATERIALIMP_IND05 */
/*==============================================================*/
create index BIOMATERIALIMP_IND05 on BIOMATERIALIMP (
EXTERNAL_DATABASE_RELEASE_ID ASC
)
/
/*==============================================================*/
/* Table: BIOMATERIALMEASUREMENT */
/*==============================================================*/
create table BIOMATERIALMEASUREMENT (
BIO_MATERIAL_MEASUREMENT_ID NUMBER(10) not null,
TREATMENT_ID NUMBER(10) not null,
BIO_MATERIAL_ID NUMBER(8) not null,
VALUE FLOAT,
UNIT_TYPE_ID NUMBER(10),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_BIOMATERIALMEASUREMENT primary key
(BIO_MATERIAL_MEASUREMENT_ID)
)
/
/*==============================================================*/
/* Index: BIOMATERIALMEASUREMENT_IND03 */
/*==============================================================*/
create index BIOMATERIALMEASUREMENT_IND03 on
BIOMATERIALMEASUREMENT (
TREATMENT_ID ASC
)
/
/*==============================================================*/
/* Index: BIOMATERIALMEASUREMENT_IND04 */
/*==============================================================*/
create index BIOMATERIALMEASUREMENT_IND04 on
BIOMATERIALMEASUREMENT (
BIO_MATERIAL_ID ASC
)
/
/*==============================================================*/
/* Index: BIOMATERIALMEASUREMENT_IND05 */
/*==============================================================*/
create index BIOMATERIALMEASUREMENT_IND05 on
BIOMATERIALMEASUREMENT (
UNIT_TYPE_ID ASC
)
/
/*==============================================================*/
/* Table: BOUNDARYPOINT */
/*==============================================================*/
create table BOUNDARYPOINT (
id NUMBER(8) not null,
pixel_x_coord float not null,
pixel_y_coord float not null,
spot_gel_2d NUMBER(8) not null,
spot_id NUMBER(8) not null,
physicalGelItem_ID NUMBER(8) not null,
gel2D NUMBER(8) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_BOUNDARYPOINT primary key
(spot_gel_2d, spot_id, physicalGelItem_ID, gel2D, id)
)
/
/*==============================================================*/
/* Table: CHANNEL */
/*==============================================================*/
create table CHANNEL (
CHANNEL_ID NUMBER(4) not null,
NAME VARCHAR2(100) not null,
DEFINITION VARCHAR2(500) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_CHANNEL primary key (CHANNEL_ID)
Appendix C. Database schema for RAPAD 296
)
/
/*==============================================================*/
/* Index: CHANNEL_AK01 */
/*==============================================================*/
create unique index CHANNEL_AK01 on CHANNEL (
NAME ASC
)
/
/*==============================================================*/
/* Table: CHEMICALTREATMENT */
/*==============================================================*/
create table CHEMICALTREATMENT (
chemical_treatment_ID NUMBER(8) not null,
BIOASSAY_TREATMENT_ID NUMBER(8) not null,
treatment_type NUMBER(10),
digestion varchar(200) not null,
derivatisations varchar(200) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_CHEMICALTREATMENT primary key
(chemical_treatment_ID)
)
/
/*==============================================================*/
/* Table: COLLISIONCELL */
/*==============================================================*/
create table COLLISIONCELL (
collision_cellID NUMBER(8) not null,
mz_analysis_ID NUMBER(8),
gas_type varchar(100) not null,
gas_pressure float not null,
collision_offset float not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_COLLISIONCELL primary key (collision_cellID)
)
/
/*==============================================================*/
/* Table: COMPOSITEELEMENTANNOTATION */
/*==============================================================*/
create table COMPOSITEELEMENTANNOTATION (
COMPOSITE_ELEMENT_ANNOT_ID NUMBER(12) not null,
COMPOSITE_ELEMENT_ID NUMBER(10) not null,
NAME VARCHAR2(500) not null,
VALUE VARCHAR2(255) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_COMPOSITEELEMENTANNOTATION primary key
(COMPOSITE_ELEMENT_ANNOT_ID)
)
/
/*==============================================================*/
/* Index: COMPELEMENTANNOT_IND01 */
/*==============================================================*/
create index COMPELEMENTANNOT_IND01 on COMPOSITEELEMENTANNOTATION
(
COMPOSITE_ELEMENT_ID ASC
)
/
/*==============================================================*/
/* Table: COMPOSITEELEMENTGUS */
/*==============================================================*/
create table COMPOSITEELEMENTGUS (
COMPOSITE_ELEMENT_GUS_ID NUMBER(12) not null,
COMPOSITE_ELEMENT_ID NUMBER(10),
TABLE_ID NUMBER(5) not null,
ROW_ID NUMBER(12) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_COMPOSITEELEMENTGUS primary key
(COMPOSITE_ELEMENT_GUS_ID)
Appendix C. Database schema for RAPAD 297
)
/
/*==============================================================*/
/* Table: COMPOSITEELEMENTIMP */
/*==============================================================*/
create table COMPOSITEELEMENTIMP (
COMPOSITE_ELEMENT_ID NUMBER(10) not null,
PARENT_ID NUMBER(10),
ARRAY_ID NUMBER(4) not null,
SUBCLASS_VIEW VARCHAR2(27) not null,
EXTERNAL_DATABASE_RELEASE_ID NUMBER(4),
SOURCE_ID VARCHAR2(50),
TINYINT1 NUMBER(3),
SMALLINT1 NUMBER(5),
SMALLINT2 NUMBER(5),
CHAR1 VARCHAR2(5),
CHAR2 VARCHAR2(5),
TINYSTRING1 VARCHAR2(50),
TINYSTRING2 VARCHAR2(50),
SMALLSTRING1 VARCHAR2(100),
SMALLSTRING2 VARCHAR2(100),
STRING1 VARCHAR2(500),
STRING2 VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_COMPOSITEELEMENTIMP primary key
(COMPOSITE_ELEMENT_ID)
)
/
/*==============================================================*/
/* Index: RAD3_SPOTFAMILY_IND01 */
/*==============================================================*/
create index RAD3_SPOTFAMILY_IND01 on COMPOSITEELEMENTIMP (
COMPOSITE_ELEMENT_ID ASC,
SMALLSTRING1 ASC,
SMALLSTRING2 ASC
)
/
/*==============================================================*/
/* Index: RAD3_SPOTFAMILY_IND02 */
/*==============================================================*/
create index RAD3_SPOTFAMILY_IND02 on COMPOSITEELEMENTIMP (
ARRAY_ID ASC,
EXTERNAL_DATABASE_RELEASE_ID ASC,
SOURCE_ID ASC
)
/
/*==============================================================*/
/* Index: SAGETAG_IND01 */
/*==============================================================*/
create index SAGETAG_IND01 on COMPOSITEELEMENTIMP (
ARRAY_ID ASC,
TINYSTRING1 ASC
)
/
/*==============================================================*/
/* Index: SAGETAG_IND02 */
/*==============================================================*/
create index SAGETAG_IND02 on COMPOSITEELEMENTIMP (
PARENT_ID ASC
)
/
/*==============================================================*/
/* Index: SAGETAG_IND03 */
/*==============================================================*/
create index SAGETAG_IND03 on COMPOSITEELEMENTIMP (
EXTERNAL_DATABASE_RELEASE_ID ASC
)
/
/*==============================================================*/
/* Table: COMPOSITEELEMENTRESULTIMP */
/*==============================================================*/
create table COMPOSITEELEMENTRESULTIMP (
COMPOSITE_ELEMENT_RESULT_ID NUMBER(10) not null,
COMPOSITE_ELEMENT_ID NUMBER(10) not null,
QUANTIFICATION_ID NUMBER(8) not null,
SUBCLASS_VIEW VARCHAR2(27) not null,
FLOAT1 FLOAT,
FLOAT2 FLOAT,
FLOAT3 FLOAT,
FLOAT4 FLOAT,
INT1 NUMBER(12),
SMALLINT1 NUMBER(5),
SMALLINT2 NUMBER(5),
SMALLINT3 NUMBER(5),
TINYINT1 NUMBER(3),
TINYINT2 NUMBER(3),
TINYINT3 NUMBER(3),
CHAR1 VARCHAR2(5),
CHAR2 VARCHAR2(5),
CHAR3 VARCHAR2(5),
STRING1 VARCHAR2(500),
STRING2 VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
Appendix C. Database schema for RAPAD 298
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_SFMRES primary key (COMPOSITE_ELEMENT_RESULT_ID)
)
/
/*==============================================================*/
/* Index: COMPELEMENTRESULTIMP_IND01 */
/*==============================================================*/
create index COMPELEMENTRESULTIMP_IND01 on
COMPOSITEELEMENTRESULTIMP (
COMPOSITE_ELEMENT_ID ASC,
SUBCLASS_VIEW ASC
)
/
/*==============================================================*/
/* Index: COMPELEMENTRESULTIMP_IND02 */
/*==============================================================*/
create index COMPELEMENTRESULTIMP_IND02 on
COMPOSITEELEMENTRESULTIMP (
COMPOSITE_ELEMENT_ID ASC,
QUANTIFICATION_ID ASC,
SUBCLASS_VIEW ASC
)
/
/*==============================================================*/
/* Table: CONTROL */
/*==============================================================*/
create table CONTROL (
CONTROL_ID NUMBER(5) not null,
CONTROL_TYPE_ID NUMBER(10) not null,
ASSAY_ID NUMBER(8) not null,
TABLE_ID NUMBER(10) not null,
ROW_ID NUMBER(12) not null,
NAME VARCHAR2(100),
VALUE VARCHAR2(255),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_CONTROL primary key (CONTROL_ID)
)
/
/*==============================================================*/
/* Index: CONTROL_IND01 */
/*==============================================================*/
create index CONTROL_IND01 on CONTROL (
TABLE_ID ASC
)
/
/*==============================================================*/
/* Index: CONTROL_IND02 */
/*==============================================================*/
create index CONTROL_IND02 on CONTROL (
ASSAY_ID ASC
)
/
/*==============================================================*/
/* Index: CONTROL_IND03 */
/*==============================================================*/
create index CONTROL_IND03 on CONTROL (
CONTROL_TYPE_ID ASC
)
/
/*==============================================================*/
/* Table: DATABASEENTRY */
/*==============================================================*/
create table DATABASEENTRY (
database_name NUMBER(10),
database_version NUMBER(10),
database_uri NUMBER(10),
database_entry_ID NUMBER(8) not null,
accession VARCHAR(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_DATABASE_ENTRY primary key (database_entry_ID)
)
/
/*==============================================================*/
/* Table: DBSEARCH */
/*==============================================================*/
create table DBSEARCH (
db_search_ID NUMBER(8) not null,
peak_list_ID NUMBER(8),
db_search_parameters_ID NUMBER(8),
username varchar(100) not null,
id_date date not null,
n_terminal_aa varchar(100),
c_terminal_aa varchar(100),
count_of_specific_aa NUMBER(8),
Appendix C. Database schema for RAPAD 299
name_of_counted_aa varchar(100),
regex_pattern varchar(100),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
search_file_uri VARCHAR2(300),
constraint PK_DBSEARCH primary key (db_search_ID)
)
/
/*==============================================================*/
/* Table: DBSEARCHPARAMETERS */
/*==============================================================*/
create table DBSEARCHPARAMETERS (
db_search_parameters_ID NUMBER(8) not null,
PROTOCOL_ID NUMBER(10),
program varchar(100) not null,
database varchar(100) not null,
database_date date not null,
taxonomical_filter NUMBER(8),
fixed_modifications varchar(100),
variable_modifications varchar(100),
max_missed_cleavages NUMBER(8),
mass_value_type varchar(100),
fragment_ion_tolerance float,
peptide_mass_tolerance float,
accurate_mass_mode NUMBER(8),
mass_error_type varchar(100),
mass_error float,
protonated NUMBER(8),
icat_option NUMBER(8),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_DBSEARCHPARAMETERS primary key
(db_search_parameters_ID)
)
/
/*==============================================================*/
/* Table: DETECTION */
/*==============================================================*/
create table DETECTION (
detection_ID NUMBER(8) not null,
type varchar(9)
constraint CKC_TYPE_DETECTIO check (type is null or (
type in (’photomultiplier’,’electron
multiplier’,’micro-channel plate’,’ICR’) )),
constraint PK_DETECTION primary key (detection_ID)
)
/
/*==============================================================*/
/* Table: DIGESINGLESPOT */
/*==============================================================*/
create table DIGESINGLESPOT (
identified_spot_ID NUMBER(8),
DIGESingleSpot_ID NUMBER(8) not null,
GEL_IMAGE_ANALYSIS_ID NUMBER(8) not null,
SPOT_MEASURES_ID NUMBER(10),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_DIGESINGLESPOT primary key (DIGESingleSpot_ID)
)
/
/*==============================================================*/
/* Table: ELECTROSPRAY */
/*==============================================================*/
create table ELECTROSPRAY (
electrospray_ID NUMBER(8) not null,
ion_source_ID NUMBER(8),
spray_tip_voltage float,
spray_tip_diameter float not null,
solution_voltage float,
cone_voltage float not null,
loading_type varchar(2)
constraint CKC_LOADING_TYPE_ELECTROS check
(loading_type is null or ( loading_type in (’LC’,’DI’)
)),
solvent varchar(100) not null,
interface_manufacturer varchar(200) not null,
spray_tip_manufacturer varchar(200) not null,
ion_source NUMBER(8),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
Appendix C. Database schema for RAPAD 300
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ELECTROSPRAY primary key (electrospray_ID)
)
/
/*==============================================================*/
/* Table: ELEMENTANNOTATION */
/*==============================================================*/
create table ELEMENTANNOTATION (
ELEMENT_ANNOTATION_ID NUMBER(10) not null,
ELEMENT_ID NUMBER(10) not null,
NAME VARCHAR2(100) not null,
VALUE VARCHAR2(500) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ELEMENTANNOTATION primary key
(ELEMENT_ANNOTATION_ID)
)
/
/*==============================================================*/
/* Index: ELEMENTANNOTATION_IND01 */
/*==============================================================*/
create index ELEMENTANNOTATION_IND01 on ELEMENTANNOTATION (
ELEMENT_ID ASC
)
/
/*==============================================================*/
/* Table: ELEMENTIMP */
/*==============================================================*/
create table ELEMENTIMP (
ELEMENT_ID NUMBER(10) not null,
COMPOSITE_ELEMENT_ID NUMBER(10),
ARRAY_ID NUMBER(4) not null,
ELEMENT_TYPE_ID NUMBER(10),
EXTERNAL_DATABASE_RELEASE_ID NUMBER(5),
SOURCE_ID VARCHAR2(50),
SUBCLASS_VIEW VARCHAR2(27) not null,
TINYINT1 NUMBER(3),
SMALLINT1 NUMBER(5),
CHAR1 VARCHAR2(5),
CHAR2 VARCHAR2(5),
CHAR3 VARCHAR2(5),
CHAR4 VARCHAR2(5),
CHAR5 VARCHAR2(5),
CHAR6 VARCHAR2(5),
CHAR7 VARCHAR2(5),
TINYSTRING1 VARCHAR2(50),
TINYSTRING2 VARCHAR2(50),
SMALLSTRING1 VARCHAR2(100),
SMALLSTRING2 VARCHAR2(100),
STRING1 VARCHAR2(500),
STRING2 VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ELEMENT primary key (ELEMENT_ID)
)
/
/*==============================================================*/
/* Index: ARRAY_IND01 */
/*==============================================================*/
create index ARRAY_IND01 on ELEMENTIMP (
ARRAY_ID ASC
)
/
/*==============================================================*/
/* Index: RAD3_SPOT_IND02 */
/*==============================================================*/
create index RAD3_SPOT_IND02 on ELEMENTIMP (
COMPOSITE_ELEMENT_ID ASC
)
/
/*==============================================================*/
/* Index: SHORTOLIGO_IND02 */
/*==============================================================*/
create index SHORTOLIGO_IND02 on ELEMENTIMP (
ELEMENT_TYPE_ID ASC
)
/
/*==============================================================*/
/* Index: SHORTOLIGO_IND03 */
/*==============================================================*/
create index SHORTOLIGO_IND03 on ELEMENTIMP (
EXTERNAL_DATABASE_RELEASE_ID ASC
)
/
/*==============================================================*/
/* Table: ELEMENTRESULTIMP */
/*==============================================================*/
Appendix C. Database schema for RAPAD 301
create table ELEMENTRESULTIMP (
ELEMENT_RESULT_ID NUMBER(10) not null,
ELEMENT_ID NUMBER(10) not null,
COMPOSITE_ELEMENT_RESULT_ID NUMBER(10),
QUANTIFICATION_ID NUMBER(8) not null,
SUBCLASS_VIEW VARCHAR2(27) not null,
FOREGROUND FLOAT,
BACKGROUND FLOAT,
FOREGROUND_SD FLOAT,
BACKGROUND_SD FLOAT,
FLOAT1 FLOAT,
FLOAT2 FLOAT,
FLOAT3 FLOAT,
FLOAT4 FLOAT,
FLOAT5 FLOAT,
FLOAT6 FLOAT,
FLOAT7 FLOAT,
FLOAT8 FLOAT,
FLOAT9 FLOAT,
FLOAT10 FLOAT,
FLOAT11 FLOAT,
FLOAT12 FLOAT,
FLOAT13 FLOAT,
FLOAT14 FLOAT,
INT1 NUMBER(12),
INT2 NUMBER(12),
INT3 NUMBER(12),
INT4 NUMBER(12),
INT5 NUMBER(12),
INT6 NUMBER(12),
INT7 NUMBER(12),
INT8 NUMBER(12),
INT9 NUMBER(12),
INT10 NUMBER(12),
INT11 NUMBER(12),
INT12 NUMBER(12),
INT13 NUMBER(12),
INT14 NUMBER(12),
INT15 NUMBER(12),
TINYINT1 NUMBER(3),
TINYINT2 NUMBER(3),
TINYINT3 NUMBER(3),
SMALLINT1 NUMBER(5),
SMALLINT2 NUMBER(5),
SMALLINT3 NUMBER(5),
CHAR1 VARCHAR2(5),
CHAR2 VARCHAR2(5),
CHAR3 VARCHAR2(5),
CHAR4 VARCHAR2(5),
TINYSTRING1 VARCHAR2(50),
TINYSTRING2 VARCHAR2(50),
TINYSTRING3 VARCHAR2(50),
SMALLSTRING1 VARCHAR2(100),
SMALLSTRING2 VARCHAR2(100),
STRING1 VARCHAR2(500),
STRING2 VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ELEMENTRESULT_N primary key (ELEMENT_RESULT_ID)
)
/
/*==============================================================*/
/* Index: ELEMENTRESULTIMP_IND01 */
/*==============================================================*/
create index ELEMENTRESULTIMP_IND01 on ELEMENTRESULTIMP (
ELEMENT_ID ASC,
SUBCLASS_VIEW ASC
)
/
/*==============================================================*/
/* Index: ELEMENTRESULTIMP_IND02 */
/*==============================================================*/
create index ELEMENTRESULTIMP_IND02 on ELEMENTRESULTIMP (
ELEMENT_ID ASC,
QUANTIFICATION_ID ASC,
SUBCLASS_VIEW ASC
)
/
/*==============================================================*/
/* Index: ELEMENTRESULTIMP_IND03 */
/*==============================================================*/
create index ELEMENTRESULTIMP_IND03 on ELEMENTRESULTIMP (
COMPOSITE_ELEMENT_RESULT_ID ASC
)
/
/*==============================================================*/
/* Table: FRACTION */
/*==============================================================*/
create table FRACTION (
Fraction_ID NUMBER(8) not null,
start_point float not null,
end_point float not null,
protein_assay float,
LCColumn_ID NUMBER(8) not null,
BIO_MATERIAL_ID NUMBER(8),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_FRACTION primary key (Fraction_ID, LCColumn_ID)
Appendix C. Database schema for RAPAD 302
)
/
/*==============================================================*/
/* Table: GEL1D */
/*==============================================================*/
create table GEL1D (
Gel1D_ID NUMBER(8) not null,
BIOASSAY_TREATMENT_ID NUMBER(8) not null,
description varchar(100) not null,
equipment varchar(200) not null,
percent_acrylamide float not null,
solubilization_buffer varchar(100) not null,
stain_details varchar(200) not null,
protein_assay float,
in_gel_digestion varchar(100),
background varchar(100),
pixel_size_x varchar(100),
pixel_size_y varchar(100),
denaturing_agent varchar(100),
mass_start float,
mass_end float,
run_details varchar(300),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_GEL1D primary key (Gel1D_ID)
)
/
/*==============================================================*/
/* Table: GEL2D */
/*==============================================================*/
create table GEL2D (
Gel2D_ID NUMBER(8) not null,
BIOASSAY_TREATMENT_ID NUMBER(8) not null,
description varchar(500),
equipment varchar(200),
percent_acrylamide float,
solubilization_buffer varchar(100),
stain_details varchar(200),
protein_assay float,
in_gel_digestion varchar(100),
background varchar(100),
pixel_size_x varchar(100),
pixel_size_y varchar(100),
pi_start float,
pi_end float,
mass_start float,
mass_end float,
first_dim_details varchar(200),
second_dim_details varchar(200),
dimensionX NUMBER(8),
dimensionY NUMBER(8),
dimensionZ NUMBER(8),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_GEL2D primary key (Gel2D_ID)
)
/
/*==============================================================*/
/* Table: GELIMAGEANALYSIS */
/*==============================================================*/
create table GELIMAGEANALYSIS (
ACQUISITION_ID NUMBER(8) not null,
PROTOCOL_ID NUMBER(10),
GEL_IMAGE_ANALYSIS_ID NUMBER(8) not null,
processing_description VARCHAR(300),
warped_image VARCHAR(100),
warping_map VARCHAR(100),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
name VARCHAR(100),
IMAGEANALYSIS_DATE DATE,
constraint PK_GEL_IMAGE_ANALYSIS primary key
(GEL_IMAGE_ANALYSIS_ID)
)
/
/*==============================================================*/
/* Table: GRADIENTSTEP */
/*==============================================================*/
create table GRADIENTSTEP (
GradientStep_ID NUMBER(8) not null,
step_time float not null,
lc_column NUMBER(8) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
Appendix C. Database schema for RAPAD 303
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_GRADIENTSTEP primary key
(lc_column, GradientStep_ID)
)
/
/*==============================================================*/
/* Table: HEXAPOLE */
/*==============================================================*/
create table HEXAPOLE (
hexapole_ID NUMBER(8) not null,
mz_analysis_ID NUMBER(8),
description varchar(100),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_HEXAPOLE primary key (hexapole_ID)
)
/
/*==============================================================*/
/* Table: IDENTIFIEDSPOT */
/*==============================================================*/
create table IDENTIFIEDSPOT (
identified_spot_ID NUMBER(8) not null,
area float,
intensity float,
local_background float,
annotation varchar(200),
annotation_source varchar(200),
volume float,
pixel_x_coord float,
pixel_y_coord float,
pixel_radius float,
normalisation varchar(200),
normalised_volume float,
apparent_pi float,
apparent_mass float,
physicalGelItem_ID NUMBER(8) not null,
GEL_IMAGE_ANALYSIS_ID NUMBER(8),
SPOT_MEASURES_ID NUMBER(10),
peakHeight NUMBER(8),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_IDENTIFIEDSPOT primary key (identified_spot_ID)
)
/
/*==============================================================*/
/* Table: IMAGEACQUISITION */
/*==============================================================*/
create table IMAGEACQUISITION (
IMAGE_ACQUISITION_ID NUMBER(8) not null,
PROTOCOL_ID NUMBER(10),
CHANNEL_ID NUMBER(4),
PROTEOME_ASSAY_ID NUMBER(8),
ACQUISITION_DATE DATE,
NAME VARCHAR2(100),
URI VARCHAR2(255),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_IMAGE_ACQUISITION primary key
(IMAGE_ACQUISITION_ID)
)
/
/*==============================================================*/
/* Index: ACQUISITION_IND02 */
/*==============================================================*/
create index ACQUISITION_IND02 on IMAGEACQUISITION (
CHANNEL_ID ASC
)
/
/*==============================================================*/
/* Index: ACQUISITION_IND03 */
/*==============================================================*/
create index ACQUISITION_IND03 on IMAGEACQUISITION (
PROTOCOL_ID ASC
)
/
/*==============================================================*/
Appendix C. Database schema for RAPAD 304
/* Index: ACQUISITION_IND04 */
/*==============================================================*/
create index ACQUISITION_IND04 on IMAGEACQUISITION (
NAME ASC
)
/
/*==============================================================*/
/* Table: INTEGRITYSTATINPUT */
/*==============================================================*/
create table INTEGRITYSTATINPUT (
INTEGRITY_STAT_INPUT_ID NUMBER(10) not null,
INTEGRITY_STATISTIC_ID NUMBER(8) not null,
INPUT_TABLE_ID NUMBER(10) not null,
INPUT_ROW_ID NUMBER(10) not null,
ROW_DESIGNATION VARCHAR2(200),
IS_TRUSTED_INPUT NUMBER(1),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_INTEGRITYSTATINPUT primary key
(INTEGRITY_STAT_INPUT_ID)
)
/
/*==============================================================*/
/* Index: INTEGRITYSTATINPUT_IND01 */
/*==============================================================*/
create index INTEGRITYSTATINPUT_IND01 on INTEGRITYSTATINPUT (
INTEGRITY_STATISTIC_ID ASC
)
/
/*==============================================================*/
/* Index: INTEGRITYSTATINPUT_IND02 */
/*==============================================================*/
create index INTEGRITYSTATINPUT_IND02 on INTEGRITYSTATINPUT (
INPUT_TABLE_ID ASC
)
/
/*==============================================================*/
/* Table: INTEGRITYSTATISTIC */
/*==============================================================*/
create table INTEGRITYSTATISTIC (
INTEGRITY_STATISTIC_ID NUMBER(8) not null,
STATISTIC_METHOD VARCHAR2(200) not null,
TRUSTED_INPUT_FORMULA VARCHAR2(200),
VALUE VARCHAR2(100) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_INTEGRITYSTATISTIC primary key
(INTEGRITY_STATISTIC_ID)
)
/
/*==============================================================*/
/* Table: IONSOURCE */
/*==============================================================*/
create table IONSOURCE (
ion_source_ID NUMBER(8) not null,
collision_energy float,
type varchar(12)
constraint CKC_TYPE_IONSOURC check (type is null or
( type in (’MALDI’,’Electrospray’,’OtherIonisation’) )),
mz_analysis NUMBER(8),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_IONSOURCE primary key (ion_source_ID)
)
/
/*==============================================================*/
/* Table: IONTRAP */
/*==============================================================*/
create table IONTRAP (
ion_trap_ID NUMBER(8) not null,
mz_analysis_ID NUMBER(8),
gas_type varchar(100) not null,
gas_pressure float not null,
rf_frequency float,
excitation_amplitude float,
isolation_centre float not null,
isolation_width float not null,
final_ms_level float,
MODIFICATION_DATE DATE not null,
Appendix C. Database schema for RAPAD 305
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_IONTRAP primary key (ion_trap_ID)
)
/
/*==============================================================*/
/* Table: LABELMETHOD */
/*==============================================================*/
create table LABELMETHOD (
LABEL_METHOD_ID NUMBER(4) not null,
PROTOCOL_ID NUMBER(10) not null,
CHANNEL_ID NUMBER(4) not null,
LABEL_USED VARCHAR2(50),
LABEL_METHOD VARCHAR2(1000),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_LABEL primary key (LABEL_METHOD_ID)
)
/
/*==============================================================*/
/* Index: LABELMETHOD_IND01 */
/*==============================================================*/
create index LABELMETHOD_IND01 on LABELMETHOD (
PROTOCOL_ID ASC
)
/
/*==============================================================*/
/* Index: LABELMETHOD_IND02 */
/*==============================================================*/
create index LABELMETHOD_IND02 on LABELMETHOD (
CHANNEL_ID ASC
)
/
/*==============================================================*/
/* Table: LCCOLUMN */
/*==============================================================*/
create table LCCOLUMN (
LCColumn_ID NUMBER(8) not null,
BIOASSAY_TREATMENT_ID NUMBER(8),
description varchar(200) not null,
manufacturer varchar(100) not null,
part_number varchar(50) not null,
batch_number varchar(50) not null,
internal_length float not null,
internal_diameter float not null,
stationary_phase varchar(200) not null,
bead_size float,
pore_size float,
temperature float not null,
flow_rate float,
injection_volume float not null,
parameters_file varchar(100) not null,
lc_column NUMBER(8),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_LCCOLUMN primary key (LCColumn_ID)
)
/
/*==============================================================*/
/* Table: LISTPROCESSING */
/*==============================================================*/
create table LISTPROCESSING (
list_processing_ID NUMBER(8) not null,
smoothing_process varchar(100) not null,
background_threshold float not null,
peak_list_ID NUMBER(8),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_LISTPROCESSING primary key (list_processing_ID)
)
/
/*==============================================================*/
/* Table: MAGEDOCUMENTATION */
/*==============================================================*/
Appendix C. Database schema for RAPAD 306
create table MAGEDOCUMENTATION (
MAGE_DOCUMENTATION_ID NUMBER(5) not
null,
MAGE_ML_ID NUMBER(8) not null,
TABLE_ID NUMBER(10) not null,
ROW_ID NUMBER(12) not null,
MAGE_IDENTIFIER VARCHAR2(100) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_MAGEDOCUMENTATION primary key
(MAGE_DOCUMENTATION_ID)
)
/
/*==============================================================*/
/* Index: MAGEDOCUMENTATION_IND01 */
/*==============================================================*/
create index MAGEDOCUMENTATION_IND01 on MAGEDOCUMENTATION (
TABLE_ID ASC
)
/
/*==============================================================*/
/* Index: MAGEDOCUMENTATION_IND02 */
/*==============================================================*/
create index MAGEDOCUMENTATION_IND02 on MAGEDOCUMENTATION (
MAGE_ML_ID ASC
)
/
/*==============================================================*/
/* Table: MAGEML */
/*==============================================================*/
create table MAGEML (
MAGE_ML_ID NUMBER(8) not null,
MAGE_PACKAGE VARCHAR2(100) not null,
MAGE_ML CLOB not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_MAGEML primary key (MAGE_ML_ID)
)
/
/*==============================================================*/
/* Table: MALDI */
/*==============================================================*/
create table MALDI (
MALDI_ID NUMBER(8) not null,
ion_source_ID NUMBER(8),
laser_wavelength float not null,
laser_power float,
matrix_type varchar(100),
grid_voltage float not null,
acceleration_voltage float not null,
ion_mode varchar(50) not null,
ion_source NUMBER(8),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_MALDI primary key (MALDI_ID)
)
/
/*==============================================================*/
/* Table: MASSSPECEXPERIMENT */
/*==============================================================*/
create table MASSSPECEXPERIMENT (
mass_spec_experiment_ID NUMBER(8) not
null,
BIOASSAY_TREATMENT_ID NUMBER(8) not
null,
MSMachineID NUMBER(8),
description varchar(200),
parameters_file varchar(200),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_MASSSPECEXPERIMENT primary key
(mass_spec_experiment_ID)
)
/
Appendix C. Database schema for RAPAD 307
/*==============================================================*/
/* Table: MASSSPECMACHINE */
/*==============================================================*/
create table MASSSPECMACHINE (
mass_spec_machine_ID NUMBER(8) not null,
ion_source_ID NUMBER(8),
manufacturer varchar(200) not null,
model_name varchar(200) not null,
software_version varchar(200) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_MASSSPECMACHINE primary key
(mass_spec_machine_ID)
)
/
/*==============================================================*/
/* Table: MATCHEDSPOTS */
/*==============================================================*/
create table MATCHEDSPOTS (
matched_spots_ID NUMBER(8) not null,
multiple_analysis_ID NUMBER(8) not null,
identified_spot_ID NUMBER(8) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_MATCHED_SPOTS primary key
(matched_Spots_ID, identified_spot_ID)
)
/
/*==============================================================*/
/* Table: MOBILEPHASECOMPONENT */
/*==============================================================*/
create table MOBILEPHASECOMPONENT (
id NUMBER(8) not null,
description varchar(100) not null,
concentration float not null,
lc_column NUMBER(8) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_MOBILEPHASECOMPONENT primary key (id)
)
/
/*==============================================================*/
/* Index: MPC_IND */
/*==============================================================*/
create index MPC_IND on MOBILEPHASECOMPONENT (
lc_column ASC
)
/
/*==============================================================*/
/* Table: MSMSFRACTION */
/*==============================================================*/
create table MSMSFRACTION (
peak_list_ID NUMBER(8),
msms_fraction_ID NUMBER(8) not null,
target_m_to_z float not null,
plus_or_minus float not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_MSMSFRACTION primary key (msms_fraction_ID)
)
/
/*==============================================================*/
/* Table: MULTIPLEANALYSIS */
/*==============================================================*/
create table MULTIPLEANALYSIS (
multiple_analysis_ID NUMBER(8) not null,
analysis_type NUMBER(10),
PROTOCOL_ID NUMBER(10),
description VARCHAR(300),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
Appendix C. Database schema for RAPAD 308
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_MULTIPLE_ANALYSIS primary key
(multiple_analysis_ID)
)
/
/*==============================================================*/
/* Table: MULTIPLEANALYSISGELIA */
/*==============================================================*/
create table MULTIPLEANALYSISGELIA (
MULT_ANALYSIS_GELIA_ID NUMBER(8) NOT NULL,
GEL_IMAGE_ANALYSIS_ID NUMBER(8) NOT NULL,
MULTIPLE_ANALYSIS_ID NUMBER(8) NOT NULL,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_MULTIPLE_ANALYSIS_GELIA primary key
(MULT_ANALYSIS_GELIA_ID)
)
/
/*==============================================================*/
/* Table: MZANALYSIS */
/*==============================================================*/
create table MZANALYSIS (
mz_analysis_ID NUMBER(8) not null,
detection_ID NUMBER(8),
type varchar(14)
constraint CKC_TYPE_MZANALYS check (type is null or (
type in (’Quadrupole’,’Hexapole’,’IonTrap’,’CollisionCell’,
’ToF’,’OthermzAnalysis’) )),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_MZANALYSIS primary key (mz_analysis_ID)
)
/
/*==============================================================*/
/* Table: ONTOLOGYENTRY */
/*==============================================================*/
create table ONTOLOGYENTRY (
ONTOLOGY_ENTRY_ID NUMBER(10) not null,
PARENT_ID NUMBER(10),
TABLE_ID NUMBER(8),
ROW_ID NUMBER(12),
EXTERNAL_DATABASE_RELEASE_ID NUMBER(10),
SOURCE_ID VARCHAR2(100),
URI VARCHAR2(500),
NAME VARCHAR2(100),
CATEGORY VARCHAR2(100) not null,
VALUE VARCHAR2(100) not null,
DEFINITION VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_ONTOLOGYENTRY primary key (ONTOLOGY_ENTRY_ID)
)
/
/*==============================================================*/
/* Index: ONTOLOGYENTRY_AK01 */
/*==============================================================*/
create unique index ONTOLOGYENTRY_AK01 on ONTOLOGYENTRY (
CATEGORY ASC,
VALUE ASC
)
/
/*==============================================================*/
/* Index: ONTOLOGYENTRY_IND01 */
/*==============================================================*/
create index ONTOLOGYENTRY_IND01 on ONTOLOGYENTRY (
PARENT_ID ASC
)
/
/*==============================================================*/
/* Index: ONTOLOGYENTRY_IND02 */
/*==============================================================*/
create index ONTOLOGYENTRY_IND02 on ONTOLOGYENTRY (
TABLE_ID ASC
Appendix C. Database schema for RAPAD 309
)
/
/*==============================================================*/
/* Index: ONTOLOGYENTRY_IND03 */
/*==============================================================*/
create index ONTOLOGYENTRY_IND03 on ONTOLOGYENTRY (
EXTERNAL_DATABASE_RELEASE_ID ASC
)
/
/*==============================================================*/
/* Table: OTHERIONISATION */
/*==============================================================*/
create table OTHERIONISATION (
otherIonisation_ID NUMBER(8) not null,
ion_source_ID NUMBER(8),
ONTOLOGY_ENTRY_ID NUMBER(10),
name varchar(100) not null,
ion_source NUMBER(8),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_OTHERIONISATION primary key (otherIonisation_ID)
)
/
/*==============================================================*/
/* Table: OTHERMZANALYSIS */
/*==============================================================*/
create table OTHERMZANALYSIS (
other_mz_analysis_ID NUMBER(8) not null,
mz_analysis_ID NUMBER(8),
ONTOLOGY_ENTRY_ID NUMBER(10),
name varchar(50) not null,
mz_analysis NUMBER(8),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_OTHERMZANALYSIS primary key
(other_mz_analysis_ID)
)
/
/*==============================================================*/
/* Table: PEAK */
/*==============================================================*/
create table PEAK (
peak_ID NUMBER(8) not null,
m_to_z float not null,
abundance float,
multiplicity NUMBER(8),
peak_list_ID NUMBER(8) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PEAK primary key (peak_ID, peak_list_ID)
)
/
/*==============================================================*/
/* Table: PEAKLIST */
/*==============================================================*/
create table PEAKLIST (
peak_list_ID NUMBER(8) not null,
mass_spec_experiment_ID NUMBER(8),
list_type varchar(11) not null
constraint CKC_LIST_TYPE_PEAKLIST check (list_type in
(’Full List’,’Edited List’,’MSMS Result’)),
description varchar(100),
mass_value_type varchar(50),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PEAKLIST primary key (peak_list_ID)
)
/
/*==============================================================*/
/* Table: PEPTIDEHIT */
/*==============================================================*/
Appendix C. Database schema for RAPAD 310
create table PEPTIDEHIT (
peptide_hit_ID NUMBER(8) not null,
score float not null,
score_type varchar(100) not null,
sequence varchar(100) not null,
information varchar(100),
probability float,
db_search_ID NUMBER(8),
database_entry_ID NUMBER(8),
constraint PK_PEPTIDEHIT primary key (peptide_hit_ID)
)
/
/*==============================================================*/
/* Table: PERCENTX */
/*==============================================================*/
create table PERCENTX (
Percent_ID NUMBER(8) not null,
lc_column NUMBER(8),
GradientStep_ID NUMBER(8),
percentage float not null,
mobile_phase_component NUMBER(8) not null,
gradient_step_lc_column NUMBER(8) not null,
gradient_step_id NUMBER(8) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PERCENTX primary key (Percent_ID)
)
/
/*==============================================================*/
/* Table: PHYSICALGELITEM */
/*==============================================================*/
create table PHYSICALGELITEM (
physicalGelItem_ID NUMBER(8) not null,
gel2D NUMBER(8),
Gel1D_ID NUMBER(8),
BIO_MATERIAL_ID NUMBER(8),
ProteinRecord_ID NUMBER(8),
description VARCHAR(200),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PHYSICALGELITEM primary key (physicalGelItem_ID)
)
/
/*==============================================================*/
/* Table: PROCESSIMPLEMENTATION */
/*==============================================================*/
create table PROCESSIMPLEMENTATION (
PROCESS_IMPLEMENTATION_ID NUMBER(5) not null,
PROCESS_TYPE_ID NUMBER(10) not null,
NAME VARCHAR2(100),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PROCESSIMPLEMENTATION primary key
(PROCESS_IMPLEMENTATION_ID)
)
/
/*==============================================================*/
/* Index: PROCESSIMPLEMENTATION_IND_01 */
/*==============================================================*/
create index PROCESSIMPLEMENTATION_IND_01 on PROCESSIMPLEMENTATION
(
PROCESS_TYPE_ID ASC
)
/
/*==============================================================*/
/* Table: PROCESSIMPLEMENTATIONPARAM */
/*==============================================================*/
create table PROCESSIMPLEMENTATIONPARAM (
PROCESS_IMPLEMETATION_PARAM_ID NUMBER(5) not null,
PROCESS_IMPLEMENTATION_ID NUMBER(5) not null,
NAME VARCHAR2(100) not null,
VALUE VARCHAR2(100) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
Appendix C. Database schema for RAPAD 311
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PROCESSIMPPARAM primary key
(PROCESS_IMPLEMETATION_PARAM_ID)
)
/
/*==============================================================*/
/* Index: PROCESSIMPPARAM_IND01 */
/*==============================================================*/
create index PROCESSIMPPARAM_IND01 on PROCESSIMPLEMENTATIONPARAM (
PROCESS_IMPLEMENTATION_ID ASC
)
/
/*==============================================================*/
/* Table: PROCESSINVOCATION */
/*==============================================================*/
create table PROCESSINVOCATION (
PROCESS_INVOCATION_ID NUMBER(5) not null,
PROCESS_IMPLEMENTATION_ID NUMBER(5) not null,
PROCESS_INVOCATION_DATE DATE not null,
DESCRIPTION VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PROCESSINV primary key (PROCESS_INVOCATION_ID)
)
/
/*==============================================================*/
/* Index: PROCESSINVOCATION_IND01 */
/*==============================================================*/
create index PROCESSINVOCATION_IND01 on PROCESSINVOCATION (
PROCESS_IMPLEMENTATION_ID ASC
)
/
/*==============================================================*/
/* Table: PROCESSINVOCATIONPARAM */
/*==============================================================*/
create table PROCESSINVOCATIONPARAM (
PROCESS_INVOCATION_PARAM_ID NUMBER(8) not null,
PROCESS_INVOCATION_ID NUMBER(5) not null,
NAME VARCHAR2(100) not null,
VALUE VARCHAR2(100) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PROCESSINVOCATIONPARAM primary key
(PROCESS_INVOCATION_PARAM_ID)
)
/
/*==============================================================*/
/* Index: PROCESSINVOCATIONPARAM_IND01 */
/*==============================================================*/
create index PROCESSINVOCATIONPARAM_IND01 on
PROCESSINVOCATIONPARAM (
PROCESS_INVOCATION_ID ASC
)
/
/*==============================================================*/
/* Table: PROCESSINVQUANTIFICATION */
/*==============================================================*/
create table PROCESSINVQUANTIFICATION (
PROCESS_INV_QUANTIFICATION_ID NUMBER(8) not null,
PROCESS_INVOCATION_ID NUMBER(5) not null,
QUANTIFICATION_ID NUMBER(8) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PROCESSINVQUANT primary key
(PROCESS_INV_QUANTIFICATION_ID)
)
/
/*==============================================================*/
/* Index: PROCESSINVQUANTIFICATION_IND01 */
/*==============================================================*/
create index PROCESSINVQUANTIFICATION_IND01 on
PROCESSINVQUANTIFICATION (
PROCESS_INVOCATION_ID ASC
)
/
/*==============================================================*/
/* Index: PROCESSINVQUANTIFICATION_IND02 */
/*==============================================================*/
create index PROCESSINVQUANTIFICATION_IND02 on
Appendix C. Database schema for RAPAD 312
PROCESSINVQUANTIFICATION (
QUANTIFICATION_ID ASC
)
/
/*==============================================================*/
/* Table: PROCESSIO */
/*==============================================================*/
create table PROCESSIO (
PROCESS_IO_ID NUMBER(12) not null,
PROCESS_INVOCATION_ID NUMBER(5) not null,
TABLE_ID NUMBER(5) not null,
INPUT_RESULT_ID NUMBER(12) not null,
INPUT_ROLE VARCHAR2(50),
OUTPUT_RESULT_ID NUMBER(12) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PROCESSIO primary key (PROCESS_IO_ID)
)
/
/*==============================================================*/
/* Index: PROCESSIO_IND01 */
/*==============================================================*/
create index PROCESSIO_IND01 on PROCESSIO (
TABLE_ID ASC
)
/
/*==============================================================*/
/* Index: PROCESSIO_IND_01 */
/*==============================================================*/
create index PROCESSIO_IND_01 on PROCESSIO (
OUTPUT_RESULT_ID ASC
)
/
/*==============================================================*/
/* Index: PROCESSIO_IND_02 */
/*==============================================================*/
create index PROCESSIO_IND_02 on PROCESSIO (
PROCESS_INVOCATION_ID ASC
)
/
/*==============================================================*/
/* Index: PROCESSIO_IND_03 */
/*==============================================================*/
create index PROCESSIO_IND_03 on PROCESSIO (
INPUT_RESULT_ID ASC,
OUTPUT_RESULT_ID ASC,
TABLE_ID ASC
)
/
/*==============================================================*/
/* Table: PROCESSIOELEMENT */
/*==============================================================*/
create table PROCESSIOELEMENT (
PROCESS_IO_ELEMENT_ID NUMBER(10) not null,
PROCESS_IO_ID NUMBER(12) not null,
ELEMENT_ID NUMBER(8) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PROCESSIOELEMENT primary key
(PROCESS_IO_ELEMENT_ID)
)
/
/*==============================================================*/
/* Index: PROCESSIOELEMENT_IND01 */
/*==============================================================*/
create index PROCESSIOELEMENT_IND01 on PROCESSIOELEMENT (
PROCESS_IO_ID ASC
)
/
/*==============================================================*/
/* Table: PROCESSRESULT */
/*==============================================================*/
create table PROCESSRESULT (
PROCESS_RESULT_ID NUMBER(12) not null,
VALUE FLOAT not null,
UNIT_TYPE_ID NUMBER(10),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_RESULT5 primary key (PROCESS_RESULT_ID)
Appendix C. Database schema for RAPAD 313
)
/
/*==============================================================*/
/* Index: PROCESSRESULT_IND01 */
/*==============================================================*/
create index PROCESSRESULT_IND01 on PROCESSRESULT (
PROCESS_RESULT_ID ASC,
VALUE ASC
)
/
/*==============================================================*/
/* Index: PROCESSRESULT_IND02 */
/*==============================================================*/
create index PROCESSRESULT_IND02 on PROCESSRESULT (
UNIT_TYPE_ID ASC
)
/
/*==============================================================*/
/* Table: PROTEINHIT */
/*==============================================================*/
create table PROTEINHIT (
protein_hit_ID NUMBER(8) not null,
ProteinRecord_ID NUMBER(8),
peptide_hit_ID NUMBER(8),
description VARCHAR(300),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
score float,
percent_seq_coverage float,
peptides_matched VARCHAR2(100),
e_value float,
db_search_ID NUMBER(8),
constraint PK_PROTEINHIT primary key (protein_hit_ID)
)
/
/*==============================================================*/
/* Table: PROTEINMODIFICATION */
/*==============================================================*/
create table PROTEINMODIFICATION (
protein_modification_ID NUMBER(8) not null,
modification_type NUMBER(10),
protein_record_ID NUMBER(8),
start_pos NUMBER(8),
end_pos NUMBER(8),
description VARCHAR(200),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PROTEINMODIFICATION primary key
(protein_modification_ID)
)
/
/*==============================================================*/
/* Table: PROTEINRECORD */
/*==============================================================*/
create table PROTEINRECORD (
protein_record_ID NUMBER(8) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
TAXON_NAME_ID NUMBER(10),
protein_name VARCHAR2(100),
pI FLOAT(126),
mW FLOAT(126),
constraint PK_PROTEINRECORD primary key (protein_record_ID)
)
/
/*==============================================================*/
/* Table: PROTEINRECORDENTRY */
/*==============================================================*/
create table PROTEINRECORDENTRY (
pr_record_entry_id NUMBER(8) not null,
protein_record_ID NUMBER(8) not null,
database_entry_ID NUMBER(8) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
Appendix C. Database schema for RAPAD 314
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PROTEINRECORDENTRY primary key
(pr_record_entry_id)
)
/
/*==============================================================*/
/* Table: PROTEOMEASSAY */
/*==============================================================*/
create table PROTEOMEASSAY (
PROTEOME_ASSAY_ID NUMBER(8) not null,
PROTOCOL_ID NUMBER(10),
ASSAY_DATE DATE,
OPERATOR_ID NUMBER(10) not null,
EXTERNAL_DATABASE_RELEASE_ID NUMBER(5),
SOURCE_ID VARCHAR2(50),
NAME VARCHAR2(100),
DESCRIPTION VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PROTEOME_ASSAY primary key (PROTEOME_ASSAY_ID)
)
/
/*==============================================================*/
/* Index: PROTEOME_ASSAY_IND02 */
/*==============================================================*/
create index PROTEOME_ASSAY_IND02 on PROTEOMEASSAY (
OPERATOR_ID ASC
)
/
/*==============================================================*/
/* Index: PROTEOME_ASSAY_IND03 */
/*==============================================================*/
create index PROTEOME_ASSAY_IND03 on PROTEOMEASSAY (
PROTOCOL_ID ASC
)
/
/*==============================================================*/
/* Index: PROTEOME_ASSAY_IND04 */
/*==============================================================*/
create index PROTEOME_ASSAY_IND04 on PROTEOMEASSAY (
EXTERNAL_DATABASE_RELEASE_ID ASC
)
/
/*==============================================================*/
/* Index: PROTEOME_ASSAY_IND05 */
/*==============================================================*/
create index PROTEOME_ASSAY_IND05 on PROTEOMEASSAY (
NAME ASC
)
/
/*==============================================================*/
/* Table: PROTOCOL */
/*==============================================================*/
create table PROTOCOL (
PROTOCOL_ID NUMBER(10) not null,
PROTOCOL_TYPE_ID NUMBER(10) not null,
SOFTWARE_TYPE_ID NUMBER(10),
HARDWARE_TYPE_ID NUMBER(10),
BIBLIOGRAPHIC_REFERENCE_ID NUMBER(10),
EXTERNAL_DATABASE_RELEASE_ID NUMBER(4),
SOURCE_ID VARCHAR2(100),
NAME VARCHAR2(100) not null,
URI VARCHAR2(100),
PROTOCOL_DESCRIPTION VARCHAR2(4000),
HARDWARE_DESCRIPTION VARCHAR2(500),
SOFTWARE_DESCRIPTION VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PROTOCOL primary key (PROTOCOL_ID)
)
/
/*==============================================================*/
/* Index: PROTOCOL_IND01 */
/*==============================================================*/
create index PROTOCOL_IND01 on PROTOCOL (
PROTOCOL_TYPE_ID ASC
)
/
/*==============================================================*/
/* Index: PROTOCOL_IND02 */
/*==============================================================*/
create index PROTOCOL_IND02 on PROTOCOL (
SOFTWARE_TYPE_ID ASC
)
/
/*==============================================================*/
Appendix C. Database schema for RAPAD 315
/* Index: PROTOCOL_IND03 */
/*==============================================================*/
create index PROTOCOL_IND03 on PROTOCOL (
HARDWARE_TYPE_ID ASC
)
/
/*==============================================================*/
/* Table: PROTOCOLPARAM */
/*==============================================================*/
create table PROTOCOLPARAM (
PROTOCOL_PARAM_ID NUMBER(10) not null,
PROTOCOL_ID NUMBER(10) not null,
NAME VARCHAR2(100) not null,
DATA_TYPE_ID NUMBER(10),
UNIT_TYPE_ID NUMBER(10),
VALUE VARCHAR2(100),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_PROTOCOLPARAM primary key (PROTOCOL_PARAM_ID)
)
/
/*==============================================================*/
/* Index: PROTOCOLPARAM_IND01 */
/*==============================================================*/
create index PROTOCOLPARAM_IND01 on PROTOCOLPARAM (
DATA_TYPE_ID ASC
)
/
/*==============================================================*/
/* Index: PROTOCOLPARAM_IND02 */
/*==============================================================*/
create index PROTOCOLPARAM_IND02 on PROTOCOLPARAM (
UNIT_TYPE_ID ASC
)
/
/*==============================================================*/
/* Index: PROTOCOLPARAM_IND03 */
/*==============================================================*/
create index PROTOCOLPARAM_IND03 on PROTOCOLPARAM (
PROTOCOL_ID ASC
)
/
/*==============================================================*/
/* Table: QUADRUPOLE */
/*==============================================================*/
create table QUADRUPOLE (
quadrupole_ID NUMBER(8) not null,
mz_analysis_ID NUMBER(8),
description varchar(100),
mz_analysis NUMBER(8),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_QUADRUPOLE primary key (quadrupole_ID)
)
/
/*==============================================================*/
/* Table: QUANTIFICATION */
/*==============================================================*/
create table QUANTIFICATION (
QUANTIFICATION_ID NUMBER(8) not null,
ACQUISITION_ID NUMBER(8) not null,
OPERATOR_ID NUMBER(10),
PROTOCOL_ID NUMBER(10),
RESULT_TABLE_ID NUMBER(5),
QUANTIFICATION_DATE DATE,
NAME VARCHAR2(100),
URI VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_QUANTIFICATION primary key (QUANTIFICATION_ID)
)
/
/*==============================================================*/
/* Index: QUANTIFICATION_IND01 */
/*==============================================================*/
create index QUANTIFICATION_IND01 on QUANTIFICATION (
ACQUISITION_ID ASC
)
/
Appendix C. Database schema for RAPAD 316
/*==============================================================*/
/* Index: QUANTIFICATION_IND02 */
/*==============================================================*/
create index QUANTIFICATION_IND02 on QUANTIFICATION (
OPERATOR_ID ASC
)
/
/*==============================================================*/
/* Index: QUANTIFICATION_IND03 */
/*==============================================================*/
create index QUANTIFICATION_IND03 on QUANTIFICATION (
PROTOCOL_ID ASC
)
/
/*==============================================================*/
/* Index: QUANTIFICATION_IND04 */
/*==============================================================*/
create index QUANTIFICATION_IND04 on QUANTIFICATION (
RESULT_TABLE_ID ASC
)
/
/*==============================================================*/
/* Index: QUANTIFICATION_IND05 */
/*==============================================================*/
create index QUANTIFICATION_IND05 on QUANTIFICATION (
NAME ASC
)
/
/*==============================================================*/
/* Table: QUANTIFICATIONPARAM */
/*==============================================================*/
create table QUANTIFICATIONPARAM (
QUANTIFICATION_PARAM_ID NUMBER(5) not null,
QUANTIFICATION_ID NUMBER(8) not null,
PROTOCOL_PARAM_ID NUMBER(10),
NAME VARCHAR2(100) not null,
VALUE VARCHAR2(50) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_QUANTIFICATIONPARAM primary key
(QUANTIFICATION_PARAM_ID)
)
/
/*==============================================================*/
/* Index: QUANTPARAM_AK01 */
/*==============================================================*/
create unique index QUANTPARAM_AK01 on QUANTIFICATIONPARAM (
NAME ASC,
QUANTIFICATION_ID ASC
)
/
/*==============================================================*/
/* Table: RELATEDACQUISITION */
/*==============================================================*/
create table RELATEDACQUISITION (
RELATED_ACQUISITION_ID NUMBER(4) not null,
ACQUISITION_ID NUMBER(8) not null,
ASSOCIATED_ACQUISITION_ID NUMBER(8) not null,
NAME VARCHAR2(100),
DESIGNATION VARCHAR2(50),
ASSOCIATED_DESIGNATION VARCHAR2(50),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_RELASSAY primary key (RELATED_ACQUISITION_ID)
)
/
/*==============================================================*/
/* Index: RELATEDACQUISITION_IND01 */
/*==============================================================*/
create index RELATEDACQUISITION_IND01 on RELATEDACQUISITION (
ACQUISITION_ID ASC
)
/
/*==============================================================*/
/* Index: RELATEDACQUISITION_IND02 */
/*==============================================================*/
create index RELATEDACQUISITION_IND02 on RELATEDACQUISITION (
ASSOCIATED_ACQUISITION_ID ASC
)
/
/*==============================================================*/
/* Table: RELATEDQUANTIFICATION */
/*==============================================================*/
create table RELATEDQUANTIFICATION (
RELATED_QUANTIFICATION_ID NUMBER(4) not null,
QUANTIFICATION_ID NUMBER(8) not null,
Appendix C. Database schema for RAPAD 317
ASSOCIATED_QUANTIFICATION_ID NUMBER(8) not null,
NAME VARCHAR2(100),
DESIGNATION VARCHAR2(50),
ASSOCIATED_DESIGNATION VARCHAR2(50),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_RELATEDQUANTIFICATION primary key
(RELATED_QUANTIFICATION_ID)
)
/
/*==============================================================*/
/* Index: RELATEDQUANTIFICATION_IND01 */
/*==============================================================*/
create index RELATEDQUANTIFICATION_IND01 on RELATEDQUANTIFICATION
(
QUANTIFICATION_ID ASC
)
/
/*==============================================================*/
/* Index: RELATEDQUANTIFICATION_IND02 */
/*==============================================================*/
create index RELATEDQUANTIFICATION_IND02 on RELATEDQUANTIFICATION
(
ASSOCIATED_QUANTIFICATION_ID ASC
)
/
/*==============================================================*/
/* Table: SPOTMEASURESIMP */
/*==============================================================*/
create table SPOTMEASURESIMP (
SPOT_MEASURES_ID NUMBER(10) not null,
SUBCLASS_VIEW VARCHAR2(27) not null,
FLOAT1 FLOAT,
FLOAT2 FLOAT,
FLOAT3 FLOAT,
FLOAT4 FLOAT,
FLOAT5 FLOAT,
FLOAT6 FLOAT,
FLOAT7 FLOAT,
FLOAT8 FLOAT,
FLOAT9 FLOAT,
FLOAT10 FLOAT,
FLOAT11 FLOAT,
FLOAT12 FLOAT,
FLOAT13 FLOAT,
FLOAT14 FLOAT,
INT1 NUMBER(12),
INT2 NUMBER(12),
INT3 NUMBER(12),
INT4 NUMBER(12),
INT5 NUMBER(12),
INT6 NUMBER(12),
INT7 NUMBER(12),
INT8 NUMBER(12),
INT9 NUMBER(12),
INT10 NUMBER(12),
INT11 NUMBER(12),
INT12 NUMBER(12),
INT13 NUMBER(12),
INT14 NUMBER(12),
INT15 NUMBER(12),
TINYINT1 NUMBER(3),
TINYINT2 NUMBER(3),
TINYINT3 NUMBER(3),
SMALLINT1 NUMBER(5),
SMALLINT2 NUMBER(5),
SMALLINT3 NUMBER(5),
CHAR1 VARCHAR2(5),
CHAR2 VARCHAR2(5),
CHAR3 VARCHAR2(5),
CHAR4 VARCHAR2(5),
TINYSTRING1 VARCHAR2(50),
TINYSTRING2 VARCHAR2(50),
TINYSTRING3 VARCHAR2(50),
SMALLSTRING1 VARCHAR2(100),
SMALLSTRING2 VARCHAR2(100),
STRING1 VARCHAR2(500),
STRING2 VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_SPOT_MEASURES_IMP primary key (SPOT_MEASURES_ID)
)
/
/*==============================================================*/
/* Table: SPOTRATIO */
/*==============================================================*/
create table SPOTRATIO (
spotRatio_ID NUMBER(8) not null,
first_DIGESingleSpot_ID NUMBER(8) not null,
second_DIGESingleSpot_ID NUMBER(8) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
Appendix C. Database schema for RAPAD 318
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_SPOTRATIO primary key (spotRatio_ID)
)
/
/*==============================================================*/
/* Table: STUDY */
/*==============================================================*/
create table STUDY (
STUDY_ID NUMBER(4) not null,
CONTACT_ID NUMBER(12) not null,
BIBLIOGRAPHIC_REFERENCE_ID NUMBER(10),
EXTERNAL_DATABASE_RELEASE_ID NUMBER(4),
SOURCE_ID VARCHAR2(100),
NAME VARCHAR2(100) not null,
DESCRIPTION VARCHAR2(4000),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_STUDY primary key (STUDY_ID)
)
/
/*==============================================================*/
/* Index: STUDY_AK01 */
/*==============================================================*/
create unique index STUDY_AK01 on STUDY (
NAME ASC
)
/
/*==============================================================*/
/* Index: STUDY_IND01 */
/*==============================================================*/
create index STUDY_IND01 on STUDY (
BIBLIOGRAPHIC_REFERENCE_ID ASC
)
/
/*==============================================================*/
/* Index: STUDY_IND02 */
/*==============================================================*/
create index STUDY_IND02 on STUDY (
CONTACT_ID ASC
)
/
/*==============================================================*/
/* Index: STUDY_IND03 */
/*==============================================================*/
create index STUDY_IND03 on STUDY (
EXTERNAL_DATABASE_RELEASE_ID ASC
)
/
/*==============================================================*/
/* Table: STUDYASSAY */
/*==============================================================*/
create table STUDYASSAY (
STUDY_ASSAY_ID NUMBER(8) not null,
STUDY_ID NUMBER(4) not null,
ASSAY_ID NUMBER(8) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_STUDYASSAY primary key (STUDY_ASSAY_ID)
)
/
/*==============================================================*/
/* Index: STUDYASSAY_IND01 */
/*==============================================================*/
create index STUDYASSAY_IND01 on STUDYASSAY (
ASSAY_ID ASC
)
/
/*==============================================================*/
/* Index: STUDYASSAY_IND02 */
/*==============================================================*/
create index STUDYASSAY_IND02 on STUDYASSAY (
STUDY_ID ASC
)
/
/*==============================================================*/
/* Table: STUDYASSAYPROT */
/*==============================================================*/
create table STUDYASSAYPROT (
STUDY_ASSAY_ID NUMBER(8) not null,
PROTEOME_ASSAY_ID NUMBER(8),
STUDY_ID NUMBER(4) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
Appendix C. Database schema for RAPAD 319
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_STUDYASSAY_PROT primary key (STUDY_ASSAY_ID)
)
/
/*==============================================================*/
/* Index: STUDYASSAY_PROT_IND02 */
/*==============================================================*/
create index STUDYASSAY_PROT_IND02 on STUDYASSAYPROT (
STUDY_ID ASC
)
/
/*==============================================================*/
/* Table: STUDYBIOMATERIAL */
/*==============================================================*/
create table STUDYBIOMATERIAL (
STUDY_BIO_MATERIAL_ID NUMBER(10) not null,
STUDY_ID NUMBER(4) not null,
BIO_MATERIAL_ID NUMBER(8) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_STUDYBIOMATERIAL primary key
(STUDY_BIO_MATERIAL_ID)
)
/
/*==============================================================*/
/* Index: STUDYBIOMATERIAL_IND01 */
/*==============================================================*/
create index STUDYBIOMATERIAL_IND01 on STUDYBIOMATERIAL (
STUDY_ID ASC
)
/
/*==============================================================*/
/* Index: STUDYBIOMATERIAL_IND02 */
/*==============================================================*/
create index STUDYBIOMATERIAL_IND02 on STUDYBIOMATERIAL (
BIO_MATERIAL_ID ASC
)
/
/*==============================================================*/
/* Table: STUDYDESIGN */
/*==============================================================*/
create table STUDYDESIGN (
STUDY_DESIGN_ID NUMBER(5) not null,
STUDY_ID NUMBER(4) not null,
NAME VARCHAR2(100) not null,
DESCRIPTION VARCHAR2(4000),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_STUDYDESIGN primary key (STUDY_DESIGN_ID)
)
/
/*==============================================================*/
/* Index: STUDYDESIGN_AK01 */
/*==============================================================*/
create unique index STUDYDESIGN_AK01 on STUDYDESIGN (
NAME ASC,
STUDY_ID ASC
)
/
/*==============================================================*/
/* Index: STUDYDESIGN_IND01 */
/*==============================================================*/
create index STUDYDESIGN_IND01 on STUDYDESIGN (
STUDY_ID ASC
)
/
/*==============================================================*/
/* Table: STUDYDESIGNASSAY */
/*==============================================================*/
create table STUDYDESIGNASSAY (
STUDY_DESIGN_ASSAY_ID NUMBER(8) not null,
STUDY_DESIGN_ID NUMBER(5) not null,
ASSAY_ID NUMBER(8) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
Appendix C. Database schema for RAPAD 320
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_STUDYDESIGNASSAY primary key
(STUDY_DESIGN_ASSAY_ID)
)
/
/*==============================================================*/
/* Index: STUDYDESIGNASSAY_IND01 */
/*==============================================================*/
create index STUDYDESIGNASSAY_IND01 on STUDYDESIGNASSAY (
ASSAY_ID ASC
)
/
/*==============================================================*/
/* Index: STUDYDESIGNASSAY_IND02 */
/*==============================================================*/
create index STUDYDESIGNASSAY_IND02 on STUDYDESIGNASSAY (
STUDY_DESIGN_ID ASC
)
/
/*==============================================================*/
/* Table: STUDYDESIGNASSAYPROT */
/*==============================================================*/
create table STUDYDESIGNASSAYPROT (
STUDY_DESIGN_ASSAY_ID NUMBER(8) not null,
STUDY_DESIGN_ID NUMBER(5) not null,
PROTEOME_ASSAY_ID NUMBER(8),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_STUDYDESIGNASSAYPROT primary key
(STUDY_DESIGN_ASSAY_ID)
)
/
/*==============================================================*/
/* Index: STUDYDESIGNASSAYPROT_IND02 */
/*==============================================================*/
create index STUDYDESIGNASSAYPROT_IND02 on STUDYDESIGNASSAYPROT (
STUDY_DESIGN_ID ASC
)
/
/*==============================================================*/
/* Table: STUDYDESIGNDESCRIPTION */
/*==============================================================*/
create table STUDYDESIGNDESCRIPTION (
STUDY_DESIGN_DESCRIPTION_ID NUMBER(5) not null,
STUDY_DESIGN_ID NUMBER(5) not null,
DESCRIPTION_TYPE VARCHAR2(100) not null,
DESCRIPTION VARCHAR2(4000) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_STUDYDESIGNDESCR primary key
(STUDY_DESIGN_DESCRIPTION_ID)
)
/
/*==============================================================*/
/* Index: STUDYDESIGNDESCRIPTION_IND01 */
/*==============================================================*/
create index STUDYDESIGNDESCRIPTION_IND01 on
STUDYDESIGNDESCRIPTION (
STUDY_DESIGN_ID ASC
)
/
/*==============================================================*/
/* Table: STUDYDESIGNTYPE */
/*==============================================================*/
create table STUDYDESIGNTYPE (
STUDY_DESIGN_TYPE_ID NUMBER(6) not null,
STUDY_DESIGN_ID NUMBER(5) not null,
ONTOLOGY_ENTRY_ID NUMBER(10) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_STUDYDESIGNTYPE primary key
(STUDY_DESIGN_TYPE_ID)
)
/
/*==============================================================*/
/* Index: STUDYDESIGNTYPE_IND01 */
/*==============================================================*/
Appendix C. Database schema for RAPAD 321
create index STUDYDESIGNTYPE_IND01 on STUDYDESIGNTYPE (
ONTOLOGY_ENTRY_ID ASC
)
/
/*==============================================================*/
/* Index: STUDYDESIGNTYPE_IND02 */
/*==============================================================*/
create index STUDYDESIGNTYPE_IND02 on STUDYDESIGNTYPE (
STUDY_DESIGN_ID ASC
)
/
/*==============================================================*/
/* Table: STUDYFACTOR */
/*==============================================================*/
create table STUDYFACTOR (
STUDY_FACTOR_ID NUMBER(5) not null,
STUDY_DESIGN_ID NUMBER(5) not null,
STUDY_FACTOR_TYPE_ID NUMBER(10),
NAME VARCHAR2(100) not null,
DESCRIPTION VARCHAR2(500),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_STUDYFACTOR primary key (STUDY_FACTOR_ID)
)
/
/*==============================================================*/
/* Index: STUDYFACTOR_AK01 */
/*==============================================================*/
create unique index STUDYFACTOR_AK01 on STUDYFACTOR (
NAME ASC,
STUDY_DESIGN_ID ASC
)
/
/*==============================================================*/
/* Index: STUDYFACTOR_IND01 */
/*==============================================================*/
create index STUDYFACTOR_IND01 on STUDYFACTOR (
STUDY_FACTOR_TYPE_ID ASC
)
/
/*==============================================================*/
/* Table: STUDYFACTORVALUE */
/*==============================================================*/
create table STUDYFACTORVALUE (
STUDY_FACTOR_VALUE_ID NUMBER(8) not null,
STUDY_FACTOR_ID NUMBER(5) not null,
ASSAY_ID NUMBER(8) not null,
VALUE_ONTOLOGY_ENTRY_ID NUMBER(10),
STRING_VALUE VARCHAR2(100),
MEASUREMENT_UNIT_OE_ID NUMBER(10),
MEASUREMENT_TYPE VARCHAR2(10),
MEASUREMENT_KIND VARCHAR2(20),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_STUDYFACTORVAL primary key
(STUDY_FACTOR_VALUE_ID)
)
/
/*==============================================================*/
/* Index: STUDYFACTORVALUE_IND01 */
/*==============================================================*/
create index STUDYFACTORVALUE_IND01 on STUDYFACTORVALUE (
ASSAY_ID ASC
)
/
/*==============================================================*/
/* Index: STUDYFACTORVALUE_IND2 */
/*==============================================================*/
create index STUDYFACTORVALUE_IND2 on STUDYFACTORVALUE (
STUDY_FACTOR_ID ASC
)
/
/*==============================================================*/
/* Index: STUDYFACTORVALUE_IND3 */
/*==============================================================*/
create index STUDYFACTORVALUE_IND3 on STUDYFACTORVALUE (
VALUE_ONTOLOGY_ENTRY_ID ASC
)
/
/*==============================================================*/
/* Index: STUDYFACTORVALUE_IND4 */
/*==============================================================*/
create index STUDYFACTORVALUE_IND4 on STUDYFACTORVALUE (
MEASUREMENT_UNIT_OE_ID ASC
)
/
Appendix C. Database schema for RAPAD 322
/*==============================================================*/
/* Table: STUDYFACTORVALUEPROT */
/*==============================================================*/
create table STUDYFACTORVALUEPROT (
STUDY_FACTOR_VALUE_ID NUMBER(8) not null,
STUDY_FACTOR_ID NUMBER(5) not null,
PROTEOME_ASSAY_ID NUMBER(8) not null,
VALUE_ONTOLOGY_ENTRY_ID NUMBER(10),
STRING_VALUE VARCHAR2(100),
MEASUREMENT_UNIT_OE_ID NUMBER(10),
MEASUREMENT_TYPE VARCHAR2(10),
MEASUREMENT_KIND VARCHAR2(20),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_STUDYFACVAL_PROT primary key
(STUDY_FACTOR_VALUE_ID)
)
/
/*==============================================================*/
/* Table: TANDEMSEQUENCEDATA */
/*==============================================================*/
create table TANDEMSEQUENCEDATA (
tandem_sequence_ID NUMBER(8) not null,
db_search_parameters_ID NUMBER(8),
source_type varchar(100) not null,
sequence varchar(100) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_TANDEMSEQUENCEDATA primary key
(tandem_sequence_ID)
)
/
/*==============================================================*/
/* Table: TOF */
/*==============================================================*/
create table TOF (
TOF_ID NUMBER(8) not null,
mz_analysis_ID NUMBER(8),
reflectron_state varchar(4) not null
constraint CKC_REFLECTRON_STATE_TOF check
(reflectron_state in (’On’,’Off’,’None’)),
internal_length float not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_TOF primary key (TOF_ID)
)
/
/*==============================================================*/
/* Table: TREATEDANALYTE */
/*==============================================================*/
create table TREATEDANALYTE (
chemical_treatment_ID NUMBER(8),
treated_analyte_ID NUMBER(8) not null,
BIO_MATERIAL_ID NUMBER(8),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_TREATEDANALYTE primary key (treated_analyte_ID)
)
/
/*==============================================================*/
/* Table: TREATMENT */
/*==============================================================*/
create table TREATMENT (
TREATMENT_ID NUMBER(10) not null,
ORDER_NUM NUMBER(3) not null,
BIO_MATERIAL_ID NUMBER(8) not null,
TREATMENT_TYPE_ID NUMBER(10) not null,
PROTOCOL_ID NUMBER(10),
NAME VARCHAR2(100),
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
Appendix C. Database schema for RAPAD 323
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(12) not null,
ROW_PROJECT_ID NUMBER(12) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_TREATMENT primary key (TREATMENT_ID)
)
/
/*==============================================================*/
/* Index: TREATMENT_IND01 */
/*==============================================================*/
create index TREATMENT_IND01 on TREATMENT (
BIO_MATERIAL_ID ASC
)
/
/*==============================================================*/
/* Index: TREATMENT_IND02 */
/*==============================================================*/
create index TREATMENT_IND02 on TREATMENT (
TREATMENT_TYPE_ID ASC
)
/
/*==============================================================*/
/* Index: TREATMENT_IND03 */
/*==============================================================*/
create index TREATMENT_IND03 on TREATMENT (
PROTOCOL_ID ASC
)
/
/*==============================================================*/
/* Table: TREATMENTPARAM */
/*==============================================================*/
create table TREATMENTPARAM (
TREATMENT_PARAM_ID NUMBER(10) not null,
TREATMENT_ID NUMBER(10) not null,
PROTOCOL_PARAM_ID NUMBER(10) not null,
VALUE VARCHAR2(100) not null,
MODIFICATION_DATE DATE not null,
USER_READ NUMBER(1) not null,
USER_WRITE NUMBER(1) not null,
GROUP_READ NUMBER(1) not null,
GROUP_WRITE NUMBER(1) not null,
OTHER_READ NUMBER(1) not null,
OTHER_WRITE NUMBER(1) not null,
ROW_USER_ID NUMBER(12) not null,
ROW_GROUP_ID NUMBER(3) not null,
ROW_PROJECT_ID NUMBER(3) not null,
ROW_ALG_INVOCATION_ID NUMBER(12) not null,
constraint PK_TREATMENTPARAM primary key (TREATMENT_PARAM_ID)
)
/
/*==============================================================*/
/* Index: TREATMENTPARAM_IND01 */
/*==============================================================*/
create index TREATMENTPARAM_IND01 on TREATMENTPARAM (
PROTOCOL_PARAM_ID ASC
)
/
/*==============================================================*/
/* Index: TREATMENTPARAM_IND02 */
/*==============================================================*/
create index TREATMENTPARAM_IND02 on TREATMENTPARAM (
TREATMENT_ID ASC
)
/
/*==============================================================*/
/* View: AFFYMETRIXCEL */
/*==============================================================*/
create or replace view AFFYMETRIXCEL(ELEMENT_RESULT_ID,
ELEMENT_ID,
COMPOSITE_ELEMENT_RESULT_ID, QUANTIFICATION_ID, SUBCLASS_VIEW,
MEAN, STDV, NPIXELS, MODIFICATION_DATE, USER_READ, USER_WRITE,
GROUP_READ, GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID,
ROW_GROUP_ID, ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT
element_result_id,
element_id,
composite_element_result_id,
quantification_id,
subclass_view,
foreground AS mean,
float1 AS stdv,
int3 AS npixels,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM ElementResultImp
WHERE subclass_view = ’AffymetrixCEL’
with check option
/
/*==============================================================*/
/* View: AFFYMETRIXMAS4 */
/*==============================================================*/
create or replace view AFFYMETRIXMAS4(COMPOSITE_ELEMENT_RESULT_ID,
COMPOSITE_ELEMENT_ID, QUANTIFICATION_ID, SUBCLASS_VIEW,
POSITIVE_PROBE_PAIRS, NEGATIVE_PROBE_PAIRS, NUM_PROBE_PAIRS_USED,
PAIRS_IN_AVERAGE, LOG_AVERAGE_RATIO, AVERAGE_DIFFERENCE,
ABSOLUTE_CALL, MODIFICATION_DATE, USER_READ, USER_WRITE,
GROUP_READ, GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID,
ROW_GROUP_ID, ROW_PROJECT_ID,
ROW_ALG_INVOCATION_ID) as
Appendix C. Database schema for RAPAD 324
SELECT
composite_element_result_id,
composite_element_id,
quantification_id,
subclass_view,
tinyint1 AS positive_probe_pairs,
tinyint2 AS negative_probe_pairs,
tinyint3 AS num_probe_pairs_used,
smallint1 AS pairs_in_average,
float1 AS log_average_ratio,
float2 AS average_difference,
string1 AS absolute_call,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM CompositeElementResultImp
WHERE subclass_view = ’AffymetrixMAS4’
with check option
/
/*==============================================================*/
/* View: AFFYMETRIXMAS5 */
/*==============================================================*/
create or replace view AFFYMETRIXMAS5(COMPOSITE_ELEMENT_RESULT_ID,
SUBCLASS_VIEW, COMPOSITE_ELEMENT_ID, QUANTIFICATION_ID, SIGNAL,
DETECTION, DETECTION_P_VALUE, STAT_PAIRS, STAT_PAIRS_USED,
MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ,
GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID,
ROW_GROUP_ID, ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT
composite_element_result_id,
subclass_view,
composite_element_id,
quantification_id,
float1 AS signal,
char1 AS detection,
float2 AS detection_p_value,
smallint1 AS stat_pairs,
smallint2 AS stat_pairs_used,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM CompositeElementResultImp
WHERE SUBCLASS_VIEW = ’AffymetrixMAS5’
with check option
/
/*==============================================================*/
/* View: ARRAYVISIONELEMENTRESULT */
/*==============================================================*/
create or replace view ARRAYVISIONELEMENTRESULT(ELEMENT_RESULT_ID,
SUBCLASS_VIEW, ELEMENT_ID, COMPOSITE_ELEMENT_RESULT_ID,
QUANTIFICATION_ID, FOREGROUND, BACKGROUND, SD, MAD,
SIGNAL_TO_NOISE, PERCENT_REMOVED, PERCENT_REPLACED,
PERCENT_AT_FLOOR, PERCENT_AT_CEILING, BKG_PERCENT_AT_FLOOR,
BKG_PERCENT_AT_CEILING, X, Y, AREA, FLAG, MODIFICATION_DATE,
USER_READ, USER_WRITE, GROUP_READ, GROUP_WRITE,
OTHER_READ, OTHER_WRITE, ROW_USER_ID, ROW_GROUP_ID,
ROW_PROJECT_ID,
ROW_ALG_INVOCATION_ID) as
SELECT
element_result_id,
subclass_view,
element_id,
composite_element_result_id,
quantification_id,
foreground,
background,
foreground_sd AS sd,
float1 AS mad,
float2 AS signal_to_noise,
float3 AS percent_removed,
float4 AS percent_replaced,
float5 AS percent_at_floor,
float6 AS percent_at_ceiling,
float7 AS bkg_percent_at_floor,
float8 AS bkg_percent_at_ceiling,
tinystring1 AS x,
tinystring2 AS y,
tinystring3 AS area,
tinyint1 AS flag,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM ElementResultImp
WHERE subclass_view = ’ArrayVisionElementResult’
with check option
/
/*==============================================================*/
/* View: BIOMATERIAL */
/*==============================================================*/
create or replace view BIOMATERIAL(BIO_MATERIAL_ID, SUBCLASS_VIEW,
BIO_MATERIAL_TYPE_ID, EXTERNAL_DATABASE_RELEASE_ID, SOURCE_ID,
MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ,
GROUP_WRITE,
OTHER_READ, OTHER_WRITE, ROW_USER_ID, ROW_GROUP_ID,
ROW_PROJECT_ID,
ROW_ALG_INVOCATION_ID) as
SELECT
bio_material_id,
Appendix C. Database schema for RAPAD 325
subclass_view,
bio_material_type_id,
external_database_release_id,
source_id,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM BioMaterialImp
WHERE subclass_view = ’BioMaterial’
with check option
/
/*==============================================================*/
/* View: BIOSAMPLE */
/*==============================================================*/
create or replace view BIOSAMPLE(BIO_MATERIAL_ID, SUBCLASS_VIEW,
BIO_MATERIAL_TYPE_ID, EXTERNAL_DATABASE_RELEASE_ID, SOURCE_ID,
NAME, DESCRIPTION, MODIFICATION_DATE, USER_READ, USER_WRITE,
GROUP_READ, GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID,
ROW_GROUP_ID, ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT
bio_material_id,
subclass_view,
bio_material_type_id,
external_database_release_id,
source_id,
string1 AS name,
string2 AS description,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM BioMaterialImp
WHERE subclass_view = ’BioSample’
with check option
/
/*==============================================================*/
/* View: BIOSOURCE */
/*==============================================================*/
create or replace view BIOSOURCE(BIO_MATERIAL_ID, SUBCLASS_VIEW,
TAXON_ID, BIO_MATERIAL_TYPE_ID, BIO_SOURCE_PROVIDER_ID,
EXTERNAL_DATABASE_RELEASE_ID, SOURCE_ID, NAME, DESCRIPTION,
MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ,
GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID,
ROW_GROUP_ID,
ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT BioMaterialImp.bio_material_id,
BioMaterialImp.subclass_view,
BioMaterialImp.taxon_id,
BioMaterialImp.bio_material_type_id,
BioMaterialImp.bio_source_provider_id,
BioMaterialImp.external_database_release_id,
BioMaterialImp.source_id,
BioMaterialImp.string1 AS name,
BioMaterialImp.string2 AS description,
BioMaterialImp.modification_date,
BioMaterialImp.user_read,
BioMaterialImp.user_write,
BioMaterialImp.group_read,
BioMaterialImp.group_write,
BioMaterialImp.other_read,
BioMaterialImp.other_write,
BioMaterialImp.row_user_id,
BioMaterialImp.row_group_id,
BioMaterialImp.row_project_id,
BioMaterialImp.row_alg_invocation_id
FROM BioMaterialImp
where subclass_view=’BioSource’
with check option
/
/*==============================================================*/
/* View: COMPOSITEELEMENT */
/*==============================================================*/
create or replace view COMPOSITEELEMENT(COMPOSITE_ELEMENT_ID,
SUBCLASS_VIEW, PARENT_ID, ARRAY_ID, EXTERNAL_DATABASE_RELEASE_ID,
SOURCE_ID, MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ,
GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID, ROW_GROUP_ID,
ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT
composite_element_id,
subclass_view,
parent_id,
array_id,
external_database_release_id,
source_id,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM CompositeElementImp
WHERE subclass_view = ’CompositeElement’
with check option
/
/*==============================================================*/
/* View: COMPOSITEELEMENTRESULT */
/*==============================================================*/
create or replace view COMPOSITEELEMENTRESULT(
COMPOSITE_ELEMENT_RESULT_ID, SUBCLASS_VIEW, COMPOSITE_ELEMENT_ID,
QUANTIFICATION_ID, MODIFICATION_DATE, USER_READ, USER_WRITE,
Appendix C. Database schema for RAPAD 326
GROUP_READ, GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID,
ROW_GROUP_ID, ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT
composite_element_result_id,
subclass_view,
composite_element_id,
quantification_id,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM CompositeElementResultImp
WHERE subclass_view = ’CompositeElementResult’
with check option
/
/*==============================================================*/
/* View: ELEMENT */
/*==============================================================*/
create or replace view ELEMENT(ELEMENT_ID, SUBCLASS_VIEW,
ELEMENT_TYPE_ID, COMPOSITE_ELEMENT_ID,
ARRAY_ID, EXTERNAL_DATABASE_RELEASE_ID, SOURCE_ID,
MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ,
GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID, ROW_GROUP_ID,
ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT
element_id,
subclass_view,
element_type_id,
composite_element_id,
array_id,
external_database_release_id,
source_id,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM ElementImp
WHERE subclass_view = ’Element’
with check option
/
/*==============================================================*/
/* View: ELEMENTRESULT */
/*==============================================================*/
create or replace view ELEMENTRESULT(ELEMENT_RESULT_ID,
SUBCLASS_VIEW, ELEMENT_ID, COMPOSITE_ELEMENT_RESULT_ID,
QUANTIFICATION_ID, FOREGROUND, BACKGROUND, FOREGROUND_SD,
BACKGROUND_SD, MODIFICATION_DATE, USER_READ, USER_WRITE,
GROUP_READ, GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID,
ROW_GROUP_ID, ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT
element_result_id,
subclass_view,
element_id,
composite_element_result_id,
quantification_id,
foreground,
background,
foreground_sd,
background_sd,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM ElementResultImp
WHERE subclass_view = ’ElementResult’
with check option
/
/*==============================================================*/
/* View: GEMTOOLSELEMENTRESULT */
/*==============================================================*/
create or replace view GEMTOOLSELEMENTRESULT(ELEMENT_RESULT_ID,
ELEMENT_ID, COMPOSITE_ELEMENT_RESULT_ID, QUANTIFICATION_ID,
SUBCLASS_VIEW, SIGNAL, SIGNAL_TO_BACKGROUND, AREA_PERCENTAGE,
VISUAL_FLAG, MODIFICATION_DATE, USER_READ, USER_WRITE,
GROUP_READ, GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID,
ROW_GROUP_ID, ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT
element_result_id,
element_id,
composite_element_result_id,
quantification_id,
subclass_view,
float1 AS signal,
float2 AS signal_to_background,
float3 AS area_percentage,
tinyint1 AS visual_flag,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM ElementResultImp WHERE subclass_view =
’GEMToolsElementResult’
with check option
/
Appendix C. Database schema for RAPAD 327
/*==============================================================*/
/* View: GENEPIXELEMENTRESULT */
/*==============================================================*/
create or replace view GENEPIXELEMENTRESULT(ELEMENT_RESULT_ID,
ELEMENT_ID, COMPOSITE_ELEMENT_RESULT_ID, QUANTIFICATION_ID,
SUBCLASS_VIEW, FOREGROUND_SD, BACKGROUND_SD, SPOT_DIAMETER,
FOREGROUND_MEAN, FOREGROUND_MEDIAN, BACKGROUND_MEAN,
BACKGROUND_MEDIAN, PERCENT_OVER_BG_PLUS_ONE_SD,
PERCENT_OVER_BG_PLUS_TWO_SDS, PERCENT_FOREGROUND_SATURATED,
MEAN_OF_RATIOS, MEDIAN_OF_RATIOS, RATIOS_SD, RGN_RATIO,
RGN_R_SQUARED, NUM_FOREGROUND_PIXELS, NUM_BACKGROUND_PIXELS,
FLAG, MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ,
GROUP_WRITE, OTHER_READ, OTHER_WRITE,
ROW_USER_ID, ROW_GROUP_ID, ROW_PROJECT_ID,
ROW_ALG_INVOCATION_ID) as
SELECT element_result_id,
element_id,
composite_element_result_id,
quantification_id,
subclass_view,
foreground_sd,
background_sd,
float1 AS spot_diameter,
float2 AS foreground_mean,
float3 AS foreground_median,
float4 AS background_mean,
float5 AS background_median,
float6 AS percent_over_bg_plus_one_sd,
float7 AS percent_over_bg_plus_two_sds,
float8 AS percent_foreground_saturated,
float9 AS mean_of_ratios,
float10 AS median_of_ratios,
float11 AS ratios_sd,
float12 as rgn_ratio,
float13 as rgn_r_squared,
smallint1 AS num_foreground_pixels,
smallint2 AS num_background_pixels,
tinyint1 AS flag,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM ElementResultImp
WHERE subclass_view = ’GenePixElementResult’
with check option
/
/*==============================================================*/
/* View: LABELEDEXTRACT */
/*==============================================================*/
create or replace view LABELEDEXTRACT(BIO_MATERIAL_ID,
SUBCLASS_VIEW, BIO_MATERIAL_TYPE_ID, LABEL_METHOD_ID,
EXTERNAL_DATABASE_RELEASE_ID, SOURCE_ID, NAME, DESCRIPTION,
MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ, GROUP_WRITE,
OTHER_READ, OTHER_WRITE, ROW_USER_ID, ROW_GROUP_ID,
ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT
bio_material_id,
subclass_view,
bio_material_type_id,
label_method_id,
external_database_release_id,
source_id,
string1 AS name,
string2 AS description,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM BioMaterialImp
WHERE LABEL_METHOD_ID is not null
AND SUBCLASS_VIEW = ’LabeledExtract’
with check option
/
/*==============================================================*/
/* View: MOIDRESULT */
/*==============================================================*/
create or replace view MOIDRESULT(COMPOSITE_ELEMENT_RESULT_ID,
COMPOSITE_ELEMENT_ID,
QUANTIFICATION_ID,
SUBCLASS_VIEW, EXPRESSION, LOWER_BOUND, UPPER_BOUND, LOG_P,
MODIFICATION_DATE, USER_READ, USER_WRITE,
GROUP_READ, GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID,
ROW_GROUP_ID, ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT
composite_element_result_id,
composite_element_id,
quantification_id,
subclass_view,
float1 AS expression,
float2 AS lower_bound,
float3 AS upper_bound,
float4 AS log_p,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM CompositeElementResultImp
WHERE subclass_view = ’MOIDResult’
with check option
/
Appendix C. Database schema for RAPAD 328
drop view OTHERSPOTMEASURES
/
/*==============================================================*/
/* View: OTHERSPOTMEASURES */
/*==============================================================*/
create or replace view OTHERSPOTMEASURES as
select SPOTMEASURESIMP.SPOT_MEASURES_ID,
SPOTMEASURESIMP.SUBCLASS_VIEW,
SPOTMEASURESIMP.FLOAT1, SPOTMEASURESIMP.FLOAT2,
SPOTMEASURESIMP.FLOAT3,
SPOTMEASURESIMP.FLOAT4, SPOTMEASURESIMP.FLOAT5,
SPOTMEASURESIMP.FLOAT6,
SPOTMEASURESIMP.FLOAT7, SPOTMEASURESIMP.FLOAT8,
SPOTMEASURESIMP.FLOAT9,
SPOTMEASURESIMP.FLOAT10, SPOTMEASURESIMP.FLOAT11,
SPOTMEASURESIMP.FLOAT12,
SPOTMEASURESIMP.FLOAT13, SPOTMEASURESIMP.FLOAT14,
SPOTMEASURESIMP.INT1,
SPOTMEASURESIMP.INT2, SPOTMEASURESIMP.INT3, SPOTMEASURESIMP.INT4,
SPOTMEASURESIMP.INT5,
SPOTMEASURESIMP.INT6, SPOTMEASURESIMP.INT7, SPOTMEASURESIMP.INT8,
SPOTMEASURESIMP.INT9,
SPOTMEASURESIMP.INT10, SPOTMEASURESIMP.INT11,
SPOTMEASURESIMP.INT12,
SPOTMEASURESIMP.INT13, SPOTMEASURESIMP.INT14,
SPOTMEASURESIMP.INT15,
SPOTMEASURESIMP.TINYINT1, SPOTMEASURESIMP.TINYINT2,
SPOTMEASURESIMP.TINYINT3
, SPOTMEASURESIMP.SMALLINT1, SPOTMEASURESIMP.SMALLINT2,
SPOTMEASURESIMP.SMALLINT3,
SPOTMEASURESIMP.CHAR1, SPOTMEASURESIMP.CHAR2,
SPOTMEASURESIMP.CHAR3,
SPOTMEASURESIMP.CHAR4, SPOTMEASURESIMP.TINYSTRING1,
SPOTMEASURESIMP.TINYSTRING2,
SPOTMEASURESIMP.TINYSTRING3, SPOTMEASURESIMP.SMALLSTRING1,
SPOTMEASURESIMP.SMALLSTRING2, SPOTMEASURESIMP.STRING1,
SPOTMEASURESIMP.STRING2, SPOTMEASURESIMP.MODIFICATION_DATE,
SPOTMEASURESIMP.USER_READ, SPOTMEASURESIMP.USER_WRITE,
SPOTMEASURESIMP.GROUP_READ,
SPOTMEASURESIMP.GROUP_WRITE, SPOTMEASURESIMP.OTHER_READ,
SPOTMEASURESIMP.OTHER_WRITE, SPOTMEASURESIMP.ROW_USER_ID,
SPOTMEASURESIMP.ROW_GROUP_ID, SPOTMEASURESIMP.ROW_PROJECT_ID,
SPOTMEASURESIMP.ROW_ALG_INVOCATION_ID
from SPOTMEASURESIMP
/
/*==============================================================*/
/* View: SAGETAG */
/*==============================================================*/
create or replace view SAGETAG(COMPOSITE_ELEMENT_ID,
SUBCLASS_VIEW,
PARENT_ID, ARRAY_ID, TAG, MODIFICATION_DATE, USER_READ,
USER_WRITE, GROUP_READ, GROUP_WRITE, OTHER_READ, OTHER_WRITE,
ROW_USER_ID, ROW_GROUP_ID, ROW_PROJECT_ID,
ROW_ALG_INVOCATION_ID)
as
SELECT
composite_element_id,
subclass_view,
parent_id,
array_id,
tinystring1 AS tag,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM CompositeElementImp
WHERE subclass_view = ’SAGETag’
with check option
/
/*==============================================================*/
/* View: SAGETAGMAPPING */
/*==============================================================*/
create or replace view SAGETAGMAPPING(ELEMENT_ID, SUBCLASS_VIEW,
ARRAY_ID, COMPOSITE_ELEMENT_ID, EXTERNAL_DATABASE_RELEASE_ID,
SOURCE_ID, MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ,
GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID, ROW_GROUP_ID,
ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT
element_id,
subclass_view,
array_id,
composite_element_id,
external_database_release_id,
source_id,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM ElementImp
WHERE subclass_view = ’SAGETagMapping’
with check option
/
/*==============================================================*/
/* View: SAGETAGRESULT */
/*==============================================================*/
create or replace view SAGETAGRESULT(COMPOSITE_ELEMENT_RESULT_ID,
COMPOSITE_ELEMENT_ID, QUANTIFICATION_ID, SUBCLASS_VIEW,
TAG_COUNT, MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ,
GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID, ROW_GROUP_ID,
ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT
composite_element_result_id,
composite_element_id,
quantification_id,
subclass_view,
int1 AS tag_count,
modification_date,
Appendix C. Database schema for RAPAD 329
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM CompositeElementResultImp
WHERE subclass_view = ’SAGETagResult’
with check option
/
/*==============================================================*/
/* View: SCANALYZEELEMENTRESULT */
/*==============================================================*/
create or replace view SCANALYZEELEMENTRESULT(ELEMENT_RESULT_ID,
ELEMENT_ID, COMPOSITE_ELEMENT_RESULT_ID, QUANTIFICATION_ID,
SUBCLASS_VIEW, I, B, BA, SPIX, BGPIX, TOP, LEFT, BOT, RIGHT,
FLAG, MRAT, REGR, CORR, LFRAT, GTB1, GTB2, EDGEA, KSD, KSP,
MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ, GROUP_WRITE,
OTHER_READ, OTHER_WRITE, ROW_USER_ID, ROW_GROUP_ID,
ROW_PROJECT_ID,
ROW_ALG_INVOCATION_ID) as
SELECT
element_result_id,
element_id,
composite_element_result_id,
quantification_id,
subclass_view,
foreground AS i,
background AS b,
float1 AS ba,
int1 AS spix,
int2 AS bgpix,
int3 AS top,
int4 AS left,
int5 AS bot,
int6 AS right,
tinyint1 AS flag,
float2 AS mrat,
float3 AS regr,
float4 AS corr,
float5 AS lfrat,
float6 AS gtb1,
float7 AS gtb2,
float8 AS edgea,
float9 AS ksd,
float10 AS ksp,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM ElementResultImp
WHERE subclass_view = ’ScanAlyzeElementResult’
with check option
/
/*==============================================================*/
/* View: SHORTOLIGO */
/*==============================================================*/
create or replace view SHORTOLIGO(ELEMENT_ID, SUBCLASS_VIEW,
ARRAY_ID, COMPOSITE_ELEMENT_ID, NAME,
X_POSITION, Y_POSITION, SEQUENCE, DESCRIPTION, MODIFICATION_DATE,
USER_READ, USER_WRITE, GROUP_READ,
GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID, ROW_GROUP_ID,
ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT
element_id,
subclass_view,
array_id,
composite_element_id,
smallstring2 AS name,
tinystring1 AS x_position,
tinystring2 AS y_position,
smallstring1 AS sequence,
string1 AS description,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM ElementImp
WHERE subclass_view = ’ShortOligo’
with check option
/
/*==============================================================*/
/* View: SHORTOLIGOFAMILY */
/*==============================================================*/
create or replace view SHORTOLIGOFAMILY(COMPOSITE_ELEMENT_ID,
SUBCLASS_VIEW, PARENT_ID, ARRAY_ID,
EXTERNAL_DATABASE_RELEASE_ID, SOURCE_ID, NAME, DESCRIPTION,
MODIFICATION_DATE, USER_READ, USER_WRITE,
GROUP_READ, GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID,
ROW_GROUP_ID, ROW_PROJECT_ID,
ROW_ALG_INVOCATION_ID) as
SELECT
composite_element_id,
subclass_view,
parent_id,
array_id,
external_database_release_id,
source_id,
smallstring1 AS name,
string1 AS description,
modification_date,
user_read,
user_write,
group_read,
Appendix C. Database schema for RAPAD 330
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM CompositeElementImp
WHERE subclass_view = ’ShortOligoFamily’
with check option
/
/*==============================================================*/
/* View: SPOT */
/*==============================================================*/
create or replace view SPOT(ELEMENT_ID, SUBCLASS_VIEW, ARRAY_ID,
ELEMENT_TYPE_ID,
COMPOSITE_ELEMENT_ID,
EXTERNAL_DATABASE_RELEASE_ID, SOURCE_ID, ARRAY_ROW, ARRAY_COLUMN,
GRID_ROW, GRID_COLUMN, SUB_ROW,
SUB_COLUMN, SEQUENCE_VERIFIED, NAME, DESCRIPTION,
MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ,
GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID, ROW_GROUP_ID,
ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT
element_id,
subclass_view,
array_id,
element_type_id,
composite_element_id,
external_database_release_id,
source_id,
char1 AS array_row,
char2 AS array_column,
char3 AS grid_row,
char4 AS grid_column,
char5 AS sub_row,
char6 AS sub_column,
tinyint1 AS sequence_verified,
tinystring1 AS name,
string1 AS description,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM ElementImp
WHERE subclass_view = ’Spot’
with check option
/
/*==============================================================*/
/* View: SPOTELEMENTRESULT */
/*==============================================================*/
create or replace view SPOTELEMENTRESULT(ELEMENT_RESULT_ID,
ELEMENT_ID, COMPOSITE_ELEMENT_RESULT_ID,
QUANTIFICATION_ID, SUBCLASS_VIEW, MEDIAN, MORPH, IQR, MEAN,
BG_MEDIAN, BG_MEAN, BG_SD, VALLEY, MORPH_ERODE,
MORPH_CLOSE_OPEN, AREA, PERIMETER, CIRCULARITY, BADSPOT,
VISUAL_FLAG, MODIFICATION_DATE, USER_READ,
USER_WRITE, GROUP_READ, GROUP_WRITE, OTHER_READ, OTHER_WRITE,
ROW_USER_ID, ROW_GROUP_ID, ROW_PROJECT_ID,
ROW_ALG_INVOCATION_ID) as
SELECT
element_result_id,
element_id,
composite_element_result_id,
quantification_id,
subclass_view,
foreground AS median,
background AS morph,
foreground_sd AS iqr,
float1 AS mean,
float2 AS bg_median,
float3 AS bg_mean,
float4 AS bg_sd,
float5 AS valley,
float6 AS morph_erode,
float7 AS morph_close_open,
int1 AS area,
int2 AS perimeter,
float8 AS circularity,
tinyint1 AS badspot,
tinyint2 AS visual_flag,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM ElementResultImp WHERE subclass_view = ’SpotElementResult’
with check option
/
/*==============================================================*/
/* View: SPOTFAMILY */
/*==============================================================*/
create or replace view SPOTFAMILY(COMPOSITE_ELEMENT_ID,
SUBCLASS_VIEW, PARENT_ID, ARRAY_ID,
EXTERNAL_DATABASE_RELEASE_ID, SOURCE_ID, PLATE_NAME,
WELL_LOCATION, PCR_FAILURE_FLAG, NAME, DESCRIPTION,
MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ, GROUP_WRITE,
OTHER_READ, OTHER_WRITE, ROW_USER_ID,
ROW_GROUP_ID, ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as
SELECT
composite_element_id,
subclass_view,
parent_id,
array_id,
external_database_release_id,
source_id,
smallstring1 AS plate_name,
smallstring2 AS well_location,
tinyint1 AS pcr_failure_flag,
Appendix C. Database schema for RAPAD 331
string2 AS name,
string1 AS description,
modification_date,
user_read,
user_write,
group_read,
group_write,
other_read,
other_write,
row_user_id,
row_group_id,
row_project_id,
row_alg_invocation_id
FROM CompositeElementImp
where subclass_view = ’SpotFamily’
with check option
/
alter table ACQUISITION
add constraint FK_ACQ_ASSAY foreign key (ASSAY_ID)
references ASSAY (ASSAY_ID) not deferrable
/
alter table ACQUISITION
add constraint FK_ACQ_CHANNEL foreign key (CHANNEL_ID)
references CHANNEL (CHANNEL_ID) not deferrable
/
alter table ACQUISITION
add constraint FK_ACQ_PRTCL foreign key (PROTOCOL_ID)
references PROTOCOL (PROTOCOL_ID) not deferrable
/
alter table ACQUISITIONPARAM
add constraint FK_ACQPARAM_ACQ foreign key (ACQUISITION_ID)
references ACQUISITION (ACQUISITION_ID) not deferrable
/
alter table ACQUISITIONPARAM
add constraint FK_ACQPARAM_PRTPRM foreign key
(PROTOCOL_PARAM_ID)
references PROTOCOLPARAM (PROTOCOL_PARAM_ID) not deferrable
/
alter table ANALYSISIMPLEMENTATION
add constraint FK_ANLIMP_ANL foreign key (ANALYSIS_ID)
references ANALYSIS (ANALYSIS_ID) not deferrable
/
alter table ANALYSISIMPLEMENTATIONPARAM
add constraint FK_ANLIMPPARAM_ANLIMP foreign key
(ANALYSIS_IMPLEMENTATION_ID)
references ANALYSISIMPLEMENTATION
(ANALYSIS_IMPLEMENTATION_ID) not deferrable
/
alter table ANALYSISINPUT
add constraint FK_ANLINPUT_ANALYSISINV foreign key
(ANALYSIS_INVOCATION_ID)
references ANALYSISINVOCATION (ANALYSIS_INVOCATION_ID) not
deferrable
/
alter table ANALYSISINVOCATION
add constraint FK_ANLINV_ANLIMP foreign key
(ANALYSIS_IMPLEMENTATION_ID)
references ANALYSISIMPLEMENTATION
(ANALYSIS_IMPLEMENTATION_ID) not deferrable
/
alter table ANALYSISINVOCATIONPARAM
add constraint FK_ANLPARAM_ANLINV foreign key
(ANALYSIS_INVOCATION_ID)
references ANALYSISINVOCATION (ANALYSIS_INVOCATION_ID) not
deferrable
/
alter table ANALYSISOUTPUT
add constraint FK_ANALYSISOUTPUT4 foreign key
(ANALYSIS_INVOCATION_ID)
references ANALYSISINVOCATION (ANALYSIS_INVOCATION_ID) not
deferrable
/
alter table ANALYTEMEASUREMENT
add constraint FK_ANALYTE__REFERENCE_BIOASSAY foreign key
(BIOASSAY_TREATMENT_ID)
references BIOASSAYTREATMENT (BIOASSAY_TREATMENT_ID)
/
alter table ANALYTEMEASUREMENT
add constraint FK_ANALYTE__REFERENCE_BIOMATER foreign key
(BIO_MATERIAL_ID)
references BIOMATERIALIMP (BIO_MATERIAL_ID)
/
alter table ARRAY
add constraint FK_ARRAY_ONTO01 foreign key (PLATFORM_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table ARRAY
add constraint FK_ARRAY_ONTO02 foreign key (SUBSTRATE_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table ARRAY
add constraint FK_ARRAY_PROTOCOL foreign key (PROTOCOL_ID)
references PROTOCOL (PROTOCOL_ID) not deferrable
/
Appendix C. Database schema for RAPAD 332
alter table ARRAYANNOTATION
add constraint FK_ARRAYANN_ARRAY foreign key (ARRAY_ID)
references ARRAY (ARRAY_ID) not deferrable
/
alter table ASSAY
add constraint FK_ASSAY_ARRAY foreign key (ARRAY_ID)
references ARRAY (ARRAY_ID) not deferrable
/
alter table ASSAY
add constraint FK_ASSAY_PRTCL foreign key (PROTOCOL_ID)
references PROTOCOL (PROTOCOL_ID) not deferrable
/
alter table ASSAYBIOMATERIAL
add constraint FK_ASSAYBIOMATERIAL15 foreign key
(BIO_MATERIAL_ID)
references BIOMATERIALIMP (BIO_MATERIAL_ID) not deferrable
/
alter table ASSAYBIOMATERIAL
add constraint FK_ASSAYBIOSOURCE13 foreign key (ASSAY_ID)
references ASSAY (ASSAY_ID) not deferrable
/
alter table ASSAYDATAPOINT
add constraint FK_ASSAYDAT_REFERENCE_LCCOLUMN foreign key (id)
references LCCOLUMN (LCColumn_ID)
/
alter table ASSAYGROUP
add constraint FK_ASSAYGRO_REFERENCE_ASSAY foreign key
(ASSAY_ID)
references ASSAY (ASSAY_ID)
/
alter table ASSAYGROUP
add constraint FK_ASSAYGRO_REFERENCE_STUDY foreign key
(STUDY_ID)
references STUDY (STUDY_ID)
/
alter table ASSAYGROUP
add constraint FK_ASSAYGRO_REFERENCE_STUDYDES foreign key
(STUDY_DESIGN_ID)
references STUDYDESIGN (STUDY_DESIGN_ID)
/
alter table ASSAYGROUP
add constraint FK_ASSAYGRO_REFERENCE_STUDYFAC foreign key
(STUDY_FACTOR_VALUE_ID)
references STUDYFACTORVALUE (STUDY_FACTOR_VALUE_ID)
/
alter table ASSAYLABELEDEXTRACT
add constraint FK_ASSAYLAB_ASSAY foreign key (ASSAY_ID)
references ASSAY (ASSAY_ID) not deferrable
/
alter table ASSAYLABELEDEXTRACT
add constraint FK_ASSAYLAB_CHANNEL foreign key (CHANNEL_ID)
references CHANNEL (CHANNEL_ID) not deferrable
/
alter table ASSAYLABELEDEXTRACT
add constraint FK_ASSAYLAB_LEX foreign key (LABELED_EXTRACT_ID)
references BIOMATERIALIMP (BIO_MATERIAL_ID) not deferrable
/
alter table ASSAYPARAM
add constraint FK_ASSAYPARAM_ASSAY foreign key (ASSAY_ID)
references ASSAY (ASSAY_ID) not deferrable
/
alter table ASSAYPARAM
add constraint FK_ASSAYPARAM_PRTOPRM foreign key
(PROTOCOL_PARAM_ID)
references PROTOCOLPARAM (PROTOCOL_PARAM_ID) not deferrable
/
alter table ASSAYPARAMPROT
add constraint FK_ASSAYPAR_REFERENCE_PROTEOME foreign key
(PROTEOME_ASSAY_ID)
references PROTEOMEASSAY (PROTEOME_ASSAY_ID)
/
alter table ASSAYPARAMPROT
add constraint FK_ASSAYPAR_REFERENCE_PROTOCOL foreign key
(PROTOCOL_PARAM_ID)
references PROTOCOLPARAM (PROTOCOL_PARAM_ID)
/
alter table BAND
add constraint FK_BAND_REFERENCE_PHYSICAL foreign key
(physicalGelSpot_ID)
references PHYSICALGELITEM (physicalGelItem_ID)
/
alter table BIOASSAYTREATMENT
add constraint FK_BIOASSAY_REFERENCE_ONTOLOGY foreign key
(TREATMENT_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID)
/
alter table BIOASSAYTREATMENT
add constraint FK_BIOASSAY_REFERENCE_PROTEOME foreign key
(PROTEOME_ASSAY_ID)
references PROTEOMEASSAY (PROTEOME_ASSAY_ID)
Appendix C. Database schema for RAPAD 333
/
alter table BIOASSAYTREATMENT
add constraint FK_BIOASSAY_REFERENCE_PROTOCOL foreign key
(PROTOCOL_ID)
references PROTOCOL (PROTOCOL_ID)
/
alter table BIOMATERIALCHARACTERISTIC
add constraint FK_BMCHARAC_BIOMAT foreign key (BIO_MATERIAL_ID)
references BIOMATERIALIMP (BIO_MATERIAL_ID) not deferrable
/
alter table BIOMATERIALCHARACTERISTIC
add constraint FK_BMCHARAC_ONTOLOGY foreign key
(ONTOLOGY_ENTRY_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table BIOMATERIALIMP
add constraint FK_BIOMATERIALIMP15 foreign key
(LABEL_METHOD_ID)
references LABELMETHOD (LABEL_METHOD_ID) not deferrable
/
alter table BIOMATERIALIMP
add constraint FK_BIOMATTYPE_OE foreign key
(BIO_MATERIAL_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table BIOMATERIALMEASUREMENT
add constraint FK_BMM_BIOMATERIAL foreign key (BIO_MATERIAL_ID)
references BIOMATERIALIMP (BIO_MATERIAL_ID) not deferrable
/
alter table BIOMATERIALMEASUREMENT
add constraint FK_BMM_ONTO foreign key (UNIT_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table BIOMATERIALMEASUREMENT
add constraint FK_BMM_TREATMENT foreign key (TREATMENT_ID)
references TREATMENT (TREATMENT_ID) not deferrable
/
alter table BOUNDARYPOINT
add constraint FK_BOUNDARY_REFERENCE_IDENTIFI foreign key
(spot_id)
references IDENTIFIEDSPOT (identified_spot_ID)
on delete cascade
/
alter table CHEMICALTREATMENT
add constraint FK_CHEMICAL_REFERENCE_BIOASSAY foreign key
(BIOASSAY_TREATMENT_ID)
references BIOASSAYTREATMENT (BIOASSAY_TREATMENT_ID)
/
alter table CHEMICALTREATMENT
add constraint FK_CHEMICAL_REFERENCE_ONTOLOGY foreign key
(treatment_type)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID)
/
alter table COLLISIONCELL
add constraint FK_COLLISIO_REFERENCE_MZANALYS foreign key
(mz_analysis_ID)
references MZANALYSIS (mz_analysis_ID)
/
alter table COMPOSITEELEMENTANNOTATION
add constraint FK_CEANNOT_CE foreign key (COMPOSITE_ELEMENT_ID)
references COMPOSITEELEMENTIMP (COMPOSITE_ELEMENT_ID) not
deferrable
/
alter table COMPOSITEELEMENTGUS
add constraint FK_CEG_CE foreign key (COMPOSITE_ELEMENT_ID)
references COMPOSITEELEMENTIMP (COMPOSITE_ELEMENT_ID) not
deferrable
/
alter table COMPOSITEELEMENTIMP
add constraint FK_CE_ARRAY foreign key (ARRAY_ID)
references ARRAY (ARRAY_ID) not deferrable
/
alter table COMPOSITEELEMENTIMP
add constraint FK_CE_CE foreign key (PARENT_ID)
references COMPOSITEELEMENTIMP (COMPOSITE_ELEMENT_ID) not
deferrable
/
alter table COMPOSITEELEMENTRESULTIMP
add constraint FK_CERESULT_CELEMENT foreign key
(COMPOSITE_ELEMENT_ID)
references COMPOSITEELEMENTIMP (COMPOSITE_ELEMENT_ID) not
deferrable
/
alter table COMPOSITEELEMENTRESULTIMP
add constraint FK_CERESULT_QUANT foreign key
(QUANTIFICATION_ID)
references QUANTIFICATION (QUANTIFICATION_ID) not deferrable
/
alter table CONTROL
add constraint FK_CONTROL_ASSAY foreign key (ASSAY_ID)
Appendix C. Database schema for RAPAD 334
references ASSAY (ASSAY_ID) not deferrable
/
alter table CONTROL
add constraint FK_CONTROL_ONTO foreign key (CONTROL_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table DATABASEENTRY
add constraint FK_DATABASE_NAME_FK_DATABA_ONT foreign key
(database_name)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID)
/
alter table DATABASEENTRY
add constraint FK_DATABASE_URI_FK_DATABA_ONTO foreign key
(database_uri)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID)
/
alter table DATABASEENTRY
add constraint FK_DATABASE_VERS_FK_DATABA_ONT foreign key
(database_version)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID)
/
alter table DBSEARCH
add constraint FK_DBSEARCH_REFERENCE_DBSEARCH foreign key
(db_search_parameters_ID)
references DBSEARCHPARAMETERS (db_search_parameters_ID)
/
alter table DBSEARCH
add constraint FK_DBSEARCH_REFERENCE_PEAKLIST foreign key
(peak_list_ID)
references PEAKLIST (peak_list_ID)
/
alter table DBSEARCHPARAMETERS
add constraint FK_DBSEARCH_REFERENCE_PROTOCOL foreign key
(PROTOCOL_ID)
references PROTOCOL (PROTOCOL_ID)
/
alter table DIGESINGLESPOT
add constraint FK_DIGESING_REFERENCE_IDENTIFI foreign key
(identified_spot_ID)
references IDENTIFIEDSPOT (identified_spot_ID)
/
alter table DIGESINGLESPOT
add constraint FK_DIGESING_REFERENCE_IMAGE_AN foreign key
(GEL_IMAGE_ANALYSIS_ID)
references GELIMAGEANALYSIS (GEL_IMAGE_ANALYSIS_ID)
/
alter table DIGESINGLESPOT
add constraint FK_DIGESING_REFERENCE_SPOT_MEA foreign key
(SPOT_MEASURES_ID)
references SPOTMEASURESIMP (SPOT_MEASURES_ID)
/
alter table ELECTROSPRAY
add constraint FK_ELECTROS_REFERENCE_IONSOURC foreign key
(ion_source_ID)
references IONSOURCE (ion_source_ID)
/
alter table ELEMENTANNOTATION
add constraint FK_ELEANNOT_ELEMENTIMP foreign key (ELEMENT_ID)
references ELEMENTIMP (ELEMENT_ID) not deferrable
/
alter table ELEMENTIMP
add constraint FK_ELEMENT_ARRAY foreign key (ARRAY_ID)
references ARRAY (ARRAY_ID) not deferrable
/
alter table ELEMENTIMP
add constraint FK_ELEMENT_COMPELEFAM foreign key
(COMPOSITE_ELEMENT_ID)
references COMPOSITEELEMENTIMP (COMPOSITE_ELEMENT_ID) not
deferrable
/
alter table ELEMENTIMP
add constraint FK_ELEMENT_ONTO foreign key (ELEMENT_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table ELEMENTRESULTIMP
add constraint FK_ELEMENTRESULT_ELEMENTIMP foreign key
(ELEMENT_ID)
references ELEMENTIMP (ELEMENT_ID) not deferrable
/
alter table ELEMENTRESULTIMP
add constraint FK_ELEMENTRESU_QUANT foreign key
(QUANTIFICATION_ID)
references QUANTIFICATION (QUANTIFICATION_ID) not deferrable
/
alter table ELEMENTRESULTIMP
add constraint FK_ELEMENTRES_SFR foreign key
(COMPOSITE_ELEMENT_RESULT_ID)
references COMPOSITEELEMENTRESULTIMP
(COMPOSITE_ELEMENT_RESULT_ID) not deferrable
/
Appendix C. Database schema for RAPAD 335
alter table FRACTION
add constraint FK_FRACTION_REFERENCE_BIOMATER foreign key
(BIO_MATERIAL_ID)
references BIOMATERIALIMP (BIO_MATERIAL_ID)
/
alter table FRACTION
add constraint FK_FRACTION_REFERENCE_LCCOLUMN foreign key
(LCColumn_ID)
references LCCOLUMN (LCColumn_ID)
/
alter table GEL1D
add constraint FK_GEL1D_REFERENCE_BIOASSAY foreign key
(BIOASSAY_TREATMENT_ID)
references BIOASSAYTREATMENT (BIOASSAY_TREATMENT_ID)
/
alter table GEL2D
add constraint FK_GEL2D_REFERENCE_BIOASSAY foreign key
(BIOASSAY_TREATMENT_ID)
references BIOASSAYTREATMENT (BIOASSAY_TREATMENT_ID)
/
alter table GRADIENTSTEP
add constraint FK_GRADIENT_REFERENCE_LCCOLUMN foreign key
(GradientStep_ID)
references LCCOLUMN (LCColumn_ID)
/
alter table HEXAPOLE
add constraint FK_HEXAPOLE_REFERENCE_MZANALYS foreign key
(mz_analysis_ID)
references MZANALYSIS (mz_analysis_ID)
/
alter table IDENTIFIEDSPOT
add constraint FK_IDENTIFI_REFERENCE_IMAGE_AN foreign key
(GEL_IMAGE_ANALYSIS_ID)
references GELIMAGEANALYSIS (GEL_IMAGE_ANALYSIS_ID)
/
alter table IDENTIFIEDSPOT
add constraint FK_IDENTIFI_REFERENCE_PHYSICAL foreign key
(physicalGelItem_ID)
references PHYSICALGELITEM (physicalGelItem_ID)
/
alter table IDENTIFIEDSPOT
add constraint FK_IDENTIFI_REFERENCE_SPOT_MEA foreign key
(SPOT_MEASURES_ID)
references SPOTMEASURESIMP (SPOT_MEASURES_ID)
/
alter table IMAGEACQUISITION
add constraint FK_IMAGE_AC_REFERENCE_CHANNEL foreign key
(CHANNEL_ID)
references CHANNEL (CHANNEL_ID)
/
alter table IMAGEACQUISITION
add constraint FK_IMAGE_AC_REFERENCE_PROTEOME foreign key
(PROTEOME_ASSAY_ID)
references PROTEOMEASSAY (PROTEOME_ASSAY_ID)
/
alter table GELIMAGEANALYSIS
add constraint FK_IMAGE_AN_REFERENCE_IMAGE_AC foreign key
(ACQUISITION_ID)
references IMAGEACQUISITION (IMAGE_ACQUISITION_ID)
/
alter table GELIMAGEANALYSIS
add constraint FK_IMAGE_AN_REFERENCE_PROTOCOL foreign key
(PROTOCOL_ID)
references PROTOCOL (PROTOCOL_ID)
/
alter table INTEGRITYSTATINPUT
add constraint INTEGRITYSTATINPUT_FK01 foreign key
(INTEGRITY_STATISTIC_ID)
references INTEGRITYSTATISTIC (INTEGRITY_STATISTIC_ID) not
deferrable
/
alter table IONTRAP
add constraint FK_IONTRAP_REFERENCE_MZANALYS foreign key
(mz_analysis_ID)
references MZANALYSIS (mz_analysis_ID)
/
alter table LABELMETHOD
add constraint FK_LABELEDMETHOD_PROTO foreign key (PROTOCOL_ID)
references PROTOCOL (PROTOCOL_ID) not deferrable
/
alter table LABELMETHOD
add constraint FK_LABELMETHOD_CHANNEL foreign key (CHANNEL_ID)
references CHANNEL (CHANNEL_ID) not deferrable
/
alter table LCCOLUMN
add constraint FK_LCCOLUMN_REFERENCE_BIOASSAY foreign key
(BIOASSAY_TREATMENT_ID)
references BIOASSAYTREATMENT (BIOASSAY_TREATMENT_ID)
/
alter table LISTPROCESSING
add constraint FK_LISTPROC_REFERENCE_PEAKLIST foreign key
(peak_list_ID)
Appendix C. Database schema for RAPAD 336
references PEAKLIST (peak_list_ID)
/
alter table MAGEDOCUMENTATION
add constraint FK_MDOC_MAGEML foreign key (MAGE_ML_ID)
references MAGEML (MAGE_ML_ID) not deferrable
/
alter table MALDI
add constraint FK_MALDI_REFERENCE_IONSOURC foreign key
(ion_source_ID)
references IONSOURCE (ion_source_ID)
/
alter table MASSSPECEXPERIMENT
add constraint FK_MASSSPEC_REFERENCE_BIOASSAY foreign key
(BIOASSAY_TREATMENT_ID)
references BIOASSAYTREATMENT (BIOASSAY_TREATMENT_ID)
/
alter table MASSSPECEXPERIMENT
add constraint FK_MASSSPEC_REFERENCE_MASSSPEC foreign key
(MSMachineID)
references MASSSPECMACHINE (mass_spec_machine_ID)
/
alter table MASSSPECMACHINE
add constraint FK_MASSSPEC_REFERENCE_IONSOURC foreign key
(ion_source_ID)
references IONSOURCE (ion_source_ID)
/
alter table MATCHEDSPOTS
add constraint FK_MATCHED__REFERENCE_MULTIPLE foreign key
(multiple_analysis_ID)
references MULTIPLEANALYSIS (multiple_analysis_ID)
/
alter table MATCHEDSPOTS
add constraint FK_MATCHED__REFERENCE_IDENTIFI foreign key
(identified_spot_ID)
references IDENTIFIEDSPOT (identified_spot_ID)
/
alter table MOBILEPHASECOMPONENT
add constraint FK_MOBILEPH_REFERENCE_LCCOLUMN foreign key
(lc_column)
references LCCOLUMN (LCColumn_ID)
/
alter table MSMSFRACTION
add constraint FK_MSMSFRAC_REFERENCE_PEAKLIST foreign key
(peak_list_ID)
references PEAKLIST (peak_list_ID)
/
alter table MULTIPLEANALYSIS
add constraint FK_MULTIPLE_REFERENCE_ONTOLOGY foreign key
(analysis_type)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID)
/
alter table MULTIPLEANALYSIS
add constraint FK_MULTIPLE_REFERENCE_PROTOCOL foreign key
(PROTOCOL_ID)
references PROTOCOL (PROTOCOL_ID)
/
alter table MULTIPLEANALYSISGELIA
add constraint FK_MULTIPLE_REFERENCE_MULTAG foreign key
(MULTIPLE_ANALYSIS_ID)
references MULTIPLEANALYSIS (MULTIPLE_ANALYSIS_ID)
/
alter table MULTIPLEANALYSISGELIA
add constraint FK_MULTIPLE_REFERENCE_MULTGIA foreign key
(GEL_IMAGE_ANALYSIS_ID)
references GELIMAGEANALYSIS (GEL_IMAGE_ANALYSIS_ID)
/
alter table MZANALYSIS
add constraint FK_MZANALYS_REFERENCE_DETECTIO foreign key
(detection_ID)
references DETECTION (detection_ID)
/
alter table ONTOLOGYENTRY
add constraint FK_ONTOLOGYENTRY_PARENT foreign key (PARENT_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table OTHERIONISATION
add constraint FK_OTHERION_REFERENCE_IONSOURC foreign key
(ion_source_ID)
references IONSOURCE (ion_source_ID)
/
alter table OTHERIONISATION
add constraint FK_OTHERION_REFERENCE_ONTOLOGY foreign key
(ONTOLOGY_ENTRY_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID)
/
alter table OTHERMZANALYSIS
add constraint FK_OTHERMZA_REFERENCE_MZANALYS foreign key
(mz_analysis_ID)
references MZANALYSIS (mz_analysis_ID)
/
alter table OTHERMZANALYSIS
add constraint FK_OTHERMZA_REFERENCE_ONTOLOGY foreign key
Appendix C. Database schema for RAPAD 337
(ONTOLOGY_ENTRY_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID)
/
alter table PEAK
add constraint FK_PEAK_REFERENCE_PEAKLIST foreign key
(peak_list_ID)
references PEAKLIST (peak_list_ID)
/
alter table PEAKLIST
add constraint FK_PEAKLIST_REFERENCE_MASSSPEC foreign key
(mass_spec_experiment_ID)
references MASSSPECEXPERIMENT (mass_spec_experiment_ID)
/
alter table PEPTIDEHIT
add constraint FK_PEPTIDEH_REFERENCE_DATABASE foreign key
(database_entry_ID)
references DATABASEENTRY (database_entry_ID)
/
alter table PEPTIDEHIT
add constraint FK_PEPTIDEH_REFERENCE_DBSEARCH foreign key
(db_search_ID)
references DBSEARCH (db_search_ID)
/
alter table PERCENTX
add constraint FK_PERCENTX_REFERENCE_GRADIENT foreign key
(lc_column, GradientStep_ID)
references GRADIENTSTEP (lc_column, GradientStep_ID)
/
alter table PERCENTX
add constraint FK_PERCENTX_REFERENCE_MOBILEPH foreign key
(Percent_ID)
references MOBILEPHASECOMPONENT (id)
/
alter table PHYSICALGELITEM
add constraint FK_PHYSICAL_REFERENCE_BIOMATER foreign key
(BIO_MATERIAL_ID)
references BIOMATERIALIMP (BIO_MATERIAL_ID)
/
alter table PHYSICALGELITEM
add constraint FK_PHYSICAL_REFERENCE_GEL1D foreign key
(Gel1D_ID)
references GEL1D (Gel1D_ID)
/
alter table PHYSICALGELITEM
add constraint FK_PHYSICAL_REFERENCE_GEL2D foreign key (gel2D)
references GEL2D (Gel2D_ID)
/
alter table PHYSICALGELITEM
add constraint FK_PHYSICAL_REFERENCE_PROTEINR foreign key
(ProteinRecord_ID)
references PROTEINRECORD (protein_record_ID)
/
alter table PROCESSIMPLEMENTATION
add constraint FK_PROCESSIMP_ONTO foreign key (PROCESS_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table PROCESSIMPLEMENTATIONPARAM
add constraint FK_PRCSIMPPARAM_PRCSIMP foreign key
(PROCESS_IMPLEMENTATION_ID)
references PROCESSIMPLEMENTATION (PROCESS_IMPLEMENTATION_ID)
not deferrable
/
alter table PROCESSINVOCATION
add constraint FK_PROCESS_PROCIMP foreign key
(PROCESS_IMPLEMENTATION_ID)
references PROCESSIMPLEMENTATION (PROCESS_IMPLEMENTATION_ID)
not deferrable
/
alter table PROCESSINVOCATIONPARAM
add constraint FK_PROCESSINVPARAM_PROCESSINV foreign key
(PROCESS_INVOCATION_ID)
references PROCESSINVOCATION (PROCESS_INVOCATION_ID) not
deferrable
/
alter table PROCESSINVQUANTIFICATION
add constraint FK_PROCESSINQUANT_P foreign key
(PROCESS_INVOCATION_ID)
references PROCESSINVOCATION (PROCESS_INVOCATION_ID) not
deferrable
/
alter table PROCESSINVQUANTIFICATION
add constraint FK_PROCESSINQUANT_Q foreign key
(QUANTIFICATION_ID)
references QUANTIFICATION (QUANTIFICATION_ID) not deferrable
/
alter table PROCESSIO
add constraint FK_PROCESSEDRESULT21 foreign key
(OUTPUT_RESULT_ID)
references PROCESSRESULT (PROCESS_RESULT_ID) not deferrable
/
alter table PROCESSIO
add constraint FK_PROCESSIO_PROCESSINV foreign key
Appendix C. Database schema for RAPAD 338
(PROCESS_INVOCATION_ID)
references PROCESSINVOCATION (PROCESS_INVOCATION_ID) not
deferrable
/
alter table PROCESSIOELEMENT
add constraint FK_PROCESSIO foreign key (PROCESS_IO_ID)
references PROCESSIO (PROCESS_IO_ID) not deferrable
/
alter table PROCESSRESULT
add constraint FK_PROCESSRESULT_ONTO foreign key (UNIT_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table PROTEINHIT
add constraint FK_PROTEINH_REFERENCE_PEPTIDEH foreign key
(peptide_hit_ID)
references PEPTIDEHIT (peptide_hit_ID)
/
alter table PROTEINHIT
add constraint FK_PROTEINH_REFERENCE_PROTEINR foreign key
(ProteinRecord_ID)
references PROTEINRECORD (protein_record_ID)
/
ALTER TABLE PROTEINHIT ADD constraint FK_db_search_id foreign key
(db_search_ID)
references DBSEARCH (db_search_ID)
/
alter table PROTEINMODIFICATION
add constraint FK_PROTEINM_REFERENCE_ONTOLOGY foreign key
(modification_type)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID)
/
alter table PROTEINMODIFICATION
add constraint FK_PROTEINM_REFERENCE_PROTEINR foreign key
(protein_record_ID)
references PROTEINRECORD (protein_record_ID)
/
alter table PROTEINRECORD
add constraint FK_PROTEINTAXONNAME foreign key (TAXON_NAME_ID)
references SRES_TAXONNAME (TAXON_NAME_ID)
/
alter table PROTEINRECORDENTRY
add constraint FK_PROTENTRYREC foreign key (protein_record_ID)
references PROTEINRECORD (protein_record_ID)
/
alter table PROTEINRECORDENTRY
add constraint FK_PROTENTRYDB foreign key (database_entry_ID)
references DATABASEENTRY (database_entry_ID)
/
alter table PROTEOMEASSAY
add constraint FK_PROTEOME_REFERENCE_PROTOCOL foreign key
(PROTOCOL_ID)
references PROTOCOL (PROTOCOL_ID)
/
alter table PROTOCOL
add constraint FK_PROTOCOL_HDWTYPE_OE foreign key
(HARDWARE_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table PROTOCOL
add constraint FK_PROTOCOL_PRTTYPE_OE foreign key
(PROTOCOL_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table PROTOCOL
add constraint FK_PROTOCOL_SFWTYPE_OE foreign key
(SOFTWARE_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table PROTOCOLPARAM
add constraint FK_PROTOCOLPARAM_ONTO1 foreign key
(DATA_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table PROTOCOLPARAM
add constraint FK_PROTOCOLPARAM_ONTO2 foreign key
(UNIT_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table PROTOCOLPARAM
add constraint FK_PROTOCOLPARAM_PROTO foreign key (PROTOCOL_ID)
references PROTOCOL (PROTOCOL_ID) not deferrable
/
alter table QUADRUPOLE
add constraint FK_QUADRUPO_REFERENCE_MZANALYS foreign key
(mz_analysis_ID)
references MZANALYSIS (mz_analysis_ID)
/
alter table QUANTIFICATION
add constraint FK_QUANT_ACQ foreign key (ACQUISITION_ID)
references ACQUISITION (ACQUISITION_ID) not deferrable
/
alter table QUANTIFICATION
Appendix C. Database schema for RAPAD 339
add constraint FK_QUANT_PROTOCOL foreign key (PROTOCOL_ID)
references PROTOCOL (PROTOCOL_ID) not deferrable
/
alter table QUANTIFICATIONPARAM
add constraint FK_QUANTPARAM_PRTPRM foreign key
(PROTOCOL_PARAM_ID)
references PROTOCOLPARAM (PROTOCOL_PARAM_ID) not deferrable
/
alter table QUANTIFICATIONPARAM
add constraint FK_QUANTPARAM_QUANT foreign key
(QUANTIFICATION_ID)
references QUANTIFICATION (QUANTIFICATION_ID) not deferrable
/
alter table RELATEDACQUISITION
add constraint FK_RELACQ_ACQ01 foreign key (ACQUISITION_ID)
references ACQUISITION (ACQUISITION_ID) not deferrable
/
alter table RELATEDACQUISITION
add constraint FK_RELACQ_ACQ02 foreign key
(ASSOCIATED_ACQUISITION_ID)
references ACQUISITION (ACQUISITION_ID) not deferrable
/
alter table RELATEDQUANTIFICATION
add constraint FK_RELQUANT_QUANT01 foreign key
(QUANTIFICATION_ID)
references QUANTIFICATION (QUANTIFICATION_ID) not deferrable
/
alter table RELATEDQUANTIFICATION
add constraint FK_RELQUANT_QUANT02 foreign key
(ASSOCIATED_QUANTIFICATION_ID)
references QUANTIFICATION (QUANTIFICATION_ID) not deferrable
/
alter table SPOTRATIO
add constraint FK_SPOTRATI_2_FK_DIGE_S_DIGES2 foreign key
(second_DIGESingleSpot_ID)
references DIGESINGLESPOT (DIGESingleSpot_ID)
/
alter table SPOTRATIO
add constraint FK_SPOTRATI_FK_DIGE_S_DIGESING foreign key
(first_DIGESingleSpot_ID)
references DIGESINGLESPOT (DIGESingleSpot_ID)
/
alter table STUDYASSAY
add constraint FK_STDYASSAY_ASSAY foreign key (ASSAY_ID)
references ASSAY (ASSAY_ID) not deferrable
/
alter table STUDYASSAY
add constraint FK_STDYASSAY_STDY foreign key (STUDY_ID)
references STUDY (STUDY_ID) not deferrable
/
alter table STUDYASSAYPROT
add constraint FK_STUDYASS_REFERENCE_PROTEOME foreign key
(PROTEOME_ASSAY_ID)
references PROTEOMEASSAY (PROTEOME_ASSAY_ID)
/
alter table STUDYASSAYPROT
add constraint FK_STUDYASS_REFERENCE_STUDY foreign key
(STUDY_ID)
references STUDY (STUDY_ID)
/
alter table STUDYBIOMATERIAL
add constraint FORMHELP_FK01 foreign key (STUDY_ID)
references STUDY (STUDY_ID) not deferrable
/
alter table STUDYBIOMATERIAL
add constraint FORMHELP_FK02 foreign key (BIO_MATERIAL_ID)
references BIOMATERIALIMP (BIO_MATERIAL_ID) not deferrable
/
alter table STUDYDESIGN
add constraint FK_STDYDES_STDY foreign key (STUDY_ID)
references STUDY (STUDY_ID) not deferrable
/
alter table STUDYDESIGNASSAY
add constraint FK_STDYDESASSAY_ASSAY foreign key (ASSAY_ID)
references ASSAY (ASSAY_ID) not deferrable
/
alter table STUDYDESIGNASSAY
add constraint FK_STDYDESASSAY_STDYDES foreign key
(STUDY_DESIGN_ID)
references STUDYDESIGN (STUDY_DESIGN_ID) not deferrable
/
alter table STUDYDESIGNASSAYPROT
add constraint FK_STUDYDES_REFERENCE_STUDYDES foreign key
(STUDY_DESIGN_ID)
references STUDYDESIGN (STUDY_DESIGN_ID)
/
alter table STUDYDESIGNASSAYPROT
add constraint FK_STUDYDES_REFERENCE_PROTEOME foreign key
(PROTEOME_ASSAY_ID)
references PROTEOMEASSAY (PROTEOME_ASSAY_ID)
Appendix C. Database schema for RAPAD 340
/
alter table STUDYDESIGNDESCRIPTION
add constraint FK_STDYDESDCR_STDYDES foreign key
(STUDY_DESIGN_ID)
references STUDYDESIGN (STUDY_DESIGN_ID) not deferrable
/
alter table STUDYDESIGNTYPE
add constraint FK_STDYDESTYPE_ONTO foreign key
(ONTOLOGY_ENTRY_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table STUDYDESIGNTYPE
add constraint FK_STDYDESTYPE_STDYDES foreign key
(STUDY_DESIGN_ID)
references STUDYDESIGN (STUDY_DESIGN_ID) not deferrable
/
alter table STUDYFACTOR
add constraint FK_STDYFCTR_ONTO foreign key
(STUDY_FACTOR_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table STUDYFACTOR
add constraint FK_STDYFCTR_STDYDES foreign key
(STUDY_DESIGN_ID)
references STUDYDESIGN (STUDY_DESIGN_ID) not deferrable
/
alter table STUDYFACTORVALUE
add constraint FK_STDYFCTRVAL_ASSAY foreign key (ASSAY_ID)
references ASSAY (ASSAY_ID) not deferrable
/
alter table STUDYFACTORVALUE
add constraint FK_STDYFCTRVAL_OEUNIT foreign key
(MEASUREMENT_UNIT_OE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table STUDYFACTORVALUE
add constraint FK_STDYFCTRVAL_OEVALUE foreign key
(VALUE_ONTOLOGY_ENTRY_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table STUDYFACTORVALUE
add constraint FK_STDYFCTRVAL_STDYFCTR foreign key
(STUDY_FACTOR_ID)
references STUDYFACTOR (STUDY_FACTOR_ID) not deferrable
/
alter table STUDYFACTORVALUEPROT
add constraint FK_STUDYFAC_FK_SFV_ME_ONTOLOGY foreign key
(MEASUREMENT_UNIT_OE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID)
/
alter table STUDYFACTORVALUEPROT
add constraint FK_STUDYFAC_FK_STF_VA_ONTOLOGY foreign key
(VALUE_ONTOLOGY_ENTRY_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID)
/
alter table STUDYFACTORVALUEPROT
add constraint FK_STUDYFAC_REFERENCE_PROTEOME foreign key
(PROTEOME_ASSAY_ID)
references PROTEOMEASSAY (PROTEOME_ASSAY_ID)
/
alter table STUDYFACTORVALUEPROT
add constraint FK_STUDYFAC_REFERENCE_STUDYFAC foreign key
(STUDY_FACTOR_ID)
references STUDYFACTOR (STUDY_FACTOR_ID)
/
alter table TANDEMSEQUENCEDATA
add constraint FK_TANDEMSE_REFERENCE_DBSEARCH foreign key
(db_search_parameters_ID)
references DBSEARCHPARAMETERS (db_search_parameters_ID)
/
alter table TOF
add constraint FK_TOF_REFERENCE_MZANALYS foreign key
(mz_analysis_ID)
references MZANALYSIS (mz_analysis_ID)
/
alter table TREATEDANALYTE
add constraint FK_TREATEDA_REFERENCE_BIOMATER foreign key
(BIO_MATERIAL_ID)
references BIOMATERIALIMP (BIO_MATERIAL_ID)
/
alter table TREATEDANALYTE
add constraint FK_TREATEDA_REFERENCE_CHEMICAL foreign key
(chemical_treatment_ID)
references CHEMICALTREATMENT (chemical_treatment_ID)
/
alter table TREATMENT
add constraint FK_TREATMENT6 foreign key (BIO_MATERIAL_ID)
references BIOMATERIALIMP (BIO_MATERIAL_ID) not deferrable
/
alter table TREATMENT
Appendix C. Database schema for RAPAD 341
add constraint FK_TREATMENT7 foreign key (TREATMENT_TYPE_ID)
references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable
/
alter table TREATMENTPARAM
add constraint FK_TREATMENTPARAM_PRTOPRM foreign key
(PROTOCOL_PARAM_ID)
references PROTOCOLPARAM (PROTOCOL_PARAM_ID) not deferrable
/
alter table TREATMENTPARAM
add constraint FK_TREATMENTPARAM_TREATMENT foreign key
(TREATMENT_ID)
references TREATMENT (TREATMENT_ID) not deferrable
/
Appendix D
Modelling and database storage of
difference gel data
D.1 Introduction
The focus of the thesis is to improve technology for the management and sharing of proteome
data arising from 2-DE and MS. Chapter 1 reports three case studies: (i) a host-parasite
interaction study, (ii) the study of changes in the proteome of cell culture with a knock-
out of the gene Raf-1, and (iii) the determination of the proteome of Trypanosoma brucei.
The three case studies were used to inform the development of Gla-PSI and the data from
case studies (i) and (iii) are stored in RAPAD, as reported in Chapters 6 and 7. Case
study (ii), performed at the Beatson Institute, focused on a difference gel electrophoresis
(DIGE) experiment to find differentially expressed proteins. However, the data from the
DIGE study did not become available for inclusion in RAPAD due to technical difficulties
with the experimental setup. DIGE is becoming a major technique in proteomic analysis
because it allows more accurate determination of relative protein volume between two or
more study groups than standard gel electrophoresis. As case studies (i) and (iii) utilised
standard 2-DE analysis, the main data sets used for testing the technology we developed
did not include DIGE data. The purpose of this Appendix is to demonstrate that Gla-PSI,
FGE-OM and RAPAD are capable of representing DIGE data.
Chapter 6 describes a study of the proteome of host cells when invaded with a para-
site compared with non-invaded host cells, measured using standard 2-DE. The experiments
have recently been extended to study the proteome using the DIGE technique. The follow-
ing section (Section D.1.1) briefly describes the experimental methodology and Section D.2
illustrates how such DIGE data can be represented in Gla-PSI. Section D.3 describes how
the same experiment can be captured in FGE-OM. The data has recently been added to
342
Appendix D. Modelling and database storage of difference gel data 343
Replicate Cy2 Cy3 Cy5
1 S Inf1 Non12 S Inf2 Non23 S Non3 Inf34 S Non4 Inf4
Table D.1: Experimental plan for Cy labelling of proteins in the DIGE experiment withToxoplasma gondii. S = pooled sample from all eight replicates, Inf1 = Infected samplereplicate 1, Non1 = Non-infected sample replicate 1.
RAPAD, as described in Section D.4, and can be viewed within the Gel Viewer.
D.1.1 Host-parasite responses
In this section, there is a brief outline of a study to elucidate the changes in the proteome
of a human cell culture when invaded with a parasite, compared with non-invaded cells.
The study was performed in the laboratory of Dr Jonathan Wastling at the Institute of
Biomedical and Life Sciences, University of Glasgow, and it was performed by Morag Nelson,
a PhD student. The DIGE investigation accompanies the standard 2-DE studies described
in Chapter 6. RAPAD aids the information retrieval task, the combination of data across
replicate gels and the comparison with microarray results. There are details about the
hypothesis of the investigation and the generation of samples in Chapter 6, however the
following experimental procedure was used for DIGE analysis.
Four biological replicates were performed (four infected HFF samples versus four non-
infected). The samples were labelled with Cy dyes as shown in Table D.1. A fifth gel was
run with pooled material from the non-infected and infected samples. The fifth gel was used
for generating samples for mass spectrometry (MS) to identify the proteins. The gels were
scanned and gel images loaded into DeCyderTM[74] software. The software performs spot
matching across a series of gels and quantifies the difference in fluorescence, corresponding to
the relative abundance of a particular protein between the infected and non-infected samples.
D.2 Gla-PSI
Gla-PSI is shown in Figure D.1 and the main classes that are used to store DIGE data are
boxed. Figure D.2 demonstrates how classes in Gla-PSI capture the parasitology experiment
described above. ExperimentDesign describes the purpose of the experiment (infected ver-
sus non-infected samples) and ExperimentParameters captures the replicates described in
Appendix D. Modelling and database storage of difference gel data 344
IDEvidence
MassSpec
The stages preceding image analysis have been presented in models: MAGE http://www.mged.org and PEDRo http://pedro.man.ac.uk
Class A
Class B
New classes inthe model
Classes derived from MAGE or PEDRo
Legend
Database
version : StringURI : String
Identifiable
identifier : Stringname : String
All classes are subclasses of Identifiable and Describable (not shown). Therefore, all classes can have an identifier attached and be linked to annotation classes.
ScannedImage
scanner : StringfileURI : Stringresolution : Doublecontrast : Doublebrightness : Doublewavelength : DoubledimensionX : IntegerdimensionY : Integer
ExternalReference
exportedFromServer : StringexportedFromDB : StringexportID : StringexportName : String
Describable
BibliographicReference
title : Stringauthors : Stringpublication : Stringpublisher : Stringeditor : Stringyear : Datevolume : Stringissue : Stringpages : StringURI : String
Database
version : StringURI : String
Description
text : StringURI : String
0..1 10..1 1
0..n1 0..n1
0..n
1
0..n
1
OntologyEntry
category : Stringvalue : Stringdescription : String 1..n 11..n 1
0..n
1
0..n
1
DatabaseEntry
accession : StringaccessionVersion : StringURI : String
1 0..n1 0..n
0..n
1
0..n
1
0..1 10..1 1
OntologyRef
0..11 0..11
Type
SpotRatio
id1 : Stringid2 : StringnormalisedRatio : Doublequality : StringratioType : String
DatabaseEntry
accession : StringaccessionVersion : StringURI : String
1
0..n
1
0..n
Parameter
parameterType : StringparameterValue : StringparameterUnit : String
0..n0..1 0..n0..1
DIGESingleSpot
volume : DoublepeakHeight : DoublenormalisedVolume : Double
0..1
2
0..1
2
0..n
0..1
0..n
0..1
Protein
id : StringmW : DoublepI : Doubleaccession : StringswissProtID : StringpirID : String
0..n
0..1
0..n
0..1
Parameter
parameterType : StringparameterValue : StringparameterUnit : String
DIGESingleImage
dyeLabel : StringisMasterGel : StringvolumeAverage : Double
0..n
0..1
0..n
0..1
0..n1 0..n1
SpotSets
ScannedImage
scanner : StringfileURI : Stringresolution : Doublecontrast : Doublebrightness : Doublewavelength : DoubledimensionX : IntegerdimensionY : Integer
0..n
0..1
0..n
0..1
StatisticalAnalysis
software : Stringversion : Stringalgorithm : StringdataFile : StringanalysisType : String
Spot
volume : DoublenormalisedVolume : Doublearea : DoublepeakHeight : DoublexCoord : IntegeryCoord : Integerexperiment_pI : Doubleexperiment_mW : Doubleradius : Double
0..1
0..n
0..1
0..n
1..n
0..1
1..n
0..1
0..n0..n 0..n0..n
SpotRefs
spotID : String0..n
0..1
0..n
0..1
1..n
0..1
1..n
0..1
2D-PAGE
pI_start : DoublepI_end : DoublemW_start : DoublemW_end : DoublepercentAcrylamide : DoublesolubilizationBuffer : StringstainDetails : StringdimensionX : DoubledimensionY : DoubledimensionZ : DoubledimensionUnit : String
0..n
0..1
0..n
0..1
1..n
1
1..n
1
DIGEAnalysis1..n1 1..n1
0..1
1
0..1
1
1
1
1
1
ImageAnalysis
softwareName : Stringversion : StringfileURI : StringimageProcessing : String
0..n
1
0..n
1
0..n1 0..n1
Parameter
parameterType : StringparameterValue : StringparameterUnit : String 0..n
0..1
0..n
0..1
0..n
0..1
0..n
0..1
0..n
0..1
0..n
0..1
0..n 0..10..n 0..1
ExperimentDesign
ProteinPreparation
1
1
1
1
MultipleAnalysis
analysisType : String0..1
0..n
0..1
0..n
0..1
0..n
0..1
0..n
0..n
0..1
0..n
0..1
ExperimentParameters
1..n
1
1..n
1
1
1..n
1
1..n
0..1
0..n
0..1
0..n
MatchedSpots
quality : Stringdescription : String
0..n0..1
0..n0..1
1
1..n
1
1..n
0..n
1
0..n
1
0..n
0..1
0..n
0..1
Figure D.1: The Gla-PSI model. The boxed classes are discussed in the text.
Appendix D. Modelling and database storage of difference gel data 345
Table D.1 and the gel used for spot picking. ProteinPreparation captures the method of
protein extraction, solubilisation and the attachment of Cy labels. Gla-PSI does not include
attributes for ExperimentDesign, ExperimentParameters or ProteinPreparation there-
fore it is left to individual systems designers to develop structures that adequately store the
information from their domain of interest. 2D-PAGE has attributes for storing the details
of gel separation. There are four gels described in this experiment that are part of the
DIGE analysis. Within the DIGE analysis there are three scanned images from each 2-D
gel, corresponding to scanning the gel at three fluorescence wavelengths (DIGESingleImage).
Each image produces a set of spots stored in DIGESingleSpot. DeCyder software can also
produce a composite image by combining the three images, and data from spots measured
on the composite image are stored in Spot. The composite image is stored in ScannedImage
(to the right of DIGESingleSpot in Figure D.2). Spots that have been matched across the
replicates are captured in the classes: MultipleAnalysis, MatchedSpots and SpotRefs. All
classes can be attached to Parameter for capturing attributes not represented explicitly in
the model. MultipleAnalysis also links to a separate instance of 2D-PAGE which captures
the gel used to generate a picklist for MS analysis. The database accession numbers for
the proteins identified can be stored in Protein, and the model allows for single spots to
be matched to more than one protein, as multiple proteins migrating to the same spot is
a common phenomenon in 2-DE. The image analysis software can match spots across the
set of DIGE gels and the picklist gel. The ID numbers of matched spots are captured in
SpotRefs, which links composite spots on DIGE gels to spots from the picklist gel, thereby
associating the protein identifications to the relative abundance values for each spot on the
DIGE gels.
D.3 FGE-OM
The workflow in Figure D.3 demonstrates the representation in FGE-OM of DIGE data from
the parasitology study described above. At the top level the experimental hypothesis is cap-
tured in the class Experiment, which is attached to ExperimentFactor (not shown) that
represents the difference between the infected and the non-infected cell lines. One instance
of Experiment is indirectly related (via BioAssay) to two instances of BioMaterial (one for
infected, one for non-infected). A series of treatments are performed on the BioMaterial
that includes pooling samples after they have been labelled with the Cy dyes. The output
of the set of treatments is the four samples shown in Table D.1 and the sample for the prep
Appendix D. Modelling and database storage of difference gel data 346
DIGEAnalysis
DIGESingleImage
dyeLabel = Cy2
DIGESingleImage
dyeLabel = Cy3
DIGESingleImage
dyeLabel = Cy5
DIGESingleSpot
volume = 12643
DIGESingleSpot
volume = 3456
DIGESingleSpot
volume = 17006
2D−PAGE
separation details
ScannedImage
Scanning and image details(attached toall images)
MultipleAnalysis
MatchedSpots
SpotRefs
Capture of gel for spotpicking and identificationof proteins
ScannedImage
Scanning and image details
ImageAnalysis
Software, versionand parameters
2D−PAGE
separation details
1
*
1
*
1
1
1
*
1
* *
1
Spot
coordinates
1
ProteinPreparationExperimentDesign ExperimentParameters
volume
4 4 4
4
1
1
Captures detailsof samples and labelling system
5
1
1
Protein
Spot
coordinatesvolume
database ids
Collection ofcomposite spotsfrom one 2−D gel
*SpotSets
Clusters of spotsmatched acrossgels
(allows for multiple matches)
1
1
1
1
1
1
451
1 1 1
* 1
1
1
*
1
1
4
imagethe composite
ScannedImage
1
*
1
1
*
1
111
Figure D.2: A DIGE experiment represented in Gla-PSI. The boxes represent classes in themodel and the text in each box is a comment to describe the purpose of the class or exampleattributes and values. The lines indicate relationships between classes and the numbersrepresent the relative number of classes (cardinality) that participate in the relationship.
Appendix D. Modelling and database storage of difference gel data 347
gel, which are represented in BioMaterial, and each BioMaterial is associated with an in-
stance of BioAssayTreatment. BioAssayTreatment is a superclass from which specific types
of treatment can inherit relationships. In this case, an instance of Gel2D is used to capture
the details of the two-dimensional separation. BioAssayTreatment is associated with the
PhysicalBioAssay class that is used for linking various classes together. ImageAcquisition
(scanning) is linked to a source PhysicalBioAssay and an output PhysicalBioAssay that
is related to the four images: three from scanning at the three fluorescence wavelengths and
one composite image (captured in Image, Channel and the image format in OntologyEntry).
The gel that is used for spot picking is modelled in the same way but the Channel class is
not required as the gel does not contain fluorescent labels and has not been scanned at a
particular wavelength. Gel image analysis is modelled by the general MAGE-OM derived
class FeatureExtraction and a more specific class GelImageAnalysis. The five different
gels are related to each other through the class MultipleAnalysis. MeasuredBioAssay re-
lates the image analysis event to the data via MeasuredBioAssayData (not shown). There
is a relationship to the class BioDataTuples that stores rows of gel spot data, with rela-
tionships to IdentifiedSpot and DIGESingleSpot. IdentifiedSpot represents composite
spot information (from the image combined across the three channels) and DIGESingleSpot
stores attributes of spots in the single channel images. Spots that are matched across more
than one gel, for example matched between the standard gel used for MS analysis and the
DIGE gels, are stored in MatchedSpots which is related to the class MultipleAnalysis.
IdentifiedSpot and DIGESingleSpot have various attributes that are measured by image
analysis such as relative volumes, ratios between the different channels and a spot’s co-
ordinates. Complete class diagrams showing all the attributes are displayed in Appendix
B.
D.4 RAPAD
The parasitology study described above has recently been entered into RAPAD. There are
17 gel images in total in the study, corresponding to four images from each of the four
DIGE gels (three different wavelengths and a composite image) plus a single gel image from
the prep gel used for MS identification. Figures D.4 displays screenshots of the prep gel
visualised in the Gel Viewer. The DeCyderTMsoftware calculates the relative volume ratio
between the two study conditions (infected versus non-infected) across all four DIGE gels.
The ratio of volumes is stored in the IdentifiedSpot table, linked to the prep gel (stored in
Appendix D. Modelling and database storage of difference gel data 348
Experiment
BioMaterialTreatment
Gel2D BioAssayTreatment
Image
Channel FeatureExtraction
GelImageAnalysis
MultipleAnalysis
MatchedSpots
BioDataTuples
IdentifiedSpot
13
DIGESingleSpot
MeasuredBioAssay
PhysicalBioAssay
Picklist gel (non−DIGE) is stored in the same structures but does not require the Channel class
ImageAcquisition
ScanningprotocolPhysical
BioAssay
1
1
11
Composite spotinformation
spots matchedacross gels
scanningwavelength
link to OntologyEntry for imageformat
proteinsolubilisation and labelling
Hypothesis andparameters
Separationdetails
Protocol for gelimage analysis
1
1
11
1
1
*
*
1
1
*
1
2
1
1 5
Treatments produce 5BioMaterials (4 for DIGE1 for standard gel)
2
1 *
1
*
1
1
4
1
1
11
1 17
Figure D.3: A DIGE study represented in FGE-OM. The boxes represent classes in themodel and the text in each box is a comment to describe the purpose of the class. Thelines indicate relationships between classes and the numbers represent the relative numberof classes (cardinality) that participate in the relationship for this experiment.
Appendix D. Modelling and database storage of difference gel data 349
Gel2D, ProteomeAssay, ImageAcquisition and GelImageAnalysis, described in Chapter
5). The organisation of the data in this way allows a simple visualisation of the proteins up
or down-regulated without needing to examine the entire series of images because the Gel
Viewer allows the user to perform searches for the spot volume. All the DIGE gel images
are stored in RAPAD within the same study and can also be loaded concurrently with the
prep gel in the Gel Viewer. Proteins with a volume greater than zero are present in higher
abundance in non-infected cells, and less than zero are in higher abundance in infected cells.
RAPAD contains microarray, standard 2-DE and DIGE data for HFF cells invaded with
T. gondii. This means that comparisons can be made between the level of gene expression
and protein abundance as measured by more than one technique. This allows for validation
of the experimental methodology, and the derivation of significant biological information
about the proteins modulated in response to parasite invasion of host cells.
Appendix D. Modelling and database storage of difference gel data 350
B)
A)
Figure D.4: Relative protein abundance data calculated from DIGE can be viewed in theGel Viewer via the gel used for protein identification by MS. The user can query for proteinsdown-regulated (panel A) or proteins up-regulated (panel B) in the Gel Viewer.
Bibliography
[1] S. Abiteboul, S. Cluet, V. Christophides, T. Milo, G. Moerkotte, and J. Simeon. Query-ing Documents in Object Databases. Int. J. on Digital Libraries, 1:5–19, 1997.
[2] F. Achard, G. Vaysseix, and E.Barillot. XML, bioinformatics and data integration.Bioinformatics, 17:115–125, 2001.
[3] C. Adessi, C. Miege, C. Albrieux, and T. Rabilloud. Two-dimensional electrophoresis ofmembrane proteins: A current challenge for immobilized pH gradients. Electrophoresis,18:127–135, 1997.
[4] R. Aebersold and M. Mann. Mass spectrometry-based proteomics. Nature, 422:198–207, 2003.
[5] Affymetrix. http://www.affymetrix.com/.
[6] J. W. Ajioka, J. M. Fitzpatrick, and C. P. Reitter. Toxoplasma gondii genomics:shedding light on pathogenesis and chemotherapy. Expert Rev Mol Med., 2001:1–19,2001.
[7] F. Al-Shahrour, R. Diaz-Uriarte, and J. Dopazo. FatiGO: a web tool for findingsignificant associations of Gene Ontology terms with groups of genes. Bioinformatics,20:578–580, 2004.
[8] A. Alban, S. O. David, L. Bjorkesten, C. Andersson, E. Sloge, S. Lewis, and I. Cur-rie. A novel experimental design for comparative two-dimensional gel analysis: two-dimensional difference gel electrophoresis incorporating a pooled internal standard.Proteomics, 3:36–44, 2003.
[9] J. Allen, H. M. Davey, D. Broadhurst, J. K. Heald, J. J. Rowland, S. G. Oliver, andD. B Kell. High-throughput classification of yeast mutants for functional genomicsusing metabolic footprinting. Nat Biotechnol., 21:692–696, 2003.
[10] AllGenes: a web site providing access to an integrated database of known and predictedhuman and mouse genes. (version 6.0, 2003) Center for Bioinformatics, University ofPennsylvania. http://www.allgenes.org.
[11] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, andD. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein databasesearch programs. Nucleic Acids Res., 25:3389–3402, 1997.
[12] AmiGO. http://www.godatabase.org/.
[13] Analytical Information Markup Language (AnIML).http://animl.sourceforge.net/.
351
[14] Apache Xindice. http://xml.apache.org/xindice/.
[15] Applied Biosystems. http://www.appliedbiosystems.com.
[16] ArrayExpress at the EBI. http://www.ebi.ac.uk/arrayexpress/.
[17] G. Arrizabalaga and J. C. Boothroyd. Role of calcium during Toxoplasma gondiiinvasion and egress. Int J Parasitol., 34:361–368, 2004.
[18] ASTM International. http://www.astm.org.
[19] M. P. Atkinson, L. Daynes, M. J. Jordan, T. Printezis, and S. Spence. An OrthogonallyPersistent Java. SIGMOD Record, 25(4):68–75, 1996.
[20] G. Babnigg and C. S. Giometti. GELBANK: a database of annotated two-dimensionalgel electrophoresis patterns of biological systems with completed genomes. NucleicAcids Res., 32:D582–D585, 2004.
[21] A. Bahl, B. Brunk, R. L. Coppel, J. Crabtree, S. J. Diskin, M. J. Fraunholz, et al.PlasmoDB: The Plasmodium Genome Resource. An integrated database providingtools for accessing and analyzing mapping, expression, and sequence data (both finishedand unfinished). Nucleic Acids Res., 30:87–90, 2002.
[22] P. G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A. Brass. AnOntology for Bioinformatics Applications. Bioinformatics, 15:510–520, 1999.
[23] C. A. Ball, G. Sherlock, and H. Parkinson. An open letter to the scientific journals.Science, 298:539, 2002.
[24] C. A. Ball, G. Sherlock, and H. Parkinson. An open letter to the scientific journals.Bioinformatics, 18:1409, 2002.
[25] C. A. Ball, G. Sherlock, and H. Parkinson. An open letter to the scientific journals.The Lancet, 360:1019, 2002.
[26] M. P. Barrett. The fall and rise of sleeping sickness. The Lancet, 353:1113–1114, 1999.
[27] J. D. Barry. The relative significance of mechanisms of antigenic variation in Africantrypanosomes. Parasitology Today, 13:203–244, 1997.
[28] S. Bechhofer, I. Horrocks, C. Goble, and R. Stevens. OilEd: a reason-able ontology edi-tor for the semantic web. In Proceedings of KI2001, Joint German/Austrian conferenceon Artificial Intelligence, pages 396–408, 2001.
[29] C. J. Beckers, J. F. Dubremetz, O. Mercereau-Puijalon, and K. A. Joiner. The Tox-oplasma gondii rhoptry protein ROP 2 is inserted into the parasitophorous vacuolemembrane, surrounding the intracellular parasite, and is exposed to the host cell cy-toplasm. J Cell Biol., 127:947–961, 1994.
[30] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler. Gen-Bank. Nucleic Acids Res., 31:23–27, 2003.
[31] T. Berners-Lee and J. Hendler. Nature Debates: Scientific publishing on the ‘semanticweb’. http://www.nature.com/nature/debates/e-access/Articles/bernerslee.htm.
[32] BIND at Blueprint. http://www.blueprint.org/bind/bind.php.
352
[33] Bioinformatic Harvester, Collection of all human (non fragmented) SWALL proteinsand their cross references to the major bioinformatic databases.http://harvester.embl.de/.
[34] BioJava. http://www.biojava.org.
[35] I. J. Blader, I. D. Manger, and J. C. Boothroyd. Microarray analysis reveals previouslyunknown changes in Toxoplasma gondii -infected human cells. J Biol Chem., 276:24223–24231, 2001.
[36] B. Blagoev, I. Kratchmarova, S. E. Ong, M. Nielsen, L. J. Foster, and M. Mann. Aproteomics strategy to elucidate functional protein-protein interactions applied to EGFsignaling. Nat Biotechnol., 21:315–318, 2003.
[37] B. Boeckmann, A. Bairoch, R. Apweiler, M. C. Blatter, A. Estreicher, E. Gasteiger,et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.Nucleic Acids Res., 31:365–370, 2003.
[38] S. Boldt, U. H. Weidle, and W. Kolch. The role of MAPK pathways in the action ofchemotherapeutic drugs. Carcinogenesis, 23:1831–1838, 2002.
[39] S. Bowers and B. Ludascher. An Ontology-Driven Framework for Data Transformationin Scientific Workflows. In Proceeding of the International Workshop on Data Integra-tion in Life Sciences, Lecture Notes in Computer Science, volume 2994, pages 1–16,2004.
[40] Tim Bray. What is RDF? http://www.xml.com/pub/a/2001/01/24/rdf.html.
[41] A. Brazma, P. Hingamp, J. Quackenbush, G. Sherlock, P. Spellman, C. Stoeckert, et al.Minimum information about a microarray experiment (MIAME)-toward standards formicroarray data. Nat. Genet., 29:365–71, 2001.
[42] A. Brazma, A. Robinson, G. Cameron, and M. Ashburner. One-stop shop for microar-ray data - Is a universal, public DNA-microarray database a realistic goal? Nature,403:699–700, 2000.
[43] P. Buneman, M. Grohe, and C. Koch. Path Queries on Compressed XML. In Proceed-ings of 29th International Conference on Very Large Data Bases, Berlin, Germany,pages 141–152, 2003.
[44] Peter Buneman. Semistructured data. In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 117–121,1997.
[45] A. Burger, D. Davidson, and R. Baldock. Formalization of Mouse Embryo Anatomy.Bioinformatics, 20:259–267, 2004.
[46] E. Camon, M. Magrane, D. Barrell, D. Binns, W. Fleischmann, P. Kersey, et al. TheGene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT,TrEMBL, and InterPro. Genome Res., 13:662–672, 2003.
[47] D. Carlson. Modeling XML Applications with UML: Practical e-Business Applications.Addison-Wesley, 2001.
353
[48] S. Carr, R. Aebersold, M. Baldwin, A. Burlingame, K. Clauser, and A. Nesvizhskii. Theneed for guidelines in publication of peptide and protein identification data: WorkingGroup on Publication Guidelines for Peptide and Protein Identification Data. Mol CellProteomics., 3:531–533, 2004.
[49] V. B. Carruthers. Host cell invasion by the opportunistic pathogen Toxoplasma gondii .Acta Trop., 81:111–122, 2002.
[50] J. I. Castrillo and S. G. Oliver. Yeast as a Touchstone in Post-genomic Research:Strategies for Integrative Analysis in Functional Genomics. J Biochem Mol Biol.,37:93–106, 2004.
[51] CellML. http://www.cellml.org/.
[52] S. Celniker, D. Wheeler, B. Kronmiller, J. Carlson, A. Halpern, S. Patel, et al. Finish-ing a whole-genome shotgun: Release 3 of the Drosophila melanogaster euchromaticgenome sequence. Genome Biol., 3:research0079.1–0079.14, 2002.
[53] Chagas disease information. The UNICEF-UNDP-World Bank-WHO Special Pro-gramme for Research and Training in Tropical Diseases.http://www.who.int/tdr/diseases/chagas/diseaseinfo.htm.
[54] K. H. Cheung, K. White, and J. Hager. YMD: A microarray database for large-scale gene expression analysis. In Proceedings of the American Medical InformaticsAssociation Annual Symposium, pages 140–144, 2002.
[55] The Chipping Forecast. Supplement to Nat Genet., 21:1–60, 1999.
[56] S. Cho, S. G. Park, D. H. Lee, and B. Chul. Protein-protein Interaction Networks:from Interactions to Networks. J Biochem Mol Biol., 37:45–52, 2004.
[57] D. Christendat, A. Yee, A. Dharamsi, Y. Kluger, A. Savchenko, J. R. Cort, et al.Structural proteomics of an archaeon. Nat Struct Biol., 7:903–909, 2000.
[58] M. Clamp, D. Andrews, D. Barker, P. Bevan, G. Cameron, Y. Chen, et al. Ensembl2002: accommodating comparative genomics. Nucleic Acids Res., 31:38–42, 2003.
[59] J-M. Claverie. What If There Are Only 30,000 Human Genes? Science, 291:1255–1257,2001.
[60] C. E. Clayton. Life without transcriptional control? from fly to man and back again.EMBO J., 21:1881–1888, 2002.
[61] A. M. Cohen, K. Rumpel, G. H. Coombs, and J. M. Wastling. Characterisation ofglobal protein expression by two-dimensional electrophoresis and mass spectrometry:proteomics of Toxoplasma gondii . Int J Parasitol., 32:39–51, 2002.
[62] B. Cooper, N. Sample, M. J. Franklin, G. R. Hjaltason, and M. Shadmon. A FastIndex for Semistructured Data. In Proceedings of 27th International Conference onVery Large Data Bases, pages 341–350, 2001.
[63] Cprogramming.com - Your Resource for C++ Programming.http://www.cprogramming.com/.
[64] F. Crick. Central Dogma of Molecular Biology. Nature, 227:561–563, 1970.
354
[65] Database of Interacting Proteins (DIP). http://dip.doe-mbi.ucla.edu/.
[66] C. J. Date. An Introduction to Database Systems - Volume 1, 6th Edition. Addison-Wesley, 1995. DAT c 95:1 1.Ex.
[67] S. Davidson, J. Crabtree, B. Brunk, J. Schug, V. Tannen, G. C. Overton, and C. J.Stoeckert Jr. K2/Kleisli and GUS: Experiments in integrated access to genomic datasources. IBM Systems Journal, 40(2):512–531, 2001.
[68] S. B. Davidson, G. C. Overton, V. Tannen, and L. Wong. BioKleisli: A Digital Libraryfor Biomedical Researchers. Int. J. on Digital Libraries, 1:36–53, 1997.
[69] T. N. Davis. Protein localization in proteomics. Curr Opin Chem Biol., 8:49–53, 2004.
[70] DB2 published by IBM. http://www.ibm.com/.
[71] DBLP, Computer Science Bibliography. http://dblp.uni-trier.de/.
[72] The DDBJ/EMBL/GenBank Feature Table: Definition.http://www.ebi.ac.uk/embl/Documentation/FT definitions/feature table.html.
[73] S. V. de Avalos, I. J. Blader, M. Fisher, J. C. Boothroyd, and B. A. Burleigh. Immedi-ate/Early Response to Trypanosoma cruzi Infection Involves Minimal Modulation ofHost Cell Transcription. J. Biol. Chem., 277:639–644, 2002.
[74] DeCyderTMpublished by Amersham Biosciences. http://www.apbiotech.com/.
[75] The definition of Document Type Definition (DTD). http://www.w3.org/TR/REC-html40/sgml/dtd.html.
[76] J. DeRisi, L. Penland, P. O. Brown, M. L. Bittner, P. S. Meltzer, M. Ray, Y. Chen,Y. A. Su, and J. M. Trent. Use of a cDNA microarray to analyse gene expressionpatterns in human cancer. Nat Genet., 14:457–460, 1996.
[77] A. Deutsch, M. Fernandez, and D. Suciu. Storing semistructured data with STORED.In Proceedings of the 1999 ACM SIGMOD international conference on Managementof data, pages 431–442, 1999.
[78] M. Diehn, G. Sherlock, G. Binkley, H. Jin, J. C. Matese, and T. Hernandez-Boussard.SOURCE: a unified genomic resource of functional annotations, ontologies, and geneexpression data. Nucleic Acids Res., 31:219–223, 2003.
[79] H. Dlugonska, K. Dytnerska, G. Reichmann, S. Stachelhaus, and H. G. Fischer. To-wards the Toxoplasma gondii proteome: position of 13 parasite excretory antigens on astandardized map of two-dimensionally separated tachyzoite proteins. Parasitol Res.,87:634–637, 2001.
[80] DNA Data Bank of Japan. http://www.ddbj.nig.ac.jp/.
[81] A. Doan, P. Domingos, and A. Levy. Learning Source Descriptions for Data Integration.In Proceedings of the International Workshop on The Web and Databases (WebDB),page Learning Source Descriptions for Data Integration, 2000.
[82] Document Object Model (DOM). http://www.w3.org/DOM/.
[83] A. W. Dowsey, M. J. Dunn, and G. Z. Yang. The role of bioinformatics in two-dimensional gel electrophoresis. Proteomics, 3:1567–1596, 2003.
355
[84] B. Edde, J. Rossier, J-P. LeCaer, F. Desbruyeres, F. Gros, and P. Denoulet. Post-translational glutamylation of alpha-tubulin. Science, 247:83–85, 1990.
[85] R. Edgar, M. Domrachev, and A. E. Lash. Gene Expression Omnibus: NCBI geneexpression and hybridization array data repository. Nucleic Acids Res., 30:207–210,2002.
[86] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis anddisplay of genome-wide expression patterns. Proc Natl Acad Sci U S A., 95:14863–14868, 1998.
[87] N. M. El-Sayed, E. Ghedin, J. Song, A. MacLeod, F. Bringaud, C. Larkin, et al. Thesequence and analysis of Trypanosoma brucei chromosome II. Nucleic Acids Res.,31:4856–4863, 2003.
[88] The Electronic Statistics Textbook.http://www.statsoftinc.com/textbook/stathome.html.
[89] R. A. Elmasri and S. B. Navathe. Fundamentals of Database Systems, 3rd edition.Addison-Wesley, 2000.
[90] EMAP: The Edinburgh Mouse Atlas Project. http://genex.hgu.mrc.ac.uk/.
[91] The EMBL Nucleotide Sequence Database. http://www.ebi.ac.uk/embl/.
[92] EMBOSS. http://www.hgmp.mrc.ac.uk/Software/EMBOSS/.
[93] J. Eng and J. Yates. SEQUEST. http://fields.scripps.edu/sequest/.
[94] Ensembl Genome Browser. http://www.ensembl.org/.
[95] Ensembl Trace Server. http://trace.ensembl.org/.
[96] Enterprise Architect v 4.1, published by Sparx Systems.http://www.sparxsystems.com.au/.
[97] Entrez, The Life Sciences Search Engine. http://www.ncbi.nih.gov/Entrez/.
[98] Ettan DIGE: Fluorescence 2D Difference Gel Electrophoresis.http://www.amershambiosciences.com/proteomics/dige/.
[99] T. Etzold, A. Ulyanow, and P. Argos. SRS: Information Retrieval System for MolecularBiology Data Banks. Methods Enzymol., 266:114–128, 1996.
[100] eVOC: The Human Gene Expression VOCabulary.http://www.sanbi.ac.za/evoc/.
[101] Extensible Markup Language (XML). http://www.w3c.org/XML/.
[102] J. B. Fenn, M. Mann, C. K. Meng, S. F. Wong, and C. M. Whitehouse. Electrosprayionization for mass spectrometry of large biomolecules. Science, 246:64–71, 1989.
[103] S. B. Ficarro, M. L. McCleland, P. T. Stukenberg, D. J. Burke, M. M. Ross, J. Sha-banowitz, D. F. Hunt, and F. M. White. Phosphoproteome analysis by mass spec-trometry and its application to Saccharomyces cerevisiae. Nat Biotechnol., 20:301–305,2002.
356
[104] T. Fiebig, S. Helmer, C-C. Kanne, G. Moerkotte, J. Neumann, R. Schiele, and T. West-mann. Anatomy of a native XML base management system. VLDB J., 11:292–314,2002.
[105] O. Fiehn, J. Kopka, R. N. Trethewey, and L. Willmitzer. Identification of uncommonplant metabolites based on calculation of elemental compositions using gas chromatog-raphy and quadrupole mass spectrometry. Anal Chem., 72:3573–3580, 2000.
[106] H. I. Field, D. Fenyo, and R. C. Beavis. RADARS, a bioinformatics solution that auto-mates proteome mass spectral analysis, optimises protein identification, and archivesdata in a relational database. Proteomics, 2:36–47, 2002.
[107] S. Fields and O. Song. A novel genetic system to detect protein-protein interactions.Nature, 340:245–246, 1989.
[108] A. Fire, S. Xu, M. K. Montgomery, S. A. Kostas, S. E. Driver, and C. C. Mello. Potentand specific genetic interference by double-stranded RNA in Caenorhabditis elegans.Nature, 391:806–811, 1998.
[109] G. Fischer, S. M. Ibrahim, G. A. Brockmann, J. Pahnke, E. Bartocci, H-J. Thiesen,P. Serrano-Fernandez, and S. Moller. Expressionview: visualization of quantitativetrait loci and gene-expression data in Ensembl. Genome Biol., 4:R77, 2003.
[110] L. Florens, M. P. Washburn, J. D. Raine, R. M. Anthony, M. Grainger, J. D. Haynes,et al. A proteomic view of the Plasmodium falciparum life cycle. Nature, 419:520–526,2002.
[111] D. Florescu and D. Kossmann. Storing and Querying XML Data using an RDMBS.IEEE Data Engineering Bulletin, 22:27–34, 1999.
[112] FlyBase: A Database of the Drosophila Genome. http://www.flybase.org.
[113] R. Fogh, J. Ionides, E. Ulrich, W. Boucher, W. Vranken, J. P. Linge, et al. The CCPNproject: an interim report on a data model for the NMR community. Nat Struct Biol.,9:416–418, 2002.
[114] A. Freier, R. Hofestadt, M. Lange, U. Scholz, and A. Stephanik. BioDataServer: ASQL-based service for the online integration of life science data. In Silico Biol., 2:37–57,2002.
[115] B. Futcher, G. I. Latter, P. Monardo, C. S. McLaughlin, and J. I. Garrels. A samplingof the yeast proteome. Mol Cell Biol., 19:7357–7368, 1999.
[116] M. Gail, U. Gross, andW. Bohne. Transcriptional profile of Toxoplasma gondii -infectedhuman fibroblasts as revealed by gene-array hybridization. Mol Genet Genomics.,265:905–912, 2001.
[117] M. Y. Galperin. The Molecular Biology Database Collection: 2004 update. NucleicAcids Res., 32, Database issue:D3–D22, 2004.
[118] H. Garcia-Molina, J. Ullman, and J. Widom. Database Systems: The Complete Book.Prentice Hall, 2002.
[119] M Gardiner-Garden and T. G. Littlejohn. A comparison of microarray databases.Brief. Bioinformatics, 2:143–158, 2001.
357
[120] A. C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, and A. Bauer. Functionalorganization of the yeast proteome by systematic analysis of protein complexes. Nature,415:141–147, 2002.
[121] GenAtlas. http://www.genatlas.org/.
[122] Genbank. http://www.ncbi.nlm.nih.gov/Genbank/.
[123] Gene Expression Omnibus (GEO). http://www.ncbi.nlm.nih.gov/geo/.
[124] The Gene Ontology Consortium. http://www.geneontology.org/.
[125] The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology.Nat Genet., 25:25–29, 2000.
[126] The Gene Ontology Consortium. Creating the gene ontology resource: design andimplementation. Genome Res., 11:1425–1433, 2001.
[127] GeneDB. http://www.genedb.org.
[128] Generalized Analytical Markup Language. http://www.gaml.org.
[129] S. Gharbi, P. Gaffney, A. Yang, M. J. Zvelebil, R. Cramer, M. D. Waterfield, and J. F.Timms. Evaluation of two-dimensional differential gel electrophoresis for proteomicexpression analysis of a model breast cancer cell system. Mol Cell Proteomics., 1:91–98, 2002.
[130] M. Girolami and R. Breitling. Biologically valid linear factor models of gene expression.Bioinformatics, 20:3021–3033, 2004.
[131] G. V. Gkoutos, P. Murray-Rust, H. S. Rzepa, and M. Wright. Chemical markup, XMLand the World-Wide Web. 3. Toward a signed semantic chemical web of trust. J ChemInf Comput Sci., 41:1124–1130, 2001.
[132] The Global Grid Forum (GGF). http://www.gridforum.org/.
[133] C. A. Goble. The Semantic Web: A Killer App for AI? In Artificial Intelligence:Methodology, Systems, and Applications, 10th International Conference, AIMSA 2002,Varna, Bulgaria, pages 274–278, 2002.
[134] J. Gollub, C. A. Ball, G. Binkley, J. Demeter, D. B. Finkelstein, J. M. Hebert, et al.The Stanford Microarray Database: data access and quality assessment tools. NucleicAcids Res., 31:94–96, 2003.
[135] A. Gorg, C. Obermaier, G. Boguth, A. Harder, B. Scheibe, R. Wildgruber, andW. Weiss. The current state of two-dimensional electrophoresis with immobilized pHgradients. Electrophoresis, 21:1037–1053, 2000.
[136] A. Gorg, W. Postel, and S. Gunther. The current state of two-dimensional electrophore-sis with immobilized pH gradients. Electrophoresis, 9:531–546, 1988.
[137] P. R. Graves and T. A. Haystead. Molecular biologist’s guide to proteomics. MicrobiolMol Biol Rev., 66:39–63, 2002.
[138] T. R. Gruber. A translation approach to portable ontologies. Knowledge Acquisition,5:199–220, 1993.
358
[139] M. E. Guicciardi, J. Deussing, H. Miyoshi, S. F. Bronk, P. A. Svingen, C. Peters,S. H. Kaufmann, and G. J. Gores. Cathepsin B contributes to TNF-alpha-mediatedhepatocyte apoptosis by promoting mitochondrial release of cytochrome c. J ClinInvest., 106:1127–1137, 2000.
[140] K. Gull. The cytoskeleton of trypanosomatid parasites. Annu Rev Microbiol., 53:629–655, 1999.
[141] The GUS 3.0 schema. http://www.gusdb.org/cgi-bin/schemaBrowser.
[142] S. P. Gygi, B. Rist, S. A. Gerber, F. Turecek, M. H. Gelb, and R. Aebersold. Quan-titative analysis of complex protein mixtures using isotope-coded affinity tags. NatBiotechnol., 17:994–999, 1999.
[143] S. P. Gygi, Y. Rochon, B. R. Franza, and R. Aebersold. Correlation between proteinand mRNA abundance in yeast. Mol Cell Biol., 19:1720–30, 1999.
[144] L. M. Haas, P. M. Schwarz, P. Kodali, E. Kotlar, J. E. Rice, and W. C. Swope.DiscoveryLink: A system for integrated access to life sciences data sources. IBMSystems Journal, 40:489–511, 2001.
[145] J. G. Hacia, L. C. Brody, M. S. Chee, S. P. Fodor, and F. S. Collins. Detectionof heterozygous mutations in BRCA1 using high density oligonucleotide arrays andtwo-colour fluorescence analysis. Nat Genet., 14:441–447, 1996.
[146] N. Hall, M. Berriman, N. J. Lennard, B. R. Harris, C. Hertz-Fowler, E. N. Bart-Delabesse, et al. The DNA sequence of chromosome I of an African trypanosome:gene content, chromosome organisation, recombination and polymorphism. NucleicAcids Res., 31:4864–4873, 2003.
[147] G. J. Hannon. RNA interference. Nature, 418:244–251, 2002.
[148] P. M. Haverty, Z. Weng, N. L. Best, K. R. Auerbach, L. L.i Hsiao, R. V. Jensen,and S. R. Gullans. Hugeindex: a database with visualization tools for high-densityoligonucleotide array data from normal human tissues. Nucleic Acids Res., 30:214–217, 2002.
[149] S. Hennig, D. Groth, and H. Lehrac. Automated gene ontology annotation for anony-mous sequence data. Nucleic Acids Res., 31:3712–3715, 2003.
[150] H. Hermjakob, L. Montecchi-Palazzi, G. Bader, J. Wojcik, L. Salwinski, A. Ceol,et al. The HUPO PSI’s molecular interaction format–a community standard for therepresentation of protein interaction data. Nat Biotechnol., 22:177–183, 2004.
[151] F. Hillenkamp and M. Karas. Mass spectrometry of peptides and proteins by matrix-assisted ultraviolet laser desorption/ionization. Methods Enzymol., 193:280–95, 1990.
[152] Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S. L. Adams, et al. Systematicidentification of protein complexes in Saccharomyces cerevisiae by mass spectrometry.Nature, 415:180–183, 2002.
[153] C. Hoogland, J. C. Sanchez, L. Tonella, P. A. Binz, A. Bairoch, D. F. Hochstrasser,and R. D. Appel. The 1999 SWISS-2DPAGE database update. Nucleic Acids Res.,28:286–288, 2000.
359
[154] I. Horrocks. DAML+OIL: a reason-able web ontology language. In Proceedings ofEDBT 2002, number 2287 in Lecture Notes in Computer Science, pages 2–13, March2002.
[155] M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, et al. The Sys-tems Biology Markup Language (SBML): A Medium for Representation and Exchangeof Biochemical Network Models. Bioinformatics, 19:524–531, 2003.
[156] HUGO Gene Nomenclature Committee (HGNC).http://www.gene.ucl.ac.uk/nomenclature/.
[157] W. K. Huh, J. V. Falvo, L. C. Gerke, A. S. Carroll, R. W. Howson, J. S. Weissman,and E. K. O’Shea. Global analysis of protein localization in budding yeast. Nature,425:686–691, 2003.
[158] Human-Mouse Homology Map. http://www.ncbi.nlm.nih.gov/Homology/.
[159] E. Hunt, E. Pafilis, I. Tulloch, and J. Wilson. Index-Driven XML Data Integration toSupport Functional Genomics. In Proceeding of the International Workshop on DataIntegration in Life Sciences, Lecture Notes in Computer Science, volume 2994, pages95–109, 2004.
[160] HUP-ML format is available as a DTD (Document Type Definition).http://www1.biz.biglobe.ne.jp/˜jhupo/HUP-ML/hup-ml.dtd.
[161] HUPO - The Human Proteome Organisation. http://www.hupo.org/.
[162] ImageMaster published by Amersham Biosciences. http://www.apbiotech.com/.
[163] Immunohistochemistry - In Situ Hybridization. http://home.no.net/immuno/.
[164] The International Human Genome Sequencing Consortium. Initial sequencing andanalysis of the human genome. Nature, 401:860–921, 2001.
[165] R. Jansen and M. Gerstein. Analysis of the yeast transcriptome with structural andfunctional categories: characterizing highly expressed proteins. Nucleic Acids Res.,28:1481–1488, 2000.
[166] Japanese Human Proteome Organisation (J-HUPO). http://www.jhupo.org/.
[167] Java 2 Platform, Standard Edition (J2SE), v1.4 Overview.http://java.sun.com/j2se/1.4/.
[168] Java Applet. http://java.sun.com/applets/.
[169] Java Technology. http://java.sun.com/.
[170] Java Web Start Technology. http://java.sun.com/products/javawebstart/.
[171] JavaScript.comTM- The Definitive JavaScript Resource.http://www.javascript.com/.
[172] O. N. Jensen. Modification-specific proteomics: characterization of post-translationalmodifications by mass spectrometry. Curr Opin Chem Biol., 8:33–41, 2004.
[173] T. K. Jenssen and E. Hovig. The semantic web and biology. Drug Discov Today.,7:992, 2002.
360
[174] A. Jones. A database for storing the results of 2D-PAGE experiments. Master’s thesis,University of Glasgow, 2001.
[175] A. Jones, E. Hunt, J. M. Wastling, A. Pizarro, and C. J. Stoeckert Jr. An object modeland database for functional genomics. Bioinformatics, 20:1583–1590, 2004.
[176] A. Jones, J. Wastling, and E. Hunt. Proposal for a standard representation of two-dimensional gel electrophoresis data. Comp. Funct. Genom., 4:492–501, 2003.
[177] K. R. Jonscher and J. R. Yates 3rd. The quadrupole ion trap mass spectrometer–asmall solution to a big challenge. Anal Biochem., 244:1–15, 1997.
[178] K. Kadota, D. Tominaga, R. Asai, and K. Takahashi. Correlation Analysis of mRNAand Protein Abundances in Human Tissues. Genome Lett., 2:139–148, 2003.
[179] D. E. Kalume, H. Molina, and A. Pandey. Tackling the phosphoproteome: tools andstrategies. Curr Opin Chem Biol., 7:64–9, 2003.
[180] M. Karas and F. Hillenkamp. Laser desorption ionization of proteins with molecularmasses exceeding 10,000 daltons. Anal Chem., 60:2299–2301, 1988.
[181] N. A. Karp, D. P. Kreil, and K. S. Lilley. Determining a significant change in pro-tein expression with DeCyderTMduring a pair-wise comparison using two-dimensionaldifference gel electrophoresis. Proteomics, 4:1421–1432, 2004.
[182] P. Karp, M. Riley, S. Paley, A. Pellegrini-Toole, and M. Krummenacker. EcoCyc:Electronic Encyclopedia of E. coli Genes and Metabolism. Nucleic Acids Res., 27:55–58, 1999.
[183] P. D. Karp. A strategy for database interoperation. J. Comput. Biol., 2:573–586, 1995.
[184] KEGG: Kyoto Encyclopedia of Genes and Genomes.http://www.genome.ad.jp/kegg/.
[185] K. Kim, D. Soldati, and J. C. Boothroyd. Gene replacement in Toxoplasma gondii withchloramphenicol acetyltransferase as selectable marker. Science, 262:911–914, 1993.
[186] K. Kim and L. M. Weiss. Toxoplasma gondii : the model apicomplexan. Int J Parasitol.,34:423–432, 2004.
[187] J. C. Kissinger, B. Gajria, L. Li, I. T. Paulsen, and D. S. Roos. ToxoDB: accessingthe Toxoplasma gondii genome. Nucleic Acids Res., 31:234–236, 2003.
[188] H. Kitano. Systems Biology: A Brief Overview. Science, 295:1662–1664, 2002.
[189] T. G. Kleno, C. M. Andreasen, H. O. Kjeldal, L. R. Leonardsen, T. N. Krogh, P. F.Nielsen, M. V. Sorensen, and O. N. Jensen. MALDI MS peptide mapping perfor-mance by in-gel digestion on a probe with prestructured sample supports. Anal Chem.,76:3576–3583, 2004.
[190] A. Kumar, P. M. Harrison, K. H. Cheung, N. Lan, N. Echols, P. Bertone, P. Miller,M. B. Gerstein, and M. Snyder. An integrated approach for finding overlooked genesin yeast. Nat Biotechnol., 20:58–63, 2002.
[191] J. Lee, S. Nam, S. B. Hwang, M. Hong, J. Y. Kwon, K. S. Joeng, S. H. Im, J. Shim,and M. C. Park. Functional genomic approaches using the nematode Caenorhabditiselegans as a model system. J Biochem Mol Biol., 37:107–113, 2004.
361
[192] J-H. Lee, D-E. Lee, B-U. Lee, and H-S. Kim. Global Analyses of Transcriptomes andProteomes of a Parent Strain and an L-Threonine-Overproducing Mutant Strain. JBacteriol., 185:5442–5451, 2003.
[193] M. G. Lee. The 3’ untranslated region of the hsp 70 genes maintains the level ofsteady state mRNA in Trypanosoma brucei upon heat shock. Nucleic Acids Res.,26:4025–4033, 1998.
[194] M. L. Lee, L. H. Yang, W. Hsu, and X. Yang. XClust: clustering XML schemas foreffective integration. In Proceedings of the 2002 ACM CIKM International Conferenceon Information and Knowledge Management, McLean, VA, USA, pages 292–299, 2002.
[195] A. J. Link, J. Eng, D. M. Schieltz, E. Carmack, G. J. Mize, D. R. Morris, B. M. Garvik,and J. R. Yates 3rd. Direct analysis of protein complexes using mass spectrometry.Nat Biotechnol., 17:676–682, 1999.
[196] C. M. Lloyd, M. D. B. Halstead, and P. F. Nielsen. CellML: its future, present andpast. Prog. Biophys. Mol. Biol., 85:433–450, 2004.
[197] G. W. Lubega, D. K. Byarugaba DK, and R. K. Prichard. Immunization with atubulin-rich preparation from Trypanosoma brucei confers broad protection againstAfrican trypanosomosis. Exp Parasitol., 102:9–22, 2002.
[198] R. E. Lyons, R. McLeod, and C. W. Roberts. Toxoplasma gondii : tachyzoite tobradyzoite interconversion. Trends Parasitol., 18:198–201, 2002.
[199] Macromedia. http://www.macromedia.com/.
[200] P. Mahon and P. Dupree. Quantitative and reproducible two-dimensional gel analysisusing Phoretix 2D Full. Electrophoresis, 22:2075–2085, 2001.
[201] H. Mamitsuka, Y. Okuno, and A. Yamaguchi. Mining biologically active patterns inmetabolic pathways using microarray expression profiles. ACM SIGKDD ExplorationsNewsletter, 5:113–121, 2003.
[202] E. Manduchi, G. R. Grant, H. He, J. Liu, M. D. Mailman, A. D. Pizarro, P. L.Whetzel, and C. J. Stoeckert Jr. RAD and the RAD Study-Annotator: an approachto collection, organization and exchange of all relevant information for high-throughputgene expression studies. Bioinformatics, 20:452–459, 2004.
[203] M. Mann, R. C. Hendrickson, and A. Pandey. Analysis of proteins and proteomes bymass spectrometry. Annu. Rev. Biochem., 70:437–473, 2001.
[204] M. Mann and O. N. Jensen. Proteomic analysis of post-translational modifications.Nat Biotechnol., 21:255–261, 2003.
[205] A. G. Marshall, C. L. Hendrickson, and G. S. Jackson. Fourier transform ion cyclotronresonance mass spectrometry: a primer. Mass Spectrom Rev., 17:1–35, 1998.
[206] C. J. Marshall. Specificity of receptor tyrosine kinase signaling: transient versus sus-tained extracellular signal-regulated kinase activation. Cell, 80:179–185, 1995.
[207] MASCOT, published by Matrix Science. http://www.matrixscience.com.
[208] M. H. Maurer, C. Berger, M. Wolf, C. D. Futterer, R. E. Feldmann Jr., S. Schwab, andW. Kuschinsky. The proteome of human brain microdialysate. Proteome Sci., 1(7),2003.
362
[209] S. M. Maurer, R. B. Firestone, and C. R. Scriver. Science’s neglected legacy. Nature,405:117–120, 2000.
[210] Melanie3 published by GeneBio. http://www.GeneBio.com/Melanie.html.
[211] The MGED Ontology. http://mged.sourceforge.net/ontologies/MGEDontology.php.
[212] Microarray Gene Expression Data Society (MGED). http://www.mged.org/.
[213] Microsoft .NET Information. http://www.microsoft.com/net/.
[214] O. A. Mirgorodskaya, Y. P. Kozmin, M. I. Titov, R. Korner, C. P. Sonksen, andP. Roepstorff. Quantitation of peptides and proteins by matrix-assisted laser des-orption/ionization mass spectrometry using (18)O-labeled internal standards. RapidCommun Mass Spectrom., 14:1226–1232, 2000.
[215] B. Modrek, A. Resch, C. Grasso, and C. Lee. Genome-wide detection of alternativesplicing in expressed sequences of human genes. Nucleic Acids Res., 29:2850–2859,2001.
[216] Molecular Visualization Resources: CHIME.http://www.umass.edu/microbio/chime/.
[217] M. P. Molloy. Two-Dimensional Electrophoresis of Membrane Proteins Using Immo-bilized pH Gradients. Anal Biochem., 280:1–10, 2000.
[218] The Mouse Anatomical Dictionary Browser.http://www.informatics.jax.org/searches/anatdict form.shtml.
[219] N. J. Mulder, R. Apweiler, T. K. Attwood, A. Bairoch, D. Barrell, A. Bateman, et al.The InterPro Database, 2003 brings increased coverage and new features. NucleicAcids Res., 31:315–318, 2003.
[220] P. Murray-Rust, H. S. Rzepa, M. J. Williamson, and E. L. Willighagen. Chemicalmarkup, XML, and the World Wide Web. 5. Applications of chemical metadata inRSS aggregators. J Chem Inf Comput Sci., 44:462–469, 2004.
[221] MySQL. http://www.mysql.com/.
[222] National Institute for Standards and Technology. http://www.nist.gov.
[223] C. Navarre, H. Degand, K. L. Bennett, J. S. Crawford, E. Mortz, and M. Boutry. Sub-proteomics: Identification of plasma membrane proteins from the yeast Saccharomycescerevisiae. Proteomics, 12:1706–1714, 2002.
[224] The NCBI Taxonomy Homepage.http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/.
[225] NCBI Trace Archive. http://www.ncbi.nlm.nih.gov/Traces/.
[226] W. Ni and T. W. Ling. GLASS: A Graphical Query Language for Semi-StructuredData. In Eighth International Conference on Database Systems for Advanced Applica-tions (DASFAA), pages 363–370, 2003.
[227] J. K. Nicholson, J. Connelly, J. C. Lindon, and E. Holmes. Metabonomics: a platformfor studying drug toxicity and gene function. Nat Rev Drug Discov., 1:153–161, 2002.
363
[228] M. Nilsson. The semantic web: How RDF will change learning technology standards,2001. http://www.cetis.ac.uk/content/20010927172953.
[229] N. Nirmalan, P. F. G. Sims, and J. E. Hyde. Quantitative proteomics of the humanmalaria parasite Plasmodium falciparum and its application to studies of developmentand inhibition. Mol Microbiol., 52:1187–1199, 2004.
[230] N. F. Noy, R. W. Fergerson, and M. A. Musen. The knowledge model of Protege-2000: Combining interoperability and flexibility. In 2th International Conference onKnowledge Engineering and Knowledge Management, pages 17–32, 2001.
[231] The Object Management Group. http://www.omg.org/.
[232] OPD: Open Proteomics Database. http://bioinformatics.icmb.utexas.edu/OPD/.
[233] Open Biological Ontologies (OBO). http://obo.sourceforge.net/.
[234] Open Grid Services Architecture Data Access and Integration (OGSA-DAI).http://www.ogsadai.org.uk/.
[235] Oracle 9i. http://www.oracle.com/.
[236] S. Orchard, P. Kersey, H. Hermjakob, and R. Apweiler. The HUPO Proteomics Stan-dards Initiative meeting: towards common standards for exchanging proteomics data.Comp Funct Genom, 4:16–19, 2003.
[237] S. Orchard, P. Kersey, W. Zhu, L. Montecchi-Palazzi, H. Hermjakob, and R. Apweiler.Progress in establishing common standards for exchanging proteomics data: The sec-ond meeting of the HUPO Proteomics Standards Initiative. Comp Funct Genom,4:203–206, 2003.
[238] OWL Web Ontology Language. http://www.w3.org/TR/owl-features/.
[239] H. Papageorgiou, F. Pentaris, E. Theodoruou, M. Vardaki, and M. Petrakos. Modelingstatistical metadata. In Proceedings of the 13th International Conference on Scientificand Statistical Database Management, pages 25–35, 2001.
[240] G. M. Pasinetti and L. Ho. From cDNA microarrays to high-throughput proteomics.Implications in the search for preventive initiatives to slow the clinical progression ofAlzheimer’s disease dementia. Restor Neurol Neurosci., 18:137–142, 2001.
[241] N. W. Paton, R. Stevens, P. G. Baker, C. A. Goble, S. Bechhofer, and A. Brass. QueryProcessing in the TAMBIS Bioinformatics Source Integration System. In Proceedings11th Int. Conf. on Scientific and Statistical Databases (SSDBM), pages 138–147, 1999.
[242] PEDRo (Proteomics Experiment Data Repository). http://pedro.man.ac.uk/.
[243] J. Peng, J. E. Elias, C. C. Thoreen, L. J. Licklider, and S. P. Gygi. Evaluation ofmultidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J Proteome Res., 2:43–050, 2003.
[244] C. A. Pereira, G. D. Alonso, H. N. Torres, and M. M. Flawia. Arginine kinase: acommon feature for management of energy reserves in African and American flagellatedtrypanosomatids. J Eukaryot Microbiol., 49:82–85, 2002.
364
[245] M. Perrot, F. Sagliocco, T. Mini, C. Monribot, U. Schneider, A. Shevchenko, M. Mann,P. Jeno, and H. Boucherie. Two-dimensional gel protein database of Saccharomycescerevisiae (update 1999). Electrophoresis, 20:2280–2298, 1999.
[246] PHP: Hypertext Processing. http://www.php.net.
[247] The Plant Ontology Consortium. http://www.plantontology.org/.
[248] PlasmoDB: The Plasmodium Genome Resource. http://www.plasmodb.org.
[249] Poseidon for UMLTM, available from Gentleware. http://www.gentleware.com.
[250] Powerdesigner 9TM, available from Sybase Inc. http://www.sybase.com.
[251] P. F. Predki. Functional protein microarrays: ripe for discovery. Curr Opin ChemBiol., 8:8–13, 2004.
[252] J. T. Prince, M. W. Carlson, R. Wang, P. Lu, and E. M. Marcotte. The need for apublic proteomics repository. Nat Biotechnol., 22:471–472, 2004.
[253] The Protein Data Bank. http://www.rcsb.org/pdb/.
[254] Protein Information Resource. http://pir.georgetown.edu/.
[255] Proteome 2D-PAGE database at Max-Planck.http://www.mpiib-berlin.mpg.de/2D-PAGE/.
[256] ProteomeGRID. http://vip.doc.ic.ac.uk/proteomegrid/.
[257] The Proteomics Standards Initiative. http://psidev.sourceforge.net/.
[258] PSI-MS XML Data Format. http://psidev.sourceforge.net/ms/.
[259] S. Purvine, A. F. Picone, and E. Kolker. Standard mixtures for proteome studies.OMICS, 8:79–92, 2004.
[260] X. Que, H. Ngo, J. Lawton, M. Gray, Q. Liu, J. Engel, et al. The cathepsin B ofToxoplasma gondii, toxopain-1, is critical for parasite invasion and rhoptry proteinprocessing. J Biol Chem., 277:25791–25797, 2002.
[261] The R Project for Statistical Computing. http://www.r-project.org/.
[262] RAD (RNA Abundance Database). http://www.cbil.upenn.edu/RAD/.
[263] J. C. Rain, L. Selig, H. De Reuse, V. Battaglia, C. Reverdy, S. Simon, et al. Theprotein-protein interaction map of Helicobacter pylori. Nature, 409:211–215, 2001.
[264] B. Raman, A. Cheung, and M. R. Marten. Quantitative comparison and evaluationof two commercially available, two-dimensional electrophoresis image analysis softwarepackages, Z3 and Melanie. Electrophoresis, 23:2194–2202, 2002.
[265] W. D. Ransom, P-C. Lao, D. A. Gage, and W. F. Boss. PhosphoglycerylethanolaminePosttranslational Modification of Plant Eukaryotic Elongation Factor 1 α. Plant Phys-iol., 117:949–960, 1998.
[266] Rational Rose 2000e, published by Rational Software.http://www.rational.com/.
365
[267] S. Raychaudhuri, J. Stuart, and R. Altman. Principal components analysis to sum-marize microarray experiments: application to sporulation time series. Pac SympBiocomput., 5:455–66, 2000.
[268] M. Rebhan, V Chalifa-Caspi, J. Prilusky, and D. Lancet. GeneCards: encyclopedia forgenes, proteins and diseases. Weizmann Institute of Science, Bioinformatics Unit andGenome Center (Rehovot, Israel).http://bioinformatics.weizmann.ac.il/cards.
[269] Resource Description Framework (RDF). http://www.w3.org/RDF.
[270] G. Rigaut, A. Shevchenko, B. Rutz, M. Wilm, M. Mann, and B. Seraphin. A genericprotein purification method for protein complex characterization and proteome explo-ration. Nat Biotechnol., 17:1030–1032, 1999.
[271] U. Roessner, C. Wagner, J. Kopka, R. N. Trethewey, and L. Willmitzer. Technicaladvance: simultaneous analysis of metabolites in potato tuber by gas chromatography-mass spectrometry. Plant J., 23:131–142, 2000.
[272] M. Rogers, J. Graham, and R. P. Tonge. Using statistical image models for objectiveevaluation of spot detection in two-dimensional gels. Proteomics, 3:879–86, 2003.
[273] D. S. Roos. Bioinformatics–trying to swim in a sea of data. Science, 291:1260–1261,2001.
[274] J. Rumbaugh, I. Jacobson, and G. Booch. The Unified Modeling Language ReferenceManual. Addison Wesley, 1999.
[275] L. H. Saal, C. Troein, J. Vallon-Christersson, S. Gruvberger, A. Borg, and C. Peterson.BioArray Software Environment: A Platform for Comprehensive Management andAnalysis of Microarray Data. Genome Biol., 3:software0003.1–0003.6, 2002.
[276] F. Sanger, G. M. Air, B. G. Barrell, N. L. Brown, A. R. Coulson, C. A. Fiddes, C. A.Hutchison, P. M. Slocombe, and M. Smith. Nucliotide sequence of bacteriophage phiX174 DNA. Nature, 265:687–695, 1977.
[277] V. Santoni, S. Kieffer, D. Desclaux, F. Masson, and T. Rabilloud. Membrane pro-teomics: use of additive main effects with multiplicative interaction model to classifyplasma membrane proteins according to their solubility and electrophoretic properties.Electrophoresis, 21:3329–3344, 2000.
[278] SASHIMI. http://sashimi.sourceforge.net/.
[279] SAX (Simple API for XML). http://sax.sourceforge.net/.
[280] R. A. Sayle and E. J. Milner-White. RasMol: Biomolecular graphics for all. TrendsBiochem Sci., 20:374–376, 1995.
[281] Scalable Vector Graphics (SVG). http://www.w3.org/Graphics/SVG/.
[282] D. G. Schmid, F. D. von der Mulbe, B. Fleckenstein, T. Weinschenk, and G. Jung.Broadband detection electrospray ionization Fourier transform ion cyclotron resonancemass spectrometry to reveal enzymatically and chemically induced deamidation reac-tions within peptides. Anal Chem., 73:6008–6013, 2001.
366
[283] A. Schneider, U. Plessmann, and K. Weber. Subpellicular and flagellar microtubulesof Trypanosoma brucei are extensively glutamylated. J Cell Sci., 110:431–437, 1997.
[284] L. V. Schneider and M. P. Hall. Stable Isotope Methods for High-Precision Proteomics.Drug Discov Today., in press, 2005.
[285] J. Seo and K-J. Lee. Post-translational modifications and their biological functions:Proteomic analysis and systematic approaches. J Biochem Mol Biol., 37:35–44, 2004.
[286] The Sequence Ontology Project. http://song.sourceforge.net/.
[287] D. Shalon, S. J. Smith, and P. O. Brown. A DNA microarray system for analyzingcomplex DNA samples using two-color fluorescent probe hybridization. Genome Res.,6:639–645, 1996.
[288] J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton.Relational Databases for Querying XML Documents: Limitations and Opportunities.In Proceedings of 25th International Conference on Very Large Data Bases, pages 302–314, 1999.
[289] T. Sherwin, A. Schneider, R. Sasse, T. Seebeck, and K. Gull. Distinct localization andcell cycle dependence of COOH terminally tyrosinolated alpha-tubulin in the micro-tubules of Trypanosoma brucei brucei . J Cell Biol., 104:439–446, 1987.
[290] Y. Shi, R. Xiang, C. Horvath, and J. A. Wilkins. The role of liquid chromatographyin proteomics. J Chromatogr A., 1053:27–36, 2004.
[291] L. D. Sibley. Intracellular Parasite Invasion Strategies. Science, 304:248–253, 2004.
[292] A. P. Sinai, T. M. Payne, J. C. Carmen, L. Hardi, S. J. Watson, and R. E. Molestina.Mechanisms underlying the manipulation of host apoptotic pathways by Toxoplasmagondii . Int J Parasitol., 34:381–391, 2004.
[293] Sir Henry Wellcome Functional Genomics Facility (SHWFGF), based in the Universityof Glasgow. http://www.gla.ac.uk/functionalgenomics/.
[294] D. H. Smith, J. Pepin, and A. H. Stich. Human African trypanosomiasis: an emergingpublic health crisis. Br Med Bull., 54:341–355, 1998.
[295] W. Smyth. Computing Patterns in Strings. Addison-Wesley, 2003.
[296] SourceForge.net: Project Info - Life Science Identifier (LSID).http://sourceforge.net/projects/lsid/.
[297] P. T. Spellman, M. Miller, J. Stewart, C. Troup, U. Sarkans, S. Chervitz, et al. Designand implementation of microarray gene expression markup language (MAGE-ML).Genome Biol., 23, 2002. RESEARCH0046.
[298] Standards and Ontologies for Functional Genomics. http://www.sofg.org/.
[299] L. D. Stein. Integrating biological databases. Nat Rev Genet., 4:337–345, 2003.
[300] R. D. Stevens, A. J. Robinson, and C. A. Goble. myGrid: personalised bioinformaticson the information grid. Bioinformatics, 19:I302–I304, 2003.
[301] A. Stich, P. M. Abel, and S. Krishna. Human African trypanosomiasis. BMJ, 325:203–206, 2002.
367
[302] C. Stoeckert, A. Pizarro, E. Manduchi, M. Gibson, B. Brunk, J. Crabtree, J. Schug,S. Shen-Orr, and G. C. Overton. A relational schema for both array-based and SAGEgene expression experiments. Bioinformatics, 417:300–308, 2001.
[303] C. J. Stoeckert, H. C. Causton, and C. A. Ball. Microarray databases: standards andontologies. Nat Genet., 32:469–473, 2002.
[304] C. J. Stoeckert and H. Parkinson. The MGED ontology: a framework for describingfunctional genomics experiments. Comp. Funct. Genom., 4:127–132, 2003.
[305] E. C. Strauss, J. A. Kobori, G. Siu, and L. E. Hood. Specific-primer-directed DNAsequencing. Anal Biochem., 154:353–360, 1986.
[306] L. W. Sumner, P. Mendes, and R. A. Dixon. Plant Metabolomics: Large-scale Phyto-chemistry in the Functional Genomics Era. Phytochemistry, 62:817–836, 2003.
[307] Sun Microsystems, Inc. http://www.sun.com/.
[308] Y. H. Sung, J. Song, and H-W. Lee. Functional Genomics Approach Using Mice. JBiochem Mol Biol., 37:122–132, 2004.
[309] SWISS-2DPAGE: Two-dimensional polyacrylamide gel electrophoresis database.http://ca.expasy.org/ch2d/.
[310] Swiss-Prot. http://www.expasy.ch/sprot/.
[311] The Systems Biology Markup Language. http://sbml.org/.
[312] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. Lander,and T. Golub. Interpreting gene expression with self-organizing maps: Methods andapplication to hematopoietic differentiation. Proc Natl Acad Sci U S A., 96:2907–2912,1999.
[313] Tamino XML server. http://www.softwareag.com/tamino/.
[314] T. A. Tatusova, L. Karsch-Mizrachi, and J. A. Ostell. Complete genomes in WWWEntrez: data representation and analysis. Bioinformatics, 15:536–543, 1999.
[315] C. F. Taylor, N. W. Paton, K. L. Garwood, P. D. Kirby, D. A. Stead, Z. Yin, et al.A systematic approach to modeling, capturing, and disseminating proteomics experi-mental data. Nat. Biotechnol., 21:247–254, 2003.
[316] S. W. Taylor, E. Fahy, B. Zhang, G. M. Glenn, D. E. Warnock, S. Wiley, et al.Characterization of the human heart mitochondrial proteome. Nat Biotechnol., 21:281–286, 2003.
[317] D. E. Terry and D. M. Desiderio. Between-gel reproducibility of the human cere-brospinal fluid proteome. Proteomics, 3:3, 2003.
[318] J. D. Thompson, D. G. Higgins, and T. J. Gibson. CLUSTAL W: improving the sensi-tivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22:4673–4680,1994.
[319] P. Toronen, M. Kolehmainen, G. Wong, and E. Castren. Analysis of gene expressiondata using self-organizing maps. FEBS, 451:142–146, 1999.
368
[320] ToxoDB : The Toxoplasma Genome Resource. http://www.toxodb.org/.
[321] Toxoplasma Genome Page. www.ebi.ac.uk/parasites/toxo/toxpage.html.
[322] M. Tyers and M. Mann. From genomics to proteomics. Nature, 422:193–197, 2003.
[323] P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight, et al.A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae.Nature, 403:623–627, 2000.
[324] Unified Modeling Language. http://www.uml.org/.
[325] UniParc, The UniProt Archive. http://www.ebi.ac.uk/uniparc/.
[326] UniProt (Universal Protein Resource). http://www.uniprot.org.
[327] M. Unlu, M. E. Morgan, and J. S. Minden. Difference gel electrophoresis: a single gelmethod for detecting changes in cell extracts. Electrophoresis, 18:2071–2077, 1997.
[328] G. Van den Bergh, S. Clerens, F. Vandesande, and L. Arckens. Reversed-phase high-performance liquid chromatography prefractionation prior to two-dimensional differ-ence gel electrophoresis and mass spectrometry identifies new differentially expressedproteins between striate cortex of kitten and adult cat. Electrophoresis, 24:1471–1481,2003.
[329] F. J. van Deursen, S. K. Shahi, C. M. Turner, C. Hartmann, C. Guerra-Giraldez, K. R.Matthews, and C. E. Clayton. Characterisation of the growth and differentiation invivo and in vitro-of bloodstream-form Trypanosoma brucei strain TREU 927. MolBiochem Parasitol., 112:163–171, 2001.
[330] F. J. van Deursen, D. J. Thornton, and K. R. Matthews. A reproducible protocol foranalysis of the proteome of Trypanosoma brucei by 2-dimensional gel electrophoresis.Mol Biochem Parasitol., 128:107–110, 2003.
[331] S. Veeser, M. J. Dunn, and G. Z. Yang. Multiresolution image registration for two-dimensional gel electrophoresis. Proteomics, 1:856–870, 2001.
[332] V. E. Velculescu, L. Zhang, B. Vogelstein, and K. W. Kinzler. Serial analysis of geneexpression. Science, 270:484–487, 1995.
[333] V. E. Velculescu, L. Zhang, W. Zhou, J. Vogelstein, M. A. Basrai, D. E. Bassett Jr,P. Hieter, B. Vogelstein, and K. W. Kinzler. Characterization of the yeast transcrip-tome. Cell, 88:243–251, 1997.
[334] J. C. Venter, M. D. Adams, and E. W. Myers. The Sequence of the Human Genome.Science, 291:1304–1351, 2001.
[335] K. Vickerman. On the surface coat and flagellar adhesion in trypanosomes. Cell Sci.,5:163–194, 1969.
[336] E. O. Voit. Metabolic modeling: a tool of drug discovery in the post-genomic era.Drug Discov. Today, 7:621–628, 2002.
[337] C-W. von der Lieth, A. Bohne-Lang, K. K. Lohmann, and M. Frank. Bioinformaticsfor glycomics: Status, methods, requirements and perspectives. Brief. Bioinformatics,5:164–178, 2004.
369
[338] T. Voss and P. Haberl. Observations on the reproducibility and matching efficiencyof two-dimensional electrophoresis gels: consequences for comprehensive data analysis.Electrophoresis, 21:3345–3350, 2000.
[339] Voyager Version 5 with Data Explorer Software, published by Applied Biosystems.http://www.appliedbiosystems.com/.
[340] W3C Math home page. http://www.w3.org/Math/.
[341] W3C Recommendation for XML Schema. http://www.w3.org/XML/Schema.
[342] W3C Semantic Web. http://www.w3.org/2001/sw/.
[343] A. J. Walhout, R. Sordella, X. Lu, J. L. Hartley, G. F. Temple, M. A. Brasch,N. Thierry-Mieg, and M. Vidal. Protein interaction mapping in C. elegans usingproteins involved in vulval development. Science, 287:116–122, 2000.
[344] M. P. Washburn, D. Wolters, and J. R. Yates III. Large-scale analysis of the yeast pro-teome by multidimensional protein identification technology. Nat Biotechnol., 19:242–247, 2001.
[345] V. C. Wasinger, S. J. Cordwell, A. Cerpa-Poljak, J. X. Yan, A. A. Gooley, M. R.Wilkins, M. W. Duncan, R. Harris, K. L. Williams, and I. Humphery-Smith. Progresswith gene-product mapping of the Mollicutes: Mycoplasma genitalium. Electrophoresis,16:1090–1094, 1995.
[346] W. Weckwerth. Metabolomics in systems biology. Annu Rev Plant Biol., 54:669–689,2003.
[347] W. Weckwerth, V. Tolstikov, and O. Fiehn. Metabolomic characterization of transgenicpotato plants using GC/TOF and LC/MS analysis reveals silent metabolic phenotypes.In Proceedings of the 49th ASMS Conference on Mass spectrometry and Allied Topics,pages 1–2. Chicago: Am. Soc. Mass Spectrom., 2001.
[348] G. Wiederhold. Intelligent integration of diverse information (invited talk). In Int.Conf. on Information and Knowledge Management, Baltimore, 1992.
[349] M. R. Wilkins, J. C. Sanchez, A. A. Gooley, R. D. Appel, I. Humphery-Smith, D. F.Hochstrasser, and K. L. Williams. Progress with proteome projects: why all proteinsexpressed by a genome should be identified and how to do it. Biotechnol Genet EngRev., 13:19–50, 1996.
[350] WordNet - a lexical database for the English language.http://www.cogsci.princeton.edu/˜wn/.
[351] WORLD-2DPAGE: Index to 2-D PAGE databases and services.http://us.expasy.org/ch2d/2d-index.html.
[352] The World Wide Web Consortium. http://www.w3c.org/.
[353] WormBase. http://www.wormbase.org/.
[354] W. Xhou, B. A. Merrick, M. G. Khaledi, and K. B. Tomer. Detection and sequencingof phosphopeptides affinity bound to immobilized metal ion beads by matrix-assistedlaser desorption/ionization mass spectrometry. J Am Soc Mass Spectrom., 11:273–282,2000.
370
[355] S. Xirasagar, S. Gustafson, A. Merrick, K. B. Tomer, S. Stasiewicz, D. D. Chan,et al. CEBS Object Model for Systems Biology Data, CEBS MAGE SysBio-OM.Bioinformatics, 20:2004–2015, 2004.
[356] XML Metadata Interchange (XMI).http://www.omg.org/technology/documents/formal/xmi.htm.
[357] XQuery 1.0: An XML Query Language. http://www.w3.org/TR/xquery/.
[358] XSPAN - A Cross-Species Anatomy Project. http://www.xspan.org/.
[359] Xtect. http://xtect.cis.strath.ac.uk/.
[360] A. F. Yakunin, A. A. Yee, A. Savchenko, A. M. Edwards, and C. H. Arrowsmith.Structural proteomics: a tool for genome annotation. Curr Opin Chem Biol., 8:42–48,2004.
[361] W. Yan, H. Lee, E. C. Yi, D. Reiss, P. Shannon, B. K. Kwieciszewski, et al. System-based proteomic analysis of the interferon response in human liver cells. Genome Biol.,5:R54, 2004.
[362] M. Yanagida. Functional proteomics; current achievements. J Chromatogr B AnalytTechnol Biomed Life Sci., 771:89–106, 2002.
[363] X. Yang, M. L. Lee, and T. W. Ling. Resolving Structural Conflicts in the Integra-tion of XML Schemas: A Semantic Approach. In 22nd International Conference onConceptual Modeling (ER), pages 520–533, 2003.
[364] M. Yoshikawa, T. Amagasa, T. Shimura, and S. Uemura. XRel: a path-based ap-proach to storage and retrieval of XML documents using relational databases. ACMTransactions on Internet Technology, 1:110–141, 2001.
[365] N. Young, Z. Chang, and D. S. Wishart. GelScape: a web-based server for inter-actively annotating, manipulating, comparing and archiving 1D and 2D gel images.Bioinformatics, 20:976–978, 2004.
[366] Z3 published by Compugen. http://www.2dgels.com/.
[367] A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, G. Ausiello, M. Helmer-Citterich,and G. Cesareni. MINT: a Molecular INTeraction database. FEBS Lett., 513:135–140,2002.
[368] B. R. Zeeberg, W. Feng, G. Wang, M. D. Wang, A. T. Fojo, M. Sunshine, et al.GoMiner: A Resource for Biological Interpretation of Genomic and Proteomic Data.Genome Biol., 4:R28, 2003.
[369] R. Zeng, H. Q. Ruan, X. S. Jiang, H. Zhou, L. Shi, L. Zhang, Q. H. Sheng, Q. Tu, Q. C.Xia, and J. R. Wu. Proteomic analysis of SARS associated coronavirus using two-dimensional liquid chromatography mass spectrometry and one-dimensional sodiumdodecyl sulfate-polyacrylamide gel electrophoresis followed by mass spectroemtric anal-ysis. J Proteome Res., 3:549–555, 2004.
[370] X. Zuo and D. W. Speicher. Comprehensive analysis of complex proteomes usingmicroscale solution isoelectrofocusing prior to narrow pH range two-dimensional elec-trophoresis. Proteomics, 2:58–68, 2002.
371