bioinformatics: advancing biotechnology through...

16
Indian Journal of Biotechnology Vol I, January 2002, pp 101-116 Bioinformatics: Advancing Biotechnology through Information Technology Part I: Molecular Biology Databases Sudeshna Adak* and Biplav Srivastava IBM India Research Lab, Block I, lIT Campus, Hauz Khas, New Delhi 110016, India This paper is intended as a review of molecular biology databases and other Bioinformatics resources available for biotechnologists aiming to use the wealth of genomic data available today. The genomic data along with associated proteomic and functional data are often distributed across multiple databases, requiring a time- consuming search by the user. The explosion of information seen in molecular biology has created a veritable maze, through which careful navigation is required for research and innovation in biotechnology. The paper, one of the series, introduces the readers to the major molecular biology databases and bioinformatics tools such as BLAST for similarity searching and RasMol for protein structure visualization. Subsequent papers will take the readers into a journey across bioinformatics and the biotechnological discoveries that are happening with bioinformatics. Advances in computer technologies and the birth of the internet are also part of this revolution in biology. Online databases have given scientists and researchers across the world access to unimaginable volumes of biologically relevant data. Bioinformatics, a truly multidisciplinary science, aims to use the benefits of computer technologies in understanding the biology of life itself. Keywords: bioinformatics, biological databases, alignment, Entrez, SRS, BLAST 1. Introduction The announcement of the completion of a 'working draft' of the human genome on June 26, 2000, captured the imagination of people across the world in a way that science and technology had not done since man walked on the moon. Translating the 3 billion characters in the DNA sequences that make up the human genome into biologically meaningful information has given rise to a new field - Bioinformatics. When the Human Genome Project was conceived of in 1987, the field of bioinformatics was barely in its infancy. Today, the science of bioinformatics has become a recognized discipline on its own - born out of the necessity to bring together information sciences and the biological sciences in understanding the wealth of data that has been created through the various genomics, proteomics and functional genomics projects around the world. This paper is intended to introduce bioinformatics to scientists and biotechnologists who are beginning to explore and use the tools of bioinformatics in making new advances and discoveries in the field of biology. The first question today in the mind of many scientists is "What is bioinformatics?" Bioinformatics * Author for correspondence: Tel: 91-11-6861100 E-mail: [email protected] has been touted as in-silico biology, where wet lab experimental biology can (perhaps) be replaced with computers. A more precise definition of bioinformatics is the application of information sciences (mathematics, statistIcs and computer science) to increase our understanding of biology. Probably, the most remarkable success of bioinformatics to date has been its use in the 's hotgun sequencing' of the human genome. In shotgun sequencing (Bankier et ai, 1987), a large piece of DNA is broken up randomly into smaller fragments. The smaller fragments are subcloned, their ends sequenced, and the fragments are reassembled based on overlaps (Fig. 1). This approach rapidly rev ea ls 90% of the desired sequence information, and the remaining few gaps are filled by custom oligonucleotide primers (Waterston & Sulston, 1995 ). Sequence assembly is still one of the primary uses of bi oinformatics in the various sequencing projects underway today. Bioinformatics, as a subject, consists of three core areas: Molecular Biology Databases Sequence Comparison and Sequence Analysis The Emerging Technology of Microarrays For detailed discussion on these topics, the readers are referred to Baxevanis & Ouellete (1998); Gibas & Jambeck (2001); Rashidi & Buehler (1999); Mount

Upload: others

Post on 29-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

Indian Journal of Biotechnology Vol I, January 2002, pp 101-116

Bioinformatics: Advancing Biotechnology through Information Technology Part I: Molecular Biology Databases

Sudeshna Adak* and Biplav Srivastava

IBM India Research Lab, Block I, lIT Campus, Hauz Khas, New Delhi 110016, India

This paper is intended as a review of molecular biology databases and other Bioinformatics resources available for biotechnologists aiming to use the wealth of genomic data available today. The genomic data along with associated proteomic and functional data are often distributed across multiple databases, requiring a time­consuming search by the user. The explosion of information seen in molecular biology has created a veritable maze, through which careful navigation is required for research and innovation in biotechnology. The paper, one of the series, introduces the readers to the major molecular biology databases and bioinformatics tools such as BLAST for similarity searching and RasMol for protein structure visualization. Subsequent papers will take the readers into a journey across bioinformatics and the biotechnological discoveries that are happening with bioinformatics. Advances in computer technologies and the birth of the internet are also part of this revolution in biology. Online databases have given scientists and researchers across the world access to unimaginable volumes of biologically relevant data. Bioinformatics, a truly multidisciplinary science, aims to use the benefits of computer technologies in understanding the biology of life itself.

Keywords: bioinformatics, biological databases, alignment, Entrez, SRS, BLAST

1. Introduction The announcement of the completion of a 'working

draft ' of the human genome on June 26, 2000, captured the imagination of people across the world in a way that science and technology had not done since man walked on the moon. Translating the 3 billion characters in the DNA sequences that make up the human genome into biologically meaningful information has given rise to a new field -Bioinformatics. When the Human Genome Project was conceived of in 1987, the field of bioinformatics was barely in its infancy. Today, the science of bioinformatics has become a recognized discipline on its own - born out of the necessity to bring together information sciences and the biological sciences in understanding the wealth of data that has been created through the various genomics, proteomics and functional genomics projects around the world. This paper is intended to introduce bioinformatics to scientists and biotechnologists who are beginning to explore and use the tools of bioinformatics in making new advances and discoveries in the field of biology.

The first question today in the mind of many scientists is "What is bioinformatics?" Bioinformatics

* Author for correspondence: Tel: 91-11-6861100 E-mail: [email protected]

has been touted as in-silico biology, where wet lab experimental biology can (perhaps) be replaced with computers. A more precise definition of bioinformatics is the application of information sciences (mathematics, statistIcs and computer science) to increase our understanding of biology. Probably, the most remarkable success of bioinformatics to date has been its use in the 'shotgun sequencing' of the human genome.

In shotgun sequencing (Bankier et ai, 1987), a large piece of DNA is broken up randomly into smaller fragments . The smaller fragments are subcloned, their ends sequenced, and the fragments are reassembled based on overlaps (Fig. 1). This approach rapidly reveals 90% of the desired sequence information, and the remaining few gaps are filled by custom oligonucleotide primers (Waterston & Sulston , 1995). Sequence assembly is still one of the primary uses of bioinformatics in the various sequencing projects underway today.

Bioinformatics, as a subject, consists of three core areas :

• Molecular Biology Databases • Sequence Comparison and Sequence Analysis • The Emerging Technology of Microarrays For detailed discussion on these topics, the readers

are referred to Baxevanis & Ouellete (1998) ; Gibas & Jambeck (2001); Rashidi & Buehler (1999); Mount

Page 2: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

102 INDIAN J BIOTECHNOL, JANUARY 2002

(2001); and Misener & Krawetz (2000) .This paper first in series of 3 provides a review of the first core area of molecular biology databases. The subsequent papers, wi II review the two other core areas of bioinformatics. The review of molecul ar biology databases is intended to provide the answers to two kinds of questions faced by the biotechnologists in using bioinformatics:

1 What are the resources currently available and what is their potenti al usefulness in biotechnology? (Sections 2 and 3 of thi s paper provide a review of these resources).

2 What are the important aspects of database technology that are particularly relevant in creating new biological databases and adding value to existing biological databases? (Section 4 of this paper discusses the main value add in molecul ar biology databases today: seamless integration of multiple, heterogeneous databases that provides the user a single point of entry to a variety of resources).

3 While sequencing of the human genome has captured the attention of the scientific community at large, similar efforts for other organisms and the creation of databases in related areas of proteomics and functional genomics has not received the same publicity. However, it is the combined use of a variety of biological databases that will have an impact

.. "

Purifv DNA of int('l'(~ land fmgrll~llt int o SUl aU pi ("3

' --

.. -. .. ... .. , .. ~ . . ~ ..• ~-'~: ... ~ ' .. ,+ ----,-

Fig. l---Shotgun sequencing

on every industry that uses biotechnology today such as: pharmaceuticals, agriculture, foren sics, bioremediation and biofuels , and other biochemical industrial processes. It is clear that an understanding of the combination of resources of molecular biology databases is necessary for the modern biotechnologist. A hypothetical scenario, illustrating a combination use of such resources leading to a new biological discovery is given below.

Use Case Scenario-A scientist at a plant biotechnology company is interested in understanding the genetic basis of fruit development; specifically , the user is interested in identifying the genes that are involved in ripening green strawben'ies into red strawberries and also to determine the biochemical pathways involved in the ripening process. Fig. 2 illustrates a comparative genomics approach that can be used in a purely in-silico effort to determine the genes involved in ripening of strawberries. In this approach, the genome of the strawberry fruit is compared to the annotated genomes of similar species, to identify the genes and their associated func tions.

2. Molecular Biology Databases Most biologists and biotechnologists are familiar

with the more well-known nucleotide sequence

G€nom~

Gene Identification BLAST searching of plant genome databases

I Translation 1

Function Identification Pr'olein BLAST searching of

Protein databases I, Literature Search

products .. --_ ... B ' I ... __ .. _~ ... .. . metabolites JOC Jem i e.;]

Pal l"J ' ... ."ays

Ph C'f10 typ c-

Path Comparison Search pathways databases

Fig. 2-From strawberry genome to phenotype (colour: red and flavour: sweet)

Page 3: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

ADAK & SRIV ASTA V A: BIOINFORMATICS-MOLECULAR BIOLOGY DATABASES 103

Database Name

Infobiogen Catalogue of Molecular Biology Databases The Molecular Biology Database Collection. The European Bioinformatics Institute Biocatalogue

NCBI's PubMed

NCBI's Genbank

EMBL Nucleotide Sequence Database DNA Data Bank of Japan (DDBJ) Major Protein Databases Protein Information Resource (PIR) Swiss-ProtffrEMBL InterPro

MetaFam Membrane Protein Database

TRANSFAC

Protein Data Bank (PDB)

NCBI'sMMDB

SCOP

NCBI's dbSNP ALFRED

HGBASE OMIM

Gene Expression Omnibus Stanford Microarray Database (SMD)

Genoplante

UK CropNet

Table I-Major Biological Databases

Link Contents

Molecular Biology Database Catalogs

ht tp://www .in fobiogc n.fr/scrv ices/dbcati

http://n ar. ou p jou rnal s.org/cgi/con ten t/fu II /29/ I II /DC I. http://www.ebi.ac.uklbiocati

Major Biomedical Literature Databases

http://www.ncbi.n lm.nih. gov/PubMcd Medline and Pre-Medline Citations

Major Nucleotide Sequence Databases

http://www.ncbi.nlm.nih.gov/Genbank

hltp://www.ebi .ac.uk/e mbl{

http://www.ddbj.nig.ac .jp

http://pir.georgetown. ed u

http://ww w.expasy.ch/sprot http://www .ebi.ac.uk/interpro

http://metaf am.ahc. umn .edu/ h ltp://bi ophys.bio.tuat.ac.jp/ohshimaJdata basc/ hltp://tran sfac. gbf.delTRA NSF AC/

All known nucleotide and protein sequences: International Nucleotide Sequence Data Collaboration All known nucleotide and protein seq uences: Internati onal Nucleotide Sequence Data Collaboration All known nucleotide and protein sequences: International Nucleotide Sequence Data Collaboration

Comprehensi ve, annotated, non-redundant protein seq uence database Curated protein seq uences Integrated resource for protein families, domains, and sites Integrated protein family informati on Membrane sequences, transmembrane regions and structures Transcription Factors and Binding Sites

Major Protein Structure Databases

http://www.rcsb.org/pdb/

http://www . ncbi. n I m.ni h. goy/Structure

http://scop.mrc-I mb.cam.ac .uklscop/

Structure data determined by X-ray crystallography and NMR All experimentally-determined 3D protein structures linked to NCBI's Entrez Familial and Structural protein relationships

Major Mutation Databases

http://www.ncbi .nlm .nih .gov/SNP/ http://alfred.med.yale.eelu/alfred/i ndex. as

R

Database of single nucleotide polymorphisms Allele frequencies and DNA polymorphisms

Intragenic sequence polymorphisms http://hgbase.cgr.ki .se/ http://www.ncbi.nlm.nih.gov/OMIM/ Catalog of human genetic and genomic disorders

Major Gene Expression Databases

http://www.ncbi.nlm.nih .gov/GEO

http://genome-www4.Stanford.eelu

NCBI's Repository for gene expression (under development) Gateway to microarray data from Stanford labs

Major Plant Genome Databases

www.gcnoplante.org Genomics for plant improvement

http://ukcrop.net/ Comprehensive gateway to crop genomes Conld.-

Page 4: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

104 INDIAN J BIOTECHNOL, JANUARY 2002

Table I-Major Biological Databases--Colltd.

Database Name Link Contents

NCBI's Microbial Genome Gateway

Major Microbial Genome Databases

http://www.ncbi.nlm.nih.gov/PMGifs/Ge nomes/micr. html

DOE's Microbial Genomics Gateway TIGR's Comprehensive Microbial Resources

http://m icrobialgenome.org/

http://www.tigr.org/tigr­scripts/CMR2/CMRHomePage.spl

Completed Microbi al Genomes

Major Organism-speci fic Genome Databases

Genomes Onlines Database (GOLD)

http://wiLintegratedgenomics.com/GOLD/ Information regarding complete and ongoing genome projects

Flybase Full-Malaria Mouse Genome Database (MGD) Arabidopsis thaliana genome database

http ://www.fruitfly.org http://I 33. 11.149.55/ http://www . i n forma ti cs. j ax. org/

http://www. tigr.org/ldb/e2k l /ath l/

Drosophilia sequences and genomic information Malaria full-Length cDNA Database Mouse genetics and genomics

Arabidopsis thaliana genome database

Saccaromyces Genome Database (SGD) Rice Genome Project ZmDB

htlp://genome­www.stanford.edu/Saccharomyces/ & hllp://rgp.dna.affrc.go. jp/ http://zmdb. iastate.edu

S. cerevisiae genome information

Reporting current data in the rice genome project Reporting current data in the maize genome project

databases such as Genbank or EMBL as well as the protein related databases of Swiss-Prot, PIR, and PDB (Table 1). However, there are numerous specialized biological databases that have been created out of a particular need, either to answer a particular biological question or to better serve a particular segment of the biological community. The objective in describing the molecular biology databases is to better serve our readers in promoting the use of these resources in the design and analysis of their experiments. Comprehensive listing of molecular biology databases are available through the following web catalogues:

• Infobiogen {Table 1) provides an online catalogue of molecular biology databases. The catalogue has a listing of 511 databases (as of October 30, 2001), further categorized based on content as - DNA related: 87, RNA related; 29, protein related; 94, genomic; 58, mapping; 29, protein structure; 18, literature; 43 and miscellaneous; 153.

• The Molecular Biology Database Collection (Table 1) is an online catalogue of key databases of value to the biological community. This collection contains today a list of 281 high-quality online databases. The databases included in this collection are considered particularly relevant for biotechnology research as they provide new

value to the underlying data by virtue of curation, new data connections or other innovative approaches.

• Since 1993, the European Bioinformatics Institute has been maintaining Biocatalogue (Table 1). The Biocatalogue is a software directory of general interest in molecular biology and genetics. It is also categorized into key areas similar to the lnfobiogen catalogue.

It is beyond the scope of this paper to describe the plethora of molecular biology databases available today. The readers are referred to the Infobiogen Catalogue and the Molecular Database Collection for a comprehensive listing. The major well-known biological databases are listed in Table 1. Some specialized databases which are perhaps of more interest to biotechnologists and molecular biologists, are listed below.

Specialized Resources a) Plant Databases. The completion of the

sequencing of the entire genome of the model plant Arabidopsis thaliana (Arabidopsis Genome Initiative, 2000) is hailed as the beginning of a new era by the plant biotechnologists. Various efforts in sequencing of the genomes of major crop plants are underway and will be completed shortly, scientists are now faced with challenge of identifying new plant genes, understanding the functions of newly discovered plant

Page 5: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

ADAK & SRIV ASTA V A: BIOINFORMATICS-MOLECULAR BIOLOGY DATABASES 105

genes and "reaping the plant gene harvest" by using this new information in improving crop yield. While plant genomics (i.e. unraveling the functions of plant genes based on whole genome sequencing) is still in its infancy, plant biotechnologists have been more active in creating proteomics and metabolic profile databases. Proteomics uses two-dimensional (2D) gel electrophoresis to separate the proteins in a cell or tissue by size and pH characteristic, followed by mass spectrometry to help identify each component of the resulting gel pattern. Using this technology, researchers are building protein expression pattern databases for Arabidopsis, rice, maize, and pine trees (see http://www.expasy.ch/ch2d/2d-index.html for links to these databases). The databases provide pictures of 2D gels with links to previous publications on proteins of interest, the plant tissue from which it was derived and sequence data for proteins. However, as an increasing number of plant genomes become available, plant genomics will play an even more significant role in plant biotechnology. Two gateways to the emerging area of plant genomics are reviewed.

UK CropNet-Cbmparative genomics (ascribing function through alignment of similar nucleotide or protein sequences) has become a key area of research in plant biotechnology. This is because genomes of closely related plant species have been found to have remarkably similar genes and gene functions. As the vast amount of plant genomic data becoming available, the use of bioinformatics to improve plant varieties is also becoming vital. To make sense of all the genomic data, UK CropNet was established in 1996 with specific aims of developing software and databases that will facilitate the querying of genomic information from different crop species. Particular emphasis has been placed on developing software tools for comparative mapping. UK CropNet has used the AceDB (Durbin & Thierry-Mieg, 1992) database system to create separate databases for each of the UK CropNet projects with individual databases for Arabidopsis, Barley, Brassica, Forage grasses, and Millet.

Genoplante-Genoplante is a remarkable instance of collaboration between publicly funded institutions and private organizations in furthering scientific research in genomics and bioinformatics, specifically for plants. Genoplante is a major partnership programme in plant genomics, which links public research in France (INRA, CIRAD, IRD, CNRS) and the main private companies involved in crop improvement and protection (Biogemma, Aventis

CropScience. Bioplante) . Genoplante is part of the fierce international competition in the science of plant genomics, as can be seen from the variety of world programmes which are being created. Genoplante is the French answer - and tomorrow the European answer - to this major scientific and economic challenge. By pooling their knowledge and their finances in a structured research network, the public and private members of the programme are playing the synergy card, whilst following their own programmes at the same time. The objective of Genoplante is to create a network of laboratories across Europe, which will pool their combined resources to discover new plant genes, study the genomics of model plant species like Arabidopsis and rice, and conduct genome-based research on major crops under cultivation in Europe.

b) Microbial Databases. Micro-organisms (viruses, bacteria, fungi, protozoa and algae) hold the key to maintaining the earth's ecological system. The unique properties of microbes represent an extremely valuable resource for biotechnology and are key elements in breakthrough in more effective and safer vaccines, identification of new drug and chemical targets in pathogens, improved industrial catalysts, bioremediation, and perhaps clues to the origin of the Earth. However, as there are thousands of microbes known to exist on earth, systematic understanding of the biology of such a large number of organisms was a daunting task until recently (less than 1 % of microbes on earth have been cultured and studied in the laboratory). Whole genome sequencing represents an important step as it can help accelerate the process of understanding of a microbe's biological capabili­ties and lead to real impact in the field of biotechnology. Till date, there are 59 complete, annotated microbial genomes with 17 more whose sequencing is complete and annotation is under development. For an additional impressive number (- 200 to 300) of microbial genomes, sequencing is currently under progress. Initial analysis of available microbial genome data has already led to some surprising results: 20-30% of genes encode unknown proteins apparently unique to the species and 40-50% of genes encode proteins of unknown function. Some of the important databases that are gateways to microbial genomes (Table 1 for links) and will prove a key resource in microbial biotechnology research are as under:

NCBI's Microbial Genomes Gateway-The gateway provides accepted to complete and unfinished

Page 6: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

106 INDIAN J BIOTECHNOL, JANUARY 2002

microbial genomes, including sequence and structural similarity searching tools. It also provides the taxonomic information of the microbial species.

Microbial Genomics Gateway-This is the portal to the US Department of Energy (DOE),s Microbial Genome Programme. It provides a comprehensive set of links to web pages for microbes under study around the world. The portal also provides links to DOE's research on genetic engineering and biotechnology using microbes. For example, DOE scientists finished sequencing the radiation-resistant bacterium (Fig. 3) Deinococcus radiodurans (White et ai , 1999) are now investigating if the genome of D. radiodurans can be altered to increase its potential usefulness in cleaning up toxic waste around the globe (Melin et ai, 2001). Through the use of biotechnological processes, scientists hope to transfer genes from other organisms that will enable the bacterium to degrade toxic chemicals such as toluene found in mixed chemical and radiation waste sites.

TIGR Comprehensive Microbial Resource--ln addition to the links to microbial genomes (complete, annotated and sequencing in progress) , TlGR provides a set of bioinformatics tools (similarity search and gene identification) that are geared specifically for microbial genomes.

c) Mutation Databases. A key aspect of research in genetic engineering is understanding how mutation (variation in the DNA sequence) affects different phenotypes (characteristics of the organisms). In humans, many mutations are harmless but some are not and most diseases are associated with mutations. Moreover, inherited mutations in humans are mostly single nucleotide polymorphisms (SNPs), which occur every 100 to 300 bases. Because of the importance of identifying such mutations in the study of the genetic basis of diseases, ten of the major pharmaceutical companies in the world have come

100

10

~ 0.1

] 0.01 \

0.001 I

- D. rodiodurons - E.coli ~(')(h

012345678 Radiation (kGy)

Fig. 3--Radiation resistance in D. radiodurans

9 10

together in an unprecedented scientific collaboration to form the SNP consortium. The objective of the SNP consortium is to create a database of all know human SNPs. Here two of the major gateways to human mutation databases are discussed. A complete catalogue of human mutation databases is available at http://www.uta.fi/lai tokset/imt/bioinfo/B TKbase/data base.html. Interestingly , the effect of mutations in crops is being actively studied by plant biotechnologists, but there is currently no public database available cataloguing such mutations. While creation of human mutation databases involves expensive and time-consuming screening, the effect of plant mutations can be studied through manipulation of plant genes. Biotechnology methods cUlTently being used to study mutations in plants include:

• Chimeraplasty: Creates SNPs in plant genes (Zuo & Chua, 2000).

• Trait utility system: Creates random mutations in a large number of genes in fertile maize plants by inserting DNA that can jump in and out of genes, and the resulting plants are screened for interesting changes such as drought resistance or sweeter kernels (Gura, 2000).

• Activation Tagging: Generates wholesale mutations in plants by inserting DNA enhancers via a plant-cell infecting bacterium (Wiegel et ai, 2000).

3. Applications for Molecular Biology Databases Biological databases often provide bioinformatics

applications as part of the user interface. There are three primary types of bioinformatics tools that are commonly coupled with the databases: text-based database searching, similarity-based database .searching, and visualization tools. The most frequently used of these bioinformatics tools are :

Text-based Database Searching Most publicly available biological databases

provide a text-based search interface that allows retrieval of entries that "match" user-spec ified word(s) or phrases . Advanced search features typically include boolean searches (combining terms with orland/not), wild cards, etc. The example given below, demonstrates the capabilities and limitation of text-based searching in NCBI ' s P ubMed database (Table 1).

Page 7: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

ADAK & SRIVASTAVA: BIOINFORMATICS-MOLECULAR BIOLOGY DATABASES 107

Use Case Scenario--The goal is to search PubMed to determine possible functions of the yeast gene MDM2. A query on PubMed for "yeast AND MDM2" retrieved 39 citations. A quick scan shows that many of the articles also refer to the p53 oncogene and a careful analysis showed that the MDM2 gene inhibits p53 and apoptosis by binding to it. But, this is primarily in humans. To investigate other functions of MDM2, specific to yeast, the refined search using "NOT p53"-resuJted in 8 articles, one of which talks about the role of MDM2 in fatty acid metabolism. Thus, careful searches and sifting through the biological literature can help in discovering all the possible functions known to be associated with genes.

The quality of results of a text-based search depends on the quality of the contents of the database. Hence, in either conducting a text-based search and even more importantly while designing biological databases and enabling text searches in newly created databases, it is important to keep the following in mind:

• If the database contains free-form text, spelling en-ors can exclude relevant entries. Incon­sistencies may also result in relevant errors being excluded (for example, IL-2 and IL2 are both used to refer to Interleukin-2).

• Problems with free-form text can be avoided by the use of keywords. However, author­supplied keywords can also be arbitrary and inconsistent.

• The best solution is the use of a controlled vocabulary. For example, PubMed uses an extensive controlled vocabulary called MeSH (medical subject headings) . However, it is important to understand the organization and hierarchy of such a controlled vocabu lary if used in searching.

Similarity-based Database Searching: BLAST and FASTA

With large-scale genome sequencing projects, the flood of DNA sequence data coming into public databases is staggering. Researchers are increasingly relying on infen-ing the function of putative genes through similarity to well-characterized proteins. It is important to realize that designing an in-silico sequence similarity search needs to be as carefully designed as a wet lab experiment in order to get biologically meaningful results. In this paper, some of the issues in using the most popular sequence

similarity search tools are reviewed (Durbin et aI , 1999; Gusfield, 1997).

Sequence similarity searches use alignments to determine a "match". Alignment of two sequences is matching of the two sequences, except that they allow the most common mutations: insertions, deletions, and single-character substi tutions. The basic considerations in using a sequence-similarity search are:

Global VS. Local Alignment. Global alignment forces complete al ignment of the input sequences, whereas local alignment will align the most similar segments. The choice of global vs. local depends on the assumptions made by the user as to whether the sequences are related over their entire length or presumed to share only isolated regions of homology. As similarities will span segments rather than entire sequences, local alignment is the most popular database si milarity search. See Fig. 4 for a sample output of global alignment (a) and local alignment (b).

Alignment Algorithms. There are a variety of relatively efficient alignment algorithms, each of which aim to determine the most optimal alignment. The first of these to be described in the biological literature was the Needleman-Wusnch algorithm (Needleman & Wunsch, 1970) for optimal global alignment, followed by a slight variant, the Smith­Waterman algorithm (Smith & Waterman, 1981) for optimal local alignment of two sequences. These two methods were developed prior to wholesale genome sequencing. Today, the special purpose parallel machines and the massive computation time required by these algorithms have rendered them almost obsolete. Most users prefer BLAST (Altschul et ai, 1990) (http: //www.ncbi.nlm.nih.gov/BLAST) or FASTA (Pearson & Lipman, 1988) (http://www.ebi.ac. uklfasta33) . which rely on heuristic strategies to speed up alignment searches. Promising regions are first determined through rapid

(a)

POOOOI

P00090

POOOOI

POOO9O

(b) P13569

P33593

PI3569

P33593

PJ3569

P33593

1 ""~~X~I""AIIZWIl<-__ '8 D ,5+ -+1' OC!r .... x+ uP 1. Q+GB:I.G:t. a++::f+ !I '. G+

1 <HI'AlGIAVI'-__ ~_ 56

59~-~_ 105-: +1f ++ ++ 1]. +1' Y+ 7IN., + ++. 1)+0 AYJ. .M' ''''

57 VIJTlITiJ[UAILP~~-~HLII 114

1221 l!iIIGDIIJIHlSF'~JI1'I!GIIIO:tIlGY5 1273

13 + ++ +s ++ G+". LotG +<:SCIQij +.A 1. -+J. 'I' CBl' DG. ~~="",",_ 70

1274 ~-~""~SJIOE%WJmD1V J3Z2

71 1. 0 llAY .. ... -f -i- + -+ .J:ASH-~~116

1323 ~ 1'V 1379

117 :L. n VL .... G OJl*A+VL++ ++nt:P+- LD· V ~VL<:ZI<1'Pr~ 174

Fig. 4--Global (a) ancl local (b) alignmenls.

Page 8: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

108 INDIAN J BIOTECHNOL, JANUARY 2002

exact match searches, and only then is Smith­Waterman invoked. This approach permits FASTA or BLAST to run 10 to 100 times faster than conventional Smith-Waterman, at the cost of missing a few alignments. Some of the adjustable parameters described below provide the user the flexibility to trade-off between speed and accuracy. BLAST, in general, tends to be faster and are more sensitive (detects more alignments), but FAST A returns fewer false hits.

Search Parameters. The effectiveness of alignment algorithms depends on its parameters: a careful choice is necessary, without which important alignments may be overlooked or too many spurious alignments may be returned. There are three sets of parameters that can be specified by the user to control the results: the alignment parameters, algorithmic parameters, and output parameters.

=> The alignment parameters include the choice of substitution matrix and the costs associated with gaps. The substitution matrix is the cost associated with substituting one residue with another in aligning two protein sequences. The most popular substitution matrices are the PAM (Schwart & Dayhoff, 1978) and BLOSUM (Henikoff & Henikoff, 1992) family of matrices. The gap cost parameters involve a cost associated with opening a gap and a lesser cost associated with extension of a gap.

=> The different algorithmic parameters mostly control the heuristics on which BLAST and FAST A rely and hence allow the user to control the speed and accuracy of their alignments. We refer the reader to the online manuals available at the BLAST and FASTA sites referred above for a detailed description of these algorithmic parameters.

=> The output parameters include a threshold for the E-score and the desired number of matches. The E-score is a measure of the statistical significance of an alignment, where it combines the raw score from the alignments, the lengths of the query sequences, and the size of the database. The E-score gives the expected number of sequences in the database that would align with the given raw score by chance. Typically, in a database the size of Genbank or Swiss-Prot, one expects random matches of 5 to 10 sequences, and thus alignments with E-scores less than 5 (or 10) are ignored. However, in smaller databases, such as the PDB, it is important to consider smaller E-scores.

Various forms of BLAST and FAST A are used in alignment of different types of biological sequences, some of which are listed in Table 2.

Pair wise sequence alignment usi ng BLAST and FASTA has been extended in two ways: (1) for multiple alignment of DNA or protein sequences and (2) structural alignment for determination of protein structural neighbours, where the extension of BLAST to 3-dimensional coordinates is called VAST (Gibrat et ai, 1996). In multiple alignments, the most common method called Clustal W (Gibson et ai, 1994) creates a multiple alignment of DNA or protein sequences, starting with BLAST or FAST A pai r wise alignment scores. A more detailed discussion on these topics is beyond the scope of this paper and we refer the readers to see Durbin et ai, 1999.

Protein Structure Visualization: RasMol and Kinemage

Most protein structure databases today come equipped with vi sualization tools-the most fre-

Table 2- BLAST and FASTA variants for di fferent searches

Program

BLASTni FASTA BLASTpi FASTA

BLASTxi FASTx

TBLASTni IFASTx TBLASTx! tFASTx

Input Sequence

Nucleotide

Protein

Nucleotide (translated)

Protein

Nucleotide (translated)

Compari son Database Common Use

Nucleotide Al ign a new DNA sequence to a nucleotide sequence database.

Protein Seeks to a lign an amino acid query sequence to a protein sequence database.

Protein Analyze new DNA sequence (translated) to fi nd potential cod ing regions.

Nucleotide (translated) Useful for EST analysis.

Nucleotide (translated) Useful fo r EST analysis.

Page 9: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

ADAK & SRIV ASTA V A: BIOINFORMATICS-MOLECULAR BIOLOGY DATABASES 109

Fig. 5--{a) RasMol [Phage CRO Repressor on DNA. Andrew Coulson & Roger Sayle with RasMol, University of Edinburgh, 1993] and (b) K:nemage images of protein structure

quently used being the freely available RasMol (Sayle & Milner-White, 1995). The name "RasMol" is derived from Raster (the array of pixels on a computer screen) and Molecules. The fact that the

initials of the author of RasMol are R .A.S. is probably only coincidental (Fig. 5).

• RasMol. This is a molecular graphics software that allows visualization of proteins, nucleic acids , and small molecules, for which a 3-dimensional structure is available. In order to display a mol ecule, RasMol requires an atomic coordinate file that specifies the position of every atom in the molecule through its 3-dimensional cartesian coordinates. RasMol accepts this coordinate file in a variety of formats , including the Protein Data Bank (PDB) format. The visualization provides the user a choice of color schemes and molecular representations [wireframe, cy linder (Dreiding) stick bonds, alpha-carbon trace, space fi lling (CPK) spheres, macro-molecular ribbons (either smooth shaded solid ribbons or parallel strands), hydrogen bonding and dot surface]. Additional features such as text labeling for selected atoms, different colour schemes for different parts of the molecule, zoom, rotation, etc. have made this the most popular of all visualization tools.

• Chime and Protein Explorer are derivatives of RasMol that allow visualization inside wet browsers, whi le RasMol runs independently outs ide a web browser.

• Kinemage. One of the drawbacks of R:!sMol was that it fai led to allow the user to move two molecules or parts of a molecule complex, relative to each other. For example, RasMol cannot show the binding of a substrate to, or its release from, an enzyme. This drawback was corrected when kinemage (kinetic images) was developed (Richardson & Richardson, 1989). To quote the authors, "Kinemages are set up to illustrate a particular idea about a three­di mensional object, rather than neutrally displaying that object; they incorporate the author's selection, emphasis, and viewpoint.. .. "

4. Integrated Molecular Biology Database Systems A database is a repository that provides a

centralized and homogeneous view of its contents . The repository is created and modified through a database management system (DBMS). Every data item in the database is structured according to a schema, defined as a set of pre-specified rules through the data definition language . The contents of the database can be typically accessed through a

Page 10: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

110 INDIAN J BIOTECHNOL, JANUARY 2002

graphical user inteiface (GUI) that allows browsing through the contents of the repository very much similar as one may browse through the books in a library. Most databases also allow querying of its contents through a specialized query language. The data definition language and the query language form the data model and define the semantics of the manipulations and operations allowed on the database.

For example, the schema of Genome Sequence Data Base (GSDB) and the Mouse Genome Database (MGD) are defined using the data definition language of the Sybase relational DBMS, the structure of the Arabidopsis thaliana database (AtDB) (as well as numerous other genomic databases) is defined using AceDB (Durbin and Thierry-Mieg, 1992), and the structure of Genome Database (GDB) and the Protein Data Bank (PDB) are defined using an object protocol model (OPM) (Chen & Markowitz, 1995) that allows storage of images on top of a relational database management system. For molecular biology databanks maintained as files, the data definition languages used for defining their structure are not based on a data model per se and range from generic notations such as the ASN.l data exchange format used for Genbank to ad-hoc data definition languages such as that employed for EMBL.

It is clear that comprehensive studies of molecular biology data involve exploring multiple databases. Rather than requiring the user to combine information retrieved from multiple databases, it is clearly in our interest to provide an integration of the databases. The particular challenges of integrating biological data sources (as compared to heterogeneous data sources from other domains) have been discussed (Davidson et ai, 1995; Markowitz & Ritter, 1995; Karp, 1995). It is clear that the main hurdle in the integration of multiple biological databases is their inherent heterogeneity. These inherent heterogeneities are caused by:

1 Heterogeneity of Content: Different databases are used to store a variety of information; for example, protein sequence information is available through Swiss-Prot while protein structure information is available through the Protein Data Bank (PDB).

2 Heterogeneity of Database Management System: Different data types require different database management systems. For example, table structured data can be easily stored through relational database management

systems (such as those of Oracle), storage of text data as in Genbank is not amenable to relational DMBSs, while images require object database management systems such as that of OPM (Chen & Markowitz, 1995).

3 Heterogeneity of Data Model: The data model (the schema and the query language) of heterogeneous systems will also vary. For example, relational systems mostly use SQL . (structured query language), while special query languages such as OQL (object query language) need to be used for object databases like the PDB.

In integrating molecular biology databases, the following issues need to be addressed:

Integration of Data ==> A basic problem underlying the integration of

heterogeneous databases is the autonomy of the sources, which has led to lack of cooperation and non-standardization of formats with some notable exceptions. For example, Genbank, EMBL and the DNA Data Bank of Japan (DDBJ) cooperate in creating a centralized repository for the human genome sequence data and daily exchange of data is made for the purpose of synchronization. A cooperation!co­licensing agreement is the first step in the creation of integrated systems.

==> As data is exchanged between heterogeneous systems, schema converters are required (which convert data from one schema to either the schema of another database or a global schema). General schema integration methodologies have been discussed (Batini et al, 1986) and further evaluated in the context of biological databases by (Buneman et al, 1995).

==> Data conflicts and errors need to be resolved in a systematic manner.

Integration of User Interfaces ==> Browsing Interface: Presenting a unified view

of the data is done in one of two possible ways (Markowitz & Ritter, 1995): (a) A global schema is created by unifying the schema of the component databases; or (b) Local views of the data which use the "local" schema of the component databases.

==> Query Interface: Each of the component databases may support different types of

Page 11: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

ADAK & SRIV AST A V A: BIOINFORMATICS-MOLECULAR BIOLOGY DATABASES 111

queries (e.g. free-text search, keyword search, accession number search, etc.). Integrated querying of multiple databases is still the holy grail of heterogeneous database management systems today. The main hurdles to integrated querying remain: (1) rewriting the queries and query rules in the integrated system to operations on the source databases; (2) ability to implement join across heterogeneous items from the source databases; (3) the ability to identify redundancy; and (4) the ability to exploit the parallelism of different servers (computers) in use for the source databases.

Integration of Visualization and Analytical Tools => With integration comes the advantage that

bioinformatics applications (analytical and visualization tools) can be developed on a common data model.

There are currently three types of integration strategies being used for molecular biology databases: data warehousing, link-driven federation, and semantic integration. These strategies along with discussion of the integrated molecular biology database systems that have resulted from these efforts are reviewed as under.

Data Warehousing Data Warehouses represent the materialization of a

global schema, i.e. the integrated database is defined by the global schema and loaded with data from the component databases. The steps involved in creating a data warehouse are:

• Downloading of data from the component databases

• Data cleaning (removal of erroneous entries and resolving data cOl)flicts)

• Reformatting the data into the global schema A data warehouse is often confused with a

consolidation of mUltiple databases. In consolidating multiple databases, the component databases are subsumed into a larger database and the individual component databases are discarded, whereas in data warehousing, the component databases are not disturbed. Consolidation is far more complex and expensive, requiring consensus on common names, data structures, and policies. Furthermore, existing applications on component databases must be converted in order to function on the consolidated system. The relative advantages and disadvantages of data warehousing systems are listed in Table 3. Current data warehousing systems used in molecular biology databases are:

Table 3-Data Warehousing

Advantages Disadvantages

Downloaded source data can be manipulated into suitable formats

Global schema allows browsing of data through a unified view

Execution of queries is usuall y very fast because all data is locally available

System is reliable because there is no outside dependency.

High maintenance cost as data needs to be constantly synchronized with component databases

Large initial costs associated with setup and schema development

Storage requirements add to cost

System does not scale easi ly - not easy to add new databases.

GUS-Genomics Unified Schema-Genomics Unified Schema (GUS) (Davidson et ai, 2001) is a warehousing based data integration systems from University of Pennsylvania. GUS uses a relational data model and stores nucleotide, amino acid sequences and annotations in Tables. The data sources already included are GenBanklEMBLIDDBJ, dbEST and SWISS-PROT.

GUS builds and maintains a map between DNA sequence based entries at some sites and gene-based entries at others through its local storage of the necessary date. Its tables hold the conceptual entities that DNA sequences and annotations indirectly represent, which are genes, the RNA derived from these genes and the proteins from those RNAs. While transforming the data into gene-centric organization, it cleans data to identify erroneous annotations and misidentified sequences.

GUS facilitates data maintenance by tracking how data is generated/accessed from ' sources and subsequently modified. This helps in learning about continuous changes (external or internal) to the kno:vledge of genes that data items represent. The revisions of original data can be by the source itself, an-notations are slowly experimentally verified, and predicted values become more accurate with better algorithms. For example, with computationally derived annotations, GUS stores the algorithm used, its implementation version, input parameters and the run time information.

To keep its database synchronized with external data sources after initial download, GUS retrieves updates and new entries from them based on the source's change schedule or periodically. The changed fields of the modified entries are detected

Page 12: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

112 INDIAN J BIOTECHNOL, JANUARY 2002

from the database based on a difference operation, and updated accordingly. Both the new and updated entries are subjected to an annotation update process in which the protein and DNA sequence and their annotation are transformed into gene and protein based entries. The user always sees a production version of the database while updates are made to the next development version. When the development cycle is over, the database version is put into production, and a new development version is created.

Link Driven Federation The link driven federation approach has been used

successfully by mainly online molecular biology databases to add value to their data. A link connects an object in one database to objects in possibly another database. Links can also be to objects in the same database (for example, the related documents links that appear with entries in the PubMed database). The links provided allow the user to start from a data item of interest in a particular database and then jump to other related data sources through the links. The user has to still interact with individual sources; only the interaction is easier through convenient links and invoking/querying the individual source databases directly is not required. The most widely used integrated molecular biology systems, Entrez, SRS (Etzold & Argos, 1993), and LinkDB (Fujibuchi et at, 1998) are examples of this approach (Table 4) .

Entrez. The National Center for Biotechnology Information (NCBI), which is part of the National Institutes of Health, USA is the foremost repository of publicly available genomic and proteomic data. Their integrated information retrieval system, known as Entrez, is perhaps the most utilized of all biological database systems. Entrez uses a link-based approach to cross-reference entries from different databases. The nucleotide sequence database Genbank, the medical literature database PubMed, NCBI's protein sequence database, NCBI's protein structure database and NCBI's database of whole genomes (Fig. 6). Hard links are applied between the different databases, whenever there is a logical connection between entries. LinkOut from PubMed citations also provide links to user defined external web pages (for example, the full text journal articles, biological data, sequence centres, etc.) These external resources provide a URL, resource name, brief description of their web site, which PubMed uses to create the links to their sites. For complete review of the

Table 4-Link-Driven Federation

AdvalJlages Disadvan tages

Point-and-click links make • Manual link creat ion difficu lt for large databases

it convenient for the user to see re lated sources

Links to a variety of • Does not scale easily: for each entry in a database difficult to generate all links to a new databases

informati on for each entry in a database

• Changes in source database schema may result in links becoming obsolete

features and complexities of Entrez, readers may refer to a tutorial on the Entrez system at http://www.ncbi.nll11 . n i h.gov :80/entrez/q uery/static/he Ip/helpdoc. html.

SRS. The creators of the Swiss-Prot database at the Swiss Institute of Bioinformatics and the European Bioinformatics Institute have created SRS (~equence Retrieval ~ystem). SRS allows retrieval from an extensive catalogue of more than 75 public biological databases of interest. The link button in SRS allows the user to obtain all the entries in one databank that are linked to an entry (or en tries) in another databank. Hyperlinks are links between entries, which are displayed as hypertext (clicking on the text takes you to the related entry). These are hard coded into SRS and are useful for examining entries that are referenced directly from a data item of interest.

To see what data is linked to an entry, ticking the checkbox next to that entry and cl icking the link button will disp lay the LINK page (Fig. 7). After the user has selected the database to be linked with the entry (say Swiss-Prot), the user clicks the submit link button. The resu lt will be a list of all the SWISS-

OM 1M I_I · I~ Full-tex~ r------ r-~-' // ElectrOniC -1.-,r :"1 <~urnaI5 j --------- - ---·I=~

r--M-a-p-'-s""'-'-&--' ''':m::..t-.1.. - .... - ::··.-·::~_<_:.-__ -.. ...:...t __ -,

Genomes '--____ ,_..;...-l ---ITaxonom y I

Fig. 6---Entrez map

Page 13: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

ADAK & SRIV ASTA V A: BIOINFORMATICS-MOLECULAR BIOLOGY DATABASES

Qur#lfJ qwry: '{EMBirlD:AB034639]"

!i'1nd oU J!1IJm~ iII .rhe seltettd databw whWb are linked to 1M curren! ({lltry

ill the wnent 4UC1Y .h are liIlked w a! se1emd databanks ill Ibe wrrent quay wflitb are IIOllinked t.O itfl'j of Ihe w etted databanks

r TO' PQr~711 Library 9 ReCenm.~es

,..., r MEDUNE; r gQ

hfore~~; - mbs«IWm ~ r MEDUNE (Main Rdease) r ldEDI-INE <yp~) Sequence libraries :!!!. r ~ r SWALqSPTR) r !L r RemTrEMBL r ENSE-\tB1-

r PATENT PRT r JPO PRT r PATENT DNA r USPO PRT r IMGTlLIGM-DB

r IMGTHLA

Sequence hbrarie& - subsecttom :::3 r EMSL (Rdtase) r £.\1BL (Uodates) r SWISS·PROT r SpTfEMBL

r TrEMBL (Updat<!s)

InwiPro&Related SeqRelated lransFac

.~~~~ .~----------.----~------~-

Fig. 7-5RS link page

PROT entries that are related to the EMBL entries with which we started. These will be displayed on the Query Result page. D BG ET Database Links

LinkDB. The integrated database retrieval system DBGETlLinkDB (Fujibuchi et ai, 1998) is the backbone of the Japanese GenomeNet service. DBGET i~ used to search and extract entries from a wide range of molecular biology databases (Fig. 8), while LinkDB is used to search and compute links between entries in different databases. Once an entry is retrieved through DBGET, all links from this entry can be obtained by clicking on the entry name, which causes the search against LinkDB. In addition to the original links provided by the source database which are embedded in the entry, LinkDB also aims at providing computer-generated links, which include:

• Factual Links: links between database entries, e.g., Medline ID and GenBank accession

• Similarity Links: links produced by similarity search, e.g., the results of BLAST and FAST A

C ,.....---.::.-.-) KEGG

OWlY

LlTOO

Fig. 8--DBGET database links

113

Page 14: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

114 INDIAN J BlOTECHNOL, JANUARY 2002

• Biological Links: links by biological meanings, e.g., molecular or genetic interactions in the KEGG pathways.

format for exchanging data in the last few years with abundant development tools and backing from major vendors. XML is ideal for format integration because it can handle semi­structured data (unlike the rigid data structures of relational database systems). The wide use of XML is easily seen in the variety of standards being proposed, which are shown in Table 5.

Semantic Integration The ultimate aim of the LinkDB systems is to add

to its current hyperlinks using biological meanings and relations: For example, known protein-protein interactions should be represented through a bi­directional hyperlink between the two protein sequence entries. However, this kind of semantic integration is still in its infancy and the focus of much bioinformatics research. An effort on semantic integration worth mentioning is the development of XML ontologies and some early work on TAMBIS.

• Ontology: An ontology is a set of concepts (objects, events, and relations) that are specified in order to create an agreed-upon vocabulary for exchanging information. Specification of ontology involves (1) determination of which concepts are to be included in the ontology; (2) assigning English language meaning to the terms; and (3) defining all possible relations between the concepts in the ontology.

XML Ontologies. In a discipline-wide effort to standardize the representation of entries in biological databases. various organizations and institutions are participating in the creating of XML ontologies.

• XML: Better than html (the language used in creation of web pages), XML (eXtensible Markup Language) has emerged as the de facto

TAMBIS. Transparent Access to Multiple Bioinformatics Information Sources (T AMBIS) is an integration system for molecular biology where a

Table 5- XML Standards in Biology

XML Standard

AnatML

Array XML (AXML) Bioinformatic Sequence Markup Language (BSML)

Biopolymer Markup Language (BIOML)

Genome Annotation Markup Elements (GAME)

CellML Clinical Trial Data Model

Gene Expression Markup Language (GEML) GeneOntology Markup Language

GeneX Gene Expression Markup Language (GeneXML)

Molecular Dynamics Markup Language (MoDL)

Systems Biology Markup Language (SBML)

Taxonomic Markup Language (TaxonomicML)

XML Description Language for Taxonomy (XDEL T A)

Description

a language for storing geometric information and documentation obtained as part of the musculoskeletal modelling project For exchanging and storing data from microarray experiments a public domain standard fo r encoding and display of DNA, RNA and protein sequence information BlOML standard a llows the full specification of all experimental information known about molecular entities composed of biopolymers (for example, protei ns and genes) The goals of GAME, at least in the perspective of the bioxml community, are to provide an XML ontology and tools for annotating biosequence "annotation features" For storage and exchange of computer-based biological models FDA safety domain and metadata models with an XML ontology for clinical data An open-standard XML format for microarray and gene expression data The Gene Ontology(tm) Consortium is attempting to produce a dynamic controlled vocabulary that can be applied to all eukaryotes, even as we gain more knowl edge of gene and protein roles in cells Part of the GeneX project, a massi vely distributed gene expression database MoDL (pronounced "Model") is an XML application that allows chemical simulation data visualization over the Web represent models of biological systems common in research on a number of topics including cell signaling pathways, metaboli c pathways, biochemical reactions, etc . Taxonomic ML seeks to standardize (I) the description of the structure (topology) of a biological phylogeny; (2) the presentation of statistical metadata about the phylogeny and (3) the option of superimposing a Linnaean taxonomy XML file fo rmat derived from DELTA (DEscription Language for TAxonomy) standurd)

Page 15: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

ADAK & SRIV AST A V A: BIOINFORMATICS-MOLECULAR BIOLOGY DATABASES 115

domain-dependent ontology is used for information retrieval. The ontology is central to the system and plays a role in query formulation and execution (Goble et at, 2001).

Discussion The beginning of the new millennium coincided

with the dawn of a new era in biology. The next century is expected to see remarkable discoveries in the field of biology, and biotechnology is all set to play a major role in revolutionizing the ways of medical treatment and solves many of life's mysteries. Our interaction with environment will also be guided by biotechnology. This review paper intends to provide only a flavour of the scope and resources available today. Weare still at the beginning of using genomics and proteomics in Biotechnology. Integration of data (by the biologist or by the computer scientist in creating access to heterogeneous systems) will be one more important step in furthering biotechnology research. Current research in Bioinformatics as well as software development efforts in bioinformatics, as it pertains to molecular biology databases, is focused on integration.

The key to biotechnology discoveries is locked in the genomes of organisms and bioinformatics holds the key to unlock this data for the next generation of innovations.

References: Altschul S F, Gish W, Miller W, Myers E W & Lipman D J, 1990.

Basic local alignment search tool. J Mol Bioi, 215, 403-410. Arabidopsis Genome Initiative. 2000. Analysis of the genome

sequence of the flowering plant Arabidopsis thaliana. Nature (Lond), 408, 796-815.

Bankier A T, Weston K M & Barrel B G, 1987. Random cloning and sequencing by the M13/dedeoxynucleotide chain termination method. Methods Enzymol, 155, 51-93.

Batini C, Lenzerini M & Navathe S, ] 986. A comparative analysis of methodologies for database schema integration. ACM Comput Surv, 18. 323-364.

Buneman P, Davidson S, Hart K, Overton C & Wong L, 1995. A data transformation system for biological data sources. in Proceedings of the 21st International Conference on Very Large Databases. Pp 158-169.

Baxevanis A D & Francis Ouellette B F (Eds), 1998. Bioinformatics : A Practical Guide to the Analysis of Genes and Proteins. John Wiley & Sons, New York.

Chen I A & Markowitz V M, 1995. An overview of the object­protocol model (OPM) and OPM data management tools. In! Syst, 20, 393-4]8.

Davidson S, Overton C & Buneman P, 1995. Challenges in integrating biological data sources. J Computational Bioi, 2, 557-572.

Davidson S, Crabtree J, Brunk B, Schug J, Tannen V, Overton C & Stoeckert C, 2001. K2IKleisli and GUS: Experiments in integrated access to genomic data sources. IBM Syst J, 40. 5]2-531.

Durbin R & Thierry-Mieg J, 1992. Syntactic definition for the Ace DB database manager. Available at http://probe.nalusda.gov:8000/acedocs.

Durbin R, Krogh A, Mitchison G & Eddy S, 1999. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge. UK.

Etzold T & Argos P, 1993. SRS: An indexing and retrieval tool for at file data libraries. Comput Appl Biosci, 9, 49-57.

Fujibuchi W, Goto S, Migimatsu H, Uchiyama I, Ogiwara A, Akiyama Y & Kanehisa M, 1998. DBGETlLinkDB: an integrated database retrieval system. in Pacific Symposium on Biocomputing. Pp. 683-694.

Gibas C & Jambeck P, 2001. Developing Bioinformatics Computer Skills. O'Reilly & Associates, New York.

Gibrat J-F, Madej T & Bryant S H, 1996. Surprising similarities in structure comparison. Curr Opinion Struct Bioi, 6. 377-385.

Gibson T J, Thompson J D, & Higgins D G, 1994. CLUSTAL W: improving the sensitivity of progressive mUltiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4680.

Goble C, Stevens R, Ng G, Bechhofer S, Paton N, Baker P, Peim M & Brass A, 2001. Transparent access to multiple Bioinformatics information sources. IBM Syst J, 40, 532-551.

Gura T, 2000. Reaping the plant gene harvest. Science, 287, 412-4]4.

Gusfield D, 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge, UK.

Henikoff S & Henikoff J G, 1992. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA , 89, 10915- 10919.

Karp P, 1995. A strategy for database interoperation. J Computational Bioi, 2, .573-583.

Markowitz V M & RiUer 0, 1995. Characterizing heterogeneous molecular biology database systems. } Computational Bioi, 2,547-556.

Melin A M, Perromat A & Deleris G, 2001. Sensitivity of Deillococcus radiodurans to gamma-irradiation: A novel approach by fourier transform infrared spectroscopy. Arch Biochem Biophys, 394, 265-274.

Misener S & Krawetz S A, 2000. Bioinformatics: Methods and Protocols. Humana Press, New Jersey.

Mount D, 2001. Bioinformatics: Sequence and Genome Analysis. Mount, D. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York.

Needleman S B & Wunsch C D, 1970. A general method applicable to the search for si milarities in the amino acid seq uences of two proteins. J Mol Bioi, 48, 443-453.

Pearson W R & Lipman D J, 1988. Improved tools for biological sequence analysis. Proc Natl Acad Sci USA, 85, 2444-2448.

Rashidi H H & Buehler L K, 1999. Bioinformatics Basics Applications in Biological Science and Medicine. CRC Press, Boca Raten, Florida.

Richardson J S & Richardson D C, 1989. Principles and paUerns of protein conformation. in Prediction of protein structure and the principles of protein conformation, edited by. G. D. Fasman. Plenum Press. New York. Pp 1-98.

Page 16: Bioinformatics: Advancing Biotechnology through ...nopr.niscair.res.in/bitstream/123456789/19855/1/IJBT 1(1) 101-116.… · databases such as Genbank or EMBL as well as the protein

116 INDIAN J BIOTECHNOL, JANUARY 2002

Sayle R A & E.J. Milner-White E J, 1995. RasMol: Biomolecular graphics for all . Trends Biochem Sci, 20, 374-376.

Schwart R M & Dayhoff M 0, 1978. Matrices for detecting di stant relation-ships. in Atlas of Protein Sequence and Structure, S (suppl. 3), 353-358.

Smith T F & Waterman M S, 1981. Identification of common molecular sub-sequences. J Mol Bioi, 147, 195-197.

Waterston R & Sulston J, 1995. The C. Elegans genome sequencing project. Proc Natl Acad Sci USA, 92, 10836-10840.

Weigel D, Ahn J H, Blazquez, Borevitz, J 0 et 01 2000. Activation tagging in Arabic/opsis. Pia III Physiol, 122, 1004-1013.

White 0 el ai, 1999. Genome sequence of the radioresistant bacterium Deinococcus radiodurClns R I. Science, 286, 1571-1577.

Zuo J & Chua N H, 2000. Chemical -inducible systems for regulated expression of plant genes. ClIrr Opin Biotechnol, 11. 146-15 I.