an introduction to biological databases

An introduction to biological databases

Database or databank ?

At the beginning, subtle distinctions were done between databases and databanks (in UK, but not in the USA), such as:

« Database management programs for the gestion of databanks »

From now on, the term « database » (db) is usually preferred

What is a database ?

A collection of... structured searchable (index) -> table of contents

updated periodically (release) -> new edition

cross-referenced (hyperlinks) -> links with other db

…data

Includes also associated tools (software) necessary for db access, db updating, db information insertion, db information deletion….

Databases: an simple example

Accession number: 1First Name: AmosLast Name: BairochCourse: DEA=oct-nov-dec 2000http://expasy4.expasy.ch/people/amos.html//Accession number: 2 First Name: LaurentLast name: FalquetCourse: EMBnet=sept 2000;DEA=oct-nov-dec 2000; //Accession number 3:First Name: Marie-ClaudeLast name: Blatter GarinCourse: EMBnet=sept 2000;DEA=oct-nov-dec 2000;http://expasy4.expasy.ch/people/Marie-Claude.Blatter-Garin.html//

Easy to manage: all the entries are visible at the same time !

« Introduction To Database »Teacher Database (ITDTdb) (flat file, 3 entries)

Databases: an simple example (cont.)

Teacher Accession number

Education

Amos 1 Biochemistry

Laurent 2 Biochemistry

M-Claude 3 Biochemistry

Course Date Involved teachers

DEA Oct-nov-dec 2000 1,3

EMBnet Sept 2000 2,3

Relational database (« table file »):

Easier to manage; choice of the output

Why biological databases ?

Explosive growth in biological data

Data (sequences, 3D structures, 2D gel analysis, MS analysis….) are no longer published in a conventional manner, but directly submitted to databases

Essential tools for biological research, as classical publications used to be !

Some databases in the field of molecular biology… AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,

ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,BioMagResBank, BIOMDB, BLOCKS, BovGBASE,

BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,

ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,GCRDB, GDB, GENATLAS, Genbank, GeneCards,

Genline, GenLink, GENOTK, GenProtEC, GIFTS,GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,

HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5

Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,

PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,

SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,

SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,

YPM, etc .................. !!!!

Biological databases

Some statistics

More than 1000 different databases Generally accessible through the web (useful link: www.expasy.ch/alinks.html)

Variable size: <100Kb to >10Gb DNA: > 10 Gb Protein: 1 Gb 3D structure: 5 Gb Other: smaller

Update frequency: daily to annually

Categories of databases for Life Sciences

Sequences (DNA, protein) -> Primary db Genomics Protein domain/family -> Secondary

db Mutation/polymorphism Proteomics (2D gel, MS) 3D structure -> Structure

db Metabolism Bibliography Others

Distribution of sequence databases

Books, articles 1968 -> 1985 Computer tapes 1982 ->1992 Floppy disks 1984 -> 1990 CD-ROM 1989 -> ? FTP 1989 -> ? On-line services 1982 -> 1994 WWW 1993 -> ? DVD 2001 -> ?

Sequence Databases: some « technical » definitions

Data storage management: flat file: text file relational (e.g., Oracle) object oriented (rare in biological field)

Format (flat file): fasta GCG NBRF/PIR MSF…. standardized format ?

Federated databases: different autonomous, redundant, heterogeneous db linked together by links/hyperlinks.

Ideal minimal content of a « sequence » db

Sequences !! Accession number (AC) References Taxonomic data ANNOTATION/CURATION Keywords Cross-references Documentation

Sequence database: exampleID EPO_HUMAN STANDARD; PRT; 193 AA.AC P01588;DT 21-JUL-1986 (Rel. 01, Created)DT 21-JUL-1986 (Rel. 01, Last sequence update)DT 30-MAY-2000 (Rel. 39, Last annotation update)DE Erythropoietin precursor.GN EPO.OS Homo sapiens (Human).OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.RN [1]RP SEQUENCE FROM N.A.RX MEDLINE; 85137899.RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J.,RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F.,RA Kawakita M., Shimizu T., Miyake T.;RT "Isolation and characterization of genomic and cDNA clones of humanRT erythropoietin.";RL Nature 313:806-810(1985)....CC -!- FUNCTION: ERYTHROPOIETIN IS THE PRINCIPAL HORMONE INVOLVED IN THECC REGULATION OF ERYTHROCYTE DIFFERENTIATION AND THE MAINTENANCE OF ACC PHYSIOLOGICAL LEVEL OF CIRCULATING ERYTHROCYTE MASS.CC -!- SUBCELLULAR LOCATION: SECRETED.CC -!- TISSUE SPECIFICITY: PRODUCED BY KIDNEY OR LIVER OF ADULT MAMMALSCC AND BY LIVER OF FETAL OR NEONATAL MAMMALS.CC -!- PHARMACEUTICAL: Available under the names Epogen (Amgen) andCC Procrit (Ortho Biotech).CC -!- DATABASE: NAME=R&D Systems' cytokine source book;CC WWW="http://www.rndsystems.com/cyt_cat/epo.html".DR EMBL; X02158; CAA26095.1; -.DR EMBL; X02157; CAA26094.1; -.DR EMBL; M11319; AAA52400.1; -.DR EMBL; AF053356; AAC78791.1; -.DR EMBL; AF202308; AAF23132.1; -.DR EMBL; AF202306; AAF23132.1; JOINED....KW Erythrocyte maturation; Glycoprotein; Hormone; Signal; Pharmaceutical.FT SIGNAL 1 27FT CHAIN 28 193 ERYTHROPOIETIN.FT PROPEP 190 193 MAY BE REMOVED IN PROCESSED PROTEIN.FT DISULFID 34 188...

SWISS-PROTFlat file

reference

taxonomy

annotations

Keywords

Cross-references

Sequence database: example (cont.)

FT DISULFID 34 188FT DISULFID 56 60FT CARBOHYD 51 51 N-LINKED (GLCNAC...).FT CARBOHYD 65 65 N-LINKED (GLCNAC...).FT CARBOHYD 110 110 N-LINKED (GLCNAC...).FT CARBOHYD 153 153FT CONFLICT 40 40 E -> Q (IN CAA26095).FT CONFLICT 85 85 Q -> QQ (IN REF. 5).FT CONFLICT 140 140 G -> R (IN CAA26095).** Chromosomal location: 7q22SQ SEQUENCE 193 AA; 21306 MW; C91F0E4C26A52033 CRC64; MGVHECPAWL WLLLSLLSLP LGLPVLGAPP RLICDSRVLE RYLLEAKEAE NITTGCAEHC SLNENITVPD TKVNFYAWKR MEVGQQAVEV WQGLALLSEA VLRGQALLVN SSQPWEPLQL HVDKAVSGLR SLTTLLRALG AQKEAISPPD AASAAPLRTI TADTFRKLFR VYSNFLRGKL KLYTGEACRT GDR//

sequence

Sequence database: example

…a SWISS-PROT entry, in fasta format:

>sp|P01588|EPO_HUMAN ERYTHROPOIETIN PRECURSOR - Homo sapiens (Human).

MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLERYLLEAKEAE

NITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEA

VLRGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPD

AASAAPLRTITADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR

Databases 1: nucleotide sequence

The main DNA sequence db are EMBL (Europe)/GenBank (USA) /DDBJ (Japan) There are also specialized databases for the

different types of RNAs (i.e. tRNA, rRNA, tm RNA, uRNA, etc…)

3D structure (DNA and RNA) Others: Aberrant splicing db; Eucaryotic promoter

db (EPD); RNA editing sites, Multimedia Telomere Resource ……

EMBL/GenBank/DDJB

These 3 db contain mainly the same informations within 2-3 days (few differences in the format and syntax)

Serve as archives containing all sequences (single genes, ESTs, complete genomes, etc.) derived from: Genome projects and sequencing centers Individual scientists Patent offices (i.e. European Patent Office, EPO)

Non-confidential data are exchanged daily Currently: 8.3 x106 sequences, over 9.7 x109

bp; Sequences from > 50’000 different species;

EMBL/GenBank/DDBJ

Heterogeneous sequence length: genomes, variants, fragments…

Sequence sizes: max 300’000 bp /entry (! genomic sequences,

overlapping) min 10 bp /entry

Archive: nothing goes out -> highly redundant ! full of errors: in sequences, in annotations, in

CDS attribution… no consistency of annotations; most annotations

are done by the submitters; heterogeneity of the quality and the completion and updating of the informations

EMBL/GenBank/DDJB Unexpected informations you can find in these db:

FT source 1..124FT /db_xref="taxon:4097"FT /organelle="plastid:chloroplast"FT /organism="Nicotiana tabacum"FT /isolate="Cuban cahibo cigar, gift from President FidelFT Castro"

Or: FT source 1..17084 FT /chromosome="complete mitochondrial genome" FT /db_xref="taxon:9267" FT /organelle="mitochondrion" FT /organism="Didelphis virginiana" FT /dev_stage="adult" FT /isolate="fresh road killed individual" FT /tissue_type="liver"

EMBL entry: exampleID HSERPG standard; DNA; HUM; 3398 BP.XXAC X02158;XXSV X02158.1XXDT 13-JUN-1985 (Rel. 06, Created)DT 22-JUN-1993 (Rel. 36, Last updated, Version 2)XXDE Human gene for erythropoietinXXKW erythropoietin; glycoprotein hormone; hormone; signal peptide.XXOS Homo sapiens (human)OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;OC Eutheria; Primates; Catarrhini; Hominidae; Homo.XXRN [1]RP 1-3398RX MEDLINE; 85137899.RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J.,RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F., Kawakita M.,RA Shimizu T., Miyake T.;RT Isolation and characterization of genomic and cDNA clones of humanRT erythropoietin;RL Nature 313:806-810(1985).XXDR GDB; 119110; EPO.DR GDB; 119615; TIMP1.DR SWISS-PROT; P01588; EPO_HUMAN.XX

…

taxonomy

Cross-references

references

keyword

EMBL entry (cont.)CC Data kindly reviewed (24-FEB-1986) by K. JacobsFH Key Location/QualifiersFHFT source 1..3398FT /db_xref=taxon:9606FT /organism=Homo sapiensFT mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327)FT CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763)FT /db_xref=SWISS-PROT:P01588FT /product=erythropoietinFT /protein_id=CAA26095.1FT /translation=MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLEFT AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRGFT QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITADFT TFRKLFRVYSNFLRGKLKLYTGEACRTGDRFT mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2763)FT /product=erythropoietinFT sig_peptide join(615..627,1194..1261)FT exon 397..627FT /number=1FT intron 628..1193FT /number=1FT exon 1194..1339FT /number=2FT intron 1340..1595FT /number=2FT exon 1596..1682FT /number=3FT intron 1683..2293FT /number=3FT exon 2294..2473FT /number=4FT intron 2474..2607FT /number=4FT exon 2608..3327FT /note=3' untranslated regionFT /number=5XXSQ Sequence 3398 BP; 698 A; 1034 C; 991 G; 675 T; 0 other; agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 60 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 120

annotation

sequence

GenBank entry: example LOCUS HSERPG 3398 bp DNA PRI 22-JUN-1993

DEFINITION Human gene for erythropoietin. ACCESSION X02158 VERSION X02158.1 GI:31224 KEYWORDS erythropoietin; glycoprotein hormone; hormone; signal peptide. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 3398) AUTHORS Jacobs,K., Shoemaker,C., Rudersdorf,R., Neill,S.D., Kaufman,R.J., Mufson,A., Seehra,J., Jones,S.S., Hewick,R., Fritsch,E.F., Kawakita,M., Shimizu,T. and Miyake,T. TITLE Isolation and characterization of genomic and cDNA clones of human erythropoietin JOURNAL Nature 313 (6005), 806-810 (1985) MEDLINE 85137899 COMMENT Data kindly reviewed (24-FEB-1986) by K. Jacobs. FEATURES Location/Qualifiers source 1..3398 /organism="Homo sapiens" /db_xref="taxon:9606" mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327) exon 397..627 /number=1 sig_peptide join(615..627,1194..1261) CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763) /codon_start=1 /product="erythropoietin" /protein_id="CAA26095.1" /db_xref="GI:312304" /db_xref="SWISS-PROT:P01588" /translation="MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLL EAKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVL RGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTI

…

GenBank entry (cont.) TADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR"

intron 628..1193

/number=1

exon 1194..1339

/number=2

mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2760)

/product="erythropoietin"

intron 1340..1595

/number=2

exon 1596..1682

/number=3

intron 1683..2293

/number=3

exon 2294..2473

/number=4

intron 2474..2607

/number=4

exon 2608..3327

/note="3' untranslated region"

/number=5

BASE COUNT 698 a 1034 c 991 g 675 t

ORIGIN

1 agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag

61 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat

121 agcagctccg ccagtcccaa gggtgcgcaa ccggctgcac tcccctcccg cgacccaggg

181 cccgggagca gcccccatga cccacacgca cgtctgcagc agccccgtca gccccggagc

241 ctcaacccag gcgtcctgcc cctgctctga ccccgggtgg cccctacccc tggcgacccc

DDJB entry: exampleLOCUS HSERPG 3398 bp DNA HUM 22-JUN-1993DEFINITION Human gene for erythropoietin.ACCESSION X02158 VERSION X02158.1KEYWORDS erythropoietin; glycoprotein hormone; hormone; signal peptide.SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 3398) AUTHORS Jacobs,K., Shoemaker,C., Rudersdorf,R., Neill,S.D., Kaufman,R.J., Mufson,A., Seehra,J., Jones,S.S., Hewick,R., Fritsch,E.F., Kawakita,M., Shimizu,T. and Miyake,T. TITLE Isolation and characterization of genomic and cDNA clones of human erythropoietin JOURNAL Nature 313, 806-810(1985) MEDLINE 85137899COMMENT Data kindly reviewed (24-FEB-1986) by K. JacobsFEATURES Location/Qualifiers source 1..3398 /db_xref="taxon:9606" /organism="Homo sapiens" mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327) CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763) /db_xref="SWISS-PROT:P01588" /product="erythropoietin" /protein_id="CAA26095.1" /translation="MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLE AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRG QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITAD TFRKLFRVYSNFLRGKLKLYTGEACRTGDR »

…

DDJB (cont.)mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2763) /product="erythropoietin" sig_peptide join(615..627,1194..1261) exon 397..627 /number=1

intron 628..1193

/number=1

exon 1194..1339

/number=2

intron 1340..1595

/number=2

exon 1596..1682

/number=3

intron 1683..2293

/number=3

exon 2294..2473

/number=4

intron 2474..2607

/number=4

exon 2608..3327

/note="3' untranslated region"

/number=5

BASE COUNT 698 a 1034 c 991 g 675 t

ORIGIN

1 agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag

61 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat

The tremendous increase in nucleotide sequences

EMBL data…first increase in data due to the PCR development…

1980: 80 genes fully sequenced !

EMBL divisions

EMBL has been divided into subdatabases to allow easier data management and searches fun, hum, inv, mam, org, phg, pln, pro, rod,

syn, unc, vrl, vrt est, gss, htg, sts, patent

RefSeq a SWISS-PROT clone? The NCBI Reference Sequence project (RefSeq) will provide

reference sequence standards for the naturally occurring molecules of the central dogma, from chromosomes to mRNAs to proteins. RefSeq standards provide a foundation for the functional annotation of the human genome. They provide a stable reference point for mutation analysis, gene expression studies, and polymorphism discovery.

Molecule Accession Format GenomeComplete Genome NC_###### Archaea, Bacterial,

Organelle,Virus, Viroid

Complete Chrom. NC_###### Eukaryote

Complete Sequence NC_###### Plasmid

Genomic Contig NT_###### Homo sapiens

mRNA NM_###### Homo sapiens, Mus musculus, Rattus norvegicus

Protein NP_###### All of the above

RefSeq a SWISS-PROT clone? RefSeq records are created via a process consisting of:

identifying sequences that represent distinct genes establishing the correct gene name-to-accession number association identifying the full extent of available sequence data creating a new RefSeq record with a status of:

PREDICTED PROVISIONAL REVIEWED

Provisional RefSeq records are reviewed by a biologist who confirms the initial name-to-sequence association, adds information including a summary of gene function, and, more importantly, corrects, re-annotates, or extends the sequence data using data available in other GenBank records.

Databases 2: genomics

Contain information on genes, gene location (mapping), gene nomenclature and links to sequence databases;

Exist for most organisms important for life science research;

Examples: MIM, GDB (human), MGD (mouse), FlyBase (Drosophila), SGD (yeast), MaizeDB (maize), SubtiList (B.subtilis), etc.;

Format: generally relational (Oracle, SyBase or AceDb).

MIM

OMIM™: Online Mendelian Inheritance in Man

a catalog of human genes and genetic disorders

contains a summary of literature, pictures, and reference information. It also contains numerous links to articles and sequence information.

MIM: example*133170 ERYTHROPOIETIN; EPO

Alternative titles; symbols

EP

TABLE OF CONTENTS

TEXT REFERENCES SEE ALSO CONTRIBUTORS CREATION DATE EDIT HISTORY

Database Links

Gene Map Locus: 7q21

Note: pressing the symbol will find the citations in MEDLINE whose text most closely matches the text of the preceding OMIM paragraph, using the Entrez

MEDLINE neighboring function.

TEXT

Human erythropoietin is an acidic glycoprotein hormone with molecular weight 34,000. As the prime regulator of red cell production, its major functions are to

promote erythroid differentiation and to initiate hemoglobin synthesis. Sherwood and Shouval (1986) described a human renal carcinoma cell line that

continuously produces erythropoietin. Eschbach et al. (1987) demonstrated the effectiveness of recombinant human erythropoietin in treating the anemia of

end-stage renal disease. Lee-Huang (1984) cloned human erythropoietin cDNA in E. coli. McDonald et al. (1986) and Shoemaker and Mitsock (1986)

cloned the mouse gene and the latter workers showed that coding DNA and amino acid sequence are about 80% conserved between man and mouse. This is

a much higher order of conservation than for various interferons, interleukin-2, and GM-CSF.

……

Ensembl

Contains all the human genome DNA sequences currently available in the public domain.

Automated annotation: by using different software tools, features are identified in the DNA sequences: Genes (known or predicted) Single nucleotide polymorphisms (SNPs) Repeats Homologies

Created and maintained by the EBI and the Sanger Center (UK)

www.ensembl.org

Database 3: protein sequence

SWISS-PROT: created in 1986 (A.Bairoch) TrEMBL: created in 1996; complement to SWISS-PROT;

derived from automated EMBL CDS translations (« proteomic » version of EMBL)

GenPept: derived from automated GenBank CDS translations and journal scans (« proteomic » version of GenBank)

PIR: Protein Information Resources MIPS: Martinsried Institute for Protein Sequences

PIR + PATCHX (supplement of unverified protein sequences from external sources)

Database 3: protein sequence

NRL-3D: produced by PIR from PDB (3D struture) sequences Many specialized protein databases for specific families or

groups of proteins. Examples: YPD (yeast proteins), AMSDb (antibacterial peptides),

GPCRDB (7 TM receptors), IMGT (immune system) etc.

SWISS-PROT

Collaboration between the SIB (CH) and EMBL/EBI (UK)

Annotated (manually), non-redundant, cross-referenced, documented protein sequence database.

88 ’000 sequences from more than 6’800 different species; 70 ’000 references (publications); 550 ’000 cross-references (databases); ~200 Mb of annotations.

Weekly releases; available from about 50 servers across the world, the main source being ExPASy

SWISS-PROT: example

Never changed

SWISS-PROT (cont.)

TrEMBL (Translation of EMBL)

Computer-annotated supplement to SWISS-PROT, as it is impossible to cope with the flow of data…

Well-structure SWISS-PROT-like resource Derived from automated EMBL CDS translation

(maintained at the EBI (UK)) TrEMBL is automatically generated and

annotated using software tools (incompatible with the SWISS-PROT in terms of quality)

TrEMBL contains all what is not yet in SWISS-PROT

Yerk!! But there is no choice and these software tools are becoming quite good !

The simplified story of a Sprot entry

cDNAs, genomes, ….

EMBLnew EMBL

TrEMBLnew TrEMBL

SWISS-PROT

« Automatic »• Redundancy check (merge)• InterPro (family attribution)• Annotation

« Manual »• Redundancy (merge, conflicts)• Annotation• Sprot tools (macros…)• Sprot documentation• Medline• Databases (MIM, MGD….)• Brain storming

Once in Sprot, the entry is no more in TrEMBL, but still in EMBL (archive)

CDS

SWISS-PROT introduces a new arithmetical concept !

How many sequences in SWISS-PROT + TrEMBL ?

88’000 + 300 ’000 = about 240’000

SWISS-PROT and TrEMBL (SPTR) a minimal of redundancy

TrEMBL divisions

TrEMBL: SPTrEMBL + REMTrEMBL SPTrEMBL: TrEMBL entries that will eventually be

integrated into SWISS-PROT, but that have not yet be manually annotated

REMTrEMBL: sequences that are not destined to be included in SWISS-PROT

Immunoglobulins and T-cell receptors Synthetic sequences Patented sequences Small fragments (<8 aa) CDS not coding for real proteins

TrEMBL new: updates to the latest release of TREMBL

TrEMBL divisions

Subdivisions Archae arc Fungus fun Human hum Invertebrate inv Mammals mam Major Hist. Comp. mhc Organelles org Phage phg Plant pln Prokaryote pro Rodent rod Uncommented unc Viral vrl Vertebrate vrt

TrEMBL: example

GenPept (translation of GenBank)

GenPept is a protein database translated from the last release of GenBank (+ journal scans)

The current release has 484’496 entries

In contrast to TrEMBL, keeps all protein sequences including small fragments (< 8 aa), immunoglobulins….

Redundancy: 20 entries for human EPO

GenPept: example LOCUS L33410_1 [HUMMLCMPL] DEFINITION Human c-mpl ligand (ML) mRNA, complete cds; erythropoietin homology domain bp 66..522. DATE 07-JAN-1995 ACCESSION L33410 NID ORGANISM Homo_SP_sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. COMMENT CDS 216..1277 /gene="ML" /product="c-mpl ligand" /protein_id="AAA59857.1" /db_xref="GI:506827" WEIGHT 37823 LENGTH 353 ORIGIN 1 MELTELLLVV MLLLTARLTL SSPAPPACDL RVLSKLLRDS HVLHSRLSQC PEVHPLPTPV 61 LLPAVDFSLG EWKTQMEETK AQDILGAVTL LLEGVMAARG QLGPTCLSSL LGQLSGQVRL 121 LLGALQSLLG TQLPPQGRTT AHKDPNAIFL SFQHLLRGKV RFLMLVGGST LCVRRAPPTT 181 AVPSRTSLVL TLNELPNRTS GLLETNFTAS ARTTGSGLLK WQQGFRAKIP GLLNQTSRSL 241 DQIPGYLNRI HELLNGTRGL FPGPSRRTLG APDISSGTSD TGSLPPNLQP GYSPSPTHPP 301 TGQYTLFPLP PTLPTPVVQL HPLLPDPSAP TPTPTSPLLN TSYTHSQNLS QEG //

PIR

Protein Information Resource, created in 1984 Successor of the National Biochemical Research

Foundation (NBRF) protein sequence database developed in 1965 by M. O. Dayhoff « Atlas of Protein Sequence and Structure »

Maintained by MIPS (Germany) and JIPID (Japan) Provides some cross-referencing to

EMBL/GenBank/DDJB and PDB, GDB, FlyBase, OMIM, SGD, and MGD

In august 2000: 178’050 entries. Redundancy: 3 entries for human EPO

PIR: example>P1;ZUHUerythropoietin precursor - humanC;Species: Homo sapiens (man)C;Date: 27-Nov-1985 #sequence_revision 27-Nov-1985 #text_change 22-Jun-1999C;Accession: A01855; A24744; A25384; A22210; S56178R;Jacobs, K.; Shoemaker, C.; Rudersdorf, R.; Neill, S.D.; Kaufman, R.J.; Mufson, A.; Seehra, J.; Jones, S.S.; Hewick, R.; Fritsch, E.F.;

Kawakita, M.; Shimizu, T.; Miyake, T.Nature 313, 806-810, 1985A;Title: Isolation and characterization of genomic and cDNA clones of human erythropoietin.A;Reference number: A01855; MUID:85137899A;Accession: A01855A;Molecule type: mRNA; DNAA;Residues: 1-193 A;Cross-references: GB:X02157; GB:X02158R;Lin, F.K.; Suggs, S.; Lin, C.H.; Browne, J.K.; Smalling, R.; Egrie, J.C.; Chen, K.K.; Fox, G.M.; Martin, F.; Stabinsky, Z.; Badrawi, S.M.;

Lai, P.H.; Goldwasser, E.Proc. Natl. Acad. Sci. U.S.A. 82, 7580-7584, 1985A;Title: Cloning and expression of the human erythropoietin gene.A;Reference number: A24744; MUID:86067948A;Accession: A24744A;Molecule type: DNAA;Residues: 1-193 A;Cross-references: GB:M11319; NID:g182197; PIDN:AAA52400.1; PID:g182198R;Lai, P.H.; Everett, R.; Wang, F.F.; Arakawa, T.; Goldwasser, E.J. Biol. Chem. 261, 3116-3121, 1986A;Title: Structural characterization of human erythropoietin.A;Reference number: A25384; MUID:86140080A;Accession: A25384A;Molecule type: proteinA;Residues: 28-86,'Q',87-193 A;Experimental source: urineA;Note: forms without the carboxyl-terminal residue and the four carboxyl-terminal residues were observedR;Yanagawa, S.; Hirade, K.; Ohnota, H.; Sasaki, R.; Chiba, H.; Ueda, M.; Goto, M.J. Biol. Chem. 259, 2707-2710, 1984A;Title: Isolation of human erythropoietin with monoclonal antibodies.A;Reference number: A22210; MUID:84135751

PIR (cont.)A;Accession: A22210A;Molecule type: proteinA;Residues: 28-29,'X',31-33,'L',35-50,'X',52-53,'D',55,'G',57 R;Matsumoto, S.; Ikura, K.; Ueda, M.; Sasaki, R. Plant Mol. Biol. 27, 1163-1172, 1995A;Title: Characterization of a human glycoprotein (erythropoietin) produced in cultured tobacco cells.A;Reference number: S56178; MUID:95284365A;Accession: S56178A;Molecule type: proteinA;Residues: 28-33,'X',35-37 C;Comment: Erythropoietin is produced by kidney or liver of adult mammals and by liver of fetal or neonatal mammals.C;Genetics:A;Gene: GDB:EPOA;Cross-references: GDB:119110; OMIM:133170A;Map position: 7q21.3-7q22.1A;Introns: 5/1; 53/3; 82/3; 142/3C;Function:A;Description: the primary inducer of erythrocyte formationC;Superfamily: erythropoietinC;Keywords: erythropoiesis; glycoprotein; hormone; kidney; liverF;1-27/Domain: signal sequence #status predicted F;28-193/Product: erythropoietin #status experimental F;34-188,56-60/Disulfide bonds: #status experimental F;51,65,110/Binding site: carbohydrate (Asn) (covalent) #status experimentalF;153/Binding site: carbohydrate (Ser) (covalent) #status experimental>P1;ZUHU

MGVHECPAWL WLLLSLLSLP LGLPVLGAPP RLICDSRVLE RYLLEAKEAE NITTGCAEHCSLNENITVPD TKVNFYAWKR MEVGQQAVEV WQGLALLSEA VLRGQALLVN SSQPWEPLQLHVDKAVSGLR SLTTLLRALG AQKEAISPPD AASAAPLRTI TADTFRKLFR VYSNFLRGKLKLYTGEACRT GDR*

Composite protein sequence db

NRDB OWL MIPSX SPTrEMBL *PDBSWISS-PROTPIRGenPeptSP updateGenPept update

SWISS-PROTPIRGenBankNRL-3D

PIRMIPSNRL-3DSWISS-PROTEMBL translationGenBank translationKabat (immuno)PseqIP

SWISS-PROTSPTrEMBLTrEMBLnew

Different composite db use different primary sources and different redundancy criteria in their amalgamation procedures

Redundancy priority criteria

* Also called SWall at EBISWIR: SPTrEMBL + Wormpep

Composite: protein family

The proteins /genes are classified by superfamily/family according to Blast/Fasta (homology) results

General: ProtFam: PIR ProtoMap: SWISS-PROT SYSTERS: SWISS-PROT and PIR (non redundant) ProClass: PIR and PROSITE

Species specific: HOVERGEN: vertebrates HOBACGEN: bacteria COG: complete organism genome

ProtoMap: example

ProtoMap (cont.)

Database 4: protein domain/family

Contains biologically significant « pattern / profiles/ HMM » formulated in such a way that, with appropriate computional tools, it can rapidly and reliably determine to which known family of proteins (if any) a new sequence belongs to

-> tools to identify what is the function of uncharacterized proteins translated from genomic or cDNA sequences (« functional diagnostic »)

Protein domain/family

Most proteins have « modular » structure Estimation: ~ 3 domains / protein Domains (conserved sequences or structures)

are identified by multi sequence alignments

Domains can be defined by different methods: Pattern (regular expression); used for very conserved domains Profiles (weighted matrices): two-dimensional tables of position

specific match-, gap-, and insertion-scores, derived from aligned sequence families; used for less conserved domains

Hidden Markov Model (HMM); probabilistic models; an other method to generate profiles.

Some statistics 15 most common protein domains for H. sapiens (Incomplete)

Immunoglobulin and major histocompatibility complex domainEukaryotic protein kinaseZinc finger, C2H2 typeRhodopsin-like GPCR superfamilySrc homology 3 (SH3) domainRNA-binding region RNP-1 (RNA recognition motif)Fibronectin type III domainPleckstrin homology (PH) domainHomeobox domainMajor histocompatibility complex protein, Class IEF-hand familyEGF-like domainRING fingerCadherin domainPDZ domain (also known as DHR or GLGF)Serine proteases, trypsin family

http://www.ebi.ac.uk/proteome/HUMAN/interpro/top15d.html

Protein domain/family db

Secondary databases are the fruit of analyses of the sequences found in the primary db

Either manually curated (i.e. PROSITE, Pfam, etc.) or automatically generated (i.e. ProDom, DOMO)

Some depend on the method used to detect if a protein belongs to a particular domain/family (patterns, profiles, HMM)

Protein domain/family db

Secondary db Primary source Information

PROSITE SWISS-PROT Patterns (Regular expression)

PROSITE SWISS-PROT Profiles(Weighted matrices)

PRINTS OWL and SWISS-PROT

Aligned motifs (Fingerprints)

Pfam SWISS-PROT HMM(Hidden Markov Models)

BLOCKS PROSITE/PRINTS Aligned motifs

IDENTIFY BLOCKS/PRINTS Fuzzy regular expressions

Prosite

Created in 1988 (SIB) Contains functional domains fully annotated, based

on two methods: patterns and profiles

Entries are deposited in PROSITE in two distinct files: Pattern/profiles with the lists of all matches in the parent

version of SWISS-PROT Documentation

Aug 2000: contains 1064 documentation entries that describe 1424 different patterns, rules and profiles/matrices.

Prosite (pattern): example

ID EPO_TPO; PATTERN.AC PS00817;DT OCT-1993 (CREATED); NOV-1995 (DATA UPDATE); JUL-1998 (INFO UPDATE).DE Erythropoietin / thrombopoeitin signature.PA P-x(4)-C-D-x-R-[LIVM](2)-x-[KR]-x(14)-C.NR /RELEASE=38,80000;NR /TOTAL=14(14); /POSITIVE=14(14); /UNKNOWN=0(0); /FALSE_POS=0(0);NR /FALSE_NEG=0; /PARTIAL=1;CC /TAXO-RANGE=??E??; /MAX-REPEAT=1;CC /SITE=3,disulfide; /SITE=11,disulfide;DR P48617, EPO_BOVIN , T; P33707, EPO_CANFA , T; P33708, EPO_FELCA , T; DR P01588, EPO_HUMAN , T; P07865, EPO_MACFA , T; Q28513, EPO_MACMU , T; DR P07321, EPO_MOUSE , T; P49157, EPO_PIG , T; P29676, EPO_RAT , T; DR P33709, EPO_SHEEP , T; P42705, TPO_CANFA , T; P40225, TPO_HUMAN , T; DR P40226, TPO_MOUSE , T; P49745, TPO_RAT , T; DR P42706, TPO_PIG , P; DO PDOC00644;//

Diagnostic performance

List of matches

Prosite (profile): examplePROSITE: PS50097

ID BTB; MATRIX.AC PS50097;DT DEC-1999 (CREATED); DEC-1999 (DATA UPDATE); DEC-1999 (INFO UPDATE).DE BTB domain profile.MA /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=67;MA /DISJOINT: DEFINITION=PROTECT; N1=6; N2=62;MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=.9751; R2=.02068202; TEXT='-LogE';MA /CUT_OFF: LEVEL=0; SCORE=363; N_SCORE=8.5; MODE=1; TEXT='!';MA /CUT_OFF: LEVEL=-1; SCORE=267; N_SCORE=6.5; MODE=1; TEXT='?';MA /DEFAULT: D=-20; I=-20; B1=-50; E1=-50; MI=-105; MD=-105; IM=-105; DM=-105; MM=1; M0=-2;MA /I: B1=0; BI=-105; BD=-105;MA /M: SY='C'; M=-6,-10,28,-14,-9,-15,-20,-14,-19,-15,-17,-14,-8,-19,-14,-15,0,0,-9,-32,-17,-12;MA /M: SY='D'; M=-16,41,-28,53,15,-34,-11,-1,-33,0,-27,-25,21,-11,0,-8,2,-6,-26,-38,-19,7;MA /M: SY='V'; M=2,-23,-8,-28,-24,-1,-24,-25,16,-20,7,6,-20,-25,-23,-20,-10,-4,24,-23,-9,-24;MA /M: SY='T'; M=-2,-13,-18,-19,-13,-7,-24,-19,6,-8,-2,1,-11,-17,-11,-10,-1,10,10,-24,-6,-13;MA /M: SY='L'; M=-11,-30,-22,-33,-24,15,-32,-23,25,-29,35,17,-26,-27,-23,-22,-24,-9,16,-17,3,-24;MA /M: SY='V'; M=0,-11,-18,-13,-10,-12,-20,-13,1,-6,-4,2,-10,-19,-6,-7,-4,-2,8,-25,-9,-9;MA /M: SY='V'; M=1,-25,-3,-29,-25,-2,-26,-26,17,-22,10,7,-23,-25,-23,-22,-11,-3,24,-27,-10,-25;MA /M: SY='D'; M=-6,7,-26,8,7,-25,6,-7,-27,0,-23,-17,8,-13,0,-3,3,-6,-23,-27,-17,3;MA /I: I=-5; MI=0; IM=0; DM=-15; MD=-15;MA /M: SY='G'; M=-6,8,-27,8,-3,-27,22,-7,-30,-8,-26,-19,10,-14,-8,-9,2,-9,-24,-28,-21,-6;MA /M: SY='K'; M=-7,-4,-23,-4,7,-23,-13,-2,-21,10,-18,-9,-3,-12,7,9,-2,-4,-16,-25,-12,6;MA /M: SY='E'; M=-8,-6,-21,-8,1,-15,-21,-7,-7,-1,-10,-5,-3,-14,0,-1,-2,-2,-6,-26,-9,-1;MA /M: SY='F'; M=-12,-28,-22,-34,-26,31,-31,-21,18,-26,16,9,-22,-27,-27,-21,-20,-9,14,-6,13,-26;MA /M: SY='R'; M=-13,-9,-24,-10,-3,-11,-21,7,-17,7,-16,-4,-4,-8,2,9,-9,-9,-16,-20,-1,-2;MA /M: SY='A'; M=21,-15,-8,-22,-17,-10,-10,-23,0,-15,-5,-5,-14,-18,-17,-19,4,6,12,-24,-15,-17;MA /M: SY='H'; M=-15,5,-22,2,-1,-20,-16,65,-26,-8,-21,-5,15,-19,6,-2,-2,-11,-26,-32,7,0;MA /M: SY='K'; M=-12,-5,-29,-5,5,-25,-18,-8,-26,34,-24,-9,-1,-14,8,34,-8,-8,-17,-20,-10,5;MA /M: SY='A'; M=4,-12,-12,-16,-10,-6,-18,-14,-2,-13,-1,-2,-11,-17,-12,-13,-3,1,2,-24,-8,-11;MA /M: SY='V'; M=-7,-26,-19,-31,-26,7,-32,-24,27,-23,14,11,-22,-25,-23,-22,-13,0,28,-19,3,-26;MA /M: SY='L'; M=-10,-30,-20,-30,-21,9,-30,-20,22,-29,47,20,-29,-29,-20,-20,-29,-10,12,-20,0,-21;MA /M: SY='A'; M=18,-6,0,-12,-8,-18,-6,-16,-15,-10,-18,-12,-2,-14,-8,-13,18,11,-5,-32,-19,-8;….

Prosite (profile): example (cont.)……MA /M: SY='T'; M=-3,3,-16,1,-3,-18,-12,-9,-20,-6,-19,-15,2,-7,-6,-6,10,15,-13,-27,-12,-5;MA /M: SY='G'; M=-1,1,-25,2,-9,-26,31,-12,-32,-10,-26,-18,4,-17,-12,-10,1,-12,-24,-25,-22,-11;MA /M: SY='E'; M=-9,3,-24,4,13,-25,-16,-1,-24,13,-21,-13,3,-9,6,13,-3,-6,-20,-27,-13,8;MA /M: SY='I'; M=-6,-21,-18,-25,-21,-2,-29,-21,21,-21,14,10,-19,-24,-17,-19,-13,-3,19,-23,-3,-20;MA /M: SY='E'; M=-4,3,-23,3,4,-18,-11,-7,-17,-1,-18,-13,3,-9,-1,-5,1,-4,-14,-25,-11,1;MA /M: SY='I'; M=-8,-25,-23,-27,-20,1,-30,-21,21,-20,18,12,-22,-18,-18,-18,-18,-7,16,-21,-1,-20;MA /M: SY='P'; M=-6,0,-24,2,1,-22,-13,-8,-21,-2,-23,-15,1,14,-4,-7,3,2,-19,-31,-18,-3;MA /M: SY='E'; M=-7,1,-27,4,11,-24,-15,-4,-19,2,-18,-11,0,-1,6,-1,-2,-6,-19,-25,-14,7;MA /I: E1=0; IE=-105; DE=-105;NR /RELEASE=39,87397;NR /TOTAL=46(44); /POSITIVE=45(43); /UNKNOWN=1(1); /FALSE_POS=0(0);NR /FALSE_NEG=0; /PARTIAL=0;CC /TAXO-RANGE=??E?V; /MAX-REPEAT=2;DR O14867, BAC1_HUMAN, T; P97302, BAC1_MOUSE, T; P97303, BAC2_MOUSE, T; DR P41182, BCL6_HUMAN, T; P41183, BCL6_MOUSE, T; Q01295, BRC1_DROME, T; DR Q01296, BRC2_DROME, T; Q01293, BRC3_DROME, T; Q28068, CALI_BOVIN, T; DR Q13939, CALI_HUMAN, T; Q08605, GAGA_DROME, T; Q01820, GCL1_DROME, T; DR P10074, HKR3_HUMAN, T; Q04652, KELC_DROME, T; P42283, LOLL_DROME, T; DR P42284, LOLS_DROME, T; O14682, PI10_HUMAN, T; Q05516, PLZF_HUMAN, T; DR O43791, SPOP_HUMAN, T; P42282, TTKA_DROME, T; P17789, TTKB_DROME, T; DR P21073, VA55_VACCC, T; P24768, VA55_VACCV, T; P21037, VC02_VACCC, T; DR P17371, VC02_VACCV, T; P32228, VC04_SPVKA, T; P32206, VC13_SPVKA, T; DR P21013, VF03_VACCC, T; P24357, VF03_VACCV, T; P22611, VMT8_MYXVL, T; DR P08073, VMT9_MYXVL, T; O43167, Y441_HUMAN, T; Q10225, YAZ4_SCHPO, T; DR P40560, YIA1_YEAST, T; P34324, YKV2_CAEEL, T; P34371, YLJ8_CAEEL, T; DR P34568, YNV5_CAEEL, T; P41886, YPT9_CAEEL, T; Q09563, YR47_CAEEL, T; DR Q10017, YSW1_CAEEL, T; Q13105, Z151_HUMAN, T; Q60821, Z151_MOUSE, T; DR P24278, ZN46_HUMAN, T; DR Q13829, TNP1_HUMAN, ?; DO PDOC50097;//

PRINTS

Compendium of protein motif fingerprints Most protein families are characterized by

several conserved motifs Fingerprint: set of motif(s) (simple or

composite, such as multidomains) = signature of family membership

True family members exhibit all elements of the fingerprint, while subfamily members may possess only a part

ProDom

consists of an automated compilation of homologous domain alignment (procedure based on PSI-BLAST searches)

Updating problem !Last ProDom update: February 7, 2000built from SWISS-PROT 38 + TREMBL +TREMBL updates - October 22, 1999

ProDom: example

Your query

Protein domain/family: Composite databases

Example: InterPro

Unification of PROSITE, PRINTS, Pfam and ProDom into an integrated resource of protein families, domains and functional sites;

Single set of «documents» linked to the various methods;

Will be used to improve the functional annotation of SWISS-PROT (classification of unknown protein…)

This release contains 3052 entries, representing 574 domains, 2418 families, 46 repeats and 14 post-translational modification sites.

InterPro: example

IPR001323 Name Erythropoietin/thrombopoeitin Type Family Abstract Erythropoietin, a plasma glycoprotein, is the primary physiological mediator of erythropoiesis [1] . It is involved

in the regulation of the level of peripheral erythrocytes by stimulating the differentiation of erythroid progenitor

cells, found in the spleen and bone marrow, into mature erythrocytes [2] . It is primarily produced in adult kidneys

and foetal liver, acting by attachment to specific binding sites on erythroid progenitor cells, stimulating their differentiation [3] . Severe kidney dysfunction causes reduction in the plasma levels of erythropoietin, resulting

in chronic anaemia - injection of purified erythropoietin into the blood stream can help to relieve this type of

anaemia. Levels of erythropoietin in plasma fluctuate with varying oxygen tension of the blood, but androgens and prostaglandins also modulate the levels to some extent [3] . Erythropoietin glycoprotein sequences are well conserved, a consequence of which is that the hormones are cross-reactive among mammals, i.e. that from

one species, say human, can stimulate erythropoiesis in other species, say mouse or rat [4] .

Thrombopoeitin (TPO), a glycoprotein, is the mammalian hormone which functions as a megakaryocytic lineage specific growth and differentiation factor affecting the proliferation and maturation from their committed

progenitor cells acting at a late stage of megakaryocyte development. It acts as a circulating regulator of platelet

numbers.….

InterPro: example... Examplelist P33708 P33709 P49745 view matches for the examples Publications 1. Shoemaker C.B., Mitsock L.D. 849-858 (1986) 2. Takeuchi M., Takasaki S., Miyazaki H., Kato T., Hoshi S., Kochibe N., Kobata A. J. Biol. Chem.

263: 3657-3663 (1988) 3. Lin F.K., Lin C.H., Lai P.H., Browne J.K., Egrie J.C., Smalling R., Fox G.M., Chen K.K., Castro M.,

Suggs S. Gene 44: 201-209 (1986) 4. Nagao M., Suga H., Okano M., Masuda S., Narita H., Ikura K., Sasaki R. Nucleotide sequence of rat erythropoietin. 1171: 99-102 (1992) Children IPR003013 Signatures PROSITE PS00817 EPO_TPO PFAM PF00758 EPO_TPO Matches Table Graphical

Databases 5: mutation/polymorphism

Contain informations on sequence variations that are linked or not to genetic diseases;

Mainly human but: OMIA - Online Mendelian Inheritance in Animals

General db: OMIM HMGD - Human Gene Mutation db SVD - Sequence variation db HGBASE - Human Genic Bi-Allelic Sequences db dbSNP - Human single nucleotide polymorphism (SNP) db

Disease-specific db: most of these databases are either linked to a single gene or to a single disease; p53 mutation db ADB - Albinism db (Mutations in human genes causing albinism) Asthma and Allergy gene db ….

Mutation/polymorphisms: definitions

SNPs: single nucleotide polymorphisms c-SNPs: coding single nucleotide

polymorphisms (Single Nucleotide Polymorphisms within cDNA sequences)

SAPs: single amino-acid polymorphisms

Missense mutation: -> SAP Nonsense mutation: -> STOP Insertion/deletion of nucleotides -> frameshift…

! Numbering of the mutation depends on the db (aa no 1 is not necessary the initiator Met !)

Mutation/polymorphisms dbSNP consortium http://snp.cshl.org/

Bayer, Roche, IBM, Pfizer, Novartis, Motorola…… Mission: develop up to 300,000 SNPs distributed evenly throughout

the human genome and make the informations related to these SNPs available to the public without intellectual property restrictions. The project started in April 1999 and is anticipated to continue until the end of 2001.

dbSNP at NCBI http://www.ncbi.nlm.nih.gov/SNP/ Collaboration between the National Human Genome Research Institute and

the National Center for Biotechnology Information (NCBI) Mission: central repository for both single base nucleotide subsitutions and

short deletion and insertion polymorphisms Aug 24, 2000 , dbSNP has submissions for 803557 SNPs.

Chromosome 21 dbSNP http://csnp.isb-sib.ch/ A joint project between the Division of Medical Genetics of the

University of Geneva Medical School and the SIB Mission: comprehensive cSNP (Single Nucleotide Polymorphisms

within cDNA sequences) database and map of chromosome 21

Mutation/polymorphisms

Very heterogeneous format;

Generally modest size;

There are initiatives to standardize and to unify these databases (SVD - Sequence Variation Database project at EBI: HMutDB)

Databases 6: proteomics Contain informations obtained by 2D-PAGE:

master images of the gels and description of identified proteins

Examples: SWISS-2DPAGE, ECO2DBASE, Maize-2DPAGE, Sub2D, Cyano2DBase, etc.

Format: composed of image and text files Most 2D-PAGE databases are “federated” and use SWISS-PROT as a master index There is currently no protein Mass Spectrometry

(MS) database (not for long…)

Databases 7: 3D structure

Contain the spatial coordinates of macromolecules whose 3D structure has been obtained by X-ray or NMR studies

Proteins represent more than 90% of available structures (others are DNA, RNA, sugars, virus, complex protein/DNA…)

PDB (Protein Data Bank), SCOP (structural classification of proteins (according to the secondary structures)), BMRB (BioMagResBank; RMN results)

Future: Homology-derived 3D structure db.

PDB

Protein Data Bank, managed by RCSB Currently there are ~13’000 structures for

about 4’000 different molecules, but far less protein family !

There are also databases that contain data derived from PDB. Examples: HSSP (homology-derived secondary structure of proteins), SWISS-3DIMAGE (images)…

Restriction enzyme

PDB: example

HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6REVDAT 1 15-OCT-92 12CA 0 12CA 7JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13REMARK 1 12CA 14REMARK 2 12CA 15REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16REMARK 3 12CA 17REMARK 3 REFINEMENT. 12CA 18REMARK 3 PROGRAM PROLSQ 12CA 19REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20REMARK 3 R VALUE 0.170 12CA 21REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23REMARK 4 12CA 24REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27………

PDB (cont.)SHEET 3 S10 PHE 66 PHE 70 -1 O ASN 67 N LEU 60 12CA 68SHEET 4 S10 TYR 88 TRP 97 -1 O PHE 93 N VAL 68 12CA 69SHEET 5 S10 ALA 116 ASN 124 -1 O HIS 119 N HIS 94 12CA 70SHEET 6 S10 LEU 141 VAL 150 -1 O LEU 144 N LEU 120 12CA 71SHEET 7 S10 VAL 207 LEU 212 1 O ILE 210 N GLY 145 12CA 72SHEET 8 S10 TYR 191 GLY 196 -1 O TRP 192 N VAL 211 12CA 73SHEET 9 S10 LYS 257 ALA 258 -1 O LYS 257 N THR 193 12CA 74SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA 258 12CA 75TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81CRYST1 42.700 41.700 73.000 90.00 104.60 90.00 P 21 2 12CA 82ORIGX1 1.000000 0.000000 0.000000 0.00000 12CA 83ORIGX2 0.000000 1.000000 0.000000 0.00000 12CA 84ORIGX3 0.000000 0.000000 1.000000 0.00000 12CA 85SCALE1 0.023419 0.000000 0.006100 0.00000 12CA 86SCALE2 0.000000 0.023981 0.000000 0.00000 12CA 87SCALE3 0.000000 0.000000 0.014156 0.00000 12CA 88ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102…….

Databases 8: metabolic Contain informations that describe enzymes,

biochemical reactions and metabolic pathways; ENZYME and BRENDA: nomenclature databases

that store informations on enzyme names and reactions;

Examples of metabolic databases: EcoCyc (specialized on Escherichia coli), KEGG, EMP/WIT;

Usualy these databases are tightly coupled with query software that allows the user to visualise reaction schemes.

Databases 9: bibliographic

Bibliographic reference databases contain citations and abstract informations of published life science articles;

Example: Medline Other more specialized databases also

exist (example: Agricola).

Medline

MEDLINE covers the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences

more than 4,000 biomedical journals published in the United States and 70 other countries

Contains over 10 million citations since 1966 until now

Contains links to biological db and to some journals

New records are added to PreMEDLINE daily! Many papers not dealing with human are not in Medline

! Before 1970, keeps only the first 10 authors ! Not all journals have citations since 1966 !

Medline/Pubmed

PubMed is developed by the National Center for Biotechnology Information (NCBI)

PubMed provides access to bibliographic information such as MEDLINE, PreMEDLINE, HealthSTAR, and to integrated molecular biology databases (composite db)

PMID: 10923642 (PubMed ID), UI: 20378145 (Medline ID)

Databases 10: others There are many databases that cannot be

classified in the categories listed previously;

Examples: ReBase (restriction enzymes), TRANSFAC (transcription factors), O-GLYCBASE (O-linked sugars), Protein-protein interactions db (DIR), biotechnology patents db, etc.;

As well as many other resources concerning any aspects of macromolecules and molecular biology.

Proliferation of databases

What is the best db for sequence analysis ? Which does contain the highest quality data ? Which is the more comprehensive ? Which is the more up-to-date ? Which is the less redundant ? Which is the more indexed (allows complex

queries) ? Which Web server does respond most

quickly ? …….??????

Some important practical remarks

Databases: many errors (automated annotation) !

Not all db are available on all servers The update frequency is not the same

for all servers; creation of db_new between releases (exemple: EMBLnew; TrEMBLnew….)

Some servers add automatically useful cross-references to an entry (implicit links) in addition to already existing links (explicit links)

Database retrieval tools

Sequence Retrieval System (SRS, Europe) allows any flat-file db to be indexed to any other; allows to formulate queries across a wide range of different db types via a single interface, without any worry about data structure, query languages…

Entrez (USA): less flexible than SRS but exploits the concept of « neighbouring », which allows related articles in different db to be linked together, whether or not they are cross-referenced directly

ATLAS: specific for macromolecular sequences db (i.e. NRL-3D)

….

More informations about SWISS-PROT

The golden goals of SWISS-PROT

Annotated / curated Complete Non-redundant Highly cross-referenced Available from a variety of servers and

through sequence analysis software tools Associated with wide range of

documentation

Review: Protein sequence databasesR. Apweiler (2000), Adv. in protein chemistry, 54, 31-70

SWISS-PROT: species

6’840 different species 20 species represent about 45% of all

sequences in the database 5’000 species are only represented by

one to three sequences. In most cases, these are sequences which were obtained in the context of a phylogenetic study

SWISS-PROT: cross-references

SWISS-PROT was the first database with cross-references.

Explicitly cross-referenced to 34 databases Cross-ref to DNA (EMBL/GenBank/DDBJ), 3D-

structure (PDB), literature (Medline), genomic (MIM, MGD, FlyBase, SGD, SubtiList, etc.), 2D-gel (SWISS-2DPAGE), specialized db (PROSITE, TRANSFAC)

Implicitly cross-referenced to additional db on the WWW (GeneCards, PRODOM, etc.)

Annotations

Function(s) Post-translational modifications (PTM) Domains Quaternary structure Similarities Diseases, mutagenesis Conflicts, variants Cross-references …

A Swiss-Prot entry

Sprot entry (cont.)

Future for human proteins Original estimate: from 70’000 to 100’000 genes Incyte recently announced an estimation of 140’000 genes More recent estimations give about 30’000 to 40’000 genes C. elegans and Drosophila have ~15’000 genes. There was

two sets of genome duplication in the evolutionary history leading to vertebrates. Very roughly it means that:

Human genes=~60’000 genes - losses + new genesBut more than 1 million proteins !

(due to PTM, alternative products, variants…)

http://www.ensembl.org/genesweep.html

Genesweep

http://www.ensembl.org/genesweep.html

What after genomes?

Proteome projects are an essential tool for the understanding of real proteins

There will be a flood of characterization data (MS, 2D) that will be the equivalent of ESTs at the protein level

Protein databases are going to be more and more important for new biological studies

Databases in GCG

DNA EMBL, EPD, RepBase, vectordb (NCBI)

Protein Swiss-Prot, TrEMBL, PDB

Other PROSITE, REBASE

How to access databases in GCG?

Fetch or typedata ? Stringsearch Name Lookup (based on SRS)

Useful to generate list files

an introduction to biological databases

Documents

term database db

db access

emd db

srna db

db dataincludes

db information deletion

db information insertion

biological research