an introduction to biological databases
DESCRIPTION
An introduction to biological databases. Database or databank ?. At the beginning, subtle distinctions were done between databases and databanks (in UK, but not in the USA), such as: « Database management programs for the gestion of databanks » - PowerPoint PPT PresentationTRANSCRIPT
An introduction to biological databases
Database or databank ?
At the beginning, subtle distinctions were done between databases and databanks (in UK, but not in the USA), such as:
« Database management programs for the gestion of databanks »
From now on, the term « database » (db) is usually preferred
What is a database ?
A collection of... structured searchable (index) -> table of contents
updated periodically (release) -> new edition
cross-referenced (hyperlinks) -> links with other db
…data
Includes also associated tools (software) necessary for db access, db updating, db information insertion, db information deletion….
Databases: an simple example
Accession number: 1First Name: AmosLast Name: BairochCourse: DEA=oct-nov-dec 2000http://expasy4.expasy.ch/people/amos.html//Accession number: 2 First Name: LaurentLast name: FalquetCourse: EMBnet=sept 2000;DEA=oct-nov-dec 2000; //Accession number 3:First Name: Marie-ClaudeLast name: Blatter GarinCourse: EMBnet=sept 2000;DEA=oct-nov-dec 2000;http://expasy4.expasy.ch/people/Marie-Claude.Blatter-Garin.html//
Easy to manage: all the entries are visible at the same time !
« Introduction To Database »Teacher Database (ITDTdb) (flat file, 3 entries)
Databases: an simple example (cont.)
Teacher Accession number
Education
Amos 1 Biochemistry
Laurent 2 Biochemistry
M-Claude 3 Biochemistry
Course Date Involved teachers
DEA Oct-nov-dec 2000 1,3
EMBnet Sept 2000 2,3
Relational database (« table file »):
Easier to manage; choice of the output
Why biological databases ?
Explosive growth in biological data
Data (sequences, 3D structures, 2D gel analysis, MS analysis….) are no longer published in a conventional manner, but directly submitted to databases
Essential tools for biological research, as classical publications used to be !
Some databases in the field of molecular biology… AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,
ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,BioMagResBank, BIOMDB, BLOCKS, BovGBASE,
BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,GCRDB, GDB, GENATLAS, Genbank, GeneCards,
Genline, GenLink, GENOTK, GenProtEC, GIFTS,GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5
Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,
PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,
SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,
SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,
YPM, etc .................. !!!!
Biological databases
Some statistics
More than 1000 different databases Generally accessible through the web (useful link: www.expasy.ch/alinks.html)
Variable size: <100Kb to >10Gb DNA: > 10 Gb Protein: 1 Gb 3D structure: 5 Gb Other: smaller
Update frequency: daily to annually
Categories of databases for Life Sciences
Sequences (DNA, protein) -> Primary db Genomics Protein domain/family -> Secondary
db Mutation/polymorphism Proteomics (2D gel, MS) 3D structure -> Structure
db Metabolism Bibliography Others
Distribution of sequence databases
Books, articles 1968 -> 1985 Computer tapes 1982 ->1992 Floppy disks 1984 -> 1990 CD-ROM 1989 -> ? FTP 1989 -> ? On-line services 1982 -> 1994 WWW 1993 -> ? DVD 2001 -> ?
Sequence Databases: some « technical » definitions
Data storage management: flat file: text file relational (e.g., Oracle) object oriented (rare in biological field)
Format (flat file): fasta GCG NBRF/PIR MSF…. standardized format ?
Federated databases: different autonomous, redundant, heterogeneous db linked together by links/hyperlinks.
Ideal minimal content of a « sequence » db
Sequences !! Accession number (AC) References Taxonomic data ANNOTATION/CURATION Keywords Cross-references Documentation
Sequence database: exampleID EPO_HUMAN STANDARD; PRT; 193 AA.AC P01588;DT 21-JUL-1986 (Rel. 01, Created)DT 21-JUL-1986 (Rel. 01, Last sequence update)DT 30-MAY-2000 (Rel. 39, Last annotation update)DE Erythropoietin precursor.GN EPO.OS Homo sapiens (Human).OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.RN [1]RP SEQUENCE FROM N.A.RX MEDLINE; 85137899.RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J.,RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F.,RA Kawakita M., Shimizu T., Miyake T.;RT "Isolation and characterization of genomic and cDNA clones of humanRT erythropoietin.";RL Nature 313:806-810(1985)....CC -!- FUNCTION: ERYTHROPOIETIN IS THE PRINCIPAL HORMONE INVOLVED IN THECC REGULATION OF ERYTHROCYTE DIFFERENTIATION AND THE MAINTENANCE OF ACC PHYSIOLOGICAL LEVEL OF CIRCULATING ERYTHROCYTE MASS.CC -!- SUBCELLULAR LOCATION: SECRETED.CC -!- TISSUE SPECIFICITY: PRODUCED BY KIDNEY OR LIVER OF ADULT MAMMALSCC AND BY LIVER OF FETAL OR NEONATAL MAMMALS.CC -!- PHARMACEUTICAL: Available under the names Epogen (Amgen) andCC Procrit (Ortho Biotech).CC -!- DATABASE: NAME=R&D Systems' cytokine source book;CC WWW="http://www.rndsystems.com/cyt_cat/epo.html".DR EMBL; X02158; CAA26095.1; -.DR EMBL; X02157; CAA26094.1; -.DR EMBL; M11319; AAA52400.1; -.DR EMBL; AF053356; AAC78791.1; -.DR EMBL; AF202308; AAF23132.1; -.DR EMBL; AF202306; AAF23132.1; JOINED....KW Erythrocyte maturation; Glycoprotein; Hormone; Signal; Pharmaceutical.FT SIGNAL 1 27FT CHAIN 28 193 ERYTHROPOIETIN.FT PROPEP 190 193 MAY BE REMOVED IN PROCESSED PROTEIN.FT DISULFID 34 188...
SWISS-PROTFlat file
reference
taxonomy
annotations
Keywords
Cross-references
Sequence database: example (cont.)
FT DISULFID 34 188FT DISULFID 56 60FT CARBOHYD 51 51 N-LINKED (GLCNAC...).FT CARBOHYD 65 65 N-LINKED (GLCNAC...).FT CARBOHYD 110 110 N-LINKED (GLCNAC...).FT CARBOHYD 153 153FT CONFLICT 40 40 E -> Q (IN CAA26095).FT CONFLICT 85 85 Q -> QQ (IN REF. 5).FT CONFLICT 140 140 G -> R (IN CAA26095).** Chromosomal location: 7q22SQ SEQUENCE 193 AA; 21306 MW; C91F0E4C26A52033 CRC64; MGVHECPAWL WLLLSLLSLP LGLPVLGAPP RLICDSRVLE RYLLEAKEAE NITTGCAEHC SLNENITVPD TKVNFYAWKR MEVGQQAVEV WQGLALLSEA VLRGQALLVN SSQPWEPLQL HVDKAVSGLR SLTTLLRALG AQKEAISPPD AASAAPLRTI TADTFRKLFR VYSNFLRGKL KLYTGEACRT GDR//
sequence
Sequence database: example
…a SWISS-PROT entry, in fasta format:
>sp|P01588|EPO_HUMAN ERYTHROPOIETIN PRECURSOR - Homo sapiens (Human).
MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLERYLLEAKEAE
NITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEA
VLRGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPD
AASAAPLRTITADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR
Databases 1: nucleotide sequence
The main DNA sequence db are EMBL (Europe)/GenBank (USA) /DDBJ (Japan) There are also specialized databases for the
different types of RNAs (i.e. tRNA, rRNA, tm RNA, uRNA, etc…)
3D structure (DNA and RNA) Others: Aberrant splicing db; Eucaryotic promoter
db (EPD); RNA editing sites, Multimedia Telomere Resource ……
EMBL/GenBank/DDJB
These 3 db contain mainly the same informations within 2-3 days (few differences in the format and syntax)
Serve as archives containing all sequences (single genes, ESTs, complete genomes, etc.) derived from: Genome projects and sequencing centers Individual scientists Patent offices (i.e. European Patent Office, EPO)
Non-confidential data are exchanged daily Currently: 8.3 x106 sequences, over 9.7 x109
bp; Sequences from > 50’000 different species;
EMBL/GenBank/DDBJ
Heterogeneous sequence length: genomes, variants, fragments…
Sequence sizes: max 300’000 bp /entry (! genomic sequences,
overlapping) min 10 bp /entry
Archive: nothing goes out -> highly redundant ! full of errors: in sequences, in annotations, in
CDS attribution… no consistency of annotations; most annotations
are done by the submitters; heterogeneity of the quality and the completion and updating of the informations
EMBL/GenBank/DDJB Unexpected informations you can find in these db:
FT source 1..124FT /db_xref="taxon:4097"FT /organelle="plastid:chloroplast"FT /organism="Nicotiana tabacum"FT /isolate="Cuban cahibo cigar, gift from President FidelFT Castro"
Or: FT source 1..17084 FT /chromosome="complete mitochondrial genome" FT /db_xref="taxon:9267" FT /organelle="mitochondrion" FT /organism="Didelphis virginiana" FT /dev_stage="adult" FT /isolate="fresh road killed individual" FT /tissue_type="liver"
EMBL entry: exampleID HSERPG standard; DNA; HUM; 3398 BP.XXAC X02158;XXSV X02158.1XXDT 13-JUN-1985 (Rel. 06, Created)DT 22-JUN-1993 (Rel. 36, Last updated, Version 2)XXDE Human gene for erythropoietinXXKW erythropoietin; glycoprotein hormone; hormone; signal peptide.XXOS Homo sapiens (human)OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;OC Eutheria; Primates; Catarrhini; Hominidae; Homo.XXRN [1]RP 1-3398RX MEDLINE; 85137899.RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J.,RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F., Kawakita M.,RA Shimizu T., Miyake T.;RT Isolation and characterization of genomic and cDNA clones of humanRT erythropoietin;RL Nature 313:806-810(1985).XXDR GDB; 119110; EPO.DR GDB; 119615; TIMP1.DR SWISS-PROT; P01588; EPO_HUMAN.XX
…
taxonomy
Cross-references
references
keyword
EMBL entry (cont.)CC Data kindly reviewed (24-FEB-1986) by K. JacobsFH Key Location/QualifiersFHFT source 1..3398FT /db_xref=taxon:9606FT /organism=Homo sapiensFT mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327)FT CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763)FT /db_xref=SWISS-PROT:P01588FT /product=erythropoietinFT /protein_id=CAA26095.1FT /translation=MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLEFT AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRGFT QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITADFT TFRKLFRVYSNFLRGKLKLYTGEACRTGDRFT mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2763)FT /product=erythropoietinFT sig_peptide join(615..627,1194..1261)FT exon 397..627FT /number=1FT intron 628..1193FT /number=1FT exon 1194..1339FT /number=2FT intron 1340..1595FT /number=2FT exon 1596..1682FT /number=3FT intron 1683..2293FT /number=3FT exon 2294..2473FT /number=4FT intron 2474..2607FT /number=4FT exon 2608..3327FT /note=3' untranslated regionFT /number=5XXSQ Sequence 3398 BP; 698 A; 1034 C; 991 G; 675 T; 0 other; agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 60 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 120
annotation
sequence
GenBank entry: example LOCUS HSERPG 3398 bp DNA PRI 22-JUN-1993
DEFINITION Human gene for erythropoietin. ACCESSION X02158 VERSION X02158.1 GI:31224 KEYWORDS erythropoietin; glycoprotein hormone; hormone; signal peptide. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 3398) AUTHORS Jacobs,K., Shoemaker,C., Rudersdorf,R., Neill,S.D., Kaufman,R.J., Mufson,A., Seehra,J., Jones,S.S., Hewick,R., Fritsch,E.F., Kawakita,M., Shimizu,T. and Miyake,T. TITLE Isolation and characterization of genomic and cDNA clones of human erythropoietin JOURNAL Nature 313 (6005), 806-810 (1985) MEDLINE 85137899 COMMENT Data kindly reviewed (24-FEB-1986) by K. Jacobs. FEATURES Location/Qualifiers source 1..3398 /organism="Homo sapiens" /db_xref="taxon:9606" mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327) exon 397..627 /number=1 sig_peptide join(615..627,1194..1261) CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763) /codon_start=1 /product="erythropoietin" /protein_id="CAA26095.1" /db_xref="GI:312304" /db_xref="SWISS-PROT:P01588" /translation="MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLL EAKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVL RGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTI
…
GenBank entry (cont.) TADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR"
intron 628..1193
/number=1
exon 1194..1339
/number=2
mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2760)
/product="erythropoietin"
intron 1340..1595
/number=2
exon 1596..1682
/number=3
intron 1683..2293
/number=3
exon 2294..2473
/number=4
intron 2474..2607
/number=4
exon 2608..3327
/note="3' untranslated region"
/number=5
BASE COUNT 698 a 1034 c 991 g 675 t
ORIGIN
1 agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag
61 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat
121 agcagctccg ccagtcccaa gggtgcgcaa ccggctgcac tcccctcccg cgacccaggg
181 cccgggagca gcccccatga cccacacgca cgtctgcagc agccccgtca gccccggagc
241 ctcaacccag gcgtcctgcc cctgctctga ccccgggtgg cccctacccc tggcgacccc
DDJB entry: exampleLOCUS HSERPG 3398 bp DNA HUM 22-JUN-1993DEFINITION Human gene for erythropoietin.ACCESSION X02158 VERSION X02158.1KEYWORDS erythropoietin; glycoprotein hormone; hormone; signal peptide.SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 3398) AUTHORS Jacobs,K., Shoemaker,C., Rudersdorf,R., Neill,S.D., Kaufman,R.J., Mufson,A., Seehra,J., Jones,S.S., Hewick,R., Fritsch,E.F., Kawakita,M., Shimizu,T. and Miyake,T. TITLE Isolation and characterization of genomic and cDNA clones of human erythropoietin JOURNAL Nature 313, 806-810(1985) MEDLINE 85137899COMMENT Data kindly reviewed (24-FEB-1986) by K. JacobsFEATURES Location/Qualifiers source 1..3398 /db_xref="taxon:9606" /organism="Homo sapiens" mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327) CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763) /db_xref="SWISS-PROT:P01588" /product="erythropoietin" /protein_id="CAA26095.1" /translation="MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLE AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRG QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITAD TFRKLFRVYSNFLRGKLKLYTGEACRTGDR »
…
DDJB (cont.)mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2763) /product="erythropoietin" sig_peptide join(615..627,1194..1261) exon 397..627 /number=1
intron 628..1193
/number=1
exon 1194..1339
/number=2
intron 1340..1595
/number=2
exon 1596..1682
/number=3
intron 1683..2293
/number=3
exon 2294..2473
/number=4
intron 2474..2607
/number=4
exon 2608..3327
/note="3' untranslated region"
/number=5
BASE COUNT 698 a 1034 c 991 g 675 t
ORIGIN
1 agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag
61 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat
The tremendous increase in nucleotide sequences
EMBL data…first increase in data due to the PCR development…
1980: 80 genes fully sequenced !
EMBL divisions
EMBL has been divided into subdatabases to allow easier data management and searches fun, hum, inv, mam, org, phg, pln, pro, rod,
syn, unc, vrl, vrt est, gss, htg, sts, patent
RefSeq a SWISS-PROT clone? The NCBI Reference Sequence project (RefSeq) will provide
reference sequence standards for the naturally occurring molecules of the central dogma, from chromosomes to mRNAs to proteins. RefSeq standards provide a foundation for the functional annotation of the human genome. They provide a stable reference point for mutation analysis, gene expression studies, and polymorphism discovery.
Molecule Accession Format GenomeComplete Genome NC_###### Archaea, Bacterial,
Organelle,Virus, Viroid
Complete Chrom. NC_###### Eukaryote
Complete Sequence NC_###### Plasmid
Genomic Contig NT_###### Homo sapiens
mRNA NM_###### Homo sapiens, Mus musculus, Rattus norvegicus
Protein NP_###### All of the above
RefSeq a SWISS-PROT clone? RefSeq records are created via a process consisting of:
identifying sequences that represent distinct genes establishing the correct gene name-to-accession number association identifying the full extent of available sequence data creating a new RefSeq record with a status of:
PREDICTED PROVISIONAL REVIEWED
Provisional RefSeq records are reviewed by a biologist who confirms the initial name-to-sequence association, adds information including a summary of gene function, and, more importantly, corrects, re-annotates, or extends the sequence data using data available in other GenBank records.
Databases 2: genomics
Contain information on genes, gene location (mapping), gene nomenclature and links to sequence databases;
Exist for most organisms important for life science research;
Examples: MIM, GDB (human), MGD (mouse), FlyBase (Drosophila), SGD (yeast), MaizeDB (maize), SubtiList (B.subtilis), etc.;
Format: generally relational (Oracle, SyBase or AceDb).
MIM
OMIM™: Online Mendelian Inheritance in Man
a catalog of human genes and genetic disorders
contains a summary of literature, pictures, and reference information. It also contains numerous links to articles and sequence information.
MIM: example*133170 ERYTHROPOIETIN; EPO
Alternative titles; symbols
EP
TABLE OF CONTENTS
TEXT REFERENCES SEE ALSO CONTRIBUTORS CREATION DATE EDIT HISTORY
Database Links
Gene Map Locus: 7q21
Note: pressing the symbol will find the citations in MEDLINE whose text most closely matches the text of the preceding OMIM paragraph, using the Entrez
MEDLINE neighboring function.
TEXT
Human erythropoietin is an acidic glycoprotein hormone with molecular weight 34,000. As the prime regulator of red cell production, its major functions are to
promote erythroid differentiation and to initiate hemoglobin synthesis. Sherwood and Shouval (1986) described a human renal carcinoma cell line that
continuously produces erythropoietin. Eschbach et al. (1987) demonstrated the effectiveness of recombinant human erythropoietin in treating the anemia of
end-stage renal disease. Lee-Huang (1984) cloned human erythropoietin cDNA in E. coli. McDonald et al. (1986) and Shoemaker and Mitsock (1986)
cloned the mouse gene and the latter workers showed that coding DNA and amino acid sequence are about 80% conserved between man and mouse. This is
a much higher order of conservation than for various interferons, interleukin-2, and GM-CSF.
……
Ensembl
Contains all the human genome DNA sequences currently available in the public domain.
Automated annotation: by using different software tools, features are identified in the DNA sequences: Genes (known or predicted) Single nucleotide polymorphisms (SNPs) Repeats Homologies
Created and maintained by the EBI and the Sanger Center (UK)
www.ensembl.org
Database 3: protein sequence
SWISS-PROT: created in 1986 (A.Bairoch) TrEMBL: created in 1996; complement to SWISS-PROT;
derived from automated EMBL CDS translations (« proteomic » version of EMBL)
GenPept: derived from automated GenBank CDS translations and journal scans (« proteomic » version of GenBank)
PIR: Protein Information Resources MIPS: Martinsried Institute for Protein Sequences
PIR + PATCHX (supplement of unverified protein sequences from external sources)
Database 3: protein sequence
NRL-3D: produced by PIR from PDB (3D struture) sequences Many specialized protein databases for specific families or
groups of proteins. Examples: YPD (yeast proteins), AMSDb (antibacterial peptides),
GPCRDB (7 TM receptors), IMGT (immune system) etc.
SWISS-PROT
Collaboration between the SIB (CH) and EMBL/EBI (UK)
Annotated (manually), non-redundant, cross-referenced, documented protein sequence database.
88 ’000 sequences from more than 6’800 different species; 70 ’000 references (publications); 550 ’000 cross-references (databases); ~200 Mb of annotations.
Weekly releases; available from about 50 servers across the world, the main source being ExPASy
SWISS-PROT: example
Never changed
SWISS-PROT (cont.)
SWISS-PROT (cont.)
TrEMBL (Translation of EMBL)
Computer-annotated supplement to SWISS-PROT, as it is impossible to cope with the flow of data…
Well-structure SWISS-PROT-like resource Derived from automated EMBL CDS translation
(maintained at the EBI (UK)) TrEMBL is automatically generated and
annotated using software tools (incompatible with the SWISS-PROT in terms of quality)
TrEMBL contains all what is not yet in SWISS-PROT
Yerk!! But there is no choice and these software tools are becoming quite good !
The simplified story of a Sprot entry
cDNAs, genomes, ….
EMBLnew EMBL
TrEMBLnew TrEMBL
SWISS-PROT
« Automatic »• Redundancy check (merge)• InterPro (family attribution)• Annotation
« Manual »• Redundancy (merge, conflicts)• Annotation• Sprot tools (macros…)• Sprot documentation• Medline• Databases (MIM, MGD….)• Brain storming
Once in Sprot, the entry is no more in TrEMBL, but still in EMBL (archive)
CDS
SWISS-PROT introduces a new arithmetical concept !
How many sequences in SWISS-PROT + TrEMBL ?
88’000 + 300 ’000 = about 240’000
SWISS-PROT and TrEMBL (SPTR) a minimal of redundancy
TrEMBL divisions
TrEMBL: SPTrEMBL + REMTrEMBL SPTrEMBL: TrEMBL entries that will eventually be
integrated into SWISS-PROT, but that have not yet be manually annotated
REMTrEMBL: sequences that are not destined to be included in SWISS-PROT
Immunoglobulins and T-cell receptors Synthetic sequences Patented sequences Small fragments (<8 aa) CDS not coding for real proteins
TrEMBL new: updates to the latest release of TREMBL
TrEMBL divisions
Subdivisions Archae arc Fungus fun Human hum Invertebrate inv Mammals mam Major Hist. Comp. mhc Organelles org Phage phg Plant pln Prokaryote pro Rodent rod Uncommented unc Viral vrl Vertebrate vrt
TrEMBL: example
GenPept (translation of GenBank)
GenPept is a protein database translated from the last release of GenBank (+ journal scans)
The current release has 484’496 entries
In contrast to TrEMBL, keeps all protein sequences including small fragments (< 8 aa), immunoglobulins….
Redundancy: 20 entries for human EPO
GenPept: example LOCUS L33410_1 [HUMMLCMPL] DEFINITION Human c-mpl ligand (ML) mRNA, complete cds; erythropoietin homology domain bp 66..522. DATE 07-JAN-1995 ACCESSION L33410 NID ORGANISM Homo_SP_sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. COMMENT CDS 216..1277 /gene="ML" /product="c-mpl ligand" /protein_id="AAA59857.1" /db_xref="GI:506827" WEIGHT 37823 LENGTH 353 ORIGIN 1 MELTELLLVV MLLLTARLTL SSPAPPACDL RVLSKLLRDS HVLHSRLSQC PEVHPLPTPV 61 LLPAVDFSLG EWKTQMEETK AQDILGAVTL LLEGVMAARG QLGPTCLSSL LGQLSGQVRL 121 LLGALQSLLG TQLPPQGRTT AHKDPNAIFL SFQHLLRGKV RFLMLVGGST LCVRRAPPTT 181 AVPSRTSLVL TLNELPNRTS GLLETNFTAS ARTTGSGLLK WQQGFRAKIP GLLNQTSRSL 241 DQIPGYLNRI HELLNGTRGL FPGPSRRTLG APDISSGTSD TGSLPPNLQP GYSPSPTHPP 301 TGQYTLFPLP PTLPTPVVQL HPLLPDPSAP TPTPTSPLLN TSYTHSQNLS QEG //
PIR
Protein Information Resource, created in 1984 Successor of the National Biochemical Research
Foundation (NBRF) protein sequence database developed in 1965 by M. O. Dayhoff « Atlas of Protein Sequence and Structure »
Maintained by MIPS (Germany) and JIPID (Japan) Provides some cross-referencing to
EMBL/GenBank/DDJB and PDB, GDB, FlyBase, OMIM, SGD, and MGD
In august 2000: 178’050 entries. Redundancy: 3 entries for human EPO
PIR: example>P1;ZUHUerythropoietin precursor - humanC;Species: Homo sapiens (man)C;Date: 27-Nov-1985 #sequence_revision 27-Nov-1985 #text_change 22-Jun-1999C;Accession: A01855; A24744; A25384; A22210; S56178R;Jacobs, K.; Shoemaker, C.; Rudersdorf, R.; Neill, S.D.; Kaufman, R.J.; Mufson, A.; Seehra, J.; Jones, S.S.; Hewick, R.; Fritsch, E.F.;
Kawakita, M.; Shimizu, T.; Miyake, T.Nature 313, 806-810, 1985A;Title: Isolation and characterization of genomic and cDNA clones of human erythropoietin.A;Reference number: A01855; MUID:85137899A;Accession: A01855A;Molecule type: mRNA; DNAA;Residues: 1-193 A;Cross-references: GB:X02157; GB:X02158R;Lin, F.K.; Suggs, S.; Lin, C.H.; Browne, J.K.; Smalling, R.; Egrie, J.C.; Chen, K.K.; Fox, G.M.; Martin, F.; Stabinsky, Z.; Badrawi, S.M.;
Lai, P.H.; Goldwasser, E.Proc. Natl. Acad. Sci. U.S.A. 82, 7580-7584, 1985A;Title: Cloning and expression of the human erythropoietin gene.A;Reference number: A24744; MUID:86067948A;Accession: A24744A;Molecule type: DNAA;Residues: 1-193 A;Cross-references: GB:M11319; NID:g182197; PIDN:AAA52400.1; PID:g182198R;Lai, P.H.; Everett, R.; Wang, F.F.; Arakawa, T.; Goldwasser, E.J. Biol. Chem. 261, 3116-3121, 1986A;Title: Structural characterization of human erythropoietin.A;Reference number: A25384; MUID:86140080A;Accession: A25384A;Molecule type: proteinA;Residues: 28-86,'Q',87-193 A;Experimental source: urineA;Note: forms without the carboxyl-terminal residue and the four carboxyl-terminal residues were observedR;Yanagawa, S.; Hirade, K.; Ohnota, H.; Sasaki, R.; Chiba, H.; Ueda, M.; Goto, M.J. Biol. Chem. 259, 2707-2710, 1984A;Title: Isolation of human erythropoietin with monoclonal antibodies.A;Reference number: A22210; MUID:84135751
PIR (cont.)A;Accession: A22210A;Molecule type: proteinA;Residues: 28-29,'X',31-33,'L',35-50,'X',52-53,'D',55,'G',57 R;Matsumoto, S.; Ikura, K.; Ueda, M.; Sasaki, R. Plant Mol. Biol. 27, 1163-1172, 1995A;Title: Characterization of a human glycoprotein (erythropoietin) produced in cultured tobacco cells.A;Reference number: S56178; MUID:95284365A;Accession: S56178A;Molecule type: proteinA;Residues: 28-33,'X',35-37 C;Comment: Erythropoietin is produced by kidney or liver of adult mammals and by liver of fetal or neonatal mammals.C;Genetics:A;Gene: GDB:EPOA;Cross-references: GDB:119110; OMIM:133170A;Map position: 7q21.3-7q22.1A;Introns: 5/1; 53/3; 82/3; 142/3C;Function:A;Description: the primary inducer of erythrocyte formationC;Superfamily: erythropoietinC;Keywords: erythropoiesis; glycoprotein; hormone; kidney; liverF;1-27/Domain: signal sequence #status predicted F;28-193/Product: erythropoietin #status experimental F;34-188,56-60/Disulfide bonds: #status experimental F;51,65,110/Binding site: carbohydrate (Asn) (covalent) #status experimentalF;153/Binding site: carbohydrate (Ser) (covalent) #status experimental>P1;ZUHU
MGVHECPAWL WLLLSLLSLP LGLPVLGAPP RLICDSRVLE RYLLEAKEAE NITTGCAEHCSLNENITVPD TKVNFYAWKR MEVGQQAVEV WQGLALLSEA VLRGQALLVN SSQPWEPLQLHVDKAVSGLR SLTTLLRALG AQKEAISPPD AASAAPLRTI TADTFRKLFR VYSNFLRGKLKLYTGEACRT GDR*
Composite protein sequence db
NRDB OWL MIPSX SPTrEMBL *PDBSWISS-PROTPIRGenPeptSP updateGenPept update
SWISS-PROTPIRGenBankNRL-3D
PIRMIPSNRL-3DSWISS-PROTEMBL translationGenBank translationKabat (immuno)PseqIP
SWISS-PROTSPTrEMBLTrEMBLnew
Different composite db use different primary sources and different redundancy criteria in their amalgamation procedures
Redundancy priority criteria
* Also called SWall at EBISWIR: SPTrEMBL + Wormpep
Composite: protein family
The proteins /genes are classified by superfamily/family according to Blast/Fasta (homology) results
General: ProtFam: PIR ProtoMap: SWISS-PROT SYSTERS: SWISS-PROT and PIR (non redundant) ProClass: PIR and PROSITE
Species specific: HOVERGEN: vertebrates HOBACGEN: bacteria COG: complete organism genome
ProtoMap: example
ProtoMap (cont.)
Database 4: protein domain/family
Contains biologically significant « pattern / profiles/ HMM » formulated in such a way that, with appropriate computional tools, it can rapidly and reliably determine to which known family of proteins (if any) a new sequence belongs to
-> tools to identify what is the function of uncharacterized proteins translated from genomic or cDNA sequences (« functional diagnostic »)
Protein domain/family
Most proteins have « modular » structure Estimation: ~ 3 domains / protein Domains (conserved sequences or structures)
are identified by multi sequence alignments
Domains can be defined by different methods: Pattern (regular expression); used for very conserved domains Profiles (weighted matrices): two-dimensional tables of position
specific match-, gap-, and insertion-scores, derived from aligned sequence families; used for less conserved domains
Hidden Markov Model (HMM); probabilistic models; an other method to generate profiles.
Some statistics 15 most common protein domains for H. sapiens (Incomplete)
Immunoglobulin and major histocompatibility complex domainEukaryotic protein kinaseZinc finger, C2H2 typeRhodopsin-like GPCR superfamilySrc homology 3 (SH3) domainRNA-binding region RNP-1 (RNA recognition motif)Fibronectin type III domainPleckstrin homology (PH) domainHomeobox domainMajor histocompatibility complex protein, Class IEF-hand familyEGF-like domainRING fingerCadherin domainPDZ domain (also known as DHR or GLGF)Serine proteases, trypsin family
http://www.ebi.ac.uk/proteome/HUMAN/interpro/top15d.html
Protein domain/family db
Secondary databases are the fruit of analyses of the sequences found in the primary db
Either manually curated (i.e. PROSITE, Pfam, etc.) or automatically generated (i.e. ProDom, DOMO)
Some depend on the method used to detect if a protein belongs to a particular domain/family (patterns, profiles, HMM)
Protein domain/family db
Secondary db Primary source Information
PROSITE SWISS-PROT Patterns (Regular expression)
PROSITE SWISS-PROT Profiles(Weighted matrices)
PRINTS OWL and SWISS-PROT
Aligned motifs (Fingerprints)
Pfam SWISS-PROT HMM(Hidden Markov Models)
BLOCKS PROSITE/PRINTS Aligned motifs
IDENTIFY BLOCKS/PRINTS Fuzzy regular expressions
Prosite
Created in 1988 (SIB) Contains functional domains fully annotated, based
on two methods: patterns and profiles
Entries are deposited in PROSITE in two distinct files: Pattern/profiles with the lists of all matches in the parent
version of SWISS-PROT Documentation
Aug 2000: contains 1064 documentation entries that describe 1424 different patterns, rules and profiles/matrices.
Prosite (pattern): example
ID EPO_TPO; PATTERN.AC PS00817;DT OCT-1993 (CREATED); NOV-1995 (DATA UPDATE); JUL-1998 (INFO UPDATE).DE Erythropoietin / thrombopoeitin signature.PA P-x(4)-C-D-x-R-[LIVM](2)-x-[KR]-x(14)-C.NR /RELEASE=38,80000;NR /TOTAL=14(14); /POSITIVE=14(14); /UNKNOWN=0(0); /FALSE_POS=0(0);NR /FALSE_NEG=0; /PARTIAL=1;CC /TAXO-RANGE=??E??; /MAX-REPEAT=1;CC /SITE=3,disulfide; /SITE=11,disulfide;DR P48617, EPO_BOVIN , T; P33707, EPO_CANFA , T; P33708, EPO_FELCA , T; DR P01588, EPO_HUMAN , T; P07865, EPO_MACFA , T; Q28513, EPO_MACMU , T; DR P07321, EPO_MOUSE , T; P49157, EPO_PIG , T; P29676, EPO_RAT , T; DR P33709, EPO_SHEEP , T; P42705, TPO_CANFA , T; P40225, TPO_HUMAN , T; DR P40226, TPO_MOUSE , T; P49745, TPO_RAT , T; DR P42706, TPO_PIG , P; DO PDOC00644;//
Diagnostic performance
List of matches
Prosite (profile): examplePROSITE: PS50097
ID BTB; MATRIX.AC PS50097;DT DEC-1999 (CREATED); DEC-1999 (DATA UPDATE); DEC-1999 (INFO UPDATE).DE BTB domain profile.MA /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=67;MA /DISJOINT: DEFINITION=PROTECT; N1=6; N2=62;MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=.9751; R2=.02068202; TEXT='-LogE';MA /CUT_OFF: LEVEL=0; SCORE=363; N_SCORE=8.5; MODE=1; TEXT='!';MA /CUT_OFF: LEVEL=-1; SCORE=267; N_SCORE=6.5; MODE=1; TEXT='?';MA /DEFAULT: D=-20; I=-20; B1=-50; E1=-50; MI=-105; MD=-105; IM=-105; DM=-105; MM=1; M0=-2;MA /I: B1=0; BI=-105; BD=-105;MA /M: SY='C'; M=-6,-10,28,-14,-9,-15,-20,-14,-19,-15,-17,-14,-8,-19,-14,-15,0,0,-9,-32,-17,-12;MA /M: SY='D'; M=-16,41,-28,53,15,-34,-11,-1,-33,0,-27,-25,21,-11,0,-8,2,-6,-26,-38,-19,7;MA /M: SY='V'; M=2,-23,-8,-28,-24,-1,-24,-25,16,-20,7,6,-20,-25,-23,-20,-10,-4,24,-23,-9,-24;MA /M: SY='T'; M=-2,-13,-18,-19,-13,-7,-24,-19,6,-8,-2,1,-11,-17,-11,-10,-1,10,10,-24,-6,-13;MA /M: SY='L'; M=-11,-30,-22,-33,-24,15,-32,-23,25,-29,35,17,-26,-27,-23,-22,-24,-9,16,-17,3,-24;MA /M: SY='V'; M=0,-11,-18,-13,-10,-12,-20,-13,1,-6,-4,2,-10,-19,-6,-7,-4,-2,8,-25,-9,-9;MA /M: SY='V'; M=1,-25,-3,-29,-25,-2,-26,-26,17,-22,10,7,-23,-25,-23,-22,-11,-3,24,-27,-10,-25;MA /M: SY='D'; M=-6,7,-26,8,7,-25,6,-7,-27,0,-23,-17,8,-13,0,-3,3,-6,-23,-27,-17,3;MA /I: I=-5; MI=0; IM=0; DM=-15; MD=-15;MA /M: SY='G'; M=-6,8,-27,8,-3,-27,22,-7,-30,-8,-26,-19,10,-14,-8,-9,2,-9,-24,-28,-21,-6;MA /M: SY='K'; M=-7,-4,-23,-4,7,-23,-13,-2,-21,10,-18,-9,-3,-12,7,9,-2,-4,-16,-25,-12,6;MA /M: SY='E'; M=-8,-6,-21,-8,1,-15,-21,-7,-7,-1,-10,-5,-3,-14,0,-1,-2,-2,-6,-26,-9,-1;MA /M: SY='F'; M=-12,-28,-22,-34,-26,31,-31,-21,18,-26,16,9,-22,-27,-27,-21,-20,-9,14,-6,13,-26;MA /M: SY='R'; M=-13,-9,-24,-10,-3,-11,-21,7,-17,7,-16,-4,-4,-8,2,9,-9,-9,-16,-20,-1,-2;MA /M: SY='A'; M=21,-15,-8,-22,-17,-10,-10,-23,0,-15,-5,-5,-14,-18,-17,-19,4,6,12,-24,-15,-17;MA /M: SY='H'; M=-15,5,-22,2,-1,-20,-16,65,-26,-8,-21,-5,15,-19,6,-2,-2,-11,-26,-32,7,0;MA /M: SY='K'; M=-12,-5,-29,-5,5,-25,-18,-8,-26,34,-24,-9,-1,-14,8,34,-8,-8,-17,-20,-10,5;MA /M: SY='A'; M=4,-12,-12,-16,-10,-6,-18,-14,-2,-13,-1,-2,-11,-17,-12,-13,-3,1,2,-24,-8,-11;MA /M: SY='V'; M=-7,-26,-19,-31,-26,7,-32,-24,27,-23,14,11,-22,-25,-23,-22,-13,0,28,-19,3,-26;MA /M: SY='L'; M=-10,-30,-20,-30,-21,9,-30,-20,22,-29,47,20,-29,-29,-20,-20,-29,-10,12,-20,0,-21;MA /M: SY='A'; M=18,-6,0,-12,-8,-18,-6,-16,-15,-10,-18,-12,-2,-14,-8,-13,18,11,-5,-32,-19,-8;….
Prosite (profile): example (cont.)……MA /M: SY='T'; M=-3,3,-16,1,-3,-18,-12,-9,-20,-6,-19,-15,2,-7,-6,-6,10,15,-13,-27,-12,-5;MA /M: SY='G'; M=-1,1,-25,2,-9,-26,31,-12,-32,-10,-26,-18,4,-17,-12,-10,1,-12,-24,-25,-22,-11;MA /M: SY='E'; M=-9,3,-24,4,13,-25,-16,-1,-24,13,-21,-13,3,-9,6,13,-3,-6,-20,-27,-13,8;MA /M: SY='I'; M=-6,-21,-18,-25,-21,-2,-29,-21,21,-21,14,10,-19,-24,-17,-19,-13,-3,19,-23,-3,-20;MA /M: SY='E'; M=-4,3,-23,3,4,-18,-11,-7,-17,-1,-18,-13,3,-9,-1,-5,1,-4,-14,-25,-11,1;MA /M: SY='I'; M=-8,-25,-23,-27,-20,1,-30,-21,21,-20,18,12,-22,-18,-18,-18,-18,-7,16,-21,-1,-20;MA /M: SY='P'; M=-6,0,-24,2,1,-22,-13,-8,-21,-2,-23,-15,1,14,-4,-7,3,2,-19,-31,-18,-3;MA /M: SY='E'; M=-7,1,-27,4,11,-24,-15,-4,-19,2,-18,-11,0,-1,6,-1,-2,-6,-19,-25,-14,7;MA /I: E1=0; IE=-105; DE=-105;NR /RELEASE=39,87397;NR /TOTAL=46(44); /POSITIVE=45(43); /UNKNOWN=1(1); /FALSE_POS=0(0);NR /FALSE_NEG=0; /PARTIAL=0;CC /TAXO-RANGE=??E?V; /MAX-REPEAT=2;DR O14867, BAC1_HUMAN, T; P97302, BAC1_MOUSE, T; P97303, BAC2_MOUSE, T; DR P41182, BCL6_HUMAN, T; P41183, BCL6_MOUSE, T; Q01295, BRC1_DROME, T; DR Q01296, BRC2_DROME, T; Q01293, BRC3_DROME, T; Q28068, CALI_BOVIN, T; DR Q13939, CALI_HUMAN, T; Q08605, GAGA_DROME, T; Q01820, GCL1_DROME, T; DR P10074, HKR3_HUMAN, T; Q04652, KELC_DROME, T; P42283, LOLL_DROME, T; DR P42284, LOLS_DROME, T; O14682, PI10_HUMAN, T; Q05516, PLZF_HUMAN, T; DR O43791, SPOP_HUMAN, T; P42282, TTKA_DROME, T; P17789, TTKB_DROME, T; DR P21073, VA55_VACCC, T; P24768, VA55_VACCV, T; P21037, VC02_VACCC, T; DR P17371, VC02_VACCV, T; P32228, VC04_SPVKA, T; P32206, VC13_SPVKA, T; DR P21013, VF03_VACCC, T; P24357, VF03_VACCV, T; P22611, VMT8_MYXVL, T; DR P08073, VMT9_MYXVL, T; O43167, Y441_HUMAN, T; Q10225, YAZ4_SCHPO, T; DR P40560, YIA1_YEAST, T; P34324, YKV2_CAEEL, T; P34371, YLJ8_CAEEL, T; DR P34568, YNV5_CAEEL, T; P41886, YPT9_CAEEL, T; Q09563, YR47_CAEEL, T; DR Q10017, YSW1_CAEEL, T; Q13105, Z151_HUMAN, T; Q60821, Z151_MOUSE, T; DR P24278, ZN46_HUMAN, T; DR Q13829, TNP1_HUMAN, ?; DO PDOC50097;//
PRINTS
Compendium of protein motif fingerprints Most protein families are characterized by
several conserved motifs Fingerprint: set of motif(s) (simple or
composite, such as multidomains) = signature of family membership
True family members exhibit all elements of the fingerprint, while subfamily members may possess only a part
ProDom
consists of an automated compilation of homologous domain alignment (procedure based on PSI-BLAST searches)
Updating problem !Last ProDom update: February 7, 2000built from SWISS-PROT 38 + TREMBL +TREMBL updates - October 22, 1999
ProDom: example
Your query
Protein domain/family: Composite databases
Example: InterPro
Unification of PROSITE, PRINTS, Pfam and ProDom into an integrated resource of protein families, domains and functional sites;
Single set of «documents» linked to the various methods;
Will be used to improve the functional annotation of SWISS-PROT (classification of unknown protein…)
This release contains 3052 entries, representing 574 domains, 2418 families, 46 repeats and 14 post-translational modification sites.
InterPro: example
IPR001323 Name Erythropoietin/thrombopoeitin Type Family Abstract Erythropoietin, a plasma glycoprotein, is the primary physiological mediator of erythropoiesis [1] . It is involved
in the regulation of the level of peripheral erythrocytes by stimulating the differentiation of erythroid progenitor
cells, found in the spleen and bone marrow, into mature erythrocytes [2] . It is primarily produced in adult kidneys
and foetal liver, acting by attachment to specific binding sites on erythroid progenitor cells, stimulating their differentiation [3] . Severe kidney dysfunction causes reduction in the plasma levels of erythropoietin, resulting
in chronic anaemia - injection of purified erythropoietin into the blood stream can help to relieve this type of
anaemia. Levels of erythropoietin in plasma fluctuate with varying oxygen tension of the blood, but androgens and prostaglandins also modulate the levels to some extent [3] . Erythropoietin glycoprotein sequences are well conserved, a consequence of which is that the hormones are cross-reactive among mammals, i.e. that from
one species, say human, can stimulate erythropoiesis in other species, say mouse or rat [4] .
Thrombopoeitin (TPO), a glycoprotein, is the mammalian hormone which functions as a megakaryocytic lineage specific growth and differentiation factor affecting the proliferation and maturation from their committed
progenitor cells acting at a late stage of megakaryocyte development. It acts as a circulating regulator of platelet
numbers.….
InterPro: example... Examplelist P33708 P33709 P49745 view matches for the examples Publications 1. Shoemaker C.B., Mitsock L.D. 849-858 (1986) 2. Takeuchi M., Takasaki S., Miyazaki H., Kato T., Hoshi S., Kochibe N., Kobata A. J. Biol. Chem.
263: 3657-3663 (1988) 3. Lin F.K., Lin C.H., Lai P.H., Browne J.K., Egrie J.C., Smalling R., Fox G.M., Chen K.K., Castro M.,
Suggs S. Gene 44: 201-209 (1986) 4. Nagao M., Suga H., Okano M., Masuda S., Narita H., Ikura K., Sasaki R. Nucleotide sequence of rat erythropoietin. 1171: 99-102 (1992) Children IPR003013 Signatures PROSITE PS00817 EPO_TPO PFAM PF00758 EPO_TPO Matches Table Graphical
Databases 5: mutation/polymorphism
Contain informations on sequence variations that are linked or not to genetic diseases;
Mainly human but: OMIA - Online Mendelian Inheritance in Animals
General db: OMIM HMGD - Human Gene Mutation db SVD - Sequence variation db HGBASE - Human Genic Bi-Allelic Sequences db dbSNP - Human single nucleotide polymorphism (SNP) db
Disease-specific db: most of these databases are either linked to a single gene or to a single disease; p53 mutation db ADB - Albinism db (Mutations in human genes causing albinism) Asthma and Allergy gene db ….
Mutation/polymorphisms: definitions
SNPs: single nucleotide polymorphisms c-SNPs: coding single nucleotide
polymorphisms (Single Nucleotide Polymorphisms within cDNA sequences)
SAPs: single amino-acid polymorphisms
Missense mutation: -> SAP Nonsense mutation: -> STOP Insertion/deletion of nucleotides -> frameshift…
! Numbering of the mutation depends on the db (aa no 1 is not necessary the initiator Met !)
Mutation/polymorphisms dbSNP consortium http://snp.cshl.org/
Bayer, Roche, IBM, Pfizer, Novartis, Motorola…… Mission: develop up to 300,000 SNPs distributed evenly throughout
the human genome and make the informations related to these SNPs available to the public without intellectual property restrictions. The project started in April 1999 and is anticipated to continue until the end of 2001.
dbSNP at NCBI http://www.ncbi.nlm.nih.gov/SNP/ Collaboration between the National Human Genome Research Institute and
the National Center for Biotechnology Information (NCBI) Mission: central repository for both single base nucleotide subsitutions and
short deletion and insertion polymorphisms Aug 24, 2000 , dbSNP has submissions for 803557 SNPs.
Chromosome 21 dbSNP http://csnp.isb-sib.ch/ A joint project between the Division of Medical Genetics of the
University of Geneva Medical School and the SIB Mission: comprehensive cSNP (Single Nucleotide Polymorphisms
within cDNA sequences) database and map of chromosome 21
Mutation/polymorphisms
Very heterogeneous format;
Generally modest size;
There are initiatives to standardize and to unify these databases (SVD - Sequence Variation Database project at EBI: HMutDB)
Databases 6: proteomics Contain informations obtained by 2D-PAGE:
master images of the gels and description of identified proteins
Examples: SWISS-2DPAGE, ECO2DBASE, Maize-2DPAGE, Sub2D, Cyano2DBase, etc.
Format: composed of image and text files Most 2D-PAGE databases are “federated” and use SWISS-PROT as a master index There is currently no protein Mass Spectrometry
(MS) database (not for long…)
Databases 7: 3D structure
Contain the spatial coordinates of macromolecules whose 3D structure has been obtained by X-ray or NMR studies
Proteins represent more than 90% of available structures (others are DNA, RNA, sugars, virus, complex protein/DNA…)
PDB (Protein Data Bank), SCOP (structural classification of proteins (according to the secondary structures)), BMRB (BioMagResBank; RMN results)
Future: Homology-derived 3D structure db.
PDB
Protein Data Bank, managed by RCSB Currently there are ~13’000 structures for
about 4’000 different molecules, but far less protein family !
There are also databases that contain data derived from PDB. Examples: HSSP (homology-derived secondary structure of proteins), SWISS-3DIMAGE (images)…
Restriction enzyme
PDB: example
HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6REVDAT 1 15-OCT-92 12CA 0 12CA 7JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13REMARK 1 12CA 14REMARK 2 12CA 15REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16REMARK 3 12CA 17REMARK 3 REFINEMENT. 12CA 18REMARK 3 PROGRAM PROLSQ 12CA 19REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20REMARK 3 R VALUE 0.170 12CA 21REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23REMARK 4 12CA 24REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27………
PDB (cont.)SHEET 3 S10 PHE 66 PHE 70 -1 O ASN 67 N LEU 60 12CA 68SHEET 4 S10 TYR 88 TRP 97 -1 O PHE 93 N VAL 68 12CA 69SHEET 5 S10 ALA 116 ASN 124 -1 O HIS 119 N HIS 94 12CA 70SHEET 6 S10 LEU 141 VAL 150 -1 O LEU 144 N LEU 120 12CA 71SHEET 7 S10 VAL 207 LEU 212 1 O ILE 210 N GLY 145 12CA 72SHEET 8 S10 TYR 191 GLY 196 -1 O TRP 192 N VAL 211 12CA 73SHEET 9 S10 LYS 257 ALA 258 -1 O LYS 257 N THR 193 12CA 74SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA 258 12CA 75TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81CRYST1 42.700 41.700 73.000 90.00 104.60 90.00 P 21 2 12CA 82ORIGX1 1.000000 0.000000 0.000000 0.00000 12CA 83ORIGX2 0.000000 1.000000 0.000000 0.00000 12CA 84ORIGX3 0.000000 0.000000 1.000000 0.00000 12CA 85SCALE1 0.023419 0.000000 0.006100 0.00000 12CA 86SCALE2 0.000000 0.023981 0.000000 0.00000 12CA 87SCALE3 0.000000 0.000000 0.014156 0.00000 12CA 88ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102…….
Databases 8: metabolic Contain informations that describe enzymes,
biochemical reactions and metabolic pathways; ENZYME and BRENDA: nomenclature databases
that store informations on enzyme names and reactions;
Examples of metabolic databases: EcoCyc (specialized on Escherichia coli), KEGG, EMP/WIT;
Usualy these databases are tightly coupled with query software that allows the user to visualise reaction schemes.
Databases 9: bibliographic
Bibliographic reference databases contain citations and abstract informations of published life science articles;
Example: Medline Other more specialized databases also
exist (example: Agricola).
Medline
MEDLINE covers the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences
more than 4,000 biomedical journals published in the United States and 70 other countries
Contains over 10 million citations since 1966 until now
Contains links to biological db and to some journals
New records are added to PreMEDLINE daily! Many papers not dealing with human are not in Medline
! Before 1970, keeps only the first 10 authors ! Not all journals have citations since 1966 !
Medline/Pubmed
PubMed is developed by the National Center for Biotechnology Information (NCBI)
PubMed provides access to bibliographic information such as MEDLINE, PreMEDLINE, HealthSTAR, and to integrated molecular biology databases (composite db)
PMID: 10923642 (PubMed ID), UI: 20378145 (Medline ID)
Databases 10: others There are many databases that cannot be
classified in the categories listed previously;
Examples: ReBase (restriction enzymes), TRANSFAC (transcription factors), O-GLYCBASE (O-linked sugars), Protein-protein interactions db (DIR), biotechnology patents db, etc.;
As well as many other resources concerning any aspects of macromolecules and molecular biology.
Proliferation of databases
What is the best db for sequence analysis ? Which does contain the highest quality data ? Which is the more comprehensive ? Which is the more up-to-date ? Which is the less redundant ? Which is the more indexed (allows complex
queries) ? Which Web server does respond most
quickly ? …….??????
Some important practical remarks
Databases: many errors (automated annotation) !
Not all db are available on all servers The update frequency is not the same
for all servers; creation of db_new between releases (exemple: EMBLnew; TrEMBLnew….)
Some servers add automatically useful cross-references to an entry (implicit links) in addition to already existing links (explicit links)
Database retrieval tools
Sequence Retrieval System (SRS, Europe) allows any flat-file db to be indexed to any other; allows to formulate queries across a wide range of different db types via a single interface, without any worry about data structure, query languages…
Entrez (USA): less flexible than SRS but exploits the concept of « neighbouring », which allows related articles in different db to be linked together, whether or not they are cross-referenced directly
ATLAS: specific for macromolecular sequences db (i.e. NRL-3D)
….
More informations about SWISS-PROT
The golden goals of SWISS-PROT
Annotated / curated Complete Non-redundant Highly cross-referenced Available from a variety of servers and
through sequence analysis software tools Associated with wide range of
documentation
Review: Protein sequence databasesR. Apweiler (2000), Adv. in protein chemistry, 54, 31-70
SWISS-PROT: species
6’840 different species 20 species represent about 45% of all
sequences in the database 5’000 species are only represented by
one to three sequences. In most cases, these are sequences which were obtained in the context of a phylogenetic study
SWISS-PROT: cross-references
SWISS-PROT was the first database with cross-references.
Explicitly cross-referenced to 34 databases Cross-ref to DNA (EMBL/GenBank/DDBJ), 3D-
structure (PDB), literature (Medline), genomic (MIM, MGD, FlyBase, SGD, SubtiList, etc.), 2D-gel (SWISS-2DPAGE), specialized db (PROSITE, TRANSFAC)
Implicitly cross-referenced to additional db on the WWW (GeneCards, PRODOM, etc.)
Annotations
Function(s) Post-translational modifications (PTM) Domains Quaternary structure Similarities Diseases, mutagenesis Conflicts, variants Cross-references …
A Swiss-Prot entry
Sprot entry (cont.)
Sprot entry (cont.)
Sprot entry (cont.)
Sprot entry (cont.)
Future for human proteins Original estimate: from 70’000 to 100’000 genes Incyte recently announced an estimation of 140’000 genes More recent estimations give about 30’000 to 40’000 genes C. elegans and Drosophila have ~15’000 genes. There was
two sets of genome duplication in the evolutionary history leading to vertebrates. Very roughly it means that:
Human genes=~60’000 genes - losses + new genesBut more than 1 million proteins !
(due to PTM, alternative products, variants…)
http://www.ensembl.org/genesweep.html
Genesweep
http://www.ensembl.org/genesweep.html
What after genomes?
Proteome projects are an essential tool for the understanding of real proteins
There will be a flood of characterization data (MS, 2D) that will be the equivalent of ESTs at the protein level
Protein databases are going to be more and more important for new biological studies
Databases in GCG
DNA EMBL, EPD, RepBase, vectordb (NCBI)
Protein Swiss-Prot, TrEMBL, PDB
Other PROSITE, REBASE
How to access databases in GCG?
Fetch or typedata ? Stringsearch Name Lookup (based on SRS)
Useful to generate list files