protein database bioinformatics lab. sequence databases genbank --dna sequences and derived protein...

33
Protein Database Bioinformatics Lab

Upload: byron-small

Post on 16-Jan-2016

240 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Protein Database

Bioinformatics Lab

Page 2: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Sequence Databases

• GenBank--DNA sequences and derived protein

sequences • EMBL --DNA sequences and derived protein

sequences

• DDBJ --DNA sequences and derived protein sequences • SWISS-PROT--Protein sequences

• PDB--three-dimensional structures of protein

Page 3: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

• GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences .

• A new release is made every two months. GenBank is part of the International Nucleotide Sequence Database Collaboration, which is comprised of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI.

• These three organizations exchange data on a daily basis.

GenBank,EMBL & DDBJ

Page 4: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

GenBank,EMBL & DDBJ

• GenBank Release 122.0,Feb.15,2001. 10,897,000 sequence records

11,720,000,000 bases • EMBL Release 66,Mar.2,2000

11,169,673

11,916,112,872 • DDBJ,the Center for operating DDBJ, National

Institute of Genetics (NIG),Japan,established in April 1995.

Page 5: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Protein Databases

There are many styles in protein databases,such as protein sequences,motif,classification,structure, structure alignment, curation

• GenBANK,EMBL and DDBJ(derived sequences, http://www.ncbi.nlm.nih.gov/gorf/gorf.html)

• SWISS-PROT,PIR (sequences)• PROSITE,PRINTS(sequence motifs)• HSSP,FSSP(classification,alignment)• PDB(3-D structure)

Page 6: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

SWISS-PROT/TrEMBL• Annotated protein sequences,• Established in 1986• Developed by the SWISS-PROT groups at SIB

and at EBI. • Maintained collaboratively, since 1987, by the

Department of Medical Biochemistry of the University of Geneva( 日内瓦) and the EMBL Data Library (now the EMBL Outstation - The European Bioinformatics Institute (EBI)).

• Website: http://www.expasy.ch/

Page 7: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Different Features of SWISS-PROT

• Format follows as closely as possible that of EMBL’s

• Curated protein sequence database• Three differences:1. Strives to provide a high level of annotations

(力争)2. Minimal level of redundancy (冗余最少)3. High level of integration with other databases

(综合性高)

Page 8: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Three Distinct Criteria 1. AnnotationThe sequence data; the citation information

(bibliographical references) and the taxonomic data (description of the biological source of the protein) such as protein functions,post-translational modifications ,domains and sites,secondary structure,quaternary structure,similarities to other proteins,diseases associated with deficiencies in the protein,sequence conflicts, variants, etc.

Page 9: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

2. Minimal Redundancy

Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. SWISS-PROT is as much as possible to merge all these data so as to minimize the redundancy. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry.

Page 10: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

3. Integration With Other Databases • SWISS-PROT and TrEMBL - Protein sequences • PROSITE - Protein families and domains • SWISS-2DPAGE - Two-dimensional

polyacrylamide gel electrophoresis 聚丙烯酰胺电泳• SWISS-3DIMAGE - 3D images of proteins and

other biological macromolecules • SWISS-MODEL Repository - Automatically

generated protein models • CD40Lbase - CD40 ligand defects (配合体缺失)

• ENZYME - Enzyme nomenclature (酶命名)• SeqAnalRef - Sequence analysis bibliographic

references (序列分析目录参考)

Page 11: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

SWISS-PROT/TrEMBL

• TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT

• SWISS-PROT Release 39.15 of 19-Mar-2001: 94,152 entriesTrEMBL Release 16.2 of 23-Mar-2001: 436,924 entries

Page 12: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

SWISS-PROT FORMATLine code Content Occurrence in an entry

ID Identification Once; starts the entry

AC Accession number(s) One or more

DT Date Three times

DE Description One or more

GN Gene name(s) Optional

OS Organism species One or more

OG Organelle Optional

OC Organism classification One or more

RN Reference number One or more

RP Reference position One or more

RC Reference comment(s) Optional

RX Reference cross-reference(s) Optional

RA Reference authors One or more

RT Reference title Optional

RL Reference location One or more

CC Comments or notes Optional

DR Database cross-references Optional

KW Keywords Optional

FT Feature table data Optional

SQ Sequence header Once

  (blanks) sequence data One or more

// Termination line Once; ends the entry

Page 13: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Access to SWISS-PROT and TrEMBL

• SRS - Access to SWISS-PROT, TrEMBL and other databases using the Sequence Retrieval System

• Full text search in SWISS-PROT and TrEMBL • by accession number or ID (AC or ID line; SWISS-PROT

and TrEMBL) • by description or identification (any word in the DE, OS,

OG, GN and ID lines; SWISS-PROT and TrEMBL) • by author (RA line; SWISS-PROT and TrEMBL) • by citation (RL line; SWISS-PROT only) • Retrieve a list of SWISS-PROT/TrEMBL entries • Randomly retrieve a SWISS-PROT/TrEMBL entry

Page 14: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Protein Data Bank• PDB is three-dimensional structure of

proteins,some nuclei acids involved • PDB is operated by RCSB(Research Collaboratory

for Structural Bioinformatics),funded by NSF, DOE, and two units of NIH:NIGMS National Institute Of General Medical Sciences and NLM National Library Of Medicine.

• Established at BNL Brookhaven National Laboratories in 1971,as an archive for biological macromolecular crystal structures

• In 1980s, the number of deposited structures began to increase dramatically.

• October 1998, the management of the PDB became the responsibility of RCSB.

• Website http://www.rcsb.org

Page 15: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

PDB Holdings List: 27-Mar-2001

Molecule Type

Proteins, Peptides, and Viruses

Protein/

Nucleic Acid Complexes

Nucleic Acids

Carbohydrates total

Exp.

Tech.

X-ray Diffraction and other

11045 526 552 14 12137

NMR 1832 71 366 4 2273

Theoretical Modeling

281 19 21 0 321

total 13158 616 939 18 14731

5032 Structure Factor Files968 NMR Restraint Files

Page 16: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

PDB Content Growth

Page 17: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

PDB Growth in New Folds

Page 18: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

PDB Data File Format

• There are mainly two formats:PDB and CIF

• PDB is fixed format in its columns

• CIF is free format

Page 19: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

PDB Format• HEADER: First line of the entry, contains PDB ID code, classification, and date of

deposition. • OBSLTE : Statement that the entry has been removed from distribution and list of

the ID code(s) which replaced it. • TITLE : Description of the experiment represented in the entry. • CAVEAT : Severe error indicator. Entries with this record must be used with care. • COMPND : Description of macromolecular contents of the entry. • SOURCE : Biological source of macromolecules in the entry. • KEYWDS : List of keywords describing the macromolecule. • EXPDTA : Experimental technique used for the structure determination.• AUTHOR : List of contributors. • REVDAT : Revision date and related information. • SPRSDE : List of entries withdrawn from release and replaced by current entry.• JRNL : Literature citation that defines the coordinate set. • REMARK : General remarks, some are structured and some are free form. • DBREF : Reference to the entry in the sequence database(s). • SEQADV : Identification of conflicts between PDB and the named sequence

database. • SEQRES : Primary sequence of backbone residues. • MODRES : Identification of modifications to standard residues. • HET : Identification of non-standard groups or residues (heterogens) • HETNAM : Compound name of the heterogens. • HETSYN : Synonymous compound names for heterogens. • FORMUL : Chemical formula of non-standard groups. • HELIX : Identification of helical substructures.

Page 20: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

• SHEET : Identification of sheet substructures. • TURN : Identification of turns. • SSBOND : Identification of disulfide bonds. • LINK : Identification of inter-residue bonds. • HYDBND : Identification of hydrogen bonds. • SLTBRG : Identification of salt bridges • CISPEP : Identification of peptide residues in cis conformation. • SITE : Identification of groups comprising important sites. • CRYST1 : Unit cell parameters, space group, and Z. • ORIGXn : Transformation from orthogonal coordinates to the submitted coordinates (n

= 1, 2, or 3). • SCALEn : Transformation from orthogonal coordinates to fractional crystallographic

coordinates (n = 1, 2, or 3). • MTRIXn : Transformations expressing non-crystallographic symmetry (n = 1, 2, or 3).

There may be multiple sets of these records. • TVECT : Translation vector for infinite covalently connected structures. • MODEL : Specification of model number for multiple structures in a single coordinate

entry. • ATOM : Atomic coordinate records for standard groups. • SIGATM : Standard deviations of atomic parameters. • ANISOU : Anisotropic temperature factors. • SIGUIJ : Standard deviations of anisotropic temperature factors. • TER : Chain terminator. • HETATM : Atomic coordinate records for heterogens. • ENDMDL : End-of-model record for multiple structures in a single coordinate entry.• CONECT : Connectivity records. • MASTER : Control record for bookkeeping. • END : Last record in the file.

Page 21: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

An Example of PDBHEADER IMMUNOGLOBULIN 09-MAY-89 2MCG 2MCG 2COMPND IMMUNOGLOBULIN LAMBDA LIGHT CHAIN DIMER (/MCG$) 2MCG

3COMPND 2 (TRIGONAL FORM) 2MCG 4SOURCE HUMAN (HOMO $SAPIENS) 2MCG 5AUTHOR K.R.ELY,J.N.HERRON,A.B.EDMUNDSON 2MCG 6REVDAT 2 15-JUL-92 2MCGA 1 SPRSDE 2MCGA 1SPRSDE 15-OCT-90 2MCG 1MCG 2MCGA 2JRNL AUTH K.R.ELY,J.N.HERRON,M.HARKER,A.B.EDMUNDSON 2MCG 9JRNL TITL THREE-DIMENSIONAL STRUCTURE OF A LIGHT CHAIN 2MCG 10REMARK 1 REFERENCE 1 2MCG 16REMARK 1 AUTH A.B.EDMUNDSON,K.R.ELY,J.N.HERRON,B.D.CHESON 2MCG

17SEQRES 1 1 216 PCA SER ALA LEU THR GLN PRO PRO SER ALA SER GLY SER 2MCG

183

FORMUL 3 HOH *318(H2 O1) 2MCG 217SSBOND 1 CYS 1 22 CYS 1 90 2MCG 218CRYST1 72.300 72.300 185.900 90.00 90.00 120.00 P 31 2 1 6 2MCG 223ORIGX1 0.013831 0.007985 0.000000 0.00000 2MCG 224ORIGX2 0.000000 0.015971 0.000000 0.00000 2MCG 225ORIGX3 0.000000 0.000000 0.005379 0.00000 2MCG 226SCALE1 0.013831 0.007985 0.000000 0.00000 2MCG 227SCALE2 0.000000 0.015971 0.000000 0.00000 2MCG 228SCALE3 0.000000 0.000000 0.005379 0.00000 2MCG 229ATOM 1 N PCA 1 1 23.624 -24.231 101.873 1.00 17.85 2MCG 230ATOM 2 CA PCA 1 1 23.296 -22.902 102.481 1.00 17.38 2MCG 231ATOM 3 C PCA 1 1 24.304 -22.495 103.531 1.00 16.74 2MCG 232ATOM 4 O PCA 1 1 23.962 -21.756 104.487 1.00 16.81 2MCG 233ATOM 5 CB PCA 1 1 21.845 -23.057 103.035 1.00 18.02 2MCG 234ATOM 6 CG PCA 1 1 21.816 -24.552 103.492 1.00 18.36 2MCG 235ATOM 7 CD PCA 1 1 23.109 -25.217 102.974 1.00 18.57 2MCG 236ATOM 8 OE PCA 1 1 23.354 -26.423 103.256 1.00 19.02 2MCG 237TER 3214 SER 2 216 2MCG3443HETATM 3215 O HOH 1 26.302 -28.430 111.973 1.00 4.66 2MCG3444CONECT 145 144 660 2MCG3762MASTER 170 0 0 0 0 0 0 6 3530 2 10 34 2MCGA 5END 2MCG3773

Page 22: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Fragment of CIF example##################### ATOM_SITE #####################loop__atom_site.label_seq_id_atom_site.group_PDB_atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.auth_seq_id _atom_site.label_alt_id _atom_site.cartn_x _atom_site.cartn_y _atom_site.cartn_z _atom_site.occupancy_atom_site.B_iso_or_equiv _atom_site.footnote_id_atom_site.label_entity_id_atom_site.id1 ATOM N N GLY A 1 . -8.863 16.944 14.289 1.00 21.88 1

1 11 ATOM C CA GLY A 1 . -9.929 17.026 13.244 1.00 22.85 1

1 21 ATOM C C GLY A 1 . -10.051 15.625 12.618 1.00 43.92 1

1 31 ATOM O O GLY A 1 . -9.782 14.728 13.407 1.00 25.22 1

1 4

Page 23: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

3-D Structure from PDB

• 20 Amino acids

http://www.clunet.edu/BioDev/omm/aa/aa.htm

http://www.nyu.edu/pages/mathmol/library/life/

http://inquiry.uiuc.edu/bioweb/tutorial/amino_acids.htm

Page 24: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Phenylalanine

Glycine Histidine Isoleucine Lysine

Leucine Methionine Asparagine

Proline Glutamine

Arginine Serine Threonine Valine Tryptophane

Glutamic acidAlanine CysteineAspartic

acidTryosine

Page 25: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

How to Construct 3-D Molecule

• Read coordinates from PDB (找相配结构)• Set up data structure of molecules• Form bonds among atoms and groups• Calculate secondary structure• Implement 3-D graphical algorithms• Render 3-D graph in various style, wires,

sticks, balls, ribbons, and the like.

Page 26: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Bonds among atomsATOM 20 N LEU 1 4 30.279 -25.716 105.041 1.00 10.60 2MCG 249

ATOM 21 CA LEU 1 4 31.406 -26.518 104.496 1.00 9.39 2MCG 250

ATOM 22 C LEU 1 4 32.658 -25.786 105.165 1.00 8.90 2MCG 251

ATOM 23 O LEU 1 4 32.890 -24.586 104.967 1.00 8.74 2MCG 252

ATOM 24 CB LEU 1 4 31.615 -26.794 103.141 1.00 8.79 2MCG 253

ATOM 25 CG LEU 1 4 31.552 -27.440 101.860 1.00 8.37 2MCG 254

ATOM 26 CD1 LEU 1 4 32.732 -26.945 100.970 1.00 7.99 2MCG 255

ATOM 27 CD2 LEU 1 4 31.706 -28.963 102.016 1.00 8.09 2MCG 256

Leucine LEU L(亮氨酸)

Page 27: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Bonds between groups

ATOM 9 N SER 1 2 25.548 -22.930 103.333 1.00 16.05 2MCG 238ATOM 10 CA SER 1 2 26.608 -22.758 104.327 1.00 15.38 2MCG 239ATOM 11 C SER 1 2 27.351 -24.076 104.604 1.00 14.81 2MCG 240ATOM 12 O SER 1 2 27.530 -24.949 103.740 1.00 15.00 2MCG 241ATOM 13 CB SER 1 2 25.887 -22.406 105.682 1.00 15.73 2MCG 242ATOM 14 OG SER 1 2 25.193 -23.586 106.117 1.00 15.14 2MCG 243ATOM 15 N ALA 1 3 27.758 -24.228 105.876 1.00 13.72 2MCG 244ATOM 16 CA ALA 1 3 28.328 -25.397 106.456 1.00 12.33 2MCG 245ATOM 17 C ALA 1 3 29.255 -26.303 105.686 1.00 11.58 2MCG 246ATOM 18 O ALA 1 3 29.033 -27.552 105.641 1.00 11.28 2MCG 247ATOM 19 CB ALA 1 3 27.101 -26.228 106.998 1.00 12.39 2MCG 248ATOM 20 N LEU 1 4 30.279 -25.716 105.041 1.00 10.60 2MCG 249ATOM 21 CA LEU 1 4 31.406 -26.518 104.496 1.00 9.39 2MCG 250ATOM 22 C LEU 1 4 32.658 -25.786 105.165 1.00 8.90 2MCG 251ATOM 23 O LEU 1 4 32.890 -24.586 104.967 1.00 8.74 2MCG 252ATOM 24 CB LEU 1 4 31.615 -26.794 103.141 1.00 8.79 2MCG 253ATOM 25 CG LEU 1 4 31.552 -27.440 101.860 1.00 8.37 2MCG 254ATOM 26 CD1 LEU 1 4 32.732 -26.945 100.970 1.00 7.99 2MCG 255ATOM 27 CD2 LEU 1 4 31.706 -28.963 102.016 1.00 8.09 2MCG 256

Page 28: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Nucleic Acid Database(NDB)

• The NDB Project is funded by the National Science Foundation and the Department of Energy

• The goal of NDBP is to assemble and distribute structural information about nucleic acids

• The format of NDB is the same as PDB.

Page 29: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Molvie1.0

• A visual and interactive environment to display,analyze,fold and compare molecular structure.

• Developed in Java AWT by us.

• Java application/applet,really embedded in webpage.(http://www.cs.ucsb.edu/~mli/Bioinf/software/index.html)

Page 30: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Some features

• Molvie 1.0 is programmed in Java, hence it is platform-independent.

• There is no limit on the number of molecules, atoms, residues or the number of animation frames displayed, as long as there is enough in computer memory.

• Molvie has many rendering (表现) styles. • Molvie can display two molecules

simultaneously and allows the user to align secondary structure by dragging the mouse.

• Molvie also allows the users to click at some part of the 3-D structure of a protein and displays the corresponding primary amino acid sequences.

Page 31: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Molvie Application Screen

Page 32: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Molvie Applet Screen

Page 33: Protein Database Bioinformatics Lab. Sequence Databases GenBank --DNA sequences and derived protein sequences EMBL --DNA sequences and derived protein

Show Molvie