structure databases dna/protein structure-function analysis and prediction lecture 6 bioinformatics...

43
Structure Structure Databases Databases DNA/Protein structure- DNA/Protein structure- function analysis and function analysis and prediction prediction Lecture 6 Lecture 6 Bioinformatics Bioinformatics Section Section , Vrije , Vrije Universiteit, Amsterdam Universiteit, Amsterdam

Upload: elaine-joubert

Post on 14-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Structure DatabasesStructure Databases

DNA/Protein structure-function DNA/Protein structure-function analysis and predictionanalysis and prediction

Lecture 6Lecture 6

Bioinformatics Bioinformatics SectionSection, Vrije Universiteit, Amsterdam, Vrije Universiteit, Amsterdam

Page 2: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

The dictionary definitionThe dictionary definition

Main Entry: Main Entry: da·ta·baseda·ta·base Pronunciation: 'dA-t&-"bAs, 'da- Pronunciation: 'dA-t&-"bAs, 'da- also also 'dä-'dä-Function: Function: nounnounDate: circa 1962Date: circa 1962

:: a usually large collection of data organized a usually large collection of data organized especially for rapid search and retrieval (as by especially for rapid search and retrieval (as by a computer) a computer)

- Webster dictionary- Webster dictionary

Page 3: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

WHAT is a database?WHAT is a database?A collection of data that needs to be:A collection of data that needs to be:

StructuredStructured SearchableSearchable Updated (periodically)Updated (periodically) Cross referencedCross referenced

Challenge:Challenge: To change “meaningless” data into useful information that can be To change “meaningless” data into useful information that can be

accessed and analysed the best way possible.accessed and analysed the best way possible.

For example: For example: HOW would YOU organise all biological sequences so that the HOW would YOU organise all biological sequences so that the biological information is optimally accessible?biological information is optimally accessible?

You need an appropriate database management system (DBMS)You need an appropriate database management system (DBMS)

Page 4: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

DBMSDBMS

Internal organizationInternal organization Controls speed and Controls speed and

flexibilityflexibility

A unity of programs that A unity of programs that StoreStore ExtractExtract ModifyModify

DatabaseDatabase

StoreStore ExtractExtract ModifyModify

USER(S)USER(S)

Page 5: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

DBMS organisation typesDBMS organisation types

Flat file databases (flat DBMS)Flat file databases (flat DBMS) Simple, restrictive, tableSimple, restrictive, table

Hierarchical databases (hierarchical DBMS)Hierarchical databases (hierarchical DBMS) Simple, restrictive, tablesSimple, restrictive, tables

Relational databases (RDBMS)Relational databases (RDBMS) Complex,versatile, tablesComplex,versatile, tables

Object-oriented databases (ODBMS)Object-oriented databases (ODBMS) Complex, versatile, objectsComplex, versatile, objects

Page 6: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Relational databasesRelational databases

Data is stored in multiple Data is stored in multiple relatedrelated tables tables

Data relationships across tables can be Data relationships across tables can be either either many-to-onemany-to-one or or many-to-manymany-to-many

A few rules allow the database to be A few rules allow the database to be viewed in many waysviewed in many waysLets convert the “course details” to a Lets convert the “course details” to a relational databaserelational database

Page 7: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Student 1 Chemistry Biology A B B A C …..Student 1 Chemistry Biology A B B A C …..

Student 2 Ecology Maths A D A A A …..Student 2 Ecology Maths A D A A A …..

..

..

..

..

Course detailsCourse detailsFLAT DATABASE 2FLAT DATABASE 2

Student 2 Ecology Biology A B A A A …..Student 2 Ecology Biology A B A A A …..

Student 1 Chemistry English A A A A A …..Student 1 Chemistry English A A A A A …..........

Name Depart. Course E1 E2 E3 P1 P2Name Depart. Course E1 E2 E3 P1 P2

Student 1 Chemistry Maths C C B A A …..Student 1 Chemistry Maths C C B A A …..

Our flat file databaseOur flat file database

Page 8: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Normalize (1NF) …Normalize (1NF) …We remove repeating records (rows)We remove repeating records (rows)

sID Name dIDsID Name dID

1 Student1 11 Student1 1

2 Student2 22 Student2 2

cID Course cID Course

1 Biology1 Biology

2 Maths 2 Maths

3 English 3 English

dID Department dID Department

1 Chemistry1 Chemistry

2 Ecology 2 Ecology

1 1 A B B A C …..1 1 A B B A C …..

2 2 A D A A A …..2 2 A D A A A …..

..

..

..

..

2 1 A B A A A …..2 1 A B A A A …..

1 3 A A A A A …..1 3 A A A A A …..........

sID cID E1 E2 E3 P1 P2sID cID E1 E2 E3 P1 P2

1 2 C C B A A …..1 2 C C B A A …..

Primary keysPrimary keysForeign keysForeign keys

Page 9: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

sID Name dIDsID Name dID

1 Student1 11 Student1 1

2 Student2 22 Student2 2

cID Course cID Course

1 Biology1 Biology

2 Maths 2 Maths

3 English 3 English gID Grade gID Grade

1 A1 A

2 B 2 B

3 C 3 C

dID Department dID Department

1 Chemistry1 Chemistry

2 Ecology 2 Ecology

wID Project wID Project

1 E11 E1

2 E2 2 E2

3 E3 3 E3

4 P1 4 P1

5 P2 5 P2

sID cID gID wID sID cID gID wID

1 1 1 1 1 1 1 1 1 1 2 21 1 2 2

1 1 2 31 1 2 3

1 1 1 41 1 1 4

1 1 3 5 1 1 3 5

2 1 1 1 2 1 1 1 2 1 1 22 1 1 2

2 1 2 32 1 2 3

2 1 1 42 1 1 4

2 1 1 5 2 1 1 5

Normalize (2NF) …Normalize (2NF) …

We remove redundant fields (columns)We remove redundant fields (columns)

Page 10: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Relational DatabasesRelational Databases

What have we achieved?What have we achieved? No repeating informationNo repeating information Less storage spaceLess storage space Better reality representationBetter reality representation Easy modification/managementEasy modification/management Easy usage of any combination of recordsEasy usage of any combination of records

RememberRemember the DBMS has programs to access and edit this the DBMS has programs to access and edit this information so ignore the human reading limitation of information so ignore the human reading limitation of the primary keysthe primary keys

Page 11: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Accessing database informationAccessing database information

A request for data from a database is A request for data from a database is called a called a queryquery

Queries Queries can be of three forms:can be of three forms: Choose from a list of parametersChoose from a list of parameters Query by example (QBE)Query by example (QBE) Query languageQuery language

Page 12: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Query LanguagesQuery Languages

The standard The standard SQL (Structured Query Language) originally SQL (Structured Query Language) originally

called SEQUEL (Structured English QUEry called SEQUEL (Structured English QUEry Language)Language)

Developed by IBM in 1974; introduced Developed by IBM in 1974; introduced commercially in 1979 by Oracle Corp.commercially in 1979 by Oracle Corp.

Standard interactive and programming Standard interactive and programming language for getting information from and language for getting information from and updating a database.updating a database.

RDMS (SQL), ODBMS (Java, C++, OQL etc)RDMS (SQL), ODBMS (Java, C++, OQL etc)

Page 13: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Distributed databasesDistributed databases

From local to global attitudeFrom local to global attitudeData appears to be in one location but is most definitely Data appears to be in one location but is most definitely notnot

A definitionA definition: Two or more data files in different locations, : Two or more data files in different locations, periodically synchronized by the DBMS to keep data in periodically synchronized by the DBMS to keep data in all locations consistent (A,B,C)all locations consistent (A,B,C)

An intricate network for combining and sharing An intricate network for combining and sharing informationinformationAdministrators praise fast network technologies!!!Administrators praise fast network technologies!!!Users praise the internet!!!Users praise the internet!!!

Page 14: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Data warehouseData warehouse

Periodically, one imports data from databases and store Periodically, one imports data from databases and store it (locally) in the data warehouse.it (locally) in the data warehouse.

Now a local database can be created, containing for Now a local database can be created, containing for instance instance protein family data (sequence, structure, protein family data (sequence, structure, function and pathway/process data integrated with the function and pathway/process data integrated with the gene expression and other experimental data).gene expression and other experimental data).

Disadvantage: expensive, intensive, needs to be Disadvantage: expensive, intensive, needs to be updated. updated.

Advantage: easy control of integrated data-mining Advantage: easy control of integrated data-mining pipeline. pipeline.

Page 15: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

So why do biologists care?So why do biologists care?

Page 16: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Three main reasonsThree main reasons

Database proliferationDatabase proliferation Dozens to hundreds at the momentDozens to hundreds at the moment

More and more scientific discoveries result More and more scientific discoveries result from inter-database analysis and miningfrom inter-database analysis and mining

Rising complexity of required data-Rising complexity of required data-combinationscombinations E.g. translational medicine: “from bench to E.g. translational medicine: “from bench to

bedside” (genomic data vs. clinical data)bedside” (genomic data vs. clinical data)

Page 17: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Biological databasesBiological databases

Like any other databaseLike any other database Data organization for optimal analysisData organization for optimal analysis

Data is of different typesData is of different types Raw data (DNA, RNA, protein sequences)Raw data (DNA, RNA, protein sequences) Curated data (DNA, RNA and protein Curated data (DNA, RNA and protein

annotated sequences and structures, annotated sequences and structures, expression data)expression data)

Page 18: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Raw Biological dataRaw Biological dataNucleic Acids (DNA)Nucleic Acids (DNA)

Page 19: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Raw Biological dataRaw Biological dataAmino acid residues (proteins)Amino acid residues (proteins)

Page 20: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Curated Biological DataCurated Biological Data

DNA, nucleotide sequences

Gene boundaries, topologyGene boundaries, topology Gene structureGene structure

Introns, exons, ORFs, splicingIntrons, exons, ORFs, splicing

Expression dataExpression data Mass spectometry Mass spectometry

Page 21: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Mass spectometry Mass spectometry (metabolomics, proteomics)(metabolomics, proteomics)

Post-Translational proteinPost-Translational proteinModification (PTM)Modification (PTM)

Curated Biological DataCurated Biological DataProteins, residue sequences

MCTUYTCUYFSTYRCCTYFSCDExtended sequence information Extended sequence information

Secondary structureSecondary structure

Hydrophobicity, motif dataHydrophobicity, motif data

Protein-protein interactionProtein-protein interaction

Page 22: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Curated Biological dataCurated Biological data3D Structures, folds3D Structures, folds

Page 23: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Biological DatabasesBiological Databases

The 2003 NAR Database Issue: http://nar.oupjournals.org/content/vol31/issue1/

Page 24: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Distributed informationDistributed information

Pearson’s Law:Pearson’s Law: The usefulness of a column of The usefulness of a column of data varies as the square of the number of data varies as the square of the number of columns it is compared to.columns it is compared to.

Page 25: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

A few biological databasesA few biological databasesNucleotide DatabasesNucleotide DatabasesAlternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome, Alternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome, MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations, MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations, IMGTIMGTGenome DatabasesGenome DatabasesHuman, Mouse, Yeast, C.elegans, FLYBASE, ParasitesHuman, Mouse, Yeast, C.elegans, FLYBASE, ParasitesProtein DatabasesProtein Databases Swiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis, Swiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis, HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDITHPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDITStructure DatabasesStructure Databases PDB, MSD, FSSP, DALIPDB, MSD, FSSP, DALIMicroarray DatabaseMicroarray Database ArrayExpressArrayExpressLiterature DatabasesLiterature Databases MEDLINE, Software Biocatalog, Flybase ArchivesMEDLINE, Software Biocatalog, Flybase ArchivesAlignment DatabasesAlignment DatabasesBAliBASE, Homstrad, FSSPBAliBASE, Homstrad, FSSP

Page 26: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Structural DatabasesStructural Databases

Protein Data Bank (PDB) Protein Data Bank (PDB) http://www.rcsb.org/pdb/http://www.rcsb.org/pdb/

Structural Classification of Proteins Structural Classification of Proteins (SCOP)(SCOP)

http://scop.berkeley.eduhttp://scop.berkeley.edu

http://scop.mrc-lmb.cam.ac.uk/scop/http://scop.mrc-lmb.cam.ac.uk/scop/

Page 27: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

3D Macromolecular structural data3D Macromolecular structural data

Data originates from NMR or X-ray Data originates from NMR or X-ray crystallography techniquescrystallography techniques

Total nTotal noo of structures of structures 34.626 34.626 (17/01/2006)(17/01/2006)

If the 3D structure of a protein is solved ... If the 3D structure of a protein is solved ... they have itthey have it

PDBPDB

Page 28: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

PDB contentPDB content

Page 29: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

PDB informationPDB information

The PDB files have a standard format The PDB files have a standard format

Key featuresKey features

Informative descriptorsInformative descriptors

Page 30: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

PDB-mirror on the WWW …PDB-mirror on the WWW …

e.g.1AE5

Page 31: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Example output: 1AE5Example output: 1AE5

Page 32: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

SCOPSCOP

SStructural tructural CClassification lassification OOf f PProteinsroteins3D Macromolecular structural data grouped 3D Macromolecular structural data grouped based on structural classification based on structural classification

Data originates from the PDBData originates from the PDBCurrent version (v1.69)Current version (v1.69)25973 PDB Entries (July 2005).25973 PDB Entries (July 2005).70859 Domains 70859 Domains

Page 33: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

SCOP levelsSCOP levels bottom-up bottom-up1.Family: Clear evolutionarily relationshipProteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. However, in some cases similar functions and structures provide definitive evidence of common descent in the absence of high sequence identity; for example, many globins form a family though some members have sequence identities of only 15%.

2.Superfamily: Probable common evolutionary originProteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily.

3.Fold: Major structural similarityProteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favouring certain packing arrangements and chain topologies.

Page 34: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

SCOP-mirror on the WWW …SCOP-mirror on the WWW …

Page 35: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Enter SCOP at the top of the hierarchy

Page 36: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Keyword search of SCOP entries

Page 37: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

CATHCATHCClasslass, derived from secondary structure content, is , derived from secondary structure content, is assigned for more than 90% of protein structures assigned for more than 90% of protein structures automatically. automatically. AArchitecturerchitecture, which describes the gross orientation of , which describes the gross orientation of secondary structures, independent of connectivities, is secondary structures, independent of connectivities, is currently assigned manually. currently assigned manually. TTopologyopology level clusters structures according to their level clusters structures according to their toplogical connections and numbers of secondary toplogical connections and numbers of secondary structures. structures. The The HHomologous superfamiliesomologous superfamilies cluster proteins with cluster proteins with highly similar structures and functions. The assignments highly similar structures and functions. The assignments of structures to topology families and homologous of structures to topology families and homologous superfamilies are made by sequence and structure superfamilies are made by sequence and structure comparisons.comparisons.

Page 38: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

CATH-mirror on the WWW …CATH-mirror on the WWW …

Page 39: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

DSSPDSSP

Dictionary of secondary structure of proteinsDictionary of secondary structure of proteins

The DSSP database comprises the secondary The DSSP database comprises the secondary structures of all PDB entriesstructures of all PDB entries

DSSP is actually software that translates the DSSP is actually software that translates the PDB structural co-ordinates into secondary PDB structural co-ordinates into secondary (standardized) structure elements(standardized) structure elements

A similar example is STRIDEA similar example is STRIDE

Page 40: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

WHY bother???WHY bother???

Researchers create and use the dataResearchers create and use the data

Use of known information for analyzing Use of known information for analyzing new datanew data

New data needs to be screenedNew data needs to be screened

Structural/Functional informationStructural/Functional information

Extends the knowledge and information on Extends the knowledge and information on a higher level than DNA or protein a higher level than DNA or protein sequencessequences

Page 41: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

In the end ….In the end ….

Computers can figure out all kinds of problems, except the things in the

world that just don't add up. James Magary

We should add:For that we employ the human brain,

experts and experience.

Page 42: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

Bio-databases: A short word on Bio-databases: A short word on problemsproblems

Even today we face some key limitationsEven today we face some key limitations There is no standard formatThere is no standard format

Every database or program has its own formatEvery database or program has its own format There is no standard nomenclatureThere is no standard nomenclature

Every database has its own namesEvery database has its own names Data is not fully optimizedData is not fully optimized

Some datasets have missing information without indications Some datasets have missing information without indications of itof it

Data errorsData errorsData is sometimes of poor quality, erroneous, misspelledData is sometimes of poor quality, erroneous, misspelled

Error propagation resulting from computer annotationError propagation resulting from computer annotation

Page 43: Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

What to take homeWhat to take home

Databases are a collection of dataDatabases are a collection of data Need to access and maintain easily and flexiblyNeed to access and maintain easily and flexibly

Biological information is vast and sometimes Biological information is vast and sometimes very redundantvery redundantDistributed databases bring it all together with Distributed databases bring it all together with quality controls, cross-referencing and quality controls, cross-referencing and standardizationstandardizationComputers can only create data, they do not Computers can only create data, they do not give answersgive answersReview-suggestion: “Integrating biological Review-suggestion: “Integrating biological databases”, Stein, Nature 2003databases”, Stein, Nature 2003