structure databases dna/protein structure-function analysis and prediction lecture 6 bioinformatics...

Structure DatabasesStructure Databases

DNA/Protein structure-function DNA/Protein structure-function analysis and predictionanalysis and prediction

Lecture 6Lecture 6

Bioinformatics Bioinformatics SectionSection, Vrije Universiteit, Amsterdam, Vrije Universiteit, Amsterdam

The dictionary definitionThe dictionary definition

Main Entry: Main Entry: da·ta·baseda·ta·base Pronunciation: 'dA-t&-"bAs, 'da- Pronunciation: 'dA-t&-"bAs, 'da- also also 'dä-'dä-Function: Function: nounnounDate: circa 1962Date: circa 1962

:: a usually large collection of data organized a usually large collection of data organized especially for rapid search and retrieval (as by especially for rapid search and retrieval (as by a computer) a computer)

- Webster dictionary- Webster dictionary

WHAT is a database?WHAT is a database?A collection of data that needs to be:A collection of data that needs to be:

StructuredStructured SearchableSearchable Updated (periodically)Updated (periodically) Cross referencedCross referenced

Challenge:Challenge: To change “meaningless” data into useful information that can be To change “meaningless” data into useful information that can be

accessed and analysed the best way possible.accessed and analysed the best way possible.

For example: For example: HOW would YOU organise all biological sequences so that the HOW would YOU organise all biological sequences so that the biological information is optimally accessible?biological information is optimally accessible?

You need an appropriate database management system (DBMS)You need an appropriate database management system (DBMS)

DBMSDBMS

Internal organizationInternal organization Controls speed and Controls speed and

flexibilityflexibility

A unity of programs that A unity of programs that StoreStore ExtractExtract ModifyModify

DatabaseDatabase

StoreStore ExtractExtract ModifyModify

USER(S)USER(S)

DBMS organisation typesDBMS organisation types

Flat file databases (flat DBMS)Flat file databases (flat DBMS) Simple, restrictive, tableSimple, restrictive, table

Hierarchical databases (hierarchical DBMS)Hierarchical databases (hierarchical DBMS) Simple, restrictive, tablesSimple, restrictive, tables

Relational databases (RDBMS)Relational databases (RDBMS) Complex,versatile, tablesComplex,versatile, tables

Object-oriented databases (ODBMS)Object-oriented databases (ODBMS) Complex, versatile, objectsComplex, versatile, objects

Relational databasesRelational databases

Data is stored in multiple Data is stored in multiple relatedrelated tables tables

Data relationships across tables can be Data relationships across tables can be either either many-to-onemany-to-one or or many-to-manymany-to-many

A few rules allow the database to be A few rules allow the database to be viewed in many waysviewed in many waysLets convert the “course details” to a Lets convert the “course details” to a relational databaserelational database

Student 1 Chemistry Biology A B B A C …..Student 1 Chemistry Biology A B B A C …..

Student 2 Ecology Maths A D A A A …..Student 2 Ecology Maths A D A A A …..

..

..

..

..

Course detailsCourse detailsFLAT DATABASE 2FLAT DATABASE 2

Student 2 Ecology Biology A B A A A …..Student 2 Ecology Biology A B A A A …..

Student 1 Chemistry English A A A A A …..Student 1 Chemistry English A A A A A …..........

Name Depart. Course E1 E2 E3 P1 P2Name Depart. Course E1 E2 E3 P1 P2

Student 1 Chemistry Maths C C B A A …..Student 1 Chemistry Maths C C B A A …..

Our flat file databaseOur flat file database

Normalize (1NF) …Normalize (1NF) …We remove repeating records (rows)We remove repeating records (rows)

sID Name dIDsID Name dID

1 Student1 11 Student1 1


cID Course cID Course

1 Biology1 Biology

2 Maths 2 Maths

3 English 3 English

dID Department dID Department

1 Chemistry1 Chemistry

2 Ecology 2 Ecology

1 1 A B B A C …..1 1 A B B A C …..

2 2 A D A A A …..2 2 A D A A A …..

..

..

..

..

2 1 A B A A A …..2 1 A B A A A …..

1 3 A A A A A …..1 3 A A A A A …..........

sID cID E1 E2 E3 P1 P2sID cID E1 E2 E3 P1 P2

1 2 C C B A A …..1 2 C C B A A …..

Primary keysPrimary keysForeign keysForeign keys

sID Name dIDsID Name dID



cID Course cID Course

1 Biology1 Biology

2 Maths 2 Maths

3 English 3 English gID Grade gID Grade

1 A1 A

2 B 2 B

3 C 3 C

dID Department dID Department

1 Chemistry1 Chemistry

2 Ecology 2 Ecology

wID Project wID Project

1 E11 E1

2 E2 2 E2

3 E3 3 E3

4 P1 4 P1

5 P2 5 P2

sID cID gID wID sID cID gID wID

1 1 1 1 1 1 1 1 1 1 2 21 1 2 2

1 1 2 31 1 2 3

1 1 1 41 1 1 4

1 1 3 5 1 1 3 5

2 1 1 1 2 1 1 1 2 1 1 22 1 1 2

2 1 2 32 1 2 3

2 1 1 42 1 1 4

2 1 1 5 2 1 1 5

Normalize (2NF) …Normalize (2NF) …

We remove redundant fields (columns)We remove redundant fields (columns)

Relational DatabasesRelational Databases

What have we achieved?What have we achieved? No repeating informationNo repeating information Less storage spaceLess storage space Better reality representationBetter reality representation Easy modification/managementEasy modification/management Easy usage of any combination of recordsEasy usage of any combination of records

RememberRemember the DBMS has programs to access and edit this the DBMS has programs to access and edit this information so ignore the human reading limitation of information so ignore the human reading limitation of the primary keysthe primary keys

Accessing database informationAccessing database information

A request for data from a database is A request for data from a database is called a called a queryquery

Queries Queries can be of three forms:can be of three forms: Choose from a list of parametersChoose from a list of parameters Query by example (QBE)Query by example (QBE) Query languageQuery language

Query LanguagesQuery Languages

The standard The standard SQL (Structured Query Language) originally SQL (Structured Query Language) originally

called SEQUEL (Structured English QUEry called SEQUEL (Structured English QUEry Language)Language)

Developed by IBM in 1974; introduced Developed by IBM in 1974; introduced commercially in 1979 by Oracle Corp.commercially in 1979 by Oracle Corp.

Standard interactive and programming Standard interactive and programming language for getting information from and language for getting information from and updating a database.updating a database.

RDMS (SQL), ODBMS (Java, C++, OQL etc)RDMS (SQL), ODBMS (Java, C++, OQL etc)

Distributed databasesDistributed databases

From local to global attitudeFrom local to global attitudeData appears to be in one location but is most definitely Data appears to be in one location but is most definitely notnot

A definitionA definition: Two or more data files in different locations, : Two or more data files in different locations, periodically synchronized by the DBMS to keep data in periodically synchronized by the DBMS to keep data in all locations consistent (A,B,C)all locations consistent (A,B,C)

An intricate network for combining and sharing An intricate network for combining and sharing informationinformationAdministrators praise fast network technologies!!!Administrators praise fast network technologies!!!Users praise the internet!!!Users praise the internet!!!

Data warehouseData warehouse

Periodically, one imports data from databases and store Periodically, one imports data from databases and store it (locally) in the data warehouse.it (locally) in the data warehouse.

Now a local database can be created, containing for Now a local database can be created, containing for instance instance protein family data (sequence, structure, protein family data (sequence, structure, function and pathway/process data integrated with the function and pathway/process data integrated with the gene expression and other experimental data).gene expression and other experimental data).

Disadvantage: expensive, intensive, needs to be Disadvantage: expensive, intensive, needs to be updated. updated.

Advantage: easy control of integrated data-mining Advantage: easy control of integrated data-mining pipeline. pipeline.

So why do biologists care?So why do biologists care?

Three main reasonsThree main reasons

Database proliferationDatabase proliferation Dozens to hundreds at the momentDozens to hundreds at the moment

More and more scientific discoveries result More and more scientific discoveries result from inter-database analysis and miningfrom inter-database analysis and mining

Rising complexity of required data-Rising complexity of required data-combinationscombinations E.g. translational medicine: “from bench to E.g. translational medicine: “from bench to

bedside” (genomic data vs. clinical data)bedside” (genomic data vs. clinical data)

Biological databasesBiological databases

Like any other databaseLike any other database Data organization for optimal analysisData organization for optimal analysis

Data is of different typesData is of different types Raw data (DNA, RNA, protein sequences)Raw data (DNA, RNA, protein sequences) Curated data (DNA, RNA and protein Curated data (DNA, RNA and protein

annotated sequences and structures, annotated sequences and structures, expression data)expression data)

Raw Biological dataRaw Biological dataNucleic Acids (DNA)Nucleic Acids (DNA)

Raw Biological dataRaw Biological dataAmino acid residues (proteins)Amino acid residues (proteins)

Curated Biological DataCurated Biological Data

DNA, nucleotide sequences

Gene boundaries, topologyGene boundaries, topology Gene structureGene structure

Introns, exons, ORFs, splicingIntrons, exons, ORFs, splicing

Expression dataExpression data Mass spectometry Mass spectometry

Mass spectometry Mass spectometry (metabolomics, proteomics)(metabolomics, proteomics)

Post-Translational proteinPost-Translational proteinModification (PTM)Modification (PTM)

Curated Biological DataCurated Biological DataProteins, residue sequences

MCTUYTCUYFSTYRCCTYFSCDExtended sequence information Extended sequence information

Secondary structureSecondary structure

Hydrophobicity, motif dataHydrophobicity, motif data

Protein-protein interactionProtein-protein interaction

Curated Biological dataCurated Biological data3D Structures, folds3D Structures, folds

Biological DatabasesBiological Databases

The 2003 NAR Database Issue: http://nar.oupjournals.org/content/vol31/issue1/

Distributed informationDistributed information

Pearson’s Law:Pearson’s Law: The usefulness of a column of The usefulness of a column of data varies as the square of the number of data varies as the square of the number of columns it is compared to.columns it is compared to.

A few biological databasesA few biological databasesNucleotide DatabasesNucleotide DatabasesAlternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome, Alternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome, MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations, MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations, IMGTIMGTGenome DatabasesGenome DatabasesHuman, Mouse, Yeast, C.elegans, FLYBASE, ParasitesHuman, Mouse, Yeast, C.elegans, FLYBASE, ParasitesProtein DatabasesProtein Databases Swiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis, Swiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis, HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDITHPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDITStructure DatabasesStructure Databases PDB, MSD, FSSP, DALIPDB, MSD, FSSP, DALIMicroarray DatabaseMicroarray Database ArrayExpressArrayExpressLiterature DatabasesLiterature Databases MEDLINE, Software Biocatalog, Flybase ArchivesMEDLINE, Software Biocatalog, Flybase ArchivesAlignment DatabasesAlignment DatabasesBAliBASE, Homstrad, FSSPBAliBASE, Homstrad, FSSP

Structural DatabasesStructural Databases

Protein Data Bank (PDB) Protein Data Bank (PDB) http://www.rcsb.org/pdb/http://www.rcsb.org/pdb/

Structural Classification of Proteins Structural Classification of Proteins (SCOP)(SCOP)

http://scop.berkeley.eduhttp://scop.berkeley.edu

http://scop.mrc-lmb.cam.ac.uk/scop/http://scop.mrc-lmb.cam.ac.uk/scop/

3D Macromolecular structural data3D Macromolecular structural data

Data originates from NMR or X-ray Data originates from NMR or X-ray crystallography techniquescrystallography techniques

Total nTotal noo of structures of structures 34.626 34.626 (17/01/2006)(17/01/2006)

If the 3D structure of a protein is solved ... If the 3D structure of a protein is solved ... they have itthey have it

PDBPDB

PDB contentPDB content

PDB informationPDB information

The PDB files have a standard format The PDB files have a standard format

Key featuresKey features

Informative descriptorsInformative descriptors

PDB-mirror on the WWW …PDB-mirror on the WWW …

e.g.1AE5

Example output: 1AE5Example output: 1AE5

SCOPSCOP

SStructural tructural CClassification lassification OOf f PProteinsroteins3D Macromolecular structural data grouped 3D Macromolecular structural data grouped based on structural classification based on structural classification

Data originates from the PDBData originates from the PDBCurrent version (v1.69)Current version (v1.69)25973 PDB Entries (July 2005).25973 PDB Entries (July 2005).70859 Domains 70859 Domains

SCOP levelsSCOP levels bottom-up bottom-up1.Family: Clear evolutionarily relationshipProteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. However, in some cases similar functions and structures provide definitive evidence of common descent in the absence of high sequence identity; for example, many globins form a family though some members have sequence identities of only 15%.

2.Superfamily: Probable common evolutionary originProteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily.

3.Fold: Major structural similarityProteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favouring certain packing arrangements and chain topologies.

SCOP-mirror on the WWW …SCOP-mirror on the WWW …

Enter SCOP at the top of the hierarchy

Keyword search of SCOP entries

CATHCATHCClasslass, derived from secondary structure content, is , derived from secondary structure content, is assigned for more than 90% of protein structures assigned for more than 90% of protein structures automatically. automatically. AArchitecturerchitecture, which describes the gross orientation of , which describes the gross orientation of secondary structures, independent of connectivities, is secondary structures, independent of connectivities, is currently assigned manually. currently assigned manually. TTopologyopology level clusters structures according to their level clusters structures according to their toplogical connections and numbers of secondary toplogical connections and numbers of secondary structures. structures. The The HHomologous superfamiliesomologous superfamilies cluster proteins with cluster proteins with highly similar structures and functions. The assignments highly similar structures and functions. The assignments of structures to topology families and homologous of structures to topology families and homologous superfamilies are made by sequence and structure superfamilies are made by sequence and structure comparisons.comparisons.

CATH-mirror on the WWW …CATH-mirror on the WWW …

DSSPDSSP

Dictionary of secondary structure of proteinsDictionary of secondary structure of proteins

The DSSP database comprises the secondary The DSSP database comprises the secondary structures of all PDB entriesstructures of all PDB entries

DSSP is actually software that translates the DSSP is actually software that translates the PDB structural co-ordinates into secondary PDB structural co-ordinates into secondary (standardized) structure elements(standardized) structure elements

A similar example is STRIDEA similar example is STRIDE

WHY bother???WHY bother???

Researchers create and use the dataResearchers create and use the data

Use of known information for analyzing Use of known information for analyzing new datanew data

New data needs to be screenedNew data needs to be screened

Structural/Functional informationStructural/Functional information

Extends the knowledge and information on Extends the knowledge and information on a higher level than DNA or protein a higher level than DNA or protein sequencessequences

In the end ….In the end ….

Computers can figure out all kinds of problems, except the things in the

world that just don't add up. James Magary

We should add:For that we employ the human brain,

experts and experience.

Bio-databases: A short word on Bio-databases: A short word on problemsproblems

Even today we face some key limitationsEven today we face some key limitations There is no standard formatThere is no standard format

Every database or program has its own formatEvery database or program has its own format There is no standard nomenclatureThere is no standard nomenclature

Every database has its own namesEvery database has its own names Data is not fully optimizedData is not fully optimized

Some datasets have missing information without indications Some datasets have missing information without indications of itof it

Data errorsData errorsData is sometimes of poor quality, erroneous, misspelledData is sometimes of poor quality, erroneous, misspelled

Error propagation resulting from computer annotationError propagation resulting from computer annotation

What to take homeWhat to take home

Databases are a collection of dataDatabases are a collection of data Need to access and maintain easily and flexiblyNeed to access and maintain easily and flexibly

Biological information is vast and sometimes Biological information is vast and sometimes very redundantvery redundantDistributed databases bring it all together with Distributed databases bring it all together with quality controls, cross-referencing and quality controls, cross-referencing and standardizationstandardizationComputers can only create data, they do not Computers can only create data, they do not give answersgive answersReview-suggestion: “Integrating biological Review-suggestion: “Integrating biological databases”, Stein, Nature 2003databases”, Stein, Nature 2003

structure databases dna/protein structure-function analysis and prediction lecture 6 bioinformatics...

Documents

chemistry maths c c

relational database

flat file database slide

relational databases

cid course

database pronunciation

amsterdam slide

course details flat