bioinformatics databases

BIOINFORMATICSDatabases

Mark Gerstein, Yale Universitybioinfo.mbb.yale.edu/mbb452a

Contents: Databases

• Structuring Information inTables

• Keys and Joins• Normalization• Complex RDB encoding• Indexes and Optimization• Forms and Reports• Clustering & Trees• Function Classification and

Orthologs• The Genomic vs. Single-

molecule Perspective

• Folds in Genomes, shared &common folds

• Genome Trees• Bulk Structure Prediction• Extent of Fold Assignment:

the Bias Problem• Correcting for Biases with

Sampling• Cross-tabulation, folds and

functions• Analysis of Expression Data• Analysis of Other Whole

Genome Datasets

Relational Databases

• Databases make program data persistent• RDB’s turn formless data in a number of structured

tables◊ Ways of joining together tables to give various views of the data

This type of “membership” analysis has been performed previously in termsof the occurrence of sequence motifs, families, functions, and biochemicalpathways. Starting from the most basic units, genomes have been compared interms of the relative frequencies of short oligonucleotide and oligopeptide“words” (Blaisdell et al., 1996; Karlin & Burge, 1995; Karlin et al., 1992;Karlin et al., 1996). The degree of gene duplication in a number of genomeshas been ascertained (Brenner et al., 1995; Koonin et al., 1996b; Riley &Labedan, 1997; Wolfe & Shields, 1997; Gerstein, 1997; Tamames et al.,1997). Other analyses have looked at how many highly conserved sequencefamilies in one organism are present in another (Green et al., 1993; Koonin etal., 1995; Tatusov et al., 1997; Ouzounis et al., 1995a,b; Clayton et al., 1997).Finally, if sequences can be related to specific functions and pathways, onecan see whether homologous sequences in two organisms truly have the samerole (ortholog vs. paralog) and whether particular pathways are present orabsent in different organisms (Karp et al., 1996a; Karp et al., 1996b; Kooninet al., 1996a; Mushegian & Koonin, 1996; Tatusov et al., 1996, 1997). Thiswork has yielded many interesting conclusions in terms of pathways that aremodified or absent in certain organisms. For instance, the essential citric acidcycle is found to be highly modified in H. influenzae (Fleischmann et al.,

UnstructuredData

Semi-Structured

REMARK 8 HET GROUP TRIVIAL NAME: FLAVIN ADENINE DINUCLEOTIDE (FAD) 1FNB 79

REMARK 8 CAS REGISTRY NUMBER: 146-14-5 1FNB 80

REMARK 8 SEQUENCE NUMBER: 315 1FNB 81

REMARK 8 NUMBER OF ATOMS IN GROUP: 53 1FNB 82

REMARK 8 1FNB 83

REMARK 8 HET GROUP TRIVIAL NAME: PHOSPHATE 1FNB 84

REMARK 8 1FNB 87

REMARK 8 HET GROUP TRIVIAL NAME: SULFATE 1FNB 88

REMARK 8 1FNB 91

REMARK 8 HET GROUP TRIVIAL NAME: K2 PT(CN)4 1FNB 92

REMARK 8 CHARGE: 2- ( PT(CN)4 -- ) 1FNB 93

REMARK 8 SEQUENCE NUMBER: PT1 - PT7 1FNB 94

REMARK 8 ADDITIONAL COMMENTS: BINDING SITES USED IN MIR PHASING 1FNB 96

REMARK 8 1FNB 97

REMARK 8 HEAVY ATOM PARAMETERS ARE AS FOLLOWS: 1FNB 98

REMARK 8 PT PT 1 11.832 -8.309 27.027 0.68 33.00 1FNB 99

REMARK 8 PT PT 2 13.996 -2.135 13.212 0.42 40.00 1FNB 100

REMARK 8 PT PT 3 33.293 18.752 27.229 0.32 42.00 1FNB 101

REMARK 8 PT PT 4 19.961 -15.348 -10.328 0.23 28.00 1FNB 102

REMARK 8 PT PT 5 8.312 14.713 35.679 0.26 31.00 1FNB 103

REMARK 8 PT PT 6 27.594 -7.790 23.540 0.14 35.00 1FNB 104

REMARK 8 PT PT 7 15.917 -9.001 12.608 0.30 50.00 1FNB 105

REMARK 8 1FNB 106

REMARK 8 HET GROUP TRIVIAL NAME: URANYL NITRATE (UO2--) 1FNB 107

REMARK 8 EMPIRICAL FORMULA: UO2 (NO3)2 1FNB 108

REMARK 8 CHARGE: 2- 1FNB 109

REMARK 8 SEQUENCE NUMBER: UR1 - UR13 1FNB 110

REMARK 8 ADDITIONAL COMMENTS: BINDING SITES USED IN MIR PHASING 1FNB 112

REMARK 8 1FNB 113

REMARK 8 HEAVY ATOM PARAMETERS ARE AS FOLLOWS: 1FNB 114

REMARK 8 U UR 1 8.513 16.214 36.081 0.49 27.00 1FNB 115

StructuredData

did_ fidsd2rs51_ 1.002.007d1imr__ 1.010.002d1pyib1 1.007.030d1dxtd_ 1.001.001d181l__ 1.004.002d1vmoa_ 1.002.044d2gsq_1 1.001.031d1etb2_ 1.002.003d1guha1 1.001.031d1hrc__ 1.001.003d150lc_ 1.004.002d1dmf__ 1.007.035d1l19__ 1.004.002d1yrnc_ 1.010.002d1apld_ 1.001.004d1ndab2 1.003.004d2rmai_ 1.002.036

fid_ bestrep N_minsp N_scop objname1.001.001 d1flp__ 8 340 Globin-like1.001.002 d1hdj__ 4 33 Long alpha-hairpin1.001.003 d1ctj__ 9 78 Cytochrome c1.001.004 d1enh__ 18 76 DNA-binding 3-helical bundle1.001.005 d1dtr_2 1 3 Diphtheria toxin repressor (DtxR) dimeriz1.001.006 d1tns__ 1 2 Mu transposase, DNA-binding domain1.001.007 d2spca_ 1 2 Spectrin repeat unit1.001.008 d1bdd__ 1 4 Immunoglobulin-binding protein A modules1.001.009 d1bal__ 1 5 Peripheral subunit-binding domain of 2-ox1.001.010 d2erl__ 3 5 Protozoan pheromone proteins

gid_ TrgStrt TrgStop didHI0299 119 135 d193l__HI0572 180 240 d1aba__HI0989 56 125 d1aco_1HI0988 106 458 d1aco_2HI0154 2 76 d1acp__HI1633 2 432 d1adea_HI0349 1 183 d1aky__HI1309 35 52 d1alo_3HI0589 8 25 d1alo_3HI1358 239 444 d1amg_2HI1358 218 410 d1amy_2HI0460 20 24 d1ans__HI1386 139 147 d1ans__HI0421 11 14 d1ans__HI0361 285 295 d1ans__HI0835 100 106 d1ans__

Turn the Survey into a Table (I)

UniqueIdentifierforPerson?

8 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu

urveyinto

able(II)

Standard-

Turn the Survey into a Table (III)

• Dependencies between Values (dates)• Unstructured Text

Statistics

areonly

Possible

tandarizedV

• SIMPLE Language for Building and Querying Tables• CREATE a table• INSERT values into it• SELECT various entries from it (tuples, rows)• UPDATE the values

• Example: How Many Globin Foldsare there in E. coli versus Yeast?

matches table

gid_ TrgStrt TrgStop did scoreHI0299 119 135 d193l__ 3.1HI0572 180 240 d1aba__ 0.0032HI0989 56 125 d1aco_1 0.0049HI0988 106 458 d1aco_2 4.4e-14HI0154 2 76 d1acp__ 1.2e-23HI1633 2 432 d1adea_ 0HI0349 1 183 d1aky__ 7.6e-36HI1309 35 52 d1alo_3 1.1HI0589 8 25 d1alo_3 1.8HI1358 239 444 d1amg_2 0.002HI1358 218 410 d1amy_2 0.00037HI0460 20 24 d1ans__ 1.8HI1386 139 147 d1ans__ 3.3HI0421 11 14 d1ans__ 6.4HI0361 285 295 d1ans__ 8.2HI0835 100 106 d1ans__ 9.7

create table

matches(gid char255,

# Genome_ID

TrgStrt int,

# Start of

# Match in GeneTrgStop int,

# End of Match

# in Genedid char255,

# ID Matching

# Structurescore real

# e-value

# of Match

matches table 2

insert into

matches

(gid, TrgStrt,

TrgStop, did,

score)

values

(HI0299, 119, 135,d193l__, 3.1)

structures table

create table

structures(did char255,

# ID Matching

# Structurefid char255,

# ID of fold that

# structure has

did_ fidd2rs51_ 1.002.007d1imr__ 1.010.002d1pyib1 1.007.030d1dxtd_ 1.001.001d181l__ 1.004.002d1vmoa_ 1.002.044d2gsq_1 1.001.031d1etb2_ 1.002.003d1guha1 1.001.031d1hrc__ 1.001.003d150lc_ 1.004.002d1dmf__ 1.007.035d1l19__ 1.004.002d1yrnc_ 1.010.002d1apld_ 1.001.004d1ndab2 1.003.004d2rmai_ 1.002.036

10 K domainstructure IDs (did)vs. 300 fold IDs(fid)

folds table

create table

folds(fid char255,

# fold ID

bestrep char255,

N_hlx int,

N_beta int,

# number of helices & sheets

name char255

# name of fold

fid_ bestrep N_hlx N_beta name1.001.001 d1flp__ 8 0 Globin-like1.001.002 d1hdj__ 4 0 Long alpha-hairpin1.001.003 d1ctj__ 9 0 Cytochrome c1.001.004 d1enh__ 2 0 DNA-binding 3-helical bundle1.001.005 d1dtr_2 1 3 Diphtheria toxin repressor (DtxR) dimeriz1.001.006 d1tns__ 1 2 Mu transposase, DNA-binding domain1.001.007 d2spca_ 0 2 Spectrin repeat unit1.001.008 d1bdd__ 0 4 Immunoglobulin-binding protein A modules1.001.009 d1bal__ 0 5 Peripheral subunit-binding domain of 2-ox1.001.010 d2erl__ 3 5 Protozoan pheromone proteins

TableInterpretation

Match Table: Ways Structures A, B, and C can match HIGenome

Structures have a limitednumber of folds, whichhave variouscharacteristics

Structure of a Table

• Row◊ Entity, Tuple, Instance

• Column◊ Field

◊ Attribute of an Entity

◊ dimension

• Key◊ Certain Attributes (or

combination of attributes) canuniquely identify an object,these are keys

• NULL◊ Variant Records

key keyTable attr-a attr-b attr-c attr-d attr-e attr-f

tuple-1 a1 b1 c1 d1 e1 f1tuple-2 a2 b2 c2 d2 e2 f2tuple-3 a3 b3 c3 d3 e3 f3tuple-4 a4 b4 c4 d4 e4 f4tuple-5 a5 b5 c5 d5 e5 f5tuple-6 a6 b6 c6 d6tuple-7 a7 b7 c7 d7 f7tuple-8 a8 b8 c8 d8 e8 f8tuple-9 a9 b9 c9 d9 e9 f9tuple-10 a10 b10 c10 d10 f10tuple-11 a11 b11 c11 d11 e11 f11tuple-12 a12 b12 c12 d12 e12 f12tuple-13 a13 b13 c13 d13 e13 f13tuple-14 a14 b14 c14 d14 e14 f14

What is a Key?

table matches(gid, TrgStrt, TrgStop, did, score)

table structures(did, fid)

table folds(fid, bestrep, N_hlx, N_beta, name)

gid -> many matchesgid,TrgStrt -> unique match (one tuple)thus, primary key gid,TrgStrtgid,TrgStop -> unique match as wellfid -> many did’s, but did -> one fidthus, primary key didone-to-one between fid and name

1<->11->manymany->1

SQLSelect ona Single

• Select {columns} from {a table}where {row-selection is true}

• projection of a selection• Sort result on a attribute

SQL Select on aSingle Table,

Example

• Select * from matches where gid= HI0016HI0016 1 173 d1dar_2 2e-07

HI0016 179 274 d1dar_1 8.5e-06

HI0016 399 476 d1dar_4 0.00031

• Select * from matches where gid= HI0016 andTrgStrt=179

HI0016 179 274 d1dar_1 8.5e-06

gid_ TrgStrt TrgStop did scoreHI0299 119 135 d193l__ 3.1HI0572 180 240 d1aba__ 0.0032HI0989 56 125 d1aco_1 0.0049HI0349 1 183 d1aky__ 7.6e-36HI1309 35 52 d1alo_3 1.1HI0589 8 25 d1alo_3 1.8HI1358 239 444 d1amg_2 0.002HI0016 1 173 d1dar_2 2e-07HI0016 179 274 d1dar_1 8.5e-06HI0016 399 476 d1dar_4 0.00031HI0460 20 24 d1ans__ 1.8HI1386 139 147 d1ans__ 3.3HI0421 11 14 d1ans__ 6.4HI0361 285 295 d1ans__ 8.2HI0835 100 106 d1ans__ 9.7

SQL Select on aSingle Table,Example 2

• Select did from matches where score < 0.0001

d1aky__, d1dar_2, d1dar_1

HI0349 1 183 d1aky__ 7.6e-36

I0016 1 173 d1dar_2 2e-07

HI0016 179 274 d1dar_1 8.5e-06

gid_ TrgStrt TrgStop did scoreHI0299 119 135 d193l__ 3.1HI0572 180 240 d1aba__ 0.0032HI0989 56 125 d1aco_1 0.0049HI0349 1 183 d1aky__ 7.6e-36HI1309 35 52 d1alo_3 1.1HI0589 8 25 d1alo_3 1.8HI1358 239 444 d1amg_2 0.002HI0016 1 173 d1dar_2 2e-07HI0016 179 274 d1dar_1 8.5e-06HI0016 399 476 d1dar_4 0.00031HI0460 20 24 d1ans__ 1.8HI1386 139 147 d1ans__ 3.3HI0421 11 14 d1ans__ 6.4HI0361 285 295 d1ans__ 8.2HI0835 100 106 d1ans__ 9.7

Joinsgid_ TrgStrt TrgStop did scoreHI0299 119 135 d193l__ 3.1HI0572 180 240 d1aba__ 0.0032HI0989 56 125 d1aco_1 0.0049HI0988 106 458 d1aco_2 4.4e-14HI0154 2 76 d1acp__ 1.2e-23HI1633 2 432 d1adea_ 0HI0349 1 183 d1aky__ 7.6e-36HI1309 35 52 d1alo_3 1.1HI0589 8 25 d1alo_3 1.8HI1358 239 444 d1amg_2 0.002HI1358 218 410 d1amy_2 0.00037HI0460 20 24 d1ans__ 1.8HI1386 139 147 d1ans__ 3.3HI0421 11 14 d1ans__ 6.4HI0361 285 295 d1ans__ 8.2HI0835 100 106 d1ans__ 9.7

did_ fidd2rs51_ 1.002.007d1imr__ 1.010.002d1pyib1 1.007.030d1dxtd_ 1.001.001d181l__ 1.004.002d1vmoa_ 1.002.044d2gsq_1 1.001.031d1etb2_ 1.002.003d1guha1 1.001.031d1hrc__ 1.001.003d150lc_ 1.004.002d1dmf__ 1.007.035d1l19__ 1.004.002d1yrnc_ 1.010.002d1ans__ 1.007.008d2rmai_ 1.002.036

fid_ bestrep N_hlx N_beta name1.001.001 d1flp__ 8 0 Globin-like1.001.002 d1hdj__ 4 0 Long alpha-hairpin1.001.003 d1ctj__ 9 0 Cytochrome c1.001.004 d1enh__ 2 0 DNA-binding 3-helical bundle1.001.005 d1dtr_2 1 3 Diphtheria toxin repressor (DtxR) dimeriz1.001.006 d1tns__ 1 2 Mu transposase, DNA-binding domain1.001.007 d2spca_ 0 2 Spectrin repeat unit1.001.008 d1bdd__ 0 4 Immunoglobulin-binding protein A modules1.007.008 d1qkt__ 4 3 Neurotoxin III (ATX III)1.001.010 d2erl__ 3 5 Protozoan pheromone proteins

Matches

Structures

ForeignKey

SQL Select on Multiple Tables

• Select *from matches, structures, foldswherematches.gid = HI0361and matches.did=structures.didand structures.fid = folds.fid

• Returnsmatches | structures | foldsHI0361,285,295,d1ans__ ,8.2 | d1ans__,1.007.008 | 1.007.008,d1qkt__,4, 3,Neurotoxin III ...

• Select score,name from matches, structures, foldswhere gid = HI0361and matches.did=structures.didand structures.fid = folds.fid8.2, Neurotoxin III ...

Foreign Key

matches.did is a (foreign) key in the structures table --i.e. looks up exactly one structure.

matchesstructures

Selection as Array Lookup

• Same for a fold identifier from a structure id◊ $fid=$structure{$did}

◊ (perl pseudo-code)

• Same for matches and folds tables, but this time arraysreturn multiple values and have multiple field keys◊ ($bestrep, $N_hlx, $N_beta, $name) = $folds{$fid}◊ ($TrgStop,$did,$score)=$match{$gid,$TrgStrt}

• Joining as a double-lookup◊ $did = 1mbd__

($bestrep, $N_hlx, $N_beta, $name) = $folds{ $structures{$did} }◊ Select bestrep,N_hlx,N_beta,name from structures, folds where

structures.fid = folds.fid and structures.did = 1mbd__

SQLSelect

onMultipleTables

• Select {columns} from {huge cross-product of tables}where {row-selection is true}◊ cross-product T(1) x T(2) builds a huge virtual table where every row of

T(1) is paired with every row of T(2). Then perform selection on this.

• Select fid from matches,structures where gid=HI009 andmatches.did = structures.did

Matches Structures

Cross Product A x B

A(1) = Row 1 of Table AA(2) = Row 2 of Table AA(i) = Row i of Table A

A has N rowsand C columns

B(1) = Row 1 of Table BB(2) = Row 2 of Table BB(i) = Row i of Table B

B has M rowsand K columns

A x B =

A x B hasN x M rowsandC+K columns

A(1)B(1)A(1)B(2)A(1)B(3)...A(1)B(M)A(2)B(1)A(2)B(2)A(2)B(3)...A(2)B(M)A(N)B(1)A(N)B(2)A(N)B(3)...A(N)B(M)

• Korth & Silberschatz◊ branch <=> matches (gid-start +++ did)◊ customer <=> folds (fid +++)

◊ linked byaccount <=> structures (did fid)

ER-diagrams

Start gid structure

Aggregate Functions--Statistics on Attributes

• Query Statistics◊ select gid, count (distinct did) from matches◊ select max(N_hlx) from folds where N_beta = 0

• How many matches to globins in the E. coli genome• Complex Query by nesting selections

◊ F <= select fid from folds where name contains “globin”

◊ D <= select did from structures where fid in F◊ N <= select count(distinct gid,TrgStrt) from matches

where did in D and score < .01

Joinsgid_ TrgStrt TrgStop did scoreHI0299 119 135 d193l__ 3.1HI0572 180 240 d1aba__ 0.0032HI0989 56 125 d1aco_1 0.0049HI0988 106 458 d1aco_2 4.4e-14HI0154 2 76 d1acp__ 1.2e-23HI1633 2 432 d1adea_ 0HI0349 1 183 d1aky__ 7.6e-36HI1309 35 52 d1alo_3 1.1HI0589 8 25 d1alo_3 1.8HI1358 239 444 d1amg_2 0.002HI1358 218 410 d1amy_2 0.00037HI0460 20 24 d1ans__ 1.8HI1386 139 147 d1ans__ 3.3HI0421 11 14 d1ans__ 6.4HI0361 285 295 d1ans__ 8.2HI0835 100 106 d1ans__ 9.7

Join Gives Unnormalized Table

gid_ TrgStrt TrgStop did score fid N_hlx N_beta name

HI0299 119 135 d193l__ 3.1 1.010.002 0 2 Spectrin repeat unitHI0572 180 240 d1aba__ 0.0032 1.002.045 1 2 Mu transposase, DNA-binding domainHI0989 56 125 d1aco_1 0.0049 1.001.031 8 0 Globin-likeHI0988 106 458 d1aco_2 4.4e-14 1.001.031 8 0 Globin-likeHI0154 2 76 d1acp__ 1.2e-23 1.001.031 8 0 Globin-likeHI1633 2 432 d1adea_ 0 1.010.002 0 2 Spectrin repeat unitHI0349 1 183 d1aky__ 7.6e-36 1.001.031 8 0 Globin-likeHI1309 35 52 d1alo_3 1.1 1.007.008 4 3 Neurotoxin III (ATX III)HI0589 8 25 d1alo_3 1.8 1.002.045 1 2 Mu transposase, DNA-binding domainHI1358 239 444 d1amg_2 0.002 1.004.002 1 3 Diphtheria toxin repressor (DtxR)HI1358 218 410 d1amy_2 0.00037 1.002.044 0 4 Immunoglobulin-binding protein AHI0460 20 24 d1ans__ 1.8 1.007.008 4 3 Neurotoxin III (ATX III)HI1386 139 147 d1ans__ 3.3 1.007.008 4 3 Neurotoxin III (ATX III)HI0421 11 14 d1ans__ 6.4 1.007.008 4 3 Neurotoxin III (ATX III)HI0361 285 295 d1ans__ 8.2 1.007.008 4 3 Neurotoxin III (ATX III)HI0835 100 106 d1ans__ 9.7 1.007.008 4 3 Neurotoxin III (ATX III)

Joining Two or More Tables with a Select QueryGives a New, “Bigger” Table

Normalization

gid_ TrgStrt TrgStop did score fid N_hlx N_beta name

HI0299 119 135 d193l__ 3.1 1.010.002 0 2 Spectrin repeat unitHI0572 180 240 d1aba__ 0.0032 1.002.045 1 2 Mu transposase, DNA-binding domainHI0989 56 125 d1aco_1 0.0049 1.001.031 8 0 Globin-likeHI0988 106 458 d1aco_2 4.4e-14 1.001.031 8 0 Globin-likeHI0154 2 76 d1acp__ 1.2e-23 1.001.031 8 0 Globin-likeHI1633 2 432 d1adea_ 0 1.010.002 0 2 Spectrin repeat unitHI0349 1 183 d1aky__ 7.6e-36 1.001.031 8 0 Globin-likeHI1309 35 52 d1alo_3 1.1 1.007.008 4 3 Neurotoxin III (ATX III)HI0589 8 25 d1alo_3 1.8 1.002.045 1 2 Mu transposase, DNA-binding domainHI1358 239 444 d1amg_2 0.002 1.004.002 1 3 Diphtheria toxin repressor (DtxR)HI1358 218 410 d1amy_2 0.00037 1.002.044 0 4 Immunoglobulin-binding protein AHI0460 20 24 d1ans__ 1.8 1.007.008 4 3 Neurotoxin III (ATX III)HI1386 139 147 d1ans__ 3.3 1.007.008 4 3 Neurotoxin III (ATX III)HI0421 11 14 d1ans__ 6.4 1.007.008 4 3 Neurotoxin III (ATX III)HI0361 285 295 d1ans__ 8.2 1.007.008 4 3 Neurotoxin III (ATX III)HI0835 100 106 d1ans__ 9.7 1.007.008 4 3 Neurotoxin III (ATX III)

• What if Want to update Fold1.007.008 to be “Neurotoxin IV”?◊ Many Updates

• So Good if Previously Normalizedinto Separate Tables◊ Eliminate Redundancy

◊ Allow Consistent Updating

Normalization Example

Name City Area-Code Phone-NumberCharles NY 212 345-6789Mark SF 415 236-8982Jane NY 212 567-2345Jeff SF 415 435-3535Jack Boston 617 234-9988

Name City Phone-NumberCharles NY 345-6789Mark SF 236-8982Jane NY 567-2345Jeff SF 435-3535Jack Boston 234-9988

City Area-CodeNY 212SF 415Boston 617

Un-normalized Normalized

Normalized Tablesgid_ TrgStrt TrgStop did scoreHI0299 119 135 d193l__ 3.1HI0572 180 240 d1aba__ 0.0032HI0989 56 125 d1aco_1 0.0049HI0988 106 458 d1aco_2 4.4e-14HI0154 2 76 d1acp__ 1.2e-23HI1633 2 432 d1adea_ 0HI0349 1 183 d1aky__ 7.6e-36HI1309 35 52 d1alo_3 1.1HI0589 8 25 d1alo_3 1.8HI1358 239 444 d1amg_2 0.002HI1358 218 410 d1amy_2 0.00037HI0460 20 24 d1ans__ 1.8HI1386 139 147 d1ans__ 3.3HI0421 11 14 d1ans__ 6.4HI0361 285 295 d1ans__ 8.2HI0835 100 106 d1ans__ 9.7

Theory ofNormaliz-ation

Query Optimization

• Get at the Data Quickly!!• Indexes• Hash Function Reproduce the Effect of Indexes

◊ Rapidly Associate a Bucket with Each Key

• Joining 10 tables, which to do first?◊ Joining is slow so store some tables in unnormalized form

o Speed vs Memory

IndexesS

Object

Databases

rtranvs.C

Forms & reports [user views]

• Reports are the result of running a succession ofselects queries on a database, joining together anumber of tables, and then pasting the resultstogether

• Forms are the same but they are editable• Forms and Reports represent particular views of the

data◊ For instance, one can be keyed on gene id listing all the structures

matching a gene and the other could be keyed on structure id listingall the gene matching a given structure

Aspects of Forms:Transactions and Security

• Transactions◊ Genome Centers and United Airlines!

◊ Log each entry and enable UNDO• Security

◊ Only certain users can modify certain fields

Complex Data Example:Encoding Trees in RDBs

Node Name1 Organism2 Bacteria3 Archea4 Eukarya5 Metazoa6 Plants

Node Parent1 02 13 14 15 46 4

Everyw

here:InternetMail

RDBs Everywhere: File SystemINODE SIZE PERMISSION USER GROUP BYTES MMM-DD--YEAR NAME

120462 1 drwxr-xr-x 10 mbg gerstein 1024 Feb 12 1997 .120463 1 drwxr-xr-x 2 mbg gerstein 1024 Jan 30 1997 ./hi-tbl120464 514 -rw-r--r-- 1 mbg gerstein 525335 Nov 10 1996 ./hi-tbl/id_gorss.tbl120465 19 -rw-r--r-- 1 mbg gerstein 18469 Nov 10 1996 ./hi-tbl/id_kytedool.tbl120466 514 -rw-r--r-- 1 mbg gerstein 525372 Nov 10 1996 ./hi-tbl/id_seq.tbl108224 507 -rw-r--r-- 1 mbg gerstein 518822 Nov 10 1996 ./mj-tbl/id_gorss.tbl108227 54 -rw-r--r-- 1 mbg gerstein 54775 Jan 30 1997 ./mj-tbl/id_abcode.tbl108228 19 -rw-r--r-- 1 mbg gerstein 19131 Nov 11 1996 ./mj-tbl/id_kytedool.tbl108229 106 -rw-r--r-- 1 mbg gerstein 108345 Nov 16 1996 ./mj-tbl/word_stats.tbl.bak108230 106 -rw-r--r-- 1 mbg gerstein 108354 Jan 28 1997 ./mj-tbl/word_stats.tbl108231 7 -rw-r--r-- 1 mbg gerstein 6962 Jan 30 1997 ./mj-tbl/hist_seqlen.tbl108232 7 -rw-r--r-- 1 mbg gerstein 6967 Jan 30 1997 ./mj-tbl/hist_num_H_res.tbl91903 1 drwxr-xr-x 2 mbg gerstein 1024 Nov 19 1996 ./po-tbl

USER:PASSWD:UID:GID:COMMENT:DIR:SHELL

ftp:*:14:50:FTP User:/home/ftp:nobody:*:99:99:Nobody:/:mlml:cw5ZrAmNBAxvU:106:100:Michael Levitt (linux):/u1/mlml:/bin/tcshdabushne:ErR3hu4q0tO7Y:108:100:Dave:/u1/dabushne:/bin/tcshmbg:V9CPWXAG.mo3E:5514:165:Mark Gerstein,432A, BASS,2-6105,:/u0/mbg:/bin/tcshmbgmbg:V9CPWXAG.mo3E:5515:165:logs into mbg,,,,:/u0/mbg:/bin/tcshmbg10:V9CPWXAG.mo3E:5516:165:alternate account for mbg:/home/mbg10:/bin/tcshlocal::502:20:Local Installed Packages:/u1/local:/bin/tcshlogin::503:20:Hyper Login:/u0/login:/u0/login/hyper-login.pl

find -ls/etc/passwd

Quickie Trees andClustering

Top-down vs. Bottom up

Top-down when you know how many subdivisions

k-means as an example of top-down1) Pick ten (i.e. k?) random points as putative cluster centers.2) Group the points to be clustered by the center to which they areclosest.3) Then take the mean of each group and repeat, with the means now atthe cluster center.4) I suppose you stop when the centers stop moving.

Methods of Building Trees from thebottom up

CHOOSE METHOD- Parsimony

• Minimizing the number of changes at each node• Requires greater computer resources than distance

methods• Depends on phylogenetically informative sites• Retains all sequence information throughout the

analysisProblems:• As the sequences diverge, the accuracy of the

inference drops• Long Edge Attraction• Multiple islands of “almost the most parsimonious trees”

can exist• Requires greater computer resources than distance

methods

CHOOSE METHOD- Distance Based

Distance Methods• Compute distance measures• Build the tree from the table of distances

Assumptions• A single coefficient of sequence similarity contains the

information necessary to reconstruct the phylogeny• May reduce the available information

Measuring Distances• Compute all pairwise distances• Correct for multiple substitution events• Weight according to nucleotide substitution frequency• Weight according to codon degeneracy• Different measures presuppose different models of

character evolution

Bootstrapto Test

the Tree

ANALYZE TREE- Bootstrap

• Randomly resample the data with replacement,creating a new dataset that is then used to infer aphylogeny

• Generating replicate samples• Observe tree topology• Percentage of grouping• Majority Rule Consensus

Popular Tree Program SystemsPREPARE THE DATA- PAUP

• Phylogenetic Analysis Using Parsimony• David Swofford, Smithsonian• Sophisticated parsimony program with a wide variety of options

o Tree building algorithms

o Weighting schemes

o Resampling procedures

PREPARE THE DATA- Phylip• J. Felsenstein, University of Washington• A comprehensive set of phylogenetic inference programs

o Maximum Likelihood

o Parsimony

o Distance

o Single and multiple tree algorithms

Tree ofLife

,---------------------------------- Chlamydia psittaci|

,-----------------| Chlamydia| `---------------------------------- Chlamydia trachomatis

,-----------------| Eubacteria| || |--------------------------------------------------- Borrelia burgdorferi| | ,---------------------------------- Bacteroides fragilis| | || |-----------------| Bacteroidaceae| | || | `---------------------------------- Porphyromonas gingivalis| | ,----------------- Microcystis aeruginosa| | ,-----------------| Chroococcales| | | || | | |----------------- Synechococcus sp.| | | || | | `----------------- Synechocystis sp.| |-----------------| Cyanobacteria| | | ,----------------- Anabaena sp.| | | || | |-----------------| Anabaena| | | || | | `----------------- Anabaena variabilis| | `---------------------------------- Fremyella diplosiphon| | ,---------------------------------- gamma subdivision ----| |-----------------| Proteobacteria| | | ,----------------- Myxococcus xanthus| | | || | |-----------------| delta subdivision| | | || | | `----------------- Desulfovibrio vulgaris| | | ,----------------- Campylobacter jejuni| | | || | |-----------------| epsilon subdivision| | | || | | `----------------- Helicobacter pylori| | `---------------------------------- Pseudomonas sp.| || |--------------------------------------------------- Thermotoga maritima| || `--------------------------------------------------- Thermus aquaticus-| Universal Ancestor| ,----------------- Sulfolobus

acidocaldarius| || ,-----------------| Sulfolobus| | `----------------- Sulfolobus solfataricus| ,-----------------| Archaea| | `---------------------------------- Euryarchaeota ----`-----------------| Archaea and Eukarotae

| ,---------------------------------- Giardia lamblia| |`-----------------| Eukaryotae

`---------------------------------- mitochondrial eukaryotes----

GenProtEC -Functional

Classification

the E. coli databasehttp://genprotec.mbl.edu/start

COGs - OrthologsOrtholog ~ gene withprecise same role in diff.organism, directly relatedby descent from acommon ancesor

Ortholog,homolog,fold

vsParalog

eport:Motions

Database

Example Report: Motions DatabaseCREATE TABLE classes (

class_num_ CHAR(10),new CHAR(10),class_name CHAR(80)

)CREATE TABLE classifications (

id_ CHAR(10),

class_num CHAR(10))CREATE TABLE links (

id_ CHAR(10),

url_ CHAR(150),hilit_text CHAR(100),other_text CHAR(500),flag CHAR(5)

)CREATE TABLE names (

id_ CHAR(10),

seq_num_n INT,name CHAR(255)

)CREATE TABLE refs (

id_ CHAR(10),

medline_I INT,endnote_I INT,flag_n INT

)CREATE TABLE descriptions (

id_ CHAR(10),

num_I INT,prose CHAR(5000)

CREATE TABLE relations (

id_ CHAR(15),

id_to_ CHAR(15),type CHAR(30),comment CHAR(512)

)CREATE TABLE single_vals (

id_ CHAR(10),

name_ CHAR(30),val CHAR(30),comment CHAR(500)

)CREATE TABLE structures (

id_ CHAR(10),

pdb_id_ CHAR(8),name_short CHAR(50),chain CHAR(1),name_long CHAR(100)

)CREATE TABLE value_names (abbrev_ CHAR(15),name CHAR(50)

)CREATE TABLE endnote_refs (num_I INT,name CHAR(512)

Reportshowsinformation,mergingtogethermany tableswith variableamounts ofinformation.Form samebut allowsentry.

Schema

Example Report: Motions Database

Structures: Variable Number Per ID (Var. Num. ofPhone Num. per Person), Foreign Key into PDB

Single Values:Joining TwoTables andIterating in Perl

$sth = $dbh->query("SELECT value_names.name,single_vals.val,single_vals.comment ".

"FROM value_names,single_vals "."WHERE single_vals.id_ = '$id' ANDsingle_vals.name_ = value_names.abbrev_ "."ORDER BY value_names.name");

$rows = $sth->numrows;

if ($rows > 0) {&PrintHead("Particular values describing motion");for ($i=0; $i<$rows; $i++) {

@values = $sth->fetchrow;PrintSingleVals(@values);

NAMESid_ seq_num_n nameaat 7 Aspartate Amino Transferase (AAT)acetyl 1005 Acetylcholinesterasebr 97 Bacteriorhodopsin (bR)cm 23 Calmodulin

REFSid_ medline_I endnote_Iacetyl 0 1007br 90294303 893br 93154310 313cm 92263094 648cm 92390716 647cm 94082290 673

ENDNOTE_REFSnum_I name313 S Subramaniam, M Gerstein, D Oesterhelt and R H Hender893 R Henderson, J M Baldwin, T A Ceska, F Zemlin, E Beckm1007 M K Gilson, T P Straatsma, JA A McCammon, D R Ripoll,647 W E Meador, A R Means and F A Quiocho (1992). Target e648 M Ikura, G M Clore, A M Gronenborn, G Zhu, C B Klee an649 B-H Oh, J Pandit, C-H Kang, K Nikaido, S Gokcen, G F-L

References:Join Two Lists (Protein Namesand References) with a TableContaining Key for each List (aRelation: protein has reference.)

SELECT endnote_refs.name, refs.medline_IFROM endnote_refs,refs WHERE refs.id_ =’cm' AND refs.endnote_I =endnote_refs.num_I

Graphics:How to StoreComplex Data?(File Pointers,BLOBS, OODB)

Large-scale Example: Census DB

• 9 Genome Comparison• 1437 Relational Tables• 442 Mb• Simple ASCII Layout

Major Application II:Overall Genome Characterization

• Overall Occurrence of aCertain Feature in theGenome◊ e.g. how many kinases in Yeast

• Compare Organisms andTissues◊ Expression levels in Cancerous vs

Normal Tissues

• Databases, Statistics

(Clock figures, yeast v. Synechocystis,adapted from GeneQuiz Web Page, Sander Group, EBI)

• Structure helps to understand genomes insimplest terms -- fewest parts & most duplication

• Structural domain more precisely defined thansequence module

• Sequence Similarity more reliably related toStructure than Function

• Many approaches to building Library◊ Manual (scop, Murzin)

~1000 folds

~100000 genes

~1000 genes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 …

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 …

(human)

(T. pallidum)

The World ofStructures isalso Finite:

A FoldLibrary Automatic:

FSSP-HSSP(Holm/Sander),Entrez-MMDB(Bryant)

Semi-automatic:CATH (Thornton),HOMALDB (Sali)

Sequences 1st:Pfam(Durbin/Eddy),COGs(Koonin/Lipman),Blocks (Henikoff),ProSite (Bairoch)

Cross-Reference:Folds→Sequences

→ Organisms

Abbrev. Kingdom(subgroup)

Genome Num.ORFs

Reference

EC Bacteria (gram negative) Escherichia coli 4290 Blattner et al.

HI Bacteria (gram negative) Haemophilusinfluenzae

1680 TIGR

HP Bacteria (gram negative) Helicobacter pylori 1577 TIGR

MG Bacteria (gram positive) Mycoplasmagenitalium

468 TIGR

MJ Archaea (Euryarchaeota) Methanococcusjannaschii

1735 TIGR

MP Bacteria (gram positive) Mycoplasmapneumoniae

677 Himmelreichet al.

SC Eukarya (fungi) Saccharomycescerevisiae

6218 Goffeau et al.

SS Bacteria (Cyanobacteria) Synechocystis sp. 3168 Kaneko et al.

class Fold# EC SC HI SS HP MJ MP MG total Fam.PDB Rep. Struc. Name

α/β 18 60 46 23 40 19 7 4 3 202 16 183 1xel - NAD(P)-binding Rossmann Fold

α/β 24 20 69 17 19 17 16 10 11 179 13 132 1gky - P-loop Containing NTP Hydrolases

α+β 31 37 28 18 16 12 40 3 3 157 23 160 1fxd - like Ferrodoxin

α/β 01 45 36 13 22 11 10 5 4 146 37 399 1byb - TIM-barrel

α/β 23 18 17 7 9 4 8 2 2 67 5 36 1pyd a:2-181 Thiamin-binding

α/β 04 15 11 7 10 1 9 5 5 63 13 132 2tmd a:490-645 FAD/NAD(P)-binding

α+β 55 8 9 7 8 9 3 6 6 56 4 23 1sry a:111-421 Class-II-aaRS/Biotin Synthetases

β 27 7 10 8 8 4 4 3 3 47 5 19 1fnb 19-154 Reductase/Elongation Factor Domain

β 24 13 7 4 3 3 3 3 3 39 18 177 1snc - OB-fold

α+β 11 10 8 4 8 2 2 2 1 37 11 48 1igd - beta-Grasp

β 55 9 10 5 5 2 2 2 2 37 7 19 1bdo - Barrel-sandwich hybrid

α/β 15 5 5 4 4 5 6 3 3 35 3 22 2ts1 1-217 ATP pyrophoshatases

α/β 05 10 4 2 4 2 2 2 3 29 4 35 1zym a: The "swivell ing" beta/beta/alpha domain

α/β 60 5 7 4 6 3 2 1 1 29 3 18 3pmg a:1-190 Phosphoglucomutase, firs t 3 domains

α+β 68 4 2 3 6 4 2 4 3 28 2 3 1mat - Creat inase/methionine aminopept idase

α+β 39 6 4 3 4 4 1 1 1 24 3 42 1gad o:149-312 like G3P dehydrogenase, Ct-dom

α+β 18 5 4 4 1 2 2 1 2 21 3 23 1fkd - FKBP-like

α/β 41 3 3 3 3 1 3 1 1 18 3 16 1opr - Phosphoribosyltransferases (PRTases)

α 78 1 9 1 2 1 1 1 1 17 1 23 1oel a:(*) GroEL, the ATPase domain

α+β 10 2 2 2 4 2 1 2 2 17 2 5 1dar 477-599 Ribosomal protein S5 domain 2-like

α+β 43 4 3 2 2 1 1 2 2 17 4 50 3grs 364-478 FAD/NAD-linked reductases, dimer-dom.

α+β 09 3 4 3 1 2 1 1 1 16 3 12 1kpa a: HIT-like

α/β 47 4 2 3 1 2 1 1 1 15 2 10 1ulb - Purine and uridine phosphorylases

α+β 33 3 1 3 3 2 1 1 1 15 2 3 1tig - IF3-like

α+β 26 2 3 1 2 2 1 1 1 13 3 4 1stu - dsRBD & PDA domains

α+β 29 2 5 1 1 1 1 1 1 13 3 26 1one a:1-141 like Enolase, Nt-dom.

Μ 11 2 1 2 1 2 2 1 1 12 1 1 1ecl - type I DNA topoisomerase

β 23 1 3 1 1 1 1 1 1 10 1 1 1whi - Ribosomal protein L14

α/β 31 2 2 1 1 1 1 1 1 10 1 10 1trk a:535-680 Transketolase, Ct-dom.

α/β 61 1 1 1 1 1 1 1 1 8 1 4 3pgk - Phosphoglycerate kinase

α/β 13 49 8 14 57 12 5 1 146 15 100 3chy - Flavodoxin-like

α/β 38 24 54 15 11 4 4 5 117 19 112 2rn2 - Ribonuclease H-like motif

α 02 7 18 6 9 4 5 5 54 4 33 1hdj - Long alpha-hairpin

β 21 14 13 3 3 2 2 1 38 2 44 1lep a: GroES-like

α/β 30 7 13 4 10 2 1 1 38 7 83 1srx - Thioredoxin-like

α/β 56 8 4 2 4 2 4 2 26 3 105 2at2 a: Asp-carbamoyltransferase, Cat.-chain

α+β 70 3 6 3 3 3 3 3 24 3 24 1mxa 1-101 S-adenosylmethionine synthetase. MAT

α/β 44 2 1 3 5 6 4 2 23 5 16 1vid - SAM-dependent methyltransferases

Μ 12 4 1 4 3 2 4 4 22 1 1 1bgw - type II DNA topoisomerase

Μ 16 3 10 2 3 1 1 1 21 1 4 1dkz a: like HSP70, Ct-dom.

β 31 4 2 3 3 3 2 1 18 3 20 1bmf a:24-94 like F1 ATP synthase, a & b sub., A-dom.

α 21 4 2 4 3 2 1 1 17 5 54 1fha - Ferrit in-like

α/β 55 3 6 1 2 1 2 1 16 1 29 1xaa - Isocit rate/isopropylmalate dehydrogenases

α+β 71 3 2 3 3 2 2 1 16 5 10 2pol a:1-122 DNA clamp

α 49 2 2 2 2 2 2 2 14 2 18 1bmf a:380-510 Left-handed superhelix

α/β 50 4 4 1 2 1 1 1 14 3 27 2ctb - Zn-dependent exopeptidases

α/β 43 4 1 2 3 1 1 1 13 1 7 1cde - Glycinamide ribonucleotide transformylase

β 53 2 1 2 2 2 1 1 11 1 4 1lxa - Single-stranded left-handed beta-helix

β 38 2 2 1 2 1 1 1 10 1 7 1pkn 116-217 Pyruvate kinase beta-barrel domain

β 28 2 1 2 1 1 1 1 9 1 6 1efu a:297-393 EF-Tu, Ct-dom.

α/β 03 2 2 1 1 1 1 1 9 1 1 1rlr 221-748 ribonucleotide reductase, R1 sub., Ct-dom.

α+β 85 1 3 1 1 1 1 1 9 3 43 1mld a:145-313 like LDH/MDH, Ct-dom.

α 15 1 1 1 1 1 1 1 7 1 3 1bmf g: F1-ATPase, gamma subunit

α+β 24 1 1 1 1 1 1 1 7 1 1 1ctf - Ribosomal protein L7/12, Ct-dom.

6 B1 B1

9 C1 A1

10 D1 D1 D1

A26 pairs

C11 pair

("Superfold")

IndividualStructures

SequenceFamilies

class Fold# EC SC HI SS HP MJ MP MG total Fam.PDB Rep. Struc. Name

α/β 18 60 46 23 40 19 7 4 3 202 16 183 1xel - NAD(P)-bindin

α/β 24 20 69 17 19 17 16 10 11 179 13 132 1gky - P-loop Contai

α+β 31 37 28 18 16 12 40 3 3 157 23 160 1fxd - like Ferrodoxi

α/β 01 45 36 13 22 11 10 5 4 146 37 399 1byb - TIM-barrel

α/β 23 18 17 7 9 4 8 2 2 67 5 36 1pyd a:2-181 Thiamin-bindin

α/β 04 15 11 7 10 1 9 5 5 63 13 132 2tmd a:490-645 FAD/NAD(P)-

α+β 55 8 9 7 8 9 3 6 6 56 4 23 1sry a:111-421 Class-II-aaRS

β 27 7 10 8 8 4 4 3 3 47 5 19 1fnb 19-154 Reductase/El

β 24 13 7 4 3 3 3 3 3 39 18 177 1snc - OB-fold

α+β 11 10 8 4 8 2 2 2 1 37 11 48 1igd - beta-Grasp

(1) Structures in Folds (scop)

(2) MatchSequences(fasta,blast)

(3) OrganizeSequencesby Genomeor Taxon

(4) Results in “Fold Table”

Structurally Uncharacterized (186)

1 4 3 3 2 5 6 1 4 2 4

1 PDB Match (152) 3 TM helix (30) 5 Coiled-Coil

2 Low Complexity Region (116) 4 Linker Region (5) 6 All-alpha or All-beta Region

Eubacteria

Other Euk.

Eukaryote

Other Met.

Metazoa Arthropod

Chordate

Venn Diagrams forShared Folds

Eukaryotes (229)

Eubacteria (202)

other (virus) (78)

Metazoa (194)

Plants (124)

other eukaryotes (151)

Chordates (181)

other metazoa (126)

Arthropods (105)

~300-350 folds(282 folds in scop1.32 [‘96])

~120K sequencesin OWL 27.1

7 phylogeneticgroups oforganisms

5 genomes --HI, EC (bacteria),MJ (archeon),SC (eukaryote),CE (worm, animal)

3 2538

2 2 20MJ SC

3 3545

3 3 36MJ SC

of 339

Patterns ofFolds Usage in

8 Genomes

0 1 2 3 4 5 6 7 8

superfold

family

"Fold" Present in at Least this Many Genomes

ESHSHMMM (##) ESHSHMMM (##) ESHSHMMM (##) ESHSHMMM (##) ESHSHMMM (##)CCISPJPG CCISPJPG CCISPJPG CCISPJPG CCISPJPG

11111111 (30) .1...... (23) 1....... (19) 11111.11 (16) 111111.. (16)1111.... (09) 11111... (08) 1.1..... (08) 1.111.11 (06) 11...... (06)...1.... (06) 1.11.... (05) .1.1.... (05) 1.111... (04) 11.1.... (04).1...1.. (04) ..1..... (04) 111111.1 (03) 1111111. (03) 1111..11 (03)1111.1.. (03) .....1.. (03) 1111.111 (02) 111...11 (02) 111.11.. (02)1.11.1.. (02) ..111... (02) .1.11... (02) 1..1.1.. (02) 1.1..1.. (02)111..... (02) .11..... (02) ......1. (02) ....1... (02) 111..111 (01)111.1.11 (01) 1.111..1 (01) 1.1111.. (01) .1.1..11 (01) .1.11.1. (01).11.1..1 (01) 1....111 (01) 1..111.. (01) 1.1...11 (01) 1.1..11. (01)11....11 (01) 11.1.1.. (01) 11.11... (01) 111..1.. (01) 111.1... (01).11...1. (01) 1.....11 (01) 1...11.. (01) 1.1.1... (01) ......11 (01)....1..1 (01) ...1.1.. (01) ...11... (01) ..1.1... (01) .1....1. (01)1....1.. (01) .......1 (01)

fold fam.superfold

total in PDB 338 990 25

in at least one of8 genomes 240 547 23

present in thismany genomes

1 60 192 12 32 82 43 23 54 34 27 53 35 17 50 06 27 49 37 24 41 28 30 26 7

("Superfold")

SequenceFamilies Superfold = fold

that allows manynon-homologousseq. (Thornton)

Cluster Trees Grouping InitialGenomes on Basis of Shared Folds

20 3010

D=10/(20+10+30)

Fold Tree “Classic” Tree

sub Mjan

T= total #folds in both

D = shared fold dist.betw. 2 genomes

D=S/T S = # shared folds

20 Genomes

Top-10 Folds in a Genome

M. genitalium B. subtilis E. coli

Rank Superfamily # Superfamily # Superfamily #

1 ∆∆∆∆ P-loop hydrolase 60 ∆∆∆∆ P-loophydrolyase 173 ∆∆∆∆ P-loop hydrolase 191

2 = SAM methyl-transferase 16 ⊗⊗⊗⊗ Rossmann

domain 165 ⊗⊗⊗⊗ Rossmanndomain 158

3 ⊗⊗⊗⊗ Rossmanndomain 13 •••• Phosphate-

binding barrel 79 •••• Phosphate-binding barrel 64

4Class I

synthetase 12 ♦♦♦♦ PLP-transferase 44 ♦♦♦♦ PLP-transferase 38

5Class II

synthetase 11 ∗∗∗∗ CheY-like domain 36 ∗∗∗∗ CheY-like domain 36

6Nucleic acidbinding dom. 11 = SAM methyl-

transferase 30 ◊◊◊◊ Ferredoxins 35

Total ORFs 479 4268 4268with CommonSuperfamilies

105(22%)

465(11%)

458(11%)

M. thermo-autotrophicum

A. fulgidus

Rank Superfamily # Superfamily #

1 ∆∆∆∆ P-loophydrolyase

2 •••• Phosphate-binding barrel

54 ⊗⊗⊗⊗ Rossmanndomain

3 ⊗⊗⊗⊗ Rossmanndomains

53 •••• Phosphate-binding barrel

4 ◊◊◊◊ Ferredoxins 48 ◊◊◊◊ Ferredoxins 49

5 = SAM methyl-tranferase

17 = SAM methyl-tranferase

6 ♦♦♦♦ PLP-transferases 15 ♦♦♦♦ PLP-transferases 18

Total ORFs 1869 2409with CommonSuperfamilies

252(14%)

309(13%)

Rank Superfamily #

2 x Protein kinase 123

3 ⊗⊗⊗⊗ Rossmanndomain

4RNA-binding

domain75

5 = SAM methyl-transferase 63

6Ribonuclease H-

like57

Total ORFs 6218with CommonSuperfamilies

560(9%)

S. cerevisiae

Eubacteria

Archaea

Depends oncomparisonmethod, DB,&c(new topsuperfamiliesvia ψ-Blast,Intersection oftop-10 to getshared andcommon)

Top-10 Worm Foldsclass

num.matchesin wormgenome

frac. allwormdom.(F)

Ig B 830 1.7% 18 4Knottins SML 565 1.1% 0 3Protein kinases (cat. core) MULT 472 0.9% 1 142C-type lectin-like A+B 322 0.6% 0 1corticoid recep. (DNA-bind dom.) SML 276 0.5% 1 10Ligand-bind dom. nuc. receptor A 257 0.5% 0 0alpha-alpha superhelix A 247 0.5% 6 114C2H2 Zn finger SML 239 0.5% 0 78P-loop NTP Hydrolase A/B 235 0.5% 72 133Ferrodoxin A+B 207 0.4% 83 114

Characteristicsof Common,

Shared Folds:βαβ structure

All share α/β structure withrepeated R.H. βαβ units

connecting adjacent strandsor nearly so (18+4+2 of 24)

HI, MJ, SC vs scop 1.32

336: 42

What are the most common folds:Overall? In plants? In animals?

Num. of Sequences

Eukaryote

ss Fold Name

Totals

6 3139

Overall Top-10 ∇ 1REI-A β Immunoglobulin-like 32 13 ◊ 1 ◊ 25 ◊6TIM-B α/β TIM-barrel 29 6 ◊ 7 20 2 13

1ATP-E O Protein Kinases (catalytic core) 1 4 3 ◊ 3 6 61FXD O Ferredoxin-like 17 4 2 2 17 ◊ 81AKE-A α/β NTP Hydrolases containing P-loop 9 3 ◊ 5 3 2 71HDD-C α DNA-binding 3-helical bundle 13 3 ◊ ◊ 2 5 ◊2HSD-A α/β Rossmann Fold (NAD binding) 11 3 ◊ 7 3 1 31MBD α Globin-like 3 2 1 ◊ 4 12RN2 α/β like Ribonuclease H 15 2 5 1 2 1 5

1ZNF S Classic Zinc Finger 2 1 ◊ 3 1

Sequence Family Top-11 ∇

1REI-A β Immunoglobulin-like 32 13 ◊ 1 ◊ 25 ◊6TIM-B α/β TIM-barrel 29 6 ◊ 7 20 2 13

1FXD O Ferredoxin-like 17 4 2 2 17 ◊ 82RN2 α/β like Ribonuclease H 15 2 5 1 2 1 51PYP β OB-fold 15 ◊ ◊ 1 ◊ ◊ ◊1PTX S Small inhibitors, toxins, lectins 14 ◊ 3 ◊ ◊2TBV-C β Viral coat and capsid proteins 14 1 12 1HDD-C α DNA-binding 3-helical bundle 13 3 ◊ ◊ 2 5 ◊2HSD-A α/β Rossmann Fold (NAD binding) 11 3 ◊ 7 3 1 31RCF α/β Flavodoxin-like 11 ◊ ◊ 4 ◊ ◊ ◊1RCB α 4-helical cytokines 11 ◊ ◊ ◊ 2

Percent of Sequences

Eukaryote

Fold Name

Plant Top-10 ∇∇∇∇α/β TIM-barrel 29 6 ³ 7 20 2 13

O like Ferredoxin 17 4 2 2 17 ³ 8α/β NTP Hydrolases containing P-loop 9 3 ³ 5 3 2 7

O Protein Kinases (catalytic core) 1 4 3 ³ 3 6 6

S Small inhibitors, toxins, lectins 14 ³ 3 ³ ³α/β Rossmann Fold (NAD binding) 11 3 ³ 7 3 1 3

O RuBisCO (small subunit) 1 ³ ³ 2 ³β like Concanavalin A 6 ³ ³ ³ 2 ³ 2

α like Hydrophobic Seed Protein 2 ³ 2

α/β like Ribonuclease H 15 2 5 1 2 1 5

Metazoan Top-10 ∇∇∇∇β like Immunoglobulin 32 13 ³ 1 ³ 25 ³

O Protein Kinases (catalytic core) 1 4 3 ³ 3 6 6α DNA-binding 3-helical bundle 13 3 ³ ³ 2 5 ³α like Globin 3 2 1 ³ 4 1

S Classic Zinc Finger 2 1 ³ 3 1α/β NTP Hydrolases containing P-loop 9 3 ³ 5 3 2 7

β Trypsin-like serine proteases 4 1 1 ³ 2 ³

α Cytochrome P450 1 1 ³ ³ 2 1

S like Glucocort. receptor (DNA-binding) 4 1 ³ 2 ³α EF-hand 3 1 ³ 1 2 1

An Issue withFold Counting:Biases in theDatabanks

ExampleStructure Fold

Percentage ofknown folds

(PDB) Name in genome

Top-10 in a bacterial genome (H. influenzae)2HSD-A Rossmann Fold (NAD binding) 9.6 11AKE-A NTP Hydrolases containing P-loop 5.7 31RCF Flavodoxin-like 5.1 46TIM-B TIM-barrel 4.5 21FXD Ferredoxin-like 4.2 52RN2 like Ribonuclease H 3.0 161SBP like Periplasmic binding protein (class II) 3.0 112DRI like Periplasmic binding protein (class I) 3.0 191SRY-* Class II aaRS and biotin synthetases 2.7 501PYP OB-fold 2.7 9

Rank ineubacterial

Top-10

• Over-representation of certain species and functionsin the databanks (e.g. human v. plant globins, Ig’s)

• Nevertheless HI top-10 like eubacterial top-10

• PDB small, biased sample of genome (6-12%)• Diff. numbers with diff. comparison sensitivity

• FASTA, HMM, &c• Some Correction with Seq. Weighting, Diff. Sampling• Uniform sampling is better than high sensitivity for some and low

for others (ψ-blast problem)• Best to avoid FPs than FNs for Venn

HBαHBβMbOther

Globin

Same Issues withReal US Census!!

Sampling

Using a Tree toCorrect for Biases

DCBA100%

.8 .8 1.1 1.4

• Databank has biases.• Assuming "fair"

distribution spreadssequences uniformlythrough "space", want toweight sequences:◊ over-represented, down

(mammal)

◊ under-represented, up (plant& NV)

• Weights derived from atree◊ Length of an unshared

branch is allotted directly tosequence

◊ Length of a shared branch isdivided proportionally amongsequences

Other schemes (Argos, Sander)

Know All Folds in aGenome: How arewe doing on MG?

• MG smallest genome with 479 ORFs

• Separate PDB Match, TMs, LC (SEG),linkers

• How many residues in genome matched byknown folds, in 1975, ‘76, ‘77...’00...’50

• The impact of PSI-blast in comparison topairwise methods

◊ Two way PSI-blast gives an improvement(genome vs PDB, PDB vs. genome)

• Union of many sets of PDB matches finds>40% of a.a. and more than half the ORFs(242/479)

◊ (Eisenberg, Godzik, Bork, Koonin, Frishman)

• ~65% structurally characterizedStructurally Uncharacterized (186)

1 4 3 3 2 5 6 1 4 2 4

2 Low Complexity Region (116) 4 Linker Region (5) 6 All-alpha or All-beta Region 0%

74 76 78 80 82 84 86 88 90 92 94 96 98

PDB matches

Good TMs, Low-complexity Regions

Fraction of the MG Genome(by residue) with Structural

Annotation over Time

allmatchessig.+

link lowcplx.

knownfunc.

low-qual.TM, LC,

orig. '97fasta

nofunc.

Pooror

PDBMatch

ψψψψblast

GoodPrediction

Know All Folds inGenome: MGOptimistic →

Prediction

• Just use one pairwise method formatching

• Multiple, big genomes (e.g. SC)

1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997Year

FITSCMJHIMPMGECSSHP

1 4 3 3 2 5 6 1 4 2 4

1970 1980 1990 2000 2010 2020 2030 2040 2050Year

me FIT

HIMPMG

TM-helix“prediction”

• TM prediction (KD, GES).Count number with2 peaks, 3 peaks, &c.

• Similar conclusions to others:von Heijne, Rost, Jones, &c.

• Divide Predictions into sureand marginal(Boyd & Beckwith’s criteria)

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of TM Helices

s) Bacteria (HI)

Eukaryote (SC)

Archaeon (MJ)

Min H value

TM Marginal

Thresholds

Soluble

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Number of TM helices per ORF

marginal

Comparative Genomicsof Membrane Proteins

• Yeast has moremem. prots., esp.2-TMs

• Similarconclusions toothers: vonHeijne, Rost,Jones, &c.

• Overall, no strongpreference for particularsupersecondary structures

◊ Freq. of Number of TMhelixes follows a Zipf-like law: F=1/[5n2]

• In detail, worm has a peakfor 7-TMs and E. coli for12-TMs

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of TM Helices

s) Bacteria (HI)

Eukaryote (SC)

Archaeon (MJ)

1 10 100Number of TM Helices

ces) FIT

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Number of TM helices

wormyeastE. coli

2º StructurePrediction

Fraction ofresiduesPredictedto be in... strand helix

Avg 17% 39%SD 1% 2%

EC 17% 39%HI 16% 41%HP 15% 42%MG 17% 39%MJ 19% 37%MP 17% 39%SC 17% 34%SS 16% 38%

1 4 3 3 2 5 6 1 4 2 4

• Bulk prediction of 2º struc. in genomes• Same fraction of α and β (by element,

half each)

• Both overall and only for unknownsoluble proteins.

• Diff From PDB:31% helical and 21% strand.

• Related results: FrishmanNot expectedsince.…..

DifferentAmino AcidCompositionShould GiveDifferent 2ºStructure

Each a.a. has differentpropensity for localstructure->Different Compositions (Kfrom 4.4 in EC to 10.4 inMJ, Q too)->Different Local Structure(but compensation?)

Propensities from Regan(beta) and Baldwin (alpha)

EC HI SS SC HP MP MG MJ TM-hlx helix strand

K 4.4 6.3 4.2 7.3 8.9 8.6 9.5 10.4 8.8 -1.5 -0.4

C 1.2 1.0 1.0 1.3 1.1 .8 .8 1.3 -2 -1.1 -0.8

R 5.5 4.5 5.1 4.5 3.5 3.5 3.1 3.8 12.3 -1.9 -0.4

N 4.0 4.9 4.0 6.1 5.9 6.2 7.5 5.3 4.8 -1 -0.5

Q 4.4 4.6 5.6 3.9 3.7 5.4 4.7 1.5 4.1 -1.3 -0.4

A 9.5 8.2 8.5 5.5 6.8 6.7 5.6 5.5 -1.6 -1.9 0

I 6.0 7.1 6.3 6.6 7.2 6.6 8.2 10.5 -3.1 -1.2 -1.3

H 2.3 2.1 1.9 2.2 2.1 1.8 1.6 1.4 3 -1.1 -0.4

S 5.8 5.8 5.8 9.0 6.8 6.5 6.6 4.5 -0.6 -1.1 -0.9

M 2.8 2.4 2.0 2.1 2.2 1.6 1.5 2.2 -3.4 -1.4 -0.9

P 4.4 3.7 5.1 4.3 3.3 3.5 3.0 3.4 0.2 3 >3.0

G 7.4 6.6 7.4 5.0 5.8 5.5 4.6 6.3 -1 0 1.2

F 3.9 4.5 4.0 4.5 5.4 5.6 6.1 4.2 -3.7 -1 -1.1

E 5.7 6.5 6.0 6.5 6.9 5.7 5.7 8.7 8.2 -1.2 -0.2

Y 2.9 3.1 2.9 3.4 3.7 3.2 3.2 4.4 0.7 -1.2 -1.6

V 7.1 6.7 6.7 5.6 5.6 6.5 6.1 6.9 -2.6 -0.8 -0.9

T 5.4 5.2 5.5 5.9 4.4 6.0 5.4 4.0 -1.2 -0.6 -1.4

D 5.1 5.0 5.0 5.8 4.8 5.0 4.9 5.5 9.2 -1 0.9

L 10.6 10.5 11.4 9.6 11.2 10.3 10.7 9.5 -2.8 -1.6 -0.5

W 1.5 1.1 1.6 1.0 .7 1.2 1.0 .7 -1.9 -1.1 -1

total propensityα -1.00 -1.02 -0.96 -1.00 -1.05 -1.03 -1.05 -1.01

β -0.27 -0.33 -0.26 -0.36 -0.37 -0.38 -0.42 -0.36

Amino Acid Composition Propensity(kcal/mole)

Supersecondary structure words• Look at super-secondary

patterns (“words” such as ααor βαβ) in predictions

• Compare observed freq. withexpected freq.

odds = f(αβ)/f(α)f(β)(Freq. Words, Karlin)

• Do have differences betweengenomes (and PDB) here

HI more αα, ααα, αααα ...

SC more ββ, βββ, βββββ...

MJ more αβαβ, βαβα …

Super- Maximum

Secondary DifferenceStructure between 3"Word" Genomes HI MJ SC PDB

ββ 26% 0.96 1.06 1.24 1.22

αα 15% 0.97 0.85 0.83 0.85

αβ 10% 1.09 1.09 0.99 0.95

βα 7% 0.98 1.00 0.93 0.99

ββ βββ βββ βββ β 41% 0.96 1.15 1.46 1.62

αααααααααααα 19% 1.01 0.83 0.84 0.92

αβααβααβααβα 18% 1.04 1.03 0.87 1.16

ααβ 15% 1.03 0.97 0.89 0.70

βαββαββαββαβ 12% 1.15 1.24 1.10 1.19

βαα 11% 0.93 0.87 0.83 0.78

ββα 9% 0.90 0.94 0.99 0.82

αββ 6% 0.97 0.98 1.03 0.80

ββ β βββ β βββ β βββ β β 54% 1.03 1.35 1.78 2.28

αααααααααααααααα 29% 1.10 0.82 0.89 1.18

βββα 25% 0.85 0.94 1.10 0.98

βαβ αβαβ αβαβ αβαβ α 23% 1.11 1.18 0.94 1.48

αβαβαβαβαβαβαβαβ 21% 1.21 1.23 0.99 1.39

αβαα 21% 1.00 0.95 0.81 1.00

… … … … … …

Relative Abundance

(Odds Ratio)

DifferentPerspectives

on ProteinThermostability

In depth focus on single moleculevs. broad view of many (all?)proteins. Anectdotal vs.Comprehensive (the genomicperspective)

Ion pairsin GluDHs

Change in entropy ofunfolded state in

engineering of TLP(disulfides)

Thermostability: Analyzing a few Factorswith Genome Comparison

Organism Category GenomeAbbreviation

# ofProteins

Physiologicalcondition

Pyrococcus horikoshii(Strain OT3)(Kawarabayasi et al.,1998)

archaea OT 2061 98°°°°C,

anaerobe

Aquifex aeolicus(Deckert et al., 1998)

eubacteria,gram negative AA 1522 95°°°°C

Methanococcusjanaschii(Bult et al., 1996)

archaea MJ 1735 85°°°°C,

anaerobeArchaeoglobus fulgidus(Klenk et al., 1997) archaea AF 2409 83°°°°C,

anaerobeMethanobacteriumthermoautotrophicum(Smith et al., 1997)

archaea MT 1869 65°°°°C,

anaerobeHaemophilus influenzae(Fleischmann et al.,1995)

eubacteria,gram negative HI 1680 mesophilic temp.

Mycoplasma genitalium(Fraser et al., 1995)

eubacteria,gram positive MG 470 mesophilic temp.

Mycoplasmapneumoniae(Himmelreich et al.,1996)

eubacteria,gram positive MP 677 mesophilic temp.

Helicobactor pylori(Tomb et al., 1997)

eubacteria,gram negative HI 1590 mesophilic temp.

Escherichia coli(Blattner et al., 1997)

eubacteria,gram negative EC 4288 mesophilic temp.

Synechocystis sp.(Kaneko et al., 1996)

cyanobacteria SS 3168 mesophilic temp.

Saccharomycescerevisiae (Goffeau etal., 1997)

eukaryote,fungus SC 6218 mesophilic temp.

CEEEEHHHHHHHHHCCEEEEEEEEECCCMEAPAGNIDIIKAGMKSPVQLTVKNDT

tertiary (EK)

local (DK)

Composition Analysis of the Proteome

whole genome

More Charged Residues in Thermophiles, Suggestive of Salt Bridges

predictedhelices in all

ThermophileMesophile

1-4 Spacing of ChargedResidues More than

Expected in ThermophileHelices ⇒ Salt Bridges

Quantify with LOD scoreLOD = log (observed/expected)For inst.,expected[EK(4)] ~ f(E)*f(K)LOD > 0, greater than expected

MP MG EC SC HP SS HI MT MJ AF AA OT

Mesophile Thermophile

10 to 45

Physiological temperature in C

9865 85 83 95

CEEEEHHHHHHHHHCCEEEEEEEEECCCMEAPAGNIDIIKAGMKSPVQLTVKNDT

tertiary (EK)

local (DK)

0 50 115

Length

mesophilic cog

thermophilic cog

thermophile

mesophile

Sequence LengthDoesn’t Completely

Relate toThermostability

But this neglects special case of AA(eubacterial thermophile): archealsequences shorter

Simple distributions of sequencelength have thermophiles shorter

(Eisenberg)

Controllingfor Biases:StratifiedSample Stratified Sampling based on COGs

Meso, MT AF OT AA MJ Meso, AA MT AF OT MJ

Ortho.ALL

Correct forduplications, repeats,unique families;Extend COGs to get52 ortholog families

(COGs, Lipman, Koonin)

Controls II: Known Structures, Random Genomes

49 J ribosom al 1rss 3 +80 J ribosomal 1aci 0.7 0.1

81 J ribosomal 1ad2 4.3 2.1 +91 J ribosomal 1bxe 0.9 0.9

93 J ribosomal 1whi 1.9 1.1 +96 J ribosomal 1sei 2.1 -0.1

98 J ribosomal 1pkp 1.7 -1.1 -184 J ribosomal 1a32 1.9 -0.1

186 J ribosomal 1rip 0.9 -0.5

16 J synthetase 1pys 2.6 5 +124 J synthetase 1ady 6.1 3.5 +162 J synthetase 2ts1 3.3 0.5

30 J other 1yub 5.3 -0.3

125 F other 1tmk 0.4 0.4

149 C other 1btm 4.3 -1.3 -541 N other 1fts 3.4 0.2

112 E other 1cj0 4.6 1.6 +552 N other 1ffh 4.6 -0.4

Therm.Avg.SBCOG Cat. PDB Diff.

5.6 3.1

MesoAvg.SB

Uniform CompositionClusteredOriginal Skewed Composition

3DStructures

For orthologsof knownstructure:map tertiarysalt bridgesonto multiplealignmentand look atconservationin Therm. vs.Meso.

Random Sampling: Make up randomthermo. and meso. genomes, seewhat distribution of each statistic is

Therm.Meso.

How Representative are theKnown Structures of theProteins in a Complete

Genome? The issue of Bias

0 50 115

Length

FITSCMJHIMPMGECSSHP

0 50 115

Length

s) genomes

PDB domains

whole chains

Assess 2º,TM predictions(+) comprehensive, statistical(-) predictions inaccurate

(~65%)(-) extrapolate from PDB (esp. TM),

domain problem

Is prediction (extrapolation) based on knownstructures justified?

Length: Genomes Sequences are longerthan those in Known Structures

340 aa for avg. genome seq.(470 aa for yeast)205 aa for PDB chain~160 aa for PDB domain

Amino Acid Composition

ABS. rms K I C Q W N F L G A P S R H M E D T Y V

EC 4.4 6.0 1.2 4.4 1.5 4.0 3.9 10.6 7.4 9.5 4.4 5.8 5.5 2.3 2.8 5.7 5.1 5.4 2.9 7.1

HI 6.3 7.1 1.0 4.6 1.1 4.9 4.5 10.5 6.6 8.2 3.7 5.8 4.5 2.1 2.4 6.5 5.0 5.2 3.1 6.7

SS 4.2 6.3 1.0 5.6 1.6 4.0 4.0 11.4 7.4 8.5 5.1 5.8 5.1 1.9 2.0 6.0 5.0 5.5 2.9 6.7

SC 7.3 6.6 1.3 3.9 1.0 6.1 4.5 9.6 5.0 5.5 4.3 9.0 4.5 2.2 2.1 6.5 5.8 5.9 3.4 5.6

HP 8.9 7.2 1.1 3.7 .7 5.9 5.4 11.2 5.8 6.8 3.3 6.8 3.5 2.1 2.2 6.9 4.8 4.4 3.7 5.6

MP 8.6 6.6 .8 5.4 1.2 6.2 5.6 10.3 5.5 6.7 3.5 6.5 3.5 1.8 1.6 5.7 5.0 6.0 3.2 6.5

MG 9.5 8.2 .8 4.7 1.0 7.5 6.1 10.7 4.6 5.6 3.0 6.6 3.1 1.6 1.5 5.7 4.9 5.4 3.2 6.1

MJ 10.4 10.5 1.3 1.5 .7 5.3 4.2 9.5 6.3 5.5 3.4 4.5 3.8 1.4 2.2 8.7 5.5 4.0 4.4 6.9

AVG 7.5 7.3 1.1 4.2 1.1 5.5 4.8 10.5 6.1 7.0 3.8 6.4 4.2 1.9 2.1 6.5 5.1 5.2 3.3 6.4

SD 2.3 1.4 .2 1.3 .3 1.2 .8 .7 1.0 1.5 .7 1.3 .9 .3 .4 1.0 .3 .7 .5 .6

EC 16 -25 8 -29 19 7 -15 -2 28 -6 13 -5 -3 16 3 28 -7 -14 -7 -22 1

HI 17 8 27 -38 24 -21 6 12 26 -15 -2 -20 -2 -6 -7 10 5 -17 -11 -14 -4

SS 20 -29 13 -39 49 9 -13 1 37 -6 1 11 -3 6 -15 -8 -2 -16 -6 -20 -4

SC 21 24 18 -21 5 -27 31 14 15 -36 -34 -7 51 -7 -2 -4 5 -4 0 -8 -20

HP 27 52 29 -34 0 -51 27 36 34 -26 -18 -29 14 -28 -4 2 11 -20 -25 1 -20

MP 28 45 18 -55 44 -17 35 41 24 -29 -20 -25 8 -27 -18 -28 -8 -17 2 -11 -7

MG 36 61 48 -50 27 -32 62 53 28 -41 -33 -36 11 -35 -28 -30 -8 -18 -8 -11 -12

MJ 38 77 88 -23 -61 -49 14 6 14 -19 -35 -28 -25 -20 -35 1 40 -8 -31 20 -2

AVG 26 31 -36 13 -23 19 20 26 -22 -16 -17 6 -13 -13 -4 4 -14 -11 -8 -9

RMS 45 39 38 35 31 30 28 27 25 24 23 21 21 18 18 16 15 15 15 11

How Representative are the KnownStructures of the Proteins in

Complete Genome?

Name SolublePDB

= all-β + all-α

A 8.40% 6.8% 9.2%C 1.72% 1.6% 1.4%D 5.91% 5.9% 5.8%E 6.29% 5.2% 7.3%F 3.94% 4.2% 4.2%G 7.79% 8.4% 6.4%H 2.19% 2.1% 2.2%I 5.54% 5.4% 5.1%K 6.02% 5.6% 6.5%L 8.37% 7.3% 9.6%M 2.15% 1.7% 2.4%N 4.57% 5.3% 4.4%P 4.70% 5.1% 4.4%Q 3.73% 3.5% 4.2%R 4.78% 4.2% 5.4%S 5.97% 7.2% 5.7%T 5.87% 7.2% 5.2%V 6.96% 7.6% 5.7%W 1.46% 1.7% 1.5%Y 3.64% 3.8% 3.5%

Compositionof DifferentRegions ofGenomes

• Are compositiondifferencesuniform?

• Resampling• Non-globular

regions differ mostin occurrence andcomposition

• Remove RepetitiveRegions (SEG)

AVG SD EC HI HP MG MJ MP SC SS

Statistics for Amino Acids

Total Number 775998 1358465 505279 500616 170400 497968 237905 2900670 1033450

Fraction Masked by...

PDB Match 8.7% 3.7% 11.1% 13.7% 8.8% 12.9% 7.1% 9.7% 6.2% 9.0%

Non-globular Region 21.7% 6.9% 16.7% 13.9% 22.2% 28.2% 35.1% 24.7% 23.9% 20.5%

TM-helix 4.9% 1.4% 7.3% 6.1% 4.8% 3.8% 2.9% 4.5% 5.2% 5.9%

Linker Region 5.1% 0.4% 5.3% 4.8% 4.8% 5.0% 5.0% 5.2% 4.6% 5.1%

Fraction Remaining

Uncharacterized 59.7% 8.9% 59.6% 61.5% 59.4% 50.2% 49.9% 55.8% 60.0% 59.6%

AVG SD EC HI HP MG MJ MP SC SS

Overall 23% 10% 16% 17% 27% 36% 38% 28% 21% 20%

PDB Match 18% 9% 12% 14% 24% 27% 34% 20% 12% 15%

Non Globular Region 36% 13% 32% 33% 39% 50% 52% 40% 42% 35%

TM-helix 49% 15% 55% 53% 55% 57% 55% 56% 56% 51%

Linker Region 27% 10% 22% 24% 29% 39% 33% 35% 21% 25%

Uncharacterized Region 23% 6% 15% 17% 26% 34% 32% 27% 20% 19%

1 4 3 3 2 5 6 1 4 2 4

Name Hydroph.

Soluble

biophys.

proteins

PS BP BP/PS -1

P H 4.7% 3.7% -21%

F H 4.0% 3.2% -19%

M H 2.1% 1.8% -16%

D P 6.0% 5.1% -16%

V H 7.0% 6.2% -12%

C H 1.7% 1.5% -9%

S P 6.0% 5.7% -5%

G . 7.8% 7.7% -1%

I H 5.6% 5.5% -1%

N P 4.6% 4.6% 0%

W H 1.4% 1.5% 1%

T P 5.8% 6.0% 2%

L H 8.4% 8.7% 5%

A . 8.4% 8.8% 6%

Y . 3.7% 3.9% 6%

H P 2.2% 2.4% 6%

Q P 3.7% 4.0% 6%

R P 4.8% 5.2% 9%

E P 6.2% 7.0% 13%

K P 5.9% 7.7% 30%

PDB Select length class name

1sty - 137 β Staph nuclease

1cgp a:9-137 129 β CAP

1bgh - 85 β Gene V protein

1pht - 83 β SH3 domain

1tpf a: 250 α/β TIM

1wsy a: 248 α/β Trp Synthase

8dfr - 186 α/β DHFR

2rn2 - 155 α/β Ribonuclease H

1brs d: 87 α/β Barstar

1gbs - 185 α+β Hen Lyzozyme

119l - 162 α+β T4 lysozyme

193l - 129 α+β alpha-Lactabumin

7rsa - 124 α+β RNAse A

1brn l: 108 α+β Barnase

1fkd - 107 α+β FK506

9rnt - 104 α+β RNAse T1

1sha a: 103 α+β SH2 domain

1ubi - 76 α+β Ubiquitin

1cse i: 63 α+β CI-2 inhibitor

1igd - 61 α+β B1 domain

1mbd - 153 α Globin

1hrc - 105 α Cytochrome c

2wrp r: 104 α Trp Repressor

1lli a: 89 α Cro Repressor

1cop d: 66 α Lambda Repressor

1rpo - 61 α ROP

1myk a: 47 α Arc Repressor

2zta a: 31 α GCN4 zipper

1btl - 263 Μ beta-Lactamase

1bpi - 58 S BPTI

AVG 116

BiophysicalProteins

Proteins thatinform our viewof the foldingprocess -- ascompared tothe PDB.

Shorter(116 v 161)

Fewerhydrophobes

Adding Structure toFunctional Genomics,Function to Structural

Genomics

Function

Folds v.Genomes

%ID v.RMS

Purely Seq.Based

Analysis --e.g. EcoCyc,

ENZYME,GenProtEC,COGs, MIPS

Why Structure?Do we really need it?

1 MostHighlyConserved

2 PreciselyDefinedModules

3 Seq. ⇔Struc.Clearerthan Seq.⇔Func.

4 Link toChemistry,Drugs

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 …

Fold-FunctionCombinations

• Two Different FoldsCatalyze the SameReaction -- e.g.Carbonic Anhydrases(4.2.1.1)

Many Functions onthe Same Fold-- e.g. the TIM-barrel

91 Enzymatic Functions+ Non-Enzyme

Fold-F

unctionC

ombinations

(=92x229)

Possible,

bserved

The MostVersatile Folds,

Versatile Functions16 9 6 6 6 5 4 4 4 3 3 3 3 3 3 3

0.0.0 22 5 40 666 374 168 11 464 105 1 1 7 102

1.1.1 106 2661.1.3 4 11.10.2 51.11.1 41.14.13 3 61.14.14 21 501.14.15 21.14.99 71.17.4 361.18.6 421.3.1 15 82 31.3.99 101.6.5 21.6.99 8 2 41.9.3 62.1.3 6 12.3.1 6 82.6.1 1282.7.1 102.7.4 291 1562.7.7 13.1.1 122 12 13.1.2 33.1.3 773.1.31 43.1.4 43.2.1 170 1213.2.3 33.4.11 23.4.16 4 13.5.2 1423.5.4 53.6.1 14 403.7.1 23.8.1 34.1.1 28 14.1.2 584.1.3 4 14.1.99 74.2.1 48 15 15.1.3 255.3.1 3825.3.3 15.4.3 15.4.99 16.3.2 56.3.3 96.3.4 176.4.1. 6

NONENZ

150 0.0.0 # # # # # # # # # 8 6 1 # # # # # # 1 5 1 # 4 # 7 #

7 3.2.1 4 # 1 # # 3 #7 4.2.1 # 1 # # # 2 26 3.1.3 9 7 5 4 # #6 3.5.1 # 1 1 # # 25 1.11.1 # 1 # 4 #5 1.9.3 3 5 #5 2.7.1 # 3 # 1 #5 3.6.1 # # # 35 4.1.1 # 1 8 # 14 1.14.13 2 6 3 24 2.4.2 # # 4 74 2.5.1 # 3 #4 3.1.1 # 1 # #4 3.2.2 1 # 1 #3 1.1.1 # #3 1.3.1 # # 33 1.6.99 8 4 2

SMBA A+B MULTIA/B

Top-4 Functions:Glycosidases, carboxy-lyases, phosphoricmonoester hydrolases,linear monoesterhydrolases (3.2.1, 4.2.13.1.3, 3.5.1)

Top-5 Folds:TIM-barrel (16),alpha-beta hydrolase fold (9),Rossmann fold (6), P-loopNTP hydrolase fold (6),Ferrodoxin fold (6)

Top Multifunctional Folds →→→→

ultifo

s→→ →→

Fold-FunctionCombinations

Cross-TabulationSummaryDiagram

A B A/B A+B MULTI SML sumNONENZ 34 30 14 28 4 26 136OX 13 5 17 3 4 5 47TRAN 3 3 16 8 5 35HYD 4 11 30 18 4 67LY 2 3 13 5 23ISO 1 2 7 4 2 16LIG 1 2 3 1 7sum 57 55 99 69 20 31 331

A B A/B A+B

NONENZ 7.1 5.7 7.1 9.2 2.8 0.7

OX 3.5 2.1 9.2 2.1 0.7 0.7

TRAN 0.7 10.6 1.4 1.4 0.7

HYD 2.8 2.8 6.4 5.7 1.4

LY 2.1 4.3

ISO 0.7 1.4 2.8 0.7

LIG 1.4 1.4

[ Similar analysis in Martin et al. (1998), Structure 6: 875 ]

nonENZ

nonENZYeast

101520253035

nonENZ

012345678

E. coli

A B ABNONENZ 10 9.0 15

OX 5.1 5.1 10

TRAN 1.3 13

HYD 2.6 1.3 14

LY 2.6 1.3

ISO 1.3 1.3 5.1

LIG 1.3

A B A/B A+B

metabolism 1 3.5 2.3 10 4.5 1.3 0.8

energy 2 1.1 1.2 5 1.5 0.3 0.2

growth, div.,DNA syn. 3 4.9 3.6 4 4.5 1.8 1.2

transcription 4 1.5 1.3 2.2 1.5 0.5 0.8

proteinsynthesis 5 1 0.9 0.7 1.3 0.3 0.2

proteintargetting 6 1.2 1.7 2 1.6 0.5 0.3

transportfacilitation 7 0.9 0.5 0.7 0.6 0.4

intracellulartransport 8 1.8 2.1 1.6 0.6 1

cellularbiogenesis 9 0.9 0.7 1.2 0.3 0.3 0.1

signaltransduction 10 1 1 1.1 0.3 0.7 0.3

cell rescue,defense… 11 1.5 1 2.6 1.9 0.7 0.5

ionichomeostatis 13 0.5 0.3 0.4 0.4 0.2

A B A/B A+B

NONENZ 7.1 5.7 7.1 9.2 2.8 0.7

OX 3.5 2.1 9.2 2.1 0.7 0.7

TRAN 0.7 10.6 1.4 1.4 0.7

HYD 2.8 2.8 6.4 5.7 1.4

LY 2.1 4.3

ISO 0.7 1.4 2.8 0.7

LIG 1.4 1.4

Compare Classifications and Genomes

wormSwissProt

Compare 1 Structure-Function Cross-Tab forDifferent Genomes andDifferent Functional &

Structural Classificationsfor the Yeast Genome

CATH (Thornton)

MIPS YFC (Mewes)

COGs vs SCOP: Different StructureFunction Relationships for Most

Conserved Proteins

A B A/B A+B

C 2.2 2.6 4.8 3 0.4

E 2.2 1.1 7.4 2.6 0.7

F 1.1 3.7 1.8

G 0.4 0.4 3.3 0.7

H 1.1 0.7 4.8 3

I 0.7 0.7 2.2 0.4 0.4

J 2.2 1.8 3 3 0.4 0.4

K 1.1 0.4

L 1.1 1.5 1.1 1.1

M 0.4 0.4 0.7

N 1.8 0.7 0.4 0.7 0.4

O 1.5 1.1 3 2.2 0.4 0.4

P 0.4 1.1 0.7 0.4

A B A/B A+B

C 7.2 2.9

E 1.4 1.4 1.4

G 4.3 1.4

H 1.4 2.9 1.4

J 8.7 7.2 7.2 10 1.4 1.4

N 1.4 1.4

O 2.9 7.2 2.9

P 1.4 2.9 1.4

(Scop, Murzin, Ailey, Brenner, Hubbard, Chothia; COGs, Tatusov, Koonin, Lipman)

Gene ExpressionDatasets: theTranscriptome

Yeast Expression Data in Academia:levels for all 6000 genes!

X-ref. with other genome data: protein foldfeatures common in Transcriptome....

Also: SAGE;Samson andChurch, Chips;Aebersold,ProteinExpression

Young/Lander, Chips,Abs. Exp.

Brown, µµµµarray,Rel. Exp. overTimecourse

Snyder,Transposons,Protein Exp.

N S C L F I D H Q Y P M W E T K R V G A

Amino Acid

SamsonSAGE-SSAGE-LSAGE-G/MChurch-heatChurch-galChurch-alphaChurch-aYoung

GenomeComposition

TranscriptomeComposition

Composition of Genome vs. Transcriptome

VGA ↑NS ↓

Which Protein Folds are Highly Expressed?

Top-10folds ingenomeand tran-scriptome

Fold Fol

TIM barrel α/ β 1byb 4.2 8.3 +98 5 1 1 1 1 1 1 1 1 1P-loop NTP hydrolases α/ β 1gky 5.8 5.2 -11 3 2 2 4 4 4 5 5 6 7

Ferredoxin like α+β 1fxd 3.9 3.4 -14 6 3 7 11 9 8 10 4 10 11

Rossmann fold α/ β 1xel 3.3 3.3 0 8 4 3 3 3 2 2 19 15 9

7-bladed beta-propeller β 1mda* 6.4 2.9 -55 2 5 4 5 6 6 7 9 9 16

aplha-alpha superhelix α 2bct 4.4 2.7 -37 4 6 11 15 16 12 12 8 5 8

Thioredoxin fold α/ β 2trx 1.7 2.7 +63 14 7 6 8 2 5 4 11 10 6G3P dehydrogenase-like α+β 1drw† 0.2 2.7 +1316 81 8 12 2 5 3 3 35 19 30

beta grasp α+β 1igd 0.6 2.6 +348 36 9 10 21 9 18 21 82 122 120

HSP70 C-term. fragment multi 1dky 0.8 2.6 +231 31 10 16 17 11 16 12 48 25 56

long helices oligomers α 1zta 3.8 2.1 -46 7 15 8 14 21 15 19 21 20 33

Protein kinases (cat. core) multi 1hcl 6.8 1.6 -77 1 18 19 9 16 11 15 13 16 17

alpha/beta hydrolases α/ β 2ace 2.2 0.9 -62 10 32 31 25 26 21 23 26 26 26

Zn2/C6 DNA-bind. dom. sml 1aw6 2.6 0.3 -89 9 75 94 27 50 32 40 48 39 50

Composition Rank

Broad Categories Const. inTranscriptome over Timecourse,

Not Specific Genes (or Folds)

Amino Acid

YoungBrown Timepoint 1Brown Timepoint 7

Amino Acid

SamsonSAGE-SSAGE-LSAGE-G/MChurch-heatChurch-galChurch-alphaChurch-aYoung

Fold class composition weighted bytranscript frequency

A ll a lp h

A l l b e tA lp ha / b e

A lp ha + b eM ultid

om aSm all p ro te i

First timepoint

Last timepoint

Common Yeast Folds (scop) Rep.Structure

GenomeDuplication

Expression(aerobic)

Expression(anaerobic)

Protein kinases (cat. core) 1hcl 1 3 4

NTP Hydrolases with P-loop 1gky 2 1 2

Classic Zn finger 1ard 3 9 5

Ribonuclease H-like motif 2rn2 4 2 1

Rossmann Fold 1xel 5 4 3

Zn2/Cys6 DNA-binding dom. 125d 6 6 7

7-bladed beta-propeller 2bbk-H 7 8 16

TIM-barrel 1byb 8 5 6

like Ferrodoxin 1fxd 9 7 10

DNA-binding 3-helix bundle 1enh 10 30 36

… …

GroES-like 1lep-A 17 10 9

… …

like HSP70, Ct-dom. 1dkz-A 22 11 8

Brown cDNA microarray expts. not as useful for X-ref.at individual timepts

Nevertheless, they show same aa composition andfold class usage at different timepts. However, top foldchanges and also specific TM proteins....

Different Classes ofMembrane Proteins

Have DifferentChanges in Expression

Level (esp. 12 TMs)Differential expression of

transmembrane proteins (12 segments)

1 2 3 4 5 6 7

Transmembrane (12 segments)

All open reading frames

Most Expressed TMsin anaerobic conditionsORF TMsYPR149W 4YDR343C 9YDR342C 9YKL217W 7YHR096C 9YBR116C 2YIL088C 6YBR012W-B 2YBR054W 7YBR218C 2

Most Expressed TMsin aerobic conditionsORF TMsYHR078W 4YGL008C 6YBR012W-B 2YLR340W 2YPL131W 2YHR099W 2YMR205C 2YHR216W 2YLR432W 2YIL075C 5

Hexose permease expression

1 2 3 4 5 6 7

Hexose permeases

Expression of lactate transporter

1 2 3 4 5 6 7

Lactate transporter

Column gives the expression in aerobic conditions (high sugar, second time-series data point inDeRisi et al.), and other column, in anaerobic conditions (low sugar, high ethanol, last time-series datapoint in DeRisi et al.). 9 hexose permeases, 1 lactate transporter.

Functional category number Function Average correlation # ORFs01 METABOLISM 0.1001 100501.01 amino-acid metabolism 0.1488 19901.01.01 amino-acid biosynthesis 0.239 11401.01.04 regulation of amino-acid metabolism 0.23 3201.01.07 amino-acid transport 0.1198 2301.01.10 amino-acid degradation 0.0524 3601.01.99 other amino-acid metabolism activities 0.2205 401.02 nitrogen and sulphur metabolism 0.1869 7301.02.01 nitrogen and sulphur utilization 0.0726 3701.02.04 regulation of nitrogen and sulphur utilization 0.3715 2801.02.07 nitrogen and sulphur transport 0.2829 801.03 nucleotide metabolism 0.1708 13401.03.01 purine-ribonucleotide metabolism 0.3639 4201.03.04 pyrimidine-ribonucleotide metabolism 0.176 2801.03.07 deoxyribonucleotide metabolism 0.1095 1201.03.10 metabolism of cyclic and unusual nucleotides 0.2848 801.03.13 regulation of nucleotide metabolism 0.2696 1301.03.16 polynucleotide degradation 0.2461 2001.03.19 nucleotide transport 0.1187 1201.03.99 other nucleotide-metabolism activities -0.0328 701.04 phosphate metabolism 0.1348 3101.04.01 phosphate utilization 0.1612 1301.04.04 regulation of phosphate utilization 0.0599 801.04.07 phosphate transport 0.0724 1001.05 carbohydrate metabolism 0.0779 40901.05.01 carbohydrate utilization 0.075 25601.05.04 regulation of carbohydrate utilization 0.1174 120

Functional category number Function Average correlation # ORFs01 METABOLISM 0.1001 100501.01 amino-acid metabolism 0.1488 19901.01.01 amino-acid biosynthesis 0.239 11401.01.04 regulation of amino-acid metabolism 0.23 32

Correlate withExpression Levelwith Functional

bioinformatics databases - yale university

Documents

sequence formats and databases in bioinformatics ·...

introduction to bioinformatics introduction to databases

1 databases in bioinformatics (roald forsberg). 2 overview...

a guided sql tour of bioinformatics databases

1 (c) mark gerstein, 1999, yale, bioinfo.mbb.yale.edu...

major databases in bioinformatics

an introduction to bioinformatics - ableweb.org ·...

bioinformatics - databases - bioplexity · data sources for...

bioinformatics databases: getting knowledge from information...

sri international bioinformatics 1 computing with...

bioinformatics datamining - gerstein...

the cmbi: bioinformatics content bioinformatics ...

introduction to bioinformatics (databases)introduction to...

introduction to bioinformatics introduction to databases

computational biology & bioinformatics at yale ·...

databases in bioinformatics

bioinformatics - gene databases

databases, archives, search tools. bioinformatics:

bioinformatics t2-databases v2014