sequence analysis of nucleic acids and proteins: part 2 based on chapter 3 of post-genome...

50
Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000 Prediction of structure and function

Upload: rolf-harper

Post on 25-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Sequence analysis of nucleic acids and proteins: part 2

Based on Chapter 3 of

Post-genome bioinformatics

by Minoru Kanehisa

Oxford University Press, 2000

Prediction of structure and function

Page 2: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Search and learning problems in sequence analysisProblems in Biological Science Math/Stat/CompSci method

Similarity search Pairwise sequence alignmentDatabase search for similarsequencesMultiple sequence alignmentPhylogenetic treereconstructionProtein 3D structure alignment

Optimization algorithms• Dynamic programming

(DP)• Simulated annealing (SA)• Genetic algorithms (GA)• Markov Chain Monte Carlo

(MCMC: Metropolis andGibbs samplers)

• Hopfield neural networkStructure/function prediction

ab initioprediction

RNA secondary structurepredictionRNA 3D structure predictionProtein 3D structureprediction

Knowledge basedprediction

Motif extractionFunctional site predictionCellular localizationpredictionCoding region predictionTransmembrane domainpredictionProtein secondary structurepredictionProtein 3D structureprediction

Pattern recognition andlearning algorithms• Discriminant analysis• Neural networks• Support vector machines• Hidden Markov models

(HMM )• Formal grammar• CART

Molecular classification Superfamily classificationOrtholog/paralog grouping ofgenes3D fold classification

Clustering algorithms• Hierarchical, k-means,

etc• PCA, MDS, etc• Self-organizing maps, etc

Page 3: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Thermodynamic principle

The amino acid sequence contains all the information necessary to fold a protein molecule into its native 3D state under physiological conditions: fold, denature, spontaneously refold, called Anfinsen’s thermodynamic principle

Thus it should be possible to predict 3D structure computationally by minimizing a suitable conformational energy function, but difficult to define, difficult to minimize (globally), called ab initio

In practice, structures determined by X-ray crystallography and nuclear magnetic resonance (NMR) are used to give empirical structure-function relationships.

Page 4: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

A schematic illustration of RNA secondary structure elements.

Stem Hairpin loop

Pseudo knot

Bulge loop Internal loop Branch loop

RNA secondary structure can be predicted ab initio using an energy function and DP to minimize it, in a process similar to alignment

Page 5: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Yeast alanyl transfer RNA

A C C AG.CC.GG.CG.UA.UU.AU.A C U G ACAC A G C

Page 6: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

The definition of a dihedral angle and the three backbone dihedral angles, in a protein. Because is around 180O, the backbone configuration can be specified by and for each peptide unit.

N

C’

C’

C

N C’ C N C’

C N C’ C

H O R H H O

O H RHH R

Peptide unit

Prediction of protein secondary structure: many methods

Page 7: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Prediction of protein secondary structure

The options are -helix, -strand and coil.

Many 2º structure prediction methods exist, with ones by Chou-Fasman and another due to Garnier,Osguthorpe and Robson being widely used. These are position&structure-specific scoring matrices based on modest or large numbers of proteins. On the next page we display the GOR PSSM for -helices.

These days one can choose from methods based on almost every major machine learning approach: ANN, HMM, etc.

Page 8: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Helix State-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8

G -16 -18 -18 -29 -41 -51 -67 -85 -105 -64 -42 -37 -30 -33 -26 -21 -17A 18 20 23 25 32 40 45 45 62 58 51 45 48 43 37 30 32V 1 -1 -5 0 -2 -9 -10 -5 4 -5 -3 -8 -11 -1 0 -7 -7L 17 19 22 28 23 29 37 37 51 48 54 59 41 36 34 28 15I -21 -19 -15 -5 0 2 10 9 17 12 8 12 6 6 16 18 9S -23 -16 -18 -13 -20 -25 -27 -31 -51 -41 -47 -43 -35 -34 -38 -34 -36T -13 -21 -16 -16 -14 -11 -7 -14 -28 -30 -33 -30 -20 -17 -18 -12 -8D 16 20 18 14 23 22 19 26 -1 -5 -26 -35 -21 -6 -3 -1 1E 19 24 31 35 39 36 36 45 52 40 14 -17 -13 -14 -10 -7 -2N 2 3 -2 -6 -6 -9 -16 -22 -44 -29 -24 -13 0 -2 -4 -5 3Q 7 9 6 0 7 0 -3 10 23 35 29 23 16 10 0 0 1K 25 24 22 18 14 16 16 25 28 37 44 54 49 44 39 44 47H 14 0 -7 -6 -14 -6 -2 1 2 21 24 25 27 25 19 25 31R 1 -5 -19 -25 -16 -16 -7 -4 -1 -1 3 6 0 0 -6 8 0F 0 7 17 23 23 18 29 26 32 40 34 28 12 3 15 6 4Y -8 -9 -10 -18 -13 -13 -31 -26 -15 -24 -18 -23 -28 -19 -16 -18 -23W 8 18 11 9 2 26 37 29 30 17 -1 12 13 11 31 13 2C -77 -71 -74 -74 -67 -60 -71 -61 -47 -46 -56 -58 -67 -70 -71 -80 -81M 2 -12 -9 -1 0 21 33 25 34 41 39 44 29 15 4 -2 -11P 0 -6 -7 -6 -15 -22 -35 -47 -68 -179 -95 -72 -53 -37 -28 -22 -11X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Cter Nter

Page 9: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Two architectures of the hierarchical neural network: (a) the perceptron and (b) the back-propagation neural network.

Input layer Output layer

Input

LayerHidden

Layer

Output

Layer

Page 10: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Prediction of transmembrane domains

Membrane proteins are very common, perhaps 25% of all. Membranes are hydrophobic and so a transmembrane domain typically has hydrophobic residues, about 20 to span the membrane.

There are a number of rules for detecting them: Kyte-Doolittle hydropathy scores work fairly well, and the Klein-Kanehisa-DeLisi discriminant function does even better.

Page 11: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Photosynthetic reaction centre (PDB:1PRC)

Outer membrane protein: porin(PDB: 1OMF)

Three-dimensional structures of two membrane proteins

Page 12: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Hidden Markov Models (HMMs)S = States {s0,s1,…..,sn}

V = Output alphabet {v0,v1,…..,vm}

A = { aij} = transition probability from si sj

B = {bi(j)} = probability outputting vj in state si

• What is the probability of a sequence of observations?

• What are the maximum likelihood estimates of parameters in an HMM?

• What is the most likely sequence of states that produced a given sequence of observations?

Page 13: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

A hidden Markov model for sequence analysis

d1 d2 d3 d4

I0 I2 I3 I4I1

m0 m1 m2 m3 m4 m5

Start End

m=match state (output), I=insert state (output), d=delete state (no output)

Page 14: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Prediction of protein 3D structures

Knowledge based prediction of protein 3D or 3º structure can be classified into two categories: comparative modelling and fold recognition. The first can work well when there is significant sequence similarity to a protein with known 3D structure. By contrast, fold recognition is used when no significant sequence similarity exists, and makes use of the knowledge and analysis of all protein structures. One such method due to Eisenberg and colleagues, involves 3D-1D alignment. Another such is threading.

Page 15: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

The 3D-1D method for prediction of protein 3D structures involves the construction of a library of

3D profiles for the known protein structures.

Main chain Side chain

P2

B2

B3

Inside or outside

P1

B1

E

Pol

ar o

r ap

olar

B1 B1 B1 . . . .

A

R

.

.

.

.

.

Y

W

-0.66 -0.79 -0.91 . . . .

-1.67 -1.16 -2.16 . . . .

. . .

. . .

. . .

. . .

. . .

0.18 0.07 0.17 . . . .

1.00 1.17 1.05 . . . .

Am

ino

acid

s

3D-1D score

Environmental class

A

R

.

.

.

.

.

Y

W

12 -66 46 . . . . . . . . . .

-32 -80 -34 . . . . . . . . . .

. . .

. . .

. . .

. . .

. . .

-94 112 -210 . . . . . . . . . .

-214 102 -135 . . . . . . . . . .

1 2 3 . . . . . . . . . . N

3D profile

Residue number

Page 16: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

DNA - - - - agacgagataaatcgattacagtca - - - -

Transcription

RNA - - - - agacgagauaaaucgauuacaguca - - - -

Translation

Protein - - - - - DEI - - - -

Protein FoldingProblem

Exon Intron Exon Intron Exon

Protein

Splicing

Gene Structure I

Page 17: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Gene Structure II

AUG - X1…Xn - STOP

SPLICING

TRANSLATION

3’

pre-mRNA

mRNA

protein sequenceprotein 3D structure

Exon 1 Exon 2 Exon 3 Exon 4

Intron 1 Intron 2 Intron 3

DNATRANSCRIPTION

5’

Page 18: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Gene Structure III

5’ 3’

DNAExon 1 Exon 2 Exon 3 Exon 4

Intron 1 Intron 2 Intron 3

polyA signalPyrimidinetract

Branchpoint

CTGAC

Splice siteCAG

Splice siteGGTGAG

TranslationInitiationATG

Stop codonTAG/TGA/TAA

PromoterTATA

Page 19: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Additional Difficulties

• Alternative splicing

SPLICING

TRANSLATION

pre-mRNA

• Pseudo genes

ALTERNATIVE SPLICING

TRANSLATION

Protein IIProtein I

mRNA

DNA

Page 20: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Approaches to Gene Recognition

• HomologyBLASTN, TBLASTX, Procrustes

• Statistical de novo GRAIL, FGENEH, Genscan, Genie, Glimmer

• Hybrid GenomeScan, Genie

F(*,*,*,…)

Page 21: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Example: GlimmerGene Finding in Microbial DNA

• No introns

• 90% coding

• Shorter genomes (less than 10 million bp)

• Lots of data

Page 22: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

TranslationInitiationATG

Stop codonTAG/TGA/TAA

ORF

Gene Structure in Prokaryotes

Page 23: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Simplest Hidden Markov Gene Model

Intergene

ATG TAA

Coding

A 0.25C 0.25G 0.25T 0.25

A 0.9C 0.03G 0.04T 0.03

1

1

0.9

0.1

0.1

0.9

Page 24: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

The Viterbi Algorithm

A A C A G T G A C T C T

Page 25: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Example: GenscanGene Finding in Human DNA

• Introns

• 5% coding

• Large genome (3 billion bp)

• Alternative splicing

Page 26: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

The

Gen

scan

HM

M

Page 27: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Examples of functional sites.

Molecule Processing Functional sites Interacting moleculesDNA

RNA

Protein

ReplicationTranscription

Post-transcriptionalprocessingTranslation

Post-translational processing

Protein sorting

Protein function

Replication originPromotorEnhancerOperator and other prokaryoticregulators

Splice site

Translation initiation site

Cleavage sitePhosphorylation and othermodification sitesATP binding sitesSignal sequence, localizationsignalsDNA binding sitesLigand binding sitesCatalytic sites

Origin recognition complexRNA polymeraseTranscription factorRepressor, etc

Spliceosome

Ribosome

ProteaseProtein kinase, etc.

Signal recognition particle

DNALigandsMany different molecules

Page 28: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Protein sorting prediction

The final step in informational expression of proteins involves their sorting to the appropriate location within or outside the cell. The information for correct localization is usually located within the protein itself.

Page 29: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000
Page 30: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000
Page 31: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Sequence Alignment Problem

• Task:Task: find common patterns shared by multiple Protein sequences

• Importance:Importance: understanding function and structures; revealing evolutionary relationship, data organizing …

• Types:Types: Pairwise vs. Multiple; Global vs. Local.

• Approaches:Approaches: criteria-based (extension of pairwise methods) versus model-based (EM, Gibbs, HMM)

Page 32: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Outline of Liu-Lawrence approachOutline of Liu-Lawrence approach

• Local alignment --- Examples, the Gibbs sampling algorithm

• A simple multinomial model for block-motifs and the Bayesian missing-data formulation.

Possible but not covered here:

• Motif sampler: repeated motifs.

• The hidden Markov model (its decoupling)

• The propagation model and beyond

Page 33: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Example: search for regulatory binding sites

• Gene Transcription and Regulation– Transcription initiated by RNA polymerase binding at

the so-called promoter region (TATA-box; or -10, -35)

– Regulated by some (regulatory) proteins on DNA “near” the promoter region.

– These binding sites on DNA are often “similar” in composition.

AUG

Translation startPromoter region

Enhancers and repressors Starting codon

RNA polymerase

5’ 3’

Page 34: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000
Page 35: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

The particular dataset

• 18 DNA segments, each of length 105 bps.• There are at least one CRP binding sites, known

experimentally, in each sequence.• The binding sites are about 16-19 base pairs long,

with considerable variability in their contents.• Interested in seeing if we can find these sites

computationally.

Page 36: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

The Data Set

Page 37: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Truth?

Page 38: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Example: H-T-H proteins

HTH: sequence-specific DNA binding, gene regulation. Motifs occur as local isolated structures. The whole 3-D

structures are known and very different. 30 sequences with known HTH positions chosen. The set

represents a typically diverse cross section of HTH seq. Width of the motif pattern is assumed to be in the range

from 17 to 22. The criterion “information per parameter” is used to determine the optimal width, 21.

Heuristic convergence developed (multiple restarts with IPP monitored)

Finding

Page 39: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000
Page 40: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Local Alignment of Multiple Sequences

Motif

width = w

length nk

a1

a2

ak

Alignment variable: A={a1, a2, …, ak}

Local

Objective:Objective: find the “best” common patterns.

Page 41: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Motif Alignment Model

Motif

width = w

length nk

a1

a2

ak

The missing data: Alignment variable: A={a1, a2, …, ak}

• Every non-site positions follows a common multinomial with p0=(p0,1 ,…, p0,20)• Every position i in the motif element follows probability distribution pi=(pi,1 ,…, pi,20)

Page 42: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

The Tricky Part: The alignment variable A={a1, a2, …, ak} is not observable

• General Missing Data problem:– Unobserved data in each datum– Object of the DP optimization (path)– Potentially observable– Examples

• Alignment

• RNA structure

• Protein secondary structure

Page 43: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Statistical Models

• How do we describe patterns? – frequencies of amino acid types.

– multinomial distribution --- more generally a “model”

A typical aligned motif

Page 44: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Multinomial Distribution MotifPositions

1 2 3 4 5 6

Seq 1 I G K P I ESeq 2 V G D P G ESeq 3 V G D D A DSeq 4 I G Q H P E

Seq 5 L S G P E E

Model Mi for i-th column:

(ki,1, ki,2, …, ki,20) ~ Multinom (k, pi ) where pi=(pi,1 ,…, pi,20)

A total ofk sequences

Page 45: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Estimation for the “pattern”

• The maximum likelihood:

• Bayesian estimate:

– Prior: pi ~ Dirichlet (ii), “pseudo-counts”

– Posterior: [pi | obs ]~ Dirichlet (iki,1,…, i +ki,20)

– Posterior Mean:

– Posterior Distribution:

$pk

kij

ij=

++

=ill

ijijij k

kp

α

αˆ

kkk ii =++ ,, L

Page 46: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Dealing with the missing data• Let =(p0 , p1 , … , pw ), “parameter”, A={a1, a2, …, aK}

• Iterative sampling:Iterative sampling: P( | A, Data); P(A | , Data)

Draw from [ | A, Data], then draw from [A | , Data]

• Predictive Updating:Predictive Updating: pretend that K-1 sequences have been aligned. We stochastically predict for the K-th sequence!!

ak ?

a1

a2

a3

Page 47: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

The Algorithm • Initialized by choosing random starting

positions

• Iterate the following steps many times:– Randomly or systematically choose a sequence, say,

sequence k, to exclude.

– Carry out the predictive-updating step to update ak

• Stop when not much change observed, or some criterion met.

)0()0(2

)0(1 ,......,, Kaaa

Page 48: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

a2

a1The PU-Step

ak ?

a3

1. Compute predictive frequencies of each position i in motif

cij= count of amino acid type j at position i.

c0j = count of amino acid type j in all non-site positions.

qij= (cij+bj)/(K-1+B), B=b1+ + bK “pseudo-counts”

2. Sample from the predictive distriubtion of ak .

P a lq

qk

i R l i

R l ii

wk

k

( ), ( )

, ( )

= + ∝ +

+=⊆

Page 49: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Phase-shift and Fragmentation

• Sometimes get stuck in a local shift optimum

• How to “escape” from this local optimum?– Simultaneous move: A A+A+a1+, … ,

aK+

– Use a Metropolis step: accept the move with prob=p,

ak ?

: True motif locations

pA R

A R=

+min{ ,

( )( )

π|

|Compare entropies between new columns and left-out ones.

Page 50: Sequence analysis of nucleic acids and proteins: part 2 Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000

Acknowledgements for slides used

PDB: protein figures

Lior Pachter: gene finding

Jun Liu: Gibbs sampler