Download - Lecture 1 - MITweb.mit.edu/10.555/Spring2004/notes/L01-2004... · 2004-02-03 · 10.555-Bioinformatics Lecture 1: Introduction and General Overview 5 Drivers of change (1) zGenomics:

10.555-Bioinformatics Lecture 1: Introduction and General Overview

1

10.555Bioinformatics: Principles, Methods and Applications

MIT, Spring term, 2004

Lecture 1:

Introduction to the course.Definitions, perspectives, general overview.

Rudiments of dynamic programming


2

Course goals-requirements

GoalsNot code writingNot algorithm optimizationProvide appreciation-familiarity-expertise with available

computational tools of sequence analysis, pattern discovery, analysis of cell physiology

Define important problems in bioinformatics-systems biology

Suggest process for problem solving by illustrative case studies involving important biological-physiological problems

RequirementsAdvanced courseNone formallyApplied probability- statistics most important background


3

MIT Class:http://web.mit.edu/cheme/gnswebpage/index.shtml

Look for course 10.555

More Information at:


4

We have entered a period of rapid change

… For the times, they are a changin’Bob Dylan

How is phenotype controlled by the genes?Nobody knows, least of all the machines.Medicine will thrive if we can discover the

meansTo merge our knowledge and informationAnd find genes’ intent and control by

environmentFor the times, they are a changin’.

Jay Bailey


5

Drivers of change (1)

Genomics: Sequencing and study of genomes

More than 100 (150) genome species sequenced: Comparative studies

All industrially important organisms will be sequenced within the next 1-2 years

Impact of Genomics:Generation of data about the cellular

phenotype: Link with genotype (expression)Need of integration in a meaningful

frameworkUse of these data in discovery


6


Heavy investment in the life sciences

Excess of $200B in the USA

Mostly health-oriented, however, equally applicable to other fieldsApplied molecular biology

Construct optimal genetic backgroundControls at the genetic levelOptimal cell-bioreactor combinations


7


Enormous opportunitiesMedicineChemicals (cell factories)Materials (special properties, PHA’s, PHB)Pharmaceuticals (chirality)Fuels (bio-diesel), ethanol, CO2, methane-

methanolEnvironmental

Plants (resistance, oils, enrichment, polymers)


8

Drivers of Bioinformatics (1)

Genomics


9

Drivers of Bioinformatics (2)

Technologies probing the expression, proteomic and metabolic phenotype generating large volumes data

DNA microarrays (gene chips)

2-D gel electrophoresis

Protein chips or HPLC methods combined with maldi/seldi tof spectrometry for protein characterization

Isotope label distributions


10

0.4% Pine-Solmid-exponential phase

0.4% Pine-Sol early stationary phase

Control early stationary phase

Same culture+3 hours


11

Example of 2-D gels


12

2-Dimensional Liquid Chromatography

To MS

To MS

To MS

1. Ion Exchange 2. Size exclusion

To MS

Inject peptides


13

Trehalose

GLC6PCO 2

Ribu5P

Sed7PE4P

GLC

Fru6P

GAP

754.8100

316.8

55.5

114.5

Suc

IsocitOAA

aKG Glt

GlumSucCoA

CO 2

CO 2 NH 3

15.7

2.5

5.1

20.6Asp

PYR13

5.1

PYR

AcCoA Acetate

PEPAla ,Val,Lac

CO 2

CO 2

CO 2

19.6 23.2

23.2

3.9

Lysine

H4D

Meso-DAP 12.7

CO 2NH 3

Suc

Fluxes estimated from isotopomer balancing (locally optimal solution - value of objective: 0.021)

(2.62)

G3P151.8

48.89.4

0.02

303.5

22.4

5.09

76.7

21.9

0.2

0.13

12.68

Bioinformatics

The process and methods applied to the upgrade of the information

content of biological measurements

More specifically:The utilization of sequence, expression, proteomic and physiological data to identify characteristic patterns of disease and elucidate mechanisms of gene regulation, signaltransduction, flux control and overall cell physiology


15

Probing cellular function

DNA microarrays

Environment

DNA

mRNAFluxes

Proteins

Proteomics Metabolic Flux Analysis

Signal Transduction


16

Central GoalGenome sequences and technologies generating

large volumes of dataDNA microarrays (gene chips)ProteomicsSmall metabolite measurementsIsotope labeling

Challenge:Utilize genomic information and above data to elucidate gene regulation, metabolic flux control, and overall cell physiology

Importance:Elucidate genetic regulation and flux controls

in their entirety instead of one at a time


17

I. Genomics: Sequence-driven problems

Bioinformatics course

Structured to address these driving forces.

It comprises three basic units:

II. Data-driven problems

III. Functional Genomics and Issues of Systems Biology


18

I. Sequence-driven problems

Central theme: Sequence comparisonsMetrics of comparisonScoring matricesScore evaluationComparison by alignment vs. comparison

by entries of bio-dictionaries (Teiresias)

EmphasisComputer ScienceApplied probabilityStatisticsComparative genomics


19

I. Sequence-driven problems

• Identify Open Reading Frames

• Identify gene splicing sites (introns)

• Gene regulation (using expression and other data)

• Gene annotation (inter-genomic comparisons)

• Determine sequence patterns of regulatory sites. Important in understanding gene regulation and expression characteristics of tissues, disease, phase of development, etc.

GENOMICS

• Fragment reassembly for chromosome sequence


20

I. Sequence-driven problems (cont’d)

• Single, multiple protein alignment (homology)

• Framework for the analysis of signaling networks

• Identification of functional domains in protein sequences

PROTEOMICS

• Pathway interactions• Effect of pathway convergence-divergence on signaling• Controls of signaling (kinetic, regulatory)• Integration

• Sequence-structure, sequence-function relationships (structural bioinformatics)• Pattern discovery (phylogeny, remote associations)


21

II. Data-driven problems

Central theme: Information upgrade

Transcriptional data (microarrays). Use to:Identify discriminatory genesDefine clusters of co-expressed genesCorrelations of gene expressionSample classification

Disease diagnosisTherapy prognosisEvaluation of side effects


22

Data-driven problems (cont’d)

Metabolic rate-isotopic tracer data. Use to:Determine fluxesStudy flux control (metabolic engineering)

Mass Spectrometry data of proteins. Use to:Identify proteins and protein fragmentsProtein quantification


23

III. Problems of Systems Biology-Physiology-

Functional Genomics

Central theme:Interactions of cellular parts

NetworksAssociations of data

Methods:Network analysis (balances)Projections (PCA, FDA, etc.)Correlations (PLS, PCR, etc.)

Pattern Discovery in data


24

Interesting problems

Identify discriminatory genesDecipher gene regulationElucidate gene regulatory networksUse transcriptional data to discover

characteristic patterns of gene expression for different tissue types, disease, etc.

Microarray data analysis


25

Example: Use microarray expression data to discover discriminatory genes

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Auto

scal

ed E

xpre

ssio

n Le

vel


26






27

Interesting problems: Decipher structure of gene regulation: Modified lac operon

σ3 ... σ1 σ2 ... P1 cap ... lacI lacP lacO

lacZ lacY lacA

mRNA

☺Lac

CAP

Rich Medium

Growth

Study effect of:Maltose, Glucose,LactoseGrowth phaseRich Medium, O2

+

+

-- CAP-cAMP

+


28

Expected profiles in differential gene expression experiments

Glc, Glc,Lac, Glc, Rich Mal,Lac Glc, Grth.Grth No grth Medium Growth Stat. Rich Md.

σ3σ1σ2

cap1lacIlacZlacYlacAcap2malImalZmalYmalA


29

Example of gene chip data


30

Deciphering microarray data (cont’d)

Step 2: Determine structure of interactions

P1 P2 P3 P4 GC1 GC2 GC3 GC4 GC5 ...

C1C2C3

H L L L I R 0 I I ...H H H L I R R I 0 ... H H L H R R 0 I R …

Look for most generic patterns:

W W H W I R I W W I RL W H W W I I R W W RW W W H W W I R R W I

Apply: Decision trees, Teiresias, pattern discovery


31

Deciphering microarray data (cont’d)

Simulation results with lac operonSuccess rate of pattern discovery

0

20

40

60

80

100

0 20 40 60 80 100% of possible combinations considered in parameter

space

% o

f pat

tern

s di

scov

ered

S


32






33

Interesting problems of Systems Biology-Physiology-Functional Genomics

Genetic Regulatory Networks

• Apply time-lagged correlations in transcriptional data

• Apply pattern discovery in upstream sequence regions of co-expressed genes


34






35

A73

0

Figure 1Figure 1

FullFullLightLight

23 h

rs.

23 h

rs.

24 h

rs.

24 h

rs.

Transient LightTransient Light(10 min. samples)(10 min. samples) Transient DarkTransient Dark

(10 min. samples)(10 min. samples)

0.5

0.6

0.7

0.8

0.9

1

50:00 74:00 74:40 75:20 76:00 76:40 77:20

0

0.5

1

1.5

2

2.5

0 50 100 150 200 250

A73

0

Time (hrs)

FullFullDarkDark

Example: Defining the physiological state from the expression phenotype


36

Example of gene chip data


37

Example (cont’d)

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80 90 100

110

120

130

140

150

160

170

180

190

200

Time (min.)


38

Dimensionality reduction of the expression phenotype space by CDA

(10 minute time delay)

FL

FL

D23D23

D23D24D24

D24L10

L20L30L40L50L60L70L70L80L90L100D10

D20D30D40D50

D60

D70

D80D90

D100

-3 -2 -1 0 1 2 3-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

B-ALLT-ALLAML

CV1

CV2

fMB

fMT

fBT

fBT>0

B-ALL fTM>0

T-ALL AML

Decision boundaries

fBT: between B-ALL and T-ALL

fMT:between T-ALL and AML

fMB: between AML and B-ALL

( ) ( ) ( ) 021

:

≥+−−−

=

kmT

kmT

km

Tmkf

yyyyyyy

xVy

Figure 5: Physiological state domains are defined by a series of decision boundaries for the projected values and group means in the reduced CV plane.


40

Interesting problems (cont’d)

Determine rate controlling steps in metabolic networks to achieve directed flux amplification

Framework for the analysis of metabolic networks

Determination of in vivo metabolic fluxes from metabolite and isotopomer balances

METABOLISM (Metabolic Engineering)

• Pathway interactions• Controls of flux (kinetic, regulatory)• Integration

Establish link between the expression phenotype and the metabolic phenotype as determined by the fluxes

Discover patterns in fluxes


41

Trehalose

GLC6PCO 2

Ribu5P

Sed7PE4P

GLC

Fru6P

GAP

754.8100

316.8

55.5

114.5

Suc

IsocitOAA

aKG Glt

GlumSucCoA

CO 2

CO 2 NH 3

15.7

2.5

5.1

20.6Asp

PYR13

5.1

PYR

AcCoA Acetate

PEPAla ,Val,Lac

CO 2

CO 2

CO 2

19.6 23.2

23.2

3.9

Lysine

H4D

Meso-DAP 12.7

CO 2NH 3

Suc

Fluxes estimated from isotopomer balancing (locally optimal solution - value of objective: 0.021)

(2.62)

G3P151.8

48.89.4

0.02

303.5

22.4

5.09

76.7

21.9

0.2

0.13

12.68


42

Interesting problems: Signal transduction pathways

Receptor-Ligand binding

Nucleus Gene induction

Signalingpathways

MetabolicEnzyme Regulation

InputE1

MAPKKK MAPKKK*

MAPKK MAPKK-P MAPKK-PP

E2

MAPKKP'ase

MAPK MAPK-P MAPK-PP

MAPK P'ase

Output

Analysis of structure and interactions of signal transduction pathways


43

Problems of Systems Biology-Physiology-Functional Genomics

Identification of genes that collectivelyCorrelate with a phenotypeAre critical for important physiological properties


44

PHA/PHB biosynthesis genes

Acetyl-CoA

O

S-CoA

phaA phaB phaE phaC

Thiolase

2

HS-CoA

Reductase

NADPH+H+NADP+

Synthase

PHB

O

On

O

S-CoA

O

Acetoacetyl-CoA

O

S-CoA

OH

Hydroxybutyryl-CoA

HS-CoA

orphaC


45

Drug Discovery (cont’d)

WT AM50

2

4

6

8

10

12

14

16

18

20

PHB

Synt

hesi

s R

ate

(µm

ole/

g cd

w/h

r)


46

PHA/PHB biosynthesis genes

Acetyl-CoA

O

S-CoA

phaA phaB phaE phaC

Thiolase

2

HS-CoA

Reductase

NADPH+H+NADP+

Synthase

PHB

O

On

O

S-CoA

O

Acetoacetyl-CoA

O

S-CoA

OH

Hydroxybutyryl-CoA

HS-CoA

orphaC


47

Regulation of PHA biosynthesis in Synechocystis sp.

Acetyl-phosphate

O

O P OH

OH

O

PHBSynthase(active)

PHB

O

On

Acetyl-CoA

O

S-CoAAcetyl-CoA

O

S-CoA

Acetate

O

OH

PHBSynthase(inactive)

Acetate Kinase ackA

Phospho-transacetylase pta

Acetyl-CoA synthetaseacs


48

0

1000

2000

3000

4000

5000

0 2 4 6 8 10 12

PHA

Synt

hase

activ

itynm

olm

in-1

mg

prot

ein-1

Acetyl phosphate (mM)

Acetyl phosphate dependent activation of theSynechocystis sp. PHA synthase


49

Figure 2

(B)(A)

(C)

-25-20-15-10-505

101520

-20 -10 0 10 20 30

BGA10N

5AM

10PA

BG11

0.3NA

4.0

8

.0

12

..0

PHB

(%D

CW

)

5AM

BG11

10N BGA

0.3NA

10PA

Dimensionality reduction by Fisher Discriminant Analysis

Projections: CV1 = a1g1 + a2g2 + a3g3 + … + angnCV2 = b1g1 + b2g2 + b3g3 + … + bngn


50

Figure 5

-0.2 -0.1 0.0 0.1 0.2CV1

-0.5

-0.3

-0.1

0.1

0.3

CV

2

ssr2194slr1895

cpcB ndhG

glmSpsbA2

sll1369

slr1859

slr1800

sll0048

sll0611

sll0423

sll0442

sll0036

ssr2062ssl2384

sll0267sll0174

ssr0756

apcE

rbcL

slr0284

sll0395

smr0001

slr0639

slr1820slr1902

bcp

ssr1114

slr1807

Plotting the loadings (i.e., the coefficients of linear projection


51

Figure 2

R2 = 0.879

-505

101520

-20 -10 0 10 20CV2

PHB

(%D

CW

)

Correlational structure of PHB accumulation

PHB = c1g1 + c2g2 + c3g3 + … + cngn


52

Linking the Metabolic and ExpressionPhenotypes

Metabolic Phenotype

ExpressionPhenotype

MA PC HR EO NS O C TO YP PI EC

Mapping of the Metabolic Phenotypeonto the Macroscopic Phenotype

Mapping of the Expression Phenotypeonto the MacroscopicPhenotype

M1M2...…Mn

*M1*M2...

*Mn

M1M2......Mn

M1M2...

Mn

G1G2......Gm

G1G2......Gm

G1G2......Gm

*G1*G2......

*Gm

Bioinformatics

The process and methods applied to the upgrade of the information

content of biological measurements

More specifically:The utilization of sequence, expression, proteomic and physiological data to identify characteristic patterns of disease and elucidate mechanisms of gene regulation, signaltransduction, flux control and overall cell physiology


54

In summary

Bioinformatics is driven by genomics and data

These driving forces define three essential units (or types of problems) of instruction:

Sequence-driven problemsData-driven problemsProblems of physiology or systems biology

While initially focused on sequence analysis, the future belong to successful linkage of genomics and physiology

This is a young field very much in flux with many problems in search of good solutions


55

Spring Semester 2004

10.555 Bioinformatics: Principles, Methods and Applications

Instructors

Gregory Stephanopoulos and Isidore RigoutsosSpecial lectures by Dr. Joanne Kelleher

Assistant-Grading: Adrian Fay, [email protected]

9 units (H), Class meets Tuesday 2-5pm, Room 66-154

COURSE SYLLABUS

Part I: INTRODUCTION

Lecture 1: February 3- Historical perspectives, definitions- Impact of genomics on problems in molecular and cellular biology; need for integration and quantification, contributions of engineering- Overview of problems. Sequence driven and data driven problems- Rudiments of dynamic programming with applications to sequence analysis- Overview of course methods- Integrating cell-wide data, broader issues of physiology

Lecture 2: February 10 (Assignment 1, due February 24)- Primer on probabilities, inference, estimation, Bayes theorem- Dynamic programming, application to sequence alignment, Markov chains, HMM


56

Part II: SEQUENCE DRIVEN PROBLEMSNo class on February 17 (Monday schedule-President's day)Lecture 3: February 24 (Assignment 2, due on March 2)

- Primer on Biology (the units, the code, the process, transcription, translation, central dogma, genes, gene expression and control, replication, recombination and repair)- Data generation and storage- Schemes for gene finding in prokaryotes/eukaryotes- Primer on databases on the web- Primer on web engines

Lecture 4: March 2 (Assignment 3, due March 9)- Some useful computer science (notation, recursion, essential algorithms on sets, trees and graphs, computational complexity)- Physical mapping algorithms- Fragment assembly algorithms- Comparison of two sequences- Dynamic programming revisited- Popular algorithms: Smith-Waterman, Blast, Psi-blast, Fasta

Lecture 5: March 9 (Assignment 4, due on March 16)- Building and using scoring matrices- Multiple sequence alignment- Functional annotation of sequences- Micro RNA, Antimicrobial peptides

Lecture 6: March 16 (Assignment 5, due March 30) - Pattern discovery in biological sequences- Protein motifs, profiles, family representations, tandem repeats, multiple sequence alignment and sequence comparison through pattern discovery- Functional annotation-Promoter site recognition

No class on March 23: Spring BreakLecture 7: March 30 (Assignment 6, due on April 13): Physiology

- Primer on cell physiology. Definition at the macroscopic level- Molecular cell physiology. Interactions of pathways, cells, organs. Measurements and their integration- Distribution of kinetic control. Rudiments of metabolic control analysis (MCA)


57

PART III: UPGRADING EXPRESSION AND METABOLIC DATA

Lecture 8: April 6: Fluxes- MCA continued. Analysis of metabolic pathways, fluxes, the metabolic phenotype- Methods for metabolic flux determination- Methods for underdetermined systems

Lecture 9: April 13 (Assignment 7, due on April 27): Microarrays- Monitoring gene expression levels. DNA microarrays- Data collection, error analysis, normalization, filtering- Novel applications of DNA microarrays- Analysis of gene expression data- Clustering methods: Coordinated gene expression- Identification of discriminatory genes- Determination of discriminatory gene expression patterns. Use in diagnosis- Data visualization. Reconstruction of genetic regulatory networks

No class on April 20: Patriots DayLecture 10: April 27: Linkage

- Linking the metabolic and expression phenotypes- Quantitative predictive models of cell physiology from gene expression-Linkage by partial least squares- Systems biology

Lecture 11: May 4- Signaling and signal transduction pathways- Measurements in signaling networks- Integrated analysis of signal transduction networks

Lecture 12: May 18- Putting it all together- Project presentations


58

HOMEWORKSThere will be five Problem Sets on the methodologies and computational algorithms covered in the course, as follows:

Problem Set – 1: Material of Lectures 1, 2 Problem Set – 2: Material of Lecture 3Problem Set – 3: Material of Lecture 4 Problem Set – 4: Material of Lecture 5Problem Set – 5: Material of Lecture 6 Problem Set – 6: Material of Lectures 7,8Problem Set – 7: Material of Lectures 9,10

PROJECTSThe students, in groups of 2-3, will carry out a project on a course-related subject of their own choosing, or from a list of suggested topics. The groups must be formed topics selected by April 6, 2004. An oral presentation of the project by the group members will take place on May 18, 2004, at which time the final report on the project will be also due.

GRADEThere will be no mid-term or final exams. The grade in the course will be based on the homeworks, the group project, and the oral presentation, with the following weights:Homeworks (40 %); Written project report (35 %); Oral presentation (25 %)

CLASS NOTES and REFERENCESLecture notes will be posted on the web. Additionally, the course will draw from a series of published papers and the following list of books, which are recommended as references:


59

References•Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, D.Gusfield, Cambridge University Press, ISBN: 0521585198•Fundamental concepts of bioinformatics, D.E. Krane and M. L. Raymer, ISBN: 0-8053-4633-3•Introduction to probability, D.P. Bertsekas and J.N. Tsitsiklis, ISBN:1-886529-40-X•Genetics, a Molecular Approach, T.A.Brown, Chapman & Hall, ISBN: 0412447304.•Introduction to Computational Molecular Biology, J.Setubal and J.Meidanis, PWS Publishing Company, ISBN: 0534952623.•Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, A.D.Baxevanis and B.F.F.Ouellette, Wiley-Interscience, ISBN: 0471191965.•Bioinformatics: The Machine Learning Approach, P. Baldi and S. Brunal, MIT Press, ISBN: 0-262-02442-X•Introduction to Computational Biology: Maps, Sequences, Genomes, M.S.Waterman, Chapman & Hall, ISBN: 0412993910.•Biological Sequence Analysis: Probabilistic Models of proteins and Nucleic Acids, R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Cambridge University Press, ISBN: 0-521-62041•Bioinformatics: Methods and Protocols, S. Misener and S.A. Krawetz (editors), Humana Press, ISBN: 0-89603-732-0•Bioinformatics Basics: Applications in Biological Science and Medicine, H.H. Rashidi and L.K. Buehler, CRC Press, ISBN: 0-8493-2375-4•Introduction to Protein Structure?, C.Branden and J.Tooze, Garland Publishing Inc., ISBN: 0815302703.•Molecular Biotechnology: Principles and Applications of Recombinant DNA, B.R.Glick and J.JPasternak, ASM Press, ISBN: 1555811361.•Introduction to Proteins and Protein Engineering, B.Robson and J.Garnier, Elsevier Science Publishers, ISBN: 0444810471.•Computational Molecular Biology: An algorithmic approach, Pavel Pevzner, MIT Press, ISBN: 0262161974•Metabolic Engineering: Principles and Methodologies, G. Stephanopoulos, A. Aristidou and J. Nielsen, Academic Press, ISBN: 0-12-666260-6

Additional references of web-based material will be distributed to the students during the course.


60

DYNAMIC PROGRAMMING

Input Output

Cost

Decision

n-1S n-1

X n-1

R n-1

nS n+1 S n

X n

R n

iS i+1 S i

X i

R i

1S 2 S 1

X 1

R1

Solve a problem by using already computed solutions for smaller, but similar problems

S i = ti(S i+1, Xi)

R i = ri(S i+1, Xi)

Overall cost function: f(R1, R2, .. Rn)• Separability: f = Σ Ri• Monotonicity

fi*(S i+1) = opt[ Ri(S i+1, Xi) + f*i-1 (S i) ]Xi

Overall Optimization solved by• breaking down the problem to the sum of its parts using the recurrence relation:

Task: Optimize f

• keeping track of the optimum decision at each stage, Xi*

where * denotes optimum value


61

SEQUENCE DRIVEN PROBLEMS

• DNA Sequencing• Identification of similar prefixes and suffixes for given sequences

• Comparing two sequences: Homology searches in databases• Identification of similar substrings

• Multiple alignment of sequences: Homologies in related genes/proteins• Identification of short substrings/motifs

• Determination of introns and exons

• Construction of Phylogenetic trees

• Structure prediction from sequences• RNA secondary structure prediction• Protein folding

Application of DP to sequence problems


62

COMPARISON OF SEQUENCES

Cost Function, Ri• reward for exact match• penalty for mismatch• penalty for matching with a gap or a break in the sequence

Decision, Xi• Alignment with an element in the sequence• Alignment with a gap

Input,Si+1• Alignment of preceding prefix

Output, Si• Alignment of preceding prefix and current element


63

s = AAAC t = AGC

For the alignment of the s[1:i] and t[1:j], we have:Decision, X• Align s[1:i] with t[1:j-1], and match a space with t[j]• Align s[1: i -1] with t[1:j-1], and match s[i] with t[j]• Align s[1: i -1] with t[1:j], and match s[i] with a space

Cost, R• reward for exact match, pm• penalty for mismatch, mm• penalty for matching with a gap or a break in the sequence, sp

f*(s[1: i], t[1:j]) = maxf*(s[1:i], t[1:j-1]) - spf*( s[1: i -1], t[1:j-1]) + pm, if s[i] = t[j]f*( s[1: i -1], t[1:j-1]) - mm, if s[i] ≠ t[j]f*( s[1: i -1], t[1:j]) - sp

DP definition of the problem: Recursive

COMPARISON OF SEQUENCES


64

ELUCIDATION OF ALGORITHM

s = _AAAC t = _AGC

Consider 2nd element of s and t, and the decisions possible:

A

A

t:

s:

AA _

AA _

AA

Known fromprevious calculation -2+=

Known fromprevious calculation -2+=

Known fromprevious calculation 1+=

sp = -2, mm = -1, pm = 1

Basic calculation approach

Sequences can be expanded at will by adding more spaces


65

Matrix based calculation

Step 1: Initialization of s[1] and t[1]

Cost of alignment of _ and _ = 0

Step 2: Initialization of first row and column

i = 1, j = 2: _ A_

t:

s: Cost = 0 + -2 = -2

i = 1, j = 3: _ AG_

t:

s:

Cost = -2 + (-2) = -4

i = 1, j = 4: _ AGC_

t:

s:

Cost = -4 + -2 = -6

i = 2, j = 1:

0 -2 -4 -6

-8

-2

-4

-6

A G C

A

A

A

C

Keeping track of optimum decision

s = _AAAC t = _AGC



66

Step 3(0): Apply recursive optimization

i = 2, j = 2 Compare _A_A

0 -2 -4 -6

-8

-2

-4

-6

A G C

A

A

A

C

1 Possible Decisions:

_ A __ A

_ A_A _

_ A_ A

Cost:

_ A_

_A

+ = -2 + -2 = -4

__

AA

+ = 0 + 1 = 1

__A

A_

+ = -2 + -2 = -4

t:

s:

t:

s:

t:

s:

Optimum



67

Step 3(1): Apply recursive optimization

i = 2, j = 3 Compare _AG_A

0 -2 -4 -6

-8

-2

-4

-6

A G C

A

A

A

C

1 Possible Decisions:

_ AG __ A

_ AG_ A _

_ AG_ A

Cost:

_ AG_

_A

+ = -4 + -2 = -6

_A_

GA

+ = -2+ -1 = -3

_A_A

G_

+ = 1 + -2 = -1

t:

s:

t:

s:

t:

s: Optimum

-1

And so on, and so forth ...



68

s = AAAC t = AGC

0 -2 -4 -6

-8

-2

-4

-6

1 -1 -3

-1 0 -2

-3 -2 -1

-5 -4 -1

A G C

A

A

A

C

Arrows keep track of optimal decision

Reconstruction of optimal alignment doneby retracing the arrows from the last corner

Bidimensional array for computing optimal alignments

ALGORITHM: SUMMARY


69

Parameters (allow for modification of the basic algorithm):

• Initialization • Recurrence scoring relation• Start point in sequence reconstruction • Scoring of spaces• Multiple sequence alignment

Other modifications:• Cost function R can be a matrix, with mm and pm being vectors• Gap penalty can be a function instead of a fixed value

MODIFICATIONS TO THE BASIC ALGORITHM

Allows for Global/ Local Alignment

Allows for Semiglobal Alignment


70

RNA SECONDARY STRUCTURE PREDICTION

Cost function: Free Energy of structure

Dynamic programming applied under certain assumptions:• Free energy of structure is the sum of its parts

Implies - independent base pairs, or- consider adjacent base pairs, but ignore free energies of bases not

part of any loop

•Ignore the prediction of knots in the structure

• Only certain types of loops allowed

Under these assumptions, and with an appropriate cost function, a DPbased algorithm can be used to compute the optimal structure

Download - Lecture 1 - MITweb.mit.edu/10.555/Spring2004/notes/L01-2004... · 2004-02-03 · 10.555-Bioinformatics Lecture 1: Introduction and General Overview 5 Drivers of change (1) zGenomics:

Top Related