10.555-Bioinformatics Lecture 1: Introduction and General Overview
1
10.555Bioinformatics: Principles, Methods and Applications
MIT, Spring term, 2004
Lecture 1:
Introduction to the course.Definitions, perspectives, general overview.
Rudiments of dynamic programming
10.555-Bioinformatics Lecture 1: Introduction and General Overview
2
Course goals-requirements
GoalsNot code writingNot algorithm optimizationProvide appreciation-familiarity-expertise with available
computational tools of sequence analysis, pattern discovery, analysis of cell physiology
Define important problems in bioinformatics-systems biology
Suggest process for problem solving by illustrative case studies involving important biological-physiological problems
RequirementsAdvanced courseNone formallyApplied probability- statistics most important background
10.555-Bioinformatics Lecture 1: Introduction and General Overview
3
MIT Class:http://web.mit.edu/cheme/gnswebpage/index.shtml
Look for course 10.555
More Information at:
10.555-Bioinformatics Lecture 1: Introduction and General Overview
4
We have entered a period of rapid change
… For the times, they are a changin’Bob Dylan
How is phenotype controlled by the genes?Nobody knows, least of all the machines.Medicine will thrive if we can discover the
meansTo merge our knowledge and informationAnd find genes’ intent and control by
environmentFor the times, they are a changin’.
Jay Bailey
10.555-Bioinformatics Lecture 1: Introduction and General Overview
5
Drivers of change (1)
Genomics: Sequencing and study of genomes
More than 100 (150) genome species sequenced: Comparative studies
All industrially important organisms will be sequenced within the next 1-2 years
Impact of Genomics:Generation of data about the cellular
phenotype: Link with genotype (expression)Need of integration in a meaningful
frameworkUse of these data in discovery
10.555-Bioinformatics Lecture 1: Introduction and General Overview
6
Drivers of change (2)
Heavy investment in the life sciences
Excess of $200B in the USA
Mostly health-oriented, however, equally applicable to other fieldsApplied molecular biology
Construct optimal genetic backgroundControls at the genetic levelOptimal cell-bioreactor combinations
10.555-Bioinformatics Lecture 1: Introduction and General Overview
7
Drivers of change (3)
Enormous opportunitiesMedicineChemicals (cell factories)Materials (special properties, PHA’s, PHB)Pharmaceuticals (chirality)Fuels (bio-diesel), ethanol, CO2, methane-
methanolEnvironmental
Plants (resistance, oils, enrichment, polymers)
10.555-Bioinformatics Lecture 1: Introduction and General Overview
8
Drivers of Bioinformatics (1)
Genomics
10.555-Bioinformatics Lecture 1: Introduction and General Overview
9
Drivers of Bioinformatics (2)
Technologies probing the expression, proteomic and metabolic phenotype generating large volumes data
DNA microarrays (gene chips)
2-D gel electrophoresis
Protein chips or HPLC methods combined with maldi/seldi tof spectrometry for protein characterization
Isotope label distributions
10.555-Bioinformatics Lecture 1: Introduction and General Overview
10
0.4% Pine-Solmid-exponential phase
0.4% Pine-Sol early stationary phase
Control early stationary phase
Same culture+3 hours
10.555-Bioinformatics Lecture 1: Introduction and General Overview
11
Example of 2-D gels
10.555-Bioinformatics Lecture 1: Introduction and General Overview
12
2-Dimensional Liquid Chromatography
To MS
To MS
To MS
1. Ion Exchange 2. Size exclusion
To MS
Inject peptides
10.555-Bioinformatics Lecture 1: Introduction and General Overview
13
Trehalose
GLC6PCO 2
Ribu5P
Sed7PE4P
GLC
Fru6P
GAP
754.8100
316.8
55.5
114.5
Suc
IsocitOAA
aKG Glt
GlumSucCoA
CO 2
CO 2 NH 3
15.7
2.5
5.1
20.6Asp
PYR13
5.1
PYR
AcCoA Acetate
PEPAla ,Val,Lac
CO 2
CO 2
CO 2
19.6 23.2
23.2
3.9
Lysine
H4D
Meso-DAP 12.7
CO 2NH 3
Suc
Fluxes estimated from isotopomer balancing (locally optimal solution - value of objective: 0.021)
(2.62)
G3P151.8
48.89.4
0.02
303.5
22.4
5.09
76.7
21.9
0.2
0.13
12.68
Bioinformatics
The process and methods applied to the upgrade of the information
content of biological measurements
More specifically:The utilization of sequence, expression, proteomic and physiological data to identify characteristic patterns of disease and elucidate mechanisms of gene regulation, signaltransduction, flux control and overall cell physiology
10.555-Bioinformatics Lecture 1: Introduction and General Overview
15
Probing cellular function
DNA microarrays
Environment
DNA
mRNAFluxes
Proteins
Proteomics Metabolic Flux Analysis
Signal Transduction
10.555-Bioinformatics Lecture 1: Introduction and General Overview
16
Central GoalGenome sequences and technologies generating
large volumes of dataDNA microarrays (gene chips)ProteomicsSmall metabolite measurementsIsotope labeling
Challenge:Utilize genomic information and above data to elucidate gene regulation, metabolic flux control, and overall cell physiology
Importance:Elucidate genetic regulation and flux controls
in their entirety instead of one at a time
10.555-Bioinformatics Lecture 1: Introduction and General Overview
17
I. Genomics: Sequence-driven problems
Bioinformatics course
Structured to address these driving forces.
It comprises three basic units:
II. Data-driven problems
III. Functional Genomics and Issues of Systems Biology
10.555-Bioinformatics Lecture 1: Introduction and General Overview
18
I. Sequence-driven problems
Central theme: Sequence comparisonsMetrics of comparisonScoring matricesScore evaluationComparison by alignment vs. comparison
by entries of bio-dictionaries (Teiresias)
EmphasisComputer ScienceApplied probabilityStatisticsComparative genomics
10.555-Bioinformatics Lecture 1: Introduction and General Overview
19
I. Sequence-driven problems
• Identify Open Reading Frames
• Identify gene splicing sites (introns)
• Gene regulation (using expression and other data)
• Gene annotation (inter-genomic comparisons)
• Determine sequence patterns of regulatory sites. Important in understanding gene regulation and expression characteristics of tissues, disease, phase of development, etc.
GENOMICS
• Fragment reassembly for chromosome sequence
10.555-Bioinformatics Lecture 1: Introduction and General Overview
20
I. Sequence-driven problems (cont’d)
• Single, multiple protein alignment (homology)
• Framework for the analysis of signaling networks
• Identification of functional domains in protein sequences
PROTEOMICS
• Pathway interactions• Effect of pathway convergence-divergence on signaling• Controls of signaling (kinetic, regulatory)• Integration
• Sequence-structure, sequence-function relationships (structural bioinformatics)• Pattern discovery (phylogeny, remote associations)
10.555-Bioinformatics Lecture 1: Introduction and General Overview
21
II. Data-driven problems
Central theme: Information upgrade
Transcriptional data (microarrays). Use to:Identify discriminatory genesDefine clusters of co-expressed genesCorrelations of gene expressionSample classification
Disease diagnosisTherapy prognosisEvaluation of side effects
10.555-Bioinformatics Lecture 1: Introduction and General Overview
22
Data-driven problems (cont’d)
Metabolic rate-isotopic tracer data. Use to:Determine fluxesStudy flux control (metabolic engineering)
Mass Spectrometry data of proteins. Use to:Identify proteins and protein fragmentsProtein quantification
10.555-Bioinformatics Lecture 1: Introduction and General Overview
23
III. Problems of Systems Biology-Physiology-
Functional Genomics
Central theme:Interactions of cellular parts
NetworksAssociations of data
Methods:Network analysis (balances)Projections (PCA, FDA, etc.)Correlations (PLS, PCR, etc.)
Pattern Discovery in data
10.555-Bioinformatics Lecture 1: Introduction and General Overview
24
Interesting problems
Identify discriminatory genesDecipher gene regulationElucidate gene regulatory networksUse transcriptional data to discover
characteristic patterns of gene expression for different tissue types, disease, etc.
Microarray data analysis
10.555-Bioinformatics Lecture 1: Introduction and General Overview
25
Example: Use microarray expression data to discover discriminatory genes
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Auto
scal
ed E
xpre
ssio
n Le
vel
10.555-Bioinformatics Lecture 1: Introduction and General Overview
26
Interesting problems
Identify discriminatory genesDecipher gene regulationElucidate gene regulatory networksUse transcriptional data to discover
characteristic patterns of gene expression for different tissue types, disease, etc.
Microarray data analysis
10.555-Bioinformatics Lecture 1: Introduction and General Overview
27
Interesting problems: Decipher structure of gene regulation: Modified lac operon
σ3 ... σ1 σ2 ... P1 cap ... lacI lacP lacO
lacZ lacY lacA
mRNA
☺Lac
CAP
Rich Medium
Growth
Study effect of:Maltose, Glucose,LactoseGrowth phaseRich Medium, O2
+
+
-- CAP-cAMP
+
10.555-Bioinformatics Lecture 1: Introduction and General Overview
28
Expected profiles in differential gene expression experiments
Glc, Glc,Lac, Glc, Rich Mal,Lac Glc, Grth.Grth No grth Medium Growth Stat. Rich Md.
σ3σ1σ2
cap1lacIlacZlacYlacAcap2malImalZmalYmalA
10.555-Bioinformatics Lecture 1: Introduction and General Overview
29
Example of gene chip data
10.555-Bioinformatics Lecture 1: Introduction and General Overview
30
Deciphering microarray data (cont’d)
Step 2: Determine structure of interactions
P1 P2 P3 P4 GC1 GC2 GC3 GC4 GC5 ...
C1C2C3
H L L L I R 0 I I ...H H H L I R R I 0 ... H H L H R R 0 I R …
Look for most generic patterns:
W W H W I R I W W I RL W H W W I I R W W RW W W H W W I R R W I
Apply: Decision trees, Teiresias, pattern discovery
10.555-Bioinformatics Lecture 1: Introduction and General Overview
31
Deciphering microarray data (cont’d)
Simulation results with lac operonSuccess rate of pattern discovery
0
20
40
60
80
100
0 20 40 60 80 100% of possible combinations considered in parameter
space
% o
f pat
tern
s di
scov
ered
S
10.555-Bioinformatics Lecture 1: Introduction and General Overview
32
Interesting problems
Identify discriminatory genesDecipher gene regulationElucidate gene regulatory networksUse transcriptional data to discover
characteristic patterns of gene expression for different tissue types, disease, etc.
Microarray data analysis
10.555-Bioinformatics Lecture 1: Introduction and General Overview
33
Interesting problems of Systems Biology-Physiology-Functional Genomics
Genetic Regulatory Networks
• Apply time-lagged correlations in transcriptional data
• Apply pattern discovery in upstream sequence regions of co-expressed genes
10.555-Bioinformatics Lecture 1: Introduction and General Overview
34
Interesting problems
Identify discriminatory genesDecipher gene regulationElucidate gene regulatory networksUse transcriptional data to discover
characteristic patterns of gene expression for different tissue types, disease, etc.
Microarray data analysis
10.555-Bioinformatics Lecture 1: Introduction and General Overview
35
A73
0
Figure 1Figure 1
FullFullLightLight
23 h
rs.
23 h
rs.
24 h
rs.
24 h
rs.
Transient LightTransient Light(10 min. samples)(10 min. samples) Transient DarkTransient Dark
(10 min. samples)(10 min. samples)
0.5
0.6
0.7
0.8
0.9
1
50:00 74:00 74:40 75:20 76:00 76:40 77:20
0
0.5
1
1.5
2
2.5
0 50 100 150 200 250
A73
0
Time (hrs)
FullFullDarkDark
Example: Defining the physiological state from the expression phenotype
10.555-Bioinformatics Lecture 1: Introduction and General Overview
36
Example of gene chip data
10.555-Bioinformatics Lecture 1: Introduction and General Overview
37
Example (cont’d)
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80 90 100
110
120
130
140
150
160
170
180
190
200
Time (min.)
10.555-Bioinformatics Lecture 1: Introduction and General Overview
38
Dimensionality reduction of the expression phenotype space by CDA
(10 minute time delay)
FL
FL
D23D23
D23D24D24
D24L10
L20L30L40L50L60L70L70L80L90L100D10
D20D30D40D50
D60
D70
D80D90
D100
-3 -2 -1 0 1 2 3-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
B-ALLT-ALLAML
CV1
CV2
fMB
fMT
fBT
fBT>0
B-ALL fTM>0
T-ALL AML
Decision boundaries
fBT: between B-ALL and T-ALL
fMT:between T-ALL and AML
fMB: between AML and B-ALL
( ) ( ) ( ) 021
:
≥+−−−
=
kmT
kmT
km
Tmkf
yyyyyyy
xVy
Figure 5: Physiological state domains are defined by a series of decision boundaries for the projected values and group means in the reduced CV plane.
10.555-Bioinformatics Lecture 1: Introduction and General Overview
40
Interesting problems (cont’d)
Determine rate controlling steps in metabolic networks to achieve directed flux amplification
Framework for the analysis of metabolic networks
Determination of in vivo metabolic fluxes from metabolite and isotopomer balances
METABOLISM (Metabolic Engineering)
• Pathway interactions• Controls of flux (kinetic, regulatory)• Integration
Establish link between the expression phenotype and the metabolic phenotype as determined by the fluxes
Discover patterns in fluxes
10.555-Bioinformatics Lecture 1: Introduction and General Overview
41
Trehalose
GLC6PCO 2
Ribu5P
Sed7PE4P
GLC
Fru6P
GAP
754.8100
316.8
55.5
114.5
Suc
IsocitOAA
aKG Glt
GlumSucCoA
CO 2
CO 2 NH 3
15.7
2.5
5.1
20.6Asp
PYR13
5.1
PYR
AcCoA Acetate
PEPAla ,Val,Lac
CO 2
CO 2
CO 2
19.6 23.2
23.2
3.9
Lysine
H4D
Meso-DAP 12.7
CO 2NH 3
Suc
Fluxes estimated from isotopomer balancing (locally optimal solution - value of objective: 0.021)
(2.62)
G3P151.8
48.89.4
0.02
303.5
22.4
5.09
76.7
21.9
0.2
0.13
12.68
10.555-Bioinformatics Lecture 1: Introduction and General Overview
42
Interesting problems: Signal transduction pathways
Receptor-Ligand binding
Nucleus Gene induction
Signalingpathways
MetabolicEnzyme Regulation
InputE1
MAPKKK MAPKKK*
MAPKK MAPKK-P MAPKK-PP
E2
MAPKKP'ase
MAPK MAPK-P MAPK-PP
MAPK P'ase
Output
Analysis of structure and interactions of signal transduction pathways
10.555-Bioinformatics Lecture 1: Introduction and General Overview
43
Problems of Systems Biology-Physiology-Functional Genomics
Identification of genes that collectivelyCorrelate with a phenotypeAre critical for important physiological properties
10.555-Bioinformatics Lecture 1: Introduction and General Overview
44
PHA/PHB biosynthesis genes
Acetyl-CoA
O
S-CoA
phaA phaB phaE phaC
Thiolase
2
HS-CoA
Reductase
NADPH+H+NADP+
Synthase
PHB
O
On
O
S-CoA
O
Acetoacetyl-CoA
O
S-CoA
OH
Hydroxybutyryl-CoA
HS-CoA
orphaC
10.555-Bioinformatics Lecture 1: Introduction and General Overview
45
Drug Discovery (cont’d)
WT AM50
2
4
6
8
10
12
14
16
18
20
PHB
Synt
hesi
s R
ate
(µm
ole/
g cd
w/h
r)
10.555-Bioinformatics Lecture 1: Introduction and General Overview
46
PHA/PHB biosynthesis genes
Acetyl-CoA
O
S-CoA
phaA phaB phaE phaC
Thiolase
2
HS-CoA
Reductase
NADPH+H+NADP+
Synthase
PHB
O
On
O
S-CoA
O
Acetoacetyl-CoA
O
S-CoA
OH
Hydroxybutyryl-CoA
HS-CoA
orphaC
10.555-Bioinformatics Lecture 1: Introduction and General Overview
47
Regulation of PHA biosynthesis in Synechocystis sp.
Acetyl-phosphate
O
O P OH
OH
O
PHBSynthase(active)
PHB
O
On
Acetyl-CoA
O
S-CoAAcetyl-CoA
O
S-CoA
Acetate
O
OH
PHBSynthase(inactive)
Acetate Kinase ackA
Phospho-transacetylase pta
Acetyl-CoA synthetaseacs
10.555-Bioinformatics Lecture 1: Introduction and General Overview
48
0
1000
2000
3000
4000
5000
0 2 4 6 8 10 12
PHA
Synt
hase
activ
itynm
olm
in-1
mg
prot
ein-1
Acetyl phosphate (mM)
Acetyl phosphate dependent activation of theSynechocystis sp. PHA synthase
10.555-Bioinformatics Lecture 1: Introduction and General Overview
49
Figure 2
(B)(A)
(C)
-25-20-15-10-505
101520
-20 -10 0 10 20 30
BGA10N
5AM
10PA
BG11
0.3NA
4.0
8
.0
12
..0
PHB
(%D
CW
)
5AM
BG11
10N BGA
0.3NA
10PA
Dimensionality reduction by Fisher Discriminant Analysis
Projections: CV1 = a1g1 + a2g2 + a3g3 + … + angnCV2 = b1g1 + b2g2 + b3g3 + … + bngn
10.555-Bioinformatics Lecture 1: Introduction and General Overview
50
Figure 5
-0.2 -0.1 0.0 0.1 0.2CV1
-0.5
-0.3
-0.1
0.1
0.3
CV
2
ssr2194slr1895
cpcB ndhG
glmSpsbA2
sll1369
slr1859
slr1800
sll0048
sll0611
sll0423
sll0442
sll0036
ssr2062ssl2384
sll0267sll0174
ssr0756
apcE
rbcL
slr0284
sll0395
smr0001
slr0639
slr1820slr1902
bcp
ssr1114
slr1807
Plotting the loadings (i.e., the coefficients of linear projection
10.555-Bioinformatics Lecture 1: Introduction and General Overview
51
Figure 2
R2 = 0.879
-505
101520
-20 -10 0 10 20CV2
PHB
(%D
CW
)
Correlational structure of PHB accumulation
PHB = c1g1 + c2g2 + c3g3 + … + cngn
10.555-Bioinformatics Lecture 1: Introduction and General Overview
52
Linking the Metabolic and ExpressionPhenotypes
Metabolic Phenotype
ExpressionPhenotype
MA PC HR EO NS O C TO YP PI EC
Mapping of the Metabolic Phenotypeonto the Macroscopic Phenotype
Mapping of the Expression Phenotypeonto the MacroscopicPhenotype
M1M2...…Mn
*M1*M2...
*Mn
M1M2......Mn
M1M2...
Mn
G1G2......Gm
G1G2......Gm
G1G2......Gm
*G1*G2......
*Gm
Bioinformatics
The process and methods applied to the upgrade of the information
content of biological measurements
More specifically:The utilization of sequence, expression, proteomic and physiological data to identify characteristic patterns of disease and elucidate mechanisms of gene regulation, signaltransduction, flux control and overall cell physiology
10.555-Bioinformatics Lecture 1: Introduction and General Overview
54
In summary
Bioinformatics is driven by genomics and data
These driving forces define three essential units (or types of problems) of instruction:
Sequence-driven problemsData-driven problemsProblems of physiology or systems biology
While initially focused on sequence analysis, the future belong to successful linkage of genomics and physiology
This is a young field very much in flux with many problems in search of good solutions
10.555-Bioinformatics Lecture 1: Introduction and General Overview
55
Spring Semester 2004
10.555 Bioinformatics: Principles, Methods and Applications
Instructors
Gregory Stephanopoulos and Isidore RigoutsosSpecial lectures by Dr. Joanne Kelleher
Assistant-Grading: Adrian Fay, [email protected]
9 units (H), Class meets Tuesday 2-5pm, Room 66-154
COURSE SYLLABUS
Part I: INTRODUCTION
Lecture 1: February 3- Historical perspectives, definitions- Impact of genomics on problems in molecular and cellular biology; need for integration and quantification, contributions of engineering- Overview of problems. Sequence driven and data driven problems- Rudiments of dynamic programming with applications to sequence analysis- Overview of course methods- Integrating cell-wide data, broader issues of physiology
Lecture 2: February 10 (Assignment 1, due February 24)- Primer on probabilities, inference, estimation, Bayes theorem- Dynamic programming, application to sequence alignment, Markov chains, HMM
10.555-Bioinformatics Lecture 1: Introduction and General Overview
56
Part II: SEQUENCE DRIVEN PROBLEMSNo class on February 17 (Monday schedule-President's day)Lecture 3: February 24 (Assignment 2, due on March 2)
- Primer on Biology (the units, the code, the process, transcription, translation, central dogma, genes, gene expression and control, replication, recombination and repair)- Data generation and storage- Schemes for gene finding in prokaryotes/eukaryotes- Primer on databases on the web- Primer on web engines
Lecture 4: March 2 (Assignment 3, due March 9)- Some useful computer science (notation, recursion, essential algorithms on sets, trees and graphs, computational complexity)- Physical mapping algorithms- Fragment assembly algorithms- Comparison of two sequences- Dynamic programming revisited- Popular algorithms: Smith-Waterman, Blast, Psi-blast, Fasta
Lecture 5: March 9 (Assignment 4, due on March 16)- Building and using scoring matrices- Multiple sequence alignment- Functional annotation of sequences- Micro RNA, Antimicrobial peptides
Lecture 6: March 16 (Assignment 5, due March 30) - Pattern discovery in biological sequences- Protein motifs, profiles, family representations, tandem repeats, multiple sequence alignment and sequence comparison through pattern discovery- Functional annotation-Promoter site recognition
No class on March 23: Spring BreakLecture 7: March 30 (Assignment 6, due on April 13): Physiology
- Primer on cell physiology. Definition at the macroscopic level- Molecular cell physiology. Interactions of pathways, cells, organs. Measurements and their integration- Distribution of kinetic control. Rudiments of metabolic control analysis (MCA)
10.555-Bioinformatics Lecture 1: Introduction and General Overview
57
PART III: UPGRADING EXPRESSION AND METABOLIC DATA
Lecture 8: April 6: Fluxes- MCA continued. Analysis of metabolic pathways, fluxes, the metabolic phenotype- Methods for metabolic flux determination- Methods for underdetermined systems
Lecture 9: April 13 (Assignment 7, due on April 27): Microarrays- Monitoring gene expression levels. DNA microarrays- Data collection, error analysis, normalization, filtering- Novel applications of DNA microarrays- Analysis of gene expression data- Clustering methods: Coordinated gene expression- Identification of discriminatory genes- Determination of discriminatory gene expression patterns. Use in diagnosis- Data visualization. Reconstruction of genetic regulatory networks
No class on April 20: Patriots DayLecture 10: April 27: Linkage
- Linking the metabolic and expression phenotypes- Quantitative predictive models of cell physiology from gene expression-Linkage by partial least squares- Systems biology
Lecture 11: May 4- Signaling and signal transduction pathways- Measurements in signaling networks- Integrated analysis of signal transduction networks
Lecture 12: May 18- Putting it all together- Project presentations
10.555-Bioinformatics Lecture 1: Introduction and General Overview
58
HOMEWORKSThere will be five Problem Sets on the methodologies and computational algorithms covered in the course, as follows:
Problem Set – 1: Material of Lectures 1, 2 Problem Set – 2: Material of Lecture 3Problem Set – 3: Material of Lecture 4 Problem Set – 4: Material of Lecture 5Problem Set – 5: Material of Lecture 6 Problem Set – 6: Material of Lectures 7,8Problem Set – 7: Material of Lectures 9,10
PROJECTSThe students, in groups of 2-3, will carry out a project on a course-related subject of their own choosing, or from a list of suggested topics. The groups must be formed topics selected by April 6, 2004. An oral presentation of the project by the group members will take place on May 18, 2004, at which time the final report on the project will be also due.
GRADEThere will be no mid-term or final exams. The grade in the course will be based on the homeworks, the group project, and the oral presentation, with the following weights:Homeworks (40 %); Written project report (35 %); Oral presentation (25 %)
CLASS NOTES and REFERENCESLecture notes will be posted on the web. Additionally, the course will draw from a series of published papers and the following list of books, which are recommended as references:
10.555-Bioinformatics Lecture 1: Introduction and General Overview
59
References•Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, D.Gusfield, Cambridge University Press, ISBN: 0521585198•Fundamental concepts of bioinformatics, D.E. Krane and M. L. Raymer, ISBN: 0-8053-4633-3•Introduction to probability, D.P. Bertsekas and J.N. Tsitsiklis, ISBN:1-886529-40-X•Genetics, a Molecular Approach, T.A.Brown, Chapman & Hall, ISBN: 0412447304.•Introduction to Computational Molecular Biology, J.Setubal and J.Meidanis, PWS Publishing Company, ISBN: 0534952623.•Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, A.D.Baxevanis and B.F.F.Ouellette, Wiley-Interscience, ISBN: 0471191965.•Bioinformatics: The Machine Learning Approach, P. Baldi and S. Brunal, MIT Press, ISBN: 0-262-02442-X•Introduction to Computational Biology: Maps, Sequences, Genomes, M.S.Waterman, Chapman & Hall, ISBN: 0412993910.•Biological Sequence Analysis: Probabilistic Models of proteins and Nucleic Acids, R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Cambridge University Press, ISBN: 0-521-62041•Bioinformatics: Methods and Protocols, S. Misener and S.A. Krawetz (editors), Humana Press, ISBN: 0-89603-732-0•Bioinformatics Basics: Applications in Biological Science and Medicine, H.H. Rashidi and L.K. Buehler, CRC Press, ISBN: 0-8493-2375-4•Introduction to Protein Structure?, C.Branden and J.Tooze, Garland Publishing Inc., ISBN: 0815302703.•Molecular Biotechnology: Principles and Applications of Recombinant DNA, B.R.Glick and J.JPasternak, ASM Press, ISBN: 1555811361.•Introduction to Proteins and Protein Engineering, B.Robson and J.Garnier, Elsevier Science Publishers, ISBN: 0444810471.•Computational Molecular Biology: An algorithmic approach, Pavel Pevzner, MIT Press, ISBN: 0262161974•Metabolic Engineering: Principles and Methodologies, G. Stephanopoulos, A. Aristidou and J. Nielsen, Academic Press, ISBN: 0-12-666260-6
Additional references of web-based material will be distributed to the students during the course.
10.555-Bioinformatics Lecture 1: Introduction and General Overview
60
DYNAMIC PROGRAMMING
Input Output
Cost
Decision
n-1S n-1
X n-1
R n-1
nS n+1 S n
X n
R n
iS i+1 S i
X i
R i
1S 2 S 1
X 1
R1
Solve a problem by using already computed solutions for smaller, but similar problems
S i = ti(S i+1, Xi)
R i = ri(S i+1, Xi)
Overall cost function: f(R1, R2, .. Rn)• Separability: f = Σ Ri• Monotonicity
fi*(S i+1) = opt[ Ri(S i+1, Xi) + f*i-1 (S i) ]Xi
Overall Optimization solved by• breaking down the problem to the sum of its parts using the recurrence relation:
Task: Optimize f
• keeping track of the optimum decision at each stage, Xi*
where * denotes optimum value
10.555-Bioinformatics Lecture 1: Introduction and General Overview
61
SEQUENCE DRIVEN PROBLEMS
• DNA Sequencing• Identification of similar prefixes and suffixes for given sequences
• Comparing two sequences: Homology searches in databases• Identification of similar substrings
• Multiple alignment of sequences: Homologies in related genes/proteins• Identification of short substrings/motifs
• Determination of introns and exons
• Construction of Phylogenetic trees
• Structure prediction from sequences• RNA secondary structure prediction• Protein folding
Application of DP to sequence problems
10.555-Bioinformatics Lecture 1: Introduction and General Overview
62
COMPARISON OF SEQUENCES
Cost Function, Ri• reward for exact match• penalty for mismatch• penalty for matching with a gap or a break in the sequence
Decision, Xi• Alignment with an element in the sequence• Alignment with a gap
Input,Si+1• Alignment of preceding prefix
Output, Si• Alignment of preceding prefix and current element
10.555-Bioinformatics Lecture 1: Introduction and General Overview
63
s = AAAC t = AGC
For the alignment of the s[1:i] and t[1:j], we have:Decision, X• Align s[1:i] with t[1:j-1], and match a space with t[j]• Align s[1: i -1] with t[1:j-1], and match s[i] with t[j]• Align s[1: i -1] with t[1:j], and match s[i] with a space
Cost, R• reward for exact match, pm• penalty for mismatch, mm• penalty for matching with a gap or a break in the sequence, sp
f*(s[1: i], t[1:j]) = maxf*(s[1:i], t[1:j-1]) - spf*( s[1: i -1], t[1:j-1]) + pm, if s[i] = t[j]f*( s[1: i -1], t[1:j-1]) - mm, if s[i] ≠ t[j]f*( s[1: i -1], t[1:j]) - sp
DP definition of the problem: Recursive
COMPARISON OF SEQUENCES
10.555-Bioinformatics Lecture 1: Introduction and General Overview
64
ELUCIDATION OF ALGORITHM
s = _AAAC t = _AGC
Consider 2nd element of s and t, and the decisions possible:
A
A
t:
s:
AA _
AA _
AA
Known fromprevious calculation -2+=
Known fromprevious calculation -2+=
Known fromprevious calculation 1+=
sp = -2, mm = -1, pm = 1
Basic calculation approach
Sequences can be expanded at will by adding more spaces
10.555-Bioinformatics Lecture 1: Introduction and General Overview
65
Matrix based calculation
Step 1: Initialization of s[1] and t[1]
Cost of alignment of _ and _ = 0
Step 2: Initialization of first row and column
i = 1, j = 2: _ A_
t:
s: Cost = 0 + -2 = -2
i = 1, j = 3: _ AG_
t:
s:
Cost = -2 + (-2) = -4
i = 1, j = 4: _ AGC_
t:
s:
Cost = -4 + -2 = -6
i = 2, j = 1:
0 -2 -4 -6
-8
-2
-4
-6
A G C
A
A
A
C
Keeping track of optimum decision
s = _AAAC t = _AGC
ELUCIDATION OF ALGORITHM
10.555-Bioinformatics Lecture 1: Introduction and General Overview
66
Step 3(0): Apply recursive optimization
i = 2, j = 2 Compare _A_A
0 -2 -4 -6
-8
-2
-4
-6
A G C
A
A
A
C
1 Possible Decisions:
_ A __ A
_ A_A _
_ A_ A
Cost:
_ A_
_A
+ = -2 + -2 = -4
__
AA
+ = 0 + 1 = 1
__A
A_
+ = -2 + -2 = -4
t:
s:
t:
s:
t:
s:
Optimum
ELUCIDATION OF ALGORITHM
10.555-Bioinformatics Lecture 1: Introduction and General Overview
67
Step 3(1): Apply recursive optimization
i = 2, j = 3 Compare _AG_A
0 -2 -4 -6
-8
-2
-4
-6
A G C
A
A
A
C
1 Possible Decisions:
_ AG __ A
_ AG_ A _
_ AG_ A
Cost:
_ AG_
_A
+ = -4 + -2 = -6
_A_
GA
+ = -2+ -1 = -3
_A_A
G_
+ = 1 + -2 = -1
t:
s:
t:
s:
t:
s: Optimum
-1
And so on, and so forth ...
ELUCIDATION OF ALGORITHM
10.555-Bioinformatics Lecture 1: Introduction and General Overview
68
s = AAAC t = AGC
0 -2 -4 -6
-8
-2
-4
-6
1 -1 -3
-1 0 -2
-3 -2 -1
-5 -4 -1
A G C
A
A
A
C
Arrows keep track of optimal decision
Reconstruction of optimal alignment doneby retracing the arrows from the last corner
Bidimensional array for computing optimal alignments
ALGORITHM: SUMMARY
10.555-Bioinformatics Lecture 1: Introduction and General Overview
69
Parameters (allow for modification of the basic algorithm):
• Initialization • Recurrence scoring relation• Start point in sequence reconstruction • Scoring of spaces• Multiple sequence alignment
Other modifications:• Cost function R can be a matrix, with mm and pm being vectors• Gap penalty can be a function instead of a fixed value
MODIFICATIONS TO THE BASIC ALGORITHM
Allows for Global/ Local Alignment
Allows for Semiglobal Alignment
10.555-Bioinformatics Lecture 1: Introduction and General Overview
70
RNA SECONDARY STRUCTURE PREDICTION
Cost function: Free Energy of structure
Dynamic programming applied under certain assumptions:• Free energy of structure is the sum of its parts
Implies - independent base pairs, or- consider adjacent base pairs, but ignore free energies of bases not
part of any loop
•Ignore the prediction of knots in the structure
• Only certain types of loops allowed
Under these assumptions, and with an appropriate cost function, a DPbased algorithm can be used to compute the optimal structure