course introduction
DESCRIPTION
Course Introduction. What these courses are about What I expect What you can expect. What these courses are about. overview of ways in which computers are used to solve problems in biology supervised learning of illustrative or frequently-used algorithms and programs - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/1.jpg)
Course Introduction
• What these courses are about
• What I expect
• What you can expect
![Page 2: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/2.jpg)
What these courses are about
• overview of ways in which computers are used to solve problems in biology
• supervised learning of illustrative or frequently-used algorithms and programs
• supervised learning of programming techniques and algorithms selected from these uses
![Page 3: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/3.jpg)
I Expect
• students will have basic knowledge of biology and chemistry (at the level of Modern Biology/Chemistry)
• students will have basic familiarity with statistics
• students have some programming experience and willingness to work to improve
![Page 4: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/4.jpg)
You can expect
• Homework assignments– 80% of grade
• Final (20% of grade)
• Grades totally determined by points system
![Page 5: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/5.jpg)
Textbook
• Required textbook: Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids by Durbin et al.
• Recommended additional textbook:
Introduction to Computational Biology by Waterman
![Page 6: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/6.jpg)
Chapter 1
Introduction
![Page 7: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/7.jpg)
Purpose
• A great acceleration in the accumulation of biological knowledge started in our era
• Part of the challenge is to organize, classify and parse the immense richness of sequence data
• This is not just a task of string parsing, for behind the string of base or amino acids is the whole complexity of molecular biology
![Page 8: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/8.jpg)
• A major task in computational molecular biology is to “decipher” information contained in biological sequences
• Since the nucleotide sequence of a genome contains all information necessary to produce a functional organism, we should in theory be able to duplicate this decoding using computers
Information Flow
![Page 9: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/9.jpg)
Review of basic biochemistry
• Central Dogma: DNA makes RNA makes protein
• Sequence determines structure determines function
![Page 10: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/10.jpg)
Structure
• DNA composed of four nucleotides or "bases": A,C,G,T
• RNA composed of four also: A,C,G,U (T transcribed as U)
• proteins are composed of amino acids
![Page 11: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/11.jpg)
Purpose
This class is about methods which are in
principle capable of capturing some of the
complexity of biology, by integrating diverse
sources of biological information into clean,
general, and tractable probabilistic models
for sequence analysis.
![Page 12: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/12.jpg)
However,
The most reliable way to determine a biological molecule’s structure or function is by direct experimentation.
It is far easier to obtain the DNA sequence of the gene corresponding to an RNA or protein than it is to experimentally determine its function or its structure.
![Page 13: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/13.jpg)
The Human Genome Project
Gives us the raw sequence of an estimated 20,000-25,000 human genes, only a small fraction of which have been studied experimentally.
The development of computational methods have become more important (computer science, statisticians, and etc….)
![Page 14: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/14.jpg)
Basic Information
• New sequences are adapted from pre-existing sequences
• We compare a new sequence with an old sequence with known structure or function
• Two related sequences are called homologous and we are transferring information by homology
• It is somewhat similar to determine the similarity between two text strings
• In fact, we will be trying to find a plausible alignment between sequences
![Page 15: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/15.jpg)
Definition
• A sequence is a linear set of characters (sequence elements) representing nucleotides or amino acids– DNA composed of four nucleotides or
"bases": A,C,G,T– RNA composed of four also: A,C,G,U (T
transcribed as U)– proteins are composed of amino acids (20)
![Page 16: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/16.jpg)
Character representation of sequences• DNA or RNA
– use 1-letter codes (e.g., A,C,G,T)
• protein– use 1-letter codes
• can convert to/from 3-letter codes
(e.g., A = Ala = Alanine
C = Cys = Cysteine)
![Page 17: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/17.jpg)
Alignment
• Find the best alignment between two strings under some scoring system
• “+1” for a match; “-1” for a mismatch• Most important, we want a scoring system to give the
biologically most likely alignment the highest score• Note that biological molecules have evolutionary
histories, 3D folded structures, and other features• This is more the realm of statistics than computer
science• Probabilistic modeling approach might be used and
extend
![Page 18: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/18.jpg)
Probabilities & Probabilistic Models• A model means a system that simulates the
object under consideration• A probabilistic model is to produce different
outcomes with different probabilities• That is, it stimulates a whole class of objects,
and assign each object an associated probability• The objects will be sequences, and a model
might describe a family of related sequences
![Page 19: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/19.jpg)
Example: Rolling a six-sided die
• A probability model of rolling a 6-sided die involves 6 parameters p1, p2, p3, p4, p5, and p6
• The probability of rolling i is pi
• pi 0 and Σ≧ pi=1
• Rolling the die 3 times independently, P([1,6,3])= p1 p6 p3
![Page 20: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/20.jpg)
Example: Biological Sequence
• Biological sequences are strings from finite alphabet of residues (4 nucleotides or 20 amino acids)
• A residue a occurs at random with probability qa, independent of all other residues in the sequence
• If the sequence is denoted by x1… xn, the probability of the whole sequence is
qx1 qx2… qxn
![Page 21: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/21.jpg)
Maximum Likelihood Estimation
• The parameters of a probability model is estimated from a training set (sample)
• The probability qa for amino acid a can be estimated as the observed frequency of residues in a database of known protein sequences (SWISS-PROT)
• The training sequences are not systematically biased towards a peculiar residue composition
![Page 22: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/22.jpg)
http://au.expasy.org/sprot/
![Page 23: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/23.jpg)
MLE (continued)
• This way of estimating models is called maximum likelihood estimation (MLE)
• The MLE maximizes the total probability of all sequences given the model (the likelihood)
• Given a model with parameters θ and a set of data D, the maximum likelihood estimate for θ is that value which maximizes P(D|θ)
![Page 24: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/24.jpg)
Estimation
• If estimating parameters from a limited amount of data, there is a danger of overfitting
• Overfitting: The model becomes very well adapted to the training data, but it will not generalize well to new data
• For example, observing the three flips of a coin [tail, tail, tail] would lead to the maximum likelihood estimate that the probability of head is 0 and that of tail is 1
![Page 25: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/25.jpg)
Conditional, Joint, and Marginal
• We have two dies, D1 and D2
• The conditional probability of rolling i given die D1 is called P(i|D1)
• We pick a die with probability P(Dj), j=1, 2• The probability for picking die Dj and rolling an i is
the product of the two probabilities, P(i, Dj)=P(Dj)P(i|Dj), the joint probability
• P(X, Y)=P(X|Y)P(Y)• P(X)=ΣYP(X, Y)=ΣY P(X|Y)P(Y), the marginal prob
ability
![Page 26: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/26.jpg)
Bayes Theorem
• Bayes’ theorem
• The denominator is the marginal• The numerator is the joint
)(
)()|()|(
YP
XPXYPYXP
![Page 27: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/27.jpg)
Example 1
Consider an occasionally dishonest casino
that uses two kinds of dice. Of the dice 99%
are fair but 1% are loaded so that a six
comes up 50% of the time. Suppose we pick
a die at random and roll it three times,
getting three consecutive sixes. What is
P(Dloaded|3 sixes)?
![Page 28: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/28.jpg)
Example 1 (Continued)
)()|sixes 3(
)()|sixes 3(sixes) 3(
125.05.0)|sixes 3(
0.01)(
)sixes 3(
)()|sixes 3()sixes 3|(
fairfair
loadedloaded
3loaded
loaded
loadedloaded
loaded
DPDP
DPDPP
DP
DP
P
DPDPDP
![Page 29: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/29.jpg)
Example 1 (Continued)
We will still more likely pick up a fair die, despite
seeing three successive sixes.
0.21
)99.0()6/1()01.0)(5.0(
)01.0)(5.0()sixes 3|(
33
3
loaded
DP
![Page 30: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/30.jpg)
Example 2
• Assume that, on average, extracellular protein have a slightly different amio acid composition than intracellular proteins
• For example, cysteine is more common in extracellular than intracellular proteins
• Question: whether a new protein sequence x=x1…xn is intracellular or extracellular?
![Page 31: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/31.jpg)
Example 2 (continued)
• We first split our training examples from SWISS-PROT into extracellular and intracellular proteins
• Estimate a set of frequencies for intracellular proteins, and a corresponding set of extracellular frequencies
• The probability that any new sequence is extracellular is pext, and the corresponding probability of being intracellular is pint. Note that pint=1- pext
intaq
extaq
![Page 32: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/32.jpg)
Example 2 (continued)
intintextext
extext
intext
intext
)|ext(
)int|()ext|()(
)int|( and )ext|(
ii
i
ii
xixi
xi
xixi
qpqp
qpxP
xppxppxP
qxPqxP
![Page 33: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/33.jpg)
Bayesian Model
• θ is the parameter of interest• Before collecting data, the information regarding
θ is called the prior information, P(θ)• After collected the data, the information regardin
g θ is called the posterior information, P(θ|D)• If we do not have enough data to reliably estimat
e the parameters, we can use prior knowledge to constrain the estimates
![Page 34: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/34.jpg)
Bayesian and Frequentist
D ~ N(θ,1)
• To frequentists, θ is fixed (unknown)
• To Bayesians, θ is random
• If θ is random, what should its distribution
be?
• Frequentists argue that the determination of the prior distribution of θ is very subjective
![Page 35: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/35.jpg)
Prior and Posterior
• Suppose that θ has a probability distribution
P(θ) (prior)• Assume that θ and D|θ are independent • P(D, θ) is the joint distribution of D and θ• P(D | θ) is the conditional distribution of D given
θ• P(θ | D) is the conditional distribution of θ
given D (the posterior)
![Page 36: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/36.jpg)
Prior and Posterior
• P(D| θ)P(θ)=P(D, θ)=P(θ | D)P(D)• Bayes’ theorem:
DPPDP
DP)(|
|
![Page 37: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/37.jpg)
Posterior Distribution
Given D’s density p(D|θ) and a prior
probability density P(θ), the posterior
density for θ is given as
p(θ|D)=cp(θ) p(D|θ) ,
where
c-1=∫ p(θ) p(D|θ) dθ
(the marginal of D).
![Page 38: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/38.jpg)
Example
D ~ N(θ,2), is known.
P()=N( 0, 02)
Then the posterior density is normal with
and
220
2
220
20
0 /1/1
/1
/1/1
/1mean
d
1220 /1/1variance
![Page 39: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/39.jpg)
Conjugate Prior
• D ~ N(θ,2) is a normal distribution
• Prior distribution, P()=N( 0, 02), is
also a normal distribution
• Posterior distribution, P(|D), is also a normal distribution
• The normal distribution is conjugate to the normal
![Page 40: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/40.jpg)
Specification of the Prior
• Conjugate priors:
–The beta distribution is conjugate to the
binomial
–The normal distribution is conjugate to the
normal
–The gamma distribution is conjugate to the
Poisson
![Page 41: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/41.jpg)
Specification of the Prior
• Noninformative (uninformative) priors: P(θ) constant• When we don’t have a strong belief or in
public policy situations strongly differ
![Page 42: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/42.jpg)
Specification of the Prior
Sometimes, we will wish to use an
informative P(θ). We know a priori that the
amino acids phenylalanine (Phe, F), tyrosine (Tyr,
Y), and tryptophan (Trp, W) are structurally similar
and often evolutionarily interchangeable. We
would want to use a P(θ) that tends to favor
parameter sets that assign them very similar
probabilities.
![Page 43: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/43.jpg)
Parameter Estimation
• Choose the parameter value for θ that maximize P(|D)
• This is called maximum a posteriori or MAP estimation
• MAP estimation maximizes the likelihood times the prior
• If the prior is flat (uninformative), then MAP is equal to the MLE
• Another parameter estimation is to choose the mean of the posterior
![Page 44: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/44.jpg)
Maximum A Posteriori (MAP) Estimation
• Ex: estimating probabilities for a die
We roll 1, 3, 4, 2, 4, 6, 2, 1, 2, 2
MLE: p5=0 ~ overfitting
add 1 to each observed number of counts (pseudocount):
MAP: p5=1/16
![Page 45: Course Introduction](https://reader036.vdocument.in/reader036/viewer/2022081515/56815049550346895dbe469a/html5/thumbnails/45.jpg)
When estimating large parameter sets from
small amounts of data, we believe that
Bayesian methods provide a consistent
formalism for bringing in additional
information from previous experience with
the same type of data.