Download - Statistics and Probability for tics
-
8/6/2019 Statistics and Probability for tics
1/36
STATISTICS AND PROBABILITYSTATISTICS AND PROBABILITY
FORFORBIOINFORMATICSBIOINFORMATICS
By
KUMAR PARIJAT TRIPATHI
1.INTRODUCTION
2.EVENTS , PROBABILITY ANDRULES
3. CLASSICAL PROBABILITY:
EQUALLY LIKELY OUTCOMES
4.SUBJECTIVE PROBABILITIES
5. PROBABILITY RULES6.(PROBABILISTIC) INDEPENDENCE
7.SEQUENCE ANALYSIS:PAIRWISE
ALIGNMENT
8.SUBSTITUTION MATRICES
9.HIDDEN MARKOV MODELS
-
8/6/2019 Statistics and Probability for tics
2/36
An experiment is a situation involving chance orprobability that leads to results called out comes.
An outcome is the result of a single trial of an experiment
An event is one or more outcomes of an experiment
Probability is the measure of how likely an event is.
Probability Of an
Event
INTRODUCTION
P(A) = The Number Of Ways Event A Can Occur
The Total Number Of Possible Outcomes
-
8/6/2019 Statistics and Probability for tics
3/36
DETERMINISTIC AND RANDOMDETERMINISTIC AND RANDOM
EXPERIMENTEXPERIMENT Deterministic ExperimentDeterministic Experiment:: The Experiments which have got only oneThe Experiments which have got only onepossible result or outcome i.e. whose result is certain or unique are calledpossible result or outcome i.e. whose result is certain or unique are called
deterministic or predictabledeterministic or predictable experiments. The result of these experimentsexperiments. The result of these experimentsis predictable with certainity and the result is known prior to its conduct.is predictable with certainity and the result is known prior to its conduct.This approach stipulates that the conditions under which the experiment isThis approach stipulates that the conditions under which the experiment isconducted would determine its result.conducted would determine its result.
Probabilistic ExperimentProbabilistic Experiment :An Experiment whose result is uncertain i.e. a:An Experiment whose result is uncertain i.e. a
random experiment is a probabilistic experiment. The experiment results inrandom experiment is a probabilistic experiment. The experiment results intwo or more outcomes. The result/outcome of the experiment would be onetwo or more outcomes. The result/outcome of the experiment would be oneof the possible outcomes but cannot be predicted prior to its conduct. Theof the possible outcomes but cannot be predicted prior to its conduct. Thedifferent possible outcomes of the experiment can be known or assessed.different possible outcomes of the experiment can be known or assessed.But it would not be possible to predict the occurance of a particularBut it would not be possible to predict the occurance of a particularoutcome at anyoutcome at anyparticular execution of the experimentparticular execution of the experiment..
-
8/6/2019 Statistics and Probability for tics
4/36
EVENTS, PROBABILITY,RULESEVENTS, PROBABILITY,RULES
EVENTSEVENTS(informal): tossed coin comes up heads, roll(informal): tossed coin comes up heads, rollof two dice gives a double 6,etc.of two dice gives a double 6,etc.
SubsetsSubsets (formal): of a sample space.(formal): of a sample space. PrepositionsPrepositions: it will rain tomorrowor India will: it will rain tomorrowor India will
win the matchwin the match all three denoted by A,B,.H,..all three denoted by A,B,.H,.. A or B can be represented by A U B or A V B, A and BA or B can be represented by A U B or A V B, A and B
are represented by A B, where U means union andare represented by A B, where U means union and means intersection,while not A is represented by means intersection,while not A is represented byA.A.
Basic ExpressionBasic Expression:: pr(A / H)pr(A / H) , A,H are events, sets or, A,H are events, sets orpropositions, pr=probability,/ = given,propositions, pr=probability,/ = given,conditional upon, on the assumptionconditional upon, on the assumptionof,sometimes H is hypothesis under whichof,sometimes H is hypothesis under whichprobability is evaluated.probability is evaluated.
-
8/6/2019 Statistics and Probability for tics
5/36
-
8/6/2019 Statistics and Probability for tics
6/36
SUBJECTIVE PROBABILITIESSUBJECTIVE PROBABILITIES
A= interest rates will be higher on feb1,2001 thanA= interest rates will be higher on feb1,2001 than
they are now,pr(A/H) = something for some people ,they are now,pr(A/H) = something for some people ,
such judgements are made in the finance industry.such judgements are made in the finance industry.
A bookie offers odds of 2:1 AGAINST horse AA bookie offers odds of 2:1 AGAINST horse Awinning a race.so bookies pr(A will win/H) = 1/3 . H?winning a race.so bookies pr(A will win/H) = 1/3 . H?
A bookie offer odds of 2:1 ON horse A winning aA bookie offer odds of 2:1 ON horse A winning a
race. Then bookies pr(A will win /H) = 2/3race. Then bookies pr(A will win /H) = 2/3
-
8/6/2019 Statistics and Probability for tics
7/36
PROBABILITY RULES (FOR ALLPROBABILITY RULES (FOR ALL
APPROACHES)APPROACHES) 0
-
8/6/2019 Statistics and Probability for tics
8/36
APPROPRIATENESS OF ADDITIONAPPROPRIATENESS OF ADDITION
RULERULE Roll 2 fair dice, put A =sum is even, B= sum isRoll 2 fair dice, put A =sum is even, B= sum is
77 pr(A/H)= 18/36pr(A/H)= 18/36 pr(B/H)= 6/36pr(B/H)= 6/36 pr(A or B/H)=(18+6)/36=2/3pr(A or B/H)=(18+6)/36=2/3
The proportion of AUSTRALIAN aborigines aged 0-4The proportion of AUSTRALIAN aborigines aged 0-4yrs in the 1971 census was 0.177; aged5-9 yrs inyrs in the 1971 census was 0.177; aged5-9 yrs inthe 1971 census was 0.154. What is the chance thatthe 1971 census was 0.154. What is the chance thata randomly selected AUSTRALIAN ABORIGINES isa randomly selected AUSTRALIAN ABORIGINES isaged less than 9 years?aged less than 9 years?
Put A= A.A is aged 0-4, B= AA is aged 5-Put A= A.A is aged 0-4, B= AA is aged 5-
9,pr(AA. Aged less than 9 yrs/H)= pr(A or B/H) =9,pr(AA. Aged less than 9 yrs/H)= pr(A or B/H) =0.177 + 0.154=0.3310.177 + 0.154=0.331
-
8/6/2019 Statistics and Probability for tics
9/36
EXAMPLES 1. AND 2.EXAMPLES 1. AND 2. Roll three dice . What is the chance of at least one ace?Roll three dice . What is the chance of at least one ace?
There are 6*6*6=216 possible combinations results on the three diceThere are 6*6*6=216 possible combinations results on the three dice
how many involve at least one ace? HARD! How many involve no aces ? 5*5*5= 125how many involve at least one ace? HARD! How many involve no aces ? 5*5*5= 125
pr(at least one ace) = 1- pr(no aces) = 1-(5/6)*(5/6)*(5/6)pr(at least one ace) = 1- pr(no aces) = 1-(5/6)*(5/6)*(5/6)
pick a card at random from a well shuffled pack.Then pick a second card likewise,notpick a card at random from a well shuffled pack.Then pick a second card likewise,not
replacing the first card before doing so.replacing the first card before doing so.
Let A= 1st card red,B= 2nd card red and H = the usualLet A= 1st card red,B= 2nd card red and H = the usual
pr(A/H)=26/52=1/2pr(A/H)=26/52=1/2
pr(B/A&H)=25/51pr(B/A&H)=25/51
pr(A&B/H)?by the multiplication rules, pr(A&B/H)=26/52 *25/51pr(A&B/H)?by the multiplication rules, pr(A&B/H)=26/52 *25/51
B= (B&A) or (B&A), then by the addition rulesB= (B&A) or (B&A), then by the addition rules
pr(B/H)= pr(B&A/H) + pr(B&A/H)= 26/52*25/51+26/52*26/51=1/2pr(B/H)= pr(B&A/H) + pr(B&A/H)= 26/52*25/51+26/52*26/51=1/2
-
8/6/2019 Statistics and Probability for tics
10/36
EXAMPLE 3.EXAMPLE 3.
Of the 4.639,221 base pair of sequence ofOf the 4.639,221 base pair of sequence ofE.coli,1,142,136 are As and there were 255,179E.coli,1,142,136 are As and there were 255,179
occurrence of the dinucleotide AA. A position inoccurrence of the dinucleotide AA. A position in
the E.coli genome is chosen at random and A =the E.coli genome is chosen at random and A =
an A at that position, while B= an A at thean A at that position, while B= an A at the
next position.next position. Pr(A/H)= 1,142,136/ 4,639,221 ~1/4Pr(A/H)= 1,142,136/ 4,639,221 ~1/4
Pr(B/A&H)= 255,179/1,142,136Pr(B/A&H)= 255,179/1,142,136
Pr(A&B/H) =Pr(A&B/H) =
255,179/4,639,221( MULTIPLICATION MODE)255,179/4,639,221( MULTIPLICATION MODE) Pr (B/H)~1/4( AS ABOVE)Pr (B/H)~1/4( AS ABOVE)
-
8/6/2019 Statistics and Probability for tics
11/36
EXTENDED ADDITION ANDEXTENDED ADDITION AND
MULTIPLICATION RULESMULTIPLICATION RULES
ADDITION RULESADDITION RULES If A1,A2,.. are mutually exclusive givenIf A1,A2,.. are mutually exclusive given
h, that is a1 and a2 is impossible given H (h, that is a1 and a2 is impossible given H (more generally , pr(A1&A2/H)=0) , thenmore generally , pr(A1&A2/H)=0) , then
pr(A1 or A2 or ./H) = pr(A1/H) +pr(A1 or A2 or ./H) = pr(A1/H) +pr(A2/H) + ..pr(A2/H) + ..
MULTIPLICATION RULESMULTIPLICATION RULES Pr(A1&A2&A3&./H)=Pr(A1&A2&A3&./H)=
Pr(A1/H)*Pr(A2/A1&H)*Pr(A3/A1&A2&H)*Pr(A1/H)*Pr(A2/A1&H)*Pr(A3/A1&A2&H)*..
-
8/6/2019 Statistics and Probability for tics
12/36
((PROBABILISTICPROBABILISTIC) INDEPENDENCE) INDEPENDENCE Say B isSay B is (probabilistically) independent(probabilistically) independent of A given H ifof A given H if
pr(B/A&H) = pr(B/H) (*)pr(B/A&H) = pr(B/H) (*)
(*) implies pr(A&B/H) = pr(A/H)*pr(B/H)(*) implies pr(A&B/H) = pr(A/H)*pr(B/H)
This is usually taken as the definition of independence.This is usually taken as the definition of independence.
(*) implies A is independent of B given H , as long as pr(B/H)(*) implies A is independent of B given H , as long as pr(B/H) 00
PROOF: pr(A&B/H)= pr(A/H) *pr(B/A&H) and alsoPROOF: pr(A&B/H)= pr(A/H) *pr(B/A&H) and alsopr(A&B/H)=pr(B/H)*pr(A/B&H)pr(A&B/H)=pr(B/H)*pr(A/B&H)
cancel pr(B/H) if itcancel pr(B/H) if it 00 summarysummary: as long as everything is: as long as everything is 0, any one of:0, any one of: pr(A/B&H) = pr(A/H)pr(A/B&H) = pr(A/H)
pr(B/A&H) = pr(B/H)pr(B/A&H) = pr(B/H)
Pr(A&B/H) = pr(A/H)* pr(B/A&H)Pr(A&B/H) = pr(A/H)* pr(B/A&H)
implies the other two.implies the other two.
-
8/6/2019 Statistics and Probability for tics
13/36
SEQUENCE ANALYSIS: PAIR WISESEQUENCE ANALYSIS: PAIR WISE
ALIGNMENTALIGNMENT
One of the basic problems a biologist is faced with whenOne of the basic problems a biologist is faced with when
given two DNA or protein sequences is to determinegiven two DNA or protein sequences is to determine
whether they are related.whether they are related.
-
8/6/2019 Statistics and Probability for tics
14/36
The theory of sequence alignment is to determine (a) the best
alignment between the two sequences and (b) whether two sequences
show similarity by pure chance or due to common ancestry.
Alignment algorithms strive to model the mutational process giving
rise to the two sequences.
The basic mutational processes are:1.Substitutions: replace a residue (DNA base or amino acid) with
another.
2 Insertions: add residues to the sequence.
3 Deletions: remove residues from the sequence.
Insertions and deletions result in gaps in the alignment.
-
8/6/2019 Statistics and Probability for tics
15/36
When calculating the total score of an alignment
X : x(1),x(2),x(3),x(4),........................x(n)
Y: y(1),y(2),y(3),y(4)..........................y(n)
(where x(i) and y(j) now are either a sequence residue, or a gap) weassume independence between residues, such that the probability of
the alignment is
Pr(Alignment) = Pr(x(1),y(1))*Pr(x(2),y(2))*Pr(x(n),y(n))
Where
Pr(x(i),y(i)) = the probability of aligning residues x(i) with y(i) .
since alignments usually are long, resulting in a very low totalprobability of the alignment, it is common to use the logarithm of the
probability as the score of the alignment.
Log (Pr(Alignment)) = log(Pr(x(1),y(1))) + log(Pr(x(2),y(2))) +...............
log (Pr(x(n),y(n)))
resulting in a score with the additive property.
S = s(x(1),y(1)) +s(x(2),y(2)+s(x(n),y(n))
-
8/6/2019 Statistics and Probability for tics
16/36
SUBSTITUTION MATRICES:
We can always produce an optimal alignment and an alignment score of two
sequences, whether they really are related or not. But when is the score high
enough to infer homology?
One way to answer this is by comparing the probability of the alignment when
we assume homology, to the probability of the alignment when we assume the
sequences to be independent.
Thus we have two models:
Match model (M): (assuming homology)
The residues x(i) and y(i) at position i in the alignment occur together with
probability pr(x(i),y(i)).Positions in the alignment are still assumed to be independent, but the
sequences are assumed to be dependent (that is,pr (x(i),y(i)) q(x(i))*q(y(i)) )
Random model (R): (assuming no homology)
Now both the sequences and the positions in the alignment are assumed to be
independent. Thus at each position i in the alignment residues x(i) and y(i)
occur with probability q(x(i)). q(y(i))
-
8/6/2019 Statistics and Probability for tics
17/36
We score the alignment using the relative likelihood
The idea is as follows:
x and y are really homologous:
If S is large enough we reject the random model and assume the sequences
to be homologous.
-
8/6/2019 Statistics and Probability for tics
18/36
HIDDEN MARKOV MODELSHIDDEN MARKOV MODELS MarkovMarkov ChainsChains: A Markov chain is a random process: A Markov chain is a random process
X(0),X(1),X(2),.. which jumps randomly between differentX(0),X(1),X(2),.. which jumps randomly between different
states in a state spacestates in a state space
S ={ S(1),S(2),S(3)with something called the memorylessS ={ S(1),S(2),S(3)with something called the memorylessproperty: what state the process will jump to next onlyproperty: what state the process will jump to next only
depend on the current state, not the passed ones.depend on the current state, not the passed ones.
which state to begin the process in is determined by the initial distribution
The process jumps between different states according to the transition
probabilities
-
8/6/2019 Statistics and Probability for tics
19/36
-
8/6/2019 Statistics and Probability for tics
20/36
The transition probabilities can be organized in an array, or a transition
matrix.
Possible outcome sequence: X(0)=a,X(1)=a,X(2)=b,X(3)= c .
Impossible outcome: X(0)= b,X(1)=b,X(2)=a, X(3)=b.
BIOLOGICAL MOTIVATION
-
8/6/2019 Statistics and Probability for tics
21/36
BIOLOGICAL MOTIVATION
-
8/6/2019 Statistics and Probability for tics
22/36
MARKOV CHAIN MODELMARKOV CHAIN MODEL
-
8/6/2019 Statistics and Probability for tics
23/36
-
8/6/2019 Statistics and Probability for tics
24/36
-
8/6/2019 Statistics and Probability for tics
25/36
MARKOV CHAIN MODEL
-
8/6/2019 Statistics and Probability for tics
26/36
-
8/6/2019 Statistics and Probability for tics
27/36
-
8/6/2019 Statistics and Probability for tics
28/36
EXAMPLEEXAMPLE
-
8/6/2019 Statistics and Probability for tics
29/36
-
8/6/2019 Statistics and Probability for tics
30/36
-
8/6/2019 Statistics and Probability for tics
31/36
-
8/6/2019 Statistics and Probability for tics
32/36
Example
Assume we have two dice A and B:
where A generates numbers between 1 and 6 and B generates
numbers between 1 and 4. The process is as follows:
1. We randomly choose a die to start with, A or B
2. We roll the die and record the number
3. We choose whether to roll the current die again, or switch to the
other
4. Repeat steps 2-3.
-
8/6/2019 Statistics and Probability for tics
33/36
To translate this into HMM formulation:
The state space is S ={A,B} .
We randomly choose the first die X(0) according to the initial
probabilities = ((A), (B) )Where Pr( X(0) = A) = (A)
The first observed numberY(0)= y appears with probability e(0)
(y).
We switch between states according to transition probabilities
In roll n the state is X (n) and the observed number is Y (n)
-
8/6/2019 Statistics and Probability for tics
34/36
Now assume that someone else rolled the dice, and we only know the
underlying probabilities (initial distribution, transition probabilities,
output distribution) and have the observed output
This is a hidden Markov model where the hidden states are which die
was used in each roll the output sequence is the number observed
The HMM theory can help us answering questions like.
What is the probability of observing such a series giving our
model?
What is the most likely underlying sequence of dice (statesequence) giving rise to this output ?
PROFILE HMM FOR SEQUENCE
-
8/6/2019 Statistics and Probability for tics
35/36
PROFILE HMM FOR SEQUENCEALIGNMENT
-
8/6/2019 Statistics and Probability for tics
36/36