statistics and probability for tics

Upload: mudit-misra

Post on 07-Apr-2018

234 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Statistics and Probability for tics

    1/36

    STATISTICS AND PROBABILITYSTATISTICS AND PROBABILITY

    FORFORBIOINFORMATICSBIOINFORMATICS

    By

    KUMAR PARIJAT TRIPATHI

    1.INTRODUCTION

    2.EVENTS , PROBABILITY ANDRULES

    3. CLASSICAL PROBABILITY:

    EQUALLY LIKELY OUTCOMES

    4.SUBJECTIVE PROBABILITIES

    5. PROBABILITY RULES6.(PROBABILISTIC) INDEPENDENCE

    7.SEQUENCE ANALYSIS:PAIRWISE

    ALIGNMENT

    8.SUBSTITUTION MATRICES

    9.HIDDEN MARKOV MODELS

  • 8/6/2019 Statistics and Probability for tics

    2/36

    An experiment is a situation involving chance orprobability that leads to results called out comes.

    An outcome is the result of a single trial of an experiment

    An event is one or more outcomes of an experiment

    Probability is the measure of how likely an event is.

    Probability Of an

    Event

    INTRODUCTION

    P(A) = The Number Of Ways Event A Can Occur

    The Total Number Of Possible Outcomes

  • 8/6/2019 Statistics and Probability for tics

    3/36

    DETERMINISTIC AND RANDOMDETERMINISTIC AND RANDOM

    EXPERIMENTEXPERIMENT Deterministic ExperimentDeterministic Experiment:: The Experiments which have got only oneThe Experiments which have got only onepossible result or outcome i.e. whose result is certain or unique are calledpossible result or outcome i.e. whose result is certain or unique are called

    deterministic or predictabledeterministic or predictable experiments. The result of these experimentsexperiments. The result of these experimentsis predictable with certainity and the result is known prior to its conduct.is predictable with certainity and the result is known prior to its conduct.This approach stipulates that the conditions under which the experiment isThis approach stipulates that the conditions under which the experiment isconducted would determine its result.conducted would determine its result.

    Probabilistic ExperimentProbabilistic Experiment :An Experiment whose result is uncertain i.e. a:An Experiment whose result is uncertain i.e. a

    random experiment is a probabilistic experiment. The experiment results inrandom experiment is a probabilistic experiment. The experiment results intwo or more outcomes. The result/outcome of the experiment would be onetwo or more outcomes. The result/outcome of the experiment would be oneof the possible outcomes but cannot be predicted prior to its conduct. Theof the possible outcomes but cannot be predicted prior to its conduct. Thedifferent possible outcomes of the experiment can be known or assessed.different possible outcomes of the experiment can be known or assessed.But it would not be possible to predict the occurance of a particularBut it would not be possible to predict the occurance of a particularoutcome at anyoutcome at anyparticular execution of the experimentparticular execution of the experiment..

  • 8/6/2019 Statistics and Probability for tics

    4/36

    EVENTS, PROBABILITY,RULESEVENTS, PROBABILITY,RULES

    EVENTSEVENTS(informal): tossed coin comes up heads, roll(informal): tossed coin comes up heads, rollof two dice gives a double 6,etc.of two dice gives a double 6,etc.

    SubsetsSubsets (formal): of a sample space.(formal): of a sample space. PrepositionsPrepositions: it will rain tomorrowor India will: it will rain tomorrowor India will

    win the matchwin the match all three denoted by A,B,.H,..all three denoted by A,B,.H,.. A or B can be represented by A U B or A V B, A and BA or B can be represented by A U B or A V B, A and B

    are represented by A B, where U means union andare represented by A B, where U means union and means intersection,while not A is represented by means intersection,while not A is represented byA.A.

    Basic ExpressionBasic Expression:: pr(A / H)pr(A / H) , A,H are events, sets or, A,H are events, sets orpropositions, pr=probability,/ = given,propositions, pr=probability,/ = given,conditional upon, on the assumptionconditional upon, on the assumptionof,sometimes H is hypothesis under whichof,sometimes H is hypothesis under whichprobability is evaluated.probability is evaluated.

  • 8/6/2019 Statistics and Probability for tics

    5/36

  • 8/6/2019 Statistics and Probability for tics

    6/36

    SUBJECTIVE PROBABILITIESSUBJECTIVE PROBABILITIES

    A= interest rates will be higher on feb1,2001 thanA= interest rates will be higher on feb1,2001 than

    they are now,pr(A/H) = something for some people ,they are now,pr(A/H) = something for some people ,

    such judgements are made in the finance industry.such judgements are made in the finance industry.

    A bookie offers odds of 2:1 AGAINST horse AA bookie offers odds of 2:1 AGAINST horse Awinning a race.so bookies pr(A will win/H) = 1/3 . H?winning a race.so bookies pr(A will win/H) = 1/3 . H?

    A bookie offer odds of 2:1 ON horse A winning aA bookie offer odds of 2:1 ON horse A winning a

    race. Then bookies pr(A will win /H) = 2/3race. Then bookies pr(A will win /H) = 2/3

  • 8/6/2019 Statistics and Probability for tics

    7/36

    PROBABILITY RULES (FOR ALLPROBABILITY RULES (FOR ALL

    APPROACHES)APPROACHES) 0

  • 8/6/2019 Statistics and Probability for tics

    8/36

    APPROPRIATENESS OF ADDITIONAPPROPRIATENESS OF ADDITION

    RULERULE Roll 2 fair dice, put A =sum is even, B= sum isRoll 2 fair dice, put A =sum is even, B= sum is

    77 pr(A/H)= 18/36pr(A/H)= 18/36 pr(B/H)= 6/36pr(B/H)= 6/36 pr(A or B/H)=(18+6)/36=2/3pr(A or B/H)=(18+6)/36=2/3

    The proportion of AUSTRALIAN aborigines aged 0-4The proportion of AUSTRALIAN aborigines aged 0-4yrs in the 1971 census was 0.177; aged5-9 yrs inyrs in the 1971 census was 0.177; aged5-9 yrs inthe 1971 census was 0.154. What is the chance thatthe 1971 census was 0.154. What is the chance thata randomly selected AUSTRALIAN ABORIGINES isa randomly selected AUSTRALIAN ABORIGINES isaged less than 9 years?aged less than 9 years?

    Put A= A.A is aged 0-4, B= AA is aged 5-Put A= A.A is aged 0-4, B= AA is aged 5-

    9,pr(AA. Aged less than 9 yrs/H)= pr(A or B/H) =9,pr(AA. Aged less than 9 yrs/H)= pr(A or B/H) =0.177 + 0.154=0.3310.177 + 0.154=0.331

  • 8/6/2019 Statistics and Probability for tics

    9/36

    EXAMPLES 1. AND 2.EXAMPLES 1. AND 2. Roll three dice . What is the chance of at least one ace?Roll three dice . What is the chance of at least one ace?

    There are 6*6*6=216 possible combinations results on the three diceThere are 6*6*6=216 possible combinations results on the three dice

    how many involve at least one ace? HARD! How many involve no aces ? 5*5*5= 125how many involve at least one ace? HARD! How many involve no aces ? 5*5*5= 125

    pr(at least one ace) = 1- pr(no aces) = 1-(5/6)*(5/6)*(5/6)pr(at least one ace) = 1- pr(no aces) = 1-(5/6)*(5/6)*(5/6)

    pick a card at random from a well shuffled pack.Then pick a second card likewise,notpick a card at random from a well shuffled pack.Then pick a second card likewise,not

    replacing the first card before doing so.replacing the first card before doing so.

    Let A= 1st card red,B= 2nd card red and H = the usualLet A= 1st card red,B= 2nd card red and H = the usual

    pr(A/H)=26/52=1/2pr(A/H)=26/52=1/2

    pr(B/A&H)=25/51pr(B/A&H)=25/51

    pr(A&B/H)?by the multiplication rules, pr(A&B/H)=26/52 *25/51pr(A&B/H)?by the multiplication rules, pr(A&B/H)=26/52 *25/51

    B= (B&A) or (B&A), then by the addition rulesB= (B&A) or (B&A), then by the addition rules

    pr(B/H)= pr(B&A/H) + pr(B&A/H)= 26/52*25/51+26/52*26/51=1/2pr(B/H)= pr(B&A/H) + pr(B&A/H)= 26/52*25/51+26/52*26/51=1/2

  • 8/6/2019 Statistics and Probability for tics

    10/36

    EXAMPLE 3.EXAMPLE 3.

    Of the 4.639,221 base pair of sequence ofOf the 4.639,221 base pair of sequence ofE.coli,1,142,136 are As and there were 255,179E.coli,1,142,136 are As and there were 255,179

    occurrence of the dinucleotide AA. A position inoccurrence of the dinucleotide AA. A position in

    the E.coli genome is chosen at random and A =the E.coli genome is chosen at random and A =

    an A at that position, while B= an A at thean A at that position, while B= an A at the

    next position.next position. Pr(A/H)= 1,142,136/ 4,639,221 ~1/4Pr(A/H)= 1,142,136/ 4,639,221 ~1/4

    Pr(B/A&H)= 255,179/1,142,136Pr(B/A&H)= 255,179/1,142,136

    Pr(A&B/H) =Pr(A&B/H) =

    255,179/4,639,221( MULTIPLICATION MODE)255,179/4,639,221( MULTIPLICATION MODE) Pr (B/H)~1/4( AS ABOVE)Pr (B/H)~1/4( AS ABOVE)

  • 8/6/2019 Statistics and Probability for tics

    11/36

    EXTENDED ADDITION ANDEXTENDED ADDITION AND

    MULTIPLICATION RULESMULTIPLICATION RULES

    ADDITION RULESADDITION RULES If A1,A2,.. are mutually exclusive givenIf A1,A2,.. are mutually exclusive given

    h, that is a1 and a2 is impossible given H (h, that is a1 and a2 is impossible given H (more generally , pr(A1&A2/H)=0) , thenmore generally , pr(A1&A2/H)=0) , then

    pr(A1 or A2 or ./H) = pr(A1/H) +pr(A1 or A2 or ./H) = pr(A1/H) +pr(A2/H) + ..pr(A2/H) + ..

    MULTIPLICATION RULESMULTIPLICATION RULES Pr(A1&A2&A3&./H)=Pr(A1&A2&A3&./H)=

    Pr(A1/H)*Pr(A2/A1&H)*Pr(A3/A1&A2&H)*Pr(A1/H)*Pr(A2/A1&H)*Pr(A3/A1&A2&H)*..

  • 8/6/2019 Statistics and Probability for tics

    12/36

    ((PROBABILISTICPROBABILISTIC) INDEPENDENCE) INDEPENDENCE Say B isSay B is (probabilistically) independent(probabilistically) independent of A given H ifof A given H if

    pr(B/A&H) = pr(B/H) (*)pr(B/A&H) = pr(B/H) (*)

    (*) implies pr(A&B/H) = pr(A/H)*pr(B/H)(*) implies pr(A&B/H) = pr(A/H)*pr(B/H)

    This is usually taken as the definition of independence.This is usually taken as the definition of independence.

    (*) implies A is independent of B given H , as long as pr(B/H)(*) implies A is independent of B given H , as long as pr(B/H) 00

    PROOF: pr(A&B/H)= pr(A/H) *pr(B/A&H) and alsoPROOF: pr(A&B/H)= pr(A/H) *pr(B/A&H) and alsopr(A&B/H)=pr(B/H)*pr(A/B&H)pr(A&B/H)=pr(B/H)*pr(A/B&H)

    cancel pr(B/H) if itcancel pr(B/H) if it 00 summarysummary: as long as everything is: as long as everything is 0, any one of:0, any one of: pr(A/B&H) = pr(A/H)pr(A/B&H) = pr(A/H)

    pr(B/A&H) = pr(B/H)pr(B/A&H) = pr(B/H)

    Pr(A&B/H) = pr(A/H)* pr(B/A&H)Pr(A&B/H) = pr(A/H)* pr(B/A&H)

    implies the other two.implies the other two.

  • 8/6/2019 Statistics and Probability for tics

    13/36

    SEQUENCE ANALYSIS: PAIR WISESEQUENCE ANALYSIS: PAIR WISE

    ALIGNMENTALIGNMENT

    One of the basic problems a biologist is faced with whenOne of the basic problems a biologist is faced with when

    given two DNA or protein sequences is to determinegiven two DNA or protein sequences is to determine

    whether they are related.whether they are related.

  • 8/6/2019 Statistics and Probability for tics

    14/36

    The theory of sequence alignment is to determine (a) the best

    alignment between the two sequences and (b) whether two sequences

    show similarity by pure chance or due to common ancestry.

    Alignment algorithms strive to model the mutational process giving

    rise to the two sequences.

    The basic mutational processes are:1.Substitutions: replace a residue (DNA base or amino acid) with

    another.

    2 Insertions: add residues to the sequence.

    3 Deletions: remove residues from the sequence.

    Insertions and deletions result in gaps in the alignment.

  • 8/6/2019 Statistics and Probability for tics

    15/36

    When calculating the total score of an alignment

    X : x(1),x(2),x(3),x(4),........................x(n)

    Y: y(1),y(2),y(3),y(4)..........................y(n)

    (where x(i) and y(j) now are either a sequence residue, or a gap) weassume independence between residues, such that the probability of

    the alignment is

    Pr(Alignment) = Pr(x(1),y(1))*Pr(x(2),y(2))*Pr(x(n),y(n))

    Where

    Pr(x(i),y(i)) = the probability of aligning residues x(i) with y(i) .

    since alignments usually are long, resulting in a very low totalprobability of the alignment, it is common to use the logarithm of the

    probability as the score of the alignment.

    Log (Pr(Alignment)) = log(Pr(x(1),y(1))) + log(Pr(x(2),y(2))) +...............

    log (Pr(x(n),y(n)))

    resulting in a score with the additive property.

    S = s(x(1),y(1)) +s(x(2),y(2)+s(x(n),y(n))

  • 8/6/2019 Statistics and Probability for tics

    16/36

    SUBSTITUTION MATRICES:

    We can always produce an optimal alignment and an alignment score of two

    sequences, whether they really are related or not. But when is the score high

    enough to infer homology?

    One way to answer this is by comparing the probability of the alignment when

    we assume homology, to the probability of the alignment when we assume the

    sequences to be independent.

    Thus we have two models:

    Match model (M): (assuming homology)

    The residues x(i) and y(i) at position i in the alignment occur together with

    probability pr(x(i),y(i)).Positions in the alignment are still assumed to be independent, but the

    sequences are assumed to be dependent (that is,pr (x(i),y(i)) q(x(i))*q(y(i)) )

    Random model (R): (assuming no homology)

    Now both the sequences and the positions in the alignment are assumed to be

    independent. Thus at each position i in the alignment residues x(i) and y(i)

    occur with probability q(x(i)). q(y(i))

  • 8/6/2019 Statistics and Probability for tics

    17/36

    We score the alignment using the relative likelihood

    The idea is as follows:

    x and y are really homologous:

    If S is large enough we reject the random model and assume the sequences

    to be homologous.

  • 8/6/2019 Statistics and Probability for tics

    18/36

    HIDDEN MARKOV MODELSHIDDEN MARKOV MODELS MarkovMarkov ChainsChains: A Markov chain is a random process: A Markov chain is a random process

    X(0),X(1),X(2),.. which jumps randomly between differentX(0),X(1),X(2),.. which jumps randomly between different

    states in a state spacestates in a state space

    S ={ S(1),S(2),S(3)with something called the memorylessS ={ S(1),S(2),S(3)with something called the memorylessproperty: what state the process will jump to next onlyproperty: what state the process will jump to next only

    depend on the current state, not the passed ones.depend on the current state, not the passed ones.

    which state to begin the process in is determined by the initial distribution

    The process jumps between different states according to the transition

    probabilities

  • 8/6/2019 Statistics and Probability for tics

    19/36

  • 8/6/2019 Statistics and Probability for tics

    20/36

    The transition probabilities can be organized in an array, or a transition

    matrix.

    Possible outcome sequence: X(0)=a,X(1)=a,X(2)=b,X(3)= c .

    Impossible outcome: X(0)= b,X(1)=b,X(2)=a, X(3)=b.

    BIOLOGICAL MOTIVATION

  • 8/6/2019 Statistics and Probability for tics

    21/36

    BIOLOGICAL MOTIVATION

  • 8/6/2019 Statistics and Probability for tics

    22/36

    MARKOV CHAIN MODELMARKOV CHAIN MODEL

  • 8/6/2019 Statistics and Probability for tics

    23/36

  • 8/6/2019 Statistics and Probability for tics

    24/36

  • 8/6/2019 Statistics and Probability for tics

    25/36

    MARKOV CHAIN MODEL

  • 8/6/2019 Statistics and Probability for tics

    26/36

  • 8/6/2019 Statistics and Probability for tics

    27/36

  • 8/6/2019 Statistics and Probability for tics

    28/36

    EXAMPLEEXAMPLE

  • 8/6/2019 Statistics and Probability for tics

    29/36

  • 8/6/2019 Statistics and Probability for tics

    30/36

  • 8/6/2019 Statistics and Probability for tics

    31/36

  • 8/6/2019 Statistics and Probability for tics

    32/36

    Example

    Assume we have two dice A and B:

    where A generates numbers between 1 and 6 and B generates

    numbers between 1 and 4. The process is as follows:

    1. We randomly choose a die to start with, A or B

    2. We roll the die and record the number

    3. We choose whether to roll the current die again, or switch to the

    other

    4. Repeat steps 2-3.

  • 8/6/2019 Statistics and Probability for tics

    33/36

    To translate this into HMM formulation:

    The state space is S ={A,B} .

    We randomly choose the first die X(0) according to the initial

    probabilities = ((A), (B) )Where Pr( X(0) = A) = (A)

    The first observed numberY(0)= y appears with probability e(0)

    (y).

    We switch between states according to transition probabilities

    In roll n the state is X (n) and the observed number is Y (n)

  • 8/6/2019 Statistics and Probability for tics

    34/36

    Now assume that someone else rolled the dice, and we only know the

    underlying probabilities (initial distribution, transition probabilities,

    output distribution) and have the observed output

    This is a hidden Markov model where the hidden states are which die

    was used in each roll the output sequence is the number observed

    The HMM theory can help us answering questions like.

    What is the probability of observing such a series giving our

    model?

    What is the most likely underlying sequence of dice (statesequence) giving rise to this output ?

    PROFILE HMM FOR SEQUENCE

  • 8/6/2019 Statistics and Probability for tics

    35/36

    PROFILE HMM FOR SEQUENCEALIGNMENT

  • 8/6/2019 Statistics and Probability for tics

    36/36