lecture 6ll

Upload: jean-rene

Post on 14-Apr-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

  • 7/27/2019 Lecture 6ll

    1/46

    Bioinformatics 1 -- lecture 6

    Affine gap penalty

    Substitution Matrices

    PAM

    BLOSUM

    Matrix bias in local alignment

  • 7/27/2019 Lecture 6ll

    2/46

    Reminder: scoring an alignment

    A ~ T S F M ~A G L S T F M

    The score of the alignment is the sum of the scores of each

    column (match, deletion or insertion) in the alignment.

    Match: look up match score from substitution matrixNew gap: use gap initiation penalty

    Additional gap: use gap extension penalty

    End gap: Optional, may be zero.

  • 7/27/2019 Lecture 6ll

    3/46

    Affine gap exercise

    ~ ~ A T S F M

    A G L S T F M

    A ~ ~ T S F M

    A G L S T F M

    A ~ T S F M ~

    A G L S T F M

    Which alignmentscores the highest?

    Given:

    No end gap penalty.

    BLOSUM score.

    Affine gap penalty. (-2,

    -1)

  • 7/27/2019 Lecture 6ll

    4/46

    Worksheet for affine gap local dynamic programming

    Q

    S

    IM

    PR

    P

    L T V K PM I D

    Fill in the scores are the traceback letters using the BLOSUM62 matrix and

    Gap opening = -2, gap extension = -1. Start at 0. End at maximum.No end gaps (meaning no starting gaps).

    Q

    S

    IM

    PR

    P

    L T V K P

  • 7/27/2019 Lecture 6ll

    5/46

    Rules for affine gap penalty DPDo not penalize end-gaps.

    I can follow D, and vice versa, but it is a gap opening.

    For each box, write the score and the traceback letter (M,I,orD)

    Affine gap penalty worksheet

    M = match matrix I = insertion matrix D = deletion matrixscores for alignments

    currently in a match statescores for alignments with

    gap in first sequence.

    scores for alignments with

    gap in second sequence.

    Fill in M as the

    max over three

    possible

    diagnoal

    arrows Fill in I as the

    max over three

    possible down

    arrows

    Fill in D as the

    max over three

    possible right

    arrows

    A D P Q F GA

    K

    L

    K

    L

    D

    Q

    F

    G

    P

    A

    K

    L

    K

    L

    D

    Q

    F

    G

    P

    A

    K

    L

    K

    L

    D

    Q

    F

    G

    P

    A D P Q F G A D P Q F GNote:Iwrotethesequences

    inthegaprows!!

    M[i,j] = MAX

    M[i-1,j-1]+match score

    I[i-1,j-1]+match score

    D[i-1,j-1]+match score

    I[i,j] = MAX

    M[i,j-1] - 2

    I[i,j-1] - 1

    D[i,j-1] - 2

    D[i,j] = MAX

    M[i-1,j] - 2

    I[i-1,j] - 2

    D[i-1,j] - 1

  • 7/27/2019 Lecture 6ll

    6/46

    BLOSUM matrix for match scoresTwo 20x20 substitution matrices are used: BLOSUM & PAM.

    4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2

    9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2

    6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3

    5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2

    6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3

    6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3

    8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2

    4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1

    5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2

    4 2 -3 -3 -2 -2 -2 -1 1 -2 -1

    5 -2 -2 0 -1 -1 -1 1 -1 -1

    6 -2 0 0 1 0 -3 -4 -2

    7 -1 -2 -1 -1 -2 -4 -3

    5 1 0 -1 -2 -2 -15 -1 -1 -3 -3 -2

    4 1 -2 -3 -2

    5 0 -2 -2

    4 -3 -1

    11 2

    7

    A C D E F G H I K L M N P Q R S T V W Y ACDEF

    GHIKLMNP

    QRSTVWY

    BLOSUM62

    Each number is the score

    for aligning a single pair

    ofamino acids.

    What is the score for this alignment?:

    ACEPGAA

    ASDDGTV

  • 7/27/2019 Lecture 6ll

    7/46

    Substitution matrices

    Used to score aligned positions, usually of amino acids.

    Expressed as the log-likelihood ratio of mutation (orlog-odds

    ratio)

    Derived from multiple sequence alignments

    Two commonly used matrices: PAM and BLOSUM

    PAM = percent accepted mutations (Dayhoff)

    BLOSUM = Blocks substitution matrix (Henikoff)

    Read: Mount pp94-113

  • 7/27/2019 Lecture 6ll

    8/46

    PAM

    Evolutionary time is measured in Percent Accepted

    Mutations, or PAMs

    One PAM of evolution means 1% of the residues/bases have

    changed, averaged over all 20 amino acids.

    To get the relative frequency of each type of mutation, we

    count the times it was observed in a database of multiple

    sequence alignments.

    Based on global alignments

    Assumes a Markov model for evolution.

    M Dayhoff, 1978

  • 7/27/2019 Lecture 6ll

    9/46

    BLOSUM

    Based on database of ungapped local alignments

    (BLOCKS)

    Alignments have lower similarity than PAM alignments.

    BLOSUM number indicates the percent identity level of

    sequences in the alignment. For example, for BLOSUM62

    sequences with approximately 62% identity were counted.

    Some BLOCKS represent functional units, providing

    validation of the alignment.

    Henikoff & Henikoff, 1992

  • 7/27/2019 Lecture 6ll

    10/46

    A multiple sequence alignment is made using many pairwise sequence alignments

    Multiple Sequence Alignment

  • 7/27/2019 Lecture 6ll

    11/46

    Columns in a MSA have a common evolutionary history

    By aligning the sequences, we assert that the aligned

    residues in each column had a common ancestor.

  • 7/27/2019 Lecture 6ll

    12/46

    How do you count the mutations?Assume any of the sequences could be the ancestral

    one.

    L K F R L S K K P

    L K F R L S K K P

    L K F R L T K K P

    L K F R L S K K P

    L K F R L S R K PL K F R L T R K P

    L K F R L ~ K K P

    GG

    G

    W

    W

    NG

    G

    G W W N G G

    If the first sequence was the ancestor,

    then it mutated to a W twice, to N

    once, and conserved G three times.

  • 7/27/2019 Lecture 6ll

    13/46

    Or, we could have picked...

    L K F R L S K K P

    L K F R L S K K P

    L K F R L T K K P

    L K F R L S K K P

    L K F R L S R K PL K F R L T R K P

    L K F R L ~ K K P

    WG

    G

    W

    W

    NG

    G

    G W W N G G

    W was the ancestor, then it mutated

    to a G four times, to N once, and was

    conserved once.

  • 7/27/2019 Lecture 6ll

    14/46

    Subsitution matrices are symmetrical

    Since we don't know which sequence came first, we don't

    know whether

    ...is correct. So we count this as one mutation of each type.G-->W and W-->G. In the end the 20x20 matrix will have

    the same number for elements (i,j) and (j,i).

    (That's why we only show the upper triangle)

    G

    Gw

    w

    or

  • 7/27/2019 Lecture 6ll

    15/46

    Summing the substitution counts

    G

    G

    W

    W

    N

    GG

    one column of a MSA

    G

    G

    W

    W

    N

    N

    3 21

    symmetrical matrix

    We assume the ancester is one of the observed amino acids,

    but we don't know which, so we try them all.

  • 7/27/2019 Lecture 6ll

    16/46

    Next possible ancester...

    G

    G

    W

    W

    N

    GG

    G

    G

    W

    W

    N

    N

    2 21

    We already counted this residue against all others, so be blank it out.

  • 7/27/2019 Lecture 6ll

    17/46

    Next...

    G

    G

    W

    W

    N

    GG

    G

    G

    W

    W

    N

    N

    1

    2

    1

  • 7/27/2019 Lecture 6ll

    18/46

    Next...

    G

    G

    W

    W

    N

    GG

    G

    G

    W

    W

    N

    N

    0

    2

    1

  • 7/27/2019 Lecture 6ll

    19/46

    Next...

    G

    G

    W

    W

    N

    GG

    G

    G

    W

    W

    N

    N

    0

    2

    0

  • 7/27/2019 Lecture 6ll

    20/46

    Next...

    G

    G

    W

    W

    N

    GG

    G

    G

    W

    W

    N

    N

    1 0 0

  • 7/27/2019 Lecture 6ll

    21/46

    Last...

    G

    G

    W

    W

    N

    GG

    G

    G

    W

    W

    N

    N

    0 0 0

    (no counts for last seq.)

  • 7/27/2019 Lecture 6ll

    22/46

    Summing the substitution counts

    G

    G

    W

    W

    N

    GG

    G

    G

    W

    W

    N

    N

    6 4 8

    TOTAL=21

    0

    1

    2

    Now we do this for every

    column in every multiple

    sequence alignment...

  • 7/27/2019 Lecture 6ll

    23/46

    log odds

    log odds ratio = log2(observed/expected )

    Substitutions (and many other things in bioinformatics) are

    expressed as a "likelihood ratio", or "odds ratio" of the

    observed data over the expected value. Likelihood and

    odds are synomyms for Probability.So Log Odds is the log (usually base 2) of the odds ratio.

  • 7/27/2019 Lecture 6ll

    24/46

    Getting log-odds from counts

    Observed probability of G->G

    qGG = P(G->G)=6/21 = 0.29

    Expected probability of G->G,

    eGG = 0.57*0.57 = 0.33

    odds ratio = qGG/eGG = 0.29/0.33

    log odds ratio = log2(qGG/eGG )

    If the lod is < 0., then

    the mutation is less

    likely than expected by

    chance. If it is > 0., it ismore likely.

    P(G) = 4/7 = 0.57

  • 7/27/2019 Lecture 6ll

    25/46

    Different observations, same expectation

    G G

    G A

    W G

    W A

    N GG A

    G A

    P(G)=0.50

    eGG = 0.25qGG = 9/42 =0.21

    lod = log2(0.21/0.25) =0.2

    G WG A

    G W

    G A

    G WG A

    G A

    P(G)=0.50

    eGG = 0.25qGG = 21/42 =0.5

    lod = log2(0.50/0.25) = 1

    Gs spread over many columns

    Gs concentrated

  • 7/27/2019 Lecture 6ll

    26/46

    Different observations, same expectation

    G G

    G A

    W G

    A W

    N GG A

    G A

    P(G)=0.50, P(W)=0.14

    eGW = 0.07qGW = 7/42 =0.17

    lod = log2(0.17/0.07) = 1.3

    G WG A

    G W

    G A

    G WG A

    A G

    P(G)=0.50, P(W)=0.14

    eGW = 0.07qGG = 3/42 =0.07

    lod = log2(0.07/0.07) = 0

    G and W seen together more

    often than expected.

    Gs and Ws not

    seen together.

    In class exercise:

  • 7/27/2019 Lecture 6ll

    27/46

    Get the substitution value for P->Q

    ...given a very small database.

    PQPPQQQPQQP

    PQPPPQQQP

    P(P)=_____, P(Q)=_____

    ePQ = _____

    qPQ = ___/___ =_____

    lod = log2(ePQ/qPQ) = ____

    P Q

    Q

    P

    In class exercise:

  • 7/27/2019 Lecture 6ll

    28/46

    Markovian evolution and PAMA Markov process is one where te likelihood of the next

    "state" depends only on the current state.

    Markovian evolution assumes that base changes (or

    amino acid changes) occur at a constant rate and

    depend only on the identity of the current base (oramino acid).

    G G A V V G

    millions of years

    one

    position ina protein

    .9946 .0002 .0021 .0001.9932

  • 7/27/2019 Lecture 6ll

    29/46

    Markovian evolution is an extrapolation

    Start with all G's. Wait 1 million years. Where

    do they go?

    Using PAM1, we expect them to mutate to

    about 0.0002 A, 0.0007 P, 0.9946 G, etcWait another million years.

    The new A's mutate according to PAM1 for A's,

    P's mutate according to PAM1 for P's, etc.

    Wait another million, etc , etc etc.

    What is the final distribution of amino acids at

    the positions that were once G's?

    PAM1 =

    PAM1 =

  • 7/27/2019 Lecture 6ll

    30/46

    Matrix multiplication

    PAM1 =

    00001000

    00000000000

    0

    P(G->A)P(G->C)P(G->D)P(G->E)P(G->G)P(G->F)P(G->H)

    P(G->I)P(G->K)P(G->L)P(G->M)P(G->N)P(G->P)P(G->Q)P(G->R)P(G->S)P(G->T)

    PAM1 x =

    To start we have 100%G,

    0% everything else

    After 1MY we have

    each amino acid

    according to the

    PAM probabilities.

    0.0001

    0.0001

    0.00015

    0.00005

    0.99943

    0.00002

    0.00005

    0.00001

    0.0002

    0.00015

    0.00002

    0.00003

    0.0006

    0.0006

    0.00002

    =

  • 7/27/2019 Lecture 6ll

    31/46

    Matrix multiplication

    PAM1 =

    PAM1 x

    After 2MY each

    amino acid has

    mutated again

    according to thePAM1 probabilities.

    PAM1

    =

    etc.

    0.0001

    0.0001

    0.00015

    0.00005

    0.99943

    0.00002

    0.00005

    0.000010.0002

    0.00015

    0.00002

    0.00003

    0.0006

    0.0006

    0.00002

  • 7/27/2019 Lecture 6ll

    32/46

    250 PAMs

    PAM1 =PAM1 PAM1

    PAM1

    250

    ==

  • 7/27/2019 Lecture 6ll

    33/46

    Differences between PAM and BLOSUM

    PAMPAM matrices are based onglobal alignments of closely related proteins.

    The PAM1 is the matrix calculated from comparisons of sequences with no more

    than 1% divergence.

    Other PAM matrices are extrapolated from PAM1 using an assumed Markov

    chain.

    BLOSUMBLOSUM matrices are based on local alignments.

    BLOSUM 62 is a matrix calculated from comparisons of sequences with approx

    62% identity.

    All BLOSUM matrices are based on observed alignments; they are not

    extrapolated from comparisons of closely related proteins.

    BLOSUM 62 is the default matrix in BLAST (the database search program). It is

    tailored for comparisons of moderately distant proteins. Alignment of distant

    relatives may be more accurate with a different matrix.

  • 7/27/2019 Lecture 6ll

    34/46

    Increasing sophistication in match scoring

    1. Identity score.

    2. Genetic code changes (mutations on one base more likely than 2,3). (1966)

    3. Matrices based on chemical similarity of amino acids. (1985)

    4. Matrices based on multiple sequence alignments (PAM (1978),BLOSUM (1994))

    5. Dipeptide substitution matrices (ie. AG --> DG, etc) (1994)

    6. Class specific substitution matrices (D. Jones' transmembrane proteinmatrix) (1994)

    7. Structure-based substitution matrices (2000)

    8. Position-specific, structure-based substitution matrices (2006)

  • 7/27/2019 Lecture 6ll

    35/46

    PAM250

  • 7/27/2019 Lecture 6ll

    36/46

    BLOSUM62

  • 7/27/2019 Lecture 6ll

    37/46

    Which substitution matrix favors...

    conservation of polar residues

    conservation of non-polar residuesconservation of C, Y, or W

    polar-to-nonpolar mutations

    polar-to-polar mutations

    PAM250 BLOSUM62

  • 7/27/2019 Lecture 6ll

    38/46

    Local alignment, revisited

    Starts at zero (the score of a non-alignment)

    Ends at the maximum score anywhere in the matrix.

    Global Local

    Advantages:

    Does not care if the aligned region has long "tails".

    Can align pieces of one sequence to pieces of another.

    Multi-domain sequences are OK

  • 7/27/2019 Lecture 6ll

    39/46

    Local alignment, revisited

    Global Local

    Disadvantages:

    Fails on multidomain alignments if large gaps are present.

    Success depends on an additional parameter: Matrix Bias

  • 7/27/2019 Lecture 6ll

    40/46

    Local Alignment with matrix biasMatrix bias= a constant added to the

    substitution score.

    Has the same effect as starting the

    alignment at a number other than zero.

    P

    G

    T

    SF

    E

    P

    A T S F M

    A(i,j) = MAX

    A(i-1,j-1) + match score + bias

    A(i,j-1) + gap*

    A(i-1,j) + gap

    0 + match score

    *linear gap penalty.

  • 7/27/2019 Lecture 6ll

    41/46

    Effect of matrix bias

    Higher matrix bias favors matches over

    gaps. RESULT: more matches, longerlocal alignments.

    more matrix bias

    H d k h h li i

  • 7/27/2019 Lecture 6ll

    42/46

    How do we know when the alignment is

    correct?

    2DRC:A 1/2 MISLIAALAVDRVIGMENAM-PFNLPADLAWFKRNTL-------DKPVIMGRHTWESIG-

    1DRF:_ 3/4 SLNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQNLVIMGKKTWFSIPE

    2DRC:A 52/53 --RPLPGRKNIILSSQP--GTDDRVTWVKSVDEAIAACG------DVPEIMVIGGGRVYE

    1DRF:_ 63/64 KNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYK

    2DRC:A 102/103 QFLPK--AQKLYLTHIDAEVEGDTHFPDYEPDDWESVF------SEFHDADAQNSHSYCF

    1DRF:_ 123/124 EAMNHPGHLKLFVTRIMQDFESDTFFPEIDLEKYKLLPEYPGVLSDVQEE---KGIKYKF

    2DRC:A 154/155 EILERR

    1DRF:_ 180/181 EVYEKN

    Compare to known structure-based alignments

  • 7/27/2019 Lecture 6ll

    43/46

    Aligning fibronectin

    Fibronectin is a long multidomain

    protein involved in

    adhesion/migration of cells, blood

    clotting, signaling, and

    interactions with the extracellular

    matrix (ECM). Interacts withcollagen, fibrin, heparin and

    integins.

    It is made up of many copies of at

    least 3 "modules". Smalldifferences within modules cause

    important biological effects.

    How do you align fibronectins?

  • 7/27/2019 Lecture 6ll

    44/46

    Multiple local alignments?

    II

    III

    II II II III II III

    Fragment-based alignment methods find all local

    alignments. (BLAST, FASTA)

    One way would be to select the maximum score, then thenext highest score, and so on to get all of the possible

    alignments.

  • 7/27/2019 Lecture 6ll

    45/46

    You have seen....

    Dynamic programming:

    Global alignment

    Global/local alignment (no end gaps. 3 ways to do it.)

    Local alignment

    Linear gap penalty

    Affine gap penalty

    How many ways are there to do DP?

  • 7/27/2019 Lecture 6ll

    46/46

    In class exercise: local alignment

    using BestFitStart SeqLabGo to Editor mode, with no sequences.

    Download two protein sequences from the PIRdatabase(File/Add sequences/Databases):

    PIR2:A46444

    PIR2:S58653

    Select both. Run BestFit (Functions/Pairwisecomparison/BestFit.)

    Go to Options: set gap creation (range 1-20), gap extension

    (range 0-10) . Run. Look at the results. (Compare to what you see on