probabilistic sequence alignment bmi 877 colin dewey cdewey@biostat.wisc.edu february 25, 2014

Post on 24-Dec-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Probabilistic Sequence Alignment

BMI 877

Colin Dewey

cdewey@biostat.wisc.edu

February 25, 2014

What you’ve seen thus far• The pairwise sequence alignment task

• Notion of a “best” alignment – scoring schemes

• Dynamic programming algorithms for efficiently finding a “best” alignment

• Variants on the task– global, local, different gap penalty functions

• Heuristic methods for large-scale alignment - BLAST

Tasks addressed today

• How can we express the uncertainty of an alignment?

• How can we estimate the parameters for alignment?

• How can we align multiple sequences to each other?

Picking Alignmentsmel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATA------------pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ********

mel -------CATGTGAAAATCCCAGCGAGAACTCCTTATTAATCCAG--------------------CGCAGTCGGCGGCGGCGGCpse CATTCGGCATGTGAAAA-------------TCCTTATTAATCCAGAACGTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGC ********** *************** *** ** **** ** **

mel GCGCAGTCAGC----------GGTGGCAGCGCAGTATATAAATAAAGTCTTATAAGAAACTCGTGAGCG---------------pse -CGCAG-CAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAA--TCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCG ***** **** * ******** ************* ****************** * *

mel ---AAAGAGAGCG-TTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGGCAGCGACTGCGACpse GCCAAAGAGAGCGATTTTATTTATGTG-----------------ACTGCGCTGCCTG----------GTCCTCGGC ********** ************* ** **** * * * * * ** *

mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATACATGTGAAAATCpse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** * * *

mel CCAGCGAGA------ACTCCTTATTAATCCAGCGCAGTCGGCGGCGGCGGCGCGCAGTCAGCGGTGGCAGCGCAGTATATAAATpse CATTCGGCATGTGAAAATCCTTATTAATCCAGAAC------------------------------------------------- * ** * * *************** *

mel AAAGTCTTATAAGAAACTCGTGAGCGAAAGAGAGCGTTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGpse --------------------------------------------GTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGCCGCA ****** ******* ** ** * ** * ****

mel GCAGCGA-----------------------------------------------------------------------------pse GCAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAATCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCGGCCAAAGA ***** *

mel ----------------------------------CTGCGACpse GAGCGATTTTATTTATGTGACTGCGCTGCCTGGTCCTCGGC * ** *

Alignment summary: 27 mismatches, 12 gaps, 116 spaces

Alignment summary: 45 mismatches, 4 gaps, 214 spaces

Alig

nmen

t 1A

lignm

ent 2

An Elusive Cis-Regulatory Element

Drosophila melanogaster polytene chromosomes

>chr3R:2824911-2825170 TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATACATGTGAAAATCCCAGCGAGAACTCCTTATTAATCCAGCGCAGTCGGCGGCGGCGGCGCGCAGTCAGCGGTGGCAGCGCAGTATATAAATAAAGTCTTATAAGAAACTCGTGAGCGAAAGAGAGCGTTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGGCAGCGACTGCGAC

Adf1→Antp:06447: Binding site for transcription factor Adf1

Antp: antennapedia

Alig

nmen

t 1

Alig

nmen

t 2

The Conservation of Adf1→Antp:06447mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATA------------pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ********

mel -------CATGTGAAAATCCCAGCGAGAACTCCTTATTAATCCAG--------------------CGCAGTCGGCGGCGGCGGCpse CATTCGGCATGTGAAAA-------------TCCTTATTAATCCAGAACGTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGC ********** *************** *** ** **** ** **

mel GCGCAGTCAGC----------GGTGGCAGCGCAGTATATAAATAAAGTCTTATAAGAAACTCGTGAGCG---------------pse -CGCAG-CAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAA--TCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCG ***** **** * ******** ************* ****************** * *

mel ---AAAGAGAGCG-TTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGGCAGCGACTGCGACpse GCCAAAGAGAGCGATTTTATTTATGTG-----------------ACTGCGCTGCCTG----------GTCCTCGGC ********** ************* ** **** * * * * * ** *

mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATACATGTGAAAATCpse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** * * *

mel CCAGCGAGA------ACTCCTTATTAATCCAGCGCAGTCGGCGGCGGCGGCGCGCAGTCAGCGGTGGCAGCGCAGTATATAAATpse CATTCGGCATGTGAAAATCCTTATTAATCCAGAAC------------------------------------------------- * ** * * *************** *

mel AAAGTCTTATAAGAAACTCGTGAGCGAAAGAGAGCGTTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGpse --------------------------------------------GTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGCCGCA ****** ******* ** ** * ** * ****

mel GCAGCGA-----------------------------------------------------------------------------pse GCAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAATCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCGGCCAAAGA ***** *

mel ----------------------------------CTGCGACpse GAGCGATTTTATTTATGTGACTGCGCTGCCTGGTCCTCGGC * ** *

Alignment summary: 27 mismatches, 12 gaps, 116 spaces

Alignment summary: 45 mismatches, 4 gaps, 214 spaces

TGTGCGTCAGCGTCGGCCGCAACAGCGTGTG-----------------ACTGCG**** ** ***

33% identity

TGTGCGTCAGCGTCGGCCGCAACAGCGTGTGCGCCAGCGTCAGCGCCAGCGCCG****** ******* ** ** * **

74% identity

The Polytope

TGTGCGTCAGCGTCGGCCGCAACAGCGTGTGCGCCAGCGTCAGCGCCAGCGCCG****** ******* ** ** * **

TGTGCGTCAGCGTCGGCCGCAACAGCGTGTG-----------------ACTGCG**** ** ***

364 Vertices760 Ridges398 Facets

Sequence lengths = 260bp, 280bp

Methodological machinery to be used

• Hidden Markov models (HMMs)– Viterbi and Forward algorithms– Profile HMMs– Pair HMMs– Connections to classical sequence

alignment

Hidden Markov models

• Generative probabilistic models of sequences• Explicitly models unobserved (hidden) states

that “emit” the characters of the observed sequence

• Primary task of interest is to infer the hidden states given the observed sequence

• Alignment case: hidden states = alignment

Two HMM random variables

• Observed sequence

• Hidden state sequence

• HMM: – Markov chain over hidden sequence– Dependence between

The Parameters of an HMM

• since we’ve decoupled states and characters, we also have emission probabilities

probability of emitting character b in state k

probability of a transition from state k to l represents a path (sequence of states) through the model

• as in Markov chain models, we have transition probabilities

A Simple HMM with Emission Parameters

0.8

probability of emitting character A in state 2

probability of a transition from state 1 to state 3

0.4

A 0.4C 0.1G 0.2T 0.3

A 0.1C 0.4G 0.4T 0.1

A 0.2C 0.3G 0.3T 0.2

begin end

0.5

0.5

0.2

0.8

0.6

0.1

0.90.2

0 5

4

3

2

1

A 0.4C 0.1G 0.1T 0.4

Three Important Questions• How likely is a given sequence?

the Forward algorithm

• What is the most probable “path” (sequence of hidden states) for generating a given sequence?the Viterbi algorithm

• How can we learn the HMM parameters given a set of sequences?the Forward-Backward (Baum-Welch) algorithm

How Likely is a Given Path and Sequence?

• the probability that the path is taken and the sequence is generated:

(assuming begin/end are the only silent states on path)

How Likely Is A Given Path and Sequence?

A 0.1C 0.4G 0.4T 0.1

A 0.4C 0.1G 0.1T 0.4

begin end

0.5

0.5

0.2

0.8

0.4

0.6

0.1

0.90.2

0.8

0 5

4

3

2

1

A 0.4C 0.1G 0.2T 0.3

A 0.2C 0.3G 0.3T 0.2

How Likely is a Given Sequence?• We usually only observe the sequence, not the path• To find the probability of a sequence, we must sum over

all possible paths

• but the number of paths can be exponential in the length of the sequence...

• the Forward algorithm enables us to compute this efficiently

How Likely is a Given Sequence: The Forward Algorithm

• A dynamic programming solution• subproblem: define to be the probability of

generating the first i characters and ending in state k• we want to compute , the probability of generating

the entire sequence (x) and ending in the end state (state N)

• can define this recursively

The Forward Algorithm• because of the Markov property, don’t have to explicitly

enumerate every path

• e.g. compute using

A 0.4C 0.1G 0.2T 0.3

A 0.1C 0.4G 0.4T 0.1

A 0.4C 0.1G 0.1T 0.4

A 0.2C 0.3G 0.3T 0.2

begin end

0.5

0.5

0.2

0.8

0.4

0.6

0.1

0.90.2

0.8

0 5

4

3

2

1

The Forward Algorithm

• initialization:

probability that we’re in the start state andhave observed 0 characters from the sequence

The Forward Algorithm

• recursion for silent states:

• recursion for emitting states (i =1…L):

The Forward Algorithm

• termination:

probability that we’re in the end state andhave observed the entire sequence

Forward Algorithm Example

A 0.4C 0.1G 0.2T 0.3

A 0.1C 0.4G 0.4T 0.1

A 0.4C 0.1G 0.1T 0.4

A 0.2C 0.3G 0.3T 0.2

begin end

0.5

0.5

0.2

0.8

0.4

0.6

0.1

0.90.2

0.8

0 5

4

3

2

1

• given the sequence x = TAGA

Forward Algorithm Example• given the sequence x = TAGA• initialization

• computing other values

Three Important Questions

• How likely is a given sequence?• What is the most probable “path” for generating a given

sequence?• How can we learn the HMM parameters given a set of

sequences?

Finding the Most Probable Path: The Viterbi Algorithm

• Dynamic programming approach, again!• subproblem: define to be the probability of the most

probable path accounting for the first i characters of x and ending in state k

• we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state

• can define recursively• can use DP to find efficiently

Finding the Most Probable Path: The Viterbi Algorithm

• initialization:

The Viterbi Algorithm

• recursion for emitting states (i =1…L):

• recursion for silent states:

keep track of most probable path

The Viterbi Algorithm

• traceback: follow pointers back starting at

• termination:

Forward & Viterbi Algorithms

begin end

• Forward/Viterbi algorithms effectively consider all possible paths for a sequence– Forward to find probability of a sequence– Viterbi to find most probable path

• consider a sequence of length 4…

HMM parameter estimation

• Easy if the hidden path is known for each sequence

• In general, the paths are unknown• Baum-Welch (Forward-Backward) algorithm

is used to compute maximum likelihood estimates

• Backward algorithm is analog of forward algorithm for computing probabilities of suffixes of a sequence

Learning Parameters: The Baum-Welch Algorithm

• algorithm sketch:– initialize the parameters of the model– iterate until convergence

• calculate the expected number of times each transition or emission is used

• adjust the parameters to maximize the likelihood of these expected values

How can we use HMMs for pairwise alignment?

• What is the observed sequence?– one of the two sequences?– both sequences?

• What is the hidden path?– the alignment

Profile HMM for pairwise alignment

• Select one sequence to be observed (the query)

• The other sequence (the reference) defines the states of the HMM

• Three classes of states– Match: corresponds to aligned positions– Delete: positions of the reference that are deleted in

the query– Insert: positions on the query that are insertions

relative to the reference

Profile HMMs

i 2 i 3i 1i 0

d 1 d 2 d 3

m 1 m 3m 2start endMatch states represent

key conserved positions

Insert states account

for extra characters

in some sequences

Delete states are silent; they

Account for characters missing

in some sequences

A 0.01R 0.12D 0.04N 0.29C 0.01E 0.03Q 0.02G 0.01

Insert and match states have

emission distributions over

sequence characters

Example Profile HMM

Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences

match states

delete states(silent)

insert states

Profile HMM considerations

• Odd asymmetry: have to pick one sequence as reference

• Models conditional probability P(X|Y) of query sequence (X) given reference sequence (Y)

• Is there something more natural here?– Yes, Pair HMMs

• We will revisit Profile HMMs for multiple alignment a bit later

Pair Hidden Markov Models

• each non-silent state emits one or a pair of characters

I: insert state

D: delete state

H: homology (match) state

PHMM Paths = Alignments

HAA

HAT

IG

IC

HGG

D

T

HCC

hidden:

observed:

sequence 1: AAGCGCsequence 2: ATGTC

B E

Transition Probabilities

• probabilities of moving between states at each step

B H I D E

B 1-2δ-τ δ δ τ

H 1-2δ-τ δ δ τ

I 1-ε-τ ε τ

D 1-ε-τ ε τ

E

state i+1

state

i

Emission Probabilities

A 0.3

C 0.2

G 0.3

T 0.2

A 0.1

C 0.4

G 0.4

T 0.1

A C G T

A 0.13 0.03 0.06 0.03

C 0.03 0.13 0.03 0.06

G 0.06 0.03 0.13 0.03

T 0.03 0.06 0.03 0.13

Homology (H)Insertion (I)Deletion (D)

single character single character pairs of characters

Pair HMM Viterbi• probability of most likely sequence of hidden states

generating length i prefix of x and length j prefix of y, with the last state being:

H

I

D

• note that the recurrence relations here allow ID and DI transitions

PHMM Alignment

• calculate probability of most likely alignment

• traceback, as in Needleman-Wunsch (NW), to obtain sequence of state states giving highest probability

HIDHHDDIIHH...

Correspondence with Needleman-Wunsch (NW)

• NW values ≈ logarithms of Pair HMM Viterbi values

Posterior Probabilities

• there are similar recurrences for the Forward and Backward values

• from the Forward and Backward values, we can calculate the posterior probability of the event that the path passes through a certain state S, after generating length i and j prefixes

Uses for Posterior Probabilities

• sampling of suboptimal alignments• posterior probability of pairs of residues being

homologous (aligned to each other)• posterior probability of a residue being gapped• training model parameters (EM)

Posterior Probabilities

plot of posterior probability of each alignment column

Parameter Training

• supervised training– given: sequences and correct alignments– do: calculate parameter values that maximize

joint likelihood of sequences and alignments

• unsupervised training– given: sequence pairs, but no alignments– do: calculate parameter values that maximize

marginal likelihood of sequences (sum over all possible alignments)

Multiple Alignment with Profile HMMs

• given a set of sequences to be aligned– use Baum-Welch to learn parameters of model– may also adjust length of profile HMM during training

• to compute a multiple alignment given the profile HMM– run the Viterbi algorithm on each sequence– Viterbi paths indicate correspondences among

sequences

Multiple Alignment with Profile HMMs

More common multiple alignment strategy: Progressive alignment

TGTAAC TGTAC

ATGT--CATGTGGC

ATGTC ATGTGGC

TGTAACTGT-AC

-TGTAAC-TGT-ACATGT--CATGTGGC

TGTTAAC

-TGTTAAC-TGT-AAC-TGT--ACATGT---CATGT-GGC

Classification w/ Profile HMMs• To classify sequences according to family, we can

train a profile HMM to model the proteins of each family of interest

• Given a sequence x, use Bayes’ rule to make classification

β-galactosidase

β-glucanase

β-amylase

α-amylase

PFAM

• Large database of protein families• Each family has a trained Profile HMM• Example search with a globin

sequence:

Summary

• Probabilistic models for alignment are more powerful than classical combinatorial alignment algorithms– Captures uncertainty in alignment– Allows for principled estimation of

parameters– Easily used in classification settings

top related