www..uni-rostock.de ulf schmitz, statistical methods for aiding alignment1 bioinformatics...
TRANSCRIPT
![Page 1: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/1.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 1
www. .uni-rostock.de
BioinformaticsBioinformaticsStatistical methods for pattern searchingStatistical methods for pattern searching
Bioinformatics and Systems Biology Groupwww.sbi.informatik.uni-rostock.de
![Page 2: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/2.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 2
www. .uni-rostock.de
Outline
1. Expectation Maximization Algorithm
2. Markov Models
3. Hidden Markov models
![Page 3: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/3.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 3
www. .uni-rostock.de
Expectation Maximization Algorithm
• an algorithm for locating similar sequence patterns in a set of sequences
• suspected parts are then aligned• an expected scoring matrix representing the distribution of sequence
characters in each column of the alignment will be generated• the pattern is matched to each sequence and the scoring matrix values
are then updated to maximize the alignment of the matrix to the sequences
• this procedure is repeated until there is no further improvement
![Page 4: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/4.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 4
www. .uni-rostock.de
Expectation Maximization Algorithm
Seq1:Seq2:Seq3:Seq4:
.
.
.
Seq10:
100 nucleotides long
seq1: … … … … … … TCAGAATGCAGCATAG … … … … … … … … … … … …seq2: … … … … … … CGCATAGAGCATAGAC … … … … … … … … … … … … seq3: … … … … … … ACAGACAAAAAAATAC … … … … … … … … … … … …seq4: … … … … … … CATAGCAGATACAGCA … … … … … … … … … … … …
preliminary local alignment of the sequences
Columns not in motif provide background frequencies
ACATAGACAGTATAGAGAATCAGAATGCAGCATAGCAGCACATAGAGCAGCATAGTAGACCATAGACCGATACGCGCATAGAGCATAGACACGATAGCATAGCATAGCATTACAGATCAGCAAGAGCCGACAGACAAAAAAATACGAGCAAAACGAGCATTATCGTAGGGGACACAGATACAGACATAGCAGATACAGCATAGACATAGACAGATAGCAG
.
.
.
seq10:
provides initial estimates of frequencies of nucleotides in each motif column
![Page 5: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/5.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 5
www. .uni-rostock.de
Expectation Maximization Algorithm
seq1: … … … … … … TCAGAATGCAGCATAG … … … … … … … … … … … …seq2: … … … … … … CGCATAGAGCATAGAC … … … … … … … … … … … … seq3: … … … … … … ACAGACAAAAAAATAC … … … … … … … … … … … …seq4: … … … … … … CATAGCAGATACAGCA … … … … … … … … … … … …
Background Site column 1 Site column 2 …
G 0.27 0.4 0.1 …
C 0.25 0.4 0.1 …
A 0.25 0.2 0.1 …
T 0.23 0.2 0.7 …
1.00 1.0 1.0
The first column gives the background frequencies in the flanking sequence.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
.
.
.
seq10:
4x17 matrix
![Page 6: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/6.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 6
www. .uni-rostock.de
Expectation Maximization Algorithm
Each sequence is scanned for all possible locations for the site to find the most probable location of the site
•the EM algorithm consists of two steps which are repeated consecutively•step 1 - expectation step –
–column-by-column composition of the found site is used to estimate the probability of finding the site at any position in each of the sequences–these probabilities are used to provide expected base or amino acid distribution for each column of the site
Sequence 1 xxxxooooooooooooooooooxxxx|||| ||||||||||||||||||oxxxxooooooooooooooooo ||||| |||||||||||||||||ooxxxxoooooooooooooooo |||||| ||||||||||||||||
A
B
C
Use estimates of residue frequencies for each column in the motif to calculate probability of motif in this position and multiply by…
Backgroung frequencies in the remaining positions
![Page 7: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/7.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 7
www. .uni-rostock.de
Expectation Maximization Algorithm
seq1: ACATAGACAGTATAGAGAATCAGAATGCAGCATAGCAGCACATAGAGCAGCATAG16151413121110987654321
1.01.01.00
…0.70.20.23T
…0.10.20.25A
…0.10.40.25C
…0.10.40.27G
…Site column 2Site column 1Background
(for a in pos. 1) (for C in pos. 2)
for the next 14 positions in site
for G in flanking pos. 1 for A in flanking pos.2
for the next 82 flanking positions
Table of column frequencies of each base
100191631seq,1site PP25.027.0PP1.02.0P
![Page 8: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/8.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 8
www. .uni-rostock.de
Expectation Maximization Algorithm
seq1: ACATAGACAGTATAGAGAATCAGAATGCAGCATAGCAGCACATAGAGCAGCATAG
ACATAGACAGTATAGAGAATCAGAATGCAGCATAGCAGCACATAGAGCAGCATAG
ACATAGACAGTATAGAGAATCAGAATGCAGCATAGCAGCACATAGAGCAGCATAG
ACATAGACAGTATAGAGAATCAGAATGCAGCATAGCAGCACATAGAGCAGCATAG
16151413121110987654321
16151413121110987654321
16151413121110987654321
16151413121110987654321
.
.
.
seq10:
The probability of this best location in seq1, say at site k, is the ratio of the site probability at k divided by the sum of all other site probabilities.
P(site k in seq1) = Psite k, seq1 / Psite 1, seq1 + Psite 2, seq1 + … + Psite 85, seq1
The probability of the site location in each sequence is then calculated in this manner.
![Page 9: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/9.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 9
www. .uni-rostock.de
Expectation Maximization Algorithm
•step 2 – maximization step ––the new counts of bases or amino acids for each position in the site found in step 1 are substituted for the previous set
seq1: ACATAGACAGTATAGAGAATCAGAATGCAGCATAGCAGCACATAGAGCAGCATAG
16151413121110987654321
(e.g.) P(site 1 in seq1) = 0.01 and P(site 2 in seq1) = 0.02
seq1: … … … … … … TCAGAATGCAGCATAG … … … … … … … … … … … …seq2: … … … … … … CGCATAGAGCATAGAC … … … … … … … … … … … … seq3: … … … … … … ACAGACAAAAAAATAC … … … … … … … … … … … …seq4: … … … … … … CATAGCAGATACAGCA … … … … … … … … … … … …
.
.
.
seq10:
+0.01 A C A T A G A C A G T A T A G A
+0.02 C A T A G A C A G T A T AG A G
![Page 10: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/10.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 10
www. .uni-rostock.de
Expectation Maximization Algorithm
• This procedure is repeated for all other site locations and all other sequences.
• A new version of the table of residue frequencies can be build.
• The expectation and maximation steps are repeated until the estimates of the base frequencies do not change.
MEME (Multiple EM for Motif Elication) :• is a tool for performing msa‘s by the em method • see http://www.sdsc.edu/MEME/meme/website/meme.html
![Page 11: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/11.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 11
www. .uni-rostock.de
Expectation Maximization Algorithm
• the EM algorithm consists of two steps which are repeated consecutively
• step 1 - expectation step –– column-by-column composition of the found site is used to
estimate the probability of finding the site at any position in each of the sequences
– these probabilities are used to provide expected base or amino acid distribution for each column of the site
• step 2 – maximization step –– the new counts of bases or amino acids for each position in the
site found in step 1 are substituted for the previous set• step 1 is then repeated until the algorithm converges on
a solution
![Page 12: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/12.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 12
www. .uni-rostock.de
Markov chain models
– a Markov chain model is defined by:• a set of states• some states emit symbols• other states (e.g. the begin state) are silent• a set of transitions with associated probabilities• the transitions emanating from a given state define a
distribution over the possible next states
![Page 13: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/13.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 13
www. .uni-rostock.de
Markov chain models
• given some sequence x of length L, we can ask how probable the sequence is, based on our model
• for any probabilistic model of sequences, we can write this probability as:
• key property of a (1st order) Markov chain: the probability of each Xi
depends only on Xi-1
)XPr()X,,XXPr()X,,XXPr(
)X,,X,XPr()xPr(
112L1L11LL
11LL
L
2i1ii1
1122L1L1LL
)XXPr()XPr(
)XPr()XXPr()XXPr()XXPr()xPr(
![Page 14: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/14.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 14
www. .uni-rostock.de
Markov chain models1st order Markov chain
![Page 15: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/15.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 15
www. .uni-rostock.de
Markov chain models
Example Application
• CpG islands– CG-dinucleotides are rarer in eukaryotic genomes than expected
given the independent probabilities of C, G– but the regions upstream of genes are richer in CG dinucleotides
than elsewhere – CpG islands– useful evidence for finding genes
• Could predict CpG islands with Markov chains– one to represent CpG islands– one to represent the rest of the genome
![Page 16: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/16.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 16
www. .uni-rostock.de
Markov chain models
![Page 17: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/17.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 17
www. .uni-rostock.de
Markov chain models
Selecting the Order of a Markov Chain Model
• Higher order models remember more “history”• Additional history can have predictive value• Example:
– predict the next word in this sentence fragment “…finish __” (up, it, first, last, …?)
– now predict it given more history
• “Fast guys finish __”
![Page 18: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/18.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 18
www. .uni-rostock.de
Hidden Markov Models (HMM)
Hidden State
• We will distinguish between the observed parts of a problem and the hidden parts• In the Markov models we have considered previously, it is clear which state accounts for each part of the observed sequence • In another model, there are multiple states that could account for each part of the observed sequence
– this is the hidden part of the problem– states are decoupled from sequence symbols
![Page 19: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/19.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 19
www. .uni-rostock.de
Hidden Markov models
Markov model: Move from state to state according to probability distribution of each state and emit states visited:
Hidden Markov model: Move from stateto state in the same way, but emit a symbolaccording to probability distribution instead:
![Page 20: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/20.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 20
www. .uni-rostock.de
Hidden Markov Model
• Red square, match state • green diamond, insert state • blue circle, delete state
• Arrows indicate the probability of transition from one state to the next.
![Page 21: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/21.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 21
www. .uni-rostock.de
Hidden Markov Model
N * F L SN * F L SN K Y L TQ * W - T
A. Sequence alignment
B. Hidden Markov model for sequence alignment
BEG M1 M2 M3 M4 END
I0 I1 I2 I3 I4
D1 D2 D3 D4
Probability of sequence: N K Y L T
BEG -> M -> I1 -> M2 -> M3 -> M4 -> END0.33 * 0.05 * 0.33 * 0.05 * 0.33 * 0.05 * 0.33 * 0.05 * 0.33 * 0.05 * 0.5 = 6.1 * 10-10
![Page 22: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/22.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 22
www. .uni-rostock.de
Hidden Markov Models
Three Important Questions
• How likely is a given sequence?
• What is the most probable “path” for generating a given sequence?
• How can we learn the HMM parameters given a set of sequences?
![Page 23: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/23.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 23
www. .uni-rostock.de
Hidden Markov Models
HMM-based homology searching• formal probabilistic basis and consistent theory behind gap
and insertion scores • HMMs good for profile searches, bad for alignment (due to
parametrisation of the models)• HMMs are slow
HMMER - http://hmmer.wustl.edu/
Tools:
SAM - http://cse.ucsc.edu/research/comp/bio/sam.html
![Page 24: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/24.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 24
www. .uni-rostock.de
Outlook
• Machine learning• Clustering
![Page 25: Www..uni-rostock.de Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de](https://reader031.vdocument.in/reader031/viewer/2022032702/56649caf5503460f94972aa1/html5/thumbnails/25.jpg)
Ulf Schmitz, Statistical methods for aiding alignment 25
www. .uni-rostock.de
Sequence Alignment
Thanx for your attention!!!