biol335: sequence alignment
Post on 26-Jun-2015
170 Views
Preview:
DESCRIPTION
TRANSCRIPT
Sequence alignment
Paul Gardner
March 19, 2015
Paul Gardner Sequence alignment
What is homology?
I A homologous trait is anycharacteristic of organisms that isderived from a common ancestor[1]
Eg. vertebrate forelimbs.
I Contrast this to analagous traits:similarities between organisms thatwere not in the last commonancestor[1] Eg. wings frompterosaurs, bats & birds.
1. Text & image adapted from:http://en.wikipedia.org/wiki/Homology (biology)
Paul Gardner Sequence alignment
How can we measure evolutionary relatedness?
I Is a jellyfish more related to a sponge or the custard apple?
I How can we test this?
Barton et al. (2007) Evolution. Cold Spring Harbor Laboratory Press
Paul Gardner Sequence alignment
The homology search problem
I Given a biologicalsequence, can we identifyhomologues in otherspecies?
Allers & Mevarech (2005) Archaeal genetics the third way. Nature Reviews Genetics
Paul Gardner Sequence alignment
Given gazillions of biomolecular sequences, how can weestimate homology?
I By sequence comparison! Usually by alignment.I NOTE: sequence similarity is used to infer homology.
Homology is binary, either something is or isn’t homologous.Sequences can be 87% similar, they can’t be 87% homologous!
I Alignments are two or more sequences placed on top of eachother, identifying a functional, structural, or evolutionaryrelationships between the sequences.
2
4
5
3
1
>1GCAUCCAUGGCUGAAUGGUUAAAGCGCCCAACUCAUAAUUGGCGAACUCGCGGGUUCAAUUCCUGCUGGAUGCA>2GCAUUGGUGGUUCAGUGGUAGAAUUCUCGCCUGCCACGCGGGAGGCCCGGGUUCGAUUCCCGGCCAAUGCA>3UGGGCUAUGGUGUAAUUGGCAGCACGACUGAUUCUGGUUCAGUUAGUCUAGGUUCGAGUCCUGGUAGCCCAG>4GAAGAUCGUCGUCUCCGGUGAGGCGGCUGGACUUCAAAUCCAGUUGGGGCCGCCAGCGGUCCCGGGCAGGUUCGACUCCUGUGAUCUUCCG>5CUAAAUAUAUUUCAAUGGUUAGCAAAAUACGCUUGUGGUGCGUUAAAUCUAAGUUCGAUUCUUAGUAUUUACC
** * 1 GCAUCCAUGGCUGAAU-GGUU-AAAGCGCCCAACUCAUAAUUGGCGAA-- 2 GCAUUGGUGGUUCAGU-GGU--AGAAUUCUCGCCUGCCACGCGG-GAG-- 3 UGGGCUAUGGUGUAAUUGGC--AGCACGACUGAUUCUGGUUCAG-UUA-- 4 GAAGAUCGUCGUCUCC-GGUG-AGGCGGCUGGACUUCAAAUCCA-GU-UG 5 CUAAAUAUAUUUCAAU-GGUUAGCAAAAUACGCUUGUGGUGCGU-UAA--
**** * ** 1 ------------------CUCGCGGGUUCAAUUCCUGCUGGAUGC-A 2 ------------------G-CCCGGGUUCGAUUCCCGGCCAAUGC-A 3 ------------------G-UCUAGGUUCGAGUCCUGGUAGCCCA-G 4 GGGCCGCCAGCGGUCCCG--GGCAGGUUCGACUCCUGUGAUCUUCCG 5 ------------------A-UCUAAGUUCGAUUCUUAGUAUUUAC-C
SMADMYMURSYUC
AMY-
GGY
u a AV M M M
R MH
CR
MYUSH V R
HK
CV
Rc
KWA--
-- - c c - c c
a-c---
cc
c-V-YS Y R R G U UCR
AYU
CCYRSYMDMYVM
cV
Paul Gardner Sequence alignment
Did you do your homework?
I Play: http://phylo.cs.mcgill.ca
Paul Gardner Sequence alignment
How should a sequence alignment be interpreted & scored?
>gb|CP001191.1| Rhizobium leguminosarum bv. trifolii WSM2304, complete genome
Length=4537948
Features in this part of subject sequence:
cold-shock DNA-binding domain protein
Score = 57.2 bits (62), Expect = 2e-05
Identities = 78/106 (74%), Gaps = 6/106 (6%)
Strand=Plus/Plus
Query 1 CTTCGTCAGATTTCCTCTCAATATCGATCATACCGGACTGATATTCGTCCGG----GAAC
|| |||||||| ||||||||| |||||| | | | || |||| |||| ||||
Sbjct 828507 CTCCGTCAGATATCCTCTCAACATCGATACGGCTTGTCGGACATTCTTCCGCAGGCGAAC
Query 57 TCTAGCGATTGAAA-GGAAATCGTTATGAACTCAGGCACCGTAAAG
| | || |||||| ||| ||||||||||| |||||| ||| |||
Sbjct 828567 ACAA-CGGTTGAAAAGGAGATCGTTATGAATTCAGGCGTCGTCAAG
Paul Gardner Sequence alignment
Lets play a game of coin toss
I How can we detect cheating or bias?I coin tossI ./coin-toss.pl 100I ./coin-toss2.pl 100
Paul Gardner Sequence alignment
Where do those scores come from?
I Most alignment scores (from BLAST, HMMER, ...) can beinterpreted as log-odds or bit-scores
I What is a log-odds score?I The observed frequency of two independent events is pABI The expected frequency of two independent events is fA ∗ fB
(i.e. what is the likelihood of the event occuring by chance)I The log-odds score is log2( #Observed
#Expected ) = log2( pABfA∗fB )
I What happens when pAB = fA ∗ fB?I What happens when pAB > fA ∗ fB or pAB < fA ∗ fB?
I Often called the bit-score, information theorists like to discusshow many “bits” of information they have, hence “log2”.
Paul Gardner Sequence alignment
Logarithms – reminder
Paul Gardner Sequence alignment
Exercise
I If I was to throw two dice 360 times and rolled double 6’s 80times, what is the log-odds score for that (base 2)?
I What if instead I only rolled 10 double 6’s?
I What if instead I only rolled 5 double 6’s?
Paul Gardner Sequence alignment
Where do the nucleotide scores come from?
TABLE 1. PAM Substitution Scores Based
on the Uniform Mutation Model
PAMDist.
PercentCon-
served
MatchScore(Bits)
M is-m atchScore(Bits)
Match/M is-
m atchScoreRatio
Ave.Inform ation Per
Posi-tion
(Bits)
1 99.0 1.99 -6.24 0.32 1.90
2 98.0 1.97 -5.25 0.38 1.83
5 95.2 1.93 -3.95 0.49 1.64
10 90.6 1.86 -3.00 0.62 1.40
15 86.4 1.79 -2.46 0.73 1.21
20 82.4 1.72 -2.09 0.82 1.05
25 78.7 1.66 -1.82 0.91 0.92
30 75.3 1.59 -1.60 0.99 0.80
35 72.0 1.53 -1.42 1.07 0.70
40 69.0 1.46 -1.27 1.15 0.62
45 66.2 1.40 -1.15 1.22 0.54
50 63.5 1.34 -1.04 1.29 0.47
States, DJ et al. (1991) Improved sensitivity of nucleic acid database searches using application-specific scoringmatrices. Methods Enzymol.
Paul Gardner Sequence alignment
Protein evolution
I What sorts of amino-acid replacements are likely during theevolution of a protein sequence?
GA
A
A
A
A
G
G
G
G
C
C
C
C
U
U
U
U
UC AG UCAGUCAGUCAGUCAGU
CAG
UCAGUCAGUCAGUCAGU
CA
GUCAGUCAGUCAG
UCAG UCAG
P
S
U
nG
nG
oG
oG
oG
G
P
P
P
P
P
nM
nM
M
M
nM
nM
nM
Phenylalanine
Phe
Leucine
Leu
Leucine
Leu
Proline
Pro
Histidine
His
Glutamine
Gln
Isoleucine
Ile
Methionine
Met
Threonine
Thr
Asparagine
Asn
Lysine
Lys
Arginine
Arg
Arginine
Arg
Valine
Val
Alanine
Ala
Glutamic acid
Glu
Aspartic acid
Asp
Glycine
Gly
Serine
Ser
Serine
Ser
Tyrosine
Tyr
Cysteine
Cys
Tryptophan
Trp
Stops
Stop
E G F LS
S
Y
C
WL
P
H
R
R
QIM
TN
K
V
A
D89.09
75.07
174.20
174.20
146.19
165.19
133.11
117.15
147.13
146.15
155.16
115.13
105.09
105.09
131.18
132.12
MW
= 1
49.2
1 Da
131.18
119.12
204.23
131.18
181.19
121.16
HN
NH2
NH
H2N
OH
O
H2N
CH3 OH
O
H2N
O
H2N
OH
O
O
HO
H2N
OH
O
HS
H2N
OH
O
H2N
O
NH2
OH
O
O
OH
H2N
OH
OH2N
OH
O
NH
H2N
OH
O
N
CH3 CH3
H2N
OH
O
CH3
CH3
H2N
OH
O
CH3
CH3
H2N
OH
O
H2N
H2N
OH
O
CH3 S
H2N
OH
O
H2N
OH
O
NH
OH
O
H2N
HO OH
O
H2N
HO OH
O
H2N
HO
CH3
OH
O
NH
H2N
OH
O
HO
H2N
OH
O
H2N
CH3
CH3
OH
O
BasicAcidicPolarNonpolar(hydrophobic)
S -M - P - U - nM -oG - nG -
SumoMethyl
PhosphoUbiquitinN-Methyl
O-glycosylN-glycosyl
Modification
am
ino a
cid
2nd1st position 3rdUC
Image source: http://upload.wikimedia.org/wikipedia/en/d/d6/GeneticCode21-version-2.svg
Paul Gardner Sequence alignment
Where do the protein score matrices come from?
I The BLOSUM62 matrix
Henikoff & Henikoff (1992) Amino acid substitution matrices from protein blocks. PNAS.Image source: http://www.mathgon.com/Cours/TP/TP1/Alignements.html
Paul Gardner Sequence alignment
Multiple and pairwise sequence alignment
I Score matrices are used to evaluate how “good” an alignmentis...
I Does the alignment explain the likely evolutionary history ofthese sequences?
I Is the biochemical function likely to have been preserved?I The highest-scoring pairwise sequence alignment can be found
using dynamic programmingI The highest-scoring multiple sequence alignment cannot be
found easily
H.pylori ------GVDA NALHRPKRFF GAARNIEEGG SLTIIATALI ETGSRMDEVI ------FEEFD.radiodurans ------GVDS TALYPPKRFL GAARNIEEGG SLTIIATAMV ETGSTGDTVI ------FEEFM.tuberculosis ------GVDS TALYPPKRFL GAARNIEEGG SLTIIATAMV ETGSTGDTVI ------FEEFC.pneumoniae ------GVDA SALHKPKRFF GAARNIEGGG SLTILATALI DTGSRMDEVI ------FEEFF.nucleatum ------GIDP TALYHPKNFF GAARNIKDGG SLTIIATILV DTGSKMDEVI ------YEEFS.enterica ------GVDA NALHRPKRFF GAARNVEEGG SLTIIATALI DTGSKMDEVI ------YEEFB.thetaiotaomicron ------GVDA NALHKPKRFF GAARNIENGG SLTIIATALI DTGSKMDEVI ------FEEFL.interrogans ------GVDS NALHKPKRFF GAARNIEEGG SLTIIATALI DTGSKMDEVI ------FEEFP.marinus GYQPTLGTDV GELQER---- -ITSTLE--G SITSIQAVYV PADDLTDPAP ------ATTFU.parvum GYQSTLESDV THIQNR---- -LFRNKN--G SITSFQTIFL PMDDLSDPSA ------VAVFB.subtilis ------GIDP AAFHRPKRFF GAARNIEEGG SLTILATALV DTGSRMDDVI ------YEEFC.difficile ------GIDP GALHGPKKFF GAARNIRQGG SLTILGTALV ETGSRMDDVI ------FEEFS.griseus ------GVDS TALYPPKRFF GAARNIEDGG SLTILATALV ETGSRMDEVI ------FEEFF.nodosum ------GVDP AALYKPKHFF GAARNTREGG SLTIIATALI ETGSKMDEVI ------FEEFM.infernorum ------GVDA NALQKPRRFF ATARNLEEGG SVTIIATALI DTGSKMDDVI ------FEEFT.yellowstonii ------GLEA TALQKPKRFF GTARNIEEGG SLTIIATALV ETGSRMDDVI ------FEEFE.coli ------GVDA NALHRPKRFF GAARNVEEGG SLTIIATALI DTGSKMDEVI ------YEEFH.volcanii GYPAYLAARL SEFYERAGYF TTVNGEE--G SVSVIGAVSP PGGDFSEPVT QNTLRIVKTF
Paul Gardner Sequence alignment
Multiple sequence alignment
Different approaches & issuesI A NP-hard problem (Wang & Jiang (1994))
I Which means there is no mathematically optimal solution formultiple sequence alignment that will run on a moderncomputer in reasonable time
I Which means we have to identify heuristic approaches that runquickly and produce reasonable solutions
H.pylori ------GVDA NALHRPKRFF GAARNIEEGG SLTIIATALI ETGSRMDEVI ------FEEFD.radiodurans ------GVDS TALYPPKRFL GAARNIEEGG SLTIIATAMV ETGSTGDTVI ------FEEFM.tuberculosis ------GVDS TALYPPKRFL GAARNIEEGG SLTIIATAMV ETGSTGDTVI ------FEEFC.pneumoniae ------GVDA SALHKPKRFF GAARNIEGGG SLTILATALI DTGSRMDEVI ------FEEFF.nucleatum ------GIDP TALYHPKNFF GAARNIKDGG SLTIIATILV DTGSKMDEVI ------YEEFS.enterica ------GVDA NALHRPKRFF GAARNVEEGG SLTIIATALI DTGSKMDEVI ------YEEFB.thetaiotaomicron ------GVDA NALHKPKRFF GAARNIENGG SLTIIATALI DTGSKMDEVI ------FEEFL.interrogans ------GVDS NALHKPKRFF GAARNIEEGG SLTIIATALI DTGSKMDEVI ------FEEFP.marinus GYQPTLGTDV GELQER---- -ITSTLE--G SITSIQAVYV PADDLTDPAP ------ATTFU.parvum GYQSTLESDV THIQNR---- -LFRNKN--G SITSFQTIFL PMDDLSDPSA ------VAVFB.subtilis ------GIDP AAFHRPKRFF GAARNIEEGG SLTILATALV DTGSRMDDVI ------YEEFC.difficile ------GIDP GALHGPKKFF GAARNIRQGG SLTILGTALV ETGSRMDDVI ------FEEFS.griseus ------GVDS TALYPPKRFF GAARNIEDGG SLTILATALV ETGSRMDEVI ------FEEFF.nodosum ------GVDP AALYKPKHFF GAARNTREGG SLTIIATALI ETGSKMDEVI ------FEEFM.infernorum ------GVDA NALQKPRRFF ATARNLEEGG SVTIIATALI DTGSKMDDVI ------FEEFT.yellowstonii ------GLEA TALQKPKRFF GTARNIEEGG SLTIIATALV ETGSRMDDVI ------FEEFE.coli ------GVDA NALHRPKRFF GAARNVEEGG SLTIIATALI DTGSKMDEVI ------YEEFH.volcanii GYPAYLAARL SEFYERAGYF TTVNGEE--G SVSVIGAVSP PGGDFSEPVT QNTLRIVKTF
Paul Gardner Sequence alignment
Introducing ClustalW
Paul Gardner Sequence alignment
ClustalW
I Clustering Alignment
I A progressive method
I Very fast
I Widely used
I Very good for rather similar sequences
I Advanced settings, fine tuned
I ClustalW (for weighting)
I ClustalX: Graphical version
Credit: text lifted verbatim from Stinus Lindgreen’s slides. Image source: www.wikimedia.org
Paul Gardner Sequence alignment
ClustalW
I Three steps:
1 Do all-against-all pairwise alignments
2 Build guide tree T
3 Perform multiple alignment along TI sequence-sequence, sequence-profile
& profile-profile
Basic idea: Align most similar sequencesfirst → add more divergent sequences later
Credit: text lifted verbatim from Stinus Lindgreen’s slides. Image source: www.wikimedia.org
Paul Gardner Sequence alignment
Discussion
I What do we use multiple sequence alignments for?
Paul Gardner Sequence alignment
Relevant reading
I Reviews:I Eddy SR (2004) Where did the BLOSUM62 alignment score
matrix come from? Nature Biotechnology
I Methods:I Thompson J et al. (1994) CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment throughsequence weighting, position-specific gap penalties and weightmatrix choice. Nucleic acids research.
I Altschul SF et al. (1997) Gapped BLAST and PSI-BLAST: anew generation of protein database search programs. Nucleicacids research.
Paul Gardner Sequence alignment
The End
Paul Gardner Sequence alignment
top related