biol335: sequence alignment

Sequence alignment

Paul Gardner

March 19, 2015

Paul Gardner Sequence alignment

What is homology?

I A homologous trait is anycharacteristic of organisms that isderived from a common ancestor[1]

Eg. vertebrate forelimbs.

I Contrast this to analagous traits:similarities between organisms thatwere not in the last commonancestor[1] Eg. wings frompterosaurs, bats & birds.

1. Text & image adapted from:http://en.wikipedia.org/wiki/Homology (biology)

How can we measure evolutionary relatedness?

I Is a jellyfish more related to a sponge or the custard apple?

I How can we test this?

Barton et al. (2007) Evolution. Cold Spring Harbor Laboratory Press

The homology search problem

I Given a biologicalsequence, can we identifyhomologues in otherspecies?

Allers & Mevarech (2005) Archaeal genetics the third way. Nature Reviews Genetics

Given gazillions of biomolecular sequences, how can weestimate homology?

I By sequence comparison! Usually by alignment.I NOTE: sequence similarity is used to infer homology.

Homology is binary, either something is or isn’t homologous.Sequences can be 87% similar, they can’t be 87% homologous!

I Alignments are two or more sequences placed on top of eachother, identifying a functional, structural, or evolutionaryrelationships between the sequences.

>1GCAUCCAUGGCUGAAUGGUUAAAGCGCCCAACUCAUAAUUGGCGAACUCGCGGGUUCAAUUCCUGCUGGAUGCA>2GCAUUGGUGGUUCAGUGGUAGAAUUCUCGCCUGCCACGCGGGAGGCCCGGGUUCGAUUCCCGGCCAAUGCA>3UGGGCUAUGGUGUAAUUGGCAGCACGACUGAUUCUGGUUCAGUUAGUCUAGGUUCGAGUCCUGGUAGCCCAG>4GAAGAUCGUCGUCUCCGGUGAGGCGGCUGGACUUCAAAUCCAGUUGGGGCCGCCAGCGGUCCCGGGCAGGUUCGACUCCUGUGAUCUUCCG>5CUAAAUAUAUUUCAAUGGUUAGCAAAAUACGCUUGUGGUGCGUUAAAUCUAAGUUCGAUUCUUAGUAUUUACC

** * 1 GCAUCCAUGGCUGAAU-GGUU-AAAGCGCCCAACUCAUAAUUGGCGAA-- 2 GCAUUGGUGGUUCAGU-GGU--AGAAUUCUCGCCUGCCACGCGG-GAG-- 3 UGGGCUAUGGUGUAAUUGGC--AGCACGACUGAUUCUGGUUCAG-UUA-- 4 GAAGAUCGUCGUCUCC-GGUG-AGGCGGCUGGACUUCAAAUCCA-GU-UG 5 CUAAAUAUAUUUCAAU-GGUUAGCAAAAUACGCUUGUGGUGCGU-UAA--

**** * ** 1 ------------------CUCGCGGGUUCAAUUCCUGCUGGAUGC-A 2 ------------------G-CCCGGGUUCGAUUCCCGGCCAAUGC-A 3 ------------------G-UCUAGGUUCGAGUCCUGGUAGCCCA-G 4 GGGCCGCCAGCGGUCCCG--GGCAGGUUCGACUCCUGUGAUCUUCCG 5 ------------------A-UCUAAGUUCGAUUCUUAGUAUUUAC-C

SMADMYMURSYUC

u a AV M M M

MYUSH V R

-- - c c - c c

a-c---

c-V-YS Y R R G U UCR

CCYRSYMDMYVM

Did you do your homework?

I Play: http://phylo.cs.mcgill.ca

How should a sequence alignment be interpreted & scored?

>gb|CP001191.1| Rhizobium leguminosarum bv. trifolii WSM2304, complete genome

Length=4537948

Features in this part of subject sequence:

cold-shock DNA-binding domain protein

Score = 57.2 bits (62), Expect = 2e-05

Identities = 78/106 (74%), Gaps = 6/106 (6%)

Strand=Plus/Plus

Query 1 CTTCGTCAGATTTCCTCTCAATATCGATCATACCGGACTGATATTCGTCCGG----GAAC

|| |||||||| ||||||||| |||||| | | | || |||| |||| ||||

Sbjct 828507 CTCCGTCAGATATCCTCTCAACATCGATACGGCTTGTCGGACATTCTTCCGCAGGCGAAC

Query 57 TCTAGCGATTGAAA-GGAAATCGTTATGAACTCAGGCACCGTAAAG

| | || |||||| ||| ||||||||||| |||||| ||| |||

Sbjct 828567 ACAA-CGGTTGAAAAGGAGATCGTTATGAATTCAGGCGTCGTCAAG

Lets play a game of coin toss

I How can we detect cheating or bias?I coin tossI ./coin-toss.pl 100I ./coin-toss2.pl 100

Where do those scores come from?

I Most alignment scores (from BLAST, HMMER, ...) can beinterpreted as log-odds or bit-scores

I What is a log-odds score?I The observed frequency of two independent events is pABI The expected frequency of two independent events is fA ∗ fB

(i.e. what is the likelihood of the event occuring by chance)I The log-odds score is log2( #Observed

#Expected ) = log2( pABfA∗fB )

I What happens when pAB = fA ∗ fB?I What happens when pAB > fA ∗ fB or pAB < fA ∗ fB?

I Often called the bit-score, information theorists like to discusshow many “bits” of information they have, hence “log2”.

Logarithms – reminder

Exercise

I If I was to throw two dice 360 times and rolled double 6’s 80times, what is the log-odds score for that (base 2)?

I What if instead I only rolled 10 double 6’s?

I What if instead I only rolled 5 double 6’s?

Where do the nucleotide scores come from?

TABLE 1. PAM Substitution Scores Based

on the Uniform Mutation Model

PAMDist.

PercentCon-

served

MatchScore(Bits)

M is-m atchScore(Bits)

Match/M is-

m atchScoreRatio

Ave.Inform ation Per

Posi-tion

(Bits)

1 99.0 1.99 -6.24 0.32 1.90

2 98.0 1.97 -5.25 0.38 1.83

5 95.2 1.93 -3.95 0.49 1.64

10 90.6 1.86 -3.00 0.62 1.40

15 86.4 1.79 -2.46 0.73 1.21

20 82.4 1.72 -2.09 0.82 1.05

25 78.7 1.66 -1.82 0.91 0.92

30 75.3 1.59 -1.60 0.99 0.80

35 72.0 1.53 -1.42 1.07 0.70

40 69.0 1.46 -1.27 1.15 0.62

45 66.2 1.40 -1.15 1.22 0.54

50 63.5 1.34 -1.04 1.29 0.47

States, DJ et al. (1991) Improved sensitivity of nucleic acid database searches using application-specific scoringmatrices. Methods Enzymol.

Protein evolution

I What sorts of amino-acid replacements are likely during theevolution of a protein sequence?

UC AG UCAGUCAGUCAGUCAGU

UCAGUCAGUCAGUCAGU

GUCAGUCAGUCAG

UCAG UCAG

Phenylalanine

Leucine

Proline

Histidine

Glutamine

Isoleucine

Methionine

Threonine

Asparagine

Lysine

Arginine

Valine

Alanine

Glutamic acid

Aspartic acid

Glycine

Serine

Tyrosine

Cysteine

Tryptophan

E G F LS

D89.09

174.20

146.19

165.19

133.11

117.15

147.13

146.15

155.16

115.13

105.09

131.18

132.12

131.18

119.12

204.23

131.18

181.19

121.16

CH3 OH

CH3 CH3

BasicAcidicPolarNonpolar(hydrophobic)

S -M - P - U - nM -oG - nG -

SumoMethyl

PhosphoUbiquitinN-Methyl

O-glycosylN-glycosyl

Modification

2nd1st position 3rdUC

Image source: http://upload.wikimedia.org/wikipedia/en/d/d6/GeneticCode21-version-2.svg

Where do the protein score matrices come from?

I The BLOSUM62 matrix

Henikoff & Henikoff (1992) Amino acid substitution matrices from protein blocks. PNAS.Image source: http://www.mathgon.com/Cours/TP/TP1/Alignements.html

Multiple and pairwise sequence alignment

I Score matrices are used to evaluate how “good” an alignmentis...

I Does the alignment explain the likely evolutionary history ofthese sequences?

I Is the biochemical function likely to have been preserved?I The highest-scoring pairwise sequence alignment can be found

using dynamic programmingI The highest-scoring multiple sequence alignment cannot be

found easily

H.pylori ------GVDA NALHRPKRFF GAARNIEEGG SLTIIATALI ETGSRMDEVI ------FEEFD.radiodurans ------GVDS TALYPPKRFL GAARNIEEGG SLTIIATAMV ETGSTGDTVI ------FEEFM.tuberculosis ------GVDS TALYPPKRFL GAARNIEEGG SLTIIATAMV ETGSTGDTVI ------FEEFC.pneumoniae ------GVDA SALHKPKRFF GAARNIEGGG SLTILATALI DTGSRMDEVI ------FEEFF.nucleatum ------GIDP TALYHPKNFF GAARNIKDGG SLTIIATILV DTGSKMDEVI ------YEEFS.enterica ------GVDA NALHRPKRFF GAARNVEEGG SLTIIATALI DTGSKMDEVI ------YEEFB.thetaiotaomicron ------GVDA NALHKPKRFF GAARNIENGG SLTIIATALI DTGSKMDEVI ------FEEFL.interrogans ------GVDS NALHKPKRFF GAARNIEEGG SLTIIATALI DTGSKMDEVI ------FEEFP.marinus GYQPTLGTDV GELQER---- -ITSTLE--G SITSIQAVYV PADDLTDPAP ------ATTFU.parvum GYQSTLESDV THIQNR---- -LFRNKN--G SITSFQTIFL PMDDLSDPSA ------VAVFB.subtilis ------GIDP AAFHRPKRFF GAARNIEEGG SLTILATALV DTGSRMDDVI ------YEEFC.difficile ------GIDP GALHGPKKFF GAARNIRQGG SLTILGTALV ETGSRMDDVI ------FEEFS.griseus ------GVDS TALYPPKRFF GAARNIEDGG SLTILATALV ETGSRMDEVI ------FEEFF.nodosum ------GVDP AALYKPKHFF GAARNTREGG SLTIIATALI ETGSKMDEVI ------FEEFM.infernorum ------GVDA NALQKPRRFF ATARNLEEGG SVTIIATALI DTGSKMDDVI ------FEEFT.yellowstonii ------GLEA TALQKPKRFF GTARNIEEGG SLTIIATALV ETGSRMDDVI ------FEEFE.coli ------GVDA NALHRPKRFF GAARNVEEGG SLTIIATALI DTGSKMDEVI ------YEEFH.volcanii GYPAYLAARL SEFYERAGYF TTVNGEE--G SVSVIGAVSP PGGDFSEPVT QNTLRIVKTF

Multiple sequence alignment

Different approaches & issuesI A NP-hard problem (Wang & Jiang (1994))

I Which means there is no mathematically optimal solution formultiple sequence alignment that will run on a moderncomputer in reasonable time

I Which means we have to identify heuristic approaches that runquickly and produce reasonable solutions

H.pylori ------GVDA NALHRPKRFF GAARNIEEGG SLTIIATALI ETGSRMDEVI ------FEEFD.radiodurans ------GVDS TALYPPKRFL GAARNIEEGG SLTIIATAMV ETGSTGDTVI ------FEEFM.tuberculosis ------GVDS TALYPPKRFL GAARNIEEGG SLTIIATAMV ETGSTGDTVI ------FEEFC.pneumoniae ------GVDA SALHKPKRFF GAARNIEGGG SLTILATALI DTGSRMDEVI ------FEEFF.nucleatum ------GIDP TALYHPKNFF GAARNIKDGG SLTIIATILV DTGSKMDEVI ------YEEFS.enterica ------GVDA NALHRPKRFF GAARNVEEGG SLTIIATALI DTGSKMDEVI ------YEEFB.thetaiotaomicron ------GVDA NALHKPKRFF GAARNIENGG SLTIIATALI DTGSKMDEVI ------FEEFL.interrogans ------GVDS NALHKPKRFF GAARNIEEGG SLTIIATALI DTGSKMDEVI ------FEEFP.marinus GYQPTLGTDV GELQER---- -ITSTLE--G SITSIQAVYV PADDLTDPAP ------ATTFU.parvum GYQSTLESDV THIQNR---- -LFRNKN--G SITSFQTIFL PMDDLSDPSA ------VAVFB.subtilis ------GIDP AAFHRPKRFF GAARNIEEGG SLTILATALV DTGSRMDDVI ------YEEFC.difficile ------GIDP GALHGPKKFF GAARNIRQGG SLTILGTALV ETGSRMDDVI ------FEEFS.griseus ------GVDS TALYPPKRFF GAARNIEDGG SLTILATALV ETGSRMDEVI ------FEEFF.nodosum ------GVDP AALYKPKHFF GAARNTREGG SLTIIATALI ETGSKMDEVI ------FEEFM.infernorum ------GVDA NALQKPRRFF ATARNLEEGG SVTIIATALI DTGSKMDDVI ------FEEFT.yellowstonii ------GLEA TALQKPKRFF GTARNIEEGG SLTIIATALV ETGSRMDDVI ------FEEFE.coli ------GVDA NALHRPKRFF GAARNVEEGG SLTIIATALI DTGSKMDEVI ------YEEFH.volcanii GYPAYLAARL SEFYERAGYF TTVNGEE--G SVSVIGAVSP PGGDFSEPVT QNTLRIVKTF

Introducing ClustalW

ClustalW

I Clustering Alignment

I A progressive method

I Very fast

I Widely used

I Very good for rather similar sequences

I Advanced settings, fine tuned

I ClustalW (for weighting)

I ClustalX: Graphical version

Credit: text lifted verbatim from Stinus Lindgreen’s slides. Image source: www.wikimedia.org

ClustalW

I Three steps:

1 Do all-against-all pairwise alignments

2 Build guide tree T

3 Perform multiple alignment along TI sequence-sequence, sequence-profile

& profile-profile

Basic idea: Align most similar sequencesfirst → add more divergent sequences later

Credit: text lifted verbatim from Stinus Lindgreen’s slides. Image source: www.wikimedia.org

Discussion

I What do we use multiple sequence alignments for?

Relevant reading

I Reviews:I Eddy SR (2004) Where did the BLOSUM62 alignment score

matrix come from? Nature Biotechnology

I Methods:I Thompson J et al. (1994) CLUSTAL W: improving the

sensitivity of progressive multiple sequence alignment throughsequence weighting, position-specific gap penalties and weightmatrix choice. Nucleic acids research.

I Altschul SF et al. (1997) Gapped BLAST and PSI-BLAST: anew generation of protein database search programs. Nucleicacids research.

The End

biol335: sequence alignment

c c c c

c c c v

alignment scores

c v r c

subject sequence

biological sequence

sequence similarity

sequence comparison

Science

sequence alignment tutorial

fast sequence search multiple sequence alignment

sequence alignment techniques

local multiple sequence alignment sequence motifs

biol335: rna bioinformatics

sequence alig sequence alignment pairwise alignment:-

biol335: genetic selection

sequence alignment variations

sequence alignment - imb bioinformatics...

sequence alignment - unibo.it · sequence alignment....

chapter 3: pairwise sequence alignment - fh-muenster.de ·...

sequence alignment algorithms

local multiple sequence alignment sequence motifs

sequence alignment technology

sequence alignment - simon fraser university · algorithms...

multiple sequence alignment

multiple sequence alignment - fasta multiple sequence...

parametric sequence alignment

exercises (sequence databases, sequence alignment

sequence alignment