alignments

59
Multiple Sequence Alignment James McInerney bioinf4biologists Feb. 2009

Upload: james-mcinerney

Post on 17-Jul-2015

2.269 views

Category:

Technology


0 download

TRANSCRIPT

Multiple Sequence Alignment

James McInerneybioinf4biologists Feb 2009

Alignment can be easy or difficult

Easy

Difficult due to insertions or deletions

(indels)

GCGGCCCA TCAGGTAGTT GGTGG

GCGGCCCA TCAGGTAGTT GGTGG GCGTTCCA TCAGCTGGTT GGTGG

GCGTCCCA TCAGCTAGTT GGTGG

GCGGCGCA TTAGCTAGTT GGTGA

TTGACATG CCGGGG--- A AACCG

TTGACATG CCGGTG-- GT AAGCC TTGACATG - CTAGG--- A ACGCG

TTGACATG - CTAGGGAAC ACGCG

TTGACATC - CTCTG--- A ACGCG

Homology Definitionbull Homology similarity that is the result of inheritance

from a common ancestor - identification and analysis of homologies is central to phylogenetic systematics

bull An Alignment is an hypothesis of positional homology between basesAmino Acids

Multiple Sequence Alignment- Goals

bull To generate a concise information-rich summary of sequence data

bull Sometimes used to illustrate the dissimilarity between a group of sequences

bull Alignments can be treated as models that can be used to test hypotheses

bull Used to identify homologous residues within sequences

Multiple sequence alignments - problems

bull All sequences show some similarity (even random sequences)

bull Similarity levels might be high in some parts of the sequence and low in other parts

bull Sequences might show substantial length variation and presenceabsence of various domains

SSU rRNA

bull Structural RNA (not translated)bull Found in the small ribosomal subunitbull Widely-used for phylogeny reconstruction

(found in every species)bull Contains stem and loop structuresbull Stem structures usually conform to

watson-crick base pairing

Alignment of 16S rRNA can be guided by secondary structure

lt---------------(--------------------HELIX 19---------------------)lt---------------(22222222-000000-111111-00000-111111-0000-22222222Thermus ruber UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGATh thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGAEcoli UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGAAncystnidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGABsubtilis UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGAChlaurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGAmatch

Alignment of 16S rRNA sequences from different bacteria

Protein Alignment may be guided Protein Alignment may be guided by Tertiary Structure Interactionsby Tertiary Structure Interactions

Homo sapiens DjlA protein

Escherichia coli DjlA protein

Multiple Sequence Alignment- Methods

ndash3 main methods of alignment

bull Manual (using custom-built text editors)bull Automatic (using custom-built alignment

software)bull Combined

Manual Alignment - reasonsbull Might be carried out because

ndash Alignment is easyndash There is some extraneous information (structural)

ndash Automated alignment methods have encountered the local minimum problem

ndash An automated alignment method can be ldquoimprovedrdquo

Local minimum

GARFIELDTHEFAT---CATGARFIELDTHEFATFATCAT

bull The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously

bull Lets consider a dotplot between sperm whale and human myoglobins

Dotplots

Sperm whale myoglobin

GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG

human myoglobin

VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG

bull Put one sequence on top

bull the other on the side

bull where residues are identical put a dot

bull Diagonal lines of dots show similarities

Dotplot example sperm whale vs human myg

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bullJust do the first 10 amino acids of eachbullMake a table with

ndashwhale sequence on top ndashhuman sequence on the side

bull This is the result for the whole sequence

bull It is easy to see that the diagonal is a line of dots

bull So sperm whale and human myoglobin are very similar

bull But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well

Dotplot example sperm whale vs human myg

16

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bull can smooth noise using a sliding window which considers neighbouring residues as well

bull Have done this here can see the diagonal is highly similar

bull Also instead of using using a simple identity use a scoring matrix

Dotplot example sperm whale vs human myg

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Alignment can be easy or difficult

Easy

Difficult due to insertions or deletions

(indels)

GCGGCCCA TCAGGTAGTT GGTGG

GCGGCCCA TCAGGTAGTT GGTGG GCGTTCCA TCAGCTGGTT GGTGG

GCGTCCCA TCAGCTAGTT GGTGG

GCGGCGCA TTAGCTAGTT GGTGA

TTGACATG CCGGGG--- A AACCG

TTGACATG CCGGTG-- GT AAGCC TTGACATG - CTAGG--- A ACGCG

TTGACATG - CTAGGGAAC ACGCG

TTGACATC - CTCTG--- A ACGCG

Homology Definitionbull Homology similarity that is the result of inheritance

from a common ancestor - identification and analysis of homologies is central to phylogenetic systematics

bull An Alignment is an hypothesis of positional homology between basesAmino Acids

Multiple Sequence Alignment- Goals

bull To generate a concise information-rich summary of sequence data

bull Sometimes used to illustrate the dissimilarity between a group of sequences

bull Alignments can be treated as models that can be used to test hypotheses

bull Used to identify homologous residues within sequences

Multiple sequence alignments - problems

bull All sequences show some similarity (even random sequences)

bull Similarity levels might be high in some parts of the sequence and low in other parts

bull Sequences might show substantial length variation and presenceabsence of various domains

SSU rRNA

bull Structural RNA (not translated)bull Found in the small ribosomal subunitbull Widely-used for phylogeny reconstruction

(found in every species)bull Contains stem and loop structuresbull Stem structures usually conform to

watson-crick base pairing

Alignment of 16S rRNA can be guided by secondary structure

lt---------------(--------------------HELIX 19---------------------)lt---------------(22222222-000000-111111-00000-111111-0000-22222222Thermus ruber UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGATh thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGAEcoli UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGAAncystnidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGABsubtilis UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGAChlaurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGAmatch

Alignment of 16S rRNA sequences from different bacteria

Protein Alignment may be guided Protein Alignment may be guided by Tertiary Structure Interactionsby Tertiary Structure Interactions

Homo sapiens DjlA protein

Escherichia coli DjlA protein

Multiple Sequence Alignment- Methods

ndash3 main methods of alignment

bull Manual (using custom-built text editors)bull Automatic (using custom-built alignment

software)bull Combined

Manual Alignment - reasonsbull Might be carried out because

ndash Alignment is easyndash There is some extraneous information (structural)

ndash Automated alignment methods have encountered the local minimum problem

ndash An automated alignment method can be ldquoimprovedrdquo

Local minimum

GARFIELDTHEFAT---CATGARFIELDTHEFATFATCAT

bull The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously

bull Lets consider a dotplot between sperm whale and human myoglobins

Dotplots

Sperm whale myoglobin

GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG

human myoglobin

VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG

bull Put one sequence on top

bull the other on the side

bull where residues are identical put a dot

bull Diagonal lines of dots show similarities

Dotplot example sperm whale vs human myg

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bullJust do the first 10 amino acids of eachbullMake a table with

ndashwhale sequence on top ndashhuman sequence on the side

bull This is the result for the whole sequence

bull It is easy to see that the diagonal is a line of dots

bull So sperm whale and human myoglobin are very similar

bull But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well

Dotplot example sperm whale vs human myg

16

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bull can smooth noise using a sliding window which considers neighbouring residues as well

bull Have done this here can see the diagonal is highly similar

bull Also instead of using using a simple identity use a scoring matrix

Dotplot example sperm whale vs human myg

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Homology Definitionbull Homology similarity that is the result of inheritance

from a common ancestor - identification and analysis of homologies is central to phylogenetic systematics

bull An Alignment is an hypothesis of positional homology between basesAmino Acids

Multiple Sequence Alignment- Goals

bull To generate a concise information-rich summary of sequence data

bull Sometimes used to illustrate the dissimilarity between a group of sequences

bull Alignments can be treated as models that can be used to test hypotheses

bull Used to identify homologous residues within sequences

Multiple sequence alignments - problems

bull All sequences show some similarity (even random sequences)

bull Similarity levels might be high in some parts of the sequence and low in other parts

bull Sequences might show substantial length variation and presenceabsence of various domains

SSU rRNA

bull Structural RNA (not translated)bull Found in the small ribosomal subunitbull Widely-used for phylogeny reconstruction

(found in every species)bull Contains stem and loop structuresbull Stem structures usually conform to

watson-crick base pairing

Alignment of 16S rRNA can be guided by secondary structure

lt---------------(--------------------HELIX 19---------------------)lt---------------(22222222-000000-111111-00000-111111-0000-22222222Thermus ruber UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGATh thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGAEcoli UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGAAncystnidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGABsubtilis UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGAChlaurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGAmatch

Alignment of 16S rRNA sequences from different bacteria

Protein Alignment may be guided Protein Alignment may be guided by Tertiary Structure Interactionsby Tertiary Structure Interactions

Homo sapiens DjlA protein

Escherichia coli DjlA protein

Multiple Sequence Alignment- Methods

ndash3 main methods of alignment

bull Manual (using custom-built text editors)bull Automatic (using custom-built alignment

software)bull Combined

Manual Alignment - reasonsbull Might be carried out because

ndash Alignment is easyndash There is some extraneous information (structural)

ndash Automated alignment methods have encountered the local minimum problem

ndash An automated alignment method can be ldquoimprovedrdquo

Local minimum

GARFIELDTHEFAT---CATGARFIELDTHEFATFATCAT

bull The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously

bull Lets consider a dotplot between sperm whale and human myoglobins

Dotplots

Sperm whale myoglobin

GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG

human myoglobin

VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG

bull Put one sequence on top

bull the other on the side

bull where residues are identical put a dot

bull Diagonal lines of dots show similarities

Dotplot example sperm whale vs human myg

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bullJust do the first 10 amino acids of eachbullMake a table with

ndashwhale sequence on top ndashhuman sequence on the side

bull This is the result for the whole sequence

bull It is easy to see that the diagonal is a line of dots

bull So sperm whale and human myoglobin are very similar

bull But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well

Dotplot example sperm whale vs human myg

16

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bull can smooth noise using a sliding window which considers neighbouring residues as well

bull Have done this here can see the diagonal is highly similar

bull Also instead of using using a simple identity use a scoring matrix

Dotplot example sperm whale vs human myg

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Multiple Sequence Alignment- Goals

bull To generate a concise information-rich summary of sequence data

bull Sometimes used to illustrate the dissimilarity between a group of sequences

bull Alignments can be treated as models that can be used to test hypotheses

bull Used to identify homologous residues within sequences

Multiple sequence alignments - problems

bull All sequences show some similarity (even random sequences)

bull Similarity levels might be high in some parts of the sequence and low in other parts

bull Sequences might show substantial length variation and presenceabsence of various domains

SSU rRNA

bull Structural RNA (not translated)bull Found in the small ribosomal subunitbull Widely-used for phylogeny reconstruction

(found in every species)bull Contains stem and loop structuresbull Stem structures usually conform to

watson-crick base pairing

Alignment of 16S rRNA can be guided by secondary structure

lt---------------(--------------------HELIX 19---------------------)lt---------------(22222222-000000-111111-00000-111111-0000-22222222Thermus ruber UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGATh thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGAEcoli UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGAAncystnidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGABsubtilis UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGAChlaurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGAmatch

Alignment of 16S rRNA sequences from different bacteria

Protein Alignment may be guided Protein Alignment may be guided by Tertiary Structure Interactionsby Tertiary Structure Interactions

Homo sapiens DjlA protein

Escherichia coli DjlA protein

Multiple Sequence Alignment- Methods

ndash3 main methods of alignment

bull Manual (using custom-built text editors)bull Automatic (using custom-built alignment

software)bull Combined

Manual Alignment - reasonsbull Might be carried out because

ndash Alignment is easyndash There is some extraneous information (structural)

ndash Automated alignment methods have encountered the local minimum problem

ndash An automated alignment method can be ldquoimprovedrdquo

Local minimum

GARFIELDTHEFAT---CATGARFIELDTHEFATFATCAT

bull The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously

bull Lets consider a dotplot between sperm whale and human myoglobins

Dotplots

Sperm whale myoglobin

GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG

human myoglobin

VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG

bull Put one sequence on top

bull the other on the side

bull where residues are identical put a dot

bull Diagonal lines of dots show similarities

Dotplot example sperm whale vs human myg

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bullJust do the first 10 amino acids of eachbullMake a table with

ndashwhale sequence on top ndashhuman sequence on the side

bull This is the result for the whole sequence

bull It is easy to see that the diagonal is a line of dots

bull So sperm whale and human myoglobin are very similar

bull But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well

Dotplot example sperm whale vs human myg

16

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bull can smooth noise using a sliding window which considers neighbouring residues as well

bull Have done this here can see the diagonal is highly similar

bull Also instead of using using a simple identity use a scoring matrix

Dotplot example sperm whale vs human myg

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Multiple sequence alignments - problems

bull All sequences show some similarity (even random sequences)

bull Similarity levels might be high in some parts of the sequence and low in other parts

bull Sequences might show substantial length variation and presenceabsence of various domains

SSU rRNA

bull Structural RNA (not translated)bull Found in the small ribosomal subunitbull Widely-used for phylogeny reconstruction

(found in every species)bull Contains stem and loop structuresbull Stem structures usually conform to

watson-crick base pairing

Alignment of 16S rRNA can be guided by secondary structure

lt---------------(--------------------HELIX 19---------------------)lt---------------(22222222-000000-111111-00000-111111-0000-22222222Thermus ruber UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGATh thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGAEcoli UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGAAncystnidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGABsubtilis UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGAChlaurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGAmatch

Alignment of 16S rRNA sequences from different bacteria

Protein Alignment may be guided Protein Alignment may be guided by Tertiary Structure Interactionsby Tertiary Structure Interactions

Homo sapiens DjlA protein

Escherichia coli DjlA protein

Multiple Sequence Alignment- Methods

ndash3 main methods of alignment

bull Manual (using custom-built text editors)bull Automatic (using custom-built alignment

software)bull Combined

Manual Alignment - reasonsbull Might be carried out because

ndash Alignment is easyndash There is some extraneous information (structural)

ndash Automated alignment methods have encountered the local minimum problem

ndash An automated alignment method can be ldquoimprovedrdquo

Local minimum

GARFIELDTHEFAT---CATGARFIELDTHEFATFATCAT

bull The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously

bull Lets consider a dotplot between sperm whale and human myoglobins

Dotplots

Sperm whale myoglobin

GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG

human myoglobin

VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG

bull Put one sequence on top

bull the other on the side

bull where residues are identical put a dot

bull Diagonal lines of dots show similarities

Dotplot example sperm whale vs human myg

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bullJust do the first 10 amino acids of eachbullMake a table with

ndashwhale sequence on top ndashhuman sequence on the side

bull This is the result for the whole sequence

bull It is easy to see that the diagonal is a line of dots

bull So sperm whale and human myoglobin are very similar

bull But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well

Dotplot example sperm whale vs human myg

16

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bull can smooth noise using a sliding window which considers neighbouring residues as well

bull Have done this here can see the diagonal is highly similar

bull Also instead of using using a simple identity use a scoring matrix

Dotplot example sperm whale vs human myg

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

SSU rRNA

bull Structural RNA (not translated)bull Found in the small ribosomal subunitbull Widely-used for phylogeny reconstruction

(found in every species)bull Contains stem and loop structuresbull Stem structures usually conform to

watson-crick base pairing

Alignment of 16S rRNA can be guided by secondary structure

lt---------------(--------------------HELIX 19---------------------)lt---------------(22222222-000000-111111-00000-111111-0000-22222222Thermus ruber UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGATh thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGAEcoli UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGAAncystnidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGABsubtilis UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGAChlaurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGAmatch

Alignment of 16S rRNA sequences from different bacteria

Protein Alignment may be guided Protein Alignment may be guided by Tertiary Structure Interactionsby Tertiary Structure Interactions

Homo sapiens DjlA protein

Escherichia coli DjlA protein

Multiple Sequence Alignment- Methods

ndash3 main methods of alignment

bull Manual (using custom-built text editors)bull Automatic (using custom-built alignment

software)bull Combined

Manual Alignment - reasonsbull Might be carried out because

ndash Alignment is easyndash There is some extraneous information (structural)

ndash Automated alignment methods have encountered the local minimum problem

ndash An automated alignment method can be ldquoimprovedrdquo

Local minimum

GARFIELDTHEFAT---CATGARFIELDTHEFATFATCAT

bull The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously

bull Lets consider a dotplot between sperm whale and human myoglobins

Dotplots

Sperm whale myoglobin

GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG

human myoglobin

VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG

bull Put one sequence on top

bull the other on the side

bull where residues are identical put a dot

bull Diagonal lines of dots show similarities

Dotplot example sperm whale vs human myg

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bullJust do the first 10 amino acids of eachbullMake a table with

ndashwhale sequence on top ndashhuman sequence on the side

bull This is the result for the whole sequence

bull It is easy to see that the diagonal is a line of dots

bull So sperm whale and human myoglobin are very similar

bull But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well

Dotplot example sperm whale vs human myg

16

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bull can smooth noise using a sliding window which considers neighbouring residues as well

bull Have done this here can see the diagonal is highly similar

bull Also instead of using using a simple identity use a scoring matrix

Dotplot example sperm whale vs human myg

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Alignment of 16S rRNA can be guided by secondary structure

lt---------------(--------------------HELIX 19---------------------)lt---------------(22222222-000000-111111-00000-111111-0000-22222222Thermus ruber UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGATh thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGAEcoli UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGAAncystnidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGABsubtilis UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGAChlaurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGAmatch

Alignment of 16S rRNA sequences from different bacteria

Protein Alignment may be guided Protein Alignment may be guided by Tertiary Structure Interactionsby Tertiary Structure Interactions

Homo sapiens DjlA protein

Escherichia coli DjlA protein

Multiple Sequence Alignment- Methods

ndash3 main methods of alignment

bull Manual (using custom-built text editors)bull Automatic (using custom-built alignment

software)bull Combined

Manual Alignment - reasonsbull Might be carried out because

ndash Alignment is easyndash There is some extraneous information (structural)

ndash Automated alignment methods have encountered the local minimum problem

ndash An automated alignment method can be ldquoimprovedrdquo

Local minimum

GARFIELDTHEFAT---CATGARFIELDTHEFATFATCAT

bull The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously

bull Lets consider a dotplot between sperm whale and human myoglobins

Dotplots

Sperm whale myoglobin

GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG

human myoglobin

VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG

bull Put one sequence on top

bull the other on the side

bull where residues are identical put a dot

bull Diagonal lines of dots show similarities

Dotplot example sperm whale vs human myg

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bullJust do the first 10 amino acids of eachbullMake a table with

ndashwhale sequence on top ndashhuman sequence on the side

bull This is the result for the whole sequence

bull It is easy to see that the diagonal is a line of dots

bull So sperm whale and human myoglobin are very similar

bull But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well

Dotplot example sperm whale vs human myg

16

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bull can smooth noise using a sliding window which considers neighbouring residues as well

bull Have done this here can see the diagonal is highly similar

bull Also instead of using using a simple identity use a scoring matrix

Dotplot example sperm whale vs human myg

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Protein Alignment may be guided Protein Alignment may be guided by Tertiary Structure Interactionsby Tertiary Structure Interactions

Homo sapiens DjlA protein

Escherichia coli DjlA protein

Multiple Sequence Alignment- Methods

ndash3 main methods of alignment

bull Manual (using custom-built text editors)bull Automatic (using custom-built alignment

software)bull Combined

Manual Alignment - reasonsbull Might be carried out because

ndash Alignment is easyndash There is some extraneous information (structural)

ndash Automated alignment methods have encountered the local minimum problem

ndash An automated alignment method can be ldquoimprovedrdquo

Local minimum

GARFIELDTHEFAT---CATGARFIELDTHEFATFATCAT

bull The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously

bull Lets consider a dotplot between sperm whale and human myoglobins

Dotplots

Sperm whale myoglobin

GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG

human myoglobin

VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG

bull Put one sequence on top

bull the other on the side

bull where residues are identical put a dot

bull Diagonal lines of dots show similarities

Dotplot example sperm whale vs human myg

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bullJust do the first 10 amino acids of eachbullMake a table with

ndashwhale sequence on top ndashhuman sequence on the side

bull This is the result for the whole sequence

bull It is easy to see that the diagonal is a line of dots

bull So sperm whale and human myoglobin are very similar

bull But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well

Dotplot example sperm whale vs human myg

16

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bull can smooth noise using a sliding window which considers neighbouring residues as well

bull Have done this here can see the diagonal is highly similar

bull Also instead of using using a simple identity use a scoring matrix

Dotplot example sperm whale vs human myg

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Multiple Sequence Alignment- Methods

ndash3 main methods of alignment

bull Manual (using custom-built text editors)bull Automatic (using custom-built alignment

software)bull Combined

Manual Alignment - reasonsbull Might be carried out because

ndash Alignment is easyndash There is some extraneous information (structural)

ndash Automated alignment methods have encountered the local minimum problem

ndash An automated alignment method can be ldquoimprovedrdquo

Local minimum

GARFIELDTHEFAT---CATGARFIELDTHEFATFATCAT

bull The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously

bull Lets consider a dotplot between sperm whale and human myoglobins

Dotplots

Sperm whale myoglobin

GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG

human myoglobin

VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG

bull Put one sequence on top

bull the other on the side

bull where residues are identical put a dot

bull Diagonal lines of dots show similarities

Dotplot example sperm whale vs human myg

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bullJust do the first 10 amino acids of eachbullMake a table with

ndashwhale sequence on top ndashhuman sequence on the side

bull This is the result for the whole sequence

bull It is easy to see that the diagonal is a line of dots

bull So sperm whale and human myoglobin are very similar

bull But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well

Dotplot example sperm whale vs human myg

16

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bull can smooth noise using a sliding window which considers neighbouring residues as well

bull Have done this here can see the diagonal is highly similar

bull Also instead of using using a simple identity use a scoring matrix

Dotplot example sperm whale vs human myg

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Manual Alignment - reasonsbull Might be carried out because

ndash Alignment is easyndash There is some extraneous information (structural)

ndash Automated alignment methods have encountered the local minimum problem

ndash An automated alignment method can be ldquoimprovedrdquo

Local minimum

GARFIELDTHEFAT---CATGARFIELDTHEFATFATCAT

bull The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously

bull Lets consider a dotplot between sperm whale and human myoglobins

Dotplots

Sperm whale myoglobin

GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG

human myoglobin

VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG

bull Put one sequence on top

bull the other on the side

bull where residues are identical put a dot

bull Diagonal lines of dots show similarities

Dotplot example sperm whale vs human myg

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bullJust do the first 10 amino acids of eachbullMake a table with

ndashwhale sequence on top ndashhuman sequence on the side

bull This is the result for the whole sequence

bull It is easy to see that the diagonal is a line of dots

bull So sperm whale and human myoglobin are very similar

bull But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well

Dotplot example sperm whale vs human myg

16

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bull can smooth noise using a sliding window which considers neighbouring residues as well

bull Have done this here can see the diagonal is highly similar

bull Also instead of using using a simple identity use a scoring matrix

Dotplot example sperm whale vs human myg

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Local minimum

GARFIELDTHEFAT---CATGARFIELDTHEFATFATCAT

bull The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously

bull Lets consider a dotplot between sperm whale and human myoglobins

Dotplots

Sperm whale myoglobin

GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG

human myoglobin

VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG

bull Put one sequence on top

bull the other on the side

bull where residues are identical put a dot

bull Diagonal lines of dots show similarities

Dotplot example sperm whale vs human myg

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bullJust do the first 10 amino acids of eachbullMake a table with

ndashwhale sequence on top ndashhuman sequence on the side

bull This is the result for the whole sequence

bull It is easy to see that the diagonal is a line of dots

bull So sperm whale and human myoglobin are very similar

bull But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well

Dotplot example sperm whale vs human myg

16

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bull can smooth noise using a sliding window which considers neighbouring residues as well

bull Have done this here can see the diagonal is highly similar

bull Also instead of using using a simple identity use a scoring matrix

Dotplot example sperm whale vs human myg

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

bull The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously

bull Lets consider a dotplot between sperm whale and human myoglobins

Dotplots

Sperm whale myoglobin

GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG

human myoglobin

VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG

bull Put one sequence on top

bull the other on the side

bull where residues are identical put a dot

bull Diagonal lines of dots show similarities

Dotplot example sperm whale vs human myg

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bullJust do the first 10 amino acids of eachbullMake a table with

ndashwhale sequence on top ndashhuman sequence on the side

bull This is the result for the whole sequence

bull It is easy to see that the diagonal is a line of dots

bull So sperm whale and human myoglobin are very similar

bull But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well

Dotplot example sperm whale vs human myg

16

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bull can smooth noise using a sliding window which considers neighbouring residues as well

bull Have done this here can see the diagonal is highly similar

bull Also instead of using using a simple identity use a scoring matrix

Dotplot example sperm whale vs human myg

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

bull Put one sequence on top

bull the other on the side

bull where residues are identical put a dot

bull Diagonal lines of dots show similarities

Dotplot example sperm whale vs human myg

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bullJust do the first 10 amino acids of eachbullMake a table with

ndashwhale sequence on top ndashhuman sequence on the side

bull This is the result for the whole sequence

bull It is easy to see that the diagonal is a line of dots

bull So sperm whale and human myoglobin are very similar

bull But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well

Dotplot example sperm whale vs human myg

16

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bull can smooth noise using a sliding window which considers neighbouring residues as well

bull Have done this here can see the diagonal is highly similar

bull Also instead of using using a simple identity use a scoring matrix

Dotplot example sperm whale vs human myg

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

bull This is the result for the whole sequence

bull It is easy to see that the diagonal is a line of dots

bull So sperm whale and human myoglobin are very similar

bull But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well

Dotplot example sperm whale vs human myg

16

Sperm whale myoglobin

G L S D G E W Q L V V L S E G E W Q L V

Human myoglobin

bull can smooth noise using a sliding window which considers neighbouring residues as well

bull Have done this here can see the diagonal is highly similar

bull Also instead of using using a simple identity use a scoring matrix

Dotplot example sperm whale vs human myg

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

bull can smooth noise using a sliding window which considers neighbouring residues as well

bull Have done this here can see the diagonal is highly similar

bull Also instead of using using a simple identity use a scoring matrix

Dotplot example sperm whale vs human myg

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Dotplots in practicebull The best tool is an applet called dotlet

bull wwwisrecisb-sibchjavadotletDotlethtmlbull wwwbipbhamacukdotletDotlethtml

bull an applet is a program that runs in a web browser This means that you can produce dotplots within a netscapeIE window

bull Dotplots are often useful to identify things like repeated domains or duplications in big proteins

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

bull Protein has many repeats bull SLIT_DROME (P24014)

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

bull Perform a dotplot of the SLIT protein against itself wwwbiobhamacukdotletDotlethtml

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Example dotplot - repeated domains in Drosophila melanogaster SLIT protein

20Swiss-prot entry

For further discussion of dotplot see Attwood and Parry-Smith p116-8

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Dynamic programming2 methodsbull Dynamic programming

ndash Consider 2 protein sequences of 100 amino acids in lengthndash If it takes 1002 seconds to exhaustively align these sequences

then it will take 1003 seconds to align 3 sequences 1004 to align 4 sequencesetc

ndash More time than the universe has existed to align 20 sequences exhaustively

bull Progressive alignment

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Progressive Alignmentbull Devised by Feng and Doolittle in 1987bull Essentially a heuristic method and as such

is not guaranteed to find the lsquooptimalrsquo alignment

bull Requires n-1+n-2+n-3n-n+1 pairwise alignments as a starting point

bull Most successful implementation is Clustal (Des Higgins) This software is cited 3000 times per year in the scientific literature

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 17 -Hba_Human 3 59 60 -Hba_Horse 4 59 59 13 -Myg_Whale 5 77 77 75 75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3 4

2

1

3 4

alpha-helices

Quick pairwise alignment calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

ClustalW- Pairwise Alignments

bull First perform all possible pairwise alignments between each pair of sequences There are (n-1)+(n-2)(n-n+1) possibilities

bull Calculate the lsquodistancersquo between each pair of sequences based on these isolated pairwise alignments

bull Generate a distance matrix

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Path Graph for aligning two sequences

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Possible alignment

1

1

0

1

0

-1

Scoring SchemebullMatch +1bullMismatch 0bullIndel -1

Score for this path= 2

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

GA-TTCGA-TTCGAATTCGAATTC

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score 4Alignment score 4

Alignment using this path

G-ATTCG-ATTCGAATTCGAATTC

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Alignment of 3 sequences

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

ClustalW- Guide Tree

bull Generate a Neighbor-Joining lsquoguide treersquo from these pairwise distances

bull This guide tree gives the order in which the progressive alignment will be carried out

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Neighbor joining method

bullThe neighbor-joining method is a greedy heuristic which joins at each step the two closest sub-trees that are not already joinedbullIt is based on the minimum evolution principlebullneighbors are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

PAM Spinach Rice Mosquito Monkey HumanSpinach 00 849 1056 908 863Rice 849 00 1178 1224 1226Mosquito 1056 1178 00 847 808Monkey 908 1224 847 00 33Human 863 1226 808 33 00

What is required for the Neighbour joining method

Distance matrixDistance Matrix

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

PAM distance 33 (Human - Monkey) is the minimum So well join Human and Monkey to MonHum and well calculate the new distances

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree We do this with a simple average of distancesDist[Spinach MonHum]

= (Dist[Spinach Monkey] + Dist[Spinach Human])2 = (908 + 863)2 = 8855

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

PAM Spinach Rice Mosquito MonHumSpinach 00 849 1056 886Rice 849 00 1178 1225Mosquito 1056 1178 00 828MonHum 886 1225 828 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

PAM Spinach Rice MosMonHumSpinach 00 849 971Rice 849 00 1202MosMonHum 971 1202 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

PAM SpinRice MosMonHumSpinach 00 1087MosMonHum 1087 00

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-Joining Tree

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Multiple Alignment- First pairbull Align the two most closely-related sequences first

bull This alignment is then lsquofixedrsquo and will never change If a gap is to be introduced subsequently then it will be introduced in the same place in both sequences but their relative alignment remains unchanged

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

ClustalW- Decision timebull Consult the guide tree to see what alignment is performed next

ndash Align a third sequence to the first twoOrndash Align two entirely different sequences to each other

Option 1Option 1 Option 2Option 2

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

ClustalW- Alternative 1If the situation arises where a third sequence is aligned to the first two then when a gap has to be introduced to improve the alignment each of these two entities are treated as two single sequences

+

ClustalW- Alternative 2bull If on the other hand two separate

sequences have to be aligned together then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out

+

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

ClustalW- Progression

bull The alignment is progressively built up in this way with each step being treated as a pairwise alignment sometimes with each member of a lsquopairrsquo having more than one sequence

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Progressive alignment - step 11 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

12345

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Progressive alignment - step 21 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgacagcta

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

5 ctcgaacgatacgatgactagct

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

12345

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Progressive alignment - step 31 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

+

3 gctcgatacacgatgactagcta

4 gctcgatacacgatgacgagcga

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

12345

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Progressive alignment - final step1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

+

5 ctcgaacgatacgatgactagct

1 gctcgatacgatacgatgactagcta

2 gctcgatacaagacgatgac-agcta

3 gctcgatacacga---tgactagcta

4 gctcgatacacga---tgacgagcga

5 -ctcga-acgatacgatgactagct-

12345

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

ClustalW-Good pointsBad points

bull Advantagesndash Speed

bull Disadvantagesndash No objective functionndash No way of quantifying whether or not the alignment is good

ndash No way of knowing if the alignment is lsquocorrectrsquo

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

ClustalW-Local Minimumbull Potential problems

ndash Local minimum problem If an error is introduced early in the alignment process it is impossible to correct this later in the procedure

ndash Arbitrary alignment

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Increasing the sophistication of the alignment process

bull Should we treat all the sequences in the same way - even though some sequences are closely-related and some sequences are distant relatives

bull Should we treat all positions in the sequences as though they were the same - even though they might have different functions and different locations in the 3-dimensional structure

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

ClustalW- Caveatsbull Sequence weightingbull Varying substitution matricesbull Residue-specific gap penalties and reduced penalties

in hydrophilic regions (external regions of protein sequences) encourage gaps in loops rather than in core regions

bull Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

ClustalW- User-supplied values

bull Two penalties are set by the user (there are default values but you should know that it is possible to change these)

bull GOP- Gap Opening Penalty is the cost of opening a gap in an alignment

bull GEP- Gap Extension Penalty is the cost of extending this gap

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Position-Specific gap penaltiesbull Before any pair of (groups of) sequences are

aligned a table of GOPs are generated for each position in the two (sets of) sequences

bull The GOP is manipulated in a position-specific manner so that it can vary over the sequences

bull If there is a gap at a position the GOP and GEP penalties are lowered the other rules do not apply

bull This makes gaps more likely at positions where gaps already exist

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Discouraging too many gaps bull If there is no gap opened then the GOP is increased if the

position is within 8 residues of an existing gapbull This discourages gaps that are too close togetherbull At any position within a run of hydrophilic residues the GOP is

decreasedbull These runs usually indicate loop regions in protein structuresbull A run of 5 hydrophilic residues is considered to be a hydrophilic

stretchbull The default hydrophilic residues are

ndash D E G K N Q P R Sndash But this can be changed by the user

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Divergent Sequencesbull The most divergent sequences (most different on average

from all of the other sequences) are usually the most difficult to align

bull It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned)

bull The user has the choice of setting a cutoff (default is 40 identity)

bull This will delay the alignment until the others have been aligned

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Advice on progressive alignmentbull Progressive alignment is a mathematical process that is completely

independent of biological realitybull Can be a very good estimatebull Can be an impossibly poor estimatebull Requires user input and skillbull Treat cautiouslybull Can be improved by eye (usually)bull Often helps to have colour-codingbull Depending on the use the user should be able to make a judgement

on those regions that are reliable or notbull For phylogeny reconstruction only use those positions whose

hypothesis of positional homology is unimpeachable

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Alignment of protein-coding DNA sequences

bull It is not very sensible to align the DNA sequences of protein-coding genes

ATGCTGTTAGGGATGACTCTGTTAGGG

ATG-CT--GTTAGGGATGACTCTGTTAGGG

The result might be highly-implausible and might not reflect what is known about biological processesIt is much more sensible to translate the sequences to their corresponding amino acid sequences align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from

ndash httpwwwbiochemuclacuk

SeqappSeqpup- MacPCUNIX available fromndash httpiubiobioindianaedu

SeAl for Macintosh available fromndash httpevolvezoooxacukSe-AlSe-Alhtml

BioEdit for PC available fromndash httpwwwmbioncsueduRNasePinfoprogramsBIOEDITbio

edithtml

  • Multiple Sequence Alignment James McInerney bioinf4biologists Feb 2009
  • Alignment can be easy or difficult
  • Homology Definition
  • Multiple Sequence Alignment- Goals
  • Multiple sequence alignments - problems
  • Slide 6
  • Slide 7
  • SSU rRNA
  • Alignment of 16S rRNA can be guided by secondary structure
  • Protein Alignment may be guided by Tertiary Structure Interactions
  • Multiple Sequence Alignment- Methods
  • Manual Alignment - reasons
  • Local minimum
  • Dotplots
  • Dotplot example sperm whale vs human myg
  • Slide 16
  • Slide 17
  • Dotplots in practice
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Example dotplot - repeated domains in Drosophila melanogaster SLIT protein
  • Dynamic programming
  • Progressive Alignment
  • Slide 23
  • ClustalW- Pairwise Alignments
  • Path Graph for aligning two sequences
  • Possible alignment
  • Alignment using this path
  • Optimal Alignment 1
  • Optimal Alignment 2
  • Alignment of 3 sequences
  • ClustalW- Guide Tree
  • Neighbor joining method
  • Distance Matrix
  • First Step
  • Calculation of New Distances
  • Next Cycle
  • Penultimate Cycle
  • Last Joining
  • Unrooted Neighbor-Joining Tree
  • Multiple Alignment- First pair
  • ClustalW- Decision time
  • ClustalW- Alternative 1
  • ClustalW- Progression
  • Progressive alignment - step 1
  • Progressive alignment - step 2
  • Progressive alignment - step 3
  • Progressive alignment - final step
  • ClustalW-Good pointsBad points
  • ClustalW-Local Minimum
  • Increasing the sophistication of the alignment process
  • Slide 51
  • ClustalW- Caveats
  • ClustalW- User-supplied values
  • Position-Specific gap penalties
  • Discouraging too many gaps
  • Divergent Sequences
  • Advice on progressive alignment
  • Alignment of protein-coding DNA sequences
  • Manual Alignment- software