alignment multiple sequence - - university of newcastle upon tyne

48
Multiple Sequence Alignment

Upload: others

Post on 03-Feb-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Multiple SequenceAlignment

Page 2: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Alignment can be easy ordifficult

GCGGCCCA TCAGGTAGTT GGTGGGCGGCCCA TCAGGTAGTT GGTGGGCGTTCCA TCAGCTGGTT GGTGGGCGTCCCA TCAGCTAGTT GGTGGGCGGCGCA TTAGCTAGTT GGTGA******** ********** *****

TTGACATG CCGGGG---A AACCGTTGACATG CCGGTG--GT AAGCCTTGACATG -CTAGG---A ACGCGTTGACATG -CTAGGGAAC ACGCGTTGACATC -CTCTG---A ACGCG******** ?????????? *****

Easy

Difficult due to insertions or deletions

(indels)

Page 3: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Homology: Definition• Homology: similarity that is the result of inheritance from a

common ancestor - identification and analysis of homologies iscentral to phylogenetic systematics.

• An Alignment is an hypothesis of positional homology betweenbases/Amino Acids.

Page 4: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Multiple Sequence Alignment-Goals

• To generate a concise, information-rich summary ofsequence data.

• Sometimes used to illustrate the dissimilaritybetween a group of sequences.

• Alignments can be treated as models that can beused to test hypotheses.

• Does this model of events accurately reflect knownbiological evidence.

Page 5: Alignment Multiple Sequence - - University of Newcastle Upon Tyne
Page 6: Alignment Multiple Sequence - - University of Newcastle Upon Tyne
Page 7: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Alignment of 16S rRNA can be guidedby secondary structure

<---------------(--------------------HELIX 19---------------------)<---------------(22222222-000000-111111-00000-111111-0000-22222222Thermus ruber UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGATh. thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGAE.coli UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGAAncyst.nidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGAB.subtilis UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGAChl.aurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGAmatch ** *** * ** ** * **

Alignment of 16S rRNA sequences from different bacteria

Page 8: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Protein Alignment may be guidedby Tertiary Structure Interactions

Homo sapiensDjlA protein

Escherichia coliDjlA protein

Page 9: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Multiple Sequence Alignment-Methods

–3 main methods ofalignment:

• Manual• Automatic• Combined

Page 10: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Manual Alignment - reasons• Might be carried out because:

– Alignment is easy.– There is some extraneous information(structural).

– Automated alignment methods haveencountered the local minimum problem.

– An automated alignment method can be“improved”.

Page 11: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Dynamic programming2 methods:• Dynamic programming

– Consider 2 protein sequences of 100 amino acids in length.– If it takes 1002 seconds to exhaustively align these sequences,

then it will take 1003 seconds to align 3 sequences, 1004 to align4 sequences...etc.

– More time than the universe has existed to align 20 sequencesexhaustively.

• Progressive alignment

Page 12: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Progressive Alignment• Devised by Feng and Doolittle in 1987.• Essentially a heuristic method and as such

is not guaranteed to find the ‘optimal’alignment.

• Requires n-1+n-2+n-3...n-n+1 pairwisealignments as a starting point

• Most successful implementation is Clustal(Des Higgins)

Page 13: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Overview of ClustalW Procedure

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 .17 -Hba_Human 3 .59 .60 -Hba_Horse 4 .59 .59 .13 -Myg_Whale 5 .77 .77 .75 .75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

1

2

3 4

1

2

3 4

alpha-helices

Quick pairwise alignment: calculate distance matrix

Neighbor-joining tree(guide tree)

Progressive alignment following guide tree

CLUSTAL W

Page 14: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

ClustalW- Pairwise Alignments

• First perform all possible pairwisealignments between each pair ofsequences. There are (n-1)+(n-2)...(n-n+1) possibilities.

• Calculate the ‘distance’ between each pairof sequences based on these isolatedpairwise alignments.

• Generate a distance matrix.

Page 15: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Path Graph for aligning twosequences.

Page 16: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Possible alignment

1

1

0

1

0

-1

Scoring Scheme:•Match: +1•Mismatch: 0•Indel: -1

Score for this path= 2

Page 17: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Alignment using this path

GATTC-GAATTC

1

1

0

1

0

-1

Page 18: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Optimal Alignment 1

1

1

-1

1

1

1

Alignment score: 4

Alignment using this path

GA-TTCGAATTC

Page 19: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Optimal Alignment 2

1

-1

1

1

1

1

Alignment score: 4

Alignment using this path

G-ATTCGAATTC

Page 20: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

ClustalW- Guide Tree

• Generate a Neighbor-Joining‘guide tree’ from these pairwisedistances.

• This guide tree gives the orderin which the progressivealignment will be carried out.

Page 21: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Neighbor joining method

•The neighbor joining method is a greedy heuristic whichjoins at each step, the two closest sub-trees that are notalready joined.•It is based on the minimum evolution principle.•One of the important concepts in the NJ method isneighbors, which are defined as two taxa that areconnected by a single node in an unrooted tree

A B

Node 1

Page 22: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

PAM Spinach Rice Mosquito Monkey HumanSpinach 0.0 84.9 105.6 90.8 86.3Rice 84.9 0.0 117.8 122.4 122.6Mosquito 105.6 117.8 0.0 84.7 80.8Monkey 90.8 122.4 84.7 0.0 3.3Human 86.3 122.6 80.8 3.3 0.0

What is required for the Neighbour joining method?

Distance matrixDistance Matrix

Page 23: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

PAM distance 3.3 (Human - Monkey) is the minimum. So we'lljoin Human and Monkey to MonHum and we'll calculate the newdistances.

Mon-Hum

MonkeyHumanSpinachMosquito Rice

First Step

Page 24: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

After we have joined two species in a subtree we have to compute thedistances from every other node to the new subtree. We do this with asimple average of distances:Dist[Spinach, MonHum]

= (Dist[Spinach, Monkey] + Dist[Spinach, Human])/2= (90.8 + 86.3)/2 = 88.55

Mon-Hum

MonkeyHumanSpinach

Calculation of New Distances

Page 25: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

PAM Spinach Rice Mosquito MonHumSpinach 0.0 84.9 105.6 88.6Rice 84.9 0.0 117.8 122.5Mosquito 105.6 117.8 0.0 82.8MonHum 88.6 122.5 82.8 0.0

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Next Cycle

Page 26: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

PAM Spinach Rice MosMonHumSpinach 0.0 84.9 97.1Rice 84.9 0.0 120.2MosMonHum 97.1 120.2 0.0

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

Penultimate Cycle

Page 27: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

PAM SpinRice MosMonHumSpinach 0.0 108.7MosMonHum 108.7 0.0

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

Last Joining

Page 28: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Human

Monkey

MosquitoRice

Spinach

Unrooted Neighbor-JoiningTree

Page 29: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Multiple Alignment- First pair• Align the two most closely-relatedsequences first.

• This alignment is then ‘fixed’ andwill never change. If a gap is to beintroduced subsequently, then it willbe introduced in the same place inboth sequences, but their relativealignment remains unchanged.

Page 30: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

ClustalW- Decision time• Next, consult the guide tree to see what alignment is

performed next.– Align a third sequence to the first twoOr– Align two entirely different sequences to each other.

Option 1 Option 2

Page 31: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

ClustalW- Alternative 1If the situation ariseswhere a third sequence isaligned to the first two,then when a gap has to beintroduced to improve thealignment, each of thesetwo entities are treated astwo single sequences.

+

Page 32: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

ClustalW- Alternative 2• If, on the other hand,two separate sequenceshave to be alignedtogether, then the firstpairwise alignment isplaced to one side and thepairwise alignment of theother two is carried out.

+

Page 33: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

ClustalW- Progression

• The alignment is progressivelybuilt up in this way, with eachstep being treated as a pairwisealignment, sometimes with eachmember of a ‘pair’ having morethan one sequence.

Page 34: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

ClustalW-Good points/Badpoints

• Advantages:– Speed.

• Disadvantages:– No objective function.– No way of quantifying whether or notthe alignment is good

– No way of knowing if the alignment is‘correct’.

Page 35: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

ClustalW-Local Minimum

• Potential problems:– Local minimum problem. If anerror is introduced early inthe alignment process, it isimpossible to correct thislater in the procedure.

– Arbitrary alignment.

Page 36: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Increasing the sophistiaction ofthe alignment process.

• Should we treat all the sequences in thesame way? - even though somesequences are closely-related and somesequences are distant relatives.

• Should we treat all positions in thesequences as though they were thesame? - even though they might havedifferent functions and differentlocations in the 3-dimensional structure.

Page 37: Alignment Multiple Sequence - - University of Newcastle Upon Tyne
Page 38: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

ClustalW- Caveats• Sequence weighting• Varying substitution matrices• Residue-specific gap penalties and reduced

penalties in hydrophilic regions (external regions ofprotein sequences), encourage gaps in loops ratherthan in core regions.

• Positions in early alignments where gaps have beenopened receive locally reduced gap penalties toencourage openings in subsequent alignments

Page 39: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Sequence weighting• First we must be able to categorise sequences

according to whether they have close relatives orif they are distantly-related to the othersequences (calculated directly from the guidetree).

• Weights are normalised, so that the largestweight is 1.

• Closely-related sequences have a large amount ofthe same information, so they are downweighted.

• These weights are multiplication factors.

Page 40: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

ClustalW- User-supplied values• Two penalties are set by the user

(there are default values, but youshould know that it is possible tochange these).

• GOP- Gap Opening Penalty is the costof opening a gap in an alignment.

• GEP- Gap Extension Penalty is the costof extending this gap.

Page 41: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

ClustalW- Manipulation ofpenalties

• Although GOP and GEP are set by theuser, the program attempts to manipulatethese according to the following criteria:– Dependence on the weight matrix:– Dependence on the similarity of the sequences:– The percent identity of the sequences is used

as a scaling factor to increase the GOP forclosely-related sequences and decrease it formore distantly-related sequences.

Page 42: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

ClustalW• Dependence on the length of the sequences:

– The program uses the formula– GOP->(GOP+log(MIN(N,M))*(Average residue mismatch

score)*(percent identity scaling factor)

– The logarithm of the length of the shortest sequence isused as a scaling factor to increase the GOP withincreasing length

• Dependence on the difference in lengths of thetwo sequences:

• GEP-> GEP*(1.0+|log(N/M)|)

Page 43: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Position-Specific gap penalties• Before any pair of (groups of) sequences are aligned, a

table of GOPs are generated for each position in the two(sets of) sequences.

• The GOP is manipulated in a position-specific manner, sothat it can vary over the sequences.

• If there is a gap at a position, the GOP and GEP penaltiesare lowered, the other rules do not apply.

• This makes gaps more likely at positions where gapsalready exist.

Page 44: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Discouraging too many gaps

• If there is no gap opened, then the GOP is increased if theposition is within 8 residues of an existing gap.

• This discourages gaps that are too close together.• At any position within a run of hydrophilic residues, the GOP

is decreased.• These runs usually indicate loop regions in protein structures.• A run of 5 hydrophilic residues is considered to be a

hydrophilic stretch.• The default hydrophilic residues are:

– D, E, G, K, N, Q, P, R, S– But this can be changed by the user.

Page 45: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Divergent Sequences• The most divergent sequences (most different, on

average from all of the other sequences) are usually themost difficult to align.

• It is sometimes better to delay their aligment until later(when the easier sequences have already been aligned).

• The user has the choice of setting a cutoff (default is40% identity).

• This will delay the alignment until the others have beenaligned.

Page 46: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Advice on progressive alignment• Progressive alignment is a mathematical process that is

completely independent of biological reality.• Can be a very good estimate• Can be an impossibly poor estimate.• Requires user input and skill.• Treat cautiously• Can be improved by eye (usually)• Often helps to have colour-coding.• Depending on the use, the user should be able to make a

judgement on those regions that are reliable or not.• For phylogeny reconstruction, only use those positions whose

hypothesis of positional homology is unimpeachable

Page 47: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Alignment of protein-codingDNA sequences

• It is not very sensible to align the DNAsequences of protein-coding genes.

ATGCTGTTAGGGATGCTCGTAGGG

ATGCT-GTTAGGGATGCTCGTA-GGG

The result might be highly-implausible and might not reflectwhat is known about biological processes.It is much more sensible to translate the sequences to theircorresponding amino acid sequences, align these proteinsequences and then put the gaps in the DNA sequences accordingto where they are found in the amino acid alignment.

Page 48: Alignment Multiple Sequence - - University of Newcastle Upon Tyne

Manual Alignment- softwareGDE- The Genetic Data Environment (UNIX)CINEMA- Java applet available from:

– http://www.biochem.ucl.ac.uk

Seqapp/Seqpup- Mac/PC/UNIX available from:– http://iubio.bio.indiana.edu

SeAl for Macintosh, available from:– http://evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html

BioEdit for PC, available from:– http://www.mbio.ncsu.edu/RNaseP/info/programs/BIOEDIT/bi

oedit.html