multiple s equence alignment and their reliability

Post on 30-Dec-2015

38 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Multiple s equence alignment and their reliability. The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel January 2013 By Haim Ashkenazy http://guidance.tau.ac.il/workshop_2013/. What are alignments good for?. To compare sequences Find homology - PowerPoint PPT Presentation

TRANSCRIPT

Multiple Multiple sequence sequence

alignment and alignment and their reliabilitytheir reliability

The Bioinformatics UnitG.S. Wise Faculty of Life Science

Tel Aviv University, IsraelJanuary 2013

By Haim Ashkenazy

http://guidance.tau.ac.il/workshop_2013/

January 2013 1TAU Bioinformatics Workshop

What are alignments good What are alignments good for?for?

• To compare sequenceso Find homologyo Similar sequence similar function

• To learn about sequence evolutiono Mismatch = point mutationo Gap = indel (insertion or deletion)o Reconstruct phylogenetic treeo Infer selection forces, e.g., detecting positive selection, co-

evolving sites

• For structure predictiono Similar regions potentially have similar structure

2

Making an alignment Making an alignment (pairwise)(pairwise)

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN CDRYYQ

• For 2 sequences – Pairwise alignmento Local alignment – finds regions of high

similarity in parts of the sequences.

o Global alignment – finds the best alignment across the entire two sequences

• Use exact solutiono Needleman-Wunsch (for global) or Smith-Waterman (for local) -

http://www.ebi.ac.uk/Tools/psa/

3

Sequences evolutionSequences evolutionATGAAATAA

ATGTTTTAA ATGCCCAAATAA

ATGTTTTCA ATGTTTTAA ATGCCCAAA

A T G - - - T T T T A A

A T G - - - T T T T C A

A T G C C C A A A - - -

30 MYA

5 MYA

Today

Human

Chimp

Mouse4

A T G - - - T T T T A A

A T G - - - T T T T C A

A T G C C C - - - A A A

Alignment and phylogeny Alignment and phylogeny are mutually dependentare mutually dependent

Inaccurate tree

building

MSA

Sequence alignment

0.4

Phylogeny reconstructi

on

Unaligned sequences

5

Alignment and phylogeny Alignment and phylogeny are both are both challengingchallenging

~25% of residues are wrongly alignedBased on BAliBASE: a large representative set of proteins

6

Alignment and phylogeny Alignment and phylogeny are both are both challengingchallenging

5% of tree branches are wrong

Based on simulations of 100 protein sequences

Making an alignment (MSA)Making an alignment (MSA)• For more sequences - Multiple sequence

alignment (MSA)o Exact methods are not feasible (too slow)o We use heuristic methodso Several advanced MSA programs are available

Basically two recommended methods:• MAFFT – fastest and one of the most

accurate• PRANK – distinct from all other MSA

programs because of its correct treatment of insertions/deletions

8

ABCDE

Compute the pairwise Compute the pairwise alignments for all alignments for all

against all (10against all (10 pairwise pairwise alignments).alignments).

The similarities are The similarities are converted to distances converted to distances and stored in a tableand stored in a table

First step: compute pairwise distances

Progressive alignmentProgressive alignment

A B C D E

A

B 8

C 15 17

D 16 14 10

E 32 31 31 32 9

A

D

C

B

E

Cluster the sequences to create Cluster the sequences to create a tree (a tree (guide treeguide tree):):

• represents the order in which represents the order in which pairs of sequences are to be pairs of sequences are to be alignedaligned• similar sequences are neighbors similar sequences are neighbors in the tree in the tree • distant sequences are distant distant sequences are distant from each other in the treefrom each other in the tree

Second step:build a guide tree

A B C D E

A

B 8

C 15 17

D 16 14 10

E 32 31 31 32The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!

10

Third step: align sequences in a bottom up order

A

D

C

B

E

1. Align the most similar (neighboring) pairs

2. Align pairs of pairs

3. Align sequences clustered to pairs of pairs

deeper in the tree

Sequence ASequence B

Sequence CSequence D

Sequence E

11

Multiple sequence alignment (MSA)progressivprogressiv

ee

alignmentalignment

ABCDE

Guide tree

A

DCB

E

MSA

Pairwise distance table

Iterative

12

Sources of alignment Sources of alignment errorserrors

Progressive alignment algorithms are greedy heuristics

Co-optimal solutions Heads-or-Tails (HoT) scores (Landan & Graur 2007)

GEELTNWPSPVCHNRLASGIDDSTAFRFPRPQKWIISYSLHCVI...GEELTLWPSPVCHNRLASGIDASIAFRFPRAQKRFYRYSLHCVI...TEELTHWPFPVCRNRLARGIGSAIAFRCPRSQEHI-RNSLPCVI...TEELRYWPFPVCQN--ARGNGSVIEARNPGSQ-----KVLPYVI...

...IVCHLSYSIIWKQPRPFRFATSDDIGSALRNHCVPSPWNTLEEG

...IVCHLSYRYFRKQARPFRFAISADIGSALRNHCVPSPWLTLEEG

...IVCPLSNRI-HEQSRPCRFAIASGIGRALRNRCVPFPWHTLEET

...IVYPLVK-----QSGPNRAEIVSGNGRA--NQCVPFPWYRLEET

13

…MSA 1 MSA 2 MSA 99 MSA 100

Progressive alignment

…Tree 1 Tree 2 Tree 99 Tree 100

Bootstrap sampling of NJ trees

Base alignment

GUIDANCE Scores

0

1

Penn, Privman et al. MBE. 2010

GUIDANCE: Guide-tree based GUIDANCE: Guide-tree based alignment confidence scoresalignment confidence scores

14

Comparing alignmentsComparing alignmentsCommon measures to quantify distance between two MSAs:1.CS: Each column of the MSA that is identically aligned in the other MSA is given a score of 1; all other columns are given the score 0.2.SP: Each pair of residues in the MSA that is identically aligned in the other MSA is given a score of 1; all other residue pairs are given the score 0.3.Sum-of-pairs column score (SPC): The score of each column is simply the average of the SPs over all pairs in it.

Accuracy of GUIDANCE Accuracy of GUIDANCE scoresscores

16

http://guidance.tau.ac.il

As a rule of thumb, use HoT for less than 8 sequences

17

http://guidance.tau.ac.il

Un-aligned sequences

(FASTA format)

Choose sequence

type

Choose alignment

method

18

GUIDANCE resultsGUIDANCE results

04/19/23Footer Text 19

MSA colored by

confidence score

Confident

Uncertain

Sequence score

Column score

GUIDANCE resultsGUIDANCE results

GUIDANCE outputsGUIDANCE outputs

21

Download MSA for down-stream

analysis

Text files with all scores

Mask residue by score

Remove unreliable sequences

Confident

Uncertain

Sequence score

Column score

GUIDANCE resultsGUIDANCE results

22

GUIDANCE outputsGUIDANCE outputs

23

Remove unreliable sequences

Re-align sequences after filtration

Sequences left after filtration

Filtering sequences Filtering sequences with low scores and with low scores and

re-alignre-align

24

But always remember not to

remove too much data and

consider the biology…

GUIDANCE outputsGUIDANCE outputs

25

Remove unreliable columns

MSA after filtration

Filtering columns with Filtering columns with low scoreslow scores

26

GUIDANCE outputsGUIDANCE outputs

27

Masking unreliably aligned residues

Filtering residues with Filtering residues with low scoreslow scores

28

Filtering unreliable regions Filtering unreliable regions

can improve down-stream can improve down-stream

analysisanalysis

29

(Mol Biol Evol 2012;29:1-5)

AcknowledgmentsAcknowledgments• Prof. Tal Pupko• Dr. Eyal Privman• Dr. Osnat Penn• Pupko’s lab members

1. Penn, O., Privman, E., Ashkenazy, H., Landan, G., Graur, D. and Pupko, T. (2010).GUIDANCE: a web server for assessing alignment confidence scores.Nucleic Acids Research, 2010 Jul 1; 38 (Web Server issue):W23-W28; doi: 10.1093/nar/gkq443 [ABS] [PDF] 

2. Penn, O., Privman, E., Landan, G., Graur, D. and Pupko, T. (2010).An alignment confidence score capturing robustness to guide-tree uncertainty. Molecular Biology and Evolution, 2010 Aug;27(8):1759-67; doi:10.1093/molbev/msq066 [ABS] [PDF] 

3. Landan, G., and D. Graur. (2008).Local reliability measures from sets of co-optimal multiple sequence alignments.Pac Symp Biocomput 13:15-24 [ABS] [PDF]

30

Thanks for your Thanks for your attention!attention!

31

1. Download and save the sequences file.

(http://guidance.tau.ac.il/workshop_2013/) "Seq_For_GUIDANCE.fs" (File

“Save as”). This file contains 20 protein sequences in FASTA

format.

2. Run GUIDANCE web-server to create a protein alignment:

a. Use GUIDANCE algorithm

b. Select “amino acids” as the sequences type;

c. Select MAFFT as the alignment method

d. Run (press the “Submit“ button) .

e. (In case it does not run for you, you can see the results at:

http://guidance.tau.ac.il/results/13589321556364/output.php)

3. What is the alignment score? What does it mean about the alignment achieved?

4. Which sequences can be removed to improve the alignment? What is

the biological justification for that? Try it!

Appendix – MSA serversAppendix – MSA servers

33

MAFFTMAFFT• Web server & download:

http://mafft.cbrc.jp/alignment/server/

34

Choosing a MAFFT Choosing a MAFFT strategy strategy

quick & dirty slow

but accurate

• Efficiency-tuned variants quick & dirty or slow but accurate

Choosing a MAFFT Choosing a MAFFT strategy strategy

quick & dirty slow

but accurate

Choosing a MAFFT Choosing a MAFFT strategy strategy

quick & dirty slow

but accurate

Choosing a MAFFT strategy Choosing a MAFFT strategy

L-INS-i

ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX------------------

--------------------------------XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo-------

------------------ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo-------

--------ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo

--------------------------------XXXXXXXXXXXXXXXX----XXXXXXX------------------

G-INS-i

XXXXXXXXXXX-XXXXXXXXXXXXXXX

XX-XXXXXXXXXXXXXXX-XXXXXXXX

XXXXX----XXXXXXXX---XXXXXXX

XXXXX-XXXXXXXXXX----XXXXXXX

XXXXXXXXXXXXXXXX----XXXXXXX

E-INS-i

oooooooooXXX------XXXX---------------------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXooooooooooooo

---------XXXXXXXXXXXXXooo------------------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX-------------

-----ooooXXXXXX---XXXXooooooooooo----------------------XXXXX----XXXXXXXXXXXXXXXXXXooooooooooooo

---------XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX-------------

---------XXXXX----XXXX---------------------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo--------

quick & dirty slow

but accurate

MAFFT outputMAFFT outputA colored view of the

alignmentChoose a format: Clustal, Fasta and save as text file

Run GUIDANCE also from here!!

PRANK

Classical alignment errors for HIV env

PRANKPRANK• Web server: http://www.ebi.ac.uk/goldman-srv/webPRANK/

PRANK outputPRANK output

If you need a different format – copy the results to the READSEQ sequence converter: http://www-bimas.cit.nih.gov/molbio/readseq/

top related