pairwise and multiple sequence alignment lesson 2

Pairwise and Pairwise and Multiple Multiple

Sequence Sequence AlignmentAlignment

Lesson 2Lesson 2

|| || ||||| ||| || || |||||||||||||||||||MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE…

ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA

MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…

MotivationMotivation

What is sequence alignmentWhat is sequence alignment??

Alignment: Alignment: Comparing two (pairwise) or Comparing two (pairwise) or more (multiple) sequences. Searching for more (multiple) sequences. Searching for a series of identical or similar characters in a series of identical or similar characters in the sequences.the sequences.

MVNLTSDEKTAVLALWNKVDVEDCGGE|| || ||||| ||| || || ||MVHLTPEEKTAVNALWGKVNVDAVGGE

Why perform a pairwise sequence Why perform a pairwise sequence alignment?alignment?

e.g., pe.g., predicting characteristics of a protein – redicting characteristics of a protein –

premised on:premised on:

similar sequence (or structure)similar sequence (or structure)

similar functionsimilar function

Finding homology between two sequences

Local vs. GlobalLocal vs. Global

Local alignmentLocal alignment – finds regions of high – finds regions of high similarity in similarity in partsparts of the sequences of the sequences

Global alignmentGlobal alignment – finds the best alignment – finds the best alignment across the across the entireentire two sequences two sequences

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN CDRYYQ

Three types of nucleotide changes:Three types of nucleotide changes:1.1. SubstitutionSubstitution – a replacement of one (or more) – a replacement of one (or more)

sequence characters by another:sequence characters by another:

2.2. InsertionInsertion - an insertion of one (or more) - an insertion of one (or more) sequence characters:sequence characters:

3.3. DeletionDeletion – a deletion of one (or more) sequence – a deletion of one (or more) sequence characters:characters:

TTAA

Evolutionary changes in sequencesEvolutionary changes in sequences

InsertionInsertion + + DeletionDeletion IndelIndel

AAAAGGAA AAAACCAA

AAGAAG

GAGAAAAA

Choosing an alignment: Choosing an alignment:

Many Many differentdifferent alignments between two alignments between two sequences are possible:sequences are possible:

AAGCTGAATTCGAAAGGCTCATTTCTGA

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

How do we determine which is the best alignment?

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

. . .

Toy exerciseToy exercise

Match: Match: +1+1 Mismatch: Mismatch: -2-2 Indel: Indel: -1-1

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Compute the scores of each of the following alignments using this naïve scoring scheme

Scoring scheme:11--22--22--22

--2211--22--22

--22--2211--22

--22--22--2211

A

C

G

T

A C G T

Substitution matrix

Gap penalty (opening = extending)

Substitution matrices: accounting Substitution matrices: accounting for biological contextfor biological context

Which best reflects the biological reality regarding nucleotide mismatch penalty?

1. Tr > Tv > 0

2. Tv > Tr > 0

3. 0 > Tr > Tv

4. 0 > Tv > Tr

Tr = Transition

Tv = Transversion

Scoring schemes: accounting for Scoring schemes: accounting for biological contextbiological context

Which best reflects the biological reality regarding these mismatch penalties?

1. Arg->Lys > Ala->Phe

2. Arg->Lys > Thr->Asp

3. Asp->Val > Asp->Glu

PAM matricesPAM matrices Family of matrices PAM 80, PAM 120, PAM 250, …Family of matrices PAM 80, PAM 120, PAM 250, …

The number with a PAM matrix (the The number with a PAM matrix (the nn in PAM in PAMnn) ) represents the evolutionary distance between the represents the evolutionary distance between the sequences on which the matrix is basedsequences on which the matrix is based

The (The (iithth,,jjthth)) cell in a PAMcell in a PAMnn matrix denotes the probability matrix denotes the probability that amino-acid that amino-acid ii will be replaced by amino-acid will be replaced by amino-acid j j in in time time nn:: P Pii→→j,nj,n

Greater Greater nn numbers denote greater distances numbers denote greater distances

PAM - limitationsPAM - limitations

Based on only one original datasetBased on only one original dataset

Examines proteins with few differences Examines proteins with few differences (85% identity)(85% identity)

Based mainly on small globular proteins Based mainly on small globular proteins so the matrix is biased so the matrix is biased

BLOSUM matricesBLOSUM matrices Different BLOSUMDifferent BLOSUMnn matrices are calculated matrices are calculated

independently from BLOCKS (ungapped, manually independently from BLOCKS (ungapped, manually created local alignments)created local alignments)

BLOSUMBLOSUMnn is based on a cluster of BLOCKS of is based on a cluster of BLOCKS of sequences that share at least sequences that share at least nn percent identity percent identity

The (The (iithth,,jjthth)) cell in a BLOSUM matrix denotes the log of cell in a BLOSUM matrix denotes the log of odds of the observed frequency and expected frequency odds of the observed frequency and expected frequency of amino acids of amino acids ii and and j j in the same position in the data: in the same position in the data: log(log(PPijij//qqii**qqjj))

Higher Higher nn numbers denote higher identity between the numbers denote higher identity between the sequences on which the matrix is basedsequences on which the matrix is based

PAM Vs. BLOSUMPAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45

More distant sequences

BLOSUM62 for general useBLOSUM62 for general useBLOSUM80 for close relationsBLOSUM80 for close relationsBLOSUM45 for distant relationsBLOSUM45 for distant relations

PAM120 for general usePAM120 for general usePAM60 for close relations PAM60 for close relations PAM250 for distant relationsPAM250 for distant relations

Substitution matrices exerciseSubstitution matrices exercise

Pick the best substitution matrix (PAM and Pick the best substitution matrix (PAM and BLOSUM) for each pairwise alignment:BLOSUM) for each pairwise alignment:

Human – chimpHuman – chimp Human - yeastHuman - yeast Human – fishHuman – fish

PAM options: PAM60 PAM120 PAM250

BLOSUM options: BLOSUM45 BLOSUM62 BLOSUM80

Substitution matrices Substitution matrices

Nucleic acids:Nucleic acids: Transition-transversionTransition-transversion

Amino acids:Amino acids: Evolutionary (empirical data) based: (PAM, Evolutionary (empirical data) based: (PAM,

BLOSUM)BLOSUM) Physico-chemical properties based Physico-chemical properties based

(Grantham, McLachlan)(Grantham, McLachlan)

Gap penaltyGap penalty

AAGCGAAATTCGAACA-G-GAA-CTCGAAC

AAGCGAAATTCGAACAGG---AACTCGAAC

• Which alignment has a higher score?

• Which alignment is more likely?

Pairwise alignment algorithm matrix Pairwise alignment algorithm matrix representation: representation: formulationformulation

V[i,j] = value of the optimal alignment between S1[1…i] and S2[1…j]

V[i,j] + S(S1[i+1],S2[j+1])

V[i+1,j+1] = max V[i+1,j] + S(gap)

V[i,j+1] + S(gap)

V[i,j]V[i,j]V[i+1,j]V[i+1,j]

V[i,j+1]V[i,j+1]V[i+1,j+1]V[i+1,j+1]

2 sequences: S1 and S2 and a Scoring scheme: match = 1, mismatch = -1, gap = -2

Pairwise alignment algorithm matrix Pairwise alignment algorithm matrix representation: representation: initializationinitialization

0

A 1

G 2

C 3

0 0 -2 -4 -6

A 1 -2

A 2 -4

A 3 -6

C 4 -8

S2S1

Match = 1Mismatch = -1Indel (gap) = -2

Scoring scheme:

Pairwise alignment algorithm matrix Pairwise alignment algorithm matrix representation: representation: filling the matrixfilling the matrix

Match = 1Mismatch = -1Indel (gap) = -2

Scoring scheme:

0

A 1

G 2

C 3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

S2S1

Pairwise alignment algorithm matrix Pairwise alignment algorithm matrix representation: representation: trace backtrace back

0

A 1

G 2

C 3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

S2S1

Pairwise alignment algorithm matrix Pairwise alignment algorithm matrix representation: trace backrepresentation: trace back

0

A 1

G 2

C 3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

S2S1

AAAC

AG-C

Assessing the significance of an Assessing the significance of an alignment scorealignment score

AAGCTGAATTC-GAAAGGCTCATTTCTGA-

AAGCTGAATTCGAAAGGCTCATTTCTGA

AGATCAGTAGACTAGAGTAGCTATCTCT

28.0

AGATCAGTAGACTA---------GAGTAG-CTATCTCT

CGATAGATAGCATAGCATGTCATGATTC

.

.

CGATAGATAGCATA------------------GCATGTCATGATTC

26.0

16.0

True

Random

Web servers for pairwise alignmentWeb servers for pairwise alignment

BLAST 2 sequences (bl2Seq) at BLAST 2 sequences (bl2Seq) at NCBI NCBI

Produces the Produces the locallocal alignment of two given alignment of two given sequences using sequences using BLASTBLAST (Basic Local (Basic Local Alignment Search Tool)Alignment Search Tool) engine for local engine for local alignmentalignment

Does not use an exact algorithm but a Does not use an exact algorithm but a heuristicheuristic

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed&cmd=Retrieve&list_uids=9254694&dopt=Citation

Back to NCBIBack to NCBI

BLAST – bl2seqBLAST – bl2seq

Bl2Seq - queryBl2Seq - query

blastnblastn – – nucleotide nucleotide blastpblastp – protein – protein

Bl2seq resultsBl2seq results

Bl2seq resultsBl2seq results

MatchMatch DissimilarityDissimilarity SimilaritySimilarity GapsGaps Low Low

complexitycomplexity

Query type: AA or DNAQuery type: AA or DNA??

For coding sequences, AA (protein) data For coding sequences, AA (protein) data are betterare better Selection operates most strongly at the protein Selection operates most strongly at the protein

level level →→ the homology is more evident the homology is more evident AA – 20 char’ alphabetAA – 20 char’ alphabet DNA - 4 char’ alphabetDNA - 4 char’ alphabet

lower chance of random homology for AAlower chance of random homology for AA

↓

BLAST – programsBLAST – programs

Query: DNA Protein

Database: DNA Protein

BLAST – BlastpBLAST – Blastp

Blastp - resultsBlastp - results

Blastp – results (cont’)Blastp – results (cont’)

Blast scoresBlast scores:: Bits scoreBits score – A score for the alignment according – A score for the alignment according

to the number of similarities, identities, etc. It has to the number of similarities, identities, etc. It has a standard set of units and is thus independent a standard set of units and is thus independent of the scoring schemeof the scoring scheme

Expected-score (E-value)Expected-score (E-value) –The number of –The number of alignments with the same or higher score one alignments with the same or higher score one can “expect” to see by chance when searching a can “expect” to see by chance when searching a random database with a random sequence of random database with a random sequence of particular sizes. The closer the e-value is to particular sizes. The closer the e-value is to zero, the greater the confidence that the hit is zero, the greater the confidence that the hit is really a homologreally a homolog

Multiple Multiple Sequence Sequence

Alignment (MSA)Alignment (MSA)

Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPGSeq2 VTISCTGTSSNIGS--ITVNWYQQLPGSeq3 LRLSCSSSGFIFSS--YAMYWVRQAPGSeq4 LSLTCTVSGTSFDD--YYSTWVRQPPGSeq5 PEVTCVVVDVSHEDPQVKFNWYVDG--Seq6 ATLVCLISDFYPGA--VTVAWKADS--Seq7 AALGCLVKDYFPEP--VTVSWNSG---Seq8 VSLTCLVKGFYPSD--IAVEWWSNG--

Similar to pairwise alignment BUT n sequences are aligned instead of just 2

Multiple sequence alignment

Each row represents an individual sequenceEach column represents the ‘same’ position

Why perform an MSAWhy perform an MSA??

MSAs are at the heart of comparative genomics studies which seek to study evolutionary histories, functional and structural aspects of sequences, and to understand phenotypic differences between species



Multiple sequence alignment

variable conserved

Alignment methodsAlignment methods

There is no available optimal solution for There is no available optimal solution for MSA – all methods are MSA – all methods are heuristics:heuristics:

Progressive/hierarchical alignment Progressive/hierarchical alignment (ClustalX)(ClustalX)

Iterative alignment (MAFFT, MUSCLE)Iterative alignment (MAFFT, MUSCLE)

ABCDE

Compute the pairwise Compute the pairwise alignments for all against alignments for all against

all (10 pairwise alignments).all (10 pairwise alignments).The similarities are The similarities are

converted to distances and converted to distances and stored in a tablestored in a table

First step :compute pairwise distances

Progressive alignmentProgressive alignment

AABBCCDDEE

AA

BB88

CC15151717

DD161614141010

EE3232313131313232

A

D

C

B

E

Cluster the sequences to create a Cluster the sequences to create a tree (tree (guide treeguide tree):):

• represents the order in which pairs of represents the order in which pairs of sequences are to be alignedsequences are to be aligned• similar sequences are neighbors in the similar sequences are neighbors in the tree tree • distant sequences are distant from distant sequences are distant from each other in the treeeach other in the tree

Second step:build a guide tree

AABBCCDDEE

AA

BB88

CC15151717

DD161614141010

EE3232313131313232The guide tree is imprecise The guide tree is imprecise and is NOT the tree which and is NOT the tree which truly describes the truly describes the evolutionary relationship evolutionary relationship between the sequences!between the sequences!

Third step: align sequences in a bottom up order

A

D

C

B

E

1. Align the most similar (neighboring) pairs

2. Align pairs of pairs

3. Align sequences clustered to pairs of pairs deeper in the tree

Sequence A

Sequence B

Sequence C

Sequence D

Sequence E

Main disadvantages of progressive Main disadvantages of progressive alignmentsalignments

A

D

C

B

E

Sequence A

Sequence B

Sequence C

Sequence D

Sequence E

Guide-tree topology may be considerably wrong

Globally aligning pairs of sequences may create errors that will propagate through to the final result

ABCDE

Iterative alignmentIterative alignment

Guide tree

Pairwise distance table

Iterate until the MSA does not change (convergence)

A

DCB

E

MSA

Blastp – acquiring sequencesBlastp – acquiring sequences

blastp – acquiring sequencesblastp – acquiring sequences

MSA input: multiple sequence Fasta fileMSA input: multiple sequence Fasta file>gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens]MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH

>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens]MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

>gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens]MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH

>gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens]MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVASALSSRYH

>gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens]MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVASALSSRYH

>gi|4885397|ref|NP_005323.1| hemoglobin, zeta [Homo sapiens]MSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR

MSA using MSA using ClustalXClustalX

Step1: Load the sequencesStep1: Load the sequences

A little unclear…

Edit Fasta headersEdit Fasta headers……>gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens]MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH

>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens]MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

>gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens]MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH

>gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens]MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVASALSSRYH

>gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens]MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVASALSSRYH

>gi|4885397|ref|NP_005323.1| hemoglobin, zeta [Homo sapiens]MSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR

> delta globin

> beta globin

> epsilon globin

> G-gamma globin

> A-gamma globin

> hemoglobin zeta

Step2: Perform alignmentStep2: Perform alignment

MSA and conservation viewMSA and conservation view

Messing-up alignment of HIV-1 env

MSA toolsMSA tools

Progressive:Progressive: CLUSTALX/CLUSTALX/CLUSTALWCLUSTALW

Iterative:Iterative: MUSCLEMUSCLE, , MAFFTMAFFT, , PRANKPRANK

pairwise and multiple sequence alignment lesson 2

Documents

best alignment

sequence characters

sequences global alignment

globallocal alignment

similar sequence

multiple sequences

biological reality

biological contextwhich