sequence alignment goal: line up two or more sequences an alignment of two amino acid sequences:...

Sequence Alignment

Goal: line up two or more sequences

An alignment of two amino acid sequences:

123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H PSeq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP

Position within the alignment = columns

Sequence Alignment

The alignment is a hypothesis: • The positions with identical nt/AA were present in the common

ancestor

• Differences represent the nt/AA that have diverged since the

common ancestor

Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H PSeq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP

From extant sequences to evolution

Constructing and evaluating alignments

• how to identify regions of sequence similarity between two sequences?

• How to evaluate the degree of similarity?

• What is the biological significance of the alignment?

Dot Plots: visualization of an alignment

1) Take two English words:

2) place the two sequences on vertical and horizontal axes of graph

3) put dots wherever there is a match

4) diagonal line is the region of identity – local alignment

THISSEQUENCE and THATSEQUENCE

Alignments reveal insertions and deletions

a gap in Seq1 accounts for the insertion of ISA into Seq2

seq1 THIS---SEQUENCE seq2 THISISASEQUENCE

THISSEQUENCE

THATSEQUENCE

– How many substitutions?

– Are all substitutions equal?

– If these were real AA sequences in two extant organisms, how can we determine whether they reflect evolutionary ancestry?

• Would two unrelated sequence share this level of identity?

Alignments reveal substitutions

Need a methods for evaluating the likelihood of

- A to V (alanine to valine)

- R to F (Arginine to Phenylalanine)

Substitutions

Scoring schemes to assess similarity

• Percent identity = number of identical amino acids

• Percent similarity (biochemical equivalence)

• Substitution matrices– value assigned based on the probability of substitution– score the alignment

123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H PSeq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP

Substitution matrices

How might one construct a scoring matrix?

– what types of sequence events should we consider?• DNA level? Transition vs. transversion• Amino acid level? Biochemical equivalence• Using known proteins?

– comparing protein homologs

– post hoc determination of probabilities

– should we use the same substitution matrix for two very closely related proteins vs. proteins that diverged long ago?

• Probability of substitutions increases over time

• Probability that multiple substitutions occurred in a single position

Substitution matrix

Each cell represents the likelihood of substitution of each possible pair of amino acids

Sum up the score = 52THISSEQUENCETHATSEQUENCE

581145505695

PAM and BLOSUM matrices for AA sequences

Most protein alignment matrices are empirically derived:

• PAM Scoring Matrices– compared full length of closely related proteins– Measured the frequency of all possible substitution pairs

• BLOSUM Scoring Matrices– compared highly conserved regions of proteins– blocks

How to score gaps?THISISASEQUENCE vs THATSEQUENCE?

THISISASEQUENCETH----ATSEQUENCETHA---TSEQUENCETH---ATSEQUENCETH-A-T-SEQUENCE

Scoring the alignment must take into account1) Substitutions2) Gaps

Gap penalties:1) start a new gap (-4)2) extend an existing gap (-1)

Score all, choose highest score

More than one possible alignment

Alignments

• Finding regions of sequence identity or similarity

• Inserting gaps to reflect indels

• Scoring the possible alignments to find the optimal alignment by

Common tools that produce alignments

• BLAST to identify similar sequences, given a query sequence

• ClustalW to align two or more sequences across their entire length

the blast algorithm• Uses PAM or BLOSUM matrix

• divides query sequence into short strings, called words

• searches through the database to find subject sequences that contain similar words

• When finds similar words, it extends and scores the alignment

• Output consists of all subject sequences that align to the query at or above a threshold score

• If no words are similar, then no alignment

BLAST Algorithm

divide entire length into words (segments of X length)

Extend hits one base at a time

S is the alignment score:

If it falls below a threshold, the extension processes ends

HSPs are Aligned Regions

• High scoring segment pairs = the original word match plus the extension– high scoring = score of the alignment above threshold– segment = the region of the query sequence aligned to the subject– pair = alignment between two sequences (query and subject)

• BLAST often produces several short HSPs rather than a single aligned region

BLAST Results report local alignments

>gi|17556182|ref|NP_497582.1| Predicted CDS, phosphatidylinositol transfer protein [Caenorhabditis elegans]

Score = 283 bits (723), Expect = 8e-75 Identities = 144/270 (53%), Positives = 186/270 (68%), Gaps = 13/270 (4%)

Query: 48 KEYRVILPVSVDEYQVGQLYSVAEASKNXXXXXXXXXXXXXXPYEK----DGE--KGQYT 101 K+ RV+LP+SV+EYQVGQL+SVAEASK P++ +G+ KGQYTSbjct: 70 KKSRVVLPMSVEEYQVGQLWSVAEASKAETGGGEGVEVLKNEPFDNVPLLNGQFTKGQYT 129

Query: 102 HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP 160 HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H PSbjct: 130 HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP 189

Query: 161 DLGTQENVHKLEPEAWKHVEAVYIDIADRSQVL-SKDYKAEEDPAKFKSIKTGRGPLGPN 219 D GT EN H L+ + E V I+IA+ + L S D + P+KF+S KTGRGPL NSbjct: 190 DNGTTENAHGLKGDELAKREVVNINIANDHEYLNSGDLHPDSTPSKFQSTKTGRGPLSGN 249

Query: 220 WKQELVNQKDCPYMCAYKLVTVKFKWWGLQNKVENFIHKQERRLFTNFHRQLFCWLDKWV 279 WK + P MCAYKLVTV FKW+G Q VEN+ H Q RLF+ FHR++FCW+DKW Sbjct: 250 WKDSVQ-----PVMCAYKLVTVYFKWFGFQKIVENYAHTQYPRLFSKFHREVFCWIDKWH 304

Query: 280 DLTMDDIRRMEEETKRQLDEMRQKDPVKGM 309 LTM DIR +E + +++L+E R+ V+GMSbjct: 305 GLTMVDIREIEAKAQKELEEQRKSGQVRGM 334

Query was the entire protein sequence (position 1 to 749)

• Score,E-value, Identities, Positives, Gaps

BLAST Statistics• E-value is equivalent to a P value

• smaller numbers are more significant – 1e-4 = 1 x 10-4 = 0.0004

– 1e-50 = 1 x 10-50

• E-value is calculated from the alignment score (S)

• how many alignments of that score would likely occur by chance if you query a database of that size?– if GenBank contains 10 million sequences, there is a good

probability that the sequence “MAGAV” will occur multiple times in sequences that are NOT evolutionarily related

• The E-value represents the likelihood that the observed alignment is due to chance alone

Interpretation of output

• very low E-values (e-100) represent sequences that are very close to being identical

• moderate E-values are related genes (homologs)

• long list of gradually declining of E-values indicates a large gene family

• you must examine the results when e-value is in the 10-4 to -5 range– examine sequences

• a few AA matches in a long sequence?• many AA matches in a very short sequence?

Evaluating Blast results

• Alignment:– colored bar alignments (region of alignment, score along length)– sequence alignments (region of alignment, AA information)

• Exploring potential function from significant blast hits– use accession link to go to the record page for each hit

• published papers• full sequence information• annotation

• Blast is linked to a protein domain tool

sequence alignment goal: line up two or more sequences an alignment of two amino acid sequences:...

sequencesan alignment

amino acid sequences

regions of sequence

level of identity

unrelated sequence

sequence alignmentgoal

real aa sequences

f arginine

Documents

introduction sequence alignment - mit...

comparing sequences and multiple sequence alignment...

a strategy for the rapid multiple alignment of protein...

class 2: basic sequence alignment. sequence comparison much...

- comparison - multiple sequences alignment

sequences bioinformatics - gerstein...

b ioinform atics alignment of biological sequences -...

pairwise global alignment of sequences · pairwise global...

bioinformatics sequences alignment...

progressivemauve: multiple genome alignment with gene gain...

alignment of long sequences - biostat.wisc.edu ·...

sequence alignment. 2 sequence comparison much of...

sequence alignment. sequences much of bioinformatics...

"an eulerian path approach to global multiple alignment...

nonuniform temporal alignment of slice sequences...

introduction sequence alignment - mathematics › classes...

pairwise sequence alignment for very long sequences on...

1 multiple sequence alignment(msa). 2 multiple alignment...

exploring protein sequences tutorial 5. exploring protein...

2. comparing biological sequences : sequence...