sequence alignment goal: line up two or more sequences an alignment of two amino acid sequences:...
Post on 01-Jan-2016
217 Views
Preview:
TRANSCRIPT
Sequence Alignment
Goal: line up two or more sequences
An alignment of two amino acid sequences:
123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H PSeq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP
Position within the alignment = columns
Sequence Alignment
The alignment is a hypothesis: • The positions with identical nt/AA were present in the common
ancestor
• Differences represent the nt/AA that have diverged since the
common ancestor
Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H PSeq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP
From extant sequences to evolution
Constructing and evaluating alignments
• how to identify regions of sequence similarity between two sequences?
• How to evaluate the degree of similarity?
• What is the biological significance of the alignment?
Dot Plots: visualization of an alignment
1) Take two English words:
2) place the two sequences on vertical and horizontal axes of graph
3) put dots wherever there is a match
4) diagonal line is the region of identity – local alignment
THISSEQUENCE and THATSEQUENCE
Alignments reveal insertions and deletions
a gap in Seq1 accounts for the insertion of ISA into Seq2
seq1 THIS---SEQUENCE seq2 THISISASEQUENCE
THISSEQUENCE
THATSEQUENCE
– How many substitutions?
– Are all substitutions equal?
– If these were real AA sequences in two extant organisms, how can we determine whether they reflect evolutionary ancestry?
• Would two unrelated sequence share this level of identity?
Alignments reveal substitutions
Need a methods for evaluating the likelihood of
- A to V (alanine to valine)
- R to F (Arginine to Phenylalanine)
Substitutions
Scoring schemes to assess similarity
• Percent identity = number of identical amino acids
• Percent similarity (biochemical equivalence)
• Substitution matrices– value assigned based on the probability of substitution– score the alignment
123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H PSeq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP
Substitution matrices
How might one construct a scoring matrix?
– what types of sequence events should we consider?• DNA level? Transition vs. transversion• Amino acid level? Biochemical equivalence• Using known proteins?
– comparing protein homologs
– post hoc determination of probabilities
– should we use the same substitution matrix for two very closely related proteins vs. proteins that diverged long ago?
• Probability of substitutions increases over time
• Probability that multiple substitutions occurred in a single position
Substitution matrix
Each cell represents the likelihood of substitution of each possible pair of amino acids
Sum up the score = 52THISSEQUENCETHATSEQUENCE
581145505695
PAM and BLOSUM matrices for AA sequences
Most protein alignment matrices are empirically derived:
• PAM Scoring Matrices– compared full length of closely related proteins– Measured the frequency of all possible substitution pairs
• BLOSUM Scoring Matrices– compared highly conserved regions of proteins– blocks
How to score gaps?THISISASEQUENCE vs THATSEQUENCE?
THISISASEQUENCETH----ATSEQUENCETHA---TSEQUENCETH---ATSEQUENCETH-A-T-SEQUENCE
Scoring the alignment must take into account1) Substitutions2) Gaps
Gap penalties:1) start a new gap (-4)2) extend an existing gap (-1)
Score all, choose highest score
More than one possible alignment
Alignments
• Finding regions of sequence identity or similarity
• Inserting gaps to reflect indels
• Scoring the possible alignments to find the optimal alignment by
Common tools that produce alignments
• BLAST to identify similar sequences, given a query sequence
• ClustalW to align two or more sequences across their entire length
the blast algorithm• Uses PAM or BLOSUM matrix
• divides query sequence into short strings, called words
• searches through the database to find subject sequences that contain similar words
• When finds similar words, it extends and scores the alignment
• Output consists of all subject sequences that align to the query at or above a threshold score
• If no words are similar, then no alignment
BLAST Algorithm
divide entire length into words (segments of X length)
Extend hits one base at a time
S is the alignment score:
If it falls below a threshold, the extension processes ends
HSPs are Aligned Regions
• High scoring segment pairs = the original word match plus the extension– high scoring = score of the alignment above threshold– segment = the region of the query sequence aligned to the subject– pair = alignment between two sequences (query and subject)
• BLAST often produces several short HSPs rather than a single aligned region
BLAST Results report local alignments
>gi|17556182|ref|NP_497582.1| Predicted CDS, phosphatidylinositol transfer protein [Caenorhabditis elegans]
Score = 283 bits (723), Expect = 8e-75 Identities = 144/270 (53%), Positives = 186/270 (68%), Gaps = 13/270 (4%)
Query: 48 KEYRVILPVSVDEYQVGQLYSVAEASKNXXXXXXXXXXXXXXPYEK----DGE--KGQYT 101 K+ RV+LP+SV+EYQVGQL+SVAEASK P++ +G+ KGQYTSbjct: 70 KKSRVVLPMSVEEYQVGQLWSVAEASKAETGGGEGVEVLKNEPFDNVPLLNGQFTKGQYT 129
Query: 102 HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP 160 HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H PSbjct: 130 HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP 189
Query: 161 DLGTQENVHKLEPEAWKHVEAVYIDIADRSQVL-SKDYKAEEDPAKFKSIKTGRGPLGPN 219 D GT EN H L+ + E V I+IA+ + L S D + P+KF+S KTGRGPL NSbjct: 190 DNGTTENAHGLKGDELAKREVVNINIANDHEYLNSGDLHPDSTPSKFQSTKTGRGPLSGN 249
Query: 220 WKQELVNQKDCPYMCAYKLVTVKFKWWGLQNKVENFIHKQERRLFTNFHRQLFCWLDKWV 279 WK + P MCAYKLVTV FKW+G Q VEN+ H Q RLF+ FHR++FCW+DKW Sbjct: 250 WKDSVQ-----PVMCAYKLVTVYFKWFGFQKIVENYAHTQYPRLFSKFHREVFCWIDKWH 304
Query: 280 DLTMDDIRRMEEETKRQLDEMRQKDPVKGM 309 LTM DIR +E + +++L+E R+ V+GMSbjct: 305 GLTMVDIREIEAKAQKELEEQRKSGQVRGM 334
Query was the entire protein sequence (position 1 to 749)
• Score,E-value, Identities, Positives, Gaps
BLAST Statistics• E-value is equivalent to a P value
• smaller numbers are more significant – 1e-4 = 1 x 10-4 = 0.0004
– 1e-50 = 1 x 10-50
• E-value is calculated from the alignment score (S)
• how many alignments of that score would likely occur by chance if you query a database of that size?– if GenBank contains 10 million sequences, there is a good
probability that the sequence “MAGAV” will occur multiple times in sequences that are NOT evolutionarily related
• The E-value represents the likelihood that the observed alignment is due to chance alone
Interpretation of output
• very low E-values (e-100) represent sequences that are very close to being identical
• moderate E-values are related genes (homologs)
• long list of gradually declining of E-values indicates a large gene family
• you must examine the results when e-value is in the 10-4 to -5 range– examine sequences
• a few AA matches in a long sequence?• many AA matches in a very short sequence?
Evaluating Blast results
• Alignment:– colored bar alignments (region of alignment, score along length)– sequence alignments (region of alignment, AA information)
• Exploring potential function from significant blast hits– use accession link to go to the record page for each hit
• published papers• full sequence information• annotation
• Blast is linked to a protein domain tool
top related