basics of sequence analysis ch.6 and chbasics of sequence analysis ch.6 and ch.7 • sequence...
TRANSCRIPT
![Page 1: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/1.jpg)
Basics of sequence analysisCh.6 and Ch.7
• Sequence acquisition
• Sequence data
• Reconstructing sequence
• Sequence alignment
• Alignment algorithms
• Database searching
• Uses of alignments
![Page 2: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/2.jpg)
![Page 3: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/3.jpg)
http://upload.wikimedia.org/wikipedia/commons/c/cb/Sequencing.jpg
![Page 4: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/4.jpg)
ABI era
Source: wikipedia
![Page 5: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/5.jpg)
Scaling up by brute force$3,000,000,000 genome
![Page 6: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/6.jpg)
Source: G. Church
![Page 7: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/7.jpg)
Source: G. Church
![Page 8: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/8.jpg)
A Genome Analyzer flowcell (left) and imaging region or ‘tile’ (right), with a magnified section showing a cluster.
Source: Bioinformatics. 2009 Sep 1;25(17):2194-9. Epub 2009 Jun 23.
![Page 9: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/9.jpg)
http://www.eurofinsdna.com
![Page 10: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/10.jpg)
Source: Bioinformatics. 2009 Sep 1;25(17):2194-9. Epub 2009 Jun 23.
![Page 11: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/11.jpg)
![Page 12: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/12.jpg)
![Page 13: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/13.jpg)
Name
Confidence call
Sequence
![Page 14: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/14.jpg)
![Page 15: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/15.jpg)
![Page 16: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/16.jpg)
![Page 17: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/17.jpg)
![Page 18: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/18.jpg)
![Page 19: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/19.jpg)
![Page 20: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/20.jpg)
![Page 21: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/21.jpg)
CCCATCGCCCAGTTCCAGATCCCTTGCCTGATTAAAAATAC
Read
Alignment
Reference genome
![Page 22: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/22.jpg)
Hypothesis #1
Genome ref
Read
Hypothesis #2
Read
Genome ref
![Page 23: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/23.jpg)
Quality, Q, is the function of the probability, P, that the sequences called a wrong base
)(log10 10 PQ −=
Q is estimated by the sequencing sofware.
Q=10: 1 in 10 chance that base was miscalledQ=20: 1 in 100 chance that base was miscalledQ=30: 1 in 1000 chance
![Page 24: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/24.jpg)
Genome ref
Read
Genome ref
Read
Hypothesis #1
Hypothesis #2
![Page 25: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/25.jpg)
Genome ref
Read
Genome ref
Read
Hypothesis #1
Hypothesis #2
Q=30
Q=10
![Page 26: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/26.jpg)
A penalty scheme to account for different types of dissimilarities
Let’s stipulate that small gaps (indels) occur in bacterial genomes at 1 in 10K positions
40)0001.0(log10
)(log10
10
10
=−=
=−= gapgap PPenalty
Let’s stipulate that SNPs occur in bacterial genomes at 1 in 1K,but depending on Q, sequencing error maybe more likely
}{
}{ }30,min{)001.0(log10,min
)(log10),(log10min
10
1010min
PPPen SNPmiscall
=−≡
≡−−≡
![Page 27: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/27.jpg)
Penaltygap = 40
PenaltySNP =30
Penalty = 40
Q=40
Q=10
Penalty = 70
Penalty = 50
![Page 28: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/28.jpg)
How do we find our sequence in the first place?
![Page 29: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/29.jpg)
Local alignmentGiven a string P (“pattern”) of length m and a string T(“text”) of length n, find substrings a and b of P and T, respectively,having maximal global alignment score
-40TCG
|||
T-G
Gap in text
-40T-G
|||
TTG
Gap in pattern
-30TCG
|||
TTG
Mismatch
+15TTG
|||
TTG
Match
PenaltyExampleEvent
![Page 30: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/30.jpg)
Smith-Waterman
![Page 31: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/31.jpg)
Smith-Waterman
![Page 32: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/32.jpg)
Smith-Waterman
![Page 33: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/33.jpg)
Smith-Waterman
![Page 34: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/34.jpg)
Smith-Waterman
![Page 35: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/35.jpg)
Smith-Waterman
![Page 36: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/36.jpg)
![Page 37: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/37.jpg)
Dynamic Programming
![Page 38: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/38.jpg)
Indexing
![Page 39: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/39.jpg)
Indexing
![Page 40: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/40.jpg)
Dot Plots
Window:2, Stringency:1
![Page 41: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/41.jpg)
Dot Plots
Window:2, Stringency:1
![Page 42: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/42.jpg)
Dot matrix analysisof DNA sequence encoding λ cI repressor (vertical) and P22 c2
repressor
Window - 11;Stringency - 7
![Page 43: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/43.jpg)
Analysis of the regions of low complexity
![Page 44: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/44.jpg)
Calculation of an alignment score
![Page 45: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/45.jpg)
Pairwise Alignment Examples (II)
Dispersed alignment without gaps may
have higher score than more
visually appealing alignment with gaps
![Page 46: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/46.jpg)
An alignment scoring system is required to evaluate how good an alignment is
• positive and negative values assigned
• gap creation and extension penalties
• positive score for identities
• some partial positive score for conservative substitutions
• global versus local alignment
• use of a substitution matrix
![Page 47: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/47.jpg)
“Window location” by FASTA and BLAST
![Page 48: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/48.jpg)
The global alignment algorithm of Needleman and Wunsch(1970).
The local alignment algorithm of Smith and Waterman (1981).
BLAST, a heuristic version of Smith-Waterman.
Two kinds of sequence alignment: global and local
![Page 49: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/49.jpg)
Should result of alignment include all amino acids or proteins or just those that "match"? If yes, a global alignment is desiredIn a global alignment, presence of mismatched elements is neutral - doesn't affect overall match score
Should result of alignment include all amino acids or proteins or just those that "match"? If no, a local alignment is desiredLocal alignments accomplished by including negative scores for "mismatched" positions, thus scores get worse as we move away from region of matchInstead of starting traceback with highest value in first row or column, start with highest value in entire matrix, stop when score hits zero
![Page 50: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/50.jpg)
What is Database Search ?
• Find a particular (usually) short sequence in a database of sequences (or one huge sequence).
• Problem is identical to local sequence alignment, but on a much larger scale.
• We must also have some idea of the significance of a database hit.– Databases always return some kind of hit, how much attention
should be paid to the result?
• A similar problem is the global alignment of two large sequences
• General idea: good alignments contain high scoring regions.
![Page 51: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/51.jpg)
Imperfect Alignment
• What is an imperfect alignment?
• Why imperfect alignment?
• The result may not be optimal.
• Finding optimal alignment is usually to costly in terms of time and memory.
![Page 52: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/52.jpg)
Database Search Methods
• Hash table based methods– FASTA family
• FASTP, FASTA, TFASTA, FASTAX, FASTAY
– BLAST family• BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,
MegaBLAST, PsiBLAST, PhiBLAST
– Others• FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS
• Suffix tree based methods– Mummer, AVID, Reputer, MGA, QUASAR
![Page 53: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/53.jpg)
Database Search Methods
• Hash table based methods– FASTA family
• FASTP, FASTA, TFASTA, FASTAX, FASTAY
– BLAST family• BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,
MegaBLAST, PsiBLAST, PhiBLAST
– Others• FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS
• Suffix tree based methods– Mummer, AVID, Reputer, MGA, QUASAR
![Page 54: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/54.jpg)
History of sequence searching
• 1970: NW
• 1980: SW
• 1985: FASTA
• 1990: BLAST
![Page 55: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/55.jpg)
Hash Table
• K-gram = subsequence of length K
• Ak entries
– A is alphabet size
• Linear time construction
• Constant lookup time
![Page 56: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/56.jpg)
FASTP
Lipman & Pearson, 1985
![Page 57: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/57.jpg)
FASTP
• Three phase algorithm
1. Find short good matches using k-grams
1. K = 1 or 2
2. Find start and end positions for good matches
3. Use DP to align good matches
![Page 58: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/58.jpg)
FASTP
• Three phase algorithm
1. Find short good matches using k-grams
1. K = 1 or 2
2. Find start and end positions for good matches
3. Use DP to align good matches
![Page 59: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/59.jpg)
FASTP: Phase 1 (2)
• Similar to dot plot• Offsets range from 1-m to
n-1• Each offset is scored as
– # matches - # mismatches
• Diagonals (offsets) with large score show local similarities
![Page 60: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/60.jpg)
FASTP: Phase 2
• 5 best diagonal runs are found
• Rescore these 5 regions using PAM250.
– Initial score
• Indels are not considered yet
![Page 61: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/61.jpg)
FASTP: Phase 3
• Sort the aligned regions in descending score
• Optimize these alignments using Needleman-Wunsch
• Report the results
![Page 62: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/62.jpg)
FASTA – Improvement Over FASTP
Pearson 1995
![Page 63: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/63.jpg)
FASTA (1)
• Phase 2: Choose 10 best diagonal runs instead of 5
![Page 64: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/64.jpg)
FASTA (2)
• Phase 2.5– Eliminate diagonals that score less than some given threshold.– Combine matches to find longer matches. It incurs join penalty
similar to gap penalty
![Page 65: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/65.jpg)
FASTA Variations
• TFASTAX and TFASTAY: query protein against a DNA library in all reading frames
• FASTAX, FASTAY: DNA query in all reading frames against protein database
![Page 66: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/66.jpg)
BLAST
Altschul, Gish, Miller, Myers, Lipman, 1990
![Page 67: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/67.jpg)
BLAST (or BLASTP)
• BLAST – Basic Local Alignment Search Tool
• An approximation of Smith-Waterman
• Designed for database searches
– Short query sequence against long database sequence or a database of many sequences
• Sacrifices search sensitivity for speed
![Page 68: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/68.jpg)
BLAST Algorithm (1)
• Eliminate low complexity regions from the query sequence.– Replace them with X (protein) or N (DNA)
• Hash table on query sequence. – K = 3 for proteins
MCG
CGP
MCGPFILGTYC
![Page 69: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/69.jpg)
BLAST Algorithm (2)
• For each k-gram find all k-grams that align with score at least cutoff T using BLOSUM62– 20k candidates
– ~50 on the average per k-gram
– ~50n for the entire query
• Build hash table
PQG
QGM
PQGMCGPFILGTYC
PQG
PQG 18
PEG 15
PRG 14
PSG 13
PQA 12
T = 13
![Page 70: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/70.jpg)
BLAST Algorithm (3)
• Sequentially scan the database and locate each k-gram in the hash table
• Each match is a seed for an ungappedalignment.
![Page 71: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/71.jpg)
BLAST Algorithm (4)
• HSP (High Scoring Pair) = A match between a query word and the database
• Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A
• Extend the hit until the score falls below a threshold value, X
![Page 72: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/72.jpg)
BLAST Algorithm (5)
• Keep only the extended matches that have a score at least S.
• Determine the statistical significance of the result
![Page 73: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/73.jpg)
BLASTN
• BLAST for nucleic acids
• K = 11
• Exact match instead of neighborhood search.
![Page 74: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/74.jpg)
BLAST Variations
GappedNucleic acidProteinTBLASTX
GappedNucleic acidProteinTBLASTN
GappedProteinNucleic acidBLASTX
GappedNucleic acidNucleic acidBLASTN
GappedProteinProteinBLASTP
TypeTargetQueryProgram
![Page 75: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/75.jpg)
Even More Variations
– PsiBLAST (iterative)
– BLAT, BLASTZ, MegaBLAST
– FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS
– Main differences are
• Seed choice (k, gapped seeds)
• Additional data structures
![Page 76: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/76.jpg)
Suffix Tree
• Tree structure that contains all suffixes of the input sequence
• TGAGTGCGA
• GAGTGCGA
• AGTGCGA
• GTGCGA
• TGCGA
• GCGA
• CGA
• GA
• A
![Page 77: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/77.jpg)
Suffix Tree Example
![Page 78: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/78.jpg)
• O(n) space and construction time
– 10n to 70n space usage reported
• O(m) search time for m-letter sequence
• Good for
– Small data
– Exact matches
Suffix Tree Analysis
![Page 79: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/79.jpg)
Suffix Array
• 5 bytes per letter
• O(m log n) search time
• Better space usage
• Slower search
![Page 80: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/80.jpg)
Mummer
![Page 81: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/81.jpg)
Other Sequence Comparison Tools
• Reputer, MGA, AVID
• QUASAR (suffix array)
![Page 82: Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence acquisition • Sequence data • Reconstructing sequence ... • Problem is identical](https://reader034.vdocument.in/reader034/viewer/2022050122/5f522bb3920977566c7b8bdc/html5/thumbnails/82.jpg)
Uses of sequence alignment
• Search databases
• Assess similarity, relatedness
• Identify structural variations (point, gross)
• Determine specificity of primers
• Evaluate complexity of a sequence
• Assemble sequence de novo