indexing dna sequences for local similarity search joint work of angela, dr. mamoulis and dr. yiu...
TRANSCRIPT
Indexing DNA sequences for local similarity search
Joint work of Angela,Dr. Mamoulis and Dr. Yiu
17/5/2007
Outline Introduction
DNA sequences Local similarity search
Related works BLAST Prefix-suffix hashing scheme Experimental result Conclusion Future work
DNA sequences DNA exists in chromosomes of organisms Genome is all DNA in an organism Composed of 4 nucleotides A, C, G, T Human has 23 pairs of chromosomes that a
mount to 3 billion bp Public DNA databases contains genomes of
organisms and their information
DNA Similarity DNA sequences contain special region
s, eg. Genes, motifs Some regions conserve across species Similar regions may imply similar func
tions and structures Given a sequence being studied (quer
y), search for regions in the database sequences
Similarity measurement Σ = {A, C, G, T} Sequence alignment
Align sequences S and T Insert spaces in S and T to form S’ and T’
Scoring matrix σ Match/mismatch scoring Let x and y be two aligned characters or space
from two sequences, x, y Σ {space}R if x = y and x ≠ space
σ(x, y) = P if x ≠ y-∞ if x = y = space
where R (reward) is positive and P (penalty) is negative
Gap penalty
Gap = a maximal subsequence of spaces in an alignment
Affine gap penalty Wg + qW s
where Wg and Ws are constants, Wg 0, Ws 0 and q 1 is the gap length
Penalty of a length-q Gap < Penalty of q deletions/insertions
DNA sequence alignments Global alignment
Needleman-Wunsch algorithm (1970) A C – G T T C A A C C G – – G A
Local alignment Smith-Waterman algorithm (1981) A C C G T A G C A C G T – C C A T A – – A C G –
Dynamic programming Optimal solution
Time and space complexity O(mn), m and n are the lengths of the two sequences
Global alignment Input: two sequences S and T Output: alignment of S and T with the highest score V(i, j): the optimal score to align S[1..i] and T[1..j]
Basis:V(0, 0) = 0,V(i, 0) = i,V(0, j) = j
Recurrence:V(i, j) = max of{
V(i-1, j-1) + σ(S[i], T[j]),V(i-1, j) + σ(S[i], –),V(i, j-1) + σ(–, T[j])
}
Local alignment Input: two sequences S and T Output:
Substring A from S Substring B from T Score of the optimal (global) alignment of A and B
V(i, j): the optimal score to align subsequences of S ending at i and T ending at jBasis:
V(i, 0) = 0,V(0, j) = 0
Recurrence:V(i, j) = max of{
0V(i-1, j-1) + σ(S[i], T[j]),V(i-1, j) + σ(S[i], –),V(i, j-1) + σ(–, T[j])
}
Local similarity search
Input Two DNA sequences
Output The alignments of the regions from
the two sequences that score higher than a score threshold
Database search Input
A query sequence and a sequence database Output
The local similarity search results between pairs of database sequence and the query sequence
Objective: Perform local similarity search fast Maintain search sensitivity
BLAST Basic Local Alignment Search Tool By NCBI (National Center for
Biotechnology Information) of the US Government
Finds regions of local similarity between sequences (DNA, RNA or proteins)
Applies heuristics – fast Applies statistical theory – relatively
accurate
Sample BLAST result Score = 44.4 bits (27), Expect = 0.013 Identities = 37/47 (78%) Strand = Plus / Minus
Query: 6 caggggtccaggcccccagcccctctcctgggcccctcaccccgcgg 52 ||||||||| ||||||||||| ||||| ||| || | ||||||Sbjct: 199635477 caggggtccccgcccccagcccagctcctcggcaccccgggccgcgg 199635431
Score = 44.4 bits (27), Expect = 0.013 Identities = 35/43 (81%) Strand = Plus / Minus
Query: 333 ccccgtttctcggatggaaaaactgaggctccgaaagcagaag 375 |||| |||| | ||||||||||||||||| | || || ||||Sbjct: 505025625 ccccatttcacagatggaaaaactgaggcccagagagaggaag 505025583
Sample BLAST result
Matrix: blastn matrix:1 -1Gap Penalties: Existence: 5, Extension: 2Number of Sequences: 1Number of Hits to DB: 2,526,608Number of extensions: 138741Number of successful extensions: 27Number of sequences better than 1.0: 1Number of HSP's gapped: 27Number of HSP's successfully gapped: 27Length of query: 375Length of database: 880,975,758Length adjustment: 44Effective length of query: 331Effective length of database: 880,975,714Effective search space: 291602961334Effective search space used: 291602961334
How BLAST works Split a search into phases
Hit generation Ungapped extension Gapped extension Traceback
Configurable parameters Word length W Match reward R Mismatch penalty P Cutoff score S Dropoff score X E-value threshold E
Hit generation Word hits (length W, default = 11) Database sequences are compressed:
A = 00, C = 01, G = 10, T = 11 Compression factor = 4
Build a lookup table on sliding windows of the query sequence 4-sliding window of length 8
Scan the compressed database sequence for exact matches present in the lookup table Extend the exact matches of length 8 to W
Ungapped extension Extend the word hits to both directions until the score drops X or
more The extended hit is qualified if it scores higher than cutoff score S Example: X = 2, S = 3 Query: A T A C G T A C G T A C G T DB seq:G C A C G T A C G C G T
1 1 1 1 1 1 score=6 1 1 1 1 1 1 1 score=7 (drop
-1) -2 1 1 1 1 1 1 1 score=5 (drop
2) -2 1 1 1 1 1 1 1 -2 score=3 (drop
2) Extended hit = CACGTACGC
Gapped extension + traceback Extend the hits on both directions Allow gaps Perform restricted dynamic
programming on the gapped extended hits
E-value Low-complexity regions
About half the human genome is easily recognized as repetitive.
A hit is statistically significant if its score is higher than one obtained from two random sequences.
The alignment score of two random sequences follow the Extreme value distribution
The expected number of hits with score at least S is given by
E = Kmn e-λS
The smaller the E-value is, the more statistically significant the hit is
The significance of a hit is evaluated by E-value
Extreme Value Distribution Positive skewed tail Higher probability to have high score
than normal distribution
0 5-2 s
ln K λ
Prefix-Suffix Hashing Scheme Goals
Speed up hit generation and ungapped extension
Reduce the number of hits so as to reduce the processing costs of the later phases
Design Build hashing indexes on database sequences The index stores the offsets of the words (length
W) of the database sequence During a search, for each sliding window of the
query sequence, lookup the index for the offsets of the hits in the database sequence
Index structure Word pattern – length W Partition into prefix and suffix Its prefix and suffix are represented by its hash valu
e H(T) = ∑(4i * V(T[i])), i [0, |T|-1] V(A) = 0, V(C) = 1, V(G) = 2, V(T) = 3
For each possible prefix Lookup file
For each possible suffix Pointers to the actual offsets of the word pattern Total number N of offsets
Entry file For each possible suffix
The N offsets
Index structure
Prefix: AAAAAPointers
Number of offsetsSuffix:
AAAAAASuffix: AAAAAC
List of offsets
Prefix: AAAACPointers
Number of offsetsSuffix:
AAAAAASuffix: AAAAAC
List of offsets
…
……
Merge
Lookup files Entry files
Build the index For each sliding window of the database se
quence, Divide it into prefix of length P and suffix of leng
th S Store its offset with the prefix and suffix
Flush the offsets to the disk if memory is full Reorganise the offsets on the disk to the cor
responding lookup files and entry files Merge the lookup files as one
During a search Divide the query sequence into sliding
windows of length W For each sliding window,
Compute the hash values of prefix, HP, and suffix, HS
Sort the sliding windows by their HP, then their HS
Access the lookup file for HS at HP block Access the entry file for the offsets for
the hits of the word
Experiments Database sequence: human chromosomes 1 – 4, 84
0M bp Query sequences: randomly selected from human
chromosomes W = 11, P = 5, S = 6 Task:
Compare the order of prefix and suffix Compare hit generation time of the algorithms
BLAST PS-Hash – Prefix-Suffix Hashing Scheme HashQuery – build a lookup table on query sequence and sc
an the database sequence Sequential Scan
Study the ungapped extension in BLAST
Experimental results Two sets of index files built
Prefix as lookup Suffix as lookup
prefix->suffix suffix->prefix
Query length
Eff. len. # of hits total (s) lookup entry total (s) lookup entry
490 490 484925 5.39364 0.454373 4.9388 5.66894 0.504918 5.152831
512 70 20752 1.076 0.293433 0.78247 1.20806 0.303494 0.904463
512 512 336367 6.04708 0.477877 5.568728 6.08929 0.497289 5.591531
513 513 580441 5.65264 0.475084 5.16985 6.0363 0.514839 5.520972
490 452 1288149 5.36993 0.463572 4.905935 5.51566 0.497818 5.006489
Eff. len. Is the effective search length of the query sequence after filtering.
Experimental results
BLAST PS-HashHashQuer
ySequential Scan
Query
length
Eff. len. Hits
Time (s)
Hits Time (s) Time (s) Time (s)
490 49036314
17.0770 484925 5.39364 40.0346 4506.84
512 70 14949 7.6932 20752 1.076 32.0046 558.989
512 51219468
66.7870 336367 6.04708 41.6682 4721.28
513 51324823
36.7912 580441 5.65264 43.1642 4652.43
490 45235007
813.395
1128814
95.36993 40.7474 3868.43
Analysis
Index files Number of word patterns = 411 = 4M Number of prefix patterns = 45 = 1K Number of suffix patterns = 46 = 4K Total size of lookup file = 411 * (4 + 4)
= 32MB Total size of entry files = 840M * 4 =
3GB
Analysis Number of bytes reads
BLAST: compressed sequence file = 210MB PS-Hash: (# of query sliding windows) * (4 + 4) +
(# of hits) * 4 = 1.85MB HashQuery: sequence file = 840MB Sequential Scan: sequence file = 840MB w.r.t. the first query
PS-Hash only accesses 1/113 that of bytes BLAST accesses, but the running time is not much faster, in some cases, even slower Disk Locality
Experimental results BLAST Ungapped extension
Database sequence: 840M bp Query: 512 bp E-value: 10-15
Total number of word hits: 194,686
1
10
100
1000
10000
100000
0 50 100 150 200 250 300 350 400 450
success
failsuc
Conclusion
Introduced local similarity search Described BLAST Proposed Prefix-Suffix Hashing
Scheme Showed experimental results and
comparisons
Future work Optimise implementation of Prefix-Suffix H
ashing Scheme Utilise the information of the number of wo
rd hits produced by each sliding window of the query sequence
Extend the index to store neighbour information about the word patterns
Derive useful threshold to restrict the generation of hits for later phase processing
Test on multiple sequences in database
References BLAST website: http://www.ncbi.nlm.nih.gov/blast/ The Statistics of Sequence Similarity Scores:
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.
Samuel Karlin and Stephen F. Altschul. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Science USA, 87(6):2264-2268, March 1990.
BLAST. Ian Korf, Mark Yandell and Joseph Bedell. Sebastopol, CA : O'Reilly & Associates, 2003.
WU-BLAST website: http://blast.wustl.edu/ FSA-BLAST website: http://www.fsa-blast.org/