better filtering with gapped q-grams s. burkhardt center for bioinformatics, saarbrückenmax-planck...

23
Better Filtering Better Filtering with with Gapped Gapped q q -grams -grams S. Burkhardt for Bioinformatics, Saarbrücken Max-Planck Institut f. Informatik, Saa J. Kärkkäinen

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Better Filtering withBetter Filtering withGapped Gapped qq-grams-grams

S. Burkhardt

Center for Bioinformatics, Saarbrücken Max-Planck Institut f. Informatik, Saarbrücken

J. Kärkkäinen

OutlineOutline Motivation The `classic` q-gram Lemma q-shapes Measuring Filter quality/speed Experimental Results Conclusion

The k-mismatches problem

For a pattern P, a string S, a value k :

find all occurences of P in S with at

most k character replacements.

Filter Algorithms

Filtration Stage:

Examine S with a Filter Criterium

Return areas with potential matches

Verification Stage:

Verify which areas have true matches

Pattern P

A C T C

Find occurences of P with at most k errors

k = 1

String S

G C A T T C G A T G G A C T G G A C T A G T G A T T G A G T

The q-gram Lemma

For a pattern P, a string S, a value k:

Matches to P in S with at most k errors contain at least

|P|-q+1-(kq)

substrings of length q (q-grams) from S.

T C GC G A

G A TA T T

T T AT A C

G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T

q = 3# of q-grams : |P| - q + 1

k = 1

Error number k : at least t = |P| - q + 1 - (qk) common q-grams in |P| letters

T C G A T T A CT C G A T T A C

|P| = 8

=> t = 8-3+1-1 = 5

In the DPmatrix, onecan count

the numberof matching

q-gramsper diagonal

Use substrings with gaps (q-shapes) compute correct threshold t total length s is called span

3-shape##.#s = 41 gapt = 1

General idea:

3-gram###t = 0

no filter!

OOXOOXOOXOOOOX OXO XOO OOX OXO XOO OOX OXO XOO

|Q| = 11k = 3

OOOXXOOXOOOOO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.O

O = match, X = mismatch

We developed a DP based approach for computing the threshold t given a q-shape

and a query length |P|

Judging the quality of Judging the quality of qq-shapes I-shapes I

Observation: The threshold t is not the only factor that influences the

behaviour of a q-shape

We define the minimum coverage as the minimum number of matching characters for any arrangement of t matching q-shapes in P and a substring of length |P| in S

Judging the quality of Judging the quality of qq-shapes II-shapes II

##.# ##.# -----

For t=2 andthe 3-shape ##.#the minimum coverage is 5

The value q (i.e.the number of matching

characters in a shape) determines the expected number of occurences in

a random string S

Judging the quality of Judging the quality of qq-shapes III-shapes III

3-shape: ##.#A,C,G,T}

Expected number ofoccurences of a single 3-shape in S:

occ = |S|

1

||q

The speed of the filter stepis influenced by the expectednumber of matching q-shapes in S. The efficiency of the filtration correlates closely with the minimum coverage

Judging the quality of Judging the quality of qq-shapes IV-shapes IV

Speed: value of q

Efficiency: minimum coverage

Good shapes are not neccessarily

regular or predictable in

their form.

Judging the quality of Judging the quality of qq-shapes V-shapes V

Shapes with maximalminimum coverage for:

|Q| = 50, k=5q=6 : ##......#..#..#.#q=9 : ###..#..#.#...#.##q=10: ###..#..#.#..###.#q=11: #######.##.##q=12: ###.#..###.#..###.#

Experimental setup for q-shapes:• 50 million character random (Bernoulli) string S• 1000 random queries of length 500• queries have no approximate matches in queries have no approximate matches in SS• compute threshold for |Q|=50compute threshold for |Q|=50• actual value of |Q| is 500! (to reduce runtime of tests) actual value of |Q| is 500! (to reduce runtime of tests) Experiments show 10x reduced filter efficiency; relative performance between shapes unaffected

Evaluating q-shapes

What we measured for every shape and all queries:A) The total number of occurrences of all shapes Good indicator of the total work for the filter phaseB) The number of diagonals containing at least t shapes Good indicator of the Good indicator of the filter efficiencyfilter efficiency The The experiments show a good correlation betweenA and the predicted values as well as B and the minimumcoverage

Evaluating q-shapes

• An analysis of q-grams with gaps (q-shapes)• Results include:

• experimental evidence for their superiority when compared to standard q-grams• a method to roughly judge their quality, the minimum coverage• a way to calculate the parameters required to us them in a filter algorithm

Our work….

• an algorithm to predict the best shapes • improve the quality measure for q-grams• extension to the k-differences problem (with insertions and deletions)• a thorough analysis of filter behaviour for > k differences (use as a heuristic filter)

Todo….