multi-seed lossless filtration gregory kucherov laurent noé loria/inria, nancy, france mikhail...
Post on 22-Dec-2015
214 views
TRANSCRIPT
Multi-seed lossless filtrationMulti-seed lossless filtration
Gregory KucherovLaurent Noé
LORIA/INRIA, Nancy, France
Mikhail RoytbergInstitute of Mathematical Problems in Biology,
Puschino, Russia
CPM (Istanbul)July 5-7, 2004
Text filtration: general principleText filtration: general principle
lossless and lossy filters
true match
Filtration applied to sequence comparisonFiltration applied to sequence comparison
potential similarities
Filtration applied to sequence alignmentFiltration applied to sequence alignment
potential similarities
GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.
Similarities are defined through Hamming distance
GCTACGACTTCGAGCTGC
...CTCAGCTATGACCTCGAGCGGCCTATCTA...
GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.
Similarities are defined through Hamming distance
GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.
Similarities are defined through Hamming distance
(m,k)-problem, (m,k)-instances
m
k
GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.
Similarities are defined through Hamming distance
(m,k)-problem, (m,k)-instances This work: lossless filtering
m
k
Filtering by contiguous fragmentFiltering by contiguous fragment
PEX (Navarro&Raffinot 2002)– Searching for a contiguous pattern
PEX with errors– Searching for a contiguous pattern with l possible errors
• requires retrieval of all l-variants in the index. Efficient for– small alphabets (ADN,ARN)– relatively small l (<= 2)
m=18
k=3
11
km
####
conserved1
#########(1)
(m,k)
Superposition of two filtersSuperposition of two filters
Pevzner&Waterman 1995
Idea: combine PEX with another filter based on a regularly-spaced seed
PEX :
spaced PEX (matches occurring at every k positions).
####
#---#---#---#
#---#---#---# #---#---#---# #---#---#---# #---#---#---#
k+1
Spaced seedsSpaced seeds
Spaced seeds (spaced Q-grams)– proposed by Burkhardt & Kärkkäinen (CPM 2001) for solving (m,k)-
problems
Principle– Searching for spaced rather than contiguous patterns
– Selectivity• defined by the weight of the seed (number of #’s)
###-##
Spaced seeds for sequence comparisonSpaced seeds for sequence comparison
Ma, Tromp, Li 2002 (PatternHunter)
Estimating seed sensitivity: Keich et al 2002, Buhler et al 2003, Brejova et al 2003, Choi&Zhang 2004, Choi et al 2004, Kucherov et al 2004, ...
Extended seed models: BLASTZ 2003, Brejova et al 2003, Chen&Sung 2003, Noé&Kucherov 2004, ...
This work: lossless filtration using spaced seed families (extension of Burkhard&Karkkainen 2001)
single filter based on several distinct seeds each seed detects a part of (m,k)-instances but
together they must detect all (m,k)-instances
Families of spaced seedsFamilies of spaced seeds
Independent work (lossy seed families for sequence alignment):
Li, Ma, Kisman, Tromp 2004 (PatternHunter II) Xu, Brown, Li, Ma, this conference Sun, Buhler, RECOMB 2004 (Mandala)
– every (18,3)-instance contains an occurrence of a seed of F
– all seeds of the family have the same weight 7
Example: (18.3)-problem (cont)Example: (18.3)-problem (cont)
Family F solvesthe (18,3)-problem
##-#-#######---#--##-#
F
##-##-########-####--#####-##---#-#####----####-######---#-#-##-#####-#-#-#-----###
Example: (18.3)-problem (cont)Example: (18.3)-problem (cont)
##-#-#######---#--##-#
###-##---#-###
###---#--##-# ###---#--##-#
w=7
w=9
####
###-##
##-##-########-####--#####-##---#-#####----####-######---#-#-##-#####-#-#-#-----###
Comparative selectivityComparative selectivity
##-#-#######---#--##-#
w=4 ~39. 10-4
w=5 ~9.8 10-4
w=7 ~1.2 10-4
w=9 ~0.23 10-4
Selectivity of families on Bernoulli similarities (p(match) = 1/4) estimated as the probability for one of the seeds to occur at a given position
How far should we goHow far should we go
A trivial extreme solution ... – would be to pick all seeds of weight m - k. – prohibitive cost except for very small problems
We are interested in intermediate solutions:– relatively small number of seeds (< 10) to keep the hash table of a
reasonable size,– the seed weight sufficiently large to obtain a good selectivity
kmC ~
ResultsResults
Computing properties of seed families Seed design
– Seed expansion/contraction– Periodic seeds– Seed optimality– Heuristic seed design
Experiments– Examples of designed seed families– Application to computing specific oligonucleotides
Conclusions
MeMeaasursuringing the the efficefficiency of a familyiency of a family
Optimal threshold (Burkhard&Karkkainen): minimal number of seed occurrences over all (m,k)-instances
A seed family F is lossless iff the optimal threshold TF(m,k)1
TF(m,k) can be computed by a dynamic programming algorithm in time O(m·k·2(S+1)) and space O(k·2(S+1)), where S is the maximal length of a seed from F
optimizations are possible (see the paper) the resulting space complexity is the same as in the
Burkhard&Karkkainen algorithm
MeMeaasursuringing the the efficefficiency of a family (cont)iency of a family (cont)
Using a similar DP technique we can compute, within the same time complexity bound:
the number UF(m,k) of undetected (m,k)-similarities for a (lossy) family F
the contribution of a seed of F, i.e. the number of (m,k)-similarities detected exclusively by this seed
[see the paper for details]
Design Design of seedof seed famil familiesies
Pruning exhaustive search tree (Burkhard&Karkkainen)
– Construct all solutions of weight w from solutions of weight w – 1
– Example:if ##--#--# and ##-#---# are solutions of weight w-1,
consider their «union» ##-##--# of weight w.
– Prohibitive cost: • more than a week for computing all single-seed solutions of
the (50,5)-problem• the search space blows up for multi-seed families
Seed expansion/contractionSeed expansion/contraction
Burkhard&Karkkainen: the only two solutions of weight 12 solving the (50,5)-problem:
###-#--###-#--###-#
#-#-#---#-----#-#-#---#-----#-#-#---#
Seed expansion/contractionSeed expansion/contraction
Burkhard&Karkkainen: the only two solutions of weight 12 solving the (50,5)-problem:
###-#--###-#--###-#
#-#-#---#-----#-#-#---#-----#-#-#---#
the only solution of weight 12 of the (25,2)-problem
Seed expansion/contractionSeed expansion/contraction
Burkhard&Karkkainen: the only two solutions of weight 12 solving the (50,5)-problem:
###-#--###-#--###-#
#-#-#---#-----#-#-#---#-----#-#-#---#
– Let be the i-regular expansion of F obtained by inserting i-1 jokers between successive positions of each seed of F
– Example:If F = { ###-# , ##-## } then
= { #-#-#---# , #-#---#-# } = { #--#--#-----# , #--#-----#--# }
Fi
F2F3
the only solution of weight 12 of the (25,2)-problem
Seed expansion/contractionSeed expansion/contraction (cont)(cont)
Lemma:
– If a family F solves an (m,k)–problem, then both F and solves the (i·m, (i+1)·k- 1)–problem
– If a family solves the (i·m,k)–problem, then its i-contraction F solves the (m, )-problem
Fi
Fi
ik
##-#-#######---#--##-#
##-#-#######---#--##-#
#-#---#---#-#-#-##-#-#-------#-----#-#-#
(18,3)
(36,7)
Periodic seedsPeriodic seeds
Iterating short seeds with good properties
into longer seeds
###-#--###-#--###-#
###-#--
Cyclic problemCyclic problem
Lemma: If a seed Q solves a cyclic (m,k)-problem, then the seed Qi=[Q,- (m-s(Q))]i solves the linear (m·(i+1)+s(Q)-1,k)-problem.
Cyclic (11,3)-problem
Linear (30,3)-problem
###-#--#---
###-#--#---###-#--#
Extension to multi-seed caseExtension to multi-seed case
Cyclic (11,3)-problem
Linear (25,3)-problem
###-#--#---
###-#--#---###-#--##--#---###-#--#---###
Extension to multi-seed caseExtension to multi-seed case
Cyclic (11,3)-problem
Linear (25,3)-problem
###-#--#---
###-#--#---###-#--# #--#---###-#--#---###
AAsymptotsymptotic optimalityic optimality
Theorem:Fix a number of errors k. Let w(m) be the maximal weight
of a seed solving the linear (m,k)-problem. Then
the fraction of the number of jokers tends to 0 but the convergence speed depends on k
seed expansion cannot provide an asymptotically optimal solution
( )
Non-asymptotic optimality Non-asymptotic optimality
Fix a number of errors k. For each seed (seed family) Q there exists mQ s.t.
mmQ, Q solves the (m,k)-problem
For a class of seeds , Q is an optimal seed in iff Q realizes the minimal mQ over all seeds of
Lemma: Let n be an integer and r=n/3. For every k2, seed #n-
r-#r is optimal among seeds of weight n with one joker.
Heuristic seed design: genetic algorithmHeuristic seed design: genetic algorithm
a population of seed families is evolving by mutating and crossing over
seed families are screened against sets of difficult (m,k)-instances
for a family that detects all difficult instances, the number of undetected similarities is computed by a DP algorithm. A family is kept if it yields a smaller number than currently known families
compute the contribution of each seed of the family. Mutate the less “valuable” seeds.
difficult(m,k)-instances
seed families
select and reorderselect
Application Application of lossless filtering: of lossless filtering: oligooligo design design
Specific oligonucleotides: small DNA molecules (10-50bp) that hybridize with a given target sequence and do not hybridize with the other background sequences (e.g. the rest of the genome)
Formalization: given a sequence, find all windows of length m which do not occur elsewhere within k substitution errors
ExperimentExperiment
This filter has been applied to the rice EST database (100015 sequences of total size ~42 Mbp)
All 32-windows occurring elsewhere within 5 errors have been computed
The computation took slightly more than 1 hour on a P4 3GHz computer
87% of the database have been “filtered out”
Further questionsFurther questions
Combinatorial structure of optimal seed families
Efficient design algorithm