improved hit criteria for dna local alignment

36
Improved hit criteria for DNA local Improved hit criteria for DNA local alignment alignment JOBIM 2004 Montréal - June 28th Laurent Noé, Gregory Kucherov LORIA, Nancy France

Upload: illana-wong

Post on 30-Dec-2015

24 views

Category:

Documents


1 download

DESCRIPTION

Improved hit criteria for DNA local alignment. JOBIM 2004 Montréal - June 28th Laurent Noé, Gregory Kucherov LORIA, Nancy France. Plan. Introduction Local alignment Heuristic methods Hit criteria Seed Models and extension proposed Single/Multiple hit strategies and extension proposed - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Improved hit criteria for DNA local alignment

Improved hit criteria for DNA local Improved hit criteria for DNA local alignmentalignment

JOBIM 2004 Montréal - June 28th

Laurent Noé, Gregory KucherovLORIA, Nancy

France

Page 2: Improved hit criteria for DNA local alignment

2

PlanPlan

Introduction– Local alignment– Heuristic methods

Hit criteria– Seed Models and extension proposed– Single/Multiple hit strategies and extension proposed

Experiments Conclusion

– Extensions

Page 3: Improved hit criteria for DNA local alignment

3

Local alignment methodsLocal alignment methods

Why being interested in local alignment methods– Improvement needed

#sequences , #users , ( budget )

Dynamic programming (Smith-Waterman)– Give an exact solution– Quadratic cost

(Best optimization in [Crochemore et al 02])

Heuristic Algorithms– Fasta, Blast, PatternHunter, Blastz, Yass,…In practice

Page 4: Improved hit criteria for DNA local alignment

4

Dot plot

ctcgactcgggctcacgctcgcaccgggttacagcggtcgattgctaggcctcgggctcgcgctcgcgcgctagacaccgggttacagcgt

Detected alignment

Seed filtering Seed filtering

Start with small conserved and easily detected fragments (seeds).

Then extend seeds to build possible alignments

Detected seeds

Page 5: Improved hit criteria for DNA local alignment

5

Dot plot

ctcgactcgggctcacgctcgcaccgggttacagcggtcgattgctaggcctcgggctcgcgctcgcgcgctagacaccgggttacagcgt

Two questions usually askedTwo questions usually asked

1. seed model: What can serve as a seed?

2. hit criterion: What is the criterion that witnesses a potential alignment?

Detected alignment

Detected seeds → 1. Seed model

→ 2. Hit criterion

Page 6: Improved hit criteria for DNA local alignment

6

1.1. What can serve as a seedWhat can serve as a seed

Exact similarity :

Seed Pattern :

Contiguous Seed

Example :

ATCAGT||||||ATCAGT######

ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

######ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

######ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

######ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

######ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

######ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

######ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

Page 7: Improved hit criteria for DNA local alignment

7

ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

Spaced Seed Model Spaced Seed Model [Ma et al. 02: PATTERNHUNTER][Ma et al. 02: PATTERNHUNTER]

Seed Pattern : ###--#-##

‘#’ : obligatory match position‘-’ : joker position (“don’t care” position)

Weight : 6 [number of #] Span : 9 [number of all symbols]

Example : ###--#-##

ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

###--#-##ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

###--#-##ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

###--#-##ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

###--#-##ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

###--#-##ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

Page 8: Improved hit criteria for DNA local alignment

8

Spaced SeedsSpaced Seeds

Some probabilistic observations:

For spaced seeds, hits at subsequent positions are more independent events

For contiguous vs spaced seeds of the same weight, the expected number of hits is (basically) the same but the probabilities of having at least one hit are very different

||||||||||||||||| ###### ######

||||||||||||||||| ###--#-## ###--#-##

||||||||||||||||| ###### ######

||||||||||||||||| ###--#-## ###--#-##

Page 9: Improved hit criteria for DNA local alignment

9

Some probabilistic observations:

ATCAGTGCAATGCTCAAGA|||||||||||||||||||ATCAGTGCAATGCTCAAGA

###--#-##

ATCAGTGCAATGCTCAAGA|||||||||||||||||||ATCAGTGCAATGCTCAAGA

######

ATCAGTGCAATGCTCAAGA|||||||||||||||||||ATCAGTGCAATGCTCAAGA###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-##

###--#-##

ATCAGTGCAATGCTCAAGA|||||||||||||||||||ATCAGTGCAATGCTCAAGA###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ######

######

ATCAGTGCAATGCTCAAGA|||||||||||||||||||ATCAGTGCAATGCTCAAGA

###--#-##

ATCAGTGCAATGCTCAAGA|||||||||||||||||||ATCAGTGCAATGCTCAAGA

######

ATCAGTGCAATGCTCAAGA|||||.|||||||||||||ATCAGCGCAATGCTCAAGA

###--#-##

ATCAGTGCAATGCTCAAGA|||||.|||||||||||||ATCAGCGCAATGCTCAAGA

######

ATCAGTGCAATGCTCAAGA|||||.|||||||||||||ATCAGCGCAATGCTCAAGA###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-##

###--#-##

ATCAGTGCAATGCTCAAGA|||||.|||||||||||||ATCAGCGCAATGCTCAAGA###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ######

######

ATCAGTGCAATGCTCAAGA|||||.|||||||||||||ATCAGCGCAATGCTCAAGA###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-##

###--#-##

ATCAGTGCAATGCTCAAGA|||||.|||||||||||||ATCAGCGCAATGCTCAAGA###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ######

######

ATCAGTGCAATGCTCAAGA|||||.|||||||||||||ATCAGCGCAATGCTCAAGA

###--#-##

ATCAGTGCAATGCTCAAGA|||||.|||||||||||||ATCAGCGCAATGCTCAAGA

######

ATCAGTGCAATGCTCAAGA|||||.|||||||:|||||ATCAGCGCAATGCGCAAGA

###--#-##

ATCAGTGCAATGCTCAAGA|||||.|||||||:|||||ATCAGCGCAATGCGCAAGA

######

ATCAGTGCAATGCTCAAGA|||||.|||||||:|||||ATCAGCGCAATGCGCAAGA###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-##

###--#-##

ATCAGTGCAATGCTCAAGA|||||.|||||||:|||||ATCAGCGCAATGCGCAAGA###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ######

######

ATCAGTGCAATGCTCAAGA|||||.|||||||:|||||ATCAGCGCAATGCGCAAGA###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-##

###--#-##

ATCAGTGCAATGCTCAAGA|||||.|||||||:|||||ATCAGCGCAATGCGCAAGA###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ######

######

ATCAGTGCAATGCTCAAGA|||||.|||||||:|||||ATCAGCGCAATGCGCAAGA

###--#-##

ATCAGTGCAATGCTCAAGA|||||.|||||||:|||||ATCAGCGCAATGCGCAAGA

######

ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

###--#-##

ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

######

ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-##

###--#-##

ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ######

######

ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-##

###--#-##

ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ######

######

ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-## ###--#-##

###--#-##

ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ###### ######

######

For contiguous vs spaced seeds of the same weight, the expected number of hits is (basically) the same but the probabilities of having at least one hit are very different

Page 10: Improved hit criteria for DNA local alignment

10

Spaced seedsSpaced seeds

Spaced seed model is generally more sensitive than the contiguous seed model

Extend spaced seed model by taking into account DNA substitutions specificity

Page 11: Improved hit criteria for DNA local alignment

11

Biological properties

Transitions are usually over-represented.Regularity phenomenon in coding sequences. Use those properties to extend the spaced seed model

ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

Mutational events Mutational events

A T

G Ctransitions

transversions

.:

ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

Page 12: Improved hit criteria for DNA local alignment

12

BLASTZ modelBLASTZ model

[Schwartz et al. 03][Schwartz et al. 03]

A spaced seed that allows one possible transition substitution over its ‘#’ positions.

Problem : running time seed of large weight to obtain reasonable speed.

ATCAGGCATGCTAAGATCGGATCCTCAATGGCTCA|||.|||:|||.|||||.||:||||||:||.||||ATCGGGCTTGCCAAGATTGGTTCCTCATTGCCTCA

###-#--##--#-#--#--##ATCAGGCATGCTAAGATCGGATCCTCAATGGCTCA|||.|||:|||.|||||.||:||||||:||.||||ATCGGGCTTGCCAAGATTGGTTCCTCATTGCCTCA

###-#--##--#-#--#--##ATCAGGCATGCTAAGATCGGATCCTCAATGGCTCA|||.|||:|||.|||||.||:||||||:||.||||ATCGGGCTTGCCAAGATTGGTTCCTCATTGCCTCA

###-#--##--#-#--#--##ATCAGGCATGCTAAGATCGGATCCTCAATGGCTCA|||.|||:|||.|||||.||:||||||:||.||||ATCGGGCTTGCCAAGATTGGTTCCTCATTGCCTCA

###-#--##--#-#--#--##ATCAGGCATGCTAAGATCGGATCCTCAATGGCTCA|||.|||:|||.|||||.||:||||||:||.||||ATCGGGCTTGCCAAGATTGGTTCCTCATTGCCTCA

###-#--##--#-#--#--##ATCAGGCATGCTAAGATCGGATCCTCAATGGCTCA|||.|||:|||.|||||.||:||||||:||.||||ATCGGGCTTGCCAAGATTGGTTCCTCATTGCCTCA

###-#--##--#-#--#--##ATCAGGCATGCTAAGATCGGATCCTCAATGGCTCA|||.|||:|||.|||||.||:||||||:||.||||ATCGGGCTTGCCAAGATTGGTTCCTCATTGCCTCA

###-#--##--#-#--#--##ATCAGGCATGCTAAGATCGGATCCTCAATGGCTCA|||.|||:|||.|||||.||:||||||:||.||||ATCGGGCTTGCCAAGATTGGTTCCTCATTGCCTCA

###-#--##--#-#--#--##ATCAGGCATGCTAAGATCGGATCCTCAATGGCTCA|||.|||:|||.|||||.||:||||||:||.||||ATCGGGCTTGCCAAGATTGGTTCCTCATTGCCTCA

###-#--##--#-#--#--##ATCAGGCATGCTAAGATCGGATCCTCAATGGCTCA|||.|||:|||.|||||.||:||||||:||.||||ATCGGGCTTGCCAAGATTGGTTCCTCATTGCCTCA

Page 13: Improved hit criteria for DNA local alignment

13

YASS model: YASS model: Transition Constrained SeedsTransition Constrained Seeds

Seed Pattern: ##@#-#@-###‘#’ : obligatory match position‘-’ : joker position (“don’t care” position)‘@’ : transition constrained position

transition constrained position: position that corresponds to either a match or a transition.

ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

##@#-#@-###ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

##@#-#@-###ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

##@#-#@-###ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

##@#-#@-###ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

##@#-#@-###ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

##@#-#@-###ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

Page 14: Improved hit criteria for DNA local alignment

14

Transition Constrained SeedsTransition Constrained Seeds

Seed Pattern: ##@#-#@-###‘#’ : obligatory match position‘-’ : joker position (“don’t care” position)‘@’ : transitions constrained position

Weight : 8 [number of # + half number of @]

@ carries 1 bit of information whereas # carries 2 bits.

@ adapted to GC-rich/poor genomes

Page 15: Improved hit criteria for DNA local alignment

15

Spaced seeds and Spaced seeds and Transition-Constrained SeedsTransition-Constrained Seeds

Seed pattern ( why ##@#-#@-### and not #@-#-#-#@# ?) – Not chosen randomly → Need to:

• define an alignment model.• search for the best (at least a good) seed pattern according to

this model. ( Sensitivity : probability to detect any alignment given by the

model )

– Chosen model can drastically change the seed shape…

ExampleBernoulli model ##@-#@#--#-#-###Markov model ##@##-##@##

Page 16: Improved hit criteria for DNA local alignment

16

– Bernoulli [Keich et al 02]

– Markov [Buhler et al 03]

– Automata (M3/M8) and HMMs [Brejova et al 03]

– Homogeneous alignments [Kucherov et al 04]

ATCAGTGCAATGCTCAAGA|||||.||.||||:|||||ATCAGCGCGATGCGCAAGA

|||||.||.||||:||||||||||.||.||||:|||||2222212212222022222

2222212212222022222

P(’2’) = 0.7, P(’1’) = 0.15, P(’0’) = 0.15

222221221222 X

Transition has an emission probability for each symbol

Ex : P(’2’) = 0.8, P(’1’) = 0.10, P(’0’) = 0.10

Probabilistic Alignment Models:Probabilistic Alignment Models:

“HSP” Alignments found by heuristic algorithms

Page 17: Improved hit criteria for DNA local alignment

17

Seed DesignSeed Design

Alignment Model : Bernoulli– P(match) = 0.7, P(transition)=0.15, P(transversion)=0.15

– alignment length = 64

Page 18: Improved hit criteria for DNA local alignment

18

Seed DesignSeed Design

Alignment Model : Markov– 5th Order, obtained on N.Menengitidis, S.Cerevisiae, Drosophila, and

Human sequences.

Page 19: Improved hit criteria for DNA local alignment

19

ExperimentsExperiments

S.Cerevisiae/Neisseiria sequences

Page 20: Improved hit criteria for DNA local alignment

20

To summarize ...To summarize ...

We have presented several seed models (contiguous, “classic” spaced seeds, BLASTZ)

We introduced transition-constrained seeds and showed how they improve the sensitivity

From detected seeds to detected alignments

Page 21: Improved hit criteria for DNA local alignment

21

2.2. Hit criterionHit criterion

What is the criterion that witnesses a potential alignment ?

Restriction : only the information about seeds is available

Dot plot

ctcgactcgggctcacgctcgcaccgggttacagcggtcgattgctaggcctcgggctcgcgctcgcgcgctagacaccgggttacagcgt

Detected alignment

Detected seeds

→ 2. Hit criterion

Page 22: Improved hit criteria for DNA local alignment

22

Several methods have been proposedSeveral methods have been proposed

FASTA:– Several small seeds on

proximal diagonals

BLAST: (single hit)– One “large” seed.

Gapped-BLAST: (double hit)– Two seeds on the same diagonal

To define a good criterion we have first to define a class of similarities we want to detect : mutation model

Dot plot

ctcgactcgggctcacgctcgcaccgggttacagcggtcgattgctaggcctcgggctcgcgctcgcgcgctagacaccgggttacagcgt Dot plot

ctcgactcgggctcacgctcgcaccgggttacagcggtcgattgctaggcctcgggctcgcgctcgcgcgctagacaccgggttacagcgt Dot plot

ctcgactcgggctcacgctcgcaccgggttacagcggtcgattgctaggcctcgggctcgcgctcgcgcgctagacaccgggttacagcgt Dot plot

ctcgactcgggctcacgctcgcaccgggttacagcggtcgattgctaggcctcgggctcgcgctcgcgcgctagacaccgggttacagcgt

Page 23: Improved hit criteria for DNA local alignment

23

Mutation effect Mutation effect onon Seeds Seeds

Mutation effect

– Substitutions : “suppressing seeds”

– Indels : “diagonal shifts”

Remaining seeds

– Estimation of inter-seed distances• via a Waiting Time distribution

– Estimation of diagonals shifts• via a Random Walk model

ctcgactcgggctcacgctcgcaccgggttacagcggtcgattgctaggcctcgggctcgcgctcgcgcgctagacaccgggttacagcgt

Page 24: Improved hit criteria for DNA local alignment

24

YASS hit criterionYASS hit criterion

According to these parameters, YASS propose:

– An intermediate criterion between BLAST single/Gapped Blast double hit criterion.

– Overlap controlled multi-hits

|:|||||||:|||:||| ###### ######

|:||||:|||||:|.|. ###--#-## ###--#-##

7 9

Page 25: Improved hit criteria for DNA local alignment

25

SensiSensitivitytivity Comparison of BLASTn/Gapped-BLAST/YASS hit criteria

score 25

Page 26: Improved hit criteria for DNA local alignment

26

SensiSensitivitytivity (cont) (cont) Comparison of BLASTn/Gapped-BLAST/YASS hit criteria

score 35

Page 27: Improved hit criteria for DNA local alignment

27

YASS criterion mixed with spaced seedsYASS criterion mixed with spaced seeds

Page 28: Improved hit criteria for DNA local alignment

28

ExperimentsExperiments

Local alignment sensitivity– YASS software / BLASTn (2.2.6 package)

M.t : M. tuberculosis CDC1551 S.s : Synechocystis sp. PCC 6803V.p : Vibrio p. RIMD 2210633 IY.p : Yersinia pestis KIM

Page 29: Improved hit criteria for DNA local alignment

29

AdsAds

Page 30: Improved hit criteria for DNA local alignment

30

AdsAds

YASS web page

http://www.loria.fr/projects/YASS

YASS can be queried online

http://yass.loria.fr

YASS is Open Source

Page 31: Improved hit criteria for DNA local alignment

31

ConclusionsConclusions

Two improvements:– Transition-constrained spaced seeds– Hit criterion combining statistical models and advantage of

single/multi hit strategies.

A tool that implements both of them

Page 32: Improved hit criteria for DNA local alignment

32

ExtensionsExtensions

To be done

– Multi-seed approach [Li03, Bulher04, Noe04]

– Seed design on the fly (non necessary static seeds).

– and others …

Page 33: Improved hit criteria for DNA local alignment

33

QuestionsQuestions

agctga

g?cc??

tatgag

caa?ga

cca??a

ctc?gc

ggcgca

tctagg

ag??ac

c???tc

ttcttc

g

???? ??

Page 34: Improved hit criteria for DNA local alignment

34

Page 35: Improved hit criteria for DNA local alignment

35

|95550 |95540 |95530 |95520 |95510 |95500 |95490 |95480 CAAGTTTATTTCTGTAGAGAGTGTAGAAGACAGTTCGATTTTAGCCTTTTCAGCGGCTTCTCTTATTCTTTGGACAGCC||.|||:|||||:||::.::|::..::.||||.|||||||||.||.|||||.||.||||||.|.|::|:|||:|:||||CAGGTTAATTTCGGTTTGCTGGCCGCTGGACAATTCGATTTTGGCTTTTTCGGCAGCTTCTTTCAGGCGTTGTAGAGCC |583630 |583640 |583650 |583660 |583670 |583680 |583690 |583700

|95470 |95460 |95450 |95440 |95430 |95420 |95410 |95400ATACGGTCATTACTCAAATCGATACCGGTTTCTTTCTTGAAATGAGAAATAATTTCTTGCAACAAATAAATGTCAAAAT||:::|||:|::.|||||||.||.||:::||||||.|||||:|:.|:.||.||:|::|::|::|....:::|||.||.|ATCACGTCTTGTTTCAAATCAATGCCTTGTTCTTTTTTGAACTCGGCGATGATGTGGTCGATGAGGCGTTGGTCGAAGT |583710 |583720 |583730 |583740 |583750 |583760 |583770

|95390 |95380 |95370 |95360 |95350 |95340 CTTCGCCACCCAAATGGGTGTCACCATTGGTAGATTTAACCTCAAA------GATACC------GTTATCGATGTCCAG||||.||.|||||.:.|||.||.||.|||||:|:.:::||.||.|| |:..|| |||.:||||:||:|:CTTCACCGCCCAAGAAGGTATCGCCGTTGGTTGCCAATACTTCGAATTGTTTGTCGCCGTCGAGGTTGGCGATTTCGAT |583790 |583800 |583810 |583820 |583830 |583840 |583850

|95320 |95310 |95300 |95290 |95280 |95270 |95260GATTGAAATATCGAAAGTACCACCGCCCAAGTCGAAAACAGCAAT------GACTTTTGGCTCTGATTTATCTAGACCG|||:|||||||||||||||||.|||||||||||.:|:||.||:|. |:||||::::||:::|||.||.|:||||GATGGAAATATCGAAAGTACCGCCGCCCAAGTCATATACGGCTACTTTGCGGTCTTTGTTGTCGCCTTTGTCCATACCG |583870 |583880 |583890 |583900 |583910 |583920 |583930

|95250 |95240 |95230 |95220 |95210 |95200 |95190 TAAGCTAGGGCAGCAGCTGTTGGTTCGTTGACAACACGTAATACATTAAGCCCAATAATTTGTCCTGCGTCTTTAGTAG:|:||.|..||.||:||:||.||.|||||||..|..|||::.||.|.:|.:||....||:.|:|||.|||||||.||.|AATGCCAAAGCGGCTGCGGTCGGCTCGTTGATGATGCGTTTCACGTCCAAACCGGCGATACGGCCTACGTCTTTGGTGG |583950 |583960 |583970 |583980 |583990 |584000 |584010

|95170 |95160 |95150 |95140 |95130 |95120 |95110 CTTGTCTTTGGGCATCATTGAAGTAAGCAGGAACGGTGACAACAGCATTTTTGACGCTCTTCGCTAAGTAAGCCTCCGC||||:|:||||:..||.||||||||.|||||.|||||.|.:||.||:|.::|:||:.|.|.::|.||||||||.||:||CTTGACGTTGGCTGTCGTTGAAGTAGGCAGGGACGGTAATCACGGCTTCGGTTACTTTTTCGCCCAAGTAAGCTTCGGC |584030 |584040 |584050 |584060 |584070 |584080 |584090

|95090 |95080 |95070 |95060 |95050 |95040 |95030 TGTTTCCTTCATTTTATTTAAGATAAAACCTCCTATTTGGGCGGGGGAGTACGTTCTGTTTCTAGCCTCTACCCAGGCA:|.|||.|||||||||.:.|.||.:::::|::::|||||.|:.||.||::.|:.|.||..|.::||.|.||||||:||.GGCTTCTTTCATTTTACGCAGGACTTCTGCGGAAATTTGAGGAGGAGACAGCTCTTTGCCTTGTGCTTTTACCCATGCG |584110 |584120 |584130 |584140 |584150 |584160 |584170

|95010 |95000 |94990 |94980 |94970 |94960 |94950 TCTCCATTAGAATGCTTGACGATTTTGAAAGGAACCTGATTAATATCTCTTTGGACTTCAGCGTCCTCGAAACGGCGGC||:||.||.::.::.||||.|||||.||||||:|.::.:|..||.||:|:|||||||||::.|||.||.||:.:|.|||TCGCCGTTGTTGGCTTTGATGATTTCGAAAGGCATAGATTCGATGTCGCGTTGGACTTCTTTGTCTTCAAATTTGTGGC |584190 |584200 |584210 |584220 |584230 |584240 |584250

|94930 |94920 |94910 |94900 |94890 |94880 |94870 CGATTAAACGCTTAGTAGCAAACAAAGTGTTTTCTGAGTTTATGACGGATTGTCGTTTGGCTGGCTCACCAACTAAACG||||.|||||.||.|..||.:|:|:||||||||.:|:|||:.|:||:|:|||:||||||||:|||:||||.||:|..::CGATCAAACGTTTGGCGGCGTAAATAGTGTTTTTGGCGTTGGTTACCGCTTGGCGTTTGGCAGGCGCACCGACGAGGAT|584260 |584270 |584280 |584290 |584300 |584310 |584320 |584330

|94850 |94840 |94830 |94820 |94810 |94800 |94790 TTCTCCGTCTTTAGTGAAAGCCACTACAGACGGAGTAGTTCTTGAGCCTTCTGCATTTTCGATAATTCTCGGAACTTTA|||:|||.|:|.:.:.:||||:|.:||.|||||:||.||:|:||:|||||||||.||||||||:|.|.|:|:::::...TTCGCCGCCGTCCAAATAAGCGATAACGGACGGCGTGGTGCGTGCGCCTTCTGCGTTTTCGATCACTTTGGTTTGACCG |584340 |584350 |584360 |584370 |584380 |584390 |584400 |584410

|94780 |94770 |94760 |94750 |94740 CCTTCCATAATAGCTACCGCAGAATTGGTAGTACCTAAATCAATACCGATAAC..|||:.:|||.||.|::::|||.|||||:||||||||.||.||||||||:||TTTTCGGAAATGGCCAAACAAGAGTTGGTTGTACCTAAGTCGATACCGATTAC |584420 |584430 |584440 |584450 |584460

*(96264-94728)(582917-584471) Ev: 0 s: 1537/1555 r* S.cerevisiae.V (reverse complementary strand) / gi|12057208|(forward strand)* score = 1073 : bitscore = 491.92* mutations per triplet 347, 108, 152 (1.79e-36) | ts : 272 tv : 335

|96260 |96250 |96240 |96230 |96220 |96210 |96200 |96190TTCCGCTTCATTAACCATTCGATCAATCTCCGTATCAGATAGCCCAGACGCTCCGGCAACAGTGATGGAAGAGTCTTTG|||:||:||:||:|||||:||:||.||.||.:.:||.::.|.:||:||:|::||:::.|..||||||::.|:::||||.TTCGGCATCTTTCACCATGCGTTCGATTTCTTCTTCGCTCAAACCTGAAGAACCTTGGATGGTGATGTTGGCTGCTTTA |582920 |582930 |582940 |582950 |582960 |582970 |582980

|96180 |96170 |96160 |96150 |96140 |96130 |96120 TGGCTGGCGAGATCTTTTGCTGAAACGTTGATGATGCCGTTCGCATCGATATCAAAAGTGACTTCAATTTGTGGGGTAC.:|:||:|:::.|||||:||:|||||||::|:|||||||||:||.|||||.||.||.||:|||||.|||||.||:.|||CCGGTGCCTTTGTCTTTGGCGGAAACGTGCAGGATGCCGTTGGCGTCGATGTCGAAGGTTACTTCGATTTGCGGCATAC |583000 |583010 |583020 |583030 |583040 |583050 |583060

|96100 |96090 |96080 |96070 |96060 |96050 |96040 CTTTTGGAGCTGGAGGAATGCCCGCAAGAGTAAAATTACCTATTAATTTGTTATCCTTGACTAACTCCCTCTCACCTTG|:.:.||:||:||:|:.|||.|::|:|..:|.||:|:|||.|::.|||||||.:|:::..|::..||:|:.||.|||||CGCGCGGTGCAGGTGCGATGTCGCCCAAGTTGAACTGACCCAAAGATTTGTTGGCAGAAGCGCGTTCGCGTTCGCCTTG |583080 |583090 |583100 |583110 |583120 |583130 |583140

|96020 |96010 |96000 |95990 |95980 |95970 |95960 GAAAACTTTAACTTCCACCGATGTTTGACCTGATGCCGCAGTTGAAAAAATTTGAGATTTCTTATTGGGAATTGTAGAA:|.:||:|:.|.::..||.|:::||||...:::|:|:||.||:||.||:|.|||:||.:..||.:|:||.||:||.|:.CAGTACGTGGATGGTTACTGCGCTTTGGTTGTCTTCGGCGGTAGAGAACACTTGCGACGCTTTGGTCGGGATGGTGGTG |583160 |583170 |583180 |583190 |583200 |583210 |583220

|95940 |95930 |95920 |95910 |95900 |95890 |95880 TTTCTTGGGATTAATTTTGTAAAAACTCCTCCTAAAGTTTCAATACCCAATGATAGGGGAGTGACATCTAGCAACAAAA||..|.:|.||.|.|||:||:|::||:||:||.|:.|||||.||||||||:||.||.|||||:||.||.||.|.|||:|TTCTTCTGAATCAGTTTGGTCATCACGCCGCCCATGGTTTCGATACCCAAAGACAGAGGAGTTACGTCCAGTAGCAATA |583240 |583250 |583260 |583270 |583280 |583290 |583300

|95860 |95850 |95840 |95830 |95820 |95810 |95800 CATCGGTAACTTCACCAGACAAGACCGCAGCCTGTATAGCGGCCCCTAAAGCGACTGCTTCATCAGGGTTAACAGCTTT|.|||:|.:::.|.||.::|||:||.:|.:|.||:||:||:||:||||:.||.||:|||||.||||||||:||.:||||CGTCGCTGCGGCCGCCGCTCAATACTTCGCCTTGGATCGCTGCGCCTACGGCAACGGCTTCGTCAGGGTTCACGTCTTT |583320 |583330 |583340 |583350 |583360 |583370 |583380

|95780 |95770 |95760 |95750 |95740 |95730 |95720 TGATGCATCCTTACCGAATAATTTCTTTACAGTATCTGCAACCTTGGGCATCCTTGACATACCACCAACTAATAAAACA::..|::||.||.|||||:||::..||:||.|.:|||:::||.||:|||||:|::|||:::||.||.||.||:|::||.GCGCGGTTCTTTGCCGAAGAAGGCTTTAACGGCTTCTTGTACTTTCGGCATACGGGACTGCCCGCCGACCAAGATTACG |583400 |583410 |583420 |583430 |583440 |583450 |583460

|95710 |95700 |95690 |95680 |95670 |95660 |95650 |95640 TCCGATATATCTGAGGCGGTAATTCTTGCGTCTTTCAGTGCTTTTTTGACAGGATCAACCGTTCTATCAATCAATGGGG||::::||.||:::||.|:|:|::|.:||.|||||||.|||::|||||::|||:||.|.:|::|:.:.|||||.:::::TCGTCGATGTCGCCGGTGCTCAAGCCGGCATCTTTCAATGCAATTTTGCAAGGTTCGATAGAGCGGGTAATCAGGTCTT|583470 |583480 |583490 |583500 |583510 |583520 |583530 |583540

|95630 |95620 |95610 |95600 |95590 |95580 |95570 |95560 CGGTTATATTCTCAAGCTGAACCCTAGAAAAGGGCATACGAATATGCTTTGGGCCTGCAGCATCAGCAGTTATGAAAGG|....|:..|.||.|..|:..|:|:.|:||::::|||::::|:.||.||.|||||:|.:||.||:...||:|||:|:||CAACCAGGCTTTCGAATTTGGCGCGGGTAATTTTCATCGCCAAGTGTTTCGGGCCGGTTGCGTCCATGGTGATGTACGG |583550 |583560 |583570 |583580 |583590 |583600 |583610 |583620

Page 36: Improved hit criteria for DNA local alignment

36

|95550 |95540 |95530 |95520 |95510 |95500 |95490 |95480 CAAGTTTATTTCTGTAGAGAGTGTAGAAGACAGTTCGATTTTAGCCTTTTCAGCGGCTTCTCTTATTCTTTGGACAGCC||.|||:|||||:||::.::|::..::.||||.|||||||||.||.|||||.||.||||||.|.|::|:|||:|:||||CAGGTTAATTTCGGTTTGCTGGCCGCTGGACAATTCGATTTTGGCTTTTTCGGCAGCTTCTTTCAGGCGTTGTAGAGCC |583630 |583640 |583650 |583660 |583670 |583680 |583690 |583700

|95470 |95460 |95450 |95440 |95430 |95420 |95410 |95400ATACGGTCATTACTCAAATCGATACCGGTTTCTTTCTTGAAATGAGAAATAATTTCTTGCAACAAATAAATGTCAAAAT||:::|||:|::.|||||||.||.||:::||||||.|||||:|:.|:.||.||:|::|::|::|....:::|||.||.|ATCACGTCTTGTTTCAAATCAATGCCTTGTTCTTTTTTGAACTCGGCGATGATGTGGTCGATGAGGCGTTGGTCGAAGT |583710 |583720 |583730 |583740 |583750 |583760 |583770

|95390 |95380 |95370 |95360 |95350 |95340 CTTCGCCACCCAAATGGGTGTCACCATTGGTAGATTTAACCTCAAA------GATACC------GTTATCGATGTCCAG||||.||.|||||.:.|||.||.||.|||||:|:.:::||.||.|| |:..|| |||.:||||:||:|:CTTCACCGCCCAAGAAGGTATCGCCGTTGGTTGCCAATACTTCGAATTGTTTGTCGCCGTCGAGGTTGGCGATTTCGAT |583790 |583800 |583810 |583820 |583830 |583840 |583850

|95320 |95310 |95300 |95290 |95280 |95270 |95260GATTGAAATATCGAAAGTACCACCGCCCAAGTCGAAAACAGCAAT------GACTTTTGGCTCTGATTTATCTAGACCG|||:|||||||||||||||||.|||||||||||.:|:||.||:|. |:||||::::||:::|||.||.|:||||GATGGAAATATCGAAAGTACCGCCGCCCAAGTCATATACGGCTACTTTGCGGTCTTTGTTGTCGCCTTTGTCCATACCG |583870 |583880 |583890 |583900 |583910 |583920 |583930

|95250 |95240 |95230 |95220 |95210 |95200 |95190 TAAGCTAGGGCAGCAGCTGTTGGTTCGTTGACAACACGTAATACATTAAGCCCAATAATTTGTCCTGCGTCTTTAGTAG:|:||.|..||.||:||:||.||.|||||||..|..|||::.||.|.:|.:||....||:.|:|||.|||||||.||.|AATGCCAAAGCGGCTGCGGTCGGCTCGTTGATGATGCGTTTCACGTCCAAACCGGCGATACGGCCTACGTCTTTGGTGG |583950 |583960 |583970 |583980 |583990 |584000 |584010

|95170 |95160 |95150 |95140 |95130 |95120 |95110 CTTGTCTTTGGGCATCATTGAAGTAAGCAGGAACGGTGACAACAGCATTTTTGACGCTCTTCGCTAAGTAAGCCTCCGC||||:|:||||:..||.||||||||.|||||.|||||.|.:||.||:|.::|:||:.|.|.::|.||||||||.||:||CTTGACGTTGGCTGTCGTTGAAGTAGGCAGGGACGGTAATCACGGCTTCGGTTACTTTTTCGCCCAAGTAAGCTTCGGC |584030 |584040 |584050 |584060 |584070 |584080 |584090

|95090 |95080 |95070 |95060 |95050 |95040 |95030 TGTTTCCTTCATTTTATTTAAGATAAAACCTCCTATTTGGGCGGGGGAGTACGTTCTGTTTCTAGCCTCTACCCAGGCA:|.|||.|||||||||.:.|.||.:::::|::::|||||.|:.||.||::.|:.|.||..|.::||.|.||||||:||.GGCTTCTTTCATTTTACGCAGGACTTCTGCGGAAATTTGAGGAGGAGACAGCTCTTTGCCTTGTGCTTTTACCCATGCG |584110 |584120 |584130 |584140 |584150 |584160 |584170

|95010 |95000 |94990 |94980 |94970 |94960 |94950 TCTCCATTAGAATGCTTGACGATTTTGAAAGGAACCTGATTAATATCTCTTTGGACTTCAGCGTCCTCGAAACGGCGGC||:||.||.::.::.||||.|||||.||||||:|.::.:|..||.||:|:|||||||||::.|||.||.||:.:|.|||TCGCCGTTGTTGGCTTTGATGATTTCGAAAGGCATAGATTCGATGTCGCGTTGGACTTCTTTGTCTTCAAATTTGTGGC |584190 |584200 |584210 |584220 |584230 |584240 |584250

|94930 |94920 |94910 |94900 |94890 |94880 |94870 CGATTAAACGCTTAGTAGCAAACAAAGTGTTTTCTGAGTTTATGACGGATTGTCGTTTGGCTGGCTCACCAACTAAACG||||.|||||.||.|..||.:|:|:||||||||.:|:|||:.|:||:|:|||:||||||||:|||:||||.||:|..::CGATCAAACGTTTGGCGGCGTAAATAGTGTTTTTGGCGTTGGTTACCGCTTGGCGTTTGGCAGGCGCACCGACGAGGAT|584260 |584270 |584280 |584290 |584300 |584310 |584320 |584330

|94850 |94840 |94830 |94820 |94810 |94800 |94790 TTCTCCGTCTTTAGTGAAAGCCACTACAGACGGAGTAGTTCTTGAGCCTTCTGCATTTTCGATAATTCTCGGAACTTTA|||:|||.|:|.:.:.:||||:|.:||.|||||:||.||:|:||:|||||||||.||||||||:|.|.|:|:::::...TTCGCCGCCGTCCAAATAAGCGATAACGGACGGCGTGGTGCGTGCGCCTTCTGCGTTTTCGATCACTTTGGTTTGACCG |584340 |584350 |584360 |584370 |584380 |584390 |584400 |584410

|94780 |94770 |94760 |94750 |94740 CCTTCCATAATAGCTACCGCAGAATTGGTAGTACCTAAATCAATACCGATAAC..|||:.:|||.||.|::::|||.|||||:||||||||.||.||||||||:||TTTTCGGAAATGGCCAAACAAGAGTTGGTTGTACCTAAGTCGATACCGATTAC |584420 |584430 |584440 |584450 |584460

*(96264-94728)(582917-584471) Ev: 0 s: 1537/1555 r* S.cerevisiae.V (reverse complementary strand) / gi|12057208|(forward strand)* score = 1073 : bitscore = 491.92* mutations per triplet 347, 108, 152 (1.79e-36) | ts : 272 tv : 335

|96260 |96250 |96240 |96230 |96220 |96210 |96200 |96190TTCCGCTTCATTAACCATTCGATCAATCTCCGTATCAGATAGCCCAGACGCTCCGGCAACAGTGATGGAAGAGTCTTTG|||:||:||:||:|||||:||:||.||.||.:.:||.::.|.:||:||:|::||:::.|..||||||::.|:::||||.TTCGGCATCTTTCACCATGCGTTCGATTTCTTCTTCGCTCAAACCTGAAGAACCTTGGATGGTGATGTTGGCTGCTTTA |582920 |582930 |582940 |582950 |582960 |582970 |582980

|96180 |96170 |96160 |96150 |96140 |96130 |96120 TGGCTGGCGAGATCTTTTGCTGAAACGTTGATGATGCCGTTCGCATCGATATCAAAAGTGACTTCAATTTGTGGGGTAC.:|:||:|:::.|||||:||:|||||||::|:|||||||||:||.|||||.||.||.||:|||||.|||||.||:.|||CCGGTGCCTTTGTCTTTGGCGGAAACGTGCAGGATGCCGTTGGCGTCGATGTCGAAGGTTACTTCGATTTGCGGCATAC |583000 |583010 |583020 |583030 |583040 |583050 |583060

|96100 |96090 |96080 |96070 |96060 |96050 |96040 CTTTTGGAGCTGGAGGAATGCCCGCAAGAGTAAAATTACCTATTAATTTGTTATCCTTGACTAACTCCCTCTCACCTTG|:.:.||:||:||:|:.|||.|::|:|..:|.||:|:|||.|::.|||||||.:|:::..|::..||:|:.||.|||||CGCGCGGTGCAGGTGCGATGTCGCCCAAGTTGAACTGACCCAAAGATTTGTTGGCAGAAGCGCGTTCGCGTTCGCCTTG |583080 |583090 |583100 |583110 |583120 |583130 |583140

|96020 |96010 |96000 |95990 |95980 |95970 |95960 GAAAACTTTAACTTCCACCGATGTTTGACCTGATGCCGCAGTTGAAAAAATTTGAGATTTCTTATTGGGAATTGTAGAA:|.:||:|:.|.::..||.|:::||||...:::|:|:||.||:||.||:|.|||:||.:..||.:|:||.||:||.|:.CAGTACGTGGATGGTTACTGCGCTTTGGTTGTCTTCGGCGGTAGAGAACACTTGCGACGCTTTGGTCGGGATGGTGGTG |583160 |583170 |583180 |583190 |583200 |583210 |583220

|95940 |95930 |95920 |95910 |95900 |95890 |95880 TTTCTTGGGATTAATTTTGTAAAAACTCCTCCTAAAGTTTCAATACCCAATGATAGGGGAGTGACATCTAGCAACAAAA||..|.:|.||.|.|||:||:|::||:||:||.|:.|||||.||||||||:||.||.|||||:||.||.||.|.|||:|TTCTTCTGAATCAGTTTGGTCATCACGCCGCCCATGGTTTCGATACCCAAAGACAGAGGAGTTACGTCCAGTAGCAATA |583240 |583250 |583260 |583270 |583280 |583290 |583300

|95860 |95850 |95840 |95830 |95820 |95810 |95800 CATCGGTAACTTCACCAGACAAGACCGCAGCCTGTATAGCGGCCCCTAAAGCGACTGCTTCATCAGGGTTAACAGCTTT|.|||:|.:::.|.||.::|||:||.:|.:|.||:||:||:||:||||:.||.||:|||||.||||||||:||.:||||CGTCGCTGCGGCCGCCGCTCAATACTTCGCCTTGGATCGCTGCGCCTACGGCAACGGCTTCGTCAGGGTTCACGTCTTT |583320 |583330 |583340 |583350 |583360 |583370 |583380

|95780 |95770 |95760 |95750 |95740 |95730 |95720 TGATGCATCCTTACCGAATAATTTCTTTACAGTATCTGCAACCTTGGGCATCCTTGACATACCACCAACTAATAAAACA::..|::||.||.|||||:||::..||:||.|.:|||:::||.||:|||||:|::|||:::||.||.||.||:|::||.GCGCGGTTCTTTGCCGAAGAAGGCTTTAACGGCTTCTTGTACTTTCGGCATACGGGACTGCCCGCCGACCAAGATTACG |583400 |583410 |583420 |583430 |583440 |583450 |583460

|95710 |95700 |95690 |95680 |95670 |95660 |95650 |95640 TCCGATATATCTGAGGCGGTAATTCTTGCGTCTTTCAGTGCTTTTTTGACAGGATCAACCGTTCTATCAATCAATGGGG||::::||.||:::||.|:|:|::|.:||.|||||||.|||::|||||::|||:||.|.:|::|:.:.|||||.:::::TCGTCGATGTCGCCGGTGCTCAAGCCGGCATCTTTCAATGCAATTTTGCAAGGTTCGATAGAGCGGGTAATCAGGTCTT|583470 |583480 |583490 |583500 |583510 |583520 |583530 |583540

|95630 |95620 |95610 |95600 |95590 |95580 |95570 |95560 CGGTTATATTCTCAAGCTGAACCCTAGAAAAGGGCATACGAATATGCTTTGGGCCTGCAGCATCAGCAGTTATGAAAGG|....|:..|.||.|..|:..|:|:.|:||::::|||::::|:.||.||.|||||:|.:||.||:...||:|||:|:||CAACCAGGCTTTCGAATTTGGCGCGGGTAATTTTCATCGCCAAGTGTTTCGGGCCGGTTGCGTCCATGGTGATGTACGG |583550 |583560 |583570 |583580 |583590 |583600 |583610 |583620