s pectrum - based de novo repeat detection in genomic sequences do huy hoang
TRANSCRIPT
![Page 1: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/1.jpg)
SPECTRUM-BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES
Do Huy Hoang
![Page 2: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/2.jpg)
OUTLINE
Introduction What is a repeat? Why studying repeats?
Related work SAGRI
Algorithm Analysis
Evaluation
![Page 3: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/3.jpg)
INTRODUCTION
![Page 4: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/4.jpg)
WHAT IS A REPEAT? (DEFINITION)
[General]: Nucleotide sequences occurring multiply within a genome
[CompBio]: Given a genome sequence S, find a string P which occurs at least twice in S (allowing some errors).
![Page 5: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/5.jpg)
WHAT IS A REPEAT? (FUNCTION)
Motifs Very short repeats (10-20bp) Transcription factor binding sites
Long and Short interspersed elements (SINE, LINE) Jumping genes
Genes and Pseudogenes
Tandem repeats Simple short sequence repeats An, CGGn
![Page 6: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/6.jpg)
WHY STUDYING REPEATS? (1) Eukaryotic genomes contain a lot of repeats
E.g. Human genome contains 50% repeats.
Repeats are believed to play an important role in evolution and disease. E.g. Alu elements are particularly prone to recombination. Insertion of
Alu repeats inactivate genes in patient with hemophilia and neurofibromatosis (Kazazian, 1998; Deininger and Batzer, 1999)
Repeats are important to chromatin structure. Most TEs in mammals seem to be silenced by methylation. Alu
sequences are major target for histone H3-Lys9 methylation in humans (Kondo and Issa, 2003).
It is known that heterochromatin have a lot of SINE and LINE repeats.
![Page 7: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/7.jpg)
WHY STUDYING REPEATS? (2)
Repeats complicated sequence assembly and genome comparison Many people remove repeats before they analyze the genome.
Repeats set hurdles on microarray probe signal analysis The probe signal may be inaccurate if the probe sequence
overlap with repeat regions.
Repeats may contribute to human diversity more than genes.
Repeats can be used as DNA fingerprint
![Page 8: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/8.jpg)
STEPS IN REPEAT FINDING
Repeat library (RepeatMasker) De-novo repeat discovery (two steps):
Identification of repeats Classification of repeats
![Page 9: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/9.jpg)
SAGRI ALGORITHM
![Page 10: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/10.jpg)
ALGORITHM OUTLINE
Input: a text G
FindHit phase: finds all candidate of second occurrence of repeat regions ACGACGCGATTAACCCTCGACGTGATCCTC
Validation phase: uses hits from phase 1 to find all pairs of repeats ACGACGCGATTAACCCTCGACGTGATCCTC
![Page 11: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/11.jpg)
SPECTRUM-BASED REPEAT FINDER What is a spectrum?
Given a string G, its spectrum is the set of all k-mers.
E.g. k=3, G= ACGACGCTCACCCT
The spectrum is ACC, ACG, CAC, CCC, CCT, CGA, CGC, CTC, GAC, GCT, TCA
CTC is a k-mer occurring at position 7. ACG is a k-mer occurring at positions 1, 4.
![Page 12: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/12.jpg)
OBSERVATION 1: HOW TO FIND CANDIDATE REGIONS CONTAINING REPEATS? Two regions of repeats should share some k-mers.
E.g. the following repeats share CGA.
ACGACGCGATTAACCCTCGACGTGATCCTC
![Page 13: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/13.jpg)
FEASIBLE EXTENSION (BUD)
iS = ACGACGTGATTAACCCTCGACGTGATCCTC
Given the spectrum S for G[1..i-1]:
A XC G XT
CGA
Feasible extensions!
i
Note: T is called a fooling probe!
![Page 14: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/14.jpg)
OBSERVATION 2 A path of feasible extensions may be a repeat.
Example:S = ACGACGCTATCGATGCCCTC
Spectrum S for G[1..10] isACG, CGA, CGC, CTA, GAC, GCT, TAT
Starting from position 11, there exists a path of feasible extensions:CGA-C-G-C
This path corresponds to a length-6 substring in position 2.Also, this path has one mismatch compare with the length-6 substring for
position 11 (CGATGC).
11
![Page 15: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/15.jpg)
PHASE 1: FINDHIT()
Algorithm:Input: a text G Initialize the empty spectrum S For i = 1 to n
/* we maintain the variant that S is a spectrum for G[1..i-1] */ Let x be the k-mer at position i If x exists in S, run DetectRepSeq(S,i); Insert x into S
Note: DetectRepSeq(S,i) looks for repeat occurring at position i.
![Page 16: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/16.jpg)
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20 21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C G C G A T C T
DetectRepSeg(S(18), 18)
AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA
CGA
1 2 …
RefCurr
![Page 17: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/17.jpg)
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20 21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C G C G A T C T
DetectRepSeg(S(18), 18)
AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA
CGA-T1
1 2 …
RefCurr
![Page 18: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/18.jpg)
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20 21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C G C G A T C T
DetectRepSeg(S(18), 18)
AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA
CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3* G3*
1 2 …
RefCurr
![Page 19: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/19.jpg)
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20 21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C G C G A T C T
DetectRepSeg(S(18), 18)
AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA
CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3* G3*
1 2 …
RefCurr
![Page 20: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/20.jpg)
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20 21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C G C G A T C T
DetectRepSeg(S(18), 18)
AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA
CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3* G3*
1 2 …
RefCurr
![Page 21: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/21.jpg)
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20 21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C G C G A T C T
DetectRepSeg(S(18), 18)
AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA
CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3* G3*
1 2 …
RefCurr
![Page 22: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/22.jpg)
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20 21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C G C G A T C T
DetectRepSeg(S(18), 18)
AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA
CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3* G3*
1 2 …
RefCurr
![Page 23: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/23.jpg)
ACGAAGTGATTAACCCTCGACGCGATCC
18 19 20 21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C G C G A T C T
DetectRepSeg(S(18), 18)
AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA
CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3*
C2-C2-C3* G3*
1 2 …
RefCurr
![Page 24: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/24.jpg)
OTHER DETAILS
Extend backward Stop backtracking after h steps
![Page 25: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/25.jpg)
VALIDATION PHASE
Decompose hits into set of k-mer and index all the locations of these k-mers.
Scan for each pair of locations of a k-mer w in the hits, do BLAST extension Use some auxiliary data structure to avoid double checking
Report the pairs whose length exceed our threshold
![Page 26: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/26.jpg)
ANALYSIS
![Page 27: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/27.jpg)
ANALYSIS
How to find most repeats? Avoid false negative
How to get better speed? Avoid false positive
![Page 28: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/28.jpg)
HOW DO WE CHOOSE K? (1)
If k is too big, k-mer is too specific and we may miss some repeat
If k is too small, k-mer cannot help us to differentiate repeat from non-repeat
For repeat of length 50 and similarity>0.9, we found that k log4n+2 is good enough.
![Page 29: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/29.jpg)
HOW DO WE CHOOSE K? (2)
A random k-mer match with one of n chosen k-mer
Pr(a k-mer re-occurs by random in a sequence of length n) (analog to throwing n balls into 4k bins) 1-(1 – 4-k)m 1 – exp(-m/4k).
We requires 1-exp(-n/4k)1, hence, k log4n + log41. If we set 1=1/16, k log4n + 2
0 m
![Page 30: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/30.jpg)
THE OCCURRENCE OF FALSE NEGATIVE (MISSED REPEAT) (1) A pair of repeats of length L, with m mismatches
Probability of a preserved k-mer in repeat is
M is the number of nonnegative integer solutions
to Subject to
m
LM /1
mLxxx m 121
1,,,0 121 kxxx m
L
X
x1 x2 Xm+1
X
![Page 31: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/31.jpg)
THE OCCURRENCE OF FALSE NEGATIVE (MISSED REPEAT) (2) It is easy to see that M is the coefficient of xL−m in
Hence
1
1112
)1(
)1()1(
m
mkmk
x
xxxx
m
jkL
j
mM
kmLmj
j 1)1(/)()1(0
![Page 32: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/32.jpg)
CRITERION FOR PATH TERMINATION (1)
Instead of fixing the number of mismatches, we may want to fixed the percentage of mismatches, says, 10%.
Then, the pruning strategy is length dependent. If the length of strings in is r, we allow (r) mismatches.
![Page 33: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/33.jpg)
CRITERION FOR PATH TERMINATION (2)
Let q be the mismatch probability and r be the length of the string. Prob that a string has s mismatches =
For a threshold (says, 0.01), we set (r) = max {2 s r-2 | Pq(s) > } + 2
2
2
22 )1(2
)(r
sj
jrjq qq
j
rqsP
![Page 34: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/34.jpg)
CONTROL OF FALSE POSITIVES (1)
Two typical cases
The probability of (case 1)/ (case 2) is 2*4- P(case1 or case2) is small
For example: 4 errors, q=0.1, k = 12, P(case 1) = 1.77 * 10-8
![Page 35: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/35.jpg)
EVALUATIONCompare with other programs
![Page 36: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/36.jpg)
PROGRAMS
EulerAlign by Zhang and Waterman PALS by Edgar and Myers REPuter by Kurtz et al. SARGRI
![Page 37: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/37.jpg)
MEASUREMENT
Count Ratio (CR): the ratio of number of pairs of repeat share more than 50% with a reference pair to the number of reference pairs.
Shared Repeat Region (SRR): the ratio of the found region to the reference region.
![Page 38: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/38.jpg)
SIMULATED DATA
Conclusion from simulated dataThe result is consistent with the analysis
![Page 39: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/39.jpg)
GENOME DATA
M.gen (0.6 Mbp) Organism with the smallest genome Lives in the primate genital and respiratory tracts
C.tra (1 Mbp) Live inside the cells of humans
A.ful (2.1 Mbp) Found in high-temperature oil fields
E.coli (4 Mbp) An import bacteria live inside lower intestines of mammals
Human chr22 p20M to p21M (1Mbp)
![Page 40: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/40.jpg)
Use CR and SRR ratio to measure
Cross validation G/H=1, H/G<1 G “outperforms” H G/H<1, H/G=1 H “outperforms” G G/H<1, H/G<1 G, H are complementary G/H=1, H/G=1 G, H are similar
![Page 41: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/41.jpg)
![Page 42: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/42.jpg)
=
![Page 43: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/43.jpg)
QUESTIONS AND ANSWERS
![Page 44: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/44.jpg)
![Page 45: S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc15503460f949885c3/html5/thumbnails/45.jpg)
H. H. Do, K. P. Choi, F. P. Preparata, W. K. Sung, L. Zhang. Spectrum-based de novo repeat detection in genomic sequences. Journal of Computational Biology, 15(5):469-487, June 2008