methods for repeat detection in nucleotide sequences

83
Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010 Methods for Repeat Detection In Nucleotide Sequences Gary Benson Computer Science, Biology, Bioinformatics Boston University [email protected]

Upload: dorit

Post on 24-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Methods for Repeat Detection In Nucleotide Sequences. Gary Benson Computer Science, Biology, Bioinformatics Boston University [email protected]. Outline. Classes of Repeats in DNA Tandem Repeats Techniques for finding repetitive sequence Tandem Repeats Database Variant Tandem Repeats. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Methods for Repeat Detection In Nucleotide Sequences

Gary BensonComputer Science, Biology, Bioinformatics

Boston [email protected]

Page 2: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Outline

• Classes of Repeats in DNA• Tandem Repeats• Techniques for finding repetitive sequence• Tandem Repeats Database• Variant Tandem Repeats

Page 3: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Why Look at Repeats in DNA?

• Repeats make up the largest portion of DNA.– coding sequence (~5% of human DNA) – repetitive sequence (>50% of human DNA)

Page 4: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Classes of Repeats in DNA

• Interspersed repeats:– Retrotransposons

• Sines:• Lines:• LTRs

– Transposons

Page 5: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Classes of Repeats in DNA

• Inverted repeats• Tandem repeats

– Satellite repeats– microsatellites– minisatellites– VNTR (variable number of tandem repeats)

Page 6: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Tandem Repeats

A tandem repeat (TR) is any pattern of nucleotides that has been duplicated so that it appears several times in succession.

For example, the sequence fragment below contains a tandem repeat of the trinucleotide cgt:

tcgctggtcata cgt cgt cgt cgt cgt tacaaacgtcttccgt

Page 7: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Approximate Tandem Repeats

More typically, the tandem copies are only approximate due to mutations. Here is an alignment of copies from a human TR from Chromosome 5.

Shown are

and

a consensus pattern

23.7 copies

Page 8: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Why are tandem repeats interesting?

• They are associated with human disease:Fragile-X mental retardation Myotonic dystrophy Huntington’s diseaseFriedreich’s ataxiaEpilepsy DiabetesOvarian cancer

• They are often polymorphic, making them valuable genomic markers. Also, they may cause significant variation in the human population.

• They are involved in gene regulation and often may contain transcription factor binding sites.

Page 9: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

LOCUS RATIGCA 4461 bp DNA ROD 18-APR-1994

DEFINITION Rat Ig germline epsilon H-chain gene C-region, 3' end. 2881 cgccccaagt aggcttcatc atgctctttg gtttagcaat agcccaaagc aagctatgca 2941 tccatctcag gcccagaggg atgaggagac cagaatcaag acatacccac gcccatccca 3001 cgcccaacca ccaaccacca gcacatcagg ttcacacacc tgagaccagt ggctcccatc 3061 acacacacac acacacacac acacacacac acacacacac acacacaagc ccgtacacat 3121 ccaccatatc cagagacaag tgtctgagtc tgagatacct ctgaggatca ccaatggcag 3181 agtcggccag cacctcagcc tccaggccaa tccttatact ttggcccact gcaggccatg 3241 agagatggag gaggtggagg cctgagctgt ggaaaaccag agacaggaag atggtctgta 3301 ctccaggcca atccttatac tttggcccac tgcaggccat gagagatgga ggaggtggag 3361 gcctgagctg tggaaaacca gagacaggaa gatggtctgt atggagagag tagtaaacca 3421 gattataggg agactgaggc aggagtagag ctcctacaag gccagtagtc taccttagag 3481 tcctataagt ctgggctggg agtccatgtg tcctgacttg ctcctcagat atcacaacca 3541 agattcctgg agccagagtg tgcatgcagg ccctagaaga aatgtggagc ttagagccct 3601 tcctggaggg ccctgggcac tctgaacaaa aggcaattct gtaggctgta tagaggcatc 3661 ctgtcagata cacacacaca tgcacacaca tacacacaca gagacacaga cacacacaca 3721 tgcccacaca catgcataca cacatgcaca cacatacaca cacagagaca cagacacaca 3781 catgcccaca cacatgcata cacacatgca tgcacacaca cacacacaca tacacataca 3841 cacacacaca cacaccccgc aggtagcctt catcatgctg tctagcgata gccctgctga 3901 gggtgggaga tactgggtca tggtgggcac cggagtagaa agagggaatg agcagtcagg 3961 gtcaggggaa aaggacatct gcctccaggg ctgaacagag acttggagca gtcccagagc 4021 aagtgggatg gggagctctg ccactccagt ttcaccagga ctgcctgaga ccagtgaggg

Page 10: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

LOCUS RATIGCA 4461 bp DNA ROD 18-APR-1994

DEFINITION Rat Ig germline epsilon H-chain gene C-region, 3' end. 2881 cgccccaagt aggcttcatc atgctctttg gtttagcaat agcccaaagc aagctatgca 2941 tccatctcag gcccagaggg atgaggagac cagaatcaag acatacccac gcccatccca 3001 cgcccaacca ccaaccacca gcacatcagg ttcacacacc tgagaccagt ggctcccatc 3061 acacacacac acacacacac acacacacac acacacacac acacacaagc ccgtacacat 3121 ccaccatatc cagagacaag tgtctgagtc tgagatacct ctgaggatca ccaatggcag 3181 agtcggccag cacctcagcc tccaggccaa tccttatact ttggcccact gcaggccatg 3241 agagatggag gaggtggagg cctgagctgt ggaaaaccag agacaggaag atggtctgta 3301 ctccaggcca atccttatac tttggcccac tgcaggccat gagagatgga ggaggtggag 3361 gcctgagctg tggaaaacca gagacaggaa gatggtctgt atggagagag tagtaaacca 3421 gattataggg agactgaggc aggagtagag ctcctacaag gccagtagtc taccttagag 3481 tcctataagt ctgggctggg agtccatgtg tcctgacttg ctcctcagat atcacaacca 3541 agattcctgg agccagagtg tgcatgcagg ccctagaaga aatgtggagc ttagagccct 3601 tcctggaggg ccctgggcac tctgaacaaa aggcaattct gtaggctgta tagaggcatc 3661 ctgtcagata cacacacaca tgcacacaca tacacacaca gagacacaga cacacacaca 3721 tgcccacaca catgcataca cacatgcaca cacatacaca cacagagaca cagacacaca 3781 catgcccaca cacatgcata cacacatgca tgcacacaca cacacacaca tacacataca 3841 cacacacaca cacaccccgc aggtagcctt catcatgctg tctagcgata gccctgctga 3901 gggtgggaga tactgggtca tggtgggcac cggagtagaa agagggaatg agcagtcagg 3961 gtcaggggaa aaggacatct gcctccaggg ctgaacagag acttggagca gtcccagagc 4021 aagtgggatg gggagctctg ccactccagt ttcaccagga ctgcctgaga ccagtgaggg

Page 11: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Similarity Models

Match/mismatch model – Sequences differ only by mismatches:

AAAGCTTCGGAGTGCCCGAAATGCATCGGGGTGCCTGA

Sequence 1:

Sequence 2:

Page 12: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Tandem Repeats Finder

An online sequence analysis tool.

OR

A program to download and run locally.

Data from TRF is listed as “simple repeats” at the UCSC genome browser website.

Page 13: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Similarity Models

Match/mismatch model – Sequences differ only by mismatches:

AAAGCTTCGGAGTGCCCGAAATGCATCGGGGTGCCTGA

1101101111011111011

Alignments of similar sequences can be represented by bit strings (zeros and ones).

Sequence 1:

Sequence 2:

Page 14: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Similarity Models

Match/mismatch model – Sequences differ only by mismatches:

One model parameter required:

p = probability of matching letters in a column = probability of a 1 in the bit string

Sometimes known as a Bernoulli (coin toss) model.

Page 15: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Similarity Models

Match/mismatch/indel model – adds indels to sequence differences:

AAAGCTTCGG-AGT--GCCCGAAA-GCATCGGGAGTTAGCCTGA

Sequence 1:

Sequence 2:

Page 16: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Similarity Models

Match/mismatch/Indel model – adds indels to sequence differences:

AAAGCTTCGG-AGT--GCCCGAAA-GCATCGGGAGTTAGCCTGA

1121101111211122111011

Alignments of sequences can be represented by strings of numbers in [0,1,2].

Sequence 1:

Sequence 2:

Page 17: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Similarity Models

Match/mismatch/Indel model – adds indels to sequence differences:

AAAGCTTCGG-AGT--GCCCGAAA-GCATCGGGAGTTAGCCTGA

1121101111311133111011

If the direction of insertion or deletion matters, we use strings of numbers in [0,1,2,3].

Sequence 1:

Sequence 2:

Page 18: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Similarity Models

Match/mismatch/indel model – adds indels to sequence differences:

At least two model parameters required:

p = probability of matching letters in a column = probability of a 1 in the numerical string r = probability of an insertion or deletion

= probability of a 2 or 3 in the numerical string

Page 19: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Detecting Similar Sequences

Methods for similarity detection involve some form of scanning the input sequences, usually, with a window of fixed size. Information about the contents of the window is stored. This is called indexing.

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8A C G T T G C A G T T G A C T G A C G

Page 20: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Indexing

The index is a list of all possible window contents together with a list, for each content, of where it occurs:

AAA, AAC,….,ACG,…,CAG,…,CGT,…GAC,…,GTT,…,TGC,…,TTG,…TTT

Page 21: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Building an Index

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8A C G T T G C A G T T G A C T G A C G

AAA, AAC,….,ACG,…,CAG,…,CGT,…GAC,…,GTT,…,TGC,…TTG,…TTT

First sequence:

Page 22: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Building an Index

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8A C G T T G C A G T T G A C T G A C G

AAA, AAC,….,ACG,…,CAG,…,CGT,…GAC,…,GTT,…,TGC,…TTG,…TTT

0

Page 23: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Building an Index

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8A C G T T G C A G T T G A C T G A C G

AAA, AAC,….,ACG,…,CAG,…,CGT,…GAC,…,GTT,…,TGC,…TTG,…TTT

0 1

Page 24: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Building an Index

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8A C G T T G C A G T T G A C T G A C G

AAA, AAC,….,ACG,…,CAG,…,CGT,…GAC,…,GTT,…,TGC,…TTG,…TTT

0 1 2

Page 25: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Building an Index

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8A C G T T G C A G T T G A C T G A C G

AAA, AAC,….,ACG,…,CAG,…,CGT,…GAC,…,GTT,…,TGC,…TTG,…TTT

0 1 2 3

Page 26: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Building an Index

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8A C G T T G C A G T T G A C T G A C G

AAA, AAC,….,ACG,…,CAG,…,CGT,…GAC,…,GTT,…,TGC,…TTG,…TTT

0 1 2 4 3

Page 27: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Building an Index

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8A C G T T G C A G T T G A C T G A C G

AAA, AAC,….,ACG,…,CAG,…,CGT,…GAC,…,GTT,…,TGC,…TTG,…TTT

0 1 2 4 3

8

Page 28: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Building an Index

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8A C G T T G C A G T T G A C T G A C G

AAA, AAC,….,ACG,…,CAG,…,CGT,…GAC,…,GTT,…,TGC,…TTG,…TTT

0 1 2 4 3

8 9

Page 29: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Scanning a new sequenceAAA, AAC,….,ACG,…,CAG,…,CGT,…GAC,…,GTT,…,TGC,…TTG,…TTT

0 1 2 4 3

8 9

T G C A G T T G . . . Second sequence:

Page 30: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Scanning a new sequenceAAA, AAC,….,ACG,…,CAG,…,CGT,…GAC,…,GTT,…,TGC,…TTG,…TTT

0 1 2 4 3

8 9

T G C A G T T G . . . Second sequence:

Page 31: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Scanning a new sequenceAAA, AAC,….,ACG,…,CAG,…,CGT,…GAC,…,GTT,…,TGC,…TTG,…TTT

0 1 2 4 3

8 9

T G C A G T T G . . . Second sequence:

Page 32: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Scanning a new sequenceAAA, AAC,….,ACG,…,CAG,…,CGT,…GAC,…,GTT,…,TGC,…TTG,…TTT

0 1 2 4 3

8 9

T G C A G T T G . . . Second sequence:

Page 33: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Scanning a new sequenceAAA, AAC,….,ACG,…,CAG,…,CGT,…GAC,…,GTT,…,TGC,…TTG,…TTT

0 1 2 4 3

8 9

T G C A G T T G . . . Second sequence:

Page 34: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Interaction between the similarity model and the index

Once a model is chosen and the index is built, two questions arise:

1. Is it possible to find a match using the window size chosen?

2. How many character matches are likely to be detected with the window size chosen?

Page 35: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Q1: Is it possible to find a match?

This is known as the waiting time problem. Waiting Time: How many consecutive positions must

be examined until a run of k ones occurs.

Page 36: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Q1: Is it possible to find a match?

This is known as the waiting time problem. Waiting Time: How many consecutive positions must

be examined until a run of k ones occurs.

Specific sequence example:AAAGCTTCGGAGTGCCCGAAATGCATCGGGGTGCCTGA1101101111011111011

Sequence 1:

Sequence 2:

Page 37: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Waiting Time Specific Example

AAAGCTTCGGAGTGCCCGAAATGCATCGGGGTGCCTGA1101101111011111011

k waiting time1 12 23 94 105 166 -

Sequence 1:

Sequence 2:

Page 38: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Q1: Is it possible to find a match?

Waiting Time: Given a Bernoulli sequence with generating probability p and length n, what is the probability that a run of k ones occurs?

Randomly generated Bernoulli sequence using p: k

1110101111011011010

n

Page 39: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Waiting Time Formulas

These calculate the probability of a first occurrence of a run of k ones at every sequence length from 1 to n.

for n ≥ 3, k = 3F(111:n) =

P(1)3 – F(111: n – 1) · P(1) – F(111: n – 2) · P(1)2 – ∑ k = 3 to n – 3 F(111: k) · P(1)3

where:F(111:n) is the probability of a first occurrence of 3 ones in a row

at position n,P(1) is the model probability of a match.

Page 40: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Waiting Time Formulas

Predictions: If k = 3, p = .5, n = 121. In what position [0..12] is it most likely to get a first

occurrence of 3 ones in a row?2. By what position will there be a cumulative probability of

30% to see a first occurrence of 3 ones in a row? 3. What is the likely cumulative probability of getting 3 ones in

a row anywhere up to position 12?

Page 41: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Waiting Time Formulas

Calculated probabilities:

Probabilities of first occurrence of patterns in coin toss sequences

P(1) P(0) P(1) P(1)^2 P(1)^30.5 0.5 0.5 0.25 0.125

HHHP(111) Position 1 2 3 4 5 6 7 8 9 10 11 12

Probability 0.000000 0.000000 0.125000 0.062500 0.062500 0.062500 0.054688 0.050781 0.046875 0.042969 0.039551 0.036377Cumulative 0.000000 0.000000 0.125000 0.187500 0.250000 0.312500 0.367188 0.417969 0.464844 0.507813 0.547363 0.583740

Page 42: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Q2: How many character matches will be detected?

This is known as the coverage problem. Coverage: Given a Bernoulli sequence with generating

probability p and length n, what is the probability distribution for number of ones contained in runs of k or more ones?

Page 43: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Q2: How many character matches will be detected?

Specific sequence example: Let k = 3, n = 19

AAAGCTTCGGAGTGCCCGAAATGCATCGGGGTGCCTGA1101101111011111011

Sequence 1:

Sequence 2:

n

k

Page 44: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Q2: How many character matches will be detected?

Specific sequence example: Let k = 3, n = 19

AAAGCTTCGGAGTGCCCGAAATGCATCGGGGTGCCTGA1101101111011111011

Sequence 1:

Sequence 2:

n

Total character matches detected is 9.

Page 45: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Data Structure – modified Aho Corasick Tree

Seed is 1*1**1.

Tree represents all patternsobtained by replacing each * by either 0 or 1.

Fail links in AC tree go to longest match between a stringsuffix and a prefix of a pattern.

Page 46: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Recurrence Formula

Page 47: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Q2: How many character matches will be detected?

Page 48: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Basic Assumption

We assume that two, mutated, adjacent copies of a pattern will contain runs of exact matches.

d d

T A T A C G T C T C C A C G G A

Page 49: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Basic Assumption

We assume that two, mutated, adjacent copies of a pattern will contain runs of exact matches.

d d

T A T A C G T C T C C A C G G A

We identify the runs with seeds.

Page 50: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Basic Assumption

We assume that two, mutated, adjacent copies of a pattern will contain runs of exact matches.

d d

T A T A C G T C T C C A C G G A

Page 51: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

The TRF Algorithm Outline

Page 52: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Criteria for Recognition• Are there enough matches at a common distance?• Are there enough matches if nearby distances are

included?• Do the matches start close enough to the left end?

Page 53: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Page 54: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Tools

Page 55: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Selecting a Data Set

Page 56: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Viewing a Data Set

Page 57: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

TRF Characteristics Table

Page 58: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Filter for large patterns with many copies

Page 59: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

More information about a single repeat

Page 60: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Single repeat view

Page 61: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Filter for Gene Overlap or Proximity

Page 62: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Filters for Triplets in Genes

Page 63: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Triplets in Genes

Page 64: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Changing Visible Columns

Page 65: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Changing Visible Columns

Page 66: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Link for Annotations

Page 67: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Annotations

Page 68: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Information link to the Source Database

Page 69: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Following the Information link to the UCSC Browser

Page 70: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

The TRDB Browser link

Page 71: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

The TRDB Browser

Page 72: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Distributions for a Data Set

Page 73: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Pattern size distribution Human chr. I: size 1 - 60

Page 74: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Pattern size distribution Human chr. I: sizes 60 - 120

Page 75: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Pattern size distribution Drosophila chr. 2R: size 1 - 60

Page 76: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Clustering repeats

Page 77: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Clustering repeats

Page 78: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Human Chr. 15 Family

Page 79: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Human Chr. 1 Family

Page 80: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Data Download

Page 81: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Data Download

Page 82: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Using TRF on your own data

Page 83: Methods for Repeat Detection In Nucleotide Sequences

Bioinformatic and Comparative Genome Analysis course, Institut Pasteur, July 5 - July 17, 2010

Uploading a Sequence