transposable elements (te) in genomic sequence mina rho

30
Transposable Elements (TE) in genomic sequence Mina Rho

Upload: george-anderson

Post on 31-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Transposable Elements (TE) in genomic sequence Mina Rho

Transposable Elements (TE) in genomic sequence

Mina Rho

Page 2: Transposable Elements (TE) in genomic sequence Mina Rho

• Definition

• De novo identification of repeat families in large genomes (RepeatScout)

Alkes L. Price, Neil C. Jones and Pavel A. Pevzner

• Combined Evidence Annotation of Transposable Elements in Genome Sequences

Hadi Quesneville, Casey M. Bergman, Olivier Andrieu, Delphine Autard, Danielle Nouaud, Michael Ashburner, Dominique Anxolabehere

Contents

Page 3: Transposable Elements (TE) in genomic sequence Mina Rho

Mobile element/Transposable element

Transposon- a segment of DNA that can move around to different positions in the genome of a single cell.- cut out of its location and inserted into a new location. - consisting of DNA.

Retrotransposon- copy and paste into a new location.- the copy is made of RNA and transcribed back into DNA using reverse transcriptase.- long terminal repeats (LTRs) at its ends.

=> expect to get information of evolution, mutation, changes of amount of DNA in the genome.

Page 4: Transposable Elements (TE) in genomic sequence Mina Rho
Page 5: Transposable Elements (TE) in genomic sequence Mina Rho
Page 6: Transposable Elements (TE) in genomic sequence Mina Rho

RepeatScout

Page 7: Transposable Elements (TE) in genomic sequence Mina Rho

Definition

• Repeat family: a collection of similar sequences which appear many times in a genome. – the Alu repeat family has over 1 million approximate occurrences in the

human genome

– ~ 50% Human genome

• l-mer: substring whose length is l

Page 8: Transposable Elements (TE) in genomic sequence Mina Rho

• The current status on identification method of repeat families– Given an existing library of repeat families

• RepeatMasker

– De novo identification• REPuter (Kurtz et al., 2000)

• RepeatFinder (Volfovsky et al., 2001)

• RECON (Bao and Eddy, 2002)

• RepeatGluer (Pevzner et al., 2004)

• PILER (Edgar and Myers, 2005)

• RepeatScout

Backgroud

Page 9: Transposable Elements (TE) in genomic sequence Mina Rho

Overview of RepeatScout

• Method– Builds a table of high frequency l-mers as seeds– Extends each seed to a longer consensus sequence

• Main advantage– an efficient method of similarity search which enables a rigorous

definition of repeat boundaries.

Page 10: Transposable Elements (TE) in genomic sequence Mina Rho

How to create l-mer table

frequency Position of last occurrence

l-mer1 l-mer2

l-mer3

Hash table

l-mer4 l-mer5 l-mer6

Sequence

i i+1 i+2 j k

Page 11: Transposable Elements (TE) in genomic sequence Mina Rho

Output of l-mer table

AAAAAAAAAAAGATA 8 2920943AAAAAAAGGAAAGAA 5 2468525AGGCTTGAACAATGG 3 1425014AAAAAAAAGAAAGAA 62 3009663GTTGGTTTCAAAGAA 7 2855871AAAAAAAATTTTTTT 22 2992836ATTCAAGTTAAATGG 4 1473342ATTCAATGTAACCAC 3 1463008ATGCATGCAATGCAT 9 1788944ATGCATTTAAAAGAA 3 1464381AAAAAACTCACTCCA 5 1489159

Page 12: Transposable Elements (TE) in genomic sequence Mina Rho

How to build all positions of repeats

l-mer1 l-mer2

l-mer3

Hash table

l-mer4 l-mer5 l-mer6

Sequence

i i+1 i+2

i iji

i iki+2

j k

Page 13: Transposable Elements (TE) in genomic sequence Mina Rho

S1

S2

S3

S4

S5

Q1Q2Q3Q4

High frequency l-mer Extending Q maximizing objective function one nucleotide at a time

S1 S2 S3 S4 S5

Query sequence (with l-mer1)

Page 14: Transposable Elements (TE) in genomic sequence Mina Rho

Objective Function

|Q| : the length of Q C: minimum threshold on the number of repeat elements

a(Q, Sk): a pairwise fit_preferred alignment score

p: Incomplete-fit penalty

Page 15: Transposable Elements (TE) in genomic sequence Mina Rho

Output of optimized Q

>R=0GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACTTGAGGTCAGGAGTTCGAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAATTAGCCGGGCGTGGTGGCGCGCGCCTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGGAGGCGGAGGTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAA>R=1AAAAGGCAGCAGAAACCTCTGCAGACTTAAATGTCCCTGTCTGACAGCTTTGAAGAGAGTAGTGGTTCTCCCAGCACGCAGCTGGAGATCTGAGAACGGACAGACTGCCTCCTCAAGTGGATCCCTGACCCCCGAGTAGCCTAACTGGGAGGCACCCCCCAGTAGGGGCAGACTGACACCTCACACGGCCAGGTACTCCTCTGAGAAAAAACTTCCAGAGGAACAATCAGGCAGCAACATTTGCTGCTCACCAATATCCACTGTTCTGCAGCCTCTGCTGCTGATACCCAGGCAAACAGGGTCTGGAGTGGACCTCCAGCAAACTCCAACAGACCTGCAGCTGAGGGTCCTGTCTGTTAGAAGGAAAACTAACAAACAGAAAGGACATCCACACCAAAAACCCATCTGTACGTCACCATCATCAAAGACCAAAAGTAGATAAAACCACAAAGATGGGGAAAAAACAGAGCAGAAAAACTGGAAACTCTAAAAAGCAGAGCGCCTCTCCTCCTCCAAAGGAACGCAGCTCCTCACCAGCAACGGAACAAAGCTGGACGGAGAATGACTTTGATGAGTTGAGAGAAGAAGGCTTCAGATGATCAAACTACTCCAAGCTAAAGGAGGAAATTCAAACCCATGGCAAAGAAGTTAAAAACCTTGAAAAAAAATTAGACGAATGGATAACTAGAATAACCAATGCAGAGAAGTCCTTAAAGGAGCTGATGGAGCTGAAAACCAAGGCTCGAGAACTACGTGAAGAATGCACAAGCCTCAGGAGCCGATGCGATCAACTGGAAGAAAGGGTATCAGTGATGGAAGATCAAATGAATGAAATGAAGTGAGAAGAGAAGTTTAGAGAAAAAAGAATAAAAAGAAATGA>R=2TTTTTTTTTTTTTTTAGATGCGGGGTGTCACTGTGTTGCTCAGGCTGGTCTCAAACTCCTGGGCTCAAGTGATCCTCCCACCTCAGCCTCTTTAATAGATGCGATTA>R=3TTTTTATACATGCTGTAGACAATCAATTCACACCTGTACTTTTTTTTAAGGTTGTGTTATTGCACTTTTATACCTCTTGACTGGTAGCTGATTTCCTTGAATACCTGTAAGGTAATCACCGGCTCACCAATGAATGTGGTTTTAACAATGGCTCACAGTGGCTTGGAAAGCCCTCATGGGAAGTATTTCTGAGGAAAAGTGGAGAGTGTGCAGGAATAGTTTTGAAAAACAGAGACAACCGATGTCCTCCTTCCCTCCCTTGCCTCTCCTCATGTGCCAGGTTTTCTGTTTTCTCCACTATTACAGAATCACCATGTTGTATCCTGTGATGAAAAGTTTTTATCTCTTTAATCATCCCATTTCGTCCTCCAGACCTTTTTTTTTCTGGAAGGGTTGTAAGCAGAAGGGACGAAACATCTTCAGAAAAACACATTATGATATAAACTTAGTGAAAAGATTCATCATATTTAAGAAATGGACAGGATGAAATCCTGAATTCATAAAAATTTTAAAAATCAGTTTACATAACATCCATCCCTTTTGTCTCTATCCCTTATCCA

Page 16: Transposable Elements (TE) in genomic sequence Mina Rho

Parameter setting and post processing

• Parameter setting– Recommend the smallest l = 15

– For the arbitrary length L,

– The length of Q up to 10,000bp on each side

– Remove repeat families with Q < 50

• Postprocessing– Tandem Repeat finder, Nseg

• Remove repeat families with >50% of their length annotated as low-complexity and tandem repeats

– RepeatMasker• Mask the repeat families based on the library

Page 17: Transposable Elements (TE) in genomic sequence Mina Rho

Benchmark

• C.briggsae genome (108Mb)• 7h on a single 0.5 GHz DEC Alpha processor

Page 18: Transposable Elements (TE) in genomic sequence Mina Rho

Combined evidence model of TE

Page 19: Transposable Elements (TE) in genomic sequence Mina Rho

Overview

Query Sequences: Drosophila melanogaster (Fruit fly) Release 3, 4

Combined evidence model: pipeline of RepeatMasker, BLASTER, TBLASTX, all-by-all BLASTN, RECON, and TE-HMM

- Methods for the annotation of known TE families

- Methods for the annotation of anonymous TE families

Benchmark : FlyBase Release 3.1 annotation

Sensitivity and specificity, characteristics of boundary

Page 20: Transposable Elements (TE) in genomic sequence Mina Rho

Tools

• Blaster – compares a query sequences against a subject databank.

– Launches one of the BLAST (BLASTN, TBLASTN, BLASTX, TBLASTX).

– Cut long sequences before launching BLAST and reassembles the results.

• MATCHER– Maps match results onto query sequences by filtering overlapping hits.

– Keeps the match results with E-value < 10-10 and length >20

– Chains the remaining matches by dynamic programming.

• GROUPER– Gather similar sequences into groups

Page 21: Transposable Elements (TE) in genomic sequence Mina Rho

Measures

For each nucleotide,• TP: correctly annotated as belonging to a TE• FP: falsely predicted as belonging to a TE• TN: correctly annotated as not belonging to a TE• FN: falsely predicted as not belonging to a TE

Page 22: Transposable Elements (TE) in genomic sequence Mina Rho
Page 23: Transposable Elements (TE) in genomic sequence Mina Rho
Page 24: Transposable Elements (TE) in genomic sequence Mina Rho

Method for the Annotation of known TE families

- BLASTER using BLASTN and MATCHER (BLRn)- RepeatMasker (RM)- RepeatMasker with MATCHER (RMm)

Page 25: Transposable Elements (TE) in genomic sequence Mina Rho

Method for the Annotation of known TE families

- BLASTER using BLASTN and MATCHER (BLRn)- RepeatMasker (RM)- RepeatMasker with MATCHER (RMm)

- RepeatMasker-BLASTER (RMBLR) : combined hits from both BLRn and RM and give them to MATCHER

Page 26: Transposable Elements (TE) in genomic sequence Mina Rho

Method for the Annotation of anonymous TE families

- all-by-all comparison with BLASTER using BLASTN, MATCHER, and GROUPER

- RECON- BLASTER using TBLASTX and MATCHER- HMM

Page 27: Transposable Elements (TE) in genomic sequence Mina Rho

What they (we) learned

• Overall, BLRn outperforms RM with respect to the precise determination of TE boundaries.

• RM is more sensitive for the detection of small and divergent TE.• The difference between BLRn and RM make them complementary

for TE annotation.• A combined-evidence framework can improve the quality and

confidence of TE annotation.

Page 28: Transposable Elements (TE) in genomic sequence Mina Rho

Pipeline structure

• TE detection software : BLASTER, RepeatMasker, TE-HMM, and RECON

• Tandem repeat detection software : RepeatMasker, Tandem Repeat Finder (TRF), Mreps

• Database: MySQL• Open Portable Batch System

• Whole genomic sequence was segmented into chucks of 200kb overlapping by 10kb.

• The results from different tool were stored in the database.• XML file is generated from the stored results and loaded into the

Apollo genome annotation tool.

Page 29: Transposable Elements (TE) in genomic sequence Mina Rho

The Annotation Pipeline

Page 30: Transposable Elements (TE) in genomic sequence Mina Rho