rnaseq forgenefinding
DESCRIPTION
TRANSCRIPT
- 1. Transcript discovery and gene model correction using next
generation sequencing data
SuchetaTripathy, VBI, 11th Nov 2010 - 2. Brief History of Sequencing
Sanger Dideoxy Sequencing methods(1977).
Maxam Gilberts Chemical degradation methods(1977).
Two Labs that owned automated sequencers:
1. Leroy Hood at Caltech, 1986(commercialized by AB)
2. Wilhelm Ansorge at EMBL, 1986(commercialized by Pharmacia-Amersham and GE healthcare) - 3. Brief History Of sequencing
Hypoxanthine-guanine phosphoribosyltransferase (HGPRT)
Alu sequences - 4. Hitachi Laboratory developed High throughput capillary array
sequencer, 1996.
1991, A patent filed by EMBL on media less, solid support based sequencing.
Brief History Of sequencing - 5. NextGen Sequencing Methods
454 sequencing methods(2006)
Principles of pyrophosphate detection(1985, 1988)
Illumina(Solexa) Genome sequencing methods(2007)
Applied Biosystems ABI SOLiD System(2007)
Helicos single molecule sequencing(Helioscope, 2007)
Pacific Biosciences single-molecule real-time(SMRT) technology, 2010
Sequenom for Nanotechnology based sequencing.
BioNanomatrixnanofluidiscs.
RNAP technology. - 6. Figure 1.(A) Outline of the GS 454 DNA sequencer workflow.
Library construction (I) ligates 454-specific adapters to DNA
fragments (indicated as A and B) and couples amplification beads
with DNA in an emulsion PCR to amplify fragments before sequencing
(II). The beads are loaded into the picotiter plate (III). (B)
Schematic illustration of the pyrosequencing reaction which occurs
on nucleotide incorporation to report sequencing-by-synthesis.
(Adapted from http://www.454.com.)
- 7. Outline of the Illumina Genome Analyzer workflow. Similar
fragmentation and adapter ligation steps take place (I), before
applying the library onto the solid surface of a flow cell.
Attached DNA fragments form bridge molecules which are subsequently
amplified via an isothermal amplification process, leading to a
cluster of identical fragments that are subsequently denatured for
sequencing primer annealing (II). Amplified DNA fragments are
subjected to sequencing-by-synthesis using 3 blocked labelled
nucleotides (III). (Adapted from the Genome Analyzer brochure,
http://www.solexa.com.)
- 8. (A) Primers hybridise to the P1 adapter within the library
template. A set of four fluorescence-labelleddi-base probes
competes for ligation to the sequencing primer. These probes have
partly degenerated DNA sequence (indicated by n and z). Specificity
of the di-base probe is achieved by interrogating the first and
second base in each ligation reaction (CA in this case for the
complementary strand). (B) Sequence determination by the SOLiD DNA
sequencing platform is performed in multiple ligation cycles, using
different primers, each one shorter from the previous one by a
single base. The number of ligation cycles determines the eventual
read length, whilst for each sequence tag, six rounds of primer
reset occur [from primer (n) to primer (n4)]. (Adapted and modified
from http://www.appliedbiosystems.com.)
- 9. Cost
Adapted from Eric Lander, 2010 - 10. Throughput
Standard ABI Sanger sequencing
96 samples/day
Read length ~650 bp
Total = 450,000 bases of sequence data
454 was the game changer!
~400,000 different templates (reads)/day
Read length ~250 bp
Total = 100,000,000 bases of sequence data!!! - 11. Throughput
454 Life Sciences/Roche
Genome Sequencer FLX: currently produces 400-600 million bases per day per machine
Published 1 million bases of Neanderthal DNA in 2006
May 2007 published complete genome of James Watson (3.2 billion bases ~20x coverage)
Solexa/Illumina
10 GB per machine/week
May 2008 published complete genomes for 3 hapmap subjects (14x coverage)
ABI SOLID
20 GB per machine/week - 12. RNASeq
Catalogue all species of transcripts.
mRNA
Non-coding RNA
Small RNA
Splicing patterns or other post-transcriptional modifications.
Quantify the expression levels. - 13. Zhong Wang et al; Nat. Rev. Genetics, 2009
- 14. Other Applications
SNP detection
Splice Variant Discovery
Identification of miRNA targets
TF binding sites
Genome Methylation pattern
RNA editing
Metagenomic projects
Gene Expression Analysis - 15. Difference with other expression sequencing
EST: Low throughput, expansive, NOT quantitative.
SAGA, CAGE, MPSS: Highthroughput, digital gene expression levels
Expansive
Sanger sequencing methods
A portion of transcript is analyzed
Isoforms are indistinguishable - 16. Advantages:
Zero or very less background noise.
Sensitive to isoform discovery.
Both low and highly expressed genes can be quantified.
Highly reproducible. - 17. Data Analysis
Mapping Reads to the reference assembly
Filtering output:
Reads mapping > x number of times
Downstream data analysis - 18. Mapping
One or two mis-matches < 35 bases
One insertion/deletion.
K-mer based seeding.
- Identification of Novel Transcripts.
- 19. Transcript abundance.