rna sequencing - netherlands bioinformatics centre€¦ · from next to 3rd generation sequencing...
TRANSCRIPT
RNA Sequencing
05-06-2013, Elio Schijlen
Next gen insight into transcriptomes
Transcriptome complete set of transcripts in a cell, and their quantity, for a specific developmental stage or physiological condition. Understanding the transcriptome is essential for interpreting the functional elements of the genome The key aims of transcriptomics are: to catalogue all species of transcripts, including mRNAs, non-coding RNAs and small RNAs; to determine the transcriptional structure of genes, in terms of their start sites, 5′ and 3′ ends, splicing patterns and other post-transcriptional modifications; to quantify the changing expression levels of each transcript during development and under different conditions.
Recently, the development of novel high-throughput DNA sequencing methods has provided a new method for both determining, mapping and quantifying transcriptomes. This method, termed RNA-Seq (RNA sequencing) clear advantages over previous approaches is revolutionizing the manner in which eukaryotic transcriptomes are analysed
454
Illumina HiSeq2000
Pacbio RS Ion proton
SOLiD 5500
From next to 3rd generation sequencing
Illumina HiSeq Fluorescent nt scanning
SOLiD Ligation fluorescent oligos
454 Pyrosequencing
Ion proton Hydrogen detection
Pacbio Real time fluorescent detection
From next to 3rd generation sequencing
Illumina HiSeq ssDNA sequence template
• Clonaly amplified into clusters on glass slide (flow cell)
SOLiD 5500 idem
454 ssDNA sequence template
• Clonaly amplified on beads (emPCR)
Ion proton idem
Pacbio dsDNA sequence template
• Single molecule/polymerase molecule complex
Illumina HiSeq2000
Syringe pumps
Reagents
compartment
Optics
Flow cell
access door
Flow cell
8 channels
Illumina HiSeq2000
Library Preparation
DNA (0.1-5.0 μg)
C
C
C
C
A
A
A
T
T
G
G
G
G
Sequencing
Single molecule array
Cluster Growth 5’
5’ 3’
T G T A C G A T C A C C C G A T C G A A
1 2 3 7 8 9 4 5 6
T G C T A C G A T …
Image Acquisition Base Calling
Eusol BACs 177.14 M PF clusters; 33.8 Gb>Q30
Lane Sample ID Sample Ref Index Description Yield (Mbases) % PF # Reads
% of raw clusters per lane
1 lane1 unknown Undetermined
Clusters with unmatched barcodes for lane 1 3,234 87.47 36,608,108 9.74
1 plate10 EUsol_fill_gaps TAGCTT 3,359 94.77 35,088,534 9.34
1 plate1 EUsol_fill_gaps ATCACG 4,150 95.35 43,091,246 11.47
1 plate2 EUsol_fill_gaps CGATGT 3,480 95.66 36,020,422 9.59
1 plate3 EUsol_fill_gaps TTAGGC 3,496 95.27 36,331,200 9.67
1 plate4 EUsol_fill_gaps TGACCA 4,674 95.4 48,508,022 12.91
1 plate5 EUsol_fill_gaps ACAGTG 2,305 93.65 24,365,574 6.49
1 plate6 EUsol_fill_gaps GCCAAT 1,895 94.83 19,783,144 5.27
1 plate7 EUsol_fill_gaps CAGATC 3,366 94.9 35,115,836 9.35
1 plate8 EUsol_fill_gaps ACTTGA 2,592 95.29 26,934,126 7.17
1 plate9 EUsol_fill_gaps GATCAG 3,232 94.59 33,829,830 9.01
SOLiD 5500
454 sequencing technology & workflow
NGS - 454 pyrosequencing raw read
GCTAAG
Ion semiconductor sequencing
Ion Torrent PGM & Proton
3d Gen Sequencing: PacBio
SMRT sequencing
Kb read length
<50,000 reads
<100 Mb
Pacbio sequencing
Phospholinked
Cleavage by DNA polymerase
• Fluorophore clipped off by polymerase
• DNA synthesized is natural
• No steric hindrance or accumulation of
background signal ZMW Zero Mode Waveguide
Sequence read length (raw), quality
Illumina HiSeq fixed 50 or 100 nt, SR and PE
SOLiD 5500 fixed 75 nt
454 range 50-1,000 nt (av~750)
Ion torrent range 50-200 nt (av ~170)
Pacbio range 50-20,000 nt (av ~3-4 kb)
Sequence read quality
Illumina HiSeq HQ reads, systematic errors
• Lower quality 3’ends
• Low GC coverage
SOLiD very HQ reads
• Lower quality 3’ends
454 HQ reads, sytematic errors
• Homopolymer problems
• Clonality
• Lower quality 3’ends
Ion torrent idem, but lower overall quality
Pacbio Low Quality (0.8-0.85)
• Random errors
• No decrease read quality 3’end
Sequence reads & throughput/run
Illumina HiSeq 1.5 E+09 full flowcell, 12days/run
• Up to 550 Gb (2 cells)
SOLiD 5500XL 1.5 E+09 full flowcell, 6days/run
• Up to 240 Gb (2 flow chips)
454 1 E+06 full PTP, 1 day/run
• Up to 1 Gb
Ion torrent 60-80 E+06 ionPI chip, 4 hours/run
• Up to 10 Gb
Pacbio 300,000 (8 cell strip), 1day/run
• Up to 0.75 Gb
Transcript coverage
DNA Samples for sequencing
1
mRNA
Small RNA
Other Apps ChIP-Sequencing
Genomic DNA Active Chromatin
Library preparation: Ligate adapters to both ends of
fragmented nucleic acid
RNA input requirements
RNA: DNA free, RNAse free, non degraded, No contaminants (proteins, polysaccharides)
Protocol variations Fragmentation methods RNA: nebulization, hydrolysis cDNA: sonication, Dnase I treatment Depletion of highly abundant transcripts Positive selection of mRNA . Poly(A) selection or target specific Negative selection. (RiboMinus, RNAseH) Strand specificity Most RNA sequencing is not strand-specific Single-end or Paired-end sequencing
(Illumina) RNA seq workflow
Aligning the millions of reads to a "reference genome". many tools available for aligning genomic reads to a reference genome (sequence alignment tools), however, special attention is needed when alignment of a transcriptome to a genome, mainly when dealing with genes having intronic regions. As discussed above, the sequence libraries are created extracting mRNA using its poly(A) tail, which is added to the mRNA molecule post-transcriptionally and thus splicing has taken place. Therefore, the created library and the short reads obtained cannot come from intronic sequences and thus, when trying to align these short reads to a reference genome, only short reads aligning entirely inside exonic regions will be matched while short reads from exon-exon junction regions will not. Several software packages exist for short read alignment, and recently specialized algorithms for transcriptome alignment have been developed, e.g. TopHat and Cufflinks.
Sequences coverage
A.thaliana:approx 60E+06 mapped reads
result in plateau of unique gene models
expressedm(approx 20,000)
Multi mapped 50nt SR reads (A.thaliana ~5%) can cause inaccurate expressin estimates
Tubulin B chain
reads mapped to reference
genome (gray)
Blue lines intron spanning reads
Histograms read coverage
Blue multimapped contributed
Green unique mapped contributed
Including multimapped artificially
increases expression value
Readmapping 2 genes sharing
genome region by their 3’end on
opposite strands
Multimapped reads derived from +
strand would severly overestimate
expression of – strand gene.
Ekblom et al., 2012 Comparative and Functional Genomics doi:10.1155/2012/281693
Wenger and Galliot BMC Genomics 2013, 14:204 doi:10.1186/1471-2164-14-204
Some considerations The information gathered by RNAseq has similar limitations as other RNA expression analysis pipelines. RNA status dependent • Biological variable: Tissue specific; Time dependent. Triplicates! • During a cell's lifetime and context, its gene expression levels change. • Strongly RNA quality dependent Library prep method dependent Sequencing technology dependent Analysis method dependent Because of this, care must be taken when drawing conclusions from the sequencing experiment. Results must be verified using independent technology