rna sequencing - netherlands bioinformatics centre€¦ · from next to 3rd generation sequencing...

RNA Sequencing

05-06-2013, Elio Schijlen

Next gen insight into transcriptomes

Transcriptome complete set of transcripts in a cell, and their quantity, for a specific developmental stage or physiological condition. Understanding the transcriptome is essential for interpreting the functional elements of the genome The key aims of transcriptomics are: to catalogue all species of transcripts, including mRNAs, non-coding RNAs and small RNAs; to determine the transcriptional structure of genes, in terms of their start sites, 5′ and 3′ ends, splicing patterns and other post-transcriptional modifications; to quantify the changing expression levels of each transcript during development and under different conditions.

Recently, the development of novel high-throughput DNA sequencing methods has provided a new method for both determining, mapping and quantifying transcriptomes. This method, termed RNA-Seq (RNA sequencing) clear advantages over previous approaches is revolutionizing the manner in which eukaryotic transcriptomes are analysed

454

Illumina HiSeq2000

Pacbio RS Ion proton

SOLiD 5500

From next to 3rd generation sequencing

Illumina HiSeq Fluorescent nt scanning

SOLiD Ligation fluorescent oligos

454 Pyrosequencing

Ion proton Hydrogen detection

Pacbio Real time fluorescent detection

From next to 3rd generation sequencing

Illumina HiSeq ssDNA sequence template

• Clonaly amplified into clusters on glass slide (flow cell)

SOLiD 5500 idem

454 ssDNA sequence template

• Clonaly amplified on beads (emPCR)

Ion proton idem

Pacbio dsDNA sequence template

• Single molecule/polymerase molecule complex

Illumina HiSeq2000

Syringe pumps

Reagents

compartment

Optics

Flow cell

access door

Flow cell

8 channels

Illumina HiSeq2000

Library Preparation

DNA (0.1-5.0 μg)

C

C

C

C

A

A

A

T

T

G

G

G

G

Sequencing

Single molecule array

Cluster Growth 5’

5’ 3’

T G T A C G A T C A C C C G A T C G A A

1 2 3 7 8 9 4 5 6

T G C T A C G A T …

Image Acquisition Base Calling

Eusol BACs 177.14 M PF clusters; 33.8 Gb>Q30

Lane Sample ID Sample Ref Index Description Yield (Mbases) % PF # Reads

% of raw clusters per lane

1 lane1 unknown Undetermined

Clusters with unmatched barcodes for lane 1 3,234 87.47 36,608,108 9.74

1 plate10 EUsol_fill_gaps TAGCTT 3,359 94.77 35,088,534 9.34

1 plate1 EUsol_fill_gaps ATCACG 4,150 95.35 43,091,246 11.47

1 plate2 EUsol_fill_gaps CGATGT 3,480 95.66 36,020,422 9.59

1 plate3 EUsol_fill_gaps TTAGGC 3,496 95.27 36,331,200 9.67

1 plate4 EUsol_fill_gaps TGACCA 4,674 95.4 48,508,022 12.91

1 plate5 EUsol_fill_gaps ACAGTG 2,305 93.65 24,365,574 6.49

1 plate6 EUsol_fill_gaps GCCAAT 1,895 94.83 19,783,144 5.27

1 plate7 EUsol_fill_gaps CAGATC 3,366 94.9 35,115,836 9.35

1 plate8 EUsol_fill_gaps ACTTGA 2,592 95.29 26,934,126 7.17

1 plate9 EUsol_fill_gaps GATCAG 3,232 94.59 33,829,830 9.01

SOLiD 5500

454 sequencing technology & workflow

NGS - 454 pyrosequencing raw read

GCTAAG

Ion semiconductor sequencing

Ion Torrent PGM & Proton

3d Gen Sequencing: PacBio

SMRT sequencing

Kb read length

<50,000 reads

<100 Mb

Pacbio sequencing

Phospholinked

Cleavage by DNA polymerase

• Fluorophore clipped off by polymerase

• DNA synthesized is natural

• No steric hindrance or accumulation of

background signal ZMW Zero Mode Waveguide

Sequence read length (raw), quality

Illumina HiSeq fixed 50 or 100 nt, SR and PE

SOLiD 5500 fixed 75 nt

454 range 50-1,000 nt (av~750)

Ion torrent range 50-200 nt (av ~170)

Pacbio range 50-20,000 nt (av ~3-4 kb)

Sequence read quality

Illumina HiSeq HQ reads, systematic errors

• Lower quality 3’ends

• Low GC coverage

SOLiD very HQ reads


454 HQ reads, sytematic errors

• Homopolymer problems

• Clonality


Ion torrent idem, but lower overall quality

Pacbio Low Quality (0.8-0.85)

• Random errors

• No decrease read quality 3’end

Sequence reads & throughput/run

Illumina HiSeq 1.5 E+09 full flowcell, 12days/run

• Up to 550 Gb (2 cells)

SOLiD 5500XL 1.5 E+09 full flowcell, 6days/run

• Up to 240 Gb (2 flow chips)

454 1 E+06 full PTP, 1 day/run

• Up to 1 Gb

Ion torrent 60-80 E+06 ionPI chip, 4 hours/run

• Up to 10 Gb

Pacbio 300,000 (8 cell strip), 1day/run

• Up to 0.75 Gb

Transcript coverage

DNA Samples for sequencing

1

mRNA

Small RNA

Other Apps ChIP-Sequencing

Genomic DNA Active Chromatin

Library preparation: Ligate adapters to both ends of

fragmented nucleic acid

RNA input requirements

RNA: DNA free, RNAse free, non degraded, No contaminants (proteins, polysaccharides)

Protocol variations Fragmentation methods RNA: nebulization, hydrolysis cDNA: sonication, Dnase I treatment Depletion of highly abundant transcripts Positive selection of mRNA . Poly(A) selection or target specific Negative selection. (RiboMinus, RNAseH) Strand specificity Most RNA sequencing is not strand-specific Single-end or Paired-end sequencing

(Illumina) RNA seq workflow

Aligning the millions of reads to a "reference genome". many tools available for aligning genomic reads to a reference genome (sequence alignment tools), however, special attention is needed when alignment of a transcriptome to a genome, mainly when dealing with genes having intronic regions. As discussed above, the sequence libraries are created extracting mRNA using its poly(A) tail, which is added to the mRNA molecule post-transcriptionally and thus splicing has taken place. Therefore, the created library and the short reads obtained cannot come from intronic sequences and thus, when trying to align these short reads to a reference genome, only short reads aligning entirely inside exonic regions will be matched while short reads from exon-exon junction regions will not. Several software packages exist for short read alignment, and recently specialized algorithms for transcriptome alignment have been developed, e.g. TopHat and Cufflinks.

http://en.wikipedia.org/wiki/Reference_genome

http://en.wikipedia.org/wiki/List_of_sequence_alignment_software

http://en.wikipedia.org/wiki/List_of_sequence_alignment_software

http://en.wikipedia.org/wiki/Sequence_alignment_software

http://tophat.cbcb.umd.edu/

http://cufflinks.cbcb.umd.edu/

Sequences coverage

A.thaliana:approx 60E+06 mapped reads

result in plateau of unique gene models

expressedm(approx 20,000)

Multi mapped 50nt SR reads (A.thaliana ~5%) can cause inaccurate expressin estimates

Tubulin B chain

reads mapped to reference

genome (gray)

Blue lines intron spanning reads

Histograms read coverage

Blue multimapped contributed

Green unique mapped contributed

Including multimapped artificially

increases expression value

Readmapping 2 genes sharing

genome region by their 3’end on

opposite strands

Multimapped reads derived from +

strand would severly overestimate

expression of – strand gene.

Ekblom et al., 2012 Comparative and Functional Genomics doi:10.1155/2012/281693

Wenger and Galliot BMC Genomics 2013, 14:204 doi:10.1186/1471-2164-14-204

Some considerations The information gathered by RNAseq has similar limitations as other RNA expression analysis pipelines. RNA status dependent • Biological variable: Tissue specific; Time dependent. Triplicates! • During a cell's lifetime and context, its gene expression levels change. • Strongly RNA quality dependent Library prep method dependent Sequencing technology dependent Analysis method dependent Because of this, care must be taken when drawing conclusions from the sequencing experiment. Results must be verified using independent technology

rna sequencing - netherlands bioinformatics centre€¦ · from next to 3rd generation sequencing...

Documents