[ppt]mdsc 675 bioinformatics for the...

RNA-SeqRNA-Seq

Dr. Christoph W. Sensen und Dr. Jung Soh

Trieste Course 2017

What is RNA-Seq?What is RNA-Seq?

An experimental protocol that uses next-generation sequencing technologies to sequence the messenger RNA molecules within a biological sample in an effort to determine the primary sequence and relative abundance of each mRNA

Martin JA, Wang Z (2011) Next-generation transcriptome

assembly. Nat Rev Genet. 12(10):671-682

Also known as “Whole Transcriptome Shotgun Sequencing” (WTSS)

2

Bioanalyzer(RNA quality)

Plant material

Total RNA extraction

mRNA isolationcDNA libraries

454(1/2-plate)

Illumina1 lane 108PE

Bioinformatics

Metaboliteprofiling

Bio

chem

istry

PIs

Gen

ome

Qué

bec

Inno

vatio

n C

entre

Reference transcriptomes (75)

• combination of ½-plate of 454 and 1 lane of 108PE Illumina sequencing

• excellent depth and coverage

• high-quality assemblies

• submission of total RNA samples improves quality control

• takes better advantage of sequencing facilities

• similar overall cost

• 76SE Illumina sequencing on selected species for comparative transcriptomics

• repeat sequencing in rare cases of low-quality initial output

Bio

info

rmat

ics

Inno

vatio

nC

entre

Sequencing strategy

4

Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 10(1):57-63.

RNA-Seq RNA-Seq workflowworkflow

intron

RNA-Seq vs. microarrayRNA-Seq vs. microarray

5

Characteristics RNA-Seq MicroarrayWhich transcripts? All in a sample Only those for which

probes are designed

Transcript sequence generation

Yes No

Low-abundance transcript detection

Yes Limited

Abundance info source Count (of the reads aligned to gene)

Fluorescence level (of the probe spot for gene)

Resolution Base Probe sequenceBackground noise Low HighAdditional info Alternative splicing,

transcriptome-level variation

Limited

RNA-Seq data analysisRNA-Seq data analysis

6

Lots of short reads

Map reads

Table of mapped loci per read

Bin reads to features

Normalize counts

Detect differentially expressed (DE) features

Table of counts per feature

Table of normalized quantification values per feature

DE features

Reference genome

Feature annotation(exons, genes, transcripts)

Usually combined in a tool

Mapping readsMapping reads

7

Need a reference genome Issues

Reads spanning across exon junctionAlternative splicingReads mapping to multiple locations in the genome

Huge amounts of data Most common mapping results format

SAM: sequence alignment/mapBAM: binary format of SAM

Many toolsBowtie, SOAP, BWA, SHRiMP, mrFAST, mrsFAST, ZOOM,

SSAHA2, Mosaik

BowtieBowtie

Binning readsBinning reads

Need annotated features– Exons, genes, transcripts

For each feature, the total number of reads mapped is produced– Not directly comparable across

features/samples yet– Usually followed by normalization

9

Normalizing countsNormalizing counts

Why normalize?– Longer features have more reads mapped– Deeper sequencing produces more reads

RPKM is most frequently used– Reads Per Kilobase per Million reads– Defined as C/(LN)

C = number of reads mapped to a feature L = length of the feature (in kilobases) N = total number of reads from the sample (in millions)

10

RPKMRPKM examples examples

11

http://jura.wi.mit.edu/bio/education/hot_topics/RNAseq/RNA_Seq.pdf

Gene model predicted for fungus Trametes versicolor using Augustus and RNA-seq hints

Splice VariantsSplice Variants

““non-coding” RNA moleculesnon-coding” RNA molecules

LincRNA-p21

Tran et al., In press

MIRA AssemblyMIRA Assembly

Contig: Contig: T_rep_c1201T_rep_c1201Read members: Read members: 9696Length: 2429 bpLength: 2429 bp

Combined Assembly

T_rep_c1201 is part of a 6 member contig

• 2 are partial transcripts assembled by PTA

Example

Detecting Differential Detecting Differential ExpressionExpression

Compare quantification values across samples or across features

Most tools summarize/normalize counts and suggest DE features– Cufflinks/Cuffdiff, R packages (DESeq, edgeR, baySeq,

TSPM), SAMtools

DE features go through similar analysis to microarray data analysis (e.g. validation)

16

CufflinksCufflinks

Cufflinks TutorialCufflinks Tutorial

https://docs.google.com/document/d/1t1gi2Djxd0ykMVe2bF8BVOBsOsPngjFh2999u3rZq-A/edit?hl=en&authkey=CKL1i8sD#

Anaerobic biocorrosion in reactors Anaerobic biocorrosion in reactors filled with WP-LS mediumfilled with WP-LS medium

SSV1 Replication CycleSSV1 Replication Cycle(UV Induced)(UV Induced)

[ppt]mdsc 675 bioinformatics for the...

Documents