[ppt]mdsc 675 bioinformatics for the...
TRANSCRIPT
What is RNA-Seq?What is RNA-Seq?
An experimental protocol that uses next-generation sequencing technologies to sequence the messenger RNA molecules within a biological sample in an effort to determine the primary sequence and relative abundance of each mRNA
Martin JA, Wang Z (2011) Next-generation transcriptome
assembly. Nat Rev Genet. 12(10):671-682
Also known as “Whole Transcriptome Shotgun Sequencing” (WTSS)
2
Bioanalyzer(RNA quality)
Plant material
Total RNA extraction
mRNA isolationcDNA libraries
454(1/2-plate)
Illumina1 lane 108PE
Bioinformatics
Metaboliteprofiling
Bio
chem
istry
PIs
Gen
ome
Qué
bec
Inno
vatio
n C
entre
Reference transcriptomes (75)
• combination of ½-plate of 454 and 1 lane of 108PE Illumina sequencing
• excellent depth and coverage
• high-quality assemblies
• submission of total RNA samples improves quality control
• takes better advantage of sequencing facilities
• similar overall cost
• 76SE Illumina sequencing on selected species for comparative transcriptomics
• repeat sequencing in rare cases of low-quality initial output
Bio
info
rmat
ics
Inno
vatio
nC
entre
Sequencing strategy
4
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 10(1):57-63.
RNA-Seq RNA-Seq workflowworkflow
intron
RNA-Seq vs. microarrayRNA-Seq vs. microarray
5
Characteristics RNA-Seq MicroarrayWhich transcripts? All in a sample Only those for which
probes are designed
Transcript sequence generation
Yes No
Low-abundance transcript detection
Yes Limited
Abundance info source Count (of the reads aligned to gene)
Fluorescence level (of the probe spot for gene)
Resolution Base Probe sequenceBackground noise Low HighAdditional info Alternative splicing,
transcriptome-level variation
Limited
RNA-Seq data analysisRNA-Seq data analysis
6
Lots of short reads
Map reads
Table of mapped loci per read
Bin reads to features
Normalize counts
Detect differentially expressed (DE) features
Table of counts per feature
Table of normalized quantification values per feature
DE features
Reference genome
Feature annotation(exons, genes, transcripts)
Usually combined in a tool
Mapping readsMapping reads
7
Need a reference genome Issues
Reads spanning across exon junctionAlternative splicingReads mapping to multiple locations in the genome
Huge amounts of data Most common mapping results format
SAM: sequence alignment/mapBAM: binary format of SAM
Many toolsBowtie, SOAP, BWA, SHRiMP, mrFAST, mrsFAST, ZOOM,
SSAHA2, Mosaik
Binning readsBinning reads
Need annotated features– Exons, genes, transcripts
For each feature, the total number of reads mapped is produced– Not directly comparable across
features/samples yet– Usually followed by normalization
9
Normalizing countsNormalizing counts
Why normalize?– Longer features have more reads mapped– Deeper sequencing produces more reads
RPKM is most frequently used– Reads Per Kilobase per Million reads– Defined as C/(LN)
C = number of reads mapped to a feature L = length of the feature (in kilobases) N = total number of reads from the sample (in millions)
10
MIRA AssemblyMIRA Assembly
Contig: Contig: T_rep_c1201T_rep_c1201Read members: Read members: 9696Length: 2429 bpLength: 2429 bp
Combined Assembly
T_rep_c1201 is part of a 6 member contig
• 2 are partial transcripts assembled by PTA
Example
Detecting Differential Detecting Differential ExpressionExpression
Compare quantification values across samples or across features
Most tools summarize/normalize counts and suggest DE features– Cufflinks/Cuffdiff, R packages (DESeq, edgeR, baySeq,
TSPM), SAMtools
DE features go through similar analysis to microarray data analysis (e.g. validation)
16
Cufflinks TutorialCufflinks Tutorial
https://docs.google.com/document/d/1t1gi2Djxd0ykMVe2bF8BVOBsOsPngjFh2999u3rZq-A/edit?hl=en&authkey=CKL1i8sD#
Anaerobic biocorrosion in reactors Anaerobic biocorrosion in reactors filled with WP-LS mediumfilled with WP-LS medium