statistical genomics and bioinformatics workshop: genetic...
TRANSCRIPT
Statistical Genomics and Bioinformatics Workshop8/16/2013
1
Statistical Genomics and Bioinformatics Workshop:
Genetic Association and RNA-Seq Studies
RNA‐Seq and Differential Expression Analysis
Brooke L. Fridley, PhD
University of Kansas Medical Center
1
Next-generation gap
http://www.nature.com/nmeth/journal/v6/n11s/full/nmeth.f.268.html
2
Statistical Genomics and Bioinformatics Workshop8/16/2013
2
Why sequence RNA (versus DNA)?
• Functional studies
– Genome may be constant but an experimental condition has a pronounced effect on gene expression
• Some molecular features can only be observed at the RNA level
– Alternative isoforms, fusion transcripts, RNA editing
• Predicting transcript sequence from genome sequence is difficult
– Alternative splicing, RNA editing, etc.
3
Why sequence RNA (versus DNA)?
• Interpreting mutations that do not have an obvious effect on protein sequence– ‘Regulatory’ mutations that affect what mRNA isoform is
expressed and how much • e.g. splice sites, promoters, TFBS
• Prioritizing protein coding somatic mutations – If the gene is not expressed, a mutation in that gene would
be less interesting– If the gene is expressed but only from the wild type allele,
this might suggest loss-of-function – If the mutant allele itself is expressed, this might suggest a
candidate drug target
4
Statistical Genomics and Bioinformatics Workshop8/16/2013
3
Challenges to RNA Studies• Sample
– Purity?, quantity?, quality?• RNAs consist of small exons that may be separated by large
introns– Mapping reads to genome is challenging
• The relative abundance of RNAs vary wildly– 105 – 107 orders of magnitude– Since RNA sequencing works by random sampling, a
small fraction of highly expressed genes may consume the majority of reads
• RNAs come in a wide range of sizes– Small RNAs must be captured separately– PolyA selection of large RNAs may result in 3’ end bias
• RNA is fragile compared to DNA (easily degraded)5
The evolution of transcriptomics
1995 P. Brown, et. al. Gene expression profilingusing spotted cDNA microarray: expression levels of known genes
2002 Affymetrix, whole genome expression profiling using tiling array: identifying and profiling novel genes and splicing variants
2008 many groups, mRNA‐seq: direct sequencing of mRNAs using next generation sequencing techniques (NGS)
RNA‐seq is still a technology under active development
Hybridization-based
6
Statistical Genomics and Bioinformatics Workshop8/16/2013
4
RNA-Seq vs Microarrays
• General expression profiling• Novel genes• Alternative splicing• Detect gene fusion• Can use on any sequenced genome• Better dynamic range• Cleaner and more informative data• Data analysis challenges
7
Advantages of RNA-Seq compared with other transcriptomics methods
8
Statistical Genomics and Bioinformatics Workshop8/16/2013
5
Goals of RNA-seq Study
• Catalogue all species of transcript including: mRNAs, non-coding RNAs and small RNAs
• Determine the transcriptional structure of genes in terms of: – Start sites – 5′ and 3′ ends – Splicing patterns / novel isoforms – Other post-transcriptional modifications
• Quantification of expression levels and comparison (different conditions, tissues, etc.) – Gene and exon level
• Determine Allelic expression• Gene Fusion• Transcriptome for non-model organisms
9
Computation for ChIP‐seq and RNA‐seq studiesShirley Pepke, Barbara Wold & Ali MortazaviNature Methods 6, S22 ‐ S32 (2009) Published online: 15 October 2009
10
Statistical Genomics and Bioinformatics Workshop8/16/2013
6
Types of RNAs• Coding RNAs, “genes”
• Non coding RNAs
• Ribosomal RNA
Type Size Function
microRNA (miRNA) 21‐23 nt regulation of gene expression
small interfering RNA (siRNA) 19‐23 nt antiviral mechanisms
piwi‐interacting RNA (piRNA) 26‐31 ntinteraction with piwi
proteins/spermatogenesis
small nuclear RNA (snRNA) 100‐300 nt RNA splicing
small nucleolar RNA (snoRNA) modification of other RNAs
11
RNA
exonintronPre‐mRNA
splicing
mature‐mRNA AAAAAAAAAAAAAAAAPoly‐A tail 12
Statistical Genomics and Bioinformatics Workshop8/16/2013
7
Alternative Splicing
13
Next-Gen Sequencing (NGS)
• Long RNAs are first converted into a library of cDNA fragments through either: RNA fragmentation or DNA fragmentation
14
Statistical Genomics and Bioinformatics Workshop8/16/2013
8
• In contrast to small RNAs (like miRNAs) larger RNA must be fragmented
• RNA fragmentation or cDNA fragmentation (different techniques)
• Types of bias: – RNA: depletion for ends
– cDNA: biased 5’ end
Zhong Wang, Mark Gerstein & Michael Snyder. RNA‐Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10, 57‐63 (January 2009)
15
• Sequencing adaptors (blue) are added to each cDNAfragment and a short sequence is obtained from each cDNA using high-throughput sequencing Technology
• Typical read length: 30-400 bp depending on technology
16
Statistical Genomics and Bioinformatics Workshop8/16/2013
9
• Reads are aligned with the reference genome or transcriptome and classified as three types: exonicreads, junction reads and poly(A) end-reads.
• de novo assembly also possible for non-model organisms
17
• These three types are used to generate a base-resolution expression profile for each gene
• Example: A yeast ORF with one intron 18
Statistical Genomics and Bioinformatics Workshop8/16/2013
10
Un-replicated Experimental Design
• 1 biological replicate per treatment group
• Pros: Cheap, can be informative, prelim data
• Cons: Can only make inferences about the particular biological individuals, NOT the treatment groups
• Applications: Pilot studies (although can not assess variation), reference transcriptome assembly
19
Biological vs Technical Replicates
• Biological Replicates – Multiple Unique Individuals/Samples
• Technical Replicates – One Individual/Sample with some technical steps replicated
• Biological Variance > Technical Variance (typically)
• Biological replicates more useful as allows inferences about “treatment”
20
Statistical Genomics and Bioinformatics Workshop8/16/2013
11
Sample Pooling
• Combining multiple samples / individuals / tissues during preparation into a single sample for assay.
• Pros: Necessary when not enough “material” per individual for assay.
• Cons: Measure variability between individuals is lost. A “bad” sample can bias pooled sample.
21
Multiplexing
• Each sample is “indexed” and combined into one pooled sample. – Indexing allows one to identify “reads” for each
subject
• Pros: Removes technical variation as source of confounding; cost effective
• Cons: Reduces “depth” or “coverage” per sample
Flow cell
Lane
22
Statistical Genomics and Bioinformatics Workshop8/16/2013
12
Comparison of two designs for testing differential expression between treatments A and B. Treatment A is denoted by red tones and treatment B by blue tones.
Auer P L , and Doerge R W Genetics 2010;185:405-416
Copyright © 2010 by the Genetics Society of America
Coverage Estimation• Lander/Waterman equation for coverage
• C = LN / G
C stands for coverage
L is the read length
N is the number of reads
G is the haploid genome length
• http://support.illumina.com/downloads/sequencing_coverage_calculator.ilmn
Statistical Genomics and Bioinformatics Workshop8/16/2013
13
What Coverage is needed?• http://metalhelix.github.io/coverme/
• The amount of sequencing needed for a given sample is determined by the goals of the experiment and the nature of the RNA sample. – More reads needed for alternative splicing / fusions
Tarazone, et al (2011) Genome Research. Differential expression in RNA‐seq: A matter of depth
– Less reads needed for DE gene expression studies
– 15x is recommended for standard transcriptome studies
25
Coverage and Depth
• Number of detected genes (coverage) and costs increase with sequence depth (number of analyzed read)
• Calculation of coverage is less straightforward in transcriptome analysis (transcription activity varies)
26
Statistical Genomics and Bioinformatics Workshop8/16/2013
14
ENCODE Standards• http://encodeproject.org/ENCODE/protocols/dataStandards/ENCO
DE_RNAseq_Standards_V1.0.pdf
• DE testing require only modest depths of sequencing: 30M pair‐end reads of length > 30NT, of which 20‐25M are mappable to the genome or known transcriptome
• Experiments whose purpose is discovery of novel transcribed elements and strong quantification of known transcript isoforms requires more extensive sequencing.
• Experiments should be performed with two or more biological replicates, unless there is a compelling reason why this is impractical or wasteful. A biological replicate is defined as an independent growth of cells/tissue and subsequent analysis.
• Technical replicates not required, except to evaluate cases where biological variability is abnormally high.
27
Questions?
28
Statistical Genomics and Bioinformatics Workshop8/16/2013
15
Example Workflow
Brian T. Wilhelm , Josette‐Renée Landry. RNA‐Seq—quantitative measurement of expression through massively parallel RNA‐sequencing. Methods Volume 48, Issue 3 2009 249 ‐ 257 29
30
Statistical Genomics and Bioinformatics Workshop8/16/2013
16
Tuxedo Suite RNA-Seq Pipeline
RNA‐seq reads (2 x 100 bp)
Sequencing
Bowtie/TopHatalignment (genome)
Read alignment
Cufflinks
Transcript compilation
Cufflinks (cuffcompare)
Gene identification
Cuffdiff(A:B comparison)
Differential expression
CummRbund
Visualization
Gene annotation (.gtf file)
Reference genome(.fa file)
Raw sequence data
(.fastq files)
Inputs
31
RNA-Seq - Bioinformatics challenges (I):
• Storing, retrieving and processing of large amounts of data
• Base calling
• Quality analysis for bases and reads => FastQfiles
• Mapping/aligning RNA-Seq reads (Alternative: assemble contigs and align them to genome)
• Multiple alignment possible for some reads
• Sequencing errors and polymorphisms =>SAM/BAM files
Statistical Genomics and Bioinformatics Workshop8/16/2013
17
RNA-Seq - Bioinformatics challenges (II):• Exon junctions and poly(A) ends
Identification of poly(A) -> long stretches of A(T) at end of reads
Splice sites:
Specific sequence context: CT – AG dinucleotides
Low expression for intronic regions
Known or predicted splice sites
Detection of new sites (e.g. via split read mapping)
• Overlapping genes
• RNA editing
• Secondary structure of transcripts
• Quantification of expression signals
Mapping
• A multiread is a read that maps equally well to many reference sequences.
• read: AGTCGACTAGCTATTAGCATG
Statistical Genomics and Bioinformatics Workshop8/16/2013
18
Read mapping vs. de novo assembly
Haas and Zody, Nature Biotechnology 28, 421–423 (2010)
Good reference No reference genome
• Options:
Align and then assemble
Assemble and then align
• Align to:
Genome
transcriptome
Genomic vs Transcript Mapping of Reads
Exon A Exon B
Exon A Exon BTranscript
level mapping
Genome level mapping?
??
?
?
Exon C Exon D
Exon C Exon D
???
?
?
Mapping of reads at genome or transcript level
Statistical Genomics and Bioinformatics Workshop8/16/2013
19
Genomic Mapping
• Advantages:
Less likely to have multireads across different isoforms.
One can get a sense of the coverage across exons.
• Disadvantages:
It’s a bit involved to estimate isoforms expression.
Needs an (annotated) genome! (i.e. not great for non-model organisms)
Transcriptome Mapping
• Advantages:
Transcript-level expression
Slightly easier to do.
• Disadvantages:
Multiple isoforms can share an exon; can get multireads.
Requires annotation to wrap to gene-level counts
Statistical Genomics and Bioinformatics Workshop8/16/2013
20
RNA-Seq Quality Control
• Quality Control is Important to make sure experiment/data is valid
Percentage of reads properly mapped / uniquely mapped
5’ or 3’ bias
Per base sequence quality
Per sequence GC content
Sequence length
Duplication levels
Coverage (reads per base)
Alignment of Reads
Alignment using bowtie algorithm:
• Not more than 2 mismatches per read allowed
• Reads with multiple alignment discarded
• Read longer than 35 bp truncated to 35 bp
• Overlapping of alignment of reads with gene footprint from middle position of read
Statistical Genomics and Bioinformatics Workshop8/16/2013
21
TopHat Pipeline: Splice Junctions
• Reads are mapped against a reference genome, and those reads that do not map are set aside.
• An initial consensus of mapped regions is computed by Maq.
• Sequences flanking potential donor/acceptor splice sites within neighboring regions are joined to form potential splice junctions.
• The initially unmapped (IUM) reads are indexed and aligned to these splice junction sequences.
42
Dataset after Aligning Reads
1 14 18 10 47 13 24
2 10 3 15 1 11 5
3 1 0 10 80 21 34
4 0 0 0 0 2 0
5 4 3 3 5 33 29
. . . . . . .
. . . . . . .
. . . . . . .
53256 47 29 11 71 278 339
Total 22910173 30701031 18897029 20546299 28491272 27082148
Treatment 1 Treatment 2Gene
Statistical Genomics and Bioinformatics Workshop8/16/2013
22
43
Differential Expression (DE) Analysis
• We would like to test whether the proportion of reads aligning to gene 1 tends to be different for experimental units that received treatment 1 than for experimental units that received treatment 2.
14 out of 22910173 47 out of 20546299
18 out of 30701031 vs. 13 out of 28491272
10 out of 18897029 24 out of 27082148
Example 1: Need for Normalization• Every gene in B is expressed in A at the same level. G =
number of genes expressed in B
• A also contains a set of G genes that are expressed but not expressed in B.
Thus, A has 2*G expressed genes and its RNA production is twice the size of sample B.
• Each sample is sequenced at approx. the same depth
• Without adjustment, a gene expressed in A and B will have ½ the number of reads as B, since the # reads is spread over twice as many genes.
• The normalization factor would be to adjust sample A by factor of 2.
Statistical Genomics and Bioinformatics Workshop8/16/2013
23
Example 2: Need for Normalization
• Suppose you multiplex 4 samples to lane 1 and 2 samples to lane 2 of a flowcell.
The 4 samples in lane 1 will have lower number of “reads” (counts) as compared to 2 samples in lane 2.
• Need to account for the total number of reads per sample (library size)
Measure for expression: FPKM and RPKM
• Longer transcripts, more fragments/reads• FPKM/RPKM measure “average pair coverage”
per transcript
• FPKM: Fragments Per Kilobase per Million• RPKM: Reads Per Kilobase per Million
• Paired-end RNA-Seq experiments produce two reads per fragment.
• Both reads may not be mappable. • If we were to count reads rather than fragments, we might
double-count some fragments but not others, leading to a skewed expression value.
• Thus, FPKM is calculated by counting fragments, not reads.• If single-end experiment, FPKM = RPKM
Statistical Genomics and Bioinformatics Workshop8/16/2013
24
RPKM• RPKM: Reads Per Kilobase per Million mapped reads
• RPKM = C/(L x N) C: Number of mappable reads on a feature, such as an
exon or transcript..
L: Length of feature (in kb)
N: Total number of mappable reads (in millions)
Distributions used for Modeling Count Data
• countofreadsforsubject ontreatment
forgene
i = 1…N and g = 1,…,G
• ~ , 0
!
Observed in RNA-seq data that
“Over-dispersed” Poisson (a.k.a. Negative Binomial or Poisson-Gamma Distribution)
Statistical Genomics and Bioinformatics Workshop8/16/2013
25
Empirical Assessment of Over-Dispersion of NGS Counts
https://sites.google.com/site/davismcc/useful-documents
Poisson Dist.Mean = Var
Distributions used for Modeling Count Data
• Technical Variation follows a Poisson Distribution
• Biological Variation follows a Negative Binomial Distribution
• Let ~ ,
librarysizeforsample i. e. , totalnumberofreads
dispersionparameterforgene
represents the coefficient of variation of biological variation between the samples (able to separate biological from technical variation)
=0 reduces to Poisson( )
and V (1+ )
DE is based on parameter
Model used in edgeR
Statistical Genomics and Bioinformatics Workshop8/16/2013
26
edgeR Package and Methods• edgeR (Robinson, McCarthy, Smyth; 2010):
models count data using an over-dispersed Poisson model
estimates the gene-wise dispersions by conditional maximum likelihood, conditioning on the total count for that gene(Smyth and Verbyla, 1996).
Empirical Bayes (EB) procedure is used to shrink the dispersions towards a consensus value (i.e., borrowing information) (Robinson and Smyth, 2007).
DE is assessed using an exact test analogous to Fisher’s exact test, but adapted for over-dispersed data (Robinson and Smyth, 2008).
Similar in idea to limma (Smyth, 2004)
Other Methods/Programs for DE Analysis
• DESeq, BaySeq, Cuffdiff, SAMseq and many others
• Each has pros and cons with different assumptions.
• No single method will be optimal under all circumstances, and hence the method of choice in a particular situation depends on the experimental conditions.
• Suggest running more than one method to look for sensitivity of results.
Statistical Genomics and Bioinformatics Workshop8/16/2013
27
Following DE analysis• Visualization of results
Volcano plot (FC vs p-value)
• Multiple Testing Correction
FDR
q-values
• Pathway Analysis
IPA
Network Analysis
• Validation Studies
Technical validation of sequence data
Confirm/replicate association results
Statistical Genomics and Bioinformatics Workshop8/16/2013
28
Questions?