statistical genomics and bioinformatics workshop: genetic...

Statistical Genomics and Bioinformatics Workshop8/16/2013

1

Statistical Genomics and Bioinformatics Workshop:

Genetic Association and RNA-Seq Studies

RNA‐Seq and Differential Expression Analysis

Brooke L. Fridley, PhD

University of Kansas Medical Center

1

Next-generation gap

http://www.nature.com/nmeth/journal/v6/n11s/full/nmeth.f.268.html

2


2

Why sequence RNA (versus DNA)?

• Functional studies

– Genome may be constant but an experimental condition has a pronounced effect on gene expression

• Some molecular features can only be observed at the RNA level

– Alternative isoforms, fusion transcripts, RNA editing

• Predicting transcript sequence from genome sequence is difficult

– Alternative splicing, RNA editing, etc.

3

Why sequence RNA (versus DNA)?

• Interpreting mutations that do not have an obvious effect on protein sequence– ‘Regulatory’ mutations that affect what mRNA isoform is

expressed and how much • e.g. splice sites, promoters, TFBS

• Prioritizing protein coding somatic mutations – If the gene is not expressed, a mutation in that gene would

be less interesting– If the gene is expressed but only from the wild type allele,

this might suggest loss-of-function – If the mutant allele itself is expressed, this might suggest a

candidate drug target

4


3

Challenges to RNA Studies• Sample

– Purity?, quantity?, quality?• RNAs consist of small exons that may be separated by large

introns– Mapping reads to genome is challenging

• The relative abundance of RNAs vary wildly– 105 – 107 orders of magnitude– Since RNA sequencing works by random sampling, a

small fraction of highly expressed genes may consume the majority of reads

• RNAs come in a wide range of sizes– Small RNAs must be captured separately– PolyA selection of large RNAs may result in 3’ end bias

• RNA is fragile compared to DNA (easily degraded)5

The evolution of transcriptomics

1995 P. Brown, et. al. Gene expression profilingusing spotted cDNA microarray: expression levels of known genes

2002 Affymetrix, whole genome expression profiling using tiling array: identifying and profiling novel genes and splicing variants

2008 many groups, mRNA‐seq: direct sequencing of mRNAs using next generation sequencing techniques (NGS)

RNA‐seq is still a technology under active development

Hybridization-based

6


4

RNA-Seq vs Microarrays

• General expression profiling• Novel genes• Alternative splicing• Detect gene fusion• Can use on any sequenced genome• Better dynamic range• Cleaner and more informative data• Data analysis challenges

7

Advantages of RNA-Seq compared with other transcriptomics methods

8


5

Goals of RNA-seq Study

• Catalogue all species of transcript including: mRNAs, non-coding RNAs and small RNAs

• Determine the transcriptional structure of genes in terms of: – Start sites – 5′ and 3′ ends – Splicing patterns / novel isoforms – Other post-transcriptional modifications

• Quantification of expression levels and comparison (different conditions, tissues, etc.) – Gene and exon level

• Determine Allelic expression• Gene Fusion• Transcriptome for non-model organisms

9

Computation for ChIP‐seq and RNA‐seq studiesShirley Pepke, Barbara Wold & Ali MortazaviNature Methods 6, S22 ‐ S32 (2009) Published online: 15 October 2009

10


6

Types of RNAs• Coding RNAs, “genes”

• Non coding RNAs

• Ribosomal RNA

Type Size Function

microRNA (miRNA) 21‐23 nt regulation of gene expression

small interfering RNA (siRNA) 19‐23 nt antiviral mechanisms

piwi‐interacting RNA (piRNA) 26‐31 ntinteraction with piwi

proteins/spermatogenesis

small nuclear RNA (snRNA) 100‐300 nt RNA splicing

small nucleolar RNA (snoRNA) modification of other RNAs

11

RNA

exonintronPre‐mRNA

splicing

mature‐mRNA AAAAAAAAAAAAAAAAPoly‐A tail 12


7

Alternative Splicing

13

Next-Gen Sequencing (NGS)

• Long RNAs are first converted into a library of cDNA fragments through either: RNA fragmentation or DNA fragmentation

14


8

• In contrast to small RNAs (like miRNAs) larger RNA must be fragmented

• RNA fragmentation or cDNA fragmentation (different techniques)

• Types of bias: – RNA: depletion for ends

– cDNA: biased 5’ end

Zhong Wang, Mark Gerstein & Michael Snyder. RNA‐Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10, 57‐63 (January 2009)

15

• Sequencing adaptors (blue) are added to each cDNAfragment and a short sequence is obtained from each cDNA using high-throughput sequencing Technology

• Typical read length: 30-400 bp depending on technology

16


9

• Reads are aligned with the reference genome or transcriptome and classified as three types: exonicreads, junction reads and poly(A) end-reads.

• de novo assembly also possible for non-model organisms

17

• These three types are used to generate a base-resolution expression profile for each gene

• Example: A yeast ORF with one intron 18


10

Un-replicated Experimental Design

• 1 biological replicate per treatment group

• Pros: Cheap, can be informative, prelim data

• Cons: Can only make inferences about the particular biological individuals, NOT the treatment groups

• Applications: Pilot studies (although can not assess variation), reference transcriptome assembly

19

Biological vs Technical Replicates

• Biological Replicates – Multiple Unique Individuals/Samples

• Technical Replicates – One Individual/Sample with some technical steps replicated

• Biological Variance > Technical Variance (typically)

• Biological replicates more useful as allows inferences about “treatment”

20


11

Sample Pooling

• Combining multiple samples / individuals / tissues during preparation into a single sample for assay.

• Pros: Necessary when not enough “material” per individual for assay.

• Cons: Measure variability between individuals is lost. A “bad” sample can bias pooled sample.

21

Multiplexing

• Each sample is “indexed” and combined into one pooled sample. – Indexing allows one to identify “reads” for each

subject

• Pros: Removes technical variation as source of confounding; cost effective

• Cons: Reduces “depth” or “coverage” per sample

Flow cell

Lane

22


12

Comparison of two designs for testing differential expression between treatments A and B. Treatment A is denoted by red tones and treatment B by blue tones.

Auer P L , and Doerge R W Genetics 2010;185:405-416

Copyright © 2010 by the Genetics Society of America

Coverage Estimation• Lander/Waterman equation for coverage

• C = LN / G

C stands for coverage

L is the read length

N is the number of reads

G is the haploid genome length

• http://support.illumina.com/downloads/sequencing_coverage_calculator.ilmn


13

What Coverage is needed?• http://metalhelix.github.io/coverme/

• The amount of sequencing needed for a given sample is determined by the goals of the experiment and the nature of the RNA sample. – More reads needed for alternative splicing / fusions

Tarazone, et al (2011) Genome Research. Differential expression in RNA‐seq: A matter of depth

– Less reads needed for DE gene expression studies

– 15x is recommended for standard transcriptome studies

25

Coverage and Depth

• Number of detected genes (coverage) and costs increase with sequence depth (number of analyzed read)

• Calculation of coverage is less straightforward in transcriptome analysis (transcription activity varies)

26


14

ENCODE Standards• http://encodeproject.org/ENCODE/protocols/dataStandards/ENCO

DE_RNAseq_Standards_V1.0.pdf

• DE testing require only modest depths of sequencing: 30M pair‐end reads of length > 30NT, of which 20‐25M are mappable to the genome or known transcriptome

• Experiments whose purpose is discovery of novel transcribed elements and strong quantification of known transcript isoforms requires more extensive sequencing.

• Experiments should be performed with two or more biological replicates, unless there is a compelling reason why this is impractical or wasteful. A biological replicate is defined as an independent growth of cells/tissue and subsequent analysis.

• Technical replicates not required, except to evaluate cases where biological variability is abnormally high.

27

Questions?

28


15

Example Workflow

Brian T. Wilhelm , Josette‐Renée Landry. RNA‐Seq—quantitative measurement of expression through massively parallel RNA‐sequencing. Methods Volume 48, Issue 3 2009 249 ‐ 257 29

30


16

Tuxedo Suite RNA-Seq Pipeline

RNA‐seq reads (2 x 100 bp)

Sequencing

Bowtie/TopHatalignment (genome)

Read alignment

Cufflinks

Transcript compilation

Cufflinks (cuffcompare)

Gene identification

Cuffdiff(A:B comparison)

Differential expression

CummRbund

Visualization

Gene annotation (.gtf file)

Reference genome(.fa file)

Raw sequence data

(.fastq files)

Inputs

31

RNA-Seq - Bioinformatics challenges (I):

• Storing, retrieving and processing of large amounts of data

• Base calling

• Quality analysis for bases and reads => FastQfiles

• Mapping/aligning RNA-Seq reads (Alternative: assemble contigs and align them to genome)

• Multiple alignment possible for some reads

• Sequencing errors and polymorphisms =>SAM/BAM files


17

RNA-Seq - Bioinformatics challenges (II):• Exon junctions and poly(A) ends

Identification of poly(A) -> long stretches of A(T) at end of reads

Splice sites:

Specific sequence context: CT – AG dinucleotides

Low expression for intronic regions

Known or predicted splice sites

Detection of new sites (e.g. via split read mapping)

• Overlapping genes

• RNA editing

• Secondary structure of transcripts

• Quantification of expression signals

Mapping

• A multiread is a read that maps equally well to many reference sequences.

• read: AGTCGACTAGCTATTAGCATG


18

Read mapping vs. de novo assembly

Haas and Zody, Nature Biotechnology 28, 421–423 (2010)

Good reference No reference genome

• Options:

Align and then assemble

Assemble and then align

• Align to:

Genome

transcriptome

Genomic vs Transcript Mapping of Reads

Exon A Exon B

Exon A Exon BTranscript

level mapping

Genome level mapping?

??

?

?

Exon C Exon D

Exon C Exon D

???

?

?

Mapping of reads at genome or transcript level


19

Genomic Mapping

• Advantages:

Less likely to have multireads across different isoforms.

One can get a sense of the coverage across exons.

• Disadvantages:

It’s a bit involved to estimate isoforms expression.

Needs an (annotated) genome! (i.e. not great for non-model organisms)

Transcriptome Mapping

• Advantages:

Transcript-level expression

Slightly easier to do.

• Disadvantages:

Multiple isoforms can share an exon; can get multireads.

Requires annotation to wrap to gene-level counts


20

RNA-Seq Quality Control

• Quality Control is Important to make sure experiment/data is valid

Percentage of reads properly mapped / uniquely mapped

5’ or 3’ bias

Per base sequence quality

Per sequence GC content

Sequence length

Duplication levels

Coverage (reads per base)

Alignment of Reads

Alignment using bowtie algorithm:

• Not more than 2 mismatches per read allowed

• Reads with multiple alignment discarded

• Read longer than 35 bp truncated to 35 bp

• Overlapping of alignment of reads with gene footprint from middle position of read


21

TopHat Pipeline: Splice Junctions

• Reads are mapped against a reference genome, and those reads that do not map are set aside.

• An initial consensus of mapped regions is computed by Maq.

• Sequences flanking potential donor/acceptor splice sites within neighboring regions are joined to form potential splice junctions.

• The initially unmapped (IUM) reads are indexed and aligned to these splice junction sequences.

42

Dataset after Aligning Reads

1 14 18 10 47 13 24

2 10 3 15 1 11 5

3 1 0 10 80 21 34

4 0 0 0 0 2 0

5 4 3 3 5 33 29

. . . . . . .

. . . . . . .

. . . . . . .

53256 47 29 11 71 278 339

Total 22910173 30701031 18897029 20546299 28491272 27082148

Treatment 1 Treatment 2Gene


22

43

Differential Expression (DE) Analysis

• We would like to test whether the proportion of reads aligning to gene 1 tends to be different for experimental units that received treatment 1 than for experimental units that received treatment 2.

14 out of 22910173 47 out of 20546299

18 out of 30701031 vs. 13 out of 28491272

10 out of 18897029 24 out of 27082148

Example 1: Need for Normalization• Every gene in B is expressed in A at the same level. G =

number of genes expressed in B

• A also contains a set of G genes that are expressed but not expressed in B.

Thus, A has 2*G expressed genes and its RNA production is twice the size of sample B.

• Each sample is sequenced at approx. the same depth

• Without adjustment, a gene expressed in A and B will have ½ the number of reads as B, since the # reads is spread over twice as many genes.

• The normalization factor would be to adjust sample A by factor of 2.


23

Example 2: Need for Normalization

• Suppose you multiplex 4 samples to lane 1 and 2 samples to lane 2 of a flowcell.

The 4 samples in lane 1 will have lower number of “reads” (counts) as compared to 2 samples in lane 2.

• Need to account for the total number of reads per sample (library size)

Measure for expression: FPKM and RPKM

• Longer transcripts, more fragments/reads• FPKM/RPKM measure “average pair coverage”

per transcript

• FPKM: Fragments Per Kilobase per Million• RPKM: Reads Per Kilobase per Million

• Paired-end RNA-Seq experiments produce two reads per fragment.

• Both reads may not be mappable. • If we were to count reads rather than fragments, we might

double-count some fragments but not others, leading to a skewed expression value.

• Thus, FPKM is calculated by counting fragments, not reads.• If single-end experiment, FPKM = RPKM


24

RPKM• RPKM: Reads Per Kilobase per Million mapped reads

• RPKM = C/(L x N) C: Number of mappable reads on a feature, such as an

exon or transcript..

L: Length of feature (in kb)

N: Total number of mappable reads (in millions)

Distributions used for Modeling Count Data

• countofreadsforsubject ontreatment

forgene

i = 1…N and g = 1,…,G

• ~ , 0

!

Observed in RNA-seq data that

“Over-dispersed” Poisson (a.k.a. Negative Binomial or Poisson-Gamma Distribution)


25

Empirical Assessment of Over-Dispersion of NGS Counts

https://sites.google.com/site/davismcc/useful-documents

Poisson Dist.Mean = Var

Distributions used for Modeling Count Data

• Technical Variation follows a Poisson Distribution

• Biological Variation follows a Negative Binomial Distribution

• Let ~ ,

librarysizeforsample i. e. , totalnumberofreads

dispersionparameterforgene

represents the coefficient of variation of biological variation between the samples (able to separate biological from technical variation)

=0 reduces to Poisson( )

and V (1+ )

DE is based on parameter

Model used in edgeR


26

edgeR Package and Methods• edgeR (Robinson, McCarthy, Smyth; 2010):

models count data using an over-dispersed Poisson model

estimates the gene-wise dispersions by conditional maximum likelihood, conditioning on the total count for that gene(Smyth and Verbyla, 1996).

Empirical Bayes (EB) procedure is used to shrink the dispersions towards a consensus value (i.e., borrowing information) (Robinson and Smyth, 2007).

DE is assessed using an exact test analogous to Fisher’s exact test, but adapted for over-dispersed data (Robinson and Smyth, 2008).

Similar in idea to limma (Smyth, 2004)

Other Methods/Programs for DE Analysis

• DESeq, BaySeq, Cuffdiff, SAMseq and many others

• Each has pros and cons with different assumptions.

• No single method will be optimal under all circumstances, and hence the method of choice in a particular situation depends on the experimental conditions.

• Suggest running more than one method to look for sensitivity of results.


27

Following DE analysis• Visualization of results

Volcano plot (FC vs p-value)

• Multiple Testing Correction

FDR

q-values

• Pathway Analysis

IPA

Network Analysis

• Validation Studies

Technical validation of sequence data

Confirm/replicate association results


28

Questions?

statistical genomics and bioinformatics workshop: genetic...

Documents