gene expression analysis of rna-seq...

Gene expression analysis of RNA-Seq data

Dena Leshkowitz,

Introduction to Deep-Sequencing Data Analysis 2018

Bioinformatics Unit, LSCF, WIS

RNA-Seq Potential

RNA-Seq: a revolutionary tool for transcriptomics

In theory RNA-Seq can be used to build a complete map of the transcriptome across all cell types, perturbations and states (Trapnell C. et al, Nature methods 6 469-477(2011))

RNA-Seq Applications

RNA-Seq: a revolutionary tool for transcriptomics

Discover novel transcripts

Determine transcript structure, measure

transcripts expression and detect differentially

expressed transcripts/isoforms between

conditions, treatments…

Measure gene expression and detect

differentially expressed genes between

conditions, treatments… based on known gene

structures (model organisms)

Main Topics

Introduction

Experimental design issues

Analysing Gene expression from RNA-Seq data

RNA-Seq pipeline: STAR –DESeq2

RNA-Seq Workflow

Condition 1(normal colon)

Condition 2(colon tumor)

Isolate RNAs

Sequence ends

Million/billions of fragments are sequenced

Converted to amplified small cDNA fragments with

linkers at the endsSamples of interest

Assemble or map to genome or to a

transcriptome

Quantify and detect differential expression

Modified - www.med.nyu.edu/rcr/rcr/course/RNA-seq.pptx

High Throughput Genomics

DNA Microarrays

IlluminaHiSeq2500

NextSeq500

Microarrays vs RNA-Seq

Both high throughput methods can profile the genes with similar performance

Microarrays suffer from compression (saturation) at the high end

Low expression is problematic in both platforms

Malone et al. BMC Biol. 2011; 9: 34.

RNA-Seq (log2)

Microarray & RNA-SeqPros and Cons

Microarrays RNA-SeqGene model organism/Transcript and de novo assembly

Cost $-$$ $/$$$

Biases Decade of research and solutions

Understanding is evolving

Data sizes Mb -images Gb- sequence data

Dynamic range 102 105

Transcript discovery , isoform identification

& Transcript-chimeras

No No/Yes

Genome required Yes Yes/No

Allele specific expression

No No/Yes

Main Topics

Introduction

mRNA in the RNA “World”

Most abundant RNA is rRNA – 98%

Illumina standard protocol enriches for mRNA by: oligo(dT)-based affinity

matrices

Sequence: rRNA capture beads (Ribo-Zero)

Size : hybridization-based rRNA depletion (Duplex-‐specific Nuclease (DSN))

Experiment DesignSequencing Options

Illumina NextSeq/HiSeq Sequencing options:

Length of sequence (up to 300 bases)

Paired-end (PE) or single-end (SE)

Both PE and longer length sequencing increase the sensitivity and specificity of the detection of the alternative splicing and novel transcripts

PE 100

SE 100

Experimental Design Mammalian tissue

Liu Y. et al., 2014; ENCODE 2011 RNA-Seq

Differential gene expression profiling

10-25M 50 base single-end

Alternative splicing

50-100M100 base paired-end

Allele specific expression

50-100M100 base paired-end

De novo assembly >100M100 base paired-end

(5M Bulk Mars-seq)

ENCODE consortium’s Standards, Guidelines and Best Practices for RNA-Seq

Experimental Design

Assessing biological variation requires biological replicates - (2X2) are a minimum, yet more are recommended

Avoid batch effects - technical sources of variation that have been added to samples during processing

Consult with the person which will analyse the data before performing the experiment – Kick-off meeting

Main Topics

Introduction

Abigail Yu Nature Methods 10, 1165–1166 (2013)

RNA-Seq is a straightforward process: you isolate RNA, sequence it with a high-throughput sequencer, and put it all back together. What is the problem?

What to do with all this data??

HELP !!!!

I just got sequence data…

RNA-Seq Workflow

Raw ReadsPreprocessing & Quality

control (cutadapt & FASTQC)

Example of QC report

Pre-processing Need to assess that the sequence data has

sufficient quality

Recommendation is to use the high quality sequence data (more important in de novo assembly): Check the amount of read duplication (indication

of too much PCR amplification)

Trim sequences if: the end is of low quality, contain adapter or polyA-T

Filter low quality reads

Avoid using samples with insufficient number of reads

RNA-Seq Workflow

control (cutadapt & FASTQC)

Map the reads of each sample to the genome

(STAR)

Mapping Short RNA-Seq Reads

Do I align the reads to the genome or to the transcriptome?

Novel discovery

Mapping to GenomeHow to align reads that span

exons?

http://en.wikipedia.org/wiki/RNA-Seq

The Challenge –Identifying NovelJunctions

Reads are short and contain errors

Rarely transcribed genes have few reads spanning the junctions

We are interested in discovering novel junctions i.e. we are not only relying on annotation of known genes

Need to perform the task in a timely manner

Tophat: Exon first two step approach

Mapping to the genome is done with Bowtie

Extracting unmapped reads (not including low complexity)

Copyright restrictions may apply.

Trapnell, C. et al. Bioinformatics 2009 25:1105-1111;

doi:10.1093/bioinformatics/btp120

The junctions can be improved and extended to contain donor and accepter sites

Copyright restrictions may apply.

Trapnell, C. et al. Bioinformatics 2009 25:1105-

1111; doi:10.1093/bioinformatics/btp120

Accepted_hits.bam

Insertion.bed

Deletion.bedJunctions.bed

Tophat

FastqGenome indexed

Tophat Outputs

Transcript discovery/Transcript or Gene Quantification/View in a Genome Browser

Junction.bed

TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes.

Newer Aligners – Improving speed

Steven Salzberg - Transcriptome Assembly Computational Challenges of Next Generation Sequence Datahttps://www.youtube.com/watch?v=2qGiw4MRK3c

It first tries to find candidate locations across the target genome starting from the left side of the read

It identifies these locations by first mapping part of each read using the global FM index, which in most cases identifies one or a small number of candidates

HISAT then selects one of the ~48,000 local indexes for each candidate and uses it to align the remainder of the

STAR-Spliced Transcripts Alignment to a Reference

STAR seed finding phase is the sequential search for a Maximal Mappable Prefix (MMP)

In the first step, the algorithm finds the MMP starting from the first base of the read

Next, the MMP search is repeated for the unmapped portion of the read

Speed of search is achieved since the suffix array is not compressed and therefore requires increased memory usage (30-50Gb)

Last stage is Clustering, stitching and scoring

Examples of Input and Output

Mapped Reads – bam format

Sequences – fastq

Mapping to genome

Mapping Output - Alignment file SAM or binary BAM file

Visualization of Tophat outputs in a Genome Browser (IGV)

read mappedto exon

read mappedto junction

RNA-Seq Workflow

control (cutadaptFASTQC)

Map the reads of each sample to the genome

(STAR)

Quantify gene counts

per sample

(STAR)

Basic Quantification Step

Count the number of reads aligned to each gene (do not need to determine to from which transcript the read was derived)

Trapnell et al. Nat Biotechnol. 2010 May;28(5):511-5.

HTSeq or STAR

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

A gene is quantified by counting the number of fragments/reads which align to all its exons

Discard a read if it is non-uniquely mapped to a gene

STAR/HTSeq ResultA count matrix

sample_1 sample_2 sample_3 sample_4

gene_1 15 9 11 18

gene_2 19 21 21 40

gene_3 106 114 153 207

gene_4 569 565 756 992

gene_5 1029 1260 1559 1968

gene_6 5049 5897 7537 10029

SUM 10M 50M 30M 20M

Need to account for the differences in sequence amount between the samples

RNA-Seq Workflow

Raw Reads

Preprocessing & Quality control

(cutadaptFASTQC)

Map the reads of each sample to the

genome (STAR)

Quantify gene

counts per sample

(STAR)

Detect differentially

expressed genes

(DESeq2)

DESeq2 Normalizationmedian-of-ratios method

Create a “virtual reference sample” by taking, for each gene, the geometric mean of counts over all samples

Normalize each sample to this reference, to get one scaling factor (“size factor”) per sample.

DESeq2 Normalization Need to normalize the amount of sequence data between the samples1. Geometric mean is calculated for each gene across

all samples.

2. The counts for a gene in each sample is then divided by this mean.

3. The median of these ratios in a sample is the size factor for that sample.

This procedure corrects for library size and RNA composition bias, which can arise for example when only a small number of genes are very highly expressed in one experiment condition but not in the other.

Example-DESeq Normalization

sample_1 sample_2 sample_3 sample_4 geometric mean ratio sample_1

gene_1 15 9 11 18 12.79 1.17

gene_2 19 21 21 40 24.06 0.79

gene_3 106 114 153 207 139.87 0.76

gene_4 569 565 756 992 700.73 0.81

gene_5 1029 1260 1559 1968 1412.26 0.73

gene_6 5049 5897 7537 10029 6887.68 0.73

Median 0.77

The scaling factor for sample 1 is 0.77

Determining Differentially Expressed Genes

http://www.molgen.mpg.de/1242892/rnaseq.pdf

The advantage of having many replicates allows to learn about the biological variation within the conditions tested.Aim: finding genes which have a significant difference between the groups which is larger than the “noise” – variation within the groups

Discover genes showing different average expression levels across two groups

RNA-Seq NoiseSuppose we sequence the same library twice to the same depth. Will we get the same gene counts?

Poisson noise – the variance in counts that persists even if everything is exactly the same – in our case we ran the same library

Poisson distribution is sometimes called the law of small numbers because it is the probability distribution of the number of occurrences of an event that happens rarely but has very many opportunities to happen.

Poisson Distribution

Assuming the gene counts in a RNA-Seq experiment follow a Poisson distribution we would expect that the average gene count and the variance of the counts are equal.

var = µ

https://youtu.be/fxtB8c3u6l8

https://youtu.be/HK7WKsL3c2w

Biological Variation

When we sequence biological replicate samples the concentration of a given gene will vary around a mean value with a certain standard deviation

This standard deviation cannot be calculated, it has to be estimated from the data

var = µ + c µ

Poisson noise Biological noise

Negative Binomial

Expected by Poisson

Negative binomial distribution

In RNA-Seq analysis the negative binomial distribution is used as an alternative to the Poisson since it takes into account variance that exceeds the gene mean

The count data is used to estimate the varianceOrange line: the fitted observed curve for the variance

Detecting Differentially Expressed Genes

DESeq2 tests for differential expression by the use of negative binomial generalized linear models

The output consists of: Log fold change (treatment/control)

p-value - indicates the probability that the observed difference between treatment and control will be observed even though the there is no true treatment effect

Adjusted p value – multiple test correction In the RNA-Seq study we simultaneously tested all

RNA-Seq Workflow

Raw Reads

Preprocessing & Quality control

(cutadaptFASTQC)

Map the reads of each sample to the

genome (STAR)

Quantify gene

counts per sample

(STAR)

Detect differentially

expressed genes

(DESeq2)

Main Topics

Introduction

Analysing RNA-Seq data

In exercises 2 & 5 we will use run this pipeline

Exercise 2 In this current study we analyzed the dynamics of gene expression in A. thaliana meristem during the transition to flowering using RNA sequencing (RNA-seq).

Submitting jobs to the WEXAC cluster

Examplesbsub -q bio-guest -R "rusage[mem=10000] span[hosts=1]" -J

sam1660397 -e ./samtools1660397.err -o ./samtools1660397.out "samtools index

SRR1660397Aligned.sortedByCoord.out.bam"

bsub -q bio-guest -R "rusage[mem=2000] span[hosts=1]" -o ~/RNA-Seq-Arabodopsis-map/star_job_SRR1660397.txt -e ~/RNA-Seq-Arabodopsis-map/star_error_SRR1660397.txt -J st1 -n 10 < ~/course_2018/Arabidopsis_input/commands/star_command_SRR1660397.txt

STAR --outFilterMultimapNmax 1 --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --twopassMode Basic --runThreadN 10 --sjdbGTFfile/shareDB/iGenomes/Arabidopsis_thaliana/NCBI/TAIR10/Annotation/Genes/genes.gtf --quantMode GeneCounts --outFileNamePrefix SRR1660397 --genomeDir/shareDB/iGenomes/Arabidopsis_thaliana/NCBI/TAIR10/Sequence/STAR_index/ --readFilesIn~/course_2018/Arabidopsis_input/data/SRR1660397_day8/SRR1660397_day8_R1.fastq> log_SRR1660397.txt 2>&1

References

1. Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, et al. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat Protoc. 2013;8(9):1765-86. doi: 10.1038/nprot.2013.099. PubMed PMID: 23975260. (DESeq2)

2. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25(9):1105-11. doi: 10.1093/bioinformatics/btp120. PubMed PMID: 19289445; PubMed Central PMCID: PMCPMC2672628.

3. Anders S, Pyl PT, Huber W. HTSeq-a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31(2):166-9. doi: 10.1093/bioinformatics/btu638. PubMed PMID: 25260700.

4. Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15-21. doi:10.1093/bioinformatics/bts635.

THE END

Thanks

Questions??

Experiment DesignStrand specific protocol

Nature Methods 7, 709–715 (2010)

Why is strand information important?

How is the stranded library made?

Expression ValuesFragments (Reads) Per Kilobase of exon per Million mapped fragments

Nat Methods. 2008, Mapping and quantifying mammalian transcriptomesby RNA-Seq. Mortazavi A et al.

C= the number of fragments mapped onto the gene's exons

N= total number of (mapped) fragments in the experiment

L= the length of the transcript (sum of exons)

CiFPKMi 36 1010

gene expression analysis of rna-seq...

Documents

sequencing technologies - technical university of...

introduction to second- generation sequencing · second...

capabilities of next generation sequencing...

‘sanger sequencing’ has been the only dna sequencing...

whole genome sequencing: transforming …...whole genome...

targeted sequencing using a long-read sequencing...

nextseq 550 system guide (15069765)€¦ · revisionhistory...

nextseq sequencing workflow overview · nextseq 500/550...

genomic sequencing and its data...

sequencing lesson instructions - volusia county...

genomics background for hazelnut pathogenomics course ·...

introduction to the nextseq system · introduction to the...

genomics - home | umass amherstequipment illumina nextseq...

next generation sequencing - lab genetix · next generation...

intro: high-throughput sequencing - i-med.ac.at · third...

overviewofindexedsequencing onthenextseq,miseq,andhiseq...

sequencing-related topics 1. chain-termination sequencing 2....

validation of ngs sequencing by sanger sequencing

exome sequencing or trio analysis - dna sequencing & … ·...

nextseq 500 system guide (15046563 v02)