rna-seq from a bioinformatics perspective · 2017-11-07 · rna-seq - stranded . differential...

49
RNA -seq from a bioinformatics perspective Harmen van de Werken Erasmus MC; Cancer Computational Biology Center (CCBC)

Upload: others

Post on 05-Aug-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

RNA-seq from a bioinformatics perspective

Harmen van de Werken Erasmus MC;

Cancer Computational Biology Center (CCBC)

Page 2: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Outlook

RNA seq software + RNA-seq courses

Alternative splicing & Promoters

Introduction RNAseq data

Differential expression

Read-Through & Fusion Transcripts

SNV / InDels

Novel Transcripts

Page 3: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Table 6-1 Molecular Biology of the Cell (© Garland Science 2008)

Which type of RNA is most abundant? How many different genes do we have per type?

Page 4: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

THE HUMAN GENOME ▪ Consensus ~ 22,500 protein-coding genes

~ 9,000 long non-coding RNAs ~ 2,500 – 3,000 small RNAs

▪ miTranscriptome1 ~ 91,013 genes ~ 58,648 lncRNA genes

1Iyer MK et al. Nature Genetics 47, 199–208 (2015)

Transcriptomics of (Cancer) Tissue

Page 5: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Common mRNA-seq Work flow

Dry lab Bioinformatics

Wetlab

Page 6: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

(c)DNA Next Generation Sequencing (NGS)

ThermoFisher Ion Torrent

Personal Genome Machine (PGM)

PACBio RS II Illumina HiSeq 2000

Page 7: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Illumina HiSeq 2000

Page 8: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Illumina Sequencing

Page 9: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Ion Torrent Platform

Page 10: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

RNAseq Data Analysis

Alternative splicing & Promoters RNAseq data

Differential expression

Read-Through & Fusion Transcripts

SNV / InDels

Novel Transcripts

Page 11: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Detecting Single Nucleotide Variants and small indels

Page 12: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Errors occur at each stage

Primary Analysis - Incorrect base calling - Homopolymer errors - Phasing

Page 13: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Errors occur at each stage

Secondary Analysis Read mapping - Incorrect ref. Sequence - Pseudogenes - Indels - Complex variants

Page 14: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Errors occur at each stage Secondary Analysis Variant calling

Variant Calling filters are heuristics; therefore, they will generate false negatives and positives and are best applied as soft filters.

Page 15: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Errors occur at each stage

Tertiary Analysis - Incorrect gene annotation - Contamination in reference Databases.

Page 16: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene
Page 17: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

False Negative: c.2237_2259del,insCCAACAAGGAA EGFR

Page 18: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

False Negative BRAF p.V600R

Page 19: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

RNAseq Data Analysis

Alternative splicing & Promoters RNAseq data

Differential expression

Read-Through & Fusion Transcripts

SNV / InDels

Novel Transcripts

Page 20: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Differential Gene Expression mRNA-seq Work flow

Page 21: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Fig1. FastQC report on Base Quality of position and overrepresented sequences

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345

GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC

+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345

IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Fig2. Fastq format of one read

Page 22: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

mRNA-seq alignment

Courtesy: Wikipedia

Page 23: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

mRNA derived cDNA fragments alignment to transcriptome

Alignment to transcriptome

Alignment to reference genome

Page 24: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

RNA-Seq - Alignment

Alignment algorithms need: • Reference sequence • Transcriptome database (optional) Algorithms commonly used for RNA-Seq alignment: • Tophat • STAR • HISAT2

Page 25: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Visualization of NGS Transcriptomics and Genomics data

Page 26: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

RNA-Seq - Alignment/QC

Page 27: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

RNA-Seq - Stranded

Page 28: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Differential expression

Rakesh Kaundal et al.

Page 29: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Normalization of RNA-seq

Total count (TC): Gene counts are divided by the total number of mapped reads Upper Quart ile (UQ): Very similar in principle to TC, the total counts are replaced by the upper quartile of counts Median (Med): Also similar to TC, the total counts are replaced by the median counts Trimmed Mean of M-values (TMM): This normalization method is implemented in the edgeR Bioconductor package (Robinson et al., 2010). Quant ile (Q): First proposed in the context of microarray data, this normalization method consists in matching distributions of gene counts across lanes. Reads Per Kilobase per Million mapped reads (RPKM): This approach quantifies gene expression from RNA-Seq data by normalizing for the total transcript length and the number of sequencing reads.

Page 30: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Reduce Dimensions PCA /QC

Principal components Analysis (PCA) of a multivariate Gaussian distribution. PCA is a linear algorithm. It will not be able to interpret complex polynomial relationship between features.

Page 31: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Reduce Dimensions t-SNE

t-Stochastic Neighbor Embedding (t-SNE) is a non-linear algorithm.

Page 32: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Clustering Analysis/ QC

Fig 1: Example Hierarchical Clustering. Example of hierarchical clustering: clusters are consecutively merged with the most nearby clusters. The length of the vertical dendogram-lines reflect the nearness. (Jansen et al.)

Page 33: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Clustering Analysis/ QC

Page 34: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Differential expression DESeq2

A common difficulty in the analysis of read count data is the strong variance of Log Fold Change (LFC) estimates for genes with low read count.

Page 35: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Differential expression of genes

Test Differentially gene expression with correction for multiple testing

Page 36: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Gene Set Enrichment Analysis: GO and KEGG database

Page 37: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Gene Set Enrichment Analysis: GO and KEGG database

Page 38: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

RNAseq Data Analysis

Alternative splicing & Promoters RNAseq data

Differential expression

Read-Through & Fusion Transcripts

SNV / InDels

Novel Transcripts

Page 39: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Fusion Gene Detection

Fig1 RNA-seq mapping of short reads over exon-exon junctions, it could be defined a Trans or a Cis event. (wikipedia)

Page 40: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Fusion Gene Detection

Page 41: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Fusion Gene Detection

Page 42: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Fusion Catcher Tool

Fusion Catcher outperforms other tools by using multiple Aligners ❖ Bowtie ❖ Bowtie2 ❖ BLAT ❖ STAR

Page 43: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

RNAseq Data Analysis

Alternative splicing & Promoters RNAseq data

Differential expression

Read-Through & Fusion Transcripts

SNV / InDels

Novel Transcripts

Page 44: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

RNA-seq de novo Assembly

❖ Define the whole transcriptome without a reference.

❖ Trinity

Page 45: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

RNA-seq Analysis Software

Page 46: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

RNA-seq Molmed Courses

❖ Basic Course on 'R' ❖ Galaxy for NGS ❖ Workshop Ingenuity Pathway Analysis (IPA) + CLC

Workbench / Ingenuity Variant Analysis ❖ Gene expression data analysis using R: How to make

sense out of your RNA-Seq/microarray data

Page 47: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Take Home Message

Think before you start

Page 48: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Thank you for your attention

Hematology Mathijs A. Sanders Remco Hoogenboezem

CCBC Job van Riet Wesley van de Geer

Page 49: RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

[email protected] https://ccbc.erasmusmc.nl

@ ErasmusMC_CCBC

Harmen van de Werken

Cancer Computational Biology Center (CCBC)