rna-seq module 2 from qc to differential gene expression. module2 version 3.pdf · from qc to...

81
RNA-Seq Module 2 From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics Support System (RISS) MSI Apr. 24, 2012

Upload: others

Post on 23-May-2020

7 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

RNA-Seq Module 2

From QC to differential gene expression.

Ying Zhang Ph.D, Informatics Analyst

Research Informatics Support System (RISS) MSI

Apr. 24, 2012

Page 2: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

RNA-Seq Tutorials •  Tutorial 1: Introductory (Mar. 28 & Apr. 19)

–  RNA-Seq experiment design and analysis –  Instruction on individual software will be provided

in other tutorials •  Tutorial 2: Introductory (Apr. 3 & Apr 24)

–  Analysis RNA-Seq using TopHat and Cufflinks

•  Tutorial 3: Intermediate (May 23) –  Advanced RNA-Seq Analysis topics and Trouble-

Shooting

Page 3: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Tutorial Outline Review

–  Key definitions and concepts

Pre-processing of RNA-seq data –  QC and data cleaning

Applications of RNA-seq –  Identification of differential gene expression

•  Using TopHat, Cufflinks and Cuffdiff

–  Definition of the transcriptome •  Transcriptome Assembly with / without reference genome

–  Comparison of transcriptomes •  Identification of novel transcripts

Page 4: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

RNA-Seq

Definitions & Concepts

Page 5: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Key definitions (I)

•  SE: single end sequencing •  PE: paired end sequencing mRNA isolation

Sequence fragment end(s)

Sample

Library preparation

PE sequencing

•  Mate-pair sequencing

Size: ~200 bp Size: ~2000 bp

Circulation

Fragmentation

Sequence fragment end(s)

Mate-Pair sequencing

Fragmentation

SE sequencing

Page 6: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Key definitions (II)

•  Fragment size selection -  Only fragments with size around 200bp will be sequenced in order to

reduce sequencing bias.

•  Sequencing Depth is the average reads coverage of target sequences -  Sequencing depth = total number of reads X read length / estimated

target sequence length -  Example, for a 5MB transcriptome, if 1Million 50 bp reads are

produced, the depth is 1 M X 50 bp / 5M ~ 10 X

De novo Assembly of transcriptome

Refine gene model Differential Gene Expression

Identification of structural variants

Library Type:

PE, Mated PE PE, SE PE PE

Sequencing Depth:

Extensive (> 50 X)

Extensive

Moderate (10 X ~ 30 X)

Extensive

ENCODE RNA-Seq guidelines

Page 7: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Key definitions (III)

Phred (quality) Score •  Phred Score (Q) is the log transformation of error rate

(P) at each base calling position Q = -10log10P •  Encoded using ASCII codes:

–  Sanger standard: ASCII 33-62 = Phred Score 0-93

•  Phred score 30 ~ 1 error per 1000 nucleotides Phred score 20 ~ 1 error per 100 nucleotides

Page 8: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

NGS File formats

Page 9: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

File formats in NGS (I)

fastq

CASAVA software

SAM/BAM

Mapping

GTF

Assembly

http://genome.ucsc.edu/FAQ/FAQformat.html

Page 10: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

File format NGS (II) - FASTQ and FASTQ_flt (MSI)

•  CASAVA: –  Illumina software package for base calling

•  Fastq format: –  Text format. Stores sequence and quality info –  4 lines per sequences –  CASAVA 1.8 header line:

fastq

CASAVA software

@HWI-M00262:4:000000000-A0ABC:1:1:18376:2027 1:N:0:AGATC TTCAGAGAGAATGAATTGTACGTGCTTTTTTTGT + =1:?7A7+?77+<<@AC<3<,33@A;<A?A=:4=

Read ID (header) Sequence

+ Quality score

Phred+33

Machine ID barcode QC Filter flag Y=bad N=good

Read pair #

•  FASTQ_flt data –  Fastq files processed by MSI standard to remove reads with QC flag “Y”

Page 11: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

•  SAM/BAM format: –  Sequence alignment format –  SAM: text format –  BAM: binary file of SAM –  Bitwise flag field: indicating mapped or not,

paired or not, etc

•  SAM/BAM format is the standard format of mapped reads, and could be used by almost all NGS tools, e.g. assembler, viewer, quantifier.

File formats NGS (III)

fastq

CASAVA software

SAM/BAM

mapping

Page 12: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

•  GTF format –  Gene Transfer Format –  Widely used format for annotated genome and

transcriptome –  Downloadable from major browser sites, e.g. UCSC,

Ensembl, NCBI –  Illumina also provides a set of annotated genomes:

igenomes •  Available through Galaxy and command line

File formats in NGS (IV)

fastq

CASAVA software

SAM/BAM

mapping

GTF

assembly

Seqname   Source   feature   start   end   score   strand   frame   a0ributes  

chr1   unknown   exon   3204563   3207049   .   -­‐   .  gene_id  "Xkr4";  transcript_id  

"NM_001011874“;  

Page 13: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Steps in RNA-Seq Data Analysis

Quality Control

Data prepping

Map Reads to Reference Genome/Transcriptome

Assemble Transcriptome

Identify Differentially Expressed Gens

Other applications: De novo Assembly

Refine gene models

Step 1:

Step 2:

Step 3:

Step 4:

fastq

FastQC

fastqsanger

Filter/Trimmer/Converter

bam/sam

TopHat

gtf; fpkm Cufflinks

fpkm; diff

Cuffdiff

Page 14: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Step 1 Quality control of the input data

Step 2

Data prepping

Page 15: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Quality control of the raw reads

•  Goal: to determine quality of the sequencing process

•  Recommended program: fastqc –  Available both in Galaxy and Linux platform

•  Checklist of reads quality: Ø  File format

q Basic Statistics

Ø Reliability of base calling q Per base sequence quality q Per sequence quality score

Ø Contamination q Per sequence GC content q Overrepresented sequences

Page 16: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Data prepping (I)

Is NOT NEEDED, if: •  In the right format •  Good reads quality

–  Phred score per base & per sequence >=20 ( better if >=30)

•  No contamination detected

•  Paired reads are synchronized –  Bad mapping efficiency of PE reads is symptomatic of de-

synchronization

Page 17: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

BAD format

GOOD

Wrong Fastq Format (CASAVA 1.8): @HWI-M00262:4:000-A0ABC:1:1:1836 1:N:0:AGATC TTCAGAGAGAATGAATTGTACGTGCTTTTTTTGT + =1:?7A7+?77+<<@AC<3<,33@A;<A?A=:4=

Right Fastq Format (CASAVA 1.7): @HWI-M00262:4:000-A0ABC:1:1:1836/1 TTCAGAGAGAATGAATTGTACGTGCTTTTTTTGT + =1:?7A7+?77+<<@AC<3<,33@A;<A?A=:4=

Page 18: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Data prepping– Needed (I)

Data format is incorrect, •  paired reads are not indicated as RNAME/1, RNAME/2 •  the quality score is not Sanger/Illumina 1.9 format

Action: change the format, using edit attributes, fastQ groomer, header line converter Notes: If you are using Galaxy to analyze your data, change file name WILL NOT change the file format.

Page 19: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Bad Trimming needed

Good

Distribution of Phred Score in reads

Page 20: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Data contains bad “reads”, •  the quality score of reads/part of the reads is < 20

Action: remove the low quality reads, using fastq filter, and fastq trimmer •  fastq filter: remove entire reads

•  fastq column trimmer: uniformly remove the nucleotides positions in all reads.

•  fastq quality trimmer: remove all nucleotide positions with low quality.

Data prepping – Needed (II)

Page 21: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Example of bad data: sequence contamination

Page 22: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Adapter sequences are detected Action: remove the adapter sequences, using CutAdapt

Data prepping – Needed (III)

Page 23: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Data is out of synch, •  the Forward and Reverse reads are not arranged in the same

order. Action: synchronize the files, using fastq interlacer and fastq de-interlacer Notes: Synchronization check and correction should be the last step in data prepping, because the previous steps in prepping can cause de-synchronization of PE data.

Data prepping – Needed (IV)

Page 24: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Summary - Data prepping

Data prepping is NEEDED, if:

•  Data format is incorrect

•  Data contains bad “reads” •  Adapter sequences are detected

•  Data is out of synch, meaning the pairing of Forward and

Reverse reads are out of order

Page 25: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Summary: Galaxy Tools for pre-processing

4.  Fastq Trimmer: removal of low quality end bases

5.  Cutadapt: Cut Adapter sequences

2.  Convert read header format from 1.8 to 1.7

6.  Synchronization –  Fastq Interlacer/De-Interlacer –  Critical for PE data analysis

3.  Fastq filter: removal low quality reads

1.  Fastq Groomer: Convert quality score to Sanger standard format –  Necessary for data generated with

CASAVA 1.7 or less, no need for CASAVA 1.8 and above.

Page 26: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Applications of RNA-Seq

Page 27: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

① Evaluation of a tissue’s transcriptome What is the composition of the transcriptome?

② Comparative analysis of two or more

transcriptomes How do two or more species’ transcriptomes compare?

③ Differential gene expression

What genes are differentially regulated in two or more conditions?

This tutorial

Page 28: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Differential Gene Expression (DGE)

Two Scenarios

Page 29: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

① DGE – Non discovery mode DGE without detection of novel transcripts

② DGE - Discovery mode

DGE with detection of novel transcripts

Page 30: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

①  DGE - Non discovery mode

Map Reads to Reference sequence or genome (TopHat)

fpkm; diff

Identify Differential Expression (Cuffdiff)

fastq

Condition 2

fastq

Condition 1

Quality Control (fastqc)

Quality Control (fastqc)

bam/sam

Mapped Reads (sample 1)

bam/sam

Mapped Reads (sample 2)

Pre-defined Annotation

Page 31: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

SAM/BAM

Map Reads to Reference sequence or genome (TopHat)

fpkm; diff

Identify Differential Expression (Cuffdiff) fastq

Data Prepping

②  DGE - Discovery mode

gtf

Assemble sample transcriptome with discovery of novel

transcripts (Cufflinks)

gtf

Merge sample transcriptomes

into one (Cuffcompare */

Cuffmerge)

* Only available in Galaxy

Page 32: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

RNA-Seq analytical tool: Tuxedo A mapper: Bowtie

–  Maps short reads to the reference genome.

A splice junction aligner: Tophat –  Uses Bowtie to align short reads to reference

genome or sequence –  It infers and estimates the splicing sites.

A transcriptome assembler: Cufflinks –  cuffcompare (comparing transcriptomes) –  cuffmerge (merging transcriptomes) –  cuffdiff (identifying differentially expressed

genes).

A visualization (R) package: cummeRbund

Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Trapnell C. et al. (2012) Nature Protocols

bam/sam TopHat

gtf; fpkm Cufflinks

diff; fpkm Cuffdiff

Bowtie

Cuffmerge Cuffcompare

cummeRbund

Page 33: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Quantify Expression Abundance (Cufflinks): FPKM

mRNA isolation

Paired End Sequencing

Map reads

Calculate transcript abundance Sample 1

Fragmentation RNA -> cDNA

Genome

Reference Transcriptome A B

Gene A Gene B Sample 1 2 2

Gene A Gene B Sample 1 2 1

# of Fragment (Paired Reads)

# Fragments per kilobase of exon

Gene A Gene B Total Sample 1 .7 .3 3 Sample 2 .6 .4 7

# Fragments per kilobase of exon per million mapped reads

FPKM

Page 34: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

DGE Non discovery mode

Page 35: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

DGE without detection of novel transcripts •  Approach:

–  Treat RNA-Seq as a “high-resolution” microarray.

•  Most appropriate for: –  Quick identification and analysis of differentially expressed genes –  Best for systems with annotated reference genome, such as human

and mouse

•  Analysis specificities- limitations: -  Only map reads to previously known transcripts -  Only test the differential expression of previously known transcripts.

•  Programs to use: 1. TopHat: the mapper 2. Cuffdiff: the “tester”

Page 36: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Why choose TopHat as the mapper?

•  Mapping RNA-Seq reads to the genome is a big challenge continuous RNA-Seq reads

dis-continuous genomic sequence

•  Initially, genome aligner (such as BWA) treated introns as gaps.

TopHat: discovering splice junctions with RNA-Seq Trapnell C et al. (2009) Bioinformatics

Dgcr2

BWA

Page 37: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

If there were no introns: Reads should continuously cover the splice junctions.

Dgcr2

BWA

Dgcr2

Exon 1 2 3

INTRONs shouldn’t be treated as GAPs.

Page 38: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

TopHat – The splice junction aligner

•  It became intuitive to incorporate the splicing information in the mapping process.

•  Later, it became necessary to build splicing junctions ab initio, because of the incompleteness of known junctions –  Splicing signals: donor and acceptor sites

TopHat: discovering splice junctions with RNA-Seq Trapnell C et al. (2009) Bioinformatics

•  So, TopHat is developed.

Dgcr2

BWA

TopHat

•  So TopHat is developed.

Page 39: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Basis of TopHat Direct un-spliced, un-paired mapping (using bowtie)

Uses bowtie again in closure search (finding splice junctions with mapping support)

Output: Mapping results (SAM/BAM file), and junctions.bed Only SAM/BAM file will be used by Cufflinks and Cuffdiff.

From the courtesy of Dr. Kevin Silverstein

Assemble contiguous “coverage island”

Identify possible splice donor and acceptor

Predict possible splicing junctions

Step 1: mapping

Step 2: building splicing junctions (+/- using ref juncs)

Step 3: 2nd

mapping

Page 40: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Q1: Is the project in human or mouse?

A: Yes Action: Nothing to be changed. Can use default parameters

A: No Action: Cannot use default parameters. Need to input all species specific parameters, e.g. those gene-model related parameters, such as intron length.

TopHat – General considerations

TopHat is optimized for human and mouse genome.

Page 41: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Q2: Is the library paired-end?

A: Yes Action: Set the parameters for mean distance between paired reads (-r) and the standard deviation of the inner distance (--mate-std-dev) inner distance = fragment length (220) – 2 X read length

A: No Action: Nothing to be changed.

TopHat – General considerations

Page 42: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Q3: What are the parameters to select?

①  Select “Full parameter list”

②  Select Yes for the option to “Use own

junctions”.

③  Select Yes for the option to “Use gene annotation model”, AND provide known annotation (gtf file).

④  Select Yes for the option to “Only look for supplied junctions”.

TopHat options – Non discovery analytical approach

Page 43: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Assessing mapping efficiency

①  % of reads mapped, % of reads properly paired

②  Use: Samtools and Picard tools –  flagstat: line 3 and line 7

•  For TopHat, first filter BAM file on MAPQ value of 255

–  Filter SAM or BAM files on FLAG MAPQ RG LN or by region

–  SAM/BAM Alignment Summary Metrics

③  Estimate the insertion size –  Insertion size metrics

A: Review of the mapping statistics. Recommendations: For human and mouse, good mapping will result in - >= 80% mapping percentage >=70% paired reads

Page 44: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Mapping visualization Integrative Genome Viewer (IGV)

Page 45: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Healthy Sample

Cancer

Run IGV locally to view multiple tracks

Direction to install IGV: http://www.broadinstitute.org/igv/

Page 46: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

DGE Workflow - Non discovery mode

Map Reads to Reference

sequence or genome (TopHat)

fpkm; diff

Identify Differential Expression (Cuffdiff)

fastq

Condition 2

fastq

Condition 1

Quality Control (fastqc)

Quality Control (fastqc)

bam/sam

Mapped Reads (sample 1)

bam/sam

Mapped Reads (sample 2)

Pre-defined Annotation

Page 47: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Cuffdiff Facts

log10(FPKM)

Den

sity

0 1 2 3 4

0

1

2 Global gene expression

Exclude the low-expressed genes to remove transcription artifacts. Set the parameter for “Min Alignment Count”.

Cuffdiff: Quantifies the gene expression abundance, Statistical evaluation of the differential expression.

Considerations on handling “Tail” data.

Page 48: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Cuffdiff Facts Cuffdiff: Quantifies the gene expression abundance,

Statistical evaluation of the differential expression.

log10(FPKM)

Den

sity

0 1 2 3 4

0

1

2 Global gene expression

Exclude the highly expressed genes, such as some house-keeping genes. Set “yes” to “Perform quartile normalization”.

Considerations on handling “Tail” data.

Page 49: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Cuffdiff output

Healthy_sample Cancer_sample

Page 50: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Post-analysis processing and iterations •  Check for non-biological variations

–  Also known as technical variation, or within-group variation. –  This type of variation is detected among samples of the same group.

•  Source of the technical variations: –  Batch effect

•  How were the samples collected and processed? Were the samples processed as groups, and if so what was the grouping?

–  Non-synchronized cell cultures •  Were all the cells from the same genetic backgrounds and growth phase?

–  Use technical replicates rather than biological replicates

•  Detection of non biological variation –  PCA analysis; or MDS analysis; or Unsupervised clustering analysis of FPKM

values

Page 51: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Steps in PCA analysis

Construct the multiple variable matrix

transcripts   Sample  A   Sample  V   Sample  O   Sample  E   Sample  I   Sample  U  gene1   6.18   6.64   6.46   6.30   6.58   6.54  gene2   5.48   0.11   1.00   0.24   0.02   0.68  gene3   20.53   18.93   18.79   18.51   18.00   18.26  gene4   55.47   52.71   50.39   54.66   49.15   44.68  gene5   7.28   8.09   8.57   7.82   8.29   9.38  gene6   14.65   13.88   13.48   13.98   14.72   12.47  gene7   16.41   13.80   14.99   17.20   14.39   13.50  gene8   6.17   6.79   7.20   6.70   8.42   7.26  gene9   25.83   24.24   25.63   27.09   22.18   23.09  gene10   38.04   30.39   35.53   37.42   28.72   27.28  gene11   195.06   179.88   178.18   208.25   179.01   155.15  gene12   32.82   32.04   31.84   33.62   31.06   29.46  gene13   18.41   16.75   16.72   17.33   16.32   16.87  gene14   24.00   21.05   22.68   22.72   22.08   22.45  

…………………………..  

e.g. tables of FPKM values

Group 1 (A,V,O)

Group 2 (E,I,U)

A

6 7 8 9 10 11 12 13

24

68

10

PC1

PC2

A

V

O

E

I

U

PCA analysis

O

I

U V

E

Page 52: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

DGE Discovery mode

Page 53: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

gene_exp.diff; isoform_exp.diff; ……

Genes.gtf

Genome.fa

merged.gtf

SampB.gtf SampA.gtf

SampB.bam SampA.bam

SampB.fq SampA.fq

QC with fastqc

CONDITION A

QC with fastqc

CONDITION B

Alignment with TopHat

Alignment with TopHat

Assemble with Cufflinks

Assemble with Cufflinks

Merge assemblies with

Cuffcompare

Quantitation and differential expression with cuffdiff

Reference Index

Reference Gene

Annotation

Store Results

Visualization with cummeRbund

Discovery Phase

Page 54: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

•  Novel transcripts will be assembled and tested for differential expression. –  Potential identification of new splicing variants

•  Key advantage (over microarray). –  Not limited by previous knowledge –  Extends current knowledge banks

•  Programs used: 1.  TopHat: the mapper; 2.  Cufflinks: the assembler; 3.  Cuffdiff: the tester

DGE with detection of novel transcripts

Page 55: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

①  First run is to generate a full list of junctions. ②  Second run is to apply the full junction files to all the samples to keep

mapping consistence.

Same as before, but TopHat needs to be run at least TWICE in order to reliably and consistently identify the splicing junctions.

The “TWO-STEP” running of TopHat: 1.  Running TopHat as before 2.  Re-run TopHat with a list of junctions

(see setting in next slide).

TopHat – Best practice in Discovery Mode

Page 56: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

①  Combine the sample junctions.bed files into one using Concatenate.

②  Turn on “Full parameter list”.

③  Turn on (set “yes” to) the option for “Use own junctions”.

④  Provide junctions files (bed file).

⑤  Turn on the option for “Use Closure Search”.

⑥  Turn on “Use Microexon Search”.

TopHat options – discovery analytical approach

Page 57: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Considerations for Cufflinks

Page 58: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Cufflinks facts

•  Optimized for human and mouse genomes

•  Uses a parsimonious method to assemble the transcripts +/- known annotation

•  Can estimate the transcript abundances –  FPKM: # of Fragments Per Kilobases of exon model per

Million mapped fragments

•  Can estimate the fragment length distribution –  Not available in Galaxy

•  Output file: GTF file

http://cufflinks.cbcb.umd.edu/

Page 59: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Q1: Is the project in human or mouse?

A: Yes Action: Nothing to change

A: No Action: Cannot use default parameters.

Need to input all species specific parameters, e.g. those gene-model related parameters, such as intron length.

Cufflinks – General considerations

Cufflinks is optimized for human and mouse genome.

Page 60: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Q2: Want to use a known annotation in transcriptome assembly and report novel transcripts assembled?

A: Yes Action: Use the option for “Use Reference Annotation”; Select “Use Reference Annotation as Guide”.

A: No Action: Nothing to change

Cufflinks – General considerations

Page 61: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

A: No. Because we might lose some isoforms in this manner. –  It is possible that one isoform may only be called from one sample,

due to some uncontrollable sample preparation process. –  Cufflinks will only report isoforms above certain abundance

threshold (10% of the major transcripts). –  The rare isoform will be diluted in the pooled samples, so that it

may become missing in the assembly.

Isoform A (FPKM) Isoform B (FPKM) Called? Sample 1 50 5 Yes Sample 2 50 3 No Pooled 100 8 No

Cufflinks – General considerations

Q3: Can I pool samples as one input to cufflinks?

Page 62: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Cuffcompare Facts Cuffcompare •  Compares multiple transcriptomes and reports

the similarity between them. •  Available in Galaxy. Cuffmerge •  A new function implemented in Cufflinks

package. •  Purpose is to remove assembly artifacts. •  Available using command line tools.

Page 63: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Follow the same instruction to run Cuffdiff and post-

processing as in DGE-non discovery mode.

Page 64: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Reproducibility and the value of Workflow

Page 65: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Analysis strategy in “Workflow” •  Workflow is

–  A sequential collection of Galaxy operations to complete an analysis

Page 66: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Create a “Workflow”

–  From scratch –  From current history –  Edit existing workflow

Page 67: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Share/Publish/Use “Workflow”

Page 68: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Tutorial optional material =||=

Evaluation of transcriptome

Two Scenarios

Page 69: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

①  De novo assembly of transcriptome Assemble transcriptome without a reference transcriptome/genome

②  Reference-guided assembly of transcriptome

Page 70: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Key definitions

Short Reads

Contigs = consensus of overlapping reads

Scaffolds = contigs + known-length gaps •  known-length gaps could be estimated by Mate-pair sequencing

Draft transcriptome/genome = a collection of non-ordered scaffolds

Page 71: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

De novo assemble the transcriptome.

fastq

Samples (RNA-Seq)

fastq

Pre-processing: QC and Data cleaning

De novo Assembly of Transcriptome

fastq Trans-ABySS *

* We only put one assembler in this diagram to illustrate the concept of assembling. However, in order to construct a reliable transcriptome, multiple assembler should be used to generate a consensus assembly.

Page 72: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Trans-ABySS Facts •  ABySS is a de novo, parallel sequence assembler that is

designed for short reads. –  Can work on single end reads and paired end reads. –  Is a de Bruijn graph assembler –  It takes two steps:

•  Using all possible k-mers from the reads to build the initial contigs •  Using mate-pair information to extend contigs

•  Trans-ABySS is a pipeline for analyzing ABySS-assembled contigs from RNA-Seq data. –  Use several k-mer length

•  Availability: Command line •  Homepage:

–  http://www.bcgsc.ca/platform/bioinfo/software/abyss –  http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss

Page 73: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

BAM/SAM

Reference-guided assembly of transcriptome

fastq

Samples (RNA-Seq)

fastq

Pre-processing: QC and Data cleaning

Map short reads to reference genome (TopHat)

Known Annotation

(Also known as “transcriptome reconstruction”)

GTF

Assemble transcriptome from mapped reads (Cufflinks)

Page 74: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

If choosing TopHat and Cufflinks as the assembler,

follow the instructions in DGE-discovery mode

Page 75: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

•  Cufflinks developer: “We don’t recommend assembling bacteria transcripts using

Cufflinks at first. If you are working on a new bacteria genome, consider a computational gene finding application such as Glimmer.” •  So for bacteria transcriptome:

•  If the genome is available, do genome annotation first

then reconstruct the transcriptome. •  If the genome is not available, try the de novo

assembly, then followed by gene annotation.

Specific Notes for Prokaryotes’ samples

Page 76: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Summary – Hybrid method on transcriptome assembly

Next-generation transcriptome assembly Martin J. et al (2011) Nature Review

Page 77: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Comparative Study of Transcriptomes

Page 78: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Sample 1.gtf

Sample 2.gtf

Sample 3.gtf

Sample N.gtf

…….

Cuffcompare: Compare individual transcriptome Generic tools: Operate on Genomic Intervals

Q: How can I compare different transcriptomes?

Page 79: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

•  Cuffcompare •  Operate on Genomic Intervals

Galaxy Tools for pre-processing

Page 80: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

•  Downstream visualization and analysis: – Will be covered in Tutorial Module 3. –  IGV: interactive genome viewer –  IPA: Ingenuity pathway analysis – Other analysis package:

•  R package: ArrayExpressHTS, cummeRbund

Page 81: RNA-Seq Module 2 From QC to differential gene expression. Module2 version 3.pdf · From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics

Discussion and Questions? •  Get Support at MSI:

– Email: [email protected] – General Questions:

•  Subject line: “RISS:…” – Galaxy Questions:

•  Subject line: “Galaxy:…”