rna-seq module 2 from qc to differential gene expression. module2 version 3.pdf · from qc to...

RNA-Seq Module 2

From QC to differential gene expression.

Ying Zhang Ph.D, Informatics Analyst

Research Informatics Support System (RISS) MSI

Apr. 24, 2012

RNA-Seq Tutorials •  Tutorial 1: Introductory (Mar. 28 & Apr. 19)

–  RNA-Seq experiment design and analysis –  Instruction on individual software will be provided

in other tutorials •  Tutorial 2: Introductory (Apr. 3 & Apr 24)

–  Analysis RNA-Seq using TopHat and Cufflinks

•  Tutorial 3: Intermediate (May 23) –  Advanced RNA-Seq Analysis topics and Trouble-

Shooting

Tutorial Outline Review

–  Key definitions and concepts

Pre-processing of RNA-seq data –  QC and data cleaning

Applications of RNA-seq –  Identification of differential gene expression

•  Using TopHat, Cufflinks and Cuffdiff

–  Definition of the transcriptome •  Transcriptome Assembly with / without reference genome

–  Comparison of transcriptomes •  Identification of novel transcripts

RNA-Seq

Definitions & Concepts

Key definitions (I)

•  SE: single end sequencing •  PE: paired end sequencing mRNA isolation

Sequence fragment end(s)

Sample

Library preparation

PE sequencing

•  Mate-pair sequencing

Size: ~200 bp Size: ~2000 bp

Circulation

Fragmentation

Sequence fragment end(s)

Mate-Pair sequencing

Fragmentation

SE sequencing

Key definitions (II)

•  Fragment size selection -  Only fragments with size around 200bp will be sequenced in order to

reduce sequencing bias.

•  Sequencing Depth is the average reads coverage of target sequences -  Sequencing depth = total number of reads X read length / estimated

target sequence length -  Example, for a 5MB transcriptome, if 1Million 50 bp reads are

produced, the depth is 1 M X 50 bp / 5M ~ 10 X

De novo Assembly of transcriptome

Refine gene model Differential Gene Expression

Identification of structural variants

Library Type:

PE, Mated PE PE, SE PE PE

Sequencing Depth:

Extensive (> 50 X)

Extensive

Moderate (10 X ~ 30 X)

Extensive

ENCODE RNA-Seq guidelines

Key definitions (III)

Phred (quality) Score •  Phred Score (Q) is the log transformation of error rate

(P) at each base calling position Q = -10log10P •  Encoded using ASCII codes:

–  Sanger standard: ASCII 33-62 = Phred Score 0-93

•  Phred score 30 ~ 1 error per 1000 nucleotides Phred score 20 ~ 1 error per 100 nucleotides

NGS File formats

File formats in NGS (I)

CASAVA software

SAM/BAM

Mapping

Assembly

http://genome.ucsc.edu/FAQ/FAQformat.html

File format NGS (II) - FASTQ and FASTQ_flt (MSI)

•  CASAVA: –  Illumina software package for base calling

•  Fastq format: –  Text format. Stores sequence and quality info –  4 lines per sequences –  CASAVA 1.8 header line:

CASAVA software

@HWI-M00262:4:000000000-A0ABC:1:1:18376:2027 1:N:0:AGATC TTCAGAGAGAATGAATTGTACGTGCTTTTTTTGT + =1:?7A7+?77+<<@AC<3<,33@A;<A?A=:4=

Read ID (header) Sequence

+ Quality score

Phred+33

Machine ID barcode QC Filter flag Y=bad N=good

Read pair #

•  FASTQ_flt data –  Fastq files processed by MSI standard to remove reads with QC flag “Y”

•  SAM/BAM format: –  Sequence alignment format –  SAM: text format –  BAM: binary file of SAM –  Bitwise flag field: indicating mapped or not,

paired or not, etc

•  SAM/BAM format is the standard format of mapped reads, and could be used by almost all NGS tools, e.g. assembler, viewer, quantifier.

File formats NGS (III)

CASAVA software

SAM/BAM

mapping

•  GTF format –  Gene Transfer Format –  Widely used format for annotated genome and

transcriptome –  Downloadable from major browser sites, e.g. UCSC,

Ensembl, NCBI –  Illumina also provides a set of annotated genomes:

igenomes •  Available through Galaxy and command line

File formats in NGS (IV)

CASAVA software

SAM/BAM

mapping

assembly

Seqname Source feature start end score strand frame a0ributes

chr1 unknown exon 3204563 3207049 . -‐ . gene_id "Xkr4"; transcript_id

"NM_001011874“;

Steps in RNA-Seq Data Analysis

Quality Control

Data prepping

Map Reads to Reference Genome/Transcriptome

Assemble Transcriptome

Identify Differentially Expressed Gens

Other applications: De novo Assembly

Refine gene models

Step 1:

Step 2:

Step 3:

Step 4:

FastQC

fastqsanger

Filter/Trimmer/Converter

bam/sam

TopHat

gtf; fpkm Cufflinks

fpkm; diff

Cuffdiff

Step 1 Quality control of the input data

Step 2

Data prepping

Quality control of the raw reads

•  Goal: to determine quality of the sequencing process

•  Recommended program: fastqc –  Available both in Galaxy and Linux platform

•  Checklist of reads quality: Ø  File format

q Basic Statistics

Ø Reliability of base calling q Per base sequence quality q Per sequence quality score

Ø Contamination q Per sequence GC content q Overrepresented sequences

Data prepping (I)

Is NOT NEEDED, if: •  In the right format •  Good reads quality

–  Phred score per base & per sequence >=20 ( better if >=30)

•  No contamination detected

•  Paired reads are synchronized –  Bad mapping efficiency of PE reads is symptomatic of de-

synchronization

BAD format

Wrong Fastq Format (CASAVA 1.8): @HWI-M00262:4:000-A0ABC:1:1:1836 1:N:0:AGATC TTCAGAGAGAATGAATTGTACGTGCTTTTTTTGT + =1:?7A7+?77+<<@AC<3<,33@A;<A?A=:4=

Right Fastq Format (CASAVA 1.7): @HWI-M00262:4:000-A0ABC:1:1:1836/1 TTCAGAGAGAATGAATTGTACGTGCTTTTTTTGT + =1:?7A7+?77+<<@AC<3<,33@A;<A?A=:4=

Data prepping– Needed (I)

Data format is incorrect, •  paired reads are not indicated as RNAME/1, RNAME/2 •  the quality score is not Sanger/Illumina 1.9 format

Action: change the format, using edit attributes, fastQ groomer, header line converter Notes: If you are using Galaxy to analyze your data, change file name WILL NOT change the file format.

Bad Trimming needed

Distribution of Phred Score in reads

Data contains bad “reads”, •  the quality score of reads/part of the reads is < 20

Action: remove the low quality reads, using fastq filter, and fastq trimmer •  fastq filter: remove entire reads

•  fastq column trimmer: uniformly remove the nucleotides positions in all reads.

•  fastq quality trimmer: remove all nucleotide positions with low quality.

Data prepping – Needed (II)

Example of bad data: sequence contamination

Adapter sequences are detected Action: remove the adapter sequences, using CutAdapt

Data prepping – Needed (III)

Data is out of synch, •  the Forward and Reverse reads are not arranged in the same

order. Action: synchronize the files, using fastq interlacer and fastq de-interlacer Notes: Synchronization check and correction should be the last step in data prepping, because the previous steps in prepping can cause de-synchronization of PE data.

Data prepping – Needed (IV)

Summary - Data prepping

Data prepping is NEEDED, if:

•  Data format is incorrect

•  Data contains bad “reads” •  Adapter sequences are detected

•  Data is out of synch, meaning the pairing of Forward and

Reverse reads are out of order

Summary: Galaxy Tools for pre-processing

4.  Fastq Trimmer: removal of low quality end bases

5.  Cutadapt: Cut Adapter sequences

2.  Convert read header format from 1.8 to 1.7

6.  Synchronization –  Fastq Interlacer/De-Interlacer –  Critical for PE data analysis

3.  Fastq filter: removal low quality reads

1.  Fastq Groomer: Convert quality score to Sanger standard format –  Necessary for data generated with

CASAVA 1.7 or less, no need for CASAVA 1.8 and above.

Applications of RNA-Seq

① Evaluation of a tissue’s transcriptome What is the composition of the transcriptome?

② Comparative analysis of two or more

transcriptomes How do two or more species’ transcriptomes compare?

③ Differential gene expression

What genes are differentially regulated in two or more conditions?

This tutorial

Differential Gene Expression (DGE)

Two Scenarios

① DGE – Non discovery mode DGE without detection of novel transcripts

② DGE - Discovery mode

DGE with detection of novel transcripts

①  DGE - Non discovery mode

Map Reads to Reference sequence or genome (TopHat)

fpkm; diff

Identify Differential Expression (Cuffdiff)

Condition 2

Condition 1

Quality Control (fastqc)

bam/sam

Mapped Reads (sample 1)

bam/sam

Pre-defined Annotation

SAM/BAM

Map Reads to Reference sequence or genome (TopHat)

fpkm; diff

Identify Differential Expression (Cuffdiff) fastq

Data Prepping

②  DGE - Discovery mode

Assemble sample transcriptome with discovery of novel

transcripts (Cufflinks)

Merge sample transcriptomes

into one (Cuffcompare */

Cuffmerge)

* Only available in Galaxy

RNA-Seq analytical tool: Tuxedo A mapper: Bowtie

–  Maps short reads to the reference genome.

A splice junction aligner: Tophat –  Uses Bowtie to align short reads to reference

genome or sequence –  It infers and estimates the splicing sites.

A transcriptome assembler: Cufflinks –  cuffcompare (comparing transcriptomes) –  cuffmerge (merging transcriptomes) –  cuffdiff (identifying differentially expressed

genes).

A visualization (R) package: cummeRbund

Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Trapnell C. et al. (2012) Nature Protocols

bam/sam TopHat

gtf; fpkm Cufflinks

diff; fpkm Cuffdiff

Bowtie

Cuffmerge Cuffcompare

cummeRbund

Quantify Expression Abundance (Cufflinks): FPKM

mRNA isolation

Paired End Sequencing

Map reads

Calculate transcript abundance Sample 1

Fragmentation RNA -> cDNA

Genome

Reference Transcriptome A B

Gene A Gene B Sample 1 2 2

Gene A Gene B Sample 1 2 1

# of Fragment (Paired Reads)

# Fragments per kilobase of exon

Gene A Gene B Total Sample 1 .7 .3 3 Sample 2 .6 .4 7

# Fragments per kilobase of exon per million mapped reads

DGE Non discovery mode

DGE without detection of novel transcripts •  Approach:

–  Treat RNA-Seq as a “high-resolution” microarray.

•  Most appropriate for: –  Quick identification and analysis of differentially expressed genes –  Best for systems with annotated reference genome, such as human

and mouse

•  Analysis specificities- limitations: -  Only map reads to previously known transcripts -  Only test the differential expression of previously known transcripts.

•  Programs to use: 1. TopHat: the mapper 2. Cuffdiff: the “tester”

Why choose TopHat as the mapper?

•  Mapping RNA-Seq reads to the genome is a big challenge continuous RNA-Seq reads

dis-continuous genomic sequence

•  Initially, genome aligner (such as BWA) treated introns as gaps.

TopHat: discovering splice junctions with RNA-Seq Trapnell C et al. (2009) Bioinformatics

If there were no introns: Reads should continuously cover the splice junctions.

Exon 1 2 3

INTRONs shouldn’t be treated as GAPs.

TopHat – The splice junction aligner

•  It became intuitive to incorporate the splicing information in the mapping process.

•  Later, it became necessary to build splicing junctions ab initio, because of the incompleteness of known junctions –  Splicing signals: donor and acceptor sites

TopHat: discovering splice junctions with RNA-Seq Trapnell C et al. (2009) Bioinformatics

•  So, TopHat is developed.

TopHat

•  So TopHat is developed.

Basis of TopHat Direct un-spliced, un-paired mapping (using bowtie)

Uses bowtie again in closure search (finding splice junctions with mapping support)

Output: Mapping results (SAM/BAM file), and junctions.bed Only SAM/BAM file will be used by Cufflinks and Cuffdiff.

From the courtesy of Dr. Kevin Silverstein

Assemble contiguous “coverage island”

Identify possible splice donor and acceptor

Predict possible splicing junctions

Step 1: mapping

Step 2: building splicing junctions (+/- using ref juncs)

Step 3: 2nd

mapping

Q1: Is the project in human or mouse?

A: Yes Action: Nothing to be changed. Can use default parameters

A: No Action: Cannot use default parameters. Need to input all species specific parameters, e.g. those gene-model related parameters, such as intron length.

TopHat – General considerations

TopHat is optimized for human and mouse genome.

Q2: Is the library paired-end?

A: Yes Action: Set the parameters for mean distance between paired reads (-r) and the standard deviation of the inner distance (--mate-std-dev) inner distance = fragment length (220) – 2 X read length

A: No Action: Nothing to be changed.

TopHat – General considerations

Q3: What are the parameters to select?

①  Select “Full parameter list”

②  Select Yes for the option to “Use own

junctions”.

③  Select Yes for the option to “Use gene annotation model”, AND provide known annotation (gtf file).

④  Select Yes for the option to “Only look for supplied junctions”.

TopHat options – Non discovery analytical approach

Assessing mapping efficiency

①  % of reads mapped, % of reads properly paired

②  Use: Samtools and Picard tools –  flagstat: line 3 and line 7

•  For TopHat, first filter BAM file on MAPQ value of 255

–  Filter SAM or BAM files on FLAG MAPQ RG LN or by region

–  SAM/BAM Alignment Summary Metrics

③  Estimate the insertion size –  Insertion size metrics

A: Review of the mapping statistics. Recommendations: For human and mouse, good mapping will result in - >= 80% mapping percentage >=70% paired reads

Mapping visualization Integrative Genome Viewer (IGV)

Healthy Sample

Cancer

Run IGV locally to view multiple tracks

Direction to install IGV: http://www.broadinstitute.org/igv/

DGE Workflow - Non discovery mode

Map Reads to Reference

sequence or genome (TopHat)

fpkm; diff

Identify Differential Expression (Cuffdiff)

Condition 2

Condition 1

Quality Control (fastqc)

bam/sam

Pre-defined Annotation

Cuffdiff Facts

log10(FPKM)

0 1 2 3 4

2 Global gene expression

Exclude the low-expressed genes to remove transcription artifacts. Set the parameter for “Min Alignment Count”.

Cuffdiff: Quantifies the gene expression abundance, Statistical evaluation of the differential expression.

Considerations on handling “Tail” data.

Cuffdiff Facts Cuffdiff: Quantifies the gene expression abundance,

Statistical evaluation of the differential expression.

log10(FPKM)

0 1 2 3 4

2 Global gene expression

Exclude the highly expressed genes, such as some house-keeping genes. Set “yes” to “Perform quartile normalization”.

Considerations on handling “Tail” data.

Cuffdiff output

Healthy_sample Cancer_sample

Post-analysis processing and iterations •  Check for non-biological variations

–  Also known as technical variation, or within-group variation. –  This type of variation is detected among samples of the same group.

•  Source of the technical variations: –  Batch effect

•  How were the samples collected and processed? Were the samples processed as groups, and if so what was the grouping?

–  Non-synchronized cell cultures •  Were all the cells from the same genetic backgrounds and growth phase?

–  Use technical replicates rather than biological replicates

•  Detection of non biological variation –  PCA analysis; or MDS analysis; or Unsupervised clustering analysis of FPKM

values

Steps in PCA analysis

Construct the multiple variable matrix

transcripts Sample A Sample V Sample O Sample E Sample I Sample U gene1 6.18 6.64 6.46 6.30 6.58 6.54 gene2 5.48 0.11 1.00 0.24 0.02 0.68 gene3 20.53 18.93 18.79 18.51 18.00 18.26 gene4 55.47 52.71 50.39 54.66 49.15 44.68 gene5 7.28 8.09 8.57 7.82 8.29 9.38 gene6 14.65 13.88 13.48 13.98 14.72 12.47 gene7 16.41 13.80 14.99 17.20 14.39 13.50 gene8 6.17 6.79 7.20 6.70 8.42 7.26 gene9 25.83 24.24 25.63 27.09 22.18 23.09 gene10 38.04 30.39 35.53 37.42 28.72 27.28 gene11 195.06 179.88 178.18 208.25 179.01 155.15 gene12 32.82 32.04 31.84 33.62 31.06 29.46 gene13 18.41 16.75 16.72 17.33 16.32 16.87 gene14 24.00 21.05 22.68 22.72 22.08 22.45

…………………………..

e.g. tables of FPKM values

Group 1 (A,V,O)

Group 2 (E,I,U)

6 7 8 9 10 11 12 13

PCA analysis

DGE Discovery mode

gene_exp.diff; isoform_exp.diff; ……

Genes.gtf

Genome.fa

merged.gtf

SampB.gtf SampA.gtf

SampB.bam SampA.bam

SampB.fq SampA.fq

QC with fastqc

CONDITION A

QC with fastqc

CONDITION B

Alignment with TopHat

Assemble with Cufflinks

Merge assemblies with

Cuffcompare

Quantitation and differential expression with cuffdiff

Reference Index

Reference Gene

Annotation

Store Results

Visualization with cummeRbund

Discovery Phase

•  Novel transcripts will be assembled and tested for differential expression. –  Potential identification of new splicing variants

•  Key advantage (over microarray). –  Not limited by previous knowledge –  Extends current knowledge banks

•  Programs used: 1.  TopHat: the mapper; 2.  Cufflinks: the assembler; 3.  Cuffdiff: the tester

DGE with detection of novel transcripts

①  First run is to generate a full list of junctions. ②  Second run is to apply the full junction files to all the samples to keep

mapping consistence.

Same as before, but TopHat needs to be run at least TWICE in order to reliably and consistently identify the splicing junctions.

The “TWO-STEP” running of TopHat: 1.  Running TopHat as before 2.  Re-run TopHat with a list of junctions

(see setting in next slide).

TopHat – Best practice in Discovery Mode

①  Combine the sample junctions.bed files into one using Concatenate.

②  Turn on “Full parameter list”.

③  Turn on (set “yes” to) the option for “Use own junctions”.

④  Provide junctions files (bed file).

⑤  Turn on the option for “Use Closure Search”.

⑥  Turn on “Use Microexon Search”.

TopHat options – discovery analytical approach

Considerations for Cufflinks

Cufflinks facts

•  Optimized for human and mouse genomes

•  Uses a parsimonious method to assemble the transcripts +/- known annotation

•  Can estimate the transcript abundances –  FPKM: # of Fragments Per Kilobases of exon model per

Million mapped fragments

•  Can estimate the fragment length distribution –  Not available in Galaxy

•  Output file: GTF file

http://cufflinks.cbcb.umd.edu/

Q1: Is the project in human or mouse?

A: Yes Action: Nothing to change

A: No Action: Cannot use default parameters.

Need to input all species specific parameters, e.g. those gene-model related parameters, such as intron length.

Cufflinks – General considerations

Cufflinks is optimized for human and mouse genome.

Q2: Want to use a known annotation in transcriptome assembly and report novel transcripts assembled?

A: Yes Action: Use the option for “Use Reference Annotation”; Select “Use Reference Annotation as Guide”.

A: No Action: Nothing to change

A: No. Because we might lose some isoforms in this manner. –  It is possible that one isoform may only be called from one sample,

due to some uncontrollable sample preparation process. –  Cufflinks will only report isoforms above certain abundance

threshold (10% of the major transcripts). –  The rare isoform will be diluted in the pooled samples, so that it

may become missing in the assembly.

Isoform A (FPKM) Isoform B (FPKM) Called? Sample 1 50 5 Yes Sample 2 50 3 No Pooled 100 8 No

Q3: Can I pool samples as one input to cufflinks?

Cuffcompare Facts Cuffcompare •  Compares multiple transcriptomes and reports

the similarity between them. •  Available in Galaxy. Cuffmerge •  A new function implemented in Cufflinks

package. •  Purpose is to remove assembly artifacts. •  Available using command line tools.

Follow the same instruction to run Cuffdiff and post-

processing as in DGE-non discovery mode.

Reproducibility and the value of Workflow

Analysis strategy in “Workflow” •  Workflow is

–  A sequential collection of Galaxy operations to complete an analysis

Create a “Workflow”

–  From scratch –  From current history –  Edit existing workflow

Share/Publish/Use “Workflow”

Tutorial optional material =||=

Evaluation of transcriptome

Two Scenarios

①  De novo assembly of transcriptome Assemble transcriptome without a reference transcriptome/genome

②  Reference-guided assembly of transcriptome

Key definitions

Short Reads

Contigs = consensus of overlapping reads

Scaffolds = contigs + known-length gaps •  known-length gaps could be estimated by Mate-pair sequencing

Draft transcriptome/genome = a collection of non-ordered scaffolds

De novo assemble the transcriptome.

Samples (RNA-Seq)

Pre-processing: QC and Data cleaning

De novo Assembly of Transcriptome

fastq Trans-ABySS *

* We only put one assembler in this diagram to illustrate the concept of assembling. However, in order to construct a reliable transcriptome, multiple assembler should be used to generate a consensus assembly.

Trans-ABySS Facts •  ABySS is a de novo, parallel sequence assembler that is

designed for short reads. –  Can work on single end reads and paired end reads. –  Is a de Bruijn graph assembler –  It takes two steps:

•  Using all possible k-mers from the reads to build the initial contigs •  Using mate-pair information to extend contigs

•  Trans-ABySS is a pipeline for analyzing ABySS-assembled contigs from RNA-Seq data. –  Use several k-mer length

•  Availability: Command line •  Homepage:

–  http://www.bcgsc.ca/platform/bioinfo/software/abyss –  http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss

BAM/SAM

Reference-guided assembly of transcriptome

Samples (RNA-Seq)

Pre-processing: QC and Data cleaning

Map short reads to reference genome (TopHat)

Known Annotation

(Also known as “transcriptome reconstruction”)

Assemble transcriptome from mapped reads (Cufflinks)

If choosing TopHat and Cufflinks as the assembler,

follow the instructions in DGE-discovery mode

•  Cufflinks developer: “We don’t recommend assembling bacteria transcripts using

Cufflinks at first. If you are working on a new bacteria genome, consider a computational gene finding application such as Glimmer.” •  So for bacteria transcriptome:

•  If the genome is available, do genome annotation first

then reconstruct the transcriptome. •  If the genome is not available, try the de novo

assembly, then followed by gene annotation.

Specific Notes for Prokaryotes’ samples

Summary – Hybrid method on transcriptome assembly

Next-generation transcriptome assembly Martin J. et al (2011) Nature Review

Comparative Study of Transcriptomes

Sample 1.gtf

Sample 2.gtf

Sample 3.gtf

Sample N.gtf

…….

Cuffcompare: Compare individual transcriptome Generic tools: Operate on Genomic Intervals

Q: How can I compare different transcriptomes?

•  Cuffcompare •  Operate on Genomic Intervals

Galaxy Tools for pre-processing

•  Downstream visualization and analysis: – Will be covered in Tutorial Module 3. –  IGV: interactive genome viewer –  IPA: Ingenuity pathway analysis – Other analysis package:

•  R package: ArrayExpressHTS, cummeRbund

Discussion and Questions? •  Get Support at MSI:

– Email: help@msi.umn.edu – General Questions:

•  Subject line: “RISS:…” – Galaxy Questions:

•  Subject line: “Galaxy:…”

rna-seq module 2 from qc to differential gene expression. module2 version 3.pdf · from qc to...

Documents

pathogen informatics 21 st nov 2014 pathogen sequencing...

3-part wbc differential hematology analyzer - qc functions...

inverse problems - compute.dtu.dkpcha/hdtomo/hd-tomo.pdf ·...

by robert b. scott and anne e. harding - u.s. geological...

lab informatics - nugenesis sdms interfaced with compound qc...

the health informatics process - tuni · the health...

topic 1 - qa-qc department and qc inspector job

strep a - fisher scientific...run qc test positive qc test...

revision index rev. description / pages modified · projeto...

serie qc/qc+serie qc/qc+ la serie qc+ es ideal para...

v 6.1 eucast qc tables routine and extended qc

serie qc / qc series · 2017-11-12 · serie qc plus (qc+)...

what is medical informatics? - informatics

cancer informatics (biomedical informatics)

· pdf filesince 1987 development institute for science...

dr ebtissam al-madi consumer informatics, nursing...

hollow‑wrist robotic tool changers qc‑11hm through...

serie qc / qc series - q-pumps

chip-on-chip and differential location analysis junguk hur...

clinical research informatics - cdm media · clinical...