rna-seq module 2 from qc to differential gene expression. module2 version 3.pdf · from qc to...
Post on 23-May-2020
7 Views
Preview:
TRANSCRIPT
RNA-Seq Module 2
From QC to differential gene expression.
Ying Zhang Ph.D, Informatics Analyst
Research Informatics Support System (RISS) MSI
Apr. 24, 2012
RNA-Seq Tutorials • Tutorial 1: Introductory (Mar. 28 & Apr. 19)
– RNA-Seq experiment design and analysis – Instruction on individual software will be provided
in other tutorials • Tutorial 2: Introductory (Apr. 3 & Apr 24)
– Analysis RNA-Seq using TopHat and Cufflinks
• Tutorial 3: Intermediate (May 23) – Advanced RNA-Seq Analysis topics and Trouble-
Shooting
Tutorial Outline Review
– Key definitions and concepts
Pre-processing of RNA-seq data – QC and data cleaning
Applications of RNA-seq – Identification of differential gene expression
• Using TopHat, Cufflinks and Cuffdiff
– Definition of the transcriptome • Transcriptome Assembly with / without reference genome
– Comparison of transcriptomes • Identification of novel transcripts
RNA-Seq
Definitions & Concepts
Key definitions (I)
• SE: single end sequencing • PE: paired end sequencing mRNA isolation
Sequence fragment end(s)
Sample
Library preparation
PE sequencing
• Mate-pair sequencing
Size: ~200 bp Size: ~2000 bp
Circulation
Fragmentation
Sequence fragment end(s)
Mate-Pair sequencing
Fragmentation
SE sequencing
Key definitions (II)
• Fragment size selection - Only fragments with size around 200bp will be sequenced in order to
reduce sequencing bias.
• Sequencing Depth is the average reads coverage of target sequences - Sequencing depth = total number of reads X read length / estimated
target sequence length - Example, for a 5MB transcriptome, if 1Million 50 bp reads are
produced, the depth is 1 M X 50 bp / 5M ~ 10 X
De novo Assembly of transcriptome
Refine gene model Differential Gene Expression
Identification of structural variants
Library Type:
PE, Mated PE PE, SE PE PE
Sequencing Depth:
Extensive (> 50 X)
Extensive
Moderate (10 X ~ 30 X)
Extensive
ENCODE RNA-Seq guidelines
Key definitions (III)
Phred (quality) Score • Phred Score (Q) is the log transformation of error rate
(P) at each base calling position Q = -10log10P • Encoded using ASCII codes:
– Sanger standard: ASCII 33-62 = Phred Score 0-93
• Phred score 30 ~ 1 error per 1000 nucleotides Phred score 20 ~ 1 error per 100 nucleotides
NGS File formats
File formats in NGS (I)
fastq
CASAVA software
SAM/BAM
Mapping
GTF
Assembly
http://genome.ucsc.edu/FAQ/FAQformat.html
File format NGS (II) - FASTQ and FASTQ_flt (MSI)
• CASAVA: – Illumina software package for base calling
• Fastq format: – Text format. Stores sequence and quality info – 4 lines per sequences – CASAVA 1.8 header line:
fastq
CASAVA software
@HWI-M00262:4:000000000-A0ABC:1:1:18376:2027 1:N:0:AGATC TTCAGAGAGAATGAATTGTACGTGCTTTTTTTGT + =1:?7A7+?77+<<@AC<3<,33@A;<A?A=:4=
Read ID (header) Sequence
+ Quality score
Phred+33
Machine ID barcode QC Filter flag Y=bad N=good
Read pair #
• FASTQ_flt data – Fastq files processed by MSI standard to remove reads with QC flag “Y”
• SAM/BAM format: – Sequence alignment format – SAM: text format – BAM: binary file of SAM – Bitwise flag field: indicating mapped or not,
paired or not, etc
• SAM/BAM format is the standard format of mapped reads, and could be used by almost all NGS tools, e.g. assembler, viewer, quantifier.
File formats NGS (III)
fastq
CASAVA software
SAM/BAM
mapping
• GTF format – Gene Transfer Format – Widely used format for annotated genome and
transcriptome – Downloadable from major browser sites, e.g. UCSC,
Ensembl, NCBI – Illumina also provides a set of annotated genomes:
igenomes • Available through Galaxy and command line
File formats in NGS (IV)
fastq
CASAVA software
SAM/BAM
mapping
GTF
assembly
Seqname Source feature start end score strand frame a0ributes
chr1 unknown exon 3204563 3207049 . -‐ . gene_id "Xkr4"; transcript_id
"NM_001011874“;
Steps in RNA-Seq Data Analysis
Quality Control
Data prepping
Map Reads to Reference Genome/Transcriptome
Assemble Transcriptome
Identify Differentially Expressed Gens
Other applications: De novo Assembly
Refine gene models
Step 1:
Step 2:
Step 3:
Step 4:
fastq
FastQC
fastqsanger
Filter/Trimmer/Converter
bam/sam
TopHat
gtf; fpkm Cufflinks
fpkm; diff
Cuffdiff
Step 1 Quality control of the input data
Step 2
Data prepping
Quality control of the raw reads
• Goal: to determine quality of the sequencing process
• Recommended program: fastqc – Available both in Galaxy and Linux platform
• Checklist of reads quality: Ø File format
q Basic Statistics
Ø Reliability of base calling q Per base sequence quality q Per sequence quality score
Ø Contamination q Per sequence GC content q Overrepresented sequences
Data prepping (I)
Is NOT NEEDED, if: • In the right format • Good reads quality
– Phred score per base & per sequence >=20 ( better if >=30)
• No contamination detected
• Paired reads are synchronized – Bad mapping efficiency of PE reads is symptomatic of de-
synchronization
BAD format
GOOD
Wrong Fastq Format (CASAVA 1.8): @HWI-M00262:4:000-A0ABC:1:1:1836 1:N:0:AGATC TTCAGAGAGAATGAATTGTACGTGCTTTTTTTGT + =1:?7A7+?77+<<@AC<3<,33@A;<A?A=:4=
Right Fastq Format (CASAVA 1.7): @HWI-M00262:4:000-A0ABC:1:1:1836/1 TTCAGAGAGAATGAATTGTACGTGCTTTTTTTGT + =1:?7A7+?77+<<@AC<3<,33@A;<A?A=:4=
Data prepping– Needed (I)
Data format is incorrect, • paired reads are not indicated as RNAME/1, RNAME/2 • the quality score is not Sanger/Illumina 1.9 format
Action: change the format, using edit attributes, fastQ groomer, header line converter Notes: If you are using Galaxy to analyze your data, change file name WILL NOT change the file format.
Bad Trimming needed
Good
Distribution of Phred Score in reads
Data contains bad “reads”, • the quality score of reads/part of the reads is < 20
Action: remove the low quality reads, using fastq filter, and fastq trimmer • fastq filter: remove entire reads
• fastq column trimmer: uniformly remove the nucleotides positions in all reads.
• fastq quality trimmer: remove all nucleotide positions with low quality.
Data prepping – Needed (II)
Example of bad data: sequence contamination
Adapter sequences are detected Action: remove the adapter sequences, using CutAdapt
Data prepping – Needed (III)
Data is out of synch, • the Forward and Reverse reads are not arranged in the same
order. Action: synchronize the files, using fastq interlacer and fastq de-interlacer Notes: Synchronization check and correction should be the last step in data prepping, because the previous steps in prepping can cause de-synchronization of PE data.
Data prepping – Needed (IV)
Summary - Data prepping
Data prepping is NEEDED, if:
• Data format is incorrect
• Data contains bad “reads” • Adapter sequences are detected
• Data is out of synch, meaning the pairing of Forward and
Reverse reads are out of order
Summary: Galaxy Tools for pre-processing
4. Fastq Trimmer: removal of low quality end bases
5. Cutadapt: Cut Adapter sequences
2. Convert read header format from 1.8 to 1.7
6. Synchronization – Fastq Interlacer/De-Interlacer – Critical for PE data analysis
3. Fastq filter: removal low quality reads
1. Fastq Groomer: Convert quality score to Sanger standard format – Necessary for data generated with
CASAVA 1.7 or less, no need for CASAVA 1.8 and above.
Applications of RNA-Seq
① Evaluation of a tissue’s transcriptome What is the composition of the transcriptome?
② Comparative analysis of two or more
transcriptomes How do two or more species’ transcriptomes compare?
③ Differential gene expression
What genes are differentially regulated in two or more conditions?
This tutorial
Differential Gene Expression (DGE)
Two Scenarios
① DGE – Non discovery mode DGE without detection of novel transcripts
② DGE - Discovery mode
DGE with detection of novel transcripts
① DGE - Non discovery mode
Map Reads to Reference sequence or genome (TopHat)
fpkm; diff
Identify Differential Expression (Cuffdiff)
fastq
Condition 2
fastq
Condition 1
Quality Control (fastqc)
Quality Control (fastqc)
bam/sam
Mapped Reads (sample 1)
bam/sam
Mapped Reads (sample 2)
Pre-defined Annotation
SAM/BAM
Map Reads to Reference sequence or genome (TopHat)
fpkm; diff
Identify Differential Expression (Cuffdiff) fastq
Data Prepping
② DGE - Discovery mode
gtf
Assemble sample transcriptome with discovery of novel
transcripts (Cufflinks)
gtf
Merge sample transcriptomes
into one (Cuffcompare */
Cuffmerge)
* Only available in Galaxy
RNA-Seq analytical tool: Tuxedo A mapper: Bowtie
– Maps short reads to the reference genome.
A splice junction aligner: Tophat – Uses Bowtie to align short reads to reference
genome or sequence – It infers and estimates the splicing sites.
A transcriptome assembler: Cufflinks – cuffcompare (comparing transcriptomes) – cuffmerge (merging transcriptomes) – cuffdiff (identifying differentially expressed
genes).
A visualization (R) package: cummeRbund
Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Trapnell C. et al. (2012) Nature Protocols
bam/sam TopHat
gtf; fpkm Cufflinks
diff; fpkm Cuffdiff
Bowtie
Cuffmerge Cuffcompare
cummeRbund
Quantify Expression Abundance (Cufflinks): FPKM
mRNA isolation
Paired End Sequencing
Map reads
Calculate transcript abundance Sample 1
Fragmentation RNA -> cDNA
Genome
Reference Transcriptome A B
Gene A Gene B Sample 1 2 2
Gene A Gene B Sample 1 2 1
# of Fragment (Paired Reads)
# Fragments per kilobase of exon
Gene A Gene B Total Sample 1 .7 .3 3 Sample 2 .6 .4 7
# Fragments per kilobase of exon per million mapped reads
FPKM
DGE Non discovery mode
DGE without detection of novel transcripts • Approach:
– Treat RNA-Seq as a “high-resolution” microarray.
• Most appropriate for: – Quick identification and analysis of differentially expressed genes – Best for systems with annotated reference genome, such as human
and mouse
• Analysis specificities- limitations: - Only map reads to previously known transcripts - Only test the differential expression of previously known transcripts.
• Programs to use: 1. TopHat: the mapper 2. Cuffdiff: the “tester”
Why choose TopHat as the mapper?
• Mapping RNA-Seq reads to the genome is a big challenge continuous RNA-Seq reads
dis-continuous genomic sequence
• Initially, genome aligner (such as BWA) treated introns as gaps.
TopHat: discovering splice junctions with RNA-Seq Trapnell C et al. (2009) Bioinformatics
Dgcr2
BWA
If there were no introns: Reads should continuously cover the splice junctions.
Dgcr2
BWA
Dgcr2
Exon 1 2 3
INTRONs shouldn’t be treated as GAPs.
TopHat – The splice junction aligner
• It became intuitive to incorporate the splicing information in the mapping process.
• Later, it became necessary to build splicing junctions ab initio, because of the incompleteness of known junctions – Splicing signals: donor and acceptor sites
TopHat: discovering splice junctions with RNA-Seq Trapnell C et al. (2009) Bioinformatics
• So, TopHat is developed.
Dgcr2
BWA
TopHat
• So TopHat is developed.
Basis of TopHat Direct un-spliced, un-paired mapping (using bowtie)
Uses bowtie again in closure search (finding splice junctions with mapping support)
Output: Mapping results (SAM/BAM file), and junctions.bed Only SAM/BAM file will be used by Cufflinks and Cuffdiff.
From the courtesy of Dr. Kevin Silverstein
Assemble contiguous “coverage island”
Identify possible splice donor and acceptor
Predict possible splicing junctions
Step 1: mapping
Step 2: building splicing junctions (+/- using ref juncs)
Step 3: 2nd
mapping
Q1: Is the project in human or mouse?
A: Yes Action: Nothing to be changed. Can use default parameters
A: No Action: Cannot use default parameters. Need to input all species specific parameters, e.g. those gene-model related parameters, such as intron length.
TopHat – General considerations
TopHat is optimized for human and mouse genome.
Q2: Is the library paired-end?
A: Yes Action: Set the parameters for mean distance between paired reads (-r) and the standard deviation of the inner distance (--mate-std-dev) inner distance = fragment length (220) – 2 X read length
A: No Action: Nothing to be changed.
TopHat – General considerations
Q3: What are the parameters to select?
① Select “Full parameter list”
② Select Yes for the option to “Use own
junctions”.
③ Select Yes for the option to “Use gene annotation model”, AND provide known annotation (gtf file).
④ Select Yes for the option to “Only look for supplied junctions”.
TopHat options – Non discovery analytical approach
Assessing mapping efficiency
① % of reads mapped, % of reads properly paired
② Use: Samtools and Picard tools – flagstat: line 3 and line 7
• For TopHat, first filter BAM file on MAPQ value of 255
– Filter SAM or BAM files on FLAG MAPQ RG LN or by region
– SAM/BAM Alignment Summary Metrics
③ Estimate the insertion size – Insertion size metrics
A: Review of the mapping statistics. Recommendations: For human and mouse, good mapping will result in - >= 80% mapping percentage >=70% paired reads
Mapping visualization Integrative Genome Viewer (IGV)
Healthy Sample
Cancer
Run IGV locally to view multiple tracks
Direction to install IGV: http://www.broadinstitute.org/igv/
DGE Workflow - Non discovery mode
Map Reads to Reference
sequence or genome (TopHat)
fpkm; diff
Identify Differential Expression (Cuffdiff)
fastq
Condition 2
fastq
Condition 1
Quality Control (fastqc)
Quality Control (fastqc)
bam/sam
Mapped Reads (sample 1)
bam/sam
Mapped Reads (sample 2)
Pre-defined Annotation
Cuffdiff Facts
log10(FPKM)
Den
sity
0 1 2 3 4
0
1
2 Global gene expression
Exclude the low-expressed genes to remove transcription artifacts. Set the parameter for “Min Alignment Count”.
Cuffdiff: Quantifies the gene expression abundance, Statistical evaluation of the differential expression.
Considerations on handling “Tail” data.
Cuffdiff Facts Cuffdiff: Quantifies the gene expression abundance,
Statistical evaluation of the differential expression.
log10(FPKM)
Den
sity
0 1 2 3 4
0
1
2 Global gene expression
Exclude the highly expressed genes, such as some house-keeping genes. Set “yes” to “Perform quartile normalization”.
Considerations on handling “Tail” data.
Cuffdiff output
Healthy_sample Cancer_sample
Post-analysis processing and iterations • Check for non-biological variations
– Also known as technical variation, or within-group variation. – This type of variation is detected among samples of the same group.
• Source of the technical variations: – Batch effect
• How were the samples collected and processed? Were the samples processed as groups, and if so what was the grouping?
– Non-synchronized cell cultures • Were all the cells from the same genetic backgrounds and growth phase?
– Use technical replicates rather than biological replicates
• Detection of non biological variation – PCA analysis; or MDS analysis; or Unsupervised clustering analysis of FPKM
values
Steps in PCA analysis
Construct the multiple variable matrix
transcripts Sample A Sample V Sample O Sample E Sample I Sample U gene1 6.18 6.64 6.46 6.30 6.58 6.54 gene2 5.48 0.11 1.00 0.24 0.02 0.68 gene3 20.53 18.93 18.79 18.51 18.00 18.26 gene4 55.47 52.71 50.39 54.66 49.15 44.68 gene5 7.28 8.09 8.57 7.82 8.29 9.38 gene6 14.65 13.88 13.48 13.98 14.72 12.47 gene7 16.41 13.80 14.99 17.20 14.39 13.50 gene8 6.17 6.79 7.20 6.70 8.42 7.26 gene9 25.83 24.24 25.63 27.09 22.18 23.09 gene10 38.04 30.39 35.53 37.42 28.72 27.28 gene11 195.06 179.88 178.18 208.25 179.01 155.15 gene12 32.82 32.04 31.84 33.62 31.06 29.46 gene13 18.41 16.75 16.72 17.33 16.32 16.87 gene14 24.00 21.05 22.68 22.72 22.08 22.45
…………………………..
e.g. tables of FPKM values
Group 1 (A,V,O)
Group 2 (E,I,U)
A
6 7 8 9 10 11 12 13
24
68
10
PC1
PC2
A
V
O
E
I
U
PCA analysis
O
I
U V
E
DGE Discovery mode
gene_exp.diff; isoform_exp.diff; ……
Genes.gtf
Genome.fa
merged.gtf
SampB.gtf SampA.gtf
SampB.bam SampA.bam
SampB.fq SampA.fq
QC with fastqc
CONDITION A
QC with fastqc
CONDITION B
Alignment with TopHat
Alignment with TopHat
Assemble with Cufflinks
Assemble with Cufflinks
Merge assemblies with
Cuffcompare
Quantitation and differential expression with cuffdiff
Reference Index
Reference Gene
Annotation
Store Results
Visualization with cummeRbund
Discovery Phase
• Novel transcripts will be assembled and tested for differential expression. – Potential identification of new splicing variants
• Key advantage (over microarray). – Not limited by previous knowledge – Extends current knowledge banks
• Programs used: 1. TopHat: the mapper; 2. Cufflinks: the assembler; 3. Cuffdiff: the tester
DGE with detection of novel transcripts
① First run is to generate a full list of junctions. ② Second run is to apply the full junction files to all the samples to keep
mapping consistence.
Same as before, but TopHat needs to be run at least TWICE in order to reliably and consistently identify the splicing junctions.
The “TWO-STEP” running of TopHat: 1. Running TopHat as before 2. Re-run TopHat with a list of junctions
(see setting in next slide).
TopHat – Best practice in Discovery Mode
① Combine the sample junctions.bed files into one using Concatenate.
② Turn on “Full parameter list”.
③ Turn on (set “yes” to) the option for “Use own junctions”.
④ Provide junctions files (bed file).
⑤ Turn on the option for “Use Closure Search”.
⑥ Turn on “Use Microexon Search”.
TopHat options – discovery analytical approach
Considerations for Cufflinks
Cufflinks facts
• Optimized for human and mouse genomes
• Uses a parsimonious method to assemble the transcripts +/- known annotation
• Can estimate the transcript abundances – FPKM: # of Fragments Per Kilobases of exon model per
Million mapped fragments
• Can estimate the fragment length distribution – Not available in Galaxy
• Output file: GTF file
http://cufflinks.cbcb.umd.edu/
Q1: Is the project in human or mouse?
A: Yes Action: Nothing to change
A: No Action: Cannot use default parameters.
Need to input all species specific parameters, e.g. those gene-model related parameters, such as intron length.
Cufflinks – General considerations
Cufflinks is optimized for human and mouse genome.
Q2: Want to use a known annotation in transcriptome assembly and report novel transcripts assembled?
A: Yes Action: Use the option for “Use Reference Annotation”; Select “Use Reference Annotation as Guide”.
A: No Action: Nothing to change
Cufflinks – General considerations
A: No. Because we might lose some isoforms in this manner. – It is possible that one isoform may only be called from one sample,
due to some uncontrollable sample preparation process. – Cufflinks will only report isoforms above certain abundance
threshold (10% of the major transcripts). – The rare isoform will be diluted in the pooled samples, so that it
may become missing in the assembly.
Isoform A (FPKM) Isoform B (FPKM) Called? Sample 1 50 5 Yes Sample 2 50 3 No Pooled 100 8 No
Cufflinks – General considerations
Q3: Can I pool samples as one input to cufflinks?
Cuffcompare Facts Cuffcompare • Compares multiple transcriptomes and reports
the similarity between them. • Available in Galaxy. Cuffmerge • A new function implemented in Cufflinks
package. • Purpose is to remove assembly artifacts. • Available using command line tools.
Follow the same instruction to run Cuffdiff and post-
processing as in DGE-non discovery mode.
Reproducibility and the value of Workflow
Analysis strategy in “Workflow” • Workflow is
– A sequential collection of Galaxy operations to complete an analysis
Create a “Workflow”
– From scratch – From current history – Edit existing workflow
Share/Publish/Use “Workflow”
Tutorial optional material =||=
Evaluation of transcriptome
Two Scenarios
① De novo assembly of transcriptome Assemble transcriptome without a reference transcriptome/genome
② Reference-guided assembly of transcriptome
Key definitions
Short Reads
Contigs = consensus of overlapping reads
Scaffolds = contigs + known-length gaps • known-length gaps could be estimated by Mate-pair sequencing
Draft transcriptome/genome = a collection of non-ordered scaffolds
De novo assemble the transcriptome.
fastq
Samples (RNA-Seq)
fastq
Pre-processing: QC and Data cleaning
De novo Assembly of Transcriptome
fastq Trans-ABySS *
* We only put one assembler in this diagram to illustrate the concept of assembling. However, in order to construct a reliable transcriptome, multiple assembler should be used to generate a consensus assembly.
Trans-ABySS Facts • ABySS is a de novo, parallel sequence assembler that is
designed for short reads. – Can work on single end reads and paired end reads. – Is a de Bruijn graph assembler – It takes two steps:
• Using all possible k-mers from the reads to build the initial contigs • Using mate-pair information to extend contigs
• Trans-ABySS is a pipeline for analyzing ABySS-assembled contigs from RNA-Seq data. – Use several k-mer length
• Availability: Command line • Homepage:
– http://www.bcgsc.ca/platform/bioinfo/software/abyss – http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss
BAM/SAM
Reference-guided assembly of transcriptome
fastq
Samples (RNA-Seq)
fastq
Pre-processing: QC and Data cleaning
Map short reads to reference genome (TopHat)
Known Annotation
(Also known as “transcriptome reconstruction”)
GTF
Assemble transcriptome from mapped reads (Cufflinks)
If choosing TopHat and Cufflinks as the assembler,
follow the instructions in DGE-discovery mode
• Cufflinks developer: “We don’t recommend assembling bacteria transcripts using
Cufflinks at first. If you are working on a new bacteria genome, consider a computational gene finding application such as Glimmer.” • So for bacteria transcriptome:
• If the genome is available, do genome annotation first
then reconstruct the transcriptome. • If the genome is not available, try the de novo
assembly, then followed by gene annotation.
Specific Notes for Prokaryotes’ samples
Summary – Hybrid method on transcriptome assembly
Next-generation transcriptome assembly Martin J. et al (2011) Nature Review
Comparative Study of Transcriptomes
Sample 1.gtf
Sample 2.gtf
Sample 3.gtf
Sample N.gtf
…….
Cuffcompare: Compare individual transcriptome Generic tools: Operate on Genomic Intervals
Q: How can I compare different transcriptomes?
• Cuffcompare • Operate on Genomic Intervals
Galaxy Tools for pre-processing
• Downstream visualization and analysis: – Will be covered in Tutorial Module 3. – IGV: interactive genome viewer – IPA: Ingenuity pathway analysis – Other analysis package:
• R package: ArrayExpressHTS, cummeRbund
Discussion and Questions? • Get Support at MSI:
– Email: help@msi.umn.edu – General Questions:
• Subject line: “RISS:…” – Galaxy Questions:
• Subject line: “Galaxy:…”
top related