RNA-seq data analysis Project
QI LIU
From reads to differential expressionRaw Sequence Data
FASTQ Files
Unspliced MappingBWA, Bowtie Mapped
ReadsSAM/BAM Files
Expression Quantification
DEseq, edgeR, etc
Functional Interpretation
QC by FastQC
QC by RNA-SeQC
Spliced mapping
TopHat, MapSplice
Reads Mapping
Summarize read counts
FPKM/RPKMCufflinks
Cuffdiff
DE testing
Function enrichment
Infer networks
Integrate with other
data
Biological Insights & hypothesis
List of DE
ToolsRead alignment: TopHat2 (http://ccb.jhu.edu/software/tophat/index.shtml) Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) Tools for manipulating SAM files: SAMTOOLS (http://samtools.sourceforge.net/samtools.shtml)
Counting reads in gene level:htseq-count (http://www-huber.embl.de/users/anders/HTSeq/doc/count.html)
Downstream analysis:The R statistical computing environment (http://www.r-project.org/)R package: edgeR (http://www.bioconductor.org/packages/release/bioc/html/edgeR.html)
Visualization: IGV (http://www.broadinstitute.org/software/igv/download)
/home/igptest/path.txtsetpkgs -a python
BOWTIE2=/scratch/liuq6/software/bowtie2TOPHAT2=/scratch/liuq6/software/tophat2
export PYTHONPATH=/scratch/liuq6/software/htseqlib:$PYTHONPATHexport PATH=/home/igptest/exomesequencing/software/:$BOWTIE2:$TOPHAT2:$PATH
transcriptomeindex=/scratch/liuq6/reference/gtfindex/Homo_sapiens.GRCh37.75
genomeindex=/scratch/liuq6/reference/bowtie2_index/hg19
gtffile=/scratch/liuq6/reference/Homo_sapiens.GRCh37.75_chr1-22-X-Y-M.gtf
reference=/home/igptest/exomesequencing/reference/hg19/hg19_chr.fa
VarScan=/home/igptest/exomesequencing/software/VarScan.v2.2.10.jar
• The PATH is an environment variable. It is a colon delimited list of directories that your shell searches through when you enter a command. All executables are kept in different directories on the Linux and Unix like operating systems.
• PYTHONPATH is an environment variable which you can set to add additional directories where python will look for modules and packages
Environment Variables
Two ways
1. source path.txt
source is a bash shell built-in command that executes the content of the file passed as argument.
2. put the scripts in .bashrc .bashrc is a file from which bash reads and executes command automatically when you log in. /home/yourusername/.bashrc
Practice
Before and after you load all the environmentecho $SHELL (the name of the current shell)env (the existing environment variables)echo $HOMEecho $PATHecho $gtffile
Reads alignment
1. un-spliced mapping to
transcriptome and then genome
(bowtie)
2. “Contiguously unmappable" reads are used to predict possible splice junctions.
TopHat
Tophat2
• tophat2 [options]* <genome_index_base> PE_reads_1.fq.gz PE_reads_2.fq.gz
• Options:
-o/--output-dir <string> Sets the name of the directory in which TopHat will
write all of its output. The default is "./tophat_out".
-G/--GTF <GTF/GFF3 file> Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file
--transcriptome-index <dir/prefix> use the previously built transcriptome index files
The basename of the genome index to be searched
TopHat output
• accepted_hits.bam. A list of read alignments in SAM format.
• junctions.bed. A UCSC BED track of junctions reported by TopHat. Each junction consists of two connected BED blocks, where each block is as long as the maximal overhang of any read spanning the junction. The score is the number of alignments spanning the junction.
• insertions.bed and deletions.bed. UCSC BED tracks of insertions and deletions reported by TopHat.
Practice
• tophat2 --transcriptome-index=$transcriptomeindex -o ctrl1 $genomeindex /scratch/igptest/RNAseq/data/ctrl1.fastq
Add --transcriptome-only to save time• tophat2 --transcriptome-
index=$transcriptomeindex -o ctrl1 --transcriptome-only $genomeindex /scratch/igptest/RNAseq/data/ctrl1.fastq
SAMTOOLS
• samtools view • samtools sort• samtools index
Practice
• View the header section of bam file samtools view -h accepted_hits.bam
• Extract the alignments with MAPQ>10 samtools view -q 10 accepted_hits.bam
Practice
• Extract the alignments mapped to reverse strand
samtools view -f 16 accepted_hits.bam
Practice
• Index the alignment file (.bai file) samtools index accepted_hits.bam
Use samtools to get all the reads mapped to tp53 and myc. How many reads? Any junction reads?– TP53, 17:7,571,720-7,590,868– MYC, 8:128,748,315-128,753,680
Homework
• Install IGV https://www.broadinstitute.org/igv/download• Use IGV to view the bam file (need the bam
and index files)• Take a look at the reads mapped to genes MYC
and TP53
Homework
• Align ER1.fastq • How many reads mapped to MYC and TP53?
Are there any junction reads? Visualize the results in IGV
• Extract the alignments mapped to MYC and generate a new BAM file
• Extract the alignments with MAPQ>10 and generate a new BAM file and index the file
Ht-seq Given a file with aligned sequencing reads and a list of genomic features, a common task is
to count how many reads map to each feature.
Deal with reads that overlap more than one features
Ht-seq• Options
Ht-seq
• htseq-count [options] <alignment_file> <gff_file>
Output: a table with counts for each feature, followed by the special counters, which count reads that were not counted for any feature for various reasons. The names of the special counters all start with a double underscore, to facilitate filtering.
__no_feature: reads (or read pairs) which could not be assigned to any feature (set S as described above was empty).
__ambiguous: reads (or read pairs) which could have been assigned to more than one feature and hence were not counted for any of these (set Shad mroe than one element).
__too_low_aQual: reads (or read pairs) which were skipped due to the –a option __not_aligned: reads (or read pairs) in the SAM file without alignment __alignment_not_unique: reads (or read pairs) with more than one reported alignment. These reads
are recognized from the NH optional SAM field tag. (If the aligner does not set this field, multiply aligned reads will be counted multiple times, unless they getv filtered out by due to the -a option.)
Practice
• htseq-count -s no -f bam -i gene_name ctrl1/accepted_hits.bam $gtffile > ctrl1.count
Homework
• Summarize the read counts to the gene level for ER1
• Summarize the read counts to the exon level for ctrl1
edgeR
• Install R in your laptop• Install edgeR package
source("http://bioconductor.org/biocLite.R") biocLite("edgeR")
edgeR
• Raw counts• Small sample size• Complicated experimental design
Negative binomial distribution
• Technical replicates –Poisson distribution var(ygi)=gi
• Biological variances var(ygi)=gi+
Normalization
• TMM (a trimmed mean of M-values)
minimize the log-fold changes between the samples for most genes.
Normalization• RNA-seq measures the relative abundance of
each gene in each RNA sample, but RNA output per cell
FPKM: g2-g10 TMM: g1
N Tg1 1 1000g2 1 1g3 1 1g4 1 1g5 1 1g6 1 1g7 1 1g8 1 1g9 1 1g10 1 1
Differential analysisRead Data
Normalization
Estimate the dispersion
Differential expression
preprocess
visualize
MDS
PCA
Heatmap
calcNormFactors
DGEList
estimateGLMCommonDisp
estimateGLMTrendedDisp
estimateGLMTagwiseDisp
glmFit
glmLRT