differential gene expression analysis using rna-seq...
TRANSCRIPT
![Page 1: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/1.jpg)
Differential Gene Expression Analysis using RNA-Seq Data
![Page 2: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/2.jpg)
RNA-Seq Data
1. Biologists collect mRNA from many cells
2. Cells come from two or more biological samples (different tissues)
![Page 3: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/3.jpg)
RNA-Seq Data Generation
3. Collected mRNA are shredded, size selected, sequenced
Sequenced reads are mapped back to the reference genome
![Page 4: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/4.jpg)
Mapping RNA-Seq Reads
• Mapping a read to a reference genome is finding the position within genome where the read comes from
• Reads containing splice junctions cannot be mapped to a reference genome directly
• Ways to map reads with splice junctions: – Use special algorithms/methods/techniques
– Map the reads to annotated transcriptome
![Page 5: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/5.jpg)
Mapping RNA-Seq Reads
• Example: the read TCAAG occurs at position 10 in the given reference genome (if position count starts with 1)
![Page 6: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/6.jpg)
• Since mRNA are collected from many cells:
– reads cover the entire lengths of exons
– overlapping reads come from different mRNA molecules
Mapping RNA-Seq Reads
![Page 7: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/7.jpg)
Mapping concerns:
1. A read that is mapped to two or more locations in a reference genome is called ambiguous and is discarded from the analysis
2. Two reads are called copy-duplicates if they are mapped to the same start position in the genome (these might be the product of poly-chain reaction, PCR, that is used to make copies of mRNA segments to make sequencing possible). Only one of copy-duplicates is used in the analysis
![Page 8: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/8.jpg)
• The number of reads mapped to a single gene/transcript/exon, read count, is used to estimate differential gene expression
• Given two (or more) samples, find the read count for one sample and for the other sample, and use statistics to infer whether these counts are significantly different
![Page 9: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/9.jpg)
• To estimate a read count for a transcript of a gene is not trivial: – Alternating splicing (if a read is mapped to an
exon shared in two or more transcripts, then we cannot be certain whether the read comes from one transcript or the other)
– Overlapping genes (uncertainty in counting a read that mapped to the region belonging to two or more overlapping genes)
Mapping RNA-Seq Reads
![Page 10: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/10.jpg)
• To estimate a read count for a transcript of a gene is not trivial
• To remedy:
– Estimate read count for each gene or exon instead
– Use reads containing splice junctions
– In some cases, discard the read from the analysis
Mapping RNA-Seq Reads
![Page 11: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/11.jpg)
Workflow of RNA-Seq Differential Gene Expression Analysis
Adapted from “RNA-seq Data Analysis: A Practical Approach” by Eija Korpelainen, Jarno Tuimala, Panu Somervuo, Mikael Huss, Garry Wong Chapman & Hall/CRC Mathematical and Computational Biology
![Page 12: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/12.jpg)
Preprocessing
• Adapters trimming
• Low quality read ends trimming (3’ end)
![Page 13: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/13.jpg)
![Page 14: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/14.jpg)
Adapters Trimming
More on adapter trimming: http://www.ark-genomics.org/events-online-training-eu-training-course/adapter-and-quality-trimming-illumina-data http://training.bioinformatics.ucdavis.edu/docs/2013/02/bootcamp/galaxy/_downloads/qa-and-i.pdf
TOOLS: FASTQC Cutadapt qrqc Scythe
![Page 15: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/15.jpg)
Adapters Trimming
![Page 16: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/16.jpg)
Quality Control: Nucleotide Profile
TOOLS: FASTQC qrqc
![Page 17: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/17.jpg)
Quality Control: Base Quality Profile
![Page 18: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/18.jpg)
Quality Control: k-mer Enrichment
![Page 19: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/19.jpg)
Quality Control: Reads Lengths Distribution after Trimming
![Page 20: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/20.jpg)
Quality Control: Statistics
![Page 21: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/21.jpg)
Mapping RNA-Seq Reads
• Mapping to a reference genome
• Mapping to transcriptome
• Gene annotation information (start/end of exons in known genes)
![Page 22: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/22.jpg)
1. Genes are located on both strands of DNA
2. Reads are always sequenced from 5’ to 3’
3. Mapping is performed to only (+) strand of DNA
4. Map the reverse-complement of a read: ATTGC, rc: GCAAT
Slide 22 of 31
G C A A T C T G G C
Mapping RNA-Seq Reads
![Page 23: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/23.jpg)
Mapping RNA-Seq Reads
Ambiguous Reads (identify and discard)
A read that is mapped with the same (smallest) number of mismatches to two or more locations in the genome
A read that is mapped to both + (positive) and – (negative) strands with the same smallest number of mismatches
![Page 24: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/24.jpg)
Mapping RNA-Seq Reads
Ambiguous:
Unique:
![Page 25: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/25.jpg)
Mapping RNA-Seq Reads
• Sequencing instruments require certain quantity of mRNA
• Poly Chain Reaction produces multiple copies of mRNA segments
• Copies of the same segment are sequenced producing copy duplicates (product of PCR not related to the mRNA abundance in biological sample)
![Page 26: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/26.jpg)
Mapping RNA-Seq Reads
• Two reads are called copy-duplicates if they are mapped to the same start position in the genome (identify and count only one read)
• Copy duplicates can be generated only from the same sample
![Page 27: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/27.jpg)
Mapping RNA-Seq Reads
• Collect mapping statistics: – Total reads that were attempted for mapping
– Total unique reads mapped
– Total ambiguous reads mapped
– Total copy duplicates
– Distribution of reads by mismatches/indels
– Total reads mapped to splice-junctions
– CG-bias in mapped reads
– Depth of coverage
– 3’ end gene bias (more reads mapped to 3’ end)
![Page 28: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/28.jpg)
Counting Reads: HTSeq
![Page 29: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted](https://reader030.vdocument.in/reader030/viewer/2022040411/5eda9f3c09f66a09130ba563/html5/thumbnails/29.jpg)
Normalization
• Raw read count has to be normalized to enable comparison between samples
• RPKM Reads Per Kilobase and per Million mapped reads
• Total raw reads mapped to a gene divided by the length of the gene in Kilobases and divided by total number of mapped reads in millions
• Sometimes mappable length is used (since ambiguous reads are discarded, repeated regions within genes are not covered by reads)