Rickard Sandberg
Gene Expression Analyses
Assistant Professor Ludwig Institute for Cancer Research Department of Cell and Molecular Biology Karolinska Institutet
Outline
- microarrays
- RNA-Seq
- Common gene expression analyses steps
- clustering of samples
- differential expression tests
- enrichment tests
Transcriptome analyses
- rRNAs (dominating, ~95%)
- mRNAs (~5%)
- long non-coding RNAs (e.g. lincRNAs) (~0.05%)
- snoRNAs, snRNAs
- microRNAs, piRNAs
Different protocols identify different parts of the transcriptome
PolyA selection
- rRNAs (dominating, ~95%)
- mRNAs (~5%)
- long non-coding RNAs (e.g. lincRNAs) (~0.05%)
- snoRNAs, snRNAs
- microRNAs, piRNAs
Different protocols identify different parts of the transcriptome
Ribominus (removal of
ribosomal RNAs)
not so random hexamers or DSN
- rRNAs (dominating, ~95%)
- mRNAs (~5%)
- long non-coding RNAs (e.g. lincRNAs) (~0.05%)
- snoRNAs, snRNAs
- microRNAs, piRNAs
Different protocols identify different parts of the transcriptome
small RNA protocol
- rRNAs (dominating, ~95%)
- mRNAs (~5%)
- long non-coding RNAs (e.g. lincRNAs) (~0.05%)
- snoRNAs, snRNAs
- microRNAs, piRNAs
DNA microarrays
!oligonucleotide arrays (affymetrix, agilent, illumina etc) cDNA microarrays (competitive hybridization)
Important Considerations
§ Microarrays where designed based on EST-clusters § Probes mapping at multiple locations § Multiple probe sets mapping to the same gene !
§ Many projects curated microarray probes to only allow for uniquely mapping ones, e.g. customCDF
http://brainarray.mbni.med.umich.edu/Brainarray/Database/ CustomCDF/genomic_curated_CDF.asp
Basis of Microarrays
Steps in microarray analyses
§ Start with RAW data (for affy arrays = CEL files) § Normalize
àremove systematic strength biases àoften quantile normalization
§ Background adjust/transform àTries to estimate signal from background àlog2 transform (ratios problem, stabilize variance)
§ Gene (or probeset summarization) àmedian polish (fancy average of probes targeting
the same gene/transcript/probe set)
Gene Expression - Microarray data
§ Repositories of raw and processed data: àGene Expression Omnibus (GEO)
http://www.ncbi.nlm.nih.gov/geo/ àArrayExpress
http://www.ebi.ac.uk/microarray-as/ae/
§ Databases with Gene Expression Atlases àHuman, Mouse and Rat Tissue Atlas
Symatlas / BioGPShttp://biogps.gnf.org/
àCancer Gene expression atlas: oncominewww.oncomine.org
!In what tissues are my gene expressed? using BioGPS (former symatlas)
http://biogps.gnf.org/
Finding experiments where my gene is differentially expressed
ArrayExpress GEO
§ Do not use updated CDFs (probe to transcript mappings) § Constantly evolving (hard to reproduce years later) § Offer no quality control § Limited capabilities for more comprehensive analyses
What are the methods measuring?
• Expressed Sequence Tags• Traditional 3’UTR focused microarrays
• Exon and Tiling Arrays• Deep Sequencing using Illumina/Solexa, SOLiD, (454)
Isolate polyA+ RNA
mRNA-seq protocol
Wang et al. 2009 Nat Rev Gen
§ polyA+ RNAs § rRNA- RNAs § short RNAs (e.g. miRNAs) § Ribosome footprint
sequencing § GRO-Seq (Global Run On
sequencing) § CLIP-Seq (RNA-protein
interactions) !
§ non-RNA applications:ChIP-Seq, DNAse hypersensitive sites,...
Strand-specific RNA-Seq protocols
Genome Chromosome Fasta Files
+
Known and putative splice junctions Fasta File
2. map reads towards genome + junction compilation
GTAAGT-----------AG Exon n+1
1. compile sets of junctions
Exon n
Mapping of splice junctions
Tophat first MethodIdentifying the transcriptome
A B C identify candidate exons
via genomic mapping
A B C A B C Generate possible
pairings of exons
Align “unmappable”
reads to possible junctions
A B C A B C
Longer readsLonger reads
GATGTTCTCAGTGTCC GATGTAATCAGTGTCC AACCCTCTCAGTGTCC
>HWI-EAS229_75_30DY0AAXX:7:1:0:949
Very long (100Kb+) intron
By segmenting the long reads, and mapping the segments independently, we can
look harder for junctions we might have missed with shorter reads
Running time
independent of
intron size
Mapping to transcriptomeExons 5’UTR 3’UTRIntronsGene:
DNA (genome)W
C
pre-mRNA
Transcription
AAAAA
RNA processing (splicing, polyadenylation)
mRNA AAAAA
Exons 5’UTR 3’UTRIntronsGene:
DNA (genome)W
C
Microexons and junction coverage
Exons 5’UTR 3’UTRIntronsGene:
DNA (genome)W
C
2 or more splice junctions within the same read
in-house mapping tophat mapping
Different read length will have different problems!
Finding novel non-annotated genes or transcript variants
Mapping'speed 308'M'reads'/'hour%'uniquely'mapping 60%'multimapping 25%'unmapped 15
Example of STAR aligned single-cell RNA-Seq data
281 719 splice junctions 279 356 with GT/AG 2 123 with GC/AG 215 with AT/AC
TestesLiverSkeletal MuscleHeartAK074759BC011574AK092689
log 1
0(read
s) 02
02
02
02
3B
3A
3B
RNA-Seq generate quantitative expression estimates
<10M reads
Brain expression / UHR expression (Taqman)
Bra
in R
eads / U
HR
Reads (R
NA
-SE
Q)
104
R = 0.953
slope = .933103
102
101
100
10-1
10-2
10-3
10-4
104 103 102 101 100 10-1 10-2 10-3 10-4
Mortazavi et al. Nat Methods 2008 Ramskold et al. PLoS Comp Biol 2009
03691215 12.3
0.13 0.10Exon Intron Intergenic
MKPR
Wang*, Sandberg* et al. Nature 2008
150x
How gene expression levels are estimated
gene A (2 kb transcript) gene B (600 bp transcript)
ACGCG... TCGAG... AGGTA... CCGTG... CTGCG...
Sequencing
FragmentationThe number of fragments are proportional to the abundance and length of the transcript.
Normalize for different transcripts lengths and different sequence depths in different samples.
RPKM (Reads per kilobase and million mappable reads): Given 10 million mappable reads:
RPKM, Gene A: 500 reads x 1000/2000 x 106/107
500 / (2 x 10) = 25 RPKM
RPKM roughly corresponds to transcripts per cell (Mortazavi et al. 2008) (assuming a standard cell with ~ 300.000 transcripts)
Fragments PKM (FPKM)
Gene quantification and mRNA copy numbers in cells
CN
X LT
=
X =109R T
C, number of reads mapping to transcript N, total number of sequenced reads !X, copies per cell of transcript T, total length of transcriptome L, transcript length !R, RPKM (reads per kilobase and million
mappable reads)
T, can be estimated from !1. starting amount of mRNA 2. spiked in controls 3. estimate transcriptome length - if 300.000 transcript of around 1500 nt each -> 4.5 *108
- 1 RPKM ~ 0.5 transcripts per cell
XN LC T= = 106
R T103
Depth needed for accurate expression level estimation
Perc
enta
ge o
f gen
es w
ithin
±20
% o
f fin
al e
xpre
ssio
n
100
80
60
40
20
01 5 10 15 20 25 30 35 40 45
1-9 RPKM (n=4338)10-29 RPKM (n=3048)30-99 RPKM (n=2817)100-999 RPKM (n=1469)1000-6705 RPKM (n=56)
Million mapped reads
B
A
01 5 10 15 20 25 30 35 40 45
Million mapped reads
Perc
enta
ge o
f gen
es w
ithin
fold
-cha
nge
of fi
nal e
xpre
ssio
n
100
80
60
40
20
2-fold1.5-fold1.2-fold1.1-fold1.05-fold
Mortazavi et al. 2008 Ramskold/Kavak et al. 2011 (bookchapter)
RNA sequencing of blastocyst-derived cell lines
Read counts for selected genes
ES TS XEN EpiSCNanog 6525 20 1 263
Cdx2 124 6256 1 1
Sox17 11 5 9814 99
Sox3 151 1234 6 796
Shh 0 0 0 1
Ihh 4 12 107 17
Dhh 10 212 575 80
Significance of expression level
background RPKM ~ 0.05 RPKM detection level of 0.3 RPKM an average 1 500 nt transcript 20 M uniquely mapping reads !background model: 0.05 x 1.5 x 20 = 1.5 reads !expressed at 0.3 RPKM: 0.3 x 1.5 x 20 = 9 reads binomial test for 9 reads out of 20 M mapping to transcript given a background probability of 1.5 / 20x109 gives a p-value of 2.8e-5 !!expressed at 1 RPKM: 1 x 1.5 x 20 = 30 reads
0.05 RPKM 1 RPKM
Mixed species/strains experiments
§ Mixed species experiments allows mapping of host and pathogen interactions
§ Parasite-host interactions
§ Tumor-stroma interactions
Allele-sensitive RNA-seq using mouse crosses
Fusion events, e.g. translocations in cancer
Oszolak and Milos, Nature Rev Genet 2011
Outline
- microarrays
- RNA-Seq
- Common gene expression analyses steps
- clustering of samples
- differential expression tests
- enrichment tests
Early Quality Control
0.0
0.2
0.4
0.6
0.8
1.0
20% at 3'Middle20% at 5'
SMARTer
Varian
t #2
varia
nt #3
Optimize
d
varia
nt #1
varia
nt #4
Supplementary Figure 6. Read coverage across genes in single-cell RNA-Seq data.Fraction of reads mapping to the 20% 5’ most, the 20% 3' most, and the 60% in the middle region for all individual single-cell transcriptome data from HEK293T cells. Variant protocols are as the optimized except for differences in volume of TSO used (variant #1 use 2 ul instead of 1ul), template switching oligo (variant #2 uses rGrG+N, variant #4 uses rGrGrG) or preamplification enzyme (variant #3 uses Advantage 2).
fraction o
f m
apped r
eads
0.00
0.02
0.04
0.06
0.08
0.10
0.12
123
456
789
Read mapping (STAR to hg19)
Reads (
%)
0
20
40
60
80
100
No matchMultimappingUniquely mapping
fraction o
f m
apped r
eads
0.0
0.2
0.4
0.6
0.8
1.0
IntergenicIntronic Exonic
Number of mismatches:
Genomic regions
Variant #2
Variant #3
Optim
ized
variant #1
SM
ARTe
r
variant #4
Supplementary Figure 2. Mapping statistics for single-cell libraries generated using SMARTer, optimized Smart-Seq and variants of the optimized protocol.(A) The fraction of uniquely aligned reads with 1 to 9 mismatches for each single-cell RNA-
Seq library. (B) Percentage of reads that could be aligned uniquely, aligned to multiple
genomic coordinates (multimapping) or did not align for all single-cell RNA-Seq libraries. (C)
The fraction of uniquely aligned reads that mapped to exonic, intronic or intergenic regions
(annotations based on RefSeq gene models). Variant protocols are as the optimized except
for differences in volume of TSO used (variant #1 use 2 ul instead of 1ul), template switch-
ing oligo (variant #2 uses rGrG+N, variant #4 uses rGrGrG) or preamplification enzyme
(variant #3 uses Advantage 2).
A B
C
Variant #2
Variant #3
Optim
ized
variant #1
SM
ARTe
r
variant #4
Variant #2
Variant #3
Optim
ized
variant #1
SM
ARTe
r
variant #4
Biological QC Look at replicates and that samples group by
origin/type
Hierarchical clustering
−100
−50
0
50
100
150
í100 −50 0 50 100 150
PC3 (n=4)
T24(n=4)
Lncap (n=4)
SVD component 1
SVD
com
pone
nt 2
PCA / SVD
U251
SNB-19
SF-295
SNB-75
HS-578T
SF-539
SF-268
BT-549
HOP-62
NCI-H226
A498
RXF-393
786-0
CAKI-1
UO-31
ACHN
TK-10
MDA-MB-231
HOP-92
SN12C
ADR-RES
OVCAR-8
LOXIMVI
PC-3
OVCAR-3
OVCAR-4
IGROV1
SK-OV-3
OVCAR-5
DU-145
EKVX
A549
NCI-H460
RPMI-8226
K562
K562
K-562
HL-60
MOLT-4
CCRF-CEMSR
HCT-116
SW-620
HCT-15
KM12
HCC-2998
COLO205
HT-29
MCF7
MCF7
MCF7
T-47D
NCI-H322
NCI-H23
NCI-H522
SK-MEL-5
MDA-MB435
MDA-N
M-14
SK-MEL-28
UACC-257
MALME-3M
UACC-62
SK-MEL-2A
1.00
-1.00
0.60
0.20
-0.20
-0.60
leukaemia colon melanomaCNS renal ovarian
breastprostatenon-small-lung
NCI60 cell line expression clustering
ordering pretty arbitrary
Careful about high order clustering
Singular Value Decompostion (SVD)Genes
e_0m
e_30m
e_60m
e_90m
e_120m
e_150m
e_180m
e_210m
e_240m
e_270m
e_300m
e_330m
e_360m
e_390m
Arrays
Genes
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Eigenarrays
1413121110987654321
Eigenarrays
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Eigengenes
1413121110987654321
Eigengenes
e_0m
e_30m
e_60m
e_90m
e_120m
e_150m
e_180m
e_210m
e_240m
e_270m
e_300m
e_330m
e_360m
e_390m
Arrays
QC: Similarities between replicates
0 hr
6 hr
48 hrSa
mpl
e Pr
ojec
tion
(eig
enge
ne 2
, 31%
)
Sample Projection (eigengene 1, 52%)
Eigengenes 0 hr 6 hr 48 hr 0 hr 6 hr 48 hr
SVD Analysis of Mouse T-cell Stimulation
Captures 83% of variation
QC: Outliers
Embryoid bodiesSonic Hedgehog induced
?
Differential Expression
Either based on reads or RPKM values
Most tools developed for microarrays are based on probe set expression values, whereas RNA-Seq tools aim to use read counts !Reads • have more statistical power • have unresolved biases • need fewer replicates? !
Expression levels, RPKMs • better understood statistics, but has less power
Statistical models of differential expression
Statistical models of differential expression
Transcript length effects in differential expression tests
Oshlack and Wakefield Biology Direct 2009
p-values should not be the basis for sorting
non-coding RNAs in prostate cancer: Expression and differential expression
Enrichment analyses
Goals of enrichment analyses
Factors to consider
Gene Sets, e.g. pathways and gene ontology
§ Gene Ontology § KEGG § BioCarta § PANTHER !
§ Chromosomal location
§ Genes found differentially expressed in another experiment
Two strategies
List-based enrichment analyses
Gene In List Gene NOT In List
In Category a bNOT In Category c d
all genes
in category
gene set
in category
Assessing significance
DAVID
Query many types of gene sets in one go
Current Background: HOMO SAPIENS Check Defaults ! • Main Accessions (0 selected) • Other Accessions (0 selected) • Gene Ontology (3 selected) • Protein Domains (3 selected) • Pathways (3 selected) • General Annotations (0 selected) • Functional Categories (3 selected) • Protein Interactions (0 selected) • Literature (0 selected) • Disease (1 selected) • Tissue Expression
Gene set enrichment analyses (GSEA)
Molecular Signature db
Gene Ontology analyses
§ Note: Background matterschoosing the wrong background set of genes may affect/confound your results
§ Depends upon preselected categories !
§ List-dependente.g. DAVID, http://david.abcc.ncifcrf.gov/ !
§ List-independent methodse.g. GSEA, http://www.broad.mit.edu/gsea/
Questions?