assessing gene essentiality using genome-wide crispr screens
TRANSCRIPT
Assessing gene essentiality using genome-wide CRISPR screens
Katharina Imkeller, DKFZ and EMBL, Heidelberg, Germany
June 25th 2019
CSAMA 2019, 17th edition
@K Imkeller @Boutroslab
Reverse vs. forward genetics
Forward genetics:Find the genetic basis for a specificobserved phenotype.
Reverse genetics:Modify gene sequence and analyze theresulting phenotype.
Discovery of CFTR gene mutationcausing Cystic fibrosis.
Wikipedia
Knockout of gene affecting hair growth.
Biological motivation for reverse genetics screensN
OR
MA
LD
EA
TH
GR
OW
TH
Cell fitnessCell fitness
Suppressing gene
Essential gene
Essential gene
Suppressing gene
Cell fitness
Essential gene
Suppressing gene
Cell fitness
Core essential genes:
I RPL13 - ribosomal component
I POLR1B - RNA polymerase
Growthsupressing genes:
I ”tumor suppressor”
I TP53
I BRCA1
Synthetic lethality to target tumor cells:
I PARP in BRCA1 mutated tumors
I BRAF in KRAS mutated tumors
Advantages of using CRISPR-Cas9 for gene knockout
* shRNA based screens have problems with off-target effects and weakphenotypes.
Guide RNA (gRNA) simultaneously serves as perturbagen and barcode
I gRNA can be PCRamplified from genome
I serves as a proxy for geneknockout
Different types of CRISPR mediated genetic perturbations
Name CRISPR associated enzyme perturbationCRISPR-KO Cas9 gene knockoutCRISPRi dCas9 + transcription inactivator expression inhibitionCRISPRa dCas9 + transcription activator expression activationCRISPR-BE dCas9 + base editor base editing (C-G, A-T)
Experimental procedure of pooled CRISPR screens
library T0 T1
pooled DNA array
librarycloning
lentivirus
transduction
selection
amplification and sequencing
Experimental design principle
I guide RNA/ gRNA: perturbagenand barcode
I library size around 100K gRNAs
I perturbed cells growing in a pool
I individual growth rate depends ongene knockout
I compare abundance T0 vs. T1
For protocol see e.g. Joung et al. 2017
Phenotype detection in pooled CRISPR screens
library T0 T1
pooled DNA array
librarycloning
lentivirus
transduction
selection
amplification and sequencing gRNA abundance at T0 (log2)
gRNA abundance at T1 (log2)
targeting tumor suppressors
targeting essential genes
Differences between RNA-seq and CRISPR screening data
M-A plot
Logarithmic fold change: M = log2(S1
S2)
Mean abundance: A =1
2log2(S1S2)
CRISPR screen
−5
0
0 3 6 9 12
A
M
250
500
750
1000
1250
count
RNA sequencing
−5.0
−2.5
0.0
2.5
5.0
0 5 10 15
A
M
50
100
150
200
count
Screening data is skewed towards negative fold changes
ASYMMETRY: T0 vs. T1 gRNA abundance
I negative logFC are more frequent
I especially for gRNAs that have low initial frequency
gRNAs with no expected effect on cell fitness
0
5
10
15
0 5 10 15
log2 counts T0
log2
cou
nts
T1
LFC < −1LFC > 1
sgRNA counts R1
0.00
0.05
0.10
0.15
0 25 50 75 100
Fraction of library
Mea
n fr
eque
ncy
FC_direction
negative
positive
Frequency of FC
−10
−5
0
5
−10 −5 0 5
log2 FC, R1
log2
FC
, R2
Fold changes, R1 vs. R2A
2.5
5.0
7.5
10.0
12.5
0 3 6 9 12
log2 counts T0
log2
cou
nts
T1
LFC < −1LFC > 1
sgRNA counts R1
0.0
0.1
0.2
0.3
0 25 50 75 100
Fraction of library
Mea
n fr
eque
ncy
FC_direction
negative
positive
Frequency of FC
−10
−5
0
5
−10 −5 0 5 10
log2 FC, R1
log2
FC
, R2
Fold changes, R1 vs. R2B
Computational simulation of screen to test influence of experiment design
transduce
grow
split
sequence
sequence
sequence
grow
library
T1_R1
T1_R2
C cells
τ
N split
T0_R1
T0_R2
Distribution at T1
Distribution at T0
Library distribution
CPCR
CPCR
CPCRC virus
I gRNA counts modeled as atuple of integer numbers
I result: counts after sequencing
I functions modify the counts(multiplication, randomsampling)
I number of cell splittings Nsplit ,cell duplication time τ
I ”coverages” CPCR , Cvirus , Ccells
Mean gRNA coverage in pooled CRISPR screens determines cell number
0
3000
6000
9000
−4 0 4 8 12
Counts per gRNA (log2)
Fre
quen
cy
Experiments assume a narrowdistribution and are designed based onmean gRNA coverage.
Exponential growth
Cell splitting(Random sampling)
Cell seedingn_cells Mean coverage of gRNAs:
how many times is one gRNA on averagerepresented in a pooled experiment?
coverage =ncells
ngRNAs
For example (coverage 500):107gRNAs × 500 = 5 × 109cells
Computational simulation of screen to test influence of experiment design
transduce
grow
split
sequence
sequence
sequence
grow
library
T1_R1
T1_R2
C cells
τ
N split
T0_R1
T0_R2
Distribution at T1
Distribution at T0
Library distribution
CPCR
CPCR
CPCRC virus
Cell splitting causes asymmetry in gRNA abundance changes
Simulation with different coverage
cell_cov = 100 cell_cov = 800
0 5 10 15 0 5 10 150
5
10
15
log2 counts T0
log2
cou
nts
T1
LFC < −1 LFC > 1
0 25 50 75 100 0 25 50 75 1000.0
0.1
0.2
Library fraction
Mea
n fr
eque
ncy cov_cells
1004008001500
Lower gRNA coverage increases asymmetry of gRNA abundance changes.
Asymmetry is caused by repetitive cell splitting
Bottle neck effect
Population bottlenecks in the Northern Elephant Seal
Bottle neck event: Hunting in 19th century, reduction of population size to 20individuals. Today‘s 30,000 seals have a strongly reduced genetic diversity.
Stoffel et al.
It is not OK to assume symmetry of null-distribution!
Current analysis tools loose detection power when asymmetry increases.
Detection of essential genesby MAGeCK-RRA (Li et al. 2014)
A
95% precision
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
Recall
Pre
cisi
on
Coverage Ccells
100 (88% recall)400 (98% recall)1500 (100% recall)
D
Ccells= 100 Ccells= 800
0 5 10 15 0 5 10 150
5
10
15
log2 counts T0
log2
cou
nts
T1
B
LFC < −1 LFC > 1
0 20 40 60 80 100 0 20 40 60 80 1000.0
0.1
0.2
Library fraction
Mea
n fr
eque
ncy Coverage Ccells
1004008001500
C
LFC < −1 LFC > 1
0 20 40 60 80 100 0 20 40 60 80 1000.000
0.025
0.050
0.075
0.100
Library fraction
Mea
n fr
eque
ncy
Doubling time τ205090
E
F
0.0
2.5
5.0
7.5
10.0
12.5
0.00 0.25 0.50 0.75 1.00
Library fraction
log2
sgR
NA
cou
nt
timepointT08T15T18
G
Software package gscreend with improved statistical test
● raw sgRNA counts● normalization, calculation of
fold changes
● split data into intervals according to count at T0
● fit skew normal distribution to fold changes (for every interval)
● based on this model, calculate p-values for single perturbagens
● rank perturbagens according to p-value
● aggragate on gene level using αRRA
gscr
eend
ana
lysi
s w
orkf
low
Step 1: Data preparation
I Normalization or scaling to the total counts in the reference sample.
I Calculation of logarithmic fold changes, addition of pseudo-counts:
LFC = log2(nT1 + 1
nT0 + 1)
I Partitioning into groups according to abundance in reference sample.
Low abundant gRNAs(20-30% percentile)
Histogram of LFC_data$LFC
LFC_data$LFC
Den
sity
−6 −4 −2 0 2
0.0
0.2
0.4
0.6
High abundant gRNAs(80-90% percentile)
Histogram of LFC_data$LFC
LFC_data$LFC
Den
sity
−3 −2 −1 0 1 2 3
0.0
0.5
1.0
1.5
Step 2: Statistical modeling of gRNA level data
I Modeling of the data as a mixture of null-distribution f0 and unknowndistribution f1 of the gRNAs with fitness effect:f (x) = (1 − λ)f0(x) + λf1(x)
0.00
0.05
0.10
0.15
0.20
−10 0 10
LFC
dens
ity distributionf0f1
Step 2: Statistical modeling of gRNA level data
I Modeling of the data as a mixture of null-distribution f0 and unknowndistribution f1 of the gRNAs with fitness effect:f (x) = (1 − λ)f0(x) + λf1(x)
I f0 is a skew normal distribution with 3 parameters:location ξ, scale parameter ω, skewness parameter α
0.00
0.05
0.10
0.15
0.20
−10 −5 0 5X
dens
ity
ξ
012
Location
0.0
0.1
0.2
0.3
0.4
−10 0X
dens
ity
ω123
Scale
0.00
0.05
0.10
0.15
0.20
−10 −5 0 5 10X
dens
ity
α0.512
Skewness
Step 3: Fitting the null-distribution
I Fit ξ, ω, and α from the actual LFC data.
I Ignore strong positive or negative LFCs, only consider the central 90%data point (using approach derived from least quantile of squaresregression (Rousseeuw et al. 1987).
Low abundant gRNAs(20-30% percentile)ξ = 0.16, ω = 0.69, α = 1.57
Histogram of LFC
LFC
Den
sity
−6 −4 −2 0 2
0.0
0.2
0.4
0.6
High abundant gRNAs(80-90% percentile)ξ = −0.02, ω = 0.13, α = 1.09
Histogram of LFC
LFC
Den
sity
−3 −2 −1 0 1 2 3
0.0
1.0
2.0
Step 4: Aggregation of gRNA level data to gene level
I Calculation of p-value for every gRNA.
I Ranking of gRNAs according to p-values.
I Robust rank aggregation (Kolde et al. 2012) to aggregate on gene level(typically 3-10 gRNAs per gene).
I Do the observed gRNA ranks for a given gene lie significantly outside ofwhat you would expect by random sampling? (Permutation test)
sgRNA_A1
sgRNA_A2
sgRNA_A4
sgRNA_A3 gene_A
gene_B
Results from a screen performed in HCT116 cells
components of the ribosomenon-essential genes
0
1
2
3
−8 −6 −4 −2 0
lfc
neg.
fdr
[−lo
g10]
gscreend
gscreend performance on simulated data
Ranking accuracy is improved using gscreend compared to other method.
A
0.25
0.50
0.75
1.0 1.5 2.0 2.5 3.0
Nb of replicates
Rec
all (
99%
pre
cisi
on)
BAGELgscreendmageckscreenBeam
Detection of essential genesB
0
30
60
90
120
0 1000 2000 3000
rank
Num
ber
dete
cted
Ranking of ribosome componentsC
MRPL34
MRPS12
NDUFAF3
PRRC2A
UVRAG0
1
2
3
−8 −6 −4 −2 0
lfc
neg.
fdr
[−lo
g10]
gscreend
NDUFAF3
UVRAG
PRRC2A
MRPS12
MRPL340
1
2
3
−8 −6 −4 −2 0
lfc
neg.
fdr
[−lo
g10]
mageck
NDUFAF3
PRRC2A
MRPS12
UVRAG
MRPL340
10
20
30
−8 −6 −4 −2 0
lfc
Bay
es fa
ctor
BAGELD
MRPL34 MRPS12 NDUFAF3 PRRC2A UVRAG
T0
T18
A
T18
B
T18
C
T0
T18
A
T18
B
T18
C
T0
T18
A
T18
B
T18
C
T0
T18
A
T18
B
T18
C
T0
T18
A
T18
B
T18
C
0
3
6
9
Nor
m. c
ount
[log
2]
E
0.85
0.90
0.95
1.00
1 2 3
Nb of replicates
Rec
all (
99%
pre
cisi
on)
gscreendmageckscreenBeam
Fitness increaseF
0.25
0.50
0.75
1.00
1 2 3
Nb of replicates
Rec
all (
99%
pre
cisi
on)
Fitness decreaseG gscreend mageck
2.5 5 7.5 10 17 2.5 5 7.5 10 17
100
200
300
400
500
600
700
800
900
Library skew ratio
Spl
ittin
g co
vera
ge
Recall(99% precision)
<0.85
0.85−0.9
0.9−0.95
0.95−0.98
>0.98
H
This has major implications for experiment design
I We can predict the minimal necessary experiment size.
I gscreend allows reduction of experiment size by up to 50%.
Conclusions
I Understand the data from an experimental point of view!
I Changes in gRNA abundance are asymmetric in pooled CRISPR screens(unlike RNA-seq data).
I We provide recommendation for optimal experimental design.
I gscreend: more accurate phenotype detection at smaller experiment size.
gscreend (in preparation for Bioconductor submission):
https://github.com/imkeller/gscreend
bioRxivModeling asymmetric count ratios in CRISPR screens to decrease experimentsize and improve phenotype detectionKatharina Imkeller, Giulia Ambrosi, Michael Boutros, Wolfgang Huberdoi: https://doi.org/10.1101/699348
Example forapplications of
CRISPR screens...
Context dependent lethality
Cancer dependency map: https://depmap.org/portal/
Rauscher et al. 2018, Henkel et al. 2019
BRAF mutation context dependency
Rauscher et al. 2018, Henkel et al. 2019
Resources
I Some of the graphics in this presentation were generated with Biorender(www.biorender.com)
I Rousseeuw, PJ, Leroy, AM. Robust regression and outlier detection. WileySeries in Probability and Statistics 329 (1987).
I Kolde R, Laur S, Adler P, Vilo J. Robust rank aggregation for gene listintegration and meta-analysis. Bioinformatics. 2012.10.1093/bioinformatics/btr709
I Li W, Xu H, Xiao T, Cong L, Love MI, Zhang F, Irizarry RA, Liu JS,Brown M, Liu XS. MAGeCK enables robust identification of essentialgenes from genome-scale CRISPR/Cas9 knockout screens. Genome Biol.2014. 10.1186/s13059-014-0554-4
I Joung J, Konermann S, Gootenberg JS, Abudayyeh OO, Platt RJ,Brigham MD, Sanjana NE, Zhang F. Genome-scale CRISPR-Cas9knockout and transcriptional activation screening. Nat Protoc. 2017.10.1038/nprot.2017.016
I Rauscher B, Heigwer F, Henkel L, Hielscher T, Voloshanenko O, BoutrosM. Toward an integrated map of genetic interactions in cancer cells. MolSyst Biol. 2018. 10.15252/msb.20177656
I Henkel L, Rauscher B, Boutros M. Context-dependent genetic interactionsin cancer. Curr Opin Genet Dev. 2019. 10.1016/j.gde.2019.03.004