assessing gene essentiality using genome-wide crispr screens

Assessing gene essentiality using genome-wide CRISPR screens

Katharina Imkeller, DKFZ and EMBL, Heidelberg, Germany

June 25th 2019

CSAMA 2019, 17th edition

@K Imkeller @Boutroslab

Reverse vs. forward genetics

Forward genetics:Find the genetic basis for a specificobserved phenotype.

Reverse genetics:Modify gene sequence and analyze theresulting phenotype.

Discovery of CFTR gene mutationcausing Cystic fibrosis.

Wikipedia

Knockout of gene affecting hair growth.

Biological motivation for reverse genetics screensN

OR

MA

LD

EA

TH

GR

OW

TH

Cell fitnessCell fitness

Suppressing gene

Essential gene

Essential gene

Suppressing gene

Cell fitness

Essential gene

Suppressing gene

Cell fitness

Core essential genes:

I RPL13 - ribosomal component

I POLR1B - RNA polymerase

Growthsupressing genes:

I ”tumor suppressor”

I TP53

I BRCA1

Synthetic lethality to target tumor cells:

I PARP in BRCA1 mutated tumors

I BRAF in KRAS mutated tumors

Advantages of using CRISPR-Cas9 for gene knockout

* shRNA based screens have problems with off-target effects and weakphenotypes.

Guide RNA (gRNA) simultaneously serves as perturbagen and barcode

I gRNA can be PCRamplified from genome

I serves as a proxy for geneknockout

Different types of CRISPR mediated genetic perturbations

Name CRISPR associated enzyme perturbationCRISPR-KO Cas9 gene knockoutCRISPRi dCas9 + transcription inactivator expression inhibitionCRISPRa dCas9 + transcription activator expression activationCRISPR-BE dCas9 + base editor base editing (C-G, A-T)

Experimental procedure of pooled CRISPR screens

library T0 T1

pooled DNA array

librarycloning

lentivirus

transduction

selection

amplification and sequencing

Experimental design principle

I guide RNA/ gRNA: perturbagenand barcode

I library size around 100K gRNAs

I perturbed cells growing in a pool

I individual growth rate depends ongene knockout

I compare abundance T0 vs. T1

For protocol see e.g. Joung et al. 2017

Phenotype detection in pooled CRISPR screens

library T0 T1

pooled DNA array

librarycloning

lentivirus

transduction

selection

amplification and sequencing gRNA abundance at T0 (log2)

gRNA abundance at T1 (log2)

targeting tumor suppressors

targeting essential genes

Differences between RNA-seq and CRISPR screening data

M-A plot

Logarithmic fold change: M = log2(S1

S2)

Mean abundance: A =1

2log2(S1S2)

CRISPR screen

−5

0

0 3 6 9 12

A

M

250

500

750

1000

1250

count

RNA sequencing

−5.0

−2.5

0.0

2.5

5.0

0 5 10 15

A

M

50

100

150

200

count

Screening data is skewed towards negative fold changes

ASYMMETRY: T0 vs. T1 gRNA abundance

I negative logFC are more frequent

I especially for gRNAs that have low initial frequency

gRNAs with no expected effect on cell fitness

0

5

10

15

0 5 10 15

log2 counts T0

log2

cou

nts

T1

LFC < −1LFC > 1

sgRNA counts R1

0.00

0.05

0.10

0.15

0 25 50 75 100

Fraction of library

Mea

n fr

eque

ncy

FC_direction

negative

positive

Frequency of FC

−10

−5

0

5

−10 −5 0 5

log2 FC, R1

log2

FC

, R2

Fold changes, R1 vs. R2A

2.5

5.0

7.5

10.0

12.5

0 3 6 9 12

log2 counts T0

log2

cou

nts

T1

LFC < −1LFC > 1

sgRNA counts R1

0.0

0.1

0.2

0.3

0 25 50 75 100

Fraction of library

Mea

n fr

eque

ncy

FC_direction

negative

positive

Frequency of FC

−10

−5

0

5

−10 −5 0 5 10

log2 FC, R1

log2

FC

, R2

Fold changes, R1 vs. R2B

Computational simulation of screen to test influence of experiment design

transduce

grow

split

sequence

sequence

sequence

grow

library

T1_R1

T1_R2

C cells

τ

N split

T0_R1

T0_R2

Distribution at T1

Distribution at T0

Library distribution

CPCR

CPCR

CPCRC virus

I gRNA counts modeled as atuple of integer numbers

I result: counts after sequencing

I functions modify the counts(multiplication, randomsampling)

I number of cell splittings Nsplit ,cell duplication time τ

I ”coverages” CPCR , Cvirus , Ccells

Mean gRNA coverage in pooled CRISPR screens determines cell number

0

3000

6000

9000

−4 0 4 8 12

Counts per gRNA (log2)

Fre

quen

cy

Experiments assume a narrowdistribution and are designed based onmean gRNA coverage.

Exponential growth

Cell splitting(Random sampling)

Cell seedingn_cells Mean coverage of gRNAs:

how many times is one gRNA on averagerepresented in a pooled experiment?

coverage =ncells

ngRNAs

For example (coverage 500):107gRNAs × 500 = 5 × 109cells

Computational simulation of screen to test influence of experiment design

transduce

grow

split

sequence

sequence

sequence

grow

library

T1_R1

T1_R2

C cells

τ

N split

T0_R1

T0_R2

Distribution at T1

Distribution at T0

Library distribution

CPCR

CPCR

CPCRC virus

Cell splitting causes asymmetry in gRNA abundance changes

Simulation with different coverage

cell_cov = 100 cell_cov = 800

0 5 10 15 0 5 10 150

5

10

15

log2 counts T0

log2

cou

nts

T1

LFC < −1 LFC > 1

0 25 50 75 100 0 25 50 75 1000.0

0.1

0.2

Library fraction

Mea

n fr

eque

ncy cov_cells

1004008001500

Lower gRNA coverage increases asymmetry of gRNA abundance changes.

Asymmetry is caused by repetitive cell splitting

Bottle neck effect

Population bottlenecks in the Northern Elephant Seal

Bottle neck event: Hunting in 19th century, reduction of population size to 20individuals. Today‘s 30,000 seals have a strongly reduced genetic diversity.

Stoffel et al.

It is not OK to assume symmetry of null-distribution!

Current analysis tools loose detection power when asymmetry increases.

Detection of essential genesby MAGeCK-RRA (Li et al. 2014)

A

95% precision

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00

Recall

Pre

cisi

on

Coverage Ccells

100 (88% recall)400 (98% recall)1500 (100% recall)

D

Ccells= 100 Ccells= 800

0 5 10 15 0 5 10 150

5

10

15

log2 counts T0

log2

cou

nts

T1

B

LFC < −1 LFC > 1

0 20 40 60 80 100 0 20 40 60 80 1000.0

0.1

0.2

Library fraction

Mea

n fr

eque

ncy Coverage Ccells

1004008001500

C

LFC < −1 LFC > 1

0 20 40 60 80 100 0 20 40 60 80 1000.000

0.025

0.050

0.075

0.100

Library fraction

Mea

n fr

eque

ncy

Doubling time τ205090

E

F

0.0

2.5

5.0

7.5

10.0

12.5

0.00 0.25 0.50 0.75 1.00

Library fraction

log2

sgR

NA

cou

nt

timepointT08T15T18

G

Software package gscreend with improved statistical test

● raw sgRNA counts● normalization, calculation of

fold changes

● split data into intervals according to count at T0

● fit skew normal distribution to fold changes (for every interval)

● based on this model, calculate p-values for single perturbagens

● rank perturbagens according to p-value

● aggragate on gene level using αRRA

gscr

eend

ana

lysi

s w

orkf

low

Step 1: Data preparation

I Normalization or scaling to the total counts in the reference sample.

I Calculation of logarithmic fold changes, addition of pseudo-counts:

LFC = log2(nT1 + 1

nT0 + 1)

I Partitioning into groups according to abundance in reference sample.

Low abundant gRNAs(20-30% percentile)

Histogram of LFC_data$LFC

LFC_data$LFC

Den

sity

−6 −4 −2 0 2

0.0

0.2

0.4

0.6

High abundant gRNAs(80-90% percentile)

Histogram of LFC_data$LFC

LFC_data$LFC

Den

sity

−3 −2 −1 0 1 2 3

0.0

0.5

1.0

1.5

Step 2: Statistical modeling of gRNA level data

I Modeling of the data as a mixture of null-distribution f0 and unknowndistribution f1 of the gRNAs with fitness effect:f (x) = (1 − λ)f0(x) + λf1(x)

0.00

0.05

0.10

0.15

0.20

−10 0 10

LFC

dens

ity distributionf0f1

Step 2: Statistical modeling of gRNA level data

I Modeling of the data as a mixture of null-distribution f0 and unknowndistribution f1 of the gRNAs with fitness effect:f (x) = (1 − λ)f0(x) + λf1(x)

I f0 is a skew normal distribution with 3 parameters:location ξ, scale parameter ω, skewness parameter α

0.00

0.05

0.10

0.15

0.20

−10 −5 0 5X

dens

ity

ξ

012

Location

0.0

0.1

0.2

0.3

0.4

−10 0X

dens

ity

ω123

Scale

0.00

0.05

0.10

0.15

0.20

−10 −5 0 5 10X

dens

ity

α0.512

Skewness

Step 3: Fitting the null-distribution

I Fit ξ, ω, and α from the actual LFC data.

I Ignore strong positive or negative LFCs, only consider the central 90%data point (using approach derived from least quantile of squaresregression (Rousseeuw et al. 1987).

Low abundant gRNAs(20-30% percentile)ξ = 0.16, ω = 0.69, α = 1.57

Histogram of LFC

LFC

Den

sity

−6 −4 −2 0 2

0.0

0.2

0.4

0.6

High abundant gRNAs(80-90% percentile)ξ = −0.02, ω = 0.13, α = 1.09

Histogram of LFC

LFC

Den

sity

−3 −2 −1 0 1 2 3

0.0

1.0

2.0

Step 4: Aggregation of gRNA level data to gene level

I Calculation of p-value for every gRNA.

I Ranking of gRNAs according to p-values.

I Robust rank aggregation (Kolde et al. 2012) to aggregate on gene level(typically 3-10 gRNAs per gene).

I Do the observed gRNA ranks for a given gene lie significantly outside ofwhat you would expect by random sampling? (Permutation test)

sgRNA_A1

sgRNA_A2

sgRNA_A4

sgRNA_A3 gene_A

gene_B

Results from a screen performed in HCT116 cells

components of the ribosomenon-essential genes

0

1

2

3

−8 −6 −4 −2 0

lfc

neg.

fdr

[−lo

g10]

gscreend

gscreend performance on simulated data

Ranking accuracy is improved using gscreend compared to other method.

A

0.25

0.50

0.75

1.0 1.5 2.0 2.5 3.0

Nb of replicates

Rec

all (

99%

pre

cisi

on)

BAGELgscreendmageckscreenBeam

Detection of essential genesB

0

30

60

90

120

0 1000 2000 3000

rank

Num

ber

dete

cted

Ranking of ribosome componentsC

MRPL34

MRPS12

NDUFAF3

PRRC2A

UVRAG0

1

2

3

−8 −6 −4 −2 0

lfc

neg.

fdr

[−lo

g10]

gscreend

NDUFAF3

UVRAG

PRRC2A

MRPS12

MRPL340

1

2

3

−8 −6 −4 −2 0

lfc

neg.

fdr

[−lo

g10]

mageck

NDUFAF3

PRRC2A

MRPS12

UVRAG

MRPL340

10

20

30

−8 −6 −4 −2 0

lfc

Bay

es fa

ctor

BAGELD

MRPL34 MRPS12 NDUFAF3 PRRC2A UVRAG

T0

T18

A

T18

B

T18

C

T0

T18

A

T18

B

T18

C

T0

T18

A

T18

B

T18

C

T0

T18

A

T18

B

T18

C

T0

T18

A

T18

B

T18

C

0

3

6

9

Nor

m. c

ount

[log

2]

E

0.85

0.90

0.95

1.00

1 2 3

Nb of replicates

Rec

all (

99%

pre

cisi

on)

gscreendmageckscreenBeam

Fitness increaseF

0.25

0.50

0.75

1.00

1 2 3

Nb of replicates

Rec

all (

99%

pre

cisi

on)

Fitness decreaseG gscreend mageck

2.5 5 7.5 10 17 2.5 5 7.5 10 17

100

200

300

400

500

600

700

800

900

Library skew ratio

Spl

ittin

g co

vera

ge

Recall(99% precision)

<0.85

0.85−0.9

0.9−0.95

0.95−0.98

>0.98

H

This has major implications for experiment design

I We can predict the minimal necessary experiment size.

I gscreend allows reduction of experiment size by up to 50%.

Conclusions

I Understand the data from an experimental point of view!

I Changes in gRNA abundance are asymmetric in pooled CRISPR screens(unlike RNA-seq data).

I We provide recommendation for optimal experimental design.

I gscreend: more accurate phenotype detection at smaller experiment size.

gscreend (in preparation for Bioconductor submission):

https://github.com/imkeller/gscreend

bioRxivModeling asymmetric count ratios in CRISPR screens to decrease experimentsize and improve phenotype detectionKatharina Imkeller, Giulia Ambrosi, Michael Boutros, Wolfgang Huberdoi: https://doi.org/10.1101/699348

Example forapplications of

CRISPR screens...

Context dependent lethality

Cancer dependency map: https://depmap.org/portal/

Rauscher et al. 2018, Henkel et al. 2019

BRAF mutation context dependency

Rauscher et al. 2018, Henkel et al. 2019

Resources

I Some of the graphics in this presentation were generated with Biorender(www.biorender.com)

I Rousseeuw, PJ, Leroy, AM. Robust regression and outlier detection. WileySeries in Probability and Statistics 329 (1987).

I Kolde R, Laur S, Adler P, Vilo J. Robust rank aggregation for gene listintegration and meta-analysis. Bioinformatics. 2012.10.1093/bioinformatics/btr709

I Li W, Xu H, Xiao T, Cong L, Love MI, Zhang F, Irizarry RA, Liu JS,Brown M, Liu XS. MAGeCK enables robust identification of essentialgenes from genome-scale CRISPR/Cas9 knockout screens. Genome Biol.2014. 10.1186/s13059-014-0554-4

I Joung J, Konermann S, Gootenberg JS, Abudayyeh OO, Platt RJ,Brigham MD, Sanjana NE, Zhang F. Genome-scale CRISPR-Cas9knockout and transcriptional activation screening. Nat Protoc. 2017.10.1038/nprot.2017.016

I Rauscher B, Heigwer F, Henkel L, Hielscher T, Voloshanenko O, BoutrosM. Toward an integrated map of genetic interactions in cancer cells. MolSyst Biol. 2018. 10.15252/msb.20177656

I Henkel L, Rauscher B, Boutros M. Context-dependent genetic interactionsin cancer. Curr Opin Genet Dev. 2019. 10.1016/j.gde.2019.03.004

assessing gene essentiality using genome-wide crispr screens

Documents