crseek: a python module for facilitating …1 crseek: a python module for facilitating 2 complicated...

CRSeek: a Python module for facilitating complicated CRISPR

design strategies

Will Dampier Corresp., 1, 2, 3 , Cheng-Han Chung 1, 2 , Neil T Sullivan 1, 2 , Andrew J Atkins 1, 2 , Michael R Nonnemacher 1, 2, 4 , Brian Wigdahl Corresp. 1, 2, 4

1 Department of Microbiology and Immunology, Drexel University College of Medicine, Philadelphia, PA, United States

2 Center for Molecular Virology and Translational Neuroscience, Institute for Molecular & Medicine and Infectious Disease, Drexel University College of

Medicine, Philadelphia, PA, United States3 School of Biomedical Engineering, Science, and Health Systems, Drexel University, Philadelphia, PA, United States

4 Sidney Kimmel Cancer Center, Thomas Jefferson University, Philadelphia, PA, United States

Corresponding Authors: Will Dampier, Brian Wigdahl

Email address: [email protected], [email protected]

With the popularization of the CRISPR-Cas gene editing system there has been an

explosion of new techniques made possible by this versatile technology. However, the

computational field has lagged behind with a current lack of computational tools for

developing complicated CRISPR-Cas gene editing strategies. We present crseek, a Python

package that provides a consistent application programming interface (API) for multiple

cleavage prediction algorithms. Four popular cleavage prediction algorithms were

implemented and further adapted to work on draft-quality genomes. Furthermore, since

crseek mirrors the popular scikit-learn API, the package can be easily integrated as an

upstream processing module for facilitating further CRISPR-Cas machine learning research.

The package is fully integrated with the biopython package facilitating simple import,

export, and manipulation of sequences before and after gene editing. This manuscript

presents four common gene editing tasks that would be difficult with current tools but are

easily performed with the crseek package. We believe this package will help

bioinformaticians rapidly design complex CRISPR-Cas gene editing strategies and will be a

useful addition to the field.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018

CRSeek: a Python module for facilitating1

complicated CRISPR design strategies2

Will Dampier1,2,3, Cheng-Han Chung1,2, Neil T. Sullivan1,2, Andrew3

Atkins1,2, Michael R. Nonnemacher1,2,4, and Brian Wigdahl1,2,44

1Department of Microbiology and Immunology, Drexel University College of Medicine,5

Philadelphia, PA, USA6

2Center for Molecular Virology and Translational Neuroscience, Institute for Molecular7

Medicine and Infectious Disease, Drexel University College of Medicine, Philadelphia,8

PA, USA9

3School of Biomedical Engineering, Science, and Health Systems, Drexel University,10

Philadelphia, Pennsylvania, USA11

4Sidney Kimmel Cancer Center, Thomas Jefferson University, Philadelphia, PA, USA12

Corresponding author:13

Will Dampier and Brian Wigdahl14

Email address: [email protected], [email protected]

ABSTRACT16

With the popularization of the CRISPR-Cas gene editing system there has been an explosion of new

techniques made possible by this versatile technology. However, the computational field has lagged

behind with a current lack of computational tools for developing complicated CRISPR-Cas gene editing

strategies. We present crseek, a Python package that provides a consistent application programming

interface (API) for multiple cleavage prediction algorithms. Four popular cleavage prediction algorithms

were implemented and further adapted to work on draft-quality genomes. Furthermore, since crseek

mirrors the popular scikit-learn API, the package can be easily integrated as an upstream processing

module for facilitating further CRISPR-Cas machine learning research. The package is fully integrated

with the biopython package facilitating simple import, export, and manipulation of sequences before

and after gene editing. This manuscript presents four common gene editing tasks that would be difficult

with current tools but are easily performed with the crseek package. We believe this package will help

bioinformaticians rapidly design complex CRISPR-Cas gene editing strategies and will be a useful addition

to the field.

17

18

19

20

21

22

23

24

25

26

27

28

29

INTRODUCTION30

Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) are a family of related genes that31

function as a defense system in prokaryotes against phages and plasmid DNA (Barrangou, 2015). By32

incorporating a small portion of the invading DNA into the prokaryotic genome, it can defend against33

subsequent attacks. In this system, CRISPR-associated (Cas) genes process these genomic repeats into a34

guide RNA (gRNA). One particular region of the gRNA, termed the spacer, directs Cas endonucleases35

to cleave the target DNA at a region complementary to the spacer on foreign DNA (Marraffini and36

Sontheimer, 2010). Recently, aspects of this system have been re-engineered for researchers to easily37

use this system as a highly specific endonuclease that can be directed to edit nearly any desired DNA38

sequence across the genome (Jinek et al., 2012; Cong et al., 2013).39

The CRISPR-Cas system has revolutionized the gene editing field (Doudna and Charpentier, 2014).40

It has democratized the field by lowering the difficulty of editing a specific genomic locus (Genscript,41

2016). Various applications of CRISPR-Cas systems have been developed with useful properties such42

as an inducible expression system (Cao et al., 2016), tissue specificity (Ablain et al., 2015), as well as43

advanced editing strategies (Kim et al., 2017). In practice, targeting a specific locus with any of these44

systems simply involves finding a protospacer adjacent motif (PAM) next to a unique 20 bp protospacer45

and performing basic molecular biology techniques (Genscript, 2016; Sternberg and Doudna, 2015).46


Recent studies have elucidated the majority of the mechanism and binding site recognition of the47

CRISPR-Cas system. Early target search strategies involved using ad-hoc rule-based methods that48

excluded potential targets by defining either a total maximum of mismatches or distinguishing between49

seed and tail regions (Fu et al., 2013; Hsu et al., 2013; Wu et al., 2014). Current targeting methods50

employ a data-driven approach by using a dataset of experimentally determined targeting probabilities and51

extracting sophisticated rules to construct scoring algorithms. Hsu et al. (2013) created a position-specific52

binding matrix by testing 700 gRNAs engineered to have intended mismatches to their target. Doench53

et al. (2016) followed this work by creating an even larger dataset of ≥80,000 gRNAs targeting many54

positions across the same gene. Klein et al. (2018) further refined this method by developing a kinetic55

model of CRISPR-Cas binding and cleavage using these datasets and showed that a small number of56

parameters can explain these experimentally determined binding rules.57

There are numerous currently employed tools for designing gRNAs reviewed by Ding et al. (2016);58

Cui et al. (2018). All of these tools use the same basic strategy. First, the target DNA is searched for PAM59

sequences and the adjacent protospacers are extracted. Next, the potential spacers are searched against a60

reference database allowing for multiple mismatches and are scored for potential off-target sites using one61

of the many scoring algorithms described above. Finally, the protospacers are ranked by their potential62

off-target risk. These tools can be served either as a webserver and occasionally as a locally installable63

toolset.64

However, all of these tools lack many of the features needed from a bioinformatic developer perspec-65

tive.66

• Webtools often limit searching to model organisms or pre-built indices.67

• Most tools cannot search for non-standard Cas9 variants such as Cpf1, SaCas9, or other species-68

specific Cas9s.69

• Currently distributed software lacks unit-testing, one-command install of the tool and dependencies,70

continuous integration frameworks, or a documented API.71

These features are needed in order for bioinformaticians to design larger scale experiments or develop72

automation methods for more advanced gene editing strategies.73

In this manuscript we present crseek (https://github.com/DamLabResources/crseek), a Python74

library mirroring the popular scikit-learn application program interface (API) proposed in Buitinck et al.75

(2013). It provides both high- and low-level methods for designing CRISPR gene editing strategies.76

crseek provides an invaluable resource for computational biologists looking to employ CRISPR-Cas77

based gene editing techniques.78

METHODS79

This manuscript presents a collection of tools for the biomedical software developer looking to design80

advanced CRISPR-Cas gene-editing strategies. This toolset was built from a developer perspective to81

aid biologists with programming experience who are looking to perform gRNA design for non-standard82

CRISPR-Cas gene editing. The toolset, and its dependencies, are installable using the Ananconda83

environment system (Anaconda, 2016).84

Terminology85

Throughout the research field there is a great deal of variability in the nomeclature of the various parts of86

the CRISPR-Cas complex. Many of these nomeclatures do not conform to the Python PEP-8 standard87

(Van Rossum et al., 2001). For consistency we refer to the complex parts in the following ways:88

• The gRNA refers to the entire strand of RNA which complexes with Cas9 to form the CRISPR-Cas89

complex. This molecule is not directly represented in crseek.90

• The spacer refers to the 20 nucleotide region of the gRNA which is used for target matching by91

the CRISPR-Cas complex. This requires an RNA alphabet.92

• The pam refers to the, potentially degenerate, nucleotide recognition site of the particular CRISPR-93

Cas system. The PAM resides adjacent to the spacer and is required for binding. This requires a94

DNA alphabet.95

2/16


• The target refers to the 20 nucleotide region of the DNA which is complementary to the spacer.96

In order to accommodate multiple CRISPR-Cas systems the target includes the PAM recognition97

site. This requires a DNA alphabet.98

• The loci refers to a ≥20 nucleotide segment of the DNA which may contain potential targets.99

The crseek tool does not make a distinction between on-target sites, those where binding is intended,100

and off-target sites, those where binding is unintended; all (spacer, target) pairs are treated equally.101

This framing is useful for expressing bleeding-edge uses of CRISPR-Cas that exploit the promiscuity of102

SpCas9 to target heterogeneous populations such as HIV (Dampier et al., 2018) or modulate microbial103

communities (Gomaa et al., 2014).104

Implementation Strategy105

We have intentionally mirrored the scikit-learn API. We subclass the BaseEstimator class and over-106

load the fit, transform, predict, and predict proba methods. This allows us to seamlessly107

use all of the tools of the scikit-searn package such as cross-validation, normalization, and other prediction108

methods (Pedregosa et al., 2011). We have decomposed this methodology into three main tasks: Searching,109

Preprocessing, and Estimating. We use the biopython library (Cock et al., 2009) to enforce appropriate110

nucleotide alphabets as inputs and outputs of the various tools.111

Searching112

We define search as the act of locating potential target sites in loci. DNA segments can be any113

files, or directories of files, readable by Bio.SeqIO (fasta, genbank, etc), lists of Bio.SeqRecord114

objects, or np.arrays of characters. The crseek tool provides two strategies for this searching:115

exhaustive or mismatch based. In an exhaustive search, all positions are considered as potential targets.116

For mismatch searching, a wrapper was created around the popular cas-offinder library (Bae et al.,117

2014). This library allows for rapid mismatch searching, even employing any graphical processors present118

and properly installed. The utils.tile seqrecord and utils.cas offinder return arrays of119

(spacer, target) pairs for downstream analysis.120

Preprocessing121

Once targets have been found on the genomic loci they must be encoded as (spacer, target) pairs for122

downstream analysis. There are two main strategies for this encoding: mismatch versus one-hot encoding.123

Rule-based mismatching and the Hsu et al. (2013) strategies both give the same weight independent of124

the mismatch identity; as such this encoding is a binary vector of length 21 (20 for the spacer, 1 for125

pam matching). One-hot encoding is used for strategies such as Doench et al. (2016) which give different126

binding penalties based on the mismatch identity; for example an A:T mismatch may have a larger penalty127

than an A:C mismatch and these penalties change with position.128

These two strategies have been encapsulated into the preprocessing.MatchingTransformer129

and preprocessing.OneHotEncoder classes. These classes take pairs of (spacer, target)130

and return binary vectors for downstream processing. By implementing these as subclasses of the131

sklearn.BaseEstimator, one can use these classes in sklearn.Pipeline instances. Down-132

stream processing can either employ one of the pre-built estimator classes or used as preprocessing133

for other machine learning algorithms.134

Estimating135

Multiple pre-built algorithms were collected for determining the activity of any given (spacer, target)136

pair. These estimators input the binary vectors produced by the preprocessing modules. A flexible137

MismatchEstimator was built that allows one to specify the seed-length, number of seed or non-seed138

mismatches, and the PAM identity. These parameters can also be loaded from user-supplied yaml files to139

allow for the exploration of non-standard CRISPR-Cas systems. The penalty scoring strategy described140

by Hsu et al. (2013) was implemented as the MITEstimator and the Doench et al. (2016) method141

as the CFDEstimator. The newly proposed kinetic model developed by Klein et al. (2018) has also142

been implemented as the KineticEstimator. New estimators can be easily added by subclassing the143

SequenceBase abstract base class and overloading the relevant methods.144

3/16


Additional Features145

In order to account for many of the idiosyncrasies of designing CRISPR-Cas tools, a few optional features146

were added. Degenerate bases can be used across all tools. These bases are assumed to be each relevant147

nucleotide in equal likelihoods and the penalty scores are scaled accordingly (e.g. an R is 50% likely to be148

an A or G). SAM/BAM files can be used as input in all search tools. This allows for search and estimation149

across variable populations. This is useful when targeting microbiomes, metagenomes, or highly mutable150

viral genomes.151

In vitro validation of Cas9 cleavage of mixed populations152

In order to evaluate whether the crseek module can be used for novel analyses, an in vitro cutting assay153

using mixed populations of sequences from HIV-1-infected individuals was adapted. Genomic DNA154

was extracted from PMBCs derived from five patients in the Drexel Medicine CNS AIDS Research and155

Eradication Study (CARES). The Long Terminal Repeat (LTR) region was amplified using a two-round156

PCR strategy as previously described (Li et al., 2011). Bulk PCR products were cloned into the pGL3157

basic expression vector as previously described (Li et al., 2015) obtaining one to six unique sequences per158

patient. Additionally, sequences representing the consensus LTRs from CXCR4- (X4) and CCR5-utilizing159

(R5) viruses constructed from previous analyses (Antell et al., 2016), were synthesized and cloned into the160

pGL3 vector by VectorBuilder (Cyagen Biosciences). The LTR-A gRNA (ATCAGATATCCACTGACCTT161

NGG) from Hu et al. (2014) was selected for analysis, purchased from IDT, and in vitro transcribed using162

the EnGen sgRNA synthesis procedure (NEB) as described by the manufacturer protocol. The gRNA163

was transcribed, purified, and concentrated using the RNA clean and concentrator procedure provided by164

Zymo Research.165

The in vitro digestion of patient LTRs was performed by following NEB in vitro digestion of DNA with166

Cas9 nuclease protocol with no modifications. The multiple unique sequences cloned into pGL3 vectors167

were added in equimolar ratios to mirror the biological context of HIV-1 infection. Cutting reactions168

were performed for 1 hour at 37°C followed by 65°C heat-inactivation for 5 min. BamHI restriction169

endonuclease digestion was performed to linearize any undigested plasmid and samples were cleaned170

using the Qiagen PCR cleanup procedure to ensure proper running on the High Sensitivity Bioanalyzer171

chip (Agilent) which was used to measure the sizes and molarities of the fragments.172

As the sizes reported by the Bioanalyzer have a known degree of noise, a mean shift clustering173

algorithm was used to identify the appropriate cutoffs to use when assigning a Bioanalyzer peak to a174

particular molecular fragment. We use a width of 500 bp due to the estimate of 10% error in the length175

estimation described in technical specifications provided by Agilent. The two fragment sizes resulting176

from the Cas9 in vitro digestion were calculated in a similar way. The calculation of cleavage efficiency,177

the mean of the molarity of two Cas9-digested fragments (designated as mol f 1 and mol f 2 in Equation178

1) was used as the molarity for the fragmented pGL3. The molarity of the total pGL3 was calculated by179

adding the molarity of fragmented pGL3 and the molarity of unfragmented pGL3 (designated as molun).180

The function of cleavage efficiency can be described as:181

cleavage =(mol f 1+mol f 2)/2

molun+(mol f 1+mol f 2)/2(1)

The processed data is shown in Table 1.182

RESULTS183

In order to showcase the many uses of the crseek module, this manuscript includes four simple and184

realistic CRISPR-Cas design tasks. These tasks would not be reasonably possible with currently available185

tools due to either the draft quality of the template genome or the complexity of the design.186

Task 1187

Creating positive controls for CRISPR library screens are critical to downstream validation. As an188

illustrative example, imagine developing a gRNA that can knockout the expression of eGFP from a189

plasmid (Addgene-54622, (Davidson, 2016)) transformed into the newly sequenced Clostridioides difficile190

genome (Genbank: NC 009089.1, Monot et al. (2011)). A robust positive control will have a perfect191

homology with the eGFP locus while having a low homology with any other position in the C. difficile192

genome. The following script shows a basic implementation.193

4/16


# Build the CFD estimator

estimator = estimators.CFDEstimator.build_pipeline()

# Load the relevant sequences

genome_path = 'data/Clostridioides_difficile_630.gb'

plasmid_path = 'data/addgene-plasmid-54622-sequence-158642.gbk'

with open(genome_path) as handle:

genome = list(SeqIO.parse(handle, 'genbank'))[0]

with open(plasmid_path) as handle:

plasmid = list(SeqIO.parse(handle, 'genbank'))[0]

egfp_feature = get_egfp(plasmid)

egfp_record = egfp_feature.extract(plasmid)

# Extract all possible NGG targets

possible_targets = utils.extract_possible_targets(egfp_record,

pams=('NGG',))

# Find all targets across the host genome

possible_binding = utils.cas_offinder(possible_targets,

5, locus = [genome])

# Score each hit across the genome

scores = estimator.predict_proba(possible_binding.values)

possible_binding['Score'] = scores

# Find the maximum score for each spacer

results = possible_binding.groupby('spacer')['Score'].agg('max')

results.sort_values(inplace=True)

Utilizing this simple script reveals that there are numerous potential spacers. The biopython194

GenomeDiagram module can be used to generate a proposed plasmid map as shown in Fig. 1. This is195

not an unexpected result as the C. difficile genome does not contain genes that share significant homology196

with eGFP, as such this represents simple application of the crseek module. Aside from the draft-quality197

of the genome, this design could be accomplished with many of the currently available tools.198

Task 2199

The crseek module is also capable of more complex spacer design tasks. The following code snippet200

describes the generation of a large library of spacers that target individuals genes in the C. difficile201

genome. The tight integration of crseek with the biopython package allows for simple reading and202

writing of the numerous file types employed in the biological field. Furthermore, with the crseek203

module it is simple to ensure that the spacers designed for a particular gene do not have homology204

elsewhere in the genome.205

library_grnas = []

gene_info = []

genome_key = genome.id + ' ' + genome.description

for feat in genome.features:

# Only target genes

if feat.type != 'CDS':

continue

# Get info about this gene

5/16


product = feat.qualifiers['product'][0]

tag = feat.qualifiers['locus_tag'][0]

gene_record = feat.extract(genome)

# Get potential spacers

possible_targets = utils.extract_possible_targets(gene_record)

# Score hits

possible_binding = utils.cas_offinder(possible_targets,

3, locus = [genome])

scores = estimator.predict_proba(possible_binding.values)

possible_binding['Score'] = scores

# Set hits within the gene to np.nan

slic = pd.IndexSlice[genome_key, :,

feat.location.start:feat.location.end]

possible_binding.loc[slic, 'Score'] = np.nan

# Aggregate and sort off-target scan

grouped_spacers = possible_binding.groupby('spacer')

ot_scores = grouped_spacers['Score'].agg('max')

# Spacers which only have intragenic hit

ot_scores.fillna(0, inplace=True)

ot_scores.sort_values(inplace=True)

# Save information

gene_info.append({'Product': product,

'Tag': tag,

'UsefulGuides': (ot_scores<=0.25).sum()})

top = ot_scores.head()

for protospacer, off_score in top.to_dict().items():

dna_target = protospacer.back_transcribe()

strand = '+'

if location == -1:

dna_target = reverse_complement(dna_target)

strand = '-'

location = genome.seq.find(dna_target)

library_grnas.append({'Product': product,

'Tag': tag,

'Protospacer': protospacer,

'Location': location,

'Strand': strand,

'Off Target Score': off_score})

gene_df = pd.DataFrame(gene_info)

library_df = pd.DataFrame(library_grnas)

This analysis shows that of the 3751 annotated genes in this draft C. difficile genome, 3506 (93.4%)206

of them can be targeted by at least five spacers without off-target effects. Fig. 2 shows the selected207

spacers along with the genes that are targeted. The design of the crseek library facilitates this208

analysis with only three lines of CRISPR-specific code.209

6/16


Task 3210

The crseek module is flexible enough to facilitate the analysis of novel experimental designs. Popula-211

tions of mixed HIV-1 LTR-containing plasmids were subjected to an in vitro cutting assay, as described212

above, in an effort to explore the most appropriate way of predicting SpCas9 effectiveness against a mixed213

population of DNA fragments. This mirrors the biological situation found in HIV-1-infected individu-214

als who often contain a swarm of related, yet genetically distinct, integrated viral genomes (Dampier215

et al., 2017). The script below describes how to compare the predictions of three popular methods with216

experimentally determined cutting profiles.217

# Load in the known data

cutting_data = pd.read_csv('data/IVCA_gRNAs_efficiency.csv')

# Text objects need to be converted to Bio.Seq

# objects with the correct Alphabet

ct = utils.smrt_seq_convert('Seq',

cutting_data['Target'].values,

alphabet=Alphabet.generic_dna)

cutting_data['target'] = list(ct)

sd = utils.smrt_seq_convert('Seq',

cutting_data['gRNA Sequence'].values,

alphabet=Alphabet.generic_dna)

cutting_data['spacer'] = [s.transcribe() for s in sd]

# Predict the score using each model across all sequences

ests = [('MIT', estimators.MITEstimator.build_pipeline()),

('CFD', estimators.CFDEstimator.build_pipeline()),

('Kinetic', estimators.KineticEstimator.build_pipeline())]

for name, est in ests:

# Models only require the gRNA and Target sequence columns

data = cutting_data[['spacer', 'target']].values

cutting_data[name] = est.predict_proba(data)

# Aggregate the predicted cutting data across each sample

predicted_cleavage = pd.pivot_table(cutting_data,

index = 'Patient',

columns = 'gRNA Name',

values = ['MIT', 'CFD',

'Kinetic'],

aggfunc = 'mean')

After aggregation, these results can be plotted against the known results as shown in Fig. 3. In these218

studies, the CFD model has the best correlation between the predicted and observed results (r2 = 0.90)219

when compared to an r2 = 0.81 for the MIT model and r2 = 0.83 for the Kinetic model.220

Task 4221

The extensible nature of the crseek module allows simple integration with the scikit-learn archi-222

tecture. The crseek.preprocessing transformers can be used to easily extract features from223

target, spacer pairs that can be used as inputs for scikit-learn prediction models. Addition-224

ally, the crseek.estimators subclass the sklearn.BaseEstimator and implement .fit,225

.predict, and .predict proba methods allowing them to be used interchangeably with native226

scikit-learn models. The script below shows how crseek tools can be used to develop new machine227

learning methods for identifying interaction rules as well as comparing the results to currently published228

methodologies.229

7/16


files = sorted(glob.glob('data/GUIDESeq/*/*.tsv'))

dfs = [pd.read_csv(f, sep='\t').reset_index() for f in files]

hit_data = pd.concat(dfs,

axis=0, ignore_index=True)

# Multiple data loading/cleaning steps have been omitted.

# See the Github repository files for the entire script.

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import cross_validate, StratifiedKFold

# Convert spacer-target pairs into a binary "matching" vector

transformer = preprocessing.MatchingTransformer()

X = transformer.transform(hit_data[['spacer', 'target']].values)

y = (hit_data['NormSum'] >= cutoff).values

# Evaluate the module using 3-fold cross-validation

grad_res = cross_validate(GradientBoostingClassifier(), X, y,

scoring = ['accuracy',

'precision',

'recall'],

cv = StratifiedKFold(random_state=0),

return_train_score=True)

grad_res = pd.DataFrame(grad_res)

# Evaluate MITEstimator using 3-fold cross-validation

mit_res = cross_validate(estimators.MITEstimator(), X, y,

scoring = ['accuracy',

'precision',

'recall'],

cv = StratifiedKFold(random_state=0),

return_train_score=True)

mit_res = pd.DataFrame(mit_res)

The prediction models that are part of the crseek module can be directly used in the scikit-learn230

evaluation functions like cross validate. Additionally, the preprocessing tools can be used231

to convert (spacer, target) pairs into a feature space that is directly usable by other scikit-learn232

models. Fig. 4 shows the results of exploring simple, built-in, scikit-learn models utilizing data from a233

GUIDE-Seq experiment by Tsai et al. (2015). While these results are preliminary, there is evidence that234

more advanced machine learning methods such as gradient boosting and feed-forward neural networks235

may have improved prediction capabilities.236

Discussion237

As the CRISPR-Cas gene-editing field expands, the number of potential strategies will increase as well.238

It is not feasible to build single purpose-driven tools to account for all of these potential strategies. For239

example, nickase-based strategies require finding targets with potential spacers on each strand at the same240

position (Shen et al., 2014), while employing homologous recombination gene editing strategies requires241

finding targets that are a specific distance away (Chen et al., 2013). While it may be easy to develop242

single instances of these strategies by hand, it would be impracticable to do so across many, potentially243

thousands, of genes. As such, a modular architecture was employed to allow bioinformaticians to use244

individual parts of the library as needed to leverage the power of Python to design gene-editing strategies.245

The crseek module can also be used to explore the use of CRISPR-Cas gene-editing in non-model246

organisms. Even ”draft quality” genome sequences containing degenerate bases can be searched. This247

further includes the use of differing CRISPR-Cas systems that can be explored easily. Researchers can248

also integrate crseek into the scikit-learn ecosystem to develop more advanced estimators as more data249

8/16


becomes available.250

The crseek module can be used to evaluate the effectiveness of CRISPR-Cas gene in rapidly251

evolving bleeding-edge use-cases of the CRISPR-Cas system. This will be an invaluable tool for fields252

editing highly mutable genomes and meta-genomes of mixed populations. In the case of HIV-1, recent253

research has shown that inter-patient variability presents a barrier to effective targeting (Dampier et al.,254

2017; Roychoudhury et al., 2018). However, utilizing this tool it was possible to exploit the promiscuity255

of SpCas9 to construct a set spacers capable of targeting the wide range of known mutations (Dampier256

et al., 2018). Additionally, CRISPR gene-editing has been proposed as a strategy for shifting the human257

microbiome by selectively targeting specific species (Bikard and Barrangou, 2017; Gomaa et al., 2014).258

The crseekmodule can measure the effectiveness across each composed metagenome and accommodate259

arbitrarily complicated targeting rules (eg. target genomes A-E, miss genomes F-J).260

Lastly, the crseek module will be useful to researchers exploring the binding rules of non-model261

Cas9 variants. Task 4 shows the ease of training a new estimator given a set of binding data. Using a Cas9262

variant in a genome-wide binding experiment such as GUIDE-Seq (Tsai et al., 2015), CIRCLE-Seq (Tsai263

et al., 2017), or SITE-Seq (Cameron et al., 2017), can provide invaluable training data. The crseek264

module can be easily merged with the scikit-learn architecture or even more advanced deep learning265

frameworks such as Keras (Chollet et al., 2015) or Tensorflow (Abadi et al., 2015).266

Conclusion267

By designing a consistent API that mirrors other common tools in the Python scientific stack, crseek268

can be used to automate arbitrarily complicated gene editing strategies with minimal effort. Along with269

the multitude of implemented features, crseek is an invaluable resource for computational biologists270

exploring the intricacies of CRISPR-Cas gene editing.271

ACKNOWLEDGMENTS272

We would like to acknowledge the patients of the Drexel Medicine CARES cohort for providing material273

for this study. Patients in the Drexel CNS AIDS Research and Eradication Study (CARES) Cohort were274

recruited from the Partnership Comprehensive Care Practice of the Division of Infectious Disease and275

HIV Medicine at Drexel University College of Medicine (Drexel Medicine), located in Philadelphia,276

Pennsylvania. Patients in the Drexel Medicine CARES cohort were recruited under protocol 1201000748277

(Brian Wigdahl, PI), which adheres to the ethical standards of the Helsinki Declaration (1964, amended278

most recently in 2008). All patients provided written consent upon enrollment.279

REFERENCES280

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean,281

J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R.,282

Kaiser, L., Kudlur, M., Levenberg, J., Mane, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster,283

M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viegas,284

F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow:285

Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.286

Ablain, J., Durand, E. M., Yang, S., Zhou, Y., and Zon, L. I. (2015). A CRISPR/Cas9 vector system for287

tissue-specific gene disruption in zebrafish. Developmental cell, 32(6):756–64.288

Anaconda (2016). Anaconda software distribution.289

Antell, G., Dampier, W., Aiamkitsumrit, B., Nonnemacher, M., Jacobson, J., Pirrone, V., Zhong, W.,290

Kercher, K., Passic, S., Williams, J., Schwartz, G., Hershberg, U., Krebs, F., and Wigdahl, B. (2016).291

Utilization of HIV-1 envelope V3 to identify X4- and R5-specific Tat and LTR sequence signatures.292

Retrovirology, 13(1).293

Bae, S., Park, J., and Kim, J.-S. (2014). Cas-OFFinder: a fast and versatile algorithm that searches294

for potential off-target sites of Cas9 RNA-guided endonucleases. Bioinformatics (Oxford, England),295

30(10):1473–5.296

Barrangou, R. (2015). The roles of CRISPR–Cas systems in adaptive immunity and beyond. Current297

Opinion in Immunology, 32:36–41.298

Bikard, D. and Barrangou, R. (2017). Using CRISPR-Cas systems as antimicrobials. Current Opinion in299

Microbiology, 37:155–160.300

9/16


Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer,301

P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., and Varoquaux, G. (2013).302

API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD303

Workshop: Languages for Data Mining and Machine Learning, pages 108–122.304

Cameron, P., Cameron, P., Settle, A. H., Fuller, C. K., Thompson, M. S., Cigan, A. M., Young, J. K., and305

May, A. P. (2017). SITE-Seq: A Genome-wide Method to Measure Cas9 Cleavage. Protocol Exchange.306

Cao, J., Wu, L., Zhang, S.-M., Lu, M., Cheung, W. K. C., Cai, W., Gale, M., Xu, Q., and Yan, Q. (2016).307

An easy and efficient inducible CRISPR/Cas9 platform with improved specificity for multiple gene308

targeting. Nucleic acids research, 44(19):e149.309

Chen, C., Fenk, L. A., and de Bono, M. (2013). Efficient genome editing in Caenorhabditis elegans by310

CRISPR-targeted homologous recombination. Nucleic Acids Research, 41(20):e193–e193.311

Chollet, F. et al. (2015). Keras.312

Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., Hamelryck,313

T., Kauff, F., Wilczynski, B., and de Hoon, M. J. L. (2009). Biopython: freely available Python314

tools for computational molecular biology and bioinformatics. Bioinformatics (Oxford, England),315

25(11):1422–3.316

Cong, L., Ran, F. A., Cox, D., Lin, S., Barretto, R., Habib, N., Hsu, P. D., Wu, X., Jiang, W., Marraffini,317

L. A., and Zhang, F. (2013). Multiplex genome engineering using CRISPR/Cas systems. Science (New318

York, N.Y.), 339(6121):819–23.319

Cui, Y., Xu, J., Cheng, M., Liao, X., and Peng, S. (2018). Review of CRISPR/Cas9 sgRNA Design Tools.320

Interdisciplinary Sciences: Computational Life Sciences, 10(2):455–465.321

Dampier, W., Sullivan, N., Chung, C.-H., Mell, J., Nonnemacher, M., and Wigdahl, B. (2017). Designing322

broad-spectrum anti-HIV-1 gRNAs to target patient-derived variants. Scientific Reports, 7(1).323

Dampier, W., Sullivan, N. T., Mell, J., Pirrone, V., Ehrlich, G., Chung, C.-H., Allen, A. G., DeSimone,324

M., Zhong, W., Kercher, K., Passic, S., Williams, J., Szep, Z., Khalili, K., Jacobson, J., Nonnemacher,325

M. R., and Wigdahl, B. (2018). Broad spectrum and personalized gRNAs for CRISPR/Cas9 HIV-1326

therapeutics. AIDS Research and Human Retroviruses, page AID.2017.0274.327

Davidson, M. (2016). megfp-pbad.328

Ding, Y., Li, H., Chen, L.-L., and Xie, K. (2016). Recent Advances in Genome Editing Using CRISPR/-329

Cas9. Frontiers in Plant Science, 7:703.330

Doench, J. G., Fusi, N., Sullender, M., Hegde, M., Vaimberg, E. W., Donovan, K. F., Smith, I., Tothova,331

Z., Wilen, C., Orchard, R., Virgin, H. W., Listgarten, J., and Root, D. E. (2016). Optimized sgRNA332

design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nature Biotechnology,333

34(2):184–191.334

Doudna, J. A. and Charpentier, E. (2014). The new frontier of genome engineering with CRISPR-Cas9.335

Fu, Y., Foden, J. A., Khayter, C., Maeder, M. L., Reyon, D., Joung, J. K., and Sander, J. D. (2013).336

High-frequency off-target mutagenesis induced by CRISPR-Cas nucleases in human cells. Nature337

Biotechnology, 31(9):822–826.338

Genscript (2016). CRISPR Handbook. GenScript, (October):1–32.339

Gomaa, A. A., Klumpe, H. E., Luo, M. L., Selle, K., Barrangou, R., and Beisel, C. L. (2014). Pro-340

grammable removal of bacterial strains by use of genome-targeting CRISPR-Cas systems. mBio,341

5(1):00928–13.342

Hsu, P. D., Scott, D. A., Weinstein, J. A., Ran, F. A., Konermann, S., Agarwala, V., Li, Y., Fine, E. J.,343

Wu, X., Shalem, O., Cradick, T. J., Marraffini, L. A., Bao, G., and Zhang, F. (2013). DNA targeting344

specificity of RNA-guided Cas9 nucleases. Nature Biotechnology, 31(9):827–832.345

Hu, W., Kaminski, R., Yang, F., Zhang, Y., Cosentino, L., Li, F., Luo, B., Alvarez-Carbonell, D., Garcia-346

Mesa, Y., Karn, J., Mo, X., and Khalili, K. (2014). RNA-directed gene editing specifically eradicates347

latent and prevents new HIV-1 infection. Proceedings of the National Academy of Sciences of the348

United States of America, 111(31):11461–6.349

Jinek, M., Chylinski, K., Fonfara, I., Hauer, M., Doudna, J. A., and Charpentier, E. (2012). A pro-350

grammable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science (New York,351

N.Y.), 337(6096):816–21.352

Kim, Y. B., Komor, A. C., Levy, J. M., Packer, M. S., Zhao, K. T., and Liu, D. R. (2017). Increasing the353

genome-targeting scope and precision of base editing with engineered Cas9-cytidine deaminase fusions.354

Nature biotechnology, 35(4):371–376.355

10/16


Klein, M., Eslami-Mossallam, B., Arroyo, D. G., and Depken, M. (2018). Hybridization Kinetics Explains356

CRISPR-Cas Off-Targeting Rules. Cell Reports, 22(6):1413–1423.357

Li, L., Aiamkitsumrit, B., Pirrone, V., Nonnemacher, M. R., Wojno, A., Passic, S., Flaig, K., Kilareski, E.,358

Blakey, B., Ku, J., Parikh, N., Shah, R., Martin-Garcia, J., Moldover, B., Servance, L., Downie, D.,359

Lewis, S., Jacobson, J. M., Kolson, D., and Wigdahl, B. (2011). Development of co-selected single360

nucleotide polymorphisms in the viral promoter precedes the onset of human immunodeficiency virus361

type 1-associated neurocognitive impairment. Journal of NeuroVirology, 17(1):92–109.362

Li, L., Parikh, N., Liu, Y., Kercher, K., Dampier, W., Flaig, K., Aiamkitsumrit, B., Passic, S., Frantz,363

B., Blakey, B., Pirrone, V., Moldover, B., Jacobson, J., Feng, R., Nonnemacher, M., and Wigdahl,364

B. (2015). Impact of Naturally Occurring Genetic Variation in the HIV-1 LTR TAR Region and Sp365

Binding Sites on Tat-Mediated Transcription. Journal of Human Virology & Retrovirology, 2(4).366

Marraffini, L. A. and Sontheimer, E. J. (2010). CRISPR interference: RNA-directed adaptive immunity in367

bacteria and archaea. Nature Reviews Genetics, 11(3):181–190.368

Monot, M., Boursaux-Eude, C., Thibonnier, M., Vallenet, D., Moszer, I., Medigue, C., Martin-Verstraete,369

I., and Dupuy, B. (2011). Reannotation of the genome sequence of Clostridium difficile strain 630.370

Journal of Medical Microbiology, 60(8):1193–1199.371

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer,372

P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and373

Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning374

Research, 12:2825–2830.375

Roychoudhury, P., De Silva Feelixge, H., Reeves, D., Mayer, B. T., Stone, D., Schiffer, J. T., and Jerome,376

K. R. (2018). Viral diversity is an obligate consideration in CRISPR/Cas9 designs for targeting the377

HIV reservoir. BMC Biology, 16(1):75.378

Shen, B., Zhang, W., Zhang, J., Zhou, J., Wang, J., Chen, L., Wang, L., Hodgkins, A., Iyer, V., Huang, X.,379

and Skarnes, W. C. (2014). Efficient genome modification by CRISPR-Cas9 nickase with minimal380

off-target effects. Nature Methods, 11(4):399–402.381

Sternberg, S. H. and Doudna, J. A. (2015). Expanding the Biologist’s Toolkit with CRISPR-Cas9.382

Tsai, S. Q., Nguyen, N. T., Malagon-Lopez, J., Topkar, V. V., Aryee, M. J., and Joung, J. K. (2017).383

CIRCLE-seq: a highly sensitive in vitro screen for genome-wide CRISPR-Cas9 nuclease off-targets.384

Nature methods, 14(6):607–614.385

Tsai, S. Q., Zheng, Z., Nguyen, N. T., Liebers, M., Topkar, V. V., Thapar, V., Wyvekens, N., Khayter,386

C., Iafrate, A. J., Le, L. P., Aryee, M. J., and Joung, J. K. (2015). GUIDE-seq enables genome-wide387

profiling of off-target cleavage by CRISPR-Cas nucleases. Nature biotechnology, 33(2):187–197.388

Van Rossum, G., Warsaw, B., and N., C. (2001). Pep 8 – style guide for python code.389

Wu, X., Scott, D. A., Kriz, A. J., Chiu, A. C., Hsu, P. D., Dadon, D. B., Cheng, A. W., Trevino, A. E.,390

Konermann, S., Chen, S., Jaenisch, R., Zhang, F., and Sharp, P. A. (2014). Genome-wide binding of391

the CRISPR endonuclease Cas9 in mammalian cells. Nature biotechnology, 32(7):670–6.392

FIGURES & TABLES393

11/16


Patient Clone Target Cleavage Efficiency

R5 1 ATCAGATATCCACTGACCTT TGG 73%

X4 1 ACCAGATATCCACTGTGCTT TGG 6%

A0019

1 ACAAGATTTCCACTGACCTT TGG

56%2 ACAAGATTTCCACTGACCTT TGG

A0020

1 ACCAGATTTCCCCTGACCTT TGG

25%2 ACCAGATTTCCCCTGACCTT TGG

A0049

1 GTCAGATACCCACTGACCTT TGG

91%2 GTCAGATACCCACTGACCTT TGG

3 GTCAGATATCCACTGACCTT TGG

A0050 1 ACCAGGTTTCCACTGACCTT TGG 58%

A0107


84%




5 ACTAGATGGCCACTGACCTT TGG


Table 1. Sequences used in the in vitro cutting assay. The Target column refers to the target region of

the cloned sequence with the highest match to the LTR-A gRNA used for cleavage

(ATCAGATATCCACTGACCTT NGG) in the LTR (loci). The pam region is bolded. The Cleavage

Efficiency refers to the percentage of the plasmids that were cleaved as measured by the BioAnalyzer.

12/16


Amp-R

ori

EGFP

araC

T7 Tag

Figure 1. A plasmid map of mEGFP-pBAD with the gene and promoter features labeled in blue and

light blue. Potential spacer-targeting sites are shown in green.

13/16


Designable genes

Undesignable genes

Spacers

Figure 2. Approximately 93% of genes in the C. difficile genome can be targeted by at least 5 spacers

without predicted off-target effects. Gene regions are shown along the inner track with those in blue

representing those for which at least five spacers could be designed while those in red could not be

properly targeted. The outer track shows the location of spacers in green.

14/16


Observed Cleavage 0 0.25 0.50 0.75 1.0



Pre

dic

ted

Cle

ava

ge

0

0.2

0.4

0.6

0.8

1.0

A B C MIT CFD Kinetic

r2=0.81 r2=0.90 r2=0.83

Figure 3. Prediction models have different levels of agreement with in vitro data. Three different

prediction models were used to predict the cleavage percentage of a mixed population of targets. Each

point represents the percentage of cleavage observed on a population of distinct genetic variants

compared to the predicted percentage of cleavage. The line indicates the regression line, while the

shadows indicate the 95% confidence intervals of 1000 bootstrap replicates. A) The predictions were

performed using the algorithm proposed by Hsu et al. (2013). B) The predictions were performed using

the algorithm proposed by Doench et al. (2016) C) The predictions were performed using the algorithm

proposed by Klein et al. (2018).

15/16


MIT

CF

D

Kin

etic

Gra

die

nt B

oo

stin

g

Ne

are

st N

eig

hb

ors

Ne

ura

l N

etw

ork

Me

an

Ab

so

lute

Err

or

0

0.05

0.10

0.15

0.20

0.25

Figure 4. Advanced machine learning techniques may be useful in improving the agreement between

predicted and observed cleavage. Each bar indicates the mean absolute error between the predicted

cleavage rate and the observed cleavage rate measured using GUIDE-Seq by Tsai et al. (2015).

Scikit-learn models were fit using default parameters on the output of the

crseek.preprocessing.MisMatchEstimator.

16/16


crseek: a python module for facilitating …1 crseek: a python module for facilitating 2 complicated...

Documents