title: machine learning for enzyme engineering, selection

Title: Machine Learning for Enzyme Engineering, Selection, and Design

Authors: Ryan Feehan*1, Daniel Montezano*1, and Joanna S.G. Slusky1,2

Affiliations: 1 Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66045-7534. 2 Department of Molecular Biosciences, The University of Kansas, 1200 Sunnyside Ave. Lawrence KS 66045-3101 *These authors contributed equally Correspondence: [email protected]. Abstract Machine learning is a useful computational tool for large and complex tasks such as those in the field of enzyme engineering, selection, and design. In this review, we examine enzyme-related applications of machine learning. We start by comparing tools that can identify the function of an enzyme and the site responsible for that function. Then we detail methods for optimizing important experimental properties, such as the enzyme environment, enzyme reactants. We describe recent advances in enzyme systems design, and enzyme design itself. Throughout we compare and contrast the data and algorithms used for these tasks to illustrate how the algorithms and data can be best used by future designers. 1. Introduction Enzymes catalyze chemical reactions that would otherwise require high temperature or pressure and they increase reaction rates up to a million-fold. Because of these qualities, enzymes find numerous applications in environmental remediation (Sharma et al. 2018), human health, and industrial synthesis (Jegannathan and Nielsen 2013). Moreover, using enzymes in industrial synthesis is often more environmentally friendly than other synthetic alternatives (Jegannathan and Nielsen 2013). Designing, engineering, and selecting enzymes is therefore a subject of great medical, environmental and industrial importance. A large number of enzymes have already found their way into scientific research and industry. With a sense that the properties of these natural proteins could be improved, decades-long design efforts have continually modified these enzymes in order to sharpen their function and optimize the conditions under which they work (Fox and Huisman 2008). However, sequence space is very large, and nature has left a significant part of it unexplored. It is likely that new interesting enzymes could be created if we could explore these untouched regions of the protein sequence space in an efficient manner. For decades protein design has promised not just new proteins but also that the process of designing new proteins would reveal fundamental truths about protein folding and protein interactions. The same is true for enzyme design as it can teach what intrinsic enzymatic properties can be found in the sequences of amino acids and in the atomic arrangements of catalytic sites.

Machine learning (ML) is becoming the tool of choice for such exploratory efforts. ML is learning done without full manual oversight. ML is also sometimes called statistical learning because it focuses on patterns or statistical properties of data. Machine learning can mean learning the distribution of the data, learning to make predictions for new data, learning to generate new data from estimated data distributions, or even the process of continuous learning by machines, as in artificial intelligence. As it applies to enzyme engineering, ML provides a way to use biological data (at the level of organism, protein sequence, protein structure, residue or atomic level) to extract information (patterns or data distributions) that can then be used for downstream tasks such as classifying new enzymes, predicting properties for enzymes, their substrates, and their optimal microenvironment, and finding new enzymes or combination of enzymes that have better catalytic activity. With more experimental data and more computational power available, ML efforts in the field of enzymology include the creation of new databases for training, data preprocessing (Mazurenko et al. 2020) and the ethical concerns related to creating new biomolecules with unintended characteristics (Kemp et al. 2020). Over the past three years. ML has been applied toward answering long standing questions in enzymology: predicting the function (or functions) of a putative enzyme given its sequence or structure, predicting what mutations would improve catalytic turnover rates, and predicting how environmental changes will effect enzyme function. Here we review those recent advances specifically focusing on the preferred datasets, algorithms, and features that facilitated the advances of the past three years in enzyme identification, classification, optimization, systemization, and finally, de novo enzyme design. Special consideration is given to models that are interpretable such that the importance of features to enzyme function is revealed. The top models we discuss are summarized in Table 1.

Tabl

e 1.

Res

ourc

e Ta

ble.

Sel

ecte

d lis

t of r

ecen

t, ex

celle

nt e

nzym

atic

ML

tool

s and

met

hods

org

aniz

ed b

y th

e ta

sk th

ey p

erfo

rm a

s cov

ered

in th

is re

view

. A

dash

is u

sed

in li

eu o

f a n

ame

if a

tool

/met

hod

that

doe

s not

hav

e an

aut

hor d

esig

nate

d na

me.

ML

met

hod

abbr

evia

tions

are

: con

volu

tiona

l neu

ral n

etw

ork

(CN

N),

supp

ort v

ecto

r mac

hine

(SV

M),

k-ne

ares

t nei

ghbo

rs (k

NN

), re

curr

ent n

eura

l net

wor

ks (R

NN

), ra

ndom

fore

st (R

F), n

aïve

Bay

es (N

B), g

radi

ent

boos

ting

regr

essi

on tr

ees (

GB)

, par

tial l

east

squa

res r

egre

ssio

n (P

LSR

), lin

ear r

egre

ssio

n (L

R),

mul

ti-la

yer p

erce

ptro

n (M

LP),

neur

al n

etw

ork

(NN

), an

d di

rect

cou

plin

g an

alys

is (D

CA).

Inpu

t typ

e ab

brev

iatio

ns a

re: s

eque

nce

(Seq

), st

ruct

ure

(Stru

ct),

and

expe

rimen

tal (

Exp)

. Bes

t use

spec

ifies

wha

t eac

h re

sour

ce is

cap

able

of p

redi

ctin

g, h

ighl

ight

ing

diffe

renc

es fr

om si

mila

r res

ourc

es w

hen

appl

icab

le. A

vaila

bilit

y sp

ecifi

es if

a re

sour

ce to

ol c

ould

be

acce

ssed

vi

a a

web

serv

er o

r if t

he m

odel

/cod

e w

as a

vaila

ble

for d

ownl

oad.

All

the

tool

s with

ava

ilabl

e w

ebse

rver

als

o ha

ve d

ownl

oad

optio

ns.

2. Enzyme function prediction A substantial amount of work has been done in attempting to computationally predict a protein’s enzymatic classification or its catalytic site. Common non ML methodology for both includes using homology to detect enzyme function from the entire sequence (BLAST+) (Camacho et al. 2009), sequence motif (PROSITE) (Sigrist et al. 2013), or a domain (Pfam) (Finn et al. 2016). These tools predict enzyme function by using previously annotated enzymes or enzyme sites and find sequences or sites that are similar to the annotated example. The homology-based predictions are not sensitive to small, important amino acid changes and require numerous, well-studied homologs. In contrast, ML methods use sequence and structural features outside of sequence or structural similarity, allowing for ML models to make accurate enzymatic predictions on proteins with few or no homologs. 2.1 ML for enzymatic classification

The Nomenclature Committee of the International Union of Biochemistry classifies enzymatic reactions using a four-level hierarchy, called Enzyme Commission (EC) numbers (McDonald and Tipton 2014). EC numbers contain four numbers, one for each level, separated by periods. The task of predicting if a protein is enzymatic or not enzymatic is often referred to as level zero. Level one divides enzymes into seven major enzyme classes; 1: oxidoreductases, 2: transferases, 3: hydrolases, 4: lyases, 5: isomerases, 6: ligases, and 7: translocases. The next three levels use reaction elements such as the chemical bond and substrate to further categorize enzymes into different subclasses. Previous ML EC number predictors have used classical classification algorithms (Amidi et al. 2017; Che et al. 2016; De Ferrari et al. 2012; Kumar and Skolnick 2012; Li et al. 2016; Shen and Chou 2007; Zou and Xiao 2016). These classical algorithms differ from the current mostly deep learning predictors in that the classical predictors require user defined features. Conversely, deep learning methods extract features from raw data representations, which leads to overall better metrics. The transition to and popularity of deep networks coincided with availability of deep learning methods in TensorFlow (Martín Abadi et al. 2015). Here we compare the similarities and differences of four recent and effective ML EC number predictors: DEEPre (Li et al. 2018), mlDEEPre (Zou et al. 2019), ECPred (Dalkiran et al. 2018), and DeepEC (Ryu et al. 2019). ML methods for classification of enzyme reactions are mostly sequence-based, due in large part to the abundance of protein sequences with associated EC numbers. Some of these models use homologous sequences in order to have more data while other models avoid homologs to prevent overfitting. For example, DeepEC uses 1,388,606 protein sequences with 4,669 different EC numbers. The data set was made by mapping EC numbers to over a million unannotated sequences using sequence similarity. Although homologs were seen as helpful for training DeepEC, it is more common to remove sequences with high similarity in order to avoid bias and overfitting during performance evaluation. For example, another ML enzyme classifier, ECPred, uses representatives from >50% sequence-similarity clusters to create a non-redundant dataset with only 55,180 enzyme sequences. A consequence of using a smaller dataset is that ECPred is only capable of making 634 complete EC number predictions and 224 partial EC number predictions in contrast with almost five thousand complete EC number predictions in DeepEC. In addition to being sequence-based, all four ML enzyme classifiers use some variation of a level-by-level approach. Level-by-level predictors decompose the problem into simpler tasks by using one or more predictor for each EC number level. Generally, a binary classifier is used for level zero predictions, which is trained on enzyme sequences for positives and non-enzyme sequences for

negatives. From there, level-by-level predictors progress through the EC number hierarchy. The success of level-by-level prediction is due to the diversity criteria considered throughout the EC hierarchy. For example, isomerases are further divided into types of isomerization, a criterion that can only be considered at level two for the isomerase enzyme class. A complete level-by-level strategy is computationally intensive. ECPred trains a different model for each of its 859 possible predictions, making it the most computationally expensive of the four tools. mlDEEPre builds on the level-by-level predictor in DEEPre by adding a model between level 0 and level 1 that is trained to predict if an enzyme has one or multiple EC numbers (because it catalyzes multiple reactions). Different reactants are the most frequent reason for multi-reactant enzymes (Dönertaş et al. 2016), therefor mlDEEPre can also be useful for substrate identification which we describe below. There are a number of difficulties in creating direct comparisons among the most recent models. The evolutionary nature of biological data complicates evaluations. Most enzymes within a category are homologs, but training and then testing on close homologs causes overfitting and inflates metrics, essentially training the model on the test set. Although methods that check sequence similarity, such as ECPred and DEEPre, report more reliable self-performance evaluations, differences in sequence similarity cutoffs used complicate comparing these evaluations. Studies frequently also include comparisons with competing tools. To improve comparability, studies often evaluate their performance on a test set created by previous methods. When compared on a common test-set, (Roy et al. 2012) ECPred and DEEPre display similar performance. Generating new, specialized test sets are also useful for performance evaluations, such as the no Pfam set created by the ECPred authors. Since Pfam annotations are added using homology, the no Pfam test-set represents challenging cases such as those faced when predicting on novel and de novo sequences. ECPred outperforms DEEPre when compared on the set with no Pfam annotations. However, as acknowledged by ECPred’s authors, DEEPre’s performance may be hindered by its use of domain annotations as a feature, which points out that such tests may not always be fair to competing algorithms. Considerations for accurate comparisons include differences in how models report classifications. For example, DeepEC only generates and evaluates complete four level comparison. Conversely, ECPred predicts only partial EC numbers in cases where a fuller prediction is at low confidence or is not possible. Thus, for four-level EC numbers that ECPred predicts, ECPred surpasses DEEPre. Whereas for complete four-level prediction that included classes ECPred does not predict, DEEPre shows superior metrics. Finally, EC numbers are frequently added, amended, transferred, and deleted, based on new data. To our knowledge, no ML tools re-train with every EC number update to account for this. An extreme example of this issue is that most recent publications ignore translocases, a level one enzyme class added in mid-2018. 2.2. ML for catalytic site prediction Catalytic site prediction may assist in computationally discriminating between better and worse enzyme designs. Unlike for enzyme classification which is best suited to sequence information, structure-based methods work best for predicting catalytic sites. Structure-based methods require annotations at the residue level, as opposed to EC number whole protein level annotations. The M-CSA (Mechanism and Catalytic Site Atlas) contains residue level annotations for 964 unique reactions, making it the best available database (Ribeiro et al. 2018). These annotations can only be mapped to active sites for ~60% of the 91,112 protein structures in the protein databank with associated EC numbers (as of 11/19/2020)(Berman et al. 2000). Enzyme site predictor PreVAIL (Song et al. 2018) passes both sequence and structural features to a random forest (RF) algorithm that is capable of ranking features by their importance to the model

thereby revealing aspects of what makes an enzyme. Top features for PreVAIL came from both sequence—such evolutionary information—and structural information like solvent accessibility and contact number. PreVAIL’s authors compared PreVAIL to other methods using eight different data sets. Competitors included support vector machines (SVMs) and neural networks that use sequence and/or structural data (Gutteridge et al. 2003; Petrova and Wu 2006; Zhang et al. 2008). For a holdout test set, PreVAIL has recall and precision values of 62.2% and 14.9% respectively, which is better than a sequence based SVM’s recall and precision values of 50.1% and 14.7% respectively (Zhang, Zhang et al. 2008). When contextualizing these low precision rates, it is important to note the extreme imbalance of data used for catalytic residue prediction as high non-catalytic proteins in the set increase the difficulty of high levels of precision. PreVAIL outperformed competing methods on all except for two test sets, where a competing method scored higher recall values with lower precision values than PreVAIL. The competing methods that scored higher recall values on a test were an SVM (Petrova and Wu 2006) which like PreVAIL uses sequence and structure features and a neural network when it uses only structural features. A new type of feature extraction method for enzyme site prediction was recently implemented by the Altman group (Torng and Altman 2019b). In this method, local atomic distribution, represented by a grid-like data structure is used as input for a convolutional neural network (CNN). CNNs are a deep learning algorithm known for their excellent performance on spatial applications such as facial recognition. Hence, passing the raw structural data allows the CNN to perform feature extraction which can overcome issues faced by physiochemical properties, such as data loss and high dimensionality. The three dimensional CNN (3DCNN) outperformed a CNN which was passed physicochemical features, calculated by the author’s commonly used FEATURE program (Bagley and Altman 1995). Unfortunately, this new feature extraction method has only been shown to produce models capable of predicting a specific enzyme domain or family. As such, it will require additional work relative to previously discussed tools and is limited to use for enzyme domains with available data. The 3DCNN has an impressive average recall of 95.5% and a precision of 99%. When comparing these metrics to PreVAIL’s metrics, it is important to note that the 3DCNN models cover a single enzyme class connected to one M-CSA. Conversely, PreVAIL is intended to cover all enzymes, testing and training on proteins with less than 30% sequence similarity making for few, if any, M-CSA reactions being present in both the training and testing sets.

3. Applications of Enzyme Properties Prediction Predicting enzymatic function and the level of activity for that function are important steps in enzyme engineering and must be taken into account in order to engineer or select enzymes that are useful in industrial settings. Enzyme activity is dependent not only on the specificity of an enzyme for different substrates but also the enzyme activity dependence on the microenvironment in which the enzyme performs its function. This microenvironment includes many factors, such as the reaction temperature and enzyme solubility. ML has been used to predict optimal enzyme conditions, enzyme substrate specificity, and has aided in understanding which properties of enzymes are impacted most by changes in the microenvironment in which enzymes perform their function. 3.1 Condition Optimization Each enzyme works best at a specific temperature and most enzymes must be soluble in specific media in order to function. These conditions are critical for large-scale industrial use of biocatalysts. ML methods are an attractive bioinformatics tool for determining optimal enzyme conditions. Current

models exclusively use sequence-based features (and not structure-based features), thereby reducing the cost and complexity in dataset creation. Recent ML models with features of single- or pairs-of-amino acids have shown compelling successes in determining optimal temperatures for organismal growth and catalysis (Li et al. 2019). Using a non-linear regression vector machine model, organismal growth temperature was predicted from sequence data, and then subsequently used in a random forest model as a feature, along with amino acid composition to predict the optimal catalytic temperature for individual enzymes (Li et al. 2019). For this second task, combining amino acid composition features with a physiological property such as optimal growth temperature provided higher prediction performance. Interestingly, attempts to improve the model with further feature engineering and the introduction of a more complex deep learning model (Li et al. 2020) did not produce higher performance in predicting enzyme catalytic temperature. However, significant improvements over the model presented in (Li et al. 2019) were obtained when data set imbalance was reduced (Gado et al. 2020). This work was informed by the fact that available data on temperature-dependent enzyme function is highly skewed, with most experimental data points normally distributed around 37°C, and less than 5% of data having temperatures above 85°C. This skew had led to overall low performance when predicting high temperature enzymes despite the considerable use of enzymes at high temperatures in industry. The model was improved using imbalance-aware ML models in combination with resampling and ensemble learning. The best performance of the resampling model was obtained with an ensemble tree-based model, which had a 60% improvement in prediction of optimal catalytic temperature at high temperature ranges. Other features derived from protein sequences such sequence-order information (Chou 2001) and frequency of amino acid triads (Shen et al. 2007) have also been used for enzyme optimum temperature determination. These features were used to successfully classify a small set of xylanases into three thermophilicity classes (non-thermophilic, thermophilic and hyper-thermophilic) (Foroozandeh Shahraki et al. 2020). Thus, the state of the art allows for accurate prediction of optimal temperature conditions (both growth and catalytic) from sequence data alone, at least within classes of enzymes. Determining optimal conditions for industrial enzyme use also includes enzyme solubility as experimental characterization usually requires proteins to be soluble in heterologous expression. EnzymeMiner is a pipeline that includes a ML model for solubility prediction (Hon et al. 2020). The model predicts solubility of an enzyme in an E. coli expression system and the whole pipeline is available as a webserver. Using a template-based approach, EnzymeMiner produces a ranked list of candidates for experimentation from a list of putative-enzymes of unknown function. The EnzymeMiner pipeline identifies enzymes from user-defined criteria and ranks the results according to the predicted solubility. EnzymeMiner can be used as a downstream step after models that predict catalytic temperature to prioritize sequences for experimental studies. As a webserver, EnzymeMiner allows for interoperability of ML models. Since several isolated tools (described above) predict optimal conditions for different aspects of the microenvironment where enzymes function, selection guides such as EnzymeMiner that promote easy-to-run integrated workflow should receive more focus in future ML efforts for enzyme engineering. 3.2 Substrate Identification Enzymes can be promiscuous—they can catalyze changes on multiple substrates with varying specificities. Substrate identification for enzymatic activities is a subset of the larger field of drug and small molecule identification (Lo et al. 2018). When coupled with high-throughput experimental

technologies and deep evolutionary analysis, ML can help identify enzyme substrates with particular specificities or affinities for a variety of enzymes. Peptide substrates have been designed to bind to specific 4'-phosphopantetheinyl transferases while not binding to homologs using a method called POOL (Tallorin et al. 2018). In this design, an interpretable Naive Bayes classifier allowed for iterative improvement and was therefore preferred to black-box methods like support vector machines or deep neural networks. ML has also been used to identify which glycosyltransferases attach which sugars (Taujale et al. 2020). Evolutionary deep mining identified regions of sequences responsible for the sugar transfer mechanism and transferase acceptor specificity. While the latter is determined by hypervariable regions in the enzyme sequence and are subfamily-specific, the former is determined by features of the binding pocket. Expanding on previous work with a smaller glycosyltransferase family (Yang et al. 2018), an ensemble tree method was trained to identify and annotate donor substrate for a large portion of glycosyltransferases with unknown function. Interestingly, as in the case of temperature optimization, ensemble learning provided high prediction performance in a hard task with extensive sequence variation. This 6-substrate classification problem achieved 90% accuracy predicting the transfer of six different sugars. Similar to the Tallorin et al. model, the ensemble Gradient Boost Regression Tree model allowed for interpretability by measuring relative importance of different features. Specific conserved residues (including second shell residues) were identified as determinants for specificity. Also, physicochemical properties of side-chain volume, polarity and residue accessible area were correlated with each different donor class. The prediction of substrate specificity and identification of protein features that correlate with it have also been applied to less well-characterized enzyme families such as the thiolase superfamily (Robinson et al. 2020a) and adenylate-forming enzymes (Robinson et al. 2020b). Random forest models outperformed other approaches (e. g. artificial neural networks, linear models, SVMs) on these tasks, where enzymes can present extensive sequence variation and have a broad range of substrate specificities. In both cases, amino acids lining the active site were selected and encoded with physicochemical properties. For thiolases (Robinson et al. 2020a), features derived from the substrate and the whole sequence were also used, and the binary classifier predicted whether an enzyme-substrate pair would be active or inactive. For the adenylate-forming family (Robinson et al. 2020b) evolutionary analysis was combined with ML to detect substrate binding residues and the physicochemical properties of those residues were encoded as features. Substrate identification is closely related to function prediction for multifunctional enzymes, and the methods reviewed in this section complement the deep learning model mlDEEPre described previously (Zou et al. 2019). 3.3 Turnover Rate In addition to predicting if a particular thiolase can convert a particular substrate, the level of activity for each sequence/substrate combination was predicted using a random forest regression model (Robinson et al. 2020a). This is an example of a third important application of ML in enzyme engineering, enzyme activity level prediction. This method can be used to compare the relative importance of specific residue locations that are altered in different sequences. This study trained two random forest model: a classifier to decide whether a sequence/substrate pair has enzymatic activity or not, and a regression model to predict level of activity of a sequence/substrate pair predicted to be active. The most important features for the classifier model included chemical descriptors of the substrate such as aromaticity index and molecular connectivity index, while for the activity level prediction there was a correlation of activity level with oxygen content and molecular connectivity

index of substrate. Important protein features included the physicochemical properties of residues lining the pocket. For the enzyme glucose oxidase (used in food industry and blood glucose measuring devices) ML was successfully applied to predict effects of multiple mutations on enzyme activity under different environmental conditions, such as pH and mediator molecules (Ostafe et al. 2020). Using only sequence features combined into a protein Fast Fourier Transform spectrum, the model captured both the effect of multiple mutations and epistatic effects. ML models based on molecular dynamics trajectories are also predictive of enzyme reactivity. Features computed from the conformation trajectory of an enzyme-substrate complex were used to predict reactivity of ketol-acid reductoisomerase. Using the least absolute shrinkage and selection operator (LASSO) feature selection (Tibshirani 1996) features of the reactants at the beginning of the trajectory were shown to be highly predictive of reactivity. Predictive features included intra-substrate features (molecule organization such as bond lengths and angles) and features measuring the interaction between the substrate and the environment (binding and distance to water molecules, metal ions) (Bonk et al. 2019). A method for computing atomistic features similar to the previously described 3DCNN for enzyme site prediction (Torng and Altman 2019a) was trained using molecular dynamics trajectories and was applied to predict reaction rates from hydrolysis trajectories (Chew et al. 2020). Though the enzyme substrate-complex was most predictive of enzyme rates, features quantifying the interaction of the substrate with the environment were also used and shown to partially encode the necessary information for reliable rate prediction. Intra-substrate features were not used in this model leaving open the possibility that these features, successfully used in (Bonk et al. 2019), could account for the reported missed predictions in this model. 3.4 Systems Engineering In addition to optimizing or selecting particular enzymes, ML has been deployed in the improvement of biosynthetic pathways. By changing expression rates, turnover rates, and post translational modification rates within metabolic systems ML can significantly increase cellular production of particular metabolites. These current models point the way to a future that includes the optimization of complete living systems. The combination of ensemble ML methods with mechanistic genome-scale metabolic models was used to optimize the biosynthetic pathway of tryptophan metabolism (Zhang et al. 2020). Optimizing for combinations of six promoter genes controlling five upstream enzymes of the shimitake pathway improved tryptophan production 74% over already optimized strains. The training dataset used, similar to the previously discussed POOL method (Tallorin et al. 2018), was obtained with high-throughput techniques. Further, these two methods represent a tendency in balancing ML exploitation—a greedy approach where the best option is always taken—with exploration—where suboptimal paths are first explored in search for better final outcomes. The exploit-explore trade-off in ML is easy to see in reinforcement learning, where a machine learning “agent” declines immediate rewards to search for possibly higher gains later in the task. But as exemplified by these two methods, this trade-off can also be applied in a supervised learning setting. Here it was achieved by incorporating a measure of prediction noise in the prediction task, augmenting the space of reachable solutions compared to the exploitative approach. When explicitly comparing exploitation and exploration, the exploitative approach selected promoter combinations that yielded higher metabolite production, while the exploratory approach recommended more diverse promoter combinations improving designs with a subsequent optimization round (Zhang et al. 2020).

Identifying the relative concentrations of upstream enzymes that lead to higher yields and rates in biosynthetic pathways is particularly important for cell-free systems. Artificial networks predicted the metabolic flux in the upper part of glycolysis using the relative experimentally-determined concentrations of four upstream enzymes as input (Ajjolli Nagaraja et al. 2020). While the tryptophan metabolism model (Zhang et al. 2020) did not extrapolate outside the training data, this glycolysis model is designed to specifically perform extrapolation of predictions using artificial data generated by an auxiliary tree-based classifier. This resampling strategy allowed the model to predict enzyme concentrations that yielded flux values up to 63% higher than the ones present in the training data. An ensemble of ten neural networks was used to screen a large combinatorial space for the 6-step biosynthetic pathway for synthesizing n-butanol in Clostridium (Karim et al., 2020). An ensemble of neural nets was used to screen and rank candidate combinations of enzymes for n-butanol synthesis. The ML recommendations improved production of two related metabolites by 6- and 20-fold over previously engineered designs with the highest reported yields. Understanding metabolism through mechanistic genome-scale models is difficult due to lack of experimental data on enzyme turnover number, which are costly to obtain and prone to noise. Enzyme turnover numbers using ML were predicted for whole metabolic pathways and improved downstream metabolic inferences on a genome-scale metabolic model. Predictions were based on a combination of features (structure, network, biochemistry and assay conditions) and correlated well with data both from in vivo and in vitro experiments (Heckmann et al. 2018). As an alternative to more common supervised and unsupervised learning methodologies, reinforcement learning, was applied to study the problem of post-translational regulation of biosynthetic enzymes (Britton et al. 2020). This ML model was used to contradict a common understandings about biological regulation. The ML model indicated that regulation pushes reactions to a non-equilibrium state with the goal of maintaining the solvent capacity of the cell. Use of ML in this manner demonstrated that such models can be used for biological understanding in addition to engineering improvements. 4. Enzyme engineering Recent work has begun to examine the previously unexplored potential for using machine-learning to create new enzymes, either optimized designs or new functions not evolved by nature. There have been two successes in this area, one using sequence features and one using structural features. In a landmark example of ML-aided enzyme design, direct coupling analysis of sequence data was used to engineer chorismate mutases (Russ et al. 2020). The authors developed a generative ML model that performs sampling of the protein sequence space. This generative model captured the essential characteristics of chorismate mutases. Newly generated sequences, when expressed, reproduced the activity seen with native chorismate mutases, showing comparable catalytic power. A new metric called relative enrichment, created from frequency of alleles detected in a deep sequencing experiment, was also used to assess activity and 30% of the new designs showed relative enrichment similar to a native chorismate mutase used as reference, and large sequence diversity from mutases known in nature. All de novo designs showed less than 27% sequence identity to native chorismate mutases. This result shows that sequence-based features such as amino acid composition and residue correlations carry sufficient information not only for enzyme function and enzyme property prediction, but also for the de novo design of new enzymes that preserve the original function. In a second design, structural features were used to design aldehyde deformylating oxigenase activity into non-enzymatic helix bundles of the ferritin family (Mak et al. 2020). Using a logistic regression

model, a set of structural features was ranked to identify the minimal set of structural features required to recreate this function in non-functional scaffolds. The regression was interpretable and the most important features were energy of the active site, overall system energy and active site volume. 5. Outlook and Challenges

In enzyme engineering, design, and selection, the best feature type and ML method depends on the problem being tackled (Figure 1, Table 1). Sequences are the most simplistic protein representation and as such continue to be useful for creating features for every problem discussed here. However, purely sequence-based predictors have only shown to be sufficient for enzyme classification. The problem of enzyme classification is centered around EC numbers, one of the oldest, continuously updated bioinformatics repositories. The abundance of data in that resource has allowed for enzyme classification to transition to deep learning ML methods. We expect that in addition to deep learning future tools will use attention learning, which achieved a recent landmark for using sequence data to translate to atomic level (AlphaFold2.0 unpublished), as the level-by-level architecture of the EC classification problem is particularly well suited to attention learning. All models beyond EC classification, benefit from inclusion of structure-based or experiment-based features. Data availability of these features remains limited—this is currently being addressed by the community. For example, more than 50 journals currently instruct their authors to use the standards for reporting enzymology data (STRENDA) guidelines (Tipton et al. 2014). As the STRENDA database grows, its inclusion of pH, temperature, Km, and Kcat will be crucial for advancing condition optimization, substrate identification, and turnover rate predictors.

Figure 1. The feature tree. ML models reviewed in this paper according to the type of features used. The maroon dotted line across the center of the tree divides features can be broadly classified as sequence-based features (left of line) and structure-based features (right of line). The downward arrow categorizes features hierarchically, from atomic (top) to organismal (bottom). Citations for the models (numbered column on the right) are color-coded by the relevant task they are used for (enzyme classification (red), enzyme site prediction (orange), condition optimization (green), substrate identification (teal), turnover rate (blue), and design (purple).

Another strategy for dealing with limited data is using classical ML algorithms instead of deep learning. Currently, the most commonly used, classical ML algorithm is the tree-based random forest algorithm, which is most suitable for complex classification tasks with high dimensional data. Here we described random forest algorithms used for organismal growth temperature prediction, optimal enzyme temperature prediction, enzyme function prediction, substrate classification, and activity level prediction. Ensemble models and the reduction of imbalance in datasets also play key roles in model performance as shown with enzyme temperature optimization. Combining predictors showed success over individual predictors, indicating that a better balance between exploration (combination of models) and exploitation (selecting best predictions) is useful to achieve higher performance in enzyme system engineering. It is likely that the future will include more enzyme design efforts being guided by ML, combining ML predictors in more complex workflows and reducing screening costs and numbers of rounds of testing in experimentation. With respect to data representation, no consensus yet exists about the best encoding to use for sequence features. Even though one-hot encoding is a popular method, at least in one instance, the use of physicochemical properties somewhat improved accuracy when compared with one-hot encoding (Robinson et al. 2020a). As more data becomes available, we are likely to see protein sequences encoded using deep learning embeddings of the type used in natural language processing (Alley et al. 2019; Heinzinger et al. 2019) to produce highly informative feature sets for downstream protein tasks. By using embeddings as inputs, simpler models can outperform more complex models relying on data representations, as shown for protein function prediction (Villegas-Morcillo et al. 2020). Finally, despite the difficulties in comparing ML models, more comprehensive comparisons that specifically evaluate the benefits of each type of feature (evolutionary conservation, structural data, sequence encodings) and feature encoding (one-hot, physicochemical encoding, unsupervised embeddings) will be welcome, since their impact on different protein prediction tasks is an under analyzed question. Acknowledgements We gratefully acknowledge helpful discussions with Meghan W. Franklin as well as funding from NIGMS award DP2GM128201. References Ajjolli Nagaraja A, Charton P, Cadet XF, Fontaine N, Delsaut M, Wiltschi B, Voit A, Offmann B,

Damour C, Grondin-Perez B et al. 2020. A machine learning approach for efficient selection of enzyme concentrations and its application for flux optimization. Catalysts. 10(3).

Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM. 2019. Unified rational protein engineering with sequence-based deep representation learning. Nature Methods. 16(12):1315-1322.

Amidi S, Amidi A, Vlachakis D, Paragios N, Zacharaki EI. 2017. Automatic single- and multi-label enzymatic function prediction by machine learning. PeerJ. 5:e3095-e3095.

Bagley SC, Altman RB. 1995. Characterizing the microenvironment surrounding protein sites. Protein Science. 4(4):622-635.

Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. 2000. The protein data bank. Nucleic Acids Research. 28(1):235-242.

Bonk BM, Weis JW, Tidor B. 2019. Machine learning identifies chemical characteristics that promote enzyme catalysis. Journal of the American Chemical Society. 141(9):4108-4118.

Britton S, Alber M, Cannon WR. 2020. Machine learning and optimal control of enzyme activities to preserve solvent capacity in the cell. bioRxiv.2020.2004.2006.028035.

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009. Blast+: Architecture and applications. BMC Bioinformatics. 10(1):421.

Che Y, Ju Y, Xuan P, Long R, Xing F. 2016. Identification of multi-functional enzyme with multi-label classifier. PLOS ONE. 11(4):e0153503.

Chew AK, Jiang S, Zhang W, Zavala VM, Van Lehn RC. 2020. Fast predictions of liquid-phase acid-catalyzed reaction rates using molecular dynamics simulations and convolutional neural networks. Chemical Science. 11(46):12464-12476.

Chou K-C. 2001. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Structure, Function, and Bioinformatics. 43(3):246-255.

Dalkiran A, Rifaioglu AS, Martin MJ, Cetin-Atalay R, Atalay V, Doğan T. 2018. Ecpred: A tool for the prediction of the enzymatic functions of protein sequences based on the ec nomenclature. BMC Bioinformatics. 19(1):334.

De Ferrari L, Aitken S, van Hemert J, Goryanin I. 2012. Enzml: Multi-label prediction of enzyme classes using interpro signatures. BMC Bioinformatics. 13(1):61.

Dönertaş HM, Martínez Cuesta S, Rahman SA, Thornton JM. 2016. Characterising complex enzyme reaction data. PLOS ONE. 11(2):e0147952.

Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A et al. 2016. The pfam protein families database: Towards a more sustainable future. Nucleic Acids Research. 44(D1):D279-D285.

Foroozandeh Shahraki M, Farhadyar K, Kavousi K, Azarabad MH, Boroomand A, Ariaeenejad S, Hosseini Salekdeh G. 2020. A generalized machine-learning aided method for targeted identification of industrial enzymes from metagenome: A xylanase temperature dependence case study. Biotechnology and Bioengineering. n/a(n/a).

Fox RJ, Huisman GW. 2008. Enzyme optimization: Moving from blind evolution to statistical exploration of sequence–function space. Trends in Biotechnology. 26(3):132-138.

Gado JE, Beckham GT, Payne CM. 2020. Improving enzyme optimum temperature prediction with resampling strategies and ensemble learning. Journal of Chemical Information and Modeling. 60(8):4098-4107.

Gutteridge A, Bartlett GJ, Thornton JM. 2003. Using a neural network and spatial clustering to predict the location of active sites in enzymes. Journal of Molecular Biology. 330(4):719-734.

Heckmann D, Lloyd CJ, Mih N, Ha Y, Zielinski DC, Haiman ZB, Desouki AA, Lercher MJ, Palsson BO. 2018. Machine learning applied to enzyme turnover numbers reveals protein structural correlates and improves metabolic models. Nature Communications. 9(1):5252.

Heinzinger M, Elnaggar A, Wang Y, Dallago C, Nechaev D, Matthes F, Rost B. 2019. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics. 20(1):723.

Hon J, Borko S, Stourac J, Prokop Z, Zendulka J, Bednar D, Martinek T, Damborsky J. 2020. Enzymeminer: Automated mining of soluble enzymes with diverse structures, catalytic properties and stabilities. Nucleic Acids Research. 48(W1):W104-W109.

Jegannathan KR, Nielsen PH. 2013. Environmental assessment of enzyme use in industrial production – a literature review. Journal of Cleaner Production. 42:228-240.

Kemp L, Adam L, Boehm CR, Breitling R, Casagrande R, Dando M, Djikeng A, Evans NG, Hammond R, Hills K et al. 2020. Bioengineering horizon scan 2020. eLife. 9:e54489.

Kumar N, Skolnick J. 2012. Eficaz2.5: Application of a high-precision enzyme function predictor to 396 proteomes. Bioinformatics. 28(20):2687-2688.

Li G, Rabe KS, Nielsen J, Engqvist MKM. 2019. Machine learning applied to predicting microorganism growth temperatures and enzyme catalytic optima. ACS Synthetic Biology. 8(6):1411-1420.

Li G, Zrimec J, Ji B, Geng J, Larsbrink J, Zelezniak A, Nielsen J, Engqvist MK. 2020. Performance of regression models as a function of experiment noise.

Li Y, Wang S, Umarov R, Xie B, Fan M, Li L, Gao X. 2018. Deepre: Sequence-based enzyme ec number prediction by deep learning. Bioinformatics. 34(5):760-769.

Li YH, Xu JY, Tao L, Li XF, Li S, Zeng X, Chen SY, Zhang P, Qin C, Zhang C et al. 2016. Svm-prot 2016: A web-server for machine learning prediction of protein functional families from sequence irrespective of similarity. PLOS ONE. 11(8):e0155290.

Lo Y-C, Rensi SE, Torng W, Altman RB. 2018. Machine learning in chemoinformatics and drug discovery. Drug Discovery Today. 23(8):1538-1546.

Mak WS, Wang X, Arenas R, Cui Y, Bertolani S, Deng WQ, Tagkopoulos I, Wilson DK, Siegel JB. 2020. Discovery, design, and structural characterization of alkane-producing enzymes across the ferritin-like superfamily. Biochemistry. 59(40):3834-3843.

Martín Abadi AA, Paul Barham, Eugene Brevdo,, Zhifeng Chen CC, Greg S. Corrado, Andy Davis,, Jeffrey Dean MD, Sanjay Ghemawat, Ian Goodfellow,, Andrew Harp GI, Michael Isard, Rafal Jozefowicz, Yangqing Jia,, Lukasz Kaiser MK, Josh Levenberg, Dan Mané, Mike Schuster,, Rajat Monga SM, Derek Murray, Chris Olah, Jonathon Shlens,, Benoit Steiner IS, Kunal Talwar, Paul Tucker,, Vincent Vanhoucke VV, Fernanda Viégas,, Oriol Vinyals PW, Martin Wattenberg, Martin Wicke,, Yuan Yu aXZ. 2015. Tensorflow: Large-scale machine learning on heterogeneous systems. https://wwwtensorfloworg/.

Mazurenko S, Prokop Z, Damborsky J. 2020. Machine learning in enzyme engineering. ACS Catalysis. 10(2):1210-1223.

McDonald AG, Tipton KF. 2014. Fifty-five years of enzyme classification: Advances and difficulties. The FEBS Journal. 281(2):583-592.

Ostafe R, Fontaine N, Frank D, Ng Fuk Chong M, Prodanovic R, Pandjaitan R, Offmann B, Cadet F, Fischer R. 2020. One-shot optimization of multiple enzyme parameters: Tailoring glucose oxidase for ph and electron mediators. Biotechnology and Bioengineering. 117(1):17-29.

Petrova NV, Wu CH. 2006. Prediction of catalytic residues using support vector machine with selected protein sequence and structural properties. BMC Bioinformatics. 7(1):312.

Ribeiro António J M, Holliday GL, Furnham N, Tyzack JD, Ferris K, Thornton JM. 2018. Mechanism and catalytic site atlas (m-csa): A database of enzyme reaction mechanisms and active sites. Nucleic Acids Research. 46(D1):D618-D623.

Robinson SL, Smith MD, Richman JE, Aukema KG, Wackett LP. 2020a. Machine learning-based prediction of activity and substrate specificity for olea enzymes in the thiolase superfamily. Synthetic Biology. 5(1).

Robinson SL, Terlouw BR, Smith MD, Pidot SJ, Stinear TP, Medema MH, Wackett LP. 2020b. Global analysis of adenylate-forming enzymes reveals β-lactone biosynthesis pathway in pathogenic nocardia. Journal of Biological Chemistry. 295(44):14826-14839.

Roy A, Yang J, Zhang Y. 2012. Cofactor: An accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Research. 40(W1):W471-W477.

Russ WP, Figliuzzi M, Stocker C, Barrat-Charlaix P, Socolich M, Kast P, Hilvert D, Monasson R, Cocco S, Weigt M et al. 2020. An evolution-based model for designing chorismate mutase enzymes. Science. 369(6502):440.

Ryu JY, Kim HU, Lee SY. 2019. Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers. Proceedings of the National Academy of Sciences. 116(28):13996.

Sharma B, Dangi AK, Shukla P. 2018. Contemporary enzyme based technologies for bioremediation: A review. Journal of Environmental Management. 210:10-22.

Shen H-B, Chou K-C. 2007. Ezypred: A top–down approach for predicting enzyme functional classes and subclasses. Biochemical and Biophysical Research Communications. 364(1):53-59.

https://wwwtensorfloworg/

Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H. 2007. Predicting protein–protein interactions based only on sequences information. Proceedings of the National Academy of Sciences. 104(11):4337.

Sigrist CJA, de Castro E, Cerutti L, Cuche BA, Hulo N, Bridge A, Bougueleret L, Xenarios I. 2013. New and continuing developments at prosite. Nucleic Acids Research. 41(D1):D344-D347.

Song J, Li F, Takemoto K, Haffari G, Akutsu T, Chou K-C, Webb GI. 2018. Prevail, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework. Journal of Theoretical Biology. 443:125-137.

Tallorin L, Wang J, Kim WE, Sahu S, Kosa NM, Yang P, Thompson M, Gilson MK, Frazier PI, Burkart MD et al. 2018. Discovering de novo peptide substrates for enzymes using machine learning. Nature Communications. 9(1):5253.

Taujale R, Venkat A, Huang L-C, Zhou Z, Yeung W, Rasheed KM, Li S, Edison AS, Moremen KW, Kannan N. 2020. Deep evolutionary analysis reveals the design principles of fold a glycosyltransferases. eLife. 9:e54532.

Tibshirani R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological). 58(1):267-288.

Tipton KF, Armstrong RN, Bakker BM, Bairoch A, Cornish-Bowden A, Halling PJ, Hofmeyr J-H, Leyh TS, Kettner C, Raushel FM et al. 2014. Standards for reporting enzyme data: The strenda consortium: What it aims to do and why it should be helpful. Perspectives in Science. 1(1-6):131-137.

Torng W, Altman RB. 2019a. High precision protein functional site detection using 3d convolutional neural networks. Bioinformatics (Oxford, England). 35(9):1503-1512.

Torng W, Altman RB. 2019b. High precision protein functional site detection using 3d convolutional neural networks. Bioinformatics. 35(9):1503-1512.

Villegas-Morcillo A, Makrodimitris S, van Ham RCHJ, Gomez AM, Sanchez V, Reinders MJT. 2020. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics.

Yang M, Fehl C, Lees KV, Lim E-K, Offen WA, Davies GJ, Bowles DJ, Davidson MG, Roberts SJ, Davis BG. 2018. Functional and informatics analysis enables glycosyltransferase activity prediction. Nature Chemical Biology. 14(12):1109-1117.

Zhang J, Petersen SD, Radivojevic T, Ramirez A, Pérez-Manríquez A, Abeliuk E, Sánchez BJ, Costello Z, Chen Y, Fero MJ et al. 2020. Combining mechanistic and machine learning models for predictive engineering and optimization of tryptophan metabolism. Nature Communications. 11(1):4880.

Zhang T, Zhang H, Chen K, Shen S, Ruan J, Kurgan L. 2008. Accurate sequence-based prediction of catalytic residues. Bioinformatics. 24(20):2329-2338.

Zou H-L, Xiao X. 2016. Classifying multifunctional enzymes by incorporating three different models into chou’s general pseudo amino acid composition. The Journal of Membrane Biology. 249(4):551-557.

Zou Z, Tian S, Gao X, Li Y. 2019. Mldeepre: Multi-functional enzyme function prediction with hierarchical multi-label deep learning. Frontiers in Genetics. 9(714).

title: machine learning for enzyme engineering, selection

Documents