virology learning the language of viral evolution and escape › content › sci › 371 › 6526...

6
VIROLOGY Learning the language of viral evolution and escape Brian Hie 1,2 , Ellen D. Zhong 1,3 , Bonnie Berger 1,4 * , Bryan Bryson 2,5 * The ability for viruses to mutate and evade the human immune system and cause infection, called viral escape, remains an obstacle to antiviral and vaccine development. Understanding the complex rules that govern escape could inform therapeutic design. We modeled viral escape with machine learning algorithms originally developed for human natural language. We identified escape mutations as those that preserve viral infectivity but cause a virus to look different to the immune system, akin to word changes that preserve a sentences grammaticality but change its meaning. With this approach, language models of influenza hemagglutinin, HIV-1 envelope glycoprotein (HIV Env), and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Spike viral proteins can accurately predict structural escape patterns using sequence data alone. Our study represents a promising conceptual bridge between natural language and viral evolution. V iral mutations that allow an infection to escape from recognition by neutral- izing antibodies have prevented the development of a universal antibody- based vaccine for influenza (1, 2) or HIV (3) and are a concern in the development of therapies for severe acute respiratory syn- drome coronavirus 2 (SARS-CoV-2) infection (4, 5). Escape has motivated high-throughput experimental techniques that perform causal escape profiling of all single-residue muta- tions to a viral protein (14). Such techniques, however, require substantial effort to profile even a single viral strain, and testing the es- cape potential of many (combinatorial) muta- tions in many viral strains remains infeasible. Instead, we sought to train an algorithm that learns to model escape from viral sequence data alone. This approach is not unlike learn- ing properties of natural language from large text corpuses (6, 7) because languages such as English and Japanese use sequences of words to encode complex meanings and have com- plex rules (for example, grammar). To escape, a mutant virus must preserve infectivity and evolutionary fitnessit must obey a grammarof biological rulesand the mutant must no longer be recognized by the immune system, which is analogous to a change in the mean- ingor the semanticsof the virus. Currently, computational models of protein evolution focus either on fitness (8) or on func- tional or semantic similarity (911), but we want to understand both (Fig. 1A). Rather than developing two separate models of fitness and function, we developed a single model that simultaneously achieves these tasks. We leveraged state-of-the-art machine learning al- gorithms called language models (6, 7), which learn the probability of a token (such as an English word) given its sequence context (such as a sentence) (Fig. 1B). Internally, the lan- guage model constructs a semantic representa- tion, or an embedding, for a given sequence (6), and the output of a language model en- codes how well a particular token fits within the rules of the language, which we call gram- maticalityand can also be thought of as syn- tactic fitness(supplementary text, note S2). The same principles used to train a language model on a sequence of English words can train a language model on a sequence of ami- no acids. Although immune selection occurs on phenotypes (such as protein structures), evolution dictates that selection is reflected within genotypes (such as protein sequences), RESEARCH Hie et al., Science 371, 284288 (2021) 15 January 2021 1 of 5 1 Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. 2 Ragon Institute of MGH, MIT, and Harvard, Cambridge, MA 02139, USA. 3 Computational and Systems Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. 4 Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. 5 Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. *Corresponding author. Email: [email protected] (B.Be.); bryand@ mit.edu (B.Br.) Fig. 1. Modeling viral escape requires characterizing semantic change and grammaticality. (A) Constrained semantic change search (CSCS) for viral escape prediction is designed to search for mutations to a viral sequence that preserve fitness while being antigenically different. This corresponds to a mutant sequence that is grammatical (conforms to the structure and rules of a language) but has high semantic change with respect to the original (for example, wild type) sequence. (B) A neural language model with a bidirectional long short-term memory (BiLSTM) architecture was used to learn both semantics (as a hidden layer output) and grammaticality (as the language model output). CSCS combines semantic change and grammaticality to predict escape (12). ( C) CSCS-proposed changes to a news headline (implemented by using a neural language model trained on English news headlines) makes large changes to the overall semantic meaning of a sentence or to the part-of-speech structure. The semantically closest mutated sentence according to the same model is largely synonymous with the original headline. Corrected 14 June 2021. See full text. on June 29, 2021 http://science.sciencemag.org/ Downloaded from

Upload: others

Post on 08-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • VIROLOGY

    Learning the language of viral evolution and escapeBrian Hie1,2, Ellen D. Zhong1,3, Bonnie Berger1,4*, Bryan Bryson2,5*

    The ability for viruses to mutate and evade the human immune system and cause infection, calledviral escape, remains an obstacle to antiviral and vaccine development. Understanding the complex rules thatgovern escape could inform therapeutic design. We modeled viral escape with machine learning algorithmsoriginally developed for human natural language. We identified escape mutations as those that preserve viralinfectivity but cause a virus to look different to the immune system, akin to word changes that preserve asentence’s grammaticality but change its meaning. With this approach, language models of influenzahemagglutinin, HIV-1 envelope glycoprotein (HIV Env), and severe acute respiratory syndrome coronavirus 2(SARS-CoV-2) Spike viral proteins can accurately predict structural escape patterns using sequence dataalone. Our study represents a promising conceptual bridge between natural language and viral evolution.

    Viral mutations that allow an infectionto escape from recognition by neutral-izing antibodies have prevented thedevelopment of a universal antibody-based vaccine for influenza (1, 2) or HIV

    (3) and are a concern in the development oftherapies for severe acute respiratory syn-

    drome coronavirus 2 (SARS-CoV-2) infection(4, 5). Escape has motivated high-throughputexperimental techniques that perform causalescape profiling of all single-residue muta-tions to a viral protein (1–4). Such techniques,however, require substantial effort to profileeven a single viral strain, and testing the es-cape potential of many (combinatorial) muta-tions inmany viral strains remains infeasible.Instead, we sought to train an algorithm that

    learns to model escape from viral sequencedata alone. This approach is not unlike learn-ing properties of natural language from largetext corpuses (6, 7) because languages such asEnglish and Japanese use sequences of wordsto encode complex meanings and have com-

    plex rules (for example, grammar). To escape,a mutant virus must preserve infectivity andevolutionary fitness—itmust obey a “grammar”of biological rules—and the mutant must nolonger be recognized by the immune system,which is analogous to a change in the “mean-ing” or the “semantics” of the virus.Currently, computational models of protein

    evolution focus either on fitness (8) or on func-tional or semantic similarity (9–11), but wewant to understand both (Fig. 1A). Rather thandeveloping two separate models of fitnessand function, we developed a single modelthat simultaneously achieves these tasks. Weleveraged state-of-the-art machine learning al-gorithms called language models (6, 7), whichlearn the probability of a token (such as anEnglishword) given its sequence context (suchas a sentence) (Fig. 1B). Internally, the lan-guagemodel constructs a semantic representa-tion, or an “embedding,” for a given sequence(6), and the output of a language model en-codes how well a particular token fits withinthe rules of the language, which we call “gram-maticality” and can also be thought of as “syn-tactic fitness” (supplementary text, note S2).The same principles used to train a languagemodel on a sequence of English words cantrain a language model on a sequence of ami-no acids. Although immune selection occurson phenotypes (such as protein structures),evolution dictates that selection is reflectedwithin genotypes (such as protein sequences),

    RESEARCH

    Hie et al., Science 371, 284–288 (2021) 15 January 2021 1 of 5

    1Computer Science and Artificial Intelligence Laboratory,Massachusetts Institute of Technology, Cambridge, MA02139, USA. 2Ragon Institute of MGH, MIT, and Harvard,Cambridge, MA 02139, USA. 3Computational and SystemsBiology, Massachusetts Institute of Technology, Cambridge, MA02139, USA. 4Department of Mathematics, MassachusettsInstitute of Technology, Cambridge, MA 02139, USA.5Department of Biological Engineering, Massachusetts Instituteof Technology, Cambridge, MA 02139, USA.*Corresponding author. Email: [email protected] (B.Be.); [email protected] (B.Br.)

    Fig. 1. Modeling viral escape requires characterizing semantic change andgrammaticality. (A) Constrained semantic change search (CSCS) for viral escapeprediction is designed to search for mutations to a viral sequence that preservefitness while being antigenically different. This corresponds to a mutant sequencethat is grammatical (conforms to the structure and rules of a language) but has highsemantic change with respect to the original (for example, wild type) sequence.(B) A neural language model with a bidirectional long short-term memory (BiLSTM)

    architecture was used to learn both semantics (as a hidden layer output) andgrammaticality (as the language model output). CSCS combines semantic changeand grammaticality to predict escape (12). (C) CSCS-proposed changes to a newsheadline (implemented by using a neural language model trained on English newsheadlines) makes large changes to the overall semantic meaning of a sentenceor to the part-of-speech structure. The semantically closest mutated sentenceaccording to the same model is largely synonymous with the original headline.

    Corrected 14 June 2021. See full text. on June 29, 2021

    http://science.sciencemag.org/

    Dow

    nloaded from

    https://science.sciencemag.org/content/371/6526/284http://science.sciencemag.org/

  • which language models can leverage to learnfunctional properties from sequence variation.We hypothesize that (i) language model–

    encoded semantic change corresponds toantigenic change, (ii) language model gram-maticality captures viral fitness, and (iii) bothhigh semantic change and grammaticality helppredict viral escape. Searching for mutationswith bothhigh grammaticality andhigh seman-tic change is a task that we call constrainedsemantic change search (CSCS) (Fig. 1C) (12).Our language model implementation of CSCSuses sequence data alone (which is easier toobtain than structure) and requires no explicitescape information (is completely unsupervised),does not rely on multiple sequence alignment(MSA) preprocessing (is “alignment-free”), andcaptures global relationships across an entiresequence (for example, because word choiceat the beginning of a sentence can influenceword choice at the end) (supplementary text,notes S2 and S3).We assessed the generality of our approach

    across viruses by analyzing three proteins: in-fluenza A hemagglutinin (HA),HIV-1 envelope

    glycoprotein (Env), andSARS-CoV-2 spike glyco-protein (Spike). All three are found on the viralsurface, are responsible for binding host cells,are targeted by antibodies, and are drug targets(1–5).We trained a separate languagemodel foreach protein using a corpus of virus-specificamino acid sequences (12).We initially sought to understand the se-

    mantic patterns learned by our viral languagemodels. We therefore visualized the semanticembeddings of each sequence in the influenza,HIV, and coronavirus corpuses using UniformManifold Approximation and Projection (UMAP)(13). The resulting two-dimensional semanticlandscapes show clustering patterns that cor-respond to subtype, host species, or both (Fig. 2),suggesting that the model was able to learnfunctionally meaningful patterns from rawsequence.We quantified these clustering patterns,

    which are visually enriched for particular sub-types or hosts, with Louvain clustering (14) togroup sequences on the basis of their semanticembeddings (fig. S1, A to C). We then mea-sured the clustering purity on the basis of the

    percent composition of the most representedmetadata category (sequence subtype or hostspecies) within each cluster (12). Average clus-ter purities for HA subtype, HA host species,and Env subtype are 99, 96, and 95%, respec-tively, which are comparable with or higherthan the clustering purities obtainedwithMSA-based phylogenetic reconstruction (Fig. 2, Dand F, and fig. S1D) (12).Within the HA landscape, clustering pat-

    terns suggest interspecies transmissibility. Thesequence for 1918 H1N1 pandemic influenzabelongs to the main avian H1 cluster, whichcontains sequences from the avian reservoirfor 2009 H1N1 pandemic influenza (Fig. 2Cand fig. S1, A to C). Antigenic similarity be-tween H1 HA from 1918 and 2009, althoughnearly a century apart, is well supported (15).Within the landscape of SARS-CoV-2 Spikeand homologous proteins, clustering proximityis consistent with the suggested zoonoticorigin of several human coronaviruses (Fig.2G), including bat and civet for SARS-CoV-1,camel for Middle East respiratory syndrome-related coronavirus (MERS-CoV), and bat and

    Hie et al., Science 371, 284–288 (2021) 15 January 2021 2 of 5

    UMAP 1

    UM

    AP

    2

    UMAP 1

    UM

    AP

    2

    Phylo

    Louvai

    nPh

    ylo

    Louvai

    n75%

    100%

    Clu

    ster

    pur

    ity

    Subtype Hostspecies

    A

    H1H2H3H4H5H6H7H8H9H10H11H12H13H14H15H16

    B C D

    AvianHumanSwine

    AA1A1A2A1CA1DA2A3A6AEAGBBCCD

    FF1F2GHJKLNOPUOther

    B

    C

    AE

    UMAP 1

    UM

    AP

    2

    Influenza HA subtype Influenza HA host species

    HIV Env subtypeE

    75%

    100%

    Clu

    ster

    pur

    ity

    Subtype

    Phylo

    Louvai

    n

    F SARS-CoV-2

    UMAP 1

    UM

    AP

    2

    Coronavirus spike host species

    AvianBatBovidCamelidCanineCetaceanCivetFelineGiraffeHorseHuman

    HyenaMustelidNon-humanprimatePangolinRabbitRodentShrewSwineUrsidUnknown

    MERS-CoV

    G

    1918 pandemicAvian resevoir for

    2009 pandemic

    Pangolin

    BatCamel

    SARS-CoV-1Civet

    Bat

    Fig. 2. Semantic embedding landscape is antigenically meaningful. (A andB) UMAP visualization of the high-dimensional semantic embedding landscape ofinfluenza HA. (C) A cluster consisting of avian sequences from the 2009 fluseason onward also contains the 1918 pandemic flu sequence, which isconsistent with their antigenic similarity (15). (D) Louvain clusters of the HAsemantic embeddings have similar purity with respect to subtype or host species

    compared with phylogenetic sequence clustering (Phylo). Bar height, mean; errorbars, 95% confidence. (E and F) The HIV Env semantic landscape showssubtype-related distributional structure and high Louvain clustering purity. Barheight, mean; error bars, 95% confidence. (G) Sequence proximity in thesemantic landscape of coronavirus spike proteins is consistent with the possiblezoonotic origin of SARS-CoV-1, MERS-CoV, and SARS-CoV-2.

    RESEARCH | REPORT

    Corrected 14 June 2021. See full text. on June 29, 2021

    http://science.sciencemag.org/

    Dow

    nloaded from

    http://science.sciencemag.org/

  • pangolin for SARS-CoV-2 (16). Analysis of thesesemantic landscapes strengthens our hypothe-sis that our viral sequence embeddings encodefunctional and antigenic variation.We then assessed the relationship between

    viral fitness and language model grammatica-lity using high-throughput deep mutationalscan (DMS) characterization of hundreds orthousands of mutations to a given viral pro-tein. We obtained datasets measuring repli-cation fitness of all single-residue mutationsto A/WSN/1933 (WSN33) HA H1 (17), combi-natorial mutations to antigenic site B in sixHA H3 strains (18), or all single-residue mu-tations to BG505 and BF520 HIV Env (19), aswell as a dataset measuring the dissociationconstant (Kd) between combinatorial muta-tions to yeast-displayed SARS-CoV-2 Spikereceptor-binding domain (RBD) and human

    ACE2 (20), which we used to approximate thefitness of Spike.Languagemodel grammaticalitywas signifi-

    cantly correlated (table S1, t-distribution P values)with viral fitness across all viral strains andacross studies that examined single or combi-natorial mutations (Fig. 3A), even though ourlanguage models were not given any explicitfitness-related information nor trained on theDMS mutants. When we compared viral fit-ness with the magnitude of mutant semanticchange (rather than grammaticality), we ob-served significant negative correlation (table S1,t-distribution P values) in 8 out of 10 strainstested (Fig. 3A). This makes sense biologicallybecause a mutation with a large effect on func-tion is on average more likely to be deleteriousand result in a loss of fitness. These resultssuggest that “grammaticality” of a given mu-

    tation captures fitness information and addan additional dimension to our understand-ing of how semantic change encodes perturbedprotein function.We then tested whether combining seman-

    tic change and grammaticality enables us topredict mutations that lead to viral escape.Our experimental setup involved making, insilico, all possible single-residue mutationsto a given viral protein sequence; next, eachmutant was ranked according to the CSCSobjective that combines semantic change andgrammaticality. We validated this ranking onthe basis of enrichment of experimentally ver-ified mutants that causally induce escape fromneutralizing antibodies. Three of these causalescape datasets used a DMS with antibody se-lection to identify escape mutations to WSN33HA H1 (1), A/Perth/16/2009 (Perth09) HA H3

    Hie et al., Science 371, 284–288 (2021) 15 January 2021 3 of 5

    Fig. 3. Biological interpretation of language model semantics andgrammaticality enables escape prediction. (A) Whereas grammaticalityis positively correlated with fitness, semantic change has negative correlation,suggesting that most semantically altered proteins lose fitness. (B andC) However, a mutation with both high semantic change and high grammaticalityis more likely to induce escape. Considering both semantic change and

    grammaticality enables identification of escape mutants that is consistentlyhigher than that of previous fitness models or generic functional embeddingmodels. (D) Across 891 surveilled SARS-CoV-2 Spike sequences, onlythree have both higher semantic change and grammaticality than aSpike sequence with four mutations that is associated with a potentialreinfection case.

    RESEARCH | REPORT

    Corrected 14 June 2021. See full text. on June 29, 2021

    http://science.sciencemag.org/

    Dow

    nloaded from

    http://science.sciencemag.org/

  • (2), and BG505 Env (3). The fourth identifiedescape mutations to SARS-CoV-2 Spike byusing natural replication error after in vitropassages under antibody selection (5), whereasthe fifth performed a DMS to identify mutantsthat affect antibody binding to yeast-displayedSpike RBD (4).We computed the area under the curve

    (AUC) of acquired escapemutations versus thetotal acquired mutations (12). In all five cases,escape prediction with CSCS resulted in bothstatistically significant and strong AUCs of0.81, 0.73, 0.64, 0.85, and 0.57 for H1 WSN33,H3 Perth09, Env BG505, Spike, and Spike RBD,respectively (one-sided permutation-based P <1 × 10−5 for H1, H3, Env, and Spike; P = 2 × 10−4

    for Spike RBD) (Fig. 3B, and table S2). We didnot provide themodelwith any information onescape, a setup inmachine learning referred toas “zero-shot prediction” (7). The AUCdecreasedwhen ignoring either grammaticality or seman-tic change, evidence that both are useful inpredicting escape (Fig. 3C, fig. S2A, and tableS2). Although semantic change is negativelycorrelated with fitness, it is positively pre-dictive (along with grammaticality) of escape(table S2), indicating that functional muta-tions are often deleterious, but when fitness ispreserved, they are associated with antigenicchange and subsequent escape from immunity.We also tested how well alternative models

    of fitness (each requiring MSA preprocessing)(8, 21) or of semantic change (pretrained on

    generic protein sequence) (9–11) predict es-cape, although these models are not explicitlydesigned for escape prediction. Fitness mod-els associate more frequently observed pat-terns with higher fitness and greater escapepotential, whereas semantic models associatelarger functional changes with escape (12).CSCS with our viral language models wasmore predictive of escape across all five data-sets (Fig. 3B and fig. S2A). Moreover, the in-dividual grammaticality or semantic changecomponents of our language models often out-performed benchmark models (table S2).Language modeling can also characterize se-

    quence changes beyond single-residue muta-tions, such as from accumulated replicationerror or recombination (22), although our ap-proach is agnostic to how a sequence acquiresits mutations. We therefore estimated the anti-genic change and fitness of a set of fourmutations to the SARS-CoV-2 Spike associatedwith a reported reinfection event (23). Among891 other distinct, surveilled Spike sequences,we found that only three (0.34%) representboth higher semantic change and grammati-cality (Fig. 3D). We estimate significant escapepotential of these four mutations (randommu-tant null distribution P < 1 × 10−8) (12), and weobserved similar patterns for known antigen-ically dissimilar sequences (fig. S2B) (12). Ouranalysis suggests a way to quantify the escapepotential of interesting combinatorial sequencechanges, such as those from possible reinfec-

    tion (23), and calls for more information thatrelates combinatorial mutations to reinfectionand escape.To further assess whether our model could

    learn structurally relevant patterns from se-quence alone, we scored each residue on thebasis of the CSCS objective, visualized escapepotential across the protein structure, andquantified enrichment or depletion of escape(12). Escape potential is significantly enrichedin theHAhead (permutation-basedP< 1 × 10−5)and significantly depleted in the HA stalk(permutation-based P < 1 × 10−5) (Fig. 4, Aand B; fig. S3; and table S3), which is con-sistent with literature on HA mutation ratesand supported by the successful developmentof antistalk broadly neutralizing antibodies(24). We also detected, consistent with ex-isting knowledge, a significant enrichment(permutation-based P < 1 × 10−5) of escapemutations in the V1/V2 hypervariable regionsof the HIV Env (Fig. 4, C and D; fig. S3; andtable S3) (3). Our model only learns escapepatterns linked tomutations, rather than post-translational changes such as glycosylation thatcontribute to HIV escape (3), which may ex-plain the lack of escape potential specificallyassigned to Env glycosylation sites (Fig. 4Cand table S3).The escape potential within the SARS-

    CoV-2 Spike is significantly enriched in boththe RBD (permutation-based P = 2.7 × 10−3)and N-terminal domain (permutation-based

    Hie et al., Science 371, 284–288 (2021) 15 January 2021 4 of 5

    HeadEscape enrichment,

    P < 1 × 10-5

    StalkEscape depletion,

    P < 1 × 10-5

    Influenza H1 Influenza H3

    HeadEscape enrichment,

    P < 1 × 10-5

    StalkEscape depletion,

    P < 1 × 10-5

    AV1/V2

    HIV Env

    Escapeenrichment,P < 1 × 10-5

    SARS-CoV-2 Spike

    Receptor binding domainEscape enrichment,

    P = 2.7 × 10-3

    N-terminal domainEscape enrichment,

    P < 1 × 10-5

    V3 lo

    op

    V4 lo

    op

    V5 lo

    op

    Fusio

    n pe

    ptide

    gp12

    0 A

    gp12

    0 Bgp

    41

    Imm

    unos

    uppr

    esion

    CD4

    bindin

    g loo

    p

    MPE

    R

    V1 lo

    op

    V2 lo

    op

    Glyc

    osyla

    tion0

    1

    2

    3

    4

    5

    Enriched-log 1

    0(P

    )

    HIV Env regional escape enrichment DC

    Predictedescape potential+-

    F

    S2Escape depletion,P < 1 × 10-5

    SARS-CoV-2 Spike regional escape

    HR1HR

    2RB

    D

    S1 N

    TD S2

    Fusio

    n pe

    ptide

    Glyc

    osyla

    tion

    0

    1

    2

    3

    4

    5

    -log 1

    0(P

    )

    EnrichedDepleted

    E

    Predictedescape potential+-

    Predictedescape potential+-

    B

    Predictedescape potential+-

    d,5

    lkon,0-5

    dnt,0-5

    alkon,0-5

    l(P

    )PP

    V1/V2

    Env

    EscapeenrichmP < 1 ×P

    hed

    Predictedescap

    mainhment,× 10-3

    n

    5

    Pesca-

    90°

    RBDSpike

    S2EscaP < 1P

    edtential++

    RBD

    Fig. 4. Structural localization of predicted escape potential. (A and B) HA trimer colored by escape potential. (C) Escape potential P values for HIV Env.The gray dashed line indicates the statistical significance threshold. (D) The Env trimer colored by escape potential, oriented to show the V1/V2 regions.(E and F) Potential for escape in SARS-CoV-2 Spike is significantly enriched at the N-terminal domain and receptor binding domain (RBD) and significantlydepleted at multiple regions in the S2 subunit. The gray dashed line indicates the statistical significance threshold.

    RESEARCH | REPORT

    Corrected 14 June 2021. See full text. on June 29, 2021

    http://science.sciencemag.org/

    Dow

    nloaded from

    http://science.sciencemag.org/

  • P < 1 × 10−5), whereas escape potential issignificantly depleted in the S2 subunit(permutation-based P < 1 × 10−5) (Fig. 4, Eand F; fig. S3; and table S3). These resultsare supported by the greater evolutionary con-servation at S2 antigenic sites (25). Our modelof Spike escape therefore suggests that im-munodominant antigenic sites in S2 (5, 25)may be more stable target antibody epitopesand underscores the need for more exhaus-tive causal escape profiling of Spike in regionsbeyond the RBD.Our study leverages the principle that evo-

    lutionary selection is reflected in sequencevariation. This principle may allow CSCS togeneralize beyond viral escape to differentkinds of natural selection (such as T cell se-lection) or drug selection. CSCS and its com-ponents could be used to select elements of amultivalent or mosaic vaccine. Our techniquesalso lay the foundation for more complexmodeling of sequence dynamics. We antici-pate that the “distributional hypothesis” fromlinguistics (26), in which co-occurrence pat-terns can model complex concepts and onwhich language models are based, can furtherinform viral evolution.

    REFERENCES AND NOTES

    1. M. B. Doud, J. M. Lee, J. D. Bloom, Nat. Commun. 9, 1386 (2018).2. J. M. Lee et al., eLife 8, e49324 (2019).3. A. S. Dingens, D. Arenz, H. Weight, J. Overbaugh, J. D. Bloom,

    Immunity 50, 520–532.e3 (2019).4. A. J. Greaney et al., Cell Host Microbe 10.1016/j.chom.2020.11.007

    (2020).5. A. Baum et al., Science 369, 1014–1018 (2020).6. M. Peters et al., Deep contextualized word representations.

    Proc. NAACL-HLT, 2227–2237 (2018).

    7. A. Radford et al., OpenAI Blog 1, 9 (2019).8. T. A. Hopf et al., Bioinformatics 35, 1582–1584 (2019).9. T. Bepler, B. Berger, Learning protein sequence embeddings

    using information from structure. arXiv:1902.08661 [cs.LG](2019).

    10. R. Rao et al., Evaluating protein transfer learning withTAPE. Proc. Adv. Neural Inf. Process. Syst., 9686–9698(2019).

    11. E. C. Alley, G. Khimulya, S. Biswas, M. AlQuraishi, G. M. Church,Nat. Methods 16, 1315–1322 (2019).

    12. Materials and methods are available as supplementarymaterials.

    13. L. McInnes, J. Healy, UMAP: Uniform Manifold Approximationand Projection for dimension reduction. arXiv:1802.03426[stat.ML] (2018).

    14. V. D. Blondel, J. L. Guillaume, R. Lambiotte, E. Lefebvre,J. Stat. Mech. Theory Exp. 2008, P10008 (2008).

    15. R. Xu et al., Science 328, 357–360 (2010).16. K. G. Andersen, A. Rambaut, W. I. Lipkin, E. C. Holmes,

    R. F. Garry, Nat. Med. 26, 450–452 (2020).17. M. B. Doud, J. D. Bloom, Viruses 8, 155 (2016).18. N. C. Wu et al., Nat. Commun. 11, 1233 (2020).19. H. K. Haddox, A. S. Dingens, S. K. Hilton, J. Overbaugh,

    J. D. Bloom, eLife 7, e34420 (2018).20. T. N. Starr et al., Cell 182, 1295–1310.e20 (2020).21. K. Katoh, D. M. Standley, Mol. Biol. Evol. 30, 772–780

    (2013).22. Y. Xiao et al., Cell Host Microbe 19, 493–503 (2016).23. K. K.-W. To et al., Clin. Infect. Dis. ciaa1275 (2020).24. E. Kirkpatrick, X. Qiu, P. C. Wilson, J. Bahl, F. Krammer,

    Sci. Rep. 8, 10432 (2018).25. S. Ravichandran et al., Sci. Transl. Med. 12, eabc3539

    (2020).26. Z. S. Harris, Word 10, 146–162 (1954).27. B. Hie, brianhie/viral-mutation: viral-mutation release 0.3.

    Zenodo (2020).28. B. Hie, Data for “Learning the language of viral evolution and

    escape”. Zenodo (2020).

    ACKNOWLEDGMENTS

    We thank A. Balazs, H. Fraser, O. Leddy, A. Lerer, A. Lin, A. Nitido,U. Roy, and A. Schmidt for helpful discussions. We thankS. Chun, B. DeMeo, A. Narayan, A. Nguyen, S. Nyquist, and A. Wufor assistance with the manuscript. Funding: B.H. is partiallysupported by the U.S. Department of Defense (DOD) through theNational Defense Science and Engineering Graduate Fellowship(NDSEG). E.Z. is partially supported by the National Science

    Foundation (NSF) Graduate Research Fellowship. Authorcontributions: All authors conceived the project and methodology.B.H. performed the computational experiments and wrote thesoftware. All authors interpreted the results and wrote themanuscript. Competing interests: B.H., B.Be., and B.Br. have fileda provisional patent application (serial no. 53/049,676) relatedto this work. Data and materials availability: Code, scripts forplotting and visualizing, and pretrained models are deposited toZenodo at doi:10.5281/zenodo.4034681 (27) and are also availableat https://github.com/brianhie/viral-mutation. We used thefollowing publicly available datasets for model training: InfluenzaA HA protein sequences from the NIAID Influenza ResearchDatabase (IRD) (www.fludb.org); HIV-1 Env protein sequences fromthe Los Alamos National Laboratory (LANL) HIV database (www.hiv.lanl.gov); Coronavidae spike protein sequences from the VirusPathogen Resource (ViPR) database (www.viprbrc.org/brc/home.spg?decorator=corona); SARS-CoV-2 Spike protein sequencesfrom NCBI Virus (www.ncbi.nlm.nih.gov/labs/virus/vssi); andSARS-CoV-2 Spike and other Betacoronavirus spike proteinsequences from GISAID (www.gisaid.org). We used the followingpublicly available datasets for fitness and escape validation:Fitness single-residue DMS of HA H1 WSN33 from Doud and Bloom(2016) (17); Fitness combinatorial DMS of antigenic site B in six HAH3 strains from Wu et al. (18); Fitness single-residue DMS ofEnv BF520 and BG505 from Haddox et al. (19); ACE2 bindingaffinity combinatorial DMS of Spike RBD from Starr et al. (20);Escape single-residue DMS of HA H1 WSN33 from Doud et al.(2018) (1); Escape single-residue DMS of HA H3 Perth09 fromLee et al. (2); Escape single-residue DMS of Env BG505 fromDingens et al. (3); Escape mutations of Spike from Baum et al. (5);Escape single-residue DMS of Spike RBD from Greaney et al.(4); Training and validation datasets are deposited to Zenodo atdoi:10.5281/zenodo.4029296 (28), and links to this data are alsoavailable at https://github.com/brianhie/viral-mutation.

    SUPPLEMENTARY MATERIALS

    science.sciencemag.org/content/371/6526/284/suppl/DC1Materials and MethodsSupplementary TextFigs. S1 to S3Tables S1 to S3References (29–49)MDAR Reproducibility Checklist

    8 July 2020; accepted 9 November 202010.1126/science.abd7331

    Hie et al., Science 371, 284–288 (2021) 15 January 2021 5 of 5

    RESEARCH | REPORT

    Corrected 14 June 2021. See full text. on June 29, 2021

    http://science.sciencemag.org/

    Dow

    nloaded from

    https://10.1016/j.chom.2020.11.007https://arxiv.org/abs/1902.08661https://arxiv.org/abs/1802.0342610.5281/zenodo.4034681https://github.com/brianhie/viral-mutationhttp://www.fludb.orghttps://www.hiv.lanl.govhttps://www.hiv.lanl.govhttps://www.viprbrc.org/brc/home.spg?decorator=coronahttps://www.viprbrc.org/brc/home.spg?decorator=coronahttps://www.ncbi.nlm.nih.gov/labs/virus/vssihttps://www.gisaid.org10.5281/zenodo.4029296https://github.com/brianhie/viral-mutationscience.sciencemag.org/content/371/6526/284/suppl/DC1http://science.sciencemag.org/

  • Learning the language of viral evolution and escapeBrian Hie, Ellen D. Zhong, Bonnie Berger and Bryan Bryson

    originally published online January 14, 2021DOI: 10.1126/science.abd7331 (6526), 284-288.371Science

    , this issue p. 284; see also p. 233Sciencethe immune system.

    evadesequences that are syntactically and/or grammatically correct but effectively different in semantics and thus able to (SARS-CoV-2) spike glycoprotein. Semantic landscapes for these viruses predicted viral escape mutations that producefor influenza A hemagglutinin, HIV-1 envelope glycoprotein, and severe acute respiratory syndrome coronavirus 2 semantics) (see the Perspective by Kim and Przytycka). Three different unsupervised language models were constructedlearning technique for natural language processing with two components: grammar (or syntax) and meaning (or

    used a machineet al.impede the development of vaccines. To predict which mutations may lead to viral escape, Hie Viral mutations that evade neutralizing antibodies, an occurrence known as viral escape, can occur and may

    Natural language predicts viral escape

    ARTICLE TOOLS http://science.sciencemag.org/content/371/6526/284

    MATERIALSSUPPLEMENTARY http://science.sciencemag.org/content/suppl/2021/01/13/371.6526.284.DC1

    CONTENTRELATED

    http://stm.sciencemag.org/content/scitransmed/11/513/eaaw5589.fullhttp://stm.sciencemag.org/content/scitransmed/13/576/eabd8179.fullhttp://stm.sciencemag.org/content/scitransmed/12/573/eabd3601.fullhttp://stm.sciencemag.org/content/scitransmed/12/573/eabe2555.fullhttp://science.sciencemag.org/content/sci/371/6526/233.full

    REFERENCES

    http://science.sciencemag.org/content/371/6526/284#BIBLThis article cites 40 articles, 6 of which you can access for free

    PERMISSIONS http://www.sciencemag.org/help/reprints-and-permissions

    Terms of ServiceUse of this article is subject to the

    is a registered trademark of AAAS.ScienceScience, 1200 New York Avenue NW, Washington, DC 20005. The title (print ISSN 0036-8075; online ISSN 1095-9203) is published by the American Association for the Advancement ofScience

    Science. No claim to original U.S. Government WorksCopyright © 2021 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of

    on June 29, 2021

    http://science.sciencemag.org/

    Dow

    nloaded from

    http://science.sciencemag.org/content/371/6526/284http://science.sciencemag.org/content/suppl/2021/01/13/371.6526.284.DC1http://science.sciencemag.org/content/sci/371/6526/233.fullhttp://stm.sciencemag.org/content/scitransmed/12/573/eabe2555.fullhttp://stm.sciencemag.org/content/scitransmed/12/573/eabd3601.fullhttp://stm.sciencemag.org/content/scitransmed/13/576/eabd8179.fullhttp://stm.sciencemag.org/content/scitransmed/11/513/eaaw5589.fullhttp://science.sciencemag.org/content/371/6526/284#BIBLhttp://www.sciencemag.org/help/reprints-and-permissionshttp://www.sciencemag.org/about/terms-servicehttp://science.sciencemag.org/