virology learning the language of viral evolution and escape › content › sci › 371 › 6526...

VIROLOGY

Learning the language of viral evolution and escapeBrian Hie1,2, Ellen D. Zhong1,3, Bonnie Berger1,4*, Bryan Bryson2,5*

The ability for viruses to mutate and evade the human immune system and cause infection, calledviral escape, remains an obstacle to antiviral and vaccine development. Understanding the complex rules thatgovern escape could inform therapeutic design. We modeled viral escape with machine learning algorithmsoriginally developed for human natural language. We identified escape mutations as those that preserve viralinfectivity but cause a virus to look different to the immune system, akin to word changes that preserve asentence’s grammaticality but change its meaning. With this approach, language models of influenzahemagglutinin, HIV-1 envelope glycoprotein (HIV Env), and severe acute respiratory syndrome coronavirus 2(SARS-CoV-2) Spike viral proteins can accurately predict structural escape patterns using sequence dataalone. Our study represents a promising conceptual bridge between natural language and viral evolution.

Viral mutations that allow an infectionto escape from recognition by neutral-izing antibodies have prevented thedevelopment of a universal antibody-based vaccine for influenza (1, 2) or HIV

(3) and are a concern in the development oftherapies for severe acute respiratory syn-

drome coronavirus 2 (SARS-CoV-2) infection(4, 5). Escape has motivated high-throughputexperimental techniques that perform causalescape profiling of all single-residue muta-tions to a viral protein (1–4). Such techniques,however, require substantial effort to profileeven a single viral strain, and testing the es-cape potential of many (combinatorial) muta-tions inmany viral strains remains infeasible.Instead, we sought to train an algorithm that

learns to model escape from viral sequencedata alone. This approach is not unlike learn-ing properties of natural language from largetext corpuses (6, 7) because languages such asEnglish and Japanese use sequences of wordsto encode complex meanings and have com-

plex rules (for example, grammar). To escape,a mutant virus must preserve infectivity andevolutionary fitness—itmust obey a “grammar”of biological rules—and the mutant must nolonger be recognized by the immune system,which is analogous to a change in the “mean-ing” or the “semantics” of the virus.Currently, computational models of protein

evolution focus either on fitness (8) or on func-tional or semantic similarity (9–11), but wewant to understand both (Fig. 1A). Rather thandeveloping two separate models of fitnessand function, we developed a single modelthat simultaneously achieves these tasks. Weleveraged state-of-the-art machine learning al-gorithms called language models (6, 7), whichlearn the probability of a token (such as anEnglishword) given its sequence context (suchas a sentence) (Fig. 1B). Internally, the lan-guagemodel constructs a semantic representa-tion, or an “embedding,” for a given sequence(6), and the output of a language model en-codes how well a particular token fits withinthe rules of the language, which we call “gram-maticality” and can also be thought of as “syn-tactic fitness” (supplementary text, note S2).The same principles used to train a languagemodel on a sequence of English words cantrain a language model on a sequence of ami-no acids. Although immune selection occurson phenotypes (such as protein structures),evolution dictates that selection is reflectedwithin genotypes (such as protein sequences),

RESEARCH

Hie et al., Science 371, 284–288 (2021) 15 January 2021 1 of 5

1Computer Science and Artificial Intelligence Laboratory,Massachusetts Institute of Technology, Cambridge, MA02139, USA. 2Ragon Institute of MGH, MIT, and Harvard,Cambridge, MA 02139, USA. 3Computational and SystemsBiology, Massachusetts Institute of Technology, Cambridge, MA02139, USA. 4Department of Mathematics, MassachusettsInstitute of Technology, Cambridge, MA 02139, USA.5Department of Biological Engineering, Massachusetts Instituteof Technology, Cambridge, MA 02139, USA.*Corresponding author. Email: [email protected] (B.Be.); [email protected] (B.Br.)

Fig. 1. Modeling viral escape requires characterizing semantic change andgrammaticality. (A) Constrained semantic change search (CSCS) for viral escapeprediction is designed to search for mutations to a viral sequence that preservefitness while being antigenically different. This corresponds to a mutant sequencethat is grammatical (conforms to the structure and rules of a language) but has highsemantic change with respect to the original (for example, wild type) sequence.(B) A neural language model with a bidirectional long short-term memory (BiLSTM)

architecture was used to learn both semantics (as a hidden layer output) andgrammaticality (as the language model output). CSCS combines semantic changeand grammaticality to predict escape (12). (C) CSCS-proposed changes to a newsheadline (implemented by using a neural language model trained on English newsheadlines) makes large changes to the overall semantic meaning of a sentenceor to the part-of-speech structure. The semantically closest mutated sentenceaccording to the same model is largely synonymous with the original headline.

Corrected 14 June 2021. See full text. on June 29, 2021

http://science.sciencemag.org/

Dow

nloaded from

https://science.sciencemag.org/content/371/6526/284http://science.sciencemag.org/

which language models can leverage to learnfunctional properties from sequence variation.We hypothesize that (i) language model–

encoded semantic change corresponds toantigenic change, (ii) language model gram-maticality captures viral fitness, and (iii) bothhigh semantic change and grammaticality helppredict viral escape. Searching for mutationswith bothhigh grammaticality andhigh seman-tic change is a task that we call constrainedsemantic change search (CSCS) (Fig. 1C) (12).Our language model implementation of CSCSuses sequence data alone (which is easier toobtain than structure) and requires no explicitescape information (is completely unsupervised),does not rely on multiple sequence alignment(MSA) preprocessing (is “alignment-free”), andcaptures global relationships across an entiresequence (for example, because word choiceat the beginning of a sentence can influenceword choice at the end) (supplementary text,notes S2 and S3).We assessed the generality of our approach

across viruses by analyzing three proteins: in-fluenza A hemagglutinin (HA),HIV-1 envelope

glycoprotein (Env), andSARS-CoV-2 spike glyco-protein (Spike). All three are found on the viralsurface, are responsible for binding host cells,are targeted by antibodies, and are drug targets(1–5).We trained a separate languagemodel foreach protein using a corpus of virus-specificamino acid sequences (12).We initially sought to understand the se-

mantic patterns learned by our viral languagemodels. We therefore visualized the semanticembeddings of each sequence in the influenza,HIV, and coronavirus corpuses using UniformManifold Approximation and Projection (UMAP)(13). The resulting two-dimensional semanticlandscapes show clustering patterns that cor-respond to subtype, host species, or both (Fig. 2),suggesting that the model was able to learnfunctionally meaningful patterns from rawsequence.We quantified these clustering patterns,

which are visually enriched for particular sub-types or hosts, with Louvain clustering (14) togroup sequences on the basis of their semanticembeddings (fig. S1, A to C). We then mea-sured the clustering purity on the basis of the

percent composition of the most representedmetadata category (sequence subtype or hostspecies) within each cluster (12). Average clus-ter purities for HA subtype, HA host species,and Env subtype are 99, 96, and 95%, respec-tively, which are comparable with or higherthan the clustering purities obtainedwithMSA-based phylogenetic reconstruction (Fig. 2, Dand F, and fig. S1D) (12).Within the HA landscape, clustering pat-

terns suggest interspecies transmissibility. Thesequence for 1918 H1N1 pandemic influenzabelongs to the main avian H1 cluster, whichcontains sequences from the avian reservoirfor 2009 H1N1 pandemic influenza (Fig. 2Cand fig. S1, A to C). Antigenic similarity be-tween H1 HA from 1918 and 2009, althoughnearly a century apart, is well supported (15).Within the landscape of SARS-CoV-2 Spikeand homologous proteins, clustering proximityis consistent with the suggested zoonoticorigin of several human coronaviruses (Fig.2G), including bat and civet for SARS-CoV-1,camel for Middle East respiratory syndrome-related coronavirus (MERS-CoV), and bat and


UMAP 1

UM

AP

2

UMAP 1

UM

AP

2

Phylo

Louvai

nPh

ylo

Louvai

n75%

100%

Clu

ster

pur

ity

Subtype Hostspecies

A

H1H2H3H4H5H6H7H8H9H10H11H12H13H14H15H16

B C D

AvianHumanSwine

AA1A1A2A1CA1DA2A3A6AEAGBBCCD

FF1F2GHJKLNOPUOther

B

C

AE

UMAP 1

UM

AP

2

Influenza HA subtype Influenza HA host species

HIV Env subtypeE

75%

100%

Clu

ster

pur

ity

Subtype

Phylo

Louvai

n

F SARS-CoV-2

UMAP 1

UM

AP

2

Coronavirus spike host species

AvianBatBovidCamelidCanineCetaceanCivetFelineGiraffeHorseHuman

HyenaMustelidNon-humanprimatePangolinRabbitRodentShrewSwineUrsidUnknown

MERS-CoV

G

1918 pandemicAvian resevoir for

2009 pandemic

Pangolin

BatCamel

SARS-CoV-1Civet

Bat

Fig. 2. Semantic embedding landscape is antigenically meaningful. (A andB) UMAP visualization of the high-dimensional semantic embedding landscape ofinfluenza HA. (C) A cluster consisting of avian sequences from the 2009 fluseason onward also contains the 1918 pandemic flu sequence, which isconsistent with their antigenic similarity (15). (D) Louvain clusters of the HAsemantic embeddings have similar purity with respect to subtype or host species

compared with phylogenetic sequence clustering (Phylo). Bar height, mean; errorbars, 95% confidence. (E and F) The HIV Env semantic landscape showssubtype-related distributional structure and high Louvain clustering purity. Barheight, mean; error bars, 95% confidence. (G) Sequence proximity in thesemantic landscape of coronavirus spike proteins is consistent with the possiblezoonotic origin of SARS-CoV-1, MERS-CoV, and SARS-CoV-2.

RESEARCH | REPORT



Dow

nloaded from


pangolin for SARS-CoV-2 (16). Analysis of thesesemantic landscapes strengthens our hypothe-sis that our viral sequence embeddings encodefunctional and antigenic variation.We then assessed the relationship between

viral fitness and language model grammatica-lity using high-throughput deep mutationalscan (DMS) characterization of hundreds orthousands of mutations to a given viral pro-tein. We obtained datasets measuring repli-cation fitness of all single-residue mutationsto A/WSN/1933 (WSN33) HA H1 (17), combi-natorial mutations to antigenic site B in sixHA H3 strains (18), or all single-residue mu-tations to BG505 and BF520 HIV Env (19), aswell as a dataset measuring the dissociationconstant (Kd) between combinatorial muta-tions to yeast-displayed SARS-CoV-2 Spikereceptor-binding domain (RBD) and human

ACE2 (20), which we used to approximate thefitness of Spike.Languagemodel grammaticalitywas signifi-

cantly correlated (table S1, t-distribution P values)with viral fitness across all viral strains andacross studies that examined single or combi-natorial mutations (Fig. 3A), even though ourlanguage models were not given any explicitfitness-related information nor trained on theDMS mutants. When we compared viral fit-ness with the magnitude of mutant semanticchange (rather than grammaticality), we ob-served significant negative correlation (table S1,t-distribution P values) in 8 out of 10 strainstested (Fig. 3A). This makes sense biologicallybecause a mutation with a large effect on func-tion is on average more likely to be deleteriousand result in a loss of fitness. These resultssuggest that “grammaticality” of a given mu-

tation captures fitness information and addan additional dimension to our understand-ing of how semantic change encodes perturbedprotein function.We then tested whether combining seman-

tic change and grammaticality enables us topredict mutations that lead to viral escape.Our experimental setup involved making, insilico, all possible single-residue mutationsto a given viral protein sequence; next, eachmutant was ranked according to the CSCSobjective that combines semantic change andgrammaticality. We validated this ranking onthe basis of enrichment of experimentally ver-ified mutants that causally induce escape fromneutralizing antibodies. Three of these causalescape datasets used a DMS with antibody se-lection to identify escape mutations to WSN33HA H1 (1), A/Perth/16/2009 (Perth09) HA H3


Fig. 3. Biological interpretation of language model semantics andgrammaticality enables escape prediction. (A) Whereas grammaticalityis positively correlated with fitness, semantic change has negative correlation,suggesting that most semantically altered proteins lose fitness. (B andC) However, a mutation with both high semantic change and high grammaticalityis more likely to induce escape. Considering both semantic change and

grammaticality enables identification of escape mutants that is consistentlyhigher than that of previous fitness models or generic functional embeddingmodels. (D) Across 891 surveilled SARS-CoV-2 Spike sequences, onlythree have both higher semantic change and grammaticality than aSpike sequence with four mutations that is associated with a potentialreinfection case.

RESEARCH | REPORT



Dow

nloaded from


(2), and BG505 Env (3). The fourth identifiedescape mutations to SARS-CoV-2 Spike byusing natural replication error after in vitropassages under antibody selection (5), whereasthe fifth performed a DMS to identify mutantsthat affect antibody binding to yeast-displayedSpike RBD (4).We computed the area under the curve

(AUC) of acquired escapemutations versus thetotal acquired mutations (12). In all five cases,escape prediction with CSCS resulted in bothstatistically significant and strong AUCs of0.81, 0.73, 0.64, 0.85, and 0.57 for H1 WSN33,H3 Perth09, Env BG505, Spike, and Spike RBD,respectively (one-sided permutation-based P <1 × 10−5 for H1, H3, Env, and Spike; P = 2 × 10−4

for Spike RBD) (Fig. 3B, and table S2). We didnot provide themodelwith any information onescape, a setup inmachine learning referred toas “zero-shot prediction” (7). The AUCdecreasedwhen ignoring either grammaticality or seman-tic change, evidence that both are useful inpredicting escape (Fig. 3C, fig. S2A, and tableS2). Although semantic change is negativelycorrelated with fitness, it is positively pre-dictive (along with grammaticality) of escape(table S2), indicating that functional muta-tions are often deleterious, but when fitness ispreserved, they are associated with antigenicchange and subsequent escape from immunity.We also tested how well alternative models

of fitness (each requiring MSA preprocessing)(8, 21) or of semantic change (pretrained on

generic protein sequence) (9–11) predict es-cape, although these models are not explicitlydesigned for escape prediction. Fitness mod-els associate more frequently observed pat-terns with higher fitness and greater escapepotential, whereas semantic models associatelarger functional changes with escape (12).CSCS with our viral language models wasmore predictive of escape across all five data-sets (Fig. 3B and fig. S2A). Moreover, the in-dividual grammaticality or semantic changecomponents of our language models often out-performed benchmark models (table S2).Language modeling can also characterize se-

quence changes beyond single-residue muta-tions, such as from accumulated replicationerror or recombination (22), although our ap-proach is agnostic to how a sequence acquiresits mutations. We therefore estimated the anti-genic change and fitness of a set of fourmutations to the SARS-CoV-2 Spike associatedwith a reported reinfection event (23). Among891 other distinct, surveilled Spike sequences,we found that only three (0.34%) representboth higher semantic change and grammati-cality (Fig. 3D). We estimate significant escapepotential of these four mutations (randommu-tant null distribution P < 1 × 10−8) (12), and weobserved similar patterns for known antigen-ically dissimilar sequences (fig. S2B) (12). Ouranalysis suggests a way to quantify the escapepotential of interesting combinatorial sequencechanges, such as those from possible reinfec-

tion (23), and calls for more information thatrelates combinatorial mutations to reinfectionand escape.To further assess whether our model could

learn structurally relevant patterns from se-quence alone, we scored each residue on thebasis of the CSCS objective, visualized escapepotential across the protein structure, andquantified enrichment or depletion of escape(12). Escape potential is significantly enrichedin theHAhead (permutation-basedP< 1 × 10−5)and significantly depleted in the HA stalk(permutation-based P < 1 × 10−5) (Fig. 4, Aand B; fig. S3; and table S3), which is con-sistent with literature on HA mutation ratesand supported by the successful developmentof antistalk broadly neutralizing antibodies(24). We also detected, consistent with ex-isting knowledge, a significant enrichment(permutation-based P < 1 × 10−5) of escapemutations in the V1/V2 hypervariable regionsof the HIV Env (Fig. 4, C and D; fig. S3; andtable S3) (3). Our model only learns escapepatterns linked tomutations, rather than post-translational changes such as glycosylation thatcontribute to HIV escape (3), which may ex-plain the lack of escape potential specificallyassigned to Env glycosylation sites (Fig. 4Cand table S3).The escape potential within the SARS-

CoV-2 Spike is significantly enriched in boththe RBD (permutation-based P = 2.7 × 10−3)and N-terminal domain (permutation-based


HeadEscape enrichment,

P < 1 × 10-5

StalkEscape depletion,

P < 1 × 10-5

Influenza H1 Influenza H3

HeadEscape enrichment,

P < 1 × 10-5

StalkEscape depletion,

P < 1 × 10-5

AV1/V2

HIV Env

Escapeenrichment,P < 1 × 10-5

SARS-CoV-2 Spike

Receptor binding domainEscape enrichment,

P = 2.7 × 10-3

N-terminal domainEscape enrichment,

P < 1 × 10-5

V3 lo

op

V4 lo

op

V5 lo

op

Fusio

n pe

ptide

gp12

0 A

gp12

0 Bgp

41

Imm

unos

uppr

esion

CD4

bindin

g loo

p

MPE

R

V1 lo

op

V2 lo

op

Glyc

osyla

tion0

1

2

3

4

5

Enriched-log 1

0(P

)

HIV Env regional escape enrichment DC

Predictedescape potential+-

F

S2Escape depletion,P < 1 × 10-5

SARS-CoV-2 Spike regional escape

HR1HR

2RB

D

S1 N

TD S2

Fusio

n pe

ptide

Glyc

osyla

tion

0

1

2

3

4

5

-log 1

0(P

)

EnrichedDepleted

E



B


d,5

lkon,0-5

dnt,0-5

alkon,0-5

l(P

)PP

V1/V2

Env

EscapeenrichmP < 1 ×P

hed

Predictedescap

mainhment,× 10-3

n

5

Pesca-

90°

RBDSpike

S2EscaP < 1P

edtential++

RBD

Fig. 4. Structural localization of predicted escape potential. (A and B) HA trimer colored by escape potential. (C) Escape potential P values for HIV Env.The gray dashed line indicates the statistical significance threshold. (D) The Env trimer colored by escape potential, oriented to show the V1/V2 regions.(E and F) Potential for escape in SARS-CoV-2 Spike is significantly enriched at the N-terminal domain and receptor binding domain (RBD) and significantlydepleted at multiple regions in the S2 subunit. The gray dashed line indicates the statistical significance threshold.

RESEARCH | REPORT



Dow

nloaded from


P < 1 × 10−5), whereas escape potential issignificantly depleted in the S2 subunit(permutation-based P < 1 × 10−5) (Fig. 4, Eand F; fig. S3; and table S3). These resultsare supported by the greater evolutionary con-servation at S2 antigenic sites (25). Our modelof Spike escape therefore suggests that im-munodominant antigenic sites in S2 (5, 25)may be more stable target antibody epitopesand underscores the need for more exhaus-tive causal escape profiling of Spike in regionsbeyond the RBD.Our study leverages the principle that evo-

lutionary selection is reflected in sequencevariation. This principle may allow CSCS togeneralize beyond viral escape to differentkinds of natural selection (such as T cell se-lection) or drug selection. CSCS and its com-ponents could be used to select elements of amultivalent or mosaic vaccine. Our techniquesalso lay the foundation for more complexmodeling of sequence dynamics. We antici-pate that the “distributional hypothesis” fromlinguistics (26), in which co-occurrence pat-terns can model complex concepts and onwhich language models are based, can furtherinform viral evolution.

REFERENCES AND NOTES

1. M. B. Doud, J. M. Lee, J. D. Bloom, Nat. Commun. 9, 1386 (2018).2. J. M. Lee et al., eLife 8, e49324 (2019).3. A. S. Dingens, D. Arenz, H. Weight, J. Overbaugh, J. D. Bloom,

Immunity 50, 520–532.e3 (2019).4. A. J. Greaney et al., Cell Host Microbe 10.1016/j.chom.2020.11.007

(2020).5. A. Baum et al., Science 369, 1014–1018 (2020).6. M. Peters et al., Deep contextualized word representations.

Proc. NAACL-HLT, 2227–2237 (2018).

7. A. Radford et al., OpenAI Blog 1, 9 (2019).8. T. A. Hopf et al., Bioinformatics 35, 1582–1584 (2019).9. T. Bepler, B. Berger, Learning protein sequence embeddings

using information from structure. arXiv:1902.08661 [cs.LG](2019).

10. R. Rao et al., Evaluating protein transfer learning withTAPE. Proc. Adv. Neural Inf. Process. Syst., 9686–9698(2019).

11. E. C. Alley, G. Khimulya, S. Biswas, M. AlQuraishi, G. M. Church,Nat. Methods 16, 1315–1322 (2019).

12. Materials and methods are available as supplementarymaterials.

13. L. McInnes, J. Healy, UMAP: Uniform Manifold Approximationand Projection for dimension reduction. arXiv:1802.03426[stat.ML] (2018).

14. V. D. Blondel, J. L. Guillaume, R. Lambiotte, E. Lefebvre,J. Stat. Mech. Theory Exp. 2008, P10008 (2008).

15. R. Xu et al., Science 328, 357–360 (2010).16. K. G. Andersen, A. Rambaut, W. I. Lipkin, E. C. Holmes,

R. F. Garry, Nat. Med. 26, 450–452 (2020).17. M. B. Doud, J. D. Bloom, Viruses 8, 155 (2016).18. N. C. Wu et al., Nat. Commun. 11, 1233 (2020).19. H. K. Haddox, A. S. Dingens, S. K. Hilton, J. Overbaugh,

J. D. Bloom, eLife 7, e34420 (2018).20. T. N. Starr et al., Cell 182, 1295–1310.e20 (2020).21. K. Katoh, D. M. Standley, Mol. Biol. Evol. 30, 772–780

(2013).22. Y. Xiao et al., Cell Host Microbe 19, 493–503 (2016).23. K. K.-W. To et al., Clin. Infect. Dis. ciaa1275 (2020).24. E. Kirkpatrick, X. Qiu, P. C. Wilson, J. Bahl, F. Krammer,

Sci. Rep. 8, 10432 (2018).25. S. Ravichandran et al., Sci. Transl. Med. 12, eabc3539

(2020).26. Z. S. Harris, Word 10, 146–162 (1954).27. B. Hie, brianhie/viral-mutation: viral-mutation release 0.3.

Zenodo (2020).28. B. Hie, Data for “Learning the language of viral evolution and

escape”. Zenodo (2020).

ACKNOWLEDGMENTS

We thank A. Balazs, H. Fraser, O. Leddy, A. Lerer, A. Lin, A. Nitido,U. Roy, and A. Schmidt for helpful discussions. We thankS. Chun, B. DeMeo, A. Narayan, A. Nguyen, S. Nyquist, and A. Wufor assistance with the manuscript. Funding: B.H. is partiallysupported by the U.S. Department of Defense (DOD) through theNational Defense Science and Engineering Graduate Fellowship(NDSEG). E.Z. is partially supported by the National Science

Foundation (NSF) Graduate Research Fellowship. Authorcontributions: All authors conceived the project and methodology.B.H. performed the computational experiments and wrote thesoftware. All authors interpreted the results and wrote themanuscript. Competing interests: B.H., B.Be., and B.Br. have fileda provisional patent application (serial no. 53/049,676) relatedto this work. Data and materials availability: Code, scripts forplotting and visualizing, and pretrained models are deposited toZenodo at doi:10.5281/zenodo.4034681 (27) and are also availableat https://github.com/brianhie/viral-mutation. We used thefollowing publicly available datasets for model training: InfluenzaA HA protein sequences from the NIAID Influenza ResearchDatabase (IRD) (www.fludb.org); HIV-1 Env protein sequences fromthe Los Alamos National Laboratory (LANL) HIV database (www.hiv.lanl.gov); Coronavidae spike protein sequences from the VirusPathogen Resource (ViPR) database (www.viprbrc.org/brc/home.spg?decorator=corona); SARS-CoV-2 Spike protein sequencesfrom NCBI Virus (www.ncbi.nlm.nih.gov/labs/virus/vssi); andSARS-CoV-2 Spike and other Betacoronavirus spike proteinsequences from GISAID (www.gisaid.org). We used the followingpublicly available datasets for fitness and escape validation:Fitness single-residue DMS of HA H1 WSN33 from Doud and Bloom(2016) (17); Fitness combinatorial DMS of antigenic site B in six HAH3 strains from Wu et al. (18); Fitness single-residue DMS ofEnv BF520 and BG505 from Haddox et al. (19); ACE2 bindingaffinity combinatorial DMS of Spike RBD from Starr et al. (20);Escape single-residue DMS of HA H1 WSN33 from Doud et al.(2018) (1); Escape single-residue DMS of HA H3 Perth09 fromLee et al. (2); Escape single-residue DMS of Env BG505 fromDingens et al. (3); Escape mutations of Spike from Baum et al. (5);Escape single-residue DMS of Spike RBD from Greaney et al.(4); Training and validation datasets are deposited to Zenodo atdoi:10.5281/zenodo.4029296 (28), and links to this data are alsoavailable at https://github.com/brianhie/viral-mutation.

SUPPLEMENTARY MATERIALS

science.sciencemag.org/content/371/6526/284/suppl/DC1Materials and MethodsSupplementary TextFigs. S1 to S3Tables S1 to S3References (29–49)MDAR Reproducibility Checklist

8 July 2020; accepted 9 November 202010.1126/science.abd7331


RESEARCH | REPORT



Dow

nloaded from

https://10.1016/j.chom.2020.11.007https://arxiv.org/abs/1902.08661https://arxiv.org/abs/1802.0342610.5281/zenodo.4034681https://github.com/brianhie/viral-mutationhttp://www.fludb.orghttps://www.hiv.lanl.govhttps://www.hiv.lanl.govhttps://www.viprbrc.org/brc/home.spg?decorator=coronahttps://www.viprbrc.org/brc/home.spg?decorator=coronahttps://www.ncbi.nlm.nih.gov/labs/virus/vssihttps://www.gisaid.org10.5281/zenodo.4029296https://github.com/brianhie/viral-mutationscience.sciencemag.org/content/371/6526/284/suppl/DC1http://science.sciencemag.org/

Learning the language of viral evolution and escapeBrian Hie, Ellen D. Zhong, Bonnie Berger and Bryan Bryson

originally published online January 14, 2021DOI: 10.1126/science.abd7331 (6526), 284-288.371Science

, this issue p. 284; see also p. 233Sciencethe immune system.

evadesequences that are syntactically and/or grammatically correct but effectively different in semantics and thus able to (SARS-CoV-2) spike glycoprotein. Semantic landscapes for these viruses predicted viral escape mutations that producefor influenza A hemagglutinin, HIV-1 envelope glycoprotein, and severe acute respiratory syndrome coronavirus 2 semantics) (see the Perspective by Kim and Przytycka). Three different unsupervised language models were constructedlearning technique for natural language processing with two components: grammar (or syntax) and meaning (or

used a machineet al.impede the development of vaccines. To predict which mutations may lead to viral escape, Hie Viral mutations that evade neutralizing antibodies, an occurrence known as viral escape, can occur and may

Natural language predicts viral escape

ARTICLE TOOLS http://science.sciencemag.org/content/371/6526/284

MATERIALSSUPPLEMENTARY http://science.sciencemag.org/content/suppl/2021/01/13/371.6526.284.DC1

CONTENTRELATED

http://stm.sciencemag.org/content/scitransmed/11/513/eaaw5589.fullhttp://stm.sciencemag.org/content/scitransmed/13/576/eabd8179.fullhttp://stm.sciencemag.org/content/scitransmed/12/573/eabd3601.fullhttp://stm.sciencemag.org/content/scitransmed/12/573/eabe2555.fullhttp://science.sciencemag.org/content/sci/371/6526/233.full

REFERENCES

http://science.sciencemag.org/content/371/6526/284#BIBLThis article cites 40 articles, 6 of which you can access for free

PERMISSIONS http://www.sciencemag.org/help/reprints-and-permissions

Terms of ServiceUse of this article is subject to the

is a registered trademark of AAAS.ScienceScience, 1200 New York Avenue NW, Washington, DC 20005. The title (print ISSN 0036-8075; online ISSN 1095-9203) is published by the American Association for the Advancement ofScience

Science. No claim to original U.S. Government WorksCopyright © 2021 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of

on June 29, 2021


Dow

nloaded from

http://science.sciencemag.org/content/371/6526/284http://science.sciencemag.org/content/suppl/2021/01/13/371.6526.284.DC1http://science.sciencemag.org/content/sci/371/6526/233.fullhttp://stm.sciencemag.org/content/scitransmed/12/573/eabe2555.fullhttp://stm.sciencemag.org/content/scitransmed/12/573/eabd3601.fullhttp://stm.sciencemag.org/content/scitransmed/13/576/eabd8179.fullhttp://stm.sciencemag.org/content/scitransmed/11/513/eaaw5589.fullhttp://science.sciencemag.org/content/371/6526/284#BIBLhttp://www.sciencemag.org/help/reprints-and-permissionshttp://www.sciencemag.org/about/terms-servicehttp://science.sciencemag.org/

virology learning the language of viral evolution and escape › content › sci › 371 › 6526...

Documents