understanding genomic diversity, pan-genome ...understanding genomic diversity,...

31
Understanding genomic diversity, pan-genome, and evolution of SARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj Sharma, Sai Gayatri Vemuri and Gaurav Sharma Institute of Bioinformatics and Applied Biotechnology (IBAB), Bengaluru, Karnataka, India * These authors contributed equally to this work. ABSTRACT Coronovirus disease 2019 (COVID-19) infection, which originated from Wuhan, China, has seized the whole world in its grasp and created a huge pandemic situation before humanity. Since December 2019, genomes of numerous isolates have been sequenced and analyzed for testing conrmation, epidemiology, and evolutionary studies. In the rst half of this article, we provide a detailed review of the history and origin of COVID-19, followed by the taxonomy, nomenclature and genome organization of its causative agent Severe Acute Respiratory Syndrome-related Coronavirus-2 (SARS-CoV-2). In the latter half, we analyze subgenus Sarbecovirus (167 SARS-CoV-2, 312 SARS-CoV, and 5 Pangolin CoV) genomes to understand their diversity, origin, and evolution, along with pan-genome analysis of genus Betacoronavirus members. Whole-genome sequence-based phylogeny of subgenus Sarbecovirus genomes reasserted the fact that SARS-CoV-2 strains evolved from their common ancestors putatively residing in bat or pangolin hosts. We predicted a few country-specic patterns of relatedness and identied mutational hotspots with high, medium and low probability based on genome alignment of 167 SARS-CoV-2 strains. A total of 100-nucleotide segment-based homology studies revealed that the majority of the SARS-CoV-2 genome segments are close to Bat CoV, followed by some to Pangolin CoV, and some are unique ones. Open pan-genome of genus Betacoronavirus members indicates the diversity contributed by the novel viruses emerging in this group. Overall, the exploration of the diversity of these isolates, mutational hotspots and pan-genome will shed light on the evolution and pathogenicity of SARS-CoV-2 and help in developing putative methods of diagnosis and treatment. Subjects Bioinformatics, Evolutionary Studies, Genomics, Microbiology, Virology Keywords COVID-19, SARS, Coronavirus, Pandemic, Viral disease, Genome, Bioinformatics, Genomics, Mutations INTRODUCTION The emergence of coronavirus disease 2019 (COVID-19) has sent people from across the world into a state of high alert, and they are trying to survive complete or partial lockdowns (Lescure et al., 2020). The causative agent of this pandemic is a novel Severe Acute Respiratory Syndrome (SARS) Coronavirus, Severe Acute Respiratory Syndrome-related Coronavirus-2 (SARS-CoV-2). Since 2000, the world has witnessed two major coronavirus outbreaks: the rst, of SARS, caused by SARS-CoV, in 2002 in China (Zhong et al., 2003), How to cite this article Parlikar A, Kalia K, Sinha S, Patnaik S, Sharma N, Vemuri SG, Sharma G. 2020. Understanding genomic diversity, pan-genome, and evolution of SARS-CoV-2. PeerJ 8:e9576 DOI 10.7717/peerj.9576 Submitted 29 April 2020 Accepted 29 June 2020 Published 17 July 2020 Corresponding author Gaurav Sharma, [email protected] Academic editor Abhishek Kumar Additional Information and Declarations can be found on page 24 DOI 10.7717/peerj.9576 Copyright 2020 Parlikar et al. Distributed under Creative Commons CC-BY 4.0

Upload: others

Post on 18-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

Understanding genomic diversity,pan-genome, and evolution of SARS-CoV-2Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik,Neeraj Sharma, Sai Gayatri Vemuri and Gaurav Sharma

Institute of Bioinformatics and Applied Biotechnology (IBAB), Bengaluru, Karnataka, India* These authors contributed equally to this work.

ABSTRACTCoronovirus disease 2019 (COVID-19) infection, which originated from Wuhan,China, has seized the whole world in its grasp and created a huge pandemic situationbefore humanity. Since December 2019, genomes of numerous isolates have beensequenced and analyzed for testing confirmation, epidemiology, and evolutionarystudies. In the first half of this article, we provide a detailed review of the history andorigin of COVID-19, followed by the taxonomy, nomenclature and genomeorganization of its causative agent Severe Acute Respiratory Syndrome-relatedCoronavirus-2 (SARS-CoV-2). In the latter half, we analyze subgenus Sarbecovirus(167 SARS-CoV-2, 312 SARS-CoV, and 5 Pangolin CoV) genomes to understandtheir diversity, origin, and evolution, along with pan-genome analysis of genusBetacoronavirus members. Whole-genome sequence-based phylogeny of subgenusSarbecovirus genomes reasserted the fact that SARS-CoV-2 strains evolved from theircommon ancestors putatively residing in bat or pangolin hosts. We predicted a fewcountry-specific patterns of relatedness and identified mutational hotspots with high,medium and low probability based on genome alignment of 167 SARS-CoV-2strains. A total of 100-nucleotide segment-based homology studies revealed that themajority of the SARS-CoV-2 genome segments are close to Bat CoV, followed bysome to Pangolin CoV, and some are unique ones. Open pan-genome of genusBetacoronavirus members indicates the diversity contributed by the novel virusesemerging in this group. Overall, the exploration of the diversity of these isolates,mutational hotspots and pan-genome will shed light on the evolution andpathogenicity of SARS-CoV-2 and help in developing putative methods of diagnosisand treatment.

Subjects Bioinformatics, Evolutionary Studies, Genomics, Microbiology, VirologyKeywords COVID-19, SARS, Coronavirus, Pandemic, Viral disease, Genome, Bioinformatics,Genomics, Mutations

INTRODUCTIONThe emergence of coronavirus disease 2019 (COVID-19) has sent people from across theworld into a state of high alert, and they are trying to survive complete or partial lockdowns(Lescure et al., 2020). The causative agent of this pandemic is a novel Severe AcuteRespiratory Syndrome (SARS) Coronavirus, Severe Acute Respiratory Syndrome-relatedCoronavirus-2 (SARS-CoV-2). Since 2000, the world has witnessed two major coronavirusoutbreaks: the first, of SARS, caused by SARS-CoV, in 2002 in China (Zhong et al., 2003),

How to cite this article Parlikar A, Kalia K, Sinha S, Patnaik S, Sharma N, Vemuri SG, Sharma G. 2020. Understanding genomic diversity,pan-genome, and evolution of SARS-CoV-2. PeerJ 8:e9576 DOI 10.7717/peerj.9576

Submitted 29 April 2020Accepted 29 June 2020Published 17 July 2020

Corresponding authorGaurav Sharma,[email protected]

Academic editorAbhishek Kumar

Additional Information andDeclarations can be found onpage 24

DOI 10.7717/peerj.9576

Copyright2020 Parlikar et al.

Distributed underCreative Commons CC-BY 4.0

Page 2: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

and the second, of Middle East Respiratory Syndrome (MERS) in 2012 in Saudi Arabia(Zaki et al., 2012). Amongst coronaviruses, SARS-CoV and MERS are known as highlypathogenic viruses that cause pneumonia and respiratory system infections (Coleman &Frieman, 2014). Besides these, several other low pathogenic strains are also known thatcause mild respiratory diseases and infect majorly the upper respiratory tract (Han et al.,2020). After the diagnosis of the first reported case in Wuhan, China, and genomesequencing of the Wuhan Hu-1 strain (SARS-CoV-2 reference genome), scientists acrossthe world are striving to develop vaccines and diagnostic kits and understand itsevolution and epidemiology. Several potential drug molecules to combat COVID-19 areundergoing clinical trials. As no vaccine or drugs are available at present, the only waysknown to contain transmission are social distancing, regular washing of hands, andcovering mouth and nose while coughing or sneezing, thereby, preventing direct or indirectphysical contact and containing the transfer of infected respiratory droplets.

This study comprises a detailed overview of the history and origin of COVID-19along with the taxonomy, nomenclature and genome structure of its causative agentSARS-CoV-2. Following that, we analyze 167 SARS-CoV-2, 312 SARS-CoV and fivePangolin CoV genomes to study their genomic variability, evolution and mutationhotspots. We also characterize the pan-genome of the genus Betacoronavirus (includingthe recently sequenced Pangolin CoV) to classify their proteins-coding regions into core,accessory, and unique categories within each genome group.

TAXONOMY AND NOMENCLATURECoronaviruses are members of family Coronaviridae that include enveloped positive sensessRNA containing viruses, taxonomically classified amongst order Nidovirales in the realmRiboviria. Like other viruses, they thrive in the gray area of living and non-living asthey do not have the machinery to survive outside a living host cell. Therefore, they cannotreplicate outside the host. The Riboviruses or realm Riboviria members replicate byutilizing RNA-dependent RNA-polymerases (RdRps). Their genetic material that is, RNAserves as messenger RNA (mRNA), directly translating into proteins and undergoesgenetic recombination in the presence of another viral genome in the host cell(Barr & Fearns, 2010).

Members of order Nidovirales have a positive sense linear (capped and polyadenylated)RNA molecule, known for causing severe infections. The word “Nido” stands for “nest”implying that all Nidoviruses express subgenomic mRNAs (sgRNA) in a nested form. Theyare known to infect hosts within three important classes in Vertebrates, namely mammals(Mammalia), birds (Aves) and fishes (Pisces). Coronaviruses (family Coronaviridae;subfamily Orthocoronavirinae) commonly infect both mammals and birds; Torovirus(family Tobaniviridae; subfamily Torovirinae) infect specifically mammals and Bafinivirus(family Tobaniviridae; subfamily Piscanivirinae) infect fishes. Coronavirus virions arespherical, Torovirus are crescent-shaped, and Bafinivirus are rod-like; however, allmembers of this family are adorned with a crown or club-shaped surface proteinscalled the Spike proteins. Because of their crown-like morphology observed in electronmicrographs, they are named Coronaviruses. Family Coronaviridae has a single

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 2/31

Page 3: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

subfamily Orthocoronavirinae and its members form a monophyletic clade. They arefurther classified into four defined genera based on phylogenetic studies of conservedgenome regions and their serological cross-reactivity: Alphacoronavirus, Betacoronavirus,Gammacoronavirus and Deltacoronavirus. Under genus Betacoronavirus, five lineagesdiverge: Lineage A (subgenus: Embecovirus) includes HCoV-OC43 and HCoV-HKU1,Lineage B (subgenus: Sarbecovirus) includes SARS-CoV and SARS-CoV-2, Lineage C(subgenus: Merbecovirus) contains Bat coronavirus BtCoV-HKU4 and BtCoV-HKU5,Lineage D (subgenus: Nobecovirus) includes Rousettus Bat coronavirus BtCoV-HKU9and lineage E (subgenus: Hibecovirus) includes Bat Hp-betacoronavirus/Zhejiang2013.The subgenus Sarbecovirus members tend to undergo deep recombination events leadingto the formation of new alleles (Boni et al., 2020) and zoonotic transfer.

ORIGIN AND HISTORY OF THE CORONAVIRUSINFECTIONSThe first Coronavirus associated disease was described in 1931 (Peiris, 2012). The viruseswere first isolated from humans in the UK and USA around the same time. The firstisolated specimen was B814, taken from a boy exhibiting symptoms of the common cold in1960 and cultured in human embryonic tracheal organ culture (Tyrrell & Bynoe, 1966).The specimen was deemed distinct from Adeno-, Entero- or Rhinoviruses, being etherlabile and propagated only in organ cultures (Kendall, Bynoe & Tyrrell, 1962; Tyrrell &Bynoe, 1966). Later, the gradual discoveries of 229E in 1966 (Hamre & Procknow, 1966),OC43 in 1967 (McIntosh, Becker & Chanock, 1967), SARS-CoV in 2002, Bovine CoV in2006, MERS-CoV in 2012 and SARS-CoV-2 in 2019 (Zhu et al., 2020) were majorsignificant steps in coronavirus history.

Coronaviruses have been associated with mild respiratory infections and cold inhumans and animals (specifically poultry and livestock). These viruses were notconsidered treacherous until the SARS epidemic that emerged in China in November 2002.The associated SARS-CoV species were first classified as a separate clade underBetacoronavirus after this outbreak, post which new species including those of humancoronaviruses (hCoVs) have been identified. The 2002 SARS epidemic ultimately infected8,096 people, with 774 deaths. Later, MERS-CoV emerged in Saudi Arabia in September2012, causing 2,494 confirmed cases of infection and 858 deaths across 27 countries(https://www.who.int/emergencies/diseases/en/).

The current pandemic, COVID-19, is caused by the virus, initially designated “2019novel coronavirus” or “2019-nCoV”, later re-named SARS-CoV-2, segregating it as a novelSARS species (Huang et al., 2020). Now, it forms a new clade in the subgenus Sarbecovirusand is established as the 7th member of the family which infects humans (Simmonds et al.,2017; Gorbalenya et al., 2020; Zhu et al., 2020). The first patients of COVID-19 wereidentified in late December 2019 when several local health centers reported clusters ofpatients with pneumonia caused by an unknown pathogen, epidemiologically linked toliving animal and seafood wholesale market in Wuhan. The Chinese Center for DiseaseControl and Prevention called a rapid action team for the epidemiological investigation.Bronchoalveolar-lavage fluids were propagated on human respiratory tract epithelial cell

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 3/31

Page 4: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

cultures for 4–6 weeks. Extracted nucleic acid was identified as viral RNA by real-timereverse transcription PCR targeting the RdRp region of pan-BetaCoV. The electronmicrographs revealed the presence of distinctive spikes of length 9–12 nm, establishingmorphological resemblance with the Coronaviridae family (Zhu et al., 2020). Followingthis, SARS-CoV-2 infection has been spreading worldwide. As of 29 April 2020(submission date), 213 countries and territories around the world are under surveillance,with more than 3.15 million confirmed cases and over 218,490 reported fatalities, and thesenumbers continue to exponentially increase day by day.

Since December 2019, numerous SARS-CoV-2 genomes have been sequenced, whichallows researchers to address many critical questions regarding its origin, transmission,epidemiology, and most importantly to design vaccines and viral detection kits. Viralgenome sequencing and multiple sequence alignment of SARS-CoV-2 genomes haverevealed its identity with Bat CoV RaTG13 (MN996532) and Pangolin-CoV (Cui, Li &Shi, 2019; Boni et al., 2020; Rehman et al., 2020). A comparative phylogenetic studyrevealed that the Pangolin-CoV genes shared a high level of sequence identity withSARS-CoV-2 genes, specifically orf1b, the spike (S), orf7a and orf10 genes (Zhang, Wu &Zhang, 2020). Higher identity of S protein sequence implies the functional similaritybetween Pangolin-CoV and SARS-CoV-2 as compared to the RaTg1 strain (Zhang, Wu &Zhang, 2020) further suggesting Pangolin as an intermediate host for infection and naturalreservoir of SARS-CoV-2-like strains.

CORONAVIRUS GENOME ORGANIZATIONCoronaviruses belong to family Coronaviridae and comprise enveloped viruses thatreplicate in the cytoplasm of animal host cells; distinguished by the presence of asingle-stranded positive-sense RNA genome (about 30 kb in length) (Marra et al., 2003).Coronaviruses possess the largest genomes among all known RNA viruses (Fehr &Perlman, 2015), ranging from 25.32 kb in Porcine Deltacoronavirus PD-CoV to 31.775 kbin Beluga whale coronavirus SW1 (information available on NCBI Virus webpagehttps://www.ncbi.nlm.nih.gov/labs/virus/). SARS-CoV-2 is a spherical or pleomorphicenveloped particle, containing single-stranded, positive-sense RNA associated with anucleoprotein, lying inside a capsid composed of matrix protein (Lai et al., 2020).

The SARS-CoV-2 reference genome (NC_045512; Wuhan Hu-1 strain) is 29,903nucleotides long, which consists of A (8,954 nt, 29.94%), G (5,863 nt, 19.61%), C (5,492 nt,18.37%), T (9,594 nt, 32.08%). Compared to SARS-CoV-2, the SARS-CoV referencegenome (NC_004718; Tor1 strain) is a little larger, that is, 29,751 nucleotides and itsnucleotide composition is marginally different: A (8,481 nt, 28.50), G (6,187 nt, 20.80%),C (5,940 nt, 19.97%), T (9,143 nt, 30.73%). We identified that the GC content (averageof all genomes under this study) of SARS-CoV, SARS-CoV-2, and Pangolin CoV is40.81%, 38% and 38.52% respectively. Like other Betacoronavirus members, this genomecontains two flanking untranslated regions (UTRs), 5′-UTR (265 nt) and 3′-UTR (229 nt)sequences.

A typical CoV genome contains at least six ORFs (Graham et al., 2008). The genomes ofall coronaviruses usually encode four well-conserved and characterized structural proteins:

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 4/31

Page 5: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

S (spike), E (envelope), M (membrane) and N (nucleocapsid) (Graham et al., 2008;Fuk-Woo Chan et al., 2020), encoded by sgRNA 2, 4, 5 and 9a respectively present at 3′endand sharing 1/3 of the genome in SARS-CoV (Graham et al., 2008).

SARS-CoV-2 Proteins: The SARS-CoV-2 genome has 12 protein-coding regions, whichencode two categories of proteins: first, Structural proteins, which give characteristicstructure to the virus and are involved in viral entry, and second, Non-structural (NS)or accessory proteins, which help in viral replication (Marra et al., 2003; Graham et al.,2008; Hu et al., 2018; Fuk-Woo Chan et al., 2020). Most of the information available so farabout these proteins is based on SARS-CoV and other members of the genusBetacoronavirus.

STRUCTURAL PROTEINSSpike proteinThe S protein is a 1,273 aa trimeric, cell-surface glycoprotein consisting of two subunits(S1 and S2), encoded by the gene S. The S1 subunit is responsible for receptor binding(Hu et al., 2018). This is also important for mediating the fusion of viral and hostmembranes. Both these processes are critical for virus entry into host cells (Tan, Lim &Hong, 2005). SARS-CoV-2 has a polybasic cleavage site RRAR, at the junction of S1 and S2subunits, which enables effective cleavage by Furin and other proteases (Andersen et al.,2020; Zhang, Wu & Zhang, 2020). Furthermore, one proline residue is also inserted atthe leading cleavage site of SARS-CoV-2, making “PRRA”, the final inserted sequence.Comparison of this site within different Betacoronavirus members revealed that theinsertion of a Furin cleavage site at the S1–S2 junction of SARS-CoV-2 enhances cell-cellfusion (Follis, York & Nunberg, 2006;De Haan et al., 2008; Andersen et al., 2020). Similarly,an effective cleavage of the polybasic cleavage site in Hemagglutinin esterase proteinfacilitated the inter-species transmission of MERS-like coronaviruses from bats to humans(Andersen et al., 2020). Majorly, variations in the S protein are responsible for twoattributes, tissue tropism and host ranges of different CoVs (Hu et al., 2018). It wasobserved that S protein underwent several drastic changes during the viral infection.The S protein of SARS-CoV-2 is more vulnerable to mutations, especially in the aminoacids associated with the spike protein-cell receptor interface. Interestingly, the amino acidsequence represented ~19% changes with four major insertions and ~81% sequencesimilarity in contrast to SARS-CoV (Ahmed, Quadeer & McKay, 2020; Rehman et al.,2020).

Nucleocapsid proteinN proteins are 419 aa long phosphoproteins weighing ~46kDa. These have helix bindingproperties. Coronavirus N proteins possess three easily distinguishable and highlyconserved domains; the N-terminal domain (NTD) (N1b), the C-terminal domain (CTD)(N2b) and the N3 region (Grossoehme et al., 2009; Cong et al., 2019). The N proteinplays a vital role in virion structure formation as it is localized in both the replication-transcription region and the ERGIC (Endoplasmic reticulum-Golgi apparatusIntermediate Compartment), the site of virion assembly (Tok & Tatar, 2017).

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 5/31

Page 6: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

The N protein of SARS-CoV is primarily expressed during the early stages of SARS-CoVinfection (Surjit & Lal, 2008; Grossoehme et al., 2009; Cong et al., 2019). It is importantas a diagnostic marker as it induces a strong immune response (Hu et al., 2018).Evolutionary analysis has shown 89–91% sequence homology between the N proteins inthe SARS-CoV and Bat SL-CoV (Hu et al., 2018; Ahmed, Quadeer & McKay, 2020).The N protein can bind to Nsp3 protein to help bind the genome to replication/transcription complex (RTC) (Cong et al., 2019) and package the encapsulated genomeinto virions (Chen, Liu & Guo, 2020). IBV, SARS CoV and MHV N protein undergophosphorylation and allow discrimination of non-viral mRNA binding from viralmRNA binding (Cong et al., 2019). N protein is also an antagonist of interferon (IFN)(Kopecky-Bromberg et al., 2007) and virus-encoded repressor of RNA interference (RNAi),therefore, it appears to benefit the viral replication (Chen, Liu & Guo, 2020).

Membrane proteinThe most abundant structural protein in the genome is the membrane (M) glycoprotein,which spans across the membrane bilayer three times. Thus, M glycoproteins have threetransmembrane regions, leaving a short NH2-terminal domain outside the virus and along COOH terminus (cytoplasmic domain) inside the virion (Tok & Tatar, 2017;Mousavizadeh & Ghasemi, 2020). The M protein plays a key role in regenerating virions inthe cell. M proteins undergo glycosylation in the Golgi apparatus which is crucial forthe virion to fuze into the cell and to make antigenic proteins. As mentioned before, Nprotein forms a complex by binding to genomic RNA andM protein triggers the formationof interacting virions in the endoplasm (Tok & Tatar, 2017).

Envelope glycoproteinE glycoproteins are composed of approximately 76–109 amino acids in other coronavirusspecies. E protein plays a crucial role in the assembly and morphogenesis of virionswithin the host. About 30 amino acids in the N-terminus of the E protein enablebinding to the viral membranes. Co-expression of E and M proteins with mammalianexpression vectors enable the formation of virus-like structures within the cell (Tok &Tatar, 2017).

NON-STRUCTURAL (ACCESSORY) PROTEINSORF1ab and ORF1aThe SARS 5′ proximal gene 1 (~22 kb) comprises two long overlapping open readingframes, orf1a and orf1ab, encoding for polyproteins 1a and 1ab. These polyproteins arecleaved by viral proteases that is, PLpro (papain-like protease) and 3CLpro (chymotrypsin-like protease) into 16 non-structural proteins, Nsp1–Nsp16. Expression of orf1abinvolves a (−1) ribosomal frameshift upstream of orf1a stop codon, thus forming thefull-length ORF1ab. ORF1a polyprotein of ~500 kDa encodes for Nsp 1–11, while ORF1abpolyprotein of ~800 kDa encodes for all Nsp1–16. The ORF1a and ORF1ab polyproteinsundergo post-translational modifications to form mature proteins and hence are not

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 6/31

Page 7: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

detected during infection (Graham et al., 2008; Lei, Kusov & Hilgenfeld, 2018). A briefdescription of these proteins is provided below:

i. Nsp1: SARS-CoV Nsp1 is an N-terminal protein coded by the first gene of ORF1ab,which plays an important role in SARS-CoV pathogenesis by inhibiting the hostgene expression via binding to the small subunit of the ribosome and then truncatingthe translation activity. Nsp1 induces endonucleolytic RNA cleavage of a cappedhost mRNA, making it translationally incompetent (Tanaka et al., 2012). Nsp1inhibits the type-1 IFN expression and other antiviral signaling pathways, thussuppressing the innate immune system. The viral mRNA is resistant to Nsp1-mediated RNA cleavage in the infected host cells, however, the mechanism throughwhich it achieves this is unknown (Tanaka et al., 2012).

ii. Nsp2: the function of Nsp2 in SARS-CoV is unknown. The experimental results showthat the deletion of the nsp2 gene from MHV and SARS-CoV was tolerated withmodest growth and some RNA defect. Also, there was no observable effect on thepolyprotein processing in the mutants. Another evidence shows that neither theNsp2 protein nor the nsp2 gene is involved in pathogenesis (Graham et al., 2005).

iii. Nsp3: the multi-domain SARS-CoV Nsp3 is a 215 kDa glycosylated transmembranemultidomain protein involved in viral replication and transcription. It may serveas a scaffolding protein for numerous other proteins (Angelini et al., 2013).The organization of various Nsp3 domains differs amongst the Coronavirusgenomes. However, eight domains are common to all CoVs: ubiquitin-like domain 1(Ubl1), the glutamate-rich acidic domain also known as “hypervariable region”, an Xmacrodomain, ubiquitin-like domain 2 (Ubl2), PL2pro (papain-like protease 2),ectodomain 3Ecto, also known as “zinc-finger domain”, and domains Y1 and CoV-Y(unknown functions). Nsp3 releases itself, Nsp1 and Nsp2 from the polyproteins.It alters the post-translational modification of host proteins to antagonize theinnate immune response by de-MARylating, de-PARylating, de-ubiquitinating,or de-ISGylating the host proteins. Meanwhile, Nsp3 modifies itself by theN-glycosylation of the ectodomain and can also interact with host proteins (such asRCHY1) to enhance viral survival (Lei, Kusov & Hilgenfeld, 2018).

iv. Nsp4: it is known that Coronaviruses induce double-membrane vesicles (DMVs).Any alteration in the DMVs’ morphology impairs the RNA synthesis andtherefore growth of Nsp4 mutants (Gadlage et al., 2010).

v. Nsp5: the Nsp5 protease also known as 3CLpro or Mpro, cleaves the Nsp peptides at11 cleavage sites (Stobart et al., 2013). Nsp5 of Porcine Deltacoronavirus (PDCoV)is a type I IFN antagonist. It disrupts the IFN signaling pathway by cleaving theNF-κB essential modulator (NEMO), thus impairing the host’s ability to activate theIFN response (Zhu et al., 2017).

vi. Nsp6: Nsp6, along with Nsp3 and Nsp4, has membrane proliferation ability. Thus, itcan induce perinuclear vesicles localized around the microtubule-organizingcenter and DMV formation (Angelini et al., 2013). The coronavirus Nsp6 protein

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 7/31

Page 8: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

restricts the autophagosome expansion either directly or indirectly throughstarvation or chemical inhibition of MTOR signaling (Cottam, Whelband &Wileman, 2014).

vii. Nsp7: the Nsp12 RNA-dependent RNA polymerase cannot act on its own anddepends upon stimulation by a complex of Nsp7 and Nsp8. Thus, Nsp7 acts asone of the cofactors along with Nsp8 to stimulate Nsp12. It is hypothesized that Nsp7and Nsp8 heterodimers stabilize the Nsp12 regions involved in RNA binding; also,Nsp8 acts as a minor RdRp (Kirchdoerfer & Ward, 2019).

viii. Nsp8: Nsp8 is a 22 kDa non-canonical polymerase shown to be capable of de novoRNA synthesis with low fidelity on single-stranded RNA templates. The ‘main’ RdRpin Coronaviruses is the Nsp12 which employs a primer-dependent initiationmechanism. These observations led to a hypothesis that Nsp8 would act as an RNAprimase and synthesize a short primer that will be extended by Nsp12 (Te Velthuis,Van den Worm & Snijder, 2012).

ix. Nsp9: Nsp9 is an RNA-binding subunit in the replication complex in allcoronaviruses (Zeng et al., 2018).

x. Nsp10: SARS-CoV Nsp10 binds to and stimulates the exoribonucleases, Nsp14, andNsp16, thus playing an important role in the replication-transcription complex(RTC) formation (Bouvet et al., 2014). Nsp10 acts as an essential co-factor intriggering Nsp16 2′O-MTase activity, which suggests its involvement in theregulation of capping of viral RNA (Lugari et al., 2010).

xi. Nsp11: the exact function of Nsp11 in Coronaviridae is unknown. Arterivirus Nsp11was identified as NendoU (Nidoviral uridylate-specific endoribonucleases) that playsa role in the viral life cycle. Nsp11 could be essential to produce helicase (Woo et al.,2005).

xii. Nsp12: Nsp12 is a 102 kDa RNA-dependent RNA polymerase (RdRp) featuringall conserved motifs of the known RdRps. It is the most conserved protein incoronaviruses and assumes a center stage in the viral RTC. The Nsp7/Nsp8 complexenhances the binding of Nsp12 to RNA. Nsp8 is the second, non-canonical RdRp incoronaviruses (Subissi et al., 2014).

xiii. Nsp13: Nsp13 is a 66.5 kDa multi-functional protein. The N terminal has azinc-binding domain while the C-terminal harbors a helicase domain-containingconserved motif of superfamily-1 helicases (Subissi et al., 2014).

xiv. Nsp14: Nsp14 is a 60 kDa bifunctional enzyme. Its N-terminal is a 3′–5′exoribonuclease involved in the mismatch repair system, improving the fidelityof virus replication via RNA proofreading. Yeast trans-complementationstudies have shown that the C-terminal domain of SARS-CoV Nsp14 is anS-adenosylmethionine-dependent guanine-N7-methyltransferase (MTase) (Subissiet al., 2014; Zeng et al., 2016) with no RNA sequence specificity (Subissi et al., 2014).

xv. Nsp15: SARS-CoV Nsp15 is a uridine specific ribonuclease that cleaves the 3′ ofuridylates, producing 2′–3′ cyclic phosphate ends (Subissi et al., 2014).

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 8/31

Page 9: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

xvi. Nsp16: Nsp16 is a 2′-O-Methyl Transferase (Zeng et al., 2016) and requires Nsp10 asa stimulatory factor to exhibit its MTase activity (Chen et al., 2011). EukaryoticmRNA has a unique 5′ cap; to evade the host machinery, the viral RNA must bemade indistinguishable from the host mRNA by capping the viral RNA (Menachery,Debbink & Baric, 2014). Both Nsp14 and Nsp16 are involved in the modificationof viral RNA cap structure (Zeng et al., 2016) to maintain the viral RNA stability,ensure protein translation and immune escape (Lugari et al., 2010; Subissi et al.,2014). The SARS-CoV genome encodes two SAM-dependent methyltransferases(Zeng et al., 2016); Nsp14 N7 methyltransferase, and Nsp16 2′-O-methyl transferase,that methylates the RNA at N7 of guanosine and ribose 2′O sites, respectively(Chen et al., 2011). This process also involves Nsp10 for stabilizing the SAM bindingregion and RNA binding of Nsp16 (Subissi et al., 2014).

ORF3a proteinThe orf3a gene lies between S and E genes and encodes for a transmembrane protein(McBride & Fielding, 2012). SARS-CoV-2 ORF3a protein shows 97.82% homology to NS3 ofBat coronavirus RaTG13 and 72% homology to SARS-CoV ORF3a protein (Issa et al., 2020).It is localized to the cell membrane and partly to the ER and the Golgi perinuclearspace in the host cell. ORF3a is known to activate PERK (PKR-like ER kinase) signalingpathway through which the viral particles escape ER-associated degradation and theresulting ER stress also induces apoptosis. ORF3a of both SARS-CoV and SARS-CoV-2 havean APA3_viroporin (a pro-apoptosis protein) conserved domain (McBride & Fielding, 2012;Issa et al., 2020). ORF3a has been shown to activate NF-κB and JNK. This leads to theupregulation of the chemokine named RANTES (Regulated upon Activation, Normal T CellExpressed and Secreted) along with IL-8 in A549 and HEK293T cells (Liu et al., 2014).The extracellular N-terminus of protein can evoke a humoral immune response. Bindingof ORF3a protein to caveolin-1 may be required for viral uptake and release (Liu et al., 2014).ORF3a can augment the activation of the p38 MAPK pathway and induce the mitochondriato leak inducing apoptosis (Liu et al., 2014).

ORF6 proteinSARS-CoV ORF6/Protein 6 is a 63-aa polypeptide. It has an amphipathic 1–40 aaN-terminal portion and a highly polar C-terminal. The amino acid residues 2–32 in theN-terminal form an a-helix which is embedded in the cell membrane. There are twosignal sequences in the C-terminal; aa 49–52 sequence YSEL targets proteins forincorporating into endosomes and the acidic tail which signals ER export. This proteinis localized in the ER and Golgi apparatus. ORF6 is incorporated into VLPs, whenco-expressed with SARS-CoV S, M and E structural proteins. It enhances viral replication,thus serving an important role in pathogenesis during SARS-CoV infection (Liu et al.,2014). ORF6 is also well known as a β-interferon antagonist; its overexpression inhibitsnuclear import of STAT1 in IFN-β treated cells. Along with ORF3b and N protein, theSARS-CoV ORF6 inhibits activation of IRF-3 via phosphorylation and binding of IRF-3 to

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 9/31

Page 10: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

a promoter with IRF-3 binding sites. IRF-3 is an important protein for the expression ofIFN. ORF6 may disturb the ER/Golgi transport necessary for the interferon response(Kopecky-Bromberg et al., 2007). During SARS-CoV infection, the C-terminal domain ofORF6 modulates the host protein nuclear transport and type-I interferon signaling, thusplaying an important role in immune evasion (Liu et al., 2014).

ORF7a proteinSARS-CoV ORF7a/ Protein 7a is a 122 aa type-I transmembrane protein (Liu et al., 2014).It consists of 15 aa N terminal signal peptide sequence, an 81 aa ectodomain, 21 aa Cterminal transmembrane domain, and a cytoplasmic tail of 5 aa residues. The ectodomainof ORF7a binds to human LFA-1 (lymphocyte function-related antigen 1) on Jurkat cellsvia the alpha integrin-I domain of LFA-1. This suggests that probably LFA-1 is abinding factor or receptor for SARS-CoV on human leukocytes (McBride & Fielding,2012). SARS-CoV ORF7a is localized in the ER-Golgi Intermediate Compartment(ERGIC), the assembly point of coronaviruses. It interacts with M and E structuralproteins, indicating a possible role in viral assembly during SARS-CoV replication.In mammalian cells infected with SARS-CoV, ORF7a has been known to inducecaspase-dependent apoptosis by cleaving PARP (poly(ADP-ribose) polymerase), anapoptotic marker. Its pro-apoptotic nature is a result of the interaction between itstransmembrane domain with Bcl-XL, an anti-apoptotic protein of the Bcl-2 family(Liu et al., 2014). Like ORF3a, overexpression of ORF7a activates NF-κB and c-Jun N-terminal kinase (JNK), augmenting the production of pro-inflammatory cytokines such asinterleukin 8 (IL-8) and RANTES (Liu et al., 2014). ORF7a overexpression inducedapoptosis can occur via a caspase-3 dependent pathway as well as p38 MAPK pathway(McBride & Fielding, 2012).

ORF7b proteinSARS-CoV ORF7b/Protein 7b is a 44 aa long, highly hydrophobic, integraltransmembrane protein with a luminal N-terminal and cytoplasmic C-terminal.The expression of ORF7a is reduced significantly when the orf7a start codon (upstreamof 7b) is mutated to a strong Kozak sequence or when an additional start codon AUG isinserted upstream of the orf7b start codon. Thus, ORF7b may be expressed by “leakyscanning” of the ribosome (Liu et al., 2014). The Golgi-restricted localization of ORF7bwas attributed to the transmembrane domain (Liu et al., 2014). However, the role ofORF7a and ORF7b during the replication of SARS-CoV remains uncertain (McBride &Fielding, 2012).

ORF8 proteinHuman SARS-CoV-2 isolated from early patients, Civet SARS-CoV and other batSARSr-CoV contains the full-length ORF8 (Fuk-Woo Chan et al., 2020). Two genomes ofgenus Betacoronavirus isolated from horseshoe bat, SARS-Rf-BatCoV YNLF_31C andYNLF_34C are 93% identical to the human and civet SARS-CoV genome. However, allHuman SARS-CoV isolates from mid and late-phase patients contain a signature 29

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 10/31

Page 11: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

nucleotide deletion in the orf8 gene, splitting it into orf8a and orf8b. This suggests thatORF8 may play a role in interspecies transmission. The generation of civet SARSr-CoVcould be a result of a potential recombination event between SARSr-Rf-BatCoVs andSARSr-Rs-BatCoVs identified around ORF8 (Lau et al., 2015). Interestingly, the newSARS-CoV-2 ORF8 shares very less homology with the conserved ORF8 or ORF8bisolated from Human SARS-CoV or its related viruses, Civet SARS-CoV, Bat-CoVYNLF_31C andYNLF_34C. The ORF8 lacks any known functional domain or motif.SARS-CoV ORF8b has an aggregation motif VLVVL (75–79aa) which is known to activateNLRP3 inflammasomes and trigger intracellular stress pathways (Fuk-Woo Chan et al.,2020), but this is not conserved in SARS-CoV-2. Notably, both ORF3 and ORF8 inSARS-CoV-2 are highly divergent from the interferon antagonist ORF3a andinflammasome activator ORF8b in SARS-CoV (Yuen et al., 2020).

ORF10 proteinThe SARS-CoV-2 ORF10 protein has no homologs in SARS-CoV and any other membersof genus Betacoronavirus (Xu et al., 2020). Viruses are known to sabotage ubiquitinationpathways for replication and pathogenesis. The ORF10 interacts with specifically theCUL2ZYG11B complex and the rest of the members of the Cullin-2 (CUL2) RING E3 ligasecomplex. ZYG11B degrades substrates with exposed N-terminal glycine residues.By studying the ORF10 interactome, it was observed that it interacts the most with theCULZYG11B complex (Gordon et al., 2020). This evidence shows that ORF10 might bind tothis complex and hijack its ubiquitination of restriction factors (Gordon et al., 2020).

METHODOLOGY (GENOMES, DATABASES, AND TOOLSUSED IN THIS STUDY)Complete genomes, belonging to SARS-CoV-2 (167 genomes), SARS-CoV (312 genomes)and Pangolin CoV (five genomes) were downloaded from NCBI Virus webpage(https://www.ncbi.nlm.nih.gov/labs/virus/) as on 29 March 2020. Similarly, RefSeqassemblies of 56 suborder Cornidovirineae genomes (including 18 from Betacoronavirusgenus) along with the genome of Paguronivirus-1 (as an outgroup) were downloadedas on 1 April 2020. The latest version of the Virus database was downloaded from theNCBI Virus page as on 6 April 2020.

For alignments of the studied genomes, as required for phylogeny and mutationalstudies, we have used MUSCLE v3.8.1551 and MEGAX (Edgar, 2004; Kumar et al., 2018)tools as per their default settings. To generate maximum likelihood phylogenies ofgenome-based alignments, RAxML v8.2.12 (Stamatakis, 2014) was used to performrapid bootstrap analysis and search for bestscoring ML tree in one program run (“-f a”parameter) using GTRGAMMA as nucleotide substitution model (-m) with 100 bootstrapvalues. After obtaining the newik trees from RAxML, an online iTOL platform (Letunic &Bork, 2019) was used to visualize phylogenetic trees as shown in this study. For eachexperiment in respective result sections, we have provided comprehensive details about thegenomes under study, research methodology, and used parameters.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 11/31

Page 12: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

For the pangenome study, Proteinortho v6.0.12 (Lechner et al., 2011) was used with anE value cutoff 1E−05 for each identified hit along with other default settings. It utilizesdiamond v0.8.36.98 and BLAST 2.8.1+ (Christiam et al., 2009; Buchfink, Xie & Huson,2014) to identify ortholog proteins using reciprocal blast strategy. For the identificationof pairwise average nucleotide identity (ANI) values within all genomes under study,PYANI v0.1.2 (Pritchard et al., 2016) program was used. Throughout this study, theWuhan Hu-1 strain has been used as a reference for SARS-CoV-2 genomes.

RESULTS AND DISCUSSIONHuman SARS-CoV-2 genome is quite disparate as compared to otherBetacoronavirus genomesThe SARS outbreak in 2003 and MERS in 2012 had already set a precedent for widespreadgenome analysis, therefore, with the recent emergence of COVID-19 in November2019, extensive sequencing and genome analysis of SARS-CoV-2 isolates are beingperformed. Since February 2020, several Bat and Pangolin coronaviruses have also beensequenced to understand the probable origin of SARS-CoV-2. In this study, we haveanalyzed 312 SARS-CoV, 167 SARS-CoV-2 and 5 Pangolin CoV genomes to understandtheir genomic conservation, unique genes, mutational hotspots and respective evolutionand origin.

Phylogeny relationship at suborder Cornidovirineae levelSARS-CoV-2 is a member of suborder Cornidovirineae, family Coronaviridae, genusBetacoronavirus, and subgenus Sarbecovirus. In this study, we have tried to understand thestrain diversity and evolutionary relationships at different taxonomy levels in a top-bottomclassification scheme. For this analysis, the complete whole-genome sequences of 56suborder Cornidovirineae reference genomes were analyzed, which included membersfrom five genera i.e. 23 Alphacoronavirus, 20 Betacoronavirus, 10 Deltacoronavirus andthree Gammacoronavirus and the genome of Paguronivirus-1 was included as an outgroupfor this study. Their genome length varied from 25,425-31,686 nucleotides. ClustalWalignment (using MEGA-X) generated an alignment of 35,601 nucleotides off which17,284 conserved sites were stripped and used to generate an ML-based phylogeny usingRAxML. We noted that this phylogeny is unambiguously able to demarcate all the generainto their respective monophyletic clades (Fig. S1). The significant variations amongstbranch length distances indicate the relative diversity within each genus.

Phylogeny relationship at genus Betacoronavirus levelWe generated the phylogeny of 18 Betacoronavirus reference strains (along with fivePangolin CoV strains) using their complete genomes ranging within 29,114–31,526nucleotides. ClustalW based whole genome alignment using MEGAX generated analignment of length 33,346 from which 24,555 conserved nucleotide sites were strippedand used to generate ML phylogeny (Fig. S2). Three distinct clades were seen in thephylogeny: the first clade includes strains from subgenus Sarbecovirus, Nobecovirus andHibecovirus, second with Embecovirus members, and the third clade of members of

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 12/31

Page 13: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

subgenus Merbecovirus. Within the first clade, both members of subgenus Sarbecovirus(SARS-CoV and SARS-CoV-2) are grouped with Pangolin CoV strains, which is supportedby their percentage identity. All Pangolin CoV genomes are >99.8% identical to each otherwhereas their closest relatives are the SARS-CoV-2 reference strain (~85% identity),followed by the SARS-CoV reference strain (~79% identity). These three groups are closest(~57% identity) to their sister branch occupied by subgenus Hibecovirus strain: BatHp-betacoronavirus Zhejiang 2013. Two reference strains of another subgenusNobecovirus (~67% identity with each other) form a separate clade that is closest (~52%identity with other group members) to the above-mentioned clade comprising subgenusSarbecovirus and Hibecovirus. The second clade only has representatives of subgenusEmbecovirus including Murine hepatitis virus, Bovine CoV, Rabbit CoV, Rat CoV, etc.Some of these strains form respective subclades in the phylogeny based on their closeness.Strains belonging to subgenus Merbecovirus that include the MERS, form the thirdclade. MERS genomes (England 1 and MERS Co.) are 99.7% identical to each other andtherefore form a separate sub-clade. As observed from their phylogenetic branch lengthsand percentage identity matrix, all subgenera are quite distinct from each other.As expected, members of each subgenus are closer to each other, therefore formingrespective monophyletic clades.

Phylogeny relationship at subgenus Sarbecovirus levelTo understand the similarities and differences among the strains of subgenus Sarbecovirus,we analyzed the complete genomes of 103 SARS-CoV-2 and 312 SARS-CoV isolatesincluding the reference strains of Wuhan Hu-1 (SARS-CoV-2) and Tor2 (SARS-CoV).Since Pangolin CoV belongs to subgenus Sarbecovirus, we also included 5 Pangolin CoVgenome sequences in our dataset and aligned them using MUSCLE. The genome size ofSARS-CoV, Pangolin CoV and SARS-CoV-2 strains were in the range of 29,013–30,311nucleotides, 29,795–29,806 nucleotides and 29,325–29,945 nucleotides, respectively.The alignment of length 31,718 nucleotides was stripped to a conserved region of 26,762nucleotides (sites with gaps in between them were excluded) which was used to generate amaximum likelihood phylogeny using RAxML (Fig. 1; Fig. S3). We noted from thephylogeny that SARS-CoV-2 strains form a single monophyletic clade as compared to theSARS-CoV. SARS-CoV strain RaTG13 procured from the bat in 2013 from Yunnan,China is closest to SARS-CoV-2 strains, suggesting that they might have diverged from acommon ancestor. Average Nucleotide Identity (ANI) studies also indicated that theSARS-CoV-2 reference genome and the Bat RaTG13 branch share 96.11% identity at thegenome level, a finding supported by earlier reports (Lv et al., 2020; Zhou et al., 2020).The next closest strains in the phylogeny were those of Pangolin CoV, all forming asubclade. Other close relatives are the SARS-CoV strains Bat CoVZC45 and BatCoVZXC21, procured from Zhoushan, eastern China in 2018. The phylogeny suggeststhat Pangolin CoVs are closer to SARS-CoV-2 strains as compared to CoVZC45 andCoVZXC21, however, genome-wide percentage identity suggests the opposite—thegenomes of both CoVZC45 and CoVZXC21 are ~89% identical to SARS-CoV-2, asopposed to its 86% identity with Pangolin CoVs.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 13/31

Page 14: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

MT0

4033

5.1

Pan

golin

PC

oV G

X-P

5L

JF292911 SA

RS

-1 US

A M

us MA

15 d4ym1

MT044258 SARS-2 USA Hum CA6

KY

4171

45 S

AR

S-1

Chi

na B

at R

f409

2

KF514423 SARS-1 USA NA WTic c3P20-2009

JF292922 SARS-1 USA NA ExoN1 c5P1

KF514417 SARS-1 USA NA ExoN1 c5 3P20-2010

MT1

6372

1 SA

RS-

2 U

SA H

um W

A9-

UW

6

MT159720 SARS-2 USA Hum CruiseA-4

AY291315 S

AR

S-1 G

ermany H

um Frankfurt-1

HQ890529 SARS-1 USA Mus MA15-ExoN1 d2ym4

KF514400 SARS-1 USA NA WTic c1 8P20-2010

MT0

6617

5 S

AR

S-2

Tai

wan

Hum

NTU

01

HQ890536 SARS-1 USA Mus MA15-ExoN1 d2om3

LC528233 SARS-2 NA Hum Hu-DP-Kng-19-027-RNA

AY427439 SAR

S-1 NA H

um A

S

FJ882953 SARS-1 USA Mus MA15-ExoN1 P3pp4

AY

394979 SA

RS

-1 NA

NA

GZ-C

AY394998 SARS-1 China NA LC1

AY279354 SARS-1 NA NA BJ04

AY362699 SARS-1 NA NA TWC3

LC528232 SARS-2 NA Hum Hu-DP-Kng-19-020-RNA

JF292915 SA

RS

-1 US

A M

us MA

15 d4ym5

EU371564 SARS-1 NA NA BJ182-12

DQ

4120

43 S

AR

S-1

NA

Bat

Rm

1

HQ890533 SARS-1 USA Mus MA15-ExoN1 d4ym3

DQ

898174 SA

RS

-1 Canada N

A C

V7

MK

062182 SA

RS

-1 US

A H

um

Urb

ani icS

AR

S-C

3-MA

AY3949

85 S

ARS-1 China N

A HSZ-B

b

GQ

1535

48 S

AR

S-1

Ch

ina

Bat

HK

U3-

13

FJ882942 SARS-1 USA Mus MA15-ExoN1 P3pp5

KF514399 SARS-1 USA NA WTic c1 1P20-2010

AY

323977 SA

RS

-1 Italy NA

HS

R-1

GQ

1535

44 S

AR

S-1

Ch

ina

Bat

HK

U3-

9

AY345987 SARS-1 HongKong NA AG02

MN996529 SARS-2 China Hum WIV05

AY394990 SARS-1 China NA HZS2-E

AY

278741 SA

RS

-1 NA

NA

Urb

ani

KF514416 SARS-1 USA NA ExoN1 c5 8P20-2010

MT0

7286

4.1

Pan

golin

PC

oV G

X-P

2V

AY502930 SARS-1 Taiwan NA TW7

AY283794 S

AR

S-1 S

ingapore NA

Sin2500

JF292920 SA

RS

-1 US

A M

us MA

15 d3om5

MT184913 SARS-2 USA Hum CruiseA-26

AY291451 SAR

S-1 Taiwan N

A TW1

AY350750 SARS-1 NA NA PUMC01

EU371562 SARS-1 NA NA BJ182-4

GU553365 SARS-1 USA M

on HKU-39849 HARROD-00003

HQ890527 SARS-1 USA Mus MA15-ExoN1 d2ym2

HQ890535 SARS-1 USA Mus MA15-ExoN1 d2om2

FJ882946 SARS-1 USA NA wtic-MB P3pp13

AY485278 SARS-1 NA NA Sino3-11

FJ882955 SARS-1 USA NA ExoN1 P3pp19

FJ95

9407

SA

RS-

1 C

hina

Mon

A00

1

AY54

5916

SA

RS-

1 C

hina

NA

HC

-SZ-

266-

03

KF514420 SARS-1 USA NA ExoN1 c5P10-2009

MK

062179 SA

RS

-1 US

A H

um

Urb

ani icS

AR

S

FJ882951 SARS-1 USA Mus MA15-ExoN1 P3pp3

AY502931 SARS-1 Taiwan NA TW8

AY864805 SARS-1 C

hina NA B

J162

MT0

5049

3 SA

RS-

2 In

dia

Hum

166

AY394989 SARS-1 China NA HZS2-D

MT121215 SARS-2 China Hum SH01

MT1

6371

9 SA

RS-2

USA H

um W

A7-UW

4

MT184910 SARS-2 USA Hum CruiseA-23

MT198653 SARS-2 Spain Hum Valencia001

MK

062184 SA

RS

-1 US

A H

um

Urb

ani icS

AR

S-C

7-MA

AY559091 S

AR

S-1 S

ingapore NA

SinP

4

AY282752 SARS-1 HongKong NA Su10

MT1849

11 S

ARS-2 USA H

um Cru

iseA-24

LR75

7995

SA

RS

-2 C

hina

Hum

sea

food

-mar

ket

AY395000 SARS-1 China NA LC3

AY394991 SARS-1 China NA HZS2-Fc

3102X

S t aB a

nih

C 1-S

RA

S 318374JK

FJ882941 SARS-1 USA NA ExoN1 P3pp8

JF292903 SARS-1 USA Mus MA15-ExoN1 d4ym5

FJ882962 SARS-1 USA Mus MA15-ExoN1 P3pp10AY345988 SARS-1 HongKong NA AG03

8-3U

KH ta

B ani

hC 1-

SR

AS 345351

QG

AY3044

88 S

ARS-1 H

ongKong NA S

Z16

KF514396 SARS-1 USA NA WTic c3P10-2009

MT159722 SARS-2 USA Hum CruiseA-6

AY

283796 SA

RS

-1 Sin

gap

ore N

A S

in2679

MT159717 SARS-2 U

SA Hum C

ruiseA-1

KF514409 SARS-1 USA NA WTic c2P20-2009

KF514421 SARS-1 USA NA WTic c1 2P20-2010

GQ

1535

42 S

AR

S-1

Ch

ina

Bat

HK

U3-

7

AY394999 SARS-1 China NA LC2

JX163927 S

AR

S-1 U

SA

NA

Tor2-F

P1-10851

KF514388 SARS-1 USA NA WTic c1 5P20-2010

GQ

1535

47 S

AR

S-1

Ch

ina

Bat

HK

U3-

12

MT1

3504

4 SA

RS-

2 C

hina

Hum

235

HQ

890541 SA

RS

-1 US

A M

us MA

15 d2ym1

FJ882940 SARS-1 USA NA ExoN1 P3pp37

KF514414 SARS-1 USA NA ExoN1 c5P20-2009

FJ882938 SARS-1 USA NA wtic-MB

AA

N er

opa

gni

S 1-

SR

AS

4909

55Y

648

niS

MT126808 SARS-2 Brazil Hum SP02

AY502932 SARS-1 Taiwan NA TW9

KY

4171

47 S

AR

S-1

Ch

ina

Bat

Rs4

237

MN

9383

84 S

AR

S-2

Chi

na H

um S

Z-00

2a 2

020

AY394987 SARS-1 China NA HZS2-Fb

MT0

4033

3.1

Pan

golin

PC

oV G

X-P

4L

KF514419 SARS-1 USA NA WTic c1P10-2009

JF292916 SAR

S-1 USA M

us MA

15 d3om1

MG

7729

34 S

AR

S-1

Ch

ina

Bat

Co

VZ

XC

21

GQ

1535

39 S

AR

S-1

Ch

ina

Bat

HK

U3-

4

AY357075 SARS-1 NA NA PUMC02

KF514412 SARS-1 USA NA ExoN1 c13P20-2009

MN996530 SARS-2 China Hum WIV06

AY3044

86 S

ARS-1 H

ongKong NA S

Z3

AY362698 SARS-1 NA NA TWC2

MT192765 SARS-2 USA Hum PC00101P

KF

5699

96 S

AR

S-1

Ch

ina

Bat

LY

Ra1

1

KU

9736

92 S

AR

S-1

Chi

na B

at F

46

KF514413 SARS-1 USA NA WTic c1 6P20-2010

HQ890539 SARS-1 USA Mus MA15-ExoN1 d3om1

AY485277 SARS-1 NA NA Sino1-11

KF514392 SARS-1 USA NA WTic c1 4P20-2010

AY

714217 SA

RS

-1 US

A H

um

CD

C-200301157

HQ890530 SARS-1 USA Mus MA15-ExoN1 d2ym5

KF514391 SARS-1 USA NA ExoN1 c5 9P20-2010

MT159718 SARS-2 USA Hum CruiseA-2

AY394983 SARS-1 China NA HSZ2-A

MT0398

88 S

ARS-2 U

SA Hum

MA1

AY6868

64 S

ARS-1 N

A Mon B

039

AY395002 SARS-1 China NA LC5

KF514393 SARS-1 USA NA ExoN1 c5 10P20-2010

MT1

8833

9 SA

RS-2

USA H

um M

N3-M

DH3

FJ882936 SARS-1 USA NA wtic-MB P3pp2

MT0

4995

1 SARS-2

Chi

na H

um Y

unna

n-01

AY345986 SARS-1 HongKong NA AG01

KF514408 SARS-1 USA NA WTic c2P1-2009

DQ

6488

57 S

AR

S-1

NA

Bat

279

-200

5

MT1

0605

2 S

AR

S-2

US

A H

um C

A7

EU371560 SARS-1 NA NA BJ182a

AY278490 SARS-1 NA NA BJ03

MT184908 SARS-2 USA Hum CruiseA-21

MT066176 SARS-2 Taiwan Hum NTU02

JF292907 SAR

S-1 U

SA M

us MA

15 d2ym2

MT106053 SARS-2 USA H

um CA8

HQ890526 SARS-1 USA Mus MA15-ExoN1 d2ym1

FJ882950 SARS-1 USA NA ExoN1 P3pp60

HQ890531 SARS-1 USA Mus MA15-ExoN1 d4ym1

MT019530 SARS-2 China Hum WH-02-2019

KY

4171

49 S

AR

S-1

Chi

na B

at R

s425

5

MT066156 SARS-2 Italy Hum INMI1

AY56

8539

SA

RS-

1 C

hina

Hum

GZ0

401

LC529905 SARS-2 Japan Hum TKYE6182

MT093571 SARS-2 Sweden Hum 01

MT123293 SARS-2 China Hum IQTC03

MT159711 SARS-2 USA Hum CruiseA-13

EU371563 SARS-1 NA NA BJ182-8

AY395001 SARS-1 China NA LC4

GQ

1535

41 S

AR

S-1

Ch

ina

Bat

HK

U3-

6

KY

4171

48 S

AR

S-1

Ch

ina

Bat

Rs4

247

MT007544 SARS-2 Australia Hum VIC01

LR757998 SARS-2 China Hum seafood-market

AY502929 SARS-1 Taiwan NA TW6

MT0

4033

6.1

Pan

golin

PC

oV G

X-P

5E

AY54

5918

SA

RS-

1 C

hina

NA

HC

-GZ-

32-0

3

JF292908 SAR

S-1 USA M

us MA

15 d2ym3

AY61

3949

SARS-

1 Chi

na M

on P

C4-13

6

DQ

497008 SA

RS

-1 US

A M

us M

A-15

AY394986 SARS-1 C

hina NA H

SZ-Cb

MT184912 SARS-2 USA Hum CruiseA-25

JX163928 SAR

S-1 USA

NA Tor2-FP1-10895

AY39

4996

SA

RS

-1 N

A N

A Z

S-B

KF514404 SARS-1 USA NA WTic c1 9P20-2010

EU371559 SARS-1 NA NA ZJ02

KF514411 SARS-1 USA NA ExoN1 c13P10-2009

AY310120 S

AR

S-1 N

A N

A FR

A

AY348314 SARS-1 Taiwan NA TC3

AY

559090 SA

RS

-1 Singapore N

A S

inP3

KJ4

7381

1 S

AR

S-1

Ch

ina

Bat

JL

2012

AY54

5917

SAR

S-1

Chin

a NA

HC-

GZ-

81-0

3

KJ4

7381

5 S

AR

S-1

Ch

ina

Bat

GX

2013

MT1

2329

2 S

AR

S-2

Chi

na H

um IQ

TC04

AY

559083 SA

RS

-1 Sin

gap

ore N

A S

in3408

KC

8810

06 S

AR

S-1

Chi

na B

at R

s336

7

MT1

5282

4 SA

RS-

2 US

A Hu

m W

A2

MN996528 SARS-2 China Hum WIV04

MT1

6371

8 SA

RS-2

USA H

um W

A6-UW

3

AY39

0556

SA

RS-

1 C

hina

NA

GZ0

2

AY502928 SARS-1 Taiwan NA TW5

KF514410 SARS-1 USA NA ExoN1 c8P20-2009

MT07

2688

SARS-2

Nep

al H

um 6

1-TW

KY

4171

51 S

AR

S-1

Chi

na B

at R

s732

7

MT159713 SARS-2 USA Hum CruiseA-15

FJ882961 SA

RS

-1 US

A M

us MA

15 P3pp5

FJ882937 SARS-1 USA NA wtic-MB P3pp18

HQ890540 SARS-1 USA Mus MA15-ExoN1 d3om2

MT159710 SARS-2 USA Hum CruiseA-9

FJ882954 SARS-1 USA NA ExoN1 P3pp46

KJ4

7381

4S

AR

S-1

Ch

ina

Bat

Hu

B20

13

DQ

182595 SA

RS

-1 Ch

ina N

A Z

J0301

JX99

3988

SA

RS

-1 C

hin

a B

at C

p-Y

un

nan

2011

GU

553363 SAR

S-1 China H

um H

KU

-39849 HA

RR

OD

-00001

AY283797 SAR

S-1 Singapore N

A Sin2748

FJ882929 SARS-1 USA NA ExoN1 P3pp1

AY502925 SAR

S-1 Taiwan N

A TW2

FJ882959 SARS-1 USA Mus MA15-ExoN1 P3pp6

AY

351680 SA

RS

-1 NA

NA

ZM

Y-1

AY502924 SARS-1 Taiwan NA TW11

MK

062180 SA

RS

-1 US

A H

um

Urb

ani icS

AR

S-M

A

AY502926 SAR

S-1 Taiw

an NA

TW3

DQ

6488

56 S

AR

S-1

NA

Bat

273

-200

5

AY338175 SARS-1 NA NA TC2

MT159714 SARS-2 USA Hum CruiseA-16

AY321118 SARS-1 NA NA TW

C

AY5720

35 S

ARS-1 C

hina

Mon

civ

et01

0

FJ58

8686

SA

RS

-1 C

hina

Bat

Rs6

72-2

006

KY

4171

44 S

AR

S-1

Chi

na B

at R

s408

4

MT019533 SARS-2 China Hum WH-05

KY

4171

42 S

AR

S-1

Ch

ina

Bat

As6

526

FJ882933 SARS-1 USA NA wtic-MB P3pp6

DQ

0223

05 S

AR

S-1

Ch

ina

Bat

HK

U3-

1

AY

559086 SA

RS

-1 Sin

gap

ore N

A S

in849

AY283798 S

AR

S-1 S

ingapore NA

Sin2774

MT1

0605

4 S

AR

S-2

US

A H

um T

X1

MN

9965

32 S

AR

S-1

Chi

na B

at R

aTG

13

MT159716 SARS-2 USA Hum CruiseA-18

MT0

2088

1 SA

RS-

2 U

SA H

um W

A1-

F6

JF292917 SA

RS

-1 US

A M

us MA

15 d3om2

KY

4171

46 S

AR

S-1

Chi

na B

at R

s423

1

MT184907 SARS-2 USA Hum CruiseA-19

AY278488 SARS-1 NA NA BJ01

FJ882944 SARS-1 USA NA ExoN1 P3pp23

KF514406 SARS-1 USA NA ExoN1 c13P1-2009

AY278487 SARS-1 NA NA BJ02

KF514403 SARS-1 USA NA ExoN1 c5 1P20-2010

FJ882935 SARS-1 USA NA wtic-MB P3pp21

AY61

3947

SA

RS-

1 C

hina

NA

GZ0

402

AY57

2038

SA

RS-

1 C

hina

Mon

civ

et02

0

MT159706 SARS-2 USA Hum CruiseA-8

FJ882927 SARS-1 USA NA wtic-MB P1pp1

AY54

5914

SARS-1

Chi

na N

A HC-S

Z-79

-03

KF514390 SARS-1 USA NA ExoN1 c5 4P20-2010

NC 045512 SARS-2 China Hum Wuhan-Hu-1

AY864806 SARS-1 NA N

A BJ202

MT019532 SARS-2 China Hum WH-04-2019

GQ

1535

45 S

AR

S-1

Ch

ina

Bat

HK

U3-

10

JN854286 SARS-1 UK NA recHKU-39849

KF514395 SARS-1 USA NA ExoN1 c8P1-2009

AY

559097 SA

RS

-1 Sin

gap

ore N

A S

in3408L

AY559089 S

AR

S-1 S

ingapore NA

SinP

2

AY283795 S

AR

S-1 S

ingapore NA

Sin2677

AS

RA

S 78

0955

YA

N er

opa

gni

S 1-

V527

3ni

S

HQ

890542 SAR

S-1 USA M

us MA

15 d2om1

DQ640652 SARS-1 China Hum GDH-BJH01

FJ882949 SARS-1 USA NA wtic-MB P3pp23

AY54

5915

SARS-

1 Chi

na N

A HC-S

Z-DM

1-03

MN988668 SARS-2 China Hum WHU01

MT159719 SARS-2 USA Hum CruiseA-3

AY278491 SARS-1 China NA HKU-39849

MT184909 SARS-2 USA Hum CruiseA-22

KF514401 SARS-1 USA NA ExoN1 c5 5P20-2010

AY39

5003

SA

RS

-1 N

A N

A Z

S-C

FJ882947 SARS-1 USA NA wtic-MB P3pp7

MN99

4467

SARS-2

USA H

um C

A1

HQ

890544 SAR

S-1 USA

Mus M

A15 d2om

3

FJ882952 SA

RS

-1 US

A M

us MA

15 P3pp4

AY772062 SARS-1 NA Mus W

H20

MN988669 SARS-2 China Hum WHU02

MT019529 SARS-2 China Hum WH-01-2019

HQ

890545 SAR

S-1 USA

Mus M

A15 d2om

4

MT118835 SARS-2 USA Hum CA9

JX99

3987

SA

RS

-1 C

hin

a B

at R

p-S

haa

nxi

2011

FJ882934 SARS-1 USA NA wtic-MB P3pp29A

N 1-S

RA

S 240214Q

D1f

R taB

AY57

2034

SARS-1

Chi

na M

on c

ivet

007

JF292905 SARS-1 USA Mus MA15-ExoN1 d3om4

MG

7729

33 S

AR

S-1

Chi

na B

at C

oVZC

45

MN996531 SARS-2

China Hum W

IV07

AY559088 S

AR

S-1 S

ingapore NA

SinP

1

KF514407 SARS-1 USA NA ExoN1 c5 7P20-2010

FJ882943 SARS-1 USA Mus MA15-ExoN1

AY

559081 SA

RS

-1 Sin

gap

ore N

A S

in842

AY463059 SARS-1 NA N

A QXC1

DQ

0842

00 S

AR

S-1

Ch

ina

Bat

HK

U3-

3

MK

062181 SA

RS

-1 US

A H

um

Urb

ani icS

AR

S-C

3

AY5459

19 S

ARS-1 C

hina

NA C

FB-S

Z-94

-03

JF292918 SA

RS

-1 US

A M

us MA

15 d3om3

JF292914 SA

RS

-1 US

A M

us MA

15 d4ym4

AY39

4997

SA

RS

-1 N

A N

A Z

S-A

MT159708 SARS-2 USA Hum CruiseA-11

FJ882956 SARS-1 USA NA ExoN1 P3pp53

AY502927 SAR

S-1 Taiwan N

A TW

4

MT0

4425

7 SARS-2

USA H

um IL

2

KF514422 SARS-1 USA NA WTic c1 3P20-2010 N

C 004718 S

AR

S-1 C

anad

a Hu

m To

r2

AY

559093 SA

RS

-1 Sin

gap

ore N

A S

in845

AY

559085 SA

RS

-1 Sin

gap

ore N

A S

in848

MN996527 SARS-2 China H

um WIV02

MT1

6372

0 SA

RS-

2 U

SA H

um W

A8-

UW

5

JF292910 SA

RS

-1 US

A M

us MA

15 d2ym5

AY461660 S

AR

S-1 R

ussia NA

SoD

AY394992 SARS-1 China NA HZS2-CKF514415 SARS-1 USA NA W

Tic c1 7P20-2010

KY

4171

50 S

AR

S-1

Chi

na B

at R

s487

4

MT123291 SARS-2 China Hum IQTC02

HQ890528 SARS-1 USA Mus MA15-ExoN1 d2ym3

AY278554 SARS-1 H

ongKong NA W

1

DQ

0841

99 S

AR

S-1

Ch

ina

Bat

HK

U3-

2

JQ316196 SARS-1 UK NA HKU-39849 UO

B

MT020781 SARS-2 Finland Hum Jan29

AY595412 SARS-1 China NA LLJ-2004

HQ890537 SARS-1 USA Mus MA15-ExoN1 d2om4

MN994468 SARS-2 USA Hum CA2

MT0270

63 S

ARS-2 U

SA Hum

CA4

JX162087 SARS-1 USA NA ExoN1 c5P10

JF292921 SARS-1 USA NA wtic-MB c1P1

AY3949

95 S

ARS-1 C

hina N

A HSZ-C

c

MT192773 SARS-2 VietNam H

um nCoV-19-02S

KF514397 SARS-1 USA NA WTic c2P10-2009

KC

8810

05 S

AR

S-1

Chi

na B

at R

sSH

C01

4

KY

4171

52 S

AR

S-1

Chi

na B

at R

s940

1

GQ

1535

46 S

AR

S-1

Ch

ina

Bat

HK

U3-

11

AY3949

94 S

ARS-1 C

hina N

A HSZ-B

c

MT2

2661

0 SARS-2

Chi

na H

um K

MS1

FJ882928 SARS-1 USA NA ExoN1 P1pp1

AY

297028 SA

RS

-1 NA

NA

ZJ01

FJ882963 S

AR

S-1 U

SA

Hu

m P

2

AY394850 SARS-1 NA NA WHU

GU

553364 SAR

S-1 China H

um H

KU

-39849 HA

RR

OD

-00002

AY

274119 SA

RS

-1 Can

ada H

um

Tor2

AY463060 SARS-1NA N

A QXC2

KF514402 SARS-1 USA NA ExoN1 c5 6P20-2010

AS

RA

S 28

0955

YA

N er

opa

gni

S 1-

258

niS

MT192759 SARS-2 Taiwan Hum CGMH-CGU-01

KF3

6745

7 S

AR

S-1

Chi

na B

at W

IV1

JF292902 SARS-1 USA Mus MA15-ExoN1 d4ym4

AY

559092 SA

RS

-1 Singapore N

A S

inP5

FJ882926 SARS-1 USA NA ExoN1

MT1597

15 S

ARS-2 U

SA Hum

Cru

iseA-1

7

KF514389 SARS-1 USA NA ExoN1 c8P10-2009

AY

394978 SA

RS

-1 NA

NA

GZ-B

MT039873 SARS-2 China Hum HZ-1

JF292909 SA

RS

-1 US

A M

us MA

15 d2ym4

AY61

3950

SARS-

1 Chi

na M

on P

C4-22

7

MT027064 SARS-2 USA Hum CA5

FJ882960 SARS-1 USA NA ExoN1 P3pp34MT012098 SARS-2 India Hum 29

MT0270

62 S

ARS-2 U

SA Hum

CA3

MT093631 SARS-2 China Hum WH-09

AY357076 SARS-1 NA NA PUMC03

MT159721 SARS-2 USA Hum CruiseA-5

AY502923 SARS-1 Taiwan NA TW10

FJ882948 SA

RS

-1 US

A M

us MA

15 P3pp3

MT188340 SARS-2 USA Hum MN2-MDH2

AS

RA

S 48

0955

YA

N er

opa

gni

S 1-

V567

3ni

S

MT159707 SARS-2 USA Hum CruiseA-10

MN

9853

25 S

AR

S-2

USA

Hum

WA

1

MT0

2088

0 SA

RS-

2 U

SA H

um W

A1-

A12

HQ890538 SARS-1 USA Mus MA15-ExoN1 d2om5

MT159709 SARS-2 USA Hum CruiseA-12

AY68

6863

SARS-1

NA M

on A

022

AY5155

12 S

ARS-1 C

hina

Mon H

C-SZ-6

1-03

FJ882931 SARS-1 USA NA ExoN1 P3pp12

AY

559096 SA

RS

-1 Sin

gap

ore N

A S

in850

MN

9752

62 S

AR

S-2

Chi

na H

um S

Z-00

5b 2

020

MN

9974

09 S

AR

S-2

US

A H

um A

Z1

EU371561 SARS-1 NA NA BJ182b

JF292912 SAR

S-1 USA M

us MA

15 d4ym2

MT163716 SARS-2 USA Hum WA3-UW1

MT1

3504

2 SA

RS-

2 C

hina

Hum

231

MN98

8713

SARS-2

USA H

um IL

1

MT039890 SARS-2 SouthKorea Hum SNU01

FJ882939 SARS-1 USA NA wtic-MB P3pp16

MT192772 SARS-2 VietN

am Hum nCoV-19-01S

JF292906 SARS-1 USA Mus MA15-ExoN1 d3om5

HQ890532 SARS-1 USA Mus MA15-ExoN1 d4ym2

KF514405 SARS-1 USA NA ExoN1 c5 2P20-2010

3102Be

H t aB a

nih

C 1-S

RA

S 218374JK

AY394993 SARS-1 NA NA HGZ8L2

AY313906 SARS-1 China NA GD69

KY

4171

43 S

AR

S-1

Chi

na B

at R

s408

1

MT019531 SARS-2 China Hum WH-03-2019

LR757996 SARS-2 China Hum seafood-market

MT1

8834

1 SARS-2

USA H

um M

N1-M

DH1

MT1

3504

3 S

AR

S-2

Chi

na H

um 2

33

KF514394 SARS-1 USA NA WTic c1P20-2009

MT1

3504

1 SA

RS-

2 C

hina

Hum

105

MT1

6371

7 SA

RS-2

USA H

um W

A4-UW

2

FJ882930 SARS-1 USA NA ExoN1

FJ882957 S

AR

S-1 U

SA

Mu

s MA

15

JF292904 SARS-1 USA Mus MA15-ExoN1 d3om3

HQ890534 SARS-1 USA Mus MA15-ExoN1 d2om1

KP

8868

09 S

AR

S-1

Ch

ina

Bat

YN

LF

34C

AY338174 SARS-1 NA NA TC1

HQ

890543 SA

RS

-1 US

A M

us MA

15 d2om2

AY508724 SARS-1 NA NA NS-1

AY304495 SARS-1 HongKong NA GZ50

MT123290 SARS-2 China Hum IQTC01

KY

3524

07 S

AR

S-1

Ken

ya B

at K

Y72

MK

062183 SA

RS

-1 US

A H

um

Urb

ani icS

AR

S-C

7

JX163923 S

AR

S-1 U

SA

NA

Tor2-FP1-10912

JF292919 SAR

S-1 USA M

us MA

15 d3om4

MT039887 SARS-2 USA Hum WI1

AY27

8489

SA

RS-

1 N

A N

A G

D01

JX163924 S

AR

S-1 U

SA

NA

Tor2-FP1-10851

FJ882945 SA

RS

-1 US

A M

us MA

15 P3pp6

KP

8868

08 S

AR

S-1

Ch

ina

Bat

YN

LF

31C

JF292913 SA

RS

-1 US

A M

us MA

15 d4ym3

MT0

4033

4.1

Pan

golin

PC

oV G

X-P

1E

AY654624 SARS-1 NA NA TJF

AY61

3948

SARS-

1 Chi

na M

on P

C4-13

JX163926 S

AR

S-1 U

SA

NA

Tor2-F

P1-10912

KF514398 SARS-1 USA NA WTic c1 10P20-2010

FJ882932 SARS-1 USA NA wtic-MB P3pp14

KJ4

7381

6 S

AR

S-1

Chi

na B

at Y

N20

13

DQ

0716

15 S

AR

S-1

Ch

ina

Bat

Rp

3

GQ

1535

40 S

AR

S-1

Ch

ina

Bat

HK

U3-

5

AY

559095 SA

RS

-1 Sin

gap

ore N

A S

in847

FJ882958 SA

RS

-1 US

A M

us MA

15 P3pp

7

MT159705 SARS-2 USA Hum CruiseA-7

KF514418 SARS-1 USA NA WTic c3P1-2009

MT159712 SARS-2 USA Hum CruiseA-14

HQ

890546 SAR

S-1 USA

Mus M

A15 d2om

5

JX163925 S

AR

S-1 U

SA

NA

Tor2-F

P1-10895

0.08

0.16

0.24

0.32

0.4

0.48

0.56

0.64

0.72

0.04

0.12

0.2

0.28

0.36

0.44

0.52

0.6

0.68

Tree scale: 0.1

Taxonomy

SARS-CoV-2

SARS-CoV

Pangolin CoV

Host

Human

Bat

Pangolin

Monkey

Civets

Mouse

Pig

Unknown

Figure 1 Maximum Likelihood (ML) phylogeny representing relationship amongst 312 SARS-CoV, 103 SARS-CoV-2, and five Pangolin CoVstrains. The whole-genome sequences of 420 isolates were aligned using MUSCLE and stripped to include the highly conserved alignments across allstrains. The final alignment was subjected to RAxML to generate the ML phylogeny utilizing the GTRGAMMA model of nucleotide substitutionwith 100 bootstrap replicates. The phylogeny is depicted with branch length consideration. The inner-circle represents the taxonomy of all strains(depicting SARS-CoV, SARS-CoV-2 and Pangolin CoV). The outermost circle represents the respective host of each strain. Inner Blue and Grayalternative dashed lines represent an internal tree scale with a branch length increment of 0.04 from inside to outside.

Full-size DOI: 10.7717/peerj.9576/fig-1

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 14/31

Page 15: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

We also examined the ANI scores of the above-identified closest strains of SARS-CoV-2with all SARS-CoV, SARS-CoV-2 and Pangolin CoV genomic sequences used in this study(Table S2). We found that the Bat RaTG13 strain is 95.97–96.12% identical whencompared to all SARS-CoV-2 genomes under study. Amongst the SARS-CoV strains, BatCoVZC45 and Bat CoVZXC21 are its closest neighbors with 89% genome identity,indicating a common ancestor in bat coronaviruses, whereas, with Pangolin CoV, theyshare lowest (~86%) nucleotide identity. Bat CoVZC45 and Bat CoVZXC21 are ~97%identical to each other, whereas their identity with SARS-CoV-2 (~89%) is higher thanthat with SARS-CoV (86–89%) and lowest with Pangolin CoV (~85%). Similarly,Pangolin CoVs share maximum identity with Bat RaTG13 SARS-CoV (86.50%) followedby SARS-CoV-2 (86.30–86.38%) and lowest with other SARS-CoV. Therefore, we canargue that as SARS-CoV-2, Bat RaTG13 and Pangolin CoV have considerable genomesimilarity, they possibly diverged from a common ancestor and SARS-CoV-2 was latertransmitted to humans through recombination and transformation events via an unknownintermediate host.

Phylogeny relationship at SARS-CoV-2 levelSARS-CoV-2 strains, like other viruses, have a high mutation rate for better adaptabilityand survival. Therefore, one of our aims was to understand the diversity amongSARS-CoV-2 isolates. Whole-genome sequences of 167 SARS-CoV-2 were aligned usingMUSCLE (alignment length 29,950 nucleotides) and the conserved sites of length 29,725were used to generate a maximum-likelihood phylogeny using RAxML (Fig. 2; Fig. S4).Irrespective of high similarity amongst all (>99.90% identity), several subclades can beidentified from the phylogeny based on their distance from each other, depicting theirsubtypes. Our data has 97 isolates from the USA, 46 from China, five from Japan, fourfrom Spain, three from Taiwan, two from Vietnam and India, one isolate each fromSweden, South Korea, Pakistan, Nepal, Italy, Finland, Brazil and Australia. From thegenome sequence data of 15 distinct geographical locations, we were able to identify a fewphylogenetic clusters depicting country-specific patterns. We identified multiple clustersper country suggesting the existence of multiple subtypes. The KMS1 isolate fromChina was found to be the most distinct one, followed by WA-UW230, WA-UW194,WA-UW218 isolates from the USA. The Wuhan Hu-1 isolate shares a sister cladewith the USA Cruise samples, indicative of high similarity. Isolates sampled at theUniversity of Washington (WA, USA) form two separate subclades within two differentclades in the phylogeny, suggesting that even amongst people residing in a limitedarea in the state of Washington, multiple strains exist and are causing COVID-19.Furthermore, both the WA subclades have Valencia-Spain isolates as the closestbranch/subclade, suggesting that at least two individuals from these two locationsindependently encountered each other and transmitted different subtypes of the virus.This study also included two distinct genomes sampled from India which weredistributed by the phylogeny into two separate clusters signifying the existence ofdifferent subtypes.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 15/31

Page 16: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

As strains from diverse locations continue to be sequenced and analyzed, more reliablecountry-specific patterns showing relatedness can be obtained. Similar to several cuttingedge studies (Woo et al., 2010; Fuk-Woo Chan et al., 2020; Zhang, Wu & Zhang, 2020;Rehman et al., 2020), this study also provides an example of how phylogenetic analysiscan help generate clusters of identical strains, further enabling the identification ofsignature mutations amongst those diverse groups.

MT093631 China WH-09

MT246469 USA WA-UW212

NC 045512 China Wuhan-Hu-1

MT121215 China SH01 MT246479 USA WA-UW222

MT233523 Spain Valencia8

MT184911 U

SA CruiseA-24

MT253700 China HZ-4

81M

T039873 China HZ-1

MT027064 U

SACA

5

MT2

4644

9 U

SA W

A-U

W19

2

MT184913 USA CruiseA-26

MT246478 USA WA-UW221

MT246486 USA WA-UW229

MT159709 USA CruiseA-12

MT253697 China HZ-178

MT253708 China H

Z-79

MT159719 USA CruiseA-3

MT246476 USA WA-UW219

MN996531 China W

IV07

MT251975 USA WA-UW239

MT253703 China HZ-551

MT246667 USA FDAARGOS 983

MT246488 USA WA-UW231

MN988669 China WHU02

MT253698 China H

Z-185

MT2

5197

3 U

SA W

A-U

W23

7

MT066176 Taiwan NTU02

MT135042 China 2

31

MT251980 USA WA-UW242

MT019529 China IPBCAM

S-WH-01

LC529905 Japan TKYE6182

MT2

5197

9 U

SA W

A-U

W24

1

MT2

5370

6 Ch

ina

HZ-

62

MT251978 USA WA-UW235

MT246473 USA WA-UW216

MT039888 U

SA MA1

MT184907 USA CruiseA-19

MT246461 USA WA-UW204

MT253701 China HZ-48

MT039887 USA WI1

MT163717 USA WA4-UW2

MT159712 USA CruiseA-14

MT123291 China IQTC02

MT2

4645

0 U

SA W

A-U

W19

3

MT019533 China IPBCA

MS-W

H-05

MT2

2661

0 Ch

ina

KMS1

20P

S li z

arB

8086

21T

M

MT123290 China IQTC01

MT246474 USA WA-UW217

MT066156 Italy IN

MI1

MT240479 Pakistan Gilgit1

MN

994468 USA CA

2

MT019531 China IPBCAMS-WH-03

MT123293 China IQTC03

LC528233 Japan 19-027

LR757998 China Wuhan seafood m

arket

MT184909 U

SA CruiseA-22

MT253696 China HZ-162

MT0

6617

5 Ta

iwan

NTU

01

MT246466 USA WA-UW209

LC534419 Ja

pan 19-4

37

MT246452 USA WA-UW195

MT192772 VietNam 01S

MN996529 China WIV05

791

WU-

AW

AS

U 45

4642

TM

MN

9752

62 C

hina

HKU

-SZ-

005b

MN9

9740

9 US

A AZ

1

MT246477 USA WA-UW220

MT118835 USA CA9

MN996530 China WIV06

MT0

4425

7 U

SA IL

2

MT027062 U

SA CA3

MT163718 USA WA6-UW3

MT246471 USA WA-UW214

MT019532 China IPBCAMS-WH-04

MT198652 Spain Valencia003

MT246464 USA WA-UW207

MT152824 USA WA2M

T159713 USA CruiseA-15

MT246489 USA WA-UW232

MT253705 China HZ-60 M

T027063 USA CA4

MT251977 USA WA-UW234

MT163716 U

SA WA

3-UW

1

MT2

4648

1 US

A W

A-UW

224

MT188340 USA M

N2-MDH2

MT2

3352

2 Spain

Valen

cia7

MN

9944

67 U

SA C

A1

MT012098 India 29

MT159711 USA CruiseA-13

MT159720 USA CruiseA-4

MN

9887

13U

SA IL

1

MT246475 USA WA-UW218

MT163719 USA WA7-UW4

MT159722 USA CruiseA-6

MT246457 USA WA-UW200

MT233526 USA DAARGOS 983

MT2

5197

6 U

SA W

A-U

W24

0

MN996528 China WIV04

MT2

4648

4 U

SA W

A-U

W22

7

MT246459 USA WA-UW202

MT159707 USA CruiseA-10

MT2

5370

9 Ch

ina

HZ-

90

MT2

4648

0 USA

WA-

UW22

3

774-Z

H ani

hC 996352

TM

MT159705 USA CruiseA-7

MT1

35043 Chin

a 233

MT0

4995

1 Ch

ina

Yunn

an-0

1

LR757996 China Wuhan seafood market

MT184908 USA CruiseA-21

9-Aesi

urC

AS

U 017951T

M

MT039890 SouthKorea SN

U01

MT2

4646

7 US

A W

A-UW

210

LC534418 Japan 19-031

MT0

5049

3 In

dia 1

66

MT253702 China HZ-49

MT2

4647

0 USA

WA-U

W21

3

MT188339 USA MN3-MDH3

MT020781 Finland 29 Jan

MT007544 Australia VIC01

MT246472 USA WA-UW215

MT253710 China HZ-91

MT2

4645

3 U

SA W

A-U

W19

6

MT159721 USA CruiseA-5

MT246462 USA WA-UW205

MT1

9276

5 USA

PC0

0101

P

MT159708 USA CruiseA-11

MT2

4645

1 USA

WA-U

W19

4

MT159706 USA CruiseA-8

MT192773 VietNam 02S

MT020881 USA WA1-F6

MT1

2329

2 Ch

ina

IQTC

04

MT072688 N

epal 61-TW

MT246455 USA WA-UW198

MT192759 Taiwan CGM

H-CGU-01M

N93

8384

Chi

na H

KU-S

Z-00

2a

MT2

4646

0 US

A W

A-UW

203

MT246482 USA WA-UW225

MT159718 U

SA CruiseA-2

MT1

3504

4 Chin

a 235

MT1

0605

4 US

A TX

1

MT253704 China HZ-576

MT044258 USA CA6M

T159717 USA CruiseA-1

MN996527 China W

IV02

MT1

5971

5 U

SA C

ruis

eA-1

7

MN988668 China WHU01

LC528232 Japan 19-020

MT159714 USA CruiseA-16

MT159716 USA CruiseA-18

MT233519 Spain Valencia5

MN985325 USA WA1

MT106053 USA CA8

MT1

0605

2 U

SA C

A7

MT020880 USA WA1-A12

MT2

5197

4 US

A W

A-UW

238

MT253707 China HZ-638

MT184910 USA CruiseA-23

MT1

3504

1 Chi

na 10

5

MT188341 USA MN1-MDH1

MT2

4649

0 U

SA W

A-U

W23

3

LR75

7995

Chi

na W

uhan

seaf

ood

mar

ket

MT2

4648

7 U

SA W

A-U

W23

0

MT093571 Sw

eden 01

MT019530 China IPBCAMS-W

H-02

MT184912 USA CruiseA-25

0.00005

0.00015

0.00025

0.00035

0.00045

0.00055

0.00065

0.00075

0.00085

0.00095

Tree scale: 0.001

Geo Location:

China

USA

India

Spain

Vietnam

Pakistan

Sweden

Japan

Italy

Brazil

Figure 2 ML phylogeny representing relationship amongst 167 SARS-CoV-2 strains. The whole-genome sequences of 167 isolates were alignedusing MUSCLE, stripped to include only conserved alignments, and subjected to RAxML to generate the ML phylogeny utilizing the GTRGAMMAmodel of nucleotide substitution with 100 bootstrap replicates. The phylogeny is depicted with branch length consideration. The inner circlerepresents the respective geo-location of each strain. Inner Blue and Gray alternative dashed lines represent an internal tree scale with a branchlength increment of 0.00005 from inside to outside. Full-size DOI: 10.7717/peerj.9576/fig-2

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 16/31

Page 17: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

Mutations are continuously diversifying the Human SARS-CoV-2strainsLike the previous section, the whole genome sequence alignment of 167 SARS-CoV-2isolates were used for this study. As anticipated, ~100 nucleotides at both N andC-terminals were highly diverse and, therefore, changes in these sites were not consideredin this study. For the rest of the alignment, the percentage of nucleotide occurrence at eachsite (along with gap and other non-standard nucleotides) was calculated using theSARS-CoV-2 Wuhan Hu-1 genome as the reference (Fig. 3). The mutational probabilitywas classified into four categories based on random distribution patterns: high (>19%variation), medium (4–19% variation), low (2–3% variation) and very low (<2% variation).Overall, 415 mutation sites were identified, among which non-standard nucleotideswere present in the genome at 125 sites, which have been ignored in this study.We ultimately selected 290 sites with a confirmed variation. Amongst these, we found43 sites with high, medium and low mutation probabilities within 9 out of 12protein-coding regions in SARS-CoV-2 genomes, which also included the genes (S, M andN), identified as core proteins across genus Betacoronavirus. Amongst all mutation sites,we found 21 sites within the gene encoding Spike protein, 18 sites in the gene encodingNucleocapsid protein and three sites in gene encoding M protein, suggesting that eventhese highly conserved protein-coding regions (core proteins as later identified viapan-genome analysis) are capable of mutating. Also, we recognized 142 mutational sites inorf1a, 50 in orf1b, 6 in orf3a, 2 in the envelope protein-coding gene, 4 in orf6, and orf8, 3 inorf7b and 1 in orf10. We also identified 36 sites within non-coding regions.

Amongst sites with high mutation probability, C8,782T in orf1a showed ~35% variationfrom C to T, suggesting a transition mutation. Likewise, T28,144C in orf8 also has~35% transition mutation variations. We found three hotspot mutational sites withinthe orf1b region of orf1ab, which has three transitions, two C to T (C17,747T andC18,060T; ~20% variation) and one A to G (A17,858G; 19% variation). In the spikeprotein-coding region, we were able to identify one transition site (A23,403G nucleotide)with medium (11%) mutational probability. Overall, we were able to identify nine hotspotswith more than two consecutive probable mutational nucleotides in vicinity (197–210(non-coding), 508–522 (orf1a), 686–694 (orf1a), 4,879–4,881 (orf1a), 20,298–20,300(orf1b), 21,385–21,389 (orf1b), 21,991–21,993 (S), 28,878–28,883 (N), 29,750–29,759(non-coding)). Most of these hotspot sites showed ~1–2% mutational probabilities andnone of these have high or medium mutational probability as discussed in Fig. 3.We believe these genomic locations are the probable sites of evolutionary divergence, andtherefore govern the evolutionary capabilities of these viruses.

We also analyzed transition and transversion possibilities within all these mutationalcombinations. Out of the 290 identified mutational sites, 244 had a nucleotide substitution,distributed into 158 transitions and 86 transversions. Transition events are known to bemore frequent and less likely to cause a change at the protein level as compared totransversions (Sanjuán et al., 2010). Most transition events are silent mutations, whiletransversion events may cause change at the protein level and therefore impact the

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 17/31

Page 18: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

pathogenesis of the disease (Lyons & Lauring, 2017). Therefore, we suggest that amongstSARS-CoV-2 genomes, transversion events, that are occurring in significant numbers,would provide more evolutionary diversity and possibly lead to divergence, as compared totransition events.

Site 29095

Site 11764

Sit

e 12

773

Site 22033

Site 01059

Site 19269

Sit

e 00

119

1942

1 et

iS

Site 00832

Sit

e 00

198

Site 06636

Site 27226

Site 05716

Site

160

75

Site 22984

Site 09157

Sit

e 00

104

Site 19175

Site 23952

Site 07016

Site

002

06Site 29752

Site 1

8060

Site 12115

Site

154

18

Site 22988

Site 18

603

Site

002

10

Site 29774

Site 21991

Site 09924

Sit

e 12

578

Site 29635

Site 21707

Site 27806

Site 03518

Site 25563

Site 01469

Site 29742

Site 09159

Site 20936

Site 05062

Site 29751

Site 28863

Site 06026

Site 06996

Site 09511

Site 02277

Site 04402

Site 29757

Site 1

7423

Site 27525

Site 00692

Site

132

25Site 00833

Site 26326

Sit

e 12

600

Site 18975

Site 03259

Site 09474

Site

002

05

Site 06031

Site 29756

Site 00693

Site 26936

Site 06819

Site 12355

Site

005

09

Site

168

87

Site 25810

Site

003

32

43521 etiS

Site 00

522

Site 02971

Site 12160

Site 09274

Site 00694

Site 09561

Site 21575

Site 28657

Site 00691

Site 11750

Site 08768

Site 11207

Site 09034

Site 21316

Site 01348

Site 12213

Site

146

57

Site

005

14

Site 03373

Site 11956

Site 0

0517

Site

155

97

Site 29754

Site

148

05

Site 07866Site 08987

Sit

e 00

124

Site 12208

Site 02632

Site 04880

Site 03992

Site 24351

100 etiS

11

Site 28916

Site 19065

Site 05784

Site 18

814

Sit

e 12

582

Site 01545

Site 11233

Site 28409

Site

170

00

Sit

e 00

120

Site 03738

Sit

e 12

572

Site 00

654

Site

005

10

Site 01691

Site 05572

Site

156

07

Site 20031

Site 22303

Site 11161

Site 11752

Site 27327

Sit

e 00

197

Site 00690

Site 04879

Site

132

26

Site 29254

100 etiS

12

Site08388

Site 09477

Site 26729

Site

003

54

Site 25775Site 25494

Site

172

47

Site 06968

Site 22104

Sit

e 12

660

Site 28854

Site 03411

Site

164

67

Site 18877

No

mu

tation

Site 21644

Site 28881

Site 29230

3742

1 et

iS

Site 06035

Site

005

16

Site 21387

Site 27925

Site 29736

Site 21784

Site 04288

Site 21385

Site 20299

Site 28792

Site 24034

Site 21386

Site 1

7747

Site 21147

Site 10319

Site 22224

Site 20980

Site 29753

Site 06695

Site 12378

Site 12467

Site 20823

Site 09180

Site 21389

Site 11320

Site 26354

Site 20298

Site 29758

Site 22785

Site 10036

Site 1

7474

Site 26144

Site

144

08

Site 08782

Site 10232Site

178

58

Site 29755

Site 28077

Site 27964

Site 29750

Site 18

149

Site

153

24

Site 01548

Sit

e 12

793

Site 00689

Sit

e 00

201

Site 28144Site 27807

Site 05052

Site 0

0521

Site

174

10 Site 09491

Site 00686

Site

169

12

Sit

e 13

072

Site

173

76

Site 12202

Site 12464

Site 28882

Site 29301

Site 07479

Site 27722

Site 29553

Site 02717

Site 20281

Site 12041

Site 27276

Site 08001

Site 29303

Site 04881

Sit

e 00

200

Site 04307

Site

005

15

Site

004

90

Site 01102

Sit

e 00

140

Site

005

08

Site 19610

Site 10507

Site 23185

Site 11410

Site 0

0519

Site 03778

Site 29759

Site 18

689

Site 00884

Site 28883

Site

002

08

Site 19018

Site

002

21

Site 27046

Sit

e 00

186

Site 03099

Site

142

29

Site 24325

Site 28878

Site 27384

Site 00687

Site 02269

Site 04642

Site 03037

Sit

e 12

685

Site

005

12

Site 22432S

ite 0

0202

Site 03177

Site 29527

Site 21992

Site 00688

Site 06501

41521 etiS

Site

168

77

Site 23403

Site 25979

Site

173

73

Site 07105

Site

002

41Site 29140

Site 0

0518

Site 05845

Site 28378

Site 00

614

Site 06310

Site 05084

Site 01385

Site 21137

Site 02446

Site 20300Site

005

20

Site

159

27

Site

002

54

Site 09534

Site 02091

Site 21993

Site 01397

Site

002

04

Site 11083

Site

005

11Si

te 0

0513

A

C

G

T

Others

GapMutational Probability

High

Medium

Low

Very Low

orf1a

orf1b

S

orf3a

E

orf6

M

orf7b

orf8

N

orf10

Figure 3 Mutation sites and hotspots distribution amongst 167 SARS-CoV-2 genomes as compared to Wuhan Hu-1. MUSCLE-basedalignment was used to analyze the distribution of all nucleotides (including gap and other nucleotides) at each alignment site and the relativepercentage of each nucleotide is represented here in a circular manner respectively for each nucleotide. The innermost circle represents the respectiveorf and non-coding DNA according to the location as mentioned in the third circle from inside. The second circle represents the mutationalprobability score of each locus distributed in high, medium, low and very low categories. Full-size DOI: 10.7717/peerj.9576/fig-3

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 18/31

Page 19: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

Evolutionary relationship amongst SARS-CoV-2, Bat CoV andPangolin CoVTo re-examine the closest hosts of these viruses, we performed homology studies on smallgenome fragments (100 nucleotides) to identify their close relatives. We segmented each ofthe SARS-CoV-2, Pangolin CoV, SARS-CoV and MERS genome (reference genome ofeach category) into 100 nucleotide fragments, identified their closest homologs in thevirus database, filtered the strains belonging to its taxonomy, selected top 50 hits persegment, calculated the host CoV distribution and represented it as a heatmap for eachsegment/strain along with the closest (topmost) hit per segment/strain (Fig. 4).MERS-CoV is already known to originate from the bat, however, the infection spread viacamels either directly/indirectly (Azhar et al., 2014). On excluding Camel CoVs fromthe data, we found several segments showing maximum similarity representation in BatCoVs, followed by Hedgehog and Human CoV strains, however, most of the parts wereuniquely present in MERS. As expected, when Camel CoV strains were included, they werethe closest best hit as indicated in the “MERS Closest” category (Fig. 4).

When we further investigated the same patterns in the SARS-CoV genome, weidentified the Bat CoV as the closest strain for most of the reference segments, followedby SARS-CoV-2 and Pangolin CoV. In the closest category, we found Bat CoV as theclosest suggesting their evolutionary origin. However, many segments do not showany closeness with any strain, indicating their uniqueness in the reference genome.We performed the same analysis on the Pangolin CoV strain GX-P5E and found thatseveral segments have close homology-distribution within SARS-CoV and SARS-CoV-2genomes. Interestingly, as visible in the “Pangolin CoV Closest” category, several hits showmaximum similarity with Bat SARS-CoV, followed by SARS-CoV-2 and a few HumanCoVs.

Likewise, SARS-CoV-2 genome fragments show maximum homology representation inBat CoV followed by Pangolin CoV. In the “closest category”, Bat CoV indisputably standsout as the closest host, which is corroborated by whole-genome phylogeny and thepercentage identity matrix studies. However, the presence of several Pangolin CoVhomologous segments in this analysis indicates that certain specific segments in theSARS-CoV-2 genome share a closer relationship with Pangolin CoV strains as comparedto Bat CoVs.

Overall, our analysis indicates that the SARS-CoV-2 genome has a close relationshipwith Bat CoV and Pangolin CoV, along with the presence of a few unique segments.This reasserts that SARS-CoV-2 strains might have evolved from a common ancestorof Bat SARS-CoV and Pangolin CoV strains and during evolution, several segments in theSARS-CoV-2 genome might have diverged as compared to Bat SARS-CoV and PangolinCoV. The recognized unique genomic segments in this study may be used as potentialtargets for diagnostic kits.

Pan-genome (Core, accessory and unique gene) analysisRegardless of their approximately similar genome size, the members of Betacoronavirusharbor huge diversity (5–14 CDS) in the number of protein-coding regions, as per the

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 19/31

Page 20: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

available NCBI genome annotations. SARS-CoV and SARS-CoV-2, belonging to subgenusSarbecovirus, have 14 and 12 protein-coding regions respectively, revealing the diversitywithin the same subgenus. Similarly, Pangolin CoV (studied strain: GX-P5E), which alsobelongs to subgenus Sarbecovirus, has only nine protein-coding regions. To identify the

1930

1 19

400

7601 7700

25901 26000

1101

120

0

21001 21100

26101 26200

24101 24200

5901 6000

11001 11100

12801 12900

1830

1 18

400

25501 25600

2901

300

0

5401 5500

20501 20600

21801 21900

2030

1 204

00

10901 11000

701

800

28901 29000

23501 23600

26901 27000

23001 23100

6201 6300

3601

370

0

24601 24700

10101 10200

14301 14400

3001

310

0

8101 820016

401

1650

0

9701 9800

13701 13800

3801

390

0

25101 25200

12301 12400

27101 27200

8001 8100

9801 9900

2701

280

0

1630

1 16

400

6701 6800

28801 2890016

901

1700

0

9001 9100

21201 21300

27301 27400

10201 10300

1790

1 18

000

14901 15000

25201 25300

12501 12600

26201 26300

23901 24000

3901

400

0

2801

290

0

6401 6500

22801 22900

28201 28300

7501 7600

12601 12700

2301

240

0

30101 30200

29201 29300

5801 5900

10501 10600

8401 8500

23101 23200

1920

1 19

300

2201

230

0

5201 5300

27001 27100

21301 21400

6301 6400

27601 27700

29501 29600

1880

1 18

900

1760

1 17

700

20401 20500

14201 14300

25601 25700

27801 27900

1710

1 17

200

29701 29800

22701 22800

30001 30100

901

1000

1960

1 19

700

29801 29900

1970

1 19

800

10401 10500

21901 22000

13901 14000

11801 11900

26501 26600

15501 15600

4101

4200

25701 25800

4701 4800

13401 13500

1670

1 16

800

11401 11500

14101 14200

801

900

2501

260

0

4201

4300

8301 8400

24001 24100

29601 29700

1400114100

26801 26900

6801 6900

7301 74007201 7300

12701 12800

10001 10100

28101 28200

1890

1 19

000

24801 24900

15301 15400

29001 29100

4001

4100

8901 9000

1910

1 19

200

26401 26500

15401 15500

10301 10400

9601 9700

3201

330

0

1701

180

0

14801 14900

15101 15200

1860

1 18

700

22601 22700

1850

1 18

600

1750

1 17

600

7901 8000

4801 4900

22001 22100

7001 7100

7801 7900

14401 14500

12001 12100

7701 7800

5701 5800

1990

1 20

000

21101 21200

002 101

1840

1 18

500

1780

1 17

900

1680

1 16

900

20601 20700

1770

1 17

800

1601

170

0

13101 13200

8701 8800

29301 29400

1700

1 17

100

25301 25400

3701

380

0

24301 24400

26601 26700

22201 22300

5101 5200

20801 20900

23201 23300

3501

360

0

11701 11800

28701 28800

1201

130

0

15701 15800

4501 4600

1301

140

0

5601 5700

25401 25500

11201 11300

1001

110

0

2001

210

0

6101 6200

15001 15100

2601

270

0

28001 28100

14701 14800

22301 22400

22901 23000

22401 22500

00261 10161

2010

1 202

00

22101 22200

23801 23900

27201 27300

1740

1 17

500

12901 13000

0006

1 10

951

5301 5400

201

300

21701 21800

4901 5000

00161 100611620

1 16

300

26701 26800

1501

160

0

9201 9300

8201 8300

2101

220

0

7101 720017

301

1740

0

25001 25100

12101 12200

27701 278004401 4500

1801

190

0

501

600

13001 13100

9501 9600

28401 28500

23301 23400

22501 22600

2000

1 201

00

8601 87001

100

1950

1 19

600

601

700

301

400

24201 24300

2020

1 203

00

10701 10800

4601 4700

28501 28600

20701 20800

6601 6700

9401 9500

29901 30000

3401

350

0

23601 23700

4301

4400

1800

1 18

100

15601 15700

5501 5600

1660

1 16

700

6901 7000

13201 13300

7401 7500

20901 21000

27901 28000

23401 23500

21401 21500

15201 15300

8801 8900

5001 5100

12401 12500

9301 9400

1401

150

0

14601 14700

21601 21700

11601 11700

401

500

21501 21600

9101 9200

1980

1 19

900

6501 6600

12201 12300

29101 29200

25801 25900

1870

1 18

800 13501 13600

11101 11200

11301 11400

13801 13900

1720

1 17

300

0095

1 10

851

10601 10700

13301 13400

1940

1 19

500

27501 27600

11901 12000

24901 25000

1901

200

0

26001 2610018

201

1830

0

24501 24600

24701 24800

1650

1 16

600

28301 28400

13601 13700

10801 10900

1900

1 19

100

11501 1160031

01 3

200

23701 23800

24401 24500

14501 14600

28601 28700

2401

250

0

26301 26400

9901 10000

27401 27500

1810

1 18

200

8501 8600

29401 29500

3301

340

0

6001 6100

Bat-CoV

Hedgehog-CoV

Human-CoV

Bird-CoV

MERS Unique

Bat-CoV

SARS-CoV-2

Pangolin-CoV

Bird/Rabbit-CoV

SARS-CoV Unique

Bat-CoV

SARS-CoV-2

Human-CoV

Other-CoV

Pangolin-CoV Unique

Bat-CoV

Pangolin-CoV

Human-CoV

SARS-CoV-2 Unique

MERS Closest

Camel

SARS-CoV Closest

Bat

SARS1 Unique

Pangolin CoV Closest

Bat-CoV

SARS-CoV-2

Human-CoV

Pangolin-CoV Unique

Representation Score

0

10

20

30

40

50

60

70

80

90

100

SARS-CoV-2 Closest

Bat CoV

Human

Pangolin

SARS2 Unique

SARS-CoV-2 Closest

SARS-CoV Closest

Pangolin CoV Closest

MERS Closest

SA

RS

-CoV

-2

Pan

golin

-CoV

SA

RS

-CoV

ME

RS

-CoV

Figure 4 Host distribution of each 100-nucleotide segment in the reference genomes of MERS, SARS-CoV, Pangolin CoV, and SARS-CoV-2.Each 100-nucleotide segment per strain was subjected to BLASTn against the virus database. Self-category hits were removed from the analysis, hostinformation of top 50 hits was identified, and the relative percentage is represented here as a heatmap for each segment in each genome. In the closestcategory, the host information of best hit (after removing self-hit) is represented in the form of a multi-color stripe. The innermost circle representsthe location of each segment to start from 1 to 100 nucleotides to the total genome length of the genome.

Full-size DOI: 10.7717/peerj.9576/fig-4

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 20/31

Page 21: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

presence of (co-)orthologous proteins present across all these genomes, a bidirectional(reciprocal) best BLAST hit-based homology detection strategy was followed using the toolProteinortho (Lechner et al., 2011) (Fig. 5; Fig. S5). A total of 44 protein-coding regionclusters represent the pan-genome of nineteen studied genus Betacoronavirus membersamongst which, 3 protein-coding regions (S, M and N proteins) were found to beconserved across all genomes, representing the core genome. A similar analysis hasbeen performed on all Betacoronavirus species; however, it does not include PangolinCoV (Alam et al., 2020). Our study also recognized that the pan-genome of genusBetacoronavirus is open, which means that with the addition of a new species, thepan-genome is continuously increasing (Fig. S6). The increase in the pan-genome at eachstep is approximately proportional to the increase in unique genes. Since each newlyidentified strain has additional unique gene sequences, it is effectively impossible to predicttheir complete pan-genome. As expected, the members of each subgenus, which areclosest to each other, have only a few variations, however, with the addition of a newsubgenus member(s), the pan-genome expands immediately.

orf1ab, the largest coding region, codes for two polyproteins ORF1a and ORF1ab(266–13,468,13,468–21,555) because of the “leaky mechanism” of the ribosome, a (-1)frameshift just upstream of the orf1a translation termination codon. All CoVs, except forstrain MHV A59 of subgenus Embecovirus, translate the full-length orf1ab. MHV A59,

Figure 5 Pan-genome (Core, accessory, and unique) analysis amongst 19 Betacoronavirus referencegenomes (including one Pangolin CoV). The X-axis represents the strain combinations as black-filledcircles, whereas Y-axis represents the distribution of genes in the respective combinations and inside eachvertical bar, those gene names are written. For each strain, their respective subgenus category has beenpointed out on the left side as the colored scale in the corner. In the left-down corner, the number ofgenes encoded by each genome is depicted as a horizontal bar. In the lowermost panel, pan-genomecategories (core, accessory and unique genes) are shown. Full-size DOI: 10.7717/peerj.9576/fig-5

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 21/31

Page 22: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

however, has separate polyproteins from orf1a and orf1b. Due to the absence of ribosomalslippage reported in several coronaviruses, ORF1a is encoded by only 9 out of 18genomes, making it an accessory protein cluster of the pan-genome. ORF1b polyprotein isencoded as a part of orf1ab in most of the species belonging to genus Betacoronavirus,again except for strain MHV A59, where it is encoded separately as RdRp, forming aunique gene. Envelope protein (76–82 aa across Betacoronavirus) is another accessoryprotein encoded by 15 out of 18 genomes. In another cluster, we identified the presence ofsmall envelope protein in two out of those three genomes where E protein is absent(both in subgenus Nobecovirus). No E protein-encoding region is found in the MHV A59genome annotations. The gene encoding ORF3 is distributed into two subunits (orf3aand orf3b) in SARS-CoV (via an internal ribosomal entry mechanism), whereas onlythe first subunit (orf3a; 276 aa) is present in SARS-CoV-2. The gene for ORF3a isconserved only amongst subgenus Sarbecovirus organisms. This protein activates thepro-transcription of gene IL-1β and helps its maturation; both signals are prerequisitesfor NLRP3 (NOD-, LRR- and pyrin domain-containing protein 3) inflammasomeactivation during infection (Siu et al., 2019). Conservation of ORF3a across SARS-CoV-2,SARS-CoV and Pangolin CoVs provides a clue about its unique infection strategy. ORF3b(155 aa) is only present in SARS-CoV and absent in all other coronaviruses. It hasbeen suggested that ORF3b supports SARS-CoV infection by inhibiting the type-1Interferon (IFN) synthesis and therefore, IFN signaling (McBride & Fielding, 2012).ORF6 is also a subgenus Sarbecovirus specific protein (conserved across SARS-CoV,SARS-CoV-2 and Pangolin CoV) localized in the Endoplasmic Reticulum or the Golgimembrane. During infection, it interacts with and interrupts the nuclear import complexformation. During infection-caused interferon signaling, this interruption inhibits STAT1transportation to the nucleus and blocks the expression of genes involved, thereforemimicking an antiviral state (Frieman et al., 2007).

Another accessory protein of the Betacoronavirus pan-genome is ORF7a, which is aunique protein amongst subgenus Sarbecovirus members. In SARS-CoV, it is knownthat bone marrow stromal antigen-2 (BST-2 or tetherin or CD317) can restrict the virusrelease from inside the cell causing partial inhibition of infection. However, ORF7a caninhibit the action of this protein, thus promoting the infection (Taylor et al., 2015). ORF7bis present in both SARS-CoV-2 and SARS-CoV but absent in Pangolin CoV NCBIannotations. Like orf7b in SARS-CoV, its initiation codon overlaps with orf7a viaribosome leaky scanning and it works as a structural component (Schaecher, Mackenzie &Pekosz, 2007). orf8 gene (365 nucleotides) is uniquely present in SARS-CoV-2 andPangolin CoV but absent in SARS-CoV. Distinctively, orf8 in SARS-CoV is separatedinto two proteins ORF8a (119 nucleotides) and ORF8b (254 nucleotides). ORF8 inSARS-CoV-2 is highly divergent from the inflammasome activator ORF8b in SARS-CoV(Yuen et al., 2020). The function of this protein is not clear yet, but it has been suggestedthat overall ORF8ab or ORF8a and ORF8b together modulate viral replication andpathogenesis in unknown fashion (McBride & Fielding, 2012). ORF10 is a unique proteinin SARS-CoV-2 strains. We also performed the homology analysis of ORF10 protein

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 22/31

Page 23: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

against the NCBI Non-redundant database and found no hits, which suggests that thisprotein is specific to SARS-CoV-2.

This analysis also revealed that some proteins are specific to a subgenus. SubgenusEmbecovirus has a unique combination of proteins, namely HE, ORF4, internal protein,NS2a and NS3a, that are present in most genomes of this subgenus. These proteins mightbe considered unique to subgenus Embecovirus as compared to the rest of the genusBetacoronavirus. Hemagglutinin Esterase glycoprotein (HE) is a nonessential proteinwith sialic acid-binding and acetyl esterase activity that create tinier spikes (secondattachment factor) on the viral envelope in several MHV strains. orf4 (also annotated asorf5a, ns12.9) encodes a nonstructural accessory protein that is, known to function as typeI interferon antagonist. The internal gene is encoded within the N gene in MHV CoV;however, its function is not well-known. Similarly, subgenus Merbecovirus members alsohave some unique proteins (NS3a, NS3b, NS4b, ORF5), that are absent in other subgenera.We also identified unique proteins in each organism, which were absent in otherBetacoronavirus. Overall, it can be argued that although genus Betacoronavirus membershave ~10 proteins coding regions per genome, they have conserved similarity in only~10% (3–4 proteins) of the genes and their diversity (~40% accessory genes and ~50%unique genes) spans the rest of the genome, leading to their different pathogenesis acrossdiverse hosts and distinct evolution. We identified that even among subgenus Sarbecovirusmembers (SARS-CoV, SARS-CoV-2 and Pangolin CoV), orf3b, orf7b, orf8, orf9b andorf10 are uniquely distributed within at least one genome, therefore depicting widediversity within the same subgenus.

CONCLUSIONSHumanity has already encountered coronavirus infections twice in the past two decades,in the form of the SARS and MERS epidemics. The latest coronavirus infection,COVID-19, caused by the highly pathogenic and transmissible SARS-CoV-2, turned into apandemic in sheer three months. This work provides a brief review of the taxonomy,history, origin, and genome organization of the virus, followed by comparative genomicsresearch focused on understanding the evolutionary relationship amongst 167SARS-CoV-2, 312 SARS-CoV and 5 Pangolin CoV genome sequences as available on29 March 2020. We identified that SARS-CoV-2 strains form a monophyletic clade distinctfrom SARS-CoV and Pangolin CoV. This study reaffirmed that they are closest to theBat CoV RaTG13 strain followed by Pangolin CoV, suggesting that SARS-CoV-2 evolvedfrom a common ancestor putatively residing in bat or pangolin hosts. We identified severalmutation sites and hotspots within SARS-CoV genomes with high, medium and lowprobabilities. Homology studies based on 100 nucleotide segments in the SARS-CoV-2reference genome pointed out their close relationships with Bat and Pangolin CoVs, alongwith the unique nature of a few segments. We recognized that 44 protein-coding regionsconstitute the pan-genome for nineteen genus Betacoronavirus strains. Moreover, theirpan-genome is open, highlighting the wide diversity provided by newly identified novelstrains. Even members of subgenus Sarbecovirus are diverse relative to each other due to

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 23/31

Page 24: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

the relative presence of unique protein-coding regions orf3b, orf7b, orf8 and orf9b andorf10. Overall, this review and research highlight the diversity within the availableSARS-CoV-2 genomes, their potential mutational sites hotspots, and probableevolutionary relationship with other coronaviruses, which might further assist in ourunderstanding of their evolution, epidemiology and pathogenicity.

ACKNOWLEDGEMENTSWe acknowledge Tanvi Kale, IBAB MSc (2018–20) student for graciously offering herartwork to use as a cover image for this manuscript.

ADDITIONAL INFORMATION AND DECLARATIONS

FundingDepartment of Science and Technology (DST), Government of India and IBAB, Bengaluruprovided financial and infrastructure support to Gaurav Sharma. The funders had no rolein study design, data collection and analysis, decision to publish, or preparation of themanuscript.

Grant DisclosuresThe following grant information was disclosed by the authors:DST-INSPIRE Faculty Award from Department of Science and Technology (DST),Government of India to GS.

Competing InterestsThe authors declare that they have no competing interests.

Author Contributions� Arohi Parlikar performed the experiments, analyzed the data, authored or revieweddrafts of the paper, and approved the final draft.

� Kishan Kalia performed the experiments, analyzed the data, prepared figures and/ortables, authored or reviewed drafts of the paper, and approved the final draft.

� Shruti Sinha performed the experiments, analyzed the data, authored or reviewed draftsof the paper, and approved the final draft.

� Sucheta Patnaik performed the experiments, analyzed the data, authored or revieweddrafts of the paper, and approved the final draft.

� Neeraj Sharma performed the experiments, analyzed the data, prepared figures and/ortables, authored or reviewed drafts of the paper, and approved the final draft.

� Sai Gayatri Vemuri performed the experiments, analyzed the data, authored or revieweddrafts of the paper, and approved the final draft.

� Gaurav Sharma conceived and designed the experiments, performed the experiments,analyzed the data, prepared figures and/or tables, authored or reviewed drafts of thepaper, and approved the final draft.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 24/31

Page 25: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

Data AvailabilityThe following information was supplied regarding data availability:

The strains and genomes used in this research were retrieved from NCBI Virus(https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/) on 29 March 2020. This information isprovided in Table S1.

Supplemental InformationSupplemental information for this article can be found online at http://dx.doi.org/10.7717/peerj.9576#supplemental-information.

REFERENCESAhmed SF, Quadeer AA, McKay MR. 2020. Preliminary identification of potential vaccine targets

for the COVID-19 coronavirus (SARS-CoV-2) based on SARS-CoV immunological studies.Viruses 12(3):254 DOI 10.3390/v12030254.

Alam I, Kamau A, Kulmanov M, Arold ST, Pain A, Gojobori T, Duarte CM. 2020. Functionalpangenome analysis provides insights into the origin, function and pathways to therapy ofSARS-CoV-2 coronavirus. bioRxiv DOI 10.1101/2020.02.17.952895.

Andersen KG, Rambaut A, Lipkin WI, Holmes EC, Garry RF. 2020. The proximal origin ofSARS-CoV-2. Nature Medicine 26:450–452 DOI 10.1038/s41591-020-0820-9.

Angelini MM, Akhlaghpour M, Neuman BW, Buchmeier MJ. 2013. Severe acute respiratorysyndrome coronavirus nonstructural proteins 3, 4, and 6 induce double-membrane vesicles.mBio 4(4):e00524-13 DOI 10.1128/mBio.00524-13.

Azhar EI, El-Kafrawy SA, Farraj SA, Hassan AM, Al-Saeed MS, Hashem AM, Madani TA. 2014.Evidence for camel-to-human transmission of MERS coronavirus. New England Journal ofMedicine 370(26):2499–2505 DOI 10.1056/NEJMoa1401505.

Barr JN, Fearns R. 2010. How RNA viruses maintain their genome integrity. Journal of GeneralVirology 91(6):1373–1387 DOI 10.1099/vir.0.020818-0.

Boni MF, Lemey P, Jiang X, Lam TT-Y, Perry B, Castoe T, Rambaut A, Robertson DL. 2020.Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19pandemic. bioRxiv DOI 10.1101/2020.03.30.015008.

Bouvet M, Lugari A, Posthuma CC, Zevenhoven JC, Bernard S, Betzi S, Imbert I, Canard B,Guillemot JC, Lécine P, Pfefferle S, Drosten C, Snijder EJ, Decroly E, Morelli X. 2014.Coronavirus Nsp10, a critical co-factor for activation of multiple replicative enzymes.Journal of Biological Chemistry 289(37):25783–25796 DOI 10.1074/jbc.M114.577353.

Buchfink B, Xie C, Huson DH. 2014. Fast and sensitive protein alignment using DIAMOND.Nature Methods 12(1):59–60 DOI 10.1038/nmeth.3176.

Chen Y, Liu Q, Guo D. 2020. Emerging coronaviruses: genome structure, replication, andpathogenesis. Journal of Medical Virology 92(4):418–423 DOI 10.1002/jmv.25681.

Chen Y, Su C, Ke M, Jin X, Xu L, Zhang Z, Wu A, Sun Y, Yang Z, Tien P, Ahola T, Liang Y,Liu X, Guo D. 2011. Biochemical and structural insights into the mechanisms of SARScoronavirus RNA ribose 2′-O-Methylation by nsp16/nsp10 protein complex. PLOS Pathogens7(10):e1002294 DOI 10.1371/journal.ppat.1002294.

Christiam C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009.BLAST+: architecture and applications. BMC Bioinformatics 10:421DOI 10.1186/1471-2105-10-421.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 25/31

Page 26: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

Coleman CM, Frieman MB. 2014. Coronaviruses: important emerging human pathogens.Journal of Virology 88(10):5209–5212 DOI 10.1128/JVI.03488-13.

Cong Y, Ulasli M, Schepers H, Mauthe M, V’kovski P, Kriegenburg F, Thiel V, De Haan CAM,Reggiori F. 2019. Nucleocapsid protein recruitment to replication-transcription complexesplays a crucial role in coronaviral life cycle. Journal of Virology 94(4):e01925-19DOI 10.1128/JVI.01925-19.

Cottam EM, Whelband MC, Wileman T. 2014. Coronavirus NSP6 restricts autophagosomeexpansion. Autophagy 10(8):1426–1441 DOI 10.4161/auto.29309.

Cui J, Li F, Shi ZL. 2019. Origin and evolution of pathogenic coronaviruses. Nature ReviewsMicrobiology 17(3):181–192 DOI 10.1038/s41579-018-0118-9.

De Haan CAM, Haijema BJ, Schellen P, Schreur PW, Te Lintelo E, Vennema H, Rottier PJM.2008. Cleavage of group 1 coronavirus spike proteins: how furin cleavage is traded off againstheparan sulfate binding upon cell culture adaptation. Journal of Virology 82(12):6078–6083DOI 10.1128/JVI.00074-08.

Edgar RC. 2004.MUSCLE: multiple sequence alignment with high accuracy and high throughput.Nucleic Acids Research 32(5):1792–1797 DOI 10.1093/nar/gkh340.

Fehr AR, Perlman S. 2015. Coronaviruses: an overview of their replication and pathogenesis.In: Maier H, Bickerton E, Britton P, eds. Coronaviruses: Methods and Protocols. New York:Springer, 1–23.

Follis KE, York J, Nunberg JH. 2006. Furin cleavage of the SARS coronavirus spike glycoproteinenhances cell–cell fusion but does not affect virion entry. Virology 350(2):358–369DOI 10.1016/j.virol.2006.02.003.

Frieman M, Yount B, Heise M, Kopecky-Bromberg SA, Palese P, Baric RS. 2007. Severe acuterespiratory syndrome coronavirus ORF6 antagonizes STAT1 function by sequestering nuclearimport factors on the rough endoplasmic reticulum/golgi membrane. Journal of Virology81(18):9812–9824 DOI 10.1128/JVI.01012-07.

Fuk-Woo Chan J, Kok K-H, Zhu Z, Chu H, Kai-Wang To K, Yuan S, Yuen K-Y. 2020. Genomiccharacterization of the 2019 novel human-pathogenic coronavirus isolated from a patient withatypical pneumonia after visiting Wuhan. Emerging Microbes & Infections 9(1):221–236DOI 10.1080/22221751.2020.1719902.

Gadlage MJ, Sparks JS, Beachboard DC, Cox RG, Doyle JD, Stobart CC, Denison MR. 2010.Murine hepatitis virus nonstructural protein 4 regulates virus-induced membrane modificationsand replication complex function. Journal of Virology 84(1):280–290DOI 10.1128/JVI.01772-09.

Gorbalenya AE, Baker SC, Baric RS, De Groot RJ, Drosten C, Gulyaeva AA, Haagmans BL,Lauber C, Leontovich AM, Neuman BW, Penzar D, Perlman S, Poon LLM, Samborskiy DV,Sidorov IA, Sola I, Ziebuhr J. 2020. The species severe acute respiratory syndrome-relatedcoronavirus: classifying 2019-nCoV and naming it SARS-CoV-2. Nature Microbiology5(4):536–544 DOI 10.1038/s41564-020-0695-z.

Gordon DE, Jang GM, Bouhaddou M, Xu J, Obernier K, White KM, O’Meara MJ, Rezelj VV,Guo JZ, Swaney DL, Tummino TA, Huettenhain R, Kaake RM, Richards AL,Tutuncuoglu B, Foussard H, Batra J, Haas K, Modak M, Kim M, Haas P, Polacco BJ,Braberg H, Fabius JM, Eckhardt M, Soucheray M, Bennett MJ, Cakir M, McGregor MJ, Li Q,Meyer B, Roesch F, Vallet T, Mac Kain A, Miorin L, Moreno E, Naing ZZC, Zhou Y, Peng S,Shi Y, Zhang Z, Shen W, Kirby IT, Melnyk JE, Chorba JS, Lou K, Dai SA, Barrio-Hernandez I, Memon D, Hernandez-Armenta C, Lyu J, Mathy CJP, Perica T, Pilla KB,Ganesan SJ, Saltzberg DJ, Rakesh R, Liu X, Rosenthal SB, Calviello L, Venkataramanan S,

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 26/31

Page 27: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

Liboy-Lugo J, Lin Y, Huang X-P, Liu Y, Wankowicz SA, Bohn M, Safari M, Ugur FS, Koh C,Savar NS, Tran QD, Shengjuler D, Fletcher SJ, O’Neal MC, Cai Y, Chang JCJ,Broadhurst DJ, Klippsten S, Sharp PP, Wenzell NA, Kuzuoglu D, Wang H-Y, Trenker R,Young JM, Cavero DA, Hiatt J, Roth TL, Rathore U, Subramanian A, Noack J, Hubert M,Stroud RM, Frankel AD, Rosenberg OS, Verba KA, Agard DA, Ott M, Emerman M, Jura N,von Zastrow M, Verdin E, Ashworth A, Schwartz O, d’Enfert C, Mukherjee S, Jacobson M,Malik HS, Fujimori DG, Ideker T, Craik CS, Floor SN, Fraser JS, Gross JD, Sali A, Roth BL,Ruggero D, Taunton J, Kortemme T, Beltrao P, Vignuzzi M, García-Sastre A, Shokat KM,Shoichet BK, Krogan NJ. 2020. A SARS-CoV-2 protein interaction map reveals targets fordrug repurposing. Epub ahead of print 30 April 2020. Nature DOI 10.1038/s41586-020-2286-9.

Graham RL, Sims AC, Brockway SM, Baric RS, Denison MR. 2005. The nsp2 replicase proteinsof murine hepatitis virus and severe acute respiratory syndrome coronavirus are dispensable forviral replication. Journal of Virology 79(21):13399–13411DOI 10.1128/JVI.79.21.13399-13411.2005.

Graham RL, Sparks JS, Eckerle LD, Sims AC, Denison MR. 2008. SARS coronavirusreplicase proteins in pathogenesis. Virus Research 133(1):88–100DOI 10.1016/j.virusres.2007.02.017.

Grossoehme NE, Li L, Keane SC, Liu P, Dann CE, Leibowitz JL, Giedroc DP. 2009. CoronavirusN protein N-terminal domain (NTD) specifically binds the transcriptional regulatory sequence(TRS) and melts TRS-cTRS RNA duplexes. Journal of Molecular Biology 394(3):544–557DOI 10.1016/j.jmb.2009.09.040.

Hamre D, Procknow JJ. 1966. A new virus isolated from the human respiratory tract.Proceedings of the Society for Experimental Biology and Medicine 121(1):190–193DOI 10.3181/00379727-121-30734.

Han Q, Lin Q, Jin S, You L. 2020. Coronavirus 2019-nCoV: a brief perspective from the front line.Journal of Infection 80(4):373–377 DOI 10.1016/j.jinf.2020.02.010.

Hu D, Zhu C, Ai L, He T, Wang Y, Ye F, Yang L, Ding C, Zhu X, Lv R, Zhu J, Hassan B, Feng Y,Tan W, Wang C. 2018. Genomic characterization and infectivity of a novel SARS-likecoronavirus in Chinese bats. Emerging Microbes & Infections 7(1):1–10DOI 10.1038/s41426-018-0155-5.

Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, Zhang L, Fan G, Xu J, Gu X, Cheng Z, Yu T, Xia J,Wei Y, Wu W, Xie X, Yin W, Li H, Liu M, Xiao Y, Gao H, Guo L, Xie J, Wang G, Jiang R,Gao Z, Jin Q, Wang J, Cao B. 2020. Clinical features of patients infected with 2019 novelcoronavirus in Wuhan, China. Lancet 395(10223):497–506DOI 10.1016/S0140-6736(20)30183-5.

Issa E, Merhi G, Panossian B, Salloum T, Tokajian S. 2020. SARS-CoV-2 and ORF3a:nonsynonymous mutations, functional domains, and viral pathogenesis. mSystems5(3):e00266-20 DOI 10.1128/mSystems.00266-20.

Kendall EJC, Bynoe ML, Tyrrell DAJ. 1962. Virus isolations from common colds occurring in aresidential school. BMJ 2(5297):82–86 DOI 10.1136/bmj.2.5297.82.

Kirchdoerfer RN, Ward AB. 2019. Structure of the SARS-CoV nsp12 polymerase bound to nsp7and nsp8 co-factors. Nature Communications 10(1):1–9 DOI 10.1038/s41467-019-10280-3.

Kopecky-Bromberg SA, Martinez-Sobrido L, Frieman M, Baric RA, Palese P. 2007. Severe acuterespiratory syndrome coronavirus open reading frame (ORF) 3b, ORF 6, and nucleocapsidproteins function as interferon antagonists. Journal of Virology 81(2):548–557DOI 10.1128/JVI.01782-06.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 27/31

Page 28: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

Kumar S, Stecher G, Li M, Knyaz C, Tamura K. 2018.MEGA X: molecular evolutionary geneticsanalysis across computing platforms. Molecular Biology and Evolution 35(6):1547–1549DOI 10.1093/molbev/msy096.

Lai AL, Millet JK, Daniel S, Freed JH, Whittaker GR. 2020. The SARS-CoV fusion peptide formsan extended bipartite fusion platform that perturbs membrane order in a calcium-dependentmanner. Journal of Molecular Biology 429(24):3875–3892 DOI 10.1016/j.jmb.2017.10.017.

Lau SKP, Feng Y, Chen H, Luk HKH, Yang W-H, Li KSM, Zhang Y-Z, Huang Y, Song Z-Z,Chow W-N, Fan RYY, Ahmed SS, Yeung HC, Lam CSF, Cai J-P, Wong SSY, Chan JFW,Yuen K-Y, Zhang H-L, Woo PCY, Perlman S. 2015. Severe acute respiratory syndrome (SARS)coronavirus ORF8 protein is acquired from SARS-related coronavirus from greater horseshoebats through recombination. Journal of Virology 89(20):10532–10547DOI 10.1128/JVI.01048-15.

Lechner M, Findeiß S, Steiner L, Marz M, Stadler PF, Prohaska SJ. 2011. Proteinortho: detectionof (Co-)orthologs in large-scale analysis. BMC Bioinformatics 12(1):124DOI 10.1186/1471-2105-12-124.

Lei J, Kusov Y, Hilgenfeld R. 2018. Nsp3 of coronaviruses: structures and functions of a largemulti-domain protein. Antiviral Research 149:58–74 DOI 10.1016/j.antiviral.2017.11.001.

Lescure FX, Bouadma L, Nguyen D, Parisey M, Wicky PH, Behillil S, Gaymard A,Bouscambert-DuchampM, Donati F, Le Hingrat Q, Enouf V, Houhou-Fidouh N, Valette M,Mailles A, Lucet JC, Mentre F, Duval X, Descamps D, Malvy D, Timsit JF, Lina B,Van-der-Werf S, Yazdanpanah Y. 2020. Clinical and virological data of the first cases ofCOVID-19 in Europe: a case series. Lancet Infectious Diseases 20:697–706DOI 10.1016/S1473-3099(20)30200-0.

Letunic I, Bork P. 2019. Interactive tree of life (iTOL) v4: recent updates and new developments.Nucleic Acids Research 47(W1):W256–W259 DOI 10.1093/nar/gkz239.

Liu DX, Fung TS, Chong KK-L, Shukla A, Hilgenfeld R. 2014. Accessory proteins of SARS-CoVand other coronaviruses. Antiviral Research 109:97–109 DOI 10.1016/j.antiviral.2014.06.013.

Lugari A, Betzi S, Decroly E, Bonnaud E, Hermant A, Guillemot JC, Debarnot C, Borg JP,Bouvet M, Canard B, Morelli X, Lécine P. 2010. Molecular mapping of the RNA cap2′-O-methyltransferase activation interface between severe acute respiratory syndromecoronavirus nsp10 and nsp16. Journal of Biological Chemistry 285(43):33230–33241DOI 10.1074/jbc.M110.120014.

Lv L, Li G, Chen J, Liang X, Li Y. 2020. Comparative genomic analysis revealed specific mutationpattern between human coronavirus SARS-CoV-2 and Bat-SARSr-CoV RaTG13. bioRxivDOI 10.1101/2020.02.27.969006.

Lyons DM, Lauring AS. 2017. Evidence for the selective basis of transition-to-transversionsubstitution bias in two RNA viruses. Molecular Biology and Evolution 34(12):3205–3215DOI 10.1093/molbev/msx251.

Marra MA, Jones SJM, Astell CR, Holt RA, Brooks-Wilson A, Butterfield YSN, Khattra J,Asano JK, Barber SA, Chan SY, Cloutier A, Coughlin SM, Freeman D, Girn N, Griffith OL,Leach SR, Mayo M, McDonald H, Montgomery SB, Pandoh PK, Petrescu AS, Robertson AG,Schein JE, Siddiqui A, Smailus DE, Stott JM, Yang GS, Plummer F, Andonov A, Artsob H,Bastien N, Bernard K, Booth TF, Bowness D, Czub M, Drebot M, Fernando L, Flick R,Garbutt M, Gray M, Grolla A, Jones S, Feldmann H, Meyers A, Kabani A, Li Y, Normand S,Stroher U, Tipples GA, Tyler S, Vogrig R, Ward D, Watson B, Brunham RC, Krajden M,Petric M, Skowronski DM, Upton C, Roper RL. 2003. The genome sequence of theSARS-associated coronavirus. Science 300(5624):1399–1404 DOI 10.1126/science.1085953.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 28/31

Page 29: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

McBride R, Fielding B. 2012. The role of severe acute respiratory syndrome (SARS)-coronavirusaccessory proteins in virus pathogenesis. Viruses 4(11):2902–2923 DOI 10.3390/v4112902.

McIntosh K, Becker WB, Chanock RM. 1967. Growth in suckling-mouse brain of “IBV-like”viruses from patients with upper respiratory tract disease. Proceedings of the National Academyof Sciences of the United States of America 58(6):2268–2273 DOI 10.1073/pnas.58.6.2268.

Menachery VD, Debbink K, Baric RS. 2014. Coronavirus non-structural protein 16: evasion,attenuation, and possible treatments. Virus Research 194:191–199DOI 10.1016/j.virusres.2014.09.009.

Mousavizadeh L, Ghasemi S. 2020. Genotype and phenotype of COVID-19: their roles inpathogenesis. Epub ahead of print 31 March 2020. Journal of Microbiology, Immunology andInfection DOI 10.1016/j.jmii.2020.03.022.

Peiris JSM. 2012. Coronaviruses. Medical Microbiology (Eighteenth Edition) 12:587–593DOI 10.1016/B978-0-7020-4089-4.00072-X.

Pritchard L, Glover RH, Humphris S, Elphinstone JG, Toth IK. 2016. Genomics and taxonomyin diagnostics for food security: soft-rotting enterobacterial plant pathogens. Analytical Methods8(1):12–24 DOI 10.1039/C5AY02550H.

Rehman S, Shafique L, Ihsan A, Liu Q. 2020. Evolutionary trajectory for the emergence of novelcoronavirus SARS-CoV-2. Pathogens 9(3):240 DOI 10.3390/pathogens9030240.

Sanjuán R, Nebot MR, Chirico N, Mansky LM, Belshaw R. 2010. Viral mutation rates.Journal of Virology 84(19):9733–9748 DOI 10.1128/JVI.00694-10.

Schaecher SR, Mackenzie JM, Pekosz A. 2007. The ORF7b protein of severe acute respiratorysyndrome coronavirus (SARS-CoV) is expressed in virus-infected cells and incorporated intoSARS-CoV particles. Journal of Virology 81(2):718–731 DOI 10.1128/JVI.01691-06.

Simmonds P, Adams MJ, Benkő Mária, Breitbart M, Brister JR, Carstens EB, Davison AJ,Delwart E, Gorbalenya AE, Harrach B, Hull R, King AMQ, Koonin EV, Krupovic M,Kuhn JH, Lefkowitz EJ, Nibert ML, Orton R, Roossinck MJ, Sabanadzovic S, Sullivan MB,Suttle CA, Tesh RB, Van der Vlugt RA, Varsani A, Zerbini FM. 2017. Virus taxonomy in theage of metagenomics. Nature Reviews Microbiology 15(3):161–168DOI 10.1038/nrmicro.2016.177.

Siu K, Yuen K, Castano-Rodriguez C, Ye Z, Yeung M, Fung S, Yuan S, Chan C, Yuen K,Enjuanes L, Jin D. 2019. Severe acute respiratory syndrome coronavirus ORF3a proteinactivates the NLRP3 inflammasome by promoting TRAF3-dependent ubiquitination of ASC.FASEB Journal 33(8):8865–8877 DOI 10.1096/fj.201802418R.

Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of largephylogenies. Bioinformatics 30(9):1312–1313 DOI 10.1093/bioinformatics/btu033.

Stobart CC, Sexton NR, Munjal H, Lu X, Molland KL, Tomar S, Mesecar AD, Denison MR.2013. Chimeric exchange of coronavirus nsp5 proteases (3CLpro) identifies common anddivergent regulatory determinants of protease activity. Journal of Virology 87(23):12611–12618DOI 10.1128/JVI.02050-13.

Subissi L, Imbert I, Ferron F, Collet A, Coutard B, Decroly E, Canard B. 2014. SARS-CoVORF1b-encoded nonstructural proteins 12–16: replicative enzymes as antiviral targets.Antiviral Research 101:122–130 DOI 10.1016/j.antiviral.2013.11.006.

Surjit M, Lal SK. 2008. The SARS-CoV nucleocapsid protein: a protein with multifarious activities.Infection, Genetics and Evolution 8(4):397–405 DOI 10.1016/j.meegid.2007.07.004.

Tan YJ, Lim SG, Hong W. 2005. Characterization of viral proteins encoded by theSARS-coronavirus genome. Antiviral Research 65(2):69–78 DOI 10.1016/j.antiviral.2004.10.001.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 29/31

Page 30: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

Tanaka T, Kamitani W, DeDiego ML, Enjuanes L, Matsuura Y. 2012. Severe acute respiratorysyndrome coronavirus nsp1 facilitates efficient propagation in cells through a specifictranslational shutoff of host mRNA. Journal of Virology 86(20):11128–11137DOI 10.1128/JVI.01700-12.

Taylor JK, Coleman CM, Postel S, Sisk JM, Bernbaum JG, Venkataraman T, Sundberg EJ,Frieman MB. 2015. Severe acute respiratory syndrome coronavirus ORF7a inhibits bonemarrow stromal antigen 2 virion tethering through a novel mechanism of glycosylationinterference. Journal of Virology 89(23):11820–11833 DOI 10.1128/JVI.02274-15.

Te Velthuis AJW, Van den Worm SHE, Snijder EJ. 2012. The SARS-coronavirus nsp7+nsp8complex is a unique multimeric RNA polymerase capable of both de novo initiation and primerextension. Nucleic Acids Research 40(4):1737–1747 DOI 10.1093/nar/gkr893.

Tok TT, Tatar G. 2017. Structures and functions of coronavirus proteins : molecular modeling ofviral nucleoprotein. International Journal of Virology & Infectious Diseases 2:1–7.

Tyrrell DA, Bynoe ML. 1966. Cultivation of viruses from a high proportion of patients with colds.Lancet 1(7428):76–77 DOI 10.1016/S0140-6736(66)92364-6.

Woo PCY, Huang Y, Lau SKP, Tsoi HW, Yuen KY. 2005. In silico analysis of ORF1ab incoronavirus HKU1 genome reveals a unique putative cleavage site of coronavirus HKU1 3C-likeprotease. Microbiology and Immunology 49(10):899–908DOI 10.1111/j.1348-0421.2005.tb03681.x.

Woo PCY, Huang Y, Lau SKP, Yuen KY. 2010. Coronavirus genomics and bioinformaticsanalysis. Viruses 2(8):1805–1820 DOI 10.3390/v2081803.

Xu J, Zhao S, Teng T, Abdalla AE, Zhu W, Xie L, Wang YGX. 2020. Systematic comparison oftwo animal-to-human transmitted human coronaviruses: SARS-CoV-2 and SARS-CoV. Viruses12(2):244 DOI 10.3390/v12020244.

Yuen KS, Ye ZW, Fung SY, Chan CP, Jin DY. 2020. SARS-CoV-2 and COVID-19: the mostimportant research questions. Cell & Bioscience 10:40 DOI 10.1186/s13578-020-00404-4.

Zaki AM, Van Boheemen S, Bestebroer TM, Osterhaus ADME, Fouchier RAM. 2012. Isolationof a novel coronavirus from a man with pneumonia in Saudi Arabia. New EnglandJournal of Medicine 367(19):1814–1820 DOI 10.1056/NEJMoa1211721.

Zeng Z, Deng F, Shi K, Ye G, Wang G, Fang L, Xiao S, Fu Z, Peng G. 2018. Dimerization ofcoronavirus nsp9 with diverse modes enhances its nucleic acid binding affinity.Journal of Virology 92(17):e00692-18 DOI 10.1128/JVI.00692-18.

Zeng C, Wu A, Wang Y, Xu S, Tang Y, Jin X, Wang S, Qin L, Sun Y, Fan C, Snijder EJ,Neuman BW, Chen Y, Ahola T, Guo D. 2016. Identification and characterization of a ribose2′-O-Methyltransferase encoded by the ronivirus branch of nidovirales. Journal of Virology90(15):6675–6685 DOI 10.1128/JVI.00658-16.

Zhang T, Wu Q, Zhang Z. 2020. Probable pangolin origin of SARS-CoV-2 associated with theCOVID-19 outbreak. Current Biology 30(7):1346–1351.e2 DOI 10.1016/j.cub.2020.03.022.

Zhong NS, Zheng BJ, Li YM, Poon LLM, Xie ZH, Chan KH, Li PH, Tan SY, Chang Q, Xie JP,Liu XQ, Xu J, Li DX, Yuen KY, Peiris JSM, Guan Y. 2003. Epidemiology and cause of severeacute respiratory syndrome (SARS) in Guangdong. People’s Republic of China, in February362:1353–1358 DOI 10.1016/S0140-6736(03)14630-2.

Zhou P, Yang X-L, Wang X-G, Hu B, Zhang L, Zhang W, Si H-R, Zhu Y, Li B, Huang C-L,Chen H-D, Chen J, Luo Y, Guo H, Jiang R-D, Liu M-Q, Chen Y, Shen X-R, Wang X,Zheng X-S, Zhao K, Chen Q-J, Deng F, Liu L-L, Yan B, Zhan F-X, Wang Y-Y, Xiao G-F,Shi Z-L. 2020. A pneumonia outbreak associated with a new coronavirus of probable bat origin.Nature 579(7798):270–273 DOI 10.1038/s41586-020-2012-7.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 30/31

Page 31: Understanding genomic diversity, pan-genome ...Understanding genomic diversity, pan-genome,andevolutionofSARS-CoV-2 Arohi Parlikar*, Kishan Kalia*, Shruti Sinha, Sucheta Patnaik, Neeraj

Zhu X, Fang L, Wang D, Yang Y, Chen J, Ye X, Foda MF, Xiao S. 2017. Porcine deltacoronavirusnsp5 inhibits interferon-$β$ production through the cleavage of NEMO. Virology 502:33–38DOI 10.1016/j.virol.2016.12.005.

Zhu N, Zhang D, WangW, Li X, Yang B, Song J, Zhao X, Huang B, Shi W, Lu R, Niu P, Zhan F,Ma X, Wang D, Xu W, Wu G, Gao GF, Tan W. 2020. A novel coronavirus from patients withpneumonia in China, 2019. New England Journal of Medicine 382(8):727–733DOI 10.1056/NEJMoa2001017.

Parlikar et al. (2020), PeerJ, DOI 10.7717/peerj.9576 31/31