modeling sequence evolution in hiv-1 infection with recombination

12
Modeling sequence evolution in HIV-1 infection with recombination Elena E. Giorgi a,b , Bette T. Korber a,c , Alan S. Perelson a,c,n , Tanmoy Bhattacharya a,c a Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, NM 87545, USA b Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, NM 87545, USA c Santa Fe Institute, Santa Fe, NM 87501, USA HIGHLIGHTS We describe a continuous time, age-dependent model of early HIV infection. We show that the new model yields the same results as a simplied model we published in 2009. We introduce recombination in the model. We show how recombination affects the results compared to our previous model. article info Article history: Received 24 September 2012 Received in revised form 12 February 2013 Accepted 27 March 2013 Available online 6 April 2013 Keywords: HIV Population dynamics Viral evolution abstract Previously we proposed two simplied models of early HIV-1 evolution. Both showed that under a model of neutral evolution and exponential growth, the mean Hamming distance (HD) between genetic sequences grows linearly with time. In this paper we describe a more realistic continuous-time, age- dependent mathematical model of infection and viral replication, and show through simulations that even in this more complex description, the mean Hamming distance grows linearly with time. This remains unchanged when we introduce recombination, though the condence intervals of the mean HD obtained ignoring recombination are overly conservative. Published by Elsevier Ltd. 1. Introduction The theory of population genetics has been applied to a vast number of organisms, from primates to bacteria and virus. Combined with coalescent theory and phylogenetic methods, it can trace back in time the dynamics and evolution of species. Previously (Lee et al., 2009; Giorgi et al., 2010), we developed an intra-host evolutionary model for HIV-1 during early infection and tested the genetic diversity of early HIV-1 samples against a null model of neutral evolution. The method allows one to distinguish infections that are established by a single virus from those initiated by multiple viruses. It also allows one to estimate the time since infection and the onset of immune selection. Though the model is now widely used and performs well on different datasets (Keele et al., 2008; Wood et al., 2009; Keele et al., 2009), including very large ones obtained through 454 sequencing (Fischer et al., 2010), the viral lifecycle described in Lee et al. (2009) was a simplication of the underlying biology. Here we study a more realistic model of HIV replication and how the new approach affects the results. Following in Sewall Wright's footsteps, Kimura (1968) used the FokkerPlanck equation to develop a continuous time model of population genetics. This was later expanded by Gillespie (1984) who developed explicit simulation frameworks for continuous time stochastic models in population genetics. In the rst part of this paper, we develop a continuous time, age-dependent model to describe the genetic drift in a growing viral population. We use a deterministic approach and derive a model using the Von Foerster equation and renewal condition with age-dependent birth and death rates. We then expand this into a completely stochastic model using the FokkerPlanck equation and follow, through simulation, the variation in genetic distances across a nite population of viral sequences. We mea- sure genetic distances in the sample using Hamming distances (HD), dened as the number of base positions at which two sequences differ. Model simulations show that even in this fully randomized model, the growth in mean HD, averaged over multi- ple runs, is linear in time, and that different choices of birth and death rates only affect the rate of the growth but not this behavior. Contents lists available at SciVerse ScienceDirect journal homepage: www.elsevier.com/locate/yjtbi Journal of Theoretical Biology 0022-5193/$ - see front matter Published by Elsevier Ltd. http://dx.doi.org/10.1016/j.jtbi.2013.03.026 n Corresponding author at: Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, NM 87545, USA. Tel.: +1 505 667 6829; fax: +1 505 665 3493. E-mail address: [email protected] (A.S. Perelson). Journal of Theoretical Biology 329 (2013) 8293

Upload: tanmoy

Post on 08-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling sequence evolution in HIV-1 infection with recombination

Journal of Theoretical Biology 329 (2013) 82–93

Contents lists available at SciVerse ScienceDirect

Journal of Theoretical Biology

0022-51http://d

n CorrNationafax: +1

E-m

journal homepage: www.elsevier.com/locate/yjtbi

Modeling sequence evolution in HIV-1 infection with recombination

Elena E. Giorgi a,b, Bette T. Korber a,c, Alan S. Perelson a,c,n, Tanmoy Bhattacharya a,c

a Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, NM 87545, USAb Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, NM 87545, USAc Santa Fe Institute, Santa Fe, NM 87501, USA

H I G H L I G H T S

� We describe a continuous time, age-dependent model of early HIV infection.

� We show that the new model yields the same results as a simplified model we published in 2009.� We introduce recombination in the model.� We show how recombination affects the results compared to our previous model.

a r t i c l e i n f o

Article history:Received 24 September 2012Received in revised form12 February 2013Accepted 27 March 2013Available online 6 April 2013

Keywords:HIVPopulation dynamicsViral evolution

93/$ - see front matter Published by Elsevierx.doi.org/10.1016/j.jtbi.2013.03.026

esponding author at: Theoretical Biology al Laboratory, Los Alamos, NM 87545, USA. Tel505 665 3493.ail address: [email protected] (A.S. Perelson).

a b s t r a c t

Previously we proposed two simplified models of early HIV-1 evolution. Both showed that under a modelof neutral evolution and exponential growth, the mean Hamming distance (HD) between geneticsequences grows linearly with time. In this paper we describe a more realistic continuous-time, age-dependent mathematical model of infection and viral replication, and show through simulations thateven in this more complex description, the mean Hamming distance grows linearly with time. Thisremains unchanged when we introduce recombination, though the confidence intervals of the mean HDobtained ignoring recombination are overly conservative.

Published by Elsevier Ltd.

1. Introduction

The theory of population genetics has been applied to a vastnumber of organisms, from primates to bacteria and virus.Combined with coalescent theory and phylogenetic methods, itcan trace back in time the dynamics and evolution of species.Previously (Lee et al., 2009; Giorgi et al., 2010), we developed anintra-host evolutionary model for HIV-1 during early infection andtested the genetic diversity of early HIV-1 samples against a nullmodel of neutral evolution. The method allows one to distinguishinfections that are established by a single virus from thoseinitiated by multiple viruses. It also allows one to estimate thetime since infection and the onset of immune selection. Thoughthe model is now widely used and performs well on differentdatasets (Keele et al., 2008; Wood et al., 2009; Keele et al., 2009),including very large ones obtained through 454 sequencing(Fischer et al., 2010), the viral lifecycle described in Lee et al.

Ltd.

nd Biophysics, Los Alamos.: +1 505 667 6829;

(2009) was a simplification of the underlying biology. Here westudy a more realistic model of HIV replication and how the newapproach affects the results.

Following in Sewall Wright's footsteps, Kimura (1968) used theFokker–Planck equation to develop a continuous time model ofpopulation genetics. This was later expanded by Gillespie (1984)who developed explicit simulation frameworks for continuoustime stochastic models in population genetics.

In the first part of this paper, we develop a continuous time,age-dependent model to describe the genetic drift in a growingviral population. We use a deterministic approach and derive amodel using the Von Foerster equation and renewal conditionwith age-dependent birth and death rates. We then expand thisinto a completely stochastic model using the Fokker–Planckequation and follow, through simulation, the variation in geneticdistances across a finite population of viral sequences. We mea-sure genetic distances in the sample using Hamming distances(HD), defined as the number of base positions at which twosequences differ. Model simulations show that even in this fullyrandomized model, the growth in mean HD, averaged over multi-ple runs, is linear in time, and that different choices of birth anddeath rates only affect the rate of the growth but not this behavior.

Page 2: Modeling sequence evolution in HIV-1 infection with recombination

E.E. Giorgi et al. / Journal of Theoretical Biology 329 (2013) 82–93 83

In the second part of the paper we study how recombinationaffects the results. Genetic recombination takes place when anewly constructed nucleic acid molecule arises from multipletemplate strains. Recombination events in HIV-1 are common(Onafuwa-Nuga and Telesnitsky, 2009; Robertson et al., 1995;Wooley et al., 1997; Sabino et al., 1994) and result from multipleinfections of the same target cell (Chen et al., 2005; Dang et al.,2004; Jung et al., 2002; Levy et al., 2004). A number of studieshave indicated that the HIV recombination rate is several timeslarger than its mutation rate (Batorsky et al., 2011; Jetzt et al.,2000; Levy et al., 2004; Shriner et al., 2004; Suryavanshi and Dixit,2007; Neher and Leitner, 2010; Zhang et al., 2010), and thatrecombinants reach fixation and undergo a rapid expansionin the population, thus representing an evolutionary shortcut(Ramirez et al., 2008).

Previous recombination studies have suggested that recombi-nation affects a virus's ability to escape CTL responses (Mostowyet al., 2011) and has a significant impact on the emergence of drugresistance (Althaus and Bonhoeffer, 2005; Fraser, 2005; Kouyoset al., 2009; Moutouh et al., 1996; Vijay et al., 2008). In Vijay et al.(2008) the authors use a simulation to study the effects ofrecombination on the changes in mean Hamming distance withtime. However, in the works cited above, it is assumed that thehost's immune response plays a major role in defining the fitnessof the viral strains. In this paper, we address recombination in ouroriginal model of exponential viral growth with no positiveselection pressure from the environment, a scenario that isrelevant to early dynamics of viral growth in the acute phase ofHIV infection prior to the initiation of an effective adaptiveimmune response. Under this scenario, we see that recombinationdoes not affect the linearity in time of the mean HD growth,though it does change the growth of the variance of the HDdistribution.

2. Results

2.1. Continuous time model of viral evolution

In Lee et al. (2009) we described two simplified models of viralevolution, where cell infection and cell death happen at fixed,discrete times. In this section we present a more realistic modelbased on a continuous time birth and death process. We alsointroduce age-dependency in the model, where age represents thetime since a cell was infected. The corresponding stochasticformulation is presented in Appendix A.

As stated in Lee et al. (2009), we assume that one uniquegenetic strain, called the transmitted/founder virus, initiates theinfection. In general, it takes some time before the host's immuneresponse kicks in, and since during the early phase of the infectionthe viral population is much smaller than the target cell popula-tion, we assume that viral evolution is initially driven by expo-nential growth and random accumulation of mutations at allowedsites (Ribeiro et al., 2010). Genetic diversity can also be achievedthrough recombination, which takes place when two or moredistinct strains infect one cell and give rise to new, recombinedgenomes. However, when the viral population is sufficiently smallwith respect to the target cell population, the effect of recombina-tion during this initial phase of the infection should be negligible.Under these hypotheses, mutations from the founder strainaccumulate randomly following a Poisson distribution (Lee et al.,2009).

In all our analyses, we ignore the presence of proteins such asAPOBEC, which results in hypermutated sequences and can modifythe mutation rate. We have included the capability to identify and

exclude APOBEC mediated hypermutation in previous implemen-tations of our model (Giorgi et al., 2010).

Once a virion enters a cell, the HIV enzyme reverse transcrip-tase transcribes the viral RNA into DNA, which is then integratedinto the host genome. A viral genome integrated into the DNA ofthe host cell is called a provirus. The time from viral entry into atarget cell until the first virions start budding out (called theeclipse phase) has been estimated to last about 24 h (Perelsonet al., 1996; Markowitz et al., 2003), although in some cases thevirus can lie dormant inside the cell for years. The infected cell willon average survive another 24 h while producing virus (Markowitzet al., 2003), and during this time it will produce tens of thousandsof new viral particles (Chen et al., 2007). HIV is rapidly clearedfrom circulation, with an average half-life of roughly 45 min (DeBoer et al., 2010; Ramratnam et al., 1999), and on average, in earlyinfection, between 6 and 10 virions go on to successfully infectnew cells (Ribeiro et al., 2010). The number of virions thatsuccessfully infect new cells in this initial phase is called the basicreproductive ratio, usually denoted R0.

In all that follows, we neglect the time a virus spends outside acell and choose to follow provirus instead. We only considermutations that may occur during reverse transcription and neglectthe extremely rare mutations that happen after integration in thecell's genome.

Given this framework, we let Iða; g; tÞ be the number of infectedcells that at time t are of age a (i.e. a represents the time elapsedsince the cell was infected) and generation g (i.e. the genome itcarries has undergone g infection cycles since the transmittedvirus, with one reverse transcription event occurring at eachinfection). Here a and t are non-negative, real variables, and g isa non-negative integer.

We assume that both the birth rate αðaÞ (the number of cellsinfected by each cell of age a per unit time) and death rate βðaÞ (thenumber of cells of age a that die per unit time) are functions thatdepend only on the age of the infected cell, a, and not on time t.Notice that αðaÞ is in fact an infection rate. Here we choose to call itbirth rate to show that the classic birth and death process providesa good infection model when neglecting the time the virus spendsoutside the cell. Let Λ denote the lag before new virions startbudding out of the infected cell (i.e. the length of the eclipsephase) and impose that αðaÞ ¼ 0 for a≤Λ. In the discussion thatfollows, we implicitly assume that all functions evaluate to zero onnegative arguments.

It has been shown (Von Foerster, 1959; Nisbet and Gurney,1982; Huddleston, 1983) that the dynamics of an age-structuredbirth and death system is determined by the following differentialequations, called the Von Foerster equation and the renewalcondition, respectively:

∂∂t

Iða; g; tÞ þ ∂∂a

Iða; g; tÞ ¼ −βðaÞIða; g; tÞ

Ið0; g þ 1; tÞ ¼Z t

0Iða; g; tÞαðaÞ da ð1Þ

with initial conditions:

Iða; g;0Þ ¼ I0 if a¼ g¼ 00 otherwise:

�ð2Þ

The first equation in (1) describes how infected cells are lost viadeath, whereas the second one describes how newly infected cellsare formed via births. This is a classic age-structured birth anddeath system (Murray, 2002; Kevorkian, 2002), with the additionof the parameter g for generation cycles. Such models are calledage-maturity structured and have been described in the context ofcell kinetics, where the parameter g in that case represents thenumber of cell divisions (Bernard et al., 2003; Zilman et al., 2010).

Page 3: Modeling sequence evolution in HIV-1 infection with recombination

Fig. 1. We generated Iðg; tÞ using the continuous, age-dependent model describedin Section 1. Here NB ¼ 2600 base pairs, αðaÞ ¼ α0ða−ΛÞ for a4Λ, and βðaÞ ¼ β0a. Ateach day we sampled N0 ¼ 30 sequences, calculated the mean HD and thenaveraged over 10,000 runs. The blue dots shot the mean HD for one particularrun, the red dots the average over all runs, and the black dashed line underneathshows the mean HD calculated from the theoretical framework. The vertical dashedlines are the 95% confidence intervals.

E.E. Giorgi et al. / Journal of Theoretical Biology 329 (2013) 82–9384

Let f ðaÞ ¼ αðaÞe−R a

0βðsÞ ds. Notice that e−

R a

0βðsÞ ds can be inter-

preted as the expected fraction of cells whose age at death isgreater than a. Therefore, f(a) represents the expected rate atwhich an infected cell survives up to age a and starts newinfections. Let ~f be the Fourier transform of f, i.e.

~f ðkÞ ¼Z ∞

Λf ðaÞeika da: ð3Þ

We assume that βðaÞ is a strictly positive function withlima-∞βðaÞ≠0 and such that lima-∞f ðaÞ ¼ 0. It follows (seeAppendix B) that a closed solution of Eq. (1) is given by

Iðg; tÞ ¼ I02π

Z t

0e−R a

0βðsÞ ds

Z ∞

−∞

~f ðkÞge−ikðt−aÞ dk da ð4Þ

2.2. Mean hamming distance

In Keele et al. (2008) we attained plasma samples from HIV-1subjects early after infection, from which viral sequences werederived. Roughly 80% of those subjects were infected by a singleviral strain, and we called those infections homogeneous. Underour model of neutral exponential growth, mutations from thefounder strain accumulate randomly following a Poisson distribu-tion due to reverse transcription errors introduced when the virusinfects a cell. As mutations accumulate, the mean of the Poissondistribution grows with time. Therefore, given a sample of viralsequences taken from a homogeneously infected subject, it ispossible to count the number of mutations across all pairs ofsequences, fit a Poisson distribution to the frequency counts, anduse the mean of the Poisson to estimate how much time haselapsed since the beginning of the infection.

The Hamming distance (HD) between two genetic sequences isdefined as the number of bases at which the two differ. In ouroriginal model (Lee et al., 2009; Giorgi et al., 2010), we showedthat for homogeneous infections, in the absence of selection, themean HD is given by

E½HD�ðtÞ ¼∑ddPðHD¼ djtÞ≈∑

d

∑gIðg; tÞPBinomðd;2gNB; ϵÞ∑gIgðtÞ

ð5Þ

where ϵ is the viral mutation rate per base per generation, NB is thelength of the sequences, and g is the number of generation stepssince the most recent common ancestor, or MRCA. PBinomðd;n; pÞ isthe binomial distribution with parameters n and p:

PBinomðd;n;pÞ ¼n

d

� �pdð1−pÞn−d:

Here we are neglecting the possibility of multiple mutations at thesame site, as these are expected to be very rare in early HIVinfection (Lee et al., 2009).

In the synchronous infection model described in Lee et al.(2009), we assume that each provirus produces R0 new provirusesand that all infected cells are of the same generation and are bornand die synchronously. We calculated Eq. (5) for Iðg; tÞ ¼ Rg

0(synchronous model) and for a slightly more complicated, but stillsimplified, scenario in which cells produce virions in two syn-chronous bursts, as a first approximation to an asynchronousinfection. In both cases we saw that the mean HD (calculated overall pairs of sequences in a given sample) grows linearly with timeand is proportional to 2gϵNB. In Lee et al. (2009) we showed thatthe MRCA is at most a couple of generations away from thetransmitted/founder strain, and it coincides with the transmitted/founder strain in most cases. Since, in HIV infections, one replica-tion cycle is roughly 2 days (Markowitz et al., 2003), we can usethis result to estimate the time since the MRCA in early homo-geneous infections (Keele et al., 2008; Wood et al., 2009; Keeleet al., 2009).

We now show that the linearity in time of the mean HD growthremains unchanged when substituting the more general expres-sion of Iðg; tÞ from Eq. (4) into Eq. (5). We first derive theexpression of the mean HD at generation g when the death rateis constant, i.e. βðaÞ ¼ β0 for all a40, and the birth rate is apiecewise linear function in age of the form αðaÞ ¼ 0 for all a≤Λ,and αðaÞ ¼ α0ða−ΛÞ for all a4Λ. Expanding Eq. (4) using theseparticular birth and death rates, one obtains

Iðg; tÞ ¼ I0αg0

ð2gÞ! ½ðt−ΛgÞ2g−ðΛgÞ2g�e−β0t ð6Þ

Given the biological properties of HIV infection, we can imposethe condition that both β and α be positive, analytic, and asymp-totically constant functions, and, in addition, that αðaÞ be null bothat 0 and at infinity. We can then extend the above derivation ofIðg; tÞ for general death and birth rates by noting that a boundeddeath rate function βðaÞ can always be approximated with a stepfunction (a function that is constant on a given partition), and ~f inEq. (4) can be calculated for a generic birth rate from Eq. (3) usingthe Taylor expansion of α.

Finally, we use a simulation based on Eqs. (4) and (5) togenerate Iðg; tÞ for different choices of piecewise linear death ratefunctions and a piecewise linear birth rate function as definedabove. For these choices of birth and death rate functions, wegenerate the full population of infected cells, out of which werandomly sample N0 sequences at each day since the start ofinfection. At any given time t, we assume that each infected cell inIðg; tÞ carries one proviral strain that has undergone exactly ginfection cycles from the MRCA. Therefore, for each one of the N0

sequences, knowing how many infection cycles it has undergone,we randomly draw from a Poisson distribution the number ofmutations it has accumulated from the MRCA. We then calculatethe mean intersequence HD assuming a star phylogeny, in otherwords, for any two given sequences s1 and s2 in the sample,HD½s1; s2� ¼HD0½s0; s1� þ HD0½s0; s1�, where s0 is the MRCA, and HD0

is the Hamming distance from the MRCA. Finally, we calculate themean HD over all pairs and average over 10,000 runs. In Fig. 1 weshow the results when both death and birth rates are linearfunctions of age. Different choices of birth and death rates donot change the linearity in growth of the mean HD, but only affect

Page 4: Modeling sequence evolution in HIV-1 infection with recombination

E.E. Giorgi et al. / Journal of Theoretical Biology 329 (2013) 82–93 85

the rate at which the mean HD grows (results not shown). Thisresult confirms what we have previously shown in Lee et al.(2009) using a discrete generation model, namely that the meanHD is proportional to 2ϵgNB.

2.3. Recombination

We now discuss the effect of recombination on the growth ofthe mean HD. We start with the same scenario described in Leeet al. (2009): a single infecting strain establishes infection, noselection, exponentially growing viral population, and a negligibleprobability of observing doubly mutated sites (we show in Leeet al. (2009) that the probability of back mutations is ≤Oðg2ϵ2=NBÞ,which is negligible in our scenario of early infections). This is anapt description of early, heterosexually transmitted, HIV infections(Keele et al., 2008), and, under this scenario, prior to the onset ofselection, the HD frequency counts follow a Poisson distribution.However, while in a Poisson model the mean HD of a sample of Nsequences should approximately equal the HD variance timesN=N−1, the HD variance was 4.7% lower than expected whenaveraging across all homogeneous subjects described in Keeleet al. (2008). This difference fell within the 95% CI and was notstatistically significant. Nonetheless, given the current advances insequencing technology and the availability of larger sample sizes(i.e. 454 sequencing Margulies et al., 2005) it is interesting toinvestigate whether this difference is due to recombination, whichwe neglected in our original model.

We now show that in the presence of recombination the HDfrequency counts no longer follow a Poisson distribution, thusexplaining possible divergences between the variance and themean. As discussed in the previous section, in the absence ofrecombination, our simulation results show that both the contin-uous and stochastic models yield the same basic results as theprevious simplified discrete generation model, namely that themean HD is proportional to 2ϵgNB.

Preliminary phylogenetic analyses conducted on viral samplesattained at serial time points from early, homogeneous infectionssuggest that the mutation rate ϵ may vary greatly across patients(Wallstrom et al., 2010). Therefore, in all that follows, we revert toour simplified, synchronous infection model and assume thatchanges in the mean HD growth rate caused by possible diver-gences from such a simplified model are folded into the parameterϵ, which needs to be measured in each patient individually.

To allow for the possibility of multiple strains infecting thesame cell and possibly giving rise to a recombinant strain, weintroduce a recombination rate 0oρo1. As a simplification, ateach generation we let each viral strain either undergo a mutationevent or a recombination (not both) with probabilities ρ andð1−ρÞϵ, respectively. The more general case can easily be solvedusing the same methods. We assume ρ is constant in time and, forsimplicity, we also assume that the parental strains recombine at asingle position θ, which is a uniformly distributed random variablebetween 0 and the length NB of the genome. Given the MRCA s0,let HD0ðs; tÞ be the Hamming distance of sequence s from s0, wheres was sampled at time t. For sufficiently small sample sizes, earlyhomogeneous infections follow a star-like phylogeny and both theHD0 and the intersequence HD follow a Poisson distribution (Leeet al., 2009; Keele et al., 2008). In particular, the expectations arerelated, i.e. E½HD� ¼ 2E½HD0�. Under these assumptions, the HD0

probability distribution at time t+dt is given by

~P ðHD0 ¼ d; t þ dtÞ ¼ ð1−ρÞPϵðHD0 ¼ d; t þ dtÞ þ ρPρðHD0 ¼ d; t þ dtÞð7Þ

where Pϵ and Pρ are the probability of a genome being at distanced from s0 after a mutation or a recombination event, respectively.

Suppose that Pρ has mean μρ and variance s2ρ . Similarly, we denoteby μϵ and s2ϵ the mean and variance of Pϵ. Then

~μðtÞ ¼ EðHD0jtÞ ¼ ð1−ρÞμϵðtÞ þ ρμρðtÞ ð8Þand

~s2 ¼ VarðHD0Þ ¼ s2ϵ þ 2ρμϵðμϵ−μρÞ þ ρðμ2;ρ−μ2;ϵÞ−ρ2ðμϵ−μρÞ2 ð9Þwhere μ2;ϵ and μ2;ρ indicate the second moments of Pϵ and Pρ,respectively. When ρ¼ 0, we get ~s2 ¼ s2ϵ and when ρ¼ 1, ~s2 ¼ s2ρ .Notice also that when μϵ ¼ μρ one has

~s2ðtÞ ¼ s2ϵ ðtÞ þ ρðμ2;ρðtÞ−μ2;ϵðtÞÞ ð10ÞIn what follows, we suppress the variable t when it is clear fromthe context. When viral genomes undergo mutation:

PϵðHD0 ¼ d; t þ dtÞ ¼ ∑d

k ¼ 0

~P ðd−k; tÞPmðk; dtÞ ð11Þ

where Pmðk; dtÞ is the probability of having kmutations appear in atime interval dt. Assuming the independence of ~P and Pm, we canuse the fact that cumulants add under convolution. Let dλ be themean of the distribution Pm in the time interval dt (for example, ina discrete, synchronous model, it would be the mutation ratetimes the sequence length). Then, omitting Oðdλ2Þ terms,

μϵðt þ dtÞ ¼ ~μðtÞ þ dλ

s2ϵ ðt þ dtÞ ¼ ~s2ðtÞ þ dλ

μ2;ϵðt þ dtÞ ¼ ~μ2ðtÞ þ ð1þ 2 ~μðtÞÞ dλ ð12Þ

2.4. Recombination alone

In order to compute Pρ, it is useful to first envision a scenario inwhich an initial, genetically diverse population of I0 sequences isallowed to grow exponentially with no accumulation of mutations.In other words, we set ϵ¼ 0 and ρ¼ 1. Suppose the HD0 distribu-tion of the I0 sequences has initial mean μ0 and variance s20. Ateach generation a recombination position θ is drawn randomlyfrom a uniform distribution, with θ¼ 0;…;NB, where the twoparental strings split and recombine. If θ¼ 0, then the recombi-nant is exactly the first sequence, whereas if θ¼NB, the recombi-nant is the second sequence. Under this scenario it is easy to seethat the mean Hamming distance of a population of sequences willremain unchanged from the initial μ0, whereas the total variancein the population will decline. As we show in Appendix C, at anygeneration g, the expectation values of the mean HD0, μðgÞ, and theestimator of the variance s2ðgÞ are given by

μðgÞ ¼ μ0 ð13Þand

s2ðgÞ ¼ NB−1NB

μ0 þ1NB

s20 þ 2s20−μ0g þ 2

NB

NB þ 1

� �g−1

þ OðN−1B Þ

" #∏g−1

j ¼ 0

Nj−1Nj

ð14Þwhere μ0 and s20 are the mean and variance HD0 of the initialpopulation of I0 sequences, Nj is the size of the population atgeneration j, and NB is the total sequence length.

In order to test the formulae in (14), we devised a simulationbased on the previously described synchronous infection model(Lee et al., 2009) where no site is mutated more than once (infinitesite assumption). We start at generation g¼0 with I0 sequences,each with an initial HD0 drawn from a probability distributionwith mean and variance μ0 and s20, respectively (in the simulationsshown in Figs. we used μ0 ¼ s20 ¼ 10). At each later generation weresample I0R

g0 sequences from the previous population of size

I0Rg−10 , and introduce recombination and mutation events as

described below.

Page 5: Modeling sequence evolution in HIV-1 infection with recombination

E.E. Giorgi et al. / Journal of Theoretical Biology 329 (2013) 82–9386

We first tested the scenario where ρ¼ 0 and ϵ¼ 0:003 (we used amuch larger value than the estimated 10−5 HIV-1 mutation rate inorder to make mutations accumulate faster since by the time we reacha population size of 210 or larger the simulation becomes computa-tionally intensive). In the mutation only scenario, we randomly drawthe number of positions to mutate from a Poisson distribution withmean ϵNB, where NB is the length of the sequences. If HD0ðs; g−1Þ isthe HD0 of a sequence s at generation g−1, and d is randomly drawnfrom Poisðλ¼ ϵNBÞ, then HD0ðs; gÞ ¼HD0ðs; g−1Þ þ d, and the d posi-tions where these mutations take place are drawn randomly from auniform distribution. Conversely, in the recombination only scenario,i.e., when ρ¼ 1 and ϵ¼ 0, at every generation g, for every sequence sin the sample, we draw the two parents from the previous generationand a recombination position θ uniformly and form s as therecombinant child.

In Figs. 2 and 3 we show the mean HD0 and variance HD0 for thetwo scenarios, respectively, averaged over 10,000 runs. The simula-tion is represented by the black dots, and the theoretical results areshown in the red dashed line. 95% confidence intervals are shown in

Generations

Tot

al M

ean

0 1 2 3 4 7

10

11

12

13

5 6

Mean HD, ρ=0, ε=0.003, 10,000 runs

Fig. 2. Comparison between theory (dashed line) and simulation (dots) for an exponenvariance of the HDs are averaged over 10,000 runs. At generation g¼0 the initial popuintervals (71.96 times the standard deviation over runs) are represented by vertical segvisible. (For interpretation of the references to color in this figure caption, the reader is

Fig. 3. Comparison between theory (dashed line) and simulation (dots) for an exponevariance of HD are averaged over 10,000 runs. At generation g¼0 the initial population(71.96 times the standard deviation over runs) are represented by vertical segments.references to color in this figure caption, the reader is referred to the web version of th

gray, though in Fig. 2 they are barely visible. In both scenarios thesimulation results fall within the 95% confidence intervals from thetheoretical expectations. Note that in Fig. 3, since the fluctuations atdifferent times are positively correlated, all points remain above themean, following a large fluctuation in the first generation.

2.5. Mutations and recombination

In Appendix C we generalize the methods used to deriveEq. (14) and find recursive formulae for the general case whenboth ρ and ϵ are non-zero. For the mean HD0 we obtain

μðgÞ ¼NBϵð1−ρÞg þ μ0: ð15ÞThis shows that even in the presence of recombination, the meanHD0 grows linearly like in the mutation only scenario. The slope ofthe linear growth here is diminished by a factor 1−ρ, which reflectsthe fact that in our model a recombining sequence does not mutate.

Let ξ¼ ð1−ρÞϵ, and let x and y be any given positions in thegenome. In Appendix D we derive the following expressions for

0 1 2 3 4 5 6 7

10.0

10.5

11.0

11.5

12.0

12.5

13.0

13.5

Generations

Tot

al V

aria

nce

HD variance, ρ=0, ε=0.003, 10,000 runs

tially growing population with ρ¼ 0 (no recombination) and ϵ¼ 0:003. Mean andlation is made of 100 sequences with HD mean and variance λ0 ¼ 10. Confidencements, which are smaller than the size of the symbols in the Fig. and hence are notreferred to the web version of this article.)

ntially growing population with ρ¼ 1 and ϵ¼ 0 (recombination only). Mean andis made of 100 sequences with HD mean and variance λ0 ¼ 10. Confidence intervalsErrors at different time points are positively correlated. (For interpretation of theis article.)

Page 6: Modeling sequence evolution in HIV-1 infection with recombination

Fig. 5. Comparison between theory (dashed line) and simulation (dots) of thecontribution of each position pair (denoted x and y, which here we took to be onethird and two thirds of the genome) to the variance of HD0 for an exponentiallygrowing population. Here ρ¼ 0:5 and ϵ¼ 0:003. These results were obtained from10,000 independent simulation runs with the same initial conditions, as describedin the main text. Confidence intervals (71.96 times the standard deviation overruns) are represented by vertical segments. (For interpretation of the references tocolor in this figure caption, the reader is referred to the web version of this article.)

E.E. Giorgi et al. / Journal of Theoretical Biology 329 (2013) 82–93 87

the HD0 mean μx at position x, the HD0 variance estimator s2x atposition x, and the HD0 covariance estimator sxy between any twopositions x and y:

μxðg þ 1Þ ¼ μxðgÞ þ ξ ð16Þ

s2x ðg þ 1Þ ¼ Ng−1Ng

s2x ðgÞ þ ξð1−ξÞ ð17Þ

sxyðg þ 1Þ ¼ 1−ρy−x

NB þ 1

� �Ng−1Ng

sxyðgÞ−Ngþ1

Ngþ1−1ξ2

þξ ϵ−1

Ngþ1

� �−2ϵξ2

1−ϵNgþ1

ð18Þ

By summing Eqs. (16)–(18) across all positions x and y we getthe mean HD0 and the variance estimator s2ðgÞ in the general casewhen both mutation and recombination are present. Again, weverified these theoretical derivations against numerous simulationruns. The simulation was designed as follows.

For a fixed recombination rate ρ and mutation rate ϵ, we firstdraw the number of recombination events from a binomialdistribution with probability ρ. Subsequently, for each sequence sin the new generation, if s is a recombinant, we uniformly drawthe two parents and a recombination position θ, and form s as therecombinant child. Else, we randomly draw a number of mutatedpositions from a Poisson distribution with mean ϵNB, where NB isthe length of the sequences. If HD0ðs; g−1Þ is the HD0 of a sequences at generation g−1, and d is randomly drawn from a Poissondistribution with mean ϵNB, then HD0ðs; gÞ ¼HD0ðs; g−1Þ þ d, andthe d positions where these mutations take place are drawnrandomly from a uniform distribution.

Because of computational limitations, we limited the numberof generations to 6–8 and simulated 10,000 sets, then averaged theHD0 means and variances and compared the results to thetheoretical results discussed in the previous sections. Instead ofsimulating from the very beginning of the infection, we started atgeneration g¼0 with I0 ¼ 100 sequences, each proviral sequencewith an HD0 from the founder strain randomly drawn from aPoisson distribution with mean λ¼ 10. The code was written in R(R Development Core Team, 2010).

In Figs. 4–6 we compare the simulation results with thetheoretical derivations. The simulation results are denoted withthe black dots and the theoretical derivations with the red dashed

Variance, ρ=0.5, ε=0.003, 10,000 runs

Generations

Var

ianc

e at

pos

ition

X

0 1 2 3 4 5

0.048

0.049

0.050

0.051

0.052

0.053

Fig. 4. Comparison between theory (dashed line) and simulation (dots) of the contribugenome) to the variance of HD0 for an exponentially growing population. Here ρ¼ 0:5 anwith the same initial conditions, as described in the main text. Confidence intervals (7Notice that the errors are positively correlated. (For interpretation of the references to c

lines. Notice that because Eqs. (D.3) and (D.4) were obtained up toterms of the order N−2, where N is the sample size, when we carrythe summation over all positions to obtain the general expressionfor s2ðgÞ, the fluctuations become of the order N−1. In Fig. 6 weomit the comparison with the theoretical derivation for the firsttwo generations. In fact, for these, the sample size is small enoughthat fluctuations are sizable. However, starting from g¼3, theoryand simulation overlap almost perfectly.

We performed a simulation with parameters taken from theearly homogeneous sample SUMA, previously described in Keeleet al. (2008) and Lee et al. (2009) (Fig. 7). This time the initialpopulation consisted of a single infecting strain and, at eachgeneration, we sampled N¼35 sequences, which was the sample

Variance, ρ=0.5, ε=0.003, 10,000 runs

0 1 2 3 4 5

0.048

0.049

0.050

0.051

0.052

Generations

Var

ianc

e at

pos

ition

Y

tion of each position (denoted x and y, located at one third and two thirds of thed ϵ¼ 0:003. These results were obtained from 10,000 independent simulation runs1.96 times the standard deviation over runs) are represented by vertical segments.olor in this figure caption, the reader is referred to the web version of this article.)

Page 7: Modeling sequence evolution in HIV-1 infection with recombination

Mean, ρ=0.5, ε=0.003, 10,000 runs

Generations

Tot

al M

ean

0 1 2 3 4 5

10.0

10.2

10.4

10.6

10.8

11.0

11.2

0 1 2 3 4 5

10.0

10.2

10.4

10.6

10.8

11.0

11.2

11.4

Variance, ρ=0.5, ε=0.003, 10,000 runs

Generations

Tot

al V

aria

nce

Fig. 6. Comparison between theory (dashed line) and simulation (dots) of the mean and variance HD0 for an exponentially growing population with ρ¼ 0:5 and ϵ¼ 0:003.The left panel represents the mean, and the right panel the variance, both averaged over 10,000 runs. Like before, at generation g¼0 the initial population is made of 100sequences with HD mean and variance λ0 ¼ 10. The simulation is represented by the black dots, and the theory by the dashed red line. Confidence intervals (71.96 timesthe standard deviation over runs) are represented by vertical segments. In the right panel, the first few generations are omitted. The theory we developed neglects earlystochastic events, which are of the order 1=Ng , where Ng is the population size at generation g. Hence, we find that only when the population size is large enough, dotheory and simulation overlap. (For interpretation of the references to color in this figure caption, the reader is referred to the web version ofthis article.)

Mean HD, no recombination

Generations

Tot

al M

ean

HD Variance, no recombination

Generations

Tot

al V

aria

nce

Mean HD, with recombination

Generations

Tot

al M

ean

1 2 3 4 5 6 7 8

−0.5

0.0

0.5

1.0

1.5

1 2 3 4 5 6 7 8

−0.5

0.0

0.5

1.0

1.5

1 2 3 4 5 6 7 8

−0.5

0.0

0.5

1.0

1.5

1 2 3 4 5 6 7 8

−0.5

0.0

0.5

1.0

1.5

HD Variance, with recombination

Generations

Tot

al V

aria

nce

Fig. 7. Simulation runs based on the patient SUMA (Keele et al., 2008). Assuming a single strain initiated the infection, at every generation we sampled N¼35 sequences,calculated the HD mean and HD variance, and then repeated 10,000 times. The top panels represent the simulation run in the absence of recombination, the bottom panelsrepresent the simulation when recombination was present (ρ¼ 2� 10−5, as estimated in Neher and Leitner, 2010). The dots represent the HD mean (left) and HD variance(right) averaged over 10,000 runs. The vertical bars represent the 95% CIs. Finally, the diamond denotes SUMA's HD mean (left) and HD variance (right). As one can see, theeffect of recombination is so small that both models fit the data well, thus validating our previous approach.

E.E. Giorgi et al. / Journal of Theoretical Biology 329 (2013) 82–9388

Page 8: Modeling sequence evolution in HIV-1 infection with recombination

E.E. Giorgi et al. / Journal of Theoretical Biology 329 (2013) 82–93 89

size of the original alignment. Under this particular scenario, onaverage, the HD variance was lower than the mean HD. This wasconsistent with what observed across all homogeneous patients inKeele et al. (2008), for whom the HD variance was, on average,4.7% lower than what expected from a Poisson model. According toour original Poisson model, SUMA was estimated to be fivegenerations into the infection, which was consistent with a Fiebigstage II (Fiebig et al., 2003). In Fig. 7 we show that addingrecombination to the model does indeed cause the HD frequencycounts to no longer follow a Poisson distribution, but the effect isso small that the estimate attained through our original model stillremains valid.

3. Conclusions

We developed a model to study the rates of evolution and geneticdiversification in acute HIV-1 infection. We used the model toexamine the effects of recombination in acute HIV infection sampledprior to the onset of selection and during the exponential growth ofthe viral population. Previously, we neglected recombination andused a simplified model of infectionwhere all infected cells producedvirions and died synchronously (Lee et al., 2009). Using this modelwe showed that the mean intersequence HD grows linearly withtime as EðHDÞ∝2ϵNBg, where ϵ is the mutation rate, NB is the lengthof the sequences, and g is the generation number. We applied thismodel to homogeneously infected subjects (Keele et al., 2008; Woodet al., 2009) and, using the linearity in time of the mean HD, for eachhomogeneously infected subject, were able to estimate the time sincethe most recent common ancestor.

In this paper, we explored a more complex model of HIVreplication. This was motivated by the fact that in general, ourestimates in Keele et al. (2008) were well correlated with the timesince the infection obtained from clinical data. However, in aPoisson model, we expect the mean HD0 to be equal to thevariance of the HD0 times N−1=N, where N is the sample size.Even though not statistically divergent from a Poisson distribution,all homogeneous subjects presented in Keele et al. (2008) had avariance HD that was on average 4.7% lower than the expectedtheoretical value had the distribution been a Poisson. Here weshowed that even in an age-dependent continuous model ofinfection, the mean HD0 still grows linearly with time, and thiscontinues to hold in a fully stochastic model. While recombinationdid not affect the linearity of the mean HD0 growth, in thepresence of recombination the HD0 distribution is no longerPoisson. In this paper we modeled the new dependency of thevariance HD0 with time, as it no longer equals that of the meanHD0. We also showed how this new framework helps explainsome of the divergence between mean and variance HD inhomogeneous samples previously described in Keele et al.(2008). Furthermore, because the effect is small, our work vali-dates our previous methods that estimate the time of infectionbased on a Poisson model.

This work suggests that our previous, simplified model (Leeet al., 2009) provides reasonable estimates of the time and numberof generations since a founder strain initiates the infection even inthe presence of in vivo recombination. It is also robust againstinstances of in vitro recombination (Salazar-Gonzalez et al., 2008),which becomes a relevant issue particularly when using 454deep sequencing (Fischer et al., 2010). For 454 samples in parti-cular, where sample sizes can reach the tens of thousands, ourmethod provides a robust alternative to computationally intensemethods such as those employed in the software package BEAST(Drummond and Rambaut, 2007).

Finally, we notice that the mathematical framework developedhere can be used in cell division models where it is relevant to

keep track of the number of divisions cells undergo (generations)(Zilman et al., 2010; Bernard et al., 2003).

Acknowledgments

This work was supported by the US Department of Energythrough the LANL LDRD program, the Center for HIV/AIDS VaccineImmunology, and NIH grants U19-AI067854-07, UM1-AI100645-01,AI028433, and OD011095.

Appendix A. Stochastic model

We present a stochastic formulation of the continuous modelpresented in Section 2.1. In all that follow we use the samemathematical framework developed for birth and death stochasticprocesses, with the added parameter g for generation. To keep thediscussion more general, we will use the word “births” instead of“new infections.” Clearly, when applied to the infection modeldescribed in this paper, new births are newly infected cells.

Denote U ¼Rþ � Zþ the space of all pairs u¼ ða; gÞ, as definedin Section 2. We define a state of the system as a functionI : U-Zþ that associates to every u∈U a positive integerIðuÞ ¼ Iða; gÞ, the number of infected cells of age a and generation g.

Let W be the space of all functions I : Rþ � Zþ-Zþ. At anygiven time t∈Rþ, and for any given state I∈W, let PðI; tÞ be theprobability of the system being described by the state I at time t.Even though P : W � Rþ-½0;1�, for simplicity we omit the depen-dency in time and consider P as an element in SðWÞ, whereSðWÞ ¼ fW-½0;1�g, the space of all operators from W into ½0;1�.

To better illustrate the derivation, we first show how to obtainthe master equation discretized in time increments Δt. DefineΔ : Rþ �W-W as follows. Given τ∈Rþ, I∈W, and ða; gÞ∈U, thenΔτIða; gÞ ¼ Iðaþ τ; gÞ−Iða; gÞ. From the above notice that for everyI∈W and every ða; gÞ∈U, ðI þ ΔτIÞða; gÞ ¼ Iðaþ τ; gÞ. ΔτI representsthe difference between state I at time t þ τ and state I at time twhen cells age only (e.g. no deaths nor births). It follows that,given Itþτ , the state at time t is Itþτ−ΔτI.

We also define the Kroenecker Delta operator as δ : U-W asfollows. Given u0∈U, δu0∈W is defined so that for every u∈U suchthat u≠u0, δu0 ðuÞ ¼ 0 and δu0 ðu0Þ ¼ 1.

Suppose now that at a given time t, the system is completelydescribed by the state I. We wish to compute the probabilityPðI; t þ ΔtÞ. In other words, we want to see how the system evolvesafter an increment in time of Δt. There are three possiblecontributions to the term PðI; t þ ΔtÞ. In what follows we assumethat for every age a and generation g, out of all infected cellsIða; gÞ only one can die or be born at the time. Hence, thecontributions are

1.

No deaths, no births, cells can only survive. The contribution inthis case is

PðI−ΔΔt I; tÞ∏u∈U

ð1−βðuÞΔtÞðI−ΔΔt IÞðuÞ

∏u∈U

ð1−αðuÞΔtÞðI−ΔΔt IÞðuÞ ðA:1Þ

2.

One cell u0 dies. The contribution (summed over all possibleu0∈U) is

∑u0∈U

PðI þ δu0 ; tÞ ∏u≠u0

ð1−βðuÞΔtÞIðuÞ !

βðu0ÞΔtðIðu0Þ þ 1Þ ðA:2Þ

Page 9: Modeling sequence evolution in HIV-1 infection with recombination

E.E. Giorgi et al. / Journal of Theoretical Biology 329 (2013) 82–9390

3.

One cell u0 is born. The contribution (summed over all possibleu0∈U) is

∑u∈U

PðI−δu0 ¼ ð0;gþ1Þ; tÞ ∏uð1−βðuÞΔtÞIðuÞ

� �αðuÞΔtðIðuÞ−δu0 ¼ ð0;gþ1ÞÞ ðA:3Þ

We expand each of the above terms in Δt and neglect terms inOðΔt2Þ to obtain, for each piece, respectively:

1.

PðI−ΔΔt I; tÞ 1− ∑

u∈UðI−ΔΔt IÞðuÞ½βðuÞ þ αðuÞ�Δt

� �ðA:4Þ

2.

∑u∈U

PðI þ δu; tÞðIðuÞ þ 1ÞβðuÞΔt ðA:5Þ

3.

∑u∈U

PðI−δu0 ¼ ð0;gþ1Þ; tÞðIðuÞ−δu0 ¼ ð0;gþ1ÞÞαðuÞΔt ðA:6Þ

The first order master equation of the process, discretized in time,is therefore

PðI; t þ ΔtÞ−PðI−ΔΔt I; tÞΔt

¼−PðI−ΔΔt I; tÞ ∑u∈U

ðI−ΔΔt IÞðuÞ½βðuÞ þ αðuÞ�� �

þ∑u∈U

PðI þ δu; tÞðIðuÞ þ 1ÞβðuÞ

þ∑u∈U

PðI−δð0;gþ1Þ; tÞðIðuÞ−δu0 ¼ ð0;gþ1ÞÞαðuÞ

ðA:7ÞTo obtain the continuous master equation we have to take the

limit for Δt-0 on both sides of Eq. (A.7). In order to do so, we usean approximation that is a direct consequence of the definition offunctional derivative (Gelfand et al., 1963): given x; y∈W andf∈SðWÞ, the following holds

f ðxþ yÞ≃f ðxÞ þZUyðuÞ ∂f

∂xðuÞ du ðA:8Þ

Using the above, the left-hand side of Eq. (A.7) becomes

∂P∂t

−ZU

∂I∂a

ðuÞ ∂P∂IðuÞ du ðA:9Þ

Notice that in the absence of births and deaths (i.e. α¼ β¼ 0),the above reduces to

∂P∂t

ðI; tÞ−ZU

∂I∂a

ðuÞ ∂P∂IðuÞ ðI; tÞ du¼ 0 ðA:10Þ

which describes the survival of all cells present at time t¼0.In order to expand the right-hand side (R.H.S.) of Eq. (A.7), we

collect the terms in β together, and similarly for the terms in α. Weapproximate to the first order in Δt and let Δt-0.

RHSðβÞ ¼ PðI; tÞ ∑uIðuÞβðuÞ þ∑

uβðuÞ−∑

uIðuÞβðuÞ

� �

þ∑u

∂P∂IðuÞ ðIÞ� �

IðuÞβðuÞ þ oðΔtÞ ðA:11Þ

In the limit the above becomes

PðI; tÞZUβðuÞ duþ

ZUβðuÞ ∂P

∂IðuÞ IðuÞ du ðA:12Þ

By expanding RHS (α) in a similar way, one obtains thecontinuous master equation:

∂P∂t

ðI; tÞ−ZU

∂I∂a

ðuÞ ∂P∂IðuÞ ðI; tÞ du¼

ZUμðuÞ PðI; tÞ þ ∂P

∂IðuÞ ðI; tÞIðuÞ� �

du

−Z∂UαðuÞ PðI; tÞ− ∂P

∂IðuÞ ðI; tÞIðuÞ� �

du

−ZU

∂P∂Ið0; g þ 1Þ ðI; tÞIðuÞαðuÞ du ðA:13Þ

where ∂U ¼ fu¼ ða; gÞ∈Uja¼ 0; g40g. In all that follows weassume that newborn cells cannot give birth, which implies thatαj∂U ¼ 0. As a consequence, the second term on the RHS is null. Wenow use the following two claims to re-write the above equation.

A.1. Appendix A.1

Claim. Let u∈U. We define a map SðWÞ-SðWÞ that to every P∈SðWÞassociates a ~Pu∈SðWÞ so defined: ~PuðIÞ ¼ IðuÞPðIÞ. Then∂ ~PuðIÞ∂IðuÞ ¼ PðIÞ þ IðuÞ ∂PðIÞ

∂IðuÞ ðA:14Þ

Proof. Given ϵ40 and u0∈U, define δu0 ;ϵ∈W as the followingKroencker delta: δu0 ;ϵðuÞ ¼ ϵδðu−u0Þ. Then~PuðI þ δu;ϵÞ− ~PuðIÞ ¼ ðI þ δu;ϵÞðuÞPðI þ δu0 ;ϵÞ−IðuÞPðIÞ

¼ δu;ϵðuÞPðI þ δu0 ;ϵÞ þ IðuÞ½PðI þ δu;ϵÞ−PðIÞ� ðA:15ÞBy dividing by ϵ and letting ϵ-0 we get the assert. □

A.2. Appendix A.2

Claim. Given u0≠u, we have

∂ ~PuðIÞ∂Iðu0Þ

¼ IðuÞ ∂PðIÞ∂Iðu0Þ

ðA:16Þ

As a consequence, we can rewrite the master equation asfollows:

∂P∂t

ðI; tÞ−ZU

∂I∂a

ðuÞ ∂P∂IðuÞ ðI; tÞ du¼

ZU

∂∂IðuÞ ½IðuÞPðI; tÞ�μðuÞ du

−ZU

∂∂Iðu0Þ

½IðuÞPðI; tÞ�αðuÞ du ðA:17Þ

where u0 ¼ ð0; g þ 1Þ.Eq. (A.17) reduces to well known results when the birth and

death rates are both constant, in which case, if we integrate bothsides over the domain WN ¼ fI∈WjRUI¼Ntg, where Nt is the totalnumber of cells at time t, and N0 is the initial number of cells, weobtain

∂P∂t

ðNt ; tÞ ¼ β½ðNt þ 1ÞPðNt þ 1; tÞ−NtPðNt ; tÞ�þα½ðNt−1ÞPðNt−1; tÞ−NtPðNt ; tÞ� ðA:18Þ

which is the master equation of the classic linear birth and deathprocess. This is fully discussed in Reichl (1998), where a solution isgiven in terms of the generating function Fðz; tÞ ¼∑Nt PðNt ; tÞzNt by

Fðz; tÞ ¼ βðz−1Þeðα−βÞt−α z þ β

αðz−1Þeðα−βÞt−α zþ β

� �N0

: ðA:19Þ

Here N0 is the initial number of infected cells.

Appendix B. Continuous model, closed form solution

We now develop the closed form solution to Eq. (1). One canverify that, given a smooth function φðg; xÞ such that φðg; xÞ ¼ 0when xo0, then

Iða; g; tÞ ¼ φðg; t−aÞe−R a

0βðsÞ ds ðB:1Þ

Page 10: Modeling sequence evolution in HIV-1 infection with recombination

E.E. Giorgi et al. / Journal of Theoretical Biology 329 (2013) 82–93 91

satisfies the first equation in (1). Substituting into the renewalcondition, one gets that φ should also satisfy

φðg þ 1; tÞ ¼Z ∞

ΛαðaÞe−

R a

0βðsÞ dsφðg; t−aÞ da: ðB:2Þ

Given the initial population of I0 cells, at any given time t, the onlycells at g¼0 are the survivors from the initial population. In otherwords:

Iða;0; tÞ ¼ I0e−R a

0βðsÞ ds if a¼ t

0 otherwise:

(ðB:3Þ

It follows that for g¼0, φ is completely determined byφð0; t−aÞ ¼ δðt−aÞI0, where δðt−aÞ is the singular density. Wheng¼1

I0

Z ∞

0f ðaÞδða−tÞ da¼ I0f ðtÞ ¼ φð1; tÞ ðB:4Þ

and therefore one can show recursively that for g≥1

φðg; tÞ ¼ I0

Z ∞

0…Z ∞

0∏g

i ¼ 1f ðaiÞ dai

!δ ∑

g

i ¼ 1ai−t

!: ðB:5Þ

By reordering the above and substituting f with its Fouriertransform ~f one obtains

φðg; tÞ ¼ I0

Z ∞

0…Z ∞

0

Z ∞

−∞eikð∑ai−tÞ dk

2π∏g

i ¼ 1f ðaiÞ dai

!

¼ I02π

Z ∞

−∞e−ikt ∏

g

i ¼ 1

Z ∞

0f ðaiÞeikai dai

!dk

¼ I02π

Z ∞

−∞e−ikt ~f ðkÞg dk ðB:6Þ

Substituting Eq. (B.6) in Eq. (B.1), one gets Eq. (4):

Iðg; tÞ ¼ I02π

Z t

0e−R a

0βðsÞ ds

Z ∞

−∞

~f ðkÞge−ikðt−aÞ dk da ðB:7Þ

which reduces the problem to a quadrature.

Appendix C. HD variance for an exponentially growingpopulation in the presence of recombination alone

We now illustrate how to obtain expressions for the mean HD0

and the expected value of the variance of HD0 as functions ofgeneration g in the case where the population grows exponen-tially, the mutation rate is ϵ¼ 0, and the recombination rate isρ¼ 1.

For every position x between 1 and NB, we compute both themean HD0 and the HD0 variance at that position across allsequences, and for every pair of positions x and y, we will computethe covariance across all sequences, and finally we will obtain thetotal mean HD0 and HD0 variance by summing across all positions.

Since there are no new mutations, at any generation g, themean HD at position x is given by HDxðgÞ ¼ d=NB, and hence

μðgÞ ¼ ∑NB

x ¼ 1HDxðgÞ ¼ d: ðC:1Þ

An unbiased estimator of the variance at position x is given by

s2x ðgÞ ¼Ng

Ng−1ðHD2

x−HD2x Þ; ðC:2Þ

Here, for simplicity of notation, we have omitted the dependencyin g. So, if s2x ðg þ 1Þ is the expected variance at generation g+1,summing over all sequences s we obtain

s2ðg þ 1Þ ¼ 1Ng

∑s

HDx−1Ng

∑s′HDx

� �2 !

¼ Ng−1Ng

s2ðgÞ: ðC:3Þ

This is not quite the total variance as we still need to add in thecontribution from the covariance between every pair of positions xand y.

To understand how the expected covariance sx;y between anytwo sites x and y changes with generation g, we make thefollowing observation: if the split position θ falls between thetwo positions, then x and y will be inherited independently to thenext generation. If, on the other hand, θ does not split the twopositions, then the new covariance will be a random variate withexpectation the second cumulant of the covariance at the previousdistribution. In order to use this observation, we split the set ofsequences in two groups: the ones derived from two parentsequences which recombined somewhere between x and y,and those derived instead from a recombination split outside xand y.

Let N1 be the number of sequences in the first group, and N2 thenumber of sequences in the second group, so that N1 þ N2 ¼Ng

(when clear from the context, we omit the dependency in g forsimplicity of notation). We define wi ¼ EðNi=NgÞ and κi1x and κi11 thecumulants of the HD in the i-th group at a fixed generation g. Anunbiased estimator of the covariance is given by

sxyðgÞ ¼1

Ng−1∑s∈Ng

ðHDx−HDxÞðHDy−HDyÞ ðC:4Þ

where Ng is the total number of sequences at generation g, HDx

and HDy are the Hamming distances at positions x and y,respectively, and HDx and HDy their means. By splitting thesummation over sequences in N1 and sequences in N2, one gets

sxyðgÞ ¼ ∑i ¼ 1;2

Ni−1Ng−1

si þ ∑i ¼ 1;2

Ni

Ng−1ðHDx;i−HDxÞðHDy;i−HDyÞ ðC:5Þ

where si is the variance over subset Ni, and HDx;i and HDy;i are themeans at positions x and y, respectively, within group Ni. Takingexpectations on both sides of the above equation, one obtains

EðsxyÞjgþ1 ¼ ∑i ¼ 1;2

wiκi11 þ

Ngþ1

Ngþ1−1∑

i ¼ 1;2wiκ

i1xκ

i1y

"

− ∑i ¼ 1;2

wiκi1x

!∑

j ¼ 1;2wjκ

j1y

!#: ðC:6Þ

Notice that the first term on the right hand side is the weightedaverage of the κi11 's, and the second one is the weighted covariancebetween κi1x and κi1y.

The first cumulants are the mean of the corresponding dis-tribution, hence κi1x ¼ κi1y ¼ d at all times, whereas with the abovedefinitions of N1 and N2, we get κ111 ¼ 0 (the split positionseparates x and y and hence the number of mutations at eachsite become independent variables), and κ211 ¼ ðNg−1=NgÞEðsxyÞ.Furthermore, w2 ¼ ð1−ðy−xÞ=ðNB þ 1ÞÞ. Therefore, substituting allof the above into Eq. (C.6), we obtain

EðsxyÞðg þ 1Þ ¼ 1−y−x

NB þ 1

� �Ng−1Ng

EðsxyÞðgÞ ðC:7Þ

By adding across all positions we get

s2ðgÞ ¼ ∑NB

x ¼ 1s2x ðgÞ þ 2 ∑

xoysxyðgÞ

¼ ∏g−1

j ¼ 0

Nj−1Nj

∑NB

x ¼ 1s2x ð0Þ þ 2 ∑

xoy1−

y−xNB þ 1

� �g

sx;yð0Þ" #

ðC:8Þ

With these recursive relations, we have reduced the problem tocomputing the initial quantities s2x ð0Þ and sx;yð0Þ. Consider theinitial population to be made of I0 sequences such that the HD0 ofany given sequence s from the founder strain s0 is a randomvariable from a probability distribution P with mean and varianceμ0 and s0, respectively. Then the moment generating function M of

Page 11: Modeling sequence evolution in HIV-1 infection with recombination

E.E. Giorgi et al. / Journal of Theoretical Biology 329 (2013) 82–9392

the HD0 distribution is given by

Mðu1;…;uNB−1Þ ¼1NB

1þ ∑NB−1

i ¼ 1eui

!" #μ0N−1B

: ðC:9Þ

Here NB is the length of the genomes, and u1;…uNB−1 are indepen-dent variables representing the positions in the genome (there is atotal of NB positions, but only NB−1 are independent due to theconstraint that the initial mean is μ0). Taking the first and secondderivatives, we find

s2x ð0Þ ¼NB−1N2

B

μ0 þ1

N2B

s20 ðC:10Þ

and

sxyð0Þ ¼1

N2B

ðs20−μ0Þ ðC:11Þ

It is easy to verify that summing the above two equations over allpositions, one gets the initial variance s20.

Combining Eqs. (C.8), (C.10), and (C.11), we get the expressionfor the total HD variance at generation g40:

s2ρ ðgÞ ¼ d 1−1NB

� �−2

d

N2B

∑xoy ≤NB

1−y−x

NB þ 1

� �g

� ∏g−1

j ¼ 0

Nj−1Nj

¼ d 1−1NB

� �−2

dg þ 2

� NB

NB þ 1

� �g−1

þ OðN−1B Þ

" #∏g−1

j ¼ 0

Nj−1Nj

ðC:12Þ

where we have used the following relationship:

2 ∑NB−1

x ¼ 1∑NB

y ¼ xþ11−

y−xNB þ 1

� �g

¼ 2ðNB þ 1Þg ∑

NB−1

k ¼ 1kðkþ 1Þg

¼ Ngþ1B

ðg þ 2ÞðNB þ 1Þg−1þ OðNBÞ: ðC:13Þ

Appendix D. HD variance for an exponentially growingpopulation when both recombination and mutation rate arenon-null

We now extend the methods used to derive Eq. (14) and findrecursive formulae for the general case when both ρ and ϵ are non-null. Eq. (C.6) can be generalized to an indefinite number ofpartitions Ni such that ∑iNi ¼Ng , the total number of sequencesat generation g. We use this to find analogous recursive formulaefor the cumulants of the mean HD0 distribution when bothrecombination and mutation events are present.

Fix positions x and y and choose N1 to be the number ofsequences derived from a recombination event that split positionsx and y; N2 the number of sequences derived from either arecombination event that did not split positions x and y, or froma division with no mutation; and, finally, denote Nk, with k¼3,4 or5 the number of sequences such that only the x position hasmutated, only the y position, or both, respectively. Like before, letwi ¼ EðNi=NgÞ. Assume that mutations happen with a rate ϵ (persite, per generation), and recombination events happen with a rateρ. Given this set-up, we notice that

EðwjÞ ¼

y−xNB þ 1

ρ if j¼ 1

1−y−x

NB þ 1

� �ρþ ð1−ρÞð1−ϵÞ if j¼ 2

ϵð1−ρÞð1−ϵÞ if j¼ 3 or 4ð1−ρÞϵ2 otherwise

8>>>>>>><>>>>>>>:

ðD:1Þ

Let ξ¼ ϵð1−ρÞ. Carrying out similar computations as sketched inthe previous section, one notices that, for every positions x and y

μxðg þ 1Þ ¼ μxðgÞ þ ξ ðD:2Þ

s2x ðg þ 1Þ ¼ Ng−1Ng

s2x ðgÞ þ ξð1−ξÞ ðD:3Þ

sxyðg þ 1Þ ¼ 1−ρy−x

NB þ 1

� �Ng−1Ng

sxyðgÞ−Ngþ1

Ngþ1−1ξ2

þξ ϵ−1

Ngþ1

� �−2ϵξ2

1−ϵNgþ1

ðD:4Þ

By summing Eq. (D.2) over all positions x one obtains Eq. (15).Similarly, by summing Eqs. (D.3) and (D.4) across all positions, andnoticing that we can still use the same initial values we used inAppendix C, one obtains the general expression for s2ðgÞ.

References

Althaus, C.L., Bonhoeffer, S., 2005. Stochastic interplay between mutation andrecombination during the acquisition of drug resistance mutations in humanimmunodeficiency virus type 1. J. Virol. 79, 13572–13578.

Batorsky, R., Kearney, M.F., Palmer, S.E., Maldarelli, F., Rouzine, I.M., Coffin, J.M.,2011. Estimate of effective recombination rate and average selection coefficientfor HIV in chronic infection. Proc. Natl. Acad. Sci. USA 108, 5661–5666.

Bernard, S., Pujo-Menjouet, L., Mackey, M.C., 2003. Analysis of cell kinetics using acell division marker: mathematical modeling of experimental data. Biophys. J.84, 3414–3424.

Chen, J., Dang, Q., Unutmaz, D., Pathak, V.K., Maldarelli, F., Powell, D., Hu, W.S., 2005.Mechanisms of nonrandom human immunodeficiency virus type 1 infectionand double infection: preference in virus entry is important but is not the solefactor. J. Virol. 79, 4140–4149.

Chen, H.Y., DiMascio, M., Perelson, A.S., Ho, D.D., Zhang, L., 2007. Determination ofvirus burst size in vivo using a single-cycle SIV in rhesus macaques. Proc. Natl.Acad. Sci. USA 104, 19079–19084.

Dang, Q., Chen, J., Unutmaz, D., Coffin, J.M., Pathak, V.K., Powell, D., KewalRamani, V.N.,Maldarelli, F., Hu, W.S., 2004. Nonrandom HIV-1 infection and double infectionvia direct and cell-mediated pathways. Proc. Natl. Acad. Sci. USA 101, 632–637.

De Boer, R.J., Ribeiro, R.M., Perelson, A.S., 2010. Current estimates for HIV-1production imply rapid viral clearance in lymphoid tissues. PLoS Comput. Biol.6, e1000906.

Drummond, A.J., Rambaut, A., 2007. BEAST: Bayesian evolutionary analysis bysampling trees. BMC Evol. Biol. 7, 214–223.

Fiebig, E.W., Wright, D.J., Rawal, B.D., Garrett, P.E., Schumacher, R.T., et al., 2003.Dynamics of HIV viremia and antibody seroconversion in plasma donors:implications for diagnosis and staging of primary HIV infection. AIDS 17,1871–1879.

Fischer, W., Bhattacharya, T., Keele, B.F., Giorgi, E.E., Hraber, P.T., Perelson, A.S.,Shaw, G.M., Korber, B.T., et al., 2010. Rapid mutational escape from cytotoxicT-cell responses in acute HIV-1 infection—an ultra-deep view. PLoS ONE 5,e12303.

Fraser, C., 2005. HIV recombination: what is the impact on antiretroviral therapy?J. R. Soc. Interface 2 (489).

Gelfand, Fomin, S.V., 1963. Calculus of Variations. Englewood Cliffs, NJ.Giorgi, E.E., Funkhouser, B., Athreya, G., Perelson, A.S., Korber, B.T., Bhattacharya, T.,

2010. Estimating time since infection in early homogeneous HIV-1 samplesusing a Poisson model. BMC Bioinformatics 11, 532–539.

Gillespie, J.H., 1984. The status of the neutral theory. Science 224, 732–733.Huddleston, J.V., 1983. Population dynamics with age- and time-dependent birth

and death rates. Bull. Math. Biol. 45, 827–836.Jetzt, A.E., Yu, H., Klarmann, G.J., Ron, Y., Preston, B.D., Dougherty, J.P., 2000. High

rate of recombination throughout the human immunodeficiency virus type Igenome. J. Virol. 74, 1234–1240.

Jung, A., Meier, R., Vartanian, J.P., Bocharov, G., Jung, V., Fischer, U., Meese, E.,Wain-Hobson, S., Meyerhans, A., 2002. Multiply infected spleen cells in HIVpatients. Nature 418, 144.

Keele, B.F., Giorgi, E.E., Salazar-Gonzalez, J.F., Decker, J.M., Pham, K.T., et al., 2008.Identification and characterization of transmitted and early founder virusenvelopes in primary HIV-1 infection. Proc. Natl. Acad. Sci. USA 105,7552–7557.

Keele, B.F., Li, H., Learn, G.H., Hraber, P., Giorgi, E.E., Grayson, T., Sun, C., Chen, Y.,Yeh, W.W., Letvin, N.L., Mascola, J.R., Nabel, G.J., Haynes, B.F., Bhattacharya, T.,Perelson, A.S., Korber, B.T., Hahn, B.H., Shaw, G.M., 2009. Low-dose rectalinoculation of rhesus macaques by SIVsmE660 or SIVmac251 recapitulateshuman mucosal infection by HIV-1. J. Exp. Med. 206, 1117–1134.

Kevorkian, J., 2002. Partial Differential Equations: Analytical Solutions Techniques,second ed. Springer-Verlag, New York.

Kimura, M., 1968. Evolutionary rate at the molecular level. Nature 217, 624–626.Kouyos, R.D., Fouchet, D., Bonhoffer, S., 2009. Recombination and drug resistance in

HIV: population dynamics and stochasticity. Epidemics 1, 58–69.

Page 12: Modeling sequence evolution in HIV-1 infection with recombination

E.E. Giorgi et al. / Journal of Theoretical Biology 329 (2013) 82–93 93

Lee, H.Y., Giorgi, E.E., et al., 2009. Modeling sequence evolution in acute HIV-1infection. J. Theor. Biol. 261, 341–360.

Levy, D.N., Aldrovandi, G.M., Kutsch, O., Shaw, G.M., 2004. Dynamics of HIV-1recombination in its natural target cells. Proc. Natl. Acad. Sci. USA 101,4204–4209.

Markowitz, M., Louie, M., Hurley, A., Sun, E., Di Mascio, M., et al., 2003. A novelantiviral intervention results in more accurate assessment of human immuno-deficiency virus type 1 replication dynamics and T-cell decay in vivo. J. Virol. 77,5037–5038.

Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., et al., 2005. Genomesequencing in microfabricated high-density picolitre reactors. Nature 437,376–380.

Mostowy, R., Kouyos, R.D., Fouchet, D., Bonhoffer, S., 2011. The role of recombina-tion for the coevolutionary dynamics of HIV and the immune response. PLoSOne 6, e16052.

Moutouh, L., Corbeil, J., Richman, D.D., 1996. Recombination leads to the rapidemergency of HIV-1 dually resistant mutants under selective drug pressure.Proc. Natl. Acad. Sci. USA 93, 6106–6111.

Murray, J.D., 2002. Mathematical Biology, third ed. Springer-Verlag, Berlin,Heidelberg.

Neher, R.A., Leitner, T., 2010. Recombination rate and selection strength in HIVintra-patient evolution. PLoS Comput. Biol. 6, e1000660.

Nisbet, R.M., Gurney, W.S.C., 1982. Modeling Fluctuating Populations. John Wileyand Sons.

Onafuwa-Nuga, A., Telesnitsky, A., 2009. The remarkable frequency of humanimmunodeficiency virus type 1 genetic recombination. Microbiol. Mol. Biol.Rev. 73, 451–480.

Perelson, A.S., Neumann, A.U., Markowitz, M., Leonard, J.M., Ho, D.D., 1996. HIV-1dynamics in vivo: virion clearance rate, infected cell life-span, and viralgeneration time. Science 271, 1582–1586.

R Development Core Team, 2010. R: A Language and Environment for StatisticalComputing. R Foundation for Statistical Computing, Vienna, Austria. ISBN3-900051-07-0, URL ⟨http://www.R-project.org⟩.

Ramirez, B.C., Simon-Loriere, E., Galetto, R., Negroni, M., 2008. Implications ofrecombination for HIV diversity. Virus Res. 134, 64–73.

Ramratnam, B., Bonhoeffer, S., Binley, J., Hurley, A., Zhang, L., et al., 1999. Rapidproduction and clearance of HIV-1 and hepatitis C virus assessed by largevolume plasma apheresis. Lancet 354, 1782–1785.

Reichl, L.E., 1998. A Modern Course in Statistical Physics. John Wiley and Sons.

Robertson, D.L., Sharp, P.M., McCutchan, F.E., Hahn, B.H., 1995. Recombination inHIV-1. Nature 374, 124.

Ribeiro, R.M., Qin, L., Chavez, L.L., Li, D., Self, S.G., Perelson, A.S., 2010. Estimation ofthe initial viral growth rate and basic reproductive number during acute HIV-1infection. J. Virol. 12, 6096–6102.

Sabino, E.C., Shpaer, E.G., Morgado, M.G., Korber, B.T., Diaz, R.S., Bongertz, V.,Cavalcante, S., Galvo-Castro, B., Mullins, J.I., Mayer, A., 1994. Identification ofhuman immunodeficiency virus type 1 envelope genes recombinant betweensubtypes B and F in two epidemiologically linked individuals from Brazil.J. Virol. 68, 6340–6346.

Salazar-Gonzalez, J.F., Bailes, E., Pham, K.T., Salazar, M.G., Guffey, M.B., Keele, B.F.,Derdeyn, C.A., Farmer, P., Hunter, E., Allen, S., Manigart, O., Mulenga, J.,Anderson, J.A., Swanstrom, R., Haynes, B.F., Athreya, G.S., Korber, B.T., Sharp,P.M., Shaw, G.M., Hahn, B.H., 2008. Deciphering human immunodeficiencyvirus type 1 transmission and early envelope diversification by single-genomeamplification and sequencing. J. Virol. 82, 3952–3970.

Shriner, D., Rodrigo, A.G., Nickle, D.C., Mullins, J.I., 2004. Pervasive genomicrecombination of HIV-1 in vivo. Genetics 167, 1573–1583.

Suryavanshi, G.W., Dixit, N.M., 2007. Emergence of recombinant forms of HIV:dynamics and scaling. PLoS Comput. Biol. 3, e205.

Vijay, N.N.V., Ajmani, R., Perelson, A.S., Dixit, N.M., 2008. Recombination increaseshuman immunodeficiency virus fitness, but not necessarily diversity. J. Gen.Virol. 89, 1467.

Von Foerster, H., 1959. Some remarks on changing populations. In: Stohlman Jr., F.(Ed.), The Kinetic of Cellular Proliferation. Grune & Stratton, New York,pp. 382–407.

Wallstrom, T.C., Daniels, M.G., Bhattacharya, T., 2010. Personal communication.Wood, N., Bhattacharya, T., Keele, B.F., Giorgi, E.E., Liu, M., Gaschen, B., Daniels, M.,

Ferrari, G., Haynes, B.F., McMichael, A., Shaw, G.M., Hahn, B.H., Korber, B.,Seoighe, C., 2009. HIV evolution in early infection: selection pressures, patternsof insertion and deletion, and the impact of APOBEC. PLoS Pathog. 5, e1000414.

Wooley, D.P., Smith, R.A., Czajak, S., Desrosiers, R.C., 1997. Direct demonstration ofretroviral recombination in rhesus monkey. J. Virol. 71, 9650–9653.

Zhang, M., Foley, B., Schultz, A.K., Macke, J.P., Bulla, I., Stanke, M., Morgenstern, B.,Korber, B., Leitner, T., 2010. The role of recombination in the emergence of acomplex and dynamic HIV epidemic. Retrovirology 7, 25.

Zilman, A., Ganusov, V.V., Perelson, A.S., 2010. Stochastic models of lymphocyteproliferation and death. PLoS ONE 5, e12775.