A statistical learning approach to infer transmissions ofinfectious diseases from deep sequencing data
Maryam Alamil1, Joseph Hughes2, Karine Berthier3, Cecile Desbiez3,Gael Thebaud4 and Samuel Soubeyrand1.
1BioSP, INRA, 84914, Avignon, France2MRC-University of Glasgow Centre for Virus Research, Glasgow, United Kingdom
3Pathologie Vegetale, INRA, 84140 Montfavet, France4BGPI, INRA, SupAgro, Cirad, Univ. Montpellier, Montpellier, France
12 March 2019
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 1 / 21
Overview
1 Introduction
2 Methodology
3 Results
4 Conclusion
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 2 / 21
Overview
1 Introduction
2 Methodology
3 Results
4 Conclusion
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 3 / 21
Introduction
Inferring transmission links, for fast evolving pathogens, using viralgenetic data is crucial to make epidemiological predictions and todesign control strategies.
Pathogen sequence data have been exploited to infer who infectedwhom, by using empirical and model-based approaches.
Data collected with deep sequencing techniques provide a subsampleof the pathogen variants that were present in the host at samplingtime. They are expected to give better insight into epidemiologicallinks.
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 4 / 21
Introduction
Here, we present a Statistical Learning Approach For EstimatingEpidemiological Links from deep sequencing data (SLAFEEL), whichis summarized as follows:
After that, we apply this approach to three real cases of animal,human and plant epidemics.
Then, we show the impact of introducing penalization and, therefore,using training data on the inference.
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 5 / 21
Overview
1 Introduction
2 Methodology
3 Results
4 Conclusion
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 6 / 21
Pseudo-evolutionary model for the evolution andtransmission of populations of sequences
+ It describes transitions between sets of sequences sampled at differenttimes from an infected host and its putative sources.
+ It takes into consideration the loss and gain of virus variants duringwithin-host evolution and their loss during between-hosttransmissions.
+ We built a sort of regression function parameterised by anevolutionary parameter and a penalisation parameter, where:
the response variable is the set of sequences S = {S1, ...,SJ} observedfrom a recipient host unit,
the explanatory variable is the set of sequences S(0) = {S(0)1 , ...,S(0)
I }observed from a putative source.
the coefficients are weights measuring how each sequence in S(0)
contributes to explaining each sequence in S.
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 7 / 21
Estimation and calibration of parameters, and inference oftransmissions
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 8 / 21
Estimation and calibration of parameters, and inference oftransmissions
Sequencing
Learning stage
Inference of transmission links for the whole data
set by assessing the link intensity between each directed pair of hosts
is calibrated by buildingand optimising a criterion
that compares contact information and inferred
sources of infection for training hosts
θ
~Θ
Maximise
μ
f μ ,θ(Sm∣Sm '); m '≠m
Identify the most likely source of
s (m;θ)=argmaxm'≠m
f μm '
(θ) ,θ(Sm∣Sm ' )
mgiven θ
μ m' (θ)
S1
(m1) , ... , S Im1
(m1)
S1
(m2) , ... , S Im2
(m2 )
S1
(m3) , ... , S Im3
(m3)
S1
(m4 ) , ... , S Im4
(m4 )
S1
(m5) , ... , S Im5
(m5 )
given θ
m1
m2
m3
m4
m5
m5
m4
m3
with respect to
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 9 / 21
Semi-parametric pseudo-evolutionary model (1/2)+ Its general form is given by a penalized pseudo-likelihood:
fµ,θ(
S1, ...,SJ |S(0)1 , ...,S(0)
I
)= Pθ(W )×
J∏j=1
∑I
i=1 wijKµ{d(Sj ,S(0)i ); ∆ij}∑I
i=1 wij︸ ︷︷ ︸pj
where:
d(., .) is a distance function giving the number of different nucleotidesbetween two sequences,
∆ij is the duration separating the two sequences Sj and S(0)i ,
wij = 1/nj for indices i corresponding to sequences S(0)i minimally distant
from the sequence Sj (the number of such sequences denote nj) and wij = 0otherwise,
Kµ(.,∆) is a kernel smoother parameterised by an evolutionary parameter µ.It is the probability distribution function of the binomial law with size equalsto the sequence length and success probability 3(1− exp(−4µ∆))/4,
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 10 / 21
Semi-parametric pseudo-evolutionary model (2/2)Pθ(W ) is a parametric penalization for the weight matrix W , parameterisedby a penalisation parameter θ. It measures the likelihood of thecontributions of explanatory sequences S(0)
1 , ...,S(0)I (measured by
∑Jj=1 wij ,
i = 1, ..., I) to the response set of sequences S1, ...,SJ .
Two hypotheses are considered for the penalization:
1 H1 : E[∑J
j=1wij
]= J/I. Two associated penalization shapes:
ã Pθ(W ) =∏I
i=1 Φ(∑J
j=1 wij ; JI , θ
JI
(1 − 1
I
)),
ã Pθ(W ) = θχ2
(∑Ii=1
(∑Jj=1
wij −J/I)2
J/I ; I − 1
),
2 H2 : E[
1J∑J
j=1 min(d(Sj ,S(0)f (j)))
]= dobs . A linked penalization shape:
ã Pθ(W ) = θ∏J
j=1 Φ(∑I
i=1 wijd(Sj ,S(0)i ); dobs, σ
2obs
).
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 11 / 21
Overview
1 Introduction
2 Methodology
3 Results
4 Conclusion
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 12 / 21
Inference of epidemiological links for hosts with differentimmunological histories
116 104
113
Naive chain (5 groups)
410 405
Vaccinated chain (7 groups)
113 115
417 409
106 112
115
400 413
108
105
…..
410
405
…..
414
412
First group Last group
First group Last group
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 13 / 21
Inference of epidemiological links for hosts with differentimmunological histories
Naive chain of InfluenzaHosts Contact
113 115115 113104 113, 115, 116116 104, 113, 115109 104, 111, 116111 104, 109, 116105 108, 109, 111108 105, 109, 111106 105, 108, 112112 105, 106, 108
Vaccinated chain of InfluenzaHosts Contact
405 410410 405409 405, 410, 417417 405, 409, 410401 409, 417415 401, 416416 401, 415403 406, 415, 416406 403, 415, 416412 403, 406, 414414 403, 406, 412400 412, 413, 414413 400, 412, 414
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 13 / 21
Inference of epidemiological links for hosts with differentimmunological histories
Naive chain of InfluenzaHosts Contact
113 115115 113104 113, 115, 116116 104, 113, 115109 104, 111, 116111 104, 109, 116105 108, 109, 111108 105, 109, 111106 105, 108, 112112 105, 106, 108
Vaccinated chain of InfluenzaHosts Contact
405 410410 405409 405, 410, 417417 405, 409, 410401 409, 417415 401, 416416 401, 415403 406, 415, 416406 403, 415, 416412 403, 406, 414414 403, 406, 412400 412, 413, 414413 400, 412, 414
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 13 / 21
Inference of epidemiological links for hosts with differentimmunological historiesA
Tree inferred by optimizing the penalization
with contact information for the last group
Time (day)
2 4 6 8 10 12 14
113113113
115115115
104104104
116116
109109
111111
105
108108
106
112112
Host
BTree inferred by optimizing the penalization
with contact information for the last group
Time (day)
2 4 6 8 10 12 14 16 18 20 22 24
405405405405410410410
409409417
401401
415416
403406
412412412414414414414414
400400413413413
Host
Figure: Transmissions inferred in the naive chain (A) and vaccinated chain (B) of Swineinfluenza virus using pair of training hosts in the last group for calibrating thepenalization. Training hosts are written in bold. The thickness of each arrow isproportional to the intensity of the corresponding inferred link.
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 13 / 21
Inference of epidemiological links for hosts with differentimmunological historiesA
Tree inferred by optimizing the penalization
with contact information for the last group
Time (day)
2 4 6 8 10 12 14
113113113
115115115
104104104
116116
109109
111111
105
108108
106
112112
Host
BTree inferred by optimizing the penalization
with contact information for the last group
Time (day)
2 4 6 8 10 12 14 16 18 20 22 24
405405405405410410410
409409417
401401
415416
403406
412412412414414414414414
400400413413413
Host
Consistent estimations with the two pairs of training hosts.
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 13 / 21
Impact of training hosts
A B C
Figure: Transmissions inferred in the vaccinated chain of SIV using different sets oftraining hosts for calibrating the penalization: (A) a pair of hosts in the last group ofthe chain (B) a pair of hosts in the middle of the chain (C) three hosts in the last groupand the middle of the chain. Training hosts are highlighted. The thickness of each arrowis proportional to the intensity of the corresponding inferred link.
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 14 / 21
Impact of training hostsA B C
Using more contact information allows a finer calibration ofthe penalisation and, consequently, a more accurate resolu-tion of transmissions.
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 14 / 21
Comparaison between SLAFEEL and BadTrIP
Figure: Discrepancy between inferred transmission graphs obtained with SLAFFEL (withand without penalization) and BadTrIP and reference graphs, for naive and vaccinatedchains of SIV. This discrepancy is measured by the proportion of correct sourceidentifications.
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 15 / 21
Inference of epidemiological links in a low diversitypathogen population
Recipient DonorG3817 G3729G3820 G3729G3821 G3729G3823 G3729G3851 G3752
Senga et al. (2017)
Jawie
Kaku
a
Kis
si K
am
a
Kissi Te
ngKissi Tongi
Kpeje Bongre
Luawa
Male
ma
Ma
mb
olo
Mandu
Njaluahun
Nongowa
Figure: Most likely epidemiological links between Ebola patients cumulating to 20%probability for each recipient (i.e., for each recipient, potential donors were ranked withrespect to link intensity, and the subset of donors with higher ranks for which the sum oflink intensities reached 0.2 were retained to be displayed in the graph).
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 16 / 21
Inference of epidemiological links in a low diversitypathogen population
Recipient DonorG3817 G3729G3820 G3729G3821 G3729G3823 G3729G3851 G3752
Senga et al. (2017)
Jawie
Kaku
a
Kis
si K
am
a
Kissi Te
ngKissi Tongi
Kpeje Bongre
Luawa
Male
ma
Mam
bolo
Mandu
Njaluahun
Nongowa
The Jawie chiefdom seems to be an interface between KissiTeng and Kissi Tongi chiefdoms on the one hand and mostof the other chiefdoms on theother hand.
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 16 / 21
Inference of epidemiological links at a metapopulation scale
+ Geographic proximity is used as contact information
10 km
Figure: Epidemiological links inferred between 27 salsify patches based on sets ofpotyvirus variants sequenced from 189 infected plants sampled in a 40× 10 kmregion of south-eastern France.
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 17 / 21
Inference of epidemiological links at a metapopulation scale
10 km
No secondary arrows are displayed,Non-negligible proportion of long links,Environmental factors and intra-host demo-geneticfactors may play a role in the transmission of the virus.
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 17 / 21
Overview
1 Introduction
2 Methodology
3 Results
4 Conclusion
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 18 / 21
Conclusion and perspectives
F SLAFEEL is adaptable to very different contexts and data fromanimal, human and plant epidemics.
F SLAFEEL is valuable in non-standard situations where classicalmechanistic assumptions may be erroneous and when sequencing andvariant calling may be noisy.
F Introducing a penalization and using more contact information lead toaccurate inferences of transmission links.
F Calibrate and assess its efficiency by applying it to simulated datagenerated with diverse sampling efforts, sequencing techniques andstochastic models of viral evolution and transmission.
F Investigate the statistical relationship between inferred transmissionlinks and environmental factors.
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 19 / 21
References
M Alamil, J Hughes, K Berthier, C Desbiez, G Thebaud, and S Soubeyrand.Inferring epidemiological links from deep sequencing data: a statistical learningapproach for human, animal and plant diseases.Submitted.
C Desbiez, A Schoeny, B Maisonneuve, K Berthier, et al.Molecular and biological characterization of two potyviruses infecting lettuce insoutheastern france.Plant Pathology, 66(6):970–979, 2017.
SK Gire, A Goba, KG Andersen, RS Sealfon, et al.Genomic surveillance elucidates Ebola virus origin and transmission during the 2014outbreak.Science, 345:1369–1372, 2014.
PR Murcia, J Hughes, P Battista, L Lloyd, GJ Baillie, et al.Evolution of an Eurasian Avian-like influenza virus in naıve and vaccinated pigs.PLoS Pathogens, 8(5):e1002730, 2012.
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 20 / 21
Thank you for your attention!
We welcome your questions,comments & suggestions!
Maryam ALAMIL (INRA-BioSP) ModStatSAP-2019 12 March 2019 21 / 21
Impact of introducing a penalizationA
Tree inferred without penalization
Time (day)
2 4 6 8 10 12 14
113113113
115115115
104104104
116116
109109
111111
105
108108
106
112112
Host
BTree inferred without penalization
Time (day)
2 4 6 8 10 12 14 16 18 20 22 24
405405405405410410410
409409417
401401
415416
403406
412412412414414414414414
400400413413413
Host
Figure: Transmissions inferred in the naive chain (A) and vaccinated chain (B) ofSIV without including the penalisation and, therefore, without including traininghosts.