microbiome data science - phylogenetic tree, bacterial
TRANSCRIPT
Microbiome Data Science - Phylogenetic Tree,Bacterial Growth and Biosynthetic Gene Clusters
Hongzhe LiPerelman Professor of Biostatistics
University of Pennsylvania
June 14, 2021
1
Microbiome and its Function
conditions seen in paediatrics that have been asso-ciated with changes in the gut microbiota (table 1).8–16
Inflammatory bowel disease (IBD)—This is one ofthe most extensively studied human conditions asso-ciated with the gut microbiota. The composition ofthe gut microbiota differs between healthy individualsand patients with IBD both in terms of species rich-ness (ie, numbers of bacterial species) and speciesabundances (ie, number of individuals per species). Asbacteria are identified by sequencing, rather than byfunctional characteristics in the culture laboratory, theindividual bacterial species or genus (depending onthe classification of sequence data) are commonlyreferred to as operational taxonomic units in micro-biota research.Studies have reported patients with IBD to have
decreased bacterial diversity, and reduced abundancesof Firmicutes and Bacteroidetes and an increasedabundance of Proteobacteria compared with healthyindividuals.17 Latest research suggests that IBD patho-genesis is due to the interaction of environmentalfactors (eg, smoking, diet and stress) and the host’sgenetic susceptibility, which is influenced by com-mensal microbiota, which activates either pathogenicor protective immune responses.18 Evidence frommouse models provides further support for the role ofgut microbiota in pathogenesis of IBD.19
Necrotising enterocolitis (NEC)—The pathogenesisof NEC is multifactorial although the gut microbiota
is thought to play a crucial role. Studies in bothhumans and animal models have described changes inthe gut microbiota including a reduction in bacterialdiversity and increased abundances of Proteobacteriain preterm infants with NEC compared with healthypreterm infants.20 However, the results have beeninconsistent across studies, and to date, no singlecausative set of microorganisms has been identified.Atopic diseases—Conditions such as eczema, asthma
and food allergies are increasing in incidence. This isoften linked to the hygiene hypothesis. It is thoughtthat the lack of early-life exposure to microbial anti-gens in hygienic developed countries alters the micro-biota composition of the infant gut, which disruptsimmune development causing allergic disease.21 Forexample, species like Bacteroides fragilis reportedlyinduces immunological tolerance through immunereceptor signalling pathways.22 Also the infant gutmicrobiota is affected by environmental factorsincluding pets, residing in rural homes and siblingsshown to have protective effects against asthma andallergies.23 The concept that altered microbiomecomposition influences childhood allergic disease sus-ceptibility is further supported by data from epi-demiological studies that report higher prevalence ofatopic diseases in infants delivered by caesarean sec-tions, formula fed infants and those exposed toantibiotics.24
Type 1 diabetes—The gut microbiota is involved inregulation of the metabolic–immune axis.8 Research
Figure 1 The human microbiome plays an important role in control of vital homeostatic mechanisms in the body. These includeenhanced metabolism, resistance to infection and inflammation, prevention against autoimmunity as well as an effect on the gut–brain axis. SCFA, short-chain fatty acid.
Research in practice
2 Amon P, Sanderson I. Arch Dis Child Educ Pract Ed 2017;0:1–4. doi:10.1136/archdischild-2016-311643
copyright. on O
ctober 1, 2019 by guest. Protected byhttp://ep.bm
j.com/
Arch Dis C
hild Educ Pract Ed: first published as 10.1136/archdischild-2016-311643 on 28 February 2017. Dow
nloaded from
https://ep.bmj.com/content/102/5/257 (Amon and Sanderson, 2016)2
Microbiome, metabolites and immunologyKim (2018, Immunology)
Immune regulation by microbiome metabolites
Immunology, Volume: 154, Issue: 2, Pages: 220-229, First published: 22 March 2018, DOI: (10.1111/imm.12930)
Question: how microbiome produces different metabolites?3
The Human Microbiome and Cancer
Rajagopala (2017 Cancer Prevention Research).
Question - microbiome-based individual treatment assignment?
4
The Human Microbiome and Cancer TreatmentAnti-PD-1 immunotherapy in melanoma patients
Gopalakrishnan et al (2018 Science).
5
Shotgun MetagenomicsSlide from Katie Pollard
Question: can we understand the growth dynamics?
6
Big data problemOne sample:
50 million sequences300 letters eachfrom 100s of species mixed
reference database:
∼ 300,000 bacterial genomes5 million letters each
Three examples
role of statistics in microbiome data science
7
Example 1 - data on the phylogenetic treeData = probability mass on the tree
Wang, Cai and Li (2021, JASA)Wang, Cai and Li (2021, Biometrika)
8
Estimation of Wasserstein distance
P and Q are two true probability mass distributions on the tree
Wasserstein distance between two microbial communities
D(P,Q) =∑
e∈ELe |Pe −Qe| ,
Weighted UniFrac distance - plugin estimate
D(P , Q) :=∑
e∈ELe
∣∣∣Pe − Qe∣∣∣ .
Estimate is biased when some Pe ≈ Qe
9
MET - moment-screening estimator on a treeWang, Cai and Li (2021 JASA): using implicit best polynomial approximationthat incorporates the tree structure.
MET is computationally efficient and is minimax rate-optimal ( s log(2d+2/s)
n logn )
Wu and Yang (2016) and Jiao, Han, and Weissman (2018) and references therein). To be more
specific, let FKe (Pe, Qe) be some polynomial of degree at most K designed for (Pe, Qe) at edge
e ∈ Er
FKe (Pe, Qe) =
K∑
k1,k2=0
fe(k1, k2)P k1e Q
k2e .
The choice of FKe (Pe, Qe) is usually determined by (Pe,0, Qe,0) to approximate target functional
locally (see, e.g. Jiao, Han, and Weissman, 2018). The corresponding unbiased estimator of
FKe (Pe, Qe) can be written as
FKe =
K∑
k1,k2=0
fe(k1, k2)Hk1(Pe,1)Hk2(Qe,1),
where Hk(Pe,1) is an unbiased estimator for P ke , i.e. Hk(x) =
∏k−1m=0
(x− m
n
)if k ≥ 1 and
Hk(x) = 1 when k = 0. Thus, the bias of estimator∑
e∈ErLeF
Ke is mainly the approximation
error of the carefully chosen polynomials∣∣∣∣∣E∑
e∈Er
Le
(FKe − |Pe −Qe|
)∣∣∣∣∣ ≤∑
e∈Er
Le supx,y
∣∣FKe (x, y)− |x− y|
∣∣ .
The squared error of estimator∑
e∈ErLeF
Ke can thus be decomposed into bias (Bias A) and vari-
ance (Variance A) as illustrated in left half of Figure 2.∑
e∈ErLe|Pe −Qe|
∑e∈Er
LeFK(Pe, Qe)
∑e∈Er
LeFKe
Bias A
Variance A
∑e∈Er
Le|Pe − Qe|
∑e∈Er
LeFK(Pe, Qe)
Bias B
“Variance B”
Figure 2: Bias and variance decomposition of polynomial approximation estimator and momentscreening estimator.
10
10
MET - simulation demonstration
Simulation: distance to normal centroid of 200 healthy vs 200 disease samples
11
Crohn’s disease vs healthy colon biopsies
16 inactive CD patients, 18 healthy, p=0.0099 (MET), 0.025 (plugin)
12
Question - Where is the difference?
13
Optimal Minimum-cost FlowLet P and Q are two population probability mass distributions.
Phylogenetic distance is actually weighted Lα norm of optimal flow
(∑
e∈ELe|∆∗e|α
)1/α
= ‖∆∗‖α, where ∆∗ = {∆∗e}e∈E
Unique solution of minimum-cost flow problem - optimal net flows:
∆∗e =∑
v∈τ(e)(pv − qv) = Pe −Qe
Null - No difference in population densities
H0 : ∆∗ = 0 vs H1 : ∆∗ 6= 0.
14
PERMANOVA - Two Sample Test
PERMANOVAis a distance-based method where distance D(P,Q) = ‖∆∗‖2 isused.
PERMANOVA is asymptotically equivalent to
D2
1
m1
m1∑
j=1
P1,j ,1
m2
m2∑
j=1
P2,j
=
∑
e∈ELe(P1,e − P2,e)
2
However, we know little about the property for PERMANOVA
When PERMANOVA is consistent?
Can it be improved? How?
Can we locate the differences?
15
When does PERMANOVA work?
Normality under null hypothesis.
PERMANOVA is consistent when
‖∆∗‖22 �√
tr[H(Σ(t) + n−1diag(Pµ(t)))HT ]2
m1 +m2,
where Σ(t) = (1− t)Σ1 + tΣ2 and Pµ(t) = (1− t)P1,µ + tP2,µ.Hev =
√Le if v ∈ τe.
PERMANOVA is a sum-of-square type statistics.
16
What does real data look like?∆∗ seems to be sparse. But sum-of-square type statistics has better power fordense alternative.
17
Detector of Active Flow
We consider a maximum type statistics
M∗ = supe∈E|Me|,
where
Me =P1,e − P2,e√
σ21,e/m1 + σ22,e/m2
.
All identified active flow EA = {e : |Me| > qα}
Wang,Cai and Li (2021): Biometrika.
18
Real Data: Ulcerative Colitis(UC)
There are 3 groups of samples.
Group A B C
Sample size 18 14 15UC N Y Y
Treatment N N Y
Reads are placed by SEPP. 3991 leaves(tips) and 3990 internal nodes.
19
Real Data: Global test
4 different methods are applied.
DAFOT DAFOT-log PERM-L1 PERM-L2
Group A vs B 0.007 0.256 0.378 0.460Group A vs C 0.147 0.495 0.305 0.270Group B vs C 0.648 0.832 0.639 0.677
20
Real Data: Where is the difference?
21
Real Data: Where is the difference?
21
Example 2 - Bacterial Growth Rate EstimationKorem et al., 2015; Pienkowska et al., 2019.
Only works for bacteria with known complete genomes.22
Metagenome assembled genomes (MAGs)50% species have no genome sequences ( Sangwan et al 2016)
23
Coverages of contigs - 6 PLEASE samplesTop 3: normal controls. Bottom 3: IBD patients.
0 20 40 60 80 100 120 140
−0.
6−
0.4
−0.
20.
00.
20.
4
contig.unordered
norm
aliz
ed lo
g−co
vera
ge
0 20 40 60 80 100 120 140
−0.
8−
0.6
−0.
4−
0.2
0.0
0.2
0.4
0.6
contig.unordered
norm
aliz
ed lo
g−co
vera
ge
0 20 40 60 80 100 120 140
−0.
50.
00.
5
contig.unordered
norm
aliz
ed lo
g−co
vera
ge
0 20 40 60 80 100 120 140
−0.
50.
00.
5
contig.unordered
norm
aliz
ed lo
g−co
vera
ge
0 20 40 60 80 100 120 140
−0.
50.
00.
5
contig.unordered
norm
aliz
ed lo
g−co
vera
ge
0 20 40 60 80 100 120 140
−0.
6−
0.2
0.0
0.2
0.4
0.6
contig.unordered
norm
aliz
ed lo
g−co
vera
ge
24
PCA vs Coverages - 6 PLEASE samplesTop 3: normal controls. Bottom 3: IBD patients.
0 20 40 60 80 100 120 140
−0.6
−0.4
−0.2
0.0
0.2
0.4
contig.ordered
norm
aliz
ed lo
g−co
vera
ge
0 20 40 60 80 100 120 140
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
contig.ordered
norm
aliz
ed lo
g−co
vera
ge
0 20 40 60 80 100 120 140
−0.5
0.0
0.5
contig.ordered
norm
aliz
ed lo
g−co
vera
ge
0 20 40 60 80 100 120 140
−0.5
0.0
0.5
contig.ordered
norm
aliz
ed lo
g−co
vera
ge
0 20 40 60 80 100 120 140
−0.5
0.0
0.5
contig.ordered
norm
aliz
ed lo
g−co
vera
ge
0 20 40 60 80 100 120 140
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
contig.ordered
norm
aliz
ed lo
g−co
vera
ge
25
Illustration of the Statistical/Computational ProblemFor a given bacteria:
26
Optimal permutation recovery
Dynamics Estimator of Microbial Communities (DEMIC)https://github.com/scottdaniel/sbx demic (Scott Daniel)
Key Step - PCA for permutation recovery of contigs along the bacterialgenomes.Gao and Li (2018 Nature Methods)
Statistical model - permuted monotone matrix model
Optimal permutation recovery and estimation of growth rates.Ma, Cai and Li (2021 JASA)Ma, Cai and Li (2021 Biometrika)Cai, Li and Ma (2021, JMLR)
27
Optimal permutation recovery
For a given MAG
Permuted monotone matrix model: X is GC-adjusted log-read countsalong the genome - n samples and p contigs,
Yn×p = π(Xn×p), Xn×p = Θn×p + Zn×p
where X,Θ, Z ∈ Rn×p, π is a column-permutation operator, and
Θ ∈ D =
{Θ = (θij) : 0 < θi,j ≤ θi,j+1 <∞,∀i, j
}.
Z: some additive noise (i.i.d. Gaussian, N(0, σ2)).
The goal is to recover π based on observed Y .
Solution: 1st PC, π = r(w>1 Y ) as an estimate of π, w1 is loadingcoefficients of the 1st PC.
28
Theoretical Properties (Ma, Cai and Li 2021 JASA)Linear growth model - the parameter space for Θ:
DL =
{Θ ∈ Rn×p :
θij = aiηj + bi, where ai, bi ≥ 0 for 1 ≤ i ≤ n,
0 ≤ ηj ≤ ηj+1 for 1 ≤ j ≤ p− 1
},
A key quantity:
Γ(Θ) =
(n−1
n∑
i=1
a2i
)1/2
· min1≤i<j≤p
|ηi − ηj |.
Theorem (Exact Recovery)
Suppose the noise Z are i.i.d. N(0, σ2). Then under some mild conditions,whenever
Γ & σ
√log p
n,
we have π = π with probability at least 1− p−c.29
Estimation of PTR (Ma, Cai and Li 2021 Biometrika)
Proposed estimators of peak/trough coverage: Θmax/ Θmin:
1 Obtain the optimal permutation estimator π to reorder the columns(contigs);
2 Fit simple linear regression for each row (sample);
3 Define Θmax and Θmin as the fitted maximum and minimum values.
=⇒ DEMIC algorithm.
Optimal and adaptive estimation of PTR and the two extreme values (peakand trough) for general growth model.Ma, Cai and Li (2021): Biometrika
30
Penn PLEASE Study (Lewis et al. (2015): Cell Host & Microbe)PLEASE (Pediatric Crohn’s Disease) study at Penn: 90× 4 shotgun metagenomicsamples and 26 normal children (ave 11×106 paired-end reads).Outcome: Fecal calprotection (FCP) (reduction below 250mcg/g).
Metabolomics: fecal metabolites.
Week 1: Stool Microbiome, Dietary recalls x 3, FCP
Week 4: Stool Microbiome, Dietary recalls x 3, FCP
Week 8: Stool Microbiome, Dietary recalls x 3, FCP, PCDAI
90 Children with Active Crohn’s Disease
Diet Therapy (n=38) Anti-TNF Therapy (n=52)
Treatment at Discretion of Treating Physician
Baseline: Stool Microbiome, Dietary recalls x 3, FCP, PCDAI
Anti-TNF: 26 (50%)a reduction in FCPbelow 250 mcg/g.
Enteral Diet: 12(32%) a reduction inFCP below 250mcg/g.
Lewis, Chen et al.
(2015): Cell Host &
Microbe.
31
Species with differential growth rate
DEMIC estimated growth dynamics for 278 species, 20% in 50 or moresamples.
7 MAGs with different growth rate in healthy and Crohn’s disease samples(FDR< 0.05)
Contig cluster Completeness Contamination Healthy vs Marker lineageCrohn’s
metabat2.187 61.7% 0.0% High kBacteriametabat2.239 58.5% 1.8% High oClostridialesmetabat2.250 66.6% 0.8% High pProteobacteriametabat2.259 79.3% 2.1% High kBacteriametabat2.270 72.0% 2.0% High fLachnospiraceaemetabat2.369 68.8% 2.8% High fLachnospiraceaemetabat2.55 55.2% 1.9% Low oClostridiales
32
Shift of growth rates after treatmentoClostridiales, oClostridiales, kbacteria (uncharacterized)
0
3
6
9
0 25 50 75 100Completeness
Contamination
Sample Number
metabat2.239 metabat2.259 metabat2.55
Control Crohn Control Crohn Control Crohn1.5
1.8
2.1
2.4
1.5
2.0
2.5
1.0
1.5
2.0
2.5
Disease
ePTR
factor(Disease)ControlCrohn
metabat2.239 metabat2.259 metabat2.55
1 2 3 4 1 2 3 4 1 2 3 4
1.5
2.0
2.5
1.5
2.0
2.5
3.0
3.5
1.00
1.25
1.50
1.75
2.00
Time
ePTR
factor(Time)1234
P4030
P4029
P4028
P4026
P4025
P4024
P4023
P4022
P4021
P4020
P4019
P4018
P4017
P4014
P4013
P4011
P4010
P4009
P4007
P4006
P4005
P4004
P4002
P4001
P4000
P7015.01
P7014.01
P7013.01
P7012.01
P7011.01
P7010.01
P7009.01
P7008.01
P7006.01
P7005.01
P7004.01
P7003.01
P7002.01
P7001.01
P6019.01
P6017.01
P6010.01
P6009.01
P6005.01
P6004.01
P5066.01
P5063.01
P5039.01
P5037.01
P5027.01
P5020.01
P5018.01
P5017.01
P5013.01
P5012.01
P5011.01
P5008.01
P5004.01
P6018.01
P6016.01
P6014.01
P6013.01
P6012.01
P6008.01
P6003.01
P5065.01
P5064.01
P5060.01
P5058.01
P5057.01
P5056.01
P5055.01
P5054.01
P5053.01
P5052.01
P5050.01
P5048.01
P5047.01
P5046.01
P5045.01
P5044.01
P5042.01
P5041.01
P5040.01
P5034.01
P5033.01
P5031.01
P5030.01
P5029.01
P5023.01
P5022.01
P5015.01
P5010.01
P5007.01
P5006.01
P5003.01
P5002.01
P5001.01
metabat2.371metabat2.100metabat2.187metabat2.209metabat2.87metabat2.270metabat2.279metabat2.77metabat2.218metabat2.304metabat2.217metabat2.243metabat2.351metabat2.314metabat2.172metabat2.55metabat2.329metabat2.390metabat2.108metabat2.278metabat2.373metabat2.392metabat2.239metabat2.85metabat2.486metabat2.369metabat2.250metabat2.353metabat2.259metabat2.145metabat2.494metabat2.75metabat2.316metabat2.339metabat2.192metabat2.542metabat2.442metabat2.61metabat2.525metabat2.561metabat2.126
TreatmentDisease Disease
ControlCrohn
TreatmentantiTNFDiet
−4
−2
0
2
4
A
B C
50
100
200
ControlCrohn BaselineCrohn Week 1Crohn Week 4Crohn Week 8
Sample Type
Week 8Week 4Week 1Baseline
33
Example 3 - Deep Learning for BGC PredictionBioactive secondary metabolites (SMs) - antibiotics, anticancer reagents, etcSMs - encoded by genes that cluster together in a genetic package, referred toas a biosynthetic gene cluster (BGC).
promoter was inserted upstream of the operon starting with
VF0844. Introduction of this construct into E. coli resulted in a
yellow-pigmented strain that produced a new compound with
an absorption maximum at 425 nm (Figure 3C). Purification of
the V. fischeri compound and analysis by a combination of 1D-
and 2D-NMR experiments and high-resolution MS revealed
a structure with a similar scaffold to the E. coli APE but a
4-hydroxy-3,5-dimethylphenyl head group (Figures 3B and
A
Escherichia coli CFT073, APEEc (c1186 - c1204)
Flavobacterium johnsonii ATCC 17061, flexirubin (Fjoh_1075 - Fjoh_1110)
Vibrio fischeri ES114, APEVf (VF0841 - VF0860)
Xanthomonas campestris ATCC 33913, xanthomonadin (XCC3998 - XCC4015)
flexirubin (R = H, CH3, Cl)
O
O
xanthomonadin
O
O
Br
APEVf
HO
APEEcHO Br
O
HO
R
O
O
O
O
B
D
Vf Δ
AP
E
C
2 kb
CoA ligaseKetosynthaseThioesteraseKetoreductaseMethyltransferaseRedox tailoringAcyl/glycosyltransferase
TransportLolAAmmonia lyaseDehydratase
ACP
Unknown
Vf
WT
Ec
To
p10 +
erm
E-A
PE
Vf
Ec T
op
10
+ A
PE
Vf
Ec
To
p10
+ v
ecto
r
Ec
To
p10
+ A
PE
Ec
Ec C
FT
073
WT
Xc FjVfEc
Subfamily 1
34
CO
Gs
1,021 Gene Clusters
Subfamily 2 Subfamily 3
Figure 3. APE Gene Clusters Comprise the Largest Known BGC Family
(A) Heat map and dendrogram of all 1,021 detected APE family gene clusters, based on Clusters of Orthologous Groups generated by OrthoMCL (Li et al., 2003)
using our adapted version of the Lin distancemetric (Lin et al., 2006) that includes sequence similarity. Light gray indicates the presence of one gene from a COG,
whereas darker gray tones indicate the presence of two or three genes from a COG. The two BGC subfamilies that functioned as the starting point of our analysis
(subfamilies 1 and 2) are shown in green and red, respectively, while the smaller BGC subfamily that includes the xanthomonadin and flexirubin gene clusters
(subfamily 3) is shown in blue. The positions of the two experimentally targeted gene clusters (Ec for Escherichia coli CFT073 and Vf for Vibrio fischeri ES114) as
well as the Xanthomonas campestrisATCC 33913 xanthomonadin (Xc) and Flavobacterium johnsonii ATCC 17061 flexirubin (Fj) gene clusters are indicated below
the heat map. See Figure S5 for a version with more detailed annotations.
(B) Chemical structures obtained for the APE compounds from E. coli and V. fischeri, and the previously determined chemical structures of xanthomonadin and
flexirubin. Note the difference in polyene acyl chain length as well as the distinct tailoring patterns on the aryl head groups.
(C) Bacterial pellets from strains harboring APE gene clusters showing the pigmentation conferred by aryl polyenes.
(D) Genetic architecture of the four characterized aryl polyene gene clusters. The inset in the Flavobacterium johnsonii flexirubin gene cluster is a subcluster
putatively involved in the biosynthesis of dialkylresorcinol (Fuchs et al., 2013), which is acylated to an APE to form flexirubin.
See Data S1 for schematics of all 1,021 APE gene clusters from (A).
416 Cell 158, 412–421, July 17, 2014 ª2014 Elsevier Inc.
Cimermancic et al (2014, Cell): ClusterFinder (2-state HMM).34
MIBiG - Repository of Knows BGCsMIBiG database
Repository of Known Biosynthetic Gene Clusters
View antiSMASH-generated outputDownload Cluster GenBank file
BGC0000007: aflatoxin biosynthetic gene cluster from Aspergillus flavusLocation: 1 - 82,511 nt. (total: 82,511 nt). This entry is originally from NCBI GenBank AY510452.1.
10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000
Legend:core biosynthetic genes additional biosynthetic genes transport-related genes regulatory genes resistance genes other genes
reset view zoom toselection
Gene details
Select a gene to view the details available for it
General Compounds Genes History KnownClusterBlast NRPS/PKS domainsGeneral information about the BGC
MIBiG accession BGC0000007Short description aflatoxin biosynthetic gene cluster from Aspergillus flavus
Status Minimal annotation: yes Completeness: Unknown
Biosynthetic class(es) PolyketideLoci NCBI GenBank: AY510452.1Compounds aflatoxinSpecies Aspergillus flavus [taxonomy]
References Variability in nitrogen regulation of aflatoxin production by Aspergillus flavus strains. Ehrlich KC et al., Appl. Microbiol. Biotechnol. (2002) PMID:12382060 Aflatoxin biosynthesis cluster gene cypA is required for G aflatoxin formation. Ehrlich KC et al., Appl. Environ. Microbiol. (2004) PMID:15528514
Found an error? Please report it. Comments or questions on MIBiG? Please do not hesitate to contact us. Anything to add? Please suggest an update.
Contact
Download
35
Identification of all BGCs in bacterial genomes
Goal: identifying and categorizing all the BGCs in 5666 bacterial genomes.
Training Data set:1,984 BGC gene sequences from MIBiG v1.4 database, ORF/gene prediction,Pfam domains. 3,685 Pfam domains, 1094 species
Background: 5,666 reference genomes from NCBI database, 11,427 uniquePfam domains. nnon−BGC = 10, 128 controls.
DeepBiG: A LSTM RNN based deep learning algorithmEmbedding - Pfam domain names, Pfam clans, Pfam function (1126-dimvector)Liu, Li and Li (2021): manuscript
36
Known BGCsNo. of Pfam domains in 1984 BGCs.
0 50 100 200 250 300150 No. of Pfams
0
50
100
150
200
250
300C
ount
37
Pfam clansSequence, structures, functional similarity and similarity between pHMMs.
https://www.ebi.ac.uk38
DeepBiG - deep learning and embeddingData: Contiguous segments of 256 Pfam sequences (+padding)Embedding: Pfam domain names, Pfam clans, Pfam function descriptions ⇒LSTM RNN
Pfam
102-d PfamEmb 64-d ClanEmb
Top 64 chars of summary
32-d emb
of c1
32-d emb
of c64
32-d emb
of c2
…
30 size 3 filters with padding,then maxpooling
960-d ClarEmb
1126-d Emb
Pfam 1 Pfam 2 Pfam 3 …… Pfam 254 Pfam 255 Pfam 256
Concatenation
Emb 1 Emb 2 Emb 3 Emb 254 Emb 255 Emb 256
LSTM LSTM LSTM LSTM LSTM LSTM
LSTM LSTM LSTM LSTM LSTM LSTM
Softmax
Labels
39
DeepBiG - Data AugmentationOn expectation, a sequence has one Pfam domain being replaced, each epochwith new perturbed data.
Data ArgumentOn expectation, a sequence will have one pfam being replaced.
Pfam 1 Pfam 2 Pfam 3 …… Pfam L
Pfam 2_i Pfam 2_nPfam 2_1
Each Pfam will be replaced by its Similar Pfams with prob 1/L
Positive/Fake pfam sequence with length L
40
DeepBiG - Embedding, multi-class, training set
20 10 0 10 20TSNE 1
30
20
10
0
10
20
30
TSN
E 2
NRPOtherPolyketideRiPPSaccharideTerpene
41
DeepBiG - Embedding, binary, testing set
60 40 20 0 20 40TSNE 1
20
10
0
10
20
TSN
E 2
BGCNon-BGC
42
DeepBiG Prediction ResultsTesting set: 13 contigs of 9 genomes with 291 known BGCs never used intraining, 10x13=130 artificial genomes with 291 known BGCs fixed in originalgenomes, other replaced with non-BGCs.
DeepBGC DeepBiG-C DeepBiG DeepBiG-OPfam level
Precision 0.834(0.0172) 0.830(0.0148) 0.822(0.0112) 0.777(0.0157)Recall 0.745(0.0082) 0.821(0.0052) 0.862(0.0061) 0.880(0.0036)F1 0.787(0.0067) 0.825(0.0071) 0.842(0.0060) 0.825(0.0091)
Overlap Pfam BGC level (pBGC) - F1 score0.2 0.694(0.0131) 0.747(0.0150) 0.781(0.0147) 0.765(0.0112)0.4 0.670(0.0089) 0.724(0.0163) 0.763(0.0136) 0.741(0.0115)0.6 0.626(0.0108) 0.676(0.0162) 0.714(0.0158) 0.693(0.0139)0.8 0.523(0.0148) 0.574(0.0199) 0.614(0.0173) 0.584(0.0220)1.0 0.270(0.0156) 0.287(0.0141) 0.302(0.0161) 0.228(0.0129)
DeepBGC: Hannigan et al., 2019 NAR.43
DeepBiG multiclass predictionTesting set: 160 new BGC were deposited to MiBIG v1.5
4 3 2 1 0 1 2TSNE 1
1
0
1
2
3
TSN
E 2
NRPOtherPolyketideRiPPSaccharideTerpene
44
DeepBiG multiclass prediction
Testing set: 160 new BGC were deposited to MiBIG v1.5
F1 scores
Overlap DeepBiG-C DeepBiG DeepBGCpBGC gBGC pBGC gBGC pBGC gBGC
0.20 77.2 74.7 95.7 93.2 90.7 90.70.40 72.2 72.2 93.8 92.0 88.3 86.40.60 66.7 68.5 91.4 92.0 85.8 84.00.80 60.5 63.6 89.5 90.7 82.1 81.51.00 54.9 60.5 85.2 88.3 74.7 79.0
45
All BGCs predicted by DeepBiGThere are 170, 685 predicted BGCs in all 5666 bacteria genomes (ave 30.1,0-243 BGCs)
NRP Other Polyketide RiPP Saccharide Terpene0.000.050.100.150.200.250.300.350.40
Porti
on
DeepBiG
MiBIG
0.0 0.2 0.4 0.6 0.8 1.0Ratio
0
200
400
600
800
#Gen
ome
(a) (b)
0 50 100 150 200 250#BGC
0
250
500
750
1000
#Gen
ome
0 100 200 300 400#Gene
0
25000
50000
75000
100000
#BGC
(c) (d) 46
Concluding remarks
Three problems and statistical methods in microbiome data sciencestrong empirical performance with theoretical insights.
Next steps:precision nutrition, microbiome and healthnutrition and host immunology
47
Acknowledgments
Many thanks to:
Li lab (NIH grants: R01GM129781; R01GM123056)
Shulei Wang, Yuan Gao, Rong Ma, Mingyang Liu and Yun Li
Biology collaborators
Gary Wu, MD (Gastroenterology)
Rick Bushman, PhD (Microbiology)
James Lewis, MD (Gastroenterology and DBEI)
Kyle Bittinger, PhD (CHOP)
People in their labs
T. Tony Cai, PhD (The Wharton School)
48