microbiome data science - phylogenetic tree, bacterial

Microbiome Data Science - Phylogenetic Tree,Bacterial Growth and Biosynthetic Gene Clusters

Hongzhe LiPerelman Professor of Biostatistics

University of Pennsylvania

June 14, 2021

1

Microbiome and its Function

conditions seen in paediatrics that have been asso-ciated with changes in the gut microbiota (table 1).8–16

Inflammatory bowel disease (IBD)—This is one ofthe most extensively studied human conditions asso-ciated with the gut microbiota. The composition ofthe gut microbiota differs between healthy individualsand patients with IBD both in terms of species rich-ness (ie, numbers of bacterial species) and speciesabundances (ie, number of individuals per species). Asbacteria are identified by sequencing, rather than byfunctional characteristics in the culture laboratory, theindividual bacterial species or genus (depending onthe classification of sequence data) are commonlyreferred to as operational taxonomic units in micro-biota research.Studies have reported patients with IBD to have

decreased bacterial diversity, and reduced abundancesof Firmicutes and Bacteroidetes and an increasedabundance of Proteobacteria compared with healthyindividuals.17 Latest research suggests that IBD patho-genesis is due to the interaction of environmentalfactors (eg, smoking, diet and stress) and the host’sgenetic susceptibility, which is influenced by com-mensal microbiota, which activates either pathogenicor protective immune responses.18 Evidence frommouse models provides further support for the role ofgut microbiota in pathogenesis of IBD.19

Necrotising enterocolitis (NEC)—The pathogenesisof NEC is multifactorial although the gut microbiota

is thought to play a crucial role. Studies in bothhumans and animal models have described changes inthe gut microbiota including a reduction in bacterialdiversity and increased abundances of Proteobacteriain preterm infants with NEC compared with healthypreterm infants.20 However, the results have beeninconsistent across studies, and to date, no singlecausative set of microorganisms has been identified.Atopic diseases—Conditions such as eczema, asthma

and food allergies are increasing in incidence. This isoften linked to the hygiene hypothesis. It is thoughtthat the lack of early-life exposure to microbial anti-gens in hygienic developed countries alters the micro-biota composition of the infant gut, which disruptsimmune development causing allergic disease.21 Forexample, species like Bacteroides fragilis reportedlyinduces immunological tolerance through immunereceptor signalling pathways.22 Also the infant gutmicrobiota is affected by environmental factorsincluding pets, residing in rural homes and siblingsshown to have protective effects against asthma andallergies.23 The concept that altered microbiomecomposition influences childhood allergic disease sus-ceptibility is further supported by data from epi-demiological studies that report higher prevalence ofatopic diseases in infants delivered by caesarean sec-tions, formula fed infants and those exposed toantibiotics.24

Type 1 diabetes—The gut microbiota is involved inregulation of the metabolic–immune axis.8 Research

Figure 1 The human microbiome plays an important role in control of vital homeostatic mechanisms in the body. These includeenhanced metabolism, resistance to infection and inflammation, prevention against autoimmunity as well as an effect on the gut–brain axis. SCFA, short-chain fatty acid.

Research in practice

2 Amon P, Sanderson I. Arch Dis Child Educ Pract Ed 2017;0:1–4. doi:10.1136/archdischild-2016-311643

copyright. on O

ctober 1, 2019 by guest. Protected byhttp://ep.bm

j.com/

Arch Dis C

hild Educ Pract Ed: first published as 10.1136/archdischild-2016-311643 on 28 February 2017. Dow

nloaded from

https://ep.bmj.com/content/102/5/257 (Amon and Sanderson, 2016)2

Microbiome, metabolites and immunologyKim (2018, Immunology)

Immune regulation by microbiome metabolites

Immunology, Volume: 154, Issue: 2, Pages: 220-229, First published: 22 March 2018, DOI: (10.1111/imm.12930)

Question: how microbiome produces different metabolites?3

The Human Microbiome and Cancer

Rajagopala (2017 Cancer Prevention Research).

Question - microbiome-based individual treatment assignment?

4

The Human Microbiome and Cancer TreatmentAnti-PD-1 immunotherapy in melanoma patients

Gopalakrishnan et al (2018 Science).

5

Shotgun MetagenomicsSlide from Katie Pollard

Question: can we understand the growth dynamics?

6

Big data problemOne sample:

50 million sequences300 letters eachfrom 100s of species mixed

reference database:

∼ 300,000 bacterial genomes5 million letters each

Three examples

role of statistics in microbiome data science

7

Example 1 - data on the phylogenetic treeData = probability mass on the tree

Wang, Cai and Li (2021, JASA)Wang, Cai and Li (2021, Biometrika)

8

Estimation of Wasserstein distance

P and Q are two true probability mass distributions on the tree

Wasserstein distance between two microbial communities

D(P,Q) =∑

e∈ELe |Pe −Qe| ,

Weighted UniFrac distance - plugin estimate

D(P , Q) :=∑

e∈ELe

∣∣∣Pe − Qe∣∣∣ .

Estimate is biased when some Pe ≈ Qe

9

MET - moment-screening estimator on a treeWang, Cai and Li (2021 JASA): using implicit best polynomial approximationthat incorporates the tree structure.

MET is computationally efficient and is minimax rate-optimal ( s log(2d+2/s)

n logn )

Wu and Yang (2016) and Jiao, Han, and Weissman (2018) and references therein). To be more

specific, let FKe (Pe, Qe) be some polynomial of degree at most K designed for (Pe, Qe) at edge

e ∈ Er

FKe (Pe, Qe) =

K∑

k1,k2=0

fe(k1, k2)P k1e Q

k2e .

The choice of FKe (Pe, Qe) is usually determined by (Pe,0, Qe,0) to approximate target functional

locally (see, e.g. Jiao, Han, and Weissman, 2018). The corresponding unbiased estimator of

FKe (Pe, Qe) can be written as

FKe =

K∑

k1,k2=0

fe(k1, k2)Hk1(Pe,1)Hk2(Qe,1),

where Hk(Pe,1) is an unbiased estimator for P ke , i.e. Hk(x) =

∏k−1m=0

(x− m

n

)if k ≥ 1 and

Hk(x) = 1 when k = 0. Thus, the bias of estimator∑

e∈ErLeF

Ke is mainly the approximation

error of the carefully chosen polynomials∣∣∣∣∣E∑

e∈Er

Le

(FKe − |Pe −Qe|

)∣∣∣∣∣ ≤∑

e∈Er

Le supx,y

∣∣FKe (x, y)− |x− y|

∣∣ .

The squared error of estimator∑

e∈ErLeF

Ke can thus be decomposed into bias (Bias A) and vari-

ance (Variance A) as illustrated in left half of Figure 2.∑

e∈ErLe|Pe −Qe|

∑e∈Er

LeFK(Pe, Qe)

∑e∈Er

LeFKe

Bias A

Variance A

∑e∈Er

Le|Pe − Qe|

∑e∈Er

LeFK(Pe, Qe)

Bias B

“Variance B”

Figure 2: Bias and variance decomposition of polynomial approximation estimator and momentscreening estimator.

10

10

MET - simulation demonstration

Simulation: distance to normal centroid of 200 healthy vs 200 disease samples

11

Crohn’s disease vs healthy colon biopsies

16 inactive CD patients, 18 healthy, p=0.0099 (MET), 0.025 (plugin)

12

Question - Where is the difference?

13

Optimal Minimum-cost FlowLet P and Q are two population probability mass distributions.

Phylogenetic distance is actually weighted Lα norm of optimal flow

(∑

e∈ELe|∆∗e|α

)1/α

= ‖∆∗‖α, where ∆∗ = {∆∗e}e∈E

Unique solution of minimum-cost flow problem - optimal net flows:

∆∗e =∑

v∈τ(e)(pv − qv) = Pe −Qe

Null - No difference in population densities

H0 : ∆∗ = 0 vs H1 : ∆∗ 6= 0.

14

PERMANOVA - Two Sample Test

PERMANOVAis a distance-based method where distance D(P,Q) = ‖∆∗‖2 isused.

PERMANOVA is asymptotically equivalent to

D2

1

m1

m1∑

j=1

P1,j ,1

m2

m2∑

j=1

P2,j

=

∑

e∈ELe(P1,e − P2,e)

2

However, we know little about the property for PERMANOVA

When PERMANOVA is consistent?

Can it be improved? How?

Can we locate the differences?

15

When does PERMANOVA work?

Normality under null hypothesis.

PERMANOVA is consistent when

‖∆∗‖22 �√

tr[H(Σ(t) + n−1diag(Pµ(t)))HT ]2

m1 +m2,

where Σ(t) = (1− t)Σ1 + tΣ2 and Pµ(t) = (1− t)P1,µ + tP2,µ.Hev =

√Le if v ∈ τe.

PERMANOVA is a sum-of-square type statistics.

16

What does real data look like?∆∗ seems to be sparse. But sum-of-square type statistics has better power fordense alternative.

17

Detector of Active Flow

We consider a maximum type statistics

M∗ = supe∈E|Me|,

where

Me =P1,e − P2,e√

σ21,e/m1 + σ22,e/m2

.

All identified active flow EA = {e : |Me| > qα}

Wang,Cai and Li (2021): Biometrika.

18

Real Data: Ulcerative Colitis(UC)

There are 3 groups of samples.

Group A B C

Sample size 18 14 15UC N Y Y

Treatment N N Y

Reads are placed by SEPP. 3991 leaves(tips) and 3990 internal nodes.

19

Real Data: Global test

4 different methods are applied.

DAFOT DAFOT-log PERM-L1 PERM-L2

Group A vs B 0.007 0.256 0.378 0.460Group A vs C 0.147 0.495 0.305 0.270Group B vs C 0.648 0.832 0.639 0.677

20

Real Data: Where is the difference?

21

Example 2 - Bacterial Growth Rate EstimationKorem et al., 2015; Pienkowska et al., 2019.

Only works for bacteria with known complete genomes.22

Metagenome assembled genomes (MAGs)50% species have no genome sequences ( Sangwan et al 2016)

23

Coverages of contigs - 6 PLEASE samplesTop 3: normal controls. Bottom 3: IBD patients.

0 20 40 60 80 100 120 140

−0.

6−

0.4

−0.

20.

00.

20.

4

contig.unordered

norm

aliz

ed lo

g−co

vera

ge

0 20 40 60 80 100 120 140

−0.

8−

0.6

−0.

4−

0.2

0.0

0.2

0.4

0.6

contig.unordered

norm

aliz

ed lo

g−co

vera

ge

0 20 40 60 80 100 120 140

−0.

50.

00.

5

contig.unordered

norm

aliz

ed lo

g−co

vera

ge

0 20 40 60 80 100 120 140

−0.

50.

00.

5

contig.unordered

norm

aliz

ed lo

g−co

vera

ge

0 20 40 60 80 100 120 140

−0.

50.

00.

5

contig.unordered

norm

aliz

ed lo

g−co

vera

ge

0 20 40 60 80 100 120 140

−0.

6−

0.2

0.0

0.2

0.4

0.6

contig.unordered

norm

aliz

ed lo

g−co

vera

ge

24

PCA vs Coverages - 6 PLEASE samplesTop 3: normal controls. Bottom 3: IBD patients.

0 20 40 60 80 100 120 140

−0.6

−0.4

−0.2

0.0

0.2

0.4

contig.ordered

norm

aliz

ed lo

g−co

vera

ge

0 20 40 60 80 100 120 140

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

contig.ordered

norm

aliz

ed lo

g−co

vera

ge

0 20 40 60 80 100 120 140

−0.5

0.0

0.5

contig.ordered

norm

aliz

ed lo

g−co

vera

ge

0 20 40 60 80 100 120 140

−0.5

0.0

0.5

contig.ordered

norm

aliz

ed lo

g−co

vera

ge

0 20 40 60 80 100 120 140

−0.5

0.0

0.5

contig.ordered

norm

aliz

ed lo

g−co

vera

ge

0 20 40 60 80 100 120 140

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

contig.ordered

norm

aliz

ed lo

g−co

vera

ge

25

Illustration of the Statistical/Computational ProblemFor a given bacteria:

26

Optimal permutation recovery

Dynamics Estimator of Microbial Communities (DEMIC)https://github.com/scottdaniel/sbx demic (Scott Daniel)

Key Step - PCA for permutation recovery of contigs along the bacterialgenomes.Gao and Li (2018 Nature Methods)

Statistical model - permuted monotone matrix model

Optimal permutation recovery and estimation of growth rates.Ma, Cai and Li (2021 JASA)Ma, Cai and Li (2021 Biometrika)Cai, Li and Ma (2021, JMLR)

27

Optimal permutation recovery

For a given MAG

Permuted monotone matrix model: X is GC-adjusted log-read countsalong the genome - n samples and p contigs,

Yn×p = π(Xn×p), Xn×p = Θn×p + Zn×p

where X,Θ, Z ∈ Rn×p, π is a column-permutation operator, and

Θ ∈ D =

{Θ = (θij) : 0 < θi,j ≤ θi,j+1 <∞,∀i, j

}.

Z: some additive noise (i.i.d. Gaussian, N(0, σ2)).

The goal is to recover π based on observed Y .

Solution: 1st PC, π = r(w>1 Y ) as an estimate of π, w1 is loadingcoefficients of the 1st PC.

28

Theoretical Properties (Ma, Cai and Li 2021 JASA)Linear growth model - the parameter space for Θ:

DL =

{Θ ∈ Rn×p :

θij = aiηj + bi, where ai, bi ≥ 0 for 1 ≤ i ≤ n,

0 ≤ ηj ≤ ηj+1 for 1 ≤ j ≤ p− 1

},

A key quantity:

Γ(Θ) =

(n−1

n∑

i=1

a2i

)1/2

· min1≤i<j≤p

|ηi − ηj |.

Theorem (Exact Recovery)

Suppose the noise Z are i.i.d. N(0, σ2). Then under some mild conditions,whenever

Γ & σ

√log p

n,

we have π = π with probability at least 1− p−c.29

Estimation of PTR (Ma, Cai and Li 2021 Biometrika)

Proposed estimators of peak/trough coverage: Θmax/ Θmin:

1 Obtain the optimal permutation estimator π to reorder the columns(contigs);

2 Fit simple linear regression for each row (sample);

3 Define Θmax and Θmin as the fitted maximum and minimum values.

=⇒ DEMIC algorithm.

Optimal and adaptive estimation of PTR and the two extreme values (peakand trough) for general growth model.Ma, Cai and Li (2021): Biometrika

30

Penn PLEASE Study (Lewis et al. (2015): Cell Host & Microbe)PLEASE (Pediatric Crohn’s Disease) study at Penn: 90× 4 shotgun metagenomicsamples and 26 normal children (ave 11×106 paired-end reads).Outcome: Fecal calprotection (FCP) (reduction below 250mcg/g).

Metabolomics: fecal metabolites.

Week 1: Stool Microbiome, Dietary recalls x 3, FCP

Week 4: Stool Microbiome, Dietary recalls x 3, FCP

Week 8: Stool Microbiome, Dietary recalls x 3, FCP, PCDAI

90 Children with Active Crohn’s Disease

Diet Therapy (n=38) Anti-TNF Therapy (n=52)

Treatment at Discretion of Treating Physician

Baseline: Stool Microbiome, Dietary recalls x 3, FCP, PCDAI

Anti-TNF: 26 (50%)a reduction in FCPbelow 250 mcg/g.

Enteral Diet: 12(32%) a reduction inFCP below 250mcg/g.

Lewis, Chen et al.

(2015): Cell Host &

Microbe.

31

Species with differential growth rate

DEMIC estimated growth dynamics for 278 species, 20% in 50 or moresamples.

7 MAGs with different growth rate in healthy and Crohn’s disease samples(FDR< 0.05)

Contig cluster Completeness Contamination Healthy vs Marker lineageCrohn’s

metabat2.187 61.7% 0.0% High kBacteriametabat2.239 58.5% 1.8% High oClostridialesmetabat2.250 66.6% 0.8% High pProteobacteriametabat2.259 79.3% 2.1% High kBacteriametabat2.270 72.0% 2.0% High fLachnospiraceaemetabat2.369 68.8% 2.8% High fLachnospiraceaemetabat2.55 55.2% 1.9% Low oClostridiales

32

Shift of growth rates after treatmentoClostridiales, oClostridiales, kbacteria (uncharacterized)

0

3

6

9

0 25 50 75 100Completeness

Contamination

Sample Number

metabat2.239 metabat2.259 metabat2.55

Control Crohn Control Crohn Control Crohn1.5

1.8

2.1

2.4

1.5

2.0

2.5

1.0

1.5

2.0

2.5

Disease

ePTR

factor(Disease)ControlCrohn

metabat2.239 metabat2.259 metabat2.55

1 2 3 4 1 2 3 4 1 2 3 4

1.5

2.0

2.5

1.5

2.0

2.5

3.0

3.5

1.00

1.25

1.50

1.75

2.00

Time

ePTR

factor(Time)1234

P4030

P4029

P4028

P4026

P4025

P4024

P4023

P4022

P4021

P4020

P4019

P4018

P4017

P4014

P4013

P4011

P4010

P4009

P4007

P4006

P4005

P4004

P4002

P4001

P4000

P7015.01

P7014.01

P7013.01

P7012.01

P7011.01

P7010.01

P7009.01

P7008.01

P7006.01

P7005.01

P7004.01

P7003.01

P7002.01

P7001.01

P6019.01

P6017.01

P6010.01

P6009.01

P6005.01

P6004.01

P5066.01

P5063.01

P5039.01

P5037.01

P5027.01

P5020.01

P5018.01

P5017.01

P5013.01

P5012.01

P5011.01

P5008.01

P5004.01

P6018.01

P6016.01

P6014.01

P6013.01

P6012.01

P6008.01

P6003.01

P5065.01

P5064.01

P5060.01

P5058.01

P5057.01

P5056.01

P5055.01

P5054.01

P5053.01

P5052.01

P5050.01

P5048.01

P5047.01

P5046.01

P5045.01

P5044.01

P5042.01

P5041.01

P5040.01

P5034.01

P5033.01

P5031.01

P5030.01

P5029.01

P5023.01

P5022.01

P5015.01

P5010.01

P5007.01

P5006.01

P5003.01

P5002.01

P5001.01

metabat2.371metabat2.100metabat2.187metabat2.209metabat2.87metabat2.270metabat2.279metabat2.77metabat2.218metabat2.304metabat2.217metabat2.243metabat2.351metabat2.314metabat2.172metabat2.55metabat2.329metabat2.390metabat2.108metabat2.278metabat2.373metabat2.392metabat2.239metabat2.85metabat2.486metabat2.369metabat2.250metabat2.353metabat2.259metabat2.145metabat2.494metabat2.75metabat2.316metabat2.339metabat2.192metabat2.542metabat2.442metabat2.61metabat2.525metabat2.561metabat2.126

TreatmentDisease Disease

ControlCrohn

TreatmentantiTNFDiet

−4

−2

0

2

4

A

B C

50

100

200

ControlCrohn BaselineCrohn Week 1Crohn Week 4Crohn Week 8

Sample Type

Week 8Week 4Week 1Baseline

33

Example 3 - Deep Learning for BGC PredictionBioactive secondary metabolites (SMs) - antibiotics, anticancer reagents, etcSMs - encoded by genes that cluster together in a genetic package, referred toas a biosynthetic gene cluster (BGC).

promoter was inserted upstream of the operon starting with

VF0844. Introduction of this construct into E. coli resulted in a

yellow-pigmented strain that produced a new compound with

an absorption maximum at 425 nm (Figure 3C). Purification of

the V. fischeri compound and analysis by a combination of 1D-

and 2D-NMR experiments and high-resolution MS revealed

a structure with a similar scaffold to the E. coli APE but a

4-hydroxy-3,5-dimethylphenyl head group (Figures 3B and

A

Escherichia coli CFT073, APEEc (c1186 - c1204)

Flavobacterium johnsonii ATCC 17061, flexirubin (Fjoh_1075 - Fjoh_1110)

Vibrio fischeri ES114, APEVf (VF0841 - VF0860)

Xanthomonas campestris ATCC 33913, xanthomonadin (XCC3998 - XCC4015)

flexirubin (R = H, CH3, Cl)

O

O

xanthomonadin

O

O

Br

APEVf

HO

APEEcHO Br

O

HO

R

O

O

O

O

B

D

Vf Δ

AP

E

C

2 kb

CoA ligaseKetosynthaseThioesteraseKetoreductaseMethyltransferaseRedox tailoringAcyl/glycosyltransferase

TransportLolAAmmonia lyaseDehydratase

ACP

Unknown

Vf

WT

Ec

To

p10 +

erm

E-A

PE

Vf

Ec T

op

10

+ A

PE

Vf

Ec

To

p10

+ v

ecto

r

Ec

To

p10

+ A

PE

Ec

Ec C

FT

073

WT

Xc FjVfEc

Subfamily 1

34

CO

Gs

1,021 Gene Clusters

Subfamily 2 Subfamily 3

Figure 3. APE Gene Clusters Comprise the Largest Known BGC Family

(A) Heat map and dendrogram of all 1,021 detected APE family gene clusters, based on Clusters of Orthologous Groups generated by OrthoMCL (Li et al., 2003)

using our adapted version of the Lin distancemetric (Lin et al., 2006) that includes sequence similarity. Light gray indicates the presence of one gene from a COG,

whereas darker gray tones indicate the presence of two or three genes from a COG. The two BGC subfamilies that functioned as the starting point of our analysis

(subfamilies 1 and 2) are shown in green and red, respectively, while the smaller BGC subfamily that includes the xanthomonadin and flexirubin gene clusters

(subfamily 3) is shown in blue. The positions of the two experimentally targeted gene clusters (Ec for Escherichia coli CFT073 and Vf for Vibrio fischeri ES114) as

well as the Xanthomonas campestrisATCC 33913 xanthomonadin (Xc) and Flavobacterium johnsonii ATCC 17061 flexirubin (Fj) gene clusters are indicated below

the heat map. See Figure S5 for a version with more detailed annotations.

(B) Chemical structures obtained for the APE compounds from E. coli and V. fischeri, and the previously determined chemical structures of xanthomonadin and

flexirubin. Note the difference in polyene acyl chain length as well as the distinct tailoring patterns on the aryl head groups.

(C) Bacterial pellets from strains harboring APE gene clusters showing the pigmentation conferred by aryl polyenes.

(D) Genetic architecture of the four characterized aryl polyene gene clusters. The inset in the Flavobacterium johnsonii flexirubin gene cluster is a subcluster

putatively involved in the biosynthesis of dialkylresorcinol (Fuchs et al., 2013), which is acylated to an APE to form flexirubin.

See Data S1 for schematics of all 1,021 APE gene clusters from (A).

416 Cell 158, 412–421, July 17, 2014 ª2014 Elsevier Inc.

Cimermancic et al (2014, Cell): ClusterFinder (2-state HMM).34

MIBiG - Repository of Knows BGCsMIBiG database

Repository of Known Biosynthetic Gene Clusters

View antiSMASH-generated outputDownload Cluster GenBank file

BGC0000007: aflatoxin biosynthetic gene cluster from Aspergillus flavusLocation: 1 - 82,511 nt. (total: 82,511 nt). This entry is originally from NCBI GenBank AY510452.1.

10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000

Legend:core biosynthetic genes additional biosynthetic genes transport-related genes regulatory genes resistance genes other genes

reset view zoom toselection

Gene details

Select a gene to view the details available for it

General Compounds Genes History KnownClusterBlast NRPS/PKS domainsGeneral information about the BGC

MIBiG accession BGC0000007Short description aflatoxin biosynthetic gene cluster from Aspergillus flavus

Status Minimal annotation: yes Completeness: Unknown

Biosynthetic class(es) PolyketideLoci NCBI GenBank: AY510452.1Compounds aflatoxinSpecies Aspergillus flavus [taxonomy]

References Variability in nitrogen regulation of aflatoxin production by Aspergillus flavus strains. Ehrlich KC et al., Appl. Microbiol. Biotechnol. (2002) PMID:12382060 Aflatoxin biosynthesis cluster gene cypA is required for G aflatoxin formation. Ehrlich KC et al., Appl. Environ. Microbiol. (2004) PMID:15528514

Found an error? Please report it. Comments or questions on MIBiG? Please do not hesitate to contact us. Anything to add? Please suggest an update.

Contact

Download

35

Identification of all BGCs in bacterial genomes

Goal: identifying and categorizing all the BGCs in 5666 bacterial genomes.

Training Data set:1,984 BGC gene sequences from MIBiG v1.4 database, ORF/gene prediction,Pfam domains. 3,685 Pfam domains, 1094 species

Background: 5,666 reference genomes from NCBI database, 11,427 uniquePfam domains. nnon−BGC = 10, 128 controls.

DeepBiG: A LSTM RNN based deep learning algorithmEmbedding - Pfam domain names, Pfam clans, Pfam function (1126-dimvector)Liu, Li and Li (2021): manuscript

36

Known BGCsNo. of Pfam domains in 1984 BGCs.

0 50 100 200 250 300150 No. of Pfams

0

50

100

150

200

250

300C

ount

37

Pfam clansSequence, structures, functional similarity and similarity between pHMMs.

https://www.ebi.ac.uk38

DeepBiG - deep learning and embeddingData: Contiguous segments of 256 Pfam sequences (+padding)Embedding: Pfam domain names, Pfam clans, Pfam function descriptions ⇒LSTM RNN

Pfam

102-d PfamEmb 64-d ClanEmb

Top 64 chars of summary

32-d emb

of c1

32-d emb

of c64

32-d emb

of c2

…

30 size 3 filters with padding,then maxpooling

960-d ClarEmb

1126-d Emb

Pfam 1 Pfam 2 Pfam 3 …… Pfam 254 Pfam 255 Pfam 256

Concatenation

Emb 1 Emb 2 Emb 3 Emb 254 Emb 255 Emb 256

LSTM LSTM LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM LSTM LSTM

Softmax

Labels

39

DeepBiG - Data AugmentationOn expectation, a sequence has one Pfam domain being replaced, each epochwith new perturbed data.

Data ArgumentOn expectation, a sequence will have one pfam being replaced.

Pfam 1 Pfam 2 Pfam 3 …… Pfam L

Pfam 2_i Pfam 2_nPfam 2_1

Each Pfam will be replaced by its Similar Pfams with prob 1/L

Positive/Fake pfam sequence with length L

40

DeepBiG - Embedding, multi-class, training set

20 10 0 10 20TSNE 1

30

20

10

0

10

20

30

TSN

E 2

NRPOtherPolyketideRiPPSaccharideTerpene

41

DeepBiG - Embedding, binary, testing set

60 40 20 0 20 40TSNE 1

20

10

0

10

20

TSN

E 2

BGCNon-BGC

42

DeepBiG Prediction ResultsTesting set: 13 contigs of 9 genomes with 291 known BGCs never used intraining, 10x13=130 artificial genomes with 291 known BGCs fixed in originalgenomes, other replaced with non-BGCs.

DeepBGC DeepBiG-C DeepBiG DeepBiG-OPfam level

Precision 0.834(0.0172) 0.830(0.0148) 0.822(0.0112) 0.777(0.0157)Recall 0.745(0.0082) 0.821(0.0052) 0.862(0.0061) 0.880(0.0036)F1 0.787(0.0067) 0.825(0.0071) 0.842(0.0060) 0.825(0.0091)

Overlap Pfam BGC level (pBGC) - F1 score0.2 0.694(0.0131) 0.747(0.0150) 0.781(0.0147) 0.765(0.0112)0.4 0.670(0.0089) 0.724(0.0163) 0.763(0.0136) 0.741(0.0115)0.6 0.626(0.0108) 0.676(0.0162) 0.714(0.0158) 0.693(0.0139)0.8 0.523(0.0148) 0.574(0.0199) 0.614(0.0173) 0.584(0.0220)1.0 0.270(0.0156) 0.287(0.0141) 0.302(0.0161) 0.228(0.0129)

DeepBGC: Hannigan et al., 2019 NAR.43

DeepBiG multiclass predictionTesting set: 160 new BGC were deposited to MiBIG v1.5

4 3 2 1 0 1 2TSNE 1

1

0

1

2

3

TSN

E 2

NRPOtherPolyketideRiPPSaccharideTerpene

44

DeepBiG multiclass prediction

Testing set: 160 new BGC were deposited to MiBIG v1.5

F1 scores

Overlap DeepBiG-C DeepBiG DeepBGCpBGC gBGC pBGC gBGC pBGC gBGC

0.20 77.2 74.7 95.7 93.2 90.7 90.70.40 72.2 72.2 93.8 92.0 88.3 86.40.60 66.7 68.5 91.4 92.0 85.8 84.00.80 60.5 63.6 89.5 90.7 82.1 81.51.00 54.9 60.5 85.2 88.3 74.7 79.0

45

All BGCs predicted by DeepBiGThere are 170, 685 predicted BGCs in all 5666 bacteria genomes (ave 30.1,0-243 BGCs)

NRP Other Polyketide RiPP Saccharide Terpene0.000.050.100.150.200.250.300.350.40

Porti

on

DeepBiG

MiBIG

0.0 0.2 0.4 0.6 0.8 1.0Ratio

0

200

400

600

800

#Gen

ome

(a) (b)

0 50 100 150 200 250#BGC

0

250

500

750

1000

#Gen

ome

0 100 200 300 400#Gene

0

25000

50000

75000

100000

#BGC

(c) (d) 46

Concluding remarks

Three problems and statistical methods in microbiome data sciencestrong empirical performance with theoretical insights.

Next steps:precision nutrition, microbiome and healthnutrition and host immunology

47

Acknowledgments

Many thanks to:

Li lab (NIH grants: R01GM129781; R01GM123056)

Shulei Wang, Yuan Gao, Rong Ma, Mingyang Liu and Yun Li

Biology collaborators

Gary Wu, MD (Gastroenterology)

Rick Bushman, PhD (Microbiology)

James Lewis, MD (Gastroenterology and DBEI)

Kyle Bittinger, PhD (CHOP)

People in their labs

T. Tony Cai, PhD (The Wharton School)

48

microbiome data science - phylogenetic tree, bacterial

Documents