Download - GEBA Project Summary
GEBA Project Summary
Dongying Wu
Phylogenetic Tree Building (Martin Wu)
Concatenate alignments of 31 marker genesbuild a PHYML tree
667 non-GEBA genomes, 53 genomes
Phylogenetic Distance (PD)
PD=sum of all the branch lengthsPD{A,B,C}=a+b+c+d
A
B
a
bC
c
d
Phylogenetic Distance Contribution of GEBA genomes
53 random non-GEBA taxa (from a pool of 667) contribute 3.15 to the tree PD
(standard deviation:0.68 for 100 sampling)
The total tree PD is 88.8, GEBA add 11.0 to the tree.
The 26 GEBA actinobacteria add 4.29 to the total PD (actinobacteria as a whole add 8.128 PD)
26 random non-GEBA actinobacteria (from a pool of 47) contribute 1.37 PD
(standard deviation 0.28, 100 sampling)
227,562 genes from 56 genomes => 17,176,180 links
Blastp: E value cutoff 1e-10, report 10000 hits
Only blastp hits that span 80% of the lengths of both genes are kept as links
Gene Family Classification
Links (matrix of sequence identities)
Expansion
Inflation (I=2)
MCL Clustering Algorithm
equilibrium state
50 - 100
20 - 50
10 - 20
5 - 10
1 - 5
20/56 - 1
10/56 - 20/56
2/56 - 5/56
5/56 - 10/56
1/56 46689
10601
2755
1588
1305
511
27
6
3
1
0 10000 20000 30000 40000 50000
Number of Families
Fam
il y Si ze
( gene
s/ geno
me)
Evenness estimation
genome Gene distribution ratio for family X
A 0.316
B 0.105
C 0.026
D 0
E 0.184
F 0.215
G 0.158
Median0.184
0.132
0.079
0.1580.1840
0.031
dist: Distanceaverrage=0.087
Evenness= 100 x e -4 x dist
0.031
Universality: ratio of genomes that a family appears inEvenness: even distribution of gene family members across genomesSize: number of members in a gene family
Family size
Large families:
famID size functions
F2669 4210 (75/genome) ABC-type transport system ATP-binding proteinF2670 1542 (27/genome) multi-sensor hybrid histidine kinase F2671 1367 (24/genome) short chain dehydrogenaseF2672 1157 (20/genome) acyl-CoA synthetaseF2673 782 (14/genome) serine/threonine protein kinaseF2674 755 (13/genome) two-component system response regulator (LuxR family)F2675 735 (13/genome) two-component system response regulator (winged helix family)
F2676 614 (11/genome) drug resistance transporterF2677 606 (11/genome) transcriptional regulator, LacI familyF2678 568 (10/genome) two-component system sensor sensor histidine kinaseF2679 543 (10/genome) sugar ABC transporter, permease component
Low universality large families:
famID size organism family function taxonomy number
F2682 461 7 outer membrane protein Bacteroidetes; Proteobacteria
F2699 303 6 outer membrane protein Bacteroidetes
F2736 180 6 anti-sigma factor Bacteroidetes; Proteobacteria
F2760 153 6 transcriptional regulator, AraC family Bacteroidetes; proteobacteria
F2772 147 5 RNA polymerase ECF-type sigma factor Bacteroidetes (Sphingobacteriales)
F2801 129 11 DNA-binding protein Actinobacteria(Actinobacteridae)
F2827 114 3 FtsX transmembrane transport protein Bacteroidetes (Sphingobacteriales)
F2867 103 3 hypothetical protein Actinobacteria;(Coriobacteriaceae)
3 out of 9 largest families have very low evenness value ( < 5)
short chain dehydrogenase acyl-CoA synthetasetwo-component system response regulator (LuxR)
0
10
20
30
40
50
60
70
80
0 1
0 2
0 3
0 4
0 6
0
0
20
40
60
80
10
0
12
0
0 1
0 2
0 3
0 4
0 5
0 6
0
0
10
20
30
40
50
60
70
80
90
10
0
0 1
0 2
0 3
0 4
0 5
0 6
0
56 Halobacteria Halorhabdus_utahensis55 Halobacteria Halomicrobium_mukohataei54 Halobacteria Halogeometricum_borinquense53 Aminanaerobia Thermanaerovibrio_acidaminovorans52 Deferribacteres Dethiosulfovibrio_peptidovorans51 Deinococci Meiothermus_silvanus50 Deinococci Meiothermus_ruber49 Chloroflexi Thermobaculum_terrenum48 Chloroflexi Sphaerobacter_thermophilus47 Actinobacteria Conexibacter_woesei46 Actinobacteria Atopobium_parvulum45 Actinobacteria Slackia_heliotrinireducens44 Actinobacteria Eggerthella_lenta43 Actinobacteria Cryptobacterium_curtum42 Actinobacteria Acidimicrobium_ferrooxidans41 Actinobacteria Kribbella_flavida40 Actinobacteria Catenulispora_acidiphila39 Actinobacteria Stackebrandtia_nassauensis38 Actinobacteria Geodermatophilus_obscurus37 Actinobacteria Nakamurella_multipartita36 Actinobacteria Actinosynnema_mirum35 Actinobacteria Saccharomonospora_viridis34 Actinobacteria Tsukamurella_paurometabola33 Actinobacteria Gordonia_bronchialis32 Actinobacteria Streptosporangium_roseum31 Actinobacteria Thermobispora_bispora30 Actinobacteria Thermomonospora_curvata29 Actinobacteria Nocardiopsis_dassonvillei28 Actinobacteria Kytococcus_sedentarius27 Actinobacteria Brachybacterium_faecium26 Actinobacteria Beutenbergia_cavernae25 Actinobacteria Cellulomonas_flavigena24 Actinobacteria Xylanimonas_cellulosilytica23 Actinobacteria Jonesia_denitrificans22 Actinobacteria Sanguibacter_keddieii21 Firmicutes Anaerococcus_prevotii20 Firmicutes Alicyclobacillus_acidocaldarius19 Firmicutes Veillonella_parvula 18 Firmicutes Desulfotomaculum_acetoxidans 17 Fusobacteria Sebaldella_termitidis 16 Fusobacteria Leptotrichia_buccalis15 Fusobacteria Streptobacillus_moniliformis14 Spirochaetes Brachyspira_murdochii13 Bacteroidetes Planctomyces_limnophilus 12 Bacteroidetes Rhodothermus_marinus 11 Bacteroidetes Capnocytophaga_ochracea 10 Bacteroidetes Chitinophaga_pinensis09 Bacteroidetes Pedobacter_heparinus08 Bacteroidetes Spirosoma_linguale 07 Bacteroidetes Dyadobacter_fermentans 06 Epsilonproteobacteria Sulfurospirillum_deleyianum 05 Deferribacteres Denitrovibrio_acetiphilus 04 Deltaproteobacteria Haliangium_ochraceum 03 Deltaproteobacteria Desulfomicrobium_baculatum 02 Deltaproteobacteria Desulfohalobium_retbaense 01 Gammaproteobacteria Kangiella_koreensis
50
phylum specific family
26/56 Actinobacteria
Gene number From Actinobacteria by chance
1 0.4643
2 0.2157
3 0.1001
4 0.0465
5 0.0216
6 0.0100
7 0.0047
8 0.0022
9 0.0010
10 0.0005
712 families (size >=10) are phylum specific
Fam
ily s
ize
Organism number
0
50
100
150
200
250
300
350
0 5 10 15 20 25
42 670
Family size Actonobacteria Bacteroidetes Deinococci Firmicutes Fusobacteria Halobacteria
10<= x <20 430 37 1 1 5 20 49420<= x <30 103 9 2 11430<= x <40 22 5 1 28
40<= x <50 7 1 1 9
50<= x <60 6 6
60<= x <70 4 1 5
70<= x <80 1 1 2
80<= x <90 3 1 4
90<= x <100 2 2
100<= x 3 3 6
581 58 1 1 6 23
Phylum-specific families from more than two organisms
F2699 Bacteroidetes=303; outer membrane protein
*F2752 Actinobacteria=160; RNA polymerase, sigma-24 subunit, ECF family
F2772 Bacteroidetes=147; putative ECF-type RNA polymerase sigma factor
F2801 Actinobacteria=129; DNA-binding protein
F2827 Bacteroidetes=114; FtsX-related transmembrane transport protein
F2867 Actinobacteria=103; unknown functions
The largest 6 phylum-specific families
* From 15 organisms
Novel gene families:None of the genes in a family has a Genbank hit (e cutoff: 1e-5)
1
10
100
1000
10000
100000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Novel GEBA families
Family Size
Fam
ily N
umber
31744 novel families(34353 genes)
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 1 2 3 4 5 6
Novel GEBA families
Genome Numbers
Fam
ily S
ize
Streptococcus agalactiae “pan-genome”
Tettelin H. et.al. PNAS 2005;102:13950-13955
217,079 genes from 53 GEBA Bacterial genomes
60024 families N genomes
Number of families with the selected genomes
A: N from1 to 53B: For every N, sample the families 100 times
0
10000
20000
30000
40000
50000
60000
70000
0 10 20 30 40 50 60 70 80
Bacteria from GEBA project
Genome Number
Gen
e F
amily
Num
ber
(incl
udin
g fa
mili
es w
ith s
ingl
e m
embe
rs)
0
500
1000
1500
2000
2500
3000
0 10 20 30 40 50 60
Number of Genomes
New
Gen
ome
fam
ilies
Actinobacteria: (73 genomes, including 26 GEBA genomes)
Streptococcus agalactiae (8 strains)
Enterobacteriaceae: (40 genomes)
9 Escherichia coli10 Yersinia pestis11 Salmonella enterica 3 Shigella flexneri
Bacteria: (53 GEBA genomes)
0
10000
20000
30000
40000
50000
60000
70000
0 10 20 30 40 50 60 70 80
S. agalactiae
Enterobacteriaceae
Actinobacteria
Bacteria from GEBA project
Genome Number
Gen
e F
amily
Num
ber
(incl
udin
g fa
mili
es w
ith s
ingl
e m
embe
rs)
0
50000
100000
150000
200000
250000
300000
350000
0 10 20 30 40 50 60 70 80
S. agalactiae
Enterobacteriaceae
Actinobacteria
Bacteria from GEBA project
Genome Number
Tot
al G
ene
Num
ber
0
10000
20000
30000
40000
50000
60000
70000
0 50000 100000 150000 200000 250000 300000 350000
S. agalactiae
Enterobacteriaceae
Actinobacteria
Bacteria from GEBA project
Total Gene Number
Gen
e F
amily
Num
ber
Calculate the PD (Phylogenetic Diversity)Of a sub-tree
0
2
4
6
8
10
12
14
16
18
0 10 20 30 40 50 60 70 80
Enterobacteriaceae
Actinobacteria
Bacteria from GEBA project
Genome Number
Phy
loge
netic
Div
ersi
ty
0
10000
20000
30000
40000
50000
60000
70000
0 2 4 6 8 10 12 14 16 18
Enterobacteriaceae
Actinobacteria
Bacteria from GEBA project
Phylogenetic Diversity
Gen
e F
amily
Num
ber
How far down the road GEBA has to go in terms of PD coverage
232812 Bacterial/Archaeal ss-rRNA from Greengenes
45997 clusters
MCL 99% Identity at 80% span
42426 Greengenes Bacterial/Archaeal ss-rRNA
667 Combo Bacterial ss-rRNA50 Combo Archaeal ss-rRNA56 GEBA ss-rRNA
Retrieve alignments from greengenes
QuickTree
Distant Tree for all representatives
Filter out ss-rRNA from Genome Porjects99% identity cutoffs
Filter out 18751 low-quality sequencesshort sequences <=1200ntlow-quality sequencesduplicateschimerics
Trim by the greengenes mask
74437 non-environmental Bacterial/Archaeal ss-rRNA from Greengenes
10397 clusters
MCL 99% Identity at 80% span
9946 Greengenes Bacterial/Archaeal ss-rRNA
667 Combo Bacterial ss-rRNA50 Combo Archaeal ss-rRNA56 GEBA ss-rRNA
Retrieve alignments from greengenes
QuickTree
Distant Tree for non-environmental representatives
Filter out ss-rRNA from Genome Porjects99% identity cutoffs
Filter out low-quality sequencesshort sequences <=1200ntlow-quality sequencesduplicateschimerics
Trim by the greengenes mask
GEBA
Pre-GEBA
Greengenes
*start from Haemophilus influenzae Rd KW20 **In each group, the taxa are sorted by their PD contributions in descending order
600
400
200
0
800
1000
1200
0 5000 10000 15000 20000 25000 30000 35000 40000
100
80
60
40
20
00 400 800 1200
GEBA genomes
pre-GEBA genomes
Organisms from the greengenes database
Organisms from the greengenes database(excluding environmental samples)
Organism Numbers
Phy
loge
netic
Div
ersi
ty
The slopes of the linear regression Lines represent the PD contribution of the genomes(each window contains 50 genomes)
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0.048
Non-environmental Greengenes Representatives
Slo
pe
Window Position (window size 50 genomes)
0
0.02
0.04
0.06
0.08
0.1
0.12
0 100 200 300 400 500 600 700
Pre-GEBA Genomes
Slo
pe
Window Position (window size 50 genomes)
GEBA genomes: 0.048
Only the top 150 PD contributors out of 717 pre-GEBA genomes have an average PD contribution greater than the GEBA genomes.
The genome sequencing efforts have only covered 11.5% phylogenetic diversity to date in this study.
We can pick an additional 550 organisms and still have an average PD contribution greater than or equal to the 56 GEBA genomes
To increase PD coverage to 50%, we need to sequence at least 1520 more genomes
Non-environmental Tree
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0 5000 10000 15000 20000 25000 30000 35000 40000 45000
0.0515
Window Position (window size 50 genomes)
Slo
pe
Greengenes Representatives
All-representative Tree
Current genome sequences only cover 2.2% of the PD
We can pick an additional 4400 organisms and still have an average PD contribution greater than or equal to the 56 GEBA genomes
To cover 50% of the phylogenetic diversity, we have to sequences 9218 more genomes
Oligotropha carboxidovorans
Burkholderia xenovorans Lb400Ralstonia eutrophaRhodobacter sphaeroides
Cylindrotheca closterium
2500650649 Thermomonospora curvata 2500714725 Meiothermus silvanus
Microcystis aeruginosaArabidopsis thaliana
Synechococcus elongatus 2500680479 Acidimicrobium ferrooxidans Methylococcus capsulatus
Staphylothermus marinusArchaeoglobus fulgidus
Thermococcus kodakarensis
2500517881 Halomicrobium mukohataei2500153872 Halogeometricum borinquense
Methanocaldococcus jannaschii Pyrococcus furiosus
Rhodopseudomonas palustrisRhodospirillum rubrum
Thiobacillus denitrificans Hydrogenovibrio marinus
Burkholderia xenovorans LB400
Roseovarius sp HTCC2601Rhizobium leguminosarum
Thermomicrobium roseum2500516270 Dyadobacter fermentans2500403940 Pedobacter heparinus
Burkholderia ambifaria Pseudomonas putida
Roseobacter sp MED193 Xanthobacter autotrophicus
Jannaschia sp CCS1
2500697965 Geodermatophilus obscurus Arthrobacter aurescens
Chlorobium phaeobacteroides Chlorobium tepidum IV
Rhodopseudomonas palustris
2500706456 Nakamurella multipartita Archaeoglobus fulgidus
2500712642 Meiothermus silvanus 2500546384 Meiothermus ruber
2500583064 Rhodothermus marinus2500348490 Veillonella parvula
2500608982 Planctomyces limnophilus Microcystis aeruginosa
2500621325 Alicyclobacillus acidocaldarius
Bacillus licheniformis Bacillus subtilis
1006330
6346
73
8391
100100
73
71
100
1003758
42
8075
100
67
10031
10058
100
37
5880
32
72
100
74
96100
38
73
45
95
1005968
10049
10060 100
0.5
III
I
II
IV
V
rbcL
Methylococcus_capsulatus 2500680479_Acidimicrobium_ferrooxidansArabidopsis_thaliana Synechococcus_elongatus Microcystis_aeruginosa 2500714725_Meiothermus_silvanus Cylindrotheca_closterium 2500650649_Thermomonospora_curvata Rhodobacter_sphaeroidesRalstonia_eutrophaI Oligotropha_carboxidovoransBurkholderia_xenovorans_LB400 Hydrogenovibrio_marinusThiobacillus_denitrificansI Rhodospirillum_rubrum Rhodopseudomonas_palustris Methanocaldococcus_jannaschii2500153872_Halogeometricum_borinquenseI 2500517881_Halomicrobium_mukohataei Pyrococcus_furiosusI Thermococcus_kodakarensisI Archaeoglobus_fulgidus Staphylothermus_marinus 2500608982_Planctomyces_limnophilus 2500348490_Veillonella_parvula 2500583064_Rhodothermus_marinus 2500546384_Meiothermus_ruber 2500712642_Meiothermus_silvanus Rhodopseudomonas_palustris Chlorobium_tepidum Chlorobium_phaeobacteroides Microcystis_aeruginosa 2500621325_Alicyclobacillus_acidocaldariusBacillus_subtilis Bacillus_licheniformis Archaeoglobus_fulgidus 2500706456_Nakamurella_multipartita Arthrobacter_aurescensRhizobium_leguminosarum Roseovarius_sp_HTCC2601Burkholderia_xenovorans_LB400 Roseobacter_sp_MED193 Pseudomonas_putida Burkholderia_ambifaria Jannaschia_sp_CCS1 Xanthobacter_autotrophicus2500697965_Geodermatophilus_obscurus 2500403940_Pedobacter_heparinus 2500516270_Dyadobacter_fermentans Thermomicrobium_roseum
RRRRRRRRRRRRRRRRRRRRRRRPPPPPFFFPPPPMYRRRRRRRRRRRRR
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
EEEEEEEEEEEEEEEEEEEEEEEHHHHHEEEEEEEEEEEEEEEEEEEEEE
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK
IIIII
I
FFFFFFFFFFFFFFFFLLLLYYYLVV
VLLL
LLFFFFFFFFLFFF
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGYGGGGGGGGGGGGGG
KKKKKKKKKKKKKKKKKKKKKKKKMVQQNNNACVVDCNASNSSSSSNSSS
KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK
NNNNNNNNNNNNNNNNNNNNNNN-----GGG------NGGNNNNNNNNNN
TTTTTTTTTTTTTTTTTTTTTTTVFTMMQQQTTSSTTTTTTTTTTTTTTT
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEGGGGEGEDDEEEEEEEEEE
HHHHHHHHHHHHHHHHHHHHHHHVIIIIIIILILLVLHQQHHHHHHHHHH
FSSSSSSSSSSSSSSSSSSSSSSAGAAAGGGSSSSSGSGCSSSSSSSSSS
KKKKKKKKKKKKKKKKKKKKKKKRRRRRRRRSSSSKGKKKKKKKKKKKKK
GGGGGGGGGGGGGGGGGGGGGGGGGGGGRRRTGGGGGGGGGGGGGGGGGG
GGGGGGSGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGSSSGGGGGGGGGGGGGGGGGGG
III
I
II
IV
V
rbcL Active sites
Catalytic
RuBP binding
Glycerate-3-P
P-glyceroyl-P GAP DHAP Fructose-1,6-PFructose-6-P
Xylulose-P
Ribulose-5-P
Ribulose-1,5-P
CO2
rbcL
pgk gap tpiA glpX tktA
rpe
Calvin cycle
Organism phylum rpe prk rbcL rbcS pgkThermomonospora_curvata_DSM_43183 Actinobacteria x x I x x
Meiothermus_silvanus_DSM_0994 Deinococci x x I,IV x x
Acidimicrobium_ferrooxidans Actinobacteria x x I x x
*Halogeometricum_borinquense_DSM_11551 Halobacteria x III x
Halomicrobium_mukohataei_DSM_12286 Halobacteria x III x
Alicyclobacillus_acidocaldarius_subsp Firmicutes x x IV x
Meiothermus_ruber_DSM_01279 Deinococci x x IV x
Nakamurella_multipartita_DSM_44233 Actinobacteria x x IV
Planctomyces_limnophilus_DSM_03776 Bacteroidetes x IV x
Rhodothermus_marinus_DSM_4252 Bacteroidetes x x IV x
Veillonella_parvula_DSM_02008 Firmicutes x IV x
Geodermatophilus_obscurus_DSM_43160 Actinobacteria x x V x
Pedobacter_heparinus_DSM_02366 Bacteroidetes x x V x
Dyadobacter_fermentans_DSM_18053 Bacteroidetes x x V x
Calvin Cycle
* Finished genome