steven essinger, robi polikar, gail rosen k electrical ...sessinger.com/posters/nn_isme.pdfsteven...
TRANSCRIPT
!
Metagenomic Fragments
Naïve Bayes Classifier
Training Database
Unsupervised Clustering Algorithm (e.g. ART)
Fragments Clustered by
Class (e.g. Phyla)
Goal: Predict the taxonomic classifica2on of organisms based on the fragments obtained from an environmental sample that may include many previously uniden2fied organisms.
Conclusion: • Compared to other unsupervised and semi-‐supervised approaches, we
cluster shorter reads (500bp) and more strains (200 to 400) than any other method, to show the clustering method’s feasibili2es on real metagenomics datasets.
• We demonstrate that adap2ve resonance theory is able to cluster novel phyla beGer than K-‐means when there are a large number of fragments to cluster. This is due to the incremental learning capability of ART and its ability to learn non-‐spherical clusters.
• On an extremely challenging dataset of grouping 500bp reads from 204 strains spanning 17 phyla, ART is able to accomplish this with 43% accuracy (5.9% by chance)
Neural Network-based Taxonomic Clustering for Metagenomics
Steven Essinger, Robi Polikar, Gail Rosen Electrical & Computer Engineering, Drexel University, 3141 Chestnut Street
Philadelphia, PA 19104, US
This work was supported in part by National Science Foundation award #0845827
Challenge
• The challenge we face is that we cannot simply cluster fragments together that are similar in composi2on as many clustering methods tend to do.
• While two strains may be similar inter-‐genomically, each generally will vary greatly intra-‐genomically. Since the fragments we are clustering represent short samples of each strain’s genome, we expect that the fragments in each cluster will vary greatly.
• Current methods do not address next-‐genera2on sequencing technology • LikelyBin: successful only for low complexity samples (2-‐10 species) • GSOM: successful when read lengths are greater than 8kbp • CompostBin: successfully tested only for low-‐complexity samples
Experiment 1: Training on 2 large phyla to cluster 17 smaller phyla
Experiment 2: Training on 17 smaller phyla to cluster 2 large phyla
Experiment 3: Training on examples of each phyla to cluster the rest
Experiment! 1! 2! 3!Training Phyla! 2! 17! 19!Test Phyla! 17! 2! 19!Training Strains! 431! 204! 320!Test Strains! 204! 431! 315! Table 1
• 635 microbe genomes obtained from Na2onal Center for Informa2on Biotechnology
• Dataset spans 19 different phyla: We selected this level since it is comprised of microbes that are much more diverse than those belonging to the levels of genus or species
• Whole-‐genomes used in training database
• Test fragments obtained from test strains by random sample 500 bp in length, 100x
Fig. 3
Fig. 1
Summary
Test Data
Algorithm
Results
Fig. 2
1 2 3
1 23
known cl
usters!
unknown cl
usters!
!!
!
"#$! %&'(&! $)%$&'*$+",! -$! "#$+! .""$*%"$/! "(! 0123"$&! "#$! 4!1.&5$3"! %#61.! '+"(!4!5&(2%3,!75.'+8! "#$! "$3"! 9&.5*$+"3!:$&$!0(*%1$"$16!+(;$1! "(! "#$!<=>!01.33'9'$&8! &$321"'+5! '+!.!3'*'?1.&16!0#.11$+5'+5!30$+.&'(!9(&!01.33'9'0."'(+,!@+!$)%$&'*$+"!A8!:$!9(11(:$/!.!/'99$&$+"!.%%&(.0#!9&(*!"#$!%&'(&! ":(!$)%$&'*$+"3!B6!%.&"'"'(+'+5! "#$!+2*B$&!(9!3"&.'+3!(9!$.0#!(9!"#$!CD!01.33$3!'+"(!":(!5&(2%3,!E+$!5&(2%!(9!A4F!3"&.'+3!:.3! 23$/! 9(&! "&.'+'+5! "#$!<=>!:#'1$! "#$! &$*.'+'+5!ACG!3"&.'+3!:$&$!23$/!'+!"#$!"$3"!3$",!73!'+!.11!(9!"#$!$)%$&'?*$+"38! "#$! "$3"! 3$"! 0(+3'3"$/! (9! CFF! 9&.5*$+"38! GFFB%! '+!1$+5"#8! 9(&! $.0#! 3"&.'+! :#'1$! "#$! "&.'+'+5! 3$"! 0(+3'3"$/! (9!:#(1$?5$+(*$3!9(&!$.0#!"&.'+'+5!3"&.'+,!H#$! &$321"3! (9! .11! "#&$$! $)%$&'*$+"3! .&$! %&(;'/$/! '+! "#$!
+$)"!3$0"'(+,!I.0#!$)%$&'*$+"!:.3!B(("3"&.%%$/!4G!"'*$3!"(!%&(;'/$!.!1$;$1!(9!0(+9'/$+0$!'+!"#$!&$321"3!J448!4AK,!!
L,! MINOPHN!
!"! #$%&'()&*+,-.,/'0(*(*1,2*,3,40'1&,%5640,+2,7489+&',-:,9)044&',%5640,7!"&.'+'+5!/.".3$"!(9!QAC!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!4! %#61.,! H#$! &$*.'+'+5! 4FQ! 3"&.'+3! 3%.++'+5! CR! /'99$&$+"!%#61.!:$&$!23$/!.3!"$3"!3"&.'+3!.3!/$30&'B$/!.B(;$,!H#$!"&.'+?'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$! "#$! "$3"! 9&.5?*$+"3!:$&$!(B".'+$/!9&(*!"#$!"$3"!3"&.'+3!B6!&.+/(*16!3.*?%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$!&$./3! $.0#,! =("#! S?*$.+3! .+/! 7MH! :$&$! '*%1$*$+"$/! "(!0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$!<=>!30(&$3! .3! 9$."2&$!;$0?"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@,!
!Table I. The results of clustering 20400 fragments spanning 17 different phyla when trained on another 2 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 17 and the vigilance parameter, , for ART was set to 0.1. Group-ing these fragments by chance into clusters of similar phyla we would ex-pect accuracy of 1/17 or 5.9%. !=("#!.15(&'"#*3!5&(2%$/!.11!(9! "#$! 9&.5*$+"3! '+"(!CR!/'9?9$&$+"! 0123"$&3! .3! '+"$+/$/,! H#$! 7MH! .15(&'"#*! %$&9(&*$/!B$""$&! 23'+5! B("#! 9'52&$3! (9! *$&'"! "#.+! "#$! S?*$.+3! .15(?&'"#*,! @+"$&$3"'+5168! S?*$.+3! :.3! .B1$! "(! '3(1."$! /'99$&$+"!%#61.!B$""$&!"#.+!'"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!:#'1$! 9(&!7MH! "#$! 0(+;$&3$! '3! "&2$,! H#$! &$321"3! '*%16! "#."!7MH!'3!5&(2%'+5!3'*'1.&!%#61.!"(5$"#$&8!B2"!"#$!0123"$&3!.&$!0(+".'+'+5!*(&$!"#.+!C!%#61.!"#$&$B6!/&';'+5!/(:+!"#$!30(&$!9(&! '3(1."'+5! /'99$&$+"! %#61.,! H#$! (%%(3'"$! '3! "&2$! 9(&! S?*$.+38! 3255$3"'+5! "#."! 3'*'1.&! %#61.! .&$! /'3"&'B2"$/! .*(+5!3$;$&.1! 0123"$&3! &."#$&! "#.+! (+$8! B2"! +("! .0&(33! .11! 0123"$&38!("#$&:'3$!:$!:(21/!$)%$0"!.!3'*'1.&!30(&$!9(&!'3(1."'+5!/'9?9$&$+"!%#61.,!
;"! #$%&'()&*+,3.,/'0(*(*1,2*,-:,9)044&',%5640,+2,7489+&',3,40'1&,%5640,7!"&.'+'+5!/.".3$"!(9!4FQ!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CR!%#61.U! "#$!(%%(3'"$!(9!$)%$&'*$+"!C,!H#$!&$*.'+'+5!QAC!3"&.'+3! 3%.++'+5!4!/'99$&$+"!%#61.!:$&$!23$/!.3! "$3"! 3"&.'+3,!H#$! "&.'+'+5! /.".3$"! 0(+3'3"$/! (9!:#(1$! 5$+(*$3!:#'1$! "#$!"$3"! 9&.5*$+"3! :$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+?/(*16! 3.*%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%!+201$("'/$!&$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$!'*%1$?*$+"$/! "(! 0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$! <=>! 30(&$3! .3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@,!
! Table II. Results of clustering 43100 fragments spanning 2 different phyla when trained on another 17 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 2. Grouping these fragments by chance into clusters of similar phyla we would expect accuracy of 1/2or 50%. ART grouped these fragments into 4 clusters with the vigilance parameter, , set at 0.025. !-#'1$!S?*$.+3!:.3!%&(5&.**$/!"(!5&(2%!.11!(9!"#$!9&.5?*$+"3! '+"(!4!0123"$&38! "#$!7MH!.15(&'"#*!$)#'B'"$/!"#$!B$3"!%$&9(&*.+0$! :#$+! 5&(2%'+5! .11! 9&.5*$+"3! '+"(! Q! 0123"$&3,!H#$! 7MH! .15(&'"#*! %$&9(&*$/! 31'5#"16! B$""$&! 23'+5! B("#!9'52&$3!(9!*$&'"! "#.+! "#$!S?*$.+3!.15(&'"#*,!S?*$.+3!:.3!.B1$! "(! '3(1."$! /'99$&$+"! %#61.!*.&5'+.116!B$""$&! "#.+! '"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!.+/!"#$!3.*$!'3!"&2$!9(&!7MH!.1B$'"!."!.!1.&5$&!*.&5'+,!H#$!%$&9(&*.+0$!(9!"#$!.15(?&'"#*3! .%%$.&! "(! B$!*20#! B$""$&! 9(&! $)%$&'*$+"! 4! "#.+! "#$!%&$;'(238!B2"!'"!'3!'*%(&".+"!"(!+("$!"#."!0#.+0$!:.3!G,VW!'+!$)%$&'*$+"!C8!:#'1$!0#.+0$!'3!GFW!'+!"#'3!$)%$&'*$+",!!
<"! #$%&'()&*+,=.,/'0(*(*1,2*,&$0)%4&9,2>,&075,%5640, +2,7489+&',+5&,'&9+,7!"&.'+'+5!/.".3$"!(9!A4F!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CD!%#61.,!H#$!&$*.'+'+5!ACG!3"&.'+3!.13(!3%.++$/! "#$! 3.*$!CD!/'99$&$+"!%#61.!.+/!:$&$!23$/!.3!"$3"!3"&.'+3,!H#$!"&.'+'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$!"#$!"$3"!9&.5*$+"3!:$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+/(*16! 3.*%1'+5!$.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$! &$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$! '*%1$*$+"$/! "(!0123"$&!"#$!9&.5*$+"3!23'+5! "#$!<=>!30(&$3!.3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@@,!!
!!
!
"#$! %&'(&! $)%$&'*$+",! -$! "#$+! .""$*%"$/! "(! 0123"$&! "#$! 4!1.&5$3"! %#61.! '+"(!4!5&(2%3,!75.'+8! "#$! "$3"! 9&.5*$+"3!:$&$!0(*%1$"$16!+(;$1! "(! "#$!<=>!01.33'9'$&8! &$321"'+5! '+!.!3'*'?1.&16!0#.11$+5'+5!30$+.&'(!9(&!01.33'9'0."'(+,!@+!$)%$&'*$+"!A8!:$!9(11(:$/!.!/'99$&$+"!.%%&(.0#!9&(*!"#$!%&'(&! ":(!$)%$&'*$+"3!B6!%.&"'"'(+'+5! "#$!+2*B$&!(9!3"&.'+3!(9!$.0#!(9!"#$!CD!01.33$3!'+"(!":(!5&(2%3,!E+$!5&(2%!(9!A4F!3"&.'+3!:.3! 23$/! 9(&! "&.'+'+5! "#$!<=>!:#'1$! "#$! &$*.'+'+5!ACG!3"&.'+3!:$&$!23$/!'+!"#$!"$3"!3$",!73!'+!.11!(9!"#$!$)%$&'?*$+"38! "#$! "$3"! 3$"! 0(+3'3"$/! (9! CFF! 9&.5*$+"38! GFFB%! '+!1$+5"#8! 9(&! $.0#! 3"&.'+! :#'1$! "#$! "&.'+'+5! 3$"! 0(+3'3"$/! (9!:#(1$?5$+(*$3!9(&!$.0#!"&.'+'+5!3"&.'+,!H#$! &$321"3! (9! .11! "#&$$! $)%$&'*$+"3! .&$! %&(;'/$/! '+! "#$!
+$)"!3$0"'(+,!I.0#!$)%$&'*$+"!:.3!B(("3"&.%%$/!4G!"'*$3!"(!%&(;'/$!.!1$;$1!(9!0(+9'/$+0$!'+!"#$!&$321"3!J448!4AK,!!
L,! MINOPHN!
!"! #$%&'()&*+,-.,/'0(*(*1,2*,3,40'1&,%5640,+2,7489+&',-:,9)044&',%5640,7!"&.'+'+5!/.".3$"!(9!QAC!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!4! %#61.,! H#$! &$*.'+'+5! 4FQ! 3"&.'+3! 3%.++'+5! CR! /'99$&$+"!%#61.!:$&$!23$/!.3!"$3"!3"&.'+3!.3!/$30&'B$/!.B(;$,!H#$!"&.'+?'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$! "#$! "$3"! 9&.5?*$+"3!:$&$!(B".'+$/!9&(*!"#$!"$3"!3"&.'+3!B6!&.+/(*16!3.*?%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$!&$./3! $.0#,! =("#! S?*$.+3! .+/! 7MH! :$&$! '*%1$*$+"$/! "(!0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$!<=>!30(&$3! .3! 9$."2&$!;$0?"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@,!
!Table I. The results of clustering 20400 fragments spanning 17 different phyla when trained on another 2 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 17 and the vigilance parameter, , for ART was set to 0.1. Group-ing these fragments by chance into clusters of similar phyla we would ex-pect accuracy of 1/17 or 5.9%. !=("#!.15(&'"#*3!5&(2%$/!.11!(9! "#$! 9&.5*$+"3! '+"(!CR!/'9?9$&$+"! 0123"$&3! .3! '+"$+/$/,! H#$! 7MH! .15(&'"#*! %$&9(&*$/!B$""$&! 23'+5! B("#! 9'52&$3! (9! *$&'"! "#.+! "#$! S?*$.+3! .15(?&'"#*,! @+"$&$3"'+5168! S?*$.+3! :.3! .B1$! "(! '3(1."$! /'99$&$+"!%#61.!B$""$&!"#.+!'"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!:#'1$! 9(&!7MH! "#$! 0(+;$&3$! '3! "&2$,! H#$! &$321"3! '*%16! "#."!7MH!'3!5&(2%'+5!3'*'1.&!%#61.!"(5$"#$&8!B2"!"#$!0123"$&3!.&$!0(+".'+'+5!*(&$!"#.+!C!%#61.!"#$&$B6!/&';'+5!/(:+!"#$!30(&$!9(&! '3(1."'+5! /'99$&$+"! %#61.,! H#$! (%%(3'"$! '3! "&2$! 9(&! S?*$.+38! 3255$3"'+5! "#."! 3'*'1.&! %#61.! .&$! /'3"&'B2"$/! .*(+5!3$;$&.1! 0123"$&3! &."#$&! "#.+! (+$8! B2"! +("! .0&(33! .11! 0123"$&38!("#$&:'3$!:$!:(21/!$)%$0"!.!3'*'1.&!30(&$!9(&!'3(1."'+5!/'9?9$&$+"!%#61.,!
;"! #$%&'()&*+,3.,/'0(*(*1,2*,-:,9)044&',%5640,+2,7489+&',3,40'1&,%5640,7!"&.'+'+5!/.".3$"!(9!4FQ!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CR!%#61.U! "#$!(%%(3'"$!(9!$)%$&'*$+"!C,!H#$!&$*.'+'+5!QAC!3"&.'+3! 3%.++'+5!4!/'99$&$+"!%#61.!:$&$!23$/!.3! "$3"! 3"&.'+3,!H#$! "&.'+'+5! /.".3$"! 0(+3'3"$/! (9!:#(1$! 5$+(*$3!:#'1$! "#$!"$3"! 9&.5*$+"3! :$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+?/(*16! 3.*%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%!+201$("'/$!&$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$!'*%1$?*$+"$/! "(! 0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$! <=>! 30(&$3! .3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@,!
! Table II. Results of clustering 43100 fragments spanning 2 different phyla when trained on another 17 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 2. Grouping these fragments by chance into clusters of similar phyla we would expect accuracy of 1/2or 50%. ART grouped these fragments into 4 clusters with the vigilance parameter, , set at 0.025. !-#'1$!S?*$.+3!:.3!%&(5&.**$/!"(!5&(2%!.11!(9!"#$!9&.5?*$+"3! '+"(!4!0123"$&38! "#$!7MH!.15(&'"#*!$)#'B'"$/!"#$!B$3"!%$&9(&*.+0$! :#$+! 5&(2%'+5! .11! 9&.5*$+"3! '+"(! Q! 0123"$&3,!H#$! 7MH! .15(&'"#*! %$&9(&*$/! 31'5#"16! B$""$&! 23'+5! B("#!9'52&$3!(9!*$&'"! "#.+! "#$!S?*$.+3!.15(&'"#*,!S?*$.+3!:.3!.B1$! "(! '3(1."$! /'99$&$+"! %#61.!*.&5'+.116!B$""$&! "#.+! '"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!.+/!"#$!3.*$!'3!"&2$!9(&!7MH!.1B$'"!."!.!1.&5$&!*.&5'+,!H#$!%$&9(&*.+0$!(9!"#$!.15(?&'"#*3! .%%$.&! "(! B$!*20#! B$""$&! 9(&! $)%$&'*$+"! 4! "#.+! "#$!%&$;'(238!B2"!'"!'3!'*%(&".+"!"(!+("$!"#."!0#.+0$!:.3!G,VW!'+!$)%$&'*$+"!C8!:#'1$!0#.+0$!'3!GFW!'+!"#'3!$)%$&'*$+",!!
<"! #$%&'()&*+,=.,/'0(*(*1,2*,&$0)%4&9,2>,&075,%5640, +2,7489+&',+5&,'&9+,7!"&.'+'+5!/.".3$"!(9!A4F!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CD!%#61.,!H#$!&$*.'+'+5!ACG!3"&.'+3!.13(!3%.++$/! "#$! 3.*$!CD!/'99$&$+"!%#61.!.+/!:$&$!23$/!.3!"$3"!3"&.'+3,!H#$!"&.'+'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$!"#$!"$3"!9&.5*$+"3!:$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+/(*16! 3.*%1'+5!$.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$! &$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$! '*%1$*$+"$/! "(!0123"$&!"#$!9&.5*$+"3!23'+5! "#$!<=>!30(&$3!.3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@@,!!
!!
!
"#$! %&'(&! $)%$&'*$+",! -$! "#$+! .""$*%"$/! "(! 0123"$&! "#$! 4!1.&5$3"! %#61.! '+"(!4!5&(2%3,!75.'+8! "#$! "$3"! 9&.5*$+"3!:$&$!0(*%1$"$16!+(;$1! "(! "#$!<=>!01.33'9'$&8! &$321"'+5! '+!.!3'*'?1.&16!0#.11$+5'+5!30$+.&'(!9(&!01.33'9'0."'(+,!@+!$)%$&'*$+"!A8!:$!9(11(:$/!.!/'99$&$+"!.%%&(.0#!9&(*!"#$!%&'(&! ":(!$)%$&'*$+"3!B6!%.&"'"'(+'+5! "#$!+2*B$&!(9!3"&.'+3!(9!$.0#!(9!"#$!CD!01.33$3!'+"(!":(!5&(2%3,!E+$!5&(2%!(9!A4F!3"&.'+3!:.3! 23$/! 9(&! "&.'+'+5! "#$!<=>!:#'1$! "#$! &$*.'+'+5!ACG!3"&.'+3!:$&$!23$/!'+!"#$!"$3"!3$",!73!'+!.11!(9!"#$!$)%$&'?*$+"38! "#$! "$3"! 3$"! 0(+3'3"$/! (9! CFF! 9&.5*$+"38! GFFB%! '+!1$+5"#8! 9(&! $.0#! 3"&.'+! :#'1$! "#$! "&.'+'+5! 3$"! 0(+3'3"$/! (9!:#(1$?5$+(*$3!9(&!$.0#!"&.'+'+5!3"&.'+,!H#$! &$321"3! (9! .11! "#&$$! $)%$&'*$+"3! .&$! %&(;'/$/! '+! "#$!
+$)"!3$0"'(+,!I.0#!$)%$&'*$+"!:.3!B(("3"&.%%$/!4G!"'*$3!"(!%&(;'/$!.!1$;$1!(9!0(+9'/$+0$!'+!"#$!&$321"3!J448!4AK,!!
L,! MINOPHN!
!"! #$%&'()&*+,-.,/'0(*(*1,2*,3,40'1&,%5640,+2,7489+&',-:,9)044&',%5640,7!"&.'+'+5!/.".3$"!(9!QAC!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!4! %#61.,! H#$! &$*.'+'+5! 4FQ! 3"&.'+3! 3%.++'+5! CR! /'99$&$+"!%#61.!:$&$!23$/!.3!"$3"!3"&.'+3!.3!/$30&'B$/!.B(;$,!H#$!"&.'+?'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$! "#$! "$3"! 9&.5?*$+"3!:$&$!(B".'+$/!9&(*!"#$!"$3"!3"&.'+3!B6!&.+/(*16!3.*?%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$!&$./3! $.0#,! =("#! S?*$.+3! .+/! 7MH! :$&$! '*%1$*$+"$/! "(!0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$!<=>!30(&$3! .3! 9$."2&$!;$0?"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@,!
!Table I. The results of clustering 20400 fragments spanning 17 different phyla when trained on another 2 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 17 and the vigilance parameter, , for ART was set to 0.1. Group-ing these fragments by chance into clusters of similar phyla we would ex-pect accuracy of 1/17 or 5.9%. !=("#!.15(&'"#*3!5&(2%$/!.11!(9! "#$! 9&.5*$+"3! '+"(!CR!/'9?9$&$+"! 0123"$&3! .3! '+"$+/$/,! H#$! 7MH! .15(&'"#*! %$&9(&*$/!B$""$&! 23'+5! B("#! 9'52&$3! (9! *$&'"! "#.+! "#$! S?*$.+3! .15(?&'"#*,! @+"$&$3"'+5168! S?*$.+3! :.3! .B1$! "(! '3(1."$! /'99$&$+"!%#61.!B$""$&!"#.+!'"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!:#'1$! 9(&!7MH! "#$! 0(+;$&3$! '3! "&2$,!H#$! &$321"3! '*%16! "#."!7MH!'3!5&(2%'+5!3'*'1.&!%#61.!"(5$"#$&8!B2"!"#$!0123"$&3!.&$!0(+".'+'+5!*(&$!"#.+!C!%#61.!"#$&$B6!/&';'+5!/(:+!"#$!30(&$!9(&! '3(1."'+5! /'99$&$+"! %#61.,! H#$! (%%(3'"$! '3! "&2$! 9(&! S?*$.+38! 3255$3"'+5! "#."! 3'*'1.&! %#61.! .&$! /'3"&'B2"$/! .*(+5!3$;$&.1! 0123"$&3! &."#$&! "#.+! (+$8! B2"! +("! .0&(33! .11! 0123"$&38!("#$&:'3$!:$!:(21/!$)%$0"!.!3'*'1.&!30(&$!9(&!'3(1."'+5!/'9?9$&$+"!%#61.,!
;"! #$%&'()&*+,3.,/'0(*(*1,2*,-:,9)044&',%5640,+2,7489+&',3,40'1&,%5640,7!"&.'+'+5!/.".3$"!(9!4FQ!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CR!%#61.U! "#$!(%%(3'"$!(9!$)%$&'*$+"!C,!H#$!&$*.'+'+5!QAC!3"&.'+3! 3%.++'+5!4!/'99$&$+"!%#61.!:$&$!23$/!.3! "$3"! 3"&.'+3,!H#$! "&.'+'+5! /.".3$"! 0(+3'3"$/! (9!:#(1$! 5$+(*$3!:#'1$! "#$!"$3"! 9&.5*$+"3! :$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+?/(*16! 3.*%1'+5! $.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%!+201$("'/$!&$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$!'*%1$?*$+"$/! "(! 0123"$&! "#$! 9&.5*$+"3! 23'+5! "#$! <=>! 30(&$3! .3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@,!
! Table II. Results of clustering 43100 fragments spanning 2 different phyla when trained on another 17 different phyla using the two figures of merit described in the methods section. The free parameter for K-means was set to 2. Grouping these fragments by chance into clusters of similar phyla we would expect accuracy of 1/2or 50%. ART grouped these fragments into 4 clusters with the vigilance parameter, , set at 0.025. !-#'1$!S?*$.+3!:.3!%&(5&.**$/!"(!5&(2%!.11!(9!"#$!9&.5?*$+"3! '+"(!4!0123"$&38! "#$!7MH!.15(&'"#*!$)#'B'"$/!"#$!B$3"!%$&9(&*.+0$! :#$+! 5&(2%'+5! .11! 9&.5*$+"3! '+"(! Q! 0123"$&3,!H#$! 7MH! .15(&'"#*! %$&9(&*$/! 31'5#"16! B$""$&! 23'+5! B("#!9'52&$3!(9!*$&'"! "#.+! "#$!S?*$.+3!.15(&'"#*,!S?*$.+3!:.3!.B1$! "(! '3(1."$! /'99$&$+"! %#61.!*.&5'+.116!B$""$&! "#.+! '"!:.3!.B1$!"(!5&(2%!3'*'1.&!%#61.!"(5$"#$&!.+/!"#$!3.*$!'3!"&2$!9(&!7MH!.1B$'"!."!.!1.&5$&!*.&5'+,!H#$!%$&9(&*.+0$!(9!"#$!.15(?&'"#*3! .%%$.&! "(! B$!*20#! B$""$&! 9(&! $)%$&'*$+"! 4! "#.+! "#$!%&$;'(238!B2"!'"!'3!'*%(&".+"!"(!+("$!"#."!0#.+0$!:.3!G,VW!'+!$)%$&'*$+"!C8!:#'1$!0#.+0$!'3!GFW!'+!"#'3!$)%$&'*$+",!!
<"! #$%&'()&*+,=.,/'0(*(*1,2*,&$0)%4&9,2>,&075,%5640, +2,7489+&',+5&,'&9+,7!"&.'+'+5!/.".3$"!(9!A4F!3"&.'+3!:.3!0(+3"&20"$/!3%.++'+5!CD!%#61.,!H#$!&$*.'+'+5!ACG!3"&.'+3!.13(!3%.++$/! "#$! 3.*$!CD!/'99$&$+"!%#61.!.+/!:$&$!23$/!.3!"$3"!3"&.'+3,!H#$!"&.'+'+5!/.".3$"!0(+3'3"$/!(9!:#(1$!5$+(*$3!:#'1$!"#$!"$3"!9&.5*$+"3!:$&$! (B".'+$/! 9&(*! "#$! "$3"! 3"&.'+3! B6! &.+/(*16! 3.*%1'+5!$.0#! (9! "#$*! CFF! "'*$3! $)"&.0"'+5! GFFB%! +201$("'/$! &$./3!$.0#,!=("#!S?*$.+3!.+/!7MH!:$&$! '*%1$*$+"$/! "(!0123"$&!"#$!9&.5*$+"3!23'+5! "#$!<=>!30(&$3!.3!9$."2&$!;$0"(&3,!H#$!&$321"3!.&$!32**.&'T$/!'+!H.B1$!@@@,!!
Proposed Algorithm Input: • Metagenomic reads (fragments) from next-gen
sequencing technology • Training database (TDB) – consists of G labeled
genomes, previously acquired • Unsupervised clustering algorithm (e.g. ART, K-means) • Set free parameters (e.g. K in K-means and v in ART) Algorithm: A. Train Naïve Bayes Classifier (NBC) motifs, M of
G genome probability profiles Do: i = 1, ..., G
Do: j = 1, …4N (# of diff. motif perm.)
End
End B. Score fragments, evaluate fragment, f using NBC
Do: f = 1, …, F (# of fragments) 1. Identify J (N-1) overlapping motifs
each of length N in fragment, f: [M1, M2, M3, …, MJ]T
2. Calculate probability of fragment belonging to genomei in TDB:
End
C. Build feature matrix for unsupervised classifier
D. Call unsupervised clustering algorithm • Cluster each fragment using corresponding
feature vector of dimension G Output: • Fragments clustered by taxonomic class (e.g. Phyla, Genus, Strain, etc.) Test: Figures of Merit
• Accuracy to group similar classes together
• Accuracy of algorithm to isolate dissimilar classes
C: # of clusters P: # of taxonomic classes (e.g. phyla) fc: # of frag. in cluster, c fcp:# of frag. in cluster, c belonging to taxonomic class, p ft’: # of fragments from taxonomic class, p F: total number of fragments in all phyla
Features NBC Scores genome1 genome2 … genomeG
Frag1 S1,1 S1,2 . S1,G Frag2 S2,1 S2,2 . .
. . . . . . Obj
ects
FragF SF,1 . . SF,G
Proposed Algorithm Input: • Metagenomic reads (fragments) from next-gen
sequencing technology • Training database (TDB) – consists of G labeled
genomes, previously acquired • Unsupervised clustering algorithm (e.g. ART, K-means) • Set free parameters (e.g. K in K-means and v in ART) Algorithm: A. Train Naïve Bayes Classifier (NBC) motifs, M of
G genome probability profiles Do: i = 1, ..., G
Do: j = 1, …4N (# of diff. motif perm.)
End
End B. Score fragments, evaluate fragment, f using NBC
Do: f = 1, …, F (# of fragments) 1. Identify J (N-1) overlapping motifs
each of length N in fragment, f: [M1, M2, M3, …, MJ]T
2. Calculate probability of fragment belonging to genomei in TDB:
End
C. Build feature matrix for unsupervised classifier
D. Call unsupervised clustering algorithm • Cluster each fragment using corresponding
feature vector of dimension G Output: • Fragments clustered by taxonomic class (e.g. Phyla, Genus, Strain, etc.) Test: Figures of Merit
• Accuracy to group similar classes together
• Accuracy of algorithm to isolate dissimilar classes
C: # of clusters P: # of taxonomic classes (e.g. phyla) fc: # of frag. in cluster, c fcp:# of frag. in cluster, c belonging to taxonomic class, p ft’: # of fragments from taxonomic class, p F: total number of fragments in all phyla
Features NBC Scores genome1 genome2 … genomeG
Frag1 S1,1 S1,2 . S1,G Frag2 S2,1 S2,2 . .
. . . . . . Obj
ects
FragF SF,1 . . SF,G