mvl cge tools course intro to servers - dtu … · g e n o m e ... a b c profile: ......

43
Introduc)on to the CGE servers

Upload: doannhi

Post on 27-Aug-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Introduc)on  to  the  CGE  servers  

Center  for  Genomic  Epidemiology  

Aim:    

•   To  provide  the  scien)fic  founda)on  for  future  internet-­‐based  solu)ons,  where  a  central  database  will  enable  simplifica)on  of  total  genome  sequence  informa)on  and  comparison  to  all  other  sequenced  isolates  including  spa)al-­‐temporal  analysis.    

•   To  develop  algorithms  for  rapid  analyses  of  whole  genome  DNA-­‐sequences,  tools  for  analyses  and  extrac)on  of  informa)on  from  the  sequence  data  and  internet/web-­‐interfaces  for  using  the  tools  in  the  global  scien)fic  and  medical  community.    

Tools  for  species  iden)fica)on  

Name of Service Description

URL (cge.cbs.dtu.dk/services/) Status Publication

SpeciesFinder Species identification using 16S rRNA

SpeciesFinder Online Published Feb 2014 PMID: 24574292

KmerFinder Species identification using overlapping 16mers

KmerFinder Online Published Jan 2014 PMID: 24172157

TaxonomyFinder Taxonomy identification using functional protein domains

TaxonomyFinder Published in PMID: 24574292 + Oksana's PhD thesis

Reads2Type Species identification on client computer

Reads2Type Online Published Feb 2014 PMID: 24574292

Benchmarking  of  Methods  for  Bacterial  Species  Iden)fica)on  

PMID: 24574292  

Training  data   1,647  completed  /  almost  completed  genomes  downloaded  

from  NCBI  in  2011  (1,009  different  species)  

Evalua)on  data     NCBI  draV  genomes  

•   695  isolates  from  species  that  overlap  with  training  set    (151  species)  

   SRA  draV  genomes  •   10,407  sets  of  short  reads  from  Illumina  (168  species)    

•   10,407  draV  genomes  from  Illumina  data  (168  species)  

16S  rRNA  

•   16S  rRNA  sequencing  has  dominated  molecular  taxonomy  of  prokaryotes  for  more  than  30  years  (Fox  et  al,  Int.  J.  Syst.  Bacteriol.,  1977)  

•   Tremendous  amounts  of  16S  rRNA  sequence  data  are  available  in  databases  

Concerns:    •   Low  resolu)on    •   Some  genomes  contain  several  copies  of  the  16S  rRNA  gene  with  inter-­‐gene  varia)on  •   The  16S  rRNA  gene  represents  only  about  0.1%  of  the  coding  part  of  a  microbial  genome    

Reference  database    •   16S  rRNA  genes  are  isolated  from  genomes  in  training  data  using  RNAmmer  (Lagesen,  NAR,  2007).  

Method  • Input  genomes  are  BLASTed  against  16S  rRNA  genes  in  reference  database.  

• Best  hit  is  selected  based  on  a  combina)on  of  coverage,  %  iden)ty,  bitscore,  number  of  mistmatches  and  number  of  gaps  in  the  alignments.    

CGE  implementa)on  of    16S  species  iden)fica)on  

SpeciesFinder  

KmerFinder  •   Genomes  in  training  data  is  chopped  into  16mers:  

A T G A C G T A T G A T T G A T G A C G T A G T A G T C C

•   Immune  system  inspired  downsampling  

•   Only  16mers  with  specific  prefix  are  kept  

MHC-I

9mer

ATGAATGTGTGAGTGA  

ATGACTGTGCCCCTGA  

ATGAAAAAAAAAAAA  

Unique  16  mers:  

Species   Match   No.  of  Kmer  hits  

Acinetobacter  baumannii   CP001921   2  

Acinetobacter  baumannii   CP000521   1  

Acinetobacter  baumannii   CP002521   1  

Buchnera  aphidicola   CP002301   1  

ATGAATGTGTGAGTGA  CP001921  (Acinetobacter  baumanii)  CP000521  (Acinetobacter  baumanii)  CP002522  (Acinetobacter  baumanii)  

ATGACTGTGCCCCTGA   CP001921  (Acinetobacter  baumanii)  CP002301  (Buchnera  aphidicola)  

16mer  database  

Unknown  isolate  

KmerFinder  is  very  robust  –  it  only  needs  one  16mer!    

Desulfovibrio  piger  GOR1  SRR097356  

>NODE 4 length 92 cov 23.119566!TAGGACGTGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAACATCCACTTGA!CGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATGC!>NODE 15 length 82 cov 2.792683!AGCGAAAAATGTCATAACAACGATCACGACCGATAACCATCTTTGGTCCAAACTTACTCA!CGCAGCAGGCGTATAACTCGCGCATACCAGCTTTGGGCAT!

N50 = 110 �Total no. of bp: 210 �

Species   Match   No.  of  Kmer  hits  

Flavobacterium  psycrophilum  

AM398681   1  

PredicNon  

Input set of prokaryotic genomes

Gene prediction

Whole genome proteome scanagainst 3 HMM-based databases

Gene grouping based on functional domain profiles

Prodigal gene prediction

User submitted genes

PfamA

TIGRFAM

Superfamily

CD-HIT clustering of all CDSs with no hits to any HMM-database

Whole genome functional profile formation

Specific-profile finding

Phyla-specific

Species-specific

domains

Foreach

genome

MTGENLPPELPATAQAWRASVLYGQHLQLIRHLCVTCPRWSQSTSR

A B CProfile: A-B-C

Taxonomy level-specific gene database creationTaxonomyFinder  

Reads2Type  

•  Read2Type  pushes  analysis  to  user,  server  provides  50-­‐mers  database  

•  SuffixTree:  efficient  data  structure  for  string  matching  

•  Narrow  Down  Approach:    –  Reads2Type  compares  50-­‐

mers  of  combined  marker  genes  against  raw  reads  

–  Shared  Probes  vs  Unique  Probe  

• DefiniNon:  Quick  &  dirty  taxonomy  iden)fica)on  of  single  isolates  

• 50-­‐mer  of  marker  gene  DB  

– 16S  rRNA:  Training  data  genomes    RNAmmer  (other)  

– ITS:  Training  data    (Mycobacterium)  

– GyrB:  Training  data  (Enterobacteriaceae)  

– Resul)ng  database  ~5  MB  

rMLST  

CGE  implementaNon  

• For  each  genome  in  the  training  data  the  53  ribosomal  genes  were  extracted.  

• Genomes  in  evalua)on  sets  were  aligned  using  blat  to  each  gene  collec)on  (only  hits  with  at  least  95%  iden)ty  and  95%  coverage  were  considered  as  a  poten)al  match).    

• The  closets  match  of  the  training  genomes  was  selected  based  on  a  combina)on  of  coverage,  %iden)ty,  bitscore,  number  of  mistmatches  and  number  of  gaps  in  the  alignments  across  all  genes.    

Jolley  KA,  Bliss  CM,  Bennej  JS,  Bratcher  HB,  Brehony  C,  Colles  FM,  Wimalarathna  H,  Harrison  OB,  Sheppard  SK,  Cody  AJ,  Maiden  MC.  Ribosomal  mulNlocus  sequence  typing:  universal  characterizaNon  of  bacteria  from  domain  to  strain.  Microbiology.  2012  Apr;158(Pt  4):1005-­‐15.    

Results  

(16s  rRNA)  

Overlap  in  predic)ons  

Isolates  in  the  NCBIdra<s  set  for  which  all  four  methods  predict  the  species  to  be  different  from  the  annotated  one.    *  NZAEPO00000000  has  been  re-­‐annotated  as  S.  oralis  since  we  downloaded  the  data.    

!"#$%&'()"&%*+(),-)$$##!"&#$'()"#..,/+0.%,*'0$%,-'$#)%1)"#..,/+)$&2*)"#/1)"#..,/+"%*%,/1)"#..,/+/,(&#.#/1)"#..,/+&2,*#$3#%$/#/1)"#..,/+4%#2%$/&%02)$%$/#/1'**%.#)+(,*35'*6%*#1*,"%..)+)('*&,/1*,"%..)+-%.#&%$/#/1,*72'.5%*#)+-)..%#1,*72'.5%*#)+0/%,5'-)..%#1,*72'.5%*#)+&2)#.)$5%$/#/8)-09.'()"&%*+:%:,$#82.)-95#)+&*)"2'-)&#/8.'/&*#5#,-+('&,.#$,-8.'/&*#5#,-+$';9#8.'/&*#5#,-+0%*6*#$3%$/<$&%*'"'"",/+6)%").#/</"2%*#"2#)+"'.#=*)$"#/%..)+&,.)*%$/#/>)%-'02#.,/+#$6.,%$?)%>%.#"'()"&%*+09.'*#@)"&'()"#..,/+"*#/0)&,/@)"&'()"#..,/+3)//%*#@)"&'()"#..,/+*%,&%*#@#/&%*#)+-'$'"9&'3%$%/A9"'()"&%*#,-+&,(%*",.'/#/B%#//%*#)+3'$'**2'%)%C/%,5'-'$)/+)%*,3#$'/)D2#?'(#,-+%&.#D2#?'(#,-+.%3,-#$'/)*,-E).-'$%..)+%$&%*#")E2#3%..)+/'$$%#E&)029.'"'"",/+),*%,/E&)029.'"'"",/+%0#5%*-#5#/E&*%0&'"'"",/+)3).)"&#)%E&*%0&'"'"",/+-#&#/E&*%0&'"'"",/+'*).#/E&*%0&'"'"",/+0$%,-'$#)%F*%)0.)/-)+,*%).9&#",-G#(*#'+"2'.%*)%G#(*#'+2)*;%9#G#(*#'+0)*)2)%-'.9&#",/H%*/#$#)+0%/&#/

!"#$%&'()"&%*+(),-

)$$##

!"&#$'()"#..,/+0.%,*'0$%,-'$#)%

1)"#..,/+)$&2*)"#/

1)"#..,/+"%*%,/

1)"#..,/+/,(&#.#/

1)"#..,/+&2,*#$3#%$/#/

1)"#..,/+4%

#2%$/&%02)$%$/#/

1'**%.#)+(,*35'*6%*#

1*,"%..)+)('*&,/

1*,"%..)+-

%.#&%$/#/

1,*72'.5%*#)+-

)..%#

1,*72'.5%*#)+0/%,5'-

)..%#

1,*72'.5%*#)+&2)#.)$5%$/#/

8)-

09.'()"&%*+:%:,$#

82.)-

95#)+&*)"2'-

)&#/

8.'/&*#5#,-+('&,.#$,-

8.'/&*#5#,-+$';9#

8.'/&*#5#,-+0%*6*#$3%$/

<$&%*'"'"",/+6)%").#/

</"2%*#"2#)+"'.#

=*)$"#/%..)+&,.)*%$/#/

>)%-'02#.,/+#$6.,%$?)%

>%.#"'()"&%*+09.'*#

@)"&'()"#..,/+"*#/0)&,/

@)"&'()"#..,/+3)//%*#

@)"&'()"#..,/+*%,&%*#

@#/&%*#)+-

'$'"9&'3%$%/

A9"'()"&%*#,-+&,(%*",.'/#/

B%#//%*#)+3'$'**2'%)%

C/%,5'-'$)/+)%*,3#$'/)

D2#?'(#,-

+%&.#

D2#?'(#,-

+.%3,-

#$'/)*,-

E).-'$%..)+%$&%*#")

E2#3%..)+/'$$%#

E&)029.'"'"",/+),*%,/

E&)029.'"'"",/+%0#5%*-#5#/

E&*%0&'"'"",/+)3).)"&#)%

E&*%0&'"'"",/+-#&#/

E&*%0&'"'"",/+'*).#/

E&*%0&'"'"",/+0$%,-'$#)%

F*%)0.)/-

)+,*%).9&#",-

G#(*#'+"2'.%*)%

G#(*#'+2)*;%9#

G#(*#'+0)*)2)%-'.9&#",/

H%*/#$#)+0%/&#/

!"#$%&'(#$)*')+,-.)/$012)3#'*"#4

5$#(&62#()78#6&#4

9''*202#()78#6&#4

!" !#$ !$" !!%$ &""'

!"#$%&'()"&%*+(),-)$$##!"&#$'()"#..,/+0.%,*'0$%,-'$#)%!.&%*'-'$)/+-)".%'1##!2'*3#2'(#,-+"),.#$'1)$/4)"#..,/+)$&3*)"#/4)"#..,/+"%*%,/4)"#..,/+/,(&#.#/4)"#..,/+&3,*#$5#%$/#/4)"#..,/+6%#3%$/&%03)$%$/#/4.)&&)()"&%*#,-+/074'**%.#)+)82%.##4'**%.#)+(,*51'*8%*#4*,"%..)+)('*&,/4*,"%..)+'9#/4,"3$%*)+)03#1#"'.)4,*:3'.1%*#)+-)..%#4,*:3'.1%*#)+0/%,1'-)..%#;)-0<.'()"&%*+=%=,$#;3%.)&#9'*)$/+/07;3.)-<1#)+&*)"3'-)&#/;.'/&*#1#,-+('&,.#$,-;.'/&*#1#,-+$'9<#;.'/&*#1#,-+0%*8*#$5%$/>$&%*'()"&%*+".')")%>$&%*'"'"",/+8)%").#/>$&%*'"'"",/+/07>/"3%*#"3#)+"'.#?#(*'()"&%*+/,""#$'5%$%/?*)$"#/%..)+&,.)*%$/#/@)%-'03#.,/+#$8.,%$2)%@).'0#5%*+A)$)1,%$/#/@).'&%**#5%$)+&,*:-%$#")B)"&'()"#..,/+"*#/0)&,/B#/&%*#)+-'$'"<&'5%$%/C)*#$'()"&%*+)13)%*%$/C'(#.,$",/+",*&#/##C<"'()"&%*#,-+)(/"%//,/C<"'()"&%*#,-+('9#/C<"'()"&%*#,-+-)*#$,-C<"'()"&%*#,-+&,(%*",.'/#/D%#//%*#)+5'$'**3'%)%E.)$"&'-<"%/+.#-$'03#.,/E/%,1'-'$)/+)%*,5#$'/)F3#2'(#,-+%&.#F,-#$'"'"",/+/07G).-'$%..)+%$&%*#")G3#5%..)+('<1##G3#5%..)+8.%A$%*#G&)03<.'"'"",/+),*%,/G&)03<.'"'"",/+")*$'/,/G&)03<.'"'"",/+%0#1%*-#1#/G&%$'&*'03'-'$)/+-).&'03#.#)G&*%0&'"'"",/+)5).)"&#)%G&*%0&'"'"",/+-#&#/G&*%0&'"'"",/+0$%,-'$#)%H*%0'$%-)+)2'&'$,&*#"#,-I*%)0.)/-)+,*%).<&#",-J#(*#'+"3'.%*)%J#(*#'+8#/"3%*#J#(*#'+0)*)3)%-'.<&#",/J#(*#'+/07K%*/#$#)+0%/&#/K%*/#$#)+0/%,1'&,(%*",.'/#/

!"#$%&'()"&%*+(),-

)$$##

!"&#$'()"#..,/+0.%,*'0$%,-'$#)%

!.&%*'-'$)/+-

)".%'1##

!2'*3#2'(#,-

+"),.#$'1)$/

4)"#..,/+)$&3*)"#/

4)"#..,/+"%*%,/

4)"#..,/+/,(&#.#/

4)"#..,/+&3,*#$5#%$/#/

4)"#..,/+6%

#3%$/&%03)$%$/#/

4.)&&)()"&%*#,-+/07

4'**%.#)+)82%.##

4*,"%..)+)('*&,/

4*,"%..)+'9#/

4,"3$%*)+)03#1#"'.)

4,*:3'.1%*#)+-

)..%#

4,*:3'.1%*#)+0/%,1'-

)..%#

;)-

0<.'()"&%*+=%=,$#

;3%.)&#9'*)$/+/07

;3.)-

<1#)+&*)"3'-

)&#/

;.'/&*#1#,-+('&,.#$,-

;.'/&*#1#,-+$'9<#

;.'/&*#1#,-+0%*8*#$5%$/

>$&%*'()"&%*+".')")%

>$&%*'"'"",/+8)%").#/

>$&%*'"'"",/+/07

>/"3%*#"3#)+"'.#

?#(*'()"&%*+/,""#$'5%$%/

?*)$"#/%..)+&,.)*%$/#/

@)%-'03#.,/+#$8.,%$2)%

@).'0#5%*+A)$)1,%$/#/

@).'&%**#5%$)+&,*:-%$#")

B)"&'()"#..,/+"*#/0)&,/

B#/&%*#)+-

'$'"<&'5%$%/

C)*#$'()"&%*+)13)%*%$/

C'(#.,$",/+",*&#/##

C<"'()"&%*#,-+)(/"%//,/

C<"'()"&%*#,-+('9#/

C<"'()"&%*#,-+-

)*#$,-

C<"'()"&%*#,-+&,(%*",.'/#/

D%#//%*#)+5'$'**3'%)%

E.)$"&'-

<"%/+.#-$'03#.,/

E/%,1'-'$)/+)%*,5#$'/)

F3#2'(#,-

+%&.#

F,-

#$'"'"",/+/07

G).-'$%..)+%$&%*#")

G3#5%..)+('<1##

G3#5%..)+8.%A$%*#

G&)03<.'"'"",/+),*%,/

G&)03<.'"'"",/+")*$'/,/

G&)03<.'"'"",/+%0#1%*-#1#/

G&%$'&*'03'-

'$)/+-

).&'03#.#)

G&*%0&'"'"",/+)5).)"&#)%

G&*%0&'"'"",/+-#&#/

G&*%0&'"'"",/+0$%,-'$#)%

H*%0'$%-

)+)2'&'$,&*#"#,-

I*%)0.)/-

)+,*%).<&#",-

J#(*#'+"3'.%*)%

J#(*#'+8#/"3%*#

J#(*#'+0)*)3)%-'.<&#",/

J#(*#'+/07

K%*/#$#)+0%/&#/

K%*/#$#)+0/%,1'&,(%*",.'/#/

!"#$%&'()"&%*+(),-)$$##!"&#$'()"#..,/+0.%,*'0$%,-'$#)%4)"#..,/+)$&3*)"#/4)"#..,/+"%*%,/4)"#..,/+"<&'&'A#",/4)"#..,/+/,(&#.#/4)"#..,/+&3,*#$5#%$/#/4)"#..,/+6%#3%$/&%03)$%$/#/4'**%.#)+(,*51'*8%*#4'**%.#)+&,*#")&)%4*,"%..)+)('*&,/4,*:3'.1%*#)+-)..%#4,*:3'.1%*#)+0/%,1'-)..%#;)-0<.'()"&%*+=%=,$#;3.)-<1#)+&*)"3'-)&#/;.'/&*#1#,-+('&,.#$,-;.'/&*#1#,-+$'9<#;.'/&*#1#,-+0%*8*#$5%$/>$&%*'"'"",/+8)%").#/>/"3%*#"3#)+"'.#?*)$"#/%..)+&,.)*%$/#/@)%-'03#.,/+#$8.,%$2)%@%.#"'()"&%*+3%0)&#",/B)"&'()"#..,/+"*#/0)&,/B#/&%*#)+-'$'"<&'5%$%/C<"'()"&%*#,-+&,(%*",.'/#/D%#//%*#)+5'$'**3'%)%E/%,1'-'$)/+)%*,5#$'/)F3#2'(#,-+%&.#F3#2'(#,-+.%5,-#$'/)*,-G).-'$%..)+%$&%*#")G3#5%..)+1</%$&%*#)%G3#5%..)+/'$$%#G&)03<.'"'"",/+),*%,/G&)03<.'"'"",/+%0#1%*-#1#/G&*%0&'"'"",/+)5).)"&#)%G&*%0&'"'"",/+-#&#/G&*%0&'"'"",/+'*).#/G&*%0&'"'"",/+0$%,-'$#)%H3%*-')$)%*'()"&%*+/0I*%)0.)/-)+,*%).<&#",-J#(*#'+"3'.%*)%J#(*#'+0)*)3)%-'.<&#",/J#(*#'+9,.$#8#",/K%*/#$#)+0%/&#/

!"#$%&'()"&%*+(),-

)$$##

!"&#$'()"#..,/+0.%,*'0$%,-'$#)%

4)"#..,/+)$&3*)"#/

4)"#..,/+"%*%,/

4)"#..,/+"<&'&'A#",/

4)"#..,/+/,(&#.#/

4)"#..,/+&3,*#$5#%$/#/

4)"#..,/+6%

#3%$/&%03)$%$/#/

4'**%.#)+(,*51'*8%*#

4'**%.#)+&,*#")&)%

4*,"%..)+)('*&,/

4,*:3'.1%*#)+-

)..%#

4,*:3'.1%*#)+0/%,1'-

)..%#

;)-

0<.'()"&%*+=%=,$#

;3.)-

<1#)+&*)"3'-

)&#/

;.'/&*#1#,-+('&,.#$,-

;.'/&*#1#,-+$'9<#

;.'/&*#1#,-+0%*8*#$5%$/

>$&%*'"'"",/+8)%").#/

>/"3%*#"3#)+"'.#

?*)$"#/%..)+&,.)*%$/#/

@)%-'03#.,/+#$8.,%$2)%

@%.#"'()"&%*+3%0)&#",/

B)"&'()"#..,/+"*#/0)&,/

B#/&%*#)+-

'$'"<&'5%$%/

C<"'()"&%*#,-+&,(%*",.'/#/

D%#//%*#)+5'$'**3'%)%

E/%,1'-'$)/+)%*,5#$'/)

F3#2'(#,-

+%&.#

F3#2'(#,-

+.%5,-

#$'/)*,-

G).-'$%..)+%$&%*#")

G3#5%..)+1</%$&%*#)%

G3#5%..)+/'$$%#

G&)03<.'"'"",/+),*%,/

G&)03<.'"'"",/+%0#1%*-#1#/

G&*%0&'"'"",/+)5).)"&#)%

G&*%0&'"'"",/+-#&#/

G&*%0&'"'"",/+'*).#/

G&*%0&'"'"",/+0$%,-'$#)%

H3%*-')$)%*'()"&%*+/07

I*%)0.)/-

)+,*%).<&#",-

J#(*#'+"3'.%*)%

J#(*#'+0)*)3)%-'.<&#",/

J#(*#'+9,.$#8#",/

K%*/#$#)+0%/&#/

!"#$%&'()"&%*+(),-)$$##!"&#$'()"#..,/+0.%,*'0$%,-'$#)%4)"#..,/+)$&3*)"#/4)"#..,/+"%*%,/4)"#..,/+/,(&#.#/4)"#..,/+&3,*#$5#%$/#/4)"#..,/+6%#3%$/&%03)$%$/#/4'**%.#)+(,*51'*8%*#4*,"%..)+)('*&,/4*,"%..)+/,#/4,*:3'.1%*#)+-)..%#4,*:3'.1%*#)+0/%,1'-)..%#;)-0<.'()"&%*+=%=,$#;3.)-<1#)+&*)"3'-)&#/;.'/&*#1#,-+('&,.#$,-;.'/&*#1#,-+$'9<#;.'/&*#1#,-+0%*8*#$5%$/>$&%*'"'"",/+8)%").#/>/"3%*#"3#)+"'.#?*)$"#/%..)+&,.)*%$/#/@)%-'03#.,/+#$8.,%$2)%B)"&'()"#..,/+"*#/0)&,/B#/&%*#)+-'$'"<&'5%$%/C<"'()"&%*#,-+&,(%*",.'/#/D%#//%*#)+5'$'**3'%)%E/%,1'-'$)/+)%*,5#$'/)F3#2'(#,-+%&.#F3#2'(#,-+.%5,-#$'/)*,-G).-'$%..)+%$&%*#")G3#5%..)+/'$$%#G&)03<.'"'"",/+),*%,/G&)03<.'"'"",/+%0#1%*-#1#/G&*%0&'"'"",/+)5).)"&#)%G&*%0&'"'"",/+-#&#/G&*%0&'"'"",/+'*).#/G&*%0&'"'"",/+0$%,-'$#)%I*%)0.)/-)+,*%).<&#",-J#(*#'+"3'.%*)%J#(*#'+0)*)3)%-'.<&#",/J#(*#'+/0K%*/#$#)+0%/&#/K%*/#$#)+0/%,1'&,(%*",.'/#/

!"#$%&'()"&%*+(),-

)$$##

!"&#$'()"#..,/+0.%,*'0$%,-'$#)%

4)"#..,/+)$&3*)"#/

4)"#..,/+"%*%,/

4)"#..,/+/,(&#.#/

4)"#..,/+&3,*#$5#%$/#/

4)"#..,/+6%

#3%$/&%03)$%$/#/

4'**%.#)+(,*51'*8%*#

4*,"%..)+)('*&,/

4*,"%..)+/,#/

4,*:3'.1%*#)+-

)..%#

4,*:3'.1%*#)+0/%,1'-

)..%#

;)-

0<.'()"&%*+=%=,$#

;3.)-

<1#)+&*)"3'-

)&#/

;.'/&*#1#,-+('&,.#$,-

;.'/&*#1#,-+$'9<#

;.'/&*#1#,-+0%*8*#$5%$/

>$&%*'"'"",/+8)%").#/

>/"3%*#"3#)+"'.#

?*)$"#/%..)+&,.)*%$/#/

@)%-'03#.,/+#$8.,%$2)%

B)"&'()"#..,/+"*#/0)&,/

B#/&%*#)+-

'$'"<&'5%$%/

C<"'()"&%*#,-+&,(%*",.'/#/

D%#//%*#)+5'$'**3'%)%

E/%,1'-'$)/+)%*,5#$'/)

F3#2'(#,-

+%&.#

F3#2'(#,-

+.%5,-

#$'/)*,-

G).-'$%..)+%$&%*#")

G3#5%..)+/'$$%#

G&)03<.'"'"",/+),*%,/

G&)03<.'"'"",/+%0#1%*-#1#/

G&*%0&'"'"",/+)5).)"&#)%

G&*%0&'"'"",/+-#&#/

G&*%0&'"'"",/+'*).#/

G&*%0&'"'"",/+0$%,-'$#)%

I*%)0.)/-

)+,*%).<&#",-

J#(*#'+"3'.%*)%

J#(*#'+0)*)3)%-'.<&#",/

J#(*#'+/07

K%*/#$#)+0%/&#/

K%*/#$#)+0/%,1'&,(%*",.'/#/

!"#$%&'()"&%*+(),-)$$##!"&#$'()"#..,/+0.%,*'0$%,-'$#)%4)"#..,/+)$&3*)"#/4)"#..,/+"%*%,/4)"#..,/+/,(&#.#/4)"#..,/+&3,*#$5#%$/#/4)"#..,/+6%#3%$/&%03)$%$/#/4'**%.#)+(,*51'*8%*#4*,"%..)+)('*&,/4*,"%..)+-%.#&%$/#/4,*:3'.1%*#)+-)..%#4,*:3'.1%*#)+0/%,1'-)..%#4,*:3'.1%*#)+&3)#.)$1%$/#/;)-0<.'()"&%*+=%=,$#;3.)-<1#)+&*)"3'-)&#/;.'/&*#1#,-+('&,.#$,-;.'/&*#1#,-+$'9<#;.'/&*#1#,-+0%*8*#$5%$/>$&%*'"'"",/+8)%").#/>/"3%*#"3#)+"'.#?*)$"#/%..)+&,.)*%$/#/@)%-'03#.,/+#$8.,%$2)%@%.#"'()"&%*+0<.'*#B)"&'()"#..,/+"*#/0)&,/B)"&'()"#..,/+5)//%*#B)"&'()"#..,/+*%,&%*#B#/&%*#)+-'$'"<&'5%$%/C<"'()"&%*#,-+&,(%*",.'/#/D%#//%*#)+5'$'**3'%)%E/%,1'-'$)/+)%*,5#$'/)F3#2'(#,-+%&.#F3#2'(#,-+.%5,-#$'/)*,-G).-'$%..)+%$&%*#")G3#5%..)+/'$$%#G&)03<.'"'"",/+),*%,/G&)03<.'"'"",/+%0#1%*-#1#/G&*%0&'"'"",/+)5).)"&#)%G&*%0&'"'"",/+-#&#/G&*%0&'"'"",/+'*).#/G&*%0&'"'"",/+0$%,-'$#)%I*%)0.)/-)+,*%).<&#",-J#(*#'+"3'.%*)%J#(*#'+3)*9%<#J#(*#'+0)*)3)%-'.<&#",/K%*/#$#)+0%/&#/

!"#$%&'()"&%*+(),-

)$$##

!"&#$'()"#..,/+0.%,*'0$%,-'$#)%

4)"#..,/+)$&3*)"#/

4)"#..,/+"%*%,/

4)"#..,/+/,(&#.#/

4)"#..,/+&3,*#$5#%$/#/

4)"#..,/+6%

#3%$/&%03)$%$/#/

4'**%.#)+(,*51'*8%*#

4*,"%..)+)('*&,/

4*,"%..)+-

%.#&%$/#/

4,*:3'.1%*#)+-

)..%#

4,*:3'.1%*#)+0/%,1'-

)..%#

4,*:3'.1%*#)+&3)#.)$1%$/#/

;)-

0<.'()"&%*+=%=,$#

;3.)-

<1#)+&*)"3'-

)&#/

;.'/&*#1#,-+('&,.#$,-

;.'/&*#1#,-+$'9<#

;.'/&*#1#,-+0%*8*#$5%$/

>$&%*'"'"",/+8)%").#/

>/"3%*#"3#)+"'.#

?*)$"#/%..)+&,.)*%$/#/

@)%-'03#.,/+#$8.,%$2)%

@%.#"'()"&%*+0<.'*#

B)"&'()"#..,/+"*#/0)&,/

B)"&'()"#..,/+5)//%*#

B)"&'()"#..,/+*%,&%*#

B#/&%*#)+-

'$'"<&'5%$%/

C<"'()"&%*#,-+&,(%*",.'/#/

D%#//%*#)+5'$'**3'%)%

E/%,1'-'$)/+)%*,5#$'/)

F3#2'(#,-

+%&.#

F3#2'(#,-

+.%5,-

#$'/)*,-

G).-'$%..)+%$&%*#")

G3#5%..)+/'$$%#

G&)03<.'"'"",/+),*%,/

G&)03<.'"'"",/+%0#1%*-#1#/

G&*%0&'"'"",/+)5).)"&#)%

G&*%0&'"'"",/+-#&#/

G&*%0&'"'"",/+'*).#/

G&*%0&'"'"",/+0$%,-'$#)%

I*%)0.)/-

)+,*%).<&#",-

J#(*#'+"3'.%*)%

J#(*#'+3)*9%<#

J#(*#'+0)*)3)%-'.<&#",/

K%*/#$#)+0%/&#/

!"#$%&'#$()!#&%#) !"#$%&'#$()!#&%#)

!"#$%&'#$()!#&%#) !"#$%&'#$()!#&%#)

*+,-./01,-'23404+5./01,-

-67)')8,9/,:./01,-

;<<=';'#$()!#&%#)

;<<=';'#$()!#&%#)

;<<=';'#$()!#&%#)

;<<=';'#$()!#&%#)

<&>%($-2?@(A,04+,:

!" !#$ !$" !!%$ &""'

; >

& $

Speed  

Method   EsNmated  speed    (mm:ss)  

16S   00:13*  

KmerFinder   00:09*  

TaxonomyFinder   11:33*  

rMLST   00:45*  

Reads2Type      00:55**  

*Es)ma)on  based  on  draV  genomes  **Es)ma)on  based  on  short  reads  

Summary  of  taxonomy  benchmark  study  

•  KmerFinder  had  the  highest  accuracy  and  was  the  fastest  method.  

•  SpeciesFinder  (16S  rRNA-­‐based)  had  the  lowest  accuracy.  

•  Methods  that  only  sample  genomic  loci  (16S,  Reads2Type,  rMLST)  had  difficul)es  dis)n-­‐guishing  species  that  only  recently  diverged,  especially  when  main  difference  is  a  plasmid.    

Tools  for  further  typing  

Name of Service Description

URL (https://cge.cbs.dtu.dk/services/ ) Publication

MLST Multilocus sequence typing MLST

Published Apr 2012, PMID: 22238442

Plasmid-Finder

Identification of plasmids in Enterobacteriaceae

PlasmidFinder Published Apr 2014, PMID 24777092

pMLST pMLST of plasmids in Enterobacteriaceae

pMLST Published Apr 2014, PMID 24777092

MulNlocus  Sequence  Typing  (MLST)  

 First  developed  in  1998  for  Neisseria  meningiIs      (Maiden  et  al.  PNAS  1998.  95:3140-­‐3145)  

   The  nucleo)de  sequence  of  internal  regions  of  app.  7  housekeeping  genes  are  determined  by  PCR  followed  by  Sanger  sequencing  

   Different  alleles  are  each  assigned  a  random  number  

   The  unique  combina)on  of  alleles  is  the  sequence  type  (ST)  

Using  WGS  data  for  MLST  

www.cbs.dtu.dk/services/MLST  

Assembled  genome  454  –  single  end  reads  454  –  paired  end  reads  Illumina  –  single  end  reads  Illumina  –  paired  end  reads  Ion  Torrent  SOLiD  –  single  end  reads  SOLiD  –  mate  pair  reads  

Acinetobacter  baumannii  #1  Acinetobacter  baumannii  #2    Arcobacter    Borrelia  burgdorferi    Bacillus  cereus    Brachyspira  hyodysenteriae    Bifidobacterium    Brachyspiria  intermedia    Bordetella    Burkholderia  pseudomallei    Brachyspira    Burkholeria  cepacia  complex    Campylobacter  jejuni    Clostridium  botulinum    Clostridium  difficile  #1    Clostridium  difficile  #2    Campylobacter  helveIcus    Campylobacter  insulaenigrae    Clostridium  sepIcum    C.  diphtheriae    Campylobacter  fetus    Chlamydiales    

Campylobacter  lari    Cronobacter    C.  upsaliensis    Escherichia  coli  #1    Escherichia  coli  #2    Enterococcus  faecalis    Enterococcus  faecium    F.  psychrophilum    Haemophilus  influenzae    Haemophilus  parasuis    Helicobacter  pylori    Klebsiella  pneumoniae    Lactobacillus  casei    Lactococcus  lacIs    Leptospira    Listeria    Listeria  monocytogenes    Moraxella  catarrhalis    Mannheimia  haemolyIca    Neisseria    P.  gingivalis    P.  acne  

 Pseudomonas  aeruginosa    Pasteurella  multocida    Pasteurella  multocida    Staphylococcus  aureus    Streptococcus  agalacIae    Salmonella  enterica    Staphylococcus  epidermidis    S.  maltophilia    Streptococcus  pneumoniae    Streptococcus  oralis    S.  zooepidemicus    Streptococcus  pyogenes    Streptococcus  suis    Streptococcus  thermophilus    Streptomyces    Streptococcus  uberis    Vibrio  parahaemolyIcus    Vibrio  vulnificus    Wolbachia    Xylella  fasIdiosa    Y.  pseudotuberculosis  

Extended  Output  

Extended  Output  

aro: WARNING, Identity: 100%, HSP/Length: 349/498, Gaps: 0, aro_122 is the best match for aro

What  is  the  MLST  web-­‐service  used  for?  

A. Baumannii #1 4%

A. Baumannii #2 6%

A. chronobacter 2%

Capmpylobacter 6%

E. coli #1 21%

E. coli #2 4%

E. faecalis 2%

E. faecium 3%

K. pneumoniae 8%

Leptospira #1 2%

Leptospira #2 5%

L. monocytogenes 3%P. aeruginosa 2%S. agalactaie 2%

S. aureus 7%S. enterica 6%

S. pneumoniae 7%

Other 9%

MLST schemes usage

PlasmidFinder  and  pMLST  

The  PlasmidFinder  database  contains  replicons,  not  en)re  plasmids.  

Tools  for  phenotyping  

Name of Service Description

URL (https://cge.cbs.dtu.dk/services/ ) Publication

ResFinder

Identification of acquired antibiotic resistance genes ResFinder

Published Nov 2012, PMID: 22782487

Virulence-Finder

Identification of virulence genes in E. coli (and S. aureus and Enterococcus)

VirulenceFinder E. coli published Feb 2014, PMID: 24574290.

MyDbFinder Identification of genes from the users own database

MyDbFinder Will be published in book chapter

Pathogen-Finder

Prediction of pathogenic potential

PathogenFinder Published Oct 2013, PMID: 24204795

ResFinder  

ResFinder  (BLAST)  

NGS  Illumina  

Ion  torrent  454..  

Sanger  

Fasta  

Resistance  gene  profile  

Assembly  pipeline  

List of genes Accession numbers

Theoretical resistance phenotype

Sanger  

Fasta  

   200  isolates  from  4  different  species  (Salmonella  Typhimurium,  Escherichia  coli,  Enterococcus  faecalis  and  Enterococcus  faecium)  

   ResFinder,  98  %ID,  60%  length  coverage  

     Phenotypic  tests,  3,051  in  total  •     482  Resistant    •     2569  Suscep)ble  

=>  99,74%  of  the  results  were  in  agreement  between  ResFinder  and  the  phenotypic  tests  

23  discrepancies  -­‐>  16,  typically  in  rela)on  to  spec)nomycin  in  E.  coli  

Alterna)ves  to  ResFinder  

Unpublished  or  uncategorized  

Name of Service Description

URL (https://cge.cbs.dtu.dk/

services/ ) Status Publication PanFunPro Groups homologous

proteins based on functional domain content

PanFunPro Online

Published in F1000Research 2013, 2:265

Serotype-Finder

Identification of serotypes SerotypeFinder-1.0

Online

Not yet published

Restriction-ModificationFinder

Identification of RM system genes

Restriction-ModificationFinder

Online

Will only be published in book chapter

HostPhinder Prediction of the host of a bacteriophage

HostPhinder Online, but under development

Not yet published

MetaVir-Finder

Identification of virus in metegenomic data

MetaVirFinder Online, but under development

Not yet published

MGmapper

Identifies the content of metagenomic samples MGmapper

Online, but under development

Not yet published

Tools  for  phylogeny  

Name of Service Description URL (cge.cbs.dtu.dk/services) Status Publication

SnpTree

Creation of phylogenetic trees based on SNPs snpTree Online

Published Dec 2012, PMID: 23281601

CSIPhylo-geny

Creation of phylogenetic trees based on SNPs

CSIPhylogeny Online

Planned

NDtree Creation of phylogenetic trees

NDtree Online

Published in Feb 2014, PMID: 24505344

0.1 0.6 5.40.3

2.33.7

0.212.1

10.4

4.8

34.1

2.7

31.6

SerotypeFinderMGmapperVirulenceFinderRestrictionNDtreeSpeciesFinderKmerFinderHostPhinderPathogenBusterAssemblerpMLSTPlasmidFindersnpTreeCGEPrimerFinderResFinderPathogenFinderMLSTMetaVirFinder

Web-­‐service  usage  

Type  of  data  uploaded  to  MLST  web-­‐service  

454,  single  reads  

454,  paired-­‐end  

Ion  torrent  

Illumina,  single  reads  

Illumina,  paired-­‐end  reads  

Assembled  draV  genomes