recombinase in trio (rit) elements in bacterial …...ii recombinase in trio (rit) elements in...
TRANSCRIPT
Recombinase in Trio (RIT) Elements in Bacterial
Genomes: Assessing the Distribution and Mobility of
a Novel yet Widespread Set of Mobile Genes.
by
Nicole Dorothy Ricker
A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy
Department of Physical and Environmental Sciences University of Toronto Scarborough
© Copyright by Nicole Ricker 2016
ii
Recombinase in Trio (RIT) Elements in Bacterial Genomes:
Assessing the Distribution and Mobility of a Novel yet
Widespread Set of Mobile Genes.
Nicole Dorothy Ricker Doctor of Philosophy
Department of Physical and Environmental Sciences University of Toronto Scarborough
2016
Abstract
The research performed over the course of my doctorate training outlines the environmental distribution, mobility,
expression and potential role of a newly described family of mobile elements as well as providing valuable
information on the challenges and potential benefits of environmental metagenomics. Sequencing technologies have
evolved considerably over the course of this work, and evaluating the limitations and opportunities provided by
these evolving technologies has formed a significant portion of my thesis work. The remainder of the work has
been dedicated to understanding the distribution and mechanisms of Recombinase in Trio (RIT) elements, a
previously underappreciated mobile element found in a large diversity of strains, but predominantly in non-
pathogenic bacteria. Recombinase in Trio (RIT) elements contain three tyrosine-based site-specific recombinases
and display a characteristic gene order and repeat architecture that is conserved across 7 bacterial phyla (Van Houdt
et al. 2006; Van Houdt et al. 2012; Ricker et al. 2013). RIT elements have been postulated to be mobile due to the
occurrence of multiple identical copies within individual genomes, and are commonly found on plasmids and in
genomic islands, including plant symbiosis and catabolic islands. The ability of RITS to excise and relocate
themselves was tested using a variety of mating experiments. Although the determination of a potential target site
sequence was initially elusive, the discovery that the RIT element also included a 20 bp palindrome adjacent to one
of the terminal inverted repeats allowed for the alignment of the target genes and revealed the original target site
sequence. Subsequently, RIT element mobility was observed during conjugation and the transformants analyzed
provided some insight into the mechanism of recombination. Finally, environmental sampling was performed on
Southern Ontario streams in order to develop a methodology for evaluating the mobilome community of bacterial
communities.
iii
Acknowledgments
No great achievement is accomplished without having a thousand people to thank. It
would be impossible to list all the people that have helped, supported and encouraged me over
the years and I hope that you truly understand my gratitude for each and every one of you. I
want to especially thank my outstanding supervisor, Roberta Fulthorpe, for all of your amazing
mentorship over the past 6 years. You have provided me with encouragement and support when
I felt unsure, clarity and direction when I was muddled, and a firm kick when I was stalled. Not
to mention physical labour and beautiful lake scenery for balance, and the insight to recognize
an amazing opportunity when it came knocking. I could not ask for a better supervisor for my
PhD, or a better mentor for my career. I would also like to thank my committee members (Don
Jackson and William Navarre) for their outstanding insight and recommendations throughout the
project, as well as their patience and encouragement.
To my husband Toby – you have been my rock throughout my PhD and have done so
much more than I ever could have asked of you. From sampler design, creation and installation
to learning site specific recombination mechanisms and moving to Belgium (twice!), you’ve put
in the blood, sweat and tears of this PhD and I am truly blessed to have such a wonderful partner
in my life. Thanks also to my Mom for her endless support including hopping on a plane last
minute to help with the first move to Belgium – and for helping to make sure I didn’t fall apart
once we got there. Thanks to my Dad for assisting with all the reference site samplers and
reminding me why I’m in this field by constantly making me defend science; and to my siblings
for keeping me grounded while also reminding me that I could do this.
My time at UTSC has been filled with amazing people and opportunities that I had never
anticipated. I want to thank everyone at the Fulthorpe lab (past and present) for all your support
and encouragement, and for putting up with endless talks about RIT elements. I especially have
to thank Tony, Roxana and Rosemary for all your dedication and friendship. Last but not least, I
am so grateful for having had the opportunity to work with Rob Van Houdt and Bernard Hallet,
as well as Ann Provoost, Kristel Mijnendonckx and all the other members of the SCK•CEN and
to the W. Garfield Weston Foundation for providing funding for this international collaboration.
iv
Table of Contents
Acknowledgments .......................................................................................................................... iii
Table of Contents ........................................................................................................................... iv
List of Tables ............................................................................................................................... viii
List of Figures ................................................................................................................................. x
List of Appendices ....................................................................................................................... xiii
Chapter 1 Introduction .................................................................................................................... 1
1.1 References ........................................................................................................................... 3
Chapter 2 The Role of Mobile Genetic Elements in Prokaryotic Adaptation ................................ 5
2 Horizontal Gene Transfer ........................................................................................................... 5
2.1 Intracellular MGEs .............................................................................................................. 7
2.2 Intercellular MGEs .............................................................................................................. 9
2.3 Impact on Genome Evolution ........................................................................................... 12
2.4 References ......................................................................................................................... 15
Chapter 3 The Limitations of Draft Assemblies for Understanding Prokaryotic Adaptation and Evolution ........................................................................................................................... 21
3 Introduction .............................................................................................................................. 21
3.1 Methods ............................................................................................................................. 24
3.2 Results ............................................................................................................................... 25
3.2.1 Assembly Quality for Cupriavidus metallidurans CH34 ..................................... 25
3.2.2 Contigs terminate at repeated elements and mobile elements .............................. 29
3.2.3 Fragmentation is greatest at genomic island sites ................................................. 30
3.2.4 Investigating the relative contribution of multiple replicons or presence of documented mobility genes by comparison with other strains ............................. 32
3.2.5 Fragmentation Evident in Real Data ..................................................................... 36
3.3 Discussion ......................................................................................................................... 38
v
3.4 Acknowledgements ........................................................................................................... 42
3.5 References ......................................................................................................................... 43
Chapter 4 Phylogeny and Organization of Recombinase in Trio (RIT) Elements ....................... 47
4 Introduction .............................................................................................................................. 47
4.1 Methods ............................................................................................................................. 48
4.2 Results and Discussion ..................................................................................................... 48
4.2.1 Abundance and Occurrence in Database .............................................................. 48
4.2.2 RIT Structure and Organization ............................................................................ 51
4.2.3 Inferred RIT Functionality .................................................................................... 53
4.2.4 Evidence for RIT Mobility Within Closely Related Strains ................................. 55
4.2.5 Similarities between RIT elements and evidence for broad distribution. ............. 61
4.2.6 RIT Classification ................................................................................................. 65
4.3 Conclusions ....................................................................................................................... 67
4.4 Acknowledgements ........................................................................................................... 69
4.5 References ......................................................................................................................... 69
Chapter 5 The Chlorocatechol Degradative Operon in Burkholderia sp. strain OLGA172 Resides in Chromosomal Area of Genome Plasticity as revealed through PacBio Single-Molecule Sequencing ............................................................................................................... 71
5 Introduction .............................................................................................................................. 71
5.1 Materials and Methods ...................................................................................................... 74
5.1.1 Short read NGS sequencing .................................................................................. 74
5.1.2 PacBio Single Molecule Sequencing .................................................................... 74
5.1.3 Assembly of Short Read Technologies and PacBio corrected reads .................... 75
5.1.4 Gene Annotation and Contig Validation ............................................................... 75
5.1.5 Comparisons to Related Finished Genomes ......................................................... 76
5.1.6 Large Plasmid Extraction ...................................................................................... 76
5.2 Results ............................................................................................................................... 76
vi
5.2.1 Overall Genome Analysis ..................................................................................... 76
5.2.2 Biological consistency of the Assembly ............................................................... 78
5.2.3 Capacity of the PacBio Assembly for comparative studies .................................. 82
5.2.4 Highlighting a region of Strain Specificity – The Chlorocatechol (CC) Degradative Operon ............................................................................................ 83
5.2.5 Limitations of the PacBio Assembly ...................................................................... 86
5.3 Discussion ......................................................................................................................... 87
5.4 Acknowledgements ........................................................................................................... 90
5.5 References ......................................................................................................................... 90
Chapter 6 Expression and Activity of RIT Elements .................................................................... 96
6 Introduction .............................................................................................................................. 96
6.1 Materials and Methods ...................................................................................................... 97
6.1.1 Growth of Bacterial Strains .................................................................................. 97
6.1.2 Construct creation ................................................................................................. 98
6.1.3 Mating-out Assays .............................................................................................. 100
6.1.4 Conjugation Experiments .................................................................................... 100
6.1.5 Expression Experiments ...................................................................................... 101
6.2 Results ............................................................................................................................. 101
6.2.1 No evidence of Intra-cellular mobility without a target site ............................... 102
6.2.2 Target site identification ..................................................................................... 105
6.2.3 Sequencing analysis of transconjugants .............................................................. 108
6.2.4 Application of these Results to other RIT Elements ........................................... 112
6.3 Discussion ....................................................................................................................... 113
6.4 References ....................................................................................................................... 117
Chapter 7 Developing a standardized method for analyzing gene content of bacterial communities in streams with varying degrees of urbanization............................................. 119
vii
7 Introduction ............................................................................................................................ 119
7.1 Materials and Methods .................................................................................................... 120
7.1.1 Sampling locations and collection of benthic invertebrates ............................... 121
7.1.2 Sampler Design ................................................................................................... 123
7.1.3 Bacterial Community Assessment ...................................................................... 124
7.1.4 Quantitative PCR ................................................................................................ 125
7.2 Results ............................................................................................................................. 126
7.2.1 Macroinvertebrate metrics of ecosystem health ................................................. 126
7.2.2 Community diversity measures ........................................................................... 129
7.2.3 Quantitative PCR ................................................................................................ 133
7.2.4 Correlations between bacterial communities and water quality parameters ....... 134
7.2.5 Primer design specific to RIT elements .............................................................. 137
7.3 Discussion ....................................................................................................................... 138
7.3.1 Biomonitoring ..................................................................................................... 138
7.3.2 Bacterial community assessment ........................................................................ 139
7.4 Acknowledgements ......................................................................................................... 143
7.5 References ....................................................................................................................... 144
Chapter 8 Conclusions and Future Directions ............................................................................ 147
8 References .............................................................................................................................. 151
9 Appendix 1 Extra Tables ........................................................................................................ 152
Appendix 2 Sampler Construction and Site Information ............................................................ 162
viii
List of Tables
Table 3.2.1: Number of contigs aligning and coverage statistics for each of the four replicons in
C. metallidurans CH34 using Velvet ad ABySS genome assembly software. ............................. 26
Table 3.2.2: Details on the terminal regions for 7 large contigs. .................................................. 29
Table 3.2.3: Genomic islands found on chromosome 1 of CH34. ................................................ 31
Table 3.2.4: Velvet assembly metrics of the 5 genomes compared. ............................................. 33
Table 4.2.1: Summary of information of putative RIT elements found in this study. .................. 49
Table 4.2.2: Potential recognition or regulatory sites contained within terminal inverted repeats.60
Table 5.2.1: Statistics of PacBio unitigs assigned as putative replicons. ..................................... 78
Table 5.2.2: Comparison of assembled genome or Burkholderia sp. str. OLGA172 with other
closely related Burkholderia strains. ............................................................................................. 80
Table 6.1.1: List of strains used in this study. .............................................................................. 97
Table 6.1.2: List of constructs created during this study. ............................................................. 98
Table 6.2.1: Decrease in optical density of cell cultures after induction with IPTG. ................. 104
Table 6.2.2: Conserved sequences found in a variety of alpha- and beta-Proteobaceria containing
RIT elements. .............................................................................................................................. 112
Table 7.1.1: Sampling locations for river assessments. .............................................................. 122
Table 7.1.2: Primers for quantitative PCR. ................................................................................. 125
Table 7.2.1: Comparison of field sites based on biotic indices of benthics obtained during this
study. ........................................................................................................................................... 127
Table 7.2.2: DeltaCt comparison of environmental samplers by quantitative real-time PCR. ... 133
Table 7.2.3: Water quality parameters for each site. .................................................................. 135
ix
Table 7.2.4: Correlations of the bacterial communities to available water quality data. ............ 136
x
List of Figures
Figure 3.2.1:Number of assembled contigs in Velvet aligning to replicons in C. metallidurans
CH34. ............................................................................................................................................ 27
Figure 3.2.2: Geneious alignment of assembled contigs to two key regions containing genomic
islands in C. metallidurans CH34. ................................................................................................ 28
Figure 3.2.3: Relationship between N50 (as percentage of the largest replicon in the genome)
and three parameters thought to influence assembly quality. ....................................................... 35
Figure 3.2.4: Relationship between three measures of assembly quality (maximum contig length,
N50 ad N50 as percent of longest replicon) and number of genomics islands as predicted by
IslandViewer. ................................................................................................................................ 36
Figure 3.2.5: Geneious alignment of real contigs obtained from the GAGE assembly data
(Salzberg et al. 2012). ................................................................................................................... 37
Figure 4.2.1: Comparison of the taxonomic representation of our RIT collection with the
abundance of the same taxonomic grouping in the NCBI genome database. ............................... 51
Figure 4.2.2: Names and arrangements of tyrosine recominase sub-families. ............................. 52
Figure 4.2.3: Comparison of conservation between the Int1 (pAE1) recombinases (A - top) and
Int3 (SG5) recombinases (B - bottom) from 40 divergent representatives. .................................. 54
Figure 4.2.4: Arrangement of RIT elements on the chromosome of Caulobacter sp. K31. ......... 57
Figure 4.2.5: Phylogenetic analysis by 16S (A) and nucleotide sequence of the RIT elements
obtained in this study (B). ............................................................................................................. 64
Figure 4.2.6: Individual congruency trees for each of the recombinases in a selection of RIT
elements. ....................................................................................................................................... 67
Figure 5.2.1: Chromosome 1 of Burkholderia sp. str. OLGA172 as determined by PacBio
sequencing. .................................................................................................................................... 79
xi
Figure 5.2.2: Large plasmid extraction. ........................................................................................ 81
Figure 5.2.3: MAUVE alignment of chromosome 1 from six Burkholderia strains. ................... 82
Figure 5.2.4: Genomic arrangement of chromosome 1 genes from Burkholderia sp. str.
OLGA172 and comparison to homologous regions of related strains. ......................................... 86
Figure 6.1.1: Constructs used in the final conjugation experiment. ............................................. 99
Figure 6.2.1: Expression of recombinase genes from pKK223-OlgaA-C and pKK223-K31A-C
expression vectors. ...................................................................................................................... 103
Figure 6.2.2: PCR amplification using primers designed to amplify out from the kanamycin
gene. ............................................................................................................................................ 104
Figure 6.2.3: Orientation of RIT elements in Caulobacter sp. K31 relative to the direction of the
target gene DUF1738. ................................................................................................................. 106
Figure 6.2.4: Final experimental design. .................................................................................... 107
Figure 6.2.5: Reversal of RIT element in positive transconjugants. ........................................... 108
Figure 6.2.6: Mating results for the recipient strain containing pTrc99-K31A-C and pACYC-
TSV1. .......................................................................................................................................... 109
Figure 6.2.7: Target site 1 transconjugants retaining both kanamycin and tetracycline resistance.110
Figure 6.2.8: Sequencing results of co-integrate structure of clone I4. ...................................... 111
Figure 6.3.1: Model for RIT element mobility based on experimental results. .......................... 115
Figure 7.1.1: Map of sampling locations. ................................................................................... 123
Figure 7.1.2: Aquatic environment bacterial community samplers. ........................................... 124
Figure 7.2.1: Lake Simcoe region samplers after retrieval. ........................................................ 128
Figure 7.2.2: Cluster analysis of T-RFLP data showing within sampler variation. .................... 129
xii
Figure 7.2.3: Principal coordinate analysis of T-RFLP results from sampler replicates. ........... 131
Figure 7.2.4: Principal coordinate analysis (PCoA) of the bacterial community compositions
revealed by 16S pyrosequencing data. ........................................................................................ 132
xiii
List of Appendices
Appendix 1: Extra tables ……………………………………………………… 152
Table S1: Primers used in this study ………………………………….. 152
Table S2: Dissolved oxygen values by month ………………………… 154
Table S3: RIT elements determined to date …………………………… 154
Appendix 2: Sampler construction and site information …………………….. 162
1
Chapter 1 Introduction
My research relates to understanding the mechanisms of bacterial adaptation, and
particularly how bacteria acquire and distribute genes through horizontal gene transfer (HGT).
This is a topic that impacts every field of biology from ecology to medicine due to the ubiquity
of bacteria and the range of diverse skills that they acquire through HGT, including
pathogenesis, antibiotic resistance, root nodulation and xenobiotic compound degradation
(Springael and Top, 2004; Frost et al. 2005; Siefert 2009). For this reason, understanding the
mechanism and regulation of the genes involved in HGT provides universally applicable
benefits. The goal of my graduate research has been to better characterize the genes involved in
creating diversity within individual bacterial genomes, and to make progress towards
investigating the effects that exposure to environmental pollutants has on the abundance and
activity of mobile genetic elements (MGEs). This work was inspired by similar research into the
distribution and expression of integrons (Wright et al. 2008; Koening et al. 2009) and plasmids
(Smalla and Sobecky, 2002; Springael and Top, 2004). I provide a summary of our
understanding of the range of MGEs found in bacteria and some details on their agents of
mobility in the next chapter (Chapter 2) to assist the reader.
The main focus of my work has been devoted to understanding a previously
uncharacterized set of mobility genes termed a Recombinase in Trio (RIT) element (Van Houdt
et al. 2009; Ricker et al. 2013). At the start of my project, a former master’s student had recently
discovered a recombinase in a chlorobenzoate degrader designated Burkholderia sp. str.
OLGA172 (Jin, 2010) that later proved to be a RIT element. The Fulthorpe lab has studied this
strain as the representative of a larger collection of chlorobenzoate degraders isolated from
pristine sites during a biogeography survey (Fulthorpe et al. 1998). These pristine isolates are of
particular interest since their chromosomally located chlorobenzoate degradation genes may be
ancestral to widely disseminated plasmid-borne catabolic genes that are highly active in
contaminated sites. In OLGA172, a RIT element was found lying just upstream from the
catabolic genes and I had an interest in determining if it had a role in the movement of the
catabolic operon, with a view to the larger interest of understanding the overall role of RIT
elements and a possible link to the evolution of catabolic traits.
2
As next generation sequencing was becoming common at that time, OLGA172 was
submitted for Illumina sequencing and subsequently for 454 sequencing in order to assemble the
complete genome and provide context to the RIT element and adjacent catabolic genes.
Unfortunately, all RIT element containing contigs were disconnected due to its presence in
multiple copies within the genome. The bioinformatic community has long acknowledged this
technical drawback of short read technology, but its importance to assembly quality and our
understanding of bacterial evolution had been underestimated. I document these issues in
Chapter 3. I recognized the potential of longer read technologies in understanding our strain and
submitted it for sequencing on the PacBio RSII platform. These improvements allowed for the
creation of a closed genome of OLGA172, which can subsequently be used to address specific
questions regarding the role of RIT elements in the evolution of this strain. I detail the larger
implications of the fundamental improvements achieved through the introduction of high
throughput long read sequencing technologies in Chapter 5 using OLGA172 as an example.
This chapter describes the closed genome of OLGA172 achieved using the PacBio sequencing
technology, and compares the genomic context surrounding the catabolic genes (and RIT
element) found in this strain with other fully sequenced relatives.
In chapter 4, I discuss the distribution and organization of Recombinase in Trio (RIT)
elements, a previously underappreciated mobile element found in a large diversity of strains, but
predominantly in non-pathogenic bacteria. A product of in depth in silico searching, I outline
overall RIT element organization and distribution in currently sequenced genomes, and
highlight individual strains harboring multiple identical copies of the same RIT element. Rob
Van Houdt of the Belgian Nuclear Research Centre (SCK•CEN) in Belgium was the first author
on the original paper recognizing and naming the RIT elements (Van Houdt et al. 2009). On
reading a poster abstract I published on the distribution of RIT elements, he contacted me and
we established a collaboration. I travelled to Belgium for 3 months in 2011 and again for 9
months the following year after securing a fellowship in order to investigate the activity of RIT
elements in his lab. The experimental evidence I gathered supporting the intracellular mobility
of these elements is presented in Chapter 6.
At the outset of PhD work my intention was to investigate the "mobilome" of bacterial
communities exposed to low levels of environmental contamination. Other researchers have
asserted that environmental pollutants are increasing the ‘evolvability’ of bacterial communities
3
by increasing their capacity for horizontal gene transfer (Baquero, 2009; Gillings and Stokes,
2012). In order to properly address this question of innate evolvability within a bacterial
community, I wanted to investigate the impact of environmental pollutants on the mobile
elements themselves separately from co-selection by the resistance genes being mobilized. The
cost of investigating an environmental ‘mobilome’ necessitates the prudent identification of
appropriate sites to be characterized. Accordingly I surveyed the suitability of several stream
sites in Ontario for this kind of work and eventually designed and sampled several of them. I
also examined in detail our current ability to quantify MGEs via various methods. Chapter 7
details this sampling strategy and the molecular characterizations I was able to perform on the
bacterial communities, with several interesting results.
1.1 References Baquero, F. 2009. Environmental stress and evolvability in microbial systems. Clin. Microbiol. Infect. 15(Suppl.1):5-10. Frost, L. S., Leplae, R., Summers, A. O. & Toussaint, A. 2005 Mobile genetic elements: the agents of open source evolution. Nat. Rev. Microbiol. 3:722-732. Fulthorpe, R. R., Rhodes, A. N., & Tiedje, J. M. 1998. High levels of endemicity of 3-chlorobenzoate-degrading soil bacteria. Applied and Environmental Microbiology, 64(5), 1620-1627. Gillings, M. R., & Stokes, H. W. 2012. Are humans increasing bacterial evolvability?. Trends in ecology & evolution, 27(6), 346-352. Jin S. 2010. Evidence of Mobility of the 3-Chlorobenzoate Degradative Genes in a Pristine Soil Isolate, Burkholderia phytofirmans OLGA172, M.Sc. Thesis (2010) Dept. Ecology and Evolutionary Biology, University of Toronto. Koenig, J.E., C. Sharp, M. Dlutek, B. Curtis, M. Joss, Y. Boucher and W.F. Doolittle. 2009. Integron Gene Cassettes and Degradation of Compounds Associated with Industrial Waste: The Case of the Sydney Tar Ponds. PLOS One 4: 1-9. Ricker, N. H. Qian and Fulthorpe, R.R. 2013. Phylogeny and Organization of Recombinase in Trio (RIT) Elements. Plasmid. 70(2):226-239. Siefert, J.L. 2009. Defining the Mobilome. In: Horizontal Gene Transfer: Genomes in Flux. pp. 13-27. Ed. M.B. Gogarten, J.P. Gogarten and L. Olendzenski. Humana Press. New York, NY, USA. Smalla, K. and P.A. Sobecky. 2002. The prevalence and diversity of mobile genetic elements in
4
bacterial communities of different environmental habitats: insights gained from different methodological approaches. FEMS Microbiol. Ecol. 42:165-175. Springael, D. and E.M. Top. 2004. Horizontal gene transfer and microbial adaptation to xenobiotics: new types of mobile genetic elements and lessons from ecological studies. Trends in Microbiol. 12(2):53-58. Van Houdt, R.V, S. Monchy, N. Leys and M. Mergeay. 2009. New mobile genetic elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria. Antonie van Leeuwenhoek 96:205-226. Wright, M.S., Baker-Austin, C., Lindell, A.H., Stepanauskas, R., Stokes, H.W. and J.V. McArthur. 2008. Influence of industrial contamination on mobile genetic elements: class 1 integron abundance and gene cassette structure in aquatic bacterial communities. ISME Journal 2: 417-428.
5
Chapter 2 The Role of Mobile Genetic Elements in Prokaryotic Adaptation
2 Horizontal Gene Transfer Bacterial evolution is a dynamic process involving gene mutation, inversion, exchange,
deletion and acquisition of exogenous DNA (Snyder and Champness, 2007). Each of these
processes varies in rate of occurrence and the scope of possible outcomes (Brüssow, 2008) and
the selection pressures shaping these outcomes are applied from multiple levels – gene, group,
population, or community. Horizontal gene transfer (HGT), also known as lateral gene transfer
(LGT), has been shown to play an important role in the evolution dynamics at all of these levels,
and is arguably the most important evolution mechanism working at the population and
community levels. HGT allows individual genomes to remain compact while providing access
to a larger pool of potentially beneficial genes maintained within the community (Darmon and
Leach, 2014). The study of the genes involved in horizontal gene transfer (aka the mobilome)
has received substantial attention due to the increasing prevalence of antibiotic resistance.
Antibiotics and antibiotic resistance genes are common contaminants from wastewater,
agriculture and aquaculture (Perry and Wright, 2013; Gillings et al. 2015) and abundance of
individual resistance genes in soil environments are increasing over time (Knapp et al. 2010).
Understanding the dynamics of gene movement within environmental communities is therefore
fundamental to establishing how existing resistances will be disseminated and in anticipating
sources of new resistance genes.
A mobile genetic element (MGE) is defined as any discrete segment of DNA that can
move within or between genomes (Siefert, 2009) and is inclusive of plasmids, phages,
integrative conjugative elements (ICEs), transposons and the myriad of smaller elements
capable of inter- or intra-cellular movement (for a complete review, see Bellanger et al. 2014).
The distinction between these different categories is often blurred for various reasons including
the modular nature of mobile element evolution, and the enormous time scale at which these
elements have been evolving (Lawrence and Hendrickson, 2008; Siguier et al. 2014). Although
all of these elements fit into the classification of transposable elements (Toussaint and Merlin,
6
2002; Curcio and Derbyshire, 2003; Roberts et al., 2008), the term transposable element
inherently suggests that the mobility of the elements is through transposition. Since
transposition and site-specific recombination are fundamentally different processes
(biochemically), it is preferable to use the more inclusive term of Mobile Genetic Elements
(MGEs) to refer to the full spectrum of genes involved in HGT. The term genomic island is also
sometimes seen as equivalent to transposable element, however there are a variety of definitions
for this particular term, many of which overlap with current definitions for other mobile
elements. For the purposes of this thesis, the term genomic island will be used to refer to
regions of a genome that are not shared with close relatives of the isolate, regardless of any
evidence regarding current mobility.
There are three mechanisms for the acquisition of exogenous DNA into a bacterial cell.
These are conjugation (formation of a junction between two cells for genetic exchange),
transduction (movement of bacterial genes mediated by phage infection) and competence (direct
uptake of DNA from the surrounding environment) (Olendzenski and Gogarten, 2009). On
entry into a new cell, a MGE can be degraded by nucleases, maintained exogenously (in the case
of most plasmids and some phages) or become integrated into the genome of the new organism
through either homologous or illegitimate recombination (Lawrence and Retchless, 2009). There
are also a number of mobile elements that integrate into the genome independently, in either a
random or site-specific manner (Hallet et al. 2004; Siguier et al. 2014). The roles that MGEs
fulfill in a bacterial genome are varied and poorly understood. They are most frequently studied
for their role in the acquisition or dissemination of selectable traits such as antibiotic resistance,
symbiosis, pathogenicity or catabolism. Many confer no such useful functions and are
commonly considered a form of ‘selfish’ DNA. However these elements can be a key
component of genome flexibility as they mediate deletions and inversions both through their
own activity and by providing homologous regions within the genome. Many MGEs have also
been found to affect expression of surrounding genes through the presence of outward facing
promoters or the production of molecules involved in regulation (Darmon and Leach, 2014).
The presence of previously mobile (ie. defective prophages) or partial remnants of MGEs can
likewise impact bacterial adaptation by providing sites of homology or through in trans activity
by intact elements. There are many categories into which MGEs can be sub-divided, however
for the purposes of this chapter they will be described according to the degree of their mobility.
7
This chapter aims to clarify the individual terms used for the different classes of mobile
elements that are capable of independent movement, including both within cell movement and
between bacterial cells. This is in no way an exhaustive description of the elements involved in
bacterial adaptation, and the reader is directed to recent excellent reviews for further information
(Bellanger et al. 2014; Darman and Leach, 2014; Siguier et al. 2014).
2.1 Intracellular MGEs
By definition, intracellular mobile elements are those capable of transfer to different
locations in a chromosome or between different replicons within a bacterial cell (between
multiple chromosomes or from a chromosome to a plasmid). These elements can only be
transferred horizontally (between cells) when they become associated with larger, self-
transmissible elements described in section 2.2. The simplest form of MGE has traditionally
been the insertion sequence (IS), although smaller non-autonomous mobile elements have
recently been described. ISs generally range in size from 700-2500 bp, can facilitate their own
movement and contain only the genes required for transposition (1-3 ORFs coding for
transposase enzymes and regulatory genes) with flanking inverted repeats (Mahillon and
Chandler, 1998; Siefert, 2009). Insertion sequences are grouped into individual families (www-
is.biotoul.fr) based on several shared characteristics. The most important of these characteristics
is similarity in the primary sequence of their encoded transposases (Siguier et al., 2014) but
family members also share other features including the organization of open reading frames,
target site preferences, and similarities in the length and sequence of both their short terminal
inverted repeats and the direct repeats generated upon insertion (Siefert, 2009; Siguier et al.
2014). The majority of insertion sequences are mobilized by a DDE transposase (where DDE
refers to the conserved Asp, Asp, Glu residues in the active site) and there are several large
families of these enzymes that have been further divided into subgroups (Siguier et al. 2014).
There are also several other transposase chemistries that have been identified including enzymes
with a DEDD catalytic motif (related to Holliday junction resolvases) and the HUH (two
histidine residues separated by a large hydrophobic residue) enzymes utilized by both
IS200/IS605 and IS91 related elements (Siguier et al. 2015).
Transposons have traditionally been distinguished from ISs due to the presence of
accessory genes (also called passenger genes or cargo) that serve purposes not related to
8
transposition (Siguier et al. 2014). However since related transposase enzymes have been found
in both ISs and Transposases, this naming system conflicts with the homology based families
that have been defined. In addition, there are transposons that are created through the
coordinated movement of two flanking ISs (composite transposons) and these are separate from
the unit transposons that have a mobility gene at one end of the element. Unit transposons are
sometimes mobilized through the action of a site-specific recombinase and these are alternately
referred to transposases or recombinases depending on whether they are referring to the MGE
they are mobilizing (Siguier et al. 2015) or their phylogenetic relatedness to other proteins
(Carraro and Burrus, 2015). As we progress towards the metagenomic age it is preferable to
group MGEs according to the phylogeny of the mobility enzyme since this is more amenable to
computational analysis and also speaks to the actual mechanism of mobility. Forming families
based on the homology of the mobility enzymes should still be viewed primarily as ‘grouping
by descent’ method rather than an attempt to define the limitations of individual families since
the acquisition of accessory genes may not be a fixed feature of the family.
Many transposons encode separate integration and resolution systems, and the resolution
mechanisms are commonly performed by site-specific recombinases. Site-specific recombinases
(SSRs) can be divided into two unrelated families based on the use of either a tyrosine or a
serine residue in the recombination event (Schumann, 2006). These enzymes are commonly
used by bacteriophages to integrate their genomes into the host chromosome when the virus
enters lysogeny, and many are also able to excise when circumstances dictate entry into the lytic
phase (Hirano et al. 2011). However members of both classes of SSRs have been found in a
variety of recombination reactions involving viral and bacterial DNA including integration,
excision, inversion, control of plasmid copy number and movement of transposons (Nunes-
Duby, 1998; Hallet et al. 2004; Mazel, 2006; Siguier et al. 2015).
In addition to the recombinases responsible for mobilizing phage genomes and
transposons, integrons are a sub-family of tyrosine based site-specific recombinases (TBSSRs)
that have been found to be functionally discreet from all other characterized tyrosine
recombinases (Mazel, 2006). Integrons consist of the integron integrase (the TBSSR), a primary
recombination site (attI) and an outward facing promoter, and are responsible for the acquisition
of gene cassettes (individual genes that have an appropriate attC site for integration) in a non-
disruptive and functional orientation (Hall and Collis, 1995; Mazel, 2006). The incorporation of
9
gene cassettes in this manner allows for the immediate use of the newly acquired genes,
however integrons also serve as storage for additional genes since the existing gene cassettes are
maintained in an array which can be composed of hundreds of individual genes (Cambray et al.
2010; Domingues et al. 2012). Gene cassettes generally decrease in activity with distance from
the primary promoter, but can be shuffled under stressful circumstances since the integron
integrase is activated by the SOS response (Guerin et al. 2009). This provides a pool of potential
genes that do not pose a transcriptional burden to the cell but are available if necessary
(Cambray et al. 2011; Darmon and Leach, 2014).
Integrons have not been found to be mobile themselves, but are commonly mobilized by
other MGEs (Collis et al. 2002; Hall and Collis, 1995; Mazel, 2006; Boucher et al. 2007). They
are sometimes referred to as mobile genetic elements (or components of the mobilome) due to
their role in horizontal gene transfer by integrating gene cassettes (Ragan and Beiko, 2009;
Olschlager and Hacker, 2008; Taylor et al., 2011). Integron classes are created based on
homology of the tyrosine recombinase and attI site, and not by the function of the genes
associated with them, in recognition of the transient nature of these associations. Integrons are
of great interest due to the unique adaptive capacity they provide, and are understandably among
the best studied of the mobile element classes due to their strong association with antibiotic
resistance genes.
In addition to direct impacts related to their mobility, ISs and transposons are also
involved in gene activation and regulation and can promote genome rearrangements either
directly or by providing scattered regions of homology (Curcio and Derbyshire, 2003).
Recombination may also occur at homologous regions within transposable element sequences
resulting in greater diversity (Ling and Cordaux, 2010). In recent years it has become apparent
that IS elements have a dramatic impact on genome evolution, ranging from inactivation and
regulation of individual genes to the complete re-organization of genomes through IS expansion
and subsequent genome streamlining (Siguier et al. 2014; Darman and Leach 2014).
2.2 Intercellular MGEs
Intercellular MGEs are similar to those described above except that in addition to the
genes required for integation/replication they also carry all the genes necessary for facilitating
their own movement between bacterial cells. Plasmids by classical definition are maintained
10
independently of the chromosome within a cell, and are therefore distinguished from
transposons since the latter are integrated into the host chromosome (Siefert, 2009). Both
plasmids and certain types of transposons can move between cells by conjugation.
Plasmids are often thought of as small extra-chromosomal DNA elements that carry non-
essential traits and are therefore easily lost when not needed. However, as the number of
characterized genomes has increased, it has become clear that many bacteria maintain plasmids
of considerable size (up to 2 Mb) and complexity. In some highly stressful environments up to
78% of culturable bacteria have been observed to carry plasmids, most of them large (>50 kb)
(Fulthorpe et al. 1993). Presumably plasmids carry unique skills that allow for a fitness benefit
that outweighs reproductive pressure to maintain small plasmid sizes, either through supplying
access to a unique niche for the individual strain or through an increased adaptive potential
inherent to the presence of the plasmid itself. These plasmids have indeed been found to contain
niche specific attributes such as symbiosis or catabolic pathways, and to be maintained as stable
components of the genome (Kostantinidis and Tiedje, 2004). Moreover, it is now established
that some replicons previously characterized as either megaplasmids (1-2 Mb) or second
chromosomes are more accurately a combination of both elements. The term chromid has been
used to describe second or third chromosomes that utilize a plasmid partitioning mechanism but
contain genes essential to the survival of the cell (Harrison et al. 2010). In addition to
megaplasmids and chromids, there have also been plasmids isolated that are capable of
integrating into the chromosome of their host (Osborn and Boltner, 2002).
Bacteriophage represent some of the most abundant replicating genetic structures
known, probably exceeding 1029 in the ocean alone (Schumann, 2006). Lytic phage
immediately commence phage production upon entry into the bacterial cell, resulting in lysis of
the bacterial cell and extinction of that particular cell lineage. However, temperate phage, under
favorable conditions, will instead integrate into the chromosome of the host bacterium and may
be maintained for multiple generations until the phage is induced to enter its lytic lifestyle.
Induction is often the result of chemical or nutritional stress threatening the survival of the host
bacterium, but can also respond to a number of environmental triggers (Schumann, 2006).
Occasionally, bacterial DNA is accidentally packaged into the phage in addition to, or instead
of, phage DNA. This process, referred to as transduction, allows the bacterial DNA to be
transferred to a new host and has been observed with a number of virulence and pathogenicity
11
traits (Lima et al., 2008). Gene transfer agents (GTAs) are an extreme example of transduction
in that these elements exclusively package random fragments of bacterial DNA into the phage
capsid. Since there is no phage DNA, these capsids are not infective but are instead a
genetically stable component of the bacterial genome (Lang and Beatty, 2007).
Integrated phages, termed prophages, are prevalent in many bacteria, averaging one per
genome sequenced, and the advent of high throughput sequencing has provided a number of
assembled bacteriophage genomes for analysis. Comparisons of these sequences has revealed
that the current genomes are the products of extensive non-homologous recombination events,
the result of both very frequent recombination between phage genomes and the enormity of the
evolutionary time scale on which these events have been taking place (Hendrix and Casjens,
2008). The role of bacteriophage in HGT through transduction is well documented, however the
role they play in directly introducing beneficial genes as a means of ensuring vertical inheritance
is less studied. Metagenomic analysis of phage communities have revealed that bacteriophage
contain an unprecedented diversity of genetic sequences that are readily exchanged between
different phage genomes and that are equally available to the bacterial host of these genomes
due to the ease with which genes are transferred (Hendrix and Casjens, 2008).
Most transposons are not self-transferable between cells and are therefore covered in
Section 2.1 on Intracellular MGEs. Some transposons however, including Tn916, have acquired
genes that provide the capability of intercellular movement, and were therefore named
‘conjugative transposons’ (therefore Tn916 is often referred to as CTn916). However, many of
the conjugative transposons move entirely through the action of a site-specific recombinase
instead of a canonical DDE transposase, leading to the re-classification of these elements as
Integrative and Conjugative Elements (ICEs, also synonymous with the retired terminology of
constin) (Rowland and Stark, 2005; Wozniak and Waldor, 2010). ICEs can be distinguished
from many transposons by the non-random insertion mediated by the site-specific recombinase,
and commonly form an excised circular intermediate that does not replicate autonomously prior
to transfer to the recipient cell. However, many ICEs were originally defined as genomic
islands and named according to the traits that they were conferring (symbiosis islands,
pathogenicity islands, etc.) therefore the terms genomic island, conjugative transposon and ICE
have been used interchangeably (Burrus et al. 2002; Juhas et al. 2009; Roberts et al. 2008;
Wozniak and Waldor, 2010; Siguier et al. 2015). Inactivation of mobility genes or physical
12
separation from the conjugation machinery can negate the intercellular mobility of a transposon
thereby changing the role of the MGE in the cell from homologues in other cells. Studies have
also revealed that MGEs can be mobilized in trans by other mobile elements in the cell, and that
MGE resolution systems can rescue plasmid resolution functions (Hallet et al. 2004). This
illustrates the interconnectedness of the different MGE categories.
2.3 Impact on Genome Evolution Antibiotic resistance is arguably the largest human health concern of our century, as
evidenced by a call from the World Health Organization that all governments should prepare a
comprehensive national plan for surveillance and mitigation of antibiotic resistance (Leung et al.
2011). However in addition to monitoring and limiting the distribution of current resistance
genes in pathogens, it is becoming increasingly apparent that environmental reservoirs serve as
an important source of available resistance genes that can be acquired by human pathogens
(Finley et al. 2013, Forsberg et al. 2012, Pruden et al. 2012; Perry and Wright, 2013). This has
lead to a flood of studies investigating the presence of resistance genes in different
environmental reservoirs, including pristine and/or ancient soils where antibiotic exposure could
not have contributed to the observed resistance (Allen et al. 2009, D’Costa et al. 2011).
Whether these resistance genes pose a tangible threat to human health depends on the ease with
which they could be acquired by pathogens, and therefore it is no longer sufficient to investigate
merely the presence of these genes in environmental samples. The context of resistance genes,
including both the strains harboring them and their potential for mobility, has become the new
focus of environmental studies on antibiotic resistance. Looking ahead, quantifying the
likelihood of new combinations of mobile elements and resistance genes emerging from a given
environment will require a greater understanding of the mobilome of different environments.
This has previously been unfeasible, however the scope of environmental metagenomics is
rapidly expanding with the advent of low cost, high throughput, sequencing technologies. It is
now increasingly common to analyze complete assembled metagenomes, highlighting the
importance of developing standardized methods that can be applied to environmental samples.
However, our ability to locate and potentially identify mobile genetic elements will only be
useful if we can also confirm the functions of putative mobile elements.
Antibiotic resistance genes coming from clinical sources have been likened to an invasive
species (Pruden et al. 2012; Gillings et al. 2015). They are introduced into the environment in
13
wastewater and agriculture in the same way as chemical pollutants, but they pose far different
kinds of threats since they are present in replicative organisms and on self-transmissable
elements. They may also form new combinations that aid in their dissemination or maintenance
within a population. The aggressiveness of their dissemination is determined by the nature of
the mobile genetic element with which they are associated, which is why it is so vitally
important that we improve our understanding of the nature and diversity of these mobile
elements in bacterial communities. Our current level of understanding of the transposable
elements, and tyrosine based site-specific recombinases in particular, is akin to an uninitiated
gardener – we can group elements based on shared characteristics, and can recognize some
known weeds, but are left in awe of the diversity that we have yet to explore.
The recognition that environmental bacterial communities serve as a reservoir of
resistance genes has important implications for managing antibiotic resistance in pathogens.
There are many mechanisms by which antibiotic resistance genes can be maintained in a
complex microbial community in the absence of selection pressure. One is co-selection by other
environmental pollutants, as has already been seen with heavy metals (Stepanauskis et al. 2006;
McCarthur et al. 2011, Wright et al. 2006, Wright et al. 2008). Co-selection has undoubtedly
impacted antibiotic resistance gene maintenance, given the high concentration of heavy metals
relative to antibiotics in the environment (Stepanauskis et al. 2006). Heavy metal resistance
genes are commonly found on the same transmissible plasmids and transposons carrying
antibiotic resistance genes, and the class 1 integrons are commonly associated with resistance to
disinfectants in addition to both heavy metals and antibiotics (Gillings et al. 2015). This
highlights the role that seemingly unrelated environmental pollutants may play in the
maintenance and dissemination of antibiotic resistance in bacterial communities. Secondly, it
has been shown that some resistance genes can be silenced by other regulatory mechanisms that
allow for the maintenance of the gene within the population or evolve from genes that serve
alternative purposes in the absence of antibiotic pressure (proto-resistance genes). Movement of
these genes into other genomic locations or other strains can create or restore the antibiotic
resistance phenotype when selection is applied (Perry and Wright, 2013). Antibiotic resistance
genes that are incorporated into integrons can likewise be inactivated and therefore present a
reduced burden to the host strain. Since gene cassettes are promoter-less the genes contained
within the integron cassette array are prone to severe polar effects, with the genes farthest from
the integron integrase rarely transcribed. Since SOS induction can result in the re-organization
14
of the cassettes maintained within the array, integrons represent an ideal storage site where
potentially useful genes can be maintained (Cambray et al. 2011). Thirdly, sub-inhibitory
concentrations of antibiotics have been shown to increase the potential for evolution of new
traits within populations, through increased mutation rates and horizontal transfer (Baquero
2009; Gillings and Stokes, 2012). This highlights the important distinction between minimum
inhibitory concentrations (toxicity) of a chemical and minimal selective concentrations – a
distinction not currently addressed in regulations designed to determine appropriate limitations
on release of chemicals to the environment. Whether there are other environmental pollutants
that specifically impact the ‘evolvability’ of bacterial communities remains an open question
(Gillings and Stokes, 2012).
It is important to realize that although the established mechanisms of HGT generally refer
to the acquisition of novel genes or transposable elements from other organisms, the mobile
elements themselves (plasmids, phages, ICEs) evolve over time through the transfer of
rearrangement of modules between MGEs within a bacterial cell. IS density has been shown to
be higher in bacterial plasmids than in their host chromosomes, which may be the result of
preferential targeting by some transposable elements into plasmids using rolling-circle
replication (Siguier et al. 2014). The modular nature of plasmids and phages has been well
established (Hendrix et al. 2000; Toussaint and Merlin, 2002) as evidenced by the broad
diversity of accessory genes that are commonly found on plasmids with homologous replication
systems (Heuer and Smalla, 2012). IS elements and other intracellular MGEs can facilitate
transfers of gene segments, and can also serve to recombine different MGEs, and therefore the
categories established for the different elements should be considered fluid (Osborn and Boltner,
2002; Toussaint and Merlin, 2002). Insertion sequences interspersed throughout a genome can
be beneficial to the bacteria for the purposes of incorporating exogenous DNA or disabling the
ability of a phage to excise from a genome in order to preserve beneficial genes (which the
phage had been using as a selective agent to ensure inheritance). There is therefore a complex
balancing act between the risks involved in maintaining potentially mobile genes, and the
benefit derived from the genome plasticity these genes enable.
The distribution of IS elements in a genome is non-random, resulting in regions of the
genome where insertion of a new element is less likely to be detrimental (Plague, 2010). As a
result, mobile elements often invade each other (Darmon and Leach, 2014). This can result in
15
new chimeric mobile elements, and fragments of inactivated MGEs that can serve as
homologous regions for further rearrangements. These genomic regions have alternatively been
referred to as genomic islands (Langille and Brinkman, 2009), regions of genome plasticity
(RGPs) (Ogier et al. 2010), or ‘junkyards’ of MGEs (Schwartz et al. 2003), but they serve an
important role by providing relatively safe regions for the acquisition of incoming mobile
elements.
2.4 References Allen, H. K., Moe, L. A., Rodbumrer, J., Gaarder, A., & Handelsman, J. (2009). Functional metagenomics reveals diverse β-lactamases in a remote Alaskan soil. The ISME journal, 3(2), 243-251. Baquero, F. 2009. Environmental stress and evolvability in microbial systems. Clin. Microbiol. Infect. 15(Suppl.1):5-10. Bellanger, X., Payot, S., Leblond-Bourget, N., & Guédon, G. (2014). Conjugative and mobilizable genomic islands in bacteria: evolution and diversity. FEMS microbiology reviews, 38(4), 720-760. Boucher, Y., Labbate, M., Koenig, J. E., & Stokes, H. W. (2007). Integrons: mobilizable platforms that promote genetic diversity in bacteria. Trends in microbiology, 15(7), 301-309. Brussow, H. 2008. Phage-bacterium co-evolution and its implication for bacterial pathogenesis. In: Horizontal Gene Transfer in the Evolution of Pathogens. pp. 49-77. Cambridge University Press, New York, NY, USA. Burrus, V., G. Pavlovic, B. Decaris and G. Guedon. 2002. Conjugative transposons: the tip of the iceberg. Molecular Microbiology 46(3): 601-610. Cambray, G., A-M. Guerout and D. Mazel. 2010. Integrons. Annual Review of Genetics. 44:141–166. Cambray, G., N. Sanchez-Alberola, S. Campoy, E. Guerin, S. Da Re, B. Gonzalez-Zorn, M-C. Ploy, J. Barbe, D. Mazel and I. Erill. 2011. Prevalence of SOS-mediated control of integron integrase expression as an adaptive trait of chromosomal and mobile integrons. Mobile DNA 2(1):6 Carraro N. and Burrus V. 2015. Biology of Three ICE Families: SXT/R391, ICEBs1, and ICESt1/ICESt3, p 289-309. In Craig N, Chandler M, Gellert M, Lambowitz A, Rice P, Sandmeyer S (ed), Mobile DNA III. ASM Press, Washington, DC. doi: 10.1128/microbiolspec.MDNA3-0008-2014
16
Collis, C.M., Kim, M., Stokes, H.W. and R.M. Hall. 2002. Integron-encoded IntI integrases preferentially recognize the adjacent cognate attI site in recombination with a 59-be site. Molecular Microbiology 46(5): 1415-1427. Curcio, M. J., & Derbyshire, K. M. 2003. The outs and ins of transposition: from mu to kangaroo. Nature Reviews Molecular Cell Biology, 4(11), 865-877. Darmon, E. and D.R.F. Leach. 2014. Bacterial Genome Instability. Microbiology and Molecular Biology Reviews. 78(1):1-39. Domingues S., G.J. da Silva, K. M. Nielsen. 2012. Integrons: vehicles and pathways for horizontal dissemination in bacteria. Mob. Genet. Elements 2:211-223. D’Costa, V. M., King, C. E., Kalan, L., Morar, M., Sung, W. W., Schwarz, C., ... & Wright, G. D. (2011). Antibiotic resistance is ancient. Nature, 477(7365), 457-461. Finley, R. L., Collignon, P., Larsson, D. J., McEwen, S. A., Li, X. Z., Gaze, W. H., ... & Topp, E. (2013). The scourge of antibiotic resistance: the important role of the environment. Clinical Infectious Diseases, cit355. Forsberg, K. J., Reyes, A., Wang, B., Selleck, E. M., Sommer, M. O., & Dantas, G. (2012). The shared antibiotic resistome of soil bacteria and human pathogens. science, 337(6098), 1107-1111. Frost, L. S., Leplae, R., Summers, A. O. & Toussaint, A. 2005 Mobile genetic elements: the agents of open source evolution. Nat. Rev. Microbiol. 3:722-732. Fulthorpe, R. R., Liss, S. N., & Allen, D. G. (1993). Characterization of bacteria isolated from a bleached kraft pulp mill wastewater treatment system. Canadian journal of microbiology, 39(1), 13-24. Gillings, M. R., & Stokes, H. W. 2012. Are humans increasing bacterial evolvability?. Trends in ecology & evolution, 27(6), 346-352. Gillings, M.R., Gaze, W.H., Pruden, A., Smalla, K. Tiedje, J.M. and Yong-Guan, Z. 2015. Using the class 1 integron-integrase gene as a proxy for anthropogenic pollution. ISME journal doi:10.1038/ismej.2014.226 Guerin, E. G. Cambray, N. Sanchez-Alberola, S. Campoy, I. Erill, S. Da Re, B. Gonzalez-Zorn, J Barbé, M.C. Ploy and D. Mazel. 2009. The SOS response controls integron recombination. Science 324:1034. Hall, R.M. and C.M. Collis. 1995. Mobile gene cassettes and integrons: capture and spread of genes by site-specific recombination. Molecular Microbiology 15(4): 593-600.
17
Hallet, B., Vanhooff, V. and F. Cornet. 2004. DNA Site-Specific Resolution Systems. In: Plasmid Biology pp. 145-180. Ed. B.E. Funnell and G.J. Phillips ASM Press, Washington, D.C. USA Harrison, P. W., Lower, R. P., Kim, N. K., & Young, J. P. W. (2010). Introducing the bacterial ‘chromid’: not a chromosome, not a plasmid. Trends in microbiology, 18(4), 141-148. Hendrix, R. W., Lawrence, J. G., Hatfull, G., and Casjens, S. 2000. The origins and ongoing evolution of viruses. Trends Microbiol. 8, 504–508. Hendrix, R.W. and S.R. Casjens. 2008. The Role of Bacteriophages in the Generation and Spread of Bacterial Pathogens. In: Horizontal Gene Transfer in the Evolution of Pathogenesis. pp. 79-112. Cambridge University Press, New York, NY, USA. Heuer, H., & Smalla, K. (2012). Plasmids foster diversification and adaptation of bacterial populations in soil. FEMS microbiology reviews, 36(6), 1083-1104. Hirano, N., Muroi, T., Takahashi, H., & Haruki, M. (2011). Site-specific recombinases as tools for heterologous gene integration. Applied microbiology and biotechnology, 92(2), 227-239. Juhas, M., van der Meer, J. R., Gaillard, M., Harding, R. M., Hood, D. W., & Crook, D. W. (2009). Genomic islands: tools of bacterial horizontal gene transfer and evolution. FEMS microbiology reviews, 33(2), 376-393. Knapp,C.W., J. Dolfing, P.A. Ehlert and D.W. Graham. 2010. Evidence of increasing antibiotic resistance gene abundances in archived soils since1940. Environ. Sci. Technol. 44:580–587. doi:10.1021/es901221x Konstantinidis, K. T., & Tiedje, J. M. (2004). Trends between gene content and genome size in prokaryotic species with larger genomes. Proceedings of the National Academy of Sciences of the United States of America, 101(9), 3160-3165. Lang, A.S. and J.T. Beatty. 2007. Importance of widespread gene transfer agent genes in alpha-proteobacteria. Trends in Microbiology 15:54-62. Lawrence, J.G. and H. Hendrickson. 2008. Genomes in Motion: Gene Transfer as a Catalyst for Genome Change. In: Horizontal Gene Transfer in the Evolution of Pathogens. pp. 3-22. Cambridge University Press, New York, NY, USA. Langille, M.G.I. and F.S.L. Brinkman, IslandViewer: an integrated interface for computational identification and visualization of genomic islands, Bioinformatics (2009) Jan. 16 (EPub). PMID: 19151094 Lawrence, J.G. and A.C. Retchless. 2009. The Interplay of Homologous Recombination and Horizontal Gene Transfer in Bacterial Speciation. In: Horizontal Gene Transfer: Genomes in Flux. pp. 29-54. Ed. M.B. Gogarten, J.P. Gogarten and L. Olendzenski. Humana Press. New York, NY, USA.
18
Leung, E., Weil, D. E., Raviglione, M., & Nakatani, H. (2011). The WHO policy package to combat antimicrobial resistance. Bulletin of the World Health Organization, 89(5), 390-392. Lima, W.C., A.C.M. Paquola, A.M. Varani, M-A. Van Sluys and C.F.M. Menck. 2008. Laterally transferred genomic islands in Xanthomonadales related to pathogenicity and primary metabolism. FEMS Microbiology Letters 281:87–97. Ling A, Cordaux R (2010) Insertion Sequence Inversions Mediated by Ectopic Recombination between Terminal Inverted Repeats. PLoS ONE 5(12): e15654. doi:10.1371/journal.pone.0015654 Mahillon, J. and M. Chandler, Insertion sequences, Microbiol. Mol. Biol. Rev. 62 (1998) 725-774. Mazel, D. 2006. Integrons: agents of bacterial evolution. Nat. Rev. Microbiol. 4 :608-620. McArthur, J. V., Tuckfield, R. C., Lindell, A. H., & Baker-Austin, C. (2011). When rivers become reservoirs of antibiotic resistance: industrial effluents and gene nurseries. Nunes-düby, S.E., Kwon, H.J., Tirumalai, R.S., Ellenberger, T., Landy, A., 1998. Similarities and differences among 105 members of the Int family of site-specific recombinases 26, 391-406. Ogier, J. C., Calteau, A., Forst, S., Goodrich-Blair, H., Roche, D., Rouy, Z., ... & Gaudriault, S. (2010). Units of plasticity in bacterial genomes: new insight from the comparative genomics of two bacteria interacting with invertebrates, Photorhabdus and Xenorhabdus. BMC genomics, 11(1), 568. Olendzenski, L. and J.P. Gogarten. 2009. Gene Transfer: Who Benefits? In: Horizontal Gene Transfer: Genomes in Flux. pp. 3-12. Ed. M.B. Gogarten, J.P. Gogarten and L. Olendzenski. Humana Press. New York, NY, USA. Olschlager, T. and J. Hacker. 2008. Genomic Islands in the Bacterial Chromosome – Paradigms of Evolution in Quantum Leaps. In: Horizontal Gene Transfer in the Evolution of Pathogenesis. pp. 113-134. Cambridge University Press, New York, NY, USA. Osborn, A. M. and D. Boltner. 2002. When phage, plasmids and tranposons collide: genomic islands, and conjugative- and mobilizable-transposons as a mosaic continuum. Plasmid 48: 202-212. Perry, J. and G.D. Wright. 2013. The antibiotic resistance “mobilome”: searching for the link between environment and clinic. Frontiers in Microbiology. 4:1-7. Plague, G.R. 2010. Intergenic transposable elements are not randomly distributed. Genome Biol. Evol. 2:584-590.
19
Pruden, A., & Arabi, M. 2012. Quantifying anthropogenic impacts on environmental reservoirs of antibiotic resistance. Antimicrobial Resistance in the Environment, 173-202. Ragan M.A. and R.G. Beiko. 2009. Lateral Gene Transfer: Open Issues. Philosophical Transactions of the Royal Society B. 364: 2241–2251. Roberts, A.P., M. Chandler, P. Courvalin, G. Guedon, P. Mullany, T. Pembroke, J.I. Rood, C.J. Smith, A. O. Summers, M. Tsuda and D. E. Berg. 2008. Revised nomenclature for transposable genetic elements. Plasmid. 60: 167-173. Rowland, S.J. and W.M. Stark. 2005. Site-specific recombination by the serine recombinases. In: The Dynamic Bacterial Genome pp. 83-120. Cambridge University Press, NY, NY, USA. Schwartz, E., A. Henne, R. Cramm, T. Eitinger, B. Friedrich and G. Gottschalk. 2003. Complete Nucleotide Sequence of pHG1: A Ralstonia eutropha H16 Megaplasmid Encoding Key Enzymes of H2-based LIthoautotrophy and Anaerobiosis. J. Mol. Biol. 332: 369–383 Schumann, W. 2006. Sequence specific recombination classes. In: Dynamics of the Bacterial genome pp. 97-98 John Wiley & Sons. Siefert, J.L. 2009. Defining the Mobilome. In: Horizontal Gene Transfer: Genomes in Flux. pp. 13-27. Ed. M.B. Gogarten, J.P. Gogarten and L. Olendzenski. Humana Press. New York, NY, USA. Siguier, P. Gourbeyre, E. ad M. Chandler. 2014. Bacterial insertion sequences: their genomic impact and diversity. FEMS Microbiol Rev 38: 865-891. Siguier P, Gourbeyre E, Varani A, Ton-Hoang B, Chandler M. 2015. Everyman’s Guide to Bacterial Insertion Sequences, p 555-590. In Craig N, Chandler M, Gellert M, Lambowitz A, Rice P, Sandmeyer S (ed), Mobile DNA III. ASM Press, Washington, DC. doi: 10.1128/microbiolspec.MDNA3-0030-2014 Snyder, and W. Champness. 2007. Molecular Genetics of Bacteria. Taylor, N.G.H., D.W. Verner-Jeffreys and C. Baker-Austin. 2011. Aquatic systems: maintaining, mixing and mobilizing antimicrobial resistance? Trends in Ecology and Evolution 26(6): 278-284. Toussaint, A. and C. Merlin. 2002. Mobile Elements as a Combination of Functional Modules, Plasmid 47 (2002) 26-35. Van Houdt, R.V, S. Monchy, N. Leys and M. Mergeay. 2009. New mobile genetic elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria. Antonie van Leeuwenhoek 96:205-226. Wozniak, R. A., & Waldor, M. K. (2010). Integrative and conjugative elements: mosaic mobile genetic elements enabling dynamic lateral gene flow. Nature Reviews Microbiology, 8(8), 552-563.
20
Wright, M. S., Peltier, G. L., Stepanauskas, R., & McArthur, J. V. (2006). Bacterial tolerances to metals and antibiotics in metal-contaminated and reference streams. FEMS microbiology ecology, 58(2), 293-302. Wright, M.S., Baker-Austin, C., Lindell, A.H., Stepanauskas, R., Stokes, H.W. and J.V. McArthur. 2008. Influence of industrial contamination on mobile genetic elements: class 1 integron abundance and gene cassette structure in aquatic bacterial communities. ISME Journal 2: 417-428.
21
Chapter 3 The Limitations of Draft Assemblies for Understanding Prokaryotic Adaptation and Evolution
Acknowledgements and Contributions: This chapter is reproduced as published in Genomics
(Ricker, N., Qian, H., & Fulthorpe, R. R. 2012. The limitations of draft assemblies for
understanding prokaryotic adaptation and evolution. Genomics, 100(3), 167-175
doi:10.1016/j.ygeno.2012.06.009) with minor modifications.
3 Introduction Next generation sequencing (NGS) platforms have revolutionized how we obtain genetic
information, leading to rapid advances in the fields of genomics and metagenomics. These
methods rely on newer sequencing chemistries (Sanger et al. 1977) and highly parallel
operations that result in high yields at low costs per read but so far produce considerably shorter
reads (in the range of 35-500 nucleotides) than Sanger sequencing (600 to 1500 nucleotides).
Shorter reads increase the required complexity of the assembly algorithms (Miller et al. 2010),
although the ability to sequence to very high coverage can overcome many of the original issues
in genome assembly including read errors and coverage gaps (Wetzel et al. 2011). The utility of
next generation sequencing has been demonstrated in examining new variants, or very close
relatives, of previously sequenced strains (reviewed in MacLean et al. 2009)). This type of
assembly, known as a reference or mapping assembly, is relatively straightforward provided that
the two strains share high sequence identity across their genome. However in many bacterial
species, the ‘core’ genes that are shared between closely related strains are supplemented by a
significant fraction of ‘dispensable’ genes that vary between the strains of a given species
(Medini et al. 2005). Assembling these sections of sequence data or entire genomes in the
absence of a suitable reference strain (referred to as de novo assembly) is a far more difficult
task (Pop 2009). Whole genome shotgun assemblies using traditional Sanger sequencing have
been utilized for many years for this purpose but the cost and effort required to do this type of
intense sequencing has been prohibitive for all but the largest laboratories (MacLean et al.
2009). The advent of NGS platforms promises to alleviate the financial and technical demands
of obtaining high quality sequence data however the issue of repetitive elements in genomic
22
sequence remains a confounding issue in genome assembly that is difficult to resolve through
coverage alone (Wetzel et al. 2011).
Many assembly programs for NGS data utilize de Bruijn graphing techniques (see
(Miller et al. 2010) to perform de novo assemblies of the high number of reads produced, with
the goal of finding the shortest path through the sequence data that includes as much of the
sequence data as possible. For genomes with a high content of repetitive sequences, some
assembly programs will produce an overly compressed alignment, and possible mis-assemblies,
when multiple copies of a repeat are collapsed to one location (Chevreux et al. 1999; Philippy et
al. 2008). Accurate graphs (those that do not collapse repetitive elements) will often form a
‘frayed rope’ pattern in repetitive sections whereby a path converges at the repeats and then
diverges again (multiple paths leading into the repeat and multiple paths leading out of the
repeat again) since there are multiple true alignments possible. Some assembly programs
specifically search for the characteristic features that repetitive elements create within a graph
such as convergent, divergent or cyclic paths (Miller et al. 2010) and therefore terminate at these
repetitive elements to ensure that they are not overly compressed in the final assembly. This
results in a more fractured assembly, but prevents the errors introduced by arbitrarily collapsing
the repeats to one location.
Realistically, the assembly software is not expected to produce a perfectly aligned
genome but rather to reduce the sequencing reads into a manageable number of contigs
(‘contiguous sequence’ – the sequence produced by the assembly of multiple overlapping reads)
for finishing. ‘Finishing’ is the process of closing all contig gaps, correcting introduced errors,
and confirming low coverage regions of the assembly through PCR and cloning experiments at
the bench. These experiments can still be expected to take months to years, even with excellent
sequence data and the best software currently available (Nagarajan et al. 2010). For this reason,
complete genome finishing is rarely carried out both due to the effort required, and because the
aim of many sequencing projects is limited to looking for a small number of differences
between the new strain and a previously sequenced close relative. The resulting genome
projects are often submitted as unfinished draft assemblies, or as ‘assembled with likely errors’
(Phillippy et al. 2008).
23
Although not as repetitive as eukaryotic genomes, prokaryotic genomes contain a variety
of repeated elements ranging in size from 1-6 bp microsatellites (Ellegren 2004; Mayer et al.
2010) to larger elements such as transposons, insertion sequences, rRNA operons, tRNA genes,
and rhs family genes (Lupski and Weinstock 1992). The computational issues that repetitive
genomes pose to NGS assembly has been discussed in other recent papers (Miller et al. 2010;
Wetzel et al. 2011; MacLean et al. 2009; Zhang et al. 2011), but there has been remarkably little
emphasis on the relative value of the portion of the genome that remains fragmented in these
draft assemblies. To this end, we performed an in silico experiment using simulated long and
short read data for the fully sequenced genome of Cupriavidus metallidurans CH34 (hereafter
simply referred to as CH34). This organism was sequenced by the Joint Genome Institute (JGI)
using whole genome shotgun cloning (WGS) with a combination of three randomly sheared
libraries (3, 8 and 40 kb insert sizes) and an additional 3,752 individual Sanger reads for
finishing (Van Houdt et al. 2009). It was chosen for this study because of the high quality
finishing and annotation that has been performed (Van Houdt et al. 2009; Janssen et al. 2010) as
well as the nature of the genome, which contains two large chromosomes and many types of
mobile elements. It was our anticipation that the repetitive elements contained within this
genome would be a hindrance to assembly, and that this simulation would serve to illustrate the
portions of the genome that are inherently resistant to automated assembly. Four additional
strains (Caulobacter sp. K31, Gramella forsetii KT0803, Rhodobacter sphaeroides 2.4.1 and
Bordetella bronchiseptica RB50) were also included which varied in G+C content, number of
replicons, repeat content of the genome and percentage of genes annotated as involved in
mobility (plasmids, phages and transposons). A detailed analysis of each individual strain was
not performed since the genomic islands have not been characterized, however genomic island
predictions were available from the IslandViewer website (Langille and Brinkman, 2009) which
utilizes multiple software programs to predict genomic islands from the completed sequence.
The predicted genomic islands in these strains were considerably smaller than those determined
in CH34, so it is expected that some of the predicted islands may actually be components of one
larger island.
Only two assembly programs were utilized since the presence of repeated elements is a
commonly acknowledged issue in assembly algorithms (Pevzner and Tang, 2001; Kingsford et
al. 2010), and a comparison of computational effectiveness was outside the scope of this study.
24
Our intent was rather to illustrate the biological significance of the regions most likely to remain
unassembled by the nature of their sequence. The Velvet assembler was chosen because the
algorithms have been improved to prevent over-collapsing of repeats (Zerbino et al. 2009). The
ABySS assembler (Simpson et al. 2009) was utilized to determine whether the results were
specific to the Velvet algorithms. Our goal for this project was to use the well-annotated CH34
genome to better understand the biological relevance of the sections of the genome left
unassembled and to examine which aspects of genome complexity would be most problematic
to assemble into large contigs given ideal data. This serves to illustrate the inherent issues in
draft assemblies of prokaryotic genomes, which we also illustrate is only further complicated by
the use of real data.
3.1 Methods
All genomes were obtained from the NCBI website (www.ncbi.nlm.nih.gov) with the
following Genbank Accension numbers: Cupriavidus metallidurans CH34 (CP000352-
CP000355), Caulobacter sp. K31 (CP000927.1-CP000929.1)), Gramella forsetii KT0803
(CU207366.1), Bordetella bronchispetica RB50 (BX470250.1) and Rhodobacter sphaeroides
2.4.1 (CP000143.1-CP000147.1, DQ232586.1, DQ232587.1). These files were used to create
error-free simulated long read (400 bp length at 10x coverage) and short read (75 bp length at
45x coverage) data for assembly in Velvet using a custom-made python program (available on
request). These datasets were assembled using Velvet version 1.1.05 (Zerbino and Birney, 2008)
using the max_kmer and big_assembly settings as these settings gave the best assembly
statistics (N50 and max contig). The final graph of the Velvet assembly for C. metallidurans
CH34 used 4,260,497 of the 4,265,686 (99.9%) simulated reads and resulted in a total of 139
contigs. The maximum contig length was 674,170 bp and the N50 value for the assembly (size
of contig for which 50% of assembled reads are in a contig of that size or larger) was 159,531
bp. The median coverage was calculated as 11.8. The N50 and longest contig stats for the other
genomes are listed in Table 1. Paired ends libraries with 100 bp reads were also created for two
different insert distances (180 and 3000). The paired ends dataset for C. metallidurans CH34
was assembled in ABySS version 1.1.3 (Simpson et al. 2009) with a final N50 value of 36682
and maximum contig size of 166493 bp.
25
Assembled contigs were aligned to reference sequences using Geneious Pro version
5.5.2 (Drummond et al. 2010). Despite the error-free nature of the simulated data, alignments
were performed at 98% identity since imperfect repeats (repeats with a small number of single
base pair differences) could be seen as sequencing errors by the assembler and would be
incorrectly collapsed thereby introducing errors into the final contigs. Coverage statistics
included were those determined by the Geneious program and therefore represent coverage of
reference by unique contigs only, with no allowance for contig repetition, instead of true
coverage of the reference genome if all repeats could be accounted for. Examination of genes
adjacent to the ends of contigs was performed using the NCBI Blast tool (Altschul et al. 1990),
and the Genbank entries for each replicon (www.ncbi.nlm.nih.gov). Repeat content of the
genomes was estimated by calculating the uniqueness of each genome at k-mer lengths of 31
and 1000 and then taking the average of these two calculations. Assembly files from the GAGE
study (Salzberg et al. 2012) were downloaded and aligned by the same metrics, or by the
addition of a maximum 500 bp gap parameter as necessary.
3.2 Results
3.2.1 Assembly Quality for Cupriavidus metallidurans CH34
CH34 has 4 large replicons (Table 3.2.1) and a multitude of well-annotated smaller mobile
elements including genomic islands, transposons and insertion sequences (Van Houdt et al.
2009; Monchy et al. 2007). On the two chromosomes, there are four sets of 5S, 16S and 23S
rRNA genes (2 sets on each) and 62 tRNA genes (8 of which are duplicates found on the second
chromosome) (Janssen et al. 2010). There are 16 documented genomic islands (11 on
chromosome 1, none on chromosome 2, 3 on pMOL30 and 2 on pMOL28), as well as 57
insertion sequences and 19 other transposable elements distributed across the four replicons
(Janssen et al. 2010).
26
Table 3.2.1: Number of contigs aligning and coverage statistics for each of the four
replicons in C. metallidurans CH34 using Velvet ad ABySS genome assembly software.
Velvet Assembly Size (bp)
Number of contigs aligned at
98%
Largest contig (bp)
Total bases in contigs longer than 10
kb
Total bases in contigs longer
than 5 kb
Total bases in contigs longer than 1 kb
Chr 1 3,928,089 75 674,226 (17.2%)
3,786,365 (96.4%)
3,835,365(97.6%)
3,875,047 (98.6%)
Chr 2 2,580,084 63 541,760 (21.0%)
2,466,986 (95.6%)
2,504,599 (97.1%)
2,532,450 (98.2%)
pMOL30 233,720 18 58,279 (24.9%)
212,451 (90.9%)
212,451 (90.9%)
230,532 (98.6%)
pMOL28 171,459 9 101,867(59.4%)
156,377 (91.2%)
156,377 (91.2%)
171,008 (99.7%)
ABySS Assembly Size (bp)
Number of contigs aligned at
98%
Largest contig (bp)
Total bases in contigs longer than 10
kb
Total bases in contigs longer
than 5 kb
Total bases in contigs longer than 1 kb
Chr 1 3,928,089 470 166,493 (4.2%)
3,435,784 (87.5%)
3,669,623 (93.4%)
3,784,364 (96.3%)
Chr 2 2,580,084 212 107,711 (4.2%)
2,242,357 (86.9%)
2,416,359 (93.6%)
2,511,515 (97.3%)
pMOL30 233,720 59 29,993 (12.8%)
155,523 (66.5%)
190,452 (81.5%)
219,738 (94.0%)
pMOL28 171,459 36 50,670 (29.6%)
135,295 (78.9%)
140,927 (82.2%)
162,258 (94.6%)
From our simulated dataset (see methods), an assembly of 139 contigs was created after
assembly in Velvet. This assembly was aligned to the reference sequence of each of the four
replicons (Table 3.2.1) in Geneious version 5.5.2 (Drummond et al. 2010). Several of the
contigs were found to align to multiple replicons (Figure 3.2.1), including one that aligned to all
four replicons (corresponding to Tn6049). The largest contig that was shared in more than one
location was contig 152 (length 10,403 bp), which is found on both chromosome 1 and 2 and
corresponds to Tn6048. Likewise, a single contig, 5471 bp, corresponded to the 4 rRNA
operons that are evenly divided between the two chromosomes. All contigs mapped to the
27
reference genome at 98% identity. The genome was also assembled using ABySS version 1.3.3
(Simpson et al. 2009). This assembly was considerably more fragmented than the Velvet
assembly (Table 3.2.1) and had two small contigs (915 bp and 740 bp) that did not align with
any of the replicons at 98% identity. Due to the considerably larger number of fragments from
this assembly, the causes of contig termination were not determined for the contigs produced
from the ABySS software, however both software programs had greater difficulty assembling
the genomic island rich pMOL30 compared to pMOL28 and showed similar contig distribution
patterns (Figure 3.2.2).
Figure 3.2.1:Number of assembled contigs in Velvet aligning to replicons in C.
metallidurans CH34.
Venn diagram is based on 98% sequence identity. It is important to note that there are no shared
contigs found solely between chromosome 1 and pMOL30, or solely between chromosome 2
and pMOL28.
28
Figure 3.2.2: Geneious alignment of assembled contigs to two key regions containing
genomic islands in C. metallidurans CH34.
Top two images are from the Velvet assembly, bottom images are the same regions from the
ABySS assembly. The grey bar indicates region coverage, and the black lines are reference
sequence (solid line) and location of the contigs with respect to the reference. The top
alignment for each assembler includes the GI rich region ranging from approximately 1.2 Mb to
1.8 Mb on chromosome 1 and contains the two largest genomic islands. The bottom alignment
is to the full length of pMOL28, with the heavy metal resistance island highlighted (location is
as indicated in Monchy et al. 2007). There are more contigs listed in Table 3 than are visible on
the figure since contigs mapping to repetitive elements can only be mapped onto the
chromosome once.
29
3.2.2 Contigs terminate at repeated elements and mobile elements
The large contigs from the Velvet assembly were examined to determine the genomic
determinants that had caused their termination (see Table 3.2.2). It was our anticipation that the
known repeated elements would be the main cause of termination in our error-free dataset, and
this was primarily found to be the case. 7 of the largest contigs (4 from chromosome 1 and one
each from the other replicons) were investigated and of the 14 terminal regions, 12 were found
to have terminated at a previously documented mobile element. The other two corresponded
with genes that would not be expected to be mobile. One of these genes was found to have an
internal repeat structure that interfered with assembly, and the other was found to have a second
copy of the same gene present on both chromosome 1 and chromosome 2 (at 99% identity).
When all contigs greater than 1 kb in length from chromosome 1 were included in the analysis
(data not shown), 75% (35/46) of the termination points were from documented mobile
elements. All other termination points were from duplicate genes found on multiple replicons
with the exception of a shared gene cluster between CMGI-2 and CMGI-3 which are both
located on chromosome 1 (discussed in section 3.2.3) and the rRNA operons for which there are
two copies on each chromosome. Of the mobile elements in this genome, Tn6049 and ISRme3
were found in the highest abundance (12 copies and 10 copies, respectively), and Tn6049 was
the only element found on all four replicons.
Table 3.2.2: Details on the terminal regions for 7 large contigs.
For simplicity, only the four largest contigs from chromosome 1 and the largest single contig
from each of the other replicons is included. The gene or mobile element responsible for the
contig termination is listed along with the number of times that element occurs in the total
genome.
Contig Name Size Replicon 5' terminus # in genome 3' terminus # in genome
Contig 17 674226 chr 1 ISRme4 2 (both on
chr1)
sodium sulphate symporter
2 (100% to chr1, 99% to
chr2)
Contig 113 358847 chr 1 ISRme7 2 (both on
chr1) IS1087B 2 (both on
chr1)
Contig 125 309700 chr 1 Tn6049 12 (across all
four replicons) IS1090 4 (all on chr1)
Contig 143 302838 chr 1 IS1087B 2 (both on
chr1) Tn6049 12 (across all
replicons)
30
Contig 220 541760 chr 2 Tn6048 3 (1 on chr1, 2
on chr2) ISRme3 10 (across all but pMOL28)
Contig 252 58279 pMOL30 merE from
Tn4380 2 (1 on each
plasmid)
repeated sequence
within copB gene
1
Contig 239 101867 pMOL28 merE from
Tn4380 2 (1 on each
plasmid) IS1086 3 (across all but
pMOL30)
3.2.3 Fragmentation is greatest at genomic island sites
Interestingly, although there were long contigs distributed across all of the replicons in
the Velvet assembly, the distribution of the smaller contigs was not found to be uniform.
Instead there were regions on each of the replicons that were markedly fragmented with small to
medium (61-5000 bp) contigs arranged in a pattern of small overlaps or with gaps between
(Figure 3.2.2). Recognizing that genomic islands frequently contain smaller imbedded
transposable elements and therefore many repeated elements, we overlayed the known genomic
island co-ordinates with the assembled fragments for chromosome 1. As noted earlier,
chromosome 1 contains 11 of the 16 genomic islands found in CH34. A sequential ordering of
the longest contigs corresponding to chromosome 1 revealed that only one of the genomic
islands (CMGI-9) was fully captured in a large contig. This is not surprising as this island has no
documented repetitive elements (not even terminal repeats) that would have interfered with
assembly. Since the genomic islands appeared to be linked with the prevalence of fragments, we
aligned the contigs to each of the chromosome 1 islands individually in Geneious. In general,
the larger genomic islands aligned to higher numbers of contigs (Table 3.2.3). The four largest
islands each had a minimum of 2 contigs longer than 5 kb, representing accessory genes that
were congruent without interference from mobile elements or repeated segments. However, the
termination points of these contigs serve to highlight the difficulties of obtaining complete
assemblies of even these relatively small regions (compared to the genome). As would be
expected, many of the contigs terminated at a documented insertion sequence or transposon that
was found at another location in the genome (sometimes within the same genomic island).
Tn6049 (with a length of 3461 bp) is a very promiscuous transposable element found in 12
locations in the genome including on 5 of the 11 genomic islands and terminated assembly in
each of the locations it was found. In addition, there were other genes that were present on more
than one of the genomic islands and therefore interfered with proper assembly. CMGI-2 and
CMGI-3 share several homologous gene clusters (see Table 3.2.3) and have similar conjugal
31
transfer genes. Two of these genes (trbB and trbF) share high sequence identity across their
length (97 and 92%, respectively) and were found to cause the termination of contigs
containing the conjugal transfer genes in both of these islands. CMGI-3 also has multiple
copies of IS1071, and in some cases this element appears to have been responsible for the
mobilization of fragments of adjacent genes, which are then also repeated within the island,
further fragmenting the assembly.
CH34 is most noted for its ability to withstand heavy metals (Janssen et al. 2010) and
many of the genes conferring these abilities are contained within three genomic islands
distributed on the two plasmids pMOL30 and pMOL28. The two large islands account for
almost the full length of pMOL30 and approximately a third of the length of pMOL28. Each
island also contains “nested” islands with partial or complete mobile elements that separate
different functional modules (Van Houdt et al. 2009). An examination of the Geneious
alignments for both pMOL28 and the region of chromosome 1 containing two genomic islands
conferring such notable phenotypes as hydrogenotrophy and metabolism of aromatic
compounds revealed that these regions are highly fragmented in comparison to surrounding
regions in both the Velvet and ABySS assemblies (Figure 3.3.2).
Table 3.2.3: Genomic islands found on chromosome 1 of CH34.
Naming, sizes and content information are derived from previous characterization (Van Houdt
et al. 2009). Contig information is solely from the Velvet assembly for simplicity.
Name of Element
Size Content Information Contigs aligned within regiona,b
Size Range of Aligned Contigsb
CMGI-1 109,598 bp Tn6049; Closely related to pathogenicity island in P. aeruginosa
3 1-5 kb: 2 >10 kb: 1
CMGI-2 101,637 bp Tn4371 family integrase, hydrogenotrophy, metabolism of aromatic compounds
12 <1 kb: 6 1-5 kb: 2 5-10 kb: 1 >10 kb: 3
CMGI-3 97,042 bp Tn4371 family integrase, carbon fixation, hydrogenotrophy
16 <1 kb: 7 1-5 kb: 6 5-10 kb: 1 >10 kb: 2
32
CMGI-4 56,529 bp Tn4371 family integrase, Tn6048
1 >10 kb: 1
CMGI-5 25,423 bp 63 bp direct repeats 3 1-5 kb: 3 CMGI-6 17,638 bp Tn6049 3 1-5 kb: 3 CMGI-7 15,362 bp Tn6049 1 1-5 kb: 1 CMGI-8 12,257 bp Tn6049, IS1087 3 1-5 kb: 3 CMGI-9 20,648 bp Integrase, no direct repeats 0 Contained
within large contig
CMGI-10 20,947 bp 3 Insertion Sequences 5 1-5 kb: 4 5-10 kb: 1
CMGI-11 10,824 bp Flanked by ISCme7 3 <1 kb: 2 5-10 kb: 1
a these numbers are an approximation because the alignments were performed at 98% and therefore some of the small contigs align to multiple places where imperfect repeats occur
b numbers are only for contigs completely covered by genomic island; each island generally aligns to the ends of two larger contigs that are not included in these numbers
3.2.4 Investigating the relative contribution of multiple replicons or presence of documented mobility genes by comparison with other strains
In addition to our in depth analysis of CH34, we simulated datasets for an additional 4
genomes that varied in overall genome size, G+C content, number of replicons and predicted
mobile element content. The metrics for all 5 genomes assembled using simulated unpaired
long and short read datasets are summarized in Table 3.2.4. The Velvet assembly data included
in Table 3.2.4 is based on alignment to the reference genome at 98% nucleotide identity with no
gaps (see methods), and there were no significant errors in the contigs that would limit their
ability to align with these restrictions. As was expected, the large genomes consistently
produced a larger number of contigs after assembly, and the assembly quality in terms of both
N50 value and maximum contig size relative to largest chromosome decreased with increasing
genome size. In order to assess the causes of fragmentation for large genomes, we specifically
included strains with variations in both the number of replicons and the number of genes
annotated as related to horizontal gene transfer by the JCVI Comprehensive Microbial Resource
JCVI-CMR (http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi). Based on overall
genome size, number of replicons and k-mer repetitiveness, it was expected that CH34 would
have the poorest (most fragmented) assembly. However Caulobacter sp. K31 fared the worst by
each of the common metrics listed in Table 3.2.4. Interestingly, the best N50 and maximum
33
contig sizes were obtained for Rhodobacter sphaeroides 2.4.1 despite the fact that this genome
is composed of 7 different replicons (Figure 3.2.3). Furthermore, although CH34 was the second
poorest assembly in terms of number of contigs and N50 value, Bordetella bronchiseptica RB50
had a smaller maximum contig size. This was unexpected despite its large overall genome size,
based on the nature of the genome. This genome had specifically been chosen because only
0.37% of its gene content has been attributed to mobile functions (plasmids, phages and
transposons) by the JCVI-CMR (http://cmr.jcvi.org/cgi-bin/CMR) and contained only one
replicon. It also had the lowest percentage of repetitive k-mers by our calculations (see
methods) and should theoretically assemble more easily.
To compare these results to the findings described for the well-annotated genome CH34
in the absence of characterized genomic islands, these strains were evaluated according to
genomic islands predicted by programs contained within IslandViewer (Langille and Brinkman
2009). Although the precise number or size of the individual islands has not been verified (and
is overestimated in CH34), the total number of predicted genomic islands significantly
correlates to the maximum contig size, N50 value and N50 as a percentage of longest replicon.
As had been seen in CH34, the most fragmented portion of the Bordetella brochiseptica genome
corresponded to a 22 kb segment of repeated gene content shared between two predicted
genomic islands (99% nucleotide identity), and likewise the Caulobacter sp. K31 assembly also
had a large (10.5 kb) segment that was perfectly repeated between two predicted genomic
islands.
Table 3.2.4: Velvet assembly metrics of the 5 genomes compared.
Unique k-mer percentage was calculated as described in the methods. Mobile gene numbers
were obtained from the JCVI-CMR (http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi).
Coverage calculation is defined as total number of reference bases covered by unique contigs at
98% nucleotide identity without gaps or repeating of individual contigs. SIGI and DIMOB are
the individual programs that IslandViewer (Langille and Brinkman, 2009) utilizes to predict
genomic islands.
34
Caulobacter sp. K31
Cupriavidus metallidurans CH34
Bordetella bronchisepta RB50
Gramella forsetii KT0803
Rhodobacter sphaeroides 2.4.1
Genome size (Mb) 5.89 6.91 5.34 3.8 4.6
No. replicons 3 4 1 1 7
GC content 66.3 62 68.1 36.6 68.8
% Unique k-mers 98.55 98.2 99.5 99.09 99.18
Contigs 151 139 104 90 99
N50 (bp) 155,182 159,531 261,616 564,738 740,045
Longest contig (bp) 495,932 674,226 550,697 899,275 1,010,805
N50 vs. longest replicon (%)
2.83 2.91 4.78 10.31 13.51
Mobile Genes 162 164 19 49 103
Mobile Genes % of genome
2.96 2.65 0.37 1.36 2.46
Islands by SIGI-HMM only
13 3 9 2 6
Islands by DIMOB only
3 12 1 7 1
Predicted by Both 9 5 5 1 2
Total # Of Islands 25 20 15 10 9
Coverage Percentage 98.8 98.6 98.9 99 99.7
35
Figure 3.2.3: Relationship between N50 (as percentage of the largest replicon in the
genome) and three parameters thought to influence assembly quality.
Top: genome size, r = -0.81 (ns); Middle: percent unique K-mers, r = 0.54 (ns); and Bottom:
Number of Replicons, r = 0.42 (ns).
36
Figure 3.2.4: Relationship between three measures of assembly quality (maximum contig
length, N50 ad N50 as percent of longest replicon) and number of genomics islands as
predicted by IslandViewer.
The pearson correlations between N50 or N50 as percent of longest replicon and number of
predicted islands are statistically significant (p<0.05) but are also clearly curvilinear.
3.2.5 Fragmentation Evident in Real Data
The benefit of using simulated ideal data for this type of analysis is that patterns can be
detected that may otherwise be masked due to the variations in sequencing coverage,
introduction of sequencing specific errors and high number of contigs produced by real
sequencing projects. In order to take our findings and compare them to real sequencing
scenarios, we examined the assembly data from Rhodobacter sphaeroides 2.4.1. This strain was
37
utilized in the Genome Assembly Gold-Standard Evaluations (GAGE) study that compared the
assembly efficiency of 7 different open access software programs (Salzberg et al. 2012). The
assembled contigs from that study are freely available. We downloaded the contigs from the
GAGE Velvet assembly of R. sphaeroides 2.4.1 and aligned them to the finished genome in the
same way that we compared CH34 contigs generated from simulated sequence to its final
genome. When the R. sphaeroides 2.4.1 contigs from the GAGE assembly were mapped to its
finished chromosome 1 in Geneious, only 454 fragments (of a total of 1192 contigs and
scaffolds) could be aligned at 98% identity - resulting in only 65% coverage of the chromosome.
This indicated that the assembled contigs contained internal errors, so we allowed for up to 500
bp gaps in the Geneious alignment. This improved the assembly of chromosome 1 from 65% to
96.3%. Regardless of whether gaps were allowed or not, the distribution of the small contigs
was greatly increased in regions predicted to be genomic islands (Figure 3.2.5). For the
alignment without gaps, only one of the predicted genomic islands was assembled, whereas 4
out of 9 of the islands were assembled when gaps were allowed. The two islands predicted by
both programs in IslandViewer had a large number of fragments for their relative size (13
fragments for 12.5kb and 12 fragments for 7.5 kb).
Figure 3.2.5: Geneious alignment of real contigs obtained from the GAGE assembly data
(Salzberg et al. 2012).
Top alignment is at 98% identity with no gaps allowed, bottom alignment is 98% identity with
up to 500 bp gaps allowed. The region shown includes 3 putative genomic islands that are all
clearly visible by the increased occurrence of small contigs in these regions. These islands
occur at 216-228 kb, 550-557 kb and 632-648 kb and are roughly indicated with curved
38
brackets. Since these are only predicted islands, the precise borders may not be accurate and
individual islands could be components of a larger combined island.
3.3 Discussion
The Genomes OnLine Database (v. 3.0; http://genomesonline.org accessed 19th March
2012) lists 3532 completed genomes of which 1045 are listed as permanent draft assemblies.
The status of permanent draft implies that finishing experiments to verify or extend the existing
contigs are not expected to be performed, and the draft status is likely to be related to repeated
elements that cannot be resolved by computerized means. Contrary to the early view that many
of these smaller repeated elements represent “junk DNA” (Mayer et al. 2010), microsatellites in
the form of tandem repeats and transposable elements such as insertion sequences have both
been found to regulate transcription of adjacent genes (Versalovic et al. 1991; Mahillon and
Chandler, 1998). These repetitive elements also function as important components of genome
plasticity by mediating DNA re-arrangements including chromosomal deletions, duplications
and inversions (Lupski and Weinstock, 1992; Touchon and Rocha 2007). Larger transposable
elements such as transposons and integrative conjugative elements (ICEs) can also be found in
multiple copies within a genome, particularly if there are multiple large replicons as is
commonly found in certain bacterial families such as the Burkholderiaceae (Janssen et al. 2010;
Amadou et al. 2008; Tuanyok et al. 2008). Reaching the stage of a draft genome is sufficient if
the goal is to discover interesting and novel genes or operons that do not contain repeated
elements, with the consequence that many genome projects are being published at the draft
assembly stage and then terminated (Nagarajan et al. 2010). These draft assemblies can have a
number of errors including collapsed repeats, rearrangements and inversions (Phillippy et al.
2008; Salzberg et al. 2012; Narzisi and Mishra, 2011) as well as having an unknown fraction of
the genome unaccounted for. In this study, we used simulated NGS data to confirm that
currently available software programs are capable of accurately recognizing repeated segments
in the DNA and that these repeats would be the primary cause of contig termination in the
assembly. Having established the causes of termination, we wanted to better understand the
nature of the fragmented regions of draft assemblies since the relative importance of these
unassembled regions has to our knowledge never been addressed.
39
An examination of the genes adjacent to the termination points for the longest contigs
(Table 3.2.2) clearly confirmed that the assemblies were terminated due to the presence of
repeated elements. These repeated elements were inclusive of known mobile elements and genes
containing internal repeat structures (as expected) but also of genes that were repeated in more
than one genomic location (commonly on two separate replicons within this genome). This type
of repetition (within or between replicons) is important in the evolution of novel traits since one
copy of the gene can be free to evolve without risking functional impairment to the host cell due
to the other preserved copy (the duplicate gene hypothesis (Ohno, 1970). Some transposable
elements have been found to specifically target transmissible plasmids and the subsequent
plasmid-chromosome exchanges facilitate assembly of genes into modules (Siguier et al. 2006),
with the result that individual genomes will commonly have identical transposable elements and
accessory genes distributed on both the main chromosome and some or all of the associated
plasmids (as was seen here). Likewise, the findings from both B. bronchiseptica RB50 and
Caulobacter sp. K31 illustrated that predicted genomic islands within the same chromosome can
carry repetitive gene content which can interfere with assembly in the absence of repeats across
different replicons. Neither of these two large repeated segments contained any insertion
sequences or transposons, but were composed almost exclusively of hypothetical proteins. The
hypothetical nature of these genes prevents an estimation of the causes of gene duplication in
these strains, although one copy of the 22 kb portion of B. bronchiseptica RB50 is contained
within an intact phage documented by the BacMap Genome Atlas website (Stothard et al. 2005).
The second copy in this strain and both repeated segments in Caulobacter sp. K31 were not part
of any documented phage (intact or otherwise) but their presence in two separate predicted
islands within the chromosome could facilitate genomic island evolution.
It was expected that the number of genomic islands would have correlated with the
percentage of genes annotated as involved in mobility, but this was not found to be the case.
Rhodobacter sphaeroides 2.4.1 had 2.46% of the genes attributed to mobility functions, yet had
a smaller number of islands than other strains with this percentage of mobile gene content and a
more successful assembly in terms of N50 and maximum contig size. Given the high number of
plasmids found in this strain it is reasonable that this high percentage of mobility functions
relates directly to plasmid genes. These would not be expected to interfere with assembly since
incompatibility prevents plasmids with highly similar transfer genes from co-existing within
40
cells. It was interesting to note that although both of the mobility related metrics (% mobile
genes and predicted genomic islands) correctly predicted Caulobacter sp. K31 to be the most
difficult to assemble, the number of genomic islands was a better indicator of assembly
complexity for Bordetella bronchiseptica RB50 than mobile gene content. In addition, the most
logical genome characteristic to interfere with assembly would be repetitiveness (measured as %
unique k-mers) but this also was not an invariant predictor of the ease of assembly.
The validity of this work rests on the assumption that the simulated reads generated from
the genomic data could be accurately assembled. There were no errors evident in any of the
alignments performed from the Velvet unpaired simulated data when using a kmer length of 57,
although there had been a number of single base pair errors introduced when using the standard
settings and there were substantial SNPs introduced in the ABySS contigs (data not shown).
This illustrates the high level of accuracy that Velvet can achieve with non-repetitive elements,
as well as the high quality repeat recognition of this particular software program. In examining
the distribution of the long reads from the Velvet assembly against chromosome 1 of CH34, the
unassembled fragments tended to group together and these regions showed a clear association
with the prevalence of repeated elements in the genomic islands. It is important to recognize that
in an actual sequencing project the reconstruction of the genome would be further complicated
by the presence of sequencing errors and variations in the level of coverage due to decreased
amplification robustness, the latter of which may be more prominent in repetitive stretches due
to the secondary structure formed by palindromic repeats (Jin, 2010). In comparing our
simulated assemblies to the data available from the GAGE Velvet assembly of R. sphaeroides
2.4.1, it was clear that our correlation between the distribution of small contigs and the location
of genomic islands was still valid when using real data.
Draft genome assemblies may lead us to unintentionally disregard the most important
parts of prokaryotic genomes. Although eukaryotic genomes are more repeat rich than
prokaryotic genomes, the reasons for this repetitiveness are vastly different between the
kingdoms. In prokaryotic organisms, horizontal gene transfer is a prominent means of acquiring
novel genes and rearrangements facilitated by mobile elements increase diversity. Insertion
sequences can spread to high prevalence within a genome, and their activity may be specifically
increased in response to changing environmental conditions. Since their behavior is strongly
linked to adaptation, these elements are of great interest (Dobrindt et al. 2004). Larger mobile
41
elements are primarily assimilations of smaller elements (Toussaint and Merlin, 2002) or serve
as recombination sites for incoming genetic information (Coleman et al. 2006; Pen et al. 2009),
with the result that genomic islands and large transposable elements are inherently resistant to
computerized assembly. These regions are full of complete or partial mobile genetic elements
and are therefore problematic for genome assembly, but ironically they are the most likely to
carry the genes responsible for any novel traits under investigation, particularly if they were
acquired horizontally. Assembly software alone is capable of reconstructing genes, and
complete operons, providing they are not interrupted by repetitive sequences or present in more
than one copy within the genome (i.e. on separate replicons). In one study it was determined that
the majority of genes can be reconstructed from even very short reads (25 bp) however genes
containing repeats (primarily intergenic repeats or mobile elements such as transposons, IS
elements and prophages) account for the vast majority of the unassembled genome (Kingsford et
al. 2010). In our study, 40 of the 75 contigs corresponding to chromosome 1 of CH34 were
fully contained within genomic islands (Table 3.2.3) and an additional 16 contigs were found to
overlap with the edge of a genomic island. Many of the functional genes contained within these
genomic islands were assembled indicating that examining the mid-range contigs (5-50 kb) of a
draft assembly may be more informative in terms of recently acquired content. The genomic
context of these newly acquired genes is lost when the data is left as a draft assembly, and the
utility of the public databases is decreased by the introduction of incomplete or incorrect data.
As an example, the largest genomic island in CH34 (CMGI-1) is almost identical to a
pathogenicity island (PAGI-2C) found in Pseudomonas aeruginosa clone C, indicating recent
transfer between industrially contaminated sites and nosocomial pathogens (Van Houdt et al.
2009). Based on our Velvet assembly simulation, a draft assembly of CH34 would have left this
island in pieces and evidence of this important transfer event would remain hidden. In our own
laboratory, we have discovered a Recombinase in Trio (RIT) element adjacent to the
chlorobenzoate degrading genes of Burkholderia sp. R172 (Accession number AY168634.1)
that is homologous to one of the RIT elements found in CMGI-2 of CH34 (Van Houdt et al.
2009). This association was determined through Sanger sequencing, and was not apparent from
the reads from only next-generation sequencing data provided by both Solexa(Illumina) and
Roche 454 sequencing (Jin, 2010). Other sequenced strains available in the GenBank database
reveal that this is not an isolated event. For example, there are two other homologous RIT
elements found in the draft assembly of the PAH degrading strain Burkholderia sp. Ch1-1.
42
Prior to additional work that has recently improved the quality of this assembly, the contigs
containing each of the RIT elements in this strain terminated at the edges of these elements,
revealing absolutely no genomic context.
The role of genomic islands in bacterial adaptation is becoming increasingly clear, yet
many of the genes contained within these islands have not been characterized (Penn et al. 2009).
Indeed, a defining feature of genomic islands is a high abundance of conserved hypothetical
proteins (Van Houdt et al. 2009). Understanding the possible roles of the multitude of currently
hypothetical genes will require intensive experiments, and the development of these experiments
may be hampered by the incomplete information included in draft assemblies (Phillippy et al.
2008). With decreasing sequencing costs, initial draft genomes are going to increase in
prevalence, inundating the public databases with incomplete or fragmented genome projects
which decrease the overall utility of these databases for other analyses particularly those relating
to horizontal gene transfer. This issue has been addressed in a number of publications, and there
are validation tools available that can aid in distinguishing mis-assemblies (Phillippy et al.
2008). We submit that many of the genes responsible for prokaryotic adaptation will be present
in these highly recombinational or potentially mobile regions that are inherently resistant to
automated assembly, and that therefore the necessity of extensive finishing experiments to not
only close the created contigs but also to correct the introduced errors should be an important
focus of any sequencing project. Furthermore, the very elements disrupting the automated
assembly have a wealth of information to provide regarding the evolution and transferability of
these genes, and also may have a role in the regulation of these important genomic regions. As
technological improvements become available to ease the assembly of bacterial genomes,
recognizing the high relative importance of these regions will be key to creating the incentive
needed to pursue novel ways of finishing genomes - and improve our knowledge of bacterial
adaptation.
3.4 Acknowledgements
Funding in the form of a NSERC Discovery Grant to RF and a NSERC PGS-D Scholarship to
NR is gratefully acknowledged. The funding agency had no role in this study.
43
3.5 References Sanger, F. Nicklen S and A.R. Coulson. 1977. DNA sequencing with chain-terminating inhibitors, Proc. Natl. Acad. Sci. USA 74:5463-5467. Miller, J.R., S. Koren and G. Sutton. 2010. Assembly algorithms for next-generation sequencing data, Genomics 95:315-327. Wetzel, J., C. Kingsford and M. Pop. 2011. Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies, BMC Bioinformatics 12:95. http://www.biomedcentral.com/1471-2105/12/95 MacLean, D., J.D.G. Jones and D.J. Studholme. 2009. Application of ’next-generation’ sequencing technologies to microbial genetics, Nat. Rev. Microbiol. 7: 287-296. Medini, D., C. Donati, H. Tettelin, V. Masignani and R. Rappuoli. 2005. The Microbial Pan-Genome, Curr. Opin. Genet. Dev. 15: 589-594. Pop, M. 2009. Genome assembly reborn: recent computational challenges, Briefings Bioinf. 10(4):354-366. Chevreux, B., T. Wetter and S. Suhai. 1999. Genome sequence assembly using trace signals and additional sequence information, Comput. Sci. Biol.: Proc. German Conference on Bioinformatics GCB'99 GCB:45–56. Phillippy, A.M., M.C. Schatz and M. Pop. 2008. Genome assembly forensics: finding the elusive mis-assembly, Genome Biol. 9:R55 (doi:10.1186/gb-2008-9-3-r55) Nagarajan, N.C., M.D. Cook, H. G. Bonaventura, A. Richards, K.A. Bishop-Lilly, R. DeSalle, T.D. Read and M. Pop. 2010. Finishing genomes with limited resources: lessons from an ensemble of microbial genomes, BMC Genomics 11:242. Ellegren, H. 2004. Microsatellites: Simple Sequences with Complex Evolution, Nat. Rev. Genet. 5:435-445. Mayer, C., F. Leese and R. Tollrian. 2010. Genome-wide analysis of tandem repeats in Daphnia pulex – a comparative approach. BMC Genomics 11:277 (http://www.biomedcentral.com/1471-2164/11/277) Lupski, J.R. and G.M. Weinstock. 1992. Short, Interspersed Repetitive DNA Sequences in Prokaryotic Genomes, J. Bact. 174(14) (1992) 4525-4529. Zhang, W., J. Chen, Y. Yang, Y. Tang, J. Shang and B. Shen. 2011. A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies, PLoS ONE 6(3): e17915. doi:10.1371/journal.pone.0017915 Van Houdt, R., S. Monchy, N. Leys and M. Mergeay. 2009. New mobile elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria, Antonie
44
van Leeuwenhoek 96:205-226.
Janssen, P.J., R. Van Houdt, H. Moors, P. Monsieurs, N. Morin, A. Michaux, M.A. Benotmane, N. Leys, T. Vallaeys, A. Lapidus, S. Monchy, C. Medigue, S. Taghavi, S. McCorkle, J. Dunn, D. van der Lelie and M. Mergeay. 2010. The Complete Genome Sequence of Cupriavidus metallidurans Strain CH34, a Master Survivalist in Harsh and Anthropogenic Environments, PLoS ONE 5(5):e10433. Doi:10.1371/journal.pone.0010433. Langille, M.G.I. and F.S.L. Brinkman. 2009. IslandViewer: an integrated interface for computational identification and visualization of genomic islands, Bioinformatics. Jan. 16 (EPub). PMID: 19151094 Pevzner, P.A. and H. Tang. 2001. Fragment assembly with double- barreled data, Bioinformatics 17 (2001) S225–S233. Kingsford, C., M.C. Schatz and M. Pop. 2010. Assembly complexity of prokaryotic genomes using short reads, BMC Bioinformatics 11:21 (http://www.biomedcentral.com/1471-2105/11/21) Zerbino, D.R., G.K. McEwen, E.H. Margulies and E. Birney. 2009. Pebble and Rock Band: Heuristic resolution of repeats and scaffolding in the Velvet short-read de novo assembler, PLoS ONE 4(12):e8407. Doi:10.1371/journal.pone.0008407 Simpson, J.T., K. Wong, S.D. Jackman, J.E. Schein, S.J.M Jones and I. Birol. 2009. ABySS : A parallel assembler for short read sequence data structures, Genome Research 19:1117-1123. Monchy, S., M.A. Benotmane, P. Janssen, T. Vallaeys, S. Taghavi, D. van der Lelie and M. Mergeay. 2007. Plasmids pMOL28 and pMOL30 of Cupriavidus metallidurans are specialized in the maximal viable response to heavy metals, J. Bact. 189(20):7417-7425. Drummond, A.J., B. Ashton, S. Buxton, M. Cheung, A. Cooper, C. Duran, M. Field, J. Heled, M. Kearse, S. Markowitz, R. Moir, S, Stones-Havas, S. Sturrock, T. Thierer and A. Wilson. 2010. Geneious v5.5, Available from http://www.geneious.com Salzberg, S.L., A. M. Phillippy, A. Zimin, D. Puiu, T. Magoc, S. Koren, T. J. Treangen, M. C. Schatz, A. L. Delcher, M. Roberts, G. Marxcais, M. Pop and J. A. Yorke. 2012. GAGE: A critical evaluation of genome assemblies and assembly algorithms, Genome Research 22: 557-567. Versalovic, J., T. Koeuth and J.R. Lupski. 1991. Distribution of Repetitive DNA Sequences in Eubacteria and Application to Fingerprinting of Bacterial Genomes. Nucleic Acids Res. 19(24):6823-6831. Mahillon, J. and M. Chandler. 1998. Insertion sequences, Microbiol. Mol. Biol. Rev. 62:725-774. Touchon, M. and E.P.C. Rocha. 2007. Causes of Insertion Sequences Abundance in Prokaryotic
45
Genomes, Mol. Biol. Evol. 24(4):969-981. Amadou, C., G. Pascal, S. Mangenot, M. Glew, C. Bontenps, D. Capela, S. Carrere, S. Cruveiller, C. Dossat, A. Lajus, M. Marchetti, V. Poinsot, Z. Rouy, B. Servin, M. Saad, C. Schenowitz, V. Barbe, J. Batut, C. Medigue and C. Masson-Boivin. 2008. Genome Sequence of the b-rhizobium Cupriavidus taiwanensis and comparative genomics of rhizobia, Genome Res. 18:1472-1483. Tuanyok, A., B.R. Leadem, R.K. Auerbach, S.M. Beckstrom-Sternberb, J.S. Bechstrom-Sternberg, M. Mayo, V. Wuthiekanun, T.S. Brettin, W.C. Nierman, S.J. Peacick, B.J. Currie, D.M. Wagner and P. Keim. 2008. Genomic Islands from Five Strains of Burkholderia pseudomallei. BMC Genomics 9:566. doi:10.1186/1471-2164-9-566 Narzisi, G. and B. Mishra. 2011. Comparing De Novo Genome Assembly: The Long and Short of It.,PLoS ONE 6(4):e19175. doi:10.1371/journal.pone.0019175 Ohno, S. 1071. Evolution by gene duplication, Springer-Verlag, New York. Siguier, P., J. Filee and M. Chandler. 2006. Insertion sequences in prokaryotic genomes. Curr. Opin. Microbiol. 9:526-531. Stothard, P., G. Van Domselaar, S. Shrivastava, A. Guo, B. O'Neill, J. Cruz, M. Ellison and D.S. Wishart. 2005. BacMap: an interactive picture atlas of annotated bacterial genomes, Nucleic Acids Research 33:D317-D320. Jin, S. 2010, Evidence of Mobility of the 3-Chlorobenzoate Degradative Genes in a Pristine Soil Isolate, Burkholderia phytofirmans OLGA172, M.Sc. Thesis. Dept. Ecology and Evolutionary Biology, University of Toronto. Dobrindt, U., B. Hochhut, U. Hentschel and J. Hacker. 2004. Genomic islands in pathogenic and environmental microorganisms, Nat Rev Microbiol 2: 414–424. Toussaint, A. and C. Merlin. 2002. Mobile Elements as a Combination of Functional Modules, Plasmid 47:26-35. Coleman, M.L., M.B. Sullivan, A.C. Martiny, C. Steglich, K. Barry, E.F. DeLong and S.W. Chrisholm. 2006. Genomic Islands and the Ecology and Evolution of Prochlorococcus. Science 311:1768 (doi: 10.1126/science.1122050) Penn, K., C. Jenkins, M. Mett, D.W. Udwary, E.A. Gontang, R.P., McGlinchey, B. Foster, A. Lapidus, S. Podell, E.E. Allen, B.S. Moore and P.R. Jensen. 2009. Genomic islands link secondary metabolism to functional adaptation in marine Actinobacteria. ISME J. 3:1193-1203. Zerbino, D.R. and E. Birney. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs, Genome Res. 18:821-829. Altschul, S.F., W. Gish, W. Miller, E.W. Myers and D.J. Lipman. 1990. Basic local alignment
46
search tool, J. Mol. Biol. 215 (1990) 403-410.
47
Chapter 4 Phylogeny and Organization of Recombinase in Trio (RIT) Elements
Acknowledgements and Contributions: This chapter is reproduced as published in Plasmid,
with some modifications (Ricker, N., Qian, H., & Fulthorpe, R. R. 2013. Phylogeny and
organization of recombinase in trio (RIT) elements. Plasmid, 70(2), 226-239
doi:10.1016/j.plasmid.2013.04.003).
4 Introduction A mobile genetic element (MGE) is defined as any discrete segment of DNA that can
move within or between genomes (Frost et al., 2005) and is inclusive of plasmids, phages,
integrative conjugative elements (ICEs), and the myriad of smaller elements capable of inter- or
intra-cellular movement (classified as transposable elements, see (Roberts et al., 2008). The
mobility of some of these elements occurs through the action of site-specific recombinases
(SSRs), which are divided into two classes defined by an absolutely conserved residue integral
to the active site (tyrosine or serine). Tyrosine recombinases (TBSSRs, often just referred to as
integrases) are extremely diverse, sharing only 3 absolutely conserved residues among all
members described to date (Nunes-düby et al., 1998), however there are 24 sub-families
described in the NCBI conserved domain database (Marchler-Bauer et al., 2005). A recent in
depth analysis of TBSSRs, has instead divided the known representatives into 56 families of 4
or more elements which were found to correlate with type of mobile genetic element (plasmid,
phage or prophage) in 87% of the families (Van Houdt et al., 2012). This analysis suggests that
the functional roles of these elements may be different between the families and may be directly
related to the nature of the mobile elements they are associated with.
Recombinase in Trio (RIT) elements were first defined in 2009 in Cupriavidus
metallidurans CH34 (Van Houdt et al., 2009). The original description noted the common
occurrence of conserved elements comprised of three TBSSR’s with overlapping open reading
frames. The three tyrosine recombinases in these elements were all of similar size, generally
with the largest enzyme first and the smallest enzyme in the middle. These elements were
postulated as being independently mobile for three reasons: the diversity of organisms found to
48
be harboring homologous elements, specific gene interruptions implying targeted integration,
and the presence of highly similar elements in more than one location in the same genome, as is
often seen in transposons. After discovering a RIT element in a chlorobenzoate degrading strain
in our own lab, we decided to further investigate the distribution of these elements in currently
available genomes in order to characterize their associations and potential for mobility.
4.1 Methods
We used the NCBI databases and BLAST analysis tools (Altschul et al., 1997) to obtain
progressively less homologous sequences to the two original RIT elements found in C.
metallidurans CH34 (Van Houdt et al., 2009). All similarity matches that still conformed to a
three adjacent recombinase format were utilized for additional searches. The three
recombinases from each of the intact elements were analyzed through BlastP comparison to the
Conserved Domain Database (CDD) (Marchler-Bauer et al., 2005) and the highest scoring
matches were consistently to the pAE1, SG4 and SG5 sub-families of tyrosine recombinases,
respectively (in order of transcription). Therefore all members of these sub-families from the
conserved domain database were investigated for inclusion as RIT elements. A random
sampling of enzymes from other subfamilies were also investigated in order to determine the
ubiquity of the triad arrangement. Organization into clusters and determination of key features
was determined through Blast homology. Automated multiple alignments were performed using
the Muscle alignment program (Edgar, 2004) within Geneious (Drummond et al., 2011).
Neighbour joining trees were also prepared in Geneious, using Jukes-Cantor models and
bootstrap re-sampling with 100 replicates. For nucleotide comparisons a 70% support threshold
was used (no outgroup for full RIT element trees; delta-Proteobacteria outgroup for 16S since
there was only one representative from this sub-phyla). Amino acid comparisons were prepared
using an 80% support threshold, using a RIT element from Acidothiobacillus as the outgroup to
anchor the trees.
4.2 Results and Discussion
4.2.1 Abundance and Occurrence in Database
Through our homology searches of the NCBI database, we were able to find 148
sequences containing three adjacent tyrosine recombinases that we classified as putative RIT
49
elements. These elements were separated into groups based on homology to the third
recombinase (see section 4.2.3), and the information for these groups is listed in Supplemental
Table S3. These putative RIT elements were obtained from 63 different genera across 7 phyla
of bacteria and this is not expected to be an exhaustive list given the diversity of the elements
found. As summarized in Table 4.2.1, the Proteobacteria accounted for the majority of the
strains (25, 17, 7 and 1 strains from the alpha through delta classes, respectively, representing
59.5% (50/84) of the total strains). This was a significant divergence from the expected
representation both for Proteobacteria in general (which represent 42% of the genomes in the
NCBI database) as well as for the alpha-, beta- and gamma-Proteobacteria individually. The
gamma-Proteobacteria are the most abundant in the database, however both alpha- and beta-
Proteobacteria had higher representation in the strains harbouring RIT elements (Figure 4.2.1).
There is the possibility that this is an artifact of beginning the homology search with a beta-
Proteobacteria representative, but this would not fully account for the abundance of alpha-
Proteobacteria found. It is likely however that the majority of these RIT elements are connected
by plasmid distribution and that the small number of isolated elements from particular phyla
represent a rare transfer event. This is supported by the fact that searches initiated from the
gamma-Proteobacteria and other low represented phyla consistently returned results from the
alpha- or beta-Proteobacteria representatives. There could potentially be other RIT elements that
are more broadly distributed among gamma-Proteobacteria or other phyla that we were not able
to detect since they were not homologous enough to the RIT elements found to date.
Table 4.2.1: Summary of information of putative RIT elements found in this study.
Phylogeny – Taxonomic Grouping
No. RIT Elements
No. Strains
pAE1 range (aa)
SG4 range (aa)
SG5 range (aa) Gene adjacent/interrupted
alpha-Proteobacteria 45 Caulobacterales 3 1 403 313 330 DUF1738
Rhizobiales 20 10 305-425 304-373 281-362 variable Rhodospirillales 11 7 228-508 303-454 329-335 hypothetical, methylase
Sphingomonadales 12 7 331-515 301-455 324-348 hypothetical, methylase beta-Proteobacteria 35
Burkholderiales 29 15 348-425 308-457 329-349 variable (IS66, RadC, transposase, integrase)
Rhodocyclales 5 1 411-417 310-325 294-332 integrase unclassified 1 1 318 324 337 hypothetical
gamma-Proteobacteria 12
50
Acidithiobacillales 2 1 414 311 331 integrase catalytic unit Alteromonodales 6 2 321-417 312-322 327-335 variable Enterobacteriales 2 2 315, 419 308, 330 337, 338 RadC, methylase
Legionellales 1 1 418 332 335 hypothetical Pseudomonodales 1 1 411 323 354 trbI conjugative genes
delta-Proteobacteria 1
Desulfobacteriales 1 1 409 338 337 reverse transcriptase Acidobacteria 3
Solibacterales 3 1 412, 710 314, 452 332, 336 integrase, hypothetical Actinobacteria 19
Actinomycetales 5 5 304-511 308-332 329 integrase, transposase, DNA directed reverse transcriptase
Bifidobacteriales 14 6 400 321 351 transposase, integrase Bacteroidia 7
Bacteroidales 5 5 407-426 313-341 336-343 hypothetical Flavobacteriales 2 1 425 330 337 RadC
Cytophagales 1 1 422 327 337 DNA repair protein Firmicutes 13
Clostridiales 12 11 404-537 283-334 337-342 variable
Bacillales 3 2 407-413 327-329 338-340 IstB domain-containing protein ATP-binding protein
Verrucomicrobia 4 1
Opitutales 4 1 432 330 336 MerR regulator Planctomycetes 1
Planctomycetales 1 1 419 321 330 hypothetical
51
Figure 4.2.1: Comparison of the taxonomic representation of our RIT collection with the
abundance of the same taxonomic grouping in the NCBI genome database.
The NCBI numbers included both completed genomes and incomplete sequencing projects.
Significant differences (a=0.05) are indicated with a double asterisk.
4.2.2 RIT Structure and Organization
As mentioned in section 4.2.1, the NCBI Conserved Domain Database currently has 24
described subfamilies of tyrosine recombinases. All of the elements had one gene from each of
the three subfamilies pAE1, SG4 and SG5 and they were always found in the same order and
orientation (Figure 4.2.2; discussed in section 4.2.3). This pattern was also confirmed in the
recent work examining the distribution of tyrosine based site specific recombinases on different
types of mobile elements (Van Houdt et al., 2012). In that work the three families of tyrosine
based site specific recombinase specifically involved in the formation of RIT elements were
designated FamilyIntegrase (FamInt) 1, 5 and 2 (also in order of transcription) and were
documented as having 64, 54 and 63 members, respectively. The number of included elements
was more conservative than our study as inclusion was based on confidence in family
membership for each individual recombinase. In our study we used the trio arrangement of these
recombinases as the hallmark of these novel elements and so we included elements that had
individual genes for which there was lower confidence in the family designation (see section
52
4.2.3). In addition to the 148 putative RIT elements, ie. genes in trios, we found only 15
sequences that corresponded with an individual recombinase from one of these sub-families but
not found in a trio. In addition there were 20 putatively degraded RIT elements. The latter were
distinguished by the presence of one or two documented recombinases and pseudogenes or
small ORFs in the remainder of the corresponding region. There was no pattern evident in
terms of which recombinase was missing in these degraded structures. These recombinases may
be RIT remnants due to inactivation or ancient distribution of these elements but may also
indicate that some or all of these subfamilies can function outside of the RIT arrangement.
As can be seen in Table 4.2.1, there is also a wide range of sizes observed for each of the
recombinases. This is particularly evident in the pAE1 (Int1) recombinase, which varies from
305 to 710 amino acids in length. The SG4 (Int2) recombinase is less variable (283-457 amino
acids), and the third (SG5, Int3) even less so (281-351 amino acids) but the individual
variations may also be an artifact of automated annotations. The pattern of sizes originally
described for these elements (largest first, smallest in the middle) is also variable. The largest
recombinase is in the middle position for 10 elements and in the third position for 6 elements.
Although these RIT elements cannot be assumed to be active, the presence of 6 similar elements
with Int2 as the largest enzyme in 5 different members of the Sphingomonodales suggests that
variation in the pattern of sizes is tolerated.
Figure 4.2.2: Names and arrangements of tyrosine recominase sub-families.
Names are families according to the NCBI conserved domain database for the three integrases
that comprise the RIT elements (see section 4.3.3). Arrows indicate the direction of
transcription for Int1-3 (in order of transcription). The inverted repeats (IR) have only been
confirmed in a small number of the putative RIT elements (see section 4.4.3).
53
4.2.3 Inferred RIT Functionality
Within our putative RIT elements we saw a broad diversity of recombinase sizes and
amino acid sequences, but conservation was always highest in the C-terminal of each enzyme.
This is commonly found in the tyrosine recombinases and in other characterized phage
integrases in which the N-terminal region is involved in site recognition and the C-terminal
contains all of the catalytic sites (Esposito and Scocca, 1997). The consistency of this finding
across the intact RIT elements examined in this study implies that all three of these
recombinases are being selectively maintained in this arrangement. The CDD utilizes these
conserved regions to support inclusion of novel phage integrases into each of the currently
outlined sub-families. If the novel enzyme contains sufficient conserved residues to surpass a
pre-determined domain specific threshold then it is designated as specific to that particular sub-
family. For the tyrosine recombinases, we have found no literature to date that investigates the
functionality of the individual sub-families, which limits the utility of these classifications for
evaluating whether each recombinase is functional. However of the 148 RIT elements included
in our assessment, 93% have at least one recombinase that meets the specific criteria for
inclusion in the designated subfamily. Int3 (SG5) is the least divergent of the three elements
(Figure 4.2.3A,B), and 105 of the RIT elements (71%) have domain specific SG5 genes. 66% of
the elements have domain specific pAE1 (Int1) genes, while only 25% have domain specific
SG4 genes (Int2). In 36%, both pAE1 and SG5 are domain specific. Only 17 (11%) contain the
amino acid residues required for designation of all three integrases as pAE1, SG4 and SG5 by
the CDD. For the remainder, the top (highest E-value) non-specific sub-family hit was
consistently to the expected group based on position within the RIT (ie. Int1, 2 or 3).
If we infer functionality by the presence of duplicate elements in a genome, we can
postulate whether those recombinases lacking the threshold number of conserved residues for
subfamily designation, may still be active. There are 19 species containing more than one
identical RIT element within their genome, only 4 of which have all three subfamily specific
integrases. There are 6 instances of genomes with identically duplicated RIT elements that are
lacking subfamily specific SG4 genes (section 4.2.4.1) and also 6 closely related strains with
duplicated elements lacking a subfamily specific SG5 gene (section 4.2.4.2). There is a single
54
instance of identical elements in a genome without a subfamily specific pAE1 recombinase (in
Sinorhizobium meliloti 1021 pSymA) and a separate genome (Dinoroseobacter shibae DFL12)
has identical elements where only the SG5 recombinase is subfamily specific. This may imply
that all three enzymes are not strictly required for mobility, or that one or more of the residues
currently used to delineate a subfamily specific enzyme are not necessary for this function.
Figure 4.2.3: Comparison of conservation between the Int1 (pAE1) recombinases (A - top)
and Int3 (SG5) recombinases (B - bottom) from 40 divergent representatives.
55
Level of conservation is illustrated through shading (dark lines represent conserved amino
acids). As can be seen, both enzymes increase in conservation towards the C-terminal, and
conservation is higher in the third recombinase.
4.2.4 Evidence for RIT Mobility Within Closely Related Strains
For the purposes of this discussion, we are making the assumption that identical RIT
sequences in the same strain implies mobility and high levels of similarity between RIT
elements located in different strains or species is evidence of horizontal transfer likely via an
intermediary replicon. In their paper originally defining RIT elements, Van Houdt et al. (2009)
described two non-homologous RITs in Cupriavidus metallidurans CH34. The first of these
RIT elements (RITCme1) bears high nucleotide identity (greater than 90%) to truncated RIT
elements in Cupriavidus necator H16 pHG1 (two identical inverted RIT element fragments close
together with integrase remnants in between) and to a degraded RIT element in Burkholderia sp.
str. CCGE1002 (with only Int2 still listed as intact and the others listed as pseudogenes). In
addition, RITCme1 shares 84% nucleotide identity to two identical RIT elements in the
unassembled whole genome sequence data of the PAH degrading strain Burkholderia sp. Ch1-1
and to our newly identified element RITBphyt1 (Jin, 2010). RITBphyt1 was discovered in a
chlorobenzoate degrading Russian soil isolate designated Burkholderia phytofirmans OLGA172
(formerly R172). In this strain, the RIT element is found adjacent to the chlorocatechol
degradative operon (Jin, 2010). There are no chlorocatechol degrading genes found in C.
metallidurans CH34, indicating that the genes adjacent to the RIT element are not shared
between these two strains, however each of these strains do have partial IS66 elements
overlapping the RIT elements which may represent the target site for insertion.
Our dataset did reveal two clusters of RITs sharing greater than 85% nucleotide identity
over their full lengths– one cluster of RITs found in Acidiphilium/Caulobacter strains, and
another cluster of RITs from Bifidibacterium longum. Each of these show evidence of recent
mobility in that 1) 100% identical sequences are found in different locations within individual
strains, 2) 80-100% identical sequences occur in separate species, and 3) the RIT elements share
higher identity than the surrounding genes. Interestingly, although gene synteny appears to be
conserved in many of the Bifidobacterium cluster, the adjacent genes in the
Acidiphilium/Caulobacter cluster are highly variable indicating that the RIT elements have not
56
been mobilized as part of a larger element. Details on these two informative groups are given
below.
4.2.4.1 Caulobacter/Acidiphilium cluster
This cluster of highly similar RITs come from the genomes of three strains and the
plasmids they contain. Two of the strains are from the genus Acidiphilium, while the third is
from the genus Caulobacter, which share approximately 86% 16S rRNA sequence identity.
Caulobacter sp. K31 is a chlorophenol degrader isolated from groundwater in Finland
(Männisto et al., 2001). There are two identical RIT elements on the K31 chromosome, and
another identical copy on one of the two plasmids in this strain (pCAUL02 – length 178 kb).
Acidiphilium multivorum AIU301 is an aerobic, anoxygenic and phototrophic bacterium from
pyritic acid mine drainage well known for its metal tolerance
(www.bio.nite.go.jp/dogan/project/view/AM1). A. multivorum AIU301 carries one RIT element
on the chromosome and 2 identical copies on one of its 8 plasmids (pACMV1 – length 272 kb).
Acidiphilium cryptum JF-5 is a facultative iron-respiring strain isolated from coal mine lake
sediment (www.ncbi.nlm.nih/bioproject/58447). A. cryptum JF-5 shows high gene synteny with
A. multivorum AIU301 except for a 225 kb region from AIU301 that is a probable genomic
island (www.bio.nite.go.jp/dogan/project/view/AM1). There is no RIT element on the A.
cryptum JF-5 chromosome, however it also carries 8 plasmids. One of these (pACRY01 – 203
kb) carries a RIT element that is identical to those found in A. multivorum AIU301, except that
one of the inverted repeats is only 97% similar). A second plasmid in this same strain
(pACRY03 – 89 kb) carries a RIT element that bears 84% nucleotide identity with the RIT
elements on pACRY01, and 82% sequence identity to 92% of the RIT elements in Caulobacter
sp. K31 (no significant alignment to 238 bp of int1).
The RIT elements in this cluster are clearly moving as one intact unit since within each
individual strain the nucleotide identity is 100% for the entire RIT element, including the three
recombinase genes and the additional sequence between the enzymes and the inverted repeats.
Similarity in the gene fragments surrounding the RIT elements are suggestive of specific target
genes for integration – in this case the DUF1738 gene (also sometimes annotated as an anti-
restriction protein; Figure 4.2.4). This is supported by the fact that the target gene is also
consistent on both the pACRY01 and pACRY03 plasmids, and there is no copy of the
57
interrupted gene (DUF1738) on the A. cryptum JF-5 chromosome or any of the other 6 plasmids
in that strain, consistent with the RIT element occurrence. Interestingly, as outlined in Figure
4.2.4 for the Caulobacter sp. K31 RIT elements, although the elements appear to have
integrated into homologous genes, the relative orientation of the RIT element to the target gene
is not always consistent and has impacted the gene annotation.
The RIT elements found in A. multivorum AIU301 share 83% nucleotide identity with
those found in Caulobacter sp. K31 and the terminal inverted repeats are almost identical
between the two genera – the Caulobacter strain shows perfect 34 bp repeats for each of the RIT
elements, however the Acidiphilium RIT elements have a SNP in the 5’ repeat and the repeats
are not the full 34 bp (therefore form imperfect repeats of 30 -33 bp in length). Despite the
decreased identity, all of these inverted repeats have 8 bp regions that are absolutely conserved
(discussed in section 4.2.4.3). The interrupted genes in A. multivorum AIU301 are all annotated
as hypothetical proteins, however the protein upstream of the RIT element found on the
chromosome has 73 and 70% homology respectively with the DUF1738 protein fragments
found surrounding RIT1 and RIT2 on the K31 chromosome.
Figure 4.2.4: Arrangement of RIT elements on the chromosome of Caulobacter sp. K31.
The two RIT elements and inverted repeats are identical, and are found within the same gene
(DUF1738) however the orientation is reversed and the DUF1738 nucleotide identity is not as
high as within the RIT element. When inserted in the correct orientation, the RIT element
appears to restore the DUF1738 sequence, however this is not the case when the orientation is
reversed.
58
4.2.4.2 Bifidobacterium longum cluster
The Bifidobacterium longum cluster consists of 15 RIT elements sharing 99-100%
nucleotide identity distributed across 7 strains of these intestinal bacteria. These RIT elements
have been previously characterized as Mobile Integrase Cassettes (MIC) (Lee et al., 2008) and a
search of other intestinal bacteria led those researchers to suggest that these elements may be
unique to the Bifidobacteria. Five of these strains contain multiple copies and almost all are
flanked by similar transposases that range from 68 to 100% nucleotide identity. In the strains
with more than one RIT element, one of the elements is commonly found in the reverse
orientation with respect to the direction of transcription of the transposase gene, which is
consistent with the duplicate RITS in both Caulobacter sp. K31 and A. multivorum AIU301.
The combination of reversed relative orientation and the decreased level of nucleotide identity
between the transposases implies that the transposases may be a target of the RIT elements and
not responsible for their movement. However, unlike the Caulobacter/Acidiphilium cluster,
within these genomes there are other homologous transposases (up to 99% nucleotide identity)
that have not been interrupted by a RIT element.
The Bifidobacterium longum strains are all very similar (99% nucleotide identity for
16S), suggesting that the similarities observed in the RIT elements may simply be associated
with vertical transmission rather than duplication and mobility. In some circumstances there is a
high degree of surrounding gene synteny which supports this interpretation. There is however
also evidence for significant genetic rearrangements specific to the RIT elements themselves. B.
longum DJ010A has three RIT elements. One of these elements, RIT1, is surrounded by genes
that are 99% conserved in the other B. longum strains and the genes are found in the same order.
The genes surrounding RIT2 and RIT3 are conserved in other B. longum strains as well,
however the gene synteny is not preserved as these genes are found scattered throughout the
other genomes. The only strain that does show high gene synteny with the genes surrounding
RIT2 is B. longum F8, however the RIT element itself is annotated as occurring in the opposite
orientation in relation to the surrounding genes.
4.2.4.3 Target Sites
Although hampered by issues with incomplete annotations due to gene interruptions, an
examination of the full collection of RIT elements clearly indicates that there are specific genes
59
that serve as target sites for integration. The genes immediately adjacent to or interrupted by the
elements are commonly the same in cases of multiple identical elements within one strain and
between different strains harboring elements with >65% SG5 protein identity. In addition to the
genes described in sections 4.2.4.1 and 4.2.4.2, clusters of RIT elements were also found
associated with IS66, RadC, methylase/helicase genes and integrase genes. There is even a RIT
element in Aromatoleum aromaticum EbN1 which appears to have interrupted a second RIT
element. Whether the variability in gene targets stems from sequence evolution or lack of the
original target site in individual strains has not been investigated and a specific target sequence
within these genes could not be determined.
4.2.4.4 Inverted Repeats
The tyrosine recombinases are a highly diverse family of proteins, with variable
complexity in both their DNA binding sites and their requirement for accessory proteins (Azaro
and Landy, 2002; Rajeev et al., 2009). The presence of multiple identical RIT elements in
different parts of the genomes of some strains revealed terminal features that may be involved in
recombinase binding or regulation. Alignment of the Bifidobacterium longum RIT sequences
identified a 97 bp inverted repeat that is absolutely conserved and always 41 bp from Int1 and 3
bp from Int3. The inverted repeats identified in the Caulobacter and Acidiphillium strains were
only 30-34 bp in length followed by a section of presumably non-coding sequence between the
inverted repeats and the recombinase enzymes (illustrated in Figures 4.2.2 and 4.2.4).
Alignment of the inverted repeats and non-coding sequence from the Caulobacter and
Acidiphillium strains with the long inverted repeats from the Bifidobacterium revealed an
interesting pattern of smaller repeats that may serve as recognition or regulatory sites for the
recombinases (Table 4.2.2). Within the inverted repeats, there were two highly conserved direct
repeats of T(A/T)ATGCCG with a 9 bp intervening sequence. Furthermore, an inverted repeat
was also found at an interval of 48-49 bp towards the recombinase enzyme. This pattern was
also confirmed at both ends of the RIT elements for our B. phytofirmans OLGA172 strain, C.
metallidurans CH34, and Burkholderia sp. Ch1-1. Whether this indicates relatedness of these
RIT elements to the Caulobacter/Acidiphillium cluster is not clear. In addition, 12 bp direct
repeats separated by 5 bases were found in both Candidatus solibacter usitatus Ellin6076 and
Gramella forsetii KT0803 (inverted copy at a distance of 43 bp in both cases). Similar partial
60
patterns were found in other strains as well, but more information is needed to determine their
relevance.
There is evidence of RIT mobilization of adjacent genes in two bacteria. Opitutus terrae
PB90-1 and Dinoroseobacter shiibae DFL-12 each have identical sequences that extended
beyond the RIT element but did not have any other mobile elements associated with them. In
the O. terrae PB90-1 strain, this identical sequence (including the RIT) was found in four copies
in the genome and the region extending beyond the RIT element was 1.6 kb in length and
contained a merR regulator, a heavy metal transport/detoxification gene and a hypothetical
protein. In D. shiibae DFL12, there were two copies of the RIT and 2.7 kb of additional
sequence including a gene annotated as a type III restriction protein subunit found on two
different plasmids within this strain. In both of these circumstances, the copied regions are
flanked by inverted repeats. In O. terrae PB90-1, this 37 bp inverted repeat had 9 bp imperfect
direct repeats (A/TGT/CTATGTG) separated by 8 bp and an inverted copy at a distance of 49
bp, consistent with the pattern observed in the other strains. For the D. shiibae DFL12, the
region was flanked by larger inverted repeats of 124 bp (bringing the upstream repeat to within
2 bp of the start codon for the RIT element). An imperfect direct repeat separated by 9 bp was
found within this region (A/TTATGCC/GG) however no clear inverted version was identified.
Table 4.2.2: Potential recognition or regulatory sites contained within terminal inverted
repeats.
The sites occur at a precise distance between the repeats and the coding sequence for the
recombinase genes. Bolded bases are direct repeats contained within the terminal inverted
repeats and for which there is an inverted copy at a precise distance in the direction of the
recombinase genes.
Strain 5’ Sequence
Burkholderia phytofirmans
OLGA172 TTATGCCGATTCCCGGATTATGCCG..49..CGGCATAA
Cupriavidus metallidurans CH34 TTATGCCGACTCCCCGATTATGCCG..49..CGGCATAA
Burkholderia sp. Ch1-1 TTATGCCGACTTCCCGATTATGCCG..49..CGGCATAA
Caulobacter sp. K31 TAATGCCGCGATCCGGATTATGCCG..49..CGGCATAA
Acidiphillium multivorum AIU301 TAATGCCGAGATCCGGATTATGCCG..49..CGGCATAA
61
Bifidobacterium longum NCC2705 TTAAGCCGGGTTTGTTGTTAAGCCG..48..CGGCTTAA
Frankia sp. EANpec1 TTATGCCGAGGGCCGGGTTATGCCG..49..CGGCATAA
Novosphingobium sp. PP1Y TAATGCCGTGACCCGGATTATGCCG..49..CGGCATAA
Candidatus S. usitatus Ellin6076 ACTATGCCGCGTCCCGGACTATGCCGCGT..43..ACGCGGCATAGT
Gramella forsetii KT0803 ATTATGTAAAGTAAATTATTATGTAAAGT..43..ACTTTACATAAT
4.2.5 Similarities between RIT elements and evidence for broad distribution.
Of the 148 putative RIT elements, 64 elements were chosen for phylogenetic analysis
based on nucleotide sequence of all three recombinases. These were chosen on the basis of
having come from completed genomes, and were spread across 38 different genera. Only one
representative was included if there were multiple identical elements within an individual strain
(19 instances), and only one representative was used from the 17 nearly identical RIT elements
in the Bifidobacterium cluster since the 16S sequences of these strains were also 99% identical.
The 16S from the main chromosome was utilized as a proxy for strain phylogeny, even in
circumstances where the RITs were solely present on plasmids. As discussed in section 4.2.1,
the RIT elements that we found were largely from Proteobacteria (Figure 4.2.5A), however the
presence of RIT elements from the Actinobacteria, Firmicutes, Bacteroidetes and Acidobacteria
suggest that these elements are not restricted to any particular phylum of bacteria, and the
diversity implies that they are an ancient feature of bacterial genomes.
A neighbour joining tree of all the putative RIT elements yields a tree with very long
branches (Figure 4.2.5B), reflecting the deep divergence in the RIT elements obtained in this
study. Most clusters harbour elements from more than one genera. Only three of these clusters
are completely contained within one taxa (two clusters from the alpha-Proteobacteria and one
from Actinomycetes). The other clusters are dominated by one taxa (commonly alpha- or beta-
Proteobacteria, consistent with their abundance in the dataset) with unexpected additions from
other classes or even other phyla.
The presence of multiple, diverse RIT elements in individual strains was a common
finding. As is highlighted in Figure 4.2.5B, Burkholderia phymatum STM815, C. metallidurans
CH34, C. necator H16 pHG1, Mesorhizobium loti MAFF303099, Novosphingobium sp. PP1Y
and Candidatus Solibacter usitatus Ellin6076 each have more than one RIT element assigning
to very different RIT clusters. In all except Novosphingobium sp. PP1Y and C. metallidurans
62
CH34 there are plasmids harboring RIT elements that could account for more recent transfer
events. Many of the other RIT elements (including RITCme1 and RITCme2 from C.
metallidurans CH34 and several RIT elements from the Rhizobiales) are contained within
genomic islands however the lack of genomic island information for the majority of these strains
prevented a full analysis of this relationship. An examination of the putative RIT elements from
genomes that have been analyzed on the IslandViewer website (Langille, 2009) revealed that
presence within a putative genomic island was common for RIT elements. Of the 20 strains that
corresponded with our list and had been analyzed in IslandViewer, only 4 had their RIT
elements separated from the putative islands (Acidithiobacillus ferrooxidans ATCC 53993,
Aromatoleum aromaticum EbN1, Sinorhizobium fredii NGR234 and Opitutus terrae PB90-1).
All of the other RIT elements analyzed were found within predicted islands or within 1 kb of a
predicted island. The small islands predicted by IslandViewer can prove to be components of
one larger island, as is found in Mesorhizobium loti MAFF303099. A 610 kb region of this
chromosome has been documented as a symbiosis island (Uchiumi et al., 2004) however the
prediction programs have instead illustrated it as 10-12 smaller islands. All three chromosomal
RIT elements from this strain are found contained within this symbiosis island, although only
one of them is documented as occurring directly within the predicted islands by IslandViewer.
Interestingly, the regions of the symbiosis island where the RIT elements occur are also high in
transposases from a variety of families. In this strain there is a fourth RIT element found on the
pMLa chromosome, and similarly in both Burkholderia phymatum STM815 and Burkholderia
phytofirmans PsJN there are identical RIT elements that are present on both a genomic island on
the chromosome and one or more plasmids, suggesting a possible route of intragenomic
variability within these islands.
The environmental distribution of RIT elements was found to be quite broad, with
representatives from such diverse environments as the head of an off-shore oil producing well to
an intracellular amoebal pathogen. There was an even contribution of strains present from each
type of environment – 32% each from the combined soil/sediment/sludge environments and all
combined water environments (freshwater, seawater, groundwater and wastewater). In addition,
there were 18% specifically identified as plant-associated strains, and 16% were animal
associated including a small number of pathogens. Disregarding the animal associated and
seawater environments (for which there is no straightforward delineation as clean or
63
contaminated), almost half of the isolates (42%) have been isolated from contaminated
environments. We note this could be due to a bias towards the study of these environments,
particularly in light of the increased abundance of alpha- and beta-Proteobacteria in our RIT
element collection compared to the NCBI database.
A
64
B
Figure 4.2.5: Phylogenetic analysis by 16S (A) and nucleotide sequence of the RIT
elements obtained in this study (B).
Only one representative is included for identical elements found within individual or highly
similar strains (99% 16S identity). Scale bars are at the bottom of each figure and the re-
sampling percentage is indicated at each node. Symbols in figure 5B are used to illustrate
different RIT elements found in individual strains.
65
4.2.6 RIT Classification
In the work by Van Houdt et al. (2012) a classification system has been created
consisting of 11 types of RIT elements, based on assignment of the three recombinases to the
same NCBI protein clusters. Four of these types (3,4,5 and 7) were further subdivided since one
of the recombinases (commonly Int2) was associated with recombinases from more than one
protein cluster, suggesting the possibility of modular evolution. We wanted to evaluate whether
our larger collection of RIT elements showed congruent evolution of all three recombinase
genes in these four groups of RIT elements.
Phylogenetic trees of the amino acid sequence for each of the three recombinases from
40 individual RIT elements were included in this analysis. These sequences corresponded to the
members of types 3A/B, 4A/B/C, 5A/B, and group 7A/B/C, as well as the other RIT elements
from our collection that were found to cluster with these groups based on homology to the third
recombinase. This resulted in the inclusion of the type 10 RIT element (Gramella forsettii
KT0804) since it was found to cluster with the type 7 elements in our analysis. A single type 2
strain was also included (Acidithiobacillus) and used as an outgroup to root the three trees.
In our analysis (Figures 4.2.6A-C), it’s clear that although the clustering occurs at
different levels of similarity for the three recombinases, congruency is evident for each of the
members of types 3, 4, 5 and 7. As the protein cluster memberships are based solely on
modified Blast scores (Klimke et al., 2009), they do not necessarily reflect phylogenetic
relationships made evident by neighbour joining analysis. As an example, for each of the
elements in type 3 and type 4, the Int2 recombinases are listed as being from the same protein
cluster and the Int1 recombinases separate into different protein clusters. Yet in Figures 4.2.6A
and 4.2.6B it is clear that the branching order relationships are preserved between the type 3 and
type 4 proteins for both Int1 and Int2. The type 7 elements are a much more diverse group of
recombinases, and from our analysis it would appear that the type 10 elements actually form one
of several sub-clusters within this group, but the three recombinases show congruency
nonetheless. More research is needed in order to determine an appropriate clustering cutoff for
designation of RIT element types.
66
B
A
67
Figure 4.2.6: Individual congruency trees for each of the recombinases in a selection of
RIT elements.
The individual trees correspond to Int1 (A), Int2(B) and Int3(C). Number and letter designations
are according to the types defined in Van Houdt et al. 2012.
4.3 Conclusions
In this analysis, we expand upon the findings of Van Houdt et al. (2012) by evaluating a
more diverse collection of RIT elements. Through this collection we are able to confirm that the
integrases are consistently from three subfamilies of tyrosine site-specific recombinases (pAE1,
SG4 and SG5) in the same order and orientation. Although not all of these enzymes surpassed
the specific domain threshold for inclusion in the individual subfamilies, the highest scoring hits
were consistently to these groups and our protein neighbour joining trees provide evidence that
the genes have evolved together. Recognizing their association is very informative for
furthering our understanding of these three sub-families of TBSSRs. Functionality (mobility)
could be inferred for a small number of elements that are copied within a genome or in closely
related strains. It should be noted that the intention of this study is not to suggest that all of the
C
68
putative RIT elements we have outlined are active, but rather to examine the distribution of
these elements in nature in the hopes of better understanding their role in bacterial genomes.
Examining elements fitting within the description of a recombinase in trio, allows for a better
understanding of the overall distribution of these elements. The maintenance of three integrases
belonging to specific sub-families of the tyrosine recombinases in the same arrangement and in
a potentially active form in such a wide diversity of organisms clearly suggests that there is
some benefit to their presence in genomes. It has yet to be determined whether their structural
organization is due to high levels of co-transfer or is evidence of functional interdependency. It
is our hope that the terminal repeat sequences outlined in this study will prove useful in
furthering an understanding of the mode of action of these recombinases. Any discussion of the
putative role of these repeats would be preliminary given the limited knowledge on the mode of
action of these specific sub-families of tyrosine recombinases, together or in combination.
There are no current models that are consistent with a role for all three recombinases.
RIT elements bearing high similarities (greater than 80% nucleotide identity) across their
full length can be found within closely related genera (such as Burkholderia and Cupriavidus) in
the absence of any other gene synteny. This suggests the elements in these strains can be
mobilized as an intact unit rather than as a component of a larger transposable element. We
expect that any horizontal transfer events are mediated by plasmid movement between these
closely related strains, a logical supposition supported by the prevalence of identical RIT copies
on both the chromosome and one or more plasmids within the same strain. Plasmids may also
be responsible for transport of these elements over greater phylogenetic distances, however the
broad divergence of these elements suggests ancient origins.
It is clear from this assessment that RIT elements are capable of coordinated intracellular
movement. It is not clear if the triad structure, though most common, is a strict requirement for
the functioning of these enzymes since a small number (11%) of seemingly intact recombinases
from the pAE1, SG4 and SG5 subfamilies were found by themselves or just in pairs. Many
questions remain to be answered. It is clear however that several years of genome sequencing
have brought to light a new element that adds to the astounding potential for bacterial adaptation
through recombination.
69
4.4 Acknowledgements
Funding in the form of a NSERC Discovery Grant to RF and a NSERC CGS-D Scholarship to
NR is gratefully acknowledged. The funding agency had no role in this study.
4.5 References
Altschul, S.F., Madden, T.L., Schäffer, a a, Zhang, J., Zhang, Z., Miller, W., Lipman,
D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 25, 3389-402.
Azaro, M.A. and Landy, A., 2002. λ Integrase and the λ Int family, in: Mobile DNA II. pp.
p. 118-148. In N. L. Craig, R. Craigie, M. Gellert. Drummond, A.J., Ashton, B., Buxton, S., Cheung, M., Cooper, A., Heled, J. et al., 2011.
No Title [WWW Document]. Geneious v5.0.4. Available at: http://www.geneious.com.
Edgar, R.C., 2004. MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucleic acids research 32, 1792-7. Esposito, D., Scocca, J.J., 1997. The integrase family of tyrosine recombinases:
evolution of a conserved active site domain. Nucleic acids research 25, 3605-14. Frost, L.S., Leplae, R., Summers, A.O., Toussaint, A., 2005. Mobile genetic elements:
the agents of open source evolution. Nature reviews. Microbiology 3, 722-32. Jin, S., 2010. Evidence of Mobility of the 3-Chlorobenzoate Degradative Genes in a
Pristine Soil Isolate, Burkholderia phytofirmans OLGA172. University of Toronto M. Sc. Thesis.
Klimke, W., Agarwala, R., Badretdin, A., Chetvernin, S., Ciufo, S., Fedorov, B., Kiryutin,
B., Neill, K.O., Resch, W., Resenchuk, S., Schafer, S., Tolstoy, I., Tatusova, T., 2009. The National Center for Biotechnology Information ’ s Protein Clusters Database. Nucleic acids research 37, 216-223.
Langille, M.G.I. and F.S.L.B., 2009. “IslandViewer: an integrated interface for
computational identification and visualization of genomic islands”. Bioinformatics Jan. 16 (E.
Lee, J.-H., Karamychev, V.N., Kozyavkin, S. a, Mills, D., Pavlov, a R., Pavlova, N.V.,
Polouchine, N.N., Richardson, P.M., Shakhova, V.V., Slesarev, a I., Weimer, B., O’Sullivan, D.J., 2008. Comparative genomic analysis of the gut bacterium
70
Bifidobacterium longum reveals loci susceptible to deletion during pure culture growth. BMC genomics 9, 247.
Marchler-Bauer, A., Anderson, J.B., Cherukuri, P.F., DeWeese-Scott, C., Geer, L.Y.,
Gwadz, M., He, S., Hurwitz, D.I., Jackson, J.D., Ke, Z., Lanczycki, C.J., Liebert, C. a, Liu, C., Lu, F., Marchler, G.H., Mullokandov, M., Shoemaker, B. a, Simonyan, V., Song, J.S., Thiessen, P. a, Yamashita, R. a, Yin, J.J., Zhang, D., Bryant, S.H., 2005. CDD: a Conserved Domain Database for protein classification. Nucleic acids research 33, D192-6.
Männisto, M.K., Salkinoja-Salonen, M.S., Puhakka, J. a, 2001. In situ polychlorophenol
bioremediation potential of the indigenous bacterial community of boreal groundwater. Water research 35, 2496-504.
Nunes-düby, S.E., Kwon, H.J., Tirumalai, R.S., Ellenberger, T., Landy, A., 1998.
Similarities and differences among 105 members of the Int family of site-specific recombinases 26, 391-406.
Rajeev, L., Malanowska, K., Gardner, J.F., 2009. Challenging a paradigm: the role of
DNA homology in tyrosine recombinase reactions. Microbiology and molecular biology reviews : MMBR 73, 300-9.
Roberts, A.P., Chandler, M., Courvalin, P., Guédon, G., Mullany, P., Pembroke, T.,
Rood, J.I., Smith, C.J., Summers, A.O., Tsuda, M., Berg, D.E., 2008. Revised nomenclature for transposable genetic elements. Plasmid 60, 167-73.
Uchiumi, T., Ohwada, T., Itakura, M., Nukui, N., Dawadi, P., Kaneko, T., Tabata, S.,
Yokoyama, T., Tejima, K., Saeki, K., Omori, H., Hayashi, M., Sriprang, R., Murooka, Y., Tajima, S., Simomura, K., Nomura, M., Suzuki, A., Shimoda, Y., Sioya, K., Uchiumi, T., Ohwada, T., Itakura, M., Mitsui, H., Nukui, N., 2004. Expression Islands Clustered on the Symbiosis Island of the Mesorhizobium loti Genome Expression Islands Clustered on the Symbiosis Island of the Mesorhizobium loti Genome. Society.
Van Houdt, R., Monchy, S., Leys, N., Mergeay, M., 2009. New mobile genetic
elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria. Antonie van Leeuwenhoek 96, 205-26.
Van Houdt, R.., Leplae, R., Mergeay, M., 2012. Towards a more accurate annotation of
tyrosine- based site-specific recombinases in bacterial genomes. Mobile DNA 3(6) doi:10.1186/1759-8753-3-6
71
Chapter 5 The Chlorocatechol Degradative Operon in Burkholderia sp. strain OLGA172 Resides in Chromosomal Area of Genome Plasticity as
revealed through PacBio Single-Molecule Sequencing
Acknowledgements and Contributions: This chapter has been submitted for consideration to
the journal Genomics and represents the compilation of efforts of previous graduate students as
well as my own work. Contributers to this chapter (besides myself and Roberta Fulthorpe) are as
follows: Jackie Goordial (investigation of CC operon), Soulbee Jin (primer walking and
assembly of CC operon region), Heng (Tony) Qian (short read data assemblies), Shu Yi
(Roxana) Shen (comparison of chromosome 1 breakpoint, assistance with figures and
bioinformatics support), Rosemary Saati (qPCR for copy number analysis, PCR confirmation
and small plasmid extraction).
5 Introduction Next generation sequencing has revolutionized our approach to the study of genomes.
However due to the short read lengths of the initial next generation sequencers, the complete
assembly of even the smallest of these genomes into a manageable number of contigs from
sequence data alone is problematic, particularly in repetitive regions of the genome (Phillipy et
al. 2008; Ricker et al. 2012; Ghodsi et al. 2013; Barbosa et al. 2014). Many researchers have
called for increased efforts in completing sequencing projects (Parkhill, 2000; Phillipy et al.
2008; Klassen 2012), however the high costs and time requirements have simply made this goal
unachievable for many labs. As a result, incomplete genome projects and permanent draft
genomes are increasingly dominating databases. The number of genomes deposited in the
Genomes Online Database (version 5.0; https://gold.jgi-psf.org accessed 31 October 2014) has
grown from 3532 genomes in 2012 (Ricker et al., 2012) to 53,514 genomes, 73% (38,962) of
which are bacterial genomes. Of these, only 12% (6648) of the bacterial genomes have been
finished and 44% (23,551) of the remainder are listed as permanent draft (up from 29%
permanent drafts in 2012). This increase in permanent draft genomes results not only from the
practical and financial obstacles faced in finishing a genome, but also occurs due to the limited
72
scope of the original project. Many sequencing projects are initiated in order to determine
whether a discrete set of known genes are present in an organism, or to investigate differences
between closely related strains. For each of these types of projects, a permanent draft genome is
sufficient since short read sequencing technologies can assemble the individual genes
(Branscomb and Predki, 2002) and referenced based assembly to close relatives can reveal strain
specific gene content. However, the prevalence of permanent draft genomes in the publicly
available databases are problematic to researchers looking to understand overall genome
organization and the impacts of horizontal gene transfer on prokaryotic evolution.
Although smaller than eukaryotic genomes, many prokaryotes are still challenging to
assemble due to the presence of multiple replicons and highly repetitive elements shared within
and between these replicons. Any repeat longer than the sequenced read results in an ambiguity
that cannot be resolved by assembly algorithms due to the simultaneous existence of more than
one true path through the data (Ghodsi et al. 2013). The ribosomal operons, typically ranging
from 5-7 kbp, represent the largest repeat class found in the majority of bacteria (Koren et al.
2013) and therefore read lengths on a kilobase scale are required to produce an ungapped or
‘closed’ assembly. PacBio SMRT sequencing produces read lengths of up to 40 kb as well as a
large volume of shorter reads to be used for error correction, simultaneously providing the
benefits of hybrid short and long read libraries without the additional cost and effort of using
two separate libraries or sequencing technologies. Most importantly, PacBio SMRT sequencing
does not suffer from amplification biases or low coverage in regions containing secondary
structure (due to palindromes or GC content), but rather produces random errors that can be
detected and algorithmically managed (Chin et al. 2013; Koren et al. 2013).
Genome organization has been linked with important lifestyle and evolutionary traits
(Slater et al. 2009; Harrison et al. 2010; Heuer and Smalla, 2012), and consequently
understanding the nature and size of individual replicons can be very informative. Fragmented
draft genomes contain anywhere from tens to hundreds of individual contigs, therefore relevant
information on replicon characteristics is lost. Prokaryotic genomes range from 200 to over
9000 genes, and there is a strong link between environment and genome complexity
(Konstantinidis and Tiedje, 2004; Cordero and Hogeweg 2009). Since horizontal gene transfer is
a prominent mechanism of bacterial adaptation, the genomic context of individual genes is
important for discerning the evolutionary origins and transferability of these traits. Of the 4394
73
completed genomes available in the NCBI genome database (accessed 27 Feb 2015), 33% (1474
genomes) contain more than one replicon. Plasmid localization infers not only potential for
dissemination, but can also give insight into gene expression (Lopez-Leal et al 2014) and the
rate of evolution since existence on a discretely replicating plasmid is effectively equivalent to a
gene duplication event equal to the copy number of the plasmid (Norman et al 2009). Similarly,
nucleotide substitution rates have also been observed to be higher on secondary chromosomes
and smaller replicons than on the primary replicon, as have rates of recombination and
rearrangements (Chain et al. 2006). Understanding the genomic context of genes found on
chromosomes is also important since outward-facing promoters that are contained within
adjacent mobile elements may directly regulate adaptive traits. As an example, in Bacteroides
fragilis, antimicrobial resistance genes exhibit increased expression from insertion sequence (IS)
elements found upstream (Sóki 2013). In a recent study using whole genome short read
sequencing of resistant isolates, Sydenham et al. (2015) found that the majority of contigs
bearing resistance genes terminated within 200 bp of the gene in question and therefore
information on the upstream gene content was lost.
In this paper, we have assembled the complete genome of Burkholderia sp. str.
OLGA172 (formerly Burkholderia sp. R172), a 3-chlorobenzoate (CBA) mineralizer isolated
from an uncontaminated Boreal forest soil in northwestern Russia as part of a global survey of
CBA and 2,4-D degrading soil bacteria (Fulthorpe et al. 1998). Chlorocatechol (CC) is a central
intermediate in the CBA degradation pathway, as well as the degradation of several other
chlorinated aromatic chemicals of environmental concern, such as chlorophenols,
chlorobenzenes and chlorobiphenyls (Schlomann, 1994). Understanding the evolution and
distribution of CC degradative pathways therefore has benefits for the remediation of a variety
of anthropogenic compounds. The catabolic genes for CC degradation are usually found in an
operon, and often, but not always, located on catabolic plasmids (Liu et al. 2005; van der Meer
et al. 1992; Fulthorpe et al. 1995; Leveau & van der Meer, 1996) or on mobile elements such as
integrative and conjugative elements (ICEs) (Sentchilo et al. 2009). In Burkholderia sp. str
OLGA172, the complete modified ortho pathway for catechol degradation has been identified
(Jin, 2010; Accession number: AY168634.1) however numerous Sanger and short read
sequencing approaches have failed to reveal the genomic context of the region surrounding this
operon. In this paper, we illustrate how the presence of a large repeated element, termed a
74
Recombinase in Trio (RIT) element, next to the operon interfered with this analysis and how the
use of PacBio single molecule sequencing overcame these issues to produce a non-fragmented
draft assembly.
5.1 Materials and Methods
5.1.1 Short read NGS sequencing Using an Illumina (Solexa) Genome Analyzer (GA) II sequencing was carried out at the
Centre for the Analysis of Genomic Evolution and Function (CAGEF) at the University of
Toronto, using two flow cells. A Roche 454 Genome Sequencer FLX (GS-FLX) was also used
to sequence the genome at The Genome Quebec Innovation Centre at McGill University, using
one quarter of a PicoTitre Plate, and de novo assembly of the raw reads was carried out to
generate contigs. Illumina and 454 sequencing gave 364,718, 452 bp and 75,636,539 bp
respectively. Hybrid assembly of the trimmed reads from both datasets was performed using
Velvet version 1.2.08 (Zerbino and Birney, 2008) and resulted in 1508 contigs with a maximum
contig length of 89,084 bp (mean value 5186 bp, N50 of 12,043). BLAST (Altschul et al.
1990) was employed to determine the identity of the other genes present on the contig
surrounding the CC degradative genes and they were confirmed through synteny analysis of
closely related species and PCR amplification.
5.1.2 PacBio Single Molecule Sequencing
Sequencing was performed by Genome Quebec Innovation Centre using 8 SMRT cells
in a PacBio RSII sequencer. There were a total of 736,020 raw subreads with an average length
of 4,949 bp and a maximum length of 23,822 bp. Contigs were assembled at the Innovation
Centre through the Hierarchical Genome Assembly Process (HGAP) workflow (Chin et al.,
2013) including pre-assembly error correction, Celera assembly, and polishing with Quiver. The
corrected long reads produced after the pre-assembly error correction process were obtained
from the Innovation Centre and utilized to examine coverage and disagreements in the final
assembly using Geneious (version 6.0.3, http://www.geneious.com, Kearse et al., 2012) and in
hybrid assemblies. Alignments were performed with a minimum overlap of 200 bp, minimum
75
overlap consensus of 98% and maximum of 30% errors throughout the read alignment and 10%
gaps. The permissive error and gap rates were utilized due to the expected high rate of random
errors in the individual PacBio reads.
5.1.3 Assembly of Short Read Technologies and PacBio corrected reads
In order to determine whether the PacBio assembly could be improved through the
addition of short read sequencing data, hybrid assemblies of the PacBio corrected long reads and
a combination of Illumina and 454 short reads (individually and combined) were performed
using Mira (Chevreaux et al. 1999) on default settings. Contigs generated were compared to the
sole PacBio assembly unitigs using the Mauve (Darling et al. 2004) whole genome alignment
option in Geneious.
5.1.4 Gene Annotation and Contig Validation
Annotation was performed automatically using the RAST server (Aziz et al., 2008)
utilizing Glimmer3 with no backfilling. The beginning sequence of each PacBio unitig was
compared to the complete unitig using Blast (Altschul et al. 1990) and regions repeated at both
ends of the unitigs were trimmed from the final replicons. Identification of individual mobile
elements was also performed using Blast, accessed through the ISFinder website (www-
is.biotoul.fr; Siguier et al. 2006). GC skew and coding density were determined and visualized
using DNAPlotter (Carver et al. 2009). Highly similar repeats (>98% nucleotide ID) within
individual replicons were determined using REPuter (https://bibiserv2.cebitec.uni-
bielefeld.de/reputer; Kurtz et al. 2001). Repeated elements greater than 1000 bp between
replicons were determined using Blast alignment with a mismatch score of 1/-3. Primers were
designed to confirm placement of the RIT element adjacent to the tfd operon on chromosome 1
and a second identical RIT element found on chromosome 2. Primers were also designed to
confirm placement of the 191 kb fragment that was removed from chromosome 2 in the hybrid
assembly. Primer sequences are listed in Appendix 1.
76
5.1.5 Comparisons to Related Finished Genomes
MAUVE alignments (Darling et al. 2004) of individual replicons from Burkholderia
phytofirmans PsJN, Burkholderia xenovorans LB400 and Burkholderia sp. CCGE1001, 1002
and 1003 were performed using Geneious version 6.0.3 (http://www.geneious.com, Kearse et al.
2012), using the Muscle (Edgar, 2004) alignment option and a minimum local co-linear block
(LCB) value of 400. Putative genomic islands on chromosome 1 were determined using
IslandViewer (Dhillon et al. 2013; www.pathogenomics.sfu.ca/islandviewer).
5.1.6 Large Plasmid Extraction
Large plasmid extraction was based on the method of Andrup et al. (2008). The samples were
run on 0.5% Megabase agarose (BioRad) for 21.5 hours at 4-6oC before staining for 1.5 hours in
ethidium bromide and destaining for 5 days at 4 oC. Plasmid sizes were estimated using the
BAC Tracker Supercoiled DNA ladder (Epicentre) as well as previously sequenced plasmids in
Cupriavidus metallidurans CH34 and B. phytofirmans PsJN.
5.2 Results
5.2.1 Overall Genome Analysis
The PacBio assembly for Burkholderia sp. str. OLGA172 produced 11 unitigs, 4 of which were
discarded as small contigs of vector control sequence and one was found to contain a partial 16S
sequence. There was also one unitig (9570 bp) that appears to be valid sequence (top matches
are to Burkholderia strains) but that did not have any features consistent with being an
independent replicon. This sequence was found to align with one of the larger unitigs, and to
terminate in a repeated element (IS66) that is repeated in multiple locations within the genome.
The remaining 5 unitigs were trimmed for overlapping terminal repeats (see methods) and
retained for further analysis, providing an estimated total genome size of 8,574,889 bp. Each of
these unitigs was aligned with the corrected long reads provided by the Genome Quebec
Innovation Centre, and 19% (4697/24528) could not be aligned at 98% percent nucleotide
77
identity. A selection of the unmatched reads were analyzed by Blast comparison to the 5 unitigs
and manual inspection indicated that there were small gaps (10-20 bp on average) that prevented
alignment. However as can be seen in Table 5.2.1, depth of coverage for each unitig ranged
from 6x to 19x from the corrected long reads alone, which provides a reasonable level of
confidence in the assembly. Coverage for total reads obtained through PacBio sequencing
ranged from 49x for the unitig designated plasmid 3 to 255x for the largest unitig. In order to
determine the overall complexity of this genome, highly similar repeated elements greater than 1
kb were quantified for the three largest unitigs (see methods). There were a total of 105
repeated elements on the three largest unitigs (including the 16S operons) and 56 elements were
found to have highly similar copies on both chromosome 1 and chromosome 2. The distribution
of repeated elements classifies Burkholderia sp. str OLGA172 as at least a class II difficulty of
assembly (Koren et al. 2013) due to having a large number of repeated elements in addition to
16S rDNA operons (more than 100 repeats greater than 500 bp) and potentially a class III
difficulty due to the presence of two repeats greater than 7 kb. For this reason it is not
surprising that our previous sequencing efforts using a combination of Illumina and Roche 454
sequencing had been unsuccessful in producing a reasonable assembly of this genome (Jin,
2010). As we had the additional short read data from both the Illumina and Roche 454
sequencing platforms available for this genome we also performed hybrid assemblies using all
three sets of sequencing reads. All assemblies that incorporated the Illumina reads failed to
produce an adequate assembly, presumably due to the low quality of the original Illumina data.
The assembly of the PacBio and 454 data resulted in considerably more contigs than the PacBio
assembly with only 11 contigs larger than 2 kb. The alignment of these 11 large contigs from
the hybrid assemblies agreed very well with the unitigs from the original PacBio assembly
(Table 5.2.1) with the exception of the removal of a 191,800 bp fragment from the PacBio unitig
corresponding with chromosome 2 and the creation of an additional contig of 79,317 bp.
Primers were designed targeting the regions flanking the 191 kb fragment within the PacBio
chromosome 2 unitig, and products of the expected size were obtained confirming the original
PacBio placement of this fragment (data not shown). The 79 kb contig was discarded as it was
highly fragment with strings of uncalled bases (N’s).
78
Table 5.2.1: Statistics of PacBio unitigs assigned as putative replicons.
Alignments were performed using only the corrected long reads from the PacBio SMRT
sequencing aligned on unitigs after trimming of repetitive terminal regions.
Number of aligned reads
Pairwise Identity (%)
Mean coverage
Standard Deviation
Chromosome 1 12,027 99.3 19.7 4.2 Chromosome 2 9129 99.3 19.7 3.9
Plasmid pOLGA1 458 99.3 11.6 2.8 Plasmid pOLGA2 195 99.3 10.1 4.2 Plasmid pOLGA3 22 99.5 5.9 1.7
Total aligned reads 21,831
Unaligned reads 4,697 (19%)
5.2.2 Biological consistency of the Assembly
In addition to using traditional assembly metrics such as size and coverage, we wanted to
specifically investigate the quality of our assembly in terms of biologically relevant features. To
perform this investigation we compared our PacBio assembly of Burkholderia sp. str. OLGA172
with the available completed genomes of other Burkholderia strains (Table 5.2.2) including
Burkholderia sp. str. CCGE1001 (unpublished), CCGE1002 (Ormeño-Orrillo et al., 2012),
CCGE1003 (unpublished), B. phytofirmans PsJN (Weilharter et al., 2011) and B. xenovorans
LB400 (Chain et al., 2006), all of which bear 16S nucleotide identities of 97% or greater to our
strain. There were 8853 genes annotated in our assembled genome, including 7 complete rRNA
operons and 64 tRNA genes. The rRNA operons were distributed between the two
chromosomes, 3 on the largest chromosome and 4 on the second chromosome, which is not
common but is also documented in Burkholderia sp. CCGE1002. The number of tRNA genes
was consistent with the other genomes, as was the estimated sizes of the chromosomes. The
gene distribution for chromosome 1 is illustrated in Figure 5.2.1.
79
Figure 5.2.1: Chromosome 1 of Burkholderia sp. str. OLGA172 as determined by PacBio
sequencing.
Chromosome is illustrated after manual trimming of ends (see methods). Rings correspond to
the following (moving from outside towards the middle): Coding sequences (CDS) in the
forward direction; CDS in reverse direction; rRNA operons (black) and tRNA genes (grey); all
CDS annotated as mobile elements (transposons, phage integrases, transposition helper
proteins); GC plot; GC skew; The two inner most rings are black for above average value and
grey for below average. Image created in DNAPlotter (Carver et al. 2009).
80
Table 5.2.2: Comparison of assembled genome or Burkholderia sp. str. OLGA172 with
other closely related Burkholderia strains.
Strain names listed are Burkholderia sp. str. unless a species name has been formally accepted
(listed in italics with the strain name).
Strain name Chromosome 1 Chromosome 2 Plasmids rRNA tRNA
Accession
Numbers
OLGA172 4.65 Mb 3.50 Mb
271 kb,
137 kb, 23
kb 21 64
B.
phytofirmans
PsJN 4.47 Mb 3.63 Mb 121 kb 18 63
NC_010676.1
NC_010679.1
NC_010681.1
B. xenovorans
LB400 4.90 Mb 3.36 Mb 1.47 Mb* 18 65
NC_007951.1
NC_007952.1
NC_007953.1
CCGE1001 4.06 Mb 2.77 Mb none 18 61
NC_015136.1
NC_015137.1
CCGE1002 3.52 Mb 2.59 Mb
1.28 Mb*,
489 kb 21 73
NC_014117.1
NC_014120.1
CCGE1003 4.08 Mb 2.97 Mb none 18 63
NC_014539,
NC_014540.1
* indicates a third chromosome as opposed to a plasmid, as annotated in the Genbank entry.
For each of the unitigs, replication and partitioning genes were located and verified to be
consistent with designated replicons (data not shown). As can be seen in Table 5.2.2, the
number and size of additional replicons can be quite variable in Burkholderia strains, and we
therefore experimentally verified our genome expectations by performing a large plasmid
extraction on Burkholderia sp. str. OLGA172 (Figure 5.2.2). This indicated two large plasmids
with sizes consistent with those obtained through the PacBio assembly as well as two smaller
plasmids. The larger of these two smaller plasmids is consistent with the unitig designated
plasmid 3 in the assembly (23 kb). Although none of the close relatives examined in this study
81
had a similarly sized plasmid, the replication and partitioning genes found on this replicon had
highest homology with those from a 45 kb plasmid found in Burkholderia sp. KJ006 (76%
protein homology with 77% coverage). With the exception of a mobile element found in
multiple locations within the genome (IS66), the majority of the coding sequences identified on
the plasmid corresponded to hypothetical or conserved hypothetical genes. A contig
corresponding with the 8 kb plasmid was not found. There is one unitig that is approximately
the right size (9570 bp) however there were no plasmid replication genes or other identifying
features to suggest that this corresponds to the small plasmid visible on the gel. As discussed
above, a comparison of this unitig with the complete assembly suggests that it is already
accounted for in the consensus assembly. The nature of the 8 kb plasmid has not been
determined, but it is possible that it corresponds to an excised mobile element that is contained
within the assembly.
Figure 5.2.2: Large plasmid extraction.
The first lane contains the BAC Tracker supercoiled DNA ladder (Epicentre), however as the
plasmids were outside of the ideal range their size was instead estimated based on comparison to
plasmids from previously sequenced strains. Samples are Cupriavidus metallidurans CH34
(lane 2; plasmid sizes 233 kb and 171 kb), Burkholderia sp. str. OLGA172 (lane 3; PacBio sizes
listed as 271 kb and 137 kb) and B. phytofirmans PsJN (lane 4; 121 kb plasmid). The smaller
plasmid in Burkholderia sp. str. OLGA172 is also visible below the non-chromosomal marker
band of the ladder, and a smaller element is visible at the 8 kb marker.
82
5.2.3 Capacity of the PacBio Assembly for comparative studies
Despite having less than 99% 16S identity to our strain, it was expected that the
chromosome 1 genes would show evolutionary conservation with other sequenced Burkholderia
genomes. We therefore aligned our assembled chromosome with chromosome 1 from 6
Burkholderia strains using the MAUVE feature in Geneious (see methods). As can be seen in
Figure 5.2.3, gene order and organization from our assembly was comparable with that seen for
the other Burkholderia strains, with large local co-linear blocks (LCBs) shared between all 6
strains. There was greater similarity in the genomic arrangement of our strain with B.
phytofirmans PsJN than with the other strains, however as expected there were large gaps both
within and between LCBs corresponding to strain specific genomic islands. For the largest of
these gaps the sequence was inspected to identify whether the break was specific to our
assembly or was also observed in the other strains. In each of these cases, there was a consistent
break point where regions of strain specificity were observed in all 6 genomes, including several
islands with clear insertions following tRNA genes consistent with expectations for genomic
island or prophage insertion sites.
Figure 5.2.3: MAUVE alignment of chromosome 1 from six Burkholderia strains.
Local Colinear Blocks (LCB) are denoted by rectangles and level of identity within those blocks
is illustrated by the height of the vertical lines contained within the rectangles. Genomes are
(from top to bottom): OLGA172, B. xenovorans LB400, CCGE1001, CCGE1002, B.
83
phytofirmans PsJN and CCGE1003. LCBs drawn below the horizontal refer to inverted
segments.
5.2.4 Highlighting a region of Strain Specificity – The Chlorocatechol (CC) Degradative Operon
Burkholderia sp. OLGA172 is a 3-chlorobenzoate (CBA) degrader representative of a
large collection of unstable CBA degraders isolated from pristine environments. Previously in
our lab, primers targeting chlorocatechol dioxygenase (CCD) genes (Leander et al. 1998) were
used to confirm the presence of a modified ortho pathway of chlorocatechol (CC) degradation
genes in Burkholderia sp. str. OLGA172 (Genbank: AY168634). Primer walking techniques
were used to determine a 10 kb genomic region surrounding the CC degradative operon that
included the full operon and a set of three tyrosine site specific recombinases (Jin, 2010). This
set of recombinases is now recognized as a Recombinase in Trio (RIT) element (Van Houdt et
al. 2009). Full genome sequencing using both the Illumina and 454 platforms was utilized to
assemble the complete genome and provide context for the CC degradative operon (Jin, 2010),
and a hybrid assembly of these two sequencing technologies produced a 14 kb contig that
connected the CC degradative operon to a segment of chromosome 1 from a number of strains
from the “plant-beneficial-environmental” (PBE) clade of Burkholderia (Suarez-Moreno et al.
2012).
Chlorocatechol degradation genes that encode an ortho cleavage pathway have been
repeatedly identified in proteobacteria isolated from various contaminated systems. Separate
isolations resulted in the use of different notations for the catabolic genes: the clc genes from
chlorobenzoate degrading Pseudomonas knackmussii B13 (Chatterjee et al. 1981), the tfd genes
from 2,4 D degrading Cupriavidus pinatubonensis JMP134 (Don and Pemberton, 1981; Don et
al. 1995), tcb genes from trichlorobenzene degrader Pseudomonas sp. P51 (van der Meer et al.
1991), and the tft genes from trichlorophenol degrading Burkholderia phenoliruptrix AC1100
(Hubner et al. 1998). In spite of the different notations, the operons share sequence similarities
consistent with divergent evolution from a common ancestor. In most instances these CC
operons are located on IncP1 plasmids, suggestive of a rapid evolutionary response to the
introduction of novel and abundant anthropogenic chloroaromatics to the environment. For
instance, there is a very high degree of sequence similarity between tfd genes located on
84
different plasmids in strains isolated all over the world. Carriage on plasmids not only
facilitates rapid distribution but also has the added benefit of providing an increased copy
number of the degradative genes, which is important due to the toxicity of the intermediate
degradation products (van der Meer, 2003). There are two known instances where the CC genes
were not found on plasmids. The clc genes originally described in B13 are contained within an
integrative conjugative element that has been shown to be self-transmissible (Sentchilo et al.
2009; Gaillard et al. 2006), and are also found in Burkholderia LB400. The tfd genes of
Burkholderia sp. st. RASC (aka TFD3, isolated from Oregon sewage sludge USA) are reported
to be chromosomal (Suwa et al. 1994, Tonso et al. 1995). However, recently Sakai et al. (2014)
isolated 7 Burkholderia and one Cupriavidus 2,4-D degrading strains from paddy fields with CC
genes highly homologous to those of RASC, and have shown them to be located on a group of
megaplasmids 580-900 kb in size.
In spite of OLGA's isolation on chlorobenzoate as a selective substrate, and its close
phylogenetic relationship to LB400, the CC operon is highly homologous to the tfd rather
than the clc genes. (85% nucleotide identity to tfdC gene of JMP134, 79% to Burkholderia
sp. NK8 (Liu et al. 2001)). There is high homology amongst the CC genes from OLGA and
other strains isolated from pristine sites around the world, but there is no evidence of
plasmid locations (Leander et al. 1998). For all these reasons, confirming the genomic
context in this strain would support our hypothesis that these genes are ancestral, and aid
in understanding the evolution of chlorocatechol degradation operon. The contig
generated through the hybrid Illumina and 454 assembly illustrated that the CBA genes
were likely to be located on chromosome 1 due to the presence of typical chromosome 1
genes adjacent to one end of the operon (Figure 5.2.4). However, at the other end of the
operon the contig terminated within the RIT element, and therefore could not provide
additional information to confirm the genes present beyond that point. Amplification using
Thermal Asymmetric Interlaced PCR (TAIL-PCR) from the RIT element to other possible
contigs that overlap with this region revealed a ‘junkyard’ of complete and partial mobile
elements and hypothetical proteins in the region adjacent to the RIT element (Jin, 2010).
With the PacBio assembly we were able to assemble this highly fragmented region of the
original genome assembly and connect it to chromosome 1 genes conserved with the other
Burkholderia strains. The assembly also revealed the presence of another copy of the RIT
85
element on the second chromosome that is identical in sequence to the end of the inverted
repeats (3393 bp). It is therefore likely that this large repeated element hindered PCR
confirmation of the CCD operon location, and our previous next generation sequence
assemblies based on shorter reads.
Amplification and sequencing was performed from tfdC to the ribonuclease G in order to
confirm the placement of the CC degradative operon on chromosome 1. Alignment of this
region to other related strains indicated that the CC degradative operon is the starting point for a
strain specific region of genome plasticity (RGP) that extends for 52 kb. There is no tRNA
flanking the region and it does not begin with an integrase or transposase, although there are a
number of mobile element proteins contained within. Comparison of this region to the other
related Burkholderia strains does indicate however that this segment of the chromosome is
highly strain specific. In each of these strains there is high homology and gene synteny for 90 kb
leading up to the region where the CC degradative operon is found in Burkholderia sp. str.
OLGA172, and gene synteny ends in all of these strains after the ribonuclease G protein (Figure
5.2.4). The portion of the chromosome leading up to this break in synteny is also conserved in
other, more distantly related, species including Cupriavidus and Ralstonia strains. In
Burkholderia sp. CCGE1001 there is a 62 kb genomic island documented in this site and the
genomic island integrase is the first gene after the ribonuclease. Although none of the other
strains have documented genomic islands in this location, this region is clearly involved in strain
specificity due to the complete disruption of gene synteny. Some of these strains also contain
highly homologous RIT elements (>80% nucleotide identity) to OLGA172, however none of
these RIT elements occur in the same genomic location. The reasons for the lack of synteny
following the Rnase G in each of these strains is not clear at this time.
86
Figure 5.2.4: Genomic arrangement of chromosome 1 genes from Burkholderia sp. str.
OLGA172 and comparison to homologous regions of related strains.
Note that the grey arrows do not indicate a type of gene but instead indicate that the genes
found in this genomic location are not shared among the different strains included in the
analysis. Note also that the complete genome for Burkholderia sp. NK8 has not been
completed and therefore only the plasmid was available for comparison.
5.2.5 Limitations of the PacBio Assembly
The RIT element on chromosome 2 is flanked on both sides by at least 15 kb of
complete and partial mobile genetic elements, including several that are also repeated in the
‘junkyard’ region adjacent to the RIT element on chromosome 1. Interestingly, the TAIL-PCR
87
sequencing results and the PacBio assembly disagreed on the nature of the ‘junkyard’ region
flanking each of the RIT elements. Primers were designed that spanned the entire distance from
tfdC to the opposite end of the RIT element on chromosome 1 based on both the PacBio genome
assembly and TAIL-PCR sequencing results (which PacBio places adjacent to the chromosome
2 RIT element). Positive PCR products of the expected size were produced from both primers,
however sequencing revealed that the PacBio product had multiple peaks indicative of a likely
PCR chimera. The original TAIL-PCR primer set produced good quality sequencing results,
suggesting that the PacBio assembly had mis-assembled the two ‘junkyard’ regions. There was
no homologous RIT element found on any of the plasmid sequences.
5.3 Discussion
Historically, bacterial genomes have been defined as one large circular chromosome
with additional information carried on transient plasmids that were not a defining feature of the
species. However the discovery of secondary chromosomes, or chromids (Harrison et al. 2010),
and other stable replicons that contribute important lifestyle characteristics, has necessitated a
closer inspection of bacterial replicon dynamics. Besides the mobility associated with gene
occurrence on a plasmid, these studies have also revealed differences in gene regulation and in
the rates of recombination, rearrangement or mutation for both plasmid and chromid genes, as
well as a separation of core and secondary functions between different replicons (Chain et al.
2006). Many genome projects and assembly platforms discuss assembly metrics with the goal
of creating one large contig without considering the prevalence of multiple replicons in
environmental isolates. Of the 4386 completed genomes available through the NCBI genome
database (accessed 01 March 2015), 1306 (30%) contain multiple replicons, of which almost
half (644) contain more than 2 replicons. The use of PacBio SMRT sequencing allows for the
primary goal of gene identification while also providing important genome characterization and
a putative location for those genes that can be experimentally validated.
As with many genome sequencing projects, the initial goal of this work was to
investigate the genomic context of the chlorocatechol degradation operon in our strain. However
the presence of a large repeated element directly adjacent to the operon, and a copy of it present
on the second chromosome, resulted in a fragmented assembly that could not be reconciled
through short read sequencing or via different PCR methods. PacBio SMRT sequencing allowed
88
us to locate our operon in a region of strain specificity that was particularly difficult to assemble
due to the presence of multiple mobile genes and gene fragments. Although sequencing revealed
that the nature of the mobile element ‘junkyard’ had been misassembled surrounding the RIT
element, the overall organization of the PacBio assembly agrees well with our experimental
results. The assembly also has the added benefit of providing a putative assembly of the difficult
regions, from which primers can be designed and tested.
Only one of the closely related Burkholderia strains used in this study (B. xenovorans
LB400) contained CC degradative genes, and these genes occur in a well-documented
integrative conjugative element (ICEclc) located on chromosome 1 (Pradervand et al. 2014).
The region surrounding the CC degradative operon in OLGA172 was designated as a potential
island by the IslandViewer website (Dhillon et al. 2013;
www.pathogenomics.sfu.ca/islandviewer) however a close inspection of the genes present does
not support mobility of this region. Therefore we have used the term region of genome plasticity
(RGP) to describe the genomic context. There are no conjugation or transfer genes to suggest
that this is an integrative conjugative element (ICE) or prophage, and no flanking repeats were
identified to suggest a transposon or genomic island. The CC degradative genes found in our
strain were not localized to the same region of the genome as those found in LB400, showed no
evidence of being contained in an ICE, and bore limited protein identity with the clc genes from
LB400 (50-65% protein ID). There is only low similarity (tfdC has only 56% protein identity)
to the genes carried on megaplasmids described by Sakai et al. (2014). However the region
directly adjacent to the CC degradative operon on these megaplasmids corresponds with a
conserved region referred to as the chromid region due to its homology with genes occurring on
the second chromosome (or ‘chromid’; Harrison et al. 2010) of Burkholderia phytofirmans
PsJN, Burkholderia xenovorans LB400 and Burkholderia sp. CCGE1002. The authors therefore
suggested that the acquisition of the degradative genes may have been the result of insertion of
the ancestor plasmid into a mobile element adjacent to the genes on the chromid (designated
Tn6233) and subsequent acquisition of the genes and one copy of the mobile element on
resolution (Sakai et al. 2014). Not surprisingly, this chromid region is also homologous to genes
found on the second chromosome of Burkholderia sp. OLGA172. There are no tfd genes found
on the second chromosome of our strain, however the presence of identical RIT elements on
both chromosomes provides an opportunity for these genes to be transferred through
89
homologous recombination between replicons. In the case of our strain, it is clear that the CCD
operon is located on the primary chromosome. Due to the pristine nature of the soil environment
where this strain was isolated, it is possible that this operon is utilized for a different purpose in
the natural environment of OLGA172. This is further supported by the variable degradation
ability observed in this strain, which is likely the result of transcriptional and biochemical
inefficiencies on this substrate (Goordial, 2010). This is consistent with published findings
indicating that toxic intermediates in the chlorocatechol pathway can accumulate, and that this
toxicity is evident when only one copy of the degradative operon is maintained in the cell
(Perez-Patoja et al. 2003).
Although not within the scope of this project, the source and role of the third plasmid
(pOLGA_3, 23 kb in length) would also make an interesting study. The plasmid replication
gene for plasmid 3 was only strongly homologous (>75% nucleotide identity) to two other
Burkholderia strains, however it showed lower homology (~ 30%) to plasmids from a diversity
of sources. Included among these were two very small plasmids, a 12 kb plasmid found in
Ralstonia solanacearum and a 3.2 kb plasmid isolated from Laribacter hongkongensis. In
addition to these, the replication gene was also 41% homologous to a P2-like phage isolated
from Burkholderia cepacia complex, and this phage was unique as it was the sole representative
from that study for which the prophage is maintained as a plasmid within the cell (Lynch et al.
2010). The majority of the remaining matches corresponded to whole genome shotgun
sequences and therefore the evolution of this particular replicon cannot be further investigated at
this time. It would be interesting to further examine the relationship between Burkholderia
phages and plasmid evolution.
While ease of assembly is often a key factor in sequencing decisions, it has been our
experience that hesitations in adopting the use of PacBio SMRT sequencing have been
attributed to cost per base comparisons to other available technologies. Certainly for projects
sequencing a number of isolates or for routine testing of clinically relevant strains, the cost of
PacBio sequencing is still prohibitively expensive. However we submit that the benefit of
obtaining not only the functional gene content but also the number of individual replicons and
the intact assembly of mobile genetic elements contained within the assembly provides a
tangible benefit to current and future comparative studies that justifies the increased investment.
90
This represents a reasonably priced option whereby the immediate goals of any individual
sequencing project can be achieved without contributing to the increased abundance of
fragmented genomes in the public databases.
5.4 Acknowledgements
The authors gratefully acknowledge Eric Collins at the University of Alaska Fairbanks and
Tony (Heng) Qian for assistance with genome assembly and annotation. NR is grateful to
Ann Provoost and Kristel Mijnendonckx at the Belgian Nuclear Research Centre (SCK·CEN)
for providing C. metallidurans CH34 DNA and for assistance and guidance for the large
plasmid extractions. B. phytofirmans PsJN was kindly provided by Angela Sessitsch of the
Austrian Institute of Technology. Funding in the form of a NSERC Discovery Grant to RF
and a NSERC CGS-D Scholarship and Michael Smith Foreign Study Supplement to NR are
gratefully acknowledged. The funding agency had no role in this study.
5.5 References Altschul, S.F., W. Gish, W. Miller, E.W. Myers and D.J. Lipman. 1990. Basic local alignment search tool, J. Mol. Biol. 215: 403-410. Aziz R.K., D. Bartels, A.A. Best, M. DeJongh, T. Disz, R.A. Edwards, K. Formsma, S. Gerdes, E.M. Glass, M. Kubal, F. Meyer, G.J. Olsen, R. Olson, A.L. Osterman, R.A. Overbeek, L.K. McNeil, D. Paarmann, T. Paczian, B. Parrello, G.D. Pusch, C. Reich, R. Stevens, O. Vassieva, V. Vonstein, A. Wilke and O. Zagnitko. 2008. The RAST Server: rapid annotations using subsystems technology. BMC Genomics 9:75. doi:10.1186/1471-2164-9-75 Andrup, L., K.K. Barfod, G.B. Jensen, and Smidt, L. 2008. Detection of large plasmids from the Bacillus cereus group. Plasmid 59(2):139-143. doi: 10.1016/j.plasmid.2007.11.005. Barbosa, E.G.V, F.F. Aburialle, R.T.J. Ramos, A.R. Carneiro, Y.L. Loir, J.B.A. Miyoshi, A. Silva and V. Azevedo 2014. Value of a newly sequenced bacterial genome. World J Biol Chem 5(2):161-168. Branscomb, E. and P. Predki. 2002. On the high value of low standards. J. Bact. 184(23):6406-6409. Carver, T. N. Thomson, A. Bleasby, M. Berriman and J. Parkhill. 2009. DNAPlotter: circular and linear interactive genome visualization. Bioinformatics. 25(1):119-20. Chatterjee D.K., S.T. Kellogg, S. Hamada, A.M. Chakrabarty. 1981. Plasmid specifying total degradation of 3-chlorobenzoate by a modified ortho pathway. J Bacteriol 146(2):639–646.
91
Chain, P.S., V.J. Denef, K.T. Konstantinidis, L.M. Vergez, L. Agullo, V.L. Reyes, L. Hauser, M. Cordova, L. Gomez, M. Gonzalez, M. Lan, V. Lao, F. Larimer, J.J. LiPuma, E. Mahenthiralingam, S.A. Malfatti, C.J> Marx, J. J. Parnell, A. Ramette, P. Richardson, M. Seeger, D. Smith, T. Spilker, W.J. Sul, T.V. Tsoi, L.E. Ulrich, I.B. Zhulin and J.M. Tiedje. 2006. Burkholderia xenovorans LB400 harbors a multi-replicon, 9.73-Mbp genome shaped for versatility. Proc Natl Acad Sci U.S.A. 103(42):15280-7. Chevreux, B., T. Wetter and S. Suhai. 1999. Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB) 99, pp. 45-56. Chin, C-S. D.H. Alexander, P. Marks, A. A. Klammer, J. Drake, C. Heiner, A. Clum, A. Copeland, J. Huddleston, E. E. Eichler, S. W. Turner and J. Korlach. 2013. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods 10: 563-569. doi:10.1038/nmeth.2474 Cordero, O. X. and P. Hogeweg. 2009. The impact of long-distance horizontal gene transfer on prokaryotic genome size. Proc Natl Acad Sci U.S.A. 106(51):21748-21753. Darling, A. C., B. Mau, F.R. Blattner and N.T. Perna. 2004. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome research 14(7):1394-1403. Dhillon, B.K., T.A. Chiu, M.R. Laird, M.G.I. Langille, and F.S.L. Brinkman. 2013. IslandViewer update: improved genomic island discovery and visualization. Nucleic Acids Res 41(Web server issue):W129-132. PMID: 23677610 Don, R. H. and J.M. Pemberton. 1981. Properties of six pesticide degradation plasmids isolated from Alcaligenes paradoxus and Alcaligenes eutrophus. J Bacteriol 145(2):681-686. Don R.H., A.J. Weightman, H.J. Knackmuss and K.N. Timmis. 1995. Transposon mutagenesis and cloning analysis of the pathways for degradation of 2,4-dichlorophenoxyacetic acid and 3-chlorobenzoate in Alcaligenes eutrophus JMP134(pJP4). J Bacteriol 161(1):85–90. Edgar, R.C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5):1792-1797. Fulthorpe, R.R., C. McGowan, O.V. Maltseva, W.E. Holben, and J.M. Tiedje. 1995. 2, 4-Dichlorophenoxyacetic acid-degrading bacteria contain mosaics of catabolic genes. Appl Environ Microbiol 61(9):3274-3281. Fulthorpe, R.R., A.N. Rhodes and J.M. Tiedje. 1998. High levels of endemicity of 3-chlorobenzoate-degrading soil bacteria. Appl Environ Microbiol 64(5):1620-1627. Gaillard M, T. Vallaeys, F.J. Vorhölter, M. Minoia, C. Werlen, V. Sentchilo, Al Pühler and J.R. van der Meer. 2006. The clc element of Pseudomonas sp. strain B13, a genomic island with various catabolic properties. J Bacteriol 188: 1999-2013.
92
Ghodsi, M., C.M. Hill, I. Astrovskaya, H. Lin, D.D. Sommer, S. Koren and M. Pop. 2013. De novo likelihood-based measures for comparing genome assemblies. BMC Research Notes 6:334. Goordial, J. 2010. Characterization of a Novel Chlorobenzoate Degrading bacterium: Burkholderia phytofirmans OLGA172, Isolated from a Pristine Environment. M.Sc. Thesis Dept. Ecology and Evolutionary Biology, University of Toronto. Harrison, P.W., R.P. Lower, N.K. Kim and J.P.W. Young. 2010. Introducing the bacterial ‘chromid’: not a chromosome, not a plasmid. Trends Microbiol 18(4):141-148. Heuer, H. and K. Smalla. 2012. Plasmids foster diversification and adaptation of bacterial populations in soil. FEMS Microbiol Rev 36(6):1083-1104. Hubner A, C.E. Danganan, L. Xun, A.M. Chakrabarty and W Hendrickson. 1998. Genes for 2,4,5-trichlorophenoxyacetic acid metabolism in Burkholderia cepacia AC1100: characterization of the tftC and tftD genes and locations of the tft operons on multiple replicons. Appl Environ Microbiol 64:2086–2093. Jin S. 2010. Evidence of Mobility of the 3-Chlorobenzoate Degradative Genes in a Pristine Soil Isolate, Burkholderia phytofirmans OLGA172, M.Sc. Thesis. Dept. Ecology and Evolutionary Biology, University of Toronto. Kearse, M., R. Moir, A. Wilson, S. Stones-Havas, M. Cheung, S. Sturrock, S. Buxton, A. Cooper, S. Markowitz, C. Duran, T. Thierer, B. Ashton, P. Mentjies and A. Drummond. 2012. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 28(12):1647-1649. Klassen J.L., and C.R. Currie. 2012. Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation. BMC Genomics. 13:14. Konstantinidis, K.T. and J.M. Tiedje. 2004. Trends between gene content and genome size in prokaryotic species with larger genomes. Proc Natl Acad Sci U.S.A. 101(9):3160-3165. Koren, S., G. P. Harhay, T. P. Smith, J. L. Bono, D. M. Harhay, S. D. Mcvey, D. Radune, N. H. Bergman, and A. M. Phillippy. 2013. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol 14(9): R101. Kurtz, S., J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye and R. Giegerich. 2001. REPuter: The Manifold Applications of Repeat Analysis on a Genomic Scale. Nucleic Acids Res 29(22):4633-4642. Leander, M., T. Vallaeys, and R. Fulthorpe. 1998. Amplification of putative chlorocatechol dioxygenase gene fragments from α-and β-Proteobacteria. Can J Microbiol 44(5): 482-486.
93
Leveau, J.H.J., C. Werlen, and J.R. van der Meer. 1996. Molecular mechanisms of genetic adaptation to xenobiotic compounds. International Biodeterioration & Biodegradation 37(3):252. Liu, S., N. Ogawa and K. Miyashita. 2001. The chlorocatechol degradative genes, tfdT-CDEF, of Burkholderia sp. strain NK8 are involved in chlorobenzoate degradation and induced by chlorobenzoates and chlorocatechols. Gene, 268(1):207-214. Liu, S., N. Ogawa, T. Senda, A. Hasebe and K. Miyashita. 2005. Amino acids in positions 48, 52 and 73 differentiate the substrate specificities of the highly homologous chlorocatechol 1,2-dioxygenases CbnA and TcbC. J. Bact 187(15):5427-5436. López-Leal, G., M.L. Tabche, , S. Castillo-Ramírez, A. Mendoza-Vargas, M.A. Ramírez-Romero and G. Dávila. 2014. RNA-Seq analysis of the multipartite genome of Rhizobium etli CE3 shows different replicon contributions under heat and saline shock. BMC genomics, 15(1):770. Lynch, K. H., P. Stothard, and J. J. Dennis. 2010. Genomic analysis and relatedness of P2-like phages of the Burkholderia cepacia complex. BMC genomics 11(1): 599. Norman, A., L.H. Hansen and S.J. Sørensen. 2009. Conjugative plasmids: vessels of the communal gene pool. Philos Trans R Soc London [Biol] 364(1527):2275-2289. Ormeño-Orrillo E, M.A. Rogel, L.M.O. Chueire, J.M. Tiedje, E. Martínez-Romero and M. Hungria. 2012. Genome Sequences of Burkholderia sp. Strains CCGE1002 and H160, Isolated from Legume Nodules in Mexico and Brazil. J Bacteriol 194(24):6927. doi:10.1128/JB.01756-12. Parkhill J. 2000. In defense of complete genomes. Nat Biotechnol.18:493–494. Perez-Pantoja, D., T. Ledger, D.H. Pieper and B. Gonzalez. 2003. Efficient turnover of chlorocatechols is essential for growth of Ralstonia eutropha JMP134 (pJP4) in 3-chlorobenzoic acid. J Bacteriol 185(5):1534-1542. Phillippy A.M., M.C. Schatz and M. Pop. 2008. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9:R55. Pradervand, N., S. Sulser, F. Delavat, R. Miyazaki, I. Lamas, and J.R. van der Meer. 2014. An Operon of Three Transcriptional Regulators Controls Horizontal Gene Transfer of the Integrative and Conjugative Element ICEclc in Pseudomonas knackmussii B13. PLoS Genetics. DOI: 10.1371/journal/pgen.1004441 Ricker, N., H. Qian and R.R. Fulthorpe. 2012. The limitations of draft assemblies for understanding prokaryotic adaptation and evolution. Genomics, 100(3):167-175. Sakai, Y., N. Ogawa, Y. Shimomura and T. Fujii. 2014. A 2, 4-dichlorophenoxyacetic acid degradation plasmid pM7012 discloses distribution of an unclassified megaplasmid group across bacterial species. Microbiology 160(3):525-536.
94
Schlömann, M. 1994. Evolution of chlorocatechol catabolic pathways. Biodegradation 5(3-4):301-321. Sentchilo, V., K. Czechowska, N. Pradervand, M. Minoia, R. Miyazaki, an der Meer, V. and J. Roelof. 2009. Intracellular excision and reintegration dynamics of the ICEclc genomic island of Pseudomonas knackmussii sp. strain B13. Mol Microbiol, 72(5):1293-1306. Siguier, P., Pérochon, J., Lestrade, L., Mahillon, J. and M. Chandler. 2006. ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res 34(suppl 1), D32-D36. Slater, S.C., B.S. Goldman, B. Goodner, J.C. Setubal, S.K. Farrand, E.W. Nester, T.J. Burr, L. Banta, A.W. Dickerman, I. Paulsen, L. Otten, G. Suen, R. Wench, N.F. Almeida, F. Arnold, O.T. Burton, Z. Du, A. Ewing, E. Godsy, S. Heisel, K.L. Houmiel, J. Jhaveri, J. Lu, N.M. Miller, S. Norton, Q. Chen, W. Phoolcharoen, V. Ohlin, D. Ondrusek, N. Pride, S.L. Sticklin, J. Sun, C. Wheeler, L. Wilson, H. Zhu and D.W. Wood. 2009. Genome Sequences of Three Agrobacterium Biovars Help Elucidate the Evolution of Multichromosome Genomes in Bacteria. J. Bact. 191(8):2501-2511. doi:10.1128/JB.01779-08 Sóki, J. 2013. Extended role for insertion sequence elements in the antibiotic resistance of Bacteroides. World J Clin Infect Dis 3, 1-12. Suárez-Moreno, Z. R., Caballero-Mellado, J., Coutinho, B. G., Mendonça-Previato, L., James, E. K. and V. Venturi. 2012. Common features of environmental and potentially beneficial plant-associated Burkholderia. Microb Ecol 63(2):249-266. Suwa, Y., W. E. Holben, and L. J. Forney. 1994. Cloning of a novel 2,4-D catabolic gene isofunctional to tfdA from Pseudomonas sp. strain TFD3, abstr. Q-403, p. 459. In Abstracts of the 94th General Meeting of the American Society for Microbiology 1994. American Society for Microbiology, Washington, D.C Sydenham, T. V., Sóki, J., Hasman, H., Wang, M. and U.S. Justesen. 2015. Identification of antimicrobial resistance genes in multidrug-resistant clinical Bacteroides fragilis isolates by whole genome shotgun sequencing. Anaerobe 31:59-64. Tonso, N. L., V. G. Matheson, and W. E. Holben. 1995. Polyphasic characterization of a suite of bacterial isolates capable of degrading 2,4-D.Microb. Ecol. 30: 1–22 van der Meer JR, A.R. van Neerven, E.J. de Vries, W.M. de Vos and A.J. Zehnder. 1991. Cloning and characterization of plasmid-encoded genes for the degradation of 1,2-dichloro-, 1,4-dichloro-, and 1,2,4-trichlorobenzene of Pseudomonas sp. strain P51. J Bacteriol 173(1):6–15. van der Meer, J.R., W.M. De Vos, S. Harayama and A.J.B Zehnder. 1992. Molecular Mechanisms of Genetic Adaptation to Xenobiotic Compounds. Microbiol Rev 56(4):677-694.
95
van der Meer, J. R. 2003. Evolution of metabolic pathways for degradation of environmental pollutants. Encyclopedia of Environmental Microbiology. Van Houdt, R., Monchy, S., Leys, N., Mergeay, M., 2009. New mobile genetic elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria. Antonie van Leeuwenhoek 96, 205-26. Weilharter, A., B. Mitter, M.V. Shin, P.S. Chain, J. Nowak and A. Sessitsch. 2011. Complete genome sequence of the plant growth-promoting endophyte Burkholderia phytofirmans strain PsJN. Journal of Bacteriology. 193(13):3383-4. doi: 10.1128/JB.05055-11. Zerbino, D. R., & Birney, E. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research 18(5), 821-829.
96
Chapter 6 Expression and Activity of RIT Elements Acknowledgements and Contributions: This work was performed in collaboration with the
researcher who originally described the RIT elements, Rob Van Houdt, at the Belgian Nuclear
Research Center (SCK•CEN), due to his shared interest in elucidating RIT mobility and
expression characteristics. The experiments were carried out over two research terms at the
SCK•CEN for a total of almost 12 months over a two-year period. Experimental design,
training and project supervision at the SCK•CEN was provided by Rob Van Houdt. Wietse
Heylen and Ann Provoost provided technical assistance.
6 Introduction As discussed in Chapter 4, RIT elements contain three TBSSRs and display a
characteristic gene order and repeat architecture that is conserved across 7 bacterial phyla (Van
Houdt et al. 2009; Van Houdt et al. 2012; Ricker et al. 2013). Since the recombinases of the RIT
elements belong to sub-families of TBSSRs that have been commonly annotated as integrases,
the two terms will be considered equivalent and used interchangeably. As I have shown, RIT
elements can occur as multiple identical copies within individual genomes and are commonly
found on plasmids and in genomic islands, including plant symbiosis and catabolic islands.
These observations support the idea that they are mobile and that their role in genetic
rearrangement/movement is likely to be a universal one. In this chapter, I describe the series of
experiments performed to look for mobility of RIT elements. For these experiments, I obtained
Caulobacter sp. K31 (generously donated by Craig Stephens of Santa Clara University, Santa
Clara, California, USA) since the presence of multiple identical RIT copies in this genome is
strongly indicative of their putative mobility in this strain. Many of the experiments were also
performed in parallel with our strain of interest, Burkholderia sp. OLGA172. The ability of RIT
elements to excise and relocate was tested using a variety of mating experiments ranging from
non-specific intracellular mobility to site-specific targeting during conjugation. The general
strategy used was to separate the three recombinase genes from their flanking inverted repeats to
induce the recombinases (on one vector) to move a selectable marker (kanamycin resistance)
which has been inserted between the inverted repeats (RIT::Km cassette) on a separate vector.
97
In the initial experiments, a conjugative plasmid (pOX38) was also added to the cells and
conjugation occurred after induction. If the RIT::Km has been mobilized to the conjugative
plasmid then it will escape the original cell and be detected in the recipient (as evident by gained
kanamycin resistance in the recipient). In the second set of experiments, the RIT::Km cassette
was carried by a suicide construct – a vector capable of its own conjugative transfer but that
cannot be maintained in the recipient cell. The expression vector was contained within the
recipient cell, along with a target site plasmid and induction occurred during conjugation so that
the recombinases would be active when the suicide construct entered the cell.
6.1 Materials and Methods
6.1.1 Growth of Bacterial Strains
All E. coli cultures were grown in LB media supplemented with antibiotics when appropriate
(kanamycin 50 µg/mL, ampicillin 100 µg/mL, tetracycline 20 µg/mL, streptomycin 50 µg/mL,
chloramphenicol 30 µg/mL). M9 media with and without the addition of 1 mM leucine was
utilized for differentiating auxotrophs, and LB with 0.3 mM diaminopimelic acid (DAP) was
utilized for growing the MFDpir strain (provided by Jean Marc Ghigo from the Institut Pasteur,
Paris, France). Burkholderia sp. str. OLGA172 and Caulobacter sp. K31 (provided by Craig
Stephens from Santa Clara University, Santa Clara, California, USA) were grown in
Pseudomonas F media and Peptone Yeast Extract (PYE), respectively.
Table 6.1.1: List of strains used in this study. Donor and Recipient Strains Genotype
E. coli DG1
mcrA Δ(mrr-hsdRMS-mcrBC, modification-, restriction- φ80lacZDM15 ΔlacX74 recA1
araD139 Δ(ara-leu)7697 galU galK rpsL endA1 nupG
E. coli DH5α
dlacZ ∆M15 ∆(lacZYA-argF) U169 recA1 endA1 hsdR17(rK-mK+) supE44 thi-1 gyrA96
relA1
E. coli S17-1 λpir TpR SmR recA, thi, pro, hsdR-M+RP4: 2-Tc:Mu: Km Tn7 λpir
E. coli MFDpir MG1655 RP4-2-Tc::[ΔMu1::aac(3)IV-ΔaphA-Δnic35-ΔMu2::zeo] ΔdapA::(erm-pir) ΔrecA
E. coli HB101
F- mcrB mrr hsdS20(rB- mB-) recA13 leuB6 ara-14 proA2 lacY1 galK2 xyl-5 mtl-1
rpsL20(SmR) glnV44 λ-
98
Table 6.1.2: List of constructs created during this study. Constructs Resistances Description
pKK223-K31IntExp Amp
Open reading frames for K31 RIT element TBSSRs inserted downstream
of taq promoter
pKK223-OlgaIntExp Amp
Open reading frames for Olga RIT element TBSSRs inserted downstream
of taq promoter
pACYC184-RIT::Km Tc, Km
Kanamycin cassette inserted between the inverted repeats of the RIT
element from K31; courtesy of Wietse Heylen
pTrc99-K31RITA-C Amp K31 RIT element recombinases inserted in pTrc99 backbone
pACYC-TSV1 Tc Target site 1 from K31 DUF1738 gene inserted in pACYC184 backbone
pACYC-TSV2 Tc Target site 2 from DUF1738 gene inserted in pACYC184 backbone
pSF100-RIT::Km Km Suicide construct containing Km cassette flanked by RIT element repeats
6.1.2 Construct creation
Expression constructs were created by inserting only the open reading frames for each
recombinase (individually or combined as a single transcript) using primers designed to be
compatible with the cloning sites of both the pKK223.1 and pTrc99A vector backbones. Donor
plasmids were created by first inserting a complete RIT element into the pACYC184 vector
backbone and then amplifying a new backbone that contained the flanking sequences (including
the inverted repeats) but without the open reading frames for the integrase genes. This new
backbone was then ligated with a kanamycin gene cassette in order to create the pACYC184-
RIT::Km donor plasmid. Target sequence oligos were designed with compatible ends to the
cloning site in pACYC184 and directly ligated to create the target site 1 and target site 2
plasmids (pACYC-TSV1 and –TSV2, respectively). Plasmids were cloned into chemically
competent DG1 cells, selected on appropriate antibiotics, and confirmed by restriction digest
analysis and sequencing. The suicide vector was created by amplifying the kanamycin cassette
flanked by the RIT element sequence from pACYC184-RIT::Km and then ligating the sequence
into a pSF100 suicide vector backbone. This backbone contains the R6K origin of replication
and therefore was maintained in S17-1 λpir host prior to conjugations.
99
Figure 6.1.1: Constructs used in the final conjugation experiment.
Diagrams were created using pDRAW software. The target site in pACYC184-target1 is
indicated in orange. The kanamycin cassette in pSF100-Km is flanked by the approximately
300 bp of RIT element sequence flanking the recombinases which includes the inverted repeats.
Since the recipient cells required the addition of two separate plasmids (expression vector
and target site vector), E. coli DH5α cells containing the expression plasmid were made
competent by washing as follows: overnight cultures of bacteria were diluted by 1/100 in fresh
media with antibiotics and grown for 2-3 hours to an OD600 of approximately 0.4. All tubes,
solutions and cultures were chilled on ice for 30 minutes, with occasionally swirling. Cells were
pelleted at 5000 rpm for 5 minutes in a centrifuge at 4oC. Supernatant was removed and ice
cold sterile Milli-Q water was used to re-suspend the cells. Pelleting and re-suspension of cells
was repeated using sequentially smaller volumes of water until a final volume of 100 µL
remained per tube. Electroporation of the target plasmid was performed in 1 mm cuvettes using
50 µL of competent cells (1.8 kV).
100
6.1.3 Mating-out Assays
Expression plasmid (pKK223 backbone) and pACYC184-RIT::Km donor plasmid were
transformed into the same DH5α cell. A conjugative plasmid, pOX38, was also introduced
through conjugation. Confirmation of the maintenance of all three plasmids was determined by
plasmid specific PCR and visualization of the individual plasmids after DNA extraction
(Promega Wizard Miniprep kit, according to the manufacturer’s instructions). After
confirmation that all three plasmids were present, cells were conjugated with E. coli HB101. As
HB101 is streptomycin resistant, transconjugants resistant to both streptomycin and kanamycin
would be evidence of pOX38 mediated transfer of the RIT::Km cassette into the recipient strain.
6.1.4 Conjugation Experiments
Cultures were inoculated from single colonies into LB with appropriate antibiotics (3 mL) and
grown at 37° for 6 hours. Cells were then washed with saline to remove antibiotics. Donor and
recipient cells were re-suspended in appropriate volumes to create approximately the same cell
density for each. Matings were performed by mixing cultures on filters on plain LB media (for
uninduced) or LB + 0.2 mM IPTG (induced) and incubated overnight at 37oC. The following
day the filters were removed and placed in microcentrifuge tubes containing 1 mL saline and
vortexed. Undiluted and 1/10 diluted cultures were then plated on selective media to search for
transconjugants. Both pre-mating and post-mating cultures were also serially diluted to
determine total counts of donor and recipient cells. Putative transconjugants were inoculated
with toothpicks into PCR grade water and streaked onto selective media for confirmation, as
well as utilized for colony PCR. Based on colony PCR results, individual colonies from the
selective media were grown for plasmid extractions and further confirmed through restriction
digests and sequencing. Conjugation experiments were also varied to include stationary phase
cultures or greater density log phase cultures, as well as including pre-induced cells (induction
for 2 hours prior to mating) and using pACYC-TSV2 in the recipient cells. Finally, due to the
high prevalence of false positives in the initial suicide construct mating experiments,
conjugations were also performed using MFDpir with pSF100-RIT::Km as the donor cell with
the same recipients as earlier experiments (separate matings with target 1 and target 2 recipients)
and these matings were performed on LB + 0.3 mM diaminopimelic acid (DAP) to support
growth of the MFDpir strain.
101
6.1.5 Expression Experiments
Expression experiments with and without induction were performed on E. coli cells containing
pKK223-OlgaA-C and pKK223-K31A-C. RNA extractions were performed using Trizol and
RNA was treated with deoxyribonuclease I (Invitrogen) according to the manufacturer’s
instructions. PCR amplification was performed on DNAsed samples to ensure no DNA
contamination in RNA samples. DNase treated RNA was used as template for first-strand cDNA
synthesis. RNA, 50 ng ul-1 random hexamers, 10 mM dNTP mix and Diethylpyrocarbonate
(DEPC) treated water were incubated at 65 ºC for 5 minutes, and 4 ºC for 1 minute. 40 units of
RNase inhibitor (RNAseOUT, Invitrogen) and 200 units Superscript III reverse transcriptase
(Invitrogen) were used for each sample, incubated at 25ºC for 5 minutes and then heated to 50ºC
for 1.5 hours. The reaction was stopped by heating to 70ºC for 15 minutes. Quantitative PCR
(qPCR) was performed using SYBR Green I technology on an ABI 7300 Sequence Detection
System (Applied Biosystems). A master mix for each PCR run was prepared with SYBR Green
PCR Master Mix (KAPA Biosystems) and 0.5 μM primers. The following amplification
program was used: 95°C 2 min, 40 cycles at 95°C for 15 s followed by 58°C for 35 seconds. A
dissociation step was added (95°C for 15s, 60°C for 30s, 95°C for 15s) to produce melting
curves of products that could be analyzed for primer dimers and PCR artifacts. Representative
samples were run on a 1% agarose gel to confirm that products were the expected size.
Dilutions of genomic DNA from Burkholderia sp. str. OLGA172 ranging from 101 to 10-4 ng ul-
1 total DNA were included to create a standard curve for each primer set and dilutions of cDNA
ranging from 50 to 1 ng were used to analyze primer efficiency on transcripts. PCR efficiencies
for all primers used were in the range of 90-110%. A standardized threshold setting of 0.8 units
above the background level was utilized for every experiment for consistency. Each sample was
normalized against 16S using the comparative deltaCt method (relative expression = 2∆∆Ct).
Results were considered significant if there was a minimum of 2 fold difference in expression
between treated and control samples (t-test, ∝ = 0.05).
6.2 Results
Most of this work was performed over two separate research terms at the Belgian Nuclear
Research Centre (SCK•CEN) and the results will therefore be discussed chronologically. In the
initial research term, the goal of the experiments was to determine whether the RIT element
102
could excise and insert into a conjugative plasmid without the addition of a specified target site.
For this reason, a mating-out assay was designed. This was necessary since the project was
limited in duration (3 months) and a target site was not readily apparent from the examinations
performed at that stage. It was also anticipated that strong induction could potentially overcome
the need for a specific target site. For ascertaining mobility potential of RIT elements, I created
an expression plasmid containing only the open reading frames for the three recombinases that
make up the RIT element, under the control of an IPTG-inducible promoter. Initial expression
constructs were created using the recombinase open reading frames from each of Burkholderia
sp. OLGA172 and Caulobacter sp. K31 in the pKK223.1 vector backbone (to create pKK223-
K31A-C and pKK223-OlgaA-C). The complete RIT element in which the recombinase open
reading frames were replaced with a single kanamycin resistance gene (RIT::Km cassette) was
inserted into a pACYC184 vector backbone to create the donor vector. A conjugative plasmid
(pOX38) was also present in the donor strain to act as a recipient of the mobilized kanamycin
gene and to facilitate transfer to the recipient cells. Donor cells containing all three plasmids in
an E. coli DH5-α (nalidixic acid resistant) background were mated with E. coli HB101 cells
(streptomycin resistant). Providing the kanamycin resistance gene had been transferred to the
pOX38 conjugative plasmid, transconjugants would be selected on LB-Km-Sm. All constructs
were confirmed initially by restriction digest analysis and then sent for sequencing to confirm
key regions of the sequence. For the pKK223 expression plasmids, the matings were performed
prior to receiving sequence confirmation due to time restrictions.
6.2.1 No evidence of Intra-cellular mobility without a target site
Although there was a surprisingly high level of spontaneous double resistant mutants in the
pKK223-K31A-C mating, there were no confirmed instances of a kanamycin gene being
transferred to the recipient. There were no spontaneous mutants observed for the pKK223-
OlgaA-C mating. The first round of sequencing suggested a possible point mutation in the first
recombinase for each of the expression constructs (both Olga and K31), which may have
rendered the proteins non-functional. These mutations were originally disregarded due to their
proximity to the sequencing primers, but were confirmed with subsequent re-sequencing. Also,
expression experiments revealed that recombinase expression was observed both with and
without induction. For this reason, it was determined that the pKK223 constructs did not
provide sufficient control over the recombinase genes.
103
There were significant differences in the expression of individual integrase genes in the
expression constructs derived from K31 and Olga (Figure 6.2.1). Although recombinase
expression was only slightly increased upon induction, it was clear that the first recombinase of
each RIT element had the highest expression, presumably due to the presence of the pKK223
promoter upstream. The expression of the second recombinase was significantly decreased
relative to the first gene (p<0.05 for K31 and p<0.005 for Olga). However, expression of the
third recombinase decreased in the K31 construct but increased in the Olga construct (p<0.05
for LB and p<0.005 for IPTG) compared to expression of the first recombinase. PCR products
were obtained from the cDNA using primers designed to amplify from the first to the second
recombinase indicating that these were transcriptionally linked, however no product was
produced from primers linking the second and third recombinase genes. As these constructs
were not going to be utilized for further studies, the reasons for increased int3 expression in the
Olga vector was not further investigated.
Figure 6.2.1: Expression of recombinase genes from pKK223-OlgaA-C and pKK223-
K31A-C expression vectors.
Values are relative abundance of integrase expression to 16S expression. Expression did not
differ significantly between induced and un-induced cultures. Significant differences between
individual recombinase genes (both within and between species) are discussed in the text.
104
Table 6.2.1: Decrease in optical density of cell cultures after induction with IPTG.
Upon returning for a second research term, it was decided that new constructs would be
created in the pTrc99A vector backbone such that the recombinase genes would be under the
control of the more stringent lacIq regulator. Initial experiments again took place without target
site sequences in the form of a conjugative mating out assay with the original donor plasmid.
Although induction of the recombinase genes had a clear impact on cell density, (see Table
6.2.1), there were no transfers of kanamycin resistance to recipient cells, and no evidence for
recombination or rearrangement between the plasmids found in the donor cells. PCR
amplification from primers designed to amplify outwards from the kanamycin cassette
suggested that the RIT element was being excised (Figure 6.2.2) however attempts to confirm
the presence of a restored backbone lacking the kanamycin cassette were unsuccessful both by
PCR and based on plasmid isolations. Therefore if the RIT element is excised in the absence of
a specific target site, it occurs at levels below the detection limit for these methods.
Figure 6.2.2: PCR amplification using primers designed to amplify out from the
kanamycin gene.
Constructs uninduced (OD600) 1 mM (OD600)
pTrc99A (empty vector) 0.901 (0.052) 0.842 (0.007)
pTrc99K31A-C 0.736 (0.057) 0.213 (0.001)
pTrc99K31-RITA 0.695 (0.044) 0.342 (0.006)
pTrc99K31-RITB 0.787 (0.027) 0.664 (0.016)
pTrc99K31-RITC 0.666 (0.014) 0.663 (0.017)
105
Lanes 1 and 7 are GeneRuler 1 kb plus ladder. The 700 bp product seen in lane 6 is consistent
with that expected if the kanamycin cassette were being excised and the large bright band
evident in lanes 2,4 and 5 is consistent with the complete plasmid backbone with kanamycin still
present. Note that the smaller product is also evident in lane 2 which contains only the
p184::Km donor plasmid without the expression plasmid. Lane 3 contains only the expression
plasmid and lanes 4 and 5 have both a donor and an expression plasmid present.
6.2.2 Target site identification
The determination of a potential target site sequence was initially elusive. As identified in
chapter 4, RIT elements found in multiple copies within a strain are commonly identical to the
ends of the 30-38 bp inverted repeats presumed to designate the ends of the element. Despite
the fact that all 3 RIT elements found in Caulobacter sp. K31 were 100% identical and had
targeted the same gene (DUF1738) for insertion, in silico removal of the RIT sequences to the
end of the inverted repeats did not result in the reconstruction of the original genes. Further
investigation revealed that there was an additional sequence, a perfect 20 bp palindrome, that
was adjacent to one of the terminal inverted repeats. Whether this palindrome occurred
upstream or downstream of the RIT element (relative to recombinase transcription) was not
consistent. It was determined that the location of the palindrome correlated with the direction of
transcription of the target gene (in this case the DUF1738 gene) as opposed to the RIT element
recombinases (Figure 6.2.3). A Blast search of the 20 bp palindrome revealed that the sequence
did not exist in other DUF1738 genes lacking a RIT element, but instead revealed additional
RIT elements that had not been previously identified. Therefore it was determined that this
palindrome sequence must be a component of the RIT element. Further I hypothesized that an
inversion of the RIT element relative to the palindrome must occur either during or after
integration in the target site.
106
Figure 6.2.3: Orientation of RIT elements in Caulobacter sp. K31 relative to the direction
of the target gene DUF1738.
Genomic locations are given to the left of each diagram. ‘IR’ designates the inverted repeats
that occur at each end of the recombinases.
Removal of the complete RIT element including the palindrome sequence allowed for the
perfect reconstruction and alignment of the DUF1738 target genes of K31 and revealed the
original target site sequence. Alignment of the same region of DUF1738 from other strains in
which there was no evidence of RIT element insertion was used to determine a second potential
target site. The latter differed from the K31 derived site by 4 bp (gtcg vs. gggc). With this
information, two target site vectors were created in pACYC184 to act as recipients for the
mobilized kanamycin gene, termed pACYC-TSV1 and pACYC-TSV2. Each contained a 45 bp
target sequence in a pACYC184 vector backbone containing only tetracycline resistance for
selection. Each target site plasmid was electroporated separately into a strain containing the
pTrc99-K31A-C expression plasmid to create recipient strains with a putative target site and
inducible recombinase genes. The suicide construct (pSF100-RIT::Km) was introduced by
conjugation and the final experimental design is illustrated in Figure 6.2.4. Transconjugants
capable of growth in both kanamycin and tetracycline were tested for kanamycin insertion in the
target site via PCR.
107
Maintenance of the suicide construct within the recipient cell, and thus the detection of a
high number of false positive (TcR and KmR) cells, proved to be a feature of this experimental
design. Thinking this might be specific to an active Mu phage carried by the S17-1 donor strain
facilitating recombination between the replicons (Ferrières et al. 2010), conjugation experiments
were also performed in a Mu free donor strain (MFDpir), however high false positive rates were
still observed in this strain as well. Nevertheless movement of the kanamycin cassette
specifically into the target vector was confirmed by sequencing of positive clone products –
TSV1A resulting from the original S17-1 mating with the target site 1 recipient and TSV2A
resulting from the MFDpir mating with the target site 2 recipient.
Figure 6.2.4: Final experimental design.
The recombinase enzymes are represented by blue circles labeled A, B and C although there is
no evidence yet to suggest that all three are required or where they may bind. The sequences for
the palindrome are written above and the putative binding sites are in bold font. Induction of the
expression plasmid would result in production of the three recombinase proteins which would
then be free to act on the inverted repeats flanking the kanamycin gene and mobilize it into the
target site plasmid.
108
6.2.3 Sequencing analysis of transconjugants
Analysis of the sequence surrounding the kanamycin cassette after insertion in the target
sites shed some light on the mechanism of insertion. For both the TSV1A and TSV2A
recombinants, the kanamycin gene has been inserted in the opposite orientation relative to the
palindrome when compared to the original suicide vector. Using two different target sites
illustrated that the target site sequence itself was unchanged when the element inserted – with
only the first target site this couldn’t be determined since the sequence flanking the kanamycin
cassette matched TSV1 (see Figure 6.2.5). From these recombinants, it is clear that target site
sequences are unchanged with RIT insertion, and that the 4 bp sequences on each end of the
palindrome have been altered. Therefore it would appear that the strand exchange occurs at
both ends of the palindrome, as opposed to in the centre of the palindrome as would be expected
based on other known cross-over regions (Hallet et al. 2004).
Figure 6.2.5: Reversal of RIT element in positive transconjugants.
Labels are included on the left. The first 4 bases correspond to the portion of the target site that
differs between target site 1 and 2 (gtcg/gggc respectively), and the last 4 bases (cact)
correspond to the continuation of the target site. ‘IR’ represents the inverted repeats that flank
the kanamycin cassette. The palindrome and inverted repeat sequences are unchanged after
recombination.
109
As illustrated in Figure 6.2.6, the TSV1A recombinant has a plasmid that is larger than the
donor plasmid (pSF100-RIT::Km), which is 4.4 kb. The original target site plasmid was 2.3 kb
and therefore the addition of the kanamycin cassette should have resulted in a plasmid of 3.2 kb.
Sequencing was inconclusive, however restriction digestion suggests that TSV1A has two
copies of the original pACYC184 backbone connected by the kanamycin cassette. This can be
seen in Figure 6.2.6, as the HindIII digestion contains all the original bands from the target site
plasmid and two additional bands consistent with the kanamycin cassette inserted into the target
site plasmid. It is possible that both the original target plasmid and the recombinant were present
in the same cell, however there is no original target plasmid visible on the mini-prep gel and the
digest bands are equal intensity which suggests equal amounts of both plasmids. Therefore if
both versions had been present in the cell they should have been visible prior to digestion.
Figure 6.2.6: Mating results for the recipient strain containing pTrc99-K31A-C and
pACYC-TSV1.
The miniprep results show the number of plasmids in each strain (A) and the HindIII digest
illustrates that the recombinant (TSV1A, labeled +I in the figure) has maintained the original
recipient plasmid bands and also acquired two new bands. The uninduced strain (-I) has bands
corresponding to all of three of the plasmids. Labels are listed in legend and the first lane of
each gel has the GeneRuler 1 kb plus ladder.
110
By contrast, the TSV2A recombinant plasmid runs much farther on the gel when
undigested than even the original constructs (Figure 6.2.7). However digestion and sequencing
confirmed it to have the expected size and the sequence corresponded to one copy of the
pACYC184 backbone and the kanamycin cassette inserted specifically in the target site.
Digestion of the TSV1A and TSV2A plasmids using enzymes that should produce a single
linear band (BamHI) confirmed these differences. For the TSV2 plasmid there is a single band
consistent with expectations for the recombinant plasmid, however the target site 1 recombinant
plasmid gave bands of both the original plasmid and the recombinant plasmid (data not shown).
Figure 6.2.7: Target site 1 transconjugants retaining both kanamycin and tetracycline
resistance.
Lane 1 contains the GeneRuler 1 kb plus ladder. Lanes 2 is the TSV2A clone. The small
plasmid is the pACYC-TSV2 plasmid with the kanamycin cassette inserted. The larger band
was lost from the strain after sub-culturing. Lane 3 is the TSV1A plasmids (expression plasmid
and recombined pACYC-TSV1 with Km) and lane 4 is the original recipient strain.
It’s important to note that RIT element mobility was only observed during conjugation, as
this may be important to understanding the mechanism of mobility. Although the high false
positive rate made it difficult to find positive recombinants, it also provided an opportunity to
try inducing the plasmids when they were all present in one strain. This induction was
performed on both the TSV1A positive recombinant clone and the previously uninduced clone
that had retained all three original plasmids (designated ‘–I’ in Figure 6.2.6). Plasmids were
111
collected from a large volume of the induced cells and no rearrangements of any kind were
observed in the subsequent plasmids either by gel or PCR analysis.
In the collection of KmR/TcR clones, there were a number of potential recombinants in
addition to TSV1 and TSV2. These appear to have larger plasmids, perhaps as a result of
unresolved co-integrate structures (similar to the plasmid visible in Figure 6.2.7 above the TSV2
recombinant plasmid). Primers designed to amplify across the target site revealed a putative co-
integrate that had an altered target site. Sequencing analysis of this clone (I4) confirmed it to be
a co-integrate of the donor and target plasmids. Sequencing out from the kanamycin gene
revealed the presence of both the donor (pSF100) and target (pACYC184) backbones but no
palindrome was found adjacent to either inverted repeat. Beyond each of the inverted repeats
the sequence matched to the same half of the target site sequence. Clone I4 was a result of the
target site 1 mating, and therefore the target site sequence was identical to the sequence flanking
the kanamycin cassette (28 bp on one end and 17 bp on the other end), which explains the
presence of two copies of the target site sequence. Sequencing from the vector backbones
revealed the palindrome sequence to be separate from the kanamycin cassette and flanked on
either side by the other half of the target site (illustrated in Figure 6.2.8).
Figure 6.2.8: Sequencing results of co-integrate structure of clone I4.
The original target site and suicide vectors are shown at the top. The coloured boxes represent
the sequences in common between the two (less than 30 bp for each). The bottom figure shows a
simplified version of the co-integrate illustrating the location of the target site sequences and
palindrome at the junction of the two plasmids.
112
6.2.4 Application of these Results to other RIT Elements
In recognizing the importance of the palindrome sequence to the mechanism of these novel
elements, I performed an in silico search specific to the palindrome/inverted repeat arrangement.
This search led me to identify additional RIT elements in the database (included in supplemental
table 1). This suggests that RIT elements may be grouped according to conservation of their
palindromes. Palindrome conservation groups span a wide range of species. The palindrome
sequences were most variable in the centre region, and in some cases this central core was no
longer a perfect palindrome sequence, suggesting that conservation of the key motifs is
functionally important as opposed to maintenance of a perfect palindrome structure. As can be
seen in Table 6.2.2, the conserved sequences in the palindromes also correspond with conserved
sequences in the inverted repeats (the presumed binding sites identified in chapter 4). Whether
these homologous sequences correspond to binding sites or facilitate the creation of a stem-loop
structure has not been determined.
Table 6.2.2: Conserved sequences found in a variety of alpha- and beta-Proteobaceria
containing RIT elements.
Strain Palindrome Sequence Inverted Repeat Sequence Caulobacter sp. K31 ttatgccgatatcggcataa cataatgccgcgatccggattatgccg Sinorhizobium medicae WSM419 pSMED02 ttatgccgatatcggcataa cattatgccgtacgccggattatgccgcatggcc
Acidophilium crytum JF-5 pACRY03 ttatgccgatatcggcataa cataatgccgtgattcggattatgccgcatgacc
Novosphingobium PP1Y ttatgccgatatcggcataa taatgccgtgacccggattatgccg Acidiphilium multivorum AIU301 tgccccttatgccgacatcggcataaggggca taatgccgagatccggattatgccg
Frankia sp. EAN1pec ttatgccgacgtcggcataag
ttatgccgagggccgggttatgccg
Cuprividus metallidurans CH34 RITCme1
catgccgctagcggcatg ttatgccgactccccgattatgccg
Burkholderia sp. Ch1-1 cctgtcatgccgctagcggcatgacagg
ttatgccgacttcccgattatgccg Mesorhizobium loti st. NZP2037 ttatgccgacgtcggcataa ttatgccgatgtccggattatgccg Phaeobacter gallaeciensis DSM26640
ttatgccgacatcggcataagg cataatgccgatgttcagattatgccgcg
Acidovorax sp. KKS102
cgctgcttatggagagctctccataagcagcg
gcagcgttatgcacagcacgcagttatgcacagttgg
Leptothrix cholodnii SP6
ctgcttatggagagctttccataagcag
gcagcgttatgcacagcacgcagttatgcacagt
113
6.3 Discussion
Tyrosine based site-specific recombinases (TBSSRs) are a broad group of enzymes
which perform conservative DNA recombination through the coordinated breakage, exchange
and resealing of all 4 DNA strands (Hallet et al. 2004). There are approximately 400,000
TBSSRs in the NCBI database, many of which can be assigned to one of 24 sub-families based
on conserved domains in the C-terminal catalytic domain (www.ncbi.nlm.nih.gov/cdd). Those
associated with mobile genes have been further divided into putative role-specific sub-families
(Van Houdt et al. 2012). Only a small number of these enzymes have been characterized
biochemically and they exhibit extensive diversity in both their recombination mechanisms and
the nature of the attachment sites. This is not unexpected given the varied roles that these
enzymes perform in the cell. These functions can be separated into three different categories –
chromosome or plasmid maintenance (by ensuring correct separation of multimers), intercellular
distribution (phages, ICEs and genomic islands) and intracellular generation of diversity (phase
switching and cassette integration) (Hallet et al. 2004; Subramanya et al. 1997; Tribble et al.
1997; Tirumalai et al. 1997; Guo et al. 1999; Cheetham and Katz 1995; Rowe-Magnus and
Mazel, 2001).
Tyrosine-based site-specific recombinases are essential for the correct separation of
circular replicons. The best studied representative is the XerCD/dif system which functions to
resolve chromosomal dimers produced during replication (Hallet et al. 2004). These
recombinases are distinguished from those utilized in homologous recombination since they
require only short (~30 base pair) sequences to perform recombination. These sites are referred
to as the “core” or “cross-over” site and usually possess dyad symmetry that facilitates the
binding of the recombinases to recognition motifs (Hallet et al. 2004). The DNA strands are cut
and exchanged at the borders of the central region separating the recognition motifs (Hallet et al.
2004).
Large conjugative plasmids and other mobile elements that are self-transmissible
commonly encode their own site-specific recombinase adjacent to the recombination site for
integration (Hallet et al. 2004). These sites can consist of only the core site (as in the Cre/loxP
system), or be more complex. The relative positioning of recombination sites specifies whether
the recombination reaction will result in integration, excision or inversion of the intervening
114
DNA. When the sites are located on a single replicon, directly repeated recombination sites will
cause excision, while inverted repeated sites cause inversion (Hallet et al. 2004). Tyrosine
recombinase systems can be specific to individual mobile elements or can be provided in trans
from the host chromosome (Huber and Waldor, 2002).
Many transposons encode separate integration and resolution systems. Two examples
using a site-specific resolution system are the Tn3 family and “Mu-like transposons” including
Tn552 and the Tn5053/Tn402 family (Hallet et al. 2004). For both of these systems, the
(usually DDE) transposase initiates the creation of a co-integrate which joins the donor and
target sequence through two directly repeating copies of the transposon. These co-integrates are
then resolved through the activity of the site-specific recombinase through intra-molecular
recombination between the two copies at the transposon resolution site (res), resulting in one
copy of the transposon in each location (Hallet et al. 2004). For the majority of Tn3 members,
the resolution occurs through the action of a serine SSR commonly referred to as the resolvase.
The res sites of the Tn3 members that have been characterized indicates that they contain three
12 bp inversely oriented binding sites, the first of which is the recombination core site and the
other two correspond to accessory elements required for recombination to proceed (Hallet et al.
2004). The sequence identity and spacing between these three sites has some variability in
different members, and it has been determined that some elements (including Tn552, ISXc5,
Tn1546 and TnXO1) may each contain direct repeats instead of inversely oriented motifs at one
of the two accessory binding sites (Hallet et al. 2004). There is a sub-family of the Tn3
elements (including Tn4430, Tn5401 and the Tn4651/Tn5041 families) that utilize a tyrosine
based site-specific recombinase (TBSSR) for the resolution of the transposase driven co-
integrates (Hallet et al. 2004).
These experiments demonstrated the movement of a Km cassette carried within a RIT
structure to a recipient plasmid harboring either of two closely related target sites, but only
during conjugative events. The target site was identified after careful examination of the gene
sequence uncovered the presence of a palindrome that mobilized as part of the RIT element.
This finding provides clues to the integration event mechanism.
The results obtained so far suggest that RIT elements can be transferred between
replicons within a bacterial cell during the process of conjugation and that integration occurs at
115
the ends of the palindrome sequence. This suggests that either conjugation or the presence of
single stranded DNA is central to RIT activity. Analysis of the sequence surrounding the
kanamycin cassette after insertion in each target site shed some light on the mechanism of
insertion. In both cases the palindrome appears downstream of the kanamycin gene in terms of
transcription whereas it is upstream in the original construct in pSF100. This suggests the
palindrome location is determined by the target site sequence, and that the palindrome may
serve as the attachment and integration site and the RIT::Km is inverted either during or after
integration. Previously characterized tyrosine recombinases (such as XerC/D and Cre) bind to
sites exhibiting dyad symmetry and crossover occurs at the centre of this symmetry (Hallet et al.
2004). As can be seen in Table 6.2.2, there are complimentary sequences found in both the
palindrome and the inverted repeats. It is therefore proposed that the RIT element recombinases
may bind to one half of the palindrome sequence and to the complimentary sequence within the
inverted repeats and that crossover occurs between the palindrome and the inverted repeat
(Figure 6.3.1). If the crossover occurs between the palindrome and the inverted repeats then the
core sites are also more consistent with those seen for XerC/D since the string of A/T is internal
and the G/C is external to the crossover region.
Figure 6.3.1: Model for RIT element mobility based on experimental results.
IR indicates the locations of the inverted repeats. Illustration of palindrome direction relative to
kanamycin transcription in pSF100-RIT::Km suicide construct (top) and in the transconjugants
116
obtained (middle). Bottom picture is proposed binding sites and crossover regions in RIT
integration involving a circular intermediate. Key residues predicted to be involved in binding
are shown in bold and strand exchange occurs between the palindrome and the inverted repeats.
Site-specific recombination events were detected in these experiments, but many more
may have been found if not for the high rate of false positives. These were due to the
independent maintenance of the suicide construct within the recipient cell, or to recombination
events that appear to be separate from RIT activation. As discussed in the results, the
substitution of the MFD strain to replace S17-1 did not eliminate the false positive issue,
suggesting that the source of the issue with pSF100 is not the active Mu phage described in the
latter strain. There were significant regions of homology (~ 200 bp) between the suicide
construct and both the donor and expression plasmids found in the recipient cell. This should
not be an issue in a recA1 mutant background, but it has been shown that ATP-independent re-
annealing of single stranded substrates can still occur although strand exchange is eliminated
(Bryant and Lehman 1986). I therefore cannot preclude the possibility that the plasmids are
becoming integrated with each other and our strong selection maintains these co-integrate
structures. In addition, the ATP dependent functions of the RecA1 protein can be partially
restored at a pH of 6.5 or lower (Kawashiwa et al 1984). As alterations in the pH during
conjugation were not monitored, there is the possibility that homologous recombination
accounts for a significant fraction of the false positives observed. A third possibility for the
source of this issue could be cross-reactivity with the active XerC and XerD homologs found in
the recipient strain. There have been phage elements described that do not carry their own site-
specific recombinases but rather depend on the action of the host recombination machinery to
facilitate their integration (Huber and Waldor, 2002). As these recombinases are essential to
chromosome separation and cell reproduction, it is not feasible to perform the experiments in a
XerC/D deficient background in order to determine whether these genes are contributing to the
high false positive rate.
Although there is currently insufficient evidence to speculate on the role that RIT
elements may play in the cell, the results obtained in this study indicate that they may be
specifically active during the process of conjugation. The lack of kanamycin movement upon
induction of the recombinases in cells already possessing all three plasmid constructs
117
(expression, target and suicide construct) was in sharp contrast to the diversity of arrangements
obtained when induction occurred as the RIT::Km cassette was conjugating in. This is consistent
with the data obtained in Chapter 4 that indicates that RIT elements are commonly associated
with one or more plasmids in an individual strain. This activation could be similar to integrons,
where a single stranded substrate is necessary for integration of gene cassettes to occur, or could
be indicative of a role for these genes in the acquisition of genes directly from incoming
plasmids regardless of the ability for that plasmid to be maintained long term within the
recipient cell. In this manner, having RIT elements specifically active during conjugation
events would be a powerful useful means of generating diversity from transient plasmid
associations.
6.4 Acknowledgements
This project was funded through the W. Garfield Weston Foundation Doctoral Fellowship
Program. Funding in the form of a NSERC Discovery Grant to RF and a NSERC PGS-D
Scholarship to NR is also gratefully acknowledged. The funding agencies had no role in this
study.
6.5 References Bryant, F.R. and Lehman, I.R. 1986. ATP-independent renaturation of complementary DNA strands by the mutant recA2 protein from Escherichia coli. The journal of biological chemistry 261(28):12988-12993. Cheetham, B. F., & Katz, M. E. (1995). A role for bacteriophages in the evolution and transfer of bacterial virulence determinants. Molecular microbiology, 18(2), 201-208. Ferrières, L., G. Hémery, T. Nham, A.M. Guérout, D. Mazel, C. Beloin and J.M. Ghigo. 2010. Silent mischief: Bacteriophage Mu insertions contaminate E. coli random mutagenesis performed using suicidal transposon-delivery plasmids mobilized by broad-host range RP4 conjugative machinery. J. Bacteriol. 192(24):6418-27. Grindley, N.D.F., Whiteson, K.L., and Rice. P.A. 2006. Mechanisms of Site-Specific Recombination. Annu. Rev. Biochem. 75:567-605. Guo, F., Gopaul, D. N., & Van Duyne, G. D. (1997). Structure of Cre recombinase complexed with DNA in a site-specific recombination synapse. Nature, 389(6646), 40-46.
118
Hallet, B., Vanhooff, V. and F. Cornet. 2004. DNA Site-Specific Resolution Systems. In: Plasmid Biology pp. 145-180. Ed. B.E. Funnell and G.J. Phillips ASM Press, Washington, D.C. USA Huber, K. E., & Waldor, M. K. (2002). Filamentous phage integration requires the host recombinases XerC and XerD. Nature, 417(6889), 656-659. Kawashima, H., Horii, T., Ogawa, T. and Ogawa, H. 1984. Functional domains of Escherichia coli recA protein deduced from the mutational sites in the gene. Mol Gen Genet (molecular and general genetics) 193(2):288-292. Ricker, N., Qian, H., and Fulthorpe, R. 2013. Phylogeny and Organization of Recombinase in Trio (RIT) Elements. Plasmid. 70(2):226-239. Rowe-Magnus, D. A., & Mazel, D. (2001). Integrons: natural tools for bacterial genome evolution. Current opinion in microbiology, 4(5), 565-569. Siguier, P. Gourbeyre, E. and M. Chandler. 2014. Bacterial insertion sequences: their genomic impact and diversity. FEMS Microbiol Rev. 38(5):865-891. Subramanya, H. S., Arciszewska, L. K., Baker, R. A., Bird, L. E., Sherratt, D. J., & Wigley, D. B. (1997). Crystal structure of the site‐specific recombinase, XerD. The EMBO Journal, 16(17), 5178-5187. Tirumalai, R. S., Healey, E., & Landy, A. (1997). The catalytic domain of λ site-specific recombinase. Proceedings of the National Academy of Sciences, 94(12), 6104-6109. Tribble, G., Ahn, Y. T., Lee, J., Dandekar, T., & Jayaram, M. (2000). DNA recognition, strand selectivity, and cleavage mode during integrase family site-specific recombination. Journal of Biological Chemistry, 275(29), 22255-22267. Van Houdt, R., Monchy, S., Leys, N., Mergeay, M., 2009. New mobile genetic elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria. Antonie van Leeuwenhoek 96, 205-26. Van Houdt, R.., Leplae, R., Mergeay, M., 2012. Towards a more accurate annotation of tyrosine- based site-specific recombinases in bacterial genomes. Mobile DNA 3(6) doi:10.1186/1759-8753-3-6
119
Chapter 7 Developing a standardized method for analyzing gene content of bacterial communities in streams with varying
degrees of urbanization
7 Introduction A key challenge in characterizing the mobilome of environmental samples is the ability
to draw comparisons between diverse environments. The ideal study involves collecting
samples before and after an environmental disturbance, however this is limited to anticipated
point source contamination events. Moreover, the information obtained can only be utilized in
drawing comparisons specific to that location and time point. Unfortunately, environmental
pollution is not limited to these discreet and known point source events. Anthropogenic
pollutants from domestic, industrial and agricultural settings contribute a diverse array of
chemical compounds to the environment (Gillings et al. 2015). Increased urbanization and
decreased vegetation likewise contributes to increased levels of environmental pollutants
(particularly polyaromatic hydrocarbons) surrounding human activities (Johnsen and Karlson,
2007).
Despite the inherent issues in drawing comparisons between sites with highly variable
anthropogenic impacts, a baseline community mobilome needs to be established from which
future studies can draw comparisons. There are a variety of metrics currently available for
classifying anthropogenic impacts on freshwater streams. In our region, the Ontario Benthic
Biomonitoring Network (OBBN) coordinates efforts to monitor impacts to both lakes and
streams and has developed appropriate methods for comparing the benthic invertebrate
populations between sites to establish anthropogenic impacts (Jones et al. 2007). The analysis of
benthic macroinvertebrate populations is a well-established biomonitoring tool for comparing
cumulative impacts of human activities in river systems (Rosenberg and Resh 1993; Wright et
al. 2000). However, understanding how bacterial communities are impacted at these sites is not
directly comparable to these macro-organism metrics. In order to determine which
environmental pollutants cause changes in bacterial diversity or gene content, there must be a
standardized bacterial community on which to perform testing. This standardized community
120
must account for varying bacterial populations within individual spatial niches as these can be
expected to respond differently to selection pressures. Gene transfer mechanisms are
particularly proficient in biofilm communities, therefore obtaining samples through filtering of
stream water is less than ideal since it minimizes the genetic contribution of these important
communities. However it is equally difficult to account for differences in sediment composition
when comparing bacterial communities between streams, and this becomes increasingly valid
when comparing between relatively pristine (reference) sites and more channelized urban
streams. Moreover, individual sediment samples can be impacted by variations in groundwater
inputs, which can be a source of various pollutants. Finally, in order to be used in a risk
assessment framework, the bacterial community should represent a reasonable route of exposure
for individuals either through direct contact or downstream water usage. For these reasons, we
chose to utilize columns filled with a standardized substrate on which the bacterial community
could colonize for a pre-determined length of time. This allows for bacteria that are present
intermittently in the water column to colonize the soil columns in addition to the ubiquitous
water inhabiting bacterial members.
I designed sand filled columns to capture and integrate the bacterial communities of streams for
study. These columns were attached to flotation devices and floated in the water column at a
shallow depth from the surface in 6 streams in southern Ontario, Canada. Two of the chosen
streams are minimally impacted by human activities, and the other four streams are moderately
impacted by a variety of pollutants (see Table 7.1.1). The sources of anthropogenic stress
included in this study (urbanization, waste water outflows, agricultural practices and landfill
leachate) were chosen in order to avoid strong selection by any particular pollutant and instead
focus on circumstances where the communities are exposed to a variety of stressors.
7.1 Materials and Methods
The reference site samplers were placed in rivers contained within the Saugeen Valley
Conservation Authority (SVCA) at sites chosen based on the 2010 water quality monitoring
report from this agency (SVCA, 2010) and are both located in streams actively monitored by the
Provincial Water Quality Monitoring Network (PWQMN). This region has a low road density
and includes the provincial reference sites utilized for the Ontario Benthos Biomonitoring
Network (OBBN) assessments (Jones, 2006). Sampler placement was chosen based on ease of
121
accessibility, however neither sampler was placed in the precise location of the PWQMN station
for the reference sites due to concerns that the locations provided public access and may lead to
tampering. The impacted sites have been chosen in the more urbanized Lake Simcoe watershed,
based on recommendations from the Lake Simcoe Regional Conservation Authority (LSRCA)
staff and a 2004 study of contaminants found in the rivers of the Lake Simcoe watershed
(LSCRA, 2004). All impacted sites show significant accumulations of polyaromatic
hydrocarbons (PAHs), which is to be expected in an urbanized watershed. The Uxbridge Brook
site was chosen due to its use as a discharge stream for a wastewater treatment plant in the area
(at a distance of approximately 2.5 km from discharge to sampling site), and is the only site that
is located precisely at the PWQMN site. The Maskinonge river site did not have any
contamination that exceeded the provincial limits according to the 2004 report but was chosen
due to its location downstream of an intensive sod farm. The West Holland River was heavily
impacted according to the 2004 contamination study by organochlorine pesticides including
DDT (and its breakdown product DDE, among other contaminants). The Dyment’s Creek
location was not highlighted in the 2004 study but was included due to the availability of
concurrent chemical analyses performed by researchers at Environment Canada. At this
location, chemical screening has been performed on the groundwater flowing beneath the stream
and results from previous years have been published (Roy and Bickerton 2011). Contaminants
found in this location are diverse and include volatile organic chemicals, metals and petroleum
products.
7.1.1 Sampling locations and collection of benthic invertebrates
Chemical data and site characteristics are listed in Table 7.1.1. Provincial water quality
monitoring data were available for all streams except for Dyment’s Creek, however stream and
sediment data for this latter site were provided for 2011 by researchers at Environment Canada.
For each of the sampling locations, benthic communities were sampled according to the OBBN
protocols (Jones et al. 2007) and animals were preserved in 70% ethanol for transportation.
Organic material was isolated by density separation in concentrated salt solution (if necessary)
and animals were classified using microscope-assisted identification to at least the 27-group
level (details in OBBN protocol, Jones et al. 2007). When possible, Trichoptera, Ephemeroptera
and Coleoptera were identified to the family level in order to utilize a more accurate tolerance
value for the coarse Hilsenhoff biotic index calculations. Benthic sampling was not performed
122
on the West Holland Canal and Maskinonge River sites due to the depth of the river at these
locations, and samples were not collected from the North Saugeen River site in the fall of 2012
due to the presence of clam beds that should not be disturbed. Although samples were collected
in both the fall and the summer across multiple years for individual sites, only the fall counts
were used for analysis for consistency. Benthic counts were also obtained from each of the
conservation authorities in order to supplement the available data. Determination of
anthropogenic impact was determined using the coarse Hilsenhoff Biotic Index (cHBI; a
modification of the HBI developed by Hilsenhoff, 1987) and Simpson’s diversity index
(Simpson, 1949) as well as percent recovered of relatively intolerant species (combination of
Ephemeroptera, Plecoptera and Trichoptera).
Table 7.1.1: Sampling locations for river assessments.
Those in bold exceed the available recommended limits (Provincial Water Quality Objectives or
Canadian Water Quality Guidelines for Protection of Aquatic Life (SVCA, 2010)); There is no
listed guideline for PHCs; Abbreviations - PAH: polyaromatic hydrocarbon, OC:
organochlorine, PHC: petroleum hydrocarbon, VOCs: volatile organic compounds.
Site Bank Width Depth (avg) Surrounding Region Chemicals of
Concern
Hamilton Creek 11.9 m 0.41 m Forest
North Saugeen
River
15.2 m 0.46 m Forest
Uxbridge Brook 6.5 m 0.40 m Downstream of
sewage outflow
PAHs, Phenols,
PHCs, Cr, Cu
Maskinonge River 9 m 0.40 m Downstream of sod
farming
Pesticides, Cr, Cu
West Holland
River
Not determined Not determined Within Holland
Marsh (agriculture)
PAHs, Phenols,
OC pesticides,
PHCs, Hg, Cd, Cr,
Pb, Cu
Dyment’s Creek 2-6 m 5-50 cm Historic landfill
turned residential
VOC’s
123
Figure 7.1.1: Map of sampling locations.
The two reference sites (orange triangles) are North Saugeen (NS) and Hamilton Creek (HC),
which are located within the SVCA. The four impacted sites (red triangles) are Dyment’s Creek
(DC), West Holland Canal (WH), Maskinonge Creek (MC) and Uxbridge Brook (UX) and are
all located within the LSRCA.
7.1.2 Sampler Design
All samplers were created from 1 ½” (inner diameter) polycarbonate tubes cut to one foot in
length and fitted with 1 ½” copper to DWM pipe adapters machined internally to fit to the pipe.
The ends of the adapters were fitted with screening material and nylon in order to prevent the
entry of invertebrates or litter from the stream. All samplers were filled with autoclaved, fine
grain sand. Samplers were floated mid-stream within 2 inches of the stream surface, and were
sub-sampled monthly throughout the 4 month exposure time. At the end of four months the
samplers were retrieved and replicate samples obtained from each portion of the sampler (inflow
end, center, outflow end) to determine within sampler variation.
124
Figure 7.1.2: Aquatic environment bacterial community samplers.
Constructed samplers (A) were attached to 2 L. bottles to be used as flotation devices. The
devices were attached to cinder blocks to keep them in place in the stream (B).
7.1.3 Bacterial Community Assessment
DNA extraction of sampler soil was performed using a PowerSoil extraction kit (MoBio).
Terminal restriction fragment length polymorphism (T-RFLP) using fluorescently labeled 16S
primers were used to compare the sampler diversity, and pyrosequencing was performed on
selected samplers to examine bacterial community diversity.
Initial T-RFLP comparisons between the inflow, center and outflow sub-samples were
performed to determine whether the bacterial communities were consistent throughout the
length of the sampler. Fluorescently labeled 16S primers (27F-FAM and 1492R-HEX) were
used for amplification and digestion was performed using AluI. Digested samples were
analyzed by the Guelph Molecular Supercenter Laboratory Services Division and statistical
analysis was performed using R (R-project.org). Sub-samples were subsequently combined (3
separate replicates of sub-samples where possible) for between sampler comparisons, also by T-
RFLP using the same methods. Principal coordinate analysis (separate analyses for Bray-Curtis
and Jaccard distance measures) of the T-RFLP on combined samples were analyzed using the
pco command in the Ecodist package of R. Principle coordinate scores were compared to water
quality parameters, benthic invertebrate metrics and genetic (qPCR) data. Correlations were
automatically generated using the corr function.
125
Pyrosequencing was performed on one replicate of combined samples, using a Roche 454
FLX titanium instrument (MR DNA Molecular Research LP). Primers utilized were provided
by the facility (27Fmod and 530R) and targeted the V1-V3 16S region (Yarza et al. 2014). Data
analysis was performed using programs in the QIIME pipeline (Caporaso et al. 2010), including
Denoiser (Reeder and Knight, 2010) and UCLUST (Edgar, 2010). Sequences were rarefied to
1455 reads per sample (corresponding to the lowest read count obtained), OTU’s were grouped
based on 97% similarity, and taxonomy was assigned according to the Greengenes Database
(DeSantis et al. 2006) files from May 2013. Beta diversity was evaluated using the vegan
package in R with either Bray-Curtis or Binary Jaccard settings.
Table 7.1.2: Primers for quantitative PCR.
Primer Name Sequence Amplicon Size
Annealing Temp (oC) Reference
qPCR-intI1F ACCAACCGAACAGGCTTATG
~ 286 bp 62 Nemergut et al. (2004) qPCR-intI1R GAGGATGCGAACCACTTCCAT
qPCR-16S-338F ACTCCTACGGGAGGCAGCAG
~ 200 bp 63 Fierer et al. (2005) qPCR-16S-518R ATTACCGCGGCTGCTGG
sulI-F CACCGGAAACATCGCTGCA
158 bp 55 Cheng et al. 2013 sulI-R AAGTTCCGCCGCAAGGCT
IncP1 korA-F TCATCGACAACGACTACAACG
117 bp Smalla et al 2013 IncP1 korA-R TTCTTCTTGCCCTTCGCCAG
IS1071_qPCR-F GCACCAAGTCTGGGAATGAT
~200 bp 60
This study
IS1071_qPCR-R ACGGGCATAGTGTTTCTTGG This study
IR_Olga TTATGCCGATTCCCGGATTATGCCG 3.5 kb
54
This study
IR_K31 TAATGCCGCGATCCGGATTATGCCG 3.5 kb This study
IR_ambig TWATGCCGIIIYCCSGATTATGCCG 3.5 kb
54
This study
IR_less_ambig TTATGCCGIIIYCCSGATTATGCCG 3.5 kb This study
7.1.4 Quantitative PCR
Quantitative PCR (qPCR) was performed using SYBR Green I technology on an ABI
7300 Sequence Detection System (Applied Biosystems). A master mix for each PCR run was
prepared with SYBR Green PCR Master Mix (KAPA Biosystems) and 0.5 μM primers. The
following amplification program was used: 95°C 2 min, 40 cycles at 95°C for 15 s followed by
126
60°C for 35 seconds. A dissociation step was added (95°C for 15s, 60°C for 30s, 95°C for 15s)
to analyze the melting curves of products for primer dimers and PCR artifacts. Representative
samples were run on a 1% agarose gel to confirm that products were the expected size. Primers
used for MGE comparisons between samplers are listed in Table 7.1.2. Primer efficiencies were
between 93-109%.RIT inverted repeat primer design and PCR
The inverted repeats from the strains listed in Table 6.2.2 were aligned and used to
design ambiguous primers targeting the inverted repeats flanking their respective RIT elements.
Since the primers were designed to target the inverted repeats, the same primer would be
expected to anneal at each end and amplify the full RIT element. Two specific (non-ambiguous)
primers were also created targeting the inverted repeats for Burkholderia sp. OLGA172 and
Caulobacter sp. K31 individually. The two specific primers were tested to verify that they were
strain specific and the ambiguous primers were shown to amplify the RIT elements in both
strains. The ambiguous primers were tested on sampler DNA to search for RIT elements
bearing comparable inverted repeats that could be amplified. PCR was carried out using HotStar
Taq at 54oC plus 1 uL BSA per reaction, with an extension time of 3 minutes and 30 seconds.
7.2 Results
7.2.1 Macroinvertebrate metrics of ecosystem health
Prior to examining the microbial community from the samplers, the overall health of the stream
was estimated based on biomonitoring of benthic communities from the stream sediment.
Where possible, benthic animals were collected directly from the sites using the OBBN
approved kick and sweep method. Abundance and identification data were also obtained from
the relevant conservation authorities and these data were used to supplement the benthic
monitoring data acquired during this study. Table 7.2.1 shows the results of the biotic indices for
the four sites at which benthic animals could be obtained concurrent with sampling. Biotic
indices fluctuate seasonally therefore the data used for calculating these biotic indices
corresponded to the fall counts for all sites, which also coincided with the sampling season for
the conservation authorities.
127
Table 7.2.1: Comparison of field sites based on biotic indices of benthics obtained during
this study.
The coarse Hilsenhoff Biotic Index (cHBI) ranks sites with a score below 5 as healthy and
increasingly polluted above 5. Other indicators of a healthy benthic community include a high
Simpson’s diversity score (approaching 1) and high abundance of species known to be intolerant
to pollutants (%EPT).
cHBI cHBI Rating Simpson's Diversity %EPT Dyment’s Creek 6.93 Poor 0.48 5.69 North Saugeen River 5.33 Fair 0.63 14.18 Uxbridge Brook 5.23 Fair 0.79 16.75 Hamilton Creek 6.12 Fairly poor 0.60 13.25
The SVCA sites historically show lower degrees of impact than the LSRCA sites (as
indicated by a lower cHBI value), however the cHBI values calculated in this study were higher
than had been observed in previous years by the SVCA. The most recent benthic data provided
from this conservation authority corresponded to 2007 and therefore recent trends could not be
identified for these sites. However land usage in this region has not changed in that time period,
and a 2010 water quality status report published by the conservation authority also confirmed
that these particular sites had retained excellent water quality (SVCA, 2010). This was
confirmed in the data available from the PWQMN, which also indicates that there has not been a
drastic change in water quality during this period. For the LSRCA sites, only the Uxbridge
Brook site and the Dyment’s Creek site could be sampled by the kick and sweep method due to
the depth of the river at the other two locations. An attempt to obtain benthic invertebrates from
the Maskinonge River site location by grab sample from the sediment was devoid of benthic
organisms, and the depth of the West Holland Canal was too excessive for a grab sample to be
attempted. However benthic counts were obtained from the conservation authority for the
Maskinonge River for the 2011, 2012 and 2014 fall sampling events at a nearby sampling
location used by the Provincial Water Quality Monitoring Network (PWQMN). The average
Hilsenhoff Family Biotic Index (FBI – which is the family level version of the cHBI) and %EPT
for this site were 6.35 and 2.43%, respectively (averaged across the three years, standard
deviation of 0.32 (FBI) and 0.92 (%EPT)). Since the other sites had not been analyzed to the
family level, the family level benthic data obtained from the LSRCA was collapsed to the same
level of identification that the other sites had been analyzed at and a modified cHBI value of
128
5.90 was calculated. Therefore when the benthics were analyzed to only the 27-group level, both
Uxbridge Brook and the Maskinonge River gave better cHBI ratings than Hamilton Creek
(Table 7.2.1). Uxbridge Brook also had the highest percentage of sensitive species of benthic
invertebrates (Ephemeroptera, Plecoptera and Trichoptera) of any of the sites analyzed, which
generally indicates a healthy river ecosystem. There was no data available from the LSRCA
pertaining to biomonitoring in the West Holland Canal due to the depth of this waterway.
The samplers recovered from each of the impacted sites were visually distinguishable
from each other and from the reference sites, indicative of the varying nature of the ecosystems
(Figure 7.3.1). The Maskinonge River sampler was the least changed visually from the reference
sites, but was coated in duckweed (aquatic plant -Lemnoideae, a subfamily within the Aracaea),
likely as a result of slow water movement coupled with high phosphorous levels. The Uxbridge
Brook and West Holland Canal sites each had substantial green algae coating the samplers, and
all three of these streams have high phosphorus levels according to the PWQMN data (see Table
7.3.2). The Dyment’s Creek sampler was thickly coated in an unknown chemical substance that
had badly discoloured the sampler and could not be removed. The sand inside the samplers
could also be distinguished visually between sites indicating that there had been substantial
input of substrate while in situ, likely as a result of sediment deposition during rain events.
Figure 7.2.1: Lake Simcoe region samplers after retrieval.
Samplers shown (bottom to top) are Maskinonge River, West Holland Canal and Dyment’s
Creek.
129
7.2.2 Community diversity measures
Individual extractions from the front, center and back portions of each sampler from the 2012
season were sent for T-RFLP analysis. Any T-RFLP sample that had a total peak height of less
than 3000 was removed from the analysis, which included the centre samples from the North
Saugeen (NS), Dyment’s Creek (Barrie), West Holland Canal (CF) and Hamilton Creek (HC)
samplers as well as the outflow sample from the North Saugeen (leaving only the inflow for that
sampler). Samplers that had good band representation for all three sampler regions, Uxbridge
(UX) and Maskinonge Creek (SOD), all grouped together as did the front and back samples
from the remaining samplers (Figure 7.2.2).
Figure 7.2.2: Cluster analysis of T-RFLP data showing within sampler variation.
Samples were obtained from the F,C,B (inflow (front), center and outflow (back)) portions of
the samplers after retrieval for comparison of the bacterial community heterogeneity within the
length of the sampler. Abbreviations are as follows: Dyment’s Creek (Barrie), West Holland
(CF), North Saugeen (NS), Hamilton Creek (HC), Maskinonge Creek (SOD) and Uxbridge
Brook (UX). Top diagram is the binary Jaccard comparison for the within sampler variation.
130
The bottom diagram illustrates the improved clustering of the samplers abundance (Bray-Curtis)
however the West Holland Canal (CF) samples group with HC instead of with each other.
Although no conclusions could be drawn for the North Saugeen sampler due to poor
amplification, each of the other samplers harbored their own unique community. When
phylotype abundances were included in the statistical analyses, the bacterial communities
showed high similarity between the different locations within the samplers, except that the West
Holland inflow and outflow samples (CF-F and CF-B) did not cluster together. When abundance
is disregarded however, the West Holland samples do cluster, albeit with only slightly better
similarity than they cluster with the two reference sites.
In order to provide a greater quantity of DNA for further analysis, the front (inflow),
centre and back (outflow) DNA from each sampler was combined and purified using GE
Healthcare S200 purification columns (in triplicate), and the pooled DNA was used for
subsequent analysis. Two replicates of pooled samples from each sampler, along with two DNA
extractions from the sediment at the Barrie site for comparison were analyzed via T-RFLP.
Although the sampler replicates generally gave good agreement by T-RFLP analysis, the West
Holland River sampler (CF) and Hamilton Creek sampler again showed greater similarity
between the two sites than between the two replicates. Further investigation revealed that the
primary driver of the clustering was the abundance of one particular band (190 bp), which
accounted for close to 50% of the total T-RFLP peaks in all of the samples except for
Maskinonge (SOD), Dyment’s Creek (BH-1, BL-1) and Uxbridge Brook (UX). For this
reason, the principle coordinate analysis was analyzed separately for the Bray-Curtis distance
and Jaccard distance between sites. As can be seen in Figure 7.2.3, the West Holland Canal
was the only sampler that segregated differently by the two distance calculations. However by
both of these analyses the West Holland sampler bacterial community was more similar to the
reference sites than to any of the other impacted sites.
131
Figure 7.2.3: Principal coordinate analysis of T-RFLP results from sampler replicates.
Samples in both graphs are numbered as follows: Dyment’s Creek (1,2), West Holland Canal
(3,4), Hamilton Creek (5,6), North Saugeen (7,8), Maskinonge Creek (9,10) and Uxbridge
Brook (11,12). The Uxbridge Brook sites are overlapping in both plots and therefore the
individual numbers are not visible.
In addition to the T-RFLP analysis, pyrosequencing was also performed on the
combined sampler DNA, however for the Dyment’s Creek sample there was insufficient DNA
for the analysis. The Uxbridge brook sample was repeated twice from the same DNA to serve as
technical replicates. An examination of the pyrosequencing data revealed that the primary
driving force for the observed similarities between the reference sites and the West Holland
Canal site (CAR) was the prevalence of one particular species at a number of the sites. This was
consistent with the T-RFLP analysis, where it had also been determined that a single dominant
band was influencing the observed clustering. In the pyrosequencing data, 3 of the samples
analyzed showed a clear dominance of Pseudomonas fluorescens. Two of these samples
corresponded with reference locations – the North Saugeen sample had 41% P. fluorescens, and
the same species accounted for 45% of the reads at the Hamilton Creek location. It also
comprised 51% of the reads from the West Holland Canal site. Using the available 16S
sequence for P. fluorescens PC17 from the NCBI Genbank database, it was determined that the
132
dominant band observed in the T-RFLP data was consistent with this species, and therefore
dominance of this species could be established in the T-RFLP data as well. In contrast, the
remaining three sites (Uxbridge Brook, Maskinonge River and Dyment’s Creek) had
considerably less prevalence of P. fluorescens based on both the pyrosequencing reads and the
observed T-RFLP band. For the Uxbridge Brook sample and the Maskinonge Creek sample, P.
fluorescens was still the most abundant single OTU in the pyrosequencing dataset however this
species accounted for only 18.5 and 19.4%, respectively. The Dyment’s Creek samples had
insufficient DNA for pyrosequencing, however the T-RFLP data indicate that P. fluorescens
was effectively absent from this site both within the sampler DNA and also in the analyzed
sediment samples (less than 0.05% in any sample).
Figure 7.2.4: Principal coordinate analysis (PCoA) of the bacterial community
compositions revealed by 16S pyrosequencing data.
Beta diversity is determined using Bray-Curtis distances. UX12 and UX122 are technical
replicates of the same DNA. Abbreviations are as follows: Uxbridge Brook (UX), North
Saugeen (NS), West Holland Canal (CAR), and Hamilton Creek (HC).
133
7.2.3 Quantitative PCR
Quantitative PCR results for sampler DNA are shown in Table 7.2.2. DNA yields from
the sampler sands were typically low, so pooled samples were used. Each biological replicate is
a pool of inflow, center and outflow extractions. However, low relative concentration of mobile
genes to total 16S gene copies, required that different amounts of DNA were used for 16S
analysis compared to target gene analysis in order to stay within the linear range of the
respective primers. Data was not converted to estimates of copy number since the target gene
amplifications were still at or approaching the limit of linear amplification despite the larger
volume of DNA used for the target genes relative to the 16S genes. Results given are therefore
cycle threshold values (Ct) normalized to the 16S amplification for the same sample (deltaCt)
but not converted to actual gene copies since duplications at each cycle cannot be assumed.
Results are averages of at least two independent biological replicates.
Table 7.2.2: DeltaCt comparison of environmental samplers by quantitative real-time
PCR.
The deltaCt is determined by subtracting the target site cycle threshold (Ct) value from the 16S
Ct value. A low deltaCt value is therefore indicative of a high concentration of target gene since
there was a smaller difference between the target and the 16S gene cycle thresholds. The
melting temperature (Tm) of the resulting PCR product is included as this differed according to
site for some primer sets.
IS1071 deltaCt IS1071 Tm
sulI deltaCt sulI Tm
IncP deltaCt IncP Tm
Uxbridge Brook 4.93 88 6.76 85.4 9.65 85.2 Dyment's Creek 5.88 87.7 9.67 85.4 8.95 85.7 West Holland
Canal 9.52 87.7 8.78 88 9.52 88.65 Maskinonge
Creek 9.43 87.7 10.03 87.1 9.92 89.35 Hamilton Creek 10.22 87.7 9.45 88 9.20 88.65 North Saugeen 12.84 87.7 9.45 87.1 8.78 89.1
In addition to low amplification, for all except the IS1071 primers the environmental
samples gave multiple and/or broad peaks, indicating that more than one product was produced.
134
These peaks had a higher melting temperature than would be expected for primer dimers, and
when representative samples were analyzed by gel electrophoresis, a single band was observed
(data not shown). This suggests that the multiple peaks on the melting curve analysis are the
result of similarly sized PCR products with varying G+C content. In support of this assertion,
the melting temperature of the qPCR product was consistent between different replicates of the
same sampler, consistent with specific target differences. The results included in Table 7.2.2
correspond to the primer sets that gave good reproducibility across multiple biological
replicates, including the IS1071 primers designed in this study and primers from other published
studies that targeted the IncP plasmid backbone (Smalla et al. 2013) and the sulI gene conferring
resistance to sulfonamide antibiotics (Cheng et al. 2013).
In addition to the primer sets listed in Table 7.1.2 there were also a number of other
mobile element primer sets tested from the existing literature, including primers designed to
target the Tn3 and Tn21 classes of transposons (Gotz et al. 1996). These primers were not
originally designed for qPCR analysis, however the amplified product was an appropriate size
for this type of analysis. The delta Ct values obtained with these primer sets were quite low
(ranging from 3 to 8) indicating high abundance of these transposons in all sites, however both
peak quality and reproducibility between replicates were insufficient for the results to be
included. Conversely, the IS1071 primers were both highly specific and highly reproducible
between biological replicates (within 0.5 Ct after normalization), consistent with their design to
target only one specific member of the Tn3 family of transposons.
7.2.4 Correlations between bacterial communities and water quality parameters
With the exception of Dyment’s Creek, water quality parameters were available from the
Provincial Water Quality Monitoring Network (PWQMN) for each of the streams used in this
project. For Uxbridge Brook the sampling site corresponded to the precise location of the
PWQMN monitoring station however due to accessibility reasons the other sampling sites were
located in other regions of the same streams. For Dyment’s Creek, water quality parameters
135
were available from Environment Canada for 2011 and these values were used to approximate
the conditions in 2012. For each of the streams, the water quality parameters were averaged
over the full year and the values are included in Table 7.2.3.
Table 7.2.3: Water quality parameters for each site.
Data for all but the Dyment’s Creek site are the averages from the 2012 data available through
the PWQMN database. Data for the Dyment’s Creek are the averages from the water stream
chemistry data provided by Environment Canada for the 2011 field season.
Site Chloride (mg/L)
Phosphorus (mg/L)
DO (mg/L)
EC (µS/cm) pH
Fe (µg/L)
Nitrates (mg/L)
Maskinonge River 92.7 0.1755 8.12 831 7.66 550 0.77 Uxbridge Brook 45.16 0.1154 10.07 538 7.87 545 2.25 Hamilton Creek 4.8 0.0035 9.03 441 8.36 24 0.526 North Saugeen 5.53 0.003 10.22 427 8.21 No data 0.382 Dyment's Creek 245.08 0.024 7.60 1022 7.73 169 2.76 West Holland River 92.55 0.14 9.26 735.06 7.74 294 1.01
The Principal Coordinates derived from both the T-RFLP and the pyrosequencing data
were compared to the available water quality parameters, and some significant correlations were
observed (Table 7.3.4). The T-RFLP data contained more replicates (since there were duplicates
of each sample) and included the Dyment’s Creek site therefore it will be the primary data
discussed. Principal coordinate 1 (PC1) showed the strongest correlations with %EPT and
dissolved oxygen, indicating that high biological oxygen demand had accounted for much of the
variability between these sites. PC1 also correlated with chloride levels, which are a dominant
feature of all the contaminated sites. PC2 didn’t show a correlation with any of the water quality
parameters but was correlated to the abundance of both IS1071 and sulI in the population.
IS1071 correlated oppositely to nitrate concentrations and P. fluorescens abundance. Since the
gene abundance is given as deltaCt, a high value corresponds to low abundance of the gene and
136
therefore high IS1071 gene abundance was correlated with high nitrate concentrations.
Conversely, IS1071 was not as abundant in the sites that had substantial P. fluorescens present.
Table 7.2.4: Correlations of the bacterial communities to available water quality data.
Only those correlations that were found to be significant are included in this table. PC numbers
refer to the first and second principal coordinate from the T-RFLP analysis (Bray-Curtis) and
separate analyses of pyrosequencing analysis by Bray-Curtis (BC) or Jaccard (J). DO is
dissolved oxygen and DO4 is the average dissolved oxygen over the summer months (June to
September). Gene abundances from qPCR analysis are listed by their gene name (sulI and
IS1071).
Parameter Correlates Pearson's R degrees of freedom
p<
T-RFLP-PC1 PyroPC2 - BC PyroPC2 - J %EPT DO Chloride
0.936 0.934 0.900 0.855 -0.798
4 5 4 5 5
0.01 0.01 0.02 0.02 0.05
T-RFLP-PC2 PyroPC1-BC PyroPC1-J IS1071 sulI
0.955 -0.820 -0.879 -0.883
4 5 5 5
0.01 0.05 0.01 0.01
PyroPC1-Bray_Curtis
PyroPC1-J Simpsons I IS1071 Nitrates P. fluorescens
-0.989 0.997 -0.868 0.895 -0.922
4 2 4 4 4
N/A 0.01 0.05 0.02 0.01
PyroPC2-Bray_Curtis
PyroPC2-J %EPT DO
0.929 0.977 0.844
4 3 4
N/A 0.01 0.05
PyroPC1- Jaccard
IS1071 P. fluorescens
0.843 0.932
5 5
0.02 0.01
PyroPC2-Jaccard %EPT sulI DO
0.959 -0.777 0.838
5 5 8
0.001 0.05 0.01
CHBI DO Chloride
-0.905 0.874
7 7
0.001 0.01
Simpsons I sulI Total P
-0.929 0.882
3 7
0.05 0.01
EPT DO DO4
0.793 0.845
7 7
0.02 0.01
IS1071 Nitrates Ps. fluorescens
-0.932 0.764
5 5
0.01 0.02
IncP Total P 0.936 6 0.001
137
DO Chloride -0.808 8 0.01
7.2.5 Primer design specific to RIT elements
In order to expand the current study to include RIT elements, it was necessary to develop
primers that could be used to search for novel elements in environmental samples. Although the
third integrase in the RIT element is more highly conserved than the first two integrases,
alignments of the nucleotide sequences from our RIT collection showed no promising candidate
regions from which to design primers. To address this lack of conservation, the RIT elements
were divided into sub-groups sharing higher homology and primers were designed specific to
conserved regions within the third integrase for these groups. However, although specific
primers could be designed for genus level groups such as Sinorhizobium and Acidiphillium,
there was insufficient conservation to design primers aimed at broader groups making qPCR for
RIT elements in environmental samples unfeasible. However, the discovery of conserved
sequences within the inverted repeats from RIT elements found in 10 different genera (see Table
6.2.2) highlighted another route by which RIT element distribution in environmental samples
could be evaluated. Primers were designed that could bind to these conserved regions in the
terminal inverted repeats and would therefore amplify the complete RIT element (being inverted
repeats the forward and reverse primers are identical). Burkholderia sp. str. OLGA172 and
Caulobacter sp. K31 were used as the control strains and primers were designed that would
match each of the strains specifically. These primers were shown to have no cross-amplification
with the alternate control. An alignment of the inverted repeats from all 10 strains was then
used to design two ambiguous primer pairs – the first had degenerate bases in all locations
where there were disagreements (RIT_ambig) and a second set kept some of the original bases
found in OLGA172 in case the fully ambiguous primer lacked specificity (RIT_less_ambig).
Both Burkholderia sp. str. OLGA172 and Caulobacter sp. K31 produced bands of the expected
size with the fully ambiguous primers and this primer set was therefore utilized to search for
similar RIT elements in the environmental samplers. Light positive bands of the expected size
were amplified from sampler DNA (specifically the Uxbridge Brook and Dyment’s Creek sites)
however cloning and characterization of these products was beyond the scope of this project.
138
7.3 Discussion
7.3.1 Biomonitoring
The use of benthic invertebrates to establish the health of a stream ecosystem is well established
(Rozenberg and Resh, 1993; Jones et al. 2007). Although the reference sites historically
performed better by these metrics than the impacted sites, the biomonitoring scores for the
reference sites were less favorable in this analysis. In particular, the coarse Hilsenhoff biotic
index (cHBI) ranked the reference streams as ‘fair’, which was unexpected but is also consistent
with the observations by the SVCA that biomonitoring scores are changing. The exact reasons
for this trend have not been elucidated since land use and water quality parameters have not
been declining for these streams, but it has been suggested that it could be the result of
increasing temperatures due to changing climate conditions (SVCA, 2010). This highlights an
important caveat with using the cHBI value for determining the overall health of an ecosystem
since higher temperatures can mimic the stress effects of the organic pollutants for which this
method was originally designed (Hilsenhoff, 1987). For the impacted sites, Dyment’s Creek
performed the worst in terms of cHBI score with an overall ranking of ‘poor’. Pollution into
this stream is one potential reason for the poor biomonitoring ranking, however lack of gravel
substrate is likely to be a strong contributing factor as the sediment consisted primarily of sand
and debris from the landfill including tires, metal rims and plastic garbage bags. Trichoptera
were abundant at this site (leading to a high %EPT score) however they were found to be
exclusively from the Hydropsychidae family, which are known to be tolerant to degraded
conditions. The biomonitoring ranking for the Uxbridge site was much better than anticipated,
and even exceeded the cHBI ranking for the reference sites. This site is 2.5 km downstream of a
wastewater outflow and was expected to show some anthropogenic impact due to the presence
of elevated levels of PAHs and other contaminants. However, the stream itself has extensive
riparian vegetation, good shading and a varied sediment composition that would be expected to
support a diversity of benthic organisms. It is also likely that the wastewater outflow adds in
nutrients as seen by the elevated phosphorous and nitrate concentrations. All of these factors
may have contributed to a high overall biomonitoring ranking despite the presence of organic
contaminants. This site also had particularly prolific algae growth and a subsequently high
abundance in Coleoptera (grazers) that may have impacted the overall ranking. These were
exclusively of the Elmidae family, which are known to be more tolerant of organic pollutants
139
than other beetle families. The LSRCA benthic data is analyzed at the family level, and does
show a slightly higher FBI value of 5.78 (‘fairly poor’) for the 2012 sampling. However, the
Uxbridge Brook site also had the highest %EPT values obtained in this study and although the
Trichoptera were all Hydropsychidae (and therefore more tolerant than other families), the
Ephemeroptera individuals came from families that are known to be quite sensitive to organic
pollution. The Simpson’s diversity metric also ranked the Uxbridge Brook site as healthier than
the reference sites (Table 7.2.1). This suggests that from a biomonitoring point of view, the
Uxbridge Brook site is maintaining a healthy benthic community.
7.3.2 Bacterial community assessment
The placement of the samplers directly in the water column was important for two
reasons: first, to minimize between site differences in bacterial communities specific to the
nature of the river substrate; and second, to specifically characterize the members of the
bacterial community that were most accessible either by direct contact with the river, or through
downstream applications such as irrigation or drinking water sources. In this manner, the
samplers can be used as a proxy for the indigenous bacterial community and utilized for
quantitative PCR comparison of mobile genes. The use of a sand substrate for colonization
inside the samplers was considered preferable to simply filtering water samples since this could
allow for the establishment of biofilm communities that may not be evident otherwise. These
communities therefore represent an accumulated population from the 4 months sampling time as
opposed to a single sampling event. Although not originally foreseen, this method had a
particular advantage over single time point sampling due to the presence of sediment washed
into the samplers after rain events. This was originally seen as a disadvantage since the goal
was to maintain a standardized substrate across all streams. In reality, however, these transient
communities that enter the water column after rain events are also accessible through direct
contact or downstream applications, making their inclusion in the samples beneficial. The
samplers still provide a standardized substrate for colonization, which minimizes the variation
and allows for comparison of streams regardless of sediment composition. In this way, the
samplers are designed specifically to capture the bacterial members that are transiently present
in the water column as well as providing a suitable substrate for the establishment of biofilm
communities that would normally be established within the sediment. Although the low DNA
140
yields obtained from the samplers made analysis difficult, there were distinctive communities
identified for each of the sites.
The sampler diversity was examined by T-RFLP and bands were found to correlate with
between streams differences as opposed to within sampler differences. This suggests that the
water movement through the samplers was sufficient to create homogenous conditions in terms
of oxygen and nutrient distributions. The added benefit of this consistency is that the large
volume of soil present in the samplers can be frozen and used for additional analyses at a later
date. The strong abundance of one particular species in the majority of the samplers was
unexpected, and could be an indication that the length of time that the samplers are in the stream
may need to be increased. It is possible that the initial colonization of the new substrate is
accomplished by P. fluorescens and that over a longer length of time the bacterial community
would diversify to more closely resemble the sediment community for that particular stream.
This is an issue that would need to be addressed before this method could be utilized in further
studies since the increased abundance of this one particular species interferes with subsequent
analysis. Interestingly, the only sample from the contaminated sites that showed an
overwhelming abundance of P. fluorescens was the West Holland Canal sampler, however this
is also the site for which there water quality measures available were at a great distance (8.4
km). The actual PWQMN site was not accessible for a sampler due to both the dimensions of
the river and level of public presence. The sampler was placed upstream of the PWQMN and the
stretch of the canal between the sampling site and the PWQMN site is entirely agricultural
therefore it is would be expected that the PWQMN data represents the worst case scenario for
agricultural inputs to this river. Therefore the conditions at the actual sampling site may be less
severe. Secondly, due to the volume of water moving through the canal the dissolved oxygen
levels are likely to be quite variable in this stream. Finally, this location was chosen based on
pollutants found in the sediment of the West Holland River however the levels of contamination
found in the water itself were not found to exceed provincial guidelines (LSRCA, 2004). Given
the depth of the canal, it is likely that the bacterial community in the water column (and
therefore in the sampler) is rarely in contact with these contaminants.
The Maskinonge Creek, Uxbridge Brook and Dyment’s Creek samples all had lower
than 20% P. fluorescens abundance. In addition to the contaminants listed in Table 7.1.1, there
are a couple of water quality parameters that could account for the lack of this species at the
141
LSRCA sites. First, the LSRCA sites all have significantly higher levels of phosphorus and
chloride, higher conductivity readings and lower pH values than the SVCA sites. In addition,
the dissolved oxygen (DO) values during the sampling period (June – September) were very low
for both the Maskinonge Creek and Dyment’s Creek locations (supplemental table S2). The
Uxbridge Brook location did not have DO levels below that of the reference sites, however
when compared to the North Saugeen site it is clear that the duration of low dissolved oxygen
levels was much longer at Uxbridge Brook (around 8 mg/L for all four months as opposed to
only 2 months) and therefore the DO levels during the sampling time were lower than the annual
averages would suggest. Since P. fluorescens is a strictly aerobic organism, the extended low
dissolved oxygen levels could certainly account for the decreased abundance of this species at
the contaminated sites.
Since the Maskinonge Creek, Uxbridge Brook and Dyment’s Creek samplers all
segregated from the other sites in terms of phylogenetic diversity, it was expected that these
three sites would also segregate from the others in terms of quantitative PCR results. In terms of
IS1071 abundance, the Dyment’s Creek and Uxbridge Brook samples both had significantly
higher IS1071 abundance compared to the other sites, and all sites were significantly higher than
the North Saugeen River. The West Holland Canal and Maskinonge Creek samples were not
significantly different from each other based on IS1071 abundance and the West Holland Canal
was only slightly significantly increased over the Hamilton Creek sample (P=0.047). Given the
known association of IS1071 with catabolic operons, it was expected that the highest
abundances would occur in the sites with complex contaminants and they did. However I also
expected high abundance of IS1071 at the two agricultural sites compared to Uxbridge Brook.
This could be the result of the strong association of IS1071 with IncP plasmids (Dennis, 2005;
Dunon et al. 2013) and would therefore be a result of wastewater and landfill leachate into the
Uxbridge Brook and Dyment’s Creek sites, respectively. The increased abundance of the sulI
antibiotic resistance gene solely at the Uxbridge Brook site is also consistent with the
expectations from wastewater outflow. All sites had comparable numbers for IncP backbone
abundance, but with different melting temperatures of the qPCR products, suggesting that there
is a much greater diversity of these plasmids than only the ones carrying the sulI genes.
However it should be noted that these primers were designed to be used in conjunction with a
specific probe and therefore the results obtained may simply be an artifact of the method used.
142
The broad distribution of IS1071 found in this study suggests that continued studies on
this particular element could prove interesting. Most notably, the increased abundance of IS1071
at the Dyment’s Creek location, coupled with the known diversity of contaminants entering this
stream from the groundwater, suggests that this would be an interesting location for a more
detailed study. There are a number of known contaminants in the Uxbridge Brook location, yet
the biomonitoring data do not indicate a decrease in overall ecosystem health. However the
bacterial community at the Uxbridge Brook site was clearly altered in comparison to the SVCA
sites, which merits further investigation of the impacts that this alteration has on the overall
community dynamics. The Uxbridge Brook site also had unexpectedly high levels of IS1071, as
well as high levels of sulI gene abundance. This site would therefore also be an interesting
location for a mobile element study in order to determine whether biomonitoring using
macroinvertebrates is informative for understanding the bacterial community response to
environmental contaminants. The qPCR results obtained in this study suggest that the bacterial
community is enriched in genetic elements commonly associated with IncP plasmids, including
IS1071 and sulI, which may indicate that this community represents an increased risk of
resistance gene transmission. However primers targeting the class 1 integrase gene (intI1) did
not detect this gene in any of the environmental samplers, which was unexpected since the sulI
gene is a known component of the class 1 integrons commonly found on IncP plasmids
(Schlüter et al. 2007). The reasons for the absence of intI1 in the samplers is unclear as the
primers showed equal efficiency to the IS1071 primers on control DNA and a sub-sampling
taken from the samplers earlier in the season had been positive for intI1. This is not likely to be
due to total bacterial abundance as the 16S results were comparable between the sub-sample and
the final sampler extractions, however it could be indicative of a change in the bacterial
population that resulted in a decrease in intI1 relative to the total community. The highly
specific and reproducible results obtained with the IS1071 primers suggests that these primers
may be better suited to identifying impacted bacterial communities when DNA concentration is
a limiting factor. As IS1071 is commonly found in multiple copies in the genome, this makes
for a more robust target for qPCR analysis. IS1071 is also associated with several catabolic
plasmids and transposons (Dunon et al. 2013; Van Houdt et al. 2000; Top and Springael, 2003)
and therefore provides a broader target than solely antibiotic resistance plasmids.
143
There are two major challenges in examining mobile elements in community samples
beyond the integron and plasmid replication genes that are currently used. The first challenge is
diversity of the nucleotide sequence of the target genes. For plasmids, the essential nature of the
replication genes provides sufficient conservation for primer design (Smalla et al. 2013; Gotz et
al. 2006) and these primers that have been utilized on environmental samples, albeit often in
association with a secondary probe for increased specificity. Class 1 integrons in particular are
highly conserved specifically due to selection for the antibiotic resistance genes that they are
associated with (Gillings et al. 2015). This conservation is not typical of mobile elements found
in the environment, with the result that testing for other families of mobile elements through
targeted qPCR or microarray approaches is not possible except in cases where the goal is to
track the abundance and distribution of a previously determined individual element. This was
illustrated in this study by the inconsistent results obtained through the Tn3 and Tn21 primers,
as well as the inability to target groups of RIT elements. Although global distribution of
individual mobile elements has not been commonly observed, in this study we designed qPCR
primers specifically targeting IS1071 and found this particular element to have a broad
distribution in environmental samples.
The goal of this study was to develop a reproducible method of analyzing the
anthropogenic impacts on freshwater stream bacterial communities, with the goal of classifying
reference and impacted locations for further characterization. These samplers can be used to
draw comparisons between different locations in a manner that is not dependent on sediment
quality (or availability) and is not impacted by spatial variations caused by groundwater inflow.
This characterization can also point directly to individual elements that may warrant further
investigation in the impacted sites. In this way, sites can be chosen for which metagenomic
characterization would be informative and this information can be accumulated and stored for
future analysis of general trends.
7.4 Acknowledgements
The following people are gratefully acknowledged for their insight and contributions: Jim Roy,
Alex Fitzgerald and Lee Grapentine (Environment Canada), Dave Lembcke and Rob Wilson
(LSRCA), Martha Nicol and Shaun Anthony (SVCA), Chris Jones (OBBN), landowners for
access especially Brouwer Sod Farms in Keswick and Jason Verkaik at Carron Farms, Shu Yi
144
(Roxana) Shen, Rosemary Saati and the other summer students for assistance, Toby Ricker for
design and construction of samplers, Ross Reid for assistance with sampler access and
placement. Funding in the form of a NSERC Discovery Grant to RF and a NSERC PGS-D
Scholarship to NR is also gratefully acknowledged. The funding agency had no role in this
study.
7.5 References Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., ... & Knight, R. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature methods, 7(5), 335-336. Cheng, W., Chen, H., Su, C., & Yan, S. (2013). Abundance and persistence of antibiotic resistance genes in livestock farms: a comprehensive investigation in eastern China. Environment international, 61, 1-7. Dennis, J. J. (2005). The evolution of IncP catabolic plasmids. Current opinion in biotechnology, 16(3), 291-298. DeSantis, T. Z., Hugenholtz, P., Keller, K., Brodie, E. L., Larsen, N., Piceno, Y. M., ... & Andersen, G. L. (2006). NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic acids research, 34(suppl 2), W394-W399. Dunon, V., Sniegowski, K., Bers, K., Lavigne, R., Smalla, K., & Springael, D. (2013). High prevalence of IncP-1 plasmids and IS1071 insertion sequences in on-farm biopurification systems and other pesticide-polluted environments. FEMS microbiology ecology, 86(3), 415-431. Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5):1792-1797. Fierer, N., Jackson, J. A., Vilgalys, R., & Jackson, R. B. (2005). Assessment of soil microbial community structure by use of taxon-specific quantitative PCR assays. Applied and environmental microbiology, 71(7), 4117-4120. Gillings, M.R., Gaze, W.H., Pruden, A., Smalla, K. Tiedje, J.M. and Yong-Guan, Z. 2015. Using the class 1 integron-integrase gene as a proxy for anthropogenic pollution. ISME journal doi:10.1038/ismej.2014.226 Götz, A., Pukall, R., Smit, E., Tietze, E., Prager, R., Tschäpe, H., ... & Smalla, K. (1996). Detection and characterization of broad-host-range plasmids in environmental bacteria by PCR. Applied and Environmental Microbiology, 62(7), 2621-2628. Hilsenhoff, W.L. 1987. An improved biotic index of organic stream pollution. Great Lakes Entomology 20: 31-39.
145
Jechalke, S., Dealtry, S., Smalla, K., & Heuer, H. (2013). Quantification of IncP-1 plasmid prevalence in environmental samples. Applied and environmental microbiology, 79(4), 1410-1413. Johnsen, A. R., & Karlson, U. (2007). Diffuse PAH contamination of surface soils: environmental occurrence, bioavailability, and microbial degradation. Applied Microbiology and Biotechnology, 76(3), 533-543. Jones F C, Somers KM, Craig B, Reynoldson TB (2007) Ontario Benthos Biomonitoring Network: Protocol Manual. Queen’s Printer for Ontario. LSRCA, 2004. Lake Simcoe Watershed Toxic Pollutant Screening Program 2004 Report. Lake Simcoe Region Conservation Authority. Drafted July 2005. Nemergut, D. R., Martin, A. P., & Schmidt, S. K. (2004). Integron diversity in heavy-metal-contaminated mine tailings and inferences about integron evolution. Applied and environmental microbiology, 70(2), 1160-1168. Reeder, J., & Knight, R. (2010). Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions. Nature methods, 7(9), 668-669. Rosenberg DM, Resh VH (1993) Freshwater biomonitoring and benthic macroinvertebrates. Chapman and Hall, New York Roy, J. W., & Bickerton, G. (2011). Toxic groundwater contaminants: an overlooked contributor to urban stream syndrome?. Environmental science & technology, 46(2), 729-736. Schlüter, A., Szczepanowski, R., Pühler, A., & Top, E. M. (2007). Genomics of IncP-1 antibiotic resistance plasmids isolated from wastewater treatment plants provides evidence for a widely accessible drug resistance gene pool. FEMS microbiology reviews, 31(4), 449-477. Simpson, E.H. 1949. Measurement of diversity. Nature (London) 163:688. SVCA, 2010. Saugeen Conservation Water Quality Status Report. Saugeen Valley Conservation Authority. Drafted March 2011. Top, E. M., & Springael, D. (2003). The role of mobile genetic elements in bacterial adaptation to xenobiotic organic compounds. Current Opinion in Biotechnology, 14(3), 262-269. Van Houdt, R., Toussaint, A., Ryan, M. P., Pembroke, J. T., Mergeay, M., & Adley, C. C. (2000). The Tn4371 ICE family of bacterial mobile genetic elements.
146
Wright JF, Sutcliffe DW, Furse MT (2000) Assessing the biological quality of fresh waters: RIVPACS and other techniques. Freshwater Biological Association, Ambleside Yarza, P, P. Yilmaz, E. Pruesse, F.O. Glöckner, W. Ludwig, K-H Schleifer, W.B. Whitman, J. Euzéby, R. Amann and R. Rosselló-Móra. 2014. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nature Reviews Microbiology 12:635-645. doi:10.1038/nrmicro3330
147
Chapter 8 Conclusions and Future Directions
The majority of my PhD research has been dedicated to better understanding the nature
and potential role of Recombinase in Trio (RIT) elements, a novel mobile element that involves
three linked tyrosine-based site-specific recombinases from separate sub-families. From this
work it is evident that in certain ways RIT elements could be considered comparable to insertion
sequences, MGEs that encode genes for their own movement but carry no other functional
information. From my intensive search of the extant sequence databases, it is clear that in the
vast majority of cases the RIT elements contain only the open reading frames coding for the
individual recombinase proteins that are presumably responsible for their mobility. These open
reading frames are flanked by inverted repeats that are evidently involved in the excision and re-
integration of these elements. However there are no IS families that are known to be mobilized
solely through the activity of a TBSSR, and unlike insertion sequences which can occur in high
numbers in an individual genome (Siguier et al. 2014), multiple identical RIT elements within
an individual genome are relatively rare. As was described in Chapter 4, copy numbers of RIT
elements identified to date range from 1 to 5 and the majority occur in only one copy in the
genome. Therefore their primary role is not likely to be to provide homologous regions for
genome rearrangements. IS elements also work in concert to mobilize larger segments of DNA
(composite transposons), which has not been seen in RIT elements. There were only two
instances identified in Chapter 4 where a RIT element appears to have mobilized adjacent genes.
In each of these instances the additional genes are found between the RIT element and one of
the inverted repeats. This suggests a mechanism more consistent with the transposons and ICEs
that utilize a site-specific recombinase at one end of the element. In addition, since mobility of
the RIT element was only observed during conjugation, it is possible that their role in the cell
may be specific to MGE evolution, or movement of genes between replicons, as opposed to
larger genome rearrangements. The prevalence of RIT elements in multi-replicon genomes is
also consistent with a potential role in plasmid evolution.
The experiments performed here allow for some speculation on how RIT elements may
function. The first issue to be addressed is the presence of three TBSSRs since this is not
consistent with current knowledge of tyrosine recombinases. Since the recombination reaction
148
proceeds in a very symmetrical manner, a role for three separate enzymes is difficult. However
site-specific recombination generally utilizes two copies of each recombinase and therefore a
symmetrical arrangement is possible. This is supported by the presence of three putative
binding sites at each end of the RIT element (two within the inverted repeat and a third within
the palindrome sequence). This is also consistent with other mobile elements such as phage λ
and Tn4430, both of which have complex binding requirements for the recombination reactions
(Hallet et al. 2004). However as only two cut sites are created in the crossover reaction it is still
unclear whether all three recombinases would be active simultaneously or whether they
facilitate different reactions (integration vs. excision or inter- vs. intra-molecular
recombination).
The second issue to be addressed from these experiments is the apparent need for
conjugation in order for the RIT element to be mobilized. Since the recombinases were
separated from the complete RIT element and were already being induced in the recipient cell,
the experiment performed in Chapter 6 should not be interpreted as indicating that RIT elements
are likely to be mobilized while conjugating into a new host. The conjugation experiment was
chosen since it allowed for a single-stranded conformation, which has been shown to be
necessary for some integron recombination reactions (Loot et al. 2010). Within the cell, single
stranded DNA is produced prior to conjugation and also by replicons that undergo rolling circle
replication. The large plasmids carrying RIT elements that I have identified are not likely to
replicate in this manner, however this type of replication has been associated with some
integrated conjugative elements (Wright et al. 2015). As RIT elements are generally found
contained within genomic islands (some of which may actually be ICEs), this provides an
opportunity whereby RIT elements could be activated specifically during either replication of
these elements or preparation for conjugative transfer. RIT elements could effectively be silent
when the ICE is integrated into the chromosome and become active when the ICE is stimulated
to excise from the chromosome. This allows for a brief time when RIT elements could mobilize
and have an impact on other targets within the cell.
There are many aspects of RIT element mobility that have yet to be determined. Most
importantly, the contribution of each individual recombinase is still an open question.
Expression plasmids containing each recombinase separately have already been prepared in the
pTrc99 backbone, however these experiments have not been performed due to the false positive
149
issues with the experimental design. Purification of the recombinase proteins is also ongoing at
the SCK•CEN by Rob Van Houdt in order to perform binding assays on the target site, inverted
repeats and the palindrome sequence. The presence of a palindrome sequence beyond the
inverted repeats is a unique feature compared to other MGEs. Whether this sequence provides
accessory binding sites for regulation or forms an important secondary structure for activity has
yet to be determined.
Chapter 4 was published in 2012 and there were 148 RIT elements obtained through in
silico searching at that time. Subsequent Blast searches from known elements and also
specifically from the palindrome sequences identified have brought this number to 183
(Supplemental Table 1), however this is still clearly not an exhaustive list as numbers continue
to increase with each subsequent search. Unfortunately, the prevalence of draft genomes
continues to limit our ability to examine the genomic context for many of these strains. It is
important to note however that despite their broad distribution there are remarkably few
instances of RIT elements shared between close relatives. This strongly suggests that RIT
elements are distributed between strains on other mobile elements and rarely incorporated into
more stable regions of the genome.
The inverted repeat primers developed in this project could be used to isolate additional
RIT elements from environmental samples, and additional primers could be designed based on
conserved repeats flanking other RIT elements in the collection. However, the most interesting
products of these primers would be the very rare instances where the RIT element is mobilizing
more genes than just the three recombinases. These could be preferentially retained through
size selection in order to separate them from the amplicons containing only the recombinase
genes. Of course understanding the abundance and diversity of RIT elements containing highly
similar repeats would also be informative. Although a distinctive role for the RIT elements has
not yet been established, the work described here has revealed a great deal about their
distribution, their associations and their mobility. It is an important part of the larger goal of
improving our overall understanding of the nature of the many MGEs that have yet to be
characterized.
At the outset of this work, many approaches were considered to quantify the mobilome
of stream environments at different levels of contamination including microarrays and a
150
comprehensive set of qPCR assays. My conclusion after investigating the degree of
conservation of even related elements was that approaches based on sequence similarities or
conserved primers for many MGE families were unfeasible. Fortunately the decreasing price of
sequencing makes metagenomics of environmental samples a viable option for the near future.
This approach has not been utilized in this study for two reasons. The first is computational
cost. Although read lengths of Illumina next generation sequencing have increased from only
35 bp at the beginning of this project to as much as 250 bp currently, the full genes would need
to be assembled in order for individual mobile elements to be assigned to families. This is
completely feasible on genomic samples, and has been performed on a small number of
metagenomic samples, but the cost is significant since it requires enough sequencing depth to
assemble an entire community (3-4 Gbp of sequence, depending on quality), as well as
sufficient computing power to perform the assembly (Luo et al. 2012).
The second issue with performing metagenomics currently is the annotation of the
assembled mobility genes. Automated annotation pipelines such as MG-RAST annotate
according to function of the closest homologues, however functional annotations are limited to
general terms such as ‘integrase’, ‘mobile element protein’ or ‘transposase’. As described in
Chapter 2, these terms are uninformative given the diversity of enzymes categorized by these
terms. Moreover, the RIT element recombinases often are not annotated as mobile elements at
all but rather as either hypothetical proteins or components of ‘recombination and repair’
functions. Therefore the annotated TBSSR component of the metagenome can alternatively be
found in three different annotation categories – mobile element proteins, phage integrases, or
recombination and repair. Since the majority of the TBSSR families have not been investigated,
there is little to be gained currently from metagenomics for these enzymes. This is clearly
illustrated by the unique adaptive role played by integrons. The type and extent of bacterial
adaptation potential provided by integrons was completely unknown to science until their
association with antibiotic resistance justified a more detailed investigation of their activity.
However without information on the mechanisms and distribution of other families of tyrosine
recombinases, we have a very limited ability to predict how unique these capabilities may be. In
order to better understand the potential roles of the many components of the mobilome, we must
begin to fill the knowledge gaps that clearly exist. More precise annotations, including genomic
context, will be necessary to fully understand the diversity of MGEs in environmental samples
151
and to begin to examine their abundance and distribution in contaminated vs. reference
ecosystems. This knowledge is important for determining the role that these MGEs play in both
individual genomes and bacterial communities, and will be essential to developing risk
assessment frameworks for monitoring both the current and developing risks of antibiotic
resistance genes in the environment. As we move into the inevitable age of environmental
metagenomics, annotation of available sequence data will continue to be the largest obstacle to
understanding bacterial evolution. The improvements in both sequencing technologies and
assembly algorithms have been important first steps; however increased funding for
experimental characterization of putative functions is still a limiting factor.
8 References Hallet, B., Vanhooff, V. and F. Cornet. 2004. DNA Site-Specific Resolution Systems. In: Plasmid Biology pp. 145-180. Ed. B.E. Funnell and G.J. Phillips ASM Press, Washington, D.C. USA Loot, C., Bikard, D., Rachlin, A. and Mazel, D., 2010. Cellular pathways controlling integron cassette site folding. The EMBO journal 29(15):2623-2634. Luo, C. D. Tsementzi, N. Kyrpides, T. Read and K. T. Konstantinidis. 2012. Direct Comparisons of Illumina vs. Roche 454 Sequencing Technologies on the Same Microbial Community DNA Sample. PLos ONE 7(2):e30087. Siguier, P. Gourbeyre, E. ad M. Chandler. 2014. Bacterial insertion sequences: their genomic impact and diversity. FEMS Microbiol Rev 38: 865-891. Wright LD, Johnson CM, Grossman AD (2015) Identification of a Single Strand Origin of Replication in the Integrative and Conjugative Element ICEBs1 of Bacillus subtilis. PLoS Genet 11(10): e1005556. doi:10.1371/journal.pgen.1005556
152
9 Appendix 1 Extra Tables
Table S1: Primers used in this study
Primer Name Sequence Reference qPCR-intI1F ACCAACCGAACAGGCTTATG Nemergut et
al. (2004) as quoted in
Wright et al. 2008 ISME 2:417-428.
qPCR-intI1R GAGGATGCGAACCACTTCCAT
qPCR-16S-338F ACTCCTACGGGAGGCAGCAG Fierer et al. (2005) as quoted in
Wright et al. 2008 ISME 2:417-428.
qPCR-16S-518R ATTACCGCGGCTGCTGG
sulI-F CACCGGAAACATCGCTGCA Luo et al. 2010.
Environ. Sci. Technol. 44:7220–
7225
sulI-R AAGTTCCGCCGCAAGGCT
IncP1 korA-F TCATCGACAACGACTACAACG Jechalke et
al 2013 Appl.
Environ. Microbiol.
79(4):1410-1413.
IncP1 korA-R TTCTTCTTGCCCTTCGCCAG
IS1071_qPCR-F GCACCAAGTCTGGGAATGAT This study IS1071_qPCR-R ACGGGCATAGTGTTTCTTGG This study
IR_Olga TTATGCCGATTCCCGGATTATGCCG This study
IR_K31 TAATGCCGCGATCCGGATTATGCCG This study
IR_ambig TWATGCCGIIIYCCSGATTATGCCG This study IR_less_ambig TTATGCCGIIIYCCSGATTATGCCG This study
184circle-F CCTCGCTAACGGATTCACCA This study 184circle-R TGGTGAATCCGTTAGCGAGG This study
pTrc99-up_XbaI CTTATCTAGAGTGAAATTGTTATCCGCTCACAATTCCAC This study pTrc99-
dn_HindIII ATGCAAGCTTGGCTGTTTTGGCGGATGAGAGAAG This study
pTrc99-RitA-F cttatctagacaggaaacagatcATGATTACGTGCGGGCCATTC This study
153
pTrc99-RitA-R ctagaagcttcgttgctagccTCATAGCGTGCCTCCCGCA This study pTrc99-RitB-F cttatctagacaggaaacagatcATGAGCCTCACCGACCAGCTC This study pTrc99-RitB-R ctagaagcttcgttgctagccTCATTGCACAGCTTCCCGGC This study pTrc99-RitC-F cttatctagacaggaaacagatcATGAGCGCCGCCGCCTT This study pTrc99-RitC-R tagaagcttacgttgctagcaTTAGAGACCTTCCAAGAACGCGAG This study
K31RitA-up CGATGATCGTCCGAGTCTGG This study K31RitC-dn CACCACGGCGTCGATCCAGC This study
Olga-RITA-up CGTCCGTAGACGATCAAGG This study Olga-RITC-dn GGACATGAATCATCTGAGACG This study Target1-FOR AATTCCACCGCCCTGCACGAGCTGTCGCACTGGACGGGCTGCA This study Target1-REV GCCCGTCCAGTGCGACAGCTCGTGCAGGGCGGTGG This study Target2-FOR AATTCCACCGCCCTGCACGAGCTGGGCCACTGGACGGGCTGCA This study Target2-REV GCCCGTCCAGTGGCCCAGCTCGTGCAGGGCGGTGG This study
Target-up CGACAGCTCGTGCAGGGC This study Target1-down TGCACGAGCTGTCGCACTGGACGGG This study Target2-down GTCCAGTGGCCCAGCTCGTGCAGGG This study Olga_RITBup GCACTGCGACGTACCGAGC This study Olga_RITBdn GCTATCTCAGCAGGAACTGTCC This study K31_RITBup CAGGAACAGCGGCGTGTC This study K31_RITBdn CTCCAACACGTACTGGTATCTGG This study
pSF100-FOR-PstI ATAACTGCAGATACCCACGCCGAAACAAG This study pSF100-REV-
EcoRI CGTCGAATTCATCGCTAGTTTGTTTTGACTCC This study
Ampli-tet-5_out GACGATGAGCGCATTGTTAG K. Mijnendonckx
Ampli-tet-3_out TCAGGGACAGCTTCAAGGAT K. Mijnendonckx
Table S2: Dissolved oxygen values by month. Values for the Dyment’s Creek location are averages of multiple values taken each month during the 2011 field season. Values for all other sites are the corresponding 2011 value for that month from the PWQMN database. All values are in mg/L.
Site June July August September Maskinonge River 6.38 -- 5.23 -- Uxbridge Brook 8.32 8.09 8.08 8.95 Hamilton Creek -- -- 8.66 9.95 North Saugeen 10.4 8.07 8.53 10.17 Dyment's Creek 5.9 6.52 6.69 6.8
154
Table S3: RIT Elements documented to date.
Strain Genbank Accension Location
(Phylum - if other than Proteobacteria);Class; Order
Burkholderia phytofirmans OLGA172 RITBphyt01 beta; Burkholderiales Cupriavidus metallidurans CH34 RITCme1 chromosome 1 CP000352
1393469-1396637 beta; Burkholderiales
Burkholderia sp. Ch1-1 NZ_JH603161.1
scaff_3 4103936-4107104 beta; Burkholderiales
Burkholderia sp. Ch1-1 NZ_JH603161.1
scaff_3 368714-371882 beta; Burkholderiales
Novosphingobium sp. PP1Y (RIT1) NC_015580.1
1558240-1561090 alpha; Sphingomonadales
Acidiphilium multivorum AIU301 NC_015186.1
449594-452732 alpha; Rhodospirillales
Acidiphilium multivorum AIU301 pACMV1(RIT1) NC_015178.1
223812-226950 alpha; Rhodospirillales
Acidiphilium multivorum AIU301 pACMV1 (RIT2) NC_015178.1
253938-257076 alpha; Rhodospirillales
Acidiphilium cryptum JF-5 pACRY01 NC_009467.1
175771-178909 alpha; Rhodospirillales
Caulobacter sp. K31 chromosome (RIT1) - NC_010338.1 NC_010338.1
2151285-2154423 alpha; Caulobacterales
Caulobacter sp. K31 chromosome (RIT2) NC_010338.1
2422880-2426018 alpha; Caulobacterales
Caulobacter sp. K31 pCAUL02 RIT1 - NC_010333.1 NC_010333.1 57564-60702 alpha; Caulobacterales Novosphingobium sp. PP1Y (RIT1) NC_015580.1
1558240-1561090 alpha; Sphingomonadales
Acidiphilium cryptum JF-5 pACRY03 NC_009469.1 38619-41757 alpha; Rhodospirillales Sinorhizobium medicae WSM419 pSMED02 NC_009621.1
369069-372212 alpha; Rhizobiales
Cupriavidus metallidurans CH34 RITCme2 CP000352
1362583-1365838 beta; Burkholderiales
Brenneria sp. EniD312 NZ_CM001230.1 3719322-3722580 gamma; Enterobacteriales
Bordetella petrii strain DSM 12804 (RIT1) NC_010170.1
1547845-1551100 beta; Burkholderiales
Burkholderia sp. YI23 plasmid byi-1p CP003090.1
1702794-1706046 beta; Burkholderiales
Burkholderia phymatum STM815 chromosome 2 CP001044.1
1976580-1979793 beta; Burkholderiales
Marinobacter sp. ELB17 NZ_AAXY01000001.1 415435-418639 gamma; Altermonodales
155
Marinobacter sp. ELB17 NZ_AAXY01000001.1 536430-539634 gamma; Altermonodales
Marinobacter sp. ELB17 NZ_AAXY01000007.1 167649-170853 gamma; Altermonodales
Marinobacter sp. ELB17 NZ_AAXY01000009.1 44734-47938 gamma; Altermonodales Aromatoleium aromaticum EbN1 NC_006513.1
281475-284649 beta; Rhodocyclales
Aromatoleium aromaticum EbN1 (plasmid 2) NC_006824.1
105863-109037 beta; Rhodocyclales
Aromatoleium aromaticum EbN1 (plasmid 2) NC_006824.1
129992-133166 beta; Rhodocyclales
Cupriavidus necator H16 pHG1 (RIT1) AY305378 32312-35249 beta; Burkholderiales Sinorhizobium fredii NGR234 pNGR234a (RIT2) NC_000914.2
233129-236294 alpha; Rhizobiales
Thauera sp. MZ1T CP001281.2 419038-422197 beta; Rhodocyclales
Candidatus Solibacter usitatus Ellin6076 (RIT1) NC_008536.1
4188436-4191597
(Acidobacteria);Solibacteres; Solibacterales
Candidatus Solibacter usitatus Ellin6076 (RIT2) NC_008536.1
9597168-9600446
(Acidobacteria);Solibacteres; Solibacterales
Mesorhizobium loti MAFF303099 (RIT1) NC_002678.2
4814973-4818126 alpha; Rhizobiales
Mesorhizobium loti MAFF303099 (RIT2) NC_002678.2
4880343-4883496 alpha; Rhizobiales
Cupriavidus necator H16 pHG1 (RIT2) AY305378 40646-43658 beta; Burkholderiales Acidovorax sp. NO-1 NZ_AGTS01000021.1 WGS ctg22 beta; Burkholderiales Leptothrix cholodnii SP-6 (RIT2) NC_010524.1
837820-841107 beta; Burkholderiales
Candidatus Accumulibacter phophatis clade IIA str. UW-1 pAph01 NC_013193.1
125367-128300 beta; unclassified
Leptothrix cholodnii SP-6 (RIT1) NC_010524.1
828122-831358 beta; Burkholderiales
Burkholderia sp. Ch1-1 NZ_JH603159.1
scaff_1 776303-779530 beta; Burkholderiales
Mesorhizobium loti MAFF303099 pMLa NC_002679.1
283344-286544 alpha; Rhizobiales
Thiomonas sp str. 3As FP475956.1 1980031-1983223 beta; Burkholderiales
Bifidobacterium longum NCC2705 (RIT1) NC_004307.2
1146734-1149950
(Actinobacteria); Bifidobacteriales
Bifidobacterium longum NCC2705 (RIT2) NC_004307.2
1151346-1154562
(Actinobacteria); Bifidobacteriales
Bifidobacterium longum NCC2705 (RIT3) NC_004307.2
1510118-1506902
(Actinobacteria); Bifidobacteriales
Bifidobacterium longum F8 FP929034.1 977730-981079
(Actinobacteria); Bifidobacteriales
Bifidobacterium longum DJ010A (RIT1) NC_010816.1 35114-38464
(Actinobacteria); Bifidobacteriales
Bifidobacterium longum NC_010816.1 389423- (Actinobacteria);
156
DJ010A (RIT2) 392773 Bifidobacteriales Bifidobacterium longum DJ010A (RIT3) NC_010816.1
2152995-2156345
(Actinobacteria); Bifidobacteriales
Bifidobacterium longum DJ010A (RIT4) NC_010816.1
1541436-1544786
(Actinobacteria); Bifidobacteriales
Bifidobacterium longum infantis 157F (RIT1) NC_015052.1
117374-120729
(Actinobacteria); Bifidobacteriales
B. longum subsp. longum JCM1217 (RIT1) NC_015067.1
998527-1001743
(Actinobacteria); Bifidobacteriales
B. longum subsp. longum JCM1217 (RIT2) NC_015067.1
1356344-1352994
(Actinobacteria); Bifidobacteriales
B. longum subsp. longum JDM301 (RIT1) NC_014169.1
516654-519870
(Actinobacteria); Bifidobacteriales
B. longum subsp. longum JDM301 (RIT2) NC_014169.1
894276-897626
(Actinobacteria); Bifidobacteriales
B. longum subsp. longum JDM301 (RIT3) NC_014169.1
2274116-2270900
(Actinobacteria); Bifidobacteriales
Burkholderia phymatum STM815 pBphy01 NC_010625.1
1602828-1606047 beta; Burkholderiales
Burkholderia phymatum STM815 pBphy02 (RIT3) NC_010627.1
393453-396672 beta; Burkholderiales
Mesorhizobium loti MAFF303099 (RIT3) NC_002678.2
5046735-5049426 alpha; Rhizobiales
Pseudomonas aeruginosa NCM1179 DF126593.1
WGS 3897228-3900501 gamma; Pseudomonodales
Acidovorax sp. NO-1 NZ_AGTS01000037.1 WGS ctg39 beta; Burkholderiales Burkholderia phymatum STM815 pBphy02 (RIT1) NC_010627.1
228640-231843 beta; Burkholderiales
Burkholderia sp. Ch1-1 NZ_JH603161.1
scaff_3 250576-253779 beta; Burkholderiales
Burkholderia sp. Ch1-1 NZ_JH603161.1
scaff_3 905113-908316 beta; Burkholderiales
Singulisphaera acidiphila DSM 18658 YP_007202199.1
2680371-2,683,568
Planctomycetes; Planctomycetales
Burkholderia phymatum STM815 pBphy02 (RIT2) NC_010627.1
367807-370963 beta; Burkholderiales
Mesorhizobium amorphae CCNWGS0123 NZ_AGSN01000188.1
WGS ctg00205 alpha; Rhizobiales
Polaromonas sp. JS666 NC_007948.1 723577-726741 beta; Burkholderiales
Acidithiobacillus ferroxidans ATCC 53993 NC_011206.1
309036-312209 gamma; Acidithiobacillales
Acidithiobacillus ferroxidans ATCC 53993 NC_011206.1
619670-622843 gamma; Acidithiobacillales
Burkholderia sp. H160 NZ_ABYL01000018.1
WGS ctg00540; 70230-73427 beta; Burkholderiales
Marinobacter ELB17 NZ_AAXY01000003.1 128404-131292 gamma; Alteromonadales
Klebsiella pneumoniae 342 NC_011283.1 1834119- gamma; Enterobacteriales
157
1836998
Bacteroides fragilis 3.1.12 NZ_EQ973215.1 WGS 152668-155859 (Bacteroidetes) Bacteroidales
Paracoccus sp. TRP NZ_AEPN01000095.1 WGS ctg 98 alpha; Rhodobacteriales Bacteroides sp. 2_2_4 NZ_EQ973384.1 superctg1.30 (Bacteroidetes) Bacteroidales
Opitutus terrae PB90-1 NC_010571.1 1510958-1514235 (Verrucomicrobia); Opitutales
Opitutus terrae PB90-1 NC_010571.1 3830062-3833339 (Verrucomicrobia); Opitutales
Opitutus terrae PB90-1 NC_010571.1 5652242-5565519 (Verrucomicrobia); Opitutales
Opitutus terrae PB90-1 NC_010571.1 5678108-5681385 (Verrucomicrobia); Opitutales
Novosphingobium pentaromativorans US6-1 NZ_AGFM01000100.1
WGS ctg00100 alpha; Sphingomonadales
Sphingomonas sp. SKA58 NZ_AAQG01000023.1 WGS 84-3662 alpha; Sphingomonadales Novosphingobium pentaromativorans US6-1 pLA1 (RIT2) NZ_AGFM01000122.1 77388-80973 alpha; Sphingomonadales Novosphingobium nitrogenifiges DSM 19370 NZ_AEWJ01000060.1
WGS ctg00067 alpha; Sphingomonadales
Novosphingobium sp. PP1Y (RIT2) NC_015580.1
2444845-2448439 alpha; Sphingomonadales
Roseobacter litoralis Och 149 NC_015730.1 1129515-1133109 alpha; Rhodobacteriales
Verminephrobacter aporrectodeae subsp. tuberculatae At4 NZ_AFAL01000379.1 WGS ctg385 beta; Burkholderiales Candidatus Solibacter usitatus Ellin6076 (RIT3) NC_008536.1
3194175-3198639
(Acidobacteria);Solibacteres; Solibacterales
Burkholderia phytofirmans PsJN chromosome 1 (RIT2) NC_010681.1
259065-262635 beta; Burkholderiales
Mesorhizobium amorphae CCNWGS0123 NZ_AGSN01000034.1
WGS ctg00035 alpha; Rhizobiales
Sinorhizobium fredii NGR234 pNGR234a (RIT1) NC_000914.2
230024-233072 alpha; Rhizobiales
Sphingopyxis alaskensis RB2256 NC_008048.1
446826-450282 alpha; Sphingomonadales
Sphingomonas sp. KA1 pCAR3 NC_008308.1
193546-197002 alpha; Sphingomonadales
Novosphingobium pentaromativorans US6-1 pLA1 (RIT1) NZ_AGFM01000122.1 71938-75388 alpha; Sphingomonadales Novosphingobium sp. PP1Y (RIT3) NC_015580.1
328363-331816 alpha; Sphingomonadales
Erythrobacter sp. SD-21 WGS NZ_ABCG01000002.1 66491-69569 alpha; Sphingomonadales Dinroseobacter shibae DFL 12 pDSHI01 NC_009955.1
100303-103738 alpha; Rhodobacteriales
Dinroseobacter shibae DFL 12 pDSHI03 NC_009957.1 67841-71276 alpha; Rhodobacteriales Mesorhizobium alhagi CCNWXJ12-2 NZ_AHAM01000339.1 WGS ctg361 alpha; Rhizobiales
158
Sinorhizobium meliloti 1021plasmid pSymA (RIT1) NC_003037.1
901120-904243 alpha; Rhizobiales
Sinorhizobium meliloti 1021plasmid pSymA (RIT2) NC_003037.1
1225061-1228184 alpha; Rhizobiales
Sulfitobacter sp. NAS-14.1 NZ_AALZ01000022.1 WGS 5396-7976 alpha; Rhodobacteriales
Pelagibacterium halotolerans B2 NC_016078.1
3574154-3577274 alpha; Rhizobiales
Agrobacterium vitis S4 pTiS4 NC_011982.1 78540-81684 alpha; Rhizobiales Roseovarius sp. 217 NZ_AAMV01000021.1 32068-35074 alpha; Rhodobacteriales Novosphingobium sp. PP1Y pMpl NC_015583.1
864650-867611 alpha; Sphingomonadales
Methylocystis sp. ATCC 49242
ctg206: 54584-57731 alpha; Rhizobiales
Rhodococcus opacus B4 pROBO2 NC_012521.1 80391-83520
(Actinobacteria); Actinomycetales
Mycobacterium vanbaalenii PYR-1 NC_008726.1
6334046-6337148
(Actinobacteria); Actinomycetales
Burkholderia sp. Ch1-1 NZ_JH603161.1
scaf_3 240108-243225 beta; Burkholderiales
Frankia sp. EAN1pec NC_009921.1 495942-498771
(Actinobacteria); Actinomycetales
Marinobacter aquaeolei VT8 pMAQU02 NC_008739.1
113911-117085 gamma; Alteromonadales
Rhizobium leguminosarum bv. viciae plasmid pRL11 NC_008384.1
376296-379437 alpha; Rhizobiales
Rhizobium leguminosarum bv. viciae plasmid pRL10 NC_008381.1 53267-56093 alpha; Rhizobiales Mesorhizobium loti R7A symbiosis island AL672113.1
127743-130902 alpha; Rhizobiales
Syntrophobotulus glycolicus DSM 8271 NC_015172.1 48211-51807 (Firmicutes); Clostridiales Desulfotomaculum gibsoniae DSM 7213 NZ_AGJQ01000018.1
WGS 99329-102528 (Firmicutes); Clostridiales
Dehalobacter sp. DCA NC_018866.1 1225943-1229157 (Firmicutes); Clostridiales
Desulfobacterium autotrophicum HRM2 NC_012108.1
3001915-3005127 delta; Desulfobacteriales
Lentibacillus sp. Grb1 NZ_AGAV01000005.1 contig005 (Firmicutes); Bacillus Clostridium saccharolyticum WM1 NC_014376.1
3030356-3033574 (Firmicutes); Clostridiales
Bacillus sp. 10403023 (RIT2) NZ_HE610986.1 726773-729999 (Firmicutes); Bacillus
Sulfobacillus acidophilus TPY (RIT2) NC_015757.1
2174723-2177995 (Firmicutes); Clostridiales
Sulfobacillus acidophilus TPY (RIT1) NC_015757.1
489401-492631 (Firmicutes); Clostridiales
Legionella drancourtii LLAP12 JH413829.1 scaffold37 gamma; Legionellales Heliobacterium modesticaldum Ice1 NC_010337.2
188072-191303 (Firmicutes); Clostridiales
159
Desulfosporosinus sp. OT TOU NZ_AGAF01000082.1
WGS assembly 178 (Firmicutes); Clostridiales
Acetivibrio celluloyticus CD2 NZ_AEDB02000020.1
WGS scf3_ctg20 16804-20008 (Firmicutes); Clostridiales
Dysgonomonas mossii DSM 22836 NZ_ADLW01000025.1 WGS ctg1.25 (Bacteroidetes); Bacteroidales
Gramella forsetii KT0803 NC_008571.1 1284588-1287826
(Bacteroidetes); Flavobacteriales
Gramella forsetii KT0803 NC_008571.1 1401240-1404478
(Bacteroidetes); Flavobacteriales
Cupriavidus necator H16 pHG1 (RIT3) NC_005241.1 51476- 54721 beta; Burkholderiales Echinicola vietnamensis DSM 1752 NC_019904.1 24444-27688
(Bacteroidetes); Cytophagia
Bordetella petrii strain DSM 12804 (RIT2) NC_010170.1
1107785-1111030 beta; Burkholderiales
Bordetella petrii strain DSM 12804 (RIT3) NC_010170.1
1365410-1368586 beta; Burkholderiales
Johnsonella ignava ATCC 51276 NZ_ACZL01000056.1 WGS ctg1.56 (Firmicutes); Clostridiales Roseburia inulinivorans DSM 16841 NZ_ACFY01000116.1
WGS ctg476.1 (Firmicutes); Clostridiales
Bacteroides finegoldii DSM 17565 NZ_ABXI02000050.1
WGS ctg8.4 46223-49494 (Bacteroidetes); Bacteroidales
Bacillus sp. 10403023 (RIT1) NZ_HE610986.1 730450-733658 (Firmicutes); Bacillus
Prevotella buccae D17 NZ_GG739978.1 WGS superctg1.53 (Bacteroidetes); Bacteroidales
Nocardioides sp. JS614 NC_008699.1 634392-637611
(Actinobacteria); Actinomycetales
Intrasporangium calvum DSM43043 NC_014830.1
948667-952111
(Actinobacteria); Actinomycetales
Mesorhizobium alhagi CCNWXJ12-2 NZ_AHAM01000340.1 WGS ctg362 alpha; Rhizobiales Aromatoleum aromaticum EbN1 NC_006513.1
284698-287802 beta; Rhodocyclales
Sphaerochaeta pleomorpha str. Grapes NC_016633.1
1,216,600-1,219,767 Sphaerochaeta
Corynebacterium halotolerans YIM 70093 = DSM 44683 CP003697.1
126,593-129,791
Actinobacteria; Corynebacteriales
Mycobacterium kansasii ATCC 12478 CP006835.1
3,763,863-3,767,049
Actinobacteria; Corynebacteriales
Clostridium difficile QCD-63q42 Microlunatus phosphovorus NM-1 AP012204.1
5,536,190-5,539,379
Actinobacteria; Propionibacteriales
Prevotella oralis ATCC 33269 Thermoanaerobacterium NC_019970.1 425,454- Firmicutes; Clostridia
160
thermosaccharolyticum M0795
428,667
Sphingomonas echinoides ATCC 1482 PRJNA76627
Alpha-Proteobacteria; Sphingomonadales
Sphingobium yanoikuyae XLDN2-5 PRJNA71691
Alpha-Proteobacteria; Sphingomonadales
Pseudomonas extremaustralis 14-3 substr. 14-3b strain 14-3 PRJNA77729
Gamma-Proteobacteria; Pseudomonadales
Gordonia rhizosphera NBRC 16068 PRJDB4
Actinobacteria; Corynebacteriales
Thioalkalivibrio nitratireducens DSM 14787 PRJNA178382
Gamma-Proteobacteria; Chromatiales
Bacillus sp. 10403023 RIT2 PRJEA70827 Firmicutes; Baciili Singulisphaera acidiphila DSM 18658 PRJNA82973
Planktomycetes; Planktomycetales
Afipia felis ATCC 53690 PRJNA52159 Alpha-Proteobacteria; Rhizobiales
Burkholderia terrae BS001 PRJNA157903 Beta-Proteobacteria; Burkholderiales
Acidocella sp. MX-AZ02 PRJNA171232 Alpha-Proteobacteria; Rhodospirillales
Celeribacter baekdonensis B30 PRJNA170411
Alpha-Proteobacteria; Rhodobacterales
Candidatus Microthrix parvicella RN1 - WGS contig 2605_44 CANL01000039.1
Actinobacteria; Candidatus Microthrix
Roseivivax atlanticus strain 22II-s10s contig25 AQQW01000025.1
10,337-13,415
Alpha-Proteobacteria; Rhodobacterales
Phaeobacter gallaeciensis DSM26640 pGal_B134 NC_023148.1 Phaeobacter gallaeciensis DSM26640 NC_023137.1
Alpha-Proteobacteria; Rhodobacterales
Bradyrhizobium sp. STM 3809 PRJEA72433 Alpha-Proteobacteria; Rhizobiales
Draconibacterium orientale strain FH5T CP007451.1
3,816,933-3,819,813
Bacteroidetes; Bacteroidales
Photorhabdus temperata subsp. temperata Meg1 PRJNA217865
Gamma-Proteobacteria; Enterobacteriales
Burkholderia glathei PRJEB6934 Beta-Proteobacteria; Burkholderiales
acid mine drainage metagenome Mycobacterium austroafricanum strain DSM 44191 PRJEB5747
Actinobacteria; Corynebacteriales
Ferrovum myxofaciens strain P3G Contig179 PRJNA255880
Beta-Proteobacteria; Ferrovales
Acidovorax sp. KKS102 CP003872.1 1283645- Beta-Proteobacteria;
161
RIT1 1286932 Burkholderiales Acidovorax sp. KKS102 RIT2 CP003872.1 1297698-1300985 Acidovorax sp. KKS102 RIT3 CP003872.1 1302872-1306159 Acidovorax sp. KKS102 RIT4 CP003872.1 2254833-2258120 837820-841107
Arthrobacter sp. Soil736 LMSB01000017.1 64,829-67,968
Actinobacteria; Micrococcales
Thalassobacter stenotrophicus CYRX01000028.1 WGS
Alpha-Proteobacteria; Rhodobacterales
Thioclava dalianensis strain DLFJ1-1 JHEH01000032.1 WGS
Alpha-Proteobacteria; Rhodobacterales
Paenirhodobacter enshiensis strain DW2-9 contig24_scaffold11 JFZB01000023.1
52,270-55,423
Alpha-Proteobacteria; Rhodobacterales
162
Appendix 2 Sampler Construction and Site Information
River Samplers 1 ½” x 1 ft length clear polycarbonate tube - cut with bandsaw, deburred and belt sanded End caps 1 ½” copper to DWM adapters for fittings – machined inside to fit and adhered with 100% polyurethane construction adhesive; window screening and nylon adhered internally to fitting ends with hot glue Filled with fine grain sand and attached with 90 lb threaded marine approved rope to two 4L pop bottles for floatation (2” distance between sampler and pop bottle lid). Tied with the same rope to cement block with minimum of 2 ft length of rope. 2011 Site assessment June 30, 2011: Hamilton Creek and Rocky Saugeen samplers installed Hamilton Creek – West back line, just north of Chatsworth Rd. 24 (East of Williamsford) GPS: 44024’18”N 80O47’47”W Upstream of bridge (east of road) Main channel width – 11.9 m (39 ft) Water depth and hydraulic head 1/3- 35 cm and no head midchannel – 48 cm (2 cm head) 2/3 – 42 cm (1 cm head) Velocity – 10 meter travelled in 15 seconds (0.667 m/s) Temperature – 16OC Transparency – clear to bottom Sediment – silt and boulders ranging from 7 - 30 cm diameter; boulders look like pieces of cement and are covered in gravel type substance (and red on the bottom) Surrounding vegetation – grasses from edge, lots of variety; Other notes: very wide with many dead trees – swamp? Perhaps this is low water for the region; lots of fish with stripe down the side (up to 7 cm in length) Rocky Saugeen –8th concession west of traverston rd (south of grey rd 12) GPS: 44027’35”N 81O2’42”W Downstream of bridge (north of road) Main channel width – 6.8 m (22 ft) narrows upstream to approx. 5.5m Water depth and hydraulic head measurements (taken upstream of bridge!) 1/3- 19 cm and no head midchannel – 24 cm (5 cm head) 2/3 – 30 cm (4 cm head) Velocity – 10 meter travelled in 16.7 seconds (0.599 m/s) Temperature – 16OC Transparency – clear to bottom
163
Sediment – rocks ranging from 10 - 30 cm diameter; Surrounding vegetation – overhanging willow (upstream) and cedar trees; lots of riparian vegetation Other notes: no fish observed; in a cedar forest with lots of trees; old sign hanging by river ‘Markdale fire truck water load area’; difficulty placing sampler – rocky bottom therefore hard to find solid place without tilting and water turbulent and pulling sampler down (about 20-30 cm below top of the water and only about 10 cm above rocks) July 1, 2011 North Saugeen – 8th sideroad between conc. 4 & 6 (East of Moorsburg) GPS: 44020’16”N 80O56’0”W Runs parallel to road Main channel width – approx 15 m (50 ft) Water depth and hydraulic head 1/3- 40 cm (2 cm head) midchannel – 38 cm (6 cm head) 2/3 – 30 cm (2 cm head) Velocity – 11 meter travelled in 25 seconds (0.44 m/s) Temperature – 20OC Transparency – clear to bottom Sediment –rocks (up to approx 20 cm diameter) and pebble; some boulders look like pieces of cement and are covered in gravel type substance (as seen at Hamilton Creek); moss on rocks Surrounding vegetation – all cedar trees; many dead cedars as far as visible – widening of river? Other notes: placed sampler in a pocket 68 cm deep and behind a tree to minimize visibility to kayakers July 8, 2011 East Holland River at Green Lane Width (estimated from bridge): 14.9 m at widest; 9 m under bridge Depth: 50 cm (1/3 channel); 58 cm mid-channel Velocity: 1 min 14 sec for 10 m Temperature: 24oC Turbidity: about 10 cm; very muddy therefore quite turbid Sediment: Mud and rock (sand and boulders) Vegetation: lots of grasses; healthy riparian region therefore turbidity not likely from erosion in this section of the river Other observations: crayfish observed and turtle; much deeper than previous visit; green tinge to the water Placed sampler under bridge but midstream to discourage interference – gone after one month. Could have been taken or could have been washed away in a rain event.
164
July 8, 2011 Uxbridge Brook at Davis Drive Depth: 40 cm (1/3 near bank), 53 cm mid-stream, 26 cm (1/3 opposite bank) Width: 6.5 m, 6.1 m Velocity: 12.9 seconds for 10 m Temperature: 21oC Turbidity: clear to bottom Vegetation: lots of riparian vegetation; grasses and ferns and trees Sediment: pebbles at edges, medium boulders covered with green plants in center Other observations: lots of dragonflies and butterflies
Location Hamilton Creek
Rocky Saugeen/Gypsy Creek
North Saugeen
East Holland
Uxbridge Brook
Date installed 30-Jun 30-Jun 01-Jul 08-Jul 08-Jul Date collected 08-Nov
Street Location West Back Line
house#443843 ncession west of
t Church Road; st of Traverston
east of Moorseburg; 8th sdrd b/t conc. 4 & 6
Green Lane/Rogers Reservoir at Davis Drive
Placement in stream
upstream of bridge
downstream from road crossing
closest to far bank
under bridge, midstream
downstream of road crossing
GPS 44 24' 18" N 44 27' 35" N 44 20' 16" N
80 47' 47" W 81 2' 42" W 80 56' 0" W
Bank Width 11.9 m
6.8 m (narrows to 5.5 m)
~ 50 ft but narrows up and downstream 9 m 6.5 m
Water depth (cm) 35 19 40 58 40 48 24 38 50 53 42 30 30 --- 26 avg water depth (cm) 41.7 24.3 36.0 54.0 39.7 Hydraulic Head (cm) 0.0 0 2 2.0 5 6 1.0 4 2 avg hydraulic head (cm) 1.0 3.0 3.3
165
Velocity 0.67 m/s 0.60 m/s 0.44 m/s 0.13 m/s 0.77 m/s Temperature (Celsius) 16 16 24 21
Transparency clear to bottom clear to bottom
clear to bottom
about 10 cm; very muddy
clear to bottom except in deep pools
Sediment silt and boulders rocky (10-30 cm diameter)
rocks with moss and pebble
mud and rock
sand and pebble at edge, bigger rocks in middle with attached algae and green plants; deep sand in pool
Surrounding vegetation
varied; grasses from edge
overhanging willow and cedars; lots of riparian veg.
widening section therefore dead and dying cedars along length
lots of grasses; healthy riparian region therefore turbidity not likely due to erosion in this section
a lot of grass; healthy riparian vegetation - grasses & ferns etc.
Comments
boulders have gravel type substance coating; lots of 5-7 cm fish (stripe on side); very wide - not a straight channel (swamp at low tide?); many dead trees
scared a deer away; no fish observed; heavily wooded area; Markdale firetruck water load area (old sign)
placed sampler in 68 cm deep pocket
cray fish observed and turtle; much deeper than previous visit; green tinge to water
a lot of dragonflies and butterflies;
off of Traverston Rd Water samples collected 18-Jul 18-Jul 18-Jul 04-Aug 04-Aug Water Depth (cm) 28 19 32 57 42 43.5 34 28 59 59 53 28 52 57 avg Water depth (cm) 41.5 27 37.3 58 Hydraulic head 0 1 3 0 3 0 0 5 0 3 0 0 5 2 avg 0.0 0.3 4.3 0
166
Hydraulic head (cm) Wetted bank width 14.3 m 4.6 m --- 8.5 m --- Hydrolab - Temp 26.16 23.38 25.62 20 17.62 SpC (mS/cm) 0.422 0.45 0.436 0.68 0.521 Dissolved Oxygen 6.1 mg/L 6.55 mg/L 9.11 mg/L 7.28 mg/L 7.89 mg/L pH 7.84 7.76 7.36 7.53 7.64 Total Dissolved Solids 0.3 g/L 0.3 g/L 0.3 g/L 0.4 g/L 0.3 g/L DO% 80 83 120.2 87.7 90.9 BOD when sampled 6.4 mg/L 6.5 mg/L 8.60 mg/L 7.68 mg/L 7.36 mg/L BOD after 5 days at 20C 6.64 mg/L 6.59 mg/L 6.84 mg/L ---- ---- iPhone GPS 44 15' 56" N 44 22' 7"N NOTE: rain event sampling
80 44' 23" W 80 53' 37" W
lots of crayfish
little fish; dark, small crayfish