recombinase in trio (rit) elements in bacterial …...ii recombinase in trio (rit) elements in...

Recombinase in Trio (RIT) Elements in Bacterial

Genomes: Assessing the Distribution and Mobility of

a Novel yet Widespread Set of Mobile Genes.

by

Nicole Dorothy Ricker

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy

Department of Physical and Environmental Sciences University of Toronto Scarborough

© Copyright by Nicole Ricker 2016

ii

Recombinase in Trio (RIT) Elements in Bacterial Genomes:

Assessing the Distribution and Mobility of a Novel yet

Widespread Set of Mobile Genes.

Nicole Dorothy Ricker Doctor of Philosophy

Department of Physical and Environmental Sciences University of Toronto Scarborough

2016

Abstract

The research performed over the course of my doctorate training outlines the environmental distribution, mobility,

expression and potential role of a newly described family of mobile elements as well as providing valuable

information on the challenges and potential benefits of environmental metagenomics. Sequencing technologies have

evolved considerably over the course of this work, and evaluating the limitations and opportunities provided by

these evolving technologies has formed a significant portion of my thesis work. The remainder of the work has

been dedicated to understanding the distribution and mechanisms of Recombinase in Trio (RIT) elements, a

previously underappreciated mobile element found in a large diversity of strains, but predominantly in non-

pathogenic bacteria. Recombinase in Trio (RIT) elements contain three tyrosine-based site-specific recombinases

and display a characteristic gene order and repeat architecture that is conserved across 7 bacterial phyla (Van Houdt

et al. 2006; Van Houdt et al. 2012; Ricker et al. 2013). RIT elements have been postulated to be mobile due to the

occurrence of multiple identical copies within individual genomes, and are commonly found on plasmids and in

genomic islands, including plant symbiosis and catabolic islands. The ability of RITS to excise and relocate

themselves was tested using a variety of mating experiments. Although the determination of a potential target site

sequence was initially elusive, the discovery that the RIT element also included a 20 bp palindrome adjacent to one

of the terminal inverted repeats allowed for the alignment of the target genes and revealed the original target site

sequence. Subsequently, RIT element mobility was observed during conjugation and the transformants analyzed

provided some insight into the mechanism of recombination. Finally, environmental sampling was performed on

Southern Ontario streams in order to develop a methodology for evaluating the mobilome community of bacterial

communities.

iii

Acknowledgments

No great achievement is accomplished without having a thousand people to thank. It

would be impossible to list all the people that have helped, supported and encouraged me over

the years and I hope that you truly understand my gratitude for each and every one of you. I

want to especially thank my outstanding supervisor, Roberta Fulthorpe, for all of your amazing

mentorship over the past 6 years. You have provided me with encouragement and support when

I felt unsure, clarity and direction when I was muddled, and a firm kick when I was stalled. Not

to mention physical labour and beautiful lake scenery for balance, and the insight to recognize

an amazing opportunity when it came knocking. I could not ask for a better supervisor for my

PhD, or a better mentor for my career. I would also like to thank my committee members (Don

Jackson and William Navarre) for their outstanding insight and recommendations throughout the

project, as well as their patience and encouragement.

To my husband Toby – you have been my rock throughout my PhD and have done so

much more than I ever could have asked of you. From sampler design, creation and installation

to learning site specific recombination mechanisms and moving to Belgium (twice!), you’ve put

in the blood, sweat and tears of this PhD and I am truly blessed to have such a wonderful partner

in my life. Thanks also to my Mom for her endless support including hopping on a plane last

minute to help with the first move to Belgium – and for helping to make sure I didn’t fall apart

once we got there. Thanks to my Dad for assisting with all the reference site samplers and

reminding me why I’m in this field by constantly making me defend science; and to my siblings

for keeping me grounded while also reminding me that I could do this.

My time at UTSC has been filled with amazing people and opportunities that I had never

anticipated. I want to thank everyone at the Fulthorpe lab (past and present) for all your support

and encouragement, and for putting up with endless talks about RIT elements. I especially have

to thank Tony, Roxana and Rosemary for all your dedication and friendship. Last but not least, I

am so grateful for having had the opportunity to work with Rob Van Houdt and Bernard Hallet,

as well as Ann Provoost, Kristel Mijnendonckx and all the other members of the SCK•CEN and

to the W. Garfield Weston Foundation for providing funding for this international collaboration.

iv

Table of Contents

Acknowledgments .......................................................................................................................... iii

Table of Contents ........................................................................................................................... iv

List of Tables ............................................................................................................................... viii

List of Figures ................................................................................................................................. x

List of Appendices ....................................................................................................................... xiii

Chapter 1 Introduction .................................................................................................................... 1

1.1 References ........................................................................................................................... 3

Chapter 2 The Role of Mobile Genetic Elements in Prokaryotic Adaptation ................................ 5

2 Horizontal Gene Transfer ........................................................................................................... 5

2.1 Intracellular MGEs .............................................................................................................. 7

2.2 Intercellular MGEs .............................................................................................................. 9

2.3 Impact on Genome Evolution ........................................................................................... 12

2.4 References ......................................................................................................................... 15

Chapter 3 The Limitations of Draft Assemblies for Understanding Prokaryotic Adaptation and Evolution ........................................................................................................................... 21

3 Introduction .............................................................................................................................. 21

3.1 Methods ............................................................................................................................. 24

3.2 Results ............................................................................................................................... 25

3.2.1 Assembly Quality for Cupriavidus metallidurans CH34 ..................................... 25

3.2.2 Contigs terminate at repeated elements and mobile elements .............................. 29

3.2.3 Fragmentation is greatest at genomic island sites ................................................. 30

3.2.4 Investigating the relative contribution of multiple replicons or presence of documented mobility genes by comparison with other strains ............................. 32

3.2.5 Fragmentation Evident in Real Data ..................................................................... 36

3.3 Discussion ......................................................................................................................... 38

v

3.4 Acknowledgements ........................................................................................................... 42

3.5 References ......................................................................................................................... 43

Chapter 4 Phylogeny and Organization of Recombinase in Trio (RIT) Elements ....................... 47

4 Introduction .............................................................................................................................. 47

4.1 Methods ............................................................................................................................. 48

4.2 Results and Discussion ..................................................................................................... 48

4.2.1 Abundance and Occurrence in Database .............................................................. 48

4.2.2 RIT Structure and Organization ............................................................................ 51

4.2.3 Inferred RIT Functionality .................................................................................... 53

4.2.4 Evidence for RIT Mobility Within Closely Related Strains ................................. 55

4.2.5 Similarities between RIT elements and evidence for broad distribution. ............. 61

4.2.6 RIT Classification ................................................................................................. 65

4.3 Conclusions ....................................................................................................................... 67


4.5 References ......................................................................................................................... 69

Chapter 5 The Chlorocatechol Degradative Operon in Burkholderia sp. strain OLGA172 Resides in Chromosomal Area of Genome Plasticity as revealed through PacBio Single-Molecule Sequencing ............................................................................................................... 71

5 Introduction .............................................................................................................................. 71

5.1 Materials and Methods ...................................................................................................... 74

5.1.1 Short read NGS sequencing .................................................................................. 74

5.1.2 PacBio Single Molecule Sequencing .................................................................... 74

5.1.3 Assembly of Short Read Technologies and PacBio corrected reads .................... 75

5.1.4 Gene Annotation and Contig Validation ............................................................... 75

5.1.5 Comparisons to Related Finished Genomes ......................................................... 76

5.1.6 Large Plasmid Extraction ...................................................................................... 76

5.2 Results ............................................................................................................................... 76

vi

5.2.1 Overall Genome Analysis ..................................................................................... 76

5.2.2 Biological consistency of the Assembly ............................................................... 78

5.2.3 Capacity of the PacBio Assembly for comparative studies .................................. 82

5.2.4 Highlighting a region of Strain Specificity – The Chlorocatechol (CC) Degradative Operon ............................................................................................ 83

5.2.5 Limitations of the PacBio Assembly ...................................................................... 86

5.3 Discussion ......................................................................................................................... 87


5.5 References ......................................................................................................................... 90

Chapter 6 Expression and Activity of RIT Elements .................................................................... 96

6 Introduction .............................................................................................................................. 96

6.1 Materials and Methods ...................................................................................................... 97

6.1.1 Growth of Bacterial Strains .................................................................................. 97

6.1.2 Construct creation ................................................................................................. 98

6.1.3 Mating-out Assays .............................................................................................. 100

6.1.4 Conjugation Experiments .................................................................................... 100

6.1.5 Expression Experiments ...................................................................................... 101

6.2 Results ............................................................................................................................. 101

6.2.1 No evidence of Intra-cellular mobility without a target site ............................... 102

6.2.2 Target site identification ..................................................................................... 105

6.2.3 Sequencing analysis of transconjugants .............................................................. 108

6.2.4 Application of these Results to other RIT Elements ........................................... 112

6.3 Discussion ....................................................................................................................... 113

6.4 References ....................................................................................................................... 117

Chapter 7 Developing a standardized method for analyzing gene content of bacterial communities in streams with varying degrees of urbanization............................................. 119

vii

7 Introduction ............................................................................................................................ 119

7.1 Materials and Methods .................................................................................................... 120

7.1.1 Sampling locations and collection of benthic invertebrates ............................... 121

7.1.2 Sampler Design ................................................................................................... 123

7.1.3 Bacterial Community Assessment ...................................................................... 124

7.1.4 Quantitative PCR ................................................................................................ 125

7.2 Results ............................................................................................................................. 126

7.2.1 Macroinvertebrate metrics of ecosystem health ................................................. 126

7.2.2 Community diversity measures ........................................................................... 129

7.2.3 Quantitative PCR ................................................................................................ 133

7.2.4 Correlations between bacterial communities and water quality parameters ....... 134

7.2.5 Primer design specific to RIT elements .............................................................. 137

7.3 Discussion ....................................................................................................................... 138

7.3.1 Biomonitoring ..................................................................................................... 138

7.3.2 Bacterial community assessment ........................................................................ 139

7.4 Acknowledgements ......................................................................................................... 143

7.5 References ....................................................................................................................... 144

Chapter 8 Conclusions and Future Directions ............................................................................ 147

8 References .............................................................................................................................. 151

9 Appendix 1 Extra Tables ........................................................................................................ 152

Appendix 2 Sampler Construction and Site Information ............................................................ 162

viii

List of Tables

Table 3.2.1: Number of contigs aligning and coverage statistics for each of the four replicons in

C. metallidurans CH34 using Velvet ad ABySS genome assembly software. ............................. 26

Table 3.2.2: Details on the terminal regions for 7 large contigs. .................................................. 29

Table 3.2.3: Genomic islands found on chromosome 1 of CH34. ................................................ 31

Table 3.2.4: Velvet assembly metrics of the 5 genomes compared. ............................................. 33

Table 4.2.1: Summary of information of putative RIT elements found in this study. .................. 49

Table 4.2.2: Potential recognition or regulatory sites contained within terminal inverted repeats.60

Table 5.2.1: Statistics of PacBio unitigs assigned as putative replicons. ..................................... 78

Table 5.2.2: Comparison of assembled genome or Burkholderia sp. str. OLGA172 with other

closely related Burkholderia strains. ............................................................................................. 80

Table 6.1.1: List of strains used in this study. .............................................................................. 97

Table 6.1.2: List of constructs created during this study. ............................................................. 98

Table 6.2.1: Decrease in optical density of cell cultures after induction with IPTG. ................. 104

Table 6.2.2: Conserved sequences found in a variety of alpha- and beta-Proteobaceria containing

RIT elements. .............................................................................................................................. 112

Table 7.1.1: Sampling locations for river assessments. .............................................................. 122

Table 7.1.2: Primers for quantitative PCR. ................................................................................. 125

Table 7.2.1: Comparison of field sites based on biotic indices of benthics obtained during this

study. ........................................................................................................................................... 127

Table 7.2.2: DeltaCt comparison of environmental samplers by quantitative real-time PCR. ... 133

Table 7.2.3: Water quality parameters for each site. .................................................................. 135

ix

Table 7.2.4: Correlations of the bacterial communities to available water quality data. ............ 136

x

List of Figures

Figure 3.2.1:Number of assembled contigs in Velvet aligning to replicons in C. metallidurans

CH34. ............................................................................................................................................ 27

Figure 3.2.2: Geneious alignment of assembled contigs to two key regions containing genomic

islands in C. metallidurans CH34. ................................................................................................ 28

Figure 3.2.3: Relationship between N50 (as percentage of the largest replicon in the genome)

and three parameters thought to influence assembly quality. ....................................................... 35

Figure 3.2.4: Relationship between three measures of assembly quality (maximum contig length,

N50 ad N50 as percent of longest replicon) and number of genomics islands as predicted by

IslandViewer. ................................................................................................................................ 36

Figure 3.2.5: Geneious alignment of real contigs obtained from the GAGE assembly data

(Salzberg et al. 2012). ................................................................................................................... 37

Figure 4.2.1: Comparison of the taxonomic representation of our RIT collection with the

abundance of the same taxonomic grouping in the NCBI genome database. ............................... 51

Figure 4.2.2: Names and arrangements of tyrosine recominase sub-families. ............................. 52

Figure 4.2.3: Comparison of conservation between the Int1 (pAE1) recombinases (A - top) and

Int3 (SG5) recombinases (B - bottom) from 40 divergent representatives. .................................. 54

Figure 4.2.4: Arrangement of RIT elements on the chromosome of Caulobacter sp. K31. ......... 57

Figure 4.2.5: Phylogenetic analysis by 16S (A) and nucleotide sequence of the RIT elements

obtained in this study (B). ............................................................................................................. 64

Figure 4.2.6: Individual congruency trees for each of the recombinases in a selection of RIT

elements. ....................................................................................................................................... 67

Figure 5.2.1: Chromosome 1 of Burkholderia sp. str. OLGA172 as determined by PacBio

sequencing. .................................................................................................................................... 79

xi

Figure 5.2.2: Large plasmid extraction. ........................................................................................ 81

Figure 5.2.3: MAUVE alignment of chromosome 1 from six Burkholderia strains. ................... 82

Figure 5.2.4: Genomic arrangement of chromosome 1 genes from Burkholderia sp. str.

OLGA172 and comparison to homologous regions of related strains. ......................................... 86

Figure 6.1.1: Constructs used in the final conjugation experiment. ............................................. 99

Figure 6.2.1: Expression of recombinase genes from pKK223-OlgaA-C and pKK223-K31A-C

expression vectors. ...................................................................................................................... 103

Figure 6.2.2: PCR amplification using primers designed to amplify out from the kanamycin

gene. ............................................................................................................................................ 104

Figure 6.2.3: Orientation of RIT elements in Caulobacter sp. K31 relative to the direction of the

target gene DUF1738. ................................................................................................................. 106

Figure 6.2.4: Final experimental design. .................................................................................... 107

Figure 6.2.5: Reversal of RIT element in positive transconjugants. ........................................... 108

Figure 6.2.6: Mating results for the recipient strain containing pTrc99-K31A-C and pACYC-

TSV1. .......................................................................................................................................... 109

Figure 6.2.7: Target site 1 transconjugants retaining both kanamycin and tetracycline resistance.110

Figure 6.2.8: Sequencing results of co-integrate structure of clone I4. ...................................... 111

Figure 6.3.1: Model for RIT element mobility based on experimental results. .......................... 115

Figure 7.1.1: Map of sampling locations. ................................................................................... 123

Figure 7.1.2: Aquatic environment bacterial community samplers. ........................................... 124

Figure 7.2.1: Lake Simcoe region samplers after retrieval. ........................................................ 128

Figure 7.2.2: Cluster analysis of T-RFLP data showing within sampler variation. .................... 129

xii

Figure 7.2.3: Principal coordinate analysis of T-RFLP results from sampler replicates. ........... 131

Figure 7.2.4: Principal coordinate analysis (PCoA) of the bacterial community compositions

revealed by 16S pyrosequencing data. ........................................................................................ 132

xiii

List of Appendices

Appendix 1: Extra tables ……………………………………………………… 152

Table S1: Primers used in this study ………………………………….. 152

Table S2: Dissolved oxygen values by month ………………………… 154

Table S3: RIT elements determined to date …………………………… 154

Appendix 2: Sampler construction and site information …………………….. 162

1

Chapter 1 Introduction

My research relates to understanding the mechanisms of bacterial adaptation, and

particularly how bacteria acquire and distribute genes through horizontal gene transfer (HGT).

This is a topic that impacts every field of biology from ecology to medicine due to the ubiquity

of bacteria and the range of diverse skills that they acquire through HGT, including

pathogenesis, antibiotic resistance, root nodulation and xenobiotic compound degradation

(Springael and Top, 2004; Frost et al. 2005; Siefert 2009). For this reason, understanding the

mechanism and regulation of the genes involved in HGT provides universally applicable

benefits. The goal of my graduate research has been to better characterize the genes involved in

creating diversity within individual bacterial genomes, and to make progress towards

investigating the effects that exposure to environmental pollutants has on the abundance and

activity of mobile genetic elements (MGEs). This work was inspired by similar research into the

distribution and expression of integrons (Wright et al. 2008; Koening et al. 2009) and plasmids

(Smalla and Sobecky, 2002; Springael and Top, 2004). I provide a summary of our

understanding of the range of MGEs found in bacteria and some details on their agents of

mobility in the next chapter (Chapter 2) to assist the reader.

The main focus of my work has been devoted to understanding a previously

uncharacterized set of mobility genes termed a Recombinase in Trio (RIT) element (Van Houdt

et al. 2009; Ricker et al. 2013). At the start of my project, a former master’s student had recently

discovered a recombinase in a chlorobenzoate degrader designated Burkholderia sp. str.

OLGA172 (Jin, 2010) that later proved to be a RIT element. The Fulthorpe lab has studied this

strain as the representative of a larger collection of chlorobenzoate degraders isolated from

pristine sites during a biogeography survey (Fulthorpe et al. 1998). These pristine isolates are of

particular interest since their chromosomally located chlorobenzoate degradation genes may be

ancestral to widely disseminated plasmid-borne catabolic genes that are highly active in

contaminated sites. In OLGA172, a RIT element was found lying just upstream from the

catabolic genes and I had an interest in determining if it had a role in the movement of the

catabolic operon, with a view to the larger interest of understanding the overall role of RIT

elements and a possible link to the evolution of catabolic traits.

2

As next generation sequencing was becoming common at that time, OLGA172 was

submitted for Illumina sequencing and subsequently for 454 sequencing in order to assemble the

complete genome and provide context to the RIT element and adjacent catabolic genes.

Unfortunately, all RIT element containing contigs were disconnected due to its presence in

multiple copies within the genome. The bioinformatic community has long acknowledged this

technical drawback of short read technology, but its importance to assembly quality and our

understanding of bacterial evolution had been underestimated. I document these issues in

Chapter 3. I recognized the potential of longer read technologies in understanding our strain and

submitted it for sequencing on the PacBio RSII platform. These improvements allowed for the

creation of a closed genome of OLGA172, which can subsequently be used to address specific

questions regarding the role of RIT elements in the evolution of this strain. I detail the larger

implications of the fundamental improvements achieved through the introduction of high

throughput long read sequencing technologies in Chapter 5 using OLGA172 as an example.

This chapter describes the closed genome of OLGA172 achieved using the PacBio sequencing

technology, and compares the genomic context surrounding the catabolic genes (and RIT

element) found in this strain with other fully sequenced relatives.

In chapter 4, I discuss the distribution and organization of Recombinase in Trio (RIT)

elements, a previously underappreciated mobile element found in a large diversity of strains, but

predominantly in non-pathogenic bacteria. A product of in depth in silico searching, I outline

overall RIT element organization and distribution in currently sequenced genomes, and

highlight individual strains harboring multiple identical copies of the same RIT element. Rob

Van Houdt of the Belgian Nuclear Research Centre (SCK•CEN) in Belgium was the first author

on the original paper recognizing and naming the RIT elements (Van Houdt et al. 2009). On

reading a poster abstract I published on the distribution of RIT elements, he contacted me and

we established a collaboration. I travelled to Belgium for 3 months in 2011 and again for 9

months the following year after securing a fellowship in order to investigate the activity of RIT

elements in his lab. The experimental evidence I gathered supporting the intracellular mobility

of these elements is presented in Chapter 6.

At the outset of PhD work my intention was to investigate the "mobilome" of bacterial

communities exposed to low levels of environmental contamination. Other researchers have

asserted that environmental pollutants are increasing the ‘evolvability’ of bacterial communities

3

by increasing their capacity for horizontal gene transfer (Baquero, 2009; Gillings and Stokes,

2012). In order to properly address this question of innate evolvability within a bacterial

community, I wanted to investigate the impact of environmental pollutants on the mobile

elements themselves separately from co-selection by the resistance genes being mobilized. The

cost of investigating an environmental ‘mobilome’ necessitates the prudent identification of

appropriate sites to be characterized. Accordingly I surveyed the suitability of several stream

sites in Ontario for this kind of work and eventually designed and sampled several of them. I

also examined in detail our current ability to quantify MGEs via various methods. Chapter 7

details this sampling strategy and the molecular characterizations I was able to perform on the

bacterial communities, with several interesting results.

1.1 References Baquero, F. 2009. Environmental stress and evolvability in microbial systems. Clin. Microbiol. Infect. 15(Suppl.1):5-10. Frost, L. S., Leplae, R., Summers, A. O. & Toussaint, A. 2005 Mobile genetic elements: the agents of open source evolution. Nat. Rev. Microbiol. 3:722-732. Fulthorpe, R. R., Rhodes, A. N., & Tiedje, J. M. 1998. High levels of endemicity of 3-chlorobenzoate-degrading soil bacteria. Applied and Environmental Microbiology, 64(5), 1620-1627. Gillings, M. R., & Stokes, H. W. 2012. Are humans increasing bacterial evolvability?. Trends in ecology & evolution, 27(6), 346-352. Jin S. 2010. Evidence of Mobility of the 3-Chlorobenzoate Degradative Genes in a Pristine Soil Isolate, Burkholderia phytofirmans OLGA172, M.Sc. Thesis (2010) Dept. Ecology and Evolutionary Biology, University of Toronto. Koenig, J.E., C. Sharp, M. Dlutek, B. Curtis, M. Joss, Y. Boucher and W.F. Doolittle. 2009. Integron Gene Cassettes and Degradation of Compounds Associated with Industrial Waste: The Case of the Sydney Tar Ponds. PLOS One 4: 1-9. Ricker, N. H. Qian and Fulthorpe, R.R. 2013. Phylogeny and Organization of Recombinase in Trio (RIT) Elements. Plasmid. 70(2):226-239. Siefert, J.L. 2009. Defining the Mobilome. In: Horizontal Gene Transfer: Genomes in Flux. pp. 13-27. Ed. M.B. Gogarten, J.P. Gogarten and L. Olendzenski. Humana Press. New York, NY, USA. Smalla, K. and P.A. Sobecky. 2002. The prevalence and diversity of mobile genetic elements in

4

bacterial communities of different environmental habitats: insights gained from different methodological approaches. FEMS Microbiol. Ecol. 42:165-175. Springael, D. and E.M. Top. 2004. Horizontal gene transfer and microbial adaptation to xenobiotics: new types of mobile genetic elements and lessons from ecological studies. Trends in Microbiol. 12(2):53-58. Van Houdt, R.V, S. Monchy, N. Leys and M. Mergeay. 2009. New mobile genetic elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria. Antonie van Leeuwenhoek 96:205-226. Wright, M.S., Baker-Austin, C., Lindell, A.H., Stepanauskas, R., Stokes, H.W. and J.V. McArthur. 2008. Influence of industrial contamination on mobile genetic elements: class 1 integron abundance and gene cassette structure in aquatic bacterial communities. ISME Journal 2: 417-428.

5

Chapter 2 The Role of Mobile Genetic Elements in Prokaryotic Adaptation

2 Horizontal Gene Transfer Bacterial evolution is a dynamic process involving gene mutation, inversion, exchange,

deletion and acquisition of exogenous DNA (Snyder and Champness, 2007). Each of these

processes varies in rate of occurrence and the scope of possible outcomes (Brüssow, 2008) and

the selection pressures shaping these outcomes are applied from multiple levels – gene, group,

population, or community. Horizontal gene transfer (HGT), also known as lateral gene transfer

(LGT), has been shown to play an important role in the evolution dynamics at all of these levels,

and is arguably the most important evolution mechanism working at the population and

community levels. HGT allows individual genomes to remain compact while providing access

to a larger pool of potentially beneficial genes maintained within the community (Darmon and

Leach, 2014). The study of the genes involved in horizontal gene transfer (aka the mobilome)

has received substantial attention due to the increasing prevalence of antibiotic resistance.

Antibiotics and antibiotic resistance genes are common contaminants from wastewater,

agriculture and aquaculture (Perry and Wright, 2013; Gillings et al. 2015) and abundance of

individual resistance genes in soil environments are increasing over time (Knapp et al. 2010).

Understanding the dynamics of gene movement within environmental communities is therefore

fundamental to establishing how existing resistances will be disseminated and in anticipating

sources of new resistance genes.

A mobile genetic element (MGE) is defined as any discrete segment of DNA that can

move within or between genomes (Siefert, 2009) and is inclusive of plasmids, phages,

integrative conjugative elements (ICEs), transposons and the myriad of smaller elements

capable of inter- or intra-cellular movement (for a complete review, see Bellanger et al. 2014).

The distinction between these different categories is often blurred for various reasons including

the modular nature of mobile element evolution, and the enormous time scale at which these

elements have been evolving (Lawrence and Hendrickson, 2008; Siguier et al. 2014). Although

all of these elements fit into the classification of transposable elements (Toussaint and Merlin,

6

2002; Curcio and Derbyshire, 2003; Roberts et al., 2008), the term transposable element

inherently suggests that the mobility of the elements is through transposition. Since

transposition and site-specific recombination are fundamentally different processes

(biochemically), it is preferable to use the more inclusive term of Mobile Genetic Elements

(MGEs) to refer to the full spectrum of genes involved in HGT. The term genomic island is also

sometimes seen as equivalent to transposable element, however there are a variety of definitions

for this particular term, many of which overlap with current definitions for other mobile

elements. For the purposes of this thesis, the term genomic island will be used to refer to

regions of a genome that are not shared with close relatives of the isolate, regardless of any

evidence regarding current mobility.

There are three mechanisms for the acquisition of exogenous DNA into a bacterial cell.

These are conjugation (formation of a junction between two cells for genetic exchange),

transduction (movement of bacterial genes mediated by phage infection) and competence (direct

uptake of DNA from the surrounding environment) (Olendzenski and Gogarten, 2009). On

entry into a new cell, a MGE can be degraded by nucleases, maintained exogenously (in the case

of most plasmids and some phages) or become integrated into the genome of the new organism

through either homologous or illegitimate recombination (Lawrence and Retchless, 2009). There

are also a number of mobile elements that integrate into the genome independently, in either a

random or site-specific manner (Hallet et al. 2004; Siguier et al. 2014). The roles that MGEs

fulfill in a bacterial genome are varied and poorly understood. They are most frequently studied

for their role in the acquisition or dissemination of selectable traits such as antibiotic resistance,

symbiosis, pathogenicity or catabolism. Many confer no such useful functions and are

commonly considered a form of ‘selfish’ DNA. However these elements can be a key

component of genome flexibility as they mediate deletions and inversions both through their

own activity and by providing homologous regions within the genome. Many MGEs have also

been found to affect expression of surrounding genes through the presence of outward facing

promoters or the production of molecules involved in regulation (Darmon and Leach, 2014).

The presence of previously mobile (ie. defective prophages) or partial remnants of MGEs can

likewise impact bacterial adaptation by providing sites of homology or through in trans activity

by intact elements. There are many categories into which MGEs can be sub-divided, however

for the purposes of this chapter they will be described according to the degree of their mobility.

7

This chapter aims to clarify the individual terms used for the different classes of mobile

elements that are capable of independent movement, including both within cell movement and

between bacterial cells. This is in no way an exhaustive description of the elements involved in

bacterial adaptation, and the reader is directed to recent excellent reviews for further information

(Bellanger et al. 2014; Darman and Leach, 2014; Siguier et al. 2014).

2.1 Intracellular MGEs

By definition, intracellular mobile elements are those capable of transfer to different

locations in a chromosome or between different replicons within a bacterial cell (between

multiple chromosomes or from a chromosome to a plasmid). These elements can only be

transferred horizontally (between cells) when they become associated with larger, self-

transmissible elements described in section 2.2. The simplest form of MGE has traditionally

been the insertion sequence (IS), although smaller non-autonomous mobile elements have

recently been described. ISs generally range in size from 700-2500 bp, can facilitate their own

movement and contain only the genes required for transposition (1-3 ORFs coding for

transposase enzymes and regulatory genes) with flanking inverted repeats (Mahillon and

Chandler, 1998; Siefert, 2009). Insertion sequences are grouped into individual families (www-

is.biotoul.fr) based on several shared characteristics. The most important of these characteristics

is similarity in the primary sequence of their encoded transposases (Siguier et al., 2014) but

family members also share other features including the organization of open reading frames,

target site preferences, and similarities in the length and sequence of both their short terminal

inverted repeats and the direct repeats generated upon insertion (Siefert, 2009; Siguier et al.

2014). The majority of insertion sequences are mobilized by a DDE transposase (where DDE

refers to the conserved Asp, Asp, Glu residues in the active site) and there are several large

families of these enzymes that have been further divided into subgroups (Siguier et al. 2014).

There are also several other transposase chemistries that have been identified including enzymes

with a DEDD catalytic motif (related to Holliday junction resolvases) and the HUH (two

histidine residues separated by a large hydrophobic residue) enzymes utilized by both

IS200/IS605 and IS91 related elements (Siguier et al. 2015).

Transposons have traditionally been distinguished from ISs due to the presence of

accessory genes (also called passenger genes or cargo) that serve purposes not related to

8

transposition (Siguier et al. 2014). However since related transposase enzymes have been found

in both ISs and Transposases, this naming system conflicts with the homology based families

that have been defined. In addition, there are transposons that are created through the

coordinated movement of two flanking ISs (composite transposons) and these are separate from

the unit transposons that have a mobility gene at one end of the element. Unit transposons are

sometimes mobilized through the action of a site-specific recombinase and these are alternately

referred to transposases or recombinases depending on whether they are referring to the MGE

they are mobilizing (Siguier et al. 2015) or their phylogenetic relatedness to other proteins

(Carraro and Burrus, 2015). As we progress towards the metagenomic age it is preferable to

group MGEs according to the phylogeny of the mobility enzyme since this is more amenable to

computational analysis and also speaks to the actual mechanism of mobility. Forming families

based on the homology of the mobility enzymes should still be viewed primarily as ‘grouping

by descent’ method rather than an attempt to define the limitations of individual families since

the acquisition of accessory genes may not be a fixed feature of the family.

Many transposons encode separate integration and resolution systems, and the resolution

mechanisms are commonly performed by site-specific recombinases. Site-specific recombinases

(SSRs) can be divided into two unrelated families based on the use of either a tyrosine or a

serine residue in the recombination event (Schumann, 2006). These enzymes are commonly

used by bacteriophages to integrate their genomes into the host chromosome when the virus

enters lysogeny, and many are also able to excise when circumstances dictate entry into the lytic

phase (Hirano et al. 2011). However members of both classes of SSRs have been found in a

variety of recombination reactions involving viral and bacterial DNA including integration,

excision, inversion, control of plasmid copy number and movement of transposons (Nunes-

Duby, 1998; Hallet et al. 2004; Mazel, 2006; Siguier et al. 2015).

In addition to the recombinases responsible for mobilizing phage genomes and

transposons, integrons are a sub-family of tyrosine based site-specific recombinases (TBSSRs)

that have been found to be functionally discreet from all other characterized tyrosine

recombinases (Mazel, 2006). Integrons consist of the integron integrase (the TBSSR), a primary

recombination site (attI) and an outward facing promoter, and are responsible for the acquisition

of gene cassettes (individual genes that have an appropriate attC site for integration) in a non-

disruptive and functional orientation (Hall and Collis, 1995; Mazel, 2006). The incorporation of

9

gene cassettes in this manner allows for the immediate use of the newly acquired genes,

however integrons also serve as storage for additional genes since the existing gene cassettes are

maintained in an array which can be composed of hundreds of individual genes (Cambray et al.

2010; Domingues et al. 2012). Gene cassettes generally decrease in activity with distance from

the primary promoter, but can be shuffled under stressful circumstances since the integron

integrase is activated by the SOS response (Guerin et al. 2009). This provides a pool of potential

genes that do not pose a transcriptional burden to the cell but are available if necessary

(Cambray et al. 2011; Darmon and Leach, 2014).

Integrons have not been found to be mobile themselves, but are commonly mobilized by

other MGEs (Collis et al. 2002; Hall and Collis, 1995; Mazel, 2006; Boucher et al. 2007). They

are sometimes referred to as mobile genetic elements (or components of the mobilome) due to

their role in horizontal gene transfer by integrating gene cassettes (Ragan and Beiko, 2009;

Olschlager and Hacker, 2008; Taylor et al., 2011). Integron classes are created based on

homology of the tyrosine recombinase and attI site, and not by the function of the genes

associated with them, in recognition of the transient nature of these associations. Integrons are

of great interest due to the unique adaptive capacity they provide, and are understandably among

the best studied of the mobile element classes due to their strong association with antibiotic

resistance genes.

In addition to direct impacts related to their mobility, ISs and transposons are also

involved in gene activation and regulation and can promote genome rearrangements either

directly or by providing scattered regions of homology (Curcio and Derbyshire, 2003).

Recombination may also occur at homologous regions within transposable element sequences

resulting in greater diversity (Ling and Cordaux, 2010). In recent years it has become apparent

that IS elements have a dramatic impact on genome evolution, ranging from inactivation and

regulation of individual genes to the complete re-organization of genomes through IS expansion

and subsequent genome streamlining (Siguier et al. 2014; Darman and Leach 2014).

2.2 Intercellular MGEs

Intercellular MGEs are similar to those described above except that in addition to the

genes required for integation/replication they also carry all the genes necessary for facilitating

their own movement between bacterial cells. Plasmids by classical definition are maintained

10

independently of the chromosome within a cell, and are therefore distinguished from

transposons since the latter are integrated into the host chromosome (Siefert, 2009). Both

plasmids and certain types of transposons can move between cells by conjugation.

Plasmids are often thought of as small extra-chromosomal DNA elements that carry non-

essential traits and are therefore easily lost when not needed. However, as the number of

characterized genomes has increased, it has become clear that many bacteria maintain plasmids

of considerable size (up to 2 Mb) and complexity. In some highly stressful environments up to

78% of culturable bacteria have been observed to carry plasmids, most of them large (>50 kb)

(Fulthorpe et al. 1993). Presumably plasmids carry unique skills that allow for a fitness benefit

that outweighs reproductive pressure to maintain small plasmid sizes, either through supplying

access to a unique niche for the individual strain or through an increased adaptive potential

inherent to the presence of the plasmid itself. These plasmids have indeed been found to contain

niche specific attributes such as symbiosis or catabolic pathways, and to be maintained as stable

components of the genome (Kostantinidis and Tiedje, 2004). Moreover, it is now established

that some replicons previously characterized as either megaplasmids (1-2 Mb) or second

chromosomes are more accurately a combination of both elements. The term chromid has been

used to describe second or third chromosomes that utilize a plasmid partitioning mechanism but

contain genes essential to the survival of the cell (Harrison et al. 2010). In addition to

megaplasmids and chromids, there have also been plasmids isolated that are capable of

integrating into the chromosome of their host (Osborn and Boltner, 2002).

Bacteriophage represent some of the most abundant replicating genetic structures

known, probably exceeding 1029 in the ocean alone (Schumann, 2006). Lytic phage

immediately commence phage production upon entry into the bacterial cell, resulting in lysis of

the bacterial cell and extinction of that particular cell lineage. However, temperate phage, under

favorable conditions, will instead integrate into the chromosome of the host bacterium and may

be maintained for multiple generations until the phage is induced to enter its lytic lifestyle.

Induction is often the result of chemical or nutritional stress threatening the survival of the host

bacterium, but can also respond to a number of environmental triggers (Schumann, 2006).

Occasionally, bacterial DNA is accidentally packaged into the phage in addition to, or instead

of, phage DNA. This process, referred to as transduction, allows the bacterial DNA to be

transferred to a new host and has been observed with a number of virulence and pathogenicity

11

traits (Lima et al., 2008). Gene transfer agents (GTAs) are an extreme example of transduction

in that these elements exclusively package random fragments of bacterial DNA into the phage

capsid. Since there is no phage DNA, these capsids are not infective but are instead a

genetically stable component of the bacterial genome (Lang and Beatty, 2007).

Integrated phages, termed prophages, are prevalent in many bacteria, averaging one per

genome sequenced, and the advent of high throughput sequencing has provided a number of

assembled bacteriophage genomes for analysis. Comparisons of these sequences has revealed

that the current genomes are the products of extensive non-homologous recombination events,

the result of both very frequent recombination between phage genomes and the enormity of the

evolutionary time scale on which these events have been taking place (Hendrix and Casjens,

2008). The role of bacteriophage in HGT through transduction is well documented, however the

role they play in directly introducing beneficial genes as a means of ensuring vertical inheritance

is less studied. Metagenomic analysis of phage communities have revealed that bacteriophage

contain an unprecedented diversity of genetic sequences that are readily exchanged between

different phage genomes and that are equally available to the bacterial host of these genomes

due to the ease with which genes are transferred (Hendrix and Casjens, 2008).

Most transposons are not self-transferable between cells and are therefore covered in

Section 2.1 on Intracellular MGEs. Some transposons however, including Tn916, have acquired

genes that provide the capability of intercellular movement, and were therefore named

‘conjugative transposons’ (therefore Tn916 is often referred to as CTn916). However, many of

the conjugative transposons move entirely through the action of a site-specific recombinase

instead of a canonical DDE transposase, leading to the re-classification of these elements as

Integrative and Conjugative Elements (ICEs, also synonymous with the retired terminology of

constin) (Rowland and Stark, 2005; Wozniak and Waldor, 2010). ICEs can be distinguished

from many transposons by the non-random insertion mediated by the site-specific recombinase,

and commonly form an excised circular intermediate that does not replicate autonomously prior

to transfer to the recipient cell. However, many ICEs were originally defined as genomic

islands and named according to the traits that they were conferring (symbiosis islands,

pathogenicity islands, etc.) therefore the terms genomic island, conjugative transposon and ICE

have been used interchangeably (Burrus et al. 2002; Juhas et al. 2009; Roberts et al. 2008;

Wozniak and Waldor, 2010; Siguier et al. 2015). Inactivation of mobility genes or physical

12

separation from the conjugation machinery can negate the intercellular mobility of a transposon

thereby changing the role of the MGE in the cell from homologues in other cells. Studies have

also revealed that MGEs can be mobilized in trans by other mobile elements in the cell, and that

MGE resolution systems can rescue plasmid resolution functions (Hallet et al. 2004). This

illustrates the interconnectedness of the different MGE categories.

2.3 Impact on Genome Evolution Antibiotic resistance is arguably the largest human health concern of our century, as

evidenced by a call from the World Health Organization that all governments should prepare a

comprehensive national plan for surveillance and mitigation of antibiotic resistance (Leung et al.

2011). However in addition to monitoring and limiting the distribution of current resistance

genes in pathogens, it is becoming increasingly apparent that environmental reservoirs serve as

an important source of available resistance genes that can be acquired by human pathogens

(Finley et al. 2013, Forsberg et al. 2012, Pruden et al. 2012; Perry and Wright, 2013). This has

lead to a flood of studies investigating the presence of resistance genes in different

environmental reservoirs, including pristine and/or ancient soils where antibiotic exposure could

not have contributed to the observed resistance (Allen et al. 2009, D’Costa et al. 2011).

Whether these resistance genes pose a tangible threat to human health depends on the ease with

which they could be acquired by pathogens, and therefore it is no longer sufficient to investigate

merely the presence of these genes in environmental samples. The context of resistance genes,

including both the strains harboring them and their potential for mobility, has become the new

focus of environmental studies on antibiotic resistance. Looking ahead, quantifying the

likelihood of new combinations of mobile elements and resistance genes emerging from a given

environment will require a greater understanding of the mobilome of different environments.

This has previously been unfeasible, however the scope of environmental metagenomics is

rapidly expanding with the advent of low cost, high throughput, sequencing technologies. It is

now increasingly common to analyze complete assembled metagenomes, highlighting the

importance of developing standardized methods that can be applied to environmental samples.

However, our ability to locate and potentially identify mobile genetic elements will only be

useful if we can also confirm the functions of putative mobile elements.

Antibiotic resistance genes coming from clinical sources have been likened to an invasive

species (Pruden et al. 2012; Gillings et al. 2015). They are introduced into the environment in

13

wastewater and agriculture in the same way as chemical pollutants, but they pose far different

kinds of threats since they are present in replicative organisms and on self-transmissable

elements. They may also form new combinations that aid in their dissemination or maintenance

within a population. The aggressiveness of their dissemination is determined by the nature of

the mobile genetic element with which they are associated, which is why it is so vitally

important that we improve our understanding of the nature and diversity of these mobile

elements in bacterial communities. Our current level of understanding of the transposable

elements, and tyrosine based site-specific recombinases in particular, is akin to an uninitiated

gardener – we can group elements based on shared characteristics, and can recognize some

known weeds, but are left in awe of the diversity that we have yet to explore.

The recognition that environmental bacterial communities serve as a reservoir of

resistance genes has important implications for managing antibiotic resistance in pathogens.

There are many mechanisms by which antibiotic resistance genes can be maintained in a

complex microbial community in the absence of selection pressure. One is co-selection by other

environmental pollutants, as has already been seen with heavy metals (Stepanauskis et al. 2006;

McCarthur et al. 2011, Wright et al. 2006, Wright et al. 2008). Co-selection has undoubtedly

impacted antibiotic resistance gene maintenance, given the high concentration of heavy metals

relative to antibiotics in the environment (Stepanauskis et al. 2006). Heavy metal resistance

genes are commonly found on the same transmissible plasmids and transposons carrying

antibiotic resistance genes, and the class 1 integrons are commonly associated with resistance to

disinfectants in addition to both heavy metals and antibiotics (Gillings et al. 2015). This

highlights the role that seemingly unrelated environmental pollutants may play in the

maintenance and dissemination of antibiotic resistance in bacterial communities. Secondly, it

has been shown that some resistance genes can be silenced by other regulatory mechanisms that

allow for the maintenance of the gene within the population or evolve from genes that serve

alternative purposes in the absence of antibiotic pressure (proto-resistance genes). Movement of

these genes into other genomic locations or other strains can create or restore the antibiotic

resistance phenotype when selection is applied (Perry and Wright, 2013). Antibiotic resistance

genes that are incorporated into integrons can likewise be inactivated and therefore present a

reduced burden to the host strain. Since gene cassettes are promoter-less the genes contained

within the integron cassette array are prone to severe polar effects, with the genes farthest from

the integron integrase rarely transcribed. Since SOS induction can result in the re-organization

14

of the cassettes maintained within the array, integrons represent an ideal storage site where

potentially useful genes can be maintained (Cambray et al. 2011). Thirdly, sub-inhibitory

concentrations of antibiotics have been shown to increase the potential for evolution of new

traits within populations, through increased mutation rates and horizontal transfer (Baquero

2009; Gillings and Stokes, 2012). This highlights the important distinction between minimum

inhibitory concentrations (toxicity) of a chemical and minimal selective concentrations – a

distinction not currently addressed in regulations designed to determine appropriate limitations

on release of chemicals to the environment. Whether there are other environmental pollutants

that specifically impact the ‘evolvability’ of bacterial communities remains an open question

(Gillings and Stokes, 2012).

It is important to realize that although the established mechanisms of HGT generally refer

to the acquisition of novel genes or transposable elements from other organisms, the mobile

elements themselves (plasmids, phages, ICEs) evolve over time through the transfer of

rearrangement of modules between MGEs within a bacterial cell. IS density has been shown to

be higher in bacterial plasmids than in their host chromosomes, which may be the result of

preferential targeting by some transposable elements into plasmids using rolling-circle

replication (Siguier et al. 2014). The modular nature of plasmids and phages has been well

established (Hendrix et al. 2000; Toussaint and Merlin, 2002) as evidenced by the broad

diversity of accessory genes that are commonly found on plasmids with homologous replication

systems (Heuer and Smalla, 2012). IS elements and other intracellular MGEs can facilitate

transfers of gene segments, and can also serve to recombine different MGEs, and therefore the

categories established for the different elements should be considered fluid (Osborn and Boltner,

2002; Toussaint and Merlin, 2002). Insertion sequences interspersed throughout a genome can

be beneficial to the bacteria for the purposes of incorporating exogenous DNA or disabling the

ability of a phage to excise from a genome in order to preserve beneficial genes (which the

phage had been using as a selective agent to ensure inheritance). There is therefore a complex

balancing act between the risks involved in maintaining potentially mobile genes, and the

benefit derived from the genome plasticity these genes enable.

The distribution of IS elements in a genome is non-random, resulting in regions of the

genome where insertion of a new element is less likely to be detrimental (Plague, 2010). As a

result, mobile elements often invade each other (Darmon and Leach, 2014). This can result in

15

new chimeric mobile elements, and fragments of inactivated MGEs that can serve as

homologous regions for further rearrangements. These genomic regions have alternatively been

referred to as genomic islands (Langille and Brinkman, 2009), regions of genome plasticity

(RGPs) (Ogier et al. 2010), or ‘junkyards’ of MGEs (Schwartz et al. 2003), but they serve an

important role by providing relatively safe regions for the acquisition of incoming mobile

elements.

2.4 References Allen, H. K., Moe, L. A., Rodbumrer, J., Gaarder, A., & Handelsman, J. (2009). Functional metagenomics reveals diverse β-lactamases in a remote Alaskan soil. The ISME journal, 3(2), 243-251. Baquero, F. 2009. Environmental stress and evolvability in microbial systems. Clin. Microbiol. Infect. 15(Suppl.1):5-10. Bellanger, X., Payot, S., Leblond-Bourget, N., & Guédon, G. (2014). Conjugative and mobilizable genomic islands in bacteria: evolution and diversity. FEMS microbiology reviews, 38(4), 720-760. Boucher, Y., Labbate, M., Koenig, J. E., & Stokes, H. W. (2007). Integrons: mobilizable platforms that promote genetic diversity in bacteria. Trends in microbiology, 15(7), 301-309. Brussow, H. 2008. Phage-bacterium co-evolution and its implication for bacterial pathogenesis. In: Horizontal Gene Transfer in the Evolution of Pathogens. pp. 49-77. Cambridge University Press, New York, NY, USA. Burrus, V., G. Pavlovic, B. Decaris and G. Guedon. 2002. Conjugative transposons: the tip of the iceberg. Molecular Microbiology 46(3): 601-610. Cambray, G., A-M. Guerout and D. Mazel. 2010. Integrons. Annual Review of Genetics. 44:141–166. Cambray, G., N. Sanchez-Alberola, S. Campoy, E. Guerin, S. Da Re, B. Gonzalez-Zorn, M-C. Ploy, J. Barbe, D. Mazel and I. Erill. 2011. Prevalence of SOS-mediated control of integron integrase expression as an adaptive trait of chromosomal and mobile integrons. Mobile DNA 2(1):6 Carraro N. and Burrus V. 2015. Biology of Three ICE Families: SXT/R391, ICEBs1, and ICESt1/ICESt3, p 289-309. In Craig N, Chandler M, Gellert M, Lambowitz A, Rice P, Sandmeyer S (ed), Mobile DNA III. ASM Press, Washington, DC. doi: 10.1128/microbiolspec.MDNA3-0008-2014

16

Collis, C.M., Kim, M., Stokes, H.W. and R.M. Hall. 2002. Integron-encoded IntI integrases preferentially recognize the adjacent cognate attI site in recombination with a 59-be site. Molecular Microbiology 46(5): 1415-1427. Curcio, M. J., & Derbyshire, K. M. 2003. The outs and ins of transposition: from mu to kangaroo. Nature Reviews Molecular Cell Biology, 4(11), 865-877. Darmon, E. and D.R.F. Leach. 2014. Bacterial Genome Instability. Microbiology and Molecular Biology Reviews. 78(1):1-39. Domingues S., G.J. da Silva, K. M. Nielsen. 2012. Integrons: vehicles and pathways for horizontal dissemination in bacteria. Mob. Genet. Elements 2:211-223. D’Costa, V. M., King, C. E., Kalan, L., Morar, M., Sung, W. W., Schwarz, C., ... & Wright, G. D. (2011). Antibiotic resistance is ancient. Nature, 477(7365), 457-461. Finley, R. L., Collignon, P., Larsson, D. J., McEwen, S. A., Li, X. Z., Gaze, W. H., ... & Topp, E. (2013). The scourge of antibiotic resistance: the important role of the environment. Clinical Infectious Diseases, cit355. Forsberg, K. J., Reyes, A., Wang, B., Selleck, E. M., Sommer, M. O., & Dantas, G. (2012). The shared antibiotic resistome of soil bacteria and human pathogens. science, 337(6098), 1107-1111. Frost, L. S., Leplae, R., Summers, A. O. & Toussaint, A. 2005 Mobile genetic elements: the agents of open source evolution. Nat. Rev. Microbiol. 3:722-732. Fulthorpe, R. R., Liss, S. N., & Allen, D. G. (1993). Characterization of bacteria isolated from a bleached kraft pulp mill wastewater treatment system. Canadian journal of microbiology, 39(1), 13-24. Gillings, M. R., & Stokes, H. W. 2012. Are humans increasing bacterial evolvability?. Trends in ecology & evolution, 27(6), 346-352. Gillings, M.R., Gaze, W.H., Pruden, A., Smalla, K. Tiedje, J.M. and Yong-Guan, Z. 2015. Using the class 1 integron-integrase gene as a proxy for anthropogenic pollution. ISME journal doi:10.1038/ismej.2014.226 Guerin, E. G. Cambray, N. Sanchez-Alberola, S. Campoy, I. Erill, S. Da Re, B. Gonzalez-Zorn, J Barbé, M.C. Ploy and D. Mazel. 2009. The SOS response controls integron recombination. Science 324:1034. Hall, R.M. and C.M. Collis. 1995. Mobile gene cassettes and integrons: capture and spread of genes by site-specific recombination. Molecular Microbiology 15(4): 593-600.

17

Hallet, B., Vanhooff, V. and F. Cornet. 2004. DNA Site-Specific Resolution Systems. In: Plasmid Biology pp. 145-180. Ed. B.E. Funnell and G.J. Phillips ASM Press, Washington, D.C. USA Harrison, P. W., Lower, R. P., Kim, N. K., & Young, J. P. W. (2010). Introducing the bacterial ‘chromid’: not a chromosome, not a plasmid. Trends in microbiology, 18(4), 141-148. Hendrix, R. W., Lawrence, J. G., Hatfull, G., and Casjens, S. 2000. The origins and ongoing evolution of viruses. Trends Microbiol. 8, 504–508. Hendrix, R.W. and S.R. Casjens. 2008. The Role of Bacteriophages in the Generation and Spread of Bacterial Pathogens. In: Horizontal Gene Transfer in the Evolution of Pathogenesis. pp. 79-112. Cambridge University Press, New York, NY, USA. Heuer, H., & Smalla, K. (2012). Plasmids foster diversification and adaptation of bacterial populations in soil. FEMS microbiology reviews, 36(6), 1083-1104. Hirano, N., Muroi, T., Takahashi, H., & Haruki, M. (2011). Site-specific recombinases as tools for heterologous gene integration. Applied microbiology and biotechnology, 92(2), 227-239. Juhas, M., van der Meer, J. R., Gaillard, M., Harding, R. M., Hood, D. W., & Crook, D. W. (2009). Genomic islands: tools of bacterial horizontal gene transfer and evolution. FEMS microbiology reviews, 33(2), 376-393. Knapp,C.W., J. Dolfing, P.A. Ehlert and D.W. Graham. 2010. Evidence of increasing antibiotic resistance gene abundances in archived soils since1940. Environ. Sci. Technol. 44:580–587. doi:10.1021/es901221x Konstantinidis, K. T., & Tiedje, J. M. (2004). Trends between gene content and genome size in prokaryotic species with larger genomes. Proceedings of the National Academy of Sciences of the United States of America, 101(9), 3160-3165. Lang, A.S. and J.T. Beatty. 2007. Importance of widespread gene transfer agent genes in alpha-proteobacteria. Trends in Microbiology 15:54-62. Lawrence, J.G. and H. Hendrickson. 2008. Genomes in Motion: Gene Transfer as a Catalyst for Genome Change. In: Horizontal Gene Transfer in the Evolution of Pathogens. pp. 3-22. Cambridge University Press, New York, NY, USA. Langille, M.G.I. and F.S.L. Brinkman, IslandViewer: an integrated interface for computational identification and visualization of genomic islands, Bioinformatics (2009) Jan. 16 (EPub). PMID: 19151094 Lawrence, J.G. and A.C. Retchless. 2009. The Interplay of Homologous Recombination and Horizontal Gene Transfer in Bacterial Speciation. In: Horizontal Gene Transfer: Genomes in Flux. pp. 29-54. Ed. M.B. Gogarten, J.P. Gogarten and L. Olendzenski. Humana Press. New York, NY, USA.

18

Leung, E., Weil, D. E., Raviglione, M., & Nakatani, H. (2011). The WHO policy package to combat antimicrobial resistance. Bulletin of the World Health Organization, 89(5), 390-392. Lima, W.C., A.C.M. Paquola, A.M. Varani, M-A. Van Sluys and C.F.M. Menck. 2008. Laterally transferred genomic islands in Xanthomonadales related to pathogenicity and primary metabolism. FEMS Microbiology Letters 281:87–97. Ling A, Cordaux R (2010) Insertion Sequence Inversions Mediated by Ectopic Recombination between Terminal Inverted Repeats. PLoS ONE 5(12): e15654. doi:10.1371/journal.pone.0015654 Mahillon, J. and M. Chandler, Insertion sequences, Microbiol. Mol. Biol. Rev. 62 (1998) 725-774. Mazel, D. 2006. Integrons: agents of bacterial evolution. Nat. Rev. Microbiol. 4 :608-620. McArthur, J. V., Tuckfield, R. C., Lindell, A. H., & Baker-Austin, C. (2011). When rivers become reservoirs of antibiotic resistance: industrial effluents and gene nurseries. Nunes-düby, S.E., Kwon, H.J., Tirumalai, R.S., Ellenberger, T., Landy, A., 1998. Similarities and differences among 105 members of the Int family of site-specific recombinases 26, 391-406. Ogier, J. C., Calteau, A., Forst, S., Goodrich-Blair, H., Roche, D., Rouy, Z., ... & Gaudriault, S. (2010). Units of plasticity in bacterial genomes: new insight from the comparative genomics of two bacteria interacting with invertebrates, Photorhabdus and Xenorhabdus. BMC genomics, 11(1), 568. Olendzenski, L. and J.P. Gogarten. 2009. Gene Transfer: Who Benefits? In: Horizontal Gene Transfer: Genomes in Flux. pp. 3-12. Ed. M.B. Gogarten, J.P. Gogarten and L. Olendzenski. Humana Press. New York, NY, USA. Olschlager, T. and J. Hacker. 2008. Genomic Islands in the Bacterial Chromosome – Paradigms of Evolution in Quantum Leaps. In: Horizontal Gene Transfer in the Evolution of Pathogenesis. pp. 113-134. Cambridge University Press, New York, NY, USA. Osborn, A. M. and D. Boltner. 2002. When phage, plasmids and tranposons collide: genomic islands, and conjugative- and mobilizable-transposons as a mosaic continuum. Plasmid 48: 202-212. Perry, J. and G.D. Wright. 2013. The antibiotic resistance “mobilome”: searching for the link between environment and clinic. Frontiers in Microbiology. 4:1-7. Plague, G.R. 2010. Intergenic transposable elements are not randomly distributed. Genome Biol. Evol. 2:584-590.

19

Pruden, A., & Arabi, M. 2012. Quantifying anthropogenic impacts on environmental reservoirs of antibiotic resistance. Antimicrobial Resistance in the Environment, 173-202. Ragan M.A. and R.G. Beiko. 2009. Lateral Gene Transfer: Open Issues. Philosophical Transactions of the Royal Society B. 364: 2241–2251. Roberts, A.P., M. Chandler, P. Courvalin, G. Guedon, P. Mullany, T. Pembroke, J.I. Rood, C.J. Smith, A. O. Summers, M. Tsuda and D. E. Berg. 2008. Revised nomenclature for transposable genetic elements. Plasmid. 60: 167-173. Rowland, S.J. and W.M. Stark. 2005. Site-specific recombination by the serine recombinases. In: The Dynamic Bacterial Genome pp. 83-120. Cambridge University Press, NY, NY, USA. Schwartz, E., A. Henne, R. Cramm, T. Eitinger, B. Friedrich and G. Gottschalk. 2003. Complete Nucleotide Sequence of pHG1: A Ralstonia eutropha H16 Megaplasmid Encoding Key Enzymes of H2-based LIthoautotrophy and Anaerobiosis. J. Mol. Biol. 332: 369–383 Schumann, W. 2006. Sequence specific recombination classes. In: Dynamics of the Bacterial genome pp. 97-98 John Wiley & Sons. Siefert, J.L. 2009. Defining the Mobilome. In: Horizontal Gene Transfer: Genomes in Flux. pp. 13-27. Ed. M.B. Gogarten, J.P. Gogarten and L. Olendzenski. Humana Press. New York, NY, USA. Siguier, P. Gourbeyre, E. ad M. Chandler. 2014. Bacterial insertion sequences: their genomic impact and diversity. FEMS Microbiol Rev 38: 865-891. Siguier P, Gourbeyre E, Varani A, Ton-Hoang B, Chandler M. 2015. Everyman’s Guide to Bacterial Insertion Sequences, p 555-590. In Craig N, Chandler M, Gellert M, Lambowitz A, Rice P, Sandmeyer S (ed), Mobile DNA III. ASM Press, Washington, DC. doi: 10.1128/microbiolspec.MDNA3-0030-2014 Snyder, and W. Champness. 2007. Molecular Genetics of Bacteria. Taylor, N.G.H., D.W. Verner-Jeffreys and C. Baker-Austin. 2011. Aquatic systems: maintaining, mixing and mobilizing antimicrobial resistance? Trends in Ecology and Evolution 26(6): 278-284. Toussaint, A. and C. Merlin. 2002. Mobile Elements as a Combination of Functional Modules, Plasmid 47 (2002) 26-35. Van Houdt, R.V, S. Monchy, N. Leys and M. Mergeay. 2009. New mobile genetic elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria. Antonie van Leeuwenhoek 96:205-226. Wozniak, R. A., & Waldor, M. K. (2010). Integrative and conjugative elements: mosaic mobile genetic elements enabling dynamic lateral gene flow. Nature Reviews Microbiology, 8(8), 552-563.

20

Wright, M. S., Peltier, G. L., Stepanauskas, R., & McArthur, J. V. (2006). Bacterial tolerances to metals and antibiotics in metal-contaminated and reference streams. FEMS microbiology ecology, 58(2), 293-302. Wright, M.S., Baker-Austin, C., Lindell, A.H., Stepanauskas, R., Stokes, H.W. and J.V. McArthur. 2008. Influence of industrial contamination on mobile genetic elements: class 1 integron abundance and gene cassette structure in aquatic bacterial communities. ISME Journal 2: 417-428.

21

Chapter 3 The Limitations of Draft Assemblies for Understanding Prokaryotic Adaptation and Evolution

Acknowledgements and Contributions: This chapter is reproduced as published in Genomics

(Ricker, N., Qian, H., & Fulthorpe, R. R. 2012. The limitations of draft assemblies for

understanding prokaryotic adaptation and evolution. Genomics, 100(3), 167-175

doi:10.1016/j.ygeno.2012.06.009) with minor modifications.

3 Introduction Next generation sequencing (NGS) platforms have revolutionized how we obtain genetic

information, leading to rapid advances in the fields of genomics and metagenomics. These

methods rely on newer sequencing chemistries (Sanger et al. 1977) and highly parallel

operations that result in high yields at low costs per read but so far produce considerably shorter

reads (in the range of 35-500 nucleotides) than Sanger sequencing (600 to 1500 nucleotides).

Shorter reads increase the required complexity of the assembly algorithms (Miller et al. 2010),

although the ability to sequence to very high coverage can overcome many of the original issues

in genome assembly including read errors and coverage gaps (Wetzel et al. 2011). The utility of

next generation sequencing has been demonstrated in examining new variants, or very close

relatives, of previously sequenced strains (reviewed in MacLean et al. 2009)). This type of

assembly, known as a reference or mapping assembly, is relatively straightforward provided that

the two strains share high sequence identity across their genome. However in many bacterial

species, the ‘core’ genes that are shared between closely related strains are supplemented by a

significant fraction of ‘dispensable’ genes that vary between the strains of a given species

(Medini et al. 2005). Assembling these sections of sequence data or entire genomes in the

absence of a suitable reference strain (referred to as de novo assembly) is a far more difficult

task (Pop 2009). Whole genome shotgun assemblies using traditional Sanger sequencing have

been utilized for many years for this purpose but the cost and effort required to do this type of

intense sequencing has been prohibitive for all but the largest laboratories (MacLean et al.

2009). The advent of NGS platforms promises to alleviate the financial and technical demands

of obtaining high quality sequence data however the issue of repetitive elements in genomic

http://dx.doi.org/10.1016/j.ygeno.2012.06.009

22

sequence remains a confounding issue in genome assembly that is difficult to resolve through

coverage alone (Wetzel et al. 2011).

Many assembly programs for NGS data utilize de Bruijn graphing techniques (see

(Miller et al. 2010) to perform de novo assemblies of the high number of reads produced, with

the goal of finding the shortest path through the sequence data that includes as much of the

sequence data as possible. For genomes with a high content of repetitive sequences, some

assembly programs will produce an overly compressed alignment, and possible mis-assemblies,

when multiple copies of a repeat are collapsed to one location (Chevreux et al. 1999; Philippy et

al. 2008). Accurate graphs (those that do not collapse repetitive elements) will often form a

‘frayed rope’ pattern in repetitive sections whereby a path converges at the repeats and then

diverges again (multiple paths leading into the repeat and multiple paths leading out of the

repeat again) since there are multiple true alignments possible. Some assembly programs

specifically search for the characteristic features that repetitive elements create within a graph

such as convergent, divergent or cyclic paths (Miller et al. 2010) and therefore terminate at these

repetitive elements to ensure that they are not overly compressed in the final assembly. This

results in a more fractured assembly, but prevents the errors introduced by arbitrarily collapsing

the repeats to one location.

Realistically, the assembly software is not expected to produce a perfectly aligned

genome but rather to reduce the sequencing reads into a manageable number of contigs

(‘contiguous sequence’ – the sequence produced by the assembly of multiple overlapping reads)

for finishing. ‘Finishing’ is the process of closing all contig gaps, correcting introduced errors,

and confirming low coverage regions of the assembly through PCR and cloning experiments at

the bench. These experiments can still be expected to take months to years, even with excellent

sequence data and the best software currently available (Nagarajan et al. 2010). For this reason,

complete genome finishing is rarely carried out both due to the effort required, and because the

aim of many sequencing projects is limited to looking for a small number of differences

between the new strain and a previously sequenced close relative. The resulting genome

projects are often submitted as unfinished draft assemblies, or as ‘assembled with likely errors’

(Phillippy et al. 2008).

23

Although not as repetitive as eukaryotic genomes, prokaryotic genomes contain a variety

of repeated elements ranging in size from 1-6 bp microsatellites (Ellegren 2004; Mayer et al.

2010) to larger elements such as transposons, insertion sequences, rRNA operons, tRNA genes,

and rhs family genes (Lupski and Weinstock 1992). The computational issues that repetitive

genomes pose to NGS assembly has been discussed in other recent papers (Miller et al. 2010;

Wetzel et al. 2011; MacLean et al. 2009; Zhang et al. 2011), but there has been remarkably little

emphasis on the relative value of the portion of the genome that remains fragmented in these

draft assemblies. To this end, we performed an in silico experiment using simulated long and

short read data for the fully sequenced genome of Cupriavidus metallidurans CH34 (hereafter

simply referred to as CH34). This organism was sequenced by the Joint Genome Institute (JGI)

using whole genome shotgun cloning (WGS) with a combination of three randomly sheared

libraries (3, 8 and 40 kb insert sizes) and an additional 3,752 individual Sanger reads for

finishing (Van Houdt et al. 2009). It was chosen for this study because of the high quality

finishing and annotation that has been performed (Van Houdt et al. 2009; Janssen et al. 2010) as

well as the nature of the genome, which contains two large chromosomes and many types of

mobile elements. It was our anticipation that the repetitive elements contained within this

genome would be a hindrance to assembly, and that this simulation would serve to illustrate the

portions of the genome that are inherently resistant to automated assembly. Four additional

strains (Caulobacter sp. K31, Gramella forsetii KT0803, Rhodobacter sphaeroides 2.4.1 and

Bordetella bronchiseptica RB50) were also included which varied in G+C content, number of

replicons, repeat content of the genome and percentage of genes annotated as involved in

mobility (plasmids, phages and transposons). A detailed analysis of each individual strain was

not performed since the genomic islands have not been characterized, however genomic island

predictions were available from the IslandViewer website (Langille and Brinkman, 2009) which

utilizes multiple software programs to predict genomic islands from the completed sequence.

The predicted genomic islands in these strains were considerably smaller than those determined

in CH34, so it is expected that some of the predicted islands may actually be components of one

larger island.

Only two assembly programs were utilized since the presence of repeated elements is a

commonly acknowledged issue in assembly algorithms (Pevzner and Tang, 2001; Kingsford et

al. 2010), and a comparison of computational effectiveness was outside the scope of this study.

24

Our intent was rather to illustrate the biological significance of the regions most likely to remain

unassembled by the nature of their sequence. The Velvet assembler was chosen because the

algorithms have been improved to prevent over-collapsing of repeats (Zerbino et al. 2009). The

ABySS assembler (Simpson et al. 2009) was utilized to determine whether the results were

specific to the Velvet algorithms. Our goal for this project was to use the well-annotated CH34

genome to better understand the biological relevance of the sections of the genome left

unassembled and to examine which aspects of genome complexity would be most problematic

to assemble into large contigs given ideal data. This serves to illustrate the inherent issues in

draft assemblies of prokaryotic genomes, which we also illustrate is only further complicated by

the use of real data.

3.1 Methods

All genomes were obtained from the NCBI website (www.ncbi.nlm.nih.gov) with the

following Genbank Accension numbers: Cupriavidus metallidurans CH34 (CP000352-

CP000355), Caulobacter sp. K31 (CP000927.1-CP000929.1)), Gramella forsetii KT0803

(CU207366.1), Bordetella bronchispetica RB50 (BX470250.1) and Rhodobacter sphaeroides

2.4.1 (CP000143.1-CP000147.1, DQ232586.1, DQ232587.1). These files were used to create

error-free simulated long read (400 bp length at 10x coverage) and short read (75 bp length at

45x coverage) data for assembly in Velvet using a custom-made python program (available on

request). These datasets were assembled using Velvet version 1.1.05 (Zerbino and Birney, 2008)

using the max_kmer and big_assembly settings as these settings gave the best assembly

statistics (N50 and max contig). The final graph of the Velvet assembly for C. metallidurans

CH34 used 4,260,497 of the 4,265,686 (99.9%) simulated reads and resulted in a total of 139

contigs. The maximum contig length was 674,170 bp and the N50 value for the assembly (size

of contig for which 50% of assembled reads are in a contig of that size or larger) was 159,531

bp. The median coverage was calculated as 11.8. The N50 and longest contig stats for the other

genomes are listed in Table 1. Paired ends libraries with 100 bp reads were also created for two

different insert distances (180 and 3000). The paired ends dataset for C. metallidurans CH34

was assembled in ABySS version 1.1.3 (Simpson et al. 2009) with a final N50 value of 36682

and maximum contig size of 166493 bp.

http://www.ncbi.nlm.nih.gov/

25

Assembled contigs were aligned to reference sequences using Geneious Pro version

5.5.2 (Drummond et al. 2010). Despite the error-free nature of the simulated data, alignments

were performed at 98% identity since imperfect repeats (repeats with a small number of single

base pair differences) could be seen as sequencing errors by the assembler and would be

incorrectly collapsed thereby introducing errors into the final contigs. Coverage statistics

included were those determined by the Geneious program and therefore represent coverage of

reference by unique contigs only, with no allowance for contig repetition, instead of true

coverage of the reference genome if all repeats could be accounted for. Examination of genes

adjacent to the ends of contigs was performed using the NCBI Blast tool (Altschul et al. 1990),

and the Genbank entries for each replicon (www.ncbi.nlm.nih.gov). Repeat content of the

genomes was estimated by calculating the uniqueness of each genome at k-mer lengths of 31

and 1000 and then taking the average of these two calculations. Assembly files from the GAGE

study (Salzberg et al. 2012) were downloaded and aligned by the same metrics, or by the

addition of a maximum 500 bp gap parameter as necessary.

3.2 Results

3.2.1 Assembly Quality for Cupriavidus metallidurans CH34

CH34 has 4 large replicons (Table 3.2.1) and a multitude of well-annotated smaller mobile

elements including genomic islands, transposons and insertion sequences (Van Houdt et al.

2009; Monchy et al. 2007). On the two chromosomes, there are four sets of 5S, 16S and 23S

rRNA genes (2 sets on each) and 62 tRNA genes (8 of which are duplicates found on the second

chromosome) (Janssen et al. 2010). There are 16 documented genomic islands (11 on

chromosome 1, none on chromosome 2, 3 on pMOL30 and 2 on pMOL28), as well as 57

insertion sequences and 19 other transposable elements distributed across the four replicons

(Janssen et al. 2010).

http://www.ncbi.nlm.nih.gov/

26

Table 3.2.1: Number of contigs aligning and coverage statistics for each of the four

replicons in C. metallidurans CH34 using Velvet ad ABySS genome assembly software.

Velvet Assembly Size (bp)

Number of contigs aligned at

98%

Largest contig (bp)

Total bases in contigs longer than 10

kb

Total bases in contigs longer

than 5 kb

Total bases in contigs longer than 1 kb

Chr 1 3,928,089 75 674,226 (17.2%)

3,786,365 (96.4%)

3,835,365(97.6%)

3,875,047 (98.6%)

Chr 2 2,580,084 63 541,760 (21.0%)

2,466,986 (95.6%)

2,504,599 (97.1%)

2,532,450 (98.2%)

pMOL30 233,720 18 58,279 (24.9%)

212,451 (90.9%)

212,451 (90.9%)

230,532 (98.6%)

pMOL28 171,459 9 101,867(59.4%)

156,377 (91.2%)

156,377 (91.2%)

171,008 (99.7%)

ABySS Assembly Size (bp)

Number of contigs aligned at

98%

Largest contig (bp)

Total bases in contigs longer than 10

kb

Total bases in contigs longer

than 5 kb

Total bases in contigs longer than 1 kb

Chr 1 3,928,089 470 166,493 (4.2%)

3,435,784 (87.5%)

3,669,623 (93.4%)

3,784,364 (96.3%)

Chr 2 2,580,084 212 107,711 (4.2%)

2,242,357 (86.9%)

2,416,359 (93.6%)

2,511,515 (97.3%)

pMOL30 233,720 59 29,993 (12.8%)

155,523 (66.5%)

190,452 (81.5%)

219,738 (94.0%)

pMOL28 171,459 36 50,670 (29.6%)

135,295 (78.9%)

140,927 (82.2%)

162,258 (94.6%)

From our simulated dataset (see methods), an assembly of 139 contigs was created after

assembly in Velvet. This assembly was aligned to the reference sequence of each of the four

replicons (Table 3.2.1) in Geneious version 5.5.2 (Drummond et al. 2010). Several of the

contigs were found to align to multiple replicons (Figure 3.2.1), including one that aligned to all

four replicons (corresponding to Tn6049). The largest contig that was shared in more than one

location was contig 152 (length 10,403 bp), which is found on both chromosome 1 and 2 and

corresponds to Tn6048. Likewise, a single contig, 5471 bp, corresponded to the 4 rRNA

operons that are evenly divided between the two chromosomes. All contigs mapped to the

27

reference genome at 98% identity. The genome was also assembled using ABySS version 1.3.3

(Simpson et al. 2009). This assembly was considerably more fragmented than the Velvet

assembly (Table 3.2.1) and had two small contigs (915 bp and 740 bp) that did not align with

any of the replicons at 98% identity. Due to the considerably larger number of fragments from

this assembly, the causes of contig termination were not determined for the contigs produced

from the ABySS software, however both software programs had greater difficulty assembling

the genomic island rich pMOL30 compared to pMOL28 and showed similar contig distribution

patterns (Figure 3.2.2).

Figure 3.2.1:Number of assembled contigs in Velvet aligning to replicons in C.

metallidurans CH34.

Venn diagram is based on 98% sequence identity. It is important to note that there are no shared

contigs found solely between chromosome 1 and pMOL30, or solely between chromosome 2

and pMOL28.

28

Figure 3.2.2: Geneious alignment of assembled contigs to two key regions containing

genomic islands in C. metallidurans CH34.

Top two images are from the Velvet assembly, bottom images are the same regions from the

ABySS assembly. The grey bar indicates region coverage, and the black lines are reference

sequence (solid line) and location of the contigs with respect to the reference. The top

alignment for each assembler includes the GI rich region ranging from approximately 1.2 Mb to

1.8 Mb on chromosome 1 and contains the two largest genomic islands. The bottom alignment

is to the full length of pMOL28, with the heavy metal resistance island highlighted (location is

as indicated in Monchy et al. 2007). There are more contigs listed in Table 3 than are visible on

the figure since contigs mapping to repetitive elements can only be mapped onto the

chromosome once.

29

3.2.2 Contigs terminate at repeated elements and mobile elements

The large contigs from the Velvet assembly were examined to determine the genomic

determinants that had caused their termination (see Table 3.2.2). It was our anticipation that the

known repeated elements would be the main cause of termination in our error-free dataset, and

this was primarily found to be the case. 7 of the largest contigs (4 from chromosome 1 and one

each from the other replicons) were investigated and of the 14 terminal regions, 12 were found

to have terminated at a previously documented mobile element. The other two corresponded

with genes that would not be expected to be mobile. One of these genes was found to have an

internal repeat structure that interfered with assembly, and the other was found to have a second

copy of the same gene present on both chromosome 1 and chromosome 2 (at 99% identity).

When all contigs greater than 1 kb in length from chromosome 1 were included in the analysis

(data not shown), 75% (35/46) of the termination points were from documented mobile

elements. All other termination points were from duplicate genes found on multiple replicons

with the exception of a shared gene cluster between CMGI-2 and CMGI-3 which are both

located on chromosome 1 (discussed in section 3.2.3) and the rRNA operons for which there are

two copies on each chromosome. Of the mobile elements in this genome, Tn6049 and ISRme3

were found in the highest abundance (12 copies and 10 copies, respectively), and Tn6049 was

the only element found on all four replicons.

Table 3.2.2: Details on the terminal regions for 7 large contigs.

For simplicity, only the four largest contigs from chromosome 1 and the largest single contig

from each of the other replicons is included. The gene or mobile element responsible for the

contig termination is listed along with the number of times that element occurs in the total

genome.

Contig Name Size Replicon 5' terminus # in genome 3' terminus # in genome

Contig 17 674226 chr 1 ISRme4 2 (both on

chr1)

sodium sulphate symporter

2 (100% to chr1, 99% to

chr2)

Contig 113 358847 chr 1 ISRme7 2 (both on

chr1) IS1087B 2 (both on

chr1)

Contig 125 309700 chr 1 Tn6049 12 (across all

four replicons) IS1090 4 (all on chr1)

Contig 143 302838 chr 1 IS1087B 2 (both on

chr1) Tn6049 12 (across all

replicons)

30

Contig 220 541760 chr 2 Tn6048 3 (1 on chr1, 2

on chr2) ISRme3 10 (across all but pMOL28)

Contig 252 58279 pMOL30 merE from

Tn4380 2 (1 on each

plasmid)

repeated sequence

within copB gene

1

Contig 239 101867 pMOL28 merE from

Tn4380 2 (1 on each

plasmid) IS1086 3 (across all but

pMOL30)

3.2.3 Fragmentation is greatest at genomic island sites

Interestingly, although there were long contigs distributed across all of the replicons in

the Velvet assembly, the distribution of the smaller contigs was not found to be uniform.

Instead there were regions on each of the replicons that were markedly fragmented with small to

medium (61-5000 bp) contigs arranged in a pattern of small overlaps or with gaps between

(Figure 3.2.2). Recognizing that genomic islands frequently contain smaller imbedded

transposable elements and therefore many repeated elements, we overlayed the known genomic

island co-ordinates with the assembled fragments for chromosome 1. As noted earlier,

chromosome 1 contains 11 of the 16 genomic islands found in CH34. A sequential ordering of

the longest contigs corresponding to chromosome 1 revealed that only one of the genomic

islands (CMGI-9) was fully captured in a large contig. This is not surprising as this island has no

documented repetitive elements (not even terminal repeats) that would have interfered with

assembly. Since the genomic islands appeared to be linked with the prevalence of fragments, we

aligned the contigs to each of the chromosome 1 islands individually in Geneious. In general,

the larger genomic islands aligned to higher numbers of contigs (Table 3.2.3). The four largest

islands each had a minimum of 2 contigs longer than 5 kb, representing accessory genes that

were congruent without interference from mobile elements or repeated segments. However, the

termination points of these contigs serve to highlight the difficulties of obtaining complete

assemblies of even these relatively small regions (compared to the genome). As would be

expected, many of the contigs terminated at a documented insertion sequence or transposon that

was found at another location in the genome (sometimes within the same genomic island).

Tn6049 (with a length of 3461 bp) is a very promiscuous transposable element found in 12

locations in the genome including on 5 of the 11 genomic islands and terminated assembly in

each of the locations it was found. In addition, there were other genes that were present on more

than one of the genomic islands and therefore interfered with proper assembly. CMGI-2 and

CMGI-3 share several homologous gene clusters (see Table 3.2.3) and have similar conjugal

31

transfer genes. Two of these genes (trbB and trbF) share high sequence identity across their

length (97 and 92%, respectively) and were found to cause the termination of contigs

containing the conjugal transfer genes in both of these islands. CMGI-3 also has multiple

copies of IS1071, and in some cases this element appears to have been responsible for the

mobilization of fragments of adjacent genes, which are then also repeated within the island,

further fragmenting the assembly.

CH34 is most noted for its ability to withstand heavy metals (Janssen et al. 2010) and

many of the genes conferring these abilities are contained within three genomic islands

distributed on the two plasmids pMOL30 and pMOL28. The two large islands account for

almost the full length of pMOL30 and approximately a third of the length of pMOL28. Each

island also contains “nested” islands with partial or complete mobile elements that separate

different functional modules (Van Houdt et al. 2009). An examination of the Geneious

alignments for both pMOL28 and the region of chromosome 1 containing two genomic islands

conferring such notable phenotypes as hydrogenotrophy and metabolism of aromatic

compounds revealed that these regions are highly fragmented in comparison to surrounding

regions in both the Velvet and ABySS assemblies (Figure 3.3.2).

Table 3.2.3: Genomic islands found on chromosome 1 of CH34.

Naming, sizes and content information are derived from previous characterization (Van Houdt

et al. 2009). Contig information is solely from the Velvet assembly for simplicity.

Name of Element

Size Content Information Contigs aligned within regiona,b

Size Range of Aligned Contigsb

CMGI-1 109,598 bp Tn6049; Closely related to pathogenicity island in P. aeruginosa

3 1-5 kb: 2 >10 kb: 1

CMGI-2 101,637 bp Tn4371 family integrase, hydrogenotrophy, metabolism of aromatic compounds

12 <1 kb: 6 1-5 kb: 2 5-10 kb: 1 >10 kb: 3

CMGI-3 97,042 bp Tn4371 family integrase, carbon fixation, hydrogenotrophy

16 <1 kb: 7 1-5 kb: 6 5-10 kb: 1 >10 kb: 2

32

CMGI-4 56,529 bp Tn4371 family integrase, Tn6048

1 >10 kb: 1

CMGI-5 25,423 bp 63 bp direct repeats 3 1-5 kb: 3 CMGI-6 17,638 bp Tn6049 3 1-5 kb: 3 CMGI-7 15,362 bp Tn6049 1 1-5 kb: 1 CMGI-8 12,257 bp Tn6049, IS1087 3 1-5 kb: 3 CMGI-9 20,648 bp Integrase, no direct repeats 0 Contained

within large contig

CMGI-10 20,947 bp 3 Insertion Sequences 5 1-5 kb: 4 5-10 kb: 1

CMGI-11 10,824 bp Flanked by ISCme7 3 <1 kb: 2 5-10 kb: 1

a these numbers are an approximation because the alignments were performed at 98% and therefore some of the small contigs align to multiple places where imperfect repeats occur

b numbers are only for contigs completely covered by genomic island; each island generally aligns to the ends of two larger contigs that are not included in these numbers

3.2.4 Investigating the relative contribution of multiple replicons or presence of documented mobility genes by comparison with other strains

In addition to our in depth analysis of CH34, we simulated datasets for an additional 4

genomes that varied in overall genome size, G+C content, number of replicons and predicted

mobile element content. The metrics for all 5 genomes assembled using simulated unpaired

long and short read datasets are summarized in Table 3.2.4. The Velvet assembly data included

in Table 3.2.4 is based on alignment to the reference genome at 98% nucleotide identity with no

gaps (see methods), and there were no significant errors in the contigs that would limit their

ability to align with these restrictions. As was expected, the large genomes consistently

produced a larger number of contigs after assembly, and the assembly quality in terms of both

N50 value and maximum contig size relative to largest chromosome decreased with increasing

genome size. In order to assess the causes of fragmentation for large genomes, we specifically

included strains with variations in both the number of replicons and the number of genes

annotated as related to horizontal gene transfer by the JCVI Comprehensive Microbial Resource

JCVI-CMR (http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi). Based on overall

genome size, number of replicons and k-mer repetitiveness, it was expected that CH34 would

have the poorest (most fragmented) assembly. However Caulobacter sp. K31 fared the worst by

each of the common metrics listed in Table 3.2.4. Interestingly, the best N50 and maximum

http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi

33

contig sizes were obtained for Rhodobacter sphaeroides 2.4.1 despite the fact that this genome

is composed of 7 different replicons (Figure 3.2.3). Furthermore, although CH34 was the second

poorest assembly in terms of number of contigs and N50 value, Bordetella bronchiseptica RB50

had a smaller maximum contig size. This was unexpected despite its large overall genome size,

based on the nature of the genome. This genome had specifically been chosen because only

0.37% of its gene content has been attributed to mobile functions (plasmids, phages and

transposons) by the JCVI-CMR (http://cmr.jcvi.org/cgi-bin/CMR) and contained only one

replicon. It also had the lowest percentage of repetitive k-mers by our calculations (see

methods) and should theoretically assemble more easily.

To compare these results to the findings described for the well-annotated genome CH34

in the absence of characterized genomic islands, these strains were evaluated according to

genomic islands predicted by programs contained within IslandViewer (Langille and Brinkman

2009). Although the precise number or size of the individual islands has not been verified (and

is overestimated in CH34), the total number of predicted genomic islands significantly

correlates to the maximum contig size, N50 value and N50 as a percentage of longest replicon.

As had been seen in CH34, the most fragmented portion of the Bordetella brochiseptica genome

corresponded to a 22 kb segment of repeated gene content shared between two predicted

genomic islands (99% nucleotide identity), and likewise the Caulobacter sp. K31 assembly also

had a large (10.5 kb) segment that was perfectly repeated between two predicted genomic

islands.

Table 3.2.4: Velvet assembly metrics of the 5 genomes compared.

Unique k-mer percentage was calculated as described in the methods. Mobile gene numbers

were obtained from the JCVI-CMR (http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi).

Coverage calculation is defined as total number of reference bases covered by unique contigs at

98% nucleotide identity without gaps or repeating of individual contigs. SIGI and DIMOB are

the individual programs that IslandViewer (Langille and Brinkman, 2009) utilizes to predict

genomic islands.

http://cmr.jcvi.org/cgi-bin/CMR

http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi

34

Caulobacter sp. K31

Cupriavidus metallidurans CH34

Bordetella bronchisepta RB50

Gramella forsetii KT0803

Rhodobacter sphaeroides 2.4.1

Genome size (Mb) 5.89 6.91 5.34 3.8 4.6

No. replicons 3 4 1 1 7

GC content 66.3 62 68.1 36.6 68.8

% Unique k-mers 98.55 98.2 99.5 99.09 99.18

Contigs 151 139 104 90 99

N50 (bp) 155,182 159,531 261,616 564,738 740,045

Longest contig (bp) 495,932 674,226 550,697 899,275 1,010,805

N50 vs. longest replicon (%)

2.83 2.91 4.78 10.31 13.51

Mobile Genes 162 164 19 49 103

Mobile Genes % of genome

2.96 2.65 0.37 1.36 2.46

Islands by SIGI-HMM only

13 3 9 2 6

Islands by DIMOB only

3 12 1 7 1

Predicted by Both 9 5 5 1 2

Total # Of Islands 25 20 15 10 9

Coverage Percentage 98.8 98.6 98.9 99 99.7

35

Figure 3.2.3: Relationship between N50 (as percentage of the largest replicon in the

genome) and three parameters thought to influence assembly quality.

Top: genome size, r = -0.81 (ns); Middle: percent unique K-mers, r = 0.54 (ns); and Bottom:

Number of Replicons, r = 0.42 (ns).

36

Figure 3.2.4: Relationship between three measures of assembly quality (maximum contig

length, N50 ad N50 as percent of longest replicon) and number of genomics islands as

predicted by IslandViewer.

The pearson correlations between N50 or N50 as percent of longest replicon and number of

predicted islands are statistically significant (p<0.05) but are also clearly curvilinear.

3.2.5 Fragmentation Evident in Real Data

The benefit of using simulated ideal data for this type of analysis is that patterns can be

detected that may otherwise be masked due to the variations in sequencing coverage,

introduction of sequencing specific errors and high number of contigs produced by real

sequencing projects. In order to take our findings and compare them to real sequencing

scenarios, we examined the assembly data from Rhodobacter sphaeroides 2.4.1. This strain was

37

utilized in the Genome Assembly Gold-Standard Evaluations (GAGE) study that compared the

assembly efficiency of 7 different open access software programs (Salzberg et al. 2012). The

assembled contigs from that study are freely available. We downloaded the contigs from the

GAGE Velvet assembly of R. sphaeroides 2.4.1 and aligned them to the finished genome in the

same way that we compared CH34 contigs generated from simulated sequence to its final

genome. When the R. sphaeroides 2.4.1 contigs from the GAGE assembly were mapped to its

finished chromosome 1 in Geneious, only 454 fragments (of a total of 1192 contigs and

scaffolds) could be aligned at 98% identity - resulting in only 65% coverage of the chromosome.

This indicated that the assembled contigs contained internal errors, so we allowed for up to 500

bp gaps in the Geneious alignment. This improved the assembly of chromosome 1 from 65% to

96.3%. Regardless of whether gaps were allowed or not, the distribution of the small contigs

was greatly increased in regions predicted to be genomic islands (Figure 3.2.5). For the

alignment without gaps, only one of the predicted genomic islands was assembled, whereas 4

out of 9 of the islands were assembled when gaps were allowed. The two islands predicted by

both programs in IslandViewer had a large number of fragments for their relative size (13

fragments for 12.5kb and 12 fragments for 7.5 kb).

Figure 3.2.5: Geneious alignment of real contigs obtained from the GAGE assembly data

(Salzberg et al. 2012).

Top alignment is at 98% identity with no gaps allowed, bottom alignment is 98% identity with

up to 500 bp gaps allowed. The region shown includes 3 putative genomic islands that are all

clearly visible by the increased occurrence of small contigs in these regions. These islands

occur at 216-228 kb, 550-557 kb and 632-648 kb and are roughly indicated with curved

38

brackets. Since these are only predicted islands, the precise borders may not be accurate and

individual islands could be components of a larger combined island.

3.3 Discussion

The Genomes OnLine Database (v. 3.0; http://genomesonline.org accessed 19th March

2012) lists 3532 completed genomes of which 1045 are listed as permanent draft assemblies.

The status of permanent draft implies that finishing experiments to verify or extend the existing

contigs are not expected to be performed, and the draft status is likely to be related to repeated

elements that cannot be resolved by computerized means. Contrary to the early view that many

of these smaller repeated elements represent “junk DNA” (Mayer et al. 2010), microsatellites in

the form of tandem repeats and transposable elements such as insertion sequences have both

been found to regulate transcription of adjacent genes (Versalovic et al. 1991; Mahillon and

Chandler, 1998). These repetitive elements also function as important components of genome

plasticity by mediating DNA re-arrangements including chromosomal deletions, duplications

and inversions (Lupski and Weinstock, 1992; Touchon and Rocha 2007). Larger transposable

elements such as transposons and integrative conjugative elements (ICEs) can also be found in

multiple copies within a genome, particularly if there are multiple large replicons as is

commonly found in certain bacterial families such as the Burkholderiaceae (Janssen et al. 2010;

Amadou et al. 2008; Tuanyok et al. 2008). Reaching the stage of a draft genome is sufficient if

the goal is to discover interesting and novel genes or operons that do not contain repeated

elements, with the consequence that many genome projects are being published at the draft

assembly stage and then terminated (Nagarajan et al. 2010). These draft assemblies can have a

number of errors including collapsed repeats, rearrangements and inversions (Phillippy et al.

2008; Salzberg et al. 2012; Narzisi and Mishra, 2011) as well as having an unknown fraction of

the genome unaccounted for. In this study, we used simulated NGS data to confirm that

currently available software programs are capable of accurately recognizing repeated segments

in the DNA and that these repeats would be the primary cause of contig termination in the

assembly. Having established the causes of termination, we wanted to better understand the

nature of the fragmented regions of draft assemblies since the relative importance of these

unassembled regions has to our knowledge never been addressed.

http://genomesonline.org/

39

An examination of the genes adjacent to the termination points for the longest contigs

(Table 3.2.2) clearly confirmed that the assemblies were terminated due to the presence of

repeated elements. These repeated elements were inclusive of known mobile elements and genes

containing internal repeat structures (as expected) but also of genes that were repeated in more

than one genomic location (commonly on two separate replicons within this genome). This type

of repetition (within or between replicons) is important in the evolution of novel traits since one

copy of the gene can be free to evolve without risking functional impairment to the host cell due

to the other preserved copy (the duplicate gene hypothesis (Ohno, 1970). Some transposable

elements have been found to specifically target transmissible plasmids and the subsequent

plasmid-chromosome exchanges facilitate assembly of genes into modules (Siguier et al. 2006),

with the result that individual genomes will commonly have identical transposable elements and

accessory genes distributed on both the main chromosome and some or all of the associated

plasmids (as was seen here). Likewise, the findings from both B. bronchiseptica RB50 and

Caulobacter sp. K31 illustrated that predicted genomic islands within the same chromosome can

carry repetitive gene content which can interfere with assembly in the absence of repeats across

different replicons. Neither of these two large repeated segments contained any insertion

sequences or transposons, but were composed almost exclusively of hypothetical proteins. The

hypothetical nature of these genes prevents an estimation of the causes of gene duplication in

these strains, although one copy of the 22 kb portion of B. bronchiseptica RB50 is contained

within an intact phage documented by the BacMap Genome Atlas website (Stothard et al. 2005).

The second copy in this strain and both repeated segments in Caulobacter sp. K31 were not part

of any documented phage (intact or otherwise) but their presence in two separate predicted

islands within the chromosome could facilitate genomic island evolution.

It was expected that the number of genomic islands would have correlated with the

percentage of genes annotated as involved in mobility, but this was not found to be the case.

Rhodobacter sphaeroides 2.4.1 had 2.46% of the genes attributed to mobility functions, yet had

a smaller number of islands than other strains with this percentage of mobile gene content and a

more successful assembly in terms of N50 and maximum contig size. Given the high number of

plasmids found in this strain it is reasonable that this high percentage of mobility functions

relates directly to plasmid genes. These would not be expected to interfere with assembly since

incompatibility prevents plasmids with highly similar transfer genes from co-existing within

40

cells. It was interesting to note that although both of the mobility related metrics (% mobile

genes and predicted genomic islands) correctly predicted Caulobacter sp. K31 to be the most

difficult to assemble, the number of genomic islands was a better indicator of assembly

complexity for Bordetella bronchiseptica RB50 than mobile gene content. In addition, the most

logical genome characteristic to interfere with assembly would be repetitiveness (measured as %

unique k-mers) but this also was not an invariant predictor of the ease of assembly.

The validity of this work rests on the assumption that the simulated reads generated from

the genomic data could be accurately assembled. There were no errors evident in any of the

alignments performed from the Velvet unpaired simulated data when using a kmer length of 57,

although there had been a number of single base pair errors introduced when using the standard

settings and there were substantial SNPs introduced in the ABySS contigs (data not shown).

This illustrates the high level of accuracy that Velvet can achieve with non-repetitive elements,

as well as the high quality repeat recognition of this particular software program. In examining

the distribution of the long reads from the Velvet assembly against chromosome 1 of CH34, the

unassembled fragments tended to group together and these regions showed a clear association

with the prevalence of repeated elements in the genomic islands. It is important to recognize that

in an actual sequencing project the reconstruction of the genome would be further complicated

by the presence of sequencing errors and variations in the level of coverage due to decreased

amplification robustness, the latter of which may be more prominent in repetitive stretches due

to the secondary structure formed by palindromic repeats (Jin, 2010). In comparing our

simulated assemblies to the data available from the GAGE Velvet assembly of R. sphaeroides

2.4.1, it was clear that our correlation between the distribution of small contigs and the location

of genomic islands was still valid when using real data.

Draft genome assemblies may lead us to unintentionally disregard the most important

parts of prokaryotic genomes. Although eukaryotic genomes are more repeat rich than

prokaryotic genomes, the reasons for this repetitiveness are vastly different between the

kingdoms. In prokaryotic organisms, horizontal gene transfer is a prominent means of acquiring

novel genes and rearrangements facilitated by mobile elements increase diversity. Insertion

sequences can spread to high prevalence within a genome, and their activity may be specifically

increased in response to changing environmental conditions. Since their behavior is strongly

linked to adaptation, these elements are of great interest (Dobrindt et al. 2004). Larger mobile

41

elements are primarily assimilations of smaller elements (Toussaint and Merlin, 2002) or serve

as recombination sites for incoming genetic information (Coleman et al. 2006; Pen et al. 2009),

with the result that genomic islands and large transposable elements are inherently resistant to

computerized assembly. These regions are full of complete or partial mobile genetic elements

and are therefore problematic for genome assembly, but ironically they are the most likely to

carry the genes responsible for any novel traits under investigation, particularly if they were

acquired horizontally. Assembly software alone is capable of reconstructing genes, and

complete operons, providing they are not interrupted by repetitive sequences or present in more

than one copy within the genome (i.e. on separate replicons). In one study it was determined that

the majority of genes can be reconstructed from even very short reads (25 bp) however genes

containing repeats (primarily intergenic repeats or mobile elements such as transposons, IS

elements and prophages) account for the vast majority of the unassembled genome (Kingsford et

al. 2010). In our study, 40 of the 75 contigs corresponding to chromosome 1 of CH34 were

fully contained within genomic islands (Table 3.2.3) and an additional 16 contigs were found to

overlap with the edge of a genomic island. Many of the functional genes contained within these

genomic islands were assembled indicating that examining the mid-range contigs (5-50 kb) of a

draft assembly may be more informative in terms of recently acquired content. The genomic

context of these newly acquired genes is lost when the data is left as a draft assembly, and the

utility of the public databases is decreased by the introduction of incomplete or incorrect data.

As an example, the largest genomic island in CH34 (CMGI-1) is almost identical to a

pathogenicity island (PAGI-2C) found in Pseudomonas aeruginosa clone C, indicating recent

transfer between industrially contaminated sites and nosocomial pathogens (Van Houdt et al.

2009). Based on our Velvet assembly simulation, a draft assembly of CH34 would have left this

island in pieces and evidence of this important transfer event would remain hidden. In our own

laboratory, we have discovered a Recombinase in Trio (RIT) element adjacent to the

chlorobenzoate degrading genes of Burkholderia sp. R172 (Accession number AY168634.1)

that is homologous to one of the RIT elements found in CMGI-2 of CH34 (Van Houdt et al.

2009). This association was determined through Sanger sequencing, and was not apparent from

the reads from only next-generation sequencing data provided by both Solexa(Illumina) and

Roche 454 sequencing (Jin, 2010). Other sequenced strains available in the GenBank database

reveal that this is not an isolated event. For example, there are two other homologous RIT

elements found in the draft assembly of the PAH degrading strain Burkholderia sp. Ch1-1.

42

Prior to additional work that has recently improved the quality of this assembly, the contigs

containing each of the RIT elements in this strain terminated at the edges of these elements,

revealing absolutely no genomic context.

The role of genomic islands in bacterial adaptation is becoming increasingly clear, yet

many of the genes contained within these islands have not been characterized (Penn et al. 2009).

Indeed, a defining feature of genomic islands is a high abundance of conserved hypothetical

proteins (Van Houdt et al. 2009). Understanding the possible roles of the multitude of currently

hypothetical genes will require intensive experiments, and the development of these experiments

may be hampered by the incomplete information included in draft assemblies (Phillippy et al.

2008). With decreasing sequencing costs, initial draft genomes are going to increase in

prevalence, inundating the public databases with incomplete or fragmented genome projects

which decrease the overall utility of these databases for other analyses particularly those relating

to horizontal gene transfer. This issue has been addressed in a number of publications, and there

are validation tools available that can aid in distinguishing mis-assemblies (Phillippy et al.

2008). We submit that many of the genes responsible for prokaryotic adaptation will be present

in these highly recombinational or potentially mobile regions that are inherently resistant to

automated assembly, and that therefore the necessity of extensive finishing experiments to not

only close the created contigs but also to correct the introduced errors should be an important

focus of any sequencing project. Furthermore, the very elements disrupting the automated

assembly have a wealth of information to provide regarding the evolution and transferability of

these genes, and also may have a role in the regulation of these important genomic regions. As

technological improvements become available to ease the assembly of bacterial genomes,

recognizing the high relative importance of these regions will be key to creating the incentive

needed to pursue novel ways of finishing genomes - and improve our knowledge of bacterial

adaptation.

3.4 Acknowledgements

Funding in the form of a NSERC Discovery Grant to RF and a NSERC PGS-D Scholarship to

NR is gratefully acknowledged. The funding agency had no role in this study.

43

3.5 References Sanger, F. Nicklen S and A.R. Coulson. 1977. DNA sequencing with chain-terminating inhibitors, Proc. Natl. Acad. Sci. USA 74:5463-5467. Miller, J.R., S. Koren and G. Sutton. 2010. Assembly algorithms for next-generation sequencing data, Genomics 95:315-327. Wetzel, J., C. Kingsford and M. Pop. 2011. Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies, BMC Bioinformatics 12:95. http://www.biomedcentral.com/1471-2105/12/95 MacLean, D., J.D.G. Jones and D.J. Studholme. 2009. Application of ’next-generation’ sequencing technologies to microbial genetics, Nat. Rev. Microbiol. 7: 287-296. Medini, D., C. Donati, H. Tettelin, V. Masignani and R. Rappuoli. 2005. The Microbial Pan-Genome, Curr. Opin. Genet. Dev. 15: 589-594. Pop, M. 2009. Genome assembly reborn: recent computational challenges, Briefings Bioinf. 10(4):354-366. Chevreux, B., T. Wetter and S. Suhai. 1999. Genome sequence assembly using trace signals and additional sequence information, Comput. Sci. Biol.: Proc. German Conference on Bioinformatics GCB'99 GCB:45–56. Phillippy, A.M., M.C. Schatz and M. Pop. 2008. Genome assembly forensics: finding the elusive mis-assembly, Genome Biol. 9:R55 (doi:10.1186/gb-2008-9-3-r55) Nagarajan, N.C., M.D. Cook, H. G. Bonaventura, A. Richards, K.A. Bishop-Lilly, R. DeSalle, T.D. Read and M. Pop. 2010. Finishing genomes with limited resources: lessons from an ensemble of microbial genomes, BMC Genomics 11:242. Ellegren, H. 2004. Microsatellites: Simple Sequences with Complex Evolution, Nat. Rev. Genet. 5:435-445. Mayer, C., F. Leese and R. Tollrian. 2010. Genome-wide analysis of tandem repeats in Daphnia pulex – a comparative approach. BMC Genomics 11:277 (http://www.biomedcentral.com/1471-2164/11/277) Lupski, J.R. and G.M. Weinstock. 1992. Short, Interspersed Repetitive DNA Sequences in Prokaryotic Genomes, J. Bact. 174(14) (1992) 4525-4529. Zhang, W., J. Chen, Y. Yang, Y. Tang, J. Shang and B. Shen. 2011. A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies, PLoS ONE 6(3): e17915. doi:10.1371/journal.pone.0017915 Van Houdt, R., S. Monchy, N. Leys and M. Mergeay. 2009. New mobile elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria, Antonie

http://www.biomedcentral.com/1471-2105/12/95



44

van Leeuwenhoek 96:205-226.

Janssen, P.J., R. Van Houdt, H. Moors, P. Monsieurs, N. Morin, A. Michaux, M.A. Benotmane, N. Leys, T. Vallaeys, A. Lapidus, S. Monchy, C. Medigue, S. Taghavi, S. McCorkle, J. Dunn, D. van der Lelie and M. Mergeay. 2010. The Complete Genome Sequence of Cupriavidus metallidurans Strain CH34, a Master Survivalist in Harsh and Anthropogenic Environments, PLoS ONE 5(5):e10433. Doi:10.1371/journal.pone.0010433. Langille, M.G.I. and F.S.L. Brinkman. 2009. IslandViewer: an integrated interface for computational identification and visualization of genomic islands, Bioinformatics. Jan. 16 (EPub). PMID: 19151094 Pevzner, P.A. and H. Tang. 2001. Fragment assembly with double- barreled data, Bioinformatics 17 (2001) S225–S233. Kingsford, C., M.C. Schatz and M. Pop. 2010. Assembly complexity of prokaryotic genomes using short reads, BMC Bioinformatics 11:21 (http://www.biomedcentral.com/1471-2105/11/21) Zerbino, D.R., G.K. McEwen, E.H. Margulies and E. Birney. 2009. Pebble and Rock Band: Heuristic resolution of repeats and scaffolding in the Velvet short-read de novo assembler, PLoS ONE 4(12):e8407. Doi:10.1371/journal.pone.0008407 Simpson, J.T., K. Wong, S.D. Jackman, J.E. Schein, S.J.M Jones and I. Birol. 2009. ABySS : A parallel assembler for short read sequence data structures, Genome Research 19:1117-1123. Monchy, S., M.A. Benotmane, P. Janssen, T. Vallaeys, S. Taghavi, D. van der Lelie and M. Mergeay. 2007. Plasmids pMOL28 and pMOL30 of Cupriavidus metallidurans are specialized in the maximal viable response to heavy metals, J. Bact. 189(20):7417-7425. Drummond, A.J., B. Ashton, S. Buxton, M. Cheung, A. Cooper, C. Duran, M. Field, J. Heled, M. Kearse, S. Markowitz, R. Moir, S, Stones-Havas, S. Sturrock, T. Thierer and A. Wilson. 2010. Geneious v5.5, Available from http://www.geneious.com Salzberg, S.L., A. M. Phillippy, A. Zimin, D. Puiu, T. Magoc, S. Koren, T. J. Treangen, M. C. Schatz, A. L. Delcher, M. Roberts, G. Marxcais, M. Pop and J. A. Yorke. 2012. GAGE: A critical evaluation of genome assemblies and assembly algorithms, Genome Research 22: 557-567. Versalovic, J., T. Koeuth and J.R. Lupski. 1991. Distribution of Repetitive DNA Sequences in Eubacteria and Application to Fingerprinting of Bacterial Genomes. Nucleic Acids Res. 19(24):6823-6831. Mahillon, J. and M. Chandler. 1998. Insertion sequences, Microbiol. Mol. Biol. Rev. 62:725-774. Touchon, M. and E.P.C. Rocha. 2007. Causes of Insertion Sequences Abundance in Prokaryotic

http://www.ncbi.nlm.nih.gov/pubmed/19151094



http://www.geneious.com/

45

Genomes, Mol. Biol. Evol. 24(4):969-981. Amadou, C., G. Pascal, S. Mangenot, M. Glew, C. Bontenps, D. Capela, S. Carrere, S. Cruveiller, C. Dossat, A. Lajus, M. Marchetti, V. Poinsot, Z. Rouy, B. Servin, M. Saad, C. Schenowitz, V. Barbe, J. Batut, C. Medigue and C. Masson-Boivin. 2008. Genome Sequence of the b-rhizobium Cupriavidus taiwanensis and comparative genomics of rhizobia, Genome Res. 18:1472-1483. Tuanyok, A., B.R. Leadem, R.K. Auerbach, S.M. Beckstrom-Sternberb, J.S. Bechstrom-Sternberg, M. Mayo, V. Wuthiekanun, T.S. Brettin, W.C. Nierman, S.J. Peacick, B.J. Currie, D.M. Wagner and P. Keim. 2008. Genomic Islands from Five Strains of Burkholderia pseudomallei. BMC Genomics 9:566. doi:10.1186/1471-2164-9-566 Narzisi, G. and B. Mishra. 2011. Comparing De Novo Genome Assembly: The Long and Short of It.,PLoS ONE 6(4):e19175. doi:10.1371/journal.pone.0019175 Ohno, S. 1071. Evolution by gene duplication, Springer-Verlag, New York. Siguier, P., J. Filee and M. Chandler. 2006. Insertion sequences in prokaryotic genomes. Curr. Opin. Microbiol. 9:526-531. Stothard, P., G. Van Domselaar, S. Shrivastava, A. Guo, B. O'Neill, J. Cruz, M. Ellison and D.S. Wishart. 2005. BacMap: an interactive picture atlas of annotated bacterial genomes, Nucleic Acids Research 33:D317-D320. Jin, S. 2010, Evidence of Mobility of the 3-Chlorobenzoate Degradative Genes in a Pristine Soil Isolate, Burkholderia phytofirmans OLGA172, M.Sc. Thesis. Dept. Ecology and Evolutionary Biology, University of Toronto. Dobrindt, U., B. Hochhut, U. Hentschel and J. Hacker. 2004. Genomic islands in pathogenic and environmental microorganisms, Nat Rev Microbiol 2: 414–424. Toussaint, A. and C. Merlin. 2002. Mobile Elements as a Combination of Functional Modules, Plasmid 47:26-35. Coleman, M.L., M.B. Sullivan, A.C. Martiny, C. Steglich, K. Barry, E.F. DeLong and S.W. Chrisholm. 2006. Genomic Islands and the Ecology and Evolution of Prochlorococcus. Science 311:1768 (doi: 10.1126/science.1122050) Penn, K., C. Jenkins, M. Mett, D.W. Udwary, E.A. Gontang, R.P., McGlinchey, B. Foster, A. Lapidus, S. Podell, E.E. Allen, B.S. Moore and P.R. Jensen. 2009. Genomic islands link secondary metabolism to functional adaptation in marine Actinobacteria. ISME J. 3:1193-1203. Zerbino, D.R. and E. Birney. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs, Genome Res. 18:821-829. Altschul, S.F., W. Gish, W. Miller, E.W. Myers and D.J. Lipman. 1990. Basic local alignment

46

search tool, J. Mol. Biol. 215 (1990) 403-410.

47

Chapter 4 Phylogeny and Organization of Recombinase in Trio (RIT) Elements

Acknowledgements and Contributions: This chapter is reproduced as published in Plasmid,

with some modifications (Ricker, N., Qian, H., & Fulthorpe, R. R. 2013. Phylogeny and

organization of recombinase in trio (RIT) elements. Plasmid, 70(2), 226-239

doi:10.1016/j.plasmid.2013.04.003).

4 Introduction A mobile genetic element (MGE) is defined as any discrete segment of DNA that can

move within or between genomes (Frost et al., 2005) and is inclusive of plasmids, phages,

integrative conjugative elements (ICEs), and the myriad of smaller elements capable of inter- or

intra-cellular movement (classified as transposable elements, see (Roberts et al., 2008). The

mobility of some of these elements occurs through the action of site-specific recombinases

(SSRs), which are divided into two classes defined by an absolutely conserved residue integral

to the active site (tyrosine or serine). Tyrosine recombinases (TBSSRs, often just referred to as

integrases) are extremely diverse, sharing only 3 absolutely conserved residues among all

members described to date (Nunes-düby et al., 1998), however there are 24 sub-families

described in the NCBI conserved domain database (Marchler-Bauer et al., 2005). A recent in

depth analysis of TBSSRs, has instead divided the known representatives into 56 families of 4

or more elements which were found to correlate with type of mobile genetic element (plasmid,

phage or prophage) in 87% of the families (Van Houdt et al., 2012). This analysis suggests that

the functional roles of these elements may be different between the families and may be directly

related to the nature of the mobile elements they are associated with.

Recombinase in Trio (RIT) elements were first defined in 2009 in Cupriavidus

metallidurans CH34 (Van Houdt et al., 2009). The original description noted the common

occurrence of conserved elements comprised of three TBSSR’s with overlapping open reading

frames. The three tyrosine recombinases in these elements were all of similar size, generally

with the largest enzyme first and the smallest enzyme in the middle. These elements were

postulated as being independently mobile for three reasons: the diversity of organisms found to

http://dx.doi.org/10.1016/j.plasmid.2013.04.003

48

be harboring homologous elements, specific gene interruptions implying targeted integration,

and the presence of highly similar elements in more than one location in the same genome, as is

often seen in transposons. After discovering a RIT element in a chlorobenzoate degrading strain

in our own lab, we decided to further investigate the distribution of these elements in currently

available genomes in order to characterize their associations and potential for mobility.

4.1 Methods

We used the NCBI databases and BLAST analysis tools (Altschul et al., 1997) to obtain

progressively less homologous sequences to the two original RIT elements found in C.

metallidurans CH34 (Van Houdt et al., 2009). All similarity matches that still conformed to a

three adjacent recombinase format were utilized for additional searches. The three

recombinases from each of the intact elements were analyzed through BlastP comparison to the

Conserved Domain Database (CDD) (Marchler-Bauer et al., 2005) and the highest scoring

matches were consistently to the pAE1, SG4 and SG5 sub-families of tyrosine recombinases,

respectively (in order of transcription). Therefore all members of these sub-families from the

conserved domain database were investigated for inclusion as RIT elements. A random

sampling of enzymes from other subfamilies were also investigated in order to determine the

ubiquity of the triad arrangement. Organization into clusters and determination of key features

was determined through Blast homology. Automated multiple alignments were performed using

the Muscle alignment program (Edgar, 2004) within Geneious (Drummond et al., 2011).

Neighbour joining trees were also prepared in Geneious, using Jukes-Cantor models and

bootstrap re-sampling with 100 replicates. For nucleotide comparisons a 70% support threshold

was used (no outgroup for full RIT element trees; delta-Proteobacteria outgroup for 16S since

there was only one representative from this sub-phyla). Amino acid comparisons were prepared

using an 80% support threshold, using a RIT element from Acidothiobacillus as the outgroup to

anchor the trees.

4.2 Results and Discussion

4.2.1 Abundance and Occurrence in Database

Through our homology searches of the NCBI database, we were able to find 148

sequences containing three adjacent tyrosine recombinases that we classified as putative RIT

49

elements. These elements were separated into groups based on homology to the third

recombinase (see section 4.2.3), and the information for these groups is listed in Supplemental

Table S3. These putative RIT elements were obtained from 63 different genera across 7 phyla

of bacteria and this is not expected to be an exhaustive list given the diversity of the elements

found. As summarized in Table 4.2.1, the Proteobacteria accounted for the majority of the

strains (25, 17, 7 and 1 strains from the alpha through delta classes, respectively, representing

59.5% (50/84) of the total strains). This was a significant divergence from the expected

representation both for Proteobacteria in general (which represent 42% of the genomes in the

NCBI database) as well as for the alpha-, beta- and gamma-Proteobacteria individually. The

gamma-Proteobacteria are the most abundant in the database, however both alpha- and beta-

Proteobacteria had higher representation in the strains harbouring RIT elements (Figure 4.2.1).

There is the possibility that this is an artifact of beginning the homology search with a beta-

Proteobacteria representative, but this would not fully account for the abundance of alpha-

Proteobacteria found. It is likely however that the majority of these RIT elements are connected

by plasmid distribution and that the small number of isolated elements from particular phyla

represent a rare transfer event. This is supported by the fact that searches initiated from the

gamma-Proteobacteria and other low represented phyla consistently returned results from the

alpha- or beta-Proteobacteria representatives. There could potentially be other RIT elements that

are more broadly distributed among gamma-Proteobacteria or other phyla that we were not able

to detect since they were not homologous enough to the RIT elements found to date.

Table 4.2.1: Summary of information of putative RIT elements found in this study.

Phylogeny – Taxonomic Grouping

No. RIT Elements

No. Strains

pAE1 range (aa)

SG4 range (aa)

SG5 range (aa) Gene adjacent/interrupted

alpha-Proteobacteria 45 Caulobacterales 3 1 403 313 330 DUF1738

Rhizobiales 20 10 305-425 304-373 281-362 variable Rhodospirillales 11 7 228-508 303-454 329-335 hypothetical, methylase

Sphingomonadales 12 7 331-515 301-455 324-348 hypothetical, methylase beta-Proteobacteria 35

Burkholderiales 29 15 348-425 308-457 329-349 variable (IS66, RadC, transposase, integrase)

Rhodocyclales 5 1 411-417 310-325 294-332 integrase unclassified 1 1 318 324 337 hypothetical

gamma-Proteobacteria 12

50

Acidithiobacillales 2 1 414 311 331 integrase catalytic unit Alteromonodales 6 2 321-417 312-322 327-335 variable Enterobacteriales 2 2 315, 419 308, 330 337, 338 RadC, methylase

Legionellales 1 1 418 332 335 hypothetical Pseudomonodales 1 1 411 323 354 trbI conjugative genes

delta-Proteobacteria 1

Desulfobacteriales 1 1 409 338 337 reverse transcriptase Acidobacteria 3

Solibacterales 3 1 412, 710 314, 452 332, 336 integrase, hypothetical Actinobacteria 19

Actinomycetales 5 5 304-511 308-332 329 integrase, transposase, DNA directed reverse transcriptase

Bifidobacteriales 14 6 400 321 351 transposase, integrase Bacteroidia 7

Bacteroidales 5 5 407-426 313-341 336-343 hypothetical Flavobacteriales 2 1 425 330 337 RadC

Cytophagales 1 1 422 327 337 DNA repair protein Firmicutes 13

Clostridiales 12 11 404-537 283-334 337-342 variable

Bacillales 3 2 407-413 327-329 338-340 IstB domain-containing protein ATP-binding protein

Verrucomicrobia 4 1

Opitutales 4 1 432 330 336 MerR regulator Planctomycetes 1

Planctomycetales 1 1 419 321 330 hypothetical

51

Figure 4.2.1: Comparison of the taxonomic representation of our RIT collection with the

abundance of the same taxonomic grouping in the NCBI genome database.

The NCBI numbers included both completed genomes and incomplete sequencing projects.

Significant differences (a=0.05) are indicated with a double asterisk.

4.2.2 RIT Structure and Organization

As mentioned in section 4.2.1, the NCBI Conserved Domain Database currently has 24

described subfamilies of tyrosine recombinases. All of the elements had one gene from each of

the three subfamilies pAE1, SG4 and SG5 and they were always found in the same order and

orientation (Figure 4.2.2; discussed in section 4.2.3). This pattern was also confirmed in the

recent work examining the distribution of tyrosine based site specific recombinases on different

types of mobile elements (Van Houdt et al., 2012). In that work the three families of tyrosine

based site specific recombinase specifically involved in the formation of RIT elements were

designated FamilyIntegrase (FamInt) 1, 5 and 2 (also in order of transcription) and were

documented as having 64, 54 and 63 members, respectively. The number of included elements

was more conservative than our study as inclusion was based on confidence in family

membership for each individual recombinase. In our study we used the trio arrangement of these

recombinases as the hallmark of these novel elements and so we included elements that had

individual genes for which there was lower confidence in the family designation (see section

52

4.2.3). In addition to the 148 putative RIT elements, ie. genes in trios, we found only 15

sequences that corresponded with an individual recombinase from one of these sub-families but

not found in a trio. In addition there were 20 putatively degraded RIT elements. The latter were

distinguished by the presence of one or two documented recombinases and pseudogenes or

small ORFs in the remainder of the corresponding region. There was no pattern evident in

terms of which recombinase was missing in these degraded structures. These recombinases may

be RIT remnants due to inactivation or ancient distribution of these elements but may also

indicate that some or all of these subfamilies can function outside of the RIT arrangement.

As can be seen in Table 4.2.1, there is also a wide range of sizes observed for each of the

recombinases. This is particularly evident in the pAE1 (Int1) recombinase, which varies from

305 to 710 amino acids in length. The SG4 (Int2) recombinase is less variable (283-457 amino

acids), and the third (SG5, Int3) even less so (281-351 amino acids) but the individual

variations may also be an artifact of automated annotations. The pattern of sizes originally

described for these elements (largest first, smallest in the middle) is also variable. The largest

recombinase is in the middle position for 10 elements and in the third position for 6 elements.

Although these RIT elements cannot be assumed to be active, the presence of 6 similar elements

with Int2 as the largest enzyme in 5 different members of the Sphingomonodales suggests that

variation in the pattern of sizes is tolerated.

Figure 4.2.2: Names and arrangements of tyrosine recominase sub-families.

Names are families according to the NCBI conserved domain database for the three integrases

that comprise the RIT elements (see section 4.3.3). Arrows indicate the direction of

transcription for Int1-3 (in order of transcription). The inverted repeats (IR) have only been

confirmed in a small number of the putative RIT elements (see section 4.4.3).

53

4.2.3 Inferred RIT Functionality

Within our putative RIT elements we saw a broad diversity of recombinase sizes and

amino acid sequences, but conservation was always highest in the C-terminal of each enzyme.

This is commonly found in the tyrosine recombinases and in other characterized phage

integrases in which the N-terminal region is involved in site recognition and the C-terminal

contains all of the catalytic sites (Esposito and Scocca, 1997). The consistency of this finding

across the intact RIT elements examined in this study implies that all three of these

recombinases are being selectively maintained in this arrangement. The CDD utilizes these

conserved regions to support inclusion of novel phage integrases into each of the currently

outlined sub-families. If the novel enzyme contains sufficient conserved residues to surpass a

pre-determined domain specific threshold then it is designated as specific to that particular sub-

family. For the tyrosine recombinases, we have found no literature to date that investigates the

functionality of the individual sub-families, which limits the utility of these classifications for

evaluating whether each recombinase is functional. However of the 148 RIT elements included

in our assessment, 93% have at least one recombinase that meets the specific criteria for

inclusion in the designated subfamily. Int3 (SG5) is the least divergent of the three elements

(Figure 4.2.3A,B), and 105 of the RIT elements (71%) have domain specific SG5 genes. 66% of

the elements have domain specific pAE1 (Int1) genes, while only 25% have domain specific

SG4 genes (Int2). In 36%, both pAE1 and SG5 are domain specific. Only 17 (11%) contain the

amino acid residues required for designation of all three integrases as pAE1, SG4 and SG5 by

the CDD. For the remainder, the top (highest E-value) non-specific sub-family hit was

consistently to the expected group based on position within the RIT (ie. Int1, 2 or 3).

If we infer functionality by the presence of duplicate elements in a genome, we can

postulate whether those recombinases lacking the threshold number of conserved residues for

subfamily designation, may still be active. There are 19 species containing more than one

identical RIT element within their genome, only 4 of which have all three subfamily specific

integrases. There are 6 instances of genomes with identically duplicated RIT elements that are

lacking subfamily specific SG4 genes (section 4.2.4.1) and also 6 closely related strains with

duplicated elements lacking a subfamily specific SG5 gene (section 4.2.4.2). There is a single

54

instance of identical elements in a genome without a subfamily specific pAE1 recombinase (in

Sinorhizobium meliloti 1021 pSymA) and a separate genome (Dinoroseobacter shibae DFL12)

has identical elements where only the SG5 recombinase is subfamily specific. This may imply

that all three enzymes are not strictly required for mobility, or that one or more of the residues

currently used to delineate a subfamily specific enzyme are not necessary for this function.

Figure 4.2.3: Comparison of conservation between the Int1 (pAE1) recombinases (A - top)

and Int3 (SG5) recombinases (B - bottom) from 40 divergent representatives.

55

Level of conservation is illustrated through shading (dark lines represent conserved amino

acids). As can be seen, both enzymes increase in conservation towards the C-terminal, and

conservation is higher in the third recombinase.

4.2.4 Evidence for RIT Mobility Within Closely Related Strains

For the purposes of this discussion, we are making the assumption that identical RIT

sequences in the same strain implies mobility and high levels of similarity between RIT

elements located in different strains or species is evidence of horizontal transfer likely via an

intermediary replicon. In their paper originally defining RIT elements, Van Houdt et al. (2009)

described two non-homologous RITs in Cupriavidus metallidurans CH34. The first of these

RIT elements (RITCme1) bears high nucleotide identity (greater than 90%) to truncated RIT

elements in Cupriavidus necator H16 pHG1 (two identical inverted RIT element fragments close

together with integrase remnants in between) and to a degraded RIT element in Burkholderia sp.

str. CCGE1002 (with only Int2 still listed as intact and the others listed as pseudogenes). In

addition, RITCme1 shares 84% nucleotide identity to two identical RIT elements in the

unassembled whole genome sequence data of the PAH degrading strain Burkholderia sp. Ch1-1

and to our newly identified element RITBphyt1 (Jin, 2010). RITBphyt1 was discovered in a

chlorobenzoate degrading Russian soil isolate designated Burkholderia phytofirmans OLGA172

(formerly R172). In this strain, the RIT element is found adjacent to the chlorocatechol

degradative operon (Jin, 2010). There are no chlorocatechol degrading genes found in C.

metallidurans CH34, indicating that the genes adjacent to the RIT element are not shared

between these two strains, however each of these strains do have partial IS66 elements

overlapping the RIT elements which may represent the target site for insertion.

Our dataset did reveal two clusters of RITs sharing greater than 85% nucleotide identity

over their full lengths– one cluster of RITs found in Acidiphilium/Caulobacter strains, and

another cluster of RITs from Bifidibacterium longum. Each of these show evidence of recent

mobility in that 1) 100% identical sequences are found in different locations within individual

strains, 2) 80-100% identical sequences occur in separate species, and 3) the RIT elements share

higher identity than the surrounding genes. Interestingly, although gene synteny appears to be

conserved in many of the Bifidobacterium cluster, the adjacent genes in the

Acidiphilium/Caulobacter cluster are highly variable indicating that the RIT elements have not

56

been mobilized as part of a larger element. Details on these two informative groups are given

below.

4.2.4.1 Caulobacter/Acidiphilium cluster

This cluster of highly similar RITs come from the genomes of three strains and the

plasmids they contain. Two of the strains are from the genus Acidiphilium, while the third is

from the genus Caulobacter, which share approximately 86% 16S rRNA sequence identity.

Caulobacter sp. K31 is a chlorophenol degrader isolated from groundwater in Finland

(Männisto et al., 2001). There are two identical RIT elements on the K31 chromosome, and

another identical copy on one of the two plasmids in this strain (pCAUL02 – length 178 kb).

Acidiphilium multivorum AIU301 is an aerobic, anoxygenic and phototrophic bacterium from

pyritic acid mine drainage well known for its metal tolerance

(www.bio.nite.go.jp/dogan/project/view/AM1). A. multivorum AIU301 carries one RIT element

on the chromosome and 2 identical copies on one of its 8 plasmids (pACMV1 – length 272 kb).

Acidiphilium cryptum JF-5 is a facultative iron-respiring strain isolated from coal mine lake

sediment (www.ncbi.nlm.nih/bioproject/58447). A. cryptum JF-5 shows high gene synteny with

A. multivorum AIU301 except for a 225 kb region from AIU301 that is a probable genomic

island (www.bio.nite.go.jp/dogan/project/view/AM1). There is no RIT element on the A.

cryptum JF-5 chromosome, however it also carries 8 plasmids. One of these (pACRY01 – 203

kb) carries a RIT element that is identical to those found in A. multivorum AIU301, except that

one of the inverted repeats is only 97% similar). A second plasmid in this same strain

(pACRY03 – 89 kb) carries a RIT element that bears 84% nucleotide identity with the RIT

elements on pACRY01, and 82% sequence identity to 92% of the RIT elements in Caulobacter

sp. K31 (no significant alignment to 238 bp of int1).

The RIT elements in this cluster are clearly moving as one intact unit since within each

individual strain the nucleotide identity is 100% for the entire RIT element, including the three

recombinase genes and the additional sequence between the enzymes and the inverted repeats.

Similarity in the gene fragments surrounding the RIT elements are suggestive of specific target

genes for integration – in this case the DUF1738 gene (also sometimes annotated as an anti-

restriction protein; Figure 4.2.4). This is supported by the fact that the target gene is also

consistent on both the pACRY01 and pACRY03 plasmids, and there is no copy of the

http://www.bio.nite.go.jp/dogan/project/view/AM1

http://www.bio.nite.go.jp/dogan/project/view/AM1

57

interrupted gene (DUF1738) on the A. cryptum JF-5 chromosome or any of the other 6 plasmids

in that strain, consistent with the RIT element occurrence. Interestingly, as outlined in Figure

4.2.4 for the Caulobacter sp. K31 RIT elements, although the elements appear to have

integrated into homologous genes, the relative orientation of the RIT element to the target gene

is not always consistent and has impacted the gene annotation.

The RIT elements found in A. multivorum AIU301 share 83% nucleotide identity with

those found in Caulobacter sp. K31 and the terminal inverted repeats are almost identical

between the two genera – the Caulobacter strain shows perfect 34 bp repeats for each of the RIT

elements, however the Acidiphilium RIT elements have a SNP in the 5’ repeat and the repeats

are not the full 34 bp (therefore form imperfect repeats of 30 -33 bp in length). Despite the

decreased identity, all of these inverted repeats have 8 bp regions that are absolutely conserved

(discussed in section 4.2.4.3). The interrupted genes in A. multivorum AIU301 are all annotated

as hypothetical proteins, however the protein upstream of the RIT element found on the

chromosome has 73 and 70% homology respectively with the DUF1738 protein fragments

found surrounding RIT1 and RIT2 on the K31 chromosome.

Figure 4.2.4: Arrangement of RIT elements on the chromosome of Caulobacter sp. K31.

The two RIT elements and inverted repeats are identical, and are found within the same gene

(DUF1738) however the orientation is reversed and the DUF1738 nucleotide identity is not as

high as within the RIT element. When inserted in the correct orientation, the RIT element

appears to restore the DUF1738 sequence, however this is not the case when the orientation is

reversed.

58

4.2.4.2 Bifidobacterium longum cluster

The Bifidobacterium longum cluster consists of 15 RIT elements sharing 99-100%

nucleotide identity distributed across 7 strains of these intestinal bacteria. These RIT elements

have been previously characterized as Mobile Integrase Cassettes (MIC) (Lee et al., 2008) and a

search of other intestinal bacteria led those researchers to suggest that these elements may be

unique to the Bifidobacteria. Five of these strains contain multiple copies and almost all are

flanked by similar transposases that range from 68 to 100% nucleotide identity. In the strains

with more than one RIT element, one of the elements is commonly found in the reverse

orientation with respect to the direction of transcription of the transposase gene, which is

consistent with the duplicate RITS in both Caulobacter sp. K31 and A. multivorum AIU301.

The combination of reversed relative orientation and the decreased level of nucleotide identity

between the transposases implies that the transposases may be a target of the RIT elements and

not responsible for their movement. However, unlike the Caulobacter/Acidiphilium cluster,

within these genomes there are other homologous transposases (up to 99% nucleotide identity)

that have not been interrupted by a RIT element.

The Bifidobacterium longum strains are all very similar (99% nucleotide identity for

16S), suggesting that the similarities observed in the RIT elements may simply be associated

with vertical transmission rather than duplication and mobility. In some circumstances there is a

high degree of surrounding gene synteny which supports this interpretation. There is however

also evidence for significant genetic rearrangements specific to the RIT elements themselves. B.

longum DJ010A has three RIT elements. One of these elements, RIT1, is surrounded by genes

that are 99% conserved in the other B. longum strains and the genes are found in the same order.

The genes surrounding RIT2 and RIT3 are conserved in other B. longum strains as well,

however the gene synteny is not preserved as these genes are found scattered throughout the

other genomes. The only strain that does show high gene synteny with the genes surrounding

RIT2 is B. longum F8, however the RIT element itself is annotated as occurring in the opposite

orientation in relation to the surrounding genes.

4.2.4.3 Target Sites

Although hampered by issues with incomplete annotations due to gene interruptions, an

examination of the full collection of RIT elements clearly indicates that there are specific genes

59

that serve as target sites for integration. The genes immediately adjacent to or interrupted by the

elements are commonly the same in cases of multiple identical elements within one strain and

between different strains harboring elements with >65% SG5 protein identity. In addition to the

genes described in sections 4.2.4.1 and 4.2.4.2, clusters of RIT elements were also found

associated with IS66, RadC, methylase/helicase genes and integrase genes. There is even a RIT

element in Aromatoleum aromaticum EbN1 which appears to have interrupted a second RIT

element. Whether the variability in gene targets stems from sequence evolution or lack of the

original target site in individual strains has not been investigated and a specific target sequence

within these genes could not be determined.

4.2.4.4 Inverted Repeats

The tyrosine recombinases are a highly diverse family of proteins, with variable

complexity in both their DNA binding sites and their requirement for accessory proteins (Azaro

and Landy, 2002; Rajeev et al., 2009). The presence of multiple identical RIT elements in

different parts of the genomes of some strains revealed terminal features that may be involved in

recombinase binding or regulation. Alignment of the Bifidobacterium longum RIT sequences

identified a 97 bp inverted repeat that is absolutely conserved and always 41 bp from Int1 and 3

bp from Int3. The inverted repeats identified in the Caulobacter and Acidiphillium strains were

only 30-34 bp in length followed by a section of presumably non-coding sequence between the

inverted repeats and the recombinase enzymes (illustrated in Figures 4.2.2 and 4.2.4).

Alignment of the inverted repeats and non-coding sequence from the Caulobacter and

Acidiphillium strains with the long inverted repeats from the Bifidobacterium revealed an

interesting pattern of smaller repeats that may serve as recognition or regulatory sites for the

recombinases (Table 4.2.2). Within the inverted repeats, there were two highly conserved direct

repeats of T(A/T)ATGCCG with a 9 bp intervening sequence. Furthermore, an inverted repeat

was also found at an interval of 48-49 bp towards the recombinase enzyme. This pattern was

also confirmed at both ends of the RIT elements for our B. phytofirmans OLGA172 strain, C.

metallidurans CH34, and Burkholderia sp. Ch1-1. Whether this indicates relatedness of these

RIT elements to the Caulobacter/Acidiphillium cluster is not clear. In addition, 12 bp direct

repeats separated by 5 bases were found in both Candidatus solibacter usitatus Ellin6076 and

Gramella forsetii KT0803 (inverted copy at a distance of 43 bp in both cases). Similar partial

60

patterns were found in other strains as well, but more information is needed to determine their

relevance.

There is evidence of RIT mobilization of adjacent genes in two bacteria. Opitutus terrae

PB90-1 and Dinoroseobacter shiibae DFL-12 each have identical sequences that extended

beyond the RIT element but did not have any other mobile elements associated with them. In

the O. terrae PB90-1 strain, this identical sequence (including the RIT) was found in four copies

in the genome and the region extending beyond the RIT element was 1.6 kb in length and

contained a merR regulator, a heavy metal transport/detoxification gene and a hypothetical

protein. In D. shiibae DFL12, there were two copies of the RIT and 2.7 kb of additional

sequence including a gene annotated as a type III restriction protein subunit found on two

different plasmids within this strain. In both of these circumstances, the copied regions are

flanked by inverted repeats. In O. terrae PB90-1, this 37 bp inverted repeat had 9 bp imperfect

direct repeats (A/TGT/CTATGTG) separated by 8 bp and an inverted copy at a distance of 49

bp, consistent with the pattern observed in the other strains. For the D. shiibae DFL12, the

region was flanked by larger inverted repeats of 124 bp (bringing the upstream repeat to within

2 bp of the start codon for the RIT element). An imperfect direct repeat separated by 9 bp was

found within this region (A/TTATGCC/GG) however no clear inverted version was identified.

Table 4.2.2: Potential recognition or regulatory sites contained within terminal inverted

repeats.

The sites occur at a precise distance between the repeats and the coding sequence for the

recombinase genes. Bolded bases are direct repeats contained within the terminal inverted

repeats and for which there is an inverted copy at a precise distance in the direction of the

recombinase genes.

Strain 5’ Sequence

Burkholderia phytofirmans

OLGA172 TTATGCCGATTCCCGGATTATGCCG..49..CGGCATAA

Cupriavidus metallidurans CH34 TTATGCCGACTCCCCGATTATGCCG..49..CGGCATAA

Burkholderia sp. Ch1-1 TTATGCCGACTTCCCGATTATGCCG..49..CGGCATAA

Caulobacter sp. K31 TAATGCCGCGATCCGGATTATGCCG..49..CGGCATAA

Acidiphillium multivorum AIU301 TAATGCCGAGATCCGGATTATGCCG..49..CGGCATAA

61

Bifidobacterium longum NCC2705 TTAAGCCGGGTTTGTTGTTAAGCCG..48..CGGCTTAA

Frankia sp. EANpec1 TTATGCCGAGGGCCGGGTTATGCCG..49..CGGCATAA

Novosphingobium sp. PP1Y TAATGCCGTGACCCGGATTATGCCG..49..CGGCATAA

Candidatus S. usitatus Ellin6076 ACTATGCCGCGTCCCGGACTATGCCGCGT..43..ACGCGGCATAGT

Gramella forsetii KT0803 ATTATGTAAAGTAAATTATTATGTAAAGT..43..ACTTTACATAAT

4.2.5 Similarities between RIT elements and evidence for broad distribution.

Of the 148 putative RIT elements, 64 elements were chosen for phylogenetic analysis

based on nucleotide sequence of all three recombinases. These were chosen on the basis of

having come from completed genomes, and were spread across 38 different genera. Only one

representative was included if there were multiple identical elements within an individual strain

(19 instances), and only one representative was used from the 17 nearly identical RIT elements

in the Bifidobacterium cluster since the 16S sequences of these strains were also 99% identical.

The 16S from the main chromosome was utilized as a proxy for strain phylogeny, even in

circumstances where the RITs were solely present on plasmids. As discussed in section 4.2.1,

the RIT elements that we found were largely from Proteobacteria (Figure 4.2.5A), however the

presence of RIT elements from the Actinobacteria, Firmicutes, Bacteroidetes and Acidobacteria

suggest that these elements are not restricted to any particular phylum of bacteria, and the

diversity implies that they are an ancient feature of bacterial genomes.

A neighbour joining tree of all the putative RIT elements yields a tree with very long

branches (Figure 4.2.5B), reflecting the deep divergence in the RIT elements obtained in this

study. Most clusters harbour elements from more than one genera. Only three of these clusters

are completely contained within one taxa (two clusters from the alpha-Proteobacteria and one

from Actinomycetes). The other clusters are dominated by one taxa (commonly alpha- or beta-

Proteobacteria, consistent with their abundance in the dataset) with unexpected additions from

other classes or even other phyla.

The presence of multiple, diverse RIT elements in individual strains was a common

finding. As is highlighted in Figure 4.2.5B, Burkholderia phymatum STM815, C. metallidurans

CH34, C. necator H16 pHG1, Mesorhizobium loti MAFF303099, Novosphingobium sp. PP1Y

and Candidatus Solibacter usitatus Ellin6076 each have more than one RIT element assigning

to very different RIT clusters. In all except Novosphingobium sp. PP1Y and C. metallidurans

62

CH34 there are plasmids harboring RIT elements that could account for more recent transfer

events. Many of the other RIT elements (including RITCme1 and RITCme2 from C.

metallidurans CH34 and several RIT elements from the Rhizobiales) are contained within

genomic islands however the lack of genomic island information for the majority of these strains

prevented a full analysis of this relationship. An examination of the putative RIT elements from

genomes that have been analyzed on the IslandViewer website (Langille, 2009) revealed that

presence within a putative genomic island was common for RIT elements. Of the 20 strains that

corresponded with our list and had been analyzed in IslandViewer, only 4 had their RIT

elements separated from the putative islands (Acidithiobacillus ferrooxidans ATCC 53993,

Aromatoleum aromaticum EbN1, Sinorhizobium fredii NGR234 and Opitutus terrae PB90-1).

All of the other RIT elements analyzed were found within predicted islands or within 1 kb of a

predicted island. The small islands predicted by IslandViewer can prove to be components of

one larger island, as is found in Mesorhizobium loti MAFF303099. A 610 kb region of this

chromosome has been documented as a symbiosis island (Uchiumi et al., 2004) however the

prediction programs have instead illustrated it as 10-12 smaller islands. All three chromosomal

RIT elements from this strain are found contained within this symbiosis island, although only

one of them is documented as occurring directly within the predicted islands by IslandViewer.

Interestingly, the regions of the symbiosis island where the RIT elements occur are also high in

transposases from a variety of families. In this strain there is a fourth RIT element found on the

pMLa chromosome, and similarly in both Burkholderia phymatum STM815 and Burkholderia

phytofirmans PsJN there are identical RIT elements that are present on both a genomic island on

the chromosome and one or more plasmids, suggesting a possible route of intragenomic

variability within these islands.

The environmental distribution of RIT elements was found to be quite broad, with

representatives from such diverse environments as the head of an off-shore oil producing well to

an intracellular amoebal pathogen. There was an even contribution of strains present from each

type of environment – 32% each from the combined soil/sediment/sludge environments and all

combined water environments (freshwater, seawater, groundwater and wastewater). In addition,

there were 18% specifically identified as plant-associated strains, and 16% were animal

associated including a small number of pathogens. Disregarding the animal associated and

seawater environments (for which there is no straightforward delineation as clean or

63

contaminated), almost half of the isolates (42%) have been isolated from contaminated

environments. We note this could be due to a bias towards the study of these environments,

particularly in light of the increased abundance of alpha- and beta-Proteobacteria in our RIT

element collection compared to the NCBI database.

A

64

B

Figure 4.2.5: Phylogenetic analysis by 16S (A) and nucleotide sequence of the RIT

elements obtained in this study (B).

Only one representative is included for identical elements found within individual or highly

similar strains (99% 16S identity). Scale bars are at the bottom of each figure and the re-

sampling percentage is indicated at each node. Symbols in figure 5B are used to illustrate

different RIT elements found in individual strains.

65

4.2.6 RIT Classification

In the work by Van Houdt et al. (2012) a classification system has been created

consisting of 11 types of RIT elements, based on assignment of the three recombinases to the

same NCBI protein clusters. Four of these types (3,4,5 and 7) were further subdivided since one

of the recombinases (commonly Int2) was associated with recombinases from more than one

protein cluster, suggesting the possibility of modular evolution. We wanted to evaluate whether

our larger collection of RIT elements showed congruent evolution of all three recombinase

genes in these four groups of RIT elements.

Phylogenetic trees of the amino acid sequence for each of the three recombinases from

40 individual RIT elements were included in this analysis. These sequences corresponded to the

members of types 3A/B, 4A/B/C, 5A/B, and group 7A/B/C, as well as the other RIT elements

from our collection that were found to cluster with these groups based on homology to the third

recombinase. This resulted in the inclusion of the type 10 RIT element (Gramella forsettii

KT0804) since it was found to cluster with the type 7 elements in our analysis. A single type 2

strain was also included (Acidithiobacillus) and used as an outgroup to root the three trees.

In our analysis (Figures 4.2.6A-C), it’s clear that although the clustering occurs at

different levels of similarity for the three recombinases, congruency is evident for each of the

members of types 3, 4, 5 and 7. As the protein cluster memberships are based solely on

modified Blast scores (Klimke et al., 2009), they do not necessarily reflect phylogenetic

relationships made evident by neighbour joining analysis. As an example, for each of the

elements in type 3 and type 4, the Int2 recombinases are listed as being from the same protein

cluster and the Int1 recombinases separate into different protein clusters. Yet in Figures 4.2.6A

and 4.2.6B it is clear that the branching order relationships are preserved between the type 3 and

type 4 proteins for both Int1 and Int2. The type 7 elements are a much more diverse group of

recombinases, and from our analysis it would appear that the type 10 elements actually form one

of several sub-clusters within this group, but the three recombinases show congruency

nonetheless. More research is needed in order to determine an appropriate clustering cutoff for

designation of RIT element types.

66

B

A

67

Figure 4.2.6: Individual congruency trees for each of the recombinases in a selection of

RIT elements.

The individual trees correspond to Int1 (A), Int2(B) and Int3(C). Number and letter designations

are according to the types defined in Van Houdt et al. 2012.

4.3 Conclusions

In this analysis, we expand upon the findings of Van Houdt et al. (2012) by evaluating a

more diverse collection of RIT elements. Through this collection we are able to confirm that the

integrases are consistently from three subfamilies of tyrosine site-specific recombinases (pAE1,

SG4 and SG5) in the same order and orientation. Although not all of these enzymes surpassed

the specific domain threshold for inclusion in the individual subfamilies, the highest scoring hits

were consistently to these groups and our protein neighbour joining trees provide evidence that

the genes have evolved together. Recognizing their association is very informative for

furthering our understanding of these three sub-families of TBSSRs. Functionality (mobility)

could be inferred for a small number of elements that are copied within a genome or in closely

related strains. It should be noted that the intention of this study is not to suggest that all of the

C

68

putative RIT elements we have outlined are active, but rather to examine the distribution of

these elements in nature in the hopes of better understanding their role in bacterial genomes.

Examining elements fitting within the description of a recombinase in trio, allows for a better

understanding of the overall distribution of these elements. The maintenance of three integrases

belonging to specific sub-families of the tyrosine recombinases in the same arrangement and in

a potentially active form in such a wide diversity of organisms clearly suggests that there is

some benefit to their presence in genomes. It has yet to be determined whether their structural

organization is due to high levels of co-transfer or is evidence of functional interdependency. It

is our hope that the terminal repeat sequences outlined in this study will prove useful in

furthering an understanding of the mode of action of these recombinases. Any discussion of the

putative role of these repeats would be preliminary given the limited knowledge on the mode of

action of these specific sub-families of tyrosine recombinases, together or in combination.

There are no current models that are consistent with a role for all three recombinases.

RIT elements bearing high similarities (greater than 80% nucleotide identity) across their

full length can be found within closely related genera (such as Burkholderia and Cupriavidus) in

the absence of any other gene synteny. This suggests the elements in these strains can be

mobilized as an intact unit rather than as a component of a larger transposable element. We

expect that any horizontal transfer events are mediated by plasmid movement between these

closely related strains, a logical supposition supported by the prevalence of identical RIT copies

on both the chromosome and one or more plasmids within the same strain. Plasmids may also

be responsible for transport of these elements over greater phylogenetic distances, however the

broad divergence of these elements suggests ancient origins.

It is clear from this assessment that RIT elements are capable of coordinated intracellular

movement. It is not clear if the triad structure, though most common, is a strict requirement for

the functioning of these enzymes since a small number (11%) of seemingly intact recombinases

from the pAE1, SG4 and SG5 subfamilies were found by themselves or just in pairs. Many

questions remain to be answered. It is clear however that several years of genome sequencing

have brought to light a new element that adds to the astounding potential for bacterial adaptation

through recombination.

69


Funding in the form of a NSERC Discovery Grant to RF and a NSERC CGS-D Scholarship to

NR is gratefully acknowledged. The funding agency had no role in this study.

4.5 References

Altschul, S.F., Madden, T.L., Schäffer, a a, Zhang, J., Zhang, Z., Miller, W., Lipman,

D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 25, 3389-402.

Azaro, M.A. and Landy, A., 2002. λ Integrase and the λ Int family, in: Mobile DNA II. pp.

p. 118-148. In N. L. Craig, R. Craigie, M. Gellert. Drummond, A.J., Ashton, B., Buxton, S., Cheung, M., Cooper, A., Heled, J. et al., 2011.

No Title [WWW Document]. Geneious v5.0.4. Available at: http://www.geneious.com.

Edgar, R.C., 2004. MUSCLE: multiple sequence alignment with high accuracy and high

throughput. Nucleic acids research 32, 1792-7. Esposito, D., Scocca, J.J., 1997. The integrase family of tyrosine recombinases:

evolution of a conserved active site domain. Nucleic acids research 25, 3605-14. Frost, L.S., Leplae, R., Summers, A.O., Toussaint, A., 2005. Mobile genetic elements:

the agents of open source evolution. Nature reviews. Microbiology 3, 722-32. Jin, S., 2010. Evidence of Mobility of the 3-Chlorobenzoate Degradative Genes in a

Pristine Soil Isolate, Burkholderia phytofirmans OLGA172. University of Toronto M. Sc. Thesis.

Klimke, W., Agarwala, R., Badretdin, A., Chetvernin, S., Ciufo, S., Fedorov, B., Kiryutin,

B., Neill, K.O., Resch, W., Resenchuk, S., Schafer, S., Tolstoy, I., Tatusova, T., 2009. The National Center for Biotechnology Information ’ s Protein Clusters Database. Nucleic acids research 37, 216-223.

Langille, M.G.I. and F.S.L.B., 2009. “IslandViewer: an integrated interface for

computational identification and visualization of genomic islands”. Bioinformatics Jan. 16 (E.

Lee, J.-H., Karamychev, V.N., Kozyavkin, S. a, Mills, D., Pavlov, a R., Pavlova, N.V.,

Polouchine, N.N., Richardson, P.M., Shakhova, V.V., Slesarev, a I., Weimer, B., O’Sullivan, D.J., 2008. Comparative genomic analysis of the gut bacterium

70

Bifidobacterium longum reveals loci susceptible to deletion during pure culture growth. BMC genomics 9, 247.

Marchler-Bauer, A., Anderson, J.B., Cherukuri, P.F., DeWeese-Scott, C., Geer, L.Y.,

Gwadz, M., He, S., Hurwitz, D.I., Jackson, J.D., Ke, Z., Lanczycki, C.J., Liebert, C. a, Liu, C., Lu, F., Marchler, G.H., Mullokandov, M., Shoemaker, B. a, Simonyan, V., Song, J.S., Thiessen, P. a, Yamashita, R. a, Yin, J.J., Zhang, D., Bryant, S.H., 2005. CDD: a Conserved Domain Database for protein classification. Nucleic acids research 33, D192-6.

Männisto, M.K., Salkinoja-Salonen, M.S., Puhakka, J. a, 2001. In situ polychlorophenol

bioremediation potential of the indigenous bacterial community of boreal groundwater. Water research 35, 2496-504.

Nunes-düby, S.E., Kwon, H.J., Tirumalai, R.S., Ellenberger, T., Landy, A., 1998.

Similarities and differences among 105 members of the Int family of site-specific recombinases 26, 391-406.

Rajeev, L., Malanowska, K., Gardner, J.F., 2009. Challenging a paradigm: the role of

DNA homology in tyrosine recombinase reactions. Microbiology and molecular biology reviews : MMBR 73, 300-9.

Roberts, A.P., Chandler, M., Courvalin, P., Guédon, G., Mullany, P., Pembroke, T.,

Rood, J.I., Smith, C.J., Summers, A.O., Tsuda, M., Berg, D.E., 2008. Revised nomenclature for transposable genetic elements. Plasmid 60, 167-73.

Uchiumi, T., Ohwada, T., Itakura, M., Nukui, N., Dawadi, P., Kaneko, T., Tabata, S.,

Yokoyama, T., Tejima, K., Saeki, K., Omori, H., Hayashi, M., Sriprang, R., Murooka, Y., Tajima, S., Simomura, K., Nomura, M., Suzuki, A., Shimoda, Y., Sioya, K., Uchiumi, T., Ohwada, T., Itakura, M., Mitsui, H., Nukui, N., 2004. Expression Islands Clustered on the Symbiosis Island of the Mesorhizobium loti Genome Expression Islands Clustered on the Symbiosis Island of the Mesorhizobium loti Genome. Society.

Van Houdt, R., Monchy, S., Leys, N., Mergeay, M., 2009. New mobile genetic

elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria. Antonie van Leeuwenhoek 96, 205-26.

Van Houdt, R.., Leplae, R., Mergeay, M., 2012. Towards a more accurate annotation of

tyrosine- based site-specific recombinases in bacterial genomes. Mobile DNA 3(6) doi:10.1186/1759-8753-3-6

71

Chapter 5 The Chlorocatechol Degradative Operon in Burkholderia sp. strain OLGA172 Resides in Chromosomal Area of Genome Plasticity as

revealed through PacBio Single-Molecule Sequencing

Acknowledgements and Contributions: This chapter has been submitted for consideration to

the journal Genomics and represents the compilation of efforts of previous graduate students as

well as my own work. Contributers to this chapter (besides myself and Roberta Fulthorpe) are as

follows: Jackie Goordial (investigation of CC operon), Soulbee Jin (primer walking and

assembly of CC operon region), Heng (Tony) Qian (short read data assemblies), Shu Yi

(Roxana) Shen (comparison of chromosome 1 breakpoint, assistance with figures and

bioinformatics support), Rosemary Saati (qPCR for copy number analysis, PCR confirmation

and small plasmid extraction).

5 Introduction Next generation sequencing has revolutionized our approach to the study of genomes.

However due to the short read lengths of the initial next generation sequencers, the complete

assembly of even the smallest of these genomes into a manageable number of contigs from

sequence data alone is problematic, particularly in repetitive regions of the genome (Phillipy et

al. 2008; Ricker et al. 2012; Ghodsi et al. 2013; Barbosa et al. 2014). Many researchers have

called for increased efforts in completing sequencing projects (Parkhill, 2000; Phillipy et al.

2008; Klassen 2012), however the high costs and time requirements have simply made this goal

unachievable for many labs. As a result, incomplete genome projects and permanent draft

genomes are increasingly dominating databases. The number of genomes deposited in the

Genomes Online Database (version 5.0; https://gold.jgi-psf.org accessed 31 October 2014) has

grown from 3532 genomes in 2012 (Ricker et al., 2012) to 53,514 genomes, 73% (38,962) of

which are bacterial genomes. Of these, only 12% (6648) of the bacterial genomes have been

finished and 44% (23,551) of the remainder are listed as permanent draft (up from 29%

permanent drafts in 2012). This increase in permanent draft genomes results not only from the

practical and financial obstacles faced in finishing a genome, but also occurs due to the limited

https://gold.jgi-psf.org/

72

scope of the original project. Many sequencing projects are initiated in order to determine

whether a discrete set of known genes are present in an organism, or to investigate differences

between closely related strains. For each of these types of projects, a permanent draft genome is

sufficient since short read sequencing technologies can assemble the individual genes

(Branscomb and Predki, 2002) and referenced based assembly to close relatives can reveal strain

specific gene content. However, the prevalence of permanent draft genomes in the publicly

available databases are problematic to researchers looking to understand overall genome

organization and the impacts of horizontal gene transfer on prokaryotic evolution.

Although smaller than eukaryotic genomes, many prokaryotes are still challenging to

assemble due to the presence of multiple replicons and highly repetitive elements shared within

and between these replicons. Any repeat longer than the sequenced read results in an ambiguity

that cannot be resolved by assembly algorithms due to the simultaneous existence of more than

one true path through the data (Ghodsi et al. 2013). The ribosomal operons, typically ranging

from 5-7 kbp, represent the largest repeat class found in the majority of bacteria (Koren et al.

2013) and therefore read lengths on a kilobase scale are required to produce an ungapped or

‘closed’ assembly. PacBio SMRT sequencing produces read lengths of up to 40 kb as well as a

large volume of shorter reads to be used for error correction, simultaneously providing the

benefits of hybrid short and long read libraries without the additional cost and effort of using

two separate libraries or sequencing technologies. Most importantly, PacBio SMRT sequencing

does not suffer from amplification biases or low coverage in regions containing secondary

structure (due to palindromes or GC content), but rather produces random errors that can be

detected and algorithmically managed (Chin et al. 2013; Koren et al. 2013).

Genome organization has been linked with important lifestyle and evolutionary traits

(Slater et al. 2009; Harrison et al. 2010; Heuer and Smalla, 2012), and consequently

understanding the nature and size of individual replicons can be very informative. Fragmented

draft genomes contain anywhere from tens to hundreds of individual contigs, therefore relevant

information on replicon characteristics is lost. Prokaryotic genomes range from 200 to over

9000 genes, and there is a strong link between environment and genome complexity

(Konstantinidis and Tiedje, 2004; Cordero and Hogeweg 2009). Since horizontal gene transfer is

a prominent mechanism of bacterial adaptation, the genomic context of individual genes is

important for discerning the evolutionary origins and transferability of these traits. Of the 4394

73

completed genomes available in the NCBI genome database (accessed 27 Feb 2015), 33% (1474

genomes) contain more than one replicon. Plasmid localization infers not only potential for

dissemination, but can also give insight into gene expression (Lopez-Leal et al 2014) and the

rate of evolution since existence on a discretely replicating plasmid is effectively equivalent to a

gene duplication event equal to the copy number of the plasmid (Norman et al 2009). Similarly,

nucleotide substitution rates have also been observed to be higher on secondary chromosomes

and smaller replicons than on the primary replicon, as have rates of recombination and

rearrangements (Chain et al. 2006). Understanding the genomic context of genes found on

chromosomes is also important since outward-facing promoters that are contained within

adjacent mobile elements may directly regulate adaptive traits. As an example, in Bacteroides

fragilis, antimicrobial resistance genes exhibit increased expression from insertion sequence (IS)

elements found upstream (Sóki 2013). In a recent study using whole genome short read

sequencing of resistant isolates, Sydenham et al. (2015) found that the majority of contigs

bearing resistance genes terminated within 200 bp of the gene in question and therefore

information on the upstream gene content was lost.

In this paper, we have assembled the complete genome of Burkholderia sp. str.

OLGA172 (formerly Burkholderia sp. R172), a 3-chlorobenzoate (CBA) mineralizer isolated

from an uncontaminated Boreal forest soil in northwestern Russia as part of a global survey of

CBA and 2,4-D degrading soil bacteria (Fulthorpe et al. 1998). Chlorocatechol (CC) is a central

intermediate in the CBA degradation pathway, as well as the degradation of several other

chlorinated aromatic chemicals of environmental concern, such as chlorophenols,

chlorobenzenes and chlorobiphenyls (Schlomann, 1994). Understanding the evolution and

distribution of CC degradative pathways therefore has benefits for the remediation of a variety

of anthropogenic compounds. The catabolic genes for CC degradation are usually found in an

operon, and often, but not always, located on catabolic plasmids (Liu et al. 2005; van der Meer

et al. 1992; Fulthorpe et al. 1995; Leveau & van der Meer, 1996) or on mobile elements such as

integrative and conjugative elements (ICEs) (Sentchilo et al. 2009). In Burkholderia sp. str

OLGA172, the complete modified ortho pathway for catechol degradation has been identified

(Jin, 2010; Accession number: AY168634.1) however numerous Sanger and short read

sequencing approaches have failed to reveal the genomic context of the region surrounding this

operon. In this paper, we illustrate how the presence of a large repeated element, termed a

74

Recombinase in Trio (RIT) element, next to the operon interfered with this analysis and how the

use of PacBio single molecule sequencing overcame these issues to produce a non-fragmented

draft assembly.

5.1 Materials and Methods

5.1.1 Short read NGS sequencing Using an Illumina (Solexa) Genome Analyzer (GA) II sequencing was carried out at the

Centre for the Analysis of Genomic Evolution and Function (CAGEF) at the University of

Toronto, using two flow cells. A Roche 454 Genome Sequencer FLX (GS-FLX) was also used

to sequence the genome at The Genome Quebec Innovation Centre at McGill University, using

one quarter of a PicoTitre Plate, and de novo assembly of the raw reads was carried out to

generate contigs. Illumina and 454 sequencing gave 364,718, 452 bp and 75,636,539 bp

respectively. Hybrid assembly of the trimmed reads from both datasets was performed using

Velvet version 1.2.08 (Zerbino and Birney, 2008) and resulted in 1508 contigs with a maximum

contig length of 89,084 bp (mean value 5186 bp, N50 of 12,043). BLAST (Altschul et al.

1990) was employed to determine the identity of the other genes present on the contig

surrounding the CC degradative genes and they were confirmed through synteny analysis of

closely related species and PCR amplification.

5.1.2 PacBio Single Molecule Sequencing

Sequencing was performed by Genome Quebec Innovation Centre using 8 SMRT cells

in a PacBio RSII sequencer. There were a total of 736,020 raw subreads with an average length

of 4,949 bp and a maximum length of 23,822 bp. Contigs were assembled at the Innovation

Centre through the Hierarchical Genome Assembly Process (HGAP) workflow (Chin et al.,

2013) including pre-assembly error correction, Celera assembly, and polishing with Quiver. The

corrected long reads produced after the pre-assembly error correction process were obtained

from the Innovation Centre and utilized to examine coverage and disagreements in the final

assembly using Geneious (version 6.0.3, http://www.geneious.com, Kearse et al., 2012) and in

hybrid assemblies. Alignments were performed with a minimum overlap of 200 bp, minimum


75

overlap consensus of 98% and maximum of 30% errors throughout the read alignment and 10%

gaps. The permissive error and gap rates were utilized due to the expected high rate of random

errors in the individual PacBio reads.

5.1.3 Assembly of Short Read Technologies and PacBio corrected reads

In order to determine whether the PacBio assembly could be improved through the

addition of short read sequencing data, hybrid assemblies of the PacBio corrected long reads and

a combination of Illumina and 454 short reads (individually and combined) were performed

using Mira (Chevreaux et al. 1999) on default settings. Contigs generated were compared to the

sole PacBio assembly unitigs using the Mauve (Darling et al. 2004) whole genome alignment

option in Geneious.

5.1.4 Gene Annotation and Contig Validation

Annotation was performed automatically using the RAST server (Aziz et al., 2008)

utilizing Glimmer3 with no backfilling. The beginning sequence of each PacBio unitig was

compared to the complete unitig using Blast (Altschul et al. 1990) and regions repeated at both

ends of the unitigs were trimmed from the final replicons. Identification of individual mobile

elements was also performed using Blast, accessed through the ISFinder website (www-

is.biotoul.fr; Siguier et al. 2006). GC skew and coding density were determined and visualized

using DNAPlotter (Carver et al. 2009). Highly similar repeats (>98% nucleotide ID) within

individual replicons were determined using REPuter (https://bibiserv2.cebitec.uni-

bielefeld.de/reputer; Kurtz et al. 2001). Repeated elements greater than 1000 bp between

replicons were determined using Blast alignment with a mismatch score of 1/-3. Primers were

designed to confirm placement of the RIT element adjacent to the tfd operon on chromosome 1

and a second identical RIT element found on chromosome 2. Primers were also designed to

confirm placement of the 191 kb fragment that was removed from chromosome 2 in the hybrid

assembly. Primer sequences are listed in Appendix 1.

https://bibiserv2.cebitec.uni-bielefeld.de/reputer

https://bibiserv2.cebitec.uni-bielefeld.de/reputer

76

5.1.5 Comparisons to Related Finished Genomes

MAUVE alignments (Darling et al. 2004) of individual replicons from Burkholderia

phytofirmans PsJN, Burkholderia xenovorans LB400 and Burkholderia sp. CCGE1001, 1002

and 1003 were performed using Geneious version 6.0.3 (http://www.geneious.com, Kearse et al.

2012), using the Muscle (Edgar, 2004) alignment option and a minimum local co-linear block

(LCB) value of 400. Putative genomic islands on chromosome 1 were determined using

IslandViewer (Dhillon et al. 2013; www.pathogenomics.sfu.ca/islandviewer).

5.1.6 Large Plasmid Extraction

Large plasmid extraction was based on the method of Andrup et al. (2008). The samples were

run on 0.5% Megabase agarose (BioRad) for 21.5 hours at 4-6oC before staining for 1.5 hours in

ethidium bromide and destaining for 5 days at 4 oC. Plasmid sizes were estimated using the

BAC Tracker Supercoiled DNA ladder (Epicentre) as well as previously sequenced plasmids in

Cupriavidus metallidurans CH34 and B. phytofirmans PsJN.

5.2 Results

5.2.1 Overall Genome Analysis

The PacBio assembly for Burkholderia sp. str. OLGA172 produced 11 unitigs, 4 of which were

discarded as small contigs of vector control sequence and one was found to contain a partial 16S

sequence. There was also one unitig (9570 bp) that appears to be valid sequence (top matches

are to Burkholderia strains) but that did not have any features consistent with being an

independent replicon. This sequence was found to align with one of the larger unitigs, and to

terminate in a repeated element (IS66) that is repeated in multiple locations within the genome.

The remaining 5 unitigs were trimmed for overlapping terminal repeats (see methods) and

retained for further analysis, providing an estimated total genome size of 8,574,889 bp. Each of

these unitigs was aligned with the corrected long reads provided by the Genome Quebec

Innovation Centre, and 19% (4697/24528) could not be aligned at 98% percent nucleotide


http://www.pathogenomics.sfu.ca/islandviewer

77

identity. A selection of the unmatched reads were analyzed by Blast comparison to the 5 unitigs

and manual inspection indicated that there were small gaps (10-20 bp on average) that prevented

alignment. However as can be seen in Table 5.2.1, depth of coverage for each unitig ranged

from 6x to 19x from the corrected long reads alone, which provides a reasonable level of

confidence in the assembly. Coverage for total reads obtained through PacBio sequencing

ranged from 49x for the unitig designated plasmid 3 to 255x for the largest unitig. In order to

determine the overall complexity of this genome, highly similar repeated elements greater than 1

kb were quantified for the three largest unitigs (see methods). There were a total of 105

repeated elements on the three largest unitigs (including the 16S operons) and 56 elements were

found to have highly similar copies on both chromosome 1 and chromosome 2. The distribution

of repeated elements classifies Burkholderia sp. str OLGA172 as at least a class II difficulty of

assembly (Koren et al. 2013) due to having a large number of repeated elements in addition to

16S rDNA operons (more than 100 repeats greater than 500 bp) and potentially a class III

difficulty due to the presence of two repeats greater than 7 kb. For this reason it is not

surprising that our previous sequencing efforts using a combination of Illumina and Roche 454

sequencing had been unsuccessful in producing a reasonable assembly of this genome (Jin,

2010). As we had the additional short read data from both the Illumina and Roche 454

sequencing platforms available for this genome we also performed hybrid assemblies using all

three sets of sequencing reads. All assemblies that incorporated the Illumina reads failed to

produce an adequate assembly, presumably due to the low quality of the original Illumina data.

The assembly of the PacBio and 454 data resulted in considerably more contigs than the PacBio

assembly with only 11 contigs larger than 2 kb. The alignment of these 11 large contigs from

the hybrid assemblies agreed very well with the unitigs from the original PacBio assembly

(Table 5.2.1) with the exception of the removal of a 191,800 bp fragment from the PacBio unitig

corresponding with chromosome 2 and the creation of an additional contig of 79,317 bp.

Primers were designed targeting the regions flanking the 191 kb fragment within the PacBio

chromosome 2 unitig, and products of the expected size were obtained confirming the original

PacBio placement of this fragment (data not shown). The 79 kb contig was discarded as it was

highly fragment with strings of uncalled bases (N’s).

78

Table 5.2.1: Statistics of PacBio unitigs assigned as putative replicons.

Alignments were performed using only the corrected long reads from the PacBio SMRT

sequencing aligned on unitigs after trimming of repetitive terminal regions.

Number of aligned reads

Pairwise Identity (%)

Mean coverage

Standard Deviation

Chromosome 1 12,027 99.3 19.7 4.2 Chromosome 2 9129 99.3 19.7 3.9

Plasmid pOLGA1 458 99.3 11.6 2.8 Plasmid pOLGA2 195 99.3 10.1 4.2 Plasmid pOLGA3 22 99.5 5.9 1.7

Total aligned reads 21,831

Unaligned reads 4,697 (19%)

5.2.2 Biological consistency of the Assembly

In addition to using traditional assembly metrics such as size and coverage, we wanted to

specifically investigate the quality of our assembly in terms of biologically relevant features. To

perform this investigation we compared our PacBio assembly of Burkholderia sp. str. OLGA172

with the available completed genomes of other Burkholderia strains (Table 5.2.2) including

Burkholderia sp. str. CCGE1001 (unpublished), CCGE1002 (Ormeño-Orrillo et al., 2012),

CCGE1003 (unpublished), B. phytofirmans PsJN (Weilharter et al., 2011) and B. xenovorans

LB400 (Chain et al., 2006), all of which bear 16S nucleotide identities of 97% or greater to our

strain. There were 8853 genes annotated in our assembled genome, including 7 complete rRNA

operons and 64 tRNA genes. The rRNA operons were distributed between the two

chromosomes, 3 on the largest chromosome and 4 on the second chromosome, which is not

common but is also documented in Burkholderia sp. CCGE1002. The number of tRNA genes

was consistent with the other genomes, as was the estimated sizes of the chromosomes. The

gene distribution for chromosome 1 is illustrated in Figure 5.2.1.

79

Figure 5.2.1: Chromosome 1 of Burkholderia sp. str. OLGA172 as determined by PacBio

sequencing.

Chromosome is illustrated after manual trimming of ends (see methods). Rings correspond to

the following (moving from outside towards the middle): Coding sequences (CDS) in the

forward direction; CDS in reverse direction; rRNA operons (black) and tRNA genes (grey); all

CDS annotated as mobile elements (transposons, phage integrases, transposition helper

proteins); GC plot; GC skew; The two inner most rings are black for above average value and

grey for below average. Image created in DNAPlotter (Carver et al. 2009).

80

Table 5.2.2: Comparison of assembled genome or Burkholderia sp. str. OLGA172 with

other closely related Burkholderia strains.

Strain names listed are Burkholderia sp. str. unless a species name has been formally accepted

(listed in italics with the strain name).

Strain name Chromosome 1 Chromosome 2 Plasmids rRNA tRNA

Accession

Numbers

OLGA172 4.65 Mb 3.50 Mb

271 kb,

137 kb, 23

kb 21 64

B.

phytofirmans

PsJN 4.47 Mb 3.63 Mb 121 kb 18 63

NC_010676.1

NC_010679.1

NC_010681.1

B. xenovorans

LB400 4.90 Mb 3.36 Mb 1.47 Mb* 18 65

NC_007951.1

NC_007952.1

NC_007953.1

CCGE1001 4.06 Mb 2.77 Mb none 18 61

NC_015136.1

NC_015137.1

CCGE1002 3.52 Mb 2.59 Mb

1.28 Mb*,

489 kb 21 73

NC_014117.1

NC_014120.1

CCGE1003 4.08 Mb 2.97 Mb none 18 63

NC_014539,

NC_014540.1

* indicates a third chromosome as opposed to a plasmid, as annotated in the Genbank entry.

For each of the unitigs, replication and partitioning genes were located and verified to be

consistent with designated replicons (data not shown). As can be seen in Table 5.2.2, the

number and size of additional replicons can be quite variable in Burkholderia strains, and we

therefore experimentally verified our genome expectations by performing a large plasmid

extraction on Burkholderia sp. str. OLGA172 (Figure 5.2.2). This indicated two large plasmids

with sizes consistent with those obtained through the PacBio assembly as well as two smaller

plasmids. The larger of these two smaller plasmids is consistent with the unitig designated

plasmid 3 in the assembly (23 kb). Although none of the close relatives examined in this study

81

had a similarly sized plasmid, the replication and partitioning genes found on this replicon had

highest homology with those from a 45 kb plasmid found in Burkholderia sp. KJ006 (76%

protein homology with 77% coverage). With the exception of a mobile element found in

multiple locations within the genome (IS66), the majority of the coding sequences identified on

the plasmid corresponded to hypothetical or conserved hypothetical genes. A contig

corresponding with the 8 kb plasmid was not found. There is one unitig that is approximately

the right size (9570 bp) however there were no plasmid replication genes or other identifying

features to suggest that this corresponds to the small plasmid visible on the gel. As discussed

above, a comparison of this unitig with the complete assembly suggests that it is already

accounted for in the consensus assembly. The nature of the 8 kb plasmid has not been

determined, but it is possible that it corresponds to an excised mobile element that is contained

within the assembly.

Figure 5.2.2: Large plasmid extraction.

The first lane contains the BAC Tracker supercoiled DNA ladder (Epicentre), however as the

plasmids were outside of the ideal range their size was instead estimated based on comparison to

plasmids from previously sequenced strains. Samples are Cupriavidus metallidurans CH34

(lane 2; plasmid sizes 233 kb and 171 kb), Burkholderia sp. str. OLGA172 (lane 3; PacBio sizes

listed as 271 kb and 137 kb) and B. phytofirmans PsJN (lane 4; 121 kb plasmid). The smaller

plasmid in Burkholderia sp. str. OLGA172 is also visible below the non-chromosomal marker

band of the ladder, and a smaller element is visible at the 8 kb marker.

82

5.2.3 Capacity of the PacBio Assembly for comparative studies

Despite having less than 99% 16S identity to our strain, it was expected that the

chromosome 1 genes would show evolutionary conservation with other sequenced Burkholderia

genomes. We therefore aligned our assembled chromosome with chromosome 1 from 6

Burkholderia strains using the MAUVE feature in Geneious (see methods). As can be seen in

Figure 5.2.3, gene order and organization from our assembly was comparable with that seen for

the other Burkholderia strains, with large local co-linear blocks (LCBs) shared between all 6

strains. There was greater similarity in the genomic arrangement of our strain with B.

phytofirmans PsJN than with the other strains, however as expected there were large gaps both

within and between LCBs corresponding to strain specific genomic islands. For the largest of

these gaps the sequence was inspected to identify whether the break was specific to our

assembly or was also observed in the other strains. In each of these cases, there was a consistent

break point where regions of strain specificity were observed in all 6 genomes, including several

islands with clear insertions following tRNA genes consistent with expectations for genomic

island or prophage insertion sites.

Figure 5.2.3: MAUVE alignment of chromosome 1 from six Burkholderia strains.

Local Colinear Blocks (LCB) are denoted by rectangles and level of identity within those blocks

is illustrated by the height of the vertical lines contained within the rectangles. Genomes are

(from top to bottom): OLGA172, B. xenovorans LB400, CCGE1001, CCGE1002, B.

83

phytofirmans PsJN and CCGE1003. LCBs drawn below the horizontal refer to inverted

segments.

5.2.4 Highlighting a region of Strain Specificity – The Chlorocatechol (CC) Degradative Operon

Burkholderia sp. OLGA172 is a 3-chlorobenzoate (CBA) degrader representative of a

large collection of unstable CBA degraders isolated from pristine environments. Previously in

our lab, primers targeting chlorocatechol dioxygenase (CCD) genes (Leander et al. 1998) were

used to confirm the presence of a modified ortho pathway of chlorocatechol (CC) degradation

genes in Burkholderia sp. str. OLGA172 (Genbank: AY168634). Primer walking techniques

were used to determine a 10 kb genomic region surrounding the CC degradative operon that

included the full operon and a set of three tyrosine site specific recombinases (Jin, 2010). This

set of recombinases is now recognized as a Recombinase in Trio (RIT) element (Van Houdt et

al. 2009). Full genome sequencing using both the Illumina and 454 platforms was utilized to

assemble the complete genome and provide context for the CC degradative operon (Jin, 2010),

and a hybrid assembly of these two sequencing technologies produced a 14 kb contig that

connected the CC degradative operon to a segment of chromosome 1 from a number of strains

from the “plant-beneficial-environmental” (PBE) clade of Burkholderia (Suarez-Moreno et al.

2012).

Chlorocatechol degradation genes that encode an ortho cleavage pathway have been

repeatedly identified in proteobacteria isolated from various contaminated systems. Separate

isolations resulted in the use of different notations for the catabolic genes: the clc genes from

chlorobenzoate degrading Pseudomonas knackmussii B13 (Chatterjee et al. 1981), the tfd genes

from 2,4 D degrading Cupriavidus pinatubonensis JMP134 (Don and Pemberton, 1981; Don et

al. 1995), tcb genes from trichlorobenzene degrader Pseudomonas sp. P51 (van der Meer et al.

1991), and the tft genes from trichlorophenol degrading Burkholderia phenoliruptrix AC1100

(Hubner et al. 1998). In spite of the different notations, the operons share sequence similarities

consistent with divergent evolution from a common ancestor. In most instances these CC

operons are located on IncP1 plasmids, suggestive of a rapid evolutionary response to the

introduction of novel and abundant anthropogenic chloroaromatics to the environment. For

instance, there is a very high degree of sequence similarity between tfd genes located on

84

different plasmids in strains isolated all over the world. Carriage on plasmids not only

facilitates rapid distribution but also has the added benefit of providing an increased copy

number of the degradative genes, which is important due to the toxicity of the intermediate

degradation products (van der Meer, 2003). There are two known instances where the CC genes

were not found on plasmids. The clc genes originally described in B13 are contained within an

integrative conjugative element that has been shown to be self-transmissible (Sentchilo et al.

2009; Gaillard et al. 2006), and are also found in Burkholderia LB400. The tfd genes of

Burkholderia sp. st. RASC (aka TFD3, isolated from Oregon sewage sludge USA) are reported

to be chromosomal (Suwa et al. 1994, Tonso et al. 1995). However, recently Sakai et al. (2014)

isolated 7 Burkholderia and one Cupriavidus 2,4-D degrading strains from paddy fields with CC

genes highly homologous to those of RASC, and have shown them to be located on a group of

megaplasmids 580-900 kb in size.

In spite of OLGA's isolation on chlorobenzoate as a selective substrate, and its close

phylogenetic relationship to LB400, the CC operon is highly homologous to the tfd rather

than the clc genes. (85% nucleotide identity to tfdC gene of JMP134, 79% to Burkholderia

sp. NK8 (Liu et al. 2001)). There is high homology amongst the CC genes from OLGA and

other strains isolated from pristine sites around the world, but there is no evidence of

plasmid locations (Leander et al. 1998). For all these reasons, confirming the genomic

context in this strain would support our hypothesis that these genes are ancestral, and aid

in understanding the evolution of chlorocatechol degradation operon. The contig

generated through the hybrid Illumina and 454 assembly illustrated that the CBA genes

were likely to be located on chromosome 1 due to the presence of typical chromosome 1

genes adjacent to one end of the operon (Figure 5.2.4). However, at the other end of the

operon the contig terminated within the RIT element, and therefore could not provide

additional information to confirm the genes present beyond that point. Amplification using

Thermal Asymmetric Interlaced PCR (TAIL-PCR) from the RIT element to other possible

contigs that overlap with this region revealed a ‘junkyard’ of complete and partial mobile

elements and hypothetical proteins in the region adjacent to the RIT element (Jin, 2010).

With the PacBio assembly we were able to assemble this highly fragmented region of the

original genome assembly and connect it to chromosome 1 genes conserved with the other

Burkholderia strains. The assembly also revealed the presence of another copy of the RIT

85

element on the second chromosome that is identical in sequence to the end of the inverted

repeats (3393 bp). It is therefore likely that this large repeated element hindered PCR

confirmation of the CCD operon location, and our previous next generation sequence

assemblies based on shorter reads.

Amplification and sequencing was performed from tfdC to the ribonuclease G in order to

confirm the placement of the CC degradative operon on chromosome 1. Alignment of this

region to other related strains indicated that the CC degradative operon is the starting point for a

strain specific region of genome plasticity (RGP) that extends for 52 kb. There is no tRNA

flanking the region and it does not begin with an integrase or transposase, although there are a

number of mobile element proteins contained within. Comparison of this region to the other

related Burkholderia strains does indicate however that this segment of the chromosome is

highly strain specific. In each of these strains there is high homology and gene synteny for 90 kb

leading up to the region where the CC degradative operon is found in Burkholderia sp. str.

OLGA172, and gene synteny ends in all of these strains after the ribonuclease G protein (Figure

5.2.4). The portion of the chromosome leading up to this break in synteny is also conserved in

other, more distantly related, species including Cupriavidus and Ralstonia strains. In

Burkholderia sp. CCGE1001 there is a 62 kb genomic island documented in this site and the

genomic island integrase is the first gene after the ribonuclease. Although none of the other

strains have documented genomic islands in this location, this region is clearly involved in strain

specificity due to the complete disruption of gene synteny. Some of these strains also contain

highly homologous RIT elements (>80% nucleotide identity) to OLGA172, however none of

these RIT elements occur in the same genomic location. The reasons for the lack of synteny

following the Rnase G in each of these strains is not clear at this time.

86

Figure 5.2.4: Genomic arrangement of chromosome 1 genes from Burkholderia sp. str.

OLGA172 and comparison to homologous regions of related strains.

Note that the grey arrows do not indicate a type of gene but instead indicate that the genes

found in this genomic location are not shared among the different strains included in the

analysis. Note also that the complete genome for Burkholderia sp. NK8 has not been

completed and therefore only the plasmid was available for comparison.

5.2.5 Limitations of the PacBio Assembly

The RIT element on chromosome 2 is flanked on both sides by at least 15 kb of

complete and partial mobile genetic elements, including several that are also repeated in the

‘junkyard’ region adjacent to the RIT element on chromosome 1. Interestingly, the TAIL-PCR

87

sequencing results and the PacBio assembly disagreed on the nature of the ‘junkyard’ region

flanking each of the RIT elements. Primers were designed that spanned the entire distance from

tfdC to the opposite end of the RIT element on chromosome 1 based on both the PacBio genome

assembly and TAIL-PCR sequencing results (which PacBio places adjacent to the chromosome

2 RIT element). Positive PCR products of the expected size were produced from both primers,

however sequencing revealed that the PacBio product had multiple peaks indicative of a likely

PCR chimera. The original TAIL-PCR primer set produced good quality sequencing results,

suggesting that the PacBio assembly had mis-assembled the two ‘junkyard’ regions. There was

no homologous RIT element found on any of the plasmid sequences.

5.3 Discussion

Historically, bacterial genomes have been defined as one large circular chromosome

with additional information carried on transient plasmids that were not a defining feature of the

species. However the discovery of secondary chromosomes, or chromids (Harrison et al. 2010),

and other stable replicons that contribute important lifestyle characteristics, has necessitated a

closer inspection of bacterial replicon dynamics. Besides the mobility associated with gene

occurrence on a plasmid, these studies have also revealed differences in gene regulation and in

the rates of recombination, rearrangement or mutation for both plasmid and chromid genes, as

well as a separation of core and secondary functions between different replicons (Chain et al.

2006). Many genome projects and assembly platforms discuss assembly metrics with the goal

of creating one large contig without considering the prevalence of multiple replicons in

environmental isolates. Of the 4386 completed genomes available through the NCBI genome

database (accessed 01 March 2015), 1306 (30%) contain multiple replicons, of which almost

half (644) contain more than 2 replicons. The use of PacBio SMRT sequencing allows for the

primary goal of gene identification while also providing important genome characterization and

a putative location for those genes that can be experimentally validated.

As with many genome sequencing projects, the initial goal of this work was to

investigate the genomic context of the chlorocatechol degradation operon in our strain. However

the presence of a large repeated element directly adjacent to the operon, and a copy of it present

on the second chromosome, resulted in a fragmented assembly that could not be reconciled

through short read sequencing or via different PCR methods. PacBio SMRT sequencing allowed

88

us to locate our operon in a region of strain specificity that was particularly difficult to assemble

due to the presence of multiple mobile genes and gene fragments. Although sequencing revealed

that the nature of the mobile element ‘junkyard’ had been misassembled surrounding the RIT

element, the overall organization of the PacBio assembly agrees well with our experimental

results. The assembly also has the added benefit of providing a putative assembly of the difficult

regions, from which primers can be designed and tested.

Only one of the closely related Burkholderia strains used in this study (B. xenovorans

LB400) contained CC degradative genes, and these genes occur in a well-documented

integrative conjugative element (ICEclc) located on chromosome 1 (Pradervand et al. 2014).

The region surrounding the CC degradative operon in OLGA172 was designated as a potential

island by the IslandViewer website (Dhillon et al. 2013;

www.pathogenomics.sfu.ca/islandviewer) however a close inspection of the genes present does

not support mobility of this region. Therefore we have used the term region of genome plasticity

(RGP) to describe the genomic context. There are no conjugation or transfer genes to suggest

that this is an integrative conjugative element (ICE) or prophage, and no flanking repeats were

identified to suggest a transposon or genomic island. The CC degradative genes found in our

strain were not localized to the same region of the genome as those found in LB400, showed no

evidence of being contained in an ICE, and bore limited protein identity with the clc genes from

LB400 (50-65% protein ID). There is only low similarity (tfdC has only 56% protein identity)

to the genes carried on megaplasmids described by Sakai et al. (2014). However the region

directly adjacent to the CC degradative operon on these megaplasmids corresponds with a

conserved region referred to as the chromid region due to its homology with genes occurring on

the second chromosome (or ‘chromid’; Harrison et al. 2010) of Burkholderia phytofirmans

PsJN, Burkholderia xenovorans LB400 and Burkholderia sp. CCGE1002. The authors therefore

suggested that the acquisition of the degradative genes may have been the result of insertion of

the ancestor plasmid into a mobile element adjacent to the genes on the chromid (designated

Tn6233) and subsequent acquisition of the genes and one copy of the mobile element on

resolution (Sakai et al. 2014). Not surprisingly, this chromid region is also homologous to genes

found on the second chromosome of Burkholderia sp. OLGA172. There are no tfd genes found

on the second chromosome of our strain, however the presence of identical RIT elements on

both chromosomes provides an opportunity for these genes to be transferred through

89

homologous recombination between replicons. In the case of our strain, it is clear that the CCD

operon is located on the primary chromosome. Due to the pristine nature of the soil environment

where this strain was isolated, it is possible that this operon is utilized for a different purpose in

the natural environment of OLGA172. This is further supported by the variable degradation

ability observed in this strain, which is likely the result of transcriptional and biochemical

inefficiencies on this substrate (Goordial, 2010). This is consistent with published findings

indicating that toxic intermediates in the chlorocatechol pathway can accumulate, and that this

toxicity is evident when only one copy of the degradative operon is maintained in the cell

(Perez-Patoja et al. 2003).

Although not within the scope of this project, the source and role of the third plasmid

(pOLGA_3, 23 kb in length) would also make an interesting study. The plasmid replication

gene for plasmid 3 was only strongly homologous (>75% nucleotide identity) to two other

Burkholderia strains, however it showed lower homology (~ 30%) to plasmids from a diversity

of sources. Included among these were two very small plasmids, a 12 kb plasmid found in

Ralstonia solanacearum and a 3.2 kb plasmid isolated from Laribacter hongkongensis. In

addition to these, the replication gene was also 41% homologous to a P2-like phage isolated

from Burkholderia cepacia complex, and this phage was unique as it was the sole representative

from that study for which the prophage is maintained as a plasmid within the cell (Lynch et al.

2010). The majority of the remaining matches corresponded to whole genome shotgun

sequences and therefore the evolution of this particular replicon cannot be further investigated at

this time. It would be interesting to further examine the relationship between Burkholderia

phages and plasmid evolution.

While ease of assembly is often a key factor in sequencing decisions, it has been our

experience that hesitations in adopting the use of PacBio SMRT sequencing have been

attributed to cost per base comparisons to other available technologies. Certainly for projects

sequencing a number of isolates or for routine testing of clinically relevant strains, the cost of

PacBio sequencing is still prohibitively expensive. However we submit that the benefit of

obtaining not only the functional gene content but also the number of individual replicons and

the intact assembly of mobile genetic elements contained within the assembly provides a

tangible benefit to current and future comparative studies that justifies the increased investment.

90

This represents a reasonably priced option whereby the immediate goals of any individual

sequencing project can be achieved without contributing to the increased abundance of

fragmented genomes in the public databases.


The authors gratefully acknowledge Eric Collins at the University of Alaska Fairbanks and

Tony (Heng) Qian for assistance with genome assembly and annotation. NR is grateful to

Ann Provoost and Kristel Mijnendonckx at the Belgian Nuclear Research Centre (SCK·CEN)

for providing C. metallidurans CH34 DNA and for assistance and guidance for the large

plasmid extractions. B. phytofirmans PsJN was kindly provided by Angela Sessitsch of the

Austrian Institute of Technology. Funding in the form of a NSERC Discovery Grant to RF

and a NSERC CGS-D Scholarship and Michael Smith Foreign Study Supplement to NR are

gratefully acknowledged. The funding agency had no role in this study.

5.5 References Altschul, S.F., W. Gish, W. Miller, E.W. Myers and D.J. Lipman. 1990. Basic local alignment search tool, J. Mol. Biol. 215: 403-410. Aziz R.K., D. Bartels, A.A. Best, M. DeJongh, T. Disz, R.A. Edwards, K. Formsma, S. Gerdes, E.M. Glass, M. Kubal, F. Meyer, G.J. Olsen, R. Olson, A.L. Osterman, R.A. Overbeek, L.K. McNeil, D. Paarmann, T. Paczian, B. Parrello, G.D. Pusch, C. Reich, R. Stevens, O. Vassieva, V. Vonstein, A. Wilke and O. Zagnitko. 2008. The RAST Server: rapid annotations using subsystems technology. BMC Genomics 9:75. doi:10.1186/1471-2164-9-75 Andrup, L., K.K. Barfod, G.B. Jensen, and Smidt, L. 2008. Detection of large plasmids from the Bacillus cereus group. Plasmid 59(2):139-143. doi: 10.1016/j.plasmid.2007.11.005. Barbosa, E.G.V, F.F. Aburialle, R.T.J. Ramos, A.R. Carneiro, Y.L. Loir, J.B.A. Miyoshi, A. Silva and V. Azevedo 2014. Value of a newly sequenced bacterial genome. World J Biol Chem 5(2):161-168. Branscomb, E. and P. Predki. 2002. On the high value of low standards. J. Bact. 184(23):6406-6409. Carver, T. N. Thomson, A. Bleasby, M. Berriman and J. Parkhill. 2009. DNAPlotter: circular and linear interactive genome visualization. Bioinformatics. 25(1):119-20. Chatterjee D.K., S.T. Kellogg, S. Hamada, A.M. Chakrabarty. 1981. Plasmid specifying total degradation of 3-chlorobenzoate by a modified ortho pathway. J Bacteriol 146(2):639–646.

91

Chain, P.S., V.J. Denef, K.T. Konstantinidis, L.M. Vergez, L. Agullo, V.L. Reyes, L. Hauser, M. Cordova, L. Gomez, M. Gonzalez, M. Lan, V. Lao, F. Larimer, J.J. LiPuma, E. Mahenthiralingam, S.A. Malfatti, C.J> Marx, J. J. Parnell, A. Ramette, P. Richardson, M. Seeger, D. Smith, T. Spilker, W.J. Sul, T.V. Tsoi, L.E. Ulrich, I.B. Zhulin and J.M. Tiedje. 2006. Burkholderia xenovorans LB400 harbors a multi-replicon, 9.73-Mbp genome shaped for versatility. Proc Natl Acad Sci U.S.A. 103(42):15280-7. Chevreux, B., T. Wetter and S. Suhai. 1999. Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB) 99, pp. 45-56. Chin, C-S. D.H. Alexander, P. Marks, A. A. Klammer, J. Drake, C. Heiner, A. Clum, A. Copeland, J. Huddleston, E. E. Eichler, S. W. Turner and J. Korlach. 2013. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods 10: 563-569. doi:10.1038/nmeth.2474 Cordero, O. X. and P. Hogeweg. 2009. The impact of long-distance horizontal gene transfer on prokaryotic genome size. Proc Natl Acad Sci U.S.A. 106(51):21748-21753. Darling, A. C., B. Mau, F.R. Blattner and N.T. Perna. 2004. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome research 14(7):1394-1403. Dhillon, B.K., T.A. Chiu, M.R. Laird, M.G.I. Langille, and F.S.L. Brinkman. 2013. IslandViewer update: improved genomic island discovery and visualization. Nucleic Acids Res 41(Web server issue):W129-132. PMID: 23677610 Don, R. H. and J.M. Pemberton. 1981. Properties of six pesticide degradation plasmids isolated from Alcaligenes paradoxus and Alcaligenes eutrophus. J Bacteriol 145(2):681-686. Don R.H., A.J. Weightman, H.J. Knackmuss and K.N. Timmis. 1995. Transposon mutagenesis and cloning analysis of the pathways for degradation of 2,4-dichlorophenoxyacetic acid and 3-chlorobenzoate in Alcaligenes eutrophus JMP134(pJP4). J Bacteriol 161(1):85–90. Edgar, R.C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5):1792-1797. Fulthorpe, R.R., C. McGowan, O.V. Maltseva, W.E. Holben, and J.M. Tiedje. 1995. 2, 4-Dichlorophenoxyacetic acid-degrading bacteria contain mosaics of catabolic genes. Appl Environ Microbiol 61(9):3274-3281. Fulthorpe, R.R., A.N. Rhodes and J.M. Tiedje. 1998. High levels of endemicity of 3-chlorobenzoate-degrading soil bacteria. Appl Environ Microbiol 64(5):1620-1627. Gaillard M, T. Vallaeys, F.J. Vorhölter, M. Minoia, C. Werlen, V. Sentchilo, Al Pühler and J.R. van der Meer. 2006. The clc element of Pseudomonas sp. strain B13, a genomic island with various catabolic properties. J Bacteriol 188: 1999-2013.

http://www.ncbi.nlm.nih.gov/pubmed/23677610

92

Ghodsi, M., C.M. Hill, I. Astrovskaya, H. Lin, D.D. Sommer, S. Koren and M. Pop. 2013. De novo likelihood-based measures for comparing genome assemblies. BMC Research Notes 6:334. Goordial, J. 2010. Characterization of a Novel Chlorobenzoate Degrading bacterium: Burkholderia phytofirmans OLGA172, Isolated from a Pristine Environment. M.Sc. Thesis Dept. Ecology and Evolutionary Biology, University of Toronto. Harrison, P.W., R.P. Lower, N.K. Kim and J.P.W. Young. 2010. Introducing the bacterial ‘chromid’: not a chromosome, not a plasmid. Trends Microbiol 18(4):141-148. Heuer, H. and K. Smalla. 2012. Plasmids foster diversification and adaptation of bacterial populations in soil. FEMS Microbiol Rev 36(6):1083-1104. Hubner A, C.E. Danganan, L. Xun, A.M. Chakrabarty and W Hendrickson. 1998. Genes for 2,4,5-trichlorophenoxyacetic acid metabolism in Burkholderia cepacia AC1100: characterization of the tftC and tftD genes and locations of the tft operons on multiple replicons. Appl Environ Microbiol 64:2086–2093. Jin S. 2010. Evidence of Mobility of the 3-Chlorobenzoate Degradative Genes in a Pristine Soil Isolate, Burkholderia phytofirmans OLGA172, M.Sc. Thesis. Dept. Ecology and Evolutionary Biology, University of Toronto. Kearse, M., R. Moir, A. Wilson, S. Stones-Havas, M. Cheung, S. Sturrock, S. Buxton, A. Cooper, S. Markowitz, C. Duran, T. Thierer, B. Ashton, P. Mentjies and A. Drummond. 2012. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 28(12):1647-1649. Klassen J.L., and C.R. Currie. 2012. Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation. BMC Genomics. 13:14. Konstantinidis, K.T. and J.M. Tiedje. 2004. Trends between gene content and genome size in prokaryotic species with larger genomes. Proc Natl Acad Sci U.S.A. 101(9):3160-3165. Koren, S., G. P. Harhay, T. P. Smith, J. L. Bono, D. M. Harhay, S. D. Mcvey, D. Radune, N. H. Bergman, and A. M. Phillippy. 2013. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol 14(9): R101. Kurtz, S., J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye and R. Giegerich. 2001. REPuter: The Manifold Applications of Repeat Analysis on a Genomic Scale. Nucleic Acids Res 29(22):4633-4642. Leander, M., T. Vallaeys, and R. Fulthorpe. 1998. Amplification of putative chlorocatechol dioxygenase gene fragments from α-and β-Proteobacteria. Can J Microbiol 44(5): 482-486.

93

Leveau, J.H.J., C. Werlen, and J.R. van der Meer. 1996. Molecular mechanisms of genetic adaptation to xenobiotic compounds. International Biodeterioration & Biodegradation 37(3):252. Liu, S., N. Ogawa and K. Miyashita. 2001. The chlorocatechol degradative genes, tfdT-CDEF, of Burkholderia sp. strain NK8 are involved in chlorobenzoate degradation and induced by chlorobenzoates and chlorocatechols. Gene, 268(1):207-214. Liu, S., N. Ogawa, T. Senda, A. Hasebe and K. Miyashita. 2005. Amino acids in positions 48, 52 and 73 differentiate the substrate specificities of the highly homologous chlorocatechol 1,2-dioxygenases CbnA and TcbC. J. Bact 187(15):5427-5436. López-Leal, G., M.L. Tabche, , S. Castillo-Ramírez, A. Mendoza-Vargas, M.A. Ramírez-Romero and G. Dávila. 2014. RNA-Seq analysis of the multipartite genome of Rhizobium etli CE3 shows different replicon contributions under heat and saline shock. BMC genomics, 15(1):770. Lynch, K. H., P. Stothard, and J. J. Dennis. 2010. Genomic analysis and relatedness of P2-like phages of the Burkholderia cepacia complex. BMC genomics 11(1): 599. Norman, A., L.H. Hansen and S.J. Sørensen. 2009. Conjugative plasmids: vessels of the communal gene pool. Philos Trans R Soc London [Biol] 364(1527):2275-2289. Ormeño-Orrillo E, M.A. Rogel, L.M.O. Chueire, J.M. Tiedje, E. Martínez-Romero and M. Hungria. 2012. Genome Sequences of Burkholderia sp. Strains CCGE1002 and H160, Isolated from Legume Nodules in Mexico and Brazil. J Bacteriol 194(24):6927. doi:10.1128/JB.01756-12. Parkhill J. 2000. In defense of complete genomes. Nat Biotechnol.18:493–494. Perez-Pantoja, D., T. Ledger, D.H. Pieper and B. Gonzalez. 2003. Efficient turnover of chlorocatechols is essential for growth of Ralstonia eutropha JMP134 (pJP4) in 3-chlorobenzoic acid. J Bacteriol 185(5):1534-1542. Phillippy A.M., M.C. Schatz and M. Pop. 2008. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9:R55. Pradervand, N., S. Sulser, F. Delavat, R. Miyazaki, I. Lamas, and J.R. van der Meer. 2014. An Operon of Three Transcriptional Regulators Controls Horizontal Gene Transfer of the Integrative and Conjugative Element ICEclc in Pseudomonas knackmussii B13. PLoS Genetics. DOI: 10.1371/journal/pgen.1004441 Ricker, N., H. Qian and R.R. Fulthorpe. 2012. The limitations of draft assemblies for understanding prokaryotic adaptation and evolution. Genomics, 100(3):167-175. Sakai, Y., N. Ogawa, Y. Shimomura and T. Fujii. 2014. A 2, 4-dichlorophenoxyacetic acid degradation plasmid pM7012 discloses distribution of an unclassified megaplasmid group across bacterial species. Microbiology 160(3):525-536.

94

Schlömann, M. 1994. Evolution of chlorocatechol catabolic pathways. Biodegradation 5(3-4):301-321. Sentchilo, V., K. Czechowska, N. Pradervand, M. Minoia, R. Miyazaki, an der Meer, V. and J. Roelof. 2009. Intracellular excision and reintegration dynamics of the ICEclc genomic island of Pseudomonas knackmussii sp. strain B13. Mol Microbiol, 72(5):1293-1306. Siguier, P., Pérochon, J., Lestrade, L., Mahillon, J. and M. Chandler. 2006. ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res 34(suppl 1), D32-D36. Slater, S.C., B.S. Goldman, B. Goodner, J.C. Setubal, S.K. Farrand, E.W. Nester, T.J. Burr, L. Banta, A.W. Dickerman, I. Paulsen, L. Otten, G. Suen, R. Wench, N.F. Almeida, F. Arnold, O.T. Burton, Z. Du, A. Ewing, E. Godsy, S. Heisel, K.L. Houmiel, J. Jhaveri, J. Lu, N.M. Miller, S. Norton, Q. Chen, W. Phoolcharoen, V. Ohlin, D. Ondrusek, N. Pride, S.L. Sticklin, J. Sun, C. Wheeler, L. Wilson, H. Zhu and D.W. Wood. 2009. Genome Sequences of Three Agrobacterium Biovars Help Elucidate the Evolution of Multichromosome Genomes in Bacteria. J. Bact. 191(8):2501-2511. doi:10.1128/JB.01779-08 Sóki, J. 2013. Extended role for insertion sequence elements in the antibiotic resistance of Bacteroides. World J Clin Infect Dis 3, 1-12. Suárez-Moreno, Z. R., Caballero-Mellado, J., Coutinho, B. G., Mendonça-Previato, L., James, E. K. and V. Venturi. 2012. Common features of environmental and potentially beneficial plant-associated Burkholderia. Microb Ecol 63(2):249-266. Suwa, Y., W. E. Holben, and L. J. Forney. 1994. Cloning of a novel 2,4-D catabolic gene isofunctional to tfdA from Pseudomonas sp. strain TFD3, abstr. Q-403, p. 459. In Abstracts of the 94th General Meeting of the American Society for Microbiology 1994. American Society for Microbiology, Washington, D.C Sydenham, T. V., Sóki, J., Hasman, H., Wang, M. and U.S. Justesen. 2015. Identification of antimicrobial resistance genes in multidrug-resistant clinical Bacteroides fragilis isolates by whole genome shotgun sequencing. Anaerobe 31:59-64. Tonso, N. L., V. G. Matheson, and W. E. Holben. 1995. Polyphasic characterization of a suite of bacterial isolates capable of degrading 2,4-D.Microb. Ecol. 30: 1–22 van der Meer JR, A.R. van Neerven, E.J. de Vries, W.M. de Vos and A.J. Zehnder. 1991. Cloning and characterization of plasmid-encoded genes for the degradation of 1,2-dichloro-, 1,4-dichloro-, and 1,2,4-trichlorobenzene of Pseudomonas sp. strain P51. J Bacteriol 173(1):6–15. van der Meer, J.R., W.M. De Vos, S. Harayama and A.J.B Zehnder. 1992. Molecular Mechanisms of Genetic Adaptation to Xenobiotic Compounds. Microbiol Rev 56(4):677-694.

95

van der Meer, J. R. 2003. Evolution of metabolic pathways for degradation of environmental pollutants. Encyclopedia of Environmental Microbiology. Van Houdt, R., Monchy, S., Leys, N., Mergeay, M., 2009. New mobile genetic elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria. Antonie van Leeuwenhoek 96, 205-26. Weilharter, A., B. Mitter, M.V. Shin, P.S. Chain, J. Nowak and A. Sessitsch. 2011. Complete genome sequence of the plant growth-promoting endophyte Burkholderia phytofirmans strain PsJN. Journal of Bacteriology. 193(13):3383-4. doi: 10.1128/JB.05055-11. Zerbino, D. R., & Birney, E. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research 18(5), 821-829.

96

Chapter 6 Expression and Activity of RIT Elements Acknowledgements and Contributions: This work was performed in collaboration with the

researcher who originally described the RIT elements, Rob Van Houdt, at the Belgian Nuclear

Research Center (SCK•CEN), due to his shared interest in elucidating RIT mobility and

expression characteristics. The experiments were carried out over two research terms at the

SCK•CEN for a total of almost 12 months over a two-year period. Experimental design,

training and project supervision at the SCK•CEN was provided by Rob Van Houdt. Wietse

Heylen and Ann Provoost provided technical assistance.

6 Introduction As discussed in Chapter 4, RIT elements contain three TBSSRs and display a

characteristic gene order and repeat architecture that is conserved across 7 bacterial phyla (Van

Houdt et al. 2009; Van Houdt et al. 2012; Ricker et al. 2013). Since the recombinases of the RIT

elements belong to sub-families of TBSSRs that have been commonly annotated as integrases,

the two terms will be considered equivalent and used interchangeably. As I have shown, RIT

elements can occur as multiple identical copies within individual genomes and are commonly

found on plasmids and in genomic islands, including plant symbiosis and catabolic islands.

These observations support the idea that they are mobile and that their role in genetic

rearrangement/movement is likely to be a universal one. In this chapter, I describe the series of

experiments performed to look for mobility of RIT elements. For these experiments, I obtained

Caulobacter sp. K31 (generously donated by Craig Stephens of Santa Clara University, Santa

Clara, California, USA) since the presence of multiple identical RIT copies in this genome is

strongly indicative of their putative mobility in this strain. Many of the experiments were also

performed in parallel with our strain of interest, Burkholderia sp. OLGA172. The ability of RIT

elements to excise and relocate was tested using a variety of mating experiments ranging from

non-specific intracellular mobility to site-specific targeting during conjugation. The general

strategy used was to separate the three recombinase genes from their flanking inverted repeats to

induce the recombinases (on one vector) to move a selectable marker (kanamycin resistance)

which has been inserted between the inverted repeats (RIT::Km cassette) on a separate vector.

97

In the initial experiments, a conjugative plasmid (pOX38) was also added to the cells and

conjugation occurred after induction. If the RIT::Km has been mobilized to the conjugative

plasmid then it will escape the original cell and be detected in the recipient (as evident by gained

kanamycin resistance in the recipient). In the second set of experiments, the RIT::Km cassette

was carried by a suicide construct – a vector capable of its own conjugative transfer but that

cannot be maintained in the recipient cell. The expression vector was contained within the

recipient cell, along with a target site plasmid and induction occurred during conjugation so that

the recombinases would be active when the suicide construct entered the cell.


6.1.1 Growth of Bacterial Strains

All E. coli cultures were grown in LB media supplemented with antibiotics when appropriate

(kanamycin 50 µg/mL, ampicillin 100 µg/mL, tetracycline 20 µg/mL, streptomycin 50 µg/mL,

chloramphenicol 30 µg/mL). M9 media with and without the addition of 1 mM leucine was

utilized for differentiating auxotrophs, and LB with 0.3 mM diaminopimelic acid (DAP) was

utilized for growing the MFDpir strain (provided by Jean Marc Ghigo from the Institut Pasteur,

Paris, France). Burkholderia sp. str. OLGA172 and Caulobacter sp. K31 (provided by Craig

Stephens from Santa Clara University, Santa Clara, California, USA) were grown in

Pseudomonas F media and Peptone Yeast Extract (PYE), respectively.

Table 6.1.1: List of strains used in this study. Donor and Recipient Strains Genotype

E. coli DG1

mcrA Δ(mrr-hsdRMS-mcrBC, modification-, restriction- φ80lacZDM15 ΔlacX74 recA1

araD139 Δ(ara-leu)7697 galU galK rpsL endA1 nupG

E. coli DH5α

dlacZ ∆M15 ∆(lacZYA-argF) U169 recA1 endA1 hsdR17(rK-mK+) supE44 thi-1 gyrA96

relA1

E. coli S17-1 λpir TpR SmR recA, thi, pro, hsdR-M+RP4: 2-Tc:Mu: Km Tn7 λpir

E. coli MFDpir MG1655 RP4-2-Tc::[ΔMu1::aac(3)IV-ΔaphA-Δnic35-ΔMu2::zeo] ΔdapA::(erm-pir) ΔrecA

E. coli HB101

F- mcrB mrr hsdS20(rB- mB-) recA13 leuB6 ara-14 proA2 lacY1 galK2 xyl-5 mtl-1

rpsL20(SmR) glnV44 λ-

98

Table 6.1.2: List of constructs created during this study. Constructs Resistances Description

pKK223-K31IntExp Amp

Open reading frames for K31 RIT element TBSSRs inserted downstream

of taq promoter

pKK223-OlgaIntExp Amp

Open reading frames for Olga RIT element TBSSRs inserted downstream

of taq promoter

pACYC184-RIT::Km Tc, Km

Kanamycin cassette inserted between the inverted repeats of the RIT

element from K31; courtesy of Wietse Heylen

pTrc99-K31RITA-C Amp K31 RIT element recombinases inserted in pTrc99 backbone

pACYC-TSV1 Tc Target site 1 from K31 DUF1738 gene inserted in pACYC184 backbone

pACYC-TSV2 Tc Target site 2 from DUF1738 gene inserted in pACYC184 backbone

pSF100-RIT::Km Km Suicide construct containing Km cassette flanked by RIT element repeats

6.1.2 Construct creation

Expression constructs were created by inserting only the open reading frames for each

recombinase (individually or combined as a single transcript) using primers designed to be

compatible with the cloning sites of both the pKK223.1 and pTrc99A vector backbones. Donor

plasmids were created by first inserting a complete RIT element into the pACYC184 vector

backbone and then amplifying a new backbone that contained the flanking sequences (including

the inverted repeats) but without the open reading frames for the integrase genes. This new

backbone was then ligated with a kanamycin gene cassette in order to create the pACYC184-

RIT::Km donor plasmid. Target sequence oligos were designed with compatible ends to the

cloning site in pACYC184 and directly ligated to create the target site 1 and target site 2

plasmids (pACYC-TSV1 and –TSV2, respectively). Plasmids were cloned into chemically

competent DG1 cells, selected on appropriate antibiotics, and confirmed by restriction digest

analysis and sequencing. The suicide vector was created by amplifying the kanamycin cassette

flanked by the RIT element sequence from pACYC184-RIT::Km and then ligating the sequence

into a pSF100 suicide vector backbone. This backbone contains the R6K origin of replication

and therefore was maintained in S17-1 λpir host prior to conjugations.

99

Figure 6.1.1: Constructs used in the final conjugation experiment.

Diagrams were created using pDRAW software. The target site in pACYC184-target1 is

indicated in orange. The kanamycin cassette in pSF100-Km is flanked by the approximately

300 bp of RIT element sequence flanking the recombinases which includes the inverted repeats.

Since the recipient cells required the addition of two separate plasmids (expression vector

and target site vector), E. coli DH5α cells containing the expression plasmid were made

competent by washing as follows: overnight cultures of bacteria were diluted by 1/100 in fresh

media with antibiotics and grown for 2-3 hours to an OD600 of approximately 0.4. All tubes,

solutions and cultures were chilled on ice for 30 minutes, with occasionally swirling. Cells were

pelleted at 5000 rpm for 5 minutes in a centrifuge at 4oC. Supernatant was removed and ice

cold sterile Milli-Q water was used to re-suspend the cells. Pelleting and re-suspension of cells

was repeated using sequentially smaller volumes of water until a final volume of 100 µL

remained per tube. Electroporation of the target plasmid was performed in 1 mm cuvettes using

50 µL of competent cells (1.8 kV).

100

6.1.3 Mating-out Assays

Expression plasmid (pKK223 backbone) and pACYC184-RIT::Km donor plasmid were

transformed into the same DH5α cell. A conjugative plasmid, pOX38, was also introduced

through conjugation. Confirmation of the maintenance of all three plasmids was determined by

plasmid specific PCR and visualization of the individual plasmids after DNA extraction

(Promega Wizard Miniprep kit, according to the manufacturer’s instructions). After

confirmation that all three plasmids were present, cells were conjugated with E. coli HB101. As

HB101 is streptomycin resistant, transconjugants resistant to both streptomycin and kanamycin

would be evidence of pOX38 mediated transfer of the RIT::Km cassette into the recipient strain.

6.1.4 Conjugation Experiments

Cultures were inoculated from single colonies into LB with appropriate antibiotics (3 mL) and

grown at 37° for 6 hours. Cells were then washed with saline to remove antibiotics. Donor and

recipient cells were re-suspended in appropriate volumes to create approximately the same cell

density for each. Matings were performed by mixing cultures on filters on plain LB media (for

uninduced) or LB + 0.2 mM IPTG (induced) and incubated overnight at 37oC. The following

day the filters were removed and placed in microcentrifuge tubes containing 1 mL saline and

vortexed. Undiluted and 1/10 diluted cultures were then plated on selective media to search for

transconjugants. Both pre-mating and post-mating cultures were also serially diluted to

determine total counts of donor and recipient cells. Putative transconjugants were inoculated

with toothpicks into PCR grade water and streaked onto selective media for confirmation, as

well as utilized for colony PCR. Based on colony PCR results, individual colonies from the

selective media were grown for plasmid extractions and further confirmed through restriction

digests and sequencing. Conjugation experiments were also varied to include stationary phase

cultures or greater density log phase cultures, as well as including pre-induced cells (induction

for 2 hours prior to mating) and using pACYC-TSV2 in the recipient cells. Finally, due to the

high prevalence of false positives in the initial suicide construct mating experiments,

conjugations were also performed using MFDpir with pSF100-RIT::Km as the donor cell with

the same recipients as earlier experiments (separate matings with target 1 and target 2 recipients)

and these matings were performed on LB + 0.3 mM diaminopimelic acid (DAP) to support

growth of the MFDpir strain.

101

6.1.5 Expression Experiments

Expression experiments with and without induction were performed on E. coli cells containing

pKK223-OlgaA-C and pKK223-K31A-C. RNA extractions were performed using Trizol and

RNA was treated with deoxyribonuclease I (Invitrogen) according to the manufacturer’s

instructions. PCR amplification was performed on DNAsed samples to ensure no DNA

contamination in RNA samples. DNase treated RNA was used as template for first-strand cDNA

synthesis. RNA, 50 ng ul-1 random hexamers, 10 mM dNTP mix and Diethylpyrocarbonate

(DEPC) treated water were incubated at 65 ºC for 5 minutes, and 4 ºC for 1 minute. 40 units of

RNase inhibitor (RNAseOUT, Invitrogen) and 200 units Superscript III reverse transcriptase

(Invitrogen) were used for each sample, incubated at 25ºC for 5 minutes and then heated to 50ºC

for 1.5 hours. The reaction was stopped by heating to 70ºC for 15 minutes. Quantitative PCR

(qPCR) was performed using SYBR Green I technology on an ABI 7300 Sequence Detection

System (Applied Biosystems). A master mix for each PCR run was prepared with SYBR Green

PCR Master Mix (KAPA Biosystems) and 0.5 μM primers. The following amplification

program was used: 95°C 2 min, 40 cycles at 95°C for 15 s followed by 58°C for 35 seconds. A

dissociation step was added (95°C for 15s, 60°C for 30s, 95°C for 15s) to produce melting

curves of products that could be analyzed for primer dimers and PCR artifacts. Representative

samples were run on a 1% agarose gel to confirm that products were the expected size.

Dilutions of genomic DNA from Burkholderia sp. str. OLGA172 ranging from 101 to 10-4 ng ul-

1 total DNA were included to create a standard curve for each primer set and dilutions of cDNA

ranging from 50 to 1 ng were used to analyze primer efficiency on transcripts. PCR efficiencies

for all primers used were in the range of 90-110%. A standardized threshold setting of 0.8 units

above the background level was utilized for every experiment for consistency. Each sample was

normalized against 16S using the comparative deltaCt method (relative expression = 2∆∆Ct).

Results were considered significant if there was a minimum of 2 fold difference in expression

between treated and control samples (t-test, ∝ = 0.05).

6.2 Results

Most of this work was performed over two separate research terms at the Belgian Nuclear

Research Centre (SCK•CEN) and the results will therefore be discussed chronologically. In the

initial research term, the goal of the experiments was to determine whether the RIT element

102

could excise and insert into a conjugative plasmid without the addition of a specified target site.

For this reason, a mating-out assay was designed. This was necessary since the project was

limited in duration (3 months) and a target site was not readily apparent from the examinations

performed at that stage. It was also anticipated that strong induction could potentially overcome

the need for a specific target site. For ascertaining mobility potential of RIT elements, I created

an expression plasmid containing only the open reading frames for the three recombinases that

make up the RIT element, under the control of an IPTG-inducible promoter. Initial expression

constructs were created using the recombinase open reading frames from each of Burkholderia

sp. OLGA172 and Caulobacter sp. K31 in the pKK223.1 vector backbone (to create pKK223-

K31A-C and pKK223-OlgaA-C). The complete RIT element in which the recombinase open

reading frames were replaced with a single kanamycin resistance gene (RIT::Km cassette) was

inserted into a pACYC184 vector backbone to create the donor vector. A conjugative plasmid

(pOX38) was also present in the donor strain to act as a recipient of the mobilized kanamycin

gene and to facilitate transfer to the recipient cells. Donor cells containing all three plasmids in

an E. coli DH5-α (nalidixic acid resistant) background were mated with E. coli HB101 cells

(streptomycin resistant). Providing the kanamycin resistance gene had been transferred to the

pOX38 conjugative plasmid, transconjugants would be selected on LB-Km-Sm. All constructs

were confirmed initially by restriction digest analysis and then sent for sequencing to confirm

key regions of the sequence. For the pKK223 expression plasmids, the matings were performed

prior to receiving sequence confirmation due to time restrictions.

6.2.1 No evidence of Intra-cellular mobility without a target site

Although there was a surprisingly high level of spontaneous double resistant mutants in the

pKK223-K31A-C mating, there were no confirmed instances of a kanamycin gene being

transferred to the recipient. There were no spontaneous mutants observed for the pKK223-

OlgaA-C mating. The first round of sequencing suggested a possible point mutation in the first

recombinase for each of the expression constructs (both Olga and K31), which may have

rendered the proteins non-functional. These mutations were originally disregarded due to their

proximity to the sequencing primers, but were confirmed with subsequent re-sequencing. Also,

expression experiments revealed that recombinase expression was observed both with and

without induction. For this reason, it was determined that the pKK223 constructs did not

provide sufficient control over the recombinase genes.

103

There were significant differences in the expression of individual integrase genes in the

expression constructs derived from K31 and Olga (Figure 6.2.1). Although recombinase

expression was only slightly increased upon induction, it was clear that the first recombinase of

each RIT element had the highest expression, presumably due to the presence of the pKK223

promoter upstream. The expression of the second recombinase was significantly decreased

relative to the first gene (p<0.05 for K31 and p<0.005 for Olga). However, expression of the

third recombinase decreased in the K31 construct but increased in the Olga construct (p<0.05

for LB and p<0.005 for IPTG) compared to expression of the first recombinase. PCR products

were obtained from the cDNA using primers designed to amplify from the first to the second

recombinase indicating that these were transcriptionally linked, however no product was

produced from primers linking the second and third recombinase genes. As these constructs

were not going to be utilized for further studies, the reasons for increased int3 expression in the

Olga vector was not further investigated.

Figure 6.2.1: Expression of recombinase genes from pKK223-OlgaA-C and pKK223-

K31A-C expression vectors.

Values are relative abundance of integrase expression to 16S expression. Expression did not

differ significantly between induced and un-induced cultures. Significant differences between

individual recombinase genes (both within and between species) are discussed in the text.

104

Table 6.2.1: Decrease in optical density of cell cultures after induction with IPTG.

Upon returning for a second research term, it was decided that new constructs would be

created in the pTrc99A vector backbone such that the recombinase genes would be under the

control of the more stringent lacIq regulator. Initial experiments again took place without target

site sequences in the form of a conjugative mating out assay with the original donor plasmid.

Although induction of the recombinase genes had a clear impact on cell density, (see Table

6.2.1), there were no transfers of kanamycin resistance to recipient cells, and no evidence for

recombination or rearrangement between the plasmids found in the donor cells. PCR

amplification from primers designed to amplify outwards from the kanamycin cassette

suggested that the RIT element was being excised (Figure 6.2.2) however attempts to confirm

the presence of a restored backbone lacking the kanamycin cassette were unsuccessful both by

PCR and based on plasmid isolations. Therefore if the RIT element is excised in the absence of

a specific target site, it occurs at levels below the detection limit for these methods.

Figure 6.2.2: PCR amplification using primers designed to amplify out from the

kanamycin gene.

Constructs uninduced (OD600) 1 mM (OD600)

pTrc99A (empty vector) 0.901 (0.052) 0.842 (0.007)

pTrc99K31A-C 0.736 (0.057) 0.213 (0.001)

pTrc99K31-RITA 0.695 (0.044) 0.342 (0.006)

pTrc99K31-RITB 0.787 (0.027) 0.664 (0.016)

pTrc99K31-RITC 0.666 (0.014) 0.663 (0.017)

105

Lanes 1 and 7 are GeneRuler 1 kb plus ladder. The 700 bp product seen in lane 6 is consistent

with that expected if the kanamycin cassette were being excised and the large bright band

evident in lanes 2,4 and 5 is consistent with the complete plasmid backbone with kanamycin still

present. Note that the smaller product is also evident in lane 2 which contains only the

p184::Km donor plasmid without the expression plasmid. Lane 3 contains only the expression

plasmid and lanes 4 and 5 have both a donor and an expression plasmid present.

6.2.2 Target site identification

The determination of a potential target site sequence was initially elusive. As identified in

chapter 4, RIT elements found in multiple copies within a strain are commonly identical to the

ends of the 30-38 bp inverted repeats presumed to designate the ends of the element. Despite

the fact that all 3 RIT elements found in Caulobacter sp. K31 were 100% identical and had

targeted the same gene (DUF1738) for insertion, in silico removal of the RIT sequences to the

end of the inverted repeats did not result in the reconstruction of the original genes. Further

investigation revealed that there was an additional sequence, a perfect 20 bp palindrome, that

was adjacent to one of the terminal inverted repeats. Whether this palindrome occurred

upstream or downstream of the RIT element (relative to recombinase transcription) was not

consistent. It was determined that the location of the palindrome correlated with the direction of

transcription of the target gene (in this case the DUF1738 gene) as opposed to the RIT element

recombinases (Figure 6.2.3). A Blast search of the 20 bp palindrome revealed that the sequence

did not exist in other DUF1738 genes lacking a RIT element, but instead revealed additional

RIT elements that had not been previously identified. Therefore it was determined that this

palindrome sequence must be a component of the RIT element. Further I hypothesized that an

inversion of the RIT element relative to the palindrome must occur either during or after

integration in the target site.

106

Figure 6.2.3: Orientation of RIT elements in Caulobacter sp. K31 relative to the direction

of the target gene DUF1738.

Genomic locations are given to the left of each diagram. ‘IR’ designates the inverted repeats

that occur at each end of the recombinases.

Removal of the complete RIT element including the palindrome sequence allowed for the

perfect reconstruction and alignment of the DUF1738 target genes of K31 and revealed the

original target site sequence. Alignment of the same region of DUF1738 from other strains in

which there was no evidence of RIT element insertion was used to determine a second potential

target site. The latter differed from the K31 derived site by 4 bp (gtcg vs. gggc). With this

information, two target site vectors were created in pACYC184 to act as recipients for the

mobilized kanamycin gene, termed pACYC-TSV1 and pACYC-TSV2. Each contained a 45 bp

target sequence in a pACYC184 vector backbone containing only tetracycline resistance for

selection. Each target site plasmid was electroporated separately into a strain containing the

pTrc99-K31A-C expression plasmid to create recipient strains with a putative target site and

inducible recombinase genes. The suicide construct (pSF100-RIT::Km) was introduced by

conjugation and the final experimental design is illustrated in Figure 6.2.4. Transconjugants

capable of growth in both kanamycin and tetracycline were tested for kanamycin insertion in the

target site via PCR.

107

Maintenance of the suicide construct within the recipient cell, and thus the detection of a

high number of false positive (TcR and KmR) cells, proved to be a feature of this experimental

design. Thinking this might be specific to an active Mu phage carried by the S17-1 donor strain

facilitating recombination between the replicons (Ferrières et al. 2010), conjugation experiments

were also performed in a Mu free donor strain (MFDpir), however high false positive rates were

still observed in this strain as well. Nevertheless movement of the kanamycin cassette

specifically into the target vector was confirmed by sequencing of positive clone products –

TSV1A resulting from the original S17-1 mating with the target site 1 recipient and TSV2A

resulting from the MFDpir mating with the target site 2 recipient.

Figure 6.2.4: Final experimental design.

The recombinase enzymes are represented by blue circles labeled A, B and C although there is

no evidence yet to suggest that all three are required or where they may bind. The sequences for

the palindrome are written above and the putative binding sites are in bold font. Induction of the

expression plasmid would result in production of the three recombinase proteins which would

then be free to act on the inverted repeats flanking the kanamycin gene and mobilize it into the

target site plasmid.

108

6.2.3 Sequencing analysis of transconjugants

Analysis of the sequence surrounding the kanamycin cassette after insertion in the target

sites shed some light on the mechanism of insertion. For both the TSV1A and TSV2A

recombinants, the kanamycin gene has been inserted in the opposite orientation relative to the

palindrome when compared to the original suicide vector. Using two different target sites

illustrated that the target site sequence itself was unchanged when the element inserted – with

only the first target site this couldn’t be determined since the sequence flanking the kanamycin

cassette matched TSV1 (see Figure 6.2.5). From these recombinants, it is clear that target site

sequences are unchanged with RIT insertion, and that the 4 bp sequences on each end of the

palindrome have been altered. Therefore it would appear that the strand exchange occurs at

both ends of the palindrome, as opposed to in the centre of the palindrome as would be expected

based on other known cross-over regions (Hallet et al. 2004).

Figure 6.2.5: Reversal of RIT element in positive transconjugants.

Labels are included on the left. The first 4 bases correspond to the portion of the target site that

differs between target site 1 and 2 (gtcg/gggc respectively), and the last 4 bases (cact)

correspond to the continuation of the target site. ‘IR’ represents the inverted repeats that flank

the kanamycin cassette. The palindrome and inverted repeat sequences are unchanged after

recombination.

109

As illustrated in Figure 6.2.6, the TSV1A recombinant has a plasmid that is larger than the

donor plasmid (pSF100-RIT::Km), which is 4.4 kb. The original target site plasmid was 2.3 kb

and therefore the addition of the kanamycin cassette should have resulted in a plasmid of 3.2 kb.

Sequencing was inconclusive, however restriction digestion suggests that TSV1A has two

copies of the original pACYC184 backbone connected by the kanamycin cassette. This can be

seen in Figure 6.2.6, as the HindIII digestion contains all the original bands from the target site

plasmid and two additional bands consistent with the kanamycin cassette inserted into the target

site plasmid. It is possible that both the original target plasmid and the recombinant were present

in the same cell, however there is no original target plasmid visible on the mini-prep gel and the

digest bands are equal intensity which suggests equal amounts of both plasmids. Therefore if

both versions had been present in the cell they should have been visible prior to digestion.

Figure 6.2.6: Mating results for the recipient strain containing pTrc99-K31A-C and

pACYC-TSV1.

The miniprep results show the number of plasmids in each strain (A) and the HindIII digest

illustrates that the recombinant (TSV1A, labeled +I in the figure) has maintained the original

recipient plasmid bands and also acquired two new bands. The uninduced strain (-I) has bands

corresponding to all of three of the plasmids. Labels are listed in legend and the first lane of

each gel has the GeneRuler 1 kb plus ladder.

110

By contrast, the TSV2A recombinant plasmid runs much farther on the gel when

undigested than even the original constructs (Figure 6.2.7). However digestion and sequencing

confirmed it to have the expected size and the sequence corresponded to one copy of the

pACYC184 backbone and the kanamycin cassette inserted specifically in the target site.

Digestion of the TSV1A and TSV2A plasmids using enzymes that should produce a single

linear band (BamHI) confirmed these differences. For the TSV2 plasmid there is a single band

consistent with expectations for the recombinant plasmid, however the target site 1 recombinant

plasmid gave bands of both the original plasmid and the recombinant plasmid (data not shown).

Figure 6.2.7: Target site 1 transconjugants retaining both kanamycin and tetracycline

resistance.

Lane 1 contains the GeneRuler 1 kb plus ladder. Lanes 2 is the TSV2A clone. The small

plasmid is the pACYC-TSV2 plasmid with the kanamycin cassette inserted. The larger band

was lost from the strain after sub-culturing. Lane 3 is the TSV1A plasmids (expression plasmid

and recombined pACYC-TSV1 with Km) and lane 4 is the original recipient strain.

It’s important to note that RIT element mobility was only observed during conjugation, as

this may be important to understanding the mechanism of mobility. Although the high false

positive rate made it difficult to find positive recombinants, it also provided an opportunity to

try inducing the plasmids when they were all present in one strain. This induction was

performed on both the TSV1A positive recombinant clone and the previously uninduced clone

that had retained all three original plasmids (designated ‘–I’ in Figure 6.2.6). Plasmids were

111

collected from a large volume of the induced cells and no rearrangements of any kind were

observed in the subsequent plasmids either by gel or PCR analysis.

In the collection of KmR/TcR clones, there were a number of potential recombinants in

addition to TSV1 and TSV2. These appear to have larger plasmids, perhaps as a result of

unresolved co-integrate structures (similar to the plasmid visible in Figure 6.2.7 above the TSV2

recombinant plasmid). Primers designed to amplify across the target site revealed a putative co-

integrate that had an altered target site. Sequencing analysis of this clone (I4) confirmed it to be

a co-integrate of the donor and target plasmids. Sequencing out from the kanamycin gene

revealed the presence of both the donor (pSF100) and target (pACYC184) backbones but no

palindrome was found adjacent to either inverted repeat. Beyond each of the inverted repeats

the sequence matched to the same half of the target site sequence. Clone I4 was a result of the

target site 1 mating, and therefore the target site sequence was identical to the sequence flanking

the kanamycin cassette (28 bp on one end and 17 bp on the other end), which explains the

presence of two copies of the target site sequence. Sequencing from the vector backbones

revealed the palindrome sequence to be separate from the kanamycin cassette and flanked on

either side by the other half of the target site (illustrated in Figure 6.2.8).

Figure 6.2.8: Sequencing results of co-integrate structure of clone I4.

The original target site and suicide vectors are shown at the top. The coloured boxes represent

the sequences in common between the two (less than 30 bp for each). The bottom figure shows a

simplified version of the co-integrate illustrating the location of the target site sequences and

palindrome at the junction of the two plasmids.

112

6.2.4 Application of these Results to other RIT Elements

In recognizing the importance of the palindrome sequence to the mechanism of these novel

elements, I performed an in silico search specific to the palindrome/inverted repeat arrangement.

This search led me to identify additional RIT elements in the database (included in supplemental

table 1). This suggests that RIT elements may be grouped according to conservation of their

palindromes. Palindrome conservation groups span a wide range of species. The palindrome

sequences were most variable in the centre region, and in some cases this central core was no

longer a perfect palindrome sequence, suggesting that conservation of the key motifs is

functionally important as opposed to maintenance of a perfect palindrome structure. As can be

seen in Table 6.2.2, the conserved sequences in the palindromes also correspond with conserved

sequences in the inverted repeats (the presumed binding sites identified in chapter 4). Whether

these homologous sequences correspond to binding sites or facilitate the creation of a stem-loop

structure has not been determined.

Table 6.2.2: Conserved sequences found in a variety of alpha- and beta-Proteobaceria

containing RIT elements.

Strain Palindrome Sequence Inverted Repeat Sequence Caulobacter sp. K31 ttatgccgatatcggcataa cataatgccgcgatccggattatgccg Sinorhizobium medicae WSM419 pSMED02 ttatgccgatatcggcataa cattatgccgtacgccggattatgccgcatggcc

Acidophilium crytum JF-5 pACRY03 ttatgccgatatcggcataa cataatgccgtgattcggattatgccgcatgacc

Novosphingobium PP1Y ttatgccgatatcggcataa taatgccgtgacccggattatgccg Acidiphilium multivorum AIU301 tgccccttatgccgacatcggcataaggggca taatgccgagatccggattatgccg

Frankia sp. EAN1pec ttatgccgacgtcggcataag

ttatgccgagggccgggttatgccg

Cuprividus metallidurans CH34 RITCme1

catgccgctagcggcatg ttatgccgactccccgattatgccg

Burkholderia sp. Ch1-1 cctgtcatgccgctagcggcatgacagg

ttatgccgacttcccgattatgccg Mesorhizobium loti st. NZP2037 ttatgccgacgtcggcataa ttatgccgatgtccggattatgccg Phaeobacter gallaeciensis DSM26640

ttatgccgacatcggcataagg cataatgccgatgttcagattatgccgcg

Acidovorax sp. KKS102

cgctgcttatggagagctctccataagcagcg

gcagcgttatgcacagcacgcagttatgcacagttgg

Leptothrix cholodnii SP6

ctgcttatggagagctttccataagcag

gcagcgttatgcacagcacgcagttatgcacagt

113

6.3 Discussion

Tyrosine based site-specific recombinases (TBSSRs) are a broad group of enzymes

which perform conservative DNA recombination through the coordinated breakage, exchange

and resealing of all 4 DNA strands (Hallet et al. 2004). There are approximately 400,000

TBSSRs in the NCBI database, many of which can be assigned to one of 24 sub-families based

on conserved domains in the C-terminal catalytic domain (www.ncbi.nlm.nih.gov/cdd). Those

associated with mobile genes have been further divided into putative role-specific sub-families

(Van Houdt et al. 2012). Only a small number of these enzymes have been characterized

biochemically and they exhibit extensive diversity in both their recombination mechanisms and

the nature of the attachment sites. This is not unexpected given the varied roles that these

enzymes perform in the cell. These functions can be separated into three different categories –

chromosome or plasmid maintenance (by ensuring correct separation of multimers), intercellular

distribution (phages, ICEs and genomic islands) and intracellular generation of diversity (phase

switching and cassette integration) (Hallet et al. 2004; Subramanya et al. 1997; Tribble et al.

1997; Tirumalai et al. 1997; Guo et al. 1999; Cheetham and Katz 1995; Rowe-Magnus and

Mazel, 2001).

Tyrosine-based site-specific recombinases are essential for the correct separation of

circular replicons. The best studied representative is the XerCD/dif system which functions to

resolve chromosomal dimers produced during replication (Hallet et al. 2004). These

recombinases are distinguished from those utilized in homologous recombination since they

require only short (~30 base pair) sequences to perform recombination. These sites are referred

to as the “core” or “cross-over” site and usually possess dyad symmetry that facilitates the

binding of the recombinases to recognition motifs (Hallet et al. 2004). The DNA strands are cut

and exchanged at the borders of the central region separating the recognition motifs (Hallet et al.

2004).

Large conjugative plasmids and other mobile elements that are self-transmissible

commonly encode their own site-specific recombinase adjacent to the recombination site for

integration (Hallet et al. 2004). These sites can consist of only the core site (as in the Cre/loxP

system), or be more complex. The relative positioning of recombination sites specifies whether

the recombination reaction will result in integration, excision or inversion of the intervening

http://www.ncbi.nlm.nih.gov/cdd

114

DNA. When the sites are located on a single replicon, directly repeated recombination sites will

cause excision, while inverted repeated sites cause inversion (Hallet et al. 2004). Tyrosine

recombinase systems can be specific to individual mobile elements or can be provided in trans

from the host chromosome (Huber and Waldor, 2002).

Many transposons encode separate integration and resolution systems. Two examples

using a site-specific resolution system are the Tn3 family and “Mu-like transposons” including

Tn552 and the Tn5053/Tn402 family (Hallet et al. 2004). For both of these systems, the

(usually DDE) transposase initiates the creation of a co-integrate which joins the donor and

target sequence through two directly repeating copies of the transposon. These co-integrates are

then resolved through the activity of the site-specific recombinase through intra-molecular

recombination between the two copies at the transposon resolution site (res), resulting in one

copy of the transposon in each location (Hallet et al. 2004). For the majority of Tn3 members,

the resolution occurs through the action of a serine SSR commonly referred to as the resolvase.

The res sites of the Tn3 members that have been characterized indicates that they contain three

12 bp inversely oriented binding sites, the first of which is the recombination core site and the

other two correspond to accessory elements required for recombination to proceed (Hallet et al.

2004). The sequence identity and spacing between these three sites has some variability in

different members, and it has been determined that some elements (including Tn552, ISXc5,

Tn1546 and TnXO1) may each contain direct repeats instead of inversely oriented motifs at one

of the two accessory binding sites (Hallet et al. 2004). There is a sub-family of the Tn3

elements (including Tn4430, Tn5401 and the Tn4651/Tn5041 families) that utilize a tyrosine

based site-specific recombinase (TBSSR) for the resolution of the transposase driven co-

integrates (Hallet et al. 2004).

These experiments demonstrated the movement of a Km cassette carried within a RIT

structure to a recipient plasmid harboring either of two closely related target sites, but only

during conjugative events. The target site was identified after careful examination of the gene

sequence uncovered the presence of a palindrome that mobilized as part of the RIT element.

This finding provides clues to the integration event mechanism.

The results obtained so far suggest that RIT elements can be transferred between

replicons within a bacterial cell during the process of conjugation and that integration occurs at

115

the ends of the palindrome sequence. This suggests that either conjugation or the presence of

single stranded DNA is central to RIT activity. Analysis of the sequence surrounding the

kanamycin cassette after insertion in each target site shed some light on the mechanism of

insertion. In both cases the palindrome appears downstream of the kanamycin gene in terms of

transcription whereas it is upstream in the original construct in pSF100. This suggests the

palindrome location is determined by the target site sequence, and that the palindrome may

serve as the attachment and integration site and the RIT::Km is inverted either during or after

integration. Previously characterized tyrosine recombinases (such as XerC/D and Cre) bind to

sites exhibiting dyad symmetry and crossover occurs at the centre of this symmetry (Hallet et al.

2004). As can be seen in Table 6.2.2, there are complimentary sequences found in both the

palindrome and the inverted repeats. It is therefore proposed that the RIT element recombinases

may bind to one half of the palindrome sequence and to the complimentary sequence within the

inverted repeats and that crossover occurs between the palindrome and the inverted repeat

(Figure 6.3.1). If the crossover occurs between the palindrome and the inverted repeats then the

core sites are also more consistent with those seen for XerC/D since the string of A/T is internal

and the G/C is external to the crossover region.

Figure 6.3.1: Model for RIT element mobility based on experimental results.

IR indicates the locations of the inverted repeats. Illustration of palindrome direction relative to

kanamycin transcription in pSF100-RIT::Km suicide construct (top) and in the transconjugants

116

obtained (middle). Bottom picture is proposed binding sites and crossover regions in RIT

integration involving a circular intermediate. Key residues predicted to be involved in binding

are shown in bold and strand exchange occurs between the palindrome and the inverted repeats.

Site-specific recombination events were detected in these experiments, but many more

may have been found if not for the high rate of false positives. These were due to the

independent maintenance of the suicide construct within the recipient cell, or to recombination

events that appear to be separate from RIT activation. As discussed in the results, the

substitution of the MFD strain to replace S17-1 did not eliminate the false positive issue,

suggesting that the source of the issue with pSF100 is not the active Mu phage described in the

latter strain. There were significant regions of homology (~ 200 bp) between the suicide

construct and both the donor and expression plasmids found in the recipient cell. This should

not be an issue in a recA1 mutant background, but it has been shown that ATP-independent re-

annealing of single stranded substrates can still occur although strand exchange is eliminated

(Bryant and Lehman 1986). I therefore cannot preclude the possibility that the plasmids are

becoming integrated with each other and our strong selection maintains these co-integrate

structures. In addition, the ATP dependent functions of the RecA1 protein can be partially

restored at a pH of 6.5 or lower (Kawashiwa et al 1984). As alterations in the pH during

conjugation were not monitored, there is the possibility that homologous recombination

accounts for a significant fraction of the false positives observed. A third possibility for the

source of this issue could be cross-reactivity with the active XerC and XerD homologs found in

the recipient strain. There have been phage elements described that do not carry their own site-

specific recombinases but rather depend on the action of the host recombination machinery to

facilitate their integration (Huber and Waldor, 2002). As these recombinases are essential to

chromosome separation and cell reproduction, it is not feasible to perform the experiments in a

XerC/D deficient background in order to determine whether these genes are contributing to the

high false positive rate.

Although there is currently insufficient evidence to speculate on the role that RIT

elements may play in the cell, the results obtained in this study indicate that they may be

specifically active during the process of conjugation. The lack of kanamycin movement upon

induction of the recombinases in cells already possessing all three plasmid constructs

117

(expression, target and suicide construct) was in sharp contrast to the diversity of arrangements

obtained when induction occurred as the RIT::Km cassette was conjugating in. This is consistent

with the data obtained in Chapter 4 that indicates that RIT elements are commonly associated

with one or more plasmids in an individual strain. This activation could be similar to integrons,

where a single stranded substrate is necessary for integration of gene cassettes to occur, or could

be indicative of a role for these genes in the acquisition of genes directly from incoming

plasmids regardless of the ability for that plasmid to be maintained long term within the

recipient cell. In this manner, having RIT elements specifically active during conjugation

events would be a powerful useful means of generating diversity from transient plasmid

associations.


This project was funded through the W. Garfield Weston Foundation Doctoral Fellowship

Program. Funding in the form of a NSERC Discovery Grant to RF and a NSERC PGS-D

Scholarship to NR is also gratefully acknowledged. The funding agencies had no role in this

study.

6.5 References Bryant, F.R. and Lehman, I.R. 1986. ATP-independent renaturation of complementary DNA strands by the mutant recA2 protein from Escherichia coli. The journal of biological chemistry 261(28):12988-12993. Cheetham, B. F., & Katz, M. E. (1995). A role for bacteriophages in the evolution and transfer of bacterial virulence determinants. Molecular microbiology, 18(2), 201-208. Ferrières, L., G. Hémery, T. Nham, A.M. Guérout, D. Mazel, C. Beloin and J.M. Ghigo. 2010. Silent mischief: Bacteriophage Mu insertions contaminate E. coli random mutagenesis performed using suicidal transposon-delivery plasmids mobilized by broad-host range RP4 conjugative machinery. J. Bacteriol. 192(24):6418-27. Grindley, N.D.F., Whiteson, K.L., and Rice. P.A. 2006. Mechanisms of Site-Specific Recombination. Annu. Rev. Biochem. 75:567-605. Guo, F., Gopaul, D. N., & Van Duyne, G. D. (1997). Structure of Cre recombinase complexed with DNA in a site-specific recombination synapse. Nature, 389(6646), 40-46.

118

Hallet, B., Vanhooff, V. and F. Cornet. 2004. DNA Site-Specific Resolution Systems. In: Plasmid Biology pp. 145-180. Ed. B.E. Funnell and G.J. Phillips ASM Press, Washington, D.C. USA Huber, K. E., & Waldor, M. K. (2002). Filamentous phage integration requires the host recombinases XerC and XerD. Nature, 417(6889), 656-659. Kawashima, H., Horii, T., Ogawa, T. and Ogawa, H. 1984. Functional domains of Escherichia coli recA protein deduced from the mutational sites in the gene. Mol Gen Genet (molecular and general genetics) 193(2):288-292. Ricker, N., Qian, H., and Fulthorpe, R. 2013. Phylogeny and Organization of Recombinase in Trio (RIT) Elements. Plasmid. 70(2):226-239. Rowe-Magnus, D. A., & Mazel, D. (2001). Integrons: natural tools for bacterial genome evolution. Current opinion in microbiology, 4(5), 565-569. Siguier, P. Gourbeyre, E. and M. Chandler. 2014. Bacterial insertion sequences: their genomic impact and diversity. FEMS Microbiol Rev. 38(5):865-891. Subramanya, H. S., Arciszewska, L. K., Baker, R. A., Bird, L. E., Sherratt, D. J., & Wigley, D. B. (1997). Crystal structure of the site‐specific recombinase, XerD. The EMBO Journal, 16(17), 5178-5187. Tirumalai, R. S., Healey, E., & Landy, A. (1997). The catalytic domain of λ site-specific recombinase. Proceedings of the National Academy of Sciences, 94(12), 6104-6109. Tribble, G., Ahn, Y. T., Lee, J., Dandekar, T., & Jayaram, M. (2000). DNA recognition, strand selectivity, and cleavage mode during integrase family site-specific recombination. Journal of Biological Chemistry, 275(29), 22255-22267. Van Houdt, R., Monchy, S., Leys, N., Mergeay, M., 2009. New mobile genetic elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria. Antonie van Leeuwenhoek 96, 205-26. Van Houdt, R.., Leplae, R., Mergeay, M., 2012. Towards a more accurate annotation of tyrosine- based site-specific recombinases in bacterial genomes. Mobile DNA 3(6) doi:10.1186/1759-8753-3-6

119

Chapter 7 Developing a standardized method for analyzing gene content of bacterial communities in streams with varying

degrees of urbanization

7 Introduction A key challenge in characterizing the mobilome of environmental samples is the ability

to draw comparisons between diverse environments. The ideal study involves collecting

samples before and after an environmental disturbance, however this is limited to anticipated

point source contamination events. Moreover, the information obtained can only be utilized in

drawing comparisons specific to that location and time point. Unfortunately, environmental

pollution is not limited to these discreet and known point source events. Anthropogenic

pollutants from domestic, industrial and agricultural settings contribute a diverse array of

chemical compounds to the environment (Gillings et al. 2015). Increased urbanization and

decreased vegetation likewise contributes to increased levels of environmental pollutants

(particularly polyaromatic hydrocarbons) surrounding human activities (Johnsen and Karlson,

2007).

Despite the inherent issues in drawing comparisons between sites with highly variable

anthropogenic impacts, a baseline community mobilome needs to be established from which

future studies can draw comparisons. There are a variety of metrics currently available for

classifying anthropogenic impacts on freshwater streams. In our region, the Ontario Benthic

Biomonitoring Network (OBBN) coordinates efforts to monitor impacts to both lakes and

streams and has developed appropriate methods for comparing the benthic invertebrate

populations between sites to establish anthropogenic impacts (Jones et al. 2007). The analysis of

benthic macroinvertebrate populations is a well-established biomonitoring tool for comparing

cumulative impacts of human activities in river systems (Rosenberg and Resh 1993; Wright et

al. 2000). However, understanding how bacterial communities are impacted at these sites is not

directly comparable to these macro-organism metrics. In order to determine which

environmental pollutants cause changes in bacterial diversity or gene content, there must be a

standardized bacterial community on which to perform testing. This standardized community

120

must account for varying bacterial populations within individual spatial niches as these can be

expected to respond differently to selection pressures. Gene transfer mechanisms are

particularly proficient in biofilm communities, therefore obtaining samples through filtering of

stream water is less than ideal since it minimizes the genetic contribution of these important

communities. However it is equally difficult to account for differences in sediment composition

when comparing bacterial communities between streams, and this becomes increasingly valid

when comparing between relatively pristine (reference) sites and more channelized urban

streams. Moreover, individual sediment samples can be impacted by variations in groundwater

inputs, which can be a source of various pollutants. Finally, in order to be used in a risk

assessment framework, the bacterial community should represent a reasonable route of exposure

for individuals either through direct contact or downstream water usage. For these reasons, we

chose to utilize columns filled with a standardized substrate on which the bacterial community

could colonize for a pre-determined length of time. This allows for bacteria that are present

intermittently in the water column to colonize the soil columns in addition to the ubiquitous

water inhabiting bacterial members.

I designed sand filled columns to capture and integrate the bacterial communities of streams for

study. These columns were attached to flotation devices and floated in the water column at a

shallow depth from the surface in 6 streams in southern Ontario, Canada. Two of the chosen

streams are minimally impacted by human activities, and the other four streams are moderately

impacted by a variety of pollutants (see Table 7.1.1). The sources of anthropogenic stress

included in this study (urbanization, waste water outflows, agricultural practices and landfill

leachate) were chosen in order to avoid strong selection by any particular pollutant and instead

focus on circumstances where the communities are exposed to a variety of stressors.


The reference site samplers were placed in rivers contained within the Saugeen Valley

Conservation Authority (SVCA) at sites chosen based on the 2010 water quality monitoring

report from this agency (SVCA, 2010) and are both located in streams actively monitored by the

Provincial Water Quality Monitoring Network (PWQMN). This region has a low road density

and includes the provincial reference sites utilized for the Ontario Benthos Biomonitoring

Network (OBBN) assessments (Jones, 2006). Sampler placement was chosen based on ease of

121

accessibility, however neither sampler was placed in the precise location of the PWQMN station

for the reference sites due to concerns that the locations provided public access and may lead to

tampering. The impacted sites have been chosen in the more urbanized Lake Simcoe watershed,

based on recommendations from the Lake Simcoe Regional Conservation Authority (LSRCA)

staff and a 2004 study of contaminants found in the rivers of the Lake Simcoe watershed

(LSCRA, 2004). All impacted sites show significant accumulations of polyaromatic

hydrocarbons (PAHs), which is to be expected in an urbanized watershed. The Uxbridge Brook

site was chosen due to its use as a discharge stream for a wastewater treatment plant in the area

(at a distance of approximately 2.5 km from discharge to sampling site), and is the only site that

is located precisely at the PWQMN site. The Maskinonge river site did not have any

contamination that exceeded the provincial limits according to the 2004 report but was chosen

due to its location downstream of an intensive sod farm. The West Holland River was heavily

impacted according to the 2004 contamination study by organochlorine pesticides including

DDT (and its breakdown product DDE, among other contaminants). The Dyment’s Creek

location was not highlighted in the 2004 study but was included due to the availability of

concurrent chemical analyses performed by researchers at Environment Canada. At this

location, chemical screening has been performed on the groundwater flowing beneath the stream

and results from previous years have been published (Roy and Bickerton 2011). Contaminants

found in this location are diverse and include volatile organic chemicals, metals and petroleum

products.

7.1.1 Sampling locations and collection of benthic invertebrates

Chemical data and site characteristics are listed in Table 7.1.1. Provincial water quality

monitoring data were available for all streams except for Dyment’s Creek, however stream and

sediment data for this latter site were provided for 2011 by researchers at Environment Canada.

For each of the sampling locations, benthic communities were sampled according to the OBBN

protocols (Jones et al. 2007) and animals were preserved in 70% ethanol for transportation.

Organic material was isolated by density separation in concentrated salt solution (if necessary)

and animals were classified using microscope-assisted identification to at least the 27-group

level (details in OBBN protocol, Jones et al. 2007). When possible, Trichoptera, Ephemeroptera

and Coleoptera were identified to the family level in order to utilize a more accurate tolerance

value for the coarse Hilsenhoff biotic index calculations. Benthic sampling was not performed

122

on the West Holland Canal and Maskinonge River sites due to the depth of the river at these

locations, and samples were not collected from the North Saugeen River site in the fall of 2012

due to the presence of clam beds that should not be disturbed. Although samples were collected

in both the fall and the summer across multiple years for individual sites, only the fall counts

were used for analysis for consistency. Benthic counts were also obtained from each of the

conservation authorities in order to supplement the available data. Determination of

anthropogenic impact was determined using the coarse Hilsenhoff Biotic Index (cHBI; a

modification of the HBI developed by Hilsenhoff, 1987) and Simpson’s diversity index

(Simpson, 1949) as well as percent recovered of relatively intolerant species (combination of

Ephemeroptera, Plecoptera and Trichoptera).

Table 7.1.1: Sampling locations for river assessments.

Those in bold exceed the available recommended limits (Provincial Water Quality Objectives or

Canadian Water Quality Guidelines for Protection of Aquatic Life (SVCA, 2010)); There is no

listed guideline for PHCs; Abbreviations - PAH: polyaromatic hydrocarbon, OC:

organochlorine, PHC: petroleum hydrocarbon, VOCs: volatile organic compounds.

Site Bank Width Depth (avg) Surrounding Region Chemicals of

Concern

Hamilton Creek 11.9 m 0.41 m Forest

North Saugeen

River

15.2 m 0.46 m Forest

Uxbridge Brook 6.5 m 0.40 m Downstream of

sewage outflow

PAHs, Phenols,

PHCs, Cr, Cu

Maskinonge River 9 m 0.40 m Downstream of sod

farming

Pesticides, Cr, Cu

West Holland

River

Not determined Not determined Within Holland

Marsh (agriculture)

PAHs, Phenols,

OC pesticides,

PHCs, Hg, Cd, Cr,

Pb, Cu

Dyment’s Creek 2-6 m 5-50 cm Historic landfill

turned residential

VOC’s

123

Figure 7.1.1: Map of sampling locations.

The two reference sites (orange triangles) are North Saugeen (NS) and Hamilton Creek (HC),

which are located within the SVCA. The four impacted sites (red triangles) are Dyment’s Creek

(DC), West Holland Canal (WH), Maskinonge Creek (MC) and Uxbridge Brook (UX) and are

all located within the LSRCA.

7.1.2 Sampler Design

All samplers were created from 1 ½” (inner diameter) polycarbonate tubes cut to one foot in

length and fitted with 1 ½” copper to DWM pipe adapters machined internally to fit to the pipe.

The ends of the adapters were fitted with screening material and nylon in order to prevent the

entry of invertebrates or litter from the stream. All samplers were filled with autoclaved, fine

grain sand. Samplers were floated mid-stream within 2 inches of the stream surface, and were

sub-sampled monthly throughout the 4 month exposure time. At the end of four months the

samplers were retrieved and replicate samples obtained from each portion of the sampler (inflow

end, center, outflow end) to determine within sampler variation.

124

Figure 7.1.2: Aquatic environment bacterial community samplers.

Constructed samplers (A) were attached to 2 L. bottles to be used as flotation devices. The

devices were attached to cinder blocks to keep them in place in the stream (B).

7.1.3 Bacterial Community Assessment

DNA extraction of sampler soil was performed using a PowerSoil extraction kit (MoBio).

Terminal restriction fragment length polymorphism (T-RFLP) using fluorescently labeled 16S

primers were used to compare the sampler diversity, and pyrosequencing was performed on

selected samplers to examine bacterial community diversity.

Initial T-RFLP comparisons between the inflow, center and outflow sub-samples were

performed to determine whether the bacterial communities were consistent throughout the

length of the sampler. Fluorescently labeled 16S primers (27F-FAM and 1492R-HEX) were

used for amplification and digestion was performed using AluI. Digested samples were

analyzed by the Guelph Molecular Supercenter Laboratory Services Division and statistical

analysis was performed using R (R-project.org). Sub-samples were subsequently combined (3

separate replicates of sub-samples where possible) for between sampler comparisons, also by T-

RFLP using the same methods. Principal coordinate analysis (separate analyses for Bray-Curtis

and Jaccard distance measures) of the T-RFLP on combined samples were analyzed using the

pco command in the Ecodist package of R. Principle coordinate scores were compared to water

quality parameters, benthic invertebrate metrics and genetic (qPCR) data. Correlations were

automatically generated using the corr function.

125

Pyrosequencing was performed on one replicate of combined samples, using a Roche 454

FLX titanium instrument (MR DNA Molecular Research LP). Primers utilized were provided

by the facility (27Fmod and 530R) and targeted the V1-V3 16S region (Yarza et al. 2014). Data

analysis was performed using programs in the QIIME pipeline (Caporaso et al. 2010), including

Denoiser (Reeder and Knight, 2010) and UCLUST (Edgar, 2010). Sequences were rarefied to

1455 reads per sample (corresponding to the lowest read count obtained), OTU’s were grouped

based on 97% similarity, and taxonomy was assigned according to the Greengenes Database

(DeSantis et al. 2006) files from May 2013. Beta diversity was evaluated using the vegan

package in R with either Bray-Curtis or Binary Jaccard settings.

Table 7.1.2: Primers for quantitative PCR.

Primer Name Sequence Amplicon Size

Annealing Temp (oC) Reference

qPCR-intI1F ACCAACCGAACAGGCTTATG

~ 286 bp 62 Nemergut et al. (2004) qPCR-intI1R GAGGATGCGAACCACTTCCAT

qPCR-16S-338F ACTCCTACGGGAGGCAGCAG

~ 200 bp 63 Fierer et al. (2005) qPCR-16S-518R ATTACCGCGGCTGCTGG

sulI-F CACCGGAAACATCGCTGCA

158 bp 55 Cheng et al. 2013 sulI-R AAGTTCCGCCGCAAGGCT

IncP1 korA-F TCATCGACAACGACTACAACG

117 bp Smalla et al 2013 IncP1 korA-R TTCTTCTTGCCCTTCGCCAG

IS1071_qPCR-F GCACCAAGTCTGGGAATGAT

~200 bp 60

This study

IS1071_qPCR-R ACGGGCATAGTGTTTCTTGG This study

IR_Olga TTATGCCGATTCCCGGATTATGCCG 3.5 kb

54

This study

IR_K31 TAATGCCGCGATCCGGATTATGCCG 3.5 kb This study

IR_ambig TWATGCCGIIIYCCSGATTATGCCG 3.5 kb

54

This study

IR_less_ambig TTATGCCGIIIYCCSGATTATGCCG 3.5 kb This study

7.1.4 Quantitative PCR

Quantitative PCR (qPCR) was performed using SYBR Green I technology on an ABI

7300 Sequence Detection System (Applied Biosystems). A master mix for each PCR run was

prepared with SYBR Green PCR Master Mix (KAPA Biosystems) and 0.5 μM primers. The

following amplification program was used: 95°C 2 min, 40 cycles at 95°C for 15 s followed by

126

60°C for 35 seconds. A dissociation step was added (95°C for 15s, 60°C for 30s, 95°C for 15s)

to analyze the melting curves of products for primer dimers and PCR artifacts. Representative

samples were run on a 1% agarose gel to confirm that products were the expected size. Primers

used for MGE comparisons between samplers are listed in Table 7.1.2. Primer efficiencies were

between 93-109%.RIT inverted repeat primer design and PCR

The inverted repeats from the strains listed in Table 6.2.2 were aligned and used to

design ambiguous primers targeting the inverted repeats flanking their respective RIT elements.

Since the primers were designed to target the inverted repeats, the same primer would be

expected to anneal at each end and amplify the full RIT element. Two specific (non-ambiguous)

primers were also created targeting the inverted repeats for Burkholderia sp. OLGA172 and

Caulobacter sp. K31 individually. The two specific primers were tested to verify that they were

strain specific and the ambiguous primers were shown to amplify the RIT elements in both

strains. The ambiguous primers were tested on sampler DNA to search for RIT elements

bearing comparable inverted repeats that could be amplified. PCR was carried out using HotStar

Taq at 54oC plus 1 uL BSA per reaction, with an extension time of 3 minutes and 30 seconds.

7.2 Results

7.2.1 Macroinvertebrate metrics of ecosystem health

Prior to examining the microbial community from the samplers, the overall health of the stream

was estimated based on biomonitoring of benthic communities from the stream sediment.

Where possible, benthic animals were collected directly from the sites using the OBBN

approved kick and sweep method. Abundance and identification data were also obtained from

the relevant conservation authorities and these data were used to supplement the benthic

monitoring data acquired during this study. Table 7.2.1 shows the results of the biotic indices for

the four sites at which benthic animals could be obtained concurrent with sampling. Biotic

indices fluctuate seasonally therefore the data used for calculating these biotic indices

corresponded to the fall counts for all sites, which also coincided with the sampling season for

the conservation authorities.

127

Table 7.2.1: Comparison of field sites based on biotic indices of benthics obtained during

this study.

The coarse Hilsenhoff Biotic Index (cHBI) ranks sites with a score below 5 as healthy and

increasingly polluted above 5. Other indicators of a healthy benthic community include a high

Simpson’s diversity score (approaching 1) and high abundance of species known to be intolerant

to pollutants (%EPT).

cHBI cHBI Rating Simpson's Diversity %EPT Dyment’s Creek 6.93 Poor 0.48 5.69 North Saugeen River 5.33 Fair 0.63 14.18 Uxbridge Brook 5.23 Fair 0.79 16.75 Hamilton Creek 6.12 Fairly poor 0.60 13.25

The SVCA sites historically show lower degrees of impact than the LSRCA sites (as

indicated by a lower cHBI value), however the cHBI values calculated in this study were higher

than had been observed in previous years by the SVCA. The most recent benthic data provided

from this conservation authority corresponded to 2007 and therefore recent trends could not be

identified for these sites. However land usage in this region has not changed in that time period,

and a 2010 water quality status report published by the conservation authority also confirmed

that these particular sites had retained excellent water quality (SVCA, 2010). This was

confirmed in the data available from the PWQMN, which also indicates that there has not been a

drastic change in water quality during this period. For the LSRCA sites, only the Uxbridge

Brook site and the Dyment’s Creek site could be sampled by the kick and sweep method due to

the depth of the river at the other two locations. An attempt to obtain benthic invertebrates from

the Maskinonge River site location by grab sample from the sediment was devoid of benthic

organisms, and the depth of the West Holland Canal was too excessive for a grab sample to be

attempted. However benthic counts were obtained from the conservation authority for the

Maskinonge River for the 2011, 2012 and 2014 fall sampling events at a nearby sampling

location used by the Provincial Water Quality Monitoring Network (PWQMN). The average

Hilsenhoff Family Biotic Index (FBI – which is the family level version of the cHBI) and %EPT

for this site were 6.35 and 2.43%, respectively (averaged across the three years, standard

deviation of 0.32 (FBI) and 0.92 (%EPT)). Since the other sites had not been analyzed to the

family level, the family level benthic data obtained from the LSRCA was collapsed to the same

level of identification that the other sites had been analyzed at and a modified cHBI value of

128

5.90 was calculated. Therefore when the benthics were analyzed to only the 27-group level, both

Uxbridge Brook and the Maskinonge River gave better cHBI ratings than Hamilton Creek

(Table 7.2.1). Uxbridge Brook also had the highest percentage of sensitive species of benthic

invertebrates (Ephemeroptera, Plecoptera and Trichoptera) of any of the sites analyzed, which

generally indicates a healthy river ecosystem. There was no data available from the LSRCA

pertaining to biomonitoring in the West Holland Canal due to the depth of this waterway.

The samplers recovered from each of the impacted sites were visually distinguishable

from each other and from the reference sites, indicative of the varying nature of the ecosystems

(Figure 7.3.1). The Maskinonge River sampler was the least changed visually from the reference

sites, but was coated in duckweed (aquatic plant -Lemnoideae, a subfamily within the Aracaea),

likely as a result of slow water movement coupled with high phosphorous levels. The Uxbridge

Brook and West Holland Canal sites each had substantial green algae coating the samplers, and

all three of these streams have high phosphorus levels according to the PWQMN data (see Table

7.3.2). The Dyment’s Creek sampler was thickly coated in an unknown chemical substance that

had badly discoloured the sampler and could not be removed. The sand inside the samplers

could also be distinguished visually between sites indicating that there had been substantial

input of substrate while in situ, likely as a result of sediment deposition during rain events.

Figure 7.2.1: Lake Simcoe region samplers after retrieval.

Samplers shown (bottom to top) are Maskinonge River, West Holland Canal and Dyment’s

Creek.

129

7.2.2 Community diversity measures

Individual extractions from the front, center and back portions of each sampler from the 2012

season were sent for T-RFLP analysis. Any T-RFLP sample that had a total peak height of less

than 3000 was removed from the analysis, which included the centre samples from the North

Saugeen (NS), Dyment’s Creek (Barrie), West Holland Canal (CF) and Hamilton Creek (HC)

samplers as well as the outflow sample from the North Saugeen (leaving only the inflow for that

sampler). Samplers that had good band representation for all three sampler regions, Uxbridge

(UX) and Maskinonge Creek (SOD), all grouped together as did the front and back samples

from the remaining samplers (Figure 7.2.2).

Figure 7.2.2: Cluster analysis of T-RFLP data showing within sampler variation.

Samples were obtained from the F,C,B (inflow (front), center and outflow (back)) portions of

the samplers after retrieval for comparison of the bacterial community heterogeneity within the

length of the sampler. Abbreviations are as follows: Dyment’s Creek (Barrie), West Holland

(CF), North Saugeen (NS), Hamilton Creek (HC), Maskinonge Creek (SOD) and Uxbridge

Brook (UX). Top diagram is the binary Jaccard comparison for the within sampler variation.

130

The bottom diagram illustrates the improved clustering of the samplers abundance (Bray-Curtis)

however the West Holland Canal (CF) samples group with HC instead of with each other.

Although no conclusions could be drawn for the North Saugeen sampler due to poor

amplification, each of the other samplers harbored their own unique community. When

phylotype abundances were included in the statistical analyses, the bacterial communities

showed high similarity between the different locations within the samplers, except that the West

Holland inflow and outflow samples (CF-F and CF-B) did not cluster together. When abundance

is disregarded however, the West Holland samples do cluster, albeit with only slightly better

similarity than they cluster with the two reference sites.

In order to provide a greater quantity of DNA for further analysis, the front (inflow),

centre and back (outflow) DNA from each sampler was combined and purified using GE

Healthcare S200 purification columns (in triplicate), and the pooled DNA was used for

subsequent analysis. Two replicates of pooled samples from each sampler, along with two DNA

extractions from the sediment at the Barrie site for comparison were analyzed via T-RFLP.

Although the sampler replicates generally gave good agreement by T-RFLP analysis, the West

Holland River sampler (CF) and Hamilton Creek sampler again showed greater similarity

between the two sites than between the two replicates. Further investigation revealed that the

primary driver of the clustering was the abundance of one particular band (190 bp), which

accounted for close to 50% of the total T-RFLP peaks in all of the samples except for

Maskinonge (SOD), Dyment’s Creek (BH-1, BL-1) and Uxbridge Brook (UX). For this

reason, the principle coordinate analysis was analyzed separately for the Bray-Curtis distance

and Jaccard distance between sites. As can be seen in Figure 7.2.3, the West Holland Canal

was the only sampler that segregated differently by the two distance calculations. However by

both of these analyses the West Holland sampler bacterial community was more similar to the

reference sites than to any of the other impacted sites.

131

Figure 7.2.3: Principal coordinate analysis of T-RFLP results from sampler replicates.

Samples in both graphs are numbered as follows: Dyment’s Creek (1,2), West Holland Canal

(3,4), Hamilton Creek (5,6), North Saugeen (7,8), Maskinonge Creek (9,10) and Uxbridge

Brook (11,12). The Uxbridge Brook sites are overlapping in both plots and therefore the

individual numbers are not visible.

In addition to the T-RFLP analysis, pyrosequencing was also performed on the

combined sampler DNA, however for the Dyment’s Creek sample there was insufficient DNA

for the analysis. The Uxbridge brook sample was repeated twice from the same DNA to serve as

technical replicates. An examination of the pyrosequencing data revealed that the primary

driving force for the observed similarities between the reference sites and the West Holland

Canal site (CAR) was the prevalence of one particular species at a number of the sites. This was

consistent with the T-RFLP analysis, where it had also been determined that a single dominant

band was influencing the observed clustering. In the pyrosequencing data, 3 of the samples

analyzed showed a clear dominance of Pseudomonas fluorescens. Two of these samples

corresponded with reference locations – the North Saugeen sample had 41% P. fluorescens, and

the same species accounted for 45% of the reads at the Hamilton Creek location. It also

comprised 51% of the reads from the West Holland Canal site. Using the available 16S

sequence for P. fluorescens PC17 from the NCBI Genbank database, it was determined that the

132

dominant band observed in the T-RFLP data was consistent with this species, and therefore

dominance of this species could be established in the T-RFLP data as well. In contrast, the

remaining three sites (Uxbridge Brook, Maskinonge River and Dyment’s Creek) had

considerably less prevalence of P. fluorescens based on both the pyrosequencing reads and the

observed T-RFLP band. For the Uxbridge Brook sample and the Maskinonge Creek sample, P.

fluorescens was still the most abundant single OTU in the pyrosequencing dataset however this

species accounted for only 18.5 and 19.4%, respectively. The Dyment’s Creek samples had

insufficient DNA for pyrosequencing, however the T-RFLP data indicate that P. fluorescens

was effectively absent from this site both within the sampler DNA and also in the analyzed

sediment samples (less than 0.05% in any sample).

Figure 7.2.4: Principal coordinate analysis (PCoA) of the bacterial community

compositions revealed by 16S pyrosequencing data.

Beta diversity is determined using Bray-Curtis distances. UX12 and UX122 are technical

replicates of the same DNA. Abbreviations are as follows: Uxbridge Brook (UX), North

Saugeen (NS), West Holland Canal (CAR), and Hamilton Creek (HC).

133

7.2.3 Quantitative PCR

Quantitative PCR results for sampler DNA are shown in Table 7.2.2. DNA yields from

the sampler sands were typically low, so pooled samples were used. Each biological replicate is

a pool of inflow, center and outflow extractions. However, low relative concentration of mobile

genes to total 16S gene copies, required that different amounts of DNA were used for 16S

analysis compared to target gene analysis in order to stay within the linear range of the

respective primers. Data was not converted to estimates of copy number since the target gene

amplifications were still at or approaching the limit of linear amplification despite the larger

volume of DNA used for the target genes relative to the 16S genes. Results given are therefore

cycle threshold values (Ct) normalized to the 16S amplification for the same sample (deltaCt)

but not converted to actual gene copies since duplications at each cycle cannot be assumed.

Results are averages of at least two independent biological replicates.

Table 7.2.2: DeltaCt comparison of environmental samplers by quantitative real-time

PCR.

The deltaCt is determined by subtracting the target site cycle threshold (Ct) value from the 16S

Ct value. A low deltaCt value is therefore indicative of a high concentration of target gene since

there was a smaller difference between the target and the 16S gene cycle thresholds. The

melting temperature (Tm) of the resulting PCR product is included as this differed according to

site for some primer sets.

IS1071 deltaCt IS1071 Tm

sulI deltaCt sulI Tm

IncP deltaCt IncP Tm

Uxbridge Brook 4.93 88 6.76 85.4 9.65 85.2 Dyment's Creek 5.88 87.7 9.67 85.4 8.95 85.7 West Holland

Canal 9.52 87.7 8.78 88 9.52 88.65 Maskinonge

Creek 9.43 87.7 10.03 87.1 9.92 89.35 Hamilton Creek 10.22 87.7 9.45 88 9.20 88.65 North Saugeen 12.84 87.7 9.45 87.1 8.78 89.1

In addition to low amplification, for all except the IS1071 primers the environmental

samples gave multiple and/or broad peaks, indicating that more than one product was produced.

134

These peaks had a higher melting temperature than would be expected for primer dimers, and

when representative samples were analyzed by gel electrophoresis, a single band was observed

(data not shown). This suggests that the multiple peaks on the melting curve analysis are the

result of similarly sized PCR products with varying G+C content. In support of this assertion,

the melting temperature of the qPCR product was consistent between different replicates of the

same sampler, consistent with specific target differences. The results included in Table 7.2.2

correspond to the primer sets that gave good reproducibility across multiple biological

replicates, including the IS1071 primers designed in this study and primers from other published

studies that targeted the IncP plasmid backbone (Smalla et al. 2013) and the sulI gene conferring

resistance to sulfonamide antibiotics (Cheng et al. 2013).

In addition to the primer sets listed in Table 7.1.2 there were also a number of other

mobile element primer sets tested from the existing literature, including primers designed to

target the Tn3 and Tn21 classes of transposons (Gotz et al. 1996). These primers were not

originally designed for qPCR analysis, however the amplified product was an appropriate size

for this type of analysis. The delta Ct values obtained with these primer sets were quite low

(ranging from 3 to 8) indicating high abundance of these transposons in all sites, however both

peak quality and reproducibility between replicates were insufficient for the results to be

included. Conversely, the IS1071 primers were both highly specific and highly reproducible

between biological replicates (within 0.5 Ct after normalization), consistent with their design to

target only one specific member of the Tn3 family of transposons.

7.2.4 Correlations between bacterial communities and water quality parameters

With the exception of Dyment’s Creek, water quality parameters were available from the

Provincial Water Quality Monitoring Network (PWQMN) for each of the streams used in this

project. For Uxbridge Brook the sampling site corresponded to the precise location of the

PWQMN monitoring station however due to accessibility reasons the other sampling sites were

located in other regions of the same streams. For Dyment’s Creek, water quality parameters

135

were available from Environment Canada for 2011 and these values were used to approximate

the conditions in 2012. For each of the streams, the water quality parameters were averaged

over the full year and the values are included in Table 7.2.3.

Table 7.2.3: Water quality parameters for each site.

Data for all but the Dyment’s Creek site are the averages from the 2012 data available through

the PWQMN database. Data for the Dyment’s Creek are the averages from the water stream

chemistry data provided by Environment Canada for the 2011 field season.

Site Chloride (mg/L)

Phosphorus (mg/L)

DO (mg/L)

EC (µS/cm) pH

Fe (µg/L)

Nitrates (mg/L)

Maskinonge River 92.7 0.1755 8.12 831 7.66 550 0.77 Uxbridge Brook 45.16 0.1154 10.07 538 7.87 545 2.25 Hamilton Creek 4.8 0.0035 9.03 441 8.36 24 0.526 North Saugeen 5.53 0.003 10.22 427 8.21 No data 0.382 Dyment's Creek 245.08 0.024 7.60 1022 7.73 169 2.76 West Holland River 92.55 0.14 9.26 735.06 7.74 294 1.01

The Principal Coordinates derived from both the T-RFLP and the pyrosequencing data

were compared to the available water quality parameters, and some significant correlations were

observed (Table 7.3.4). The T-RFLP data contained more replicates (since there were duplicates

of each sample) and included the Dyment’s Creek site therefore it will be the primary data

discussed. Principal coordinate 1 (PC1) showed the strongest correlations with %EPT and

dissolved oxygen, indicating that high biological oxygen demand had accounted for much of the

variability between these sites. PC1 also correlated with chloride levels, which are a dominant

feature of all the contaminated sites. PC2 didn’t show a correlation with any of the water quality

parameters but was correlated to the abundance of both IS1071 and sulI in the population.

IS1071 correlated oppositely to nitrate concentrations and P. fluorescens abundance. Since the

gene abundance is given as deltaCt, a high value corresponds to low abundance of the gene and

136

therefore high IS1071 gene abundance was correlated with high nitrate concentrations.

Conversely, IS1071 was not as abundant in the sites that had substantial P. fluorescens present.

Table 7.2.4: Correlations of the bacterial communities to available water quality data.

Only those correlations that were found to be significant are included in this table. PC numbers

refer to the first and second principal coordinate from the T-RFLP analysis (Bray-Curtis) and

separate analyses of pyrosequencing analysis by Bray-Curtis (BC) or Jaccard (J). DO is

dissolved oxygen and DO4 is the average dissolved oxygen over the summer months (June to

September). Gene abundances from qPCR analysis are listed by their gene name (sulI and

IS1071).

Parameter Correlates Pearson's R degrees of freedom

p<

T-RFLP-PC1 PyroPC2 - BC PyroPC2 - J %EPT DO Chloride

0.936 0.934 0.900 0.855 -0.798

4 5 4 5 5

0.01 0.01 0.02 0.02 0.05

T-RFLP-PC2 PyroPC1-BC PyroPC1-J IS1071 sulI

0.955 -0.820 -0.879 -0.883

4 5 5 5

0.01 0.05 0.01 0.01

PyroPC1-Bray_Curtis

PyroPC1-J Simpsons I IS1071 Nitrates P. fluorescens

-0.989 0.997 -0.868 0.895 -0.922

4 2 4 4 4

N/A 0.01 0.05 0.02 0.01

PyroPC2-Bray_Curtis

PyroPC2-J %EPT DO

0.929 0.977 0.844

4 3 4

N/A 0.01 0.05

PyroPC1- Jaccard

IS1071 P. fluorescens

0.843 0.932

5 5

0.02 0.01

PyroPC2-Jaccard %EPT sulI DO

0.959 -0.777 0.838

5 5 8

0.001 0.05 0.01

CHBI DO Chloride

-0.905 0.874

7 7

0.001 0.01

Simpsons I sulI Total P

-0.929 0.882

3 7

0.05 0.01

EPT DO DO4

0.793 0.845

7 7

0.02 0.01

IS1071 Nitrates Ps. fluorescens

-0.932 0.764

5 5

0.01 0.02

IncP Total P 0.936 6 0.001

137

DO Chloride -0.808 8 0.01

7.2.5 Primer design specific to RIT elements

In order to expand the current study to include RIT elements, it was necessary to develop

primers that could be used to search for novel elements in environmental samples. Although the

third integrase in the RIT element is more highly conserved than the first two integrases,

alignments of the nucleotide sequences from our RIT collection showed no promising candidate

regions from which to design primers. To address this lack of conservation, the RIT elements

were divided into sub-groups sharing higher homology and primers were designed specific to

conserved regions within the third integrase for these groups. However, although specific

primers could be designed for genus level groups such as Sinorhizobium and Acidiphillium,

there was insufficient conservation to design primers aimed at broader groups making qPCR for

RIT elements in environmental samples unfeasible. However, the discovery of conserved

sequences within the inverted repeats from RIT elements found in 10 different genera (see Table

6.2.2) highlighted another route by which RIT element distribution in environmental samples

could be evaluated. Primers were designed that could bind to these conserved regions in the

terminal inverted repeats and would therefore amplify the complete RIT element (being inverted

repeats the forward and reverse primers are identical). Burkholderia sp. str. OLGA172 and

Caulobacter sp. K31 were used as the control strains and primers were designed that would

match each of the strains specifically. These primers were shown to have no cross-amplification

with the alternate control. An alignment of the inverted repeats from all 10 strains was then

used to design two ambiguous primer pairs – the first had degenerate bases in all locations

where there were disagreements (RIT_ambig) and a second set kept some of the original bases

found in OLGA172 in case the fully ambiguous primer lacked specificity (RIT_less_ambig).

Both Burkholderia sp. str. OLGA172 and Caulobacter sp. K31 produced bands of the expected

size with the fully ambiguous primers and this primer set was therefore utilized to search for

similar RIT elements in the environmental samplers. Light positive bands of the expected size

were amplified from sampler DNA (specifically the Uxbridge Brook and Dyment’s Creek sites)

however cloning and characterization of these products was beyond the scope of this project.

138

7.3 Discussion

7.3.1 Biomonitoring

The use of benthic invertebrates to establish the health of a stream ecosystem is well established

(Rozenberg and Resh, 1993; Jones et al. 2007). Although the reference sites historically

performed better by these metrics than the impacted sites, the biomonitoring scores for the

reference sites were less favorable in this analysis. In particular, the coarse Hilsenhoff biotic

index (cHBI) ranked the reference streams as ‘fair’, which was unexpected but is also consistent

with the observations by the SVCA that biomonitoring scores are changing. The exact reasons

for this trend have not been elucidated since land use and water quality parameters have not

been declining for these streams, but it has been suggested that it could be the result of

increasing temperatures due to changing climate conditions (SVCA, 2010). This highlights an

important caveat with using the cHBI value for determining the overall health of an ecosystem

since higher temperatures can mimic the stress effects of the organic pollutants for which this

method was originally designed (Hilsenhoff, 1987). For the impacted sites, Dyment’s Creek

performed the worst in terms of cHBI score with an overall ranking of ‘poor’. Pollution into

this stream is one potential reason for the poor biomonitoring ranking, however lack of gravel

substrate is likely to be a strong contributing factor as the sediment consisted primarily of sand

and debris from the landfill including tires, metal rims and plastic garbage bags. Trichoptera

were abundant at this site (leading to a high %EPT score) however they were found to be

exclusively from the Hydropsychidae family, which are known to be tolerant to degraded

conditions. The biomonitoring ranking for the Uxbridge site was much better than anticipated,

and even exceeded the cHBI ranking for the reference sites. This site is 2.5 km downstream of a

wastewater outflow and was expected to show some anthropogenic impact due to the presence

of elevated levels of PAHs and other contaminants. However, the stream itself has extensive

riparian vegetation, good shading and a varied sediment composition that would be expected to

support a diversity of benthic organisms. It is also likely that the wastewater outflow adds in

nutrients as seen by the elevated phosphorous and nitrate concentrations. All of these factors

may have contributed to a high overall biomonitoring ranking despite the presence of organic

contaminants. This site also had particularly prolific algae growth and a subsequently high

abundance in Coleoptera (grazers) that may have impacted the overall ranking. These were

exclusively of the Elmidae family, which are known to be more tolerant of organic pollutants

139

than other beetle families. The LSRCA benthic data is analyzed at the family level, and does

show a slightly higher FBI value of 5.78 (‘fairly poor’) for the 2012 sampling. However, the

Uxbridge Brook site also had the highest %EPT values obtained in this study and although the

Trichoptera were all Hydropsychidae (and therefore more tolerant than other families), the

Ephemeroptera individuals came from families that are known to be quite sensitive to organic

pollution. The Simpson’s diversity metric also ranked the Uxbridge Brook site as healthier than

the reference sites (Table 7.2.1). This suggests that from a biomonitoring point of view, the

Uxbridge Brook site is maintaining a healthy benthic community.

7.3.2 Bacterial community assessment

The placement of the samplers directly in the water column was important for two

reasons: first, to minimize between site differences in bacterial communities specific to the

nature of the river substrate; and second, to specifically characterize the members of the

bacterial community that were most accessible either by direct contact with the river, or through

downstream applications such as irrigation or drinking water sources. In this manner, the

samplers can be used as a proxy for the indigenous bacterial community and utilized for

quantitative PCR comparison of mobile genes. The use of a sand substrate for colonization

inside the samplers was considered preferable to simply filtering water samples since this could

allow for the establishment of biofilm communities that may not be evident otherwise. These

communities therefore represent an accumulated population from the 4 months sampling time as

opposed to a single sampling event. Although not originally foreseen, this method had a

particular advantage over single time point sampling due to the presence of sediment washed

into the samplers after rain events. This was originally seen as a disadvantage since the goal

was to maintain a standardized substrate across all streams. In reality, however, these transient

communities that enter the water column after rain events are also accessible through direct

contact or downstream applications, making their inclusion in the samples beneficial. The

samplers still provide a standardized substrate for colonization, which minimizes the variation

and allows for comparison of streams regardless of sediment composition. In this way, the

samplers are designed specifically to capture the bacterial members that are transiently present

in the water column as well as providing a suitable substrate for the establishment of biofilm

communities that would normally be established within the sediment. Although the low DNA

140

yields obtained from the samplers made analysis difficult, there were distinctive communities

identified for each of the sites.

The sampler diversity was examined by T-RFLP and bands were found to correlate with

between streams differences as opposed to within sampler differences. This suggests that the

water movement through the samplers was sufficient to create homogenous conditions in terms

of oxygen and nutrient distributions. The added benefit of this consistency is that the large

volume of soil present in the samplers can be frozen and used for additional analyses at a later

date. The strong abundance of one particular species in the majority of the samplers was

unexpected, and could be an indication that the length of time that the samplers are in the stream

may need to be increased. It is possible that the initial colonization of the new substrate is

accomplished by P. fluorescens and that over a longer length of time the bacterial community

would diversify to more closely resemble the sediment community for that particular stream.

This is an issue that would need to be addressed before this method could be utilized in further

studies since the increased abundance of this one particular species interferes with subsequent

analysis. Interestingly, the only sample from the contaminated sites that showed an

overwhelming abundance of P. fluorescens was the West Holland Canal sampler, however this

is also the site for which there water quality measures available were at a great distance (8.4

km). The actual PWQMN site was not accessible for a sampler due to both the dimensions of

the river and level of public presence. The sampler was placed upstream of the PWQMN and the

stretch of the canal between the sampling site and the PWQMN site is entirely agricultural

therefore it is would be expected that the PWQMN data represents the worst case scenario for

agricultural inputs to this river. Therefore the conditions at the actual sampling site may be less

severe. Secondly, due to the volume of water moving through the canal the dissolved oxygen

levels are likely to be quite variable in this stream. Finally, this location was chosen based on

pollutants found in the sediment of the West Holland River however the levels of contamination

found in the water itself were not found to exceed provincial guidelines (LSRCA, 2004). Given

the depth of the canal, it is likely that the bacterial community in the water column (and

therefore in the sampler) is rarely in contact with these contaminants.

The Maskinonge Creek, Uxbridge Brook and Dyment’s Creek samples all had lower

than 20% P. fluorescens abundance. In addition to the contaminants listed in Table 7.1.1, there

are a couple of water quality parameters that could account for the lack of this species at the

141

LSRCA sites. First, the LSRCA sites all have significantly higher levels of phosphorus and

chloride, higher conductivity readings and lower pH values than the SVCA sites. In addition,

the dissolved oxygen (DO) values during the sampling period (June – September) were very low

for both the Maskinonge Creek and Dyment’s Creek locations (supplemental table S2). The

Uxbridge Brook location did not have DO levels below that of the reference sites, however

when compared to the North Saugeen site it is clear that the duration of low dissolved oxygen

levels was much longer at Uxbridge Brook (around 8 mg/L for all four months as opposed to

only 2 months) and therefore the DO levels during the sampling time were lower than the annual

averages would suggest. Since P. fluorescens is a strictly aerobic organism, the extended low

dissolved oxygen levels could certainly account for the decreased abundance of this species at

the contaminated sites.

Since the Maskinonge Creek, Uxbridge Brook and Dyment’s Creek samplers all

segregated from the other sites in terms of phylogenetic diversity, it was expected that these

three sites would also segregate from the others in terms of quantitative PCR results. In terms of

IS1071 abundance, the Dyment’s Creek and Uxbridge Brook samples both had significantly

higher IS1071 abundance compared to the other sites, and all sites were significantly higher than

the North Saugeen River. The West Holland Canal and Maskinonge Creek samples were not

significantly different from each other based on IS1071 abundance and the West Holland Canal

was only slightly significantly increased over the Hamilton Creek sample (P=0.047). Given the

known association of IS1071 with catabolic operons, it was expected that the highest

abundances would occur in the sites with complex contaminants and they did. However I also

expected high abundance of IS1071 at the two agricultural sites compared to Uxbridge Brook.

This could be the result of the strong association of IS1071 with IncP plasmids (Dennis, 2005;

Dunon et al. 2013) and would therefore be a result of wastewater and landfill leachate into the

Uxbridge Brook and Dyment’s Creek sites, respectively. The increased abundance of the sulI

antibiotic resistance gene solely at the Uxbridge Brook site is also consistent with the

expectations from wastewater outflow. All sites had comparable numbers for IncP backbone

abundance, but with different melting temperatures of the qPCR products, suggesting that there

is a much greater diversity of these plasmids than only the ones carrying the sulI genes.

However it should be noted that these primers were designed to be used in conjunction with a

specific probe and therefore the results obtained may simply be an artifact of the method used.

142

The broad distribution of IS1071 found in this study suggests that continued studies on

this particular element could prove interesting. Most notably, the increased abundance of IS1071

at the Dyment’s Creek location, coupled with the known diversity of contaminants entering this

stream from the groundwater, suggests that this would be an interesting location for a more

detailed study. There are a number of known contaminants in the Uxbridge Brook location, yet

the biomonitoring data do not indicate a decrease in overall ecosystem health. However the

bacterial community at the Uxbridge Brook site was clearly altered in comparison to the SVCA

sites, which merits further investigation of the impacts that this alteration has on the overall

community dynamics. The Uxbridge Brook site also had unexpectedly high levels of IS1071, as

well as high levels of sulI gene abundance. This site would therefore also be an interesting

location for a mobile element study in order to determine whether biomonitoring using

macroinvertebrates is informative for understanding the bacterial community response to

environmental contaminants. The qPCR results obtained in this study suggest that the bacterial

community is enriched in genetic elements commonly associated with IncP plasmids, including

IS1071 and sulI, which may indicate that this community represents an increased risk of

resistance gene transmission. However primers targeting the class 1 integrase gene (intI1) did

not detect this gene in any of the environmental samplers, which was unexpected since the sulI

gene is a known component of the class 1 integrons commonly found on IncP plasmids

(Schlüter et al. 2007). The reasons for the absence of intI1 in the samplers is unclear as the

primers showed equal efficiency to the IS1071 primers on control DNA and a sub-sampling

taken from the samplers earlier in the season had been positive for intI1. This is not likely to be

due to total bacterial abundance as the 16S results were comparable between the sub-sample and

the final sampler extractions, however it could be indicative of a change in the bacterial

population that resulted in a decrease in intI1 relative to the total community. The highly

specific and reproducible results obtained with the IS1071 primers suggests that these primers

may be better suited to identifying impacted bacterial communities when DNA concentration is

a limiting factor. As IS1071 is commonly found in multiple copies in the genome, this makes

for a more robust target for qPCR analysis. IS1071 is also associated with several catabolic

plasmids and transposons (Dunon et al. 2013; Van Houdt et al. 2000; Top and Springael, 2003)

and therefore provides a broader target than solely antibiotic resistance plasmids.

143

There are two major challenges in examining mobile elements in community samples

beyond the integron and plasmid replication genes that are currently used. The first challenge is

diversity of the nucleotide sequence of the target genes. For plasmids, the essential nature of the

replication genes provides sufficient conservation for primer design (Smalla et al. 2013; Gotz et

al. 2006) and these primers that have been utilized on environmental samples, albeit often in

association with a secondary probe for increased specificity. Class 1 integrons in particular are

highly conserved specifically due to selection for the antibiotic resistance genes that they are

associated with (Gillings et al. 2015). This conservation is not typical of mobile elements found

in the environment, with the result that testing for other families of mobile elements through

targeted qPCR or microarray approaches is not possible except in cases where the goal is to

track the abundance and distribution of a previously determined individual element. This was

illustrated in this study by the inconsistent results obtained through the Tn3 and Tn21 primers,

as well as the inability to target groups of RIT elements. Although global distribution of

individual mobile elements has not been commonly observed, in this study we designed qPCR

primers specifically targeting IS1071 and found this particular element to have a broad

distribution in environmental samples.

The goal of this study was to develop a reproducible method of analyzing the

anthropogenic impacts on freshwater stream bacterial communities, with the goal of classifying

reference and impacted locations for further characterization. These samplers can be used to

draw comparisons between different locations in a manner that is not dependent on sediment

quality (or availability) and is not impacted by spatial variations caused by groundwater inflow.

This characterization can also point directly to individual elements that may warrant further

investigation in the impacted sites. In this way, sites can be chosen for which metagenomic

characterization would be informative and this information can be accumulated and stored for

future analysis of general trends.


The following people are gratefully acknowledged for their insight and contributions: Jim Roy,

Alex Fitzgerald and Lee Grapentine (Environment Canada), Dave Lembcke and Rob Wilson

(LSRCA), Martha Nicol and Shaun Anthony (SVCA), Chris Jones (OBBN), landowners for

access especially Brouwer Sod Farms in Keswick and Jason Verkaik at Carron Farms, Shu Yi

144

(Roxana) Shen, Rosemary Saati and the other summer students for assistance, Toby Ricker for

design and construction of samplers, Ross Reid for assistance with sampler access and

placement. Funding in the form of a NSERC Discovery Grant to RF and a NSERC PGS-D

Scholarship to NR is also gratefully acknowledged. The funding agency had no role in this

study.

7.5 References Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., ... & Knight, R. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature methods, 7(5), 335-336. Cheng, W., Chen, H., Su, C., & Yan, S. (2013). Abundance and persistence of antibiotic resistance genes in livestock farms: a comprehensive investigation in eastern China. Environment international, 61, 1-7. Dennis, J. J. (2005). The evolution of IncP catabolic plasmids. Current opinion in biotechnology, 16(3), 291-298. DeSantis, T. Z., Hugenholtz, P., Keller, K., Brodie, E. L., Larsen, N., Piceno, Y. M., ... & Andersen, G. L. (2006). NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic acids research, 34(suppl 2), W394-W399. Dunon, V., Sniegowski, K., Bers, K., Lavigne, R., Smalla, K., & Springael, D. (2013). High prevalence of IncP-1 plasmids and IS1071 insertion sequences in on-farm biopurification systems and other pesticide-polluted environments. FEMS microbiology ecology, 86(3), 415-431. Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5):1792-1797. Fierer, N., Jackson, J. A., Vilgalys, R., & Jackson, R. B. (2005). Assessment of soil microbial community structure by use of taxon-specific quantitative PCR assays. Applied and environmental microbiology, 71(7), 4117-4120. Gillings, M.R., Gaze, W.H., Pruden, A., Smalla, K. Tiedje, J.M. and Yong-Guan, Z. 2015. Using the class 1 integron-integrase gene as a proxy for anthropogenic pollution. ISME journal doi:10.1038/ismej.2014.226 Götz, A., Pukall, R., Smit, E., Tietze, E., Prager, R., Tschäpe, H., ... & Smalla, K. (1996). Detection and characterization of broad-host-range plasmids in environmental bacteria by PCR. Applied and Environmental Microbiology, 62(7), 2621-2628. Hilsenhoff, W.L. 1987. An improved biotic index of organic stream pollution. Great Lakes Entomology 20: 31-39.

145

Jechalke, S., Dealtry, S., Smalla, K., & Heuer, H. (2013). Quantification of IncP-1 plasmid prevalence in environmental samples. Applied and environmental microbiology, 79(4), 1410-1413. Johnsen, A. R., & Karlson, U. (2007). Diffuse PAH contamination of surface soils: environmental occurrence, bioavailability, and microbial degradation. Applied Microbiology and Biotechnology, 76(3), 533-543. Jones F C, Somers KM, Craig B, Reynoldson TB (2007) Ontario Benthos Biomonitoring Network: Protocol Manual. Queen’s Printer for Ontario. LSRCA, 2004. Lake Simcoe Watershed Toxic Pollutant Screening Program 2004 Report. Lake Simcoe Region Conservation Authority. Drafted July 2005. Nemergut, D. R., Martin, A. P., & Schmidt, S. K. (2004). Integron diversity in heavy-metal-contaminated mine tailings and inferences about integron evolution. Applied and environmental microbiology, 70(2), 1160-1168. Reeder, J., & Knight, R. (2010). Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions. Nature methods, 7(9), 668-669. Rosenberg DM, Resh VH (1993) Freshwater biomonitoring and benthic macroinvertebrates. Chapman and Hall, New York Roy, J. W., & Bickerton, G. (2011). Toxic groundwater contaminants: an overlooked contributor to urban stream syndrome?. Environmental science & technology, 46(2), 729-736. Schlüter, A., Szczepanowski, R., Pühler, A., & Top, E. M. (2007). Genomics of IncP-1 antibiotic resistance plasmids isolated from wastewater treatment plants provides evidence for a widely accessible drug resistance gene pool. FEMS microbiology reviews, 31(4), 449-477. Simpson, E.H. 1949. Measurement of diversity. Nature (London) 163:688. SVCA, 2010. Saugeen Conservation Water Quality Status Report. Saugeen Valley Conservation Authority. Drafted March 2011. Top, E. M., & Springael, D. (2003). The role of mobile genetic elements in bacterial adaptation to xenobiotic organic compounds. Current Opinion in Biotechnology, 14(3), 262-269. Van Houdt, R., Toussaint, A., Ryan, M. P., Pembroke, J. T., Mergeay, M., & Adley, C. C. (2000). The Tn4371 ICE family of bacterial mobile genetic elements.

146

Wright JF, Sutcliffe DW, Furse MT (2000) Assessing the biological quality of fresh waters: RIVPACS and other techniques. Freshwater Biological Association, Ambleside Yarza, P, P. Yilmaz, E. Pruesse, F.O. Glöckner, W. Ludwig, K-H Schleifer, W.B. Whitman, J. Euzéby, R. Amann and R. Rosselló-Móra. 2014. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nature Reviews Microbiology 12:635-645. doi:10.1038/nrmicro3330

147

Chapter 8 Conclusions and Future Directions

The majority of my PhD research has been dedicated to better understanding the nature

and potential role of Recombinase in Trio (RIT) elements, a novel mobile element that involves

three linked tyrosine-based site-specific recombinases from separate sub-families. From this

work it is evident that in certain ways RIT elements could be considered comparable to insertion

sequences, MGEs that encode genes for their own movement but carry no other functional

information. From my intensive search of the extant sequence databases, it is clear that in the

vast majority of cases the RIT elements contain only the open reading frames coding for the

individual recombinase proteins that are presumably responsible for their mobility. These open

reading frames are flanked by inverted repeats that are evidently involved in the excision and re-

integration of these elements. However there are no IS families that are known to be mobilized

solely through the activity of a TBSSR, and unlike insertion sequences which can occur in high

numbers in an individual genome (Siguier et al. 2014), multiple identical RIT elements within

an individual genome are relatively rare. As was described in Chapter 4, copy numbers of RIT

elements identified to date range from 1 to 5 and the majority occur in only one copy in the

genome. Therefore their primary role is not likely to be to provide homologous regions for

genome rearrangements. IS elements also work in concert to mobilize larger segments of DNA

(composite transposons), which has not been seen in RIT elements. There were only two

instances identified in Chapter 4 where a RIT element appears to have mobilized adjacent genes.

In each of these instances the additional genes are found between the RIT element and one of

the inverted repeats. This suggests a mechanism more consistent with the transposons and ICEs

that utilize a site-specific recombinase at one end of the element. In addition, since mobility of

the RIT element was only observed during conjugation, it is possible that their role in the cell

may be specific to MGE evolution, or movement of genes between replicons, as opposed to

larger genome rearrangements. The prevalence of RIT elements in multi-replicon genomes is

also consistent with a potential role in plasmid evolution.

The experiments performed here allow for some speculation on how RIT elements may

function. The first issue to be addressed is the presence of three TBSSRs since this is not

consistent with current knowledge of tyrosine recombinases. Since the recombination reaction

148

proceeds in a very symmetrical manner, a role for three separate enzymes is difficult. However

site-specific recombination generally utilizes two copies of each recombinase and therefore a

symmetrical arrangement is possible. This is supported by the presence of three putative

binding sites at each end of the RIT element (two within the inverted repeat and a third within

the palindrome sequence). This is also consistent with other mobile elements such as phage λ

and Tn4430, both of which have complex binding requirements for the recombination reactions

(Hallet et al. 2004). However as only two cut sites are created in the crossover reaction it is still

unclear whether all three recombinases would be active simultaneously or whether they

facilitate different reactions (integration vs. excision or inter- vs. intra-molecular

recombination).

The second issue to be addressed from these experiments is the apparent need for

conjugation in order for the RIT element to be mobilized. Since the recombinases were

separated from the complete RIT element and were already being induced in the recipient cell,

the experiment performed in Chapter 6 should not be interpreted as indicating that RIT elements

are likely to be mobilized while conjugating into a new host. The conjugation experiment was

chosen since it allowed for a single-stranded conformation, which has been shown to be

necessary for some integron recombination reactions (Loot et al. 2010). Within the cell, single

stranded DNA is produced prior to conjugation and also by replicons that undergo rolling circle

replication. The large plasmids carrying RIT elements that I have identified are not likely to

replicate in this manner, however this type of replication has been associated with some

integrated conjugative elements (Wright et al. 2015). As RIT elements are generally found

contained within genomic islands (some of which may actually be ICEs), this provides an

opportunity whereby RIT elements could be activated specifically during either replication of

these elements or preparation for conjugative transfer. RIT elements could effectively be silent

when the ICE is integrated into the chromosome and become active when the ICE is stimulated

to excise from the chromosome. This allows for a brief time when RIT elements could mobilize

and have an impact on other targets within the cell.

There are many aspects of RIT element mobility that have yet to be determined. Most

importantly, the contribution of each individual recombinase is still an open question.

Expression plasmids containing each recombinase separately have already been prepared in the

pTrc99 backbone, however these experiments have not been performed due to the false positive

149

issues with the experimental design. Purification of the recombinase proteins is also ongoing at

the SCK•CEN by Rob Van Houdt in order to perform binding assays on the target site, inverted

repeats and the palindrome sequence. The presence of a palindrome sequence beyond the

inverted repeats is a unique feature compared to other MGEs. Whether this sequence provides

accessory binding sites for regulation or forms an important secondary structure for activity has

yet to be determined.

Chapter 4 was published in 2012 and there were 148 RIT elements obtained through in

silico searching at that time. Subsequent Blast searches from known elements and also

specifically from the palindrome sequences identified have brought this number to 183

(Supplemental Table 1), however this is still clearly not an exhaustive list as numbers continue

to increase with each subsequent search. Unfortunately, the prevalence of draft genomes

continues to limit our ability to examine the genomic context for many of these strains. It is

important to note however that despite their broad distribution there are remarkably few

instances of RIT elements shared between close relatives. This strongly suggests that RIT

elements are distributed between strains on other mobile elements and rarely incorporated into

more stable regions of the genome.

The inverted repeat primers developed in this project could be used to isolate additional

RIT elements from environmental samples, and additional primers could be designed based on

conserved repeats flanking other RIT elements in the collection. However, the most interesting

products of these primers would be the very rare instances where the RIT element is mobilizing

more genes than just the three recombinases. These could be preferentially retained through

size selection in order to separate them from the amplicons containing only the recombinase

genes. Of course understanding the abundance and diversity of RIT elements containing highly

similar repeats would also be informative. Although a distinctive role for the RIT elements has

not yet been established, the work described here has revealed a great deal about their

distribution, their associations and their mobility. It is an important part of the larger goal of

improving our overall understanding of the nature of the many MGEs that have yet to be

characterized.

At the outset of this work, many approaches were considered to quantify the mobilome

of stream environments at different levels of contamination including microarrays and a

150

comprehensive set of qPCR assays. My conclusion after investigating the degree of

conservation of even related elements was that approaches based on sequence similarities or

conserved primers for many MGE families were unfeasible. Fortunately the decreasing price of

sequencing makes metagenomics of environmental samples a viable option for the near future.

This approach has not been utilized in this study for two reasons. The first is computational

cost. Although read lengths of Illumina next generation sequencing have increased from only

35 bp at the beginning of this project to as much as 250 bp currently, the full genes would need

to be assembled in order for individual mobile elements to be assigned to families. This is

completely feasible on genomic samples, and has been performed on a small number of

metagenomic samples, but the cost is significant since it requires enough sequencing depth to

assemble an entire community (3-4 Gbp of sequence, depending on quality), as well as

sufficient computing power to perform the assembly (Luo et al. 2012).

The second issue with performing metagenomics currently is the annotation of the

assembled mobility genes. Automated annotation pipelines such as MG-RAST annotate

according to function of the closest homologues, however functional annotations are limited to

general terms such as ‘integrase’, ‘mobile element protein’ or ‘transposase’. As described in

Chapter 2, these terms are uninformative given the diversity of enzymes categorized by these

terms. Moreover, the RIT element recombinases often are not annotated as mobile elements at

all but rather as either hypothetical proteins or components of ‘recombination and repair’

functions. Therefore the annotated TBSSR component of the metagenome can alternatively be

found in three different annotation categories – mobile element proteins, phage integrases, or

recombination and repair. Since the majority of the TBSSR families have not been investigated,

there is little to be gained currently from metagenomics for these enzymes. This is clearly

illustrated by the unique adaptive role played by integrons. The type and extent of bacterial

adaptation potential provided by integrons was completely unknown to science until their

association with antibiotic resistance justified a more detailed investigation of their activity.

However without information on the mechanisms and distribution of other families of tyrosine

recombinases, we have a very limited ability to predict how unique these capabilities may be. In

order to better understand the potential roles of the many components of the mobilome, we must

begin to fill the knowledge gaps that clearly exist. More precise annotations, including genomic

context, will be necessary to fully understand the diversity of MGEs in environmental samples

151

and to begin to examine their abundance and distribution in contaminated vs. reference

ecosystems. This knowledge is important for determining the role that these MGEs play in both

individual genomes and bacterial communities, and will be essential to developing risk

assessment frameworks for monitoring both the current and developing risks of antibiotic

resistance genes in the environment. As we move into the inevitable age of environmental

metagenomics, annotation of available sequence data will continue to be the largest obstacle to

understanding bacterial evolution. The improvements in both sequencing technologies and

assembly algorithms have been important first steps; however increased funding for

experimental characterization of putative functions is still a limiting factor.

8 References Hallet, B., Vanhooff, V. and F. Cornet. 2004. DNA Site-Specific Resolution Systems. In: Plasmid Biology pp. 145-180. Ed. B.E. Funnell and G.J. Phillips ASM Press, Washington, D.C. USA Loot, C., Bikard, D., Rachlin, A. and Mazel, D., 2010. Cellular pathways controlling integron cassette site folding. The EMBO journal 29(15):2623-2634. Luo, C. D. Tsementzi, N. Kyrpides, T. Read and K. T. Konstantinidis. 2012. Direct Comparisons of Illumina vs. Roche 454 Sequencing Technologies on the Same Microbial Community DNA Sample. PLos ONE 7(2):e30087. Siguier, P. Gourbeyre, E. ad M. Chandler. 2014. Bacterial insertion sequences: their genomic impact and diversity. FEMS Microbiol Rev 38: 865-891. Wright LD, Johnson CM, Grossman AD (2015) Identification of a Single Strand Origin of Replication in the Integrative and Conjugative Element ICEBs1 of Bacillus subtilis. PLoS Genet 11(10): e1005556. doi:10.1371/journal.pgen.1005556

152

9 Appendix 1 Extra Tables

Table S1: Primers used in this study

Primer Name Sequence Reference qPCR-intI1F ACCAACCGAACAGGCTTATG Nemergut et

al. (2004) as quoted in

Wright et al. 2008 ISME 2:417-428.

qPCR-intI1R GAGGATGCGAACCACTTCCAT

qPCR-16S-338F ACTCCTACGGGAGGCAGCAG Fierer et al. (2005) as quoted in

Wright et al. 2008 ISME 2:417-428.

qPCR-16S-518R ATTACCGCGGCTGCTGG

sulI-F CACCGGAAACATCGCTGCA Luo et al. 2010.

Environ. Sci. Technol. 44:7220–

7225

sulI-R AAGTTCCGCCGCAAGGCT

IncP1 korA-F TCATCGACAACGACTACAACG Jechalke et

al 2013 Appl.

Environ. Microbiol.

79(4):1410-1413.

IncP1 korA-R TTCTTCTTGCCCTTCGCCAG

IS1071_qPCR-F GCACCAAGTCTGGGAATGAT This study IS1071_qPCR-R ACGGGCATAGTGTTTCTTGG This study

IR_Olga TTATGCCGATTCCCGGATTATGCCG This study

IR_K31 TAATGCCGCGATCCGGATTATGCCG This study

IR_ambig TWATGCCGIIIYCCSGATTATGCCG This study IR_less_ambig TTATGCCGIIIYCCSGATTATGCCG This study

184circle-F CCTCGCTAACGGATTCACCA This study 184circle-R TGGTGAATCCGTTAGCGAGG This study

pTrc99-up_XbaI CTTATCTAGAGTGAAATTGTTATCCGCTCACAATTCCAC This study pTrc99-

dn_HindIII ATGCAAGCTTGGCTGTTTTGGCGGATGAGAGAAG This study

pTrc99-RitA-F cttatctagacaggaaacagatcATGATTACGTGCGGGCCATTC This study

153

pTrc99-RitA-R ctagaagcttcgttgctagccTCATAGCGTGCCTCCCGCA This study pTrc99-RitB-F cttatctagacaggaaacagatcATGAGCCTCACCGACCAGCTC This study pTrc99-RitB-R ctagaagcttcgttgctagccTCATTGCACAGCTTCCCGGC This study pTrc99-RitC-F cttatctagacaggaaacagatcATGAGCGCCGCCGCCTT This study pTrc99-RitC-R tagaagcttacgttgctagcaTTAGAGACCTTCCAAGAACGCGAG This study

K31RitA-up CGATGATCGTCCGAGTCTGG This study K31RitC-dn CACCACGGCGTCGATCCAGC This study

Olga-RITA-up CGTCCGTAGACGATCAAGG This study Olga-RITC-dn GGACATGAATCATCTGAGACG This study Target1-FOR AATTCCACCGCCCTGCACGAGCTGTCGCACTGGACGGGCTGCA This study Target1-REV GCCCGTCCAGTGCGACAGCTCGTGCAGGGCGGTGG This study Target2-FOR AATTCCACCGCCCTGCACGAGCTGGGCCACTGGACGGGCTGCA This study Target2-REV GCCCGTCCAGTGGCCCAGCTCGTGCAGGGCGGTGG This study

Target-up CGACAGCTCGTGCAGGGC This study Target1-down TGCACGAGCTGTCGCACTGGACGGG This study Target2-down GTCCAGTGGCCCAGCTCGTGCAGGG This study Olga_RITBup GCACTGCGACGTACCGAGC This study Olga_RITBdn GCTATCTCAGCAGGAACTGTCC This study K31_RITBup CAGGAACAGCGGCGTGTC This study K31_RITBdn CTCCAACACGTACTGGTATCTGG This study

pSF100-FOR-PstI ATAACTGCAGATACCCACGCCGAAACAAG This study pSF100-REV-

EcoRI CGTCGAATTCATCGCTAGTTTGTTTTGACTCC This study

Ampli-tet-5_out GACGATGAGCGCATTGTTAG K. Mijnendonckx

Ampli-tet-3_out TCAGGGACAGCTTCAAGGAT K. Mijnendonckx

Table S2: Dissolved oxygen values by month. Values for the Dyment’s Creek location are averages of multiple values taken each month during the 2011 field season. Values for all other sites are the corresponding 2011 value for that month from the PWQMN database. All values are in mg/L.

Site June July August September Maskinonge River 6.38 -- 5.23 -- Uxbridge Brook 8.32 8.09 8.08 8.95 Hamilton Creek -- -- 8.66 9.95 North Saugeen 10.4 8.07 8.53 10.17 Dyment's Creek 5.9 6.52 6.69 6.8

154

Table S3: RIT Elements documented to date.

Strain Genbank Accension Location

(Phylum - if other than Proteobacteria);Class; Order

Burkholderia phytofirmans OLGA172 RITBphyt01 beta; Burkholderiales Cupriavidus metallidurans CH34 RITCme1 chromosome 1 CP000352

1393469-1396637 beta; Burkholderiales

Burkholderia sp. Ch1-1 NZ_JH603161.1

scaff_3 4103936-4107104 beta; Burkholderiales



Novosphingobium sp. PP1Y (RIT1) NC_015580.1

1558240-1561090 alpha; Sphingomonadales

Acidiphilium multivorum AIU301 NC_015186.1

449594-452732 alpha; Rhodospirillales

Acidiphilium multivorum AIU301 pACMV1(RIT1) NC_015178.1


Acidiphilium multivorum AIU301 pACMV1 (RIT2) NC_015178.1


Acidiphilium cryptum JF-5 pACRY01 NC_009467.1


Caulobacter sp. K31 chromosome (RIT1) - NC_010338.1 NC_010338.1

2151285-2154423 alpha; Caulobacterales

Caulobacter sp. K31 chromosome (RIT2) NC_010338.1

2422880-2426018 alpha; Caulobacterales

Caulobacter sp. K31 pCAUL02 RIT1 - NC_010333.1 NC_010333.1 57564-60702 alpha; Caulobacterales Novosphingobium sp. PP1Y (RIT1) NC_015580.1


Acidiphilium cryptum JF-5 pACRY03 NC_009469.1 38619-41757 alpha; Rhodospirillales Sinorhizobium medicae WSM419 pSMED02 NC_009621.1

369069-372212 alpha; Rhizobiales

Cupriavidus metallidurans CH34 RITCme2 CP000352


Brenneria sp. EniD312 NZ_CM001230.1 3719322-3722580 gamma; Enterobacteriales

Bordetella petrii strain DSM 12804 (RIT1) NC_010170.1


Burkholderia sp. YI23 plasmid byi-1p CP003090.1


Burkholderia phymatum STM815 chromosome 2 CP001044.1


Marinobacter sp. ELB17 NZ_AAXY01000001.1 415435-418639 gamma; Altermonodales

155



Marinobacter sp. ELB17 NZ_AAXY01000009.1 44734-47938 gamma; Altermonodales Aromatoleium aromaticum EbN1 NC_006513.1

281475-284649 beta; Rhodocyclales

Aromatoleium aromaticum EbN1 (plasmid 2) NC_006824.1


Aromatoleium aromaticum EbN1 (plasmid 2) NC_006824.1


Cupriavidus necator H16 pHG1 (RIT1) AY305378 32312-35249 beta; Burkholderiales Sinorhizobium fredii NGR234 pNGR234a (RIT2) NC_000914.2


Thauera sp. MZ1T CP001281.2 419038-422197 beta; Rhodocyclales

Candidatus Solibacter usitatus Ellin6076 (RIT1) NC_008536.1

4188436-4191597

(Acidobacteria);Solibacteres; Solibacterales

Candidatus Solibacter usitatus Ellin6076 (RIT2) NC_008536.1

9597168-9600446


Mesorhizobium loti MAFF303099 (RIT1) NC_002678.2




Cupriavidus necator H16 pHG1 (RIT2) AY305378 40646-43658 beta; Burkholderiales Acidovorax sp. NO-1 NZ_AGTS01000021.1 WGS ctg22 beta; Burkholderiales Leptothrix cholodnii SP-6 (RIT2) NC_010524.1


Candidatus Accumulibacter phophatis clade IIA str. UW-1 pAph01 NC_013193.1

125367-128300 beta; unclassified

Leptothrix cholodnii SP-6 (RIT1) NC_010524.1




Mesorhizobium loti MAFF303099 pMLa NC_002679.1


Thiomonas sp str. 3As FP475956.1 1980031-1983223 beta; Burkholderiales

Bifidobacterium longum NCC2705 (RIT1) NC_004307.2

1146734-1149950

(Actinobacteria); Bifidobacteriales


1151346-1154562



1510118-1506902


Bifidobacterium longum F8 FP929034.1 977730-981079


Bifidobacterium longum DJ010A (RIT1) NC_010816.1 35114-38464


Bifidobacterium longum NC_010816.1 389423- (Actinobacteria);

156

DJ010A (RIT2) 392773 Bifidobacteriales Bifidobacterium longum DJ010A (RIT3) NC_010816.1

2152995-2156345


Bifidobacterium longum DJ010A (RIT4) NC_010816.1

1541436-1544786


Bifidobacterium longum infantis 157F (RIT1) NC_015052.1

117374-120729


B. longum subsp. longum JCM1217 (RIT1) NC_015067.1

998527-1001743


B. longum subsp. longum JCM1217 (RIT2) NC_015067.1

1356344-1352994


B. longum subsp. longum JDM301 (RIT1) NC_014169.1

516654-519870



894276-897626



2274116-2270900


Burkholderia phymatum STM815 pBphy01 NC_010625.1


Burkholderia phymatum STM815 pBphy02 (RIT3) NC_010627.1




Pseudomonas aeruginosa NCM1179 DF126593.1

WGS 3897228-3900501 gamma; Pseudomonodales

Acidovorax sp. NO-1 NZ_AGTS01000037.1 WGS ctg39 beta; Burkholderiales Burkholderia phymatum STM815 pBphy02 (RIT1) NC_010627.1






Singulisphaera acidiphila DSM 18658 YP_007202199.1

2680371-2,683,568

Planctomycetes; Planctomycetales

Burkholderia phymatum STM815 pBphy02 (RIT2) NC_010627.1


Mesorhizobium amorphae CCNWGS0123 NZ_AGSN01000188.1

WGS ctg00205 alpha; Rhizobiales

Polaromonas sp. JS666 NC_007948.1 723577-726741 beta; Burkholderiales

Acidithiobacillus ferroxidans ATCC 53993 NC_011206.1

309036-312209 gamma; Acidithiobacillales

Acidithiobacillus ferroxidans ATCC 53993 NC_011206.1

619670-622843 gamma; Acidithiobacillales

Burkholderia sp. H160 NZ_ABYL01000018.1

WGS ctg00540; 70230-73427 beta; Burkholderiales

Marinobacter ELB17 NZ_AAXY01000003.1 128404-131292 gamma; Alteromonadales

Klebsiella pneumoniae 342 NC_011283.1 1834119- gamma; Enterobacteriales

157

1836998

Bacteroides fragilis 3.1.12 NZ_EQ973215.1 WGS 152668-155859 (Bacteroidetes) Bacteroidales

Paracoccus sp. TRP NZ_AEPN01000095.1 WGS ctg 98 alpha; Rhodobacteriales Bacteroides sp. 2_2_4 NZ_EQ973384.1 superctg1.30 (Bacteroidetes) Bacteroidales

Opitutus terrae PB90-1 NC_010571.1 1510958-1514235 (Verrucomicrobia); Opitutales




Novosphingobium pentaromativorans US6-1 NZ_AGFM01000100.1

WGS ctg00100 alpha; Sphingomonadales

Sphingomonas sp. SKA58 NZ_AAQG01000023.1 WGS 84-3662 alpha; Sphingomonadales Novosphingobium pentaromativorans US6-1 pLA1 (RIT2) NZ_AGFM01000122.1 77388-80973 alpha; Sphingomonadales Novosphingobium nitrogenifiges DSM 19370 NZ_AEWJ01000060.1

WGS ctg00067 alpha; Sphingomonadales

Novosphingobium sp. PP1Y (RIT2) NC_015580.1


Roseobacter litoralis Och 149 NC_015730.1 1129515-1133109 alpha; Rhodobacteriales

Verminephrobacter aporrectodeae subsp. tuberculatae At4 NZ_AFAL01000379.1 WGS ctg385 beta; Burkholderiales Candidatus Solibacter usitatus Ellin6076 (RIT3) NC_008536.1

3194175-3198639


Burkholderia phytofirmans PsJN chromosome 1 (RIT2) NC_010681.1


Mesorhizobium amorphae CCNWGS0123 NZ_AGSN01000034.1

WGS ctg00035 alpha; Rhizobiales

Sinorhizobium fredii NGR234 pNGR234a (RIT1) NC_000914.2


Sphingopyxis alaskensis RB2256 NC_008048.1


Sphingomonas sp. KA1 pCAR3 NC_008308.1


Novosphingobium pentaromativorans US6-1 pLA1 (RIT1) NZ_AGFM01000122.1 71938-75388 alpha; Sphingomonadales Novosphingobium sp. PP1Y (RIT3) NC_015580.1


Erythrobacter sp. SD-21 WGS NZ_ABCG01000002.1 66491-69569 alpha; Sphingomonadales Dinroseobacter shibae DFL 12 pDSHI01 NC_009955.1

100303-103738 alpha; Rhodobacteriales

Dinroseobacter shibae DFL 12 pDSHI03 NC_009957.1 67841-71276 alpha; Rhodobacteriales Mesorhizobium alhagi CCNWXJ12-2 NZ_AHAM01000339.1 WGS ctg361 alpha; Rhizobiales

158

Sinorhizobium meliloti 1021plasmid pSymA (RIT1) NC_003037.1


Sinorhizobium meliloti 1021plasmid pSymA (RIT2) NC_003037.1


Sulfitobacter sp. NAS-14.1 NZ_AALZ01000022.1 WGS 5396-7976 alpha; Rhodobacteriales

Pelagibacterium halotolerans B2 NC_016078.1


Agrobacterium vitis S4 pTiS4 NC_011982.1 78540-81684 alpha; Rhizobiales Roseovarius sp. 217 NZ_AAMV01000021.1 32068-35074 alpha; Rhodobacteriales Novosphingobium sp. PP1Y pMpl NC_015583.1


Methylocystis sp. ATCC 49242

ctg206: 54584-57731 alpha; Rhizobiales

Rhodococcus opacus B4 pROBO2 NC_012521.1 80391-83520

(Actinobacteria); Actinomycetales

Mycobacterium vanbaalenii PYR-1 NC_008726.1

6334046-6337148



scaf_3 240108-243225 beta; Burkholderiales

Frankia sp. EAN1pec NC_009921.1 495942-498771


Marinobacter aquaeolei VT8 pMAQU02 NC_008739.1

113911-117085 gamma; Alteromonadales

Rhizobium leguminosarum bv. viciae plasmid pRL11 NC_008384.1


Rhizobium leguminosarum bv. viciae plasmid pRL10 NC_008381.1 53267-56093 alpha; Rhizobiales Mesorhizobium loti R7A symbiosis island AL672113.1


Syntrophobotulus glycolicus DSM 8271 NC_015172.1 48211-51807 (Firmicutes); Clostridiales Desulfotomaculum gibsoniae DSM 7213 NZ_AGJQ01000018.1

WGS 99329-102528 (Firmicutes); Clostridiales

Dehalobacter sp. DCA NC_018866.1 1225943-1229157 (Firmicutes); Clostridiales

Desulfobacterium autotrophicum HRM2 NC_012108.1

3001915-3005127 delta; Desulfobacteriales

Lentibacillus sp. Grb1 NZ_AGAV01000005.1 contig005 (Firmicutes); Bacillus Clostridium saccharolyticum WM1 NC_014376.1

3030356-3033574 (Firmicutes); Clostridiales

Bacillus sp. 10403023 (RIT2) NZ_HE610986.1 726773-729999 (Firmicutes); Bacillus

Sulfobacillus acidophilus TPY (RIT2) NC_015757.1


Sulfobacillus acidophilus TPY (RIT1) NC_015757.1


Legionella drancourtii LLAP12 JH413829.1 scaffold37 gamma; Legionellales Heliobacterium modesticaldum Ice1 NC_010337.2


159

Desulfosporosinus sp. OT TOU NZ_AGAF01000082.1

WGS assembly 178 (Firmicutes); Clostridiales

Acetivibrio celluloyticus CD2 NZ_AEDB02000020.1

WGS scf3_ctg20 16804-20008 (Firmicutes); Clostridiales

Dysgonomonas mossii DSM 22836 NZ_ADLW01000025.1 WGS ctg1.25 (Bacteroidetes); Bacteroidales

Gramella forsetii KT0803 NC_008571.1 1284588-1287826

(Bacteroidetes); Flavobacteriales

Gramella forsetii KT0803 NC_008571.1 1401240-1404478

(Bacteroidetes); Flavobacteriales

Cupriavidus necator H16 pHG1 (RIT3) NC_005241.1 51476- 54721 beta; Burkholderiales Echinicola vietnamensis DSM 1752 NC_019904.1 24444-27688

(Bacteroidetes); Cytophagia





Johnsonella ignava ATCC 51276 NZ_ACZL01000056.1 WGS ctg1.56 (Firmicutes); Clostridiales Roseburia inulinivorans DSM 16841 NZ_ACFY01000116.1

WGS ctg476.1 (Firmicutes); Clostridiales

Bacteroides finegoldii DSM 17565 NZ_ABXI02000050.1

WGS ctg8.4 46223-49494 (Bacteroidetes); Bacteroidales

Bacillus sp. 10403023 (RIT1) NZ_HE610986.1 730450-733658 (Firmicutes); Bacillus

Prevotella buccae D17 NZ_GG739978.1 WGS superctg1.53 (Bacteroidetes); Bacteroidales

Nocardioides sp. JS614 NC_008699.1 634392-637611


Intrasporangium calvum DSM43043 NC_014830.1

948667-952111


Mesorhizobium alhagi CCNWXJ12-2 NZ_AHAM01000340.1 WGS ctg362 alpha; Rhizobiales Aromatoleum aromaticum EbN1 NC_006513.1


Sphaerochaeta pleomorpha str. Grapes NC_016633.1

1,216,600-1,219,767 Sphaerochaeta

Corynebacterium halotolerans YIM 70093 = DSM 44683 CP003697.1

126,593-129,791

Actinobacteria; Corynebacteriales

Mycobacterium kansasii ATCC 12478 CP006835.1

3,763,863-3,767,049


Clostridium difficile QCD-63q42 Microlunatus phosphovorus NM-1 AP012204.1

5,536,190-5,539,379

Actinobacteria; Propionibacteriales

Prevotella oralis ATCC 33269 Thermoanaerobacterium NC_019970.1 425,454- Firmicutes; Clostridia

160

thermosaccharolyticum M0795

428,667

Sphingomonas echinoides ATCC 1482 PRJNA76627

Alpha-Proteobacteria; Sphingomonadales

Sphingobium yanoikuyae XLDN2-5 PRJNA71691

Alpha-Proteobacteria; Sphingomonadales

Pseudomonas extremaustralis 14-3 substr. 14-3b strain 14-3 PRJNA77729

Gamma-Proteobacteria; Pseudomonadales

Gordonia rhizosphera NBRC 16068 PRJDB4


Thioalkalivibrio nitratireducens DSM 14787 PRJNA178382

Gamma-Proteobacteria; Chromatiales

Bacillus sp. 10403023 RIT2 PRJEA70827 Firmicutes; Baciili Singulisphaera acidiphila DSM 18658 PRJNA82973

Planktomycetes; Planktomycetales

Afipia felis ATCC 53690 PRJNA52159 Alpha-Proteobacteria; Rhizobiales

Burkholderia terrae BS001 PRJNA157903 Beta-Proteobacteria; Burkholderiales

Acidocella sp. MX-AZ02 PRJNA171232 Alpha-Proteobacteria; Rhodospirillales

Celeribacter baekdonensis B30 PRJNA170411

Alpha-Proteobacteria; Rhodobacterales

Candidatus Microthrix parvicella RN1 - WGS contig 2605_44 CANL01000039.1

Actinobacteria; Candidatus Microthrix

Roseivivax atlanticus strain 22II-s10s contig25 AQQW01000025.1

10,337-13,415


Phaeobacter gallaeciensis DSM26640 pGal_B134 NC_023148.1 Phaeobacter gallaeciensis DSM26640 NC_023137.1


Bradyrhizobium sp. STM 3809 PRJEA72433 Alpha-Proteobacteria; Rhizobiales

Draconibacterium orientale strain FH5T CP007451.1

3,816,933-3,819,813

Bacteroidetes; Bacteroidales

Photorhabdus temperata subsp. temperata Meg1 PRJNA217865

Gamma-Proteobacteria; Enterobacteriales

Burkholderia glathei PRJEB6934 Beta-Proteobacteria; Burkholderiales

acid mine drainage metagenome Mycobacterium austroafricanum strain DSM 44191 PRJEB5747


Ferrovum myxofaciens strain P3G Contig179 PRJNA255880

Beta-Proteobacteria; Ferrovales

Acidovorax sp. KKS102 CP003872.1 1283645- Beta-Proteobacteria;

161

RIT1 1286932 Burkholderiales Acidovorax sp. KKS102 RIT2 CP003872.1 1297698-1300985 Acidovorax sp. KKS102 RIT3 CP003872.1 1302872-1306159 Acidovorax sp. KKS102 RIT4 CP003872.1 2254833-2258120 837820-841107

Arthrobacter sp. Soil736 LMSB01000017.1 64,829-67,968

Actinobacteria; Micrococcales

Thalassobacter stenotrophicus CYRX01000028.1 WGS


Thioclava dalianensis strain DLFJ1-1 JHEH01000032.1 WGS


Paenirhodobacter enshiensis strain DW2-9 contig24_scaffold11 JFZB01000023.1

52,270-55,423


162

Appendix 2 Sampler Construction and Site Information

River Samplers 1 ½” x 1 ft length clear polycarbonate tube - cut with bandsaw, deburred and belt sanded End caps 1 ½” copper to DWM adapters for fittings – machined inside to fit and adhered with 100% polyurethane construction adhesive; window screening and nylon adhered internally to fitting ends with hot glue Filled with fine grain sand and attached with 90 lb threaded marine approved rope to two 4L pop bottles for floatation (2” distance between sampler and pop bottle lid). Tied with the same rope to cement block with minimum of 2 ft length of rope. 2011 Site assessment June 30, 2011: Hamilton Creek and Rocky Saugeen samplers installed Hamilton Creek – West back line, just north of Chatsworth Rd. 24 (East of Williamsford) GPS: 44024’18”N 80O47’47”W Upstream of bridge (east of road) Main channel width – 11.9 m (39 ft) Water depth and hydraulic head 1/3- 35 cm and no head midchannel – 48 cm (2 cm head) 2/3 – 42 cm (1 cm head) Velocity – 10 meter travelled in 15 seconds (0.667 m/s) Temperature – 16OC Transparency – clear to bottom Sediment – silt and boulders ranging from 7 - 30 cm diameter; boulders look like pieces of cement and are covered in gravel type substance (and red on the bottom) Surrounding vegetation – grasses from edge, lots of variety; Other notes: very wide with many dead trees – swamp? Perhaps this is low water for the region; lots of fish with stripe down the side (up to 7 cm in length) Rocky Saugeen –8th concession west of traverston rd (south of grey rd 12) GPS: 44027’35”N 81O2’42”W Downstream of bridge (north of road) Main channel width – 6.8 m (22 ft) narrows upstream to approx. 5.5m Water depth and hydraulic head measurements (taken upstream of bridge!) 1/3- 19 cm and no head midchannel – 24 cm (5 cm head) 2/3 – 30 cm (4 cm head) Velocity – 10 meter travelled in 16.7 seconds (0.599 m/s) Temperature – 16OC Transparency – clear to bottom

163

Sediment – rocks ranging from 10 - 30 cm diameter; Surrounding vegetation – overhanging willow (upstream) and cedar trees; lots of riparian vegetation Other notes: no fish observed; in a cedar forest with lots of trees; old sign hanging by river ‘Markdale fire truck water load area’; difficulty placing sampler – rocky bottom therefore hard to find solid place without tilting and water turbulent and pulling sampler down (about 20-30 cm below top of the water and only about 10 cm above rocks) July 1, 2011 North Saugeen – 8th sideroad between conc. 4 & 6 (East of Moorsburg) GPS: 44020’16”N 80O56’0”W Runs parallel to road Main channel width – approx 15 m (50 ft) Water depth and hydraulic head 1/3- 40 cm (2 cm head) midchannel – 38 cm (6 cm head) 2/3 – 30 cm (2 cm head) Velocity – 11 meter travelled in 25 seconds (0.44 m/s) Temperature – 20OC Transparency – clear to bottom Sediment –rocks (up to approx 20 cm diameter) and pebble; some boulders look like pieces of cement and are covered in gravel type substance (as seen at Hamilton Creek); moss on rocks Surrounding vegetation – all cedar trees; many dead cedars as far as visible – widening of river? Other notes: placed sampler in a pocket 68 cm deep and behind a tree to minimize visibility to kayakers July 8, 2011 East Holland River at Green Lane Width (estimated from bridge): 14.9 m at widest; 9 m under bridge Depth: 50 cm (1/3 channel); 58 cm mid-channel Velocity: 1 min 14 sec for 10 m Temperature: 24oC Turbidity: about 10 cm; very muddy therefore quite turbid Sediment: Mud and rock (sand and boulders) Vegetation: lots of grasses; healthy riparian region therefore turbidity not likely from erosion in this section of the river Other observations: crayfish observed and turtle; much deeper than previous visit; green tinge to the water Placed sampler under bridge but midstream to discourage interference – gone after one month. Could have been taken or could have been washed away in a rain event.

164

July 8, 2011 Uxbridge Brook at Davis Drive Depth: 40 cm (1/3 near bank), 53 cm mid-stream, 26 cm (1/3 opposite bank) Width: 6.5 m, 6.1 m Velocity: 12.9 seconds for 10 m Temperature: 21oC Turbidity: clear to bottom Vegetation: lots of riparian vegetation; grasses and ferns and trees Sediment: pebbles at edges, medium boulders covered with green plants in center Other observations: lots of dragonflies and butterflies

Location Hamilton Creek

Rocky Saugeen/Gypsy Creek

North Saugeen

East Holland

Uxbridge Brook

Date installed 30-Jun 30-Jun 01-Jul 08-Jul 08-Jul Date collected 08-Nov

Street Location West Back Line

house#443843 ncession west of

t Church Road; st of Traverston

east of Moorseburg; 8th sdrd b/t conc. 4 & 6

Green Lane/Rogers Reservoir at Davis Drive

Placement in stream

upstream of bridge

downstream from road crossing

closest to far bank

under bridge, midstream

downstream of road crossing

GPS 44 24' 18" N 44 27' 35" N 44 20' 16" N

80 47' 47" W 81 2' 42" W 80 56' 0" W

Bank Width 11.9 m

6.8 m (narrows to 5.5 m)

~ 50 ft but narrows up and downstream 9 m 6.5 m

Water depth (cm) 35 19 40 58 40 48 24 38 50 53 42 30 30 --- 26 avg water depth (cm) 41.7 24.3 36.0 54.0 39.7 Hydraulic Head (cm) 0.0 0 2 2.0 5 6 1.0 4 2 avg hydraulic head (cm) 1.0 3.0 3.3

165

Velocity 0.67 m/s 0.60 m/s 0.44 m/s 0.13 m/s 0.77 m/s Temperature (Celsius) 16 16 24 21

Transparency clear to bottom clear to bottom

clear to bottom

about 10 cm; very muddy

clear to bottom except in deep pools

Sediment silt and boulders rocky (10-30 cm diameter)

rocks with moss and pebble

mud and rock

sand and pebble at edge, bigger rocks in middle with attached algae and green plants; deep sand in pool

Surrounding vegetation

varied; grasses from edge

overhanging willow and cedars; lots of riparian veg.

widening section therefore dead and dying cedars along length

lots of grasses; healthy riparian region therefore turbidity not likely due to erosion in this section

a lot of grass; healthy riparian vegetation - grasses & ferns etc.

Comments

boulders have gravel type substance coating; lots of 5-7 cm fish (stripe on side); very wide - not a straight channel (swamp at low tide?); many dead trees

scared a deer away; no fish observed; heavily wooded area; Markdale firetruck water load area (old sign)

placed sampler in 68 cm deep pocket

cray fish observed and turtle; much deeper than previous visit; green tinge to water

a lot of dragonflies and butterflies;

off of Traverston Rd Water samples collected 18-Jul 18-Jul 18-Jul 04-Aug 04-Aug Water Depth (cm) 28 19 32 57 42 43.5 34 28 59 59 53 28 52 57 avg Water depth (cm) 41.5 27 37.3 58 Hydraulic head 0 1 3 0 3 0 0 5 0 3 0 0 5 2 avg 0.0 0.3 4.3 0

166

Hydraulic head (cm) Wetted bank width 14.3 m 4.6 m --- 8.5 m --- Hydrolab - Temp 26.16 23.38 25.62 20 17.62 SpC (mS/cm) 0.422 0.45 0.436 0.68 0.521 Dissolved Oxygen 6.1 mg/L 6.55 mg/L 9.11 mg/L 7.28 mg/L 7.89 mg/L pH 7.84 7.76 7.36 7.53 7.64 Total Dissolved Solids 0.3 g/L 0.3 g/L 0.3 g/L 0.4 g/L 0.3 g/L DO% 80 83 120.2 87.7 90.9 BOD when sampled 6.4 mg/L 6.5 mg/L 8.60 mg/L 7.68 mg/L 7.36 mg/L BOD after 5 days at 20C 6.64 mg/L 6.59 mg/L 6.84 mg/L ---- ---- iPhone GPS 44 15' 56" N 44 22' 7"N NOTE: rain event sampling

80 44' 23" W 80 53' 37" W

lots of crayfish

little fish; dark, small crayfish

recombinase in trio (rit) elements in bacterial …...ii recombinase in trio (rit) elements in...

Documents