discovering genomic islands using dna sequence embedding

Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations

2021

Discovering genomic islands using DNA sequence embedding Discovering genomic islands using DNA sequence embedding

Priyanka Banerjee Iowa State University

Follow this and additional works at: https://lib.dr.iastate.edu/etd

Recommended Citation Recommended Citation Banerjee, Priyanka, "Discovering genomic islands using DNA sequence embedding" (2021). Graduate Theses and Dissertations. 18451. https://lib.dr.iastate.edu/etd/18451

This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected].

http://lib.dr.iastate.edu/

http://lib.dr.iastate.edu/

https://lib.dr.iastate.edu/etd

https://lib.dr.iastate.edu/theses

https://lib.dr.iastate.edu/theses

https://lib.dr.iastate.edu/etd?utm_source=lib.dr.iastate.edu%2Fetd%2F18451&utm_medium=PDF&utm_campaign=PDFCoverPages

https://lib.dr.iastate.edu/etd/18451?utm_source=lib.dr.iastate.edu%2Fetd%2F18451&utm_medium=PDF&utm_campaign=PDFCoverPages

mailto:[email protected]

Discovering genomic islands using DNA sequence embedding

by

Priyanka Banerjee

A thesis submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

Major: Computer Science

Program of Study Committee:Iddo Friedberg, Co-major Professor

Oliver Eulenstein, Co-major ProfessorQi Li

The student author, whose presentation of the scholarship herein was approved by the program ofstudy committee, is solely responsible for the content of this thesis. The Graduate College will

ensure this thesis is globally accessible and will not permit alterations after a degree is conferred.

Iowa State University

Ames, Iowa

2021

Copyright © Priyanka Banerjee, 2021. All rights reserved.

ii

DEDICATION

I would like to dedicate this thesis to my parents and my sister for their unconditional love and

support for my decision to study computer science.

iii

TABLE OF CONTENTS

Page

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

CHAPTER 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Genomic islands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Predicting genomic islands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

CHAPTER 2. REVIEW OF LITERATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Comparative genomics based GI prediction . . . . . . . . . . . . . . . . . . . . . . . 42.2 Sequence composition-based GI prediction . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Gene level GI prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.2 Nucleotide level GI prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Machine learning in GI prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

CHAPTER 3. METHODS AND PROCEDURES . . . . . . . . . . . . . . . . . . . . . . . . 83.1 TreasureIsland Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.1 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.2 Model construction stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.3 Identification of GI stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

CHAPTER 4. EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.1 Evaluation on model construction stage . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.1 Evaluation of DNA embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1.2 Evaluation of classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Evaluation of GI identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.1 Experiment on comparative genomics data . . . . . . . . . . . . . . . . . . . 264.2.2 Experiment on comparative genomics test data and literature data . . . . . . 284.2.3 Experiment on unseen data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

CHAPTER 5. SUMMARY AND DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . 32

iv

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

APPENDIX. HYPERPARAMETER TUNING : K-MER . . . . . . . . . . . . . . . . . . . 36

v

LIST OF TABLES

Page

Table 3.1 Parameters used for the identification of GI stage . . . . . . . . . . . . . . . 13

Table 4.1 Dataset information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Table 4.2 Performance of TreasureIsland and other baseline GI predictors on 104genomes from the M dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Table 4.3 Performance of TreasureIsland and other baseline GI predictors on 626 GIsand 1981 non-GIsfrom test-set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Table 4.4 Performance of TreasureIsland and other baseline GI predictors on 80 GIsfrom test-set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Table 4.5 Average GI overlap score of each predictor on the predictions from referencepredictors on 6 genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Table .1 Accuracy of different k-mer sizes on classifiers Logistic Regression(LR), Sup-port Vector Machine(SVM), K-Nearest Neighbour(KNN) . . . . . . . . . . 36

vi

LIST OF FIGURES

Page

Figure 3.1 An overview of the framework TreasureIsland . . . . . . . . . . . . . . . . . 9

Figure 3.2 Construction of DNA embedding model using the DBOW document vectormodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Figure 3.3 The merging and fine-tuning phase. In this example, Tu is set to 0.75 andTl is set to 0.50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Figure 4.1 Creating the dataset for the model construction stage . . . . . . . . . . . . . 19

Figure 4.2 Performance of different doc2vec models and baseline methods BoW andTF-IDF on similarity task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Figure 4.3 Precision, Recall, F1score and Accuracy for doc2vec DBOW and other base-line representations on classifiers Logistic Regression(LR), Support VectorMachine(SVM), K-Nearest Neighbour(KNN) . . . . . . . . . . . . . . . . . . 23

Figure 4.4 Overall accuracy of classifiers Logistic Regression(LR), Support Vector Ma-chine(SVM), K-Nearest Neighbour(KNN) on different embeddings . . . . . . 24

Figure 4.6 A. Precision recall curve B. ROC curve of the DBOW + SVM classifier model 24

Figure 4.7 Prediction of Escherichia coli O157 genome NC 002695.1 measured on thereference data from dataset M . . . . . . . . . . . . . . . . . . . . . . . . . . 28

vii

ACKNOWLEDGMENTS

I would like to take this opportunity to express my gratitude to those who helped me with

various aspects of conducting the research and the writing of this thesis. First and foremost, Dr.

Iddo Friedberg, for his guidance, encouragement, and involvement throughout the research. I am

thankful for all constructive discussions from the lab members in Dr. Friedberg’s lab. I would also

like to thank Dr. Oliver Eulenstein for his support and very beneficial insight on the research and

writing of the thesis. I would also like to thank my committee member Dr. Qi Li whose course on

NLP proved very helpful and later her suggestions and insights helped my research as well.

viii

ABSTRACT

Genomic islands(GIs) are clusters of genes that are acquired during the Horizontal Gene Trans-

fer process(HGT) by bacterial genomes. These islands play a crucial role in the evolution of bacteria

by helping them adapt to changing environments. The detection of GIs is therefore an important

problem in medical and environmental research. There have been many previous studies on com-

putationally identifying GIs, but most of the studies rely on either closely related genomes or

annotated nucleotide sequences with predictions based on a fixed set of known features. Previous

research on unannotated sequences has not been able to reach a good accuracy due to the lack of

information taken into account while prediction and lack of GI boundary detection method. In this

thesis, I present a machine learning-based framework called TreasureIsland, that uses an unsuper-

vised representation of DNA sequences to predict GI. I propose to improve the boundary detection

problem of GI by using a boundary fine-tuning method to attain better precision. I evaluate the

efficiency of my framework by using a reference dataset obtained by the comparative genomics

method and from the literature. The evaluations show that this framework was able to achieve a

high recall and accuracy when compared to other GI predictors.

1

CHAPTER 1. INTRODUCTION

1.1 Genomic islands

Bacterial genomes evolve through a variety of processes, with a special place reserved for

horizontal gene transfer or HGT. HGT allows the acquisition of foreign genetic material, which

provides a mechanism of quick adaptation to a changing environment, by rapidly conferring new

phenotypes including stress resistance and antibiotic resistance. Genomic islands(GIs) are clusters

of foreign genes acquired by HGT. The GIs can be simplistically further classified into several

subtypes based on their key gene content, such as pathogenicity islands(PAIs) containing

pathogenic or virulent genes, resistance islands containing antimicrobial-resistant genes, symbiosis

islands containing genes that establish symbiosis with legumes, metabolic islands containing

adaptive metabolic abilities [4]. These GIs have some distinguishing features, (i) a size typically

in the range of 10-200 kbp [11], (ii) a sequence composition that is generally different than the

core genome specifically in terms of GC% content and dinucleotide frequency, and (iii) frequent

association with tRNA-encoding genes, flanking direct repeats and mobility genes, with a high

prevalence of phage related genes and hypothetical proteins [9]. The wide range of adaptive

functions, makes the identification of GI of particular environmental and medical interest [11];[14].

1.2 Predicting genomic islands

GIs can be discovered using experimental methods such as DNA-DNA hybridization,

subtractive hybridization, or using counter selectable markers [9]; [28]; [23]. These processes to

detect strain-specific GIs can be expensive and time-consuming. Hence, the need for

computational techniques for predicting GIs arises. The existing works on computationally

predicting GIs are broadly divided into two types: comparative-genomic, and

sequence-composition. The comparative genomics-based approaches involve the use of closely

2

related bacterial and archaeal genomes [4]. A GI in this case is identified when a cluster of genes

is present in an organism that is not present in any related genomes [15]. Recently, Bertelli and

colleagues conducted research where it was explained that while comparative-genomic based

approaches can predict GI boundaries precisely, the most obvious disadvantage of this method is

its dependency on the availability of closely related genomes, along with the variance in the result

based on the selection of these genomes[4]. The sequence-composition methods are based on

identifying atypical sequences in the core genome. To achieve this, these methods make use of

various structural features previously researched, such as a sequence bias in terms of GC%,

dinucleotide content, codon usage or k-mer count, presence of an insertion site, mobility gene,

phage genes, hypothetical proteins and direct flanking repeats [9]; [14]. To capture most of these

features which are related to gene, annotated sequences are required. Therefore it leaves the

prediction of unannotated sequences solely based on finding biases in nucleotides, with the very

little feature set, leading the prediction models therefore to have low precision. On the other hand,

this enables the research done on gene-level sequence composition to predict GIs successfully with

higher precision and accuracy. The downside of the gene level research is the dependency on

annotated sequences, which may not be available in the case of newly sequenced genomes or there

may be annotation errors. In summary, there are certain limitations to the GI prediction

techniques: (i) Requirement of closely related genomes in case of comparative genomics-based

approach; (ii) Dependency on annotated genomes in case of gene-level sequence-based approach;

(iii) Lack of good feature set in nucleotide level sequence-based approach.

To address the above issues, I developed an unsupervised representation of the DNA

sequences, which does not require any computation of a fixed number of features. The method

can overcome the challenge of using annotated genomes, as it takes as an input an unannotated

DNA sequence. Furthermore, it does not require the availability of related genomes. This

unsupervised algorithm can capture the semantic similarity of the DNA segments, which helps in

the classification task of predicting GIs.

3

1.3 Contributions

The major contributions of this research include:

1. Introduction of an unsupervised representation of DNA, that captures the semantic

similarity of DNA sequences.

2. Building a machine learning-based predictive framework TreasureIsland to detect the GIs

for any given DNA sequence.

3. Development of a technique to refine the initial GI boundary.

4. Analysis of the capabilities of this framework to accurately identify GIs.

1.4 Organization

The rest of the thesis is organized as follows. Chapter 2 provides the review of the current

work that has been done on predicting GIs. Chapter 3 provides the framework detail to deal with

the GI prediction problem. Chapter 4 presents the evaluation done on the framework and the

results obtained. Chapter 5 presents the conclusion and future work.

4

CHAPTER 2. REVIEW OF LITERATURE

Most of the previous work on the identification of GI can be put in two categories:,

comparative genomics studies and sequence composition.

2.1 Comparative genomics based GI prediction

Comparative genomics-based GI prediction predicts GI in a genome by comparing the genome

structure of closely related bacterial and archaeal genomes. IslandPick is one of the most

prominent comparative genomics prediction techniques which uses both Mauve and BLAST to

align genomes [15]. This technique was shown to have high accuracy and contributed towards

building a reliable reference data set which was shown to be comparable to the GIs in the

literature data-set. Some other tools such as tRNAcc used both comparative genomics and the

presence of tRNA to predict GIs [21].

The advantage of using a comparative genomics technique is that it gives us a more precise

boundary and it is generally more reliable. On the other side, this process also requires the

availability of related genomes, which eliminates all genomes without a certain number of closely

related genomes. Comparative genomics prediction of GIs is also sensitive to both gene loss and

HGT in the sequence [4].

2.2 Sequence composition-based GI prediction

The sequence composition-based methods generally try to identify GIs by looking for sequence

anomalies in the core genome. This anomaly could be due to various reasons such as a bias in

%GC content, dinucleotide content, or codon usage. There are also certain GI features widely

associated with GIs, such as the presence of tRNA, mobile genes, phage-related genes, or

hypothetical proteins [9, 16, 5]. The composition-based methods have the advantage of identifying

5

more recent transfer distantly related genomes which contain mobile genes in them [2]. The

sequence composition methods generally work at the gene level or nucleotide level.

2.2.1 Gene level GI prediction

Gene level GI prediction has seen a rise in performance in recent years since genes are the

functional unit of genomes. Some GIs contain viral structural genes or conjugation gene sets,

which helps them in their mobility. The site specificity of these genes on chromosomes is

determined by integrases. GIs have often seen presence of genes encoding integrase, mainly from

the tyrosine recombinase family. For most GIs, the integration of target site is mostly within a

tRNA gene. Some of the most prominent contributions have been made by IslandPath-DIMOB

[2], GIHunter [7], SIGI-HMM [26], PredictBias [22] and Islander [13]. SIGI-HMM uses HMMs to

predict GIs on the basis codon usage bias [26]. PredictBias uses a several features such as

insertion elements and virulence factors to predict GIs [22]. IslandPath-DIMOB predicts

dinucleotide bias in eight genes and combined with the presence of mobility genes [15] which was

later improved in performance with extended HMM profiles for searching mobility genes [2].

Islander uses only the presence of tRNA to precisely predict GIs [13].

The gene-level prediction has been able to achieve a good precision, as the boundaries can be

well identified in the presence of genes. With the rise in the understanding of structural features,

gene-level prediction makes use of these features in their tools. This technique, on the other hand,

is dependent on the availability of correctly annotated genomes.

2.2.2 Nucleotide level GI prediction

In the nucleotide level prediction, most of the predictions use windows of different sizes to

measure biases such as GC content, dinucleotides, or k-mer bias. Some popular prediction tools in

this category such as AlienHunter [25] and GI-SVM [18], use an sliding window technique to find

out GI regions. AlienHunter computes an Interpolated Variable Order Motifs (IVOM) score

which helps in identifying atypical regions in a genome compared to the core genomic regions, in

6

terms of GC content, dinucleotides, and codon usage. They also make use of two-state Hidden

Markov Models (HMM) to precisely identify the boundaries. There are also some tools such as

Zisland Explorer which does not use a window method to identify GI. Zisland divides the genome

into sequences based on its GC profile and identifies possible candidate regions for GIs [27].

The sequence level predictors generally take into account much less information compared to

the gene level predictors [4]. The advantage of sequence level prediction is that it does not require

gene annotations, but it also makes it more difficult to reach good precision on such models.

There is also a lack in the boundary refinement process in these models except MTGIpick which

uses a version of Markov Jensen–Shannon divergence to refine boundaries [8].

Recently, there has also been some work to combine the advantages of various prediction tools

to provide a composite or hybrid tool. The most popular of these tools is the IslandViewer4 tool,

which combines the strengths of IslandPath-DIMOB, SIGI-HMM, Islander, and IslandPick to

form a composite tool [3].

2.3 Machine learning in GI prediction

Early research on genomic islands shed light on the features most commonly associated with

GIs. This led to the use of machine learning models to leverage one or more of these features to

provide more accurate results. Most machine learning methods fall under the gene level GI

prediction, where they take as an input the annotated genome sequence. One of the early

machine learning tools is Wn-SVM which measures the typicality score in terms of the

composition of each gene to belong to the core genome and uses one class SVM approach to

detect GI [24]. GIDetector, combines eight gene-level features, Interpolated Variable Order Motif

compositional (IVOM) score, insertion point, size, density, repeats, integrase, phage, RNA to

classify GI from non-GI using a decision tree. This paper also identifies the more important

features among the eight features such as the presence of insertion sites, phages and repeats [6].

Later, GIHunter uses the same machine learning technique of decision trees to classify and

identify GIs, based on some new and revised features such as mobile gene information, intergenic

7

distance, and highly expressed genes [7]. GIHunter trains its model on the reference data of 118

genomes from the IslandPick paper and achieves high precision and accuracy on the prediction of

GIs among the other machine learning techniques [4]. The more recent GI-SVM prediction

worked on predicting GIs in the unannotated genomes and states the need for such techniques in

predicting GIs in newly sequenced genomes. GI-SVM uses a one-class SVM to identify GIs based

on the k-mer count in sequences [18]. Another recent research on GI prediction uses deep learning

techniques to predict GIs. This tool, ShutterIsland, based on comparative genomics uses a service

from PATRIC (The Pathosystems Resource Integration Center) to compare genome regions

among closely related species and reaches a good prediction accuracy[1].

The machine learning methods as seen above have the potential to predict GIs with better

precision and recall, but there is a lack of utilizing a good feature set in the sequence level

predictions, which has the general advantage of not requiring annotated genomes. The existing

models also have poor precision due to the lack of boundary refinement procedure. This forms the

motivation for my research to find a better way of capturing features from unannotated sequences

and work on the refinement of GIs boundaries.

8

CHAPTER 3. METHODS AND PROCEDURES

3.1 TreasureIsland Framework

This chapter describes the framework used to predict the genomic islands from a DNA

sequence.

3.1.1 Framework Overview

The computational framework I developed consists of two stages: (i) the Prediction model

construction stage for classification of GI/non-GI, (ii) GI identification for the input DNA stage.

As seen in Figure 3.1 at a high level, in the first stage I build an embedding model which helps

me to represent the variable-length DNA in terms of fixed-length vectors. These vectors are then

used to classify the segments of DNA into a GI or a non-GI region in a genome. At the end of the

first stage, I am left with an embedding model and a classifier for DNA segments. In the second

stage, I take as an input a DNA segment, dividing the DNA into non-overlapping segments of a

certain size. These segments are then embedded and classified using the embedding and classifier

models, respectively, from the first stage. The GI classified segments are then processed to refine

the boundaries to output the GI regions within the input DNA.

3.1.2 Model construction stage

In this stage, I construct the embedding and classifier models for DNA segments.

3.1.2.1 DNA embedding

Background To make use of the power of machine learning techniques, biological sequence

data must be converted to a form that can be understood by the machines. Previously, the

popular method of embedding DNA data was to use a one-hot encoding. Given the volume of

9

Figure 3.1: An overview of the framework TreasureIsland

DNA data, one hot encoding becomes an expensive technique. With the advance in NLP

techniques of word embedding, DNA embedding methods improved as well. One of the first

papers in converting DNA to vector using a word embedding model from Natural Language

Processing(NLP), by Patrick Ng 2017, showed the effectiveness of DNA2vec in numeric operations

such as concatenation and assessing global alignment similarity. For instance, let us consider an

operation using the Nearest-Neighbor algorithm [20].

Nearest-Neighbor(−−−→AAC +

−−−→TCT ) ∈ {AACTCT, TCTAAC} and,

−−−−−−→ACGAT −−−−→GAT +

−−−→ATC ≈ −−−−−−→ACATC

Here, the neighbours of the 3-mers AAC and TCT when added overlap with their string

10

concatenation AACTCT. The second equation shows the result of the nucleotide concatenation.

Thus, the importance of representing DNA as vectors is demonstrated.

Natural Language Processing(NLP) has explored many word embedding techniques in recent

years as it is a vital preprocessing step for all machine learning tasks. Some of these techniques,

such as Word2vec, where words are converted to vectors, has particularly been found to be very

powerful as it captures the semantic meaning and the context of the words[19]. It is a neural

network that uses both target words and context words to convert a word into a fixed-length

vector. The vocabulary is built from the corpus and fed into the model. The word2vec model has

two different variants known as Continuous Bag of Words(CBOW) and Skip-Gram. The CBOW

learns representations by using the context word to predict the target word. It is a supervised

learning algorithm with context words as input and a target word as output. The Skip-Gram

model learns representations by using the target word to predict the context words. It is a

supervised learning algorithm with a target word as input and context words as output.

After the successful implementation of word2vec, researchers have tried to extend the same

idea to vectorize multiple words in the form of a sentence, paragraph, or even a document [17].

Even though the weighted average of word vectors or bag of words models were simple solutions,

it did not capture the word order. Paragraph vector is an extension of the word2vec model

proposed by [17]. It converts a variable-length sentence into a fixed-length vector. Each

paragraph is identified by a paragraph ID and is then converted to a vector to represent every

paragraph. There are two different types of paragraph models- Distributed Memory (DM) and

Distributed Bag of Word(DBOW). In the DM model, a paragraph ID is added as another word in

addition to the words. The model learns the word vectors along with the paragraph vectors,

which is done by trying to predict the current word using both the context words and the

paragraph ID. This model is analogous to the CBOW model in word2vec. DBOW ignores the

order of the words, it predicts a randomly sampled word from the paragraph given the paragraph

ID. This process is analogous to the Skip Gram model in word2vec.

11

DNA as a document. Biological language representation can be analogous to natural

language. To use the power of the paragraph vectors, I treat segments of DNA as a paragraph or

a document (both have the same meaning in this context). K-mer in bioinformatics is described

as sub-sequences of length k from biological sequence data. They can be considered similar to

words in a text document. Since the DNA is made of four nucleotides {A,C,G,T}, the maximum

number of possible k-mers are 4k. The DNA embedding process can be seen in Figure 3.2.

Figure 3.2: Construction of DNA embedding model using the DBOW document vector model

Preprocessing DNA The DNA sequence(or document), which in this case is a genomic

island or a non-genomic Island is converted to the lower case and represented as a sequence of

12

k-mers(or words). There are two primary methods to obtain these k-mers in DNA that are widely

used. First is the sliding window or overlapping method, second is the non-overlapping method.

These are explained with an example: If the original sequence is GCTTAATTC, the overlapping

window method (k=3) gives rise to following sets of k-mers: [GCT, CTT, TTA, TAA, AAT,

ATT, TTC]. On the other hand, the non-overlapping method (k=3) generates [GCT, TAA, TTC]

k-mers.

Constructing a DNA embedding model Each GI and non-GI are then converted into a

document, which contained k-mers as its words and a unique paragraph id as its tag. Two

different paragraph vector models are trained, namely, the Distributed Memory (DM) model and

the Distributed Bag of Word (DBOW) model. At the end of the training phase, I get a paragraph

vector model of a fixed-length vector size. The selection of k-mer generation method, the value of

k, the model type, and the other hyper-parameters such as vector size, window size is dependent

on the final cross-validated classification results.

3.1.2.2 Constructing classifier

After the training of the embedding model, I obtain the vectors for the training data set and

test data set by inferring the vectors by gradient descent from the embedding model(the rest of

the model parameters are fixed) as stated in [17]. The training vectors are then fed into machine

learning algorithms, to complete a binary classification task on GI (Class 1) or Non-GI(Class 0).

3.1.3 Identification of GI stage

This stage takes as an input a DNA sequence and identifies all possible genomic islands in the

sequence. The DNA-embedding model and the classifier from the first stage are used here. The

parameters used for this stage is explained in Table 3.1.

13

Table 3.1: Parameters used for the identification of GI stage

Parameter Notation Description

DNA sequence D input DNA sequence

sequence window size Ws Window size of the initial non-overlapping sequence

kmer size k size of the kmer or words in the sequences

minimum gi size GIm This is the minimum size of a GI set by the user

tune window size Wt This is the window size by which the borders are tuned

(increased or decreased) on either side of a GI

upper threshold Tu probability of a segment above which a segment is classified

as GI

lower threshold Tl probability of a segment below which a segment is classified

as non-GI

3.1.3.1 DNA vectors

Given the input D, it is divided into segments of fixed length. D = [d1, d2, d3..dn]. These

segments are then considered to be individual DNA documents. I take each of these documents

and preprocess the documents in the same way as the first stage, by finding the k-mers of size k.

The documents are then embedded by inferring vectors from the DNA-embedding model.

3.1.3.2 DNA classification

The DNA vectors are then fed into the classifier. The probability p1, p2, p3..pn for class 1(or

GI class) for each of these segments in D, is then measured.

3.1.3.3 Merging phase

Now that D is divided into segments and assigned a probability for each segment, the process

can move on to the merging phase. In the merging phase, adjacent GI segments are merged to

form a new larger GI segment. As can be seen from Figure 3.3, there are two thresholds set for

the merging step: Upper Threshold Tu and Lower Threshold Tl. The segments with pi greater

than Tu are considered a GI. If two or more adjacent segments are found to be greater than Tu ,

the segments are merged, and the entire section is considered to be a GI. It can also be seen from

the figure that the segments with probabilities less than Tl are considered a non-GI segment. The

14

segments having probabilities between Tu and Tl are considered the partial GI segment, the

intuition being, portions of these segments may belong to either class. In order to find out a more

precise border in these portions, after the merging phase is completed, the segments need to be

prepared for the fine-tuning phase. Each of these GI regions is then analyzed for flanking

segments(segments on either side of the considered GI segment). If either or both flanking

segments have a probability between Tl and Tu, then those segments are also attached to the

identified GI region. This is done to prepare the segments for the final fine-tuning phase, which

can be seen in the third stage of the figure.

Figure 3.3: The merging and fine-tuning phase. In this example, Tu is set to 0.75 and Tl is set to

0.50

15

3.1.3.4 Fine-tuning phase

The fine-tuning phase is carried out to find more accurate boundaries for the GIs predicted in

the previous phase. Figure 3.3 shows that in the third stage, the GIs along with the flanking

segments are separated from the sequence D. Each GI segment obtained above has an

approximate left and right border, based on the point where it has been divided in the first step.

These borders may not contain the entire GI region or may have overshot the actual GI region.

To get more precise boundaries for the GI regions, I adopt a fine-tuning method.

A few constraints is kept in mind for this method. First, the borders when moving outward

during the tuning process must have an outer limit after which the borders cannot move. If a

border has a flanking segment at the side of it, the outer limit is set at the middle of the flanking

segment. The middle point of each flanking segment is considered to be outer limit to prevent any

overlapping of the GI regions if there another GI on the other side of the flanking segment. If a

border has no flanking segment, the outer limit is the original GI border. Second, GI segments

cannot be lesser than minimum GI size entered GIm.Next step is to tune borders on both ends of

GI, namely left and right borders.

If there is a flanking segment, intuitively, the GI border must lie either on the current GI

border or outside of the GI region. When calculating the outer border on either side, the other

end is fixed. Each time the border slides out by the tuning window Wt, the probability of the new

fragment is found. The process is stopped either when the current fragment probability is lesser

than Tu or when the outer boundary limit has reached. This ensures that if there is any flanking

segment on the side of a border, the current GI segment is expanded to include any possible GI

segments into its body.

Inner borders for each side is calculated only when there is no flanking segment on its side.

Intuitively, if there is no flanking segment on the side of a GI, the GI border must lie at the

current GI border or should be inside the GI region. When calculating the inner border on either

side, the border on the other end is fixed. Each time the border slides in by the tuning window

Wt, the probability of the new fragment is found. The process is stopped either when the previous

16

fragment has a higher probability than the current fragment or when the minimum gi size GIm is

hit.

While finalising borders for each end, left and right of the GI, I consider the outer border if

there is a flanking segment present, and I consider the inner border if there is no flanking segment

present. The process enables us to reach a good balance between sensitivity and specificity of

prediction of GI segments.

17

CHAPTER 4. EXPERIMENTS

In this chapter, I examine the effectiveness of TreasureIsland framework in predicting genomic

islands in microbial organisms. I first conduct my experiments on the model construction stage

Section 4.1, followed by the experiments on the identification of the genomic island stage Section

4.2.

Dataset background Obtaining a reliable reference set for GIs has always been a challenge

for researchers. The few experimentally verified GIs are not sufficient for a reference GI set, hence

a GI reference is necessary to evaluate the performance of the predictors. The comparative

genomics computation technique is very similar to the manual curation of the GI regions and

hence is considered a reliable source of data. An early curated dataset was used to build

IslandPath [12]. The benchmark for the GI dataset had been first constructed in IslandPick [15]

using 118 genomes, which was later revised to 108 genomes in IslandPath DIMOB [2]. The

research IslandPick in 2008 also mentions curation of some previously published GIs, which was

consolidated as the literature dataset. Apart from the general GI prediction, research focus has

been given particularly to resistance and pathogenic islands. A database is constructed including

pathogenic and resistance in [29], although it does not contain negative data. The information on

the dataset can be found in Table 4.1.

Table 4.1: Dataset information

Notation Total genomes Number of GI Number of non-GI Data Source

M (Main) 104 1845 3266 [2]

E (Early) 32 269 0 [12]

L (Literarure) 6 80 0 [15]

P (PAI) 111 264 0 [29]

18

4.1 Evaluation on model construction stage

Dataset creation The dataset used for training the document vectors includes the main

dataset M (104 organisms), the dataset E, excluding the organisms common in the L dataset and

M dataset (32 - 8 = 24 organisms), and the P dataset which also excludes the common organisms

from the L dataset and M dataset(111 - 14 = 97 organisms). The common organisms from M are

removed to make sure there is no conflicting GIs in the M dataset, and the common organisms

from L are removed to make sure the L dataset can be used for testing later. Thus, the total

positive data set is combined from 1845 (M dataset) + 172 (E dataset) + 199 (P dataset) = 2216

positive GIs. The total negative dataset of 3266 non-GIs only comes from the M dataset.

As shown in Figure 4.1 to reduce the redundancy in the data for the machine learning models,

which may occur due to similarity in the sequences, I run CD-HIT(Cluster Database at High

Identity with Tolerance) [10] at 80% sequence identity cut-off. CD-HIT software clusters

sequences above a similarity threshold set. This resulted in the positive and negative dataset of

1900 and 1607 sequences, respectively.

Before training either model, it is important to separate the test data. Since all datasets do not

provide the organism with both positive and negative datasets, it is important to separate test

data only from the M dataset. So, 20% data from this positive and negative dataset is used as the

test data using stratified train test split ratio, resulting in 380 GIs and 322 non-GIs in the test set

(702), and 1520 GIs and 1285 non-GIs in the training set (2805).

4.1.1 Evaluation of DNA embedding

In general, I use the gensim’s doc2vec package in Python, to find the paragraph vectors using

both DM and DBOW algorithms.

4.1.1.1 Setting

Dataset Since the document vector models usually needs a large dataset for its training

[17], I use the dataset obtained before CD-HIT was applied and removed the test data from it.

19

Figure 4.1: Creating the dataset for the model construction stage

This gave me a positive dataset of 1836 GIs (2216 - 380 = 1836) and a negative dataset of 2944

non-GIs (3266 - 322 = 2944), to train the DNA embedding model.

Task and models used The first evaluation is to understand the power of the paragraph

vector models on the DNA sequences and its ability to find the similarity among the documents

of the same class. The doc2vec model should be able to find similar documents closer to each

other in the vector space. Cosine similarity is widely used to find the distance between two points

in a vector space. The embedding models measured using this method is the doc2vec distributed

memory (DM) model, doc2vec distributed bag of words (DBOW) model, a concatenated version

of DM and DBOW [17], Bag of Words (BoW) model, and Term Frequency-Inverse Document

Frequency (TF-IDF) model. The BoW is the simplest representation, where a sentence can be

represented as bags of words from the dictionary. TF-IDF is a weighting metric used on the BoW

technique, which measures how important a word is in a sentence. The pre-processing steps for

each of the models are kept the same, the DNA sequences for both the GI and non-GI are

20

converted to lower case and overlapping k-mers are obtained. Each of the doc2vec models: DM,

DBOW, DM + DBOW, models are trained on the positive and negative dataset, using a unique

tag for each doc2vec TaggedDocument in gensim. For the models BoW and TFIDF, I use the

gensim’s Dictionary, doc2bow, and TfidfModel packages in Python. A dictionary is built using

the same training dataset, after which the BoW vector representation is found using doc2bow and

TFIDF representation is found using TfidfModel on the BoW representations.

Evaluation metrics The initial evaluations of the DNA document vectors are carried out

by measuring the relatedness of the documents. This is done by measuring the first n most

similar documents, in terms of cosine similarity, for each document trained. If the document is

most similar to another document of the same class, a positive score is given by its similarity

rank. First, the n most similar documents to document x are found. Scores of each similar

document y1, y2, ..., yn from rank 1, 2, ..., n are [n, n− 1, ..1]. The classes are either 1 or 0 (GI or

non-GI). If the class of document x equals the class of document yi, the similarity score is added

to the total similarity score. Maximum similarity score is n + (n− 1) + ...1. The relatedness score

is calculated by total similarity score/maximum similarity score. n is set to 10 to find the 10

most similar documents to each document. This method helped in the initial tuning of the

hyper-parameters for the doc2vec model. The relatedness scores helped in understanding the

similarity in the documents belonging to the GI and non-GI classes.

4.1.1.2 Results

As can be seen in Figure 4.2 the DBOW doc2vec model performs the best in the similarity

task. The DM model performs poorly, which suggests that using DNA embedding for the GI task,

the word order might not be useful information. Furthermore, the concatenated version of

doc2vec does not perform well, possibly due to the poor performance of the DM model.

Interestingly, the baseline models BoW and TFIDF performed well, which suggests the

importance of word count and word relevance as a feature in DNA embedding for GI tasks. Even

though they perform reasonably well, these models have the disadvantage of the high

21

Figure 4.2: Performance of different doc2vec models and baseline methods BoW and TF-IDF on

similarity task

dimensionality of one-hot encoding, The DBOW model ignores the word order and tries to

predict a randomly sampled word from the paragraph, given as input the paragraph id. This is a

simpler model than DM and uses less storage space as it does not store the word vectors. From

this experiment, I find doc2vec DBOW model can be powerful in understanding semantic

similarity between DNA sequences of the same class.

22

4.1.2 Evaluation of classifier

4.1.2.1 Setting

Dataset The dataset used for the classification task includes the positive and negative

datasets obtained after running CD-HIT mentioned above. The train to test set is an 80-20%

division of the total positive and negative set, keeping the same balance of positive and negative

data in both sets. As mentioned above, the training set consists of 2805 data, and the test set

consists of 702 data.

Task and models used The training and test data for the classifiers are learned first from

each of the doc2vec DNA embedding models DBOW, DM, and DM+DBOW using gradient

descent from the doc2vec inference stage. The BoW and TFIDF embedding models are used as

baseline embedding models. The training and test data are also found for BoW and TFIDF

models using the models built in the previous experiment. The task is formulated as a binary

classification task, where the labels are 1,0 representing GI and non-GI respectively. A few

machine learning classifiers are used that are most commonly seen associated with document

classification tasks, such as SVM, Logistic Regression, and K-Nearest Neighbour.

Evaluation metrics The evaluations of the classifiers are done based on the overall

accuracy, precision, recall, and the f1-score(harmonic mean of precision and recall) of the

classifiers. The classification task will also help to evaluate the performance of the different DNA

embedding models.

4.1.2.2 Results

From the Figure 4.3, it can be seen that the DBOW + SVM model has the highest precision,

recall, f1 score, and accuracy. Overall, SVM is seen to be performing the best among all other

classifiers from Figure 4.4. Even though DBOW + SVM performs the best in the classification

task, it is interesting to find that TFIDF + SVM model also performs quite well in the

23

Figure 4.3: Precision, Recall, F1score and Accuracy for doc2vec DBOW and other baseline repre-

sentations on classifiers Logistic Regression(LR), Support Vector Machine(SVM), K-Nearest Neigh-

bour(KNN)

classification task, showing that word relevance might indeed be a good factor for DNA

embedding. The doc2vec DM model seems to be performing poorly in the classification task as

well, in keeping with the results from the previous similarity task experiment. The reason that

results from TFIDF and BoW is sometimes comparable to the DBOW embedding could be

attributed to the fact that the embedding of DNA includes less variety of k-mers and even less

unseen k-mers when trained with enough data. The total number of possible k-mers for 4

nucleotides A,T,G,C are {4}k. Apart from the fact, doc2vec DBOW performs better in the

classification task, it is also important to note that training DBOW + classifier task takes the

24

Figure 4.4: Overall accuracy of classifiers Logistic Regression(LR), Support Vector Machine(SVM),

K-Nearest Neighbour(KNN) on different embeddings

(a) Precision-recall curve (b) ROC curve

Figure 4.6: A. Precision recall curve B. ROC curve of the DBOW + SVM classifier model

25

least amount of time among the baseline methods, especially the BoW and TFIDF models which

use a one-hot encoding. The DBOW + SVM model precision-recall curve and the ROC(receiver

operating characteristic) curve can be seen in Figure 4.6, which helps us to understand the

performance of the binary classifier.

Hyper-parameters The hyper-parameters need to be tuned at many levels. In the

pre-processing level, the hyper-parameters includes k-mer value and the choice between

overlapping and non-overlapping window for extracting the k-mers. The k values tested are

{3,4,5,6,7,8,9}, the optimal k-mer value is chosen to 6 and the window method is chosen to be

over-lapping, based on the classification results shown in appendix. In the document embedding

level, the hyper-parameters that are mainly tuned are vector size, window, epoch, alpha values,

dbow words for doc2vec DBOW model, and vector size, window, epoch, alpha values, dm concat

for doc2vec DM model. The best choice for the DBOW model is found to be with vector size 50,

window 10, epochs 150, alpha 0.025. In the classification task, the SVM models has

hyper-parameters such as C, gamma, kernel, which are optimal at C=2, gamma = 1, kernel=

RBF. These are obtained after 10 fold cross-validated grid search results.

4.2 Evaluation of GI identification

This section evaluates the second stage of this framework which identifies GIs from the input

DNA sequence.

Data In general, any nucleotide sequence more than or equal to the minimum GI size can

be entered to find the GI regions identified, but for the sake of this evaluation, I have only used

whole prokaryotic genomes from National Center for Biotechnology Information (NCBI) server.

The whole genomes downloaded for experiments 1 and 2 are the 104 genomes used in the M

dataset, this is because these genomes have been properly identified with positive and negative

GIs in them. The genomes downloaded from NCBI for experiment on unseen data are mentioned

later in section 4.2.3.

26

Evaluation metrics The general evaluation metric used for GI prediction is similar to the

metric used to compare previous GI predictors [4]. The following values are found out based on

nucleotide overlaps:

(i) True Positive(TP): The number of nucleotides present in the positive prediction that overlaps

with positive reference data. (ii) True Negative(TN): The number of nucleotides outside the

positive prediction that overlaps with negative reference data. (iii) False Positive(FP): The

number of nucleotides present in the positive prediction that overlaps with negative reference

data. (iv) False Negative(TN): The number of nucleotides outside the positive prediction that

overlaps with positive reference data.

Based on these values, the following evaluation metrics are used:

Precision = TPTP+FP ; Recall = TP

TP+FN ; F1 = 2× precision× recallprecision+recall ; Accuracy = TP+TN

TP+FP+TN+FN

4.2.1 Experiment on comparative genomics data

This experiment is designed to understand how well the boundary detection technique works

in the second stage, using the models trained in the first stage.

4.2.1.1 Experimental setup

A total of 104 genomes are used to identify all GI regions. The prediction from this model

TreasureIsland is compared against some baseline GI prediction models that have previously

shown good results : a tool with high precision based on detecting tRNA fragments (Islander),

sequence composition-based tools (IslandPath-DIMOB and Sigi-HMM), and a hybrid tool

(IslandViewer4). The reference dataset used for this task is from the M dataset of 1845 GIs and

3266 non-GIs. Since this reference dataset is based on the comparative genomics tool IslandPick,

it has not been included in the baseline methods as it would lead to biased results.

27

Table 4.2: Performance of TreasureIsland and other baseline GI predictors on 104 genomes from

the M dataset

Predictor Precision Recall F1-score Accuracy

TreasureIsland 0.8946 0.9054 0.8925 0.9427

IslandViewer4 0.9025 0.7528 0.7935 0.8912

IslandPath DIMOB 0.8957 0.4515 0.5394 0.7624

SIGI-HMM 0.9585 0.1850 0.2766 0.7039

Islander 0.9807 0.1397 0.2040 0.6970

4.2.1.2 Results

An example of the result obtained from the framework is shown in Figure 4.7, which shows

the prediction of genome Escherichia coli O157:H7 str. Sakai(NC 002695.1) from TreasureIsland,

and the reference dataset positive and negative GI regions. Table 4.2 shows that TreasureIsland

has the highest recall, F-1 score, and accuracy when compared to the baseline methods. Islander

has shown to have the highest precision, as it only works on identifying tRNA as the feature and

a few other features in the filtering technique. TreasureIsland has a precision quite close to the

IslandViewer4, which is a hybrid prediction method and is a reliable tool for GI prediction. Even

though the performance of this framework is good, it must be kept in mind that it is a machine

learning-based predictor which has used many of the GIs and non-GIs in the training data set.

This motivates us to further investigate the framework in experiment 2. This experiment,

however, gives us a good idea about the potential of the framework to predict GIs from an input

sequence, especially the second stage in using the models from the first stage.

Parameters The parameters required to identify the GIs are described in Table 3.1, which

includes window size, kmer size, minimum gi size, tune window, upper threshold (Tu) and

lower threshold (Tl). The identification of GI results on 104 genomes are seen to be optimal on

tuning parameters to window size 10000, kmer size 6, minimum gi size 10000 in keeping with the

previous research on genomic island sizes [11], tune window 1000, upper threshold (Tu) 0.75,

lower threshold 0.5.

28

Figure 4.7: Prediction of Escherichia coli O157 genome NC 002695.1 measured on the reference

data from dataset M

4.2.2 Experiment on comparative genomics test data and literature data

This experiment is done to understand the prediction power of the framework on the test data

sets.


For this experiment, two reference test sets are considered. Test-set 1 consisted of all GIs not

included in the training data set for building the classifier, and this consisted of 626 GIs and 1981

non-GIs from 104 genomes. Test-set 2 consisted of the L dataset, previously mentioned, which

consisted of 80 GIs from 6 genomes. The negative datasets for each of the 6 genomes in test-set 2

datasets have been kept the same from the original M dataset, which keeps the FP values same in

the test-set 2 results, as the experiment on comparative genomics data. The baseline predictors

used for this experiment are IslandViewer 4, IslandPath-DIMOB, Sigi-HMM, and Islander.

29

Table 4.3: Performance of TreasureIsland and other baseline GI predictors on 626 GIs and 1981

non-GIs

from test-set 1


TreasureIsland 0.8110 0.8068 0.7783 0.9254

IslandViewer4 0.8452 0.6718 0.6769 0.8803


SIGI-HMM 0.9612 0.1537 0.2203 0.7601

Islander 0.9711 0.1156 0.1570 0.7471

Table 4.4: Performance of TreasureIsland and other baseline GI predictors on 80 GIs

from test-set 2


TreasureIsland 0.9700 0.9351 0.9507 0.9440

IslandViewer4 0.9980 0.6691 0.7912 0.8165


SIGI-HMM 1.0 0.2048 0.3133 0.5539

Islander 1.0 0.2264 0.3535 0.5600

4.2.2.2 Results

In this experiment, it can be seen in Table 4.3, the results from test set 1 show that the

TreasureIsland has the highest recall, F1 score, and accuracy among the baseline models. As seen

in experiment 1, the precision of this framework is close to the IslandViewer4 predictor.

Table 4.4, shows the results from the curated literature data used as test-set 2. In general, the

predictors seem to have improved their performance in the precision, which means the True

Positives must have an increase, as the False Positives are constant due to the same negative data

set. These performances show that this framework constantly performs well in terms of recall and

accuracy when compared to the baseline predictors.

4.2.3 Experiment on unseen data

This experiment is designed to understand the capability of the framework to predict GIs on

unseen genomes it has not been trained on in the embedding or classification step.

30


To perform this experiment, 6 genomes are selected randomly namely Rhizobium

leguminosarum (NC 011369.1), Streptosporangium roseum(NC 013595), Stenotrophomonas

maltophilia(NC 015947), Enterobacter soli(NC 015968.1), Mycobacterium

tuberculosis(NC 016768.1), Escherichia coli W ( NC 017635.1). Since there is no fixed reference

data set labeling the positive and negative GI regions for these genomes, I evaluate this

experiment by checking for overlaps with some of the baseline methods. The baseline methods

included in this experiment are IslandViewer 4, IslandPath-DIMOB, Sigi-HMM, IslandPick, and

Islander. To understand how much of the predicted regions from each predictor are in common

with the other predictors, I calculate the overlapping regions of each predictor with respect to the

other predictors.

overlap score = overlapping GI regions of predictor and reference prediction / total predictions

from reference predictions.

Table 4.5: Average GI overlap score of each predictor on the predictions from reference predictors

on 6 genomes

Reference

PredictorTreasureIsland IslandViewer4 IslandPath-DIMOB IslandPick Sigi-HMM Islander

TreasureIsland 1.0 0.7739 0.7463 0.7094 0.9106 0.8919

IslandViewer4 0.2352 1.0 1.0 1.0 1.0 1.0

IslandPath-DIMOB 0.1636 0.6324 1.0 0.3497 0.3360 0.7229

IslandPick 0.0647 0.2204 0.1422 1.0 0.1943 0.3253

Sigi-HMM 0.0931 0.3705 0.2282 0.3857 1.0 0.3542

Islander 0.0594 0.1953 0.2783 0.3488 0.1774 1.0

4.2.3.2 Results

From Table 4.5 it can be seen that TreasureIsland has made predictions that overlap with all

baseline predictions made, with the highest overlap score with the Sigi-HMM predictor followed

by the islander predictor. Since IslandViewer4 is a composite predictor which includes

IslandPath-DIMOB, SIGI-HMM, and Islander, the results from this predictor overlap with most

31

of the other predictors. It can also be seen that the baseline predictors overlap quite less with

TreasureIsland, which indicates that TreasureIsland has the potential for quite a lot of novel

predictions.

32

CHAPTER 5. SUMMARY AND DISCUSSION

In this thesis, I have examined the problem of predicting GIs in microorganisms and have

proposed a machine learning-based framework to predict GIs. This framework takes unannotated

nucleotide sequences as an input and uses an unsupervised representation of DNA to classify the

GIs and non-GIs. I also introduce a boundary refinement technique to precisely understand these

GI regions.

The results obtained in this thesis show us that the framework has a high recall and accuracy,

and a precision comparable to some of the current baseline predictors. It has also been shown

that this framework has the potential to discover novel GI regions that have not been covered by

other predictors. This research opens the door to an unsupervised way of representing DNA,

which can be helpful in any other DNA prediction using machine learning. The framework also

makes an important addition to the list of GI predictors currently available to uncover more

potential GI regions. The advantage of using this framework is that it does not require any

related genomes or gene annotations to predict GIs, which means freshly sequenced unannotated

genomes can be used to predict GI regions.

However, it might be worth mentioning that since the GI predictor does not use any prior

information such as gene components as its feature, making this framework purely based on an

unsupervised process, there is a possibility of inflation in the number of GIs predicted. These GIs

predicted may fall into the biological grey zone, where I do not know if a region is GI or not.

Thus, as future work, it will be useful to look into the GIs predicted to find the gene annotations

linked to them and functionally categorize the GIs. This might also uncover more possible

features and help to further advance the research in genomic islands.

33

BIBLIOGRAPHY

[1] Assaf, R., Xia, F., and Stevens, R. Identifying genomic islands with deep neuralnetworks. bioRxiv (2019), 525030.

[2] Bertelli, C., and Brinkman, F. S. Improved genomic island predictions withislandpath-dimob. Bioinformatics 34, 13 (2018), 2161–2167.

[3] Bertelli, C., Laird, M. R., Williams, K. P., Group, S. F. U. R. C., Lau, B. Y.,Hoad, G., Winsor, G. L., and Brinkman, F. S. Islandviewer 4: expanded prediction ofgenomic islands for larger-scale datasets. Nucleic acids research 45, W1 (2017), W30–W35.

[4] Bertelli, C., Tilley, K. E., and Brinkman, F. S. Microbial genomic island discovery,visualization and analysis. Briefings in bioinformatics 20, 5 (2019), 1685–1698.

[5] Che, D., Hasan, M. S., and Chen, B. Identifying pathogenicity islands in bacterialpathogenomics using computational approaches. Pathogens 3, 1 (2014), 36–56.

[6] Che, D., Hockenbury, C., Marmelstein, R., and Rasheed, K. Classification ofgenomic islands using decision trees and their ensemble algorithms. BMC genomics 11, 2(2010), 1–9.

[7] Che, D., Wang, H., Fazekas, J., and Chen, B. An accurate genomic island predictionmethod for sequenced bacterial and archaeal genomes. Journal of Proteomics &Bioinformatics 7, 8 (2014), 214.

[8] Dai, Q., Bao, C., Hai, Y., Ma, S., Zhou, T., Wang, C., Wang, Y., Huo, W., Liu,X., Yao, Y., et al. Mtgipick allows robust identification of genomic islands from a singlegenome. Briefings in bioinformatics 19, 3 (2018), 361–373.

[9] Dobrindt, U., Hochhut, B., Hentschel, U., and Hacker, J. Genomic islands inpathogenic and environmental microorganisms. Nature Reviews Microbiology 2, 5 (2004),414–424.

[10] Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. Cd-hit: accelerated for clustering thenext-generation sequencing data. Bioinformatics 28, 23 (2012), 3150–3152.

[11] Hacker, J., and Kaper, J. B. Pathogenicity islands and the evolution of microbes.Annual Reviews in Microbiology 54, 1 (2000), 641–679.

[12] Hsiao, W., Wan, I., Jones, S. J., and Brinkman, F. S. Islandpath: aiding detection ofgenomic islands in prokaryotes. Bioinformatics 19, 3 (2003), 418–420.

34

[13] Hudson, C. M., Lau, B. Y., and Williams, K. P. Islander: a database of preciselymapped genomic islands in trna and tmrna genes. Nucleic acids research 43, D1 (2015),D48–D53.

[14] Juhas, M., Van Der Meer, J. R., Gaillard, M., Harding, R. M., Hood, D. W.,and Crook, D. W. Genomic islands: tools of bacterial horizontal gene transfer andevolution. FEMS microbiology reviews 33, 2 (2009), 376–393.

[15] Langille, M. G., Hsiao, W. W., and Brinkman, F. S. Evaluation of genomic islandpredictors using a comparative genomics approach. BMC bioinformatics 9, 1 (2008), 1–10.

[16] Langille, M. G., Hsiao, W. W., and Brinkman, F. S. Detecting genomic islands usingbioinformatics approaches. Nature Reviews Microbiology 8, 5 (2010), 373–382.

[17] Le, Q., and Mikolov, T. Distributed representations of sentences and documents. InInternational conference on machine learning (2014), PMLR, pp. 1188–1196.

[18] Lu, B., and Leong, H. W. Gi-svm: a sensitive method for predicting genomic islandsbased on unannotated sequence of a single genome. Journal of bioinformatics andcomputational biology 14, 01 (2016), 1640003.

[19] Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of wordrepresentations in vector space. arXiv preprint arXiv:1301.3781 (2013).

[20] Ng, P. dna2vec: Consistent vector representations of variable-length k-mers. arXiv preprintarXiv:1701.06279 (2017).

[21] Ou, H.-Y., Chen, L.-L., Lonnen, J., Chaudhuri, R. R., Thani, A. B., Smith, R.,Garton, N. J., Hinton, J., Pallen, M., Barer, M. R., et al. A novel strategy for theidentification of genomic islands by comparative analysis of the contents and contexts of trnasites in closely related bacteria. Nucleic acids research 34, 1 (2006), e3–e3.

[22] Pundhir, S., Vijayvargiya, H., and Kumar, A. Predictbias: a server for theidentification of genomic and pathogenicity islands in prokaryotes. In silico biology 8, 3, 4(2008), 223–234.

[23] Reyrat, J.-M., Pelicic, V., Gicquel, B., and Rappuoli, R. Counterselectablemarkers: untapped tools for bacterial genetics and pathogenesis. Infection and immunity 66,9 (1998), 4011–4017.

[24] Tsirigos, A., and Rigoutsos, I. A new computational method for the detection ofhorizontal gene transfer events. Nucleic acids research 33, 3 (2005), 922–933.

35

[25] Vernikos, G. S., and Parkhill, J. Interpolated variable order motifs for identification ofhorizontally acquired dna: revisiting the salmonella pathogenicity islands. Bioinformatics 22,18 (2006), 2196–2203.

[26] Waack, S., Keller, O., Asper, R., Brodag, T., Damm, C., Fricke, W. F.,Surovcik, K., Meinicke, P., and Merkl, R. Score-based prediction of genomic islandsin prokaryotic genomes using hidden markov models. BMC bioinformatics 7, 1 (2006), 1–12.

[27] Wei, W., Gao, F., Du, M.-Z., Hua, H.-L., Wang, J., and Guo, F.-B. Zislandexplorer: detect genomic islands by combining homogeneity and heterogeneity properties.Briefings in bioinformatics 18, 3 (2017), 357–366.

[28] Winstanley, C. Spot the difference: applications of subtractive hybridisation to the studyof bacterial pathogens. Journal of medical microbiology 51, 6 (2002), 459–467.

[29] Yoon, S. H., Park, Y.-K., and Kim, J. F. Paidb v2. 0: exploration and analysis ofpathogenicity and resistance islands. Nucleic acids research 43, D1 (2015), D624–D630.

36

APPENDIX. HYPERPARAMETER TUNING : K-MER

k-mer classification results

Table .1: Accuracy of different k-mer sizes on classifiers Logistic Regression(LR), Support Vector

Machine(SVM), K-Nearest Neighbour(KNN)

window method

k-mer size overlapping window non-overlapping window

LR SVM KNN LR SVM KNN

3 0.7379 0.8219 0.7550 0.7108 0.8419 0.7906

4 0.8063 0.9160 0.8547 0.8219 0.9188 0.8490

5 0.8291 0.9373 0.8746 0.8291 0.9288 0.8618

6 0.8433 0.9530 0.9160 0.7877 0.8960 0.7991

7 0.8063 0.9188 0.8846 0.7393 0.8675 0.7920

8 0.7578 0.8974 0.8519 0.7251 0.7528 0.7051

9 0.7350 0.8746 0.8219 0.6125 0.5370 0.4986

discovering genomic islands using dna sequence embedding

Documents