discovering genomic islands using dna sequence embedding
TRANSCRIPT
Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations
2021
Discovering genomic islands using DNA sequence embedding Discovering genomic islands using DNA sequence embedding
Priyanka Banerjee Iowa State University
Follow this and additional works at: https://lib.dr.iastate.edu/etd
Recommended Citation Recommended Citation Banerjee, Priyanka, "Discovering genomic islands using DNA sequence embedding" (2021). Graduate Theses and Dissertations. 18451. https://lib.dr.iastate.edu/etd/18451
This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected].
Discovering genomic islands using DNA sequence embedding
by
Priyanka Banerjee
A thesis submitted to the graduate faculty
in partial fulfillment of the requirements for the degree of
MASTER OF SCIENCE
Major: Computer Science
Program of Study Committee:Iddo Friedberg, Co-major Professor
Oliver Eulenstein, Co-major ProfessorQi Li
The student author, whose presentation of the scholarship herein was approved by the program ofstudy committee, is solely responsible for the content of this thesis. The Graduate College will
ensure this thesis is globally accessible and will not permit alterations after a degree is conferred.
Iowa State University
Ames, Iowa
2021
Copyright © Priyanka Banerjee, 2021. All rights reserved.
ii
DEDICATION
I would like to dedicate this thesis to my parents and my sister for their unconditional love and
support for my decision to study computer science.
iii
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Genomic islands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Predicting genomic islands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
CHAPTER 2. REVIEW OF LITERATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Comparative genomics based GI prediction . . . . . . . . . . . . . . . . . . . . . . . 42.2 Sequence composition-based GI prediction . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Gene level GI prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.2 Nucleotide level GI prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Machine learning in GI prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
CHAPTER 3. METHODS AND PROCEDURES . . . . . . . . . . . . . . . . . . . . . . . . 83.1 TreasureIsland Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.2 Model construction stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.3 Identification of GI stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
CHAPTER 4. EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.1 Evaluation on model construction stage . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.1 Evaluation of DNA embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1.2 Evaluation of classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Evaluation of GI identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.1 Experiment on comparative genomics data . . . . . . . . . . . . . . . . . . . 264.2.2 Experiment on comparative genomics test data and literature data . . . . . . 284.2.3 Experiment on unseen data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
CHAPTER 5. SUMMARY AND DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . 32
iv
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
APPENDIX. HYPERPARAMETER TUNING : K-MER . . . . . . . . . . . . . . . . . . . 36
v
LIST OF TABLES
Page
Table 3.1 Parameters used for the identification of GI stage . . . . . . . . . . . . . . . 13
Table 4.1 Dataset information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Table 4.2 Performance of TreasureIsland and other baseline GI predictors on 104genomes from the M dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Table 4.3 Performance of TreasureIsland and other baseline GI predictors on 626 GIsand 1981 non-GIsfrom test-set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Table 4.4 Performance of TreasureIsland and other baseline GI predictors on 80 GIsfrom test-set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Table 4.5 Average GI overlap score of each predictor on the predictions from referencepredictors on 6 genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Table .1 Accuracy of different k-mer sizes on classifiers Logistic Regression(LR), Sup-port Vector Machine(SVM), K-Nearest Neighbour(KNN) . . . . . . . . . . 36
vi
LIST OF FIGURES
Page
Figure 3.1 An overview of the framework TreasureIsland . . . . . . . . . . . . . . . . . 9
Figure 3.2 Construction of DNA embedding model using the DBOW document vectormodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Figure 3.3 The merging and fine-tuning phase. In this example, Tu is set to 0.75 andTl is set to 0.50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Figure 4.1 Creating the dataset for the model construction stage . . . . . . . . . . . . . 19
Figure 4.2 Performance of different doc2vec models and baseline methods BoW andTF-IDF on similarity task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Figure 4.3 Precision, Recall, F1score and Accuracy for doc2vec DBOW and other base-line representations on classifiers Logistic Regression(LR), Support VectorMachine(SVM), K-Nearest Neighbour(KNN) . . . . . . . . . . . . . . . . . . 23
Figure 4.4 Overall accuracy of classifiers Logistic Regression(LR), Support Vector Ma-chine(SVM), K-Nearest Neighbour(KNN) on different embeddings . . . . . . 24
Figure 4.6 A. Precision recall curve B. ROC curve of the DBOW + SVM classifier model 24
Figure 4.7 Prediction of Escherichia coli O157 genome NC 002695.1 measured on thereference data from dataset M . . . . . . . . . . . . . . . . . . . . . . . . . . 28
vii
ACKNOWLEDGMENTS
I would like to take this opportunity to express my gratitude to those who helped me with
various aspects of conducting the research and the writing of this thesis. First and foremost, Dr.
Iddo Friedberg, for his guidance, encouragement, and involvement throughout the research. I am
thankful for all constructive discussions from the lab members in Dr. Friedberg’s lab. I would also
like to thank Dr. Oliver Eulenstein for his support and very beneficial insight on the research and
writing of the thesis. I would also like to thank my committee member Dr. Qi Li whose course on
NLP proved very helpful and later her suggestions and insights helped my research as well.
viii
ABSTRACT
Genomic islands(GIs) are clusters of genes that are acquired during the Horizontal Gene Trans-
fer process(HGT) by bacterial genomes. These islands play a crucial role in the evolution of bacteria
by helping them adapt to changing environments. The detection of GIs is therefore an important
problem in medical and environmental research. There have been many previous studies on com-
putationally identifying GIs, but most of the studies rely on either closely related genomes or
annotated nucleotide sequences with predictions based on a fixed set of known features. Previous
research on unannotated sequences has not been able to reach a good accuracy due to the lack of
information taken into account while prediction and lack of GI boundary detection method. In this
thesis, I present a machine learning-based framework called TreasureIsland, that uses an unsuper-
vised representation of DNA sequences to predict GI. I propose to improve the boundary detection
problem of GI by using a boundary fine-tuning method to attain better precision. I evaluate the
efficiency of my framework by using a reference dataset obtained by the comparative genomics
method and from the literature. The evaluations show that this framework was able to achieve a
high recall and accuracy when compared to other GI predictors.
1
CHAPTER 1. INTRODUCTION
1.1 Genomic islands
Bacterial genomes evolve through a variety of processes, with a special place reserved for
horizontal gene transfer or HGT. HGT allows the acquisition of foreign genetic material, which
provides a mechanism of quick adaptation to a changing environment, by rapidly conferring new
phenotypes including stress resistance and antibiotic resistance. Genomic islands(GIs) are clusters
of foreign genes acquired by HGT. The GIs can be simplistically further classified into several
subtypes based on their key gene content, such as pathogenicity islands(PAIs) containing
pathogenic or virulent genes, resistance islands containing antimicrobial-resistant genes, symbiosis
islands containing genes that establish symbiosis with legumes, metabolic islands containing
adaptive metabolic abilities [4]. These GIs have some distinguishing features, (i) a size typically
in the range of 10-200 kbp [11], (ii) a sequence composition that is generally different than the
core genome specifically in terms of GC% content and dinucleotide frequency, and (iii) frequent
association with tRNA-encoding genes, flanking direct repeats and mobility genes, with a high
prevalence of phage related genes and hypothetical proteins [9]. The wide range of adaptive
functions, makes the identification of GI of particular environmental and medical interest [11];[14].
1.2 Predicting genomic islands
GIs can be discovered using experimental methods such as DNA-DNA hybridization,
subtractive hybridization, or using counter selectable markers [9]; [28]; [23]. These processes to
detect strain-specific GIs can be expensive and time-consuming. Hence, the need for
computational techniques for predicting GIs arises. The existing works on computationally
predicting GIs are broadly divided into two types: comparative-genomic, and
sequence-composition. The comparative genomics-based approaches involve the use of closely
2
related bacterial and archaeal genomes [4]. A GI in this case is identified when a cluster of genes
is present in an organism that is not present in any related genomes [15]. Recently, Bertelli and
colleagues conducted research where it was explained that while comparative-genomic based
approaches can predict GI boundaries precisely, the most obvious disadvantage of this method is
its dependency on the availability of closely related genomes, along with the variance in the result
based on the selection of these genomes[4]. The sequence-composition methods are based on
identifying atypical sequences in the core genome. To achieve this, these methods make use of
various structural features previously researched, such as a sequence bias in terms of GC%,
dinucleotide content, codon usage or k-mer count, presence of an insertion site, mobility gene,
phage genes, hypothetical proteins and direct flanking repeats [9]; [14]. To capture most of these
features which are related to gene, annotated sequences are required. Therefore it leaves the
prediction of unannotated sequences solely based on finding biases in nucleotides, with the very
little feature set, leading the prediction models therefore to have low precision. On the other hand,
this enables the research done on gene-level sequence composition to predict GIs successfully with
higher precision and accuracy. The downside of the gene level research is the dependency on
annotated sequences, which may not be available in the case of newly sequenced genomes or there
may be annotation errors. In summary, there are certain limitations to the GI prediction
techniques: (i) Requirement of closely related genomes in case of comparative genomics-based
approach; (ii) Dependency on annotated genomes in case of gene-level sequence-based approach;
(iii) Lack of good feature set in nucleotide level sequence-based approach.
To address the above issues, I developed an unsupervised representation of the DNA
sequences, which does not require any computation of a fixed number of features. The method
can overcome the challenge of using annotated genomes, as it takes as an input an unannotated
DNA sequence. Furthermore, it does not require the availability of related genomes. This
unsupervised algorithm can capture the semantic similarity of the DNA segments, which helps in
the classification task of predicting GIs.
3
1.3 Contributions
The major contributions of this research include:
1. Introduction of an unsupervised representation of DNA, that captures the semantic
similarity of DNA sequences.
2. Building a machine learning-based predictive framework TreasureIsland to detect the GIs
for any given DNA sequence.
3. Development of a technique to refine the initial GI boundary.
4. Analysis of the capabilities of this framework to accurately identify GIs.
1.4 Organization
The rest of the thesis is organized as follows. Chapter 2 provides the review of the current
work that has been done on predicting GIs. Chapter 3 provides the framework detail to deal with
the GI prediction problem. Chapter 4 presents the evaluation done on the framework and the
results obtained. Chapter 5 presents the conclusion and future work.
4
CHAPTER 2. REVIEW OF LITERATURE
Most of the previous work on the identification of GI can be put in two categories:,
comparative genomics studies and sequence composition.
2.1 Comparative genomics based GI prediction
Comparative genomics-based GI prediction predicts GI in a genome by comparing the genome
structure of closely related bacterial and archaeal genomes. IslandPick is one of the most
prominent comparative genomics prediction techniques which uses both Mauve and BLAST to
align genomes [15]. This technique was shown to have high accuracy and contributed towards
building a reliable reference data set which was shown to be comparable to the GIs in the
literature data-set. Some other tools such as tRNAcc used both comparative genomics and the
presence of tRNA to predict GIs [21].
The advantage of using a comparative genomics technique is that it gives us a more precise
boundary and it is generally more reliable. On the other side, this process also requires the
availability of related genomes, which eliminates all genomes without a certain number of closely
related genomes. Comparative genomics prediction of GIs is also sensitive to both gene loss and
HGT in the sequence [4].
2.2 Sequence composition-based GI prediction
The sequence composition-based methods generally try to identify GIs by looking for sequence
anomalies in the core genome. This anomaly could be due to various reasons such as a bias in
%GC content, dinucleotide content, or codon usage. There are also certain GI features widely
associated with GIs, such as the presence of tRNA, mobile genes, phage-related genes, or
hypothetical proteins [9, 16, 5]. The composition-based methods have the advantage of identifying
5
more recent transfer distantly related genomes which contain mobile genes in them [2]. The
sequence composition methods generally work at the gene level or nucleotide level.
2.2.1 Gene level GI prediction
Gene level GI prediction has seen a rise in performance in recent years since genes are the
functional unit of genomes. Some GIs contain viral structural genes or conjugation gene sets,
which helps them in their mobility. The site specificity of these genes on chromosomes is
determined by integrases. GIs have often seen presence of genes encoding integrase, mainly from
the tyrosine recombinase family. For most GIs, the integration of target site is mostly within a
tRNA gene. Some of the most prominent contributions have been made by IslandPath-DIMOB
[2], GIHunter [7], SIGI-HMM [26], PredictBias [22] and Islander [13]. SIGI-HMM uses HMMs to
predict GIs on the basis codon usage bias [26]. PredictBias uses a several features such as
insertion elements and virulence factors to predict GIs [22]. IslandPath-DIMOB predicts
dinucleotide bias in eight genes and combined with the presence of mobility genes [15] which was
later improved in performance with extended HMM profiles for searching mobility genes [2].
Islander uses only the presence of tRNA to precisely predict GIs [13].
The gene-level prediction has been able to achieve a good precision, as the boundaries can be
well identified in the presence of genes. With the rise in the understanding of structural features,
gene-level prediction makes use of these features in their tools. This technique, on the other hand,
is dependent on the availability of correctly annotated genomes.
2.2.2 Nucleotide level GI prediction
In the nucleotide level prediction, most of the predictions use windows of different sizes to
measure biases such as GC content, dinucleotides, or k-mer bias. Some popular prediction tools in
this category such as AlienHunter [25] and GI-SVM [18], use an sliding window technique to find
out GI regions. AlienHunter computes an Interpolated Variable Order Motifs (IVOM) score
which helps in identifying atypical regions in a genome compared to the core genomic regions, in
6
terms of GC content, dinucleotides, and codon usage. They also make use of two-state Hidden
Markov Models (HMM) to precisely identify the boundaries. There are also some tools such as
Zisland Explorer which does not use a window method to identify GI. Zisland divides the genome
into sequences based on its GC profile and identifies possible candidate regions for GIs [27].
The sequence level predictors generally take into account much less information compared to
the gene level predictors [4]. The advantage of sequence level prediction is that it does not require
gene annotations, but it also makes it more difficult to reach good precision on such models.
There is also a lack in the boundary refinement process in these models except MTGIpick which
uses a version of Markov Jensen–Shannon divergence to refine boundaries [8].
Recently, there has also been some work to combine the advantages of various prediction tools
to provide a composite or hybrid tool. The most popular of these tools is the IslandViewer4 tool,
which combines the strengths of IslandPath-DIMOB, SIGI-HMM, Islander, and IslandPick to
form a composite tool [3].
2.3 Machine learning in GI prediction
Early research on genomic islands shed light on the features most commonly associated with
GIs. This led to the use of machine learning models to leverage one or more of these features to
provide more accurate results. Most machine learning methods fall under the gene level GI
prediction, where they take as an input the annotated genome sequence. One of the early
machine learning tools is Wn-SVM which measures the typicality score in terms of the
composition of each gene to belong to the core genome and uses one class SVM approach to
detect GI [24]. GIDetector, combines eight gene-level features, Interpolated Variable Order Motif
compositional (IVOM) score, insertion point, size, density, repeats, integrase, phage, RNA to
classify GI from non-GI using a decision tree. This paper also identifies the more important
features among the eight features such as the presence of insertion sites, phages and repeats [6].
Later, GIHunter uses the same machine learning technique of decision trees to classify and
identify GIs, based on some new and revised features such as mobile gene information, intergenic
7
distance, and highly expressed genes [7]. GIHunter trains its model on the reference data of 118
genomes from the IslandPick paper and achieves high precision and accuracy on the prediction of
GIs among the other machine learning techniques [4]. The more recent GI-SVM prediction
worked on predicting GIs in the unannotated genomes and states the need for such techniques in
predicting GIs in newly sequenced genomes. GI-SVM uses a one-class SVM to identify GIs based
on the k-mer count in sequences [18]. Another recent research on GI prediction uses deep learning
techniques to predict GIs. This tool, ShutterIsland, based on comparative genomics uses a service
from PATRIC (The Pathosystems Resource Integration Center) to compare genome regions
among closely related species and reaches a good prediction accuracy[1].
The machine learning methods as seen above have the potential to predict GIs with better
precision and recall, but there is a lack of utilizing a good feature set in the sequence level
predictions, which has the general advantage of not requiring annotated genomes. The existing
models also have poor precision due to the lack of boundary refinement procedure. This forms the
motivation for my research to find a better way of capturing features from unannotated sequences
and work on the refinement of GIs boundaries.
8
CHAPTER 3. METHODS AND PROCEDURES
3.1 TreasureIsland Framework
This chapter describes the framework used to predict the genomic islands from a DNA
sequence.
3.1.1 Framework Overview
The computational framework I developed consists of two stages: (i) the Prediction model
construction stage for classification of GI/non-GI, (ii) GI identification for the input DNA stage.
As seen in Figure 3.1 at a high level, in the first stage I build an embedding model which helps
me to represent the variable-length DNA in terms of fixed-length vectors. These vectors are then
used to classify the segments of DNA into a GI or a non-GI region in a genome. At the end of the
first stage, I am left with an embedding model and a classifier for DNA segments. In the second
stage, I take as an input a DNA segment, dividing the DNA into non-overlapping segments of a
certain size. These segments are then embedded and classified using the embedding and classifier
models, respectively, from the first stage. The GI classified segments are then processed to refine
the boundaries to output the GI regions within the input DNA.
3.1.2 Model construction stage
In this stage, I construct the embedding and classifier models for DNA segments.
3.1.2.1 DNA embedding
Background To make use of the power of machine learning techniques, biological sequence
data must be converted to a form that can be understood by the machines. Previously, the
popular method of embedding DNA data was to use a one-hot encoding. Given the volume of
9
Figure 3.1: An overview of the framework TreasureIsland
DNA data, one hot encoding becomes an expensive technique. With the advance in NLP
techniques of word embedding, DNA embedding methods improved as well. One of the first
papers in converting DNA to vector using a word embedding model from Natural Language
Processing(NLP), by Patrick Ng 2017, showed the effectiveness of DNA2vec in numeric operations
such as concatenation and assessing global alignment similarity. For instance, let us consider an
operation using the Nearest-Neighbor algorithm [20].
Nearest-Neighbor(−−−→AAC +
−−−→TCT ) ∈ {AACTCT, TCTAAC} and,
−−−−−−→ACGAT −−−−→GAT +
−−−→ATC ≈ −−−−−−→ACATC
Here, the neighbours of the 3-mers AAC and TCT when added overlap with their string
10
concatenation AACTCT. The second equation shows the result of the nucleotide concatenation.
Thus, the importance of representing DNA as vectors is demonstrated.
Natural Language Processing(NLP) has explored many word embedding techniques in recent
years as it is a vital preprocessing step for all machine learning tasks. Some of these techniques,
such as Word2vec, where words are converted to vectors, has particularly been found to be very
powerful as it captures the semantic meaning and the context of the words[19]. It is a neural
network that uses both target words and context words to convert a word into a fixed-length
vector. The vocabulary is built from the corpus and fed into the model. The word2vec model has
two different variants known as Continuous Bag of Words(CBOW) and Skip-Gram. The CBOW
learns representations by using the context word to predict the target word. It is a supervised
learning algorithm with context words as input and a target word as output. The Skip-Gram
model learns representations by using the target word to predict the context words. It is a
supervised learning algorithm with a target word as input and context words as output.
After the successful implementation of word2vec, researchers have tried to extend the same
idea to vectorize multiple words in the form of a sentence, paragraph, or even a document [17].
Even though the weighted average of word vectors or bag of words models were simple solutions,
it did not capture the word order. Paragraph vector is an extension of the word2vec model
proposed by [17]. It converts a variable-length sentence into a fixed-length vector. Each
paragraph is identified by a paragraph ID and is then converted to a vector to represent every
paragraph. There are two different types of paragraph models- Distributed Memory (DM) and
Distributed Bag of Word(DBOW). In the DM model, a paragraph ID is added as another word in
addition to the words. The model learns the word vectors along with the paragraph vectors,
which is done by trying to predict the current word using both the context words and the
paragraph ID. This model is analogous to the CBOW model in word2vec. DBOW ignores the
order of the words, it predicts a randomly sampled word from the paragraph given the paragraph
ID. This process is analogous to the Skip Gram model in word2vec.
11
DNA as a document. Biological language representation can be analogous to natural
language. To use the power of the paragraph vectors, I treat segments of DNA as a paragraph or
a document (both have the same meaning in this context). K-mer in bioinformatics is described
as sub-sequences of length k from biological sequence data. They can be considered similar to
words in a text document. Since the DNA is made of four nucleotides {A,C,G,T}, the maximum
number of possible k-mers are 4k. The DNA embedding process can be seen in Figure 3.2.
Figure 3.2: Construction of DNA embedding model using the DBOW document vector model
Preprocessing DNA The DNA sequence(or document), which in this case is a genomic
island or a non-genomic Island is converted to the lower case and represented as a sequence of
12
k-mers(or words). There are two primary methods to obtain these k-mers in DNA that are widely
used. First is the sliding window or overlapping method, second is the non-overlapping method.
These are explained with an example: If the original sequence is GCTTAATTC, the overlapping
window method (k=3) gives rise to following sets of k-mers: [GCT, CTT, TTA, TAA, AAT,
ATT, TTC]. On the other hand, the non-overlapping method (k=3) generates [GCT, TAA, TTC]
k-mers.
Constructing a DNA embedding model Each GI and non-GI are then converted into a
document, which contained k-mers as its words and a unique paragraph id as its tag. Two
different paragraph vector models are trained, namely, the Distributed Memory (DM) model and
the Distributed Bag of Word (DBOW) model. At the end of the training phase, I get a paragraph
vector model of a fixed-length vector size. The selection of k-mer generation method, the value of
k, the model type, and the other hyper-parameters such as vector size, window size is dependent
on the final cross-validated classification results.
3.1.2.2 Constructing classifier
After the training of the embedding model, I obtain the vectors for the training data set and
test data set by inferring the vectors by gradient descent from the embedding model(the rest of
the model parameters are fixed) as stated in [17]. The training vectors are then fed into machine
learning algorithms, to complete a binary classification task on GI (Class 1) or Non-GI(Class 0).
3.1.3 Identification of GI stage
This stage takes as an input a DNA sequence and identifies all possible genomic islands in the
sequence. The DNA-embedding model and the classifier from the first stage are used here. The
parameters used for this stage is explained in Table 3.1.
13
Table 3.1: Parameters used for the identification of GI stage
Parameter Notation Description
DNA sequence D input DNA sequence
sequence window size Ws Window size of the initial non-overlapping sequence
kmer size k size of the kmer or words in the sequences
minimum gi size GIm This is the minimum size of a GI set by the user
tune window size Wt This is the window size by which the borders are tuned
(increased or decreased) on either side of a GI
upper threshold Tu probability of a segment above which a segment is classified
as GI
lower threshold Tl probability of a segment below which a segment is classified
as non-GI
3.1.3.1 DNA vectors
Given the input D, it is divided into segments of fixed length. D = [d1, d2, d3..dn]. These
segments are then considered to be individual DNA documents. I take each of these documents
and preprocess the documents in the same way as the first stage, by finding the k-mers of size k.
The documents are then embedded by inferring vectors from the DNA-embedding model.
3.1.3.2 DNA classification
The DNA vectors are then fed into the classifier. The probability p1, p2, p3..pn for class 1(or
GI class) for each of these segments in D, is then measured.
3.1.3.3 Merging phase
Now that D is divided into segments and assigned a probability for each segment, the process
can move on to the merging phase. In the merging phase, adjacent GI segments are merged to
form a new larger GI segment. As can be seen from Figure 3.3, there are two thresholds set for
the merging step: Upper Threshold Tu and Lower Threshold Tl. The segments with pi greater
than Tu are considered a GI. If two or more adjacent segments are found to be greater than Tu ,
the segments are merged, and the entire section is considered to be a GI. It can also be seen from
the figure that the segments with probabilities less than Tl are considered a non-GI segment. The
14
segments having probabilities between Tu and Tl are considered the partial GI segment, the
intuition being, portions of these segments may belong to either class. In order to find out a more
precise border in these portions, after the merging phase is completed, the segments need to be
prepared for the fine-tuning phase. Each of these GI regions is then analyzed for flanking
segments(segments on either side of the considered GI segment). If either or both flanking
segments have a probability between Tl and Tu, then those segments are also attached to the
identified GI region. This is done to prepare the segments for the final fine-tuning phase, which
can be seen in the third stage of the figure.
Figure 3.3: The merging and fine-tuning phase. In this example, Tu is set to 0.75 and Tl is set to
0.50
15
3.1.3.4 Fine-tuning phase
The fine-tuning phase is carried out to find more accurate boundaries for the GIs predicted in
the previous phase. Figure 3.3 shows that in the third stage, the GIs along with the flanking
segments are separated from the sequence D. Each GI segment obtained above has an
approximate left and right border, based on the point where it has been divided in the first step.
These borders may not contain the entire GI region or may have overshot the actual GI region.
To get more precise boundaries for the GI regions, I adopt a fine-tuning method.
A few constraints is kept in mind for this method. First, the borders when moving outward
during the tuning process must have an outer limit after which the borders cannot move. If a
border has a flanking segment at the side of it, the outer limit is set at the middle of the flanking
segment. The middle point of each flanking segment is considered to be outer limit to prevent any
overlapping of the GI regions if there another GI on the other side of the flanking segment. If a
border has no flanking segment, the outer limit is the original GI border. Second, GI segments
cannot be lesser than minimum GI size entered GIm.Next step is to tune borders on both ends of
GI, namely left and right borders.
If there is a flanking segment, intuitively, the GI border must lie either on the current GI
border or outside of the GI region. When calculating the outer border on either side, the other
end is fixed. Each time the border slides out by the tuning window Wt, the probability of the new
fragment is found. The process is stopped either when the current fragment probability is lesser
than Tu or when the outer boundary limit has reached. This ensures that if there is any flanking
segment on the side of a border, the current GI segment is expanded to include any possible GI
segments into its body.
Inner borders for each side is calculated only when there is no flanking segment on its side.
Intuitively, if there is no flanking segment on the side of a GI, the GI border must lie at the
current GI border or should be inside the GI region. When calculating the inner border on either
side, the border on the other end is fixed. Each time the border slides in by the tuning window
Wt, the probability of the new fragment is found. The process is stopped either when the previous
16
fragment has a higher probability than the current fragment or when the minimum gi size GIm is
hit.
While finalising borders for each end, left and right of the GI, I consider the outer border if
there is a flanking segment present, and I consider the inner border if there is no flanking segment
present. The process enables us to reach a good balance between sensitivity and specificity of
prediction of GI segments.
17
CHAPTER 4. EXPERIMENTS
In this chapter, I examine the effectiveness of TreasureIsland framework in predicting genomic
islands in microbial organisms. I first conduct my experiments on the model construction stage
Section 4.1, followed by the experiments on the identification of the genomic island stage Section
4.2.
Dataset background Obtaining a reliable reference set for GIs has always been a challenge
for researchers. The few experimentally verified GIs are not sufficient for a reference GI set, hence
a GI reference is necessary to evaluate the performance of the predictors. The comparative
genomics computation technique is very similar to the manual curation of the GI regions and
hence is considered a reliable source of data. An early curated dataset was used to build
IslandPath [12]. The benchmark for the GI dataset had been first constructed in IslandPick [15]
using 118 genomes, which was later revised to 108 genomes in IslandPath DIMOB [2]. The
research IslandPick in 2008 also mentions curation of some previously published GIs, which was
consolidated as the literature dataset. Apart from the general GI prediction, research focus has
been given particularly to resistance and pathogenic islands. A database is constructed including
pathogenic and resistance in [29], although it does not contain negative data. The information on
the dataset can be found in Table 4.1.
Table 4.1: Dataset information
Notation Total genomes Number of GI Number of non-GI Data Source
M (Main) 104 1845 3266 [2]
E (Early) 32 269 0 [12]
L (Literarure) 6 80 0 [15]
P (PAI) 111 264 0 [29]
18
4.1 Evaluation on model construction stage
Dataset creation The dataset used for training the document vectors includes the main
dataset M (104 organisms), the dataset E, excluding the organisms common in the L dataset and
M dataset (32 - 8 = 24 organisms), and the P dataset which also excludes the common organisms
from the L dataset and M dataset(111 - 14 = 97 organisms). The common organisms from M are
removed to make sure there is no conflicting GIs in the M dataset, and the common organisms
from L are removed to make sure the L dataset can be used for testing later. Thus, the total
positive data set is combined from 1845 (M dataset) + 172 (E dataset) + 199 (P dataset) = 2216
positive GIs. The total negative dataset of 3266 non-GIs only comes from the M dataset.
As shown in Figure 4.1 to reduce the redundancy in the data for the machine learning models,
which may occur due to similarity in the sequences, I run CD-HIT(Cluster Database at High
Identity with Tolerance) [10] at 80% sequence identity cut-off. CD-HIT software clusters
sequences above a similarity threshold set. This resulted in the positive and negative dataset of
1900 and 1607 sequences, respectively.
Before training either model, it is important to separate the test data. Since all datasets do not
provide the organism with both positive and negative datasets, it is important to separate test
data only from the M dataset. So, 20% data from this positive and negative dataset is used as the
test data using stratified train test split ratio, resulting in 380 GIs and 322 non-GIs in the test set
(702), and 1520 GIs and 1285 non-GIs in the training set (2805).
4.1.1 Evaluation of DNA embedding
In general, I use the gensim’s doc2vec package in Python, to find the paragraph vectors using
both DM and DBOW algorithms.
4.1.1.1 Setting
Dataset Since the document vector models usually needs a large dataset for its training
[17], I use the dataset obtained before CD-HIT was applied and removed the test data from it.
19
Figure 4.1: Creating the dataset for the model construction stage
This gave me a positive dataset of 1836 GIs (2216 - 380 = 1836) and a negative dataset of 2944
non-GIs (3266 - 322 = 2944), to train the DNA embedding model.
Task and models used The first evaluation is to understand the power of the paragraph
vector models on the DNA sequences and its ability to find the similarity among the documents
of the same class. The doc2vec model should be able to find similar documents closer to each
other in the vector space. Cosine similarity is widely used to find the distance between two points
in a vector space. The embedding models measured using this method is the doc2vec distributed
memory (DM) model, doc2vec distributed bag of words (DBOW) model, a concatenated version
of DM and DBOW [17], Bag of Words (BoW) model, and Term Frequency-Inverse Document
Frequency (TF-IDF) model. The BoW is the simplest representation, where a sentence can be
represented as bags of words from the dictionary. TF-IDF is a weighting metric used on the BoW
technique, which measures how important a word is in a sentence. The pre-processing steps for
each of the models are kept the same, the DNA sequences for both the GI and non-GI are
20
converted to lower case and overlapping k-mers are obtained. Each of the doc2vec models: DM,
DBOW, DM + DBOW, models are trained on the positive and negative dataset, using a unique
tag for each doc2vec TaggedDocument in gensim. For the models BoW and TFIDF, I use the
gensim’s Dictionary, doc2bow, and TfidfModel packages in Python. A dictionary is built using
the same training dataset, after which the BoW vector representation is found using doc2bow and
TFIDF representation is found using TfidfModel on the BoW representations.
Evaluation metrics The initial evaluations of the DNA document vectors are carried out
by measuring the relatedness of the documents. This is done by measuring the first n most
similar documents, in terms of cosine similarity, for each document trained. If the document is
most similar to another document of the same class, a positive score is given by its similarity
rank. First, the n most similar documents to document x are found. Scores of each similar
document y1, y2, ..., yn from rank 1, 2, ..., n are [n, n− 1, ..1]. The classes are either 1 or 0 (GI or
non-GI). If the class of document x equals the class of document yi, the similarity score is added
to the total similarity score. Maximum similarity score is n + (n− 1) + ...1. The relatedness score
is calculated by total similarity score/maximum similarity score. n is set to 10 to find the 10
most similar documents to each document. This method helped in the initial tuning of the
hyper-parameters for the doc2vec model. The relatedness scores helped in understanding the
similarity in the documents belonging to the GI and non-GI classes.
4.1.1.2 Results
As can be seen in Figure 4.2 the DBOW doc2vec model performs the best in the similarity
task. The DM model performs poorly, which suggests that using DNA embedding for the GI task,
the word order might not be useful information. Furthermore, the concatenated version of
doc2vec does not perform well, possibly due to the poor performance of the DM model.
Interestingly, the baseline models BoW and TFIDF performed well, which suggests the
importance of word count and word relevance as a feature in DNA embedding for GI tasks. Even
though they perform reasonably well, these models have the disadvantage of the high
21
Figure 4.2: Performance of different doc2vec models and baseline methods BoW and TF-IDF on
similarity task
dimensionality of one-hot encoding, The DBOW model ignores the word order and tries to
predict a randomly sampled word from the paragraph, given as input the paragraph id. This is a
simpler model than DM and uses less storage space as it does not store the word vectors. From
this experiment, I find doc2vec DBOW model can be powerful in understanding semantic
similarity between DNA sequences of the same class.
22
4.1.2 Evaluation of classifier
4.1.2.1 Setting
Dataset The dataset used for the classification task includes the positive and negative
datasets obtained after running CD-HIT mentioned above. The train to test set is an 80-20%
division of the total positive and negative set, keeping the same balance of positive and negative
data in both sets. As mentioned above, the training set consists of 2805 data, and the test set
consists of 702 data.
Task and models used The training and test data for the classifiers are learned first from
each of the doc2vec DNA embedding models DBOW, DM, and DM+DBOW using gradient
descent from the doc2vec inference stage. The BoW and TFIDF embedding models are used as
baseline embedding models. The training and test data are also found for BoW and TFIDF
models using the models built in the previous experiment. The task is formulated as a binary
classification task, where the labels are 1,0 representing GI and non-GI respectively. A few
machine learning classifiers are used that are most commonly seen associated with document
classification tasks, such as SVM, Logistic Regression, and K-Nearest Neighbour.
Evaluation metrics The evaluations of the classifiers are done based on the overall
accuracy, precision, recall, and the f1-score(harmonic mean of precision and recall) of the
classifiers. The classification task will also help to evaluate the performance of the different DNA
embedding models.
4.1.2.2 Results
From the Figure 4.3, it can be seen that the DBOW + SVM model has the highest precision,
recall, f1 score, and accuracy. Overall, SVM is seen to be performing the best among all other
classifiers from Figure 4.4. Even though DBOW + SVM performs the best in the classification
task, it is interesting to find that TFIDF + SVM model also performs quite well in the
23
Figure 4.3: Precision, Recall, F1score and Accuracy for doc2vec DBOW and other baseline repre-
sentations on classifiers Logistic Regression(LR), Support Vector Machine(SVM), K-Nearest Neigh-
bour(KNN)
classification task, showing that word relevance might indeed be a good factor for DNA
embedding. The doc2vec DM model seems to be performing poorly in the classification task as
well, in keeping with the results from the previous similarity task experiment. The reason that
results from TFIDF and BoW is sometimes comparable to the DBOW embedding could be
attributed to the fact that the embedding of DNA includes less variety of k-mers and even less
unseen k-mers when trained with enough data. The total number of possible k-mers for 4
nucleotides A,T,G,C are {4}k. Apart from the fact, doc2vec DBOW performs better in the
classification task, it is also important to note that training DBOW + classifier task takes the
24
Figure 4.4: Overall accuracy of classifiers Logistic Regression(LR), Support Vector Machine(SVM),
K-Nearest Neighbour(KNN) on different embeddings
(a) Precision-recall curve (b) ROC curve
Figure 4.6: A. Precision recall curve B. ROC curve of the DBOW + SVM classifier model
25
least amount of time among the baseline methods, especially the BoW and TFIDF models which
use a one-hot encoding. The DBOW + SVM model precision-recall curve and the ROC(receiver
operating characteristic) curve can be seen in Figure 4.6, which helps us to understand the
performance of the binary classifier.
Hyper-parameters The hyper-parameters need to be tuned at many levels. In the
pre-processing level, the hyper-parameters includes k-mer value and the choice between
overlapping and non-overlapping window for extracting the k-mers. The k values tested are
{3,4,5,6,7,8,9}, the optimal k-mer value is chosen to 6 and the window method is chosen to be
over-lapping, based on the classification results shown in appendix. In the document embedding
level, the hyper-parameters that are mainly tuned are vector size, window, epoch, alpha values,
dbow words for doc2vec DBOW model, and vector size, window, epoch, alpha values, dm concat
for doc2vec DM model. The best choice for the DBOW model is found to be with vector size 50,
window 10, epochs 150, alpha 0.025. In the classification task, the SVM models has
hyper-parameters such as C, gamma, kernel, which are optimal at C=2, gamma = 1, kernel=
RBF. These are obtained after 10 fold cross-validated grid search results.
4.2 Evaluation of GI identification
This section evaluates the second stage of this framework which identifies GIs from the input
DNA sequence.
Data In general, any nucleotide sequence more than or equal to the minimum GI size can
be entered to find the GI regions identified, but for the sake of this evaluation, I have only used
whole prokaryotic genomes from National Center for Biotechnology Information (NCBI) server.
The whole genomes downloaded for experiments 1 and 2 are the 104 genomes used in the M
dataset, this is because these genomes have been properly identified with positive and negative
GIs in them. The genomes downloaded from NCBI for experiment on unseen data are mentioned
later in section 4.2.3.
26
Evaluation metrics The general evaluation metric used for GI prediction is similar to the
metric used to compare previous GI predictors [4]. The following values are found out based on
nucleotide overlaps:
(i) True Positive(TP): The number of nucleotides present in the positive prediction that overlaps
with positive reference data. (ii) True Negative(TN): The number of nucleotides outside the
positive prediction that overlaps with negative reference data. (iii) False Positive(FP): The
number of nucleotides present in the positive prediction that overlaps with negative reference
data. (iv) False Negative(TN): The number of nucleotides outside the positive prediction that
overlaps with positive reference data.
Based on these values, the following evaluation metrics are used:
Precision = TPTP+FP ; Recall = TP
TP+FN ; F1 = 2× precision× recallprecision+recall ; Accuracy = TP+TN
TP+FP+TN+FN
4.2.1 Experiment on comparative genomics data
This experiment is designed to understand how well the boundary detection technique works
in the second stage, using the models trained in the first stage.
4.2.1.1 Experimental setup
A total of 104 genomes are used to identify all GI regions. The prediction from this model
TreasureIsland is compared against some baseline GI prediction models that have previously
shown good results : a tool with high precision based on detecting tRNA fragments (Islander),
sequence composition-based tools (IslandPath-DIMOB and Sigi-HMM), and a hybrid tool
(IslandViewer4). The reference dataset used for this task is from the M dataset of 1845 GIs and
3266 non-GIs. Since this reference dataset is based on the comparative genomics tool IslandPick,
it has not been included in the baseline methods as it would lead to biased results.
27
Table 4.2: Performance of TreasureIsland and other baseline GI predictors on 104 genomes from
the M dataset
Predictor Precision Recall F1-score Accuracy
TreasureIsland 0.8946 0.9054 0.8925 0.9427
IslandViewer4 0.9025 0.7528 0.7935 0.8912
IslandPath DIMOB 0.8957 0.4515 0.5394 0.7624
SIGI-HMM 0.9585 0.1850 0.2766 0.7039
Islander 0.9807 0.1397 0.2040 0.6970
4.2.1.2 Results
An example of the result obtained from the framework is shown in Figure 4.7, which shows
the prediction of genome Escherichia coli O157:H7 str. Sakai(NC 002695.1) from TreasureIsland,
and the reference dataset positive and negative GI regions. Table 4.2 shows that TreasureIsland
has the highest recall, F-1 score, and accuracy when compared to the baseline methods. Islander
has shown to have the highest precision, as it only works on identifying tRNA as the feature and
a few other features in the filtering technique. TreasureIsland has a precision quite close to the
IslandViewer4, which is a hybrid prediction method and is a reliable tool for GI prediction. Even
though the performance of this framework is good, it must be kept in mind that it is a machine
learning-based predictor which has used many of the GIs and non-GIs in the training data set.
This motivates us to further investigate the framework in experiment 2. This experiment,
however, gives us a good idea about the potential of the framework to predict GIs from an input
sequence, especially the second stage in using the models from the first stage.
Parameters The parameters required to identify the GIs are described in Table 3.1, which
includes window size, kmer size, minimum gi size, tune window, upper threshold (Tu) and
lower threshold (Tl). The identification of GI results on 104 genomes are seen to be optimal on
tuning parameters to window size 10000, kmer size 6, minimum gi size 10000 in keeping with the
previous research on genomic island sizes [11], tune window 1000, upper threshold (Tu) 0.75,
lower threshold 0.5.
28
Figure 4.7: Prediction of Escherichia coli O157 genome NC 002695.1 measured on the reference
data from dataset M
4.2.2 Experiment on comparative genomics test data and literature data
This experiment is done to understand the prediction power of the framework on the test data
sets.
4.2.2.1 Experimental setup
For this experiment, two reference test sets are considered. Test-set 1 consisted of all GIs not
included in the training data set for building the classifier, and this consisted of 626 GIs and 1981
non-GIs from 104 genomes. Test-set 2 consisted of the L dataset, previously mentioned, which
consisted of 80 GIs from 6 genomes. The negative datasets for each of the 6 genomes in test-set 2
datasets have been kept the same from the original M dataset, which keeps the FP values same in
the test-set 2 results, as the experiment on comparative genomics data. The baseline predictors
used for this experiment are IslandViewer 4, IslandPath-DIMOB, Sigi-HMM, and Islander.
29
Table 4.3: Performance of TreasureIsland and other baseline GI predictors on 626 GIs and 1981
non-GIs
from test-set 1
Predictor Precision Recall F1-score Accuracy
TreasureIsland 0.8110 0.8068 0.7783 0.9254
IslandViewer4 0.8452 0.6718 0.6769 0.8803
IslandPath DIMOB 0.8269 0.3507 0.3933 0.7711
SIGI-HMM 0.9612 0.1537 0.2203 0.7601
Islander 0.9711 0.1156 0.1570 0.7471
Table 4.4: Performance of TreasureIsland and other baseline GI predictors on 80 GIs
from test-set 2
Predictor Precision Recall F1-score Accuracy
TreasureIsland 0.9700 0.9351 0.9507 0.9440
IslandViewer4 0.9980 0.6691 0.7912 0.8165
IslandPath DIMOB 0.9976 0.4788 0.6361 0.6998
SIGI-HMM 1.0 0.2048 0.3133 0.5539
Islander 1.0 0.2264 0.3535 0.5600
4.2.2.2 Results
In this experiment, it can be seen in Table 4.3, the results from test set 1 show that the
TreasureIsland has the highest recall, F1 score, and accuracy among the baseline models. As seen
in experiment 1, the precision of this framework is close to the IslandViewer4 predictor.
Table 4.4, shows the results from the curated literature data used as test-set 2. In general, the
predictors seem to have improved their performance in the precision, which means the True
Positives must have an increase, as the False Positives are constant due to the same negative data
set. These performances show that this framework constantly performs well in terms of recall and
accuracy when compared to the baseline predictors.
4.2.3 Experiment on unseen data
This experiment is designed to understand the capability of the framework to predict GIs on
unseen genomes it has not been trained on in the embedding or classification step.
30
4.2.3.1 Experimental setup
To perform this experiment, 6 genomes are selected randomly namely Rhizobium
leguminosarum (NC 011369.1), Streptosporangium roseum(NC 013595), Stenotrophomonas
maltophilia(NC 015947), Enterobacter soli(NC 015968.1), Mycobacterium
tuberculosis(NC 016768.1), Escherichia coli W ( NC 017635.1). Since there is no fixed reference
data set labeling the positive and negative GI regions for these genomes, I evaluate this
experiment by checking for overlaps with some of the baseline methods. The baseline methods
included in this experiment are IslandViewer 4, IslandPath-DIMOB, Sigi-HMM, IslandPick, and
Islander. To understand how much of the predicted regions from each predictor are in common
with the other predictors, I calculate the overlapping regions of each predictor with respect to the
other predictors.
overlap score = overlapping GI regions of predictor and reference prediction / total predictions
from reference predictions.
Table 4.5: Average GI overlap score of each predictor on the predictions from reference predictors
on 6 genomes
Reference
PredictorTreasureIsland IslandViewer4 IslandPath-DIMOB IslandPick Sigi-HMM Islander
TreasureIsland 1.0 0.7739 0.7463 0.7094 0.9106 0.8919
IslandViewer4 0.2352 1.0 1.0 1.0 1.0 1.0
IslandPath-DIMOB 0.1636 0.6324 1.0 0.3497 0.3360 0.7229
IslandPick 0.0647 0.2204 0.1422 1.0 0.1943 0.3253
Sigi-HMM 0.0931 0.3705 0.2282 0.3857 1.0 0.3542
Islander 0.0594 0.1953 0.2783 0.3488 0.1774 1.0
4.2.3.2 Results
From Table 4.5 it can be seen that TreasureIsland has made predictions that overlap with all
baseline predictions made, with the highest overlap score with the Sigi-HMM predictor followed
by the islander predictor. Since IslandViewer4 is a composite predictor which includes
IslandPath-DIMOB, SIGI-HMM, and Islander, the results from this predictor overlap with most
31
of the other predictors. It can also be seen that the baseline predictors overlap quite less with
TreasureIsland, which indicates that TreasureIsland has the potential for quite a lot of novel
predictions.
32
CHAPTER 5. SUMMARY AND DISCUSSION
In this thesis, I have examined the problem of predicting GIs in microorganisms and have
proposed a machine learning-based framework to predict GIs. This framework takes unannotated
nucleotide sequences as an input and uses an unsupervised representation of DNA to classify the
GIs and non-GIs. I also introduce a boundary refinement technique to precisely understand these
GI regions.
The results obtained in this thesis show us that the framework has a high recall and accuracy,
and a precision comparable to some of the current baseline predictors. It has also been shown
that this framework has the potential to discover novel GI regions that have not been covered by
other predictors. This research opens the door to an unsupervised way of representing DNA,
which can be helpful in any other DNA prediction using machine learning. The framework also
makes an important addition to the list of GI predictors currently available to uncover more
potential GI regions. The advantage of using this framework is that it does not require any
related genomes or gene annotations to predict GIs, which means freshly sequenced unannotated
genomes can be used to predict GI regions.
However, it might be worth mentioning that since the GI predictor does not use any prior
information such as gene components as its feature, making this framework purely based on an
unsupervised process, there is a possibility of inflation in the number of GIs predicted. These GIs
predicted may fall into the biological grey zone, where I do not know if a region is GI or not.
Thus, as future work, it will be useful to look into the GIs predicted to find the gene annotations
linked to them and functionally categorize the GIs. This might also uncover more possible
features and help to further advance the research in genomic islands.
33
BIBLIOGRAPHY
[1] Assaf, R., Xia, F., and Stevens, R. Identifying genomic islands with deep neuralnetworks. bioRxiv (2019), 525030.
[2] Bertelli, C., and Brinkman, F. S. Improved genomic island predictions withislandpath-dimob. Bioinformatics 34, 13 (2018), 2161–2167.
[3] Bertelli, C., Laird, M. R., Williams, K. P., Group, S. F. U. R. C., Lau, B. Y.,Hoad, G., Winsor, G. L., and Brinkman, F. S. Islandviewer 4: expanded prediction ofgenomic islands for larger-scale datasets. Nucleic acids research 45, W1 (2017), W30–W35.
[4] Bertelli, C., Tilley, K. E., and Brinkman, F. S. Microbial genomic island discovery,visualization and analysis. Briefings in bioinformatics 20, 5 (2019), 1685–1698.
[5] Che, D., Hasan, M. S., and Chen, B. Identifying pathogenicity islands in bacterialpathogenomics using computational approaches. Pathogens 3, 1 (2014), 36–56.
[6] Che, D., Hockenbury, C., Marmelstein, R., and Rasheed, K. Classification ofgenomic islands using decision trees and their ensemble algorithms. BMC genomics 11, 2(2010), 1–9.
[7] Che, D., Wang, H., Fazekas, J., and Chen, B. An accurate genomic island predictionmethod for sequenced bacterial and archaeal genomes. Journal of Proteomics &Bioinformatics 7, 8 (2014), 214.
[8] Dai, Q., Bao, C., Hai, Y., Ma, S., Zhou, T., Wang, C., Wang, Y., Huo, W., Liu,X., Yao, Y., et al. Mtgipick allows robust identification of genomic islands from a singlegenome. Briefings in bioinformatics 19, 3 (2018), 361–373.
[9] Dobrindt, U., Hochhut, B., Hentschel, U., and Hacker, J. Genomic islands inpathogenic and environmental microorganisms. Nature Reviews Microbiology 2, 5 (2004),414–424.
[10] Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. Cd-hit: accelerated for clustering thenext-generation sequencing data. Bioinformatics 28, 23 (2012), 3150–3152.
[11] Hacker, J., and Kaper, J. B. Pathogenicity islands and the evolution of microbes.Annual Reviews in Microbiology 54, 1 (2000), 641–679.
[12] Hsiao, W., Wan, I., Jones, S. J., and Brinkman, F. S. Islandpath: aiding detection ofgenomic islands in prokaryotes. Bioinformatics 19, 3 (2003), 418–420.
34
[13] Hudson, C. M., Lau, B. Y., and Williams, K. P. Islander: a database of preciselymapped genomic islands in trna and tmrna genes. Nucleic acids research 43, D1 (2015),D48–D53.
[14] Juhas, M., Van Der Meer, J. R., Gaillard, M., Harding, R. M., Hood, D. W.,and Crook, D. W. Genomic islands: tools of bacterial horizontal gene transfer andevolution. FEMS microbiology reviews 33, 2 (2009), 376–393.
[15] Langille, M. G., Hsiao, W. W., and Brinkman, F. S. Evaluation of genomic islandpredictors using a comparative genomics approach. BMC bioinformatics 9, 1 (2008), 1–10.
[16] Langille, M. G., Hsiao, W. W., and Brinkman, F. S. Detecting genomic islands usingbioinformatics approaches. Nature Reviews Microbiology 8, 5 (2010), 373–382.
[17] Le, Q., and Mikolov, T. Distributed representations of sentences and documents. InInternational conference on machine learning (2014), PMLR, pp. 1188–1196.
[18] Lu, B., and Leong, H. W. Gi-svm: a sensitive method for predicting genomic islandsbased on unannotated sequence of a single genome. Journal of bioinformatics andcomputational biology 14, 01 (2016), 1640003.
[19] Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of wordrepresentations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[20] Ng, P. dna2vec: Consistent vector representations of variable-length k-mers. arXiv preprintarXiv:1701.06279 (2017).
[21] Ou, H.-Y., Chen, L.-L., Lonnen, J., Chaudhuri, R. R., Thani, A. B., Smith, R.,Garton, N. J., Hinton, J., Pallen, M., Barer, M. R., et al. A novel strategy for theidentification of genomic islands by comparative analysis of the contents and contexts of trnasites in closely related bacteria. Nucleic acids research 34, 1 (2006), e3–e3.
[22] Pundhir, S., Vijayvargiya, H., and Kumar, A. Predictbias: a server for theidentification of genomic and pathogenicity islands in prokaryotes. In silico biology 8, 3, 4(2008), 223–234.
[23] Reyrat, J.-M., Pelicic, V., Gicquel, B., and Rappuoli, R. Counterselectablemarkers: untapped tools for bacterial genetics and pathogenesis. Infection and immunity 66,9 (1998), 4011–4017.
[24] Tsirigos, A., and Rigoutsos, I. A new computational method for the detection ofhorizontal gene transfer events. Nucleic acids research 33, 3 (2005), 922–933.
35
[25] Vernikos, G. S., and Parkhill, J. Interpolated variable order motifs for identification ofhorizontally acquired dna: revisiting the salmonella pathogenicity islands. Bioinformatics 22,18 (2006), 2196–2203.
[26] Waack, S., Keller, O., Asper, R., Brodag, T., Damm, C., Fricke, W. F.,Surovcik, K., Meinicke, P., and Merkl, R. Score-based prediction of genomic islandsin prokaryotic genomes using hidden markov models. BMC bioinformatics 7, 1 (2006), 1–12.
[27] Wei, W., Gao, F., Du, M.-Z., Hua, H.-L., Wang, J., and Guo, F.-B. Zislandexplorer: detect genomic islands by combining homogeneity and heterogeneity properties.Briefings in bioinformatics 18, 3 (2017), 357–366.
[28] Winstanley, C. Spot the difference: applications of subtractive hybridisation to the studyof bacterial pathogens. Journal of medical microbiology 51, 6 (2002), 459–467.
[29] Yoon, S. H., Park, Y.-K., and Kim, J. F. Paidb v2. 0: exploration and analysis ofpathogenicity and resistance islands. Nucleic acids research 43, D1 (2015), D624–D630.
36
APPENDIX. HYPERPARAMETER TUNING : K-MER
k-mer classification results
Table .1: Accuracy of different k-mer sizes on classifiers Logistic Regression(LR), Support Vector
Machine(SVM), K-Nearest Neighbour(KNN)
window method
k-mer size overlapping window non-overlapping window
LR SVM KNN LR SVM KNN
3 0.7379 0.8219 0.7550 0.7108 0.8419 0.7906
4 0.8063 0.9160 0.8547 0.8219 0.9188 0.8490
5 0.8291 0.9373 0.8746 0.8291 0.9288 0.8618
6 0.8433 0.9530 0.9160 0.7877 0.8960 0.7991
7 0.8063 0.9188 0.8846 0.7393 0.8675 0.7920
8 0.7578 0.8974 0.8519 0.7251 0.7528 0.7051
9 0.7350 0.8746 0.8219 0.6125 0.5370 0.4986