Metagenomic Sequence Analysis using
Hybrid Approach
Dissertation
Submitted by
Umesh G Gadhe
Roll No: 121122007
in partial fulfillment of the requirements
for the degree of
M.Tech Computer Engineering
Under the guidance of
Dr. Sachin Deshmukh
College of Engineering, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND
INFORMATION TECHNOLOGY,
COLLEGE OF ENGINEERING, PUNE-5
June, 2013
DEPARTMENT OF COMPUTER ENGINEERING
AND
INFORMATION TECHNOLOGY,
COLLEGE OF ENGINEERING, PUNE
CERTIFICATE
This is to certify that the dissertation titled
Metagenomic Sequence Analysis using Hybrid
Approach
has been successfully completed
By
Umesh G Gadhe
(121122007)
and is approved for the degree of
Master of Technology.
Dr. Sachin Deshmukh Dr. J V Aghav
Project Guide Head of Department
Dept. of Computer Engineering Dept. of Computer Engineering
and Information Technology, and Information Technology,
College of Engineering Pune, College of Engineering Pune,
Shivajinagar, Pune - 5. Shivajinagar, Pune - 5.
Date:
Dedicated to
my mother Smt. Vimal G Gade and my aunty Smt. Sunita B Gade who have
always been a constant source of inspiration for me, my entire life
and
my father Shri. Gangadhar D Gade and my uncle Shri. Bhaskar D Gade, who
have always been my role model for hard work, persistence, patience and always
supported me open heartedly in all my endeavours.
Acknowledgments
I express my deepest gratitude towards my project guide Dr. Sachin Deshmukh for
his constant help and encouragement throughout the project work. I have been fortunate
to have a guide who gave me the freedom to explore on my own and at the same time
helped me to plan the project with timely reviews and constructive comments, suggestions
wherever required. A big thanks to him for having faith in me throughout the project
and his calm and understanding nature which helped me to cope up with hard moments
or mistakes that I came across in project work.
I am also grateful to Dr. A D Sahasrabudhe (Director, College of Engineering,
Pune) and Dr. J V Aghav (Head, Department of Computer Engineering and Infor-
mation Technology, College of Engineering, Pune) for providing all resources for project
work whenever needed.
I would like to thank Dr. Aarati Desai (Domain Expert, Persistent Systems, Pune)
who taught us bioinformatics subject so enthusiastically and practically which made our
concept understanding easy. Special thanks for the continuous support and encourage-
ment she extended over any problem that I stuck over during project work. I also take
this opportunity to thanks all teachers and staff who have constantly helped me grow,
learn and mature both personally and professionally throughout the process.
A BIG thanks goes to my dearest classmates who made this journey so joyful and
sporty without whom the journey wouldn’t have been so interesting and memorable.
They have always supported, guided me and have helped me stay sane throughout this
and every other chapter of my life. I greatly value their friendship and deeply appreciate
their belief in me.
Most importantly, I would like to express my heart-felt gratitude to my family.
Umesh G Gadhe
Abstract
The term metagenome was coined in 1998 by Handelsmann. Metagenomics is the
study of metagenomes or genetic material recovered directly from environmental samples
in their natural conditions. While traditional microbiology genomics rely upon cultivated
clonal cultures but the fact that over 99% of the species yet to be discovered are resistant
to cultivation. Sequencing of environmental DNA (metagenomics) has shown tremen-
dous potential to drive the discovery and understanding of the un-culturable species.
Culture independent methods are used to obtain information about the genetic diversity,
population structure, and ecological roles of members of the communities.
Over the past few years, the major challenges associated with metagenomics has shifted
from generating to analyzing sequences. Metagenomic analysis includes the identification,
functional and evolutionary analysis of the genomic sequences of a community of organ-
isms. There are many challenges involved in the analysis of these datasets including sparse
meta-data, a high volume of sequence data, genomic heterogeneity and incomplete se-
quences. Due to this nature of metagenomic data, analysis is very complex and requires
new approaches and significant computational resources. Advances in computational
analysis techniques are essential to meet all these challenges. While supervised learning
has proven useful in practice, shortcomings exist. Methods trained on the genomes in
publicly available genomic databases like “GenBank make an implicit assumption that
known genomes are representatives of microbes waiting to be found by metagenomic
projects. This assumption is clearly violated by many of the metagenomic samples. Al-
ternatively, genome signatures can be used for unsupervised clustering by learning the
signatures from the set of sequences without the use of known genomes. But applying
either method for analysis has its own shortcomings.
This dissertation studies and experiments different data mining strategies or methods
in both approaches and come out with optimized hybrid model which will remove existing
shortcomings and improve binning performance.
Contents
List of Figures i
List of Tables ii
1 BIOINFORMATICS 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Role of Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 BIOLOGY BEHIND BIOINFORMATICS 4
2.1 Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 DNA (DeoxyriboNucleic Acid) . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 RNA (RiboNucleic Acid) . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Mapping Biology in Computer Terms . . . . . . . . . . . . . . . . . . . . 9
3 METAGENOMICS 10
3.1 Extraction and Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 Experimental Design and Statistical Analysis . . . . . . . . . . . . . . . . 15
3.7 Sharing and Storage of Data . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 DATA MINING TOOLS AND TECHNIQUES 16
4.1 PhyScimm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.1 SCIMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.2 Phymm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 PhymmBL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 RESULTS AND OBSERVATIONS 22
5.1 Dataset 1 (Synthetic Metagenomic Dataset) . . . . . . . . . . . . . . . . 22
5.2 Dataset 2 (FAMeS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6 DESIGN AND IMPLEMENTATION 27
6.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7 CONCLUSIONS AND FUTURE SCOPE 36
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Appendix-A 38
Appendix-B 42
Bibliography 43
List of Figures
1.1 Bioinfomatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Basic cell structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 DNA Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Protein Synthesis steps . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Flow diagram of a typical metagenome analysis . . . . . . . . . . 11
4.1 SCIMM Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Accuracy of Blast and Phymm at Phylum level classification . . 21
4.3 Accuracy of Blast and Phymm at Genus level classification . . . 21
5.1 Accuracy of PhymmBL and PhySCIMM at 1000 bp read length 23
5.2 Accuracy of PhymmBL and PhySCIMM at 200 bp read length 24
5.3 Accuracy of PhymmBL and PhySCIMM with less complex dataset 25
6.1 Block diagram of proposed solution . . . . . . . . . . . . . . . . . . 28
6.2 Accuracy comparison for abundance separation . . . . . . . . . . 31
6.3 Cluster accuracy in descending order of size . . . . . . . . . . . . . 32
6.4 Schematic representation of approach-1 . . . . . . . . . . . . . . . . 33
6.5 Accuracy over the iterations in approach-1 . . . . . . . . . . . . . 34
6.6 Schematic representation of approach-2 . . . . . . . . . . . . . . . . 34
6.7 Accuracy over the iterations in approach-2 . . . . . . . . . . . . . 35
i
List of Tables
3.1 Sequencing Technologies . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.1 Accuracy table sorted by cluster size . . . . . . . . . . . . . . . . . 32
1 Accuracy of PhymmBL and PhySCIMM at 1000 bp read length 38
2 Accuracy of PhymmBL and PhySCIMM at 800 bp read length 38
3 Accuracy of PhymmBL and PhySCIMM at 400 bp read length 39
4 Accuracy of PhymmBL and PhySCIMM at 200 bp read length 39
5 Accuracy of PhymmBL and PhySCIMM at Phylum Level . . . . 39
6 Accuracy of PhymmBL and PhySCIMM at Class Level . . . . . 40
7 Accuracy of PhymmBL and PhySCIMM at Order Level . . . . . 40
8 Accuracy of PhymmBL and PhySCIMM at Family Level . . . . 40
9 Accuracy of PhymmBL and PhySCIMM at Genus Level . . . . . 41
10 Accuracy of PhymmBL and PhySCIMM at 200 bp read length 41
11 Paper publication status . . . . . . . . . . . . . . . . . . . . . . . . . 42
ii
Chapter 1
BIOINFORMATICS
1.1 Introduction
Bioinformatics field came into existence about 40 years ago. The term “Bioinformatics”
was coined in 1990s. At the beginning of the bioinformatics field, bioinformatics was
dealing with the creation and maintenance of database to store biological data such as
nucleotide sequences and amino acid sequences. Further development in this field involved
design issues and development of complex interfaces.
Bioinformatics is an interdisciplinary field that develops and improves upon methods
for storing, retrieving, organizing and analyzing biological data. A major activity in
bioinformatics is to develop software tools to generate useful biological knowledge [1].
Bioinformatics is a field that applies computer science and information technology to
problems in biological science. Bioinformatics is the science of managing and analyz-
ing biological data using advanced computational techniques. Database and Information
systems are used to collect, store, analyze biological information. This analysis can then
be used in gene-based drug discovery and development. Figure 1.1 visualizes interdisci-
plinary nature of bioinformatics field.
Bioinformatics mainly has four streams as:
1. Genomics
Term Genomics was coined by Thomas Roderick. Genomics is a discipline in genet-
ics that applies recombinant DNA, DNA sequencing methods, and bioinformatics
to sequence, assemble, and analyze the function and structure of genomes. The field
genomics includes efforts to determine the entire DNA sequence of organisms and
1
Figure 1.1: Bioinfomatics
fine scale genetic mapping. In contrast, the investigation of the roles and functions
of single genes is a primary focus of molecular biology or genetics and is a common
topic of modern medical and biological research. Genomics has wide applications
in drug discovery, development, diagnostics and therapy.
2. Proteomics
Word Proteomics was coined in 1997. Proteomics is the study of protein sequences
or amino acid sequences in complement to genome. Proteomics studies structure
and function of proteins which is very useful to understand metabolism of any or-
ganism. Proteins are very vital constituent of any organism which forms metabolic
pathways of cell. Proteomics is an interdisciplinary formed on the basis of the
research and development of the Human Genome Project, is also an emerging sci-
entific research and exploration of the proteome research from the overall level of
intracellular protein composition, structure, and its own unique activity patterns.
It is an important component of functional genomics [2]. Functions of proteins are
enzyme catalysis, transport, mechanical support, organelle constituents, storage re-
serves, metabolic control, protection mechanisms, toxins, and osmotic pressure.
3. Chemoinformatics
Cheminformatics is an interdisciplinary branch which requires chemistry, com-
puter science and information science. Key problems in cheminfomatics are storing
molecule, finding exact molecule, searching substructure and similarity search.
2
4. Pharmacogenomics
Pharmacogenomics analyses how genetic makeup affects an individual’s response to
drugs. It deals with the influence of genetic variation on drug response in patients
by correlating gene expression with drug’s efficacy or toxicity. By doing so, phar-
macogenomics aims to develop means to optimize drug therapy, with respect to
the patients’ genotype, to ensure maximum response with minimal adverse effects.
Such approaches promise the advent of “personalized medicine” in which drugs and
drug combinations are optimized for each individual’s unique genetic makeup [3].
1.2 Role of Bioinformatics
1. Analysis and interpretation of various types of biological data including nu-
cleotide and amino acid sequences, protein domains and protein structures.
2. Development of new algorithms and statistics with which to assess biological
information, such as relationship among members of large datasets.
3. Development and implementation of tools that enable efficient access and
management of different types of information, such as various databases and inte-
grated mapping information.
3
Chapter 2
BIOLOGY BEHIND
BIOINFORMATICS
2.1 Cell
The cell was discovered by Robert Hooke in 1665. The cell is the basic structural
and functional unit of all known living organisms. It is the smallest unit of life that is
classified as a living thing, and is often called the building block of life. Figure 2.1 gives
the basic cell structure with its major components [4].
Figure 2.1: Basic cell structure
Every organism has different body parts performing different functions. e.g. Bones
in body function as supportive frame for body, skin gives protective layer for body. But
underneath all, if basic part is a cell then how do they perform different functions in
different body parts. The answer lies in the protein which they synthesize. Protein
defines the type and functions of the cell.
2.2 DNA (DeoxyriboNucleic Acid)
The German biochemist Frederich Miescher first observed DNA in the late 1800s. DNA
is the hereditary material in humans and almost all other organisms. Nearly every cell in
4
a person’s body has the same DNA. Most DNA is located in the cell nucleus (where it is
called nuclear DNA), but a small amount of DNA can also be found in the mitochondrial
region of the cell. Researchers refer to DNA found in the cell’s nucleus as nuclear DNA.
An organism’s complete set of nuclear DNA is called its genome [5]. Figure 2.2 represents
pictorial representation of DNA structure [6].
DNA is made of chemical building blocks called nucleotides. These building blocks are
made of three parts: a phosphate group, a sugar group and one of four types of nitrogen
bases. To form a strand of DNA, nucleotides are linked into chains, with the phosphate
and sugar groups alternating. The four types of nitrogen bases found in nucleotides are:
adenine (A), thymine (T), guanine (G) and cytosine (C). The order or sequence of these
bases determines what biological instructions are contained in a strand of DNA.
e.g. Sequence ATCGTT might instruct for blue eyes, while ATCGCT might instruct for
brown.
To understand DNA’s double helix from a chemical standpoint, imagine the sides of
the ladder as strands of alternating sugar and phosphate groups. Each rung of the ladder
is made up of two nitrogen bases, paired together by hydrogen bonds. Because of the
highly specific nature of this type of chemical pairing, base A always pairs with base T,
and likewise C with G. So, if you know the sequence of the bases on one strand of a DNA
double helix, it is a simple matter to figure out the sequence of bases on the other strand.
Figure 2.2: DNA Structure
5
DNA’s unique structure enables the molecule to copy itself during cell division. When
a cell prepares to divide, the DNA helix splits down the middle and becomes two single
strands. These single strands serve as templates for building two new, double-stranded
DNA molecules - each a replica of the original DNA molecule. In this process, an A base
is added wherever there is a T, a C where there is a G, and so on until all of the bases
once again have partners.
In addition, when proteins are being made, the double helix unwinds to allow a single
strand of DNA to serve as a template. This template strand is then transcribed into
mRNA, which is a molecule that conveys vital instructions to the cell’s protein-making
machinery [5].
2.3 RNA (RiboNucleic Acid)
RNA is very similar to DNA, but differs in a few important structural details. RNA
is made up of a long chain of components called nucleotides. Each nucleotide consists of
a nucleotide base, a ribose sugar, and a phosphate group. The sequence of nucleotides
allows RNA to encode genetic information.
RNA forms a system which is responsible conversion of DNA code i.e. gene to protein.
There are three types of RNA:
1. Messenger RNA (mRNA)
Messenger RNA is synthesized from a gene segment of DNA. It contains the infor-
mation on the primary sequence of amino acids in a protein to be synthesized. The
messenger RNA carries the code to the ribosomal site.
2. Transfer RNA (tRNA)
Transfer RNA resides in cytoplasm of the cell. It acts as carriers of amino acids dur-
ing protein synthesis. They carry corresponding amino acid to the protein synthesis
site. Particular amino acid is decided by anticodon (a triplet base code which is
complementary to the codon) which defines the tRNA. There is at least one tRNA
for each amino acid. There can be multiple triplet codons for an amino acid.
6
3. Ribosomal RNA (rRNA)
Ribosomal RNA is a structural component of ribosomes. The specific function of
rRNA is not fully established, but it binds proteins and mRNA to provide the site
of protein synthesis [7].
2.4 Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically
folded into a globular or fibrous form, facilitating its biological function. A polypeptide is
a single linear polymer chain of amino acids bonded together by peptide bonds between
the carboxyl and amino groups of adjacent amino acid residues. The sequence of amino
acids in a protein is defined by the sequence of a gene, which is encoded in the genetic
code.
Proteins are very important molecules in our cells. They are involved virtually in all
cell functions. Each protein within the body has a specific function. There are several
types of proteins depending on chain of amino acids and their structure.
e.g. Antibodies are specialized proteins involved in defending the body from anti-
gens.
Proteins are responsible for movement.
Enzymes are proteins that facilitate biochemical reactions.
Hormonal Proteins are messenger proteins which help to coordinate certain bodily
activities.
Structural Proteins are fibrous and stringy and provide support.
Transport Proteins are carrier proteins which move molecules from one place to
another around the body [8].
Protein synthesis is the name used to describe the process of making proteins. Protein
synthesis is an extremely complex process that needs to be done in a series of steps to
ensure that the process is being done properly. Figure 2.3 gives visual representation of
protein synthesis steps [9].
7
Figure 2.3: Protein Synthesis steps
General steps followed during protein synthesis are:
1. DNA unwinds exposing a gene.
2. The gene undergoes transcription forming “mRNA”.
3. Messanger RNA transfers info from the nucleus to ribosomes.
4. During the translation process, ribosomes read mRNA which indicates which amino
acid to add to the polypeptide chain and its sequence.
5. Each transfer RNA (tRNA) molecule conveys a particular amino acid to the ribo-
some.
6. At the ribosome, the amino acid that has been delivered by tRNA attaches to the
peptide chain, lengthening it.
When the translation process is complete, the ribosome releases the polypeptide, and
the new protein generally undergoes further processing at other sites within the cyto-
plasm.
8
2.5 Mapping Biology in Computer Terms
As a computer person it would be easy to appreciate the understanding of these bio-
logical concepts if we map it with something we know better. If we try to map it with
the computer terms, we can say that any living organism is a perfect computer system
which has cells as hardware. Biological networks define the functionality hardware unit
or protein interaction network forms software or program which drives system through
hardware. Each element in Biological network i.e. protein is analogous to a function in
the program. Gene can be said as instructions in function body which uniquely identifies
program. DNA can be said as universal repository of instructions in all functions i.e. all
gene sequences. Also there is non-coding DNA part which actually forms control unit in
the system but right now we are not concerned with that part so we will keep it separate
for now to avoid confusion. Each base on the DNA is actual individual instruction or
more precisely the machine instruction like load, store, add, etc.
Whatever we are doing in bioinformatics can be considered as reverse engineering.
Reverse engineering is the process of discovering the technological principles of a device,
object, or system through analysis of its structure, function, and operation. It often
involves taking something (e.g. a mechanical device, electronic component, software
program, or biological, chemical, or organic matter) apart and analyzing its workings in
detail [10].
9
Chapter 3
METAGENOMICS
Metagenomics is the study of metagenomes, genetic material recovered directly from
environmental samples. While traditional microbiology genomics rely upon cultivated
clonal cultures but the fact that over 99% of the species yet to be discovered are resistant
to cultivation, skewed our view of microbial diversity. Metagenomics offers a powerful
lens for viewing the microbial world that has the potential to revolutionize understanding
of the entire living world. Sequencing of environmental DNA (metagenomics) has shown
tremendous potential to drive the discovery and understanding of the “un-culturable ma-
jority” of species. Culture independent methods are used to obtain information about the
genetic diversity, population structure, and ecological roles of members of the communi-
ties. These methods complement or even replace culture-based approaches and bypass
some of their limitations.
Over the past few years, the major challenge associated with metagenomics has shifted
from generating to analyzing sequences. Metagenomic analysis includes the identification,
functional and evolutionary analysis of the genomic sequences of a community of organ-
isms. There are many challenges involved in the analysis of these data sets including
sparse metadata, a high volume of sequence data, genomic heterogeneity and incomplete
sequences. Due to the nature of metagenomic data, analysis is very complex and requires
new approaches and significant compute resources. Advances in computational analysis
techniques are essential to move the field forward.
10
Figure 3.1: Flow diagram of a typical metagenome analysis
General steps followed in megagenomic analysis are as shown in Figure 3.1
3.1 Extraction and Processing
Sample processing is the first and most crucial step in any metagenomics analysis. As
the metagenomic samples are taken in their natural condition, there are two important
parameters which we should take care of are:
1. DNA extracted should be representative of all cells present in the sample
2. Sufficient amount of high quality nucleic acids must be obtained for subsequent
analysis
3.2 Sequencing
In bioinformatics, sequence assembly refers to aligning and merging fragments of a much
longer DNA sequence in order to reconstruct the original sequence. DNA sequencing is
the process of reading the nucleotide bases in a DNA molecule. It includes any method
11
or technology that is used to determine the order of the four bases adenine, guanine,
cytosine, and thyminein a strand of DNA. The advent of DNA sequencing has signifi-
cantly accelerated biological research and discovery. Various DNA sequencing methods
are developed such as Sanger Sequencing, Quantitative PCR, Illumina, Roche454, SOLiD,
etc.
Practically next generation sequencing methods have now been widely applied in the
field of metagenomics due to their lower cost per gigabases and speed in performance
but one more parameter where they lag before Sanger sequencing is the read length
which makes it difficult to analyze it further if we have short reads. Some of the NGS
technologies along with their cost per gigabases and read length are given in Table 3.1.
Technology Read length cost/gigabase
Sanger Sequencing >700 bps 4,00,000 $
454/Roche 600-800 bps 20,000 $
Illumina 300 bps 50 $
SOLiD 50 bps 40 $
Table 3.1: Sequencing Technologies
Longer the sequence, the better is the ability to obtain accurate information but from
the Table 3.1 we can easily observe rise in the cost as we demand for more read length
so to achieve goal with short read length i.e. less cost, demand for the efficient assembly
algorithms which will assemble these short reads with minimum errors has gone up.
3.3 Assembly
In bioinformatics, sequence assembly refers to aligning and merging fragments of a
much longer DNA sequence in order to reconstruct the original sequence.
Many DNA assembly packages are developed. They are broadly classified into two
types as:
1. Reference guided assembly tools: Pair wise alignment of reads to the specified
reference genome to guide genome assembly. Reference-based assembly works well,
if the metagenomic dataset contains sequences for which closely related reference
genomes are already available.
e.g. SOAP, Bowtie.
12
2. De-novo assembly tools: It does not use any reference to align with. It generally
uses overlapped base approach to assemble the read sequences.
e.g. Velvet, ALLPathsLS, SOAPdenovo,Celera.
Several factors need to be considered when exploring the reasons for assembling metage-
nomic data. These can be condensed to two important questions. First, what is the
length of the sequencing reads used to generate the metagenomic dataset, and are longer
sequences required for annotation. Some approaches, e.g. IMG/M, prefer assembled
contigs, other pipelines such as MG-RAST require only 75 bps.
3.4 Binning
Binning refers to the process of sorting DNA sequences into group that represent an
individual genome or genomes that represent closely related group. Generally two ap-
proaches are used in binning as:
1. Taxonomy-dependent Binning: Majority of methods available for binning metage-
nomic datasets belong to the taxonomy-dependent category. In these methods, the
extent of similarity of reads with reference sequences or pre-computed models drives
the binning process.
(a) Similarity Based methods: majority of these methods work by aligning
reads to sequences or Hidden Markov Models corresponding to known se-
quences.
e.g. IMG/M, MG-RAST, MEGAN, CARMA, MetaPhyler
(b) Composition Based methods: It makes use of the fact that genomes have
conserved nucleotide composition (e.g. a certain GC or the particular abun-
dance distribution of k-mers) and this will be also reflected in sequence frag-
ments of the genomes.
e.g. pylopthia, S-GSOM, TACAO
2. Taxonomy-independent Binning: Mainly it uses the correlation information
contained in read sequences using statistical based methods or using network based
approaches. TETRA computes the pairwise correlations between tetra-nucleotide
patterns of all reads. This information is used for segregating reads into distinct
13
bins.
e.g. TETRA, SOM, CompostBin, AbundanceBin,MetaCluster.
Important considerations for using any binning algorithm are the type of input data
available and the existence of a suitable training datasets or reference genomes. In general,
composition-based binning is not reliable for short reads, as they do not contain enough
information. Short reads may contain similarity to a known gene and this information
can be used to assign the read to a specific bin. This binning obviously requires the
availability of reference data. If the query sequence is not much related to reference data
then reference based binning will be inefficient giving binning at very higher surfacing
level.
Post-assembly the binning of contigs can lead to the generation of partial genomes of
yet-uncultured or unknown organisms, which in turn can be used to perform similarity-
based binning of other metagenomic datasets.Prior to assembly with clonal assemblers
binning can be used to increase accuracy and reduce the complexity of an assembly effort
and might reduce computational requirement.
3.5 Annotation
DNA annotation is the process of attaching biological information to sequences. It
consists of two main steps as:
1. Identifying elements on the genome, a process called gene prediction.
2. Attaching biological information to these elements.
In case of metagenomic annotation two approaches are used:
1. Annotation on assembled sequences: Longer the sequence, better the informa-
tion. If the assembly has produced enough large contigs, it is better to use existing
genome annotation tools such as RAST or IMG. For this approach to be successful,
minimal contigs length of 30000 bp or longer are required.
2. Annotation on un-assembled reads: Annotation can also be performed on
entire unassembled reads. Here the existing annotation tools are not much effective.
14
3.6 Experimental Design and Statistical Analysis
Many of the early metagenomic shotgun-sequencing projects were focused on targeted
exploration of specific organisms (e.g. uncultured organisms in low-diversity acid mine
drainage). Reduction of sequencing cost and a much wider appreciation of the utility of
metagenomics to address fundamental questions in microbial ecology now require proper
experimental designs with appropriate statistical analysis.
3.7 Sharing and Storage of Data
Data sharing has a long tradition in the field of genome research. Efficient storage and
sharing of data provides metadata and centralized services to the research community.
For metagenomic data this will require a whole new level of organization and collaboration
to provide [11].
15
Chapter 4
DATA MINING TOOLS AND
TECHNIQUES
From the above we can say that efficient next generation sequencing methods and
metagenomic assembly decides the further resolution of metagenomic analysis due to
the thumb of rule “Longer the sequence, better the information”. Although assembly
yields longer sequences, it also bears the risk of creating chimeric contigs, in particular
in samples with closely related species or highly conserved sequences that occur across
species. Assembly efforts increase with more branching in the reads. More branching
in turn creates more possibility for tip, bubble, chimeric connections which are primary
sources of errors in assembly. Furthermore, assembly distorts abundance information,
as overlapping sequences from different species will be identified as belonging to the
same genome and consequently joined. This leads to a relative under representation of
sequences of abundant species [12]. Also the computational complexity of metagenomic
assemblers is still a big question to worry about.
As discussed earlier, prior to assembly binning can be used to increase accuracy and
reduce the complexity of an assembly effort. From the flow graph shown above we can
see, there are also methodologies of annotation and further statistical analysis which can
work up on binning skipping the assembly. So the binning of metagenomic reads is very
crucial step for effective assembly as well as for the further analysis steps even skipping
the assembly.
Let’s take a review of the existing methods which are been prominently used in metage-
nomic classification.
16
4.1 PhyScimm
Main objectives of metagenomic binning are to find out which metagenomic reads
belongs to same metagenomic strains and where those strains fit on phylogenetic tree of
life. Clustering can be efficiently used to achieve first objective and classification gains
second one by assigning taxonomic labels to sequences.
Composition based clustering and classification methods use oligonucleotide frequen-
cies as a property to bin metagenomic reads. Composition based classification methods
train on oligonucleotide frequencies of existing genomes and classify reads using super-
vised approach. Recently developed Phymm, composition based method train Interpo-
lated Markov Model (IMM) on existing genomes and use it to classify further reads.
Supervised learning assumes that training reads are representative of reads which we
will be classifying but it may not be always case, especially in metagenomic reads where
there could be novel sequences. So to tackle this scenario unsupervised learning ap-
proach could be used with supervised approach which learns genomic signature on set of
sequences without use of existing genomes.
Besides oligonucleotide frequencies, markov chain model has great potential to dis-
criminate between sequences [13], and has been successfully implemented in both unsu-
pervised [14] and supervised senese [15]. PhyScimm is hybrid approach which is “SCIMM
+ PHYMM”.
4.1.1 SCIMM
Let’s first understand IMM. In markov chain model any ith element in sequence is
dependant on previous n elements in sequence for nth order markov chain model. Given
sequence s and model m, likelihood of sequence s generated by m is given by equation 4.1
P(s
m
)=
|s|∏i=n+1
P
(si
si−1si−2...si−n
)(4.1)
IMM is variable order markov chain model which keeps on switching among different
fixed order markov models depending on context. It may be the case that some 8-mers
occur too frequently than 6-mers in sequence which will give more reliable estimates but
17
from above equation we can observe that we need 4k+1 calculations to accurately build
model which is exponential with the order of model. Generally higher order models are
preferable for accurate predictions but from above statement we can say that it is only
preferable only if we are getting more accurate estimates than its lower order model at
the expense of exponentially growing complexity. So to compensate between accuracy
and complexity IMM gives more weights to oligomers which occur frequently and lower
to infrequently occurring oligomers and uses linear combination of variable length models
depending on weights assigned. It uses lower length oligomer model if longer model is
insufficient to produce good quality of predictions and keeps interpolating to variable
length markov model depending on context as it moves on [16].
Here IMM training creates probabilistic decision tree using information gain as crite-
ria. Consider windows of length n+ 1 from set of sequences. First split defines a position
i in window where MI(Xi,Xn+1) is maximum where i = 1...n and continues iteratively
by calculating conditional MI of remaining positions by considering particular nucleotide
base at position defined by its parent node. To compute likelihood of novel sequence,
we follow down decision tree from root by looking at nucleotide base in novel sequence
window.
SCIMM uses same general algorithm as CEM (Classification Expectation Maximiza-
tion) where data points are read sequences and IMMs are cluster model. For each sequence
s and IMM model m, we calculate log likelihood of P (m).P (s|m) and assign s to m for
which it is maximum and retrain IMM after every assignment.
SCIMM Pipeline: Functional block diagram of PhyScimm is shown in Figure 4.1
Algorithm starts by initializing k IMM’s. To initialize the IMMs for SCIMM, we can
either use unsupervised approach e.g. LikelyBin or CBCBCompostBin or supervised
approach e.g. Phymm on a random subset of the sequences with a user-specified number
of clusters k and train an IMM on every cluster returned.
Initial partitioning step is used to form initial clusters or seed clusters required for
working of SCIMM clustering. We obtained initial partitioning of the sequences using
PhymmBL, a hybrid of supervised and unsupervised learning. It randomly chooses 3000
sequences, classify them, and cluster them at a certain taxonomic level which forms initial
k clusters.
18
Figure 4.1: SCIMM Pipeline
SCIMM re-trains the IMMs on all metagenomic sequences starting from the seed clus-
ters and assigns them to their corresponding clusters. This loop is shown in Figure 4.1.
Over the course of the iterations, the IMMs converge to a set that represent the phylo-
genetic sources.
4.1.2 Phymm
Unsupervised clustering is less effective with high complex dataset with many microbial
strains (> 20). Classification method Phymm [15] is immune to complexity of dataset.
It can be used for initial partitioning to reduce dataset complexity by clustering samples
from same genus or family in one cluster.
This hybrid approach of Phymm (supervised) and SCIMM (unsupervised) forms
PhyScimm and if existing unsupervised clustering methods used for initial partitioning
instead of Phymm, we refer it as SCIMM [17].
As Phymm is supervised approach, its performance goes down if query sequences are
not from the same taxanomic strain on which it was trained. Above results clearly
shows that PhyScimm outperforms for mixture containing more than 20 strains. For low
complexity data, SCIMM is comparable to PhyScimm and outperforms over PhyScimm
at lower level of taxonomy.
19
4.2 PhymmBL
This method is hybrid of composition and similarity based approach. As we have al-
ready seen in PhyScimm, IMM has great potential to discriminate sequences in metage-
nomics. Phymm, method based on IMM which trains on existing non-redundant labelled
genome NCBI Refseq database [18]. It creates suite of IMMs, one per each labelled
sequence in database. Phymm exhibits dramatic improvement in results especially for
shorter read lengths of 100 bps. As we have already seen in table above, sequencing
technologies are going cheap on cost of shorter read lengths. e.g. Illumina, SOLiD.
BLAST(Basic Local Alignment Search Tool) [19] is similarity based approach which
compares query sequence against sequences in NCBI Refseq. It is the most correct method
if the query sequences are member of the taxonomic group which already present in
database. If novel sequences are present in the query samples, its performance drops
drastically in proportion with novel sequences.
Each query sequence is checked across each Phymm IMM to find out probability of
that particular IMM generating particular query sequence. Query sequence is assigned
with label of IMM returning highest probability on with it was trained. Parallely it is
aligned to sequences in database using BLAST and assigned with label where we get
highest hit score. Finally we assign best hit score given by equation 4.2 using weighted
scores from both individual methods.
score = IMM + 1.2(4− log(E)) (4.2)
where
IMM: Log-likelihood score returned by IMM
E: best E-value returned by BLAST
Naturally Phymms accuracy improves with read lengths. Phymm exhibited 32.8 %
accuracy for 100 bp reads at genus level, 60.3 % for 400-bp reads as shown in Figure 4.2
and Figure 4.3. It is much greater than existing methods such as CARMA shows 6 %
for 100-bp at genus and SVM base PhyloPythia exhibits 7.1% for 1000-bp long reads at
genus level [15].
20
Figure 4.2: Accuracy of Blast and Phymm at Phylum level classification
Figure 4.3: Accuracy of Blast and Phymm at Genus level classification
From 4.2 and Figure 4.3, we can observe that phymm outperforms BLAST at upper
level of classification above 400-bp read length. BLAST is superior to phymm as we go
lower in phylogenic classification level towards species. Especially for shorter read lengths
it outperforms phymm at all levels.
21
Chapter 5
RESULTS AND OBSERVATIONS
To compare performance of PhymmBL and PhySCIMM, we experimented them with
two datasets. We compared accuracy measure to find out drawbacks in these tools and
way to improve them. Accuracy measure represents the percentage of total number of
metagenomic reads which are clustered or classified correctly to their phylogenetic source
to the total number of reads in input sample as given by equation 5.1.
Accuracy =Number of correctly clusterd or classified reads
Total number of classified reads(5.1)
5.1 Dataset 1 (Synthetic Metagenomic Dataset)
This synthetic metagenomic dataset is created artificially to emulate real metagenomic
dataset environment. In this dataset reads were randomly chosen from RefSeq (NCBI).
It includes core library of all complete bacterial and archaeal genomes comprising distinct
539 species, 53 genera, 48 families, 34 orders, 21 classes, 14 phyla.
To control for under-representation of some clades in the available data, query sets
were filtered so that all species under consideration had at least two sister species within
the clade under consideration. For example, in the experiment that masked exact species
matches but allowed intra genus comparisons, without this filtering step, if a given species
was the only sequenced representative of its genus, then it would have been impossible to
assign a correct genus label to reads from that species. Each synthetic test set initially
contained 5 randomly-selected “reads from each of the 1,146 chromosomes and plasmids
in the RefSeq reference data, totalling 5,730 reads representing 539 bacterial and archaeal
species [15].
22
We conducted experiments with different read lengths- 200, 400, 800, 1000 bps. With
these different read length dataset, we observed accuracy at all taxonomic level i.e. at
phylum, class, order, family, genus, species level.
Figures 5.1 and Figure 5.2 shows accuracy comparison of PhymmBL and PhySCIMM
with default parameters at various taxonomic levels with 1000 bp and 200 bp read lengths
respectively. Detailed observations at other read lengths is given in Appendix-A. We
can observe that accuracy of PhymmBL in both cases is very near to 100 %. This has
happened as all query reads are picked from RefSeq (NCBI) on which PhymmBL has been
trained. PhymmBL performs outstanding if species in metagenomic sample are known i.e.
reference is available. If we observer performance of PhySCIMM compared to PhymmBL,
it is very poor. This is because our dataset contains 539 species i.e. this dataset is highly
complex. Detail comparison at all read lengths can be found in supplement document.
Figure 5.1: Accuracy of PhymmBL and PhySCIMM at 1000 bp read length
5.2 Dataset 2 (FAMeS)
To evaluate various methods that are used to process metagenomic sequences, sim-
ulated datasets of varying complexity were constructed by combining sequencing reads
randomly selected from 113 isolate genomes. These datasets were designed to model real
metagenomes in terms of complexity and phylogenetic composition [20]. Datasets are
23
Figure 5.2: Accuracy of PhymmBL and PhySCIMM at 200 bp read length
dominated by few species, but have a long tail of very low abundance species.
We created dataset with 9 species which is dominated by 4 species (85.09 %). De-
liberately we kept less species to emulate less complex dataset. We experimented this
dataset at all levels of taxonomic classification with default parameters.
From Figure 5.3, we can observer that PhymmBL has given very great accuracy with
this dataset. This could be because metagenomic mixture is constructed from already
known species. If we compare performance of PhymmBL with performance with dataset-
1, we can easily conclude that PhymmBL performance is independent of complexity of
dataset i.e. number of strains present in metagenomic dataset. Here point to be observed
is that performance of PhySCIMM has gone up drastically as compared to performance
with dataset-1. We have got about 38 % to 65% difference in accuracy results. This
suggests that PhySCIMM performs poor with high complex data i.e. mixture with many
species.
If we go in detail, we got 13 clusters at the output of PhySCIMM at species level.
Out of total 13 clusters, 12 clusters were dominated by 4 species which are present in
high abundance (85.09 %) in input dataset.
24
Figure 5.3: Accuracy of PhymmBL and PhySCIMM with less complex dataset
5 clusters were dominated by Xylella fastidiosa with average accuracy of 92.03%.
3 clusters to Rhodopseudomonas palustris with average accuracy of 71.74%.
3 clusters to Rhodospirillum rubrum with average accuracy of 87.84 %.
1 cluster to Moorella thermoacetica with average accuracy of 92.85%.
1 cluster to mixed reads with no specific dominance (cluster of un-clustered reads).
This shows that PhySCIMM could not distinguish species with lower abundance. Also
clustering accuracy of high abundant species is hampered because of noise introduced by
species with lower abundance.
5.3 Drawbacks
From above two experiments, we can brief drawbacks found in existing system as
1. Performance of Physcimm (Unsupervised Approach) degrades with high complexity
data (mixture with many species).
2. Physcimm (Unsupervised Approach) fails to identify species with lower abundance
25
level.
3. Physcimm (Unsupervised Approach) clustering accuracy of species with higher
abundance level degrades due to presence of low abundant species (noise).
26
Chapter 6
DESIGN AND
IMPLEMENTATION
6.1 Design
The experimental results discussed in previous chapter points out major problems found
with the unsupervised methods in metagenomic binning are:
1. high complexity of metagenomic dataset (number of species in dataset).
2. Difference in abundance level of species in metagenomic dataset.
To minimize above two factors, there is need of the system which will bring down the
complexity of metagenomic dataset before processing it with the unsupervised method.
The system should also be capable of providing nearly equal abundant input to unsuper-
vised method.
Considering the above said requirements, we designed a system as shown in Figure
6.1. Stages of system design are explained below.
1. Filtering Reads: This step separates reads in metagenomic mixture into high
abundant reads and low abundant reads based on k-mer frequency. This will help
us to separate noisy low abundant species from high abundant species, which are
biologically more significant [21]. By doing this we are reducing complexity of data
by splitting it into two parts. We can even re-iterate over this step by selecting
different values of k. This step will minimize abundance variation in input dataset
which will bring it to an evenly distributed dataset with nearly same abundance.
27
Figure 6.1: Block diagram of proposed solution
2. Left wing: Left wing in proposed method will bin high abundant species. As
low abundant species will be separated in filtering block, clustering accuracy of
biologically significant species will improve. Left wing will be provided with input
dataset of high abundant species with minimized abundance variation. Left wing
will extract most biological significance as biologically eco-system is dominated by
few species.
3. Random-1: This step selects random sequences from input metagenomic reads
and aligns with existing database using existing supervised binning tool. This step
will give us strains present in input which are already present in database i.e. known
taxonomy and will also serve as seed input for training IMMs.
4. Iterative IMM: Depending on seed bins given by above step, IMMs will be trained
on it and will classify remaining reads in input and re-train itself iteratively.
5. Right wing: Right wing in proposed method will take care of low abundant
species. Right wing will be provided with input dataset of low abundant species
with minimized abundance variation. We used same approach to bin these reads
as used in left wing.
28
6.2 Method
Primary assumption of this method is that reads sampled from genome follows normal
distribution [22]. Expectation Maximization (EM) method is used to find out parame-
ters of this distribution. This method discriminate abundance based on fitted poisson
distribution. Sequencing produces reads with unequal sampling when species abundance
level varies.
The sequencing output in metagenomics is considered as mixture of m Poisson distri-
butions, m being the number of species. The goal is to nd the mean values λ1 to λm,
which are the abundance levels of the species, of these Poisson distributions.
In sequencing of genome, probability of a read starting at a certain position is
Pr =N
(G− L+ 1)(6.1)
where
N : Number of reads
G : Genome size
L: Length of reads.
Given G >> L, it approximates to N/G. Assume x is a read and a l− tuplew belongs
to x. The number of occurrences of ’w’ in set of reads follows a Poisson distribution with
parameter λ = N(L− l + 1)/(GL + 1) approximates to NL/G . In metagenomics, G is
total genome size of all species in sample. If the abundance of a species i is n, the total
number of occurrences of w in whole metagenomic sample coming from different species
will also follow poisson distribution with parameter λi = nλ. Now the problem of finding
the relative abundance levels of different species is transformed to the modelling of mixed
Poisson distribution.
Given metagenomic reads as input, algorithm firstly counts l tuple count in all reads.
Denote
x = {n (wi)} (6.2)
where
i = 1...wi
n(wi) : count of tuple i
W : total number of l − tuples
29
The goal of the algorithm is to optimize the logarithm of the joint probability of
obtaining a particular l− tuple counts x and the parameters θ i.e. logP (x, θ) = {S, g, λ}
where
S : Total number of bins.
g : {gi} genome size
λ = λi abundance level
Hidden variable in this optimization problem is bin identity to which l− tuple belongs.
Expectation maximization (EM) algorithm is used to solve this optimization problem.
1. Initialize total number of bins to S with their genome size gi and abundance λi.
2. Calculate the probability that the l − tuple wj (j = 1, 2, ...,W ) coming from ith
species given its count n(wj)
p
(wiεsin (wj)
)=
gi∑sm=1 gm
(λmλi
)n(wj)e(λi−λm)
(6.3)
3. Calculate new values for gi and λi.
gi =w∑j=1
p
(wjεsin (wj)
)
λi =
∑wj=1 n (wj) p
(wjεsin(wj)
)gi
(6.4)
4. Iterate the step 2 and 3 until the parameters converges or number of run exceeds
maximum. The convergence of parameters is defined as∣∣∣∣∣λt+1i
λti
∣∣∣∣∣ < 10−5and
∣∣∣∣∣gt+1i
gti
∣∣∣∣∣ < 10−5 (6.5)
Once the EM algorithm converges, we can estimate the probability of a read assigned
to a bin, based on its l − tuples binning results as,
P (rkεsi) =ΠwjεrkP (wjεsi/n (wj))∑
siεS
(∏wjεrk P (wjεsi/n (wj))
) (6.6)
30
where
rk : given read
wj : ltuples that belongs to rk
si : Bin
A read will be assigned to the bin with the highest probability. A read remains unas-
signed if 90% of its l − tuples are excluded.
6.3 Results
We used FAMeS dataset to study the effect of abundance separation on overall binning
accuracy. To set the benchmark for accuracy improvement, we firstly performed this
experiment with manual separation of metagenomic input. As our dataset is synthetic
and we know all its composition, we could separate it on abundance with 100 % accuracy.
Accuracy improvement is shown in Figure 6.2.
Figure 6.2: Accuracy comparison for abundance separation
Automatic: accuracy after abundance separation on k-mer frequency.
Manual: accuracy after manual abundance separation.
Before: accuracy without considering abundance variance.
Figure 6.2 clearly shows the effect of abundance variation on overall accuracy of metage-
nomic binning. Though we got some improvement in overall accuracy, but it is still not
31
much impressive as against computational and time overhead caused by adding abun-
dance separation step before binning.
While thinking of any other innovative solution, we reanalyzed output clusters of PhyScimm
as shown in Table 6.1.
Size Accuracy
5136 98.91
3018 98.51
2031 99.02
1878 98.83
1226 33.61
1041 58.6
708 97.46
613 94.29
551 57.71
452 91.15
378 92.86
372 78.76
340 84.64
300 65.67
Table 6.1: Accuracy table sorted by cluster size
Figure 6.3: Cluster accuracy in descending order of size
Consider cluster size as indicator of abundance of species. Figure 6.3 shows accuracy
in descending order of cluster size.
32
Lets assume size as spectrum size and accuracy as indicator of overlap with adjacent
spectrum. If accuracy is higher, that means there is less overlap of that spectrum or
cluster with its adjacent sized clusters. Now the task is to split whole spectrum such that
the abundance variation will be minimized. This could be done by observing trend in
accuracy variation with size. From Figure 6.3, we can observe that there is groove with
accuracy 33.61 %. Left spectrum or cluster has accuracy 98.83 % and right cluster is
with 58.60 %. This clearly indicates that cluster in groove i.e. with accuracy 33.61 % has
more overlap with right spectrum and very small overlap with left one. We split dataset
at this groove cluster and add this particular cluster to right split. In general “split the
clusters at groove and add groove cluster to split with more overlap”.
To imitate real metagenomic environment and to have sufficient data for this iterative
approach, we selected whole FAMeS dataset with 113 species and 114457 reads.
Over the iterations, we tried two approaches.
Approach 1: Schematic representation of approach-1 is shown in Figure 6.4. In this
approach, we collectively analysed output clusters of every split in previous iteration and
formed new splits for next iteration if necessary.
Figure 6.4: Schematic representation of approach-1
33
Figure 6.5: Accuracy over the iterations in approach-1
From Figure 6.5,we can observe that we got 8.78 % improvement in overall accuracy.
Sensitivity in the iteration-3 is 102018/114457 = 89.13 %.
Approach 2: Schematic representation of approach-1 is shown in Figure 6.6. In this
approach, we went on analyzing and splitting output of every iteration separately. It is
similar to tree structure.
Figure 6.6: Schematic representation of approach-2
From Figure 6.7,we can observe that we got 13.89 % improvement in overall accuracy.
Sensitivity in the iteration-3 is 98849/114457 = 86.36 %.
34
Figure 6.7: Accuracy over the iterations in approach-2
Only disadvantage in approach-2 is that number of output clusters goes on adding as
we proceed with iterations. In our experiment we got 132 clusters in third iteration which
is more than the total number of species in input dataset i.e. 113. That means some
species are spread over more than one cluster.
35
Chapter 7
CONCLUSIONS AND FUTURE
SCOPE
7.1 Conclusions
We successfully experimented with metagenomic binning for different datasets. We
successfully identified the potential in metagenomic binning with unsupervised approach.
We identified the abundance variance as a major factor in effectiveness of metagenomic
binning with unsupervised approach. We performed different experiments to overcome
this problem and got peak accuracy improvement up to 13.89 %. K-mer frequency mea-
sure to identify abundance did not perform well which could be improved by hybridizing
it with any other measure. Though we got satisfactory improvement in accuracy, but
still there is much scope for improvement in metagenomic binning with unsupervised
approach. Further research on metagenomic binning with unsupervised approach will
definitely bring new revolution in the field of metagenomics.
7.2 Future Scope
The way ahead involves applying innovative approaches to handle abundance variance
problem using data characteristics. In k-mer frequency criteria, selection of proper k is
also a crucial decision which affects its abundance detection capability. Automatic selec-
tion of k-value requires further research about its relationship with data characteristic.
Combining any other measure with k-mer frequency count will be beneficial. In iterative
approaches, number of iterations to carry out is also a prime decision because after proper
number iterations, results start degrading. It may happen that any particular species is
36
split into two or more clusters. In such case markov model can be developed to identify
clusters with same species and merge them to form a cluster. Time complexity of markov
model training is much higher which limited us to experiment with few datasets. Intense
parallelism of training phase using hadoop architecture can solve this problem. Further
refinement of experiments by applying different datasets will also be useful to remove
minor drawbacks if present.
37
Appendix-A
A.1 Dataset-1
A.1.1 Accuracy at 1000 bp read length
PhymmBL PhyScimm
phylum 99.598 61.534
class 99.556 60.355
order 99.488 57.563
family 99.36 48.246
genus 98.476 34.176
Table 1: Accuracy of PhymmBL and PhySCIMM at 1000 bp read length
A.1.2 Accuracy at 800 bp read length
PhymmBL PhyScimm
phylum 99.598 61.475
class 99.519 60.950
order 99.435 55.915
family 99.305 47.059
genus 98.335 33.651
Table 2: Accuracy of PhymmBL and PhySCIMM at 800 bp read length
38
A.1.3 Accuracy at 400 bp read length
PhymmBL PhyScimm
phylum 99.616 60.703
class 99.593 57.873
order 99.506 52.725
family 99.397 45.416
genus 98.441 30.848
Table 3: Accuracy of PhymmBL and PhySCIMM at 400 bp read length
A.1.4 Accuracy at 200 bp read length
PhymmBL PhyScimm
phylum 99.581 59.013
class 99.538 50.928
order 99.541 48.664
family 99.433 38.730
genus 98.721 26.807
Table 4: Accuracy of PhymmBL and PhySCIMM at 200 bp read length
A.1.5 Accuracy at Phylum Level
PhymmBL PhyScimm
1000 99.598 61.534
800 99.598 61.475
400 99.596 60.703
200 99.581 59.013
Table 5: Accuracy of PhymmBL and PhySCIMM at Phylum Level
39
A.1.6 Accuracy at Class Level
PhymmBL PhyScimm
1000 99.556 60.355
800 99.519 60.150
400 99.593 57.873
200 99.538 50.928
Table 6: Accuracy of PhymmBL and PhySCIMM at Class Level
A.1.7 Accuracy at Order Level
PhymmBL PhyScimm
1000 99.488 57.563
800 99.435 55.915
400 99.50 52.725
200 99.541 48.664
Table 7: Accuracy of PhymmBL and PhySCIMM at Order Level
A.1.8 Accuracy at Family Level
PhymmBL PhyScimm
1000 99.36 50.558
800 99.305 48.101
400 99.397 44.488
200 99.433 38.258
Table 8: Accuracy of PhymmBL and PhySCIMM at Family Level
40
A.1.9 Accuracy at Genus Level
PhymmBL PhyScimm
1000 98.476 34.797
800 98.335 33.651
400 98.441 30.848
200 98.721 26.807
Table 9: Accuracy of PhymmBL and PhySCIMM at Genus Level
A.2 Dataset-2
A.2.1 Accuracy at various taxonomic levels
PhymmBL PhyScimm
phylum 99.018 96.539
class 98.38 92.987
order 97.842 91.355
family 97.83 91.637
genus 97.521 91.477
Table 10: Accuracy of PhymmBL and PhySCIMM at 200 bp read length
41
Appendix-B
B.1 Publication Status
Title Conference Status
Metagenomic Binning:
An Overview and Methods NCIC-2013, Coimbtore Published
Table 11: Paper publication status
42
Bibliography
[1] “Bioinformatics”, http://en.wikipedia.org/wiki/Bioinformatics
[2] Paul R. Graves and Timothy A. J. Haystead, “Molecular Biologist’s Guide to
Proteomics”, American Society for Microbiology, pubmedcentral.gov:120780,2002.
[Online] Available: http://www.pubmedcentral.gov/articlerender.fcgi?artid=120780
[3] William E. Evans and Mary V. Relling, “Pharmacogenomics: Translating
Functional Genomics into Rational Therapeutics”, Vol. 286 no. 5439 pp. 487-491,
DOI: 10.1126/science.286.5439.487. [Online] Available: http://www.sciencemag.org/
content/286/5439/487.full
[4] “Cell Structure”, http://en.wikipedia.org/wiki/Cell (biology)
[5] Raven, Johnson, Losos, Mason and Singer “Biology, Eighth Edition”, McGraw-Hill
Higher Education. [Online] Available:
[6] “DNA Structure” http://www.accessexcellence.org/RC/VL/GG/dna2.php
[7] Robert W. Simons, “RNA structure and function”, Cold Spring harbor Laboratory
Press, 1998.
[8] Gregory A. Petsko and Dagmar Ringe, “Protein structure and function”, New
Science Press, 2004.
[9] “Protein Synthesis”, http://en.wikipedia.org/wiki/File:Proteinsynthesis.png
[10] “Reverse Engineering”, http://en.wikipedia.org/wiki/Reverse engineering
[11] T. Thomas, J. Gilbert, and F. Meyer, “Metagenomics - a guide from
sampling to data analysis.” [Online]. Available: http://www.pubmedcentral.nih.gov/
articlerender.fcgi?artid=3351745
43
[12] HannoTeeling and F. O. Glockner, “Current opportunities and challenges in
microbial metagenome analysis- bioinformatics perspective,” Brief Bioinform,
2012. [Online]. Available: http://bib.oxfordjournals.org/content/early/2012/09/26/
bib.bbs039.full.pdf
[13] S. E. Bohlin J and U. D, “Reliability and applications of statistical methods
based on oligonucleotide frequencies in bacterial and archaeal genomes,” vol.
104, no. 9, 2008. [Online]. Available: http://www.biomedcentral.com/content/pdf/
1471-2164-9-104.pdf
[14] A. Kislyuk, S. Bhatnagar, J. Dushoff, and J. S. Weitz, “Unsupervised statistical
clustering of environmental shotgun sequences,” BMC Bioinformatics, vol. 10, p.
316, 2009. [Online]. Available: http://dx.doi.org/10.1186/1471-2105-10-316
[15] B. A and S. S, “Phymm and phymmbl: metagenomic phylogenetic classification
with interpolated markov models,” Nat Meth, vol. 6, no. 9, pp. 673–676, 2009.
[Online]. Available: http://www.nature.com/nmeth/journal/v6/n9/suppinfo/nmeth.
1358 S1.html
[16] S. Salzberg, A. Delcher, S. Kasif, and O. White, “Microbial gene identification using
interpolated markov models,” Nucleic Acids Res., vol. 26, no. 2, pp. 544–548, 1998.
[17] D. R. Kelley and S. L. Salzberg, “Clustering metagenomic sequences with
interpolated markov models.” [Online]. Available: http://www.pubmedcentral.nih.
gov/articlerender.fcgi?artid=3098094
[18] K. D. Pruitt, T. A. Tatusova, and D. R. Maglott, “NCBI reference sequence
(refseq): a curated non-redundant sequence database of genomes, transcripts and
proteins,” Nucleic Acids Research, vol. 33, no. Database-Issue, pp. 501–504, 2005.
[Online]. Available: http://dx.doi.org/10.1093/nar/gki025
[19] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and
D. J. Lipman, “Gapped Blast and Psi-Blast: a new generation of protein database
search programs,” Nucleic Acids Res., vol. 25, pp. 3389–3402, 1997.
[20] Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigout-
sos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P,
44
Hugenholtz P and Kyrpides NC: Use of simulated data sets to evaluate the fidelity of
metagenomic processing methods 2007.
[21] Wu, Y.W. and Ye, Y.(2011) A novel abundance-based algorithm for binning metage-
nomic sequences using l-tuple, J. Compute. Biol., 18, 523-534.
[22] Lander ES and Waterman MS: Genomic mapping by fingerprinting random clones:
a mathematical analysis. Genomics 1988 Apr; 2(3):231-9.
45