Download - Metagenomic Sequence Analysis using Hybrid Approach · Metagenomic Sequence Analysis using Hybrid Approach has been successfully completed By ... and analyze the function and structure

Metagenomic Sequence Analysis using

Hybrid Approach

Dissertation

Submitted by

Umesh G Gadhe

Roll No: 121122007

in partial fulfillment of the requirements

for the degree of

M.Tech Computer Engineering

Under the guidance of

Dr. Sachin Deshmukh

College of Engineering, Pune

DEPARTMENT OF COMPUTER ENGINEERING AND

INFORMATION TECHNOLOGY,

COLLEGE OF ENGINEERING, PUNE-5

June, 2013

DEPARTMENT OF COMPUTER ENGINEERING

AND

INFORMATION TECHNOLOGY,

COLLEGE OF ENGINEERING, PUNE

CERTIFICATE

This is to certify that the dissertation titled

Metagenomic Sequence Analysis using Hybrid

Approach

has been successfully completed

By

Umesh G Gadhe

(121122007)

and is approved for the degree of

Master of Technology.

Dr. Sachin Deshmukh Dr. J V Aghav

Project Guide Head of Department

Dept. of Computer Engineering Dept. of Computer Engineering

and Information Technology, and Information Technology,

College of Engineering Pune, College of Engineering Pune,

Shivajinagar, Pune - 5. Shivajinagar, Pune - 5.

Date:

Dedicated to

my mother Smt. Vimal G Gade and my aunty Smt. Sunita B Gade who have

always been a constant source of inspiration for me, my entire life

and

my father Shri. Gangadhar D Gade and my uncle Shri. Bhaskar D Gade, who

have always been my role model for hard work, persistence, patience and always

supported me open heartedly in all my endeavours.

Acknowledgments

I express my deepest gratitude towards my project guide Dr. Sachin Deshmukh for

his constant help and encouragement throughout the project work. I have been fortunate

to have a guide who gave me the freedom to explore on my own and at the same time

helped me to plan the project with timely reviews and constructive comments, suggestions

wherever required. A big thanks to him for having faith in me throughout the project

and his calm and understanding nature which helped me to cope up with hard moments

or mistakes that I came across in project work.

I am also grateful to Dr. A D Sahasrabudhe (Director, College of Engineering,

Pune) and Dr. J V Aghav (Head, Department of Computer Engineering and Infor-

mation Technology, College of Engineering, Pune) for providing all resources for project

work whenever needed.

I would like to thank Dr. Aarati Desai (Domain Expert, Persistent Systems, Pune)

who taught us bioinformatics subject so enthusiastically and practically which made our

concept understanding easy. Special thanks for the continuous support and encourage-

ment she extended over any problem that I stuck over during project work. I also take

this opportunity to thanks all teachers and staff who have constantly helped me grow,

learn and mature both personally and professionally throughout the process.

A BIG thanks goes to my dearest classmates who made this journey so joyful and

sporty without whom the journey wouldn’t have been so interesting and memorable.

They have always supported, guided me and have helped me stay sane throughout this

and every other chapter of my life. I greatly value their friendship and deeply appreciate

their belief in me.

Most importantly, I would like to express my heart-felt gratitude to my family.

Umesh G Gadhe

Abstract

The term metagenome was coined in 1998 by Handelsmann. Metagenomics is the

study of metagenomes or genetic material recovered directly from environmental samples

in their natural conditions. While traditional microbiology genomics rely upon cultivated

clonal cultures but the fact that over 99% of the species yet to be discovered are resistant

to cultivation. Sequencing of environmental DNA (metagenomics) has shown tremen-

dous potential to drive the discovery and understanding of the un-culturable species.

Culture independent methods are used to obtain information about the genetic diversity,

population structure, and ecological roles of members of the communities.

Over the past few years, the major challenges associated with metagenomics has shifted

from generating to analyzing sequences. Metagenomic analysis includes the identification,

functional and evolutionary analysis of the genomic sequences of a community of organ-

isms. There are many challenges involved in the analysis of these datasets including sparse

meta-data, a high volume of sequence data, genomic heterogeneity and incomplete se-

quences. Due to this nature of metagenomic data, analysis is very complex and requires

new approaches and significant computational resources. Advances in computational

analysis techniques are essential to meet all these challenges. While supervised learning

has proven useful in practice, shortcomings exist. Methods trained on the genomes in

publicly available genomic databases like “GenBank make an implicit assumption that

known genomes are representatives of microbes waiting to be found by metagenomic

projects. This assumption is clearly violated by many of the metagenomic samples. Al-

ternatively, genome signatures can be used for unsupervised clustering by learning the

signatures from the set of sequences without the use of known genomes. But applying

either method for analysis has its own shortcomings.

This dissertation studies and experiments different data mining strategies or methods

in both approaches and come out with optimized hybrid model which will remove existing

shortcomings and improve binning performance.

Contents

List of Figures i

List of Tables ii

1 BIOINFORMATICS 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Role of Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 BIOLOGY BEHIND BIOINFORMATICS 4

2.1 Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 DNA (DeoxyriboNucleic Acid) . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 RNA (RiboNucleic Acid) . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Mapping Biology in Computer Terms . . . . . . . . . . . . . . . . . . . . 9

3 METAGENOMICS 10

3.1 Extraction and Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.5 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.6 Experimental Design and Statistical Analysis . . . . . . . . . . . . . . . . 15

3.7 Sharing and Storage of Data . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 DATA MINING TOOLS AND TECHNIQUES 16

4.1 PhyScimm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 SCIMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.2 Phymm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 PhymmBL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 RESULTS AND OBSERVATIONS 22

5.1 Dataset 1 (Synthetic Metagenomic Dataset) . . . . . . . . . . . . . . . . 22

5.2 Dataset 2 (FAMeS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.3 Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 DESIGN AND IMPLEMENTATION 27

6.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7 CONCLUSIONS AND FUTURE SCOPE 36

7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Appendix-A 38

Appendix-B 42

Bibliography 43

List of Figures

1.1 Bioinfomatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Basic cell structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 DNA Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Protein Synthesis steps . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Flow diagram of a typical metagenome analysis . . . . . . . . . . 11

4.1 SCIMM Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Accuracy of Blast and Phymm at Phylum level classification . . 21

4.3 Accuracy of Blast and Phymm at Genus level classification . . . 21

5.1 Accuracy of PhymmBL and PhySCIMM at 1000 bp read length 23

5.2 Accuracy of PhymmBL and PhySCIMM at 200 bp read length 24

5.3 Accuracy of PhymmBL and PhySCIMM with less complex dataset 25

6.1 Block diagram of proposed solution . . . . . . . . . . . . . . . . . . 28

6.2 Accuracy comparison for abundance separation . . . . . . . . . . 31

6.3 Cluster accuracy in descending order of size . . . . . . . . . . . . . 32

6.4 Schematic representation of approach-1 . . . . . . . . . . . . . . . . 33

6.5 Accuracy over the iterations in approach-1 . . . . . . . . . . . . . 34

6.6 Schematic representation of approach-2 . . . . . . . . . . . . . . . . 34

6.7 Accuracy over the iterations in approach-2 . . . . . . . . . . . . . 35

i

List of Tables

3.1 Sequencing Technologies . . . . . . . . . . . . . . . . . . . . . . . . . 12

6.1 Accuracy table sorted by cluster size . . . . . . . . . . . . . . . . . 32

1 Accuracy of PhymmBL and PhySCIMM at 1000 bp read length 38




5 Accuracy of PhymmBL and PhySCIMM at Phylum Level . . . . 39

6 Accuracy of PhymmBL and PhySCIMM at Class Level . . . . . 40

7 Accuracy of PhymmBL and PhySCIMM at Order Level . . . . . 40

8 Accuracy of PhymmBL and PhySCIMM at Family Level . . . . 40

9 Accuracy of PhymmBL and PhySCIMM at Genus Level . . . . . 41


11 Paper publication status . . . . . . . . . . . . . . . . . . . . . . . . . 42

ii

Chapter 1

BIOINFORMATICS

1.1 Introduction

Bioinformatics field came into existence about 40 years ago. The term “Bioinformatics”

was coined in 1990s. At the beginning of the bioinformatics field, bioinformatics was

dealing with the creation and maintenance of database to store biological data such as

nucleotide sequences and amino acid sequences. Further development in this field involved

design issues and development of complex interfaces.

Bioinformatics is an interdisciplinary field that develops and improves upon methods

for storing, retrieving, organizing and analyzing biological data. A major activity in

bioinformatics is to develop software tools to generate useful biological knowledge [1].

Bioinformatics is a field that applies computer science and information technology to

problems in biological science. Bioinformatics is the science of managing and analyz-

ing biological data using advanced computational techniques. Database and Information

systems are used to collect, store, analyze biological information. This analysis can then

be used in gene-based drug discovery and development. Figure 1.1 visualizes interdisci-

plinary nature of bioinformatics field.

Bioinformatics mainly has four streams as:

1. Genomics

Term Genomics was coined by Thomas Roderick. Genomics is a discipline in genet-

ics that applies recombinant DNA, DNA sequencing methods, and bioinformatics

to sequence, assemble, and analyze the function and structure of genomes. The field

genomics includes efforts to determine the entire DNA sequence of organisms and

1

Figure 1.1: Bioinfomatics

fine scale genetic mapping. In contrast, the investigation of the roles and functions

of single genes is a primary focus of molecular biology or genetics and is a common

topic of modern medical and biological research. Genomics has wide applications

in drug discovery, development, diagnostics and therapy.

2. Proteomics

Word Proteomics was coined in 1997. Proteomics is the study of protein sequences

or amino acid sequences in complement to genome. Proteomics studies structure

and function of proteins which is very useful to understand metabolism of any or-

ganism. Proteins are very vital constituent of any organism which forms metabolic

pathways of cell. Proteomics is an interdisciplinary formed on the basis of the

research and development of the Human Genome Project, is also an emerging sci-

entific research and exploration of the proteome research from the overall level of

intracellular protein composition, structure, and its own unique activity patterns.

It is an important component of functional genomics [2]. Functions of proteins are

enzyme catalysis, transport, mechanical support, organelle constituents, storage re-

serves, metabolic control, protection mechanisms, toxins, and osmotic pressure.

3. Chemoinformatics

Cheminformatics is an interdisciplinary branch which requires chemistry, com-

puter science and information science. Key problems in cheminfomatics are storing

molecule, finding exact molecule, searching substructure and similarity search.

2

4. Pharmacogenomics

Pharmacogenomics analyses how genetic makeup affects an individual’s response to

drugs. It deals with the influence of genetic variation on drug response in patients

by correlating gene expression with drug’s efficacy or toxicity. By doing so, phar-

macogenomics aims to develop means to optimize drug therapy, with respect to

the patients’ genotype, to ensure maximum response with minimal adverse effects.

Such approaches promise the advent of “personalized medicine” in which drugs and

drug combinations are optimized for each individual’s unique genetic makeup [3].

1.2 Role of Bioinformatics

1. Analysis and interpretation of various types of biological data including nu-

cleotide and amino acid sequences, protein domains and protein structures.

2. Development of new algorithms and statistics with which to assess biological

information, such as relationship among members of large datasets.

3. Development and implementation of tools that enable efficient access and

management of different types of information, such as various databases and inte-

grated mapping information.

3

Chapter 2

BIOLOGY BEHIND

BIOINFORMATICS

2.1 Cell

The cell was discovered by Robert Hooke in 1665. The cell is the basic structural

and functional unit of all known living organisms. It is the smallest unit of life that is

classified as a living thing, and is often called the building block of life. Figure 2.1 gives

the basic cell structure with its major components [4].

Figure 2.1: Basic cell structure

Every organism has different body parts performing different functions. e.g. Bones

in body function as supportive frame for body, skin gives protective layer for body. But

underneath all, if basic part is a cell then how do they perform different functions in

different body parts. The answer lies in the protein which they synthesize. Protein

defines the type and functions of the cell.

2.2 DNA (DeoxyriboNucleic Acid)

The German biochemist Frederich Miescher first observed DNA in the late 1800s. DNA

is the hereditary material in humans and almost all other organisms. Nearly every cell in

4

a person’s body has the same DNA. Most DNA is located in the cell nucleus (where it is

called nuclear DNA), but a small amount of DNA can also be found in the mitochondrial

region of the cell. Researchers refer to DNA found in the cell’s nucleus as nuclear DNA.

An organism’s complete set of nuclear DNA is called its genome [5]. Figure 2.2 represents

pictorial representation of DNA structure [6].

DNA is made of chemical building blocks called nucleotides. These building blocks are

made of three parts: a phosphate group, a sugar group and one of four types of nitrogen

bases. To form a strand of DNA, nucleotides are linked into chains, with the phosphate

and sugar groups alternating. The four types of nitrogen bases found in nucleotides are:

adenine (A), thymine (T), guanine (G) and cytosine (C). The order or sequence of these

bases determines what biological instructions are contained in a strand of DNA.

e.g. Sequence ATCGTT might instruct for blue eyes, while ATCGCT might instruct for

brown.

To understand DNA’s double helix from a chemical standpoint, imagine the sides of

the ladder as strands of alternating sugar and phosphate groups. Each rung of the ladder

is made up of two nitrogen bases, paired together by hydrogen bonds. Because of the

highly specific nature of this type of chemical pairing, base A always pairs with base T,

and likewise C with G. So, if you know the sequence of the bases on one strand of a DNA

double helix, it is a simple matter to figure out the sequence of bases on the other strand.

Figure 2.2: DNA Structure

5

DNA’s unique structure enables the molecule to copy itself during cell division. When

a cell prepares to divide, the DNA helix splits down the middle and becomes two single

strands. These single strands serve as templates for building two new, double-stranded

DNA molecules - each a replica of the original DNA molecule. In this process, an A base

is added wherever there is a T, a C where there is a G, and so on until all of the bases

once again have partners.

In addition, when proteins are being made, the double helix unwinds to allow a single

strand of DNA to serve as a template. This template strand is then transcribed into

mRNA, which is a molecule that conveys vital instructions to the cell’s protein-making

machinery [5].

2.3 RNA (RiboNucleic Acid)

RNA is very similar to DNA, but differs in a few important structural details. RNA

is made up of a long chain of components called nucleotides. Each nucleotide consists of

a nucleotide base, a ribose sugar, and a phosphate group. The sequence of nucleotides

allows RNA to encode genetic information.

RNA forms a system which is responsible conversion of DNA code i.e. gene to protein.

There are three types of RNA:

1. Messenger RNA (mRNA)

Messenger RNA is synthesized from a gene segment of DNA. It contains the infor-

mation on the primary sequence of amino acids in a protein to be synthesized. The

messenger RNA carries the code to the ribosomal site.

2. Transfer RNA (tRNA)

Transfer RNA resides in cytoplasm of the cell. It acts as carriers of amino acids dur-

ing protein synthesis. They carry corresponding amino acid to the protein synthesis

site. Particular amino acid is decided by anticodon (a triplet base code which is

complementary to the codon) which defines the tRNA. There is at least one tRNA

for each amino acid. There can be multiple triplet codons for an amino acid.

6

3. Ribosomal RNA (rRNA)

Ribosomal RNA is a structural component of ribosomes. The specific function of

rRNA is not fully established, but it binds proteins and mRNA to provide the site

of protein synthesis [7].

2.4 Protein

Proteins are biochemical compounds consisting of one or more polypeptides typically

folded into a globular or fibrous form, facilitating its biological function. A polypeptide is

a single linear polymer chain of amino acids bonded together by peptide bonds between

the carboxyl and amino groups of adjacent amino acid residues. The sequence of amino

acids in a protein is defined by the sequence of a gene, which is encoded in the genetic

code.

Proteins are very important molecules in our cells. They are involved virtually in all

cell functions. Each protein within the body has a specific function. There are several

types of proteins depending on chain of amino acids and their structure.

e.g. Antibodies are specialized proteins involved in defending the body from anti-

gens.

Proteins are responsible for movement.

Enzymes are proteins that facilitate biochemical reactions.

Hormonal Proteins are messenger proteins which help to coordinate certain bodily

activities.

Structural Proteins are fibrous and stringy and provide support.

Transport Proteins are carrier proteins which move molecules from one place to

another around the body [8].

Protein synthesis is the name used to describe the process of making proteins. Protein

synthesis is an extremely complex process that needs to be done in a series of steps to

ensure that the process is being done properly. Figure 2.3 gives visual representation of

protein synthesis steps [9].

7

Figure 2.3: Protein Synthesis steps

General steps followed during protein synthesis are:

1. DNA unwinds exposing a gene.

2. The gene undergoes transcription forming “mRNA”.

3. Messanger RNA transfers info from the nucleus to ribosomes.

4. During the translation process, ribosomes read mRNA which indicates which amino

acid to add to the polypeptide chain and its sequence.

5. Each transfer RNA (tRNA) molecule conveys a particular amino acid to the ribo-

some.

6. At the ribosome, the amino acid that has been delivered by tRNA attaches to the

peptide chain, lengthening it.

When the translation process is complete, the ribosome releases the polypeptide, and

the new protein generally undergoes further processing at other sites within the cyto-

plasm.

8

2.5 Mapping Biology in Computer Terms

As a computer person it would be easy to appreciate the understanding of these bio-

logical concepts if we map it with something we know better. If we try to map it with

the computer terms, we can say that any living organism is a perfect computer system

which has cells as hardware. Biological networks define the functionality hardware unit

or protein interaction network forms software or program which drives system through

hardware. Each element in Biological network i.e. protein is analogous to a function in

the program. Gene can be said as instructions in function body which uniquely identifies

program. DNA can be said as universal repository of instructions in all functions i.e. all

gene sequences. Also there is non-coding DNA part which actually forms control unit in

the system but right now we are not concerned with that part so we will keep it separate

for now to avoid confusion. Each base on the DNA is actual individual instruction or

more precisely the machine instruction like load, store, add, etc.

Whatever we are doing in bioinformatics can be considered as reverse engineering.

Reverse engineering is the process of discovering the technological principles of a device,

object, or system through analysis of its structure, function, and operation. It often

involves taking something (e.g. a mechanical device, electronic component, software

program, or biological, chemical, or organic matter) apart and analyzing its workings in

detail [10].

9

Chapter 3

METAGENOMICS

Metagenomics is the study of metagenomes, genetic material recovered directly from

environmental samples. While traditional microbiology genomics rely upon cultivated

clonal cultures but the fact that over 99% of the species yet to be discovered are resistant

to cultivation, skewed our view of microbial diversity. Metagenomics offers a powerful

lens for viewing the microbial world that has the potential to revolutionize understanding

of the entire living world. Sequencing of environmental DNA (metagenomics) has shown

tremendous potential to drive the discovery and understanding of the “un-culturable ma-

jority” of species. Culture independent methods are used to obtain information about the

genetic diversity, population structure, and ecological roles of members of the communi-

ties. These methods complement or even replace culture-based approaches and bypass

some of their limitations.

Over the past few years, the major challenge associated with metagenomics has shifted

from generating to analyzing sequences. Metagenomic analysis includes the identification,

functional and evolutionary analysis of the genomic sequences of a community of organ-

isms. There are many challenges involved in the analysis of these data sets including

sparse metadata, a high volume of sequence data, genomic heterogeneity and incomplete

sequences. Due to the nature of metagenomic data, analysis is very complex and requires

new approaches and significant compute resources. Advances in computational analysis

techniques are essential to move the field forward.

10

Figure 3.1: Flow diagram of a typical metagenome analysis

General steps followed in megagenomic analysis are as shown in Figure 3.1

3.1 Extraction and Processing

Sample processing is the first and most crucial step in any metagenomics analysis. As

the metagenomic samples are taken in their natural condition, there are two important

parameters which we should take care of are:

1. DNA extracted should be representative of all cells present in the sample

2. Sufficient amount of high quality nucleic acids must be obtained for subsequent

analysis

3.2 Sequencing

In bioinformatics, sequence assembly refers to aligning and merging fragments of a much

longer DNA sequence in order to reconstruct the original sequence. DNA sequencing is

the process of reading the nucleotide bases in a DNA molecule. It includes any method

11

or technology that is used to determine the order of the four bases adenine, guanine,

cytosine, and thyminein a strand of DNA. The advent of DNA sequencing has signifi-

cantly accelerated biological research and discovery. Various DNA sequencing methods

are developed such as Sanger Sequencing, Quantitative PCR, Illumina, Roche454, SOLiD,

etc.

Practically next generation sequencing methods have now been widely applied in the

field of metagenomics due to their lower cost per gigabases and speed in performance

but one more parameter where they lag before Sanger sequencing is the read length

which makes it difficult to analyze it further if we have short reads. Some of the NGS

technologies along with their cost per gigabases and read length are given in Table 3.1.

Technology Read length cost/gigabase

Sanger Sequencing >700 bps 4,00,000 $

454/Roche 600-800 bps 20,000 $

Illumina 300 bps 50 $

SOLiD 50 bps 40 $

Table 3.1: Sequencing Technologies

Longer the sequence, the better is the ability to obtain accurate information but from

the Table 3.1 we can easily observe rise in the cost as we demand for more read length

so to achieve goal with short read length i.e. less cost, demand for the efficient assembly

algorithms which will assemble these short reads with minimum errors has gone up.

3.3 Assembly

In bioinformatics, sequence assembly refers to aligning and merging fragments of a

much longer DNA sequence in order to reconstruct the original sequence.

Many DNA assembly packages are developed. They are broadly classified into two

types as:

1. Reference guided assembly tools: Pair wise alignment of reads to the specified

reference genome to guide genome assembly. Reference-based assembly works well,

if the metagenomic dataset contains sequences for which closely related reference

genomes are already available.

e.g. SOAP, Bowtie.

12

2. De-novo assembly tools: It does not use any reference to align with. It generally

uses overlapped base approach to assemble the read sequences.

e.g. Velvet, ALLPathsLS, SOAPdenovo,Celera.

Several factors need to be considered when exploring the reasons for assembling metage-

nomic data. These can be condensed to two important questions. First, what is the

length of the sequencing reads used to generate the metagenomic dataset, and are longer

sequences required for annotation. Some approaches, e.g. IMG/M, prefer assembled

contigs, other pipelines such as MG-RAST require only 75 bps.

3.4 Binning

Binning refers to the process of sorting DNA sequences into group that represent an

individual genome or genomes that represent closely related group. Generally two ap-

proaches are used in binning as:

1. Taxonomy-dependent Binning: Majority of methods available for binning metage-

nomic datasets belong to the taxonomy-dependent category. In these methods, the

extent of similarity of reads with reference sequences or pre-computed models drives

the binning process.

(a) Similarity Based methods: majority of these methods work by aligning

reads to sequences or Hidden Markov Models corresponding to known se-

quences.

e.g. IMG/M, MG-RAST, MEGAN, CARMA, MetaPhyler

(b) Composition Based methods: It makes use of the fact that genomes have

conserved nucleotide composition (e.g. a certain GC or the particular abun-

dance distribution of k-mers) and this will be also reflected in sequence frag-

ments of the genomes.

e.g. pylopthia, S-GSOM, TACAO

2. Taxonomy-independent Binning: Mainly it uses the correlation information

contained in read sequences using statistical based methods or using network based

approaches. TETRA computes the pairwise correlations between tetra-nucleotide

patterns of all reads. This information is used for segregating reads into distinct

13

bins.

e.g. TETRA, SOM, CompostBin, AbundanceBin,MetaCluster.

Important considerations for using any binning algorithm are the type of input data

available and the existence of a suitable training datasets or reference genomes. In general,

composition-based binning is not reliable for short reads, as they do not contain enough

information. Short reads may contain similarity to a known gene and this information

can be used to assign the read to a specific bin. This binning obviously requires the

availability of reference data. If the query sequence is not much related to reference data

then reference based binning will be inefficient giving binning at very higher surfacing

level.

Post-assembly the binning of contigs can lead to the generation of partial genomes of

yet-uncultured or unknown organisms, which in turn can be used to perform similarity-

based binning of other metagenomic datasets.Prior to assembly with clonal assemblers

binning can be used to increase accuracy and reduce the complexity of an assembly effort

and might reduce computational requirement.

3.5 Annotation

DNA annotation is the process of attaching biological information to sequences. It

consists of two main steps as:

1. Identifying elements on the genome, a process called gene prediction.

2. Attaching biological information to these elements.

In case of metagenomic annotation two approaches are used:

1. Annotation on assembled sequences: Longer the sequence, better the informa-

tion. If the assembly has produced enough large contigs, it is better to use existing

genome annotation tools such as RAST or IMG. For this approach to be successful,

minimal contigs length of 30000 bp or longer are required.

2. Annotation on un-assembled reads: Annotation can also be performed on

entire unassembled reads. Here the existing annotation tools are not much effective.

14

3.6 Experimental Design and Statistical Analysis

Many of the early metagenomic shotgun-sequencing projects were focused on targeted

exploration of specific organisms (e.g. uncultured organisms in low-diversity acid mine

drainage). Reduction of sequencing cost and a much wider appreciation of the utility of

metagenomics to address fundamental questions in microbial ecology now require proper

experimental designs with appropriate statistical analysis.

3.7 Sharing and Storage of Data

Data sharing has a long tradition in the field of genome research. Efficient storage and

sharing of data provides metadata and centralized services to the research community.

For metagenomic data this will require a whole new level of organization and collaboration

to provide [11].

15

Chapter 4

DATA MINING TOOLS AND

TECHNIQUES

From the above we can say that efficient next generation sequencing methods and

metagenomic assembly decides the further resolution of metagenomic analysis due to

the thumb of rule “Longer the sequence, better the information”. Although assembly

yields longer sequences, it also bears the risk of creating chimeric contigs, in particular

in samples with closely related species or highly conserved sequences that occur across

species. Assembly efforts increase with more branching in the reads. More branching

in turn creates more possibility for tip, bubble, chimeric connections which are primary

sources of errors in assembly. Furthermore, assembly distorts abundance information,

as overlapping sequences from different species will be identified as belonging to the

same genome and consequently joined. This leads to a relative under representation of

sequences of abundant species [12]. Also the computational complexity of metagenomic

assemblers is still a big question to worry about.

As discussed earlier, prior to assembly binning can be used to increase accuracy and

reduce the complexity of an assembly effort. From the flow graph shown above we can

see, there are also methodologies of annotation and further statistical analysis which can

work up on binning skipping the assembly. So the binning of metagenomic reads is very

crucial step for effective assembly as well as for the further analysis steps even skipping

the assembly.

Let’s take a review of the existing methods which are been prominently used in metage-

nomic classification.

16

4.1 PhyScimm

Main objectives of metagenomic binning are to find out which metagenomic reads

belongs to same metagenomic strains and where those strains fit on phylogenetic tree of

life. Clustering can be efficiently used to achieve first objective and classification gains

second one by assigning taxonomic labels to sequences.

Composition based clustering and classification methods use oligonucleotide frequen-

cies as a property to bin metagenomic reads. Composition based classification methods

train on oligonucleotide frequencies of existing genomes and classify reads using super-

vised approach. Recently developed Phymm, composition based method train Interpo-

lated Markov Model (IMM) on existing genomes and use it to classify further reads.

Supervised learning assumes that training reads are representative of reads which we

will be classifying but it may not be always case, especially in metagenomic reads where

there could be novel sequences. So to tackle this scenario unsupervised learning ap-

proach could be used with supervised approach which learns genomic signature on set of

sequences without use of existing genomes.

Besides oligonucleotide frequencies, markov chain model has great potential to dis-

criminate between sequences [13], and has been successfully implemented in both unsu-

pervised [14] and supervised senese [15]. PhyScimm is hybrid approach which is “SCIMM

+ PHYMM”.

4.1.1 SCIMM

Let’s first understand IMM. In markov chain model any ith element in sequence is

dependant on previous n elements in sequence for nth order markov chain model. Given

sequence s and model m, likelihood of sequence s generated by m is given by equation 4.1

P(s

m

)=

|s|∏i=n+1

P

(si

si−1si−2...si−n

)(4.1)

IMM is variable order markov chain model which keeps on switching among different

fixed order markov models depending on context. It may be the case that some 8-mers

occur too frequently than 6-mers in sequence which will give more reliable estimates but

17

from above equation we can observe that we need 4k+1 calculations to accurately build

model which is exponential with the order of model. Generally higher order models are

preferable for accurate predictions but from above statement we can say that it is only

preferable only if we are getting more accurate estimates than its lower order model at

the expense of exponentially growing complexity. So to compensate between accuracy

and complexity IMM gives more weights to oligomers which occur frequently and lower

to infrequently occurring oligomers and uses linear combination of variable length models

depending on weights assigned. It uses lower length oligomer model if longer model is

insufficient to produce good quality of predictions and keeps interpolating to variable

length markov model depending on context as it moves on [16].

Here IMM training creates probabilistic decision tree using information gain as crite-

ria. Consider windows of length n+ 1 from set of sequences. First split defines a position

i in window where MI(Xi,Xn+1) is maximum where i = 1...n and continues iteratively

by calculating conditional MI of remaining positions by considering particular nucleotide

base at position defined by its parent node. To compute likelihood of novel sequence,

we follow down decision tree from root by looking at nucleotide base in novel sequence

window.

SCIMM uses same general algorithm as CEM (Classification Expectation Maximiza-

tion) where data points are read sequences and IMMs are cluster model. For each sequence

s and IMM model m, we calculate log likelihood of P (m).P (s|m) and assign s to m for

which it is maximum and retrain IMM after every assignment.

SCIMM Pipeline: Functional block diagram of PhyScimm is shown in Figure 4.1

Algorithm starts by initializing k IMM’s. To initialize the IMMs for SCIMM, we can

either use unsupervised approach e.g. LikelyBin or CBCBCompostBin or supervised

approach e.g. Phymm on a random subset of the sequences with a user-specified number

of clusters k and train an IMM on every cluster returned.

Initial partitioning step is used to form initial clusters or seed clusters required for

working of SCIMM clustering. We obtained initial partitioning of the sequences using

PhymmBL, a hybrid of supervised and unsupervised learning. It randomly chooses 3000

sequences, classify them, and cluster them at a certain taxonomic level which forms initial

k clusters.

18

Figure 4.1: SCIMM Pipeline

SCIMM re-trains the IMMs on all metagenomic sequences starting from the seed clus-

ters and assigns them to their corresponding clusters. This loop is shown in Figure 4.1.

Over the course of the iterations, the IMMs converge to a set that represent the phylo-

genetic sources.

4.1.2 Phymm

Unsupervised clustering is less effective with high complex dataset with many microbial

strains (> 20). Classification method Phymm [15] is immune to complexity of dataset.

It can be used for initial partitioning to reduce dataset complexity by clustering samples

from same genus or family in one cluster.

This hybrid approach of Phymm (supervised) and SCIMM (unsupervised) forms

PhyScimm and if existing unsupervised clustering methods used for initial partitioning

instead of Phymm, we refer it as SCIMM [17].

As Phymm is supervised approach, its performance goes down if query sequences are

not from the same taxanomic strain on which it was trained. Above results clearly

shows that PhyScimm outperforms for mixture containing more than 20 strains. For low

complexity data, SCIMM is comparable to PhyScimm and outperforms over PhyScimm

at lower level of taxonomy.

19

4.2 PhymmBL

This method is hybrid of composition and similarity based approach. As we have al-

ready seen in PhyScimm, IMM has great potential to discriminate sequences in metage-

nomics. Phymm, method based on IMM which trains on existing non-redundant labelled

genome NCBI Refseq database [18]. It creates suite of IMMs, one per each labelled

sequence in database. Phymm exhibits dramatic improvement in results especially for

shorter read lengths of 100 bps. As we have already seen in table above, sequencing

technologies are going cheap on cost of shorter read lengths. e.g. Illumina, SOLiD.

BLAST(Basic Local Alignment Search Tool) [19] is similarity based approach which

compares query sequence against sequences in NCBI Refseq. It is the most correct method

if the query sequences are member of the taxonomic group which already present in

database. If novel sequences are present in the query samples, its performance drops

drastically in proportion with novel sequences.

Each query sequence is checked across each Phymm IMM to find out probability of

that particular IMM generating particular query sequence. Query sequence is assigned

with label of IMM returning highest probability on with it was trained. Parallely it is

aligned to sequences in database using BLAST and assigned with label where we get

highest hit score. Finally we assign best hit score given by equation 4.2 using weighted

scores from both individual methods.

score = IMM + 1.2(4− log(E)) (4.2)

where

IMM: Log-likelihood score returned by IMM

E: best E-value returned by BLAST

Naturally Phymms accuracy improves with read lengths. Phymm exhibited 32.8 %

accuracy for 100 bp reads at genus level, 60.3 % for 400-bp reads as shown in Figure 4.2

and Figure 4.3. It is much greater than existing methods such as CARMA shows 6 %

for 100-bp at genus and SVM base PhyloPythia exhibits 7.1% for 1000-bp long reads at

genus level [15].

20

Figure 4.2: Accuracy of Blast and Phymm at Phylum level classification

Figure 4.3: Accuracy of Blast and Phymm at Genus level classification

From 4.2 and Figure 4.3, we can observe that phymm outperforms BLAST at upper

level of classification above 400-bp read length. BLAST is superior to phymm as we go

lower in phylogenic classification level towards species. Especially for shorter read lengths

it outperforms phymm at all levels.

21

Chapter 5

RESULTS AND OBSERVATIONS

To compare performance of PhymmBL and PhySCIMM, we experimented them with

two datasets. We compared accuracy measure to find out drawbacks in these tools and

way to improve them. Accuracy measure represents the percentage of total number of

metagenomic reads which are clustered or classified correctly to their phylogenetic source

to the total number of reads in input sample as given by equation 5.1.

Accuracy =Number of correctly clusterd or classified reads

Total number of classified reads(5.1)

5.1 Dataset 1 (Synthetic Metagenomic Dataset)

This synthetic metagenomic dataset is created artificially to emulate real metagenomic

dataset environment. In this dataset reads were randomly chosen from RefSeq (NCBI).

It includes core library of all complete bacterial and archaeal genomes comprising distinct

539 species, 53 genera, 48 families, 34 orders, 21 classes, 14 phyla.

To control for under-representation of some clades in the available data, query sets

were filtered so that all species under consideration had at least two sister species within

the clade under consideration. For example, in the experiment that masked exact species

matches but allowed intra genus comparisons, without this filtering step, if a given species

was the only sequenced representative of its genus, then it would have been impossible to

assign a correct genus label to reads from that species. Each synthetic test set initially

contained 5 randomly-selected “reads from each of the 1,146 chromosomes and plasmids

in the RefSeq reference data, totalling 5,730 reads representing 539 bacterial and archaeal

species [15].

22

We conducted experiments with different read lengths- 200, 400, 800, 1000 bps. With

these different read length dataset, we observed accuracy at all taxonomic level i.e. at

phylum, class, order, family, genus, species level.

Figures 5.1 and Figure 5.2 shows accuracy comparison of PhymmBL and PhySCIMM

with default parameters at various taxonomic levels with 1000 bp and 200 bp read lengths

respectively. Detailed observations at other read lengths is given in Appendix-A. We

can observe that accuracy of PhymmBL in both cases is very near to 100 %. This has

happened as all query reads are picked from RefSeq (NCBI) on which PhymmBL has been

trained. PhymmBL performs outstanding if species in metagenomic sample are known i.e.

reference is available. If we observer performance of PhySCIMM compared to PhymmBL,

it is very poor. This is because our dataset contains 539 species i.e. this dataset is highly

complex. Detail comparison at all read lengths can be found in supplement document.

Figure 5.1: Accuracy of PhymmBL and PhySCIMM at 1000 bp read length

5.2 Dataset 2 (FAMeS)

To evaluate various methods that are used to process metagenomic sequences, sim-

ulated datasets of varying complexity were constructed by combining sequencing reads

randomly selected from 113 isolate genomes. These datasets were designed to model real

metagenomes in terms of complexity and phylogenetic composition [20]. Datasets are

23

Figure 5.2: Accuracy of PhymmBL and PhySCIMM at 200 bp read length

dominated by few species, but have a long tail of very low abundance species.

We created dataset with 9 species which is dominated by 4 species (85.09 %). De-

liberately we kept less species to emulate less complex dataset. We experimented this

dataset at all levels of taxonomic classification with default parameters.

From Figure 5.3, we can observer that PhymmBL has given very great accuracy with

this dataset. This could be because metagenomic mixture is constructed from already

known species. If we compare performance of PhymmBL with performance with dataset-

1, we can easily conclude that PhymmBL performance is independent of complexity of

dataset i.e. number of strains present in metagenomic dataset. Here point to be observed

is that performance of PhySCIMM has gone up drastically as compared to performance

with dataset-1. We have got about 38 % to 65% difference in accuracy results. This

suggests that PhySCIMM performs poor with high complex data i.e. mixture with many

species.

If we go in detail, we got 13 clusters at the output of PhySCIMM at species level.

Out of total 13 clusters, 12 clusters were dominated by 4 species which are present in

high abundance (85.09 %) in input dataset.

24

Figure 5.3: Accuracy of PhymmBL and PhySCIMM with less complex dataset

5 clusters were dominated by Xylella fastidiosa with average accuracy of 92.03%.

3 clusters to Rhodopseudomonas palustris with average accuracy of 71.74%.

3 clusters to Rhodospirillum rubrum with average accuracy of 87.84 %.

1 cluster to Moorella thermoacetica with average accuracy of 92.85%.

1 cluster to mixed reads with no specific dominance (cluster of un-clustered reads).

This shows that PhySCIMM could not distinguish species with lower abundance. Also

clustering accuracy of high abundant species is hampered because of noise introduced by

species with lower abundance.

5.3 Drawbacks

From above two experiments, we can brief drawbacks found in existing system as

1. Performance of Physcimm (Unsupervised Approach) degrades with high complexity

data (mixture with many species).

2. Physcimm (Unsupervised Approach) fails to identify species with lower abundance

25

level.

3. Physcimm (Unsupervised Approach) clustering accuracy of species with higher

abundance level degrades due to presence of low abundant species (noise).

26

Chapter 6

DESIGN AND

IMPLEMENTATION

6.1 Design

The experimental results discussed in previous chapter points out major problems found

with the unsupervised methods in metagenomic binning are:

1. high complexity of metagenomic dataset (number of species in dataset).

2. Difference in abundance level of species in metagenomic dataset.

To minimize above two factors, there is need of the system which will bring down the

complexity of metagenomic dataset before processing it with the unsupervised method.

The system should also be capable of providing nearly equal abundant input to unsuper-

vised method.

Considering the above said requirements, we designed a system as shown in Figure

6.1. Stages of system design are explained below.

1. Filtering Reads: This step separates reads in metagenomic mixture into high

abundant reads and low abundant reads based on k-mer frequency. This will help

us to separate noisy low abundant species from high abundant species, which are

biologically more significant [21]. By doing this we are reducing complexity of data

by splitting it into two parts. We can even re-iterate over this step by selecting

different values of k. This step will minimize abundance variation in input dataset

which will bring it to an evenly distributed dataset with nearly same abundance.

27

Figure 6.1: Block diagram of proposed solution

2. Left wing: Left wing in proposed method will bin high abundant species. As

low abundant species will be separated in filtering block, clustering accuracy of

biologically significant species will improve. Left wing will be provided with input

dataset of high abundant species with minimized abundance variation. Left wing

will extract most biological significance as biologically eco-system is dominated by

few species.

3. Random-1: This step selects random sequences from input metagenomic reads

and aligns with existing database using existing supervised binning tool. This step

will give us strains present in input which are already present in database i.e. known

taxonomy and will also serve as seed input for training IMMs.

4. Iterative IMM: Depending on seed bins given by above step, IMMs will be trained

on it and will classify remaining reads in input and re-train itself iteratively.

5. Right wing: Right wing in proposed method will take care of low abundant

species. Right wing will be provided with input dataset of low abundant species

with minimized abundance variation. We used same approach to bin these reads

as used in left wing.

28

6.2 Method

Primary assumption of this method is that reads sampled from genome follows normal

distribution [22]. Expectation Maximization (EM) method is used to find out parame-

ters of this distribution. This method discriminate abundance based on fitted poisson

distribution. Sequencing produces reads with unequal sampling when species abundance

level varies.

The sequencing output in metagenomics is considered as mixture of m Poisson distri-

butions, m being the number of species. The goal is to nd the mean values λ1 to λm,

which are the abundance levels of the species, of these Poisson distributions.

In sequencing of genome, probability of a read starting at a certain position is

Pr =N

(G− L+ 1)(6.1)

where

N : Number of reads

G : Genome size

L: Length of reads.

Given G >> L, it approximates to N/G. Assume x is a read and a l− tuplew belongs

to x. The number of occurrences of ’w’ in set of reads follows a Poisson distribution with

parameter λ = N(L− l + 1)/(GL + 1) approximates to NL/G . In metagenomics, G is

total genome size of all species in sample. If the abundance of a species i is n, the total

number of occurrences of w in whole metagenomic sample coming from different species

will also follow poisson distribution with parameter λi = nλ. Now the problem of finding

the relative abundance levels of different species is transformed to the modelling of mixed

Poisson distribution.

Given metagenomic reads as input, algorithm firstly counts l tuple count in all reads.

Denote

x = {n (wi)} (6.2)

where

i = 1...wi

n(wi) : count of tuple i

W : total number of l − tuples

29

The goal of the algorithm is to optimize the logarithm of the joint probability of

obtaining a particular l− tuple counts x and the parameters θ i.e. logP (x, θ) = {S, g, λ}

where

S : Total number of bins.

g : {gi} genome size

λ = λi abundance level

Hidden variable in this optimization problem is bin identity to which l− tuple belongs.

Expectation maximization (EM) algorithm is used to solve this optimization problem.

1. Initialize total number of bins to S with their genome size gi and abundance λi.

2. Calculate the probability that the l − tuple wj (j = 1, 2, ...,W ) coming from ith

species given its count n(wj)

p

(wiεsin (wj)

)=

gi∑sm=1 gm

(λmλi

)n(wj)e(λi−λm)

(6.3)

3. Calculate new values for gi and λi.

gi =w∑j=1

p

(wjεsin (wj)

)

λi =

∑wj=1 n (wj) p

(wjεsin(wj)

)gi

(6.4)

4. Iterate the step 2 and 3 until the parameters converges or number of run exceeds

maximum. The convergence of parameters is defined as∣∣∣∣∣λt+1i

λti

∣∣∣∣∣ < 10−5and

∣∣∣∣∣gt+1i

gti

∣∣∣∣∣ < 10−5 (6.5)

Once the EM algorithm converges, we can estimate the probability of a read assigned

to a bin, based on its l − tuples binning results as,

P (rkεsi) =ΠwjεrkP (wjεsi/n (wj))∑

siεS

(∏wjεrk P (wjεsi/n (wj))

) (6.6)

30

where

rk : given read

wj : ltuples that belongs to rk

si : Bin

A read will be assigned to the bin with the highest probability. A read remains unas-

signed if 90% of its l − tuples are excluded.

6.3 Results

We used FAMeS dataset to study the effect of abundance separation on overall binning

accuracy. To set the benchmark for accuracy improvement, we firstly performed this

experiment with manual separation of metagenomic input. As our dataset is synthetic

and we know all its composition, we could separate it on abundance with 100 % accuracy.

Accuracy improvement is shown in Figure 6.2.

Figure 6.2: Accuracy comparison for abundance separation

Automatic: accuracy after abundance separation on k-mer frequency.

Manual: accuracy after manual abundance separation.

Before: accuracy without considering abundance variance.

Figure 6.2 clearly shows the effect of abundance variation on overall accuracy of metage-

nomic binning. Though we got some improvement in overall accuracy, but it is still not

31

much impressive as against computational and time overhead caused by adding abun-

dance separation step before binning.

While thinking of any other innovative solution, we reanalyzed output clusters of PhyScimm

as shown in Table 6.1.

Size Accuracy

5136 98.91

3018 98.51

2031 99.02

1878 98.83

1226 33.61

1041 58.6

708 97.46

613 94.29

551 57.71

452 91.15

378 92.86

372 78.76

340 84.64

300 65.67

Table 6.1: Accuracy table sorted by cluster size

Figure 6.3: Cluster accuracy in descending order of size

Consider cluster size as indicator of abundance of species. Figure 6.3 shows accuracy

in descending order of cluster size.

32

Lets assume size as spectrum size and accuracy as indicator of overlap with adjacent

spectrum. If accuracy is higher, that means there is less overlap of that spectrum or

cluster with its adjacent sized clusters. Now the task is to split whole spectrum such that

the abundance variation will be minimized. This could be done by observing trend in

accuracy variation with size. From Figure 6.3, we can observe that there is groove with

accuracy 33.61 %. Left spectrum or cluster has accuracy 98.83 % and right cluster is

with 58.60 %. This clearly indicates that cluster in groove i.e. with accuracy 33.61 % has

more overlap with right spectrum and very small overlap with left one. We split dataset

at this groove cluster and add this particular cluster to right split. In general “split the

clusters at groove and add groove cluster to split with more overlap”.

To imitate real metagenomic environment and to have sufficient data for this iterative

approach, we selected whole FAMeS dataset with 113 species and 114457 reads.

Over the iterations, we tried two approaches.

Approach 1: Schematic representation of approach-1 is shown in Figure 6.4. In this

approach, we collectively analysed output clusters of every split in previous iteration and

formed new splits for next iteration if necessary.

Figure 6.4: Schematic representation of approach-1

33

Figure 6.5: Accuracy over the iterations in approach-1

From Figure 6.5,we can observe that we got 8.78 % improvement in overall accuracy.

Sensitivity in the iteration-3 is 102018/114457 = 89.13 %.

Approach 2: Schematic representation of approach-1 is shown in Figure 6.6. In this

approach, we went on analyzing and splitting output of every iteration separately. It is

similar to tree structure.

Figure 6.6: Schematic representation of approach-2

From Figure 6.7,we can observe that we got 13.89 % improvement in overall accuracy.

Sensitivity in the iteration-3 is 98849/114457 = 86.36 %.

34

Figure 6.7: Accuracy over the iterations in approach-2

Only disadvantage in approach-2 is that number of output clusters goes on adding as

we proceed with iterations. In our experiment we got 132 clusters in third iteration which

is more than the total number of species in input dataset i.e. 113. That means some

species are spread over more than one cluster.

35

Chapter 7

CONCLUSIONS AND FUTURE

SCOPE

7.1 Conclusions

We successfully experimented with metagenomic binning for different datasets. We

successfully identified the potential in metagenomic binning with unsupervised approach.

We identified the abundance variance as a major factor in effectiveness of metagenomic

binning with unsupervised approach. We performed different experiments to overcome

this problem and got peak accuracy improvement up to 13.89 %. K-mer frequency mea-

sure to identify abundance did not perform well which could be improved by hybridizing

it with any other measure. Though we got satisfactory improvement in accuracy, but

still there is much scope for improvement in metagenomic binning with unsupervised

approach. Further research on metagenomic binning with unsupervised approach will

definitely bring new revolution in the field of metagenomics.

7.2 Future Scope

The way ahead involves applying innovative approaches to handle abundance variance

problem using data characteristics. In k-mer frequency criteria, selection of proper k is

also a crucial decision which affects its abundance detection capability. Automatic selec-

tion of k-value requires further research about its relationship with data characteristic.

Combining any other measure with k-mer frequency count will be beneficial. In iterative

approaches, number of iterations to carry out is also a prime decision because after proper

number iterations, results start degrading. It may happen that any particular species is

36

split into two or more clusters. In such case markov model can be developed to identify

clusters with same species and merge them to form a cluster. Time complexity of markov

model training is much higher which limited us to experiment with few datasets. Intense

parallelism of training phase using hadoop architecture can solve this problem. Further

refinement of experiments by applying different datasets will also be useful to remove

minor drawbacks if present.

37

Appendix-A

A.1 Dataset-1

A.1.1 Accuracy at 1000 bp read length

PhymmBL PhyScimm

phylum 99.598 61.534

class 99.556 60.355

order 99.488 57.563

family 99.36 48.246

genus 98.476 34.176

Table 1: Accuracy of PhymmBL and PhySCIMM at 1000 bp read length


PhymmBL PhyScimm

phylum 99.598 61.475

class 99.519 60.950

order 99.435 55.915

family 99.305 47.059

genus 98.335 33.651


38


PhymmBL PhyScimm

phylum 99.616 60.703

class 99.593 57.873

order 99.506 52.725

family 99.397 45.416

genus 98.441 30.848



PhymmBL PhyScimm

phylum 99.581 59.013

class 99.538 50.928

order 99.541 48.664

family 99.433 38.730

genus 98.721 26.807


A.1.5 Accuracy at Phylum Level

PhymmBL PhyScimm

1000 99.598 61.534

800 99.598 61.475

400 99.596 60.703

200 99.581 59.013

Table 5: Accuracy of PhymmBL and PhySCIMM at Phylum Level

39

A.1.6 Accuracy at Class Level

PhymmBL PhyScimm

1000 99.556 60.355

800 99.519 60.150

400 99.593 57.873

200 99.538 50.928

Table 6: Accuracy of PhymmBL and PhySCIMM at Class Level

A.1.7 Accuracy at Order Level

PhymmBL PhyScimm

1000 99.488 57.563

800 99.435 55.915

400 99.50 52.725

200 99.541 48.664

Table 7: Accuracy of PhymmBL and PhySCIMM at Order Level

A.1.8 Accuracy at Family Level

PhymmBL PhyScimm

1000 99.36 50.558

800 99.305 48.101

400 99.397 44.488

200 99.433 38.258

Table 8: Accuracy of PhymmBL and PhySCIMM at Family Level

40

A.1.9 Accuracy at Genus Level

PhymmBL PhyScimm

1000 98.476 34.797

800 98.335 33.651

400 98.441 30.848

200 98.721 26.807

Table 9: Accuracy of PhymmBL and PhySCIMM at Genus Level

A.2 Dataset-2

A.2.1 Accuracy at various taxonomic levels

PhymmBL PhyScimm

phylum 99.018 96.539

class 98.38 92.987

order 97.842 91.355

family 97.83 91.637

genus 97.521 91.477


41

Appendix-B

B.1 Publication Status

Title Conference Status

Metagenomic Binning:

An Overview and Methods NCIC-2013, Coimbtore Published

Table 11: Paper publication status

42

Bibliography

[1] “Bioinformatics”, http://en.wikipedia.org/wiki/Bioinformatics

[2] Paul R. Graves and Timothy A. J. Haystead, “Molecular Biologist’s Guide to

Proteomics”, American Society for Microbiology, pubmedcentral.gov:120780,2002.

[Online] Available: http://www.pubmedcentral.gov/articlerender.fcgi?artid=120780

[3] William E. Evans and Mary V. Relling, “Pharmacogenomics: Translating

Functional Genomics into Rational Therapeutics”, Vol. 286 no. 5439 pp. 487-491,

DOI: 10.1126/science.286.5439.487. [Online] Available: http://www.sciencemag.org/

content/286/5439/487.full

[4] “Cell Structure”, http://en.wikipedia.org/wiki/Cell (biology)

[5] Raven, Johnson, Losos, Mason and Singer “Biology, Eighth Edition”, McGraw-Hill

Higher Education. [Online] Available:

[6] “DNA Structure” http://www.accessexcellence.org/RC/VL/GG/dna2.php

[7] Robert W. Simons, “RNA structure and function”, Cold Spring harbor Laboratory

Press, 1998.

[8] Gregory A. Petsko and Dagmar Ringe, “Protein structure and function”, New

Science Press, 2004.

[9] “Protein Synthesis”, http://en.wikipedia.org/wiki/File:Proteinsynthesis.png

[10] “Reverse Engineering”, http://en.wikipedia.org/wiki/Reverse engineering

[11] T. Thomas, J. Gilbert, and F. Meyer, “Metagenomics - a guide from

sampling to data analysis.” [Online]. Available: http://www.pubmedcentral.nih.gov/

articlerender.fcgi?artid=3351745

43

http://en.wikipedia.org/wiki/Bioinformatics

http://www.pubmedcentral.gov/articlerender.fcgi?artid=120780

http://www.sciencemag.org/content/286/5439/487.full

http://www.sciencemag.org/content/286/5439/487.full

http://en.wikipedia.org/wiki/Cell_(biology)

http://www.accessexcellence.org/RC/VL/GG/dna2.php

http://en.wikipedia.org/wiki/File:Proteinsynthesis.png

http://en.wikipedia.org/wiki/Reverse_engineering

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3351745


[12] HannoTeeling and F. O. Glockner, “Current opportunities and challenges in

microbial metagenome analysis- bioinformatics perspective,” Brief Bioinform,

2012. [Online]. Available: http://bib.oxfordjournals.org/content/early/2012/09/26/

bib.bbs039.full.pdf

[13] S. E. Bohlin J and U. D, “Reliability and applications of statistical methods

based on oligonucleotide frequencies in bacterial and archaeal genomes,” vol.

104, no. 9, 2008. [Online]. Available: http://www.biomedcentral.com/content/pdf/

1471-2164-9-104.pdf

[14] A. Kislyuk, S. Bhatnagar, J. Dushoff, and J. S. Weitz, “Unsupervised statistical

clustering of environmental shotgun sequences,” BMC Bioinformatics, vol. 10, p.

316, 2009. [Online]. Available: http://dx.doi.org/10.1186/1471-2105-10-316

[15] B. A and S. S, “Phymm and phymmbl: metagenomic phylogenetic classification

with interpolated markov models,” Nat Meth, vol. 6, no. 9, pp. 673–676, 2009.

[Online]. Available: http://www.nature.com/nmeth/journal/v6/n9/suppinfo/nmeth.

1358 S1.html

[16] S. Salzberg, A. Delcher, S. Kasif, and O. White, “Microbial gene identification using

interpolated markov models,” Nucleic Acids Res., vol. 26, no. 2, pp. 544–548, 1998.

[17] D. R. Kelley and S. L. Salzberg, “Clustering metagenomic sequences with

interpolated markov models.” [Online]. Available: http://www.pubmedcentral.nih.

gov/articlerender.fcgi?artid=3098094

[18] K. D. Pruitt, T. A. Tatusova, and D. R. Maglott, “NCBI reference sequence

(refseq): a curated non-redundant sequence database of genomes, transcripts and

proteins,” Nucleic Acids Research, vol. 33, no. Database-Issue, pp. 501–504, 2005.

[Online]. Available: http://dx.doi.org/10.1093/nar/gki025

[19] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and

D. J. Lipman, “Gapped Blast and Psi-Blast: a new generation of protein database

search programs,” Nucleic Acids Res., vol. 25, pp. 3389–3402, 1997.

[20] Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigout-

sos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P,

44

http://bib.oxfordjournals.org/content/early/2012/09/26/bib.bbs039.full.pdf

http://bib.oxfordjournals.org/content/early/2012/09/26/bib.bbs039.full.pdf

http://www.biomedcentral.com/content/pdf/1471-2164-9-104.pdf

http://www.biomedcentral.com/content/pdf/1471-2164-9-104.pdf

http://dx.doi.org/10.1186/1471-2105-10-316

http://www.nature.com/nmeth/journal/v6/n9/suppinfo/nmeth.1358_S1.html

http://www.nature.com/nmeth/journal/v6/n9/suppinfo/nmeth.1358_S1.html



http://dx.doi.org/10.1093/nar/gki025

Hugenholtz P and Kyrpides NC: Use of simulated data sets to evaluate the fidelity of

metagenomic processing methods 2007.

[21] Wu, Y.W. and Ye, Y.(2011) A novel abundance-based algorithm for binning metage-

nomic sequences using l-tuple, J. Compute. Biol., 18, 523-534.

[22] Lander ES and Waterman MS: Genomic mapping by fingerprinting random clones:

a mathematical analysis. Genomics 1988 Apr; 2(3):231-9.

45

Download - Metagenomic Sequence Analysis using Hybrid Approach · Metagenomic Sequence Analysis using Hybrid Approach has been successfully completed By ... and analyze the function and structure

Top Related