bioinformatics review - november 2015 issue
DESCRIPTION
The Digital issue of Bioinformatics Review, November 2015TRANSCRIPT
N OVEMBER 2015 VOL 1 ISSUE 2
Explained:
CRISPR-ERA and
CRISPR/Cas9 system
How To: Detecting Chimera in 16S rRNA Sanger Sequencing Reads
“A cell is regarded as the
true biological atom.”
- George Henry Lewes
Public Service Ad sponsored by IQLBioinformatics
Contents
November 2015
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Topics
06
10
17
19
12
15
Systems Biology
Software
Tools
News
Sequence Analysis
Tools
Editorial.... 5
Cancer: From the Eyes of Mathematical and Systems Biology 06 Introduction to Mathematical Modelling Part-3 08 Introduction to Mathematical Modelling (Last Part) 14 Explore Tuberculosis: A Systems Biology Approach 20
IBS: Modifying the organization of biological sequences diagrammatically 17
Explained: CRISPR-ERA and CRISPR/Cas9 system 10 Installing Gromacs on Ubuntu for MD Simulation 25
How To: Detecting Chimera in 16S rRNA Sanger Sequencing Reads 19
Structural Identification of Macromolecules in Solution with DARA Webserver 12
Cl-Dash- Speeding up Cloud Computing in Bioinformatics 15
29 Genomics GenomeD3 plot : Easy visualization of genomes 29
CHIEF EDITOR
Dr. Prashant Pant
EDITORIAL BOARD
SECTION EDITORS
ALTAF ABDUL KALAM MANISH KUMAR MISHRA
SANJAY KUMAR PRAKASH JHA NABAJIT DAS
REPRINTS AND PERMISSIONS
You must have permission before reproducing any material from Bioinformatics Review. Send E-mail
requests to [email protected]. Please include contact detail in your message.
BACK ISSUE
Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com
at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery,
subject to availability. Pre-payment is required
CONTACT
PHONE +91. 991 1942-428 / 852 7572-667
MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025
STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as [email protected]
PUBLICATION INFORMATION
Volume 1, Number 1, Bioinformatics Review™ is published monthly for one year(12 issues) by Social
and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015
Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used
under licence by SEWA trust. Published in India
EXECUTIVE EDITOR FOUNDING EDITOR
FOZAIL AHMAD MUNIBA FAIZA
EDITORIAL
With speculations of future in mind, moving ahead slowly and
steadily is not only an option but also wisdom. BiR, in its second
month, moves ahead with a similar philosophy. This month’s
highlight would be BiR’s very first public showcasing and
representation to scientific community at an International
Conference on a concurrent and newly emerging field of
importance of soil microbes as drivers of various processes going to
be held in Prague, Czech Republic (EU). This has been a research
area of immense interest to me and I would like to share few
things on the same. Soil microbial diversity has long been seen as
life less worth than others until very recently when it was
discovered that more than 95% of microbial diversity from any
environmental sample is unknown, uncultured and has huge
biotechnological, medical, and agronomical potential in it. This
kick started a new branch of genomics and bioinformatics now
popularly known as metagenomics dealing with community
genomes from environmental samples. Metagenomics makes use
of tools and techniques of genomics along with computational
biology to deal with such large data derived from multiple
genomes using next generation sequencing (NGS) technologies. It
is one of the sources of Big Data coming from molecular
biologists. Even till today, the primary concern is to sequence
DNA from environmental samples and correlate the metagenomic
data with its probable functions as oppose to conventional
culture based approaches. It was because of these reasons that,
we chose this international platform to introduce BiR to the
world’s scientific community to showcase BiR as an excellent
vector for propagating scientific news and development. It is with
these slow and little efforts we hope for a steady metamorphosis
of BiR into a known standard for scientific reports and news.
Dr. Prashant Pant
Editor-in-Chief
Letters and responses:
Bioinformatics Review | 6
Cancer: From the Eyes of Mathematical Biology Sanjay Kumar
Image Credit: Google Images
“A cell biologist says it is an uncontrolled proliferation (increase in number by divis ion and growth) of cells , molecular biologists call it a mutant variety of some biomolecules forcing a cell to commit such an uncontrolled cell divis ion cycle. ”
he month of November has
just arrived with its generic
glimpse of winter. We
welcome this month with
an evergreen and hot topic of cancer
research. This time we intend to
introduce you to an old research
topic with a new vision…..
Cancer being an ailment with no
remedy of full confidence has been
pursued as a career by a lot of
researchers. A cell biologist says it is
an uncontrolled proliferation
(increase in number by division and
growth) of cells, molecular biologists
call it a mutant variety of some
biomolecules forcing a cell to commit
such an uncontrolled cell division
cycle. But, how does a Systems
Biologist see such kind of a problem?
Let us try to pursue it in a different
way.
Proteins if are not assigned some
name based on their function or
structure, scientists mark them
according to their molecular weight,
e.g. p53, p200, p19 etc. Scientists
have proven an abnormally high
expression of p53 protein in
Cancerous cells/tissues. p53 protein
is actually the reason behind those
other proteins which regulate the cell
cycle and makes it to divide in to two
as a normal scenario, p53 also helps
in the manufacture of its inhibitor
named Mdm2 protein. In any case of
mutation in p53, that leads the
failure of abnormality recognition by
p53, doesn’t lead to increase in p53
and consequently Mdm2, p21 and
other p53 regulated proteins. And
thus, the division of abnormal cells
continues indefinitely and causes
Cancer
From a Mathematical Biology
perspective, systems biologists form
some ordinary differential equations
that look like a mathematical
formula. These mathematical
formulae are actually nothing else
than the representative of chemical
reactions and their combinations
occurring inside a cell. As in our
previous blogs (by Fozail Ahmad), we
have mentioned about how to
combine the chemical reactions in a
shape of Ordinary Differential
Equations (ODEs) and about how we
follow Zero-Order chemical kinetics
(reaction rate doesn’t depend on any
participating chemical), First-Order
chemical kinetics (reaction rate
depends on only one participating
chemical) and Second-Order chemical
kinetics (reaction rate depends on
two or more participating chemicals)
to form the equations. In addition to
that, I would like to mention that
T
SYSTEM BIOLOGY
Bioinformatics Review | 7
there are some reactions which occur
with the help of some biomolecular
machineries. These machines
(enzymes) just help the reactions to
occur, but do not take part in it
themselves and thus affect the
reaction in a different form of
kinetics as described by the
combined work of German Scientist
of Biochemistry Leonor Michaelis and
Canadian Scientist of Physics Maud
Menten in 1913.
So, in a normal cell, when p53 senses
the danger and signals the Cell by
increasing p21 to combine with PCNA
(Proliferating Cell Nuclear Antigen –
An enzyme that helps in cell division)
it stops the cell division. This type of
cell cycle division has been shown in
one of the diagrams mentioned
below, while for the mutated case of
p53 where it can not sense the
cellular damage and thus divides
normally is also shown in one of the
images above.
We have also mentioned a combined
picture, which shows a referral of
how different stages of Mathematical
Biology looks like. These figures are in
special contrast to Cancer cells and
normal cells.
Reference: Alam MJ, Kumar S, Singh
V, Singh RKB (2015) Bifurcation in Cell
Cycle Dynamics Regulated by p53.
PLoS ONE 10(6): e0129620.
doi:10.1371/journal.pone.0129620
http://journals.plos.org/plosone/article?
id=10.1371/journal.pone.0129620
Bioinformatics Review | 8
Introduction to
Mathematical
Modelling. (Part 3 of 3) Fozail Ahmad
Image Credit: Stock Photos
“For modeling the systems behavior, suitable methods have been developed. Among them are two methods, commonly used in modeling of metabolic process , modeling of s ignaling and regulatory pathways.”
Erivation of
Mathematical Equations
for Understanding
Systems Behaviour:
Depending upon the nature of
biological process, it is essential to
understand different modeling
approach as numbers of methods
have been used for different
biological systems. Functionally,
most of the cellular processes are
dynamic that change with
environmental change such that the
signaling or regulation for specific
genes when cell is exposed to an
extraordinary medium. In order to
describe such time-dependent
phenomena it is necessary to
choose mathematical equations
that can capture these dynamic
effects. In other biological systems
where cellular products/molecules
don’t change over time i.e.,
concentration remains same, it is
not necessary to describe details of
underlying dynamics. For modeling
the systems behavior, suitable
methods have been developed.
Among them are two methods,
commonly used in modeling of
metabolic process, modeling of
signaling and regulatory pathways.
1. Modeling Metabolic Process
Metabolism is an essential process
in all living being that provide
energy and building blocks for
survivability, synthesis of larger
molecules and degradation of
unnecessary/toxic substance in a
cell. Understanding metabolic
mechanism have been a part of
major research interest for decades
but complete interplay of
underlying mechanisms has yet not
been understood.
One of the key parameter in any
metabolic study is the metabolic
flux, that is, utilization (conversion)
of metabolites along metabolic
pathways. Thus, it is important to
understand and predict the
metabolic flux for all patterns of
metabolism that inculcate which
biochemical routes are being
utilized. Here curve of modeling is
fitted into concepts of hypothetical
framework or into even known
biochemical route so as to identify
any particular step in the
production/degradation of a
desirable molecule (metabolic
bottlenecks) by cultured
cell/bacteria that in fact limit the
overall rate with which the process
occur. And result of the study will
direct researcher on how to
genetically modify the cell or
bacteria to optimize the yield of the
particular end product.
D
SYSTEMS BIOLOGY
Bioinformatics Review | 9
In overall process, metabolic flux
are not concerned much as they are
not when biochemical process are
operated in steady-state and entry
of unnecessary molecule is totally
blocked, leaving process to be
quasi-stationary without external
perturbation to take place.
For such metabolic (quasi-
stationary) process, we may
consider the conversion of sugar
to sugar phosphate. In this
process, an enzyme hexokinase
adds a phosphate group to the
glucose, (C6H12O6)
yielding a compound, glucose-6-
phosphate. This reaction should
be balanced in terms of atoms
and electrical charges. In a
chemical notation, the balanced
reaction is written as
C6H12O6 + ATP -> C6H11O6PO32- +
ADP2- + H+
In this reaction, both sides of the
equation are in a stoichiometric
balance. Over investigating more
complex metabolic network, each
individual chemical reaction is bound
with stoichiometric balance constraints
such as mass, number of molecules,
concentration and charges on
reactants that can be used to
formulate mathematical equation. For
such reaction constraints, one should
mind that they are not independent
from each other and should be solved
in parallel to develop a reliable
mathematical model. The validity of
the models can be tested through wet
lab techniques using detectable or
radioactively labeled substances.
Labeled atoms can be traced across a
number of key metabolites, indicating
the cellular influx distribution help
validate or disapprove metabolic
network model.
Bioinformatics Review | 10
Explained:
CRISPR-ERA
and
CRISPR/Cas9
system Tariq Abdullah “When a viral dna(Bacteriophage, in this case) integrates into the bacterial genome, it produces RNA which is taken up by
Cas9.”
RISPR/Cas9 system is a
bacterial defence
mechanism against
bacteriophage infection.
When a viral dna(Bacteriophage, in
this case) integrates into the
bacterial genome, it produces RNA
which is taken up by Cas9.Cas9 and
the RNA together floats and drifts
through the cell and as soon as they
encounter a sequence
complementary to the RNA, it gets
attached to it. Cas9 chops off the
dna from there. As the viral DNA is
chopped off, it prevents the virus
from multiplying. Thus the bacteria
defends itself by precisely snipping
out the viral DNA from its genome
using CRISPR/Cas9 system.
The recent implementation of
CRISPR/Cas9 system in human
beings, animals and bacteria for
gene editing has led to a lot of
interesting research in this area. It
requires designing of sgRNA known
as Single Guide RNA, which is a
challenging process. To solve this
problem, CRISPR-ERA was
developed.
So what is CRISPR-ERA?
CRISPR-ERA is a new tool available
at http://crispr-
era.stanford.edu developed
by Honglei Liu et al. It is an acronym
for Clustered Regularly Interspaced
Short Palindromic Repeat-mediated
Editing, Repression, and Activation.
What does CRISPR-ERA do?
According to the author of CRISPR-
ERA,
The major goal of our designer tool
is to address the discrepancy for
designing sgRNAs that allow
efficient and highly specific
repression or activation of genes
and for generating genome-wide
sgRNA libraries for genetic
screening in different organisms.
– Bioinformatics, 31(22), 2015,
3676–3678 doi:
10.1093/bioinformatics/btv423
(Paper)
How does CRISPR-ERA work?
CRISPR-ERA looks up all targetable
sites for each target gene, for
patterns of N20NGG (N = any
nucleotide). It then calculates E and
S score.
C
TOOLS
Bioinformatics Review | 11
1. E-score is the efficacy score
]based on the sequence
features such as GC content
(%GC), presence of poly-
thymidine and location
information
S-score is the specificity score based on
the genome-wide off-target binding
sites. For each sgRNA design, enome-
wide sequences are computed that
contain an adjacent NRG (R = A or G)
protospacer adjacent motif (PAM) site
and zero, one, two, or three
mismatches complementary to the
sgRNA using Bowtie, which are
regarded as off-target binding sites.
The penalty score for NAG off-target is
smaller than NGG off-target. The
sgRNAs are finally ranked by the sum
of E-score and S-score.
The result it then presented according
to the E and S score.
References & Further Reading
http://gizmodo.com/everything
-you-need-to-know-about-
crispr-the-new-tool-
1702114381
Bioinformatics (2015) 31 (22):36
76-
3678.doi:10.1093/bioinformatic
s/btv423
Bioinformatics Review | 12
Structural
Identification of
Macromolecules
in solution with
DARA web server Muniba Faiza
Image Credit: Google Images
“ DARA is a webserver which initially “computes the scattering profiles from the available structures / models in PDB (Protein Data Bank) and compares these profiles with a given SAXS pattern..”
o study macromolecules in
homogenous solution, a
technique known as SAXS ( Small
Angle X-ray Scattering) is used where
the obtained scattering patterns are
used to design the structure of
macromolecules that are proteins,
mucleic acids and protein:nucleic acid
complexes.In this experiment, a
monochromatic X- ray beam is used to
illuminate the homogenous solution
which forms a scattering pattern. This
experiment generates a ab-
initio particle shape. This model is
compared with the theoretical data
available. By comparing the
experimental scattering patterns with
known scattering data is useful in
determination of structure. If the
experimental data matches with one
or various scattering patterns then it
may provide a detailed information
about the quarternary and tertiary
structure.
DARA is a webserver which
initially“computes the scattering
profiles from the available structures /
models in PDB (Protein Data Bank) and
compares these profiles with a given
SAXS pattern.” This server is very fast,
it compares more than 1,50,000
profiles very rapidly within a few
seconds. It almost covers all the
models available in PDB. DARA
provides good and enhanced results.
How DARA works ?
DARA implements a new
search algorithm consisting
of principal component analysis and k-
d trees for rapid identification of the
scattering neighbours, including
nucleic acids and complexes.
SAXS data:
For each entry in PDB all biological
assemblies are retrieved from the
NMR entries whose only first model
has considered. The data is
represented in the form of curves. The
theoretical known scattering curves
are obtained by a software i,e.,
CRYSOL 2.8, which is sufficient to
cover models with maximum intra-
particle distance Dmax up to 800
A˚. For each model, CRYSOL calculate
its Dmax, radius of gyration(Rg),
molecular weight (MW) and exclude
volume of the hydrated particle (V).
For proteins, secondary structure
content was computed as the
percentage of alpha helices and beta
T
BIOINFORMATICS NEWS
Bioinformatics Review | 13
sheets.
DARA computes various parameters
and gives an output which is
instantaneous and enhanced. It
calculates for almost 100 neighbours
of the query macromolecule and the
neighbours are ranked according to
the best fitting curve are preferred.
The result shown in Fig 1 shows the
best structures obtained by calculation
and comaprison with various
parameters considered. The result can
also be downloaded.
DARA represents a quite rapid and
easy way to analyze and identify
macromolecules in solution which is a
difficult process. It can be traced
at http://www.embl-
hamburg.de/biosaxs/dara.html
Reference:
D A R A : a web server for rapid search
of structural neighbours using
solution small angle X – ray scattering
data
Alexey G. Kikhney1,†, Alejandro
Panjkovich1,†, Anna V. Sokolova2 and
Dmitri I. Svergun1,*
Fig 1 Top three nearest neighbors for experimental SAXS data collected from glucose isomerase in a phosphate buffer.
Bioinformatics Review | 14
Introduction to
Mathematical
Modelling (Last Part)
Fozail Ahmad
Image Credit: Google Images
“ Parameters for any equation in a model describe certain biochemical features of the components involved in reactions or pathways under study.”
n the previous section,
mathematical modeling
was exemplified by
metabolic process and its
biochemical regulation. It could also
be done by signalling pathways and
genetic regulatory process. At all
cellular phase, one observe
changing mode of a cell with effect
from environmental factors. It is
quite difficult to maintain cellular
functions and reach to steady state.
Thus, one needs to fix a range of
parameters for all molecular
reactions while going for
mathematical modeling.
Identification of Model
parameters:
Parameters for any equation in a
model describe certain biochemical
features of the components
involved in reactions or pathways
under study. For example, when
modeling the dynamics of a
metabolic network, the
mathematical equations inferred
from the processes must contain
parameters that represent the
kinetic features of the involved
metabolic enzymes, as a number of
reactions enzyme can perform
within a given period of time (i.e.,
the rate constant). We must come
across to these kinetic parameters
prior to setting up well-defined
systems of differential equations.
Therefore, kinetic parameters for all
the relevant reaction components
can be experimentally determined.
In practice, however, a number of
kinetic parameters, even for
otherwise well-investigated
biological components (enzyme,
proteins & hormones), still are not
known, primarily because the
reliable experimental data are
lacking. It is very often that the
kinetic parameters are measured
but no experimental validation has
been performed in wet lab (i.e. in
vitro). In practice, enzymes behave
similarly found in a cell. This creates
a hurdle which is overcome by
measuring the overall dynamics of
the system being studied.
Computational procedure have
made them easier by providing
appropriate estimation techniques
to optimize parameter values by
taking different multiple parameter
set from the data set until they fit or
get optimized for available
experimental dataset. This method
to is critically dependent on the
SYSTEMS BIOLOGY
I
Bioinformatics Review | 15
quality of the dataset being
validated, and therefore prediction
made from such unreliable data will
definitely lead to unreliable
validated parameters and to a
limited model of no use. In order to
develop a simple network model for
any biological process is awfully
lagging behind mainly due to
unavailability of high quality
experimental data which is still a
major focus in the field of systems
biology.
It is important to mention that few
biological process cannot be
described using such simple models
that are based on only
concentration of molecules,
ignoring the existence and
importance of concerned
components as the molecular
movements adorns a significant
impact on cellular mechanism. Due
to the closely packing of the
molecules in a cell, their thermal
induction, and random movement
from changing environmental
conditions may cause the initiation
of signal transmission that
propagates across the cell and stops
until reaches to its target. In order
to account for such random effects,
(i.e., stochastic) component must be
incorporated into the equations of
the model. For rare signaling
molecules, of which lesser and
fewer effect is observed, can be
neglected. Whereas, molecules of
which, rarest copies are existing in
the cell must not be neglected and
should be integrated into
mathematical equations.
The next issue after optimization of
parameter comes to be different
time and distance scales of the
components in integrated into a
pathway. For example metabolism
occurs in within seconds or minutes
whereas genetic regulation takes
longer (say it hours or even days)
times to exert their effect or to
express a particular gene induced by
metabolic processes from a greater
distance. It may be that signals
(enzymes, protein, hormones) have
to travel longer distance across the
cell membrane via circulatory
system of body fluids in between
tissues. To overcome these different
length and time scale, we can use
multi-scale model to avoid
complexity of the system.
Finally, it is important to assure that
developed model is as good as
assumption upon which it is based.
Bioinformatics Review | 16
IBS: Modifying the
organization of
biological sequences
diagrammatically Muniba Faiza
Image Credit: Google Images
” ILLUSTRATOR Of BIOLOGICAL SEQUENCE (IBS) which is used for representing the organization of protein
or nucleotide sequences in an easy, efficient and precise manner.
any a times, we need to
visualize and summarize
the existing information of
the biological sequences like protein
or DNA. For this purpose, a new
software package has been
introduced called ILLUSTRATOR Of
BIOLOGICAL SEQUENCE (IBS) which
is used for representing the
organization of protein or
nucleotide sequences in an easy,
efficient and precise manner. It
visualizes various functional
elements. Different features have
been provided in IBS such as
diagramming of domains,motifs,
rescaling, coloring and many
more. The standalone packages of
IBS were implemented in JAVA, and
supported three major Operating
Systems, including Windows, Linux
and Mac OS.
Key Features:
the annotations of both protein
and nucleotide sequences is
supported by the
implementation of various
drawing elements.
better color visualization.
an ‘export module’ is generated
with the help of which the final
generated artwork can be
exported to any publication-
quality figure.
a user-friendly interface.
various built-in textures enables
to color the black-and-white
diagrams as per the
requirements.
easy retrieval of UniProt
annotations.
IBS provides individual modes for
both proteins and DNA, the protein
or DNA sequences can be
represented in individual modes. IBS
may be proved as a very useful
software in many biological
researches, for example, with the
help of IBS, one can easily diagram
the translocations that occur in
cancer by parallel view of the wild
type arrangements existing in the
sequence (as shown in Fig. 1).
M
SOFTWARE
Bioinformatics Review | 17
“IBS provides an assistance in
drawing publication quality
diagrams of both protein and
nucleotide sequences.”
Fig.1 The main interface of IBS. ( A)
The standalone software showing
the domain organization of E3
SUMO-protein ligase RanBP2 (
Flotho and Werner,2012).( B) The
online service presenting the
organization of
bromodomain proteins and
translocations in cancer.( (Muller et
al., 2011 )
Reference:
IBS: an illustrator for the presentation and visualization of biological sequences
Wenzhong Liu1,2,†, Yubin Xie1,†, Jiyong Ma1,†, Xiaotong Luo1, Peng Nie1, Zhixiang Zuo3, Urs Lahrmann4, Qi Zhao1, Yueyuan Zheng1, Yong Zhao1, Yu Xue5,* and Jian Ren1,2,3,*
Bioinformatics Review | 18
How To: Detecting
Chimera in 16S
rRNA Sanger
Sequencing Reads Prashant Pant
Image Credit: Google Images
” Chimeras are usually formed during polymerase chain reaction (PCRs) but in some rare cases they are for real. Therefore, it becomes relevant to adopt methods which can clean the sequence datasets of Chimeras .”
A TYPICAL CHIMERIC SEQUENCE
OBTAINED FROM PINTAIL VERSION
1.0
etecting chimeric (or
recombinant)
sequences from a
sequence dataset is an
important part of sequence analysis
especially for reconstruction of
deep phylogenies as well as for
sequence similarity analyses. This
article focuses on methods of
chimera detection in high quality
16S rRNA sequences from Sanger
sequencing with good read length
(>750bp). With such large size they
become potential candidates for
chimera formation. With culture-
independent approaches for
analyses of microbial diversity
picking up fast with high throughput
sequencing methods, the amount of
chimeric sequences being published
in the databases are also increasing
exponentially. This is the era of
Metagenomics or simply put
community DNA analyses where
DNA from thousands of species gets
pooled up and is then analysed. This
further increases chances of
chimera formation. Chimeras are
usually formed during polymerase
chain reaction (PCRs) but in some
rare cases they are for real.
Therefore, it becomes relevant to
adopt methods which can clean the
sequence datasets of Chimeras.
Recently, a number of chimera
detecting software for 16S rRNA
gene sequences have been
launched namely Pintail, Mallard
and Bellerophon. First two software
applications are available at
SEQUENCE ANALYSIS
D
Bioinformatics Review | 19
http://www.bioinformatics-
toolkit.org and the last one is
available at
http://greengenes.lbl.gov/cgi-
bin/nph-index.cgi. Pintail and
Mallard can detect chimeras and
anomalies in the 16S rRNA genes
based on extent of pair-wise
percentage similarity between the
query and related sequences. In
chimera analysis by Pintail 1.0, the
query sequences which could be
putative recombinants are
compared on a one (query)-on-one
(subject) basis with a list of closely
related sequences identified by
BLAST searches. As Pintail is a one-
on-one query-subject comparison, it
is highly stringent. This is not the
case with Mallard. In Mallard, one
of the sequences from within a
dataset of query sequences is
randomly chosen as subject, while
rest remain as query. A many
(query)-on-one (subject)
comparison follows, which is easy
and completes in less time as
compared to Pintail. This is to be
noted that Mallard is of limited use
if the query sequences are too
diverse or really novel in the first
place.
Another software for detecting
chimeras in 16S rRNA genes
i.e. Bellerophon ver 3.0 from
Greengenes is more dedicated to
16S rRNA sequences. Here, the
sequences are required to be
submitted as NAST (Nearest
Alignment Space Termination Tool)
formatted file. The NAST alignment
server at Greengenes has more than
one million 16S rRNA sequence
records. Upon submission of the
NAST formatted file, the server
launches a localized BLAST search
for each query sequence with the
16S rRNA gene sequence library on
its server. It
checks for potential chimeras in the
respective query-subject alignment,
one-on-one. The outcome of the
entire process is a couple of EXCEL
sheets emailed to the user with the
query sequences, their best
matches, and BLAST score values.
The BLAST score threshold value can
be set by the user, below which the
software automatically removes the
sequences not to be considered for
chimera detection. Finally, it tells
whether a potential break-point
was found or not (in essentially Yes
or No format). It is user-friendly and
particularly good for large datasets
with high amount of sequence
diversity. The only demerit of the
software is that if there is a
relatively novel sequence in the
query batch, it receives a low score
being highly unrelated with the
existing records and thus stands at a
risk of getting omitted. Hence, one
has to be really careful while using
these programs as there could be
loss of sequence diversity especially
if the data is coming from an
extreme site (with more
newer/novel sequences) or if the
data is coming from some NGS
project with nice long reads and
good coverage as in the case of Pac
Bio Machine. It is worth mentioning
here that while Pintail and Mallard
can be applied for any given DNA
sequence data, Bellerophon is a
dedicated program for 16S rRNA.
Bioinformatics Review | 20
Explore Tuberculosis: A
Systems Biology
Approach Fozail Ahmad
Image Credit: Google Images
“ The bacterial two component system (TCS) is a s ignal transduction system that senses environmental stimuli and responses accordingly.”
ystems biology is not
sufficient to full fill the
requirement of
molecular understanding of any
organism at any level. It seeks to
contribute multiple approaches and
fields to resolve a particular issue
arisen from ongoing work. In this
article you will find a combinatorial
approach of systems biology i.e.
molecular, cellular and network
biology to understand how
tuberculosis is developed and how
pathogen succeeds in fighting with
host immune systems. A well
developed mathematical model, on
PhoP-PhoR two component system,
is also presented and explained to
demonstrate the mode of molecular
regulation by pathogen.
The bacterial two component
system (TCS) is a signal transduction
system that senses environmental
stimuli and responses accordingly.
This system consists of two
regulating proteins one of which
functions as histidine kinase (HK)
and other functions as response
regulator (RR) in the course of signal
cascade mechanism.Mycobacterium
tuberculosis have eleven two
component systems controlling
expression of those genes that are
critically involved in the virulence,
pathogenicity and survival. Studies
have demonstrated that PhoPR-TCS
is one of the eleven TCSs peculiarly
involved in the virulent activity of
the pathogen. PhoPR-TCS is a
positive regulator of many genes
which encodes gene for the
biosynthesis of lipids like
sulphatides(SL), diacyltrehalose
(DAT) and polycyltrehalose (PAT).
These lipid components contribute
to the virulency of M.
tuberculosis. Studies have
corroborated
that pks2 and msl3 are responsible
for the biosynthesis of SL, DAT and
PAT respectively. The expression of
these lipid coding genes are
regulated by PhoP in association
with the autokinase activity of
PhoR. In case
of MycobacterialPhoPR TCS,
Mg2+ ions have not been
substatilally proved to be
stimulating factor for PhoR. The
simulation of the model was carried
out through MATLAB using RK-4
(Runga Kutta fourth order
differential equation) method.
Resultantly, behavior of TCS was
found to be robust at all
concentration of Mg2+ ions. The
finding can be implicated at the
time of development of drug
against tuberculosis as to which
gene/protein has the high sensitivity
towards its stimuli.
S
SYSTEMS BIOLOGY
Bioinformatics Review | 21
Fig: General presentation of model,
depicting feedback mechanism of
system
The regulation of TCS is affected by
Mg2+ions to all possible extent
which was shown by fluctuations in
the level of PhoP and PhoR proteins.
The ions have both positive and
negative effect over TCS. The result
showed that important genes are
activated even after ions are
switched off from surrounding
medium. So, targeting of ions influx
and efflux would be of no use in
terms of development of drug
aginst the pathogen. With some
other aspect it can be further tested
for more simulations with varying
concentration of ions. Since, TCS
regulates those genes which are
directly involved in pathogenecity
and survival of Mycobaterium
tuberculosis, understanding the
nature and behaiour of individual
protein will provide an insight into
finding of novel drug target against
tuberculosis. The simulation in this
work represented the mechanism of
gene regulation and its sensitivity
twords stimulus and provided the
understading about how to deal
with when targetting a
molecule/protein for any other two
component system of the pathogen.
Reference source: Fozail Ahmad &
Ravins Dohare*, Assessing Effect of
Mg ion on PhoP-PhoR tow
component systems of
Mycobacterium tuberculosis
through Development of
Mathematical Model, Int. Journal
of Science and Research, (4) 7,
2285-2289, Paper ID: SUB 156569
Bioinformatics Review | 22
Cl-Dash:
speeding up
cloud
computing in
bioinformatics
Muniba Faiza
Image Credit: Google Images
“Cl-dash is a tool which facilitates research of novel bioinformatics data using Hadoop – a software that stores huge amount of data and provide a very easy access to that data in a relatively lesser time .”
fter a lot of work in the
field of bioinformatics,
many of the living
organisms’ genome has been
sequenced and a lot of information
has been generated at RNA and
protein level. This has given rise to a
huge amounts of biological data
whose storage is a issue now a days,
because such an enormous data
cannot be stored on a personal
computer or on a local server. For
this purpose cloud computing,
a practice to manage, and process
data by using remote servers hosted
on internet has been introduced in
bioinformatics, though the origin of
cloud computing is not very clear.
Cl-dash is a tool that
which facilitates research of novel
bioinformatics data using Hadoop –
a software that stores huge amount
of data and provide a very easy
access to that data in a relatively
lesser time. This tool has been
developed by Paul Hodor,
Amandeep Chawla, Andrew Clark
and Lauren Neal from Booz Allen
Hamilton, USA.
The tool is “cl-dash”,it is a starter
kit, which configures and apply the
new hadoop clusters in a few
minutes. It is provided by AWS
(Amazon Web Services).
According to a paper published in
Bioinformatics (Nov, 2015), cl-dash
is based on the distributed file
system and MapReduce
programming pattern. Hadoop
MapReduce is a software
CLOUD COMPUTING
A
Bioinformatics Review | 23
framework for easily writing
applications which process vast
amounts of data in-parallel on large
clusters of personal computers or
hardwares. With the help of cl-
dash, a user can create clusters (or
nodes which stores huge amount of
data) as an ‘admin’ , through a set
of command line tools, which begins
with ‘cl-’ (hence the name: ‘cl
dash’). A YAML configuration file
(config.yml) is required to make a
new cluster can be created in
minutes. Once the Hadoop cluster is
formed, the user can easily access
the data.
Such tools are required for further
storage space requirement because
biological data is increasing,
thereby, the demand for large data
storage space is also required. cl-
dash has provided a good pathway
for managing such a huge data.
NOTE:
An exhaustive list of references for
this article is available with the
author and is available on personal
request, for more details write
m.
Bioinformatics Review | 24
Installing Gromacs on
Ubuntu for MD
Simulation
Tariq Abdullah
Image Credit: Google Images
“For beginners, installing and getting GROMACS to work is more challenging due to unfamiliarity with linux
commands and GROMACS dependencies.”
In bioinformatics,
GROMACS is one of the
most popular Molecular
Dynamics simulation software with
a load of features built in. Installing
GROMACS Version 5.x.x+ can be a
tedious and cumbersome process
on Ubuntu, especially if you are just
starting out. For beginners,
installing and getting GROMACS to
work is more challenging due to
unfamiliarity with linux commands
and GROMACS dependencies. Also
the installation instructions for
version 5+ available on GROMACS
website does not seem to work first
hand.
In this quick tutorial, I will teach you
how to install Gromacs on Ubuntu
14.04 LTS. It is expected to work on
any version of Ubuntu. Post in
comments if you face any problem.
I will also explain meanings of
different the commands alongside.
To install GROMACS 5+, log into
your Ubuntu system and open a
terminal by pressing Ctrl+Alt+T
together.
You need a good internet
connection as we will have to
download various dependencies
during the installation process. To
install Gromacs, we need following
softwares installed on our system:
1. A C & C++ Compiler which
comes built-in with Ubuntu.
2. CMake – A linux software to
make binaries
3. BuildEssential – It is a
reference for all the packages
needed to compile a package.
4. FFTW Library: a library used by
Gromacs to compute discrete
Fourier transform
5. DeRegressionTest Package
Getting Started
If you have freshly installed Ubuntu,
don’t forget to update you
repository information and software
packages in your system. Press
TOOLS
I
Bioinformatics Review | 25
Ctrl+Alt+T and a terminal will open
up. In the terminal, type:
sudo apt-get update
sudo apt-get upgrade
Installation
First step in installing Gromacs is to
get cmake, In the terminal, type:
sudo apt-get install cmake
If asked “After this operation, 16.5
MB of additional disk space will be
used. Do you want to continue?”,
Press y and then Press Enter.
When download and installation
finishes up, you can check the
version of cmake by following
command
cmake --version
Next we need to install build
essential with this command
sudo apt-get install build-essential
Before we go any further, it is good
to know the path of our current
working directory, in terminal, type:
pwd
Note down the path it shows, it is
very important and will be used
during real gromacs installation.
Now that we have cmake in place
and we know the working directory,
Its time to download Regression
Tests Package. It is possible to
automatically download this
package during installation, but
most of the time it throws me an
error stating that location of file has
changed, so let us do it hard way to
avoid any problem during
installation. Copy and Paste
following commands in your
terminal (Right Click to paste or
Ctrl+shift+V). It basically downloads
the file and saves it in your
downloads folder.
cd Downloads/
wget
http://gerrit.gromacs.org/download
/regressiontests-5.1.1.tar.gz
We have Regression test package in
our downloads folder as
compressed tar.gz archive, let us
extract it with
tar xvzf regressiontests-5.1.1.tar.gz
Now we need Fourier Transform
Library on our system. You can
download it on fftw.org or install it
from repository with this following
command
sudo apt-get install libfftw3-dev
Okay, Let us now download
GROMACS 5.1.1 with this command,
Alternatively, you can download the
latest version from GROMACS
website.
wget
ftp://ftp.gromacs.org/pub/gromacs/
gromacs-5.1.1.tar.gz
Now extract GROMACS archive
tar xvzf gromacs-5.1.1.tar.gz
Now move inside the Gromacs
folder,
cd gromacs-5.1.1/
Create a directory called “Build”
where we will keep our compiled
binaries
mkdir build
move inside the build directory
cd build
It’s time to make Gromacs, Replace
“pwdpath” with the path of
working directory that you have
noted earlier in following command:
sudo cmake .. -
DGMX_BUILD_OWN_FFTW=OFF -
DREGRESSIONTEST_DOWNLOAD=O
FF -DCMAKE_C_COMPILER=gcc -
DREGRESSIONTEST_PATH=<strong>
pwdpath</strong>/Downloads/regr
essiontests-5.1.1
If everything goes well, the message
in your terminal will say
Bioinformatics Review | 26
“Generating Done. Build files
written… “. If not, make sure you
have replaced the pwd path in
command with the path of your
home directory. If you have
forgotten it, just open another
terminal and type pwd.
Now let’s first check and make the
real thing..
make check
sudo make install
Now, It may take some time
depending o n your configuration.
After completion, execute it:
source
/usr/local/gromacs/bin/GMXRC
After the successful installation, you
may check the version of your
Gromacs with a command to make
sure installation finished as
expected.
gmx pdb2gmx --version
***
Bioinformatics Review | 27
GenomeD3 plot:
Easy visualization
of genomes
Muniba Faiza
Image Credit: Stock Images
“GenomeD3 Plot is a newly created visualization library written in Java script. It uses the D3, i.e., Data Driven
Documents Library which is used to produce dynamic, interactive data visualizations in web browsers.”
As the needs say the
importance of sequencing
of genomes, it is equally
important to visualize them. There
exists some tools to visualize the
genomes, but they are static and
standalone, and very much complex
to install and use. Newer tools are
required to ease the visualization of
genomes utilizing various new
features and which are more
interactive. GenomeD3 Plot is a
newly created visualization library
written in Java script. It uses the D3,
i.e., Data Driven Documents Library
which is used to produce dynamic,
interactive data visualizations in
web browsers. GenomeD3Plot is
very user-friendly and allows to
interact with data, dynamical view
alteration is possible, and
easy resize or reposition the
visualization in the browser.
The goal of R Laird Matthew was to
create a library with minimal
external dependencies that could be
integrated in to existing web
applications just as a developer
might include an image or table.
GenomeD3 Plot uses the JSON
configuration which is a
standardized and well supported
data format that reduces the
complexity of use and provide
better visualization. The image will
be created in SVG format and can
be easily imported to PNG format as
per the requirements.
GENOMICS
A
Bioinformatics Review | 28
Fig.1 GenomeD3Plot circular and
linear visualization of an example
genome
with annotation data
With GenomeD3 Plot, the genome
can be viewed in different tracks,
such as if one wish to view a specific
base pair or a series of base pairs to
visualize GC content, or
more.GenomeD3 Plot provides a
rich API ( application program
interface that specifies how
software components should
interact) to dynamically manipulate
visualization.
A linear and circular plot can also be
tied together so that manipulation
of one will cause a mirror alteration
in the other, such as zooming or
changing the visible region of the
genome. A specific region can be
recenter to focus. Many other
features have been introduced in
GenomeD3 Plot for easy
visualization and interpretation of
genomes.
Note:
An exhaustive list of references for
this article is available with the
author and is available on personal
request, for more details write to
Bioinformatics Review | 29
Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and
never miss out on any of your favorite topics.
Log on to
www.bioinformaticsreview.com
Bioinformatics Review | 30