bioinformatics review - november 2015 issue

30
NOVEMBER 2015 VOL 1 ISSUE 2 Explained: CRISPR-ERA and CRISPR/Cas9 system How To: Detecting Chimera in 16S rRNA Sanger Sequencing Reads A cell is regarded as the true biological atom. - George Henry Lewes

Upload: bioinformatics-review

Post on 26-Jul-2016

217 views

Category:

Documents


0 download

DESCRIPTION

The Digital issue of Bioinformatics Review, November 2015

TRANSCRIPT

Page 1: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

N OVEMBER 2015 VOL 1 ISSUE 2

Explained:

CRISPR-ERA and

CRISPR/Cas9 system

How To: Detecting Chimera in 16S rRNA Sanger Sequencing Reads

“A cell is regarded as the

true biological atom.”

- George Henry Lewes

Page 2: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Public Service Ad sponsored by IQLBioinformatics

Page 3: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Contents

November 2015

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Topics

06

10

17

19

12

15

Systems Biology

Software

Tools

News

Sequence Analysis

Tools

Editorial.... 5

Cancer: From the Eyes of Mathematical and Systems Biology 06 Introduction to Mathematical Modelling Part-3 08 Introduction to Mathematical Modelling (Last Part) 14 Explore Tuberculosis: A Systems Biology Approach 20

IBS: Modifying the organization of biological sequences diagrammatically 17

Explained: CRISPR-ERA and CRISPR/Cas9 system 10 Installing Gromacs on Ubuntu for MD Simulation 25

How To: Detecting Chimera in 16S rRNA Sanger Sequencing Reads 19

Structural Identification of Macromolecules in Solution with DARA Webserver 12

Cl-Dash- Speeding up Cloud Computing in Bioinformatics 15

29 Genomics GenomeD3 plot : Easy visualization of genomes 29

Page 4: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

CHIEF EDITOR

Dr. Prashant Pant

EDITORIAL BOARD

SECTION EDITORS

ALTAF ABDUL KALAM MANISH KUMAR MISHRA

SANJAY KUMAR PRAKASH JHA NABAJIT DAS

REPRINTS AND PERMISSIONS

You must have permission before reproducing any material from Bioinformatics Review. Send E-mail

requests to [email protected]. Please include contact detail in your message.

BACK ISSUE

Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com

at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery,

subject to availability. Pre-payment is required

CONTACT

PHONE +91. 991 1942-428 / 852 7572-667

MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025

STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as [email protected]

PUBLICATION INFORMATION

Volume 1, Number 1, Bioinformatics Review™ is published monthly for one year(12 issues) by Social

and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015

Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used

under licence by SEWA trust. Published in India

EXECUTIVE EDITOR FOUNDING EDITOR

FOZAIL AHMAD MUNIBA FAIZA

Page 5: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

EDITORIAL

With speculations of future in mind, moving ahead slowly and

steadily is not only an option but also wisdom. BiR, in its second

month, moves ahead with a similar philosophy. This month’s

highlight would be BiR’s very first public showcasing and

representation to scientific community at an International

Conference on a concurrent and newly emerging field of

importance of soil microbes as drivers of various processes going to

be held in Prague, Czech Republic (EU). This has been a research

area of immense interest to me and I would like to share few

things on the same. Soil microbial diversity has long been seen as

life less worth than others until very recently when it was

discovered that more than 95% of microbial diversity from any

environmental sample is unknown, uncultured and has huge

biotechnological, medical, and agronomical potential in it. This

kick started a new branch of genomics and bioinformatics now

popularly known as metagenomics dealing with community

genomes from environmental samples. Metagenomics makes use

of tools and techniques of genomics along with computational

biology to deal with such large data derived from multiple

genomes using next generation sequencing (NGS) technologies. It

is one of the sources of Big Data coming from molecular

biologists. Even till today, the primary concern is to sequence

DNA from environmental samples and correlate the metagenomic

data with its probable functions as oppose to conventional

culture based approaches. It was because of these reasons that,

we chose this international platform to introduce BiR to the

world’s scientific community to showcase BiR as an excellent

vector for propagating scientific news and development. It is with

these slow and little efforts we hope for a steady metamorphosis

of BiR into a known standard for scientific reports and news.

Dr. Prashant Pant

Editor-in-Chief

Letters and responses:

[email protected]

Page 6: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 6

Cancer: From the Eyes of Mathematical Biology Sanjay Kumar

Image Credit: Google Images

“A cell biologist says it is an uncontrolled proliferation (increase in number by divis ion and growth) of cells , molecular biologists call it a mutant variety of some biomolecules forcing a cell to commit such an uncontrolled cell divis ion cycle. ”

he month of November has

just arrived with its generic

glimpse of winter. We

welcome this month with

an evergreen and hot topic of cancer

research. This time we intend to

introduce you to an old research

topic with a new vision…..

Cancer being an ailment with no

remedy of full confidence has been

pursued as a career by a lot of

researchers. A cell biologist says it is

an uncontrolled proliferation

(increase in number by division and

growth) of cells, molecular biologists

call it a mutant variety of some

biomolecules forcing a cell to commit

such an uncontrolled cell division

cycle. But, how does a Systems

Biologist see such kind of a problem?

Let us try to pursue it in a different

way.

Proteins if are not assigned some

name based on their function or

structure, scientists mark them

according to their molecular weight,

e.g. p53, p200, p19 etc. Scientists

have proven an abnormally high

expression of p53 protein in

Cancerous cells/tissues. p53 protein

is actually the reason behind those

other proteins which regulate the cell

cycle and makes it to divide in to two

as a normal scenario, p53 also helps

in the manufacture of its inhibitor

named Mdm2 protein. In any case of

mutation in p53, that leads the

failure of abnormality recognition by

p53, doesn’t lead to increase in p53

and consequently Mdm2, p21 and

other p53 regulated proteins. And

thus, the division of abnormal cells

continues indefinitely and causes

Cancer

From a Mathematical Biology

perspective, systems biologists form

some ordinary differential equations

that look like a mathematical

formula. These mathematical

formulae are actually nothing else

than the representative of chemical

reactions and their combinations

occurring inside a cell. As in our

previous blogs (by Fozail Ahmad), we

have mentioned about how to

combine the chemical reactions in a

shape of Ordinary Differential

Equations (ODEs) and about how we

follow Zero-Order chemical kinetics

(reaction rate doesn’t depend on any

participating chemical), First-Order

chemical kinetics (reaction rate

depends on only one participating

chemical) and Second-Order chemical

kinetics (reaction rate depends on

two or more participating chemicals)

to form the equations. In addition to

that, I would like to mention that

T

SYSTEM BIOLOGY

Page 7: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 7

there are some reactions which occur

with the help of some biomolecular

machineries. These machines

(enzymes) just help the reactions to

occur, but do not take part in it

themselves and thus affect the

reaction in a different form of

kinetics as described by the

combined work of German Scientist

of Biochemistry Leonor Michaelis and

Canadian Scientist of Physics Maud

Menten in 1913.

So, in a normal cell, when p53 senses

the danger and signals the Cell by

increasing p21 to combine with PCNA

(Proliferating Cell Nuclear Antigen –

An enzyme that helps in cell division)

it stops the cell division. This type of

cell cycle division has been shown in

one of the diagrams mentioned

below, while for the mutated case of

p53 where it can not sense the

cellular damage and thus divides

normally is also shown in one of the

images above.

We have also mentioned a combined

picture, which shows a referral of

how different stages of Mathematical

Biology looks like. These figures are in

special contrast to Cancer cells and

normal cells.

Reference: Alam MJ, Kumar S, Singh

V, Singh RKB (2015) Bifurcation in Cell

Cycle Dynamics Regulated by p53.

PLoS ONE 10(6): e0129620.

doi:10.1371/journal.pone.0129620

http://journals.plos.org/plosone/article?

id=10.1371/journal.pone.0129620

Page 8: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 8

Introduction to

Mathematical

Modelling. (Part 3 of 3) Fozail Ahmad

Image Credit: Stock Photos

“For modeling the systems behavior, suitable methods have been developed. Among them are two methods, commonly used in modeling of metabolic process , modeling of s ignaling and regulatory pathways.”

Erivation of

Mathematical Equations

for Understanding

Systems Behaviour:

Depending upon the nature of

biological process, it is essential to

understand different modeling

approach as numbers of methods

have been used for different

biological systems. Functionally,

most of the cellular processes are

dynamic that change with

environmental change such that the

signaling or regulation for specific

genes when cell is exposed to an

extraordinary medium. In order to

describe such time-dependent

phenomena it is necessary to

choose mathematical equations

that can capture these dynamic

effects. In other biological systems

where cellular products/molecules

don’t change over time i.e.,

concentration remains same, it is

not necessary to describe details of

underlying dynamics. For modeling

the systems behavior, suitable

methods have been developed.

Among them are two methods,

commonly used in modeling of

metabolic process, modeling of

signaling and regulatory pathways.

1. Modeling Metabolic Process

Metabolism is an essential process

in all living being that provide

energy and building blocks for

survivability, synthesis of larger

molecules and degradation of

unnecessary/toxic substance in a

cell. Understanding metabolic

mechanism have been a part of

major research interest for decades

but complete interplay of

underlying mechanisms has yet not

been understood.

One of the key parameter in any

metabolic study is the metabolic

flux, that is, utilization (conversion)

of metabolites along metabolic

pathways. Thus, it is important to

understand and predict the

metabolic flux for all patterns of

metabolism that inculcate which

biochemical routes are being

utilized. Here curve of modeling is

fitted into concepts of hypothetical

framework or into even known

biochemical route so as to identify

any particular step in the

production/degradation of a

desirable molecule (metabolic

bottlenecks) by cultured

cell/bacteria that in fact limit the

overall rate with which the process

occur. And result of the study will

direct researcher on how to

genetically modify the cell or

bacteria to optimize the yield of the

particular end product.

D

SYSTEMS BIOLOGY

Page 9: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 9

In overall process, metabolic flux

are not concerned much as they are

not when biochemical process are

operated in steady-state and entry

of unnecessary molecule is totally

blocked, leaving process to be

quasi-stationary without external

perturbation to take place.

For such metabolic (quasi-

stationary) process, we may

consider the conversion of sugar

to sugar phosphate. In this

process, an enzyme hexokinase

adds a phosphate group to the

glucose, (C6H12O6)

yielding a compound, glucose-6-

phosphate. This reaction should

be balanced in terms of atoms

and electrical charges. In a

chemical notation, the balanced

reaction is written as

C6H12O6 + ATP -> C6H11O6PO32- +

ADP2- + H+

In this reaction, both sides of the

equation are in a stoichiometric

balance. Over investigating more

complex metabolic network, each

individual chemical reaction is bound

with stoichiometric balance constraints

such as mass, number of molecules,

concentration and charges on

reactants that can be used to

formulate mathematical equation. For

such reaction constraints, one should

mind that they are not independent

from each other and should be solved

in parallel to develop a reliable

mathematical model. The validity of

the models can be tested through wet

lab techniques using detectable or

radioactively labeled substances.

Labeled atoms can be traced across a

number of key metabolites, indicating

the cellular influx distribution help

validate or disapprove metabolic

network model.

Page 10: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 10

Explained:

CRISPR-ERA

and

CRISPR/Cas9

system Tariq Abdullah “When a viral dna(Bacteriophage, in this case) integrates into the bacterial genome, it produces RNA which is taken up by

Cas9.”

RISPR/Cas9 system is a

bacterial defence

mechanism against

bacteriophage infection.

When a viral dna(Bacteriophage, in

this case) integrates into the

bacterial genome, it produces RNA

which is taken up by Cas9.Cas9 and

the RNA together floats and drifts

through the cell and as soon as they

encounter a sequence

complementary to the RNA, it gets

attached to it. Cas9 chops off the

dna from there. As the viral DNA is

chopped off, it prevents the virus

from multiplying. Thus the bacteria

defends itself by precisely snipping

out the viral DNA from its genome

using CRISPR/Cas9 system.

The recent implementation of

CRISPR/Cas9 system in human

beings, animals and bacteria for

gene editing has led to a lot of

interesting research in this area. It

requires designing of sgRNA known

as Single Guide RNA, which is a

challenging process. To solve this

problem, CRISPR-ERA was

developed.

So what is CRISPR-ERA?

CRISPR-ERA is a new tool available

at http://crispr-

era.stanford.edu developed

by Honglei Liu et al. It is an acronym

for Clustered Regularly Interspaced

Short Palindromic Repeat-mediated

Editing, Repression, and Activation.

What does CRISPR-ERA do?

According to the author of CRISPR-

ERA,

The major goal of our designer tool

is to address the discrepancy for

designing sgRNAs that allow

efficient and highly specific

repression or activation of genes

and for generating genome-wide

sgRNA libraries for genetic

screening in different organisms.

– Bioinformatics, 31(22), 2015,

3676–3678 doi:

10.1093/bioinformatics/btv423

(Paper)

How does CRISPR-ERA work?

CRISPR-ERA looks up all targetable

sites for each target gene, for

patterns of N20NGG (N = any

nucleotide). It then calculates E and

S score.

C

TOOLS

Page 11: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 11

1. E-score is the efficacy score

]based on the sequence

features such as GC content

(%GC), presence of poly-

thymidine and location

information

S-score is the specificity score based on

the genome-wide off-target binding

sites. For each sgRNA design, enome-

wide sequences are computed that

contain an adjacent NRG (R = A or G)

protospacer adjacent motif (PAM) site

and zero, one, two, or three

mismatches complementary to the

sgRNA using Bowtie, which are

regarded as off-target binding sites.

The penalty score for NAG off-target is

smaller than NGG off-target. The

sgRNAs are finally ranked by the sum

of E-score and S-score.

The result it then presented according

to the E and S score.

References & Further Reading

http://gizmodo.com/everything

-you-need-to-know-about-

crispr-the-new-tool-

1702114381

Bioinformatics (2015) 31 (22):36

76-

3678.doi:10.1093/bioinformatic

s/btv423

Page 12: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 12

Structural

Identification of

Macromolecules

in solution with

DARA web server Muniba Faiza

Image Credit: Google Images

“ DARA is a webserver which initially “computes the scattering profiles from the available structures / models in PDB (Protein Data Bank) and compares these profiles with a given SAXS pattern..”

o study macromolecules in

homogenous solution, a

technique known as SAXS ( Small

Angle X-ray Scattering) is used where

the obtained scattering patterns are

used to design the structure of

macromolecules that are proteins,

mucleic acids and protein:nucleic acid

complexes.In this experiment, a

monochromatic X- ray beam is used to

illuminate the homogenous solution

which forms a scattering pattern. This

experiment generates a ab-

initio particle shape. This model is

compared with the theoretical data

available. By comparing the

experimental scattering patterns with

known scattering data is useful in

determination of structure. If the

experimental data matches with one

or various scattering patterns then it

may provide a detailed information

about the quarternary and tertiary

structure.

DARA is a webserver which

initially“computes the scattering

profiles from the available structures /

models in PDB (Protein Data Bank) and

compares these profiles with a given

SAXS pattern.” This server is very fast,

it compares more than 1,50,000

profiles very rapidly within a few

seconds. It almost covers all the

models available in PDB. DARA

provides good and enhanced results.

How DARA works ?

DARA implements a new

search algorithm consisting

of principal component analysis and k-

d trees for rapid identification of the

scattering neighbours, including

nucleic acids and complexes.

SAXS data:

For each entry in PDB all biological

assemblies are retrieved from the

NMR entries whose only first model

has considered. The data is

represented in the form of curves. The

theoretical known scattering curves

are obtained by a software i,e.,

CRYSOL 2.8, which is sufficient to

cover models with maximum intra-

particle distance Dmax up to 800

A˚. For each model, CRYSOL calculate

its Dmax, radius of gyration(Rg),

molecular weight (MW) and exclude

volume of the hydrated particle (V).

For proteins, secondary structure

content was computed as the

percentage of alpha helices and beta

T

BIOINFORMATICS NEWS

Page 13: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 13

sheets.

DARA computes various parameters

and gives an output which is

instantaneous and enhanced. It

calculates for almost 100 neighbours

of the query macromolecule and the

neighbours are ranked according to

the best fitting curve are preferred.

The result shown in Fig 1 shows the

best structures obtained by calculation

and comaprison with various

parameters considered. The result can

also be downloaded.

DARA represents a quite rapid and

easy way to analyze and identify

macromolecules in solution which is a

difficult process. It can be traced

at http://www.embl-

hamburg.de/biosaxs/dara.html

Reference:

D A R A : a web server for rapid search

of structural neighbours using

solution small angle X – ray scattering

data

Alexey G. Kikhney1,†, Alejandro

Panjkovich1,†, Anna V. Sokolova2 and

Dmitri I. Svergun1,*

Fig 1 Top three nearest neighbors for experimental SAXS data collected from glucose isomerase in a phosphate buffer.

Page 14: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 14

Introduction to

Mathematical

Modelling (Last Part)

Fozail Ahmad

Image Credit: Google Images

“ Parameters for any equation in a model describe certain biochemical features of the components involved in reactions or pathways under study.”

n the previous section,

mathematical modeling

was exemplified by

metabolic process and its

biochemical regulation. It could also

be done by signalling pathways and

genetic regulatory process. At all

cellular phase, one observe

changing mode of a cell with effect

from environmental factors. It is

quite difficult to maintain cellular

functions and reach to steady state.

Thus, one needs to fix a range of

parameters for all molecular

reactions while going for

mathematical modeling.

Identification of Model

parameters:

Parameters for any equation in a

model describe certain biochemical

features of the components

involved in reactions or pathways

under study. For example, when

modeling the dynamics of a

metabolic network, the

mathematical equations inferred

from the processes must contain

parameters that represent the

kinetic features of the involved

metabolic enzymes, as a number of

reactions enzyme can perform

within a given period of time (i.e.,

the rate constant). We must come

across to these kinetic parameters

prior to setting up well-defined

systems of differential equations.

Therefore, kinetic parameters for all

the relevant reaction components

can be experimentally determined.

In practice, however, a number of

kinetic parameters, even for

otherwise well-investigated

biological components (enzyme,

proteins & hormones), still are not

known, primarily because the

reliable experimental data are

lacking. It is very often that the

kinetic parameters are measured

but no experimental validation has

been performed in wet lab (i.e. in

vitro). In practice, enzymes behave

similarly found in a cell. This creates

a hurdle which is overcome by

measuring the overall dynamics of

the system being studied.

Computational procedure have

made them easier by providing

appropriate estimation techniques

to optimize parameter values by

taking different multiple parameter

set from the data set until they fit or

get optimized for available

experimental dataset. This method

to is critically dependent on the

SYSTEMS BIOLOGY

I

Page 15: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 15

quality of the dataset being

validated, and therefore prediction

made from such unreliable data will

definitely lead to unreliable

validated parameters and to a

limited model of no use. In order to

develop a simple network model for

any biological process is awfully

lagging behind mainly due to

unavailability of high quality

experimental data which is still a

major focus in the field of systems

biology.

It is important to mention that few

biological process cannot be

described using such simple models

that are based on only

concentration of molecules,

ignoring the existence and

importance of concerned

components as the molecular

movements adorns a significant

impact on cellular mechanism. Due

to the closely packing of the

molecules in a cell, their thermal

induction, and random movement

from changing environmental

conditions may cause the initiation

of signal transmission that

propagates across the cell and stops

until reaches to its target. In order

to account for such random effects,

(i.e., stochastic) component must be

incorporated into the equations of

the model. For rare signaling

molecules, of which lesser and

fewer effect is observed, can be

neglected. Whereas, molecules of

which, rarest copies are existing in

the cell must not be neglected and

should be integrated into

mathematical equations.

The next issue after optimization of

parameter comes to be different

time and distance scales of the

components in integrated into a

pathway. For example metabolism

occurs in within seconds or minutes

whereas genetic regulation takes

longer (say it hours or even days)

times to exert their effect or to

express a particular gene induced by

metabolic processes from a greater

distance. It may be that signals

(enzymes, protein, hormones) have

to travel longer distance across the

cell membrane via circulatory

system of body fluids in between

tissues. To overcome these different

length and time scale, we can use

multi-scale model to avoid

complexity of the system.

Finally, it is important to assure that

developed model is as good as

assumption upon which it is based.

Page 16: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 16

IBS: Modifying the

organization of

biological sequences

diagrammatically Muniba Faiza

Image Credit: Google Images

” ILLUSTRATOR Of BIOLOGICAL SEQUENCE (IBS) which is used for representing the organization of protein

or nucleotide sequences in an easy, efficient and precise manner.

any a times, we need to

visualize and summarize

the existing information of

the biological sequences like protein

or DNA. For this purpose, a new

software package has been

introduced called ILLUSTRATOR Of

BIOLOGICAL SEQUENCE (IBS) which

is used for representing the

organization of protein or

nucleotide sequences in an easy,

efficient and precise manner. It

visualizes various functional

elements. Different features have

been provided in IBS such as

diagramming of domains,motifs,

rescaling, coloring and many

more. The standalone packages of

IBS were implemented in JAVA, and

supported three major Operating

Systems, including Windows, Linux

and Mac OS.

Key Features:

the annotations of both protein

and nucleotide sequences is

supported by the

implementation of various

drawing elements.

better color visualization.

an ‘export module’ is generated

with the help of which the final

generated artwork can be

exported to any publication-

quality figure.

a user-friendly interface.

various built-in textures enables

to color the black-and-white

diagrams as per the

requirements.

easy retrieval of UniProt

annotations.

IBS provides individual modes for

both proteins and DNA, the protein

or DNA sequences can be

represented in individual modes. IBS

may be proved as a very useful

software in many biological

researches, for example, with the

help of IBS, one can easily diagram

the translocations that occur in

cancer by parallel view of the wild

type arrangements existing in the

sequence (as shown in Fig. 1).

M

SOFTWARE

Page 17: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 17

“IBS provides an assistance in

drawing publication quality

diagrams of both protein and

nucleotide sequences.”

Fig.1 The main interface of IBS. ( A)

The standalone software showing

the domain organization of E3

SUMO-protein ligase RanBP2 (

Flotho and Werner,2012).( B) The

online service presenting the

organization of

bromodomain proteins and

translocations in cancer.( (Muller et

al., 2011 )

Reference:

IBS: an illustrator for the presentation and visualization of biological sequences

Wenzhong Liu1,2,†, Yubin Xie1,†, Jiyong Ma1,†, Xiaotong Luo1, Peng Nie1, Zhixiang Zuo3, Urs Lahrmann4, Qi Zhao1, Yueyuan Zheng1, Yong Zhao1, Yu Xue5,* and Jian Ren1,2,3,*

Page 18: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 18

How To: Detecting

Chimera in 16S

rRNA Sanger

Sequencing Reads Prashant Pant

Image Credit: Google Images

” Chimeras are usually formed during polymerase chain reaction (PCRs) but in some rare cases they are for real. Therefore, it becomes relevant to adopt methods which can clean the sequence datasets of Chimeras .”

A TYPICAL CHIMERIC SEQUENCE

OBTAINED FROM PINTAIL VERSION

1.0

etecting chimeric (or

recombinant)

sequences from a

sequence dataset is an

important part of sequence analysis

especially for reconstruction of

deep phylogenies as well as for

sequence similarity analyses. This

article focuses on methods of

chimera detection in high quality

16S rRNA sequences from Sanger

sequencing with good read length

(>750bp). With such large size they

become potential candidates for

chimera formation. With culture-

independent approaches for

analyses of microbial diversity

picking up fast with high throughput

sequencing methods, the amount of

chimeric sequences being published

in the databases are also increasing

exponentially. This is the era of

Metagenomics or simply put

community DNA analyses where

DNA from thousands of species gets

pooled up and is then analysed. This

further increases chances of

chimera formation. Chimeras are

usually formed during polymerase

chain reaction (PCRs) but in some

rare cases they are for real.

Therefore, it becomes relevant to

adopt methods which can clean the

sequence datasets of Chimeras.

Recently, a number of chimera

detecting software for 16S rRNA

gene sequences have been

launched namely Pintail, Mallard

and Bellerophon. First two software

applications are available at

SEQUENCE ANALYSIS

D

Page 19: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 19

http://www.bioinformatics-

toolkit.org and the last one is

available at

http://greengenes.lbl.gov/cgi-

bin/nph-index.cgi. Pintail and

Mallard can detect chimeras and

anomalies in the 16S rRNA genes

based on extent of pair-wise

percentage similarity between the

query and related sequences. In

chimera analysis by Pintail 1.0, the

query sequences which could be

putative recombinants are

compared on a one (query)-on-one

(subject) basis with a list of closely

related sequences identified by

BLAST searches. As Pintail is a one-

on-one query-subject comparison, it

is highly stringent. This is not the

case with Mallard. In Mallard, one

of the sequences from within a

dataset of query sequences is

randomly chosen as subject, while

rest remain as query. A many

(query)-on-one (subject)

comparison follows, which is easy

and completes in less time as

compared to Pintail. This is to be

noted that Mallard is of limited use

if the query sequences are too

diverse or really novel in the first

place.

Another software for detecting

chimeras in 16S rRNA genes

i.e. Bellerophon ver 3.0 from

Greengenes is more dedicated to

16S rRNA sequences. Here, the

sequences are required to be

submitted as NAST (Nearest

Alignment Space Termination Tool)

formatted file. The NAST alignment

server at Greengenes has more than

one million 16S rRNA sequence

records. Upon submission of the

NAST formatted file, the server

launches a localized BLAST search

for each query sequence with the

16S rRNA gene sequence library on

its server. It

checks for potential chimeras in the

respective query-subject alignment,

one-on-one. The outcome of the

entire process is a couple of EXCEL

sheets emailed to the user with the

query sequences, their best

matches, and BLAST score values.

The BLAST score threshold value can

be set by the user, below which the

software automatically removes the

sequences not to be considered for

chimera detection. Finally, it tells

whether a potential break-point

was found or not (in essentially Yes

or No format). It is user-friendly and

particularly good for large datasets

with high amount of sequence

diversity. The only demerit of the

software is that if there is a

relatively novel sequence in the

query batch, it receives a low score

being highly unrelated with the

existing records and thus stands at a

risk of getting omitted. Hence, one

has to be really careful while using

these programs as there could be

loss of sequence diversity especially

if the data is coming from an

extreme site (with more

newer/novel sequences) or if the

data is coming from some NGS

project with nice long reads and

good coverage as in the case of Pac

Bio Machine. It is worth mentioning

here that while Pintail and Mallard

can be applied for any given DNA

sequence data, Bellerophon is a

dedicated program for 16S rRNA.

Page 20: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 20

Explore Tuberculosis: A

Systems Biology

Approach Fozail Ahmad

Image Credit: Google Images

“ The bacterial two component system (TCS) is a s ignal transduction system that senses environmental stimuli and responses accordingly.”

ystems biology is not

sufficient to full fill the

requirement of

molecular understanding of any

organism at any level. It seeks to

contribute multiple approaches and

fields to resolve a particular issue

arisen from ongoing work. In this

article you will find a combinatorial

approach of systems biology i.e.

molecular, cellular and network

biology to understand how

tuberculosis is developed and how

pathogen succeeds in fighting with

host immune systems. A well

developed mathematical model, on

PhoP-PhoR two component system,

is also presented and explained to

demonstrate the mode of molecular

regulation by pathogen.

The bacterial two component

system (TCS) is a signal transduction

system that senses environmental

stimuli and responses accordingly.

This system consists of two

regulating proteins one of which

functions as histidine kinase (HK)

and other functions as response

regulator (RR) in the course of signal

cascade mechanism.Mycobacterium

tuberculosis have eleven two

component systems controlling

expression of those genes that are

critically involved in the virulence,

pathogenicity and survival. Studies

have demonstrated that PhoPR-TCS

is one of the eleven TCSs peculiarly

involved in the virulent activity of

the pathogen. PhoPR-TCS is a

positive regulator of many genes

which encodes gene for the

biosynthesis of lipids like

sulphatides(SL), diacyltrehalose

(DAT) and polycyltrehalose (PAT).

These lipid components contribute

to the virulency of M.

tuberculosis. Studies have

corroborated

that pks2 and msl3 are responsible

for the biosynthesis of SL, DAT and

PAT respectively. The expression of

these lipid coding genes are

regulated by PhoP in association

with the autokinase activity of

PhoR. In case

of MycobacterialPhoPR TCS,

Mg2+ ions have not been

substatilally proved to be

stimulating factor for PhoR. The

simulation of the model was carried

out through MATLAB using RK-4

(Runga Kutta fourth order

differential equation) method.

Resultantly, behavior of TCS was

found to be robust at all

concentration of Mg2+ ions. The

finding can be implicated at the

time of development of drug

against tuberculosis as to which

gene/protein has the high sensitivity

towards its stimuli.

S

SYSTEMS BIOLOGY

Page 21: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 21

Fig: General presentation of model,

depicting feedback mechanism of

system

The regulation of TCS is affected by

Mg2+ions to all possible extent

which was shown by fluctuations in

the level of PhoP and PhoR proteins.

The ions have both positive and

negative effect over TCS. The result

showed that important genes are

activated even after ions are

switched off from surrounding

medium. So, targeting of ions influx

and efflux would be of no use in

terms of development of drug

aginst the pathogen. With some

other aspect it can be further tested

for more simulations with varying

concentration of ions. Since, TCS

regulates those genes which are

directly involved in pathogenecity

and survival of Mycobaterium

tuberculosis, understanding the

nature and behaiour of individual

protein will provide an insight into

finding of novel drug target against

tuberculosis. The simulation in this

work represented the mechanism of

gene regulation and its sensitivity

twords stimulus and provided the

understading about how to deal

with when targetting a

molecule/protein for any other two

component system of the pathogen.

Reference source: Fozail Ahmad &

Ravins Dohare*, Assessing Effect of

Mg ion on PhoP-PhoR tow

component systems of

Mycobacterium tuberculosis

through Development of

Mathematical Model, Int. Journal

of Science and Research, (4) 7,

2285-2289, Paper ID: SUB 156569

Page 22: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 22

Cl-Dash:

speeding up

cloud

computing in

bioinformatics

Muniba Faiza

Image Credit: Google Images

“Cl-dash is a tool which facilitates research of novel bioinformatics data using Hadoop – a software that stores huge amount of data and provide a very easy access to that data in a relatively lesser time .”

fter a lot of work in the

field of bioinformatics,

many of the living

organisms’ genome has been

sequenced and a lot of information

has been generated at RNA and

protein level. This has given rise to a

huge amounts of biological data

whose storage is a issue now a days,

because such an enormous data

cannot be stored on a personal

computer or on a local server. For

this purpose cloud computing,

a practice to manage, and process

data by using remote servers hosted

on internet has been introduced in

bioinformatics, though the origin of

cloud computing is not very clear.

Cl-dash is a tool that

which facilitates research of novel

bioinformatics data using Hadoop –

a software that stores huge amount

of data and provide a very easy

access to that data in a relatively

lesser time. This tool has been

developed by Paul Hodor,

Amandeep Chawla, Andrew Clark

and Lauren Neal from Booz Allen

Hamilton, USA.

The tool is “cl-dash”,it is a starter

kit, which configures and apply the

new hadoop clusters in a few

minutes. It is provided by AWS

(Amazon Web Services).

According to a paper published in

Bioinformatics (Nov, 2015), cl-dash

is based on the distributed file

system and MapReduce

programming pattern. Hadoop

MapReduce is a software

CLOUD COMPUTING

A

Page 23: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 23

framework for easily writing

applications which process vast

amounts of data in-parallel on large

clusters of personal computers or

hardwares. With the help of cl-

dash, a user can create clusters (or

nodes which stores huge amount of

data) as an ‘admin’ , through a set

of command line tools, which begins

with ‘cl-’ (hence the name: ‘cl

dash’). A YAML configuration file

(config.yml) is required to make a

new cluster can be created in

minutes. Once the Hadoop cluster is

formed, the user can easily access

the data.

Such tools are required for further

storage space requirement because

biological data is increasing,

thereby, the demand for large data

storage space is also required. cl-

dash has provided a good pathway

for managing such a huge data.

NOTE:

An exhaustive list of references for

this article is available with the

author and is available on personal

request, for more details write

[email protected]

m.

Page 24: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 24

Installing Gromacs on

Ubuntu for MD

Simulation

Tariq Abdullah

Image Credit: Google Images

“For beginners, installing and getting GROMACS to work is more challenging due to unfamiliarity with linux

commands and GROMACS dependencies.”

In bioinformatics,

GROMACS is one of the

most popular Molecular

Dynamics simulation software with

a load of features built in. Installing

GROMACS Version 5.x.x+ can be a

tedious and cumbersome process

on Ubuntu, especially if you are just

starting out. For beginners,

installing and getting GROMACS to

work is more challenging due to

unfamiliarity with linux commands

and GROMACS dependencies. Also

the installation instructions for

version 5+ available on GROMACS

website does not seem to work first

hand.

In this quick tutorial, I will teach you

how to install Gromacs on Ubuntu

14.04 LTS. It is expected to work on

any version of Ubuntu. Post in

comments if you face any problem.

I will also explain meanings of

different the commands alongside.

To install GROMACS 5+, log into

your Ubuntu system and open a

terminal by pressing Ctrl+Alt+T

together.

You need a good internet

connection as we will have to

download various dependencies

during the installation process. To

install Gromacs, we need following

softwares installed on our system:

1. A C & C++ Compiler which

comes built-in with Ubuntu.

2. CMake – A linux software to

make binaries

3. BuildEssential – It is a

reference for all the packages

needed to compile a package.

4. FFTW Library: a library used by

Gromacs to compute discrete

Fourier transform

5. DeRegressionTest Package

Getting Started

If you have freshly installed Ubuntu,

don’t forget to update you

repository information and software

packages in your system. Press

TOOLS

I

Page 25: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 25

Ctrl+Alt+T and a terminal will open

up. In the terminal, type:

sudo apt-get update

sudo apt-get upgrade

Installation

First step in installing Gromacs is to

get cmake, In the terminal, type:

sudo apt-get install cmake

If asked “After this operation, 16.5

MB of additional disk space will be

used. Do you want to continue?”,

Press y and then Press Enter.

When download and installation

finishes up, you can check the

version of cmake by following

command

cmake --version

Next we need to install build

essential with this command

sudo apt-get install build-essential

Before we go any further, it is good

to know the path of our current

working directory, in terminal, type:

pwd

Note down the path it shows, it is

very important and will be used

during real gromacs installation.

Now that we have cmake in place

and we know the working directory,

Its time to download Regression

Tests Package. It is possible to

automatically download this

package during installation, but

most of the time it throws me an

error stating that location of file has

changed, so let us do it hard way to

avoid any problem during

installation. Copy and Paste

following commands in your

terminal (Right Click to paste or

Ctrl+shift+V). It basically downloads

the file and saves it in your

downloads folder.

cd Downloads/

wget

http://gerrit.gromacs.org/download

/regressiontests-5.1.1.tar.gz

We have Regression test package in

our downloads folder as

compressed tar.gz archive, let us

extract it with

tar xvzf regressiontests-5.1.1.tar.gz

Now we need Fourier Transform

Library on our system. You can

download it on fftw.org or install it

from repository with this following

command

sudo apt-get install libfftw3-dev

Okay, Let us now download

GROMACS 5.1.1 with this command,

Alternatively, you can download the

latest version from GROMACS

website.

wget

ftp://ftp.gromacs.org/pub/gromacs/

gromacs-5.1.1.tar.gz

Now extract GROMACS archive

tar xvzf gromacs-5.1.1.tar.gz

Now move inside the Gromacs

folder,

cd gromacs-5.1.1/

Create a directory called “Build”

where we will keep our compiled

binaries

mkdir build

move inside the build directory

cd build

It’s time to make Gromacs, Replace

“pwdpath” with the path of

working directory that you have

noted earlier in following command:

sudo cmake .. -

DGMX_BUILD_OWN_FFTW=OFF -

DREGRESSIONTEST_DOWNLOAD=O

FF -DCMAKE_C_COMPILER=gcc -

DREGRESSIONTEST_PATH=<strong>

pwdpath</strong>/Downloads/regr

essiontests-5.1.1

If everything goes well, the message

in your terminal will say

Page 26: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 26

“Generating Done. Build files

written… “. If not, make sure you

have replaced the pwd path in

command with the path of your

home directory. If you have

forgotten it, just open another

terminal and type pwd.

Now let’s first check and make the

real thing..

make check

sudo make install

Now, It may take some time

depending o n your configuration.

After completion, execute it:

source

/usr/local/gromacs/bin/GMXRC

After the successful installation, you

may check the version of your

Gromacs with a command to make

sure installation finished as

expected.

gmx pdb2gmx --version

***

Page 27: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 27

GenomeD3 plot:

Easy visualization

of genomes

Muniba Faiza

Image Credit: Stock Images

“GenomeD3 Plot is a newly created visualization library written in Java script. It uses the D3, i.e., Data Driven

Documents Library which is used to produce dynamic, interactive data visualizations in web browsers.”

As the needs say the

importance of sequencing

of genomes, it is equally

important to visualize them. There

exists some tools to visualize the

genomes, but they are static and

standalone, and very much complex

to install and use. Newer tools are

required to ease the visualization of

genomes utilizing various new

features and which are more

interactive. GenomeD3 Plot is a

newly created visualization library

written in Java script. It uses the D3,

i.e., Data Driven Documents Library

which is used to produce dynamic,

interactive data visualizations in

web browsers. GenomeD3Plot is

very user-friendly and allows to

interact with data, dynamical view

alteration is possible, and

easy resize or reposition the

visualization in the browser.

The goal of R Laird Matthew was to

create a library with minimal

external dependencies that could be

integrated in to existing web

applications just as a developer

might include an image or table.

GenomeD3 Plot uses the JSON

configuration which is a

standardized and well supported

data format that reduces the

complexity of use and provide

better visualization. The image will

be created in SVG format and can

be easily imported to PNG format as

per the requirements.

GENOMICS

A

Page 28: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 28

Fig.1 GenomeD3Plot circular and

linear visualization of an example

genome

with annotation data

With GenomeD3 Plot, the genome

can be viewed in different tracks,

such as if one wish to view a specific

base pair or a series of base pairs to

visualize GC content, or

more.GenomeD3 Plot provides a

rich API ( application program

interface that specifies how

software components should

interact) to dynamically manipulate

visualization.

A linear and circular plot can also be

tied together so that manipulation

of one will cause a mirror alteration

in the other, such as zooming or

changing the visible region of the

genome. A specific region can be

recenter to focus. Many other

features have been introduced in

GenomeD3 Plot for easy

visualization and interpretation of

genomes.

Note:

An exhaustive list of references for

this article is available with the

author and is available on personal

request, for more details write to

[email protected].

Page 29: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 29

Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and

never miss out on any of your favorite topics.

Log on to

www.bioinformaticsreview.com

Page 30: BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Bioinformatics Review | 30