Download - Department of Information Technology Published by · Coimbatore Institute of Information Technology #156, 3rd Floor, Kalidas Road, Ramnagar, ... Bio- Programming Python and Perl

BIOINFORMATICS

STANDARD OPERATING PROCEDURES

FOR

R PROGRAMMING

DBT STAR COLLEGE SCHEME

Department of Information Technology

Published by

Coimbatore Institute of Information Technology

#156, 3rd Floor, Kalidas Road, Ramnagar,

Coimbatore – 641009, Tamil Nadu, India.

Website: www.ciitresearch.org

All Rights Reserved.

Original English Language Edition 2019 © Copyright by Coimbatore Institute of

Information Technology.

This book may not be duplicated in any way without the express written consent of the

publisher, except in the form of brief excerpts or quotations for the purpose of review. The

information contained herein is for the personal use of the reader and may not be

incorporated in any commercial programs, other books, database, or any kind of software

without written consent of the publisher. Making copies of this book or any portion thereof

for any purpose other than your own is a violation of copyright laws.

This edition has been published by Coimbatore Institute of Information Technology,

Coimbatore.

Limits of Liability/Disclaimer of Warranty: The author and publisher have used their

effort in preparing this Bioinformatics book and author makes no representation or

warranties with respect to accuracy or completeness of the contents of this book, and

specifically disclaims any implied warranties of merchantability or fitness for any particular

purpose. There are no warranties which extend beyond the descriptions contained in this

paragraph. No warranty may be created or extended by sales representatives or written sales

materials. Neither CiiT nor author shall be liable for any loss of profit or any other

commercial damage, including but limited to special, incidental, consequential, or other

damages.

Trademarks: All brand names and product names used in this book are trademarks,

registered trademarks, or trade names of their respective holders.

ISBN 978-81-939960-6-5

This book is printed in 70 gsm papers.

Printed in India by Mahasagar Technologies.

Coimbatore Institute of Information Technology,

#156, 3rd Floor, Kalidas Road, Ramnagar,

Coimbatore – 641009, Tamil Nadu, India.

Phone: 0422-4377821

www.ciitresearch.org

Authors

Dr. M. S Vijaya

Mrs. G. Sophia Reena

Mrs. R. Jayasree

Notes

**********

ABOUT DBT STAR COLLEGE SCHEME

STAR COLLEGE SCHEME has been initiated by Department of

Biotechnology, New Delhi, to support colleges and university departments

offering undergraduate education to improve science teaching. The

programme aims to improve the skill of teachers by organising faculty

training, improved curriculum and emphasis on practical training to

students by providing access to specialized infra-structure. Five of our

science departments such as Botany, Zoology, Physics, Chemistry, and

Mathematics, have been included in star college scheme from the academic

year 2012 – 2013 and these departments have been elevated to star status

in the year 2017. In the academic year 2016 – 2017 DBT Star College

scheme has been extended to three departments namely Computer Science,

Information Technology and Computer Applications of our institution.

Under this scheme, the curriculum has been updated with the

interdisciplinary topics of Bio computational methods for the students of

Information Technology. To enhance interdisciplinary knowledge among

students and faculty members the department organized various guest

lectures, Workshops, Faculty development Programmes in the topics like

Computational Biology, Bio- Programming Python and Perl. The students

were given opportunity to do internship, projects of interdisciplinary in

nature and were able to visit various research institutes like CCMB, Trans

University, and CBNR. This scheme enhances the technical skills of

students to process biological data and motivated them to take projects in

the related fields which improved the placement opportunities in research

laboratories.

Notes

**********

R PROGRAMMING

OBJECTIVES OF THE COURSE

This course is basically designed to create a base to develop foundation

skills of programming language R. Primary goal of this course is to

familiarize participants with R syntax. Know about the different databases

available online. Learn about sequence alignment and similarity. Predict

gene and protein sequences.

COURSE OUTCOMES

Analyze Protein sequencing and Nucleic acid sequencing.

Find protein interaction, activity, modification and its function.

Predict Gene and establish genomic library.

Drug designing and discovery from data of functional genomics and

proteomics.

Seek jobs in the area of pharmaceuticals, drug designing, Forensic science,

analyzing hereditary disease.

Notes

**********

CONTENTS

S. No Title of the Paper Page No

01 Retrieve Genome Sequence Data 1

02 Shell Script To Search And Retrieve Sequences 7

03 Finding The Name of The Input Sequence Using

Grep Command Using Shell Script 11

04 Analyse And Visualize Graph For Protein Interaction Data

17

Notes

**********

Bioinformatics

Department of Information Technology 1

1. RETRIEVE GENOME SEQUENCE DATA

AIM: To retrieve genome sequence data via NCBI website using seqniv

package and write it as FASTA file.

ALGORITHM:

Step 1: Start -> All programs -> Rtool.

Step 2: Include the package seqniv using library function and require

function () is used to find which ACNUC database the accession is stored

in.

Step 3: Length of db is found and stored in numbers.

Step 4: Using For loop choose the bank of a particular database.

Step 5: Check if the sequence is in ACNUC database db using Accession

Number and store the result in resquery.

Step 6: Query the database with accession number and stored the result in

query 2.

Step 7: Using get sequence () method check if sequence was retrieved if not

print as error.

Step 8: Close the bank and return the sequence.

Step 9: Pass the accession number using gemcbi sequence.

Step 10: Write the result with file.out=’denl.fasta’.

Step 11: Stop the process.

CODING:

library(seqinr)

gtncbiseq <- function(accession)

{

require("seqinr") # this function requires the SeqinR R package # first find

which ACNUC database the accession is stored in:

dbs <- c("genbank","refseq","refseqViruses","bacterial")

numdbs <- length(dbs)

for (i in 1:numdbs)

{

db <- dbs[i]

choosebank(db)

Bioinformatics


# check if the sequence is in ACNUC database 'db':

resquery <- try(query(".tmpquery", paste("AC=", accession)), silent = TRUE)

if (!(inherits(resquery, "try-error"))) {

queryname <- "query2"

thequery <- paste("AC=", accession, sep="")

query2 <- query(queryname, thequery)

# see if a sequence was retrieved:

seq <- getSequence(query2$req[[1]])

closebank()

return(seq)

}

closebank()

}

print(paste("ERROR: accession",accession,"was not found"))

}

dengueseq <- getncbiseq("NC_001477")

write.fasta(names="DEN-1",sequences=dengueseq, file.out="den1.fasta")

Bioinformatics


OUTPUT:

RESULT:

Thus the genome is retrieved from the sequence data and the output is

verified.

Bioinformatics


ANNEXURE I

1. Subject : Information Technology

2. Area of experiment : R Tool

3. Experiment Details:

3.1 The determination to retrieve genome sequence data via NCBI website

using seqniv package and write it as FASTA file will be assigned to each

students.

3.2 Duration of experiment: 3 Hours.

3.3 List of recurring and non-recurring instruments / equipment required,

sources for procurement of the required items.

Recurring: Computer with Internet.

Non-Recurring: NIL

3.4 Recurring and non-recurring cost per experiment.

Recurring

S. No Description Quantity Rate (Rs) Amount (Rs)

1. Computer

with internet 59 - -

TOTAL -

Non- Recurring: NIL

4. Learning outcomes

4.1. Principles and skills learnt.

4.1.1. Brief introduction into basic functionalities of R.

4.1.2. Understand basic packages in R.

4.2 Further scope and application.

It is an inter disciplinary field of study that combines the field of biology

with computer science to understand the biological data. R package provide

a powerful tool kit to make the process to retrieve genome sequence data

Bioinformatics


via NCBI website easy and intuitive. Finally, based on this experiment result

the students may take a group project to build the bioinformatics related

projects using R tool.

4.3 Safety aspects: NA

4.4. Regulatory requirements, if any: NA

4.5. Possible waste generated and disposal method: NA

4.6. IT support and other support needed.

4.6.1 Screen shots of the programme can be used in future for

demonstration.

4.7. Troubleshooting and bottlenecks expected.

4.7.1. Before starting the programme, the system will work with

the internet connection.

5. List of references supporting the experiment and the relevant

references

i. https://www.youtube.com/watch?v=sK3ykyInU8o

ii. https://www.ncbi.nlm.nih.gov/genbank/genomesubmit/

6. Any other relevant information: NA

Bioinformatics


Notes

**********

Bioinformatics


2. SHELL SCRIPT TO SEARCH AND RETRIEVE SEQUENCES

AIM: To write a program to search and to retrieve the sequences and

MEDLINE References using ENTREZ database.

ALGORITHM:

Step 1: Start -> All program -> R.tool.

Step 2: Include the rentrez package.

Step 3: Store the gene id in gene-ids.

Step 4: Use the entrez list to the gendatabase with gene it and db=”nuccore”.

Step 5: Use the entrez list to the gendatabase linked transcripts.

Step 6: Pass linked transcripts as the parameter to head().

Step 7: Fetch the data of db=”nuccore”, Ed linked-transcript, retype=”fasta”

and Store it in all-recs.

Step 8: Pass all-recs to parameters to class.

Step 9: Retrieve the databases using (at (strwrap (substr(all-

recs,1500)),sep=”\n”).

Step 10: Write the records for fetching the file from as file=”my-transcripts

fasta”.

CODING & OUTPUT:

gene_ids <- c(351, 11647)

linked_seq_ids <- entrez_link(dbfrom="gene", id=gene_ids,

db="nuccore")

linked_transripts <- linked_seq_ids$links$gene_nuccore_refseqrna

head(linked_transripts)

all_recs <- entrez_fetch(db="nuccore", id=linked_transripts,

rettype="fasta")

class(all_recs)

cat(strwrap(substr(all_recs, 1, 500)), sep="\n")

write(all_recs, file="my_transcripts.fasta")

Bioinformatics


RESULT:

Thus the shell script to search and retrieve sequence and the output is

verified.

Bioinformatics


ANNEXURE II




3.1 The determination to search and to retrieve the sequences and

MEDLINE References using ENTREZ database.


3.3 List of recurring and non-recurring instruments / equipment

required, sources for procurement of the required items.


Non-Recurring: NIL


Recurring


1. Computer


TOTAL -

Non- Recurring: NIL








Bioinformatics


a powerful tool kit to determination to search and to retrieve the sequences

and MEDLINE References using ENTREZ database easy and intuitive.

Finally, based on this experiment result the students may take a group

project to build the bioinformatics related projects using R tool.





4.6.1 Screen shots of the programme can be used in

future for demonstration.





references

i. http://www.powershow.com/view/21d300-

MDVmZ/Retrieving_Information_Using_Entrez_powerpoint_ppt

_presentation

ii. https://www.ncbi.nlm.nih.gov/books/NBK184582/

6. Any other relevant information: NA.

Bioinformatics


3. FINDING THE NAME OF THE INPUT SEQUENCE USING GREP

COMMAND USING SHELL SCRIPT

AIM: To write a shell script to use the name of the input sequence using

grep command and the top hit from a BLAST output file without viewing it

directly.

ALGORITHM:

Step 1: Start -> All programs -> R.tool.

Step 2: Now, download the 16s Microbial database from NCBI website and

choose the mode=”wb”.

Step 3: Include the package ‘v Blast’ using library function.

Step 4: Specify the URL which download in the NCBI website in Rtool Page.

Step 5: Load some text data from the sequence read RNA string set from the

package=’v Blast’.

Step 6: The ‘R>Sequence’ will display the Instance of length.

Step 7: Load a blast database to Replace db with the location + Name of the

BLAST db.

Step 8: Using the predict function by passing ‘bl’ as parameters and stored

the result in ‘cl’.

Step 9: Print the sequences from “cl” using BLAST.


CODING & OUTPUT:

(Red colors are the output) ## download the 16S Microbial data base from NCBI R>

download.file("ftp://ftp.ncbi.nlm.nih.gov/blast/db/16SMicrobial.tar.gz",

+ "16SMicrobial.tar.gz", mode='wb') trying URL 'ftp://ftp.ncbi.nlm.nih.gov/blast/db/16SMicrobial.tar.gz'

ftp data connection made, file length 4192539 bytes ==================================================

downloaded 4.0 MB R> library(vblast)

Bioinformatics


R> untar("16SMicrobial.tar.gz", exdir="16SMicrobialDB")

## load some test data

R> seq <- readRNAStringSet(system.file("examples/RNA_example.fasta", + package="rBLAST"))

R> seq A RNAStringSet instance of length 5 width seq names [1] 1481

AGAGUUUGAUCCUGGCUCAGAACGAACGCUGGCG...UGGUGAUCGGGGUGAAGUCGUAACAAGGUAACC 1675

[2] 1404 GCUGGCGGCAGGCCUAACACAUGCAAGUCGAACG...GGCAGCCGACCACGGUAAGGUCAGCGACUGGGG 4399

[3] 1426 GGAAUGCUNAACACAUGCAAGUCGCACGGGCAGC...GUGUAGUCG

NAACAAGGUAGCCGUAGGGGAACC 4403 [4] 1362 GCUGGCGGAAUGCUUAACACAUGCAAGUCGCACG...GAGUUGGUU

UUACCUUAGGUGUCUAGGCUAACC 4404 [5] 1458 AGAGUUUGAUUAUGGCUCAGAGCGAACGCUGGCG...GCGACUGGG

GUGAAGUCGUAACAAGGUAACCGU 4411 ## load a BLAST database (replace db with the location + name of

the BLAST DB) R> bl <- blast(db="./16SMicrobialDB/16SMicrobial") R> bl

BLAST Database Location: /home/hahsler/baR/rMSA/16SMicrobialDB/16SMicrobial

Database: 16S Microbial Sequences 17,624 sequences; 25,680,771 total bases

Date: Jun 2, 2015 12:00 AM Longest sequence: 2,211 bases Volumes:

/home/hahsler/baR/rMSA/16SMicrobialDB/16SMicrobial

## query a sequence using BLAST

Bioinformatics


R> cl <- predict(bl, seq[1,])

R> cl[1:5,] QueryID SubjectID Perc.Ident Alignment.Length

Mismatches Gap.Openings Q.start 1 1675 gi|559795231|ref|NR_104821.1| 90.82 1459 124 8 16

2 1675 gi|444304125|ref|NR_074549.1| 85.99 1249 158 15 235 3 1675 gi|444304125|ref|NR_074549.1| 94.20 69

4 0 1 4 1675 gi|645320383|ref|NR_117601.1| 85.99 1249

158 15 235 5 1675 gi|645320383|ref|NR_117601.1| 93.85 65 4 0 5

Q.end S.start S.end E Bits 1 1468 5 1459 0e+00 1943

2 1478 247 1483 0e+00 1321 3 69 1 69 3e-22 106 4 1478 243 1479 0e+00 1321

5 69 1 65 6e-20 99

RESULT:

Thus the shell script to find the name of the input sequence using grep

command is verified.

Bioinformatics


ANNEXURE III




3.1 The determination finding the name of the input sequence using grep

command using shell script.


3.3 List of recurring and non-recurring instruments / equipment required,

sources for procurement of the required items.


Non-Recurring: NIL


Recurring


1. Computer


TOTAL -

Non- Recurring: NIL






Bioinformatics




a powerful tool kit to make the process to find the name of the input

sequence easy using grep command. Finally, based on this experiment

result the students may take a group project to build the bioinformatics

related projects using R tool.





4.6.1 Screen shots of the programme can be used in

future for demonstration.





references

i. https://www.youtube.com/watch?v=YXROs4pnHO0ii.

ii

https://www.rdocumentation.org/packages/base/versions/3.

4.3/to pics/grep


Bioinformatics


Notes

**********

Bioinformatics


4. ANALYSE AND VISUALIZE GRAPH FOR PROTEIN INTERACTION

DATA

AIM: To write a program to analyze and visualize graph for Protein

Interaction Data.

ALGORITHM:

Step 1: Start All Program R-tool.

Step 2: Include the library file “graph” and “yearsExpData”.

Step 3: Use the data() on “LitG” and display the nodes and edges using LitG.

Step 4: Fetch the library file “RGBL” and LitG as parameters to the function

connected component and store the passing result as

myconnectedcomponent which displays the result.

Step 5: Pass the myconnectedcomponent to the length function which

displays the length.

Step 6: Include the library “Rgraphviz”.

Step 7: Pass the component3 of LitG to the subgraph function and store the

result in mysubgraph,

Step 8: Use the function layout graph by passing mysubgraph type as

“neato” and store it in mygraphplot.

Step 9: Rendergraph function is used to display Rgraph by passing

mygraphplot.


CODING:

library("graph")

library("yeastExpData")

data("litG")

litG

A graphNEL graph with undirected edges

Number of Nodes = 2885

Number of Edges = 315

library("RBGL")

myconnectedcomponents<- connectedComp(litG)

Bioinformatics


myconnectedcomponents[[1]]

[1] "YBL072C"


[1] "YBL083C"


[1] "YBR009C" "YBR010W" "YNL030W" "YNL031C" "YOL139C" "YAR007C"

"YBR073W"

[8] "YER095W" "YJL173C" "YNL312W" "YBL084C" "YDR146C" "YLR127C"

"YNL172W"

[15] "YLR134W" "YMR284W" "YER179W" "YIL144W" "YML104C"

"YOR191W" "YDL008W"

[22] "YDL030W" "YDL042C" "YDR004W" "YGR162W" "YMR117C"

"YDR386W" "YDR485C"

[29] "YDL043C" "YDR118W" "YMR106C" "YML032C" "YDR076W"

"YDR180W" "YDL013W"

[36] "YDR227W"

length(myconnectedcomponents)

[1] 2642

library("Rgraphviz")

mysubgraph<- subGraph(component3, litG)

mygraphplot<- layoutGraph(mysubgraph, layoutType="neato")

renderGraph(mygraphplot)

Bioinformatics


OUTPUT:

RESULT:

Thus the output is verified and the result is obtained.

Bioinformatics


ANNEXURE IV




3.1 The determination of analyze and visualize graph for protein interaction

data will be assigned to each students.


3.3 List of recurring and non-recurring instruments / equipment

required, sources for procurement of the required items.


Non-Recurring: NIL


Recurring


1. Computer


TOTAL -

Non- Recurring: NIL






Bioinformatics




a powerful tool kit to make the process of manipulating and visualizing data

easy and intuitive. Finally, based on this experiment result the students

may take a group project to build the bioinformatics related projects using

R tool.





4.6.1 Screen shots of the programme can be used in future for

demonstration.




5. List of references supporting the experiment and the relevant references

i. https://www.youtube.com/watch?v=UYclmg1_KLk

ii. https://www.r-bloggers.com/obtaining-a-protein-

protein-interaction- network-for-a-gene-list-in-r


Bioinformatics


Notes

**********

Download - Department of Information Technology Published by · Coimbatore Institute of Information Technology #156, 3rd Floor, Kalidas Road, Ramnagar, ... Bio- Programming Python and Perl

Top Related