BIOINFORMATICS
STANDARD OPERATING PROCEDURES
FOR
R PROGRAMMING
DBT STAR COLLEGE SCHEME
Department of Information Technology
Published by
Coimbatore Institute of Information Technology
#156, 3rd Floor, Kalidas Road, Ramnagar,
Coimbatore – 641009, Tamil Nadu, India.
Website: www.ciitresearch.org
All Rights Reserved.
Original English Language Edition 2019 © Copyright by Coimbatore Institute of
Information Technology.
This book may not be duplicated in any way without the express written consent of the
publisher, except in the form of brief excerpts or quotations for the purpose of review. The
information contained herein is for the personal use of the reader and may not be
incorporated in any commercial programs, other books, database, or any kind of software
without written consent of the publisher. Making copies of this book or any portion thereof
for any purpose other than your own is a violation of copyright laws.
This edition has been published by Coimbatore Institute of Information Technology,
Coimbatore.
Limits of Liability/Disclaimer of Warranty: The author and publisher have used their
effort in preparing this Bioinformatics book and author makes no representation or
warranties with respect to accuracy or completeness of the contents of this book, and
specifically disclaims any implied warranties of merchantability or fitness for any particular
purpose. There are no warranties which extend beyond the descriptions contained in this
paragraph. No warranty may be created or extended by sales representatives or written sales
materials. Neither CiiT nor author shall be liable for any loss of profit or any other
commercial damage, including but limited to special, incidental, consequential, or other
damages.
Trademarks: All brand names and product names used in this book are trademarks,
registered trademarks, or trade names of their respective holders.
ISBN 978-81-939960-6-5
This book is printed in 70 gsm papers.
Printed in India by Mahasagar Technologies.
Coimbatore Institute of Information Technology,
#156, 3rd Floor, Kalidas Road, Ramnagar,
Coimbatore – 641009, Tamil Nadu, India.
Phone: 0422-4377821
www.ciitresearch.org
Authors
Dr. M. S Vijaya
Mrs. G. Sophia Reena
Mrs. R. Jayasree
Notes
**********
ABOUT DBT STAR COLLEGE SCHEME
STAR COLLEGE SCHEME has been initiated by Department of
Biotechnology, New Delhi, to support colleges and university departments
offering undergraduate education to improve science teaching. The
programme aims to improve the skill of teachers by organising faculty
training, improved curriculum and emphasis on practical training to
students by providing access to specialized infra-structure. Five of our
science departments such as Botany, Zoology, Physics, Chemistry, and
Mathematics, have been included in star college scheme from the academic
year 2012 – 2013 and these departments have been elevated to star status
in the year 2017. In the academic year 2016 – 2017 DBT Star College
scheme has been extended to three departments namely Computer Science,
Information Technology and Computer Applications of our institution.
Under this scheme, the curriculum has been updated with the
interdisciplinary topics of Bio computational methods for the students of
Information Technology. To enhance interdisciplinary knowledge among
students and faculty members the department organized various guest
lectures, Workshops, Faculty development Programmes in the topics like
Computational Biology, Bio- Programming Python and Perl. The students
were given opportunity to do internship, projects of interdisciplinary in
nature and were able to visit various research institutes like CCMB, Trans
University, and CBNR. This scheme enhances the technical skills of
students to process biological data and motivated them to take projects in
the related fields which improved the placement opportunities in research
laboratories.
Notes
**********
R PROGRAMMING
OBJECTIVES OF THE COURSE
This course is basically designed to create a base to develop foundation
skills of programming language R. Primary goal of this course is to
familiarize participants with R syntax. Know about the different databases
available online. Learn about sequence alignment and similarity. Predict
gene and protein sequences.
COURSE OUTCOMES
Analyze Protein sequencing and Nucleic acid sequencing.
Find protein interaction, activity, modification and its function.
Predict Gene and establish genomic library.
Drug designing and discovery from data of functional genomics and
proteomics.
Seek jobs in the area of pharmaceuticals, drug designing, Forensic science,
analyzing hereditary disease.
Notes
**********
CONTENTS
S. No Title of the Paper Page No
01 Retrieve Genome Sequence Data 1
02 Shell Script To Search And Retrieve Sequences 7
03 Finding The Name of The Input Sequence Using
Grep Command Using Shell Script 11
04 Analyse And Visualize Graph For Protein Interaction Data
17
Notes
**********
Bioinformatics
Department of Information Technology 1
1. RETRIEVE GENOME SEQUENCE DATA
AIM: To retrieve genome sequence data via NCBI website using seqniv
package and write it as FASTA file.
ALGORITHM:
Step 1: Start -> All programs -> Rtool.
Step 2: Include the package seqniv using library function and require
function () is used to find which ACNUC database the accession is stored
in.
Step 3: Length of db is found and stored in numbers.
Step 4: Using For loop choose the bank of a particular database.
Step 5: Check if the sequence is in ACNUC database db using Accession
Number and store the result in resquery.
Step 6: Query the database with accession number and stored the result in
query 2.
Step 7: Using get sequence () method check if sequence was retrieved if not
print as error.
Step 8: Close the bank and return the sequence.
Step 9: Pass the accession number using gemcbi sequence.
Step 10: Write the result with file.out=’denl.fasta’.
Step 11: Stop the process.
CODING:
library(seqinr)
gtncbiseq <- function(accession)
{
require("seqinr") # this function requires the SeqinR R package # first find
which ACNUC database the accession is stored in:
dbs <- c("genbank","refseq","refseqViruses","bacterial")
numdbs <- length(dbs)
for (i in 1:numdbs)
{
db <- dbs[i]
choosebank(db)
Bioinformatics
Department of Information Technology 2
# check if the sequence is in ACNUC database 'db':
resquery <- try(query(".tmpquery", paste("AC=", accession)), silent = TRUE)
if (!(inherits(resquery, "try-error"))) {
queryname <- "query2"
thequery <- paste("AC=", accession, sep="")
query2 <- query(queryname, thequery)
# see if a sequence was retrieved:
seq <- getSequence(query2$req[[1]])
closebank()
return(seq)
}
closebank()
}
print(paste("ERROR: accession",accession,"was not found"))
}
dengueseq <- getncbiseq("NC_001477")
write.fasta(names="DEN-1",sequences=dengueseq, file.out="den1.fasta")
Bioinformatics
Department of Information Technology 3
OUTPUT:
RESULT:
Thus the genome is retrieved from the sequence data and the output is
verified.
Bioinformatics
Department of Information Technology 4
ANNEXURE I
1. Subject : Information Technology
2. Area of experiment : R Tool
3. Experiment Details:
3.1 The determination to retrieve genome sequence data via NCBI website
using seqniv package and write it as FASTA file will be assigned to each
students.
3.2 Duration of experiment: 3 Hours.
3.3 List of recurring and non-recurring instruments / equipment required,
sources for procurement of the required items.
Recurring: Computer with Internet.
Non-Recurring: NIL
3.4 Recurring and non-recurring cost per experiment.
Recurring
S. No Description Quantity Rate (Rs) Amount (Rs)
1. Computer
with internet 59 - -
TOTAL -
Non- Recurring: NIL
4. Learning outcomes
4.1. Principles and skills learnt.
4.1.1. Brief introduction into basic functionalities of R.
4.1.2. Understand basic packages in R.
4.2 Further scope and application.
It is an inter disciplinary field of study that combines the field of biology
with computer science to understand the biological data. R package provide
a powerful tool kit to make the process to retrieve genome sequence data
Bioinformatics
Department of Information Technology 5
via NCBI website easy and intuitive. Finally, based on this experiment result
the students may take a group project to build the bioinformatics related
projects using R tool.
4.3 Safety aspects: NA
4.4. Regulatory requirements, if any: NA
4.5. Possible waste generated and disposal method: NA
4.6. IT support and other support needed.
4.6.1 Screen shots of the programme can be used in future for
demonstration.
4.7. Troubleshooting and bottlenecks expected.
4.7.1. Before starting the programme, the system will work with
the internet connection.
5. List of references supporting the experiment and the relevant
references
i. https://www.youtube.com/watch?v=sK3ykyInU8o
ii. https://www.ncbi.nlm.nih.gov/genbank/genomesubmit/
6. Any other relevant information: NA
Bioinformatics
Department of Information Technology 6
Notes
**********
Bioinformatics
Department of Information Technology 7
2. SHELL SCRIPT TO SEARCH AND RETRIEVE SEQUENCES
AIM: To write a program to search and to retrieve the sequences and
MEDLINE References using ENTREZ database.
ALGORITHM:
Step 1: Start -> All program -> R.tool.
Step 2: Include the rentrez package.
Step 3: Store the gene id in gene-ids.
Step 4: Use the entrez list to the gendatabase with gene it and db=”nuccore”.
Step 5: Use the entrez list to the gendatabase linked transcripts.
Step 6: Pass linked transcripts as the parameter to head().
Step 7: Fetch the data of db=”nuccore”, Ed linked-transcript, retype=”fasta”
and Store it in all-recs.
Step 8: Pass all-recs to parameters to class.
Step 9: Retrieve the databases using (at (strwrap (substr(all-
recs,1500)),sep=”\n”).
Step 10: Write the records for fetching the file from as file=”my-transcripts
fasta”.
CODING & OUTPUT:
gene_ids <- c(351, 11647)
linked_seq_ids <- entrez_link(dbfrom="gene", id=gene_ids,
db="nuccore")
linked_transripts <- linked_seq_ids$links$gene_nuccore_refseqrna
head(linked_transripts)
all_recs <- entrez_fetch(db="nuccore", id=linked_transripts,
rettype="fasta")
class(all_recs)
cat(strwrap(substr(all_recs, 1, 500)), sep="\n")
write(all_recs, file="my_transcripts.fasta")
Bioinformatics
Department of Information Technology 8
RESULT:
Thus the shell script to search and retrieve sequence and the output is
verified.
Bioinformatics
Department of Information Technology 9
ANNEXURE II
1. Subject : Information Technology
2. Area of experiment : R Tool
3. Experiment Details:
3.1 The determination to search and to retrieve the sequences and
MEDLINE References using ENTREZ database.
3.2 Duration of experiment: 3 Hours.
3.3 List of recurring and non-recurring instruments / equipment
required, sources for procurement of the required items.
Recurring: Computer with Internet.
Non-Recurring: NIL
3.4 Recurring and non-recurring cost per experiment.
Recurring
S. No Description Quantity Rate (Rs) Amount (Rs)
1. Computer
with internet 59 - -
TOTAL -
Non- Recurring: NIL
4. Learning outcomes
4.1. Principles and skills learnt.
4.1.1. Brief introduction into basic functionalities of R.
4.1.2. Understand basic packages in R.
4.2 Further scope and application.
It is an inter disciplinary field of study that combines the field of biology
with computer science to understand the biological data. R package provide
Bioinformatics
Department of Information Technology 10
a powerful tool kit to determination to search and to retrieve the sequences
and MEDLINE References using ENTREZ database easy and intuitive.
Finally, based on this experiment result the students may take a group
project to build the bioinformatics related projects using R tool.
4.3 Safety aspects: NA
4.4. Regulatory requirements, if any: NA
4.5. Possible waste generated and disposal method: NA
4.6. IT support and other support needed.
4.6.1 Screen shots of the programme can be used in
future for demonstration.
4.7. Troubleshooting and bottlenecks expected.
4.7.1. Before starting the programme, the system will work with
the internet connection.
5. List of references supporting the experiment and the relevant
references
i. http://www.powershow.com/view/21d300-
MDVmZ/Retrieving_Information_Using_Entrez_powerpoint_ppt
_presentation
ii. https://www.ncbi.nlm.nih.gov/books/NBK184582/
6. Any other relevant information: NA.
Bioinformatics
Department of Information Technology 11
3. FINDING THE NAME OF THE INPUT SEQUENCE USING GREP
COMMAND USING SHELL SCRIPT
AIM: To write a shell script to use the name of the input sequence using
grep command and the top hit from a BLAST output file without viewing it
directly.
ALGORITHM:
Step 1: Start -> All programs -> R.tool.
Step 2: Now, download the 16s Microbial database from NCBI website and
choose the mode=”wb”.
Step 3: Include the package ‘v Blast’ using library function.
Step 4: Specify the URL which download in the NCBI website in Rtool Page.
Step 5: Load some text data from the sequence read RNA string set from the
package=’v Blast’.
Step 6: The ‘R>Sequence’ will display the Instance of length.
Step 7: Load a blast database to Replace db with the location + Name of the
BLAST db.
Step 8: Using the predict function by passing ‘bl’ as parameters and stored
the result in ‘cl’.
Step 9: Print the sequences from “cl” using BLAST.
Step 10: Stop the process.
CODING & OUTPUT:
(Red colors are the output) ## download the 16S Microbial data base from NCBI R>
download.file("ftp://ftp.ncbi.nlm.nih.gov/blast/db/16SMicrobial.tar.gz",
+ "16SMicrobial.tar.gz", mode='wb') trying URL 'ftp://ftp.ncbi.nlm.nih.gov/blast/db/16SMicrobial.tar.gz'
ftp data connection made, file length 4192539 bytes ==================================================
downloaded 4.0 MB R> library(vblast)
Bioinformatics
Department of Information Technology 12
R> untar("16SMicrobial.tar.gz", exdir="16SMicrobialDB")
## load some test data
R> seq <- readRNAStringSet(system.file("examples/RNA_example.fasta", + package="rBLAST"))
R> seq A RNAStringSet instance of length 5 width seq names [1] 1481
AGAGUUUGAUCCUGGCUCAGAACGAACGCUGGCG...UGGUGAUCGGGGUGAAGUCGUAACAAGGUAACC 1675
[2] 1404 GCUGGCGGCAGGCCUAACACAUGCAAGUCGAACG...GGCAGCCGACCACGGUAAGGUCAGCGACUGGGG 4399
[3] 1426 GGAAUGCUNAACACAUGCAAGUCGCACGGGCAGC...GUGUAGUCG
NAACAAGGUAGCCGUAGGGGAACC 4403 [4] 1362 GCUGGCGGAAUGCUUAACACAUGCAAGUCGCACG...GAGUUGGUU
UUACCUUAGGUGUCUAGGCUAACC 4404 [5] 1458 AGAGUUUGAUUAUGGCUCAGAGCGAACGCUGGCG...GCGACUGGG
GUGAAGUCGUAACAAGGUAACCGU 4411 ## load a BLAST database (replace db with the location + name of
the BLAST DB) R> bl <- blast(db="./16SMicrobialDB/16SMicrobial") R> bl
BLAST Database Location: /home/hahsler/baR/rMSA/16SMicrobialDB/16SMicrobial
Database: 16S Microbial Sequences 17,624 sequences; 25,680,771 total bases
Date: Jun 2, 2015 12:00 AM Longest sequence: 2,211 bases Volumes:
/home/hahsler/baR/rMSA/16SMicrobialDB/16SMicrobial
## query a sequence using BLAST
Bioinformatics
Department of Information Technology 13
R> cl <- predict(bl, seq[1,])
R> cl[1:5,] QueryID SubjectID Perc.Ident Alignment.Length
Mismatches Gap.Openings Q.start 1 1675 gi|559795231|ref|NR_104821.1| 90.82 1459 124 8 16
2 1675 gi|444304125|ref|NR_074549.1| 85.99 1249 158 15 235 3 1675 gi|444304125|ref|NR_074549.1| 94.20 69
4 0 1 4 1675 gi|645320383|ref|NR_117601.1| 85.99 1249
158 15 235 5 1675 gi|645320383|ref|NR_117601.1| 93.85 65 4 0 5
Q.end S.start S.end E Bits 1 1468 5 1459 0e+00 1943
2 1478 247 1483 0e+00 1321 3 69 1 69 3e-22 106 4 1478 243 1479 0e+00 1321
5 69 1 65 6e-20 99
RESULT:
Thus the shell script to find the name of the input sequence using grep
command is verified.
Bioinformatics
Department of Information Technology 14
ANNEXURE III
1. Subject : Information Technology
2. Area of experiment : R Tool
3. Experiment Details:
3.1 The determination finding the name of the input sequence using grep
command using shell script.
3.2 Duration of experiment: 3 Hours.
3.3 List of recurring and non-recurring instruments / equipment required,
sources for procurement of the required items.
Recurring: Computer with Internet.
Non-Recurring: NIL
3.4 Recurring and non-recurring cost per experiment.
Recurring
S. No Description Quantity Rate (Rs) Amount (Rs)
1. Computer
with internet 59 - -
TOTAL -
Non- Recurring: NIL
4. Learning outcomes
4.1. Principles and skills learnt.
4.1.1. Brief introduction into basic functionalities of R.
4.1.2. Understand basic packages in R.
4.2 Further scope and application.
Bioinformatics
Department of Information Technology 15
It is an inter disciplinary field of study that combines the field of biology
with computer science to understand the biological data. R package provide
a powerful tool kit to make the process to find the name of the input
sequence easy using grep command. Finally, based on this experiment
result the students may take a group project to build the bioinformatics
related projects using R tool.
4.3 Safety aspects: NA
4.4. Regulatory requirements, if any: NA
4.5. Possible waste generated and disposal method: NA
4.6. IT support and other support needed.
4.6.1 Screen shots of the programme can be used in
future for demonstration.
4.7. Troubleshooting and bottlenecks expected.
4.7.1. Before starting the programme, the system will work with
the internet connection.
5. List of references supporting the experiment and the relevant
references
i. https://www.youtube.com/watch?v=YXROs4pnHO0ii.
ii
https://www.rdocumentation.org/packages/base/versions/3.
4.3/to pics/grep
6. Any other relevant information: NA
Bioinformatics
Department of Information Technology 16
Notes
**********
Bioinformatics
Department of Information Technology 17
4. ANALYSE AND VISUALIZE GRAPH FOR PROTEIN INTERACTION
DATA
AIM: To write a program to analyze and visualize graph for Protein
Interaction Data.
ALGORITHM:
Step 1: Start All Program R-tool.
Step 2: Include the library file “graph” and “yearsExpData”.
Step 3: Use the data() on “LitG” and display the nodes and edges using LitG.
Step 4: Fetch the library file “RGBL” and LitG as parameters to the function
connected component and store the passing result as
myconnectedcomponent which displays the result.
Step 5: Pass the myconnectedcomponent to the length function which
displays the length.
Step 6: Include the library “Rgraphviz”.
Step 7: Pass the component3 of LitG to the subgraph function and store the
result in mysubgraph,
Step 8: Use the function layout graph by passing mysubgraph type as
“neato” and store it in mygraphplot.
Step 9: Rendergraph function is used to display Rgraph by passing
mygraphplot.
Step 10: Stop the process.
CODING:
library("graph")
library("yeastExpData")
data("litG")
litG
A graphNEL graph with undirected edges
Number of Nodes = 2885
Number of Edges = 315
library("RBGL")
myconnectedcomponents<- connectedComp(litG)
Bioinformatics
Department of Information Technology 18
myconnectedcomponents[[1]]
[1] "YBL072C"
myconnectedcomponents[[2]]
[1] "YBL083C"
myconnectedcomponents[[3]]
[1] "YBR009C" "YBR010W" "YNL030W" "YNL031C" "YOL139C" "YAR007C"
"YBR073W"
[8] "YER095W" "YJL173C" "YNL312W" "YBL084C" "YDR146C" "YLR127C"
"YNL172W"
[15] "YLR134W" "YMR284W" "YER179W" "YIL144W" "YML104C"
"YOR191W" "YDL008W"
[22] "YDL030W" "YDL042C" "YDR004W" "YGR162W" "YMR117C"
"YDR386W" "YDR485C"
[29] "YDL043C" "YDR118W" "YMR106C" "YML032C" "YDR076W"
"YDR180W" "YDL013W"
[36] "YDR227W"
length(myconnectedcomponents)
[1] 2642
library("Rgraphviz")
mysubgraph<- subGraph(component3, litG)
mygraphplot<- layoutGraph(mysubgraph, layoutType="neato")
renderGraph(mygraphplot)
Bioinformatics
Department of Information Technology 19
OUTPUT:
RESULT:
Thus the output is verified and the result is obtained.
Bioinformatics
Department of Information Technology 20
ANNEXURE IV
1. Subject : Information Technology
2. Area of experiment : R Tool
3. Experiment Details:
3.1 The determination of analyze and visualize graph for protein interaction
data will be assigned to each students.
3.2 Duration of experiment: 3 Hours.
3.3 List of recurring and non-recurring instruments / equipment
required, sources for procurement of the required items.
Recurring: Computer with Internet.
Non-Recurring: NIL
3.4 Recurring and non-recurring cost per experiment.
Recurring
S. No Description Quantity Rate (Rs) Amount (Rs)
1. Computer
with internet 59 - -
TOTAL -
Non- Recurring: NIL
4. Learning outcomes
4.1. Principles and skills learnt.
4.1.1. Brief introduction into basic functionalities of R.
4.1.2. Understand basic packages in R.
4.2 Further scope and application.
Bioinformatics
Department of Information Technology 21
It is an inter disciplinary field of study that combines the field of biology
with computer science to understand the biological data. R package provide
a powerful tool kit to make the process of manipulating and visualizing data
easy and intuitive. Finally, based on this experiment result the students
may take a group project to build the bioinformatics related projects using
R tool.
4.3 Safety aspects: NA
4.4. Regulatory requirements, if any: NA
4.5. Possible waste generated and disposal method: NA
4.6. IT support and other support needed.
4.6.1 Screen shots of the programme can be used in future for
demonstration.
4.7. Troubleshooting and bottlenecks expected.
4.7.1. Before starting the programme, the system will work with
the internet connection.
5. List of references supporting the experiment and the relevant references
i. https://www.youtube.com/watch?v=UYclmg1_KLk
ii. https://www.r-bloggers.com/obtaining-a-protein-
protein-interaction- network-for-a-gene-list-in-r
6. Any other relevant information: NA
Bioinformatics
Department of Information Technology 22
Notes
**********