practice – file types (cont.) load the “mysequence.doc” file to webcutter using “choose...
DESCRIPTION
Representation of sequence The need to represent associated info with sequence Structured data entry Specialized databases Complex / customized data structure - Object-oriented data representation (Mount, p44-45)TRANSCRIPT
![Page 1: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/1.jpg)
Practice – file types (Cont.)
Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”.-Notice that the “sequence” in the sequence box are
nonsense characters.Clear input; Browse and then load the .txt file. Run
an analysis.
Always keep you sequences in .txt file for downstream analysis.
![Page 2: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/2.jpg)
Representation of sequence
The need to represent associated info with sequence
• Structured data entry• Specialized databases
3-d StructureMutation / Diseases Protein family / Protein domainInteractionPathway….
![Page 3: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/3.jpg)
Representation of sequence
The need to represent associated info with sequence
• Structured data entry• Specialized databases• Complex / customized data structure
- Object-oriented data representation (Mount, p44-45)
![Page 4: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/4.jpg)
Public Resources for Bioinformatics
•Databases
•Analysis Tools
Observe: List of databases and service at NCBI, EBI, KEGG, and Ensembl.
![Page 5: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/5.jpg)
What can we know about this gene? Search for “curated” databases. To prepare for future analysis, save annotated
sequence files as genename.html (in a target folder).
For downstream sequence analysis, save pure sequence as FASTA format file.
MDM2, or your favorite gene
Pet Project:
![Page 6: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/6.jpg)
Where and how much information are available for my gene?
Observe: The information contents and presentation format for the same gene in SwissProt, NCBI protein, NCBI Genes, etc..
![Page 7: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/7.jpg)
Public Resources (I) – Databases and data sources
Over 1,000 in the sea of databases.
Content-specific, such as DNA, Protein, Structure, etc.
Species-specific, such as flybase, wormbase, OMIM, etc.
System-specific, such as MetaCyc, AFCS, etc.
![Page 8: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/8.jpg)
Database concept:
Database - efficiently store, update, and retrieve information (data).
Database management systems – Access, Sybase MySQL, Oracle, etc.
Types of Databases – Relational DB, Object DB, native XML DB.
![Page 9: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/9.jpg)
Database concept – tables in relational databases
Accession
Organ. Ref. Name Key words
Features
…. ….. medline1 TNF ….. ……. …..
…. …. medline2 P53 …. …….. ……
“TNF”=TNF[All Fields] TNF[Name]
Protein table
![Page 10: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/10.jpg)
Database concept – relationship between tables
Accession
Organ. Ref. Name Key words
Features
…. ….. medline1 P27 ….. ……. …..
…. …. medline2 P53 …. …….. ……
Protein tableID title year author abstract
medline1 ….. 1970 …. ….. …..
medline2 …. 1980 …. …. …
Reference table
![Page 11: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/11.jpg)
Representation of sequence
The need to represent associated info with sequence
• Structured data entry• Specialized databases• Complex / customized data structure
- Object-oriented data representation (Mount, p44-45)
![Page 12: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/12.jpg)
Observe/Practice
Search for MDM2 in the Gene database and the and Proteins databases.
Search for MDM2 in “All Text” v.s “gene name” in the Gene database.
Compare results. Download the human MDM2 protein sequences for
all 8 isoforms.
![Page 13: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/13.jpg)
Public Resources (II) – Analysis tools
Web-based analysis tools – easy to use, but often with less customization options.
Stand-alone analysis tools – requires installation and configuration, but provides more customizatio0n options.
Commercial analysis tools Scripting for bioinformatics projects
![Page 14: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/14.jpg)
Practice: navigating the related resources through links
Using the “PubMed” link, search annotated references on MDM2.
Using the “GEO Profiles” link, search gene expression information on MDM2.
Using the “Map Viewer” link to observe the chromosome location and gene structure of the MDM2 locus – change the option of “Map Viewer” to include prediction of CpG island.
![Page 15: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/15.jpg)
Public Resources for Bioinformatics
•Databases : how to find relevant information.
•Analysis Tools
![Page 16: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/16.jpg)
Public Resources (II) – Analysis tools
Web-based analysis tools – easy to use, but often with less customization options.
Stand-alone analysis tools – requires installation and configuration, but provides more customizatio0n options.
Commercial analysis tools Scripting for bioinformatics projects
![Page 17: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/17.jpg)
web-based tools
• Identification of web-based bioinformatics resources. – Portals, lists, – Google search
• Organization–Book mark.–html page.
![Page 18: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/18.jpg)
web-based tools
Practice –retrieve genomic sequence from Ensemble and perform reverse
complementation with SMS
![Page 19: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/19.jpg)
Stand-alone tools 1.
Rules of the thumb: Make a folder for each program. Make a sub-folder for input/output
if necessary. Link GUI-based .exe application to
program menu
![Page 20: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/20.jpg)
Stand-alone tools 2.
1. Download the zip file to the GMS6014 folder.2. Unzip the files to a folder named “clustalx”.3. Edit the 3TNF file with WordPad and save.4. Activate the .exe file. 5. Load sequence file, select sequences, perform
alignment.6. Write the alignment to a ps file.
Practice –the ClustalX application.
![Page 21: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/21.jpg)
Stand-alone tools 3.
Command line applications: Accounts for a large number of high-quality,
sophisticated programs.
Practice – (install and) run standalone blast in your own computer
![Page 22: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/22.jpg)
Identifying the ortholog of MDM2 (Tumor necrosis factor) in an insect genome.
Pet Projects:
![Page 23: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/23.jpg)
Practice – Install the blast program (1)
1. Download the BLAST executable file, save the file in a folder, such as c:\GMS6014\blast\
2. Run the installation program by double click. Inspect the folder following installation.
3. Add three more folders to your /blast directory, “/query”, “/dbs”, and “/out”.
![Page 24: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/24.jpg)
Practice – Install the blast program (2)
5. Inspect the contents of the doc, data, and bin folder. Move the programs from blast\bin to the blast folder.
6. Bring a command (cmd) window by typing “cmd” in the StartRun box.
7. Go to the blast folder by typing “cd C:\GMS6014\blast”
8. Try to run the program by typing “blastall”, read the output.
![Page 25: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/25.jpg)
Practice -- BLAST search in your own computer
1. Download data file from the course web page, or Ensemble. Save in the blast\dbs folder.
2. Start a CMD window, navigate to the C:\GMS6014\blast folder.
3. At the prompt “C:\GMS6014\blast >” type the command “formatdb –i dbs\Dm.P –p T” -- format the dataset for the program.
4. Compose the query sequence save as “3TNF.txt” in the “blast\query\” folder.
5. Initiated the search by typing “blastall –p blastp –d dbs\Dm.P –i query\4_MMD2.fasta –o out\Mdm2_DmP.html –T T”
![Page 26: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/26.jpg)
What’s in a command?
formatdb –i dbs\Dm.P –p T
Program – format database for search.
Feed me the input file name
Tell me is it a protein sequence file?
For more info, refer to the “user manual” file in the blast\doc folder.
![Page 27: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/27.jpg)
Advantages of Running BLAST at Your Own Machine
Do it at any time, no waiting on the line.
Search for multiple sequences at once.
Search a defined data set.
Automate Blast analysis.
Combine Blast with other analysis.
…..
![Page 28: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/28.jpg)
BLAST is a program implemented in C/C++
void BlastTickProc(Int4 sequence_number, BlastThrInfoPtr thr_info)
{
if(thr_info->tick_callback &&
(sequence_number > (thr_info->last_db_seq + thr_info->db_incr))) {
NlmMutexLockEx(&thr_info->callback_mutex);
thr_info->last_db_seq += thr_info->db_incr;
thr_info->tick_callback(sequence_number, thr_info->number_of_pos_hits);
thr_info->last_tick = Nlm_GetSecs();
NlmMutexUnlock(thr_info->callback_mutex);
}
return;
}
/*
Sends out a message every PERIOD (i.e., 60 secs.) for the index.
THis function runs as a separate thread and only runs on a threaded
platform.
Should I care ?
![Page 29: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/29.jpg)
Programming language comparison
/* TRANSLATION: 3 or 6 frame translate cDNA sequences*/
//---------------------------------------------------------------------------#include "translation.hpp"
int main(int argc, char **argv){ int num_seq=0;
char string[MAXLINE]; DSEQ * dseq;
infile.getline (string,MAXLINE);
if (string[0]=='>') strncpy (dbname,string,MAXLINE); while (!infile.eof()) { dseq=Get_Lib_Seq (); if (dseq->reverse==0) Translation (&dseq->name[1], dseq->seq); else Translation (&dseq->name[1], dseq->r_seq); num_seq++; if (num_seq%1000==0) { cout<<num_seq<<endl; cout<<dseq->name<<endl; } delete dseq; }
infile.close(); outfile.close(); cout<<num_seq<<" translated"<<endl; getch();
return 0;}
DSEQ* Get_Lib_Seq(){ int i,n; char str[MAXLINE]; DSEQ* dseq; n = 0; dseq=new DSEQ; strcpy (dseq->name, dbname);
while(infile.getline(str,MAXLINE)) { if (str[0] == '>') { strcpy( dbname, str); break; }
for(i=0;i<strlen(str);i++) { if(n==MAXSEQ) break; dseq->seq[n++] = str[i]; } } dseq->seq[n]='\0';
if(n==MAXSEQ) cout<<"WARNING: sequence"<<dbname<<"too long!"<<endl; dseq->len=n; if (dseq->name[9]=='3') Reverse (dseq); else dseq->reverse=0; return dseq;}
void Reverse (DSEQ* dseq) //Reverse dseq{ int i,j; j=0; for (i=(dseq->len-1);i>0;i--) { if (dseq->seq[i]=='A'||dseq->seq[i]=='a') dseq->r_seq[j++]='T'; if (dseq->seq[i]=='C'||dseq->seq[i]=='c') dseq->r_seq[j++]='G'; if (dseq->seq[i]=='G'||dseq->seq[i]=='g') dseq->r_seq[j++]='C'; if (dseq->seq[i]=='T'||dseq->seq[i]=='t') dseq->r_seq[j++]='A'; if (dseq->seq[i]=='N'||dseq->seq[i]=='n') dseq->r_seq[j++]='N'; } dseq->r_seq[j++]='\0'; dseq->reverse=1;}void Translation (char name[], char seq[]){ char ppseq[MAXSEQ/3];
for (int f=0; f<3; f++) { outfile<<">"<<"F_"<<f<<name<<endl; int j=0; int len=strlen(seq); for( int i=f; i<len; i=i+3) ppseq[j++]=Translate(&seq[i]); ppseq[j++]='\0'; int m=strlen(ppseq)/50; // output 50 aa per line for (int n=0; n<=m; n++) { for (int i=n*50; i<50*(n+1); i++) { outfile<<ppseq[i]; if (ppseq[i]=='\0') break; } outfile<<endl; } }}
char Translate(char s[]){ int c1,c2,c3;
char P, code[3];
//***standard translation table, A(0),C(1), G(2), T(3)*****
char table [4][4][4]= {{{'K','N','K','N'},{'T','T','T','T'},{'R','S','R','S'},{'I','I','M','I'}}, {{'Q','H','Q','H'},{'P','P','P','P'},{'R','R','R','R'},{'L','L','L','L'}}, {{'E','D','E','D'},{'A','A','A','A'},{'G','G','G','G'},{'V','V','V','V'}}, {{'*','Y','*','Y'},{'S','S','S','S'},{'*','C','W','C'},{'L','F','L','F'}}};
//*********** table2 for n at 3rd position********************char table2 [4][4]={{'X','T','X','X'},{'X','P','R','L'}, {'X','A','G','V'},{'X','S','X','X'}}; strncpy (code, s, 3); c1=Convert(code[0]); c2=Convert(code[1]); c3=Convert(code[2]); if (c1>=4 || c2>=4) P='X'; //can be Optimized further here by considering....
else { if (c3>=4) P=table2[c1][c2]; else P=table[c1][c2][c3];
//P=table[Convert(code[0])][Convert(code[1])][Convert(code[2])]; } return (P);}
int Convert (char c){ char s=c;
if (s=='A'||s=='a') return (0); if (s=='C'||s=='c') return (1); if (s=='G'||s=='g') return (2); if (s=='T'||s=='t'||s=='U'||s=='u') return (3); if (s=='N'||s=='n') return (4); else return (5);}
f#Translation -- read from fasta DNA file and translate into three frames
#
import string
from Bio import Fasta
from Bio.Tools import Translate
from Bio.Alphabet import IUPAC
from Bio.Seq import Seq
ifile = "S:\\Seq\\test.fasta"
parser = Fasta.RecordParser()
file =open (ifile)
iterator = Fasta.Iterator (file, parser)
cur_rec = iterator.next()
cur_seq = Seq (cur_rec.sequence,IUPACUnambiguousDNA())
translator = Translate.unambiguous_dna_by_id[1]
translator.translate (cur_seq)
Translation : C Translation : Python
![Page 30: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/30.jpg)
Observe: scripting is not that difficult
Example: Python and bioPython.
1. Simple python scripts.2. Batch Blast with a Python script.
![Page 31: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/31.jpg)
![Page 32: Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”](https://reader034.vdocument.in/reader034/viewer/2022051405/5a4d1b647f8b9ab0599af0fa/html5/thumbnails/32.jpg)
Representation of sequence
The need to include annotations and functional information with each sequence.
• Structured data entry• GeneBank• EMBL / SwissProt
Observe: The difference of data structure between SwissProt, NCBI protein, and NCBI Genes.