perl programming for biologists part 2: tue aug 28 th 2007
DESCRIPTION
Perl Programming for Biologists PART 2: Tue Aug 28 th 2007. Yannick Pouliot, PhD Bioresearch Informationist Lane Medical Library & Knowledge Management Center. To Dos. Close all programs other than IE on your laptop Log into virtual room YP: log into Safari. To Do - 2. - PowerPoint PPT PresentationTRANSCRIPT
Lane Medical Library & Knowledge Management Centerhttp://lane.stanford.edu
Perl Programming for Biologists
PART 2: Tue Aug 28th 2007
Yannick Pouliot, PhDBioresearch Informationist
Lane Medical Library & Knowledge Management Center
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
2
To Dos Close all programs other than IE on your laptop Log into virtual room YP: log into Safari
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
3
To Do - 2
Please download all class materials fromhttp://lane.stanford.edu/howto/index.html?id=_2593
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
4
Class Focus for Session #2
1. Converting file contents
2. Introducing BioPerl
3. Perl and relational databases
And remember: Ask LOTS OF QUESTIONS
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
5
Cautions - Reminder
All examples pertain to MS Office 2003 Unclear what is to be expected for MS Office 2007
All contents pertain to Perl 5.x, not 6.x V.5 and 6 are NOT compatible V.5 is far far more common, so not much of an
issue
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
6
Questions from last session?
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
7
Part 1: Converting file contents
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
8
Converting Data Stored in Flatfiles
Input: ExampleOutputExcel3.csv File generated last week by Excel3.pl
Let’s look and run Convert1.pl →Convert5.pl
Name FunctionsConvert1.pl Open file, write its contents into another fileConvert2.pl Same as Convert1.pl, but parse and print only first bit of infoConvert3.pl Same as Convert2.pl, but interchange first bit with second bitConvert4.pl Same as Convert3.pl, but remove period in UniGene Cluster nameConvert5.pl Same as Convert2.pl, but print additional elements of array
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
9
Part 2: BioPerl
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
10
BioPerl: Overview
BioPerl = >1,000 modules divided into 7 packages Not all in 1.4 1.4 = stable release
Bioperl Package Functions bioperl (the core) Most of the main functionality of Bioperl bioperl-db Using Bioperl with BioSQL and local relational databasesbioperl-ext Interaction with some alignment functions and the Staden package bioperl-gui Some preliminary work on a graphical user interface to some Bioperl functions bioperl-microarray Microarray specific functions bioperl-pedigree manipulating genotype, marker, and individual data for linkage studies bioperl-run Wrappers to a lot of external programs
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
11
Other, Non-BioPerl Modules
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
12
BioPerl: You Have A Friend In High PlacesThe big deal: BioPerl provides “objects” for various types of
sequence data and their associated features and annotations. These objects provide interfaces for analysis of these
sequences with a wide variety of external programs (BLAST, FASTA, clustalw and EMBOSS to
name just a few). various types of databases for storage and retrieval of
sequences remote (GenBank, EMBL etc) local (MySQL, Flat_databases flat files, GFF etc.).
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
13
So What Is This Object Business?
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
14
What A Biology-Related Program Looks Like When Coded According To The Object Paradigm
t: Proteint: DNA
t: RNA
t: Gene
t: Organism
t: Species
t: LivingObject
t: Sequence
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
15
Objects Inherit From A Class Or Prior Object
Object 1(ancestor)
Class= prototype for all objects of this type
Derive an object
from an existing object
Create an object (“new”)
Object2
Sequence RNA Protein
DNA
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
16
An example: Class inheritance for shape concepts
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
17
Key BioPerl Links
BioPerl 1.4 installed as part of Perl 5.8.8.822 (what you downloaded)
BioPerl home: http://www.bioperl.org/wiki/Main_Page
http://www.bioperl.org/wiki/Getting_Started Lots of examples
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
18
BioPerl Example: Querying GenBank To Retrieve Sequence Properties Seq7.pl Seq8.pl Seq9.pl → after exercise (next slide) Seq11.pl → after exercise (next slide) Related docs:
GenBank search: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/GenBank.html
SeqIO: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SeqIO/genbank.htmlSeqIO See also http://www.bioperl.org/wiki/HOWTO:SeqIO And most importantly: http://doc.bioperl.org/releases/bioperl-
current/bioperl-live/Bio/Seq.html
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
19
Exercise: Print An Additional Sequence Feature
Add an additional sequence feature to Seq8.pl What to print: see Methods for Seq object at
http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Seq.html
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
20
Quiz Questions based on Seq11.pluse warnings;use strict;
use Bio::DB::GenBank;
# ---------------------------------------------------------------------------# main
$| = 1; # Force unbuffered STDOUT and STDIN.
my $gb = Bio::DB::GenBank->new( -format => 'GenBank',
-seq_start => 0,-seq_end => 1000,-strand => 1,-complexity => 0); # put in some restrictions as to what is retrieved and stored into GenBank object ...
# get a stream via a query stringmy $query = Bio::DB::Query::GenBank->new (-query =>'Homo sapiens[Organism] AND M-cadherin',
-db => 'nucleotide');my $seqio = $gb->get_Stream_by_query($query);
my $i=0; # count total number of sequenceswhile (my $seq = $seqio->next_seq) { print "seq id =", $seq->id, "\t version = ", $seq->version, "\t seq acc number = ", $seq->accession_number, "\t seq length = ", $seq->length,"\n"; $i++;}print "retrieved $i sequences from GenBank \n";
# --------------------------------------------------------------------------
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
21
More Quizzing: Seq10.pl
Run Seq10.pl Why the warning messages?
Specifying strands 1 for plus 2 for minus
Complexity: A GenBank nucleotide entry is often a part of a larger biological blob that contains other GI numbers (e.g., translated protein)
Complexity regulates the display:0 - get the whole blob1 - get the bioseq for gi of interest (default in Entrez)2 - get the minimal bioseq-set containing the gi of interest3 - get the minimal nuc-prot containing the gi of interest4 - get the minimal pub-set containing the gi of interest
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
22
Some Cautions
Be careful when querying databases → have an idea of how many sequences you may
be downloading/processing Know that Perl might eat-up all of your CPU
cycles
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
23
Part 3: Interacting With A Database
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
24
Preliminaries: Updating ODBC Manager
First we need to add directions to “GenesToEvaluate” DB to ODBC Manager More at
http://lane.stanford.edu/howto/index.html?id=_1751
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
25
Example Perl Programs That Interact With A Database
Ancillary files: ExampleOutputExcel3.csv needed as input to
Access1.pl Access2.pl and Access3.pl don’t need this file
All programs rely on GenesToEvaluate.mdb (Access DB)
Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu
26
In Closing: Suggestions
Modify the programs provided here Baby steps…
Save often Keep lots of prior versions so you can recover from your
mistakes SU provides lots of documentation → use it! Google is invaluable