lane medical library & knowledge management center perl programming for biologists part 2: tue...

26
Lane Medical Library & Knowledge Management Center http://lane.stanford.edu Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot, PhD Bioresearch Informationist Lane Medical Library & Knowledge Management Center

Upload: rudolf-carson

Post on 17-Dec-2015

222 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library & Knowledge Management Centerhttp://lane.stanford.edu

Perl Programming for Biologists

PART 2: Tue Aug 28th 2007

Yannick Pouliot, PhDBioresearch Informationist

Lane Medical Library & Knowledge Management Center

Page 2: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

2

To Dos Close all programs other than IE on your laptop Log into virtual room YP: log into Safari

Page 3: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

3

To Do - 2

Please download all class materials fromhttp://lane.stanford.edu/howto/index.html?id=_2593

Page 4: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

4

Class Focus for Session #2

1. Converting file contents

2. Introducing BioPerl

3. Perl and relational databases

And remember: Ask LOTS OF QUESTIONS

Page 5: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

5

Cautions - Reminder

All examples pertain to MS Office 2003 Unclear what is to be expected for MS Office 2007

All contents pertain to Perl 5.x, not 6.x V.5 and 6 are NOT compatible V.5 is far far more common, so not much of an

issue

Page 6: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

6

Questions from last session?

Page 7: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

7

Part 1: Converting file contents

Page 8: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

8

Converting Data Stored in Flatfiles

Input: ExampleOutputExcel3.csv File generated last week by Excel3.pl

Let’s look and run Convert1.pl →Convert5.pl

Name FunctionsConvert1.pl Open file, write its contents into another fileConvert2.pl Same as Convert1.pl, but parse and print only first bit of infoConvert3.pl Same as Convert2.pl, but interchange first bit with second bitConvert4.pl Same as Convert3.pl, but remove period in UniGene Cluster nameConvert5.pl Same as Convert2.pl, but print additional elements of array

Page 9: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

9

Part 2: BioPerl

Page 10: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

10

BioPerl: Overview

BioPerl = >1,000 modules divided into 7 packages Not all in 1.4 1.4 = stable release

Bioperl Package Functions bioperl (the core) Most of the main functionality of Bioperl bioperl-db Using Bioperl with BioSQL and local relational databasesbioperl-ext Interaction with some alignment functions and the Staden package bioperl-gui Some preliminary work on a graphical user interface to some Bioperl functions bioperl-microarray Microarray specific functions bioperl-pedigree manipulating genotype, marker, and individual data for linkage studies bioperl-run Wrappers to a lot of external programs

Page 11: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

11

Other, Non-BioPerl Modules

Page 12: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

12

BioPerl: You Have A Friend In High PlacesThe big deal: BioPerl provides “objects” for various types of

sequence data and their associated features and annotations. These objects provide interfaces for analysis of these

sequences with a wide variety of external programs (BLAST, FASTA, clustalw and EMBOSS to

name just a few). various types of databases for storage and retrieval of

sequences remote (GenBank, EMBL etc) local (MySQL, Flat_databases flat files, GFF etc.).

Page 13: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

13

So What Is This Object Business?

Page 14: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

14

What A Biology-Related Program Looks Like When Coded According To The Object Paradigm

t: Proteint: DNA

t: RNA

t: Gene

t: Organism

t: Species

t: LivingObject

t: Sequence

Page 15: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

15

Objects Inherit From A Class Or Prior Object

Object 1(ancestor)

Class= prototype for all objects of this type

Derive an object

from an existing object

Create an object (“new”)

Object2

Sequence RNA Protein

DNA

Page 16: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

16

An example: Class inheritance for shape concepts

Page 17: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

17

Key BioPerl Links

BioPerl 1.4 installed as part of Perl 5.8.8.822 (what you downloaded)

BioPerl home: http://www.bioperl.org/wiki/Main_Page

http://www.bioperl.org/wiki/Getting_Started Lots of examples

Page 18: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

18

BioPerl Example: Querying GenBank To Retrieve Sequence Properties Seq7.pl Seq8.pl Seq9.pl → after exercise (next slide) Seq11.pl → after exercise (next slide) Related docs:

GenBank search: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/GenBank.html

SeqIO: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SeqIO/genbank.htmlSeqIO See also http://www.bioperl.org/wiki/HOWTO:SeqIO And most importantly: http://doc.bioperl.org/releases/bioperl-

current/bioperl-live/Bio/Seq.html

Page 19: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

19

Exercise: Print An Additional Sequence Feature

Add an additional sequence feature to Seq8.pl What to print: see Methods for Seq object at

http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Seq.html

Page 20: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

20

Quiz Questions based on Seq11.pluse warnings;use strict;

use Bio::DB::GenBank;

# ---------------------------------------------------------------------------# main

$| = 1; # Force unbuffered STDOUT and STDIN.

my $gb = Bio::DB::GenBank->new( -format => 'GenBank',

-seq_start => 0,-seq_end => 1000,-strand => 1,-complexity => 0); # put in some restrictions as to what is retrieved and stored into GenBank object ...

# get a stream via a query stringmy $query = Bio::DB::Query::GenBank->new (-query =>'Homo sapiens[Organism] AND M-cadherin',

-db => 'nucleotide');my $seqio = $gb->get_Stream_by_query($query);

my $i=0; # count total number of sequenceswhile (my $seq = $seqio->next_seq) { print "seq id =", $seq->id, "\t version = ", $seq->version, "\t seq acc number = ", $seq->accession_number, "\t seq length = ", $seq->length,"\n"; $i++;}print "retrieved $i sequences from GenBank \n";

# --------------------------------------------------------------------------

Page 21: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

21

More Quizzing: Seq10.pl

Run Seq10.pl Why the warning messages?

Specifying strands 1 for plus 2 for minus

Complexity: A GenBank nucleotide entry is often a part of a larger biological blob that contains other GI numbers (e.g., translated protein)

Complexity regulates the display:0 - get the whole blob1 - get the bioseq for gi of interest (default in Entrez)2 - get the minimal bioseq-set containing the gi of interest3 - get the minimal nuc-prot containing the gi of interest4 - get the minimal pub-set containing the gi of interest

Page 22: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

22

Some Cautions

Be careful when querying databases → have an idea of how many sequences you may

be downloading/processing Know that Perl might eat-up all of your CPU

cycles

Page 23: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

23

Part 3: Interacting With A Database

Page 24: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

24

Preliminaries: Updating ODBC Manager

First we need to add directions to “GenesToEvaluate” DB to ODBC Manager More at

http://lane.stanford.edu/howto/index.html?id=_1751

Page 25: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

25

Example Perl Programs That Interact With A Database

Ancillary files: ExampleOutputExcel3.csv needed as input to

Access1.pl Access2.pl and Access3.pl don’t need this file

All programs rely on GenesToEvaluate.mdb (Access DB)

Page 26: Lane Medical Library & Knowledge Management Center  Perl Programming for Biologists PART 2: Tue Aug 28 th 2007 Yannick Pouliot,

Lane Medical Library &Knowledge Management Centerhttp://lane.stanford.edu

26

In Closing: Suggestions

Modify the programs provided here Baby steps…

Save often Keep lots of prior versions so you can recover from your

mistakes SU provides lots of documentation → use it! Google is invaluable