binf634_fall14 complex data structures and parsing1 topics quiz 3 solutions more functions on arrays...

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 1

Topics Quiz 3 Solutions More functions on arrays References and Dereferencing Two-dimensional arrays Using Hashes to Pass Parameters to subroutines Arrays of Hashes Hashes of Hashes Parsing Chapter 10 of Tisdall Program 3


Quiz 3 Solution Problem 1#!/usr/bin/perl use strict;use warnings;{

my($a) = -3;my($b) = 7;my($n) = 10;

srand(0);

my(@numbers) = myrandom($n,$a,$b);

for (my $i = 0; $i < $n; $i++){print "$numbers[$i] \n";

}

exit;

}

sub myrandom{my($n, $a, $b)=@_;

my @nums;for (my $i = 0; $i < $n; $i++){

$nums[$i] = ($b-$a)*rand(1) + $a;}

return(@nums);

}Output:-2.9884033203125 -0.64434814453125 3.4813232421875

There is actually no simple way to generate number x such that a<=x<=b this generatesnumbers x such that a<=x<b.


Quiz 3 Solution Problem 2 #!/usr/bin/perl

use strict;

use warnings;

{

# define our hash

my(%coins) = ("Cys", 25, "Asp", 10, "Glu", 5);

my(@key_list) = keys(%coins);

my $key;

# loop through our hash:

foreach $key (@key_list) {

print "The value of $key is $coins{$key}\n";

}

exit;

}

Output:

The value of Glu is 5

The value of Cys is 25

The value of Asp is 10


More Functions on Arrays

splice ARRAY,OFFSET,LENGTH,LISTRemoves the elements designated by OFFSET and LENGTH from an array,

and replaces them with the elements of LIST, if any. The array grows or shrinks as necessary. In array context, returns list of elements removed from ARRAY.

@A = qw/ A list of words /;splice @A,1,0,"short";print "@A\n";@B = splice @A,1,3,"few";print "@A\n";print "spliced out: @B\n";

A short list of wordsA few wordsspliced out: short list of

Note: OFFSET is the offset from the start of the array

LENGTH is the number of characters in the array starting at OFFSET to remove


More Functions on Arraysgrep BLOCK LISTgrep EXPR, LISTEvaluates the BLOCK or EXPR for each element of LIST (setting $_ to

each element) and returns the list of those elements for which the expression evaluated to true. In scalar context, returns the number of times the expression was true.

@A = (12, 6, 3, 20, 22);@B = grep($_ > 10, @A); # @B = (12, 20, 22)$big = grep $_ > 10, @A; # $big = 3

@A = qw / A list of words /;@B = grep /o/, @A; # @B = ("of" , "words")

Using grep to Remove Comments from a File

#!/usr/bin/perl -w

# Thomas Bonham

# 06/06/08

if($#ARGV !=0) {

print "usage: path to the configuration\n";

exit;

}

$fileName=$ARGV[0];

open(O,"<$fileName") || die($!);

open(N,">$fileName.free") || die($!);

while(<O>) {

next if($_ =~/^#.*/) ;

print N $_

}


http://linuxgazette.net/152/misc/lg/2_cent_tip__removing_the_comments_out_of_a_configuration_file.html


More Functions on Arraysmap BLOCK LIST

map EXPR, LIST

Evaluates the BLOCK or EXPR for each element of LIST (setting $_ to each element) and returns the list value of the results of each such evaluation.

@A = qw / A list of words /;

@L = map length, @A;

print "@L\n"; #prints: 1 4 2 5

@L = map uc, @A;

print "@L\n"; #prints: A LIST OF WORDS


Creating Referencesmy $fruit = fruit_i_like();

sub fruit_i_like() { my @fruit = ('apple', 'banana', 'orange');

return \@fruit; }

What does $fruit hold after called the

subroutine?

This works with scalars, arrays andhashes too.

my $scalar_ref = \$a_scalar; my $hash_ref = \%a_hash; my $subroutine_ref = \

&a_subroutine;

References to anonymous arrays and

Hashes

my $array_ref = ['apple', 'banana', 'orange'];

my $hash_ref = {name => 'Becky', age => 23};


Dereferencing – Part 1 (Via Sigils)#!/usr/bin/perl use strict; use warnings; my $scalar = "This is a

scalar"; my $scalar_ref = \$scalar; print "Reference: " .

$scalar_ref . "\n"; print "Dereferenced: " . $

$scalar_ref . "\n";

Output:Reference: SCALAR(0x182a2b4)Dereferenced: This is a

scalar

Similarly for arrays#!/usr/bin/perl use strict; use warnings; my $array_ref = ['apple',

'banana', 'orange']; my @array = @$array_ref;print "Reference: $array_ref\

n"; print "Dereferenced: @array\

n"; Output:Reference: ARRAY(0x22a0a4)Dereferenced: apple banana

orange

http://www.merriam-webster.com/dictionary/sigil


Dereferencing – Part 1Similarly for hashes#!/usr/bin/perl use strict; use warnings; my $hash_ref = {name => 'Becky', age => 23}; my %hash = %$hash_ref; print "Reference: $hash_ref\n"; print "Dereferenced:\n"; foreach my $k (keys %hash) {

print "$k: $hash{$k}\n"; }

Output:Reference: HASH(0x22a0a4)Dereferenced:name: Beckyage: 23


Dereferencing – Part 2

#!/usr/bin/perl

use strict; use warnings;

my $scalar = "This is a scalar";

my $scalar_ref = \$scalar;

print "Reference: " . $scalar_ref . "\n";

print "Dereferenced: " . ${$scalar_ref} . "\n";

Output:

Reference: SCALAR(0x182a2b4)

Dereferenced: This is a scalar


Dereferencing – Part 3The arrow operator also allows you to

dereference references to arrays or hashes.

#!/usr/bin/perl

use strict;

use warnings;

my $array_ref = ['apple', 'banana', 'orange'];

print "My first fruit is: " . $array_ref->[0] . "\n";

Output:

My first fruit is: apple

Here is similar code for a hash#!/usr/bin/perl use strict; use warnings; my $hash_ref = {name =>

'Becky', age => 23}; foreach my $k (keys %$hash_ref) {

print "$k: " . $hash_ref->{$k} . "\n"; }

Output:name: Beckyage: 23


Arrays, Lists and References

We saw that is is useful to pass arrays and hashes to subroutines using references:some_subroutine(\@a, \@b); # passes pointers to @a and @b

Sometime it is useful to create a reference to an anonymous list by using square brackets [ ]:$B = [1, 2, 3, 4]; # $B points to a list

print "@$B\n"; # this is called "dereferencing" $B

1 2 3 4

print "$$B[1]\n"; # think of "$B" as the "name" of the list

2

We can create a two-dimensional array by creating a array of lists:@A = ([1, 2, 3, 4], [5, 6, 7, 8]);

print "@A\n";

A: ARRAY(0x80dd60) ARRAY(0x80dda8) -- these are the references in A

print "$A[1][2]\n";

7 Notice that the 0-th row is the first row and that the 0th column is the first column


Two-dimensional arrays We can create a two-dimensional array by creating a array of lists:

@A = ([1, 2, 3, 4], [5, 6, 7, 8]); Access a two-dimensional array by using double brackets:

print "$A[1][2]\n"; 7

The number of rows is just the size of the array:$rows = scalar @A;

We can get the size of the fist row as follows:# brackets are needed to distinguish from @$A[0] (syntax error) $cols = scalar @{$A[0]};

Print out all rows and columns:for ($i = 0; $i < $rows; $i++) {

for ($j = 0; $j < $cols; $j++) { print "$A[$i][$j] "; } print "\n"; # newline after each row

}


Two-dimensional arrays There is no need to declare the size of an array, so arrays can be created dynamically:

my @A = ();

my $rows = 100;

my $cols = 100;

# create a matrix with 1's on diagonal

for ($i = 0; $i < $rows; $i++) {

for ($j = 0; $j < $cols; $j++) {

$A[$i][$j] = 0;

}

$A[$i][$i] = 1;

} Array sizes can be changed dynamically:

$A[0][200] = 123; # first row now has 201 items

# but other rows are unaffected!


Two-dimensional arrays Arrays can contain any scalar values Two-dimensional arrays in Perl do not have to be "rectangular" Each row can have a different length

# this array has three row with lengths 3, 4 and 1

my @A = ( ["John", "Jim, "Bill"],

[ 100, 23.5, "ATCGTTGA", \%codons ];

[ 0 ]);

Example: pretty print a 2D array

#!/usr/bin/perluse strict;use warnings;# File: print_array.pl

my @A = ();

# initialize the two dimensional array

# with some numbersmy $rows = 6;my $cols = 5;for (my $i=0; $i < $rows; $i++) { for (my $j=0; $j < $cols; $j+

+) {$A[$i][$j] = $i*100 + $j*17;

}}

# print and quitprint_2Darray(@A);exit;

# a subroutine to print out a two# dimensional rectangular array using# 5 digits per array element

sub print_2Darray { my (@a) = @_; my $rows = scalar @a; my $cols = scalar @{$a[0]}; for (my $i=0; $i < $rows; $i++) {

for (my $j=0; $j < $cols; $j++) { printf "%5d ", $a[$i][$j]; } print "\n"; # newline after each row

}}

% print_array.pl 0 17 34 51 68 100 117 134 151 168 200 217 234 251 268 300 317 334 351 368 400 417 434 451 468 500 517 534 551 568

17BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING

Example: transpose a 2D array(exchange rows and columns)

#!/usr/bin/perluse strict;use warnings;

# File: transpose.plmy @A = ();

# initialize the two dimensional arraymy $rows = 6;my $cols = 5;for (my $i=0; $i < $rows; $i++) { for (my $j=0; $j < $cols; $j++) {

$A[$i][$j] = $i*100 + $j*17; }}print "A:\n";print_2Darray(@A);

my @B = transpose(@A);

print "B:\n";print_2Darray(@B);

exit;

sub transpose {

my (@a) = @_;

my @b = ();

my $rows = scalar @a;

my $cols = scalar @{$a[0]};

for (my $i=0; $i < $rows; $i++) {

for (my $j=0; $j < $cols; $j++) {

$b[$j][$i] = $a[$i][$j];

}

}

return @b;

}

sub print_2Darray {

.

.

.

}



% transpose.plA: 0 17 34 51 68 100 117 134 151 168 200 217 234 251 268 300 317 334 351 368 400 417 434 451 468 500 517 534 551 568 B: 0 100 200 300 400 500 17 117 217 317 417 517 34 134 234 334 434 534 51 151 251 351 451 551 68 168 268 368 468 568


Conditional Operator$a = COND ? VALUE_1 : VALUE_2;

is short forif COND then $a = VALUE_1 else $a = VALUE_2;

Examples:

# silly way to compute abs($x)my $abs_x = ($x > 0)? $x : -$x;

# a safe way to compute an averagemy $ave = ($n > 0)? $sum / $n : 0.0;

# get max of a, b:my $max = ($a > $b)? $a : $b;

# safely update a hash count:exists $count{$x}? $count{$x}++ : $count{$x} = 0;


Hashes for Passing Parameters Problem: suppose you have some subroutines that take many

arguments Don’t want to have to remember the right order Don’t want to have to remember the default values

Solution: Pass in a hash of the form (arg1 => value1, arg2 =>value2, etc.) This allows arguments to appear in any order in calling program Subroutine can detect missing arguments (using exists) and

supply default values

Example: Suppose we need to write a subroutine that prints DNA, with an optional header line, optional protein translation, and optional line numbers:

# assume subroutines "print_header()", "print_protein()" and

# "print_dna()" are already defined

sub output_dna {

my ($dna, $header, $linelength, $translate, $linenumbers) = @_;

print_header($header) if (length $header > 0);

if ($translate) { print_protein($dna, $linelength, $linenumbers)}

else { print_dna($dna, $linelength, $linenumbers); }

}

Calls in the main program would look like this:output_dna($dna, $header, 60, 0, 1); # header, no protein, linenums

output_dna($dna, "", 60, 1, 0); # no header, protein, no line numbers

Problems:

1. Even if you just want to print the dna, you still have to give all the arguments (in the right order).

2. Code is not self-documenting (will you remember what it does in 6 months?)

3. Clumsy for other people to use.22BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING

REMINDER -- The conditional operator:

$a = COND ? VALUE_1 : VALUE_2; # if COND then $a = VALUE_1 else $a = VALUE_2

Version 2: Note that missing arguments are set to a default value

sub output_dna {

my (%args) = @_;

my $dna = exists $args{dna} ? $args{dna} : "";

my $header = exists $args{header} ? $args{header} : "";

my $linelength = exists $args{linelength} ? $args{linelength} : 60;

my $translate = exists $args{translate} ? $args{translate} : 0;

my $linenumbers = exists $args{linenumbers} ? $args{linenumbers} : 0;

print_header($header) if (length $header > 0);

if ($translate) {print_protein($dna, $linelength, $linenumbers); }

else { print_dna($dna, $linelength, $linenumbers); }

}

Calls in the main program now look like this:

output_dna(linelength=>30, dna=>$x); # arguments can be in any order

output_dna(dna=>$dnastring, translate=>1); # missing args set to default values



Arrays of Hashes#!/usr/bin/perl -w

#demonstrates an array of hashes;

use strict;

use warnings;

my @AoH;

my $role;

my $href;

@AoH = ( { husband => "barney", wife => "betty", son => "bamm bamm", }, { husband => "george", wife => "jane", son => "elroy", },

{ husband => "homer", wife => "marge", son => "bart", }, ); print "@AoH \n";

Output:HASH(0x22a0ac) HASH(0x229f8c) HASH(0x1846024)


Arrays of Hashes – Manipulating the Variables You can set a key/value pair of a particular hash as follows:

$AoH[0]{husband} = "fred";

To capitalize the husband of the second array, apply a substitution: $AoH[1]{husband} =~ s/(\w)/\u$1/;

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING26

Arrays of Hashes - Printing Method 1

#!/usr/bin/perl -w


use strict;

use warnings;

my @AoH;

my $role;

my $href;


{ husband => "homer", wife => "marge", son => "bart", }, ); for $href ( @AoH ) { print "{ "; for $role ( keys %$href ) { print "$role=$href->{$role} "; } print "}\n";}

Output:{ son=bamm bamm wife=betty husband=barney }{ son=elroy wife=jane husband=george }{ son=bart wife=marge husband=homer }

What does the do in the print statement?


Arrays of Hashes - Printing Method 2

!/usr/bin/perl -w


use strict;

use warnings;

my @AoH;

my $role;

my $href;

my $i;


{ husband => "homer", wife => "marge", son => "bart", }, ); for $i ( 0 .. $#AoH ) { print "$i is { "; for $role ( keys %{ $AoH[$i] } ) { print "$role=$AoH[$i]{$role} "; } print "}\n";}

Output:0 is { son=bamm bamm wife=betty husband=barney }1 is { son=elroy wife=jane husband=george }2 is { son=bart wife=marge husband=homer }

Note that $# is the subscript of the last elementin an array

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING

28

Hashes of Hashes#!/usr/bin/perl -w

#demonstrates a hash of hashes;

use strict;

use warnings;

my $family;

my $role;

my %HoH = ( flintstones => { lead => "fred", pal => "barney", }, jetsons => { lead => "george", wife => "jane", "his boy" => "elroy", # key quotes

needed }, simpsons => { lead => "homer", wife => "marge", kid => "bart",

},);

# print the whole thingforeach $family ( keys %HoH ) { print "$family: "; foreach $role ( keys %

{ $HoH{$family} } ) { print "$role=$HoH{$family}{$role}

"; } print "\n";}

Output:simpsons: kid=bart lead=homer

wife=marge jetsons: his boy=elroy lead=george

wife=jane flintstones: lead=fred pal=barney


Hashes of Hashes - Printing#!/usr/bin/perl -w#demonstrates a hash of hashes;use strict;use warnings;my $family;my $roles;my $role;my $person;my %HoH = ( flintstones => { lead => "fred", pal => "barney", }, jetsons => { lead => "george", wife => "jane", "his boy" => "elroy", # key

quotes needed },

simpsons => { lead => "homer", wife => "marge", kid => "bart", },);# print the whole thing, using temporarieswhile ( ($family,$roles) = each %HoH ) { print "$family: "; while ( ($role,$person) = each %$roles

) { # using each precludes sorting print "$role=$person "; } print "\n";}

Output:simpsons: kid=bart lead=homer

wife=marge jetsons: his boy=elroy lead=george

wife=jane flintstones: lead=fred pal=barney


Chapter 10 Parsing GenBank


Data Types and Parsers A databank or data store

A flat file as compared to a database

We are going to explore the art of parsing using this data

Alternative parsing software repositories include National Institutes of Health (NIH)

www.ncbi.nlm.nih.gov European Bioinformatics Institute (EIB)

www.ebi.ac.uk European Molecular Biology Laboratory (EMBL)

www.embl.de


GenBank Files - ILOCUS AB031069 2487 bp mRNA PRI 27-MAY-2000DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC

domain 1, complete cds.ACCESSION AB031069VERSION AB031069.1 GI:8100074KEYWORDS .SOURCE Homo sapiens embryo male lung fibroblast cell_line:HuS-

L12 cDNA to mRNA. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;

Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (sites) AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si.

and Takano,T. TITLE PCCX1, a novel DNA-binding protein with PHD finger and

CXXC domain, is regulated by proteolysis JOURNAL Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000) MEDLINE 20261256REFERENCE 2 (bases 1 to 2487) AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S.

and Takano,T. TITLE Direct Submission JOURNAL Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank

databases. Tadahiro Fujino, Keio University School of Medicine, Department

of Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582,

Japan (E-mail:[email protected], Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508)

FEATURES Location/Qualifiers source 1..2487 /organism="Homo sapiens" /db_xref="taxon:9606" /sex="male" /cell_line="HuS-L12" /cell_type="lung fibroblast" /dev_stage="embryo" gene 229..2199 /gene="PCCX1" CDS 229..2199 /gene="PCCX1" /note="a nuclear protein carrying

a PHD finger and a CXXC domain" /codon_start=1 /product="protein containing

CXXC domain 1" /protein_id="BAA96307.1" /db_xref="GI:8100075"


GenBank Files - II/translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP

RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"


GenBank Files - IIIBASE COUNT 564 a 715 c 768 g 440 tORIGIN 1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg 61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg 121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt 181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat 241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat 301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac 361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc 421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat 481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat 541 gagggtggag ggcgcaagag gcctgtccct gatccagacc tgcagcgccg ggcagggtca 601 gggacagggg ttggggccat gcttgctcgg ggctctgctt cgccccacaa atcctctccg 661 cagcccttgg tggccacacc cagccagcat caccagcagc agcagcagca gatcaaacgg 721 tcagcccgca tgtgtggtga gtgtgaggca tgtcggcgca ctgaggactg tggtcactgt 781 gatttctgtc gggacatgaa gaagttcggg ggccccaaca agatccggca gaagtgccgg 841 ctgcgccagt gccagctgcg ggcccgggaa tcgtacaagt acttcccttc ctcgctctca 901 ccagtgacgc cctcagagtc cctgccaagg ccccgccggc cactgcccac ccaacagcag 961 ccacagccat cacagaagtt agggcgcatc cgtgaagatg agggggcagt ggcgtcatca 1021 acagtcaagg agcctcctga ggctacagcc acacctgagc cactctcaga tgaggaccta

…

2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa//


A Little More About GeneBank FEATURES table

Specific information about locations of exons, regulatory regions, important mutations etc.

More information is contained at http://www.ncbi.nlm.nih.gov/genbank/GenBankFtp.html

http://www.ncbi.nlm.nih.gov/Genbank/ GenBank® is the NIH genetic sequence database, an annotated

collection of all publicly available DNA sequences (Nucleic Acids Research, 2008 Jan;36(Database issue):D25-30). There are approximately 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions and 191,401,393,188 bases in 62,715,288 sequence records in the WGS division as of April 2011.


GenBank: Separating Sequence and Annotation Method 1

Slurp the GenBank record into an array and look through the lines

Method 2 Put the whole GenBank record into a scalar and use regular

expressions to look through it

Example files record.gb library.gb


Parsing Genbank Files Using Arrays – Example 10-1#!/usr/bin/perl# Example 10-1 Extract annotation and

sequence from GenBank file

use strict;use warnings;use lib 'C:\Documents and Settings\Owner\

workspace\binf634_book_examples';use BeginPerlBioinfo; # see Chapter 6

about this module

# declare and initialize variablesmy @annotation = ( );my $sequence = '';my $filename = 'record.gb';

parse1(\@annotation, \$sequence, $filename);

# Print the annotation, and then# print the DNA in new format just to

check if we got it okay.print @annotation;

print_sequence($sequence, 50);

exit;

################################################################################

# Subroutine################################################

################################

# parse1## -parse annotation and sequence from GenBank

record

sub parse1 {

my($annotation, $dna, $filename) = @_;

# $annotation-reference to array # $dna -reference to scalar # $filename -scalar # declare and initialize variables my $in_sequence = 0; my @GenBankFile = ( ); # Get the GenBank data into an array from a

file @GenBankFile = get_file_data($filename); # Extract all the sequence lines foreach my $line (@GenBankFile) {


Parsing Genbank Files Using Arrays – Example 10-1if( $line =~ /^\/\/\n/ ) { # If $line is end-of-record line //\n, last; #break out of the foreach loop. } elsif( $in_sequence) { # If we know we're in a sequence, $$dna .= $line; # add the current line to $$dna. } elsif ( $line =~ /^ORIGIN/ ) { # If $line begins a sequence, $in_sequence = 1; # set the $in_sequence flag. } else{ # Otherwise push( @$annotation, $line); # add the current line to

@annotation. } } # remove whitespace and line numbers from DNA sequence $$dna =~ s/[\s0-9]//g;}


Some Musings on the Array Approach

if( $line =~ /^\/\/\n/ ) { # If $line is end-of-record line //\n,

last; #break out of the foreach loop.

}

If we have a large number of forward slashes we can change the delimiter to a ! like this

m!//\n!

The order of the tests for what we are currently gathering in the file is in the reverse order of things in the file

For example we first gather the annotation lines and then set a flag when the ORIGIN start-of-sequence line is found


Parsing Genbank Files Using Scalars We may end up with several lines in the same scalar; there

are modifiers to our search code that can make our life easier

Pattern Modifiers /g (What does this do?) /i (How about this one?)

Regular expression matching symbols ^ (What does this do?) $ (How about this one?) . (What does this do?)


Special Multiline Regular Expression Modifiers m

Treat string as multiple lines. That is, change ``^'' and ``$'' from matching at only the very start or end of the string to the start or end of any line anywhere within the string,

s Treat string as single line. That is, change ``.'' to match any

character whatsoever, even a newline, which it normally would not match. The /s and /m modifiers both override the $* setting. That is, no matter what $* contains, /s without /m will force ``^'' to match only at the beginning of the string and ``$'' to match only at the end (or just before a newline at the end) of the string. Together, as /ms, they let the ``.'' match any character whatsoever, while yet allowing ``^'' and ``$'' to match, respectively, just after and just before newlines within the string.


Pattern Modifiers in Action#!/usr/bin/perl

use strict;

use warnings;

"AAC\nGTT" =~ /^.*$/;

print $&, "\n";

exit;

This fails because there is no match and $& has no value

Suppose we use "AAC\nGTT" =~ /^.*$/m; Output: AAC

Suppose we use Suppose we use

"AAC\nGTT" =~ /^.*$/s; Output:

AAC GTT


Separating Annotations from Sequence Recall that a GenBank record starts with LOCUS and ends with

//

The input record separator This is normally set to new line so each call to read a scalar from a

file handle gets one line

The input record separator is denoted by $/ $/ = “//\n”;

Now a call to read a scalar from a file handle takes all data up to the GenBank end of record separator


Example 10-2 –Extract Record and Annotation Sequence from a GenBank Record

#!/usr/bin/perl# Example 10-2 Extract the annotation

and sequence sections from the first

# record of a GenBank library

use strict;use warnings;use lib 'C:\Documents and Settings\

Owner\workspace\binf634_book_examples';

use BeginPerlBioinfo; # see Chapter 6 about this module

# Declare and initialize variablesmy $annotation = '';my $dna = '';my $record = '';my $filename = 'record.gb';my $save_input_separator = $/;

# Open GenBank library fileunless (open(GBFILE, $filename)) { print "Cannot open GenBank

file \"$filename\"\n\n"; exit;}

# Set input separator to "//\n" and read in a record to a scalar

$/ = "//\n";

$record = <GBFILE>;

# reset input separator $/ = $save_input_separator;

# Now separate the annotation from the sequence data

($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);

# Print the two pieces, which should give us the same as the

# original GenBank file, minus the // at the end

print $annotation, $dna;

exit;


The Regular Expressions In Depth ($annotation, $dna)=($record =~ /^(LOCUS.*ORIGIN\s*\n)

(.*)\/\/\n/s); What do the two sets of parentheses do? What do we do with the contents of these parenthesis? What does the /s modifier do?

^(LOCUS.*ORIGIN\s*\n) Can you explain this?

\/\/\n What does this match?

Notice that it took one line of Perl code to extract annotation and sequence


Parsing Annotations#!/usr/bin/perl -w# Example 10-3 Parsing GenBank annotations using

arrays

use strict;use warnings;use lib 'C:\Documents and Settings\Owner\workspace\

binf634_book_examples';use BeginPerlBioinfo; # see Chapter 6 about this

module

# Declare and initialize variablesmy @genbank = ( );my $locus = '';my $accession = '';my $organism = '';

# Get GenBank file data@genbank = get_file_data('record.gb');

# Let's start with something simple. Let's get some of the identifying

# information, let's say the locus and accession number (here the same

# thing) and the definition and the organism.

for my $line (@genbank) { if($line =~ /^LOCUS/) { $line =~ s/^LOCUS\s*//; $locus = $line; }elsif($line =~ /^ACCESSION/) { $line =~ s/^ACCESSION\s*//; $accession = $line; }elsif($line =~ /^ ORGANISM/) { $line =~ s/^\s*ORGANISM\s*//; $organism = $line; }}print "*** LOCUS ***\n";print $locus;print "*** ACCESSION ***\n";print $accession;print "*** ORGANISM ***\n";print $organism;

exit;

Notice the use of the flags and that this program handles single line entries.


Tackling the Definition Field#!/usr/bin/perl -w# Example 10-4 Parsing GenBank

annotations using arrays, take 2




# Declare and initialize variablesmy @genbank = ( );my $locus = '';my $accession = '';my $organism = '';my $definition = '';my $flag = 0;

# Get GenBank file data@genbank = get_file_data('record.gb');

# Let's start with something simple. Let's get some of the identifying

# information, let's say the locus and accession number (here the same

# thing) and the definition and the organism.

for my $line (@genbank) { if($line =~ /^LOCUS/) { $line =~ s/^LOCUS\s*//; $locus = $line; }elsif($line =~ /^DEFINITION/) { $line =~ s/^DEFINITION\s*//; $definition = $line; $flag = 1; }elsif($line =~ /^ACCESSION/) { $line =~ s/^ACCESSION\s*//; $accession = $line; $flag = 0; }elsif($flag) { chomp($definition); $definition .= $line; }elsif($line =~ /^ ORGANISM/) { $line =~ s/^\s*ORGANISM\s*//; $organism = $line; }}


Parsing the Annotations Using Regular Expressions sub open_file

Given the filename, return the filehandle sub get_next_record

Given the file handle, get the record (we can get the offset by first calling tell)

sub get_annotation_and_dna Given a record, split it into annotation and cleaned up record

sub search_sequence Given a sequence and a regular expression, return array of locations and hits

sub search_annotation Given a GenBank annotation and a regular expression, return array of

location and hits sub parse_annotation

Separate out the files of the annotation in a convenient form sub parse_features

Given the features field, separate out the components


Byte offset The Perl function tell allows us to save the byte offset of the

record of interest The byte offset is the number of characters into the file where the

information of interest lies

We can return again “instantaneously” to this location in the file again using the Perl function seek


Parsing of the Data

1. Separate out the annotation and the sequence (some capability to search exists at this stage)

2. Extract out the fields

3. Parse the features table


Example 10-5. GenBank Library Subroutines!/usr/bin/perl# Example 10-5 - test program of GenBank

library subroutines)

use strict;use warnings;# Don't use BeginPerlBioinfo# Since all subroutines defined in this

file# use BeginPerlBioinfo; # see Chapter

6 about this module

# Declare and initialize variablesmy $fh; # variable to store filehandlemy $record;my $dna;my $annotation;my $offset;my $library = 'library.gb';

# Perform some standard subroutines for test

$fh = open_file($library);

$offset = tell($fh);

while( $record = get_next_record($fh) ) {

($annotation, $dna) = get_annotation_and_dna($record);

if( search_sequence($dna, 'AAA[CG].')) {

print "Sequence found in record at offset $offset\n";

} if( search_annotation($annotation,

'homo sapiens')) { print "Annotation found in

record at offset $offset\n"; }

$offset = tell($fh);}

exit;


Subroutines

# open_file## - given filename, set filehandle

sub open_file {

my($filename) = @_; my $fh;

unless(open($fh, $filename)) { print "Cannot open file

$filename\n"; exit; } return $fh;}

# get_next_record## - given GenBank record, get

annotation and DNA

sub get_next_record {

my($fh) = @_;

my($offset); my($record) = ''; my($save_input_separator) = $/;

$/ = "//\n";

$record = <$fh>;

$/ = $save_input_separator;

return $record;}


Subroutines# get_annotation_and_dna## - given filehandle to open GenBank

library file, get next record

sub get_annotation_and_dna {

my($record) = @_;

my($annotation) = ''; my($dna) = '';

# Now separate the annotation from the sequence data

($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);

# clean the sequence of any whitespace or / characters

# (the / has to be written \/ in the character class, because

# / is a metacharacter, so it must be "escaped" with \)

$dna =~ s/[\s\/]//g;

return($annotation, $dna)}

# search_sequence## - search sequence with regular

expression

sub search_sequence {

my($sequence, $regularexpression) = @_;

my(@locations) = ( );

while( $sequence =~ /$regularexpression/ig ) {

push( @locations, pos ); }

return (@locations);}


Subroutines# search_annotation## - search annotation with regular

expression

sub search_annotation {

my($annotation, $regularexpression) = @_;

my(@locations) = ( );

# note the /s modifier-. matches any character including newline

while( $annotation =~ /$regularexpression/isg ) {

push( @locations, pos ); }

return (@locations);}

Output:Sequence found in record at offset 0Annotation found in record at offset 0Sequence found in record at offset 6358Annotation found in record at offset

6358Sequence found in record at offset

12573Annotation found in record at offset





22722


Parsing the Annotations at the Top Level Let’s start with one of the simpler annotationsDEFINITION Homo sapiens PCCX1 mRNA for protein containing

CXXC domain 1, complete cds.ACCESSION AB031069VERSION AB031069.1 GI:8100074

We need a regular expression that matches everything from a word at the beginning of a line to a newline that just precedes another word at the beginning of a line.


Parsing the Annotations at the Top level /^[A-Z].*\n(^\s.*\n)*/m

What does /m do?

^[A-Z].*\n Capital letter at the beginning of the line followed by any number

of characters (except newlines) followed by a newline

(^\s.*\n)* Matches a space or tab at the beginning of the line followed by

any number of characters )except newlines) followed by a newline ()* means 0 or more of these


Example 10-6 Parsing GenBank Data#!/usr/bin/perl# Example 10-6 - test program for

parse_annotation subroutine




# Declare and initialize variablesmy $fh;my $record;my $dna;my $annotation;my %fields;my $library = 'library.gb';

# Open library and read a record$fh = open_file($library);

$record = get_next_record($fh);

# Parse the sequence and annotation($annotation, $dna) =

get_annotation_and_dna($record);

# Extract the fields of the annotation%fields =

parse_annotation($annotation);

# Print the fieldsforeach my $key (keys %fields) { print "******** $key *********\n"; print $fields{$key};}

exit;


Subroutines# parse_annotation## given a GenBank annotation, returns a hash with# keys: the field names# values: the fields

sub parse_annotation {

my($annotation) = @_; my(%results) = ( );

while( $annotation =~ /^[A-Z].*\n(^\s.*\n)*/gm ) { my $value = $&; (my $key = $value) =~ s/^([A-Z]+).*/$1/s; $results{$key} = $value; }

return %results;}


Extracting the key

while( $annotation =~ /^[A-Z].*\n(^\s.*\n)*/gm ) {

my $value = $&;

(my $key = $value) =~ s/^([A-Z]+).*/$1/s;

$results{$key} = $value;

} What does the bolded line do?

First assigns $key the value $value Uses the /s modifier for embedded newlines Replaces $key with $1 which is a special variable indicating the

match between the first pair of parenthesis


Parsing the Features Table Let’s tackle source, gene and CDS features keyssource 1..2487 /organism="Homo sapiens" /db_xref="taxon:9606" /sex="male" /cell_line="HuS-L12" /cell_type="lung fibroblast" /dev_stage="embryo" gene 229..2199 /gene="PCCX1" CDS 229..2199 /gene="PCCX1" /note="a nuclear protein carrying a PHD

finger and a CXXC Will use an array rather than a hash since there can be multiple

instances of the same feature in a record


Parsing the Features Table#!/usr/bin/perl# - main program to test

parse_features




# Declare and initialize variablesmy $fh;my $record;my $dna;my $annotation;my %fields;my @features;my $library = 'library.gb';

# Get the fields from the first GenBank record in a library

$fh = open_file($library);

$record = get_next_record($fh);

($annotation, $dna) = get_annotation_and_dna($record);

%fields = parse_annotation($annotation);

# Extract the features from the FEATURES table

@features = parse_features($fields{'FEATURES'});

# Print out the featuresforeach my $feature (@features) {

# extract the name of the feature (or "feature key")

my($featurename) = ($feature =~ /^ {5}(\S+)/);

print "******** $featurename *********\n";

print $feature;}

exit;


Subroutine# parse_features## extract the features from the FEATURES field of a GenBank record

sub parse_features {

my($features) = @_; # entire FEATURES field in a scalar variable

# Declare and initialize variables my(@features) = (); # used to store the individual features

# Extract the features while( $features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm ) { my $feature = $&; push(@features, $feature); }

return @features;}


A Closer Look at the Crucial Regular Expression while( $features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm )

Line begins with 5 spaces followed by non-whitespace character followed by any number of non-newlines followed by a newline

Next we space 21 or more spaces followed by non-whitepace characters \S followed by any number of non-newlines .* followed by a newline


Homework for Next Week Read Tisdall Chapter 10

Particular attention to the Indexing GenBank with DBM section

Exercises 10.3 and 10.6

Begin working on Program 3

Quiz 4 next week