binf634_fall14 complex data structures and parsing1 topics quiz 3 solutions more functions on arrays...
TRANSCRIPT
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 1
Topics Quiz 3 Solutions More functions on arrays References and Dereferencing Two-dimensional arrays Using Hashes to Pass Parameters to subroutines Arrays of Hashes Hashes of Hashes Parsing Chapter 10 of Tisdall Program 3
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 2
Quiz 3 Solution Problem 1#!/usr/bin/perl use strict;use warnings;{
my($a) = -3;my($b) = 7;my($n) = 10;
srand(0);
my(@numbers) = myrandom($n,$a,$b);
for (my $i = 0; $i < $n; $i++){print "$numbers[$i] \n";
}
exit;
}
sub myrandom{my($n, $a, $b)=@_;
my @nums;for (my $i = 0; $i < $n; $i++){
$nums[$i] = ($b-$a)*rand(1) + $a;}
return(@nums);
}Output:-2.9884033203125 -0.64434814453125 3.4813232421875
There is actually no simple way to generate number x such that a<=x<=b this generatesnumbers x such that a<=x<b.
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 3
Quiz 3 Solution Problem 2 #!/usr/bin/perl
use strict;
use warnings;
{
# define our hash
my(%coins) = ("Cys", 25, "Asp", 10, "Glu", 5);
my(@key_list) = keys(%coins);
my $key;
# loop through our hash:
foreach $key (@key_list) {
print "The value of $key is $coins{$key}\n";
}
exit;
}
Output:
The value of Glu is 5
The value of Cys is 25
The value of Asp is 10
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 4
More Functions on Arrays
splice ARRAY,OFFSET,LENGTH,LISTRemoves the elements designated by OFFSET and LENGTH from an array,
and replaces them with the elements of LIST, if any. The array grows or shrinks as necessary. In array context, returns list of elements removed from ARRAY.
@A = qw/ A list of words /;splice @A,1,0,"short";print "@A\n";@B = splice @A,1,3,"few";print "@A\n";print "spliced out: @B\n";
A short list of wordsA few wordsspliced out: short list of
Note: OFFSET is the offset from the start of the array
LENGTH is the number of characters in the array starting at OFFSET to remove
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 5
More Functions on Arraysgrep BLOCK LISTgrep EXPR, LISTEvaluates the BLOCK or EXPR for each element of LIST (setting $_ to
each element) and returns the list of those elements for which the expression evaluated to true. In scalar context, returns the number of times the expression was true.
@A = (12, 6, 3, 20, 22);@B = grep($_ > 10, @A); # @B = (12, 20, 22)$big = grep $_ > 10, @A; # $big = 3
@A = qw / A list of words /;@B = grep /o/, @A; # @B = ("of" , "words")
Using grep to Remove Comments from a File
#!/usr/bin/perl -w
# Thomas Bonham
# 06/06/08
if($#ARGV !=0) {
print "usage: path to the configuration\n";
exit;
}
$fileName=$ARGV[0];
open(O,"<$fileName") || die($!);
open(N,">$fileName.free") || die($!);
while(<O>) {
next if($_ =~/^#.*/) ;
print N $_
}
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 6
http://linuxgazette.net/152/misc/lg/2_cent_tip__removing_the_comments_out_of_a_configuration_file.html
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 7
More Functions on Arraysmap BLOCK LIST
map EXPR, LIST
Evaluates the BLOCK or EXPR for each element of LIST (setting $_ to each element) and returns the list value of the results of each such evaluation.
@A = qw / A list of words /;
@L = map length, @A;
print "@L\n"; #prints: 1 4 2 5
@L = map uc, @A;
print "@L\n"; #prints: A LIST OF WORDS
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 8
Creating Referencesmy $fruit = fruit_i_like();
sub fruit_i_like() { my @fruit = ('apple', 'banana', 'orange');
return \@fruit; }
What does $fruit hold after called the
subroutine?
This works with scalars, arrays andhashes too.
my $scalar_ref = \$a_scalar; my $hash_ref = \%a_hash; my $subroutine_ref = \
&a_subroutine;
References to anonymous arrays and
Hashes
my $array_ref = ['apple', 'banana', 'orange'];
my $hash_ref = {name => 'Becky', age => 23};
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 9
Dereferencing – Part 1 (Via Sigils)#!/usr/bin/perl use strict; use warnings; my $scalar = "This is a
scalar"; my $scalar_ref = \$scalar; print "Reference: " .
$scalar_ref . "\n"; print "Dereferenced: " . $
$scalar_ref . "\n";
Output:Reference: SCALAR(0x182a2b4)Dereferenced: This is a
scalar
Similarly for arrays#!/usr/bin/perl use strict; use warnings; my $array_ref = ['apple',
'banana', 'orange']; my @array = @$array_ref;print "Reference: $array_ref\
n"; print "Dereferenced: @array\
n"; Output:Reference: ARRAY(0x22a0a4)Dereferenced: apple banana
orange
http://www.merriam-webster.com/dictionary/sigil
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 10
Dereferencing – Part 1Similarly for hashes#!/usr/bin/perl use strict; use warnings; my $hash_ref = {name => 'Becky', age => 23}; my %hash = %$hash_ref; print "Reference: $hash_ref\n"; print "Dereferenced:\n"; foreach my $k (keys %hash) {
print "$k: $hash{$k}\n"; }
Output:Reference: HASH(0x22a0a4)Dereferenced:name: Beckyage: 23
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 11
Dereferencing – Part 2
#!/usr/bin/perl
use strict; use warnings;
my $scalar = "This is a scalar";
my $scalar_ref = \$scalar;
print "Reference: " . $scalar_ref . "\n";
print "Dereferenced: " . ${$scalar_ref} . "\n";
Output:
Reference: SCALAR(0x182a2b4)
Dereferenced: This is a scalar
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 12
Dereferencing – Part 3The arrow operator also allows you to
dereference references to arrays or hashes.
#!/usr/bin/perl
use strict;
use warnings;
my $array_ref = ['apple', 'banana', 'orange'];
print "My first fruit is: " . $array_ref->[0] . "\n";
Output:
My first fruit is: apple
Here is similar code for a hash#!/usr/bin/perl use strict; use warnings; my $hash_ref = {name =>
'Becky', age => 23}; foreach my $k (keys %$hash_ref) {
print "$k: " . $hash_ref->{$k} . "\n"; }
Output:name: Beckyage: 23
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 13
Arrays, Lists and References
We saw that is is useful to pass arrays and hashes to subroutines using references:some_subroutine(\@a, \@b); # passes pointers to @a and @b
Sometime it is useful to create a reference to an anonymous list by using square brackets [ ]:$B = [1, 2, 3, 4]; # $B points to a list
print "@$B\n"; # this is called "dereferencing" $B
1 2 3 4
print "$$B[1]\n"; # think of "$B" as the "name" of the list
2
We can create a two-dimensional array by creating a array of lists:@A = ([1, 2, 3, 4], [5, 6, 7, 8]);
print "@A\n";
A: ARRAY(0x80dd60) ARRAY(0x80dda8) -- these are the references in A
print "$A[1][2]\n";
7 Notice that the 0-th row is the first row and that the 0th column is the first column
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 14
Two-dimensional arrays We can create a two-dimensional array by creating a array of lists:
@A = ([1, 2, 3, 4], [5, 6, 7, 8]); Access a two-dimensional array by using double brackets:
print "$A[1][2]\n"; 7
The number of rows is just the size of the array:$rows = scalar @A;
We can get the size of the fist row as follows:# brackets are needed to distinguish from @$A[0] (syntax error) $cols = scalar @{$A[0]};
Print out all rows and columns:for ($i = 0; $i < $rows; $i++) {
for ($j = 0; $j < $cols; $j++) { print "$A[$i][$j] "; } print "\n"; # newline after each row
}
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 15
Two-dimensional arrays There is no need to declare the size of an array, so arrays can be created dynamically:
my @A = ();
my $rows = 100;
my $cols = 100;
# create a matrix with 1's on diagonal
for ($i = 0; $i < $rows; $i++) {
for ($j = 0; $j < $cols; $j++) {
$A[$i][$j] = 0;
}
$A[$i][$i] = 1;
} Array sizes can be changed dynamically:
$A[0][200] = 123; # first row now has 201 items
# but other rows are unaffected!
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 16
Two-dimensional arrays Arrays can contain any scalar values Two-dimensional arrays in Perl do not have to be "rectangular" Each row can have a different length
# this array has three row with lengths 3, 4 and 1
my @A = ( ["John", "Jim, "Bill"],
[ 100, 23.5, "ATCGTTGA", \%codons ];
[ 0 ]);
Example: pretty print a 2D array
#!/usr/bin/perluse strict;use warnings;# File: print_array.pl
my @A = ();
# initialize the two dimensional array
# with some numbersmy $rows = 6;my $cols = 5;for (my $i=0; $i < $rows; $i++) { for (my $j=0; $j < $cols; $j+
+) {$A[$i][$j] = $i*100 + $j*17;
}}
# print and quitprint_2Darray(@A);exit;
# a subroutine to print out a two# dimensional rectangular array using# 5 digits per array element
sub print_2Darray { my (@a) = @_; my $rows = scalar @a; my $cols = scalar @{$a[0]}; for (my $i=0; $i < $rows; $i++) {
for (my $j=0; $j < $cols; $j++) { printf "%5d ", $a[$i][$j]; } print "\n"; # newline after each row
}}
% print_array.pl 0 17 34 51 68 100 117 134 151 168 200 217 234 251 268 300 317 334 351 368 400 417 434 451 468 500 517 534 551 568
17BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING
Example: transpose a 2D array(exchange rows and columns)
#!/usr/bin/perluse strict;use warnings;
# File: transpose.plmy @A = ();
# initialize the two dimensional arraymy $rows = 6;my $cols = 5;for (my $i=0; $i < $rows; $i++) { for (my $j=0; $j < $cols; $j++) {
$A[$i][$j] = $i*100 + $j*17; }}print "A:\n";print_2Darray(@A);
my @B = transpose(@A);
print "B:\n";print_2Darray(@B);
exit;
sub transpose {
my (@a) = @_;
my @b = ();
my $rows = scalar @a;
my $cols = scalar @{$a[0]};
for (my $i=0; $i < $rows; $i++) {
for (my $j=0; $j < $cols; $j++) {
$b[$j][$i] = $a[$i][$j];
}
}
return @b;
}
sub print_2Darray {
.
.
.
}
18BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 19
% transpose.plA: 0 17 34 51 68 100 117 134 151 168 200 217 234 251 268 300 317 334 351 368 400 417 434 451 468 500 517 534 551 568 B: 0 100 200 300 400 500 17 117 217 317 417 517 34 134 234 334 434 534 51 151 251 351 451 551 68 168 268 368 468 568
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 20
Conditional Operator$a = COND ? VALUE_1 : VALUE_2;
is short forif COND then $a = VALUE_1 else $a = VALUE_2;
Examples:
# silly way to compute abs($x)my $abs_x = ($x > 0)? $x : -$x;
# a safe way to compute an averagemy $ave = ($n > 0)? $sum / $n : 0.0;
# get max of a, b:my $max = ($a > $b)? $a : $b;
# safely update a hash count:exists $count{$x}? $count{$x}++ : $count{$x} = 0;
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 21
Hashes for Passing Parameters Problem: suppose you have some subroutines that take many
arguments Don’t want to have to remember the right order Don’t want to have to remember the default values
Solution: Pass in a hash of the form (arg1 => value1, arg2 =>value2, etc.) This allows arguments to appear in any order in calling program Subroutine can detect missing arguments (using exists) and
supply default values
Example: Suppose we need to write a subroutine that prints DNA, with an optional header line, optional protein translation, and optional line numbers:
# assume subroutines "print_header()", "print_protein()" and
# "print_dna()" are already defined
sub output_dna {
my ($dna, $header, $linelength, $translate, $linenumbers) = @_;
print_header($header) if (length $header > 0);
if ($translate) { print_protein($dna, $linelength, $linenumbers)}
else { print_dna($dna, $linelength, $linenumbers); }
}
Calls in the main program would look like this:output_dna($dna, $header, 60, 0, 1); # header, no protein, linenums
output_dna($dna, "", 60, 1, 0); # no header, protein, no line numbers
Problems:
1. Even if you just want to print the dna, you still have to give all the arguments (in the right order).
2. Code is not self-documenting (will you remember what it does in 6 months?)
3. Clumsy for other people to use.22BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING
REMINDER -- The conditional operator:
$a = COND ? VALUE_1 : VALUE_2; # if COND then $a = VALUE_1 else $a = VALUE_2
Version 2: Note that missing arguments are set to a default value
sub output_dna {
my (%args) = @_;
my $dna = exists $args{dna} ? $args{dna} : "";
my $header = exists $args{header} ? $args{header} : "";
my $linelength = exists $args{linelength} ? $args{linelength} : 60;
my $translate = exists $args{translate} ? $args{translate} : 0;
my $linenumbers = exists $args{linenumbers} ? $args{linenumbers} : 0;
print_header($header) if (length $header > 0);
if ($translate) {print_protein($dna, $linelength, $linenumbers); }
else { print_dna($dna, $linelength, $linenumbers); }
}
Calls in the main program now look like this:
output_dna(linelength=>30, dna=>$x); # arguments can be in any order
output_dna(dna=>$dnastring, translate=>1); # missing args set to default values
23BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 24
Arrays of Hashes#!/usr/bin/perl -w
#demonstrates an array of hashes;
use strict;
use warnings;
my @AoH;
my $role;
my $href;
@AoH = ( { husband => "barney", wife => "betty", son => "bamm bamm", }, { husband => "george", wife => "jane", son => "elroy", },
{ husband => "homer", wife => "marge", son => "bart", }, ); print "@AoH \n";
Output:HASH(0x22a0ac) HASH(0x229f8c) HASH(0x1846024)
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 25
Arrays of Hashes – Manipulating the Variables You can set a key/value pair of a particular hash as follows:
$AoH[0]{husband} = "fred";
To capitalize the husband of the second array, apply a substitution: $AoH[1]{husband} =~ s/(\w)/\u$1/;
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING26
Arrays of Hashes - Printing Method 1
#!/usr/bin/perl -w
#demonstrates an array of hashes;
use strict;
use warnings;
my @AoH;
my $role;
my $href;
@AoH = ( { husband => "barney", wife => "betty", son => "bamm bamm", }, { husband => "george", wife => "jane", son => "elroy", },
{ husband => "homer", wife => "marge", son => "bart", }, ); for $href ( @AoH ) { print "{ "; for $role ( keys %$href ) { print "$role=$href->{$role} "; } print "}\n";}
Output:{ son=bamm bamm wife=betty husband=barney }{ son=elroy wife=jane husband=george }{ son=bart wife=marge husband=homer }
What does the do in the print statement?
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING27
Arrays of Hashes - Printing Method 2
!/usr/bin/perl -w
#demonstrates an array of hashes;
use strict;
use warnings;
my @AoH;
my $role;
my $href;
my $i;
@AoH = ( { husband => "barney", wife => "betty", son => "bamm bamm", }, { husband => "george", wife => "jane", son => "elroy", },
{ husband => "homer", wife => "marge", son => "bart", }, ); for $i ( 0 .. $#AoH ) { print "$i is { "; for $role ( keys %{ $AoH[$i] } ) { print "$role=$AoH[$i]{$role} "; } print "}\n";}
Output:0 is { son=bamm bamm wife=betty husband=barney }1 is { son=elroy wife=jane husband=george }2 is { son=bart wife=marge husband=homer }
Note that $# is the subscript of the last elementin an array
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING
28
Hashes of Hashes#!/usr/bin/perl -w
#demonstrates a hash of hashes;
use strict;
use warnings;
my $family;
my $role;
my %HoH = ( flintstones => { lead => "fred", pal => "barney", }, jetsons => { lead => "george", wife => "jane", "his boy" => "elroy", # key quotes
needed }, simpsons => { lead => "homer", wife => "marge", kid => "bart",
},);
# print the whole thingforeach $family ( keys %HoH ) { print "$family: "; foreach $role ( keys %
{ $HoH{$family} } ) { print "$role=$HoH{$family}{$role}
"; } print "\n";}
Output:simpsons: kid=bart lead=homer
wife=marge jetsons: his boy=elroy lead=george
wife=jane flintstones: lead=fred pal=barney
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 29
Hashes of Hashes - Printing#!/usr/bin/perl -w#demonstrates a hash of hashes;use strict;use warnings;my $family;my $roles;my $role;my $person;my %HoH = ( flintstones => { lead => "fred", pal => "barney", }, jetsons => { lead => "george", wife => "jane", "his boy" => "elroy", # key
quotes needed },
simpsons => { lead => "homer", wife => "marge", kid => "bart", },);# print the whole thing, using temporarieswhile ( ($family,$roles) = each %HoH ) { print "$family: "; while ( ($role,$person) = each %$roles
) { # using each precludes sorting print "$role=$person "; } print "\n";}
Output:simpsons: kid=bart lead=homer
wife=marge jetsons: his boy=elroy lead=george
wife=jane flintstones: lead=fred pal=barney
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 30
Chapter 10 Parsing GenBank
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 31
Data Types and Parsers A databank or data store
A flat file as compared to a database
We are going to explore the art of parsing using this data
Alternative parsing software repositories include National Institutes of Health (NIH)
www.ncbi.nlm.nih.gov European Bioinformatics Institute (EIB)
www.ebi.ac.uk European Molecular Biology Laboratory (EMBL)
www.embl.de
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 32
GenBank Files - ILOCUS AB031069 2487 bp mRNA PRI 27-MAY-2000DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC
domain 1, complete cds.ACCESSION AB031069VERSION AB031069.1 GI:8100074KEYWORDS .SOURCE Homo sapiens embryo male lung fibroblast cell_line:HuS-
L12 cDNA to mRNA. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (sites) AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si.
and Takano,T. TITLE PCCX1, a novel DNA-binding protein with PHD finger and
CXXC domain, is regulated by proteolysis JOURNAL Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000) MEDLINE 20261256REFERENCE 2 (bases 1 to 2487) AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S.
and Takano,T. TITLE Direct Submission JOURNAL Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank
databases. Tadahiro Fujino, Keio University School of Medicine, Department
of Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582,
Japan (E-mail:[email protected], Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508)
FEATURES Location/Qualifiers source 1..2487 /organism="Homo sapiens" /db_xref="taxon:9606" /sex="male" /cell_line="HuS-L12" /cell_type="lung fibroblast" /dev_stage="embryo" gene 229..2199 /gene="PCCX1" CDS 229..2199 /gene="PCCX1" /note="a nuclear protein carrying
a PHD finger and a CXXC domain" /codon_start=1 /product="protein containing
CXXC domain 1" /protein_id="BAA96307.1" /db_xref="GI:8100075"
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 33
GenBank Files - II/translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 34
GenBank Files - IIIBASE COUNT 564 a 715 c 768 g 440 tORIGIN 1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg 61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg 121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt 181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat 241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat 301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac 361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc 421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat 481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat 541 gagggtggag ggcgcaagag gcctgtccct gatccagacc tgcagcgccg ggcagggtca 601 gggacagggg ttggggccat gcttgctcgg ggctctgctt cgccccacaa atcctctccg 661 cagcccttgg tggccacacc cagccagcat caccagcagc agcagcagca gatcaaacgg 721 tcagcccgca tgtgtggtga gtgtgaggca tgtcggcgca ctgaggactg tggtcactgt 781 gatttctgtc gggacatgaa gaagttcggg ggccccaaca agatccggca gaagtgccgg 841 ctgcgccagt gccagctgcg ggcccgggaa tcgtacaagt acttcccttc ctcgctctca 901 ccagtgacgc cctcagagtc cctgccaagg ccccgccggc cactgcccac ccaacagcag 961 ccacagccat cacagaagtt agggcgcatc cgtgaagatg agggggcagt ggcgtcatca 1021 acagtcaagg agcctcctga ggctacagcc acacctgagc cactctcaga tgaggaccta
…
2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa//
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 35
A Little More About GeneBank FEATURES table
Specific information about locations of exons, regulatory regions, important mutations etc.
More information is contained at http://www.ncbi.nlm.nih.gov/genbank/GenBankFtp.html
http://www.ncbi.nlm.nih.gov/Genbank/ GenBank® is the NIH genetic sequence database, an annotated
collection of all publicly available DNA sequences (Nucleic Acids Research, 2008 Jan;36(Database issue):D25-30). There are approximately 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions and 191,401,393,188 bases in 62,715,288 sequence records in the WGS division as of April 2011.
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 36
GenBank: Separating Sequence and Annotation Method 1
Slurp the GenBank record into an array and look through the lines
Method 2 Put the whole GenBank record into a scalar and use regular
expressions to look through it
Example files record.gb library.gb
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING37
Parsing Genbank Files Using Arrays – Example 10-1#!/usr/bin/perl# Example 10-1 Extract annotation and
sequence from GenBank file
use strict;use warnings;use lib 'C:\Documents and Settings\Owner\
workspace\binf634_book_examples';use BeginPerlBioinfo; # see Chapter 6
about this module
# declare and initialize variablesmy @annotation = ( );my $sequence = '';my $filename = 'record.gb';
parse1(\@annotation, \$sequence, $filename);
# Print the annotation, and then# print the DNA in new format just to
check if we got it okay.print @annotation;
print_sequence($sequence, 50);
exit;
################################################################################
# Subroutine################################################
################################
# parse1## -parse annotation and sequence from GenBank
record
sub parse1 {
my($annotation, $dna, $filename) = @_;
# $annotation-reference to array # $dna -reference to scalar # $filename -scalar # declare and initialize variables my $in_sequence = 0; my @GenBankFile = ( ); # Get the GenBank data into an array from a
file @GenBankFile = get_file_data($filename); # Extract all the sequence lines foreach my $line (@GenBankFile) {
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 38
Parsing Genbank Files Using Arrays – Example 10-1if( $line =~ /^\/\/\n/ ) { # If $line is end-of-record line //\n, last; #break out of the foreach loop. } elsif( $in_sequence) { # If we know we're in a sequence, $$dna .= $line; # add the current line to $$dna. } elsif ( $line =~ /^ORIGIN/ ) { # If $line begins a sequence, $in_sequence = 1; # set the $in_sequence flag. } else{ # Otherwise push( @$annotation, $line); # add the current line to
@annotation. } } # remove whitespace and line numbers from DNA sequence $$dna =~ s/[\s0-9]//g;}
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 39
Some Musings on the Array Approach
if( $line =~ /^\/\/\n/ ) { # If $line is end-of-record line //\n,
last; #break out of the foreach loop.
}
If we have a large number of forward slashes we can change the delimiter to a ! like this
m!//\n!
The order of the tests for what we are currently gathering in the file is in the reverse order of things in the file
For example we first gather the annotation lines and then set a flag when the ORIGIN start-of-sequence line is found
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 40
Parsing Genbank Files Using Scalars We may end up with several lines in the same scalar; there
are modifiers to our search code that can make our life easier
Pattern Modifiers /g (What does this do?) /i (How about this one?)
Regular expression matching symbols ^ (What does this do?) $ (How about this one?) . (What does this do?)
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 41
Special Multiline Regular Expression Modifiers m
Treat string as multiple lines. That is, change ``^'' and ``$'' from matching at only the very start or end of the string to the start or end of any line anywhere within the string,
s Treat string as single line. That is, change ``.'' to match any
character whatsoever, even a newline, which it normally would not match. The /s and /m modifiers both override the $* setting. That is, no matter what $* contains, /s without /m will force ``^'' to match only at the beginning of the string and ``$'' to match only at the end (or just before a newline at the end) of the string. Together, as /ms, they let the ``.'' match any character whatsoever, while yet allowing ``^'' and ``$'' to match, respectively, just after and just before newlines within the string.
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 42
Pattern Modifiers in Action#!/usr/bin/perl
use strict;
use warnings;
"AAC\nGTT" =~ /^.*$/;
print $&, "\n";
exit;
This fails because there is no match and $& has no value
Suppose we use "AAC\nGTT" =~ /^.*$/m; Output: AAC
Suppose we use Suppose we use
"AAC\nGTT" =~ /^.*$/s; Output:
AAC GTT
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 43
Separating Annotations from Sequence Recall that a GenBank record starts with LOCUS and ends with
//
The input record separator This is normally set to new line so each call to read a scalar from a
file handle gets one line
The input record separator is denoted by $/ $/ = “//\n”;
Now a call to read a scalar from a file handle takes all data up to the GenBank end of record separator
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING44
Example 10-2 –Extract Record and Annotation Sequence from a GenBank Record
#!/usr/bin/perl# Example 10-2 Extract the annotation
and sequence sections from the first
# record of a GenBank library
use strict;use warnings;use lib 'C:\Documents and Settings\
Owner\workspace\binf634_book_examples';
use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variablesmy $annotation = '';my $dna = '';my $record = '';my $filename = 'record.gb';my $save_input_separator = $/;
# Open GenBank library fileunless (open(GBFILE, $filename)) { print "Cannot open GenBank
file \"$filename\"\n\n"; exit;}
# Set input separator to "//\n" and read in a record to a scalar
$/ = "//\n";
$record = <GBFILE>;
# reset input separator $/ = $save_input_separator;
# Now separate the annotation from the sequence data
($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);
# Print the two pieces, which should give us the same as the
# original GenBank file, minus the // at the end
print $annotation, $dna;
exit;
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 45
The Regular Expressions In Depth ($annotation, $dna)=($record =~ /^(LOCUS.*ORIGIN\s*\n)
(.*)\/\/\n/s); What do the two sets of parentheses do? What do we do with the contents of these parenthesis? What does the /s modifier do?
^(LOCUS.*ORIGIN\s*\n) Can you explain this?
\/\/\n What does this match?
Notice that it took one line of Perl code to extract annotation and sequence
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 46
Parsing Annotations#!/usr/bin/perl -w# Example 10-3 Parsing GenBank annotations using
arrays
use strict;use warnings;use lib 'C:\Documents and Settings\Owner\workspace\
binf634_book_examples';use BeginPerlBioinfo; # see Chapter 6 about this
module
# Declare and initialize variablesmy @genbank = ( );my $locus = '';my $accession = '';my $organism = '';
# Get GenBank file data@genbank = get_file_data('record.gb');
# Let's start with something simple. Let's get some of the identifying
# information, let's say the locus and accession number (here the same
# thing) and the definition and the organism.
for my $line (@genbank) { if($line =~ /^LOCUS/) { $line =~ s/^LOCUS\s*//; $locus = $line; }elsif($line =~ /^ACCESSION/) { $line =~ s/^ACCESSION\s*//; $accession = $line; }elsif($line =~ /^ ORGANISM/) { $line =~ s/^\s*ORGANISM\s*//; $organism = $line; }}print "*** LOCUS ***\n";print $locus;print "*** ACCESSION ***\n";print $accession;print "*** ORGANISM ***\n";print $organism;
exit;
Notice the use of the flags and that this program handles single line entries.
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING47
Tackling the Definition Field#!/usr/bin/perl -w# Example 10-4 Parsing GenBank
annotations using arrays, take 2
use strict;use warnings;use lib 'C:\Documents and Settings\
Owner\workspace\binf634_book_examples';
use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variablesmy @genbank = ( );my $locus = '';my $accession = '';my $organism = '';my $definition = '';my $flag = 0;
# Get GenBank file data@genbank = get_file_data('record.gb');
# Let's start with something simple. Let's get some of the identifying
# information, let's say the locus and accession number (here the same
# thing) and the definition and the organism.
for my $line (@genbank) { if($line =~ /^LOCUS/) { $line =~ s/^LOCUS\s*//; $locus = $line; }elsif($line =~ /^DEFINITION/) { $line =~ s/^DEFINITION\s*//; $definition = $line; $flag = 1; }elsif($line =~ /^ACCESSION/) { $line =~ s/^ACCESSION\s*//; $accession = $line; $flag = 0; }elsif($flag) { chomp($definition); $definition .= $line; }elsif($line =~ /^ ORGANISM/) { $line =~ s/^\s*ORGANISM\s*//; $organism = $line; }}
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 48
Parsing the Annotations Using Regular Expressions sub open_file
Given the filename, return the filehandle sub get_next_record
Given the file handle, get the record (we can get the offset by first calling tell)
sub get_annotation_and_dna Given a record, split it into annotation and cleaned up record
sub search_sequence Given a sequence and a regular expression, return array of locations and hits
sub search_annotation Given a GenBank annotation and a regular expression, return array of
location and hits sub parse_annotation
Separate out the files of the annotation in a convenient form sub parse_features
Given the features field, separate out the components
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 49
Byte offset The Perl function tell allows us to save the byte offset of the
record of interest The byte offset is the number of characters into the file where the
information of interest lies
We can return again “instantaneously” to this location in the file again using the Perl function seek
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 50
Parsing of the Data
1. Separate out the annotation and the sequence (some capability to search exists at this stage)
2. Extract out the fields
3. Parse the features table
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 51
Example 10-5. GenBank Library Subroutines!/usr/bin/perl# Example 10-5 - test program of GenBank
library subroutines)
use strict;use warnings;# Don't use BeginPerlBioinfo# Since all subroutines defined in this
file# use BeginPerlBioinfo; # see Chapter
6 about this module
# Declare and initialize variablesmy $fh; # variable to store filehandlemy $record;my $dna;my $annotation;my $offset;my $library = 'library.gb';
# Perform some standard subroutines for test
$fh = open_file($library);
$offset = tell($fh);
while( $record = get_next_record($fh) ) {
($annotation, $dna) = get_annotation_and_dna($record);
if( search_sequence($dna, 'AAA[CG].')) {
print "Sequence found in record at offset $offset\n";
} if( search_annotation($annotation,
'homo sapiens')) { print "Annotation found in
record at offset $offset\n"; }
$offset = tell($fh);}
exit;
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 52
Subroutines
# open_file## - given filename, set filehandle
sub open_file {
my($filename) = @_; my $fh;
unless(open($fh, $filename)) { print "Cannot open file
$filename\n"; exit; } return $fh;}
# get_next_record## - given GenBank record, get
annotation and DNA
sub get_next_record {
my($fh) = @_;
my($offset); my($record) = ''; my($save_input_separator) = $/;
$/ = "//\n";
$record = <$fh>;
$/ = $save_input_separator;
return $record;}
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING53
Subroutines# get_annotation_and_dna## - given filehandle to open GenBank
library file, get next record
sub get_annotation_and_dna {
my($record) = @_;
my($annotation) = ''; my($dna) = '';
# Now separate the annotation from the sequence data
($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);
# clean the sequence of any whitespace or / characters
# (the / has to be written \/ in the character class, because
# / is a metacharacter, so it must be "escaped" with \)
$dna =~ s/[\s\/]//g;
return($annotation, $dna)}
# search_sequence## - search sequence with regular
expression
sub search_sequence {
my($sequence, $regularexpression) = @_;
my(@locations) = ( );
while( $sequence =~ /$regularexpression/ig ) {
push( @locations, pos ); }
return (@locations);}
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 54
Subroutines# search_annotation## - search annotation with regular
expression
sub search_annotation {
my($annotation, $regularexpression) = @_;
my(@locations) = ( );
# note the /s modifier-. matches any character including newline
while( $annotation =~ /$regularexpression/isg ) {
push( @locations, pos ); }
return (@locations);}
Output:Sequence found in record at offset 0Annotation found in record at offset 0Sequence found in record at offset 6358Annotation found in record at offset
6358Sequence found in record at offset
12573Annotation found in record at offset
12573Sequence found in record at offset
18032Annotation found in record at offset
18032Sequence found in record at offset
22722Annotation found in record at offset
22722
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 55
Parsing the Annotations at the Top Level Let’s start with one of the simpler annotationsDEFINITION Homo sapiens PCCX1 mRNA for protein containing
CXXC domain 1, complete cds.ACCESSION AB031069VERSION AB031069.1 GI:8100074
We need a regular expression that matches everything from a word at the beginning of a line to a newline that just precedes another word at the beginning of a line.
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 56
Parsing the Annotations at the Top level /^[A-Z].*\n(^\s.*\n)*/m
What does /m do?
^[A-Z].*\n Capital letter at the beginning of the line followed by any number
of characters (except newlines) followed by a newline
(^\s.*\n)* Matches a space or tab at the beginning of the line followed by
any number of characters )except newlines) followed by a newline ()* means 0 or more of these
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 57
Example 10-6 Parsing GenBank Data#!/usr/bin/perl# Example 10-6 - test program for
parse_annotation subroutine
use strict;use warnings;use lib 'C:\Documents and Settings\
Owner\workspace\binf634_book_examples';
use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variablesmy $fh;my $record;my $dna;my $annotation;my %fields;my $library = 'library.gb';
# Open library and read a record$fh = open_file($library);
$record = get_next_record($fh);
# Parse the sequence and annotation($annotation, $dna) =
get_annotation_and_dna($record);
# Extract the fields of the annotation%fields =
parse_annotation($annotation);
# Print the fieldsforeach my $key (keys %fields) { print "******** $key *********\n"; print $fields{$key};}
exit;
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 58
Subroutines# parse_annotation## given a GenBank annotation, returns a hash with# keys: the field names# values: the fields
sub parse_annotation {
my($annotation) = @_; my(%results) = ( );
while( $annotation =~ /^[A-Z].*\n(^\s.*\n)*/gm ) { my $value = $&; (my $key = $value) =~ s/^([A-Z]+).*/$1/s; $results{$key} = $value; }
return %results;}
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 59
Extracting the key
while( $annotation =~ /^[A-Z].*\n(^\s.*\n)*/gm ) {
my $value = $&;
(my $key = $value) =~ s/^([A-Z]+).*/$1/s;
$results{$key} = $value;
} What does the bolded line do?
First assigns $key the value $value Uses the /s modifier for embedded newlines Replaces $key with $1 which is a special variable indicating the
match between the first pair of parenthesis
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 60
Parsing the Features Table Let’s tackle source, gene and CDS features keyssource 1..2487 /organism="Homo sapiens" /db_xref="taxon:9606" /sex="male" /cell_line="HuS-L12" /cell_type="lung fibroblast" /dev_stage="embryo" gene 229..2199 /gene="PCCX1" CDS 229..2199 /gene="PCCX1" /note="a nuclear protein carrying a PHD
finger and a CXXC Will use an array rather than a hash since there can be multiple
instances of the same feature in a record
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 61
Parsing the Features Table#!/usr/bin/perl# - main program to test
parse_features
use strict;use warnings;use lib 'C:\Documents and Settings\
Owner\workspace\binf634_book_examples';
use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variablesmy $fh;my $record;my $dna;my $annotation;my %fields;my @features;my $library = 'library.gb';
# Get the fields from the first GenBank record in a library
$fh = open_file($library);
$record = get_next_record($fh);
($annotation, $dna) = get_annotation_and_dna($record);
%fields = parse_annotation($annotation);
# Extract the features from the FEATURES table
@features = parse_features($fields{'FEATURES'});
# Print out the featuresforeach my $feature (@features) {
# extract the name of the feature (or "feature key")
my($featurename) = ($feature =~ /^ {5}(\S+)/);
print "******** $featurename *********\n";
print $feature;}
exit;
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 62
Subroutine# parse_features## extract the features from the FEATURES field of a GenBank record
sub parse_features {
my($features) = @_; # entire FEATURES field in a scalar variable
# Declare and initialize variables my(@features) = (); # used to store the individual features
# Extract the features while( $features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm ) { my $feature = $&; push(@features, $feature); }
return @features;}
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 63
A Closer Look at the Crucial Regular Expression while( $features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm )
Line begins with 5 spaces followed by non-whitespace character followed by any number of non-newlines followed by a newline
Next we space 21 or more spaces followed by non-whitepace characters \S followed by any number of non-newlines .* followed by a newline
BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 64
Homework for Next Week Read Tisdall Chapter 10
Particular attention to the Indexing GenBank with DBM section
Exercises 10.3 and 10.6
Begin working on Program 3
Quiz 4 next week