short introduction to perl & gff
DESCRIPTION
Short introduction to perl & gff. Marcus Ronninger The Linnaeus Centre for Bioinformatics. Motivation. Bioinformatics yields lots of information The information have to be mined Build or modify text files Small changes can take long time with lots of data - PowerPoint PPT PresentationTRANSCRIPT
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
cs
Short introduction to perl & gff
Marcus Ronninger
The Linnaeus Centre for Bioinformatics
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csMotivation
• Bioinformatics yields lots of information
• The information have to be mined • Build or modify text files• Small changes can take long time with
lots of data• Example: Change every letter to lower
case• With script programming this could be
done in less than a second
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csperl
• Practical extraction and report language
• Scripts• Object oriented programming• Graphical web interface, CGI• Possibilities • BioPerl
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csExample
Example of a very simple perl script, to_lower_case.pl
#!/usr/bin/perl -wuse strict;my $seqfile = $ARGV[0];my $outfile = $ARGV[1]; open (SEQ, $seqfile) || die "Can't open file: $seqfile";open (OUTFILE, "> $outfile"); while(<SEQ>){ if ($_ =~ /^\>.*\n/){ print OUTFILE $_; } else{ print OUTFILE lc ($_); }}
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
cs
Useful tools for parsing files
• Scalar $• Array @• Regular expression /.fasta/• Split, @chars = split //, $word• Substitute s/old-regex/new-string/• Upper and lower case: uc, lc• Escape characters: \n \t \s etc• sub
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csGeneral feature format, gff
• AKA “gene finding format”• A format for handling output from
different feature finding programs• Processes can be decoupled but the
result can still be put together• Makes it easy to include external
algorithms
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csGeneral feature format
The construction of the format is very simple. The values are tab-delimited.SEQ1 EMBL atg 103 105 . + 0SEQ1 EMBL exon 103 172 . + 01. 2. 3. 4. 5. 6. 7. 8.
1. Sequence name
2. Source of the feature
3. Feature type
4. Start
5. End
6. Score - most feature finding programs have some kind of score for the found motif
7. Strand - can either be + or -
8. Frame - 0, 1, 2, .
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csSmall example
A small script that transforms known transcription
factor binding sites into a .gff fileTFBS Position Motif
AP-2 -101 ccccaccccc
NF-1 -116 tgggctgcggccca
Hgcs -117 ctgggctgcggc
#Gfap#Known TFBS (Besnard et al 1991)#count backwards form the TSS#start -14AP-2: ccccaccccc -101NF-1: tgggctgcggccca -116
Hgcs: ctgggctgcggc -117
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csExample
Basically the same procedure as the perl example
above
$seqlength = 5000;
$gff = “”;
while (<LIT>){
if ($_ =~ /^#start/){
$rel_start = $';
}
elsif (!($_ =~ /^#/) && ($_ =~ /\w+/)){
make_gff($_, $rel_start, "Literature");
}
}
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csExamplewhile (<LIT>){
if ($_ =~ /^#start/){
$rel_start = $';
}
elsif (!($_ =~ /^#/) && ($_ =~ /\w+/)){
make_gff($_, $rel_start, "Literature");
}
}
sub make_gff{
my $start;
my $stop;
(my $seq, my $rs, my $type) = @_;
my @feature = split(/\s+/, $seq); # now the array has the feature information
if($type eq "Literature"){
$start = $seqlength + $rs + $feature[2];
$stop = $start + length($feature[1]) -1;
$sign = '.';
$gff .= "$feature[0]\t$type\t$feature[0]\t$start\t$stop\tundef\t$sign\t$sign\n";
}
etc.
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csExample
Output: a file named lit.gff with the following
contents
AP-2: Literature AP-2: 4886 4895 undef . .NF-1: Literature NF-1: 4871 4884 undef . .Hgcs: Literature Hgcs: 4870 4881 undef . .
This can now be loaded into programs thatsupport
the gff format, e.g. Apollo
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csApollo
• Gff files is boring to view as they are• Use graphical software• Apollo, a sequence annotation editor• Great for viewing gff files together with
the sequence
Th
e L
inn
aeu
s C
en
tre f
or
Bio
info
rmati
csReferences
• Tisdall J.D, “Beginning Perl for Bioinformatics” 2001, O’Reilly
• http://www.sanger.ac.uk/Software/formats/GFF/
• http://www.fruitfly.org/annot/apollo/.