5.2FASTA: Analyzing complex input
Overall design:
Read the FASTA file (several sequences).
For each sequence:
1. Read the FASTA sequence
1.1. Read FASTA header
1.2. Read each line until next FASTA header
2. For each sequence: Do something
2.1. Compute G+C content
2.2. Print header and G+C content
Let’s see how it’s done…
Do something
End of input? No
End
Start
Save header
Read line
Header orend of input
Yes
Concatenate to sequence
No
Read line
Read line
5.3
# 1. Read FASTA sequece
$fastaLine = <STDIN>;
while (defined $fastaLine) {# 1.1. Read FASTA header
$header = substr($fastaLine,1);
$fastaLine = <STDIN>;# 1.2. Read sequence until next FASTA header
while ((defined $fastaLine) and
(substr($fastaLine,0,1) ne ">" ))
{
$seq .= $fastaLine;
$fastaLine = <STDIN>;
}
# 2. Do something
... # 2.1 compute $gcContent
print "$header: $gcContent\n";
}
Do something
End of input? No
End
Start
Save header
Read line
Header orend of input
Yes
Concatenate to sequence
No
Read line
Read line
5.4Class exercise 4a
1. Write a script that reads lines of names and expenses:Yossi 6.10,16.50,5.00Dana 21.00,6.00Refael 6.10,24.00,7.00,8.00ENDFor each line print the name and the sum. Stop when you reach "END"
2. Change your script to read names and expenses on separate lines, Identify lines with numbers by a "+" sign as the first character in the string:Yossi+6.10+16.50+5.00Dana+21.00+6.00Refael +6.10+24.00+7.00+8.00END
Sum the numbers while there is a '+' sign before them.
Sum the numbers while there is a '+' sign before them.
Output:Yossi 27.6Dana 27Refael 45.1
Output:Yossi 27.6Dana 27Refael 45.1
5.5Class exercise 4a
3. (Home Ex. 2 Q. 5) Write a script that reads several protein sequences in FASTA format, and prints the name and length of each sequence. Start with the example code from the last lesson.
4*. Write a script that reads several DNA sequences in FASTA format, and prints FASTA output of the sequences whose header starts with 'Chr07'.
5**. Write a script that reads several DNA sequences in FASTA format, and prints FASTA output of the sequences whose header contains 'Chr07'.
5.7
Open a file for reading, and link it to a filehandle:
open(IN, "<EHD.fasta");
And then read lines from the filehandle, exactly like you would from <STDIN>:
my $line = <IN>;
my @inputLines = <IN>;
foreach $line (@inputLines) ...
Every filehandle opened should be closed:
close(IN);
Always check the open didn’t fail (e.g. if a file by that name doesn’t exists):
open(IN, "<$file") or die "can't open file $file";
Reading files
5.8
Open a file for writing, and link it to a filehandle:
open(OUT, ">EHD.analysis") or die...
NOTE: If a file by that name already exists it will be overwriten!
You could append lines to the end of an existing file:
open(OUT, ">>EHD.analysis") or die..
Print to a file (in both cases):
print OUT "The mutation is in exon $exonNumber\n";
Writing to files
no comma here
5.9
You can ask questions about a file or a directory name (not filehandle):
if (-e $name) { print "The file $name exists!\n"; }
-e $name exists-r $name is readable-w $name is writable by you-z $name has zero size-s $name has non-zero size (returns size)-f $name is a file-d $name is a directory-l $name is a symbolic link-T $name is a text file-B $name is a binary file (opposite of -T).
File Test Operators
5.10
open( IN, '<D:\workspace\Perl\p53.fasta' );
• Always use a full path name, it is safer and clearer to read
• Remember to use \\ in double quotes
open( IN, "<D:\\workspace\\Perl\\$name.fasta" );
• (usually) you can also use /
open( IN, "<D:/workspace/Perl/$name.fasta" );
Working with paths
5.11
Reading files: example
$line = <STDIN>;chomp $line;
# loop processes one input line and print output for linewhile ($line ne "END") { # Separate name and numbers @nameAndNums = split(/ /, $line); $name = $nameAndNums[0]; @nums = split(/,/, $nameAndNums[1]); $sum = 0;
# Sum numbers foreach $num (@nums) {
$sum = $sum + $num; } print "$name $sum\n";
# Read next line $line = <STDIN>; chomp $line;} Input: Yossi 6.10,16.50,5.00
Dana 21.00,6.00Refael 24.00,7.00,8.00END
Output: Yossi 27.6Dana 27Refael 45.1
5.12
Reading files: example
open(IN, '<D:\perl_ex\in.txt') or die "can't open input file";
$line = <IN>;chomp $line;
# loop processes one input line and print output for linewhile ($line ne "END") { # Separate name and numbers @nameAndNums = split(/ /, $line); $name = $nameAndNums[0]; @nums = split(/,/, $nameAndNums[1]); $sum = 0;
# Sum numbers foreach $num (@nums) {
$sum = $sum + $num; } print "$name $sum\n";
# Read next line $line = <IN>; chomp $line;}close(IN);
Input: Yossi 6.10,16.50,5.00Dana 21.00,6.00Refael 24.00,7.00,8.00END
Output: Yossi 27.6Dana 27Refael 45.1
5.13
Reading files: example
open(IN, '<D:\perl_ex\in.txt') or die "can't open input file";open(OUT,'>D:\perl_ex\out.txt') or die "can't open output file";$line = <IN>;chomp $line;
# loop processes one input line and print output for linewhile ($line ne "END") { # Separate name and numbers @nameAndNums = split(/ /, $line); $name = $nameAndNums[0]; @nums = split(/,/, $nameAndNums[1]); $sum = 0;
# Sum numbers foreach $num (@nums) {
$sum = $sum + $num; } print OUT "$name $sum\n";
# Read next line $line = <IN>; chomp $line;}close(IN);close(OUT);
Input: Yossi 6.10,16.50,5.00Dana 21.00,6.00Refael 24.00,7.00,8.00END
Output: Yossi 27.6Dana 27Refael 45.1
5.14Class exercise 5a
1. Change the script for class exercise 4a.2 to read the lines from an input file (instead of reading lines from keyboard).
2. Now, in addition, write the output of the previous question to a file named 'D:\perl_ex\class.ex.4a2.out' (instead of printing to the screen).
3*. Now, before opening 'D:\perl_ex\class.ex.4a2.out‘, check if it exists, and if so – print a message that the output file already exist, and exit the script.
4*. Change the script for class exercise 4.a3 to receive from the user two strings: 1) a name of FASTA file 2) a name of an output file. And then - read from a FASTA file given by the user, and write to an output file also supplied by the user.
5.16
It is common to give arguments (separated by spaces) within the command-line for a program or a script:
They will be stored in the array @ARGV:
foreach my $arg (@ARGV){ print "$arg\n";}
Command line arguments
> perl -w findProtein.pl D:\perl_ex\in.fasta 2 430
D:\perl_ex\in.fasta2430
@ARGV
'D:\perl_ex\in.fasta'
'2'
'430'
5.17
It is common to give arguments (separated by spaces) within the command-line for a program or a script:
They will be stored in the array @ARGV:
foreach my $arg (@ARGV){ print "$arg\n";}
> perl -w findProtein.pl D:\my perl\in.fasta 2 430
Command line arguments
D:\myperl\in.fasta2430
@ARGV
'D:\my'
'perl\in.fasta'
'2'
'430'
5.18
It is common to give arguments (separated by spaces) within the command-line for a program or a script:
They will be stored in the array @ARGV:
foreach my $arg (@ARGV){ print "$arg\n";}
> perl -w findProtein.pl "D:\my perl\in.fasta" 2 430
Command line arguments
D:\my perl\in.fasta2430
@ARGV
'D:\my perl\in.fasta'
'2'
'430'
5.19
It is common to give arguments (separated by spaces) within the command-line for a program or a script:
They will be stored in the array @ARGV:
my $inFile = $ARGV[0];my $outFile = $ARGV[1];
Or more simply:
my ($inFile,$outFile) = @ARGV;
Command line arguments
> perl -w findProtein.pl D:\perl_ex\in.fasta D:\perl_ex\out.txt
5.22
Reminder: the class exercise of 3 days ago.
Reading files - example
Input: Yossi 6.10,16.50,5.00Dana 21.00,6.00Refael 24.00,7.00,8.00END
Output: Yossi 27.6Dana 27Refael 45.1
5.23
Reading files: example
$line = <STDIN>;chomp $line;
# loop processes one input line and print output for linewhile ($line ne "END") { # Separate name and numbers @nameAndNums = split(/ /, $line); $name = $nameAndNums[0]; @nums = split(/,/, $nameAndNums[1]); $sum = 0;
# Sum numbers foreach $num (@nums) {
$sum = $sum + $num; } print "$name $sum\n";
# Read next line $line = <STDIN>; chomp $line;} Input: Yossi 6.10,16.50,5.00
Dana 21.00,6.00Refael 24.00,7.00,8.00END
Output: Yossi 27.6Dana 27Refael 45.1
5.24
Reading files: example
my ($inFileName) = @ARGV;open(IN, "<$inFileName") or die "can't open $inFileName";
$line = <IN>;chomp $line;
# loop processes one input line and print output for linewhile ($line ne "END") { # Separate name and numbers @nameAndNums = split(/ /, $line); $name = $nameAndNums[0]; @nums = split(/,/, $nameAndNums[1]); $sum = 0;
# Sum numbers foreach $num (@nums) {
$sum = $sum + $num; } print "$name $sum\n";
# Read next line $line = <IN>; chomp $line;}close(IN);
Input: Yossi 6.10,16.50,5.00Dana 21.00,6.00Refael 24.00,7.00,8.00END
Output: Yossi 27.6Dana 27Refael 45.1
5.25
Reading files: example
my ($inFileName, $outFileName) = @ARGV;open(IN, "<$inFileName") or die "can't open $inFileName";open(OUT, ">$outFileName") or die "can't open $outFileName";$line = <IN>;chomp $line;
# loop processes one input line and print output for linewhile ($line ne "END") { # Separate name and numbers @nameAndNums = split(/ /, $line); $name = $nameAndNums[0]; @nums = split(/,/, $nameAndNums[1]); $sum = 0;
# Sum numbers foreach $num (@nums) {
$sum = $sum + $num; } print OUT "$name $sum\n";
# Read next line $line = <IN>; chomp $line;}close(IN);close(OUT);
Input: Yossi 6.10,16.50,5.00Dana 21.00,6.00Refael 24.00,7.00,8.00END
Output: Yossi 27.6Dana 27Refael 45.1
5.26
Reading files: example
my ($inFileName, $outFileName) = @ARGV;open(IN, "<$inFileName") or die "can't open $inFileName";open(OUT, ">$outFileName") or die "can't open $outFileName";$line = <IN>;chomp $line;
# loop processes one input line and print output for linewhile (defined $line) { # Separate name and numbers @nameAndNums = split(/ /, $line); $name = $nameAndNums[0]; @nums = split(/,/, $nameAndNums[1]); $sum = 0;
# Sum numbers foreach $num (@nums) {
$sum = $sum + $num; } print OUT "$name $sum\n";
# Read next line $line = <IN>; chomp $line;}close(IN);close(OUT);
Input: Yossi 6.10,16.50,5.00Dana 21.00,6.00Refael 24.00,7.00,8.00
Output: Yossi 27.6Dana 27Refael 45.1
5.27
Reading files: example
my ($inFileName, $outFileName) = @ARGV;open(IN, "<$inFileName") or die "can't open $inFileName";open(OUT, ">$outFileName") or die "can't open $outFileName";$line = <IN>;
# loop processes one input line and print output for linewhile (defined $line) { chomp $line; # Separate name and numbers @nameAndNums = split(/ /, $line); $name = $nameAndNums[0]; @nums = split(/,/, $nameAndNums[1]); $sum = 0;
# Sum numbers foreach $num (@nums) {
$sum = $sum + $num; } print OUT "$name $sum\n";
# Read next line $line = <IN>;}close(IN);close(OUT);
Input: Yossi 6.10,16.50,5.00Dana 21.00,6.00Refael 24.00,7.00,8.00
Output: Yossi 27.6Dana 27Refael 45.1