introduction to bash, awk, and perl victor anisimov, ncsa fiu / sserca / xsede workshop, apr 4-5,...
TRANSCRIPT
Introduction to BASH, AWK, and PERL
Victor Anisimov, NCSA
FIU / SSERCA / XSEDE Workshop, Apr 4-5, 2013, Miami, FL
April 18, 2023
MOTIVATION
• Increase Productivity of Research & Development
Scripting languages require less effort in implementation of small computational projects than that when using regular programming languages
Scripts are more portable than binary code
Scripts are easy to maintain
Lab materials: /home/anisimov/labs.tgz on FIU cluster
Important: type “module add make” after logging to FIU cluster
2Introduction to BASH, AWK, and PERL
BASH, AWK, and PERL
BASH is a Linux shell
AWK is a language for data post-processing
PERL is a versatile programming language
Common feature:
interpreted programming languages
How to decide which one I will need:
project complexity dictates which language to use
3Introduction to BASH, AWK, and PERL
Objective of the Course
As of now:•No prerequisites are necessary•No change in the way you think•No need to memorize abstract concepts
At the end of the day:•You will learn three programming languages•You will improve your project organization skills•You will increase your productivity
4Introduction to BASH, AWK, and PERL
Every Project Works with Data
• Data generation by computation• Extraction of data from text files• Data format conversion• Data computation• Data analysis and reporting• Data archival and retrieval
Scripting languages can handle this work without turning the data processing into a major programming project
5Introduction to BASH, AWK, and PERL
Projects have Complex Processing Flows• Input to a program depends on the result of
another program• The process includes many steps that need to be
automated• The process is not standard and has to be
created• The process needs to be optimized
6Introduction to BASH, AWK, and PERL
Scripting Languages are perfect for automation of repetitive processes
Elements of Programming Language
• Data types• Conditional statements• Loops• Functions / procedures• Input / Output
7Introduction to BASH, AWK, and PERL
Our first guide to this virtual world is BASH shell.
BASH Data Types
• BASH treats all variables as text strings• Limited support of integer arithmetics#!/bin/bashgreetings="Hello ${USER}!" # example of stringtoday=`date` # run a program by enclosing it in grave accentsecho "${greetings} Today is ${today}”N=1; let N=N+2; echo "Integer math: 1+2=${N}"R=0.1; R=`echo “$R+1.2” | bc -l`; echo "FP math: 0.1+1.2=${R}”
$ chmod 755 01-hello.sh$ ./01-hello.shHello victor! Today is Thu Apr 4 13:37:02 EST 2013Integer math: 1+2=3FP math: 0.1+1.2=1.3
8Introduction to BASH, AWK, and PERL
BASH Conditional Statements
One more data type: built-in constants
$# - number of arguments; $0 - self name; $1, $2, … - command-line arguments
#!/bin/bash# supported string comparison conditions: == !=# supported arithmetic conditions: -eq (==) -ne (!=) –lt (<) -le (<=) -gt (>) –ge (>=)if [ $# != 2 ] ; then echo "USAGE $0 argument1 argument2" ; exitfiif [ $1 -gt $2 ] ; then echo "True: $1 -gt $2" else echo "False: $1 -gt $2" fi
$ ./02-conditions.sh
9Introduction to BASH, AWK, and PERL
BASH Loops
• Loop over listLIST="01 02 03 04 05” example: 03-loops.sh
for job in ${LIST} ; do
echo "job number ${job}”
done
• Conditional loopN=1
while [ ${N} -le 5 ] ; do
echo ${N}
let N=N+1
done
• C-style loop for ((a=1; a <= LIMIT ; a++))
10Introduction to BASH, AWK, and PERL
BASH Procedures / Functions
• Functions contain repetitive part of the code#!/bin/bash
# declaration of function
filenameGenerator()
{
echo "$1.out"
}
# call the function and supply arguments
filenameGenerator 1
filenameGenerator 2
$ ./04-functions.sh
$ 1.out
$ 2.out
11Introduction to BASH, AWK, and PERL
BASH Input / Output
• I/O is extremely simple in BASH
cat file.out send file content to std output
mycode.sh | mytool.sh send output to another program
mycode.sh > /dev/null get rid of unwanted output
mycode.sh &> log.out & detach from terminal
12Introduction to BASH, AWK, and PERL
Sample BASH Project
• Perform context replacement in text file05-project.sh
#!/bin/bash
if [ $# -ne 1 ] ; then
echo "Usage: $0 file.coor”
else
# create name for output file
outfile=`echo $1 | sed 's/\.coor/\.pdb/'`
# replace "HETATM" by "ATOM " in the text
cat $1 | sed 's/HETATM/ATOM /' > $outfile
# count number of processed lines
wc -l $outfile
fi
13Introduction to BASH, AWK, and PERL
AWK
• Although simple and powerful, BASH code can quickly become bulky because of limited structural constructs
• AWK designed to simplify data extraction and post-processing; and thus it nicely complements BASH when computational projects become a little more involved
14Introduction to BASH, AWK, and PERL
Developed by Aho, Weinberger, and Kernighan
The Power of AWK in Action
• Compute sum of number in the one-line code#!/bin/bash
awk 'BEGIN{sum=0} {for (i = 1; i <= NF; i++) sum += $i} END{print sum}’
$ echo "1.2 2.3 3.4" | ./01-sum.sh
$ 6.9
AWK logistics:• section BEGIN{…} is executed once in the beginning• standard input is processed by main program body, i.e. by second {…} block
• NF is a built-in constant equal to number of fields obtained from standard input
• $1, $2, … individual input fields
• i is loop index, so we can address each field as $i• input fields are processed in the C-style for-loop and their value is summed up
• Section END{…} is executed once in the end of execution• Variable type is automatically recognized by awk based on operation type
15Introduction to BASH, AWK, and PERL
AWK: Input Field Separator (option –F)
• AWK accepts custom field separators
#!/bin/bash
awk -F$1 '{for (i = 1; i <= NF; i++) print $i}’
Use comma as field separator
$ echo "1,a,3,b:5" | ./02-inpfields.sh ,
1
a
3
b:5
Challenge: Try using different field separators
16Introduction to BASH, AWK, and PERL
comma character
AWK: PDB-to-XYZ Format Conversion
#!/bin/bash
# Convert PDB file to XYZ format
if [ $# -ne 1 ] ; then
echo "Usage: $0 input.pdb"
else
cat $1 |
awk 'BEGIN {n=0}
{ if($1 == ”ATOM") {n=n+1; a[n]=$3; x[n]=$5; y[n]=$6; z[n]=$7} }
END {
printf "%d\n\n", n;
for (i=1; i<=n; i++)
printf "%-5s %7.3f %7.3f %7.3f\n", a[i], x[i],y[i],z[i];
}'
fi
17Introduction to BASH, AWK, and PERL
03-convert.sh Arrays in AWK are super easy !!!
AWK: Column Block-average#!/bin/bash
# compute block-average for data from loan.out
if [ $# -ne 2 ] ; then
echo "USAGE: $0 blocksize column” ; exit
fi
cat loan.out | awk -v blocksize=$1 -v column=$2 '
BEGIN{n=0; j=0}
{ if(NF==10) {x[n]=$column; n++} } # read all data
END{
nblocks = n / blocksize;
for(i=0; i<nblocks; i++){ # loop over blocks
aver=0.0; # compute average for each block
for(nRecs=0; nRecs<blocksize && j<n; nRecs++) { aver += x[j]; j++ }
printf "%4d %9.3f %d\n", i+1, aver/nRecs, nRecs;
}
}'
18Introduction to BASH, AWK, and PERL
04-blockaverage.sh
AWK: Multiple Input Files
• Alternative processing of input data from a file#!/bin/bash
# alternative way of handling input files
inpfile="loan.out” # input file to be processed
nlines=`wc -l ${inpfile} | awk '{print $1}’` # get number of lines
awk -v inpfile=${inpfile} -v size=${nlines} '
BEGIN{
command = "cat " inpfile; # string concatenation
for(i=0; i<size; i++) {
command | getline; # getting a line from the file
if(NF==10) print $0; # print entire line
}
}'
19Introduction to BASH, AWK, and PERL
05-nfiles-demo.sh06-nfiles-full.sh
AWK: Functions – Return Absolute Value
• Compute absolute value#!/bin/sh
awk 'function abs(x){return ((x+0.0 < 0.0) ? -x : x)} {print abs($1)}’
$ echo -23.11 | ./07-function.sh
23.11
20Introduction to BASH, AWK, and PERL
AWK: Writing to File
• AWK writes to file by using the mechanism of output redirection
#!/bin/sh
# redirecting output to a file
if [ $# -ne 1 ] ; then
echo "Usage $0 input.pdb" ; exit
fi
output=`echo $1 | sed 's/\.pdb/\.txt/'`
cat $1 | awk -v fname=${output} '{print $0 > fname}'
21Introduction to BASH, AWK, and PERL
08-file.sh
Exercise
22
NCSA Loan Simulator (copy left) FIU Workshop 2013, will be our computational kernel
Input:
Starting balance = $ 1000.00
Annual interest = % 7.20
Minimum payment = % 1.00
Output:
month: 1 balance: 1006.00 charge: 6.00 payment: 259.00 interest: 6.00
month: 2 balance: 751.48 charge: 4.48 payment: 259.00 interest: 10.48
month: 3 balance: 495.43 charge: 2.95 payment: 259.00 interest: 13.43
month: 4 balance: 237.85 charge: 1.42 payment: 237.85 interest: 14.85
Simulation results:
Borrowed 1000.00
Paid 1014.85 in 4 months
Finance charge 14.85
Introduction to BASH, AWK, and PERL
Write a script to optimize the loan duration
The program is not flexible enough; so, how to get the answer we need?
PERL
• Full fledge (interpreted) programming language• Highly optimized and amazingly fast• Ideal for data processing and data extraction• Lots of reusable plug-ins available for download• Fast learning curve• If you know C-language, you already know Perl
23
Practical Extraction and Reporting Language by Larry Wall
Introduction to BASH, AWK, and PERL
PERL: Program Structure#!/usr/bin/perl –w
my $inpFileName = ""; # string
my $sum = 0.0; # floating point
if (@ARGV != 1) { # number of command-line arguments
printf " USAGE %s loan.out\n", $0; exit }
else {
$inpFileName = $ARGV[0];
unless (open INP, "<$inpFileName") { die "Error: Cannot open input file $inpFileName” }
readData();
close INP;
print "All Done\n";
}
sub readData {
}
24Introduction to BASH, AWK, and PERL
enable warnings
$0 is self program name
mandatory semicolon at the end of line
open file descriptor for reading (<)close file descriptor after reading is done
do the work here (will be described later)
read 1st command-line argument
PERL: Pattern MatchingExtracting specific parts from text files is often a non-trivial task
# Patterns
my $ap = "\\S+"; # Any pattern
my $lp = "\\w+\\d*"; # Label (text) pattern
my $ip = "-?\\d+"; # Integer pattern
my $rp = "-?\\d*\\.?\\d*"; # Real pattern
my $ep = "[+|-]?\\d+\\.?\\d*[D|E]?[+|-]?\\d*"; # Exponential pattern (scientific format)
\s – space [+|-]? – either + or – or neither
\S – non-space + – one or more same instances
\w – word character (a-zA-Z0-9) ? – optional instance
\W – anything but word character * – any number of same instances
\d – numeric character (0-9)
\D – anything except numeric
\. – any character
25Introduction to BASH, AWK, and PERL
mask multiplier
PERL: Arrays
@ARGV # built-in array for command-line arguments
my @array = (); # array declaration
# accessing array elements
for(my $i=0; $i < $nRecords; $i++) {
printf "%9.3f \n", $array[$i];
}
# returning and passing arrays
($nRecords, $total) = readData( $ARGV[1], \@array );
sub readData {
my ($column, $data) = @_;
$$data[$i] = $substring; # such array must be handled as a pointer
}
26Introduction to BASH, AWK, and PERL
Exercise: Data Extraction Project
• Use the data from loan.out• Read a specified column• Sum up the values
• Extra credit: make sure that the values to be summed up have type real
27Introduction to BASH, AWK, and PERL
01-parser.pl
Useful Internet Resources
• BASHhttp://tldp.org/LDP/abs/html/
• AWKhttp://www.gnu.org/software/gawk/manual/gawk.html
• PERLhttp://www.perl.org
book: Learning Perl, Author: Randal L. Schwartz, O’Reilly
28Introduction to BASH, AWK, and PERL
Let Us know your opinion
http://www.bitly.com/fiuworkshop
Thank you !!!
29Introduction to BASH, AWK, and PERL