advanced awk for sysadmins_lfy
DESCRIPTION
Advanced Awk for Sysadmins_LFYTRANSCRIPT
![Page 1: Advanced Awk for Sysadmins_LFY](https://reader036.vdocument.in/reader036/viewer/2022081821/552b1e054a795931588b459e/html5/thumbnails/1.jpg)
Advanced Awk for Sysadmins
By Vishal Bhatia on June 1, 2011 in How-Tos, Sysadmins, Tools / Apps
http://www.linuxforu.com/2011/06/advanced-awk-for-sysadmins/
In this article, we will discuss advanced Awk functionality, including string- and time-manipulation
functions, associative arrays and user-defined functions. These can be very useful for a systems
administrator to quickly summarise and analyse the information available in various log files.
To begin with, let’s take a look at how some of the advanced string-manipulation functions are
used. One of the most useful functions available in Awk is the split function. It takes three
parameters: a source string, an array to store the split elements (starting from index 1) and the
optional field separator fs. The fs can also be a regular-expression. The split function is very
useful whenever there is a need for more than one delimiter; for example, if the Squid log has
entries like the following:
1300063856.476 199 127.0.0.1 TCP_MISS/204 396 GET
http://www.google.co.in/csi? - DIRECT/74.125.71.104 text/html
1300063865.415 28710 127.0.0.1 TCP_MISS/200 18303 CONNECT www.google.com:443
- DIRECT/74.125.71.104 -
To fetch the FQDN or IP for a URL from each log entry, using the default fs of " ", check for
the occurrence of the pattern GET in the 6th field — and if found, split the 7th field into an array
url, using the delimiter as /, and print the 3rd index of url:
awk '$6~/GET/{split($7,url,"/"); print url[3]}' /var/log/squid/access.log
www.gmail.com
mail.google.com
google.co.in
www.google.co.in
Another very important string manipulation function is gsub(r, t, s). As the man page states,
it substitutes t for occurrences of the regular expression r in the string s. If s is not given, $0 is
used.
Awk also provides a sub(), which is the same as gsub, except that it only replaces the first
occurrence. For example, to replace every occurrence of one or more spaces in the file with a
single tab, you could use the following code:
![Page 2: Advanced Awk for Sysadmins_LFY](https://reader036.vdocument.in/reader036/viewer/2022081821/552b1e054a795931588b459e/html5/thumbnails/2.jpg)
awk '{gsub(/ +/,"\t"); print $0}' <file_name>
Besides these, there are several other string-manipulation functions like index, length, match,
strtonum, substr, toupper, tolower, etc.
Time conversion
Now, let’s have a look at the time conversion functions available in Awk, which provides three
very useful functions to fetch or manipulate time: systime(), strftime() and mktime().
systime(): Returns the current time-stamp, for example, awk 'BEGIN{print systime()}'.
strftime(): Returns the time in the specified format (similar to what is used for the date command). For example, %d-%m-%Y %H:%M:%S can be used to return time in the “DD-MM-YYYY HH24:MI:SS” format.
mktime(): It takes a datespec of the form YYYY MM DD HH MM SS[ DST] and returns a time-stamp of the same form as returned by systime().
The following code snippet can be used to convert the time-stamp provided in Squid logs to the
standard date format:
awk '{$1=strftime("%d-%m-%Y %H:%M:%S",$1); print $0}'
/var/log/squid/access.log
Often, systems administrators need to figure out a date which is a few days prior to or after the
current date, which is helpful for archiving or purging log files older than a particular date. Some
simple logic, along with the strftime and systime functions, can make this task very trivial.
Since Linux stores time as the number of seconds since epoch (midnight, January 1, 1970 GMT),
and there are 86,400 seconds in a day, we can easily calculate the date a few days prior to or after
the current date, using the code snippets below, which return the dates 5 days after and 8 days
prior to the current date:
awk -v days="5" 'BEGIN{print strftime("%Y-%m-%d",systime()+(days*86400))}'
awk -v days="-8" 'BEGIN{print strftime("%Y-%m-%d",systime()+(days*86400))}'
Arrays in Awk
Awk’s support of associative arrays, combined with its ability to treat individual columns as
fields, provides a very powerful mechanism for analysing and processing data. This allows for
![Page 3: Advanced Awk for Sysadmins_LFY](https://reader036.vdocument.in/reader036/viewer/2022081821/552b1e054a795931588b459e/html5/thumbnails/3.jpg)
running SQL-like grouping functions on columns of text. Before having a look at practical
examples, let’s briefly discuss arrays in Awk.
Just like other variables in Awk, we don’t have to explicitly define arrays or the type of data they
would contain. An array index can be any integer or a string, or even both for a single array.
Thus, an array variable could contain both the number “10″ and the string “error” as its indexes.
Awk doesn’t support multidimensional arrays, but we will shortly see how we can combine
different fields to emulate that functionality.
Example uses of arrays
To get a count of the connections opened by a client to access the SSH service on your server,
you’d use the netstat -antp command, whose output contains the following seven fields:
Proto Recv-Q Send-Q Local Address Foreign
Address State PID/Program name
tcp 0 0
::ffff:192.168.56.101:22 ::ffff:192.168.56.1:53063 ESTABLISHED 2655/2
tcp 0 0
::ffff:192.168.56.101:22 ::ffff:192.168.56.1:53064 ESTABLISHED 2696/4
To process this, let’s use:
netstat -antp | awk -v IGNORECASE=2 -v prt=22 '
{ split(,laddr,":"); split(,faddr,":");
if(laddr[5]==prt)sum[faddr[4]]+=1
}END{for(ip in sum) print ip"\t"sum[ip]}'
192.168.56.101 1
192.168.56.1 2
In this case, split the 4th field (destination-address) and the 5th (source-address), using the : as a
separator; take the 5th (destination port) and the 4th (source IP) indexes respectively from the
resultant arrays, and generate a sum on a per-source-IP basis.
We have not considered the state of the connection — whether it’s OPEN, ESTABLISHED or
TIME_WAIT. However, if the need is to consider the connection state, too, then we can emulate
multidimensional array functionality by combining multiple fields to form an array index.
![Page 4: Advanced Awk for Sysadmins_LFY](https://reader036.vdocument.in/reader036/viewer/2022081821/552b1e054a795931588b459e/html5/thumbnails/4.jpg)
For example, by changing the port to 80, and adding the 6th field to the array index at the time of
the sum calculation, we can get the source IP and connection-state-wise sum for the connections
established to the Web server:
netstat -antp | awk -v IGNORECASE=2 -v prt=22 '
{ split(,laddr,":"); split(,faddr,":");
if(laddr[5]==prt)sum[faddr[4]":"]+=1
}END{for(ip in sum) print ip"\t"sum[ip]}'
192.168.56.1:TIME_WAIT 1
User-defined functions
Awk also allows for the creation of user-defined functions, which can be used within the script.
Functions can be declared anywhere within the code, even after they have been used. This is
because Awk reads the entire program before executing it.
#! /bin/sh
#User defined functions example
awk '
{nodecount()} # Execute the user-defined nodecount() for each row to update
the node count
$0~/<clusternode /,/<\/clusternode>/{ print } # Print the rows between
"<clusternode " and "</clusternode>"
function nodecount() #Function definition for nodecount
{if($0~/clusternode name/){count+=1}
}
END{print "Node Count="count}' $1 #Print the count value in the END pattern
which is executed once
sh test.sh cluster.conf
<clusternode name="node-01.example.com" nodeid="1">
<fence>
![Page 5: Advanced Awk for Sysadmins_LFY](https://reader036.vdocument.in/reader036/viewer/2022081821/552b1e054a795931588b459e/html5/thumbnails/5.jpg)
</fence>
</clusternode>
<clusternode name="node-02.example.com" nodeid="2">
<fence>
</fence>
</clusternode>
<clusternode name="node-03.example.com" nodeid="3">
<fence>
</fence>
</clusternode>
Node Count=3
In the example above, we have created a script, test.sh and passed the cluster.conf file as
the input to it. The script uses the nodecount() function before declaring it, and executes it for
every row. A counter is updated for each row containing the entry “clusternode name”. Then it
prints the information for each node, using the range-pattern <clusternode
>,</clusternode>. In the end, the count is displayed in the END block. This example also
demonstrates how range-patterns in Awk can be used for parsing XML files.
Range-patterns: The range-pattern is a very useful tool for extracting information from text-
files. For example, to get the table definitions from a MySQL dump, the range-pattern create
table `,; can be used, since ; marks the end of a table definition:
awk -v IGNORECASE=1 '/CREATE TABLE `/,/;/' dump_file_name
Similarly, if we know the text specifying the beginning (create table `<tablename>) and
ending (drop table `<next table>) of a table of data in the MySQL dump file, we can use
that as a range pattern to extract the data for that particular table. We can pass the table name as a
variable, tname, to the script:
awk -v IGNORECASE=1 -v tname=host 'BEGIN{str="^CREATE TABLE
`"}$0~str""tname,/drop table/{ if($0!~/drop table/)print}' dump_file_name
However, for a scenario where some text is contained within tags specified by the same markers
(for example, table data between multiple CREATE TABLE statements within a dump file), the
next statement, along with a marker variable, can be used to extract the data. The script below
demonstrates this:
![Page 6: Advanced Awk for Sysadmins_LFY](https://reader036.vdocument.in/reader036/viewer/2022081821/552b1e054a795931588b459e/html5/thumbnails/6.jpg)
#! /bin/sh
awk -v IGNORECASE=1 -v tname=$1 ' #Variable tname provides the table name
BEGIN{str="^CREATE TABLE `";defstart=0} #Specify the pattern to match and set
the marker variable to default
$0~str""tname{ defstart=!defstart; print;next} #If row matches the table
name, toggle the variable, print the row and switch to next row without
further processing.
defstart==1{if($0~str||$0~/drop table/){exit}else{print}} #If the marker variable is set, print the rows till the next row with the create or drop
table statement
' $2
Let’s pass two arguments to this script; the first one is the table name, and the second the dump
file.
In the BEGIN block, we need to specify the pattern (begins with create table `) and set the
marker variable to match the table name to 0. When the pattern and table name are matched in a
row, we can toggle the variable, print the row, and use the next statement to jump to the next row
of the dump file, without processing further statements in the script. Since the marker variable is
now set, further statements will be executed for each row, till the pattern is found again in the
dump file. Thus, we will get all the data contained within the markers.
In this article, we have seen how the functionality provided by Awk can quickly come to the aid
of systems administrators whenever they need to parse, summarise or extract certain data from
text files. Awk is fairly easy to learn if you already know C. For those who are interested in
learning more about Awk, a very detailed user guide is available here.
Related Posts:
Connecting to MySQL with Python and PHP Getting Started with SystemTap Secure Upload Methods in PHP Loading Library Files in C++ Developing Apps on Qt, Part 2