the great awk

3
LinuxUser/July-August 2001 59 A wk sounds like the noise made by a dying seagull, although the odd name, as is usual with such things, has relatively prosaic ori- gins, deriving, as it does, from the initials of the sur- names of each of the creators of awk, Alfred V Aho, Peter J Weinberger and Brian W Kernighan. Rightly or wrongly, the majority of the credit for awk is gen- erally given to Kernighan, although the name would indicate that this shouldn’t necessarily be so. Awk is, in fact, a powerful but simple program- ming language that, like grep and sed, has become an essential part of the Unix tool kit, and by exten- sion, the Linux tool kit. Awk was designed to fulfill a common requirement on computer systems, to edit text files, especially where those files are used to store information, and to re-arrange, classify, vali- date and analyse the data contained in those files. Such work is laborious and subject to error when done manually. The alternative, writing programs in C or other high-level programming languages, is usually impractical and time consuming. Awk could be said to be the godfather of the macro facilities that are provided with modern spreadsheet programs, but is far more powerful, far more versatile, and can be used as a rapid solution in more diverse scenarios. From awk to gawk The first incarnation of the language appeared in 1977, and originated, like Unix and C, under the aegis of Bell Laboratories, which has contributed a disproportionate share of the innovative technologies of the last half century. Kernighan is head of Bell’s Computing Structures Research Department, and is best known as the co-author of The C Programming Language with Dennis Ritchie. There are several varieties of awk. The original specifica- tion, as released in 1977, is still referenced because it is the default version on some versions of Unix. A revised version, called nawk, or new awk, was finally released as part of Unix System V Release 3.1 in 1988, although it had already been in internal use within AT&T for several years. Nawk is the most usual implementation of awk on Unix. Nawk added some new features to the language and cleaned up some “dark cor- ners”, as Effective Awk Programming puts it. The preferred adaption of the language for Linux, as implemented by the POWERTOOLS The great awk Samuel Palmer takes a look at the awk programming language, the perfect hacker’s tool for editing text files and analysing your data Free Software Foundation, is gawk, or GNU awk, which was written in 1986 by Paul Rubin and Jay Fenlason, and was reworked in 1989 by David Trueman and Arnold Robbins. Gawk contains a number of extensions to nawk that increase its functionality and power. The POSIX specifcation of awk includes feedback from both the gawk designers and the original awk designers. Gawk, like so much of the work of the GNU project, is an essential feature of Linux, and is just one of the many tools that give some justification to Richard Stallman’s often dispar- aged claim that Linux should be known to the world as GNU/Linux. There is a further implementation of awk, mawk, or Mike’s awk implementation, which is also free software, and is available with some Linux distributions. Mawk was written by Mike Brennan, who claims as the main benefit of mawk that it is “the fastest awk implementa- tion I know. It’s even a lot faster than GNU awk (which is much faster than the awks that Unix vendors ship with their sys- tems)”.

Upload: hariji

Post on 11-Apr-2015

832 views

Category:

Documents


1 download

DESCRIPTION

Information about awk and sed books

TRANSCRIPT

Page 1: The Great Awk

LinuxUser/July-August 2001 59

Awk sounds like the noise made by a dyingseagull, although the odd name, as is usualwith such things, has relatively prosaic ori-

gins, deriving, as it does, from the initials of the sur-names of each of the creators of awk, Alfred V Aho,Peter J Weinberger and Brian W Kernighan. Rightlyor wrongly, the majority of the credit for awk is gen-erally given to Kernighan, although the name wouldindicate that this shouldn’t necessarily be so.

Awk is, in fact, a powerful but simple program-ming language that, like grep and sed, has becomean essential part of the Unix tool kit, and by exten-sion, the Linux tool kit. Awk was designed to fulfill acommon requirement on computer systems, to edittext files, especially where those files are used tostore information, and to re-arrange, classify, vali-date and analyse the data contained in those files.Such work is laborious and subject to error whendone manually. The alternative, writing programs inC or other high-level programming languages, isusually impractical and time consuming.

Awk could be said to be the godfather of themacro facilities that are provided with modernspreadsheet programs, but is far more powerful, farmore versatile, and can be used as a rapid solution in morediverse scenarios.

From awk to gawkThe first incarnation of the language appeared in 1977, andoriginated, like Unix and C, under the aegis of BellLaboratories, which has contributed a disproportionate shareof the innovative technologies of the last half century.Kernighan is head of Bell’s Computing Structures ResearchDepartment, and is best known as the co-author of The CProgramming Language with Dennis Ritchie.

There are several varieties of awk. The original specifica-tion, as released in 1977, is still referenced because it is thedefault version on some versions of Unix. A revised version,called nawk, or new awk, was finally released as part of UnixSystem V Release 3.1 in 1988, although it had already been ininternal use within AT&T for several years. Nawk is the mostusual implementation of awk on Unix. Nawk added somenew features to the language and cleaned up some “dark cor-ners”, as Effective Awk Programming puts it. The preferredadaption of the language for Linux, as implemented by the

POWERTOOLS

The greatawk Samuel Palmer takes a look at

the awk programming language,the perfect hacker’s tool for editingtext files and analysing your data

Free Software Foundation, is gawk, or GNU awk, which waswritten in 1986 by Paul Rubin and Jay Fenlason, and wasreworked in 1989 by David Trueman and Arnold Robbins.Gawk contains a number of extensions to nawk that increaseits functionality and power. The POSIX specifcation of awkincludes feedback from both the gawk designers and theoriginal awk designers.

Gawk, like so much of the work of the GNU project, is anessential feature of Linux, and is just one of the many toolsthat give some justification to Richard Stallman’s often dispar-aged claim that Linux should be known to the world asGNU/Linux. There is a further implementation of awk, mawk,or Mike’s awk implementation, which is also free software,and is available with some Linux distributions.

Mawk was written by Mike Brennan, who claims as themain benefit of mawk that it is “the fastest awk implementa-tion I know. It’s even a lot faster than GNU awk (which is muchfaster than the awks that Unix vendors ship with their sys-tems)”.

Page 2: The Great Awk

60LinuxUser/July-August 2001

POWERTOOLS

On the command lineThe purpose of awk is to allow complex pattern recognitionand relatively complicated arithmetic functions in programscontaining one or two lines. An awk program contains asequence of patterns and actions. Unlike conventional pro-gramming languages awk can be said to be data-driven. Awksearches a file, or a set of specified files for a required patternof data, and then takes the appropriate action (or set ofactions), which may be quite complex. Awk is a natural exten-sion of grep and sed, which can be used to perform similartasks, and was conceived as such by the original designers,as a means of extending the processing capabilities of grepand sed to more complex forms of data. The difference is thatawk has a much greater range of pattern recognition tools,can handle arithmetic processes, has the ability to controlflow to any part of a program, can store values in user-definedvariables that reference general storage locations, and hasthe ability to operate on user generated internal functions.Awk can perform relatively complex pattern matching, fileediting and analysis tasks over multiple files. As such awkreplaces the need to use a full programming language, andgives the possibility of rapid facilities for global edits or data

analysis. Typically awk might be usedfor one-off tasks, but an awk script canalso be stored in a file, and is one ofthose classic Unix utilities that has allkinds of unpredictable uses far beyondthe original remit of its design - a gen-eral purpose programming languagethat doesn’t need extensive program-ming experience to achieve thedesired results.

Awk can be invoked in two forms,which can be conventionally definedas follows:

awk {options] ‘script’ var=value file(s)awk [options] -f scriptfile var=valuefile(s)

The options are -F to define a fieldseperator to be found in the data, and -

V to assign a variable that can be used in the script. The scriptmay be written on the command line, or contained in a file.Awk can be used to process multiple files that contain thedefined pattern.

Patterns can be defined as combinations of regularexpressions and comparison operations on strings, numbers,fields, variables, and array elements. Actions may performarbitrary processing on selected lines. The language is C-like,but has no declarations although strings and numbers havebuilt-in data types. Some benefits of awk include automaticfile handling, associative arrays, user-defined and reservedfunctions, recursion, regular expressions, multidimensionalarrays, formatted output using printf and sprintf. Empty pat-terns and actions can be defined for specific purposes. Whiletypical examples of awk programs show one line applica-tions, awk can in fact be used to compile quite complex oper-ations, and a program that is being used to process data ismore likely to be several lines long. Awk has the structures tosupport this. The simplest awk program might be as follows:

awk ‘/LinuxUser/ {print}’ *.txt

This program will scan all files in the current directory with thesuffix .txt, search for any occurrence of the word LinuxUser,and print to the terminal all lines containing that text.

Get it from the source

Effective awk Programming is writ-ten by Arnold Robbins, one of thedevelopers of gawk, and the co-author of Sed & awk. This book isrequired reading for the Linux pro-grammer who wants to explore thepotential of awk and gawk. It is gen-erally considered to give the most indepth coverage of the many titlesavailable on the subject. The bookwas written under the auspices ofthe Free Software Foundation and isalso available electronically, in whichform it can be freely copied and dis-tributed under the terms of the FreeSoftware Foundation’s FreeDocumentation Licence. A portion ofthe proceeds from sales of this bookwill goes to the FSF to support fur-ther development of free and opensource software. Effective awk Programming is a completeguide to the gawk 3.1 implementation of the language, and alsocontains the most up-to-date and thorough elucidation of thePOSIX standard for awk available anywhere.

It has been said that The AwkProgramming Language, byAho, Kernighan andWeinberger, the originatorsof the language, “is to AWKwhat The C ProgrammingLanguage is to C. Its thebible”. As the original guideto the language it offerssome insight into the inten-tions of the authors, andoffers a complete set ofexamples.

sed & awk was written byDale Dougherty andArnold Robbins, and issubject to the samelaudatory praise as thebooks above. The bookprogresses from a simpleintroduction to the bene-fits of both sed and awk,towards detailed descrip-tions of the tools, regularexpression syntax andother intricacies. sed &awk is a standard textbook Unix programmersand administrators. O’Reilly also publishes a Pocket Referenceedition of sed & awk.

“You should neveruse C if you can do itwith a script, neveruse a script if you cando it with awk, neveruse awk if you can doit with sed, and neveruse sed if you can doit with grep”Robert M Slade

Page 3: The Great Awk

POWERTOOLS

Short cutsFrom a programmer’s point of view an awk program can beseen as a quick subroutine that can be invoked on its ownwithout the requirement for the surrounding program super-structure. From a user point of view, awk allows the user witha rudimentary knowledge of programming structures toprocess data according to his or her own requirements. Awkis, in fact, a scripting language that was designed to achieve alimited number of tasks. Some may argue that, as a language,it has been superceded by Perl and other scripting languages,but it is simpler to master and quicker to use.

Because awk uses a syntax that looks very much like C, itmakes itself attractive as a short cut for programmers to get atask done quickly. As such, awk is often used as a prototypingtool that lends itself to iterative testing of algorithms. Once theproof is working it is a relatively easy process to convert theawk program into another language, or to embed the pro-gram in a working script.

The authors claim that awk has been used for a diversity ofapplications “from databases to circuit design, from numeri-cal analysis to graphics, from compilers to system adminstra-tion, from a first language for non-programmers to the imple-mentation language for software engineering courses”. Themost typical application remains that for which it was original-ly designed, to scan and edit text files and to produce reportson the data held therein.

If you can not determine when awk should be used in pref-erence to other languages, take the advise that Robert MSlade gave in a review of the O’Reilly book, sed & awk. “TheEnlightened Ones say that you should never use C if you cando it with a script, never use a script if you can do it with awk,never use awk if you can do it with sed, and never use sed ifyou can do it with grep.”

Awk and nawk and gawk

A classic definition of the capabilities and differencesbetween the popular implementations of awk is given by DaleDougherty and Arnold Robbins in sed & awk.

With original awk, you can:

• Think of a text file as made up of records and fields in atextual database.

• Perform arithmetic and string operations.• Use programming constructs such as loops conditionals.• Produce formatted reports

With nawk, you can also:

• Define your own functions• Execute Unix commands from a script• Process the results of Unix commands• Process command-line arguments more gracefully• Work more easily with multiple input streams• Flush open output files and pipes (latest Bell Labs awk)

In addition, with GNU auk (gawk), you can:

• Use regular expressions to separate records, as well asfield

• Skip to the start of the next file, not just the next record• Perform more powerful string sustitutions• Retrieve and format system time values