Download - awk & sid
awk - Read a file and split the contents
awk is one of the most powerful utilities used in the unix world. Whenever it comes to text parsing, sed and awk do some unbelievable things. In this first article on awk, we will see the basic usage of awk.
The syntax of awk is:
awk 'pattern{action}' file
where the pattern indicates the pattern or the condition on which the action is to be executed for every line matching the pattern. In case of a pattern not being present, the action will be executed for every line of the file. In case of the action part not being present, the default action of printing the line will be done. Let us see some examples:
Assume a file, say file1, with the following content:
$ cat file1
Name Domain
Deepak Banking
Neha Telecom
Vijay Finance
Guru Migration
This file has 2 fields in it. The first field indicates the name of a person, and the second field denoting their expertise, the first line being the header record.
1. To print only the names present in the file:
$ awk '{print $1}' file1
Name
Deepak
Neha
Vijay
Guru
The above awk command does not have any pattern or condition. Hence, the action will be executed on every line of the file. The action statement reads "print $1". awk, while reading a file, splits the different columns into $1, $2, $3 and so on. And hence the first column is accessible using $1, second using $2, etc. And hence the above command prints all the names which happens to be first column in the file.
2. Similarly, to print the second column of the file:
$ awk '{print $2}' file1
Domain
Banking
Telecom
Finance
Migration
3. In the first example, the list of names got printed along with the header record. How to omit the header record and get only the names printed?
$ awk 'NR!=1{print $1}' file1
Deepak
Neha
Vijay
Guru
The above awk command uses a special variable NR. NR denotes line number ranging from 1 to the actual line count. The conditon 'NR!=1' indicates not to execute the action part for the first line of the file, and hence the header record gets skipped.
4. How do we print the entire file contents?
$ awk '{print $0}' file1
Name Domain
Deepak Banking
Neha Telecom
Vijay Finance
Guru Migration
$0 stands for the entire line. And hence when we do "print $0", the whole line gets printed.
5. How do we get the entire file content printed in other way?
$ awk '1' file1
Name Domain
Deepak Banking
Neha Telecom
Vijay Finance
Guru Migration
The above awk command has only the pattern or condition part, no action part. The '1' in the pattern indicates "true" which means true for every line. As said above, no action part denotes just to print which is the default when no action statement is given, and hence the entire file contents get printed.
Let us now consider a file with a delimiter. The delimiter used here is a comma. The comma separated file is called csv file. Assuming the file contents to be:
$ cat file1
Name,Domain,Expertise
Deepak,Banking,MQ Series
Neha,Telecom,Power Builder
Vijay,Finance,CRM Expert
Guru,Migration,Unix
This file contains 3 fields. The new field being the expertise of the respective person.
6. Let us try to print the first column of this csv file using the same method as mentioned in Point 1.
$ awk '{print $1}' file1
Name,Domain,Expertise
Deepak,Banking,MQ
Neha,Telecom,Power
Vijay,Finance,CRM
Guru,Migration,Unix
The output looks weird. Isnt it? We expected only the first column to get printed, but it printed little more and that too not a definitive one. If you notice carefully, it printed every line till the first space is encountered. awk, by default, uses the white space as the delimiter which could be a single space, tab space or a series of spaces. And hence our original file was split into fields depending on space.
Since our requirement now involves dealing with a file which is comma separated, we need to specify the delimiter.
$ awk -F"," '{print $1}' file1
Name
Deepak
Neha
Vijay
Guru
awk has a command line option "-F' with which we can specify the delimiter. Once the delimiter is specified, awk splits the file on the basis of the delimiter specified, and hence we got the names by printing the first column $1.
7. awk has a special variable called "FS" which stands for field separator. In place of the command line option
"-F', we can also use the "FS".
$ awk '{print $1,$3}' FS="," file1
Name Expertise
Deepak MQ Series
Neha Power Builder
Vijay CRM Expert
Guru Unix
8. Similarly, to print the second column:
$ awk -F, '{print $2}' file1
Domain
Banking
Telecom
Finance
Migration
9. To print the first and third columns, ie., the name and the expertise:
$ awk -F"," '{print $1, $3}' file1
Name Expertise
Deepak MQ Series
Neha Power Builder
Vijay CRM Expert
Guru Unix
10. The output shown above is not easily readable since the third column has more than one word. It would have been better had the fields being displayed are present with a delimiter. Say, lets use comma to separate the output. Also, lets discard the header record.
$ awk -F"," 'NR!=1{print $1,$3}' OFS="," file1
Deepak,MQ Series
Neha,Power Builder
Vijay,CRM Expert
Guru,Unix
OFS is another awk special variable. Just like how FS is used to separate the input fields, OFS (Output field separator) is used to separate the output fields.
awk - Passing arguments or shell variables to awk
In one of our earlier articles, we saw how to read a file in awk. At times, we might have some requirements wherein we need to pass some arguments to the awk program or to access a shell variable or an environment variable inside awk. Let us see in this article how to pass and access arguments in awk:
Let us take a sample file with contents, and a variable "x":
$ cat file1
24
12
34
45
$ echo $x
3
Now, say we want to add every value with the shell variable x.
1.awk provides a "-v" option to pass arguments. Using this, we can pass the shell variable to it.
$ awk -v val=$x '{print $0+val}' file1
27
15
37
48
As seen above, the shell variable $x is assigned to the awk variable "val". This variable "val" can directly be accessed in awk.
2. awk provides another way of passing argument to awk without using -v. Just before specifying the file name to awk, provide the shell variable assignments to awk variables as shown below:
$ awk '{print $0,val}' OFS=, val=$x file1
24,3
12,3
34,3
45,3
3. How to access environment variables in awk? Unlike shell variables, awk provides a way to access the environment variables without passing it as above. awk has a special variable ENVIRON which does the needful.
$ echo $x
3
$ export x
$ awk '{print $0,ENVIRON["x"]}' OFS=, file1
24,3
12,3
34,3
45,3
Quoting file content:
Some times we might have a requirement wherein we have to quote the file contents. Assume, you have a file which contains the list of database tables. And for your requirement, you need to quote the file contents:
$ cat file
CUSTOMER
BILL
ACCOUNT
4. Pass a variable to awk which contains the double quote. Print the quote, line, quote.
$ awk -v q="'" '{print q $0 q}' file
'CUSTOMER'
'BILL'
'ACCOUNT'
5. Similarly, to double quote the contents, pass the variable within single quotes:
$ awk '{print q $0 q}' q='"' file
"CUSTOMER"
"BILL"
"ACCOUNT"
awk - Match a pattern in a file in Linux
In one of our earlier articles on awk series, we had seen the basic usage of awk or gawk. In this, we will see mainly how to search for a pattern in a file in awk. Searching pattern in the entire line or in a specific column.
Let us consider a csv file with the following contents. The data in the csv file contains kind of expense report.
Let us see how to use awk to filter data from the file.
$ cat file
Medicine,200
Grocery,500
Rent,900
Grocery,800
Medicine,600
1. To print only the records containing Rent:
$ awk '$0 ~ /Rent/{print}' file
Rent,900
~ is the symbol used for pattern matching. The / / symbols are used to specify the pattern. The above line indicates: If the line($0) contains(~) the pattern Rent, print the line. 'print' statement by default prints the entire line. This is actually the simulation of grep command using awk.
2. awk, while doing pattern matching, by default does on the entire line, and hence $0 can be left off as shown below:
$ awk '/Rent/{print}' file
Rent,900
3. Since awk prints the line by default on a true condition, print statement can also be left off.
$ awk '/Rent/' file
Rent,900
In this example, whenever the line contains Rent, the condition becomes true and the line gets printed.
4. In the above examples, the pattern matching is done on the entire line, however, the pattern we are looking
for is only on the first column. This might lead to incorrect results if the file contains the word Rent in other places. To match a pattern only in the first column($1),
$ awk -F, '$1 ~ /Rent/' file
Rent,900
The -F option in awk is used to specify the delimiter. It is needed here since we are going to work on the specific columns which can be retrieved only when the delimiter is known.
5. The above pattern match will also match if the first column contains "Rents". To match exactly for the word "Rent" in the first column:
$ awk -F, '$1=="Rent"' file
Rent,900
6. To print only the 2nd column for all "Medicine" records:
$ awk -F, '$1 == "Medicine"{print $2}' file
200
600
7. To match for patterns "Rent" or "Medicine" in the file:
$ awk '/Rent|Medicine/' file
Medicine,200
Rent,900
Medicine,600
8. Similarly, to match for this above pattern only in the first column:
$ awk -F, '$1 ~ /Rent|Medicine/' file
Medicine,200
Rent,900
Medicine,600
9. What if the the first column contains the word "Medicines". The above example will match it as well. In order to exactly match only for Rent or Medicine,
$ awk -F, '$1 ~ /^Rent$|^Medicine$/' file
Medicine,200
Rent,900
Medicine,600
The ^ symbol indicates beginning of the line, $ indicates the end of the line. ^Rent$ matches exactly for the word Rent in the first column, and the same is for the word Medicine as well.
10. To print the lines which does not contain the pattern Medicine:
$ awk '!/Medicine/' file
Grocery,500
Rent,900
Grocery,800
The ! is used to negate the pattern search.
11. To negate the pattern only on the first column alone:
$ awk -F, '$1 !~ /Medicine/' file
Grocery,500
Rent,900
Grocery,800
12. To print all records whose amount is greater than 500:
$ awk -F, '$2>500' file
Rent,900
Grocery,800
Medicine,600
13. To print the Medicine record only if it is the 1st record:
$ awk 'NR==1 && /Medicine/' file
Medicine,200
This is how the logical AND(&&) condition is used in awk. The records needed to be retrieved is only if it is the first record(NR==1) and the record is a medicine record.
14. To print all those Medicine records whose amount is greater than 500:
$ awk -F, '/Medicine/ && $2>500' file
Medicine,600
15. To print all the Medicine records and also those records whose amount is greater than 600:
$ awk -F, '/Medicine/ || $2>600' file
Medicine,200
Rent,900
Grocery,800
Medicine,600
This is how the logical OR(||) condition is used in awk.
awk - Join or merge lines on finding a pattern
In one of our earlier articles, we had discussed about joining all lines in a file and also joining every 2 lines in a file. In this article, we will see the how we can join lines based on a pattern or joining lines on encountering a pattern using awk or gawk.
Let us assume a file with the following contents. There is a line with START in-between. We have to join all the lines following the pattern START.
$ cat file
START
Unix
Linux
START
Solaris
Aix
SCO
1. Join the lines following the pattern START without any delimiter.
$ awk '/START/{if (NR!=1)print "";next}{printf $0}END{print "";}' file
UnixLinux
SolarisAixSCO
Basically, what we are trying to do is: Accumulate the lines following the START and print them on encountering the next START statement. /START/ searches for lines containing the pattern START. The command within the {} will work only on lines containing the START pattern. Prints a blank line if the line is not the first line(NR!=1). Without this condition, a blank line will come in the very beginning of the output since it encounters a START in the beginning.
The next command prevents the remaining part of the command from getting executed for the START lines. The second part of braces {} works only for the lines not containing the START. This part simply prints the line
without a terminating new line character(printf). And hence as a result, we get all the lines after the pattern START in the same line. The END label is put to print a newline at the end without which the prompt will appear at the end of the last line of output itself.
2. Join the lines following the pattern START with space as delimiter.
$ awk '/START/{if (NR!=1)print "";next}{printf "%s ",$0}END{print "";}' file
Unix Linux
Solaris Aix SCO
This is same as the earlier one except it uses the format specifier %s in order to accommodate an additional space which is the delimiter in this case.
3. Join the lines following the pattern START with comma as delimiter.
$ awk '/START/{if (x)print x;x="";next}{x=(!x)?$0:x","$0;}END{print x;}' file
Unix,Linux
Solaris,Aix,SCO
Here, we form a complete line and store it in a variable x and print the variable x whenever a new pattern starts. The command: x=(!x)?$0:x","$0 is like the ternary operator in C or Perl. It means if x is empty, assign the current line($0) to x, else append a comma and the current line to x. As a result, x will contain the lines joined with a comma following the START pattern. And in the END label, x is printed since for the last group there will not be a START pattern to print the earlier group.
4. Join the lines following the pattern START with comma as delimiter with also the pattern matching line.
$ awk '/START/{if (x)print x;x="";}{x=(!x)?$0:x","$0;}END{print x;}' file
START,Unix,Linux
START,Solaris,Aix,SCO
The difference here is the missing next statement. Because next is not there, the commands present in the second set of curly braces are applicable for the START line as well, and hence it also gets concatenated.
5. Join the lines following the pattern START with comma as delimiter with also the pattern matching line. However, the pattern line should not be joined.
$ awk '/START/{if (x)print x;print;x="";next}{x=(!x)?$0:x","$0;}END{print x;}' file
START
Unix,Linux
START
Solaris,Aix,SCO
In this, instead of forming START as part of the variable x, the START line is printed. As a result, the START line comes out separately, and the remaining lines get joined.
awk - 10 examples to group data in a CSV or text file
awk is very powerful when it comes for file formatting. In this article, we will discuss some wonderful grouping features of awk. awk can group a data based on a column or field , or on a set of columns. It uses the powerful associative array for grouping. If you are new to awk, this article will be easier to understand if you can go over the article how to parse a simple CSV file using awk.
Let us take a sample CSV file with the below contents. The file is kind of an expense report containing items and their prices. As seen, some expense items have multiple entries.
$ cat file
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600
1. To find the total of all numbers in second column. i.e, to find the sum of all the prices.
$ awk -F"," '{x+=$2}END{print x}' file
3000
The delimiter(-F) used is comma since its a comma separated file. x+=$2 stands for x=x+$2. When a line is parsed, the second column($2) which is the price, is added to the variable x. At the end, the variable x contains the sum. This example is same as discussed in the awk example of finding the sum of all numbers in a file.
If your input file is a text file with the only difference being the comma not present in the above file, all you need to make is one change. Remove this part from the above command: -F"," . This is because the default delimiter in awk is whitespace.
2. To find the total sum of particular group entry alone. i.e, in this case, of "Item1":
$ awk -F, '$1=="Item1"{x+=$2;}END{print x}' file
800
This gives us the total sum of all the items pertaining to "Item1". In the earlier example, no condition was specified since we wanted awk to work on every line or record. In this case, we want awk to work on only the records whose first column($1) is equal to Item1.
3. If the data to be worked upon is present in a shell variable:
$ VAR="Item1"
$ awk -F, -v inp=$VAR '$1==inp{x+=$2;}END{print x}' file
800
-v is used to pass the shell variable to awk, and the rest is same as the last one.
4. To find unique values of first column
$ awk -F, '{a[$1];}END{for (i in a)print i;}' file
Item1
Item2
Item3
Arrays in awk are associative and is a very powerful feature. Associate arrays have an index and a corresponding value. Example: a["Jan"]=30 meaning in the array a, "Jan" is an index with value 30. In our case here, we use only the index without values. So, the command a[$1] works like this: When the first record is processed, in the array named a, an index value "Item1" is stored. During the second record, a new index "Item2", during third "Item3" and so on. During the 4th record, since the "Item1" index is already there, no new index is added and the same continues.
Now, once the file is processed completely, the control goes to the END label where we print all the index items. for loop in awk comes in 2 variants: 1. The C language kind of for loop, Second being the one used for associate arrays.
for i in a : This means for every index in the array a . The variable "i" holds the index value. In place of "i", it can be any variable name. Since there are 3 elements in the array, the loop will run for 3 times, each time holding the value of an index in the "i". And by printing "i", we get the index values printed.
To understand the for loop better, look at this:
for (i in a)
{
print i;
}
Note: The order of the output in the above command may vary from system to system. Associative arrays do not store the indexes in sequence and hence the order of the output need not be the same in which it is entered.
5. To find the sum of individual group records. i.e, to sum all records pertaining to Item1 alone, Item2 alone, and so on.
$ awk -F, '{a[$1]+=$2;}END{for(i in a)print i", "a[i];}' file
Item1, 800
Item2, 1300
Item3, 900
a[$1]+=$2 . This can be written as a[$1]=a[$1]+$2. This works like this: When the first record is processed, a["Item1"] is assigned 200(a["Item1"]=200). During second "Item1" record, a["Item1"]=800 (200+600) and so on. In this way, every index item in the array is stored with the appropriate value associated to it which is the sum of the group. And in the END label, we print both the index(i) and the value(a[i]) which is nothing but the sum.
6. To find the sum of all entries in second column and add it as the last record.
$ awk -F"," '{x+=$2;print}END{print "Total,"x}' file
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600
Total,3000
This is same as the first example except that along with adding the value every time, every record is also printed, and at the end, the "Total" record is also printed.
7. To print the maximum or the biggest record of
every group:
$ awk -F, '{if (a[$1] < $2)a[$1]=$2;}END{for(i in a){print i,a[i];}}' OFS=, file
Item1,600
Item2,800
Item3,900
Before storing the value($2) in the array, the current second column value is compared with the existing value and stored only if the value in the current record is bigger. And finally, the array will contain only the maximum values against every group. In the same way, just by changing the "lesser than(<)" symbol to greater than(>), we can find the smallest element in the group.The syntax for if in awk is, similar to the C language syntax:
if (condition){ <code for true condition >}else{ <code for false condition> }
8. To find the count of entries against every group:
$ awk -F, '{a[$1]++;}END{for (i in a)print i, a[i];}' file
Item1 2
Item2 2
Item3 1
a[$1]++ : This can be put as a[$1]=a[$1]+1. When the first "Item1" record is parsed, a["Item1"]=1 and every item on encountering "Item1" record, this count is incremented, and the same follows for other entries as well. This code simply increments the count by 1 for the respective index on encountering a record. And finally on printing the array, we get the item entries and their respective counts.
9. To print only the first record of every group:
$ awk -F, '!a[$1]++' file
Item1,200
Item2,500
Item3,900
A little tricky this one. In this awk command, there is only condition, no action statement. As a result, if the condition is true, the current record gets printed by default. !a[$1]++ : When the first record of a group is encountered, a[$1] remains 0 since ++ is post-fix, and not(!) of 0 is 1 which is true, and hence the first record gets printed. Now, when the second records of "Item1" is parsed, a[$1] is 1 (will become 2 after the command since its a post-fix). Not(!) of 1 is 0 which is false, and the record does not get printed. In this way, the first record of every group gets printed. Simply by removing '!' operator, the above command will print all records other than the first record of the group.
10. To join or concatenate the values of all group items. Join the values of the second column with a colon separator:
$ awk -F, '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS=, file
Item1,200:600
Item2,500:800
Item3,900
This if condition is pretty simple: If there is some value in a[$1], then append or concatenate the current value using a colon delimiter, else just assign it to a[$1] since this is the first value.To make the above if block clear, let me put it this way: "if (a[$1])" means "if a[$1] has some value".
if(a[$1])
a[$1]=a[$1]":"$2;
else
a[$1]=$2
The same can be achieved using the awk ternary operator as well which is same as in the C language.
$ awk -F, '{a[$1]=a[$1]?a[$1]":"$2:$2;}END{for (i in a)print i, a[i];}' OFS=, file
Item1,200:600
Item2,500:800
Item3,900
Ternary operator is a short form of if-else condition. An example of ternary operator is: x=x>10?"Yes":"No" means if x is greater than 10, assign "Yes" to x, else assign "No".In the same way: a[$1]=a[$1]?a[$1]":"$2:$2 means if a[$1] has some value assign a[$1]":"$2 to a[$1] , else simply assign $2 to a[$1].
Concatenate variables in awk : One more thing to notice is the way string concatenation is done in awk. To concatenate 2 variables in awk, use a space in-between.Examples:
z=x y #to concatenate x and y
z=x":"y #to concatenate x and y with a colon separator.
awk - 10 examples to split a file into multiple files
In this article of the awk series, we will see the different scenarios in which we need to split a file into multiple files using awk. The files can be split into multiple files either based on a condition, or based on a pattern or because the file is big and hence needs to split into smaller files.
Sample File1:Let us consider a sample file with the following contents:
$ cat file1
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600
1. Split the file into 3 different files, one for each item. i.e, All records pertaining to Item1 into a file, records of Item2 into another, etc.
$ awk -F, '{print > $1}' file1
The files generated by the above command are as below:
$ cat Item1
Item1,200
Item1,600
$ cat Item3
Item3,900
$ cat Item2
Item2,500
Item2,800
This looks so simple, right? print prints the entire line, and the line is printed to a file whose name is $1, which is the first field. This means, the first record will get written to a file named 'Item1', and the second record to 'Item2', third to 'Item3', 4th goes to 'Item2', and so on.
2. Split the files by having an extension of .txt to the new file names.
$ awk -F, '{print > $1".txt"}' file1
The only change here from the above is concatenating the string ".txt" to the $1 which is the first field. As a result, we get the extension to the file names. The files created are below:
$ ls *.txt
Item2.txt Item1.txt Item3.txt
3. Split the files by having only the value(the second field) in the individual files, i.e, only 2nd field in the new files without the 1st field:
$ awk -F, '{print $2 > $1".txt"}' file1
The print command prints the entire record. Since we want only the second field to go to the output files, we do: print $2.
$ cat Item1.txt
200
600
4. Split the files so that all the items whose value is greater than 500 are in the file "500G.txt", and the rest in the file "500L.txt".
$ awk -F, '{if($2<=500)print > "500L.txt";else print > "500G.txt"}' file1
The output files created will be as below:
$ cat 500L.txt
Item1,200
Item2,500
$ cat 500G.txt
Item3,900
Item2,800
Item1,600
Check the second field($2). If it is lesser or equal to 500, the record goes to "500L.txt", else to "500G.txt".
Other way to achieve the same thing is using the ternary operator in awk:
$ awk -F, '{x=($2<=500)?"500L.txt":"500G.txt"; print > x}' file1
The condition for greater or lesser than 500 is checked and the appropriate file name is assigned to variable x. The record is then written to the file present in the variable x.
Sample File2:Let us consider another file with a different set of contents. This file has a pattern 'START' at frequent intervals.
$ cat file2
START
Unix
Linux
START
Solaris
Aix
SCO
5. Split the file into multiple files at every occurrence of the pattern START .
$ awk '/START/{x="F"++i;}{print > x;}' file2
This command contains 2 sets of curly braces: The control goes to the first set of braces only on encountering a line containing the pattern START. The second set will be encountered by every line since there is no condition and hence always true. On encountering the pattern START, a new file name is created and stored. When the first START comes, x will contain "F1" and the control goes to the next set of braces and the record is written to F1, and the subsequent records go the file "F1" till the next START comes. On encountering next START, x will contain
"F2" and the subsequent lines goes to "F2" till the next START, and it continues.
$ cat F1
START
Unix
Linux
Solaris
$ cat F2
START
Aix
SCO
6. Split the file into multiple files at every occurrence of the pattern START. But the line containing the pattern should not be in the new files.
$ awk '/START/{x="F"++i;next}{print > x;}' file2
The only difference in this from the above is the inclusion of the next command. Due to the next command, the lines containing the START enters the first curly braces and then starts reading the next line immediately due to the next command. As a result, the START lines does not get to the second curly braces and hence the START does not appear in the split files.
$ cat F1
Unix
Linux
$ cat F2
Solaris
Aix
SCO
7. Split the file by inserting a header record in every new file.
$ awk '/START/{x="F"++i;print "ANY HEADER" > x;next}{print > x;}' file2
The change here from the earlier one is this: Before the next command, we write the header record into the file. This is the right place to write the header record since this is where the file is created first.
$ cat F1
ANY HEADER
Unix
Linux
$ cat F2
ANY HEADER
Solaris
Aix
SCO
Sample File3: Let us consider a file with the sample contents:
$ cat file3
Unix
Linux
Solaris
AIX
SCO
8. Split the file into multiple files at every 3rd line . i.e, First 3 lines into F1, next 3 lines into F2 and so on.
$ awk 'NR%3==1{x="F"++i;}{print > x}' file3
In other words, this is nothing but splitting the file into equal parts. The condition does the trick here: NR%3==1 : NR is the line number of the current record. NR%3 will be equal to 1 for every 3rd line such as 1st, 4th, 7th and so on. And at every 3rd line, the file name is changed in the variable x, and hence the records are written to the appropriate files.
$ cat F1
Unix
Linux
Solaris
$ cat F2
Aix
SCO
Sample File4:Let us update the above file with a header and trailer:
$ cat file4
HEADER
Unix
Linux
Solaris
AIX
SCO
TRAILER
9. Split the file at every 3rd line without the header and trailer in the new files.
sed '1d;$d;' file4 | awk 'NR%3==1{x="F"++i;}{print > x}'
The earlier command does the work for us, only thing is to pass to the above command without the header and trailer. sed does it for us. '1d' is to delete the 1st line, '$d' to delete the last line.
$ cat F1
Unix
Linux
Solaris
$ cat F2
AIX
SCO
10. Split the file at every 3rd line, retaining the header and trailer in every file.
$ awk 'BEGIN{getline f;}NR%3==2{x="F"++i;a[i]=x;print f>x;}{print > x}END{for(j=1;j<i;j++)print> a[j];}' file4
This one is little tricky. Before the file is processed, the first line is read using getline into the variable f. NR%3 is checked with 2 instead of 1 as in the earlier case because since the first line is a header, we need to split the files at 2nd, 5th, 8th lines, and so on. All the file names are stored in the array "a" for later processing. Without the END label, all the files will have the header record, but only the last file will have the trailer record. So, the END label is to precisely write the trailer record to all the files other than the last file.
$ cat F1
HEADER
Unix
Linux
Solaris
TRAILER
$ cat F2
HEADER
Aix
SCO
TRAILER
awk - 10 examples to read files with multiple delimiters
In this article of awk series, we will see how to use awk to read or parse text or CSV files containing multiple delimiters or repeating delimiters. Also, we will discuss about some peculiar delimiters and how to handle them using awk.
Let us consider a sample file. This colon separated file contains item, purchase year and a set of prices separated by a semicolon.
$ cat file
Item1:2010:10;20;30
Item2:2012:12;29;19
Item3:2014:15;50;61
1. To print the 3rd column which contains the prices:
$ awk -F: '{print $3}' file
10;20;30
12;29;19
15;50;61
This is straight forward. By specifying colon(:) in the option with -F, the 3rd column can be retrieved using the $3 variable.
2. To print the 1st component of $3 alone:
$ awk -F '[:;]' '{print $4}' file
20
29
50
What did we do here? Specified multiple delimiters, one is : and other is ; . How awk parses the file? Its simple. First, it looks at the delimiters which is colon(:) and semi-colon(;). This means, while reading the line, as and when the delimiter : or ; is encountered, store the part read in $1. Continue further. Again on encountering one of the delimiters, store the read part in $2. And this continues till the end of the line is reached. In this way, $4 contained the first part of the price component above.Note: Always keep in mind. While specifying multiple delimiters, it has to be specified inside square brackets( [;:] ).
3. To sum the individual components of the 3rd column and print it:
$ awk -F '[;:]' '{$3=$3+$4+$5;print $1,$2,$3}' OFS=: file
Item1:2010:60
Item2:2012:60
Item3:2014:126
The individual components of the price($3) column are available in $3, $4 and $5. Simply, sum them up and store in $3, and print all the variables. OFS (output field separator) is used to specify the delimiter while printing the output.Note: If we do not use the OFS, awk will print the fields using the default output delimiter which is space.
4. Un-group or re-group every record depending on the price column:
$ awk -F '[;:]' '{for(i=3;i<=5;i++){print $1,$2,$i;}}' OFS=":" file
Item1:2010:10
Item1:2010:20
Item1:2010:30
Item2:2012:12
Item2:2012:29
Item2:2012:19
Item3:2014:15
Item3:2014:50
Item3:2014:61
The requirement here is: New records have to be created for every component of the price column. Simply, a loop is run on from columns 3 to 5, and every time a record is framed using the price component.
5-6. Read file in which the delimiter is square brackets:
$ cat file
123;abc[202];124
125;abc[203];124
127;abc[204];124
5. To print the value present within the brackets:
$ awk -F '[][]' '{print $2}' file
202
203
204
At the first sight, the delimiter used in the above command might be confusing. Its simple. 2 delimiters are to be used in this case: One is [ and the other is ]. Since the delimiters itself is square brackets which is to be placed within the square brackets, it looks tricky at the first instance.
Note: If square brackets are delimiters, it should be put in this way only, meaning first ] followed by [. Using the delimiter like -F '[[]]' will give a different interpretation altogether.
6. To print the first value, the value within brackets, and the last value:
$ awk -F '[][;]' '{print $1,$3,$5}' OFS=";" file
123;202;124
125;203;124
127;204;124
3 delimiters are used in this case with semi-colon also included.
7-8. Read or parse a file containing a series of delimiters:
$ cat file
123;;;202;;;203
124;;;213;;;203
125;;;222;;;203
The above file contains a series of 3 semi-colons between every 2 values.
7. Using the multiple delimiter method:
$ awk -F'[;;;]' '{print $2}' file
Blank output !!! The above delimiter, though specified as 3 colons is as good as one delimiter which is a semi-colon(;) since they are all the same. Due to this, $2 will be the value between the first and the second semi-colon which in our case is blank and hence no output.
8. Using the delimiter without square brackets:
$ awk -F';;;' '{print $2}' file
202
213
222
The expected output !!! No square brackets is used and we got the output which we wanted.
Difference between using square brackets and not using it : When a set of delimiters are specified using square brackets, it means an OR condition of the delimiters. For example, -F '[;:]' means to separate the contents either on encountering ':' or ';'. However, when a set of delimiters are specified without using square brackets, awk looks at them literally to separate the contents. For example, -F ':;' means to separate the contents only on encountering a colon followed by a semi-colon. Hence, in the last example, the file contents are separated only when a set of 3 continuous semi-colons are encountered.
9. Read or parse a file containing a series of delimiters of varying lengths: In the below file, the 1st and 2nd column are separated using 3 semi-colons, however the 2nd and 3rd are separated by 4 semi-colons
$ cat file
123;;;202;;;;203
124;;;213;;;;203
125;;;222;;;;203
$ awk -F';'+ '{print $2,$3}' file
202 203
213 203
222 203
The '+' is a regular expression. It indicates one or more of previous characters. ';'+ indicates one or more semi-colons, and hence both the 3 semi-colons and 4 semi-colons get matched.
10. Using a word as a delimiter:
$ cat file
123Unix203
124Unix203
125Unix203
Retrieve the numbers before and after the word "Unix" :
$ awk -F'Unix' '{print $1, $2}' file
123 203
124 203
125 203
In this case, we use the word "Unix" as the delimiter. And hence $1 and $2 contained the appropriate values . Keep in mind, it is not just the special characters which can be used as delimiters. Even alphabets, words can also be used as delimiters.
P.S: We will discuss about the awk split command on how to use it in these types of multiple delimited files.
awk - Passing awk variables to shell
In one of our earlier articles, we discussed how to access or pass shell variables to awk. In this, we will see how to access the awk variables in shell? Or How to access awk variables as shell variables ? Let us see the different ways in which we can achieve this.
Let us consider a file with the sample contents as below: $ cat fileLinux 20Solaris 30HPUX 40
1. Access the value of the entry "Solaris" in a shell variable, say x: $ x=`awk '/Solaris/{a=$2;print a}' file`$ echo $x30
This approach is fine as long as we want to access only one value. What if we have to access multiple values in shell?
2. Access the value of "Solaris" in x, and "Linux" in y: $ z=`awk '{if($1=="Solaris")print "x="$2;if($1=="Linux")print "y="$2}' file`
$ echo "$z"y=20x=30$ eval $z$ echo $x30$ echo $y20
awk sets the value of "x" and "y" awk variables and prints which is collected in the shell variable "z". The eval command evaluates the variable meaning it executes the commands present in the variable. As a result, "x=30" and "y=20" gets executed, and they become shell variables x and y with appropriate values.
3. Same using the sourcing method: $ awk '{if($1=="Solaris")print "x="$2;if($1=="Linux")print "y="$2}' file > f1$ source f1$ echo $x30$ echo $y20
Here, instead of collecting the output of awk command in a variable, it is re-directed to a temporary file. The file is then sourced or in other words executed in the same shell. As a result, "x" and "y" become shell variables.Note: Depending on the shell being used, the appropriate way of sourcing has to be done. The "source" command is used here since the default shell is bash.
awk - 10 examples to insert / remove / update fields of a CSV file
How to manipulate a text / CSV file using awk/gawk? How to insert/add a column between columns, remove columns, or to update a particular column? Let us discuss in this article.
Consider a CSV file with the following contents: $ cat fileUnix,10,ALinux,30,BSolaris,40,CFedora,20,DUbuntu,50,E
1. To insert a new column (say serial number) before the 1st column $ awk -F, '{$1=++i FS $1;}1' OFS=, file1,Unix,10,A2,Linux,30,B3,Solaris,40,C4,Fedora,20,D5,Ubuntu,50,E
$1=++i FS $1 => Space is used to concatenate columns in awk. This expression concatenates a new field(++i) with the 1st field along with the delimiter(FS), and assigns it back to the 1st field($1). FS contains the file delimiter.
2. To insert a new column after the last column $ awk -F, '{$(NF+1)=++i;}1' OFS=, fileUnix,10,A,1Linux,30,B,2Solaris,40,C,3Fedora,20,D,4Ubuntu,50,E,5
$NF indicates the value of last column. Hence,by assigning something to $(NF+1), a new field is inserted at the end automatically.
3. Add 2 columns after the last column: $ awk -F, '{$(NF+1)=++i FS "X";}1' OFS=, fileUnix,10,A,1,XLinux,30,B,2,XSolaris,40,C,3,XFedora,20,D,4,XUbuntu,50,E,5,X
The explanation gives for the above 2 examples holds good here.
4. To insert a column before the 2nd last column $ awk -F, '{$(NF-1)=++i FS $(NF-1);}1' OFS=, fileUnix,1,10,ALinux,2,30,BSolaris,3,40,CFedora,4,20,DUbuntu,5,50,E
NF-1 points to the 2nd last column. Hence, by concatenating the serial number in the beginning of NF-1 ends up in inserting a column before the 2nd last.
5. Update 2nd column by adding 10 to the variable: $ awk -F, '{$2+=10;}1' OFS=, fileUnix,20,ALinux,40,BSolaris,50,CFedora,30,DUbuntu,60,E
$2 is incremented by 10.
6.Convert a specific column(1st column) to uppercase in the CSV file: $ awk -F, '{$1=toupper($1)}1' OFS=, fileUNIX,10,ALINUX,30,BSOLARIS,40,CFEDORA,20,DUBUNTU,50,E
Using the toupper function of the awk, the 1st column is converted from lowercase to uppercase.
7. Extract only first 3 characters of a specific column(1st column): $ awk -F, '{$1=substr($1,0,3)}1' OFS=, fileUni,10,ALin,30,BSol,40,CFed,20,DUbu,50,E
Using the substr function of awk, a substring of only the first few characters can be retrieved.
8.Empty the value in the 2nd column: $ awk -F, '{$2="";}1' OFS=, fileUnix,,ALinux,,BSolaris,,CFedora,,DUbuntu,,E
Set the variable of 2nd column($2) to blank(""). Now, when the line is printed, $2 will be blank.
9. Remove/Delete the 2nd column from the CSV file: $ awk -F, '{for(i=1;i<=NF;i++)if(i!=x)f=f?f FS $i:$i;print f;f=""}' x=2 fileUnix,ALinux,BSolaris,CFedora,DUbuntu,E
By just emptying a particular column, the column stays as is with empty value. To remove a column, all the subsequent columns from that position, needs to be advanced one position ahead. The for loop loops on all the fields. Using the ternary operator, every column is concatenated to the variable "f" provided it is not 2nd column using the FS as delimiter. At the end, the variable "f" is printed which contains the updated record. The column to be removed is passed through the awk variable "x" and hence just be setting the appropriate number in x, any specific column can be removed.
10. Join 3rd column with 2nd colmn using ':' and remove the 3rd column: $ awk -F, '{$2=$2":"$x;for(i=1;i<=NF;i++)if(i!=x)f=f?f FS $i:$i;print f;f=""}' x=3 fileUnix,10:ALinux,30:BSolaris,40:CFedora,20:DUbuntu,50:E
Almost same as last example expcept that first the 3rd column($3) is concatenated with 2nd column($2) and then removed.
gawk - Date and time calculation functions
gawk has 3 functions to calculate date and time:
systime strftime mktime
Let us see in this article how to use these functions:
systime: This function is equivalent to the Unix date (date +%s) command. It gives the Unix time, total number of seconds elapsed since the epoch(01-01-1970 00:00:00).$ echo | awk '{print systime();}'1358146640
Note: systime function does not take any arguments.
strftime: A very common function used in gawk to format the systime into a calendar format. Using this function, from the systime, the year, month, date, hours, mins and seconds can be separated.
Syntax: strftime (<format specifiers>,unix time);1. Printing current date time using strftime:$ echo | awk '{print strftime("%d-%m-%y %H-%M-%S",systime());}'14-01-13 12-37-45
strftime takes format specifiers which are same as the format specifiers available with the date command. %d for date, %m for month number (1 to 12), %y for the 2 digit year number, %H for the hour in 24 hour format, %M for minutes and %S for seconds. In this way, strftime converts Unix time into a date string.
2. Display current date time using strftime without systime:$ echo | awk '{print strftime("%d-%m-%y %H-%M-%S");}'14-01-13 12-38-08
Both the arguments of strftime are optional. When the timestamp is not provided, it takes the systime by default.
3. strftime with no arguments:$ echo | awk '{print strftime();}'Mon Jan 14 12:30:05 IST 2013
strftime without the format specifiers provides the output in the default output format as the Unix date command.
mktime: mktime function converts any given date time string into a Unix time, which is of the systime format.Syntax: mktime(date time string) # where date time string is a string which contains atleast 6 components in the following order: YYYY MM DD HH MM SS
1. Printing timestamp for a specific date time :$ echo | awk '{print mktime("2012 12 21 0 0 0");}'1356028200
This gives the Unix time for the date 21-Dec-12.
2. Using strftime with mktime:$ echo | awk '{print strftime("%d-%m-%Y",mktime("2012 12 21 0 0 0"));}'21-12-2012
The output of mktime can be validated by formatting the mktime output using the strftime function as above.
3. Negative date in mktime:
$ echo | awk '{print strftime("%d-%m-%Y",mktime("2012 12 -1 0 0 0"));}'29-11-2012
mktime can take negative values as well. -1 in the date position indicates one day before the date specified which in this case leads to 29th Nov 2012.
4. Negative hour value in mktime:$ echo | awk '{print strftime("%d-%m-%Y %H-%M-%S",mktime("2012 12 3 -2 0 0"));}'02-12-2012 22-00-00
-2 in the hours position indicates 2 hours before the specified date time which in this case leads to "2-Dec-2012 22" hours.
gawk - Calculate date / time difference between timestamps
How to find the time difference between timestamps using gawk? Let us consider a file where the 1st column is the Process name, 2nd is the start time of the process, and 3rd column is the end time of the process.
The requirement is to find the time consumed by the process which is the difference between the start and the end times.
1. File in which the date and time component are separated by a space:$ cat fileP1,2012 12 4 21 36 48,2012 12 4 22 26 53P2,2012 12 4 20 36 48,2012 12 4 21 21 23P3,2012 12 4 18 36 48,2012 12 4 20 12 35
Time difference in seconds:$ awk -F, '{d2=mktime($3);d1=mktime($2);print $1","d2-d1,"secs";}' fileP1,3005 secsP2,2675 secsP3,5747 secs
Using mktime function, the Unix time is calculated for the date time strings, and their difference gives us the time elapsed in seconds.
2. File with the different date format :$ cat fileP1,2012-12-4 21:36:48,2012-12-4 22:26:53P2,2012-12-4 20:36:48,2012-12-4 21:21:23P3,2012-12-4 18:36:48,2012-12-4 20:12:35
Note: This file has the start time and end time in different formats
Difference in seconds:$ awk -F, '{gsub(/[-:]/," ",$2);gsub(/[-:]/," ",$3);d2=mktime($3);d1=mktime($2);print $1","d2-d1,"secs";}' fileP1,3005 secsP2,2675 secs
P3,5747 secs
Using gsub function, the '-' and ':' are replaced with a space. This is done because the mktime function arguments should be space separated.Difference in minutes:$ awk -F, '{gsub(/[-:]/," ",$2);gsub(/[-:]/," ",$3);d2=mktime($3);d1=mktime($2);print $1","(d2-d1)/60,"mins";}' fileP1,50.0833 minsP2,44.5833 minsP3,95.7833 mins
Just by dividing the seconds difference by 60 gives us the difference in minutes.
3. File with only date, without time part:$ cat fileP1,2012-12-4,2012-12-6P2,2012-12-4,2012-12-8P3,2012-12-4,2012-12-5
Note: The start and end time has only the date components, no time components
Difference in seconds:$ awk -F, '{gsub(/-/," ",$2);gsub(/-/," ",$3);$2=$2" 0 0 0";$3=$3" 0 0 0";d2=mktime($3);d1=mktime($2);print $1","d2-d1,"secs";}' fileP1,172800 secsP2,345600 secsP3,86400 secs
In addition to replacing the '-' and ':' with spaces, 0's are appended to the date field since the mktime requires the date in 6 column format.
Difference in days:$ awk -F, '{gsub(/-/," ",$2);gsub(/-/," ",$3);$2=$2" 0 0 0";$3=$3" 0 0 0";d2=mktime($3);d1=mktime($2);print $1","(d2-d1)/86400,"days";}' fileP1,2 daysP2,4 daysP3,1 days
A day has 86400(24*60*60) seconds, and hence by dividing the duration in seconds by 86400, the duration in days can be obtained
sed - Include or Append a line to a file
sed is one of the most important editors we use in UNIX. It supports lot of file editing tasks. In this article, we will see a specific set of sed options.
Assume I have a flat file, empFile, containing employee name and employee id as shown below:
Hilesh, 1001
Bharti, 1002
Aparna, 1003
Harshal, 1004
Keyur, 1005
1. How to add a header line say "Employee, EmpId" to this file using sed?
$ sed '1i Employee, EmpId' empFile
Employee, EmpId
Hilesh, 1001
Bharti, 1002
Aparna, 1003
Harshal, 1004
Keyur, 1005
This command does the following: The number '1' tells the operation is to be done only for the first line. 'i' stands for including the following content before reading the line. So, '1i' means to include the following before reading the first line and hence we got the header in the file.
However, the file with the header is displayed only in the output, the file contents still remain the old file. So, if the user's requirement is to update the original file with this output, the user has to re-direct the output of the sed command to a temporary file and then move it to the original file.
The UNIX system which has the GNU version contains sed with the '-i' option. This option of the sed command is used to edit the file in-place. Let us see the same above example using '-i' option:
$ cat empFile
Hilesh, 1001
Bharti, 1002
Aparna, 1003
Harshal, 1004
Keyur, 1005
$ sed -i '1i Employee, EmpId' empFile
$ cat empFile
Employee, EmpId
Hilesh, 1001
Bharti, 1002
Aparna, 1003
Harshal, 1004
Keyur, 1005
As shown above, the '-i' option edits the file in-place without the need of a temporary file.
2. How to add a line '-------' after the header line or the 1st line?
$ sed -i '1a ---------------' empFile
$ cat empFile
Employee, EmpId
---------------
Hilesh, 1001
Bharti, 1002
Aparna, 1003
Harshal, 1004
Keyur, 1005
'1i' is similar to '1a' except that 'i' tells to include the content before reading the line, 'a' tells to include the content after reading the line. And hence in this case, the '----' line gets included after the 1st line. As you thought correctly, even if you had used '2i', it will work well and fine.
3. How to add a trailer line to this file?
$ sed -i '$a ---------------' empFile
$ cat empFile
Employee, EmpId
---------------
Hilesh, 1001
Bharti, 1002
Aparna, 1003
Harshal, 1004
Keyur, 1005
---------------
To add to the last line of the file, we need to know the total line count of the file to use in the above mentioned methods. However, sed has the '$' symbol which denotes the last line. '$a' tells to include the following content after reading the last line of the file.
4. How to add a record after a particular record? Let us assume the sample file contains only 3 records as shown below:
Employee, EmpId
---------------
Hilesh, 1001
Harshal, 1004
Keyur, 1005
---------------
Now, if I want to insert the record for the employee 'Bharti' after the employee 'Hilesh':
$ sed -i '/Hilesh/a Bharti, 1002' empFile
$ cat empFile
Employee, EmpId
---------------
Hilesh, 1001
Bharti, 1002
Harshal, 1004
Keyur, 1005
---------------
If you note the above sed command carefully, all we have done is in place of a number, we have used a pattern. /Hilesh/a tells to include the following contents after finding the pattern 'Hilesh', and hence the result.
5. How to add a record before a particular record? Say, add the record for the employee 'Aparna' before the employee record of 'Harshal'
$ sed -i '/Harshal/i Aparna, 1003' empFile
$ cat empFile
Employee, EmpId
---------------
Hilesh, 1001
Bharti, 1002
Aparna, 1003
Harshal, 1004
Keyur, 1005
---------------
Similarly, /Harshal/i tells to include the following contents before reading the line containing the pattern 'Harshal'.
Note: As said above, the '-i' option will only work if the sed is GNU sed. Else the user has to re-direct the output to a temporary file and move it to the original file.
sed - Replace or substitute file contents
In one our earlier articles, we saw how to insert a line or append a line to an existing file using sed. In this article, we will see how we can do data manipulation or substitution in files using sed.
Let us consider a sample file, sample1.txt, as shown below:
apple
orange
banana
pappaya
1. To add something to the beginning of a every line in a file, say to add a word Fruit:
$ sed 's/^/Fruit: /' sample1.txt
Fruit: apple
Fruit: orange
Fruit: banana
Fruit: pappaya
The character 's' stands for substitution. What follows 's' is the character, word or regular expression to replace followed by character, word or regular expression to replace with. '/' is used to separate the substitution character 's', the content to replace and the content to replace with. The '^' character tells replace in the beginning and hence everyline gets added the phrase 'Fruit: ' in the beginning of the line.
2. Similarly, to add something to the end of the file:
$ sed 's/$/ Fruit/' sample1.txt
apple Fruit
orange Fruit
banana Fruit
pappaya Fruit
The character '$' is used to denote the end of the line. And hence this means, replace the end of the line with 'Fruit' which effectively means to add the word 'Fruit' to the end of the line.
3. To replace or substitute a particular character, say to replace 'a' with 'A'.
$ sed 's/a/A/' sample1.txt
Apple
orAnge
bAnana
pAppaya
Please note in every line only the first occurrence of 'a' is being replaed, not all. The example shown here is just for a single character replacement, which can be easily be done for a word as well.
4. To replace or substitute all occurrences of 'a' with 'A'
$ sed 's/a/A/g' sample1.txt
Apple
orAnge
bAnAnA
pAppAyA
5. Replace the first occurrence or all occurrences is fine. What if we want to replace the second occurrence or third occurrence or in other words nth occurrence.
To replace only the 2nd occurrence of a character :
$ sed 's/a/A/2' sample1.txt
apple
orange
banAna
pappAya
Please note above. The 'a' in apple has not changed, and so is in orange since there is no 2nd occurrence of 'a' in this. However, the changes have happened appropriately in banana and pappaya
6. Now, say to replace all occurrences from 2nd occurrence onwards:
$ sed 's/a/A/2g' sample1.txt
apple
orange
banAnA
pappAyA
7. Say, you want to replace 'a' only in a specific line say 3rd line, not in the entire file:
$ sed '3s/a/A/g' sample1.txt
apple
orange
bAnAnA
pappaya
'3s' denotes the substitution to be done is only for the 3rd line.
8. To replace or substitute 'a' on a range of lines, say from 1st to 3rd line:
$ sed '1,3s/a/A/g' sample1.txt
Apple
orAnge
bAnAnA
pappaya
9. To replace the entire line with something. For example, to replace 'apple' with 'apple is a Fruit'.
$ sed 's/.*/& is a Fruit/' sample1.txt
apple is a Fruit
orange is a Fruit
banana is a Fruit
pappaya is a Fruit
The '&' symbol denotes the entire pattern matched. In this case, since we are using '.*' which means matching the entire line, '&' contains the entire line. This type of matching will be really useful when you a file containing list of file names and you want to say rename them as we have shown in one of our earlier articles: Rename group of files
10. Using sed, we can also do multiple substitution. For example, say to replace all 'a' to 'A', and 'p' to 'P':
$ sed 's/a/A/g; s/p/P/g' sample1.txt
APPle
orAnge
bAnAnA
PAPPAyA
OR This can also be done as:
$ sed -e 's/a/A/g' -e 's/p/P/g' sample1.txt
APPle
orAnge
bAnAnA
PAPPAyA
The option '-e' is used when you have more than one set of substitutions to be done.
OR The multiple substitution can also be done as shown below spanning multiple lines:
$ sed -e 's/a/A/g' \
> -e 's/p/P/g' sample1.txt
APPle
orAnge
bAnAnA
PAPPAyA
sed - Read from a file or write into a file
In this sed article, we will see how to read a file into a sed output, and also how to write a section of a file content to a different file.
Let us assume we have 2 files, file1 and file2 with the following content:
$ cat file1
1apple
1banana
1mango
$ cat file2
2orange
2strawberry
sed has 2 options for reading and writing:
r filename : To read a file name content specified in the filename w filename : To write to a file specified in the filename
Let us see some examples now:
1. Read the file2 after every line of file1.
$ sed 'r file2' file1
1apple
2orange
2strawberry
1banana
2orange
2strawberry
1mango
2orange
2strawberry
r file2 reads the file contents of file2. Since there is no specific number before 'r', it means to read the file contents of file2 for every line of file1. And hence the above output.
2. The above output is not very useful. Say, we want to read the file2 contents after the 1st line of file1:
$ sed '1r file2' file1
1apple
2orange
2strawberry
1banana
1mango
'1r' indicates to read the contents of file2 only after reading the line1 of file1.
3. Similarly, we can also try to read a file contents on finding a pattern:
$ sed '/banana/r file2' file1
1apple
1banana
2orange
2strawberry
1mango
The file2 contents are read on finding the pattern banana and hence the above output.
4. To read a file content on encountering the last line:
$ sed '$r file2' file1
1apple
1banana
1mango
2orange
2strawberry
The '$' indicates the last line, and hence the file2 contents are read after the last line. Hey, hold on. The above example is put to show the usage of $ in this scenario. If your requirement is really something like above, you need not use sed. cat file1 file2 will do :) .
Let us now move onto the writing part of sed. Consider a file, file1, with the below contents:
$ cat file1
apple
banana
mango
orange
strawberry
1. Write the lines from 2nd to 4th to a file, say file2.
$ sed -n '2,4w file2' file1
The option '2,4w' indicates to write the lines from 2 to 4. What is the option "-n" for? By default, sed prints every line it reads, and hence the above command without "-n" will still print the file1 contents on the standard output. In order to suppress this default output, "-n' is used. Let us print the file2 contents to check the above output.
$ cat file2
banana
mango
orange
Note: Even after running the above command, the file1 contents still remain intact.
2. Write the contents from the 3rd line onwards to a different file:
$ sed -n '3,$w file2' file1
$ cat file2
mango
orange
strawberry
As explained earlier, the '3,$' indicates from 3 line to end of the file.
3. To write a range of lines, say to write from lines apple through mango :
$ sed -n '/apple/,/mango/w file2' file1
$ cat file2
apple
banana
mango
sed - Selective printing
In this sed article, we are going to see the different options sed provides to selectively print contents in a file. Let us take a sample file with the following contents:
$ cat file
Gmail 10
Yahoo 20
Redif 18
1. To print the entire file contents:
$ sed '' file
Gmail 10
Yahoo 20
Redif 18
2. To print only the line containing 'Gmail'. In other words, to simulate the grep command:
$ sed '/Gmail/p' file
Gmail 10
Gmail 10
Yahoo 20
Redif 18
Within the slashes, we specify the pattern which we try to match. The 'p' command tells to print the line. Look at the above result properly, the line Gmail got printed twice. Why? This is because the default behavior of sed is to print every line after parsing it. On top of it, since we asked sed to print the line containing the pattern 'Gmail' explicitly by specifying 'p", the line 'Gmail' got
printed twice. How to get the desired result now?
$ sed -n '/Gmail/p' file
Gmail 10
The desired result can be obtained by suppressing the default printing which can be done by using the option "-n". And hence the above result.
3. To delete the line containing the pattern 'Gmail'. In other words, to simulate the "grep -v" command option in sed:
$ sed '/Gmail/d' file
Yahoo 20
Redif 18
The "d" command denotes the delete the pattern. As said earlier, the default action of sed is to print. Hence, all the other lines got printed, and the line containing the pattern 'Gmail' got deleted since we have specified explicit "d" option.
In the same lines, say to delete the first line of the file:
$ sed '1d' file
Yahoo 20
Redif 18
4. Print lines till you encounter a specific pattern, say till 'Yahoo' is encountered.
$ sed '/Yahoo/q' file
Gmail 10
Yahoo 20
The "q" command tells to quit from that point onwards. This sed command tells to keep printing(which is default) and stop processing once the pattern "Yahoo" is encountered.
Printing Range of Lines:
Till now, what we saw is to retrieve a line or a set of lines based on a condition. Now, we will see how to get the same for a given range:
Consider the below sample file:
$ cat file
Gmail 10
Yahoo 20
Redif 18
Inbox 15
Live 23
Hotml 09
5. To print the first 3 lines, or from lines 1 through 3:
$ sed -n '1,3p' file
Gmail 10
Yahoo 20
Redif 18
The option "-n" suppresses the default printing. "1,3p" indicates to print from lines 1 to 3.
The same can also be achieved through:
$ sed '3q' file
Gmail 10
Yahoo 20
Redif 18
3q denotes to quit after reading the 3rd line. Since the "-n" option is not used, the first 3 lines get printed.
6. Similar to give line number ranges, sed can also work
on pattern ranges. Say, to print from lines between patterns "Yahoo" and "Live":
$ sed -n '/Yahoo/,/Live/p' file
Yahoo 20
Redif 18
Inbox 15
Live 23
The pattern is always specified between the slashes. The comma operator is used to specify the range. This command tells to print all those lines between the patterns "Yahoo" and 'Live".
7. To print the lines from pattern "Redif" till the end of the file.
$ sed -n '/Redif/,$p' file
Redif 18
Inbox 15
Live 23
Hotml 09
The earlier examples were line number ranges and pattern ranges. sed allows us to use both (line number and pattern) in the same command itself. This command indicates to print the lines from pattern "Redif" till the end of the file($).
8. Similarly, to print contents from the beginning of the file till the pattern "Inbox":
$ sed -n '1,/Inbox/p' file
Gmail 10
Yahoo 20
Redif 18
Inbox 15
sed - Replace or substitute file contents - Part 2
In one of our earlier articles, we saw about Replace and substitute using sed . In continuation to it, we will see a few more frequent search and replace operations done on files using sed.
Let us consider a file with the following contents:
$ cat file
RE01:EMP1:25:2500
RE02:EMP2:26:2650
RE03:EMP3:24:3500
RE04:EMP4:27:2900
1. To replace the first two(2) characters of a string or a line with say "XX":
$ sed 's/^../XX/' file
XX01:EMP1:25:2500
XX02:EMP2:26:2650
XX03:EMP3:24:3500
XX04:EMP4:27:2900
The "^" symbol indicates from the beginning. The two dots indicate 2 characters.
The same thing can also be achieved without using the carrot(^) symbol as shown below. This also works because by default sed starts any operation from the beginning.
sed 's/../XX/' file
2. In the same lines, to remove or delete the first two characters of a string or a line.
$ sed 's/^..//' file
01:EMP1:25:2500
02:EMP2:26:2650
03:EMP3:24:3500
04:EMP4:27:2900
Here the string to be substituted is empty, and hence gets deleted.
3. Similarly, to remove/delete the last two characters in the string:
$ sed 's/..$//' file
RE01:EMP1:25:25
RE02:EMP2:26:26
RE03:EMP3:24:35
RE04:EMP4:27:29
4. To add a string to the end of a line:
$ sed 's/$/.Rs/' file
RE01:EMP1:25:2500.Rs
RE02:EMP2:26:2650.Rs
RE03:EMP3:24:3500.Rs
RE04:EMP4:27:2900.Rs
Here the string ".Rs" is being added to the end of the line.
5. To add empty spaces to the beginning of every line in a file:
$ sed 's/^/ /' file
RE01:EMP1:25:Rs.2500
RE02:EMP2:26:Rs.2650
RE03:EMP3:24:Rs.3500
RE04:EMP4:27:Rs.2900
To make any of the sed command change permanent to the file OR in other words, to save or update the changes in the same file, use the option "-i"
$ sed -i 's/^/ /' file
$ cat file
RE01:EMP1:25:Rs.2500
RE02:EMP2:26:Rs.2650
RE03:EMP3:24:Rs.3500
RE04:EMP4:27:Rs.2900
6. To remove empty spaces from the beginning of a line:
$ sed 's/^ *//' file
RE01:EMP1:25:2500
RE02:EMP2:26:2650
RE03:EMP3:24:3500
RE04:EMP4:27:2900
"^ *"(space followed by a *) indicates a sequence of spaces in the beginning.
7. To remove empty spaces from beginning and end of string.
$ sed 's/^ *//; s/ *$//' file
RE01:EMP1:25:2500
RE02:EMP2:26:2650
RE03:EMP3:24:3500
RE04:EMP4:27:2900
This example also shows to use multiple sed command substitutions as part of the same command.
The same command can also be written as :
sed -e 's/^ *//' -e 's/ *$//' file
8. To add a character before and after a string. Or in other words, to encapsulate the string with something:
$ sed 's/.*/"&"/' file
"RE01:EMP1:25:Rs.2500"
"RE02:EMP2:26:Rs.2650"
"RE03:EMP3:24:Rs.3500"
"RE04:EMP4:27:Rs.2900"
".*" matches the entire line. '&' denotes the pattern matched. The substitution pattern "&" indicates to put a double-quote at the beginning and end of the string.
9. To remove the first and last character of a string:
$ sed 's/^.//;s/.$//' file
RE01:EMP1:25:2500
RE02:EMP2:26:2650
RE03:EMP3:24:3500
RE04:EMP4:27:2900
10. To remove everything till the first digit comes :
$ sed 's/^[^0-9]*//' file
01:EMP1:25:2500
02:EMP2:26:2650
03:EMP3:24:3500
04:EMP4:27:2900
Similarly, to remove everything till the first alphabet comes:
sed 's/^[^a-zA-Z]*//' file
11. To remove a numerical word from the end of the string:
$ sed 's/[0-9]*$//' file
RE01:EMP1:25:
RE02:EMP2:26:
RE03:EMP3:24:
RE04:EMP4:27:
12. To get the last column of a file with a delimiter. The delimiter in this case is ":".
$ sed 's/.*://' file
2500
2650
3500
2900
For a moment, one can think the output of the above command to be the same contents without the first column and the delim. sed is greedy. When we tell, '.*:' it goes to the last column and consumes everything. And hence, we only the get the content after the last colon.
13. To convert the entire line into lower case:
$ sed 's/.*/\L&/' file
re01:emp1:25:rs.2500
re02:emp2:26:rs.2650
re03:emp3:24:rs.3500
re04:emp4:27:rs.2900
\L is the sed switch to convert to lower case. The operand following the \L gets converted. Since &(the pattern matched, which is the entire line in this case) is following \L, the entire line gets converted to lower case.
14. To convert the entire line or a string to uppercase :
$ sed 's/.*/\U&/' file
RE01:EMP1:25:RS.2500
RE02:EMP2:26:RS.2650
RE03:EMP3:24:RS.3500
RE04:EMP4:27:RS.2900
Same as above, \U instead of \L.
sed - 25 examples to delete a line or pattern in a file
In this article of sed tutorial series, we are going to see how to delete or remove a particular line or a particular pattern from a file using the sed command.
Let us consider a file with the sample contents as below:
$ cat file
Cygwin
Unix
Linux
Solaris
AIX
1. Delete the 1st line or the header line:
$ sed '1d' file
Unix
Linux
Solaris
AIX
d command is to delete a line. 1d means to delete the first line.
The above command will show the file content by deleting the first line. However, the source file remains unchanged. To update the original file itself with this deletion or to make the changes permanently in the source file, use the -i option. The same is applicable for all the other examples.
sed -i '1d' file
Note: -i option in sed is available only if it is GNU sed. If not GNU, re-direct the sed output to a file, and rename the output file to the original file.
2. Delete a particular line, 3rd line in this case:
$ sed '3d' file
Cygwin
Unix
Solaris
AIX
3. Delete the last line or the trailer line of the file:
$ sed '$d' file
Cygwin
Unix
Linux
Solaris
$ indicates the last line.
4. Delete a range of lines, from 2nd line till 4th line:
$ sed '2,4d' file
Cygwin
AIX
The range is specified using the comma operator.
5. Delete lines other than the specified range, line other than 2nd till 4th here:
$ sed '2,4!d' file
Unix
Linux
Solaris
The ! operator indicates negative condition.
6. Delete the first line AND the last line of a file, i.e, the header and trailer line of a file.
$ sed '1d;$d' file
Unix
Linux
Solaris
Multiple conditions are separated using the ';' operator. Similarly, say to delete 2nd and 4th line, you can use: '2d;3d'.
7. Delete all lines beginning with a particular character, 'L' in this case:
$ sed '/^L/d' file
Cygwin
Unix
Solaris
AIX
'^L' indicates lines beginning with L.
8. Delete all lines ending with a particular character, 'x' in this case:
$ sed '/x$/d' file
Cygwin
Solaris
AIX
'x$' indicates lines ending with 'x'. AIX did not get deleted because the X is capital.
9. Delete all lines ending with either x or X, i.e case-insensitive delete:
$ sed '/[xX]$/d' file
Cygwin
Solaris
[xX] indicates either 'x' or 'X'. So, this will delete all lines ending with either small 'x' or capital 'X'.
10. Delete all blank lines in the file
$ sed '/^$/d' file
Cygwin
Unix
Linux
Solaris
AIX
'^$' indicates lines containing nothing and hence the empty lines get deleted. However, this wont delete lines containing only some blank spaces.
11. Delete all lines which are empty or which contains just some blank spaces:
$ sed '/^ *$/d' file
Cygwin
Unix
Linux
Solaris
AIX
'*' indicates 0 or more occurrences of the previous character. '^ *$' indicates a line containing zero or more spaces. Hence, this will delete all lines which are either empty or lines with only some blank spaces.
12. Delete all lines which are entirely in capital letters:
$ sed '/^[A-Z]*$/d' file
Cygwin
Unix
Linux
Solaris
[A-Z] indicates any character matching the alphabets in capital.
13. Delete the lines containing the pattern 'Unix'.
$ sed '/Unix/d' file
Cygwin
Linux
Solaris
AIX
The pattern is specified within a pair of slashes.
14. Delete the lines NOT containing the pattern 'Unix':
$ sed '/Unix/!d' file
Unix
15. Delete the lines containing the pattern 'Unix' OR 'Linux':
$ sed '/Unix\|Linux/d' file
Cygwin
Solaris
AIX
The OR condition is specified using the | operator. In order not to get the pipe(|) interpreted as a literal, it is escaped using a backslash.
16. Delete the lines starting from the 1st line till encountering the pattern 'Linux':
$ sed '1,/Linux/d' file
Solaris
AIX
Earlier, we saw how to delete a range of lines. Range can be in many combinations: Line ranges, pattern ranges, line
and pattern, pattern and line.
17. Delete the lines starting from the pattern 'Linux' till the last line:
$ sed '/Linux/,$d' file
Cygwin
Unix
18. Delete the last line ONLY if it contains the pattern 'AIX':
$ sed '${/AIX/d;}' file
Cygwin
Unix
Linux
Solaris
$ is for the last line. To delete a particular line only if it contains the pattern AIX, put the line number in place of the $. This is how we can implement the 'if' condition in sed.
19. Delete the last line ONLY if it contains either the pattern 'AIX' or 'HPUX':
$ sed '${/AIX\|HPUX/d;}' file
Cygwin
Unix
Linux
Solaris
20. Delete the lines containing the pattern 'Solaris' only if it is present in the lines from 1 to 4.
$ sed '1,4{/Solaris/d;}' file
Cygwin
Unix
Linux
AIX
This will only delete the lines containing the pattern Solaris only if it is in the 1st four lines, nowhere else.
21. Delete the line containing the pattern 'Unix' and also the next line:
$ sed '/Unix/{N;d;}' file
Cygwin
Solaris
AIX
N command reads the next line in the pattern space. d deletes the entire pattern space which contains the current and the next line.
22. Delete only the next line containing the pattern 'Unix', not the very line:
$ sed '/Unix/{N;s/\n.*//;}' file
Cygwin
Unix
Solaris
AIX
Using the substitution command s, we delete from the newline character till the end, which effective deletes the next line after the line containing the pattern Unix.
23. Delete the line containing the pattern 'Linux', also the line before the pattern:
$ sed -n '/Linux/{s/.*//;x;d;};x;p;${x;p;}' file | sed '/^$/d'
Cygwin
Solaris
AIX
A little tricky ones. In order to delete the line prior to the pattern,we store every line in a buffer called as hold space. Whenever the pattern matches, we delete the content present in both, the pattern space which contains the current line, the hold space which contains the previous line.
Let me explain this command: 'x;p;' ; This gets executed for every line. x exchanges the content of pattern space with hold space. p prints the pattern space. As a result, every time, the current line goes to hold space, and the previous line comes to pattern space and gets printed. When the pattern /Linux/ matches, we empty(s/.*//) the pattern space, and exchange(x) with the hold space(as a result of which the hold space becomes empty) and delete(d) the pattern space which contains the previous line. And hence, the current and the previous line gets deleted on encountering the pattern Linux. The ${x;p;} is to print the last line which will remain in the hold space if left.
The second part of sed is to remove the empty lines created by the first sed command.
24. Delete only the line prior to the line containing the pattern 'Linux', not the very line:
$ sed -n '/Linux/{x;d;};1h;1!{x;p;};${x;p;}' file
Cygwin
Linux
Solaris
AIX
This is almost same as the last one with few changes. On encountering the pattern /Linux/, we exchange(x) and delete(d). As a result of exchange, the current line remains in hold space, and the previous line which came into pattern space got deleted.
1h;1!{x;p;} - 1h is to move the current line to hold space only if it first line. Exchange and print for all the other lines. This could easily have been simply: x;p . The drawback is it gives an empty line at the beginning because during the first exchange between the pattern space and hold space, a new line comes to pattern space since hold space is empty.
25. Delete the line containing the pattern 'Linux', the line before, the line after:
$ sed -n '/Linux/{N;s/.*//;x;d;};x;p;${x;p;}' file | sed '/^$/d'
Cygwin
AIX
With the explanations of the last 2 commands, this should be fairly simple to understand.
sed - 10 examples to print lines from a file
In this article of sed series, we will see how to print a particular line using the print(p) command of sed. Let us consider a file with the following contents:$ cat fileAIXSolarisUnixLinuxHPUX
1. Print only the first line of the file:$ sed -n '1p' fileAIX
Similarly, to print a particular line, put the line number before 'p'.
2. Print only the last line of the file$ sed -n '$p' fileHPUX
$ indicates the last line.
3. Print lines which does not contain 'X':$ sed -n '/X/!p' fileSolarisUnixLinux
!p indicates the negative condition to print.
4. Print lines which contain the character 'u' or 'x' :$ sed -n '/[ux]/p' fileUnixLinux
[ux] indicates line containing the pattern either 'u' or 'x'.
5. Print lines which end with 'x' or 'X' :$ sed -n '/[xX]$/p' fileAIXUnixLinuxHPUX
6. Print lines beginning with either 'A' or 'L':$ sed -n '/^A\|^L/p' fileAIXLinux
The pipe is used to provide multiple pattern matching. Like this, multiple patterns can be provided for searching.
7. Print every alternate line:$ sed 'n;d' fileAIXUnixHPUX
n command prints the current line, and immediately reads the next line into pattern space. d command deletes the line present in pattern space. In this way, alternate lines get printed.
8. Print every 2 lines:$ sed 'n;n;N;d' fileAIXSolarisHPUX
n;n; => This command prints 2 lines and the 3rd line is present in the pattern space. N command reads the next line and joins with the current line, and d deltes the entire stuff present in the pattern space. With this, the 3rd and 4th lines present in the pattern space got deleted. Since this repeats till the end of the file, it ends up in printing every 2 lines.
9. Print lines ending with 'X' within a range of lines:$ sed -n '/Unix/,${/X$/p;}' fileHPUX
The range of lines being chosen are starting from the line containing the pattern 'Unix' till the end of the file($). The commands present within the braces are applied only for this range of lines. Within this group, only the lines ending with 'x' are printed. Refer this to know how to print a range of lines using sed from example 5 onwards.
10. Print range of lines excluding the starting and ending line of the range:$ sed -n '/Solaris/,/HPUX/{//!p;}' fileUnixLinux
The range of lines chosen is from 'Solaris' to 'HPUX'. The action within the braces is applied only for this range of lines. If no pattern is provided in pattern matching (//), the last matched pattern is considered. For eg, when the line containing the pattern 'Solaris' matches
the range of lines and gets inside the curly braches, since no pattern is present, the last pattern (solaris) is matched. Since this matching is true, it is not printed(!p), and the same becomes true for the last line in the group as well.
sed - 10 examples to replace / delete / print lines of CSV file
How to use sed to work with a CSV file? Or How to work with any file in which fields are separated by a delimiter?
Let us consider a sample CSV file with the following content:cat fileSolaris,25,11Ubuntu,31,2Fedora,21,3LinuxMint,45,4RedHat,12,5
1. To remove the 1st field or column :$ sed 's/[^,]*,//' file25,1131,221,345,412,5
This regular expression searches for a sequence of non-comma([^,]*) characters and deletes them which results in the 1st field getting removed.
2. To print only the last field, OR remove all fields except the last field:$ sed 's/.*,//' file112345
This regex removes everything till the last comma(.*,) which results in deleting all the fields except the last field.
3. To print only the 1st field:$ sed 's/,.*//' fileSolarisUbuntuFedoraLinuxMintRedHat
This regex(,.*) removes the characters starting from the 1st comma till the end resulting in deleting all the fields except the last field.
4. To delete the 2nd field:$ sed 's/,[^,]*,/,/' fileSolaris,11Ubuntu,2
Fedora,3LinuxMint,4RedHat,5
The regex (,[^,]*,) searches for a comma and sequence of characters followed by a comma which results in matching the 2nd column, and replaces this pattern matched with just a comma, ultimately ending in deleting the 2nd column.Note: To delete the fields in the middle gets more tougher in sed since every field has to be matched literally.
5. To print only the 2nd field:$ sed 's/[^,]*,\([^,]*\).*/\1/' file2531214512
The regex matches the first field, second field and the rest, however groups the 2nd field alone. The whole line is now replaced with the 2nd field(\1), hence only the 2nd field gets displayed.
6. Print only lines in which the last column is a single digit number:$ sed -n '/.*,[0-9]$/p' fileUbuntu,31,2Fedora,21,3LinuxMint,45,4RedHat,12,5
The regex (,[0-9]$) checks for a single digit in the last field and the p command prints the line which matches this condition.
7. To number all lines in the file: $ sed = file | sed 'N;s/\n/ /'1 Solaris,25,112 Ubuntu,31,23 Fedora,21,34 LinuxMint,45,45 RedHat,12,5
This is simulation of cat -n command. awk does it easily using the special variable NR. The '=' command of sed gives the line number of every line followed by the line itself. The sed output is piped to another sed command to join every 2 lines.
8. Replace the last field by 99 if the 1st field is 'Ubuntu': $ sed 's/\(Ubuntu\)\(,.*,\).*/\1\299/' fileSolaris,25,11Ubuntu,31,99Fedora,21,3LinuxMint,45,4RedHat,12,5
This regex matches 'Ubuntu' and till the end except the last column and groups each of them as well. In the replacement part, the 1st and 2nd group along with the new number 99 is substituted.
9. Delete the 2nd field if the 1st field is 'RedHat': $ sed 's/\(RedHat,\)[^,]*\(.*\)/\1\2/' fileSolaris,25,11Ubuntu,31,2Fedora,21,3LinuxMint,45,4RedHat,,5
The 1st field 'RedHat', the 2nd field and the remaining fields are grouped, and the replacement is done with only 1st and the last group , resuting in getting the 2nd field deleted.
10. To insert a new column at the end(last column) :$ sed 's/.*/&,A/' fileSolaris,25,11,AUbuntu,31,2,AFedora,21,3,ALinuxMint,45,4,ARedHat,12,5,A
The regex (.*) matches the entire line and replacing it with the line itself (&) and the new field.
11. To insert a new column in the beginning(1st column):$ sed 's/.*/A,&/' fileA,Solaris,25,11A,Ubuntu,31,2A,Fedora,21,3A,LinuxMint,45,4A,RedHat,12,5
Same as last example, just the line matched is followed by the new column.
Note: sed is generally not preferred on files which has fields separated by a delimiter because it is very difficult to access fields in sed unlike awk or Perl where splitting fields is a breeze.