Untitled Document

Chapter - 14  Advanced Filter - awk

awk is very powefrul command in Unix / linux system which has capabilities to generate suitable report to format the files . Named after its authors , Aho,Weinberger and Kernighan (awk).

awk doesn't belong to the do-one -thing family of Unix/linux command .It can do several things and most of them quite well. It also uses regular expressions, variables, several built in function which makes it more powerful command . Lets start learning the awk by some simple examples and then moving to complex ones

Simple awk Filters -

awk has a very similar syntax like find , here is the syntax for awk .

awk options 'selection_criteria {action}' file(s)

options - The different option that we have in awk command selection_criteria - is a form of addressing and select the lines from the given file . action - The operation that we want to perform on the selected lines . file - On which we have to perform selection and action on the file.

The selection criteria in awk have wider scope. Like there, can be patterns like /mangesh/ or line addresses that use awk's built-invariable,NR. Further , they can also be conditional expressions using && and | | operators as used in the shell . You can select lines practically on any condition .

The following command selects mangesh from emp.csv

awk '/mangesh/ { print } ' emp.csv

The command print the lines having word mangesh , selection_criteria section select the lines that are processed in action section ( print) . If selection_criteria is missing , then action applies to all lines. if action is missing the entire line is printed . Either of the two (but not both) is optional , but they must be enclosed within a pair of single ( not double) quotes .

   awk '/mangesh/ { print } ' emp.csv 
   awk '/mangesh/ ' emp.csv ------------------------------- By default it print lines
   awk '/mangesh/ { print $0 } ' emp.csv ------------------- $0 is the complete line

You cannot run a command like this - awk emp.csv it will give you error as -

error - awk : emp.csv
awk: ^Syntax Error
The command requires either needs selection criteria or action along with filename

Splitting a Line Into Fields -

awk uses the special parameters , $0 to indicate the entire line . It also identifies fields by $1,$2,$3 . Since these parameters also have a special meaning to the shell , single-quoting an awk program protects them from interpretation by the shell .

By default awk uses a contiguous sequence of spaces and tabs as single delimiter. As our file is , ( comma) separated , so we must specify the -F options to specify it in our command . Here is the example that print name,designation,department and salary of all the people.

awk -F "," '{ print 42,$3,$4,$6 } ' emp.csv

Notice that a , (comma) has been used to delimit the field specifications ( in print section ). This ensures that each fileld is separated from the other by a space . If we don't put the comma , the fields will be glued together .

We can print the selected line using Line address e.g If you want to prnt the lines from 1 to 5 from the file emp.csv , use awk built in variable NR to specify the line numbers .

awk -F "," 'NR == 1 , NR == 5 {print NR , $2,$3,$6 } ' emp.csv

printf : Formatting Output -

We can format the above output using printf statement . It uses C-like printf statement , the %s format will be used for string data and %d for numeric . We can now format above output as -

awk -F "," ' /mangesh/ { printf "%4d %-15s %-20s %d\n " NR , $2,$3,$6 } ' emp.csv

The name and designation is now printed with 15 and 20 characters wide , respectively ; the symbol - left-justifies the output . The line number is 4 character wide , right justified . Note that printf requires \n to print a newline after each line .

Redirecting Standard Output -

Every print and printf statement can be separately redirected with the > and | symbols. However , make sure the filename or command that follows these symbols is enclosed within double quotes .For example , the following statement sorts the output of the printf statement .

printf "%s %-10s %-12s %-8s\n" , $1 ,$3 ,$4 ,$6 | "sort"

If you use redirection instead , the filename should be enclosed in quotes in a similar manner

printf "%s %-10s %-12s %-8s\n" , $1 ,$3 ,$4 ,$6 > "mlist"

awk thus provides the flexibility of separately manipulating the different output streams . But don't forget the quotes !

Variables and Expressions - 

We can use variables and expression with awk . Expressions comprise strings , numbers , variables that can be built by combining with these operators .Unlike programming language awk doesn't have char , int,long, double etc .Every expression can be interpreted either as a string or a number and awk makes the necessary conversion according to context.

awk also allows the use of user-defined variables but without declaring them . Variables are case sensitive , i is different from I .Unlike shell variables,awk variables don't use the $ either in assignment or in evaluation .

   i = "3" 
   print i

A user defined variable needs no initialization . awk has a mchanism of identifying the type and initial value of a variable from its context.

String in awk are always double-quoted and can contain any character .awk provides no operator for concatenating strings . Strings are concatenated by simply placing them side-by-side

   i = "mangadaku" ; j = "com" 
   print i j ---------------------------------------------- prints mangadakucom
   print i "." j ------------------------------------------Prints mangadaku.com

Concatenation is not affected by the type of variable.Anumeric and string value can be concatenated easily. The following examples demonstrate how awk makes automatic conversions when concatenating and adding variables .

   i = "4" ; j = 3 ; m = "A" 
   print i j ---------------------------------------------- j converted to string ; prints 43
   print i + j ------------------------------------------i is converted to number and prints 7
   print j + m ---------------------------------------- m is converted to numeral ; prints 3

Even though we assigned 4 ( a string) to i , we could use it for numeric computation . Also observe that when a number is added to a string , awk converts the string to zero since it doesn't have numerals .

Expressions also have true and false values associated with them . Any non empty string is true , so is any positive number . The statement

if (x)

is true if x is a not null string or positive number .

The Comparison Operators -  

Lets consder a scenario where you want to print the - Name , Designation and salary for the employees whose designation is chiarman and director . Here is how we can use selection criteria with awl - as designation is the third field so we will use $3 in selection field .

awk -F "," '$3 == "director" || $3 =="chairman" { printf "%-20s %-12s %d\n" , $2,$3,$6 } ' emp.csv

~ and ! The Regular Expression Operators -

Previously we had used awk with regular expression in this manner :

   awk -F "," '/sa[kx]s*ena/' emp.csv

This matches the pattern saxena saksena anywhere in the line and not in a specified field . For matching a regular expression with a field , awk offers the ~ and ! operators to match and negate a match , respectively . With these operators , matching becomes more specific as seen in the following examples :

   awk -F "," '$2 ~ /[cC]ho[wu]dh?ury/ || $2 ~ /sa[xk]s?ena/' emp.csv    Matches second field 
   awk -F "," '$2 ~ /[cC]ho[wu]dh?ury|sa[xk]s?ena/                       Same as Above 
   awk -F "," '$3 !~ /director|chairman/                                 Neither director nor chairman 

Remember that the operator ~ and ! work only with field specifier ( $1,$2 etc ) . The delimiting of patterns with the , is an ERE feature ,and awk uses extended regular expressions . However awk doesn't accept the IRE and TRE used by grep and sed .

To match a string embedded in a field , you must use ~ instead of == , Similarly , to negate a match use !~ instead of !=

To locate only the g.m , you just can't use this :

   awk -F "," '$3 ~ /g.m/ { printf " .....

Because g.m is embedded in d.g.m , locating just the g.m.s requires use of the ^ and $ . These characters have slightly different meanings when used by awk . Rather than match lines , they are used to indicate the begining and end of a field , respectively ( unless you use them with $0) . So, if you use the condition :

   awk -F "," '$3 ~ /^g.m./ { printf " .... 

You will locate only the g.m.s and discard the d.g.m.s

To match a string at the begining of the field , precede the search pattern by a ^ . Similarly , use a $ for matching a pattern at the end of a field .

Number Comparison - 

awk can also use to do number comparison for both integer and floating type and make relational tests on them .The operators are summarized in below table . Lets print the name of employees having salary greater than 5000 .

awk -F "," '$6 > 5000 { printf "%-20s %-12s %d\n" , $2,$3,$6 } ' emp.csv
Parameter Operation
< Less Than
<= Less Than or equal to
== Equal to
!= Not equal to
>= Greater Than or equal to
> Greater Than
~ Matches a regular expression
!~ Doesn't Match a regular expression

Number Processing - 

awk can perform computations on numbers using the arithematic operatoors +,-,*,/ and % ( modulus). It also overcomes one of the most major limitations of the shell - the inability to handle decimal nummbers . Here is the example which computes hra and da on your salary slips .

awk -F "," '$3 == "director" { printf "%-20s %-12s %d %d %d\n" , $2,$3,$6,%6*0.4,$6*0.15 } ' emp.csv

Variables -

while awk has certain built in variables , like NR and $0 it also permits the user to use variables of his/her choice . We can now print a serial number , using the variable kount , and apply it to those directors drawing a sallary exceeding 5000 .

awk -F "|" '$3 == "director" && $6 > 5000 { kount = kount + 1 printf "%3d %-20s %-12s %d\n" , kount,$2,$3,$6 } ' emp.csv

The initial value of kount was 0 ( by default) . That's why the first line is correctly assigned the number 1.awk also accepts below formats

   kount ++ ------------------------------------------------ same as kount = kount + 1 
   kount += 2 --------------------------------------------- same as kount = kount + 2
   printf "%3d\n" , ++kount ---------------------------- Increment kount before printing

-f Option : Storing awk programs in a file - 

You can store your awk programs in a file with .awk extension for easier identification . The read statement in shell is used for taking input from the user to make script interactive .Input supplied through the standard input is read into these variables . Invoke the following statement to read the input -

$3 == "director" && $6 > 5000 { kount = kount + 1 printf "%3d %-20s %-12s %d\n" , kount,$2,$3,$6 }

Observe that this time we haven't used quotes to enclose the awk program . You can now use awk with th -f filename options to obtain the same output

awk -F "|" -f emp.awk emp.csv
awk -F "|" -f empawk1.awk emp.csv

Like the shell , awk also uses th # for providing comments . To execute this program use the -f option .

The BEGIN and END Sections - 

If you have to print something before processing the first line , for example , a heading then the BGIN section can be used quite gainfully . Similarly the END section is useful in printing some totals after processing is over .

The BEGIN and END sections are optional and take the form

     BEGIN { action } 
     END { action }

These two sections , when present , are delimited by the body of the awk program .We can use them to print a suitable heading at the begining and the average salary at the end . Store this awk program in a separate file empawk1.awk

   BEGIN { 
   Printf "\t\t Employee Abstract \n\n"
   } $6 > 5000 { # Increment the varaible for the serial number and the pay
   kount++ ; tot+= $6 # multiple assignments in one line 
   printf "%3d %-20s %-12s %d \n " , kount , $2,$3,$6
   } 
   END { printf "\n\t The average basic pay is %6d \n", tot/kount } 

Like all filters , awk reads standard input when th filename is omitted. We can make awk behave like a simple scripting language by doing all work in BEGIN section . This is how we can perform floating point arithematic .

awk 'BEGIN { printf "%f\n" ,22/7 }'

Depending on your awk version of awk , the prompt may or may not be returned , which means that awk still be reading standard input . Use [ctrl-d] to return the prompt .

Always start the opening brace in the same line the section (BEGIN or END) begins . If you don't do that , awk will generate some strange messages !

Built In Variables -  

 awk has several built in variables . They are all assigned automatically , though it is also possible for a user to re assign some of them . We have already used NR , which signifies th record number of the current line . We will now take a look at some of the other variables .

The FS Variable - As stated , awk uses a contiguous string of spaces as the default field delimiter . FS redefines this field separator , which in the sample database happens to be | . When used at all , it must occur in the BEGIN section so that the body of the program known its value before it starts processing .

BEGIN { FS= "|" }'
This is an alternative to the -F option which does the same thing

The OFS Variable - When we have used print statement with comma-separated arguments , eachh argument was separated from the other by a space . This is awk's default output field separator and can be reassigned using the variable OFS in the BEGIN section .

BEGIN { OFS= "~" }'

When you re assign this variable with a tilde (~) awk will use this character for delimiting the print arguments . This is a useful variable for creating lines with delimited fields .

The NF Variable - NF comes in quite handy for cleaning up database of lines that don't contain the right number fields . By using it on a file samy emp.lst , you can locate those lines not having six fields and which have crept in due to faulty data entry

awk 'BEGIN { FS= "|" } NF != 6 { print "Record No " , NR , "has" , NF , " fields" } ' emp.lst

The FILENAME Variable - FILENAME stores the name of the current file being processed . Like grep,sed , awk canuse multiple file names in the command line .By default awk doesn't print the filename , but you can instruct it to do so

'$6 < 5000 { print FILENAME ,$0 } '
Parameter Operation
NR Cummulative number of lines read
FS Input Field Separator
OFS Output Field Separator
NF Number Of fields in Current Line
FILENAME Current input file
ARGC Number Of arguments in command line
ARGV List Of Arguments

Functions -

awk has several bilt in functions, performing both arithmetic and string operations . Some of these functions take a variable number of arguments , and one (length ) uses no argument as a variant form .There are two arithematic functions which a programmer will expect awk to offer. int calculates the integral portion of a number ( without rounding off ) , while sqrt calculates the square root of a number . awk also has some of the common string handling functions you can hope to find in any language

length determines the length of its argument , and if no argument is present , the entire line is assumed to be argument and if no argument is present , the entire line is assumed to be argument . We can use length ( without any argument ) to locate lines whose length exceeds 1024 characters .

awk -F "|" 'length > 1024 ' emp.csv

We can use length with a field as well . The following program selects those people who have short names .

awk -F "|" 'length($2) < 10 ' emp.csv

index index (s1,s2) determines the position of string s2 , within a larger string s1 .This function is useful in validating single character fields . If field takes the values a,b,c,d,e then we can use this function to find out whether this single character field can be located within the string abcde .

y = index ("abcde" ,"c") ----------------------This returns value 3

substr The substr (stg ,m,n) function extracts a substring from a string stg. m represents the starting point of extraction and n indicates the number of characters to be extracted .Lets select the list of people those born between 1946 and 1951 .

awk -F "|" 'substr($5,7,2) > 45 && substr ($5,7,2) < 52' emp.csv

Note that awk does indeed possess a mechanism of identifying the type of expression from its context . It identified the date field as string for using substr and then coonverted it to a number for making a numeric comparison .

split split (stg, arr ,ch ) breaks up a string syg on the delimiter ch and stores the fields in an array arr [ ] . Here is hoow we can convert the date field to the format YYYYMMDD

awk -F \| '{ split($5,ar,"/") ; > print "19" ar[3]ar[2]ar[1] }' emp.csv

system You may want to print the system date at begining of the report . For running a UNIX command within awk , you will ahve to use the system function . Here are two examples .

     BEGIN {  
     system ("tput clear" ) ------------------ clears the screen
     system ("date" ) } ----------------------- Executes Unix date command

Table below summarises Buit in Awk functions

Parameter Operation
int (x) Return integer value of x
sqrt (x) Return square root of x
length Return Length of complete line
substr (stg,m,n) Returns portion of string of length n , starting from position m in string stg
index (s1,s2) Return position of string s2 in string s1
split (stg,arr,ch) Splits string stg into array arr using ch as delimiter , returns number of fields
system ("cmd") Runs UNIX command cmd nad returns its exit status .
Untitled Document Scroll To Top Untitled Document