Untitled Document

Chapter - 12  Regular Expressions grep and sed

We often need to search a file for a pattern , either to see the lines containing or not containig it or to have it replaced with something else .In this chapter we will discuss two important filters that are specially suited for these task - grep and sed grep searches for a pattern in a file , sed goes further and can even manipulate the indiviual characters in a line . In fact sed can do several things , some of them quite well .

grep : Searching a pattern - 

UNIX / Linux have a special family of commands that handles search requirement and the grep is the prnicipal command of this family. grep scan the input and displays the lines containing the pattern , the line numbers or filename where pattern occurs .

Lets run grep command to search the string director from the file emp.csv .

grep "director" emp.csv

It is always a good practice to put the search pattern in quotes ; though it is not necessary . In fact , quoting is essential if the search string consists of more than one word or uses any of the shells character like *,$ .

We can redirect the output of above command to store in a separate file , here is the command that srores result in director.txt

grep "director" emp.csv > director.txt

grep also silently return the prompt in case the pattern can't be located -

grep "vp" emp.csv

grep can be used with multiple filenames , it displays the output along with filenames . In this example grep searches two files .

grep "director" emp.csv director.txt

Quotting is essential when the pattern contains multiple words -

grep "Mangesh Pande" emp.csv

grep : options -

Like any other UNix/Linucx command grep command also has some option , which help us in matching the pattern . Lets see some of the most commonly required options -

Ignoring case ( -i) :

When we are searching for a name , but not sure of its case ( small case or lower case ) then use -i (ignore) options. The option ignores case for pattern matching -

grep -i "mangesh" emp.csv

Deleting Lines ( -v) :

grep can play an inverse role too ; -v (inverse) option selects all the lines except those containing the pattern . Thus we can create a search pattern which will search and match all the the lines except the line containing the string director . If we need we can redirect the ouput to a separate file called others.txt

grep -v "director" emp.csv > other.txt

Displaying Line Numbers ( -n) :

The -n ( number ) option displays the line numbers containng the pattern along with the lines - s

grep -n "director" emp.csv

Counting Lines Containing Pattern ( -c) :

If you want to know how many directors are there in a file . The -c (count) option counts the number of lines containing the pattern ( which is not the same as number of occurences ) . The following command does this job -

grep -c "director" emp.csv

If you run this command with multiple files , the filename is prefixed to the line count -

grep -c "director" emp*.csv

Displaying Filenames ( -l) :

The -l (list) option displays only the names of files containing the pattern :

grep -l "director" *.csv

Matching Multiple Patterns ( -e) :

With -e option we can match the two mangesh by using grep like -

grep -e "Mangesh" "mangesh" emp.csv

One must be thinking if we have few more such pattern or strings to be matched from a line , then is there any convinient way for this. The answer is yes , grep supports sophisticated pattern matching techniques that can display the same lines but with the single expression . We will most important and inttresting feature called Regular Expressions .

Taking Patterns from a file ( -f) :

We can place all the search pattern in a separate file , one pattern file . grep uses the -f option to take patterns from a file : store the above two search pattern in a file search.list and provide this file to grep with -f option .

grep -f search.list emp.csv

Lets now summarise the options that we have looked for grep command . The table below shows the options used by grep .

Options Operation
-i ignores case for matching
-v Doesn't display line matching expression
-n Displays line numbers along with lines
-c Displays count of number of occurences
-I Dispays the list of filenames only
-e Specifies expression with this option . can use multiple times .Also used for matching expression begining with (-) hyphen .
-x mathces pattern with entire line ( doesn't match embeded pattern)
-f Takes pattern from file , one per line
-o filename Places output in filename

Introduction to Regular Expressions -

In our emp.csv file we have seen name - Mangesh , mangesh which are two different strings , now if we want to locate all such patterns from a file then it is tedious to give the different patterns to grep command with -e option . Like shell's wild card which match similar filenames with a single expression , grep uses and expression of a different type to match a group of similar patterns . This feature of UNIX/Linux is command base and it has nothing to do with shell .If an expression uses any of these characters ,it is termed a Regular Expression . Some of these characters used by regular expressions are also meaningful to the shell .

POSIX identifies regular expressions as belonging to two categories - basic and extended . grep supports basic regular expression (BRE) by default and extended regular expression (ERE) with _E option .sed supports only the BRE Set . We will first start with minimal treatement of the BRE set and then take (ERE) in next section . We will exapnd the coverage of the BRE when we discuss sed .

Regular Expressions are interpreted by the command not by the shell .Quoting ensures that the shell isn't able to interfere and interpret the metachracters in its own way .

Regular Expressions Character Subset -

The table below summarises the Character Subset of the Regular Expressions . Lets see the Details about each subset in the following examples .

Options Operation
* zero or more occurences of the previous character
m* Nothing or m,mm,mmm etc
. A single character
.* Nothing or any number of chracters
[pqr] A single character p,q or r
[1-5] A digit between 1 and 5
[^pqr] A single character which is not a p,q or r
[^a-zA-Z] A non alphabetic character
^man Pattern man at begining of line
man$ Pattern man at end of line
^$ Lines containing nothing

The Character Class - 

A regular expression with character class allows you to specify the group of characters enclosed within a pair of rectangular brackets [ ] , in this case the match is performed for a single character in the group . This is something similar to shell's wild cards that we have seen in our previous chapter .

[ ma ]

matches eithe m or a . The metacharacters [ ] can now be used to match Mangesh , mangesh . The following regular expression does this job -

grep "[ mM ]" emp.csv

A single pattern has matched two similar strings , that's what regular expression are all about .We can also use ranges , both for alphabets and numerals .The pattern [a-zA-Z0-9] matches a single alphanumeric character .When you use a range , make sure that the character on the left of the hyphen has a lower ASCII value than the one on the right . Also uppercase precceds lowercase in ASCII sequence .

Negating a Class (^) -

Regular expression use the ^ ( caret ) symbol to negate the character class , while the shell uses ! ( bang ) . When the character class begins with this character , all characterss . other than the ones grouped in the class are matched . So , [^a-zA-Z] matches a single nonalphabetic chracter string .

The * (aestrik) -

The * ( aestrik ) refres to immediately precceding character . However its interpretetion is trickiest and totally different and has no resmblance with shell wild cards . The following regular expression indicates that the previous character can occur many times or not at all

m*

Matches a single character m or any number of ms . Becuase the previous character may or may not occur at all , it also matches a null string . Understand the significance of the words " zero or more occurences of the previous character " which describes the meaning of * . Don't make the mistake of using m* to match a string begining with m instead use mm*

The Dot .

A . matches a single character . The shell uses the ? character to indicate that . The pattern

3....

matches a four character pattern begining with a 3 . The shells eqivalent pattern is 3???

The regular expression .*

The dot with the * (.*) constitutes a very useful regular expression . It signifies any number of characters , or none . Consider that we want to look up the name j. saxena but are not sure whether it actually exists in the file as j.b. saxena or as joginder saxena . No problem , just embed the .* in the search string .

grep "j.*saxena" emp.csv

Specifying Pattern Locations ( ^ and $ )

Most of the regular expression characters are used for matching patterns , but there are two that can match pattern at the begining or end of a line . These are the two characters are used .

  • ^ ( caret ) -- For matching at the beginig of a line
  • $ -- For matching at the end of a line

Cosider a example - You want to extract the lines where emp-id begins with a 3 .What happens if we use below example as an expression

3...

This won't work as the character 3 , followed by three characters , can occur anywhere in the line . We must indicate grep that the pattern occurs at the begining of the line , and the ^ does it easily .

grep "^3" emp.csv

Similarly to select those lines where the salary lies between 4000 and 4999 , you have to use the $ at the end of the pattern .

grep "3...$" emp.csv

We can reverse the search and select only those lines where the emp-id don't begin with a 3? . The expression will be

grep "^[^3]" emp.csv

UNIX has no commands that lists only directories . however , we can use a pipeline to "grep" those lines from the listing that begin with a d

ls -l | grep "^d" emp.csv
The caret ( ^) has triple role to play in aregualr expressions . When placed at the begining of a character class ( e.g [^a-z] , it negates every character of the class . When placed outside it , and at the begining of the expression ( e.g ^3...) , the pattern is matched at the begining of the line . At any other location ( e.g a^b) , it matches itself literally.

Extended Regular Expressions ( ERE) and egrep -

Extended Regular Expression ( ERE) make it possible to match dissimilar patterns with single expression . This set uses additional characters , like grep can be used with -E option . Linux grep has this option . If your version of grep doexn't support this option , then use egrep without -E option .

The ? and + -

The ERE set includes two special characters , + and ? . They are often used in place of the * to restrict the matching scope .

  • + ------- Matches one or more occurences of the previous character
  • ? ------- Matches zero or one occurence of the previous character

In both the case emphasis is on previous character .This means that m+ matches m,mm,mmm etc .but unlike m*, it doen't match nothing .The expression m? matches either a single instance of m or nothing . These characters restrict the scope of match as compared to *.

copy the live example for above

Matching Multiple Patterns ( | , ( and ) )

The | is the delimiter of multiple patterns . Using it , we can locate both sengupta and dasgupta from the file and without using the -e option twice

grep -E 'sengupta | dasgupta ' emp.csv

The ERE thus handles the problem easily, but offers an even better alternative . The characters ( and ) , let you group patterns and when you use the | inside the parentheses , you can fram an even more compact pattern .

grep -E '(sen | das)gupta ' emp.csv

The table below summarise the Extended Regular expression set used by grep , egrep and awk

Expression Operation
ch+ Matches one or more occurences of character ch
ch? Matches zero or one occurences of character ch
exp1|exp2 Matches exp1 or exp2
GIF | JPEG Matches GIF or JPEG
(x1 | x2)x3 Matches x1|x3 or x2|x3
(lock|ver) wood Matches lockwood or verwood

sed : The stream editor -

sed is one of the finest command in Linux/ UNIX tool repository which combines the work of several filters .The command is derived from ed, the original UNIX editor .sed performs noninteractive operations on a data stream - hence its name is - sed

sed uses instruction to act on text . An instruction combines an address for selecting lines , with an action to be taken on them . sed uses following syntax for execution -

sed options 'address action' file (s)

The address and action are enclosed within single quotes . Addressing in sed is done in two ways :

  • By one or two line numbers (5,7)
  • By specifying a / - enclosed pattern which occurs in aline ( like /text:/ )

In the first form we select either one line number to select a single line or a set of two ( 5,7) to select a group of contiguous lines . Likewise second form uses one or two patterns .The action component is drwan from sed 's internal commands - which can be used for printing the text , quiting the text , insertion deletion . We will call these actions as commands .

Line Addressing -

To understand Line addressing - lets see a simple example - the following command prints the first 2 lines from a file emp.csv .

sed '2q' emp.csv

Lets understand the command now - 2 is nothing but the address for the sed which tell to read first 2 lines from file . q (quit) is the action that sed will perform . The entire instruction is given in single quotes with filename as argument . We can simulate head -n 2 emp.csv in this way .

Generally instead of q we will be using p ( print ) command to display the lines . However this command behaves in a strange manner , it outputs both the selected lines as well as all lines . So the selected lines appear twice . To suppress this behaviour use -n optionwhenever we use p command .

sed -n '1,5p' emp.csv

prints first 5 lines from file emp.csv

Similarly to print the last line of the file , use the $

sed -n '$p' emp.csv

Selecting Lines from Anywhere -

In previous example we have print first two lines and last line from the file . If we want to print line anywhere from line e.g want to print line from 10 to line number 15 , then we have to use the following command -

sed -n '10,15p' emp.csv

Selecting Multiple Group of Lines from Anywhere -

sed is not restricted to selecting a single group of lines . We can select as many sections from jut about any where .

sed -n '10,15p 5,8p' emp.csv

Negating the action ( ! ) -

sed also has a negation operator (!), which can be used with any action . Lets see the same example of printing first 2 lines from a file .The previous command (1,2p) can be written as -

sed -n '3,$!p' emp.csv

Multiple Instructions ( -e and -f ) -

In our previous example , we have seen the instruction which selects 3 different segments from a file .There is an option -e and -f whenever sed is used with multiple instructions . The -e allows you to enter as many instructions as you wish , each preceded by the option . So here is the same above command with -e option .

sed -n -e '1,2p' -e '7,9p' -e '$p' emp.csv

If yu have too many instruction than best thing is place all those instruction in a file and give that file to sed . Lets store the 3 instruction in a file instr.txt . We can now use -f option to direct sed to take its instruction from the file using the command

sed -n -f instr.txt emp.csv

sed is quite liberal and allow us to use multiple -f option with multiple files . We can also combine -e and -f in the command .

   sed -n -f instr.txt -f instr1.txt emp.csv 
   sed -n -e '/mangesh/p' -f instr.txt -f instr1.txt emp.csv

The second example uses context addressing (/mangesh/p) in an instruction . This is the other form of addressing used by sed and we will discuss next .

Context Addressing -

This is the second form of addressing called context addressing , which allows you to specify one or two patterns to locate lines .The pattern must be bounded bya / on both side . when we specify a single pattern , all the lines matching the pattern are selected . We can grep director in this way .

sed -n '/director/p' emp.csv

We can also provide the comma-separated pair of context addresses to select a group of lines .

sed -n '/mangesh/,/nikhil/p' emp.csv

With regular expression - Context addressing can be used with regular expression , which we have used in grep .Here are the few examples with sed .

sed -n '/[ mM ]/p' emp.csv

We can use the anchoring characters , ^ and $ as part of the regular expression syntax .This is how we can locate all people born in the year 1987

sed -n '/87.....$/p' emp.csv

Regular expressions in grep and sed are actually more powerful than we have used so far. They use some more special characters which we will see in our further discussions 

Writing selected lines to a file (w) -

Irrespective of the way you select lines ( by line or contex addressing) , we can use the w (write) command to write the selected lines to a separate file .

sed -n '/director/w list' emp.csv ----------------------- writes the o/p to file list

We can srore the lines pertaining to the directors , managers and executives in three separate files .

   sed -n '/director/w list 
   /manager/w mlist
   /executive/w ' emp.csv

The same thing we can do in line addressing . Lets consider we have a file main.txt which has around 1000 lines and we want to split it into two separate file . Here is command for it .

   sed -n '1,500w main1 
   501, $w main2' main.txt
The -n option is required with w command only to suppress printing of all lines on the terminal. however , even without it , the selected lines will be written to the respective files .

Text editing -

This section discuss some of the editing commands available in sed .sed can insert text and change the existing text in a file . We will see how sed - i (insert) , a (append) , c (change) and d (delete) text in a file .

Inserting and changing Lines ( i ,a ,c ) -

Lets insert the text in a file sample.txt using sed -i option . To insert the text in first two lines run the following command

   sed ' li\ 
   Hello mangesh \
   You have entered two lines \
   ' sample.txt > dummy

First enter the instruction li , which inserts text at line number 1 . Then enter a \ before pressing [enter] . You can now key in as many lines as you wish . Each line except the last has to be terminated by the \ before hitting [enter] .sed identifies the line without \ as the last line of input .We have to follow this technique while using the a and c commands also .

The above command writes the concatenated output of the two lines of inserted text and existing line to standard output , which we redirected to a temporary file dummy . We must move this file to sample.txt to be able to use it .

mv dummy sample.txt ; head -n 2 sample.txt

Doublespacing Text -

The following command inserts a blank line after each line of the file is printed . This is another way of doublespacing text .

   sed 'a\ 
   ' emp.csv

Deleting Lines (d) -

sed uses the d (delete) command to emulate grep's -v option of selecting lines not containing the pattern . We can use either of the commands .

   sed 'director/d' emp.csv > list                    -n option not to be used with d
   sed -n '/director/!p' emp.csv > list

selects all lines except those containing director and saves them in list .

Deleting Blank Lines -

A blank line consist of any number of spaces , tabs or nothing . How can we delete these lines from a file ? We need to fram a pattern which matches zero or more occurences od a space or tab .

sed '/^[ tab ]*$/d' sample.txt -------------a space and tabe inside [ ]

You need to press the [tab] key and space inside the character class . Providing a ^ at the begining and a $ at the end matches lines that contain only whitespace . Note that this expression also matches lines containing nothing .

Substitution (s) -

Substitution is the most important feature of sed nad this is one job that sed does exceedingly well . It lets you replace a pattern in its input with something else .Lets try to substitute ( replace) the comma (,) with |in our emp.csv file .

sed 's/,/|/' emp.csv | head -n 3

Just look at the o/p of head command , you will see that sed has replace the leftmost that is the firest occurence or instance has been replaced . We need to use the g (global) flag to replace all the pipes .

sed 's/,/|/g' emp.csv | head-n 3

We can limit the vertical boundries too by secifying an address .

sed '1,3s/,/|/g' emp.csv | head-n 3

Substitution is not restricted to a single character , it can be any string . Lets replace the word director with member in the first five lines of emp.csv

sed '1,5s/director/member/g' emp.csv

We can use the Regular Expression for substitution with sed command .The anchoring characters ^ and $ can be used as well for substitution . This is how we can add 1 prefix to all emp-ids .

sed 's/^/1/' emp.csv | head -n 2

likewise we can add suffix .00 to the salary .

sed 's/$/.00/' emp.csv | head -n 2

Performing Multiple substitution

We can perform multiple substitution with one invocation of sed . Simply press enter at the end of each instruction and finally close the quote at the end .

   sed 's/Mangesh /Pande /g 
   s/Nikhil/ Rahul/g ' emp.csv

Remembered Pattern ( // ) -

In all our previous examples we have seen that we have searched for a pattern abd then replcaed it with something .The following three commands do the same job :

   sed 's/director /member/' emp.csv 
   sed '/director/s//member/' emp.csv
   sed '/director/s/director/member/' emp.csv

The second form suggest that sed "remembers" the scanned pattern and stores it in // ( 2 front slashes). The // representing an empty ( or null) regular expression is interpreted to mean that the search and substituted patterns are the same . We will call it remembered pattern .

However , when youu use // in the target string , it means you are removing the pattern totally .

sed 's/,//g' emp.csv-------------- removes every , from file

The address /director/ in the third form appears to be redudant . However , you must understand this form also because it widens the scope of substitution . Its possible that you may like to replace a string in all lines containing a different string

sed -n '/marketing/s/director/member /p' emp.csv

Basic Regular Expressions Revisited -

To master sed , one must appreciate the numerous possibilities that regular expressions throw up with this command - more so than in grep . In this discussion we will see some more characters from the BRE set . Both grep and sed use these characters , but sed simply exploits them to the hilt . We will see following three types of expression :

  • The repeated pattern - This uses a single symbol, &, to make the entire source pattern appear at the destination also .
  • The Internal Regular Expression ( IRE ) - This expression groups ues the character { and } with a single or a pair of number between them .
  • The tagged Regular Expression ( TRE ) - This expression groups patterns with ( and ) and represents than at the destination with numbered tags .

The Repeated Pattern ( & )

We sometimes encounter situations when the source pattern also occurs at the destination . We can then use the special character & to represent it . All of these commands do the same thing :

   sed 's/director/executive director/' emp.csv
   sed 's/director/executive &/' emp.cav 
   sed '/director/s//executive &/' emp.csv 

All of these commands replace director with executive director . The & known as the repeated pattern , here expand to the entire source string.

 

Interval Regular Expression ( IRE ) -

We have matched a pattern at the begining and end of a line . But what about matching it at any specified location -- or within a zone ? sed and grep also use the interval regular expression ( IRE ) that uses an integer ( or two ) to specify the number of characters preceding a pattern . The IRE uses an escaped pair of curly braces and takes three form :

  • ch \{ m\ } - The metacharacter ch can occur m times .
  • ch \{ m,n\ } - Here , ch can occur between m and n times .
  • ch \{ m, \} - Here , ch can occur at least m times .

All of these have the single character regular expression ch as the first element . This can either be a literal character ,a . ( dot ), or a character class . It is followed by apair of escaped curly braces containing either a single number m , or a range of numbers lying between m and n to determine the number of times the chracter preceding it can occur . The value of m and n can't exceed 255.

To illustrate the first form , let's consider this samll telephone directory where a person has eithera wired phone ( 8 digits ) or a mobile phone ( 10 digits ).

   $ cat teledir.txt
   Deepak Sharma 02022568
   Sambit Kumar  07122233
   Magesh Pande  22987654
   Gunjan Verma 9878945678
   Nikhil Muthal 9890654832

let's use grep to select only those users who have a mobile phone . We must use an IRE to indicate that a numeral can occur 10 times:

   $ grep '[0-9]\{10\}' teledir.txt
   Gunjan Verma 9878945678
   Nikhil Muthal 9890654832

Let's now consider the second form of the IRE , using sed this time . Since this matches a pattern within a zone , we can display the listing for those files that have the write bit set either for group or others :

   $ ls -l | sed -n '/^.\{5,8\}w/p'
   -r-xr-xrwx   3 mangesh   dialout  426 May 15 17:38 comj
   -r-xr-xrwx   3 mangesh   dialout  426 May 15 15:38 runj
   -r-xrw-r-x   3 mangesh   dialout  589 Apr 15 19:25 prog.ksh

Extracting Lines Based on Length -

With IRE , you can use the following commands to select lines longer than 100 characters . The second one additionally imposes a limit of 150 on the maximum length :

   sed -n '/.\{101,\}/p' foo      Line length at least 101
   grep '^.\{101,150\}$' foo      Line length between 101 and 150 

The ^ and $ are required in the second example , otherwise lines longer than 150 characters would also be selected . Remember that a regular expression always tries to match the longest pattern possible .

The Tagged Regular Expression ( TRE ) -

This is the most complex of all regualr expression ,and possibly the finest . It relates to breaking up a linw into groups and then extracting one or more of these groups . The tagged regular expression ( TRE ) requires two regular expression to be specified -- one each for the source and target patterns.

This is how the TRE works. You have to identify the segments of a line that you wish to extract and enclose each segement with a matched pair of escaped parentheses . For instance , if you need to extract a number , you will have to represent that number as \ ([0-9]*\) . A series of nonalphabetic character can be represented as \([^a-zA-Z]*\) . Every grouped pattern automatically acquires the numeric label n, where n signifies nth group from the left . To produce a group at the destination , you have to use the tag \n . This means that first group is represented as \1 , the second one as \2 and so forth .

Lets illustrate this with a simple example . Consider the telephone directory that was considered in the previous section . Apart from telephone numbers, this file also contains names in the sequence first_name last_name . Some lines may have more than one space between the two name components and our job is to use the TRE to take care of these imperfections .

We will create a new list from this fiile that shows the surname first , followed by a, and then the first name and the rest of the line . For instance , the first name in the file should appear as pande , mangesh . We will have to frame two groups of alphabetic characters and then reverse them in the target pattern , while inserting a comma between them . This is how we obtain a sorted list :

   $ sed 's/\([a-z]*\) *\([a-z]*\)/\2, \1/' teledir.txt | sort  
   Kumar, Sambit 07122233
   Muthal, Nikhil 9890654832 
   Pande, Magesh 22987654  
   Sharma, Deepak 02022568
   Verma, Gunjan 9878945678

The first group, \([a-z]*\), represents zero or more occurences of alphabetic chracters , this effectively captures the first name . an identical pattern takes care of the surname. These two groups are separated by zero or more occurences of space ( *) . In target pattern , we recreate these groups but in a reverse order with the tags \2 and \1 . The comma between them is treated by sed lliterally .

You will find numerous uses for the TRE when using sed . Though it's quite cryptic and difficult to comprehend initially , you must understand it if you want sed to serve as a gateway to learn Regular Expression.

Internal Command Used by sed

Command Description
i,a,c Inserts , appends and changes text
d delete lines
10q Quits after reading the first 10 lines
p print lines on standard output
3,$p print lines 3 to end
$!p print all lines except last line
/begin/,/end/p Prints lines enclosed between begin and end ( -n option required )
q quits after reading up to address line
r flname Places contents of file flname after line
w flname Writes addressed lines to file flname
= Prints line number addressed
s/s1/s2/ Replaces first occurence of expression s1 in all lines with expression s2
10,20s/-/:/ Replaces fist occurence of - in lines 10 to 20 with a :
s/s1/s2/g Replaces all occurences of expression s1 in all lines with expression s2
s/-/:/g Replaces all occurences of - in all lines with a :
Untitled Document Scroll To Top Untitled Document